Skip to main content

Main menu

  • HOME
  • CONTENT
    • Early Release
    • Featured
    • Current Issue
    • Issue Archive
    • Blog
    • Collections
    • Podcast
  • TOPICS
    • Cognition and Behavior
    • Development
    • Disorders of the Nervous System
    • History, Teaching and Public Awareness
    • Integrative Systems
    • Neuronal Excitability
    • Novel Tools and Methods
    • Sensory and Motor Systems
  • ALERTS
  • FOR AUTHORS
  • ABOUT
    • Overview
    • Editorial Board
    • For the Media
    • Privacy Policy
    • Contact Us
    • Feedback
  • SUBMIT

User menu

Search

  • Advanced search
eNeuro
eNeuro

Advanced Search

 

  • HOME
  • CONTENT
    • Early Release
    • Featured
    • Current Issue
    • Issue Archive
    • Blog
    • Collections
    • Podcast
  • TOPICS
    • Cognition and Behavior
    • Development
    • Disorders of the Nervous System
    • History, Teaching and Public Awareness
    • Integrative Systems
    • Neuronal Excitability
    • Novel Tools and Methods
    • Sensory and Motor Systems
  • ALERTS
  • FOR AUTHORS
  • ABOUT
    • Overview
    • Editorial Board
    • For the Media
    • Privacy Policy
    • Contact Us
    • Feedback
  • SUBMIT
PreviousNext
Research ArticleResearch Article: New Research, Cognition and Behavior

Opponent Learning with Different Representations in the Cortico-Basal Ganglia Circuits

Kenji Morita, Kanji Shimomura and Yasuo Kawaguchi
eNeuro 18 January 2023, 10 (1) ENEURO.0422-22.2023; https://doi.org/10.1523/ENEURO.0422-22.2023
Kenji Morita
1Physical and Health Education, Graduate School of Education, The University of Tokyo, Tokyo 113-0033, Japan
2International Research Center for Neurointelligence (WPI-IRCN), The University of Tokyo, Tokyo 113-0033, Japan
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Kenji Morita
Kanji Shimomura
1Physical and Health Education, Graduate School of Education, The University of Tokyo, Tokyo 113-0033, Japan
3Department of Behavioral Medicine, National Institute of Mental Health, National Center of Neurology and Psychiatry, Kodaira 187-8551, Japan
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Yasuo Kawaguchi
4Brain Science Institute, Tamagawa University, Machida 194-8610, Japan
5National Institute for Physiological Sciences (NIPS), Okazaki 444-8787, Japan
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • ORCID record for Yasuo Kawaguchi
  • Article
  • Figures & Data
  • Info & Metrics
  • eLetters
  • PDF
Loading

Article Figures & Data

Figures

  • Extended Data
  • Figure 1.
    • Download figure
    • Open in new tab
    • Download powerpoint
    Figure 1.

    The simulated reward navigation task. A, The 5 × 5 grid space, where the agent moved around. The agent started from the fixed start state, (1, 1), and moved to one of the neighboring states (two states at the corners, three states at the edges, and four states elsewhere) at each time step. There were nine reward candidate states, where reward could potentially be placed, namely, (1, 5), (2, 5), (3, 5), (4, 5), (5, 1), (5, 2), (5, 3), (5, 4), and (5, 5) (indicated by the gray color). B, Epochs in the task. During the initial 500 time steps, there was no reward, and this was called the no-reward epoch. During the next 500 time steps, one of the nine reward candidate states was specified as the special candidate state, whereas the remaining eight reward candidate states were regarded as normal candidate states. There were in total nine rewarded epochs (500 time steps for each), and each one of the nine reward candidate states became the special candidate state in one of the nine epochs; the order was determined by pseudorandom permutation in each single simulation. C, In the rewarded epochs, if the agent reached the rewarded state and obtained the reward, the agent was carried back to the start state, and a new reward was introduced into a state, which was the special reward candidate state with 60% probability and one of the eight normal candidate states with 5% probability for each.

  • Figure 2.
    • Download figure
    • Open in new tab
    • Download powerpoint
    Figure 2.

    The reinforcement learning model with two systems. A, Model architecture. The model consists of two learning systems, system 1 and system 2, which may use different ways of state representations [successor representation (SR) or individual representation (IR)]. Each system has its own system-specific value of each state, and their mean (average) becomes the integrated state value. The agent selects an action to move to a neighboring state depending on their integrated values in a soft-max manner. The integrated values are also used for calculating the temporal-difference reward prediction errors (TD-RPEs). The TD-RPEs were then used for updates of the system-specific state values in each of the two systems, or more precisely, updates of the system-specific values for IR-based system(s) and updates of the weights for the system-specific values for SR-based system(s). The leaning rates of the updates can differ between the two systems and also depending on whether the TD-RPE is positive (non-negative) or negative. B, Possible combinations of the ways of state representation in the two systems.

  • Figure 3.
    • Download figure
    • Open in new tab
    • Download powerpoint
    Figure 3.

    Performance of the model consisting of a system using the SR and another system using the IR, and its dependence on the ratios of the learning rates from positive and negative TD-RPEs in each system. A, Mean performance over n = 100 simulations for each condition. The axis rising to the left indicates the ratio of positive-/negative-error-based learning rates (denoted as α+/α−) in the system using the SR, while the axis rising to the right indicates the same ratio in the system using the IR. The sum of the learning rates from positive and negative TD-RPEs (α+ + α−) in each system was constant at 1 in any conditions shown in this figure. The inverse temperature β was 5, and the time discount factor γ was 0.7. The vertical line corresponds to conditions where the α+/α− ratio is equal for both systems (bottom: negative error-based learning dominates; top: positive error-based learning dominates). The left side of the vertical line corresponds to conditions where the α+/α− ratio is larger in the SR-based system than in the IR-based system, whereas the opposite is the case for the right side of the vertical line. The color in each square pixel indicates the mean total obtained rewards in the task, averaged across 100 simulations, for each condition (i.e., set of αSR+/αSR− and αIR+/αIR− at the center of the square pixel), in reference to the rightmost color bar. The white cross indicates the set of α+/α− ratios that gave the best performance (α+/α− = 4 and 0.25 for the SR-based and IR-based systems, respectively) among those examined in this panel. Note that the minimum value of the color bar is not 0. Also, the maximum and minimum values of the color bar do not match the highest and lowest performances in this panel; rather, they were set so that the highest and lowest performances in the simulations of the same task with different parameters and/or model architecture (i.e., not only SR+IR but also SR+SR and IR+IR) shown in this panel and Figures 4 and 6 can be covered. B, The black solid line, gray thin error bars, and black thick error bars, respectively, show the mean, SD (normalized by n; same hereafter), and SEM (approximated by SD/√n; same hereafter) of the performance over n = 100 simulations for the conditions where α+/α− for SR system times α+/α− for IR system was equal to 1 (i.e., the conditions on the horizontal diagonal in A). C, Frequency (number of times) that each condition gave the best mean performance over 100 simulations when 100 simulations for each condition were executed 50 times (including the one shown in A, B).

  • Figure 4.
    • Download figure
    • Open in new tab
    • Download powerpoint
    Figure 4.

    Performance (mean total rewards) of the model consisting of an SR-based system and an IR-based system with various sets of parameters. The sum of the learning rates from positive and negative TD-RPEs (α+ + α−) was varied over 0.75, 1, and 1.25. The inverse temperature β was varied over 5 and 10. The time discount factor γ was varied over 0.7 and 0.8. The ratio of positive-/negative-error-based learning rates (α+/α−) was varied over 0.2, 0.25, 1/3, 0.5, 1, 2, 3, 4, and 5 for the cases with α+ + α− = 0.75 or 1, but 0.2 and 5 were omitted for the cases with α+ + α− = 1.25 to avoid learning rate larger than 1. The color-performance correspondence is the same as in Figure 3A. The white cross in each panel indicates the set of α+/α− ratios that gave the best performance among those examined in that condition, and the gray number near the top of each panel indicates the best performance. The panel with α+ + α− = 1, β = 5, and γ = 0.7 shows the same results as shown in Figure 3A.

  • Figure 5.
    • Download figure
    • Open in new tab
    • Download powerpoint
    Figure 5.

    Learning profiles of the model consisting of an SR-based system and an IR-based system with different combinations of the ratios of positive-/negative-error-based learning rates in each system. Three cases included in Figure 3A [where the sum of the learning rates from positive and negative TD-RPEs (α+ + α−) in each system was 1, the inverse temperature β was 5, and the time discount factor γ was 0.7] were analyzed. A, Mean learning curves. The curves indicate the mean time (number of time steps) used for obtaining the first, second, third… reward placed in each of the second to the ninth rewarded epoch (horizontal axis), averaged across simulations and also across the eight (from the second to the ninth) rewarded epochs. Only the cases in which reward was obtained in not smaller than a quarter of a total of 100 simulations in all of the eight epochs were plotted. The brown, red, and blue-green curves correspond to the conditions with (α+/α− for SR-system, α+/α− for IR-system) = (4, 0.25), (1, 1), and (0.25, 4), respectively (the colors match those in the corresponding pixels in Fig. 3A). The error bars indicate ±SD across the eight epochs (after taking the averages across simulations). B–D, Examples of the system-specific state values and the integrated state values. Panels in B–D correspond to the conditions with (α+/α− for SR-system, α+/α− for IR-system) = (4, 0.25), (1, 1), and (0.25, 4), respectively. The left images show the system-specific state values (top: SR-based system, bottom: IR-based system) as a heat map on the 5 × 5 grid space, and the left graphs also show the system-specific state values (dotted line: SR-based system, dashed line: IR-based system) together with the integrated state values (thick solid line), with the horizontal axis indicating 25 states [1–5 correspond to (1,1)–(5,1) in the grid space; 6–10 correspond to (1,2)–(5,2), and so on]. Values just after the agent obtained the last reward in the last (ninth) rewarded epoch (indicated by the white crosses in the left and right panels) in single simulations, in which that reward was placed at the special candidate state in that epoch, were shown. Note the differences in the vertical scales between the left graphs in B or D and the left graph in C. The right graphs show only the integrated state values, with the vertical axes in B and D enlarged as compared with the left graphs, and the right images also show the integrated state values as a heat map on the 5 × 5 grid space.

  • Figure 6.
    • Download figure
    • Open in new tab
    • Download powerpoint
    Figure 6.

    Performance of the model consisting of two systems, both of which employed the same way of state representation. Results in the case where both systems employed only the SR (A) or only the IR (B) are shown. Configurations are the same as those in Figure 4 [white cross: the best-performance set of α+/α− ratios in each panel; gray number: the best performance (total rewards)]. If the α+/α− ratio was equal between both systems, the two systems behaved in exactly the same manner, and thus such conditions (on the vertical lines) were equivalent to having only a single system. Also, only the left (A) or right (B) side is shown, because “α1+/α1− = 0.2 and α2+/α2− = 3,” for example, are equivalent to “α1+/α1− = 3 and α2+/α2− = 0.2” given that both systems employed the same representation. The left (A) and right (B) placement was made so as to facilitate visual comparisons with Figure 4. C, Comparison of the maximum performances (mean total rewards) of the cases where one system employed the SR while the other employed the IR (purple), both systems employed the SR (red), and both systems employed the IR (blue) for each examined set of the time discount factor (γ), inverse temperature (β), and the sum of learning rates for positive and negative TD-RPEs (α+ + α−). The bar lengths indicate the mean performance over 100 simulations for the condition that gave the best mean performance, and the black thick and gray thin error bars indicate ±SEM and ±SD for that condition, respectively.

  • Figure 7.
    • Download figure
    • Open in new tab
    • Download powerpoint
    Figure 7.

    Performance of the model consisting of SR-based and IR-based systems in a broader parameter space. The learning rate from positive or negative TD-RPEs in each system was freely varied from 0.2, 0.35, 0.5, 0.65, or 0.8. Each row of panels shows the mean performance for each set of time discount factor (γ) and inverse temperature (β; shown in the left), varying the learning rate for the update of SR features (αSRfeature, shown in the top), projected onto the plane consisting of the ratios of the learning rates from positive and negative TD-RPEs in the two systems (αSR+/αSR− and αIR+/αIR−). There were 25 cases with αSR+/αSR− = αIR+/αIR− = 1, which were difficult to draw on the single point (1, 1) in this projected plane and thus omitted. In the cases with αSR+/αSR− = 1 or αIR+/αIR− = 1 but not αSR+/αSR− = αIR+/αIR− = 1, there were five cases having the same set of (αSR+/αSR−, αIR+/αIR−; because α+/α− = 1 corresponds to five cases with α+ = α− = 0.2, 0.35, 0.5, 0.65, or 0.8), and these five cases were drawn by concentric circles with larger radius corresponding to larger learning rate. In the cases other than the special cases mentioned just above, each case was drawn by a cross. The color of the symbols (cross or circle) indicates the mean performance, in reference to the color bar in the top right. Extended Data Figure 7-1, left column, shows the sets of learning rate parameters that gave top ten mean performance for each set of time discount factor and inverse temperature.

  • Figure 8.
    • Download figure
    • Open in new tab
    • Download powerpoint
    Figure 8.

    Performance of the model consisting of two SR-based systems or two IR-based systems in a broader parameter space. Each row of panels shows the mean performance for each set of time discount factor (γ) and inverse temperature (β; shown in the left), varying the learning rate for the update of SR features (αSRfeature, shown in the top) for the cases with two SR-based systems (five panels from the left), projected onto the plane consisting of the ratios of the learning rates from positive and negative TD-RPEs in the two systems (α1+/α1− and α2+/α2−). The cases with α1+/α1− = α2+/α2− = 1 were not drawn, and the cases with α1+/α1− = 1 or α2+/α2− = 1 but not α1+/α1− = α2+/α2− = 1 were drawn by concentric circles, and all the other cases were drawn by crosses, in a similar manner to Figure 7. The color of the symbols (cross or circle) indicates the mean performance, in reference to the color bar in the top right. Only the left or right side is shown for the cases with two SR-based or IR-based systems, respectively, for the same reason as in Figure 6. Extended Data Figure 7-1, middle and right columns, shows the sets of learning rate parameters that gave top ten mean performance for each set of time discount factor and inverse temperature in the models consisting of two SR-based systems (middle) and two IR-based systems (right). Extended Data Figure 8-1 shows the sets of learning rates giving top ten mean performance for the cases with (γ, β) = (0.5, 15), (0.5, 20), and (0.6, 20) in the model consisting of two SR-based systems.

  • Figure 9.
    • Download figure
    • Open in new tab
    • Download powerpoint
    Figure 9.

    Performance of the model consisting of SR-based and IR-based system when task properties were changed. The model with the original set of parameters used in Figure 3 (αSR+ + αSR− = αIR+ + αIR− = 1, β = 5, and γ = 0.7) was used. The ranges of the color bars in this figure correspond to the ranges between the lowest and highest performances (i.e., the minimum and maximum mean total rewards) in the individual panels. A, Performance (mean total rewards) for the cases where the probability of reward placement at the special candidate state was varied from 70% (leftmost panel), 80%, 90%, or 100% (rightmost). B, Performance for the cases where reward location (state) was reset at every 500 (leftmost panel), 250, 100, or 50 (rightmost) time steps in the rewarded epochs; where reward was located was determined according to the original stochastic rule, i.e., reward was placed at the special candidate state for the epoch with 60% and placed at one of the other (normal) candidate states with 5% each. C, Performance for the case where there was only a single rewarded epoch with 4500 time steps following the initial 500 time steps no-reward epoch. D, Performance for the case where stochasticity of reward placement within each epoch was removed and the duration of each rewarded epoch was shortened from 500 time steps to 100 time steps while the number of rewarded epochs was increased from 9 to 45. E, Performance for the case where special candidate state was abolished and reward was placed at each of the nine candidate states with equal probability (1/9 = 11.11...%). F, Performance for the case where rewarded state was determined by a fixed order, namely, (5, 1), (5, 5), (1, 5), and again (5, 1), and these were repeated throughout the task.

  • Figure 10.
    • Download figure
    • Open in new tab
    • Download powerpoint
    Figure 10.

    Performance of the two-system models in the two-stage tasks. A, Schematic diagram of the two-stage task. Selection of one of the two first-stage options lead to one of the two pairs of second-stage options with fixed probabilities. B, Left, Reward probabilities for the four second-stage options in the original two-stage task. The probability for each option was independently set according to Gaussian random walk with reflecting boundaries at 0.25 and 0.75. An example is shown. Right, Mean performance (mean total rewards over n = 1000 simulations) of the model consisting of an SR-based system and an IR-based system, with the ratio of positive-/negative-error-based learning rates (α+/α−) for each system was varied under α+ + α− = 1, β = 5, and γ = 1. C, Left, Reward probabilities for the four second-stage options in a variant of the task. The probabilities for the four options were set to specific values, which changed three times in the task. Right, Mean performance of the model. D, Left, Schematic diagram of a variant of the two-stage task, in which there were three first-stage options and three pairs of second-stage options. Right, Reward probabilities for the six second-stage options, which were set to specific values and changed two times in the task. E–G, Top panels, Mean performance of the model consisting of SR-based and IR-based systems (E), two SR-based systems (F), or two IR-based systems (G), in the task variant with three first-stage options and three pairs of second-stage options. Bottom graphs, Mean (black solid line), SEM (black thick error bars; though hardly visible), and SD (gray thin error bars) of the performance over n = 1000 simulations for the conditions where α+/α− for SR system times α+/α− for IR system was equal to 1 (i.e., the conditions on the horizontal diagonal in the top panels).

  • Figure 11.
    • Download figure
    • Open in new tab
    • Download powerpoint
    Figure 11.

    Explanation for the significance and mechanism of diverse findings about the cortico-BG circuits. A, Experimentally suggested limbic/visual cortical encoding of SR and limbic/visual->D1/direct and primary motor->D2/indirect preferential connections indicate preferential use of SR in the appetitive D1/direct pathway, whose functional merit was explained by the results of our simulations. B, Experimentally suggested involvement of IT-type, but not PT-type, corticostriatal neurons in goal-directed behavior and IT->D1/direct and PT->D2/indirect preferential connections also indicate preferential use of SR in the appetitive D1/direct pathway. C, Less reciprocal IT-IT connections and wider IT->striatum axonal endpoints are in line with engagement of IT (rather than PT) neurons in SR-like representation. D, D1 neurons’ similar responses to stimuli predicting similar outcomes and D2 neurons’ weaker stimulus-outcome association and stronger selectivity to stimulus identity in the ventral striatal olfactory tubercle could be explained by preferential use of SR-like and IR-like representations in the D1 and D2 pathways, respectively. E, Engagements of D1/direct neurons in the dorsomedial striatum (DMS), a suggested locus of model-based behavior, and D2/indirect neurons in the dorsolateral striatum (DLS), a suggested locus of model-free behavior, in promotion and suppression of action, respectively, are potentially in line with the combination of appetitive SR-based and aversive IR-based systems in our model.

Extended Data

  • Figures
  • Extended Data 1

    The code described in the present paper. The content is described in the readme.txt file. Download Extended Data 1, ZIP file.

  • Extended Data Figure 7-1

    Performance of the models consisting of two systems in a broader parameter space. The left, middle, and right columns show the results for the model consisting of SR-based and IR-based systems, two SR-based systems, and two IR-based systems, respectively. Each subtable shows the sets of learning rate parameters that gave top ten mean performance for each set of time discount factor (γ) and inverse temperature (β; shown in the left) in each of the three models. For the model consisting of SR-based and IR-based systems, cases with a combination of appetitive (α+/α− > 1) SR-based system and aversive (α+/α− < 1) IR-based system are shown in bold italic; notably, even in all the other cases shown in the subtables except for a case shown in italic with asterisk in the right, the α+/α− ratio was higher in the SR-based system than in the IR-based system. Download Figure 7-1, DOCX file.

  • Extended Data Figure 8-1

    Results of additional simulations for the model consisting of two SR-based systems. Each subtable shows the sets of learning rate parameters that gave top ten mean performance for each set of time discount factor (γ) and inverse temperature (β) shown in the left. Download Figure 8-1, DOCX file.

Back to top

In this issue

eneuro: 10 (1)
eNeuro
Vol. 10, Issue 1
January 2023
  • Table of Contents
  • Index by author
  • Masthead (PDF)
Email

Thank you for sharing this eNeuro article.

NOTE: We request your email address only to inform the recipient that it was you who recommended this article, and that it is not junk mail. We do not retain these email addresses.

Enter multiple addresses on separate lines or separate them with commas.
Opponent Learning with Different Representations in the Cortico-Basal Ganglia Circuits
(Your Name) has forwarded a page to you from eNeuro
(Your Name) thought you would be interested in this article in eNeuro.
CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Print
View Full Page PDF
Citation Tools
Opponent Learning with Different Representations in the Cortico-Basal Ganglia Circuits
Kenji Morita, Kanji Shimomura, Yasuo Kawaguchi
eNeuro 18 January 2023, 10 (1) ENEURO.0422-22.2023; DOI: 10.1523/ENEURO.0422-22.2023

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero
Respond to this article
Share
Opponent Learning with Different Representations in the Cortico-Basal Ganglia Circuits
Kenji Morita, Kanji Shimomura, Yasuo Kawaguchi
eNeuro 18 January 2023, 10 (1) ENEURO.0422-22.2023; DOI: 10.1523/ENEURO.0422-22.2023
Twitter logo Facebook logo Mendeley logo
  • Tweet Widget
  • Facebook Like
  • Google Plus One

Jump to section

  • Article
    • Abstract
    • Significance Statement
    • Introduction
    • Materials and Methods
    • Results
    • Discussion
    • Footnotes
    • References
    • Synthesis
  • Figures & Data
  • Info & Metrics
  • eLetters
  • PDF

Responses to this article

Respond to this article

Jump to comment:

No eLetters have been published for this article.

Related Articles

Cited By...

More in this TOC Section

Research Article: New Research

  • Pairing mouse social and aversive stimuli across sexes does not produce social aversion in females
  • Serotonergic suppression of sustained synaptic responses in rat oculomotor neural integrator networks
  • Intrinsic cell-class-specific modulation of intracellular chloride levels and inhibitory function, in cortical networks, between day and night.
Show more Research Article: New Research

Cognition and Behavior

  • Pairing mouse social and aversive stimuli across sexes does not produce social aversion in females
  • In-vivo analysis of medial perforant path-evoked excitation and inhibition in dentate granule cells
  • Altered Dopamine Signaling in Extinction-Deficient Mice
Show more Cognition and Behavior

Subjects

  • Cognition and Behavior
  • Home
  • Alerts
  • Follow SFN on BlueSky
  • Visit Society for Neuroscience on Facebook
  • Follow Society for Neuroscience on Twitter
  • Follow Society for Neuroscience on LinkedIn
  • Visit Society for Neuroscience on Youtube
  • Follow our RSS feeds

Content

  • Early Release
  • Current Issue
  • Latest Articles
  • Issue Archive
  • Blog
  • Browse by Topic

Information

  • For Authors
  • For the Media

About

  • About the Journal
  • Editorial Board
  • Privacy Notice
  • Contact
  • Feedback
(eNeuro logo)
(SfN logo)

Copyright © 2025 by the Society for Neuroscience.
eNeuro eISSN: 2373-2822

The ideas and opinions expressed in eNeuro do not necessarily reflect those of SfN or the eNeuro Editorial Board. Publication of an advertisement or other product mention in eNeuro should not be construed as an endorsement of the manufacturer’s claims. SfN does not assume any responsibility for any injury and/or damage to persons or property arising from or related to any use of any material contained in eNeuro.