Model-based predictions for dopamine
Introduction
The striking correspondence between the phasic responses of midbrain dopamine neurons and the temporal-difference reward prediction error posited by reinforcement-learning theory is by now well established [1, 2, 3, 4, 5]. According to this theory, dopamine neurons broadcast a prediction error — the difference between the learned predictive value of the current state, signaled by cues or features of the environment, and the sum of the current reward and the value of the next state. Central to the normative grounding of temporal-difference reinforcement learning (TDRL) is the definition of ‘value’ as the expected sum of future (possibly discounted) rewards [6], from whence the learning rule can be derived directly. The algorithm also provides a simple way to learn such values using prediction errors, which is thought to be implemented in the brain through dopamine-modulated plasticity in corticostriatal synapses [7, 8] (Figure 1, left). This theory provides a parsimonious account of a number of features of dopamine responses in a range of learning tasks [9, 10, 11, 12].
Section snippets
Are model-free dopamine prediction errors a red herring?
A core tenet of TDRL is that it is ‘model-free’: learned state values are aggregate, scalar representations of total future expected reward, in some common currency [1, 13]. That is, the value of a state is a quantitative summary of future reward amount, irrespective of either the specific form of the expected reward (e.g., water, food, a combination of the two), or the sequence of future states through which it will be obtained (e.g., will water be presented before or after food). Critically,
Temporal representation and dopamine
One notable property of dopamine prediction errors is that they are temporally precise: if an expected reward is omitted, the phasic decrease in dopamine neuron activity appears just after the time the reward would have occurred [2]. It is this phenomenon that inspired the TDRL algorithm, which models such temporally precise predictions by postulating sequences of time-point states that are triggered by a stimulus (known as the ‘complete serial compound,’ CSC stimulus representation, or ‘tapped
Not all dopaminergic predictions are learned through direct experience
Indeed, a central aspect of TDRL that makes it model free is that, in the algorithm, values for state are learned (and cached) through direct experience with the state. Recent work suggests, however, that phasic dopamine may reflect values that have been learned indirectly. Of particular relevance is a sensory preconditioning experiment showing that reward predictions that are ascribed to a cue solely through its relationship to another neutral cue are reflected in dopamine neuron firing. Here,
Multiple dimensions of prediction in dopamine responses
Another fundamental property of TDRL is that it learns aggregate, scalar predictions of the sum of future rewards predicated on occupying the current state — a ‘common currency’ value that sums over apples, oranges, sex and sleep. As alluded to above, and complicating the mapping between dopamine and TDRL even further, it appears that dopamine neurons respond to deviations from predictions in dimensions other than scalar value [49]. In particular, prediction errors have been recorded for an
Model-based learning with dopamine prediction errors
All told, current findings suggest that dopamine neurons have access to model-based representations of expected rewards that reflect learned properties beyond a scalar representation of value (Figure 1, right). However, the convergence of TDRL to a useful value representation stems from the alignment between the computational goal of the agent (to maximize total reward through value-guided action) and the single dimension along which reward predictions are represented (i.e., scalar value).
So what is the role of dopamine in learning?
One thing that these recent studies make clear is that a better understanding of the computational role of dopamine entails a broader consideration of what it means for a reinforcement learning algorithm to be ‘model-based’ [34]. Model-based prediction in RL has been most strongly identified with the use of models for forward planning, enabling values to be computed on the fly (as opposed to cached) in order to flexibly support goal-directed behavior [65]. But models may also be exploited to
Conflict of interest statement
Nothing declared.
References and recommended reading
Papers of particular interest, published within the period of review, have been highlighted as:
• of special interest
•• of outstanding interest
Acknowledgements
This work was funded by grant R01DA042065 from the National Institute on Drug Abuse (AJL, YN), grant W911NF-14-1-0101 from the Army Research Office (YN, MJS), an NHMRC CJ Martin fellowship (MJS), and the Intramural Research Program at the National Institute on Drug Abuse (ZIA-DA000587) (MJS, GS). The opinions expressed in this article are the authors’ own and do not reflect the view of the NIH/DHHS.
References (68)
- et al.
Dialogues on prediction errors
Trends Cogn Sci
(2008) - et al.
Midbrain dopamine neurons encode a quantitative reward prediction error signal
Neuron
(2005) - et al.
Reinforcement learning: the good, the bad and the ugly
Curr Opin Neurobiol
(2008) - et al.
Temporal difference models and reward-related learning in the human brain
Neuron
(2003) - et al.
Reinforcement learning in multidimensional environments relies on attention mechanisms
J Neurosci
(2015) - et al.
Dynamic interaction between reinforcement learning and attention in multidimensional environments
Neuron
(2017) A unifying probabilistic view of associative learning
PLoS Comput Biol
(2015)- et al.
Planning and acting in partially observable stochastic domains
Artif Intell
(1998) - et al.
Learning latent structure: carving nature at its joints
Curr Opin Neurobiol
(2010) Acquisition of representation-mediated conditioned food aversions
Learning Motiv
(1981)