Selective Effects of the Loss of NMDA or mGluR5 Receptors in the Reward System on Adaptive Decision-Making

Visual Abstract


Introduction
Midbrain dopamine (DA) neurons originate from the ventral tegmental area and substantia nigra and, along with the major targets of their projections, including dopaminoceptive neurons in the frontal cortex and basal ganglia, play a central role in the organization of adaptive behavior (Berridge and Robinson, 1998;Wise, 2004;Floresco and Magyar, 2006;Salamone and Correa, 2012). In rodents and nonhuman primates, the burst firing of midbrain DA neurons and the subsequent phasic release of DA encode reward prediction error (Schultz et al., 1997;Bayer and Glimcher, 2005;Hart et al., 2014). This error in reward expectation is a signal of the need to modify synaptic plasticity at corticostriatal synapses and update the action values stored by striatal neurons (Reynolds et al., 2001;Samejima et al., 2005;Lee et al., 2012). In this way, the DA system provides a neural substrate for reinforcement learning mechanisms underlying decisionmaking and action selection (Glimcher, 2011;Schultz, 2015). It should be noted though that the role of DA extends beyond reinforcement learning, as it is also involved in the regulation of motivation and vigor as well as the performance of instrumental behavior (Salamone and Correa, 2012;Shiner et al., 2012).
The activity and plasticity in the DA system are largely dependent on excitatory glutamatergic transmission. Glu-tamatergic inputs activate NMDA receptors and drive the burst firing in DA neurons (Overton and Clark, 1992;Chergui et al., 1993), phasic DA release (Sombers et al., 2009;Wickham et al., 2013), and induction of long-term potentiation onto the dopaminergic neurons underlying cuereward learning (Stuber et al., 2008;Harnett et al., 2009). Moreover, NMDA receptors and metabotropic glutamate receptor 5 (mGluR5) are crucial for the induction of synaptic and structural plasticity in dopaminoceptive striatal medium spiny neurons (Calabresi et al., 2007;Surmeier et al., 2009;Yagishita et al., 2014). Altogether, these observations indicate that glutamate-dependent signaling is crucial for DA-mediated reinforcement. However, in most studies, the observations are based on correlations and in vitro measurements; therefore, the causality or degree of contribution remains uncertain.
A more direct approach for testing the role of glutamate-dependent signaling in reinforcement learning is the use of genetically modified mice with an inactivation of glutamate receptors in DA or dopaminoceptive neurons. Such models have been generated and generally observed to result in impairments in tasks involving instrumental and pavlovian learning, confirming that a disruption in glutamate-dependent signaling in the DA system is sufficient to cause an impairment in reward-based learning (Zweifel et al., 2009;Novak et al., 2010;Parker et al., 2010Beutler et al., 2011;Wang et al., 2011;James et al., 2015). However, most experiments were conducted using paradigms in which only a single lever or conditioned stimulus was reinforced. Therefore, a crucial aspect of adaptive decision-making (i.e., choosing among competing courses of action in a changing environment) was not comprehensively addressed in those studies.
Here, we sought to determine the contribution of glutamate receptor-dependent signaling in DA and dopaminoceptive neurons to adaptive decision-making. We used mice with cell type-specific, tamoxifen-inducible inactivation of NMDA receptors in DA and D 1 receptor-expressing neurons Jastrzę bska et al., 2016;Sikora et al., 2016) and animals with a knockdown of mGluR5 receptors in D 1 neurons (Novak et al., 2010;Rodriguez Parkitna et al., 2013). The animals were tested using a probabilistic reinforcement learning task, in which the mouse is required to estimate the expected value of two alternatives associated with different reward proba-bilities by trial and error. This task was followed by a probability-discounting task in which the animal is required to choose between two options that provide rewards that differ in magnitude (small vs large) and probability (certain vs uncertain).

Animals
The following three strains of genetically modified mice were used in the study: NR1 DATCreERT2 mice, which had an inducible deletion of the NR1 subunit of the NMDA receptor in DA transporter (DAT)-expressing neurons Jastrzę bska et al., 2016); NR1 D1CreERT2 animals, which had an inducible loss of the NR1 subunit of the NMDA receptor in D 1 receptor-expressing neurons (Sikora et al., 2016); and mGluR5 KD-D1 mice, which had a selective knockdown of the mGluR5 receptor in D 1expressing neurons (Novak et al., 2010;Rodriguez Parkitna et al., 2013). All strains were bred to be congenic with the C57BL/6N strain. Genotyping was performed as previously described. The animals were housed two to five animals per cage in a room with a controlled temperature at 22 Ϯ 2°C under a 12 h light/dark cycle. Unless otherwise indicated, the mice had ad libitum access to tap water and standard rodent laboratory chow.
Regarding the CreERT2-dependent mutations, the recombination was induced in adult animals at the age of 8 -10 weeks using tamoxifen treatment. Tamoxifen (Sigma-Aldrich) was dissolved in sunflower oil, filtered through a 0.22 m membrane, and injected intraperitoneally once a day for 5 consecutive days at a dose of 100 mg/kg and a volume of 5 l/g. The genotype of the mutant mice was [Tg/0; flox/flox], and the genotype of the control animals was [0/0; flox/flox]. All tamoxifen-treated animals were allowed to rest for at least 3 weeks before the start of the behavioral procedures. Regarding mGluR5 KD-D1 , no induction was necessary, and the expression of the transgene was initiated when the D 1 promoter became active during late development. The genotype of the mutant mGluR5 KD-D1 animals was [Tg/0], and the genotype of their respective controls was [0/0].
Only male mice were used in the study. The mean ages and weights of the cohorts of animals used in the experiments were as follows: 16.25 Ϯ 1.05 weeks and 25.6 Ϯ 0.85 ϫ g for the NR1 DATCreERT2 mice and 16.57 Ϯ 1.15 weeks and 29.43 Ϯ 0.62 ϫ g for their respective controls; 18.33 Ϯ 0.94 weeks and 26.39 Ϯ 1.21 ϫ g for the NR1 D1CreERT2 mice and 19.33 Ϯ 1.08 weeks and 27.85 Ϯ 1.37 ϫ g for their controls; and 13.38 Ϯ 1.31 weeks and 25.8 Ϯ 1.1 ϫ g for the mGluR5 KD-D1 mice and 13.56 Ϯ 1.12 weeks and 24.98 Ϯ 1.02 ϫ g for their controls. The same cohorts of animals were used in the probabilistic reinforcement learning and probability-discounting tasks.

Behavioral procedures Water deprivation
A week before the behavioral testing, water consumption was limited to 1-1.5 ml/d, and this water restriction schedule was maintained for the duration of the experiments. The mice were trained 5-7 d/week, and their body weight was monitored daily. The water restriction was lessened if the mice fell to Ͻ85% of their body weight from the beginning of the deprivation.

Apparatus
The experiments were performed using mouse operant chambers (ENV-307W-CT, Med Associates) enclosed in cubicles that were equipped with a fan to provide ventilation and mask extraneous noise. Each chamber was equipped with a dual cup liquid receptacle, a nose-poke port containing a cue light located on each side of a liquid receptacle, and a house light located on the wall opposite to the liquid receptacle. Saccharin-flavored water (0.01% w/v saccharin; Sigma-Aldrich) was delivered into an individual cup by an infusion pump (PHM-100, Med Associates) connected to the liquid receptacle via a silicone tube. The amount of fluid delivered (reward size) was dependent on the duration of the infusion.

Training
First, the mice were placed in the operant chamber for 30 min, during which 20 l of water were delivered into the receptacle in 60 s intervals. This procedure allowed the animals to become familiar with the chamber and liquid reward. On subsequent days, the mice were trained under a continuous reinforcement schedule and were rewarded with 10 l of water after poking their noses into the active port (with the cue-light on). The other port was inactive. The nose pokes in the inactive port were recorded but had no consequences. The port assignment was counterbalanced, and the animals were trained until they reached the criterion of 60 rewarded responses in 40 min, which occurred first in one port and then in the other port in a subsequent session. This training was followed by additional training during which the left and right ports were active once in every pair of trials, and the order within the pair was random. These sessions ended when an animal completed 100 trials or 60 min elapsed, whichever came first. There was no limit to the trial duration, and each trial ended when a nose poke in the active port resulted in the delivery of a reward, followed by a 5 s intertrial interval (ITI). The animals had to complete at least 85 trials. Finally, the mice underwent omission training, which was similar to the training described above with two exceptions. First, the trial number was increased to 160. Second, responding in an active poke resulted in a 50% chance of reward omission. Reward omission was signaled by switching on the house light for the duration of the ITI. The animals had to complete at least 120 trials.

Probabilistic reinforcement learning task
In this task, the nose-poke ports were randomly assigned reward probabilities of 80% or 20% (Fig. 1A). During each session, the reward probabilities were reversed after 60 trials. Thus, to maximize the long-term sum of the rewards, the mouse had to select the alternative with the higher success probability and adapt its choices to the changes in the reward contingencies. There was no limit to the trial duration, and the session ended when the animal completed 120 trials or 60 min elapsed. Rewarded choices resulted in the delivery of 10 l of water, followed by a 5 s ITI. Unrewarded choices were signaled by turning on the house light for the duration of the ITI. The animals were trained in this task for 15 sessions.

Probability-discounting task
In this task, one nose-poke port was associated with the delivery of a small reward (10 l), while the other nose-poke port was associated with the delivery of a large reward (20 l). Each session consisted of 20 forced trials, followed by 40 free choice trials (see Fig. 5A). During the forced trials, only one port was active, whereas during the free choice trials, both ports were active. Once the preference for the large reward was stabilized, the probability of its delivery gradually decreased to 75%, 50%, or 25% during subsequent blocks of four to five sessions. Simultaneously, the small reward was always available at a 100% probability. The trials were separated by a 5 s ITI, and unrewarded choices were signaled by turning on the house light for the duration of the ITI.

Statistical analysis
A script written in R was used to parse the data files that were generated during the behavioral experiments. All statistical analyses were conducted using GraphPad Prism 7 (GraphPad Software) and R software. Statistical significance was estimated using an ANOVA, followed by a Bonferroni post hoc test or a Student's t test, as appropriate. The results were considered significant at ␣ ϭ 0.05. One animal from the control group in the NR1 DATCreERT2 strain was classified as an outlier (Grubb's test) in choice reaction time measures (in both tasks) and was excluded from all analyses. Two animals (one from the NR1 DATCreERT2 strain and one control mouse from the Figure 1. The probabilistic reinforcement learning task. A, Schematic representation of the probabilistic reinforcement learning task. The animal could make a nose-poke in one of two ports. Following a nose-poke, water could have been delivered with the probability depending on the chosen port. The nose-poke ports were randomly assigned 80% or 20% reward probabilities. During each session, the reward probabilities were reversed after 60 trials. B, An example the choice behavior of a mouse in 600 trials (sessions 6 -10). The black line shows the probability of choosing the left side (data smoothed with the 21 point moving average). The cyan bars indicate the side with the higher probability of reward delivery. The red dashed line indicates session boundaries. C-H, Probability of selecting the alternative with the higher reward probability by the NR1 DATCreERT2 (mutant, n ϭ 6; control, n ϭ 8; C, F), mGluR5 KD-D1 (mutant, n ϭ 8; control, n ϭ 9; D, G), and NR1 D1CreERT2 (mutant, n ϭ 6; control, n ϭ 9; E, H) strains. C-E, Session-by-session analysis; data were collapsed across trials. F-H, Trial-by-trial analysis; data were collapsed across sessions. Data are represented as the mean Ϯ SEM. NR1 D1CreERT2 strain) showed no preference for the freely available large reward in the probability-discounting task (0.5 Ϯ 0.5% and 1.5 Ϯ 0.6%, respectively) and were excluded from the analysis of this task, to avoid the misinterpretation of the effect of discounting. Confidence intervals (CIs) for post hoc comparisons are listed in Table 1.

Computational modeling
We fitted three reinforcement learning models to trialby-trial choice data of the probabilistic reinforcement learning task, which are all based on the Rescorla-Wagner model (Rescorla and Wagner, 1972), but include additional features. Model 1 assumes that animals learn with different rates when the reward prediction error is positive and negative (den Ouden et al., 2013). Model 2 assumes that the animals have learned that entering only one of the ports results in a high reward probability, so in this model after choosing one option, the expected rewards for both options are modified in opposite directions (Gläscher et al., 2009). Model 3 integrates models 1 and 2, so it includes separate learning rates for positive and negative prediction errors of the chosen option and updates the unchosen option using the fictitious learning component of model 2.
As model 3 is the most general, we start with its description, and then present how it can be simplified to give models 1 and 2. In model 3, the expected value of the chosen (V c,t ) and unchosen (V uc,t ) options are updated as follows on each trial t. If prediction error on trial t (PE t ϭ r t Ϫ V c, t ) is Ն0, expected values of chosen and unchosen options are updated with learning rate ϩ ͑0 Յ ϩ Յ 1͒, as follows: (2) Note that the unchosen option is updated with a fictitious prediction error ͑PE t ϭ Ϫ r t Ϫ V uc, t ͒ following the study by Gläscher et al. (2009). If PE t is Ͻ0, the expected values of chosen and unchosen options are updated with learning rate Ϫ ͑0 Յ Ϫ Յ 1͒: In the simulations, r t is set to 1 if reward is received on trial t, or to Ϫ1 if it is omitted. Choice probabilities are computed based on the expected values as follows. If A and B refer to the two options of the probabilistic reinforcement learning task and p tϩ1 ͑A͒ refers to the probability of choosing the option A on trial t ϩ 1, then: Here, ␤͑0 Յ ␤͒ is the inverse temperature parameter, which governs the degree of exploitation and exploration (i.e., low and high values of ␤ indicate more exploration and exploitation, respectively). In summary, model 3 has three free parameters: ϩ (learning rate for positive PE), Ϫ (learning rate for negative PE) and ␤ (inverse temperature). If we set ϩ ϭ _ ϭ , the model becomes model 2, which has two free parameters: (learning rate) and ␤ (inverse temperature). If we only update the values of chosen options using Equations 1 and 3 (but not use Equations 2 and 4), the model becomes model 1, which also has three free parameters: ϩ (learning rate for positive PE), Ϫ (learning rate for negative PE), and ␤ (inverse temperature).
We fitted the three models using hierarchical Bayesian analysis (HBA), which pools information across individuals and allows us to capture both individual differences and commonalities across subjects in a reliable way (Shiffrin et al., 2008;Ahn et al., 2011;. To perform HBA, we used the hBayesDM package (Ahn et al., 2017), which is an R package that offers HBA of various computational models and tasks using the Stan software (Carpenter et al., 2017). The hBayesDM functions of models 1-3 are prl_rp, prl_fictitious_woa, and prl_fictitious_rp_woa, respectively. All source codes and Bayesian model formulation are available in its GitHub repository: https:// github.com/CCS-Lab/hBayesDM. We performed model comparisons and identified a best-fitting model using leave-one-out cross-validation information criterion (LOOIC). To compute LOOIC for a given model we used the loo R package, which computes leave-one-out predictive density using Pareto smoothed importance sampling (Vehtari et al., 2017). The LOOIC inherently penalizes model complexity, as an overly complicated model will perform poorly on unseen data than a simpler model. It also has an advantage over other measures designed to prevent overfitting by overly complex model (like Akaike or Bayesian information criterion) in that it measures the overfitting directly.

Simulation Analysis
To test whether the best-fitting model can describe the observed data well, we performed simulation analysis as previously described (Ahn et al., 2008;Steingroever et al., 2014). Briefly, by using estimated individual parameters alone (without access to trial-by-trial choice history), we generated simulated agents and computed their win-stay and lose-shift (switching to the alternative choice when the preceding response yielded no reward) probabilities. When we generated simulated data, for each group and condition, we used its total number of trials and subjects of the real data. Then, we simulated choices on the probabilistic reinforcement learning task using estimated individual parameters (individual posterior means) of each simulated agent for the whole trajectory (i.e., 1800 trials) using customized R codes.

Computational modeling results
We tested fits of three reinforcement learning models based on reward prediction error. Table 2 shows the LOOIC scores for the three models compared. For all groups tested, model 3 outperformed the others and had the lowest LOOIC scores by a large margin. Model 3 assumes that animals learn with different rates when the prediction error is positive or negative, and also that mice take the higher-order structure of the task into account, namely that they learn that at a given time only one of the ports gives high reward probability. Thus, in model 3 when unexpected reward is obtained following nose-poke to the left port, the expected reward associated with this port is increased, while the expected reward for the right port is decreased.
A summary of parameters calculated for the best-fitting model is shown in Figure 2A-C. For each parameter, we quantified an effect of the mutation by calculating the difference of hyperposterior distributions between mutant and control mice (Ahn et al., 2014), which is summarized as the 95% highest density interval (HDI). The 95% HDI refers to the range of parameter values that span the 95% of the distribution . If the 95% HDI of the difference is far Ͼ0 or Ͻ0, it indicates that there is a strong evidence of a group difference. While binary interpretations of 95% HDI should be avoided, it is possible to check whether the 95% HDI excludes 0 for a heuristic judgment of "credible" group differences. As in the case of previous analyses, credible effects of mutations were observed in the NR1 DATCreERT2 mice (95% HDI ϭ [Ϫ0.461, Ϫ0.102]) and mGluR5 KD-D1 mice (95% HDI ϭ [Ϫ0.516, Ϫ0.143]). We found that the mutation in the NR1 DATCreERT2 and mGluR5 KD-D1 strains affected the inverse temperature (␤) parameter and mutant mice make more random rather than value-driven choices. However, the mutation did not Figure Data structure Type of test 95% CIs or 95% HDIs Assumed normal distribution Bonferroni-corrected t test (0.636, Ϫ6.974) Figure 6C forced 100% Assumed normal distribution Bonferroni-corrected t test (2.767, Ϫ5.107) Figure 6C forced 75% Assumed normal distribution Bonferroni-corrected t test (3.120, Ϫ4.754) Figure 6C forced 50% Assumed normal distribution Bonferroni-corrected t test (3.388, Ϫ4.486) Figure 6C forced 25% Assumed normal distribution Bonferroni-corrected t test (2.297, Ϫ5.578) Figure 6C free 100% Assumed normal distribution Bonferroni-corrected t test (2.447, Ϫ5.754) Figure 6C free 75% Assumed normal distribution Bonferroni-corrected t test (3.649, Ϫ4.552) Figure 6C free 50% Assumed normal distribution Bonferroni-corrected t test (4.181, Ϫ4.020) Figure 6C  In agreement with the analysis of learning behavior of the NR1 DATCreERT2 group (Fig. 1F), the means of posterior distribution of the learning rates for this group were lower than those of controls ( Fig. 2A). However, unexpectedly, this effect was not credible (95% HDI ϭ [Ϫ0.360, 0.078] for the reward learning rate; 95% HDI ϭ [Ϫ0.295, 0.129] for the punishment learning rate). We did not observe any other credible effects of any of the mutations on learning rates. Another interesting observation was that in all groups, learning rates tend to be higher for positive than negative outcomes. Such a relationship between the learning rates has been observed before in a probabilistic choice task, and was proposed to arise because the animals might have learned that one option gives a higher reward on average, so a single reward omission may just be noise and should not change the behavior (Grogan et al., 2017). In summary, the computational modeling indicated that mutations significantly affected only the parameter influencing the preference for the alternative with a higher expected outcome. Additionally, the behavior in general was most consistent with models that included updates of the expected value of the nonselected alternative.
The overall higher proportion of win-stay than lose-shift trials is in a qualitative agreement with the higher learning rate from positive than from negative feedback (Fig. 2). To test whether the model can quantitatively reproduce the proportions of win-stay trials and lose-shift trials, Figure  3D-F shows the simulation performance of model 3 with parameters set to the means of posterior distributions in Figure 2A-C. Comparisons of actual ( Fig. 3A-C) and simulated ( Fig. 3D-F) behavioral performance revealed that our model indeed describes observed data very well. Consistent with actual data, simulated NR1 DATCreERT2 and mGluR5 KD-D1 mice were significantly less likely to repeat the previously rewarded choice than the control animals (win-stay), but this was not observed in NR1 D1CreERT2 simulated mice (Fig. 3D: win-stay, t (12) ϭ 3.939, p ϭ 0.0020; Fig. 3E: win-stay, t (15) ϭ 3.292, p ϭ 0.0049; Fig.  3F: win-stay, t (13) ϭ 1.657, p ϭ 0.1215). We observed no effect of mutation on lose-shift behavior in any group, which is consistent with actual data (Fig. 3D: lose-shift, t (12) ϭ 0.4638, p ϭ 0.6511; Fig. 3E: lose-shift, t (15) ϭ 0.8803, p ϭ 0.3926; Fig. 3F: lose-shift, t (13) ϭ 1.432, p ϭ 0.1757).
These results confirmed that while none of the mutations appreciably affected the magnitude discrimination or probability discounting, the animals from the NR1 DATCreERT2 and mGluR5 KD-D1 strains were considerably slower in performing choices.

Discussion
The mutations in the NR1 DATCreERT2 and mGluR5 KD-D1 strains had three effects on the choice behavior. First, the performance in the probabilistic reinforcement learning task was impaired, leading to fewer choices of the alternative with the higher reward probability. This effect was transient in the NR1 DATCreERT2 strain, and the mutant mice eventually reached the same performance as the controls, whereas the mGluR5 KD-D1 animals showed a generally lower preference for the higher value option. Second, the NR1 DATCreERT2 and mGluR5 KD-D1 mice were less likely to repeat the previously rewarded choice. In accordance with this, computational modeling suggested that the behavior of both of these mutant groups was to a smaller extent influenced by acquired associations in comparison to controls (i.e., making more exploratory/random choices compared with controls). Finally, the third mutation effect in the NR1 DATCreERT2 and mGluR5 KD-D1 strains was an increase in the delay to make a choice. In contrast, there were no appreciable changes in the behavior of the NR1 D1CreERT2 mice.
Earlier studies have shown that the inactivation of functional NMDA receptors in DA neurons impaired burst firing and attenuated phasic DA release in the striatum (Zweifel et al., 2009;Parker et al., 2010;Wang et al., 2011). Consistent with this finding, we recently reported that the induction of the mutation in the NR1 DATCreERT2 mice causes a complete loss of NMDA receptor-dependent bursting of midbrain DA neurons (Jastrzę bska et al., 2016). Considering the role of DA neuron burst firing in reward prediction error coding (Schultz et al., 1997;Glimcher, 2011), the observed effects of the mutation are to an extent unexpected, as no significant changes in learning rates were observed. Still, we note that the reduced winstay probability is actually similar to the effect reported in the case of optogenetic studies, where the inhibition of DA neurons imitating negative reward prediction error reduced the likelihood of returning to the previously rewarded alternative (Hamid et al., 2016;Parker et al., 2016). Moreover, the study by Pessiglione et al. (2006) offers a possible explanation for why the reduced bursting of DA neurons might have led to less deterministic behavior rather than a reduced learning rate. In that study, the effects of a drug-reducing DA function on learning in an analogous task was studied in humans inside an fMRI scanner. The authors developed a computational model that captured both behavioral data and blood oxygenation level-dependent responses in striatum, which are known to correlate with reward prediction error. According to this model, the drug had an effect of reducing the value of reward parameter r t on trials where the reward is obtained (see Eqs. 1-4). Reducing r t has exactly the same effect on model behavior as reducing inverse temperature ␤ (identified in our study for NR1 DATCreERT2 and mGluR5 KD-D1 mice) for the following reason. Reducing r t decreases the value to which the estimators V c,t converge, because they approach the expected value of the reward. If both V 1,t and V 2,t are reduced by the same constant, this constant can be taken outside the bracket in the softmax Equation 5 and incorporated into ␤ giving a lower effective value of ␤. Computational models with reduced r t and ␤ predict exactly the same behavior, and therefore cannot be distinguished on the basis of our data. Pessiglione et al. (2006) had additional neurophysiological data, indicating the value of reward prediction on Figure 5. The probability-discounting task. A, Schematic representation of the probability-discounting task. One nose-poke port was associated with the delivery of small certain rewards, while the other nose-poke port was associated with the delivery of large uncertain rewards. Each session consisted of 20 forced trials during which only one port was active, followed by 40 free choice trials during which both ports were active. B-D, The graphs show the frequency of choosing the larger reward as a function of its probability in the NR1 DATCreERT2 (mutant, n ϭ 6; control, n ϭ 7; B), mGluR5 KD-D1 (mutant, n ϭ 8; control, n ϭ 9; C), and NR1 D1CreERT2 (mutant, n ϭ 5; control, n ϭ 9; D) strains. Data are represented as the mean Ϯ SEM. each trial, which allowed them to distinguish between these models. Thus, in summary, the less deterministic behavior of NR1 DATCreERT2 mice in our study might have resulted from impaired encoding of reward prediction error that led to reduced estimates of expected reward.
Nevertheless, we note that while the impairment caused by the mutation is clearly significant, it was arguably mild, and the NR1 DATCreERT2 mice eventually reached the same performance as that observed in the control animals. This is in agreement with the observation that Figure 6. Reaction times in the probability-discounting task. A-C, Time elapsed from the trial onset to the choice port entry during the forced choice (left) and free choice (right) trials in the NR1 DATCreERT2 (mutant, n ϭ 6; control, n ϭ 7; A), mGluR5 KD-D1 (mutant, n ϭ 8; control, n ϭ 9; B), and NR1 D1CreERT2 (mutant, n ϭ 5; control, n ϭ 9; C) strains. Bars represent the mean choice latency Ϯ SEM. ‫ء‬p Ͻ 0.05, ‫‪p‬ءء‬ Ͻ 0.01, ‫‪p‬ءءء‬ Ͻ 0.001 (Bonferroni-corrected t test). after extended training, performance levels in mice with constitutive mutations are similar to those found in control animals (Zweifel et al., 2009;James et al., 2015). Furthermore, in addition to its role in signaling reward prediction errors, phasic DA encodes expected reward value and contributes to risk-based decision-making (Fiorillo et al., 2003;Tobler et al., 2005;Sugam et al., 2012). This hypothesis is supported by observations in which the pharmacological blockade of DA receptors or the attenuation of phasic activity in DA neurons biases choices away from larger but probabilistic rewards (St Onge and Floresco, 2009;St Onge et al., 2011;Stopper et al., 2013Stopper et al., , 2014. However, we found no effect of the loss of NMDA receptors on probability discounting, suggesting that NMDA receptors in DA neurons are not required for assessing the reward value when choosing between deterministic and probabilistic outcomes. The inactivation of mGluR5 receptors in D 1 -expressing neurons decreased the frequency of choosing the alternative with a higher reward probability. Thus, the mGluR5 KD-D1 mice made more random choices. Simultaneously, the NR1 D1CreERT2 mice showed a normal performance. This result may be due to differences in the efficiency of the mutations in the dorsal part of the striatum. We have previously reported that a mutation in D 1 CreERT2-derived strains is efficient in the nucleus accumbens and ventral striatum but is less extensive in the dorsal parts of the striatum Sikora et al., 2016), whereas, in the mGluR5 KD-D1 strain, the mutation is efficient in both regions (Novak et al., 2010;Rodriguez Parkitna et al., 2013). The ventral components of the striatum are involved in stimulusoutcome learning, but the dorsal striatum plays a key role in learning about actions and their consequences (Balleine et al., 2007;Yin et al., 2008). A dissociable role of the ventral and dorsal striatal regions in choice behavior was also recently reported by Parker et al. (2016). These authors showed that DA terminals in the ventral striatum responded preferentially to reward consumption and reward-predicting cues, whereas terminals in the dorsal striatum responded more strongly to choices. Accordingly, optogenetic studies have demonstrated that the stimulation of D 1 neurons in the dorsal striatum mimic changes in action values and bias choice behavior during decision-making (Tai et al., 2012). Therefore, we speculate that when glutamate receptor-dependent plasticity is disrupted at corticostriatal synapses in the dorsal, rather than the ventral striatum, an increased randomness in action selection occurs.
The strongest effect observed in our study was the increased delay in performing a choice in the NR1 DATCreERT2 and mGluR5 KD-D1 mice. This effect is consistent with a reported increase in the latency to choose in the appetitive T-maze task in NR1 DATCre mice (Zweifel et al., 2009) and the effect of optogenetic stimulation of DA neurons on the delay to engage in reward-seeking behavior (Hamid et al., 2016). Notably, our procedure imposed no limit on the trial length, while a 10 s limit was often used previously (Stopper et al., 2014;Parker et al., 2016). If a limit had been imposed, we would have likely observed a large number of omitted trials. Thus, a decision time limit could likely exacerbate the phenotypes observed in the probabilistic reinforcement learning task. It should also be noted that the mutations affected the time to collect the reward. However, only a slight increase in the reward latency was observed. The influence of the mutations on locomotor activity in this case seems to be rather unlikely. First, it was previously reported that a mutation in NR1 DATCreERT2 mice had no effect on locomotor activity in the home cage or open field arena , and only a mild reduction of activity in the novel environment was observed in mGluR5 KD-D1 mice, with no change in the distance traveled in familiar environment . Second, based on the performance in the rotarod test, there is no evidence of motor impairment in NR1 DATCreERT2 mice (Jastrzę bska et al., 2016). We thus believe that an increase in choice latency is a result of an internal decision (or motivational) process, rather than a result of impaired motor performance. This interpretation is in line with observations showing that perturbations in mesolimbic DA signaling result in decreased motivation to engage in reward-seeking behavior, which is expressed as an increase in latency to initiate instrumental phase of rewarddirected behavior (Nicola, 2010;Salamone and Correa, 2012).
In conclusion, we find that the loss of NMDA receptors in DA neurons and mGluR5 receptors in D 1 -expressing neurons affects the speed of the decision process and increases the number of exploratory choices. Nevertheless, mutant mice did improve their performance in the probabilistic reinforcement learning task and showed normal probability discounting. Overall, this indicates that reward-driven learning does occur in the absence of key receptors implicated in the plasticity of the reward system of the brain, but the decision-making process slows and loses efficiency.