The Neural Correlates of Individual Differences in Reinforcement Learning during Pain Avoidance and Reward Seeking

Organisms learn to gain reward and avoid punishment through action–outcome associations. Reinforcement learning (RL) offers a critical framework to understand individual differences in this associative learning by assessing learning rate, action bias, pavlovian factor (i.e., the extent to which action values are influenced by stimulus values), and subjective impact of outcomes (i.e., motivation to seek reward and avoid punishment). Nevertheless, how these individual-level metrics are represented in the brain remains unclear. The current study leveraged fMRI in healthy humans and a probabilistic learning go/no-go task to characterize the neural correlates involved in learning to seek reward and avoid pain. Behaviorally, participants showed a higher learning rate during pain avoidance relative to reward seeking. Additionally, the subjective impact of outcomes was greater for reward trials and associated with lower response randomness. Our imaging findings showed that individual differences in learning rate and performance accuracy during avoidance learning were positively associated with activities of the dorsal anterior cingulate cortex, midcingulate cortex, and postcentral gyrus. In contrast, the pavlovian factor was represented in the precentral gyrus and superior frontal gyrus (SFG) during pain avoidance and reward seeking, respectively. Individual variation of the subjective impact of outcomes was positively predicted by activation of the left posterior cingulate cortex. Finally, action bias was represented by the supplementary motor area (SMA) and pre-SMA whereas the SFG played a role in restraining this action tendency. Together, these findings highlight for the first time the neural substrates of individual differences in the computational processes during RL.


Introduction
A fundamental inquiry in behavioral neuroscience concerns the characterization of the brain processes underlying one's choice of action to maximize rewards and minimize punishment.Such decision-making processes can be understood using reinforcement learning (RL) theories (Sutton and Barto, 1998).In RL, an agent learns the values of choices by forming associations between actions and outcomes.These associations help predict the expected values for a particular set of actions and then bias action selection toward the agent's goals (O'Doherty et al., 2015).With roots in computer science and neurophysiological research, RL models offer a conceptual framework to identify distinct component processes of learning (Hampton et al., 2006;Niv, 2009).Specifically, RL models generate several parameters capturing learning rate, action bias, and subjective impact of outcomes that characterize differences in learning across individuals.It is suggested that the brain areas involved in goal-directed behaviors likely harbor signals related to these component variables (Lee et al., 2012).Describing such signals would be of critical importance in elucidating the neurobiology of decision-making, predicting future behaviors, and understanding the pathology of mental illnesses that implicate dysfunctional avoidance and/or reward learning.
A number of studies have investigated the neural correlates of RL model-derived indices of associative learning.For instance, Guitart-Masip et al. ( 2012) used a probabilistic learning go/no-go task (PLGT) with monetary gain and loss to localize the effects of action values in the striatum and substantia nigra/ventral tegmental area.Employing a social reinforcement task, another study reported higher pregenual anterior cingulate cortex (ACC) activation with increased expected stimulus values (Jones et al., 2011).The ventromedial prefrontal cortex was also found to track learned values of options in an RL task (Jocham et al., 2011).These investigations highlighted the brain underpinnings of model variables in trial-by-trial variation that are shared across participants, with an emphasis on reward learning.Despite the increasing interest in the neural bases of RL, the neural correlates of individual differences in RL model parameters remain underexplored.
RL models offer several indices that describe learning.For instance, learning rate assesses the change in action following an error, a metric that quantifies one's adaptive behavior (Oba et al., 2019).The subjective impact of outcomes which indicates the motivation to seek reward or avoid punishment is critical to individual differences in goal-directed behaviors.pavlovian factor, referring to the extent to which action values are influenced by stimulus values, independent of learning, determines the extent to which actions are motivated by outcomes.Another useful measure is action bias which assesses an individual's tendency to initiate an action regardless of the optimal strategy.Importantly, these individual differences at the extreme may suggest psychopathology in mental illnesses that involve learning dysfunctions.For instance, rewarddriven impulsivity is widely implicated in substance use disorders and behavioral addiction (Jentsch et al., 2014;Maxwell et al., 2020).In contrast, while learning to avoid aversive outcomes such as pain is critical to the survival and wellbeing of organisms, avoidance learning dysfunctions represent a detrimental feature in anxiety, obsessive-compulsive, posttraumatic stress, and substance use disorders (Radell et al., 2020).Thus, characterizing the individual differences both during reward and punishment learning will inform research on the pathophysiology of mental disorders.
Ample research has examined neuronal responses and regional activation to motivated actions.In particular, studies examining reward learning have implicated the ACC in representing learned cue values (Jones et al., 2011), past reward weights (Wittmann et al., 2016), and reinforcement history (Holroyd and Coles, 2008).The dorsal ACC, in particular, has been found to exhibit significantly higher activation during reward learning in learners versus nonlearners (Santesso et al., 2008).In contrast, the posterior cingulate cortex (PCC) has been implicated in actions aimed at avoiding punishment both in humans (Roy et al., 2011;Le et al., 2019) and animals (Riekkinen et al., 1995;Duvel et al., 2001), suggesting a role in avoidance learning.It is likely that the ACC and PCC circuits are involved in representing individual differences in reward and punishment learning.
Previous RL research has largely associated reward and punishment with approach and avoidance behaviors, respectively.However, punishment avoidance may at times require the initiation rather than suppression of an action to avert a negative event.Similarly, reward seeking may necessitate response inhibition for a successful outcome, likely engaging distinct regional activities (Groman et al., 2022).Thus, to investigate the neural correlates of RL model parameters, we employed a PLGT that fully orthogonalized action types and motivational goals in a balanced design (go vs no-go) × (pain vs reward).We used a whole-brain approach to identify the neural correlates of RL model parameters, including learning rate, subjective impact of outcomes, action bias, and pavlovian factor.We hypothesized that the ACC and PCC would demonstrate regional responses that reflect individual variation in the parameters of the RL model that best characterizes behavioral performance.

Participants
Eighty-two healthy adults (34 women, 35.9 ± 11.2 years in age) participated in the study.All participants were screened to be free from major medical, neurological, and Axis I psychiatric disorders.No participants were currently on psychotropic medications, and all tested negative for illicit substances on the study day.Subjects provided written informed consent after details of the study were explained, in accordance with the institute guidelines and procedures approved by the Institutional Human Investigation Committee.

PLGT
Participants underwent fMRI while performing the PLGT (Fig. 1A; Guitart-Masip et al., 2012;Oba et al., 2019).In each run, a cue (fractal) image was presented at the beginning of a trial to signal one of the four contingencies: go to win $1, no-go to win $1, go to avoid a painful shock, or no-go to avoid shock.There were eight images, two per cue category, with cue-outcome mappings randomized across participants.The cue was displayed for 2 s, and participants were instructed to decide whether to press a button (go) or not (no-go) before it disappeared.After a randomized interval of 1-5 s, feedback of reward (win trials), shock (avoid trials), or "null" (both win and avoid trials) was delivered.The intertrial interval varied randomly from 2 to 8 s.The randomization and intervening time intervals enabled the modeling of distinct regional responses to anticipation and feedback.The outcome was probabilistic, with 80%/20% of correct/incorrect responses in the win trials rewarded and the remaining 20%/80% of correct/incorrect responses leading to a null outcome (Fig. 1A).In avoid trials, electric shocks were avoided (a null outcome) on 80%/20% of correct/incorrect responses, with the remainder leading to electric shocks.In reality, despite the feedback display of the shock image, shocks were randomly delivered only half of the time to minimize head movements.Each shock was followed by a 20 s rest window to allow neural and physiological responses to return to baseline.
Prior to MRI, an appropriate pain intensity level was calibrated for each participant so that the shocks would be painful but tolerable.Participants also practiced on a variant of the task with identical rules, with the exception that there was one image per cue category.
With four 10 min learning runs, there were approximately 50 trials for each cue.Participants won ∼$78 on average (plus a base payment of $25) and experienced a total of 14 actual shocks and 14 omitted shocks on average.

RL models
We constructed RL models of the subjects' behavioral data.A detailed description of the models can be found in our previous work (Oba et al., 2019) and elsewhere (Guitart-Masip et al., 2012).All models assigned an action value to each action in a given trial.For instance, considering action a (go or no-go) in response to cue s on trial t, the action value Q t (a t , s t ) for a chosen action was updated based on the following equation: where ε was the learning rate governing how the action value was updated.The subjective impact of outcome ρ was a free parameter representing the effect size of reinforcement for a subject.The outcome value r t was +1 for a gain, −1 for a shock, or 0 for no gain or shock in trial t.The term ρr t −Q t (a t , s t ) represented the prediction error (PE) or δ t .Learning proceeded with a decision for each action according to the values, and the probabilities of implementing an action were calculated by the softmax function: where W t (a t , s t ) was an action weight corresponding to Q t (a t , s t ).
In addition, we included two other parameters validated in prior studies to better explain behavioral performance (Guitart-Masip et al., 2012, 2014).One parameter was the action bias b, a tendency to press a button regardless of learning, which influenced the action value on the weight: The other parameter was the pavlovian factor, which expresses the effect of a stimulus value independent of learning.
Several studies have reported that stimuli resulting in rewards tend to block action inhibition, thus reflecting an approach bias (Guitart-Masip et al., 2012, 2014).The action weight is adapted by the pavlovian factor π as follows: The stimulus value V t (s t ) was updated with the same parameters as the action value.The outputs of the models included the learning rate ε, which can be separated according to the sign of PE (positive, δ > 0, or negative, δ < 0).Models that produce the learning rates for the signed PE allow asymmetric effects of better or worse (than expected) outcomes on learning (Cazé and van der Meer, 2013).Furthermore, the learning rates can be modeled separately for avoid and reward trials; thus, the models would have four different learning rates: εWP for win trials with a positive PE, εWN for win trials with a negative PE, εAP for avoid trials with a positive PE, and εAN for avoid trials with a negative PE.The subjective impact of outcomes could also differ between win (ρW) and avoid (ρA) trials, as the subjective impacts of positive and negative reinforcers may not be the same.In sum, we examined a total of 12 parameters and identified the best combination of these parameters in modeling the behavioral data.Free parameters were estimated for each participant via a hierarchical type II maximum likelihood procedure, as with previous studies (Huys et al., 2011;Guitart-Masip et al., 2012).We supposed that the population-level distribution for each parameter was a normal distribution.To perform the estimation, the likelihood was maximized by the expectation-maximization procedure using the Laplace approximation to calculate the posterior probability.We used the Rsolnp package in R to optimize the likelihood functions.These models were evaluated with the integrated Bayesian information criterion (iBIC).A smaller iBIC value represents a better model (Huys et al., 2011).Briefly, the iBIC was calculated by using the following procedures: using the parameter values randomly generated by the population distributions, the likelihood was calculated 1,000 times for each participant data.Next, after dividing the total likelihood of each participant by the number of samples (1,000), these amounts were summed for all participants.Finally, the cost for the number of parameters was added to this value.Thus, the iBIC values approximated the log marginal likelihoods with a penalty for the number of free parameters.RL code was run on Lenovo Workstation TS P720, using the operating system Ubuntu 20.04.4 LTS.

Imaging protocol and data preprocessing
Conventional T1-weighted spin-echo sagittal anatomical images were acquired for slice localization using a 3T scanner (Siemens Trio).Anatomical images of the functional slice locations were next obtained with spin-echo imaging in the axial plane parallel to the AC-PC line with (TR) = 1,900 ms, echo time (TE) = 2.52 ms, bandwidth = 170 Hz/pixel, FOV = 250 × 250 mm, matrix = 256 × 256, 176 slices with slice thickness = 1 mm, and no gap.Functional BOLD signals were acquired using multiband imaging (multiband acceleration factor = 3) with a single-shot gradient-echo echoplanar imaging sequence.Fifty-one axial slices parallel to the AC-PC line covering the whole brain were acquired with TR = 1,000 ms, TE = 30 ms, bandwidth = 2,290 Hz/pixel, flip angle = 62°, field of view = 210 × 210 mm, matrix = 84 × 84, slice thickness = 2.5 mm, and no gap.
Imaging data were preprocessed using SPM12 (Wellcome Trust Centre for Neuroimaging).Subjects with BOLD runs with significant motion (>3 mm translation peak-to-peak movement and/or 1.5°rotation) were removed.Furthermore, we calculated framewise displacement (FD) for each task run and removed subjects with an average FD of >0.2.This resulted in the removal of five subjects, leaving a sample of 82 subjects as reported above.Images from the first five TRs at the beginning of each run were discarded to ensure only BOLD signals at steady-state equilibrium between RF pulsing and relaxation were included in analyses.Physiological signals including respiration and heart rate were regressed out to minimize the influence of these sources of noise.Images of each subject were first realigned (motion corrected) and corrected for slice timing.A mean functional image volume was constructed for each subject per run from the realigned image volumes.These mean images were coregistered with the high-resolution structural image and then segmented for normalization with affine registration followed by nonlinear transformation.The normalization parameters determined for the structure volume were then applied to the corresponding functional image volumes for each subject.Images were resampled to 2.5 mm isotropic voxel size.Finally, the images were smoothed with a Gaussian kernel of 4 mm FWHM.

Imaging data modeling and group analyses
We constructed a GLM to examine the shared and distinct brain processes underlying the initiation and inhibition of an action to either avoid pain or gain reward.To this end, we included two trial types, with cue onsets of individual trials convolved with a canonical HRF and with the temporal derivative of the canonical HRF and entered as regressors in the GLM.The two trial types included avoid and win trials, with go and no-go responses collapsed.This resulted in the contrasts avoid > 0 and win > 0. Motion regressors were included in the GLMs.We corrected for serial autocorrelation caused by aliased cardiovascular and respiratory effects by the FAST model.Finally, to examine potential action bias in subjects' learning, we constructed the contrast go versus no-go, collapsing avoid and win trials.
To identify the neural correlates of RL model parameters, we used whole-brain multiple regressions at the group level with the following parameters each as the predictor of the brain activity during the cue period: learning rates (i.e., the magnitude of change of action following errors), pavlovian factor (i.e., the extent to which action values are influenced by stimulus values and not by learning), subjective impact of outcomes (i.e., motivation to seek reward and avoid punishment), and action bias (i.e., the tendency to initiate an action).For learning rate, pavlovian factor, and subjective impact of outcomes, we combined go and no-go trials.Furthermore, we averaged the two learning rates from positive outcomes and negative outcomes each for win trials (i.e., εWP and εWN) and avoid trials (i.e., εAP and εAN).This resulted in two learning rates, one for win trials and one for avoid trials.These parameters were used to predict regional responses to pain avoidance and reward seeking during the cue period (i.e., contrasts avoid > 0 and win > 0, respectively).For action bias, we used the contrasts go > no-go and no-go > go to predict regional responses to action initiation and inhibition, respectively.All multiple regressions controlled for sex and age.As the total number of shocks showed a significant relationship with the pavlovian factor (r = 0.30, p = 0.007), the subjective impact of outcomes during avoidance (r = −0.48,p < 0.001), and action bias (r = 0.22, p = 0.04), we also included this variable as a covariate in the multiple regressions involving avoidance learning.
The results of all whole-brain analyses were evaluated with voxel p < 0.001 in combination with cluster p < 0.05, corrected for family-wise error of multiple comparisons, according to current reporting standards (Woo et al., 2014;Eklund et al., 2016).Minimum cluster size was 30.All peaks of activation were reported in MNI coordinates.

Stay/switch analysis
In addition to accuracy performance and response time, we conducted a stay/switch analysis to evaluate the subjects' task performance.The analysis yielded the percentage of instances in which a subject immediately switched their response (i.e., from go to no-go or no-go to go) after encountering an unfavorable outcome (e.g., null feedback in a win trial or shock feedback in an avoid trial).Switching was only advantageous if the previous response consistently yielded unfavorable outcomes.Due to the probabilistic nature of our task, excessive switching likely indicated poor learning.Previous studies showed that increased switching was associated with the lack of a coherent task strategy and "reward chasing" in individuals with substance use disorders (Myers et al., 2016;Robinson et al., 2021).

Mediation analysis
We conducted mediation analyses using a single-mediator model to examine the interrelationship between regional activities, learning parameters estimated by the model, and performance variables.The methods are detailed in a previous work (MacKinnon et al., 2007).Briefly, in a mediation analysis, the relation between the independent variable X and the dependent variable Y; that is, X → Y is tested to determine whether it is significantly mediated by a variable M. The mediation test is performed using the following three regression equations: and c represents X → Y. a, b, c, and c ′ are referred to as "path coefficients" or simply "paths."Variable M is said to be a mediator of connection X → Y, if (cc ′ ), which is mathematically equivalent to the product of the paths a × b, is significantly different from zero (MacKinnon et al., 2007).If (cc ′ ) is different from zero and the paths a and b are significant, one concludes that X → Y is mediated by M. In addition, if path c ′ is not significant, there is no direct connection from X to Y, and thus X → Y is completely mediated by M. Note that path b represents M → Y, controlling for X, and should not be confused with the correlation coefficient between Y and M. Significant correlations between X and Y and between X and M are required for one to perform the mediation test.The analysis was performed with package Lavaan (Rosseel, 2012) in R (https:// www.r-project.org).To test the significance of the mediation effect, we used the bootstrapping method (Preacher and Hayes, 2004) as it is generally considered advantageous to the Sobel test (MacKinnon et al., 2007).
Specifically, we evaluated the relationships of each of all four variables indexing learning (i.e., learning rate, pavlovian factor, subjective impact of outcomes, and action bias) with brain activation and task performance (see Results).For brain activations, we extracted the parameter estimates from the regions identified by the whole-brain multiple regressions with the learning parameters as the predictors of the contrast avoid > 0, win > 0, go > no-go, no-go > go for avoidance learning, reward learning, action initiation, and action inhibition, respectively.We tested the hypothesis that the relationship between brain activations and task performance was significantly mediated by the RL model parameters.
Code accessibility.The code/software described in the paper is freely available online at https://github.com/tml44/RL_models_PLGT/.The code is available as Extended Data.

Behavioral results
We conducted a 2 (motivational goal, avoid vs win) × 2 (response type, go vs no-go) repeated-measures ANOVA of performance accuracy (Fig. 1B).The results showed no significant main effect of motivational goal (p = 0.06) or response type (p = 0.41).However, there was a significant interaction effect (F (1,80) = 9.26, p = 0.002).Post hoc analyses revealed that performance accuracy was significantly higher for go-to-win conditions than no-go-to-win (p < 0.001), go-to-avoid (p = 0.002), and no-go-to-avoid (p = 0.01) conditions.As expected, avoid and win trials did not significantly differ in performance accuracy (p = 0.59).No other comparisons were significant.
During probabilistic learning, excessive switching (i.e., switching from go to no-go or vice versa for the same cue after one instance of unfavorable outcome) may indicate poor learning.Thus, we computed for each task condition a stay/ switch index, which measures the percentage of switches immediately after an encounter with unfavorable feedback.Next, we performed a 2 (motivational goal, avoid vs win) × 2 (response type, go vs no-go) repeated-measures ANOVA.There was a significant main effect of motivational goal (F (1,81) = 105.69,p < 0.001) but no significant main effect of Figure 1.A, PLGT: participants learned to respond to four cue categories to avoid electric shocks (i.e., go-to-avoid, no-go-to-avoid) and gain monetary rewards (i.e., go-to-win, no-go-to-win), with two different images per cue category.Correct responses yielded favorable outcomes 80% of the time whereas incorrect responses yielded unfavorable outcomes 80% of the time.Shocks were only delivered in 50% of the shock feedback instances to reduce head movement.B, Performance accuracy in the four conditions: participants performed significantly better in go-to-win than in any other conditions.**p ≤ 0.01; ***p ≤ 0.001.C, Learning rates: in monetary win and pain avoidance conditions, learning was separately estimated when unexpected favorable outcomes (positive) or unexpected unfavorable outcomes (negative) were encountered.D, Action value difference: subjects showed learning of the two go cues (go-to-avoid and go-to-win), as demonstrated by the increasing difference between go and no-go action values over time.A similar pattern of learning was found for the no-go trials as the difference between no-go and go action values also exhibited an upward trend as the trials progressed.Shade indicates confidence interval.response type (p = 0.49).The interaction effect was significant (F (1,80) = 4.97, p = 0.02).Post hoc analyses showed that participants switched significantly more in avoid relative to win trials (p < 0.001) whereas there was no significant difference between go and no-go trials (p = 0.36).Next, we examined the relationship between the stay/switch index and task performance and found a significant negative correlation between the index and performance accuracy of avoid trials (r = −0.31,p = 0.005).The relationship was not significant for the win trials (p = 0.41).Taken together, participants displayed more inconsistent response behavior in avoid trials and such inconsistency negatively impacted task performance.

RL model of performance
We constructed 14 models with different combinations of free parameters to determine the model that optimally predicted the choice data.Using a stepwise procedure for model comparison and selection, we added one free parameter to a model, calculated the iBIC, and accepted the parameter that decreased the iBIC the most at each step.The pavlovian factor π reduced the iBIC of the basic model (one learning rate ε and one subjective impact of outcomes ρ) over the other parameters.The iBIC of the model with π was diminished by the separation of the learning rates into positive and negative PEs (εP and εN).The iBIC value decreased further with learning rates computed separately for win (i.e., monetary win) and avoid (i.e., pain avoidance) trials (εWP, εWN, εAP, and εAN).The subjective impact of outcomes was separated for win (ρW) and avoid (ρA) trials.Finally, the action bias parameter b reduced the iBIC.Thus, the optimal model included four different learning rates (εWP, εWN, εAP, and εAN) and two subjective impact of outcomes (ρW and ρA), action bias b, and the pavlovian factor π.
We extracted the subjective impact of outcomes for win (ρW) and avoid (ρA) trials and found that ρW was significantly greater than ρA, suggesting that stimulus value was perceived to be greater for win trials.Interestingly, the stay/switch index for avoid and win trials was significantly correlated with ρA (r = −0.31,p = 0.005) and ρW (r = −0.30,p = 0.007), respectively.This finding indicates that the less value the subjects placed on the outcomes, the more likely the subjects showed inconsistent patterns of responses.In other words, the subjects with lower subjective impact of outcomes appeared to be more uncertain about the best action to take.
To further characterize subjects' learning, we estimated for individual subjects the action values Q t (go) and Q t (no-go) and computed the difference Q t (go) versus Q t (no-go) for each trial.We expected that, for go trials, learning would manifest as the increase of Q t (go)-Q t (no-go) over time.In contrast, for no-go trials, Q t (no-go)-Q t (go) would rise as the trials progressed.The results were consistent with our prediction (Fig. 1D).In other words, subjects placed increasingly more value on go versus no-go action as they encountered more go trials, whereas the reverse was true for no-go trials.These findings suggest subjects showed learning across all four trial types over time.

Neural correlates of individual differences in learning rates
To identify the neural correlates of learning, we conducted whole-brain multiple regressions, first with the learning rate during pain avoidance.A small ε indicated that the individual was less inclined to modify their future responses following the outcomes.The results showed that, with go and no-go trials collapsed, ε predicted activations to avoid > 0 during the cue period in the dACC, midcingulate cortex (MCC), left postcentral gyrus (PoCG), and left superior temporal sulcus (Fig. 2A, Table 1).No clusters showed a significant negative relationship.
Notably, the learning rate and performance accuracy were positively correlated during pain avoidance (r = 0.31, p = 0.005).To remove the effects of sex, age, and number of total shocks in this correlation, as is the case with all following correlations, we regressed out these variables (i.e., partial correlation).The activities (i.e., parameter estimates, or β's) of brain regions identified by the whole-brain multiple regression were also significantly correlated with performance accuracy (r = 0.36, p = 0.001).Using the mediation analysis, we determined a significant model in which the relationship between the brain activity during pain avoidance and performance accuracy was significantly and fully mediated by learning rate (Fig. 2B).Thus, the greater the activations in the dACC, MCC, left PoCG, and left superior temporal sulcus, the better the performance accuracy; and this relationship was positively and fully mediated by learning rate during pain avoidance.
The multiple regression applied to the learning rate of reward seeking (i.e., win > 0) did not yield any significant results at the same threshold.

Neural correlates of individual differences in pavlovian factor
In whole-brain multiple regression with the pavlovian factor as the predictor showed a negative correlation with the activation of the left precentral gyrus (PrCG) during pain avoidance (Fig. 3A, Table 1).
Across participants, the pavlovian factor and performance accuracy of pain avoidance were negatively correlated (r = −0.35,p = 0.002).Furthermore, the regional activity correlates of individual variation in pavlovian factor were significantly correlated with the accuracy of pain avoidance performance (r = 0.24, p = 0.03).Thus, we performed a mediation analysis to examine the interrelationships between the activation in the PrCG, the pavlovian factor, and task performance during avoidance (Fig. 3B).The model in which the relationship between the brain activation and task performance was mediated by the pavlovian factor was significant.In other words, the greater the brain activity in the PrCG, the higher the task performance, and this relationship was significantly and negatively mediated by the pavlovian factor.
We next examined the neural correlates of the pavlovian factor during reward seeking, using the win > 0 contrast.The pavlovian factor predicted lower activation of the right superior frontal gyrus (SFG) and right PCC (Fig. 3C, Table 1).The reverse contrast did not yield any significant findings.
As task performance on win trials was correlated with both pavlovian factor (r = −0.34,p = 0.002) and SFG/PCC activation (r = 0.26, p = 0.01), we conducted a mediation analysis, which showed that the relationship between SFG/PCC activity and task performance was significantly and negatively mediated by the pavlovian factor (Fig. 3D).

Neural correlates of subjective impact of outcomes
In a whole-brain multiple regression with the subjective impact of outcomes (ρA) as the predictor of regional responses to avoid > 0, the PCC and left superior parietal lobule (SPL) showed activation in positive correlation with ρA (Fig. 4A, Table 1).
Additionally, the performance accuracy during pain avoidance was positively correlated with both ρA (r = 0.59, p < 0.001) and the parameter estimates of PCC/SPL activity (r = 0.48, p < 0.001).A mediation analysis showed that the higher the PCC/SPL activity during pain avoidance, the greater the performance accuracy, a relationship significantly and positively mediated by ρA (Fig. 4B).
The same analyses applied to the ρW during reward seeking (i.e., win > 0) did not yield any significant results.

Neural correlates of action bias
Finally, to determine the neural processes underlying the tendency to initiate an action, a whole-brain regression with the action bias as the predictor of activations to go > no-go contrast was conducted.Greater action bias was significantly correlated with higher activations in the bilateral dlPFC; left superior frontal cortex; a cluster containing the SMA, pre-SMA, and dACC; and a large cluster containing occipital cortices, SPL, parahippocampal gyri, and cerebellum (Fig. 5A, Table 2).
We then extracted the parameter estimates for the regions identified in the multiple regression involving go > no-go and found a significant correlation with performance accuracy during go trials (r = 0.67, p < 0.001).The relationship between action bias and performance accuracy during go trials was also significant (r = 0.69, p < 0.001).A mediation analysis was thus performed to characterize the interrelationships between action bias, brain activation during go > no-go, and performance accuracy in go trials (Fig. 5B).We found that the model in which the relationship between brain activity in the regions identified and task performance during go trials was significantly and positively mediated by action bias.
Action bias also negatively predicted activations in the bilateral striate areas, bilateral superior temporal sulcus, bilateral anterior insula (AI), and right SFG (Fig. 5C, Table 2).We then extracted the parameter estimates for these regions and found a significant relationship with no-go behavioral performance (r = 0.51, p < 0.001).Finally, we conducted a mediation analysis, revealing that the relationship between brain activity during no-go responses and no-go behavioral performance was significantly and negatively mediated by action bias (Fig. 5D).

Discussion
We investigated regional brain activations reflecting individual differences in four distinct RL components in learning to avoid pain and to seek reward: learning rate, subjective impact of outcomes, pavlovian factor, and action bias.The dACC,  MCC, and left PoCG showed activity in a positive correlation with learning rate and predicted task performance of avoidance learning.The left PrCG showed activity in a negative correlation with the pavlovian factor during avoidance learning.Further, a cluster involving the right SFG and PCC showed activities in negative correlation with pavlovian factor during reward seeking.The subjective impact of outcomes was predictive of PCC and SPL activation and task performance during avoidance learning.Finally, we observed robust activations positively associated with action bias during go versus no-go trials in the dlPFC, SMA, pre-SMA, and dACC, SPL, and cerebellum.In contrast, there was a negative association between the right SFG activation and action bias.We discussed the main findings below.

Behavioral performance
Participants showed highest performance accuracy in go-to-win (81.1%) and lowest (65.1%) in no-go-to-win trials, consistent with effects of reward on motivating approach behavior (Arias-Carrión and P ǒppel, 2007;Guitart-Masip et al., 2012;Hidi, 2016) and attenuating inhibitory control, especially when such reward is concurrently paired with a go response (Meyer et al., 2021).Next, we examined the learning rate that was significantly lower for win versus avoid trials.It is plausible that the relatively easy go-to-win condition represented a less-than-optimal opportunity for learning whereas the no-go-to-win condition may have been too challenging.Another possibility is that subjects may have been simply less inclined to change responses following errors during reward seeking relative to pain avoidance, resulting in a lower learning rate.More nuanced learning characteristics were revealed when we further separated learning based on the preceding outcomes.We found that participants showed better learning following favorable versus unfavorable outcomes, an asymmetry suggesting a bias toward learning from positive events (Frank et al., 2004;Cazé and van der Meer, 2013).Specifically, during learning to avoid shocks, participants exhibited superior learning after successfully avoiding pain compared with receiving shocks.During learning to gain money, participants similarly exhibited greater learning after successfully obtaining a monetary reward than achieving no reward.These findings are consistent with a previous PLGT study using monetary gain and loss (Oba et al., 2019).It is also worth noting that we randomly omitted half of the shocks to reduce head motion.Such omission may have affected learning due to weakened reinforcement as a result of the absence of pain.However, examining the feedback period, we found that the feedback with omitted shocks (i.e., display of the shock symbol without actual shocks) elicited similar regional activities implicated in pain response as compared with the feedback with actual shocks (data not shown).This suggested that the visual shock feedback alone was sufficient in activating the pain circuit and other brain regions subserving avoidance learning, and that the absence of shock likely exerted limited effects on learning.
As expected, we found a greater subjective impact of outcomes in reward versus avoid trials.This may have indicated participants' greater valuation of monetary reward relative to shock avoidance.In the analyses of the stay/switch index, which assesses response certainty, we found that a higher subjective impact of outcomes was associated with greater response certainty in both reward and avoid trials.Thus, when subjects failed to learn the value of the stimuli, they appeared less certain in their responses, in line with previous reports (Myers et al., 2016;Robinson et al., 2021).The pavlovian factor, which quantifies the extent to which action values are influenced by stimulus values, independent of learning, was negatively associated with performance accuracy in both reward and avoid trials.Maximizing performance in the PLGT requires behavioral regulation.As stimulus value promotes go responses and reduces cognitive control (Dixon and Christoff, 2012), a heightened pavlovian factor can upset the regulatory processes in balancing between go and no-go.

Neural correlates of individual differences in learning rates
Our findings showed that higher dACC activation was associated with a greater avoidance learning rate across subjects.This finding is broadly in line with previous reports of dACC activation during experimentally induced pain (Xu et al., 2020), fear conditioning (Büchel and Dolan, 2000), and aversive delay conditioning across multiple stimulus modalities (Sehlmeyer et al., 2009).The dACC also responds to errors in go/no-go (Orr andHester, 2012), flanker (Gilbertson et al., 2021), continuous performance (Ichikawa et al., 2011) tasks, and during probabilistic learning (Holroyd et al., 2004), highlighting the region's involvement in learning.Error-related learning requires dopaminergic signaling (Allman et al., 2001;Holroyd and Yeung, 2012), at least partially supported by midbrain projections to dACC (Niv, 2007;Assadi et al., 2009;Holroyd and Yeung, 2012), and is known to exhibit individual differences in traits related to learning (Laakso et al., 2000;Reif and Lesch, 2003;Munafò et al., 2008).Further, in direct support of our work, μ-opioid receptor availability in the dACC was positively associated with individual differences in harm avoidance traits in healthy subjects (Tuominen et al., 2012).Despite such evidence, no work to our knowledge has explored how dACC may represent differences in avoidance learning characteristics across individuals.Thus, the current findings add to the literature by implicating the dACC in individual differences in associative learning involving avoidance of painful shocks.
Another region that showed activity in association with avoidance learning rate was the MCC.The MCC has been implicated in the anticipation and execution of avoidance response to pending noxious stimulation (Shackman et al., 2011).Motor neurons in the MCC may receive motivational information (e.g., pain, reward, error) to guide goal-directed behaviors (Vogt, 2016).In electroencephalographic recordings, the MCC appears to contribute to avoidance by evaluating the cost related to punishment in a Simon task (Cavanagh et al., 2014).During avoidance learning involving pain, the MCC was also found to respond to unexpected pain and pain avoidance (Jepma et al., 2022).Our findings of MCC activation in a positive association with avoidance learning rate contribute to this literature and substantiate the role of the MCC in tracking avoidance learning.It is worth pointing out that we observed higher dACC and MCC activities in association with individual variation in learning during pain avoidance but not reward seeking.Other factors potentially contributing to the intersubject variation in reward learning may not have been accounted for in our RL models.For instance, perseverance influences performance in difficult reward-learning tasks (Duckworth et al., 2007;Dale et al., 2018).With our participants' average at 73% in performance accuracy (chance at 50%), perseverance as an additional factor may be needed to fully model reward learning.

The neural correlates of individual differences in pavlovian factor
A higher pavlovian factor predicted SFG activation and worse performance in reward-seeking trials.The pavlovian factor refers to the extent to which actions are influenced by stimulus values, independent of learning.Thus, higher values of the pavlovian factor may have instigated go when no-go responses were required.The right SFG has been implicated in proactive control (Hu et al., 2016; see additional discussion of the role of the SFG below).The current findings thus extend the literature by showing the role of the right SFG in individual differences in balancing action initiation and inhibition, with hypoactivity negatively impacting behavioral performance during reward seeking.
Greater pavlovian factor was also correlated with lower left PrCG activation during pain avoidance, suggesting that stimulus value reduced PrCG activity.Mediation analysis further showed that the reduction in PrCG activity in turn decreased task performance, pointing to the possibility that the PrCG plays a role in avoidance behavior and that increased stimulus value may undermine this function of the region.Evidence for the involvement of the PrCG in the avoidance of aversive stimuli can be found in both human and animal research.For instance, stimulation of the polysensory zone of the PrCG elicited defensive movements in monkeys (Cooke and Graziano, 2004).An imaging study in humans showed left PrCG activity in a positive correlation with individual differences in avoidance response rates during the performance of an approach-avoidance task involving rewarding and aversive visual stimuli (Schlund et al., 2010).Other studies in healthy individuals showed higher left PrCG activity during response inhibition in a go/no-go task (Thoenissen et al., 2002) and during the avoidance of negative social scenes (Ascheid et al., 2019).As PrCG's activity may have been affected by stimulus value, it is plausible that the region receives inputs from brain structures associated with motivation and such inputs modulate PrCG activity.Our own previous work showed that the left PrCG exhibited functional connectivity with the medial orbitofrontal cortex, a region important for reward processing, and the pain-related periaqueductal gray.Importantly, these functional connectivities predicted positive alcohol expectancy in drinkers (Le et al., 2020).Thus, the PrCG's involvement in avoidance behavior may be affected by modulatory information from the reward and pain-related regions, which then help control motor responses accordingly.Taken together, left PrCG activity may support individual differences in the pavlovian factor via its interaction with the reward and punishment circuits.

Neural correlates of individual differences in subjective impact of outcomes
The PCC showed activities reflecting individual differences in the subjective impact of outcomes during avoidance learning-a measure of stimulus-driven motivation to avoid painful shocks.There is abundant evidence for the involvement of the PCC in avoidance behavior.In rodents, PCC neuronal activities increased during behavioral avoidance during water maze navigation (Riekkinen et al., 1995) and during avoidance learning involving foot shocks (Vogt et al., 1991).Synaptic plasticity within the PCC was found to play a critical role in memory consolidation for avoidance (Pereira et al., 2002).PCC inactivation by muscimol following training impaired both memory and subsequent avoidance behavior (Souza et al., 2002).In humans, the PCC is likewise involved in defensive behavior during exposure to fear and threats (McNaughton and Corr, 2004).This proposal was supported by evidence of increased PCC activation to pain processing (Nielsen et al., 2005).Importantly, in our previous work, we showed PCC activation during response inhibition in positive correlation with individual differences in punishment sensitivity trait (Le et al., 2019), lending weight to the notion that the region subserves avoidance behavior.
The results of the mediation analysis demonstrated that PCC's involvement in avoidance behavior may have been mediated by learning of motivational values of outcomes.The better the learning by the individual, the greater the PCC activation, and the more consistent and higher behavioral performance.This finding suggested that the PCC may receive modulatory information from other brain regions that evaluate stimulus motivational values to influence behaviors.Indeed, tract-tracing studies in nonhuman primates found that the PCC is highly anatomically connected.The PCC has dense connections with other paralimbic and limbic structures including the hippocampal formation and parahippocampal cortex (Vogt and Pandya, 1987;Kobayashi and Amaral, 2007).The hippocampus may serve as the hub for the convergence between the set of limbic regions related to ventral stream processing (e.g., orbitofrontal, anterior cingulate, amygdala) and the PCC related to dorsal stream processing (Rolls, 2019).This convergence likely helps with the memory of "what" happens (i.e., outcomes) and "where" (i.e., context), both of which are crucial for avoidance learning.

Neural correlates of individual differences in action bias
Finally, our whole-brain multiple regression with action bias as the predictor showed higher activity in the pre-SMA, SMA, dACC, and dlPFC.Both the pre-SMA and SMA have been implicated in action initiation, with reports of increased neuronal activities immediately before movement both in humans (Cunnington et al., 2005;Bortoletto and Cunnington, 2010) and nonhuman primates (Eccles, 1982;Nachev et al., 2008;Nakajima et al., 2009).The pre-SMA is a higher-order, cognitive motor region whose role likely includes the assessment of action value.Recent supporting evidence demonstrates that pre-SMA neurons in humans encoded option values during the decision phase of a two-armed bandit task (Aquino et al., 2023).Similarly, SMA neurons have been found to encode reward expectancy in monkeys performing eye movement tasks (Campos et al., 2005).Thus, both the pre-SMA and SMA offer a link between motivation and action.Here, individual pre-SMA/SMA activities were correlated both with action bias and go response accuracy to gain reward and avoid shocks, suggesting the regions may use motivational information (i.e., potential monetary gain or painful shocks) to execute appropriate motor activities.
Greater dACC activity was also associated with higher action bias.It is worth noting that this activity (z = 40) was located dorsal to the dACC cluster that represented individual avoidance learning rate (z = 20-40).With anatomical connectivity to the primary motor cortex and spinal cord (Matelli et al., 1991;Picard and Strick, 1996), the dACC is known to be involved in action initiation (Srinivasan et al., 2013).Other studies have suggested a more complex role, with dACC activity varying with motor and reward information during a pursuit/evasion task (Yoo et al., 2021) as well as with the probability of correct response selection during learning of stimulus-outcome association (Noonan et al., 2011).
We observed higher activation in bilateral dlPFC in association with the degree of action bias.The dlPFC is commonly implicated in executive control (Fuster, 2015) and, of specific relevance to the current findings, in representing and integrating goals and reward during decision-making (Miller and Cohen, 2001;Watanabe and Sakagami, 2007).As a hub where information about motivation, motor control, and desired outcomes converges, the dlPFC may have facilitated associative learning by relating visual cues to feedback (shock vs money) to determine required responses (go vs no-go).As with many other cortical regions, the dlPFC receives neuromodulatory inputs from the dopaminergic midbrain (Björklund and Dunnett, 2007), enabling RL via its influence on working memory, attention, and cognitive control processes (Ott and Nieder, 2019).In a study employing a simple rewarded reaction time task, dynamic causal modeling showed that reward information during goal-directed actions was the driving input to the dlPFC, which in turn modulated the activity of the ventral tegmental area and nucleus accumbens (Ballard et al., 2011).These findings support the dlPFC in reflecting individuals' tendency to initiate goal-directed actions.
In contrast, the bilateral AI and right SFG exhibited activity in a negative association with action bias, suggesting their role in restraining actions.It is worth noting that this SFG cluster is dorsal (z = 55) to the region (z = 14) implicated in individual variation in pavlovian factor.A meta-analysis showed robust activation of the AI and SFG in response inhibition across multiple tasks (Zhang et al., 2017).Right SFG response to no-go inhibition was negatively correlated with trait impulsivity (Horn et al., 2003).Using a stop-signal task, another study associated greater right SFG activation with more efficient response inhibition and less motor urgency across subjects (Hu et al., 2016).These findings suggest right SFG's role in representing individual differences in cognitive control.Similarly, the AI activation during failed no-go responses in the go/no-go task was associated with individual variations in motor impulsivity and reactive aggression in healthy male adults (Dambacher et al., 2013).More broadly, AI volume showed a negative relationship with multiple impulsivity and compulsivity measures in individuals with alcohol use disorders (Grodin et al., 2017).Taken together, the AI and right SFG may represent an index of reduced action bias during goal-directed behaviors.

Limitations
The current study has several potential limitations.First, we used electric shocks and monetary gains to motivate learning associated with pain avoidance and reward seeking.Across subjects, we observed a greater learning rate for pain avoidance relative to reward seeking.It is plausible that this difference reflected higher saliency for electric shocks versus money.To explore this possibility, we analyzed the skin conductance data during feedback of money versus shocks (including both actual and omitted shocks).The skin conductance responses did not differ significantly across the two conditions (data not shown), suggesting comparable saliency between the two contingencies.Despite this finding, we cannot rule out the possibility that learning was biased by the use of the two distinct motivational outcomes.Second, to reduce the effects of head motion, we only delivered the shocks half of the time the shock feedback was displayed.Learning may have been affected by the absence of shock delivery.However, we found that feedback with omitted and actual shocks both elicited robust activations in regions previously reported to be responsive to experimentally induced pain (Xu et al., 2020).Thus, negative feedback with omitted shocks was likely sufficiently salient to evoke activations indicative of pain or pain anticipation.

Conclusions
The current work investigated the neural correlates of individual differences in RL.Learning was facilitated by positive outcomes and hindered by the excessive influence of stimulus value.Our imaging results shed light on the neural correlates of various RL metrics, showing distinct roles of the medial frontal cortex, including the ACC, in representing avoidance learning rate, pavlovian factor, and action bias.We confirmed the involvement of the PCC and PrCG in avoidance learning by demonstrating their activities in correlation with individual differences in learning characteristics.These findings may have important implications for the understanding of mental disorders that manifest as dysfunctional RL.

Figure 2 .
Figure 2. Neural correlates of avoidance learning rate during pain avoidance.A, Whole-brain multiple regression of avoid > 0 on learning rate showed higher activations in the dorsal ACC, MCC, left PoCG, and left superior temporal sulcus.B, The regional activations positively predicted performance accuracy, and this relationship was significantly mediated by the learning rate during pain avoidance.*p < 0.05; **p < 0.01.

Figure 3 .
Figure 3. Neural correlates of pavlovian factor during pain avoidance and reward seeking.A, Whole-brain multiple regression of avoid > 0 on the pavlovian factor negatively predicted activation in the left PrCG.B, Mediation analysis: during pain avoidance, brain activation predicted performance accuracy, and this relationship was significantly and negatively mediated by the pavlovian factor.C, During reward seeking (i.e., win > 0), the pavlovian factor negatively predicted activation in the right SFG and right PCC.D, Mediation analysis: during monetary win trials, right SFG and right PCC activation predicted performance accuracy, and this relationship was significantly and negatively mediated by the pavlovian factor.*p < 0.05; **p < 0.01.

Figure 4 .
Figure 4. A, Whole-brain multiple regression of avoid > 0 on the subjective impact of outcomes positively predicted activations in the PCC and SPL.B, Mediation analysis: during pain avoidance, brain activation positively predicted performance accuracy, and this relationship was significantly and positively mediated by the subjective impact of outcomes.*p < 0.05; **p < 0.01.

Figure 5 .
Figure 5. Neural correlates of action bias during go versus no-go trials (A) Whole-brain multiple regression of go > no-go on action bias positively predicted activations in the bilateral dlPFC; left superior frontal cortex; a cluster containing the SMA, pre-SMA, and dACC; and a large cluster containing the bilateral SPL, visual cortices, parahippocampal gyri, and cerebellum.B, Mediation analysis: during go responses, brain activations positively predicted performance accuracy, and this relationship was significantly mediated by action bias.C, Whole-brain multiple regression of go > no-go on action bias negatively predicted activations in the bilateral striate areas, bilateral superior temporal sulcus, bilateral AI, and right SFG.D, Mediation analysis: during no-go responses, brain activations positively predicted performance accuracy, and this relationship was significantly mediated by action bias.*p < 0.05; **p < 0.01.

Table 1 .
Neural correlates of individual differences in RL parameters during pain avoidance and reward seeking

Table 2 .
Neural correlates of individual differences in RL parameters during action initiation