Abstract
Addiction is characterized by a profound intersubject (phenotypic) variability in the expression of addictive symptomatology and propensity to relapse following treatment. However, laboratory investigations have primarily focused on common neural substrates in addiction and have not yet been able to identify mechanisms that can account for the multifaceted phenotypic behaviors reported in the literature. To fill this knowledge gap theoretically, here we simulated phenotypic variations in addiction symptomology and responses to putative treatments, using both a neural model, based on cortico-striatal circuit dynamics, and an algorithmic model of reinforcement learning (RL). These simulations rely on the widely accepted assumption that both the ventral, model-based, goal-directed system and the dorsal, model-free, habitual system are vulnerable to extra-physiologic dopamine reinforcements triggered by addictive rewards. We found that endophenotypic differences in the balance between the two circuit or control systems resulted in an inverted-U shape in optimal choice behavior. Specifically, greater unbalance led to a higher likelihood of developing addiction and more severe drug-taking behaviors. Furthermore, endophenotypes with opposite asymmetrical biases among cortico-striatal circuits expressed similar addiction behaviors, but responded differently to simulated treatments, suggesting personalized treatment development could rely on endophenotypic rather than phenotypic differentiations. We propose our simulated results, confirmed across neural and algorithmic levels of analysis, inform on a fundamental and, to date, neglected quantitative method to characterize clinical heterogeneity in addiction.
Significance Statement
Addiction is known to encompass heterogeneity in its development, maintenance, and treatment response. While previous work has mostly focused on the common mechanisms underlying vulnerabilities in addiction at a group level, the neurocomputational causes for such intersubject variability in addition are not well understood. To fill this knowledge gap, we combine a neural and a reinforcement learning (RL) model to reveal that the balance between neural circuits or computational control modalities characterizes the presence of behavioral phenotypes in addiction. The presence of converging effects, validated across neural and algorithmic levels of analysis, informs on a quantitative method to characterize clinical heterogeneity and potentially helps future development of precision treatments.
Introduction
Addiction is known to encompass a wide range of individual behavioral differences (i.e., phenotypes) in development, maintenance and severity of symptoms, and treatment response (Everitt and Robbins, 2016). Previous investigations into the mechanisms underlying this heterogeneity of behaviors have identified two fundamental neurocomputational alterations correlated with vulnerability in the development and severity of addictive behaviors (Garrison and Potenza, 2014; Jupp and Dalley, 2014; Belin et al., 2016). These neural and computational intersubject differentiations (i.e., endophenotypes) include (1) a dysregulation of D2 receptors in the striatum (Morgan et al., 2002; Nader and Czoty, 2005; Dalley et al., 2007; Flagel et al., 2014) and (2) an alteration of learning rates within a reinforcement-learning framework (Gutkin et al., 2006; Piray et al., 2010). However, these endophenotypic differences are found across a wide spectrum of dissociable phenotypes, so that the same neural or computational mechanism is used to account for separable behavioral traits. For instance, different forms of striatal D2 dysregulation are found in individuals differing in terms of their impulsivity (Dalley et al., 2007; Volkow et al., 2007), social dominance (Morgan et al., 2002; Gould et al., 2014), motor reactivity or preference for novelty (Flagel et al., 2010, 2014), or sensitivity to rewards (Belcher et al., 2014). Each of these behavioral traits is separately correlated with development of addiction, but they do not necessarily coexist in the same individuals (cf. novelty seeking and impulsivity: Ersche et al., 2010; Molander et al., 2011; Belin and Deroche-Gamonet, 2012). This mismatch between few known endophenotypic differences and a wide variety of multifaceted, dissociable, behavioral phenotypes suggests there are yet unknown neural and computational mechanisms that are responsible, alone or in interaction, for the reported behavioral differentiations. Finally, investigations into intersubject variability often emphasize the initial stage of addiction development (but see Belin et al., 2008; Economidou et al., 2009; Pelloux et al., 2015). Yet, individual differences also exist in treatment response, resulting in diverse relapse patterns among individuals showing similar severity of symptoms. These differences have not been so far addressed in previous neural or computational models.
Here, we propose a theoretical investigation into the interaction between ventral and dorsal cortico-striatal circuits and the associated behavioral control modalities. Several studies have emphasized that addiction is associated with alterations of ventral and dorsal cortico-striatal circuits, and of motivations and habits (Volkow and Morales, 2015; Everitt and Robbins, 2016; Koob and Volkow, 2016). However, the role played by the interaction between the two neural circuits or between the two behavioral control modalities in generating intersubject variability in addiction, has been so far neglected. To investigate this interaction, we use two models to simulate neural dynamics and algorithmic (or normative) choice selections in a multiple-choice task involving drug and non-drug rewards. Then we test these models under different conditions of circuit or control modality dominance (i.e., simulated endophenotypes). Consistently with previous models, we assume addictive substances hijack the healthy reward prediction error signal (Schultz et al., 1997) by triggering extra-physiologic dopamine bursts (Nestler and Aghajanian, 1997; Koob and Volkow, 2016). These dopamine activities signal the presence of an aberrant unexpected reward, leading to the repetition of drug-related actions and escalation of consumption (Redish et al., 2008; Dayan, 2009). In our neural model, this process of reinforcement learning (RL; Sutton and Barto, 1998) is mediated by extra-physiologic changes in cortico-striatal connectivity weights (Hyman et al., 2006; Haber, 2008; Koob and Volkow, 2016). These changes in turn aberrantly affect circuit gain and the stability of both ventral and dorsal cortico-striatal circuits, disrupting their respective roles in encoding and selecting goal-directed behaviors (Balleine, 2005; Balleine and O'Doherty, 2010; Gruber and McDonald, 2012) and habitual responses (Yin et al., 2004; Balleine and O'Doherty, 2010). A similar effect is assumed for our algorithmic model, where overevaluation of drugs and related RL affect the two control modalities, termed model-based and model-free, that approximate ventral/goal-oriented and dorsal/habitual implementations (Dolan and Dayan, 2013; Voon et al., 2017). As a result, and consistently with previous formulations of RL models of addiction (Redish et al., 2008; Piray et al., 2010; Gillan et al., 2016), both the planned evaluation of known action-outcome contingencies, represented in an internal model of the world, and the reactive immediate motor responses are biased toward drug-related selections.
Based on these assumptions, our models show that phenotypic differentiation in addiction development and treatment response can emerge as a function of the interaction between ventral and dorsal circuits or model-based and model-free control modalities. Our simulated results offer a proof-of-concept that this interaction is a candidate independent neural and computational mechanism underlying addiction vulnerability, putatively characterizing three different endophenotypes differing in the likelihood to develop addiction, severity of symptoms and treatment response. We suggest this neurocomputational mechanism could interact with both previously described D2 receptors dysregulation in the striatum (Dalley et al., 2007; Flagel et al., 2014) and altered learning rates (Gutkin et al., 2006; Piray et al., 2010) to generate the variety of dissociable behavioral traits reported in literature as associated with addiction vulnerabilities.
Materials and Methods
In brief, we present two complementary models simulating endophenotypic differences and their effects on addiction development and treatment response. In the models, intersubject differences are expressed in terms of either neural circuit dominance (i.e., ventral or dorsal circuit) or control modality dominance (i.e., model-based or model-free) in determining behavioral selections. The resulting phenotypes are tested in environments granting free access to a simulated substance of addiction, as usually implemented in laboratory studies. In particular, we compare our simulated phenotypic variability with the results described in a recent study investigating individual differences in rats self-administrating the stimulants cocaine or a designer drug, a dopamine- and mixed dopamine-norepinephrine reuptake inhibitor, respectively (Gannon et al., 2017). We selected this study because it highlights how different drugs, dosages, and tasks result in different ranges of phenotypic differentiation. For instance, an initial acquisition phase, over a 10-d period, shows compulsive behavior developed in up to 75% rats self-administering cocaine and 87.5% of those exposed to the designer drug. Furthermore, under a condition of fixed ratio (=5) schedule, the study shows self-administration varied significantly among subjects. A subset of rat population, termed high responders, self-administered cocaine up to 60% more times in comparison with a different subset, termed low responders, depending on dosage (cf. Gannon et al., 2017, and their Fig. 3). Importantly, the task setup chosen for both of our proposed models involves the selection of a drug reward over explicit non-drug-related alternatives; in contrast, the chosen empirical study utilizes a time-out responding paradigm, where the only explicit non-drug-related behavior (a lever-press) is not rewarded. As for most studies simulating addiction (Redish, 2004), we believe the choice to present our simulated agents with a richer set of options (i.e., more than one) does not invalidate a parallel between simulated and real data. We consider the simulated competing options as a proxy for the many conflicting stimuli and associated behaviors that animals have access to, even in the limited environment of a standard operant conditioning chamber. Thus, our focus is on perturbing the balance between the dorsal/model-free and the ventral/model-based systems, to compare our simulated behavioral differentiations in the escalation and compulsive selection of drug-related actions with the data reported in the chosen laboratory study.
The two models comprise a neural mass model that has been validated and described in the context of choice behavior and dopaminergic modulation (Fiore et al., 2016, 2018; Hauser et al., 2016) and a normative or algorithmic model based on standard RL schemes (Sutton and Barto, 1998). In the neural model, addiction and treatment response are modeled through DA-dependent associative plasticity in both ventral and dorsal circuits. In the RL model, aberrant learning is modeled using a duplex of model-based and model-free schemes that competed for control over action selection. The model-based scheme entails learning a model of the environment (in the form of probability transition matrices among states) that is used to compute value functions under the Bellman optimality principle (Bellman, 1966). The equivalent model-free scheme uses prediction error-based learning to directly acquire the value of state action pairs. Both neural and RL models are tested under four successive stages or phases: (1) before exposure to the simulated drug (termed pre-drug); (2) learning of addictive behavior (termed addiction); (3) simulated ideal therapeutic interventions (termed treatment) that partially revert the learning of the previous phase; and finally, (4) reinstated access to the simulated drug following each treatment (termed relapse). The simulated treatments are conceived to emphasize endophenotypic response and relapse differentiation; and therefore, they predominantly affect only one control system, targeting either the goal-oriented/model-based or the habitual/model-free. The former treatment is assumed to modify only the internal model of the environment and related selection of action-outcome contingencies performed in the ventral circuit. The latter treatment represents a condition in which the model of the world of the agent remains mainly unaltered, but the acquired drug-related stimulus-response associations are disrupted, thus preventing the agent from exhibiting habitual responses (cf. Doll et al., 2009).
The unique aspect of this complementary modeling approach is that converging results from neural and algorithmic models can validate each other, as process and implementation theories (i.e., synaptic and dynamical mechanisms) complement the normative principles formalized in the RL model.
Neural field model
Basic model architecture and parameterization
In cortico-striatal circuits, the signal processed in the cortex is conveyed toward its respective area of the striatum, processed in basal ganglia and finally relayed to the same cortical area where it originated, via thalamus (Haber, 2003; Draganski et al., 2008; Jahanshahi et al., 2015). Thus, despite diverging in terms of the information processed, e.g., sensorimotor or rewards and outcomes, these circuits are characterized by similar computational dynamics (Obeso et al., 2014). Temporal responses in recurrent neural networks co-occur with state transitions or input transformations that are often described in terms of energy landscapes (Fig. 1A–C). If multiple inputs or initial states generate transitions toward the same final state, this is termed attractor state (Amit, 1989). In recurrent networks such as cortico-striatal circuits, learning processes modulate the circuit gain, thereby affecting the strength of the attractor states and the overall stability of the system (Fiore et al., 2015, 2016; Hauser et al., 2016).
We simulate the temporal responses in cortico-striatal circuits in a neural model (for illustrative representation of the neural architecture, see Fig. 1D). This neural model simulates mean-field activity (Deco et al., 2008) within multiple channels of both dorsal and ventral cortico-striatal loops. A continuous-time differential equation simulates changes over time of the average action potential of a pool of neurons (Eq. 1), and a positive transfer function (Eq. 2) converts this action potential in the final activation of the pool ( ). Finally, the plasticity of the connections ( ) between cortex and striatum is characterized by DA-dependent Hebbian learning, corrected with a constant threshold (th) as defined in Equation 3. The resulting rule strengthens the connections among all active nodes in the cortex and those active in the striatum and weakens the connections among nodes showing opposite activation status. (1) (2) (3)
The input ( ), reaching each node in the neural network is modulated by two coefficients λ and ϵ . These regulate the ratio between the signal affected by the presence of dopamine release d and the amount of signal that is computed independent of dopamine release. For most units, the values of the two coefficients are set to and , with the exception of the simulated striatal units, where these parameters are set to and , to simulate the differential effect dopamine has, depending on the most prevalent receptor type ( > 1 and λ < 0 for D1 and D2 receptors, respectively). Due to the different effects the dopamine receptors have on the activity of the simulated neurons, the drug-induced dopamine-dependent Hebbian learning significantly affects D1-enriched units in the striatum, while having negligible effects on D2-enriched units (Gerfen and Surmeier, 2011; Volkow and Morales, 2015).
Simulating different addiction phenotypes and treatment effects
Agents controlled by the neural model are immersed in a simplified environment and can select among three arbitrary actions or inactivity (cf. nonstationary three armed bandit environment). The selection of the actions is conducted in the circuit simulating the dorsal cortico-striatal activity, and it is considered completed if the neural activity of any of the units in the external layer of the simulated cortex (Fig. 1D) is maintained for at least 2 s. Ventral and dorsal circuits interact, both ways, via corticocortical connectivity. Therefore, the activity in the simulated ventral circuit biases action selection in the dorsal circuit and the selection of actions in the dorsal circuit biases the activity in the ventral circuit. To test our hypothesis about the effect these reciprocal biases have on choice behavior, we assumed corticocortical weights do not vary over time and we tested eleven combinations for the parameters determining their weights, as = [0.02–0.2], [0.03–0.17], [0.03–0.15], [0.05–0.15], [0.07–0.13], or [0.1–0.1] (and symmetrical). This spectrum of weights describes the strength of the biases between the two major circuits, thereby characterizing either a balanced condition or a dominance of one of the two circuits. We report the effects in terms of behavioral responses for these putative endophenotypes and test each of these with thirty noise seeds, random inputs and under four stages, to allow within phenotype comparisons. The first stage, “pre-drug,” represents an assessment of behavior before any drug or reward is introduced, as the three available inputs randomly change their value to determine a nonstationary order of preferences. Under the second stage, termed “addiction,” one action is associated with the administration of a simulated addictive substance, triggering DA phasic responses and associated Hebbian learning in cortico-striatal connections of both ventral and dorsal circuits. For the third stage, termed “treatment,” we simulate the effects of deprivation coupled with one of two hypothetical treatments targeting either the dorsal or the ventral cortico-striatal circuits. The treatments are simulated by reverting the learning process in either the dorsal or the ventral cortico-striatal circuit, respectively, representing an intervention that would block or extinguish either the habitual drug-related response (an ideal behavioral treatment) or the drug-related emotional and value association (an ideal cognitive treatment). The dorsal treatment brings back the pre-drug configuration in the dorsal circuit and keeps the configuration reached under the addiction stage for the ventral circuit. The ventral treatment is achieved with the opposite intervention. Finally, during the fourth stage, termed “relapse,” we reintroduce access to the simulated addictive substance, inducing relapse. For this stage, relapse time is defined as the time required to reinstate the configuration of cortico-striatal weights found at the end of the addiction stage.
RL model
Basic model architecture and parameterization
In this model, we assume that the behavior of the agent relies on a hybrid model (Daw et al., 2011) that learns and computes the value of choices (actions, ) under each condition (state, st). Value is defined as a quantity that combines short and long-term expected rewards and negative outcomes when a specific strategy of action is followed (policy, π). It is formally defined as: (4)
In Equation 4, denotes the instantaneous reward received when action a is performed in state s. γ is a discount factor, comprised between 0 and 1, which defines the trade-off between immediate and long-term rewards. The value of a state given the policy is defined as . For each environment, there is an optimal policy , which maximizes the value for every state (Sutton and Barto, 1998).
The environment can be completely characterized through the state transitions distributions , and the expected rewards . These two functions together represent a model of the environment. Model-based behaviors compute and the policy relying on such functions, at each state, following the Bellman equation (Daw and Dayan, 2014): (5)
The model-based component learns the transition distributions and the expected rewards during the interaction with the environment. Thus, differently from other hybrid models (Daw et al., 2005; Keramati et al., 2011; Pezzulo et al., 2013), the quality of Q value estimation at any given moment depends on the experience the agent acquired up to that point in time. To compute value estimation ( ), this bounded (Gershman et al., 2015) component applies at each step the Bellman equation (Eq. 5) a limited number of times to states sampled stochastically following a heuristic for efficient state update selection. The algorithm is an early-interrupted variation of the Prioritized Sweeping algorithm (Moore and Atkeson, 1993) with stochastic state update selection. Crucially, our model-based component does not accumulate the variations of Q values over time, and restarts the computation after each step (desJardins et al., 1999). This choice is meant to instate a plausible bounded rationality for our model which can account for the cognitive costs and ensuing limits of integrating old and new information about the environment, while updating and extending a complex plan to navigate it. This implementation is suitable for a bounded rational model-based component that shows controlled stochasticity of deliberation performances in nontrivial environments. This choice allows to test the effects of the hypothesized endophenotypic differentiation in an environment characterized by higher degree of complexity in comparison with both the one chosen for the neural model and those described in the literature of RL models of addiction. In particular, we consider drug consumption to be associated with complex after-effects that make it difficult to predict the overall result of pursuing the related action course.
In comparison with other hybrid models such as Dyna and Dyna2 (Sutton, 1990; Silver et al., 2016), the proposed architecture does not share Q values between model-based and model-free components, nor it requires that the two processes share the same state representations. The two components separately represent their Q values and integrate them in a later phase. This decoupling is assumed to result in a more biologically plausible agent (Daw and Dayan 2014), and it is essential for the simulations of two separate treatments, essential requirement to establish a comparison with the behavior simulated with the neural model. In contrast with previous work using a hybrid Dyna-like architecture and prioritized sweeping algorithm, where the sharing of the Q values explained the appearance of model based drug oriented behavior (Simon and Daw, 2012), in our simulations this model based addiction emerges in independent model-free and model based components. Thus, addiction behavior results from the joint effect of high reward (i.e., the drug), a limited number of stochastically selected policy updates and limited knowledge of the environment.
The model-free component has been implemented using the Q-Learning algorithm in tabular form (Watkins and Dayan, 1992). Q-learning updates initial state value estimations as follows: (6) (7)where α is a learning factor comprised between 0 and 1. Our hybrid model computes choice values in a fashion that balances model-free (MF in the equations) and model-based (MB in the equations) components depending on a parameter β . Six values (1, 0.8, 0.6, 0.4, 0.2, 0) are used for this parameter to simulate different endophenotypes, on a spectrum between purely model-based ( β = 1) and purely model-free ( β = 0) RL.
To allow exploration, the action to execute is selected randomly 10% of the times. This exploration factor is kept constant to support adaptation to a changing environment (Singh et al., 2000) and to simulate the continuous update of knowledge necessary to cope with ecological environments. The remaining 90% of the times, actions are determined by maximizing QMX(s,a) in a strategy defined as ε-greedy (ε = 0.1). These values are produced by combining the values computed by the model-based and model-free components: (8)
The choice for a fixed balance between model-based and model-free requires minimal assumptions on their interaction and has been used in recent RL architectures (Silver et al., 2016).
Simulating different addiction phenotypes and treatment effects
In comparison with the simulations characterizing the neural model, a more complex environment is in use for the RL model to highlight how our endophenotypic differentiations can also affect the likelihood to develop addiction. This environment is characterized by a total of 20 states divided into four different types (Fig. 2): (1) healthy rewards (i.e., normal rewards that are not directly associated with drugs); (2) neutral states (no reward or negative outcome); (3) drug-related states, which give a high reward but are followed by multiple (4) drug aftereffects, characterized by small negative outcomes. Similar to the neural model investigations, the agent deals with environment variations meant to simulate four phases of addiction: initial pre-drug phase (f1); addiction (i.e., the drug becomes accessible for the first time, f2); treatment (f3); relapse (i.e., second drug exposures; f4). Under the initial pre-drug phase (dinit = 50 steps), the agent does not receive any reward or negative outcome by entering the drug-related and aftereffects area, but a moderate reward is assigned (Rg = 1) by accessing the healthy reward state. Under the phases of addiction and post-treatment addiction (dtpy = 1000 steps), the agent can also receive a high reward, after accessing a drug-related state (Rd = 10). The drug state always leads to a series of randomized state transitions among the aftereffects states (Ra = -1.2) and simulates generic negative consequences associated with addiction. The agent can occasionally leave this aftereffect area of the environment (Fig. 2) to reach a neutral state, at the price of a further negative outcome (Ra = -4). Under the treatment phase (dtpy = 1000 steps), the drug-related state results in a negative outcome (Rdt = -1; Tables 1, 2, column f3), thus increasing the chances the agent stops pursuing this state. To allow for a comparison with the results in the neural model, we simulate a model-based and model-free treatment by manipulating the learning factor of the nontreated control modality, decreasing it: αCtpy = 0.01 * α. Under the relapse phase, we measure the simulated time required by the agents to reach at least 95% of drug-related action preference as recorded under the addiction phase, after the drug is introduced again in the environment. This threshold is used to measure the percentage of agents relapsing, as well as the time required to complete the relapse, per endophenotype.
Code accessibility
All models rely on custom code developed in MATLAB (optimized for R2014b) that has been run successfully on multiple OS (iOS, Linux and Windows) on different computers and local servers. The code can be accessed at any time from the repository ModelDB (http://modeldb.yale.edu/239540). The downloadable archive file consists of two folders (respectively, for the neural model and the RL model), which include the entire source code required to replicate the data reported in our Results section. Code available as Extended Data Code File 1.
Extended Data Code File 1.
To access the source code of both models, visit the ModelDB website (https://senselab.med.yale.edu/modeldb/enterCode.cshtml?model=239540) and download the archive. The source code shows its structure in the commented main files “separate_test.m” and “RunExperimentLearning96.m,” respectively, in the folder “neural_model” and “RL_model.” Download Extended Data C, ZIP file.
Results
Simulations from the neural field model
During all stages, the three stimuli randomly change every few seconds, putatively representing a dynamic fluctuation of values associated with perceived cues in a nonstationary environment. This setup requires the agents to rapidly adapt to these changes, transiently triggering the motor response associated with the most valuable cue, to achieve optimal behavior. During the pre-drug stage, dorsal and ventral circuits perform unbiased selections, collaborating in the generation of a near-optimal sequence of motor selections. All eleven endophenotypes show uniform distributions of action selections, complying with the random distribution of the inputs configurations (Fig. 3A). This control stage allows the simulated network to generate transient temporal responses that couple multiple initial states with multiple stable states, in a transient winner-take-all or winner-less competition (Rabinovich et al., 2006; Afraimovich et al., 2008).
During the simulated addiction stage, one of the actions is associated with drug administration (Fig. 3B, values represented in blue). Substance use triggers phasic dopamine bursts, leading to Hebbian learning in cortico-striatal connections of both dorsal and ventral circuits (Eq. 3). In recurrent networks, circuit gain increases as a direct function of the weights of reentrant synapses (Amit, 1989). A dopamine response triggered by healthy unexpected rewards would create a bias toward the selection of the reinforced motor response to a perceived cue (Cohen and Frank, 2009; Grahn et al., 2009; Baldassarre et al., 2013). However, drug consumption triggers extra-physiologic dopamine-dependent learning, which in our model results in aberrantly high circuit gain, compromising the ability of all affected circuits to discriminate among different inputs and produce temporal transitions toward multiple stable states (cf. Fiore et al., 2014). The cortico-striatal circuits become overstable and resistant to perturbation caused by a change of input or by noise as they are dominated by parasitic attractors (Hoffman and McGlashan, 2001; Fig. 1C). In the ventral cortico-striatal circuit, a parasitic attractor sets and maintains the selection of drug-related goals or outcomes, biasing the action-outcome assessments required for planning. In the dorsal circuit, the same process determines overstable selections of the reinforced motor behavior, generating reactive responses and habits. Importantly, the learning process simulated in our neural model leads to the generation of parasitic attractors in both circuits across all endophenotypes, as all agents eventually reach a fixed threshold in cortico-striatal neural plasticity. Despite the generation of a form of compulsive drug seeking behavior across all endophenotypes, we observe significant differences in motor response patterns as a function of the balance between ventral and dorsal circuits. Specifically, the endophenotypes characterized by unbalanced dorsal or ventral control (i.e., Fig. 3B, endophenotypes 1–3 and 9–11) express distributions of motor selections that are significantly more compromised by drug-related aberrant rewards, in comparison with balanced endophenotypes (i.e., Fig. 3B, endophenotypes 5–7). The presence of identical learning processes, and the associated attractor formation in both ventral and dorsal circuits, ascribes all phenotypic differences univocally to the only remaining independent variable, which controls corticocortical connectivity and therefore the strength of the biases between circuits. Unbalanced agents are characterized by more frequent drug-related selections as actions leading to drug consumption are selected more frequently than in balanced endophenotypes, in a range between +3% and +45%. This result identifies all phenotypes within the limits of individual differentiation described in the study chosen for behavioral comparison (Gannon et al., 2017).
Next, we investigate how the simulated endophenotypes behave during the stages of treatment and relapse. First, we measure the frequency of drug-related action selections during the stages of addiction and treatment (Fig. 4A,B). Both ventral (goal-oriented) and dorsal (habitual) treatments effectively reduce the number of actions associated with drug consumption, in comparison with baseline addiction. However, the dorsal treatment is more effective for dorsal-dominated endophenotypes and the ventral treatment is more effective for ventral-dominated endophenotypes. These endophenotype-specific treatment effects are further confirmed by our analysis of individual differences under the relapse stage (Fig. 4C,D): dorsal treatments are more effective in elongating time to relapse for dorsal-dominated endophenotypes, whereas ventral treatments are more successful in delaying relapse for ventral-dominated endophenotypes. This analysis shows that simulated treatments focusing either on the dorsal circuit (and therefore habitual responses) or the ventral circuit (and therefore motivational responses) can have substantially different effects, depending on the balance between dorsal and ventral circuits. Importantly, these differences emerge only after the treatment is applied, where a pre-treatment comparison between compulsive behaviors expressed by the opposite unbalanced endophenotypes (i.e., ventral-dominant or dorsal-dominant) does not show any significant difference in choice selections (Fig. 3B, endophenotypes 1–3 and 9–11).
Simulations from the RL model
By simulating explicit negative outcomes associated with drug consumption, the RL model allows to measure the likelihood each agent has to develop addiction, as a function of its endophenotype. In our analysis, addiction is defined as a behavior leading to drug selections more frequently than the healthy alternative reward, under the addiction phase. The mean percentage of these addicted agents (over 300 runs) was 43.05%, across endophenotypes, which is consistent with the percentage of rats developing compulsive self-administration of cocaine, as reported in the reference study (∼40% over a period of 5 d; cf. Gannon et al., 2017). Importantly, when considering endophenotype differentiation, the percentage varies significantly: 60.3% for β = 0, 40.3% for β = 0.2, 30.1% for β = 0.4, 36.7% for β = 0.6, 39.3% for β = 0.8, and 51.6% for β = 1 (Fig. 5A,B). This phenotypic differentiation is consistent with well-established data from animal models. For instance, rat strains selectively bred for either high or low voluntary running differ in the likelihood to develop addiction when given free access to cocaine (respectively, ∼35% and ∼60% of each strain develop addiction over a period of 5 d; cf. Smethells et al., 2016). Free access to substances of abuse does not necessarily lead to compulsive behaviors (Piazza et al., 1989; Belin et al., 2011), as addiction varies as a function of factors such as exposure extent, amount of drug delivered, and associated negative effects (Pelloux et al., 2007; Jonkman et al., 2012). Our simulations suggest that endophenotypes with lower chances of addiction are characterized by balanced control modalities. Note that an optimal agent, knowing the environment structure and being able to compute the long-term effects of drug, will never select drug states (Table 3).
Finally, the simulations suggest that the hypothetical treatment targeting model-free control is the most effective, reducing the likelihood to pursue drug-related behaviors for all endophenotypes (Fig. 5A). In contrast, the model-based treatment appears to be less effective for all endophenotypes, with the exception of the purely model-based one ( β = 1; Fig. 5B). Under the relapse phase, our data confirm that the simulated treatments significantly differ in their effectiveness across the proposed endophenotypes, also suggesting the treatment targeting model-free control is the most successful in prolonging relapse time (Fig. 5C,D). Relapse time after model-free treatment is mostly similar to the time required to develop addiction behavior before any treatment (Fig. 5C). At the opposite side of the control spectrum, the model-based treatment shows a positive effect only for the purely model-based endophenotype. All remaining endophenotypes show relapse times significantly shorter than those recorded for the first development of addiction ( β = 1; Fig. 5D).
Discussion
Individual differences in stress and anxiety responses (Dilleen et al., 2012; Jimenez and Grant, 2017), social dominance (Morgan et al., 2002; Covington and Miczek, 2005), aggressive temperament (McClintick and Grant, 2016), preference for saccharine (Carroll et al., 2002), sensation or novelty seeking (Suto et al., 2001; Nadal et al., 2002; Belin et al., 2011; Flagel et al., 2014), impulsivity (Perry and Carroll, 2008; Verdejo-García et al., 2008; Dalley et al., 2011), and sensitivity to rewards (Belcher et al., 2014) have all been found in both animal models and clinical studies in humans to be associated with addiction vulnerabilities, and in particular with the likelihood to develop and maintain addiction, or to resist to treatment (Piazza et al., 1989; Belin et al., 2016; Everitt and Robbins, 2016). However, investigations into the mechanisms underlying this phenotypic differentiation in addiction has so far revealed few neural or computational candidates, which are found to be associated with diverse and dissociable behavioral traits. An important example is represented by the endophenotypic differentiation reported in the expression and reactivity of striatal D2 dopaminergic receptors, which is found to be negatively correlated with the traits of impulsivity (Dalley et al., 2007), social dominance (Morgan et al., 2002), and sensitivity to rewards (Belcher et al., 2014) and nonlinearly correlated with novelty preference (Flagel et al., 2014). The overlap of this endophenotypic trait across multiple, noncoexisting, phenotypes associated with addiction vulnerabilities suggests other neural or computational mechanisms have yet to be identified to allow accounting for the reported variety in behavioral traits.
Here, we have presented a neural field model, augmented by an RL model, to expand on existing neuropsychological and computational accounts of addiction. Our models propose a theoretical investigation into the interaction among cortico-striatal circuits or behavioral control modalities, and the effects this interaction has on addiction development and treatment response. As described in classic models (Redish, 2004, 2008; Dayan, 2009), we have assumed that overevaluation of a drug leads to aberrant dopamine release and associated overlearning in multiple DA targets (Volkow and Morales, 2015; Koob and Volkow, 2016). In the neural field model, this mechanism results in the dysregulation of the circuit gain and associated dynamics of both ventral and dorsal cortico-striatal circuits (Fiore et al., 2014; Hauser et al., 2016). In the integrated model-based and model-free RL model, sequential choice behavior is confounded by the presence of a high immediate reward (drug state). This leads to misrepresent the negative outcomes following drug consumption, if their distribution across states and time is sufficiently complex to escape the capabilities of the agent to correctly represent the environment (Doll and Daw, 2016; Sadacca et al., 2016). We found that both models jointly indicate that the balance between neural circuits or behavioral control modalities is a candidate neurocomputational mechanism characterizing endophenotypes in addiction. The neural and RL models converge in suggesting that individuals characterized by balanced behavioral control between reward-seeking or planning (ventral circuit/model-based) and reactive or habitual responses (dorsal circuit/model-free) would have a reduced chance to develop addiction and decreased severity of symptoms if developing addiction. We propose that this neurocomputational mechanism may be interacting with other known endophenotypic differentiations, such as alterations of D2 receptors in the striatum (Morgan et al., 2002; Nader and Czoty, 2005; Dalley et al., 2007; Volkow et al., 2007; Belcher et al., 2014; Flagel et al., 2014) or differences in learning rates (Gutkin et al., 2006; Piray et al., 2010), to generate the multifaceted behavioral traits that have been reported in literature to be associated with addiction vulnerabilities.
In our neural model, ventral and dorsal circuits are mostly in phase in their selections under the pre-drug stage, exhibiting synchronous transient stability of neural activity and enhancing the overall ability of the system to adapt to changing stimuli (i.e., the two circuits adapt to the input changes with a similar pace and synchronize in their selection). Under the addiction stage, the two circuits are mostly pulled toward the parasitic attractor state associated with drug consumption, and they occasionally select the competing non-drug stimuli. If only one of the two systems performs a selection outside of the attractor, the difference in selection generates a dissonance or interference. In neural endophenotypes characterized by unbalanced control, this dissonance is solved by one circuit taking the lead, so that both systems eventually converge on the selection of the dominant circuit. These dynamics result in limited opportunities to generate non-drug-related responses to the external stimuli, as they can only be generated by the dominant circuit. Conversely, in balanced control endophenotypes, if any of the two circuits ignores the drug-stimulus and selects a competing option, the resulting dissonance can trigger a state transition pulling out the parasitic attractor states associated with substance use. The endophenotypes in our simulations vary only in the parameters regulating the balance between circuits, as dopamine-driven learning processes established between cortex and striatum (Eq. 3) do not vary across endophenotypes, resulting in identical habit formation and drug-related biases in the outcome representations. Thus, our proposed phenotypic differentiation does not interfere with the usual role ascribed to the ventral and dorsal circuits as, respectively, implicated in the initial reward-seeking phase in addiction (Belin and Everitt, 2008; Willuhn et al., 2012) and the subsequent consolidation of stimulus-response, habitual, association (Everitt and Robbins, 2013, 2016). However, our simulated dynamics show that, after addiction is developed, systemic overstability can be reduced or further enhanced, depending on the corticocortical biases between cortico-striatal circuits. In turn, this modulation of system stability can foster or further impair input discrimination and motor response versatility, affecting addiction symptomatology. As a result, our neural model shows phenotypic variability emerging after the presentation of the reward simulating the drug and addiction is developed, in a gradient of overselection of drug-related actions.
With the RL model, we investigate whether the balance between model-based and model-free modalities would also increase the robustness of the system against the selection of drug states in a more complex environment and in presence of explicit negative outcomes. Similar to the neural model, a system with balanced control modalities introduces more diversity in action selection during exploration, reducing (yet not cancelling) the chances of developing maladaptive reactive responses. This increased diversity and overall reliability are likely to be induced by a higher redundancy and diversification of the system. While both components may fail, the causes of failures are not necessarily correlated. The model-based system can fail due to its sensitivity to cognitive resources but it is more efficient in encoding previous experience of the agent. On the other hand, the model-free component is affected by limited exploration but it is reliable in its selections, which are not affected by the availability of cognitive resources. Consistent with the neural model, differentiations in behaviors among endophenotypes emerge in an inverted-U shape, where unbalanced control system are the most vulnerable to developing addiction.
The phenomenon of relapse is more elusive and the two models do not fully converge on this aspect. To investigate this phenomenon, we have adapted the complexity of real world treatments to the capabilities of our simulated agents and environments, where we can easily manipulate or extinguish consolidated memory, but we cannot engage all other aspects commonly involved in addiction treatment, such as cognitive or emotional functions or developing new behavioral strategies to compete with drug-related habits. Therefore, we implemented two compartmentalized treatments that we consider as ideal reference models that target only a single decision system or circuit. These putatively represent treatments capable of affecting only drug-related emotional/value or habitual/motor associations. In the neural model, balanced dorsal and ventral endophenotypes respond well to both types of simulated treatments. For the unbalanced endophenotypes, however, only the appropriate treatment, targeting the dominant neural circuit, is effective. The simulations in the RL model do not show the same symmetric effects for the two treatments: the model-free treatment is effective for most endophenotypes, whereas the model-based treatment is mostly unsuccessful, with short relapse times across all endophenotypes, but the purely model-based one. The latter result is possibly due to the learning process characterizing the model-based component, which is affected by conflicting information as drug use is associated with both positive and negative outcomes, experienced by the agent when entering the drug state under different phases.
It is worth noting that habitual and goal-oriented behaviors have neural representations in the dorsal and ventral cortico-striatal circuits, respectively, but they do not fully overlap with model-based and model-free control modalities in RL (Dolan and Dayan, 2013). Nonetheless, the neural and RL models independently simulate choices among competing options in addiction. Thus, we have been able to test our hypothesis of endophenotypic differentiation under two complementary levels in Marr’s tri-level of analysis: the neural implementation and the algorithmic level (Marr and Poggio, 1976). This multilevel modeling approach has been often used in computational psychiatry (Maia and Frank, 2011; Montague et al., 2012; Adams et al., 2016; Hauser et al., 2016; Huys et al., 2016) to highlight model convergence and associate specific neural structure and dynamics with mathematical formalizations of optimal and suboptimal behavior in RL. The convergence of neural and RL models on important predictions also provides more confidence in the reliability of the identified computational mechanisms underlying addiction and the associated characterization of endophenotypes. Specifically, both models indicate individuals with unbalanced cortico-striatal activity or control modality are at higher risk of developing addiction and relapse after any treatment. Thus, independent of phenotypic-specific treatments, our results suggest that individuals with these traits would require a prolonged or more intense treatment, in comparison with balanced endophenotypes. Finally, when considering phenomena that are divergent across both models (e.g., response across endophenotypes to our simulated treatments), our findings still demonstrate that important endophenotypic features might remain undetected in terms of pre-treatment observable behavior. The models showed that opposite unbalanced agents resulted in similar addictive behaviors and vulnerabilities, but diverged in treatment response, potentially informing the development of precision interventions. Further studies will be required to provide empirical validation of our models. For example, computational analysis of fMRI data can be used to test effective connectivity among cortico-striatal circuits (Friston et al., 2003), in conjunction with cognitive tasks targeting the model-based and model-free control systems.
Acknowledgments
Acknowledgements: We thank Prof. Karl Friston for his comments and kind suggestions in shaping this manuscript.
Footnotes
The authors declare no competing financial interests.
This work is supported by the Dallas Foundation and a startup grant from University of Texas at Dallas.
This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International license, which permits unrestricted use, distribution and reproduction in any medium provided that the original work is properly attributed.