A Mechanistic Model for Reward Prediction and Extinction Learning in the Fruit Fly

Visual Abstract


Introduction
Fruit flies can learn to associate an odor stimulus with a positive or negative consequence, e.g., food reward or electric shock punishment. In the training phase flies are typically exposed to two odors (differential conditioning) where one odor [conditioned stimulus minus (CS-)] is perceived alone whereas a second odor [CS plus (CS1)] is presented together with either reward or punishment [unconditioned stimulus (US)]. Once an association has formed between the CS1 and the respective US, the learned anticipation of the US can be observed in a memory test that enforces a binary choice behavior between the CS1 and the CS- (Tempel et al., 1983;Tully and Quinn, 1985). A single learning trial can be sufficient to form a stable memory in the fruit fly (Beck et al., 2000;Krashes and Waddell, 2008;Zhao et al., 2019) and other insect species (see Discussion).
The prediction error theory (Rescorla and Wagner, 1972) describes a basic theoretical concept of classical conditioning. It assumes that the efficacy of learning is determined by the momentary discrepancy (or error) between the expected and the received reinforcement (i.e., reward or punishment). In vertebrates, it has been shown that prediction error coding dopaminergic neurons (DANs) are involved in learning (Schultz et al., 1997;Schultz, 2016). Recent studies give rise to the assumption that DANs could play a similar role in Drosophila melanogaster (Hammer, 1997;Riemensperger et al., 2005;Ichinose et al., 2015;Eichler et al., 2017;Felsenberg et al., 2017Felsenberg et al., , 2018Zhao et al., 2021;Eschbach et al., 2020) and other insects (Terao and Mizunami, 2017), rejuvenating an earlier hypothesis based on experimental observations in the honeybee (Hammer, 1997).
Re-exposing the flies to the CS1 after successful training and in the absence of the US leads to a reduction of the previously learned behavior (Tempel et al., 1983;Schwaerzel et al., 2002;Felsenberg et al., 2017Felsenberg et al., , 2018. This new learning is called extinction learning and has been observed across invertebrate (Eisenhardt and Menzel, 2007;Eisenhardt, 2014) and vertebrate species (Pavlov, 1927;Tully and Quinn, 1985;Myers and Davis, 2002;Bouton, 2004Bouton, , 2017Dudai, 2004). Following prediction error theory, extinction learning is caused by the mismatch between the expected outcome (predicted US) based on the initial learning and the actual outcome (no US). In humans, extinction learning is of high clinical relevance and is applied in the treatment of humans suffering from maladaptive memories (Quirk and Mueller, 2008;Chiamulera et al., 2014;Delamater and Westbrook, 2014;Bouton, 2017;Walsh et al., 2018). The majority of behavioral experiments in invertebrates and vertebrates suggest that extinction protocols, including the applied exposure therapy in humans, does not erase a memory but rather leads to the formation of a parallel but opposing memory, the extinction memory (Bouton, 2004;Eisenhardt and Menzel, 2007). Despite the relevance for clinical therapy, our understanding of the neural circuit mechanisms underlying extinction learning is still in its infancy.
We study a computational neural circuit model of the fruit fly mushroom body (MB) that captures the most recent anatomic and physiological facts. The model proposes detailed mechanisms for associative learning, extinction learning, prediction error coding, and singletrial learning. We compare the model outcomes quantitatively to the results of recent behavioral and physiological studies and we derive novel experimental predictions by mimicking neurogenetic manipulations of relevant neuron groups.

Circuit architecture
Our circuit model for olfactory coding and olfactory memory formation ( Fig. 1) consists of three neuron layers [projection neurons (PNs), Kenyon cells (KCs), MB output neurons (MBONs)] representing the three major stages of the olfactory pathway in Drosophila and two reinforcement mediating DANs representing the PAM and PPL1 cluster. Each neuron in the circuit can assume an activation rate in the range of 0-1.
Olfactory model input was simulated through the activation of 50 out of the total 100 PNs. This matches experimental observations of ;40À60% of PNs being activated by a single odor stimulus (Krofczik et al., 2009;Brill et al., 2013;Wilson, 2013) and follows the model of Peng and Chittka (2017). Each PN is activated with a random rate drawn from a uniform distribution in the range between 0.2 and 0.8. To enhance variability between the networks that were set up, PN input patterns were further multiplied by a factor that was randomly distributed between 0.8 and 1. PNs are connected to 2000 KCs via the weight matrix W1 where each PN is connected to 5-15 KCs, each connection has a fixed synaptic weight of 0.2. Activation of the KC vector in the next layer resulted from the matrix product of the PN population vector and the respective weight matrix W1 according to Out of the total 2000 KCs only the subpopulation of 100 KCs (5%) with the highest activation rate kept its activation. All other KCs are set to zero to enforce population sparseness (Peng and Chittka, 2017). In a next step, KCs are fully connected to the four MBONs via the weight matrix W2, with all synaptic weights initially set to 0.01. The excitatory input to each MBON was calculated as the matrix products We further included lateral connectivity between the MBONs. The M6 MBON receives inhibitory input from MVP2, whereas V2 gets inhibited by MV2. The respective inhibitory inputs are formalized according to The activation rate of M6 and MV2 results from a summation of inhibitory and excitatory input as whereas for MVP2 and MV2, the activation rate is solely determined by the excitatory input. The PAM and PPL1 DAN receive excitatory input by the M6 and V2 neurons, respectively. Additionally, reinforcing stimuli have an effect on both DANs. A rewarding US (US = 11, positive reinforcer) leads to an excitatory input to the PAM (Eq. 5a). At the same time, excitatory input from the V2 to the PPL1 is partially suppressed (represented by the factor r = 0.8 in Eq. 5b). Conversely, a punishing US (US = À1, negative reinforcer) results in excitation of the PPL1 (Eq. 5b) and in the suppression of excitatory input from M6 to the Figure 1. Circuit model. The olfactory pathway comprises three feedforward layers. Olfactory input activates an odor-specific combination of PNs. The connectivity matrix from PNs to the large population of KCs is divergent-convergent and sparse. The inhibitory feedback mechanism ensures population sparseness of 5% active cells in the KC layer. KCs are fully connected with four MBONs in the output layer. A positive or negative reinforcement stimulus directly excites the dopaminergic PAM or PPL1 neuron, respectively. Lateral inhibition between the MBONs and excitatory feedback from MBONs to DANs is crucial for the reward prediction and extinction mechanisms. Activation of the MVP2 output neuron mediates approach behavior, activation of MV2 mediates avoidance behavior. Behavioral preference toward an odor is calculated as the imbalance between the activations of MVP2 and MV2. PAM (Eq. 5a). The total DAN input was thus computed as We fixed the US induced excitation to R = 0.3 for all our experiments. However, generally the parameter R allows for a modeling of variable reward magnitudes linked to the US such as, e.g., different levels of sugar concentration contained in a rewarding US.
The output activation rate of each DAN encoding the prediction error is calculated with a sigmoid transfer function Plasticity rule at KC::MBON synapses We implemented synaptic plasticity at the KC::MBON synapses. In the initial state of the model, the weights of all KC:: MBON synapses are set to the same fixed value of 0.01. Each of the synaptic weights in W2 is subject to synaptic plasticity. At the end of a given trial t and for any synapse KC i ::MBON j a change of the synaptic weight w ij occurs if both, the presynaptic KC and the respective DAN were active in trial t. If both conditions are met, the synaptic weight is reduced according to the two-factor learning rule where w app refers to the weight of a synapse onto M6 or MV2 and w av refers to the weight of a synapse onto MVP2 or V2. Note that the weight changes are proportional to the activation rate of the respective DAN. We fixed the learning rate as d = 0.0045 for all our experiments.

Experimental protocols
We subjected our model to a set of conditioning and extinction protocols. Further, it was operated and evaluated in a trial-resolved fashion where, in each trial, the excitatory and inhibitory synaptic input to each of the neurons and its output activation rate was computed. Within-trial neuronal dynamics were neglected.

Classical conditioning
The training procedure derives from a group assay developed by Tully and Quinn (1985) that is often used to study olfactory learning in flies. A group of flies are trained in a Tully-machine with a training arm and two testing arms. In the training arm, the animals are typically exposed to the CS1 in combination with either sugar (reward) or a train of electric shocks (punishment). In Felsenberg et al. (2018) the CS1 is presented for 1 min and combined with 12 electric shocks. Subsequently, animals are exposed to the CS-without reinforcer. After the training, the flies are transferred to the testing chamber in which they can choose between a CS1 and a CS-perfused arm.
Our classical conditioning protocol also consisted of a training phase and a subsequent memory retention test ( Fig.  2A). An odor pattern (CS1) was paired with a positive or negative reinforcement (US) during each training trial. Subsequently, a second odor pattern (CS-) was presented without reinforcement. For our initial model analysis, we varied the number of training trials in the range of n = 1, ..., 24 to quantify the associative strength as a function of n (Fig. 2). Then, we fixed the number of training trials to n = 12 in our standard training procedure. The trial-based classical conditioning protocol is a standard procedure in other insect models for learning and memory, notably the honeybee (Bitterman et al., 1983;Pamir et al., 2011). Unlike in the experimental conditioning paradigm, paired and unpaired odors were presented alternately when the number of trials was .1. However, we expect no difference between continuous and alternating odor presentation, since CS-presentation does not induce synaptic plasticity. During retention, we presented the CS1 and CS-stimulus once. In contrast to the experimental paradigm, simultaneous presentation of both odors was not possible. However, effects on the behavioral outcome could be excluded by switching off learning during the retention test.

Extinction learning
The extinction protocol included an additional reactivation phase after training and before memory extinction ( Figure 3A). During reactivation, the CS1 odor was presented alone, i.e., without reinforcement, during 12 reactivation trials. This procedure again mimics the experimental protocol used by Felsenberg et al. (2018), in which the authors presented the CS1 without shocks for the same duration (1 min) as in the training.

Neurogenetic manipulation experiments
To investigate the role of individual neurons in the memory extinction process we selectively suppressed their activation by setting their activation rate to zero during the reactivation phase. Thereby, we mimic neurogenetic manipulation experiments that suppress the activation of specific neurons. We specifically compare our model results to the experimental results performed in Felsenberg et al. (2018) based on the neurogenetic tool shibire ts1 and we generate experimentally testable hypotheses in novel model experiments.

Evaluation of model performance and quantitative model predictions
We evaluated model outcome by different measures that allow for a quantitative comparison with the outcome of animal experiments, both at the behavioral and at the physiological level. Applying these measures in untested experiments allowed us to formulate experimental predictions for novel experiments.

Behavioral output
To this end, we computed two quantities that can be interpreted as measures of approach or avoidance behavior. We used the activation rates of these MVP2 and MV2 to calculate the preference index as The innate preference index of our model is zero by construction.
In behavioral extinction learning experiments (Felsenberg et al., 2017(Felsenberg et al., , 2018, the number of animals that chose the CS-odor was then subtracted from the number of animals that chose the CS1, and the result was normalized by the total group size to compute the performance index (Tully and Quinn, 1985). In analogy, we calculated a model performance index computing the difference between the preference indices for CS1 and CS-as

Dendritic input
We quantified activation rates of single neurons. Activation rates were measured in both DANs and MBONs. DAN activation rates were obtained during training trials, whereas the MBON activation rates were measured during the test trials (Fig. 4). Further, we compared the model to a recent study in which dendritic MBON activity was measured (Fig. 5;Felsenberg et al., 2018). KC:: MBON synapses in M6 populate the dendritic tree while the MVP2::M6 are located proximal to the dendritic root (Felsenberg et al., 2018). Measuring the calcium activation across the dendritic field can thus be interpreted as the . Differential conditioning leads to CS1-specific odor preference. A, Experimental protocols for simulation of associative conditioning experiments. The model was trained in a classical conditioning paradigm, in which a conditioned odor stimulus (CS1) was paired with a reward (green) or punishment (red). Subsequently, a second unconditioned odor stimulus (CS-) was presented without reinforcement. The standard protocol comprises 12 training trials. During the retention test, both odor stimuli were presented alone without reinforcement. Preference index and performance index in the model were computed after each trial and during the retention test. B, Dynamics of the odor preference index across 24 appetitive (top) and aversive (bottom) learning trials, respectively averaged across 10 networks. C, left, Combinatorial response patterns in the PN population. Each odor stimulus activates 50% of all PNs. Similarities between the CS1 reference odor and five novel odors are defined by their overlap in the PN activation pattern (x-axis). Right, Activation patterns across the subpopulation of 100 KCs (5%) activated by the CS1 odor. Pattern overlap in the KC population reduces rapidly with decreasing odor similarity as expressed in the percentage of PN pattern overlap (x-axis). D, Generalization to different odors after associative conditioning to the CS1 odor. The model shows a significant CS1 approach (green) or avoidance (red) after 12 training trials as expressed in the preference index (n = 15), which diminishes rapidly with decreasing odor similarity. Boxplots show the median and the lower and upper quartiles, whiskers indicate 1.5 times interquartile range, outliers are marked with 1 symbol.
Research Article: New Research overall excitatory KC input to the MBON. In analogy to the experimental approaches, not only the total synaptic input, but also the sum of excitatory synaptic input from KCs was measured.

Performance at the group level
Each new initialization of a network involved the random generation of the odor induced PN activation rates (ranging between 0.8 and 1, evenly distributed) and a random number of PN::KC connections (in the range 5-15, evenly distributed) for each KC to establish variability across individuals.
To test statistical significance of group differences in the performance index, the Wilcoxon rank-sum test was used. To test for differences of MBON activation rates in the same group before and after extinction learning, the Wilcoxon signed-rank test was used. Model results were compared with the experimental results obtained by Felsenberg et al. (2017Felsenberg et al. ( , 2018. To extract single data points from the experimental studies, the GRABIT tool by MathWorks was applied. Model simulation and statistical analyses were performed with MATLAB_R2018b (MathWorks Inc). The full code for the (MB model is available on the GitHub account of the lab: https://github.com/ nawrotlab.

Circuit model of the Drosophila MB
We implemented a neural network model of the olfactory memory circuit of D. melanogaster ( Fig. 1) where individual neurons exhibit trial-resolved activation dynamics (see Materials and Methods). Our network model integrates experimentally confirmed connections in the adult fruit fly. It involves three feed-forward layers of the antennal lobe PNs, the MB KCs, and four individual MBONs with specific lateral inhibitory connections. Reinforcing stimuli (US) of rewarding or punishing nature are mediated by two DANs. Each DAN receives specific feedback input from a single excitatory MBON and exerts a recurrent neuromodulatory effect on the plastic KC::MBON synapses.
Stimulation with a particular odor is modeled as one specific input pattern activating 50 out of total 100 PNs, each with a random activation rate (see Materials and Methods). Similarity between different odors was defined as the percentage of overlap between the PN activation patterns (Fig. 2C). This establishes a dense combinatorial odor code in the PN layer as reported experimentally for fruit flies (Wilson et al., 2004;Olsen et al., 2010) and other species, notably the honeybee (Joerges et al., 1997;Krofczik et al., 2009), the locust (Mazor and Laurent, 2005;Broome et al., 2006), and the moth (Namiki and Kanzaki, 2008).
The connectivity between the PNs and KCs is divergent-convergent and random (Caron et al., 2013) where each of the 2.000 KCs (Aso et al., 2009) connects to average 10 PNs (uniformly distributed, range of 5-15), matching the experimentally estimated numbers for Drosophila (Leiss et al., 2009;Turner et al., 2008) and establishing a sparse connectivity for each of the postsynaptic KCs. In a second step, a threshold criterion was implemented such that only the 5% of KCs with the highest activation rates retain their activation (Fig. 2C) while the activation rates of all other KCs was set to zero (Peng and Chittka, 2017). This enforces population sparse coding in the KC layer (Kloppenburg and Nawrot, 2014) as reported in physiological experiments in the fruit fly where, on average, ;5% of all KCs were activated by a single odor stimulus (Turner et al., 2008;Honegger et al., 2011;Lin et al., 2014).
In the MB output layer, we implemented four out of the 34 anatomically identified MBONs (Ito et al., 1998;Tanaka et al., 2008;Séjourné et al., 2011;Aso et al., 2014a). These are MV2 . The neuron cluster V2, consisting of five neurons, was modeled as a single neuron for simplicity. M6, MV2, MVP2, and V2 have been previously shown to be involved in odor valence coding with M6 and MV2 mediating odor driven avoidance behavior, and MVP2 and V2 promoting approach behavior toward an olfactory stimulus (Séjourné et al., 2011;Aso et al., 2014b;Bouzaiane et al., 2015;Perisse et al., 2016;Ueoka et al., 2017). All KCs connect to each of the four MBONs (full connectivity; Fig. 1). Our model includes two inhibitory lateral connections among MBONs. The inhibitory synapses between MVP2 and M6 (Perisse et al., 2016;Felsenberg et al., 2018) have been suggested to be functionally relevant for aversive memory extinction (Felsenberg et al., 2018). We additionally assume a symmetric lateral inhibitory connection from MV2 to the V2 neuron. It has been shown that the MV2 neuron projects onto MBON-a2sc and MBON-a2p3p, which are part of the V2 cluster and it was hypothesized that the glutamatergic MV2 acts inhibitory on both neurons (Aso et al., 2014a,b), as has been shown for glutamatergic neurons in the AL (Liu and Wilson, 2013).

Reinforcing pathways and synaptic plasticity
In our circuit model, the presence of the appetitive or the aversive US are signaled by two distinct neuromodulatory DANs, the PPL1 and PAM neuron, respectively (Fig.  1). The single PPL1 neuron is representative of the PPL1 neuron cluster activated by aversive sensory stimuli such as electric shock (Claridge-Chang et al., 2009;Mao and Davis, 2009;Aso et al., 2010Aso et al., , 2012, the PAM neuron represents the PAM cluster that is activated by appetitive sensory stimuli such as sucrose (Burke et al., 2012;Liu et al., 2012;Waddell, 2013;Yamagata et al., 2015). Both neurons receive additional excitatory input from either the M6 or the V2 MBON Eichler et al., 2017;Felsenberg et al., 2017Felsenberg et al., , 2018Eschbach et al., 2020).
The presence of a reinforcing signal (US) has differential effects on the DAN activation. A rewarding US leads to an excitation of PAM and can reduce excitation of PPL1 input (Eq. 5a). Vice versa, a punishing reinforcer excites PPL1 and at the same time can modulate excitation of PAM input (Eq. 5b). This modulatory effect in our model is based on an experimental study revealing that a negative reinforcement inhibits PAM-g 4 and PAM-g 5 neurons, whereas sugar feeding inhibits PPL1-g 2 and PAM-g 3 (Cohn et al., 2015). Plasticity in our model exclusively resides in the KC:: MBON synapses, reflecting the prevalent data-based hypothesis (see Discussion). These synapses are subject to plasticity according to the learning rule in Equations 7a, 7b. In the initial state of the model, the weights of all KC:: MBON synapses are set to the same fixed value. In a given experimental trial, and for any synapse KC::MV2 or KC::M6, plasticity occurs when activation of the presynaptic KC coincides with activation of the reward-mediating PAM neuron. Likewise, any KC::V2 or KC::MVP2 synapse undergoes changes when the presynaptic KC and the punishment-mediating PPL1 neuron are active at the same time. The reduction of a synaptic weight is proportional to the DAN activation rate (for details, see Materials and Methods). This establishes two distinct parallel but interconnected neuromodulatory pathways, each involving feedback from MBONs to the DANs that, in turn, can modulate KC::MBON synapses (Fig. 1).

Appetitive and aversive conditioning establish a behavioral odor preference
In a first set of experiments, we subjected our model to a classical conditioning protocol. This protocol ( Fig. 2A) mimics standard training procedures in the fruit fly (see Materials and Methods). In each trial the trained odor, CS1, was paired with either reward or punishment (US). Subsequently a second odor stimulus was presented without reinforcement (CS-).
After a predefined number of training trials, we performed a memory retention test by presenting the model with the CS1 alone (no reinforcement) and subsequently with the unpaired odor (CS). We computed a preference index (Eq. 8) based on the activation rates of approach and avoidance mediating MBONs for both odors, the CS1 and the CS-(see Materials and Methods). The model performance index (Eq. 9) mimics comparison with the behavioral performance index computed from experiments in the fruit fly.
Olfactory memories are established as a skew in the MBON network: aversive learning reduces CS1 driven responses in approach MBONs (MVP2-MBON and V2 cluster MBONs) resulting in a skew toward avoidance. In contrast, appetitive learning skews the network toward approach by reducing CS1 mediated input to avoidance coding M4 and M6 MBONs (Aso et al., 2014b;Perisse et al., 2016;Zhang et al., 2019).
We explored the preference index during the retention test for a varying number of training trials (Fig. 2B). In the naive state, i.e., before the first training trial, the model did not reveal a preference. This is because of the balanced synaptic weight initialization of the model. A single appetitive training trial yielded a strong preference toward the rewarded odor, the CS1. With each additional training trial, the preference index further incremented by a small amount and thus additional training led to only a gradual increase of the preference index saturating after ;10-15 trials. Aversive conditioning revealed a negative preference index reaching a similar absolute level after a single training trial (Fig. 2B, lower panel). Both, single-trial learning and saturation of behavioral performance within few trials matches the experimental observations during associative learning in insects (see Discussion).
For all subsequent experiments, we chose 12 training trials, both for the training and the reactivation phase. This number mimics the standard experimental protocol for aversive conditioning in fruit flies, in which a group of animals is exposed to 12 subsequent electric shocks while constantly being exposed to the CS1 (Tully and Quinn, 1985;Scheunemann et al., 2013;Felsenberg et al., 2018).

Generalization to novel odors
We next analyzed how the learned preference to the CS1 odor generalizes to novel odors of varying similarity with the CS1 odor by comparing the preference indices toward CS1 and toward a novel odor during the retention test (Fig. 2D). We designed the pattern of activated PNs for each novel odor such that it has a defined overlap with the CS1 pattern, ranging from zero to 80% overlap (see Materials and Methods). Our results in Figure 2D predict that generalization to a novel odor is rather low even for the highest odor similarity and decays with decreasing odor similarity, effectively reaching zero generalization for odors that share 40% or less activated PNs with the CS1 odor. For the subsequent extinction learning experiments, we used CS-odors that had 60% pattern overlap with the CS1 odor.

Extinction learning significantly reduces the conditioned odor response
We analyzed extinction learning after appetitive and aversive conditioning according to the stimulation protocol in Figure 3A. The initial training phase was followed by a reactivation phase during which the model is repeatedly presented with the CS1 odor alone (no reinforcement). This matches the experimental protocols for appetitive (Felsenberg et al., 2017) and aversive (Felsenberg et al., 2018) conditioning and subsequent extinction learning and allows for a direct comparison of our simulation results with the experimental results, both at the behavioral and the physiological level (compare Extended Data Fig. 3-1).
Repeated reactivation of the CS1 odor alone after either appetitive or aversive conditioning resulted in memory extinction (Fig. 3B,C), i.e., in a significant reduction of the learned CS1 approach or avoidance. In addition, generalization to the CS-odor was abolished (Fig. 3B,C). The observation that learning is induced when the expected reward or punishment is omitted is in line with the idea of prediction error dependent learning (see Discussion). For a quantitative comparison with the behavioral results of the binary memory test in Felsenberg et al. (2017), we computed the performance index of the model by subtracting the preference index for CS-from the preference index for CS1 (see Materials and Methods). The appetitive conditioning protocol in the model and in vivo yielded similar performance indices (model: 0.30 6 0.03 mean 6 SD, experiment: 0.29 6 0.05 mean 6 SEM). Memory extinction led to a significantly reduced approach performance (0.20 6 0.02; Fig. 2B). However, the effect of memory extinction in the behavioral experiment was considerably stronger, effectively erasing the initial behavioral memory expression (0.01 6 0.03; Felsenberg et al., 2017). The aversive conditioning protocol led to a negative performance index of À0.29 6 0.04 (mean 6 SD) for the model, which in absolute terms was smaller than the experimental value of À0.39 6 0.04 (mean 6 SEM; Felsenberg et al., 2018). The reactivation of the CS1 odor after the aversive conditioning led to a significantly reduced avoidance behavior (À0.20 6 0.02), very similar to the experimental results (À0.20 6 0.03). A detailed quantitative comparison of experimental and model results is provided in Extended Data Extended Data Fig. 3-1.

Memory is rapidly established within a single trial
As a next step, we investigated the dynamics of plasticity during the course of learning and extinction. To this end, we monitored the activation rates of DANs and MBONs and the gross synaptic input to the MBONs across trials during single network simulations for appetitive (Fig. 4A,B) and aversive (Fig. 4C,D) conditioning and extinction.
In both, appetitive and aversive conditioning, the first pairing of CS1 and US induced a strong response in either the PAM or PPL1, respectively. This strong neuromodulatory signal resulted in a significant reduction of the respective KC::MBON synapses and led to a strongly reduced KC::MBON input to either MV2/M6 or MVP2/V2 during the second training trial. As a consequence, feedback excitation from M6 to PAM (V2 to PPL1) in appetitive (aversive) learning is abolished from the second trial onward, leading to further synaptic weight modulations that are weak compared with those in the very first training trial. The evolution of plasticity is reflected in the model performance index, indicating a switch-like increase after the first training trial and moderate but steady increase after subsequent trials. Saturation of the learning effect  Figure 2A. In the extinction protocol, the training phase is followed by an odor reactivation phase, in which the CS1 was presented alone for 12 consecutive trials before performing the retention test. B, The appetitive conditioning protocol (light gray) induced a strong and significant preference for the CS1 odor and a weak and significant preference for the CS-odor. Extinction learning partly abolished the CS1 preference during the retention test (dark gray). The preference index for the CS1 odor was significantly lower than after initial appetitive conditioning, for the CS-odor no more significant preference was observed after extinction learning. C, Symmetric results were observed for aversive conditioning. Again, extinction learning significantly reduced the learned preference for the CS1 odor. Boxplots show the median and the lower and upper quartiles, whiskers indicate 1.5 times interquartile range, outliers are marked with 1 symbol. D, E, The performance indices were significantly reduced after extinction compared with the initial appetitive and aversive memories. Data are presented mean 6 SD; ***p , 0.001; n = 15 independent models. For preference and performance indices of the model tuned toward asymmetric appetitive and aversive learning pathways, see Extended Data Figure 3-2. becomes visible only after ;10-12 trials (compare Fig.  2B). In conclusion, the model learns within a single trial and more training leads to a stronger valence signal as encoded in the imbalance of MV2 and MVP2 activation.
Extinction learning follows a different and slower dynamics in our network model. While already a single presentation of the CS1 odor without reinforcer leads to the induction of the extinction memory and a significant reduction of the performance index (Fig. 4), we observe an only gradual extinction that saturates across the 12 extinction trials used here. No complete extinction is observed in our model.

Associative learning and extinction learning establish two separate memory traces
We tested the hypothesis that associative learning and subsequent extinction learning form two parallel and distinct memories at the physiological level. To this end, we mimicked experiments in Felsenberg et al. (2018), in which the authors measured KC input to individual MBONs by means of in vivo calcium-imaging from the dendritic field.
We first consider the case of aversive conditioning and subsequent extinction learning. After initial conditioning, the CS1-induced synaptic input to the MVP2 and V2 neurons is significantly lower than the CS--induced input (Fig. 5C). The synaptic input to the M6 and MV2 neurons, however, is similar for both odors (Fig. 5D). The extinction protocol, i.e., the subsequent reactivation with CS1 alone (Fig. 3A), did not alter KC synaptic input to MVP2 and V2 (Fig. 5C). In contrast, CS1-induced synaptic input to M6 and MV2 was significantly reduced after extinction and in comparison to the CS--induced input (Fig. 5D). These results are fully consistent with the in vivo calcium-imaging results by Felsenberg and colleagues (Felsenberg et al., 2018; compare Extended Data Fig. 3-1) who observed a significantly reduced dendritic calcium activation in MVP2 in response to the CS1 stimulus after aversive conditioning, while there was no significant change in M6 activation (Felsenberg et al., 2018;see their Fig. 3). After memory extinction, the decreased CS1 response in MVP2 remained while additional plasticity was observed as CS1specific reduction in calcium levels in M6.
Our circuit model predicts a symmetric network behavior for the extinction of an appetitive memory. After appetitive conditioning, the CS1-induced excitatory input was lower than the CS--mediated input to the avoidance-mediating M6 and MV2 (Fig. 5B). However, the input to the approach mediating MVP2 and MV2 remained unchanged (Fig. 5A). Reactivation of the memory by stimulation with the CS1 odor alone led to a reduction of the excitatory dendritic input into MVP2 and V2 (Fig. 5A). As a result, the CS1 preference and performance indices acquired during initial conditioning were significantly reduced after memory extinction (Fig. 3B,D).

Distinct pathways are crucial for appetitive and aversive memory extinction
To validate the idea of recurrent feedback as a teaching signal in our model we tested which neurons are necessary for extinction learning. To this end, we mimicked shibire ts1 experiments performed by Felsenberg et al. (2017Felsenberg et al. ( , 2018, who selectively suppressed the activity of individual neurons or neuron clusters during CS1 odor reactivation (see Materials and Methods; Fig. 6).
We first tested extinction of appetitive memory. Selective blocking of PPL1-cluster or V2-cluster during the reactivation phase abolished the extinction of an appetitive memory. In contrast, blocking of PAM had no effect on extinction (Fig. 6A). These findings agree with the experimental results (Extended Data Fig. 3-1) in Felsenberg et al. (2017). Conversely, in the extinction of aversive memory, Felsenberg et al. (2018) showed that the behavioral effect of extinguishing the aversive memory was significantly diminished by either blocking the PAM cluster or the M6 neurons. However, blocking the vesicle release machinery of the PPL1 cluster had no effect on aversive memory extinction (Extended Data Fig. 3-1). Our model showed the same qualitative result where blocking of either the M6 or the PAM but not of the PPL1 neuron abolished the extinction of an aversive memory (Fig. 6B).
To formulate novel predictions for future experiments we further analyzed our model by silencing other MBONs during the reactivation phase ( Fig. 6; Extended Data Fig. 3-1). Deactivating V2 during the reactivation phase did not significantly alter the aversive extinction memory. Conversely, a blockage of M6 during the extinction process had no effect on the appetitive extinction memory performance either. Blocking MVP2 or MV2 during odor reactivation had no effect on both, appetitive and aversive memory extinction. Thus, our model predicts that a single MBON, V2, or M6, is exclusively involved in the formation of the appetitive or aversive extinction memory, respectively.

KC activity is partly required for extinction learning
We additionally investigated the role of KCs in extinction learning in our model. To this end, we applied the shibire ts1 protocol to KCs. When we blocked all KCs during the training phase, neither appetitive nor extinction memory could be established. However, when we randomly chose and blocked 50% of all KCs during appetitive and aversive extinction training, we did not observe an effect on extinction memory acquisition (Fig. 6). Thus, partially blocking KC activity during odor reactivation might not or only mildly interfere with extinction learning while an almost complete block of all KCs should prevent extinction learning in the fly (Extended Data Fig. 6-1; see also Discussion).

Discussion
In the present study, we established a circuit model of the fruit fly based on confirmed anatomy and physiology. It can explain formation and extinction of an olfactory memory, and single-trial learning. We discuss the implications, limitations and predictions of our model.

Prediction error coding
The prediction error theory hypothesizes that learning takes place if there is an unexpected change in the valence of a stimulus (Rescorla and Wagner, 1972). The model performance index in Figure 4 computed after each training trial mimics behavioral performance across trials and fits the prediction error theory and experimental studies. The behavioral learning effect is reduced across trials and the learning curve saturates with increased training. The results match the well-known saturation in the conditioned response (CR) behavior in the honeybee (Pamir et al., 2011). This effect, typically observed across a group of animals, has been formalized in the Rescorla-Wagner model (Rescorla and Wagner, 1972). A recent rate-based model of the fly MB (Bennett et al., 2021) assumed that connections from MBONs to DANs are crucial for asymptotic learning based on reward prediction. Our model proposes that prediction error coding in D. melanogaster is achieved through a network mechanism supported by the specific circuit motif sketched in Figure 1. It is suggested that DANs receive relevant information not only from sensory neurons directly but also via odor activated MBONs (Felsenberg et al., 2017(Felsenberg et al., , 2018. It was implemented in the model through excitatory connections from laterally connected MBONs to DANs. Initial conditioning of the model leads to activation of the PAM (PPL1) DAN and therefore induces downregulation of the aversive (appetitive) MBONs. This learning effect declines after the first learning trial, since MBON . Neuron activation rates and behavioral performance across trials. A, Reward value (US, top) and DAN activation rates (bottom) across 12 initial conditioning trials (left) and 12 extinction trials (right). Trial zero indicates neuron activation rates in the naive model before learning. Reward presentation resulted in a prominent peak activation of the PAM during the very first training trial, i.e., when the actual reward strongly deviates from the predicted reward. Omission of the reward during the first extinction trial (right) resulted in a step-like reduction of PAM activation the simultaneous increase in PPL1 activation, which is crucial for extinction learning. B, Summed synaptic KC input to MBONs (top), MBON activation rates (middle), and model preference index (bottom) for each simulated test trial that followed the respective training trial (see Materials and Methods). The strong PAM activation during the first conditioning trial resulted in a strong change of the synaptic weights between KCs and MV2 and M6, and consequently in a step-like reduction of the synaptic input and the output activation rates. Consequently, the model preference index represents the imbalance between M6 and MV2 activation rates showed a step-like increase predicting a switch-like expression of a CR behavior after only a single conditioning trial. Additional training led to a further gradual reduction of gross synaptic input to and activation of MV2 and M6, paralleled by the gradual increase of the preference index. Extinction learning led to a gradual reduction of synaptic weights in the KC::MVP2 and KC::V2 pathway. This reduces the difference between M6 and MV2 activation and leads to a gradual extinction of the preference index. C, D, Same as A, B for aversive conditioning (left) and subsequent extinction (right). Simulation results are shown for a single network. This was initiated identically before appetitive (A, B) and aversive (C, D) conditioning to enforce identical initial conditions stressing the symmetric mechanism of reward prediction.
activity is not sufficient to activate DANs after being downregulated.

Single-trial learning
Learning within a single trial is a fundamental ability of vertebrates (Irwin et al., 1968;Cook and Fagot, 2009) and invertebrates, and the underlying mechanisms are at least partially conserved across phyla. In insects, single-trial learning has been intensely studied in classical and operant olfactory, tactile, and visual conditioning of the honeybee Apis mellifera (Menzel et al., 1974;Smith, 1991;Sandoz et al., 1995;Menzel, 1999;Pamir et al., 2014;Villar et al., 2020). Appetitive olfactory conditioning of the proboscis extension response allows the observation of the all-or-none CR behavior during successive training trials where the onset of US presentation is delayed with respect to the onset of the CS odor. Typically, 40-60% of bees show a CR after a single pairing of odor with reward (Pamir et al., 2014) and a single learning trial is sufficient to establish short-term and long-term memory, which can be recalled up to 3 d after training (Smith, 1991;Menzel, 1999;Pamir et al., 2014;Villar et al., 2020). Additional training trials lead to a saturation after approximately three to four trials with respect to the fraction of animals that express a CR.
In the fruit fly, single-trial learning has been established more recently. Aversive conditioning of adult flies with electric shock punishment showed that pairing of a single odor presentation (of 10-60 s in duration) with a single electric shock induces short term memory (Beck et al., 2000;Scheunemann et al., 2013) and long-term memory, retrievable for up to 14 d if the experimental context is kept strictly constant for training and memory test (Zhao et al., 2019). In appetitive conditioning, a single session of odor-reward pairing could establish long-term memory (Krashes and Waddell, 2008;Colomb et al., 2009). Weiglein et al. (2019) could show that also in the fruit fly larva a single session of appetitive conditioning of 5 min was sufficient to establish a short-term memory where the performance index in the memory test increased with increasing duration (in the order of minutes) of the training session.
In our circuit model, the very first training trial induces a switch-like change in the network response and in the behavioral memory expression. This parallels the switch-like behavioral dynamics observed in learner bees (Pamir et al., 2011) and provides an explanation for the single-trial induction of appetitive and aversive memories in the fruit fly (Krashes and Waddell, 2008;Colomb et al., 2009;Zhao et al., 2019). In contrast, during extinction learning we propose that an extinction memory builds up gradually in the opposing pathway.

Extinction learning
Experimental studies in the fruit fly have shown that extinction learning establishes distinct and opposing memory traces. The extinction of reward memory requires punishment coding dopamine neurons whereas extinction of aversive memory is mediated by reward coding neurons (Felsenberg et al., 2017(Felsenberg et al., , 2018. Recent findings in mice (Salinas-Hernández et al., 2018) and rats (Luo et al., 2018) suggest that this principle for extinction learning is conserved in mammals. The authors showed that DANs from the ventral tegmental area associated with reward signaling are required to extinguish fear memory.
The importance of MBON::DAN feedback to perform complex learning tasks such as extinction has been formalized recently in a rate-based model of the Drosophila larva (Eschbach et al., 2020). In our adult fly model, the MBON::MBON and MBON::DAN connections for the extinction of aversive and appetitive memories are strictly symmetrical. During extinction of an aversive memory the dopaminergic reward signal of the PAM neuron in extinction trials is driven by excitation from the M6 MBON, which receives reduced inhibition from the approach mediating MVP2 as a result of the initial aversive conditioning that reduced KC drive of MVP2. Thus, extinction learning establishes a reward-like extinction memory trace in parallel to the initial aversive memory trace. Intriguingly, the initial memory trace is not altered by extinction (Fig. 5). This matches the experimental results and the proposed mechanism in Felsenberg et al. (2018) supporting the long-standing hypothesis of two parallel memory traces after extinction (Bouton, 2004;Dudai, 2004;Eisenhardt and Menzel, 2007).
The strict symmetry of the recurrent pathways in our model has the consequence of symmetrical quantitative results for the performance index and neuron input activation rates (Extended Data Fig. 3-1). The experimental Figure 5. Two separate memories underlie initial and extinction learning. CS1 odor-induced activation rates of MBONs were measured before and after extinction training to locate associative and extinction memories. A, B, Appetitive learning led to a relative depression of CS1 odor-induced activation rates of MV2 and M6, but not MVP2 and V2 MBONs. The memory trace in MV2/M6 stayed intact even after appetitive memory was extinguished. Extinction decreased the CS1 response of MVP2/ V2, establishing a parallel memory trace. C, D, Aversive conditioning led to a reduced CS1 input into MVP2/V2, but not into MV2/M6. After memory extinction, the memory trace in MVP2/ V2 remained, and we observed an additional decrease in CS1 response in MV2/M6. Boxplots show the median and the lower and upper quartiles, whiskers indicate 1.5 times interquartile range, outliers are marked with 1 symbol; n.s. = not significant, **p , 0.01. Results across simulation of n = 10 networks in all panels. situation, however, does not provide fully symmetric behavioral results. Our model matches well in the aversive memory pathway, i.e., for appetitive conditioning and extinction of aversive memory. In aversive conditioning experiments, however, Felsenberg et al. (2017Felsenberg et al. ( , 2018 obtained stronger performance indices than our model and extinction of the appetitive memory was complete (Extended Data Fig. 3-1). Independent tuning of model parameters for the two parallel pathways in our symmetric model allows for an improved quantitative match with the experimental results (Extended Data Fig. 3-2). However, a complete fit to the experimental data were not achievable. In particular, the complete extinction of the appetitive memory cannot be obtained. We thus conclude that the current model lacks at least on additional mechanism that introduces asymmetry in circuit and behavior. Our minimal model does not include all MBONs and DANs and their proven interconnections (Takemura et al., 2017), i.e., the inhibitory connection between MVP2 and the PPL1 cluster, might add functional asymmetry between the appetitive and the aversive memory pathway (Perisse et al., 2016;Pavlowsky et al., 2018).

Limitations of the model
Our network model (Fig. 1) can be viewed as a minimal circuit model of the MB sufficient to explain prediction error coding, single-trial conditioning and extinction learning. We focused on only four out of 34 MBONs and restricted MB connectivity to feedforward from KCs to four MBONs and feedback from MBONs to DANs but did not explore other synaptic contacts among KCs and between KCs, MBONs and DANs within the MB lobes as uncovered in the fly EM connectome (Takemura et al., 2017). Learning in our model is achieved through reinforcementmediated plasticity at a single synaptic site, which has been shown to be involved in appetitive and aversive short-term olfactory memory (Zars et  Performance Index B A Figure 6. Blocking specific neuron groups can reverse memory extinction. The computational model was used to reproduce the neurogenetic manipulations with temperature-sensitive shibire ts1 mutant flies. The activation of individual neurons was suppressed during extinction training only. This allows to identify neurons that are crucial to the process of establishing the extinction memory in our model. A, Knocking down PPL1 or V2 during extinction training led to a completely abolished the appetitive extinction memory. Blocking PAM, M6, MV2, or MVP2 neurons during odor re-exposure had no effect on the appetitive extinction memory. Blockage of all KCs (100%) prevented the model from extinction learning. However, when only 50% of the KC population was blocked during odor reactivation, extinction learning remained unaffected. Further effects of partial KC blocking on extinction learning are presented in Extended Data Figure 6-1. B, Suppressing the activity of PAM DANs, M6 neurons, or KCs (100%) during CS1 odor re-exposure after aversive conditioning prevents the model from forming the aversive extinction memory. However, blocking PPL1 DANs, MV2, V2, MVP2, or KCs (50%) does not diminish the memory extinction. Data are presented as mean 6 SD; n.s. = not significant, **p , 0.01, ***p , 0.001; n = 15. and Préat, 2001;. Increasing experimental evidence indicates that olfactory learning can involve plasticity at multiple sites within the insect MB (Pascual and Préat, 2001;Kremer et al., 2010;Bouzaiane et al., 2015;Haenicke et al., 2018;Yamazaki et al., 2018). Also, it has been shown that memories can co-exist on different time scales representing short-term, mid-term, and long-term memories (Pascual and Préat, 2001;Davis, 2011;Trannoy et al., 2011;Yamagata et al., 2015). Our trial-based model approach does not accommodate explicit time scales that would allow to differentiate between simultaneous, delay and trace conditioning (Dylla et al., 2013), memory acquisition and consolidation (Felsenberg et al., 2017) or decay (Shuai et al., 2015). Future extensions may retain full temporal dynamics, e.g., by using spiking neural network models that have previously been used successfully to study classical conditioning in fruit flies (Smith et al., 2008;Wessnitzer et al., 2012;Faghihi et al., 2017;Gupta et al., 2018;Rapp and Nawrot, 2020).
In the present study, we aimed at reproducing the behavioral performance index that is measured in a group forced choice paradigm (Quinn et al., 1974) and thus represents a population measure across individuals, each performing either the correct or the incorrect behavioral choice. The long-held notion states that expression of the CR behavior in individual flies is stochastic and follows the group-averaged behavior (Quinn et al., 1974) has been questioned by Chabaud et al. (2010) and by studies in the honeybee (Pamir et al., 2011(Pamir et al., , 2014Haenicke et al., 2018) and in the cockroach (Arican et al., 2020). They have shown that the average CR in a group of animals does not accurately reflect memory expression in individuals. Rather, it confounds two subgroups of learners and non-learners where learners showed a switch-like and stable expression of the CR behavior mostly after a single trial (average 1.7 trials) with high memory retention rates (.90%). Different parameters may account for individual learning abilities such as internal state, e.g., of hunger and satiety (Pamir et al., 2014;Sayin et al., 2018) that could influence perception or motivation. Establishing neural circuit models that take into account individuality of learning and individual parameters that influence memory formation is a challenge we aim to address in future model studies.

Model predictions
Our model makes several predictions that can be tested experimentally. First, we hypothesize that DANs in the PAM and PPL1 cluster compute a prediction error. Specifically, in naive animals and for the example of appetitive learning the PAM should show spontaneous baseline spiking activity and no stimulus response during the presentation of a novel odor. During a first pairing of the CS1 odor with a reward (US) the PAM cluster should show a clear odor response, which will be strongly reduced after the first learning trials (Fig. 4A). When reward is omitted during extinction trials or in the memory test, we predict a response in which the PAM reduces its spiking output below spontaneous level. This prediction matches the observation in DANs in the monkey (Schultz et al., 1997) and in the octopaminergic VUMmx1 neuron in the honeybee (Montague et al., 1995;Hammer, 1997), although the circuit mechanisms hypothesized here differs from that proposed by Schultz et al. (1997) and Hammer (1997).
Second, we hypothesize a switch-like induction of a memory trace during a single training trial, implying a steplike change in the MBON activation rates (Fig. 4). A second or third training trial should only induce gradual changes in cellular physiology. Likewise, the behavioral memory expression should be observable after a single learning trial.
Third, our systematic analysis of blocking experiments (Fig. 6) for all MBONs indicate that MV2, M6, and MVP2 do not play a role during appetitive memory extinction, whereas MVP2, MV2, and V2 are dispensable for aversive extinction learning. This could be experimentally tested by blocking these MBONs specifically during extinction trials only. While the co-existence of two memory traces after extinction of an aversive memory has been shown in calcium-imaging from MVP2 and M6 (Felsenberg et al., 2018;Fig. 5D) our model predicts parallel memory traces also for extinction of an appetitive memory (Fig. 5A,B).
Finally, in a study by Schwaerzel et al. (2002), in which