2006 Special IssueThe misbehavior of value and the discipline of the will
Introduction
Theories of animal learning rest on a fundamental distinction between two classes of procedure: Pavlovian and instrumental conditioning (see Mackintosh (1983)). Crudely, the difference concerns contingency. In a Pavlovian (or classical) procedure, an animal learns that a stimulus (such as the ringing of a bell) predicts a biologically significant outcome (such as the delivery of a piece of meat) which is made to happen regardless of the animal’s actions. The characteristic behavioral responses (e.g. salivation) that result are taken to reflect directly the animal’s learned expectations. In instrumental (operant) conditioning, however, the delivery of the outcome is made to be contingent on appropriate actions (e.g. leverpresses) being taken by the animal. Ambiguity between Pavlovian and instrumental influences arises in that many behaviors, such as locomotion, can evidently occur under either Pavlovian or instrumental control. In fact, virtually all conditioning situations involve both sorts of circumstance; and the two varieties of learning are thought to interact with one another in a number of ways.
Here we investigate one such interaction — direct competition for behavioral output. This sort of competition has hitherto eluded the reinforcement learning (RL Sutton and Barto (1998)) theories that nevertheless have wide application in modeling substantial issues in both classical and instrumental conditioning (e.g. Dayan and Balleine (2002), Doya (1999), Hikosaka et al. (1999), Houk, Adams, and Barto (1995), Montague, Dayan, and Sejnowski (1996), O’Doherty (2004), Schultz (1998), Suri and Schultz (1998) and Voorn, Vanderschuren, Groenewegen, Robbins, and Pennartz (2004).
One high point in the debate about the relative importance of instrumental and classical effects in controlling behavior was the development of a Pavlovian procedure called autoshaping (Brown & Jenkins, 1968). This originally involved the observation that when the delivery of food reward is accompanied by the timely illumination of a pecking key, pigeons come to approach and peck the key. Critically, this pecking occurs even though (as a Pavlovian procedure) the food is delivered regardless of whether or not the key is pecked. In fact, this procedure leads more swiftly to reliable key pecking than the instrumental equivalent of only rewarding the birds with food on trials on which they peck. Classical conditioning ideas such as autoshaping actually underlie many schemes for shaping particular, apparently instrumental, animal behaviors.
By contrast, a procedure called negative automaintenance (Williams & Williams, 1969) uses an omission schedule (Sheffield, 1965) to pit classical and instrumental conditioning against each other. In the version of this adapted from autoshaping, the pigeons are denied food on any trial in which they peck the lit key. In this case, the birds still peck the key (albeit to a reduced degree), thereby getting less food than they might. This persistence in pecking despite the instrumental contingency between withholding pecking and food shows that Pavlovian responding is formally independent from instrumental responding, since the Pavlovian peck is never reinforced. However, it is disturbing for standard instrumental conditioning notions, which typically do not place restrictions on the range of behaviors that can be controlled by reward contingencies. Further, as Dayan and Balleine (2002) pointed out, but did not fix, it has particular force against the formal instantiation of instrumental conditioning in terms of RL. RL accounts neither for the fact that a particular action (pecking) accompanies the mere prediction of food, nor for the fact that this action choice can be better (or perhaps worse) than the instrumentally appropriate choice (in this case, of not pecking).
Such an anomaly is merely the tip of a rococo iceberg. In a famous paper entitled The misbehavior of organisms, Breland and Breland (1961) described a variety of more exotic failures of conditioning procedures (see also Breland and Breland (1966)). For instance, animals that initially learn to deposit an object in a chute to obtain food, subsequently become hampered because of their inability to part with the food-predicting object. Equally, Hershberger (1986) showed that, in a ‘looking glass’ environment, chicks could not learn to run away from a source of food in order to get access to it. Many of these failures have the flavor of omission schedules, with an ecologically plausible action (the equivalent of approaching and pecking the lit key) interfering with the choices that would otherwise lead to desirable outcomes. Various of the behavioral anomalies arise progressively, with the instrumentally appropriate actions slowly being overwhelmed by Pavlovian ones.
Humans also exhibit behaviors that seem to violate their apparent goals. This has most frequently been studied in terms of a long-term plan (e.g. dieting) being bulldozed by a short-term opportunity (e.g. a cream bun). Indeed, this sort of intertemporal choice conflict lies at the heart of two popular theories. One theory suggests the conflict arises from hyperbolic discounting of the future (see Ainslie, 1992, Ainslie, 2001, Laibson (1997), Loewenstein and Prelec (1992) and Myerson and Green (1995), which makes short term factors overwhelm a long term view. Another theory is that the behavioral anomalies arise from competition between deliberative and affective choice systems (Loewenstein and O’Donoghue, 2004, McClure et al., 2004), with the latter ignoring long-term goals in favor of immediate ones. However, data on interactions between deliberative and affective instrumental systems in animals are well explained (see Daw, Niv, and Dayan (2005)) by assuming the controllers actually share the same goals and differ only in terms of the information they bring to bear on achieving those goals. Therefore, here we propose that, instead of intertemporal conflicts being key, the anomalies may arise from interactions between Pavlovian control and instrumental control of either stripe. The appearance of intertemporal competition follows from the character of the Pavlovian responses, which seem myopic due to being physically directed toward accessible reinforcers and their predictors.
In this paper, we propose a formal RL account of the interaction between the apparently misbehaving Pavlovian responses (arising from classically conditioned value predictions) and instrumental action preferences. As mentioned, we have recently (Daw et al., 2005) studied competition in RL between multiple subsystems for instrumental control — a more reflective, ‘goal-directed’ controller and its ‘habitual’ counterpart; the present work extends this approach to the interactions between instrumental (for simplicity, here represented by a single habitual controller) and Pavlovian control. In Section 2, we show how our model gives rise to negative automaintenance in an omission schedule, and in Section 3, we explore the richer and more varied sorts of misbehavior that it produces in the context of a navigational task. Finally, Ainslie, 1992, Ainslie, 2001, and following him Loewenstein and O’Donoghue (2004), consider the will as the faculty that allows (human) subjects to keep their long-term preferences from being derailed by short-term ones. We consider how the will may curb Pavlovian misbehavior.
Section snippets
Negative automaintenance
Consider first the simplest case of instrumental conditioning in which animals learn to execute action (nogo:withholding a key peck) which leads to reward rather than action (go:pecking the key) which leads to reward . For convenience, we consider a -learning scheme (Watkins, 1989) in which subjects acquire three quantities (all at trial ):
- 1.
the mean reward, learned as , and where is the reward delivered on trial , and is a learning
Detours
Omission schedules, and indeed the interestingly florid behaviors exhibited by Breland and Breland (1961)’s actors or Hershberger’s (1986) chicks, concern relatively constrained sets of actions. However, classical contingencies may exert a rather more all-pervasive influence over other sorts of behavior, warping choice according to the proximity of relatively immediate goals and their near precursors. We would argue that some of the many apparent illogicalities of choices, such as those studied
Discussion
In this paper, we have considered a simple, policy-blending interaction between Pavlovian and instrumental actions, using ideas from reinforcement learning to provide a formal framework. Pavlovian actions (such as approach to cues predicting rewards), which are presumably stamped in by their evolutionary appropriateness, can sometimes interfere negatively with instrumental goals, leading to poor control. This is starkly evident in omission schedules, which are designed to emphasize this
Acknowledgements
The authors are supported by the Gatsby Charitable Foundation (PD, NDD, YN), the Royal Society (NDD) the EU BIBA project (PD, NDD), a Dan David Fellowship (YN) and a Wellcome Trust program grant to Prof Ray Dolan (BS). We are very grateful to Kenji Doya and the Okinawa Computational Neuroscience Course for most pleasantly facilitating our interaction, and to two anonymous reviewers for their helpful comments.
References (54)
- et al.
Goal-directed instrumental action: Contingency and incentive learning and their cortical substrates
Neuropharmacology
(1998) - et al.
Inactivation of the infralimbic prefrontal cortex reinstates goal-directed responding in overtrained rats
Behavioral Brain Research
(2003) - et al.
Reward, motivation and reinforcement learning
Neuron
(2002) What are the computations of the cerebellum, the basal ganglia, and the cerebral cortex?
Neural Networks
(1999)- et al.
Parallel neural networks for learning sequential procedures
Trends in Neurosciences
(1999) Reward representations and reward-related learning in the human brain: Insights from neuroimaging
Current Opinion in Neurobiology
(2004)Cognitive planning in humans: Neuropsychological, neuroanatomical and neuropharmacological perspectives
Progress in Neurobiology
(1997)- et al.
Putting a spin on the dorsal-ventral divide of the striatum
Trends in Neuroscience
(2004) Picoeconomics
(1992)Breakdown of will
(2001)
Neuronlike elements that can solve difficult learning problems
IEEE Transactions on Systems, Man, and Cybernetics
Learning and sequential decision making
Does CS termination reinforce avoidance behavior?
Journal of Comparative and Physiological Psychology
The misbehavior of organisms
American Psychologist
Animal behavior
Auto-shaping of the pigeon’s key-peck
Journal of the Experimental Analysis of Behavior
Consequences of response-contingent change in unconditioned stimulus intensity upon the rabbit (Oryctolagus cuniculus) nictitating membrane response
Journal of Comparative and Physiological Psychology
Discounting human lives
American Journal of Agricultural Economics
Uncertainty-based competition between prefrontal and striatal systems for behavioral control
Nature Neuroscience
Learning and selective attention
Nature Neuroscience
Actions and habits — the development of behavioural autonomy
Philosophical Transactions of the Royal Society of London
The role of learning in motivation
Models of hippocampally dependent navigation using the temporal difference learning rule
Hippocampus
An approach through the looking-glass
Animal Learning and Behavior
Relations between Pavlovian-instrumental transfer and reinforcer devaluation
Journal of Experimental Psychology: Animal Behavior Processes
Cited by (259)
Political reinforcement learners
2024, Trends in Cognitive SciencesNeurofeedback through the lens of reinforcement learning
2022, Trends in NeurosciencesPrefrontal signals precede striatal signals for biased credit assignment in motivational learning biases
2024, Nature CommunicationsPavlovian-to-instrumental transfer in intertemporal choice
2024, Judgment and Decision Making