Elsevier

Neural Networks

Volume 19, Issue 8, October 2006, Pages 1153-1160
Neural Networks

2006 Special Issue
The misbehavior of value and the discipline of the will

https://doi.org/10.1016/j.neunet.2006.03.002Get rights and content

Abstract

Most reinforcement learning models of animal conditioning operate under the convenient, though fictive, assumption that Pavlovian conditioning concerns prediction learning whereas instrumental conditioning concerns action learning. However, it is only through Pavlovian responses that Pavlovian prediction learning is evident, and these responses can act against the instrumental interests of the subjects. This can be seen in both experimental and natural circumstances. In this paper we study the consequences of importing this competition into a reinforcement learning context, and demonstrate the resulting effects in an omission schedule and a maze navigation task. The misbehavior created by Pavlovian values can be quite debilitating; we discuss how it may be disciplined.

Introduction

Theories of animal learning rest on a fundamental distinction between two classes of procedure: Pavlovian and instrumental conditioning (see Mackintosh (1983)). Crudely, the difference concerns contingency. In a Pavlovian (or classical) procedure, an animal learns that a stimulus (such as the ringing of a bell) predicts a biologically significant outcome (such as the delivery of a piece of meat) which is made to happen regardless of the animal’s actions. The characteristic behavioral responses (e.g. salivation) that result are taken to reflect directly the animal’s learned expectations. In instrumental (operant) conditioning, however, the delivery of the outcome is made to be contingent on appropriate actions (e.g. leverpresses) being taken by the animal. Ambiguity between Pavlovian and instrumental influences arises in that many behaviors, such as locomotion, can evidently occur under either Pavlovian or instrumental control. In fact, virtually all conditioning situations involve both sorts of circumstance; and the two varieties of learning are thought to interact with one another in a number of ways.

Here we investigate one such interaction — direct competition for behavioral output. This sort of competition has hitherto eluded the reinforcement learning (RL Sutton and Barto (1998)) theories that nevertheless have wide application in modeling substantial issues in both classical and instrumental conditioning (e.g. Dayan and Balleine (2002), Doya (1999), Hikosaka et al. (1999), Houk, Adams, and Barto (1995), Montague, Dayan, and Sejnowski (1996), O’Doherty (2004), Schultz (1998), Suri and Schultz (1998) and Voorn, Vanderschuren, Groenewegen, Robbins, and Pennartz (2004).

One high point in the debate about the relative importance of instrumental and classical effects in controlling behavior was the development of a Pavlovian procedure called autoshaping (Brown & Jenkins, 1968). This originally involved the observation that when the delivery of food reward is accompanied by the timely illumination of a pecking key, pigeons come to approach and peck the key. Critically, this pecking occurs even though (as a Pavlovian procedure) the food is delivered regardless of whether or not the key is pecked. In fact, this procedure leads more swiftly to reliable key pecking than the instrumental equivalent of only rewarding the birds with food on trials on which they peck. Classical conditioning ideas such as autoshaping actually underlie many schemes for shaping particular, apparently instrumental, animal behaviors.

By contrast, a procedure called negative automaintenance (Williams & Williams, 1969) uses an omission schedule (Sheffield, 1965) to pit classical and instrumental conditioning against each other. In the version of this adapted from autoshaping, the pigeons are denied food on any trial in which they peck the lit key. In this case, the birds still peck the key (albeit to a reduced degree), thereby getting less food than they might. This persistence in pecking despite the instrumental contingency between withholding pecking and food shows that Pavlovian responding is formally independent from instrumental responding, since the Pavlovian peck is never reinforced. However, it is disturbing for standard instrumental conditioning notions, which typically do not place restrictions on the range of behaviors that can be controlled by reward contingencies. Further, as Dayan and Balleine (2002) pointed out, but did not fix, it has particular force against the formal instantiation of instrumental conditioning in terms of RL. RL accounts neither for the fact that a particular action (pecking) accompanies the mere prediction of food, nor for the fact that this action choice can be better (or perhaps worse) than the instrumentally appropriate choice (in this case, of not pecking).

Such an anomaly is merely the tip of a rococo iceberg. In a famous paper entitled The misbehavior of organisms, Breland and Breland (1961) described a variety of more exotic failures of conditioning procedures (see also Breland and Breland (1966)). For instance, animals that initially learn to deposit an object in a chute to obtain food, subsequently become hampered because of their inability to part with the food-predicting object. Equally, Hershberger (1986) showed that, in a ‘looking glass’ environment, chicks could not learn to run away from a source of food in order to get access to it. Many of these failures have the flavor of omission schedules, with an ecologically plausible action (the equivalent of approaching and pecking the lit key) interfering with the choices that would otherwise lead to desirable outcomes. Various of the behavioral anomalies arise progressively, with the instrumentally appropriate actions slowly being overwhelmed by Pavlovian ones.

Humans also exhibit behaviors that seem to violate their apparent goals. This has most frequently been studied in terms of a long-term plan (e.g. dieting) being bulldozed by a short-term opportunity (e.g. a cream bun). Indeed, this sort of intertemporal choice conflict lies at the heart of two popular theories. One theory suggests the conflict arises from hyperbolic discounting of the future (see Ainslie, 1992, Ainslie, 2001, Laibson (1997), Loewenstein and Prelec (1992) and Myerson and Green (1995), which makes short term factors overwhelm a long term view. Another theory is that the behavioral anomalies arise from competition between deliberative and affective choice systems (Loewenstein and O’Donoghue, 2004, McClure et al., 2004), with the latter ignoring long-term goals in favor of immediate ones. However, data on interactions between deliberative and affective instrumental systems in animals are well explained (see Daw, Niv, and Dayan (2005)) by assuming the controllers actually share the same goals and differ only in terms of the information they bring to bear on achieving those goals. Therefore, here we propose that, instead of intertemporal conflicts being key, the anomalies may arise from interactions between Pavlovian control and instrumental control of either stripe. The appearance of intertemporal competition follows from the character of the Pavlovian responses, which seem myopic due to being physically directed toward accessible reinforcers and their predictors.

In this paper, we propose a formal RL account of the interaction between the apparently misbehaving Pavlovian responses (arising from classically conditioned value predictions) and instrumental action preferences. As mentioned, we have recently (Daw et al., 2005) studied competition in RL between multiple subsystems for instrumental control — a more reflective, ‘goal-directed’ controller and its ‘habitual’ counterpart; the present work extends this approach to the interactions between instrumental (for simplicity, here represented by a single habitual controller) and Pavlovian control. In Section 2, we show how our model gives rise to negative automaintenance in an omission schedule, and in Section 3, we explore the richer and more varied sorts of misbehavior that it produces in the context of a navigational task. Finally, Ainslie, 1992, Ainslie, 2001, and following him Loewenstein and O’Donoghue (2004), consider the will as the faculty that allows (human) subjects to keep their long-term preferences from being derailed by short-term ones. We consider how the will may curb Pavlovian misbehavior.

Section snippets

Negative automaintenance

Consider first the simplest case of instrumental conditioning in which animals learn to execute action N (nogo:withholding a key peck) which leads to reward r=1 rather than action G (go:pecking the key) which leads to reward r=0. For convenience, we consider a Q-learning scheme (Watkins, 1989) in which subjects acquire three quantities (all at trial t):

  • 1.

    v(t) the mean reward, learned as v(0)=0, and v(t+1)=v(t)+η(r(t)v(t)) where r(t){0,1} is the reward delivered on trial t, and η is a learning

Detours

Omission schedules, and indeed the interestingly florid behaviors exhibited by Breland and Breland (1961)’s actors or Hershberger’s (1986) chicks, concern relatively constrained sets of actions. However, classical contingencies may exert a rather more all-pervasive influence over other sorts of behavior, warping choice according to the proximity of relatively immediate goals and their near precursors. We would argue that some of the many apparent illogicalities of choices, such as those studied

Discussion

In this paper, we have considered a simple, policy-blending interaction between Pavlovian and instrumental actions, using ideas from reinforcement learning to provide a formal framework. Pavlovian actions (such as approach to cues predicting rewards), which are presumably stamped in by their evolutionary appropriateness, can sometimes interfere negatively with instrumental goals, leading to poor control. This is starkly evident in omission schedules, which are designed to emphasize this

Acknowledgements

The authors are supported by the Gatsby Charitable Foundation (PD, NDD, YN), the Royal Society (NDD) the EU BIBA project (PD, NDD), a Dan David Fellowship (YN) and a Wellcome Trust program grant to Prof Ray Dolan (BS). We are very grateful to Kenji Doya and the Okinawa Computational Neuroscience Course for most pleasantly facilitating our interaction, and to two anonymous reviewers for their helpful comments.

References (54)

  • Baird, L. C. (1993). Advantage updating. Technical report WL-TR-93-1146. Wright-Patterson Air Force...
  • A.G. Barto et al.

    Neuronlike elements that can solve difficult learning problems

    IEEE Transactions on Systems, Man, and Cybernetics

    (1983)
  • A.G. Barto et al.

    Learning and sequential decision making

  • R.C. Bolles et al.

    Does CS termination reinforce avoidance behavior?

    Journal of Comparative and Physiological Psychology

    (1966)
  • K. Breland et al.

    The misbehavior of organisms

    American Psychologist

    (1961)
  • K. Breland et al.

    Animal behavior

    (1966)
  • P.L. Brown et al.

    Auto-shaping of the pigeon’s key-peck

    Journal of the Experimental Analysis of Behavior

    (1968)
  • S.R. Coleman

    Consequences of response-contingent change in unconditioned stimulus intensity upon the rabbit (Oryctolagus cuniculus) nictitating membrane response

    Journal of Comparative and Physiological Psychology

    (1975)
  • M. Cropper et al.

    Discounting human lives

    American Journal of Agricultural Economics

    (1991)
  • N.D. Daw et al.

    Uncertainty-based competition between prefrontal and striatal systems for behavioral control

    Nature Neuroscience

    (2005)
  • P. Dayan et al.

    Learning and selective attention

    Nature Neuroscience

    (2000)
  • A. Dickinson

    Actions and habits — the development of behavioural autonomy

    Philosophical Transactions of the Royal Society of London

    (1985)
  • A. Dickinson et al.

    The role of learning in motivation

  • Foster, D. J. (2000). A computational inquiry into navigation, with particular reference to the hippocampus. Ph.D....
  • D.J. Foster et al.

    Models of hippocampally dependent navigation using the temporal difference learning rule

    Hippocampus

    (2000)
  • W.A. Hershberger

    An approach through the looking-glass

    Animal Learning and Behavior

    (1986)
  • P.C. Holland

    Relations between Pavlovian-instrumental transfer and reinforcer devaluation

    Journal of Experimental Psychology: Animal Behavior Processes

    (2004)
  • Cited by (259)

    • Political reinforcement learners

      2024, Trends in Cognitive Sciences
    View all citing articles on Scopus
    View full text