Elsevier

Neural Networks

Volume 19, Issue 8, October 2006, Pages 1075-1090
Neural Networks

2006 Special Issue
Neural mechanism for stochastic behaviour during a competitive game

https://doi.org/10.1016/j.neunet.2006.05.044Get rights and content

Abstract

Previous studies have shown that non-human primates can generate highly stochastic choice behaviour, especially when this is required during a competitive interaction with another agent. To understand the neural mechanism of such dynamic choice behaviour, we propose a biologically plausible model of decision making endowed with synaptic plasticity that follows a reward-dependent stochastic Hebbian learning rule. This model constitutes a biophysical implementation of reinforcement learning, and it reproduces salient features of behavioural data from an experiment with monkeys playing a matching pennies game. Due to interaction with an opponent and learning dynamics, the model generates quasi-random behaviour robustly in spite of intrinsic biases. Furthermore, non-random choice behaviour can also emerge when the model plays against a non-interactive opponent, as observed in the monkey experiment. Finally, when combined with a meta-learning algorithm, our model accounts for the slow drift in the animal’s strategy based on a process of reward maximization.

Introduction

Decision making has been studied using a variety of paradigms in multiple disciplines. For example, economists have often employed tasks based on multiple gambles or lotteries to investigate how decisions are influenced by the decision maker’s attitude towards uncertainty (Kahneman & Tversky, 1979). Behavioural ecologists have approached the problem of decision making in the context of foraging (Stephens & Krebs, 1986), whereas psychologists have frequently investigated the choice behaviour using a concurrent schedule of reinforcement (Herrnstein, Rachlin, & Laibson, 1997). All of these paradigms, however, are designed to investigate the process of decision making in a socially isolated individual. In contrast, decision making in a socially interactive context introduces a new principle of optimality (von Neumann & Morgenstern, 1944), since the outcome of one’s decision can be influenced by the decisions of others in the same group.

Recently, neuroscientists have begun to investigate the neural basis of decision making using the behavioural paradigms rooted in these various disciplines (see Lee (2006)). In some cases, these studies were carried out in non-human primates, allowing the investigators to examine the activity of individual neurons during various types of decision making. For example, Sugrue, Corrado, and Newsome (2004) found that activity of neurons in intraparietal cortex reflects the relative income from the target in their receptive fields during an oculomotor foraging task based on a concurrent variable-interval schedule. Using an approach based on a standard economic choice theory, McCoy and Platt (2005) found that neurons in posterior cingulate cortex modulate their activity according to the uncertainty of reward expected from a particular target. Barraclough, Conroy, and Lee (2004) and Dorris and Glimcher (2004) have examined the pattern of neural activity in dorsolateral prefrontal cortex and posterior parietal cortex, while the animal interacted competitively with a computer opponent.

In the present study, we focus on the choice behaviour of monkeys playing a simple competitive game, known as the matching pennies (Barraclough et al., 2004). During this task, monkeys were required to choose one of two visual targets in an oculomotor free-choice task, and they obtained a reward only if they chose the same target as the computer opponent in a given trial. The optimal strategy during this game is to choose the two targets randomly and with equal probability, and therefore requires random sequences of choices. Some studies have shown that in general people are relatively poor in generating a random sequence of choices (Bar-Hillel and Wagenaar, 1991, Camerer, 2003), but with feedback they can learn to generate sequences that can pass standard randomness tests (Neuringer, 1986). More interestingly, it has been found that people can generate more random sequences of choices, if they are engaged in a competitive game (Rapoport & Budescu, 1992). Nevertheless, the neural mechanisms responsible for the generation of such a high level of stochastic behaviour is unknown.

Consistent with these behavioural findings obtained in human subjects, the results from the previous study (Barraclough et al., 2004) showed that monkeys can learn to generate nearly random sequences of choices when they receive reward feedback regarding their performance. In addition, the degree of randomness in the animal’s choice behaviour varied according to the amount of information utilized by the computer opponent to predict the animal’s choice (Lee, Conroy, McGreevy, & Barraclough, 2004). In other words, the animal’s behaviour became more random, when the computer utilized additional information about the animal’s previous choices and their outcomes. Furthermore, a simple reinforcement learning model was proposed to account for the fact that the animal’s choice was systematically influenced by the computer’s choices in previous trials (Lee et al., 2004).

Here, we show that a biophysically-plausible network model of decision making endowed with plastic synapses can not only generate random sequences of choices, but also capture other important features of animal’s choice behaviour during the matching pennies task. First, monkeys displayed a bias in their choice behaviour when playing against a non-responsive computer opponent selecting its target randomly, regardless of the animal’s behaviour. To understand the nature of such a bias, we analyze the steady-state behaviour of the reinforcement learning model described in Lee et al. (2004) and that of our model, and derive the conditions under which a biased and non-random choice behaviour can emerge. Second, when the computer opponent was partially exploitive and used only the information about the animal’s previous choices but not their outcomes, the animal’s choice strategy displayed a slow drift over the period of many days. We implement a meta-learning algorithm (Schweighofer & Doya, 2003) in our model, and show that it can account for the gradual change in the animal’s strategy. To our knowledge, this study is the first to propose a possible explanation for the slow, gradual behavioural change on the timescale of many days observed experimentally.

Section snippets

Experimental methods

A detailed description of the experimental methods used to collect the behavioural data during a matching pennies task has been published previously (Lee et al., 2004). In the following, the behavioural task used in this study is only briefly described. Three rhesus monkeys (C, E, and F) were trained to perform in an oculomotor free-choice task according to the rule of the matching pennies game (Fig. 1).

The animal began each trial by fixating a small yellow square at the centre of the computer

Stability of equilibrium strategy

In algorithm 0, the computer chose the two targets randomly with an equal probability, independent of the monkey’s choice, which corresponds to the Nash equilibrium in the matching pennies game. Thus, in this condition, the animal’s choice was always rewarded with 50% probability for both targets. Nevertheless, each animal displayed a significant bias for choosing one of the two targets (Fig. 2, Fig. 3), indicating that they deviated from the Nash equilibrium. This bias was extreme for monkey

Description of network model

Details about the architecture of our network model can be found in Wang (2002) (see also Brunel and Wang (2001)). Briefly, the decision-making network consists of 2000 integrate-and-fire neurons (1600 excitatory, and 400 inhibitory) which are grouped into three populations of excitatory neurons and a single population of inhibitory neurons (Fig. 5). Two of the excitatory populations (240 neurons each) are selective to the leftward and rightward targets and the third excitatory population (1120

Model’s behaviour in the game of matching pennies

As described in the previous section, our model with the belief-dependent learning rule at the synaptic level behaves similarly compared to the reinforcement learning model, and the synaptic strength of this model can represent the value function for each choice. In this section, the choice behaviour of our model is characterized further, focusing on the behaviour during algorithm 0 and the robustness of the model.

Meta-learning

The behavioural data and the estimates of learning rates described above suggest that in addition to the trial-to-trial dynamics of the choice behaviour, there is a much slower change in the behaviour which takes place across multiple days during the course of the experiment. This slow change was most noticeable, when the computer opponent switched to algorithm 2 (see Fig. 3, Fig. 12). During this experiment, animals were not explicitly cued for the transitions in the algorithms used by the

Discussion

One of the most important and influential models of decision making is reinforcement learning (Sutton & Barto, 1998). In this framework, desirability of each action is represented by a value function that estimates the expected amount of reward resulting from a particular action. Consequently, actions with high value functions are chosen more frequently. The outcome of each action is then compared to the previously expected outcome, and the resulting error is used to update value functions

Acknowledgments

We are grateful to Dominic Barraclough, Michelle Conroy and Ben McGreevy for their help with the experiment. This study was supported by a grant MH073246 from the National Institute of Health.

References (45)

  • P. Redgrave et al.

    The basal ganglia: a vertebrate solution to the selection problem?

    Neuroscience

    (1999)
  • J.N. Reynolds et al.

    Dopamine-dependent plasticity of corticostriatal synapses

    Neural Networks

    (2002)
  • N. Schweighofer et al.

    Meta-learning in reinforcement learning

    Neural Networks

    (2003)
  • X.-J. Wang

    Probabilistic decision making by slow reverberation in cortical circuits

    Neuron

    (2002)
  • D.J. Amit et al.

    Dynamic learning in neural networks with material synapses

    Neural Computation

    (1994)
  • D.J. Barraclough et al.

    Prefrontal cortex and decision making in a mixed-strategy game

    Nature Neuroscience

    (2004)
  • G.S. Berns et al.

    A computational model of how the basal ganglia produce sequences

    Journal of Cognitive Neuroscience

    (1998)
  • N. Brunel et al.

    Effects of neuromodulation in a cortical network model of object working memory dominated by recurrent inhibition

    Journal of Computational Neuroscience

    (2001)
  • K.P. Burnham et al.

    Model selection and multimodel inference. A practical information-theoretic approach

    (2002)
  • C.F. Camerer

    Behavioural game theory: Experiments in strategic interaction

    (2003)
  • S. Fusi

    Hebbian spike-driven synaptic plasticity for learning patterns of mean firing rates

    Biological Cybernetics

    (2002)
  • R.J. Herrnstein et al.

    The matching law: Papers in psychology and economics

    (1997)
  • Cited by (0)

    1

    Current address: Department of Neurobiology and Kavli Institute for Neuroscience, Yale University School of Medicine, New Haven, CT 06520, USA.

    2

    Tel.: +1 203 785 6302; fax: +1 203 785 5263.

    3

    Tel.: +1 203 785 3527; fax: +1 203 785 5263.

    View full text