2006 Special IssueNeural mechanism for stochastic behaviour during a competitive game
Introduction
Decision making has been studied using a variety of paradigms in multiple disciplines. For example, economists have often employed tasks based on multiple gambles or lotteries to investigate how decisions are influenced by the decision maker’s attitude towards uncertainty (Kahneman & Tversky, 1979). Behavioural ecologists have approached the problem of decision making in the context of foraging (Stephens & Krebs, 1986), whereas psychologists have frequently investigated the choice behaviour using a concurrent schedule of reinforcement (Herrnstein, Rachlin, & Laibson, 1997). All of these paradigms, however, are designed to investigate the process of decision making in a socially isolated individual. In contrast, decision making in a socially interactive context introduces a new principle of optimality (von Neumann & Morgenstern, 1944), since the outcome of one’s decision can be influenced by the decisions of others in the same group.
Recently, neuroscientists have begun to investigate the neural basis of decision making using the behavioural paradigms rooted in these various disciplines (see Lee (2006)). In some cases, these studies were carried out in non-human primates, allowing the investigators to examine the activity of individual neurons during various types of decision making. For example, Sugrue, Corrado, and Newsome (2004) found that activity of neurons in intraparietal cortex reflects the relative income from the target in their receptive fields during an oculomotor foraging task based on a concurrent variable-interval schedule. Using an approach based on a standard economic choice theory, McCoy and Platt (2005) found that neurons in posterior cingulate cortex modulate their activity according to the uncertainty of reward expected from a particular target. Barraclough, Conroy, and Lee (2004) and Dorris and Glimcher (2004) have examined the pattern of neural activity in dorsolateral prefrontal cortex and posterior parietal cortex, while the animal interacted competitively with a computer opponent.
In the present study, we focus on the choice behaviour of monkeys playing a simple competitive game, known as the matching pennies (Barraclough et al., 2004). During this task, monkeys were required to choose one of two visual targets in an oculomotor free-choice task, and they obtained a reward only if they chose the same target as the computer opponent in a given trial. The optimal strategy during this game is to choose the two targets randomly and with equal probability, and therefore requires random sequences of choices. Some studies have shown that in general people are relatively poor in generating a random sequence of choices (Bar-Hillel and Wagenaar, 1991, Camerer, 2003), but with feedback they can learn to generate sequences that can pass standard randomness tests (Neuringer, 1986). More interestingly, it has been found that people can generate more random sequences of choices, if they are engaged in a competitive game (Rapoport & Budescu, 1992). Nevertheless, the neural mechanisms responsible for the generation of such a high level of stochastic behaviour is unknown.
Consistent with these behavioural findings obtained in human subjects, the results from the previous study (Barraclough et al., 2004) showed that monkeys can learn to generate nearly random sequences of choices when they receive reward feedback regarding their performance. In addition, the degree of randomness in the animal’s choice behaviour varied according to the amount of information utilized by the computer opponent to predict the animal’s choice (Lee, Conroy, McGreevy, & Barraclough, 2004). In other words, the animal’s behaviour became more random, when the computer utilized additional information about the animal’s previous choices and their outcomes. Furthermore, a simple reinforcement learning model was proposed to account for the fact that the animal’s choice was systematically influenced by the computer’s choices in previous trials (Lee et al., 2004).
Here, we show that a biophysically-plausible network model of decision making endowed with plastic synapses can not only generate random sequences of choices, but also capture other important features of animal’s choice behaviour during the matching pennies task. First, monkeys displayed a bias in their choice behaviour when playing against a non-responsive computer opponent selecting its target randomly, regardless of the animal’s behaviour. To understand the nature of such a bias, we analyze the steady-state behaviour of the reinforcement learning model described in Lee et al. (2004) and that of our model, and derive the conditions under which a biased and non-random choice behaviour can emerge. Second, when the computer opponent was partially exploitive and used only the information about the animal’s previous choices but not their outcomes, the animal’s choice strategy displayed a slow drift over the period of many days. We implement a meta-learning algorithm (Schweighofer & Doya, 2003) in our model, and show that it can account for the gradual change in the animal’s strategy. To our knowledge, this study is the first to propose a possible explanation for the slow, gradual behavioural change on the timescale of many days observed experimentally.
Section snippets
Experimental methods
A detailed description of the experimental methods used to collect the behavioural data during a matching pennies task has been published previously (Lee et al., 2004). In the following, the behavioural task used in this study is only briefly described. Three rhesus monkeys (C, E, and F) were trained to perform in an oculomotor free-choice task according to the rule of the matching pennies game (Fig. 1).
The animal began each trial by fixating a small yellow square at the centre of the computer
Stability of equilibrium strategy
In algorithm 0, the computer chose the two targets randomly with an equal probability, independent of the monkey’s choice, which corresponds to the Nash equilibrium in the matching pennies game. Thus, in this condition, the animal’s choice was always rewarded with 50% probability for both targets. Nevertheless, each animal displayed a significant bias for choosing one of the two targets (Fig. 2, Fig. 3), indicating that they deviated from the Nash equilibrium. This bias was extreme for monkey
Description of network model
Details about the architecture of our network model can be found in Wang (2002) (see also Brunel and Wang (2001)). Briefly, the decision-making network consists of 2000 integrate-and-fire neurons (1600 excitatory, and 400 inhibitory) which are grouped into three populations of excitatory neurons and a single population of inhibitory neurons (Fig. 5). Two of the excitatory populations (240 neurons each) are selective to the leftward and rightward targets and the third excitatory population (1120
Model’s behaviour in the game of matching pennies
As described in the previous section, our model with the belief-dependent learning rule at the synaptic level behaves similarly compared to the reinforcement learning model, and the synaptic strength of this model can represent the value function for each choice. In this section, the choice behaviour of our model is characterized further, focusing on the behaviour during algorithm 0 and the robustness of the model.
Meta-learning
The behavioural data and the estimates of learning rates described above suggest that in addition to the trial-to-trial dynamics of the choice behaviour, there is a much slower change in the behaviour which takes place across multiple days during the course of the experiment. This slow change was most noticeable, when the computer opponent switched to algorithm 2 (see Fig. 3, Fig. 12). During this experiment, animals were not explicitly cued for the transitions in the algorithms used by the
Discussion
One of the most important and influential models of decision making is reinforcement learning (Sutton & Barto, 1998). In this framework, desirability of each action is represented by a value function that estimates the expected amount of reward resulting from a particular action. Consequently, actions with high value functions are chosen more frequently. The outcome of each action is then compared to the previously expected outcome, and the resulting error is used to update value functions
Acknowledgments
We are grateful to Dominic Barraclough, Michelle Conroy and Ben McGreevy for their help with the experiment. This study was supported by a grant MH073246 from the National Institute of Health.
References (45)
- et al.
The perception of randomness
Advances in Applied Mathematics
(1991) - et al.
Activity in posterior parietal cortex is correlated with the relative subjective desirability of action
Neuron
(2004) Metalearning and neuromodulation
Neural Networks
(2002)- et al.
Cascade models of synaptically stored memories
Neuron
(2005) A stochastic reinforcement learning algorithm for learning real-valued functions
Neural Networks
(1990)Dopamine: a potential substrate for synaptic plasticity and memory mechanisms
Progress in Neurobiology
(2003)Neural basis of quasi-rational decision making
Current Opinion in Neurobiology
(2006)- et al.
Reinforcement learning and decision making in monkeys during a competitive game
Cognitive Brain Research
(2004) - et al.
Learning and decision making in monkeys during a rock-paper-scissors game
Cognitive Brain Research
(2005) - et al.
Learning behaviour in an experimental matching pennies game
Games and Economic Behaviour
(1994)
The basal ganglia: a vertebrate solution to the selection problem?
Neuroscience
Dopamine-dependent plasticity of corticostriatal synapses
Neural Networks
Meta-learning in reinforcement learning
Neural Networks
Probabilistic decision making by slow reverberation in cortical circuits
Neuron
Dynamic learning in neural networks with material synapses
Neural Computation
Prefrontal cortex and decision making in a mixed-strategy game
Nature Neuroscience
A computational model of how the basal ganglia produce sequences
Journal of Cognitive Neuroscience
Effects of neuromodulation in a cortical network model of object working memory dominated by recurrent inhibition
Journal of Computational Neuroscience
Model selection and multimodel inference. A practical information-theoretic approach
Behavioural game theory: Experiments in strategic interaction
Hebbian spike-driven synaptic plasticity for learning patterns of mean firing rates
Biological Cybernetics
The matching law: Papers in psychology and economics
Cited by (0)
- 1
Current address: Department of Neurobiology and Kavli Institute for Neuroscience, Yale University School of Medicine, New Haven, CT 06520, USA.
- 2
Tel.: +1 203 785 6302; fax: +1 203 785 5263.
- 3
Tel.: +1 203 785 3527; fax: +1 203 785 5263.