A Bayesian analysis of human decision-making on bandit problems

https://doi.org/10.1016/j.jmp.2008.11.002Get rights and content

Abstract

The bandit problem is a dynamic decision-making task that is simply described, well-suited to controlled laboratory study, and representative of a broad class of real-world problems. In bandit problems, people must choose between a set of alternatives, each with different unknown reward rates, to maximize the total reward they receive over a fixed number of trials. A key feature of the task is that it challenges people to balance the exploration of unfamiliar choices with the exploitation of familiar ones. We use a Bayesian model of optimal decision-making on the task, in which how people balance exploration with exploitation depends on their assumptions about the distribution of reward rates. We also use Bayesian model selection measures that assess how well people adhere to an optimal decision process, compared to simpler heuristic decision strategies. Using these models, we make inferences about the decision-making of 451 participants who completed a set of bandit problems, and relate various measures of their performance to other psychological variables, including psychometric assessments of cognitive abilities and personality traits. We find clear evidence of individual differences in the way the participants made decisions on the bandit problems, and some interesting correlations with measures of general intelligence.

Introduction

Imagine you are staying in an unfamiliar city for a few weeks, and are consigned to eat alone each evening. There are a number of Chinese restaurants, all with essentially the same menu, and all within a short walk of your hotel. Each menu is cheap enough (or your expense account is large enough) that whether the meal is ‘good’ or ‘bad’ acts as the sole criterion for choosing one restaurant over the other.

Over the course of your stay, a natural dining goal would be to maximize the number of good meals. In the first few days, pursuing this goal might involve trying a number of the restaurants. If the meal on the first night was bad, it seems unlikely you would re-visit that restaurant on the second night.

Towards the end of your stay, however, it becomes increasingly likely you will visit the same reasonably good restaurant repeatedly, even if it does not produce good meals every night. There is less incentive to explore options about which you are less certain, and more incentive to exploit options which are reasonably good, and about which you are more certain. Because of the limited nature of your stay, how you search the environment of restaurants is likely to move from an initial exploration phase to a mature exploitation phase.

Another source of information that will affect your decision-making is any potential knowledge you have about the quality of Chinese restaurants in the city. If the city has a reputation for having many excellent Chinese restaurants, then a smaller number of bad meals will encourage the trying of alternatives. On the other hand, if you believe the city does not have many good Chinese restaurants, a relatively modest success rate at one place might encourage repeat dining.

It seems clear that your decision-making in trying to maximize the number of good meals you have is a non-trivial optimization problem. Choices must be sensitive to previous dining outcomes, how many days of your stay remain, what you know about the relevant base-rates for good and bad meals, and the interaction of all these factors.

This real world decision-making scenario has the same essential characteristics as a formal mathematical optimization problem known as the ‘bandit’ problem,1 which, following its original description by Robbins (1952), has been studied extensively in the machine learning and statistics literatures (e.g., Berry (1972), Berry and Fristedt (1985), Brezzi and Lai (2002), Gittins, 1979, Gittins, 1989, Kaebling, Littman, and Moore (1996), Macready and Wolpert (1998) and Sutton and Barto (1988)). In a general N-armed bandit problem, there is a set of N bandits, each having some fixed but unknown rate of reward. On each trial, a decision-maker must choose a bandit, after which they receive feedback as to whether or not one unit of probabilistically determined reward was attained. The decision-maker’s task is, over a set of K trials, to use this feedback to make a sequence of bandit choices that maximizes their reward.

We believe that bandit problems provide an interesting and useful task for the study of human capabilities in decision-making and problem solving. They provide a challenging task, similar to many real-world problems that is nevertheless simple to understand. They require people to search their environment in intelligent ways to make decisions, exploring uncertain alternatives and exploiting familiar ones. The ability to search effectively, striking the right balance between exploration and exploitation, is a basic requirement for successful decision-making (Gigerenzer & Todd, 1999). Bandit problems also have a known optimal solution process to which human decision-making can be compared. Being able to make the distinction between ‘outcome optimal’ and ‘process optimal’ decision-making is useful, because adherence to the process can always be achieved, but attainment of the reward is inherently stochastic. This property can make process optimality a less noisy measure of decision-making performance than outcome optimality (Lee, 2006).

Early studies of human performance on bandit problems used models and experimental manipulations motivated by theories of operant conditioning (e.g., Brand, Wood, and Sakoda (1956) and Brand, Sakoda, and Woods (1957)). Later studies were informed by economic theories, leading to a focus on deviations from rationality in human decision-making (e.g., Anderson (2001), Banks, Olson, and Porter (1997), Horowitz (1973) and Meyer and Shi (1995)). Most of these economic studies focused on two-choice bandit problems, but considered variants on the basic bandit problem we have defined. In particular, they sometimes fixed the reward rate of one of the two choices, or, motivated by the economic concept of an ‘infinite horizon’ (i.e., the potential of an arbitrarily long sequence of decision-making between the alternatives), considered bandit problems without a fixed number of trials, but instead introduced a small probability of terminating the problem after any given trial. Human performance on the bandit problem has also been a recent focus of interest in cognitive neuroscience (e.g., Cohen, McClure, and Yu (2007) and Daw, O’Doherty, Dayan, Seymour, and Dolan (2006)).

One issue that deserves more systematic study involves individual differences in solving bandit problems. The ability to solve problems involving the maximization (or minimization) of some criterion under multiple interacting constraints is generally regarded as a basic expression of intelligence. Accordingly, how well people solve optimization problems like bandit problems may provide an interesting window onto differences in basic human cognitive abilities. A good recent example of this line of inquiry is provided by Burns, Lee, and Vickers (2006), who explored the relationship between human performance on optimal stopping problems known as secretary problems (Gilbert & Mosteller, 1966) and standard psychometric measures of cognitive abilities. These authors demonstrated that performance on the Secretary Problem loaded on fluid intelligence (Gf), with performance on the problem also being shown to load approximately 0.4 on a general ability factor, (g). This g-loading was comparable to that of the Digit Symbol task from the Wechsler Adult Intelligence Scale. In a similar spirit, it seems worth examining whether people’s abilities to follow optimal decision processes in bandit problems differ, and whether any differences are related to psychometric measures of cognitive abilities.

Performance on bandit problems also seems to have natural links to personality traits that control risk behavior. Too much exploration in solving a bandit problem could be regarded as a form of risk seeking behavior, while too much exploitation could be regarded as risk averse behavior. Recently, the risk propensity of clinical and normal populations has been a research focus in neuropsychology. In particular, clinical populations with damage to the ventromedial prefrontal cortex, cocaine abusers, and patients with Aspberger’s syndrome often take decisions that are risk seeking relative to a normative analysis (Bechara et al., 1994, Yechiam et al., 2005). In a laboratory setting, risk-taking behavior is often quantified using tasks such as the Iowa Gambling Task (e.g., Bechara et al. (1994), Busemeyer and Stout (2002) and Wood, Busemeyer, Koling, Cox, and Davis (2005)), which can be conceived as a special type of bandit problem, and the Balloon Analog Risk Task (e.g., Wallsten, Pleskac, and Lejuez (2005)). The classic bandit problems we consider provide another task for continuing and expanding the study of individual differences in risk-related behavior.

In this paper, we study individual differences in how people balance exploration and exploitation in solving bandit problems, using a natural Bayesian extension of the optimal decision model. The basic idea is to extend the optimal decision process to operate in a range of environments, characterized by different distributions over the reward rates. The insight is that people who explore (i.e., choose alternatives about which less in known) can be regarded as making optimistic assumptions about the underlying reward rates, while people who exploit (i.e., choose alternatives about which more is known) can be regarded as making more pessimistic assumptions. Within this framework, differences in human decision-making can be explained in terms of differences in the assumptions people make about the statistical structure of their environment.

We begin by giving a formal characterization of the optimal Bayesian decision process for bandit problems, parameterized by assumptions about the distribution of underlying reward rates. We develop a method for inferring these parameters from the decisions people make solving bandit problems. These methods are then applied to data from 451 participants, for whom extensive additional cognitive ability and personality trait measures are also available. From a basic analysis, we observe that there are individual differences that make it useful to be able to measure the extent to which people adhere to the optimal decision process, rather than simpler heuristic strategies. Accordingly, we develop a method based on Bayesian model selection to compare the optimal decision model to a three alternative heuristic models, with varying degrees of psychological sophistication. When we fit the models and their parameters to human data, we find clear evidence of individual differences, both in terms of which model is used, and what the interesting parameter values are. Finally, we use our models, together with simpler experimental measures of performance, to examine how the decision-making of all 451 participants on the bandit problems relates to measures of their cognitive abilities and personality traits.

Section snippets

Environmental assumptions

We assume that the way people might think about bandit problem environments can be modeled by allowing the reward rates to come from any Beta distribution. In its standard form, the Beta distribution has two parameters: a count α of ‘prior successes’ and a count β of ‘prior failures’. A natural and useful re-parameterization is to consider α/(α+β), which is a measure of the expected rate of reward, and α+β, which is a measure of the strength of the prior evidence. Psychologically, α/(α+β)

Subjects

A total of 451 participants completed a series of bandit problems, as well as battery of other psychological tests, as part of ‘testweek’ at the University of Amsterdam.

Method

Each participant completed 20 bandit problems in sequence. All of the problems had 4 alternatives and 15 trials, and drew the reward rates for each alternative independently from a Beta(2,2) distribution (i.e., we set α=β=2). The drawing of rates for the games was done only once, so each participant played games with the same θ

Evidence via model comparison

One way to measure the evidence the decisions made by participants provide for following the optimal decision process is to propose alternative accounts of how they completed the task, and use model selection methods to infer which of these is the most likely from their behavioral data. To this end, we considered three heuristic models of how people might solve bandit problems, and contrasted them with the optimal Bayesian model. The goal in developing these models was to build a sequence

Modeling results

In this section, we apply the models to the human data, in order to make inference about which model and parameterization best describes their decision-making. We then use information from these inferences to explore the relationship between individual decision-making and various psychometric and personality variables.

Discussion

The bandit problem provides an interesting decision-making task that is amenable to formal analysis, but realistic enough to engage many important issues in understanding how people search for information and make choices. In this paper, we have focused on the tradeoff between exploration and exploitation in decision-making that lies at the heart of the problem. In particular, we were interested in whether there are individual differences in how people make the tradeoff, and how any differences

Acknowledgments

This work was funded by award FA9550-07-1-0082 from the Air Force Office of Scientific Research. We are very grateful to two anonymous reviewers, and Jerry Busemeyer, for their extremely helpful comments on an earlier version of this paper.

References (37)

  • A. Bechara et al.

    Insensitivity to future consequences following damage to human prefrontal cortex

    Cognition

    (1994)
  • M. Brezzi et al.

    Optimal learning and experimentation in bandit problems

    Journal of Economic Dynamics & Control

    (2002)
  • Anderson, C.M. (2001).Behavioral models of strategies in multi-armed bandit problems. Doctoral dissertation. California...
  • J. Banks et al.

    An experimental analysis of the bandit problem

    Economic Theory

    (1997)
  • D.A. Berry

    A Bernoulli two-armed bandit

    The Annals of Mathematical Statistics

    (1972)
  • D.A. Berry et al.

    Bandit problems: Sequential allocation of experiments

    (1985)
  • H. Brand et al.

    Effects of a random versus pattern reinforcement instructional set in a contingent partial reinforcement situation

    Psychological Reports

    (1957)
  • H. Brand et al.

    Anticipation of reward as a function of partial reinforcement

    Journal of Experimental Psychology

    (1956)
  • N.R. Burns et al.

    Individual differences in problem solving and intelligence

    Journal of Problem Solving

    (2006)
  • J.R. Busemeyer et al.

    A contribution of cognitive decision models to clinical assessment: Decomposing performance on the Bechara gambling task

    Psychological Assessment

    (2002)
  • J.D. Cohen et al.

    Should I stay or should I go? Exploration versus exploitation

    Philosophical Transactions of the Royal Society B: Biological Sciences

    (2007)
  • N.D. Daw et al.

    Cortical substrates for exploratory decisions in humans

    Nature

    (2006)
  • A. Gelman et al.

    Bayesian data analysis

    (2004)
  • G. Gigerenzer et al.

    Simple heuristics that make us smart

    (1999)
  • J.P. Gilbert et al.

    Recognizing the maximum of a sequence

    American Statistical Association Journal

    (1966)
  • J.C. Gittins

    Bandit processes and dynamic allocation indices

    Journal of the Royal Statistical Society, Series B

    (1979)
  • J.C. Gittins

    Multi-armed bandit allocation indices

    (1989)
  • T.L. Griffiths et al.

    Bayesian models of cognition

  • Cited by (0)

    View full text