Approximate Bayesian computation with differential evolution

https://doi.org/10.1016/j.jmp.2012.06.004Get rights and content

Abstract

Approximate Bayesian computation (ABC) is a simulation-based method for estimating the posterior distribution of the parameters of a model. The ABC approach is instrumental when a likelihood function for a model cannot be mathematically specified, or has a complicated form. Although difficulty in calculating a model’s likelihood is extremely common, current ABC methods suffer from two problems that have largely prevented their mainstream adoption: long computation time and an inability to scale beyond a few parameters. We introduce differential evolution as a computationally efficient genetic algorithm for proposal generation in our ABC sampler. We show how using this method allows our new ABC algorithm, called ABCDE, to obtain accurate posterior estimates in fewer iterations than kernel-based ABC algorithms and to scale to high-dimensional parameter spaces that have proven difficult for current ABC methods.

Highlights

► Traditional ABC methods are slow and scale poorly. ► We merge ABC with differential evolution (DE) to provide efficient proposals. ► ABCDE is a fast and scalable method of likelihood-free Bayesian estimation.

Introduction

Bayesian statistics has become an enormously popular tool for the analysis of random variables. While there are a myriad of practical and theoretical reasons to adopt the Bayesian framework, one clear advantage is the quantification of uncertainty in the form of the posterior distribution. In contrast to classic null hypothesis testing, directed analysis of the posterior distribution allows for statistical inference that does not compromise the process theory behind the experiments generating the data. In the last decade, many have turned to the problem of Bayesian estimation when the likelihood function is difficult or impossible to evaluate. One popular method for solving this problem is called approximate Bayesian computation (ABC). While ABC originated in genetics (Beaumont et al., 2002, Marjoram et al., 2003, Pritchard et al., 1999, Tavaré et al., 1997), it has received a considerable amount of attention in ecology, epidemiology, and systems biology (Beaumont, 2010), as well as, more generally, with regard to model choice (Robert, Cornuet, Marin, & Pillai, 2011). Recently, ABC has also become an important tool in the analysis of models of human behavior, whose likelihoods are often rather complex (Turner & Van Zandt, 2012).

The basic ABC algorithm proceeds as follows. Given an observed data set Y, some unknown parameters θ with prior distribution π(θ), and an assumed model such that YModel(y|θ), ABC algorithms propose candidate parameter values by randomly sampling θ and generating a data set XModel(x|θ), where the method used to draw θ varies from one algorithm to the next. Because this random data set is always generated and paired with the random parameter value θ, we often use the notation (θ,X) to refer to a single point in the joint distribution of θ and X. The idea is that if the distance ρ(X,Y) between the two data sets is small enough, specifically, less than a value ϵ, we can assume that the proposed candidate parameter value θ has some nonzero probability of being in the posterior distribution π(θ|Y). In so doing, ABC algorithms attempt to approximate the posterior distribution by sampling from π(θ|Y)Xπ(θ)Model(x|θ)I(ρ(X,Y)ϵ)dx, where X is the support of the simulated data and I(a) is an indicator function returning a one if the condition a is satisfied and a zero otherwise.

The ABC rejection framework (see Turner & Van Zandt, 2012 for a tutorial) has been embedded in Markov chain Monte Carlo (Marjoram et al., 2003), sequential Monte Carlo (Beaumont et al., 2009, Del Moral et al., 2008, Sisson et al., 2007, Toni et al., 2009), expectation propagation (Barthelmé & Chopin, 2011), and Gibbs sampling (Turner & Van Zandt, submitted for publication). While these algorithms have become increasingly more complex and efficient, they all still rely on satisfying the condition ρ(X,Y)ϵ. Hence, the accuracy of the estimated posterior distribution and the computational efficiency hinge on the selection of the tolerance threshold ϵ: large values of ϵ produce inaccurate posteriors while small values of ϵ produce long computation times. Long computation times are a result of the decrease in the likelihood of proposing acceptable parameter values θ that produce data X such that the condition ρ(X,Y)ϵ is satisfied.

A number of methods have been proposed to balance the trade-off between long computation times and accurate posterior estimates. One common approach is to select a set of ϵ values that decrease monotonically from a larger tolerance value to some small one. This allows the algorithm to converge slowly to a computationally feasible final value of ϵ. Other methods allow for larger values of ϵ, and instead rely on a post-simulation correction to obtain accurate estimates of the posterior (Beaumont et al., 2002, Blum and François, 2010). These algorithms proceed as described above but then use regression techniques to reweigh the samples based on their distance from the observed data Y. Although these techniques are very useful in practice, the post hoc nature of the correction suggests that the samplers used prior to the correction could be substantially improved. That is, algorithms that use the reweighing scheme internally would be more efficient than the regression correction schemes.

Although the basic ABC framework has been instrumental in furthering the widespread usage of Bayesian analysis, it currently suffers from the so-called “curse of dimensionality” (Beaumont, 2010). That is, when the number of parameters is increased the acceptance rate declines dramatically because successful sets of parameter values (i.e., sets of parameter values that satisfy the tolerance threshold condition) become very hard to find. To attenuate this problem, a few new algorithms replace the indicator function in Eq. (1) with a kernel function ψ(ρ(X,Y)|δ) (Wilkinson, submitted for publication). The kernel function provides continuous measures to the evaluation of a proposal parameter set θ rather than a dichotomous accept or reject (a one or a zero, respectively). While this dramatically improves the computational efficiency, the accuracy of the estimated posterior now relies on the selection of δ. In addition, when sampling from a high-dimensional parameter space, the selection of tuning parameters such as the transition kernel becomes vitally important.

In this article we present a new ABC algorithm, which we call ABCDE, that uses differential evolution (DE) as a means of proposal generation (Hu and Tsui, 2005, ter Braak, 2006, Vrugt et al., 2009). DE is an extremely efficient population-based genetic algorithm for minimizing real-valued cost functions (Storn & Price, 1997). Instead of relying solely on random mutations of each member of the population to drive evolution, DE creates a self-organizing population by evolving each member based on the weighted difference between two other members of the population, similar in spirit to particle swarm optimization (Havangi et al., 2010, Kennedy and Eberhart, 1995, Tong et al., 2006). In practice, DE requires far fewer population members than other genetic algorithms, typically just enough to span the possible parameter space, because the entire population evolves towards the minimum cost function values (or in our case the regions of the parameter space with the highest densities). Here we merge ABC and DE, including enhancements to the basic DE algorithm to make it suitable for posterior estimation instead of simply locating the global minimum (Hu & Tsui, 2005).

We first describe the algorithm (presented in Fig. 1) and motivate the use of each of its components. We then use the algorithm to fit three models whose posterior distributions are known to assess the accuracy of the estimates obtained using ABCDE. In the first example, we show that a reduced version of the ABCDE algorithm can easily handle the classic mixture of normals example that is prevalent in the ABC literature (Beaumont et al., 2009, Sisson et al., 2007, Toni et al., 2009, Wilkinson, submitted for publication). We then assess the accuracy of the ABCDE algorithm by comparing it to a basic kernel-based ABC algorithm (Wilkinson, submitted for publication). To do so, we use both algorithms to fit a model whose joint posterior distribution is highly correlated. In our final example, we test the scalability of ABCDE by fitting it to a model with 20 parameters, a problem that is extremely difficult for existing ABC algorithms to solve. We close with a brief discussion of the potential advancements and provide guidelines for the selection of the tuning parameters.

Section snippets

The algorithm

The ABCDE algorithm combines DE with a distributed genetic algorithm approach that relies on three steps: crossover, mutation, and migration (Hu & Tsui, 2005). After a high-level overview of the algorithm, we will discuss each of these steps in turn. Fig. 1 illustrates the structure of the ABCDE algorithm. We begin by selecting a pool of P particles and dividing this pool evenly into K groups, each of size G=P/K. Although we will refer to the group of individual members of our sampler as

Simulations

We now present three examples meant to test the validity, speed, and scalability of the ABCDE algorithm. First, we use the ABCDE algorithm to fit the classic mixture of normals example. In this first example, we employ a simplified version of the ABCDE algorithm that uses only the crossover step. In the second example, we compare the speed of the ABCDE algorithm to a recently developed KABC algorithm (Wilkinson, submitted for publication). In this example, the joint posterior distribution is

General discussion

Typically, Monte Carlo methods such as Markov chain Monte Carlo or sequential Monte Carlo rely on transition kernels to move proposals around the region of interest. These transition kernels generally are Gaussian in form and are centered at the previous state of the chain. Thus, to fully specify the kernel, one only needs to select the standard deviation of the Gaussian as a tuning parameter. However, it can be very difficult to specify the optimal transition kernel, especially in the case of

Conclusions

In this article, we have presented a new algorithm, ABCDE, that attempts to solve the problem of scalability associated with ABC methods. The algorithm relies on DE to drive the sampler to high-density regions of the posterior distribution and automatically optimize the proposal efficiency. This type of proposal generation is very efficient, capturing both multi-modal and high-dimensional posteriors both quickly and accurately. Although we have not provided proofs of convergence for the ABCDE

References (44)

  • D.J. Navarro et al.

    Fast and accurate calculations for first-passage times in Wiener diffusion models

    Journal of Mathematical Psychology

    (2009)
  • B.M. Turner et al.

    A tutorial on approximate Bayesian computation

    Journal of Mathematical Psychology

    (2012)
  • Barthelmé, S., & Chopin, N. (2011). ABC-EP: expectation propagation for likelihood-free Bayesian computation. In...
  • M.A. Beaumont

    Approximate Bayesian computation in evolution and ecology

    Annual Review of Ecology, Evolution, and Systematics

    (2010)
  • M.A. Beaumont et al.

    Adaptive approximate Bayesian computation

    Biometrika

    (2009)
  • M.A. Beaumont et al.

    Approximate Bayesian computation in population genetics

    Genetics

    (2002)
  • M.G.B. Blum et al.

    Non-linear regression models for approximate Bayesian computation

    Statistics and Computing

    (2010)
  • R.S. Chhikara et al.

    The inverse Gaussian distribution: theory methodology and applications

    (1989)
  • P. Craigmile et al.

    Hierarchical Bayes models for response time data

    Psychometrika

    (2010)
  • Del Moral, P., Doucet, A., & Jasra, A. (2008). An adaptive sequential Monte Carlo method for approximate Bayesian...
  • W. Feller

    An introduction to probability theory and its applications, Vol. 1

    (1968)
  • A. Gelman et al.

    Bayesian data analysis

    (2004)
  • R. Havangi et al.

    A multi swarm particle filter for mobile robot localization

    International Journal of Computer Science

    (2010)
  • A. Heathcote

    Fitting Wald and ex-Wald distributions to response time data: an example using functions for the S-PLUS package

    Behavioral Research Methods, Instruments, & Computers

    (2004)
  • Hu, B., & Tsui, K.-W. (2005). Distributed evolutionary Monte Carlo with applications to Bayesian analysis Technical...
  • Kennedy, J., & Eberhart, R. (1995). Particle swarm optimization. In Proceedings of IEEE international conference on...
  • M.D. Lee et al.

    A Bayesian approach to diffusion models of decision-making and response time

  • Lee, M.D., & Wagenmakers, E.-J. (2012). A course in Bayesian graphical modeling for cognitive science, last downloaded...
  • R.D. Luce

    Response times: their role in inferring elementary mental organization

    (1986)
  • P. Marjoram et al.

    Markov chain Monte Carlo without likelihoods

    Proceedings of the National Academy of Sciences of the United States

    (2003)
  • D. Matzke et al.

    Psychological interpretation of the ex-Gaussian and shifted wald parameters: a diffusion model analysis

    Psychonomic Bulletin and Review

    (2009)
  • Z. Oravecz et al.

    A hierarchical Ornstein–Uhlenbeck model for continuous repeated measurement data

    Psychometrika

    (2009)
  • Cited by (68)

    • Bayesian neuroevolution using distributed swarm optimization and tempered MCMC[Formula presented]

      2022, Applied Soft Computing
      Citation Excerpt :

      The authors also provided a proof of symmetric proposal distribution for DE-MC. The computational efficiency of DE has motivated its use in approximate Bayesian computation (ABC) framework which needs an extensive number of samples to be drawn, thus requiring more computation time [72]. Due to the robustness and accuracy of swarm-based optimization, PSO algorithms have been successfully applied for the learning structure of Bayesian networks [73].

    View all citing articles on Scopus

    Portions of this work were presented at the 8th Annual Context and Episodic Memory Symposium, Bloomington, Indiana, and the 45th Annual Meeting of the Society for Mathematical Psychology, Columbus, Ohio. The authors would like to thank Trisha Van Zandt, Simon Dennis, Rich Shiffrin, Jay Myung, and Dan Navarro for helpful comments that improved an earlier version of this article.

    View full text