Elsevier

Journal of Mathematical Psychology

Volumes 68–69, October–December 2015, Pages 13-24
Journal of Mathematical Psychology

A practical guide to the Probability Density Approximation (PDA) with improved implementation and error characterization

https://doi.org/10.1016/j.jmp.2015.08.006Get rights and content

Highlights

  • PDA couples non-parametric methods with MCMC to estimate a model’s posterior.

  • Kernel density estimation is used to construct an approximate likelihood.

  • Signal processing methods are used to accelerate this process.

  • A “resampled MCMC” that improves chain mixing for this method is presented.

  • Approximation errors are characterized theoretically and through example.

Abstract

A critical task in modeling is to determine how well the theoretical assumptions encoded in a model account for observations. Bayesian methods are an ideal framework for doing just this. Existing approximate Bayesian computation (ABC) methods however rely on often insufficient “summary statistics”. Here, I present and analyze a highly efficient extension of the recently proposed (Turner and Sederberg 2014) Probability Density Approximation (PDA) method, which circumvents this insufficiency. This method combines Markov Chain Monte Carlo simulation with tools from non-parametric statistics to improve upon existing ABC methods. The primary contributions of this article are: (1) A more efficient implementation of this method that substantially improves computational performance is described. (2) Theoretical results describing the influence of methodological approximation errors on posterior estimation are discussed. In particular, while this method is highly accurate, even small errors have a strong influence on model comparisons when using standard statistical approaches (such as deviance information criterion). (3) An augmentation of the standard PDA procedure, termed “resampled PDA”, that reduces the negative influence of approximation errors on performance and accuracy, is presented. (4) A number of examples of varying complexity are presented along with supplementary code for their implementation.

Introduction

Parameter estimation is a crucial step in determining how well a model, and by design the set of assumptions it encodes, account for observations. The canonical Bayesian model estimation problem is to determine the posterior probability distribution of a set of model parameters conditioned on observed data π(θ|X). Given a model likelihood L(θ|X) and a prior distribution π(θ), this accomplished via Bayes’ theorem π(θ|X)=L(θ|X)π(θ)L(θ|X)π(θ)dθ. In almost all practical situations however, computing the posterior is not possible since the required integral cannot be computed. For this reason, numerous MCMC methods have been developed to circumvent this third step.

In many cases however, it is not even possible to provide a model likelihood or it may be too cumbersome to compute. Approximate Bayesian Computation (ABC) methods have been developed to deal with this difficulty, see Csilléry, Blum, Gaggiotti, and François (2010) and Turner and Van Zandt (2012) for existing reviews. Generally speaking, ABC deals with the absence of a likelihood by prescribing a surrogate measure for how plausible a particular parameter set (θ) is. To accomplish this, a large number of simulated data observations (X̃) are drawn from the model. The observed (X) and simulated (X̃) data are then compared in some way to determine how likely that parameter set is. Typically this comparison is accomplished by compressing both data sets into a set of summary statistics S(X) and then defining a “distance” between them ρ(S(X),S(X̃)).

One issue with this process is that the summary statistics must adequately represent the models output, often referred to as a sufficiency condition (Dawid, 1979). ABC methods do not approximate the models posterior π(θ|X), but rather a posterior of a new model augmented by the choice of S, π(θ|S(X)). So ABC only estimates the posterior distribution of the intended model if π(θ|S(X))=π(θ|X). The insidious issue with these summary statistics is that it is rarely possible to verify either sufficiency or insufficiency. Furthermore, if they are insufficient, it is usually not possible to determine how badly they have distorted results. Said another way, you know you are probably making errors, but you do not know how large they are.

The essential problem here is that the choice of summary statistics encodes assumptions on the structure of the likelihood function. Those assumptions may or may not be valid, potentially leading to serious errors (Robert, Cornuet, Marin, & Pillai, 2011). As an extreme example, using mean and variance as summary statistics to describe a distribution implies a normality assumption, which could be very poor if the underlying model likelihood is multimodal or heavily skewed. In a more cognitive context, choice response time distributions are often described by quantile summary statistics (Heathcote and Brown, 2004, Heathcote et al., 2002, Ratcliff and Tuerlinckx, 2002). This was however recently shown to be an insufficient summary of the data (Turner & Sederberg, 2014), leading to substantial posterior inaccuracies.

In the broader statistics field, such issues have been overcome through the development of “non-parametric statistical” methods, which free the user from having to make potentially erroneous assumptions. These methods were incorporated into maximum likelihood estimation more than a decade ago (Fermanian & Salanie, 2004). Recently they have been incorporated into Bayesian Computation (BC) to improve estimation. Mengersen, Pudlo, and Robert (2013) proposed an algorithm (termed BCel) where empirical likelihood methods are used as a replacement for the analytic likelihood function in an importance sampling algorithm. Zhu, Diazaraque, and Leisen (2014) proposed an alternative (BCbl) where the likelihood is estimated using a bootstrap approximation. Turner and Sederberg (2014) proposed an alternative, the Probability Density Approximation (PDA) method, where a kernel density estimate (KDE) (Silverman, 1986) is used to approximate the likelihood in a Metropolis–Hastings (MH) framework. These methods all use likelihood approximations to replace the unknown likelihood in a BC framework (in this way, BCkde would be an alternative name of the PDA method connecting it to existing methods).

The PDA method, which is built upon here, is a variation of the standard MH algorithm where the exact likelihood used to compute acceptance rates is replaced by an approximate likelihood value. Constructing this approximation begins the same way as ABC by first generating a synthetic data set (X̃) consisting of Ns model simulations. Next however, those model realizations are used to construct an approximation of the underlying likelihood Lˆ(θ|X) (from here on, a hat will always indicate an approximation). The approximate likelihood is then substituted (LLˆ) into the MH framework and an approximate posterior is determined. Fig. 1 illustrates the similarities and differences between the MH and PDA algorithms as well as the PDA variant developed in this work. The critical step in PDA is the construction of the approximate likelihood Lˆ(θ|X) that will replace L. The KDE is a powerful tool for doing just this (Silverman, 1982, Silverman, 1986).

The basic density estimation problem is to determine the density f(x) at a point x (the likelihood in the Bayesian context) from a collection of samples X̃={x̃j} from f, where j=1Ns and observations are assumed to be independently distributed. The density at x is then approximated by f(x)fˆ(x)1Nsj=1NsKh(xx̃j). Here Kh is a “smoothing kernel” defined by Kh(z)=1hK(zh), where K is a continuous function that is symmetric about zero and integrates to one. The parameter h, commonly referred to as a “bandwidth” size, determines the smoothing properties of the kernel: large h heavily smoothes the sampled data while small h provides less smoothing. To illustrate this, consider the uniform kernel K(z)=I[0.5,0.5](z) where I is the standard indicator function that is one on the prescribed interval and zero elsewhere. This kernel produces a standard histogram estimator with h corresponding to the size of the histogram bins. Histograms with small bins (i.e. small h) of course produce noisy plots while those with large bins produce smoother but less refined plots.

The most basic implementation of the PDA suffers from a number of inefficiencies: (1) computation of the KDE is time consuming and (2) approximation errors negatively impact proposal acceptance rates. Here, I present an improved implementation that alleviates these issues (to some extent). It can be applied in any setting where the original PDA method is applicable and will have essentially the same accuracy characteristics. As with any method, understanding the strengths, weaknesses, and sources of inaccuracy are required to implement this method and interpret its results. Thus, in addition to describing this variant, the influence of likelihood approximation errors on posterior estimation is discussed both theoretically and through example to illustrate its strengths and weaknesses.

The goals of this article are three fold: (1) describe improvements of the PDA method (Turner & Sederberg, 2014), (2) determine the strengths, weaknesses, and limitations of this method, and (3) present it in an accessible way so it can be utilized by a broad audience of end users. In particular, toward this third goal, a number of examples of this method are presented along with documented MATLAB code for two of the examples. This is not intended as a “plug in” software package, but rather to aid implementation of this method by others.

The remainder of this paper is structured as follows. Section  2 theoretically outlines some of error characteristics and some pitfalls that an interested user should be aware of. Section  3 details the methodological changes that improve performance of the PDA, namely an improved implementation of the KDE and the incorporation of likelihood resampling. Section  4 provides an algorithmic overview of this method. Section  5 demonstrates application of this method for a sequence of increasingly complex models, and the pitfalls outlined in Section  3 are revisited. Section  6 presents profiling and performance data for this improvement.

Section snippets

PDA estimation error analysis

While distinct from the standard class of ABC methods, the PDA is approximate in the sense that the stationary distribution of the resulting Markov chain is the posterior of a different, approximate model. With ABC methods, summary statistics are the primary source of posterior distortion. With the PDA, the KDE introduces approximation errors. To see this, note that fˆ is an approximation to f and as such can be thought of as an estimator with some underlying distribution. For the standard

Improved PDA implementation

One issue with the PDA is that computing the kernel density estimate can be computationally expensive. Here, an improvement that ameliorates this is presented. Before continuing, it is important to note that this improvement has the same accuracy, the same bias/variance issues as the standard KDE procedure, and will lead to essentially the same results as the standard KDE when embedded into the PDA procedure. It does however substantially improve efficiency. In what follows, I will assume

A procedural overview of the enhanced PDA method for the practitioner

Here, a procedural overview of the resampled PDA with an FFT based KDE implementation is outlined. Familiarity with MCMC methods is assumed and only the details that relate to non-parametric component of this method is provided.

  • (0)

    Choose the number of samples to be used in the estimation process (Ns) and the kernel bandwidth (h). A minimum of Ns=10,000 should generally be used. Choosing h will require trial and error, but Silverman’s rule of thumb (Silverman, 1986) provides a good starting point.

Examples

In the following sections, this method is applied to four examples of increasing complexity to demonstrate its strengths and weaknesses. In Example 1, the FFT based KDE is presented on its own to demonstrate the density construction process. In Example 2, this is embedded in a MCMC framework to demonstrate its use in parameter inference in a simple case. In Example 3, this method is applied to a more psychologically relevant model, the canonical Linear Ballistic Accumulator Model (LBA). Through

Profiling and performance

To demonstrate the performance gains of this method, performance of both the PDA and this variant have been profiled for the LBA and pLBA examples. Before continuing, it is important to note that there is no universally accepted way of determining the runtime of a program. Here, I use what is usually referred to as CPU time which measures the total time a CPU is in operation. With multi-core processors, this is different than the wall clock time (aka. runtime), which measures the time from

Discussion

This article presents an enhancement of the Probability Density Approximation method originally introduced in Turner and Sederberg (2014). This method, which combines non-parametric statistical methods with Bayesian inference techniques, is an extensible methodology for performing posterior parameter estimation. The purpose of this article is to discuss this methods properties, improve its efficiency, and make it accessible a broader audience of end users.

A great many algorithms, ranging from

Acknowledgments

This work was supported by the National Science Foundation grant SES1530760. I would like to thank Joachim Vandekerckhove and Jennifer S. Trueblood for comments on this manuscript.

References (39)

  • S.D. Brown et al.

    The simplest complete model of choice response time: Linear ballistic accumulation

    Cognitive Psychology

    (2008)
  • K. Csilléry et al.

    Approximate Bayesian computation (ABC) in practice

    Trends in Ecology & Evolution

    (2010)
  • B.M. Turner et al.

    A tutorial on approximate Bayesian computation

    Journal of Mathematical Psychology

    (2012)
  • J.F. Apgar et al.

    Sloppy models, parameter uncertainty, and the role of experimental design

    Molecular BioSystems

    (2010)
  • S. Brown et al.

    A ballistic model of choice response time

    Psychological Review

    (2005)
  • J.R. Busemeyer et al.

    Decision field theory: a dynamic-cognitive approach to decision making in an uncertain environment

    Psychological Review

    (1993)
  • O. Cappé et al.

    Population Monte Carlo

    Journal of Computational and Graphical Statistics

    (2004)
  • A.P. Dawid

    Conditional independence in statistical theory

    Journal of the Royal Statistical Society. Series B. Statistical Methodology

    (1979)
  • C. de Boor

    Elementary numerical analysis

    (1972)
  • P. Del Moral et al.

    Sequential Monte Carlo samplers

    Journal of the Royal Statistical Society. Series B (Statistical Methodology)

    (2006)
  • J.-D. Fermanian et al.

    A nonparametric simulated maximum likelihood estimation method

    Econometric Theory

    (2004)
  • K.O. Friedrichs

    The identity of weak and strong extensions of differential operators

    Transactions of the American Mathematical Society

    (1944)
  • A. Gelman et al.

    Bayesian data analysis

    (2003)
  • R.N. Gutenkunst et al.

    Universally sloppy parameter sensitivities in systems biology models

    PLoS Computational Biology

    (2007)
  • A. Heathcote et al.

    Reply to speckman and rouder: A theoretical basis for QML

    Psychonomic Bulletin & Review

    (2004)
  • A. Heathcote et al.

    Quantile maximum likelihood estimation of response time distributions

    Psychonomic Bulletin & Review

    (2002)
  • Holmes, W.R., Trueblood, J.S., & Heathcoat, A. (2015). A new framework for modeling decisions about changing...
  • A.C. Huk et al.

    Neural activity in macaque parietal cortex reflects temporal integration of visual motion signals during perceptual decision making

    The Journal of Neuroscience

    (2005)
  • R. Kiani et al.

    Bounded integration in parietal cortex underlies decisions even when viewing duration is dictated by the environment

    The Journal of Neuroscience

    (2008)
  • Cited by (58)

    View all citing articles on Scopus
    View full text