A practical guide to the Probability Density Approximation (PDA) with improved implementation and error characterization
Introduction
Parameter estimation is a crucial step in determining how well a model, and by design the set of assumptions it encodes, account for observations. The canonical Bayesian model estimation problem is to determine the posterior probability distribution of a set of model parameters conditioned on observed data . Given a model likelihood and a prior distribution , this accomplished via Bayes’ theorem In almost all practical situations however, computing the posterior is not possible since the required integral cannot be computed. For this reason, numerous MCMC methods have been developed to circumvent this third step.
In many cases however, it is not even possible to provide a model likelihood or it may be too cumbersome to compute. Approximate Bayesian Computation (ABC) methods have been developed to deal with this difficulty, see Csilléry, Blum, Gaggiotti, and François (2010) and Turner and Van Zandt (2012) for existing reviews. Generally speaking, ABC deals with the absence of a likelihood by prescribing a surrogate measure for how plausible a particular parameter set () is. To accomplish this, a large number of simulated data observations () are drawn from the model. The observed and simulated data are then compared in some way to determine how likely that parameter set is. Typically this comparison is accomplished by compressing both data sets into a set of summary statistics and then defining a “distance” between them .
One issue with this process is that the summary statistics must adequately represent the models output, often referred to as a sufficiency condition (Dawid, 1979). ABC methods do not approximate the models posterior , but rather a posterior of a new model augmented by the choice of , . So ABC only estimates the posterior distribution of the intended model if . The insidious issue with these summary statistics is that it is rarely possible to verify either sufficiency or insufficiency. Furthermore, if they are insufficient, it is usually not possible to determine how badly they have distorted results. Said another way, you know you are probably making errors, but you do not know how large they are.
The essential problem here is that the choice of summary statistics encodes assumptions on the structure of the likelihood function. Those assumptions may or may not be valid, potentially leading to serious errors (Robert, Cornuet, Marin, & Pillai, 2011). As an extreme example, using mean and variance as summary statistics to describe a distribution implies a normality assumption, which could be very poor if the underlying model likelihood is multimodal or heavily skewed. In a more cognitive context, choice response time distributions are often described by quantile summary statistics (Heathcote and Brown, 2004, Heathcote et al., 2002, Ratcliff and Tuerlinckx, 2002). This was however recently shown to be an insufficient summary of the data (Turner & Sederberg, 2014), leading to substantial posterior inaccuracies.
In the broader statistics field, such issues have been overcome through the development of “non-parametric statistical” methods, which free the user from having to make potentially erroneous assumptions. These methods were incorporated into maximum likelihood estimation more than a decade ago (Fermanian & Salanie, 2004). Recently they have been incorporated into Bayesian Computation (BC) to improve estimation. Mengersen, Pudlo, and Robert (2013) proposed an algorithm (termed ) where empirical likelihood methods are used as a replacement for the analytic likelihood function in an importance sampling algorithm. Zhu, Diazaraque, and Leisen (2014) proposed an alternative () where the likelihood is estimated using a bootstrap approximation. Turner and Sederberg (2014) proposed an alternative, the Probability Density Approximation (PDA) method, where a kernel density estimate (KDE) (Silverman, 1986) is used to approximate the likelihood in a Metropolis–Hastings (MH) framework. These methods all use likelihood approximations to replace the unknown likelihood in a BC framework (in this way, would be an alternative name of the PDA method connecting it to existing methods).
The PDA method, which is built upon here, is a variation of the standard MH algorithm where the exact likelihood used to compute acceptance rates is replaced by an approximate likelihood value. Constructing this approximation begins the same way as ABC by first generating a synthetic data set () consisting of model simulations. Next however, those model realizations are used to construct an approximation of the underlying likelihood (from here on, a hat will always indicate an approximation). The approximate likelihood is then substituted () into the MH framework and an approximate posterior is determined. Fig. 1 illustrates the similarities and differences between the MH and PDA algorithms as well as the PDA variant developed in this work. The critical step in PDA is the construction of the approximate likelihood that will replace . The KDE is a powerful tool for doing just this (Silverman, 1982, Silverman, 1986).
The basic density estimation problem is to determine the density at a point (the likelihood in the Bayesian context) from a collection of samples from , where and observations are assumed to be independently distributed. The density at is then approximated by Here is a “smoothing kernel” defined by where is a continuous function that is symmetric about zero and integrates to one. The parameter , commonly referred to as a “bandwidth” size, determines the smoothing properties of the kernel: large heavily smoothes the sampled data while small provides less smoothing. To illustrate this, consider the uniform kernel where is the standard indicator function that is one on the prescribed interval and zero elsewhere. This kernel produces a standard histogram estimator with corresponding to the size of the histogram bins. Histograms with small bins (i.e. small ) of course produce noisy plots while those with large bins produce smoother but less refined plots.
The most basic implementation of the PDA suffers from a number of inefficiencies: (1) computation of the KDE is time consuming and (2) approximation errors negatively impact proposal acceptance rates. Here, I present an improved implementation that alleviates these issues (to some extent). It can be applied in any setting where the original PDA method is applicable and will have essentially the same accuracy characteristics. As with any method, understanding the strengths, weaknesses, and sources of inaccuracy are required to implement this method and interpret its results. Thus, in addition to describing this variant, the influence of likelihood approximation errors on posterior estimation is discussed both theoretically and through example to illustrate its strengths and weaknesses.
The goals of this article are three fold: (1) describe improvements of the PDA method (Turner & Sederberg, 2014), (2) determine the strengths, weaknesses, and limitations of this method, and (3) present it in an accessible way so it can be utilized by a broad audience of end users. In particular, toward this third goal, a number of examples of this method are presented along with documented MATLAB code for two of the examples. This is not intended as a “plug in” software package, but rather to aid implementation of this method by others.
The remainder of this paper is structured as follows. Section 2 theoretically outlines some of error characteristics and some pitfalls that an interested user should be aware of. Section 3 details the methodological changes that improve performance of the PDA, namely an improved implementation of the KDE and the incorporation of likelihood resampling. Section 4 provides an algorithmic overview of this method. Section 5 demonstrates application of this method for a sequence of increasingly complex models, and the pitfalls outlined in Section 3 are revisited. Section 6 presents profiling and performance data for this improvement.
Section snippets
PDA estimation error analysis
While distinct from the standard class of ABC methods, the PDA is approximate in the sense that the stationary distribution of the resulting Markov chain is the posterior of a different, approximate model. With ABC methods, summary statistics are the primary source of posterior distortion. With the PDA, the KDE introduces approximation errors. To see this, note that is an approximation to and as such can be thought of as an estimator with some underlying distribution. For the standard
Improved PDA implementation
One issue with the PDA is that computing the kernel density estimate can be computationally expensive. Here, an improvement that ameliorates this is presented. Before continuing, it is important to note that this improvement has the same accuracy, the same bias/variance issues as the standard KDE procedure, and will lead to essentially the same results as the standard KDE when embedded into the PDA procedure. It does however substantially improve efficiency. In what follows, I will assume
A procedural overview of the enhanced PDA method for the practitioner
Here, a procedural overview of the resampled PDA with an FFT based KDE implementation is outlined. Familiarity with MCMC methods is assumed and only the details that relate to non-parametric component of this method is provided.
- (0)
Choose the number of samples to be used in the estimation process () and the kernel bandwidth (). A minimum of should generally be used. Choosing will require trial and error, but Silverman’s rule of thumb (Silverman, 1986) provides a good starting point.
Examples
In the following sections, this method is applied to four examples of increasing complexity to demonstrate its strengths and weaknesses. In Example 1, the FFT based KDE is presented on its own to demonstrate the density construction process. In Example 2, this is embedded in a MCMC framework to demonstrate its use in parameter inference in a simple case. In Example 3, this method is applied to a more psychologically relevant model, the canonical Linear Ballistic Accumulator Model (LBA). Through
Profiling and performance
To demonstrate the performance gains of this method, performance of both the PDA and this variant have been profiled for the LBA and pLBA examples. Before continuing, it is important to note that there is no universally accepted way of determining the runtime of a program. Here, I use what is usually referred to as CPU time which measures the total time a CPU is in operation. With multi-core processors, this is different than the wall clock time (aka. runtime), which measures the time from
Discussion
This article presents an enhancement of the Probability Density Approximation method originally introduced in Turner and Sederberg (2014). This method, which combines non-parametric statistical methods with Bayesian inference techniques, is an extensible methodology for performing posterior parameter estimation. The purpose of this article is to discuss this methods properties, improve its efficiency, and make it accessible a broader audience of end users.
A great many algorithms, ranging from
Acknowledgments
This work was supported by the National Science Foundation grant SES1530760. I would like to thank Joachim Vandekerckhove and Jennifer S. Trueblood for comments on this manuscript.
References (39)
- et al.
The simplest complete model of choice response time: Linear ballistic accumulation
Cognitive Psychology
(2008) - et al.
Approximate Bayesian computation (ABC) in practice
Trends in Ecology & Evolution
(2010) - et al.
A tutorial on approximate Bayesian computation
Journal of Mathematical Psychology
(2012) - et al.
Sloppy models, parameter uncertainty, and the role of experimental design
Molecular BioSystems
(2010) - et al.
A ballistic model of choice response time
Psychological Review
(2005) - et al.
Decision field theory: a dynamic-cognitive approach to decision making in an uncertain environment
Psychological Review
(1993) - et al.
Population Monte Carlo
Journal of Computational and Graphical Statistics
(2004) Conditional independence in statistical theory
Journal of the Royal Statistical Society. Series B. Statistical Methodology
(1979)Elementary numerical analysis
(1972)- et al.
Sequential Monte Carlo samplers
Journal of the Royal Statistical Society. Series B (Statistical Methodology)
(2006)
A nonparametric simulated maximum likelihood estimation method
Econometric Theory
The identity of weak and strong extensions of differential operators
Transactions of the American Mathematical Society
Bayesian data analysis
Universally sloppy parameter sensitivities in systems biology models
PLoS Computational Biology
Reply to speckman and rouder: A theoretical basis for QML
Psychonomic Bulletin & Review
Quantile maximum likelihood estimation of response time distributions
Psychonomic Bulletin & Review
Neural activity in macaque parietal cortex reflects temporal integration of visual motion signals during perceptual decision making
The Journal of Neuroscience
Bounded integration in parietal cortex underlies decisions even when viewing duration is dictated by the environment
The Journal of Neuroscience
Cited by (58)
Numerical approximation of the first-passage time distribution of time-varying diffusion decision models: A mesh-free approach
2023, Engineering Analysis with Boundary ElementsDouble responding: A new constraint for models of speeded decision making
2020, Cognitive Psychology