## Abstract

A central question in neuroscience is how sensory inputs are transformed into percepts. At this point, it is clear that this process is strongly influenced by prior knowledge of the sensory environment. Bayesian ideal observer models provide a useful link between data and theory that can help researchers evaluate how prior knowledge is represented and integrated with incoming sensory information. However, the statistical prior employed by a Bayesian observer cannot be measured directly, and must instead be inferred from behavioral measurements. Here, we review the general problem of inferring priors from psychophysical data, and the simple solution that follows from assuming a prior that is a Gaussian probability distribution. As our understanding of sensory processing advances, however, there is an increasing need for methods to flexibly recover the shape of Bayesian priors that are not well approximated by elementary functions. To address this issue, we describe a novel approach that applies to arbitrary prior shapes, which we parameterize using mixtures of Gaussian distributions. After incorporating a simple approximation, this method produces an analytical solution for psychophysical quantities that can be numerically optimized to recover the shapes of Bayesian priors. This approach offers advantages in flexibility, while still providing an analytical framework for many scenarios. We provide a MATLAB toolbox implementing key computations described herein.

## Significance Statement

Ideal observer models in neuroscience are an important tool for developing and testing hypotheses about how sensory information is processed. Here, we review the canonical application of Bayesian ideal observer models for understanding sensory processing. We present a new mathematical generalization that will allow these models to be used for deeper investigations into how prior knowledge influences perception. We also provide a software toolkit for implementing the described models.

## Introduction

Sensory systems must encode information about environmental stimuli in a way that supports successful behaviors. However, sensory measurements are often noisy and ambiguous, making this a demanding task. For example, in the visual system, each retinal image is consistent with an infinite number of possible three-dimensional scenes. In the auditory system, the vibration of the inner ear intermixes both the identity and elevation of sound sources. Prior knowledge about the environment can help resolve these ambiguities (Knill and Richards, 1996; Simoncelli and Olshausen, 2001). Thus, advances in understanding sensation and perception often rely on understanding how prior knowledge is represented in the nervous system and how this prior knowledge influences our percepts.

The influence of prior knowledge on perception is often characterized using psychophysical experiments that measure the bias and variability of perceptual reports (Hürlimann et al., 2002; Weiss et al., 2002; Adams et al., 2004; Girshick et al., 2011; Vacher et al., 2018). For example, measured biases can be compared with biases predicted by ideal observer models, which can also inform our understanding of how sensory information is represented within neuronal populations (Ganguli and Simoncelli, 2010; Wei and Stocker, 2015, 2017; Morais and Pillow, 2018). Bayesian ideal observer models specifically posit that observers optimally combine noisy sensory measurements with a probability distribution representing the relative frequency with which events occur in the world (called the prior distribution, or simply the prior). Bayesian models are popular across many domains, including sensation and perception, because they can successfully explain a wide range of phenomena (Weiss et al., 2002; Adams et al., 2004; Burge et al., 2010; Girshick et al., 2011; Kim and Burge, 2018). However, these models are often poorly constrained. Without constraints on the shape of the prior, Bayesian models can effectively explain any biases. Thus, a set of important questions arise: What is the shape of the prior the observer is using? Does this shape accurately reflect probabilities in the world? Does it change systematically with experience?

Bayesian priors are often assumed to take the form of a Gaussian distribution for computational efficiency (Mamassian and Landy, 1998; Weiss et al., 2002; Beierholm et al., 2009; Sotiropoulos et al., 2011; Saunders and Chen, 2015; Rokers et al., 2018). This assumption, however, limits the ability to ask questions about the shape of the prior because a Gaussian only has two parameters. In addition, analyses of natural scene statistics suggest that the probability distributions of environmental stimuli are generally non-Gaussian (Dong and Atick, 1995; Girshick et al., 2011; Sprague et al., 2015). In order to more flexibly model prior distributions, a previous study introduced an analytic approach based on piecewise approximations that leverages assumptions about the local shape of the prior relative to the magnitude of measurement noise (Stocker and Simoncelli, 2006). An alternative approach to increasing flexibility without introducing assumptions about prior shape is to use numeric methods that do not place constraints on the global parametric form or local properties of the prior (Girshick et al., 2011; Acerbi et al., 2014; Sprague et al., 2015). Numeric methods, while able to fit an arbitrary prior, are often slower and require hand-tuning of the numerical support and precision. Thus, while researchers have a varied toolkit for modeling the shapes of Bayesian priors, there is still a need to diversify our tools for using these models in perceptual research.

Our goal is to provide an overview of how Bayesian ideal observer models can be used in perceptual research, and to describe a computational approach that uses mixture of Gaussian models to flexibly and efficiently model the influence of priors on perception. First, we review the general mathematical principles that link a Bayesian ideal observer to psychophysical data. Then, we present the analytic solutions for psychophysical quantities assuming a simple Gaussian prior and Gaussian measurement noise. Next, we introduce a mixture of Gaussians model of priors that provides increased flexibility. Mixture of Gaussian priors have been employed in other contexts, such as computer vision and signal processing (Olshausen and Millman, 1999; Snoussi and Mohammad-Djafari, 2001), but are not commonly used in ideal observer models (but see related applications for modeling perceptual inferences by Acerbi et al., 2014; Orhan and Jacobs, 2014). Lastly, we introduce a new analytical approximation that increases the computational efficiency of the mixture of Gaussians model. This approximation offers improvements in efficiency for adaptive experimental methods (e.g., adaptive stimulus staircasing) as compared with fully numerical approaches. An accompanying MATLAB (MathWorks) toolkit provides implementations that can be used to simulate and fit psychophysical data.

## Materials and Methods

### Bayesian ideal observer models

In a Bayesian ideal observer model, the observer makes a noisy measurement *m* of a stimulus *x* and uses that measurement to generate an estimate of the stimulus in the world *r*. We can represent this mapping of measurement onto response with the function *r *=* T*(·) where *T* is some estimation function. For example, in the context of a psychophysical experiment, *T*(·) may represent a point estimate of the presented stimulus (in which case

A Bayesian ideal observer selects the optimal response to a set of stimuli on the basis of the three components:

a prior distribution

*p*(*x*)a likelihood

*p*(*m*|*x*)a loss function

*L*(*x*,*r*)

The prior *p(x)* represents the observer’s knowledge of the probability of encountering the stimulus based on previous experience. The likelihood

To obtain the observer’s belief about the current stimulus *x* given a measurement *m* we first use Bayes’ rule to obtain the posterior distribution,

The posterior represents the probability distribution of a stimulus, given the current measurement, and can thus be used for drawing inferences. Here, *p*(*m*) is the model evidence (or marginal likelihood) that serves to normalize the posterior. This calculation is represented graphically in Figure 1. Since *p*(*m*) is a scalar value and does not affect the shape of the posterior, we can note that

This simple illustration, however, shows a likelihood based on only one example measurement. If we instead consider the full range of possible measurements, as shown in Figure 2, we can see how the resulting shape of the posterior varies. Figure 2*A* shows the prior as a function of *x*. By definition, the prior is independent of the measurement *m*, so it varies horizontally, but is constant along the vertical dimension. This two-dimensional (2D) format, similar to that used in (Girshick et al., 2011), helps illustrate the point that the posterior (Fig. 2*C*) arises from pointwise multiplying the prior (Fig. 2*A*) and likelihood (Fig. 2*B*). Figure 2*B* illustrates the likelihood by plotting the probability of the observer making each measurement, conditioned on each possible stimulus value. This 2D distribution is generated by assuming that the measurement associated with each stimulus value is corrupted by additive Gaussian noise, but is unbiased. A vertical slice through B represents what we refer to as the measurement distribution *m* given a particular stimulus *x _{n}*. A horizontal slice through B, on the other hand, represents the likelihood function

*p*(

*m*,

_{n}*x*), which is the probability of a given measurement

*m*as a function of different stimulus values

*x*. Thus, the 2D object

While the likelihood is the pertinent quantity for applying Bayes’ rule, the measurement distribution is the relevant quantity when considering samples of the noisy sensory observation process. Note the measurement distribution is a true probability density function based on a noise model (here, we use additive Gaussian noise). The likelihood, on the other hand, is not generally a probability distribution because it does not necessarily integrate to one.

By multiplying each row of the prior and likelihood plots and normalizing, we obtain the set of possible posterior distributions *C*). Note that since the prior is non-Gaussian and steeper around the left flank of the peak, the posteriors are more concentrated around these values.

Finally, a loss function is needed to complete the model. The loss function *L*(*x*,*r*) refers to the penalty of making a response *r* when the true stimulus was *x*. An optimal decision rule is one where the observer will minimize the loss on average over the course of a set of responses. To calculate the expected loss of a particular response, we can find the expected loss under the posterior:

A decision rule is Bayes optimal under a particular loss function if it minimizes the expected loss for all measurements. That is, *T**(·) is Bayes optimal if for all estimation functions *T*(*m*) and all measurement values *m*:

Note that here we show *T*(·) as a function of a single *m*, but it may also take multiple measurements into account, as in a two-alternative forced-choice paradigm (2AFC). In the following sections, we will discuss this loss in more concrete terms in the context of point estimation and 2AFC tasks.

While the derivations outlined in this paper do not assume any particular stimulus, they do assume that the measurements are unbiased, and that the measurement noise is additive and Gaussian distributed. In this case, the likelihood always takes the form of a Gaussian. Under these assumptions, the mean of the likelihood varies with and is equal to the measurement. We also assume that the width of the likelihood (i.e., the amount of noise) does not inherently vary with the measurement. However, the Weber–Fechner law across many stimuli suggests that this assumption does not hold if stimulus values are represented in many common sense units (e.g., candelas per square meter for luminance, visual degrees per second for speed), because sensory thresholds in these units increase systematically as stimulus values increase (Hecht, 1924; McKee et al., 1986; Pardo-Vazquez et al., 2019). Thus, a transformation of the stimulus values from physical space to “sensory space” may often be necessary to satisfy this assumption (Stocker and Simoncelli, 2006; Kwon et al., 2015). Indeed, the Weber–Fechner law suggests that the width of the likelihood or measurement distribution is approximately constant in logarithmic units across many stimulus domains (although deviations have been noted). For example, if one were to model visual speed perception, the measurement distribution and prior could be represented in terms of *p*(*x*), respectively. Throughout this document, we will represent the likelihoods as Gaussians even if a transformation is necessary, to keep the estimation of the prior computationally tractable. For reference, Table 1 provides a summary of notation used for each of the ideal observer parameters.

### Modeling psychophysical data from an observer with a Gaussian prior

We begin with a simple case in which the prior takes the form of a Gaussian distribution. If this condition is met, the posterior has an analytic solution and is also Gaussian. This property follows from the general rule defining the product of any two Gaussians. Specifically, if we denote a Gaussian distribution generally as *a* and standard deviation *b*, we can write the prior as *m* and a SD of *σ*: *ρ*, which relates the posterior to the product of prior and likelihood, is given by:

#### Selecting a sensory estimate from the posterior

To start linking this framework to psychophysical data, we first consider an experiment in which we want to fit a Bayesian ideal observer model to a set of data in which participants reported point estimates of the presented stimuli (e.g., through method of adjustment such that *x* is the true value). To convert the posterior into an optimal estimate, we can assert a loss function for our Bayesian ideal observer. In the general form, this loss function will determine the Bayes estimate that minimizes the expected error defined in Equation 2:

Two commonly used loss functions are the zero-one loss (where the loss is 0 when

For an ideal observer that uses a squared error loss function, the Bayesian least squares (BLS) estimate is the mean of the posterior:

When the posterior is Gaussian, the MAP and BLS estimates are equivalent and equal to μ_{post} (Eq. 7), which can be simplified to:
_{post} such that α is a shrinkage factor that determines how biased the posterior is toward the prior mean:

With these simplifications we can rewrite the posterior as:

When *α*)*ν*. Note that when the posterior is not Gaussian, the MAP and BLS estimates are not necessarily equivalent.

#### Distribution of sensory estimates

While the ideal observer model outlined in this paper is defined from the perspective of the observer, we will briefly shift our perspective to that of an experimenter to demonstrate how the model can be used in practice. In a task in which the observer is making repeated point estimates of the stimulus (e.g., judging its visual brightness, auditory volume, or speed), the mean of the measurement distribution on each trial will be equal to the true value of the stimulus, *x*, and we can define *m*. The responses thus form an estimate distribution

If we want to infer the underlying ideal observer parameters from a set of real behavioral data, we can fit a set of empirically measured observer estimates to this estimate distribution. To do so, we define an analytic form of this estimate distribution *m* in the measurement distribution

While we could also derive the estimate distribution more simply using the identity for the affine transformation of Gaussian random variables, we use a substitution of variables here to draw a parallel to the mixture of Gaussians case in the next section. Note that the form of the estimate distribution is similar to the posterior distribution associated with a single measurement (Eq. 14) with two key differences: the mean of the estimate distribution is dependent on the stimulus x instead of any specific noisy measurement, and the variance is equal to the variance of the likelihood scaled by α^{2} instead of α.

This distribution of observer estimates, given the stimulus, provides the likelihood function for fitting the Bayesian ideal observer model to data by performing maximum likelihood estimation (MLE; not to be confused with the likelihood of a Bayesian observer). Specifically, it is a likelihood when considered as a function of the model parameters θ = { ν, γ, σ }. Given a set of paired stimuli and observer reports

In practice, we optimize θ by minimizing the negative log-likelihood, which is obtained by taking the negative log of this expression:

#### Two-alternative forced-choice task

Experimenters often avoid having research participants report point estimates of stimuli because the origin of the noise in the measurement is ambiguous. For example, responses that incorporate a motor component may be contaminated by motor noise in addition to sensory noise. To avoid this issue, participants can make a categorical judgment about stimuli in perceptual space that can be related back to physical qualities of the stimulus. One such paradigm is a two-alternative forced-choice (2AFC) task in which participants view two stimuli either sequentially or concurrently and must select which of the two best fits the instructions they are given. In a speed judgment task, for example, the instruction might be: “indicate which of the two stimuli appeared to move faster”. Often, this task is repeated for a range of stimulus values, such as stimulus speed, to build up a psychometric function. This function, for example, might describe the probability that a test stimulus is perceived as moving faster than a fixed reference stimulus, as a function of the test stimulus speed. Importantly, the two stimuli should differ in reliability to estimate the best fitting parameters for both the likelihood and the prior.

If we consider two stimuli *x*_{1} and *x*_{2}, on each trial, the observer makes two noise-corrupted measurements, which we model with two measurement distributions *r* based on a decision function that takes both measurements as input (*x*_{2} best satisfies the instructions given the measurements (e.g., in our speed judgment example, was *x*_{2} faster than *x*_{1}). This is defined by the following decision rule:

Since we model the likelihoods as independent and the posteriors are both Gaussian (at this point in the derivations), we can more succinctly say this occurs whenever the estimate

Because this is now a classification task, we adopt the loss function:
*A–C* for three different combinations of noise levels for m_{1} and m_{2}. The slope of this line is determined by: *x*_{2} and *x*_{1} over repeated trials (i.e., a point on the psychometric curve), we can obtain a numerical solution by integrating the joint distribution above the decision boundary:

The results of this integration for Figure 4*A–C* are shown in Figure 4*D*, along with the full psychometric curves.

However, Bayesian ideal observer models with Gaussian posteriors also allow for an equivalent analytical alternative to this calculation. Specifically, we can obtain an analytic solution for points on the psychometric curve via an alternative model of the Bayesian observer in which the observer computes the MAP estimate for each stimulus and then compares which of the two is larger. This method has been used previously (Stocker and Simoncelli, 2006) and is equivalent to the optimal computation in Equation 25 when the prior and likelihoods are both Gaussian. Since *x*_{2} (*x*_{1} (

Equivalently,

From this equation,

To simplify the calculation of this integral, we can convert the difference distribution to a standard normal ϕ(·) by subtracting the mean and scaling all values by the inverse of the SD. The location on the standard normal curve that corresponds to the lower bound on the integral in Equation 28 is then equal to the original mean divided by the SD. This is useful because it allows us to integrate the standard normal above this (standardized) mean to find *x*_{2}. That is, instead of integrating the original normal from zero to infinity, we now integrate the standard normal up to the standardized mean. Lastly, we take advantage of the fact that the standard normal is symmetric about its mean to write the equation as follows:
*K*.

We can again take the perspective of the experimenter to demonstrate how to fit the ideal observer model to 2AFC data. This analytic solution is an efficient way to estimate the underlying parameters of the Bayesian ideal observer model given a dataset *T _{t}* is the participant’s response to stimulus pair

*t*. As in the point estimate case, we can solve for the best fitting parameters θ = {

*v*, γ, σ

_{1}, σ

_{2}} with MLE in which we minimize the following negative log-likelihood function:

#### Summary

Up to this point, we have described how to determine the posterior, the individual sensory estimates, the sensory estimate distribution, and the results of a 2AFC task for a Bayesian ideal observer with a Gaussian prior and likelihood. In the next section, we will generalize this framework by deriving the same quantities for an observer with a prior that can be modelled more flexibly as a mixture of Gaussian components.

### Modeling psychophysical data for an observer with a mixture of Gaussians prior

While the approach outlined in the previous section is computationally efficient, it assumes that the observer’s prior is well fit by a single Gaussian. This is unlikely to be the case assuming that the prior reflects knowledge of natural scene statistics, since many physical quantities have much heavier tails than a Gaussian (Dong and Atick, 1995; Sprague et al., 2015) or are even multimodal (Girshick et al., 2011; Kim and Burge, 2018). Accurately modeling these shapes is important. For example, long-tailed priors would predict that biases are reduced for stimulus values that fall within the the flatter regions of the stimulus probability distribution than in the more peaked regions. In this section, we propose an approach based on a mixture of Gaussians that retains some of the efficiency of the single Gaussian prior while better approximating realistic priors. Table 2 lists a summary of the additional notation adopted for this section.

Consider an observer with a prior defined by a mixture of *C* Gaussian components:
*w _{i}* ≥ 0 is the weight of the

*i*th component, with

_{i}and

*i*th Gaussian component, respectively (Fig. 5

*A*, red lines). If we assume a Gaussian likelihood with variance σ

^{2}, the posterior is also a mixture of Gaussians (Fig. 5

*B*, blue lines):

_{i}and

*i*th posterior component, respectively:

This is the mixture of Gaussians version of the posterior given in Equation 14. Here, *w _{i}* of the individual components of the prior, the scale factors

*ρ*(

_{i}*m*) for each of the components of the posterior (analogous to Eq. 5), and a normalization step to ensure the weights all sum to 1. To determine

*ρ*(

_{i}*m*) as:

_{i}with Equations 35 and 34, respectively, then simplifying, we obtain:

Note that *ρ _{i}*(

*m*) is inversely related to the difference between the measurement

*m*and the prior component mean

*ν*. Therefore, the posterior shape will change relative to the likelihood, not just shift as in the single Gaussian prior case. That is, as the measurement changes, the relative weight of each component changes. We can combine the scaling effects of

_{i}*w*and

_{i}*ρ*(

_{i}*m*) to define:

*v*to obtain the set of adjusted weights

_{i}In the following sections, we will first demonstrate how to fit the mixture of Gaussians prior to point estimation and 2AFC data using numerical evaluation of the log-likelihood. We then derive an analytical approximation that can reduce the computational load necessary to estimate the observer parameters.

#### Selecting a sensory estimate from the posterior

As before, let us first consider the case where we want to estimate a participant’s prior from a set of point estimates from an experimental dataset. We can use the posterior derived in Equation 33 and an appropriate loss function to define an optimal estimate

Without an analytical solution

#### Distribution of sensory estimates

We can use Equation 40 to define *m*. To demonstrate this, consider a simplified form where all

Instead, we can numerically estimate *m*} from *N* point estimates (all with the same sensory noise level, σ), we can then evaluate the likelihood of the data given the putative model parameters, θ, using Equation 17:

Note that we have abbreviated

The toolkit includes a function for this numerical approach (`fitEstimData_numerical.m`), which we will also return to in Results. This process can be computationally expensive, however, if we are trying to fit an observer’s prior with many Gaussian components (each of which is defined by three parameters *w*, ν, γ). While this may be acceptable for lower numbers of components and datasets that have already been collected, this is more problematic if the mixture of Gaussians model is used during the course of an experiment to guide an adaptive staircase.

To make the log-likelihood equation more tractable to solve, we can derive an approximate analytical solution for the point estimate distribution if we approximate Equation 40 using just the expected value of the measurement

This approximation allows us to solve for *m* in Equation 41:

We can then derive an analytic solution to

This approximates the true estimate distribution with a Gaussian with a mean

#### Two-alternative forced-choice task

As with the point estimate distributions, we will again describe a numerical and approximate analytical approach for handling data from a 2AFC task.

To numerically estimate the ideal observer’s prior from a set of experimental 2AFC data using a mixture of Gaussians prior, we can again use the general form of the log-likelihood defined in Equation 31. Here,

Compared with the single Gaussian case, the mixture of Gaussians decision boundary can be nonlinear for a few reasons. One reason is the dependence of each adjusted weight *m*: the weight of the shrinkage factor for each prior component decreases with a greater difference between the component mean and the likelihood mean. As a result, the perceptual bias that the prior exerts is different at different points along the stimulus domain. Nonlinear decision boundaries can also emerge when the prior is bimodal, with measurements biased in different directions depending on which mode is closest. A function for numerically evaluating `calcMoGPFxn_Numeric.m`).

As noted for the point estimation case with a mixture of Gaussians prior, this numerical calculation can be computationally expensive. We can, however, leverage the approximate analytical expression for the estimate distribution to define an approximate expression for the categorical data collected in a 2AFC experiment. The reason this is possible is that with this approximation, the two-point estimate distributions are Gaussian. Using the Bayesian least squares estimate

Note that we index the modified weights and means differently for the two stimuli (*i* for *x*_{1} and *j* for *x*_{2}) since these parameters of the posterior components are defined by both the prior components and the likelihood parameters, which differ whenever *x*_{1} is different from *x*_{2}. As before, we can derive an analytical (although approximate) solution to the psychometric function for the mixture of Gaussians approach using Equation 29, with the exception of substituting in the approximate estimate distribution

#### Code accessibility

The code is included as Extended Data 1 and is available at https://github.com/tsmanning/bayesIdealObserverMoG.

## Results

In this section, we will demonstrate that there are a number of ways to maintain the flexibility of the mixture of Gaussians approach while reducing the total number of parameters describing the prior, and then show that this approach can be used to fit leptokurtotic and bimodal distributions. Lastly, we show that the approximate 2AFC solution remains close to the numerical solution for a range of model parameters constrained to realistic values. Although we do not go into detail here about how to generate synthetic estimate or 2AFC data using a Bayesian ideal observer framework, we include some example code in the toolkit about how one might benchmark implementations of an observer model with a mixture of Gaussians prior `interactiveNumTrialsVSaccuracy.m`.

### Prior estimation error using mixture of Gaussians model with point estimation data

Theoretically, a mixture of Gaussians could fit an infinite number of prior shapes given enough Gaussian components in the model. But the number of model parameters increases by three for each additional component, potentially requiring large amounts of data to obtain reliable fits. Further, unrestricted models will likely be nonconvex with multiple local optima. These characteristics extend the number of iterations needed to find the global optimum of the log-likelihood objective functions at best and make it unlikely or impossible to find the global optimum at worst. In practice, unrestricted forms of the mixture of Gaussians model will likely need multiple optimization runs with different starting parameters to reliably minimize the log-likelihood functions. There are a few ways to maintain the flexibility of the mixture of Gaussian approach while reining in the number of parameters in the model.

In sensory subdomains where there is evidence that the probability of some stimulus values monotonically decreases with stimulus magnitude, such as the spectral content of retinal images (Field, 1987; Dong and Atick, 1995), we can reduce the number of parameters by a third in our ideal observer model by fixing all component means at zero. This allows us to model long-tailed distributions as can be seen in Figure 7*A*, and in fact, any distribution that is a member of the exponential power family with a peak at zero and power 1 ≤ *p* ≤ 2 can be approximated with enough components (West, 1987).

If there is not sufficient evidence that the true distribution of stimulus power in the environment is either symmetric or zero-peaked, one can take an alternative approach of tiling the components (Fig. 7*B*). Here, one defines a fixed number of components, their means, and their SDs and fits only the weights of the tiled components to the data. In this way, the mixture of Gaussians can approximate a prior with a peak at an arbitrary location, skewness, and kurtosis. This approach has been used previously with large numbers of components to approximate a “nonparametric” reconstruction of a complicated prior (Acerbi et al., 2014).

Here, we demonstrate proof of principle for both approaches by generating a synthetic dataset of 1000 point estimates using a zero-centered, non-Gaussian prior and a bimodal prior, and then recovering estimates of these priors using the mixture of Gaussians ideal observer model and the constraints illustrated in Figure 7.

We first defined a long-tailed prior using a Cauchy distribution *A* and calculating *γ _{i}* on the range

*ν*fixed at zero. Thus, the only observer parameters free to vary were the component weights

_{i}*w*, and the measurement noise level σ which was constant for all simulated stimuli (that is, we are assuming the stimulus properties that may affect this measurement noise are held constant throughout the experiment). The best fitting parameters

_{i}*m*} from the dataset of

_{i}*A*, as well as the correspondence between the true BLS estimates and the ones inferred through MLE. In general, the mixture of Gaussians model closely matches the true prior although each of the basis function components on their own are less kurtotic than their sum.

We then repeated this process using a bimodal prior defined by the normalized sum of two Gaussians *B*. Once again, the prior inferred from the data closely corresponds to the true prior, although this correspondence will change depending on the exact spacing and width of the basis functions (Fig. 8*B*).

### Error in mixture of Gaussians analytical approximation with 2AFC data

We will next examine how close the approximate analytical solution is to the numerical solution within a range of observer parameters that matches the biases and sensitivities seen in real human data.

#### Human bias

To get a sense of what a realistic range of biases is in the literature, we consider empircally measured perceptual biases for linear (i.e., noncircular, nonspherical) stimulus domains like speed and distance. For example, in Stone and Thompson (1992), participants performed a 2AFC speed judgment task in which they selected which of two contrast and speed-varying stimuli appeared to move faster. Depending on the contrast ratio between the two stimuli, biases in speed judgments ranged from ∼0.55 to 1.55 times the veridical speed. Similar results were found in later studies that developed Bayesian ideal observer models to explain these biases (Weiss et al., 2002; Stocker and Simoncelli, 2006). An analysis of speed judgments for contrast-varying stimuli in 2D and 3D (Cooper et al., 2016) found a bias of up to ∼1.75 times veridical. In a disparity judgment task, Burge and colleagues reported bias of ∼1.15 (Burge et al., 2010). Thus, we will ensure that the simulated observer parameter models will at least reach these levels in our error analysis. The relationship between bias and observer parameters is straight-forward for a single Gaussian prior and Gaussian likelihood. It is simply the fraction of the shrinkage factors

#### Sensitivity

The slope of the psychometric curve at the point of subjective equality (PSE; i.e., the value of *x*_{2} where

Although there is an infinite range of possible prior configurations to test, we will restrict ourselves here to two useful situations not well fit by a single Gaussian: (1) a prior with only zero-mean components creating a leptokurtotic unimodal distribution and (2) a bimodal prior.

#### Example 1: leptokurtotic unimodal prior

First, we randomly selected a set of stimulus and observer configurations 5000 times (Fig. 9*A*, top). Likelihood means were selected from a uniform distribution ranging from [–1, 1] and SDs *γ _{1}* was fixed at

*γ*was randomly selected from a uniform distribution on the range

_{2}*γ*to be only slightly larger than the largest σ ensured that the priors were non-Gaussian and long-tailed, and that the priors produced psychometric functions with a range of biases covering the targeted range (actual biases ranged from 0.55 to 1.83).

_{1}Next, we determined the difference between the psychometric curve resulting from the analytical and numerical approaches described in the previous section (Fig. 9*A*, bottom). We compare the approximate solution (Eq. 50) to a numerical evaluation (Eq. 25) for the observer with a mixture of Gaussian prior (blue and black lines). We also compare the best fit single Gaussian to the mixture of Gaussians prior (yellow line). In doing so, we can directly assess the improvement of the approximate mixture of Gaussians approach over the single Gaussian approximation. These results are plotted in Figure 9*B–E*. Figure 9*B*,*C* show the correspondence between the single Gaussian (Fig. 9*B*) and approximate mixture of Gaussians (Fig. 9*C*) models and the numerical evaluations. The points all fall near the identity line, indicated reasonable agreement, but the spread is clearly larger for the single Gaussian model. Figure 9*D* summarizes the errors, showing that the analytical mixture of Gaussians approximation has approximately three times lower RMS error than the single Gaussian fit. The error reduction is most profound about the PSE in the psychometric function, where the analytical and numerical approximations are essentially equivalent. This means that one can precisely estimate observer biases even for non-Gaussian priors in a computationally-efficient manner. The analytical approximation does show a slight tendency to overestimate the upper flank of the psychometric curve and underestimate the lower flank (visible with the mean signed errors; Fig. 9*E*), indicating a bias toward steeper psychometric functions. Thus, when using this approximation to fitting psychometric data of observers with heavy-tailed priors, this will produce prior and/or likelihood estimates that are narrower than the true values.

#### Example 2: bimodal prior

Next, we assess the performance of the mixture of Gaussians analytical approximation for fitting psychometric data from an observer with a bimodal prior. We randomly selected likelihood means and SDs in the same fashion as we did for the zero-mean prior. To define a bimodal, two-component prior on each randomization, we selected the component means

## Discussion

The Bayesian ideal observer framework has proven broadly useful for explaining perceptual phenomena in multiple sensory modalities. For example, a prior that peaks at zero speed (a “slow motion” prior) has successfully predicted systematic biases in judgements of the speed (Weiss et al., 2002; Stocker and Simoncelli, 2006) and direction (Weiss et al., 2002; Sotiropoulos et al., 2011; Rokers et al., 2018) of moving objects. A “light from above” prior about the position of the illuminant in a scene has been used to explain biases in the perceived shape of ambiguously shaded figures (Adams et al., 2004). Similarly, priors for viewing angle, convexity, and alignment between principal lines of curvature and surface contours can explain biases in the interpretation of surface curvature from simple line drawings (Mamassian and Landy, 1998). Other examples of the success of Bayesian perceptual models include prediction of biases in the timing of intervals between discrete events (Sohn and Jazayeri, 2021), the perceived structure in complex moving patterns (Yang et al., 2021), judgments in the orientation of contours (Girshick et al., 2011), and the orientation of surface tilt in natural scenes (Kim and Burge, 2018).

Here, we reviewed the straightforward approach for inferring Bayesian ideal observer models from psychophysical data when it is assumed that priors and sensory noise are Gaussian distributed. Following on a step-by-step formulation of this approach, we then extended the model to include prior distributions described with mixtures of Gaussians. In doing so, we build on previous work that has used mixture of Gaussian priors in other perceptual applications. For example, one group used a mixture of Gaussians to to define the relative probabilities of experimental stimuli and then probed suboptimalities in perceptual inference (Acerbi et al., 2014). Another group used a mixture of Gaussians approach to model human observer priors about homogeneity of orientation to understand biases in visual short-term memory tasks (Orhan and Jacobs, 2014).

Importantly, this mixture of Gaussians extension of the Bayesian ideal observer framework complements and expands on existing approaches for modeling the relationship between natural scene statistics and perceptual priors. First, if perceptual priors indeed match the non-Gaussian distributions of natural stimuli, then using a mixture of Gaussians model of priors may improve how well we predict perceptual biases when stimulus measurements fall on different regions of the stimulus domain, as compared with a single Gaussian model. Second, the mixture of Gaussians approach provides a tool for researchers to constrain Bayesian models using empirically measured stimulus statistics. Bayesian models have faced criticism because of their lack of constraint in how the priors or likelihoods are defined (Jones and Love, 2011; Marcus and Davis, 2013; Rahnev and Denison, 2018). One way to constrain the prior is to assume that the visual system has veridically learned the statistics of natural scenes and these learned statistics are reflected in the prior. In this case, one could define the ideal observer prior with a mixture of Gaussians that matches an empirically measured distribution of scene statistics, forgoing the need to fit the prior from perceptual judgment data. Indeed, several groups have made progress in the estimating the distribution of spectral content in terrestrial scenes (Field, 1987; Dong and Atick, 1995), tilt of objects in natural scenes (Burge et al., 2016), binocular disparity (Sprague et al., 2015), and the spectral content of retinal motion during eye and head movements (DuTell et al., 2020). While the match between estimates of natural statistics and perceptual biases has been investigated previously with numerical methods (Girshick et al., 2011; Sprague et al., 2015), a (relatively) low dimensional parameterization of these stimulus distributions opens up new opportunities for efficiency and experimental investigations.

### Limitations and alternative approaches

Although numerical estimation of the model parameters will find an exact solution with sufficient precision, this is not always possible in practice. Estimating

While the approximate analytical method dramatically improves the computational efficiency of the ideal observer parameter estimation, it deviates from the true solution for *x*_{2} gets from the point of subjective equality. As shown in Figure 10, this method is also especially prone to errors away from the PSE when the prior or posterior are bimodal. These problems can be mitigated in a few ways. If the approximate analytical method is to be used to adaptively select stimuli during an experiment, the numerical approach can be used after data collection to reach a more accurate solution. If there is good reason a priori to think that an observer’s prior is bimodal (e.g., based on natural stimulus statistics), one can just fall back to the numerical solution.

Throughout this document, we assert that the ideal observer likelihood and measurement distributions are Gaussian along the domain in which the observer encodes the stimuli. Other model parameterizations, however, have been proposed that constrain the likelihood based on physiology and other assumptions, and result in notably asymmetric, non-Gaussian likelihoods (Zhang and Stocker, 2022). We also focus here on stimuli defined along a linear axis (e.g., position, velocity, binocular disparity), and therefore, the methods as presented cannot be directly applied to perceptual judgments about stimuli defined on a circular axis (e.g., orientation, visual motion direction, position of an illuminant). Despite this limitation, previous work has successfully used circular statistics to explain perceptual biases with a Bayesian ideal observer model (Mamassian and Landy, 1998; Burge et al., 2016). As a circular analog of the Gaussian, a mixture of von Mises distributions is a natural extension of the mixture of Gaussians approach.

Finally, we focus here on perceptual priors and not priors involved in decision-making or perception-action contingencies. Decision strategies could presumably affect the loss function as well, if there was an advantage to taking some other summary statistics from the posterior distribution instead of the least squares estimate. The influences of these strategies have been considered elsewhere (Chambers et al., 2019) but are out of the scope of the current work.

In conclusion, many scientific questions about how prior knowledge is incorporated into perceptual judgments and perceptually-guided behaviors remain unanswered. Within the Bayesian framework, for example, do priors vary significantly between observers and do they vary between different tasks? How closely do priors follow from the statistics we can measure empirically from the environment across multiple stimulus domains? How adaptable are priors in response to changing stimulus statistics? A major limiting factor in answering these questions is the accuracy and efficiency with which we can estimate people’s priors from experimental data. Broadening the computational toolkit for experimenters and modelers to address this challenge is an important component of the larger effort to advance our understanding of the transformation from sensation to perception.

### Extended Data 1

The archive bayesIdealObserverMoG.zip contains MATLAB code for generating the figures in this document, for performing the model comparison presented in Figures 9 and 10, and for fitting psychophysical data from stimulus estimation and 2AFC tasks. Download Extended Data 1, ZIP file.

## Footnotes

The authors declare no competing financial interests.

This work was supported by National Institute of Health Grants F32 EY03232 and T32 EY007043 (to T.S.M.); the National Science Foundation Award 2041726 (to E.A.C.); the Aspire Grant VRI20-10 (to B.R.); and the McKnight Scholar’s Award, the Simons Collaboration on the Global Brain (SCGB) Grant AWD543027, and the National Institutes of Health BRAIN Initiative Grant R01EB026946 (to J.W.P.).

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International license, which permits unrestricted use, distribution and reproduction in any medium provided that the original work is properly attributed.