## Abstract

Typical statistical practices in the biological sciences have been increasingly called into question due to difficulties in the replication of an increasing number of studies, many of which are confounded by the relative difficulty of null significance hypothesis testing designs and interpretation of *p*-values. Bayesian inference, representing a fundamentally different approach to hypothesis testing, is receiving renewed interest as a potential alternative or complement to traditional null significance hypothesis testing due to its ease of interpretation and explicit declarations of prior assumptions. Bayesian models are more mathematically complex than equivalent frequentist approaches, which have historically limited applications to simplified analysis cases. However, the advent of probability distribution sampling tools with exponential increases in computational power now allows for quick and robust inference under any distribution of data. Here we present a practical tutorial on the use of Bayesian inference in the context of neuroscientific studies in both rat electrophysiological and computational modeling data. We first start with an intuitive discussion of Bayes' rule and inference followed by the formulation of Bayesian-based regression and ANOVA models using data from a variety of neuroscientific studies. We show how Bayesian inference leads to easily interpretable analysis of data while providing an open-source toolbox to facilitate the use of Bayesian tools.

## Significance Statement

Bayesian inference has received renewed interest as an alternative to null significance hypothesis testing for its interpretability, ability to incorporate prior knowledge into current inference, and robust model comparison paradigms. Despite this renewed interest, implementations of Bayesian inference are often stymied by undue mathematical complexity and misunderstandings underlying the Bayesian inference process. In this article, we aim to empower neuroscientists to adopt Bayesian statistical inference by providing a practical methodological walkthrough using single and multiunit recordings from the rodent auditory circuit accompanied by a well-documented and user-friendly toolkit containing regression and ANOVA statistical models commonly encountered in neuroscience.

## Introduction

The scientific process generates new experiments and updates models based on quantitative differences among experimental variables, evaluated by statistical inference. Inference tools are foundational to these studies, providing the necessary machinery to make decisions and conclusions from data. Frequentist-based null significance hypothesis testing (NSHT) has been the gold standard of inference in neuroscience and science at large in part due to the computational simplicity of frequentist models compared with permutation sampling or Bayesian-based methods. A significant problem present in the current practice of NSHT, however, arises in the adoption of the *p*-value as the de facto metric of experimental “success,” notorious for its difficulty in interpretation and correct usage (Krueger and Heck, 2019). The confluence of exponential increases in computational power with the wider discussion of problems with NSHT usage has created renewed interest in Bayesian inference as an alternative to frequentist NSHT while offering interpretability benefits over the *p*-value and NSHT overall.

The use of *p*-value thresholds as an ubiquitous decision rule in frequentist methods is fraught with problems due to fundamental misunderstandings of its use, its interpretability, and most pathologically, its susceptibility to intentional and unintentional *p*-hacking (Nuzzo, 2014). Contrary to the initial intent of Ronald Fisher (Fisher, 1992), the *p*-value has often become the gatekeeper of significance in studies. In this role, it limits deeper observations into data, and it is often used without proper experimental design to ensure proper use and control. Statistical inference methods require first defining a statistical model with the power to adequately describe the data-generating process. Inference is then performed to estimate the population distribution from limited samples of observed data. Once estimates of population distributions are made, the determination of whether or not these distributions represent a significant effect is determined. NSHT is somewhat a victim of its own success, where common practice has distilled the practice of NSHT to chase the somewhat arbitrary *p* < 0.05 measure of significance devoid of model or data considerations (Krueger and Heck, 2019). Furthermore, even in the best of experimental designs, the *p*-value is a surrogate for arguably what a researcher is most interested in: how likely is it that observed data have some effect different from null (Kruschke, 2011; Gelman and Shalizi, 2013).

Bayesian methods offer a solution to the problem of pathological *p*-value use and interpretation, providing a direct measure of the probability that observations have some significant effect (Kruschke, 2011; Gelman and Shalizi, 2013). This is done by reallocation of the probability of possibilities as parameters in a mathematical model of the data-generating process, leading to probabilistic estimates desired by but not attainable with *p*-value analyses. Bayesian methods are inherently data driven; models are built with prior knowledge directly incorporated from parameters estimated directly from observed data.

Bayesian inference, though developed earlier than frequentist approaches, was not adopted as the primary inference paradigm due to the computational demands necessary to solve inference problems outside of certain canonical forms (Bishop, 2006) and the later adoption of the frequentist interpretation of probability (Fienberg, 2006). Inference on arbitrary distributions required a deeper mathematical knowledge and computation of integrals that were potentially intractable without modern numerical integration techniques. Frequentist paradigms however were more easily adapted to computationally simple algorithms, allowing researchers to “do statistics” without extensive formal training. However, exponential increases in computational power with the development of powerful Markov chain Monte Carlo (MCMC) sampling methods now allow researchers to perform meaningful Bayesian inference on arbitrary distributions underlying observed data (Gilks et al., 1996). While Bayesian approaches have received some attention for inference in electrophysiology (Wood et al., 2004; Wu et al., 2006; Vogelstein et al., 2009; Cronin et al., 2010; Gerwinn et al., 2010; Park and Pillow, 2011), the advantageous interpretability of inference and data-driven nature found in Bayesian statistics have as of yet been widely used in neuroscientific studies. This is in part a pedagogical problem, in that most neuroscientists do not encounter Bayesian statistics during formal training combined with the perception that a high level of mathematical acuity is necessary to perform Bayesian inference.

The goal of this tutorial is to remedy the opacity that often accompanies discussions of Bayesian inference by providing simple, step-by-step walkthroughs of Bayesian inference with four common inference paradigms utilizing tools that facilitate Bayesian computation for users at all levels of mathematical skill. We also aim to demonstrate the explanatory power of Bayesian inference in the context of neuroscience data. While the aim of this article is focused on application, this tutorial will begin with a brief introduction to Bayes' rule and its constituent components necessary for inference. For more theoretical and mathematical considerations of Bayesian inference, see the following books and articles: Gerwinn et al. (2010), Colombo and Seriès (2012), Bielza and Larranaga (2014), Kruschke (2014), Kruschke and Vanpaemel (2015), Ma (2019), Gelman et al. (2021), Van De Schoot et al. (2021), and Cinotti and Humphries (2022).

### Outline of Bayesian methods

To best facilitate practical application of Bayesian inference methods in neuroscience, a variety of datasets were acquired and analyzed. Data acquisition paradigms are described below in the Materials and Methods section. Bayesian inference is introduced in the context of regression to infer changes in inferior colliculus (IC) single-unit firing from changing auditory stimuli. *T* test-like group comparisons are demonstrated using computational models of basal ganglia–thalamocortical (BGTC) local field potentials (LFPs). Bayesian implementations of repeated measures and random effects models are demonstrated using chronic, multichannel single-unit recordings from the auditory cortex. Finally, Bayesian ANOVAs and ANCOVAs are utilized in assessing age-related changes in IC single-unit firing. All model implementations are available in our Bayesian inference for neuroscience toolbox at https://github.com/bscoventry/Practical-Bayesian-Inference-in-Neuroscience-Or-How-I-Learned-To-Stop-Worrying-and-Embrace-the-Dist and permanently indexed in our Zenodo repository at 10.5281/zenodo.11206116.

### A simple example of Bayesian inference in neuroscience

As a simple, motivating example of how Bayesian statistics can be performed in neuroscientific studies, consider a hypothetical in vitro whole-cell patch-clamp experiment recording of a cortical neuron (Figs. 1, 2*A–C*). This example will serve as a concrete example for explaining the components used in Bayesian inference. The experimental details are less important than the formulation of Bayesian models, but are given in some detail to help visualize the hypothetical study. Consider a study to assess the effect of a novel drug on firing rates in L2/3 cortical pyramidal neurons. The goal of the study is to assess the change in the firing frequency versus injection current before and after drug application during current-clamp recording. To complete Bayesian inference, the state of knowledge of firing rates should be known and quantified in a structure known as the prior distribution, and the observed data are constructed into a structure called the likelihood function. The prior distribution and likelihood function are then used to create a posterior distribution from which all statistical inference is derived.

### An introduction to Bayesian inference

Bayes' rule forms the basis for all Bayesian inference as the machinery to reallocate probability from prior distributions and observed data to the posterior. While a full mathematical and computational treatment of Bayes' rule is out of scope for this article, we begin by outlining the components of Bayes' rule, how inference is performed, and show, using the above hypothetical experiment, how priors and likelihood functions can be formed.

### Bayes' rule

Foundational to Bayesian approaches is a complementary, but epistemically differing view of probability from that of frequentist approaches. While the frequentist perspective treats probability as the relative frequency of the occurrence of some event, the Bayesian perspective instead treats probability as the expectation of an event occurring which can be used to quantify not only the state of knowledge of an event but also the uncertainty involved in measuring an event (Bishop, 2006; Kruschke, 2011, 2014; Van De Schoot et al., 2021). Traditionally, the Bayesian perspective has been called “belief” (Kruschke, 2010), a perhaps unfortunate name that belies the fact that the Bayesian perspective of uncertainty of an event is fundamentally quantifiable. Perhaps a better description of Bayesian belief is instead quantification of the state of knowledge by accounting for uncertainty. The cornerstone of Bayesian inference is Bayes’ rule, defined as follows:*H* is the quantification of the state of a hypothesis and *E* is the quantification of observed evidence. In the context of inference, it is helpful to explicitly state the role of the model in Bayesian formulations:*M* is the model of the data-generating process and
θ are the model parameters. The individual components of Bayes' rule are given names corresponding to the purpose they serve, with *A*). Each component plays a key role in Bayesian inference, and each will be discussed briefly below.

### The model evidence

The denominator term

### The prior

The prior, *B*, top) centered around the average firing rates of these cortical neurons across all studies surveyed before the experiment is conducted. The logic to this prior is that while firing rate distribution to this novel drug is not known, the totality of observed firing rates reasonably approximates the range of possible firing states of that neuron.

This begs the question as to what a prior might look like in newer avenues of study where a paucity of data exists or in situations where researchers and data analysts want the data to “speak for itself” outside any influence of the prior. In these cases, priors can be designed to be “noninformative” or “weakly informative,” assigning broad, noncommittal distributions to the prior. One might assign a uniform distribution on the prior, effectively treating each parameter outcome as equally likely. Uniformly distributed priors do require some caution, however, as any parameter value outside of the bounds of the uniform distribution is automatically assigned probability 0 in the posterior, even if that value has been observed (Fig. 1*B*, left). In many cases, it is better to allow small, but nonzero, probabilities to extreme values, such as the tails of a normal distribution, such that evidence for unexpected events is represented in the posterior given strong data (Fig. 1*B*, right). Conversely, priors can be made to be highly informative in situations where physiological bounds are well known and well studied, where extreme values are known to be biophysically irrelevant, impossible, or known to be due to instrument noise (e.g., large 50/60 Hz noise peak in power spectrum indicative of wall power noise).

### The likelihood

The likelihood function, *M*. In the context of inference, the likelihood function updates information given in a prior distribution to the posterior distribution given the observed data (Etz, 2018). The likelihood function is generally not a proper distribution, in that it is conditioned on yet unknown parameters and may not integrate to 1, but the evidence and prior terms ensure that the resultant posterior distributions are true probability densities. The idea of the likelihood function is present in both Bayesian and frequentist models, but has vastly different interpretations. The model parameters in a frequentist viewpoint converge upon singular values learned, usually though maximum likelihood estimation, from merging competing hypotheses of data. Bayesian approaches treat model parameters as ranges arising from distributions after observing the data at hand. For our hypothetical cortical slice experiment, the graduate student runs the study, plots the distribution of observed results, and finds that drug application produces small increases in mean firing rates that are very consistent between trials and cells, as evidenced by relatively tight distributions of data (Fig. 2*A*). However, the overlap of naive and drug conditions means that quantification of the uncertainty of measurement of the drug effect must be performed. The distributions of evoked firing rates as a function of injected currents are the data encoded in the likelihood function. Upon observation of the acquired data, both the naive and drug applied groups appear to be generated from a normally distributed underlying process. As such, the choice of a normally distributed likelihood function conditioned on the observed data is a good choice. This choice will be validated after the posterior is derived using a process called posterior predictive checking which is described below.

### The posterior

The prior, likelihood, and evidence then form the posterior

### Estimation of the posterior

Historically, limited computational capacities posed a significant obstacle to the widespread adoption of Bayesian inference methodologies. Application of Bayes' rule requires computation of complex integrals, limiting early Bayesian inference to canonical prior to posterior transformations, called conjugate pairs. However, modern MCMC tools simplify the computation of these complex integrals, allowing for quick and easy estimation of arbitrary posterior distributions. MCMC involves the generation of random samples that converge to a target probability distribution, the details of which can be learned from the following reviews (Hoffman and Gelman, 2011; Betancourt, 2017).

### Making decisions on the posterior

We define inference broadly as the process by which reasoning and decisions about phenomena are made from a sample of observations of the phenomena. Incorporation of prior knowledge in Bayesian inference allows for optimal decision-making on observed data (Blackwell and Ramamoorthi, 1982). The posterior contains all necessary information to make inferences on experimental data incorporating prior knowledge. However, it is best to consider the specific goals of inference before performing statistics. Possible goals of inference are as follows (Kutner et al., 2005; Kruschke, 2014):

Infer the parameters of a model.

Reject or confirm a hypothesis, null or otherwise.

Compare two or more competing models.

In the case of neuroscientific studies of unit electrophysiology, inferring model parameters occurs when an experiment aims to establish how neural firing rates change with changes in applied stimuli. Or one may want to confirm or reject a null hypothesis that a treatment has the desired effect or that there are differences between neural populations. Importantly, because the Bayesian inference operates solely on the posterior distribution, one can confirm or reject competing hypotheses and not simply reject the null as in frequentist NSHT.

Regardless of the goal, inference always involves analyzing the posterior, which provides a complete representation of the distribution of a given parameter given the experimental data. Therefore, decisions about the data, the effect of model parameters, and/or which hypothesis has more evidence are performed with calculations on the posterior. There are multiple decision rules that can be used to assess the posterior. The most common, and in our opinion, the most intuitive is that of the Bayesian credible interval. The credible interval is a statement of probability, describing the probability of a parameter estimate of falling within a given range. As credible intervals are not strictly unique, Bayesian inference convention is to fix the interval to the smallest interval which contains 95% of the posterior distribution density mass called the highest density interval (HDI). Observations of posterior HDIs can then be used to assess the relative effect of a parameter. Regions of practical equivalence (ROPEs) may be included in the posterior distribution that explicitly define a range of values that are effectively equivalent to a null value, with parameters considered significant if 95% of the posterior distribution (95% HDI) does not contain 0 or any values in the ROPE (Kruschke, 2018). Along with posterior HDIs, calculations of maximum a posteriori (MAP, distribution mode) estimates from the posterior are performed to quantify a most likely parameter value. While decision rules are important to assess the relative effect of statistical model parameters, we reiterate that simply passing a decision rule should not conclude the inference step. Inference should be made in the context of the evidence presented in model quality checks, observed data posterior distributions, and decision metrics. Returning to our example of cortical neuron firing rates, the graduate student takes the derived prior formed by survey of previous literature and the likelihood formed by observed data and uses Bayes' rule to calculate posterior distributions. Plots of the posterior distributions and inference from the posterior show that application of the drug significantly increases cortical neuron firing rates (Fig. 2*C*, top; MAP estimate = 8.1). This effect is considered significant because the 95% HDI does not contain 0. This means that, given the observed data, the chance that there is no difference between groups has an upper bound of 5%. Given that the value 0 is far within the tail of the posterior distribution, the chance of no difference between groups is vanishingly small (but easily quantifiable if necessary). Measurement of the effect size distribution shows that the observed effect is most likely 3.6, with the probability of no effect also being vanishingly small. After completion of the experiment, the advisor of the graduate student noticed that both in vitro and in vivo data were included in the prior, even though firing rates differ dramatically between those preparations in the previous work. Rerunning the analysis with the highly informed prior (Fig. 2*B*, bottom) based only on previously observed in vitro data produces a similar result, but with a slight reduction in the uncertainty in the difference of group means as measured by group mean difference MAP estimates and 95% HDIs. In this case, comparing weakly and strongly informed models also shows that relatively drastic changes in the prior make minor changes in posterior estimates (Fig. 2*C*, bottom), suggesting that observed data are the primary drivers of inference and are not unduly influenced by the prior. Taken together, both broad, minimally informative and highly informative priors often lead to similar decisions on the data and are useful for cases where prior knowledge is sparse. However, when prior knowledge is available, incorporating more information in the form of an informative prior backed by data can provide reduction in uncertainty in the estimation of study effects and better handle outliers within data.

This simple example shows one way in which priors and likelihoods are defined in neuroscience studies in a manner that most researchers unknowingly do by running experiments and reading the literature. The prior serves to take intrinsic prior knowledge and quantify it to provide more transparent reporting of experimental results. Furthermore, data forms the likelihood function, facilitating inference that makes decisions on the data at hand while accounting for prior knowledge of the state of a particular field. It also introduces the idea of using model comparison to assess prior influence on inference, which will be rigorously explored below.

### Inference using NSHT

While theoretical comparisons of Bayesian inference versus NSHT are out of scope for this article, a brief description of decision-making using NSHT is warranted to orient Bayesian inference. For detailed descriptions and discussions of NSHT, see the following reference texts and articles: Kutner et al. (2005), Nieuwenhuis et al. (2011), Maris (2012), Kruschke (2014), and Bzdok and Ioannidis (2019). Proper NSHT experimental design is performed before data acquisition with sample size, statistical tests, and significance levels set to achieve a desired level of statistical power. However, proper NSHT experimental design is largely not followed, in part due to lack of proper statistical training and external career pressure (Munafò et al., 2017) as well as the relative difficulty in data acquisition (Button et al., 2013). We believe that Bayesian approaches provide a way to do statistics more in line with current neuroscientific practice, in which beliefs about data are updated as data are acquired, informed by the current state of the field. Statistical power relies on estimates of population variances and differences in group means. We note that this is somewhat analogous to the declaration of prior distributions in Bayesian inference. After data are acquired using the predefined stopping rule, NSHT inference is then performed. Similar to Bayesian inference, NSHT utilizes a parameterized likelihood function (Fig. 1*C,D*) that describes the data-generating process. The null hypothesis is the likelihood function descriptive of a nonexistent effect. Statistical significance is most commonly assessed through the *p*-value metric, a measure quantifying the probability of obtaining a value of a given test statistic at least as extreme as the test statistic observed under the assumption of a null hypothesis. Another common metric for statistical significance and parameter estimation is the confidence interval, which utilizes sample distributions to estimate parameters of the population distribution. Importantly, unlike data-driven Bayesian approaches, both *p*-values and confidence intervals draw inference using in part data that are not observed, owing from calculations over hypothetical replications of an experiment (Wagenmakers, 2007; Kruschke, 2014). Comparisons between Bayesian methods and NSHT are summarized in Table 1.

### Stopping problem bias in *p*-value calculation

A critical calculation in frequentist inference is the determination of the probability of correctly rejecting a null hypothesis when the alternative is true, called statistical power. As collecting data is an expensive venture, data collection stopping rules are defined such that a desired power is obtained while minimizing total samples needed to uncover a desired statistical effect. While this stopping rule is explicitly stated in preregistered studies designed to elucidate a particular effect, a large portion the neuroscientific literature does not state stopping rule intentions*.* It is, however, known that the choice of stopping rules fundamentally change *p*-value results (Lindley and Phillips, 1976; Berger and Berry, 1988; Frick, 1998). To illustrate this, consider another simple hypothetical neuroanatomical experiment [adapted from Kruschke (2014)]. The striatum is a critical juncture in the basal ganglia circuit of the motor pathway. Striatal function is governed by two classes of neurons differentially expressing D1 and D2 dopaminergic receptors, respectively (Parker et al., 2018). D1- and D2-expressing neurons are relatively homogeneously distributed in the dorsal striatum of the mouse, with the density of medium spiny neurons (MSNs) expressing D1 or D2 receptors found to be *z*) from a total sample size of 1,000 (*N*) of MSN neurons is a binomial distribution:*D*). In this scenario, we are considering whether counts are bigger or smaller than our null hypothesis, necessitating a one-tailed critical value of *E*. Calculating the *p*-value for this data given this stopping condition results in

Now, consider a second stopping rule. When asked why data collection was stopped, the researchers responded that identification of >525 D1-expressing MSNs and <500 D2-expressing MSNs was enough to establish the desired trends, with a total sample size equal to 1,000 occurring by chance. Note that while the data and sample size remain exactly the same, we no longer are discussing the probability of getting *z* counts in *N* samples; we instead have the probability of taking *N* samples to get *z* counts. This flips *N* from a fixed to a random variable and *z* from a random to a fixed variable. In this case, the probability of getting *z* counts in *N* samples is a negative binomial distribution (Fig. 2*E*):*p*-value for this stopping rule results in a value of *post hoc* without proper statistical design and consideration.

Bayesian approaches do not suffer from the same stopping problem rule. Being a “data-driven” approach, all data observations are codified in the likelihood function, which is independent of any stopping rule (Anscombe, 1954; Kruschke, 2010; Van De Schoot et al., 2021). Examples of the unchanging property of the likelihood function can be found in numerous studies (Berger and Berry, 1988; Gelman et al., 2009; Kruschke, 2014). Bayesian models can certainly be influenced in a similar way by choosing a prior distribution that does not appropriately reflect the state of knowledge of the data-generating process. However, as the prior distribution is reported explicitly and encapsulates the researcher's knowledge of the data-generating processes, ill-fitting priors are more easily identified both at the stage of manuscript review and the pre- and postreview stages. This is in contrast to current trends in frequentist inference where sampling intentions may not directly stated unless the study has been actively preregistered.

### Error quantification and model comparison

Critical to any statistical model and inference therein is its fit to observed data. While it is entirely possible to perform linear regression on data distributions that are highly nonlinear, predictions and inference made by the model will likely be inaccurate. Both Bayesian and frequentist inference offer robust model error quantification. Bayesian approaches, however, can utilize the posterior distribution to not only quantify and bound the distribution of model errors but also include post hoc posterior predictive sampling as part of the inference paradigm. Posterior predictive sampling involves making random draws from the posterior and building a sampling distribution. This distribution is then compared with the observed data distribution to quantify the model's disparity from observed data. Along with posterior predictive checks, prior predictive checks act as a sensitivity measure of the influence of the prior distribution on the posterior distribution. Taken together, Bayesian inference thus allows for robust statistical inference on observed experimental data which appropriately includes prior knowledge of the state of the field.

### Formulation of models and applied Bayesian inference

There are a multiplicity of programs and programming languages that facilitate Bayesian analysis, such as standalone programs like Jasp (Love et al., 2019) and probabilistic programming language packages such as BUGS (Brooks, 2003) and STAN (Carpenter et al., 2017). We chose to use PyMC (Salvatier et al., 2016) for its ease in explicitly declaring probability distributions and its implementation in Python, which is in common use in neuroscientific data analysis. Model formation is often conserved between frequentist and Bayesian approaches; it is only the mode of inference that differs. However, for clarity, we will discuss both model formation and performing inference in the subsequent sections.

## Materials and Methods

Bayesian inference was performed on a range of data typical to neuroscience experiments. Regression models, ANOVA models, and group comparisons are performed on single-unit activity recorded from IC neurons in response to auditory stimuli in young and aged rats (Palombi et al., 2001; Simon et al., 2004; Rabang et al., 2012; Herrmann et al., 2017; Bartlett et al., 2024). Random-effects regression models are performed on single units recorded in the auditory cortex (A1) using high-density recording arrays in response to infrared neural stimulation (INS; Izzo et al., 2007; Cayce et al., 2011, 2014; Coventry et al., 2024) of the medial geniculate body (MGB). All surgical procedures used in this study were approved by the Institutional Animal Care and Use Committee of Purdue University (West Lafayette, IN, #120400631) and in accordance with the guidelines of the American Association for Laboratory Animal Science and the National Institutes of Health guidelines for animal research. Sample sizes and subject details are given in their corresponding sections.

##### Computational hardware

To underscore that meaningful Bayesian inference does not require cluster computing or extensive computational resources, all computations were performed on a Windows MSI GS-66 laptop with an Intel i7 central processing unit (CPU) with an Nvidia RTX2070 graphics processing unit (GPU). Our inference programs are CPU-bound, not requiring any GPU resources. Computations can be performed on most modern CPUs, but are accelerated with more CPU threads and cores and parallelization on GPUs.

##### Bayesian linear regression and analysis of covariance assessment of the disruption of temporal processing in the IC due to aging

The IC is the major integrative center of the auditory pathway, receiving excitatory inputs from the ventral and dorsal cochlear nuclei, excitatory and inhibitory inputs from the lateral and medial superior olivary complex (Kelly and Caspary, 2005), and inhibitory inputs from the superior paraolivary nucleus and the dorsal and ventral nuclei of the lateral lemniscus (Cant and Benson, 2006; Loftus et al., 2010). The IC encodes auditory information through hierarchical processing of input synaptics with local IC circuitry (Caspary et al., 2002; Rabang et al., 2012; Grimsley et al., 2013). Age-related changes in auditory processing primarily arise as deficits in temporal processing (Frisina and Frisina, 1997; Parthasarathy et al., 2010; Parthasarathy and Bartlett, 2012; Herrmann et al., 2017; Bartlett et al., 2024). This dataset is composed of single-unit responses recorded from 9 young (age ≤ 6 months) and 8 aged (age ≥ 22 months) male Fisher 344 rats. Auditory brainstem responses were recorded from animal subjects a few days prior to surgery to ensure hearing thresholds were typical of the rodent's age. Single-unit recordings were performed in a 9′ × 9′ double-walled, electrically isolated anechoic chamber (Industrial Acoustics). Animals were initially anesthetized via a bolus injection of ketamine (VetaKet, 60–80 mg/kg) and medetomidine (0.1–0.2 mg/kg) mixture via intramuscular injection. Oxygen was maintained via a manifold and pulse rate and blood oxygenation monitored through pulse oximetry. Supplemental doses of ketamine/medetomidine (20 mg/kg ketamine, 0.05 mg/kg medetomidine) were administered intramuscularly as required to maintain the surgical plane of anesthesia. An incision was made down midline and the skull exposed. Periosteum was resected and a stainless steel headpost was secured anterior to the bregma via three stainless steel bone screws. A craniectomy was made above the IC (−8.5 anterior/posterior and 1 mm medial/lateral from the bregma). A single tungsten electrode was advanced dorsally toward the central nucleus of the IC during which bandpass noise (200 ms, center frequencies 1–36 kHz in five steps per octave, 0.5 octave bandwidth) was delivered. The central nucleus of the IC was identified based on short-latency-driven responses to bandpass noise search stimuli with ascending tonotopy and narrowly tuned responses to pure tones of varying frequencies. Once neurons were identified, responses from 5 to 10 repetitions of sinusoidal amplitude-modulated (SAM) tones (750 ms tone length, modulation depth between −30 and 0 dB) were recorded using a preamplifying headstage (RA4PA, Tucker-Davis Technologies) and discretized at a sampling rate of 24.41 kHz (RZ-5, TDT). SAM tones were defined as follows:*m* is the modulation depth ranging between *ϕ* is the reference phase of the modulator, *A* is the scaling factor for the stimulus sound level, and *C*. Single units were filtered between 0.3 and 5 kHz. Offline spike sorting was performed using OpenExplorer (TDT).

##### Bayesian multilinear regression assessment of thalamocortical recruitment from INS

INS is an optical technique that uses coherent infrared light to stimulate nerves and neurons without the need for genetic modification of the target or direct contact with tissue that offers spatially constrained activation above electrical stimulation (Wells et al., 2005; Izzo et al., 2007; Cayce et al., 2011, 2014; Coventry et al., 2020, 2024). In this study, seven male Sprague Dawley rats were chronically implanted in A1 with 16 channel planar Utah-style arrays (TDT) and stimulating optrodes in the MGB of the auditory thalamus (Thorlabs). Rodents were initially anesthetized with a bolus injection of a ketamine (80 mg/kg) and medetomidine (0.2 mg/kg) cocktail. Oxygen was maintained via a manifold and pulse rate and blood oxygenation monitored through pulse oximetry. Supplemental doses of ketamine/medetomidine (20 mg/kg ketamine, 0.05 mg/kg medetomidine) were administered intramuscularly as required to maintain the surgical plane of anesthesia. An incision was made down midline and the skull exposed. The periosteum was removed via blunt dissection, and three stainless steel bone screws were placed in the skull for headcap stability. An additional titanium bone crew was placed in the skull to serve as a chronic ground and reference point for recording electrodes. Craniectomies were made above the MGB (−6 anterior/posterior, −3.5 medial/lateral from the bregma) and auditory cortex (−6 anterior/posterior, −5 medial/lateral from the bregma). Fiber-optic-stimulating optrodes were placed in the midpoint of the MGB (−6 dorsal/ventral from the dura) and affixed to the skull using a UV-curable dental acrylic (Midwest Dental). A 16-recording channel planar array was putatively placed in layers 3/4 of the auditory cortex, with placement confirmed by short-latency high-amplitude multiunit activity elicited from bandpass noise (200 ms, center frequencies 1–36 kHz in five steps per octave, 0.5 octave bandwidth) test stimuli. Recording electrodes were sealed onto the headcap. Animals were allowed to recover for 72 h prior to the beginning of the recording regime. All recordings were performed in a 9′ × 9′ electrically isolated anechoic chamber. During recording periods, animals received an intramuscular injection of medetomidine (0.2 mg/kg) for sedation. Optical stimuli were delivered from a 1,907 nm diode laser (INSight open-source optical stimulation system) coupled to the optrode with a 200 µm, 0.22 NA fiber (Thorlabs FG200LCC). Laser stimuli were controlled via a RX-7 stimulator (TDT) and consisted of train stimuli with pulse widths between 0.2 and 10 ms, interstimulus intervals between 0.2 and 100 ms, and energy per pulse between 0 and 4 mJ. Applied laser energies were randomized to limit effects from neural adaptation with 30–60 repetitions per pulse width/interstimulus interval combinations. Signals from recording electrodes were amplified via a Medusa 32 channel preamplifier and discretized and sampled at 24.414 kHz with a RZ-2 biosignal processor and visualized using Open-Ex software (TDT). Action potentials were extracted from raw waveforms via real-time digital bandpass filtering with cutoff frequencies of 300–5,000 Hz. Single units were extracted offline via superparamagnetic clustering in WaveClus (Quiroga et al., 2004). Studies were performed to assess the dose–response profiles of optically based deep brain stimulation (DBS) over the span of several months. As each electrode recorded diverse populations of neurons which are potentially subject to change due to electrode healing in, age of the device, and adaptation to the stimulus, a within subjects, repeated measures regression model was warranted. Bayesian hierarchical regressions can easily deal with complex models such as these. This data was part of a previous study (Coventry et al., 2024).

##### Bayesian *t* test assessment of computational models of BGTC function in Parkinson's disease

Parkinson's disease is a chronic and progressive neurological disorder resulting from a loss of dopaminergic neurons in the substantia nigra of the basal ganglia circuit (Dickson, 2018; Bove and Travagli, 2019). DBS, a technique by which therapeutic electrical currents are delivered to subthalamic neural structures using chronically implanted electrodes, has arisen as a potent treatment for motor symptoms of Parkinson's disease (Cagnan et al., 2019; Guidetti et al., 2021). However, the neural mechanisms underlying DBS remain unclear (Neumann et al., 2023). Computational models of the BGTC function in response to DBS provide insight into alterations of circuit dynamics as well as potential therapeutic targets. In prior work, a modified human mean-field BGTC model (Grado et al., 2018; Coventry and Bartlett, 2023) was implemented to study network DBS-encoding mechanisms in dopamine-depleted states. In the case of BGTC circuits, mean-field modeling consists of the average extracellular electric field response of collections of neurons with cellular properties modeled after each stage of the BGTC circuit in both healthy and dopamine-depleted states (Van Albada and Robinson, 2009). In each model trial, LFPs were recorded from the globus pallidus internus in resting state and stimulation conditions. Stimulation trials consisted of subthreshold (0.01 mA, 100 µs pulse width, 130 Hz) or suprathreshold (1.8 mA, 240 µs pulse width, 130 Hz) DBS of the subthalamic nucleus. The LFP activity in the β-band (13–30 Hz) is a known biomarker for Parkinson's symptomology (Little et al., 2013; Guidetti et al., 2021). As such, LFP power spectral density estimates of the β-band activity were calculated. The total LFP power in the β-band was calculated as follows:*f*,

##### Inference reporting guidelines

Proper presentation of statistical methods and processes is critical to the interpretation of conclusions drawn from inference. While there are many reporting guidelines for Bayesian inference, we follow the Bayesian analysis reporting guidelines as given by Kruschke (2021) and provide an example reporting document including posterior predictive checks, Bayesian model comparisons, and sensitivity analysis in our GitHub code repository.

## Results

### Estimation of spike rates from auditory stimuli: a motivating example

To facilitate the discussion of Bayesian inference in neuroscience, consider an example found prominently in auditory neuroscience (Fig. 3). In our first experiment, single-unit recordings were made from the IC in response to applied SAM tones (see Materials and Methods). The goal of this analysis is to create a linear model of SAM temporal auditory processing by quantifying increases in evoked single-unit firing rates in response to decreased SAM modulation depth.

The linear regression model seeks to estimate a linear relationship between one (simple linear) or more (multilinear) predictor and measured variables. In this model, both the measured result and predictors are metric variables that map to a continuum of possible values, as opposted to categorical variables mapping to discrete classes. The simple linear regression model takes the form as follows:*A*) and the distributions of data across all stimulation conditions (Fig. 3*B*).

Linear regression forms a model in which SAM depth predicts evoked firing rates in which the model parameters are estimated and used to draw conclusions about the relative dependency of
y on
x. To begin, an observation of the distribution of the measured data, in this case firing rates elicited from the IC, will allow for robust inference model design. Data distributions are most often visualized through construction of data histograms. Pdfs can then be estimated through kernel density estimation (KDE), a process in which data points are convolved with Gaussian kernels to create a smooth estimation of underlying continuous data pdfs (Rosenblatt, 1956; Parzen, 1962). Inspection of the distribution of firing rates (Fig. 3*B*) suggests that a log transform would allow for the data to be normally distributed, making model computations easier through the use of canonical normal distributions. Observation of total data pdfs also guides the choice of the likelihood distribution which describes the data-generating process. After natural log transformation, data pdfs approximate a normal distribution, making a normally distributed likelihood function a good choice for regression inference. This choice, however, should be validated using posterior predictive checking, which is shown after performing Bayesian inference using the linear regression model (example shown in Fig. 4*C*).

### Performing Bayesian inference on the linear regression model

Turning back to the example of IC single-unit firing rates in response to SAM depth stimuli, the first step in inference is to place a prior distribution on the data. Previous studies and data can be used to inform the prior, but for this example, we chose to demonstrate regression with moderately informative priors on

The likelihood variable then translates our model to one of the Bayesian inferences by casting the model as a probability distribution, in this case, as follows:

This routine calculates the regression model, generating a trace variable containing the posterior distributions of all model parameters after sampling numSamples with numBurnIn samples to initialize chains. We also ran four chains in parallel with a target_accept probability of 90%. Acceptance probability is based on the statistics of observed data and model, with more difficult posteriors benefiting from higher accept probability values (Gilks et al., 1996). Improper acceptance probabilities can give rise to insufficient number of draws and malformation of posterior distributions. PyMC provides a helpful readout for when posterior draws are malignant and indicative of higher acceptance probabilities. In summary, in a few lines of code, the researcher has observed distributions of the data and explicitly defined a model of the data generator and likely now has a better intuition of the data and how it is distributed. All that is left to observe the posteriors with HDIs to infer significance from the model.

Plotting the 95% HDI estimation of the regression line (Fig. 4*A*) of modulation depth versus natural log-transformed firing rates suggests a small but significant increase in firing rates with increases in modulation depth as evidenced by a 95% HDI credible region (0.015–0.022) that does not include zero from the
β posterior distribution (Fig. 4*B*). A MAP value of 0.018 represents the most probable slope value from the regression. Posterior distributions of model parameters (Fig. 4*B*) also show that there is an estimated basal firing rate above 0 (α MAP = 3.1) with model error terms considered small for being significantly smaller than the intercept term (
∈ MAP = 0.74). The spread of the 95% HDI on inferred parameters is used as a measure of uncertainty of the parameter, with narrow HDIs representing more certainty in the MAP estimated parameter. In our model, the
α parameter has a spread between 3.02 and 3.13, with a difference of 0.11 containing 95% of its posterior distribution, suggesting strong certainty in the MAP estimate of 3.1. A similar narrow spread is seen in the
β parameter, with a difference of 0.007 containing 95% of the posterior. The model error term shows that the observed data deviation from the model is constrained between 0.71 and 0.76, suggesting relative certainty in the magnitude of deviation of the data from the model. In the context of regression, these posterior results can be interpreted as a mean firing rate increase by a factor of 1.02 (2%,

The ∝ term in the context of this study represents the natural log firing rate for 0% modulation, corresponding to a pure tone stimulus. Statistical conclusions should not end after making inferences on model parameters, however. Critical to the validity of statistical inference is the quality of the model fit to observed data. This goodness of fit in Bayesian approaches can be analyzed by posterior predictive checks, in which sample draws are made from the posterior distribution, simulating observations of data generated from the experiment from which the statistical model was fit, and comparing sampled to observed data to assess deviation of model predictions from observed data distributions. In PyMC, posterior predictive checks can be easily performed using the following code:

To illustrate how posterior predictive checks can be used, a competing model was made which performs Bayesian linear regression fit to the same data and priors except without log transformation of the data. In each case, random draws were made from each log-transformed and nonlog-transformed posterior to create empirical data distributions. Comparison of empirical distributions qualitatively shows that log-transformed models present a better fit to observed data than nonlog-transformed models. The relative disparity between posterior predictive model fits and observed data can be quantified by the use of Bayesian *p*-values (Fig. 4*C*), a distance measure between two distributions (for details of Bayesian *p*-values, see Kruschke, 2014). The closer the Bayesian *p*-value is to 0.5, the better data sampled from the posterior overlaps with the distribution of observed data. Plotting the resulting distributions and the Bayesian *p*-values indeed shows the log-transformed model fits better to observed data than the nontransformed model. Similar analyses can be performed around model-free parameters, such as prior variables, to form a sensitivity analysis of our prior distributions on resulting posterior inferences.

A secondary and quick check of posterior sampling can be performed by qualitative evaluation of the MCMC sampling chains, often called traces. Traces represent the long-term run of a Markov chain which represents the distribution of interest. As such, good traces show evidence of effective sampling and convergence to target probability distributions. PyMC offers easy ways to visualize posterior MCMC traces using the plot_trace function. Figure 5 shows traces obtained from our Bayesian regression example. The kernel density estimates of traces corresponding to the posterior distributions of regression parameters show good convergence of MCMC traces to a target distribution (Fig. 5*A*). As MCMC chains are time-series samples that form a distribution, evaluation of traces through sampling time can also be used as a diagnostic of sampling convergence. Traces should have a “fuzzy caterpillar”-like appearance (Fig. 5*B*) without any stark jump discontinuities from sample to sample. Quantitative trace evaluations are also available, with the Gelman–Rubin statistic *post hoc* after sampling, and PyMC will automatically flag if

### Comparison to frequentist linear regression

We next applied frequentist ordinary least squares (OLS) regression to the above log-transformed data implemented in the SPSS software (IBM). OLS regression produced point estimates on model parameters are very similar to what was found in Bayesian regressions, with increases in modulation depth resulting in increased evoked log firing rates *B*). The MAP value directly gives the most likely parameter estimate and observation of the scale of the effect. The credible interval then adds to the point MAP estimate by showing other probable slope values given the observed data. Given what is known about the neurons of the IC, this spread suggests some heterogeneity in neural response to changes in SAM stimuli modulation depth, generating hypotheses for future study. However, because the relative spread of the credible interval is small, any heterogeneity found in the IC response is likely to be small in variance. Furthermore, because the slope parameter value corresponding to no effect *C*). Frequentist confidence intervals offer a similar inference modality to the credible interval. However, confidence intervals are not themselves distributions and do not directly assess the probability of the parameter
θ given the observed data. Rather, confidence intervals are dependent on a potentially unobserved population distribution. Efforts are being made to create so-called “confidence distributions,” aiming to provide similar interpretability to Bayesian credible intervals in frequentist contexts (Xie and Singh, 2013), but confidence distributions are still nascent and not yet fully developed.

In many cases, frequentist paradigms are used to generate binary inferences (significant/nonsignificant) without consideration of the underlying data distribution or fit of model to the observed data. As an example, consider the non-log-transformed linear regression shown in Figure 4, *B* and *C*. OLS regression gives a significant result *p*-value is met without considering how well the statistical model fits to observed data. An advantage of Bayesian approaches is that observation of posterior and data distributions is a requisite for inference, in which bad model fits to data are more easily observed.

Finally, we note that our OLS regression should be treated as a *post hoc* analysis, as sampling intentions were not planned ahead of time, and thus subject to statistical biases inherent to performing frequentist analyses without the proper pre-experimental planning stage formalized in the preregistration of studies.

*T* tests and comparison of groups

Comparison of differences between two or more groups is one of the most fundamental inference paradigms, ubiquitous to all fields of neuroscience. Frequentist implementations of group comparisons, such as the *t* test and *t* test (BEST) implemented and described in Kruschke (2013), consisting of a hierarchical model with data described by *t* distributions and minimally informative priors on group means and standard deviations. Thus, the BEST model serves as the Bayesian analog to the frequentist *t* test.

### Comparison of β-band dynamics in a DBS model using BEST

To illustrate the use of BEST in group comparisons, a computational model of the dopamine-depleted BGTC circuit was utilized (Fig. 6*A*). Simulations consisted of measurements of oscillatory activity in LFP β-band (13–30 Hz), elevation of which is a known biomarker for parkinsonian symptomology across the basal ganglia circuit (Singh, 2018). No stimulation, subthreshold “sham” stimulation, and effective DBS stimulation groups were tested and compared. LFPs were recorded from the globus pallidus internus and the subthalamic nucleus was stimulated with 2,000 repetitions of each stimulation group. Figure 6*B* shows example LFP power spectrums for no stimulation, subthreshold stimulation, and DBS conditions, respectively, with reductions in β-band activity shown in DBS conditions compared with no stimulation and subthreshold stimulation. Posterior distributions allow for estimates of group means and standard deviations of LFP β-power (Fig. 6*C*) with qualitative observations of nonoverlapping credible regions of means (DBS, 0.46–0.47; no stimulation, 0.86–0.88) and standard deviations (DBS, 0.036–0.041; no stimulation, 0.068–0.078) suggesting that DBS produces strong reductions in β-power. Unlike frequentist methods, direct measurements of differences in group means are easily quantified from subtraction of group posterior distributions. We reiterate that unlike frequentist methods, this inference is formed directly from observed data, and not conditioned on hypothetical population distributions. Differences in group means show strong reductions in β-band power in DBS conditions (MAP estimate,

Estimation of group differences further shows the explanatory power of Bayesian inference in its ability to provide direct, probabilistic estimates of group summary statistics, group differences, and effect sizes. Bayesian approaches with appropriate priors are also more robust to sample size, particularly lower sample sizes that are typical of many neuroscience experiments (Smid et al., 2020). To illustrate this, we utilized frequentist *t* test and BEST inference, comparing no stimulation and subthreshold stimulation groups. BEST and *t* test estimates on these groups should confirm a null hypothesis/fail to reject the null hypothesis for Bayesian and frequentist inference respectively given no effect of subthreshold stimulation on LFP β-band activity. In this study, random study stopping points were simulated by performing BEST and *t* test estimates on increasing sample sizes (2–1,000, steps of 5). Data points were pulled from random permutation of no stimulation and subthreshold stimulation groups. BEST estimation showed rapid convergence to estimated group differences of 0 at low sample sizes, while frequentist *p*-values showed strong fluctuations regardless of sample size (Fig. 6*D*). *P*-value oscillations showed several instances of falling below *p*-value dynamics are contextual, based on data type and underlying data distributions, this data shows the potential of NSHT to draw drastically different experimental conclusions from sampled data distributions with very similar sample sizes.

### Multilinear regressions, repeated measures, and hierarchical models

In many experiments, inference across multiple possible data-generating parameters must be analyzed and accounted for. These models, called multilinear regressions, are extensions of standard linear regression as follows:*n* is the total number of predictors and β terms corresponding to slope terms of each predictor variable.

To illustrate the use of multilinear regressions, consider the case of thalamocortical INS (Fig. 7*A*). Auditory thalamic neurons in the MGB were excited by pulse trains of optical stimuli varying in pulse energy and time between pulses. The resulting auditory cortex single-unit responses were recorded using a planar, Utah-style array in layer 3/4. An important and understudied aspect of INS is the effect of laser energy and interstimulus interval changes on evoked firing rate responses quantified by a dose–response curve. We begin by specifying predicted and predictor values. Dose–response relationships were measured by predicting maximum firing rates in response to applied INS energy (E) and interpulse intervals (ISI). As we suspect an interaction between E and ISI, an interaction term of E * ISI was incorporated. Therefore, the model was defined as follows:

An important aspect of this study was that rats underwent chronic recordings through the duration of the lifetime of the implant. It almost a certainty that stimulation and recording quality will change over the lifetime of the devices due to neural adaptation to stimulation (Falowski et al., 2011) and glial response and encapsulation of the devices (Van Kuyck et al., 2007; Woolley et al., 2013). This experimental paradigm is thus complicated by potentially meaningful repeated measures within-subject variability. Furthermore, slight differences in electrode and optrode placement between rodents could create a heterogeneity in the receptive fields of recorded neurons (Vasquez-Lopez et al., 2017), representing a potentially meaningful between-subject variance.

### Hierarchical structures capture latent variables

Models in both Bayesian and frequentist paradigms capture these within- and between-subject variances by adding hierarchical structure to the model. From the Bayesian perspective, hierarchical models are defined by allocating hyperparameters on the prior which encode within- and between-group variances in the model, with each hyperparameter containing hyperprior distributions. Graphically, this is organized in Figure 7*B*. Bayesian and frequentist hierarchical models share similar roots, with particular hyperprior distributions in Bayesian paradigms becoming proportional to frequentist random effects models.

While this appears to be a herculean task in data modeling, PyMC allows for declarations of hierarchical models, as shown in Code Snippet 4:

Owing to the scarcity of thalamocortical INS data, we assigned noninformative, widespread normal distributions on the priors and hyperpriors so as to let the data speak for itself. We also utilized a Student’s *t* distribution as the likelihood function to accommodate outliers in a modification known as robust regression (Kruschke, 2014). Student’s *t* distributions have tails that are not bounded by the exponential function, meaning that extreme values have less impact or skew on the posterior distribution. Half-Cauchy distributions are placed on the error term and Student’s *t* normality parameter
ν. Half-Cauchy distributions are advantageous in learning scale parameters from the data in hierarchical models (Gelman, 2006; Polson and Scott, 2012).

It is important to validate that our model and data-generating functions indeed represent the observed data. Sensitivity analyses and posterior predictive checks thus can be performed to ensure the model chosen is the one that best describes the observed data. Sensitivity analyses were performed by varying prior variance and comparing models that were nominal or natural log transformed with normal (N) and Student’s *t* (ST) likelihood functions. Model comparisons can be performed in many ways, but a common paradigm is the leave-one-out cross-validation (LOO; Gelman et al., 2014). LOO consists of partitioning data into training and test sets and iteratively fitting the model under test with training data and testing out-of-sample fits with test data. Models are then ranked using the expected log pointwise predictive density (ELPD) measure:*p*-values should be utilized on the final model to determine model fit. This seemingly complex model comparison can be quickly and easily done in PyMC with the following commands:

Model comparison results are given in Table 2. Similar to the simple regression above, the log-transformed model provided much better fits to observed data than nonlog-transformed models. Interestingly and instructively, moderately informative priors (variance 5) outperformed noninformative priors (variance 100), suggesting that constraining prior variance can have predictive power in inference. Posterior predictive checks on the winning model show good fits to observed data with a Bayesian *p*-value near 0.5.

We can now perform inference on our multiregression model. It was found (Fig. 7*C*) that
α was significantly above 0 (MAP = 2.2; 95% HDI does not cross 0) suggesting that basal firing rates of recorded neurons were typically above 0 as expected. It was also seen that maximal firing rates were significantly dependent on the applied INS energy (

### Bayesian ANOVAs

Comparison of differences between groups is another routine statistical procedure used when predictor variables are nominal or categorical in nature or a mixture of metric and categorical predictors. The frequentist treatment of these experimental designs largely uses analysis of variance methods, namely, ANOVA for categorical predictors and, more generally, analysis of covariances (ANCOVAs) for categorical predictors with metric covariates. ANOVAs are models that take the form of:

ANCOVA is a modification to the ANOVA model to include a metric covariance term:

Bayesian analogs of ANOVA and ANCOVA can be easily defined in PyMC and are termed Bayesian analysis of variance (BANOVA) and Bayesian analysis of covariance (BANCOVA) (Fig. 8*A*), respectively, to distinguish models from their frequentist counterparts. Traditional ANOVAs make two key assumptions; (1) that underlying data is normally distributed and (2) there exists a homogeneity of variance among groups. To account for these assumptions, normal distributions are placed on prior parameter and observed data distributions, and a uniform distribution prior is placed on observed data variance

As a guiding example, consider a similar experiment to that done in simple linear regression. In this experiment, we aim to understand age-related changes in the IC auditory processing of SAM sounds. This experiment consisted of two groups of young (animals <6 months in age) and aged (animals >22 months in age). SAM stimuli at increasing modulation depths were played to the animals with evoked single-unit responses recorded from the IC. As seen in the previous simple linear regression experiment (Fig. 4), there is a significant increase in evoked firing rate with increased modulation depth in young animals. As such, it should be included in comparison between the two groups. Taken together, this suggests that BANCOVA will serve as an appropriate model. BANCOVAs are inherently hierarchical (Gelman, 2005; Kruschke, 2014) (Fig. 8*A*) to allow for between-subject variances to be represented in the prior if these variances mutually inform one another. Setting this hyperprior to a constant creates a model analogous to a frequentist ANCOVA (Kruschke, 2014). The formation of the BANCOVA is again relatively straightforward:

with inference made in the exact same way as the previous models.

After model sampling, posterior sampling checks were performed to ensure posterior distributions adhere well to observed data. Posterior predictive distributions show good qualitative fit to observed firing rate data with Bayesian *p*-values centered ∼0.51, suggesting good model fits to observed data (Fig. 8*B*). Comparisons between groups are simple once posterior distributions are obtained. Similar to Bayesian group comparisons (BEST; Fig. 6), all that needs to be done is to measure differences between aged and young group mean parameter posteriors (Fig. 8*C*), encoding influence of young and age groups on firing rates. Aged and young contrasts show significantly elevated firing rates in young rats across all SAM stimuli (young-aged difference MAP = 0.25; 95% HDI excludes 0). Another advantage of Bayesian inference is the ability to observe the distribution and thus the most likely value and spread of effect size. In this analysis, the effect of age in the SAM stimulus processing is significant but small (effect size MAP = 0.058; 95% HDI excludes 0) but with a wide spread of effect (95% HDI between 0.025 and 0.64), suggesting variable temporal acuity between rodent subjects. Finally, firing rates versus SAM amplitude depth for each class are plotted with

### Comparisons between frequentist and Bayesian ANCOVAs

To compare BANCOVA results to classical, frequentist ANCOVAs, the young-age IC data underwent *post hoc* ANCOVA analysis in SPSS. Frequentist ANCOVA also found a significant difference between groups after accounting for changes in firing rate from modulation depth (young-age log firing rate = 0.108; *p* < 0.001). The measured confidence interval (there is 95% probability that the 95% confidence interval of 0.087–0.130 calculated from a future sample of data will contain the true parameter value) also suggests some within- and between-group variance. Interestingly, the young-age log firing rate estimates are noticeably smaller in the frequentist ANCOVA than the BANCOVA. The reason for this is due to inclusion of prior knowledge. As described in Figure 8*A*, frequentist ANCOVA and BANCOVA have very similar model structure, with the biggest difference being how the group variance

### Multiple comparisons in Bayesian inference

In traditional frequentist analyses, corrections for multiple comparisons are necessary in order to ensure that maximum Type I errors (false positives) are constrained to a maximum of 5% *A*, 8*A*) which can account for outliers by pulling parameters toward group means when evidence is small and allowing parameters with good evidence to remain. This shrinkage is due to partial pooling, a phenomenon implicit to hierarchical structures which allows for both the quantification of group differences and group similarities(Gelman et al., 2009).

## Discussion

The Bayesian inference approache provides a powerful statistical tool that encourages deep and meaningful exploration of data and allows for presentation of data in intuitive and transparent ways. In this tutorial, we demonstrate the ease by which Bayesian inference can be performed across a wide variety of experimental designs and provide source code that can be modified to accommodate neuroscientific experiments using all freely available and open-source tools. We intentionally used the base PyMC toolchain in order to explicitly show Bayesian model creation. However, there are PyMC plug-in tools such as Bambi (Capretto et al., 2022) which can facilitate creation of Bayesian models in single lines of code. An example of Bambi-enabled model creation is provided in our Bayesian inference toolbox.

### Applications of Bayesian inference

In this tutorial, our inference examples largely focused on data commonly found in electrophysiology and computational neuroscience studies. However, Bayesian inference is agnostic to the form and type of data used in inference. The described statistical models are easily adapted to electroencephalography, neuroanatomical measures, behavioral measures, and calcium events, among others. Bayesian inference is also of particular interest to neuroscience experiments involving very large datasets, such as spatial transcriptomics (Gregory et al., 2022; Song et al., 2022).

We also focused primarily on canonical statistical model structures of *t* test group comparisons, simple regression, ANOVAs, and mixed-effect models. Bayesian models can further be defined for other statistical models such as logistic, Poisson, and ridge regressions, mixture of Gaussians, and time-series analysis.

### Tempering expectations of Bayesian inference

Despite the enthusiasm of some Bayesian advocates, Bayesian inference is not a panacea. It is subject to similar problems as frequentist NSHT, in that models can be used which do not adequately fit underlying data statistics or that priors can be chosen which dominate model performance and de-emphasize observed data. However, Bayesian approaches support and encourage model transparency, requiring researchers to declare model priors and posteriors while encouraging continued discussion of data inference as opposed to stopping if a *p*-value is below an arbitrary threshold. A second caveat is that running MCMCs can be slower than frequentist approaches, with run times occasionally in minutes as opposed to seconds. However, time increases are not astronomical and can be further reduced to levels similar to frequentist approaches by using GPU computing or using programs such as JASP (Love et al., 2019) which utilize a C backend to speed up computation.

### The controversy of the prior

The prior is arguably the most contentious aspect of Bayesian inference, with arguments that the prior unduly influences decisions on data. It is absolutely possible to have priors that distort posterior distributions into poor inference. Similar arguments can be levied at frequentist approaches that perform similar distortions on decision metrics, such as applying ANOVA tests when the underlying data are not normal. Often times, these mistakes are not done out of malevolence, but due to the modern framework of how statistics is performed. We argue that having to consider what prior to use, and thus what one's assumptions are, what distributions are physiologically relevant, and the distributions of observed data will help to prevent errors in statistical modeling while creating greater transparency in how conclusions on data are drawn.

### Decisions with Bayes’ factors

Some studies that utilize Bayesian inference use a decision metric called the Bayes' factor, which is a measurement of the ratio of marginal likelihoods of two competing models providing log likelihood of evidence for one model over another (Johnson et al., 2023). We intentionally chose not to utilize Bayes' factor metrics because, in the authors' opinions, they reduce inference to evaluation of a single metric over an arbitrary threshold, as opposed to analysis over posterior distributions of observed data. Furthermore, certain prior declarations yield undefined Bayes' factors (Gelman and Rubin, 1995) potentially encouraging the use of suboptimum models in order to provide arbitrary decision metrics.

### Bayesian and frequentist approaches: a holistic approach to inference

Following in the steps of Bayarri and Berger (2004), data analysis should not consist solely of Bayesian or frequentist approaches devoid of the other. There are certainly cases where frequentist approaches should be used, such as clinical trials where preregistration and proper protocol design can provide bounds on false-positive and false negative rates necessary for translation of medical therapeutics. Hybrid frequentist and Bayesian approaches can also provide richer insight into analyses where posterior distributions are unidentifiable or difficult to sample (Raue et al., 2013) or in identifying when improper models have been chosen (Berger et al., 1997). Bayesian ideas of posterior predictive checks and model comparisons can also be applied to frequentist NSHT, many of which would help address problems of replication and data transparency. As frequentist approaches are often baked into the pedagogy of neuroscience and neural engineering, we aim for this tutorial to be a thorough introduction into the application of Bayesian statistics to help develop a toolkit that can be used for robust data analysis or in conjunction with previously established frequentist approaches. These models are also easily extendable into Bayesian analogs of logistic or multinomial regressions, Gaussian mixture models, and Bayesian time-series analyses, among many more.

## Code and Data Availability

The code/software, data, and SPSS files described in this article are freely available online at https://github.com/bscoventry/Practical-Bayesian-Inference-in-Neuroscience-Or-How-I-Learned-To-Stop-Worrying-and-Embrace-the-Dist and permanently indexed in a Zenodo repository: 10.5281/zenodo.11206116. The code is available as Extended Data.

## Code and Data Repository

https://github.com/bscoventry/BayesianNeuralAnalysis doi.org/10.5281/zenodo.11206116.

## Footnotes

B.S.C. and E.L.B. hold a provisional patent (USPTO: 18/083,490) related to INS study data used in this article.

We thank the anonymous reviewers for their helpful and constructive feedback on this article. Funding for experiments that generated young and aged responses to sinusoidal amplitude-modulated stimuli was provided by the National Institutes of Health (NIDCD R01DC011580, PI: E.L.B.). Funding for experiments generating infrared neural stimulation dose–response data was provided by the Purdue Institute for Integrative Neuroscience Collaborative Training Grant (PI: B.S.C.). E.L.B.: NIDCD R01DC011580 B.S.C.: Purdue Institute for Integrative Neuroscience Collaborative Training Grant.

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International license, which permits unrestricted use, distribution and reproduction in any medium provided that the original work is properly attributed.