The Effect of Inclusion Criteria on the Functional Properties Reported in Mouse Visual Cortex

Abstract Neurophysiology studies require the use of inclusion criteria to identify neurons responsive to the experimental stimuli. Five recent studies used calcium imaging to measure the preferred tuning properties of layer 2/3 pyramidal neurons in mouse visual areas. These five studies employed different inclusion criteria and reported different, sometimes conflicting results. Here, we examine how different inclusion criteria can impact reported tuning properties, modifying inclusion criteria to select different subpopulations from the same dataset of almost 17,000 layer 2/3 neurons from the Allen Brain Observatory. The choice of inclusion criteria greatly affected the mean tuning properties of the resulting subpopulations; indeed, the differences in mean tuning because of inclusion criteria were often of comparable magnitude to the differences between studies. In particular, the mean preferred temporal frequencies (TFs) of visual areas changed markedly with inclusion criteria, such that the rank ordering of visual areas based on their TF preferences changed with the percentage of neurons included. It has been suggested that differences in TF tuning support a hierarchy of mouse visual areas. These results demonstrate that our understanding of the functional organization of the mouse visual cortex obtained from previous experiments critically depends on the inclusion criteria used.


Introduction
Five recent studies have employed two-photon calcium imaging to compare spatial frequency (SF) tuning, temporal frequency (TF) tuning, orientation selectivity, and directional selectivity of neurons across mouse visual cortical areas (Table 1; Fig. 1; Andermann et al., 2011;Marshel et al., 2011;Roth et al., 2012;Tohmi et al., 2014;Sun et al., 2016). Some results were consistent across studies, e.g., the mean preferred TF of neurons in area AL was greater than those in V1 (Fig. 1A), but there were also differences between studies, e.g., some studies found that the mean preferred TF of neurons in PM was greater than those in V1 while others found the opposite. Further, the magnitudes of average TF tuning, orientation selectivity index (OSI), and direction selectivity index (DSI) in individual visual areas as well as the rank order of these properties between visual areas differed across studies (Fig. 1). All five studies imaged layer 2/3 of mouse visual cortex and activity was evoked with a drifting grating stimulus, but the studies differed in anesthesia state, calcium indicator, stimulus parameters, and in the inclusion criteria used in analysis (Table 1). It is likely that all these differences contribute to the contrasting results. Here, we leverage a single large and open dataset, the Allen Brain Observatory, to quantify the impact of the choice of inclusion criteria on the measurement of tuning properties of neurons in mouse visual areas.
Calcium imaging studies usually require the use of inclusion criteria to select neurons that are deemed to be "active" or "responsive" such that the derived analysis of their activity is relevant to the aims of the experiment and not a quantification of noise. As the measured fluorescence shows continuous fluctuations, these criteria serve to identify which fluctuations reflect signal rather than noise. Criteria are often based on the amplitude of the fluorescence change, e.g., a threshold on the mean or median change in fluorescence over multiple trials, or its reproducibility, e.g., a statistically significant stimulus-evoked change in fluorescence on a subset of trials. Naturally, some neurons exhibit large-amplitude changes in fluorescence on every trial in response to a preferred stimulus and fulfil both amplitude and reproducibility criteria ( Fig Although not often used as the basis for inclusion criteria, other features of the fluorescence traces, such as periodicity in the fluorescence in response to a periodic stimulus such as a drifting grating (Fig. 2I) and tuning to stimulus characteristics such as orientation and TF (Fig. 2C,H,I), may also be suggestive of stimulus-evoked activity (Niell and Stryker, 2008).
Each of the five studies used different inclusion criteria and it is unclear whether these different criteria select the same or different neurons and how they impact the distribution of measured responses to visual stimuli across the population. Here, we explore the effects of inclusion criteria on results from a single large dataset, eliminating the effects of different experimental conditions. We used recordings from the Allen Brain Observatory, a database of physiological activity in visual cortex measured with twophoton calcium imaging from adult GCaMP6f transgenic mice (de Vries et al., 2020). We found that tuning properties varied with inclusion criteria, in some cases changing the rank order of tuning properties across mouse cortical visual areas.

Stimulus and dataset
We used calcium imaging recordings from the Allen Brain Observatory, a publicly available dataset that surveys physiological activity in the mouse visual cortex (de Vries et al., 2020). We specifically used the responses to the drifting grating stimulus in this dataset. This stimulus consisted of a 2 s grating followed by a 1s mean luminance gray period. Six TFs (1,2,4,8,15 Hz), eight different directions, and one SF (0.04 cpd) were used. Each grating condition was presented 15 times. Last column shows the number of neurons selected from the Allen Brain Observatory. This work was supported by the Allen Institute.
Data analysis was performed in Python using the AllenSDK. The evoked response was defined as the mean dF/F during the 2-s grating presentation. Responses to all 15 stimulus presentations were averaged together to calculate the mean evoked response.

Metrics
The preferred direction and TF condition was defined as the grating condition that evoked the largest mean response. In order to compute the average TF tuning of a population of neurons, these TF values were first converted an octave scale (base 2), averaged, then converted back to a linear scale and reported.
Direction selectivity was computed for each neuron as the following: where R pref is the mean response at to the preferred direction and R null is the mean response to the opposite direction.
Orientation selectivity was computed for each neuron using the global OSI (OSI; Ringach et al., 1997), defined as the following: where R u is the mean response at each orientation u . The coefficient of variance (CV) was used as our metric to determine robustness. CV was calculated for each neuron as the ratio of SD of the 15 responses to the preferred condition (mean dF/F over the 2-s stimulus presentation) to the mean evoked response (see above). A low CV would indicate high robustness.
Metrics were either computed using all available trials, or with cross validation. When using cross validation, half of the trials (chosen at random, without replacement) were used to identify the preferred direction and TF, and the other half of the trials were used to compute the metrics using those preferred conditions. This was iterated 50 times, and the resulting metrics were averaged together.
When examining the effects of the number of trials, for each number of trials (n), n trials were chosen at random (without replacement), and the cross-validation was done as described above.

Inclusion criteria
Published studies used the following inclusion criteria, which we applied to cells in the Allen Brain Observatory dataset in the following manner: Study 1: The mean evoked response (dF/F) to the preferred stimulus condition is .10% (Sun et al., 2016).
Study 2: In 50% of trials, the response is (1) larger than the 3Â the SD of the prestimulus baseline and (2) larger than 5% dF/F (Roth et al., 2012).
Study 3: Paired t test (p , 0.05) with Bonferroni correction comparing the mean evoked response during the blank sweeps with mean evoked responses to preferred stimulus condition (Andermann et al., 2011).
Study 5: The maximum fluorescence change (dF/F) during the 2-s stimulus presentation block to any stimulus condition was .4% (Tohmi et al., 2014).

Code availability
The code used in this paper is available at https:// github.com/nataliamv2/inclusion_criteria.

Results
The five studies employed a range of inclusion criteria, selecting 8-49% of the neurons in their respective studies ( Table 1). The inclusion criteria were based on one or both of the amplitude and the trial-to-trial variability of the evoked responses and we therefore calculated the mean and SD of the response of each neuron to its peak stimulus condition (the direction and TF that evoked the largest mean response). We applied the five different inclusion criteria to the Allen Brain Observatory, a large two-photon calcium imaging data set. We restricted our analysis to layer 2/3 excitatory neurons imaged 175-250 mm below the pia in Cux2-CreERT2;Camk2a;Ai93 and Slc17a7-IRES2-Cre;Camk2a;Ai93 mice, yielding a dataset of fluorescence recordings from 16, 923 neurons. Different inclusion criteria selected different, often overlapping populations of neurons (6-94% of 16,923 neurons; Table 1, column 7), readily visualized by plotting the mean against the SD of the response (Fig. 3A). The results derived using these different criteria covered similar ranges to those in the published studies, consistent with the idea that effects of inclusion criteria could contribute to the disparate results across published studies (Fig. 3B).
Using CV (CV = SD/mean) as a measure of response robustness, we asked how increasing the number of neurons selected, from the most robust (lowest CV) to the least (highest CV), affects the computed tuning metrics. For some metrics, including more neurons affected tuning properties by almost as much as the differences between studies. For example, increasing included neurons changed the mean preferred TF for V1, PM, and AL as well as the rank order of these three areas, such that AL and PM display different mean TFs when only the top decile are included, but have the same mean TF when all neurons are included ( Fig. 4A-D,M). Within V1, the change in mean TF reflects the fact that the highest decile (10% with highest CV) shows a broader distribution of preferred TF than the lowest decile (Fig. 4B,C). In contrast, the effect on on OSI was smaller and more consistent across areas, having a smaller effect on the value or the rank order across areas ( Fig. 4E-H,M). Finally, increasing the number of neurons included increased the mean DSI, and did so consistently and significantly across all visual areas ( Fig. 4I-L,M). The increase in DSI reflects the fact that many of the neurons in the lowest decile have a DSI of 1, whereas the neurons in the highest decile have a uniform distribution of DSIs (Fig. 4J,K). None of the inclusion criteria used in the published studies apply a threshold on the CV specifically, but some incorporate measurements of reliability that might have a similar effect. If criteria are selecting neurons based primarily on reliability, one might expect that selecting a population of neurons with matched mean CV would result in similar tuning properties and would replicate the differences observed between the studies. We selected populations of neurons that had the same mean CV as those chosen by each inclusion criteria, for each area separately, and compared the tuning properties for that population to the tuning properties for the neurons chosen by the criteria. For some metrics, there was a high correlation between these values, namely mean preferred TF and mean DSI (r = 0.82 Pearson's correlation for both; Fig. 4N,P). For preferred TF the values were close to unity, indicating that selecting neurons by their CV closely matched the differences between studies. For DSI, however, the range of DSI values was more limited. Thus, while there was a high correlation between the values for neurons selected by CV to those for neurons selected by the criteria, the shallow slope of this relationship made it less predictive. Further, for the mean OSI, the was no correlation between these values (r = 0.09; Fig. 4O). Thus, some of the differences between the published studies could result from the inclusion criteria effectively selecting neurons based on their reliability at different threshold. However, it is clear that the criteria did not select neurons exclusively based on the reliability, as captured by the CV, as CV alone cannot account for all of the differences between the studies. Selection by CV displayed a greater effect on preferred TF and DSI than on OSI, likely because the measurements of preferred TF and DSI are more susceptible to noise. The neurons with the noisiest responses (greatest CV) commonly displayed DSI ;1 (Fig. 4J), which is inevitable when the response to the null direction is 0. The response to the preferred direction need not be large and could even result from a single trial having just a small amplitude fluorescence change. As the preferred TF is the TF at which the neuron has its largest response, regardless of amplitude or reliability, the TF tuning is similarly sensitive to small numbers of noisy events. In contrast, OSI is calculated from the responses to all eight directions of drifting gratings and is thus less sensitive to a small amplitude response in one condition.
Might a calculation that is more robust to trial-to-trial variability reduce the sensitivity of measurements to inclusion criteria or CV? We recalculated OSI, DSI and TF with cross-validation, using half of the trials to identify the stimulus condition that evoked the largest mean responses (grating direction and TF) and then calculated OSI, DSI and TF for these preferred conditions from the other half of the trials. The overall effect of including more neurons based on their CV on the cross-validated metrics across different areas was similar to that on the noncross-validated metrics (Fig. 5). The notable difference is that the noisy neurons in the lowest decile of robustness no longer have high DSI or OSI values, but are shifted to much lower values (Fig. 5F,J). This difference is also reflected in the fact that the overall curves are shifted to lower values (compare Figs. 5E,I and Fig. 4E,I). Thus, while more statistically robust metrics calculated through cross-validation likely better reflect the true values of the population, they do not reduce the impact of selection on those metrics.
Different studies presented each visual stimulus multiple times, with numbers of repetitions ranging from 4 to 24 trials (Table 1). Might the number of repetitions account for some of the differences between studies? We computed OSI, DSI and preferred TF using subsets of 4-14 trials. As expected, the variability of the responses decreased as the number of trials increased, resulting in a lower mean CV across the entire population (Fig. 6A). Visualizing the neurons by plotting response mean versus SD for n = 4 trials (Fig. 6B) and n = 14 trials (Fig. 6C), it is clear that the bulk of the data are shifted to more robust responses. Increasing the number of trials had a small effect on the cross-validated metrics (Fig. 6D-F), decreasing both the mean OSI and DSI across all areas (when including all neurons). The effect was consistent across all areas, however, thus the number of trials did not impact the rank order across areas. Thus, while more trials can reduce the variability of the response measurements, it is unlikely that these differences had a large effect on the differences observed between studies.

Discussion
We applied different inclusion criteria to the Allen Brain Observatory two-photon dataset to examine how these criteria impact the reported tuning properties across visual areas after experimental differences are eliminated. That different inclusion criteria selected different subsets of neurons might not be surprising, but the extent of the differences between selected neurons was substantial.
One key difference was in the numbers of neurons selected. To examine how including more, or fewer, neurons could impact the tuning properties, we used CV as a metric of robustness and shifted our threshold for inclusion. Mean TF, OSI, and DSI changed differently with the robustness of the responses of the underlying neurons. The preferred TF was the most sensitive, OSI the least sensitive.
Our results offer one possible explanation why published studies comparing TF, OSI, and DSI across mouse visual areas have produced different results for TF and more similar results for OSI and DSI. Mean TF tuning is more sensitive than OSI and DSI to the neurons selected. As a result, comparison across studies is difficult and there remains considerable uncertainty regarding the mean TF and the rank order of TF tuning across mouse visual areas.
We used CV to examine how including more neurons can impact the reported results, as one of the big differences of the criteria is the number of neurons they select from our dataset. But this is not the only difference between these criteria. The Venn diagram (Fig. 3E) reveals that the cells selected by different criteria are not described by a set of concentric circles, and neurons with mean CV matched to those selected by an inclusion criterion have different tuning (Fig. 4O,P), revealing that the inclusion criteria use features of the neural responses in addition to the size and reliability of neurons' responses to their preferred condition. For instance, the statistical tests employed in Studies 3 and 4 also depend on the size and reliability of the neurons' responses to the blank sweep.
Cross-validating metrics and increasing the number of trials can each improve the accuracy of the measured responses. Cross-validation can mitigate the impact of particularly noisy responses, reducing the impact of small numbers of outlier trials. This is most evident in the effect of cross-validation on the DSI distribution for the neurons in the lowest decile of robustness (Fig. 5J). It is possible that inclusion criteria based on the reliability of metrics across iterations of cross-validation might be more effective for identifying neurons with truly robust responses.
Our results illustrate how inclusion criteria can play a role in determining the tuning properties of visual areas. The choice of inclusion criteria is unlikely to account for all of the differences observed between the original studies, indicating that other experimental factors are important. Other factors likely include anesthesia state, the type of anesthesia used, the calcium indicator, image brightness, as well as visual stimulus parameters. Brain state can modulate neural responses in visual cortex, and anesthesia in particular can impact both the spontaneous and evoked responses. The type of anesthesia can also be a factor, with urethane impacting spontaneous and evoked firing rates but not OSI (Niell and Stryker, 2010) and atropine affecting OSI but not spontaneous firing rate, evoked firing rate, DSI, preferred TF, or preferred SF (Durand et Figure 6. How trial number changes tuning metrics and CV. A, Mean CV calculated at the preferred condition using different numbers of trials and the cross-validation method. B, Mean peak response at the preferred condition versus SD at the preferred condition using only four trials and the cross validation method. C, Same as in B but using 14 trials. D-F, OSI, TF, and DSI calculated using the cross-validation method as a function of the number of trials used in the analysis. al., 2016). Stimulus parameters, such as the size or contrast of the drifting gratings or the precise SFs and TFs, do also impact the evoked responses and could account for some of the differences observed between the original studies.
Calcium indicators have different sensitivities and signal-to-noise properties (Hendel et al., 2008;Chen et al., 2013), such that thresholds in mean DF/F appropriate for one indicator might not be appropriate for another. Most of the inclusion criteria selected ;40-50% of neurons when applied to their own data, but when applied to the Allen Brain Observatory data the percentage of neuron included often differed substantially, presumably because experimental conditions such as indicator brightness differed across studies. For example, simple thresholds on peak DF/F cannot be applied uniformly across different calcium indicators. Thus, it is unlikely that a single set of inclusion criteria would be appropriate across a wide range of experimental conditions, and that these criteria must be chosen and validated by experimenters, including, for instance, an analysis of how metrics change based on how restrictive criteria are (Kim et al., 2018).
Functional specialization of the higher visual areas in mouse cortex has been interpreted as evidence of parallel streams (Andermann et al., 2011;Marshel et al., 2011). For example, V1 is thought to transfer low TF, high SF information to PM, the putative gateway to the dorsomedial stream (López-Aranda et al., 2009;Polack and Contreras, 2012;Glickfeld et al., 2013). However, in some studies, neurons in V1 and PM have similar mean TF tuning (with PM's being 1.3-2Â that of V1; Marshel et al., 2011;Roth et al., 2012), while others show that mean TF tuning in PM neurons that is 1/3 that of V1 neurons (Andermann et al., 2011). Our results indicate that in the most robust neurons, V1 has a higher TF tuning than PM, but in the least robust neurons, PM has a higher TF tuning than V1, potentially explaining the some of the difference between studies. Since TF is sensitive enough to inclusion criteria to change the relative order of TF tuning, it is difficult to interpret the relative TF tuning between visual areas currently. The most appropriate inclusion criteria would take into account how downstream targets filter or weight inputs and how robustness factors into that weighting. Since we do not know what this weighting is, we must be cautious in drawing conclusions about functional organization from these analyses.