Abstract
Recent advancements in two-photon calcium imaging have enabled scientists to record the activity of thousands of neurons with cellular resolution. This scope of data collection is crucial to understanding the next generation of neuroscience questions, but analyzing these large recordings requires automated methods for neuron segmentation. Supervised methods for neuron segmentation achieve state of-the-art accuracy and speed but currently require large amounts of manually generated ground truth training labels. We reduced the required number of training labels by designing a semi-supervised pipeline. Our pipeline used neural network ensembling to generate pseudolabels to train a single shallow U-Net. We tested our method on three publicly available datasets and compared our performance to three widely used segmentation methods. Our method outperformed other methods when trained on a small number of ground truth labels and could achieve state-of-the-art accuracy after training on approximately a quarter of the number of ground truth labels as supervised methods. When trained on many ground truth labels, our pipeline attained higher accuracy than that of state-of-the-art methods. Overall, our work will help researchers accurately process large neural recordings while minimizing the time and effort needed to generate manual labels.
Significance Statement
Modern neuroscience analyzes the activity of hundreds to thousands of neurons from large optical imaging datasets. One important step in this analysis is neuron segmentation. Supervised algorithms have performed neuron segmentation with class-leading accuracy and speed but lag unsupervised algorithms in training time. A large component of training time is the manual labeling of neurons as training samples; current supervised methods train over many manual labels to achieve accurate prediction. We developed a semi-supervised neuron segmentation algorithm, SAND, that retained high accuracy in the few-label regime. SAND employed neural network ensembling to generate robust pseudolabels and used domain-specific hyperparameter optimization. SAND was more accurate than existing supervised and unsupervised algorithms in low and high label regimes of multiple imaging conditions.
Introduction
Studying modern neuroscience questions requires scientists to simultaneously measure and analyze the coordinated activity of neural ensembles formed from hundreds to thousands of neurons (Stevenson and Kording, 2011; Yuste, 2015; Makino et al., 2017; Stringer et al., 2019; Rumyantsev et al., 2020; Vyas et al., 2020). Understanding the function of neural ensembles is technically challenging because distinctive genetic or functional subtypes of neurons within ensembles spatially overlap and temporally change on timescales ranging from seconds to days (Ziv et al., 2013; Driscoll et al., 2017; Pérez-Ortega et al., 2021; Sweis et al., 2021).
Calcium imaging using fluorescent protein sensors meets these technical recording challenges because it can record neural ensembles with cellular spatial resolution and genetic specificity over multiple months (Nakai et al., 2001; Stosiek et al., 2003; Chen et al., 2013; Y. Zhang et al., 2023). Calcium influx follows action potentials and typically increases the brightness of calcium indicators (Grienberger and Konnerth, 2012). Recent optical setups have successfully recorded the calcium activity of hundreds of thousands of neurons simultaneously (Demas et al., 2021). Modern calcium protein sensors have trended toward detection of single action potentials and linear response over multiple action potentials (Ryan et al., 2023; Y. Zhang et al., 2023).
Cellular or subcellular resolution imaging that captures rapid single-spike calcium transients creates large datasets. Extracting single neuron activity from these large-scale imaging datasets necessitates a pipeline of automated methods; such algorithms could save time and minimize human error during analysis (Stevenson and Kording, 2011). Analysis pipelines usually consist of four steps to predict spiking activity from calcium fluorescence recordings: (1) motion correction, (2) cell segmentation, (3) fluorescence extraction, and (4) spike inference (Theis et al., 2016; Pachitariu et al., 2017; Pnevmatikakis and Giovannucci, 2017; Keemink et al., 2018; Giovannucci et al., 2019; Bao et al., 2022). Automated neuron segmentation in particular has received substantial attention but needs improvement.
Both supervised and unsupervised machine learning methods exist for neuron segmentation (Abbas and Masip, 2022; Bao and Gong, 2023). Supervised methods consist of convolutional neural networks (CNNs), while unsupervised methods include dictionary learning, PCA/ICA, and matrix factorization [e.g., CaImAn (Giovannucci et al., 2019) and Suite2p (Pachitariu et al., 2017)]. Supervised methods are typically more accurate than unsupervised methods (Soltanian-Zadeh et al., 2019; Bao et al., 2021). For example, Shallow U-Net Neuron Segmentation (SUNS) is a supervised deep learning-based pipeline for neuron segmentation that achieves state-of-the-art accuracy and speed (Bao et al., 2021).
Supervised methods trade off superior performance for the large effort required to generate hundreds of ground truth labels for model training and hyperparameter optimization. The many manual labels can help train algorithms that account for idiosyncratic fluorescence and noise distributions within each image dataset but then necessitate labels for each imaging condition. Generating such labels is time consuming and subject to human error (Giovannucci et al., 2019; Zhang et al., 2020).
Semi-supervised learning presents an opportunity to reduce the burdens of manual labeling. Semi-supervised segmentation leverages limited numbers of ground truth labels and unlabeled images to train models using two primary approaches: pseudolabeling and consistency regularization (Ouali et al., 2020). Pseudolabeling increases the size of the training dataset by accepting high-confidence labels predicted on unlabeled data as ground truth labels that can further train the model (Lee, 2013; Zou et al., 2020). Consistency regularization trains models by penalizing dissimilar predictions for similar inputs (Chaitanya et al., 2020; Zhuang et al., 2021; Huang et al., 2022; Wu et al., 2022). A combination of pseudolabeling and consistency regularization significantly improved classification accuracy with small numbers of ground truth labels (Sohn et al., 2020).
An alternative paradigm to semi-supervised learning that improves generalizability is ensemble learning. Ensemble learning improves predictive accuracy by combining the outputs of multiple models (Sagi and Rokach, 2018). Averaging multiple independent models reduces overfitting, increases generalizability, and compensates for high model variability even when trained on limited data (Dietterich, 2002; Polikar, 2006). Previous work has successfully applied ensemble learning to neural networks for image classification and segmentation (Zheng et al., 2019; Muller et al., 2022), with the ensemble outperforming the individual (Krizhevsky et al., 2017).
In this study, we developed a semi-supervised neuron segmentation pipeline that maintained state-of-the-art accuracy and prediction speed while limiting the number of manual training labels. Our approach, Semi-supervised Active Neuron Detection (SAND), used neural network ensemble learning to predict active neurons in unlabeled frames. These predictions acted as pseudolabels to augment our training set. We also developed a novel pipeline to choose algorithm hyperparameters with few ground truth labels.
Materials and Methods
Our SAND approach consisted of three main steps: (1) preprocessing the entire video to enhance active neurons, (2) semi-supervised CNN training using small numbers of manually labeled frames, and (3) postprocessing to segment unique neuron masks from the CNN output (Fig. 1A). The postprocessing step used four hyperparameters. Their values were determined using only the manually labeled frames.
Figure 1-1
Multiple neural network architectures generated the pseudolabels and final predictions. We used three U-Net architectures with the same encoder but varying decoders. The labels above the U-Nets represent the number of channels in each level of the decoder, starting with the deepest layer, and ‘c' denotes a concatenation. The numbers above each block represent the number of channels at each layer and the variables to the left represent the dimensions of the image at each step, given an input frame with dimensions m × n. We used a dropout rate of 0.1 for the first two depths and a rate of 0.2 for the deepest depth. The total number of trained parameters for each architecture was ∼5000 (blue), ∼6000 (green), and ∼7500 (purple). Download Figure 1-1, TIF file.
Figure 1-2
Pseudolabel training helped reduce and stabilize training loss. (A) Example learning curves for training and validation sets using different amounts of training labels on ABO 275 µm data. The first round of supervised training (green) lasted for 200 epochs. We then trained models of pseudolabels (orange) for 25 epochs. A final round of training on the labeled frames (purple) lasted for another 200 epochs. (Insets) Corresponding learning curves on pseudolabels, which used binary cross entropy loss. (B) Breakdown of training time for SAND and SUNS for each step of training using different amounts of ground truth labels. Total training time for SAND and SUNS increased as the number of training labels increased. Even when trained on 500 frames, both SAND and SUNS could be trained in less than an hour. Bars represent the average training time for 10 different models. Error bars represent the standard deviation for the total training time (n = 10 models). Download Figure 1-2, TIF file.
Figure 1-3
Pseudolabels and final probability maps closely aligned with ground truth temporal masks. (A) Example pseudolabels for two different ABO 275 μm frames generated by ensembles trained on different numbers of labeled frames. Pseudolabels were generally consistent across different numbers of training frames. The scale bar is 50 μm. (B) Pseudolabels were strongly correlated with the ground truth temporal masks. Bars represent the median correlation values between 1000 pseudolabels and ground truth temporal masks (100 labels from 10 different models). Error bars represent standard deviation. Data points represent the median correlation values for each model. (C) Example probability maps used for FLHO. Probability maps closely aligned with the ground truth temporal mask. Thresholding with the 25th percentile value helped retain lower confidence neurons. The scale bar is 50 μm. Download Figure 1-3, TIF file.
Figure 1-4
Few label hyperparameter optimization accurately estimated SAND hyperparameters and was invariant to optimization choices. (A) Grid searches for hyperparameters with small numbers of ground truth labels consistently underestimated the optimal p_thresh value. We performed a traditional grid search for all hyperparameters on 20 models, each trained on randomly selected sets of 10 frames from the ABO 275 µm dataset (2 models per video). We also used grid search for all hyperparameters using all labels from all frames to generate the ‘optimal' hyperparameters. p_thresh tuned on 10 frames was consistently lower than the optimal p_thresh. We then calculated the hyperparameters for the 20 models using our new hyperparameter optimization method, FLHO. This method produced p_thresh values that were similar to the optimal p_thresh. p_thresh is listed as a grayscale pixel value from 0 to 255 (e.g. a value of 205 corresponds to a probability of 80%). Parentheses show the mean ± standard deviation of the number of labeled neurons used in hyperparameter optimization. (B) The accuracy of our method was robust to small changes in intermediate p_thresh. We performed our hyperparameter optimization pipeline on 20 models using four different percentiles for intermediate p_thresh, used during step 2 of the pipeline. Each model was trained on 10 labeled frames from the ABO 275 µm dataset (73 ± 22 labeled neurons). A one-way Kruskal-Wallis test did not find a significant difference in performance between the four choices of p_thresh (p = 0.20). Bars and error bars respectively represent mean and standard error. (C) The values of centroid_dist and min_area determined by the grid search were robust to changes in the p_thresh percentile used in step 2 of FLHO. Bars represent the median values of centroid_dist and min_area for the models in B. A one-way Kruskal-Wallis test did not find a significant difference in the grid search selection between the four choices of p_thresh for min_area or centroid_dist (n = 20 models, p = 0.45 and p = 0.46, respectively). (D) The accuracy of our method was robust to changes in the index of max_continuous used to calculate min_consecutive. We performed our hyperparameter optimization pipeline on 20 models using three different conditions for min_consecutive for each model. Each model was trained on 10 labeled frames from the ABO 275 µm dataset (73 ± 22 labeled neurons). A one-way Kruskal-Wallis test did not find a significant difference between the three conditions (p = 0.43). Bars and error bars respectively represent mean and standard error. Download Figure 1-4, TIF file.
Figure 1-5
The Few Label Hyperparameter Optimization pipeline had four main steps. (A) We determined the 25th percentile and median values for p_thresh. The 25th percentile value was used in Step 2 and the median value was used in steps 3 and 4, which included neuron prediction. (B) We used a grid search to find the min_area and centroid_dist values that maximized F1 on labeled frames. (C) We found the maximum number of continuous frames for each ground truth neuron. (D) We determined the final hyperparameters. We placed an upper bound of 80% on p_thresh. To determine min_consecutive, we found the second smallest number of continuous frames from Step 3. We placed an upper bound of 8 frames on this value. Download Figure 1-5, TIF file.
Figure 1-6
We adjusted the number of training labels by randomly sampling different numbers of training frames. A scatter plot of the number of labeled frames vs. number of labeled neurons for each dataset shows that the number of training labels (neurons) increased as we increased the number of training frames. Each point represents a unique model. Download Figure 1-6, TIF file.
Table 1-1
The grid search values for post-processing hyperparameters were generally consistent between datasets. Values are listed as Begin:Step:End. Download Table 1-1, DOCX file.
Table 1-2
The datasets in this study covered multiple brain regions and imaging conditions. V1 was primary visual cortex, PPC was posterior parietal cortex, S1 was primary somatosensory cortex, vS1 was vibrissal primary somatosensory cortex. Download Table 1-2, DOCX file.
Preprocessing
Before training, we preprocessed the video to reduce noise and emphasize active neurons. We first applied pixel-by-pixel temporal filtering to the registered video, which highlighted fluorescence activity that was similar to calcium response waveforms (Bao et al., 2021). We convolved each pixel with the time-reversed average fluorescence response of the ground truth neurons. Selected fluorescence responses had a peak SNR between 5 and 8, and we aligned the transients by their peaks. We then diminished nonresponsive neurons and enhanced active neurons by converting the temporally filtered video into an SNR representation. We calculated this representation by first computing the pixel-wise median image and quantile-based noise image over the entire video. We then pixel-wise subtracted the median image from each frame and pixel-wise divided the result by the noise image.
Model training
The original SUNS training pipeline used a fully supervised approach and trained a single shallow U-Net with a combination of dice loss and focal loss (Bao et al., 2021). The CNN predicted probability maps that underwent a postprocessing pipeline to calculate the final neuron masks. Our SAND approach used neural network ensembling to generate pseudolabels (Fig. 1B). We used an ensemble of three models based on recent work that developed a semi-supervised pipeline for accurate medical image segmentation that trained an ensemble of the same size (Wu et al., 2022). We first defined three separate shallow U-Nets. Each U-Net had a unique decoder architecture, and one U-Net had the same architecture as SUNS (Extended Data Fig. 1-1). We selected the three U-Net architectures tested by Bao et al. (2021) that achieved the highest accuracy. We trained all three U-Nets on frames with manually labeled masks using a weighted sum of dice and focal loss for 200 epochs (focal loss:dice loss, 100:1; Extended Data Fig. 1-2A). We then passed 1,800 unlabeled frames through each trained U-Net within the ensemble and averaged the output probability maps to serve as pseudolabels. Pseudolabels closely resembled the known temporal masks (Extended Data Fig. 1-3A,B). We then produced the final prediction U-Net by using the pseudolabels to continue training the U-Net with the SUNS architecture. We trained this U-Net using binary cross-entropy loss for 25 epochs using the pseudolabels (Extended Data Fig. 1-2A), and then we fine-tuned the U-Net with a final round of training using dice and focal loss for 200 epochs using the original labeled frames (Extended Data Fig. 1-2A). Training time increased as the number of labeled training frames increased but remained under an hour for up to 500 training frames (Extended Data Fig. 1-2B). For all training steps, we used the Adam optimizer with a 0.001 learning rate, and our training pipeline augmented the input frames with random flips and rotations to help prevent overfitting.
Postprocessing
The output probability maps of our neural network represented the model’s confidence that a pixel belonged to an active neuron. Additional postprocessing converted the output series of probability maps into unique neuron masks (Fig. 1C). We followed the same postprocessing steps described in Bao et al. (2021). First, we binarized the probability maps with a probability threshold (p_thresh) to determine active pixels. Higher values of p_thresh retained only high-confidence predictions. Lower values preserved lower confidence predictions, such as pixels from neurons with relatively low SNR, but also kept more false-positive predictions. After probability thresholding, we grouped active pixels within a frame into separate components using connected component labeling. We removed components smaller than a minimum area (min_area), as these regions were unlikely to be neurons. Next, we merged colocalized components across different frames; active components in the same location across multiple frames likely represented the same neuron. We defined components as colocalized if the centers of mass (COMs) of two components were within a minimum threshold (COM distance < centroid_dist) or if the areas of two components were substantially overlapping. Overlapped neurons met either of two criteria: (1) intersection-over-union (IoU) > 0.5 or (2) consume ratio (consume) > 0.75, with IoU and consume defined for two binary masks m1 and m2 as follows (Bao et al., 2021):
Hyperparameter optimization
Selection of the optimal postprocessing hyperparameters after CNN training was crucial for accurately identifying neurons and distinguishing neurons from noise. Hyperparameter optimization with SUNS required manual labeling of all active neurons in the training video. The original SUNS pipeline used a grid search to determine the postprocessing parameters that maximized F1 on the training frames (Extended Data Table 1-1). Recall, precision, and F1 are common metrics to define segmentation accuracy:
We developed a novel pipeline, Few Label Hyperparameter Optimization (FLHO), to optimize postprocessing hyperparameters that used only a fraction of the number of ground truth labels as SUNS (Extended Data Fig. 1-5). Instead of using a grid search to determine all four hyperparameters, we directly calculated p_thresh and min_consecutive using estimates from a small number of ground truth labels.
We first used the ground truth labels to estimate p_thresh (Extended Data Fig. 1-5A). For each labeled neuron, we identified the frames when that neuron’s peak SNR (pSNR) exceeded the threshold set by Bao et al. (2021). The trained CNN then calculated probability maps for these active frames. For each neuron, we found the median probability map value within its mask during its active frames. We used this distribution of median probability values for each neuron to find two values: (1) the 25th percentile, which was used for intermediate steps, and (2) the median, which was used as the final p_thresh. We used a lower p_thresh for intermediate steps that used only labeled frames because our initial small set of labeled training frames likely did not include the frames with the pSNR or peak probability values for each neuron. Our 25th percentile value for p_thresh thresholded probability maps and retained neurons with relatively low SNR on the training frames (Extended Data Fig. 1-3C). We used these thresholded maps to perform a grid search for values of centroid_dist and min_area that maximized the F1 score on the labeled frames (Extended Data Fig. 1-5B).
We found that the pipeline was robust across different choice of percentiles with respect to the ultimate algorithm accuracy (Extended Data Fig. 1-4B). The values of centroid_dist and min_area were also robust to changes in p_thresh, which may partially explain the robustness in accuracy across percentiles (Extended Data Fig. 1-4C). Additionally, the median value of the p_thresh distribution trained on a small number of labels was very similar to the optimal p_thresh value calculated using all labels (Extended Data Fig. 1-4A). Therefore, we set our final p_thresh to the median value. We set an upper bound on this value so that p_thresh was not >80% probability. Finally, we calculated min_consecutive by assessing the distribution of consecutive frames for all neurons (Extended Data Fig. 1-5C). For this step, we used the probability maps for all frames. Therefore, we set p_thresh to its final (median) value. We thresholded these probability maps using p_thresh and min_area. We calculated the maximum number of consecutive frames that the model identified for each neuron. We observed that the minimum consecutive frame value among all neurons was occasionally an outlier, so we selected the second smallest value to be min_consecutive. However, the performance of our method was robust across different choices of min_consecutive (Extended Data Fig. 1-4D). We set an upper bound on min_consecutive so that it did not surpass eight frames (Extended Data Fig. 1-5D).
Peer segmentation methods
SUNS
SUNS is a supervised deep learning pipeline for neuron segmentation from fluorescence recordings (Bao et al., 2021). SUNS first computed an SNR representation of imaging videos that emphasized active neurons and de-emphasized inactive neurons. SUNS then trained a shallow U-Net on 1,800–2,400 of all imaging frames developed from a set of comprehensively labeled neurons over all imaging movies. Finally, a multistep postprocessing pipeline identified unique ROIs across all frames. SUNS determined the hyperparameters for this postprocessing pipeline with a grid search that evaluated accuracy against the ground truth labels. Python code for SUNS is available at https://github.com/YijunBao/Shallow-UNet-Neuron-Segmentation_SUNS.
CaImAn
CaImAn is a calcium imaging analysis pipeline that uses both unsupervised and supervised algorithms to identify active neurons (Pnevmatikakis et al., 2016; Giovannucci et al., 2019). The unsupervised step was a nonnegative matrix factorization method that separated spatially overlapping neurons based on the temporal activity of active neurons; these sparse decomposed components also included sources that represented background noise and neuropil activity. Components representing unique regions of interest (ROIs) were curated by iteratively combining components that exceeded a threshold for correlated temporal activity. The supervised portion was a quality control step to remove non-neuronal components. This step used a peak signal-to-noise (SNR) threshold, spatial footprint consistency, and a CNN classifier. Python code for CaImAn is available at https://github.com/flatironinstitute/CaImAn (version 1.6.4).
Suite2p
Suite2p is another widely used pipeline that applies unsupervised algorithms to identify potential neurons and a supervised quality control step to refine the neurons (Pachitariu et al., 2017). Suite2p first reduced the dimensionality of the input video using singular value decomposition. Then, unsupervised non-negative matrix factorization identified ROIs and modeled decomposed neural activity as the weighted sum of underlying neural activity and neuropil signal. A supervised classifier then processed these ROIs and separated cells from noncells based on temporal and spatial features. Lastly, manual acceptance or rejection of the classifier’s predictions refined the final output neurons. Python code for Suite2p is available at https://github.com/MouseLand/suite2p (version 0.6.16).
Datasets
We tested our pipeline on two-photon videos from three different datasets, all recorded in mice. These videos covered multiple cortical and subcortical brain regions, were collected with multiple imaging conditions, and utilized various calcium sensors with different responses and kinetics (Extended Data Table 1-2).
Allen Brain Observatory
The dataset from the Allen Brain Observatory (ABO) consisted of 10 videos recorded from a depth of 275 µm and 10 videos recorded from a depth of 175 µm in the primary visual cortex (V1; de Vries et al., 2020). The 175 µm set had ∼200 neurons per video, and the 275 µm set had ∼300 neurons per video. For each depth, we used 10-fold cross-validation: we trained our model and determined the hyperparameters using one video and tested on the other nine videos. Data is available at http://observatory.brain-map.org/visualcoding.
Neurofinder
We used three sets of videos (01, 02, and 04) from three different labs with different imaging conditions from the Neurofinder competition (CodeNeuro, 2016). Each video was paired with another video obtained under the same imaging conditions, making six pairs of videos. For each of the six pairs, we trained the model and determined the hyperparameters on one video and tested on the other video. The 12 videos averaged ∼250 neurons per video. Videos are available at http://Neurofinder.Codeneuro.Org/.
CaImAn
The CaImAn dataset (Giovannucci et al., 2019) contained four videos (J115, J123, K53, and YST) that imaged various brain regions. We divided each video into quarters to perform cross-validation, so that the training and test set had the same imaging conditions. For two of the videos (J115 and K53), the average number of neurons per subvideo was ∼200. For these videos, we trained the model on one subvideo and tested on the remaining three subvideos. The other two videos (J123 and YST) had ∼40 and ∼80 neurons per subvideo, respectively. For these videos containing far fewer neurons, we used leave-one-out cross-validation, training on three subvideos, and testing on the remaining subvideo.
Analysis
We compared three different deep learning segmentation pipelines: (1) SUNS, model training with supervised learning (SL) and hyperparameter optimization with a full grid search (GS), (2) SL and our new hyperparameter optimization pipeline (FLHO), and (3) SAND, model training using a combination of SL and neural ensemble learning and hyperparameter optimization with FLHO. We also compared SAND with the widely used matrix factorization methods Suite2p and CaImAn. We quantified the quality of the identified masks as the proportion of the mask’s area divided by the area of the mask’s convex hull. We evaluated model accuracy by calculating the F1 score of each method on the test videos when trained with different numbers of ground truth neuron masks from the training video. We altered the number of ground truth masks used in training by randomly sampling different sets of SNR frames (Extended Data Fig. 1-6). We evaluated F1 across all frames and neurons in the test videos using the same ground truth masks as previous work (Soltanian-Zadeh et al., 2019; Bao et al., 2021). For CaImAn and Suite2p, we used the F1 values found in Bao et al. (2021), which previously optimized the hyperparameters for these pipelines.
We ran multiple analyses to test the performance of SAND. First, we compared SAND with SUNS, SL + FLHO, CaImAn, and Suite2p when trained on a low number of ground truth neurons. We also compared the performance of SAND trained on a low number of ground truth neurons with the asymptotic performance of SUNS. Finally, we compared the asymptotic performance of SAND with the asymptotic performance of SUNS. We binned the F1 scores for each condition by the number of neurons used in training. We compared algorithms using the Wilcoxon rank-sum test and by computing the effect size (Cohen’s d).
Code accessibility
The code described in the paper is available at https://github.com/caseymbaker/SemiSupervisedNeuronSegmentation2p. The GitHub repository includes a tutorial for downloading and running SAND as well as the data and code for recreating figures in this paper. Code was tested on two Windows 10 PCs (AMD Ryzen 9 3900X, 128 GB RAM, RTX 2080, and Intel Core i7-7700K, 64 GB RAM, Quadro P5000).
Supplementary 1
Download Suppl 1, ZIP file.
Results
We first evaluated SAND using both ABO datasets (Fig. 2). Masks generated by SAND closely matched the ground truth masks even when trained on only 10 frames (Fig. 2A,B). Masks generated by SUNS trained on few frames and, however, included many false positives, and masks generated by Suite2p and CaImAn were more irregularly shaped and less accurate than those generated by SAND (Fig. 2A,B; Extended Data Fig. 2-1A; Extended Data Tables 2-1, 2-2). SAND significantly outperformed all other methods when trained on 0–50 ground truth labels (∼10 labeled frames; Extended Data Fig. 2-2, Table 2-1). In the 275 µm dataset, SUNS achieved a median F1 score of 0.81 when trained on >250 labels (Fig. 2C; Extended Data Table 2-1). However, SAND achieved this F1 score when trained on only ∼25% the number of labels and came within one standard deviation of this value when trained on only ∼12% the number of labels (median F1 = 0.79; 34 ± 10 neurons). Additionally, the F1 score for SAND when trained on >250 neurons was significantly higher than the SUNS F1 score (Extended Data Table 2-1). In the 175 µm dataset (Fig. 2D), SUNS achieved a median F1 score of 0.81 when trained on >200 neuron labels (Extended Data Table 2-1). However, SAND came within one standard deviation of this value when trained on only ∼13% of the number of labels (median F1 = 0.77, 29 ± 12 neurons). Additionally, the F1 score for SAND when trained on >200 labels was significantly higher than the SUNS F1 score when trained on >200 labels (Extended Data Table 2-1). SAND also significantly outperformed the matrix factorization methods, CaImAn and Suite2p, over all numbers of ground truth masks (Extended Data Table 2-1). In particular, SAND trained on only 10 frames more reliably detected low pSNR neurons than CaImAn and Suite2p (Extended Data Fig. 2-3). SAND generally improved model precision (Fig. 2C,D). Both our new training method and our new hyperparameter optimization method helped maximize F1 in our pipeline. FLHO without pseudolabel training (SL + FLHO) had a modest effect on accuracy when trained on fewer ground truth masks (Fig. 2C,D). In addition to state-of-the-art accuracy, SAND also achieved the state-of-the-art processing speed of SUNS at ∼300 frames per second (Extended Data Fig. 2-4).
Figure 2-1
SAND had higher quality masks than competing methods. (A) Masks identified by SAND were more consistently shaped like neuron somas than masks identified by other methods on the ABO datasets. SUNS and SAND were both trained on 10 frames. (B) Masks identified by SAND were more consistently shaped like neuron somas than masks identified by other methods on the Neurofinder dataset. SUNS and SAND were both trained on 10 frames. (C) Masks identified by SAND were more consistently shaped like neuron somas than masks identified by other methods on the CaImAn datasets. SUNS and SAND were both trained on 10 frames for J115, K53, and YST. SUNS and SAND were both trained on 100 frames for J123. Bars represent average ratio values and error bars represent standard deviation. *** indicates p < 0.001 (Tables 2-2, 2-3, 2-4). Download Figure 2-1, TIF file.
Figure 2-2
SAND outperforms other methods on the ABO dataset when trained on fewer labeled frames. SAND had higher accuracy than other methods with low number of labeled frames on both the ABO 275 μm and ABO 175 μm datasets. Dots represent the average F1 score for each model when processing the nine test videos. Lines represent the mean F1 scores averaged over different numbers of training frames. Shaded regions represent standard error. Horizontal lines are the average F1 scores of Suite2p and CaImAn. Download Figure 2-2, TIF file.
Figure 2-3
Neuron recall with SAND outperformed unsupervised algorithms. We calculated recall as a function of neuron pSNR for each method on the ABO 275 μm dataset. SAND trained on 10 frames could more reliability detect neurons, specifically in the low pSNR regime, compared to CaImAn and Suite2p. SUNS trained on 10 frames had higher recall than SAND in the low pSNR regime, but also had lower precision than all other methods (Figure 2C). Lines represent the average recall across all 10 models for all 10 videos. Neurons were grouped by their pSNR in bins with a width of 3 (n = 3016 neurons). Download Figure 2-3, TIF file.
Figure 2-4
SAND achieved the same processing speed as SUNS on the ABO 275 μm dataset. Segmentation with SAND and SUNS consisted of three steps: pre-processing, CNN inference, and post-processing. For all steps, SAND and SUNS achieved comparable speeds. The total processing speed was an order of magnitude faster than the video’s frame rate (30 Hz). Bars represent the average processing speeds for 10 different models over 9 videos. Error bars represent the standard deviation. Download Figure 2-4, TIF file.
Table 2-1
SAND significantly outperformed SUNS on the ABO 275 µm and ABO 175 µm datasets when both trained on a small number of labels. SAND also significantly outperformed SUNS when trained on many labels. "Neuron #" indicates the range of labeled neurons used to train our models over many trials, grouped as a bin; the average number of training labels for each of these trials followed in parentheses. "Neuron #" for CaImAn and Suite2p is listed as NA because these methods were unsupervised; however, hyperparameter optimization for these unsupervised methods was done using all ground truth labels. nx shows the number of trials (models) in that bin. F1 is the median F1 value in that bin. We used a two-sided Wilcoxon rank-sum test to evaluate significance. *, **, ***, and n.s. represent p < 0.05, 0.01, 0.001, and not significant, respectively. Download Table 2-1, DOCX file.
Table 2-2
SAND had significantly higher quality masks than competing methods on the ABO datasets. We measured quality as the ratio of the mask’s area to the area of the mask's convex hull. SAND and SUNS used 10-fold cross validation to test performance; each model trained on a single video and was tested on the 9 remaining videos in that dataset (i.e. 9 applied models per video). CaImAn and Suite2p did not use cross validation (1 applied model per video). We evaluated SAND and SUNS based on models trained on 10 labeled frames. "# Training Frames" for Suite2p and CaImAn are N/A because these methods were unsupervised. We compared methods using a two-tailed Wilcoxon rank-sum test on all the masks generated by each model across all test videos. Download Table 2-2, DOCX file.
Table 2-3
SAND had significantly higher quality masks than competing methods on the Neurofinder dataset. We measured quality as the ratio of the mask's area to the area of the mask’s convex hull. SAND and SUNS used Train 1 Test 1 cross validation to test performance; each model was trained on 1 video and tested on a corresponding video (i.e. 1 applied model per video). CaImAn and Suite2p did not use cross validation (1 applied model per video). We evaluated SAND and SUNS based on models trained on 10 labeled frames. "# Training Frames" for Suite2p and CaImAn are N/A because these methods were unsupervised. We compared methods using a two-tailed Wilcoxon rank-sum test on all the masks generated by each model across all test videos. Download Table 2-3, DOCX file.
Table 2-4
SAND had significantly higher quality masks than competing methods on the CaImAn datasets. We measured quality as the ratio of the mask's area to the area of the mask's convex hull. SAND and SUNS trained on 1 video for the K53 and J115 datasets and tested on the remaining 3 videos (i.e. 3 applied models per video). SAND and SUNS trained on 3 videos for the J123 and YST datasets and tested on the held-out video (1 applied model per video). "# Training Frames" for Suite2p and CaImAn are N/A because these methods were unsupervised. We compared methods using a two-tailed Wilcoxon rank-sum test on all the masks generated by each model across all test videos. Download Table 2-4, DOCX file.
We next tested SAND on the Neurofinder dataset (Fig. 3). Masks generated by SAND closely matched the ground truth masks even when trained on only 10 frames (Fig. 3A,B). Masks generated by Suite2p and CaImAn were more irregularly shaped and had more false-negative predictions than SAND (Fig. 3A,B; Extended Data Fig. 2-1B, Table 2-3). SAND significantly outperformed SUNS when trained on 0–50 ground truth neuron labels (∼10 frames; Fig. 3C; Extended Data Fig. 3-1, Table 3-1). SUNS achieved a median F1 score of 0.58 when trained on 200–250 labels. However, the performance of SAND was not significantly different from this when trained on only ∼14% the number of labels (median F1 = 0.53, 32 ± 12 neurons; Extended Data Table 3-1). Similar to observations when processing the ABO datasets, our new hyperparameter optimization without pseudolabel training partially improved accuracy when trained on fewer ground truth masks. SAND performed as well as or better than CaImAn segmentation over all numbers of ground truth masks (Extended Data Table 3-1). Overall, the Neurofinder dataset had the most variability in performance, likely due to the variety of imaging conditions throughout this dataset.
Figure 3-1
SAND outperforms SUNS on the Neurofinder dataset when trained on fewer labeled frames. SAND had higher precision and recall than SUNS when trained on only 10 labeled frames. Dots represent the average F1 score for each model when processing the nine test videos. Lines represent the mean F1 scores averaged over different numbers of training frames. Shaded regions represent standard error. Horizontal lines are the average F1 scores of Suite2p and CaImAn. Download Figure 3-1, TIF file.
Table 3-1
SAND significantly outperformed SUNS on the Neurofinder dataset when both trained on a small number of labels. SAND trained on 0-50 labels did not significantly differ from SUNS trained on 200-250 labels. "Neuron #" indicates the range of labeled neurons used to train our models over many trials, grouped as a bin; the average number of training labels for each of these trials followed in parentheses. "Neuron #" for CaImAn and Suite2p is listed as NA because these methods were unsupervised; however, hyperparameter optimization for these unsupervised methods was done using all ground truth labels. nx shows the number of trials (models) in that bin. F1 is the median F1 value in that bin. We used a two-sided Wilcoxon rank-sum test to evaluate significance. *, **, ***, and n.s. represent p < 0.05, 0.01, 0.001, and not significant, respectively. Download Table 3-1, DOCX file.
Finally, we tested SAND on the CaImAn dataset, starting with the K53 and J115 videos (Fig. 4A,B). When processing the K53 dataset, SAND significantly outperformed SUNS, Suite2p, and CaImAn at all numbers of ground truth neurons (Extended Data Table 4-1). SAND’s performance when trained on 0–50 neurons (∼10–25 frames) was more accurate than the performance of SUNS using >150 ground truth neurons (∼500–1,800 frames; Fig. 4A; Extended Data Fig. 4-1, Table 4-1). When processing the J115 dataset, SAND significantly outperformed CaImAn and Suite2p on all numbers of ground truth neurons (Fig. 4B; Extended Data Table 4-1). SAND also significantly outperformed SUNS when trained on 0–50 ground truth neuron labels (∼10 frames; Extended Data Fig. 4-1, Table 4-1). For both videos, SAND’s predicted masks aligned closely with the ground truth masks, even when trained on just 10 frames (Extended Data Fig. 4-2). SUNS’s predicted masks included many false positives. Conversely, CaImAn and Suite2p both failed to detect many ground truth neurons. SAND outperformed CaImAn and Suite2p on both the YST and J123 videos on all numbers of ground truth neurons; however, SAND did not consistently outperform SUNS (Fig. 4C,D; Extended Data Fig. 4-3). On all of the CaImAn videos, SAND predicted masks with more consistent soma shapes than other methods (Extended Data Fig. 2-1C, Table 2-4).
Figure 4-1
SAND outperforms SUNS on K53 and J115 when trained on fewer labeled frames. SAND had higher precision than SUNS when trained on only 10 labeled frames from the K53 and J115 videos. Dots represent the average F1 score for each model when processing the nine test videos. Lines represent the mean F1 scores averaged over different numbers of training frames. Shaded regions represent standard error. Horizontal lines are the average F1 scores of Suite2p and CaImAn. Download Figure 4-1, TIF file.
Figure 4-2
Masks generated by SAND closely matched ground truth masks on the K53 and J115 videos. (A) Example segmentations from a K53 sub-video. Masks generated by SAND were more accurate than those of other methods, even when trained on only 10 frames. The scale bar is 10 μm. (B) Example segmentations from a J115 sub-video. Masks generated by SAND were more accurate than those of other methods, even when trained on only 10 frames. For both videos, SUNS’s predictions included many false positives, while CaImAn and Suite2p had many false negatives. The scale bar is 10 μm. Download Figure 4-2, TIF file.
Figure 4-3
SAND and SUNS predicted similar neuron masks from the J123 and YST videos. (A) Example segmentations from a J123 sub-video. The scale bar is 25 μm. (B) Example segmentations from a YST sub-video. The scale bar is 10 μm. Download Figure 4-3, TIF file.
Figure 4-4
SAND greatly outperformed SUNS on datasets with high average pSNR or low variability of pSNR. Scatter plot of average vs standard error of pSNR for all neurons in each video. Dot size indicates the difference in median F1 between SAND and SUNS when trained on 0-50 neurons. Green dots indicate datasets where SAND significantly outperformed SUNS on low numbers of ground truth labels. Download Figure 4-4, TIF file.
Figure 4-5
SAND generalized well to videos with similar imaging conditions as the training data. We trained SAND (red) and SUNS (orange) on the ABO 175 μm dataset and tested the performance of those models on a dataset with similar imaging conditions (ABO 275 μm) and a dataset with different imaging conditions (K53). We evaluated these models against CaImAn (green) and Suite2p (purple) using the optimal hyperparameters for each test dataset. We also evaluated the models against SAND models that were trained on videos with the same imaging conditions as the test data (gray). SAND generalized well to datasets with similar imaging conditions to the training data. SAND outperformed unsupervised methods when generalizing to data with different imaging conditions. However, SAND performed best when training and testing videos had the same imaging conditions. Download Figure 4-5, TIF file.
Table 4-1
SAND significantly outperformed SUNS on the K53 and J115 videos when both trained on a small number of labels. SAND also significantly outperformed SUNS when trained on many labels. "Neuron #" indicates the range of labeled neurons used to train our models over many trials, grouped as a bin; the average number of training labels for each of these trials followed in parentheses. "Neuron #" for CaImAn and Suite2p is listed as NA because these methods were unsupervised; however, hyperparameter optimization for these unsupervised methods was done using all ground truth labels. nx shows the number of trials (models) in that bin. F1 is the median F1 value in that bin. We used a two-sided Wilcoxon rank-sum test to evaluate significance. *, **, ***, and n.s. represent p < 0.05, 0.01, 0.001, and not significant, respectively. Download Table 4-1, DOCX file.
Table 4-2
Calcium sensor SNR have improved over multiple generations of development. SNR values are the average fold increase of SNR relative to GCaMP3. Values are estimated from (Y. Zhang et al., 2023) and (Chen et al., 2013) and are based on single action potential results in vitro. Download Table 4-2, DOCX file.
To understand why SAND only moderately outperformed SUNS when processing the J123 and YST videos, we compared the quality of these videos to the quality of the other datasets. The pSNR of a neuron’s fluorescence can predict likeliness of being detected by both supervised and unsupervised segmentation methods: neurons with higher pSNR were more likely to be detected (Bao et al., 2021). We calculated the average and standard error of pSNR for all ground truth neurons in each video (Extended Data Fig. 4-4).
Neurons in J123 and YST had both lower average pSNR and more variable pSNR than neurons in other videos. This suggests that SAND works best on videos with high pSNR values and low variability of pSNR across neurons. However, SAND appears to be effective when only one of these conditions is met. For example, SAND effectively processed video K53, which had high pSNR but high variability; it also effectively processed the Neurofinder dataset, which had low variability but low pSNR.
The type of calcium indicator used in each recording impacted the pSNR values. Notably, the J123 and YST videos used GCaMP5 (Akerboom et al., 2012) and GCaMP3 (Tian et al., 2009), respectively. These older sensors have very low SNR relative to modern sensors, such as the GCaMP6 used in the other videos (Extended Data Table 4-2). Protein sensors of calcium have continued to develop, so recent sensors in the GCaMP8 series have even higher SNR than that of GCaMP6 (Chen et al., 2013; Ryan et al., 2023; Y. Zhang et al., 2023). It is likely that the high SNR of modern sensors will translate to high pSNR in two-photon neural recordings. This superior signal fidelity should more effectively allow our pipeline to accurately process modern neural recordings with small numbers of ground truth labels.
Finally, we tested how different imaging conditions (e.g., pSNR variability) affected the generalizability of SAND (Extended Data Fig. 4-5). We found that SAND generalized well when the training and test data had similar imaging conditions. For example, SAND trained on the ABO 175 µm dataset and tested on the ABO 275 µm dataset performed as well as SAND trained on the ABO 275 µm dataset and tested on the ABO 275 µm dataset. We then tested ABO-trained SAND on the K53 dataset, which has higher average pSNR values and higher pSNR variability than the ABO dataset. We found that ABO-trained SAND still outperformed CaImAn and Suite2p on K53, but K53-trained SAND achieved the highest accuracy across all numbers of training labels. The accuracy of SAND and SUNS trained on the ABO 275 µm dataset and tested on the K53 dataset decreased as the number of ABO labels used to train these models increased. This is likely the result of increased model specificity when trained on data specific to certain imaging conditions. Augmenting the training data of SAND to make consistent predictions on a variety of noise levels would likely improve model generalizability. For example, we could add an additional training step to SAND to include mutual consistency learning: we could train SAND to predict the same probability maps after adding different amounts of noise to the same frame.
Discussion
Current methods of neuron segmentation have a trade-off between accuracy and manual effort: supervised methods have superior accuracy but require substantial manual effort to generate ground truth labels for each imaging condition (Abbas and Masip, 2022). This work developed SAND, the first semi-supervised pipeline to segment active neurons from two-photon calcium recordings with limited ground truth labels. SAND effectively operated in this low label regime by using neural network ensembling and a new hyperparameter optimization pipeline. The former process generated a large and robust set of pseudolabels that trained a deep learning segmentation algorithm, while the latter process determined postprocessing hyperparameters from limited numbers of ground truth labels.
SAND achieved higher accuracy than the accuracy of fully supervised methods at multiple scales of labeling. At the small scale, SAND trained on labels from <1% of frames and 25% of all ground truth labels available in a movie was comparably accurate as fully supervised methods trained on all labels. When trained on all available ground truth labels in our movies (>200 neurons), SAND attained higher accuracy than that of current methods. SAND trained on low number of ground truth labels also consistently outperformed matrix factorization methods.
The high accuracy of SAND trained on low numbers of manual labels could allow researchers to circumvent the accuracy–effort trade-off. SAND attained state-of-the-art accuracy with ∼25% of the manual labels, but likely even lower fractions of labeling effort. Previous studies on supervised methods required the manual labeling of all hundreds to thousands of neurons in a single video to serve as a comprehensive training set (Soltanian-Zadeh et al., 2019; Bao et al., 2021). We estimate that manual labelers could identify and outline a single neuron per minute, with diminishing speed as they find fewer neurons when scanning through more frames of a movie. Therefore, SAND could greatly reduce the labeling time needed to generate effective labels for training deep learning neuron segmentation algorithms to well under 1 h per experimental condition.
Pseudolabel training and FLHO both played a role in SAND’s high accuracy when trained on few labels. Pseudolabeling generated a robust training dataset much larger than the manually labeled training set. This larger training set helped train our shallow U-net to distinguish between noise and active neurons, reducing the number of false-positive calls. On the other hand, FLHO improved accuracy by improving hyperparameters in postprocessing. Selection of hyperparameters can greatly impact algorithm performance, but many other pipelines, such as SUNS, CaImAn, and Suite2p, employ supervised postprocessing steps that require large numbers of ground truth labels to accurately tune hyperparameters (Pachitariu et al., 2017; Giovannucci et al., 2019; Bao et al., 2021). FLHO helped bypass the accuracy–effort trade-off in hyperparameter optimization through direct calculation of certain parameters using the limited ground truth labels.
The relationship between the number of ground truth labels and accuracy of neuron segmentation displayed three trends. First, in the regime of extremely low numbers of labels, such as 20–50, SAND outperformed its fully supervised sibling SUNS. Second, both algorithms increased F1 performance as the number of training labels increased, often reaching performance asymptotes at high numbers of labels ranging from 150 to 250 labels. This large number of labels needed to saturate SAND and SUNS highlights the need for large sets of publicly available manual annotations for a variety of data, such that the field can better understand the conditions that saturate neural network-based segmentations. Third, precision often lagged recall in both SUNS and SAND; the increase in precision largely accounted for the increase in F1. The reason for this is likely twofold. First, our ensemble learning method averaged the predictions of three models to generate pseudolabels that were conservative and thus reduced training on samples near the detection threshold that could increase false positives. Second, FLHO was also likely conservative. It produced hyperparameters, such as p_thresh values that were higher than those found by grid search on the few-label dataset, which eliminated weakly confident predictions.
SAND reduces the manual labeling effort compared with fully supervised algorithms but inherits the prediction speed of the underlying SUNS shallow U-Net architecture (Extended Data Fig. 2-4). This speed was an order of magnitude faster than the rate of data collection (Bao et al., 2021). Fast prediction speed can enable researchers to quickly identify neurons of interest from their recordings in real time and perform targeted perturbation experiments within the same imaging session or during imaging. This capability could help researchers study neural ensemble dynamics in memory and perception that are consistent on the minutes timescale but change from one day to the next (Ziv et al., 2013; Driscoll et al., 2017; Rule et al., 2020; Deitch et al., 2021; Pérez-Ortega et al., 2021). Our ensemble training and hyperparameter optimization processes also reduce training time compared with SUNS because it trained on only 10–25 labeled frames, far fewer than the 1,800 frames used for SUNS. The above benefits at training and test time could also arise from partnerships between existing or future neuron segmentation algorithms and our semi-supervised approaches. Because our ensemble learning and FLHO modify the training approach without dictating the underlying supervised machine learning architecture, these training approaches could retain the accuracy or speed of other algorithms while boosting the other algorithms’ performance in the low label regime.
Similar to all machine learning neuron segmentation algorithms, SAND will likely benefit from recent developments in protein engineering and video processing. Our work showed that SAND in particular benefits from higher response and small variance in response. Such distributional changes have been instantiated by recent generations of protein calcium indicators, which are both more responsive and more linear (Dana et al., 2019; Y. Zhang et al., 2023). Additionally, the development of novel unsupervised video denoising pipelines, such as DeepInterp (Lecoq et al., 2021) and DeepCAD-RT (Li et al., 2022), may also improve recall by reducing noise, thereby increasing SNR. Increases in pSNR have correlated with increased recall (Soltanian-Zadeh et al., 2019; Bao et al., 2021). SNR gains will likely increase precision as well by reporting even small calcium fluctuations.
Future work could directly improve our implementation of SAND or create alternative implementations. Direct improvement of SAND could optimize the frame selection or model selection to maximize accuracy. Our current approach randomly selected the frames used for labeling. It is possible that systematic selection of these frames could more effectively represent the range of neuron characteristics (e.g., size and pSNR) with even fewer ground truth labels. Additionally, our current approach defaulted to the SUNS shallow U-net architecture as the final neural network to make neuron predictions. Future iterations of SAND could evaluate the accuracy of all ensemble U-nets when processing the ground truth data and then perform pseudolabel training on the U-net with the lowest error. Finally, improvements to SAND or SUNS could also help detect neurons by improving the postprocessing classification step. Such changes could use dynamic information from a large temporal extent to detect sparsely and weakly active neurons (Soltanian-Zadeh et al., 2019).
Application of SAND beyond the two-photon datasets in this work are potentially numerous. Future SAND applications could help process imaging data from one-photon or volumetric imaging settings, which generally have lower SNR than planar two-photon imaging (Jung et al., 2004; Ahrens et al., 2013; Ji et al., 2016; Waters, 2020). SAND can stand alone to process such data or pair with segmentation algorithms that target specific optical imaging data types (Yuanlong Zhang et al., 2023). Likewise, future testing could also apply SAND to process the diverse calcium recordings of many cell types, such as inhibitory neurons or glia (Akerboom et al., 2013; Semyanov et al., 2020; Mulholland et al., 2021). SAND’s ability to accurately segment neurons in the few labels regime can potentially help individual labs process imaging data from distinctive imaging preparations even if a substantial manually labeled training dataset, generated by a single lab or large community, does not yet exist.
Footnotes
The authors declare no competing financial interests.
This work was funded by the National Institutes of Health National Institute of Neurological Disorders and Stroke (NIH NINDS; 1DP2-NS111505) New Innovator Award (Y.G.), the NIH NINDS (1UF1-NS107678) BRAIN Initiative (Y.G.), and the NSF Graduate Research Fellowship (GRFP; C.M.B.).
This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International license, which permits unrestricted use, distribution and reproduction in any medium provided that the original work is properly attributed.