A Semi-supervised Pipeline for Accurate Neuron Segmentation with Fewer Ground Truth Labels

Recent advancements in two-photon calcium imaging have enabled scientists to record the activity of thousands of neurons with cellular resolution. This scope of data collection is crucial to understanding the next generation of neuroscience questions, but analyzing these large recordings requires automated methods for neuron segmentation. Supervised methods for neuron segmentation achieve state of-the-art accuracy and speed but currently require large amounts of manually generated ground truth training labels. We reduced the required number of training labels by designing a semi-supervised pipeline. Our pipeline used neural network ensembling to generate pseudolabels to train a single shallow U-Net. We tested our method on three publicly available datasets and compared our performance to three widely used segmentation methods. Our method outperformed other methods when trained on a small number of ground truth labels and could achieve state-of-the-art accuracy after training on approximately a quarter of the number of ground truth labels as supervised methods. When trained on many ground truth labels, our pipeline attained higher accuracy than that of state-of-the-art methods. Overall, our work will help researchers accurately process large neural recordings while minimizing the time and effort needed to generate manual labels.


Introduction
Studying modern neuroscience questions requires scientists to simultaneously measure and analyze the coordinated activity of neural ensembles formed from hundreds to thousands of neurons (Stevenson and Kording, 2011;Yuste, 2015;Makino et al., 2017;Stringer et al., 2019;Rumyantsev et al., 2020;Vyas et al., 2020).Understanding the function of neural ensembles is technically challenging because distinctive genetic or functional subtypes of neurons within ensembles spatially overlap and temporally change on timescales ranging from seconds to days (Ziv et al., 2013;Driscoll et al., 2017;Pérez-Ortega et al., 2021;Sweis et al., 2021).
Calcium imaging using fluorescent protein sensors meets these technical recording challenges because it can record neural ensembles with cellular spatial resolution and genetic specificity over multiple months (Nakai et al., 2001;Stosiek et al., 2003;Chen et al., 2013;Y. Zhang et al., 2023).Calcium influx follows action potentials and typically increases the brightness of calcium indicators (Grienberger and Konnerth, 2012).Recent optical setups have successfully recorded the calcium activity of hundreds of thousands of neurons simultaneously (Demas et al., 2021).Modern calcium protein sensors have trended toward detection of single action potentials and linear response over multiple action potentials (Ryan et al., 2023;Y. Zhang et al., 2023).
Cellular or subcellular resolution imaging that captures rapid single-spike calcium transients creates large datasets.Extracting single neuron activity from these large-scale imaging datasets necessitates a pipeline of automated methods; such algorithms could save time and minimize human error during analysis (Stevenson and Kording, 2011).Analysis pipelines usually consist of four steps to predict spiking activity from calcium fluorescence recordings: (1) motion correction, (2) cell segmentation, (3) fluorescence extraction, and (4) spike inference (Theis et al., 2016;Pachitariu et al., 2017;Pnevmatikakis and Giovannucci, 2017;Keemink et al., 2018;Giovannucci et al., 2019;Bao et al., 2022).Automated neuron segmentation in particular has received substantial attention but needs improvement.
Supervised methods trade off superior performance for the large effort required to generate hundreds of ground truth labels for model training and hyperparameter optimization.The many manual labels can help train algorithms that account for idiosyncratic fluorescence and noise distributions within each image dataset but then necessitate labels for each imaging condition.Generating such labels is time consuming and subject to human error (Giovannucci et al., 2019;Zhang et al., 2020).
Semi-supervised learning presents an opportunity to reduce the burdens of manual labeling.Semi-supervised segmentation leverages limited numbers of ground truth labels and unlabeled images to train models using two primary approaches: pseudolabeling and consistency regularization (Ouali et al., 2020).Pseudolabeling increases the size of the training dataset by accepting high-confidence labels predicted on unlabeled data as ground truth labels that can further train the model (Lee, 2013;Zou et al., 2020).Consistency regularization trains models by penalizing dissimilar predictions for similar inputs (Chaitanya et al., 2020;Zhuang et al., 2021;Huang et al., 2022;Wu et al., 2022).A combination of pseudolabeling and consistency regularization significantly improved classification accuracy with small numbers of ground truth labels (Sohn et al., 2020).
An alternative paradigm to semi-supervised learning that improves generalizability is ensemble learning.Ensemble learning improves predictive accuracy by combining the outputs of multiple models (Sagi and Rokach, 2018).Averaging multiple independent models reduces overfitting, increases generalizability, and compensates for high model variability even when trained on limited data (Dietterich, 2002;Polikar, 2006).Previous work has successfully applied ensemble learning to neural networks for image classification and segmentation (Zheng et al., 2019;Muller et al., 2022), with the ensemble outperforming the individual (Krizhevsky et al., 2017).
In this study, we developed a semi-supervised neuron segmentation pipeline that maintained state-of-the-art accuracy and prediction speed while limiting the number of manual training labels.Our approach, Semi-supervised Active Neuron Detection (SAND), used neural network ensemble learning to predict active neurons in unlabeled frames.These predictions acted as pseudolabels to augment our training set.We also developed a novel pipeline to choose algorithm hyperparameters with few ground truth labels.

Materials and Methods
Our SAND approach consisted of three main steps: (1) preprocessing the entire video to enhance active neurons, (2) semi-supervised CNN training using small numbers of manually labeled frames, and (3) postprocessing to segment unique neuron masks from the CNN output (Fig. 1A).The postprocessing step used four hyperparameters.Their values were determined using only the manually labeled frames.

Preprocessing
Before training, we preprocessed the video to reduce noise and emphasize active neurons.We first applied pixel-by-pixel temporal filtering to the registered video, which highlighted fluorescence activity that was similar to calcium response waveforms (Bao et al., 2021).We convolved each pixel with the time-reversed average fluorescence response of the ground truth neurons.Selected fluorescence responses had a peak SNR between 5 and 8, and we aligned the transients by their peaks.We then diminished nonresponsive neurons and enhanced active neurons by converting the temporally filtered video into an SNR representation.We calculated this representation by first computing the pixel-wise median image and quantile-based noise image over the entire video.We then pixel-wise subtracted the median image from each frame and pixel-wise divided the result by the noise image.

Model training
The original SUNS training pipeline used a fully supervised approach and trained a single shallow U-Net with a combination of dice loss and focal loss (Bao et al., 2021).The CNN predicted probability maps that underwent a postprocessing pipeline to supervised training pipeline that used ensemble learning to predict active neurons in unlabeled frames.We trained three different shallow U-Nets on labeled frames.The titles above the U-Nets represent the number of channels in each level of the decoder, starting with the deepest level, and "c" denotes a concatenation.We then passed unlabeled frames through each model and averaged the resulting probability maps to create pseudolabels.We retrained and fine-tuned one of the three U-Nets with these pseudolabels and then fine-tuned this network with a final round of training on the labeled frames.See Extended Data Figures 1-1, 1-2, 1-3A,B, and 1-6 for more details.C, FLHO found optimal hyperparameters for the postprocessing pipeline that processed frames of probability maps to predict individual neurons.We first thresholded the probability maps from the CNN (p_thresh).We then segmented ROIs in each frame and removed ROIs that were smaller than min_area.We then merged ROIs across frames by their relative spatial locations (centroid_dist) and removed any ROIs that were not active for enough consecutive frames (min_consecutive).* denotes hyperparameters.See Extended Data Figures 1-3C, 1-4, and 1-5, Table 1-1, and Table 1-2 for more details.calculate the final neuron masks.Our SAND approach used neural network ensembling to generate pseudolabels (Fig. 1B).We used an ensemble of three models based on recent work that developed a semi-supervised pipeline for accurate medical image segmentation that trained an ensemble of the same size (Wu et al., 2022).We first defined three separate shallow U-Nets.Each U-Net had a unique decoder architecture, and one U-Net had the same architecture as SUNS (Extended Data Fig. 1-1).We selected the three U-Net architectures tested by Bao et al. (2021) that achieved the highest accuracy.We trained all three U-Nets on frames with manually labeled masks using a weighted sum of dice and focal loss for 200 epochs (focal loss:dice loss, 100:1; Extended Data Fig. 1-2A).We then passed 1,800 unlabeled frames through each trained U-Net within the ensemble and averaged the output probability maps to serve as pseudolabels.Pseudolabels closely resembled the known temporal masks (Extended Data Fig. 1-3A,B).We then produced the final prediction U-Net by using the pseudolabels to continue training the U-Net with the SUNS architecture.We trained this U-Net using binary cross-entropy loss for 25 epochs using the pseudolabels (Extended Data Fig. 1-2A), and then we fine-tuned the U-Net with a final round of training using dice and focal loss for 200 epochs using the original labeled frames (Extended Data Fig. 1-2A).Training time increased as the number of labeled training frames increased but remained under an hour for up to 500 training frames (Extended Data Fig. 1-2B).For all training steps, we used the Adam optimizer with a 0.001 learning rate, and our training pipeline augmented the input frames with random flips and rotations to help prevent overfitting.

Postprocessing
The output probability maps of our neural network represented the model's confidence that a pixel belonged to an active neuron.Additional postprocessing converted the output series of probability maps into unique neuron masks (Fig. 1C).We followed the same postprocessing steps described in Bao et al. (2021).First, we binarized the probability maps with a probability threshold (p_thresh) to determine active pixels.Higher values of p_thresh retained only high-confidence predictions.Lower values preserved lower confidence predictions, such as pixels from neurons with relatively low SNR, but also kept more false-positive predictions.After probability thresholding, we grouped active pixels within a frame into separate components using connected component labeling.We removed components smaller than a minimum area (min_area), as these regions were unlikely to be neurons.Next, we merged colocalized components across different frames; active components in the same location across multiple frames likely represented the same neuron.We defined components as colocalized if the centers of mass (COMs) of two components were within a minimum threshold (COM distance < centroid_dist) or if the areas of two components were substantially overlapping.Overlapped neurons met either of two criteria: (1) intersection-over-union (IoU) > 0.5 or (2) consume ratio (consume) > 0.75, with IoU and consume defined for two binary masks m 1 and m 2 as follows (Bao et al., 2021): These temporally merged components represented unique ROIs.Lastly, we removed masks that were not active for a minimum number of consecutive frames (min_consecutive) typical of calcium responses.

Hyperparameter optimization
Selection of the optimal postprocessing hyperparameters after CNN training was crucial for accurately identifying neurons and distinguishing neurons from noise.Hyperparameter optimization with SUNS required manual labeling of all active neurons in the training video.The original SUNS pipeline used a grid search to determine the postprocessing parameters that maximized F 1 on the training frames (Extended Data Table 1-1).Recall, precision, and F 1 are common metrics to define segmentation accuracy: Evaluating F 1 on an entire video is impossible using a video that contains unlabeled neurons, which could be inaccurately labeled as false positives.Similarly, the nature of the min_consecutive hyperparameter required all video frames to be used in its estimation.We found that a grid search failed to find the optimal hyperparameters when trained with a small number of labels.In particular, we found that a grid search often underestimated the optimal p_thresh value when trained with limited manually labeled frames (Extended Data Fig. 1-4A).
We developed a novel pipeline, Few Label Hyperparameter Optimization (FLHO), to optimize postprocessing hyperparameters that used only a fraction of the number of ground truth labels as SUNS (Extended Data Fig. 1-5).Instead of using a grid search to determine all four hyperparameters, we directly calculated p_thresh and min_consecutive using estimates from a small number of ground truth labels.
We first used the ground truth labels to estimate p_thresh (Extended Data Fig. 1-5A).For each labeled neuron, we identified the frames when that neuron's peak SNR (pSNR) exceeded the threshold set by Bao et al. (2021).The trained CNN then calculated probability maps for these active frames.For each neuron, we found the median probability map value within its mask during its active frames.We used this distribution of median probability values for each neuron to find two values: (1) the 25th percentile, which was used for intermediate steps, and (2) the median, which was used as the final p_thresh.We used a lower p_thresh for intermediate steps that used only labeled frames because our initial small set of labeled training frames likely did not include the frames with the pSNR or peak probability values for each neuron.Our 25th percentile value for p_thresh thresholded probability maps and retained neurons with relatively low SNR on the training frames (Extended Data Fig. 1-3C).We used these thresholded maps to perform a grid search for values of centroid_dist and min_area that maximized the F 1 score on the labeled frames (Extended Data Fig. 1-5B).
We found that the pipeline was robust across different choice of percentiles with respect to the ultimate algorithm accuracy (Extended Data Fig. 1-4B).The values of centroid_dist and min_area were also robust to changes in p_thresh, which may partially explain the robustness in accuracy across percentiles (Extended Data Fig. 1-4C).Additionally, the median value of the p_thresh distribution trained on a small number of labels was very similar to the optimal p_thresh value calculated using all labels (Extended Data Fig. 1-4A).Therefore, we set our final p_thresh to the median value.We set an upper bound on this value so that p_thresh was not >80% probability.Finally, we calculated min_consecutive by assessing the distribution of consecutive frames for all neurons (Extended Data Fig. 1-5C).For this step, we used the probability maps for all frames.Therefore, we set p_thresh to its final (median) value.We thresholded these probability maps using p_thresh and min_area.We calculated the maximum number of consecutive frames that the model identified for each neuron.We observed that the minimum consecutive frame value among all neurons was occasionally an outlier, so we selected the second smallest value to be min_consecutive.However, the performance of our method was robust across different choices of min_consecutive (Extended Data Fig. 1-4D).We set an upper bound on min_consecutive so that it did not surpass eight frames (Extended Data Fig. 1-5D).
Peer segmentation methods SUNS.SUNS is a supervised deep learning pipeline for neuron segmentation from fluorescence recordings (Bao et al., 2021).SUNS first computed an SNR representation of imaging videos that emphasized active neurons and de-emphasized inactive neurons.SUNS then trained a shallow U-Net on 1,800-2,400 of all imaging frames developed from a set of comprehensively labeled neurons over all imaging movies.Finally, a multistep postprocessing pipeline identified unique ROIs across all frames.SUNS determined the hyperparameters for this postprocessing pipeline with a grid search that evaluated accuracy against the ground truth labels.Python code for SUNS is available at https://github.com/YijunBao/Shallow-UNet-Neuron-Segmentation_SUNS.
CaImAn.CaImAn is a calcium imaging analysis pipeline that uses both unsupervised and supervised algorithms to identify active neurons (Pnevmatikakis et al., 2016;Giovannucci et al., 2019).The unsupervised step was a nonnegative matrix factorization method that separated spatially overlapping neurons based on the temporal activity of active neurons; these sparse decomposed components also included sources that represented background noise and neuropil activity.Components representing unique regions of interest (ROIs) were curated by iteratively combining components that exceeded a threshold for correlated temporal activity.The supervised portion was a quality control step to remove nonneuronal components.This step used a peak signal-to-noise (SNR) threshold, spatial footprint consistency, and a CNN classifier.Python code for CaImAn is available at https://github.com/flatironinstitute/CaImAn(version 1.6.4).
Suite2p.Suite2p is another widely used pipeline that applies unsupervised algorithms to identify potential neurons and a supervised quality control step to refine the neurons (Pachitariu et al., 2017).Suite2p first reduced the dimensionality of the input video using singular value decomposition.Then, unsupervised non-negative matrix factorization identified ROIs and modeled decomposed neural activity as the weighted sum of underlying neural activity and neuropil signal.A supervised classifier then processed these ROIs and separated cells from noncells based on temporal and spatial features.Lastly, manual acceptance or rejection of the classifier's predictions refined the final output neurons.Python code for Suite2p is available at https://github.com/MouseLand/suite2p(version 0.6.16).

Datasets
We tested our pipeline on two-photon videos from three different datasets, all recorded in mice.These videos covered multiple cortical and subcortical brain regions, were collected with multiple imaging conditions, and utilized various calcium sensors with different responses and kinetics (Extended Data Table 1-2).
Allen Brain Observatory.The dataset from the Allen Brain Observatory (ABO) consisted of 10 videos recorded from a depth of 275 µm and 10 videos recorded from a depth of 175 µm in the primary visual cortex (V1; de Vries et al., 2020).The 175 µm set had ∼200 neurons per video, and the 275 µm set had ∼300 neurons per video.For each depth, we used 10-fold cross-validation: we trained our model and determined the hyperparameters using one video and tested on the other nine videos.Data is available at http://observatory.brain-map.org/visualcoding.
Neurofinder.We used three sets of videos (01, 02, and 04) from three different labs with different imaging conditions from the Neurofinder competition (CodeNeuro, 2016).Each video was paired with another video obtained under the same imaging conditions, making six pairs of videos.For each of the six pairs, we trained the model and determined the hyperparameters on one video and tested on the other video.The 12 videos averaged ∼250 neurons per video.Videos are available at http://Neurofinder.Codeneuro.Org/.
CaImAn.The CaImAn dataset (Giovannucci et al., 2019) contained four videos (J115, J123, K53, and YST) that imaged various brain regions.We divided each video into quarters to perform cross-validation, so that the training and test set had the same imaging conditions.For two of the videos (J115 and K53), the average number of neurons per subvideo was ∼200.For these videos, we trained the model on one subvideo and tested on the remaining three subvideos.The other two videos (J123 and YST) had ∼40 and ∼80 neurons per subvideo, respectively.For these videos containing far fewer neurons, we used leave-one-out cross-validation, training on three subvideos, and testing on the remaining subvideo.

Analysis
We compared three different deep learning segmentation pipelines: (1) SUNS, model training with supervised learning (SL) and hyperparameter optimization with a full grid search (GS), (2) SL and our new hyperparameter optimization pipeline (FLHO), and (3) SAND, model training using a combination of SL and neural ensemble learning and hyperparameter optimization with FLHO.We also compared SAND with the widely used matrix factorization methods Suite2p and CaImAn.We quantified the quality of the identified masks as the proportion of the mask's area divided by the area of the mask's convex hull.We evaluated model accuracy by calculating the F 1 score of each method on the test videos when trained with different numbers of ground truth neuron masks from the training video.We altered the number of ground truth masks used in training by randomly sampling different sets of SNR frames (Extended Data Fig. 1-6).We evaluated F 1 across all frames and neurons in the test videos using the same ground truth masks as previous work (Soltanian-Zadeh et al., 2019;Bao et al., 2021).For CaImAn and Suite2p, we used the F 1 values found in Bao et al. (2021), which previously optimized the hyperparameters for these pipelines.
We ran multiple analyses to test the performance of SAND.First, we compared SAND with SUNS, SL + FLHO, CaImAn, and Suite2p when trained on a low number of ground truth neurons.We also compared the performance of SAND trained on a low number of ground truth neurons with the asymptotic performance of SUNS.Finally, we compared the asymptotic performance of SAND with the asymptotic performance of SUNS.We binned the F 1 scores for each condition by the number of neurons used in training.We compared algorithms using the Wilcoxon rank-sum test and by computing the effect size (Cohen's d ).

Results
We first evaluated SAND using both ABO datasets (Fig. 2).Masks generated by SAND closely matched the ground truth masks even when trained on only 10 frames (Fig. 2A,B).Masks generated by SUNS trained on few frames and, however, included many false positives, and masks generated by Suite2p and CaImAn were more irregularly shaped and less accurate than those generated by SAND (Fig. 2A,B; Extended Data Fig. 2-1A; Extended Data Tables 2-1, 2-2).SAND significantly outperformed all other methods when trained on 0-50 ground truth labels (∼10 labeled frames; Extended Data Fig. 2-2, Table 2-1).In the 275 µm dataset, SUNS achieved a median F 1 score of 0.81 when trained on >250 labels (Fig. 2C; Extended Data Table 2-1).However, SAND achieved this F 1 score when trained on only ∼25% the number of labels and came within one standard deviation of this value when trained on only ∼12% the number of labels (median F 1 = 0.79; 34 ± 10 neurons).Additionally, the F 1 score for SAND when trained on >250 neurons was significantly higher than the SUNS F 1 score (Extended Data Table 2-1).In the 175 µm dataset (Fig. 2D), SUNS achieved a median F 1 score of 0.81 when trained on >200 neuron labels (Extended Data Table 2-1).However, SAND came within one standard deviation of this value when trained on only ∼13% of the number of labels (median F 1 = 0.77, 29 ± 12 neurons).Additionally, the F 1 score for SAND when trained on >200 labels was significantly higher than the SUNS F 1 score when trained on >200 labels (Extended Data Table 2-1).SAND also significantly outperformed the matrix factorization methods, CaImAn and Suite2p, over all numbers of ground truth masks (Extended Data Table 2-1).In particular, SAND trained on only 10 frames more reliably detected low pSNR neurons than CaImAn and Suite2p (Extended Data Fig. 2-3).SAND generally improved model precision (Fig. 2C,D).Both our new training method and our new hyperparameter optimization method helped maximize F 1 in our pipeline.FLHO without pseudolabel training (SL + FLHO) had a modest effect on accuracy when trained on fewer ground truth masks (Fig. 2C,D).In addition to state-of-the-art accuracy, SAND also achieved the state-of-the-art processing speed of SUNS at ∼300 frames per second (Extended Data Fig. 2-4).
We next tested SAND on the Neurofinder dataset (Fig. 3).Masks generated by SAND closely matched the ground truth masks even when trained on only 10 frames (Fig. 3A,B).Masks generated by Suite2p and CaImAn were more irregularly shaped and had more false-negative predictions than SAND (Fig. 3A,B; Extended Data Fig. 2-1B, Table 2-3).SAND significantly outperformed SUNS when trained on 0-50 ground truth neuron labels (∼10 frames; Fig. 3C; Extended Data Fig. 3-1, Table 3-1).SUNS achieved a median F 1 score of 0.58 when trained on 200-250 labels.However, the performance of SAND was not significantly different from this when trained on only ∼14% the number of labels (median F 1 = 0.53, 32 ± 12 neurons; Extended Data Table 3-1).Similar to observations when processing the ABO datasets, our new hyperparameter optimization without pseudolabel training partially improved accuracy when trained on fewer ground truth masks.SAND  performed as well as or better than CaImAn segmentation over all numbers of ground truth masks (Extended Data Table 3-1).Overall, the Neurofinder dataset had the most variability in performance, likely due to the variety of imaging conditions throughout this dataset.
Finally, we tested SAND on the CaImAn dataset, starting with the K53 and J115 videos (Fig. 4A,B).When processing the K53 dataset, SAND significantly outperformed SUNS, Suite2p, and CaImAn at all numbers of ground truth neurons (Extended Data Table 4-1).SAND's performance when trained on 0-50 neurons (∼10-25 frames) was more accurate than the performance of SUNS using >150 ground truth neurons (∼500-1,800 frames; Fig. 4A; Extended Data Fig. 4-1, Table 4-1).When processing the J115 dataset, SAND significantly outperformed CaImAn and Suite2p on all numbers of ground truth neurons (Fig. 4B; Extended Data Table 4-1).SAND also significantly outperformed SUNS when trained on 0-50 ground truth neuron labels (∼10 frames; Extended Data Fig. 4-1, Table 4-1).For both videos, SAND's predicted masks aligned closely with the ground truth masks, even when trained on just 10 frames (Extended Data Fig. 4-2).SUNS's predicted masks included many false positives.Conversely, CaImAn and Suite2p both failed to detect many ground truth neurons.SAND outperformed CaImAn and Suite2p on both the YST and J123 videos on all numbers of ground truth  neurons; however, SAND did not consistently outperform SUNS (Fig. 4C,D; Extended Data Fig. 4-3).On all of the CaImAn videos, SAND predicted masks with more consistent soma shapes than other methods (Extended Data Fig. 2-1C, Table 2-4).
To understand why SAND only moderately outperformed SUNS when processing the J123 and YST videos, we compared the quality of these videos to the quality of the other datasets.The pSNR of a neuron's fluorescence can predict likeliness of being detected by both supervised and unsupervised segmentation methods: neurons with higher pSNR were more likely to be detected (Bao et al., 2021).We calculated the average and standard error of pSNR for all ground truth neurons in each video (Extended Data Fig. 4-4).
Neurons in J123 and YST had both lower average pSNR and more variable pSNR than neurons in other videos.This suggests that SAND works best on videos with high pSNR values and low variability of pSNR across neurons.However, SAND appears to be effective when only one of these conditions is met.For example, SAND effectively processed video K53, which had high pSNR but high variability; it also effectively processed the Neurofinder dataset, which had low variability but low pSNR.
The type of calcium indicator used in each recording impacted the pSNR values.Notably, the J123 and YST videos used GCaMP5 (Akerboom et al., 2012) and GCaMP3 (Tian et al., 2009), respectively.These older sensors have very low SNR relative to modern sensors, such as the GCaMP6 used in the other videos (Extended Data Table 4-2).Protein sensors of calcium have continued to develop, so recent sensors in the GCaMP8 series have even higher SNR than that of GCaMP6 (Chen et al., 2013;Ryan et al., 2023;Y. Zhang et al., 2023).It is likely that the high SNR of modern sensors will translate to high pSNR in two-photon neural recordings.This superior signal fidelity should more effectively allow our pipeline to accurately process modern neural recordings with small numbers of ground truth labels.
Finally, we tested how different imaging conditions (e.g., pSNR variability) affected the generalizability of SAND (Extended Data Fig. 4-5).We found that SAND generalized well when the training and test data had similar imaging conditions.For example, SAND trained on the ABO 175 µm dataset and tested on the ABO 275 µm dataset performed as well as SAND trained on the ABO 275 µm dataset and tested on the ABO 275 µm dataset.We then tested ABO-trained SAND on the K53 dataset, which has higher average pSNR values and higher pSNR variability than the ABO dataset.We found that ABO-trained SAND still outperformed CaImAn and Suite2p on K53, but K53-trained SAND achieved the highest accuracy across all numbers of training labels.The accuracy of SAND and SUNS trained on the ABO 275 µm dataset and tested on the K53 dataset decreased as the number of ABO labels used to train these models increased.This is likely the result of increased model specificity when trained on data specific to certain imaging conditions.Augmenting the training data of SAND to make consistent predictions on a variety of noise levels would likely improve model generalizability.For example, we could add an additional training step to SAND to include mutual consistency learning: we could train SAND to predict the same probability maps after adding different amounts of noise to the same frame.

Discussion
Current methods of neuron segmentation have a trade-off between accuracy and manual effort: supervised methods have superior accuracy but require substantial manual effort to generate ground truth labels for each imaging condition (Abbas and Masip, 2022).This work developed SAND, the first semi-supervised pipeline to segment active neurons from two-photon calcium recordings with limited ground truth labels.SAND effectively operated in this low label regime by using neural network ensembling and a new hyperparameter optimization pipeline.The former process generated a large and robust set of pseudolabels that trained a deep learning segmentation algorithm, while the latter process determined postprocessing hyperparameters from limited numbers of ground truth labels.
SAND achieved higher accuracy than the accuracy of fully supervised methods at multiple scales of labeling.At the small scale, SAND trained on labels from <1% of frames and 25% of all ground truth labels available in a movie was comparably accurate as fully supervised methods trained on all labels.When trained on all available ground truth labels in our movies (>200 neurons), SAND attained higher accuracy than that of current methods.SAND trained on low number of ground truth labels also consistently outperformed matrix factorization methods.
The high accuracy of SAND trained on low numbers of manual labels could allow researchers to circumvent the accuracy-effort trade-off.SAND attained state-of-the-art accuracy with ∼25% of the manual labels, but likely even lower fractions of labeling effort.Previous studies on supervised methods required the manual labeling of all hundreds to thousands of neurons in a single video to serve as a comprehensive training set (Soltanian-Zadeh et al., 2019;Bao et al., 2021).We estimate that manual labelers could identify and outline a single neuron per minute, with diminishing speed as they find fewer neurons when scanning through more frames of a movie.Therefore, SAND could greatly reduce the labeling time needed to generate effective labels for training deep learning neuron segmentation algorithms to well under 1 h per experimental condition.
Pseudolabel training and FLHO both played a role in SAND's high accuracy when trained on few labels.Pseudolabeling generated a robust training dataset much larger than the manually labeled training set.This larger training set helped train our shallow U-net to distinguish between noise and active neurons, reducing the number of false-positive calls.On the other hand, FLHO improved accuracy by improving hyperparameters in postprocessing.Selection of hyperparameters can greatly impact algorithm performance, but many other pipelines, such as SUNS, CaImAn, and Suite2p, employ supervised postprocessing steps that require large numbers of ground truth labels to accurately tune hyperparameters (Pachitariu et al., 2017;Giovannucci et al., 2019;Bao et al., 2021).FLHO helped bypass the accuracy-effort trade-off in hyperparameter optimization through direct calculation of certain parameters using the limited ground truth labels.
The relationship between the number of ground truth labels and accuracy of neuron segmentation displayed three trends.First, in the regime of extremely low numbers of labels, such as 20-50, SAND outperformed its fully supervised sibling SUNS.Second, both algorithms increased F 1 performance as the number of training labels increased, often reaching performance asymptotes at high numbers of labels ranging from 150 to 250 labels.This large number of labels needed to saturate SAND and SUNS highlights the need for large sets of publicly available manual annotations for a variety of data, such that the field can better understand the conditions that saturate neural network-based segmentations.Third, precision often lagged recall in both SUNS and SAND; the increase in precision largely accounted for the increase in F 1 .The reason for this is likely twofold.First, our ensemble learning method averaged the predictions of three models to generate pseudolabels that were conservative and thus reduced training on samples near the detection threshold that could increase false positives.Second, FLHO was also likely conservative.It produced hyperparameters, such as p_thresh values that were higher than those found by grid search on the few-label dataset, which eliminated weakly confident predictions.
SAND reduces the manual labeling effort compared with fully supervised algorithms but inherits the prediction speed of the underlying SUNS shallow U-Net architecture (Extended Data Fig. 2-4).This speed was an order of magnitude faster than the rate of data collection (Bao et al., 2021).Fast prediction speed can enable researchers to quickly identify neurons of interest from their recordings in real time and perform targeted perturbation experiments within the same imaging session or during imaging.This capability could help researchers study neural ensemble dynamics in memory and perception that are consistent on the minutes timescale but change from one day to the next (Ziv et al., 2013;Driscoll et al., 2017;Rule et al., 2020;Deitch et al., 2021;Pérez-Ortega et al., 2021).Our ensemble training and hyperparameter optimization processes also reduce training time compared with SUNS because it trained on only 10-25 labeled frames, far fewer than the 1,800 frames used for SUNS.The above benefits at training and test time could also arise from partnerships between existing or future neuron segmentation algorithms and our semi-supervised approaches.Because our ensemble learning and FLHO modify the training approach without dictating the underlying supervised machine learning architecture, these training approaches could retain the accuracy or speed of other algorithms while boosting the other algorithms' performance in the low label regime.
Similar to all machine learning neuron segmentation algorithms, SAND will likely benefit from recent developments in protein engineering and video processing.Our work showed that SAND in particular benefits from higher response and small variance in response.Such distributional changes have been instantiated by recent generations of protein calcium indicators, which are both more responsive and more linear (Dana et al., 2019;Y. Zhang et al., 2023).Additionally, the development of novel unsupervised video denoising pipelines, such as DeepInterp (Lecoq et al., 2021) and DeepCAD-RT (Li et al., 2022), may also improve recall by reducing noise, thereby increasing SNR.Increases in pSNR have correlated with increased recall (Soltanian-Zadeh et al., 2019;Bao et al., 2021).SNR gains will likely increase precision as well by reporting even small calcium fluctuations.
Future work could directly improve our implementation of SAND or create alternative implementations.Direct improvement of SAND could optimize the frame selection or model selection to maximize accuracy.Our current approach randomly selected the frames used for labeling.It is possible that systematic selection of these frames could more effectively represent the range of neuron characteristics (e.g., size and pSNR) with even fewer ground truth labels.Additionally, our current approach defaulted to the SUNS shallow U-net architecture as the final neural network to make neuron predictions.Future iterations of SAND could evaluate the accuracy of all ensemble U-nets when processing the ground truth data and then perform pseudolabel training on the U-net with the lowest error.Finally, improvements to SAND or SUNS could also help detect neurons by improving the postprocessing classification step.Such changes could use dynamic information from a large temporal extent to detect sparsely and weakly active neurons (Soltanian-Zadeh et al., 2019).
Application of SAND beyond the two-photon datasets in this work are potentially numerous.Future SAND applications could help process imaging data from one-photon or volumetric imaging settings, which generally have lower SNR than planar two-photon imaging (Jung et al., 2004;Ahrens et al., 2013;Ji et al., 2016;Waters, 2020).SAND can stand alone to process such data or pair with segmentation algorithms that target specific optical imaging data types (Yuanlong Zhang et al., 2023).Likewise, future testing could also apply SAND to process the diverse calcium recordings of many cell types, such as inhibitory neurons or glia (Akerboom et al., 2013;Semyanov et al., 2020;Mulholland et al., 2021).SAND's ability to accurately segment neurons in the few labels regime can potentially help individual labs process imaging data from distinctive imaging preparations even if a substantial manually labeled training dataset, generated by a single lab or large community, does not yet exist.

Figure 1 .
Figure1.A multistep pipeline processed the input videos into masks using semi-supervised learning and FLHO for postprocessing.A, Examples of preprocessed video frames, intermediate SNR representation frames, model output, and final masks obtained from our pipeline.B, Schematic of our semisupervised training pipeline that used ensemble learning to predict active neurons in unlabeled frames.We trained three different shallow U-Nets on labeled frames.The titles above the U-Nets represent the number of channels in each level of the decoder, starting with the deepest level, and "c" denotes a concatenation.We then passed unlabeled frames through each model and averaged the resulting probability maps to create pseudolabels.We retrained and fine-tuned one of the three U-Nets with these pseudolabels and then fine-tuned this network with a final round of training on the labeled frames.See Extended Data Figures1-1, 1-2, 1-3A,B, and 1-6 for more details.C, FLHO found optimal hyperparameters for the postprocessing pipeline that processed frames of probability maps to predict individual neurons.We first thresholded the probability maps from the CNN (p_thresh).We then segmented ROIs in each frame and removed ROIs that were smaller than min_area.We then merged ROIs across frames by their relative spatial locations (centroid_dist) and removed any ROIs that were not active for enough consecutive frames (min_consecutive).* denotes hyperparameters.See Extended Data Figures1-3C, 1-4, and 1-5, Table 1-1, and Table 1-2 for more details.

Figure 2 .
Figure 2. SAND outperformed other pipelines on low numbers of ground truth labels when processing the ABO 275 µm and ABO 175 µm datasets.A, Example segmentations from ABO 275 µm video 539670003.Masks generated by SAND were more accurate than those of other methods, even when trained on only 10 frames.Yellow boxes indicate region isolated in panel B. Scale bar, 50 µm.See Extended Data Figure 2-1A and Table 2-2 for more details.B, Example neurons zoomed from boxed regions in panel A. When trained on only 10 frames, SUNS identified many false-positive masks, whereas SAND accurately identified neurons.CaImAn and Suite2p both failed to find some ground truth neurons and Suite2p in particular had irregularly shaped masks.Scale bar, 25 µm.SAND had higher accuracy than other methods with low number of ground truth labels on both the (C) ABO 275 µm and (D) ABO 175 µm datasets.Dots represent the average F 1 score for each model when processing the nine test videos.Lines represent the mean F 1 scores averaged over bins grouped by the number of training labels; bins spanned 0-50 labels, 50-100 labels, etc. Shaded regions represent standard error.Horizontal lines are the average F 1 scores of Suite2p and CaImAn.SAND generally did not improve recall but improved precision for the ABO datasets.The red line (SAND) represents ensemble learning and hyperparameter optimization with FLHO.The blue line represents single-model supervised learning and hyperparameter optimization with FLHO.The orange line (SUNS) represents single-model supervised learning and grid search hyperparameter optimization.See Extended Data Figures 2-2, 2-3, and 2-4 andTable 2-1 for more details.

Figure 3 .
Figure 3. SAND outperformed other pipelines on low numbers of ground truth labels using the Neurofinder datasets.A, Example segmentations from Neurofinder video 4.00.Masks generated by SAND were more accurate than those of other methods, even when trained on only 10 frames.Yellow boxes indicate region isolated in panel B. Scale bar, 50 µm.See Extended Data Figure 2-1B and Table 2-3 for more details.B, Example neurons zoomed from boxed regions in panel A. When trained on only 10 frames, SAND correctly identified more masks than CaImAn and Suite2p.Scale bar, 25 µm.C, SAND generally had higher accuracy than other methods when trained on a low number of ground truth labels.Dots represent the average F 1 score for each model when processing the test video(s).Lines represent the mean F 1 scores averaged over bins grouped by the number of training labels; bins spanned 0-50 labels, 50-100 labels, etc. Shaded regions represent standard error.Horizontal lines are the average F 1 scores of Suite2p and CaImAn.More than half of Neurofinder videos did not have >250 neurons, so we did not include trials with >250 neurons in comparisons and binned results.The red line (SAND) represents ensemble learning and hyperparameter optimization with FLHO.The blue line represents single-model supervised learning and hyperparameter optimization with FLHO.The orange line (SUNS) represents single-model supervised learning and grid search hyperparameter optimization.See Extended Data Figure 3-1 andTable 3-1 for more details.

Figure 4 .
Figure 4. SAND outperformed other pipelines on low numbers of ground truth labels using the CaImAn datasets.A, When trained with low numbers of ground truth neurons, SAND outperformed all other methods on the K53 video.SAND had the highest F 1 and precision of all methods.See Extended Data Figure 2-1C and Table 2-4 for more details.B, When trained with low numbers of ground truth neurons, SAND outperformed all other methods on the J115 video.SAND outperformed CaImAn and Suite2p, but not SUNS on the (C) J123 and (D) YST videos when trained on low numbers of ground truth neurons.Dots represent the average F 1 score for each model when processing the test video(s).Lines represent the mean F 1 scores averaged over bins grouped by the number of training labels; bins spanned 0-50 labels, 50-100 labels, etc. Shaded regions represent standard error.Horizontal lines are the average F 1 scores of Suite2p and CaImAn.The red line (SAND) represents ensemble learning and hyperparameter optimization with FLHO.The blue line represents single-model supervised learning and hyperparameter optimization with FLHO.The orange line (SUNS) represents single-model supervised learning and grid search hyperparameter optimization.See Extended Data Figures 4-1,[4][5] and  Table 4-2 for more details.

Table 3
-1 for more details.