GhostiPy: An Efficient Signal Processing and Spectral Analysis Toolbox for Large Data

Abstract Recent technological advances have enabled neural recordings consisting of hundreds to thousands of channels. As the pace of these developments continues to grow rapidly, it is imperative to have fast, flexible tools supporting the analysis of neural data gathered by such large-scale modalities. Here we introduce GhostiPy (general hub of spectral techniques in Python), a Python open source software toolbox implementing various signal processing and spectral analyses including optimal digital filters and time–frequency transforms. GhostiPy prioritizes performance and efficiency by using parallelized, blocked algorithms. As a result, it is able to outperform commercial software in both time and space complexity for high-channel count data and can handle out-of-core computation in a user-friendly manner. Overall, our software suite reduces frequently encountered bottlenecks in the experimental pipeline, and we believe this toolset will enhance both the portability and scalability of neural data analysis.


Introduction
Advancements in neural recording technologies have enabled the collection of large data in both space (high density/channel count) and time (continuous recordings). During subsequent analysis, the scale of the data induces certain challenges that may manifest as the following scenarios: (1) analysis code takes a long time to complete (high time complexity); and (2) code is unable to complete because of insufficient memory on the hardware (high spatial complexity). Moreover, the scientist may have difficulty finding existing tools that address both 1 and 2 and implement the desired analyses.
Although a potential remedy is to simply upgrade the hardware, it is not an acceptable solution for scientists desiring portability, an important component that improves reproducibility and replicability. In more portable systems, hardware resources may be limited (e.g., using a laptop at the airport). We thus took an alternate approach by efficiently implementing analyses that would trivially scale for different hardware configurations. Our solution is GhostiPy (general hub of spectral techniques in Python), a free and open source Python toolbox that attempts to optimize both time and space complexity in the context of spectral analyses. Methods include linear filtering, signal envelope extraction, and spectrogram estimation, according to best practices. GhostiPy is designed for general purpose usage; while well suited for high-density continuous neural data, it works with any arbitrary array-like data object.
In this article, we first describe the software design principles of GhostiPy to increase efficiency. We then elaborate on featured methods along with code samples illustrating the user friendliness of the software. Finally, we benchmark our software against a comparable implementation, and we discuss strategies for working under an out-of-core (when data cannot fit into system memory) processing context.

Materials and Methods
An overview of implemented methods can be found in Table 1. Excluding out-of-core support, it is possible to use multiple different packages (OverLordGold Dragon, https://github.com/OverLordGoldDragon/ssqueezepy/; Bokil et al., 2010;Oostenveld et al., 2011;Gramfort et al., 2013;Yegenoglu et al., 2015;Lee et al., 2019;Tadel et al., 2019;Virtanen et al., 2020) to achieve the same functionality. However, the mix-and-match approach can reduce user friendliness since application programming interfaces (APIs) differ across packages and dependency management is more difficult. We believe our unified package provides an attractive solution to this challenge. Table 2 documents the methods currently available in GhostiPy.

Software design considerations
As previously noted, successful completion of analyses may be hampered by long computation times or lack of system memory. Specifically, algorithmic time and space complexity is a major determinant for the efficiency and performance of a software method. In general, it is difficult to optimize both simultaneously. For example, time complexity may be reduced by increasing hardware parallelization, at the expense of higher space complexity (memory requirements). While we sought to lower both kinds of complexity compared with existing solutions, we gave space complexity a higher priority. Stated concretely, slow computation time is primarily a nuisance, but failure to complete an analysis because of insufficient memory is catastrophic.
Our design decision to prioritize space complexity was particularly critical because it directly influenced which backend library we chose for the fast Fourier transform (FFT), an operation used in the majority of the GhostiPy methods. While investigating the different options, we saw that numpy currently uses the pocketfft backend (https://gitlab.mpcdf.mpg.de/mtr/pocketfft; Van Der Walt et al., 2011). When accelerated with the Intel MKL library, it can be slightly faster than FFTW (https://software.intel. com/content/www/us/en/develop/tools/math-kernel-library/ benchmarks.html). However, we have found FFTW (Frigo   , 1998, 2005) to be superior for memory management and better suited for FFTs of arbitrary length, including prime and odd numbers. An additional benefit of FFTW was its multithreaded capabilities ( Fig. 1). We therefore selected FFTW as our FFT backend.
To lower space complexity, we used blocked algorithms, including overlap save convolution, which is not offered in any of the standard Python numerical computing libraries such as numpy or scipy (Van Der Walt et al., 2011;Virtanen et al., 2020). This approach enabled us to process very large data that could not fit in memory (also known as out-of-core processing). Throughout our code, we also used other strategies such as in-place operations.
To lower the time complexity, we used efficient lengths of FFTs wherever possible, and we leveraged modern computing hardware by parallelizing our algorithms. For example, a wavelet transform can be trivially parallelized since the transform for each scale is not dependent on other scales.

Finite impulse response filter design
GhostiPy provides classical signal processing capabilities such as filtering data, using the efficient overlap save convolution. Filtering data is a ubiquitous operation, but before this stage, the filter must itself be designed. While this step may appear somewhat trivial, it can make a significant difference, including the very existence of thetagamma phase amplitude coupling (Canolty et al., 2006;Dvorak and Fenton, 2014).
Existing packages such as scipy and MNE offer a variety of finite impulse response (FIR) filter design methods (Gramfort et al., 2013;Virtanen et al., 2020). However, some methods suffer from the following issues. (1) Using the least-squares method, a solution may result in a filter with a magnitude response effectively of zero throughout. This situation is more common when designing filters with passband relatively low compared with the sampling rate.
(2) Using the Remez exchange method, the algorithm may simply fail to converge. (3) Using the window method, the transition bands cannot be controlled exactly, and optimality cannot be defined, as is the case for the leastsquares (L2 optimal) and Remez exchange (L1 optimal) methods.
Therefore, the GhostiPy filter design uses the method defined in the study by Burrus et al. (1992) for the following reasons: (1) it is simple to design, and the computational complexity is similar to that of a window method and can be implemented on embedded hardware if desired; (2) optimality can be defined, as it is optimal in the L2 sense; (3) transition bands can be defined exactly, and the steepness of the passband rolloff can be controlled by the spline power parameter; and (4) the filter impulse response can be defined analytically. Consequently, its computation does not suffer from the failure modes of the least-squares or Remez exchange methods, as those must solve systems of linear equations. In other words, the design process is reliable and stable.
This method designs a low-pass filter according to the following: where v 1 and v 2 are radian frequencies defining the transition-band boundaries. GhostiPy uses the low-pass filter defined in Equation 1 as a prototype to design more complicated filters. As Note in the code that the data-sampling rate is 1250 Hz, the FFT is parallelized across eight threads, and ripple_data are a 1D numpy array.
Open Source Tools and Methods a result, users can request filters with arbitrary magnitude response. An example is shown in Figure 2.

Multitaper method
Users often wish to perform a spectral decomposition on a signal of interest. This can be accomplished by using the multitaper method (Thomson, 1982;Percival and Walden, 1993). The technique is well suited to reduce the variance of a spectrum estimate, which is particularly useful when working with noisy neural data. The spectrum estimate is obtained as an average of multiple statistically independent spectrum estimators for a discrete signal, x [n], with sampling frequency f s , as follows: Given the length of data N and a smoothing half-bandwidth W, the tapers v l,W [n] are computed by solving for vectors that satisfy the energy and orthogonality properties, as follows: n¼0 v l;W ½nv m;W ½n ¼ 0; l 6 ¼ m: For the tapers, GhostiPy uses the discrete prolate spheroidal sequences (DPSSs), which satisfy Equations 5 and 6 and maximize the power in the band [-W,W] . Time-frequency decompositions. a-c, Users can leverage the scipy spectrogram (a) along with the methods of GhostiPy (b-d) for a richer understanding of their data. The synchrosqueezed transform in d gives the overall sharpest time and frequency resolution. Note in the code that data are a 1D numpy array, fs is the sampling rate, nperseg is the spectrogram window size in samples, noverlap is the number of samples overlapping in adjacent windows, and w is the bandwidth for the multitapered spectrogram. (Thomson, 1982). An example for computing the multitapered spectrum is shown in Figure 3.

Continuous wavelet transform
Neuroscientists often use a continuous wavelet transform (CWT) to study transient oscillatory activity. The CWT itself is defined in the time domain by the following: where c (...) is the mother wavelet function. The transform represents a two-dimensional decomposition in the scale (a) and time (b) planes. In the frequency domain, the CWT is given by the inverse Fourier transform of the following: for a given scale (a), where X and W are the Fourier transforms of x and c , respectively. Many mother wavelet functions have been investigated in the literature, but we have focused on the analytic wavelets, as they are found to be superior, particularly for estimating the phase (Olhede and Walden, 2002;Lilly and Gascard, 2006;Olhede, 2009, 2012. We have implemented the analytic Morse, Morlet, and Bump wavelets, whose respective frequency domain definitions are as follows: Wðav Þ ¼ 2e Àðav Àv 0 Þ 2 =2 Hðv Þ; Wðav Þ ¼ 2 eg b b g ðav Þ b e Àðav Þ g Hðv Þ (10) Figure 6. Speed spectrogram. The multitaper spectrogram (bandwidth, 5 Hz) was computed with GhostiPy for nonoverlapping 0.5 s time bins and then z scored for each frequency. Each time bin in the spectrogram was assigned to 1 of 21 logarithmically spaced speed bins spanning 0.125-64 cm/s. a, b, The mean PSD for each speed bin is shown for an animal exploring a novel environment (a) and a familiar environment (b).
where 1 ðmÀs Þ=a;ðm1s Þ=a is the indicator function for the interval (m -s ) /a v (m 1 s ) /a and H(v ) is the Heaviside step function. In our implementation, we use Equation 8 to compute the CWT.
Note that in practice the timeseries x(t) is sampled, and the CWT is likewise sampled. Then Equation 8 becomes a pointwise complex multiplication of discrete Fourier transforms, where the discretized angular frequencies v k are determined by the following: where N is the number of data samples and Dt is the sampling interval. A naive implementation of the wavelet transform (Eq. 8) calculates untruncated wavelets the same length as the input data. This is often inefficient because it is equivalent to convolving the data with a time-domain wavelet, mainly consisting of leading and trailing zeros. In our approach, we exploit the fact that wavelets are finite in time and frequency, and we use an overlap-save algorithm to compute the CWT purely in the frequency domain. Note that the latter point is particularly critical: because of the Gibbs phenomenon, using any time-domain representation of the wavelet may violate numerical analyticity for wavelet center frequencies near the Nyquist frequency. It is therefore necessary to use only the frequency domain representation of the wavelet. While we offer both traditional/naive and blockwise convolution implementations, the latter will give superior performance for longer-duration data. We believe that this is a valuable option for researchers and that this is the first tool that uses blockwise convolution to implement the CWT.
For electrophysiological data, a typical wavelet analysis will require computing Equation 8 for 50-500 scales. This is an obvious candidate for parallelization since the wavelet transform for each scale can be computed independently of the others. We use a backend powered by Dask to carry out the parallelization (Rocklin, 2015). Users can set the number of parallel computations to execute and thereby leverage the multicore capabilities offered by modern computing hardware.

Synchrosqueezing transform
One disadvantage of the wavelet transform is that its frequency resolution decreases as the temporal resolution increases. Strictly speaking, the CWT results in information contained in the (time, scale) plane, but a single frequency is typically assigned to each scale. Regardless, spectral smearing can be observed at higher frequencies/lower scales. However Daubechies (1996) and Thakur et al. (2013) showed the synchrosqueezing transform (SST) could mitigate this issue by transferring a CWT (time, scale) plane information to the (time, frequency) plane.
The synchrosqueezing transform proceeds as follows. For every scale a: compute the CWT W(a) using Equation 8, compute the following partial derivative: and compute the following phase transform: The phase transform contains the real frequencies each point in the CWT matrix should be assigned to. In practice, the real frequency space is discretized, so the CWT points are assigned to frequency bins. Note that multiple CWT points at a given time coordinate, b, may map to the same frequency bin. In this situation, a given frequency bin is a simple additive accumulation of CWT points.
Note the similarity of the SST to the spectral reassignment algorithms in the studies by Gardner and Magnasco (2006) and Fitz and Fulop (2009). However, an important distinction is that the SST only operates along the scale dimension. In addition to preserving the temporal resolution of the CWT, this makes SST data easy to work with since uniform sampling can be maintained.
Overall, the spectrogram methods implemented by GhostiPy give an experimenter a more complete picture of the time-varying spectral content of neural data. Figure  4 illustrates this using the scipy standard spectrogram method along with the GhostiPy methods.

Data availability
The code/software described in the article is freely available online at https://github.com/kemerelab/ghostipy/. Figure 7. Theta cycle clustering. The Morse wavelet (g = 3, b = 20) CWT was computed with GhostiPy using 81 frequencies and was subsequently divided into multiple windows, where one window corresponded to one theta cycle. Each CWT sample in a window was assigned to 1 of 20 phase bins according to the instantaneous theta phase at that particular sample. The result was frequency-phase power profiles, which were then clustered into four clusters. Shown is the mean frequencyphase power profile for each cluster. As in Zhang et al., 2019, these use the hc-11 dataset from CRCNS.org (Grossmark andBuzaki, 2016, Grossmark et al., 2016).
Jupyter Notebook, which can be found at https://github.com/ kemerelab/ghostipy/tree/master/examples/2021paper, was used to produce the figures. The code and notebooks are also available as Extended Data 1 and Extended Data 2. All results were obtained on an Intel Core i7-4790 desktop computer running the Ubuntu 16.04 operating system.

Example analyses
An example spectrogram of local field potentials recorded in area CA1 of the rat hippocampus is depicted in Figure 5. Clearly apparent are the theta oscillation, theta-nested gamma oscillations, and a sharp wave ripple, which occurs after the animal has stopped moving.
In addition, GhostiPy can be used as an intermediate for a multistep analysis. Figure 6 replicates the speed spectrogram analysis in the study by Kemere et al. (2013) for an animal exploring a novel and a familiar environment (Mattias et al., 2015). Figure 7 implements the clustering of theta cycles (Zhang et al., 2019) with Morse wavelets. We have also included notebooks to replicate these example analyses.

Performance and complexity
The calculation of the CWT is computationally intensive and consequently a good method to benchmark performance. Of the software packages listed in Table 1, only MATLAB offered an equivalent solution. It was thus chosen as the reference to compare our implementation against. Figure 8 shows that our implementation results in faster computation times and better memory usage.
It is not entirely clear what accounts for the higher jaggedness in the MATLAB curves from Figure 8. A possible explanation is that the FFT computation is less efficient for an odd-length transform, but the magnitude of the spikes in the curve is surprising given that the FFT backend of MATLAB also uses FFTW. Regardless, we have demonstrated that our implementation is able to achieve lower time and space complexity. When using the functionality offered by GhostiPy, the following three primary scenarios arise with regard to the sizes of data involved in the processing: (1) both the input and output data fit into core memory; (2) the input fits into core memory, but the output does not; and (3) neither the input nor the output fit.
In all of the previous examples, we have restricted ourselves to case 1. However, with the ever-increasing sizes of data, the other cases will inevitably be encountered. Case 2 may arise when attempting to generate spectrograms. As the input is a single channel, memory constraints are rarely an issue. For example, even a 10 h local field potential (LFP) recording sampled at 1 kHz and saved as 64 bit floating point values will require ,300 mebibytes (MiB) of memory. However, the size of a wavelet spectrogram computed from these data will be directly proportional to the number of scales/frequencies. For a typical range of 1-350 Hz at 10 voices per octave, this amounts to a space requirement of 85 times that of the input data. Given that this can well exceed the core memory size of a machine, the GhostiPy CWT routine can also accept a preallocated output array that is stored on disk (Fig. 9).
Case 3 may arise when a user wishes to filter many channels of full bandwidth data. One case used is a 1 h recording for a 256-channel probe sampled at 30 kHz and stored as a 2 byte signed integer type; already this requires 51 gibibytes. Our strategy is similar to case 2, where an output array is allocated and stored on disk. As for the input, it is read in chunks, and the size of these can be chosen to lower memory usage, although potentially at a cost to computation time. The code in Figure 10 illustrates an example.
Several points can be made about the scheme in Figure  10. Our method allows for downsampling during the convolution, which can reduce the number of stages in a computational scheme. Given full bandwidth data, a traditional strategy to filter to the theta band would look like the following: (1) apply an antialiasing filter; (2) downsample to obtain LFP; (3) store the LFP to disk; (4) apply a theta-band filter; (5) downsample this output; and (5) save the result.
Using the GhostiPy method, it is not necessary to generate the intermediate LFP. To our knowledge, we do not know of other software that allows out-of-core filtering and downsampling in a single function call. The result is a simultaneous reduction in time and space complexity, by storing only the downsampled result and by filtering only once. Filtering to the theta band is now simplified to the following Figure 8. CWT implementation performance. a, b, Our implementation of the Morse continuous wavelet transform outperforms MATLAB in both time (a) and space complexity (b). Note that MATLAB was unable to complete execution for the full range of the test parameter (data length) because of out-of-memory exceptions. The test machine was an Intel Core i7-4790 (eight hyperthreads) equipped with 32 gigabytes of RAM. In both cases, the CWT was parallelized over available CPU threads. steps: (1) apply a theta filter to the full bandwidth data; (2) downsample the result; and (3) save the result to disk.

Discussion
We have described the key features of GhostiPy and given examples of its ease of use to perform computations efficiently. Users can thus conduct exploratory spectral analyses quickly across a range of parameters while reducing their concerns for running out of memory, especially since out-of-core computation is supported for many of the methods. Thus, we believe GhostiPy is well suited to handle the ever-increasing size of experimental data.
In the future, we plan to improve GhostiPy with various enhancements. For example, currently the methods are designed to offer the user a lot of low-level control over areas such as multithreading, and to work with raw array types. However, users may desire a higher-level API. For this reason, we believe it would be a worthwhile endeavor to incorporate our work into frameworks such as NWB (Teeters et al., 2015); this would also facilitate more widespread adoption. There are also other analyses we could implement, including the adaptive multitaper method (Percival and Walden, 1993) and other time-frequency reassignment techniques similar to the synchrosqueezing transform (Daubechies et al., 2016).
Our primary contribution is improving the ease and speed at which data analysis can be conducted by developing user-friendly software implementing efficient algorithms well suited for large data sizes. This point is specifically demonstrated by our ability to outperform existing solutions in space and time complexity, and to run computations even in out-of-core memory conditions, which enables machines with 1-10 s of of GBs of memory to process data on the scale of 10-100 s of GBs and higher. In these ways, we have increased the accessibility of neural data analysis by enabling it to be run on hardware such as a laptop computer, a scenario that often was not previously possible.
Finally, the software we developed has a much larger potential impact than the scope described in this article. Although many of the examples given in this article were specific to extracellular rodent hippocampal data, the functionality we implemented is intentionally generic and Figure 9. CWT out-of-core. Example code when the output array is too large for main memory. The CWT method is first executed as a dry run to compute the necessary array sizes. Here data are a 1D numpy array, and cwt_data are an HDF5 array created to store the results to disk. Figure 10. Filtering out-of-core. Filtering data from a large array stored on disk and likewise storing the output on disk. Similar to the CWT out-of-core features, the method is called once as a dry run to compute array sizes, which the user can then pass in to store the result. The filtering method also allows to correct for the delay of the filter and to downsample without storing any intermediate results. Although the example uses the h5py library, any object that behaves like an array can be used. Here ds is the downsampling factor, K is the filter group delay, N is the number of samples, infile['chdata'] is a (n_channels, n_samples) array, and outdata is an HDF5 array.