Closed-loop estimation of retinal network sensitivity reveals signature of efficient coding

According to the theory of efficient coding, sensory systems are adapted to represent natural scenes with high fidelity and at minimal metabolic cost. Testing this hypothesis for sensory structures performing non-linear computations on high dimensional stimuli is still an open challenge. Here we develop a method to characterize the sensitivity of the retinal network to perturbations of a stimulus. Using closed-loop experiments, we explore selectively the space of possible perturbations around a given stimulus. We then show that the response of the retinal population to these small perturbations can be described by a local linear model. Using this model, we computed the sensitivity of the neural response to arbitrary temporal perturbations of the stimulus, and found a peak in the sensitivity as a function of the frequency of the perturbations. Based on a minimal theory of sensory processing, we argue that this peak is set to maximize information transmission. Our approach is relevant to testing the efficient coding hypothesis locally in any context where no reliable encoding model is known.

According to the theory of efficient coding, sensory systems are adapted to represent natural scenes with high fidelity and at minimal metabolic cost. Testing this hypothesis for sensory structures performing non-linear computations on high dimensional stimuli is still an open challenge. Here we develop a method to characterize the sensitivity of the retinal network to perturbations of a stimulus. Using closed-loop experiments, we explore selectively the space of possible perturbations around a given stimulus. We then show that the response of the retinal population to these small perturbations can be described by a local linear model. Using this model, we computed the sensitivity of the neural response to arbitrary temporal perturbations of the stimulus, and found a peak in the sensitivity as a function of the frequency of the perturbations. Based on a minimal theory of sensory processing, we argue that this peak is set to maximize information transmission. Our approach is relevant to testing the efficient coding hypothesis locally in any context where no reliable encoding model is known.
The efficient coding hypothesis [1, 2] postulates that neural encoding of stimuli has adapted to represent natural occuring sensory scenes optimally in the presence of limited resources. This principle can be recast in the language of information theory, and it has been argued that early stages of sensory processing aim to represent information in an efficient form [3][4][5].
How can we test if this hypothesis is valid? When sensory systems can be approximated as linear or quasilinear filters [6][7][8], the efficient coding hypothesis yields quantitative predictions about the shape of the receptive fields [9][10][11][12][13]. Beyond the linear case, predictions have only been drawn in restricted cases, e.g. for onedimension stimuli or single neurons [14][15][16][17][18][19][20]. Testing the efficient coding hypothesis for non-linear systems is more challenging, especially if the stimulus and response are high dimensional. If a non-linear model existed that could accurately predict the network response to complex stimuli, then the sensitivity of the representationits ability to discriminate responses to small perturbations around any given stimulus-could be estimated directly from that model. According to efficient coding theory, the form of this sensitivity should be related to the stimulus statistics [21,22]. However, in the majority of cases, such a comprehensive non-linear model is not available, and there is no method to quantify sensitivity and to assess the efficient coding hypothesis. Even in the retina, previous studies have shown that the response to complex stimuli can be produced by complex circuits involving several non-linearities [23,24], and a general predictive model is lacking. New ways to test the efficiency of the retinal representation of complex scenes are thus needed.
Here we present a novel approach to measure experimentally the sensitivity of a non-linear network, and compare it to the prediction derived from the efficient coding hypothesis. Because any non-linear function can be linearized around a given point, we hypothesized that, even in a sensory network with non-linear responses, one can still define a local linear model that can well describe the network response to small perturbations around a given reference stimulus. This local model should only be valid around the reference stimulus, but it is sufficient to estimate the sensitivity of the network.
We applied this strategy to the retina. We recorded the activity of a large population of retinal ganglion cells stimulated by a randomly moving bar. We characterized the sensitivity of the retinal population to small stimulus changes, by testing perturbations around a reference stimulus, using closed-loop experiments. This allowed us to build a complete model of the population response in that region of the stimulus space, and to precisely quantify the sensitivity of the neural representation. We found that the sensitivity exhibits a peak as a function of frequency, in agreement with the prediction from efficient coding theory. Our approach is general and can be used for any sensory network, and allows for testing the efficient coding theory even for high dimension stimuli and non-linear networks.

Results
Measuring sensitivity using closed-loop experiments. We recorded from a population of 60 ganglion cells in the rat retina using a 252-electrode array while presenting a randomly moving bar (see Fig. 1A and Materials and Methods). Our aim was to measure the sensitivity of the retinal population to perturbations of a pre-defined stimulus. We measured the response to many repetitions of a short (0.9 s) reference stimulus, as well as many small perturbations around it. The reference stimulus was the random trajectory of a white bar on a dark background undergoing Brownian motion with a restoring force (see Materials and Methods). Perturbations were small changes affecting that reference trajectory in its middle portion, between 300 and 630 ms. Perturbations varied both in shape and in amplitude: we used 16 different perturbation shapes (shown in Fig. S1), each presented at different amplitudes (Fig. 1A). The population response was defined as sequences of spikes and silences in 20 ms time bins for each neuron.
To assess the sensitivity of the neural code, we asked how well different perturbations could be discriminated from the reference stimulus based on the population response. We expect the ability to discriminate perturbations to depend on their amplitude: responses to small perturbations should be hardly distinguishable, while large perturbations should elicit easily detectable changes, as can be seen in Fig. 1B. One thus needs to find the range of amplitudes for which discrimination is hard but not impossible. This requires looking for the adequate range of perturbation amplitudes "online," during the time course of the experiment.
In order to adapt the amplitude of perturbations to the sensitivity of responses for each perturbation shape, we implemented closed-loop experiments ( Fig. 2A). At each step, the retina was stimulated with a perturbed stimulus and the population response was recorded. Spikes were detected in real time for each electrode independently by threshold crossing (see Materials and Methods). This coarse characterization of the response is no substitute for spike sorting, but it is fast enough to be implemented in real time between two stimulus presentations, and sufficient to detect changes in the response. This method was used to adaptively select the range of perturbations in real time during the experiment. Proper spike sorting was performed after the experiment using the procedure Experimental setup: we stimulated a rat retina with a moving bar. Retinal ganglion cell (RGC) population responses were recorded extracellularly with a multi-electrode array. Electrode signals were high-pass filtered and spikes were detected by threshold crossing. We computed the discrimination probability of the population response, and adapted the amplitude of the next perturbation. B. Left: the neural responses of 60 sorted RCGs are projected along the axis going through the mean response to reference stimulus and the mean response to a large perturbation. Small dots are individual responses, large dots are means. Middle: mean and standard deviation (in grey) of response projections for different amplitudes of an example perturbation shape. Right: distributions of the projected responses to the reference (blue), and to small (purple) and large (red) perturbations. Discrimination is high when the distribution of the perturbation is well separated from the distribution of the reference. C. Discrimination probability as a function of amplitude A. The discrimination increases as an error function, (1/2)[1 + erf(d /2)], with d = c × A (grey line: fit). Ticks on the x axis show the amplitudes that have been tested during the closed-loop experiment. described in [25,26], and used for all subsequent analyses.
To test whether a perturbation was detectable from the retinal response, we projected the population responses along an axis defined by the difference between the mean response to a large-amplitude perturbation and the mean response to the reference (Fig. 2B). By definition, on average the projected response to a perturbation is larger than to the reference. However, both are noisy and broadly distributed around their mean (see Fig. 2B, right, for example distributions). We define the discrimination probability as the probability that the response to the perturbation is in fact larger than to the reference. Its value is 100% if the responses to the reference and perturbation are perfectly separable, and 50% if their distributions are identical, in which case the classifier does no better than chance. This discrimination probability is equal to the 'area under the curve of the receiveroperating characteristics,' which widely used for measuring the performance of binary discrimination tasks.
During our closed-loop experiment, our purpose was to find the perturbation amplitude with a discrimination of 85%. To this end we computed the discrimination probability online as describe above, and then chose the next perturbation amplitude to be displayed using the 'accelerated stochastic approximation' method [27,28]: when discrimination was above 85%, the amplitude was decreased, otherwise, it was increased (see Materials and Methods).  . 2C shows the discrimination probability as a function of the perturbation amplitude for an example perturbation shape. Discrimination grows linearly with small perturbations, and then saturates to 100% for large ones. This behavior is well approximated by an error function (gray line) parametrized by a single coefficient, which we call sensitivity coefficient and denote by c. This coefficient measures how fast the discrimination probability increases with perturbation amplitude: the higher the sensitivity coefficient, the easier it is to discriminate responses to small perturbations. It can be interpreted as the inverse of the amplitude at which discrimination reaches 76%, and is related to the classical sensitivity index d [29], through d = c × A, where A denotes the perturbation amplitude (see Materials and Methods).
All 16 different perturbation shapes were displayed, and the optimal amplitude was searched for each of them independently. We found a mean sensitivity coefficient of c = 0.0516 µm −1 . However, there were large differences across the different perturbation shapes, with a minimum of c = 0.028 µm −1 and a maximum of c = 0.065 µm −1 .
Sensitivity and Fisher information. Can one predict the sensitivity to any perturbation of the stimulus? The stimulus is the trajectory of a bar and is high dimensional. Generalizing the result of Seung and Sompolinsky [30] to arbitrary dimension, we show that the sensitivity can be expressed as (see Material and Meth-ods): where I is the Fisher information matrix, of the same dimension as the stimulus, and S the perturbation represented as a column vector. Thus, the Fisher information is sufficient to predict the code's sensitivity to any perturbation. Despite the generality of Eq. 1, it should be noted that estimating the Fisher information matrix requires a model of the population response. As already discussed, the non-linearities of the retinal code make the construction of a generic model of responses to arbitrary stimuli a very arduous task, and is still an open problem. However, the Fisher information matrix need only be evaluated locally, around the response to the reference stimulus.
Local model for predicting sensitivity. We introduce a local model to describe the stochastic population response to small perturbations of the reference stimulus. This model will then be used to estimate the Fisher information matrix, and from it the retina's sensitivity to arbitrary perturbations.
The model, schematized in Fig. 3A, assumes that perturbations are small enough that the response can be linearized around the reference stimulus. First, the response to the reference is described by conditionally independent neurons firing with time-dependent rates estimated from the peristimulus time histograms (PSTH). Second, the response to perturbations is modeled as follows: for each neuron and for each 20 ms time bin of the considered response, we use a linear projection of the perturbation onto a temporal filter to modify the spike rates relative to the reference. These temporal filters were inferred from the responses to all the presented perturbations, varying both in shape and amplitude (but small enough to remain within the linear approximation). Details of the model and its inference are given in Materials and Methods.
We first checked the validity of the local model by its ability to predict the PSTH of cells in response to perturbations ( Fig. 3B). To assess model performance, we computed the difference of PSTH between perturbation and reference, and compared it to the model prediction. Fig. 3D shows the correlation coefficient of this PSTH difference between model and data, averaged over all recorded cells for one perturbation shape. To control for noise in the responses, we computed the same quantity for responses generated by the model (black line), which gives an upper bound on the attainable performance given the limited amount of data. Model performance saturates that bound for amplitudes up to 60 µm, indicating that the local model can accurately predict the statistics of responses to perturbations within that range. For larger amplitudes, the linear approximation breaks down, and the local model fails to accurately predict the response. This failure for large amplitudes is expected if the retinal population responds non-linearly to the stimulus. We observed the same behavior for all the perturbation shapes tested. Prediction of the local model for the perturbation is shown in green. D. Performance of the local model at predicting the change in PSTH induced by a perturbation, as measured by Pearson's correlation coefficient between data and model, averaged over cells (green). The data PSTH were calculated by grouping perturbations of the same shape and of increasing amplitudes by groups of 20, and computing the mean firing rate at each time over the 20 perturbations of each group. The model PSTH was calculated by mimicking the same procedure. To control for noise from limited sampling, the same performance was calculated from synthetic data of the same size, where the model is known to be exact (black).
To further test the accuracy of the local model, we estimated its decoding capability, and compared it to the performance of a classical linear decoder [31,32] trained over an ensemble of random bar trajectories (see Materials and Methods). The local model is an encoding model: it predicts the probability of responses given an stimulus. Yet it can be used to create a 'Bayesian decoder' using Bayesian inversion (see Materials and Methods): given a response, what is the most likely stimulus that generated this response under the model? If the local model predicts the retinal response accurately, doing Bayesian inversion of this model should be the best decoding strategy, meaning that other decoders should perform equally or worse. The linear decoder has previously shown high performance in decoding this stimulus [32], and is therefore a good benchmark for comparison. When decoding the bar trajectory, the Bayesian decoder was more precise than the linear decoder, as measured by the variance of the reconstructed stimulus (Fig. 4A). The Bayesian decoder had a smaller error than the linear decoder when decoding perturbations of small amplitudes (Fig. 4B). For larger amplitudes, where the local model is expected to break down, the performance of the Bayesian decoder decreased.
To quantify decoding performance as a function of the stimulus frequency, we estimated the signal-to-noise ratio (SNR) of the decoding signal for small perturbations of various frequencies (see Materials and Methods). The Bayesian decoder had a much higher SNR than the linear decoder at all frequencies ( Fig. 4C; see also Fig. S3A for an example on another reference stimulus), even if both did fairly poorly at high frequencies. This results suggests that inverting the local model might be the best decoding strategy, and therefore confirms that the local model is an accurate description of the retinal response to small enough perturbations around the reference stimulus.
Ideal versus empirical sensitivity. Now that we have validated the local model, we can use it to compute the Fisher information matrix described above. This matrix takes a very simple form within the local linear model, I = F · C R · F T , where F is the matrix containing the model's temporal filters (stacked as row vectors), and C R is the covariance matrix of the entire response to the reference stimulus across neurons and time. This Fisher information is consistent with the sensitivity that we computed earlier. To demonstrate this consistency, we first used the local model to simulate a large number of responses to the different perturbations. When esti- mating the sensitivity of the response with the method describe above, we obtained identical results than if we estimated it directly from the Fisher information using Eq. 1. This confirms that the Fisher information is a compact and reliable way to estimate the sensitivity of a model. We then asked if the Fisher information could predict the sensitivity measured above from the recorded responses. We compared the sensitivity coefficients measured experimentally to the prediction by the Fisher information, for all perturbations (Fig. 5B). We found that the empirical sensitivity could be well predicted by the Fisher information (Pearson correlation: 0.82, p = 10 −8 ). Although the Fisher prediction always overestimates the observed sensitivity, a large fraction of this difference can be attributed to the limited sampling of the responses, as can be shown on simulated data (Fig. S2).

Signature of efficient coding in the sensitivity.
We showed that the Fisher information is a complete description of the sensitivity of the retinal response to perturbations around the reference stimulus. We can now use it to study the sensitivity of the retina to perturbations at different frequencies. Fig. 6A represents the power spectrum of the bar motion. Power is maximum at low frequencies, and quickly decays at large frequencies.
In many classical theories of efficient coding, sensitivity is expected to follow a inverse relationship with the stimulus power [21,22]. However, here we found that the sensitivity is bell shaped, with a peak in frequency around 4Hz (Fig. 6C; see also To interpret this non-monotonic sensitivity, we studied a minimal theory of retinal function (similar to [33]) to test how efficient coding would reflect on the sensitivity of the retinal response. In this theory, the stimulus  6: Signature of efficient coding in the sensitivity A. Spectral density of the stimulus used in experiments, which is monotonically decreasing. B. Simple theory of retinal function: the stimulus is filtered by noisy photoreceptors, whose signal is then filtered by the noisy retinal network. The retinal network filter was optimized to maximize information transfer at constant output power. C. Sensitivity of the recorded retina to perturbations of different frequencies.
Note the non monotonic behavior. D. Same as C, but for the theory of optimal processing. E. Information transmitted by the retina on the perturbations at different amplitudes. F. Same as E, but for the theory.
is first passed though a low-pass filter, then corrupted by an input noise. This first stage describes processing by photoreceptors [34]. The photoreceptor output is then transformed by a transfer function and corrupted by a second external noise, which mimics the subsequent stages of retinal processing leading to ganglion cell activity. Here the output is reduced to a single continuous signal (Fig. 6B, see Materials and Methods for mathe-matical details). Note that this theory is linear: we are not describing the response of the retina to any stimulus, which would be highly non-linear, but rather its linearized response to perturbations around a given stimulus, as in our experimental approach. To apply the efficient coding hypothesis, we assumed that the photoreceptor filter is fixed, and we maximized the transmitted information, measured by Shannon's mutual information, over the transfer function. We constrained the variance of the output to be constant, corresponding to a metabolic constraint on the firing rate of ganglion cells. In this simple and classical setting, this optimal transfer function, and the corresponding sensitivity, can be calculated analytically. Although the power spectrum of the stimulus and photoreceptor output are monotonically decreasing, and the noise spectrum is flat, we found that the optimal sensitivity of the theory is bell shaped (Fig. 6E), in agreement with our experimental findings (Fig. 6C).
One can intuitively understand this bell-shaped sensitivity. On one hand, in the small frequency regime, sensitivity is small to balance out and to share of information across frequencies. This result is classic: when the input noise is negligible, the best coding strategy for maximizing information is to whiten the input signal to obtain a flat output spectrum, which is obtained by having the squared sensitivity be inversely proportional to the stimulus power. On the other hand, at high frequencies, the input noise is too high for the stimulus to be recovered. Allocating sensitivity and output power to those frequencies is therefore a waste of resources, as it is devoted to amplifying noise, and sensitivity should remain low to maximize information. A peak of sensitivity is thus found between the high SNR region, where stimulus dominates noise and whitening is the best strategy, and the low SNR region, where information is lost into the noise and coding resources should be scarce. A result of this optimization is that the information transferred should monotonically decrease with frequency, just as the input power spectrum does (Fig. 6F). We tested if this prediction was verified in the data. We estimated similarly the information rate against frequency in our data, and found that it was also decreasing monotonically (Fig. 6D). The retinal response has therefore organized its sensitivity across frequencies in a manner that is consistent with an optimization of information transmission across the retinal network.

Discussion
We have developed a novel approach to characterize experimentally the sensitivity of a sensory network, and tested if this sensitivity was in agreement with the prediction from efficient coding theory. We used closed-loop experiments [35] to measure the sensitivity of the retinal network to small perturbations around a given stimulus. Using a local model that predicts well the neural responses to these perturbations, we could determine the sensitivity of the population, defined by the Fisher information matrix. When expressed in the frequency domain, we found that the Fisher information matrix has a particular structure, with a non-monotonic sensitivity as a function of frequency. We showed that this bell-shaped sensitivity curve corresponds to a signature of efficient coding of the photoreceptor responses by the retinal network.
Our approach circumvents the need to build a nonlinear model that would accurately predict responses to complex stimuli, by restricting ourselves to only a small neighbourhood of the stimulus space. This simplification allowed us to go beyond cases where the retinal function can be summarized by a linear or quasi-linear (e.g. 'linear-nonlinear' or LN) model. In the retina, efficient coding theory had led to key predictions about the shape of the receptive fields, explaining their spatial extent [4,36], or the details of the overlap between cells of the same type [7,8,37]. However, when stimulated with complex stimuli like a fine-grained image, or irregular temporal dynamics, the retina exhibits a nonlinear behaviour [23]. For this reason, up to now, there was no prediction of the efficient theory for these complex stimuli. Our approach allows us to overcome this barrier and to characterize sensitivity for any arbitrarily complex stimulus ensemble, as long as the experiment duration allows for a reasonable exploration of the perturbation space.
Another possible use of our method would be to test whether the retina efficiently codes for the statistics of the stimulus by adapting to them specifically, or if it remains an efficient code for natural stimuli, and performs sub-optimally for artificial stimuli. In our case, the temporal dynamics of the bar were close to the kind of dynamics encountered in natural scenes [38,39]. As a result, it is difficult to tease apart the two hypotheses. Future works could address this question by using stimulus statistics that would be strong departure form natural ones, and test if the retina adapts to code these novel statistics more efficiently, or remains an efficient encoder of natural statistics.
Recently, more elaborate versions of the efficient coding theory have been suggested, with different constraints on the information maximization. For example, it has been suggested that the retinal network could code efficiently the possible future stimuli, constrained on the information carried about the past stimuli [40]. Our approach could be used to test if this predictive information is optimized in cases where the retinal responses are not amenable to a linear or LN model.
More generally, different versions of the efficient coding theory have been proposed to explain the organisation of several areas of the visual system [8,10,12,13,41] and elsewhere [42][43][44][45]. Our approach is general and could be used in other sensory structures to test the validity of this hypothesis. In our case we found that the local model could be linear in the responses as well as in the stimulus, and still predict the responses to small perturbations. In more complex structures, it is possible that non-linear models in the response are necessary. Yet, the linearity in the stimulus would greatly simplify the problem. Our approach might therefore be useful in general for sensory systems.
The experimental characterization of the sensitivity of a neural network could be useful for other purposes than testing the efficient coding theory. First, the local model is supposed to be a very good model for the responses to small perturbations around a given stimulus, and as such, a Bayesian decoder based on this model should perform close to optimal at decoding the stimulus. This can be used to test how close to optimal different decoding methods are. In our case, we found that linear decoding, despite its very good performance, was quite far from the performance of the Bayesian inversion of our local model. This result implies that there should exist nonlinear decoding methods that outperform linear decoding [46]. Testing the optimality of the decoding method is crucial for brain machine interfaces [47]: in this case an optimal decoder is necessary to avoid missing a significant amount of information. Building our local model is a good strategy for benchmarking different decoding methods.
Finally, the estimation of the sensitivity along several dimensions of the stimulus perturbations allows us to define which changes of the stimulus evoke the strongest change in the sensory network, and which ones should not make a big difference. Similar measures could in principle be performed at the perceptual level, where some pairs of stimuli are perceptually indistinguishable, while others are well discriminated. Comparing the sensitivity of a sensory network to the sensitivity measured at the perceptual level could be a promising way to relate neural activity and perception.

Materials and Methods
Extracellular recording. Experiments were performed on the adult Long Evans rat, in accordance with institutional animal care standards. The retina was extracted from the euthanized animal and maintained in an oxygenated Ames' medium (Sigma-Aldrich). The retina was recorded extracellularly on the ganglion cell side with an array of 252 electrodes spaced by 60 µm (Multichannel Systems), as previously described [48]. Single cells were isolated offline using SpyKING CIRCUS, a custom spike sorting algorithm [26]. We then selected 60 cells that were well separated (no violations of refractory period), had enough spikes (firing rate larger than 0.5 Hz), had a stable firing rate during the whole experiment, and responded consistently to repetitions of a reference stimulus.
Online spike detection. During the experiment we detected spikes in real time on each electrode independently. Each electrode signal was high-pass filtered using a Butterworth filter with a 200 Hz frequency cutoff. A Perturbation direction 15 and 16 were also constrained to be orthogonal. The largest amplitude for each perturbation we presented was 115 µm. An exception was made for perturbations 15 and 16 applied to the second reference trajectory, as for this amplitude they had a discrimination probability below 70%. They were thus increased by a factor 1.5. The largest amplitude for each perturbation was repeated at least 93 times, with the exception of perturbation 15 (32 times) and 16 (40 times) on the second reference trajectory. spike was detected if the electrode potential U was lower than a threshold of 5 times the median absolute deviation of the voltage [26]. Stimulus. The stimulus was a movie of a white bar on a dark background projected at 50 Hz with a digital micromirror device. The bar had intensity 7.6 10 11 photons.cm −2 .s −1 , and 115 µm width. The bar trajectory consisted in 17034 snippets of 0.9 s consiting in 6431 random trajectories, 2 reference trajectories repeated 391 times each and perturbations of these reference trajectories. All results are shown for the first reference stimulus, except Fig. 5B which combines both. The second reference stimulus served to check the robustness of our results to the choice of reference. Results specific to the second reference stimulus are reported in Fig. S3. Continuity between snippets was ensured by constraining all snippets to start and end in the middle of the screen with velocity 0. Random trajectories followed the statistics of an overdamped stochastic oscillator [32]. We used a Metropolis-Hastings algorithm to generate random trajectories satisfying the boundary conditions. The two reference trajectories were drawn from that ensemble.
A perturbation is denoted by its discretized time series with time step δt = 20 ms, S = (S 1 , . . . , S L ), with L = 16, over the 320 ms of the perturbation. Perturbations can be decomposed as S = A × P , where t is the amplitude, and P = S/A the shape. Perturbations shapes were chosen to have zero value and derivative at their boundaries. They are represented in Fig. S1.
Linear discrimination. The response R of the N = 60 cells over time is binarized into B time bins of size δ = 20 ms: R ib = 1 if cell i spiked during the bth time bin, and 0 otherwise. R is thus a vector of size N × B, labeled by a joint index ib. The response is considered from the start of the perturbation until 280 ms afters its end, so that B = 30.
To discriminate perturbations, we first measured the responses R ref to multiple repetitions of the reference stimulus, and the responses R Smax to multiple repetitions of the largest amplitude of each perturbation shape (typically 110 µm). We computed the mean response to the reference, R ref , and to the largest-amplitude perturbation, R Smax , and projected all responses onto their difference: we denote the projection of a response to the reference, and x S = ( R Smax − R ref ) T · R S the projection of a response to S (when projecting, we recalculated the mean responses by excluding the response to project, to avoid overfitting). The distributions of x ref and x S are shown in blue and purple in Fig. 2B. The probability of discrimination D is defined as the probability that Adaptation of perturbation amplitude. To identify the range of perturbations that were neither too easy nor too hard to discriminate, we adapted perturbation amplitudes so that the discrimination probability converged to target value D * = 85%. For each shape, perturbation amplitudes were adaptated using the Accelerated Stochastic Approximation [27]. If an amplitude A n triggered a response with discrimination probability D n , then at the next step the perturbation was presented at amplitude A n+1 with where C = 0.74, and r n is the number of reversal steps in the experiment, i.e. the number of times when a discrimination D n larger than D * was followed by D n+1 smaller than D * , or vice versa. We also presented amplitudes regularly spaced on a log-scale. We presented the largest amplitude A max (value in caption of Fig. S1), and scaled it down by multiples of 1.4, A max /1.4 k with k = 1, . . . , 7. Discrimination and sensitivity. We assume that the difference of the projections, ∆x = x S − x ref , is approximately distributed as a Gaussian variable, by virtue of the central limit theorem. For small perturbations, its mean can be assumed to grow linearly with the perturbation amplitude, µ = αA, and its variance 2σ 2 = Var(x S ) + Var(x ref ) to be independent of A. Then the probability of discrimination is given by the error function: where d = µ/σ = c × A is the standard sensitivity index [29], and c = α/σ is defined as the sensitivity coefficient, which depends on the perturbation shape P . This coefficient determines the amplitude A = c −1 at which discrimination is equal to (1/2)[1 + erf(1/2)] = 76%. Optimal sensitivity and Fisher information. Given the distributions of responses to the reference stimulus, P (R|ref), and to a perturbation, P (R|S), optimal discrimination can be achieved by studying the sign of the log-ratio L = ln[P (R|S)/P (R|ref)]. Let us call L ref the value of L upon presentation of the reference stimulus, and L S its value upon presentation of S. The probability of successful discrimination is the probability that L S > L ref . Using the central limit theorem we assume again that L S and L ref are Gaussian. We can calculate their mean and variance at small S: µ = L S − L ref = S T · I · S and 2σ 2 = Var(L S ) + Var(L ref ) = 2S T · I · S, where is the Fisher information matrix calculated at the reference stimulus. The discrimination probability is: This result proves Eq. 1 with S = A × P . Local model. We introduce the local model as a linear expansion of the logarithm of response distribution as a function of both stimulus and response: The matrix F contains the linear filters with which the change in the response is calculated from the linear projection of the past stimulus. For ease of notation, hereafter we use matrix multiplications rather than explicit sums over ib and t.
The distribution of responses to the reference trajectory is assumed to be conditionally independent: Since the variables R ib are binary, their mean value R ib upon presentation of the reference completely specifies P (R|ref). They are directly evaluated from the responses to repetitions of the reference stimulus, with a small pseudo-count to avoid zero values.
To infer the filters F ib,t , we only include perturbations that are small enough to remain within the linear approximation. We first separated the dataset into a training (285 × 16 perturbations) and testing (20 × 16 perturbations) sets. We then defined, for each perturbation shape, a maximum perturbation amplitude above which the linear approximation was considered no longer valid. We selected this threshold by optimizing the model's ability to predict the changes in firing rates in the testing set. Model learning was performed for each cell independently by maximum likelihood with an L 2 smoothness regularization on the shape of the filters, using a pseudo-Newton algorithm. The amplitude threshold obtained from the optimization varied widely across perturbation shapes. The number of perturbations for each shape used in the inference ranged from 20 (7% of the total) to 260 (91% of the total). Overall only 32% of the perturbations were kept. Overfitting was limited: when tested on perturbations of similar amplitudes, the prediction performance on the testing set was never lower than 15% of the performance on the training set.
Evaluating the Fisher information matrix, Eq. (4), within the local model, Eq. 6, gives: where C R is the covariance matrix of R, which is diagonal because of conditional independence. Link with linear discrimination. There exists a mathematical relation between the Fisher information of Eq. 8 and linear discrimination. The linear discrimination task described earlier can be generalized by projecting the response difference, R S − R ref , along an arbitrary direction u: ∆x is again assumed to be Gaussian by virtue of the central limit theorem. We further assume that perturbations S are small, so that R S − R ref ≈ (∂ R S /∂S) · S, and that C R does not depend on S. Calculating the mean and variance of ∆x under these assumption gives an explicit expression of d in Eq. 3: Maximizing this expression of d over the direction of projection u yields u = const × C −1 R · (∂ R S /∂S) · S and where I L = (∂ R S /∂S) T · C −1 R · (∂ R S /∂S) is the linear Fisher information [49,50]. This expression of the sensitivity corresponds to the best possible discrimination based on a linear projection of the response.
Within the local linear model defined above, one has ∂ R S /∂S = F · C R , and I L = F · C R · F T , which is also equal to the true Fisher information (Eq. 8): I = I L . Thus, if the local model (Eq. 6) is correct, discrimination by linear projection of the response is optimal and saturates the bound given by the Fisher information.
Note that the optimal direction of projection only differs from the direction we used in the experiments, u = R S − R ref , by an equalization factor C −1 R . We have checked that applying that factor only improves discrimination by a few percents (data not shown).
Linear decoder. We infer the linear decoder filters by minimizing the mean square error [31] in the reconstruction of 4000 random trajectories governed by the dynamics of an overdamped oscillator with noise. To allow for a meaningful comparison with the local model, the linear filters integrate the past of the stimulus for τ = 15 × δt = 300 ms. Tested on a sequence of ∼ 400 repetitions of one of the two reference trajectories, where the first 300 ms of each have been cut out, we obtain a correlation coefficient of 0.87 among the stimulus and its reconstruction.
Bayesian decoder. We use Bayes' rule to infer the presented stimulus given the response: P (S|R) = P (R|S)P (S) P (R) (12) where P (R|S) is given by the local model (Eq. 6), P (S) is the prior distribution over the stimulus, and P (R) is a normalization factor that does not depend on the stimulus. P (S) is taken to be the distribution of trajectories from an overdamped stochastic oscillator with the same parameters as in the experiment. The stimulus is inferred by maximizing the posterior P (S|R) numerically, using a pseudo-Newton iterative algorithm. Local signal to noise ratio in decoding. For any decoder, the inferred value of the stimulusŜ can be written as:Ŝ where T is a transfer matrix which differs from the identity matrix when decoding is imperfect, b is a systematic bias, is a Gaussian noise of covariance C . We inferred the values of b and C from the ∼ 400 reconstructions of the reference stimulation using either of the two decoders, and the values of T from the reconstructions of the perturbed trajectories. The inference is done by an iterative algorithm similar to that used for the inference of the filters F of the local model. The signal-to-noise ratio (SNR) in decoding the perturbation S is then defined as: In Fig. 4C, to compute SNR for a frequency ν, we use Eq. 14 with S t = A exp(2πiνtδt), where A is the amplitude of the perturbation shown in Fig. 4A. Frequency dependence of sensitivity and information. To analyze the behavior in frequency of the sensitivity, we compute the sensitivity index for an oscillating perturbation of unitary amplitude. We apply Eq. 1 withŜ t (ν) ≡ exp(2πiνtδt). In order to estimate the spectrum of the information rate we compute its behavior within the linear theory [33]: MI(ν) = 1 2 log 1 + C S (ν)I(ν)/δt 2 where C S (ν) is the power spectrum of stimulus, and I(ν) = (δt/L)Ŝ T (ν) · I ·Ŝ(ν). Efficient coding theory. To build a theory of retinal sensitivity, we follow closely the approach of Van Hateren [33]. The stimulus is first linearly convolved with a filter f , of power F, then corrupted by an input white noise with uniform power H, then convolved with the linear filter r of the retina network of power G, and finally corrupted again by an external white noise Γ. The output power spectrum O(ν) can be expressed as a function of frequency ν: O(ν) = (δtL)G(ν)[(δtL)F(ν)C S (ν) + H] + Γ (16) where C S (ν) is the power spectrum of the input. The information capacity of such a noisy input-output channel is limited by the allowed total output power V = ν O(ν), which can be interpreted as a constraint on the metabolic cost. The efficient coding hypothesis consists in finding the input-output relationship g * , of power G * (ν), that maximizes the information transmission under a constraint on the total power of the output. The optimal Fisher information matrix can be computed in the frequency domain as: The photoreceptor filter [51] was taken to be exponentially decaying in time, f = τ −1 exp(−t/τ ) (for t ≥ 0), with τ = 100 ms. The curve I(ν) only depends on H, Γ and V through two independent parameters. For the plots in Fig. 6 we chose: H = 3.38 µm 2 s, Γ = 0.02 spikes 2 s and V = 307 spikes 2 s, δt = 20 ms, and L = 2, 500. In Fig. 6D, we plot the sensitivity to oscillating perturbation with fixed frequency ν, which results in I(ν)L/δt. In Fig. 6E we plot the spectral density of the transferred information rate: MI(ν) = 1 2 log 1 + (δtL) 2 G(ν)F(ν)C S (ν) Γ + (δtL)G(ν)H .

Ackownledgements.
We thank Stéphane Deny for his help with the experiments, and Jean-Pierre Nadal for stimulating discussions and crucial suggestions. This work was supported by ANR TRAJEC-TORY, ANR OPTIMA, the French State program Investissements d'Avenir managed by the Agence Nationale de la Recherche [LIFESENSES: ANR-10-LABX-65], European Comission grant from the Human Brain Project n. FP7-604102, and National Institutes of Health grant n. U01NS090501.