Learning Probabilistic Inference through Spike-Timing-Dependent Plasticity

Abstract Numerous experimental data show that the brain is able to extract information from complex, uncertain, and often ambiguous experiences. Furthermore, it can use such learnt information for decision making through probabilistic inference. Several models have been proposed that aim at explaining how probabilistic inference could be performed by networks of neurons in the brain. We propose here a model that can also explain how such neural network could acquire the necessary information for that from examples. We show that spike-timing-dependent plasticity in combination with intrinsic plasticity generates in ensembles of pyramidal cells with lateral inhibition a fundamental building block for that: probabilistic associations between neurons that represent through their firing current values of random variables. Furthermore, by combining such adaptive network motifs in a recursive manner the resulting network is enabled to extract statistical information from complex input streams, and to build an internal model for the distribution p* that generates the examples it receives. This holds even if p* contains higher-order moments. The analysis of this learning process is supported by a rigorous theoretical foundation. Furthermore, we show that the network can use the learnt internal model immediately for prediction, decision making, and other types of probabilistic inference.


Introduction
A large number of experimental data from neuroscience and cognitive science suggest that the brain performs probabilistic inference for a large number of probability distributions p ‫ء‬ that describe different aspects of the environment of the organism and its interaction with the environment (Griffiths and Tenenbaum, 2006;Bonawitz et al., 2014). In models for probabilistic inference in networks of neurons one has so far (with the exception of Siegelmann and Holzman, 2010; see Discussion) programmed the underlying distribution p ‫ء‬ into the architecture and parameters of the network model. However brains have to learn internal models p of salient distributions p ‫ء‬ in their environment from examples that are generated by p ‫ء‬ . We present in this article the first model that explains how networks of spiking neurons could learn such internal models p through experimentally supported local plasticity rules. Furthermore, we show how a network of spiking neurons can store the learnt information in a way that makes it readily accessible for probabilistic inference.
Two different approaches how networks of neurons in the brain could execute probabilistic inference have been proposed. One approach emulates an arithmetical (deterministic) algorithmic method for performing probabilistic inference through a suitable distributed organization of sums and products of probabilities (referred to as belief propagation or message passing; Steimer et al., 2009). The other approach for emulating probabilistic inference in networks of spiking neurons is based on the assumption that a network of neurons can "embody" a distribution p in such a way that it can generate samples from p. Probabilistic inference for p can then be performed through simple operations on these samples, for example the computation of a posterior marginal just requires to look at the distribution of the random variable of interest in these samples. This approach is known in computer science as Markov chain Monte Carlo (MCMC) sampling, and is widely used to perform probabilistic inference also for complex distributions p for which belief propagation approaches have no guarantee to provide correct answers. Whereas the previous approach prefers a deterministic network, where every stochasticity is detrimental for the performance, this second approach requires an inherently stochastic network of spiking neurons (Buesing et al., 2011). It has been argued that the dynamics of networks of neurons in the brain is in fact highly stochastic, both on the basis of inherently stochastic features of its components (such as stochastic synaptic release), and on the basis of trial-to-trial variability in neural recordings and observed variability in behavioral outputs.
The learning model that is presented in this article ties in to this second approach, and shows that stochastic networks of neurons are able to automatically absorb the relevant statistical information from examples that they receive. As a result, we have now one first complete theory for the emergence of probabilistic inference in networks of spiking neurons through learning. We focus in this article on somewhat idealized models for spiking neurons and plasticity rules, which allow us to give rigorous proofs for the convergence of learning.
We first show how an extension of an ubiquitous network motif of cortical microcircuits, interconnected populations of pyramidal cells with lateral inhibition (Winner-Take-All (WTA) circuits; Douglas and Martin, 2004;Nessler et al., 2013), gives rise to the basic building block for absorbing probabilistic information from examples. The output neurons of an array of WTA-circuits form the hidden layer of a three-layer feedforward learning module, to which we refer as a "stochastic association module. " We show that such a module can learn through spiketiming-dependent plasticity (STDP) and plasticity of the excitability of neurons (intrinsic plasticity) stochastic associations between the random variables that are encoded by the firing of neurons on its first and third layers. The module can learn this probabilistic information from the statistics of activation and coactivation of these neurons when the network processes examples that are provided by its environment. An important second finding is that recursive combinations of this network module can learn even complex probabilistic relationships between large numbers of random variables. This network learning capability is in fact universal in the sense that the underlying theory implies a proof of principle that an approximation to any distribution p ‫ء‬ over discrete random variables can be learnt by exposing the network to examples drawn from p ‫ء‬ . In fact, one can show that if the network is too small or has too few connections for learning a close approximation of p ‫ء‬ , it will still strive toward approximating p ‫ء‬ as well as it can, given its limited resources. The understanding of this learning process is supported by the theory of Expectation Maximization (EM).

Results
Previous models for probabilistic inference in networks of spiking neurons have shown that one can program the parameters (eg, conditioned probability tables) of a given distribution p ‫ء‬ over discrete random variables into a network of idealized models for spiking neurons, provided that the network has a suitable architecture. We provide in this article a proof of principle that these parameters of p ‫ء‬ do not have to be programmed into the network: they can be learnt by a network ᏺ of spiking neurons via simple local plasticity rules from examplesỹ that are generated by p ‫ء‬ . This does not hold for every neural network ᏺ, but like in any existing model for probabilistic inference in neural networks, only under suitable assumptions about the architecture of ᏺ. This result provides a proof of principle that networks of neurons in the brain can not only perform probabilistic inference for distributions p ‫ء‬ whose parameters are specified in the genetic code, but also for distributions p ‫ء‬ that an organism encounters in its environment.
The underlying theory of EM does not guarantee that p ‫ء‬ can be learnt perfectly. However it implies that a network ᏺ with a suitable architecture is expected to make progress in creating an increasingly more accurate internal model p for p ‫ء‬ when it receives more and more examples that are generated by p ‫ء‬ . EM does not guarantee that the internal model p converges to p ‫ء‬ , but it implies that the network learning process cannot "run around in circles" where p moves forth and back between better and worse approximations of p ‫ء‬ . This learning result is general insofar as it shows that internal models p can be learnt by a network ᏺ with a suitable architecture for external distributions p ‫ء‬ over any number of discrete random variables, with arbitrary, also higher-order, dependencies among these random variables. However, although the architecture of ᏺ will obviously have to depend on the number of random variables of p ‫ء‬ , we show that it suffices to assume that it consists of recursive interconnections of different copies of a simple generic network motif, to which we refer as a stochastic association module. This network motif is a three-layer feedforward network of excitatory spiking neurons with lateral inhibition on the hidden layer (see Fig. 2). We show that this simple microcircuit motif can be viewed as an atomic learning module, that extracts via STDP and intrinsic plasticity from examples probabilistic associations between input variables x and output variable z that are encoded through population coding on its input and output layer. An example is presented in Figure 3. We will then show in the following subsection that this atomic learning module can be recursively combined to form a network that automatically approximates through STDP and intrinsic plasticity arbitrarily complex distributions p ‫ء‬ over many discrete random variables from examples generated by p ‫ء‬ . In other words, this network learns an internal probabilistic model p for its stochastic environment p ‫ء‬ . Furthermore, this network has the property that it can readily apply this internal model by carrying out probabilistic inference for p through its inherent stochastic dynamics. Examples are presented in Figures 6-9.
The neurons in our models are stochastic integrateand-fire neurons, which have been shown to match biological data quite well (Jolivet et al., 2006;Mensi et al., 2011;. We assume that a neuron has at any time t the instantaneous firing probability density ͑t͒ ϭ 1 exp ͑u͑t͒͒, where u(t) is its membrane potential and is a time constant. When it fires a spike, the neuron enters an absolute refractory period of duration after which it resumes its stochastic firing. The membrane potential u͑t͒ ϭ ͚ i w i ⑀ i ͑t͒ ϩ b is assumed to be the sum of the PSPs ⑀ i ͑t͒ elicited by the spikes from its presynaptic neurons, where w i is the synaptic efficacy of the i-th synapse (and b is the bias of the neuron). The theoretically best tractable shape of a PSP ⑀ i ͑t͒ would be a step function of length . However, we show in Examples 1 and 2 that the relevant learning properties also hold for ␣-shaped EPSPs that are commonly considered in theoretical neuroscience. On the side, we would like to point out that for biological neurons the EPSPs vary from shapes with a pronounced initial peak to shapes with smooth hills in dependence of the distance of the synapse to the soma (Williams and Stuart, 2002), and obtain yet other shapes if amplified through NMDA or Ca spikes (Larkum et al., 2009).
We use a simple STDP rule, which has the advantage of being theoretically tractable. Let w be the weight of the synapse at the connection from some presynaptic neuron pre to a postsynaptic neuron post . At each postsynaptic spike of neuron post at time t this weight undergoes an update: w ¢ w ϩ ⌬w, where is the learning rate and

Ϫ1,
if pre did not fire in [t Ϫ , t] . (1) The parameter wis a baseline parameter in the learning rule, and is a parameter that corresponds to the duration of postsynaptic potentials (PSPs). Figure 1A shows the resulting STDP curve. The rule exhibits LTP only for pre-beforepost spiking within a time window of duration , otherwise it exhibits LTD. The causal part of the STDP window curve has the same shape as the PSP kernel, which is similar to other theoretically derived plasticity rules (Toyoizumi et al., 2005;Pfister et al., 2006). The properties of this plasticity rule were studied by Nessler et al. (2013). It was shown there that it supports learning of an internal probabilistic model of the inputs in a WTA network. It was also shown there that the weight dependence of ⌬w in Equation 1 fits quite well to experimental data. Figure 1B shows that the shape of the The red curve represents STDP for the simple rule that is used in the theoretical derivations. In computer simulations, we used also an STDP rule shown with the blue curve, that has a smoother, more biologically realistic shape. B, The change of the synaptic efficacy after a stimulation protocol where both the presynaptic and postsynaptic neuron fire at a frequency of 20 Hz, for different time differences ⌬t between a postsynaptic and presynaptic spike. The STDP curve shifts more toward LTP, and depression is no longer time independent due to overlapping PSPs (Nessler et al., 2013, their Fig. 4). This STDP curve is quite similar to experimental data (Sjöström et al., 2001).
STDP curve according to Equation 1 looks like the commonly considered one when applying an intermediate pairing frequency of 20 Hz (Sjöström et al., 2001 shows experimental data on the dependence of the shape of the STDP curve on the pairing frequency).
It is well known that the excitability of neurons changes in dependence of their history of firing activity (Daoudal and Debanne, 2003;Cudmore and Turrigiano, 2004). We model this intrinsic plasticity of neurons through a simple rule according to which the bias b of a neuron changes at each spike of the neuron according to b ¢ b ϩ Ј ⌬b, with: ⌬b ϭ e Ϫ(bϩb Ϫ ) , where Ј is a learning rate and bis a baseline parameter. In addition, we assume that the bias exhibits constant decay according to the differential equation: (3)

A network module for learning stochastic associations
The atomic learning module in our model is a simple microcircuit motif that learns associations between some array of random variables x ϭ ͑x 1 , . . . , x I ͒ and another random variable z from examples ͗x, z͘ that are presented to the network. The variables x and z could for example represent different higher-level features of an image. Or the variables x could represent higher level features of some visual stimulus, and the variable z a feature of a simultaneously occurring auditory stimulus. More formally, we assume that the network is exposed to examples ͗x, z͘ consisting of concrete assignments of discrete values to the variables x and z, that are drawn from some unknown distribution p ‫ء‬ (x, z). We want to determine under what conditions a network module is able to create autonomously from exposure to these examples an internal model p(x, z) for p ‫ء‬ (x, z), that approximates p ‫ء‬ when the number of examples grows. Note that in general the same input x will occur in combination with different values z(1), z(2), . . . of z in the training examples, and the goal of learning is to learn for each value z(i) the probability that it occurs for input x. Hence, the learning performance will not be evaluated by counting errors, ie, deviations from a target output value. Rather, it will be evaluated by how well the network reproduces for any input value x the distribution of output values z. We show that a three-layer network of spiking neurons with the architecture shown in Figure 2 can accomplish this learning task through STDP on synaptic connections from the first to the second layer and intrinsic plasticity of excitatory "hidden" neurons ␣ on the second layer. The weights of synaptic connections between the second and third layers are assumed to be fixed. These weights are assumed to have a large value, so that the firing of a neuron ␣ on the second layer causes with very high probability the firing of the neuron on layer 3 to which it is connected.
We assume that each of the random variables x and z is represented by a population of neurons ("population coding"), with each value of the variable encoded by a separate neuron in the population, as indicated at the top of Figure 2 for the variables x i , and at the bottom for the variable z. The firing of a particular neuron im in the population coding of variable x i encodes the fact that x i assumes the value m in the currently presented example. Similarly the firing of neuron l in the population for the variable z encodes the value l of this variable (Fig. 2).

Figure 2.
Structure of a stochastic association module that is able to learn probabilistic associations between multinomial variables x ϭ ͑x 1 , . . . , x I ͒ and z through STDP. Populations of neurons i (i ϭ 1, . . . , I) on the first layer encode the values of input variables x i . The population of neurons on the third layer encodes the value of z. The hidden layer consists of populations of excitatory neurons ␣ l (I ϭ 1, . . . , L) that are subject to lateral inhibition. STDP applied to the weights w im, j l of synaptic connections from the first layer to the neurons ␣ on the hidden layer enables the network to approximate for any network input x through the firing probability of neurons on the third layer the distribution of values z that were associated with x in previously processed examples ͗x, z͘.
Finally, we assume that an example ͗x, z͘ is presented to this three-layer network during learning in the following manner: Those neurons in the first layer that represent the given values of the variables x are made to fire at a high rate, whereas the other neurons in the first layer are inhibited and kept silent. In addition, if variable z has value l in this example, all hidden neurons outside the corresponding subpopulation ␣ l are inhibited, so that they cannot fire (actually, it would suffice to block STDP for these neurons). An alternative view is that only a selected subset of neurons is disinhibited. New experimental data (for review, see Caroni, 2015) suggest that in fact inhibition of synaptic plasticity (especially via Somatostatinpositive neurons) and disinhibition via VIP interneurons (Letzkus et al., 2015) play an important role in the control of plasticity in cortical microcircuits.
When the network does not receive examples, the neurons in this network module fire according to the stochastic dynamics of the model, and no plasticity is assumed to occur. We present in Materials and Methods a rigorous proof that after learning the distribution of output values z for a given network input x approximates in this stochastic association module the conditional distribution p ‫ء‬ ͑z Խ x͒ of the joint distribution p ‫ء‬ ͑x, z͒ from which the values of z and x are drawn in the training examples. In fact, one can view this module from a theoretical perspective as an implicit generative model p͑x, z; ͒ for the examples ͗x, z͘, and one can prove that the network module performs a stochastic search (stochastic online Expectation Maximization) that strives to minimize the Kullback-Leibler divergence D KL ͑p ‫ء‬ ͑x, z͒p͑x, z; ͒͒ between the external distribution p ‫ء‬ from which the examples are drawn and its internal model p͑x, z; ͒ (see Materials and Methods, Theorem 1 and Theorem 1 ‫ء‬ ). The implicit generative model of the module is encoded in the synaptic weights and biases of the ␣ neurons.
Because the learning module represents the full joint distribution p͑x, z; ͒, not just the conditional distribution p͑z Խ x; ͒, it is the joint distribution that is considered to be the internal model of the learning module. This is more than just representing p͑z Խ x; ͒ as the module also represents the distribution p͑x; ͒. The distribution p͑x; ͒ is represented in a sense that all probability values p͑x; ͒ for each value of x can be calculated from the synaptic weights and biases of the ␣ neurons. This can be done by first calculating p͑x, z; ͒ for each value of z based on the probabilistic model (see Materials and Methods), and then marginalizing out z. Another reason why p͑x, z; ͒ is considered as internal model is that the learning rules are based on minimizing the Kullback-Leibler divergence between the internally represented joint distribution p͑x, z; ͒ and the target distribution p ‫ء‬ (x, z) of the examples. In other words, the module implements generative model learning. The conditional distribution becomes important after learning, when the module performs its function realized through the firing of the output neurons that approximates p ‫ء‬ ͑z Խ x). This functional property of the learning module enables composing networks of modules that can learn larger distributions, as described in the section "Recursive combinations of the basic learning module enable efficient learning of complex distributions from examples." One may wonder why a two-layer network would not suffice for learning such stochastic associations between random variables x and z. The simplest approach would be a model without hidden neurons, where the strengths of the synaptic connections between the neurons in the population codes for x and z encode the probability that a vector x is encountered in conjunction with a particular value of z. But this approach would restrict very much the types of internal models p(x, z) that the network could learn. In particular, it could not handle a situation where the distribution p ‫ء‬ (x, z ϭ l) is multimodal, ie, when there are multiple modes in the distribution of x that are likely to occur in conjunction with a specific value l of z. For example in Figure 3B for z ϭ 2, the distribution p ‫ء‬ (x, z ϭ 2) has two modes, ie, x 1 ϭ 1 can occur in combination with x 2 ϭ 2, and x 1 ϭ 2 in combination with x 2 ϭ 1 (whereas the assignments where x 1 ϭ x 2 do not occur). The reason for this restriction to unimodal distributions is that the neuron l that represents z ϭ l in the population code for z would have to represent through the implicit generative model that is defined by the weights of afferent synapses and its excitability the marginal distribution p ‫ء‬ (x, z ϭ l). However, a single linear neuron can only represent one mode of a probability distribution of x. However if one considers more complex neuron models with nonlinear dendritic processing, they can in principle also represent multimodal distributions (Pecevski et al., 2011, their Figs. 4 and 5;Legenstein and Maass, 2011). Hence, with such more complex neuron models a more shallow learning network could potentially be used as a learning module in our architecture.
The three-layer circuit in Figure 2 can be viewed as a minimal model for allowing multimodal distributions of x to be associated with a value of z. In fact, if one allows sufficiently many hidden neurons ␣, this representation becomes arbitrarily precise. These hidden neurons ␣ represent combinations of features represented through the neurons i that encode the variables x. This mixed coding is reminiscent of experimental data on neurons in the cortex (Rigotti et al., 2013;Mante et al., 2013).
We exploit here a generic property of STDP in WTA circuits, that was made explicit by Nessler et al. (2013) and Habenschuss et al. (2013b): if the neurons in the populations for the variables x i are synaptically connected to a set of neurons ␣ in a WTA circuit, and these synaptic connections are subject to STDP, then the neurons ␣ learn automatically a multimodal internal model for the distribution of the variables x. The learned probabilistic model is a mixture of multinomials. More precisely, each WTA neuron ␣ specializes to fire in response to input patterns from one mode of p ‫ء‬ (x). This specialization is produced by the plasticity rules (Eqs. 1, 2), which, when a neuron fires in response to some input pattern, adapt the weights and biases of the neuron so that in the future it fires with higher probability in response to the same pattern. At the same time, the competition enforced by the lateral inhibition between the ␣ neurons tends to prevent that multiple WTA neurons specialize on the same mode of p ‫ء‬ (x). This emergent property of STDP in WTA circuits was considered by Nessler et al. (2013) andHabenschuss et al. (2013b) in a setting where no association of x with other variables z needed to be learnt. In order to learn associations with z, we apply this mechanism in parallel for every possible value of z. In particular, we assume that in the population ␣ of WTA neurons there are disjoint subpopulations ␣ l for each possible value l of z. The subpopulation ␣ l projects to the l neuron with strong synaptic weights so that a spike of a neuron in ␣ l causes also the neuron l to fire (Fig. 2). As the WTA subcircuit ␣ l is allowed to fire only for examples from p ‫ء‬ ͑x Խ z ϭ l) it learns to approximate this distribution.
Intrinsic plasticity of the excitability of the hidden neurons ␣ l is also essential for successful learning. As they are not firing during a presentation of example ͗x, z͘ with z l, the only adaptation in them for such example is a decay of their intrinsic excitabilities (Eq. 3). This supports learning of a representation of the marginal probability p ‫ء‬ (z ϭ l) in the biases of the neurons in ␣ l . Hence, the population ␣ l learns a probabilistic model p͑x, z ϭ l; ͒ of the target probability distribution p ‫ء‬ (x, z ϭ l) ϭ p ‫ء‬ ͑x Խ z ϭ l) p ‫ء‬ (z ϭ l). In this way all populations ␣ l together learn a generative model p͑x, z; ͒ for the joint distribution p ‫ء‬ ͑x, z͒ of the presented examples.
Further details can be found in Materials and Methods, section "Theoretical properties of the basic learning module (stochastic association module) and its plasticity."

Example 1: the learning module extracts complex stochastic associations from examples
We illustrate the inner workings of the learning module in an example where the task is to learn an internal model of an example target distribution p ‫ء‬ (x 1 , x 2 , z) over binary RVs. This learning task is nontrivial since the distributions of values of x that are stochastically associated with the values z ϭ 2 and z ϭ 1 according to p ‫ء‬ are multimodal (Fig. 3B, gray bars). After learning, the output neurons of the module should fire for input x according to the conditional probability p ‫ء‬ ͑z Խ x͒. The structure of the network module is depicted in Figure 3A. It has two hidden neurons in the populations ␣ 1 and ␣ 2 , which learn the two modes of p ‫ء‬ (x, z ϭ 1) and the two modes of p ‫ء‬ (x, z ϭ 2), respectively (Fig. 3B). The learning period of the module lasted 1200 s of simulated biological time. During learning, examples from p ‫ء‬ (x, z) were presented to the module for 100 ms each. After learning, each WTA subcircuit ␣ l had in fact acquired an approximation of the distribution p ‫ء‬ (x, z ϭ l), as can be seen in Figure 3B. The learning of the internal model in the WTA subcircuit ␣ 2 is achieved through a process where each hidden neuron specializes to represent one of the two modes of p ‫ء‬ (x, z ϭ 2) that are shown in Figure 3B. For the subcircuit ␣ 1 the results are similar (not shown). The learning of an approximation to p ‫ء‬ (x, z) as an internal model automatically produces an approximation of the conditional p ‫ء‬ ͑z Խ x͒ by the firing probabilities of the output neurons (Fig. 3D). A typical resulting firing pattern is shown in Figure 3F.

Recursive combinations of the basic learning module enable efficient learning of complex distributions from examples
The stochastic association module shown in Figures 2 and 3 is self-consistent in the sense that the input variables x i are encoded through population coding in the same way as the output variable z. Hence, one can recursively combine these modules so that the output population of one module becomes part of the input population of another module (Fig. 4). The resulting more complex network is then not only able to learn a single probabilistic association between RVs, but many such associations simultaneously. The basic learning modules form here not only chain connections, but typically also cycles, where the RV that is the output of the second module is simultaneously an input to the first module in the chain (like the variable y k in Fig. 4).
Elementary results from probability theory imply that such recursive combinations of probabilistic associations between RVs have a very powerful, in fact universal, representation capability: the dependency structure of every probability distribution p ‫ء‬ over discrete RVs can be represented as a network of probabilistic associations between each of the RVs y k and a subset of the other RVs (Bishop, 2006). More precisely, in the representation of an arbitrary distribution p ‫ء‬ one has a subnetwork (module) for each RV y k of p ‫ء‬ that has y k as output variable and the continued neurons projects to a different neuron in the population coding of the variable z on layer 3. B, Left, The target probability distribution p ‫ء‬ ͑x, z͒ of the examples (grey bars) and the internal model p͑x, z; ͒ (blue bars) that is extracted from the examples by the hidden neurons. The learned probabilities match the target probabilities quite well. Right, The two mixture components p͑x, z ϭ 2, ␣ 1 2 fires ; ͒ and p͑x, z ϭ 2, ␣ 2 2 fires ; ͒ represented by the hidden neurons ␣ 1 2 and ␣ 2 2 in the subpopulation ␣ 2 . Each has specialized to represent one of the two modes. The resulting internal model p͑x, z ϭ 2; ͒ is a sum of these two mixture components. C, Same as the plots on the left in B, but for a larger network where four hidden neurons were used in each subpopulation ␣ 1 , ␣ 2 of hidden neurons. This larger size of the subpopulations is suggested by the network construction from Pecevski et al. (2011), because the vector x can assume four different values. However, a comparison with C shows that smaller subpopulations suffice here for good learning performance. D, Left, The target probabilities p ‫ء‬ ͑z Խ x͒ (grey bars) compared with the learned firing probabilities of the output neurons 1 and 2 that represent p͑z ϭ 1 Խ x; ͒ and p͑z ϭ 2 Խ x; ͒ respectively. Right, The probability of firing in response to different inputs x for the two hidden neurons ␣ 1 2 and ␣ 2 2 that drive 2 to fire. E, Same as the left plots in D, but for the larger network as in C. Again, one sees that fewer hidden neurons are needed here than in the construction of Pecevski et al. (2011). F, Firing activity of the hidden neurons and output neurons in the module in response to two different input patterns ͑x 1 , x 2 ͒ ϭ ͑1, 1͒ and ͑x 1 , x 2 ͒ ϭ ͑2, 1͒. The firing rates of ␣ 1 2 and ␣ 2 2 correspond to their probabilities of firing shown in D.
random variables in the Markov blanket y B(k) of y k as input variables. The Markov blanket defines a set of random variables so that conditioned on their values, y k becomes independent from all remaining variables. For example, if p ‫ء‬ can be represented by a Bayesian network, it suffices to include in y B(k) the parents of y k together with the children of y k , and their coparents. Whereas the classical results from probability theory only imply that one can represent any distribution p ‫ء‬ over discrete RVs as such recursive network of probabilistic associations, it was shown in (Buesing et al., 2011;Pecevski et al., 2011) that any such target distribution p ‫ء‬ can also be represented as stationary distribution of a network of spiking neurons, if suitable parameters (weights and biases) are programmed into the network. One only needs to assume that every spike of a neuron that participates in the population coding for one of the RVs y k sets the value of y k for a time period of length (ϭ standard length of an EPSP) equal to the value encoded by this neuron. Then a suitably programmed network ᏺ of spiking neurons that results from recursive combinations of the basic module from Figure 2 can represent any distribution p ‫ء‬ through its spontaneous firing activity (provided that each module for a RV y k represents p ‫ء‬ ͑y k Խ y B͑k͒ ͒ as described above). If one decodes the current firing activity in the network ᏺ at any time t by setting each RV y k to that value that is indicated by the most recent firing of a neuron in the population code for y k , the resulting distribution of value assignments to the RVs y 1 , . . . , y K of p ‫ء‬ over time is exactly the one given by p ‫ء‬ . In other words, p ‫ء‬ is the stationary distribution of the Markov chain that is defined by this network ᏺ of stochastically firing neurons. On the side, we would like to point out that this holds only after some initial "burn-in" phase, during which the distribution of network states becomes independent of the initial network state (Habenschuss et al., 2013a).
We now show (Fig. 5) that if one takes the previously analyzed learning capability of the basic network modules ᏺ k into account, the composed spiking network ᏺ learns from examplesỹ͑n͒ of value assignments to ͗y 1 , . . . , y K ͘ drawn from p ‫ء‬ values for its weights and biases that provide an approximation p͑y; ͒ of p ‫ء‬ (y). This approximation p͑y; ͒ is represented by the network in the form of its stationary distribution of network states that result from its spontaneous firing activity. In order to achieve that, one just needs to allow each learning module ᏺ k to learn in parallel from those components of the example  Figures 2 and 3. For learning complex distributions p ‫ء‬ its input variables x 1 , . . . , x l form a Markov blanket of y k . Each variable x i is encoded by the same population coding as the output variables of learning modules, and can therefore be produced by the output of another learning module (as shown for the RV x l ). As here y k is in the Markov blanket of x l , y k appears among the input variables of the upper module, and its corresponding input neurons are the same as the output neurons of the lower module.
Theory/New Concepts ỹ͑n͒ that concern the random variables that it represents in the previously described manner. More precisely, each module ᏺ k receives the components ͗ỹ k ͑n͒,ỹ B͑k͒ ͑n͒͘ of each exampleỹ͑n͒ that is presented to the network (for n ϭ 1, 2, . . . ). The network ᏺ learns an approximation of p ‫ء‬ from examplesỹ͑n͒ without any additional computational overhead or teaching signals. Each subnetwork ᏺ k learns through STDP and intrinsic plasticity an internal model p k ͑y k , y B͑k͒ ; k ͒ of the marginal distribution p ‫ء‬ ͑y k , y B͑k͒ ͒, as described in the preceding section.
One can rigorously prove that the sum over k of Kullback-Leibler divergences between the marginal distributions p ‫ء‬ ͑y k , y B͑k͒ ͒ and the learnt internal models p k ͑y k , y B͑k͒ ; k ͒ converges through these adaptive processes to a local minimum (see Materials and Methods, Theorem 2). In this sense the spiking network ᏺ learns an approximation p͑y; ͒ of the distribution p ‫ء‬ ͑y͒.
As long as the RVs y are not split into inputs and outputs, this learning process is a typical example of unsupervised learning (see the definition in standard textbooks, such as Bishop (2006); Murphy, (2012); Haykin (2009)). Characteristic for unsupervised learning is that learning progress is measured in terms of the deviation between the learnt distribution p͑y; ͒ (ϭ learnt internal model) and the distribution p ‫ء‬ ͑y͒ from which the examples are generated. One also refers to this type of learning as density estimation.
Unsupervised learning has previously already been studied in a large number of artificial neural networks models: from Boltzmann machines (Ackley et al., 1985), neural belief networks (Neal, 1992), up to deep learning networks (Salakhutdinov and Hinton, 2012;Bengio et al., 2015). One major motivation of this work has been to discover learning principles of the brain, based on the argument that supervision for learning is rare in the brain. Our learning model provides a complementary approach, with the main difference being that it is based on networks of spiking neurons, rather than artificial neural networks, and that it uses STDP as primary plasticity mechanism. A major difference between the learning process in Boltzmann machines and our model is that our model does not require separate sleep phases. Its learning process is more similar to parameter learning in Bayesian networks (Koller and Friedman, 2009, Chapter 17). There the learning process also amounts to learning for each RV separately and in parallel from examples the conditional probability table for each RV, conditioned on the values of its parents. Such learning of a conditional probability table is analogous to the learning in a stochastic association module, except that such association module considers all RVs in the Markov blanket of a given RV, rather than just its parents. However, our learning approach is more general than parameter learning in Bayesian networks insofar, as it also encompasses aspects of structure learning (Koller and Friedman, 2009, Chapter 18), see section below, "Small numbers of hidden neurons in the learning modules often suffice." Like other generative models for unsupervised learning, our model also aims at extracting underlying structure in the training examples (Hinton et al., 1995), so that it can even generate fake examples that share the discovered underlying structure (Fig. 7). On the level of higher cortical areas such unsupervised learning could detect relationships between different types of features (Figs. 6-9), between object representations in different sensory modalities, or how an action modifies the environment.
In contrast to the previously mentioned paradigms for unsupervised learning in neural networks, and similar to parameter learning in Bayesian networks, the architecture that we are proposing has a clear modular structure (Fig. 5). It consists of stereotypical network motifs ᏺ k that each try to determine for one of the RVs y k to what extent values for y k can be predicted from the values of other RVs (more precisely: the RVs in its Markov blanket y B͑k͒ ). As soon as this prediction becomes better than chance, the learning module ᏺ k has discovered some underlying structure in the examplesỹ. One curious feature of this local prediction learning is that the learning process looks from the perspective of the learning module ᏺ k like supervised learning, because ỹ k is the prediction target for input y B͑k͒ to this module, and both ỹ k andỹ B͑k͒ are part of a training exampleỹ. This holds in spite of the fact that the whole examplesỹ are in general presented to the network without any supervision, ie, without any associated target output. This feature of the learning process is shared with parameter learning in Bayesian networks, where the learning of a conditional probability table for a RV y may look locally like supervised learning, because both the values of its parent nodes and the value of y are extracted from each training example.
Boltzmann machines and probabilistic graphical models such as Bayesian networks that are usually trained through unsupervised learning can, however, also be used for supervised learning (Hinton et al., 2006). In that case, the RVs y are split into two subsets y I and y O , ie, the target outputỹ O is combined with the vectorỹ I to form the Figure 5. Schematic description of the learning approach. Sequence of examplesỹ͑0͒,ỹ͑1͒, . . . drawn from the target distribution p ‫ء‬ ͑y͒ are presented to the neural network ᏺ. The neural network is composed of learning modules ᏺ k , one for each RV y k . ᏺ k learns from the components ͗ỹ k ͑n͒,ỹ B͑k͒ ͑n͒͘ of examplesỹ͑n͒ an approximation p k ͑y k , y B͑k͒ ; k ͒ of p ‫ء‬ ͑y k , y B͑k͒ ͒ as indicated in Figures 2 and 3. The theory based on EM ensures that the total network ᏺ learns in this way an approximation p͑y; ͒ of p ‫ء‬ ͑y͒. examplesỹ ϭ ͗y I , y O ͘ used for training. The goal is to learn a mapping from y I to y O , ie, to learn the distribution p ‫ء‬ ͑y OԽ y I ͒. We consider here the general case where the mapping from inputs to target values is stochastic, either due to present noise in the target values or their inherent stochastic relation to the inputs. Typical classification problems where the mapping is assumed to be a deterministic function represent a special case of the stochastic formulation. For example, for supervised learning of image categorization in Boltzmann machines one simply adds the target category like an additional feature to the feature vector of an image. The learning process remains exactly the same (ie, unsupervised learning from examplesỹ), and for learning one does not have to tell the network which variables are inputs and which are outputs.
The only difference is that after learning only the components y I of new test examples are provided and the network has to produce a guess for values of the components y O . The same principle applies to the neural network in our approach: If y is partitioned into y I and y O it can learn the input-output mapping by learning an internal model of the full joint distribution p ‫ء‬ ͑y I , y O ͒ of the examples. After the learning process has finished, the network can estimate the probabilities of the target values p ‫ء‬ ͑y OԽ y I ͒ given a particular input y I . However, note that, as it learns the joint distribution, the network can additionally also answer any other probabilistic inference queries based on the probability distribution p ‫ء‬ of the examples (see section "Flexible retrieval of learnt statistical information through probabilistic inference").  Knill and Kersten (1991). Both surfaces, top and bottom, have identical shading profiles in the horizontal direction. Nevertheless, subjects perceive that the reflectances of the two halves of the bottom panel are the same, whereas they perceive the left half of the top panel as being darker than the right half. The different contours of the two panels suggest different 3D shapes (flat vs cylindrical), which influences subjects' perception of the reflectance of the two halves of each surface. B, The "explaining away" Bayesian network proposed by Kersten and Yuille (2003) that models the effect from A. It consists of four RVs y 1 , y 2 , y 3 , and y 4 . The relative reflectance y 1 of the surfaces can have two values: y 1 ϭ 2 for different and y 1 ϭ 1 for the same reflectance of the two parts of the surface. The 3D shape of the surfaces (y 2 ) is either cylindrical (y 2 ϭ 2) or rectangular (y 2 ϭ 1). The relative reflectance and the 3D shape are direct causes of the shading or the luminance change of the surfaces (y 3 ), which can have the profile like in the bottom part of B (y 3 ϭ 2) or a different one (y 3 ϭ 1). The 3D shape of an object causes different 2D contours (y 4 ), which can be either straight (y 4 ϭ 1) or curved (y 4 ϭ 2). The observed variables are the contour (y 4 ) and the shading (y 3 ) of the surfaces. Subjects infer the value of the relative reflectance (y 1 ) and the 3D shape (y 2 ) based on these observed cues. C, The structure of the neural network ᏺ that corresponds to the Bayesian network in B. For each RV y k in the Bayesian network there is a learning module ᏺ k composed of a population of neurons that outputs y k in population coding, and a population of hidden neurons ␣ k . The learning modules are interconnected according to the Markov blankets of the RVs in the Bayesian network as indicated in Figure 4. For example, the RVs in the Markov blanket of y 1 are y 2 and y 3 , and therefore the learning module ᏺ 1 receives connections from ᏺ 2 and ᏺ 3 . D, Structure of the learning module ᏺ 1 for the RV y 1 in the neural network in C.
Classical learning of associative memory (like in a Hopfield network) also appears as a special case of the type of network learning that we are investigating: If p ‫ء‬ ͑y͒ is nonzero only for some setỹ͑0͒, . . . ,ỹ͑M Ϫ 1͒ of M memory items. For these memory items the network learns then input completion. Such input completion can be accomplished through the learnt internal model p͑y; ͒ in our framework: If one clamps some of the neurons to the values of some given incomplete pattern ŷ I , and lets the other neurons fire according to the stochastic dynamics of this internal model. But note that the learning task of our model is more demanding than classical learning of associative memory, because it also has to learn the probability distribution (frequency) of the patternsỹ͑n͒.
The underlying learning theory for our model (based on Expectation Maximization) does not guarantee that the internal model p͑y; ͒ converges to the target distribution p ‫ء‬ ͑y͒. Rather, as in all known cases of nontrivial unsupervised learning and self-organization it can only guarantee that a local optimum is achieved (which cannot get worse if learning continues). However, even this weak form of a theoretical guarantee is actually quite rare in the literature on neural network learning via STDP. Most known successful methods for unsupervised learning or self-organization in machine learning are supported theoretically in the same weak manner via Expectation Maximization. However many of these methods work very well in practice. The learning speed and quality depend for the learning framework that we have introduced on the nature of the target distribution p ‫ء‬ . We are demonstrating in Example 2 that this learning scheme works well for an example of p ‫ء‬ where it is well known that humans are able to learn a good approximation of probabilistic inference for p ‫ء‬ .

Flexible retrieval of learnt statistical information through probabilistic inference
After learning, the network ᏺ from Figure 5 has an approximation p͑y; ͒ to p ‫ء‬ ͑y͒ as its stationary distribution. This holds under the convention that the firing of a neuron that represents a specific value l of a variable y k sets this variable to value l for a duration equal to the duration of a generic EPSP. The learned distribution is manifested in the spontaneous activity of the network, ie, when no neurons in the network are clamped. Information can be extracted from this learnt internal model p͑y; ͒ through probabilistic inference via neural sampling. This type of information retrieval goes far beyond input completion, which is the only form of information retrieval in classical neural network models for memory. In particular, after inserting evidence into ᏺ by exciting or inhibiting some of the neurons that represent a subset y e of the variables, the spiking activity of the rest of the network generates according to Pecevski et al. (2011) samples from the conditional posterior distribution p͑y sԽ y e ; ͒, where y s is the subset of variables that are not in y e . Furthermore, the posterior marginal probabilities p ‫ء‬ ͑y k Խ y e ͒ of the variables y k in y s can be read out from the resulting firing rates of the neurons that represent these variables. Thus, information gathered from the examplesỹ͑n͒ that had been presented to the network ᏺ (Fig. 5) can be extracted from this network in very flexible ways through probabilistic inference. This will be demonstrated for an example in Figures 6-9. In particular, the network can produce estimates of posterior marginal probabilities of the type indicated through examples in the Significance Statement. Note that these marginal probabilities, that are represented by the firing rates of corresponding neurons in ᏺ, integrate automatically information from many modes of the learnt approximation p͑y; ͒ to p ‫ء‬ . Hence, if these modes represent individual memory items of a memory model, the network ᏺ can combine information from many different memory items (episodes), also in ways that could not be anticipated during learning.

Small numbers of hidden neurons in the learning modules often suffice
The structure of the network ᏺ in Figure 5 is very similar to the structure of a constructed network of spiking neurons that directly mimics a representation of p ‫ء‬ by a Bayesian network according to Pecevski et al. (2011). However, there the number of hidden neurons ␣ k for a random variable y k was required to be exponentially large in the number of variables in the Markov blanket of y k . In contrast, in the learning approach of this article, one can employ in principle any number, also a very small number, of hidden neurons in ␣ k . The described learning approach will approximate the marginal distribution of p ‫ء‬ over y k and the Markov blanket of y k with a mixture distribution whose number of modes is determined by the chosen number of hidden neurons in ␣ k . For example, in Example 1 we had chosen just two hidden neurons in ␣ for each of the two possible values of z, instead of two times four that were used in the construction of Pecevski et al. (2011) for representing all four possible assignments of values to the inputs x of a module. But as the comparison of Figure 3, B and C (and of D and E) shows, the network with the smaller number of hidden neurons ␣ works in this case about as well as the larger network. This effect is predicted from general results in learning theory: a learning network with fewer parameters sacrifices representation power, but gains generalization capability. Furthermore, naturally occurring distributions can often be approximated quite well by mixture distributions with a relatively small number of components (modes). This suggests that real world distributions p ‫ء‬ of examples can often be learnt by relatively small networks ᏺ. An ideal scenario from a biological perspective would be one where a population of hidden neurons in ᏺ can become larger if the size provides insufficient resolution or prediction capability for the examples that it receives.
Note that with this approach a spiking network ᏺ can in principle learn an approximation of a given distribution p ‫ء‬ even without prior information on the dependency structure among RVs of p ‫ء‬ . One can set up the network ᏺ so that each module ᏺ k extracts the probabilistic association between RV y k and all other RVs y \k (ie, replacing y B(k) by all RVs other than y k ). The size (ie, number of hidden neurons ␣; Fig. 2) of ᏺ k determines the quality of the resulting learnt approximation of p ‫ء‬ ͑y k Խ y \k ͒. Whereas a Theory/New Concepts good approximation can only be theoretically guaranteed if the number of hidden neurons in ᏺ k is exponential in the number K of RVs of p ‫ء‬ , acceptable results may emerge with drastically fewer hidden neurons, provided that their number is in the same range as the sum of the number of main modes of the distributions p ‫ء‬ ͑y \k Խ y k ϭ l͒ for different values l (Fig. 3 provides an illustration).

Example 2: autonomous learning of explaining away in perceptual inference
We demonstrate learning in recursive combinations of the basic learning module for a concrete example with four modules (Fig. 6C). We apply it to the task of learning a complex distribution p ‫ء‬ that represents a standard example for explaining away in visual perception. The famous experiment of (Knill and Kersten, 1991), depicted in Figure  6, had first demonstrated that nontrivial inference is involved in visual perception. A subsequent study (Kersten and Yuille, 2003) proposed that this perceptual effect can be understood as "explaining away" in probabilistic inference, and a Bayesian network with 4 RVs, y 1 , . . . , y 4 ( Fig.  6B) was introduced to demonstrate this. The probability distribution of the Bayesian network is p͑y 1 , y 2 , y 3 , y 4 ͒ ϭ p͑y 1 ͒p͑y 2 ͒p͑y 3 Խ y 1 , y 2 ͒p͑y 4 Խ y 2 ͒, and the inference task is to calculate the marginal posterior probability distributions p͑y 1 Խ y 3 , y 4 ͒ and p͑y 2 Խ y 3 , y 4 ͒.
In the experiment of Knill and Kersten (1991), the demonstrated different perception of the shading in the two surfaces in Figure 6A (see legend for more details) can be explained by two different competing causes, either by different relative reflectance of the two abutting surfaces (y 1 ), or by their cylindrical 3D shape (y 2 ). An observed curved contour of the surfaces is a cue that increases the probability of a cylindrical 3D shape. Because a cylindrical 3D shape alone is enough to explain the shading, it reduces the probability that the relative reflectance is different. Hence, one of the competing causes, the cylindrical 3D shape, "explains away" the other possible cause, the different reflectance of the surfaces. In the other case, when a flat contour is observed as an additional cue, this increases the probability of a rectangular 3D shape. As a rectangular 3D shape cannot explain the observed shading, the probability of the second possible cause for the shading, the different reflectance is increased. This type of explaining away in probabilistic inference can only occur for distributions p ‫ء‬ that have higher-order interactions between three or more RVs, like between the two competing causes and the observed shading in this example.
We show that the underlying distribution p ‫ء‬ can be learnt (approximately) from examples for this visual perception task, and that the network ᏺ which learns this approximation learns simultaneously to deal with the explaining away effect as an emergent phenomenon.
The structure of the neural network ᏺ suitable for learning this target probability distribution p ‫ء‬ is given in Figure 6C. It consists of four interconnected learning modules, where the connections between the learning modules reflect the dependencies between the RVs in the Bayesian network in Figure 6B. Additionally, the structure of one of the learning modules, the learning module ᏺ 1 for the RV y 1 , is given in Figure 6D in detail. Each subgroup ␣ 1l of hidden neurons has two neurons. This number of hidden neurons is smaller than the number of hidden neurons in the exact neural implementations of this Bayesian network by Pecevski et al. (2011), (their Implementation 2), equal to the total number of assignments of values to the RVs in the Markov blanket, which in this case is four. But we show that the smaller neural network can nevertheless learn the distributions p ‫ء‬ ͑y 2 , y 3 Խ y 1 ϭ l͒, because these distributions do not have more than two modes. In fact, we use here just two hidden neurons in the subgroups ␣ kl of all learning modules, also for the learning module ᏺ 3 , where the total number of assignments of values to the RVs in the Markov blanket is 8. As we will see in the results, two hidden neurons in the learning module ᏺ 3 are enough to learn a good approximation of p ‫ء‬ ͑y 1 , y 2 , y 4 Խ y 3 ϭ l͒.
We performed computer simulations of learning with this network, where examples drawn from the target probability distribution p ‫ء‬ were presented to the network successively during learning. The distribution p ‫ء‬ was defined according to Table 4 in Materials and Methods in order to capture the visual perception scenario of Figure 6 in a qualitative manner. The learning phase took 1200 s of simulated biological time, and each example was presented for a time period of 100 ms. The weights and biases of the neurons were randomly initialized before learning.
We first analyzed the stationary distribution of network states in the network ᏺ from Figure 6C after learning. Figure 7A shows that the network switches spontaneously between different network states, and occasionally remains longer in one of the network states that have high probability under the stationary distribution (Fig. 6B). One can relate the firing activity of this network ᏺ to an approximation p͑y; ͒ of the target distribution p ‫ء‬ ͑y͒ by assuming in the usual manner (Berkes et al., 2011;Buesing et al., 2011) that the firing of a neuron in the population code for variable y k sets the value of this variable for a time period of length to the value encoded by this neuron. The distribution over 4 binary random variables y 1 , . . . , y 4 obtained in this way from the spontaneous firing of the network ᏺ is shown in Figure 7B and compared with the target distribution p ‫ء‬ .
In Figure 8A we examined how the learning in the modules progresses in time. In each of the four learning modules in the network (one for each problem variable y k ; Figs. 5, 6C) the Kullback-Leibler (KL) divergence between the marginal target distribution p ‫ء‬ ͑y k , y B͑k͒ ͒ that it learns to approximate and its internal model p k ͑y k , y B͑k͒ ; ͒ decreases and stabilizes to a local minimum after about 300 s. The same is true for the sum of all KL divergences of the modules (Fig. 8B). As a result, the difference between the model distribution p͑y; ͒ of the network and the full target distribution p ‫ء‬ ͑y͒ also decreases, as shown in Figure 8C. This is because after learning a good internal model of the marginal target distributions p ‫ء‬ ͑y k , y B͑k͒ ͒, the firing probability of the output neurons of the learning modules approximate well the conditionals p ‫ء‬ ͑y k Խ y B͑k͒ ͒ of p ‫ء‬ , as shown in Figure 8D, which according to the theory leads to a good approximation of p ‫ء‬ by the network.
The PSPs and STDP curves in the simulated network have a more biological smoother alpha shape, which differs from the rectangular shape used in the theory. This can introduce minor deviations of the learning convergence from the theoretically optimal one, as for example the slight increase of the KL divergence in Figure 8C in the second half of the learning process.
This network ᏺ can extract the information that it has acquired from the examples in a very flexible manner through probabilistic inference. For example, if evidence e is entered for some of the problem variables y 1 , . . . , y 4 (by inducing corresponding neurons in their population codes to fire at high rates) the conditional marginal probabilities of other variables y i can be read off from the firing rates of neurons that represent y i through population coding. In particular, we demonstrate in Figure 9 that the network ᏺ has acquired autonomously from examples the capability to carry out nontrivial probabilistic inference that involves explaining away (ie, higher order dependencies among random variables).
Finally, we would like to point out that our approach is not restricted to Bayesian networks; it can be applied to any type of graphical models, eg, also to Markov random fields and factor graphs. The network connectivity is determined by the Markov blankets of the random variables, which can easily be read out from the graph structure of any graphical model.

Materials and Methods
We first present a rigorous learning theory that supports the learning results that are presented. After that, we provide details to the computer simulations.

Theoretical properties of the basic learning module (stochastic association module) and its plasticity
In this section, we give additional details about the structure of the module, the firing of the output neurons  Figure 6C after learning. A, Random sample of spontaneous activity of the network (without any external input). The network switches stochastically between different network states (some of which are labeled by colors), in a manner that is qualitatively similar to experimental data from networks of neurons in the cortex (Abeles et al., 1995;Jones et al., 2007). The spike trains of neurons that encode a problem variable y k through population coding are underlined by blue lines. Spike trains immediately above are from the hidden neurons ␣ k that drive this neuron to fire (as in Fig. 2). B, Frequency of network states that encode particular assignments to the problem variables y k (shown on the x-axis) resulting from the learned stationary distribution (spontaneous activity) p͑y; ͒ of the network are shown as green bars. The coloring of the network states labels is the same as for corresponding network states in A. Black bars indicate the probabilities of the same value assignments under the distribution p ‫ء‬ that had produced the examples for learning. and the learning procedure. In particular, the subsequent subsection titled "Implicitly represented generative model" defines the probabilistic model p͑x, z; ͒ represented in the module. After that, in the two subsections "Firing probability of the output neurons resulting from the internal generative model" and "Deriving the probability distribution of the firing of the neurons ␣" the firing of the output and the alpha neurons in the module is expressed through the represented probabilistic model. The last subsection "The plasticity rules minimize the KL divergence through Expectation Maximization" continues with the explanation how the plasticity rules implement learning of the stochastic associations between the variables x and z through EM.
The neurons ␣ in the learning module are interconnected through inhibitory interneurons (data not shown in Fig. 2) that enforce strong lateral inhibition between them.
The role of the lateral inhibition is to ensure that if one of the ␣ neurons fires at time t, no other neuron in ␣ will fire within the interval (t, t ϩ ). Each neuron in population ␣ receives input synaptic connections from all input neurons such that the firing of the population i encodes the value of the input variable x i (Fig. 2). For each non-zero value of the input variable x i ϭ m there is a dedicated neuron im , and if im fired in the time period (t -, t], the value of x i at time t is x i ϭ m. The firing in the population i is such that no two neurons fire in the time interval (t -, t], ie, after a spike there is no spiking period of duration . If there is no spike in the interval (t -, t], then the value of x i is 0. All neurons in the population ␣ l connect to their corresponding output neuron l . These synaptic connections are not subject to synaptic plasticity and have a fixed strong efficacy. The strong weights of these synapses achieve that whenever any of the neurons in ␣ l fires, it  Figure 6C in learning the perceptual inference task of Knill and Kersten (1991) from examples. A, Evolution of the KL divergence between the target distribution p ‫ء‬ ͑y k , y B͑k͒ ͒ of the examples (denoted in the plot as p k ‫ء‬ ) and the distribution of the internal model p k ͑y k , y B͑k͒ ; k ͒ during learning, for each of the learning modules ᏺ k in the network (k ϭ 1, 2, 3, 4). B, Evolution of the sum of the KL divergences shown in A. C, Evolution of the KL divergence between the internal model distribution p͑y; ͒ represented by the whole network and the target distribution p ‫ء‬ ͑y͒ of the examples during learning. D, The plots show the target conditional distributions p ‫ء‬ ͑y k Խ y B͑k͒ ͒ (black bars), and the learned conditional distributions p k ͑y k Խ y B͑k͒ ; k ͒ (green bars), for each problem variable y k . The bit string labels on the x-axis denote the assignment of values to the problem variables on which each distribution is conditioned.
Theory/New Concepts drives the output neuron l to fire. The output neurons fire only if they receive input spikes, otherwise they remain silent. Thus, the spikes that the neuron l emits can be seen as a union of all spikes of the hidden neurons in the group ␣ l .
One can give an analytical description for the firing probability density of the output neuron l according to the model. To do that, first we need to introduce some notation. With w im, j l we denote the efficacy of the synaptic connection from the input neuron im to the neuron ␣ j l from the group ␣ l . For the bias (intrinsic excitability) of the neuron ␣ j l , we write b j l . We also introduce, for simplicity, binary RVs x im corresponding to each neuron im . The binary RV x im assumes value 1 at time t if the current value of the RV x i is equal to m, and 0 otherwise. The current value of x im at time t can be determined just from the spikes of the neuron im , ie, it is 1 if im fired within the time interval [t, t -], and 0 otherwise. With this notation, we can describe the firing probability density of l at time t by: Figure 9. Demonstration that the network ᏺ from Figure 6C has learnt explaining away. A, B, Results of probabilistic inference (estimate of posterior marginal probabilities) by the network through observation of the firing rates of neurons that encode y 1 and y 2 (whereas other neurons are "clamped" to encode the evidence e) are shown as green bars. The results are from a single run with duration 400 ms per evidence value. Black bars indicate corresponding values for p ‫ء‬ ("ground truth"). The network estimates the probabilities p͑y 1 ϭ 2 Խ e; ͒ (indicating different relative reflectance according to Figure 6B) and p͑y 2 ϭ 2 Խ e; ͒ (indicating a cylindrical 3D shape) for two different evidence values e ϭ (y 3 , y 4 ) through sampling from its stationary distribution of network states. In the first inference we assume e ϭ (2, 2) (indicating an observation with discontinuous shading of the object and curved contour), and in the second e ϭ (2, 1) (observation of discontinuous shading and straight contour). C, The spiking activity of the network ᏺ during online inference of hidden causes, where the evidence e was switched after t ϭ 3s (red vertical line) from e ϭ (2, 2) to e ϭ (2, 1). Spike trains of neurons from population codes for problem variables y k are underlined by blue lines (as in Figure 7A). D, Firing rates (estimated with a sliding alpha kernel K͑t͒ ϭ t exp ͑ t ͒ , t ϭ 0.1 s) of the neurons encoding y 1 ϭ 2 (different relative reflectance, dashed line) and y 2 ϭ 2 (cylindrical 3D shape) during the inference simulation run shown in C. The "explaining away" effect is clearly visible from the complementary evolution of the firing rates of the two neurons that represent the two potential hidden causes "cylindrical 3D shape" and "different relative reflectance." (4) where u j l is the membrane potential of neuron ␣ j l , J l is the number of neurons in the subpopulation ␣ l , and M(x i ) denotes the maximum integer value that x i can assume. The firing probability density of the output neuron l is equal to the sum of the firing probability densities of all neurons in the subpopulation ␣ l , which follows from the specific connectivity structure in the learning module. The sums with the indices i and m iterate over all input neurons im . During learning, we assume that examples ͗z͑0͒, x ͑0͒͘, ͗z͑1͒, x ͑1͒͘, . . . , ͗z͑n͒, x ͑n͒͘, . . . drawn from the joint probability distribution p ‫ء‬ ͑z, x͒ are presented to the WTA circuit in succession, one by one. In an example the variable z has an integer value from the set ͕1, . . . , M͑z͖͒ (M(z) denotes the maximum integer value that z can assume), and each of the variables x i assumes a value from the set ͕1, . . . , M͑x i ͖͒. Each example ͗z ͑n͒, x ͑n͒͘ is presented for several tens or hundreds of milliseconds. During this time period, the input neurons fire according to the values of the input variables x ͑n͒. In particular, if x i ͑n͒ ϭ m, then the input neuron im fires with a high firing rate, whereas all other input neurons in the population i are silent. The value z ͑n͒ in the example is given to the network via inhibitory currents in a subset of the neurons ␣. More precisely, if in the current example z ͑n͒ ϭ l, then all neurons in ␣ that do not belong to ␣ l are inhibited with a strong negative external current which prevents them from firing. At the same time the neurons in the group ␣ l do not receive any external inhibitory currents and are free to fire according to their firing probabilities determined by their inputs. At the beginning of the learning process, it is assumed that the biases of the ␣ neurons have large values so that they fire with high firing rates regardless of what the presented input x ͑n͒ is.
As stated above, in the presented examples during learning the variables z and x i (i ϭ 1, . . . , I) assume values from the sets ͕1, . . . , M͑z͖͒ and ͕1, . . . , M͑x i ͖͒, respectively. These values could represent states of the external environment, internal beliefs or any other behavior related variable value. However, in the population coding of values of the variables z and x i (i ϭ 1, . . . , I) by their corresponding neuron populations in the learning module, the variables can have an additional value of zero. A neuron population that encodes a variable through population coding assigns value zero to this variable if none of the neurons fires for a period longer than . There is not a dedicated neuron whose firing signals a zero value, as is the case for the rest of the non-zero values of the variable. The assumed spike based encoding with a zero value is the same as in the neural sampling theory (Buesing et al., 2011;Pecevski et al., 2011). It defines a value for the variable at each moment in the continuous time dynamics of the neural network model.
The zero values in the population coding of the learning module do not represent states of the external environment. The reason for this is that the learning module does not learn the probabilities p͑x, z; ͒ in its internal model where some of the variables z and x i (i ϭ 1, . . . , I) have zero value. These probabilities have very fixed small (close to zero) values (see section "Firing probability of the output neurons resulting from the internal generative model"). Therefore, it is assumed that the examples contain only the encoded variable values whose stochastic relations can be learnt by the learning module.

Implicitly represented generative model
The learning theory of the module is based on its internally represented generative model p͑z, x; ͒, as it is discussed in Results, section "A network module for learning stochastic associations" (see also section "The plasticity rules minimize the KL divergence through Expectation Maximization"). In order to define the implicitly represented generative model, we first define an additional compound RV a in form of a vector of binary RVs, that we will relate to the stochastic activity of the neurons in the population ␣. More precisely, the RV a is defined as a population code, represented as a set of binary RVs a j l , one for each neuron ␣ j l . The generated values of a j l over time during the activity of the WTA circuit are defined by the spiking of ␣ j l in the same way as the spiking of the input neurons define the values of x. The value of a 0 0 , which does not have a corresponding neuron in the population ␣, is defined as a 0 0 ϭ 1 Ϫ ͚ l, j a j l . Note that as there can be no two neurons in ␣ spiking within the time window (t -, t) because of the lateral inhibition, it follows that the vector RV a has a restricted domain of allowed values, consisting only of values where exactly one a j l is equal to 1, and the others are 0. The value of a where only a j l ϭ 1 and the rest are 0 will be denoted by a ϭ ͑l, j͒.
Having defined a, we can now write the parametrized form of the generative model as follows: where A͑͒ is the normalization constant. In the sum over is not parametrized and specifies the deterministic relations between a and z. It is defined as follows: p͑z ϭ l 1Խ a ϭ ͑l 2 , j͒͒ ϭ 1 if l 1 ϭ l 2 , and ϭ 0 otherwise, for all l 1 ʦ͕0, . . . , L͖, and all (l 2 , j) that are valid values of a. These probability values express the fact that when the neuron ␣ j l fires and sets the value of a to be (l, j), then this also uniquely determines z, ie, sets the value of z to z ϭ l.
A sufficient condition for having a normalization constant A͑͒ ϭ 1 are the following constraints on the parameters: As we will see later, one property of the plasticity rules is that they move the parameter vector toward the region of the parameter space where these normalization constraints are approximately satisfied, ie, they try to keep the internal probabilistic model normalized. If one assumes that the normalization constant satisfies A͑͒ ϭ 1, then the generative model in Equation 5 obtains the form: By marginalizing a in the full generative model p͑a, z, x; ͒, we obtain the marginal probability distribution p͑z, x; ͒, which has the form of a mixture of multinomials. The marginal probabilistic model p͑z, x; ͒ models the probabilistic relations between the RV z, encoded by the firing of the output neurons, and the RVs x, which are encoded in the inputs. These relations are modeled through the vector of auxiliary RVs a, which are also called hidden RVs. In the mixture model the conditional probabilities p͑z Խ a͒ and p͑x i Խ a; ͒ for i ϭ 1, . . . , I are the likelihoods, and p͑a; ͒ is the prior. It can be easily shown from Equation 5 that by assuming A͑͒ ϭ 1, the likelihoods p͑x i Խ a; ͒ for i ϭ 1, . . . , I are: Similarly the priors are: Hence, given that the normalization constraints in Equation 6 hold, the parameters ŵ im, j l are equal to the log of the conditional probabilities from the likelihood ŵ im, j l ϭ log p͑x im ϭ 1 Խ a j l ϭ 1; ͒, whereas b j l represent the log probabilities of the prior b j l ϭ log p͑a j l ϭ 1; ͒. For convenience, Table 1 lists the mathematical notation that is used in the definition of the learning module.
Firing probability of the output neurons resulting from the internal generative model Here we will express the firing probability density of the output neurons through the probabilistic model p͑z, x; ͒. This will show that by learning the internal model p͑z, x; ͒ of the target distribution p ‫ء‬ ͑z, x͒, the output neurons actually learn to fire according to the conditional p ‫ء‬ ͑z Խ x͒. Thus, as discussed in Results, section "A network module for learning stochastic associations," the module exhibits the learned stochastic associations between the variables z and x through the firing of its output neurons given input x. Furthermore, this particular form of firing enables to recurrently interconnect multiple modules in larger networks to learn more complex distributions. Networks of interconnected learning modules are presented in Results, section "Recursive combinations of the basic learning module enable efficient learning of complex distributions from examples," and further discussed in Materials and Methods, section "Theoretical properties of networks of recursively interconnected basic learning modules." As a prerequisite to the derivation, we first establish a relation between the parameters ŵ im, j The relation is defined through the following equations: This entails that the modification of synaptic weights and biases via synaptic and intrinsic plasticity during learning results in adaptation of the parameters of the represented generative model. Not all parameters in the generative model are learned, however. We assume that the parameters: that do not have a corresponding synaptic weight or a bias, have fixed values that are not subject to learning. In particular, ŵ i0, j l , ŵ i0, 1 0 and b 1 0 are assumed to have the following values: where V is a large positive constant. We assume that the constant V is large enough so that the probabilities p͑z, x; ͒ where at least one of the variables z and x i (for i ϭ 1, . . . , I) has zero value are much smaller than the probabilities of value assignments where all variables have non-zero values (Eq. 5).
The fixed parameters are all subsumed in the baseline values wand b -: From this it follows that one can write the synaptic weights and the biases as: Hence, we have a one-to-one mapping between the synaptic weights and biases of the neurons ␣, and the parameters in the generative model that are subject to adaptation during learning. Note that from Equations 12 and 13, it follows that: If we now substitute the synaptic weights and biases from Equation 10 in the expression for the firing probability density l (t) of the output neuron l from Equation 4, we obtain:

Symbols related to the RVs of the inputs and the outputs x
Vector of all multinomial RVs ͑x 1 , . . . , x I ͒ corresponding to the inputs x i i-th multinomial RV from the vector x z RV corresponding to the output neurons M( . . . ) Operator that gives the maximum integer value of the RV given as an argument; for example M(x i ) and M(z) denote the maximum values of x i and z, respectively. p ‫ء‬ ͑x, z͒ Target probability distribution learned by the learning module The output and input neurons in the learning module i Population of input neurons that together encode the value of the RV x i through population coding im Input neuron in i whose firing signals the value m of the RV x i x im Binary RV that assumes value 1 if and only if x i ϭ m; it corresponds to the coding property of the input neuron im . Population of output neurons that encode the value of the RV z l Output neuron in whose firing signals the value l of the RV z In this expression, we can combine the two sums over the index i that iterates through 1, . . . , I (we do this both in the denominator and in the numerator), and by using the equality is a binary RV, such that x i0 ϭ 1 if and only if x i ϭ 0), we get the simplified form: where we write simply x im instead of x im (t), as we assume it is implicitly understood that it is a function of time. It is clear that the denominator is equal to p͑z ϭ 0, x; ͒. In the numerator, we can rewrite the sum by iterating over values of the RV a as: where the sum indexed by a iterates over all possible values of the compound RV a having only one a j l ϭ 1 and others equal to 0, and in both sums inside the exp function indexed by lЈ and j, the index lЈ iterates over all possible values of z which are ͕0, . . . , L͖, and j iterates from 1 to J lЈ . It is now easy to see that the numerator is equal to p͑z ϭ l, x; ͒ A͑͒, and finally we obtain: Thus, the firing probability of the output neuron l is proportional to p͑z ϭ l Խ x; ͒, and if the internal model p͑x, z; ͒ is close to p ‫ء‬ ͑x, z͒, it will be proportional to p ‫ء‬ ͑z ϭ l Խ x͒ (as pointed out in Results, section "A network module for learning stochastic associations"). For how this result is used in the theory for networks of interconnected learning modules that learn to perform probabilistic inference in larger distributions see section "Theoretical properties of networks of recursively interconnected basic learning modules." Deriving the probability distribution of the firing of the neurons ␣ We show in this subsection that the firing of the neurons ␣ in the WTA circuit during the presentation of an example ͗x͑n͒, z ͑n͒͘ generates samples from the distribution p͑a Խx ͑n͒, z ͑n͒; ͒. This result is used in section "The plasticity rules minimize the KL divergence through Expectation Maximization," where the link between the plasticity rules, the activity of the neurons, and different calculations of the EM algorithm are discussed. In particular, it is explained there that, as the RVs a are the hidden variables in the probabilistic model, the samples generated by the activity of the ␣ neurons together with the presented examples, form samples from the complete data distribution, which represents the expectation step of the EM algorithm. The result is further used in the section "Proof that the plasticity rules minimize the objective function ᐁ͑͒ through Expectation Maximization" in that the proof is based on analyzing mean weight updates of the plasticity rules over the complete data distribution. During the presentation of a single example, in the periods when it is not inhibited by the lateral inhibition due to a spike by another neuron in ␣, the firing probability density of each neuron ␣ j l in ␣ ideally remains constant over time. Let us denote the stationary distribution of its associated RV a j l during the presentation of the example ͗x͑n͒, z ͑n͒͘ with p ͑n͒ ͑a j l ͒. We will also here use the assumption that the total firing rate of the ␣ l neurons is very high upon presentation of an example with z ͑n͒ ϭ l irrespective of what the input x ͑n͒ is. That this assumption is true can be seen from the following. As we stated previously, we assume that the initial values of the biases of the ␣ neurons are high. This implies that in the beginning of learning, for each presented example there will be active ␣ neurons that fire in response to the example. In addition, as bhas a large negative value (Eqs. 13, 15), it follows that the plasticity rule for the biases defined in Equations 2 and 3 will drive and stabilize the biases of the active ␣ neurons toward very large values. Thus, when an example with z ͑n͒ ϭ l is presented, there will be always a subset of the neurons in ␣ that have high firing rates. Consequently, during the presentation of the example almost all the time there will be a neuron in ␣ l active and p ͑n͒ ͑a 1 0 ϭ 1͒ Ϸ 0. If we use the fact that p ͑n͒ ͑a 1 0 ϭ 1͒ Ϸ 0 for p ͑n͒ ͑a j l ͒ we can write: and equal to 0 otherwise. Here j l denotes the firing probability density of ␣ j l , which is constant during the presentation of the example. If we substitute now the firing probability density in Equation 20 with: and additionally substitute the synaptic weights and biases with their related model parameters according to Equation 10, we get: if z ͑n͒ ϭ l and 0 otherwise. Then, if we use the definition of the generative model in Equation 5, we can write both the numerator and denominator as: p (n) (a j l ϭ 1) ϭ ϭ p(a ϭ (l, j),x(n), z (n); )/p(a ϭ (0, 1),x(n); ) ͚ jЈϭ1 J l p(a ϭ (l, jЈ),x(n), z (n); )/p(a ϭ (0, 1),x(n); ) .
By multiplying now both the numerator and denominator by p͑a ϭ ͑0, 1͒, x ͑n͒, z ͑n͒; ͒ and by using the fact that: we finally arrive at: Hence, the neurons ␣ sample from p͑a Խx ͑n͒, z ͑n͒; ͒. As we pointed out at the beginning, this derivation is an important step in the proof that the plasticity rules implement the Expectation Maximization algorithm (see next subsection).

The plasticity rules minimize the KL divergence through Expectation Maximization
Here we further discuss the key property of the learning module pointed out in Results, section "A network module for learning stochastic associations," that the plasticity rules modify the synaptic weights and biases in a way that minimizes the KL divergence between the full joint distribution of the generative model p͑z, x; ͒ and the corresponding target joint distribution p ‫ء‬ ͑z, x͒. We denote the KL divergence between p͑z, x; ͒ and p ‫ء‬ ͑z, x͒ as the objective function ᐁ͑͒ ie, By using Equation 26, the key property of the learning module can be reformulated in form of the following theorem: Theorem 1. The synaptic and intrinsic plasticity rules introduced in Equations 1, 2, and 3 change the parameters so that they always converge to a local minimum of ᐁ͑͒ subject to the normalization constraints (Eq. 6).
The complete proof of the theorem is given in the next section "Proof that the plasticity rules minimize the objective function ᐁ͑͒ through Expectation Maximization." In this section we discuss the approach and the main steps of the proof, and we also describe how the convergence is carried out through the EM algorithm, ie, how the firing of the neurons and the synaptic weight updates by the plasticity rules implement different steps of EM. The complete details to the EM implementation can be found in the section "Proof that the plasticity rules minimize the objective function ᐁ͑͒ through Expectation Maximization." The proof of Theorem 1 uses results from the Robbins-Monro methods for stochastic approximation (Kushner and Yin, 2003). What is shown in the proof is that first, STDP and intrinsic plasticity drive the parameters toward satisfying the normalization constraints in Equation 6. As a second step, it is shown that given that satisfy the normalization constraints, then the mean update of the parameters averaged over many presented examples is in the direction of the negative gradient of the objective function ᐁ͑͒. From these two intermediate steps, it follows that the parameters converge to a local minimum of ᐁ͑͒ under the normalization constraints.
By reducing the KL divergence between p ‫ء‬ ͑z, x͒ and p͑z, x; ͒, the module also aims to approximate the target conditional p ‫ء‬ ͑z Խ x͒ by its internal model conditional p͑z Խ x; ͒ that defines the firing of the output neurons. In particular, the objective function ᐁ͑͒ is an upper bound of the function which is the mean KL divergence between the target conditional and the model conditional, averaged over the distribution p ‫ء‬ (x) of the inputs. This is easy to see from the equality: Thus, EM aims to reduce the measure of how much the two conditionals differ ͑͒ through minimizing its upper bound ᐁ͑͒. If the upper bound during learning decreases, this does not always mean ͑͒ will decrease as well. Nevertheless, if the upper bound ᐁ͑͒ converged during learning to a certain small value C, then after learning ͑͒ Յ C. And if C is small enough so that p͑z Խ x; ͒ is a good approximation of p ‫ء‬ ͑z Խ x͒, in such a way the minimization of the upper bound would lead to good learning of the conditional. Furthermore, in the simulation experiments we showed that the learning based on minimizing the upper bound ᐁ͑͒ works well.
The update of the synaptic weights and the biases of the neurons in the WTA circuit can be understood as performing an online stochastic approximation of the EM algorithm. The EM algorithm is an iterative optimization algorithm that finds a local minimum of ᐁ͑͒ indirectly through another function called the complete data loglikelihood. In order to establish the link with the neural network learning, we will consider a stochastic version of the EM algorithm (Wei and Tanner, 1990;Jank, 2006), that we describe in the following. We will also assume the more common version used in the literature where we have a finite sequence of examples ͗z ͑n͒, x ͑n͒͘ for n ϭ 1, . . . , N, instead of an infinite sequence drawn from p ‫ء‬ ͑z, x͒. In the case of a finite sequence, the target distribution p ‫ء‬ ͑z, x͒ is defined as the histogram of all examples. The main idea in EM is that we can substitute the original learning objective: to update the mixture generative model p͑z, x; ͒ so that it gets close to p ‫ء‬ ͑z, x͒ by another learning objective: to update the full (complete data) generative model p͑a, x, z; ͒ (together with the hidden RVs a) to get closer to the complete data target distribution: As there are not any complete data samples to learn the generative model p͑a, x, z; ͒, but just incomplete data examples for the RVs x, z, in stochastic EM one completes the data samples by generating sample values for the hidden RVs a from the generative model itself, ie, through the conditional distribution p͑a Խ x, z; ͒. This is done for all examples x ͑n͒ for n ϭ 1, . . . , N. Then one updates the parameters so that the likelihood that the complete data examples were generated by the complete data generative model p͑a, x, z; ͒ is increased. The completion of the data samples is called the "expectation" step, whereas the update of the parameters of the complete data generative model to increase the likelihood of the complete data examples is called the "maximization" step. The EM optimization is performed iteratively, where in each iteration one repeats these two steps. We can link the above two steps in the EM algorithm to concrete mechanisms in the dynamics of the WTA circuit during learning. Specifically, the expectation step is implemented through the firing of spikes by the neurons ␣ during the presentation of the data examples ͗x͑n͒, z ͑n͒͘. Indeed, the firing of ␣ generates samples from the conditional distribution p͑a Խ x, z; ͒ (see section "Deriving the probability distribution of the firing of the neurons ␣"). The presented data examples together with the generated samples by the neurons ␣ form complete data samples from p ‫ء‬ in Equation 29. The plasticity rules update the synaptic weights and biases exactly based on these complete data samples encoded in the spikes. Furthermore, given that the normalization constraints hold, the mean update of the parameters over the infinite sequence of examples (where the same finite sequence is repeated in succession infinitely) is in the direction of the gradient of the complete data log-likelihood (for details see section "Proof that the plasticity rules minimize the objective function ᐁ͑͒ through Expectation Maximization"). In view of this established link between the offline mean update of the weights and the EM algorithm, we see that the online plasticity rules in the neural network implement an online stochastic approximation of the EM algorithm in the spirit of Monte Carlo EM (Wei and Tanner, 1990;Jank, 2006).
During learning of the distribution p ‫ء‬ ͑x Խ z ϭ l͒ by the WTA circuit ␣ l , the neurons in ␣ l self-organize so that each of them specializes to fire in response to one cluster of similar, frequently occurring input patterns. In the probability distribution p ‫ء‬ ͑x Խ z ϭ l͒, this cluster corresponds to one mode of the distribution. A mode of the distribution can be described as a region of high probability in the probability space, surrounded by regions of low probability. The theoretical basis of the learning strategy via EM proves that the plasticity rules change the synaptic weights and biases to reduce the difference between the distribution of the inputs p ‫ء‬ ͑x Խ z ϭ l͒ and the represented generative model p͑x Խ z ϭ l; ͒. At the end, the learning process yields a mixture distribution where each neuron in ␣ l represents one mixture component in the form of the unimodal distribution p͑x, a j l ϭ 1; ͒ centered at one of the clusters of input patterns. The full generative model p͑x Խ z ϭ l; ͒ is then retrieved as a sum of the unimodal distributions. The unimodal mixture component is implic-itly represented in the weights and the bias of the corresponding neuron, where the vector of synaptic weights of the neuron actually represents the center of the mixture component, ie, the location in the input pattern space where its mode peaks with maximum probability.

Proof that the plasticity rules minimize the objective function ᐁ͑͒ through Expectation Maximization
We will give in this subsection a proof of Theorem 1 from the previous subsection "The plasticity rules minimize the KL divergence through Expectation Maximization." The theorem captures the main property of the learning module discussed in Results, section "A network module for learning stochastic associations," ie, that the plasticity rules install in the module an internal representation of the stochastic associations between the variables x and z in the presented examples.
Before we present the main part of the proof, we first introduce some needed definitions and derivations.

Definitions and assumptions
According to the learning procedure, independent and identically distributed examples ͗x͑0͒, z ͑0͒͘, ͗x͑1͒, z ͑1͒͘, . . . , ͗x͑n͒, z ͑n͒͘, . . . drawn from the target probability distribution are presented to the learning module one by one, each presented for a certain period of time ⌬t E . The learning procedure was explained in section "Theoretical properties of the basic learning module (stochastic association module) and its plasticity." The time interval ⌬t E for the presentation of the examples should be several times larger than the duration of the PSPs . The reason for this is that at the beginning of a time period ⌬t E when a new example ͗x͑n ϩ 1͒, z ͑n ϩ 1͒͘ is presented to the module, the values of the RVs x determined by the firing of the input neurons do not immediately change to the new example. Additionally, the neurons ␣ do not immediately start to sample from the new conditional distribution p͑a Խx ͑n ϩ 1͒, z ͑n ϩ 1͒; ͒. There can be residual active EPSPs at the synaptic inputs connecting from the input neurons that encode the valuesx͑n͒ from the previous time period ⌬t E . Similarly, there can be residual EPSPs from active neurons in ␣ that prevent the other ␣ neurons to immediately start sampling correctly from the new conditional distributions p͑a Խx ͑n ϩ 1͒, z ͑n ϩ 1͒; ͒. Therefore, for a time period of duration of one EPSP at the beginning of the presentation of a new example, the update of the synaptic weights and biases might be incorrect. Nevertheless, if is several times smaller than ⌬t E , this incorrect contribution to the learned weight (or bias) update should be insignificant.
As ⌬t E is larger than , this means that for each example ͗x͑n͒, z ͑n͒͘ the WTA circuit generates several samples from the conditional p͑a Խx ͑n͒, z ͑n͒; ͒ as the neurons ␣ spike several times during ⌬t E (see section "Deriving the probability distribution of the firing of the neurons ␣"). For simplicity of notation and formulation, in the convergence proof we will assume that for eachx͑n͒ exactly one sample ã ͑n͒ from p͑a Խx ͑n͒, z ͑n͒; ͒ is generated. The convergence proof can be easily extended for the more general case when several samples are generated from p͑a Խx ͑n͒, z ͑n͒; ͒. Given this assumption, the presented data example ͗x͑n͒, z ͑n͒͘ together with the generated sample ã ͑n͒ by the neurons ␣ represent one complete data example ͗x͑n͒, z ͑n͒, ã ͑n͒͘ drawn from the complete data distribution p ‫ء‬ ͑x, z, a; ͒.
If with ͑n͒ we denote the parameter values before applying the learning rule for the n-th example, and with ͑n ϩ 1͒ we denote the values after the application of the learning rule, then: where (n) is the learning rate used in the n-th iteration, ⌬͑n͒ is the update according to the learning rules defined in Equations 1, 2, and 3 (for easier reference, the synaptic and intrinsic plasticity rules are also given in Table 2), and ⍀͑͒ is a function that clips the parameter values within the intervals Ϫ w min Յ ŵ im, j l Յ 0 and Ϫ b min Յ b j l Յ 0. The set of values that are within these intervals we will denote with D͑͒. The subset of D͑͒ where the normalization constraints (Eq. 6) are satisfied we will denote with C . As the update rule for the biases b j l in Table 2 is defined in continuous time, we need to transform it into a form consistent with the previously stated simplification that the updates are performed for one data example drawn from p ‫ء‬ ͑x, z, a; ͒, or in other words, that the population of neurons ␣ fires one spike during the presentation of a data example ͗x ͑n͒, z ͑n͒͘. We can do that easily by assuming that the examples are presented for a period of (corresponding to one spike sample), which yields the following update for the model parameters encoded in the biases: Although in Table 2 the learning rates are different for the weights and the biases, here for simplicity we assume that the learning rate is the same for all parameters. We additionally assume that the sequence of learning rates ͑n͒ satisfies: The initial values of the parameters ͑0͒ are randomly drawn from D͑).
As pointed out previously (see section "Theoretical properties of the basic learning module (stochastic association module) and its plasticity"), in the joint distribution p ‫ء‬ ͑x, z͒ of the examples zero values of z and x i do not occur, ie, p ‫ء‬ ͑x i ϭ 0͒ ϭ 0 for all i ϭ 1, . . . , I, and additionally p ‫ء‬ ͑z ϭ 0͒ ϭ 0. In the implicitly represented generative model p͑x, z; ͒ the probabilities where at least one of the RVs z and x i (i ϭ 1, . . . , I) has zero value are represented by the fixed parameters (see Eq. 11 and accompanying text) and are not learned. The fixed parameters assume values (see Eq. 12) so that the corresponding probabilities over zero values are very small, ie, close to zero. Ideally, for the theoretical analysis we set these parameters so that: p(x i ϭ 0; ) ϭ 0 for all i ʦ ͕1, . . . , I͖, and p(z ϭ 0; ) ϭ 0 , in order to match the values p ‫ء‬ (x i ϭ 0) ϭ 0 and p ‫ء‬ (z ϭ 0) ϭ 0. In simulations, the probabilities over zero values represent a very small, insignificant portion of the mass of the learned generative model. This is also true when learning with a network of interconnected learning modules (see section "Theoretical properties of networks of recursively interconnected basic learning modules").
The assumption (Eq. 34) can be easily achieved by letting in (Eq. 12) the constant V to converge to V ¡ ϩϱ. Indeed, from V ¡ ϩϱ it follows that ŵ i0, j l ¡ Ϫϱ and b 1 0 ¡ Ϫ ϱ. The weight offset wremains constant at any time during the limit process (Eqs. 12, 13). The offset value for the biases does not remain constant, however, it converges to b Ϫ ¡Ϫϱ. To emulate this idealized condition in simulations it is sufficient to set bto a very large negative value. This ensures that the biases of the active ␣ neurons converge to large positive values, which then entails that there is always a non-empty subset of ␣ neurons that have high firing rates upon presentation of an example (see section "Deriving the probability distribution of the firing of the neurons ␣"). High firing rates of the ␣ neurons imply p͑a ϭ ͑0, 1͒ Խ x; ͒ Ϸ 0 and consequently p͑z ϭ 0; ͒ Ϸ 0 in Equation 34. In the section "Theoretical properties of networks of recursively interconnected basic learning modules," we also demonstrate that after the convergence of the weights and biases during learning to a local optimum, large negative value of bimplies p͑z ϭ 0 Խ x; ͒ Ϸ 0, which is consistent with Equation 34.

Finding a local minimum of ᐁ͑͒ through Expectation Maximization
We describe here the EM algorithm applied to the concrete case of finding a local minimum of the objective function ᐁ͑͒ in Equation 26, without any reference to the learning module and how it is implemented there. After we Table 2

Synaptic plasticity
At each postsynaptic spike of the neuron ␣ j l , at time t, the synaptic weight w im, j l undergoes an update w im, j l ¢ w im, j l ϩ ⌬w im, j l where is the learning rate and if im did not fire in ͓tϪ, t͔ . The parameter wis a baseline parameter, and is a parameter that corresponds to the duration of PSPs.

Intrinsic plasticity
At the time of each spike of the neuron ␣ j l the bias instantaneously changes its value according to Ј is the learning rate and b Ϫ is a baseline parameter for the bias. In addition, between spikes the bias exhibits continuous decay according to the differential equation give the convergence proof that the plasticity rules drive the synaptic weights and biases to a local minimum of ᐁ͑͒, we return to the question what mechanisms in the dynamics of the learning module implement different steps of the EM algorithm. The objective function ᐁ͑͒ can be written as follows: where is the log-likelihood function and is the entropy of the RVs x and z with respect to the target joint distribution p ‫ء‬ ͑x, z͒. As p‫ء‬ ͑x, z͒ does not depend on the parameters , the constrained local minima of ᐁ͑͒ match the constrained local maxima of ᏸ͑͒.
In the following we will explain the steps of the EM algorithm that finds a local maximum of ᏸ͑͒. EM optimizes ᏸ͑͒ indirectly through another function called the complete data log-likelihood where p ‫ء‬ ͑a, x, z; ͒ is the complete data probability distribution in Equation 29. It is an algorithm that updates the parameters in iterations. We will denote the values of the parameters in current iteration as old . Let ᏽ͑, old ͒ be the complete data log-likelihood ᏸ ͑͒, where in the probability distribution in the expectation we have the parameter values old from the current iteration, ie, ᏽ(, old ) ϭ ‫ޅ‬ p‫(ء‬a,x,z; old ) ͓log p(a, x, z; )͔ .
The iteration in the EM algorithm consists of two steps: Expectation step: calculate ᏽ͑, old ͒ as the expectation in Equation 39. Maximization step: find parameter values new that satisfy: and then set old ¢ new . This is a formulation of EM called the generalized EM algorithm (Dempster et al., 1977). The essence of the algorithm lies in the fact that Equation 40 implies also that: which means that in each iteration we increase the value of the log-likelihood. In fact, it also holds that: We will refer to this equation when we show that the learning module implements an online stochastic approximation of the generalized EM algorithm outlined here (see text after the convergence proof).

Definition of an auxiliary function ͑͒ used in the convergence proof
We define here an additional function ͑͒, which will be used in the convergence proof. A useful property of ͑͒ is that within the parameter domain where the normalization constraints (Eq. 6) are satisfied, it is equal to the log-likelihood function ᏸ͑͒. Hence, in order to demonstrate in the proof that the plasticity rules minimize ᐁ͑͒, it will be sufficient to show that the rules drive the weights and biases to satisfy the normalization constraints, and when the normalization constraints are satisfied, they maximize ͑͒.
The ͑͒ function is defined as follows: where q͑x, z; ͒ is the marginal of the probability distribution: The numerator of this distribution is the same as in the definition of the generative model p͑x, z, a; ͒ in Equation 5. The difference in the definition between p and q is the denominator. In the denominator in Equation 44, Â 0 ͑͒ is defined as: It can be easily seen that if the normalization constraints (Eq. 6) are satisfied, ie, if ʦC , then from which it follows that: The derivatives of ͑͒ with respect to the parameters ŵ im, j l are: If we now substitute q͑x, z, a; ͒ in the term inside the expectation in Equation 49 with Equation 44 and simplify, we obtain: Similarly, for the derivative of ͑͒ with respect to b j l we have the following: and if we substitute Equation 44, we arrive at: Deriving the local minima of ᐁ͑͒ under the constraints As we stated previously, the constrained local minima of ᐁ͑͒ match the constrained local maxima of the loglikelihood ᏸ͑͒. Additionally, the constrained local maxima ᏸ͑͒ also match the constrained local maxima of ͑͒, because ᏸ͑͒ ϭ ͑͒ for all ʦC . Here we derive the local maxima of ͑͒ under the normalization constraints in Equation 6 in analytical form. The result is used in the main proof in the subsection "The convergence theorem and the proof" where we will show that the sequence ͑n͒ during learning converges exactly to these local maxima.
The local maxima of ͑͒ can be found by calculating the derivatives of the Lagrangian function of ͑͒: and also: q ‫ء‬ (x im ϭ 1 Խ a j l ϭ 1; ) ϭ 1 . (60) Therefore, we can substitute in the implicit solutions the probability distribution q ‫ء‬ ͑x, z, a; ͒ with p ‫ء‬ ͑x, z, a; ͒ from (29), which yields: b j l ϭ log p ‫ء‬ (a j l ϭ 1; ) and ŵ im,j l ϭ log p ‫ء‬ (x im ϭ 1 Խ a j l ϭ 1; ) .
We will denote the set of all finite points that satisfy Equation 61 as G. It is assumed that the bounds of the parameters w min and b min are chosen so that GʦD͑͒. The critical points of the Lagrangian in G are either local maxima, local minima, or saddle points of the objective function ᐁ͑͒ in Equation 26 under the normalization constraints in Equation 6.
The convergence theorem and the proof After introducing the necessary definitions and notations in the previous subsections, we use them here to restate the Theorem 1 from the subsection "The plasticity rules minimize the KL divergence through Expectation Maximization" in more technical terms.
Theorem 1 ‫ء‬ . The infinite sequence ͑n͒ in Equation 30 converges with probability 1 to the set of local minima of ᐁ͑͒ in Equation 26 subject to the normalization constraints (Eq. 6).
Proof: We define a "mean limit" ordinary differential equation (ODE) that corresponds to the sequence ͑n͒ as ϭ h ͑͒, where h ͑͒ ϭ ‫ޅ‬ p‫͑ء‬x, z, a; ͒ ͓⌬͔. Here ⌬ is defined with the plasticity rules in Table 2. As for all n the parameters ͑n͒ remain in the bounded set D͑͒, it follows that sup n ‫ޅ‬ p‫͑ء‬x, z, a; ͒ ͓ ԽԽ ⌬ ԽԽ ͔ Ͻ ϱ and sup n ‫ޅ‬ p‫͑ء‬x, z, a; ͒ ͓͑⌬͒ 2 ͔ Ͻ ϱ. In order to prove convergence, we use the stochastic convergence Theorem 3.1 in Chapter 5 from Kushner and Yin, (2003) which states that, under the above assumptions, the sequence ͑0͒, ͑1͒, . . . , ͑n͒, . . . from Equation 30 converges with probability 1 to the stable points of the limit set of the mean limit ODE for all initial conditions ͑0͒ʦD͑͒. The limit set of the mean limit ODE with respect to a set of the initial conditions ͑0͒ʦA, which we denote as F(A), is defined as: Convergence to the limit set F(A) with probability 1 means that for all stochastic realizations of the sequence ͑n͒ it holds that: The main part of the proof will be to show that F͑D͑͒͒ ϭ G, ie, that the limit set of the ODE with the set of initial values D͑͒ is equal to the critical points of the Lagrangian ⌳͑, ͒, which are also critical points of ͑͒. We will show that in two steps. First, we will show that all trajectories of the ODE asymptotically converge to the set C where the normalization constraints (Eq. 6) are satisfied. Then in the second step, we will show that all trajectories with initial conditions ͑0͒ʦC converge to G.
Let c 0 ͑͒ and c ij l ͑͒ be the functions that express how much the sums in the normalization constraints (Eq. 6) deviate from 1, ie: For all points in the set C , where the constraints are satisfied, we have c 0 ͑͒ ϭ 0 and c ij l ͑͒ ϭ 0 (for all l 0 and all i and j). We now analyze how the values of c 0 ͑͒ and c ij l ͑͒ change when progresses along a solution trajectory of the mean limit ODE. For that purpose we calculate the dot products between h ͑͒ and the gradients of c 0 ͑͒ and c ij l ͑͒. The non-zero components of the gradient of c 0 ͑͒ are: and becomes equal to zero for the set of stationary points ϭ 0 which is the same as the set G. Hence, the ODE trajectories do not have any limit cycles and always converge to the critical points of the Lagrangian G, ie, F͑D͑͒പC ͒ ϭ G. Together with the conclusion (Eq. 69) of the first step, we finally obtain that the limit set of the ODE is: According to the stochastic convergence Theorem 3.1 in Chapter 5 from Kushner and Yin (2003), the infinite sequence ͑n͒ converges to the stable points of the limit set G. Stochastic fluctuations would drive the sequence to escape from the unstable points in the limit set. We have shown in Equation 70 that ͑͒ increases along an ODE trajectory that starts within ʦC . This implies that also ᏸ͑͒ increases as ͑͒ ϭ ᏸ͑͒ for ʦC . Thus, it follows that the stable points of the ODE in G are the local constrained maxima of ᏸ͑͒, subject to the normalization constraints, which concludes the proof.
By using results from the proof, we can now more precisely relate the update of the synaptic weights and biases by the plasticity rules to the maximization step of the EM algorithm, as discussed in subsection "The plasticity rules minimize the KL divergence through Expectation Maximization." In the WTA circuit the plasticity rules update the weights and biases online, after each spike of a neuron in ␣. If we instead consider an offline update where the updates are averaged over many spikes of the ␣ neurons (theoretically infinitely many), and then afterward they are applied to the weights and biases, then this offline update of the parameter vector in the limit of infinite number of spikes would be equal to: The vector h ͑͒ is the gradient of the ODE trajectory, which means that for a sufficiently small learning rate the offline mean update will always increase the expected log-likelihood ᏸ͑͒ provided that ʦC . This follows from the fact that ᏸ͑͒ increases along the ODE trajectory as we have shown in the proof (Equation 70 and the text afterward). Furthermore, the off-line mean update increases ᏸ͑͒ by calculating an expectation over the complete data samples from the distribution (Eq. 29) which represents the expectation step of the EM algorithm. In the subsection "The plasticity rules minimize the KL divergence through Expectation Maximization," we already showed that the spikes of the neurons ␣ together with the presented examples represent the complete data samples drawn from Equation 29. This clarifies that the offline mean update of the plasticity rules indeed implements the maximization step of the generalized EM algorithm as defined in Equations 42 and 40. From this, it follows that the online plasticity rules implement an online stochastic approximation version of the maximization step of EM.
The proof of Theorem 1 (reformulated in Theorem 1 ‫ء‬ in this subsection) completes the theory behind the learning mechanisms and learning capabilities of the learning module described in Results, section "A network module for learning stochastic associations." In particular, it rigorously shows that the learning module learns, via the EM algorithm, an internal model of the probabilistic relations between a set of variables x and another variable z in the presented examples.

Theoretical properties of networks of recursively interconnected basic learning modules
This section contains additional details on the learning approach with networks of recursively connected learning modules, that we described in Results, section "Recursive combinations of the basic learning module enable efficient learning of complex distributions from examples." After presenting the learning procedure for such networks, we will formulate in this section Theorem 2 which contains the theoretical basis of learning with networks of learning modules. Theorem 2 is derived from Theorem 1 (see section "The plasticity rules minimize the KL divergence through Expectation Maximization"), and shows that the plasticity mechanisms in a network of modules minimize another objective function that pertains to the represented probabilistic model in the whole network. This minimization then leads to learning an internal model of the stochastic associations between all variables y k of the presented examples.
The population of neurons that encodes the values of the RV y k (t) over time t we denote with k (these are the output neurons of the learning module ᏺ k ). For each non-zero value l of y k , there is a neuron kl whose firing signals this value. In particular, a spike of the neuron kl at time t sets the value of y k to y k ϭ l for a time period of duration , after which the value changes to y k ϭ 0. To ensure a valid value of y k , no other neuron in k should fire in the time interval [t, t ϩ ). This is ensured by the lateral inhibition between the ␣ k neurons, which drive the k neurons to fire.
The network learns from presented examples ỹ ͑0͒, ỹ ͑1͒, . . . , ỹ ͑n͒, . . . drawn from the target probability distribution p ‫ء‬ ͑y͒. In the examples the RV y k assumes an integer value drawn from the set ͕1, 2, . . . , M͑y k ͖͒. An example is presented to the network in form of injected currents in the neurons. These injected currents are assumed to originate from external neurons. The neurons in k are driven by the injected currents such that their firing reflects correctly the values of the RVs y k in the current exampleỹ͑n͒. More precisely, for ỹ k ͑n͒ ϭ l the neuron kl receives strong positive current and fires with a high firing rate, whereas the other neurons in the population k receive a strong negative current, which prevents them from firing. The neurons ␣ also have currents injected during learning, that give information about the current example, exactly in the same way as explained for a single learning module (see section "Theoretical properties of the basic learning module (stochastic association module) and its plasticity"), where the presented example ͗x ͑n͒, z ͑n͒͘ is here for the learning module ᏺ k equal to ͗ỹ B͑k͒ ͑n͒, ỹ k ͑n͒͘ (Fig. 5).
The mathematical symbols that are used in the definition of the neural network are listed in Table 3.
As we stated in the section "Theoretical properties of the basic learning module (stochastic association module) and its plasticity," only the non-zero values of z represent meaningful states of the external environment and are learned.
The probability of the zero value p͑z ϭ 0 Խ x; ͒ is not learned, and converges always to a very small value during learning. This can be easily seen if we express p͑z ϭ 0 Խ x; ͒ through the firing probability densities j l ͑x͒ of the ␣ neurons given input x: From the inequality j l ͑x͒ ϭ 1 exp ͑u j l ͑x͒͒ Ն 1 exp ͑b j l ͒ it follows that: If we substitute Equation 14 for the biases and use bϭ ϪV, which follows from Equations 12 and 15, we obtain: According to the proof of Theorem 1 ‫ء‬ (Eq. 67), during learning the plasticity rules drive the synaptic weights and biases of the ␣ neurons toward satisfying the normalization constraints in Equation 6, which simplifies Equation 75 to p͑z ϭ 0 Խ x; ͒ Յ exp ͑ Ϫ V͒. As V is a very large positive constant (Eq. 12 and accompanying text), we finally arrive at: Hence, we have shown that after learning, the sum of the firing probabilities of the output neurons is always very high (when inhibition is not active), regardless of the current input x. Similarly, in case of a network of learning modules, after learning, the output k neurons in each module in the network are in strong competition and In order to show that if each learning module ᏺ k learns an approximation of p ‫ء‬ ͑y k , y B͑k͒ ͒ in its internal model p k ͑y k , y B͑k͒ ; k ͒ then the stationary distribution of the whole network is close to p ‫ء‬ ͑y͒, we use as a theoretical basis the neural computability condition (NCC) from Pecevski et al., (2011) andBuesing et al. (2011). The NCC has been Table 3

Symbols related to the RVs in the target probability distribution y
Vector of all multinomial RVs ͑y 1 , . . . , y K ͒ from the target probability distribution p ‫ء‬ ͑y͒ y k k-th multinomial RV from the vector y M͑y k ͒ The maximum integer value the multinomial RV y k can assume p ‫ء‬ ͑y͒ Target probability distribution learned by the neural network y B͑k͒ Vector of the RVs from y that are in the Markov blanket of the RV y k ᏺ k The learning module in the network that approximates the r.h.s of the NCC in Equation 77 for the neurons k encoding the RV y k Output (and input) neurons of the modules ᏺ k and their associated RVs k Population of neurons in the network that together encode the value of the RV y k through population coding kl Neuron in k whose firing signals the value l of the RV y k y kl Binary RV that assumes value 1 if and only if y k ϭ l; it corresponds to the coding property of the neuron kl .
Other symbols k Superscript index used to indicate that the symbol describes an element of the learning module ᏺ k ␣ k , ␣ kl , ␣ j kl , J kl , a j kl , a kl , a k Symbols for the neurons in ␣ and their associated RV have the same meaning as the symbols for a single learning module in Table 1; the additional superscript index k indicates that the element belongs to the module ᏺ k . b j kl , w im, j kl , ŵ im, j kl , b j kl , k Synaptic weights, biases and the parameters of the generative models have the same meaning as the symbols for a single learning module in Table 1; the additional superscript k identifies that the element belongs to the module ᏺ k . p k ͑y B͑k͒ , y k ; k ͒ Probability distribution of the mixture generative model implicitly represented in the module ᏺ k Vector of union of all parameters in the neural network ϭ ͑ 1 , . . . , k ͒ from all generative models (corresponding to the learning modules ᏺ k ) et al. (2011) and Pecevski et al. (2011). The parameter that defines the encoding of values of the RVs by the spikes was equal to ϭ 15 ms. The neurons had absolute refractory period of duration . The lateral inhibition in the WTA circuits was implemented with a population of five inhibitory neurons, where all neurons in the population ␣ connect to these inhibitory neurons with a synaptic weight w e2i ϭ 80. The bias of the inhibitory neurons in the lateral inhibition was set to b inh ϭ Ϫ10, and the bias of all neurons corresponding to the RVs y was also set to b pp ϭ Ϫ10. The weights of the strong input synaptic connections to each neuron kl , that originated from its corresponding group of neurons in ␣ kl that drive it to fire, were set to w pp ϭ 20. During learning, the evolution of all synaptic weights was confined within a range [w min , w max ], where w min ϭ 0, w max ϭ 4 for the synapses of the ␣ neurons in the learning modules ᏺ 1 , ᏺ 2 and ᏺ 3 , and w max ϭ 2 for the synapses of the ␣ neurons of module ᏺ 4 . Synaptic weight changes induced by the plasticity rule that would go below w min , or above w max , were clipped to w min , or w max , respectively. At the beginning of each probabilistic inference simulation, the state of the network (defined by the RV values y(t)) was initialized according to a random state drawn from a uniform distribution over all possible initial network states. This was done by injecting very short current pulses in the neurons in the populations k . During the probabilistic inference simulations, evidence for the known values of the RVs was given to the network by injecting external currents in the k neurons of the known RVs. The currents were injected in exactly the same way and with exactly the same amplitudes as during the presentation of the examples in the learning process (see below).
During learning, the presented examples were generated from the Bayesian network in Figure 6B, with the values for the conditional probabilities given in Table 4. The prior probabilities p͑y 1 ϭ 2͒ and p͑y 2 ϭ 2͒ were both equal to 0.5.
During the presentation of the examples, the neuron kl had injected a strong positive current I ppϩ ϭ 30 if y k ϭ l in the example, or a strong negative current I pp-ϭ Ϫ30 if in the example y k l. Additionally, if in the current example y k ϭ l, then a strong negative external current with amplitude I F Ϫ ϭ Ϫ80 was injected in each of the neurons in the population ␣ k that are not in the subpopulation ␣ kl , whereas the neurons in ␣ kl did not receive any external current.
Each subpopulation ␣ kl in every learning module in the network consisted of two neurons. All excitatory neurons in the network had an alpha shaped EPSP defined by the kernel: Here ⑀ 0 ϭ 2.8 is a scaling factor, t 1 and t 2 and are the points in time where the basic alpha kernel of the form e·t·exp ͑Ϫt͒ is equal to 1 2 , and ␣ ϭ 8.5 ms is the time constant of the alpha kernel. The same shape of the EPSPs was also used by Pecevski et al. (2011). After an incoming spike at the synapse at time t 1 f , the time course of the EPSP is equal to w⑀ ␣ ͑t Ϫ t 1 f ͒ (w is the synaptic efficacy) until the next spike at t 2 f after which it is set to w⑀ ␣ ͑t Ϫ t 2 f ͒. The PSPs at the synapses connecting from, and to, the inhibitory neurons of the lateral inhibition had a rectangular shape with duration of ϭ 15 ms. The rectangular IPSPs of the inhibitory neurons approximate the effect that fast-spiking bursting inhibitory neurons with short duration IPSPs would have on the membrane potential of the neurons ␣. The synaptic weights of the connections from the inhibitory neurons in the lateral inhibition to the neurons ␣ were all equal to w i2e ϭ Ϫ7.
Before the start of learning, the synaptic weights of all neurons in ␣ were initialized randomly from a Gaussian distribution with mean w init ϭ w max /3 and SD w0 ϭ 0.1. If the randomly drawn value was not in the interval [w min , w max ], then it was redrawn again until the new value was in the interval (the same type of redrawing was done for the initial values of the biases). The initial biases of the ␣ neurons were randomly drawn from a Gaussian distribution with mean b init ϭ 5 and SD b0 ϭ 0.1. The evolution of the biases of the neurons in ␣ was restricted to be in the range [b min , b max ] with b min ϭ -30 and b max ϭ 5. If the intrinsic plasticity changed the bias above b max or below b min , the bias was clipped to b max or b min , respectively. In the learning rules we introduced an additional scaling factor T ϭ 0.58 in the potentiation part. In particular, the potentiation part of the learning rule (Eq. 1) for the weights had the form ⌬w ϭ e ϪT͑wϩw Ϫ ͒ Ϫ 1, and the potentiation part of the learning rule (Eq. 2) for the biases had form ⌬b ϭ e ϪT͑bϩb Ϫ ͒ .
The learning process lasted 1200 s of biological time. During the first 600 s of learning, the learning rate for the synaptic weights was decreasing linearly from ϭ 0.002 at t ϭ 0 to ϭ 0.0006 at t ϭ 600 s. In the second 600 s of the learning, after t ϭ 600 s, the synaptic plasticity was not active, ie, the learning rate was set to ϭ 0. The learning rate for the biases during the first 600 s was constant and equal to Ј ϭ 0.01. In the second part of learning, after t ϭ 600, it had another constant value equal to Ј ϭ 0.02.
The offset parameter in the synaptic plasticity rule was equal to w Ϫ ϭ 2.5 log ͑0.2͒, whereas the offset parameter of the intrinsic plasticity in the neuron ␣ j kl was: where Խ B͑k͒ Խ is the number of RVs in the Markov blanket of y k , I B(k) is the set of indices of the variables y i that are in the Markov blanket of y k , T is the scaling factor in the Table 4. Values for the conditional probabilities in the Bayesian network in Figure 6B used to generate examples for learning in Example 2 p ‫ء‬ ͑y 3 ϭ 2 Խ y 1 ϭ 1, y 2 ͒ p ‫ء‬ ͑y 3 ϭ 2 Խ y 1 ϭ 2, y 2 ͒ p ‫ء‬ ͑y 4 ϭ 2 Խ y 2 ͒ y 2 ϭ 1 0.13 0.87 0.13 y 2 ϭ 2 0.87 0.13 0.87 learning rule (see above) and M(y i ) denotes the largest value of the RV y i . As the neurons ␣ k in different learning modules have different number of input synapses, the bias offset parameter was set for each neuron to counterbalance appropriately the average total input it receives from the input synapses.
In Figure 7B, the frequency of the network states was calculated from the spontaneous activity of the neural network after learning, simulated for 20 s biological time. The neural network was initialized to be in a random network state at t ϭ 0. In Figure 8C, the KL divergence was calculated at 1 min time intervals, ie, at time points t ϭ 60i seconds (i ϭ 0, . . . ,19). At each time point, the distribution p͑y; ͒ was estimated by simulating the neural network with the values of the synaptic weights and biases of the neurons at the particular time point. The network was simulated for 1500 s, and the estimated probabilities were calculated from the network states in the time interval tʦ͓100s, 1500s͔ of the simulation. In Figure 8 Panels, A, B, and D, the distributions p k ͑y k , y B͑k͒ ; k ͒ and p k ͑y k Խ y B͑k͒ ; k ͒ of the learning modules were calculated analytically by using Equation 5, and then marginalizing ␣.

Details to the computer simulations in Example 1
The target probability distribution p ‫ء‬ ͑x 1 , x 2 , z͒ from which examples were generated is given in Table 5. The scaling factor T in the learning rules (see section "Details to the computer simulation in Example 2") was set to T ϭ 0.4. The maximum weight value was in this example w max ϭ 5. Before learning, the synaptic weights of all neurons in ␣ were initialized randomly from a Gaussian distribution with mean w init ϭ 3.0 and SD w0 ϭ 0.1. All other parameters in this example that define the learning module, as well as the learning process, were the same as the parameters of the learning modules in the computer simulation in Example 2 (see section "Details to the computer simulation in Example 2").

Discussion
Numerous models for probabilistic inference in networks of neurons have been proposed, and many models for the impact of learning on network computations have been proposed. However, surprisingly, these two lines of research have so far (with a few exceptions that are discussed below) not been brought together. We propose in this article a new model for learning and memory organization in recurrent networks of spiking neurons that makes the information that is gathered from numerous experiences immediately available for complex and unforeseen memory retrievals in the form of probabilistic inference. The network is able to perform such probabi-listic inference just through its inherent stochastic dynamics. More precisely, we assume that the network receives samples (ϭexamples) from some unknown multivariate distribution p ‫ء‬ , and show that a suitably structured recurrent network of spiking neurons is able to build through STDP and intrinsic plasticity of the excitability of neurons an internal model for p ‫ء‬ in such a way, that it can answer a diverse repertoire of probabilistic inference queries about the stored knowledge through sampling. Early models for memory storage and retrieval in recurrent networks of artificial neurons (Steinbuch, 1961;Hopfield, 1982Hopfield, , 1984 had focused on the storage of isolated memory items in attractors of a deterministic network dynamics. The only memory queries that are considered in these models are input completion tasks, or finding the most similar memory item to the input pattern. Answering queries that require the combination of several stored memory items is virtually impossible in such model, especially because the stored memory items were required to be scrambled (orthogonalized) before they were committed to the network. This is inconsistent with experimental data on human memory, which require a substantially more structured memory organization (Stickgold and Walker, 2013).
We have shown in this article, that very simple network motifs (Fig. 2) provide learning modules, which can extract probabilistic relationships between random variables from examples, simply by applying STDP and intrinsic plasticity. One feature of these network motifs is that different groups of neurons on layer 2 project to different neurons on layer 3. Such nonconvergent synaptic connections are difficult to identify with current experimental methods, but have already been found in the mouse cortex (Chen et al., 2013), where largely nonoverlapping populations of pyramidal cells in layer 2/3 of area S1 project to areas S2 and M1. Somewhat similar fine-scale connectivity patterns had previously been found within a cortical column of rat visual cortex (Yoshimura et al., 2005;Kampa et al., 2006). It remains to be tested whether these fine-scale network structures in the brain support the learning of stochastic associations as predicted by our model.
These simple stochastic association modules can be recursively combined (Figs. 4,5), and are then able to learn also complex stochastic relationships, including higher order moments (eg, explaining away), from examples. We have demonstrated this for a well-known visual inference task from Knill and Kersten (1991), which is known to require explaining away (Figs. 6-9), but where it has been an open question whether brains could in principle acquire this capability through learning. Furthermore, we have shown that this impressive learning capability of recurrent networks of spiking neurons can be understood for simple rules for STDP and intrinsic plasticity of neurons on the basis of a rigorous mathematical theory (Expectation Maximization).
The representations of statistical knowledge in networks of spiking neurons that are shown here to result from learning are structurally very similar to previously proposed ones by Pecevski et al. (2011) that were based Table 5. The target probability distribution p ‫ء‬ ͑x 1 , x 2 , z͒ in Example 1 p ‫ء‬ ͑x 1 , x 2 , z ϭ 1͒ p ‫ء‬ ͑x 1 , x 2 , z ϭ 2͒ x 1 ϭ 1, x 2 ϭ 1 0.04 0.04 x 1 ϭ 1, x 2 ϭ 2 0.21 0.21 x 1 ϭ 2, x 2 ϭ 1 0.04 0.21 x 1 ϭ 2, x 2 ϭ 2 0.21 0.04 on construction instead of learning, with one important difference: whereas the constructed networks required very large numbers of auxiliary neurons for representing random variables with larger Markov blankets, we show here that similar networks but with realistic numbers of hidden (auxiliary) neurons can still provide good approximations. These smaller modules automatically learn to make optimal use of the number of auxiliary neurons that are available to them. They learn automatically to approximate complex stochastic associations between random variables that emerge from examples, with a mixture distribution whose maximal number of components (modes) fits to the available number of auxiliary neurons. We have demonstrated this important feature of our learning approach in Figure 3 (compare panels B,D and C,E) and in our model for explaining away in Figure 6 (where the number of auxiliary neurons in the learning modules ( Fig.  6D) is smaller than the number required by the construction of Pecevski et al. (2011)). One other interesting general feature of our model is that it shows how noise in networks of neurons can be beneficial. In fact, noise provides here a necessary ingredient both for the self-organization of the network through STDP, and for the use of learnt probabilistic relationships for probabilistic inference via MCMC sampling (Habenschuss et al., 2013a). The necessary stochasticity in the neurons can come from different sources like for example the unreliable neurotransmitter release from the vesicles at the presynaptic site, or the stochastic closing and opening of membrane ion channels (Faisal et al., 2008).
Finally, we would like to point out that our network model produces answers to probabilistic inference queries ( Fig. 9) in the form of firing rates, which is obviously useful for communicating such answers to downstream networks. Internally however, the network works with a spike-based encoding of network states, where every spike has an impact on the network state (Fig. 7).
Our model and theory has identified concrete plasticity mechanisms and network architectures that would enable networks of neurons in the brain to build probabilistic internal models for their stochastic environment. We have focused here on an idealized model in order to keep it theoretically tractable. Further work will have to explore to what extent similar learning phenomena arise in more complex neural network models that sacrifice theoretical tractability for additional biological details.

Related work
There are several studies that propose neural implementations of probabilistic inference in general graphical models where, as in our approach, the present independencies of the distribution are directly exploited for reducing the complexity of the neural network structure. The majority of these base their implementations on the loopy belief propagation algorithm (Rao, 2006;Steimer et al., 2009;Siegelmann and Holzman, 2010;Litvak and Ullman, 2009). Except for Siegelmann and Holzman (2010), to the best of our knowledge, none of these studies proposes a way how these neural structures could emerge through learning from examples. The model of Siegelmann and Holzman (2010) is formulated on a more abstract level. It is based on the observation that belief propagation (message passing) requires only three arithmetical operations: summation, multiplication, and division (for normalization). Their network model is based on symbolic computational units (interpreted as multicompartment neurons) that carry out these arithmetical operations on real numbers. The resulting real numbers are interpreted as probabilities or messages that are sent to other units during belief propagation. Estimates of conditional probability tables are extracted from examples through online accumulators, assumed to be implemented as plasticity of the weight of a dendritic branch that represents a specific value assignment for a set of random variables (more precisely, for the Markov blanket of a random variable). Hence, this learning model is not based on synaptic plasticity, but rather on a plasticity mechanism that changes the weight of a whole dendritic branch. We are not aware of an attempt to implement this approach with spiking neurons, or with more local plasticity rules.
An alternative framework for probabilistic inference in neural circuits developed by Ma et al. (2006) and Beck et al. (2008Beck et al. ( , 2011Beck et al. ( , 2012 is based on representation of probability distributions in probabilistic population codes. To the best of our knowledge, the question of how the neural implementations in those studies can emerge through learning with local plasticity mechanisms has so far not been addressed. We have focused in this paper on the task of learning time-invariant distributions p ‫ء‬ over static patterns. Complementary to this, in several studies (Deneve, 2008;Rezende et al., 2011;Brea et al., 2013;Kappel et al., 2014) the authors developed neural network models for learning time-varying distributions, restricted to dynamical Bayesian networks that do not have dependencies between the RVs in the same time step, which simplifies the learning. In these models, the network learns to reproduce sequences of patterns, by developing latent representations as a memory about the recent history of patterns, and learning the stochastic transitions between the patterns in the sequence. In contrast to this, the neural networks in our approach learn a probability distribution of static patterns as their stationary distribution, where the distribution can contain arbitrary dependencies between the random variables without any restrictions.
The problem of learning a probability distribution from examples has been well studied in the artificial neural network community. The Boltzmann machine is one of the earliest developed neural networks that can learn general probability distributions (Ackley et al., 1985). The learning in the Boltzmann machine is however difficult for learning higher-dimensional probability distributions with a larger number of RVs (Hinton, 2002;Carreira-Perpinan and Hinton, 2005). Building on the Boltzmann machine idea, more recently developed deep belief networks have considerably improved the efficiency and scalability of learning by using a Boltzmann machine with a two-layer bipartite graphical model structure, called restricted Boltzmann machine, which is easier to train than the general Boltzmann machine (Hinton et al., 2006). Deep belief networks learn multilayer latent representations of the input data in a generative model by training layer-by-layer restricted Boltzmann machines. As in our approach, the learned probability distribution is embodied as a stationary distribution of the network states. However, the structure of Boltzmann machines and deep belief networks exhibits symmetric weights between the neurons which are not found in biological networks of neurons, and also the biological plausibility of layerwise training and the learning rules in these networks is not clear.
These approaches derive their learning rules from the maximum likelihood learning principle. Our learning approach for network of interconnected learning modules, on the other hand, is more reminiscent to maximizing the pseudo-likelihood (Besag, 1975), an alternative parameter estimation method in statistical learning, where the objective function can be formulated through the Kullback-Leibler divergences between the target and the model conditional distributions of one variable given the rest of the RVs. The difference is that in the pseudo-likelihood objective function the conditional distributions are all derived from a single probabilistic model, whereas in our approach each learning module learns a separate generative model of a marginal of the target probability distribution.

Experimentally testable predictions of our model and other questions for further research
Our model makes significantly different predictions compared with previous models for memory in neural networks with regard to the architecture of the neural networks involved. Models based on Hopfield networks and Boltzmann machines predict that memories are stored in homogeneous networks of only excitatory neurons with symmetric synaptic connections. In contrast, our model predicts that memories are stored in large assemblies of highly structured generic microcircuit motifs, each consisting of excitatory and inhibitory neurons. Our model gives rise to a concrete hypothesis, why different species have different learning capabilities, which cannot be explained in terms of the different numbers of neurons in their brains. It proposes that the structure or structural predisposition of interconnections between neurons is an important factor for learning performance. In particular, it predicts that the superior learning capabilities result from a genetically more precisely structured interaction of excitatory and inhibitory neurons. Our model of a probabilistic learning module (Fig. 2) proposes in fact two different functional roles of inhibitory neurons: lateral inhibition among excitatory neurons that induces each of them to specialize on different presynaptic firing patterns, and another type of inhibition that prevents hidden neurons that learn a probabilistic relationship between random variables ͗x, z͘ for a specific value z ϭ l, to engage plasticity for examples ͗x, z͘ with z l. This architecture provides concrete hypotheses for the analysis of data on the role of inhibitory neurons in the organization of plasticity (Caroni, 2015;Letzkus et al., 2015). Furthermore, our model for an atomic probabilistic learning module predicts that pyramidal cells within a network can represent probabilistic, rather than only deterministic relationships. This can in principle be tested experimentally by exciting (eg, through optogenetic methods) a subset A of these neurons, and observe the resulting firing probability of pyramidal cells B. Furthermore our model predicts that this firing probability of neurons in B can be changed upward through trials where both neurons in A and B are made to fire, and changed downward through trials where neurons in A are made to fire and neurons in B are inhibited.
In contrast to preceding memory models, our model does not predict that synaptic connections between neurons are in general symmetric. This would actually be impossible for synaptic connections between excitatory and inhibitory neurons. But also for synaptic connections between pyramidal cells in the cortex a symmetry of synaptic weights between them is not really consistent with the currently available experimental data (Song et al., 2005;Haeusler et al., 2009, their Fig. 4). With regard to the plasticity mechanisms involved, our model points to an essential contribution of intrinsic plasticity of the excitability of pyramidal cells for memory formation (Mozzachiodi and Byrne, 2010).
An interesting functional prediction of our model is that memory recall cannot only take the form of input completion (like in a Hopfield network), but can engage the full power of probabilistic inference. Furthermore, it proposes that this inference is implemented through the inherent stochastic dynamics of networks of neurons in the brain. Such implementation has previously already been proposed on the basis of data from cognitive science Vul and Pashler, 2008). Furthermore our model predicts that memory recall and imagination of possible scenarios are closely related brain computations, that engage similar network mechanisms. This prediction appears to be consistent with recent experimental data, which suggest that memory recall and imagination/fabulation engage largely the same brain systems (Schacter, 2002). Finally, our model predicts that the development of suitable brain networks that store basic insights about typical causal roles and dependencies of the objects and phenomena we encounter is essential for learning. This prediction is in line with experimental results from cognitive science, which argue that such basic knowledge about dependencies in the real world is already known to 2-year-old children (Landau et al., 1988). Because our learning approach emphasizes the role of probabilistic inference, it also provides a theoretical framework for integrating innate or previously learnt knowledge in the form of priors.
The learning paradigm that we have presented was designed to provide an alternative to other neural network models for higher-level memory. Such higher-level memory system in the brain receives high-dimensional inputs y from numerous brain areas, in particular also higher-level features that are extracted from sensory inputs by other learning systems.
It is at this point an open question to what extent the proposed model for learning in networks of spiking neu-rons can also provide insight into the organization of lower level learning systems in the brain, for example in the visual system. The underlying mathematical approach is quite general, and guarantees convergence to an internal model for any external distribution p ‫ء‬ over discrete RVs that generates the examples y that are presented to the network. How good this internal model will become is related to the question how well p ‫ء‬ can be approximated by a Bayesian network with small degrees of nodes, or more generally, by a simplified distribution where all RVs have small Markov blankets. However, a precise answer to this question is even more difficult, because our model is in principle also able to approximate some distributions with large Markov blankets, see the remarks at the end of Results section "Small numbers of hidden neurons in the learning modules often suffice." Numerical tests for a number of practically relevant distributions p ‫ء‬ are likely to provide further insight into this question.
In principle, it is also possible to boost the learning capability of our approach by stacking multiple copies of the learning network. The ␣ neurons in the network learn to encode salient combinations of values of different RVs, similarly as feature detectors on the first hidden layer of a deep learning network. Hence, it would make sense to send the output of these ␣ neurons also as input to a version of the same type of learning network on a second level. One would then expect that the second level network learns in the same unsupervised manner to detect and represent salient combinations of values in these ␣ neurons. The analysis of the performance of such a stacked learning architecture based on spiking neurons and STDP is a topic for future research.

Summary
Altogether, we have shown that some forms of probabilistic inference can be learnt through STDP, even in cases where the nontrivial "explaining away" effect occurs . We propose, that this new paradigm for network learning provides an alternative to previous models for associative learning that were based on learning categorical rather than probabilistic associations. In addition, in contrast to earlier memory models, this new model is compatible with basic properties of biological networks of neurons, such as spikes, trial-to-trial variability, stereotypical microcircuit motifs, and STDP.