THE LOW-DIMENSIONAL MANIFOLD HYPOTHESIS
Before starting this long journey it is essential to spend a few pages for discussing the low-dimensional manifold hypothesis, which encapsulates the key motivations that inspired this thesis. Although the reasons of this work are coming from computational neuroscience, we discuss here the emergence of this hypothesis also in the context of machine learning in order to emphasize its generality and importance.
The low-dimensional manifold hypothesis states that real-world high-dimensional data may lie on low-dimensional manifolds embedded within the high-dimensional space . This definition may seem complicated at first sight but in reality its meaning is very intuitive. In the following we introduce this hypothesis in the field of machine learning and computational neuroscience respectively so as to concretize this abstract statement with concrete examples.
in machine learning
Representing and interpreting efficiently noisy high-dimensional data is an issue of growing importance in modern machine learning. A common procedure consists of searching for representations of data in spaces of (much) lower dimensions, an approach known as manifold learning [32, 53, 118, 146].
Manifold learning’s approach has as a key assumption the low-dimensional manifold hypothesis which states that although many data are a priori high-dimensional, in reality they are intrinsically of much lower dimension. There are several reasons in favour of this hypothesis, for example:
— this can result from physical laws such as translation, rotation, change of scale and so on. If we consider a real image, whose dimensionality (the number of pixels constituting the photo) is usually very high and we apply the above mentioned transformations, it is clear that all the variants of the starting photo are linked to-gether by a number of parameters much smaller than the number of pixels, i.e., each parameter corresponding to a particular physical law. So in the end we have that the different photos live in a space of dimensionality much lower than the a priori very high number of pixels, see Fig. 1(a);
— moreover, if we consider a classic data-set used in machine learning like MNIST , see Fig. 1(b), it is possible to estimate the intrinsic dimension, that is the number of variables needed in a minimal representation of the data, of its digits by looking at the number of local transformations required to convert a number into another one of its variants, and it turns out that this number is ’ 10 , much less than the number of pixels of the images that make up MNIST, equal to 784.
The same goes for a more complicated example, let’s consider a data-set consisting of photos of a person’s face in different poses, Fig. 1(c); each picture is made of, say, 1000 1000 pixels. It is clear that this data-set is a very small subset of all possible colored pictures, which is defined by a 3 106–dimensional vector 1. The reason is that, for a given face, there are only 50 varying degrees of freedom (the position of all muscles), a very small number compared to 106 . Hence, all data points lie in a (non-linear) manifold, of very low dimension compared to the one of the initial pixel space.
We can therefore generally conclude that even though data may be high-dimensional, very often the number of relevant dimensions is much smaller.
The low-dimensional manifold hypothesis explains (heuristically) why machine learn-ing techniques are able to find useful features and produce accurate predictions from data-sets that have a potentially large number of dimensions (variables) bypassing the curse of dimensionality problem. In fact, when the dimension of data increases, the volume of the configuration space grows so fast (exponentially) that the available data become sparse and this sparsity is problematic for any method that requires statistical significance. The fact that the actual data-set of interest really lives in a space of low dimension, means that a given machine learning model only needs to learn to focus on a few key features of the data-set to make decisions. However, these key features may turn out to be complicated functions of the original variables. Many of the algorithms behind machine learning techniques focus on ways to determine these (embedding) functions .
A simple example of application of this hypothesis comes from the context of super-vised learning [35, 96, 97], that is the fitting of input-output relation from examples, with neural networks for high-dimensional data classification because if the data actually live in manifolds of much smaller dimension, it is necessary to classify the manifolds , see Fig. 1(a).
Figure 1 – (a) The set of variations associated with the image of a dog form a continuous manifold of dimension much lower than the pixels space. Other object images, such as those corresponding to a cat in various poses, are represented by other manifolds in same space. Figure adapted from ; (b) Examples of digits corresponding to the number two in the MNIST data-set. Figure taken from . (c) Pictures of a person with various facial expressions. They lie in a very low dimensional manifold of the vector space of pictures with 1000 1000 pixels. Figure taken from .
in computational neuroscience
Low-dimensional representations of high-dimensional data are not restricted to ma-chine learning, and are encountered in other fields, in particular, computational neuro-science [1, 86, 250].
In fact, what is typically done in a neuroscience experiment is to measure with elec-trodes the activity of neurons in a real neural network, see Fig. 2(a). By looking at the population of measured neurons one can often find that the activity of the network can be explained in terms of relative activation of groups of neurons, called neural modes or cell assemblies, see Fig. 2(b). This means that even if the network activity is a priori high-dimensional, if one looks at its trajectory in the space of neural configurations as a function of time, it will be confined to live in a linear (or even non linear) manifold of much smaller dimension D N, being N the number of neurons that make up the network, see Fig. 2(c) and Fig. 2(d).
It is now legitimate to ask what the dimensions of the manifold mean. One of the most reliable hypothesis is that these collective coordinates generated by the neural network activity in the D-dimensional manifold encode sensory correlates, i.e., they encode some external stimulus, as the orientation of a bar presented to the retina [114, 115], and can be used for example from the motor cortex to make decisions and/or produce actions .
Also related to this is the fact that low-dimensional continuous attractors provide a paradigm for analog memories, in which the memory item is represented by an extended manifold, i.e., the cognitive map of a place in a certain context, see Section 3.3.
A crucial question to answer is therefore, as we will investigate in detail in the next
Figure 2 – (a) Neural modes as a generative model for population activity. The relative area of the blue/green regions in each neuron represents the relative magnitude of the contribution of each cell assembly to the neuron’s activity; (b) Spikes from three recorded neurons during task execution as a linear combination of two neural modes; (c) Trajectory of time-varying population activity in the neural space of the three recorded neurons (black). The trajectory is mostly confined to the neural manifold, a plane shown in gray and spanned by the neural modes u1 and u2; (d) A curved, nonlinear neural manifold, shown in blue. Figure adapted from .
RECURRENT NEURAL NEURAL NETWORKS AND ATTRACTORS
Once we have presented in Chapter 2 the fundamental hypothesis on which this work is based, we can introduce the subject of this thesis, i.e., recurrent neural networks and attractors, through a pedagogical illustration of a series of models introduced in the field of computational neuroscience. These models are always presented together with the experimental evidence that led to their formulation and the connections with statistical physics are also explained in detail, see Chapter 1 for a discussion on the link between statistical physics and computational neuroscience. The aim of this Chapter is therefore to place the work of this thesis in a very precise context within the literature, stressing the problems of the current theory and therefore the need for it.
discrete and continuous attractors in recurrent neural networks
Let’s start by defining in a pictorial way what a recurrent neural network (RNN) is, what an attractor is, and the difference between discrete and continuous attractors 1.
A RNN is a kind of non-linear dynamical system defined by a set of N activity variables (neurons) i, i = 1; : : : ; N, interconnected via pairwise synapses fWijg (in the following we will never consider the case of self-connections, i.e., Wii = 0, 8 i), where, depending on the models, both neurons and synapses can assume binary or continuous values and also respect from time to time different constraints of biological nature that we will discuss later, see Fig. 3(a).
In addition, the activity variables are updated over time, which can also be considered discrete or continuous as the circumstances require, following a non-linear dynamics dictated by the connectivity matrix W (or even by external fields), whose choice obviously defines the network properties in a crucial way.
The state of the RNN can therefore be represented by a point evolving in a very high-dimensional space, the space of neural configurations of dimension N. In particular we will be interested in studying the trajectory of this point after a long time, especially in the case where the dynamics of the network remains blocked on different fixed points depending on the initial condition of the neurons activity variables: these fixed points are the celebrated attractors.
According to the choice of the synapses (without considering the presence of any exter-nal fields) we can have different scenarios for the structure of these fixed points:
— we can have that the different fixed points (specific configurations of the network activity variables where the dynamics get stucked) are isolated from each other and divided by attraction basins, that define according to the initial condition which will be the fixed point to which the dynamics of the network will converge: this is the case of discrete attractors, also called point attractors or 0-dimensional attractors, see Fig. 3(b);
— moreover, we can also have situations where the attractors instead of being isolated points, are composed of a continuous set of fixed points (manifolds) living in a D-dimensional space, where typically D N: this is the case of continuous attractors, see Chapter 2. Also here it is possible to have several continuous attractors as fixed points of the same network and divided by attraction basins, where, however, now depending on the initial condition of the network in an attraction basin, the dynamics of the RNN can remain blocked at any point of the relative attractor, see Fig. 3(c). As we discussed in Section 2.2, the important thing is to understand the physical meaning of the collective coordinate r which represents the state of the network onto the continuous attractor, that is a D-dimensional vector.
It is important to note that the dynamics really gets stucked 2 to a fixed point if it is noise free (deterministic), i.e., zero temperature Glauber dynamics , otherwise there will be fluctuations around the fixed points that will depend on the level of the neural noise. In particular, as we will see later, with the right temperature it is possible to spon-taneously generate for the state of the network transitions between one fixed point and another in the case of discrete attractors and the same is true also in the case of contin-uous attractors where, however, in addition to transitions between different attractors, a diffusive dynamics of the collective coordinate r on the attractors themselves is present as well.
Questions that we will answer in the following concern how to engineer the choice of synapses in order to build attractors with ad hoc (biological) properties and in particular what is the maximum number of attractors that can be stored in a recurrent network.
We will also give below a strong emphasis to the experimental evidence (direct and indirect) of these mechanisms in the brain, especially in the context of memory, where memories correspond to the above mentioned attractors.
hopfield model: multiple point-attractors
Undoubtedly the milestone in this field of research is the seminal work of J.J. Hopfield in 1982  where the model named after him was formulated.
He showed that the computational properties used by biological organisms or for the construction of computers may emerge as collective properties of systems that have a large number of simple equivalent components (or neurons).
In practice J.J. Hopfield had proposed a practical way to choose connections in a RNN with many neurons in order to build multiple 0-dimensional attractors, showing that a very simple model of interacting binary neurons was able to have non-trivial collective properties, in particular to build autoassociative memories.
Basically there are two main ways to store information on a device: addressable and autoassociative memory.
10 recurrent neural neural networks and attractors
— The first way consist in comparing input search data (tag) against a table of stored data, and returns the matching one ;
— the second is any type of memory that enables one to retrieve a piece of data from only a tiny sample of itself.
The Hopfield model together with all the models we will see in the following are autoassociative memories, and are particularly important to study because they are more biologically plausible than addressable ones.
Moreover, this model has remarkable properties as the robustness to the removal of a certain number of connections, the ability to correct patterns (memories) presented with errors, the ability to store patterns with a time sequence and recall them in the right order, although the single elementary components had independent dynamics without a clock that synchronized them  Before we discuss the Hopfield model in detail, let us recall the fundamental ingredi-ents of biological inspiration that led to its formulation.
Table of contents :
1 statistical physics meets computational neuroscience
2 the low-dimensional manifold hypothesis
2.1 In machine learning
2.2 In computational neuroscience
3 recurrent neural neural networks and attractors
3.1 Discrete and continuous attractors in recurrent neural networks
3.2 Hopfield model: multiple point-attractors
3.2.1 Ingredients of the model
3.2.2 Model details and properties
3.2.3 Why is it important to study simple models?
3.3 Representation of space in the brain
3.3.2 Place cells and place fields
3.4 Storing a single continuous attractor
3.4.1 Tsodyks and Sejnowsky’s model
3.4.2 CANN and statistical physics: Lebowitz and Penrose’s model
3.4.3 Continuous attractors and population coding
3.5 Why is it necessary to store multiple continuous attractors?
3.6 The case of multiple continuous attractors
3.6.1 Samsonovich and McNaugthon’s model
3.6.2 Rosay and Monasson’s model
3.7 Issues with current theory
3.8 Experimental evidences for continuous attractors
3.8.1 Head-direction cells
3.8.2 The fruit fly central complex
3.8.3 Grid cells
3.8.4 Prefrontal cortex
3.8.5 Other examples
4 optimal capacity-resolution trade-off in memories of multiple continuous attractors
4.2 The Model
4.3 Learning the optimal couplings
4.4 Results of numerical simulations
4.4.1 Couplings obtained by SVM
4.4.2 Finite temperature dynamics (T > 0)
4.4.3 Zero temperature dynamics (T = 0) and spatial error
4.4.4 Comparison with Hebb rule
4.4.5 Capacity-Resolution trade-off
4.5 Gardner’s theory for RNN storing spatially correlated patterns
4.6 Quenched Input Fields Theory
4.6.1 Replica calculation
4.6.2 Log. volume and saddle-point equations close to the critical line
4.6.3 Large-p behavior of the critical capacity
5 spectrum of multi-space euclidean random matrices
5.2 Spectrum of MERM: free-probability-inspired derivation
5.2.1 Case of the extensive eigenvalue – k=0
5.2.2 Case of a single space (L=1)
5.2.3 Case of multiple spaces (L = N)
5.3 Spectrum of MERM: replica-based derivation
5.4 Application and comparison with numerics
5.4.1 Numerical computation of the spectrum
5.4.2 Merging of density “connected components”: behavior of the density at small
5.4.3 Eigenvectors of MERM and Fourier modes associated to the ERMs
6 towards greater biological plausibility
6.2 Border effects
6.3 Positive couplings constraint
6.3.1 Couplings obtained by SVMs with positive weights constraint
6.3.2 Stability obtained by SVMs with positive weights constraint
6.3.3 Adding the positive weights constraint in Gardner’s framework
6.4 Variants of the place cell model
6.4.2 Place fields of different volumes
6.4.3 Multiple place fields per cell in each space
6.4.4 Putting everything together
6.5 Individuality of neurons
6.6 Non uniform distribution of positions
6.7 Comparison between SVM and theoretical couplings
6.8 Dynamics of learning
6.9 Learning continuous attractors in RNN from real images