Robustness of Deep Neural Networks to Erasures in a Reading Task

Get Complete Project Material File(s) Now! »

Willshaw Networks and biological considerations

Willshaw networks are models of associative memories constituted of a given number of neurons. A stored message, or memory, is a combination of nodes taken in this set. The storage of this information element corresponds to the creation of connections with unitary weights between every two neurons in this message. The graphical pattern thus formed is termed « clique ». The storing process of n binary vectors xµ of length N , is equivalent to the modification of elements of the network’s connection matrix W , according to the formula: wij = max xiµxjµ (2.3) µ.
Note that here, the max operator is performed coeﬃcient-wise. Equivalently, the con-nection weight between nodes i and j is equal to 1 if, and only if, those two nodes are both part of one of the n stored messages.
The network’s density d is defined as the expected ratio of the number of ones in the matrix W to the number of ones it would contain if every possible message was stored. For cliques of order c, the number of connections they contain is 2c . Despite the correlation between these edges, the probability of a given connection to be picked when forming a message can be estimated as (2C) . Provided that the n stored messages are uniformly (N2) distributed and independent, the density of the network equates to the probability for any given connection to belong to at least one of these messages. This leads to the formula: c n d = 1 − 1 − 2 . (2.4) N 2.
The eﬃciency of a connectionist associative memory is defined as the ratio of the max-imal amount of information carried by the messages it is capable of storing then retrieving with high probability, over the total information quantity represented by its set of connec-tion weights. For a Willshaw network with N nodes, the number of potential connections, or binary resource, is Q = N(N − 1) [bits]. (2.5).

Preventing connections between neighbor neurons

Classic Willshaw networks have no topology. Their material is constituted with a list of neurons each having an index as sole referent. There is neither a notion of spatial position in these networks, nor, a fortiori, of spatial distance. We get closer here to a real neural network, by arranging them on a two-dimensional map. In the model we propose, the respective positions of two neurons impact the possibility for them to get connected together. The considered network is composed of a number N of nodes evenly distributed along a square grid, of side S = √N . Stored messages are of constant order, meaning they are all constituted of the same number of neurons. We forbid connections between nearby neurons. To this end, we apply a threshold σ on the spatial length of a connection. Stored messages must necessarily be conform to this constraint. Each message is formed in a random manner, units being chosen iteratively. Each new element of the message is picked from the positions left available after the removal of the neighbors of the formerly selected nodes, as indicated in Figure 2.2. One can consider the introduced constraint as applied on the network’s material, as the weights of a predetermined set of short-range connections will be enforced to stay null all along the network’s life. During the formation of a message, it is practical to pick neurons to satisfy this constraint in a sequential manner, with a local inhibition applied on a neuron’s neighborhood from the moment it is selected until the message generation is complete.
A link can be drawn between this approach and Kohonen Self-Organizing Maps, where close-by neurons encode more similar information [46]. Therefore, long-range distance sep-arates information elements that are diﬀerent in nature, whereas shorter-range distance depicts a diﬀerence in degree. Local competition is particularly relevant in this scheme.
During retrieval, the network is stimulated iteratively with a request that will most often change from one iteration to the next. Each node of the request will first stimulate every other element it is connected to. Scores are initialized with zero at the start of every iteration, and each stimulation is a unitary increment to the score of the receiver unit. For the first iteration, after the stimulation we apply a global Winner-Takes-All rule, which consists in excluding from the research scope all units that do not achieve the maximal score observed in the network. We know indeed that the neurons from the searched message will all have the maximum possible score, equal to the number of elements in the request. Once non-maximum elements are put to zero, we only pay interest in the remaining neurons during the rest of the retrieval process. Moreover, for every iteration after the first one, neurons in the new request are the only ones that can receive stimulation as the algorithm proceeds to only discard neurons from then on.

Visual recognition of partially displayed words

An experiment was formerly run on 23 human subjects confronted with partial displays of words [51, 52]. The volunteers were asked to signal through a computer keyboard whenever they thought they had recognized a word, at display rates always inferior to 30% of the pixels. They were then to type in the word they believed to have seen. The display rates came in ascending order, starting at 0.25% and were incremented each time all words in a sequence had been seen. The first increment was of 0.25%, and the following increments were of 0.5%. Participants signaled their recognition of all words, accurately or not, before reaching the 10% mark. The order of the sequence of words was randomly reset at each display rate increment, with correctly recognized words getting removed from the display list right away. A fixation cross was displayed for 500 milliseconds prior to the display of a word, centered on this location, for 350 milliseconds. A white screen was then displayed for 2000 milliseconds, before repetition of this tri-fold sequence with the next word. Figure 3.1 illustrates this protocol. Word length ranged from 4 to 9 letters, all letters were in upper-case and the used font was Courier.
The words were chosen to be suﬃciently well balanced in terms of length as well as frequency in the French lexicon. The results of this study showed the conjunct role of several parameters in word recognition, which can be arranged in three categories. The first category is constituted by bottom-up factors, i.e., characteristics of the display triggering recognition. Other studied parameters are linguistic factors, such as the frequency of a word and the size of its orthographic neighborhood in the lexicon, and subject-related factors, e.g., age and level of education. The prominent factors in the specifics of recognized visual displays happened to be the display rate of the whole word, followed by the specific display rates of lines and curves in the letters. Regarding lexical factors, the frequency of a word had a significant impact, with high-frequency words being recognized at lower display rates than lower-frequency words, on average. Longer words (8-9 letters) were also usually recognized at lower display rates than mid-length words (6-7 letters), themselves being recognized faster than short words (4-5 letters). As for the human subjects, age and level of education were the most determinant factor aﬀecting word recognition performances.

Table of contents :

Résumé
Abstract
Acknowledgements
1 Introduction
2 Sparse Neural Associative Memories
2.1 Introduction
2.2 Related work
2.2.1 Hopfield Networks
2.2.2 Willshaw Networks
2.2.3 Clustered Cliques Networks
2.3 Willshaw Networks and biological considerations
2.4 Preventing connections between neighbor neurons
2.5 Efficient use of the memory resource
2.6 Pattern completion
3 Robustness of Deep Neural Networks to Erasures in a Reading Task
3.1 Introduction
3.1.1 Visual recognition of partially displayed words
3.1.2 Pre-existing models of reading
3.1.3 Deep Convolutional Neural Networks
3.1.4 Previous study of the representation geometry and biological plausibility of CNNs
3.2 Methods
3.2.1 Data-set pre-processing
3.2.2 The convolutional architecture employed
3.2.3 Training procedure
3.2.4 Testings to study the effect of the different letter feature types on recognition
3.2.5 Study of the learned feature detectors and associated activation patterns
3.3 Results
3.3.1 Display rate effects
3.3.2 Relationship with lexical factors
3.3.3 Effects of feature types
3.3.4 Observations on the geometry of representations and the information carried by their activation patterns
3.4 Discussion
3.4.1 Reproducibility of lexical effects relative to word frequency
3.4.2 Role of lexical factors
3.4.3 Comparison with other models
3.4.4 Major differences with human recognition behavior
3.5 Conclusions
4 Assembly Output Codes for Learning Neural Networks
4.1 Introduction
4.2 Coding theory
4.3 Methodology
4.3.1 MNIST
4.3.2 SVHN
4.3.3 Assembly codes
4.3.4 Neural network settings
4.4 Results
4.4.1 Experimenting with random codes
4.4.2 MNIST – Shallow network
4.4.3 MNIST – Deep network
4.4.4 SVHN – Deep Network
4.5 Problems with a large number of classes
4.5.1 Memory constraints
4.5.2 Error-correcting decoding
4.5.3 Experiments
4.5.4 Results
5 Combination of Unsupervised Learning and Associative Memory
5.1 Introduction
5.2 Gradient descent learning
5.3 Sparse Clustered Associative Memories
5.3.1 CCNs
5.3.2 Sparse messages
5.3.3 Specific retrieval algorithms for sparse messages
5.3.4 Accelerations using GPUs
5.3.5 GWsTA
5.3.6 GLsKO
5.3.7 Experiments
5.4 Combination of learning and memory
6 Conclusion and Openings
6.1 Conclusion
6.2 Openings
6.2.1 Semi-supervised learning
6.2.2 Decision under uncertainty