Get Complete Project Material File(s) Now! »
Neuroanatomy of Multimodal Associative Areas
Investigation of the human being likely started with the study of the body. From the physical and chemical properties of organs scientists could begin to attempt to understand some of their functionalities. When it came to the brain, we were facing a totally different problem: it is not a single organ, but a complex system made by the massive connections between smaller organs units (neurons). To understand its functionality we need to investigate the anatomical structure as well as the process of communication among the network of neurons. It has been suggested that the brain is anatomically connected in a way that is correlated with and facilitates its functional connectivity (Sporns, Tononi et al. 2000).
Cortex is divided into multiple areas which the scientific community more or less agree on, they are defined by their cyto-architecture (type of neurons and other neuronal material composing it), by their connectivity pattern and by the cognitive function they are involved in (Brodmann 1909; Amunts, Schleicher et al. 1999). Areas are connected together, but despite numerous studies, establishing a connectivity matrix is a huge task that has not been achieved yet on human. Historically, the cortex has been thought as being a hierarchy (Felleman and Van Essen 1991); while it is now clear that this is not the case in the mathematical definition of this term (average connectivity rate of 66% (Markov, Ercsey-Ravasz et al. 2010)), a “hierarchical flavor” is still present in our understanding of the early areas connectivity. Studies by Kennedy’s team on the monkey provide us with a partial connectivity matrix summarizing which and how areas are connected. Statistical analysis of this matrix gives interesting results: it seems that a general pattern of connectivity exists. Indeed within an area or among areas, the strength of connectivity between two locations seems to be dependent of the distance in the way represented in Figure 1. From the earlier sensory area point of view this organization produces indeed a “hierarchical gradient” of connections to the other areas if we consider that position in the hierarchy is defined by the distance to the sensory cortex. This semi-hierarchical pattern is a well suited design for the multi-modal integration that I will develop in this chapter. After the initial sensory cortex (with V1, A1, S1, G1, O1) where each sensor modality is clearly identified, areas start to be more and more amodal. The premotor cortex of the monkey for example is well known to merge inputs coming from vision and proprioception (Graziano 1999; Maravita, Spence et al. 2003). Merging proprioception with vision is important for biological systems; both modalities can contribute to a better estimation of the physical body status within the environment, therefore allowing a finer motor control.
Figure 1: Extracted from (Markov, Misery et al. 2011). The FLN is a measure of connectivity strength. The strength of connection between cortical areas is a matter of their relative distance. Original legend: Lognormal distribution of FLN values. The observed means (points) ordered by magnitude and SDs (error bars) of the logarithm of the FLNe for the cortical areas projecting on injection sites. (A) V1 (n = 5), (B) V2 (n = 3), and (C) V4 (n = 3). The relative variability increases as the size of the projection decreases. Over most of the range, the variability is less than an order of magnitude. The curves are the expected lognormal distribution for an ordered sample of size, n, equal to the number of source areas. The gray envelope around each curve indicates the 0.025 and 0.975 quantiles obtained by resampling n points from a lognormal distribution 10 000 times and ordering them.
This context leads to the strong intuition that multi-sensory merging is a core principle of cortical computations. When a subject interacts with the physical world, the changes induced are perceived by its sensors, the same action (in the broad sense, any motor act) will produce the same effects, therefore producing a coherence relation between the corresponding sensor activation. When I move my hand in front of my eyes, I will always see it evolving in the same trajectory, with the same shapes, and feel the same exact proprioceptive percepts, i.e. the visual and proprioceptive images are correlated. Since proprioception and vision provide input to the same area, according to the most basic Hebbian rule, visuo-proprioceptive regular patterns will be coded within this multimodal area. This is one of the most obvious relations between our sensory spaces; however it is interesting to look at the case of blind people. Neuroimagery tells us that the dorsal stream which merges proprioception and vision in sighted people seems to merge auditory and proprioception in congenitally blinds (Fiehler and Rösler 2010); while another study shows that early vision during child development shapes definitively the tactile perception of non-congenital blinds (Röder, Rösler et al. 2004). To demonstrate another such combination, visual and auditory signal integration was found in monkey for person identification (face + voice) (Ghazanfar, Maier et al. 2005). Multimodal areas are not predefined to use a specific combination of modalities; they are a mechanism to merge the modalities which express the most pattern co-activation regularity. Listing in an exhaustive way all the multimodal areas and their input would be a huge and meaningless task, even “modal areas” where found to integrate information coming from each other (Cappe and Barone 2005).
Literature about multimodal integration in the brain is vast, and a standalone topic (Meredith and Stein 1986; Lipton, Alvarez et al. 1999; Sommer and Wennekers 2003; Ménard and Frezza-Buet 2005), however the objective here is not to dress a map of the multimodal streams in the brain, but to enlighten the fact that merging of multiple modalities is likely one of the core mechanism induced by cortical connectivity. This principle is the core of the Convergence Zone Theory (Damasio and Damasio 1994) which I will use as a basis for modeling multimodality convergence.
As stated in the introduction, one of the most common and impressive manifestations of the multimodal integration is the perceptual dependency created among different sensory modalities. It is reasonable to assume that our percepts are based on quite high level areas and do not come directly from the raw sensor input, therefore they encode multimodal traces. From a computational point of view, it means that activity in one modality can produce a form of recall on the other, therefore biasing the perception to a more regular pattern. Most perceptual illusions are indeed inherent to this phenomenon: the ventriloquist and McGurk show the link between auditory and visual percepts (McGurk and MacDonald 1976; Bonath, Noesselt et al. 2007), the rubber hand experiment is about vision, proprioception and touch (Botvinick and Cohen 1998), etc. Taking as an example the rubber hand experiment, in a nutshell the subject is being presented a fake hand as being its own, therefore integrating the fake hand displacement as a displacement of its own limb. Refer to (Botvinick and Cohen 1998; Ehrsson, Spence et al. 2004; Tsakiris and Haggard 2005) for details and variations). In this setup the subject feels a fake hand as being his own, because sensory input coming from proprioception and vision are coherent. The small displacement induced in the vision creates a shift in the proprioception. Indeed given the visual input, the proprioception should not be what the body experiences; the subject therefore feels neither the reality nor the exact vector matching the vision but a mixture of both. In this experiment the illusion is induced after short training, a sort of priming so that the subject can associate the fake hand with his own. Indeed psychophysics demonstrates two types of illusions, one induced by such priming and another related to long term experience of world regularities. While the first shows that multimodal integration is subject to short term adaptation, the second type demonstrates that our experience shapes our perceptual system all along our lives. An entertaining example based on the single visual modality is presented in Figure 2 : the balls seem to be flying or not according to the position of their shadow, while if you hide the shadow they will be on the same level. Knowing that our brain is used to perceive a consistency between the height of an object and the position of its shadow, we can assume that integrative systems is indeed trying to make us perceive the situation in the image as it should be according to the laws of physics. Shadow position and spatial position are so tightly coupled in the world that only manipulating the perception of the shadow induces a major shift in the percept and the estimated position of the object. This illusion is so common and useful that it has been studied (Kersten, Mamassian et al. 1997; Mamassian, Knill et al. 1998) and exploited for artistic purposes. This example is probably not the best one, but the point is easy to grasp: the brain is “fooling us” to perceive not the real world, but the world shaped as we are used to experience it. Illusions happens when those two worlds do not match, therefore modulating the percept to be a mixture of both.
Figure 2: A long term knowledge induced illusion. Our experience of the world shapes our understanding of a perceptual stimulus so that the physics law are consistent with our daily experience. (Picture from Lawrence Cormack)
Indeed illusions do not rely only on early sensory integration; they also touch the “semantic” level with interferences to and from language. Reading “blue” takes longer than reading “blue”, and YouTube had quite a buzz about hearing “fake speech” in songs by synchronously reading a text (the illusion might not work for non-native French speakers but one can check in case 4). This last case is an impressive illusion: despite the “sexual and comic” connotation of the video it is an effect that is worthy of a serious investigation, though to our knowledge no such study has been performed. In the illusion, one reads a text (in French) while listening to a foreign song. The sonority of the text is close to the lyrics of the song, therefore if one reads it a few second before the audio comes, one actually hears what was just read. The auditory percept is “fooled” by vision, but passing via the language level, which therefore demonstrates a three step chain vision->word->audition. Indeed it is tempting to suppose that while learning to read, a multimodal area becomes a convergence zone for written words and their audio representations.
To conclude this section on illusions, let us consider memories. French literature has a well-known description, by Proust, of what an adult feels when he happens to smell the odor of a cake he used to eat when he was a child. The odor triggers in the reader the “awakening” of dreams where he experiences his childhood, completely disconnected from reality. Memories can be induced on demand, or suggested by environmental factors; however it is clear that in both cases our percepts correspond to an illusion between the multimodal pattern of activation we experienced and the real world. Could we therefore say that the recall process is a “mind induced” illusion, a memory driven activation of a coherent pattern of sensory inputs? This is beyond the scope of this chapter, so we let this question pending and propose a model that can cope with this principle of perceiving each sensory modality shaped by others and by previous experience.
Model: Multi Modal Convergence Maps
The Convergence Zone Framework (CVZ) (Damasio 1989; Damasio and Damasio 1994) makes use of a standard and generic computational mechanism within the cortex: integration of multiple modalities within a single area. This integration derives a memory capability allowing multimodal traces to be recalled using a partial cue (unimodal stimulation for example). The original model formalization was performed by Moll (Moll and Miikkulainen 1997) and is quite similar to Minerva2 (Hintzman 1984) apart the fact that the former uses a neural network while Minerva2 uses “brute force” storage of all the episodic traces. Both models enter the category of Mixture of Experts models (Jacobs, Jordan et al. 1991; Jordan and Jacobs 1994) in which a pool of computational units (experts) are trained to respond to multimodal patterns. When a partial or noisy input signal is presented all the experts examine it and respond with their level of confidence (activation) about this input being their pattern or not. By a linear combination of their responses and their specific pattern the missing or wrong information can be filled in. Another model which can be considered as a special type of Mixture of Experts is the Self Organizing Map (SOM) from Kohonen (Kohonen 1990). While the formalisms are different, the core principle is the same: a pool of neurons is trained so that each of them tunes its receptive field (prototypical vector) in order to be mostly activated by a specific input vector. The SOM is particularly well known because of the direct visually meaningful 2-D map representation, allowing an understanding of the network computation and the possibility to map high dimensional data into a 2D space. They are indeed based on the lateral organization of connectivity within cortical areas, which induces through learning a topographical mapping between the input vector and the neural map. However, despite the fact that they are bidirectional by nature and allow recall, SOMs were never really used as a basis for multimodal integration but mainly to operate vector quantization on high dimensional datasets (Kaski, Kangas et al. 1998). In this section, I will present a model fusing ideas from the CVZ and from SOM. I will first provide preliminary explanation on those two models and finally present the Multi Modal Convergence Maps which I’ll link to some very similar models in the recent literature on modeling multimodality.
Convergence Zone & Self Organizing Map
A direct model of CVZ has been established by Moll (Moll, Miikkulainen et al. 1994; Moll and Miikkulainen 1997) where multiple modality specific layers are linked through a binding layer (the convergence zone). Each unit of modality vectors is connected toward all neurons of the binding layer with weight being 0 or 1 (connected or not). To store a new pattern, modalities are set and a random pool of binding neurons is chosen, links between input neurons activated and those are set to 1. For retrieval a partial set of the input vectors (e.g. one modality) is activated, the neurons of the binding layer connected with weights of 1 are found and activate back all the input units that they encode for, the process is summarized in Figure 3.
Figure 3 Taken from (Moll and Miikkulainen 1997). A stored pattern is retrieved by presenting a partial representation as a cue. The size of the square indicate the level of activation of the unit.
The focus in Moll’s research is to show that such a model can store a large amount of traces within a reasonable number of neurons. Because of this they argue that it is a good representation of episodic memory storage within the human brain. However, the current interest is more in the properties emerging from the mixture between multiple modalities: how the model behaves in case of incoherent input activity (illusion) and how one modality can cause a drift in the other one. Indeed this model doesn’t hold any temporal or spatial relation amongst the stimuli: two stimuli close in time do not have to possess any kind of proximity (while it is typically the case in the real world) and there is no point in trying to find spatial clustering within the binding layer since the position of neurons isn’t used at all. Therefore, a multimodal pattern is a hard to imagine “binding constellation” and it is difficult to label those binding neurons to the concept they are related with. Moreover, even if the learning process is fast, there is no evidence about how it behaves against catastrophic forgetting (French 2003) and no benefit from past experience when learning a new trace. The SOM can cope with those points, although it is not designed to handle multiple modalities.
Self-Organizing Maps were introduced by Kohonen (Kohonen 1990) and have been intensively used and adapted to a huge diversity of problems, see (Kaski, Kangas et al. 1998) for a review.
The main purpose of SOMs is to perform a vector quantization and to represent high dimensional data. A SOM is a 2 (or more) dimension map of artificial neurons. An input vector is connected to the map so that each component of this input vector is connected to each node of the map (see a partial representation of the connections Figure 4). In this context, each neuron of the map owns a vector of connections that has the size of the input vector, and each connection has a weight. The main idea is to fill the input vector with values and to compare these values with the vector of weights of each neuron. Each neuron is activated by the similarity between its weight vector and the input vector. One neuron will more be activated than all the other, we call it the winner.
Figure 4: Schematic representation of a self organizing map. A single input vector has each of its neurons connected to each neuron in the map.
The main idea is to train the map so that 2 neurons that are close on the map will encode similar input vectors. To do so input vectors from a training set are presented, the winner neuron is calculated and its weights are adjust so they will be closer to the input values. While this general process is very similar to the CVZ, the learning point is quite important in SOM. Indeed not only the winner neuron is learning, but also its neighbors so that a region instead of a single neuron will learn to respond to this input. The learning rate of neighbors depends of their distance to the winner, this learning function is inspired from the lateral connectivity pattern and the resulting inhibition. The learning rate function is often called the “Mexican hat” because the learning rate is distributed like a sombrero whose center is the position of the winner neuron (Figure 5).
Figure 5: Neighbourhood function or Mexican Hat function (Credit for picture to Daneel Reventlov)
Learning will therefore have a tendency to shape the map so that two similar inputs will be stored within the same region of the map. A concrete example demonstrating this principle is the application of SOM to image compression (Dekker 1994; Pei and Lo 1998). An image is composed of pixels, which hold in most of the cases three channels (R,G,B) accounting for 255*255*255 possible colors. However, when considering a single picture, it is clear that all those colors are not used. By considering each pixel as a 3 component vector and sequentially presenting pixels from an image to a SOM it is easy to get a compact palette of the colors used. Indeed after learning, the map will store gradient of colors in several regions which are composing the most representative palette for this image, therefore the number of color coding the image is the number of neurons forming the map, which can be used to greatly increase the compression. After training the map can be represented by painting each neuron to the color its weights are encoding for, providing meaningful representation and understanding of the map encoding Figure 6.
Table of contents :
Chapter I Embodied action, merging multiple sensory modalities
Neuroanatomy of Multimodal Associative Areas
Model: Multi Modal Convergence Maps
Convergence Zone & Self Organizing Map
Conducted: Proprioception enhance vision speed
To be conducted
Chapter II Symbolic Action Definition, from Primitives to meaning
Action Definition: Perceptive Level
Action Definition: Motor Level
Action Definition: Descriptive Level
Action Definition: Brain encoding and datastructure
Experimental Results: Application to Imitation
Recognition Process Details
Execution Process Details
Chapter III Cooperation, using Actions to compose Shared Plans
Shared Plans: Neurophysiology
Shared Plans: Child Development
Learning a shared plan by observation
Execution of a shared plan
Shared Plans: Implementation
Learning a shared plan by observation
Execution of a shared plan
Generation of a plan: teleological reasoning
Teaching a shared plan to the robot and using to test naïve subjects
Chapter IV Abstract Cognitive Machine(s)
Various scales of heterogeneity
Robots hardware heterogeneity
Robots Software heterogeneity
Cognitive Architecture heterogeneity
Shared experience framework
Perspectives and Inquiries
Annex 1: A Theory of Mirror Development
Annex 2: Central Cognition, Implementation Details
Appendix 1 Towards a Platform-Independent Cooperative Human-Robot Interaction
System: I. Perception
Appendix 2 Towards a Platform-Independent Cooperative Human-Robot Interaction
System: II. Perception, Execution and Imitation of Goal Directed Actions
Appendix 3 Linking Language with Embodied and Teleological Representations of Action for Humanoid Cognition
Appendix 4 The EFAA’s OPC format specification
Affordances & PMP
The entity « body part » (human awareness)