The importance of time in stereo correspondence

Get Complete Project Material File(s) Now! »

3D reconstruction

Figure 1.4: 3D reconstruction from multiple views. Image courtesy of (Hernández, 2004) A major interest of being able to compute depth is the possibility to recover the structure of the scene. While, in binocular stereo, the goal is to produce dense depth maps, here the expected result is to build complete 3D models of the scene (see Fig. 1.4). The ability to reconstruct an observed environment allows us to build virtual scenes which are exact copies of the real world. This virtualized reality provides, as opposed to computer generated virtual reality scenes, much richer and realistic 3D environments (Kanade et al., 1995). Since the seminal work of creating multi-camera networks (Kanade et al., 1998), tele-immersion became an important element for the next generation of live and interactive 3DTV applications. The goal of these techniques is to allow people at different physical locations to share a virtual environment.
Several methods for achieving 3D reconstruction from multiple views exist. Seitz et. al (Seitz et al., 2006) categorize them into four classes: the first class, includes voxel colouring algorithms which operate by extracting surfaces in a single sweep. A cost is assigned to each voxel of a given volume which is reconstructed if this cost is under a certain threshold (Seitz and Dyer, 1997; Treuille et al., 2004). Variants try to obtain optimal surfaces by using Markov Random Fields and max-flow (Furukawa, 2008; Roy and Cox, 1998; Sinha and Pollefeys, 2005; Vogiatzis et al., 2005) or multi-way graph cut (Kolmogorov and Zabih, 2002). The second class of algorithms includes methods which operate by iteratively refining surfaces through minimization of a cost function. Examples are space carving (Fromherz and Bichsel, 1995; Kutulakos and Seitz, 2000) and variants which progressively refine structures by adding or rejecting voxels to minimize an energy function (Bhotika et al., 2002; Eisert et al., 1999; Kutulakos, 2000; Kutulakos and Seitz, 2000; Saito and Kanade, 1999; Slabaugh et al., 2000, 2004; Yang et al., 2003; Zeng et al., 2005). Level-set techniques start from a large volume which shrinks or expands by minimizing a set of partial differential equations. The third class involves methods that compute sets of depth maps. Image-space methods enforce consistency between depth maps in order to recover a 3D reconstruction of the scene (Gargallo and Sturm, 2005; Kolmogorov and Zabih, 2002; Narayanan et al., 1998; Szeliski, 1999; Zitnick et al., 2004).
Finally, the fourth class includes methods that rely on feature extraction. Features are first extracted and matched between viewpoints and a surface fitting method is then used to reconstruct the surfaces (Faugeras et al., 1990; Manessis et al., 2000; Morris and Kanade, 2000; Taylor, 2003). As shown, methods for achieving 3D reconstruction have been under intensive research during the last decades. Although much progress has been made, 3D reconstruction and its core problem, stereo matching, still remain fundamental research problems in computer vision. Proposed approaches lack temporal resolution and performance is far from what is provided by the examples we can find in nature such as the 3D vision humans are able to perceive. Classical frame-based cameras capture dynamic scenes as a sequence of static image frames taken at a given frequency typically around 30Hz. The precise temporal dynamic of the scene is therefore lost during the early acquisition phase as the scene is sampled at discrete points in time. Current state-of-the-art methods approach the reconstruction of dynamic scenes as the reconstruction of sequences of static scenes. Produced reconstructions are therefore limited to the low temporal resolutions of the frame-based cameras where the fine temporal resolution of the scene is lost. Currently, real-time 3D reconstruction has been achieved using depth cameras (such as the Microsoft Kinect or the Asus Xtion) however the results are noisy and can only operate in specific lightning condition. Other state-of-the-art methods which are able to produce real-time reconstructions need to find a compromise quality in order to gain in computation speed (Niesner et al., 2013).
In nature, stereovision and 3D reconstruction is achieved effortlessly and is not limited to 30Hz as we perceive the world continuously and not as a sequence of images. Current methods relying on classical cameras are computationally expensive even at low temporal resolution. The reason for such computation difficulties of current methods might come from the way visual information is encoded and the loss of the temporal precision. The next section shows the importance of precise timing in depth perception both in biology and computer vision.

The importance of time in stereo correspondence

Several studies showed that the temporal information is not only used but is also critical in the stereomatching process of the human visual system. Two temporal factors seem to be particularly important: duration of the stimulus and interocular synchronization (synchronization of images shown to the left and right eyes) (Howard and Rogers, 2008).
Although early studies showed that depth perception could be achieved when exposed to a stimulus for less than 1ms (when eyes were previously converged) suggesting the stimulus duration was not important, further research showed that in fact the stereo matching requires time to solve ambiguities and is more demanding for more complex stimulus such as dense random dot stereograms. A globally accepted idea is that correspondence is achieved by an expensive process of interocular correlation maximization (Cormack et al., 1991) (Howard and Rogers, 2008). The synchronization between images received by left and right views has also been shown to represent a critical role in stereo matching. Experiments have been conducted where a stimulus presented to an observer had one of the views delayed either using a filter or a computer generated stimulus. Results showed that the disparity-induced depth was still perceived (Howard and Rogers, 2008). The tolerance for interocular delay, representing the amount of tolerated delay between views while still perceiving depth, has been largely studied and shown to be up to 50ms (Mitchell and O’Hagan, 1972)(Ross, 1974)(Howard and Rogers, 2008). Some authors suggest that interocular delay is not only tolerated but can also by itself produce a sensation of depth, calling temporal disparity to this purely temporal stereoscopic disparity as it was first described by Mach and Dvorak (1872) and later by MaxWolf in 1920 (Howard and Rogers, 2008)(Ross, 1974). This effect was called Pulfrich effect and was studied in detail by Carl Pulfrich in 1922 (Gonzalez and Perez, 1998). However, although the claim for the existence of temporal disparities, authors generally explain this phenomenon by assuming that the delay introduced in one eye interferes with the signals from the other eye (Howard and Rogers, 2008).
An important conclusion should be retained, temporal consistency studies show that higher synchronization between views leads to more accurate depth extraction, whereas interocular delays give rise to non-existent depth and deformed shapes (Chang, 2009).

READ Mathematical programming approach for full network coverage optimization in C-RAN

Bio-inspired event-based vision

Biological retinas encode visual information differently from conventional cameras. Frame-based cameras transmit full image frames at constant rates, where each frame contains luminance information for all pixels of the visual sensor. However, biological retinas encode information as a stream of spikes, where each photoreceptor independently generates spikes that encode light intensity changes at millisecond precision (see Fig. 1.5).
Therefore, only the information on parts of the scene that change (e.g. luminosity) is encoded, avoiding acquiring and transmitting redundant data while adding precise time information.
In the late 1980s, the first neuromorphic vision sensor mimicking the various be- haviours of the first three layers of the biological retina was proposed by Mahowald (Mahowald, 1992). It introduced an analog photoreceptor that transforms the perceived light intensity into an output voltage following a logarithmic mapping. Delbruck and Mead improved the design by adding active adaptation (Delbrück and Mead, 1995) and Kramer further added polarity encoding luminosity intensity change (Kramer, 2002).
In 2006, the Dynamic Vision Sensor was proposed by Lichtsteiner wich provided the first generation of ready-to-use sensors for asynchronous event-based vision(Lichtsteiner et al., 2006). In 2010 The sensor In 2011 Posch et al. (Posch et al., 2011) proposed a QVGA resolution sensor. Besides increasing by more than four times the resolution of the DVS(Lichtsteiner et al., 2008), the sensor also provides luminance information. Graylevel information of events is encoded as two events representing beginning and end of the exposure measurement. Another recent DVS development (Serrano-Gotarredona and Linares-Barranco, 2013) improves on the contrast sensitivity, allowing for the inclusion of more low-contrast visual information such as texture details. A review of some of the history and recent developments in artificial retina sensors can be found in (Delbruck et al., 2010).

Stereo-correspondence in neuromorphic engineering

Existing work on stereo vision with neuromorphic sensors is still poorly studied. Mahowald et al. (Mahowald and Delbrück, 1989) implemented cooperative stereovision in a neuromorphic chip in 1989. The resulting sensor was composed of two 1D pixel arrays of 5 neuromorphic pixels each. The use of local inhibition driven along the line of sight implemented the uniqueness constraint (one pixel from one view is associated to only one pixel in the other, except during occlusions), while the lateral excitatory connectivity gave more weight to coplanar solutions to discriminate false matches from correct ones.
This method requires a great amount of correlator units to deal with higher resolution sensors. In 2008, Shimonomura, Kushima and Yagi implemented the biologically inspired disparity energy model to perform stereovision with two silicon retinas (Shimonomura et al., 2008). They simulated elongated receptive fields to extract the disparity of the scene and control the vergence of the cameras. The approach is frame-based and allows to extract coarse disparity measurements to track object in 3D.
Kogler et al. (Kogler et al., 2009) have described a frame-based use of the event-based DVS cameras in 2009. They designed an event-to-frame converter to reconstruct event frames and then tested two conventional stereo vision algorithms: a window-based and a feature-based using center-segment features (Shi and Tomasi, 1993).
Delbruck has implemented a real event-based stereo tracker that tracks the position of a moving object in both views using an event-based median tracker and then reconstructs the position of the object in 3D (Lee et al., 2012a). This efficient and fast method lacks resolution on smaller features and is sensitive to noise when too many objects are present. In 2011, Rogister et al. (Rogister et al., 2011) (see Fig. 1.6) proposed an asynchronous event-based binocular stereo matching algorithm combining epipolar geometry and timing information. Taking advantage of the high temporal resolution and the epipolar geometry constraint they provided a truly event-based approach for real-time stereo matching. However the method is very prone to errors as enforced constraints are weak and result in many ambiguities.

Table of contents :

1 Introduction
1.1 Depth perception and stereovision
1.2 Stereo matching problem
1.3 Epipolar geometry
1.4 3D reconstruction
1.5 The importance of time in stereo correspondence
1.6 Bio-inspired event-based vision
1.7 Stereo-correspondence in neuromorphic engineering
1.8 Motivation and contribution
2 Asynchronous Event-Based N-Ocular Stereomatching
2.1 Introduction
2.2 Asynchronous N-Ocular Stereo Vision
2.2.1 Trinocular geometry
2.2.2 Trinocular spatio-temporal match
2.2.3 Stereo match selection using bayesian inference
2.2.3.1 Prior
2.2.3.2 Likelihood
2.2.3.3 Posterior
2.2.4 Synchronization
2.2.5 N-ocular stereo matching
2.3 Experimental results
2.3.1 Experimental Setup
2.3.2 Reconstruction Evaluation
2.3.3 Processing time
2.4 Conclusion and Discussion
3 Scene flow from 3D point clouds
3.1 Introduction
3.2 Scene flow parametrization
3.2.1 Plane approximation
3.2.2 Rank of M
3.3 Velocity estimation
3.3.1 Error cost function
3.3.2 Optimal spatio-temporal neighbourhood
3.4 Results
3.4.1 Simulated scene
3.4.2 Natural scene
3.5 Discussion
3.6 Conclusions
4 It’s (all) about time
4.1 Introduction
4.2 Intensity and motion based stereo matching
4.3 Time encoded imaging
4.4 Event-based stereo matching
4.4.1 Geometrical error
4.4.2 Temporal error
4.4.3 Time-coded intensity matching
4.4.4 Motion matching
4.4.5 Error minimization
4.5 Results
4.5.1 Experimental setup
4.5.2 Method evaluation
4.5.3 Binocular matching
4.5.4 Trinocular matching
4.5.5 Performance evaluation
4.6 Discussion
4.6.1 3D Structure refinement using point cloud prediction
4.7 Conclusion
5 Discussion
References