Object Segmentation by Interactive Perception

Get Complete Project Material File(s) Now! »

Geometrical analysis of Multivariate Normal Distribution

Ellipsoid of tolerance In statistics, a way to study a dataset is to determine the confidence and tolerance regions of a distribution. While the confidence region can be used to estimate the parameters of the MVND, the tolerance region provides information on the whole dataset. For an MVND, the tolerance region is an ellipsoid.
The size of the ellipsoid is determined by the level of confidence α which means all the data with a probability greater than 1 − α is within the ellipsoid (for the confidence ellipsoid the probability must be greater than α). Generally, the level of confidence is low (around 5 %). The tolerance ellipsoid is a good approximation of the MVND and allows us to perform a geometrical analysis. If the theoretical covariance matrix is considered, the distance between a point and the theoretical mean in the sense of Manahobolis follows a χ2 distribution : n(μ∗ − μ)TΣ−1(μ∗ − μ) ∼ χ2 p.

Visual Features and descriptors extraction

Visual feature extraction is a key part of the framework proposed in this thesis. A visual feature characterizes one supervoxel. Two different kinds of features are extracted on the 3D pointcloud. Color histograms and fast point feature histograms (FPFH) (Rusu et al. [2009]) which characterize the local geometry of a supervoxel. While color histograms are extracted from the RGB image, the FPFH are extracted from depth information provided by 3D cameras like stereoscopic cameras or active 3D cameras like microsoft kinect or asus xtion.
Before explaining color histogram features, local invariant features must be mentioned because of their popularity. Local invariant features are descriptors localized on keypoints. Their are extracted on the entire picture and produces keypoints which characterize salient parts. The most used feature is scale-invariant feature transform, better known as SIFT descriptor (Lowe [1999]), but there is also SURF (Speeded-up Robust Features : a lightweight enhancement of SIFT Bay et al. [2006]), BRIEF (binary robust independent elementary feature Calonder et al. [2010]), GFT (good feature to track Shi and Tomasi [1993]) and many others (Tuytelaars et al. [2008]). This family of descriptors is local because they create keypoints which characterize a small part of an image, and they are invariant because of their invariability on zoom, rotation, and translation. Thus, local invariant keypoints are very useful for images alignment, to make panoramic pictures for instance, or for object recognition and tracking.
But for our work, local invariant features cannot be used as output keypoints are localized and cannot be extracted anywhere in images. For instance, there would be just a few or no SIFT keypoints on a nontextured object. Because SIFT is based on contrast and difference on the images. A fully monochromatic image would have no SIFT keypoints at all.

Bagging, Boosting and Random Forest

Boosting and random forest are part of the ensemble learning family of methods. These methods use a set of weak classifiers and combine them to produce a strong classifier, i.e. a classifier able to separate non-linear dataset and solve non-convex classes. Random forest combines randomly generated trees by training them on a subset of the data. Boosting can be used with different kinds of weak classifiers andcombine them by weighting them and optimizing these weights. Bootstrap aggregating (bagging) algorithm is a method to generate, from a unique dataset of size N, M datasets of size N. The samples are distributed by sampling uniformly with replacement the training dataset which means that each dataset contains duplicate samples. Boosting such as AdaBoost or random forest relies on bagging.
In classical bagging, the data are divided following a binomial distribution. In online case, all data are not available beforehand and must be distributed on-thefly. The Poisson distribution is a good choice as the binomial distribution tends toward Poisson distribution when the dataset size grows to infinity. So, model learning algorithms trained with batch learning or online bagging converge to the same classifiers as the training set grows to infinity (Oza [2005]). Saffari et al. [2009, 2010] proposed an online random forest and an online boosting algorithm. These algorithms are used to track faces in videos in real time. In this task, exploration is not needed; thus, estimation of the uncertainty or confidence of the classification to drive a sampling process is not required. As such, this issue is not addressed. One option for measuring uncertainty is to use the confidence of the classification of a tree used in the random forest algorithm. This confidence is used to estimate which tree will give the best prediction for a given sample. Online random forest of Saffari et al. [2009] featured an unlearning process by pruning trees or discarding entire trees. They use the out-of-bag error to evaluate which tree must be pruned or discarded. This error is computed by evaluating the tree on samples which were not used for its training. These samples are called left out the bag, hence the name of the error: out-of-bag. For boosting, to the best of our knowledge, unlearning does not seem to be addressed. This is due to the fact that online boosting does not need such process to be efficient.

Gaussian Mixture Models with an unknown number of components

In general, to estimate the parameters of a mixture model, the expectationmaximization algorithm (EM) is used. But in order to use the EM algorithm, the number of components, i.e. the number of summed distributions, must be provided. It is a hyperparameter specific to the complexity of the problem. This section reviews works which try to estimate the number of components. Vlassis and Likas [2002] proposed a greedy version of the EM algorithm. They assume that training a mixture with an unknown number of components can be achieved by directly maximizing the loglikelihood. At each step, they add a new component by choosing a point in the dataset by maximizing a utility function (the loglikelihood summed with an approximation of the second order Taylor development of the loglikelihood). Then they use a partial EM algorithm to estimate the parameters of the new component and by keeping fixed the rest of the mixture.
The main drawback of the method is the assumption of achieving a good density estimation by only maximizing the loglikelihood which is valid in offline learning. But with online learning, such a hypothesis does not hold anymore because the training dataset is only partially representative of the problem. Which means that the previously built GMM could not be good anymore, therefore a process of unlearning or editing the previously added components is needed. Loglikelihood is then not a sufficient optimization criterion. Richardson and Green [1997] proposed an algorithm called reversible jump Markov chain Monte Carlo (RJMCMC) for univariate GMM with an unknown number of components which is based on the Metropolis-Hasting update. Metropolis- Hasting update consists in updating the parameters according to an acceptance probability. This probability is based on a ratio between the density before and after a move, i.e. a variation of a parameter and also the probability of making this move. The issue of GMM with an unknown number of components is that the space of parameters have variable dimensions and classical MCMC are made for fixed dimensionality. Among classical moves (updating weights, updating MVND parameters), RJMCMC has split/merge and birth/dead moves which consist in adding or removing components, and thus, extending or shrinking the number of parameters. So, these moves change the dimension of the parameter space.

READ SOUTH AFRICA’S HUMAN TRAFFICKING ECOLOGY IN THE GLOBAL THEATRE

Query Strategies in Active Learning

The use of query strategies aims at limiting the number of samples to label. In supervised learning, all the training data must be labeled because they are processed generally randomly, like for instance in batch learning. In active learning, the samples are labeled when they are queried and are assumed to be processed in an optimized, well chosen order. The classifier is then expected to converge with fewer training data with active learning than with classical supervised learning. Active learning data flow is illustrated in figure 4.2. In this section, different query strategies are reviewed with a focus on uncertainty sampling.

Object Segmentation by Interactive Perception

The main goal of this set of studies on interactive perception is to obtain a clean segmentation of several objects in a cluttered environment. At the end of an experiment, the objects have to be well separated from each other. The bootstrap step often uses passive computer vision methods without any interaction. This step introduces assumptions about objects and environments. Table 5.1 summarizes the priors and techniques used for object hypothesis generation.
The studies of van Hoof et al. [2014], Gupta and Sukhatme [2012], Chang et al. [2012], Eitel et al. [2017], Pattent et al. [2018], Chaudhary et al. [2016] have a similar goal: separating stacked or heaped objects from each other. Their studies are limited to a tabletop scenario which allows to segment the table by using most the time a random sample consensus (RANSAC Fischler and Bolles [1981]). Also, objects are assumed to be rigid bodies.

Table of contents :

Amorcer la perception écologique d’un robot par exploration et interactions
Résumé
Introduction
Le contexte
Travaux et domaines proches
La carte de pertinence
Les modèles de mélange collaborateurs
La carte d’affordances
Conclusion
1 Introduction
2 Context
2.1 Introduction
2.2 Developmental Robotics
2.3 Classification problem and learning methods
2.3.1 Classification Problem
2.3.2 Semi-supervised and Active learning
2.4 Related Works
2.4.1 Related Domains
2.4.2 Putting it all together
2.5 Conclusion
3 Background
3.1 Introduction
3.2 Gaussian Mixture Models
3.2.1 Classical GMM
3.2.2 Geometrical analysis of Multivariate Normal Distribution
3.3 Image Processing
3.3.1 Supervoxels Segmentation
3.3.2 Visual Features and descriptors extraction
3.4 Conclusion
4 Collaborative Mixture Models
4.1 Introduction
4.2 Online Learning
4.2.1 Support Vector Machines
4.2.2 Bagging, Boosting and Random Forest
4.2.3 Mixture Models
4.3 Gaussian Mixture Models with an unknown number of components .
4.4 Query Strategies in Active Learning
4.4.1 Uncertainty Sampling
4.4.2 Other Query Strategies
4.5 Definition of the classifier
4.6 Algorithm
4.6.1 Split and Merge operation
4.6.2 Query Strategy
4.7 Conclusion
5 Relevance Map
5.1 Introduction
5.2 Interactive Perception
5.2.1 Object Segmentation by Interactive Perception
5.2.2 Discussion
5.3 Saliency Map
5.3.1 Salient Object Detection
5.3.2 Discussion
5.4 Method
5.4.1 Overview
5.4.2 Features Extraction
5.4.3 Building the Relevance Map
5.4.4 Query Strategy
5.4.5 Push Primitive
5.4.6 Change Detection
5.5 Experiments
5.5.1 Protocol
5.5.2 Classification Quality Measures
5.6 Results
5.6.1 Simplified Setups
5.6.2 Real World Experiments
5.7 Discussion and Future work
5.8 Conclusion
6 Exstensive study of CMMs
6.1 Introduction
6.2 Splitting and Merging
6.2.1 Protocol
6.2.2 Results
6.3 Query strategy
6.4 Supervoxel features
6.4.1 Protocol
6.4.2 Results
6.5 Discussion and Future works
6.6 Conclusion
7 Affordances Map
7.1 Introduction
7.2 Affordances
7.2.1 Foundation and Definition(s)
7.2.2 Affordances in Robotic
7.2.3 Learning affordances from local features
7.3 Method
7.3.1 Affordances Formalisation
7.3.2 Classifier
7.3.3 Primitives and Effects Detection
7.4 Experiments
7.5 Results
7.6 Discussion and Future Works
7.7 Conclusion
8 Conclusion and Discussions
8.1 Summary of the contributions
8.2 Discussion and Limitations
8.2.1 CMMs Limitations
8.2.2 Supervoxels
8.2.3 Learning from local features
8.3 Future Works
8.3.1 Possible Improvements
8.3.2 Next Developmental Steps
Bibliography
A Singular Value Decomposition (SVD)