Unsupervised Layer-Wise Model Selection in Deep Neural Networks

Get Complete Project Material File(s) Now! »

Preprocessing and feature space

A solution, instead of trying to increase model complexity, possibly at the ex- pense of generalization and computational cost, is to create a vector of features !(x) = („1(x), . . . ,„K(x)) as a pre-processing step, which are then meant to be used as input of the learning procedure. The objective of the new supervised learning problem is then to approximate the function fú : !(x) æ y. If the fea- tures „i(x) extract relevant information from the raw data x, it can result in a simplified problem which can hopefully be solved using a simple model. The fea- ture extraction function ! can be seen as projecting data into a feature space, i.e. representing data so as to make the euclidean distance between training exam- ples dE(!(x),!(xÕ)) more meaningful than in the input space. Figure 2.7 shows how such a projection can make a non linearly separable problem separable.

The kernel trick

In the illustration above, the projection ! is easily computable which allows us to show the data in the feature space. However, most linear methods do not require the exact coordinates !(x) of points in the new feature space but only rely on the inner products Èw, xÍ between features w and input vectors x. Whereas the usual inner product is taken in the input space where Èw, xÍ = wT x, the so called kernel trick proposes to directly define an inner product K(w, x) without explicitly formulating the feature space to which it corresponds. Common choices of kernels include the polynomial kernel: K(w, x) = (wT x + c)d for d Ø 0.

Unsupervised representation learning

Although the projection into a feature space can be a very powerful tool, inter- esting features are often found with hard work, sometimes after years of research. Nevertheless, it is sometimes possible to learn interesting features with an un- supervised algorithm, i.e. with unsupervised representation learning, a central point of this thesis. Assuming a suitable set of features can be learned, the final supervised problem becomes simpler.
Putting aside the usually simple final problem, one could argue that learning representations only transforms a supervised learning problem into an equally difficult unsupervised learning problem. However, this transformation has sev- eral benefits.
Better generalization In any practical application, the input variable x is expected to carry a lot of information about itself and only little information about the target variable y. Accordingly, an unsupervised learning problem on the variable x has access to a lot of information during learning and is thus less prone to over-fitting than the supervised learning problem on x and y. Ad- ditionally, once a representation has been obtained with unsupervised learning, the final supervised learning problem can be solved with a small number of parameters which means that over-fitting is, again, less likely.
Access to more data with semi-supervised Learning Learning repre- sentations with an unsupervised learning algorithm has the advantage that it only depends on unlabeled data which is easily available in most settings. Be- cause more data is available, more complex models can be learned without an adverse effect on generalization. In essence, the complete learning procedure can then leverage the information contained in an unlabeled dataset to perform better on a supervised task. This approach is known as semi-supervised learning. Formally, a semi-supervised learning problem given two datasets DL = {(x1, y1), . . . (xL, yL)}.

Maximum a-posteriori and maximum likelihood

When trying to learn a distribution, the goal is often to find the best possible parameter value. The problem is then to maximize the posterior distribution with respect to the parameter ◊, i.e. , we are looking for ◊ú such that ◊ú = argmax ◊ p(◊|D). ◊ú is then called the maximum a-posteriori estimate because it maximizes the posterior distribution.
Note that the evidence q ◊ p(D|◊)p(◊) can in fact be written p(D) (marginal- ization rule) and does not depend on the parameter ◊. Applying Bayes’ rule, the maximization problem is therefore equivalent to ◊ú = argmax ◊ p(D|◊)p(◊) where we take the likelihood and the prior into account as expected. In cases where there is no useful prior, the prior can be chosen to be uniform and therefore does not depend on ◊, i.e. p(◊) = cst. We can further simplify the optimization problem into: ◊ú = argmax ◊ p(D|◊).

READ NETWORKS: A COMPLEMENTARY METHOD TO PHYLOGENETIC ANALYSIS OF EVOLUTION

Table of contents :

Introduction
i optimization and machine learning
1 optimization
1.1 Problem statement
1.2 The curse of dimensionality
1.3 Convex functions
1.4 Continuous differentiable functions
1.5 Gradient descent
1.6 Black-box optimization and Stochastic optimization
1.7 Evolutionary algorithms
1.8 EDAs
2 from optimization to machine learning
2.1 Supervised and unsupervised learning
2.2 Generalization
2.3 Supervised Example: Linear classification
2.4 Unsupervised Example: Clustering and K-means
2.5 Supervised Example: Polynomial regression
2.6 Model selection
2.7 Changing representations
2.7.1 Preprocessing and feature space
2.7.2 The kernel trick
2.7.3 The manifold perspective
2.7.4 Unsupervised representation learning
3 learning with probabilities
3.1 Notions in probability theory
3.1.1 Sampling from complex distributions
3.2 Density estimation
3.2.1 KL-divergence and likelihood
3.2.2 Bayes’ rule
3.3 Maximum a-posteriori and maximum likelihood
3.4 Choosing a prior
3.5 Example: Maximum likelihood for the Gaussian
3.6 Example: Probabilistic polynomial regression
3.7 Latent variables and Expectation Maximization
3.8 Example: Gaussian mixtures and EM
3.9 Optimization revisited in the context of maximum likelihood
3.9.1 Gradient dependence on metrics and parametrization
3.9.2 The natural gradient
ii deep learning
4 artificial neural networks
4.1 The artificial neuron
4.1.1 Biological inspiration
4.1.2 The artificial neuron model
4.1.3 A visual representation for images
4.2 Feed-forward neural networks
4.3 Activation functions
4.4 Training with back-propagation
4.5 Auto-encoders
4.6 Boltzmann Machines
4.7 Restricted Boltzmann machines
4.8 Training RBMs with Contrastive Divergence
5 deep neural networks
5.1 Shallow v.s. deep architectures
5.2 Deep feed-forward networks
5.3 Convolutional networks
5.4 Layer-wise learning of deep representations
5.5 Stacked RBMs and deep belief networks
5.6 stacked auto-encoders and deep auto-encoders
5.7 Variations on RBMs and stacked RBMs
5.8 Tractable estimation of the log-likelihood
5.9 Variations on auto-encoders
5.10 Richer models for layers
5.11 Concrete breakthroughs
5.12 Principles of deep learning under question ?
6 what can we do ?
iii contributions
7 presentation of the first article
7.1 Context
7.2 Contributions
Unsupervised Layer-Wise Model Selection in Deep Neural Networks
1 Introduction
2 Deep Neural Networks
2.1 Restricted Boltzmann Machine (RBM)
2.2 Stacked RBMs
2.3 Stacked Auto-Associators
3 Unsupervised Model Selection
3.1 Position of the problem
3.2 Reconstruction Error
3.3 Optimum selection
4 Experimental Validation
4.1 Goals of experiments
4.2 Experimental setting
4.3 Feasibility and stability
4.4 Efficiency and consistency
4.5 Generality
4.6 Model selection and training process
5 Conclusion and Perspectives
References
7.3 Discussion
8 presentation of the second article
8.1 Context
8.2 Contributions
Layer-wise training of deep generative models
Introduction
1 Deep generative models
1.1 Deep models: probability decomposition
1.2 Data log-likelihood
1.3 Learning by gradient ascent for deep architectures
2 Layer-wise deep learning
2.1 A theoretical guarantee
2.2 The Best Latent Marginal Upper Bound
2.3 Relation with Stacked RBMs
2.4 Relation with Auto-Encoders
2.5 From stacked RBMs to auto-encoders: layer-wise consistency
2.6 Relation to fine-tuning
2.7 Data Incorporation: Properties of qD
3 Applications and Experiments
3.1 Low-Dimensional Deep Datasets
3.2 Deep Generative Auto-Encoder Training
3.3 Layer-Wise Evaluation of Deep Belief Networks
Conclusions
References
8.3 Discussion
9 presentation of the third article
9.1 Context
9.2 Contributions
Information-Geometric Optimization Algorithms: A Unifying Picture via Invariance Principles
Introduction
1 Algorithm description
1.1 The natural gradient on parameter space
1.2 IGO: Information-geometric optimization
2 First properties of IGO
3 IGO, maximum likelihood, and the cross-entropy method
4 CMA-ES, NES, EDAs and PBIL from the IGO framework
5 Multimodal optimization using restricted Boltzmann machines .
5.1 IGO for restricted Boltzmann machines
5.2 Experimental setup
5.3 Experimental results
5.4 Convergence to the continuous-time limit
6 Further discussion and perspectives
Summary and conclusion
Appendix: Proofs
References
9.3 Discussion
Conclusion and perspectives
bibliography