Hard Multi-Task Metric Learning for Kernel Regression

Get Complete Project Material File(s) Now! »

Understanding overfitting

In this subsection, we take an example of polynomial regression in order to introduce the notion of overfitting. We discuss overfitting and its relationship to the model complexity and to the training data size before explaining how overfitting is commonly estimated.
For simplicity, let us consider a feature f and a label l corresponding to scalar values. Let us also consider that the label and the feature are linked by a polynomial relationship of order 2 (a, b and c being real-valued coefficients), according to: l = a. f 2+b. f +c.
We consider a sampling of this function plus some white noise and we fit polynomials of different degrees to it. The obtained results are represented on figure 2.1. Solid green curves represent the initial polynomial. Blue points represent our sampled polynomial. Red dashed curves correspond to the estimated polynomials.

Semi-parametric models

Semi-parametric models make use of a cost function for optimizing a set of parameters. However, on the contrary to parametric models, these parameters alone are not sufficient for predicting test samples. The estimator makes use of the training data samples in addition to the learned parameters.
In this subsection, we present a metric learning algorithm called Large Margin Nearest Neighbors (LMNN) that has been proposed by Weinberger et al. [110]. The core idea of the algorithm is to estimate an Mahalanobis metric suited for a k-NN classifier. This algorithm translates the maximum margin learning principle behind SVM to k-nearest neighbors classification. It is based on a convex optimization of a cost function which is the sum of two weighted terms: ε1(M) =Σ i j ηi jdM(xi,xj)2.

Metric Learning for Kernel Regression

In this section, we present and discuss the MLKR method on which our proposed regression methods are built upon. We first introduce the algorithm. Afterwards, we discuss issues caused by its non-convexity. Then, we discuss its training time and memory complexities and its robustness to overfitting. Finally, we discuss the extrapolation capabilities of its prediction function.

The MLKR method

Using the MLKR method, a test label is predicted with the Nadaraya-Watson estimator [70]. As we previously discussed, the space in which the samples lie has an important impact on the prediction quality, making dimensionality reduction a relevant initial step. The goal of the MLKR method is to estimate the optimal linear subspace for minimizing the Nadaraya-Watson squared error on the training set with the commonly used Gaussian kernel. Considering an initial space of dimension nd and an reduced space of dimension nr, The MLKR method estimates the projection matrix A ∈Mnr,nd (R) that minimizes the following error: L(A) = nsΣ i=1 (yˆi−yi)2.

About MLKR convexity

On the contrary to SVM, the MLKR optimization is non-convex. However, depending on the shape of the cost function, non-convexity may be a more or less important issue. For instance, if the cost function has a relatively small number of local minima with same energy values, the non-convexity may not even be an issue. However, if the cost function is highly non-convex with dense local minima having various energy values, the non-convexity may lead to a very inconvenient optimization process. In this subsection, we present simulations for evaluating the difficulties caused by MLKR non-convexity and discuss the parameters that impact the gradient descent. We defined a function linking label and features in a non-linear manner with a sufficient amount of non-linearity to induce issues in the optimization process. Let { fi, i ∈ [[1;nf ]]} be a set of nf features. The label is defined according to: label = k.cos(a1. f1+b1).cos(a2. f2+b2)+cos(a3. f3+b3) with k,a1,a2,a3,b1,b2,b3 all being real-valued parameters. The label and three of the features are linked by a non-linear relationship. In order to have an idea on the induced non-linearity, we represented on figure 2.3 the label (corresponding to point colors) with respect to the three related features using 600 randomly generated data points. We can notice that the small intensity labels are located in three different areas of the space. We designed three different tests for evaluating the impact of the kernel variance, the amount of noise and the number of data points.
The first test aims at evaluating the influence of the kernel variance on the convexity of MLKR. We use 600 randomly generated data points on a five-dimensional space. We run the MLKR algorithm using 20 random initializations for different kernel variances and stored the mean squared errors. We represented on figure 2.4 the mean and variance of the obtained errors. In this test, the best kernel variance is 1 10 because it leads to the smallest mean error as well as the smallest variance. The large differences that exist between the different kernel variances show that the kernel variance has to be adapted to the data in order for the optimization process to be efficient. At first sight, this seems illogical as multiplying the parameter matrix A by a real-valued coefficient should make up for the kernel variance. This only remains true within some acceptable range for the kernel variance. However, a too small or a too large kernel variance can cause numerical issues in the optimization process, because of unsuitable initializations. When the kernel variance is too small, each point seems to have (numerically) only one neighbor. Indeed, we have: ∀{(a,b) ∈ R2|b > a > 0}, lim σ→0+ e− a σ e− b σ = +∞.

READ Effective thermal conductivity of oolitic rocks using the Maxwell homogenization method (IJRMMS, 2015)

Table of contents :

List of figures
List of tables
1 Introduction
1.1 What can be inferred from faces?
1.1.1 Facial Action Coding System
1.1.2 Towards high level information
1.2 Automatic facial analysis: applications and challenges
1.2.1 A few applications
1.2.2 How do automatic facial analysis systems work?
1.2.3 Difficulties in automatic facial expression analysis
1.2.4 International challenges
1.3 From landmark detection towards emotion recognition
1.3.1 Landmark detection
1.3.2 AU prediction
1.3.3 Mental state recognition
1.4 Outline and contributions
2 Hard Multi-Task Metric Learning for Kernel Regression
2.1 Introduction to machine learning
2.1.1 A few definitions
2.1.2 Different methods
2.1.3 Understanding overfitting
2.2 Different model types
2.2.1 Non-parametric models
2.2.2 Parametric models
2.2.3 Semi-parametric models
2.3 Metric Learning for Kernel Regression
2.3.1 The MLKR method
2.3.2 About MLKR convexity
2.3.3 About MLKR complexity
2.3.4 About overfitting
2.3.5 About Nadaraya-Watson extrapolation capabilities
2.4 Our extensions
2.4.1 Feature selection
2.4.2 Stochastic gradient descent
2.4.3 Lasso-regularization
2.4.4 Multi-dimensional label extensions
2.5 Conclusion
3 Facial landmark detection
3.1 Introduction
3.2 Commonly used appearance features
3.3 The 300W database
3.4 Our facial landmark prediction framework
3.4.1 Feature extraction
3.4.2 Proposed regression method
3.4.3 Experimental setup
3.5 Results on the 300W dataset
3.5.1 HOG normalizations
3.5.2 Comparison to CS-MLKR
3.5.3 Embedding more training data samples
3.5.4 Comparison to global PCA and Linear Regression
3.5.5 Comparison to state-of-the-art methods
3.6 Conclusion
4 Action Unit prediction
4.1 Introduction
4.2 The BP4D dataset
4.3 AU prediction framework
4.3.1 Feature extraction
4.3.2 About learning with video data
4.3.3 Experimental setup
4.4 Results on the BP4D dataset
4.4.1 Analysis of feature impact
4.4.2 Evaluations and results on the BP4D dataset
4.4.3 Evaluation of regularization impact
4.4.4 Comparison to baseline systems on the FERA’15 development set .
4.4.5 Comparison to baseline systems on the FERA’15 test set
4.4.6 Comparison to other participants on the FERA’15 test set
4.5 Conclusion
5 Conclusion and future works
5.1 Conclusion
5.2 Future works
5.2.1 Towards coupling database design and model training
5.2.2 Towards smart system adaptation
5.2.3 Towards handling big data sets
References
Appendix A Iterative Regularized Metric Learning
Appendix B Emotion Prediction in a Continuous Space
Appendix C Binary Map based Landmark Localization