Get Complete Project Material File(s) Now! »

## Graphical Lasso for Gaussian mixtures

As we saw in the previous sections, the number of free parameters in a full GMM with K components in dimension p are (K1)+Kp+Kp(p+1)=2 which means, for instance, that for K = 5 and p = 100 we have 125704 parameters to estimate. In this high dimensional setting, the EM algorithm experiences severe performance degradation. In particular, the inversion of the covariance matrices are one source of error. One way to circumvent these problems is to use regularization. To this end, we will make a structural assumption on the inverse of the covariance matrices, called precision or concentration matrices, of a component. The work presented in this chapter is inspired by [Friedman et al., 2007], [Banerjee et al., 2008], [Yuan and Lin, 2007] and [Meinshausen and Buhlmann, 2006] where it is suggested to penalize the o-diagonal entries of the precision matrix of a Gaussian graphical model. We do an attempt to generalize this work to the Gaussian mixture model. We consider X = (X(1); : : : ;X(p)), a random vector admitting a p-dimensional normal distribution N(;) with a non-singular . One can construct an undirected graph G = (V;E) with p vertices corresponding to each coordinate and, E = (ei;j)1i<jp, the edges between the vertices describing the conditional independence relation among X(1); : : : ;X(p). If in this graph, ei;j is absent in E if and only if X(i) and X(j) are independent conditionally to the other variables fX(l)g with l 6= i; j (denoted X(i) ?? X(j)jX(l) l 6= i; j). Then G is called the Gaussian concentration graph model for the Gaussian random vector X. This representation is particularly interesting in the study of the inverse of the covariance matrix. Let us denote 1 = = (!i;j) the precision matrix. The entries of this matrix satisfy !i;j = 0 if and only if X(i) ?? X(j) conditionally to all the other variables. We recall in the following lemma this well-known result.

### Estimating the number of clusters

In this section, we will focus on the problem of estimating the number of clusters, K, in a Gaussian mixture model. Most of popular clustering methods such as K-Means, Expectation-Maximization with Gaussian mixture model or spectral clustering need this parameter in input. Various methods are used to perform a selection of the best model according to a given criterion. As we saw in Section 1.1.4,a common approach is to perform multiple clusterings with the parameter K ranging from say 2 to Kmax, where Kmax is the maximum number of clusters we assume are present in the dataset, and to select the best model according to some prescribed criterion. In this work, we seek to incorporate the model selection step into the estimation step by means of an \adaptive » sparse estimation of the weight vector of the Gaussian components.

#### Clustering and density estimation

We have tried several approaches to perform clustering using the GMM and various notions of sparsity. Unfortunately, the results were never as good as expected. One of the reasons is that all these approaches are too complex to be easily understood. In particular, they involve a number of parameters to be tuned, which turned out to be a dicult task. This led us to consider a slightly simpler task investigated in next chapters. It corresponds to choosing the weights of the components in a mixture assuming that the densities of the components (which can be interpreted as clusters) are known in advance. In practice, these densities can be furnished by some other algorithm or by an expert. This approach can be seen as an ensemble method applied to unsupervised learning.

**Oracle inequalities in deviation and in expectation**

In this work, we prove several non-asymptotic risk bounds that imply, in particular, that the maximum likelihood estimator is optimal in model-selection aggregation, convex aggregation and D-sparse aggregation (up to log-factors). In all the results of this section we assume the parameter in (3.3) to be equal to 0.

**Table of contents :**

**1 Introduction **

1.1 Clustering and density estimation problem

1.1.1 Centroid-Based Clustering: K-means

1.1.2 Agglomerative Hierarchical Methods

1.1.3 Spectral clustering

1.1.4 Finding the number of clusters

1.2 The Gaussian Mixture Model

1.2.1 EM Algorithm

1.2.2 K-means from the EM angle

1.3 The curse of dimensionality

**2 Partial contributions to clustering **

2.1 Graphical Lasso for Gaussian mixtures

2.1.1 Graphical Lasso on Gaussian mixtures

2.2 Estimating the number of clusters

2.2.1 First method: regularizing the posterior probabilities

2.2.2 Second method: penalizing the weight vector

2.3 Clustering and density estimation

**3 KL-Aggregation in Density Estimation **

3.1 Introduction

3.1.1 Related work

3.1.2 Additional notation

3.1.3 Agenda

3.2 Oracles inequalities

3.3 Discussion of the results

3.3.1 Lower bounds

3.3.2 Weight vector estimation

3.3.3 Extensions

3.4 Conclusion

3.5 Proofs: Upper bounds

3.5.1 Proof of Theorem 3.2.1

3.5.2 Proof of Theorem 3.2.2

3.5.3 Proof of Theorem 3.2.3

3.5.4 Proof of Proposition 2

3.5.5 Proof of Proposition 3

3.5.6 Auxiliary results

3.6 Proofs: Lower bounds

3.6.1 Lower bound on HF(0;D)

3.6.2 Lower bound on HF(; 1)

3.6.3 Lower bound holding for all densities

**4 Experiments for the KL-aggregation **

4.1 Introduction

4.2 Implementation

4.3 Alternative methods considered

4.3.1 SPADES

4.3.2 Adaptive Dantzig density estimation

4.3.3 Kernel density estimation

4.4 Experimental Evaluation

4.4.1 Dictionaries considered

4.4.2 Densities considered

4.4.3 Discussion of the results

4.5 A method for constructing the dictionary of densities

4.5.1 Implementation of the dictionary generator

4.5.2 Experimental evaluation

4.5.3 Concluding remarks .