Get Complete Project Material File(s) Now! »

## Atomic norms for leveraging structure

In many machine learning applications, and particularly for ill-posed problems, models are constrained structurally so they have a simple representation in terms of some fundamental elements, consisting for example of a collection of specific vectors, matrices or tensors. Examples of such elements include sparse vectors for many sparsity inducing norms, rank-one matrices for the trace norm or low-rank tensors as used in the nuclear tensor norm (Liu et al., 2013). We call atoms these elements and atomic set A their (possibly uncountable) collection. In this section we remind usefull concepts and results of atomic norms that can be found in classical literature (Rockafellar, 1970; Chandrasekaran et al., 2012). We describe a model as « simple » if it can be written as a nonnegative combination of a « few » elements from an atomic set. In other words, the model is « simple » if it is a sparse combination of atoms. Parsimonious models are useful in machine learning for three main reasons: they lead to a better generalization of the model (avoid overfitting); they give interpretability through atom selection (extension of variable selection) and in some cases they are computationally cheaper both to learn at to perform predictions. The computational advantage depends on how hard is to find the atoms since the problem can be NP-hard.

### Definition from a collection of atoms

We define now atomic gauges. Let A be a subset of Rp, defined as a collection of atoms. These atoms can be sparse vectors, rank-one matrices and multiple other choices. The penalty of an element x in Rp is the minimum sum of non-negative weights ci such that x writes as a linear combination of atoms (ai)i > A, x = Pi ciai. Therefore, an atomic gauge is defined from its set of atoms A as the norm defined on the convex envelope of A, denoted CA. The formal definition of atomic gauge is stated below.

#### Coordinate descent

Coordinate descend methods optimize, either exatly or approximately, at each iteration the objective with respect to a single variable at a time while others are kept fixed. They all extend naturally to block coordinate descend where the optimization step is done on a block of variables while keeping the others fixed. Coordinate descent algorithms have many variants. The update applied to each variable in the optimization step can take different forms of approximate updates such as: one or a few gradient descent steps; one or a few projected gradient descent steps; one or a few proximal gradient steps; acceleration steps etc. We focus on presenting proximal coordinate descent. Let us consider problem in (3.1). When is separable, meaning that it can be written.

**Applying Frank-Wolfe to the regularized problem**

Beyond constrained optimization problems, the basic conditional gradient algorithm (corresponding to plain FW when f is quadratic) has been generalized to solve problems of the form minx f(x) + (x) where the set constraint CA is replaced by a proper convex function for which the subgradient of can be computed efficiently (Bredies et al., 2009; Yu et al., 2014). Bach (2015) shows that the obtained algorithm can be interpreted as a dual mirror descent algorithm. Yu et al. (2014); Bach (2015) and Nesterov et al. (2015) prove sublinear convergence rates for these algorithms. Corresponding generalizations of PWFW and FCFW are however not obvious. As exploited in Yu et al. (2014); Harchaoui et al. (2015), if = hX A, with h a nondecreasing convex function and A an atomic norm, and if an upper bound can be specified a priori on A(x) for x a solution of the problem, it can reformulated as min.

**Table of contents :**

**1 Introduction **

1.1 Outline and contributions

**2 Atomic norms to induce structure **

2.1 Concepts in convex optimization

2.2 Atomic norms for leveraging structure

**3 Optimization for convex sparsity **

3.1 Convex formulation

3.2 Proximal splitting methods

3.3 Coordinate descent

3.4 Frank-Wolfe methods

3.5 Column generation

**4 Fast column generation for atomic norm regularization **

4.1 Introduction and related work

4.2 Atomic norms

4.3 Previous approaches

4.4 Pivoting Frank Wolfe

4.5 Convergence and computational cost

4.6 Experiments

4.7 Discussion

**5 Learning structure in probabilistic graphical models **

5.1 Graphical models

5.2 Exponential families

5.3 Learning structure of graphical models

5.4 Inverse covariance estimation in Gaussian graphical models

**6 Learning the effect of latent variables in GGM **

6.1 Introduction

6.2 Related Work

6.3 Gaussian Graphical Models with Latent Variables

6.4 Spsd-rank(k) and a convex surrogate

6.5 Convex Formulation and Algorithm

6.6 Identifiability of S and of the sparse factors of L

6.7 Proofs of main theorems

6.8 Experiments

6.9 Conclusion

**7 Convex demixing by gauge minimization **

7.1 Introduction

7.2 Related work

7.3 Subspace associated with a gauge at a point

7.4 Identifiability conditions

7.5 Illustrative examples

7.6 Conclusion

**8 Conclusion and perspectives **

**A Column generation for atomic regularization **

A.1 Proof of Proposition

A.2 Rank one updates of the Hessian and its inverse in active-set

**B Learning the effect of latent variables in GGM **

B.1 Technical lemmas from the proof of Theorem

B.2 Proof of Proposition 7

B.3 Lemmas to control eigenvalues

B.4 Construction of sparse precision matrices

B.5 Experiments

**C Convex demixing by gauge minimization **

C.1 Lemmas for the main theorem

C.2 Technical results on gauges .