Get Complete Project Material File(s) Now! »

## Understanding the Properties of Structured Sparsity-Inducing Norms

abstract: We consider the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsity-inducing norms. These are defined as sums of Euclidean norms on certain subsets of variables, extending the usual ℓ1 -norm and the group ℓ1 – norm by allowing the subsets to overlap. This leads to a specific set of allowed nonzero patterns for the solutions of such problems. We first explore the relationship between the groups defining the norm and the resulting nonzero patterns, providing both forward and backward algorithms to go back and forth from groups to patterns. This allows the design of norms adapted to specific prior knowledge expressed in terms of nonzero patterns. We also present an eﬃcient active set algorithm, and analyze the consistency of variable selection for least-squares linear regression in low and high-dimensional settings.

**Introduction**

Sparse linear models have emerged as a powerful framework to deal with various supervised estimation tasks, in machine learning as well as in statistics and signal pro-cessing. These models basically seek to predict an output by linearly combining only a small subset of the features describing the data. To simultaneously address this vari-able selection and the linear model estimation, ℓ1-norm regularization has become a popular tool, that benefits both from eﬃcient algorithms (see, e.g., Efron et al., 2004; Lee et al., 2007; Beck and Teboulle, 2009; Yuan et al., 2010, and multiple references therein) and well-developed theory for generalization properties and variable selection consistency (Zhao and Yu, 2006; Wainwright, 2009; Bickel et al., 2009; Zhang, 2009).

When regularizing by the ℓ1-norm, sparsity is yielded by treating each variable indi-vidually, regardless of its position in the input feature vector, so that existing relation-ships and structures between the variables (e.g., spatial, hierarchical or related to the physics of the problem at hand) are merely disregarded. However, many practical situa-tions could benefit from this type of prior knowledge, potentially both for interpretability purposes and for improved predictive performance. For instance, in neuroimaging, one is interested in localizing areas in functional mag-netic resonance imaging (fMRI) or magnetoencephalography (MEG) signals that are discriminative to distinguish between diﬀerent brain states (Gramfort and Kowalski, 2009; Xiang et al., 2009, and references therein). More precisely, fMRI responses con-sist in voxels whose three-dimensional spatial arrangement respects the anatomy of the brain. The discriminative voxels are thus expected to have a specific localized spatial or-ganization (Xiang et al., 2009), which is important for the subsequent identification task performed by neuroscientists. In this case, regularizing by a plain ℓ1-norm to deal with the ill-conditionedness of the problem (typically only a few fMRI responses described by tens of thousands of voxels) would ignore this spatial configuration, with a potential loss in interpretability and performance.

Similarly, in face recognition, robustness to occlusions can be increased by consider-ing as features, sets of pixels that form small convex regions on the face images (Jenatton et al., 2010b). Again, a plain ℓ1-norm regularization fails to encode this specific spatial locality constraint (Jenatton et al., 2010b). The same rationale supports the use of struc-tured sparsity for background subtraction tasks (Cevher et al., 2008; Huang et al., 2009; Mairal et al., 2010b). Still in computer vision, object and scene recognition generally seek to extract bounding boxes in either images (Harzallah et al., 2009) or videos (Dalal et al., 2006). These boxes concentrate the predictive power associated with the con-sidered object/scene class, and have to be found by respecting the spatial arrangement of the pixels over the images. In videos, where series of frames are studied over time, the temporal coherence also has to be taken into account. An unstructured sparsity-inducing penalty that would disregard this spatial and temporal information is therefore not adapted to select such boxes.

Another example of the need for higher-order prior knowledge comes from bioin-formatics. Indeed, for the diagnosis of tumors, the profiles of array-based comparative genomic hybridization (arrayCGH) can be used as inputs to feed a classifier (Rapaport et al., 2008). These profiles are characterized by plenty of variables, but only a few sam-ples of such profiles are available, prompting the need for variable selection. Because of the specific spatial organization of bacterial artificial chromosomes along the genome, the set of discriminative features is expected to have specific contiguous patterns. Using this prior knowledge on top of a standard sparsity-inducing method leads to improvement in classification accuracy (Rapaport et al., 2008). In the context of multi-task regres-sion, a genetic problem of interest is to find a mapping between a small subset of single nucleotide polymorphisms (SNP’s) that have a phenotypic impact on a given family of genes (Kim and Xing, 2010). This target family of genes has its own structure, where some genes share common genetic characteristics, so that these genes can be embedded into a underlying hierarchy (Kim and Xing, 2010). Exploiting directly this hierarchical information in the regularization term outperforms the unstructured approach with a standard ℓ1-norm. Such hierarchical structures have been likewise useful in the context of wavelet regression (Baraniuk et al., 2010; Zhao et al., 2009; Huang et al., 2009; Jenat-ton et al., 2011c), kernel-based non linear variable selection (Bach, 2008a) and for topic modeling (Jenatton et al., 2011c).

These real world examples motivate the need for the design of sparsity-inducing regularization schemes, capable of encoding more sophisticated prior knowledge about the expected sparsity patterns.

As mentioned above, the ℓ1-norm focuses only on cardinality and cannot easily specify side information about the patterns of nonzero coeﬃcients (“nonzero patterns”) induced in the solution, since they are all theoretically possible. Group ℓ1-norms (Yuan and Lin, 2006; Roth and Fischer, 2008; Huang and Zhang, 2010) consider a partition of all variables into a certain number of subsets and penalize the sum of the Euclidean norms of each one, leading to selection of groups rather than individual variables. Moreover, recent works have considered overlapping but nested groups in constrained situations such as trees and directed acyclic graphs (Zhao et al., 2009; Bach, 2008a; Kim and Xing, 2010; Jenatton et al., 2010a, 2011c; Schmidt and Murphy, 2010).

In this chapter, we consider all possible sets of groups and characterize exactly what type of prior knowledge can be encoded by considering sums of norms of overlapping groups of variables. Before describing how to go from groups to nonzero patterns (or equivalently zero patterns), we show that it is possible to “reverse-engineer” a given set of nonzero patterns, i.e., to build the unique minimal set of groups that will generate these patterns. This allows the automatic design of sparsity-inducing norms, adapted to target sparsity patterns. We give in Section 2.3 some interesting examples of such designs in specific geometric and structured configurations, which covers the type of prior knowledge available in the real world applications described previously.

As will be shown in Section 2.3, for each set of groups, a notion of hull of a nonzero pattern may be naturally defined. For example, in the particular case of the two-dimensional planar grid considered in this chapter, this hull is exactly the axis-aligned bounding box or the regular convex hull. We show that, in our framework, the al-lowed nonzero patterns are exactly those equal to their hull, and that the hull of the relevant variables is consistently estimated under certain conditions, both in low and high-dimensional settings. Moreover, we present in Section 2.4 an eﬃcient active set algorithm that scales well to high dimensions. Finally, we illustrate in Section 2.6 the behavior of our norms with synthetic examples on specific geometric settings, such as lines and two-dimensional grids. **Notation.** For any finite set A with cardinality |A|, we also define the |A|-tuple (ya)a∈A ∈ Rp×|A| as the collection of p-dimensional vectors ya indexed by the elements of A. Furthermore, for two vectors x and y in Rp, we denote by x◦y = (x1y1, . . . , xpyp)⊤ ∈ Rp the elementwise product of x and y. where (ωg )g G

**Understanding the Properties of Structured Sparsity-Inducing Norms**

**Regularized Risk Minimization**

We consider the problem of predicting a random variable Y ∈ Y from a (potentially non random) vector x ∈ Rp, where Y is the set of responses, typically a subset of R. p for i J1; nK. We

We assume that we are given n observations (xi, yi) ∈ R × Y , ∈1 n define the empirical risk of a loading vector w ∈ Rp as L(w) = i=1 ℓ yi, w⊤xi , n where ℓ : Y × R R+ is a loss function. We assume that ℓ is convex and continuously diﬀerentiable with respect to the second parameter. Typical examples of loss functions are the square loss for least squares regression, i.e., ℓ(y, yˆ) = 12 (y − yˆ)2 with y ∈ R, and the logistic loss ℓ(y, yˆ) = log(1 + e−yyˆ) for logistic regression, with y ∈ {−1, 1}.

We focus on a general family of sparsity-inducing norms that allow the penalization of subsets of variables grouped together. Let us denote by G a subset of the power set of J1; pK such that g G g = J1; pK, i.e., a spanning set of subsets of J1; pK. Note that G does not necessarily define a partition of J1; pK, and therefore, it is possible for elements of G to overlap. We consider the norm Ω defined by (ωjg )2|wj |2 2 ωg ◦ w 2, (2.1) Ω(w) = = g Gj∈g g G is a |G|-tuple of p-dimensional vectors such that ωjg > 0 if j ∈ g and ωjg = 0 otherwise. A same variable wj belonging to two diﬀerent groups g1, g2 ∈ G is allowed to be weighted diﬀerently in g1 and g2 (by respectively ωjg1 and ωjg2 ). We do not study the more general setting where each ωg would be a (non-diagonal) positive-definite matrix, which we defer to future work. Note that a larger family of penalties with similar properties may be obtained by replacing the ℓ2-norm in Eq. (2.1) by other ℓq -norm, q > 1 (Zhao et al., 2009). Moreover, non-convex alternatives to Eq. (2.1) with quasi-norms in place of norms may also be interesting, in order to yield sparsity more aggressively (see, e.g., Jenatton et al., 2010b).

This general formulation has several important sub-cases that we present below, the goal of this chapter being to go beyond these, and to consider norms capable to incorporate richer prior knowledge.

• ℓ2-norm: G is composed of one element, the full set J1; pK.

• ℓ1-norm: G is the set of all singletons, leading to the Lasso (Tibshirani, 1996) for the square loss.

• ℓ2-norm and ℓ1-norm: G is the set of all singletons and the full set J1; pK, leading (up to the squaring of the ℓ2-norm) to the Elastic net (Zou and Hastie, 2005) for the square loss.

• Group ℓ1-norm: G is any partition of J1; pK, leading to the group-Lasso for the square loss (Yuan and Lin, 2006).

• Hierarchical norms: when the set J1; pK is embedded into a tree (Zhao et al., 2009) or more generally into a directed acyclic graph (Bach, 2008a), then a set of p groups, each of them composed of descendants of a given variable, is considered.

We study the following regularized problem: min ℓ yi, w⊤xi + Ω(w), (2.2) w∈Rp n i=1 where ≥ 0 is a regularization parameter. Note that a non-regularized constant term could be included in this formulation, but it is left out for simplicity. We denote by wˆ any solution of Eq. 2.2. Regularizing by linear combinations of (non-squared) ℓ2-norms is known to induce sparsity in wˆ (Zhao et al., 2009); our grouping leads to specific patterns that we describe in the next section.

### Groups and Sparsity Patterns

We now study the relationship between the norm Ω defined in 2.1 and the nonzero patterns the estimated vector wˆ is allowed to have. We first characterize the set of nonzero patterns, then we provide forward and backward procedures to go back and forth from groups to patterns.

**Stable Patterns Generated by G**

The regularization term Ω(w) = g∈G ωg ◦w 2 is a mixed (ℓ1, ℓ2)-norm (Zhao et al., 2009). At the group level, it behaves like an ℓ1-norm and therefore, Ω induces group sparsity. In other words, each ωg ◦ w, and equivalently each wg (since the support of ωg is exactly g), is encouraged to go to zero. On the other hand, within the groups g ∈ G, the ℓ2-norm does not promote sparsity. Intuitively, for a certain subset of groups G′ ⊆ G, the vectors wg associated with the groups g ∈ G′ will be exactly equal to zero, leading to a set of zeros which is the union of these groups, g∈G′ g. Thus, the set of allowed zero patterns should be the union-closure of G, i.e. (see Figure 2.1 for an example): Z = g; G′ ⊆ G . (2.3) g∈G′

The situation is however slightly more subtle as some zeros can be created by chance (just as regularizing by the ℓ2-norm may lead, though it is unlikely, to some zeros). Nevertheless, Theorem 1 shows that, under mild conditions, the previous intuition about the set of zero patterns is correct. Note that instead of considering the set of zero patterns Z, it is also convenient to manipulate nonzero patterns, and we define P = gc; G′ ⊆ G = zc; z ∈ Z . (2.4) g∈G′

We can equivalently use P or Z by taking the complement of each element of these sets. The following two results characterize the solutions of the problem (2.2). We first gives suﬃcient conditions under which this problem has a unique solution. We then ∂2 ℓ (y, y′) > 0 ∂y′ 2

**Understanding the Properties of Structured Sparsity-Inducing Norms**

Figure 2.1: Groups and induced nonzero pattern: three sparsity-inducing groups (middle and right, denoted by {g1, g2, g3}) with the associated nonzero pattern which is the complement of the union of groups, i.e., (g1 ∪ g2 ∪ g3)c (left, in black).

formally prove the aforementioned intuition about the zero patterns of the solutions of (2.2), namely they should belong to Z. In the following two results (see proofs in Appendix A.1.1 and Appendix A.1.2), we assume that ℓ : (y, y′) ℓ(y, y′) is nonnegative, twice continuously diﬀerentiable with positive second derivative with respect to the sec-ond variable and non-vanishing mixed derivative, i.e., for any y, y′ in R, and ∂2 ℓ ′ (y, y′) = 0. ∂y∂y

**Proposition 5** (Uniqueness)

Let Q denote the Gram matrix 1 n xix⊤. We consider the optimization problem in n i=1 i (2.2) with > 0. If Q is invertible or if the group J1; pK belongs to G, then the problem in (2.2) admits a unique solution.

Note that the invertibility of the matrix Q requires p ≤ n. For high-dimensional settings, the uniqueness of the solution will hold when {1, . . . , p} belongs to G, or as further discussed at the end of the proof, as soon as for any j, k ∈ {1, . . . , p}, there exists a group g ∈ G which contains both j and k. Adding the group {1, . . . , p} to G will in general not modify P (and Z), but it will cause G to lose its minimality (in a sense introduced in the next subsection). Furthermore, adding the full group {1, . . . , p} has to be put in parallel with the equivalent (up to the squaring) ℓ2-norm term in the elastic-net penalty (Zou and Hastie, 2005), whose eﬀect is to notably ensure strong convexity. For more sophisticated uniqueness conditions that we have not explored here, we refer the readers to Osborne et al. (2000a, Theorem 1, 4 and 5), Rosset et al. (2004, Theorem 5) or Dossal (2007, Theorem 3) in the Lasso case, and Roth and Fischer (2008) for the group Lasso setting. We now turn to the result about the zero patterns of the solution of the problem in (2.2):

**Theorem 1** (Stable patterns)

Assume that Y = (y1, . . . , yn)⊤ is a realization of an absolutely continuous probability distribution. Let k be the maximal number such that any k rows of the matrix (x1, . . . , xn) ∈ Rp×n are linearly independent. For > 0, any solution of the problem in (2.2) with at most k −1 nonzero coeﬃcients has a zero pattern in Z = g∈G′ g; G′ ⊆ G almost surely.

In other words, when Y = (y1, . . . , yn)⊤ is a realization of an absolutely continuous probability distribution, the sparse solutions have a zero pattern in Z = g∈G′ g; G′ ⊆ G almost surely. As a corollary of our two results, if the Gram matrix Q is invertible, the problem in (2.2) has a unique solution, whose zero pattern belongs to Z almost surely. Note that with the assumption made on Y , Theorem 1 is not directly applicable to the classification setting. Based on these previous results, we can look at the following usual special cases from Section 2.2 (we give more examples in Section 2.3.5):

• ℓ2-norm: the set of allowed nonzero patterns is composed of the empty set and the full set J1; pK.

• ℓ1-norm: P is the set of all possible subsets.

• ℓ2-norm and ℓ1-norm: P is also the set of all possible subsets.

• Group ℓ1-norm: P = Z is the set of all possible unions of the elements of the partition defining G.

• Hierarchical norms: the set of patterns P is then all sets J for which all ances-tors of elements in J are included in J (Bach, 2008a).

Two natural questions now arise: (1) starting from the groups G, is there an eﬃcient way to generate the set of nonzero patterns P ; (2) conversely, and more importantly, given P , how can the groups G—and hence the norm Ω(w)—be designed?

**General Properties of G, Z and P**

We now study the diﬀerent properties of the set of groups G and its corresponding sets of patterns Z and P .

Closedness. The set of zero patterns Z (respectively, the set of nonzero patterns P ) is closedKunder union (respectively, intersection), that is, for all K ∈ N and all z , . . . , z K 1 K ∈ Z, k=1 zk ∈ Z (respectively, p1, . . . , pK ∈ P , k=1 pk ∈ P ). This implies that when “reverse-engineering” the set of nonzero patterns, we have to assume it is closed under intersection. Otherwise, the best we can do is to deal with its intersection-closure. For instance, if we consider a sequence (see Figure 2.4), we cannot take P to be the set of contiguous patterns with length two, since the intersection of such two patterns may result in a singleton (that does not belong to P ). Minimality. If a group in G is the union of other groups, it may be removed from G without changing the sets Z or P . This is the main argument behind the pruning backward algorithm in Section 2.3.3. Moreover, this leads to the notion of a minimal set G of groups, which is such that for all G ′ ⊆ Z whose union-closure spans Z, we have G ⊆ G ′. The existence and uniqueness of a minimal set is a consequence of classical results in set theory (Doignon and Falmagne, 1998). The elements of this minimal set are usually referred to as the atoms of Z.

Minimal sets of groups are attractive in our setting because they lead to a smaller number of groups and lower computational complexity—for example, for 2 dimensional grids with rectangular patterns, we have a quadratic possible number of rectangles, i.e., |Z | = O(p2), that can be generated by a minimal set G whose size is |G| = O(√p).

Hull. Given a set of groups G, we can define for any subset I ⊆ J1; pK the G-adapted hull, or simply hull, as: Hull(I ) = g , g G, g∩I =∅ which is the smallest set in P containing I (see Figure 2.2); we always have I ⊆ Hull(I ) with equality if and only if I ∈ P . The hull has a clear geometrical interpretation for specific sets G of groups. For instance, if the set G is formed by all vertical and horizontal half-spaces when the variables are organized in a 2 dimensional-grid (see Figure 2.5), the hull of a subset I ⊂ {1, . . . , p} is simply the axis-aligned bounding box of I . Similarly, when G is the set of all half-spaces with all possible orientations (e.g., orientations ±π/4 are shown in Figure 2.6), the hull becomes the regular convex hull 1. Note that those interpretations of the hull are possible and valid only when we have geometrical information at hand about the set of variables. Graphs of patterns. We consider the directed acyclic graph (DAG) stemming from the Hasse diagram (Cameron, 1994) of the partially ordered set (poset) (G, ⊃). By definition, the nodes of this graph are the elements g of G and there is a directed edge from g1 to g2 if and only if G1 ⊃ G2 and there exists no g ∈ G such that g1 ⊃ g ⊃ g2 (Cameron, 1994). We can also build the corresponding DAG for the set of zero patterns Z ⊃ G, which is a super-DAG of the DAG of groups (see Figure 2.3 for examples). Note that we obtain also the isomorphic DAG for the nonzero patterns P , although it corresponds to the poset (P , ⊂): this DAG will be used in the active set algorithm presented in Section 2.4.

Prior works with nested groups (Zhao et al., 2009; Bach, 2008a; Kim and Xing, 2010; Jenatton et al., 2010a; Schmidt and Murphy, 2010) have also used a similar DAG structure, with the slight diﬀerence that in these works, the corresponding hierarchy of variables is built from the prior knowledge about the problem at hand (e.g., the tree of wavelets in Zhao et al. (2009), the decomposition of kernels in Bach (2008a) or the hierarchy of genes in Kim and Xing (2010)). The DAG we introduce here on the set of groups naturally and always comes up, with no assumption on the variables themselves (for which no DAG is defined in general).

#### From Patterns to Groups

We now assume that we want to impose a priori knowledge on the sparsity structure of a solution wˆ of our regularized problem in Eq. 2.2. This information can be exploited by restricting the patterns allowed by the norm Ω. Namely, from an intersection-closed set of zero patterns Z, we can build back a minimal set of groups G by iteratively pruning away in the DAG corresponding to Z, all sets which are unions of their parents. See Algorithm 1. This algorithm can be found under a diﬀerent form in Doignon and Falmagne (1998)—we present it through a pruning algorithm on the DAG, which is natural in our context (the proof of the minimality of the procedure can be found in Appendix A.1.3). The complexity of Algorithm 1 is O(p|Z |2). The pruning may reduce significantly the number of groups necessary to generate the whole set of zero patterns, sometimes from exponential in p to polynomial in p (e.g., the ℓ1-norm). In Section 2.3.5, we give other examples of interest where |G| (and |P |) is also polynomial in p.

**Algorithm 1 Backward procedure**

Input: Intersection-closed family of nonzero patterns P .

Output: Set of groups G.

Initialization: Compute Z = {P c; P ∈ P } and set G = Z.

Build the Hasse diagram for the poset (Z, ⊃).

for t = ming Z |g| to maxg Z |g| do

for each node g ∈ Z such that |g| = t do

if C∈Children(g) C = gthen

if (Parents(g) = ∅) then

Connect children of g to parents of g.

end if

Remove g from G.

end if

end for

end for

**Algorithm 2 Forward procedure**

Input: Set of groups G = {g1, . . . , gM }.

Output: Collection of zero patterns Z and nonzero patterns P .

Initialization: Z = {∅}.

for m = 1 to M do

C = {∅}

for each Z ∈ Z do

if (gm Z) and (∀g ∈ {g1, . . . , gm−1}, g ⊆ Z ∪ gm ⇒ g ⊆ Z) then

C ← C ∪ {Z ∪ gm} . end if

end for

Z←Z∪C.

end for

P = {Zc; Z ∈ Z}.

**From Groups to Patterns**

The forward procedure presented in Algorithm 2, taken from Doignon and Falmagne (1998), allows the construction of Z from G. It iteratively builds the collection of patterns by taking unions, and has complexity O(p|Z||G|2). The general scheme is straightfor-ward. Namely, by considering increasingly larger sub-families of G and the collection of patterns already obtained, all possible unions are formed. However, some attention needs to be paid while checking we are not generating a pattern already encountered. Such a verification is performed by the if condition within the inner loop of the algo-rithm. Indeed, we do not have to scan the whole collection of patterns already obtained (whose size can be exponential in |G|), but we rather use the fact that G generates Z. Note that in general, it is not possible to upper bound the size of |Z | by a polynomial term in p, even when G is very small (indeed, |Z | = 2p and |G| = p for the ℓ1-norm).

**Examples**

We now present several examples of sets of groups G, especially suited to encode geometric and temporal prior information.

Sequences. Given p variables organized in a sequence, if we want only contiguous nonzero patterns, the backward algorithm will lead to the set of groups which are in- tervals [1, k]k∈{1,…,p−1} and [k, p]k∈{2,…,p}, with both |Z | = O(p2) and |G| = O(p) (see Figure 2.4). Imposing the contiguity of the nonzero patterns is for instance relevant for the diagnosis of tumors, based on the profiles of arrayCGH (Rapaport et al., 2008).

Figure 2.4: (Left) The set of blue groups to penalize in order to select contiguous patterns in a sequence. (Right) In red, an example of such a nonzero pattern with its corresponding zero pattern (hatched area).

Two-dimensional grids. In Section 2.6, we notably consider for P the set of all rectangles in two dimensions, leading by the previous algorithm to the set of axis-aligned half-spaces for G (see Figure 2.5), with |Z | = O(p2) and |G| = O(√p). This type of structure is encountered in object or scene recognition, where the selected rectangle would correspond to a certain box inside an image, that concentrates the predictive power for a given class of object/scene (Harzallah et al., 2009). Larger set of convex patterns can be obtained by adding in G half-planes with other orientations than vertical and horizontal. For instance, if we use planes with angles that are multiples of π/4, the nonzero patterns of P can have polygonal shapes with up to 8 faces. In this sense, if we keep on adding half-planes with finer orientations, the nonzero patterns of P can be described by polygonal shapes with an increasingly larger number of faces. The standard notion of convexity defined in R2 would correspond to the situation where an infinite number of orientations is considered (Soille, 2003). See Figure 2.6. The number of groups is linear in √p with constant growing linearly with the number of angles, while |Z | grows more rapidly (typically non-polynomially in the number of angles). Imposing such convex-like regions turns out to be useful in computer vision. For instance, in face recognition, it enables the design of localized features that improve upon the robustness to occlusions (Jenatton et al., 2010b). In the same vein, regularizations with similar two-dimensional sets of groups have led to good performances in background subtraction tasks (Mairal et al., 2010b), where the pixel spatial information is crucial to avoid scattered results. Another application worth being mentioned is the design of topographic dictionaries in the context of image processing (Kavukcuoglu et al., 2009; Mairal et al., 2011). In this case, dictionaries self-organize and adapt to the underlying geometrical structure encoded by the two-dimensional set of groups.

**Table of contents :**

**1 Introduction and Related Work **

1.1 Summary of the Contributions of the Thesis

1.2 Notation

1.3 Variable Selection and Sparsity-Inducing Regularizations

1.4 Dictionary Learning with Structured Sparsity-Inducing Penalties

1.5 Some Elements of Convex Analysis and Convex Optimization for Sparse Methods

1.6 Some Ingredients to Study Statistical Properties of Sparse Methods

**2 Understanding the Properties of Structured Sparsity-Inducing Norms **

2.1 Introduction

2.2 Regularized Risk Minimization

2.3 Groups and Sparsity Patterns

2.4 Optimization and Active Set Algorithm

2.5 Pattern Consistency

2.6 Experiments

2.7 Conclusion

**3 Structured Sparse Principal Component Analysis **

3.1 Introduction

3.2 Problem Statement

3.3 Optimization

3.4 Experiments

3.5 Conclusions

**4 Proximal Methods for Structured Sparsity-Inducing Norms **

4.1 Introduction

4.2 Problem Statement and Related Work

4.3 Optimization

4.4 Application to Dictionary Learning

4.5 Experiments

4.6 Discussion

4.7 Extension: General Overlapping Groups and `1/`1-norms

**5 Application of Structured Sparsity to Neuroimaging **

5.1 Multi-scale Mining of fMRI Data with Hierarchical Structured Sparsity

5.2 Sparse Structured Dictionary Learning for Brain Resting-State Activity Modeling

**6 Local Analysis of Sparse Coding in Presence of Noise **

6.1 Introduction

6.2 Problem statement

6.3 Main result and structure of the proof

6.4 Some experimental validations

6.5 Conclusion

6.6 Proofs of the main results

6.7 Control of the Taylor expansion

6.8 Computation of the Taylor expansion

6.9 Technical lemmas

**7 Conclusion **

A Proofs

A.1 Proofs and Technical Elements of Chapter 1

A.2 Proofs and Technical Elements of Chapter 2

**Bibliography**