Get Complete Project Material File(s) Now! »

## Proximal methods for structured optimization

This chapter presents an overview of structured optimization and how the proximal operator can be used to leverage the structure of the problem. We develop in particu-lar the problem of minimzing the anisotropic total variation on an arbitrary weighted graph. We first define the context of structured optimization and give several exam-ples. We then provide an overview of some of the most well-known methods for solving such problems using the proximal operator. Finally, we present in greater detail the Generalized Forward-Backward algorithm, introduce a preconditioned version and give numerical experiments.

The material of section 2.4 and 2.5 is based on Raguet and Landrieu (2015), pub-lished in the 2015 issue of SIAM Journal of Imaging Science (SIIMS), volume 8 issue 4.

**Introduction**

Many of the optimization problems encountered in machine learning are ill-posed in the sense that they are underconstrained and have too many solutions, becoming susceptible to overfitting (Hadamard, 1902). A solution is to add regularization functions, providing the problem with mathematical properties which ensure the solution is unique (Tihonov and Arsenin, 1978). Regularization can also be interpreted as encouraging the solution of the problem to satisfy a set of desirable properties. Those properties could represent prior knowledge, such as the solution belonging to a given set, or useful properties such as smoothness.

Among the diversity of such regularizers existing in the litterature, many lack diﬀer-entiability. This is notably the case of set-characteristic functions and sparsity-inducing penalizations (Bach et al., 2012a), which encourage the solution to be mostly comprised Φ(x) = ιx∈Ω = of zeros. The non-diﬀerentiability of such problems prevents the use of traditional first order schemes such as gradient descent. The most straigh-forward approach to solving such problems is the subgradient descent (Boyd et al., 2003) which, while simple, is quite slow with a distance to optimality that decreases as O( 1 ). However it is often the case that the non-diﬀerentiable functions encountered present a special structure. In particular, regularized problems usually have a diﬀerentiable fidelity term ensuring that the solution stays close to the obervations, and a non-diﬀerentiable regularizer. Fur-thermore such regularizers often present a simple structure, such as separability. This structure can be leveraged to design algorithms that have similar convergence rates as problems that are diﬀerentiable: O( 1t ), or O( t12 ) for accelerated schemes.

**Structured optimization problems**

This chapter presents some examples of optimization problems whose structure can be computationally exploited. We focus in particular on regularized problems, i.e. mini-mization problems whose optimized function can be broken down into two parts: x⋆ = arg min f (x) + λΦ(x), (2.1) x∈Rn with f : Rn R a fidelity function, typically smooth, and Φ the regularizer. f measures the accuracy of a candidate solution x with respect to the observation, while Φ is the regularizer. The regularization strength λ > 0 balances the influence of the two functions. While an optimal parametrization in λ is hard to find in general, low values denote trust in the observed data while high values indicate an emphasis on the desired properties.

**Projection on simple sets**

Let us consider f a smooth function to minimize over a convex subset Ω ∈ R. The optimization problem can be written as follows: x⋆ = arg min f (x). x∈Ω

Such a problem can be rewritten under regularized form by choosing 0 if x ∈ Ω ∞ else.

As will be detailed further in this chapter, such problems can be solved eﬃciently as long as Ω is easy to project onto. Examples of such sets include:

• box constraints: Ω = {x | ai ≥ xi ≥ bi, ∀i ∈ 1 • • • n}, for a, b ∈ Rn.

• simplex constraints: Ω = {x | xi ≥ 0, ∀i ∈ 1 • • • n, n i=1 xi = 1}.

• ℓ1 cone: Ω = {x | n i=1 xi ≤ ω} for ω ∈ R.

• subspace constraint: Ω is a sub-vector space of Rn.

**Regular sparsity**

The solution of an optimization is said to be sparse if its values at most indices are zero. Sparsity can be desirable, as such solutions are easier to interpret, are more compact in memory (Tropp et al., 2007), or can correspond to knowledge of the optimizer on the solution set.

The sparsity of the solutions can be assured by adding a sparsity inducing penalty to an optimization problem, i.e a function Φ : Rn R that decreases with the cardinality of the set of non-zeros elements of its argument, called the support:{k | xk = 0}. The most natural approach is to penalize by the cardinality of the support: Φ(x) = x 0 = |{k | xk = 0}| .

The non-continuous and non-convex nature of this penalty can lead to combinatorial problems that are diﬃcult to solve (Tropp, 2004). A successful alternative approach is to replace the cardinal with a convex approximation (Bach et al., 2012a) such as the ℓ1 norm: Φ(x) = x 1 = |xi| . i=1

This is the celebrated Lasso penalty (Tibshirani, 1996), which has numerous advantages. Its convexity ensures the uniqueness of the solution, and has been shown to be consistent (Zhao and Yu, 2006) in the sense that under some conditions it retrieves the same support as the non-relaxed problem. Furthermore the non-diﬀerentiablity of |•| at 0 encourages most coordinates of x⋆ to be zero, thus inducing sparsity.

This behaviour can be illustrated by the one-dimension minimization problem ob-tained for f (x) = 12 (x−y)2, Ψ(x) = |x| and (x, y) ∈ R2. The solution of this regularized optimization problem is as follows:

x⋆ = 0 if |y| ≤ λ

y + λ if y < −λ

y − λ if y > λ,

and is represented in Figure 2.1. We can see that x⋆ is encouraged to take the value zero for y, which is smaller than the regularization strength λ. We can also observe that for |y| ≥ λ, the solution x⋆ is shifted towards zero. This biais is not observed in the ℓ0 case, and can be a drawback of this approach.

### Structured Sparsity

Sparse methods are not limited to finding solutions for which the majority of parameters are zero. Indeed Huang et al. (2011) extend the sparsity of the vector of parameters to the notion of coding complexity, a measure of the simplicity adapted for a given problem. Bach et al. (2012b) give an overview of how structured forms of sparsity can be induced by extending the ℓ1 norm to appropriate structured norms.

For example, group sparsity is induced by the group Lasso regularization (Bakin, 1999; Yuan and Lin, 2006). Consider [1, • • • , n] partioned into k meaningful groups {g1, • • • , gk}. The group Lasso regularization takes the following form: Φ(x) = xφi 2 , i=1 with xg 2 = j∈g xj2 2 . As in the regular LASSO, the discontinuity of 2 at 0 encourages whole blocks x⋆φi to be exactly zero. This could be a desirable property of the solution, and can be exploited to decrease the number of samples needed to find the solution (Obozinski et al., 2011; Wipf and Rao, 2007).

Another variation of the LASSO is the fused LASSO, used to encourage the sparsity of the parameters as well as the diﬀerence between successive elements in an ordered set (Tibshirani et al., 2005). Suppose that the ordering of [1, • • • , n] is meaningful, then the fused lasso regularization writes: Ψ(x) = α |xi − xi−1| + β |xi| i=2 i=1

The first part of this regularization encourages most consecutive values of x⋆ to be equals, forming a piecewise constant structure, while the second part encourages values to be exactly zero. Hence the solution x⋆ is not only sparse but its non-zero values show a piecewise constant structure with respect to the chosen ordered set.

**Graph-structured Sparsity**

An important class of regularizers derive their structure from graphs, as illustrated in (Peyré et al., 2008) for image processing. For example, the spatial structure of an image with n pixels can be captured by an unoriented graph G = (V, E, w) with each element of V = [1, • • • , n] being associated with one pixel, E linking neighboring pixels (4, 8 or 16 neighborhood are usually used). In this context x ∈ Rn is the greyscale value associated with each pixel. The edge weights wi,j can be set based on the norm of the gradient between two pixels to account for the likelihood of object boundaries to display sharp color changes (Boykov and Jolly, 2001). In the special case of a regular grid graph in theplane,Goldfarb and Yin (2009) propose to set the edge weights such that the total weight of the edges intercepted by a cut of the graph approximates its curve length using the Cauchy-Crofton formula.

A natural way for regularizers to take into account a graph structure is to be factorizable with respect to the graph gradient: Φ(x) = φij (xi − xj ), (2.2) (ij)∈E with φij : R R. Well-chosen edge weights will capture the specificity of each edge so that the functions φij take the form : φij = wij φ with φ : R R+. In this case spatial regularity can be achieved when φ is a sparsity inducing function. Indeed as φ encourages xi = xj for most neighboring nodes, x will be constant for large connected components of G.

The challenge is to design a penalty which will induce spatial regularity while autho-rizing sharp discontinuities. Piecewise constant approximations have in particular been considered in the image processing literature. In that context Mumford and Shah (1989) introduce an energy whose minimization produces piecewise-smooth approximations of images (see Chapter 4 for a more detailed presentation of this literature). By setting the smoothness term to infinity, one can obtain piecewise constant approximations. With this parameterization, the energy amounts to a squared diﬀerence data term penalized by the contour lentgh of the constant regions. For an arbitrary data term, and when the number of regions is fixed in advance, this problem is known as the minimal partition problem.

Rather than viewing images as functions on a continuous set, we consider the clas-sical discretization of the problem on a regular grid. In this setting we can transpose this penalty by choosing φ(x) = 1x=0 = 0 if x = 0 1 else, and G = (V, E, w) the pixel neighborhood graph weighted with the Cauchy-Crofton formula. For these choices Φ(x) can be interpreted as the approximate length of the boundaries between the connected components of G in which x is constant. Remark that the form of the regularizer (2.2) is not specific to grid graphs, and can be extended to arbitrary weighted graphs.

The main drawback of this penalty is its non-convexity, which implies a potential multitude of local optima and the impossibility of estimating their quality compared to the global optima . (Rudin et al., 1992) introduce a convex penalization inducing spatial regularity while authorizing sharp discontinuities, the total variation. In our graph setting, this penalty is obtained by setting φ = |•|. This particular implementation is known as the anisotropic weighted total variation: Φ(x) = wij |xi − xj | (2.3) ij∈E

#### Proximal splitting for structured optimization

In this section we present a brief overview of proximal splitting algorithms. An index of the methods presented and the context in which they are applicable is presented in Table 2.1. This chapter is inspired by the work of (Bauschke and Combettes, 2011; Combettes, 2004; Combettes and Pesquet, 2009), as well as the survey by Parikh and Boyd (2013)s.

**The subgradient**

Let Φ : Ω R be a proper convex function defined over Ω ⊂ Rn. The subgradient is a generalization of the notion of gradient for convex functions that are not necessarily diﬀerentiable everywhere. By contrast with the gradient which is a point-to-point op-erator, the subgradient ∂Φ takes its value in the convex subsets of Rn. The subgradient of Φ at x0 in Rn is defined by the hyperplanes tangent to the set of points above the graph:1 ∂Φ(x0) = {c ∈ Ω | Φ(x) − Φ(x0) ≥ c, x − x0 ∀ x ∈ Rn} (2.4)

In dimension 1, the subgradient of Φ at x0 is the set containing the slopes of all the lines going through (x0, f (x0)) and that are under the graph of φ everywhere, as illustrated in Figure 2.2.

Figure 2.2: Illustration of the subgradient of a convex function. In blue ,the graph of φ. In red , the lines bounding the slopes in the subgradient. In pink , the set of points through which pass the lines defined by the slopes in the subgradient of Φ at x0.

If φ is diﬀerentiable, we have ∂Φ(x) = {∇Φ(x)}. Generalizing the notion of gradient, the subgradient can be used to characterize stationarity of non-diﬀerentiable functions.

Figure 2.3: Illustration of the proximal operator. The full black line represents the boundary of the defition domain of φ, while the dashed line represents its level sets.

The red arrow points from point x to the proximal operator value proxΦ(x). Observe that the red arrows are perpendicular to the level set of Φ at their destination.

**Proposition 1.** (x = arg minz∈Rn Φ(z)) ⇔ 0 ∈ ∂Φ(x).

**The proximal operator**

The proximal operator is a keys concept from convex analysis to design optimization algorithms for non-diﬀerentiable functions (Moreau, 1965).

**Definition 2.** For x ∈ Rn and λ > 0 the proximal operator of λΦ at x is defined as: proxλΦ(x) = arg min{ 1 t − x 2 + λΦ(t)} 2 t∈R n .

If Φ = ιC is the characteristic function of a convex set C whose values are 0 in C and ∞ elsewhere, then proxλΦ(x) = arg mint∈C { 12 t − x 2}, i.e. the orthogonal projector onto C. The proximal operator can thus be seen as a generalization of the orthogonal projection.

If Φ is diﬀerentiable, then t = proxλΦ(x) is such that t + λ∇Φ(t) = x. In other words t is obtained from x by a gradient descent step for which the gradient would be computed at its destination t. This property is the reason why the adjective implicit or backward is used to describe algorithms relying on proximal operators. An example of a proximal operator is the soft thresholding which is the proximal operator of Φ = | • |:

proxλ (x) = 0 if |x| ≤ λ

x + λ if x < −λ

|•| x − λ if x > λ

We can see that t = proxλ|•|(x) gives the same result as a gradient step of length λ on the function |•| while dealing with the non-diﬀerentiablity.

More generally, the proximal operator of a function Φ maps a point x to a point t which reflects a compromise between decreasing Φ and moving away from x, all while remaining in the domain of Φ, as illustrated in Figure 2.3.

**Table of contents :**

**1 Introduction **

1.1 Spatial data and geostatistics

1.2 Spatial data analysis

1.3 Characteristics of geostatistical data

1.4 The weighted graph framework

1.5 Variational aggregation on weighted graphs

1.6 Graph structured prediction

1.7 Organisation of the thesis

**2 Proximal methods for structured optimization **

2.1 Introduction

2.2 Structured optimization problems

2.3 Proximal splitting for structured optimization

2.4 Generalized forward-backward

2.5 Experimental setup and results

2.6 Conclusion

**3 Aggregating spatial statistics with a generalized forward-backward splitting algorithm **

3.1 Aggregation as an optimization problem

3.2 Interpretation

**4 Cut Pursuit: fast algorithms to learn piecewise constant functions on general weighted graphs **

4.1 Introduction

4.2 A working set algorithm for total variation regularization

4.3 Minimal partition problems

4.4 Experiments

4.5 Conclusion

**5 Learning in graphical models **

5.1 Introduction

5.2 Undirected graphical models

5.3 Potts model

5.4 Continuous time Markov models

5.5 Conclusion

**6 Continuously indexed Potts model **

6.1 Introduction

6.2 Continuous graph Potts models

6.3 Learning with continuous graphs

6.4 Experiments

6.5 Conclusion

A Converting spatial data to graph

A.1 Converting spatial data to graph

B Appendix of Chapter 2

**Bibliography **

C Appendix of Chapter 4

D Appendix of Chapter 6