Stochastic Approximation and Least-Squares Regression

Get Complete Project Material File(s) Now! »

Supervised Machine Learning

Supervised machine learning aims at understanding the relationship between ele-ments of an arbitrary input set X and elements of an arbitrary output set Y. Typically X = Rd for a large d and Y is a finite set or a subset of R. For example: — X is a set of images that contains a hand-written number and Y is the set of associated numbers. A picture is coded through the grey level of its pixels and X = [0, 1)d for d pixels and Y = {0, . . . , 9}.
— X summarizes the air pollution, the crime rate, the high-school quality, the percent of green spaces around certain neighborhoods and Y = R depicts the housing prices.
Unfortunately an output y 2 Y cannot be always expressed exactly as a function of an input x 2 X since there may be some random noise or unknown factors. So, instead the couple (X, Y ) is modeled as random variables. Therefore we aim to predict the output Y associated to the input X where it is given that (X, Y ) is sampled from an unknown distribution D. We do this prediction through a predictor which is defined as a measurable function h : X ! Y. The set of all predictors is denoted by F(X , Y). A predictor is also called a hypothesis or an estimator. In order to measure the performance of a predictor we define a loss function ` : Y ⇥ Y ! R where `(y, y0) is the loss incurred when the true output is y whereas y0 is predicted. For instance:
Classification: Y is a finite set and `(y, y0) = y=y0 is the 0 1 loss. In binary classification this would be Y = {0, 1}.
Least-squares regression: X = R and `(y, y0) = |y y0|2. Least-squares regression is the main topic of this thesis and one of the most classical problems of statistical learning. It will be used to illustrate numerous examples throughout this introduction. In particular we shall be considering the parametric least-squares framework: we shall assume a linear parameterization of the predictor, h(x) = h✓, (x)i where the features (x ) 2 Rd are designed by experts using their knowledge of the phenomenon, or are independently learned, e.g., with neural networks [Bengio et al., 2013]. These features have a key importance in practice because linear predictors with relevant features may be more eﬃcient than non-linear predictors. Furthermore we note that a linear parametrization in ✓ does not imply a linear parametrization in x since may be non-linear.
The quality of a prediction is measured by the generalization error (also called risk ) of a predictor defined by L(h) = E(X,Y )⇠D`(Y, h(X)).
This is the averaged loss incurred when the learner predicts Y by h (X) and the data (X, Y ) are sampled following D. Thus the learner wants to solve the minimization problem min L(h). h2F(X ,Y)
The solution of this problem (when it exists) is called the Bayes predictor. For least-squares regression the generalization error of a predictor h can be decomposed as 1 1 L(h) = E[Y E(Y |X)]2 + E[E[Y |X] h(X)]2.
In other words E [Y |X] is the Euclidean projection of Y onto the set L2(X) of the square integrable functions of X. Consequently L(h) is always larger than 12 E[Y E[Y |X]]2 and the Bayes predictor is (in this particular case) h⇤(x) = E[Y |X = x].
Regrettably since the distribution D is unknown, this function is not computable. Therefore we assume that the learner is observing a finite training set of points Sn = ((X1 , Y1), . . . , ( Xn, Yn)) which are sampled independently from the unknown law D and wants to use them to predict the output Y associated to the input X, where (X, Y ) is sampled from D independently from Sn. This task is formalized through a learning algorithm : a function A : [n 1( X ⇥ Y)n ! F(X ⇥ Y) which relates a training set Sn to a predictor A(Sn). Since the training set Sn is random, the generalization error of the predictor A(Sn) is a random variable and the quality of a learning algorithm A is measured by ESn L( A(Sn)), where the expectation is measured with respect to the distribution of Sn. Alternatively L(A(Sn)) may be controlled in probability.
Let us stress that we adopt the distribution-free approach of Vapnik and Chervo-nenkis [1971, 1974]; see, e.g., Devroye et al. [1996]. There is no assumptions on the distribution D and the method has to be eﬃcient independently of it. In contrast classical statistics first assumes a particular statistical model for the distribution D and then estimate the parameters of this model. This approach will be temporar-ily taken in Chapter 6. It is also diﬀerent from the probably approximately correct (PAC) learning framework introduced by Valiant [1984] but the comparison is out-side the scope of this introduction. Interested readers can see the monograph by Shalev-Shwartz and Ben-David [2014] for more precisions on these frameworks.
As the distribution D is unknown the Bayes predictor cannot be directly computed and the generalization error cannot even be minimized. Instead the training set Sn is used to approximate these objects. There are two principal ways to address this problem:
Local averaging methods: They estimate the Bayes predictor E[Y |X = x] by av-eraging the Yi’s corresponding to the Xi’s close to x. This includes, for example, the Nadaraya-Watson estimator [Nadaraya, 1964, Watson, 1964] or the k-nearest neighbors algorithm Cover and Hart [1967]. These methods are well studied and eﬃcient for dimensions of small to middle scale compared to n [see, e.g., Hastie et al., 2009].
Empirical Risk Minimization (ERM): It approximates the generalization error by the training error (also called empirical risk ) ˆ defined by the average error over the sample Sn L(g) = Xi n `(Yi, h(Yi)), =1 and considers its minimizer [see, e.g., Vapnik, 1998]. This method raises two principal issues: Statistical problem: How well does the minimizer of the empirical risk per-form? This question will be studied in Section 1.4.
Optimization problem: How to minimize the empirical risk ˆ? This question will be examined in Section 1.5.
We will succinctly address these two questions through the lens of the least-squares regression. First of all we will introduce the mathematical framework adopted in this thesis.

Mathematical Framework

We shall now consider the Euclidean space Rd of dimension d 2 N⇤ endowed with the natural inner product h•, •i and the Euclidean norm k • k2.
The central concept in this thesis is convexity:
Definition 1. A set C 2 Rd is said to be convex if for all ✓1, ✓2 2 C and all t 2 (0, 1), (1 t)✓1 + t✓2 2 C.
While classical functional analysis does not attach the same importance to convex functions as to convex sets [Clarke, 2013], the former is a key concept for optimization theory:
Definition 2. A extended-valued function f : Rd ! R [ {+1} is said to be convex if for all ✓1, ✓2 2 Rd and all t 2 (0, 1), f (1 t)✓1 + t✓2) (1 t)f(✓1) + tf(✓2).
When f is twice diﬀerentiable this condition is equivalent to a symmetric positive-definite Hessian. It is worth noting that these concepts of convexity are independent of the notions of norm or even distance. Indeed convex sets and convex functions only need the basic operations of a vector space to be defined. The class of convex functions is very general and this thesis is restricted, for convenience, to the convex functions which are closed (their epigraphs epi(f) = {(✓,↵ ) 2 Rd ⇥ R : f(✓) ↵} are closed sets) and proper (not identically +1). We refer to the monographes by Rockafellar [1970], Hiriart-Urruty and Lemaréchal [2001] for more details on convex analysis.
Conceptually, convexity enables to turn a local information about the function into a global one. For example a local minimum is automatically a global minimum or even more importantly, the gradient rf(✓) at a point ✓ 2 Rd provides a global linear lower-bound on the function f.
Functions met in machine learning applications often have additional properties. A continuously diﬀerentiable function f is said to be L-smooth for L 2 R + when the gradient of f is L-Lipschitz. When f is twice diﬀerentiable, this is equiva-lent to a global upper-bound on the Hessian of f . Thus this assumption provides a quadratic upper bound on the function value. For example, the empirical risk L(h) = n Xi i=1 `(Yi, h(Xi)) is usually smooth when the loss function ` is smooth and bounded. the data P are almost surely (abbreviated to from now on as a.s.) µ 2
On the other side a function f is µ-strongly convex for µ 2 R+ if f k • k2 is convex. When f is twice diﬀerentiable this is equivalent to a global lower-bound on the Hessian of f and then this assumption provides a quadratic lower-bound on the function value. In this sense, µ measures the curvature of the function f.
strongly convex and smooth functions have several interesting properties we will not review here [see, e.g., Nesterov, 2004]. Moreover these two assumptions are connected: if f is µ-strongly convex then its Fenchel conjugate f⇤ is 1/µ-smooth [Hiriart-Urruty and Lemaréchal, 2001, Theorem 4.2.1]. On the other hand, contrary to convexity, both smoothness and strong convexity depend on the norm considered and Bauschke et al. [2016] recently relaxed this dependency.

Complexity Results in Convex Optimization

This thesis is focused on convex optimization and mainly on the study of optimiza-tion algorithms for quadratic functions. Nowadays, optimization is indeed divided between convex problems (which can be eﬃciently solved) and non-convex problems (for which there are no eﬃcient and generic methods).
Interestingly, this distinction was initially drawn between linear and non-linear problems. Linear programming is now essentially solved: (a) in practice since 1947 by Dantzig’s simplex algorithm (with an exponential worst-case complexity); (b) in theory by Khachiyan [1979] who achieved polynomial complexity using the ellipsoid method (with very slow convergence in practice); and (c) in both theory and practice since Karmakar [1984] proposed the first eﬃcient polynomial-time algorithm using interior point methods.
This winding path from practical to theoretical progress has fed and inspired con-vex optimization since the seminal works of von Neumann [1963] and those of Kuhn and Tucker [1951]. Nemirovski and Yudin [1979] first showed that all convex mini-mization problems with a first-order oracle can theoretically be solved in polynomial time using the ellipsoid method, but with slow convergence in practice. Then Nes-terov and Nemirovskii [1994] extended the interior point method to eﬃciently solve a wider class of convex problems.
However when the complexity of problems becomes higher, this generic solver becomes ineﬃcient and older first-order methods are preferred at the cost of precision to the advantage of many cheap iterations. Unfortunately there is no unified analysis for these large-scale problems and the algorithmic choices are constrained by the problem structure. This has prompted both theoreticians and practitioners to move away from black-box optimization and leverage the specifics of the problem being considered to design better adapted algorithms.
Therefore large-scale optimization contributes to blur the frontier between con-vexity and non-convexity by defining new common territories, i.e., algorithms which can eﬃciently solve diﬀerent problems indiﬀerently of their convex aspects [Ge et al., 2016, Jin et al., 2016].

Black Box Optimization and Lyapunov Analysis

We consider the general optimization problem: min f(✓), ✓2C where C is a convex set and f is a convex function. In order to conceptualize the optimization task, we adopt the black box framework defined by Nemirovski and Yudin [1979]. The convex set C is known to the algorithm but not the objective function f. The only assumption made on f is that it belongs to some class of functions F. The information about f is obtained through interacting with an oracle. When queried at a given point ✓, the first-order oracle we consider here, answers by giving the function value f(✓ ) and the gradient rf(✓). Therefore a first-order black box procedure is a sequence of mappings ( n)n 0 where n : Cn ⇥ (Rd)n ⇥ Rn ! C. The algorithm starts from ✓0 = 0 and then iterates:
✓n = n 1 ✓0, . . . , ✓n 1, rf(✓0), . . . , rf(✓n 1), f(✓0), . . . , f(✓n 1) .
Our aim is to determine the oracle complexity of the problem. This is the number of oracle queries necessary to solve the problem at a precision of « , i.e., to find an admissible point 2 C such that ˜ . To that purpose we should: f(✓) min˜2C f(✓) » (a) find an algorithm whose convergence rate matches the desired rate, this would imply an upper bound on the complexity; (b) provide a lower-bound on the complexity by showing that no admissible method can solve the problem faster. This is usually done by finding a particular function for which it is possible to bound from below the performance of any method.
We note that oracle complexity does not directly provide information about the computational complexity of the method since the oracle query and the mapping n may be computationally demanding. Therefore we will also pay attention to the com-putational complexity of each method. This explains why we often restrict ourselves to algorithms with finite-order iteration of the form:
✓n = n 1(✓n k, . . . , ✓n 1, rf(✓n k), . . . , rf(✓n 1), f(✓n k) . . . , f(✓n 1)).
They can be reformulated (after a change of variable ⇥n = (✓n1 , . . . , ✓n)) as iterative processes of the form ⇥n = Fn 1(⇥n 1).
To prove the convergence of these dynamical systems, it is common to rely on Lya-punov theory. It goes back to Lyapunov [1892] who showed that all the trajectories of the ordinary diﬀerential equation (ODE) x˙(t) = Ax(t) goes to zero (the ODE is stable) if and only if there exists a symmetric positive-definite matrix P such that A>P +PA40.
The matrix P solution of the Lyapunov inequality A> P A P 4 0 can be found analytically using convex optimization [Boyd et al., 1994] and ensures the convergence of the iterative process. Lyapunov theory has been frequently applied to control theory with major works from Lur’e and Postnikov [1944], Popov [1961], Kalman [1963], Yakubovich [1971]. This method can be extended to non-linear processes, by linearizing the iterative procedure [Perron, 1929, Ostrowski, 1966]. It can also be applied directly to optimization [Polyak, 1987] or by using more complex control theory tools [Lessard et al., 2016].
Most stability results for diﬀerence equations [Ortega and Rheinboldt, 2000] have been obtained as discrete analogues of corresponding results for diﬀerential equa-tions [see, e.g., LaSalle and Lefschetz, 1961]. Lyapunov’s second method [Hahn, 1958, Kalman and Bertram, 1960] is the most common and general method to prove convergence of iterative processes. Its idea is to introduce a nonnegative function V (the so-called Lyapunov function) which will decrease along the sequence of iterates, i.e., V (✓n+1) V ( ✓n) and therefore ensure the convergence of the method [see Polyak, 1987, Sec 2.2, for more details]. Therefore finding such a function V may prove the convergence of the iterative system and converse results ensure existence of a Lyapunov function for a stable iterative system [Halanay, 1963, Driver, 1965]. Un-fortunately there is no systematic way to find Lyapunov functions in practice. There are some classical solutions such as f (✓ ) f (✓⇤) or k✓ ✓⇤k2 , but these do not always work. In general, theoreticians have to use their experience to design them.

READ Regression with P-VOL, D-Yield and P-OUT

Smooth Optimization

In this section we present a panel of methods for optimizing a smooth function. We do not aim at being comprehensive and we refer to Nesterov [2004], Bubeck [2015] for surveys. We first present the gradient method dating back to Cauchy [1847] which will be the starting point for more refined first-order techniques.
The number of iterations can not be too large compared to the dimension of the problem. Note this restriction is necessary since the ellipsoid method [Ben-Tal and Nemirovski, 2001, Sec. 5.2 ] converges at exponential rate for n larger than d. Moreover this lower-bound is still informative about the first steps, when n is small compared to d. Surprisingly, the worst-case function designed to prove this result is quadratic. In this sense, quadratic functions are not easier to optimize than smooth functions. Regarding the strongly convex case, Nesterov [2004] provides a lower bound O⇣ p +p ⌘ for the oracle complexity.
Clearly, there is a gap between the lower bounds and the performance of gradient descent in Proposition 1. But the lower bounds are not to blame as it turns out they are matched by a slightly more involved algorithm, which we describe now.
Accelerated gradient descent. Taking inspiration from conjugate gradient, Ne-mirovsky and Yudin [1983] proposed the first method with optimal convergence rates. However this method needed some kind of line-search and was not practical. Nes-terov [1983] presented an optimal algorithm which only needs a first-order oracle. Nesterov’s accelerated gradient can be described as follows: start from ✓0 = ⌘0 2 Rd and then iterate
1 rf(⌘n 1)
✓n = ⌘n 1
⌘n = ✓n + n(✓n ✓n 1),
where the momentum coeﬃcient n 2 R is chosen to accelerate the convergence rate and has its roots in the heavy-ball algorithm from Polyak [1964]. Figure 1-2 illustrates the diﬀerence between gradient and accelerated gradient steps. For this algorithm one obtains the following result.
The algorithm was initially proposed by Nesterov [1983, 2004] with a more complex momentum term. The simplified version we present here is due to Tseng [2008]. Beck and Teboulle [2009], Schmidt et al. [2011] both proposed very concise proofs of these optimal convergence rates. It is worth noting that the accelerated gradient method is not adaptive to strong convexity. Indeed the value of µ is needed to obtain the linear convergence rate. Nesterov [2013], O’Donoghue and Candès [2013] proposed the use of restart heuristics to resolve this issue. Indeed Arjevani and Shamir [2016] show that no methods with fixed sequences of step-sizes can achieve the accelerated rate, unless the parameter µ is known. This method is not a descent algorithm and often exhibits an oscillatory behavior. This will be further detailed in Chapter 2. Until recently, accelerated gradient descend lacked an intuitive interpretation, contrary to the standard gradient descent. This has been addressed by Allen-Zhu and Orecchia [2017], Bubeck et al. [2015], and in a series of works on the connection with diﬀerential equations by Su et al. [2014], Krichene et al. [2015], Wibisono et al. [2016], Wilson et al. [2016], Attouch et al. [2016a], Attouch and Peypouquet [2016].

Specificity of Quadratic Functions

Minimizing a quadratic function f is equivalent to solving a linear system of equa-tions, i.e., rf (✓) = 0 . For instance, in least-squares regression, the ERM predictor is directly given by the normal equation Φ > ˆ ~ (1.2) Φ✓=ΦY, which is a set of d equations for d unknowns. Hence, the methods we presented so far may appear ineﬃcient or needlessly complicated for solving such problems. Nevertheless the worst case function used to prove the lower bound in Proposition 2 is quadratic and accordingly quadratic problems are just as hard as any smooth problem. We introduce presently two methods dedicated to minimizing quadratic functions.
Numerical algebra. First one may prefer to directly solve the system in Eq. (1.2) by using numerical linear algebra algorithms. The complexities of these methods can usually be decomposed as 3 2 to form > , 2 to form ~ and 3 to O(d + nd ) Φ Φ O (nd ) ΦY O(d ) solve the linear system. Gaussian elimination could be used to solve the linear system in Eq. (1.2) as any linear system but this approach is oblivious to the specific form of the empirical covariance Φ>Φ. One may thus prefer methods based on singular value decomposition or QR decomposition when Φ is full rank [see, e.g., Golub and Van Loan, 2013]. These methods rely on well-known linear algebra, but can only be used for small to medium scale problems (typically d 104). For large scale problems they are not suitable since they do not leverage the special structure of the problem and the conjugate gradient algorithm may be favored instead.
Conjugate gradient algorithm. Hestenes and Stiefel [1952] introduced the con-jugate gradient method which corresponds to the momentum algorithm ✓n = ✓n 1 nrf(✓n 1) + n(✓n 1 ✓n 2) with ( n, n) solutions of the two dimensional optimization problem: min f ✓n 1 rf(✓n 1) + (✓n 1 ✓n 2) . γ,δ
This line search procedure can be explicitly solved when f is a quadratic function with the same computational complexity as a gradient computation. The explicit method is detailed by Golub and Van Loan [2013]. Surprisingly this method converges to the solution ✓⇤ of the problem in at most d steps and the behavior of the method is globally controlled by [Polyak, 1987]

Extension to Composite Optimization

When the function f is no longer smooth, gradient descent with constant step size fails be to consistent [Bertsekas, 1999]. Shor [1962], Ermoliev [1966], Polyak [1967] solve this issue by showing that the projected sub-gradient method with decreasing step size O(1 /pn) converges at rate O(1 /pn) when the average of the iterates is considered as output of the method [Shor et al., 1985]. When the function is also µ-strongly convex, the projected sub-gradient method with step- size O(1 /µn) converges at rate O(log(n)/µn) and at rate O (1/µn ) when non-uniform average is used. More-over Nemirovsky and Yudin [1983] show these rates are optimal among first-order methods.
Yet there is a crucial diﬀerence between the class of smooth problems (with bounded smoothness constant) and the class of all convex problems since conver-gence rates are downgraded from O(1/n2) to O(1/pn). However the structure of the function as well as the reason for non-smoothness are often known. For instance the objective function is sometimes the sum of a smooth and a non-smooth function, as in Lasso or constrained problems. These are composite problems of the form:
min f(✓) + g(✓), ✓2Rd where f is an L-smooth function accessible through a first-order oracle and g is a simple convex function in a sense that will be explained below.
As with the gradient descent for smooth functions, at each iteration the smooth function f is linearized around the current iterate ✓n. Then proximal gradient meth-ods, also called forward-backward splitting methods [see, e.g., Beck and Teboulle, 2009, Wright et al., 2009, Combettes and Pesquet, 2011] consider the iterate operator is computable eﬀectively. This is the case if there exists a closed form expression and several examples are provided by Bach et al. [2012]. Surprisingly Beck and Teboulle [2009], Nesterov [2013] show convergence results similar to smooth optimization with the proofs following the same general guidelines.

Table of contents :

1 Introduction
1.1 Machine Learning
1.2 Supervised Machine Learning
1.3 Mathematical Framework
1.4 Analysis of Empirical Risk Minimization
1.5 Complexity Results in Convex Optimization
1.6 Stochastic Approximation
1.7 Online Convex Optimization
1.8 Digest of Least-Squares Regression
I Stochastic Approximation and Least-Squares Regression
2 Multistep Methods for Quadratic Optimization
2.1 Introduction
2.2 Second-Order Iterative Algorithms for Quadratic Functions
2.3 Convergence with Noiseless Gradients
2.4 Quadratic Optimization with Additive Noise
2.5 Experiments
2.6 Conclusion
Appendices
2.A Additional Experimental Results
2.B Proofs of Section 2.2
2.C Proof of Section 2.3
2.D Proof of Theorem 3
2.E Lower Bounds
2.F Proofs of Section 2.4
2.G Comparison with Additional Other Algorithms
2.H Lower Bound for Stochastic Optimization for Least-Squares
3 Optimal Convergence Rates for Least-Squares Regression through Stochastic Approximation
3.1 Introduction
3.2 Least-Squares Regression
3.3 Averaged Stochastic Gradient Descent
3.4 Accelerated Stochastic Averaged Gradient Descent
3.5 Tighter Dimension-Independent Convergence Rates
3.6 Experiments
3.7 Conclusion
Appendices
3.A Proofs of Section 3.3
3.B Proof of Theorem 7
3.C Convergence of Accelerated Averaged Stochastic Gradient Descent
3.D Technical Lemmas
4 Dual Averaging Algorithm for Composite Least-Squares Problems
4.1 Introduction
4.2 Dual Averaging Algorithm
4.3 Stochastic Convergence Results for Quadratic Functions
4.4 Parallel Between Dual Averaging and Mirror Descent
4.5 Experiments
4.6 Conclusion
Appendices
4.A Unambiguity of the Primal Iterate
4.B Proof of Convergence of Deterministic DA
4.C Proof of Proposition 9
4.D Proof of Proposition 10
4.E Lower Bound for Non-Strongly Convex Quadratic Regularization
4.F Lower Bound for Stochastic Approximation Problems
4.G Lower Bounds on the Rates of Convergence of DA and MD Algorithms
4.H Continuous Time Interpretation of DA et MD
4.I Examples of Different Geometries
4.J Proof of Proposition
4.K Standard Benchmarks
II Applications of the Quadratic Loss in Machine Learning
5 Application to Discriminative Clustering
5.1 Introduction
5.2 Joint Dimension Reduction and Clustering
5.3 Regularization
5.4 Extension to Multiple Labels
5.5 Theoretical Analysis
5.6 Algorithms
5.7 Experiments
5.8 Conclusion
Appendices
5.A Joint Clustering and Dimension Reduction
5.B Full (Unsuccessful) Relaxation
5.C Equivalent Relaxation
5.D Auxiliary Results for Section 5.5.1
5.E Auxiliary Results for Sparse Extension
5.F Proof of Multi-Label Results
5.G Efficient Optimization Problem
6 Application to Isotonic Regression and Seriation Problems
6.1 Introduction
6.2 Problem Setup and Related Work
6.3 Main Results
6.4 Further Results in the Monotone Case
6.5 Discussion
Appendices
6.A Proof of the Upper Bounds
6.B Metric Entropy
6.C Proof of the Lower Bounds
6.D Matrices with Increasing Columns
6.E Upper bounds in a Trivial Case
6.F Unimodal Regression
7 Conclusion and Future Work
7.1 Summary of the Thesis
7.2 Perspectives