Get Complete Project Material File(s) Now! »

## Stochastic Gradient Descent

In some cases, the gradient rf may be very expensive to compute, but stochastic estimates can be obtained in a cheap way. This is for instance the case in many machine learning applications, in which the objective function is the average of a loss function (that measures how well a given model fits the data) over all samples in a dataset. Computing a full gradient is expensive, since one needs to compute the gradient of the loss for all samples of the dataset. Yet, a stochastic estimate of the gradient can be obtained by drawing a few samples at random. This is the fundamental principle of Stochastic Gradient Descent (SGD), introduced by Robbins and Monro [1951], which writes: xt+1 = xt − ηtgt, with E[gt] = rf(xt). (1.1.25)

This method is widely used in practice, as it has very good performances for training large-scale models [Bottou, 2010]. It also benefits from very good statistical properties, as it is possible to avoid overfitting and obtain bounds on the testing error (the error on unseen data generated from the true distribution) if only one pass is performed (each training sample is used only once) [Nemirovski and Yudin, 1983, Polyak and Juditsky, 1992]. In particular, this avoids the need for explicit regularization, represented by the λ term in Equation (1.1.13). In order to analyze SGD, we need to assume further conditions on the stochastic estimates gt. One of the simplest assumptions is that the variance of gt is bounded.

Assumption 1. The stochastic gradients gt are independent and such that: E[kgt − rf(xt)k2] ≤ χ2 (1.1.26)

This assumption is quite strong, as it requires the variance to be bounded at all times. This may be hard to achieve when f is strongly convex, since the magnitude of rf grows (a) Small dataset (N = 100) (b) Large dataset (N = 10000)

Figure 3. Comparison between deterministic and stochastic gradient descent on the LibSVM a9a dataset, for a regularization λ = 10−3. SGD was run with step size η = G/L. We see that SGD decreases the error rapidly, but quickly saturates to mean error level that depends on G.

when x is far from x?. Under this assumption, the following convergence result holds [Robbins and Monro, 1951, Bottou et al., 2018].

Theorem 3. Under Assumption 1, the iterates produced by SGD with constant step-size η ≤ L1 guarantee: Et[kxT − x?k2] ≤ (1 − ησ)T kx0 − x?k2 + ηχ2 (1.1.27)

Proof. Using that E[gt] = rf(xt), we obtain the following slightly perturbed version of Equation (1.1.19): Et[kxt+1 − x?k2] ≤ kxt − x?k2 − 2ηtDf (x?, xt) − 2ηt [f(xt) − f(x?)] + ηt2Et[kgtk2]. (1.1.28) Then, since E[gt] = rf(xt): E[kgtk2] = E[kgt − rf(xt)k2] + krf(xt)k2 (1.1.29)

Thus, the exact same calculations as in the case of (deterministic) gradient descent can be performed, and we obtain: Et[kxt+1 − x?k2] ≤ (1 − ηtσ)kxt − x?k2 + ηt2χ2. (1.1.30)

Taking a constant ηt = η, and using that PTt=0(1−ησ)k ≤ 1/(ησ) leads to the final result.

Thus, despite the fact that it only uses stochastic estimates of the gradient, SGD con-verges as fast as its deterministic counterpart, but only up to a given precision. At some point, the variance in the gradients takes over and SGD fails to optimize f beyond this error. We illustrate this phenomenon on Figure 3, in which we compare the GD (Equa-tion (1.1.2)) and SGD (Equation (1.1.25)) algorithms on the Logistic Regression Problem of Equation (1.1.13). The a9a dataset was downloaded from the LibSVM repository1.

In order to obtain exact convergence to x?, one needs to employ other techniques, such as using a diminishing step-size, or averaging the iterates [Ruppert, 1988, Polyak and Juditsky, 1992]. Due to its huge practical impact, this method has been widely studied since [Bach and Moulines, 2011a, Dieuleveut et al., 2017, Gower et al., 2019]. Yet, one does not obtain the linear convergence rate that we show in Theorem 3 in this case, and only sublinear rates (in O(T ) at best without acceleration) can be achieved. This thesis focuses on convex problems, but SGD is also a widespread method both in theory and in practice for minimizing non-convex problems. In this case, convergence guarantees only hold for the norm of the gradient (in the form of Equation (1.1.16), plus a noise term).

**Randomized Coordinate Descent**

When the full gradient rf is expensive to compute, another way to obtain cheap estimates of the gradient can be to only perform coordinate updates (instead of full updates). This method is called randomized coordinate descent, and can be seen as a special case of stochastic gradient descent. In its most simple form, it writes, with pi ∈ R the probability to pick coordinate i:

ηt rif(xt),

xt+1 = xt − pi (1.1.31)

with rif(xt) = eie>irf(xt), where ei ∈ Rd is the unit vector corresponding to coordinate i. In the case of coordinate descent, the variance at optimum is ke>irf(x?)k2 = 0, and so one can mimick the deterministic gradient descent proof. In particular, we use the notion of directional smoothness, and say that f is Li smooth in the direction i if for all δ > 0, Df (x + δei, x) ≤ Li δ2. (1.1.32)

Using this, the following theorem can be derived, where we denote Et[ • ] = E[ • |xt; It = i] to simplify notations, where It is the coordinate picked at time t.

Theorem 4. Denote Lt = 1 Et[kxt − x?k2 + η [f(xt) − f(x?)]], with pmin = mini pi. 2 pmin

Then, if η ≤ piLi for all i, the iterates generated by coordinate descent (Equation (1.1.31)) verify:

Lt ≤ max(1 − ησ, 1 − pmin)tL0. (1.1.33)

Proof. We start again from the main Equation (1.1.19): Et[kxt+1 −x?k2] ≤ kxt −x?k2 −ηDf (x?, xt) −2η [f(xt) − f(x?)] + Et[ krif(xt)k2] (1.1.34)

This time, the update at time t has support on ei, and so: i Df (xt+1, xt) = f(xt+1) − f(xt) + η krif(xt)k2 ≤ η2Li krif(xt)k2 ≤ η krif(xt)k2, p i 2p2 2p

Thus, f(xt+1) ≤ f(xt), and so i i Et[ krif(xt)k2 ] ≤ [f(xt) − Et[f(xt+1)]] . (1.1.35) p2 p min

Plugging this into Equation (1.1.34), we obtain: 1 [ k x t+1− x ?k 2+ η [f(x ) − f(x )]] ≤ 1−ησ [ k x t− x ?k 2]+ η(1 − pmin) [f(x ) − f(x )] . 2Et pmin t+1 ? 2 Et pmin t ?

The result is obtained by taking the minimum of the two terms, and chaining the inequalities.

Note that in this case, the Lyapunov function is not as simple as it was for gradient descent. Because of the stochastic aspect, function values terms are required to guarantee the decrease of the Lyapunov function Lt. Note that this introduces an additive pmin term in the complexity which was expected: without further assumptions, one cannot guarantee a given decrease of the objective if all coordinates have not been seen. Note that it is possible to choose the coordinates in a cyclic fashion (instead of uniformly at random), but the analysis is much more complex in this case [Saha and Tewari, 2013]. For more details on coordinate descent methods, and in particular for composite objectives, we refer the interested reader to Nesterov [2012], Richtárik and Takáč [2014], Lu and Xiao [2015]. Note that parallel and block versions can also be derived in a similar way [Richtárik and Takáč, 2016], but we discuss these methods in further details in Section 1.3.

**Bregman gradients.**

To end this section, we present a last algorithm, which is known as Mirror Descent [Nemirovski and Yudin, 1983, Beck and Teboulle, 2003], or Bregman gradients. The iterations are the same, but the “Mirror Descent” terminology is usually used when the objective function is not smooth, whereas “Bregman gradients” is generally used together with the notion of relative smoothness (that we introduce in Definition 4). This algorithm is similar to gradient descent, but relies on a strictly convex potential function h ∈ C2(Rd) to perform its updates: rh(xt+1) = rh(xt) − ηtrf(xt). (1.1.36)

Another way to see these updates is to notice that the standard gradient descent iterations from (1.1.2) are equivalent to: x = arg min η f(x )>x + 1 x x 2. (1.1.37) t+1 tr 2k − tk x∈Rd t

In this form, we see that gradient descent consists in going in the direction of −rf(xt), but without going too far from the initial point xt in order to guarantee the descent condition. The “without going too far” part is measured in terms of Euclidean distance, but one could imagine measuring it diﬀerently, using Bregman divergences for instance, leading to the following iterations: x t+1 = arg min η tr f(x )>x + D (x, x ). (1.1.38) x∈Rd t h t

This is exactly the idea of Bregman gradient methods, and writing the optimality condi-tion for the inner problem of (1.1.38) actually leads to (1.1.36), thus showing that these two iterations are indeed equivalent. Thus, Bregman gradients can be seen as standard gradient descent, on a space in which distances are measured using the function h, and is often referred to as a “non-Euclidean method”. Mirror descent is widely used in the bandit and reinforcement learning communities, in which the parameters that are optimized repre-sented probabilities, and for which the Kullback-Leibler divergence is very natural [Even-Dar et al., 2009, Bubeck, 2011, Hazan, 2012]. To take advantage of this non-Euclidean aspect, smoothness and strong convexity need to be adapted and defined relatively to the function h [Bauschke et al., 2017, Lu et al., 2018].

**Definition 4** (Relative smoothness and strong convexity). A function f is Lf/h smooth and σf/h relatively strongly convex with respect to h if for all x, y, σf/hDh(x, y) ≤ Df (x, y) ≤ Lf/hDh(x, y). (1.1.39)

First note that when h = 12 k•k22, then the usual notions of smoothness and strong convex-ity are recovered (see Equation (1.1.9)). A very interesting consequence of this definition is that it allows to perform “smooth analyses” for non-smooth functions by changing the gra-dient step. For instance, some objectives (such as Poisson inverse problems) are non-smooth in the classical definition, but they are smooth with respect to the entropy. Thus, taking h as the entropy and performing Bregman gradients instead of gradient descent allows to recover a smooth behaviour. For instance, using relative smoothness, it is possible to prove the following lemma, which is a Bregman generalization of the cocoercivity inequality (a consequence of smoothness for convex functions).

**Lemma 2.** [Dragomir et al., 2021a] For all ηt ≤ , the following holds: Lf/h Dh∗(rh(x) − ηt [rf(x) − rf(y)] , rh(x)) ≤ ηtDf (y, x). (1.1.40)

Using this Lemma, we can directly prove the following theorem, which can also be ob-tained directly [Bauschke et al., 2017, Lu et al., 2018].

**Theorem 5.** Let T ∈ N. If f ∈ C1(Rd) is Lf/h smooth and σf/h strongly convex relative to h, and ηt ≤ L−f/h1 for all t ≤ T , then the iterates produced by T mirror descent iterations are such that:

Y (1.1.41)

Dh(x?, xT ) ≤ (1 − ηtσf/h)Dh(x?, x0).

t=0

The main diﬀerence in the proof is that in the Bregman framework, the natural notion of distance to the solution is given by Dh(x?, x). Yet, Dh(x?, xt+1) cannot be decomposed as easily as in the Euclidean case, but a Bregman analog of Equation (1.1.19) can still be obtained, as we show below.

**Proof.** Define Vt(x) = ηtrf(xt)>x + Dh(x, xt), so that rVt(xt+1) = 0. Then, we write that: Vt(x?) − Vt(xt+1) = DVt (x?, xt+1) = Dh(x?, xt+1), (1.1.42) using the fact that the Bregman divergences of Vt and h are equal (since they only diﬀer by a linear term). The previous equation can be rearranged as: Dh(x?, xt+1) = Dh(x?, xt)−Dh(xt+1, xt)+ηtrf(xt)>(x?−xt)−ηtrf(xt)>(xt+1−xt). (1.1.43)

Then, the last term can be expressed as: −ηtrf(xt)>(xt+1 − xt) = (rh(xt+1) − rh(xt))>(xt+1 − xt) = Dh(xt+1, xt) + Dh(xt, xt+1), and so we obtain a Bregman analog to Equation (1.1.19), which writes: Dh(x?, xt+1) = Dh(x?, xt) − ηt [Df (x?, xt) + f(xt) − f(x?)] + Dh(xt, xt+1). (1.1.44)

Note that this equation is very similar to the base Equation (1.1.20), that we repeatedly used in the Euclidean case, and that it can also be obtained directly by using that rh(xt+1) = rh(xt) − ηtrf(xt). Then, Lemma 2 tells us that: Dh(xt, xt+1) ≤ ηt [f(xt) − f(x?)] , (1.1.45) and so Dh(x?, xt+1) = Dh(x?, xt) − ηtDf (x?, xt), (1.1.46) and we finish the proof using the relative strong convexity of f.

Theorem 5 is very interesting in the sense that modifying the notions of distance and regularity from standard gradient descent does not change the convergence guarantees, but only the way suboptimality is measured. In particular, specifying Theorem 5 to the case h = 12 k • k2 exactly leads to Theorem 1. In Chapters 4 and 5, we show that the same thing happens when extending Bregman gradients to the stochastic setting (coordinate descent and stochastic gradient descent), so that the Euclidean convergence rates are recovered as specific cases. Now that we have presented a few basic first-order methods for convex optimization, we present more advanced techniques that improve the convergence guarantees.

### Designing Faster methods

Gradient descent is a natural algorithm, and we have presented its convergence properties in the previous section. Yet, one may wonder whether gradient descent is optimal, or if it is possible to design algorithms that converge faster with the same assumptions.

**Nesterov acceleration**

In order to design such a method, a natural first ques-tion is the following: How fast can first-order methods be? Yet, to answer this, one must define the notion of speed. The common notion is that of oracle complexity. The complexity of an algorithm is measured by the number of calls to an oracle (which in our case is the gradient), to achieve a given precision level ε > 0. Then, one can design hard instances that cannot be optimized up to a given precision without a minimum number of calls to the oracle.

The proof of this theorem is out of scope of this section. The key point is that the con-vergence rate of this algorithm matches the lower bound from Theorem 6 (up to constants). This means that the lower bound is tight, and that Accelerated Gradient Descent (AGD) achieves the fastest possible rates in this class of algorithms. The complexity gap between gradient descent and accelerated gradient descent is significant, since the constant κf can be very large for standard machine learning problems. Thus, having a √κf bound instead of κf one can make a problem tractable, or allow to use much smaller regularization for a given statistical learning problem (since κf scales as the inverse of the regularization). We illustrate this speedup numerically in Figure 4.

We have presented acceleration for the standard gradient descent method in the smooth and strongly convex setting, but it was originally introduced in the convex setting (σ = 0), √ in which case the iteration complexity changes from O(1/ε) to O(1/ ε) [Nesterov, 1983].

Another famous accelerated method is the so-called Heavy Ball Acceleration [Polyak, 1964], but it enjoys weaker convergence guarantees in general. Nesterov acceleration can also be applied to many other settings, for instance to composite objectives, which incorporate a non-smooth term (but for which we know the proximal operator) [Beck and Teboulle, 2009b, Nesterov, 2013a, Chambolle and Dossal, 2015]. Similar ideas can also be applied to the proximal point algorithm (in which one iterates the proximal operator of the function instead of applying gradient descent) [Güler, 1992]. Note that a generic acceleration framework named Catalyst has also been developed, that allows to obtain comparable improvements in the convergence rate (up to logarithmic factors) with any base algorithm [Lin et al., 2015a]. This framework is based on an accelerated proximal point algorithm in which inner problems are solved by the algorithm that one wishes to accelerate. We refer the interested reader to d’Aspremont et al. [2021] for more details on accelerated methods.

We see that acceleration significantly speeds up first-order algorithms for ill-conditioned problems. A large part of this thesis is devoted to developping accelerated methods that are relevant for distributed optimization.

Accelerated Stochastic methods. Just like standard gradient descent, accelerated gradient descent can be used in stochastic settings. In this case, the following more general form of accelerated gradient descent is generally prefered, with gt an approximation of the gradient:

yt = (1 − αt)xt + αt(1 − βt)vt

1 − αtβt (1.2.3)

vt+1 = (1 − βt)vt + βtyt − at+1

gt

Bt+1

xt+1 = yt + αt(vt+1 − (1 − βt)vt − βtyt).

This algorithm is robust to stochastic noise as long as the gt are unbiased, i.e., as long as E[gt] = rf(yt). In this case, convergence can be proven for accelerated coordinate de-scent [Lin et al., 2015b, Nesterov and Stich, 2017, Allen-Zhu et al., 2016], which corresponds to the special case: 1 eiei>rf(yt). gt = pi (1.2.4)

In Chapter 3, we study variants of this algorithm that support strong convexity in semi-norms induced by non-PSD quadratics, together with proximal operators of a non-smooth term and arbitrary sampling (non-uniform choice of pi). Other types of gradient estimates can be used, and in particular Vaswani et al. [2019] prove convergence of iterations (1.2.3) under a strong growth condition of the form E[kgtk2] ≤ ρkrf(xt)k2, (1.2.5) for some constant ρ > 0.

Bregman gradients. We have seen that the lower bound for smooth and strongly convex optimization does not match the convergence rate of gradient descent, and that accelerated versions can be developed in this setting. The lower bound from Theorem 6 is still valid since the Bregman assumptions generalize that of gradient descent in the sense that it can be recovered by choosing h = 12 k • k2. Yet, the acceleration proofs are specific to gradient descent.

Recall that the convergence results from gradient descent could directly be generalized to the Bregman setting, so one may hope to obtain a complexity that is similar to the euclidean case. Yet, surprisingly, the opposite happens in this case: the lower bound from Theorem 6 is actually not tight in the case of arbitrary h, and can be strengthened as shown by Dragomir et al. [2021b]. In particular, one cannot hope to beat the simple Bregman gradient scheme presented in (1.1.36) without assuming more structure or regularity on the functions than just relative smoothness.

Yet, weaker forms of acceleration can be obtained under a favorable triangle scaling inequality [Hanzely et al., 2021], or when assuming that h is regular enough, as we detail in Chapter 5.

1.2.2. Variance reduction. We have seen in Section 1.1.3 that SGD only converges to a region around x? because of the variance in the gradients. We focus in this section on the specific finite sum problem, in which the objective function f can be decomposed as: f(x) = 1 m fi(x), (1.2.6) where each function fi ∈ C1(Rd) is L-smooth. This corresponds to the typical Empirical Risk Minimization (ERM) problem in which we have a model x ∈ Rd, a dataset of samples (zi) for i ∈ {1, . . . , m}, and a loss function `, where `(x, z) measures the error made by a model x when predicting a given sample z. In this case, the regularized ERM objective writes for λ>0: 1 m λ (1.2.7) m `(x, zi) + 2 kxk2, and the goal is to minimize it in x in order to obtain the model that achieves the lowest error on the training dataset. The regularization parameter λ > 0 allows to limit the model complexity and prevent overfitting, i.e., having a model that performs well on the training data, but would perform poorly on new test data. An instanciation of SGD for this problem would be: xt+1 = xt − ηtrfit (xt), (1.2.8) where it is a random index drawn at time t and fit (x) = `(x, zit ) + λ2 kxk2. Yet, the noise has a lot of structure in this case, since we know that it comes from taking the gradient of a specific function fit instead of f, and we know which fit it comes from.

Direct approaches. A direct approach for variance reduction is to perform the following updates: xt+1 = xt − ηtgt, (1.2.9) where gt is a variance-reduced gradient. Informally, this means that we would like to have that gt → 0 when xt → x?. Many such gt are possible, and we detail here the main ones.

Stochastic Average Gradient (SAG) [Schmidt et al., 2017]. A very natural estimate gt consists in taking gt = i=1 rfi(φt ), (1.2.10) m where the φ(ti) are such that φ(0)t = x0, and then φ(t+1it) = xt, and φ(t+1j) = φ(tj) for j 6= it. In other words, the algorithm stores the most recent rfi(xt) that it has computed for each i, and aggregates them to form the pseudo-gradient gt. Ideally, one would like to have φ(ti) = xt for all i, but this would require recomputing the full gradient at each step, which we specifically would like to avoid. As xt approaches the optimum, each φ(ti) approaches x? and so gt ≈ n1 mi=1 rfi(x?) = 0. This algorithm corresponds to SAG, and although it has P a very simple form, it is quite hard to analyze. This mainly comes from the fact that the updates are biased, in the sense that E[gt] 6= rf(xt).

SAGA [Defazio et al., 2014a]. As previously stated, SAG provides a biased estimator of the gradient, and the method is thus very hard to analyze. Instead, SAGA uses a slightly diﬀerent update: (i) 1 m (i) X gt = rfi(xt) − rfi(φt ) + i=1 rfi(φt ) (1.2.11) m

Similarly to SAG, gt → 0 as xt converges to x?, but this time Et[gt] = rf(xt), so that the stochastic updates are unbiased. The analysis is thus similar to that of SGD, with an additional error term that vanishes this time.

Stochastic Variance Reduced Gradient (SVRG) [Johnson and Zhang, 2013b]. SAG and SAGA store the stochastic gradients corresponding to each component and use them to construct an estimate of the full gradient. Instead, the SVRG method maintains a running point φt, which is periodically updated, and performs the following update: ˜ ˜ (1.2.12) gt = rfi(xt) − rfi(φt) + rf(φt). ˜

If φt is updated every K = O(m) rounds, then the overhead of computing a full gradient every K iterations is only of a constant factor, and the memory cost is significantly lower ˜ r ˜ than SAG or SAGA since only φt and f(φt) need to be stored. Yet, this algorithm has a double-loop structure which makes it harder to tune in practice.

Other direct methods such as Finito [Defazio et al., 2014b] or MISO [Mairal, 2013] have also been introduced, but we do not detail them in this thesis. One key feature is that all these methods have very similar convergence rates. We refer the interested reader to the corresponding paper for each method, but a general convergence theorem can be stated informally as follows.

**Table of contents :**

**Chapter 1. Introduction **

1.1. Basics of Convex Optimization

1.2. Designing Faster methods

1.3. Distributed Optimization Models

1.4. Standard Decentralized Approaches with Linear Convergence

**Chapter 2. Accelerated Decentralized Optimization with Pairwise Updates **

2.1. Introduction

2.2. Model

2.3. Algorithm

2.4. Performances

2.5. Experiments

2.6. Conclusion

Appendices

2.A. Detailed average time per iteration proof

2.B. Execution speed comparisons

2.C. Detailed rate proof

**Chapter 3. Accelerated Variance-reduced decentralized stochastic optimization **

3.1. Introduction

3.2. Model and notations

3.3. Related work

3.4. Optimal rates

3.5. Block Accelerated Proximal Coordinate Gradient with Arbitrary Sampling

3.6. Accelerated Decentralized Stochastic Algorithm

3.7. Local synchrony and the randomized gossip model

3.8. Experiments

3.9. Conclusion

Appendices

3.A. Accelerated Proximal Block Coordinate Descent with Arbitrary Sampling

3.B. Average Time per Iteration

3.C. Algorithm Performances

**Chapter 4. Dual-Free Decentralized Algorithm with Variance Reduction **

4.1. Introduction

4.2. Algorithm Design

4.3. Convergence Rate

4.4. Acceleration

4.5. Experiments

4.6. Conclusion

Appendices

4.A. Bregman Coordinate Descent

4.B. Convergence results for DVR

4.C. Catalyst acceleration

4.D. Experiments

**Chapter 5. Statistically Preconditioned Accelerated Gradient Method **

5.1. Introduction

5.2. Related Work

5.3. The SPAG Algorithm

5.4. Bounding the Relative Condition Number

5.5. Experiments

5.6. Conclusion

5.7. Statistical Preconditioning with other Bregman algorithms

Appendices

5.A. Convergence Analysis of SPAG

5.B. Concentration of Hessians

5.C. Experiment Setting and Additional Results

**Chapter 6. Quantifying the natural differential privacy guarantees of gossip protocols**

6.1. Introduction

6.2. Background and Related Work

6.3. A Model of Differential Privacy for Gossip Protocols

6.4. Extreme Privacy Cases

6.5. Privacy vs. Speed Trade-offs

6.6. Empirical Evaluation

6.7. Concluding Remarks

Appendices

6.A. Model Extensions

6.B. Delayed Start Gossip

6.C. Detailed Proofs

6.D. Challenges of Private Gossip for General Graphs

**Conclusion and Research Directions **

Summary of the thesis

Perspectives

**Bibliography **