Langevin Monte-Carlo with inaccurate gradient
In this paper, we study the problem of sampling from a given probability density function that is known to be smooth and strongly log-concave. We analyze several methods of approximate sampling based on discretizations of the (highly overdamped) Langevin diﬀusion and establish guarantees on its error measured in the Wasserstein-2 distance. Our guarantees improve or extend the state-of-the-art results in three directions. First, we provide an upper bound on the error of the first-order Langevin Monte Carlo (LMC) algorithm with optimized varying step-size. This result has the advantage of being horizon free (we do not need to know in advance the target precision) and to improve by a logarithmic factor the corresponding result for the constant step-size. Second, we study the case where accurate evaluations of the gradient of the log-density are unavailable, but one can have access to approximations of the aforementioned gradient. In such a situation, we consider both deterministic and stochastic approximations of the gradient and provide an upper bound on the sampling error of the first-order LMC that quantifies the impact of the gradient evaluation inaccuracies. Third, we establish upper bounds for two versions of the second-order LMC, which leverage the Hessian of the log-density. We provide nonasymptotic guarantees on the sampling error of these second-order LMCs. These guarantees reveal that the second-order LMC algorithms improve on the first-order LMC in ill-conditioned settings.
This chapter is based on a joint work with Arnak Dalalyan called “User-friendly guarantees for the Langevin Monte-Carlo with inaccurate gradient”. It is published in Stochastic Processes and their Applications.
The problem of sampling a random vector distributed according to a given target distribution is central in many applications. In the present paper, we consider this problem in the case of a target distribution having a smooth and log-concave density and when the sampling is performed by a version of the Langevin Monte Carlo algorithm (LMC). More precisely, for a positive integer p, we consider a continuously diﬀerentiable function f : Rp ! R satisfying the following assumption: For some positive constants m and M, it holds f( ) f( 0) r f( 0)>( 0) (m=2)k 0k22; 8 ; 0 2 Rp; (2.1) < f( ) r f( 0) 2 M k 0 k 2; :kr k
where rf stands for the gradient of f and k k2 is the Euclidean norm. The target distributions considered in this paper are those having a density with respect to the Lebesgue measure on Rp given by e f( ) ()=R : Rp e f(u) du
We say that the density ( ) / e is log-concave (resp. strongly log-concave) if the function f satisfies the first inequality of (2.1) with m = 0 (resp. m > 0).
Most part of this work focused on the analysis of the LMC algorithm, which can be seen as the analogue in the problem of sampling of the gradient descent algorithm for optimization. For a sequence of positive parameters h = fhkgk2N, referred to as the step-sizes and for an initial point #0;h 2 Rp that may be deterministic or random, the iterations of the LMC algorithm are defined by the update rule #k+1;h = #k;h hk+1rf(#k;h) + p2hk+1 k+1; k = 0; 1; 2; : : : (2.2) where 1 ; : : : ; k; : : : is a sequence of mutually independent, and independent of #0;h, centered Gaussian vectors with covariance matrices equal to identity.
When all the hk ’s are equal to some value h > 0, we will call the sequence in (2.2) the constant step LMC and will denote it by #k+1;h. When f satisfies assumptions (2.1), if h is small and k is large (so that the product kh is large), the distribution of #k;h is known to be a good approximation to the distribution with density ( ). An important question is to quantify the quality of this approximation. An appealing approach to address this question is by establishing non asymptotic upper bounds on the error of sampling; this kind of bounds are particularly useful for deriving a stopping rule for the LMC algorithm, as well as for understanding the computational complexity of sampling methods in high dimensional problems. In the present paper we establish such bounds by focusing on their user-friendliness. The latter means that our bounds are easy to interpret, hold under conditions that are not diﬃcult to check and lead to simple theoretically grounded choice of the number of iterations and the step-size.
In the present work, we measure the error of sampling in the Wasserstein-Monge-Kantorovich distance W2 . For two measures and defined on (Rp; B(Rp)), and for a real number q 1, Wq is defined by q( ) = %2%( ; ) ZRp Rp k k2 ( ) 0 q d% ; 0 1=q W ; inf ;
where the inf is with respect to all joint distributions % having and as marginal distributions. For statistical and machine learning applications, we believe that this distance is more suitable for assessing the quality of approximate sampling schemes than other metrics such as the total variation or the Kullback-Leibler divergence. Indeed, bounds on the Wasserstein distance—unlike the bounds on the total-variation—provide direct guarantees on the accuracy of approximating the first and the second order moments.
Asymptotic properties of the LMC algorithm, also known as Unadjusted Langevin Algorithm (ULA), and its Metropolis adjusted version, MALA, have been studied in a number of papers [RT96a, RR98, ST99b, ST99c, JH00, RS02]. These results do not emphasize the eﬀect of the dimension on the computational complexity of the algorithm, which is roughly proportional to the number of iterations. Non asymptotic bounds on the total variation error of the LMC for log-concave and strongly log-concave distributions have been established by [Dal17b]. If a warm start is available, the results in [Dal17b] imply that after O(p= »2) iterations the LMC algorithm has an error bounded from above by « . Furthermore, if we assume that in addition to (2.1) the function f has a Lipschitz continuous Hessian, then a modified version of the LMC, the LMC with Ozaki discretization (LMCO), needs O(p= ») iterations to achieve a precision level « . These results were improved and extended to the Wasserstein distance by [DM17, DM19]. More precisely, they removed the condition of the warm start and proved that under the Lipschitz continuity assumption on the Hessian of f, it is not necessary to modify the LMC for getting the rate O(p= »). The last result is closely related to an error bound between a diﬀusion process and its Euler discretization established by [AJKH14].
On a related note, [BEL18] studied the convergence of the LMC algorithm with reflection at the boundary of a compact set, which makes it possible to sample from a compactly supported density (see also [BDMP17]). Extensions to non-smooth densities were presented in [DMP18, LFC17]. [CB18 ] obtained guarantees similar to those in [Dal17b] when the error is measured by the Kullback-Leibler divergence. Very recently, [CCBJ18] derived non asymptotic guarantees for the kinetic LMC which turned out to improve on the previously known results. Langevin dynamics was used in [ ARW16, BDM18] in order to approximate normalizing constants of target distributions. [HZ17] established tight bounds in Wasserstein distance between the invariant distributions of two (Langevin) diﬀusions; the bounds involve mixing rates of the diﬀusions and the deviation in their drifts.
The goal of the present work is to push further the study of the LMC and its variants both by improving the existing guarantees and by extending them in some directions. Our main contributions can be summarized as follows:
• We state simplified guarantees in Wasserstein distance with improved constants both for the LMC and the LMCO when the step-size is constant, see Theorem 4 and Theorem 9.
• We propose a varying-step LMC which avoids a logarithmic factor in the number of iterations required to achieve a precision level « , see Theorem 5.
• We extend the previous guarantees to the case where accurate evaluations of the gradient are unavailable. Thus, at each iteration of the algorithm, the gradient is computed within an error that has a deterministic and a stochastic component. Theorem 7 deals with functions f satisfying (2.1), whereas Theorem 8 requires the additional assumption of the smoothness of the Hessian of f.
• We propose a new second-order sampling algorithm termed LMCO0. It has a per-iteration computational cost comparable to that of the LMC and enjoys nearly the same guarantees as the LMCO, when the Hessian of f is Lipschitz continuous, see Theorem 9.
• We provide a detailed discussion of the relations between, on the one hand, the sampling methods and guarantees of their convergence and, on the other hand, optimization methods and guarantees of their convergence (see Section 2.5).
We have to emphasize right away that Theorem 4 is a corrected version of [Dal17a, Theorem 1], whereas Theorem 7 extends [Dal17a, Theorem 3] to more general noise. In particular, Theorem 7 removes the unbiasedness and independence conditions. Furthermore, thanks to a shrewd use of a recursive inequality, the upper bound in Theorem 7 is tighter than the one in [Dal17a, Theorem 3].
As an illustration of the first two bullets mentioned in the above summary of our contributions, let us consider the following example. Assume that m = 10, M = 20 and we have at our disposal an initial sampling distribution 0 satisfying W2( 0; ) = p+(p=m). The main inequalities in Theorem 4 and Theorem 5 imply that after K iterations, the distribution K obtained by the LMC algorithm satisfies W2( K; ) (1 mh)K W2( 0; ) + 1:65(M=m)(hp)1=2 (2.3) for the constant step LMC M + m + (2=3)m(K K1) for the varying-step LMC, where K1 is an integer the precise value of which is provided in Theorem 5. One can compare these inequalities with the corresponding bound in [DM19]: adapted to the constant-step, it takes the form W22(K; ) 21 mM h K W22( 0; ) m + M 2 h 2 2
For any » > 0, we can derive from these guarantees the smallest number of iterations, K », for which there is a h > 0 such that the corresponding upper bound is smaller than « . The logarithms of these values K » for varying » 2 f0:001; 0:005; 0:02g and p 2 f25; : : : ; 1000g are plotted in Figure 2.1. We observe that for all the considered values of » and p, the number of iterations derived from (2.4) (referred to as Theorem 5) is smaller than those derived from (2.3) (referred to as Theorem 4) and from (2.5) (referred to as DM bound). The diﬀerence between the varying-step LMC and the constant step LMC becomes more important when the target precision level » gets smaller. In average over all values of p, when » = 0:001, the number of iterations derived from (2.5) is 4.6 times larger than that derived from (2.4), and almost 3 times larger than the number of iterations derived from (2.3).
Table of contents :
1.1 High-dimensional Parametric Statistics
1.2 Statistiques paramétriques en grande dimension
1.3 Convex Optimization
1.4 Sampling from a probability distribution
1.5 Langevin sampling
1.7 Résumé substantiel
2 Langevin Monte-Carlo with inaccurate gradient
2.2 Guarantees in the Wasserstein distance with accurate gradient
2.3 Guarantees for the inaccurate gradient version
2.4 Guarantees under additional smoothness
2.5 Relation with optimization
Appendix to Chapter 2
2.A Proof of Theorem 4
2.B Proof of Theorem 5
2.C Proof of Theorem 6
2.D Proof of Theorem 7
2.E Proof of Theorem 8
2.F Proof of Theorem 9
2.G Proofs of the lemmas
3 Langevin Monte-Carlo with convex potentials
3.2 Further precisions on the analyzed methods
3.3 How to measure the complexity of a sampling scheme?
3.4 Overview of main contributions
3.5 Prior work
3.6 Precision and computational complexity of the LMC
3.7 Precision and computational complexity of KLMC and KLMC2
3.8 Bounding moments
Appendix to Chapter 3
3.A Proof of Proposition 13
3.B Proof of Proposition 14
3.C Proof of Proposition 15
3.D Proof of Proposition 12
3.E Technical lemmas
4 Penalized Langevin Dynamics
4.2 Convergence of penalized Langevin dynamics
4.3 The counterpart in optimization: penalized gradient flow
4.4 Prior work and outlook
Appendix to Chapter 4
4.A Proof of Theorem 14
4.B Proof of Theorem 15
4.C Proofs of the lemmas used in Theorem 14
4.D Proof of Proposition 17
4.E (Weakly) convex potentials: what is known and what we can hope for
4.F Proofs of the lemmas used in Theorem 15
4.G Penalized Gradient Flow
4.H Examples of functions satisfying condition A(D; q)
5 Summary and Perspectives
5.1 Summary of the Thesis