# Learning with Sinkhorn Divergences

Get Complete Project Material File(s) Now! »

## Distances Between Probability Measures and Weak-Convergence

This section introduces three types of discrepancies between measures, which are not all distances strictly-speaking, but they all define some sort of closeness between probability measures. We review ϕ−divergences, Maximum Mean Discrepancy(MMD) (which comes from the larger class of Integral Probability Metrics) and Optimal Trans-port (OT) distances of which the Wasserstein Distance is a special instance, as these are all popular losses in machine learning problems.

ϕ-divergences

The simplest tool to compare two measures are ϕ-divergences. Roughly speaking, they compare ddαβ (x) to 1 trough the following formulation:
Definition 1. (ϕ-divergence)(Csiszár, 1975) Let ϕ be a convex, lower semi-continuous function such that ϕ(1) = 0.
The best-known ϕ-divergence is the so-called Kullback-Leibler divergence (see Ta-ble 1.1 for examples), which is widely used in machine learning problems (see Chapter 2 for an overview of learning with ϕ-divergences). However, it is equal to +∞ if both mea-sures do not share the same support, which causes discontinuity issues. For instance, consider the case on R where α = δ0 a Dirac mass in 0 and αn = δ1/n a Dirac mass in 1/n. Then DKL(αn|α) = +∞ for all n, although it would seem natural to say that when n goes to infinity, αn gets closer to α. As for DT V (which is a norm) and DH2 (which is the square of a distance), they are both finite constants for all n, when considering this same case. This issue is simply an illustration of the fact that ϕ-divergences do not metrize weak-convergence.
Definition 2. (Weak-convergence) We say that a sequence of measures (αn)n weakly converges to α (or converges in law) if f(x)dαn(x) → f(x)dα(x) ∀f ∈ Cb(X ), (2.1) where Cb(X ) denotes the set of continuous bounded functions on X .
We say that a discrepancy d metrizes the weak-convergence of measures if d(αn, α) → 0 ⇔ αn * α,
where * denotes weak-convergence (or convergence in law, for Xn * X where Xn ∼ αn and X ∼ α). Remark 1. As pointed out in Sec. 5.1 of (Ambrosio et al., 2006), it is suﬃcient to check (2.1) on any subset Ω of bounded continuous functions whose linear envelope span(Ω) is uniformly dense (i.e. dense in the uniform topology induced by the infinity norm) in Cb(X ).
The fact that ϕ−divergences do not metrize weak convergence is a major issue and makes them poor candidates for learning problems, in spite of their appreciated com-putational simplicity. We discuss this in details in Chapter 2, where focus on finding a good notion of distance between measures to fit a (generative) parametric model to a dataset. For now, let us introduce another class of distances between measures which can metrize weak convergence under some assumptions.

Integral Probability Metrics and Maximum Mean discrepancy

The notion of Integral Probability Metrics (IPMs) was introduced by (Müller, 1997) as a class of maximization problems on certain sets of functions, regrouping some well known distances: Definition 3. (Integral probability metrics) (Müller, 1997) Consider two probability distributions α and β on a space X . Given a set of measurable functions F, the integral probability metric dF is defined as def. (2.2) dF (α, β) = sup |Eα(f(X)) − Eβ(f(Y ))|. f∈F
Let us now give a suﬃcient condition on F so that the associated IPM metrizes weak convergence:
Proposition 1. If span(F) is uniformly dense in Cb(X ), then dF metrizes weak con-vergence.
We give some examples of well-known IPMs in Table 1.2. The Wasserstein-1 dis-tance, which is the IPM for the set of 1-Lipschitz functions can be reformulated using Kantorovich-Rubinstein duality as: 1( ) = π∈Π(α,β) Z X×X || − ||2d ( ) W α, β def. min x y π x, y , where Π(α, β) is the set of probability distributions over the product set X × X with marginals α and β. This formulation is known as an optimal transport problem between α and β with cost function c(x, y) = ||x − y||2. It is well known that W1 metrizes weak convergence of measures. We will get back to a more general definition of Wasserstein distance with other cost functions in the following section, since it extends beyond the frame of IPMs. As for T V , which is both a ϕ−divergence and an IPM, it does not metrize weak convergence: convergence in T V implies weak-convergence but not the other way around. We now focus on Maximum Mean Discrepancy for a while. Maximum Mean Discrepancies are IPMs on the unit ball of a Reproducing Kernel Hilbert Space (RKHS), where the norm is the one induced by its kernel function k. The fact that MMDs metrize weak convergence requires some conditions on the kernel k. Let us start by introducing these concepts in more detail:
Definition 4. (Reproducing Kernel Hilbert Space) Consider a Hilbert space H of real-valued functions on a space X . Let Lx be the evaluation operator, such that def. Lx(f) = f(x). Then H is a Reproducing Kernel Hilbert Space if and only if Lx is continuous.
From this definition, the role of the reproducing kernel in the reproducing kernel Hilbert space is not obvious. We first give the definition of a reproducing kernel.
Definition 5. (Reproducing Kernel) Consider a Hilbert space H of real-valued func-tions on a space X . A function k : X × X → R is a reproducing kernel of H if it verifies:
1. ∀x ∈ X , k(x, •) ∈ H,
2. ∀f ∈ H, f(x) = hf, k(x, •)iH.
Proposition 2. A function k : X × X → R is a reproducing kernel if and only if it is positive definite, i.e for all (x1, . . . , xn) ∈ X n, (a1, . . . , an) ∈ Rn, aiajK(xi, xj) > 0. i=1 j=1
We can now state a theorem giving an equivalent definition for RKHS.
Theorem 5. A Hilbert space H of real-valued functions on a space X is a Reproducing Kernel Hilbert Space if and only if it has a reproducing kernel. Besides, this reproducing kernel is unique.
Thanks to this theorem, it is possible to define the RKHS associated to any positive definite kernel k.
Remark 2. The proof of any RKHS having a reproducing kernel is made thanks to Riesz representer theorem. Since the evaluation function is linear and continuous, there exists a function kx ∈ H such that f(x) = hf, kxiH. Defining the bilinear function k : X × X → R by k(x, y) = kx(y) we clearly have that k is a reproducing kernel of H.
The reproducing property of RKHS allows to derive a much simpler expression for their associated IPM, which becomes a closed form formula.
Theorem 6. (MMD and weak convergence) (Sriperumbudur et al., 2010) Consider Maximum Mean Discrepancy with kernel k between two measures α and β on some space X , as defined in (2.3).
(i) Let X be a compact space. If the kernel k is universal (i.e. its associated RKHS is dense in the space of continuous functions), then M M Dk metrizes weak conver-gence on M1+(X ).
The most widely used kernel is the Gaussian kernel k(x, y) = exp( σ2 ), which is a universal kernel. According to Theorem 6 it metrizes weak convergence on a compact set, but it does not verify the required hypotheses on Rd. They are however characteristic, meaning that M M Dk(α, β) = 0 ↔ α = β. An example of kernels that verify the hypotheses of Theorem 6 on Rd are the so-called Matern kernels, whose associated RKHS are Sobolev spaces. We further discuss the use of various kernels for learning problems in Chapter 2 and for function estimation in Chapters 3 and 4.

### Optimal Transport

We consider two probability measures α ∈ M1+(X ) and β on M1+(Y). The Kan-torovich formulation (Kantorovich, 1942) of Optimal Transport (OT) between α and β is defined by:
c( ) = π∈Π(α,β) Z X×Y ( )d ( ) (P)
W α, β def. min c x, y π x, y ,
Figure 1.1 – Illustration of optimal transport between two measures α and β in the continuous case (left) and discrete case (right). In the continuous case, the transport plan is a probability distribution on X × Y while in the discrete case it is a matrix. For the latter, each entry πij corresponds to how much mass is moved from i to j.
where the feasible set is composed of probability distributions over the product space X × Y with fixed marginals α, β: def. π ∈M+1(X ×Y); P1]π =α,P2]π =β , where P1]π (resp. P2]π) is the marginal distribution of π for the first (resp. second) variable, using the projection maps P1(x, y) = x; P2(x, y) = y along with the push-forward operator ].
An optimizer π is called the transport plan between α and β, and quantifies how mass is optimally moved from α to β, see Figure 1.1. The cost function c represents the cost to move a unit of mass from x to y, and Wc(α, β) represents the total cost of moving all mass from α to β.
Remark 3 (p-Wasserstein distance). When X = Y is endowed with a distance dX , choosing c(x, y) = dX (x, y)p where p > 1 yields the p-th power of the p-Wasserstein distance. It defines an actual distance between probability measures, which metrizes the weak-convergence.
For other cost functions c, Wc(α, β) is not necessarily a distance, since it does not always satisfy the triangle inequality but it still symmetric and positive under natural assumptions on the cost function (e.g. c(x, y) = 0 ⇔ x = y, c(x, y) > 0).
Optimal transport is a powerful tool to capture the underlying geometry of the measures, by relying on the cost function c which encodes the geometry of the space X , and they have the ability to make meaningful comparisons even when the supports of the measures do not overlap (which is not the case for Kullback-Leibler divergence for instance). Besides, the transport plan π gives a mapping between measures which can be used for instance in domain adaptation (Courty et al., 2014). More structure can be enforced with extensions of OT (Alvarez-Melis et al., 2017), which can for instance take into account labels of the data in supervised learning. However, OT suﬀers from a computational and statistical burden:
• Computing OT is costly: Solving OT when dealing with discrete distributions (i.e., finite weighted sums of Dirac masses) amounts to solving a large-scale linear program. This can be done using network flow solvers, which can be further refined to assignment problems when comparing measures of the same size with uniform weights (Burkard et al., 2009). The computational complexity is O(n3log(n)) where n is the number of points in the discrete measure (see also the monograph on Computational OT by Peyré et al. (2017) for a detailed review of OT solvers).
• OT suﬀers from a curse of dimensionality: considering a probability measure α ∈ M1+(Rd) and its empirical estimation αˆn, we have E[Wp(α, αˆn)] = O(n−1/d) (see (Weed and Bach, 2017) for refined convergence rates depending on the support of α). Thus the error made when approximating the Wasserstein distance from samples grows exponentially fast with the dimension of the ambient space.
These two issues have caused OT to be neglected in machine learning applications for a long time in favor of simpler ϕ−divergences or MMD.
Let us conclude this section on OT with a recent extension introduced in (Chizat et al., 2018), (Liero et al., 2018). While OT is restricted to positive measures of mass 1, Unbalanced Optimal Transport can compare any two arbitrary positive measures. The marginal constraints are relaxed, as they are replaced with ϕ-divergences:
Definition 6. (Unbalanced Optimal Transport) Consider two positive measures α ∈ M+(X ) and β ∈ M+(Y). Unbalanced Optimal Transport is defined as the following minimization problem π∈M+(X×Y) ZX×Y c(x, y)dπ(x, y) + D ψ1 (P 1] | α)+D ψ2 (P 2] | β), (2.4) min π π where ψ1 and ψ2 are positive, lower-semi-continuous functions such that ψ1(1) = 0 and ψ2(1) = 0.
Note that there is no constraint on the transport plan besides positivity: it is not required to have marginals equal to α and β nor to have mass 1. For specific choices of c, ψ1, ψ2, unbalanced OT defines a distance on M+(X ). This extension is popular in several applications due to the fact that it can compare any arbitrary positive measures. Whenever possible, we extend our results on regularized OT to the unbalanced case.

#### Regularized Optimal Transport

We introduce regularized optimal transport, which consists in regularizing the orig-inal problem by penalizing it with the ϕ-divergence of the transport plan with respect to the product measure:
W ϕ α, β def. min c x, y π x, y ε ϕ dπ(x, y) d α x β y , ) = ) + dα(x)dβ(y) )d c,ε( π∈Π(α,β) ZX×Y ( )d ( ZX ×Y ( ( ) (Pε,ϕ) where ϕ is a convex function with domain R+.
Entropic regularization, which is the main focus of this thesis, corresponds to the case ϕ(w) = w log(w) − w + 1 (or alternatively ϕ(w) = w log(w)) but one may choose the squared penalty ϕ(w) = w22 + ιR+ (w), where ι denotes the convex indicator func-tion. However, most of the properties we derive for regularized optimal transport – in particular fast numerical solvers and improved sample complexity – are specific to the entropic regularization.

Dual Expectation Formulation

Another benefit of the regularization introduced above is the fact that is can be rewritten as the maximization of an expectation with respect to the product measure α ⊗ β

Entropy-Regularized Optimal Transport

Entropic regularization is the main focus of this thesis, as it presents several specific properties:
• closed-form primal-dual relationship, allowing to recover the transport plan π after solving the simpler (unconstrained) dual problem,
• a fast numerical solver for finite discrete measures, Sinkhorn’s algorithm (see Sec. 4.2),
• a discrepancy between measures interpolating between standard OT and MMD (see Chapter 2),
• an improved sample complexity compared to OT, breaking the curse of dimen-sionality for a regularization parameter large enough (see Chapter 3),
• reformulation of the dual as the maximization of an expectation in a Reproducing Kernel Hilbert Space (RKHS) ball of finite radius, allowing to solve the dual problem with a kernel version of stochastic gradient descent (see Chapter 3 and Chapter 4),
• semi-dual formulation (Sε), allowing to solve semi-discrete OT with online descent algorithms (see Chapter 4).

Solving the Regularized Dual Problem

As most of the methods we develop in this thesis rely on the dual formulation of entropy-regularized OT, we give a proof of existence of a solution to this problem, for a general setting. We discuss further the regularity of the dual potentials in Chapter 3. This section is dedicated to proving the following existence theorem:
Theorem 7. (Existence of a dual solution) Consider the dual of entropy-regularized OT, with marginals α, β ∈ M1+(X )×M1+(Y) supported on two subsets of Rd, and with a ∞ def. cost function c bounded on X × Y. Let L (α) = {f : X → R|∃C > 0 such that f(x) 6 C α-a.e.}. Then the dual problem has solutions (u∗, v∗) ∈ L∞(α) × L∞(β) which are unique α− and β−a.e. up to an additive constant. It is straightforward to see that for any solution (u∗, v∗) to the dual problem, the pair (u∗ + k, v∗ − k) for k ∈ R is also a solution to the dual problem. Besides, modifying the values of u∗ and v∗ outside of the support of the measures does not have any eﬀect on the value of the problem.
The proof of existence of a solution to the dual problem essentially amounts to rewriting the optimality condition as a fixed point equation, and proving that a fixed point exists. To do so, we show that the operator in the fixed point equation is a contraction for a certain metric, called the Hilbert metric. This proof is based on the same idea from that of the existence of a solution to Schrodinger’s system (which shares strong links with regularized OT) in (Chen et al., 2016), inspired from the original proof of (Franklin and Lorenz, 1989) which deals with discrete regularized OT. We prove the existence of potentials in a general framework, as we consider arbitrary measures α and β and any bounded regular cost function c.

Chapter 1: Entropy-regularized Optimal Transport
1 Introduction
2 Distances Between Probability Measures
2.1 ‘-divergences
2.2 Integral Probability Metrics and Maximum Mean discrepancy
2.3 Optimal Transport
3 Regularized Optimal Transport
3.1 Dual Formulation
3.2 The Case of Unbalanced OT
3.3 Dual Expectation Formulation
4 Entropy-Regularized Optimal Transport
4.1 Solving the Regularized Dual Problem
4.1.1 Hilbert Metric
4.1.2 Fixed Point Theorem
4.2 Sinkhorn’s Algorithm
4.3 Semi-Dual Formulation
4.3.1 Case of a Discrete Measure
4.3.2 Semi-Dual Expectation Formulation
4.3.3 Some Analytic Properties of the Semi-Dual Functional
4.4 Convergence of Entropy-Regularized OT to Standard OT
Chapter 2: Learning with Sinkhorn Divergences
1 Introduction
2 Density Fitting
2.1 Learning with ‘-divergences
2.2 Maximum Mean Discrepancy and Optimal Transport
2.3 Regularized OT and Variants of the Regularized OT Loss
2.4 Sinkhorn Divergences : an Interpolation Between OT and MMD
3 Sinkhorn AutoDiff Algorithm
3.1 Mini-batch Sampling Loss
3.2 Sinkhorn Iterates
3.3 Learning the Cost Function Adversarially
3.4 The Optimization Procedure in Practice
4 Applications
4.1 Benchmark on Synthetic Problems
4.2 Data Clustering with Ellipses
4.3 Tuning a Generative Neural Network
4.3.1 With a Fixed Cost c
4.3.2 Learning the Cost
Chapter 3: Sample Complexity of Sinkhorn Divergences
1 Introduction
2 Reminders on Sinkhorn Divergences
3 Approximating Optimal Transport with Sinkhorn Divergences
4 Properties of Sinkhorn Potentials
5 Approximating Sinkhorn Divergence from Samples
6 Experiments
Chapter 4: Stochastic Optimization for Large Scale OT
1 Introduction
2 Optimal Transport: Primal, Dual and Semi-dual Formulations
2.1 Primal, Dual and Semi-dual Formulations
2.2 Stochastic Optimization Formulations
3 Discrete Optimal Transport
3.1 Discrete Optimization and Sinkhorn
3.2 Incremental Discrete Optimization with SAG when  » > 0
3.3 Numerical Illustrations on Bags of Word-Embeddings
4 Semi-Discrete Optimal Transport
4.1 Stochastic Semi-discrete Optimization with SGD
4.2 Numerical Illustrations on Synthetic Data
5 Continuous Optimal Transport Using RKHS
5.1 Kernel SGD
5.2 Speeding up Iterations with Kernel Approximation
5.2.1 Incomplete Cholesky Decomposition
5.2.2 Random Fourier Features
5.3 Comparison of the Three Algorithms on Synthetic Data
Conclusion
Bibliography

GET THE COMPLETE PROJECT