Subspace Robust Wasserstein Distances

Get Complete Project Material File(s) Now! »

Chapter 3: Regularized Optimal Transport is Ground-Cost Adversarial

This chapter is based on [Paty and Cuturi, 2020].
Regularizing the optimal transport problem has proven crucial for the opti-mal transport theory to impact the field of machine learning. For instance, it is known that regularizing OT problems with entropy leads to faster computations and better diﬀerentiation using the Sinkhorn algorithm [Cuturi, 2013], as well as better sample complexity bounds than classic OT [Genevay et al., 2019]. In this chapter, we depart from this practical perspective and propose a new interpretation of regularization as a robust mechanism, and show using Fenchel duality that any convex regularization of optimal transport can be interpreted as ground-cost adversarial.
Related work The key to using optimal transport in applications lies in the diﬀerent forms of regularization of the original optimal transport problem. Adding a small convex regularization to the classical linear cost not only helps on the algorithmic side, by convexifying the objective and allowing for faster solvers, but also introduces a regularity trade-oﬀ that prevents from overfitting on data measures. Although entropy-regularized OT is the most studied regularization of OT, due to its algorithmic advantages [Cuturi, 2013], several other convex regularizations of the transport plan have been proposed in the community: quadratically-regularized OT [Essid and Solomon, 2017], OT with capacity constraints [Korman and McCann, 2015], Group-Lasso regularized OT [Courty et al., 2016], OT with Laplacian regularization [Flamary et al., 2014], Tsallis Regularized OT [Muzellec et al., 2017], among others. On the other hand, regularizing the dual Kantorovich problem was shown in [Liero et al., 2018] to be equivalent to unbalanced OT, that is optimal transport with relaxed marginal constraints. The question of understanding why regularizing OT proves critical has triggered several approaches. A compelling reason is statistical: although classical OT suﬀers from the curse of dimensionality, as its empirical version converges at a rate of order O n 1=d [Dudley, 1969, Fournier and Guillin, 2015], regularized OT and more precisely Sinkhorn divergences [Genevay et al., 2018] have a sample complexity of O(1=pn) [Genevay et al., 2019, Mena and Niles-Weed, 2019]. Entropy-regularized OT was also shown to perform maximum likelihood estimation in the Gaussian deconvolution model [Rigollet and Weed, 2018]. Taking another approach, Dessein et al. [2018], Blondel et al. [2018] have considered general classes of convex regularizations and characterized them from a more geometrical perspective. Recently, several papers [Genevay et al., 2018, Flamary et al., 2018, Deshpande et al., 2019, Kolouri et al., 2019, Niles-Weed and Rigollet, 2019, Paty and Cuturi, 2019] have proposed to maximize OT with respect to the ground-cost function, which can in turn be interpreted in light of ground metric learning [Cuturi and Avis, 2014]. This approach can also be viewed as an instance of robust optimization [Ben-Tal and Nemirovski, 1998, Ben-Tal et al., 2009, Bertsimas et al., 2011]: instead of considering a data-dependent, hence unstable minimization problem minx f^(x) where ^ represents the data, the robust optimization literature adversarially chooses the parameters in a neighborhood of the data: max 2 minx f (x).
Contributions Continuing along these lines, our main contribution in this chapter is to make a connection between regularizing and maximizing OT. Our main goal is to provide a novel interpretation of regularized optimal transport in terms of ground-cost robustness: regularizing OT amounts to maximizing unregularized OT with respect to the ground-cost function.
1. We interpret convex regularizations of the transport plan as ground-cost robustness: Here, X is a compact Hausdorﬀ space. Let F : M(X 2) ! R [ f+1g be a lower semi-continuous (l.s.c) functional.
We show that: inf F ( ) = sup Tc( ; ) F (c);
2(;) c2C(X2) where F : c sup 2 ) c d F ( ) is the Legendre-Fenchel con- F 7! 2M(X jugate of . This means R transport plan corresponds to the maximization of the classical linear optimal transport value over the ground-cost function, with a penalization on the ground-cost function. An important reformulation of this fact is the following: if R : M(X 2) ! R [ f+1g is l.s.c, c0 : X 2 ! R+ is continuous and » > 0, then: 2(;)Z 0 c2C(X2) Tc » inf c d + « R( ) = sup (;) « R c c0 ; where R is the Legendre-Fenchel dual of R. In particular, this formulation implies that regularizing OT boils down to maximizing classic OT with respect to the ground-cost function, with a regularization that somewhat forces the adversarial ground-cost function to be “close” to the original one c0, in the sense that R c c0 cannot be too large. The link between the regularization on the transport plan and the regularization on the ground-cost function is given by the Legendre-Fenchel transform of the regularizers.
2. We investigate the properties of optimal adversarial ground-cost functions: We prove, under some technical assumption (that is verified e.g. for entropic or quadratic regularizations of OT), a duality theorem for “regularized” OT: 2(;) ; 2C(X) Z Z d F () inf F ( ) = sup d + where : (x; y) 7! (x)+ (y). We use this duality result to show that under the same technical assumption, there exists an optimal adversarial ground-cost function that is separable, i.e. that there exists functions ; : X ! R such that is optimal in the maximization problem supc2C(X2 ) Tc( ; ) F (c). In some sense, this is disappointing since a separable ground-cost function does not yield an interesting geometry on the ground space that could help analyze the data.

Smooth and Strongly-Convex Nearest Brenier Potentials

This chapter is based on [Paty et al., 2020].
One of the greatest achievements of the OT literature in recent years lies in regularity theory: Caﬀarelli [2000] showed that the OT map between two well behaved measures is Lipschitz, or equivalently when considering 2-Wasserstein distances, that Brenier convex potentials (whose gradient yields an optimal map) are smooth. Instead of considering regularity as a property that can be proved under suitable assumptions, we consider in this chapter regularity as a condition that must be enforced when estimating OT: we propose algorithms that can recover optimal transport maps with small distortion that best approach the transport constraint, so that the associated Brenier potentials are strongly convex and smooth.
Related work Learning an optimal Monge map is crucial in some appli-cations [Courty et al., 2016, Schiebinger et al., 2019]. To do so, two main approaches have been considered in the literature:
• learn an optimal transport map from an optimal transport plan, by considering its barycentric projection [Courty et al., 2016]: although the obtained map is indeed an optimal transport map, it is only defined on the data and cannot generalize to new points;
• learn a transport map that is defined everywhere, by considering a parametric family [Perrot et al., 2016, Seguy et al., 2018]: in this case, such maps cannot be guaranteed to be optimal transport maps.
In order to reconciliate those two approaches, we turn to the optimal transport literature, which provides a rich theory about Monge maps: the regularity theory. This theory gives properties of the optimal Monge map pushing forward a measure onto with a small average cost. When that cost is the quadratic Euclidean distance, the Monge map is necessarily the gradient rf of a convex function f. This major result, known as the Brenier [1987] theorem, states that the OT problem between and is solved as soon as there exists a convex function f such that rf] = . Estimating a Monge map by forcing it to be the gradient of a convex potential will therefore guarantee its optimality. In that context, regularity in OT is usually understood as the property that the map rf is Lipschitz, a seminal result due to Caﬀarelli [2000] who proved that the Monge map can be guaranteed to be 1-Lipschitz when transporting a “fatter than Gaussian” measure / eV d towards a “thinner than Gaussian” measure / e W d (here d is the Gaussian measure on Rd, d / e 2 , and V; W are two convex potentials). Equivalently, this result can be stated as the fact that the Monge map is the gradient of a 1-smooth Brenier [1987] potential. Adding such kind of regularity constraints on the estimated Monge map will allow for generalization outside the data, by means of the results obtained in the convex interpolation literature [Taylor et al., 2017, Section 3].
Contributions The main contribution of this chapter is to translate the idea that the OT map between suﬃciently well-behaved distributions should be regular into an estimation procedure.
1. We define smooth and strongly convex nearest-Brenier (SSNB) potentials: Given two probability measures ; 2 P2(Rd), a L-smoooth and ‘-strongly convex function f such that rf] = may not always exist. We relax this equality and look instead for a smooth strongly convex function f that minimizes the Wasserstein distance between rf] and : min W2 (rf] ; ) : (SSNB) f is L-smooth, ‘-strongly convex
We call such potential nearest-Brenier because they provide the way to transport to a measure as close as possible to using a smooth and strongly convex potential.
2. We show that such potentials can be computed when measures are discrete: Even when ; are discrete probability measures, prob-lem (SSNB) is infinite-dimensional because of the minimization over the set of smooth and strongly convex functions. We show that this prob-lem can actually be rewritten as a finite-dimensional separately-convex optimization problem: at fixed potential f, the problem is a discrete OT problem; at fixed transport plan, the minimization over f is a con-vex quadratically-constrained quadratic program (QCQP), which can be solved using on-the-shelf solvers.
3. We remark that the one-dimensional case is easier to compute: In the univariate case, we show that computing the nearest-Brenier potential is equivalent to solving a variant of the isotonic regression problem in which the map must be strongly increasing and Lipschitz. A projected gradient descent approach can be used to solve this problem eﬃciently.
4. We show how to compute the SSNB potential and map on the whole space: We exploit the solutions to both these optimization problems to extend the Brenier potential and Monge map at any point. We show this can be achieved by solving a convex QCQP for each new point.

Grandes lignes et contributions de cette thèse

Le transport optimal date de la fin du XVIIIe siècle, lorsque le mathématicien français Monge [1781] proposa de résoudre le problème des déblais et des rem-blais. Cependant, la formulation mathématique de Monge atteignit rapidement ses limites en tant qu’il fut impossible de prouver l’existence d’une solution à son problème. Ce n’est qu’après cent cinquante ans que le transport optimal vécut une seconde jeunesse, lorsque le mathématicien russe Kantorovich [1942] com-prit quel serait le cadre adéquat qui permettrait l’étude du problème de Monge, donnant ainsi naissance à des outils fondamentaux en mathématiques pures et appliquées. Ces dernières années, le transport optimal a trouvé de nouvelles applications au sein des statistiques et de l’apprentissage automatique, en tant qu’outil d’analyse des données : en apprentissage supervisé [Frogner et al., 2015, Abadeh et al., 2015, Courty et al., 2016], en informatique graphique [Solomon et al., 2015, Bonneel et al., 2016], en imagerie [Rabin and Papadakis, 2015, Cuturi and Peyré, 2016], pour des modèles génératifs [Arjovsky et al., 2017, Salimans et al., 2018, Genevay et al., 2018], en biologie [Hashimoto et al., 2016, Schiebinger et al., 2019, Huizing et al., 2021] ou en traitement automatique du langage [Grave et al., 2019, Alaux et al., 2019].

Le fléau de la dimension en transport optimal

J’ai commencé ma thèse en septembre 2018. À ce moment-là, l’un des problèmes majeurs dans l’estimation des distances de Wasserstein était le fléau de la dimension qu’elles subissent. Il est connu depuis [Dudley, 1969] qu’étant données une mesure de probabilité 2 Pp(Rd) et sa version empirique ^n = 1 n xi n i=1 où x1; : : : ; xn sont des échantillons i.i.d. de , E[Wp(^n; )] = Pn 1=d .
Des bornes non-asymptotiques et des résultats de concentration ont aussi prouvés par Fournier and Guillin [2015]. Dans [Weed and Bach, 2019], dont la première version a été publiée en 2017, les auteurs prouvent des bornes asymptotiques améliorées lorsque le support de la mesure est une variété de faible dimension : dans ce cas, E[Wp(^n; )] = O n 1=k où k est la dimension de la variété supp( ). Bien que ce résultat semble résoudre la question du fléau de la dimension, on ne peut pas toujours supposer que la dimension de supp( ) est faible : même si elle est de dimension aussi faible que k = 10 dans un espace de dimension d = 106 (penser à des images), une vitesse de O n 1=10 W de reste très lente. Notons que l’inégalité triangulaire pour p nous permet prouver des bornes similaires sur la quantité E [jWp(^n; ^n) Wp( ; )j] où 2 Pp(Rd) et ^n est sa version empirique. Finalement, Genevay et al. [2019] ont montré au début de mes recherches doctorales que le transport optimal régularisé par l’entropie permettait eﬀectivement de réduire le fléau de la dimension : pour une fonction de coût lipschitzienne c et à force de régularisation > 0 fixée, E [jSc (^n; ^n) Sc ( ; )j] = O (1=pn). Cependant, ce résultat se détériore lorsque ! 0, et est valable quelles que soient les mesures et , tant qu’elles sont à supports compacts. Ainsi, les structures potentiellement faiblement dimensionnelles des mesures ne sont pas prises en compte par le transport optimal régularisé par l’entropie. La question de trouver des approches algorithmiques qui puissent réduire les eﬀets du fléau de la dimension, tout en tirant partie des structures de faible dimension cachées dans les mesures, était donc partiellement ouverte au début de mon doctorat.

Obtention de robustesse grâce aux projections

Une approche, cependant, exploitait déjà les structures de faible dimension dans les mesures : la distance de Wasserstein par tranches (WT) introduite par Rabin et al. [2011]. La raison principale qui a mené à la définition de la distance WT est algorithmique, puisque calculer la distance de Wasserstein entre les projections unidimensionnelles des deux mesures possède une complexité log-linéaire, bien meilleure que la complexité cubique du problème de transport optimal originel. Bien que cette approche améliore les aspects computationnels, il m’a semblé que projeter des mesures aux supports de grande dimension sur des lignes unidimensionnelles causerait des pertes importantes d’information sur les mesures initiales, ou du moins que ce choix algorithmique des lignes était un peu ad hoc. Pire encore, lorsque l’on moyenne les valeurs Wp P ] ; P ] sur la sphère 2 Sd 1, la plupart des projections ne discriminent que peu entre les mesures, polluant ainsi la moyenne globale. Il m’a donc paru clair que projeter les mesures, non plus sur des lignes mais sur des sous-espaces de faible dimension que nous choisirions de manière à optimiser un certain critère, créerait des opportunités de recherches novatrices. Dans l’analyse en composantes principales (ACP) [Pearson, 1901], les données sont projetées sur le sous-espace de faible dimension qui maximise la variance des points projetés. La traduction de ce fait dans le cadre des mesures de probabilité et du transport optimal s’est avérée être la clef pour définir des variantes robustes des distances de Wasserstein qui exploitent les structures de faible dimension dans les mesures.
Ce problème se réécrit naturellement comme la maximisation de la distance de Wasserstein par rapport à la fonction de coût. Le choix d’une fonction de coût lorsque l’on calcule le transport optimal est un problème important : comme remarqué par Genevay et al. [2018, Section 3], apprendre une fonction de coût de manière adverse s’avère crucial lorsque l’on calcule le transport optimal dans des espaces de grande dimension. Mais comment interpréter un tel choix ? Dans notre cas, le lien avec les sous-espaces de faible dimension rend l’interprétation aisée : le problème peut se réinterpréter naturellement comme un problème de transport optimal dans lequel l’objectif linéaire est remplacé par un objectif spectral convexe, à la manière de l’ACP où le sous-espace optimal est calculé à partir de la décomposition en sous-espaces propres de la matrice de covariance des données. Nous avons ainsi mis au jour un lien, quoique apparemment frêle, entre la maximisation des distances de Wasserstein par rapport à la fonction de coût d’une part, et la convexification de l’objectif classique du transport optimal d’autre part. J’ai ainsi décidé de poursuivre dans cette voie, et d’éclaircir ce lien, ce qui s’est avéré productif : toute convexification du coût traditionnel de transport optimal peut se réécrire grâce à la dualité de Legendre comme une maximisation du problème par rapport à la fonction de coût. Réciproquement, cela signifie donc que toute régularisation convexe du transport optimal, et en particulier le transport optimal régularisé par l’entropie, est robuste à la fonction de coût. Ce travail a apporté une nouvelle lumière sur les raisons qui font que la régularisation fonctionne en pratique : la régularisation implique un choix automatique d’une fonction de coût robuste, et inversement, maximiser par rapport à la fonction de coût comme dans [Genevay et al., 2018] correspond à une régularisation implicite du problème de transport optimal.

Table of contents :

1 Introduction
1.1 Background on optimal transport
1.2 Outline and contributions of this thesis
1.3 Grandes lignes et contributions de cette thèse
1.4 Notation
I Ground-Cost Robustness
2 Subspace Robust Wasserstein Distances
2.1 Introduction
2.2 Subspace Robust Wasserstein Distances
2.3 Geometry of Subspace Robust Distances
2.4 Computation
2.5 Experiments
2.6 Supplementary Results about Projection Robust Wasserstein Distances
3 Regularized Optimal Transport is Ground-Cost Adversarial
3.1 Introduction
3.2 Background and Notations
3.3 Ground-Cost Adversarial Optimal Transport
3.4 Examples
3.5 Characterization of the Adversarial Cost and Duality
3.6 Adversarial Ground-Cost for Several Measures
3.7 Algorithms
3.8 Experiments
II Regularity-Constrained Maps
4 Smooth and Strongly-Convex Nearest Brenier Potentials
4.1 Introduction
4.2 Regularity in Optimal Transport
4.3 Regularity as Regularization
4.4 One-dimensional Case and the Link with Constrained Isotonic Regression
4.5 Estimation of the Wasserstein Distance and Monge Map
4.6 Experiments
Conclusion
Bibliography