Extending Gossip Algorithms to Estimation of U-statistics

Get Complete Project Material File(s) Now! »

Decentralized estimation of U-statistics

Decentralized estimation have many applications in sensor and peer-topeer networks as well as for extracting knowledge from massive information graphs such as interlinked Web documents and on-line social media. Algorithms running on such networks must often operate under tight constraints: the nodes forming the network cannot rely on a centralized entity for communication and synchronization, may not be aware of the global network topology and/or have limited resources (computational power, memory, energy). Gossip algorithms (TSITSIKLIS, 1984; SHAH, 2009; DIMAKIS et al., 2010), where each node exchanges information with at most one of its neighbors at a time, have emerged as a simple yet powerful technique for distributed computation in such settings.

Decentralized optimization for pairwise functions

Decentralized optimization is particularly well suited to address the challenges posed by the advent of big data and the “Internet of Things”. For instance, in large-scale machine learning, one aims at finding a model that minimizes a loss function over a massive dataset distributed across several machines in a commodity cluster or cloud computing platform. Other prominent applications come from wired and wireless networks, where local agents must coordinate in order to minimize a global objective function. Common strategies to solve such optimization problems rely on gossip algorithms, as for decentralized estimation. These algorithms have attracted a lot of interest due to their simplicity and their ability to operate in peer-to-peer networks where centralized coordination may be prohibitively expensive or even unavailable.
One of the flagship problems in decentralized optimization is to find a parameter vector which minimizes an empirical risk expressed as average of convex functions (1=n) Pn i=1 f(; xi), where the data xi is only known to agent i. Various gossip algorithms based on (sub)gradient descent (NEDI´C and OZDAGLAR, 2009; JOHANSSON, RABI, and JOHANSSON, 2010; RAM, NEDI´C , and VEERAVALLI, 2010; BIANCHI and JAKUBOWICZ, 2013), ADMM (WEI and OZDAGLAR, 2012; WEI and OZDAGLAR, 2013; IUTZELER et al., 2013) and dual averaging (DUCHI, AGARWAL, and WAINWRIGHT, 2012; YUAN et al., 2012; LEE, NEDI´C, and RAGINSKY, 2015; TSIANOS, LAWLOR, and RABBAT, 2015) have been proposed to solve this problem, possibly including constrained and regularized terms. In these methods, each agent seeks to minimize its local function by applying local updates (e.g., gradient steps) while exchanging information with neighbors to ensure convergence to the consensus value.

Model Selection Based on Incomplete U-Statistics

Automatic selection of the model complexity is a crucial issue in machine learning: it includes the number of clusters in cluster analysis (see CLÉMENÇON, 2014) or the choice of the number of possible values taken by a piecewise constant scoring function in multipartite ranking for instance (cf. CLÉMENÇON and VAYATIS, 2009). In the present situation, this boils down to choosing the adequate level of complexity of the class of kernels H, measured through its (supposedly finite) VC dimension for simplicity, in order to minimize the (theoretical) risk of the empirical minimizer. It is the purpose of this subsection to show that the incomplete U-statistic (3.19) can be used to define a penalization method to select a prediction rule with nearly minimal risk, avoiding procedures based on data splitting/ resampling and extending the celebrated structural risk minimization principle, see VAPNIK, 1999. Let H be the collection of all symmetric kernels on QK k=1 Xdk k and set = infH2H (H). Let H1;H2; : : : be a sequence of uniformly bounded major subclasses of H, of increasing complexity (VC dimension). For anym 1, let Vm denote the VC dimension of the classHm and setMHm = sup(H;x)2HmX jH(x)j < +1. We suppose that there exists M< +1 such that supm1MHm M. Given 1 B jj and m 1, the complexity penalized empirical risk of a solution eUB;m of the ERM problem (3.22) with H = Hm

Table of contents :

Abstract
Résumé de la thèse
1 Résumé en français
1.1 Échantillonnage de U-statistiques
1.1.1 Les U-statistiques incomplètes
1.1.2 Expériences numériques
1.2 Les protocoles gossip
1.2.1 Contexte
1.2.2 Modèle temporel
1.2.3 Laplacien d’un graphe
1.3 Estimation décentralisée d’une U-statistique
1.3.1 Les algorithmes GOSTA
Cas synchrone
Cas asynchrone
1.3.2 Expériences
1.4 Optimisation décentralisée pour des fonctions de paires
1.4.1 Définition du problème
1.4.2 Dual averaging gossip pour les fonctions de paires
Cas synchrone
Cas asynchrone
1.4.3 Expériences numériques
1.5 Conclusion
2 Introduction
2.1 U-statistics sampling
2.1.1 Incomplete U-statistics
2.1.2 Application to Stochastic Gradient Descent
2.1.3 Numerical Experiments
2.2 Gossip protocols
2.2.1 Background
2.2.2 Clock modelling
2.2.3 Graph Laplacian
2.3 Decentralized estimation of U-statistics
2.3.1 GOSTA Algorithms
Synchronous Setting
Asynchronous Setting
2.3.2 Experiments
2.4 Decentralized optimization for pairwise functions
2.4.1 Problem Statement
2.4.2 Pairwise gossip dual averaging
Synchronous setting
Asynchronous Setting
2.4.3 Numerical experiments
2.5 Conclusion
3 Scaling-up Empirical Risk Minimization: Optimization of incomplete U-statistics
3.1 Introduction
3.2 Background and Preliminaries
3.2.1 U-Statistics/Processes: Definitions and Properties
3.2.2 Motivating Examples Clustering
Metric Learning
Multipartite Ranking
3.2.3 Empirical Minimization of U-Statistics
3.3 Empirical Minimization of Incomplete U-Statistics
3.3.1 Uniform Approximation of Generalized U-Statistics .
3.3.2 Model Selection Based on Incomplete U-Statistics
3.3.3 Fast Rates for ERM of Incomplete U-Statistics
3.3.4 Alternative Sampling Schemes
3.4 Application to Stochastic Gradient Descent
3.5 Numerical Experiments
3.5.1 Metric Learning
One-Time Sampling
Stochastic Gradient Descent
3.5.2 Model Selection in Clustering
3.6 Conclusion
3.7 Proofs
3.7.1 Proof of Proposition 2
3.7.2 Proof of Theorem 11
3.7.3 Proof of Corollary 1
3.7.4 Proof of Theorem 12
3.7.5 Proof of Theorem 13
3.7.6 Proof of Theorem 14
3.7.7 Proof of Proposition
4 Extending Gossip Algorithms to Estimation of U-statistics
4.1 Introduction
4.2 Background
4.2.1 Definitions and Notations
4.2.2 Problem Statement
4.3 RelatedWork
4.3.1 Sample mean estimation
4.3.2 U-statistics estimation
4.4 GOSTA Algorithms
4.4.1 Synchronous Setting
4.4.2 Asynchronous Setting
4.5 Experiments
4.5.1 Comparison to U2-GOSSIP
AUC measure
Within-cluster point scatter
4.5.2 Comparison to Baseline Methods
4.6 Conclusion
4.7 Proofs
4.7.1 Preliminary Results
4.7.2 Convergence Proofs for GOSTA
Proof of Theorem 18 (Asynchronous Setting)
4.7.3 U2-gossip Algorithm
5 Gossip Dual Averaging for Decentralized Optimization of Pairwise Functions
5.1 Introduction
5.2 Notations and problem statement
5.2.1 Definitions and Notation
5.2.2 Problem Statement
5.3 Centralized Dual Averaging
5.3.1 Deterministic Setting
5.3.2 Stochastic Dual Averaging
5.3.3 Ergodic dual averaging
Problem setting
Convergence analysis
5.4 Decentralized Dual Averaging
5.5 Pairwise Gossip Dual Averaging
5.5.1 Synchronous Setting
5.5.2 Asynchronous Setting
5.6 Extension to Multiple Points per Node
5.7 Numerical Simulations
5.8 Conclusion
5.9 Proofs
5.9.1 Ergodic dual averaging
Error after mixing (Lemma 12)
Consecutive iterates bound (Lemma 13)
Gap with noisy objectives (Lemma 14)
5.9.2 Synchronous Pairwise Gossip Dual Averaging
5.9.3 Asynchronous Pairwise Dual Averaging
Conclusion
Bibliography