Robust Subgaussian Estimation of a Mean Vector in Nearly Linear Time

Get Complete Project Material File(s) Now! »

Thorough introduction on the robust mean vector estima-tion problem

Estimating the mean of a random variable in a d-dimensional space when given some of its realizations is arguably the oldest and most fundamental problem of statistics. In the past few years, it has received important attention from two communities: the statistics community, see for instance Catoni (2012); Minsker (2015); Chen et al. (2018); Catoni and Giulini (2017); Lugosi and Mendelson (2019c); Minsker (2018a); M. Lerasle and Lecu´ (2017); Hopkins (2018); Chera-panamjeri et al. (2019); Lei et al. (2020); Dalalyan and Minasyan (2020) and the computer science one, see Diakonikolas et al. (2016, 2019a, 2018a,b, 2019c); Cheng et al. (2019a); Diakonikolas and Kane (2019); Hopkins et al. (2020). Both communities consider the problem of robust mean estimation, focusing mainly on diﬀerent definitions of robustness.
In recent years, many eﬀorts have been made by the statistics community on the construction of estimators performing in a subgaussian way for heavy-tailed data. As seen in the introduction, such estimators achieve the same statistical properties as the empirical mean ¯ of ( • • • ), XN X1, , XN a N-sample of i.i.d. Gaussian variables N (µ, Σ) where µ ∈ Rd and Σ 0 is the covariance matrix. In that case, for a given confidence 1 − δ, the subgaussian rate as defined in Lugosi and Mendelson (2019c) is (up to an absolute multiplicative constant) rδ = Tr(Σ) + kΣkop log(1/δ) (3.1) where Tr(Σ) is the trace of Σ and kΣkop is the operator norm of Σ. Indeed, it follows from Borell-TIS’s inequality (see Theorem 7.1 in Ledoux (2001) or pages 56-57 in Ledoux and Talagrand (2011)) that with probability at least 1 − δ, N − 2 = kvk2≤1 N − ≤ E kvk2 ≤1 N − + q2 log(1 ) where σ = supkvk2 ≤1 E X¯N − µ, v 2 is the weak variance of the Gaussian process. It is that sup µ, v Tr(Σ)/N and σ = Σ /N, which straightforward to check E ≤ qk kop kvk2≤1 N − constant √2 on thep second term in (3.1)). In most of leads to the rate in (3.1) (up to the the recent works, the eﬀort has been made to achieve the rate rδ for i.i.d. heavy-tailed data even under the minimal requirement that the data only have a second moment. Under this second-moment assumption only, the empirical mean cannot1 achieve the rate (3.1) and one needs to consider other procedures. As, recalled in the general introduciton, over the years, some procedures have been proposed to achieve such a goal: it started with Catoni (2012) and Lerasle and Oliveira (2011), then, a Le Cam test estimator, called a tournament estimator in Lugosi and Mendelson (2019c), a minmax median-of-means estimator in M. Lerasle and Lecu´ (2017) and a PAC-Bayesian estimator in Catoni and Giulini (2017) were constructed. The constructions in Lerasle and Oliveira (2011); Lugosi and Mendelson (2019c); M. Lerasle and Lecu´ (2017) are based on the median-of-means principle, a technique that we will also use.
On the other side, the computer science (CS) community mostly considers a diﬀerent definition of robustness and targets a diﬀerent goal. In many recent CS papers, tractable algorithms (and not only theoretical estimators) have been constructed and proved to be robust with respect to adversarial contamination of the dataset seen in the general introduction. We recall now this adversarial contamination model together with the heavy-tailed setup which will serve as our unique assumption in this work.
Assumption 3.1. There exists N random vectors (Xi)i=1 in R mean µ and covariance matrix E(X˜i − µ)(X˜i − µ)> Σ where Σ is an unknown covariance ˜ N matrix. The N random vectors (Xi)i=1 are first given to an ”adversary” who is allowed to modify up to |O| of these vectors. This modification does not have to follow any rule. Then, the ”adversary” gives the modified dataset (Xi )Ni=1 to the statistician. Hence, the statistician receives an ”adversarially” contaminated dataset of N vectors in Rd which can be partitioned into two groups: the modified data (Xi)i∈O, which can be seen as outliers and the ”good data” or inliers ( ) such that ∀ ∈ I = ˜ . Of course, the statistician does not know which data has been Xi i∈I i , Xi Xi modified or not so that the partition O ∪ I = {1, . . . , N} is unknown to the statistician.
In the adversarial contamination model from Assumption 3.1, the set O can depend arbitrarily ˜ N on the initial data (Xi)i=1; the corrupted data (Xi)i∈O can have any arbitrary dependance structure; and the informative data (X ) i∈I may also be correlated (for instance, it is the case, ˜ i d in general, when the |O| data Xi with largest `2-norm are modified by the adversary). The computer science community looks at the problem of robust mean estimation from algorithmic perspectives such as the running time in this contamination model. A typical result in this line of research is Theorem 1.3 from Cheng et al. (2019a) that we recall now.
Theorem 3.1 (Theorem 1.3, Cheng et al. (2019a)). Let X1, . . . , XN be a data points in Rd following Assumption 3.1. We assume that the covariance matrix Σ of the inliers satisfies Σ σ2Id. We assume that = |O|/N is such that 0 < < 1/3 and N & d log(d)/ . There exists p 1Under only a second-moment assumption, the empirical mean achieves the rate Tr(Σ)/(δN) which can not be improved in general, see Catoni (2012) an algorithm running in ˜( ) poly( ) which outputs ˆ such that with probability at least 9 10,ONd/µ/ kµˆ − µk2 . σ√ .
The notation ˜( ) stands for the computational running time of an algorithm up to log( ) O N d N d factors. The first result proving the existence of a polynomial time algorithm robust to adversarial contamination may be found in Diakonikolas et al. (2016) and the first achieving such a result under only a second moment assumption may be found in Diakonikolas et al. (2017). Theorem 3.1 improves upon many existing results since it achieves the optimal information theoretic-lower bound with a (nearly) linear-time algorithm.
Finally, there are two recent papers for which both algorithmic and statistical considerations are important. In Hopkins (2018); Cherapanamjeri et al. (2019), algorithms achieving the subgaussian rate in (3.1) have been constructed. They both run in polynomial time: O(N24 +N d) for Hopkins (2018) and O(N4 + N2d) for Cherapanamjeri et al. (2019) (see Cherapanamjeri et al. (2019) for more details on these running times). They do not consider a contamination of the dataset even though their results easily extend to this setup. Some other estimators which have been proposed in the statistics literature are very fast to compute but they do not achieve the optimal subgaussian rate from (3.1). A typical example is Minsker’s geometric median estimator Minsker (2015) which achieves the rate ˜ Tr(Σ) log(1/δ)/N in linear time O(N d). All the later three papers use the median-of-means principle. We will also use this principle. What we mainly p borrow from the literature on MOM estimators is the advantage to work with local block means instead of the data themselves. We will identify two such advantages by doing so: a stochastic one and a computational one (see Remark 3.4 below for more details).
The aim of this work is to show that a single algorithm can answer the three problems: robustness to heavy-tailed data, to adversarial contamination and computational cost. Assump-tion 3.1 covers the two concepts of robustness considered in the statistics and computer science communities since the informative data (data indexed by I) are only assumed to have a second moment and there are |O| adversarial outliers in the dataset. Our aim is to show that the rate of convergence (3.1) which is the rate achieved by the empirical mean in the ideal i.i.d. Gaussian case can be achieved in the corrupted and heavy-tailed setup from Assumption 3.1 with a fast algorithm: we construct an algorithm running in time ˜( + log(1 ) ) which O N d u /δ d outputs an estimator of the true mean achieving the subgaussian rate (3.1) with confidence 1 − δ − (1/10)u (for exp(−c0N) ≤ δ ≤ exp(−c1|O|)) on a corrupted database and under a second moment assumption only. It is therefore robust to heavy-tailed data and to contamination. Our approach takes ideas from both communities: the median-of-means principle which has been recently used in the statistics community and a SDP relaxation from Cheng et al. (2019a) which can be theoretically computed fast. The baseline idea is to construct K equal size groups of data from the given ones and to compute their empirical means ¯ = 1 . These N Xk, k , . . . , K K empirical means are used successively to find a robust descent direction thanks to a SDP relaxation from Cheng et al. (2019a). We prove the robust subgaussian statistical property of the resulting descent algorithm under only the Assumption 3.1.
The chapter is organized as follows. In the next section, we give a high-level description of the algorithm and summarize its statistical and computation performance in our main result Theorem 3.2. We also clearly identify how it improves upon existing results on the same subject. In Section 3, we prove its statistical properties and give a precise definition of the algorithm. In Section 4, we study the statistical performance of the SDP relaxation at the heart of the descent direction. In Section 5, we fully characterize its computational cost. In Section 3.6, we construct a procedure achieving the same statistical properties and can automatically adapt to the number of outliers. This latter adaptive procedure is also proved to satisfy estimation results in expectation.
We will use the following notation [n] = {1, . . . , n} for any n ∈ N and `d2 stands for the Euclidean space Rd endowed with its canonical Euclidean norm k•k2 : x = (xj)dj=1 ∈ Rd → Pj x2j 1/2. A `d2-ball centered in x ∈ Rd with radius r > 0 is denoted by B2d(x, r), the `d2 unit ball is denoted by B2d and the `d2 unit sphere is denoted by S2d−1.

Construction of the algorithms and main result

The construction of our robust subgaussian descent procedure is using two ideas. The first one comes from the median-of- means (MOM) approach which has recently received a lot of attention in the statistical and machine learning communities, see for instance Bubeck et al. (2013); Lerasle and Oliveira (2011); Devroye et al. (2016); Minsker and Strawn (2017); Minsker (2015); Nemirovsky and Yudin (1983); Alon et al. (1999); Jerrum et al. (1986); Birg´e (1984). The MOM approach often yields robust estimation strategies (but usually at a high computational cost). Let us recall the general idea behind that approach already exposed in the introduction: we first randomly split the data into K equal-size blocks B1, . . . , BK (if K does not divide N, we just remove some data). We then compute the empirical mean within each block: for k = 1, . . . , K, ¯ 1 iX∈k | Bk | Xk = Xi where we set |Bk| = Card(Bk) = N/K. In the one-dimensional case, we then take the median of the latter K empirical means to construct a robust and subgaussian estimator of the mean (see Devroye et al. (2016)). It is more complicated in the multi-dimensional case, where there is no definitive equivalent of the one dimensional median but instead there are several candidates: coordinate-wise median, the geometric median (also known as Fermat point), the Tukey Median, among many others (see Small (1990)). The strength of this approach is the robustness of the median operator, which leads to good statistical properties even on corrupted databases. For the construction of our algorithm, we use the idea of grouping the data and compute iteratively some median of the bucketed means ¯ = 1 . Xk, k , . . . , K
In Cherapanamjeri et al. (2019), the authors propose to use these block means for a gradient descent algorithm: at the current point xc of the iterative algorithm, a ”robust descent direction” well aligned with xc −µ is constructed with high probability. Note that xc −EX is the best descent direction towards EX starting from xc; we can also re-write that as a matrix problem: a top eigenvector (i.e. an eigenvector associated with the largest singular value) of (EX − xc)(EX −xc )> is the optimal descent direction (xc − EX)/ kxc − EXk2. As a consequence, a top eigenvector of a solution to the optimization problem argmax M, (EX − xc)(EX − xc)> (3.2) M 0,Tr(M)=1 also yields the best descent direction we are looking for (note that A, B = Tr(A> B) is the inner product between two matrices A and B). Optimization problem (3.2) may be seen as a SDP relaxation for the problem of finding a top eigenvector and it is the reason why we go into SDP optimization techniques. Recently, this SDP relaxation has been bypassed thanks to the power method in Lei et al. (2020) whose aims is also to approximate a top eigenvector.
Of course, we don’t know ( X − xc)( X − xc)> in (3.2) but we are given a database of N E E data X1, . . . , XN (among which |I| of them have mean µ). We use these data to estimate in a robust way the unknown quantity ( X − xc)( E X − xc)> in (3.2). Ideally, we would like to E k∈K(X¯k − xc)(X¯k − xc)>, where identify the informative data and their block means (1/|K|) = k : B = , to estimate this quantity but this information is not available either. K { k ∅} P T O
To address this problem we use a tool introduced in Cheng et al. (2019a); Diakonikolas et al. ¯ (2016) adapted to the block means. The idea is to endow each block mean Xk with a weight ωk taken in ΔK defined as ΔK = ((ωk)kK=1 : 0 ≤ ωk ≤ 9K/10, . ωk = 1) X k=1
Ideally we would like to put 0 weights to all block means ¯ corrupted by outliers. But, we Xk cannot do it since K is unknown. To overcome this issue, we learn the optimal weights and consider the following minmax optimization problem max min K k( X¯ c)( X¯ ( xc ) M M, ω k − x k − x > . E 0,Tr(M)=1 w ΔK X c) ∈ k=1
This is the dual problem from Cheng et al. (2019a) adapted to the block means. The key insight from Cheng et al. (2019a) is that an approximate solution Mc of the maximization problem in (Exc ) can be obtained in a reasonable amount of time using a covering SDP approach Cheng et al. (2019a); Peng et al. (2012) (see Section 3.4). We expect a solution (in M) to (Exc ) to be close to a solution of the minimization problem in (3.2) – which is M∗ = (µ − xc)(µ − xc)>/ kµ − xck22 – and the same for their top eigenvectors (up to the sign). We note that in order to find a good descent direction the authors of Cherapanamjeri et al. (2019) also use a (diﬀerent) SDP relaxation. Theirs costs O(N4 + N d) to be computed.
At a high level description, the robust descent algorithm we perform outputs µˆK after at most log d iterations of the form xc − θcv1 where v1 is a top eigenvector of an approximate solution Mc to the problem (Exc ) and θc is a step size. It starts at the coordinate-wise median of the bucketed means ¯ ¯ . In Algorithm 4, we define precisely the step size and the stopping criteria X1,…,XK we use to define the algorithm (it requires too much notation to be defined at this stage). This algorithm outputs the vector µˆK whose running time and statistical performance are gathered in the following result.
Theorem 3.2. Grant Assumption 3.1. Let K ∈ {1, . . . , N} be the number of equal-size blocks and assume that K ≥ 300|O|. Let u ∈ N∗ be a parameter of the covering SDP used at each descent step. With probability at least 1 − exp(−K/180000) − (1/10)u, the descent algorithm ˜ finishes in time O(N d + Kud) and outputs µˆK such that k µˆK µ k 2 ≤ 808 1200sTr(Σ) + s1200 kΣkop K . − N N
To make the presentation of the proof of Theorem 3.2 as simple as possible we did not optimize the constants (better constants have been obtained in Catoni (2012); Catoni and Giulini (2017)). Theorem 3.2 generalizes and improves Theorem 3.1 in several ways. We first improve the confidence from a constant “9/10” to an exponentially large confidence 1 − exp(−c0K) (when u ∼ K), which was a major technical challenge (note however that the confidence 9/10 in Cheng et al. (2019a) can be increased to any desired confidence at the expense of deteriorating the rate of convergence – see footnote of page 2 in Cheng et al. (2019a)). We obtain the result for any covariance structure Σ and µˆK does not require the knowledge of Σ for its construction. We obtain a result which holds for any N (even in the case where N ≤ d). The construction of µˆK does not require the knowledge of the exact proportion of outliers in the dataset, but it requires an upper bound in the number of outlier, so that we can chose K & |O|. Moreover, using a Lepskii adaptation method Lepski˘ı (1991, 1990) it is also possible to automatically choose K and therefore to adapt to the proportion of outliers if we have some extra knowledge on Tr(Σ) and kΣkop (see Section 3.6 for more details). Moreover, if we only care about constant 9/10 confidence, our runtime does not depend on and is nearly-linear ˜( ). We also refer the O N d reader to Corollary 3.2 for more comparison with Theorem 3.1.
Remark 3.1 (Nearly-linear time). We identify two important situations where the algorithm from Theorem 3.2 runs in nearly-linear time, that is, in time ˜( ). First, when the number √ O√N d of outliers is known to be less than N, we can choose K ≤ N and u = K. In that case, the algorithm runs in time ˜( ) and the subgaussian rate is achieved with probability at least O N d 1 − 2 exp(−c0K) for some constant c0 (see also Corollary 3.3 for an adaptive to K version of this result). Another widely investigated situation is when we only want to have a constant confidence like 9/10 as it is the case in the CS community such as in Theorem 3.1. In that case, one may choose u = 1 and any values of K ∈ [N] can be chosen (so we can have any number of outliers up to a N/300) to achieve the rate in Theorem 3.2 with constant probability and in nearly-linear time ˜( ) (see also Corollary 3.2 for an adaptive to version of this result). Finally, it is possible O N d K to get a subgaussian estimator for the all range of K ∈ [N] which is also robust to adversarial outliers up to a constant fraction of N when we take u = K. In that case, the running time is ˜ 2 ˜ 2 ˜ O(N d + K d) which is at worst O(N d). So algorithm outputs µˆK in time between O(N d) and ˜ 2 d) depending on the number of outliers and the probability deviation certifying the result O(N we want.
Theorem 3.2 improves the result from Hopkins (2018); Cherapanamjeri et al. (2019) since µˆK runs faster than the polynomial times O(N24 + N d) and O(N4 + N d) in Hopkins (2018) and Cherapanamjeri et al. (2019). The algorithm µˆK also does not require the knowledge of Tr(Σ) and kΣkop. Finally, Theorem 3.2 provides running time guarantees on the algorithm unlike in Lugosi and Mendelson (2019c); M. Lerasle and Lecu´ (2017); Catoni and Giulini (2017) and it improves upon the statistical performance from Minsker (2015). The main technical novelty lies in Proposition 3.1, necessary to improve analysis from Cheng et al. (2019a) toward exponentially large confidence 1 − exp(−c0K). Proposition 3.1 may be of independent interest. Theorem 3.2 ˜ 6 ) and the constant probability also improves the running time in Cheng et al. (2019a) O(N d/ deviation (see Theorem 3.1 for more details) – both probability estimates and computational time have been improved by using bucketed means in place of the data themselves (see Remark 3.4 below for more details). The computational time improvement from Theorem 3.2 upon the one in Cherapanamjeri et al. (2019) is due to the use of covering SDP Allen-Zhu et al. (2014); Peng et al. (2012); Cheng et al. (2019a) at each iteration of the robust gradient descent algorithm. Very recent works Lei et al. (2020); Hopkins et al. (2020); Depersin (2020b) obtain similar results to the one of Theorem 3.2. They were also able to replace SDPs by spectral methods for the computations of a robust descent direction at each step. Even though cover SDPs are from a theoretical point of view computationally eﬃcient, see Allen-Zhu et al. (2014); Peng et al. (2012), they are notoriously diﬃcult to implement in practice whereas the power methods used in Lei et al. (2020); Hopkins et al. (2020); Depersin (2020b) open the door to implementable algorithms. For more references on robust mean estimation, we refer the reader to the survey from Diakonikolas and Kane (2019).

READ Modular framework for knowledge management

Table of contents :

1 Introduction: Robustness and Complexity
1.1 What is Robust Statistics?
1.2 High-dimensional Robustness
1.3 Statistical complexities
1.4 Main contributions
1.5 Unanswered Questions and Future Research Direction
2 Introduction en franc¸ais : Robustesse et complexit´e
2.1 Qu’est-ce que la statistique robuste?
2.2 Robustesse en haute dimension
2.3 Contributions principales
3 Robust Subgaussian Estimation of a Mean Vector in Nearly Linear Time
3.1 Thorough introduction on the robust mean vector estimation problem
3.2 Construction of the algorithms and main result
3.3 Proof of the statistical performance in Theorem 3.2
3.4 Approximately solving the SDP (Exc)
3.5 The final algorithm and its computational cost: proof of Theorem 3.2
3.6 Adaptive choice of K and results in expectation
4 A Spectral Algorithm for Robust Regression with Subgaussian Rates
4.1 Introduction
4.2 Assumptions and preliminary stochastic results
4.3 Analysis of the algorithm
4.4 Experiments
4.5 Conclusion
4.6 Proofs
4.7 Appendix
5 Robust subgaussian estimation with VC-dimension
5.1 Introduction
5.2 Warm-up: MOM principle, VC-dimension and mean estimation
5.3 Sparse setting and other estimation tasks
5.4 An algorithm to improve risk bounds
5.5 Conclusion: concurrent work and discussion
5.6 Main Proofs
6 On the robustness to adversarial corruption and to heavy-tailed data of the Stahel-Donoho median of means
6.1 Introduction
6.2 The Gaussian case
6.3 The L2 case and beyond
6.4 Estimation of using MOMAD
6.5 Study of the HN,K,v, v 2 Sd−1
2 functions
6.6 Conclusion
6.7 Proofs
7 Optimal robust mean and location estimation via convex programs with respect to any pseudo-norms
7.1 Introduction
7.2 Deviation minimax rates in the Gaussian case: benchmark subgaussian rates for the mean estimation w.r.t. k·kS
7.3 Convex programs
7.4 Proofs . .