On Anomaly Ranking and Excess-Mass Curves

Get Complete Project Material File(s) Now! »

Extreme Values Analysis through STDF estimation

This section is a summary of Chapter 6, which is based on previous work published in Goix et al. (2015b).
Recall that scoring functions are built by approaching density level sets/minimum volume sets of the underlying ‘normal’ density. As mentioned previously, for an anomaly detection purpose, we are interested in being accurate on level sets corresponding to high quantiles, namely with level t close to 0 – equivalently being accurate on minimum volume sets with mass constraint close to 1. In the univariate case, suppose we want to consider the (1􀀀p)th quantile of the distribution F of a random variable X, for a given exceedance probability p, that is xp = inffx 2 R; P(X > x) pg. For moderate values of p, a natural empirical estimate is xp;n = inffx 2 R; 1=n Pn i=1 1Xi>x pg. However, if p is very small, the finite sample X1; : : : ;Xn contains insufficient information and xp;n becomes irrelevant. This problem transfers in the multivariate case to learning density level sets with very low level, or equivalently scoring functions inducing such level sets. Extreme value theory specially addresses such issues, in the one-dimensional as well as in the multi-dimensional setup.
Preliminaries. Extreme Value Theory (EVT) develops models for learning the unusual rather than the usual. These models are widely used in fields involving risk management like finance, insurance, telecommunication or environmental sciences. One major application of EVT is to provide a reasonable assessment of the probability of occurrence of rare events.
To illustrate this point, suppose we want to manage the risk of a portfolio containing d different assets, X = (X1; : : : ;Xd). A fairly general purpose is then to evaluate the probability of events of the kind fX1 x1 or : : : or Xd xdg, for large multivariate thresholds x = (x1; : : : ; xd). Under not too stringent conditions on the regularity of X’s distribution, EVT shows that for large enough thresholds, PfX1 x1 or : : : or Xd xdg ‘ l(p1; : : : ; pd); where l is the stable tail dependence function (STDF) and the pj ’s are the marginal exceedance probabilities, pj = P(Xj xj). The functional l characterizes the dependence among extremes.
The joint distribution (over large thresholds) can thus be recovered from the knowledge of the marginal distributions together with the STDF l. In practice, l can be learned from ‘moderately extreme’ data, typically the k ‘largest’ ones among a sample of size n, with k n.
Recovering the pj ’s can be done following a well paved way: in the univariate case, EVT essentially consists in modeling the distribution of the maxima (resp. the upper tail) as a generalized extreme value distribution, namely an element of the Gumbel, Fréchet or Weibull parametric families (resp. by a generalized Pareto distribution). In contrast, in the multivariate case, there is no finite-dimensional parametrization of the dependence structure. The latter being characterized by the STDF, estimating this functional is one of the main issues in multivariate EVT. Asymptotic properties of the empirical STDF have been widely studied, see Huang (1992); Drees & Huang (1998); Embrechts et al. (2000); De Haan & Ferreira (2007) for the bivariate case, and Qi (1997); Einmahl et al. (2012) for the general multivariate case under smoothness assumptions.
However, no bounds exist on the finite sample error. The contribution summarized in the next section and published in Goix et al. (2015b) derives such non-asymptotic bounds. Our results do not require any assumption other than the existence of the STDF. Learning the dependence structure of rare events. Classical VC inequalities aim at bounding the deviation of empirical from population quantities on relatively simple classes of sets, called VC classes. These classes typically cover the support of the underlying distribution. However, when dealing with rare events, it is of great interest to have such bounds on a class of sets which only covers a small probability region and thus contains (very) few observations. This yields sharper bounds, since only differences between very small quantities are involved. The starting point of this analysis is the following VC-inequality stated below and proved in Chapter 6.

Sparse Representation of Multivariate Extremes

This section is a summary of Chapter 7, which is based on previous work published in Goix et al. (2016c) and on its long version Goix et al. (2016b) under review.
EVT has been intensively used in anomaly detection in the one-dimensional situation, see for instance Roberts (1999, 2000); Clifton et al. (2011, 2008); Lee & Roberts (2008). In the multivariate setup, however, there is –to the best of our knowledge– no anomaly detection method relying on multivariate EVT. Until now, the multidimensional case has only been tackled by means of extreme value statistics based on univariate EVT. The major reason is the difficulty to scale up existing multivariate EVT models with the dimensionality. In the present work we bridge the gap between the practice of anomaly detection and multivariate EVT by proposing a method which is able to learn a sparse ‘normal profile’ of multivariate extremes and, as such, may be implemented to improve the accuracy of any usual anomaly detection algorithm.
Context: multivariate extreme values in large dimension. Parametric or semi-parametric estimation of the structure of multivariate extremes is relatively well documented in the statistical literature, see e.g. Coles & Tawn (1991); Fougères et al. (2009); Cooley et al. (2010); Sabourin & Naveau (2014) and the references therein. However, restrictive structural assumptions have to be made, stipulating e.g. that only some pre-definite subgroups of components may be concomittantly extremes, or, on the contrary, that all of them must be. In addition, their practical use is restricted to moderate dimensional problems (say, d 10), otherwise simplifying modeling choices are needed, as e.g. in Stephenson (2009)). Finally, uncertainty assessment concerning the output of these models is made under the hypothesis that the training set is ‘asymptotic’, in the sense that one assumes that, above a fixed high threshold, the data are exactly distributed according to the limit distribution of extremes. In other words, the modeling error is ignored.
Non-parametric estimation of the angular measure has only been treated in the two dimensional case, in Einmahl et al. (2001); Einmahl & Segers (2009), in an asymptotic framework. Here we extend the non-asymptotic study on STDF estimation (previous section) to the angular measure of extremes, restricted to a well-chosen class of sets, corresponding to lower-dimensional regions of the space. The objective is to learn a representation of the angular measure, rough enough to control the variance in high dimension and accurate enough to gain information about the ‘probable directions’ of extremes, both at the same time. This yields a –first– nonparametric estimate of the angular measure in any dimension, restricted to a class of subcones, with a non asymptotic bound on the error. Note incidentally that this method can also be used as a preprocessing stage, for dimensionality reduction purpose, before proceeding with a parametric or semi-parametric estimation which could benefit from the structural information issued in the first step. Such applications are beyond the scope of this thesis.
The framework we develop is non-parametric and lies at the intersection of support estimation, density estimation and dimensionality reduction: it consists in learning the support of a distribution (from training data), that can be decomposed into subcones, hopefully each of low dimension and to which some mass is assigned, defining empirical versions of probability measures of specific extreme regions. It produces a scoring function defined and specially accurate on extreme regions, which can thus be exploited to detect anomalies among extremes. Due to its moderate complexity –of order dn log n– this algorithm is suitable for the treatment of real word large-scale learning problems, and experimental results reveal a significantly increased performance on extreme regions compared with standard anomaly detection approaches.
In a wide range of situations, one may expect the occurrence of two phenomena:
1- Only a ‘small’ number of groups of components may be concomitantly extreme, so that only a ‘small’ number of hyper-cubes (those corresponding to these subsets of indexes precisely) have non zero mass (‘small’ is relative to the total number of groups 2d).
2- Each of these groups contains a limited number of coordinates (compared to the original dimensionality), so that the corresponding hyper-cubes with non zero mass have small dimension compared to d.
The main purpose of this work is to introduce a data-driven methodology for identifying such faces, so as to reduce the dimensionality of the problem and thus to learn a sparse representation of extreme behaviors. In case hypothesis 2- is not fulfilled, such a sparse ‘profile’ can still be learned, but loses the low dimensional property of its supporting hyper-cubes. One major issue is that real data generally do not concentrate on sub-spaces of zero Lebesgue measure. This is circumvented by setting to zero any coordinate less than a threshold > 0, so that the corresponding ‘angle’ is assigned to a lower-dimensional face.

Heuristic approaches

The two contributions in this section are of heuristic nature and not yet supported by statistically sound theoretical results. Although this ongoing work has not been published yet and will certainly be completed in the near future, we believe that it has its place in our manuscript, given the numerous convincing numerical experiments we carried out and the rationale behind the approaches promoted we gave. These two contributions address two major challenges in anomaly detection:
How to evaluate unsupervised anomaly detection in practice?
How to grow random forests with only one class available?
The first point has been partially addressed in Section 1.3 with MV and EM curves. However, these two criteria have originally been introduced to build scoring functions via Empirical Risk Minimization (ERM), and no study has been made on their use to evaluate scoring functions as ROC or PR criteria do. Besides, their use to measure the quality of a scoring function sn involves the computation of the Lebesgue measure Leb(sn u), which is very challenging in high dimensional frameworks.
The two proposed approaches are heuristic-based, and no theoretical guarantees such as consistency or convergence rates are derived. However, extensive benchmarks show the relevance
of these approaches.

Table of contents :

List of Contributions
List of Figures
List of Tables
1 Summary
1.1 Introduction
1.2 Anomaly Detection, Anomaly Ranking and Scoring Functions
1.3 M-estimation and criteria for scoring functions
1.3.1 Minimum Volume sets
1.3.2 Mass-Volume curve
1.3.3 The Excess-Mass criterion
1.4 Accuracy on extreme regions
1.4.1 Extreme Values Analysis through STDF estimation
1.4.2 Sparse Representation of Multivariate Extremes
1.5 Heuristic approaches
1.5.1 Evaluation of anomaly detection algorithms
1.5.2 One-Class Random Forests
1.6 Scikit-learn contributions
1.7 Conclusion and Scientific Output I Preliminaries
2 Concentration Inequalities from the Method of bounded differences
2.1 Two fundamental results
2.1.1 Preliminary definitions
2.1.2 Inequality for Bounded Random Variables
2.1.3 Bernstein-type Inequality (with variance term)
2.2 Popular Inequalities
2.3 Connections with Statistical Learning and VC theory
2.4 Sharper VC-bounds through a Bernstein-type inequality
3 Extreme Value Theory 37
3.1 Univariate Extreme Value Theory
3.2 Extension to the Multivariate framework
4 Background on classical Anomaly Detection algorithms
4.1 What is Anomaly Detection?
4.2 Three efficient Anomaly Detection Algorithms
4.2.1 One-class SVM
4.2.2 Local Outlier Factor algorithm
4.2.3 Isolation Forest
4.3 Examples through scikit-learn
4.3.1 What is scikit-learn?
4.3.2 LOF examples
4.3.3 Isolation Forest examples
4.3.4 Comparison examples
5 On Anomaly Ranking and Excess-Mass Curves
5.1 Introduction
5.2 Background and related work
5.3 The Excess-Mass curve
5.4 A general approach to learn a scoring function
5.5 Extensions – Further results
5.5.1 Distributions with non compact support
5.5.2 Bias analysis
5.6 Simulation examples
5.7 Detailed Proofs
6 Learning the dependence structure of rare events: a non-asymptotic study
6.1 Introduction
6.2 Background on the stable tail dependence function
6.3 A VC-type inequality adapted to the study of low probability regions
6.4 A bound on the STDF
6.5 Discussion
7 Sparse Representation of Multivariate Extremes
7.1 Introduction
7.1.1 Context: multivariate extreme values in large dimension
7.1.2 Application to Anomaly Detection
7.2 Multivariate EVT Framework and Problem Statement
7.2.1 Statement of the Statistical Problem
7.2.2 Regularity Assumptions
7.3 A non-parametric estimator of the subcones’ mass : definition and preliminary results
7.3.1 A natural empirical version of the exponent measure mu
7.3.2 Accounting for the non asymptotic nature of data: epsilon-thickening.
7.3.3 Preliminaries: uniform approximation over a VC-class of rectangles .
7.3.4 Bounding empirical deviations over thickened rectangles
7.3.5 Bounding the bias induced by thickened rectangles
7.3.6 Main result
7.4 Application to Anomaly Detection
7.4.1 Extremes and Anomaly Detection.
7.4.2 DAMEX Algorithm: Detecting Anomalies among Multivariate Extremes
7.5 Experimental results
7.5.1 Recovering the support of the dependence structure of generated data
7.5.2 Sparse structure of extremes (wave data)
7.5.3 Application to Anomaly Detection on real-world data sets
7.6 Conclusion
7.7 Technical proofs
7.7.1 Proof of Lemma 7.5
7.7.2 Proof of Lemma 7.6
7.7.3 Proof of Proposition 7.8
7.7.4 Proof of Lemma 7.10
7.7.5 Proof of Remark 7.14
8 How to Evaluate the Quality of Anomaly Detection Algorithms?
8.1 Introduction
8.2 Mass-Volume and Excess-Mass based criteria
8.2.1 Preliminaries
8.2.2 Numerical unsupervised criteria
8.3 Scaling with dimension
8.4 Benchmarks
8.4.1 Datasets description
8.4.2 Results
8.5 Conclusion
8.6 Further material on the experiments
9 One Class Splitting Criteria for Random Forests
9.1 Introduction
9.2 Background on decision trees
9.3 Adaptation to the one-class setting
9.3.1 One-class splitting criterion
9.3.2 Prediction: a majority vote with one single candidate?
9.3.3 OneClassRF: a Generic One-Class Random Forest algorithm
9.4 Benchmarks
9.4.1 Default parameters of OneClassRF.
9.4.2 Hyper-Parameters of tested algorithms
9.4.3 Description of the datasets
9.4.4 Results
9.5 Theoretical justification for the one-class splitting criterion
9.5.1 Underlying model
9.5.2 Adaptive approach
9.6 Conclusion
9.7 Further details on benchmarks and unsupervised results
10 Conclusion, limitations & perspectives
11 Résumé des contributions en Français
11.1 Introduction
11.2 Détection d’anomalies, ranking d’anomalies et fonctions de scores
11.3 M-estimation et critères de performance pour les fonctions de scores
11.3.1 Ensembles à volume minimal
11.3.2 La courbe Masse-Volume
11.3.3 Le critère d’excès de masse
11.4 Précision sur les régions extrêmes
11.4.1 Analyse du point de vue de la théorie des valeurs extrêmes par l’estimation de la STDF
11.4.2 Représentation parcimonieuse des extrêmes multivariés
11.5 Approches heuristiques
11.5.1 Évaluer un algorithme de détection d’anomalies
11.5.2 Forêts aléatoires à une classe
11.6 Contributions sur scikit-learn
11.7 Conclusion and production scientifique
Bibliography