Estimation of the convex or polytopal support of a uniform density

Get Complete Project Material File(s) Now! »

Definitions and Notation

Set-valued estimators

An important part of statistics deals with estimation. From a sample of observations, say, X , one may expect to be able to understand better the underlying mechanism, which created that sample. This mechanism may be Nature, or some intricate machinery. For instance, take X = {X1, . . . , Xn}, where Xi is a binary variable giving the sex of the i-th newborn from a sample of n newborns. Then Xi can be – maybe artificially – seen as the realization of a random variable, whose distribution is Bernoulli, with some parameter pi ∈ (0, 1). A common modeling of this situation consists in assuming that the parameters pi take one common value for all i, say p, and that the random variables X1, . . . , Xn are mutually independent. In other words, the observations are assumed to be the realization of one and the same phenomenon repeated n times. Given this modeling, the only unknown quantity is the value of the parameter p. The statistician would like to use the observations X1, . . . , Xn, in order to have a precise idea of the value of that number p. One way to do this is to estimate p. An estimator of p is a random variable depending on the observations. More precisely, in this example, an estimator can be written as Sn(X1, . . . , Xn), where Sn is a measurable function from {0, 1}n onto (0, 1).
Let us now be more general. Consider a finite sample X , consisting of n observations, each of which belongs to a given measurable set E. Assume that these observations are realizations of random variables, whose distribution depends on one unknown quantity θ ∈ Θ, where Θ is a measurable set. An estimator of the unknown θ is a random variable of the form Sn(X1, . . . , Xn), where Sn : En −→ Θ is a measurable function. Therefore, constructing an estimator requires a structure on the set which contains the unknown quantity of interest. In this thesis, we deal with estimation of sets and, more precisely, of convex bodies. Hence, we need to allow Θ to be the class of all convex bodies in Rd, where d is a given positive integer. However, there is no natural way to provide a non trivial measurability structure on this class. A particular definition of set-valued random variables is given in [Mol05]. For a given topological space E, consider the class F of all closed subsets of E. The Eﬀros σ-algebra on F is defined as the σ-algebra generated by {F ∈ F : F ∩ K = ∅}, where K runs through the family of all compact subsets of E. Denote the Eﬀros σ-algebra by B(F). Let (Ω, F, P) be a probability space. A map X : Ω −→ F is measurable with respect to F and B(F) if, and only if, for every compact subset K of E, {ω∈Ω:X(ω)∩K=∅}∈F. (1.2)
If the latest condition is satisfied, the map X is a random variable, valued in F, and is called a random set. Note that this definition restricts the random sets to be closed subsets of E, but it is not too restrictive for our purposes, since we only consider compact sets. However, Condition (1.2) may be too diﬃcult to check, sometimes, and it is not very useful for our purposes. Indeed, this definition allows one to extend the usual properties of random variables to random sets. For example, the expectation of a random set is defined in [Mol05]. The random sets that we consider are estimators, and we are interested in their pointwise accuracy, defined as their distance to their target. This distance is real-valued, and what is important to us is that it is a random variable. Indeed, this would allow us to compute its expectation, its probability deviations, etc.
We can bypass Condition (1.2) by defining a set-valued estimator, as a set-valued mapping, such that its accuracy, measured with a given distance, is a random variable. This will be the case most of the time. In order to make this even more simple, we prefer to consider outer probabilities, in order to avoid measurability conditions.

Notation and first definitions

Before going further in this introduction, we need some notation. In the whole thesis, d ≥ 1 denotes a positive integer, and the ambient space is the Euclidean space Rd.
General notation and definitions Denote by ρd – or simply ρ, when there is no ambiguity – the Euclidean distance in Rd, by B2d the Euclidean unit closed ball in Rd, and by βd its volume. Denote by B2d(a, r) the Euclidean ball centered at the point a ∈ Rd and of radius r ≥ 0. If x ∈ Rd and G ⊆ Rd, we denote by ρd(x, G) the Euclidean distance from x to the set G, i.e., ρd(x, G) = inf ρd(x, y). y∈G
We use the convention that the infimum on the empty set is infinite ⊆ d ¯ ˚ ¯\˚ If G R , we denote by G its closure, G its interior and ∂G its boundary, i.e., ∂G = G G.
If G is a closed subset of Rd and ǫ is a positive number, we denote by Gǫ the closed ǫ-neighborhood of G, i.e., the set of all x ∈ Rd such that ρ(x, G) ≤ ǫ or, in other words, Gǫ = G + ǫB2d. If G is any set, 1(• ∈ G) stands for the indicator function of G.
The Lebesgue measure in Rd is denoted by | • |d. For brevity, we omit the dependence on d when there is no ambiguity.
For two sets G1 and G2, their symmetric diﬀerence G1△G2 is defined as (G1\G2)∪(G2\G1). The Nikodym pseudo-distance between two measurable subsets G1 and G2 of Rd is defined as the Lebesgue measure of their symmetric diﬀerence, namely |G1△G2|d. We will often refer to this pseudo-distance as the Nikodym distance.
The Hausdorﬀ distance between two subsets G1 and G2 of Rd is denoted by dH (G1, G2), and is defined as dH (G1, G2) = inf{ǫ > 0 : G1 ⊆ Gǫ2, G2 ⊆ Gǫ1}.
Equivalently, one has: dH (G1, G2) = max inf ρ(x, G2), inf ρ(y, G1) . x∈G1 y∈G2
If x ∈ R, we denote by ⌊x⌋ its integer part, i.e., the greatest integer smaller or equal to x. For 1 ≤ p < ∞, and E a measurable subset of Rd, denote by Lp(E) the set of real valued p the and measurable functions f , such that E |f | < ∞. The integral is defined with respect to 1 p Lebesgue measure. The Lp(E)-norm is denoted by • p, and is defined as f p = |f |p .
If f is a bounded function, we set f ∞ = esssup |f (x)|. If f is a real-valued function, define x∈E f + = max(f, 0).
Geometric notation and definitions Let S be a subset of Rd. Its convex hull, denoted by CH(S), is the smallest convex set containing S. Equivalently, it is the intersection of all convex sets which contain S. Its aﬃne hull is the intersection of all aﬃne subspaces which contain S.
A convex set is said k-dimensional if its aﬃne hull has dimension k (k ∈ N). A convex body in Rd is a compact and convex subset of Rd with positive volume. We denote by Kd the class of all convex bodies in Rd, and by Kd(1) the class of those convex bodies that are included in [0, 1]d. Note that a convex set in Rd has a positive volume if and only if it is d-dimensional. Thus, considering convex sets of positive volume is equivalent to considering real d-dimensional convex sets.
A convex body K ∈ Kd is said to have a smooth boundary if and only if its boundary ∂K is k times continuously diﬀerentiable – in the sense of diﬀerential submanifolds, see [Sch93a, Section 2.5] -, for some integer k ≥ 2.
A convex polytope is the convex hull of a finite set. If P = CH(S) is a convex polytope, S being a finite subset of Rd, the vertices of P are those x ∈ S such that CH(S\{x}) = P . By polytope, we will always mean convex polytope.
For an integer r ≥ d + 1, we denote by Pr the class of all convex polytopes in Rd with at most r vertices and with positive volume, and by Pr(1) the class of those P ∈ Pr such that P ⊆ [0, 1]d. We define Pr,n(1) as the class of all polytopes in Pr(1) with vertices whose coordinates are integer multiples of 1/n. We denote by P = r≥d+1 Pr the class of all convex polytopes in Rd, and by P(1) = P(1) the class of all convex polytopes in [0, 1]d. We will also use the r≥d+1 r notation P∞(1) for the class Kd(1). Note that P(1) P∞(1).
A polytope is the convex hull of finitely many points. The dual form of this definition states that a polytope is a bounded intersection of finitely many closed halfspaces. A polytope can be described either by its vertices – a polytope is the set of convex combinations of its vertices – or as intersection of halfspaces – a polytope is a set of points satisfying a finite number of linear inequalities -. The boundary of a polytope is entirely determined by these halfspaces, and its structure can be viewed from a combinatorial point of view.
Definition 1.1. ”Supporting hyperplane”
Let K ∈ Kd and let H an aﬃne hyperplane of Rd. We call H a supporting hyperplane of K if and only if H ∩ K = ∅ and there exists a non zero vector e, orthogonal to H, such that for all t > 0, (H + te) ∩ K = ∅.
Let x ∈ ∂K. A supporting hyperplane of K at x is a supporting hyperplane of K which contains the point x.
Note that the supporting hyperplane of a convex body at a point of its boundary need not be unique.
Definition 1.2. Let P be a polytope and k ∈ {0, . . . , d − 1} be an integer. A face of P is the intersection of P with a supporting hyperplane. A k-face is a face whose aﬃne hull has dimension k. The vertices of P are exactly the 0-faces of P . The 1-faces of P are called the edges of P . The (d − 1)-faces of P are called the facets of P .
There are some universal relations between the numbers of k-faces of a polytope. The most simple one is Euler’s formula. Let P be any d-dimensional polytope. Let us denote by fk the number of k-faces of P , for k ∈ {0, . . . , d − 1}. When d = 2, it is clear that f0 = f1. If d = 3, Euler’s formula states that: f0 − f1 + f2 = 2. (1.3)
This formula still holds in any finite dimension (see [BM71] for one of the first proof of this identity, after several unsuccessful attempts by other mathematicians in the nineteen century): (−1)kfk = 1 + (−1)d−1. (1.4) k=0
Other combinatorial equations exist, in specific cases, e.g. Dehn-Sommerville equations for simplicial polytopes. We do not provide the details, because we do not consider such cases in this thesis. We refer to [Bro83, Zie95] for a detailed account.
In this thesis, we deal with the classes Pr, r ≥ d + 1, i.e., the polytopes in Rd with known value of f0 – or, to be more precise, with known upper bound on f0 -. In dimension 2, the knowledge of f0 determines completely the value of f1: f0 = f1. In higher dimensions, this is not true anymore, and a given value of f0 may be compatible with diﬀerent values of the fk’s, k = 1, . . . , d − 1. We are interested in upper bounds on fk, i.e., in controlling the possible values of fk, given f0 ≤ r. In particular, for k = d − 1, it will be useful in Section 2.2.1 to know if fd−1 can be bounded from above, uniformly on the class Pr. The answer is positive, and it is given by Mc Mullen’s bound. In order to be more specific, let us denote by fk(P ) the number of k-faces of a given polytope P ∈ P. The question addressed in [McM70] consists in maximizing fk(P ) on the class Pr, i.e., finding the maximal number – if not infinite – of k-faces of a d-dimensional polytope having r vertices. Let fk(r, d) be the number of k-faces of a cyclic polytope with r vertices. A cyclic polytope with r vertices is the convex hull of r distinct points on the moment curve {(t, t2, . . . , td) : t ∈ R}. Gr¨unbaum [Gru67, Section 4.7] proved that the combinatorial structure of a cyclic polytope with r vertices does not depend on the choice of the vertices on the moment curve. The upper bound conjecture, proved by McMullen in [McM70], states that max fk(P ) = fk(r, d), ∀1 ≤ k < d < r. (1.5) P ∈Pr
This equation states that among all d-dimensional polytopes with r vertices, the cyclic polytopes are those with the largest number of k-faces, for all k = 1, . . . , d − 1. The number of k-faces of a cyclic polytope with r vertices is kr for k = 0, . . . , ⌊d/2⌋, and is given by the Dehn-Sommerville formula for the other values of k, see [Bro83, Zie95].
Probabilistic and statistical notation and definitions Let (A, B) be a measurable space, and C a class of sets in Rd. If {PG : G ∈ C} is a family of probability measures on (A, B), we denote by EG the expectation operator associated with PG, and by VG the corresponding variance operator, for G ∈ C. We keep the same notation PG and EG for the corresponding outer probability and expectation when necessary, to avoid measurability issues.
If f is a density, we denote by Pf and Ef the corresponding probability measure and expec-tation operator.
Let P be a probability measure and E the corresponding expectation operator. For any pos-itive integer n, we denote by P⊗n and E⊗n the n-product of P and its corresponding expectation operator. In particular, if X1, . . . , Xn are n independent, identically distributed (i.i.d.) random variables in Rd with probability distribution P, then P⊗n is the probability distribution of the vector (X1, . . . , Xn). When there is no ambiguity, we may omit the superscript ⊗n.
Let P and Q be two probability measures defined on the same measurable space. Let p and q be their respective densities with respect to a common σ-finite dominating measure ν. The Hellinger distance between P and Q is denoted by H(P, Q), and is defined as H(P, Q) = (√p − √q)2dν 1/2 , Note that this definition does not depends on the choice of the dominating measure ν.

READ Algebraic Hierarchy

The minimax setup

Let (Ω, F, P) be a probability space, and (A, B) a measurable space. Let C a class of measurable subsets of Rd, and {PG : G ∈ C} a family of probability measures on (A, B). Let n be a positive integer, and Z1, . . . , Zn be i.i.d. random variables defined on (Ω, F), taking values in (A, B). Assume that the common probability distribution of Z1, . . . , Zn is PG, for some G ∈ C. This means that for all B ∈ B, P[Z1 ∈ B] = P [{ω ∈ Ω : Z1(ω) ∈ B}] = PG[B]. The vector (Z1, . . . , Zn) is interpreted as the observation of n independent realizations of a random variable Z, having PG as its probability distribution, and the set G is unknown to the statistician. Let Gn be an estimator of G, i.e., a set-valued function of (Z1, . . . , Zn).
We measure the accuracy of Gn in a minimax framework, and we will always use the Nikodym distance. The pointwise error of Gn is its distance to the target, namely |Gn△G|. Its pointwise G | Gˆ n△ G | . The uniform risk, or simply risk is the expectation of its pointwise error, i.e., E⊗n the risk of Gn on the class C, is defined as Rn(Gn; C) = sup EG[|G△Gn|].
The rate – a sequence depending on n – of the estimator Gn on the class is the speed at which its risk on C converges to zero, when the number n of available observations tends to infinity. For all estimators defined in the sequel we will be interested in upper bounds on their risk, in order to get information about their rates.
The minimax risk on C, when n observations are available, is defined as Rn(C) = ˆ Rn ˆ n C ), inf (G ; Gn where the infimum is taken over all set estimators depending on n observations. If Rn(C) converges to zero, we call a minimax rate of convergence on the class C the speed at which Rn(C) tends to zero, i.e., a positive sequence ψn converging to zero and such that the positive sequence ψn−1Rn(C) is both bounded from above, and bounded from below by a positive constant.
It is interesting to provide a lower bound for Rn(C): By definition, no estimator can achieve a better rate on C than that of the lower bound. This bound gives also information on how close the risk of a given estimator is to the minimax risk. If the rate of the upper bound on the risk of an estimator matches the rate of the lower bound on the minimax risk on the class C, then the estimator is said to have the minimax rate of convergence on this class.
Similarly to the minimax risk, we define the weighted minimax risk as follows. Let C be a class of measurable subsets of R d ˆ , of positive volume. For an estimator Gn, we define its weighted risk on the class C as Qn (Gˆ n ; C ) = sup E G |G△Gn| , G C | G | and the minimax weighted risk on the class C, when n observations are available, is defined as Qn(C) = ˆ Qn ˆ n C ). inf (G ; Gn
Let us mention an alternative to this minimax setup. The minimax risk on a class C is defined for each number n of available observations. The supremum is taken over all G ∈ C, and therefore G is allowed to depend on n. In particular, usual methods for proving lower bounds on the minimax risk are based on hypotheses testing, and the hypotheses often depend on n in the proofs. This might seem unrealistic, since adding new observations should not change the unknown distribution measure from which they are generated. The alternative to this minimax setup consists in defining individual lower rates of convergence on a class C. A sequence of positive numbers (an) is called an individual lower rate of convergence on the class C if and only EG G Gn inf sup limsup > 0, (1.6)
a ˆ n→∞ n (Gn)n∈N G C where the infimum is taken over all sequences (Gn)n∈N of estimators. This alternative is pro-posed in [GKKW02]. It will not be much discussed in this thesis, in which we prefer to focus on the classical minimax setup. This preference will be motivated when we compare estimation and testing, in Section 3.5.

Adaptation and misspecification

In our minimax setup, the target set G that we aim to estimate is seen as a member of a class C, and the risk of estimation is defined uniformly on this class. This class can be very big.
Assume it can be divided into several subclasses: C= Cτ, τ ∈T where T is a set of parameters. This decomposition may be interesting, for example, if the subclasses Cτ are more simple than the whole class C, i.e., much smaller. If so, the minimax rate on a subclass Cτ might be much smaller than that on the whole class. The target set G belongs to Cτ for some τ ∈ T. If the value of τ is known, a natural approach consists in constructing an estimator which knows the value of τ , and in bounding from above its risk on the class Cτ . However, the value of τ may be unknown. As a consequence, the estimation procedure cannot take into account the value of this parameter. Yet, we might be interested in achieving or, at least, approaching, the minimax rate Rn(Cτ ), by constructing an estimator which does not depend on τ . Such an estimator is said to be adaptive with respect to τ , or to adapt to τ . It would have exactly, or approximately, the minimax rate of convergence, simultaneously on all subclasses Cτ , for τ ranging in T.
Let Gn be an estimator, based on n random observations, of the target set G belonging to a class C = τ ∈T Cτ . For τ ∈ T, let ψn,τ be the minimax rate of convergence on Cτ . The estimator Gn is said to be adaptive minimax with respect to τ if and only if ≤ −1RˆC≤ c1 sup ψn,τ n(Gn; τ ) c2 τ ∈T for n large enough, where c1 and c2 are two positive constants. It is not always possible to construct an adaptive minimax estimator. In particular, not knowing in advance the value of τ may unavoidably lead to a loss in the rate of convergence. Such loss often occurs when one aims to estimate a function at a given point, e.g. a density or a regression function, whose smoothness is characterized by the parameter τ (see [BL92] for instance).
In this thesis, we will do adaptation when estimating a polytope in Pr – or in Pr(1) – without knowing the true value of r, i.e., the number of vertices of the target polytope. In this case, C = P = P or C = P(1) = P(1), and T = {d + 1, d + 2, . . .}. r≥d+1 r r≥d+1 r
Misspecification Let us now add a point, denoted by ∞, to the set T, and consider a new class C∞, which is not contained in C. Denote by C′ = C ∪ C∞ = τ ∈T∪{∞} Cτ . Assume now that the target to be estimated belongs to this bigger class C′, and not necessarily to C. This situation corresponds to possible misspecification of the model. Adaptive estimation with respect to τ ∈ T ∪ {∞} allows one to bypass misspecification, if we are able to construct an estimator which adapts to τ ∈ T ∪ {∞}, and which is minimax simultaneously on all subclasses Cτ , τ ∈ T, and, on the class C∞.
Typically, in this thesis, T is the set of all integers greater than the dimension d, and Cτ = Pτ or Pτ(1), while C∞ = Kd or Kd(1): The class C∞ contains convex bodies which are not polytopes. The class C is Kd itself, or Kd(1). In this framework, misspecification consists in mistakenly believing that the target – a convex body – is a polytope. As we will see, there is a big gap between the minimax risks on the classes of polytopes with given number of vertices, and on the class of all convex bodies.

The statistical models

The density support model (DS) The density support model that we consider in this thesis consists of the observation of a sample of n i.i.d. random variables Xi, i = 1, . . . , n, with uniform distribution on some compact subset G of Rd. In this setup, PG stands for the uniform probability measure on G.
The regression model (RM) Consider the following regression model: Yi = 1(Xi ∈ G) + ξi, i = 1, . . . , n, where G is an unknown set in [0, 1]d. The set of points Xi, i = 1, . . . , n is called the design, and it is observed. Unless we mention otherwise, the design is assumed to be i.i.d. uniformly distributed in the hypercube [0, 1]d. The errors ξi are i.i.d. random variables, independent of the design. In other words, some points in [0, 1]d are labeled, with label 1 if the point belongs to some set G and 0 otherwise. Instead of observing directly the correct labels, one has access to their noisy version. From these observations, one wishes to recover G as accurately as possible. This model can be interpreted as a partial and noisy observation of an image, [0, 1]d, in which some unknown object G is to be recovered. If there is no noise, i.e., ξi = 0, i = 1, . . . , n, the observed image has only black and white pixels, the black ones belonging to G. If the signal is noisy, the image has diﬀerent levels of gray. In this setup, PG stands for the probability measure associated to the random couple (X1, Y1) of (1.3).
We assume, along this thesis, that the errors ξi are i.i.d. and subgaussian, i.e., we make the following assumption.
Assumption A. The random variables ξi, i = 1, . . . , n, are i.i.d. and satisfy the following exponential inequality.

Table of contents :

1 Introduction
1.1 Why set estimation ?
1.2 Definitions and Notation
1.2.1 Set-valued estimators
1.2.2 Notation and first definitions
1.2.3 The minimax setup
1.2.4 Adaptation and misspecification
1.3 The statistical models
2 Estimation of the convex or polytopal support of a uniform density
2.1 Random polytopes
2.1.1 The wet part and the floating convex body
2.1.2 Convergence of the random convex hull
2.1.3 Asymptotic facial structure
2.2 Minimax estimation
2.2.1 Estimation of polytopes
2.2.2 Estimation of convex bodies
2.2.3 Adaptation and misspecification
2.3 Further results and open problems
2.3.1 A universal deviation inequality for the random convex hull
2.3.2 Moments of the number vertices of the convex hull estimator
2.3.3 Efron’s identity revisited
2.3.4 Non uniform distributions
2.3.5 Estimation of log-concave densities
2.4 Proofs
2.5 Appendix to Chapter 2
2.5.1 Lemmata
2.5.2 Proofs of Lemmata and others
3 A nonparametric regression model
3.1 Estimation of polytopes
3.1.1 Upper bound
3.1.2 Lower bound
3.2 Estimation of convex bodies
3.3 Adaptation and misspecification
3.4 Discussion
3.5 The one-dimensional case and the change point problem
3.5.1 Detection of a segment
3.5.2 Estimation of a segment
3.5.3 Conclusion and discussion
3.6 Proofs
3.7 Appendix to Chapter 3
3.7.1 Lemmata
3.7.2 Proof of the lemmata
4 Estimation of functionals
4.1 The density support model
4.1.1 Estimation of the volume of a convex body
4.1.2 Estimation of the area of a triangle
4.1.3 Estimation of intrinsic volumes
4.2 The regression model
5 Conclusion, contributions