Random projections and Johnson-Lindenstrauss lemma

Get Complete Project Material File(s) Now! »

Random projections and Johnson-Lindenstrauss lemma

Johnson-Lindenstrauss lemma

One of the main motivations for the development of random projections is the so-called Johnson-Lindenstrauss lemma (JLL), which is found by William B. Johnson and Joram Lin-denstrauss in their 1984 seminal paper [14]. The lemma asserts that any subset of Rm can be embedded into a low-dimension space Rk (k m), whilst keeping the Euclidean distances between any two points of the set almost the same. Formally, it is stated as follows:
Theorem 2.1.1 (Johnson-Lindenstrauss Lemma [14]). Given » 2 (0; 1) and A = fa1; : : : ; ang be a set of n points in Rm. Then there exists a mapping T : Rm ! Rk, where k = O( » 2 log n), such that
(1 « )kai ajk2 kT (ai) T (aj)k2 (1 + « )kai ajk2 (2.1) for all 1 i; j n.
To see why this theorem is important, let’s imagine we have a billion of points in Rm. Accord-ing to JLL, we can compress these points by projecting them into Rk, with k = O( » 2 log n) O(20 » 2). For reasonable choices of the error « , the projected dimension k might be much smaller than m (for example, when m = 106 and » = 0:01). The e ect is even more signi cant for larger instances, mostly due to the rapid decrease of the log-function.
Note that in JLL, the magnitude of the projected dimension k only depends on the number of data points n and a predetermined error « , but not on the original dimension m. Therefore, JLL is more meaningful for the \big-data » cases, i.e. when m and n are huge. In contrast, if m is small, then small values of » will result in dimensions k that are larger than m. In that case the applying of JLL is not useful.
The existence of the map T in JLL is shown by probabilistic methods. In particular, T is drawn from some well-structured classes of random maps T , in such a way one can prove that the inequalities (2.1) hold for all i i; j n with some positive probability. This can be done if the random map T satis es for all x 2 Rm: P (1 « )kxk2 kT (x)k2 (1 + « )kxk2 > 1 2 : (2.2) (n 1)n
Indeed, if it is the case, we have
P (1 « )kxi xjk2 kT (xi xj)k2 (1 + « )kxi xjk2 > 1 2 ; (n 1)n
for all 1 i < j n. It is only left to apply the union bound for n2 pairs of points (xi; xj).
There are several ways to construct a random map T satisfying the requirement (2.2). In the original paper of Johnson and Lindenstrauss ([14]), T is constructed as the orthogonal projection on a k-dimensional random subspace of Rm. Later on, P. Indyk and R. Motwani
[13] noticed that T can be de ned simply as a random matrix whose entries are i.i.d Gaussian random variables. Another breakthrough is made by Achlioptas [1], in which the entries of T are greatly simpli ed to i.i.d random variables of 1 values, each with probability 12 . He also gave another interesting construction.
This construction is particularly useful because it only requires 13 of the entries to be non-zero, therefore the matrix T becomes relatively sparse. The sparsity of T leads to faster matrix-vector multiplications, which are useful in many applications.
After the discoveries of these results, many researchers continue to nd more sophisticated constructions of the random map T that are suitable for speci c problems. These researches can be divided into 2 main branches: faster constructions (i.e. to obtain fast matrix-vector multiplications) and sparser constructions (i.e. to obtain sparse random matrices). As op-posed to the common intuition, many fastest random matrices are dense. However, sparse matrices are obvious relatively \fast ».
In the next section, we will give a formal de nition of random projections. Several popular constructions of random projections will be brie y introduced in Section 2.3.
For the simplicity of notations, for the rest of the thesis, we will use T (instead of T ) to denote a random map. Moreover, since we will only work with linear maps, T is also treated as a random matrix. Thus, the expression T (x) is best written as the matrix-vector product T x and Tij will stand for the (i; j) entry of T . The terminologies \random projection », \random mapping » and \random matrix », therefore can be used exchangeable.

Definition of random projections

Normalized random projections

It might be useful and easy to follow if we give a formal de nition of random projections (RP). However, there is no general agreement on what an RP is. Naturally, we rst provide the properties that are desired and look for structures that follow them. Since we will focus on the applications of existing random projections instead of constructing new ones, for convenience, we will give the following de nition (motivated by the unpublished manuscript of Jir Matousek [19]). It contains the property that we will mostly deal with in this thesis.
De nition 2.2.1. A random linear map T : Rm ! Rk is called a random projection (or random matrix ) if for all » 2 (0; 1) and all vectors x 2 Rm, we have:
P( (1 « )kxk2 kT (x)k2 (1 + « )kxk2 ) 1 2e « 2k (2.3)
for some universal constant C > 0 (independent of m; k; « ).
It should be noted that, given an RP, one can show the existence of a map T satisfying conditions in the Johnson-Lindenstrauss lemma. Indeed, to obtain a positive probability, it is su cient to choose k such that 1 2e « 2k 1 2 , and this can be done with any (n 1)n
k > (C2 ) log »2n . More interesting, the probability that we can successfully nd such a map (by sampling) is very high. For example, if we want this probability to be at least, say 99:9%, by the union bound, we can simply choose any k such that 1 n(n 1)e « 2k > 1 1 : 1000
This means k can be chosen to be k = d ln(1000)+2 ln(n) e d 7+2 ln(n) e. Therefore, by slightly
C »2 C »2 increasing the projected dimension, we almost always obtain a \good » mapping without having to re-sample.

Preservation of scalar products

From the de nition, we can immediately see that an RP also preserves the scalar product with high probability. Indeed, given any x; y 2 Rn, then by applying the de nition of RP on two vectors x + y; x y and using the union bound, we have T x; T y x; y = 1 T (x + y) 2 T (x y) 2 1k x + y 2 + x y 2
with probability at least 1 4e « 2k. We can actually strengthen it to obtain the following useful lemma:
Lemma 2.2.2. Let T : Rm ! Rk be a random projection and 0 < » < 1. Then there is a universal constant C such that, for any x; y 2 Rn:
hT x; T yi = hx; yi « kxk:kyk with probability at least 1 4e « 2k.
Proof. Apply the above result for u = x and v = y . kxk kyk

READ Characterization of the target set of algebraic variables

Lower bounds for projected dimension

It is interesting to know whether we can obtain a lower value for the dimension k if we use a smarter construction of a map T (such as nonlinear ones). It turns out that the answer is negative, i.e. the value k = O( » 2 log n) is almost optimal. Indeed, Noga Alon shows in ([5]) log n that there exists a set of n points such that the dimension k has to be at least k = O(« 2 log(1= ») ) in order to preserve the distances between all pairs of points. Moreover, when the mapping T is required to be linear, then Larsen and Nelson [16]) are able to prove that k = O( » 2 log n) is actually the best possible. Therefore, the linearity requirement of T in the de nition 2.2.1 is quite natural.

Constructions of random projections

In this section, we introduce several popular constructions of random projections. We rst consider sub-gaussian random matrices, the simplest case that contains many well-known constructions that are mentioned in the previous section. Then we move to fast constructions in the subsection 2.3.2. For simplicity, we will discard the scaling factor pnk in the random matrix T . This does not a ect the main ideas we present but makes them clearer and more concise.

Sub-gaussian random projections

Perhaps the simplest construction of a random projection is matrices with i.i.d entries which are drawn from a certain \good distribution ». Some of good distributions are: Normal N (0; 1) (by Indyk and Motwani ([13]), Rademacher ( 1) (by Achlioptas [1]), Uniform U( a; a), . . . .
In [18], Matousek shows that all these distributions turn out to be special cases of a general class of the so-called sub-Gaussian distributions. He proves that an RP can still be obtained by sampling under this general distribution.
Definition 2.3.1. A random variable X is called sub-Gaussian (or to has a sub-Gaussian tail) if there are constants C; such that for any t > 0: P(jXj > t) Ce t2 :
A random variable X is called sub-Gaussian up to t0 if there are constants C; such that for any t > t0: P(jXj > t) Ce t2 :
As the name suggests, this family of distributions strongly relates to Gaussian distribution. Intuitively, a sub-Gaussian random variable has a strong tail decay property, similar to that of a Gaussian distribution. One of its useful properties is that, a linear combination of sub-Gaussian random variables (of the uniform constants C; ) is again sub-Gaussian. Moreover,
Lemma 2.3.2. Let X be a random variable with E(X) = 0. If E(euX ) eCu2 for some constant C and any u > 0, then X has a sub-gaussian tail. In reverse, if Var(X) = 1 and X has a sub-gaussian tail, then E(euX ) eCu2 for all u > 0 and some constant C.
The following lemma states that, the sum of squares of sub-Gaussian random projections also has a sub-Gaussian tail up to a constant.
Lemma 2.3.3 (Matousek [18]). Let Y1; : : : ; Yk be independent random variables with E(Yj) = 0 and Var(E) = 1 and a uniform sub-Gaussian tail.
Now let T 2 Rk m be a matrix whose entries are random variables with expectation 0, variance k1 and a uniform sub-Gaussian tail. Then, for any x 2 Rm, kT xk2 1 = (T1x)2 + : : : + (Tkx)2 1
has the distribution the same as the variable 1 Z in the above lemma. By the de nition of sub-Gaussian tail, we have Prob (jkT xk2 1j « ) = Prob (jZj « pk) Ce « 2k; which then implies that T is an RP.

Fast Johnson-Lindenstrauss transforms

In many applications, the bottle-neck in applying random projection techniques is the cost of matrix-vector multiplications. Indeed, the complexity of multiplying a k n matrix T to a vector is of order O(kn). Even with Achlioptas’ sparse construction, the computation time is only decreased by the factor of 3. Therefore, it is important to construct the random matrix T such that the operations T x can be done as fast as possible.
It is natural to expect that, the sparser the matrix T we can construct, the faster the product T x becomes. However, due to the uncertainty principle in analysis, if the vector x is also sparse, then its image under a sparse matrix T can be largely distorted. Therefore, a random projection that satis es Johnson-Lindenstrauss lemma cannot be too sparse.

Table of contents :

Acknowledgement
1 Introduction
1.1 Random projection versus Principal Component Analysis
1.2 Structure of the thesis
1.3 Preliminaries on Probability Theory
2 Random projections and Johnson-Lindenstrauss lemma
2.1 Johnson-Lindenstrauss lemma
2.2 Denition of random projections
2.2.1 Normalized random projections
2.2.2 Preservation of scalar products
2.2.3 Lower bounds for projected dimension
2.3 Constructions of random projections
2.3.1 Sub-gaussian random projections
2.3.2 Fast Johnson-Lindenstrauss transforms
3 Random projections for linear and integer programming
3.1 Restricted Linear Membership problems
3.2 Projections of separating hyperplanes
3.3 Projection of minimum distance
3.4 Certicates for projected problems3.5 Preserving Optimality in LP
3.5.1 Transforming the cone membership problems
3.5.2 The main approximate theorem
3.6 Computational results
4 Random projections for convex optimization with linear constraints
4.1 Introduction
4.1.1 Our contributions
4.2 Random -subgaussian sketches
4.3 Random orthonormal systems
4.4 Sketch-and-project method
5 Gaussian random projections for general membership problems
5.1 Introduction
5.1.1 Our contributions
5.2 Finite and countable sets
5.3 Sets with low doubling dimensions
6 Random projections for trust-region subproblems
6.1 Derivative-free optimization and trust-region methods
6.2 Random projections for linear and quadratic models
6.2.1 Approximation results
6.2.2 Trust-region subproblems with linear models
6.2.3 Trust-region subproblems with quadratic models
7 Concluding remarks
7.1 Summary
7.2 Further research
References