The multivariate Hawkes model and the least-squares functional

Get Complete Project Material File(s) Now! »

Sparse and low-rank multivariate Hawkes processes

We consider the problem of unveiling the implicit network structure of user inter-actions in a social network, based only on high-frequency timestamps. Recently, an approach built on Hawkes processes has increased in popularity [CS08, BBH12, ZZS13a, YZ13]. It uses Hawkes structure to retrieve the direct influence of a specific user’s action on all the future actions of all the users (including himself). Hawkes pro-cesses model simultaneously decay of the influence over time with the shape of the kernels ’i j , the levels of interaction between nodes with the weighted asymmetrical adjacency matrix ‘ and the baseline intensity, that measures the level of exogeneity of a user, namely the spontaneous apparition of an action, with no influence from other nodes of the network.
Penalization (also called regularization) is a very common technique in machine learn-ing. It consists in minimizing the opposite goodness-of-fit function plus a penalty term which can be designed to enforce a specific structure on the problem minimizer. Lasso or ‘1 penalty [Tib96] is one of the best known and most used. In a problem where F : Rd ! R is the opposite goodness-of-fit function, its ‘1-penalized version consists in solving min f (w) ¯‚kwk1, w 2Rd
where ‚ ¨ 0 and kwk1 ˘ jw j j. This penalty is known to output solutions with sparse support, that is a vector of coeﬃcients w with many zeros entries. In fact, if we consider a more general penalty of the form of ‘p penalization, Pdj˘1 jw j jp , then the lasso is the ‘1 penalization and the ‘0 penalization, which gives the number of nonzero entries of a vector, is the limiting case when p ! 0. In this setting, lasso uses the smallest value of p that leads to a convex formulation of an ‘p -penalized problem. Hence, lasso can also be viewed as a convex relaxation of the best subset selection problem [Tib11].
The trace-norm penalization (also known as nuclear norm) is used when the coef-ficients › 2 Rd£d are matrices instead of vectors. It consists in adding the trace-norm of the matrix k›k⁄ to the opposite goodness-of-fit, given by where ¾1(›) ‚ ¢¢¢ ‚ ¾d (›) ‚ 0 are the singular to an ‘1-penalization of the vector [¾1(›) ¢¢¢ values of ›. Thus, this is equivalent ¾d (›)] and tends to enforce the apparition of zero singular values. Finally, as the rank of matrix equals the number of non-zero singular values, trace-norm penalization is a convex relaxation of the low-rank problem [CT10], just like ‘1 penalization is for the best subset selection problem. Enforcing low-rank is now of common use in in collaborative filtering prob-lems [CT04, CT10, RRS11] to describe the network structure with a limited number of parameters. Popularized by the Netflix prize [BL07], this tends to make appear communities of individuals with similar behaviors.

Sparse and low rank priors

We apply these penalizations technique on a modeling with Hawkes process of d nodes, each of them representing a user. For more simplicity, we consider in this chapter that, for all j, j 0 ˘ 1, . . . ,d the kernels ’j, j 0 can be decomposed into ’ j, j 0 (t) ˘
P K a h (t) where a ˘ 0 and h : R¯ R¯ are known decay functions
k˘1 j, j 0 1 ,k k j, j 0,k k1 j, j 0 ,k ! ˘
,k j, j 0 j, j 0 ,k ‚
with fixed ‘ norm, h 1. Hence, for each node j 1, . . . ,d, the conditional intensity writes
d K ‚ j,„,A(t) ˘ „j ¯ X X a j, j 0,k h j, j 0,k (t ¡ s) dNsj , j 0˘1 k˘1
where „ 2 Rd is the exogeneous intensity. Typically, choosing h j, j 0,k ˘ flk exp(¡flk t) where flk ¨ 0, reverts to the sum of exponentials kernel parametrization (see Sec-tion 1.2). The parameter of interest is the self-excitement tensor A that simply rewrites here A ˘ [a j, j 0,k ]1•j, j 0•d,1•k •K . Then, we combine convex proxies for sparsity and low-rank of self-excitement tensor and the baseline intensities to obtain the desired network structure. Our prior assumptions on „ and A are the following.
Sparsity of „. Some nodes are basically inactive and react only if stimulated.
Hence, we assume that the baseline vector „ is sparse.
Sparsity of A. A node interacts only with a fraction of other nodes, meaning that for a fixed node j , only a few a j,j 0,k are non-zero. Moreover, a node might react at specific time scales only, namely a j, j 0,k is non-zero for some k only for fixed j, j 0. Thus, we assume that A is an entrywise sparse tensor.
Low-rank of A. We assume that there exist latent factors that explain the way nodes impact other nodes through the diﬀerent scales k ˘ 1, . . . ,K . Rewriting the intensity naturally leads to penalize the rank of d £ K d matrix hstack(A) ˘ £A†,†,1 ¢¢¢A†,†,K ⁄ where A†,†,k stands for the d £d matrix with entries (A†,†,k )j, j 0 ˘ Aj, j 0,k .
We induce these proxies with penalty terms added to the objective. Namely, we minimize R(„, A) ¯ k„k1,wˆ ¯ kAk1, ˆ ¯¿ˆkhstack(A)k⁄, (2)
where R(„, A) is the least squares goodness-of-fit (1), and the penalty terms are trace-norm and weighted ‘1 penalizations.
Section 4 of Chapter II. The choice of these weights lead to a data-driven scaling of the variability of information available for each nodes and comes from a sharp analysis of the noise terms presented in the section below.

Sharp oracle inequality

With a wise choice of the weights parameters wˆ, ˆ and ¿ˆ, we derive, in Theorem 1, W an oracle inequality. This inequality bounds the estimation error of the intensity ‚„ˆ, ˆ , A obtained by minimizing Problem (2), given the best possible estimator, with the same parametrization, that would rely on perfect information. We fix some confidence level x ¨ 0, which can be safely chosen as x ˘ logT for instance, and denote by k ¢ kF the Frobenius norm to formulate the following theorem where no assumption is made on the ground truth intensity ‚.
The constant •(„,A) is given by Definition 1 of Chapter II. It comes from the necessity to have a restricted eigenvalue condition on the Gram matrix of the problem to obtain an oracle inequality with a fast rate [BRT09, Kol11]. Roughly, it requires that for any set of parameters {„0,A0} that has a support close to the one of {„,A}, we have that the L2 norm of {„0,A0} in the support of {„,A} can be bounded by the L2 norm of the intensity given by k‚„0,A0 kT .

Numerical experiments

To measure the performances of the choice of the data-driven weighting of the penalizations {wˆ, ˆ ,¿ˆ}, we conduct experiments on synthetic datasets and compare our method to non-weighted penalizations [ZZS13a]. We perform these experiments on Hawkes processes with d ˘ 30 nodes and K ˘ 3 basis kernels where the self-excitement tensor contains square overlapping boxes (corresponding to overlapping communities) to respect the sparse and low rank priors. We consider four estimation procedures that minimize the least square goodness-of-fit plus one of the following penalization:
• L1: non-weighted ‘1 penalization of A
• wL1: weighted ‘1 penalization of A
• L1Nuclear: non-weighted ‘1 penalization and trace-norm penalization of hstack(A) (same as [ZZS13a])
• wL1Nuclear: weighted ‘1 penalization and trace-norm penalization of hstack(A)
Then, for each procedure, we train the model on the generated data, restricting it on a growing time intervals, and assessing their performance each time with the three following metrics:
• Estimation error: the relative ‘2 estimation error of A, given by kA¡Ak2 /kAk2
• AUC: we compute the AUC (area under the ROC curve) between the binarized ground truth matrix and the solution ˆ with entries scaled in [0, 1]. This allows us to quantify the ability of the procedure to detect the support of the connectivity structure between nodes.
• Kendall: we compute Kendall’s tau-b between all entries of the ground truth matrix and the solution ˆ . This correlation coeﬃcient takes value between ¡1 and 1 and compare the number of concordant and discordant pairs. This allows us to quantify the ability of the procedure to rank correctly the intensity of the connectivity between nodes.
Figure 2 confirms that weighted penalizations systematically leads to an improve-ment, both for L1 and L1Nuclear, in terms of error, AUC and Kendall coeﬃcient.
Studying the optimization techniques used to minimize the objective (2) was beyond the scope of this section. However, optimization is a crucial part of the procedure on which we will focus in the two following sections.
A wide variety of machine learning tasks consists in optimizing the following problem w2Rd with 1 n
where the function fi corresponds to a loss computed at a sample i of the dataset and the convex function g is a penalization term. This framework includes classification with logistic regression with fi (w) ˘ log(1 ¯exp(¡yi w>xi )), least square regression with fi (w) ˘ (yi ¡ w>xi )2 among many others. It is common to assume that function
f is gradient-Lipschitz, namely krf (x) ¡ rf (y)k • Lkx ¡ yk for any x, y 2 Rd where k.k stands for the Euclidean norm on Rd , and L ¨ 0 is the Lipschitz constant. With this property, the descent lemma [Ber99, Proposition A.24] holds, f (w ¯¢w) • f (w) ¯¢w>rf (w) ¯ L2 k¢wk2
for any w,¢w 2 Rd . Most first order algorithms ensue from this lemma and firstly the gradient descent algorithm where at each step ¢w t¯1 is set to the optimal value ¡ L1 rf (w t ). The penalization term is then handled with proximal operators [CP11]. This leads to ISTA algorithm and its accelerated version FISTA [BT09] whose rate is optimal [Nes83].
Stochastic gradient descent. However, with its structure, this problem can also be considered as an accumulation of smaller problems fi that have a common behavior. Stochastic gradient descent (SGD) [RM51] exploits this and instead of computing the full gradient rf (w t ) at each step, uses a random variable `t 2 Rd such that E[`t ] ˘ rf (w t ). The descent iteration becomes ¢w t¯1 ˘ ¡•t `t where `t is set to rfi (w t ) and •t ¨ 0 is a step size. When all rfi (w t ) are as expensive to compute, SGD performs iterations that are n times quicker than the batch algorithms previously introduced. But, this method does not converge easily to a precise solution because rfi (w t ) does not approach zero when w t is close from the optimal value w⁄. Hence, the sequence of step size (•t )t‚0 must be decreasing which eventually aﬀects the convergence speed.
Variance reduced stochastic algorithms. Recently, stochastic solvers based on a combination of SGD and the Monte-Carlo technique of variance reduction [SLRB17], [SSZ13], [JZ13], [DBLJ14] turn out to be both very eﬃcient numerically (each update has a complexity comparable to vanilla SGD) and very sound theoretically. To reduce its variance, the random variable `t is set to rf i (w t ) ¯ Y ¡ E[Y ] where Y 2 Rd is a random variable that is expected to be correlated with rfi (w t ). Hence, `t remains an unbiased estimator of rf (w t ) and its variance is decreased. In [SLRB17], [SSZ13], [JZ13] and [DBLJ14], `t converges to 0 at the optimum, and the sequence of step sizes (•t )t‚0 does not need to be decreasing as in SGD. Theses algorithms obtain linear convergence rate, that is they reach an iterate w t such that F (w t ) • F (w⁄) ¯ » in less than O(log(1/ »)) iterations.
Thus, modern optimization methods obtain high precision solutions with few passes on the data. However, the first order algorithms that we have just introduced rely on the assumption that f is gradient-Lipschitz which is not verified by Hawkes processes log likelihood. The lack of fast, scalable and robust method to solve this problem motivates the following question.

Table of contents :

Introduction
1 Background on Hawkes processes
2 Sparse and low-rank multivariate Hawkes processes
3 Background on composite sum minimization with rst order methods
4 Dual optimization without the gradient-Lipschitz assumption
5 tick: a Python library for statistical learning
Chapter I Background on Hawkes processes
1 Temporal point processes
1.1 Denition
1.2 Goodness-of-t
2 Hawkes processes
2.1 Multivariate Hawkes processes
2.2 Simulation
2.3 Estimation
2.4 Exponential kernels
Chapter II Sparse and low-rank multivariate Hawkes processes
1 Introduction
2 The multivariate Hawkes model and the least-squares functional
3 A new data-driven matrix martingale Bernstein’s inequality
3.1 Notations
3.2 A non-observable matrix martingale Bernstein’s inequality
3.3 Data-driven matrix martingale Bernstein’s inequalities
4 The procedure
5 A sharp oracle inequality
6 Numerical experiments
7 Conclusion
8 Proofs
Chapter III Background on rst order composite sum minimization
1 Composite sum minimization
2 Batch gradient descent
3 Stochastic gradient descent
4 Variance reduced stochastic gradient descent
5 Numerical comparison
Chapter IV Dual optimization without the gradient-Lipschitz assumption
1 Introduction
2 A tighter smoothness assumption
3 The Shifted SDCA algorithm
3.1 Proximal algorithm
3.2 Importance sampling
4 Applications to Poisson regression and Hawkes processes
4.1 Linear Poisson regression
4.2 Hawkes processes
4.3 Closed form solution and bounds on dual variables
5 Experiments
5.1 Poisson regression
5.2 Hawkes processes
5.3 Heuristic initialization
5.4 Using mini batches
5.5 About the pessimistic upper bounds
6 Conclusion
7 Proofs
Chapter V tick: a Python library for statistical learning
1 Introduction
2 Existing Libraries
3 Package Architecture
4 Hawkes
5 Benchmarks
6 Examples
6.1 Estimate Hawkes intensity
6.2 Fit Hawkes on nance data
6.3 SVRG with an adaptive step size
6.4 Lower precision to accelerate algorithms
7 Hawkes with non constant exogenous intensity
8 Asynchronous stochastic solvers
Bibliography