Maximum Likelihood Estimation in Latent Data Models

Get Complete Project Material File(s) Now! »

Incremental Method for Non-smooth Non-convex Optimization

Abstract: Many constrained, non-convex optimization problems can be tackled using the Majorization-Minimization (MM) method which al-ternates between constructing a surrogate function which upper bounds the objective function, and then minimizing this surrogate. For prob-lems which minimize a finite sum of functions, a stochastic version of the MM method selects a batch of functions at random at each itera-tion and optimizes the accumulated surrogate. However, in many cases of interest such as variational inference for latent variable models, the surrogate functions are expressed as an expectation. In this contribu-tion, we propose a doubly stochastic MM method based on Monte Carlo approximation of these stochastic surrogates. We establish asymptotic and non-asymptotic convergence of our scheme in a constrained, non-convex, non-smooth optimization setting. We apply our new framework for inference of logistic regression model with missing covariates and for variational inference of autoencoder on the MNIST dataset. This chap-ter corresponds to the article [Karimi et al., 2019b].

Introduction

We consider the constrained minimization problem of a finite sum of functions:
min ( ):= 1 n , (3.1.1)
where Θ is a convex, compact, and closed subset of Rp, and for any i ∈ J1, nK, the function Li : Rp → R is bounded from below and is (possibly) non-convex and non-smooth.
To tackle the optimization problem (3.1.1), a popular approach is to apply the majorization-minimization (MM) method which iteratively minimizes a majorizing sur-rogate function. A large number of existing procedures fall into this general framework, for instance gradient-based or proximal methods or the Expectation-Maximization (EM) algorithm [McLachlan and Krishnan, 2008] and some variational Bayes inference tech-niques [Jordan et al., 1999]; see for example [Razaviyayn et al., 2013] and [Lange, 2016] and the references therein. When the number of terms n in (3.1.1) is large, the vanilla MM method may be intractable because it requires to construct a surrogate function for all the n terms Li at each iteration. Here, a remedy is to apply the Minimization by Incremental Surrogate Optimization (MISO) method proposed by Mairal [2015b], where the surrogate functions are updated incrementally. The MISO method can be interpreted as a combination of MM and ideas which have emerged for variance reduction in stochastic gradient methods [Schmidt et al., 2017].
The success of the MISO method rests upon the eﬃcient minimization of surrogates such as convex functions, see [Mairal, 2015b, Section 2.3]. In many applications of interest, the natural surrogate functions are intractable, yet they are defined as expectation of tractable functions. This for example the case for inference in latent variable models. Another application is variational inference, [Ghahramani, 2015], in which the goal is to approximate the posterior distribution of parameters given the observations; see for example [Blundell et al., 2015, Li and Gal, 2017, Neal, 2012, Polson et al., 2017, Rezende et al., 2014].
This paper fills the gap in the literature by proposing a new method called Minimization by Incremental Stochastic Surrogate Optimization (MISSO) which is designed for the finite sum optimization with a finite-time convergence guarantee. Our contributions can be summarized as follows. t→0+

Functions

• We propose a unifying framework of analysis for incremental stochastic surrogate optimization when the surrogates are defined by expectations of tractable functions. The proposed MISSO method is built on the Monte Carlo integration of the in-tractable surrogate function, i.e., a doubly stochastic surrogate optimization scheme. In addition, we present an incremental variational inference and Monte-Carlo EM methods as two special cases of this framework.
• We establish both asymptotic and non-asymptotic convergence for the MISSO method. In particular, the MISSO method converges almost surely to a station-ary point and in O(n/ ) iterations to an -stationary point.
In Section 3.2, we review the techniques for incremental minimization of finite sum func-tions based on the MM principle; specifically, we review the MISO method as introduced in [Mairal, 2015b], and present a class of surrogate functions expressed as an expectation over a latent space. The MISSO method is then introduced for the latter class of surrogate functions. In Section 4.2.1, we provide the asymptotic and non-asymptotic convergence analysis for the MISSO method. Finally, Section 3.4 presents numerical applications to illustrate our findings including parameter inference for logistic regression with missing covariates and variational inference for Bayesian neural network. Notations We denote J1, nK = {1, . . . , n}. Unless otherwise specified, k • k denotes the standard Euclidean norm and h• | •i is the inner product in Euclidean space. For any function f : Θ → R, f0( , d) is the directional derivative of f at along the direction d, i.e.,
f0( , d) := lim f( + td) − f( ) . (3.1.2)
The directional derivative is assumed to exist for the functions introduced throughout this paper.

Functions

The objective function in (3.1.1) is composed of a finite sum of possibly non-smooth and non-convex functions. A popular approach here is to apply the MM method. The MM method tackles (3.1.1) through alternating between two steps — (i) minimizing a surrogate function which upper bounds the original objective function; and (ii) updating the surrogate function to tighten the upper bound.
Example 1: Maximum Likelihood Estimation for Latent Variable Model La-tent variable models [Bishop, 2006] are constructed by introducing unobserved (latent) variables which help explain the observed data. We consider n independent observations ((yi, zi), i ∈ JnK), that can be non identically distributed, where yi is observed and zi is latent. In this incomplete data framework, define {fi(zi, yi, ), ∈ Θ} to be the complete data likelihood models, i.e., joint likelihood of the observations and latent variables. Let gi(yi, ) := ZZ fi(zi, yi, )µi(dzi), i ∈ 1, n K (3.2.7) denote the incomplete data likelihood, i.e., the marginal likelihood of the observations. For ease of notations, the dependence on the observations is made implicit. The maximum likelihood (ML) estimation problem takes Li( ) to be the ith negated incomplete data
Algorithm 3.2 MISSO method
1: Input: initialization (0); a sequence of non-negative numbers {M(k)}∞k=0.
2: For all i ∈ J1, nK, draw M(0) Monte-Carlo samples with the stationary distribution pi(•; (0)).
Example 2: Variational Inference Let ((xi, yi), i ∈ J1, nK) be i.i.d. input-output pairs and w ∈ W ⊆ Rd be a latent variable. When conditioned on the input x = (xi, i ∈ J1, nK), the joint distribution of y = (yi, i ∈ J1, nK) and w is given by: p(y, w|x) = π(w) Qin=1 p(yi|xi, w) . (3.2.9)
Our goal is to compute the posterior distribution p(w|y, x). In most cases, the posterior distribution p(w|y, x) is intractable and is approximated using a family of parametric distributions, {q(w, ), ∈ Θ}. The variational inference (VI) problem [Blei et al., 2017a] boils down to minimizing the KL divergence between q(w, ) and the posterior distribution

Application to Logistic Regression and Bayesian Deep Learning

Binary logistic regression with missing values

This application follows Example 1 described in Section 3.2. We consider a binary regression setup, ((yi, zi), i ∈ JnK) where yi ∈ {0, 1} is a binary response and zi = (zi,j ∈ R, j ∈ JpK) is a covariate vector. The vector of covariates zi = [zi,mis, zi,obs] is not fully observed where we denote by zi,mis the missing values and zi,obs the observed covariate.
It is assumed that (zi, i ∈ JnK) are i.i.d. and marginally distributed according to N ( , ) where β ∈ Rp and Ω is a positive definite p × p matrix.
Fitting a logistic regression model on the TraumaBase dataset We apply the MISSO method to fit a logistic regression model on the TraumaBase (http:// traumabase.eu) dataset, which consists of data collected from 15 trauma centers in France, covering measurements on patients from the initial to last stage of trauma.
Similar to [Jiang et al., 2018], we select p = 16 influential quantitative measurements, described in Appendix 3.9.1, on n = 6384 patients, and we adopt the logistic regression model with missing covariates in (3.4.1) to predict the risk of a severe hemorrhage which is one of the main cause of death after a major trauma. Note as the dataset considered is heterogeneous – coming from multiple sources with frequently missed entries – we apply the latent data model described in the above. For the Monte-Carlo sampling of zi,mis, we run a Metropolis Hastings algorithm with the target distribution p(•|zi,obs, yi; (k)) whose procedure is detailed in Appendix 3.9.1.
Figure 3.1 – Convergence of first component of the vector of parameters and for the SAEM, the MCEM and the MISSO methods. The convergence is plotted against the number of passes over the data.
We compare in Figure 3.1 the convergence behavior of the estimated parameters using SAEM [Delyon et al., 1999a] (with stepsize γk = 1/k), MCEM [Wei and Tanner, 1990a] and the proposed MISSO method. For the MISSO method, we set the batch size to M(k) = 10 + k2 and we examine with selecting diﬀerent number of functions in Line 5 in the method – the default settings with 1 function (MISSO), 10% (MISSO10) and 50% (MISSO50) of the functions per iteration. From Figure 3.1, the MISSO method converges to a static value with less number of epochs than the MCEM, SAEM methods. It is worth noting that the diﬀerence among the MISSO runs for diﬀerent number of selected functions demonstrates a variance-cost tradeoﬀ.

Fitting Bayesian LeNet-5 on MNIST

This application follows Example 2 described in Section 3.2. We apply the MISSO method to fit a Bayesian variant of LeNet-5 [LeCun et al., 1998] (see Appendix 3.9.2). We train this network on the MNIST dataset [LeCun, 1998]. The training set is composed of N = 55 000 handwritten digits, 28×28 images. Each image is labelled with its correspond-ing number (from zero to nine). Under the prior distribution π, see (3.2.9), the weights are assumed independent and identically distributed according to N (0, 1). We also assume that q(•; ) ≡ N (µ, σ2I). The variational posterior parameters are thus = (µ, σ) where µ = (µ`, ` ∈ JdK) where d is the number of weights in the neural network. We use the re-parametrization as w = t( , z) = µ + σz with z ∼ N (0, I).

Table of contents :

Contributions and thesis outline
1 Introduction
1.1 Statistical Learning
1.2 Non-convex Optimization
1.3 Maximum Likelihood Estimation in Latent Data Models
1.4 Mixed Effects Modeling and Population Approach
2 Introduction en Français
2.1 Apprentissage Statistique
2.2 Optimisation Non-convexe
2.3 Maximum de Vraisemblance Dans Des Modèles à Données Latentes
2.4 Modèles à Effets Mixtes et Approche de Population
3 Incremental Method for Non-smooth Non-convex Optimization
3.1 Introduction
3.2 Incremental Minimization of Finite Sum Non-convex Functions
3.3 Convergence Analysis
3.4 Application to Logistic Regression and Bayesian Deep Learning
3.5 Conclusions
Appendices to Incr. Method for Non-smooth Non-convex Optimization
3.6 Proof of Theorem 1
3.7 Proof of Theorem 2
3.8 Proof of Lemma 1
3.9 Details about the Numerical Experiments
4 Online Optimization of Non-convex Problems
4.1 Introduction
12 Contents
4.2 Stochastic Approximation Schemes and Their Convergence
4.3 Application to Online and Reinforcement Learning
4.4 Conclusion
Appendices to Online Optimization of Non-convex Problems
4.5 Proofs of Section 4.2.1
4.6 Proofs for the ro-EM Method
4.7 Proofs for the Policy Gradient Algorithm
4.8 Existence and Regularity of the Solutions of Poisson Equations
5 Fast Incremental EM Methods
5.1 Introduction
5.2 Stochastic Optimization Techniques for EM Methods
5.3 Global Convergence of Stochastic EM Methods
5.4 Application to Mixture and Topic Modeling
5.5 Conclusion
Appendices to Fast Incremental EM Methods
5.6 Proof of Lemma 9
5.7 Proof of Theorem 5
5.8 Proof of Lemma 10
5.9 Proof of Lemma 11
5.10 Proof of Lemma 12
5.11 Proof of Theorem 6
5.12 Local Linear Convergence of fiEM
5.13 Practical Applications of Stochastic EM Methods
6 Fast Stochastic Approximation of the EM
6.1 Introduction
6.2 Mixed Effect Models
6.3 Sampling from Conditional Distributions
6.4 The nlme-IMH and the f-SAEM
6.5 Application to Pharmacology
6.6 Conclusion
Appendices to Fast Stochastic Approximation of the EM
6.7 Mathematical Details
6.8 Supplementary Experiments
7 Incremental Stochastic Approximation of the EM
7.1 Introduction
7.2 Maximum Likelihood Estimation: the SAEM Algorithm
7.3 Numerical Applications
7.4 Conclusion
Appendices to Incremental Stochastic Approximation of the EM
7.5 Proof of Lemma 15
7.6 Proof of Theorem 8
7.7 Bias-Variance Tradeoff in Incremental EM and SAEM
8 R Tutorial: MLE for Noncontinuous Data Models
8.1 Introduction
8.2 Noncontinuous Data Models
8.3 A Repeated Time-To-Event Data Model
8.4 A Categorical Data Model with Regression Variables
9 Conclusion 223
9.1 Summary of the Thesis
9.2 Perspectives
Bibliography