Improved recurrent neural network training
While we dedicated Part I to introducing new optimization algorithms, in Part II we focus on novel methods for complex tasks. The excellent performance obtained on simple problems (e.g. binary classification) has prompted researchers to turn to more challenging issues. Notable amongst these are structured prediction tasks, which consist in predicting several random variables at once (taking into account their dependencies). Examples of such tasks include image segmentation, dependency parsing and machine translation.
As these problems can often be solved one variable at a time, we dedicate Part II of this thesis to sequential prediction methods. More specifically, we focus on recurrent neural networks (RNNs), which have proven quite successful for structured prediction applications in the last few years. RNNs aim at representing sequential data. To do so, they recursively apply the same transformation (or cell) f on the sequential data and their own previous predictions, outputting a sequence of hidden states: ht = f(ht 1; yt 1; x), with h0 an initial state and x an optional input.
Maximum likelihood training for recurrent neural networks
Our work concerns conditional RNNs, which have a natural probabilistic interpretation. Each cell is supposed to output the conditional probability of a token, given all previous predictions in the sequence: p(ytjy1; :::; yt 1; x). Multiplying the outputs in the sequence thus yields the joint probability of the whole sequence y1:::yT , thanks to the chain rule.
Provided we replace the outputs of the model with the ground truth tokens before feeding them back to the following cell – this is known as teacher forcing – we can obtain the conditional probability of the ground truth sequences p(yjx). This allows us to use maximum likelihood estimation (MLE) to derive a surrogate loss: we train the model by maximizing the probability of the ground truth.
Issues with MLE
While this “universal” loss has led to many successes, it does suffer from a number of limitations. This comes as no surprise as RNNs are used for a wide variety of tasks, each with their own specific validation metrics, such as BLEU score, edit distance etc.
First, as it completely ignores the actual validation metric we are interested in optimizing for, it fails to exploit the wealth of data that such structured losses can offer. MLE training only concerns itself with maximizing the probability of the ground truth, and does not distinguish a poor candidate sequence from a strong one.
Second, it introduces discrepancies between the train and test phases, such as the exposure or exploration bias (Ranzato et al., 2016). Due to teacher forcing, the model is trained to make predictions, assuming that it has not made a single mistake before. At test time though, the model does not have access to the ground truth, and thus feeds its own previous predictions to its next cell for prediction instead. Of course, errors do occur. The model then finds itself in a situation it has not encountered during training and is thus more likely to make additional mistakes.
Alternative training methods. Improving RNN training thus appears as a relevant en-deavor, which has received much attention recently. In particular, ideas coming from reinforcement learning (RL), such as the REINFORCE and ACTOR-CRITIC algorithms (Ran-zato et al., 2016; Bahdanau et al., 2017), have been adapted to derive training losses that are more closely related to the test error that we actually want to minimize.
In order to address the issues of MLE training, we propose instead to use ideas from the structured prediction field, in particular from the “learning to search” (L2S) approach pioneered by Daumé et al. (2009).
A more thorough introduction to Part II can be found in Chapter 6.
In part II, we aim at proposing new training algorithms for recurrent neural networks that do not suffer from the limitations of MLE, and thus lead to improved model performance.
Traditional MLE training and its limitations. In the introductory Chapter 6, we explain the importance of recurrent neural networks to tackle complex machine learning models. We describe its most common training algorithm, MLE, and detail its limitations, clarifying the claims of the related literature. We explore its alternatives in-depth, and finally state the goals of Part II of this thesis, as well as our contributions.
SEARNN: a new training algorithm that leverages the validation metric. In Chapter 7, we propose SEARNN, a novel training algorithm inspired by the L2S approach to structured prediction. SEARNN leverages test-alike search space exploration to introduce global-local losses that are closer to the test error than the MLE surrogate.
We provide the rationale for using L2S as a basis for improved RNN training by underlin-ing surprisingly strong similarities between this algorithm and RNNs. We investigate scaling schemes to allow SEARNN to handle tasks with large vocabulary sizes and long sequences. We demonstrate the usefulness of our method through comprehensive experiments on three challenging datasets.
Finally, we contrast our novel approach both empirically and theoretically to MLE and the related L2S- and RL-inspired methods.
Part I is organized in five chapters.
In Chapter 1, we detail the reasons why optimizing (1) is important in many machine learning applications. We then provide some context on the new challenges that practitioners are faced with in the era of “big data”, and give some insight in the new methods that are developed accordingly. Finally, we state the goals of this thesis, as well as our contributions.
In Chapter 2, we introduce a new framework of analysis, enabling correct and easy conver-gence proofs for asynchronous parallel incremental algorithms. After explaining how this framework alleviates the issues previous analyses ran into, we demonstrate its usefulness by analyzing HOGWILD, obtaining better bounds under more realistic assumptions than in the related literature.
This chapter covers the first half of the conference paper Leblond et al. (2017) and its extended journal version Leblond et al. (2018b).
In Chapter 3, we focus on asynchronous parallel variants of high-performing variance-reduced algorithms. We introduce a sparse variant of the incremental SAGA algorithm, as well as its asynchronous parallel adaptation, ASAGA. Using the framework introduced in Chapter 2, we show that ASAGA converges linearly and obtains a linear speedup over its sequential counterpart (under assumptions).
We also conduct the analysis of KROMAGNON, improving the known bounds for this algorithm while removing problematic assumptions. We prove that this algorithm enjoys similar speedup properties as ASAGA.
This chapter is based on the second half of the journal paper Leblond et al. (2018b).
In Chapter 4, we extend ASAGA to the composite optimization objective, where the reg-ularizer term can be non-smooth. We introduce a variant of SAGA with sparse proximal updates, which is amenable to adaptation to the asynchronous parallel setting. We perform the convergence analysis for this algorithm, and show that it can obtain linear speedups under sparsity assumptions. Empirical benchmarks show that PROXASAGA significantly outperforms other state-of-the-art methods on large-scale datasets.
This chapter is based on the conference paper Pedregosa et al. (2017).
Finally, in Chapter 5 we provide a summary of our contributions, discuss the limitations of our approach and how to address them in future work.
Part II is organized in three chapters.
Chapter 6 is dedicated to the formal description of the traditional training algorithm for RNNs. We explore its issues in-depth and discuss existing alternative approaches.
In Chapter 7, we introduce and analyze the behavior and performance of a novel RNN training method called SEARNN, which leverages exploration to derive global-local losses that are closer to the validation metric than the MLE surrogate. We propose scaling schemes to improve the algorithm’s scaling, and we demonstrate improved performance on three challenging structured prediction tasks.
This chapter covers the conference paper Leblond et al. (2018a).
Finally, we conclude Part II in Chapter 8. We summarize our findings, list the limitations of the novel algorithm we have proposed, and offer several promising new directions of research for advancing RNN training.
The chapters of Part I are based on three papers, all written with Fabian Pedregosa and Simon Lacoste-Julien. Part II is based on the conference paper Leblond et al. (2018a), written jointly with Jean-Baptiste Alayrac, Anton Osokin and Simon Lacoste-Julien.
Leblond et al. (2017) was published as a conference paper at AISTATS in 2017. It introduces our novel framework of analysis for asynchronous optimization, as well as ASAGA.
Pedregosa et al. (2017) was published as a conference paper at NIPS in 2017. In this work we focus on the non-smooth objective setting and introduce and analyze PROXASAGA.
Leblond et al. (2018b) is to be published as a journal paper at JMLR in 2018. It is an extended version of Leblond et al. (2017), to which we have added the analysis of both HOGWILD and KROMAGNON.
Leblond et al. (2018a) was published as a conference paper at ICLR in 2018. In this publication, we propose SEARNN, our novel RNN training algorithm, and explore its properties and performance on complex structured prediction tasks.
Modern optimization for machine learning
In the first part of this dissertation, we will concentrate on methods to solve efficiently an optimization objective with a specific finite sum structure.
min f(x) + h(x); f(x) := 1 n fi(x): (1.1)
This objective is particularly relevant for machine learning applications where it arises naturally from various tasks.
Empirical risk minimization (ERM). One salient example is the empirical risk mini-mization process, which is a standard approach in the supervised learning setup. Broadly speaking, we want to find a mapping between inputs zi and outputs yi such that the discrep-ancy between the mapped inputs (zi) and the outputs is small. To keep computational costs under control, we specify a family of parametric functions ( ; x), and then we try to find the best candidate in this family, i.e. the candidate that minimizes the average discrepancy between its mapping of the inputs (zi; x) and the actual outputs yi. The resulting objective function is f(x), also called the data fitting term. By adding a regularization term h(x) we obtain (1.1).
More formally, we are given an input set Z, an output set Y and a real-valued loss function L : Y Y ! R which measures the discrepancy between two elements in Y.
Typically, Z is Rd and Y is either f 1; 1g (in the case of a classification task) or R (for regression tasks). We also assume that there exist a joint probability distribution P (z; y) over Z and Y. We aim to learn a mapping between the inputs and the outputs such that the risk associated with is minimized, where risk means the expected loss: E L( (z); y) . Ideally, we could look for in the whole set of mappings between Z and Y. Unfortu-nately, such an unconstrained search would be prohibitively expensive in most cases. To circumvent this issue, we arbitrarily impose constraints on . Typically, we assume that ( ; x) is one element of a parametric family of promising candidate functions. We then seek to find the best candidate in this family, i.e. the candidate that minimizes the expected loss.
To achieve this goal, we have at our disposal a training set, which consists in a set of inputs (zi)ni=1, each associated with an output yi. The instances (zi; yi) are assumed to be drawn identically and independently from the distribution P (z; y). This training set is crucial because we usually do not have access to the joint distribution P (z; y), which implies that the expected loss we seek to minimize cannot be computed. The training set gives us an estimator of the expected loss, which we can minimize instead. This is the empirical risk minimization principle. Intuitively, for ( ; x) to be a good mapping from Z to Y, we need (zi; x) to be a good approximation of yi, i.e. L (zi; x); yi to be small.
If we define i i
optimization objective (1.1) (in the
specific case where h(x) = 0).
Background. In this dissertation, we consider the unconstrained optimization problem, i.e. = Rd. In order to design efficient methods to solve (1.1), we also impose some restrictions on f and the individual fi:
1. each fi is differentiable;
2. each fi is convex, i.e.:
3. f is -strongly convex, i.e. f kxk2 is convex, or equivalently:
8×1; x2 2 Rd; f(x1) f(x2) + hf0 x2k2: (1.3)
(x2); x1 x2i + kx1
4. each fi is convex with L-lipschitz continuous gradients: 8×1; x2 2 Rd; kf0(x1) f0(x2)k Lkx1 x2k: (1.4)
We define := L as the condition number of f, and x as the global minimizer of the objective (as the objective is strongly convex, x exists and is unique).
Historical methods. In the general case, there is no closed form solution for the opti-mization problem (1.1), even when the additional term h(x) is set to 0 (for clarity’s sake, we assume h = 0 throughout the rest of Section 1.1). Practitioners then turn to numerical optimization methods to obtain approximate solutions.
One of the oldest optimization method which can be used to solve (1.1) is gradient descent, also called batch gradient descent or steepest descent, dating back to the 19th century (Cauchy, 1847). This iterative method consists in repeatedly subtracting from the current iterate the gradient of the objective computed at this iterate, until convergence It follows the general scheme: xt+1 = xt tf0(xt); (1.5)
where t is a hyper-parameter called the step size. Intuitively this corresponds to solving to optimality the optimization problem for a surrogate objective function of the following quadratic form:
st(x) = f(xt) + f0(xt)|(x xt) + 1 (x xt)|(x xt): (1.6)
st can be seen as the first order Taylor expansion of the objective.
Gradient descent behaves very well on (1.1), where f is -strongly convex and L-smooth (i.e., its gradient is continuous and L-Lipschitz). Indeed, when using a constant step size, the convergence of the algorithm is linear (Nesterov, 2004), i.e. there exists a constant 0 < < 1 such that f(xt) f(x ) = O( t). As the suboptimality term is decreased by a constant fraction at every iteration, the convergence is also sometimes referred to as exponential or geometric.
The number of iterations required to reach an accuracy of an arbitrary > 0 is thus O log(1= ) . 1 As the computational cost of each iteration is dominated by the gradient computation cost, O(nd), the overall complexity of the algorithm is O nd log(1= ) .
Gradient descent is part of the family of first-order methods, as it only exploits informa-tion about the gradient f0(xt). This is in contrast to second-order or Newton-type methods, which also require computing the Hessian (or second derivative) of the objective.
1. appears in the constant. As it problem-dependent, we make explicit that dependency here. 2t ).
Newton’s method is also an iterative procedure where the parameters are updated at each step, following a slightly different scheme: xt+1 = xt [f00(xt)] 1f0(xt); (1.7) assuming that f is twice differentiable and that its Hessian matrix f00(xt) is invertible. Intuitively, this corresponds to solving to optimality the optimization problem for a different quadratic surrogate objective function:
St(x) = f(xt) + f0(xt)|(x xt) + 1 (x xt)|f00(xt)(x xt): (1.8)
St can be seen as the second order Taylor expansion of f at xt. Newton’s method converges significantly faster than gradient descent, although it can only be used in a more restricted set of problems. Indeed, when the Hessian is invertible and Lipschitz continuous, there exists for each local optima a neighborhood such that Newton’s method converges quadratically with step size 1. This implies that there exist a constant > 0 such that f(xt) f(x ) = O(e)
To reach -accuracy, only O(log log(1= )) iterations are needed. However, the iterations from Newton’s method are more costly than those of gradient descent: their complexity is O(nd2 + d3). Consequently, which method is fastest is problem dependent.
To reduce the complexity of each iteration, a number of algorithms use an approximation of the Hessian matrix. This is the case of quasi-Newton methods, such as BFGS (Broyden, 1970; Fletcher, 1970; Goldfarb, 1970; Shanno, 1970).
These historical methods have encountered great success in the past. However, the last few years have seen significant changes in the characteristics of problem (1.1), rendering gradient descent or Newton’s method increasingly inefficient.
The big data era
Over the last few years, both the quantity and the dimensionality of data in machine learning tasks have increased dramatically. This implies that d, the dimensionality of the optimization problem (1.1), 2 as well as n, the number of data points, can nowadays easily grow to millions, and beyond. In such a setting, the computational cost of very fast, 2. Throughout this section, we discuss methods for solving (1.1) in the specific case where h is smooth. We discuss extensions to non-smooth regularizers in Section 1.2.4.
Newton-like optimization routines – which require inverting a d d matrix 3 – becomes prohibitive. This has prompted a renewal of interest for first-order methods, whose dependency on d is linear. However, as n grows very large, even standard gradient descent updates can become too expensive, as the cost of computing a full gradient grows linearly with the number of data points. Stochastic gradient descent. In order to alleviate this issue, practitioners have turned to stochastic methods where cheap unbiased random estimators are used as an approximation to the true gradient. When the objective has a finite-sum structure, a very natural unbiased estimator is the gradient fi0(x) of a single random factor. The resulting algorithm is called stochastic gradient descent (SGD), and was introduced by Robbins and Monro (1951). Here is its update rule: xt+1 = xt tfi0t (xt); (1.9) where the (it)t 0 are sampled independently and uniformly at random. One key advantage of this update scheme is that the computation cost of each iteration – O(d) – is independent of n, so does not scale with it.
SGD has been very successful in practice and has been proven to converge in expectation in a variety of settings, including the one we focus on (see e.g. Schmidt (2014)). However, the switch to an approximate estimator comes at a cost: using random estimates introduces non-diminishing variance in the algorithm.
The cost of randomness. One consequence of this variance is that the algorithm is not stable, i.e. the iterates do not stay at the global minimizer x once they reach it. A simple way of explaining this phenomenon is to remark that although the individual fi0(x ) sum to 0 (since f0(x ) = 0), they are not equal to 0 themselves in the general case. Another consequence is that in the strongly-convex setting, SGD with a constant step size does not converge linearly to the optimum as does gradient descent. Instead, it converges linearly up to a ball around the optimum, because the variance of the estimator does not vanish when nearing x .
In order for SGD to converge all the way to the optimum, the variance of the gradient estimator needs to be handled. The traditional approach has been to use vanishing step sizes. As the step size multiplies the update, the variance of the update itself goes to 0 as the number of iterations increases. Although this does allow SGD to converge to the optimum, unfortunately using a vanishing step size also implies losing linear convergence. Typical decreasing step size schedules result in a O( =t) convergence rate.
In the last few years, novel methods to reduce the variance of the gradient estimator – that do not require using a vanishing step size – have been introduced. The resulting algorithms enjoy linear convergence on finite-sum strongly convex objectives with a constant step size. The main idea behind these methods is to use a smarter gradient estimator than that of SGD, often by adding a correction term, such that its variance vanishes at the optimum x . They are called variance reduction algorithms. This family includes examples such as SAG (Le Roux et al., 2012) 4, SDCA (Shalev-Shwartz and Zhang, 2013), SAGA (Defazio et al., 2014a) and SVRG (Johnson and Zhang, 2013), among others.
General principle. Suppose you have an estimator U with high variance, and that we have access to another random variable V , which is positively correlated with U and whose expectation can be computed efficiently. We can then introduce a new estimator W := (U V ) + EV . We have that EW = EU + (1 )EV , and that var(W ) = 2[var(U) + var(V ) 2cov(U; V )]. Provided than the covariance term is big enough, the variance of the new estimator is reduced compared to the variance of the initial one. This mechanism falls into the category of variance reduction.
Application to incremental gradient methods. The estimator with high variance is fi0(xt) in this case. The usual random variable with positive correlation is the same in-dividual gradient, taken at an earlier iterate: fi0(xs), where the definition of s < t depends on the specific algorithm. We now illustrate this principle by detailing two algorithms, SAGA (Defazio et al., 2014a) and SVRG (Johnson and Zhang, 2013).
Borrowing our notation from Hofmann et al. (2015), we first present SAGA. SAGA maintains two moving quantities to optimize (1.1): the current iterate x and a table (memory) of historical gradients ( i)ni=1. 5 At every iteration, the SAGA algorithm samples uniformly at random an index i 2 f1; : : : ; ng, and then executes the following update on x and (for the unconstrained optimization version): x+ = x fi0(x) i + ; i+ = fi0(x): (1.10) := 1=n Pni=1 i can be updated efficiently in an online fashion. Let us define E as the conditional expectation of a random i given on all the past. By construction, E i = and thus the update direction is unbiased (Ex+ = x f0(x)). It can be proven (see Defazio et al. (2014a)) that under a reasonable condition on , the update has vanishing variance, which enables the algorithm to converge linearly with a constant step size.
The standard SVRG algorithm (Johnson and Zhang, 2013) is very similar to SAGA. The main difference is that instead of maintaining a table of historical gradients, SVRG uses a “reference” batch gradient f0(~x), updated at regular intervals (typically every m iterations, where m is a hyper-parameter). SVRG is thus an epoch-based algorithm. At the beginning of every epoch, a reference iterate x~ is chosen and its gradient is computed. Then, at every iteration in the epoch, the algorithm samples uniformly at random an index i 2 f1; : : : ; ng, and then executes the following update on x: x+ = xfi0(x) fi0(~x) + f0(~x) : (1.11)
As for SAGA the update direction is unbiased (Ex+ = x f0(x)), and it can be proven (see Johnson and Zhang (2013)) that under a reasonable condition on and m (the epoch size), the update has vanishing variance, again enabling the algorithm to converge linearly with a constant step size. Compared to SAGA, SVRG trades memory for computation time. Instead of storing the past gradients, the algorithm recomputes them at a fixed iteration point x~. However the convergence rates of both methods are the same, as is their overall computational complexity: O (n + )d log(1= )
Related methods. SAGA and SVRG are only two examples of variance-reduced incre-mental optimization methods. This family contains numerous other instances, such as its precursor SAG (Le Roux et al., 2012), the stochastic dual coordinate ascent method (Shalev-Shwartz and Zhang, 2013, SDCA), MISO (Mairal, 2015) or FINITO (Defazio et al., 2014b), among many others. All these methods have similar rates of convergence and iterations. Their differences lie in their interpretation and proof technique. For a more detailed compar-ison, we recommend Defazio et al. (2014a).
All the methods we have presented up until this point share a key characteristic: they are all inherently sequential procedures, where we define the value of the next iteration as a simple function of the current one. This implies that these algorithms typically only use a single CPU core. Unfortunately, while the size of the datasets continues to grow very fast, the clock speed of CPU cores has stopped its dramatic increase a few years ago. Performance gains now rely on efficient parallelization over multiples cores which have access to shared memory resources. Thus to take advantage of the multi-core architecture of modern computers, the aforementioned optimization algorithms need to be adapted to the parallel setting, where multiple threads work concurrently. 6 Synchronous approaches. The easiest way to take advantage of parallel resources is through synchronous approaches, where all cores get assigned a task, and wait for the last of them to finish before starting a new one. This suits a specific type of problem very well, that is, embarrassingly parallel problems which decompose naturally in parallel subtasks. One relevant example if the computation of a full gradient of an objective with a finite-sum structure. Each core can be assigned a subset of data points, compute the associated gradients and then a simple average of all sub-results yields the desired quantity.
In the context of stochastic gradient descent, the workload at each iteration cannot be easily split. Nevertheless, one way to leverage synchronous parallelism is to compute the gradient of a random mini-batch of data points, thus decreasing the variance of the estimator.
However, despite these possible improvements, synchronous approaches are not very well adapted to the incremental optimization setup. The main reason why is that as cores have to wait for each other to finish a task before they can start a new one, the overall process is only as fast as the slowest core (assuming the workload is evenly spread out). In potentially heterogeneous architectures, where cores do not have the same clock speed and random latency issues are common, this implies that a lot of computation power is wasted because cores are idle a large amount of the time. In the worst case where one core gets stuck, the whole process is simply stopped, but even under relatively mild assumptions, the slowdown can be significant. For instance, Hannah and Yin (2017) show that under reasonable assumptions the penalty for choosing a synchronous approach over a method without synchronization grows at least as log(p), where p is the amount of computing resources. This is obviously suboptimal.
Table of contents :
I Asynchronous Optimization for Machine Learning
1 Modern optimization for machine learning
1.2 Modern challenges
1.2.1 The big data era
1.2.2 Variance reduction
1.2.3 Asynchronous optimization
1.2.4 Non-smooth regularization
1.3 Goal and contributions
2 Improved asynchronous parallel optimization analysis for stochastic methods
2.2 Revisiting the perturbed iterate framework for asynchronous analysis
2.2.1 Perturbed Iterate Framework
2.2.2 On the difficulty of labeling the iterates
2.3 HOGWILD analysis
2.3.1 Useful properties and assumptions
2.3.2 Convergence and speedup results
2.3.3 Proof outlines
2.3.4 Key Lemmas
3 Asynchronous parallel variance reduction
3.2 Asynchronous Parallel Sparse SAGA
3.2.1 Sparse SAGA
3.2.2 Asynchronous Parallel Sparse SAGA
3.2.3 Convergence and speedup results
3.2.4 Key Lemmas
3.3 Asynchronous Parallel SVRG with the “After Read” labeling
3.3.1 SVRG algorithms
3.3.2 Extension to the SVRG variant from Hofmann et al. (2015)
3.3.3 Fast convergence and speedup rates for KROMAGNON
3.3.4 Proof outline
3.4 Empirical results
3.4.1 Experimental setup
3.4.2 Implementation details
3.4.3 Comparison of sequential algorithms: Sparse SAGA vs Lagged updates
3.4.4 ASAGA vs. KROMAGNON vs. HOGWILD
3.4.5 Effect of sparsity
3.4.6 Theoretical speedups
3.4.7 A closer look at the constant
3.5 Conclusion and future work
4 Asynchronous composite optimization
4.1.1 Related work
4.1.2 Definitions and notations
4.2 Sparse Proximal SAGA
4.3 Asynchronous Sparse Proximal SAGA
4.3.1 Analysis framework
4.3.2 Properties and assumptions
4.3.3 Theoretical results
4.3.4 Proof outline
4.3.5 Comparison to related work
4.4.1 Implementation details
4.4.2 Comparison of Sparse Proximal SAGA with sequential methods
4.4.3 Comparison of PROXASAGA with asynchronous methods
4.4.4 Theoretical speedups
4.4.5 Timing benchmarks
4.5 Conclusion and future work
II Improving RNNs Training through Global-Local Losses
6 A brief introduction to recurrent neural networks
6.1 What is a recurrent neural network (RNN)?
6.1.1 A concrete example
6.1.2 Interpretation as a graphical model
6.1.3 The encoder-decoder architecture
6.2 Traditional RNN training and its limitations
6.2.1 Maximum likelihood training (MLE)
6.3 Alternative training approaches
6.3.1 Imitation learning approaches
6.3.2 RL-inspired approaches
6.3.3 Other methods
6.4 Goal and contributions
7.2 Links between RNNs and learning to search
7.3 Improving RNN training with L2S
7.3.1 The SEARNN Algorithm
7.3.2 Adaptation to RNNs
7.3.3 Expected and empirical benefits
7.4 Scaling up SEARNN
7.5 Neural Machine Translation
7.5.1 Experimental results
7.5.2 In-depth analysis
7.6 Model confidence and beam search
7.7 Scaling SEARNN up further
7.8 Discussion and related work
7.8.1 Traditional L2S approaches
7.8.2 L2S-inspired approaches
7.8.3 RL-inspired approaches
7.8.4 Other methods
7.8.5 An unexpected cousin: AlphaGo Zero
8 Conclusion and future work
A HOGWILD analysis using the “After read” framework
A.1 Initial recursive inequality derivation
A.2 Proof of Lemma 11 (inequality in gt := g(^xt; it))
A.3 Proof of Lemma 14 (suboptimality bound on Ekgtk2)
A.4 Proof of Theorem 9 (convergence guarantee and rate of HOGWILD)
A.5 Proof of Theorem 8 (convergence result for serial SGD)
A.6 Proof of Corollary 10 (speedup regimes for HOGWILD)
B Asynchronous variance reduction
B.1 Sparse SAGA
B.1.1 Proof of Theorem 15
B.1.2 Proof of Theorem 16
B.2 ASAGA – Proof of Theorem 18 and Corollary
B.2.1 Proof of Lemma 20 (suboptimality bound on Ekgtk2)
B.2.2 Lemma 20 for AHSVRG
B.2.3 Master inequality derivation
B.2.4 Lyapunov function and associated recursive inequality
B.2.5 Proof of Lemma 21 (convergence condition for ASAGA)
B.2.6 Proof of Theorem 18 (convergence guarantee and rate of ASAGA)
B.2.7 Proof of Corollary 19 (speedup regimes for ASAGA)
B.3 KROMAGNON – Proof of Theorem 22 and Corollary 25
B.3.1 Proof of Lemma 26 (suboptimality bound on Ekgtk2)
B.3.2 Proof of Theorem 22 (convergence rate of KROMAGNON)
B.3.3 Proof of Corollary 23, 24 and 25 (speedup regimes)
B.4 On the difficulty of parallel lagged updates
B.5 Additional empirical details
B.5.1 Detailed description of datasets
B.5.2 Implementation details
B.5.3 Biased update in the implementation
C Extension to non-smooth objectives
C.1 Basic properties
C.2 Sparse Proximal SAGA
C.4 Comparison of bounds with Liu and Wright (2015)
D.2 Design decisions
D.3 Additional machine translation experimental details