Hyper-parameter Optimization of Neural Networks

Get Complete Project Material File(s) Now! »

Grid search

The most basic method is usually called grid search. Very easy to implement, it simply tests every possible combination (typically with uniform sampling for the continuous hyper-parameters). With only a handful of hyper-parameters to optimize and with a function f fast to evaluate, it can be advisable to use as a first step to get sensible boundaries on each hyper-parameter, i.e. by testing a handful of values over a very wide range to find a smaller but relevant range. But in the context of deep learning, hyper-parameters are too numerous, meaning there are too many combinations to evaluate, and each evaluation is costly. Limiting oneself when choosing the hyper-parameters still leaves us with at least 5 or 6 hyper-parameters but it is easy to end up with over 30 hyper-parameters. Assuming an average of 4 values per hyper-parameters, this implies 45 = 1024 combinations for a space of 5 hyper-parameters or 430 ‘ 1018 combinations with 30 hyper-parameters. Grid search does not scale well, which makes it unsuitable for deep learning. Even for reasonably sized hyper-parameter space, a typical implementation in nested for-loops goes from one corner of the hyper-parameter space to the opposite corner in fixed order. It is unlikely that the corner the search starts from happens to be an area filled with good models, since combinations in corners are extreme values of hyper-parameters and build atypical neural networks. Grid search can still be used with a wide enough sampling of the hyperparameters to get an idea on where the interesting models are located and refine the boundaries, but even for that other methods are preferable.

Random search

One step above grid search is random search. A big limitation of grid search is that the order it goes through the hyper-parameter space is very dependent on the implementation and it always selects the same limited set of values for each hyper-parameter. Random search instead draws the value of each hyper-parameterrom a uniform distribution, allowing for a much wider range of explored values. Figure 2.1 illustrates this. Given a hyper-parameter space of two continuous hyper-parameters, and at equal number of evaluated combinations, random search finds a better solution. It works because hyper-parameters are not equally relevant. In practice, for deep learning, only a few have a high impact on the performance (Bergstra and Bengio (2012)).1 In the figure, grid search only tested three values of each hyper-parameters while random search tested nine for the same cost! In terms of implementation cost, random search requires only the ability to draw uniformly from an interval or a list, giving it the same computational cost than grid search. The increased implementation cost is largely compensated by the gain in performance.

Reinforcement learning

Recent advances in reinforcement learning have made possible its use to design efficient neural networks. The idea is to train an agent called the controller which builds neural networks for a specific task. Baker et al. (2017) developed a controller that chooses each layer of the network sequentially and is trained using Q-learning. B. Zoph and Le (2017) created a string representation of neural networks and their controller is a RNN outputting valid strings. In both cases the created networks must be fully trained to be evaluated and the controller takes thousands of iterations to converge, making those approaches extremely costly. To address this problem, Barret Zoph et al. (2017) proposed testing on a smaller dataset as an approximation of the performance on the true dataset. Another suggestion is to train only partially each network (Li et al. (2017), Zela et al. (2018)), allowing longer training time as the search is refined.

Table of contents :

1 Introduction
1.1 Context
1.1.1 Medical Imaging
1.1.2 Deep Learning
1.2 Contributions and Outline
2 Hyper-parameter Optimization of Neural Networks
2.1 Defining the Problem
2.1.1 Notations
2.1.2 Black-box optimization
2.1.3 Evolutionary algorithms
2.1.4 Reinforcement learning
2.1.5 Other approaches
2.1.6 Synthesis
2.1.7 Conclusion
2.2 Bayesian Optimization
2.2.1 Gaussian processes
2.2.2 Acquisition functions
2.2.3 Bayesian optimization algorithm
2.3 Incremental Cholesky decomposition
2.3.1 Motivation
2.3.2 The incremental formulas
2.3.3 Complexity improvement
2.4 Comparing Random Search and Bayesian Optimization
2.4.1 Random search efficiency
2.4.2 Bayesian optimization efficiency
2.4.3 Experiments on CIFAR-10
2.4.4 Conclusion
2.5 Combining Bayesian Optimization and Hyperband
2.5.1 Hyperband
2.5.2 Combining the methods
2.5.3 Experiments and results
2.5.4 Discussion
2.6 Application: Classification of MRI Field-of-View
2.6.1 Dataset and problem description
2.6.2 Baseline results
2.6.3 Hyper-parameter optimization
2.6.4 From probabilities to a decision
2.6.5 Conclusion
3 Transfer Learning
3.1 The many different faces of transfer learning
3.1.1 Inductive Transfer Learning
3.1.2 Transductive Transfer Learning
3.1.3 Unsupervised Transfer Learning
3.1.4 Transfer Learning and Deep Learning
3.2 Kidney Segmentation in 3D Ultrasound
3.2.1 Introduction
3.2.2 Related Work
3.2.3 Dataset
3.3 Baseline
3.3.1 3D U-Net
3.3.2 Training the baseline
3.3.3 Fine-tuning
3.4 Transformation Layers
3.4.1 Geometric Transformation Layer
3.4.2 Intensity Layer
3.5 Results and Discussion
3.5.1 Comparing transfer methods
3.5.2 Examining common failures
3.6 Conclusion
4 Template Deformation and Deep Learning
4.1 Introduction
4.2 Template Deformation
4.2.1 Constructing the template
4.2.2 Finding the transformation
4.3 Segmentation by Implicit Template Deformation
4.4 Template Deformation via Deep Learning
4.4.1 Global transformation
4.4.2 Local deformation
4.4.3 Training and loss function
4.5 Kidney Segmentation in 3D Ultrasound
4.6 Results and Discussion
4.7 Conclusion
5 Conclusion
5.1 Summary of the contributions
5.2 Future Work
A Incremental Cholesky Decomposition Proofs
A.1 Formula for the Cholesky decomposition
A.2 Formula for the inverse Cholesky decomposition
Publications
Bibliography