Get Complete Project Material File(s) Now! »

## Structured output learning

As we have mentioned in the beginning of this chapter, the structured prediction is ageneral learning problem for estimating the mapping between the high-dimensional input x 2 X and output y 2 Y. In many cases, we are only interested in the structure (i.e. conditional independences) within the output y rather than the structure within the input x, since the latter is often hard to understand from a human perspective. Thus, for structured prediction, we estimate the parameters of the conditional distribution p(y|x) rather than the joint distribution p(x, y). The associated PGM is known as conditional random field (CRF) (Lafferty et al. 2001). It should be noted that CRF is not the only model for structured prediction. Since our focus is graphical model, we assume the structured output space decomposes according to the graph given by the CRF, that is, Y := Y1 × · · · × Yn. 40 Introduction to Probabilistic Graphical Models The potential functions of p(y|x) are usually parameterized as a form of

### Derivations of dual, and relaxed primal and dual objectives

In this section, we derive the dual objective D(μ) of P(w). Given the augmented Lagrangian D(μ, ), we first introduce a relaxed primal ˜ P(w, , ) involving a new primal variable whose components can be interpreted as messages exchanged between cliques in the context of marginal inference via message-passing algorithms. The partial minimization with respect to then yields the corresponding primal of D(μ, ) with respect to μ for a fixed : P(w, ) := min ˜ P(w, , ), which can be interpreted as a Moreau-Yoshida smoothing of the original objective P(w).

**Deep Network Architectures**

A neural network is a special composite function of the input signal in the sense that the functions are connected in a way resembling the network of biological neurons. A function in the neural network is sometimes referred as a (hidden) layer. Obviously the input and the output are the first and the last layers respectively.

The computation from the input to the output is called the forward pass. Accordingly, the backward pass occurs when we try to compute the derivatives with respect to the layers as well as the parameters associated to the functions, which is also known as the backpropagation– an application of chain rule. The first neural network, called the threshold logic, was proposed by McCulloch and Pitts (1943). Back to the time when backpropagation firstly gained recognition (Werbos 1974, Rumelhart et al. 1986), the multilayer perceptron (MLP) was one of the first neural networks considered in supervised learning (Cybenko 1989a), which includes alternatively linear functions and non-linear functions such as sigmoid, tanh, rectified linear unit (ReLU) and so on. For specific purposes, different types of architecture are proposed. The most common type is called feedforward neural network, such as the aforementioned MLP. On the other hand, if there exists a function in the network that the previous outputs from itself are used as inputs, the neural network is called recurrent neural network (Rumelhart et al. 1986), which has an advantage in modeling arbitrarily long sequences.

#### Memory and Energy Issues with Over- Parameterized DNNs

From Figure 1 and Table 1, we can see that all commonly used deep neural networks contain millions of parameters, and require gigas of FLOPS for a single forward pass. This situation has been mitigated in recent architectures by replacing fully connected layers with convolutional layers and by using convolutions with small kernels. In general, fully connected layers introduce more parameters but require less FLOPS comparing to convolutional layers. However, none of the designs of neural networks can avoid over-parameterization completely, even with a good balance between the number of parameters and the FLOPS. Inevitably, over-parameterization leads to inefficiency in memory and energy. Moreover, the resulting model is hard to interpretable. Applications such as embedded systems, internet of things and mobile phones are examples where

computers have limited processors, memory and batteries. The goal of such resource limited applications is quite different from that of achieving human-level performance. Deploying deep neural networks on embedded systems (Venieris et al. 2018) has become a recent trend towards intelligent mobile computing. However, it is infeasible to design an architecture for each embedded system from scratch. 94 A Survey on Over-Parameterization in Deep Learning One valid solution is to adaptively compress the universal model. Specifically according to the memory and computation requirements of a given application, we reduce the storage of network parameters or eliminate less important parameters. This is generally called model compression in deep learning.

**Table of contents :**

Acknowledgments

Abstract

Résumé

Contents

Introduction

**1 Background: Convex Optimization and Information Theory **

1.1 Convex Optimization

1.2 Information Theory

1.3 Rate-distortion theory

**2 Introduction to Probabilistic Graphical Models **

2.1 Introduction

2.2 Models

2.3 Inference

2.3.1 Marginal inference

2.3.2 MAP inference

2.4 Learning

2.4.1 Maximum likelihood estimation of exponential family

2.4.2 Structured output learning

2.5 Conclusion

**3 SDCA-Powered Inexact Dual Augmented Lagrangian Method for Fast CRF Learning **

3.1 Introduction

3.2 Related Work

3.3 CRF Learning

3.3.1 CRF as exponential family

3.4 Relaxed Formulations

3.4.1 Classical local polytope relaxation

3.4.2 A dual augmented Lagrangian

3.4.3 Gini entropy surrogate

3.5 Algorithm

3.6 Convergence Analysis

3.6.1 Conditions for global linear convergence

3.6.2 Convergence results with SDCA

3.6.3 Discussion

3.7 Experiments

3.7.1 Setup

3.7.2 Results

3.8 Conclusion

Appendices

3.A Loss-Augmented CRF

3.B Derivations of dual, and relaxed primal and dual objectives

3.B.1 Derivation of the dual objective D(μ)

3.B.2 Derivation of an extended primal ˜ P(w, , )

3.B.3 Interpretation as Moreau-Yosida smoothing

3.B.4 Duality gaps and representer theorem

3.B.5 Comparison with State-of-the-Art Structured Learning Methods

3.C Gini Oriented Tree-Reweighted Entropy

3.D Proof of Theorem 3.6.1 and associated Corollaries

3.D.1 Smoothness of d()

3.D.2 Associated lemmas for Theorem 3.6.1

3.D.3 Proof of Theorem 3.6.1

3.D.4 Proofs of Corollary 3.6.2 and Corollary 3.6.3

3.D.5 Proofs of Corollaries 3.6.4 and Corollary 3.6.5

3.E Convergence results with SDCA

3.E.1 Proof of Propositions 3.6.7 and 3.6.8

3.F Notation summary

**4 A Survey on Over-Parameterization in Deep Learning: Compression and Generalization**

4.1 Introduction

4.2 Deep Network Architectures

4.3 Memory and Energy Issues with Over-Parameterized DNNs

4.4 Model Compression Approaches

4.5 Towards Understanding Generalization via Compression

4.5.1 The generalization puzzle

4.5.2 Sharpness: the bridge between compressibility and generalization

4.5.3 MDL: a lossless-compression-induced supervised learning framework

4.5.4 Information bottleneck: a lossy-compression-induced supervised learning framework

4.6 Transferring Knowledge from Over-Parameterized Models

4.7 Conclusion

**5 beta-BNN: A Rate-Distortion Perspective on Bayesian Neural Networks **

5.1 Supervised learning via lossy compression

5.2 Approximate Blahut-Arimoto Algorithm

5.3 Experiments

5.4 Discussion

**6 Empirical Bayes Transductive Meta-Learning with Synthetic Gradients**

6.1 Introduction

6.2 Meta-learning with transductive inference

6.2.1 Empirical Bayes model

6.2.2 Amortized inference with transduction

6.3 Unrolling exact inference with synthetic gradients

6.4 Generalization analysis of empirical Bayes via the connection to information bottleneck

6.5 Experiments

6.5.1 Few-shot classification

6.5.2 Zero-shot regression: spinning lines

6.5.3 Zero-shot classification: unsupervised multi-source domain adaptation

6.6 Conclusion

Appendices

6.A Proofs

6.B Importance of synthetic gradients

6.C Varying the size of the query set

**7 Variational Information Distillation for Knowledge Transfer**

7.1 Introduction

7.2 Variational information distillation (VID)

7.2.1 Algorithm formulation

7.2.2 Connections to existing works

7.3 Experiments

7.3.1 Knowledge distillation

7.3.2 Transfer learning

7.3.3 Knowledge transfer from CNN to MLP

7.4 Conclusion

**8 Exploring Weight Symmetry in Deep Neural Networks **

8.1 Introduction

8.2 Symmetric reparameterizations

8.2.1 Motivation

8.2.2 Soft constraints

8.2.3 Hard constraints

8.2.4 Combining with other methods

8.3 Implementations of block symmetry

8.3.1 Imposing symmetry in convolutional neural networks

8.3.2 Imposing symmetry in recurrent neural networks

8.4 Experiments

8.4.1 CIFAR experiments

8.4.2 ImageNet experiments

8.4.3 Language modeling

8.5 Conclusion

Conclusion

**Bibliography**