A Hardware Platform for Real-time Image Understanding

Get Complete Project Material File(s) Now! »

Dataflow Computing

The third contribution of this thesis is a custom dataflow computer architecture optimized for the computation of convolutional networks (such as the model presented in Section 2). Dataflow computers are a particular type of processing architecture, which aim at maximizing the number of effective operations per instruction, which in turn maximizes the number of operations reachable per second and per watt consumed. They are particularly well fit to the computation of convolutional networks, as these require very little branching logic, and rather require tremendous quantities of basic, redundant arithmetic operations.
In this section I provide a very quick primer on dataflow computing and architectures. As I suspect most readers of this thesis will come from a software background, this section is rather high-level, with an emphasis on the compute model rather than on the specific details of implementation. Chapter 3 extensively describes our custom dataflow architecture.
Dataflow architectures are a particular type of computer architecture that directly contrasts the traditional von Neumann architecture or control flow architecture. Dataflow architectures do not have a program counter, or (at least conceptually) the execution of instructions is solely determined based on the availability of input data to the compute elements.
Dataflow architectures have been successfully implemented in specialized hardware such as in digital signal processing (3, 85), network routing (5), graphics processing (71, 92). It is also very relevant in many software architectures today including database engine designs and parallel computing frameworks (9, 10).
Before getting into the details of the dataflow architecture, let us look at the Von Neumann architecture, which should help highlight the fundamental shortcomings of traditional flow control for highly data-driven applications.
In this type of architecture, the control unit, which decodes the instructions and executes them, is the central point of the system. A program (sequence of instructions) is typically stored in external memory, and sequentially read into the control unit. Certain types of instructions involve branching, while others involve reading data from the memory into the arithmetic logic unit (ALU), to transform them, and write them back into external memory.

Multiscale feature extraction for scene parsing

The model proposed in this chapter, depicted on Figure 2.1, relies on two complementary image representations. In the first representation, an image patch is seen as a point in RP , and we seek to find a transform f : RP → RQ that maps each patch into RQ, a space where it can be classified linearly. This first representation typically suffers from two main problems when using a classical convolutional network, where the image is divided following a grid pattern: (1) the window considered rarely contains an object that is properly centered and scaled, and therefore offers a poor observation basis to predict the class of the underlying object, (2) integrating a large context involves increasing the grid size, and therefore the dimensionality P of the input; given a finite amount of training data, it is then necessary to enforce some invariance in the function f itself. This is usually achieved by using pooling/subsampling layers, which in turn degrades the ability of the model to precisely locate and delineate objects. In this chapter, f is implemented by a multiscale convolutional network, which allows integrating large contexts (as large as the complete scene) into local decisions, yet still remaining manageable in terms of parameters/dimensionality. This multiscale model, in which weights are shared across scales, allows the model to capture long-range interactions, without the penalty of extra parameters to train. This model is described in Section 2.2.2.1. In the second representation, the image is seen as an edge-weighted graph, on which one or several over-segmentations can be constructed. The components are spatially accurate, and naturally delineate the underlying objects, as this representation conserves pixel-level precision. Section 2.2.3 describes multiple strategies to combine both representations. In particular, we describe in Section 2.2.3.3 a method for analyzing a family of segmentations (at multiple levels). It can be used as a solution to the first problem exposed above: assuming the capability of assessing the quality of all the components in this family of segmentations, a system can automatically choose its components so as to produce the best set of predictions.

READ Stochastic Approximation and Least-Squares Regression

Scale-invariant, scene-level feature extraction

Good internal representations are hierarchical. In vision, pixels are assembled into edglets, edglets into motifs, motifs into parts, parts into objects, and objects into scenes. This suggests that recognition architectures for vision (and for other modalities such as audio and natural language) should have multiple trainable stages stacked on top of each other, one for each level in the feature hierarchy. Convolutional Networks (ConvNets) provide a simple framework to learn such hierarchies of features.
Convolutional Networks (64, 65) are trainable architectures composed of multiple stages. The input and output of each stage are sets of arrays called feature maps. For example, if the input is a color image, each feature map would be a 2D array containing a color channel of the input image (for an audio input each feature map would be a 1D array, and for a video or volumetric image, it would be a 3D array). At the output, each feature map represents a particular feature extracted at all locations on the input. Each stage is composed of three layers: a filter bank layer, a non-linearity layer, and a feature pooling layer. A typical ConvNet is composed of one, two or three such 3-layer stages, followed by a classification module. Because they are trainable, arbitrary input modalities can be modeled, beyond natural images.
Our feature extractor is a three-stage convolutional network. The first two stages contain a bank of filters producing multiple feature maps, a point-wise non-linear mapping and a spatial pooling followed by subsampling of each feature map. The last layer only contains a bank of filters. The filters (convolution kernels) are subject to training. Each filter is applied to the input feature maps through a 2D convolution operation, which detects local features at all locations on the input. Each filter bank of a convolutional network produces features that are equivariant under shifts, i.e. if the input is shifted, the output is also shifted but otherwise unchanged.

Table of contents :

.List of Figures
List of Tables
1 Introduction
1.1 Representation Learning with Deep Networks
1.1.1 Deep Network Architectures
1.1.1.1 Multilayer Perceptrons
1.1.1.2 Convolutional Networks
1.1.1.3 Encoders + Decoders = Auto-encoders
1.1.2 Learning: Parameter Estimation
1.1.2.1 Loss Function, Objective
1.1.2.2 Optimization
1.2 Hierarchical Segmentations, Structured Prediction
1.2.1 Hierarchical Segmentations
1.2.1.1 Graph Representation
1.2.1.2 Minimum Spanning Trees
1.2.1.3 Dendograms
1.2.1.4 Segmentations
1.2.2 Structured Prediction
1.2.2.1 Graphical Models
1.2.2.2 Learning: Parameter Estimation
1.3 Dataflow Computing
2 Image Understanding: Scene Parsing
2.1 Introduction
2.2 A Model for Scene Understanding
2.2.1 Introduction
2.2.2 Multiscale feature extraction for scene parsing
2.2.2.1 Scale-invariant, scene-level feature extraction
2.2.2.2 Learning discriminative scale-invariant features
2.2.3 Scene labeling strategies
2.2.3.1 Superpixels
2.2.3.2 Conditional Random Fields
2.2.3.3 Parameter-free multilevel parsing
2.2.4 Experiments
2.2.4.1 Multiscale feature extraction
2.2.4.2 Parsing with superpixels
2.2.4.3 Multilevel parsing
2.2.4.4 Conditional random field
2.2.4.5 Some comments on the learned features
2.2.4.6 Some comments on real-world generalization
2.2.5 Discussion and Conclusions
3 A Hardware Platform for Real-time Image Understanding
3.1 Introduction
3.2 Learning Internal Representations
3.2.1 Convolutional Networks
3.2.2 Unsupervised Learning of ConvNets
3.2.2.1 Unsupervised Training with Predictive Sparse Decomposition
3.2.2.2 Results on Object Recognition
3.2.2.3 Connection with Other Approaches in Object Recognition
3.3 A Dedicated Digital Hardware Architecture
3.3.1 A Data-Flow Approach
3.3.1.1 On Runtime Reconfiguration
3.3.2 An FPGA-Based ConvNet Processor
3.3.2.1 Specialized Processing Tiles
3.3.2.2 Smart DMA Implementation
3.3.3 Compiling ConvNets for the ConvNet Processor
3.3.4 Application to Scene Understanding
3.3.5 Performance
3.3.6 Precision
4 Discussion
References