Gaze Inference Using the Siamese Neural Network

Get Complete Project Material File(s) Now! »

Appearance-Based Unconstrained Gaze Estimation

An appearance-based model takes some image frame(s) of the subject as input and produces gaze direction as output. The task of gaze estimation is described as « unconstrained » when no specific demands are placed on the input image. This means that the inference model should be capable of estimating the gaze for any user, in any natural setting. In particular, there are no requirements such as head pose, distance, angle etc. The goal is to make an eye tracker that works reliably on any mobile phone, using no special equipment apart from the front camera.

Problem Description

The gaze estimation task may be formulated as gaze point regression where the goal is to minimize euclidean distance between the predicted gaze point and the true gaze point. The input is a normalized face input image I 2 [0; 1]DW ;DH;DC , taken with a mobile device front camera, where DW is the im-age width, DH is the image height and DC is the number of color channels. In our case, I is a RGB image where the dimensions are given as (DW ; DH ; DC ) = (224; 224; 3). Given I, the goal is to predict a gaze point gp(I) 2 R2 on the plane of the mobile device screen, where gp(I) is the coordinate distance in centimeters from the device front camera.

Accuracy Metrics

While all appearance-based models have some image frame(s) as input and gaze direction as output, there is no standard way to represent gaze direction and accuracy[1]. The common accuracy metrics are angular error (in degrees), absolute distance (pixels, mm), relative distance (percentage). A survey was performed, using 200 diﬀerent research articles on gaze based algorithms and applications (see Table 1), showing angular error is more popular in general but absolute error is more popular for mobile devices.

The GazeCapture Dataset

There are many publicly available datasets for the task of gaze estimation. GazeCapture is one such dataset containing almost 2,500,000 datapoints from 1450 diﬀerent people[12]. This dataset was gathered on various iOS mobile devices in an unconstrained setting. The datapoints include the image frame taken by the front camera, valid face and eye crops as well as the correspond-ing gaze point labels. These labels are given as the x and y distance from the camera in centimeters. It needs to be taken into consideration though that, as in the real world, not all of these frames contain valid face frames (no face, closed eyes etc.) After filtering for valid frames containing a detected face and opened eyes this dataset contains 1,490,959 datapoints.

Artificial Neural Networks

Supervised learning algorithms build statistical models from data in order to perform certain tasks. A popular learning algorithm is the artificial neural net-work. Gaining in popularity as computational power keeps increasing, neural networks are a powerful tool for solving various problems. As the name sug-gests, neural networks are inspired by the structure of a brain.

Artificial Neuron

An artificial neural network is composed of a collection of connected nodes called artificial neurons. Each one of these artificial neurons take some inputs x1; x2; :::; xn as well as a bias input. The neuron has a set of trainable pa-rameters (weights) w0; w1; :::; wn. When the neuron is fed inputs it produces an output depending on an activation function which takes the sum of all the weighted inputs. With the bias always being 1, the output can be formulated as n Xi (2.1) Output = Activation( wi xi) : =0

Neural Networks and Connected Layers

As mentioned, a complete neural network can be constructed by connecting a multitude of artificial neurons. Rather than adding them individually, neurons are typically aggregated in layers. The first layer is often referred to as the input layer. The final layer is called the output layer, and the intermediate layers are called hidden layers. Input is fed to the input layer which transforms and propagates the information forward through the rest of the neural network. The output is the final activation produced by the output layer.
A layer Li is considered fully connected when every one of its inputs are connected to every output of the previous layer Li 1. The fully connected neural network is the simplest type of neural network, consisting purely of fully connected layers of artificial neurons. Figure 2.3 illustrates such a fully connected neural network, with 2 hidden layers and 2 neurons in the output layer.
Since the number of parameters in fully connected layers grows quickly with the number of neurons, such naive neural networks are not commonly used for complex tasks such as image recognition. Fortunately, there are other layers that can reduce the number of parameters for a neural network. Such
layers include pooling layers and convolutional layers which will be described in a later subsection.

READ Discussion of Chinese luxury consumer behaviorDiscussion of Chinese luxury consumer behavior

Learning and Backpropagation

The previous sections described how inputs are forward propagated through the neural network, transforming inputs to outputs. The neural network also needs to be able to learn from data in order to produce useful outputs. This is done by adjusting the weights in such a way that the produced output will match the data. The weights are adjusted using backpropagation. Backprop-agation is a gradient-based optimization algorithm which exploits the chain rule to eﬃciently adjust the weights layer by layer. After feeding the neural network labeled data, e.g. some input image vector x and its corresponding label l, and producing an output y, we use some loss function L(y; l) = E between the neural network output and data label. Then, by using the chain rule the gradient can be calculated layer by layer.

Convolutional Layers

Convolutional layers are a vital component in neural networks concerned with learning tasks related to images. Neural networks that utilize convolutional layers are often referred to as CNNs (Convolutional Neural Networks). We know that pixels in an image are most useful in the context of neighbouring pixels, so the idea is to save parameters by looking at smaller parts of the image instead of all the pixels at once (in contrast to fully connected layers). Convolutional layers are a set of filters. One can think of these filters as 2d matrices. The convolution operation is to « slide » these filters across a feature map, performing element-wise multiplications at each position. The feature map can be the input image or any intermediate output at any layer in the neural network. The sum of the multiplications are the outputs for the new image, as illustrated in Figure 2.5 and 2.6. The convolution is a diﬀerentiable operation and therefore these filters are trainable parameters.
Since images typically have a color channel for red, green and blue channel values as well as the spatial width and height, the initial input feature map and feature maps produced at any point in the neural network are usually 3d rather than 2d. The filters need to match the number of dimensions. Each convolution layer has some number of these trainable filters, producing a new channel for each filter. For example, applying 6 (5 5 3) filters on a (32 32 3) image produces a (28 28 6) image as seen in Figure 2.7.

Table of contents :

1 Introduction
1.1 Research Questions
1.2 Scope
1.3 Outline
2 Background
2.1 Appearance-Based Unconstrained Gaze Estimation
2.1.1 Problem Description
2.1.2 Accuracy Metrics
2.1.3 The GazeCapture Dataset
2.2 Artificial Neural Networks
2.2.1 Artificial Neuron
2.2.2 Neural Networks and Connected Layers
2.2.3 Learning and Backpropagation
2.2.4 Convolutional Layers
2.2.5 Depthwise Separable Convolutions
2.2.6 Pooling Layers
2.3 Transfer Learning
2.4 Similarity Learning
2.4.1 Siamese Neural Network
2.5 Calibration
3 Related Work
3.1 Eye Tracking for Everyone
3.1.1 iTracker: a Deep Neural Network for Eye Tracking
3.1.2 Calibration with SVR
3.2 MobileNets
3.3 A Differential Approach for Gaze Estimation with Calibration
3.4 It’s Written All Over Your Face
4 Siamese Regression for GazeCapture
4.1 The Siamese Neural Network for Regression
4.1.1 Neural Network Architecture
4.1.2 Training the Siamese Neural Network
4.1.3 Gaze Inference Using the Siamese Neural Network
4.2 Intermediate Experiments
5 Results
5.1 Siamese Neural Network and Calibration Points for Gaze Estimation
5.1.1 Inference Time
5.2 Miniature Models
6 Discussion
6.1 Mobile Phones vs Tablets
6.2 Effect of Increasing Calibration Points
6.3 Even Spread or Random
6.4 The Efficacy of SiameseNeuralNetworks for Gaze Estimation with Calibration
6.5 Inference Time
6.6 Transfer Learning from ImageNet to GazeCapture
6.7 Fine-tuning for Specific Device and Orientation
6.8 Increased Data Quantity for Gaze Difference
6.9 iTracker vs MobileNet
6.10 Depthwise Separable Convolutions
7 Conclusions
7.1 Transfer Learning
7.2 Calibration Points with Siamese Neural Networks
7.3 Future Work
Bibliography