Machine learning basics
Machine learning (ML) is often considered as a measure to realize Artificial Intelligence (AI). It focus on designing computer models (al-gorithms) which automatically learn from data to solve specific tasks. Compared to rule-based models which require large amount of human experts’ investigation for each encountered problem, learning-based models are typically adapted to a type of problems and require only enough data with moderate human supervision. ML is the core of various applications such as spam detection (Dada et al., 2019), image classification (Krizhevsky, Sutskever, and Geoffrey E Hinton, 2012; K. He et al., 2015), product recommendation (Linden, Smith, and York, 2003; Q. Chen et al., 2019), language translation (Bahdanau, Cho, and Yoshua Bengio, 2016; Johnson et al., 2016; Y. Wu et al., 2016), autonomous driving (Grigorescu et al., 2019). Taking the spam detec-tion task as an example, the objective is to design an algorithm that automatically identify whether an email is a spam or not. The main questions to be answered here are: How the model learns from data? Can we confirm the model has learned something? What is the actual performance of the learned model? In the following sections, we introduce the main concepts in machine learning by trying to answer these questions.
Machine learning categorization
In machine learning domain, the data is typically presented in form of dataset, which is a collection of samples (examples). As computers can only take numeric inputs, each sample is typically represented by a multi-dimensional tensor where each entry represents a qualitative or quantitative measure of its characteristic (feature). Without loss of generality, we suppose each input is a vector, denoted as x 2 Rp where p is the dimension of the feature space. A dataset with n examples is denoted as a set fxigin=1 or a matrix X 2 Rp n as the feature dimension p is the same for all samples.
In the spam detection task, for example, we can represent an email using two features: the number of special characters in the email and the domain of the sender’s email address. An email which contains 100 special characters and sent from email@example.com can be repre-sented by a 2-dimensional vector [100, 4]> where 100 is the number of special characters in the email and 4 is the index of gmail.com in the vocabulary of domains.
Depending on the types of data that machine learning algorithms are allowed to use, most of them can be included in two categories (Hastie, Robert Tibshirani, and J. Friedman, 2001; Bishop, 2006): supervised learning and unsupervised learning.
supervised learning Given an annotated dataset f(xi, yi)gin=1, the goal of supervised learning is to learn how to correctly predict y knowing x. The y is referred as target, with a role to guide and supervise the learning process. Actually, depending on the type of the target y, we can generally separate supervised learning tasks to regression and classification. In regression tasks, y is a scalar, i. e. , y 2 R. For example, predicting the salary of an employee given his gender, years of experience and job title is a regression task. In classification tasks, y is a class (category), i. e. , y 2 f0, . . . , k 1g, k > 2 where k is the number of predefined classes. In practice, classification can be further divided into three settings: 1) binary classification: each example can only be labeled with one of two classes (k = 2, e. g. , spam or non-spam), 2) multi-class classification: similar to the binary one, but with more possible classes (k > 2), 3) multi-label classification: each example can belong to multiple classes at the same time (y 2 f0, 1gk).
unsupervised learning For unsupervised learning, we are given an unannotated dataset fxigin=1. Without the guidance of the labels fyigin=1, the goal is to learn the underlying properties or struc-tures of fxigin=1 (Pearson, 1901; Goodfellow et al., 2014; Kingma and Welling, 2014) or to classify them into different groups (MacQueen, 1967; Ester et al., 1996; Rokach and Maimon, 2005). Unsupervised models often involve comparing the similarity between two samples (e. g. , if two samples are close, they are clustered together) where the similarity measure has a heavy impact on the model’s performance, which should be coincident with the nature of the data.
Although we are in the era of Big Data where huge amounts of data are produced, processed and stored at every second, annotated data’s quantity is still limited, due to the fact that high quality data labeling requires large amount of careful human work. On the other side, unan-notated data such as images, text corpus, videos, seem to be ‘unlimited’ and can be collected easily. To take profit of these unlimited resources, self-supervised learning (sometimes considered as a specific type of unsupervised learning) attracts more and more attention (Arora et al., 2019; Xiao Liu et al., 2021), especially in Natural Language Processing (NLP) domain (Mikolov, K. Chen, et al., 2013; Devlin et al., 2018). The general idea is to create proper labels directly from the unannotated dataset, then run supervised learning models on top of it. Instead of focusing on the performance of the auxiliary supervised models, self-supervised learning focus on learning meaningful intermediate representations. The learned representations will be used directly or be fine-tuned to serve downstream tasks. For example, in NLP domain, a common practice is to mask (hold) a word from a sentence and let the model to predict the masked word (e. g. , masked language model in Section 2.4.3). By doing this, the model is forced to learn semantically meaningful representation of words which is used later in downstream task to improve the task’s performance (Hendrycks et al., 2019; J. D. Lee et al., 2020).
Other categories of machine learning models also exist. Semi-supervised learning (Xiaojin Zhu, 2005) is a mixture of supervised learning and unsupervised learning where the algorithms are given a small amount of labeled data and much larger amount of unlabeled data. In Reinforcement learning (Sutton and Barto, 2018), the algorithm (referred as the ‘agent’) interacts with an environment instead of a fixed dataset and interactively gets feedback (reward) from the envi-ronment according to its actual state and next action, with the goal to maximize the total reward. These two categories will not be covered in this dissertation.
Learning as optimization
In this section, we introduce the learning (training) procedure of machine learning algorithms (models) focusing on supervised learning under a probabilistic setting.
Machine learning tasks typically involve three components: dataset, model and loss function. In supervised learning, we are given an annotated dataset f(xi, yi) gin=1 consisting of n input-target pairs where each pair (x, y) is drawn independently from a fixed but unknown joint distribution P. A machine learning model (algorithm), considered as a family of functions f f j f 2 Fg that it can realize where each f corresponds to a specific setting of it, models the dependency of y on a given x. f (x) (often denoted as yˆ) is an estimator of y where f is referred as predictor. The discrepancy between the target y and its estimator yˆ is measured by ‘(yˆ, y) where ‘ is the loss function.
We denote the expected risk (also known as expected loss, generalization error) of a given predictor f as L( f ) = E[‘( f (x), y)] = ‘( f (x), y)dP(x, y). (2.14)
It measures the general approximation quality of f under the data distribution P. Indeed, ideally we would like to learn (find) the oracle best predictor f ? which minimizes L( f ) over all possible functions, i. e. , to learn f ? = argmin L( f ). (2.15)
However, we have two main issues here: First, we don’t have access to the underlying distribution P, we only have its samples in the dataset. Second, it’s not possible to test all the functions. For the first issue, in practice, we use the empirical risk (also known as empirical error, training error) Ln( f ), calculated based on the dataset
We expect the learned predictor fˆ (Eq. 2.17) to have a good per-formance in practice, expressed by a low generalization error L( fˆ). This error is closely related to the capacity (complexity) of the model F which is typically measured by Vapnik–Chervonenkis dimension (VC-dimension, Vapnik, 1995). To be more precise, a classification model with VC-dimension n means there exists a set of n points that can be shattered by it, and no set of n + 1 points can be shattered. The ‘shatter’ here means that for any label assignment (with label 0 or 1) of each data point, their exists at least one function inside the function set, that separates the data points with no mistake. For example, the set of hyperplanes in 2-d space (i. e. , lines) is of VC-dimension 3 because any label assignment of 3 data points that are not colinear can be separated perfectly (i. e. , shattered) by it, and no set of 4 points can be shattered.
In order to prevent overfitting and ill-posed optimization problem during the learning procedure, regularization is needed. Different strategies exist, such as early stopping (Yao, Rosasco, and Caponnetto, 2007), Lp-norm penalties (A. N. Tikhonov, 1963), bagging (Breiman, 1996). Quite often, regularization could be expressed as adding an penalty term on the objective function which can be seen as a preference or prior imposed on the solution that the algorithm should find. The regularized version of the optimization problem, corresponding to the Equation 2.17, is: fˆ = argmin Ln( f ) + lR( f ), (2.27)
where R is the penalty term on f and l is a hyper-parameter which controls the degree of regularization. When f is parameterized by w, we have the regularized estimator of w as: wˆ = argmin Ln(w) + lR(w). (2.28)
For Lp-norm regularization, two common used terms are L2 and L1 regularization, where we penalize large weights of parameters. For the L2 regularization (also known as weight decay, Hanson and Pratt, 1989), R(w) = kwk22. In the linear regression case, adding this term (Ridge regression, Hoerl and Kennard, 1970) resolves the ill-posed optimization problem when the features are highly linear correlated to each other. For L1 regularization, R(w) = kwk1. In linear regression case, adding this term (Lasso, R. Tibshirani, 1996) shrinks the weights of some parameters to zero which implies implicitly a feature selection and introduces sparsity in the model.
Hyper-parameters, such as the degree of regularization (l of the Equa-tion. 2.27), control the high-level behavior of the learning algorithms. They are set before the learning procedure starts. In order to choose the best one, we need to access the generalization error L( fˆ) of the learned predictor fˆ under each hyper-parameter. However, as the un-derlying distribution P is unknown, we use the empirical error Ln( fˆ) on unseen dataset (validation set) as a proxy. Meanwhile, in order to have an idea about the performance of the final chosen predictor (learned under the best hyper-parameter), another unseen dataset (test set) is required. So in the end, during the whole machine learning process, we need three datasets: training set, validation set and test set. All of them should come from the same underlying distribution P and be independent from each other.
In practice, instead of three independently collected datasets, we of-ten have one whole dataset at hand. To create aforementioned datasets, we randomly split the data into three disjoint partitions, usually with a train-validation-test cut equals to a%-b%-c% where a > b > c > 0. However, this one-shot split may potentially cause the model’s perfor-mance to depend on a particular random choice of train-validation split and it reduces the amount of data that the training set can use. A cross-validation strategy (Stone, 1974) is adapted to solve these issues where the test set is still held out for final evaluation, but the training and validation set are created repeatedly by randomly taking parts of the rest data. In a common k-fold cross-validation approach, the data is randomly split to k equal-sized subsets (folds). Given a hyperparameters setting, for each one of the k folds, we take the rest k 1 folds as training set to train the algorithm and this fold as evaluation set to evaluate the algorithm. The performance of the given hyper-parameter setting is judged by the average evaluation performance of k rounds. We then take the best hyper-parameter setting, retrain the algorithm using the whole k folds data to get our final predictor fˆ. In the end, the actual performance of the final predictor fˆ is evaluated on the test set.
Evaluation metrics are the measures that the model is eventually evaluated with, which can be different from the objective function used in the learning procedure. Theoretically, we can chose any measure we want, but an ideal one should reflect the real world needs of the task on which we apply the model. The common reason that we do not directly optimize our model using the actual evaluation metrics is that quite often, they are non-convex or not even smooth, e. g. , the error rate (the average of 0-1 loss on all samples).
Taking the logistic regression for example, as we have seen pre-viously, we use the log loss as loss function (Eq. 2.21). The actual evaluation metrics can be accuracy, F1-score, Receiver operating char-acteristic (ROC) curve, Precision-Recall (PR) curve, etc. , depending on the practical needs. In the following, we first introduce several basic measure components and then use them to present ROC curve and PR curve which we’ll use in our experiments, focusing on the binary classification case.
Table of contents :
1.1 Display advertising & Real-Time Bidding
1.1.1 Programmatic media buying
1.1.2 Real-Time Bidding
1.2 Auction mechanism
1.3 Performance metrics of advertising campaigns
1.4 Cookie-based user identification & privacy
1.5 Data Quality
1.6 Thesis contributions and related work
1.6.1 URL embedding
1.6.2 User response prediction
1.6.3 Audience expansion
1.7 Thesis organization
2 preliminaries and background
2.1 Basic math
2.1.3 Activation functions
2.2 Machine learning basics
2.2.1 Machine learning categorization
2.2.2 Learning as optimization
2.2.3 Error decomposition
2.2.5 Evaluation protocol
2.2.6 Evaluation metrics
2.3 Neural networks
2.3.1 Basic concepts
2.3.3 Type of layers
2.3.5 Success reasons
2.4 Word representation
2.4.1 One-hot encoding
2.4.2 Word embedding
2.4.3 Contextual word embedding
2.5 Document embedding
2.5.1 Aggregate pre-trained word embeddings
2.5.2 Produce directly the document embedding
3 predicting conversions in display advertising based on url embeddings
3.1 Related work
3.2 Proposed conversion prediction architecture
3.3 URL representation schemes
3.5 Conclusions and future directions
4 audience extension
4.1 Related work
4.2 The proposed audience expansion methods
4.2.1 Audience expansion based on set similarity metrics
4.2.2 Audience expansion based on URL embeddings
4.2.3 Audience expansion based on User2Vec model
4.2.4 Audience expansion based on User2VecC model
4.3 Empirical analysis
4.3.2 Ablation study
4.4 Conclusions and future directions
5 concluding remarks
5.1 Summary of contributions
5.2 Future directions