Graph Inference and Semi-Supervised Learning with Auto-Encoding Variational Bayes

Get Complete Project Material File(s) Now! »

Graph-based Semi-Supervised Learning

Semi-supervised Learning (SSL) is a learning paradigm that exploits both labeled and unlabeled data. SSL algorithms have been successfully applied to problems where labeled data is very scarce, and where there is access to relatively large amount of unlabeled data.
Formally, we assume we have access to a dataset of the form D = Dsup ∪ Dunsup where Dsup = {(xi, yi)}k= is the supervised set and Dunsup = {xi}n is the unlabeled set. We i 1 i=k+1.
assume each observation (xi, yi) comes from a distribution p(x, y) where x ∈ X and y ∈ Y. We also assume the supervised set is relatively small with respect to the unlabeled set. The goal is to find a mapping g : X → Y that fits the observed data as best as possible. SSL algorithms work by making some assumptions about the distribution of the input data so that one can leverage the information provided by the unlabeled set. A few popular assumptions are described below. Smoothness assumption The smoothness or continuity assumption establishes that points that lay in a dense region have outputs that are likely to be close. Cluster assumption The cluster assumption is a discrete version of the smoothness assump-tion: points form clusters, and if two elements belong to the same cluster then they are likely to belong to the same class. For a classification task this implies that an optimal decision boundary will pass through a low density region. Manifold assumption High-dimensional data lies in a low-dimensional manifold. This assumption is key to avoid the « curse of dimensionality ». SSL methods can be classified into inductive and transductive. The former aims at learning a classifier that is able to handle data from the whole domain X . The latter is only concerned with finding the labels corresponding to the unlabeled elements of the training set. Graph-based semi-supervised learning (GSSL) algorithms are a class of semi-supervised methods that additionally rely on a graph structure associated with the data. In the taxonomy describe above they are usually placed among the transductive methods. These methods received a lot of attention in the last two decades for many reasons [61]. In the first place, in many applications the data is naturally endowed with a graph. Such a graph is assumed to represent the underlying topology of the data, and it thus provides grounds to develop algorithms that comply with the three assumptions presented above. As a matter of fact, the massive use of Internet results in graph-structured datasets to be more widespread. In the second place, many problems can be formulated as convex programs in a straightforward manner [61]. Finally, graphs are expressive objects and can encode rich information about the data. At a high-level, GSSL algorithms can be classified into methods that introduce an explicit form of regularization to the objective based on the topology of the data, and more modern methods based on graph neural networks. We will describe these different approaches below. In what follows we will assume the data is endowed with a graph G = (V, E, ω) with associated adjacency matrix W ∈ Rn×n.

Algorithms based on Manifold Regularization

Algorithms under this category are characterized by an objective function that consists of a supervised term and a graph regularization term [61]: The first term is a regular supervised loss, while the second term ensures the solution will be consistent with the topology of the data. A widely used criterion of graph regularization is that of smoothness: the solution has to be as smooth as possible with respect to the graph in the sense of Equation (2.1).
Some methods use the graph as a propagation operator that propagates labels according to how elements are connected. Examples of these are Label Propagation [75] and Label Spreading [72]. Considering binary labels for simplicity, Label Propagation takes the random walk normalized Laplacian Lrw = D−1W , initializes y(0) = [y1, . . . , yl , 0, . . . , 0] and iteratively updates y as:
y(t+1) = Lrwy(t) (2.7) yi(t+1) = yi for i = 1, . . . , l.
When normalized as such, Lrw can be deemed a random walk matrix, that is, a matrix whose entries represent transition probabilities. Keeping that in mind, what Equation 2.7 does is to iteratively visit the neighborhood of each node, and diffuse the labels according to the strength of the edges. Label Spreading differs mainly in two things. First it uses the symmetric normalized Laplacian as propagation operator, that is, it uses Lsym = D−1/2W D−1/2, and it allows changes in the predicted labels corresponding to the training set: y(it+1) = αLsymy(t) + (1 −α)y(0),

Algorithms based on Graph Neural Networks

In the more recent literature the focus is on methods that use the Graph Neural Network (GNN) architecture [56]. GNNs are neural network models that exploit a given graph struc-ture. In a broad sense they compute node features by iteratively aggregating neighborhood information. A CNN can be deemed a particular case of a GNN where the data sits on a grid type of graph. Given a n dimensional input f , a this model applies a series of convolutional layers followed by a non linearity, that is, it produces an output γ(x) of the form n γl (x) = σ ∑( fi ∗gl,i)(x) i=1.
where g1, . . . , gL are learnable filters. CNNs are well known for having been used in ground-breaking work in the field of computer vision [23, 43, 24]. Its success is due to the fact that they capture interesting spatial and temporal dependencies, rendering their output very good feature maps. As mentioned in Section 2.1.2 Graph Signal Processing provides the necessary tools to generalize convolutional layers to any type of graph domain (not only grids). The Graph Convolutional Network is a type of GNN, and it was first introduced by Kipf and Welling [39] in the context of Semi-supervised classification. Since the computation of filters involves calculating the potentially costly spectral decomposition of the graph Laplacian (as explained in section 2.1.2), the authors exploited the fact that a graph filter can be approximated by.

Latent Variable Models and Variational Inference

When studying a phenomenon, the practitioner will usually observe a set of features or variables that produce a certain response. However it is often the case that the proposed model is partial or incomplete in the sense that the observed response is also affected by a set of unobserved or latent variables. A real-world example from the field of sociology is racial prejudice: one sometimes cannot explicitly measure such attribute, but one can infer it from other pieces of information such as political stances or whether the person approves of a specific legislation [19].
Let us denote random variables in bold. In latent variable models we assume a phe-nomenon that is observed through a set of known variables X = [X1, . . . , Xd ] can be explained in terms of a set of latent variables Z = [Z1, . . . , Zm]. Mathematically, let pθ be the probability distribution of X with parameters θ . Then, pθ (X) = pθ (X|Z)p(Z)∂ Z (2.16)
where p(Z) is some prior over Z. We are interested in learning the parameters of the likelihood pθ (X|Z), often referred to as the generative model of X, and we will assume a specific prior over the latent variables. Furthermore, we want to parameterize our models with neural networks, something that quickly renders the inference problem intractable.
Fitting the parameters of the model in (2.16) amounts to maximizing the log-likelihood, that is, maxθ log pθ (X). Provided we have access to a dataset D = {X(1), . . . , X(m) }, we would have to compute the posterior distribution pθ (Z|X). The challenge is that in many cases the true posterior is intractable. In fact, unless we pick very simple models for the prior and likelihood, such as Gaussians or conjugate distributions, the Equation (2.16) will be very hard or impossible to calculate analytically. A well known method to alleviate this issue is Expectation-Maximization introduced by Dempster et al. [15], an iterative algorithm that finds the maximum likelihood solution of Equation (2.16). This algorithm alternates between an « E step » and an « M step » until convergence. In the E step we fix the parameters θ and estimate the expected value of the latent variables. In the M step we maximize the expectation function obtained in the E step with respect to θ .

Table of contents :

List of figures
List of tables
1 Introduction
1.1 Published Work
1.2 Outline
2 Background
2.1 Graphs
2.1.1 Spectral Graph Theory
2.1.2 Graph Signal Processing
2.2 Graph-based Semi-Supervised Learning
2.2.1 Algorithms based on Manifold Regularization
2.2.2 Algorithms based on Graph Neural Networks
2.2.3 Graph Construction
2.3 Metric Learning
2.4 Latent Variable Models and Variational Inference
2.5 Unsupervised Graph Learning
3 Joint Learning of the Graph and the Data Representation
3.1 Introduction
3.2 Model
3.2.1 Problem Formulation
3.2.2 Choices of Representation Functions
3.2.3 Optimization
3.3 Experiments and Results
3.3.1 Synthetic Data
3.3.2 Real Data
3.4 Conclusion
4 Graph Inference and Semi-Supervised Learning with Auto-Encoding Variational Bayes
4.1 Introduction
4.2 Model
4.2.1 Architecture
4.2.2 Encoder
4.2.3 Decoder
4.2.4 Choice of Prior
4.2.5 End-to-End Training Algorithm
4.3 Experiments and Results
4.3.1 Datasets
4.3.2 Competing Approaches and Setup
4.3.3 Results in Transductive Setting
4.3.4 Discussion
4.4 Concurrent Work in Joint Models for GSSL
4.5 Conclusion
5 Conclusion
5.1 Summary
5.2 Future directions
References