Learning Representations in Graphs
In our contributions, the proposed models use transductive (semi-)supervised approaches. Those notions can be source of confusion, so let us firstly clarify some definitions. Since this thesis focuses on relational data, we give an explanation of what it means to work under such frameworks when dealing with graphs.
Inductive and Transductive Algorithms
In this section, we compare two different philosophical concepts related to inference in ma-chine learning: induction and transduction. The works presented in this thesis use the transductive approach, but it is important to differentiate the two approaches since, based on the data, both of them or only one of them could be employed.
We take a simple example for the classification task where we have instances (xi)i2[1::n] divided in training set, and testing set, as well as their respective labels. Without loss of generality we assume the task of binary classification.
Inductive algorithms aim at learning a general rule f from observed examples, which is then applied to unseen examples. It learns a predictive model that only depends upon the features of a sample to classify.
More precisely, for classification an inductive approach aims at learning a classifier f that maps the instance space to the label space, based only on the training instances. In practice, we need input features for all xi. Generally input features are represented as vec-tors in Rd0. Thus, in our example, the classifier f maps xi 2 Rd0 into f 1; 1g. During the inductive training phase only the labeled training set is used. Inference is then performed by applying the learned classifier f directly on the test set or to any new datum.
The fact that input features are mandatory is important since not all datasets provide readily available input features. As we will see in this thesis, sometimes only the graph is given as input, with no additional feature (e.g. only a weight matrix indicating a relation or a similarity between graph nodes is provided). So, in the case where no input feature is available, the inductive approach is not applicable.
Transductive approaches are introduced in (Gammerman et al., 1998). The main difference between inductive and transductive approaches is that the transductive one does not aim at learning a general predictive model but directly classifies test data based on training data: it is an inference from particular to particular. Said otherwise, the output of the transductive algorithm is a vector of labels instead of a classifier f. The goal is to transfer the information from labeled examples to unlabeled ones. The strength of transduction lies in that it is solving a simpler task than induction. Thus during the training phase both the training set (with its respective labels) and the test set (without its respective labels) are used. Since test data is used during the training phase, if we want to label a new test point xn+1 the transductive algorithm needs in principle to be run again whereas its inductive counterpart can label it by applying directly the learned classifier f.
Transductive learning assumes that we can however provide a relation between exam-ples (e.g. a way to compare them). This is the case in graphs where relations between nodes give us a prior knowledge on which nodes should be close to each other even if there is no input feature. For example, concerning the classification task, it is very common to assume that neighbors in the graph are more likely to have similar labels.
Moreover, speaking about relational data, since most often the main goal is not to return a classifier, the lack of node input features is not crippling because the input graph can be used as the only source of information.
Supervised, Unsupervised and Semi-Supervised Learning
In this section we compare three different technical ideas in machine learning: unsupervised, supervised and semi-supervised learning. The works presented in this thesis use semi-supervised techniques.
Unsupervised learning techniques consist in learning hidden structures (e.g. data densities) from unlabeled data only, i.e. there are no labels yi provided. All these techniques rely on some pre-defined relation between the samples. A very common example in unsupervised learning is the task of clustering, and more specifically the k-means algorithm (MacQueen, 1967). It aims at finding k points (the center of the cluster) in the vectorial space that mini-mize the sum of the distances of all instances to the nearest cluster center.
In unsupervised learning, there is no direct evaluation of the algorithm, although the output of the algorithm (clusters, representations, …) can be used for other tasks.
Supervised learning techniques aim at learning a predictor for a task dealing with labels, ratings, values, etc. In our case, it corresponds to learning a classifier f. During the training phase, the goal of the model is, given (xi)i2[1::m] and (yi)i2[1::m], to try to match (xi; f(xi)) with (xi; yi) as well as possible. This learned classifier can then be used for mapping new examples. All the difficulty is to have a classifier that generalizes well from labeled to unlabeled data.
Semi-supervised learning is a class of supervised learning methods that makes use of both labeled and unlabeled data. The main assumption of semi-supervised learning is that even non labeled data should add relevant information to the model. Actually, unlabeled data give interesting information about the prior distribution of the data P(x).
In our case it aims at learning a classifier f, when not all the instances of the training set are labeled. Unlabeled training instance are used together with labeled one to represent local relational constraints on the problem by forcing linked nodes to have close represen-tations.
Transductive methods are always semi-supervised since they use information contained within the test points. The difference is more on the reason why they have been proposed. The main difference is that semi-supervised learning does not need specific test instances, whereas transductive algorithms need the test set that will be used in the end to evaluate its performance (Chapelle et al., 2009).
Concerning our contributions for the classification task (Chapters 4 and 5), using both la-beled and unlabeled nodes make our proposed models semi-supervised ones.
Concerning our contributions, for the forecasting task (Chapter 6), target are directly provided since we forecast at time T +n learning from the past known data. For the ranking task (Chapter 7), we also learn to rank from triplets (user, item, ratings). Thus those two contributions belong to the supervised learning framework1.
Learning Deterministic Representations
In recent years, there has been a surge of interest in learning deterministic (vector) repre-sentations for different machine learning tasks, such as collaborative filtering (Koren et al., 2009), image retrieval (Weston et al., 2011), relation extraction (Riedel et al., 2013), word semantics and language modeling (Bengio et al., 2003; Mnih et al., 2009; Mikolov et al., 2013).
1 Note that prediction could be considered as unsupervised learning since there is no need for human anno-tation. However technically the data are also used as targets and this framework include supervised training methods.
In this section we present a state of the art concerning the models learning deterministic representations, focusing on relational data which is the main topic of this thesis.
In the following, we distinguish unsupervised models and supervised/semi-supervised ones since they rely on different sets of techniques. In our experiments, unsupervised mod-els performed worse than supervised one, yet, a lot of (semi-)supervised models are in-spired from unsupervised ones.
Recent years have seen an increasing enthusiasm for using neural networks to learn supervised representations. For example, (Zhou et al., 2015; Zhang et al., 2014; Che et al., 2015; Wu et al., 2015; Niepert et al., 2016) propose neural network based models to learn deep representations from raw input data. But neural networks models need rich infor-mative raw input features to learn good representations. In our case, no node specific rich input data are available. Note that, we do not cover the models that learn a representation of the whole graph since we focused on the nodes of the graph.
We first study unsupervised representation learning methods. In the context of graphs, this corresponds to a graph, where nodes may (or not) be associated with raw data like e.g. images or text. These learned representations can then be used for different supervised tasks such as link prediction or classification.
Learning from Context
Learning representations from context often comes to learning entity representations in the context of surrounding words in a text or surrounding objects in an image. Firstly intro-duced to learn word embeddings (Mikolov et al., 2013), this methodology was extended to other types of high-dimensional data like images. (Mikolov et al., 2013) learn two latent vectors for each word (entity): a representation and a context vector. These two vectors are used to compute the conditional probability of observing a word given its context (an-other word for instance). The representations are learned by maximizing this conditional likelihood over a corpus.
Most learning from context models can be framed within the class of methods called the exponential family embeddings (Rudolph et al., 2016). Such a family defines a context function, a conditional probability function (generally modeled from the exponential family) and an embedding structure to form the objective function:
The context function defines, given an entity instance (e.g. an occurrence of a word in a text), which are the related entities (e.g. surrounding words).
The conditional probability function is a function of both the representation of the entity and the corresponding context vectors. It is in charge of modeling how repre-sentations (mathematically) interact with their corresponding context.
The embedding structure determines how the model shares the vector representations across the data, i.e. the possible relations between the context vector for a given entity and the representation of the same entity (e.g. same vector for the context and the target representation).
The models proposed in (Mikolov et al., 2013), namely skip-gram and C-BOW, could be set within this framework. We introduce the skip-gram model since it has been adapted to learn representations for relational data.
(Mikolov et al., 2013) proposed an unsupervised model to learn word representations in a vector space, with the goal of using them in natural language processing tasks, by grouping semantically similar words in the same region of the vector space. This is now the basic representation for many neural network models manipulating text.
The goal of the skip-gram model is to learn word representations from which we can infer the context, i.e. surrounding words. Thus words that appear in similar contexts have similar representations. More formally, given a phrase composed of the series of words x1, x2, x3, …, xT , the training objective of the model consists in maximizing the log-probability: log p(xjjxi) (2.1)
where xt is a specific word and xc is a word that appears in the context of xt. The choice of xc depends on the choice of the context function. For example, we can take a window of a given size centered on xt and consider that all the nodes xi in this window are part of the context.
Various conditional probability functions may be used. The basic formulation consists in using a softmax function and learning two representations, zi and zi0 respectively the « target » and « context » vector representation for xi:
p(xjjxi) = exp(zj0 zi) (2.2)
W exp(z0 zi)
W k=1 k
where is the total number of words in the vocabulary.
An interesting property of the learned representations for this model is that some basic algebraic operations on word representation were found to be meaningful and understand-able from a natural language point of view (Mikolov et al., 2013). For example, a simple operation such as z(« France ») + z(« capital ») is close to z(« Paris »). Even complex operations can be done, for example, the closest word representation z(« Lisbon ») z(« Portugal ») + z(« France ») is z(« Paris »). However up to now, there is no real exploitation of these proper-ties.
These ideas have been extended to graphs. (Tang et al., 2015; Perozzi et al., 2014; Cao et al., 2015) learn deterministic representations for relational data with graph nodes instead of words and a list of neighboring nodes instead of window.
The difference between these contributions lies in the way each model constructs the list of neighboring nodes, i.e. the context function. There are various ways to define a sequence of nodes based on a graph structure. (Perozzi et al., 2014; Cao et al., 2015) both use a random walk based approach to generate sequences of nodes. Concerning (Tang et al., 2015), the main difference with (Mikolov et al., 2013) is that the size of the context window is variable depending on the considered node. To define a sequence of nodes, they randomly sort the neighbors of the considered node. Furthermore, in the case of weighted graphs, they assume that the joint probability of two nodes depends on the corresponding weight. Thus, they weight the conditional probability loss of two nodes accordingly to the edge weight. The objective function maximized by their model is then:
wij log p(xjjxi) (2.3) (i;j)2E
where E is the set of edges of the graph and wij the weight of the (directed) edge (i; j). The conditional probability of neighboring nodes is, as in (Mikolov et al., 2013), modeled as a function of the inner product of their representations.
Similarly, (Grover et al., 2016) proposed an unsupervised learning model that maxi-mizes the likelihood of the network neighborhood (context) knowing a node (target). In practice, this neighborhood is defined through a random walk which can, for example, take into account homophily.
(Tang et al., 2015) have shown on various tasks and datasets that their model was the best competitor over different unsupervised models, including (Perozzi et al., 2014). It is thus used as our unsupervised baseline for the classification tasks.
Supervised and Semi-Supervised Models
Learning Nodes and Relationship Representations in Knowledge Bases
In the context of relational data some graphs distinguish different types of nodes and re-lations (Chapter 1). For example, in knowledge bases (KB), different types of entities are linked together with different types of relationships. A concrete example of such knowl-edge base is the Google Knowledge Graph that had, in 2016, more than 70 billions facts and relations about people, places, things.
When multiple types of relations are present, we need more complex models than the simple unsupervised models presented above.
Here we describe a class of supervised models that learn directly the representations without input features. This class of models uses geometric constraints, between two node representations, that depend on the relation type.
Most of these works are concerned with modeling knowledge bases. Information is then provided under the form of triplets (subject (s), relation (r), object(o)). The goal is to learn representations for items (s and o) and relations (r) so as to maximize some loss function (e.g. triplet likelihood). This topic has motivated a series of works (Bordes et al., 2013; Nickel et al., 2011; Bordes et al., 2011; Jenatton et al., 2012; Socher et al., 2013; Socher et al., 2013; Bordes et al., 2014). The main difference between all these approaches lies in how they model the relations between the (s; r; o) elements, and how they calculate the score associ-ated to a triplet (s; r; o). In these models, representations interact with each other through different types of transformations. In the following we describe some representative models of this class. We differentiate them based on which transformation (of the representations) they use: translation, linear or bilinear.
Translation We first present the model introduced in (Bordes et al., 2013), namely TransE, which is a simple but representative example of how these models work.
For each node xs and for each type of relation r, the model learns latent representations, respectively zs and zr, both in Rd. They model relations as a translation in the latent space.
Thus, if there is a directed relation of type r between xs and xo (denoted xs ! xo) the vector zs + zr should be close to zo in the latent space.
The constraint on the latent space could be written as: if xs and xo are linked through the relation r, zs + zr zo, and zs + zr should be far away from zo otherwise. The idea of this model was somehow inspired by the interesting property of representations (coinci-dentally) found in (Mikolov et al., 2013), where, in the example we give in Section 2.3.1, the relation « capital of » was represented by the model as a translation in the embedding space.
To learn such representations, (Bordes et al., 2013) minimize a margin-based ranking loss over the training set: X X max(0; 1 + jjzs + zr zojj2 jj zk + zr zljj2) (2.4) LTransE = (s;o;r)2Er (k;l;r)2Enegr where Er represents the set of edges for the relation type r and Enegr corresponds to the set of edges in Er where the origin node or the destination node of the relation was changed by a random node of the graph. Here, the score of an edge xs ! xo is modeled as S(s; r; o) = jjzs + zr zojj2, but other models have been proposed using different constraints.
Linear Transformation (Nickel et al., 2011) represent the graph as a 3D tensor where each slice represents a given relationship. They then use tensor (as a matter of fact, matrix) fac-torization to predict new relations. With X being a tensor such as Xsro = 1 if the relation (s; r; o) exists and zero otherwise. They factorize each slice Xr as Xr ARrA>, where A is an n m matrix and Rr is an m m matrix modeling the interactions between the elements according to the relation r.
(Bordes et al., 2011) represent each relationship r with two matrices one for outgoing edges Rr1 and the other for incoming ones Rr2 . The score of the triplet used in this model is:
S(s; r; o) = jjRr1 zs Rr2 zojj (2.5)
(Jenatton et al., 2012) proposed to represent nodes as vectors (z) and relations as matrices (R). They learn the representations of nodes and relations using probabilistic models. They then compute the probability of a triplet through a dot product.
S(s; r; o) = (< zs; Rrzo >) (2.6)
At last, (Bordes et al., 2014) introduced two models where nodes and relations are rep-resented in the same latent space but where relations are modeled as tensors instead of translations. In both models, nodes (zs, zo) and relations (zr) are represented in Rd. The first variant they propose for calculating the score of a triplet uses a linear transformation. Giving the representations (zs, zo and zr), the score of the triplet is: Slinear(s; r; o) = (W1zr + W2zs + bs)>(W3zr + W4zo + bo) (2.7) where W are matrices.
Bilinear Transformation In the second model introduced in (Bordes et al., 2014) nodes (zs, zo) and relations (zr) are also represented in Rd, but they use for calculating the score of a triplet a bilinear transformation. Giving the representations (zs, zo and zr), the score of the triplet is:
Sbilinear(s; r; o) = ((W1 zr)zs + bs)>((W2 zr)Eo + bo) (2.8)
where W are tensors of dimension 3 and denotes the n-mode vector-tensor product along the 3rd mode (see (Kolda et al., 2009) for more details).
In (Socher et al., 2013), they also use a bilinear model where they compute the score of a triplet (s; r; o) by: S(s; r; o) = ur>f zs>Wrzo + Vr zo + br (2.9) where Wr is a tensor of dimension k, br and ur are in Rd all representing the relation r.
In practice there is no best model among the ones presented here, since the performances of these models depends on the KB and the task we evaluate the models on.
In this thesis, we proposed models that learn nodes and differentiate relationships based on their type.
All the previous presented learned representations can be used as inputs to learn, for example, a classifier. As seen previously, another way to learn classifier or predictors is to use directly the labels when learning the representations. This is what we discuss in the next section focusing on transductive approaches.
From Unlabeled to Labeled Data in Classification
For graphs, supervised learning models learn representations and classifiers at the same time. Compared with the unsupervised works presented in Section 2.3.1 or the supervised ones on KB presented in Section 2.3.2, the classification task has some specificities. For example, even if labels can be seen as a new type of node (with each nodes linked to its corresponding labels), the node-label relation is very particular and handling them sepa-rately in the model performs better. That is why the node-label relation needs to be treated differently in the loss function.
Furthermore, learning a classifier at the same time as the node representation brings in-formation to the learned representation, and makes a huge difference on the learned vector latent space. In this section, we review such graph transductive models.
(Jacob et al., 2014) introduced a supervised model for classification in the context of heterogeneous graphs. The idea developed in this work is to learn, along with classifiers, a latent representation for each node, whatever their type is, into the same latent space.
Table of contents :
1.1.2 Types of Graphs
1.2 Tasks Studied during the thesis
1.2.1 The Classification Task
1.2.2 The Forecasting Task
1.2.3 The Ranking Task
1.3 Learning Representations
1.4.1 Learning Deterministic Representations for Heterogeneous Graph Node Classification (Chapter 4)
1.4.2 Learning Gaussian Representations
Learning Gaussian Representations for Heterogeneous Graph Node
Classification (Chapter 5)
Learning Gaussian Representations for Relational Time Series Forecasting
Learning Gaussian Representations for Ranking (Chapter 7)
1.5 Thesis Organization
2 Learning Representations for Relational Data
2.2 Learning Representations in Graphs
2.2.1 Inductive and Transductive Algorithms
2.2.2 Supervised, Unsupervised and Semi-Supervised Learning
2.3 Learning Deterministic Representations
2.3.1 Unsupervised Models
Learning from Context
2.3.2 Supervised and Semi-Supervised Models
Learning Nodes and Relationship Representations in Knowledge Bases 22
From Unlabeled to Labeled Data in Classification
2.3.3 Other Applications of Representation Learning
2.4 Capturing uncertainties in representations
I Learning Deterministic Representations and Application to Classification
3 State of the Art: Graph Node Classification
3.2 Graph Node Classification
3.2.1 Simple Transductive Model for Graph Node Classification
3.2.2 Other Graph Node ClassificationWorks
4 Learning Deterministic Representations for Classification
4.2 Learning Representation for Graph Node Classification
Transductive graph model
4.2.2 Prior Parameters and Learning Algorithms
Learned Relation Specific Parameters
Construction of the LastFM Datasets
Evidence For Heterogeneous Node Reciprocal Influence
ComparisonWith Other Models
Evaluation Measures And Protocol
Importance Of The Relations’Weights
Label correlation on the LastFM2 Dataset
II Learning Gaussian Representations and Applications
5 Learning Gaussian Representations for Classification
5.2 Graph Node Classification with Gaussian Embeddings
Prior Parameters and Learned Relation Specific Parameters
6 Learning Gaussian Representations for Relational Time Series Forecasting
6.1.1 Relational Time Series
6.3 Relational Time Series Forecasting with Gaussian Embeddings
Notations and Task Formal Description
6.3.2 Informal description
Impact of minimizing the KL-divergence on predicted values
Inference and Time Complexity
7 Learning Gaussian Representations for Collaborative Filtering
7.1.1 Recommender Systems and Uncertainty
7.2 Learning Gaussian Embeddings for Collaborative Filtering
The Gaussian Embeddings Ranking Model
8 Conclusion et perspectives
Graph Node Classification
Relational Time Series Forecasting
8.1.2 Learning Hyperparameters
8.1.3 From Deterministic to Gaussian Representations
8.2.1 Classification Task
8.2.2 Forecasting Task
8.2.3 Ranking Task
8.2.4 Learning Gaussian Embeddings for Knowledge Bases