The Kendall and Mallows Kernels for Permutations

Get Complete Project Material File(s) Now! »

The Kendall and Mallows Kernels for Permutations

Introduction

A permutation is a 1-to-1 mapping from a finite set into itself. Assuming the finite set is ordered, a permutation can equivalently be represented by a total ranking of the elements of the set. Permutations are ubiquitous in many applications in-volving preferences, rankings or matching, such as modeling and analyzing data describing the preferences or votes of a population [Diaconis 1988, Marden 1996], learning or tracking correspondences between sets of objects [Huang 2009], or esti-mating a consensus ranking that best represents a collection of individual rankings [Dwork 2001, Ailon 2008, Arrow 2012]. Another potentially rich source of rank data comes from real-valued vectors in which the relative ordering of the values of multiple features is more important than their absolute magnitude. For example, in the case of high-dimensional gene expression data, [Geman 2004] showed that simple classifiers based on binary comparisons between the expression of diﬀerent genes in a sample show competitive prediction accuracy with much more complex classifiers built on quantitative gene expression levels, a line of thoughts that have been further investigated by [Tan 2005, Xu 2005, Lin 2009]. In these approaches, an n-dimensional feature vector is first transformed into a vector of ranks by sorting its entries, which are presented as input to training a classifier.
Working with permutations is, however, computationally challenging. There are n! permutations over n items, suggesting that various simplifications or ap-proximations are necessary in pursuit of eﬃcient algorithms to analyze or learn permutations. Such simplifications include for example, reducing ranks to a series of binary decisions [Ailon 2008, Balcan 2008], or estimating a parametric distribu-tion over permutations [Lebanon 2008, Helmbold 2009, Huang 2009].
Kernel algorithms form a class of methods that have been proved successful in numerous applications and enjoy great popularity in the machine learning com-munity [Cortes 1995, Vapnik 1998, Schölkopf 2002, Shawe-Taylor 2004]. The es-sential idea behind these methods is to define a symmetric positive definite kernel K : X × X → R over an input space X , which expresses our belief of similarities between pairs of points in the input space, and which implicitly defines an embed-ding Φ : X → F of the input space X to a Hilbert space F in which the kernel becomes an inner product: ∀x, x0 ∈ X , K(x, x0) = hΦ(x), Φ(x0)iF .
Key to kernel methods is the fact that kernel algorithms only manipulate data through evaluation of the kernel function, allowing to work implicitly in the poten-tially high- or even infinite-dimensional space F. This kernel trick is particularly interesting when K(x, x0) is inexpensive to evaluate, compared to Φ(x) and Φ(x0). In particular, kernel methods have found many applications where the input data are discrete or structured, such as strings or graphs, thanks to the development of numerous kernels for these data [Haussler 1999, Kashima 2003, Gärtner 2004, Shawe-Taylor 2004, Schölkopf 2004, Vishwanathan 2009].
In this context, it is surprising that relatively little attention has been paid to the problem of defining positive definite kernels between permutations, which could pave the way to benefiting from computationally eﬃcient kernel methods in problems involving permutations. A notable exception is the work of [Kondor 2008, Kondor 2010], who exploit the fact that the right-invariant positive definite kernels on the symmetric group are fully characterized by Bochner’s theorem [Kondor 2008, Fukumizu 2008]. They derive interesting kernels, such as a diﬀusion kernel for rank-ings or partial rankings, and demonstrate that kernel methods are flexible to handle rank data of diverse types. However, the kernels proposed in their papers have typ-ically a computational complexity that grows exponentially with the number of items to rank, and remain prohibitive to compute for more than a few items.
Here we study new computationally attractive positive definite kernels for per-mutations and rankings. Our main contribution is to show that two widely-used and computationally eﬃcient measures of similarity between permutations, the Kendall tau correlation coeﬃcient and the Mallows kernel, are positive definite. Although these measures compare two permutations of n items in terms of n2 pairwise com-parisons, they can be computed in O(n log n), which allows us to use kernel methods for problems involving rank data over a large number of items. We show how these kernels can be extended to partial rankings, multivariate rankings, and uncertain rankings which are particularly relevant when the rankings are obtained by sorting a real-valued vector where ties or almost-ties occur. We illustrate the benefit of kernel learning with the new kernels on two applications, one concerning the unsu-pervised clustering of rank data with kernel k-means, one focusing on the supervised classification of genomic data with Support Vector Machines (SVMs), reaching in both cases state-of-the-art performances.
The chapter is organized as follows. In Section 2.2, we prove our main theorem showing that the Kendall and Mallows kernels are positive definite. We extend them to partial, multivariate and uncertain rankings respectively in Section 2.3.1, 2.3.2 and 2.3.3. We highlight the relation to the diﬀusion kernel of [Kondor 2010] in Sec-tion 2.4. Finally we illustrate the relevance of kernel methods for unsupervised (Sec-tion 2.5) and supervised (Section 2.6) tasks. Data and R code for reproducing the experiments in this chapter are available via https://github.com/YunlongJiao/ kendallkernel_demo. I have also developed kernrank, an R package implementing kernel functions and kernel methods for analyzing rank data [Jiao 2016a].

Extension to Partial Rankings

In this section we show how the Kendall and Mallows kernels can eﬃciently be adapted to partial rankings, a situation frequently encountered in practice. For example, in a movie recommender system, each user only grades a few movies that he has watched based on personal interest. As another example, in a chess tournament, each game results in a relative ordering between two contestants, and one would typically like to find a single ranking of all players that globally best represents the large collection of binary outcomes.
As opposed to a total ranking (2.1), partial rankings arise in diverse form which can be generally described by X1 X2 • • • Xk,
where X1, . . . , Xk are k disjoint subsets of n items {x1, . . . , xn }. For example, {x2, x4} x6 {x3, x8} in a social survey could represent the fact that items 2 and 4 are ranked higher by an interviewee than item 6, which itself is ranked higher than items 3 and 8. Note that it is uninformative of the relative order be-tween items 2 and 4, and of how item 1 is rated. For ease of analysis, a partial ranking is often associated with a subset R ⊂ Sn of permutations which are com-patible with all partial orders described by the partial ranking. In this study, two particularly interesting types are:
(i) Interleaving partial rankings. Such a partial ranking is of the form xi1 xi2 • • • xik , k ≤ n, where we have a total ranking for k out of n items. This type of partial ranking is frequently encountered in real life, for example in a social survey an interviewer is inexperienced to rank all items listed so that there exist interleaved inaccessible values. The interleaving partial ranking corresponds to the set of permutations compatible with it: Ai1,…,ik = {σ ∈ Sn|σ(ia) > σ(ib) if a < b, a, b ∈ [1, k]}. (2.5)
(ii) Top-k partial rankings. Such a partial ranking is of the form xi1 xi2 • • • xik Xrest, k ≤ n, where we have a total ranking for k out of n items and also know that these k items are ranked higher than all the other items. For example, the top k hits returned by a search engine leads to a top k partial ranking; under a voting system in election, voters express their vote by ranking some (or all) of the candidates in order of preference. The top-k partial ranking corresponds to the set of compatible permutations:
To extend any kernel K over Sn to a kernel over the set of partial rankings, we propose to represent a partial ranking by its compatible subset R ⊂ Sn of permu-tations, and define a kernel between two partial rankings R and R0 by adopting the convolution kernel, written with a slight abuse of notations as K(R, R0) = 1 X∈ σX0∈ K(σ, σ0). (2.7)
As a convolution kernel, it is positive definite as long as K is positive definite [Haussler 1999]. However, a naive implementation to compute (2.7) typically re-quires O((n − k)!(n − k0)!) operations when the number of observed items in partial rankings R, R0 is respectively k, k 0 < n, which can quickly become prohibitive. For-tunately Theorem 2.2 guarantees that we can circumvent the computational burden of naively implementing (2.7) with the Kendall kernel Kτ on at least the two par-ticular cases of partial rankings (2.5) or (2.6).
Theorem 2.2. The Kendall kernel Kτ between two interleaving partial rankings of respectively k and m observed items, or between a top-k partial ranking and a top-m partial ranking, of form (2.7) can be computed in O(k log k + m log m) operations.
Proof. The proof is constructive. We show here explicitly how to compute the Kendall kernel between two interleaving partial rankings while the idea remains similar for the case of top-k partial rankings. Denote by JnK the item set to be ranked and by Ai1,…,ik , Aj1,…,jm ⊂ Sn two interleaving partial rankings of size k, m respectively, whose subsets of item indices are denoted by I := {i1, . . . , ik} and J := {j1, . . . , jm}. We will lighten the notation by writing AI := Ai1,…,ik and AJ := Aj1,…,jm and recall that by definition, AI = {π ∈ Sn|π(ia) > π(ib) if a < b, a, b ∈ [1, k]} , AJ = {π0 ∈ Sn|π0(ja) > π0(jb) if a < b, a, b ∈ [1, m]} are subsets of Sn compatible with the two partial rankings respectively. In partic-ular, |AI | = n!/k! and |AJ | = n!/m!. Note that every item that does not appear in the partial ranking corresponding to AI (or AJ ) can be interleaved at any possible order with the other items for some permutation in that set.
Key observation to our proof is the “symmetry” of AI (or AJ ) in the sense that (i) for every item pair {i, j } such that i, j ∈ I, all permutations in AI are identical on the relative order of items i and j; (ii) for every item pair {i, j} such that i, j ∈ I{, there exists a unique permutation ρ = (i, j) ◦ π ∈ AI for each π ∈ AI by swapping the ranks of items i, j in π such that (π(i) − π(j))(ρ(i) − ρ(j)) < 0 and ρ is identical with π on the absolute ranks of all the other items.

READ Spherical model: the camera in a fixed environment

Table of contents :

1 Introduction
1.1 General Background of Breast Cancer
1.2 Towards Molecular Prognosis
1.3 Genomic Data Analysis: Topics, Prospects and Challenges
1.4 Contribution of the Thesis
2 The Kendall and Mallows Kernels for Permutations
2.1 Introduction
2.2 The Kendall and Mallows Kernels for Permutations
2.3 Extensions of the Kendall Kernel to Rank Data
2.3.1 Extension to Partial Rankings
2.3.2 Extension to Multivariate Rankings
2.3.3 Extension to Uncertain Rankings
2.4 Relation of the Mallows Kernel and the Diffusion Kernel on Sn
2.5 Application: Clustering and Modeling Rank Data
2.5.1 Clustering with Kernel k-means
2.5.2 Mallows Mixture Model with Kernel Trick
2.5.3 Experiments
2.6 Application: Supervised Classification of Biomedical Data
2.7 Discussion
3 Network-based Wavelet Smoothing for Analysis of Genomic Data
3.1 Introduction
3.2 Methods
3.2.1 Feature Selection Under Predictive Modeling Framework
3.2.2 Network-guided Feature Selection: A Review of Related Work
3.2.3 Network-based Wavelet Smoothing for Feature Selection
3.2.4 Implementation
3.3 Results
3.3.1 Experiment Set-ups: Data, Network and Methods
3.3.2 Simulation Studies
3.3.3 Breast Cancer Survival Analysis
3.4 Discussion
4 Signaling Pathway Activities Improve Prognosis for Breast Cancer
4.1 Introduction
4.2 Methods
4.2.1 Data Source and Processing
4.2.2 Modeling Framework for Signaling Pathways
4.2.3 Cancer Prognosis with Inferred Signaling Pathway Activity
4.3 Results
4.3.1 Signaling Pathway Activities Lead to Improved Prognosis for Breast Tumor Samples
4.3.2 Signaling Circuits Selected as Features Relevant for Cancer Prognosis Account for Cancer Hallmarks
4.3.3 The Classification Algorithm Suggests Additional Prognostic Genes That Do Not Code for Signaling Proteins
4.4 Discussion
5 Conclusion and Perspectives
A A Tractable Bound on Approximating Kemeny Aggregation
A.1 Introduction
A.2 Kemeny Aggregation Problem
A.3 Geometric Analysis of Kemeny Aggregation
A.4 Controlling the Distance to Kemeny Consensus
A.5 Geometric Interpretation Revisit and Proof of Theorem A.1
A.5.1 Interpretation of the Condition in Theorem A.1
A.5.2 Proof of Theorem A.1
A.6 Numerical Experiments
A.6.1 Tightness of the Bound
A.6.2 Applicability of The Method
A.7 Discussion
Bibliography