Probabilistic Graphical Models for Playlist Generation

Get Complete Project Material File(s) Now! »

Collaborative Filtering

The goal of recommender systems based on collaborative filtering is to infer a user’s ranking of previously unrated items and present the user with items corresponding to high inferred ratings. In a formalized way collaborative filtering can be said to present a user u ∈ U with products p ∈ P that the user has not yet rated such that the ratings Rup for those products are maximized[7].
Collaborative filtering focuses on user’s past behaviour. From the past behaviour of a specific user and past behaviour of similar users the ranking for the specific user for a certain item is inferred[32][33]. In other words, a user gets recommen-dations of items that other users with similar taste like[1]. Finding similar users mean that some type of similarity metric must be used. Often items and users are mapped into vector spaces and distance metrics are used as similarity metrics. A distance metric could be the euclidean distance, but also the cosine value of the angle between vectors, also known as the cosine similarity or cosine distance, is used for collaborative filtering[32]. Collaborative filtering has the advantage that it typically relies on past user behaviour without the need of explicit user profiles[21]. This means that only the past behaviour of a user is needed to use collaborative filtering and the creation of specific profiles that model a specific userÂ´s taste is not needed. As collaborative filtering, in the general case, only looks at user data it is also domain free, i.e. the model is not dependent on whether users have rated books, movies, music or a combination thereof[15].
Collaborative filtering can be used with explicit user feedback. One example of explicit user feedback are the movie ratings used by Netflix or the thumbs up or thumbs down method used by the radio streaming feature from Spotify. Collabora-tive filtering can also be used with implicit user feedback[15]. In the music context implicit feedback could be whether a song has been played or skipped and in the movie context implicit feedback could be if a movie is watched for more than fifteen minutes or not.
Collaborative filtering methods can be divided into two categories, memory-based and model-based. Memory-based collaborative filtering algorithms can be seen as user based as they try to find recommendations based on what other users like. The model-based algorithms can be seen as item based, as they often seek to find similar items to the ones for which a user has a liking[32]. Memory-based collaborative filtering algorithms operate on the entire user-item matrix where the full user history data set is used[7]. The user-item matrix could for example consist of one user per row and one item per column, see figure 2.1. This data set is used to predict a preference for a previously unseen item for a specific user. To perform predictions first the rows most similar to the row corresponding to the specific user are found. The ratings of the users corresponding to these rows for the unseen item are then used to predict the rating for the specific user. As similar user’s ratings are used to predict a specific user’s rating memory-based models can also be thought of as neighbourhood models as similar users are treated as the near-est neighbouring users in user space[15]. There are various ways of implementing a memory-based model, but a naive way could be to find similar rows by using the cosine similarity and then simply averaging the rating of the top-n similar users for a specific item. This naive approach has a O(M N2) complexity where M is the number of users and N the number of items. One downside of this approach is that it does not scale very well when the user-item matrix is large. Another downside is that the user-item matrix is likely to be very sparse. Imagine a online retailer that has millions of products, one percent of a million is ten thousand and a typical user is likely to have rated less than ten thousand products. Using a nearest neighbour approach in this sparse setting can lead to poor performance[32][33]. The reason for this poor performance is that in a high-dimensional space the distance to the closest point, for any given point, approaches the distance of the furthest point as the dimensionality increases for many distributions[5]. This would mean that for a given user the inferred ratings for previously unseen items would have equal weights from users similar and diﬀerent to the given user.
Model-based collaborative filtering means that historical data of users is used to create a probabilistic model for ratings. At run time the model, rather than the entire user history data set, is used to make predictions of items for users[7]. Model-based approaches are likely to scale better than memory-based ones[32]. One approach to model-based collaborative filtering is to use latent factors. This means that each user would be associated with a user-factors vector xu ∈ Rf and each item with an item-factors vector yi ∈ Rf , where the number of latent factors f is less than the number of items or users. The predicted value of a user for an item would then be the inner product between the corresponding user and item vectors, i.e. rˆui = xTu yi. To avoid overfitting the model can be regularized, which means penalizing complex models. A cost function as follows is then obtained: minx∗,y∗ X(rui − xuT yi)2 + λ(||xu||2 + ||yi||2) (2.1)
In this equation rui is the explicit user feedback, for example a rating on the scale one to five. The latter part of the equation is the added penalization, which can be seen to penalize large values for x and y.
The problem with equation 2.1 is that it assumes knowledge of explicit feedback. In the context of music recommendation the case is rather that implicit feedback is available than explicit. It is for example much easier to collect information about which songs a user has listened to than to collect ratings. Even if ratings are collected, then the number of songs a user has streamed is likely to be much larger than the number of rated songs.. What can be done in this case is to use binary labels expressing whether a user has preference for an item or not. Having preference for an item could mean that the user has streamed that song and not skipped it for example. Therefore the binary variable pui is used to describe user preference.
There is however an uncertainty to the preference a user has. Has a user re-ally preference for a song that is selected on Spotify Radio while the user was in another room? What can be done is to create confidence variables, variables that gives a describes how certain we are that a user has a preference for an item. Con-fidence variables could for example depend on the number of times a song has been streamed. What can be done here is to use another variable cui = 1 + αrui where rui is the number of times user u has streamed item i. The α term is to weight the number of times a song has been streamed, as we do not necessarily want to increase the confidence linearly with the number of times a song has been played.
The resulting cost function then becomes: minx∗,y∗ X cui(pui − xuT yi)2 + λ(||xu||2 + ||yi||2) (2.2)
Problems still remain as users and items can contain bias. The remedy is to enter bias terms, the resulting cost function is then: minx∗,y∗ X cui(pui − xuT yi − bu − bi)2 + λ(||xu||2 + ||yi||2) (2.3)
Where bu is the user bias term and bi is the item bias term.
The resulting problem is a non-convex optimization problem, but by fixing either the user or item vectors the problem becomes convex and can be solved by the use of alternating least squares, where the cost function is guaranteed to get a lower value with each iteration[15]. This approach is similar to the collaborative filtering method used in production for recommending songs, artists and albums at Spotify[16].
The downside with collaborative filtering is that it suﬀers from something called the cold start problem or first rater problem. If a user yet have not rated any items collaborative filtering will fail to give that user good recommendations. The same things applies to items that no one has yet rated or songs that no one has played yet, if neither explicit or implicit feedback for items has been given then they cannot be recommended[13][21]. In the context of Spotify where twenty thousand songs are added each day this will pose a problem.

Content Based Approaches

In contrast to collaborative filtering approaches content based recommendation sug-gest items that are similar to items a user has had preference for in the past. This can be done by either comparing items to items or to create a user profile based on a users preferred items[1]. Content based recommendation requires a representation of each item and or user, typically by mapping items and users into vector spaces. An rudimentary example of user profiles created from preferred items would be to represent a user by an initially empty vector and each time the user presents prefer-ence for an item the item vector is added to the user vector which is then normalized. Content based approaches look at discrete features of items and tries to infer a sim-ilarity between two items given their similarity of features. Similar to the approach of finding similar users in memory-based collaborative filtering distance metrics can be used as similarity measures in content based recommendation. A parallel can be drawn between content based recommendation and information retrieval. In the context of content based recommendation the utility function that is maximized is the similarity between items, which typically is the inverse of the items distance[1].
One of the main features of content based recommenders are that they are able to provide recommendations even when little user history data is available, something that is one of the major drawbacks of collaborative filtering[11].
Diﬀerent approaches can be used to create the features used in content based rec-ommendation in the music domain. One approach is to simply have human experts annotating tracks with information[25][35]. Other approaches could be to extract properties from the audio signal. One such example is the use of MFCCs, Mel-Frequency Cepstral Coeﬃcients, which creates features from short time intervals of each track[19] and another is to use Deep Belief Networks[12].
An interesting property of content based recommendation is that it allows for relevance feedback, for example with use of the Rocchio algorithm. The Rocchio algorithm allows for a user to select a subset of recommended items as relevant and move the recommendations displayed towards the direction of those items in the vector space items are represented in[26].
Downsides with content based recommendation are that a user can never be rec-ommended something that is not similar to what the user has expressed preferences for in the past. Further, content based recommendation is limited to the features of items. If the features used to describe items are poor a content based recommender system is likely to perform poorly. Lastly, content based recommenders do not take sequential information into account. Thus a well written news article is seen identical to the same article written backwards as they contain exactly the same words[1].

READ Neuroanatomy of Multimodal Associative Areas

Hybrid Systems

Hybrid systems are recommenders that combine both the techniques of collabora-tive filering and content based filtering, with the purpose of thus obtaining better recommendations. The underlying assumption is that a combination of content based recommenders and recommenders using collaborative filtering can redeem the weaknesses these methods face on their own[11].
Hybrid recommenders can be made by combining the results of collaborative filering methods with content based methods, by incorporating properties of one method into the other or by creating a model that incorporates properties of both types of systems[1].
A simple example of a hybrid system could be to create an ensemble model that weight scores from content based and collaborative filtering based recommenders. Given an item i ∈ I that has an inferred rating rcollaborativefiltering from a collabora-tive filtering recommender and an inferred rating rcontentbased from a content based recommender a way of combining thesis approaches could be to simple set the rating of the hybrid model to rhybrid = α1rcollaborativefiltering + α2rcontentbased Here α1 and α2 are the weights to each of the individual recommenders.
One way of combining collaborative filtering and content based recommendation into one single method is to first create user ratings for all items that are completely lacking ratings by a content based method. For example, for all users that rated a product similar to the unrated item a content inferred rating can be created by taking a user’s rating of the item closest to the unrated item and multiplying it with one minus the cosine similarity of the items runrated = (1 − (cos(irated, iunrated )) ∗ rrated). This matrix can then be used for traditional collaborative filtering and the cold start problem for unrated products will no longer be an issue. Another way of tackling the same problem would be to infer ratings of all unrated items for all users in a similar manner and then use collaborative filtering to weight the vote of each previously unrated item for a specific user. That is an approach that have been used to find improvement in recommendation over a pure content based approach[21].

Playlist Generation

The approaches of recommendation presented gave a general overview to the field of recommender systems. However, these approaches where either based on item to item recommendation or recommendations of the type item to user. The problem this thesis intends to solve regards music or more specifically playlists, why previous work related to playlist generation is presented here.

Probabilistic Graphical Models for Playlist Generation

Earlier attempts of playlist generation has been made by Microsoft Research[28]. Ragno, Burges and Herley has made a model for playlist generation that employs probabilistic graphical models. Probabilistic graphical models can be said to be graph-based models that give a compact way of describing the structure in a dis-tribution. A probabilistic graphical model describes the conditional dependence between random variables in the form of a graph[17]. In a music setting this could mean that given that a song A is spotted in a playlist we can get the probabil-ity for song B being in the same playlist. The probabilistic graphical model used for playlist generation by Ragno, Burges and Herley can take any type of ordered playlist material, such as curated playlists or albums, as training data, and use this input data to generate new playlists as output data. The initial step in the model is to create an undirected graph.
In this graph each song will be represented by a node. Each edge in the graph corresponds to a first to nth ordered Markov property. What this means is that for a first order Markov property edges between songs adjacent to each other in a playlist or album will be present in the graph. For a second order Markov property songs that are one or two steps before or behind any given songs will have edges between the corresponding nodes. The weights corresponding to the edges in the graph depends on how many times the songs fulfills the nth order Markov property in the input data. For example, if a song A is adjacent to a song B twenty times in the input data playlists the weights on the edge between node A and B will be twenty.
Once the undirected graph is created for all songs in the music library a directed graph is created from the undirected graph.
When the undirected graph is converted into a directed graph the weights of edges, the transition probabilities, are normalized by the sum of outgoing weights from each node. That is if the edge between nodes A and B in the graph has a value of five and node B has outgoing edges which sum to ten and node A has outgoing edges that sum to fiften then the value of the directed edge from A to be will be edgeAB = 5/10 and the edge between B and A will have a value of edgeBA = 5/15. Once the directed graph is created a playlist can be generated by selecting an initial seed song, choosing a node as starting point, and simply performing a random walk in the graph. This model assumes that the connectivity between songs is independent of the order in input data. From one perspective this is a bad assumption as DJs, playlist curators and artists spend a large amount of time perfecting the order of songs, information that is lost in this model. On the other hand, consider the following example: song A is followed by song B five thousand times in input data, but song B is never followed by song A in. Then songs B and C both follow each other five times each. First creating an undirected graph and then a directed graph would not rank the connectivity between B and C as being higher, which is a reasonable behaviour. Lastly, a problem that might occur is something called playlist drifting. Playlist drifting would mean that the generated playlist would consists of songs in the following order: A, B, C, D. Where each song is realted to the song before, but song D might be completely unrelated to song A. The solution to this problem would be to use higher order Markov properties, that is adding taking songs further away from each song into considerationwhen creating the graphical model[28].
Problems with the directed graphical model approach taken to playlist genera-tion by Rango, Burges and Herley is that if you generate a playlist from a random walk you cannot chose the playlist theme for the generated playlist on before hand. Another problem with the probabilistic graphical model approach to playlist gen-eration is that the graph created during training phase only works for songs that are in the training data set. This model is not generalizable in the sense that as new songs are added updates are needed for the graph in which the random walk is performed to create a playlist. This is not an ideal situation for Spotify as a high quantity of songs are added to the Spotify music library every day[34].

Gaussian Processes for Playlist Generation

An approach for playlist generation that is generalizable to data outside the data used to train the model is to employ the use of gaussian processes[27]. A gaussian process is characterized by its mean and covariance functions[20]. A gaussian pro-cess is a model that associates each data point with a stochastic normally distributed variable and provides a joint gaussian distribution over the data[29]. Gaussian pro-cesses are useful when the number of dimensions is large compared to the number of data points. The setting of few data point but many dimensions is relevant in the playlist generation case. A playlist can be about ten, twenty or thirty songs long, while the number of dimensions for each song can be on the magnitude of music genres. As there are hundreds of music genres, the number of features for each song in a playlist might then very well exceed the number of songs in that playlist. A gaussian process does not provide a direct function that reflects the data but rather a distribution over functions[30]. To employ a gaussian process a prior assumption of distribution is needed and as data is seen the prior is updated according to data. This new distribution that incorporates data is called a posterior distribution. The posterior can then be used similarly as the prior for the next data point, then called the predictive posterior.
In the work taken by Platt et al, also at Microsoft Research the authors try to learn a gaussian process prior from training data[27]. This prior is then used together with a set of songs, for which a user has expressed preference, to gener-ate a playlist given an initial seed song. In the training phase a blend of linear kernels is used to learn the relationship of meta data features among songs that come in sequence. A blend of linear kernels is the same things as weighting sev-eral linear predictions where weights are learned from data. The coeﬃcients for each linear kernel is learned by finding the coeﬃcient that minimizes the diﬀerence between the empirical covariance between songs and the value given by the lin-ear kernel. Empirical covariance is in this case as simple as whether the training data songs belong to the same album or not. Once the training phase is done the playlist generation phase consists of predicting the user preference for each song in a set of candidate songs. This is equal to calculating the posterior probability p(pref erenceF orN ewSong|knownP ref erences). This probability is calculated by weighing the blend of linear kernels between a seed song and each candidate song with a factor. This factor is the sum of similarity between the initial seed song and each user preference song weighted by how central each user preference song is in the preference space. Playlist generation is then done by simply choosing the songs with highest probability for user preference[27].
This model generalizes to new songs, but the user preference space is seen as one single space. This is a simplification of reality where a user preference space is probably divided into several categories, for example a workout preference space and a chill-out preference space, something the model provided does not take into account, which can be claimed as a weakness in terms of generating a playlist adapted for a specific theme or context. Neither does the model take the ordering of songs into account, but the authors argue that the ordering obtained from listing the songs by their predicted user preference value creates a good order of songs[27].

Table of contents :

1 Introduction
1.1 Project Introduction
1.2 Project Aim
2 Background
2.1 Collaborative Filtering
2.2 Content Based Approaches
2.3 Hybrid Systems
2.4 Playlist Generation
2.4.1 Probabilistic Graphical Models for Playlist Generation
2.4.2 Gaussian Processes for Playlist Generation
3 Representation Learning
3.1 Representation Learning
3.2 Principal Component Analysis
4 Representing Playlists
4.1 Assumptions
4.2 Data
4.3 Exploratory Data Analysis
4.4 Learning Playlist Characteristics
4.4.1 Handling zero variance terms
4.5 Selecting candidate songs for a playlist
4.5.1 Subspace method
4.6 Playlist Comparison
4.7 Approximate Nearest Neighbours
5 Results
5.1 Precision
5.2 Comparing Model to Baseline
5.3 Confusions
5.4 Qualitative Evaluations
6 Discussion and Future Work
6.1 Discussion
6.1.1 Complexity Analysis
6.2 Future Work
Bibliography