A classification of web content popularity prediction methods
To structure the different popularity prediction methods, we propose a classification that groups the methods according to the type and granularity of the information used in the prediction process (Figure 2.2). We further organize the subsequent chapters based on this classification.
We define a domain as the web site where an individual item resides, regardless if it has been created or shared from an external source (e.g., news shared on social bookmarking sites). Methods under this category make forecasts about popularity using only the local information about the web item.
One of the most challenging objectives is to predict the popularity of a web content before its publication, relying only on content metadata or the social structure of the publisher.
The alternative is to include in the prediction model data about the attention that one item receives after its publication.
Aggregate behavior. A common approach is to deduce future content popularity from the aggregate early users’ reactions. This can further be separated in three main categories:
• Study the cumulative growth of attention, i.e., the amount of attention that a web item receives from the moment it was published until the prediction moment.
• Perform a temporal analysis of how content popularity evolved in time until the prediction moment.
• Use clustering methods to find popularity evolution trends.
Individual behavior. Instead of treating each user action equally, one may further refine the prediction model by taking into account individual user behavior.
Explaining popularity from the perspective of single domain is limited due to the diverse and complex user interactions between different platforms. Methods under this class draw conclusions by extracting and transferring information across web domains.
A survey on popularity prediction methods
Several popularity prediction methods have been proposed in the last decade, from simple linear regression functions to complex frameworks that mine and build knowledge from social media. We describe these methods according to the proposed classification and explain how these methods perform on different types of web content. A summary of these methods is then presented in Table 2.1.
In the vast majority of cases, prediction methods rely entirely on the information available on the site where the content has been published.
Predicting the popularity of an item before publication is particularly useful for web items with short lifespan. News articles, which are time-sensitive by nature, fall under this category and have been analyzed in two studies [64, 65].
Tsakias et al. addressed this problem as a two-steps classification problem: predict if news articles will receive comments and if they do, if the number of comments will be high or low . The prediction method used in this case has been a RandomForest classifier trained on a large number of features (textual, semantic, and real-world). Using several online news sources the authors showed that one can accurately predict which articles will receive comments and observed that the performance degrades significantly when trying to predict if the volume of comments will be high or low.
Bandari et al., using the number of tweets as an indicator of news popularity, formulated the prediction task both as a numeric and a classification problem . Predicting the exact popularity of news articles, even under various regression methods (linear, knearest- neighbors and support vector machines (SVM) regression) showed modest results, being able to explain only 34% of the variability in the observed popularity (R2 = 0.34).
Predicting ranges of popularity has proved to be more effective, with an accuracy of 84% when identifying articles that would receive a small, medium, or large number of tweets.
After publication – Aggregate behavior
The methods under this category have been used to predict the popularity of web items based on the aggregate user attention received early after publication. Cumulative growth. One of the first solutions, which was used to model the popularity of Slashdot stories, was proposed by Kaltenbrunner et al. . The model, which we will refer to as growth profile (we adopt the terminology used in ), assumes that, depending on the time of the publication, news stories follow a constant growth that can be described by the following function: Nc(ti, tr) = Nc(ti) P(ti, tr) .
After publication – Individual behavior
Instead of treating each user reaction equally in the prediction process, models under this category draw conclusions based on individual user behavior.
Social dynamics, the model proposed by Lerman and Hogg, describes the temporal evolution of a web content popularity as a stochastic process of user behavior during a browsing session on a social media site . In its original form, it was designed according to the characteristics of the social bookmarking site Digg: stories can be found in three sections of the site (front, upcoming, and friend list pages), users can express their opinions through votes, and stories are arranged in pages (or promoted to different sections of the site) based on the dynamics of votes.
User behavior is modeled through a set of states that describe the possible actions that one can take on a site: browse through the different sections, read news stories, and cast votes to further recommend them to the Digg community. Browsing sessions are dynamic as stories circulate through the site (e.g., they may appear on different sections of the site or change position on the page) depending on the voting results. Individual user behavior is thus linked to the collective behavior, which in the end explains how stories receive votes over time. More specifically, the number of votes a story receives depends on its visibility and general interest. Visibility is expressed as the probability of finding a story in different sections of the site and the interest is linked to the quality of the story estimated by the voting dynamics.
The authors validated the model on a small sample of Digg stories by studying user reactions to the publication of stories and by taking into account the relationships between Digg users. By using this algorithm, the authors show that they can predict in 95% of the cases which stories will become popular enough to reach Digg’s front page. In terms of numeric prediction of the number of votes, results indicate that the first twenty votes are strong predictors of the final Digg score (RMSE = 593, compared to 610 when using log-linear model).
Factors that influence content popularity
Research on web content popularity has evolved from describing the popularity characteristics to understanding the temporal evolution as well as designing models to predict the future. However, during this process, little has been said about the factors that can drive a web content to its success. We report the main findings on the factors known to have an important impact on the popularity growth.
The amount of attention that a web item generates depends on various content and content-agnostic factors. In general, the content itself explains much of its popularity. Creating quality content [80, 81], that generates strong emotions , and has a large geographic relevance [82, 83] is more likely to attract a larger audience. The topic of the content is also important, as popularity is susceptible to bursts of attention in response to real-world events . On the other hand, there are elements that have a negative impact on content popularity. One of them is the presence of multiple versions of the same content that tends to limit the popularity of each individual copy .
There are also several content-agnostic factors that have a strong impact on the popularity growth . Popular Internet services, such as search tools, recommendation systems, and social sharing applications can extent the visibility of a web item and increase its popularity. Taking the example of YouTube (one of the most active platform for this kind of studies) the internal search engine accounts for most of the views, followed by the recommendation systems, and the social sharing tools [17, 85]. But the outcome of these services can also play an important role in the popularity outcome. For example, it has been observed that videos have a higher chance of becoming popular if they are placed in the related list of popular videos [20, 86] and higher the position of the video in the list the greater the number of views . The recommendation system thus creates a strong linked structure between similar videos, which influence each other in terms of popularity . This information can be extremely valuable to newborn videos that can have a big advantage in creating relationships with similar popular videos by choosing a relevant title, description, or keyword set .
Table of contents :
1.1 Context and motivation
1.2 Global scenario and research challenges
1.3 Contributions of this thesis
1.3.1 A survey on predicting the popularity of web content
1.3.2 Predicting the popularity of online news
1.3.3 Proactive seeding based on content popularity prediction
1.3.4 Predicting κ-contact opportunities between mobile users
1.4 Thesis outline
2 A survey on predicting the popularity of web content
2.3 Performance measures
2.3.1 Numeric prediction
2.4 A classification of web content popularity prediction methods
2.4.1 Single domain
126.96.36.199 Before publication
188.8.131.52 After publication
2.4.2 Cross domain
2.5 A survey on popularity prediction methods
2.5.1 Single domain
184.108.40.206 Before publication
220.127.116.11 After publication – Aggregate behavior
18.104.22.168 After publication – Individual behavior
2.5.2 Cross domain
2.6 Selecting the right features
2.7 Factors that influence content popularity
2.8 Predictive proactive seeding: an application of web content popularity prediction
3 Predicting the popularity of online news articles
3.3 Global statistics
3.3.1 Online news data collections
3.3.2 News articles lifetime
3.3.3 Distribution of popularity
3.4 Predicting the popularity of online news articles
3.4.1 Popularity predictions methods
3.4.2 Popularity prediction accuracy
3.5 Ranking news articles based on popularity prediction
3.5.2 Ranking methods
3.5.3 Ranking performance
3.5.4 An alternative to learning to rank algorithms
4 Predictive proactive seeding for mobile opportunistic data offloading
4.3 Global scenario
4.4 Proactive seeding in mobile opportunistic networks
4.4.1 Premise for effective proactive seeding
4.4.2 Proactive seeding strategies
4.5.1 Simulating user behavior
4.5.2 Simulation scenario
5 Beyond contact predictions in mobile opportunistic networks
5.3 Vicinity and data sets
5.3.1 Beyond contact relationships
5.3.2 κ-vicinity, κ-contact, and κ-intercontact
5.3.3 Data sets
5.4 Pairwise relationships under the κ-contact case
5.4.1 Pairwise minimum distance
5.4.2 Analyzing the distribution of pairwise distance
5.4.3 The stability of κ-contact relationships
5.5 Predicting κ-contact encounters
5.5.1 Dynamic graph representation
5.5.2 κ-contact prediction problem
5.5.3 The effect of time-window duration and past data
5.5.4 κ-contact prediction results
5.6 Practical implications
6 Conclusions and future work
6.2 Looking ahead
6.2.1 Improving the quality of the prediction
6.2.2 Smart proactive seeding
6.2.3 Predicting spatiotemporal contacts
6.2.4 Mobile opportunistic data offloading engine
A R´esum´e en francais
A.1 Contexte et motivation
A.2 La probl´ematique
A.3 Contributions de cette th`ese
A.3.1 Une synth`ese sur les algorithmes de pr´ediction de la popularit´e du contenu web.
A.3.2 Pr´edire la popularit´e des articles
A.3.3 Pr´e-t´el´echargement du contenu fond´e sur la pr´ediction de la popularit ´e du contenu
A.3.4 Pr´edire les ´ev´enements κ-contact entre les utilisateurs mobiles
A.5.1 Am´eliorer la qualit´e de la pr´ediction de popularit´e
A.5.2 Pr´e-t´el´echargement intelligent
A.5.3 Pr´edire les contacts spatio-temporelles
A.5.4 Moteur pour le d´elestage opportuniste de trafic mobile de donn´ees .
B List of Publications