Lack of User Input: Cold Start and Sparsity of Data
In this section, we present the work about the open problems in designing information filtering systems. In particular we consider two of open issues: cold start and sparsity of data. The cold start problem happens when a system is too young to have enough data to infer what content to propose. To solve this problem Schein et al.  combine both collaborative filtering and content information to recommend unrated items in a collaborative filter based system. Lam et al.  approach this problem using information of the users. They base their work on the idea that people with the same features will also share the same interests. In the long term, the lack of content with user feedback is called sparsity of data. For example, in a system that requires users to explicitly evaluate items, most of the user can just free ride the system and never evaluate any item. To solve this problem Pazzani et al.  augment the rating information with personal information of the user (e.g. the age and the gender). The scientific community refers to this kind of approach as “demograhic filtering”. Huang et al.  model the problem as a neural network and use spreading activation algorithms to explore transitive associations among users. Another approach is used by Billsus et al.  and Sarwar et al. . They reduce the original matrix containing the feedback using Singular Value Decomposition. Note that, as WeBrowse does not require user engagement, it is not affected by cold start problem and sparsity of data.
Finally, some work propose to augment the input of information filtering systems adding information abut the communities of the users. Notable examples of community based systems are [30, 31, 32, 33, 29], they demonstrate that in general social-based recommendation leads to better results than recommendation based on users’ preference similarity, in an heterogeneous set of taste-related domains (e.g. movie catalogues, club ratings, webpages and blogs).
Characterizing Existing Tools
A lot of work has been conducted on understanding and characterizing the information filtering systems that we exposed and analyzed. When users share content on social network they act as a filter for other users. Bakshy et al. study this effect in Facebook . They relate the probability of sharing with other metrics like the number of comments an items has and messages a user received. Similarly, Kwak et al.  evaluate Twitter’s efficiency as a news propagation platform, and propose advanced criteria for generating user list recommendations. Bakshy et al  predict the influence of users on twitter. They use for their predictor features the number of followers, number of tweets, date of joining etc. The scientific community has produced plenty of studies about Reddit. Singer et al. in  report the growing popularity of this website. They study five years of data and observe an exponential growth of the submissions.
Gilbert  exposed the problem of underprovision in Reddit. In particular, this work shows that 52% of the most popular links are ignored the first time they are submitted. Other works are more general and instead of focusing on single systems, they survey the systems similarly to what we do in Sec. 2.1.1. Adomavicius et al.  classify information filtering systems using their suggestions method. They divide the systems in Content-base, Collaborative and Hybrid. For each of these families they provide examples of two types: heuristic-based and model-based. Su et al.  survey the existing collaborative filtering techniques. They group the systems in memory based, model based and hybrid recommenders. For each group they cite some representative techniques and they list the main advantages and disadvantages. Some of the existing tools (e.g. Reddit, Storify, Pearltrees) are crowdsourced. Crowdsourced means that their users share content on the tool. The tools use community votes to promote content. Although these are powerful tools, prior work highlighted some of their weaknesses.
First, most users participating in platforms like Reddit are not active. This slows down information propagation in the community . Second, although they target the web at large, such communities tend to become closed and self-referencing, with mostly user-generated content . Third, they heavily suffer from “freeloaders”, i.e. passive users relying on others actively sharing contents, and from a large amount of submitted contents which are not viewed . Other tools like Facebook’s Paper and Twitter Trends overcome these issues as they build on huge social networks to discover novel and popular content. Similarly, social bookmarking tools like Delicious and Pinterest, promote content based on users who organize collections of webpages.
Tools for Communities
All the tools described in the previous section share the problem of requiring users to be active at sharing content, which is difficult to obtain, in general [43, 70]. Differently, WeBrowse is a system for content sharing in communities of a place based solely on the passive observation of network traffic, and, to the best of our knowledge, it is the first system in its kind. WeBrowse’s most similar system is perhaps Google Trends, which passively observes users’ activity in Google Search (i.e., search queries) to extract trends about the most searched topics. The result however is different from the one produced by WeBrowse, as Google Trends returns the keys used for search queries, instead of links to contents.
Other proposals aim at personalizing the recommended contents to match users’ interests , or to offer news based on a regional basis . WeBrowse, in its current design does not offer personalized recommendations beyond the community-based aspect, but we plan to explore this aspect in future work.
Information filtering system for communities have also been deployed in workplaces. Zhao et al.  show the benefit of using Twitter at work. They show that the usage of Twitter in the workplace gives professional benefits (e.g. workers are aware of the work of other colleagues not nearby located) and personal benefits (e.g. workers develop a “sense of community” inside the enterprise). DiMicco et al.  show similar results. They design a social network for IBM and find that people use it to discover colleagues that worked on similar projects and that they didn’t know before. There are other systems that have been designed specifically for content promotion in communities-of-a-place scenarios such as campuses and neighbourhoods. For example, several enterprises and universities deploy internal social networks [37, 38, 39], create blogs [40, 36] and wikis . Steinfield et al.  find that using a social network inside an enterprise increases the value of the social capital. The term social capital refers to the resources that derive from the relationships among people in varying social contexts and it is considered an asset by enterprises.
Finally, in systems that work in communities of a place, privacy is a sensible issue. Indeed, the set of users is in general small. For this reason, it can be easy to relate data present in system with their owner. A notable amount of work has been conducted investigating the relation between privacy and the usage of tool for communities. Some of these work are related to the design and deployment of WeBrowse. In fact, designing the system, we took into account the requirements of the privacy boards of the institutions hosting it (i.e. Inria and Politecnico di Torino). The seminal work about privacy perception is by Acquisti et al .
They use both surveys and information retrieved from the network to investigate the users’ perception of privacy. They use as population for their studies a university campus. They find that undergraduate students underestimate the impact of the usage of social network in their privacy and also that some users have a unrealistic conception of the size of the network they share their information with. Similar surveys have been conducted also more recently [75, 76, 77].
The traditional metric to evaluate suggestions is accuracy. For example, Netflix offered a prize of one million dollars to the BellKor’s Pragmatic Chao team who was able to improve by 10% the accuracy of their algorithm . To measure accuracy means to measure how the ranking proposed by a system differs from the ranking made by the user. Shani et al.  make an extensive description of prediction accuracy techniques and divide them in three classes. The first is measuring the accuracy of rating prediction. The second is measuring the accuracy of usage predictions. The third is measuring the accuracy of ranking items. Other work evaluate suggestions accuracy. Steck  tackles the problem of popularity bias. The popularity bias happens when users are more likely to provide feedback on popular items. In the case of WeBrowse, popularity bias happens because users click more on popular content generating a rich-get-richer effect. The author creates a model that gives more to less popular items in orderto correct the bias. Accuracy is not the only evaluation metric available. McNee at al.  claim that using only accuracy as evaluation metric can hurt the systems. They propose other metrics such as: diversity of suggestion lists, serendipity and user needs and expectations. The lack of diversity in suggestion lists creates the effect that once a user consumes an item, he receives suggestions too similar. For example, a user buying a book on Amazon may receive as suggestion only books that are from the same authors. To solve this problem, Ziegler et al.  design a metric to measure the similarity of the proposed items. Serendipity is the experience of receiving unexpected input. Ge et al.  evaluate the relation between serendipity and accuracy. They show that an increase of serendipity means a decrease of accuracy and propose “explained suggestions” to mitigate this effect. Knijnenburg et al.  design a framework to evaluate the user experience using information filtering systems. They use questionnaires to correlate user experience with objective metrics.
Mining of Web Logs
WeBrowse is not the first system facing the task of mining HTTP logs for URL extraction. For instance, this idea was already proposed Jia et al. . However, the result obtained with their approach, i.e., number of clicks to web portals, is not different from what services like Alexa already provide. Indeed, the main challenge behind this task resides in the detection of what we call user-URLs from the set of all URLs, and only a few other research studies address this problem [85, 86, 87, 88]. Among them, only StreamStructure  and ReSurf  are still adequate to investigate the complex structure of modern web pages. Neither of them, however, are appropriate for WeBrowse, as they rely on HTTP Content-Type, which is often unreliable , and makes traffic extraction more complex (because of the matching between requests and responses) and unsuitable for an online system as WeBrowse. Our experiments in Sec. 3.4 confirm this observation. Eirinaki et al. survey the existing methods for web log mining. However, being dated 2003, this work is not up to date. Yang et al.  use some of the techniques we use in this work for web logs mining. For example, they detect the object composing a web page. However, as their goal is to improve caching and prefetching they do not need to reconstruct the page. More recently Neasbit et al. proposed ClickMiner . ClickMiner has similarities with the click extraction part of WeBrowse. For example, they both reconstruct the pages using the referrer field of the HTTP request message. However, ClickMiner is based on a 2 steps procedure that makes it impossible to be used online. Neasbit et al.  use a completely different approach. They design WebCapsule. An extension for a browser that is able to log all the user activity. Naylor et al.  conduct a study about the diffusion of the HTTPS protocol. They find that it accounts for 50% of the total connections.
They also state how the increase of its adoption will impact the network and web log mining. In particular they prospect the ending of the caching, compression, content optimization and filtering. Recently, Nelms et al.  proposed WebWitness. This tool tracks users’ browsing activity that they call web paths. However, their goal differ from ours. Indeed, WebWitness is a tool that identifies threats (i.e. malware) to understand attack trends. Vassio et al.  propose an approach similar to ours for clicks extraction from HTTP logs. However, they focus on offline data and in improving recall an precision.
Content-URLs versus Portal-URLs
This section describes how WEBROWSE distinguishes candidate-URLs pointing to web portalsfrom those pointing to specific content. We use the term web portal (or portal-URL) to refer to the front page of content providers, which generally has links to different pieces of content (e.g., nytimes.com/ and wikipedia.org/); whereas a content-URL refers to the webpage of, e.g., a single news or a Wikipedia article. We first design an offline classifier. We then engineer an online version.
Given the heterogeneity of the URL characteristics, we opt for a supervised machine learning approach to build a classifier. We choose the Naive Bayes classifier, since it is simple and fast, thus, suitable for online implementation. As we will see, it achieves good performance, not calling for more advanced classifiers.
We use five features to capture both URL characteristics and the arrival process of visits users generate.
URL Length. It is the number of characters in the URL. Intuitively, portal-URLs tend to be shorter than content-URLs.
Hostname. This is a binary feature. It is set to one if the resource in the URL has no path (i.e., it is equal to “/”); and to zero, otherwise. Usually, requests to portal-URLs have no path in the resource field.
Frequency as Hostname. This feature counts the number of times a URL appears as root of other candidate-URLs. The higher the frequency, the higher the chances that the URL is a portal.
Table of contents :
1.1 Existing Information Filtering Systems
1.2 Lack of User Input in Communities of a Place
1.3 Implicit Input from HTTP Logs
1.4.1 Identifying Users’ Clicks
1.4.2 A Passive Crowdsourced Information Filtering System
1.4.3 News Consumption Characterization
2.1 Information Filtering Systems for the Web
2.1.1 Categories of Information Filtering Systems
2.1.2 Functional Taxonomy
2.1.3 Lack of User Input: Cold Start and Sparsity of Data
2.1.4 Characterizing Existing Tools
2.1.5 Tools for Communities
2.1.6 Evaluating Suggestions
2.2 Mining of Web Logs
2.3 News Characterization
3 Identifying Users’ Clicks
3.1 Detection of user-URLs
3.2 Online Algorithm
4 WeBrowse: A Passive Information Filtering System for Communities of a Place
4.2.1 Detection of Candidate-URLs
4.2.2 Content-URLs versus Portal-URLs
4.2.3 Privacy-Preserving Promotion
4.3.1 Performance Evaluation
4.4.1 Effect of the crowd size
4.4.2 User study
4.4.3 Community of a Place Effect
4.4.4 Complementing Existing Systems
4.4.5 Speed of Content Discovery
4.4.6 Ethical and Legal Issues
5 News Consumption Characterization
5.1.1 Data collection: Method and Vantage Points
5.1.2 From HTTP Traces to News Visits and Articles
5.1.3 Identifying Users: Households and Surfers
5.1.4 Dataset Description
5.2 News Referrals
5.3 News Categories
5.4 News Locality
5.4.1 Method: Text and Keywords Extraction
5.4.2 Method: Measuring interest
5.5 Visited News versus Headlines
5.6 Visited News Articles per Day
5.7 Number and Sizes of News Stories
5.7.1 Method: Story Identification
5.8 Visited News per User and News Addiction
5.9 News Freshness
5.10 Online Journals Clustering
6.1 Lessons from Users’ Clicks Extraction
6.2 Lessons from WeBrowse
6.3 Lessons from News Characterization
A Extended Performance Description