ON THE MOBILITY AND CONTENT ANALYSIS
“We are like dwarfs on the shoulders of giants, so that we can see more than they, and things at a greater distance, not by virtue of any sharpness of sight on our part, or any physical distinction, but because we are carried high and raised up by their giant size.”
— Bernard De Chartres
Related to the problems stated in the previous chapter, we present the bases of mobility and traffic dataset analysis as well as we position our work according to the literature.
dataset knowledge extraction
Datasets are of enormous importance on the analysis of many scien-tific fields. They provide the convenience of a non-real time analysis, i. e., one can analyze the phenomena of interest after its parameters have been collected and logged. In the context of large-scale mobility and networks, where real-time analysis is arduous due to the enor-mous amount of elements (e. g., subscribers) and parameters (e. g., position), datasets are widely used as primary source of information.
As a precious resource, few are the datasets in the literature that contain large-scale information of mobility or network measurements. Moreover, to the best of our knowledge, no freely available dataset has both large-scale information about fine-grained mobility and net-work traffic. Experiments to collect human mobility data generally involve people carrying GPS-capable devices which regularly collect their precise positioning. Due to the complexity of those experiments, they tend to be limited in number of participants (e. g., up to 35), time duration (i. e., up to few weeks), and space as in university cam-puses [15, 16, 17, 18, 19], conference rooms [20, 21], or shopping malls . Lausanne campaign  and GeoLife  represent some of the few relatively large experiments with around 200 participants, that at-tempts to collect fine-grained human mobility. The dataset collected from the former is not publicly available, while the one from the latter is.
Aside from that, human mobility datasets covering large areas tend to rely only on automobile transportation, which is not in the scope of this thesis. For instance, taxi cabs in San Francisco  and Rome , vans inside Microsoft headquarters in Redmond , buses inside a university campus area  or in the metropolitan area of Seattle .
Another source of human mobility data is Call Detail Record (CDR). CDR is a metadata record that describes phone communication using a series of data fields, e. g., the identification of callee and caller, call type (voice call or SMS), starting, ending and duration time of the call, and GPS location of the caller’s cell tower . CDR datasets are usually released by Telco operators to a limited number of partners under a non-disclosure agreement and with limited access. As both mobility and network traffic are susceptible of giving away private users’ information, entities responsible for such data are careful on providing it to third-parties.
Besides, it is important to understand the limitations of each dataset. For instance, when modeling mobility through CDRs, one has to know its two biggest limitations: sparse in time and coarse in space. Time sparseness occurs because records are generated only when a subscriber sends or receive a call or a SMS, which makes he/she invisi-ble at all other periods of time. Space coarseness is due to the granular-ity of a cell tower sector for the subscribers’ positioning, which leads to a location uncertainty of about 1 square mile . It is important to consider that those two characteristics are not uniformly distributed in time due to the fact that subscribers tend to place their calls in bursts, then staying nonactive for long periods , around 70% of the total time . To overcome such limitation on mobility analysis, a threshold may be set aiming to remove from the dataset subscribers that have low call frequency . Nonetheless,  studies the pos-sible caveats of using CDRs to model human mobility patterns. It shows that this kind of data performs well in certain scenarios such as evaluation of subscribers’ home and work places.
Similarly, network traffic datasets in large scale are rarely available in the literature. As in CDR datasets, they are generally released by Telco operators as xDRs, an extension of CDRs that includes network traffic usage information, e. g., generated from subscriber’s brows-ing activities and background applications. In some cases, more spe-cific information such every single URL recorded during subscribers browsing sessions are available as presented in .
Getting access to the dataset is the first step towards proposing a system model. Some effort has to be made to extract mobility and traf-fic knowledge from raw datasets due to their singular characteristics. For example, CDR mobility datasets generally contain, per user, a set of geolocalized points generally ascendingly sorted by timestamp per day. Assumptions must be made in order to represent disjoint trajec-tories that a user perform during the day, otherwise every user will have one single day-long trajectory, which might not represent accu-rately his displacement. Contrarily, for instance, GeoLife provides a precise set of trajectories per user, which ease the mobility analysis.
The understanding of mobility and its modeling has started with an-imals such as monkeys , jackals , and albatrosses . Such works indicated that animal mobility follows a random walk for which the displacement is power-law distributed, i. e., Lévy flight . Early human mobility studies used tracking methods such as bank notes dispersion . Latterly, the lower cost of GPS devices increased the possibility of collecting mobility datasets. In , the authors evalu-ate GPS traces of 44 volunteers in various outdoor scenarios includ-ing two different college campuses, a metropolitan area, a theme park and a state fair. The analysis shows that human mobility resembles Lévy flight within a scale of less than 10 km, which corroborates the findings from . Authors then create a Lévy flight model that captures the mobility from those individuals. More recently, easier methods for collecting human mobility in large scale such as mobile phones open new horizons for deeper human mobility investigations.
Through extensive analysis,  presents a seminal study on hu-man mobility using a CDR dataset of 100,000 subscribers. Authors show that human trajectories show a high degree of temporal and spatial regularity, in disagreement with the aforementioned random trajectories predicted by the prevailing Lévy flight and random walk models. Besides, each individual is characterized by a specific travel distance that is time-independent and a significant probability to re-turn to a few highly frequented locations. The return to a previously visited location occurs with a frequency proportional to the ranking in popularity of the location with respect to other locations. It means, that humans have a strong tendency to return to locations that they visited before, due to the recurrence and temporal periodicity inher-ent to human mobility.  presents an extension of this work using two CDR datasets totalizing 3 million subscribers focusing on the vis-iting time, i. e., the period of time spent at one location. The resulting curve shows a truncated power-law with a cutoff of 17 hours, which authors link to the typical awake period of humans.
 analyzes CDRs dataset of 97,000 subscribers in Los Angeles and 71,000 in New York aiming to identify important locations in peo-ples’ lives. Using ground-truth data of home and work location from 19 subscribers, authors were able to identify home and work locations with about 1 and 21 miles of error, respectively. In  authors evalu-ate a dataset of CDRs with information of about 450,000 subscribers to capture city dynamics. More specifically, authors want to discover two main groups in a city, one active during the day (laborshed) and another during the night (partyshed). Their grouping strategy relies on a set of fixed rules, e. g., a subscriber is set to laborshed group if he makes 4 calls (or send 4 SMSs) during business hours using city cell towers, at least, twice per week. It is shown a 81% correlation between the groups identified by their algorithm and US Census dataset used as ground-truth.
In  authors analyze a subset of Lausanne dataset  with 38 participants in order to understand how temporal and personal factors, e. g., occupation and age, affect individual mobility patterns. From the temporal analysis, they concluded that people are less ac-tive during workdays and night than during weekends and daytime. Occupational analysis shows that among full-workers and students, the former are more prone to shorter displacement during the day due to the stricter time rules imposed on companies when compared to universities. Finally, age analysis shows higher nightly mobility of younger people compared to older counterparts, which is the result of nightlife attractions being more interesting for younger people. In , a study was made using a CDR dataset containing information of 180 subscribers, which presented similar temporal findings. Due to the routinary behavior, human mobility is highly predictable.  presents a study using a CDR dataset with 50,000 subscribers aiming to measure how predictable human mobility is. Authors mea-sure the predictability of subscribers’ next whereabouts by using three entropy metrics: (1) uniformly randomly chosen among all the loca-tions the subscriber already visited, (2) based on the frequency of the most visited locations, or (3) taking into consideration both frequency, time spent and the order of the visits. As result, for the typical sub-scriber, the uncertainty of the next location (i. e., the cell tower the subscriber will be connected to) resides, on average, in a set of less than two locations. Moreover,  shows that individuals are found at their first two preferred locations on 40% of the time.
Besides the efforts above, human mobility has been widely studied from several points of view, specially with regards to the inter-contact and contact time between people, i. e., the time gap separating two contacts and its duration considering the same pair of people. The im-portance on those studies comes from a specific problem on intermit-tently connected networks: as messages are transmitted among nodes when they get in contact with each other, the contact time between pairs of nodes is a key factor on the end-to-end communication delay. In the context of human mobility, people carrying mobile phones are nodes and a contact between devices signify respective people get-ting closer to each other. The longer they stay close, i. e., the contact duration, the larger the amount of data that can be exchanged.
In [47, 48] authors show that empirical distributions of inter-contact times present two characteristics. First, they are well fitted by log-normal curves, with exponential curves also fitting a significant por-tion of the distributions. Second, they can be well approximated by a power law over some specific time ranges, from few minutes to 12 hours. [21, 49, 50] conducted experiments involving Bluetooth con-tacts between people carrying devices:  studies data from 41 participants at Infocom 2005 conference rooms,  analyses 9 partici-pants in a campus scenario, and  assess data collected from 16 undergraduate students. Similar results are present in those works re-garding contact duration: it is power-law distributed with variations in the power-law coefficient k inherent to the specificities on the sce-narios where the experiments were carried-out. For instance, the con-tact duration distribution curve presented in  decays slower when compared to the ones from [17, 49]. Authors associate this behavior to students that tend to stay longer periods of time in the vicinity of each other as they may attend to the same classes.
Aforementioned works have mostly studied some aspects of the hu-man mobility unveiling characteristics on people’s displacement such as distance, high probability to revisit certain few locations, and the dynamic of encounters. Their conclusions indicate that temporal and spatial factors are recurrently impacting human mobility. However, our intuition says that people’s mobility presents other characteristics such as tendency to use shortest-paths. Besides, no large scale evalu-ation of fine-grained datasets was performed to verify this intuition nor the aspects previously assessed in the literature.
data traffic insights
The understanding of users’ content consumption has attracted sig-nificant attention of the networking community in the literature. Its improved understanding is of fundamental importance when looking for solutions to manage the increased data usage and to improve the quality of communication service provided. The resulting knowledge can help to design more adaptable networking protocols or services, as well as to determine, for instance, where to deploy networking infrastructure, how to reduce traffic congestion, or how to fill the gap between the capacity granted by the infrastructure technology and the traffic load generated by mobile users.
Earliest analysis of cellular network traffic were mostly focused on the traffic generated from micro-browsing using Wireless Application Protocol (WAP) , PDAs , and CDMA2000 network technol-ogy . Although those studies were made more than a decade ago, certain findings are still somehow similar to the current state of the literature, e. g., in  authors detected day cyclicity on the micro-browsing access behavior of mobile phone users. Besides,  shows that most of the users tend to have short network usage sessions when accessing web sites. Compared to those first studies, current network traffic data collection and analysis involves an enormous amount of data, which raises the difficulty on the understanding, characteriza-tion and classification of network demands and subscribers.
Within network traffic analysis, considerable effort has been made towards the classification of subscribers behavior based on their similarities. This is challenging task due to the heterogeneity of human behavior, while some subscribers rarely make use of the voice calls, others perform them thousands of times per month . Clustering techniques are of frequent use in this context and are employed since the earliest analysis on metropolitan-area wireless network traffic . Clustering primary focus on the concept of similarity between ele-ments in a collection, as determined by the distance between them in a multidimensional space. Two elements belong to the same clus-ter if the distance between them is small enough. Relatively simple clustering algorithms have the capability to group a large number of elements into clusters, which generally are predefined classes with a semantical meaning.
Distinguishing between changes that are merely due to the dy-namic nature of the system and anomalies is a difficult problem. In  authors develop a method to identify anomalous behavior from subscribers’ call records. They apply a clustering technique to detect anomalous voice call pattern relative to the historical pattern estab-lished for a phone. Their evaluation makes usage of a CDR dataset comprising 500 subscribers. Their approach uses artificial neural net-works model and focus only on the voice call duration parameter. A voice call is considered abnormal if it is way longer than the com-mon subscriber’s voice call length. Similarly,  applies clustering techniques in order to find anomalous network behavior. This ap-proach combines Leader  and K-means  clustering algorithms into a hybrid scheme that considers voice call duration and starting time, number of sent/received SMSs, and data traffic. Authors eval-uate the framework on a 12-day CDR dataset of unspecified amount of subscribers. Resulting clusters with few members were considered to contain records anomalous voice calls. Authors evaluate several configuration for the clustering algorithms parameters and conclude that an hybrid approach composed of Leader and K-means have very limited success on anomaly detection.
In  authors use a CDR dataset with information about 475,000 subscribers aiming to classify subscribers into usage groups. In this study, voice call duration and number of SMSs are the only param-eters that represent subscriber’s usage. In order to create the groups, K-means clustering was used on both parameters and k = 7, i. e., a fixed number of seven usage groups. Moreover, further investiga-tion in one of the clusters show predominant activity before and after working hours, which authors link to a cluster with majority of com-muter subscribers. In another cluster, authors found higher number of SMSs than calls, which is a characteristic of younger people  and peak of SMS activity corresponding to the start of the school day, open lunch, and dismissal. Authors were able to establish con-clusively that this cluster was mostly composed by students by analyzing the activity on the antenna that covers a school in the studied area and finding the same behavior.
In  authors propose a framework to characterize network-wide usage profiles and evaluate it using a CDR dataset of 5 million sub-scribers. The goal is to build categories of network usages from a raw CDR dataset, which, in this case, provides hourly number of calls per antenna. The framework creates a node from each hourly mea-surement and structures all nodes as a dendogram. A set of network usage profiles results from hierarchically merging similar nodes into clusters until a certain threshold defined by two stopping metrics, Beale and C-Index. The similarity between nodes relies on two pro-posed metrics, traffic volume similarity and traffic distribution sim-ilarity, which take into consideration absolute and relative call vol-umes on all antennas inside a geographical area. Authors show that more than being able to find a coherent set of voice call usage cate-gories, their framework produces as by-product the identification of anomalous call usage profiles.
Literature presents efforts on the categorization of users by their network activities using other parameters, e. g., visited websites , and WiFi usage . Additionally, [64, 65] present network-wide anal-ysis for special days such as World Cup match, Easter, Christmas, and Carnival or special areas of a city . Besides, with focus on urban planning,  characterizes the usage of urban area based on the net-work activities for typical and special days. Still towards better urban planning,  proposes a framework that merges city information from network activity, vehicle traffic, and events to provide a better understanding on how cities function and to develop more efficient urban policies.
Most studies in the literature on the analysis of network utiliza-tion are based on the analysis of calling patterns (i. e., generated only when a voice call or a short message service occurs) usually described in CDRs with no regards to the actual data traffic generated by smart-phone applications (e. g., email checks, synchronization, etc). Such analysis may provide an idea on the activity of mobile network cus-tomers but do not describe realistic data traffic demand patterns.
It is undeniable the importance of datasets in the network and mo-bility analysis, they enable an atemporal data investigation of events that otherwise would be very hard to observe and study in a real time fashion. The main drawback is the lack of publicly available rich datasets in terms of fine-grained and network traffic information. As a consequence, this thesis tackles mobility and network traffic analy-sis using different datasets, which will be presented in the following chapters. Our analysis from both mobility and network traffic differs from previously mentioned works in several aspects, we:
• focus on trajectory analysis and how people interact with the urban scenario
• provide large-scale fine-grained mobility study focusing routi-nary behavior
• analyze several cities at once, avoiding bias of a specific city context
• extensively analyze real cellular network data traffic and its cyclicity
With that in mind, we envision opportunities for better network management, and planning. Therefore, in Chapter 4 we propose a framework to automatically classify subscribers into a set of profiles based on their network usage, and a synthetic traffic generator that mimics the original network demands’ behavior. Another opportu-nity that we explore is the deployment of a supportive wireless net-work on metropolitan areas, which we evaluate as study case in Chap-ter 5.
Table of contents :
1.1 Context and motivation
1.2 Problem statement
1.2.1 What about investigating routine instead of purely mobility?
1.2.2 Can we identify common traffic behavior among mobile subscribers?
1.2.3 And so what?
1.3 Contributions of this thesis
1.3.1 Routine characterization of human mobility
1.3.2 Mobile data traffic profiling and synthetic generation
1.3.3 Traffic-and-mobility-aware hotspot deployment for data offloading
1.4 Thesis outline
2 on the mobility and content analysis
2.1 Dataset knowledge extraction
2.2 Mobility insights
2.3 Data traffic insights
3 context analysis
3.2.1 System model
3.3 Mobility Dynamics
3.3.1 Visit behavior
3.3.2 Displacement behavior
3.3.3 Spatiotemporal behavior
4 content analysis
4.2.1 Traffic dynamics
4.2.2 Temporal dynamics
4.2.3 Age and gender dynamics
4.3 Subscriber profiling methodology
4.3.1 Similarity computation
4.3.2 Subscriber clustering and classification
4.3.3 Subscriber profiles
4.3.4 Profile’s age and gender
4.4 Measurement-driven traffic modeling
4.4.1 Fitting empirical distributions
4.4.2 Synthetic subscriber generation
4.4.3 Synthetic traffic model evaluation
5 study case
5.2 Related work
5.3.1 Graph creation
5.3.3 Synthetic traffic model
5.3.4 Objective formalization
5.4 Performance evaluation
5.4.2 Offloaded traffic
6 conclusions and future horizons
6.2 Future horizons
a.1 Classes and categories for Points of Interest
a.2 CDFs of the traffic parameters in peak and non-peak hours
a.3 Synthetic traffic generator algorithm