Mobile Data Traffic Prediction

Get Complete Project Material File(s) Now! »

Mobile Data Traﬃc Prediction

To what degree is the Internet traﬃc predictable? It is a question that has led to a number of attractive issues and has been continuously investigated since the invention of the Internet [50]. In this section, we review the state-of-the-art on the prediction of mobile data traﬃc. Our discussion is organized from two perspectives:
Aggregated mobile data traﬃc. In this perspective, we consider the mobile data traﬃc from the viewpoint of a mobile network operator. Such data traﬃc is aggregated over many mobile devices within the same cell, the same close geographical area, or the same service/application.
Individual mobile data traﬃc. Here we discuss in an individual viewpoint, i.e., the mobile data traﬃc that is generated by a single mobile device.
For each perspective, we briefly introduce the data traﬃc characterization and particularly present the practical prediction techniques. It is worth noting that, in this section, we focus on the studies on the Internet traﬃc and exclude those on other traﬃc (e.g., voice calls).

Literature on Aggregated Mobile Data Traﬃc

The investigation on aggregated mobile data traﬃc is mainly driven by the analyses on world-wide large-scale operator-collected datasets. For instance, such datasets that have nationwide populations are deeply mined in the relevant studies by Paul et al. [27] (USA), Hoteit et al. [51] (France), and Xu et al. [37, 38] (China).

Characterization

There are two major aspects with respect to the characterization, i.e., temporal dynamics and spatiotemporal correlation.
There is a general agreement on the regularity of the temporal variation of aggregated mobile data traﬃc [23]. Almost at the same time, Paul et al. [27] and Shafiq et al. [33] separately investigate the temporal evolution of aggregated mobile data traﬃc of cell towers and popular applications. They both find that such traﬃc follows a daily repetitive pattern over weekdays: in general, the traﬃc has low demand during nighttime and high demand during daytime. The same repetitive pattern is also observed by Xu et al. [37, 38]. Is is also remarked in [27, 33, 52] that the traﬃc over weekdays and weekends have diﬀerent repetitive patterns and demands; a larger data traﬃc demand exists on weekdays than weekends. An interesting fact is that the temporal variations observed by Paul et al. [27] and Shafiq et al. [33] have diﬀerent peak hours, which is also observed from other network traﬃc [23]. For this, a possible explanation is that such temporal variation under a higher temporal resolution partially depends on the area of study.
The spatiotemporal correlation exists among the data traﬃc generated by cell towers over many users in the same area. In the pure spatial perspective, the distribution of the data traﬃc is spatially heterogeneous: it varies over diﬀerent regions as revealed by Paul et al. [27] and Xu et al. [37, 38]. Further, the latter authors find that the cell towers have similar data traﬃc profiles regarding their regions (i.e., resident, transport, oﬃce, and entertainment) and such profiles of adjacent cell towers are correlated. In the spatiotemporal perspective, the two papers show that the spatial heterogeneity above also varies over time: pick hours depend on regions. The former authors leverage a quantitive measure (i.e., the Moran’s I statistic) to evaluate spatiotemporal diversity of the data traﬃc. They find that in general, the imminent loads of adjacent cell towers are more correlated when these loads are high, but the correlation is relatively weak and almost disappears around midnights. Recently, [53] further investigates the spatiotemporal correlation and propose an approach to infer the hidden spatial and temporal structures of aggregated mobile data traﬃc.
Also, several studies reveal the spatial heterogeneity aggregated over applications. The earlier work by Trestian et al. [40] already shows that the Internet traﬃc over services and applications is consumed diﬀerently at home and work locations. Hoteit et al. [51] find that the data traﬃc loads of cell towers have diﬀerent inner diversities among TCP- and UDP-based services. Later, the extended analysis by Shafiq et al. [39] finds that the data traﬃc aggregated by popular applications is strongly heterogenous over regions. This provides the capability of categorizing cell towers into four classes (web browsing, email, audio, and mixed traﬃc) with respect to the major applications in their data traﬃc loads.

Prediction

Some eﬀorts have been put on the prediction of aggregated mobile data traﬃc. They aim at converting the observed dynamics and correlations above to practical prediction techniques. In the following, we review the proposed prediction techniques according to the level of the aggregation.
Cell-level data traﬃc. There is a common observation on the fact that the data traﬃc of cell towers has a high degree of both theoretical and practical predictabilities. Regarding the theoretical viewpoint, Zhang et al. [31, 32] investigate the limits of the theoretical predictability by observing the traﬃc of 7; 000 cell towers in China. They find that under the temporal resolution of 30 minutes, aggregated traﬃc (voice, text, and data) can be well predicted from the historical demand of the preceding 15 hours; the theoretical predictability of the data traﬃc is lower than that of the data flow of voice calls or text messages. They also find that the knowledge of the traﬃc demands of adjacent cells towers can enhance the theoretical predictability, but in a less degree on the data traﬃc than the others, which supports the quantitative evaluation on the spatiotemporal correlation by Paul et al. [27]. Their results ensure the capability of time series prediction techniques on the prediction of such traﬃc.
Regarding the practical prediction techniques, Xu et al. [37, 38] show that the cell-level data traﬃc is predictable via a linear combination of four primary components corresponding to human activities. Zang et al. [54] propose a mixed machine learning approach composed of K-means clustering, Elman Neural Network, and wavelet decom-position. An alternative prediction approach is proposed by Yi et al. [55]; it builds a complex network among cell towers, measures the traﬃc on the very important ones, and predicts the others’ traﬃc using Support Vector Regression – another machine learn-ing method. It can recover the whole picture of the traﬃc demand from only 8% of the total cell towers. In the opposite viewpoint, Nika et al. [36] perform an empirical study on data hotspots using a large-scale operator-collected dataset of 5; 327 cell towers, and show the availability of standard machine learning methods on the prediction of future hotspots (cells towers) of the traﬃc demand from the past history.
Application-level data traﬃc. The early paper by Keralapura et al.[56] proposes a technique to cluster users and their browsing profiles. The authors find that user behav-ior in terms of Internet surfing can be captured using a small number of clusters. Such heterogeneity of aggregated mobile data traﬃc is also investigated by Ying et al. [57]. Later, Shafiq et al. [33] uses a Zipf-like model to capture the distribution of application-level mobile data traﬃc and finds that the regularity makes the temporal variation of the traﬃc highly predictable from the history of the past demand using a simple Markovian method. Recently, Zhang et al. [58] design a mixed application-level traﬃc prediction framework that leverages the -stable modeled property and dictionary learning to sep-arately deal with the temporal variation and the spatial sparsity of the traﬃc. Marquez et al. [59] extend the analysis in [33] and reveal a strong heterogeneity in diﬀerence mobile service demands using correlation and clustering. They show that the temporal usage patterns are quite diﬀerent from service to service. Besides, several works focus on the traﬃc generated by special services, such as chatting (e.g., WhatsApp [60] and WeChat [61]), video streaming [62], and mobile cloud [63].
In summary, the proposed techniques extend the technical bound on the prediction of mobile data traﬃc: they not only leverage the legacy tools that used for analyzing wired network traﬃc (e.g., the entropy, Markov property, -stable modeled property) to capture the temporal variation but also import several state-of-the-art machine learning tools to utilize the spatiotemporal correlation.

Literature on Individual Mobile Data Traﬃc

A relatively small body of literature is on the investigation of individual mobile data traﬃc, which is also driven by data mining. Diﬀerently, the relevant studies utilize both large-scale operator-collected datasets, e.g., by Paul et al. [27] and Oliveira et al. [34, 52], and small-scale mobile crowdsensing datasets, e.g., by Jo et al. [35].

Characterization

The characterization from the individual viewpoint is performed by Paul et al. [27], Jo et al. [35], Li et al. [42], Oliveira et al. [34, 52], among others.
There is a general agreement on the heterogeneity of the data traﬃc, with respect to the user population and the time. It is shared by Paul et al. [27] and Oliveira et al. [34, 52]. They show that most of the total data traﬃc is generated from a small group of « heavy » users.
Regarding the temporal variation, both the authors above find that, in general, each user is highly active only in a few hours per day, and similarly, the temporal variation is diﬀerent on weekdays and weekends, as in aggregate mobile data traﬃc. The latter authors [34, 52] find that individual mobile data traﬃc also follows daily repetitive patterns and the users also have peak and non-peak hours in terms of data traﬃc. In particular, they find that the variation of diﬀerent hours within the same day is stronger than that of the same hours overs diﬀerent days.
As to the spatiotemporal correlation, Paul et al. [27] point out that a user is usually active at only a few of his common locations. Jo et al. [35] mine a small dataset of locations and services of 124 users over 16 months and they identify the spatiotemporal correlations of service usage patterns.
Other dynamics with respect to social features are also revealed. For instance, Oliveira et al. [34, 52] find that the distribution of individual mobile data traﬃc is slightly heterogeneous over the age and gender; Li et al. [42] focus on the major smartphone operating systems and discuss the traﬃc dynamics and major application in each system.

Table of contents :

1 Introduction
1.1 Predicting Per-user Mobile Data Traffic
1.2 Utilizing Operator-collected Datasets
1.3 Contributions and Thesis Outline
2 Background
2.1 Mobile Data Traffic Prediction
2.2 Operator-collected Mobility Data Utilization
2.3 Summary
3 Datasets: Characteristics and Challenges
3.1 Human Behavior Collection
3.2 Operator-collected Large-scale Datasets
3.3 Application-based Mobility Datasets
3.4 Challenge of Completeness
3.5 Challenge of Mobility Measurement
3.6 Challenge of Data Processing
3.7 Summary
4 CDR-based Trajectory Completion
4.1 Terminology
4.2 Completing Instant CDR-based Trajectories
4.3 Completing Slotted CDR-based Trajectories
4.4 Summary
5 Per-User Mobile Data Traffic Prediction
5.1 Terminology and Definitions
5.2 Characterizing Individual Mobile Data Traffic
5.3 Constructing Per-user Spatiotemporal Behavioral Data
5.4 Investigation through Temporal Dynamics
5.5 Investigation through Spatiotemporal Dynamics
5.6 Additional Investigation of Human Mobility
5.7 Summary
6 Conclusion and Outlook
6.1 Summary of the Thesis
6.2 Limitation and Outlook
6.3 Concluding Remarks
Bibliography