Get Complete Project Material File(s) Now! »

## Document–pivot methods

Simple document–pivot approaches cluster documents by leveraging some simi-larity metric between them. A recent work [19] follows this direction to provide a method for “breaking news” detection in Twitter. Tweets retrieved using targeted queries or hashtags are converted into a bag-of-words representation weighted with boosted tf-idf (term frequency–inverse document frequency) emphasizing important entities such as names of countries or public figures. Bag-of-words of a text is its representation as the set of its words, disregarding grammar and even word order but keeping multiplicity. Tweets are then incrementally merged in clusters by considering the textual similarity between incoming tweets and existing clusters. Similar approaches based on textual similarity and tf-idf can be found in literature [20, 21]. Among them, the method discussed in [21] classifies tweets as referring to real–world events or not. The classifier is trained on a vast variety of features including social aspects (e.g., number of mentions) and other Twitter–specific features. An important drawback of the method is the need for manual annotation of training and test samples. Dimensions other than text can also be used to improve the quality of clustering. TwitterStand [22] uses a “leader–follower” clustering algorithm that takes into account both textual similarity and temporal proximity. Each cluster center is represented using a centroid tf-idf vector and the average post–time of the tweet in the cluster. A similarity metric based on both elements and on the number of shared hashtags allows incremental merging of new tweets with existing clusters. The main disadvantages of this method is the sensitivity to noise (which is a known problem for document–pivot methods [23] ) and fragmentation of clusters. It needs a manual selection of trusted information providers and periodic defragmentation runs to mitigate such eﬀects. The goal in a large corpus is to detect the first document discussing a given topic like in [24]. A new story is created by a document having low similarity with all previously detected clusters. Locality sensitive hashing is used for fast retrieval of nearest neighbors for the incoming document.

In conclusion, document-pivot methods have two main problems, which are: (a) segmentation of classification groups, and (b) depend on arbitrary thresholds for the classification of an incoming tweet.

### Feature–pivot methods

Feature–pivot methods are closely related to topic models in natural language processing, namely statistical models used to build sets of terms which are representative of the topics occurring in a corpus of documents. Most of the state-of-the-art static topic models are based on the Latent Dirichlet allocation (LDA) [25], which is described in Section 2.4.3. Even though LDA extensions for dynamic data have been proposed [26], alternative approaches trying to capture topics through the detection of term burstiness have been studied [27], mainly in the context of news media mining. The idea behind those methods is that “breaking news”, unlike other discussion topics, happen to reach a fast peak of attention from routine users as soon as they are tweeted or posted [28, 29]. Accordingly, the common framework which underlies most of the approaches in this category first identifies bursty terms and then clusters them together to produce topic definitions.

The diﬀusion of the services over social media and detection of bursty events had been studied in generic document sets. The method presented in [23], for instance, detects bursty terms by looking their frequency in a given time window. Once the bursty terms are found, they are clustered using a probabilistic model of cooccurrence. The need for such a global topic term distribution restricts this approach to a batch mode of computation. Similar methods were tested for topic detection in social media, such as in the Twitter, but with additional emphasis on the enrichment of the obtained topics with non–bursty but relevant terms, URLs and locations [30].

Graph–based approaches detect term clusters based on their pairwise sim-ilarities. The algorithm proposed in [31] builds a term cooccurrence graph, whose nodes are clustered using a community detection algorithm based on betweenness centrality, which is an indicator of a node’s centrality in a network and is equal to the number of shortest paths from all vertices to all others that pass through that node. Additionally, the topic description is enriched with the documents which are most relevant to the identified terms. Graphs of short phrases, rather than of single terms, connected by edges representing similarity have also been used [32]. Graph–based approaches have also been used in the context of collaborative tagging systems with the goal of discovering groups of tags pertaining to topics of social interest [33].

Signal processing approaches have also been explored in [34], which com-pute df-idf (a variant of tf-idf) for each term in each considered time slot, and then apply wavelet analysis on consecutive blocks. The diﬀerence between the normalised entropy of consecutive blocks is used to construct the final signal. Relevant terms which are bursty are extracted by computing the autocorre-lation of the signal and heuristically learning and determining a threshold to detect new bursty terms. Also in this case, a graph between selected terms is built based on their cross–correlation and it is then clustered to obtain event definitions. The Discrete Fourier Transform is used in [35], where the signal for each term is classified according to its power and periodicity. Depending on the identified class, the distribution of appearance of a term in time is modeled using Gaussian distributions. The Kullback–Leibler divergence (as a relative entropy) between the distributions is then used to determine clusters and increase the computational complexity of the method.

The knowledge of the community leads to even more sophisticated ap-proaches. In a recent work [36] a PageRank–like measure is used to identify important users on the Twitter social network. Such centrality score is combined with a measure of term frequency to obtain a measure for each term. Then, clustering on a correlation graph of bursty terms delineates topic boundaries.

These methods are based on the analysis of similarities between terms and often give wrong correlation of topics, with their main disadvantage being the use of dictionaries and stemming processes.

#### Related Work

In general the topic detection methods use preprocessing techniques. Next, we define all the components of the topic detection process. In Section 2.4.1, we present the problem statement and define some basic terminology. Then in Sec-tion 2.4.2, we describe the data preprocessing and in the following sections we present six methods that take in as input the preprocessed data and output the detected topics.

**Problem Definition**

We address the task of detecting topics in real–time from social media streams. To keep our approach general, we consider that the stream is made of short pieces of text generated by social media users, e.g. posts, messages, or tweets in the specific case of Twitter. Posts are formed by a sequence of words or terms, and each one is marked with the timestamp of creation. A plethora of methods have a user–centered scenario in which the user starts up the detection system by providing a set of seed terms that are used as initial filter to narrow down the analysis only to the posts containing at least one of the seed terms. Additionally, there exists an assumption that the time frame of interest (can be indefinitely long) and a desired update rate are provided (e.g. detect new trending topics every 15 minutes). The expected output of the algorithm is a topic, defined as a headline and a list of terms, delivered at the end of each time slot determined by the update rate. This setup fits well many real–world scenarios in which an expert of some domain has to monitor specific topics or events being discussed in social media [3, 37]. For instance, this is the case for computational journalism in which the media inquirer is supposed to have enough knowledge of the domain of interest to provide initial terms to perform an initial filtering of the data stream. Even if it requires an initial human input, this framework still remains generic and suitable to any type of topic or event.

**Data Preprocessing**

The content of user generated messages could be unpredictably noisy. In many works, in order to reduce the amount of noise before the proper topic detection is executed, the raw data extracted through the seed terms filter is subjected to three preprocessing steps.

Tokenization: In a raw post, terms can be combined with any sort of punc-tuation and hyphenation and can contain abbreviations, typos, or conven-tional word variations. The Twokenizer tool [20] is used to extract bags of cleaner terms from the original messages by removing stopwords and punctuation, compressing redundant character repetitions, and removing mentions, i.e., IDs or names of other users included in the text for messag-ing purposes.

Stemming: In information retrieval, stemming is the process of reducing inflected words to their root (or stem), so that related words map to the same stem. This process naturally reduces the number of words associated to each document, thus simplifying the feature space. Most techniques use an implementation of the Porter stemming algorithm [38].

Aggregation: Topic detection methods based on word or n–grams cooccur-rences, or any other type of statistical inference, suﬀer in the absence of long documents. This is the case of social media, where user–generated content is typically in the form of short posts. In information retrieval it is a common practice to partially address this problem by concatenating diﬀerent messages together to produce documents of larger size. Large doc-uments construction is based on two strategies. The first strategy involves temporal aggregation that concatenates together N messages, whose gen-eration date is contiguous. The second strategy involves a similarity–based aggregation which attaches to a message all the near–duplicate messages posted in the same time slot, identified through an eﬃcient document clus-tering method [24], which is also also used by one of the examined topic detection algorithms presented in Section 2.4.4.

Determining the eﬀect of such preprocessing algorithms on the quality of the final topic detection is diﬃcult to assess, and it has not been much investigated so far. For instance, the aggregation of posts in super–documents could on the one hand help to improve the word cooccurrence statistic but on the other hand introduces the risk of grouping terms related to diﬀerent topics, and to reveal false cooccurrence.

#### Latent Dirichlet Allocation

Topic extraction in textual corpus can be addressed through probabilistic topic models. In general, a topic model is a Bayesian model which associates with each document a probability distribution over the topics, where each topic is in turn a probability distribution. The Latent Dirichlet Allocation (LDA) [25] is the best known and most widely used topic model. According to LDA, every document is considered as a bag of terms, which are the only observed variables in the model. The topic distribution per document and the term distribution per topic are instead hidden variable and have to be estimated through Bayesian inference. The Collapsed Variational Bayesian inference algorithm [39], which is an LDA variant, is computationally eﬃcient, more accurate than standard variational Bayesian inference for LDA, and has given rise to many independent implementations already available in the literature. LDA requires the expected number of topics k as a input. The estimation of the optimal k, although possible through the use of non-parametric methods [40], falls beyond the scope of this thesis.

**Table of contents :**

**1 Introduction **

1.1 Dynamic Social Networks

1.1.1 The Twitter Social Network

1.2 Research and Technical Challenges

1.3 Problem Statement and Objectives

1.4 Scope and Plan of the Thesis

**2 Background and Related Work **

2.1 Introduction

2.2 Document–pivot methods

2.3 Feature–pivot methods

2.4 Related Work

2.4.1 Problem Definition

2.4.2 Data Preprocessing

2.4.3 Latent Dirichlet Allocation

2.4.4 Document–Pivot Topic Detection

2.4.5 Graph–Based Feature–Pivot Topic Detection

2.4.6 Frequent Pattern Mining

2.4.7 Soft Frequent Pattern Mining

2.4.8 BNgram

2.5 Chapter Summary

**3 Joint Sequence Complexity: Introduction and Theory **

3.1 Introduction

3.2 Sequence Complexity

3.3 Joint Complexity

3.4 Main Results

3.4.1 Models and Notations

3.4.2 Summary of Main Results

3.5 Proof of Main Results

3.5.1 An important asymptotic equivalence

3.5.2 Functional Equations

3.5.3 Double DePoissonization

3.5.4 Same Markov sources

3.5.5 Different Markov Sources

3.6 Expending Asymptotics and Periodic Terms

3.7 Numerical Experiments in Twitter

3.8 Suffix Trees

3.8.1 Examples of Suffix Trees

3.9 Snow Data Challenge

3.9.1 Topic Detection Method

3.9.2 Headlines

3.9.3 Keywords Extraction

3.9.4 Media URLs

3.9.5 Evaluation of Topic Detection

3.10 Tweet Classification

3.10.1 Tweet augmentation

3.10.2 Training Phase

3.10.3 Run Phase

3.10.4 Experimental Results on Tweet Classification

3.11 Chapter Summary

**4 Text Classification via Compressive Sensing **

4.1 Introduction

4.2 Compressive Sensing Theory

4.3 Compressive Sensing Classification

4.3.1 Training Phase

4.3.2 Run Phase

4.4 Tracking via Kalman Filter

4.5 Experimental Results

4.5.1 Classification Performance based on Ground Truth

4.6 Chapter Summary

**5 Extension of Joint Complexity and Compressive Sensing **

5.1 Introduction

5.2 Indoor Path-Tracking Using Compressive RSS Measurements

5.2.1 Prior Work on RSS-based Path Tracking

5.2.2 CS-based Location Estimation

5.2.3 CS-Kalman Filter

5.2.4 Experimental Results

5.3 Encryption System based on Compressive Sensing Measurements

5.3.1 SecLoc System Description

5.3.2 Possible Attacks from Malicious Users

5.3.3 Experimental Results

5.4 Stealth Encryption based on Eulerian Circuits

5.4.1 Background

5.4.2 Motivation and Algorithm Description

5.4.3 Performance in Markov Models

5.4.4 Experimental Results

5.5 Chapter Summary

**6 Conclusions and Perspectives **

**Bibliography**