Get Complete Project Material File(s) Now! »

## Feature–pivot methods

Feature–pivot methods are closely related to topic models in natural language processing, namely statistical models used to build sets of terms which are representative of the topics occurring in a corpus of documents. Most of the state-of-the-art static topic models are based on the Latent Dirichlet allocation (LDA) [25], which is described in Section 2.4.3. Even though LDA extensions for dynamic data have been proposed [26], alternative approaches trying to capture topics through the detection of term burstiness have been studied [27], mainly in the context of news media mining. The idea behind those methods is that “breaking news”, unlike other discussion topics, happen to reach a fast peak of attention from routine users as soon as they are tweeted or posted [28, 29]. Accordingly, the common framework which underlies most of the approaches in this category first identifies bursty terms and then clusters them together to produce topic definitions.

The diffusion of the services over social media and detection of bursty events had been studied in generic document sets. The method presented in [23], for instance, detects bursty terms by looking their frequency in a given time window. Once the bursty terms are found, they are clustered using a

probabilistic model of cooccurrence. The need for such a global topic term distribution restricts this approach to a batch mode of computation. Similar methods were tested for topic detection in social media, such as in the Twitter, but with additional emphasis on the enrichment of the obtained topics with non–bursty but relevant terms, URLs and locations [30].

### Data Preprocessing

The content of user generated messages could be unpredictably noisy. In many works, in order to reduce the amount of noise before the proper topic detection is executed, the raw data extracted through the seed terms filter is subjected to three preprocessing steps.

Tokenization: In a raw post, terms can be combined with any sort of punctuation and hyphenation and can contain abbreviations, typos, or conventional word variations. The Twokenizer tool [20] is used to extract bags of cleaner terms from the original messages by removing stopwords and punctuation, compressing redundant character repetitions, and removing mentions, i.e., IDs or names of other users included in the text for messaging purposes.

Stemming: In information retrieval, stemming is the process of reducing inflected words to their root (or stem), so that related words map to the same stem. This process naturally reduces the number of words associated to each document, thus simplifying the feature space. Most techniques use an implementation of the Porter stemming algorithm [38].

Aggregation: Topic detection methods based on word or n–grams cooccurrences, or any other type of statistical inference, suffer in the absence of long documents. This is the case of social media, where user–generated content is typically in the form of short posts. In information retrieval it is a common practice to partially address this problem by concatenating different messages together to produce documents of larger size. Large documents construction is based on two strategies. The first strategy involves temporal aggregation that concatenates together N messages, whose generation date is contiguous. The second strategy involves a similarity–based aggregation which attaches to a message all the near–duplicate messages posted in the same time slot, identified through an efficient document clustering method [24], which is also also used by one of the examined topic detection algorithms presented in Section 2.4.4.

#### Latent Dirichlet Allocation

Topic extraction in textual corpus can be addressed through probabilistic topic models. In general, a topic model is a Bayesian model which associates with each document a probability distribution over the topics, where each topic is in turn a probability distribution. The Latent Dirichlet Allocation (LDA) [25] is the best known and most widely used topic model. According to LDA, every document is considered as a bag of terms, which are the only observed variables in the model. The topic distribution per document and the term distribution per topic are instead hidden variable and have to be estimated through Bayesian inference. The Collapsed Variational Bayesian inference algorithm [39], which is an LDA variant, is computationally efficient, more accurate than standard variational Bayesian inference for LDA, and has given rise to many independent implementations already available in the literature. LDA requires the expected number of topics k as a input. The estimation of the optimal k, although possible through the use of non-parametric methods [40], falls beyond the scope of this thesis.

**Document–Pivot Topic Detection**

The second method discussed here, is an instance of a classical Topic Detection and Tracking method which uses a document–pivot approach (Doc-p). It works as follows:

First, the method performs online clustering of posts. It computes the cosine similarity of the tf-idf [41] representation of an incoming post with all other posts processed so far. If the best cosine similarity is above some threshold tfidf , it assigns the item to the same cluster as its best match; otherwise it creates a new cluster with the new post as its only item. The best matching tweet is efficiently retrieved by Locality Sensitive Hashing (LSH).

Then, it filters out clusters with item count smaller than n.

For each cluster c, it computes a score as follows: score(c) = X doc 2 C X word 2 doc exp(p(word)).

**Graph–Based Feature–Pivot Topic Detection**

The Graph–Based Feature–Pivot Topic Detection method (GFeat-p) is a first of a series of feature–pivot methods, where the clustering step is performed via the Structural Clustering Algorithm for Networks (SCAN) [42], which is in general applied to network analysis. Apart from detecting communities of nodes, SCAN provides a list of hubs (vertices that bridge many clusters), each of which may be connected to a set of communities. For SCAN applied to topic detection, the nodes in the network correspond to terms and the communities correspond to topics. The detected hubs are terms which are related to more than one topic, and effectively provides an explicit link between topics. That would not be possible to achieve with a common partition clustering algorithm. The terms to be clustered, is a subset of terms present in the corpus, applied to independent reference corpus selected with randomly collected tweets [20]. For each of the terms in the reference corpus, the likelihood of appearance p(wjcorpus) is estimated as follows: p(wjcorpus) = Nw(corpus) + N(corpus).

**Soft Frequent Pattern Mining**

In Section 2.4.6 a frequent pattern mining approach for topic detection was described. It provided an elegant solution to the problem of feature–pivot methods that takes into account only pairwise cooccurrences between terms in the case of corpus with densely interconnected topics. Section 2.4.5 examined only pairwise cooccurrences, where frequent pattern mining examines cooccurrences between any number of terms, typically larger than two. A question that naturally arises is whether it is possible to formulate a method that lies between these two extremes. Such a method would examine cooccurrence patterns between sets of terms with cardinality larger that two, like frequent pattern mining does, but it would be less strict by not requiring that all terms in these sets cooccur frequently. Instead, in order to ensure topic cohesiveness, it would require that large subsets of the terms grouped together, but not necessar2.4. Related Work 27 ily all, cooccur frequently, resulting in a “soft” version of frequent pattern mining. The proposed approach (SFPM) works by maintaining a set of terms S, on which new terms are added in a greedy manner, according to how often they cooccur with the terms in S. In order to quantify the cooccurrence match between a set S and a candidate term t, a vector DS for S and a vector Dt for the term t are maintained, both with dimension n, where n is the number of documents in the collection. The i-th element of DS denotes how many of the terms in S cooccur in the i-th document, whereas the i-th element of Dt is a binary indicator that represents if the term t occurs in the i-th document or not. Thus, the vector Dt for a term t that frequently cooccurs with the terms in set S, will have a high cosine similarity to the corresponding vector DS. Note that some of the elements of DS may have the value jSj, meaning that all items in S occur in the corresponding documents, whereas other may have a smaller value indicating that only a subset of the terms in S cooccur in the corresponding documents. For a term that is examined for expansion of S, it isclear that there will be some contribution to the similarity score also from the documents in which not all terms cooccur, albeit somewhat smaller compared to that documents in which all terms cooccur. The “soft” matching between a term that is considered for expansion and a set S is achieved. Finding the best matching term can be done either using exhaustive search or some approximate nearest neighbour scheme such as LSH. As mentioned, a greedy approach that expands the set S with the best matching term is used, thus a criterion is needed to terminate the expansion process. The termination criterion clearly has to deal with the cohesiveness of the generated topics, meaning that if not properly set, the resulting topics may either end up having too few terms or really being a mixture of topics (many terms related to possibly irrelevant topics). To deal with this, the cosine similarity threshold (S) between S and the next best matching term is used. If the similarity is above the threshold, the term is added, otherwise the expansion process stops. This threshold is the only parameter of the proposed algorithm and is set to be a function of the cardinality of S. In particular a sigmoid function of the following form is used: (S) = 1 1 1 + exp((jSj b)=c).

**Table of contents :**

**1 Introduction **

1.1 Dynamic Social Networks

1.1.1 The Twitter Social Network

1.2 Research and Technical Challenges

1.3 Problem Statement and Objectives

1.4 Scope and Plan of the Thesis

**2 Background and Related Work **

2.1 Introduction

2.2 Document–pivot methods

2.3 Feature–pivot methods

2.4 Related Work

2.4.1 Problem Definition

2.4.2 Data Preprocessing

2.4.3 Latent Dirichlet Allocation

2.4.4 Document–Pivot Topic Detection

2.4.5 Graph–Based Feature–Pivot Topic Detection

2.4.6 Frequent Pattern Mining

2.4.7 Soft Frequent Pattern Mining

2.4.8 BNgram

2.5 Chapter Summary

**3 Joint Sequence Complexity: Introduction and Theory **

3.1 Introduction

3.2 Sequence Complexity

3.3 Joint Complexity

3.4 Main Results

3.4.1 Models and Notations

3.4.2 Summary of Main Results

3.5 Proof of Main Results

3.5.1 An important asymptotic equivalence

3.5.2 Functional Equations

3.5.3 Double DePoissonization

3.5.4 Same Markov sources

3.5.5 Different Markov Sources

3.6 Expending Asymptotics and Periodic Terms

3.7 Numerical Experiments in Twitter

3.8 Suffix Trees

3.8.1 Examples of Suffix Trees

3.9 Snow Data Challenge

3.9.1 Topic Detection Method

3.9.2 Headlines

3.9.3 Keywords Extraction

3.9.4 Media URLs

3.9.5 Evaluation of Topic Detection

3.10 Tweet Classification

3.10.1 Tweet augmentation

3.10.2 Training Phase

3.10.3 Run Phase

3.10.4 Experimental Results on Tweet Classification

3.11 Chapter Summary

**4 Text Classification via Compressive Sensing **

4.1 Introduction

4.2 Compressive Sensing Theory

4.3 Compressive Sensing Classification

4.3.1 Training Phase

4.3.2 Run Phase

4.4 Tracking via Kalman Filter

4.5 Experimental Results

4.5.1 Classification Performance based on Ground Truth

4.6 Chapter Summary

**5 Extension of Joint Complexity and Compressive Sensing **

5.1 Introduction

5.2 Indoor Path-Tracking Using Compressive RSS Measurements .

5.2.1 Prior Work on RSS-based Path Tracking

5.2.2 CS-based Location Estimation

5.2.3 CS-Kalman Filter

5.2.4 Experimental Results

5.3 Encryption System based on Compressive Sensing Measurements .

5.3.1 SecLoc System Description

5.3.2 Possible Attacks from Malicious Users

5.3.3 Experimental Results

5.4 Stealth Encryption based on Eulerian Circuits

5.4.1 Background

5.4.2 Motivation and Algorithm Description

5.4.3 Performance in Markov Models

5.4.4 Experimental Results

5.5 Chapter Summary

**6 Conclusions and Perspectives **

**Bibliography **

**A Suffix Trees **

A.1 Suffix Tree Construction

A.2 Suffix Trees Superposition