Get Complete Project Material File(s) Now! »

## Feature–pivot methods

Feature–pivot methods are closely related to topic models in natural language processing, namely statistical models used to build sets of terms which are representative of the topics occurring in a corpus of documents. Most of the state-of-the-art static topic models are based on the Latent Dirichlet allocation (LDA) [25], which is described in Section 2.4.3. Even though LDA extensions for dynamic data have been proposed [26], alternative approaches trying to capture topics through the detection of term burstiness have been studied [27], mainly in the context of news media mining. The idea behind those methods is that “breaking news”, unlike other discussion topics, happen to reach a fast peak of attention from routine users as soon as they are tweeted or posted [28, 29]. Accordingly, the common framework which underlies most of the approaches in this category first identifies bursty terms and then clusters them together to produce topic definitions. The diffusion of the services over social media and detection of bursty events had been studied in generic document sets. The method presented in [23], for instance, detects bursty terms by looking their frequency in a given time window. Once the bursty terms are found, they are clustered using a probabilistic model of cooccurrence. The need for such a global topic term distribution restricts this approach to a batch mode of computation. Similar methods were tested for topic detection in social media, such as in the Twitter, but with additional emphasis on the enrichment of the obtained topics with non–bursty but relevant terms, URLs and locations [30].

### Latent Dirichlet Allocation

Topic extraction in textual corpus can be addressed through probabilistic topic models. In general, a topic model is a Bayesian model which associates with each document a probability distribution over the topics, where each topic is in turn a probability distribution. The Latent Dirichlet Allocation (LDA) [25] is the best known and most widely used topic model. According to LDA, every document is considered as a bag of terms, which are the only observed variables in the model. The topic distribution per document and the term distribution per topic are instead hidden variable and have to be estimated through Bayesian inference. The Collapsed Variational Bayesian inference algorithm [39], which is an LDA variant, is computationally efficient, more accurate than standard variational Bayesian inference for LDA, and has given rise to many independent implementations already available in the literature. LDA requires the expected number of topics k as a input. The estimation of the optimal k, although possible through the use of non-parametric methods [40], falls beyond the scope of this thesis.

#### Document–Pivot Topic Detection

The second method discussed here, is an instance of a classical Topic Detection and Tracking method which uses a document–pivot approach (Doc-p). It works as follows:

First, the method performs online clustering of posts. It computes the cosine similarity of the tf-idf [41] representation of an incoming post with all other posts processed so far. If the best cosine similarity is above some threshold tfidf , it assigns the item to the same cluster as its best match;

otherwise it creates a new cluster with the new post as its only item. The best matching tweet is efficiently retrieved by Locality Sensitive Hashing (LSH).

Then, it filters out clusters with item count smaller than n.

For each cluster c, it computes a score as follows: score(c) = X doc 2 C X word 2 doc exp(p(word)).

The probability of appearance of a single term p(word) is estimated from a reference dataset that has been collected from Twitter, mentioned in Section 2.4.5. Thus, less frequent terms contribute more to the score of the cluster.

Clusters are sorted according to their score and the top clusters are returned. LSH can rapidly provide the nearest neighbours with respect to cosine similarity in a large collection of documents. An alternative to LSH is to use inverted indices on the terms which appear in the tweets and then compute the cosine similarity between the incoming document and the set of documents that have a significant term overlap with it; however, the use of LSH is much more efficient as it can provide the nearest neighbours with respect to cosine similarity directly.

**Graph–Based Feature–Pivot Topic Detection**

The Graph–Based Feature–Pivot Topic Detection method (GFeat-p) is a first of a series of feature–pivot methods, where the clustering step is performed via the Structural Clustering Algorithm for Networks (SCAN) [42], which is in general applied to network analysis. Apart from detecting communities of nodes, SCAN provides a list of hubs (vertices that bridge many clusters), each of which may be connected to a set of communities. For SCAN applied to topic detection, the nodes in the network correspond to terms and the communities correspond to topics. The detected hubs are terms which are related to more than one topic, and effectively provides an explicit link between topics. That would not be possible to achieve with a common partition clustering algorithm. The terms to be clustered, is a subset of terms present in the corpus, applied to independent reference corpus selected with randomly collected tweets [20]. For each of the terms in the reference corpus, the likelihood of appearance p(wjcorpus) is estimated as follows.

**Soft Frequent Pattern Mining**

In Section 2.4.6 a frequent pattern mining approach for topic detection was described. It provided an elegant solution to the problem of feature–pivot methods that takes into account only pairwise cooccurrences between terms in the case of corpus with densely interconnected topics. Section 2.4.5 examined only pairwise cooccurrences, where frequent pattern mining examines cooccurrences between any number of terms, typically larger than two. A question that naturally arises is whether it is possible to formulate a method that lies between these two extremes. Such a method would examine cooccurrence patterns between sets of terms with cardinality larger that two, like frequent pattern mining does, but it would be less strict by not requiring that all terms in these sets cooccur frequently. Instead, in order to ensure topic cohesiveness, it would require that large subsets of the terms grouped together, but not necessar2.4. Related Work 27 ily all, cooccur frequently, resulting in a “soft” version of frequent pattern mining.

**Table of contents :**

**1 Introduction **

1.1 Dynamic Social Networks

1.1.1 The Twitter Social Network

1.2 Research and Technical Challenges

1.3 Problem Statement and Objectives

1.4 Scope and Plan of the Thesis

**2 Background and Related Work **

2.1 Introduction

2.2 Document–pivot methods

2.3 Feature–pivot methods

2.4 Related Work

2.4.1 Problem Definition

2.4.2 Data Preprocessing

2.4.3 Latent Dirichlet Allocation

2.4.4 Document–Pivot Topic Detection

2.4.5 Graph–Based Feature–Pivot Topic Detection

2.4.6 Frequent Pattern Mining

2.4.7 Soft Frequent Pattern Mining

2.4.8 BNgram

2.5 Chapter Summary

**3 Joint Sequence Complexity: Introduction and Theory **

3.1 Introduction

3.2 Sequence Complexity

3.3 Joint Complexity

3.4 Main Results

3.4.1 Models and Notations

3.4.2 Summary of Main Results

3.5 Proof of Main Results

3.5.1 An important asymptotic equivalence

3.5.2 Functional Equations

3.5.3 Double DePoissonization

3.5.4 Same Markov sources

3.5.5 Different Markov Sources

3.6 Expending Asymptotics and Periodic Terms

3.7 Numerical Experiments in Twitter

3.8 Suffix Trees

3.8.1 Examples of Suffix Trees

3.9 Snow Data Challenge

3.9.1 Topic Detection Method

3.9.2 Headlines

3.9.3 Keywords Extraction

3.9.4 Media URLs

3.9.5 Evaluation of Topic Detection

3.10 Tweet Classification

3.10.1 Tweet augmentation

3.10.2 Training Phase

3.10.3 Run Phase

3.10.4 Experimental Results on Tweet Classification

3.11 Chapter Summary

**4 Text Classification via Compressive Sensing **

4.1 Introduction

4.2 Compressive Sensing Theory

4.3 Compressive Sensing Classification

4.3.1 Training Phase

4.3.2 Run Phase

4.4 Tracking via Kalman Filter

4.5 Experimental Results

4.5.1 Classification Performance based on Ground Truth

4.6 Chapter Summary

**5 Extension of Joint Complexity and Compressive Sensing **

5.1 Introduction

5.2 Indoor Path-Tracking Using Compressive RSS Measurements .

5.2.1 Prior Work on RSS-based Path Tracking

5.2.2 CS-based Location Estimation

5.2.3 CS-Kalman Filter

5.2.4 Experimental Results

5.3 Encryption System based on Compressive Sensing Measurements .

5.3.1 SecLoc System Description

5.3.2 Possible Attacks from Malicious Users

5.3.3 Experimental Results

5.4 Stealth Encryption based on Eulerian Circuits

5.4.1 Background

5.4.2 Motivation and Algorithm Description

5.4.3 Performance in Markov Models

5.4.4 Experimental Results

5.5 Chapter Summary

**6 Conclusions and Perspectives **

**Bibliography**