Dynamic Social Networks
Social networks have undergone a dramatic growth in recent years. Such networks provide an extremely suitable space to instantly share multimedia information between individuals and their neighbours in the social graph. Social networks provide a powerful reflection of the structure, the dynamics of the society and the interaction of the Internet generation with both people and technology. Indeed, the dramatic growth of social multimedia and user generated content is revolutionising all phases of the content value chain includ-ing production, processing, distribution and consumption. It also originated and brought to the multimedia sector a new underestimated and now critical aspect of science and technology, which is social interaction and networking. The importance of this new rapidly evolving research field is clearly evidenced by the many associated emerging technologies and applications, including (a) online content sharing services and communities, (b) multimedia communication over the Internet, (c) social multimedia search, (d) interactive services and entertainment, (e) health care and (f) security applications. It has generated a new research area called social multimedia computing, in which well established computing and multimedia networking technologies are brought together with emerging social media research.
Social networking services are changing the way we communicate with others, entertain and actually live. Social networking is one of the primary reasons why more people have become avid Internet users, people who until the emergence of social networks could not find interests in the Web. This is a very robust indicator of what is really happening online. Nowadays, users both produce and consume significant quantities of multimedia content. Moreover, their behaviour through online communities, is forming a new Internet era where multimedia content sharing through Social Networking Sites (SNSs) is an everyday practice. More than 200 SNSs of worldwide impact are known today and this number is growing quickly. Many of the existing top web sites are either SNSs or oﬀer some social networking capabilities.
Except for the major social networks with hundreds of millions of users that span in the entire world, there are also many smaller SNSs which are equally as popular as the major social networks within the more limited geographical scope of their membership, e.g. within a city or a country. There are also many vertically oriented communities that gather users around a specific topic and have many dedicated members on all continents.
Facebook is ranked among the most visited sites in the world, with over than 1:3 billion subscribed users to date. Moreover, Friendster is popular in Asia, Orkut in Brazil and Vkon-takte in Russia. On top of that, there are dozens of other social networks with vibrant communities, such as Vznet, Xing, Badoo, Netlog, Tuenti, Barrabes, Hyves, Nasza Klasa, LunarStorm, Zoo, Sapo, Daily-Motion and so on. There are also many vertically oriented communities which gather users around a specific topic, such as music, books, etc. LinkedIn with over 330 million users or Viadeo with 50 million users and Xing with 13:5 million users are mostly oriented in establishing professional connections between their users and initiate potential business collaborations.
The rapid growth in popularity of social networks has enabled large num-bers of users to communicate, create and share content, give and receive recommendations, and, at the same time, it opened new challenging problems. The unbounded growth of content and users pushes the Internet technologies to its limits and demands for new solutions. Such challenges are present in all SNSs to a greater or lesser extent. Considerable amount of eﬀort has already been devoted worldwide for problems such as content management in large scale collections, context awareness, multimedia search and retrieval, social graph modelling analysis and mining, etc.
The Twitter Social Network
Twitter is an online social networking service that enables users to send and read short messages of up to 140 characters called “tweets”. Registered users can read and post tweets, but unregistered users can only read them. Users access Twit-ter through the website interface, SMS, or through a mobile device application. Twitter is one of the most popular social networks and micro-blogging service in the world, and according to its website it has more than 640 million users con-nected by 24 billion links. In Twitter, “following” someone means that a user will have in his personal timeline other people’s tweets (Twitter updates). “Followers” are people who receive other people’s Twitter updates. Approximately 99.89% of the Twitter accounts have less than 3; 500 followers and followings. There are approximately 40 million accounts with less than 10 followers and followings, that is between 6% to 7% of all Twitter accounts. It is a social trend to ask followed accounts to follow back in Twitter. There is a limit at 2; 000 followings that starts growing after 1; 800 followers, which is the number of followings set by Twitter to prevent users monitoring too many accounts whereas they have no active role in Twitter. Approximately 40% of accounts have no followers and 25% have no followings. Twitter is interesting to be studied because it allows the information spread between people, groups and advertisers, and since the relation between its users is unidirectional, the information propagation within the network is similar to the way that the information propagates in real life.
Research and Technical Challenges
This section lists the main research challenges in social networks, which are currently being investigated by the research community.
The analysis of relations and communications between members of a com-munity can reveal the most influential users from a social point of view.
As social networks will continue to evolve, the discovery of communities, users’ interests , and the construction of specific social graphs from large scale social networks will continue to be a dynamic research challenge . Research in dynamics and trends in social networks may provide valuable tools for information extraction that may be used for epidemic predictions or recommender systems [3, 4, 5].
The information extracted from social networks proved to be a useful tool towards security. One example of an application related to security is the terrorism analysis, e.g. the analysis of the 9-11 terrorist network . This study was done by gathering public information from major newspapers on the Internet and analyzed it by means of social networks . Therefore, cyber surveillance for critical infrastructure protection is another major research challenge on social network analysis.
Searching in blogs, tweets and other social media is still an open issue since posts are very small in size but numerous, with little contextual informa-tion. Moreover, diﬀerent users have diﬀerent needs when it comes to the consumption of social media. Real time search has to balance between quality, authority, relevance and timeliness of the content .
Crowdsourcing systems gave promising solutions to problems that were unsolved for years. The research community nowadays is working by lever-aging human intelligence to solve critical problems [9, 10], since social net-works contain immense knowledge through their users. However, it is not trivial to extract that knowledge .
Traﬃc prediction based on media consumption may be correlated between groups of users. This information can be used to dimension media servers and network resources to avoid congestion and improve the quality of ex-perience and service.
Content sharing and distribution needs will continue to increase. Mobile phones, digital cameras and other pervasive devices produce huge amounts of data which users want to distribute if possible in real time .
Since users population and data production increase, spam and advertise-ments will continue growing . In addition, the importance of social networks to influence the opinions of the users has to be protected with a mechanism that promotes trustworthy opinions that are relevant to busi-nesses.
As in every human community, online social communities face also criti-cal social and ethical issues that need special care and delicate handling. Protection of personal information and many other problems need special attention .
In order to address these challenges, we need to extract the relevant infor-mation from online social media in real time.
Problem Statement and Objectives
Topic detection and trend sensing is the problem of automatically detecting topics from large text documents and extracting the main opinion without any human intervention. This problem is of great practical importance given the massive volume of documents available online in news feeds, electronic mail, digital libraries, and social networks.
Text classification is the task of assigning predefined labels or categories to texts or documents. It can provide conceptual views of document collections and has important applications in real world problems. Nowadays, the documents which can be found online are typically organized by categories according to their subject, e.g. topics. Some widespread applications of topic detection and text classification is community detection, traﬃc prediction, dimensioning media consumption, privacy, and spam filtering, as mentioned in Section 1.2.
By performing topic detection on social network communities, we can re-group users in teams and find the most influential ones, which can be used to build specific and strategic plans. Public information in social networks can be extracted by topic detection and classification and used for cyber surveillance in an automatic way in order to avoid the overload. Extracting an opinion from social networks is diﬃcult, because users are writing in a way which does not have correct syntax or grammar and contains many abbreviations. Therefore, mining opinions in social networks can benefit by on automatic topic detection on really short and tremendous posts. By grouping users and adding labels to discussions or communities we are able to find their interests and tag people that share information very often. This information can be used to dimension media servers and network resources to avoid congestion and improve the quality of experience and service. Finally, by performing topic classification we can find similarities between posts of users that spread irrele-vant information into the network and enable a spam filter to defend against that.
In this thesis we present a novel method to perform topic detection, classification and trend sensing in short texts. The importance of the proposed method comes from the fact that up to now, the main methods used for text classification are based on keywords detection and machine learning techniques. By using keywords or bag-of-words in tweets will often fail because of the wrongly or distorted usage of the words – which also needs lists of keywords for every language to be built – or because of implicit references to previous texts or messages. In general, machine learning techniques are heavy and complex and therefore are not good candidates for real time text classification, especially in the case of Twitter where we have natural language and thousands of tweets per second to process. Furthermore machine learning processes have to be manually initiated by tuning parameters, and it is one of the main drawbacks for that kind of application. Some other methods are using information extracted by visiting the specific URLs on the text, which makes them a heavy procedure, since one may have limited or no access to the information, e.g. because of access rights, or data size. In this thesis we are trying to address the discussed challenges and problems of other state-of-the-art methods and propose a method which is not based on keywords, language, grammar or dictionaries, in order to perform topic detection, classification and trend sensing.
Instead of relying on words as most other existing methods which use bag-of-words or n-gram techniques, we introduce Joint Complexity (JC), which is defined as the cardinality of a set of all distinct common factors, subsequences of characters, of two given strings. Each short sequence of text is decomposed in linear time into a memory eﬃcient structure called Suﬃx Tree and by overlapping two trees, in linear or sublinear average time, we obtain the JC defined as the cardinality of factors that are common in both trees. The method has been extensively tested for text generation by Markov sources of finite order for a finite alphabet. The Markovian generation of text gives a good approximation for natural text generation and is a good candidate for language discrimination. One key take-away from this approach is that JC is language-agnostic since we can detect similarities between two texts without being based on grammar and vocabulary. Therefore there is no need to build any specific dictionary or stemming process. JC can also be used to capture a change in topic within a conversation, as well as a change in the style of a specific writer of a text.
On the other hand, the inherent sparsity of the data space motivated us in a natural fashion the use of the recently introduced theory of Compressive Sensing (CS) [15, 16] driven by the problem of target localization . More specifically, the problem of estimating the unknown class of a message is reduced to a problem of recovering a sparse position-indicator vector, with all of its components being zero except for the component corresponding to the unknown class where the message is placed. CS states that signals which are sparse or compressible in a suitable transform basis can be recovered from a highly reduced number of incoherent random projections, in contrast to the traditional methods dominated by the well-established Nyquist-Shannon sampling theory. The method works in conjunction with a Kalman filter to update the states of a dynamical system as a refinement step.
In this thesis we exploit datasets collected by using the Twitter streaming API, getting tweets in various languages and we obtain very promising results when comparing to state-of-the-art methods.
Scope and Plan of the Thesis
In this thesis, a novel method for topic detection, classification and trend sensing in Dynamic Social Networks is proposed and implemented. Such system is able to address the research and technical challenges mentioned in Section 1.2. The structure of this thesis is organized as follows.
First, Chapter 2 overviews the state-of-the-art of topic detection, classifi-cation and trend sensing techniques for online social networks. First, it describes the document–pivot and feature–pivot methods, along with a brief overview of the pre-processing stage of these techniques. Six state-of-the-art methods: LDA, Doc-p, GFeat-p, FPM, SFPM, BNgram are described in details, as they serve as the performance benchmarks to the proposed system.
In Chapter 3, we introduce the Joint Sequence Complexity method. This chapter describes the mathematical concept of the complexity of a sequence, which is defined as the number of distinct subsequences of the given sequence. The analysis of a sequence in subcomponents is done by suﬃx trees, which is a simple, fast, and low complexity method to store and recall subsequences from the memory. We define and use Joint Complexity for evaluating the similarity between sequences generated by diﬀerent sources. Markov models well describe the generation of natural text, and their performance can be predicted via linear algebra, combinatorics, and asymptotic analysis. We exploit Markov sources trained on diﬀerent natural language datasets, for short and long sequences, and perform automated online sequence analysis on information streams in Twitter.
Then, Chapter 4 introduces the Compressive Sensing based classification method. Driven by the methodology of indoor localization, the algorithm converts the classification problem into a signal recovery problem, so that CS theory can be applied. First we employ Joint Complexity to perform topic detection and build signal vectors. Then we apply the theory of CS to perform topic classification by recovering an indicator vector, while reducing significantly the amount of information from tweets. Kalman filter is introduced as a refinement step for the update of the process, and perform users and topics tracking.
Table of contents :
1.1 Dynamic Social Networks
1.1.1 The Twitter Social Network
1.2 Research and Technical Challenges
1.3 Problem Statement and Objectives
1.4 Scope and Plan of the Thesis
2 Background and Related Work
2.2 Document–pivot methods
2.3 Feature–pivot methods
2.4 Related Work
2.4.1 Problem Definition
2.4.2 Data Preprocessing
2.4.3 Latent Dirichlet Allocation
2.4.4 Document–Pivot Topic Detection
2.4.5 Graph–Based Feature–Pivot Topic Detection
2.4.6 Frequent Pattern Mining
2.4.7 Soft Frequent Pattern Mining
2.5 Chapter Summary
3 Joint Sequence Complexity: Introduction and Theory
3.2 Sequence Complexity
3.3 Joint Complexity
3.4 Main Results
3.4.1 Models and Notations
3.4.2 Summary of Main Results
3.5 Proof of Main Results
3.5.1 An important asymptotic equivalence
3.5.2 Functional Equations
3.5.3 Double DePoissonization
3.5.4 Same Markov sources
3.5.5 Different Markov Sources
3.6 Expending Asymptotics and Periodic Terms
3.7 Numerical Experiments in Twitter
3.8 Suffix Trees
3.8.1 Examples of Suffix Trees
3.9 Snow Data Challenge
3.9.1 Topic Detection Method
3.9.3 Keywords Extraction
3.9.4 Media URLs
3.9.5 Evaluation of Topic Detection
3.10 Tweet Classification
3.10.1 Tweet augmentation
3.10.2 Training Phase
3.10.3 Run Phase
3.10.4 Experimental Results on Tweet Classification
3.11 Chapter Summary
4 Text Classification via Compressive Sensing
4.2 Compressive Sensing Theory
4.3 Compressive Sensing Classification
4.3.1 Training Phase
4.3.2 Run Phase
4.4 Tracking via Kalman Filter
4.5 Experimental Results
4.5.1 Classification Performance based on Ground Truth
4.6 Chapter Summary
5 Extension of Joint Complexity and Compressive Sensing
5.2 Indoor Path-Tracking Using Compressive RSS Measurements
5.2.1 Prior Work on RSS-based Path Tracking
5.2.2 CS-based Location Estimation
5.2.3 CS-Kalman Filter
5.2.4 Experimental Results
5.3 Encryption System based on Compressive Sensing Measurements
5.3.1 SecLoc System Description
5.3.2 Possible Attacks from Malicious Users
5.3.3 Experimental Results
5.4 Stealth Encryption based on Eulerian Circuits
5.4.2 Motivation and Algorithm Description
5.4.3 Performance in Markov Models
5.4.4 Experimental Results
5.5 Chapter Summary
6 Conclusions and Perspectives