Detection and characterization of impersonators
People are aware that attackers impersonate accounts in social networks. Apart from some anecdotal evidence, however, there has been no in depth characterizations of impersonation accounts in today’s social networks. We propose a technique in two steps that detects impersonating accounts. The first step returns accounts that portray the same person in a social network. The second step detects which account is an impersonator. Traditional methods to detect fake accounts perform poorly for detecting impersonators. We show that for detecting impersonating accounts we have to build methods that exploit features that characterize pairs of accounts rather than features that characterize single accounts as has been done so far. We do a characterization study of about 5,693 cases of impersonation attacks we catch on Twitter. We found that impersonation attacks do not only target celebrities but also target less popular Twitter users. Furthermore, their main goal is to evade Twitter fake account detection rather than use the accounts for social engineering attacks. Our findings reveal a new type of impersonation attacks that can impact negatively the online image of any user, not just that of celebrities.
1.5 Organization of the thesis
This dissertation is organized as follows. Chapter 2 presents related work on analyzing social data, privacy and security threats in social networks and methods to match enti-ties. Chapter 3 presents the account matching framework and the ground truth data for matching accounts. Chapter 4 presents our method to reliably and salably match accounts. Chapter 5 shows how we can match accounts using only innocuous user activities. Chap-ter 6 characterizes impersonators and presents ways to automatically detect them. We conclude in Chapter 7.
The problem of matching accounts across social networks is related to problems tackled in diﬀerent research communities ranging from security and privacy to database and data mining. In this chapter, we first review the state of the art on measuring and characteriz-ing the information users leave on diﬀerent social networks. We then give an overview of the privacy implications of sharing content online. Finally, we review the main matching techniques proposed in the database, information retrieval, and data mining communi-ties. Even if there are not many studies that focus on the particular problem of matching accounts across social networks, these communities have worked on closely related prob-lems like matching records across databases and anonimizing/de-anonimizing databases. Finally, we review some of the current eﬀorts to detect fake social networks identities.
This section reviews works on characterizing the amount of information we can learn about users from their posts and provides the motivation for investigating techniques to match accounts across social networks. We first discuss related work that measured how much and what kind of information users leave online, then overview studies that showed how this information can be exploited to further infer information that is missing or private and we finally review studies that measured the users’ footprint across diﬀerent social networks.
Measurement and characterization of social networks
Many studies have analyzed the content, the structure, and the evolution of social net-works [7, 10, 11, 18, 65, 96, 98, 107, 109, 112, 129, 190]. We first review studies that measured the type and amount of content users share online and then studies that analyzed the interactions and the graphs of social networks.
Gross and Acquisti  were the first who studied what kind of information users share on Facebook and what are their privacy implications. Back in 2005, they observed that users willingly provide many kinds of information ranging from their names, location and photos to interests (books, music, movies), political views and sexual orientation, including even date of births, phone numbers and email addresses. Similarly, Humphreys et al.  analyzed the personal information users provide on Twitter. A quarter of tweets include information about where a user is or what activity he is doing and most of this information is publicly accessible. Lampe et al.  studied the eﬀect of the types of profile attributes users provide on Facebook on the number of friends they have. They found that the pres-ence of profile attributes that help users share common references (e.g., school, employer) is strongly associated with the number of friends. The association is weaker for attributes related to the user interests. After almost 10 years, the well publicized debate about users’ online privacy caused users to limit the access to some of their personal data online . However, large parts of personal data still remain accessible to public. So far, most of the studies have only looked at social networks separately and did not consider them in aggregate. In contrast, in Chapter 4, we investigate what kind of information users provide on diﬀerent social networks and how consistent this information is across social networks. Our study is useful for both understanding the extent to which we can match accounts across social networks and also gives insight into the behavior of users on diﬀerent social networks.
A number of studies have looked at the structure of social network graphs and patterns in users activities [10, 18, 98, 129]. More recently, there has been an increased interest in augmenting the social graph with user attributes – Social Attribute Network (SAN) . Such augmented networks are useful for link prediction, attribute inference, community detection and potentially accounts matching . In this thesis, we only exploit one-hop friendship links to match accounts without exploiting the full social graph. We leave such study for future works.
Inferring additional information about users in social networks
Social networking sites allow users to hide parts of their personal profiles from the public, however there are always some pieces of information that remain public. This mix of private and public information can be exploited to infer private attributes of users. In this section, we first review studies that inferred diﬀerent kinds of information about users by exploiting friendship links or other kinds of data in social networks and then we discuss studies that inferred the location of posts and photos.
Several studies exploited the social network graph to infer private information [71, 83, 115, 116,130]. The studies are based on the homophily  assumption that users tend to relate to other users sharing similar traits. For example, political aﬃliations of friends tend to be similar, or students also have other students as friends. Gayo-Avello  showed that we can determine the political orientation, religious aﬃliation, race, ethnicity and sexual orientation of Twitter users with a 95% confidence by exploiting the network neighborhood. Similarly, Backstrom et al.  showed that the location of users can be inferred from the location of their friends. Zheleva et al.  showed that the inference accuracy can be improved if we also consider group membership besides network neighbors. Finally, Gong et al.  proposed to jointly predict links (friendships between users) and infer attributes in social network. By combining the two problems, both the accuracy of link prediction and the accuracy of inferring attributes increases.
ONLINE PRIVACY 11
Besides the social graph, researchers have also exploited other types of information present inside social networks. Chaabane et al.  leveraged interests and likes on Facebook to infer otherwise hidden information about users, such as gender, relationship status, and age. Calandrino et al.  exploited the outputs of collaborative filtering to infer customer transactions. Popescu et al.  used Flickr tags to find the gender and the home location of users. Lieberman et al.  determined the location of Wikipedia users from the articles they edit. Cheng et al.  use the content of tweets to determine the city-level location of a user. Finally, Staddon  showed that we can learn the hidden connection of a LinkedIn user with a very simple attack using a fake account. More generally, researchers have inferred extra information from publicly available records. Griﬃth et al.  showed that it is possible to infer the mother’s maiden name from public records. Farkas and al.  discussed the inference problem across databases. While this body of work is not directly related to our matching method, it shows that, when we know a piece of information about a user, we can always infer more. This explains why matching accounts is appealing to industries and attackers.
In another line of work, researchers used publicly available information from a social net-work site and other external sources to infer the location of users posts and photos. Hecht et al.  derived user locations from tweets using basic machine learning techniques that associate tweets with geotagged articles on Wikipedia. Similarly, Kinsella et al.  lever-aged tweets with geotags to build language models for specific places; they found that their model can predict country, state, and city with similar performance as IP geolocation, and zip code with much higher accuracy. Crandall et al.  located Flickr photos by identify-ing landmarks via visual, temporal and textual features. In Chapter 5, we will show that we can match accounts across social networks by exploiting the location of the users’ posts. So far, our method only uses the posts that have geotags, however we could potentially expend it to include posts that do not have geotags if we can infer their location.
Aggregating users’ data across social networks
Prior work also studied the aggregate footprint users leave across multiple social net-works [31, 77]. Irani et al.  showed that, in average, users reveal four personal infor-mation fields (e.g., names, location, school) in one social network. However, users reveal diﬀerent attributes on diﬀerent social networks, thus if we know their accounts on multi-ple social networks, we can learn more about users. To create aggregate profiles of users, Pontual et al.  proposed to build a crawler that, given a name, is able to collect information from diﬀerent social networks and sites. While inferring and aggregating in-formation across social networks is appealing for building applications and services, it can also breach the privacy of individual users. We will review in the next section possible privacy threats.
In this section we start by overviewing privacy research in two areas: tracking users online and the privacy of online services. These research areas are not directly related with the privacy implications of sharing content online, but, they expose diﬀerent privacy threats users encounter online. We then focus on privacy implications of sharing location data which is related to our work in Chapter 5 that exploits location data to match accounts. Finally, we overview privacy threats caused by matching data about users across diﬀerent sources.
Online tracking and advertising
Discussions on online tracking and advertising became lately very popular in both media and research communities because users are generally bothered by the fact they do not have control over what data companies are collecting and aggregating about them.
Websites such as lemond.fr or nytimes.com authorize other third-party websites such as Google Analytics or DoubleClick to track their users through cookies. Third-party websites allow (first-party) website to easily implement advertising, provide site analytics or provide integration with social networks. While third-party websites provide tremendous benefits for first-party websites, they can also severely aﬀect the privacy of users. Third-party sites aggregate the browsing activities of users across unrelated first-party sites to create aggregate browsing profiles for better targeted advertising. Even if the aggregate browsing profiles are not directly linked to the users real identities, many users consider them a privacy breach. The creation of aggregate browsing profiles has been criticized by consumer advocates, policymakers and even marketers themselves. Numerous research eﬀorts have measured and analyzed the ecosystem of advertising and tracking users online [67, 69, 99–102, 124, 131, 194]. While technology researchers have provided tools to block such tracking , policymakers proposed laws to limit or disclose tracking . If companies start to massively aggregate and exploit the data users share online it is possible that users and policymakers will react the same way and claim their privacy rights.
Because online advertising supports the free content on the Internet, blocking all tracking will significantly aﬀect the economics of the ecosystem. To overcome this issue, researchers are working on privacy-preserving tracking [49, 68, 179].
Privacy of online services
We look at the privacy of online services in two situations. First, we assume that users trust and use a service (e.g., Facebook, Yelp), however, the trusted service may leak personal information to other untrusted services, either intended or unintended. Second, we assume users have the ‘big brother syndrome’ and do not trust the service, i.e. they do not feel comfortable to give their private data to service providers such as Facebook or Twitter. Personal information can leak from a trusted party to an untrusted party in diﬀerent ways: through Wi-Fi networks, from applications (e.g., mobile applications, software, or browsers) and from social networks. We first review studies that measure such leaks and we then review studies that propose privacy preserving services.
Whenever we connect to a Wi-Fi network, we are susceptible to leak private informa-tion. Large portions of the network traﬃc generated by a computer are unencrypted, and someone connected to the same Wi-Fi network can see what information is transmitted.
ONLINE PRIVACY 13
Consolvo et al.  tried to increase the awareness of this type of leakage by proposing a tool that alerts a user whenever private information such as email address or credit card number leaves the computer unencrypted. Cunche et al.  showed that it is possible to infer that two users live together by exploiting messages containing the SSIDs of users’ preferred wireless networks in the active service discovery phase. To protect the privacy of users in wireless networks, Greenstein et al.  proposed a link layer protocol that obfuscates MAC addresses.
Users can grant access to mobile applications to many kinds of private information ranging from their exact location to the contacts in their address book or their IMEI (International Mobile Station Equipment Identity). After giving permission, users lose the control of what the application is doing with this information. Several studies analyzed the application and network traces generated by smart phones to quantify such leackages [45, 87,158,189]. For example, Cox et al.  found that half of the applications they tested were sending location information to advertisers. Browsers are also susceptible to information leakage. Studies showed that the browsing history of a user can be sniﬀed through side channel attacks or caching [79, 187].
Private information can leak inside a social network or from a social network to third par-ties. Thomas et al.  showed that information made private by a user can be inadver-tently made public through conflicting privacy policies of users. Krishnamurthy and Willis showed that social networks can leak informations such as screen names, IDs and locations to third parties through referral headers or request-URLs [103, 104]. This has important privacy implications since advertisers could potentially link the anonymous tracking pro-files they hold to social network identities. This line of research is complementary to ours and shows alternate ways in which the privacy of users can be compromised.
A few research eﬀorts proposed distributed social networks to avoid giving up private information to companies [8, 14, 23, 40, 42]. While these solutions will protect the personal data of users from big companies, it is still possible for a third-party to match the accounts users have on diﬀerent ‘private’ social networks. Furthermore, creating aggregate profiles of users can potentially be easier as there is no central service that detects information harvesting.
Privacy of location data
Real-time access to users exact location has lead to many innovative and useful ser-vices. Fine-grained location information enabled the creation of applications to recommend restaurants around the current user location, call taxis or map the photos taken during a trip. Sharing location info, however, yields serious privacy concerns. In Chapter 5, we will show that we can match accounts belonging to the same individual by exploiting only the location information that comes with posts and photos. In this section, we review studies that looked at diﬀerent aspects of location privacy: (1) how to share location data with service providers in a privacy preserving way; and (2) how to anonymize or de-anonymize mobility traces.
To guarantee privacy while providing a reasonable level of service when sharing location information with sites or applications, researchers have proposed both cryptographic [50, 57] and non-cryptographic [19, 66, 88, 123, 138] techniques. Non-cryptographic techniques include sharing the location at a coarser grain, cloaking or using a trusted third-party service that provides k-anonymity. For a comprehensive picture we refer the reader to existing surveys [82,106,133,157]. Finally, Shokri et al. [163,164] proposed a framework to formally quantify the location privacy (being able to predict where a user is at a particular moment) in order to compare the accuracy of diﬀerent privacy preserving location sharing techniques. While these techniques were aimed to protect users privacy with respect to service providers, such obfuscation techniques could be also used to protect users’ privacy online when publishing posts or photos with geotags. Such obfuscation techniques will not have a strong impact on our matching scheme because we only need coarse-grain location data.
A diﬀerent line of research focuses on privacy implication of publishing anonymized location datasets. Sharing such datasets is tremendously useful and allows to study human mobility and behavior. Consequently, many researchers have gathered such datasets and make them public . Usually, the mobility of a user is represented as series of location-time pairs. The granularity of the location can be at either cell level or exact latitude and longitude depending how the data is collected. Researchers have shown that we can use these datasets to build a location profile for each user in the form of a random walk or a Markovian model and we can use this location profile to de-anonymize mobility traces very easily with the help of some auxiliary location data [43, 121]. The techniques proposed to de-anonymize mobility traces do not apply in our case. Mobility traces contain thousands of location-time pairs while the median number of such pairs is less than 10 on Twitter for example. The lack of data makes it impossible to model location profiles with random walks or Markovian models. Consequently identifying location profiles that correspond to the same user in social networks is more challenging than de-anonymizing mobility traces. Other studies showed that we can learn the home and work address of a user from these mobility traces [51, 105]. The home and work location of a user can then be matched with census databases to find the real identity of a user. Our matching scheme that exploits location data is based on a similar concept: we identify the main locations from where a user is posting and we match them across social networks. Closer to our work is the study of Zang et al.  who studied the k-anonymity of the top locations from where users make phone calls. They found that, at zip code level, the top three locations are unique for most of the people. Our results in Chapter 5 confirms their results, but when studying where users post form. Zang et al.  did not further explore how these top three locations could be matched with social identities. Finally, Srivatsa et al.  explored how mobility traces can be de-anonymized by correlating their contact graph with the friendship graph of a social network. Rather than correlating the location profiles of users in mobility traces with location profiles of users in social networks, Srivatsa et al.  chose to exploit mobility traces in a diﬀerent way by creating a contact graph. We cannot use contact graphs to match accounts because we only have locations where users post which is a small subset of the locations a user passes by.
ONLINE PRIVACY 15
Privacy implications of matching and exploiting users’ data
Prior work has investigated privacy implications of exploiting users’ data. We first review works that exposed privacy implications of matching data across diﬀerent data sources (e.g., public records, health records) and across diﬀerent social networks. We then discuss works that showed diﬀerent privacy and security implications of sharing content online.
Researchers have realized long ago that matching data across diﬀerent data sources can pose privacy problems . For example, matching diﬀerent health databases could lead to the conclusion that a celebrity has a contagious disease . Traditionally, data matching has been used for “connecting the dots” , i.e., for identifying terrorist threats. While this has clear benefits, falsely matched pairs of records might have severe privacy implications. If an individual is falsely detected as being involved in a crime or terrorist attack both his life and credit worthiness could suﬀer . Furthermore, criminals can match records to collect enough identifying data to commit identity fraud [89, 143]. In this context, researchers rose a series of concerns about the privacy of public administrative and medical records [47, 55, 113, 165].
Focusing on social networks data, Fiedland et al. [52, 54] showed that a malicious user can use data that is publicly available on the Internet to mount social engineering at-tacks (cybercasing). They show that it is technically possible to use seemingly innocuous information to create correlation chains that tell much more about the individuals than they realize. A possible attack could first identify on Craigslist photos of precious objects that have geotags attached to them. From geotags we can infer the address of the owner while on Facebook we can detect when the owner is on vacation. An attacker could use this online information to mount a real-word robbery. Similar techniques can be used for economic profiling, espionage targeting, cyberframing or cyberstalking .
Several websites highlight privacy risks of sharing data on social networks.1 Sleeptime.org estimates sleep patterns of Twitter users from their posts. Stolencamerafinder.co.uk crawls for digital camera serial numbers in online photos in order to find pictures taken with stolen cameras. Icanstalku.com publishes geotags found in tweets, and pleaserobme.com uses status updates form social networks to locate users who were currently not at home but had published their home address. The cree.py application uses geolocation data from social networks and media hosting services to track a person’s movements.
Finally, researchers showed that personal data can help to crack passwords. Irani et al.  suggested to use personal data collected across multiple social networks to gather answers for password recovery questions. Castelluccia et al.  showed that personal data can also be used to reduce the number of attempts of brute force password cracking. Finally, Jagatic et al.  show that phishing attacks  have a significantly higher success rate when they consider the victim’s social context.
Table of contents :
1.1 Personal data sharing in social networks
1.2 Current techniques to match accounts
1.3 Privacy and security threats in social networks
1.4.1 Reliable and scalable account matching across social networks
1.4.2 Account matching by exploiting innocuous user activities
1.4.3 Detection and characterization of impersonators
1.5 Organization of the thesis
2 State of the Art
2.1 Social networks
2.1.1 Measurement and characterization of social networks
2.1.2 Inferring additional information about users in social networks
2.1.3 Aggregating users’ data across social networks
2.2 Online privacy
2.2.1 Online tracking and advertising
2.2.2 Privacy of online services
2.2.3 Privacy of location data
2.2.4 Privacy implications of matching and exploiting users’ data
2.3 Matching entities across data sources
2.3.1 Entity resolution
2.3.2 Anonymization and de-anonymization of user identities
2.4 Matching accounts across social networks
2.4.1 Matching accounts using private user data
2.4.2 Matching accounts using public user data
2.5 Security threats in social networks
3 Account Matching Framework
3.1 The matching problem: definition
3.2 The matching problem: challenges
3.3 The matching problem: scalability and reliability
3.3.1 The ACID test for scalable and reliable matching
3.4 Ground truth data
3.5 Evaluation method
3.5.1 Evaluation metrics
3.5.2 Evaluation over a small dataset
3.5.3 Evaluation at scale
3.5.4 Evaluation against human workers
4 Matching Accounts using Public Profile Attributes
4.1 ACID test of public attributes
4.1.1 Attribute availability
4.1.2 Attribute consistency
4.1.3 Attribute discriminability
4.1.4 Attribute impersonability
4.2 One-step matching scheme
4.2.1 Design of the scheme
4.2.2 The Linker
4.2.3 Evaluation over a small dataset
4.2.4 Evaluation at scale
4.2.5 Evaluation against human workers
4.3 Three-step matching scheme
4.3.1 Design of the scheme
4.3.2 The Filter
4.3.3 The Disambiguator
4.3.4 The Guard
4.3.5 Evaluation at scale
4.3.6 Evaluation against human workers
4.4 Testing the reliability of the three-step matching scheme
4.4.1 Reliability in the absence of a matching account
4.4.2 Reliability to impersonation
4.5 Matching in the wild
5 Matching Accounts using Public Innocuous Information
5.1 Features of innocuous activity
5.2 Location fingerprint
5.2.1 Building the fingerprint
5.2.2 Similarity metrics
5.2.3 Evaluation at scale
5.3 Timing fingerprint
5.4 Language fingerprint
5.5 Combining features
5.5.2 Evaluation at scale
5.5.3 Comparison with screen name matching
6 Detection and Characterization of Impersonators
6.1 A framework to detect impersonating accounts
6.1.1 Problem definition and approach
6.1.3 Features .
6.2 Detection of accounts that portray the same person
6.2.1 Naive approach
6.2.2 Single-site matching scheme
6.2.3 Evaluation using AMT workers
6.3 Detection of impersonating accounts
6.3.1 Ground truth
6.3.2 Methods to detect impersonating accounts
6.3.3 Evaluation over unlabeled pairs of accounts
6.4 Detecting impersonating accounts using humans
6.5 Characterization of impersonation attacks
7 Conclusions and Perspectives
7.1 Summary of contributions
7.2 Future work
7.2.1 Improving matching schemes
7.2.2 Applications of matching
7.2.3 Protecting user privacy