Link disclosure strategy with certainty
We have designed an on-line link disclosure attack strategy (with certainty). The proposed strategy is passive: the adversary does not have to interact with his target. Our attack is performed on (1,2 or 1)-lookahead social network and has been tested on-line with volunteer profiles. This is in contrast to oﬀ-line link inference (with uncertainty) attacks proposed in [Wang et al., 2015b, Gao et al., 2015] and the active link disclosure attack in [Jin et al., 2013] performed on 1-lookahead social network (Linkedin) that discloses friendship through mutual friend query but was not tested on-line. Furthermore, by eﬀectively exploring the target group network, our proposed attack strategy is able to perform group-membership, friendship and mutual-friend attacks along a strategy that minimizes the number of queries. The results of attacks performed on active Facebook profiles show that 5 diﬀerent friendship links are disclosed in average for each single legitimate query in the best cases. Let us note that we can apply these on-line link disclosure attacks (with certainty) to prepare attribute inference attacks (with uncertainty) in the tool described below.
A tool to prevent attribute inference attacks
To increase user awareness about privacy threats we have designed a tool, SONSAI, for Facebook users to audit their own profiles. The system first crawls the network around the user while performing link disclosure attack. Then it predicts the values of sensitive attributes using a machine learning engine. The results provided by SONSAI, quantify the correlation between attributes and shows the public attributes of the user that have oriented the learning algorithm towards a particular sensitive attribute value. The tool is fully interfaced for Facebook, however it can be adapted to many other social networks. The system has been tested by several volunteer users for auditing their Facebook profiles. In each case a dataset was built from real profiles collected in the user neighbourhood network.
The whole analysis process in SONSAI does not exceed 5 minutes when analysing the target local network (containing a few hundreds of profiles and attributes). We notice that inference accuracy (measured by the Area Under the Curve “AUC”) changes according to the sensitive attribute. It is about 0.83 for gender inference, 0.7 for relationship status inference and 0.79 for politicians inference.
Analysing responses and defining sensitive subjects
In this section, we analyse 18 diﬀerent subjects discussed on social media. We classify these subjects according to four criteria: discussion on both forums and social networks, avoidance and anonymity. Moreover, we identify the favourite topics of Net surfers by gender and level of study. Besides, we propose a formal definition of sensitive subjects and we identify the sensitive subjects among those studied in our survey. Finally, we define a sensitivity coeﬃcient to classify subjects by sensitivity level.
Analysing the survey responses
We have analysed the social media discussions of 18 various subjects studied in the question-naire to identify Net surfers’ favourite subjects by gender and level of study. We classify these subjects according to four criteria: (i) rate of discussion of subjects on social networks, (ii) rate of discussion of subjects on forums and websites, (iii) rate of anonymity of publication and (iv) rate of avoidance of subjects.
Rate of discussion of subjects on social networks. Table 2.3 classifies the various subjects proposed in question 35 (see Annex B) in increasing order of discussion according to the type and level of study. Subjects that have global chat frequencies below the average frequency minus the standard deviation are: “Money, Shopping Religion and Dating”.
Men’s favourite subjects are “News and Technologies”. As for women, they discuss more about “News, Going out and Travel”. Men talk about “Money, Religion, Technology and Dating” twice as much as women. The latter talk about “Fashion” twice as much as men.
Net surfers who have been to graduate school are more inclined to discuss “News and Going out”. The subjects “Family, Travel and Going out” are the most discussed by Internet users who have not completed higher education. They talk about “Money and Shopping” eight times as much and “TV shows” twice as much as those with higher education. On the other hand, the latter discuss “Politics and Technology” twice more and “Studies” thrice more than those who have not followed a higher education.
Rate of discussion of subjects on forums and websites. Table 2.4 ranks the forums and websites proposed in Question 12 (see Annex B) in increasing order of activity according to the gender and level of study. The forums and websites that have global activities rates below the average rate minus the standard deviation are “Going out, Dating and Chat” and “Philosophy, Religion and Free Thinking” sites and forums.
Men prefer forums and sites about “Computer Science and Technology”. Conversely, women prefer forums and sites about“Health, Shopping, Kitchen, House and Tip” and “Travel, Transport, Holidays, Insurance”. Although the overall activity on forums and sites of “Philosophy, Religion, Free Thinking” does not exceed 27%, 30.86% of men are active on these sites.
Participant Net surfers who have not received higher education are more present on the forums and sites of “Health, Shopping, Kitchen, House and Tip”, “Travel, Transport, Holidays and Insurance” and “Games, Music, Movie, Humour, Art and Book”. The forums and favourite sites of participant Net surfers who have received higher education are “Computer Science and Technology” and “Health, Shopping, Kitchen, House and Tip”. Even if the forums and sites of “Going out, Dating and Chat” are globally less visited, the activity of participant Net surfers who have not followed higher studies on these sites exceeds the threshold of the average of all activities minus the standard deviation.
Rate of anonymity of publication. We distinguish three types of activity on forums and websites: search without publication, anonymous publications and non-anonymous publications. Table 2.5 details the rates for each type of activity on the forums and websites proposed in Question 12 (see Annex B).
In this section, we use a normalization function to adjust the sets of values for sensitive subjects calculated in Tables 2.3, 2.4, 2.5 and 2.6. This function converts values to a common standard to make them comparable. It refers to the data transformation by dividing each value by the mean of the set.
Matrix M (6 4) summarizes the calculated percentages in the previous section for the sensitive subjects (see Matrix 2.6). The lines represent the sensitive subjects and the columns represent the four sensitivity criteria presented in the previous section. The normalization func-tion is defined by Equation 2.7, where xij denotes the value on the line i and the column j. Matrix M0 summarizes the normalized values of M (see Matrix 2.8).
Possible attack vectors according to the behaviour of partic-ipants
In this section we discuss several scenarios of attack and information leakage. Each scenario exploits a vulnerability. Vulnerabilities are detected through the responses to the questionnaire.
Reversed parental monitoring
The responses of participant parents Net surfers show that only 1.44% of them do not know whether their children use social networks. Questions 30 and 31 focus on parental monitoring on social networks (see Annex B). These questions are only displayed to participants over 34-year-old and whose children use social networks. 71.05% of parents say that their children can view their publications against only 51.3% who can view their children’s publication. In addition 5.26% of parents do not know if their publications are visible to their children. These results show that social networks have eﬀective parental monitoring tools, but they are also within the reach of children to monitor their parents. About 76% of parents are vulnerable to the scenario detailed by Figure 2.6: Alice and Bob are friends on a social network; Alice has published a photo featuring Bob; Bob made a dirty joke about the photo in a comment; but he does not want his children to become aware of this photo or read his comment, and he does not know if they can do it. Although Alice is aware that her children can see all her publications, she did not realize Bob’s inappropriate comment that is visible to her daughter too.
Modelling social network for on-line link disclosure attacks
A social network can be defined as a website that allows users to create personal pages in order to share information with their friends and acquaintances. These pages are usually called profiles and contain personal information. Profiles are connected to each other through friendship links that can be either symmetric or asymmetric, depending on the network’s policy.
In order to mimic real (i.e., non-cybernetical) societal interactions, some social networks such as Facebook, Linkedin and Viadeo support the creation of groups besides the profile creation. Accordingly, social networks can be modelled by two types of graphs –(i) friendship graph and (ii) group membership graph– as depicted by Figure 3.1.
The friendship graph (a) is unipartite and models the friendship links between users while membership graph (b) is bipartite and models the membership links between users and groups. Some of these links can be masked by users or group administrators. We call a friendship (resp. membership) attack a sequence of actions (e.g., queries) leading to disclose a masked friendship (resp. membership) link. Both kind of attacks are called link disclosure attacks. A mutual-friend attack discloses common friends to a target and other users. We call group uncovering attack a sequence of queries that disclose the membership network of the target and his acquaintances. In this work the attacker is limited to the usage of legitimate and minimal queries provided by the social networks APIs. Therefore the attacker model can be viewed as a passive one. We believe that these constraints are the cornerstones of successful real-world attacks that are diﬃcult to detect because the traﬃc appears to be legitimate at first.
In [Dougnon et al., 2015] researchers propose a Partial Graph Profile Inference (PGPI) algo-rithm that exploit group memberships to infer profiles attributes. In [Zheleva and Getoor, 2009], relational learning approaches and group memberships are used to infer sensitive attribute of users such as locations.
Problematics and objectives
Problematics. In on-line attacks, the attacker is constrained by the network dynamicity and the time needed to scrap it. In fact, the dynamical network structure, with the addition/deletion of new links and nodes will ensure that the sampled graph does not reflect a real on-line social network at any given time. Therefore, crawling tasks for on-line attacks must be highly selective to collect only useful profiles and information and be as fast as possible.
For instance, [Elkabani and Khachfeh, 2015] show that homophilic attributes have signifi-cant influence on predicting friendship between users of Facebook. Thus, an attacker may be tempted to sweep the network for similar profiles to his target. He can also consider the friends of the target friends as potential friends and check these links. Although these general solu-tions may seem eﬀective to gather many potential friends, they have major shortcomings. To understand these shortcomings let us recall the “six degrees of separation” phenomenon, that is the possibility to connect any two people in a maximum of six relationship steps. For example, the authors of [Bakhshandeh et al., 2011] show that the average degree of separation between Twitter users is 3:43 while the degree of separation on Facebook is between 2:9 and 4:2 for the majority of users [Bhagat et al., 2016]. Hence, considering friends of friends as potential friends is equivalent to considering at least tens of thousands users as potential friends for each single target [Ugander et al., 2011]. This is clearly impossible to handle and scale for real-world eﬃcient attacks.
Objectives. Link disclosure attacks in on-line social networks aim to disclose hidden links by performing authorized requests.The attacks either reveal existing links or potential ones according to the employed method. We aim to disclose numerous links without having to verify a huge number of potential friends. In other words, we attempt to gather many potential friends but only those who have high probability to be friend with the target. The best way to achieve our objectives is to disclose the vicinity network of the target. To that end, we analyse groups’ properties on on-line social networks since they reflect the way users are gathering within a network and uncover its structure. To keep our discussion simple, we aim to answer two questions in this work: Which groups leak useful information to meet previously detailed objectives? How to find and use them?
Social networks group properties
In this section we analyse some properties of Facebook groups. This analysis will guide crawling tasks in order to collect only data that leak more information about the target. Exploiting such data will increase the accuracy of link disclosure attacks and maximize the number of disclosed links. We stress that all experiments in this work were carried out on-line with real Facebook profiles. We have crawled 1,100 Facebook groups and all their members. Then, we have sorted the groups by declared size in sets. Each set contains at least 30 groups. Each group in the first set S0 gathers between 2 and 10 members and each group in the set Si gathers between 10i and 10(i + 1) members.
Table of contents :
Chapter 1 Introduction
1.1 Research context
1.2 Structure of social networks
1.3 Research on social network privacy analysis
1.4 Motivations and challenges of this work
1.5 Contributions of the thesis
Chapter 2 Defining sensitive subjects
2.2 Conducting a survey on the behaviour of French users of social media
2.3 Analysing responses and defining sensitive subjects
2.4 Possible attack vectors according to the behaviour of participants
Chapter 3 Disclosing friendship and group membership links
3.2 Modelling social network for on-line link disclosure attacks
3.3 Problematics and objectives
3.4 Social networks group properties
3.5 Link disclosure attacks
Chapter 4 Overview of our implemented prediction system
4.3 SONSAI user guide
4.4 Examples of inference scenarios
Chapter 5 Sampling and modelling social networks
5.3 Sampling social network around a target user profile
5.4 Modelling discovered links and nodes by graphs
5.5 Anonymizing the social network graph models
Chapter 6 Cleansing the collected data
6.3 Motivations for cleansing data
6.4 Cleansing the sensitive graph
6.5 Cleansing the learning graphs
6.6 Cleansing results
Chapter 7 Analysing cleansed data and inferring target sensitive values
7.2 Translating social attributed network to a text document
7.3 Applying Word2Vec to compute node embeddings
7.4 Inferring hidden sensitive values of the target user profile
7.5 Measuring inference accuracy
7.6 Experiments and results
Chapter 8 Conclusions and perspectives
Appendix A Datasets
A.1 Dataset 1 (D1)
A.2 Dataset 2 (D2)
Appendix B Questionnaire