Anonymization Methodology Based on Machine Learning

Get Complete Project Material File(s) Now! »

Attacks on Rich Graphs

The possible re-identification methods in a graph are as diversified as the complexity of the representation of a graph. According to the complexity of the dataset, data to be anonymized can be represented as a simple graph, as an oriented graph, as a graph with multiple links, as an hypergraph etc. Recent papers bring examples of how reidentification can be made by using minimal knowledge on the victims especially when the anonymized graph structure is rich.
Li and Shen [2011a] introduce an attack based on the property of a hyperedge rank. The anonymization method related to this attack tackles the hyperedge rank anonymization problem. In Li and Shen [2010] the weights present in a graph are used for identity disclosure (the weights for each adjacent edge to a vertex as well as the sum of those weights). A weighted graph is introducing much more information than a simple graph making the anonymization even more difficult than in a simple graph. Li and Shen [2010] propose a solution for the weight anonymization problem.

Dynamic Data Based Attacks

In Bhagat et al. [2010] and Tai et al. [2011b] the special case of dynamic data is evocated. Indeed, when publishing several times an instance at a time t1 and t2 of a network, the adversary holding certain knowledge on the evolution of the network between the two releases is capable to re-identify nodes in the anonymized graph data.
As an example described in Tai et al. [2011b], Figure 2.3 illustrates two sequential releases of a network of patients in a certain hospital. By having the knowledge that user Bob has gone to the hospital for the first time in the period between the two releases, the adversary is able to re-identify Bob in the anonymized data.

Learning Missing Information in Partially Anonymized Graph

This type of attack addresses released networks containing only partial information. Indeed, in most of the social networks, some profiles are public and some others profiles are private. The social network could be then released only by hiding those private profiles. The research study Zheleva and Getoor [2009] describes a method to infer those sensitive attributes and to disclose hidden group memberships in a partially anonymized network. This disclosure is mainly performed by using learning methods and classification models in order to determine the missing labels in the partially anonymized network.

Graph Spectral Properties

Another method to quantify the loss in the utility of the data consists in the evaluation of the changes in the spectrum of the graph. In Ying and Wu [2008] it is shown the close link between the spectrum of a graph and the graph characteristics.
The spectrum of a graph is defined as being the eigenvalues of the adjacency matrix or of other related matrix (as the Laplacien). Ying and Wu [2008] is describing an anonymization method based on add, delete or switch operations on the edges of the graph, operations guided by the spectrum of the modified graph. The idea is to add, delete or switch edges by modifying as little as possible the associated spectrum of the graph.
In Ying and Wu [2009] the main idea is to re-generate a new synthetic graph matching a certain list of properties associated to the real social network. Among those properties are those correlated with the spectrum of the graph.

Network Queries Aggregation

In the case of the network queries aggregation, the measure of the utility loss is proportional to the delta between the answer to a set of queries on data of the graph G and the graph G′.
In Cormode et al. [2008] the evaluation method consists in listing a set of types of queries (SQL like queries) and compare the answers before and after anonymization on a bipartite graph. The queries are grouped in 3 categories:
1. Graph structure only queries (if the anonymized graph is isomorphic with the original graph, the answer to this type of queries will be perfectly accurate).
2. Attribute predicate on one side only consisting in computing the aggregate of nodes satisfying certain properties.
3. Attributes predicate on both sides consisting in computing the aggregate of nodes on both sides and their relations satisfying required properties.

Machine Learning used for Data De-Anonymization

Machine learning techniques have been used in the past mainly for de-anonymization purposes. Learning missing information in partially anonymized graphs has been described by Zheleva and Getoor [2009]. This data disclosure is performed by using learning methods and classification models in order to determine the missing labels in partially anonymized graph.
In the work of Sharad and Danezis [2014], machine learning (decision forest) is used to evaluate anonymization techniques. The anonymization technique is modeled as a blackbox and the de-anonymization process is associated to a learning problem. The paper proposes to replace manual de-anonymization techniques by a generic learning algorithm. The learning algorithm is provided with examples of nodes being normally unlinkable. The conclusion of the paper is that good de-anonymization attacks can be produced even by using an automated algorithm based on machine learning.

READ Concepts in Task Parallel Programming Languages

Machine Learning used for Data Anonymization

Machine learning techniques have been used in the past also for data anonymization. In Szarvas et al. [2007] machine learning techniques are used in the context of medical records anonymization. The main idea is to use machine-learning named entity recognition on semi-structured documents in order to identify personal health information to be anonymized.

Exploring the Privacy-Utility Tradeoff

Anonymization problem can be modeled as the search of a tradeoff between privacy and utility. Machine learning techniques have been used in the past for finding or approximating this tradeoff. Li and Li [2009] propose a framework able to take into consideration the privacy-utility tradeoff. The concepts used are similar to concepts evaluating riskreturn tradeoff in financial investment. The evaluation of the privacy is considered as an individual concept and is measured separately for every individual. Utility is considered as an aggregated concept and its measure is performed accumulatively.
In Vinterbo [2004], the privacy problem for health data is formalized as an optimization problem balancing privacy and data utility requirements. The problem of minimizing information loss while satisfying a set of privacy requirements is proved to be NP-hard. A theoretical analysis is provided of the relationship between data utility for machine learning and privacy disclosure control by generalization. Mivule and Turner [2013] describe the idea to use KNN classification as an indicator for an optimal balance between privacy and utility.

Table of contents :

List of Figures
1 Introduction
1.1 General Context
1.2 Graph Data Anonymization Issue
1.3 Contributions
1.4 Detailed Content
2 State of the Art
2.1 Introduction
2.2 Privacy Protection for Graph Data
2.2.1 What to protect?
2.2.2 De-anonymization Techniques
2.3 Anonymization Techniques for Graphs
2.3.1 Structural Modification Based on k-Anonymity Concept
2.3.2 Randomization Techniques
2.3.3 Generalization Techniques
2.4 Utility Loss Evaluation
2.4.1 Graph Topological Properties
2.4.2 Graph Spectral Properties
2.4.3 Network Queries Aggregation
2.5 Complex Graphs Anonymization
2.5.1 Hypergraphs
2.5.2 Temporal Graphs
2.6 Differential Privacy for Graph Data Anonymization
2.6.1 Principle
2.6.2 Differential Privacy for Data Release
2.7 Machine Learning in Data Anonymization Process
2.7.1 Machine Learning used for Data De-Anonymization
2.7.2 Machine Learning used for Data Anonymization
2.7.3 Exploring the Privacy-Utility Tradeoff
2.8 Conclusion
3 Temporal Graphs Anonymization Issue
3.1 Introduction
3.2 Graphs Risk for De-Anonymization based on Subgraphs Partitioning
3.3 Anonymization by Data Partitioning
3.4 System Architecture
3.5 Conclusion
4 Anonymization Methodology Based on Machine Learning
4.1 Introduction
4.2 Methodology
4.3 Notations and Definitions
4.3.1 Anonymization Function
4.3.2 Utility Loss
4.3.3 Privacy Risk
4.4 Optimization Problem: Balance between Utility Loss and Privacy Risk
4.5 Summary
4.6 Optimization Methods
4.6.1 Estimation of Distribution Algorithm
4.6.2 Genetic Algorithms
4.7 Conclusion
5 Simple Graphs Anonymization
5.1 Introduction
5.2 Optimization Problem for Simple Graphs
5.3 Anonymization Method
5.4 Privacy Risks
5.4.1 k-Degree Anonymity
5.4.2 k-Neighborhood Anonymity
5.5 Utility Loss Evaluation
5.5.1 Clustering Coefficient Based Utility Loss (CC)
5.5.2 Page Rank Based Utility Loss (PR)
5.5.3 Two-hop neighborhood based utility loss (THN)
5.6 Experiments
5.6.1 Baseline (BL)
5.6.2 Datasets
5.6.3 Results
5.7 Conclusion
6 Adaptive Temporal Graphs Anonymization for Data Publishing
6.1 Introduction
6.2 Machine Learning for Call Detail Records Anonymization
6.2.1 Notations and Definitions
6.2.2 Learning Problem
6.3 Anonymization Method
6.4 Privacy Risks in Call Logs
6.4.1 Privacy Attack by Communication Sequence Generation (CSG)
6.4.2 Privacy attack by Neighborhood Degree Distribution (NDD) .
6.5 Utility Loss Evaluation
6.5.1 Changes Performed by the Anonymization Algorithm in the Graph (CHG)
6.5.2 Query Based Measures: Call Distribution Distance (CDD)
6.5.3 Graph Topological Properties: Vertices In/Out Degrees (DE)
6.6 Experiments
6.6.1 Baseline: Random data perturbation
6.6.2 Datasets
6.6.3 Results
6.7 Conclusion
7 Conclusion
7.1 Contributions
7.2 Perspectives
Bibliography