Location Privacy through reduced Accuracy

Get Complete Project Material File(s) Now! »

State of the Art

In this chapter we introduce some popular notions of privacy and location privacy that are closely related to our work. We start by introducing two notions of pri-vacy for statistical databases, k-anonymity and differential privacy, that preceded location privacy and largely influenced it. We then present a Bayesian approach from the area of Quantitative Information Flow, that can be used as alternative characterization of the leakage of a privacy mechanism. We then introduce loca-tion privacy, describing in more details the data produced and collected and the existing attacks. Finally we introduce the privacy notions and mechanism used for location privacy and explain their connection to the techniques introduced for statistical databases.

k-anonymity

In the context of statistical databases, we have a dataset populated with individuals and their personal (possibly sensitive) information. An analyst, or curator, wants to query the database to extract some statistical properties that are of interest for the general population. At the same time we would like to keep the sensitive information of a single participant safe, no matter what query or combination of queries are asked. For example in a hospital database we would like to learn if there is a larger incidence of cancer in a particular region but at the same time we want to hide the fact that a specific individual living in the region has cancer.
A widely used notion of privacy is k-anonymity, first introduced in 1998 by Samarati and Sweeny [SS98] and further developed over the years [Sam01, Swe02]. In order to protect the identities of the users present in the database, that may contain sensitive information about them, the explicit identifiers such as name and surname are typically suppressed. However there are several other attributes that could help re-identify users with the help of background knowledge. For ex-ample birth date and address could be present in another dataset together with the name of the user. For this reason these attributes are called quasi-identifiers and need to be sanitized. K-anonymity proposes to generalize or suppress the quasi-identifiers in order to ensure that each entry in the database is part of a group of at least k individuals. In the example above, instead of providing the complete birth date, we could suppress the day of birth, assuming that there are at least k other users born in the same month. The intuition behind this technique is to increase the attacker’s uncertainty by increasing the number of possible entries related to a certain user.

l-diversity and t-closeness

k-anonymity should provide a measure of privacy for the user, with a larger k corresponding to a higher privacy, however the size of the set where the user is hidden does not necessarily capture all that the set can reveal about the user. Imagine a group of k users where all have cancer, in this case, despite the fact that k-anonymity is satisfied, the attacker can learn a sensitive information that was uniform in the group. l-diversity [MKGV07] was introduced to overcome this issue by requiring that each group has a certain amount l of diversity among its sensitive attributes. A refinement of this technique was proposed in t-closeness [LLV07] where each group is required to be at a distance t from the distribution describing the general population.

Background knowledge

The major problem with k-anonymity and its derivatives is their weakness with respect to background knowledge, or conjunction attacks. The reason is that even if a specific database has been sanitized to guarantee a certain k, the attacker can use knowledge from another source to reduce his uncertainty and lower k. In other words, the measure of privacy k is valid only when posing strong assumptions on the background knowledge of the attacker.

Differential Privacy

Differential privacy is a property meant to guarantee that the participation in a database does not constitute a threat for the privacy of an individual. In was in-troduced in 2006 by Dwork et al. [Dwo06] in the context of statistical disclosure and found widespread acceptance in several fields. More precisely, the idea is that a (randomized) query satisfies differential privacy if two databases that differ only for the addition of one record are almost indistinguishable with respect to the results of the query. Two databases D; D0 2 D that differ only for the addition of one record are called adjacent, denoted by D D0.
Definition 1 (Differential privacy). A randomized function K : D ! Z satisfies -differential privacy if for all pairs D; D0 2 D, with D D0, and all Y Z, we have that: Pr [K(D) 2 Y ] Pr [K(D0) 2 Y ] e
where Pr [E] represents the probability of the event E.
The definition requires the distributions of reported values to be very “close” in the two scenario where the user participate or not. How close is measured by the parameter which indicates the risk of distinguishing the two cases.
Notice that the adjacency relation can also be expressed as the Hamming metric dH over databases. In this case the definition should hold for all databases at distance 1.
The typical mechanism used for differential privacy is the Laplace mechanism, discussed in more details in the next Chapter 3.1.

Compositionality and Budget

One of the main advantages of differential privacy is its compositionality. It is easy to prove the privacy provided by the sequential application of several differentially private mechanisms.
Theorem 1 (Sequential Composition [McS09]). Let Ki each provide i-differential privacy. Performing all the Ki queries on the database D provides ( differential privacy.
This theorem shows that there is an accumulation of the i with the number of accesses to the database. is a measure of the risk for the user’s privacy and its growth due the continuous access to the database meets the intuition. In the design of a mechanism to deploy in a realistic setting, where the analyst perform several queries for a period of time, this risk accumulation has to be taken into account. Usually we refer to the global sum of all i as the privacy budget; every access has to be accounted for and a limit has to be set after which the system halts.
Having a finite amount of privacy to spend presents two main challenges to the design of a realistic system. First, the budget has to be managed following a strategy in order to achieve a goal. In Section 4.3.2 for example, we describe two possible budget managers to maximize either the number of queries or the accuracy of each query.
Second, the accumulation of budget described in Theorem 1 models a worst case scenario where only the number of accesses are considered, not the content of the queries. Imagine for example that several analysts independently perform the same query, if we naively run them, Theorem 1 applies. However if we run the query only once, cache the result and then return it to the other analysts, only one access needs to be accounted for. A large body of work has been developed to optimize the set of queries so to minimize the direct accesses to the database. The median mechanism [RR10] is based on the idea of exploiting the correlation on the queries to improve the budget usage. The mechanism uses a concept similar to our prediction to determine the answer to the next query using only past answers. An analogous work is the multiplicative weights mechanism [HR10]. The mechanism keeps a parallel version of the database, built from past queries, which is used to predict the next answer and in case of failure it is updated with a multiplicative weights technique.
In Section 4.2 we equip a privacy mechanism with a prediction function that from past reported values tries to guess future answers, that comes for free on the privacy budget. A key difference of statistical databases from our context is that in the former, several queries are performed against the same database. In our setting, however, the secret (the position of the user) is always changing and the query is just the identity query asking the exact location. A similar scenario is explored also in [DNPR10] were the authors consider the case of an evolving secret and develop a differentially private counter.

READ Computer vision, image and video understanding

Background Knowledge

Among the advantages of this definition is the independence from the prior dis-tribution on the secrets. There are no assumptions on the database or on the capabilities of the attacker, making differential privacy a very robust property with respect to background knowledge. This is also due to the fact that con-trary to k-anonymity, where the uncertainty of the attacker comes from the size of the groups, in differential privacy the uncertainty is introduced through random noise. Indeed the probabilistic nature of the definition is the key difference with k-anonymity and it is common with another line of work developed by [STBH11] in the context of location privacy that we present in details in the next chapter 2.3.

Bayesian Approach

A alternative point of view on privacy can be provided by the field of Quantitative Information Flow. The system under analysis is considered as to be an information theoretic channel, the input of the channel represents the secret and the output the observable. This model is adapted from the work of Shannon [Sha48] on communication channels and indeed a number of works use Shannon’s entropy as a measure of the attacker’s uncertainty of the secret [Mal07, CHM07, CPP08a].
Imagine for example a password-based authentication system, in this case the system is the channel, the password is the secret input and the observable is either success or fail. At first the entropy of the secret depends only on the size of the passwords space, however each time the attacker tries an incorrect password, this entropy decreases slowly.
The mutual information between the input and the output of the channel is the information leakage caused by the system and the maximum possible leakage is represented by the capacity of the channel. More recently a number of other measures have been introduced, more specific to information hiding: guessing entropy [Mas94], Bayes risk [CPP08b], min-entropy leakage [Smi09], g-leakage [ACPS12].
In general we are interested in measuring the uncertainty that the attacker has over the secret X and especially if, and how much, it is reduced after the attacker sees the output of the system Z. Independently from the measure of choice, the general principle proposed by Smith in [Smi09] is:
Information Leakage = Initial Uncertainty Final Uncertainty
The idea is that following a Bayesian approach, the attacker, after the obser-vation of the output of the system, updates his information on the secret, possibly refining it.

min-entropy

We define two finite sets X and Z respectively of secret input values and observ-able output values. A channel is a triple (X ; Z; C), where C is a channel matrix, an jX j jZj matrix with entries between 0 and 1 and rows that sum to 1. Each element C[x; z] is the conditional probability of obtaining an output z from an input x. We define the prior distribution on X , that represents the knowledge of the attacker on the secret before observing the system.
In min-entropy, rather that information content of a random variable like in Shannon’s entropy, we measure the uncertainty as the maximum probability that the attacker as of guessing the secret in one try. We call this the vulnerability of the secret. For example the vulnerability measurable before to running the system is simply the most likely value in the prior distribution; its most vulnerable.
V ( ) = max (x) x2X
We can then obtain min-entropy with a simple change of scale to bits: H1( ) = log2 V ( )
However we are interested in enriching this measure with a notion of gain, in order to model different attackers.

g-leakage

For the purposes of this thesis we focus on g-leakage [ACPS12], that extends min-entropy leakage [Smi09] with gain functions, allowing us to model different types of attacker. We could imagine for example that for the authentication sys-tem, an attacker may be interested in learning the credential of a specific user or alternatively it may be enough to recover the password of any user. We intro-duce the concept of gain function to represent the goal of the attacker, note that in the literature sometime this is achieved using a loss function, which is simply the complement of a gain function.

Table of contents :

1 Introduction
1.1 They key role of privacy research
1.2 Location Privacy
1.3 State of the Art
1.4 Contributions
1.5 Publications
1.6 Plan of the thesis
2 State of the Art
2.1 k-anonymity
2.1.1 l-diversity and t-closeness
2.1.2 Background knowledge
2.2 Differential Privacy
2.2.1 Compositionality and Budget
2.2.2 Background Knowledge
2.3 Bayesian Approach
2.3.1 min-entropy
2.3.2 g-leakage
2.3.3 Relation with Differential Privacy
2.4 Location Privacy
2.4.1 Location Data
2.4.2 Attacks
2.4.3 k-anonymity
2.4.4 Differential Privacy
2.4.5 Bayesian metrics for Location Privacy
2.4.6 Others metrics
3 Preliminaries
3.1 Location Privacy through reduced Accuracy
3.2 Privacy Definitions
3.2.1 dX -privacy
3.2.2 Geo-indistinguishability
3.2.3 Distinguishability level and
3.2.4 Traces
3.3 Privacy Mechanisms
3.3.1 Laplace Mechanism
3.3.2 Planar Laplace Mechanism
3.3.3 Exponential Mechanism
3.3.4 Independent Mechanism
3.4 Utility
3.4.1 Expected Error
3.4.2 ()-accuracy
4 Repeated Use over Time
4.1 Introduction
4.2 A Predictive dX -private Mechanism
4.2.1 Budget management
4.2.2 The mechanism
4.2.3 Privacy
4.2.4 Utility
4.2.5 Skipping the test
4.3 Predictive mechanism for location privacy
4.3.1 Prediction Function
4.3.2 Budget Managers
4.3.3 Configuration of the mechanism
4.3.4 Evaluation
4.3.5 Future Work
4.4 Incorporating fences in the metric
4.4.1 Fences
4.4.2 Mechanism
4.4.3 Future work
4.5 Conclusion
5 Flexible Use over Space
5.1 Introduction
5.2 An elastic distinguishability metric
5.2.1 Privacy mass
5.2.2 Requirement
5.2.3 Extracting location quality
5.2.4 An efficient algorithm to build elastic metrics
5.3 Evaluation
5.3.1 Metric construction for Paris’ metropolitan area
5.3.2 Evaluation using the Gowalla and Brightkite datasets
5.4 Tiled Mechanism
5.5 Conclusion
6 Location Guard
6.1 A web browser extension
6.2 Desktop and mobile
6.3 Operation
6.4 Adoption
6.5 Future Work
7 Conclusion