The Geometry of Hyper-Distributions – Project topics materials

Get Complete Project Material File(s) Now! »

Differential privacy and its applications

Differential privacy is now arguably the gold standard for data privacy, primarily due to its math-ematical properties (it behaves well under composition), its robustness to background knowl-edge (its guarantee holds for all priors) and its simplicity of implementation (statistical noise can be added independently of the value of a data element) [7]. These properties set it apart from anonymisation-based definitions, which are susceptible to adversarial attacks when com-bined with other anonymised datasets [8]. The Netflix data breach describes one such attack; in this case, researchers made use of publicly available datasets which were correlated with the anonymised Netflix data. These correlations allowed researchers – with high confidence – to identify individuals and their sensitive data in the Netflix dataset, even though the dataset itself contained no obviously identifying information.5 Differential privacy claims immunity to such attacks [7].
An important but understated property of differential privacy is also its ability to promise the release of useful information while maintaining some notion of privacy, a property that its early proponents explored [7]. This property sets differential privacy apart from privacy definitions designed to limit leakage. Dwork et al. [7] argued that leakage is an essential property of any useful data release, and therefore defining privacy as “disclosure limitation” – as desired by Dalenius6 – unnecessarily limits all leakages, even ones which may not constitute a privacy breach at all.7 As an example, differential privacy would say that a dataset which teaches that smoking causes cancer is not breaching the privacy of individuals in it, if it could later be inferred that one of them has cancer, since this information would have been learned regardless of whether the individual had participated in the study or not.
Instead, differential privacy proposes a notion of “indistinguishability”, in which similar in-puts produce similar noisy outputs, creating some uncertainty in an attacker’s ability to guess the secret with confidence, while still permitting the release of statistically useful information. In this way, differential privacy finds an in-between, one which has been accepted by the com-munity as the ‘best available’ compromise.
Because of its dominance and widespread acceptance in the literature, researchers in other areas of privacy have been tempted to adopt differential privacy in their own domains. Differen-tial privacy has been applied to domains as diverse as images [9, 10], text documents [11, 12] and shipping trajectories [13]. However, this diversity comes at a cost – differential privacy was not designed with these sorts of domains in mind, and its definition is couched in termi-nology specific to statistical datasets. Re-interpreting the ‘plausible deniability’ style guarantee for datasets in which the secrets are not individuals in rows but some more complex data is not trivial: in an image, how does one protect the individual whose image is displayed? In a document, how would one protect the writing style of the author to prevent re-identification? In these examples, what exactly is plausibly denied by the application of differential privacy? How is privacy achieved and how might it be breached? And how might one ensure that some useful information is retained in the data release?

Metric differential privacy

In 2012, researchers working in geo-location privacy introduced the notion of ‘metric differential privacy’ in order to enable differential privacy-style reasoning in their domain [14, 15]. Metric differential privacy provides a generalisation (and re-interpretation) of differential privacy for arbitrary domains endowed with a metric. The researchers’ insight was to notice that some of the main properties of differential privacy rely on the metric properties of the Hamming dis-tance between datasets; and that this metric can be safely changed without breaking the strong properties that differential privacy offers (robustness to priors, compositionality etc). Plausible deniability can be re-interpreted as ‘privacy within a radius’, permitting differential privacy to be applied to new (metric) domains. Their example application to geo-location privacy – the problem of hiding users’ precise locations while preserving their approximate locations – pro-vided a blueprint for how to re-interpret “adjacency of secrets” and the meaning of epsilon for a novel domain, as well as a new method for reasoning about utility when noise is added directly to the secret values themselves, rather than to the output of a “query” as is done in the statistical database setting.
Metrics are used in a surprising array of privacy domains, from spatio-temporal trajectories to social network graphs, recommender systems, text documents and images. However, despite its potential, metric differential privacy has found few applications beyond the geo-location privacy scenario for which it was designed. Privacy practitioners in new domains often resort to the reasoning designed for individuals in statistical databases, even when this does not correspond directly the privacy issue at hand, nor relate to the utility desired from the data release.8 This suggests that privacy practitioners are still ill-equipped to reason about privacy and utility in domains which do not match the “blueprints” provided by either standard differential privacy (for statistical datasets) or geo-location privacy.
See for example [13] and [11].

Key themes in the practice of differential privacy

Since its introduction in 2006, researchers have been stretching differential privacy in various ways to respond to the needs of different domains. While theoretical research into privacy has thrown up an avalanche of new definitions [16], privacy practitioners are left with many questions: which definition should I use for my dataset? How do these definitions compare? How do I interpret what privacy means? And – surprisingly, still – how do I choose epsilon?

Privacy breaches as inference attacks

The separation between the theory and practice of differential privacy is perhaps most notable in the way that experimental researchers investigate privacy breaches: they are usually described as inference attacks [17–21]. Inference attacks describe real adversarial threats to a system, pro-viding an operational scenario against which privacy mechanisms can be examined and (hope-fully) defend. For example, privacy researchers have found that the list of apps on a users smartphone can reveal sensitive attributes such as religion or relationship status [22] and can even be used to identify particular individuals [23]. The need for privacy-preserving techniques in these areas is precisely to protect against these sorts of inferences. Plausible deniability does not have this goal: inferences describe the potential for re-identification; plausible deniability simply ensures that there is no certainty of it. Differential privacy is careful not to guarantee protection against inference attacks – and indeed, is known to be susceptible to them [24] – despite this being a major concern for practitioners. Even if plausible deniability is the best we can hope for, reasoning about vulnerability to inferences is essential for communicating what privacy means and what sort of privacy is guaranteed for individuals whose data is entrusted to a curator. While there has been some work on incorporating inference attack models into the differential privacy guarantee9, the lack of significant research into the relationship between inferences and differential privacy means that there is still debate about what sort of protection differential privacy affords in practical adversarial scenarios [26].

How to choose epsilon

In the theoretical literature on differential privacy, the epsilon parameter in (1.1) is a privacy control; it describes both the “indistinguishability” level between secrets, and also the “privacy budget” available for subsequent data releases.10 However, in practical scenarios epsilon is usu-ally seen as a parameter for tweaking utility without associating any (specific) meaning to the resulting privacy guarantee [27]. Since epsilon is not meaningful to an end user, privacy prac-titioners can more easily brush aside concerns about its particular value. Indeed, Apple were lambasted for their cavalier use of differential privacy to provide some notional privacy to end users, without revealing that their chosen epsilon in fact provided next-to-no privacy.11 Even in situations for which differential privacy was explicitly designed, the question of how to choose epsilon? remains problematic [27–29]. Notably, this was an issue in the deployment of differ-ential privacy for the 2020 US Census; researchers remarked that the “literature of differential privacy is very sparse” regarding policies for choosing a value for epsilon [29]. In more complex domains, the relationship between epsilon and the privacy afforded is potentially very tenuous: in a text privacy scenario, how would the value of epsilon correspond to the privacy-protection afforded to the author of a document? And how does changing epsilon correspond to changes in the useful information released? A clearer understanding of the relationship between epsilon and the privacy-utility balance is necessary to ensure its proper use in differential privacy in new and complex domains.12

Managing the privacy-utility balance

Some of the early adopters of differential privacy were researchers in machine learning, hoping to ensure privacy-protection for individuals whose data was used to train machine learning models. One of the research questions which arose out of this context was where to add the noise?. Researchers found that they could squeeze more utility from a dataset by the judicious application of noise to different parts of the machine learning workflow [30, 31]. Despite these early insights, the methods for determining where to add noise in a workflow remain ad hoc and largely experimental.
In more complex domains, the question of how to add the noise? also becomes relevant. For example, in the text privacy domain, the addition of noise to words in a document might seem like a reasonable approach – but how should the noise be added to guarantee some privacy and utility? In geo-location privacy and standard differential privacy, there is a clear relation-ship between privacy and utility which determines the nature of the noise-adding mechanism that should be applied. Whereas in text privacy, the relationship between privacy and utility is unclear, which complicates the decision of how to add noise.
These questions underlie the need for a more principled approach to understanding the nature of the privacy-utility balance and how to compare systems wrt their privacy and utility properties.

READ ABE with memory inclusion using semi-supervised stacked autoencoders

Reasoning about utility

A related issue is the lack of tools for reasoning about the utility of a differentially private data release, thus requiring practitioners to resort to manipulating epsilon to an experimentally determined value. This sort of problem is not restricted to privacy practitioners working in new domains. In 2020, the US Census Bureau chose to apply differential privacy to their statistical data release mechanisms – precisely the domain for which differential privacy was designed.
This has been neatly documented by Dwork et al. [28] in a survey on the use of epsilon across a range of businesses.
Even in this scenario, the practitioners reported fundamental difficulties, particularly around the design of mechanisms for utility:
“Differential privacy lacks a well-developed theory for measuring the relative impact of added noise on the utility of different data products, tuning equity trade-offs, and presenting the impact of such decisions.” [29]
If in fact utility is the sole purpose of a data release, the absence of a sound methodology for reasoning about utility in differentially private data releases remains a critical weakness.13 And differential privacy’s claim of independence from the particulars of a dataset is a moot point if utility can only be evaluated through dataset-specific experimentation.

Goals of this thesis

The aforementioned concerns constitute foundational issues which act as obstacles to the more widespread practical application of differential privacy. This motivates a more general study of privacy and utility for new contexts, as well as a sound methodology for applying differential privacy in complex scenarios.
The goal of this thesis is to address these fundamental questions: How do I design mecha-nisms which provide some level of privacy and some level of accuracy for a domain in which the privacy or the utility requirements are not well understood? How do I decide how and where to add noise in order to achieve my privacy and utility goals? How do I interpret what privacy means in my domain? And how safe is my system against adversarial threats?
While the above concerns are not new – indeed many works in the literature have made contributions to our understanding of these areas – our goal is to approach this study under a unifying framework for reasoning about privacy and utility and to situate our work in the metric differential privacy context. In this way, we aim to equip privacy practitioners with the tools to reason about privacy and utility, and thereby to assist in the application of differential privacy to novel domains.

Contributions

In this thesis we examine some of the foundational questions around privacy and utility posed in the previous section with the overall goal of advancing the understanding and adoption of differential privacy in new domains. In particular, this work makes the following contributions:
We use the framework of Quantitative Information Flow (QIF) to examine questions around privacy and utility in metric differential privacy.
We explore the relationship between the epsilon parameter of metric differential privacy mechanisms and the safety of systems against adversarial threats. We show how the metric on the domain of inputs relates to how privacy is applied, which in turn affects the types of attacks that the system can defend against.
We model inference attacks against differential privacy and show that differential privacy’s protection against such attacks is dependent on the particulars of the dataset. We show that where the noise is added in a privacy workflow determines how the trade-off between privacy and utility can be managed, and what the setting of epsilon means in these scenarios.
We introduce a framework for reasoning about optimality based on a Bayesian model of util-ity. We show how the metric determines the classes of consumers for which optimal mecha-nisms exist. We use our model to re-evaluate impossibility results on optimality, discovering a new characterisation of optimal mechanisms and demonstrating that optimal mechanisms do exist contra existing results in the literature. Finally, we prove a new result on the universal optimality of the Laplace mechanism for continuous domains.
We present 3 new applications of metric differential privacy to problems of interest. We show why an understanding of the metrics involved is important for determining the type of privacy to apply. We also show how to reason about privacy and utility in each situation using the understandings developed in this thesis.

Table of contents :

1 Overview
1.1 Background to differential privacy
1.1.1 Differential privacy and its applications
1.1.2 Metric differential privacy
1.2 Key themes in the practice of differential privacy
1.2.1 Privacy breaches as inference attacks
1.2.2 How to choose epsilon
1.2.3 Managing the privacy-utility balance
1.2.4 Reasoning about utility
1.3 Goals of this thesis
1.3.1 Contributions
1.3.2 Synopsis
1.3.3 Publications
I Foundations
2 Quantitative Information Flow
2.1 Modelling Systems as Channels
2.2 Modelling Adversaries
2.2.1 Vulnerability and Gain Functions
2.2.2 Posterior Vulnerability
2.2.3 Leakage
2.3 Comparing Channels: Refinement
2.4 Hyper-Distributions
2.4.1 The Geometry of Hyper-Distributions
2.4.2 Refinement of Hypers
2.5 Chapter Notes
3 Metric Differential Privacy
3.1 Definition and Properties
3.1.1 Technical Preliminaries
3.1.2 Definition of d-Privacy
3.1.3 Properties of Metric Differential Privacy
3.2 Geo-Indistinguishability – A Canonical Example
3.2.1 Utility Goals
3.3 Reasoning with Metrics
3.3.1 Oblivious Mechanisms
3.3.2 Local Differential Privacy
3.3.3 Reasoning about Privacy and Utility
3.4 Constructing d-Private Mechanisms
3.4.1 Other Metrics
3.5 Discussion and Concluding Remarks
3.6 Chapter Notes
II Analysis
4 Comparing Privacy Mechanisms
4.1 Motivating Examples
4.1.1 Technical Preliminaries
4.2 Average-Case Refinement
4.3 Max-Case Refinement
4.4 Privacy-Based Refinement
4.4.1 Comparing mechanisms by their “smallest « ” (for fixed d)
4.4.2 Privacy-based leakage and refinement orders
4.4.3 Application to oblivious mechanisms
4.4.4 Privacy as max-case capacity
4.4.5 Revisiting the Data Processing Inequality for d-Privacy
4.5 Verifying the Refinement Orders
4.5.1 Average-case refinement
4.5.2 Max-case refinement
4.5.3 Privacy-based refinement
4.6 Lattice properties
4.6.1 Average-case refinement
4.6.2 Max-case refinement
4.6.3 Privacy-based refinement
4.6.4 Constructing d-private mechanisms for any metric d
4.7 Comparing d-Privacy Mechanisms
4.7.1 Technical Preliminaries
4.7.2 Refinement order within families of mechanisms
4.7.3 Refinement order between families of mechanisms
4.7.4 Asymptotic behaviour
4.8 Conclusion
4.9 Chapter Notes
5 Inference Attacks
5.1 Motivation
5.2 Informal example: does it matter how to randomise?
5.3 Modelling Inferences
5.3.1 Quantitative Information Flow
5.3.2 Reasoning about Inferences
5.4 The Privacy-Utility Trade Off
5.5 Application to Differential Privacy
5.5.1 Utility-Focussed Privacy
5.5.2 Secrecy-Focussed Privacy
5.6 Experimental Results
5.7 Concluding Remarks
5.8 Chapter Notes
6 Optimality I: Discrete Mechanisms
6.1 Introduction
6.1.1 Motivation and Contribution
6.2 Modelling Utility
6.2.1 Universal optimality
6.2.2 The privacy context
6.2.3 Relationship to “counting” and “sum” queries
6.2.4 Our results on optimality
6.3 The Geometry of d-Privacy
6.3.1 The relationship between channels and hypers
6.3.2 The space of d-private hypers
6.3.3 Kernel and Vertex Mechanisms
6.4 Characterising Optimal Mechanisms
6.4.1 Existence of universally optimal mechanisms
6.4.2 Characterising universal L-optimality
6.5 Reasoning about Optimality
6.5.1 Loss Function Algebra
6.5.2 Classes of Loss Functions
6.5.3 Properties of Loss Function Classes
6.5.4 Identifying Optimal Mechanisms
6.6 Application to Oblivious Mechanisms
6.7 Optimality for d2-Private Mechanisms on f0: : :Ng
6.7.1 Kernel mechanisms
6.7.2 The geometric mechanism
6.7.3 Other optimal mechanisms
6.8 Optimality for dD-Private Mechanisms
6.8.1 Kernel mechanisms
6.8.2 The randomised response mechanism
6.8.3 Impossibility of universal optimality
6.8.4 Reframing the impossibility result for monotonic loss functions
6.9 Optimality for Other Metric Spaces
6.10 Conclusion
6.11 Chapter Notes
7 Optimality II: Continuous Mechanisms
7.1 Introduction
7.2 Preliminaries: loss functions; hypers; refinement
7.2.1 The relevance of hyper-distributions
7.2.2 Refinement of hypers and mechanisms
7.3 Discrete and continuous optimality
7.3.1 The optimality result for the geometric mechanism
7.3.2 The geometric mechanism is never « d-private on dense continuous inputs, eg. when d on X is Euclidean
7.3.3 Our result — continuous optimality
7.3.4 An outline of the proof
7.4 Measures on continuous X and Y
7.4.1 Measures via probability density functions
7.4.2 Continuous mechanisms over continuous priors
7.4.3 The truncated Laplace mechanism
7.5 Approximating Continuity for X
7.5.1 Connecting continuous and discrete
7.5.2 Approximations of continuous priors
7.5.3 N-step mechanisms and loss functions
7.5.4 Approximating continuous « d-private mechanisms
7.6 The Laplace and Geometric mechanisms
7.6.1 The stability of optimality for geometric mechanisms
7.6.2 Bounding the loss for continuous mechanisms
7.6.3 Refinement between N-Geometric and T;N-Laplace mechanisms
7.6.4 The Laplace approximates the Geometric
7.6.5 Approximating monotonic functions
7.7 Universal optimality for the Laplace Mechanism
7.8 Conclusion and Future Work
7.9 Chapter Notes
III Applications
8 Statistical Utility for Local Differential Privacy
8.1 Scenario: Estimating a Distribution
8.2 Comparing Mechanisms for Privacy
8.3 Comparing Mechanisms for Utility
8.3.1 Reconstructing the Original Distribution
8.3.2 Maximum Likelihood Estimation with IBU
8.3.3 Measuring Utility
8.4 Experimental Results
8.5 Conclusion
8.6 Chapter Notes
9 Text Document Privacy
9.1 Introduction
9.2 Documents, Topic Classification and Earth Moving
9.2.1 Word embeddings
9.3 Privacy, Utility and the Earth Mover’s Metric
9.3.1 Utility and Earth Moving
9.3.2 Implementing Earth Mover’s Privacy
9.3.3 Application to Text Documents
9.3.4 Properties of Earth Mover’s Privacy
9.4 Earth Mover’s Privacy for bags of vectors in Rn
9.4.1 Earth Mover’s Privacy in BRn
9.4.2 Utility Bounds
9.5 Text Document Privacy
9.5.1 Privacy as Indistinguishability
9.5.2 Privacy as Protection from Inferences
9.6 Experimental Results
9.7 Conclusions
9.8 Chapter Notes
10 Locality Sensitive Hashing with Differential Privacy
10.1 Introduction
10.1.1 Friend-Matching: Problem Setup
10.1.2 Our contributions
10.2 Locality Sensitive Hashing (LSH)
10.2.1 Random-Projection-Based Hashing
10.2.2 Privacy with LSH
10.2.3 Is LSH private?
10.3 LSH-based Privacy Mechanisms
10.3.1 Privacy Measures
10.3.2 Construction of LSHRR
10.3.3 Construction of LapLSH
10.4 Analyses of the Privacy Mechanisms
10.4.1 LSHRR: Privacy Guarantees for Hash Values
10.4.2 LSHRR: Privacy Guarantees for Inputs
10.4.3 LapLSH: Privacy Guarantees
10.5 Experimental Evaluation
10.5.1 Experimental Setup
10.5.2 Results
10.5.3 Discussion
10.6 Conclusion
10.7 Chapter Notes
11 Conclusions and Future Work
References
Appendices