DIFFERENTIAL PRIVACY and some of its relatives BETTER DEFINITIONS - - PowerPoint PPT Presentation
DIFFERENTIAL PRIVACY and some of its relatives BETTER DEFINITIONS - - PowerPoint PPT Presentation
DIFFERENTIAL PRIVACY and some of its relatives BETTER DEFINITIONS or: Why you should I answer your survey? THE PROBLEM Statistical databases provide some utility, ...often at odds with privacy concerns. Utility for who? - Government,
BETTER DEFINITIONS
- r: Why you should I answer your survey?
THE PROBLEM
- Statistical databases provide some utility,
- ...often at odds with privacy concerns.
- Utility for who?
- Government, researchers, health authorities (Dwork: Tier 1)
- Commercial, movie recommendations, ad targeting (Tier II)
- How do we protect privacy but maintain utility?
ONLINE METHODS
- Bart showed us offline methods, preparing data beforehand,
and some of the problems. Instead, consider the online case:
- Data curators host data and control access to it.
- We assume the curator is trusted, as is the confidentiality of
the raw data itself.
- If the data contains private information, how does a curator
keep it private?
BLEND IN THE CROWD
- Suggestion: Only answer
questions about large datasets.
- Fails:
?> How many Swedish PhD students like Barbie? => 834 ?> How many Swedish PhD students, who are not Icelandic or do not study Computer Security, like Barbie? => 833
Willard likes Barbie!
(because, obviously, I do not!)
RANDOMIZED RESPONSE
- Suggestion: Fuzz the answers so that
individual differences are lost in the noise.
- Not easy, but on the right track!
- E.g. averaging many queries cancels
noise, and it may not be decidable to detect equivalent queries.
A LITTLE REMINDER
- Linkage attacks particularly tricky to prevent: The AOL
debacle, Netflix Prize and Sweeney’s use of voting records.
- Dalenius’ desiteratum & Dwork’s “impossibility proof”:
Known fact: Height(Peter) = Average Swedish height + 10
WHAT IS PRIVACY ANYWAYS
- Dalenius: Any private information that can be deduced from the
information from a database, can just as well be deduced without any access to it at all.
- Great notion of privacy preserving, but impossible (unless we
sacrifice utility)
- Sweeney’s k-anonymity: Each value of a QID appears at least k
times.
- Easy to check or enforce, but what does it actually mean?
Ad-hoc and hurts utility (e.g. correlations)
TWO INSIGHTS
- Dalenius’ definition is good,
but doesn’t make sense unless considering all the information in the universe.
- Auxiliary information caused
Peter’s height to be revealed, whether he was in the database or not.
A MORE TANGIBLE PROPERTY
Controlled access to a database preserves the privacy of a person if it makes no difference whether that person’s data is included in the database or not.
So: I should participate in that toy product survey, because it makes very little difference for me (privacy), but the the statistical results may lead to better Barbie dolls (utility).
A MORE TANGIBLE PROPERTY
Controlled access to a database preserves the privacy of a person if it makes no difference whether that person’s data is included in the database or not.
- So: I should participate in that toy product survey, because it
makes very little difference for me (privacy), but the the statistical results may lead to better Barbie dolls (utility).
DIFFERENTIAL PRIVACY
I’m in your database, but nobody knows! (Dwork)
BACK TO RANDOMIZING
- Given a database D, we allow run a randomized query f, one
that may make random coin tosses in addition to inspecting data.
- The result f(D) is a probability distribution over the random
coin tosses of f, i.e. the data is not a random variable.
- We can now measure the probability of an answer being in a
certain range: Pr[ f(D) ∈ S ]
For a particular query f, what happens to those probabilities when one person is removed from D?
EXAMPLE: COUNT
Pr[f(D)]
P
- utcome
833
EXAMPLE: COUNT
Pr[f(D)]
P
- utcome
833
S
EXAMPLE: COUNT
Pr[f(D)]
P
- utcome
833
S
EXAMPLE: COUNT
Pr[f(D)]
P D’ = D+{Arnar} S
833
EXAMPLE: COUNT
Pr[f(D)]
P
Pr[f(D’)]
D’ = D+{Arnar} S
833
EXAMPLE: COUNT
Pr[f(D)]
P
Pr[f(D’)]
D’ = D+{Arnar} S
833
Pr[ f(D) ∈ S ] ≈ Pr[ f(D’) ∈ S ]
DIFFERENTIAL PRIVACY
- Let D and D’ be databases differing in one row.
- A randomized query f is ε-differentially private iff
for any set S ⊆ dom(f)
Pr[ f(D) ∈ S ] ≤ exp(ε) ⋅ Pr[ f(D’) ∈ S ]
DIFFERENTIAL PRIVACY
- Swapping D and D’ and rearranging gives an equivalent:
Pr[ f(D) ∈ S ] Pr[ f(D’) ∈ S ] exp(-ε) ≤ ≤ exp(ε)
- For small ε, this means the ratio is very close to one.
HOW MUCH NOISE?
- The sensitivity of a (non-randomized) query g is the maximum
effect of adding or removing one row. over all databases.
max | g(D) - g(D’) |
D, D’
- Then g can be made ε-differentially private by adding a noise
according to the Laplace distribution Lap(b), with density
P(z) = b/2 ⋅ exp(-|z| / b)
with b = 1/ε
LAPLACIAN NOISE
g(D) g(D’)
- High ε gives a clear difference. More utility, less privacy:
g(D) g(D’)
- Low ε gives less difference. More privacy, less utility:
- Why Laplace: Symmetric and “memoryless”.
LAPLACIAN NOISE
g(D) g(D’)
- High ε gives a clear difference. More utility, less privacy:
g(D) g(D’)
- Low ε gives less difference. More privacy, less utility:
- Why Laplace: Symmetric and “memoryless”.
LAPLACIAN NOISE
g(D) g(D’)
- High ε gives a clear difference. More utility, less privacy:
g(D) g(D’)
- Low ε gives less difference. More privacy, less utility:
- Why Laplace: Symmetric and “memoryless”.
MULTIPLE QUERIES
- Differential privacy mechanisms generally allocate a privacy
budget to each user.
- A user runs a query with a specified ε, which is then deducted
from her budget.
- Running the same query twice and averaging gives the same
distribution as running it once with twice the ε.
- Benefit: No need for semantic analysis on queries.
AUXILIARY INFORMATION
- Differential privacy is not affected by auxiliary information,
because the definition only considers whether one should participate in a particular database or not.
- Note: Differential privacy gives the same guarantees for those
that do not appear in the database, as for those who do!
DP MECHANISMS
How do we actually use this?
BY HAND
- Sensitivity of an algorithm may sometimes be approximated
mechanically, but often one can do better. Conservative estimates lead to high noise and reduced utility.
- Often there are non-trivial ways to give a differentially private
implementation of an algorithm, that requires much less noise.
- Generally parametrized on an ε. Not always Laplace noised.
- Many publications here, as DP originated from the algorithms
community.
EXAMPLE: K-MEANS
- Finding k clusters in a collection of N points:
Select k random “master” points Sort points into buckets by their closest master point Choose means of buckets as new master points and repeat.
- “Bucketing” has a low sensitivity - removing one point only
affects one bucket.
- Calculating the means also has a low sensitivity.
- Tricky: How many iterations and how to split the ε ?
PINQ
- LINQ, Language Integrated Queries, are an embedded
language for queries in .NET languages.
- Privacy Integrated Queries [McSherry] adds a layer on top
that automatically adds Laplacian noise to the result.
var data = new PINQueryable<SearchRecord>( ... ... ); var users = from record in data where record.Query == argv[0] groupby record.IPAddress return users.NoisyCount(0.1);
PINQ
- Sequential composition of differentially private computations is
differentially private, with the sum of the components.
- Parallel composition of differentially private computations is
differentially private, with the maximum of the components.
- PINQ over-approximates very much for some algorithms, e.g.
where the privacy factor depends on control flow, but works well for many others - for example k-means.
AIRAVAT
- MapReduce computations are decomposed into two phases
that can easily be distributed. Introduced by Google.
- Airavat [Shmatikov et.al.] implements differential privacy on
top of MapReduce.
- Mandatory access control
and isolation allows untrusted mappers.
LINEAR TYPES
- Pierce and Reed provided a linear type system that guarantees
differential privacy.
- Value types form metric spaces, so sensitivity of operations
can be inferred (conservatively).
- Does not deal well with redundant computations and over-
estimates sensitivity due to control flow.
- Works only for specific ways of adding random noise. Recent
work of Barthe et al. (POPL 2012) aims to improve.
NOT PERFECT, OF COURSE
- Issuing and managing privacy budgets (to be spent on epsilons)
is far from trivial. Not really a technical problem.
- One may leak information through the use of the budget. E.g.
PINQ issues an error when the budget is spent, providing a limited side channel.
- Traditional covert channels, such as timing, are not addressed
by PINQ/Airavat.
- Proposed solutions: DP Under Fire [Haeberlen, Pierce, Narayan]
RELATIVES OF DP
Wait, there’s more?
ADVERSARIAL PRIVACY
- Some analysis, e.g. of relational data such as connections in a
social network, do not fit the DP model well.
- Rastogi et al. propose a weaker notion, adversarial privacy,
based on Bayesian inference about the knowledge of an adversary.
- Knowledge is represented as a class of probability
distributions, which can be updated based on observations.
- DP is obtained as a special case for certain classes.
P(t | O) ≤ eP(t) + γ
QUANTIFIED INFO-FLOW
- Information-flow control may choose to enforce limits on the
amount of leaked information.
- E.g. the number of bits leaked, discrete models of attacker
knowledge or probabilistic models.
- Probabilistic models correspond somewhat with DP
, but tend to apply to individual records rather than aggregates.
- See e.g work by Köpf et al. and Hicks (CSF 2011)
PAN-PRIVATE ALGORITHMS
- Traditional DP algorithms have full access to non-randomized
data, and store it in memory.
- Recent research by Dwork discusses pan-private algorithms,
where even the internal state of the algorithm satisfies DP .
- Example: What is the proportion of IP addresses I’ve seen on
a particular network link?
TAKE HOME POINTS
- Online methods for privacy in statistical databases have more
room to avoid linkage attacks.
- Differential Privacy comes from re-evaluating what makes
database accesses private: If I’m in or out does not matter.
- DP has nice composition properties, and makes little
assumptions about prior knowledge (aux. information).
- DP is no panacea - doesn’t always fit.