[PPT] - DIFFERENTIAL PRIVACY and some of its relatives BETTER DEFINITIONS PowerPoint Presentation

SLIDE 1

DIFFERENTIAL PRIVACY

and some of its relatives

SLIDE 2

BETTER DEFINITIONS

r: Why you should I answer your survey?

SLIDE 3

THE PROBLEM

Statistical databases provide some utility,
...often at odds with privacy concerns.
Utility for who?
Government, researchers, health authorities (Dwork: Tier 1)
Commercial, movie recommendations, ad targeting (Tier II)
How do we protect privacy but maintain utility?

SLIDE 4

ONLINE METHODS

Bart showed us offline methods, preparing data beforehand,

and some of the problems. Instead, consider the online case:

Data curators host data and control access to it.
We assume the curator is trusted, as is the confidentiality of

the raw data itself.

If the data contains private information, how does a curator

keep it private?

SLIDE 5

BLEND IN THE CROWD

Suggestion: Only answer

questions about large datasets.

Fails:

?> How many Swedish PhD students like Barbie? => 834 ?> How many Swedish PhD students, who are not Icelandic or do not study Computer Security, like Barbie? => 833

SLIDE 6

Willard likes Barbie!

(because, obviously, I do not!)

SLIDE 7

RANDOMIZED RESPONSE

Suggestion: Fuzz the answers so that

individual differences are lost in the noise.

Not easy, but on the right track!
E.g. averaging many queries cancels

noise, and it may not be decidable to detect equivalent queries.

SLIDE 8

A LITTLE REMINDER

Linkage attacks particularly tricky to prevent: The AOL

debacle, Netflix Prize and Sweeney’s use of voting records.

Dalenius’ desiteratum & Dwork’s “impossibility proof”:

Known fact: Height(Peter) = Average Swedish height + 10

SLIDE 9

WHAT IS PRIVACY ANYWAYS

Dalenius: Any private information that can be deduced from the

information from a database, can just as well be deduced without any access to it at all.

Great notion of privacy preserving, but impossible (unless we

sacrifice utility)

Sweeney’s k-anonymity: Each value of a QID appears at least k

times.

Easy to check or enforce, but what does it actually mean?

Ad-hoc and hurts utility (e.g. correlations)

SLIDE 10

TWO INSIGHTS

Dalenius’ definition is good,

but doesn’t make sense unless considering all the information in the universe.

Auxiliary information caused

Peter’s height to be revealed, whether he was in the database or not.

SLIDE 11

A MORE TANGIBLE PROPERTY

Controlled access to a database preserves the privacy of a person if it makes no difference whether that person’s data is included in the database or not.

So: I should participate in that toy product survey, because it makes very little difference for me (privacy), but the the statistical results may lead to better Barbie dolls (utility).

SLIDE 12

A MORE TANGIBLE PROPERTY

Controlled access to a database preserves the privacy of a person if it makes no difference whether that person’s data is included in the database or not.

So: I should participate in that toy product survey, because it

makes very little difference for me (privacy), but the the statistical results may lead to better Barbie dolls (utility).

SLIDE 13

DIFFERENTIAL PRIVACY

I’m in your database, but nobody knows! (Dwork)

SLIDE 14

BACK TO RANDOMIZING

Given a database D, we allow run a randomized query f, one

that may make random coin tosses in addition to inspecting data.

The result f(D) is a probability distribution over the random

coin tosses of f, i.e. the data is not a random variable.

We can now measure the probability of an answer being in a

certain range: Pr[ f(D) ∈ S ]

SLIDE 15

For a particular query f, what happens to those probabilities when one person is removed from D?

SLIDE 16

EXAMPLE: COUNT

Pr[f(D)]

P

utcome

833

SLIDE 17

EXAMPLE: COUNT

Pr[f(D)]

P

utcome

833

S

SLIDE 18

EXAMPLE: COUNT

Pr[f(D)]

P

utcome

833

S

SLIDE 19

EXAMPLE: COUNT

Pr[f(D)]

P D’ = D+{Arnar} S

833

SLIDE 20

EXAMPLE: COUNT

Pr[f(D)]

P

Pr[f(D’)]

D’ = D+{Arnar} S

833

SLIDE 21

EXAMPLE: COUNT

Pr[f(D)]

P

Pr[f(D’)]

D’ = D+{Arnar} S

833

Pr[ f(D) ∈ S ] ≈ Pr[ f(D’) ∈ S ]

SLIDE 22

DIFFERENTIAL PRIVACY

Let D and D’ be databases differing in one row.
A randomized query f is ε-differentially private iff

for any set S ⊆ dom(f)

Pr[ f(D) ∈ S ] ≤ exp(ε) ⋅ Pr[ f(D’) ∈ S ]

SLIDE 23

DIFFERENTIAL PRIVACY

Swapping D and D’ and rearranging gives an equivalent:

Pr[ f(D) ∈ S ] Pr[ f(D’) ∈ S ] exp(-ε) ≤ ≤ exp(ε)

For small ε, this means the ratio is very close to one.

SLIDE 24

HOW MUCH NOISE?

The sensitivity of a (non-randomized) query g is the maximum

effect of adding or removing one row. over all databases.

max | g(D) - g(D’) |

D, D’

Then g can be made ε-differentially private by adding a noise

according to the Laplace distribution Lap(b), with density

P(z) = b/2 ⋅ exp(-|z| / b)

with b = 1/ε

SLIDE 25

LAPLACIAN NOISE

g(D) g(D’)

High ε gives a clear difference. More utility, less privacy:

g(D) g(D’)

Low ε gives less difference. More privacy, less utility:
Why Laplace: Symmetric and “memoryless”.

SLIDE 26

LAPLACIAN NOISE

g(D) g(D’)

High ε gives a clear difference. More utility, less privacy:

g(D) g(D’)

Low ε gives less difference. More privacy, less utility:
Why Laplace: Symmetric and “memoryless”.

SLIDE 27

LAPLACIAN NOISE

g(D) g(D’)

High ε gives a clear difference. More utility, less privacy:

g(D) g(D’)

Low ε gives less difference. More privacy, less utility:
Why Laplace: Symmetric and “memoryless”.

SLIDE 28

MULTIPLE QUERIES

Differential privacy mechanisms generally allocate a privacy

budget to each user.

A user runs a query with a specified ε, which is then deducted

from her budget.

Running the same query twice and averaging gives the same

distribution as running it once with twice the ε.

Benefit: No need for semantic analysis on queries.

SLIDE 29

AUXILIARY INFORMATION

Differential privacy is not affected by auxiliary information,

because the definition only considers whether one should participate in a particular database or not.

Note: Differential privacy gives the same guarantees for those

that do not appear in the database, as for those who do!

SLIDE 30

DP MECHANISMS

How do we actually use this?

SLIDE 31

BY HAND

Sensitivity of an algorithm may sometimes be approximated

mechanically, but often one can do better. Conservative estimates lead to high noise and reduced utility.

Often there are non-trivial ways to give a differentially private

implementation of an algorithm, that requires much less noise.

Generally parametrized on an ε. Not always Laplace noised.
Many publications here, as DP originated from the algorithms

community.

SLIDE 32

EXAMPLE: K-MEANS

Finding k clusters in a collection of N points:

Select k random “master” points Sort points into buckets by their closest master point Choose means of buckets as new master points and repeat.

“Bucketing” has a low sensitivity - removing one point only

affects one bucket.

Calculating the means also has a low sensitivity.
Tricky: How many iterations and how to split the ε ?

SLIDE 33

PINQ

LINQ, Language Integrated Queries, are an embedded

language for queries in .NET languages.

Privacy Integrated Queries [McSherry] adds a layer on top

that automatically adds Laplacian noise to the result.

var data = new PINQueryable<SearchRecord>( ... ... ); var users = from record in data where record.Query == argv[0] groupby record.IPAddress return users.NoisyCount(0.1);

SLIDE 34

PINQ

Sequential composition of differentially private computations is

differentially private, with the sum of the components.

Parallel composition of differentially private computations is

differentially private, with the maximum of the components.

PINQ over-approximates very much for some algorithms, e.g.

where the privacy factor depends on control flow, but works well for many others - for example k-means.

SLIDE 35

AIRAVAT

MapReduce computations are decomposed into two phases

that can easily be distributed. Introduced by Google.

Airavat [Shmatikov et.al.] implements differential privacy on

top of MapReduce.

Mandatory access control

and isolation allows untrusted mappers.

SLIDE 36

LINEAR TYPES

Pierce and Reed provided a linear type system that guarantees

differential privacy.

Value types form metric spaces, so sensitivity of operations

can be inferred (conservatively).

Does not deal well with redundant computations and over-

estimates sensitivity due to control flow.

Works only for specific ways of adding random noise. Recent

work of Barthe et al. (POPL 2012) aims to improve.

SLIDE 37

NOT PERFECT, OF COURSE

Issuing and managing privacy budgets (to be spent on epsilons)

is far from trivial. Not really a technical problem.

One may leak information through the use of the budget. E.g.

PINQ issues an error when the budget is spent, providing a limited side channel.

Traditional covert channels, such as timing, are not addressed

by PINQ/Airavat.

Proposed solutions: DP Under Fire [Haeberlen, Pierce, Narayan]

SLIDE 38

RELATIVES OF DP

Wait, there’s more?

SLIDE 39

ADVERSARIAL PRIVACY

Some analysis, e.g. of relational data such as connections in a

social network, do not fit the DP model well.

Rastogi et al. propose a weaker notion, adversarial privacy,

based on Bayesian inference about the knowledge of an adversary.

Knowledge is represented as a class of probability

distributions, which can be updated based on observations.

DP is obtained as a special case for certain classes.

P(t | O) ≤ eP(t) + γ

SLIDE 40

QUANTIFIED INFO-FLOW

Information-flow control may choose to enforce limits on the

amount of leaked information.

E.g. the number of bits leaked, discrete models of attacker

knowledge or probabilistic models.

Probabilistic models correspond somewhat with DP

, but tend to apply to individual records rather than aggregates.

See e.g work by Köpf et al. and Hicks (CSF 2011)

SLIDE 41

PAN-PRIVATE ALGORITHMS

Traditional DP algorithms have full access to non-randomized

data, and store it in memory.

Recent research by Dwork discusses pan-private algorithms,

where even the internal state of the algorithm satisfies DP .

Example: What is the proportion of IP addresses I’ve seen on

a particular network link?

SLIDE 42

TAKE HOME POINTS

Online methods for privacy in statistical databases have more

room to avoid linkage attacks.

Differential Privacy comes from re-evaluating what makes

database accesses private: If I’m in or out does not matter.

DP has nice composition properties, and makes little

assumptions about prior knowledge (aux. information).

DP is no panacea - doesn’t always fit.