DIFFERENTIAL PRIVACY and some of its relatives BETTER DEFINITIONS - - PowerPoint PPT Presentation

differential privacy
SMART_READER_LITE
LIVE PREVIEW

DIFFERENTIAL PRIVACY and some of its relatives BETTER DEFINITIONS - - PowerPoint PPT Presentation

DIFFERENTIAL PRIVACY and some of its relatives BETTER DEFINITIONS or: Why you should I answer your survey? THE PROBLEM Statistical databases provide some utility, ...often at odds with privacy concerns. Utility for who? - Government,


slide-1
SLIDE 1

DIFFERENTIAL PRIVACY

and some of its relatives

slide-2
SLIDE 2

BETTER DEFINITIONS

  • r: Why you should I answer your survey?
slide-3
SLIDE 3

THE PROBLEM

  • Statistical databases provide some utility,
  • ...often at odds with privacy concerns.
  • Utility for who?
  • Government, researchers, health authorities (Dwork: Tier 1)
  • Commercial, movie recommendations, ad targeting (Tier II)
  • How do we protect privacy but maintain utility?
slide-4
SLIDE 4

ONLINE METHODS

  • Bart showed us offline methods, preparing data beforehand,

and some of the problems. Instead, consider the online case:

  • Data curators host data and control access to it.
  • We assume the curator is trusted, as is the confidentiality of

the raw data itself.

  • If the data contains private information, how does a curator

keep it private?

slide-5
SLIDE 5

BLEND IN THE CROWD

  • Suggestion: Only answer

questions about large datasets.

  • Fails:

?> How many Swedish PhD students like Barbie? => 834 ?> How many Swedish PhD students, who are not Icelandic or do not study Computer Security, like Barbie? => 833

slide-6
SLIDE 6

Willard likes Barbie!

(because, obviously, I do not!)

slide-7
SLIDE 7

RANDOMIZED RESPONSE

  • Suggestion: Fuzz the answers so that

individual differences are lost in the noise.

  • Not easy, but on the right track!
  • E.g. averaging many queries cancels

noise, and it may not be decidable to detect equivalent queries.

slide-8
SLIDE 8

A LITTLE REMINDER

  • Linkage attacks particularly tricky to prevent: The AOL

debacle, Netflix Prize and Sweeney’s use of voting records.

  • Dalenius’ desiteratum & Dwork’s “impossibility proof”:

Known fact: Height(Peter) = Average Swedish height + 10

slide-9
SLIDE 9

WHAT IS PRIVACY ANYWAYS

  • Dalenius: Any private information that can be deduced from the

information from a database, can just as well be deduced without any access to it at all.

  • Great notion of privacy preserving, but impossible (unless we

sacrifice utility)

  • Sweeney’s k-anonymity: Each value of a QID appears at least k

times.

  • Easy to check or enforce, but what does it actually mean?

Ad-hoc and hurts utility (e.g. correlations)

slide-10
SLIDE 10

TWO INSIGHTS

  • Dalenius’ definition is good,

but doesn’t make sense unless considering all the information in the universe.

  • Auxiliary information caused

Peter’s height to be revealed, whether he was in the database or not.

slide-11
SLIDE 11

A MORE TANGIBLE PROPERTY

Controlled access to a database preserves the privacy of a person if it makes no difference whether that person’s data is included in the database or not.

So: I should participate in that toy product survey, because it makes very little difference for me (privacy), but the the statistical results may lead to better Barbie dolls (utility).

slide-12
SLIDE 12

A MORE TANGIBLE PROPERTY

Controlled access to a database preserves the privacy of a person if it makes no difference whether that person’s data is included in the database or not.

  • So: I should participate in that toy product survey, because it

makes very little difference for me (privacy), but the the statistical results may lead to better Barbie dolls (utility).

slide-13
SLIDE 13

DIFFERENTIAL PRIVACY

I’m in your database, but nobody knows! (Dwork)

slide-14
SLIDE 14

BACK TO RANDOMIZING

  • Given a database D, we allow run a randomized query f, one

that may make random coin tosses in addition to inspecting data.

  • The result f(D) is a probability distribution over the random

coin tosses of f, i.e. the data is not a random variable.

  • We can now measure the probability of an answer being in a

certain range: Pr[ f(D) ∈ S ]

slide-15
SLIDE 15

For a particular query f, what happens to those probabilities when one person is removed from D?

slide-16
SLIDE 16

EXAMPLE: COUNT

Pr[f(D)]

P

  • utcome

833

slide-17
SLIDE 17

EXAMPLE: COUNT

Pr[f(D)]

P

  • utcome

833

S

slide-18
SLIDE 18

EXAMPLE: COUNT

Pr[f(D)]

P

  • utcome

833

S

slide-19
SLIDE 19

EXAMPLE: COUNT

Pr[f(D)]

P D’ = D+{Arnar} S

833

slide-20
SLIDE 20

EXAMPLE: COUNT

Pr[f(D)]

P

Pr[f(D’)]

D’ = D+{Arnar} S

833

slide-21
SLIDE 21

EXAMPLE: COUNT

Pr[f(D)]

P

Pr[f(D’)]

D’ = D+{Arnar} S

833

Pr[ f(D) ∈ S ] ≈ Pr[ f(D’) ∈ S ]

slide-22
SLIDE 22

DIFFERENTIAL PRIVACY

  • Let D and D’ be databases differing in one row.
  • A randomized query f is ε-differentially private iff

for any set S ⊆ dom(f)

Pr[ f(D) ∈ S ] ≤ exp(ε) ⋅ Pr[ f(D’) ∈ S ]

slide-23
SLIDE 23

DIFFERENTIAL PRIVACY

  • Swapping D and D’ and rearranging gives an equivalent:

Pr[ f(D) ∈ S ] Pr[ f(D’) ∈ S ] exp(-ε) ≤ ≤ exp(ε)

  • For small ε, this means the ratio is very close to one.
slide-24
SLIDE 24

HOW MUCH NOISE?

  • The sensitivity of a (non-randomized) query g is the maximum

effect of adding or removing one row. over all databases.

max | g(D) - g(D’) |

D, D’

  • Then g can be made ε-differentially private by adding a noise

according to the Laplace distribution Lap(b), with density

P(z) = b/2 ⋅ exp(-|z| / b)

with b = 1/ε

slide-25
SLIDE 25

LAPLACIAN NOISE

g(D) g(D’)

  • High ε gives a clear difference. More utility, less privacy:

g(D) g(D’)

  • Low ε gives less difference. More privacy, less utility:
  • Why Laplace: Symmetric and “memoryless”.
slide-26
SLIDE 26

LAPLACIAN NOISE

g(D) g(D’)

  • High ε gives a clear difference. More utility, less privacy:

g(D) g(D’)

  • Low ε gives less difference. More privacy, less utility:
  • Why Laplace: Symmetric and “memoryless”.
slide-27
SLIDE 27

LAPLACIAN NOISE

g(D) g(D’)

  • High ε gives a clear difference. More utility, less privacy:

g(D) g(D’)

  • Low ε gives less difference. More privacy, less utility:
  • Why Laplace: Symmetric and “memoryless”.
slide-28
SLIDE 28

MULTIPLE QUERIES

  • Differential privacy mechanisms generally allocate a privacy

budget to each user.

  • A user runs a query with a specified ε, which is then deducted

from her budget.

  • Running the same query twice and averaging gives the same

distribution as running it once with twice the ε.

  • Benefit: No need for semantic analysis on queries.
slide-29
SLIDE 29

AUXILIARY INFORMATION

  • Differential privacy is not affected by auxiliary information,

because the definition only considers whether one should participate in a particular database or not.

  • Note: Differential privacy gives the same guarantees for those

that do not appear in the database, as for those who do!

slide-30
SLIDE 30

DP MECHANISMS

How do we actually use this?

slide-31
SLIDE 31

BY HAND

  • Sensitivity of an algorithm may sometimes be approximated

mechanically, but often one can do better. Conservative estimates lead to high noise and reduced utility.

  • Often there are non-trivial ways to give a differentially private

implementation of an algorithm, that requires much less noise.

  • Generally parametrized on an ε. Not always Laplace noised.
  • Many publications here, as DP originated from the algorithms

community.

slide-32
SLIDE 32

EXAMPLE: K-MEANS

  • Finding k clusters in a collection of N points:

Select k random “master” points Sort points into buckets by their closest master point Choose means of buckets as new master points and repeat.

  • “Bucketing” has a low sensitivity - removing one point only

affects one bucket.

  • Calculating the means also has a low sensitivity.
  • Tricky: How many iterations and how to split the ε ?
slide-33
SLIDE 33

PINQ

  • LINQ, Language Integrated Queries, are an embedded

language for queries in .NET languages.

  • Privacy Integrated Queries [McSherry] adds a layer on top

that automatically adds Laplacian noise to the result.

var data = new PINQueryable<SearchRecord>( ... ... ); var users = from record in data where record.Query == argv[0] groupby record.IPAddress return users.NoisyCount(0.1);

slide-34
SLIDE 34

PINQ

  • Sequential composition of differentially private computations is

differentially private, with the sum of the components.

  • Parallel composition of differentially private computations is

differentially private, with the maximum of the components.

  • PINQ over-approximates very much for some algorithms, e.g.

where the privacy factor depends on control flow, but works well for many others - for example k-means.

slide-35
SLIDE 35

AIRAVAT

  • MapReduce computations are decomposed into two phases

that can easily be distributed. Introduced by Google.

  • Airavat [Shmatikov et.al.] implements differential privacy on

top of MapReduce.

  • Mandatory access control

and isolation allows untrusted mappers.

slide-36
SLIDE 36

LINEAR TYPES

  • Pierce and Reed provided a linear type system that guarantees

differential privacy.

  • Value types form metric spaces, so sensitivity of operations

can be inferred (conservatively).

  • Does not deal well with redundant computations and over-

estimates sensitivity due to control flow.

  • Works only for specific ways of adding random noise. Recent

work of Barthe et al. (POPL 2012) aims to improve.

slide-37
SLIDE 37

NOT PERFECT, OF COURSE

  • Issuing and managing privacy budgets (to be spent on epsilons)

is far from trivial. Not really a technical problem.

  • One may leak information through the use of the budget. E.g.

PINQ issues an error when the budget is spent, providing a limited side channel.

  • Traditional covert channels, such as timing, are not addressed

by PINQ/Airavat.

  • Proposed solutions: DP Under Fire [Haeberlen, Pierce, Narayan]
slide-38
SLIDE 38

RELATIVES OF DP

Wait, there’s more?

slide-39
SLIDE 39

ADVERSARIAL PRIVACY

  • Some analysis, e.g. of relational data such as connections in a

social network, do not fit the DP model well.

  • Rastogi et al. propose a weaker notion, adversarial privacy,

based on Bayesian inference about the knowledge of an adversary.

  • Knowledge is represented as a class of probability

distributions, which can be updated based on observations.

  • DP is obtained as a special case for certain classes.

P(t | O) ≤ eP(t) + γ

slide-40
SLIDE 40

QUANTIFIED INFO-FLOW

  • Information-flow control may choose to enforce limits on the

amount of leaked information.

  • E.g. the number of bits leaked, discrete models of attacker

knowledge or probabilistic models.

  • Probabilistic models correspond somewhat with DP

, but tend to apply to individual records rather than aggregates.

  • See e.g work by Köpf et al. and Hicks (CSF 2011)
slide-41
SLIDE 41

PAN-PRIVATE ALGORITHMS

  • Traditional DP algorithms have full access to non-randomized

data, and store it in memory.

  • Recent research by Dwork discusses pan-private algorithms,

where even the internal state of the algorithm satisfies DP .

  • Example: What is the proportion of IP addresses I’ve seen on

a particular network link?

slide-42
SLIDE 42

TAKE HOME POINTS

  • Online methods for privacy in statistical databases have more

room to avoid linkage attacks.

  • Differential Privacy comes from re-evaluating what makes

database accesses private: If I’m in or out does not matter.

  • DP has nice composition properties, and makes little

assumptions about prior knowledge (aux. information).

  • DP is no panacea - doesn’t always fit.