Differential Privacy and Applications Marco Gaboardi Boston - - PowerPoint PPT Presentation

differential privacy and applications
SMART_READER_LITE
LIVE PREVIEW

Differential Privacy and Applications Marco Gaboardi Boston - - PowerPoint PPT Presentation

Differential Privacy and Applications Marco Gaboardi Boston University Data Private Queries? Private Queries? medical correlation? y r e u q r e w s n a Private Queries? Does Joe have cancer? query a n s w e r


slide-1
SLIDE 1

Marco Gaboardi

Boston University


Differential Privacy and Applications

slide-2
SLIDE 2

Data

slide-3
SLIDE 3

Private Queries?

slide-4
SLIDE 4

Private Queries?

medical correlation?

q u e r y a n s w e r

slide-5
SLIDE 5

Private Queries?

a n s w e r query

Does Joe have cancer?

slide-6
SLIDE 6

Private Queries?

Does Joe have cancer?

slide-7
SLIDE 7

Anonymization?

slide-8
SLIDE 8

Anonymization?

medical correlation

q u e r y a n s w e r

slide-9
SLIDE 9

Anonymization?

a n s w e r query

?!?

slide-10
SLIDE 10

Anonymous Data

Attacks on Anonymization


(Narayanan, Shmatikov: Robust De-anonymization of Large Sparse

  • Datasets. IEEE Symposium on Security and Privacy 2008)

Additional Data

correlations

slide-11
SLIDE 11

A Possible Solution: randomization

slide-12
SLIDE 12

Adding noise

Noise

slide-13
SLIDE 13

Noise

q u e r y a n s w e r + n

  • i

s e

medical correlation?

Adding noise

slide-14
SLIDE 14

Noise

a n s w e r + n

  • i

s e query

?!?

Adding noise

slide-15
SLIDE 15

Noise

?!?

Adding noise

slide-16
SLIDE 16

Data analyst

slide-17
SLIDE 17

Privacy vs Utility

Utility Privacy

slide-18
SLIDE 18

Differential privacy: understanding the mathematical and computational meaning of this trade-off.

[Dwork, McSherry, Nissim, Smith, TCC06]

slide-19
SLIDE 19
  • US Census Bureau - onTheMap, new

releases in 2020

  • Google - RAPPOR tool for Chrome
  • Apple - typing statistics reports in devices
  • Facebook - social science data release
  • Uber / Amazon / Mozilla / Snapchat
  • Many startups

Some Official Users

slide-20
SLIDE 20
  • Today: Fundamental of reconstruction attacks and

definition of differential privacy.

  • Tuesday: Basic mechanisms and basic properties of

differential privacy and how to support them in programming languages.

  • Thursday: More advanced mechanisms and their

verification.

  • Friday: Other models and applications.

The rest of the class

slide-21
SLIDE 21

Today: Fundamental of reconstruction attacks and definition of differential privacy

slide-22
SLIDE 22

Data

Statistics over Data

slide-23
SLIDE 23

Is this data private?

D1 D2 D3 D4 D5 D6 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15 I1 1 1 1 1 1 I2 1 1 1 1 1 1 1 1 I3 1 1 1 1 1 1 1 I4 1 1 1 1 1 1 1 1 I5 1 1 1 1 1 1 1 I6 1 1 1 1 1 1 1 1 I7 1 1 1 1 1 1 1 1 1 I8 1 1 1 I9 1 1 1 1 1 1 1 1 1 I10 1 1 1 1 1 1 I11 1 1 1 1 1 1 1 1 I12 1 1 I13 1 1 1 1 1 1 1 1 1 1 I14 1 1 1 1 1 I15 1 1 1 1 1 1 1 1

slide-24
SLIDE 24

How about if we also have this data?

ID Name 1 I1 Alice 2 I2 Bob 3 I3 Cynthia a 4 I4 Dan 5 I5 Eve 6 I6 Frank 7 I7 Guy 8 I8 Hannah 9 I9 Ivan 10 I10 Jon 11 I11 Ken 12 I12 Lou 13 I13 Mike 14 I14 Noa 15 I15 Omer ID Disease 1 D1 AMAN 2 D2 Behcet 3 D3 Celiac 4 D4 Dermatitis 5 D5 Evans synd. 6 D6 Fibrosis 7 D7 Graves’ dis. 8 D8 Henoch-Schonlein 9 D9 IGA Neph. 10 D10

  • Juv. Diabetes

11 D11 Kawasaki dis. 12 D12 Lichen planus 13 D13 Myositis 14 D14 Narcolepsy 15 D15 Optic Neuritis

slide-25
SLIDE 25

How about this?

D1 D2 D3 D4 D5 D6 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15 6 7 6 7 8 7 9 7 5 5 1 6 6 5 7 5

ID Disease 1 D1 AMAN 2 D2 Behcet 3 D3 Celiac 4 D4 Dermatitis 5 D5 Evans synd. 6 D6 Fibrosis 7 D7 Graves’ dis. 8 D8 Henoch-Schonlein ID Disease 9 D9 IGA Neph. 10 D10

  • Juv. Diabetes

11 D11 Kawasaki dis. 12 D12 Lichen planus 13 D13 Myositis 14 D14 Narcolepsy 15 D15 Optic Neuritis

slide-26
SLIDE 26

The answers to this kind of questions depend on the additional information we have available

slide-27
SLIDE 27
  • We can think about a database as a list of

records from some universe set:

  • Sometimes we will think to them as functions
  • and sometimes we will write elements

explicitly

Database

D ∈ Xn = DB D[i] ∈ X (x1,…,xn) ∈ DB

slide-28
SLIDE 28

(Normalized) Counting Queries

  • A counting query q : Xn → [0,1]

is a function counting the proportion of element in a dataset satisfying the predicate:
 
 q: X → {0,1}

  • In symbols: 


q(D) = 1 n ∑

i

q(D[i])

slide-29
SLIDE 29

Example 1

Let’s consider an arbitrary universe domain X and let’s consider the following predicate for y ∈ X

qy(x) =

I

1 if y = x

  • therwise

we call a point function the associated counting query Question: Suppose that we answer all the point function queries for y ∈ X. What well know statistics do we obtain?

qy : Xn → [0,1]

slide-30
SLIDE 30

Example 1

D1 D2 D3 I1 I2 1 1 I3 1 I4 1 1 I5 I6 1 I7 1 1 I8 I9 1 I10 1 1

q000(D) = .3 D ∈ X10 = X={0,1}3 q001(D) = .1 q010(D) = .2 q011(D) = 0 q100(D) = 0 q101(D) = .3 q110(D) = .1 q111(D) = 0

0.075 0.15 0.225 0.3 000 001 010 011 100 101 110 111

slide-31
SLIDE 31

Example 1

Question: Suppose that we answer all the point function queries for y ∈ X. What well know statistics do we obtain? Answer: Histogram of the database.

slide-32
SLIDE 32

Example II

Let’s consider an arbitrary ordered universe domain X and let’s consider the following predicate for y ∈ X we call a threshold function the associated counting query Question: Suppose that we answer all the threshold function queries for y ∈ X. What well know statistics do we obtain?

qy(x) =

I

1 if x ≤ y

  • therwise

qy : Xn → [0,1]

slide-33
SLIDE 33

Example II

q000(D) = .3 D ∈ X10 = X={0,1}3 with order
 given by the
 corresponding
 binary encoding. q001(D) = .4 q010(D) = .6 q011(D) = .6 q100(D) = .6 q101(D) = .9 q110(D) = 1 q111(D) = 1

0.25 0.5 0.75 1 000 001 010 011 100 101 110 111

D1 D2 D3 I1 I2 1 1 I3 1 I4 1 1 I5 I6 1 I7 1 1 I8 I9 1 I10 1 1

slide-34
SLIDE 34

Example II

Question: Suppose that we answer all the threshold function queries for y ∈ X. What well know statistics do we obtain? Answer: CDF of the database.

slide-35
SLIDE 35

Example III

Let’s consider the universe domain X={0,1}d and let’s consider the following predicate for an index 1 ≤ j ≤ d we call an attribute counting function the associated counting query Question: Which statistics does correspond to releasing all the attribute counting functions?

qj(x) = x[j]

qj : {0,1}n*d → [0,1]

slide-36
SLIDE 36

Example III

q1(D) = .4 D ∈ X10 = X={0,1}3 q2(D) = .3 q3(D) = .4

D1 D2 D3 I1 I2 1 1 I3 1 I4 1 1 I5 I6 1 I7 1 1 I8 I9 1 I10 1 1 margin 4 3 4 D1 D2 D3 I1 I2 1 1 I3 1 I4 1 1 I5 I6 1 I7 1 1 I8 I9 1 I10 1 1

slide-37
SLIDE 37

Example III

Question: Which statistics does correspond to releasing all the attribute counting functions? Answer: (1-way) Marginals of the database

slide-38
SLIDE 38

Example IV

Let’s consider the universe domain X={0,1}d and let’s consider and

⃗ v ∈ List[k]{1,¯ 1,2,¯ 2, …, d, ¯ d}

We call a conjunction or k-way marginal the associated counting query Question: Which statistics does correspond to releasing conjunctions? where as qj(x) = xj an

conjunction the c

and

d q¯

j(x) = ¬xj f

  • unting query q :

q

⃗ v (x) = qv1(x) ∧ qv2(x) ∧ ⋯ ∧ qvk(x)

qj : {0,1}n*d → [0,1]

slide-39
SLIDE 39

Example IV

D ∈ X10 = X={0,1}3

D1 D2 D3 I1 I2 1 1 I3 1 I4 1 1 I5 I6 1 I7 1 1 I8 I9 1 I10 1 1

k=2 q12(D) = .1 q1/2(D) = .3 q13(D) = .3 q1/3(D) = .1 q/12(D) = .2 q/13(D) = .1 q/1/2(D) = .4 q/1/3(D) = .5

D1 /D1 D2 0.1 0.2 /D2 0.3 0.4

slide-40
SLIDE 40

Example IV

Question: Which statistics does correspond to releasing conjunctions? Answer: contingency tables

slide-41
SLIDE 41

Linear Queries

  • A linear query q: Xn → [0,1] is a function

averagint the value of a function 
 q : X → [0,1] over the elements of the dataset.

  • In symbols: 


q(D) = ∑

i

1 nq(D[i])

slide-42
SLIDE 42

Sum queries

  • Let’s denote by I ⊆ℕ a subset of the rows of

the dataset.

  • A sum query qI : List(X) → ℕ is defined as



 qI(D) = ∑

i∈I

D[i]

slide-43
SLIDE 43

Example

D1 D2 D3 I1 I2 1 1 I3 1 I4 1 1 I5 I6 1 I7 1 1 I8 I9 1 I10 1 1

q{1,2,3}(D) = (1,1,1) D = q{1,2,4}(D) = (2,0,2) q{5,8}(D) = (0,0,0) q{2,4,7,10}(D) = (4,1,3) X=List[3]{0,1}

slide-44
SLIDE 44

Question: Is releasing the result

  • f sum (counting or linear) queries

private?

slide-45
SLIDE 45

Question: How can we make statistical queries private?

slide-46
SLIDE 46

Private Statistical database

statistical query answer+noise

Noise

Question: What kind of noise?

slide-47
SLIDE 47

Additive Noise Perturbation

  • We say that M is a privacy mechanism obtained by

adding noise if for every query q, M creates a new randomized query:

q*(D) = q(D) + Y

  • We say that a mechanism M add noise within 


perturbation E iff for every q and every D:

|q*(D)-q(D)| ≤ E

slide-48
SLIDE 48

Question: Does this approach protect privacy?

slide-49
SLIDE 49 Noise

D

q1 q2 … qk

Reconstruction attack

D’

Attacker

slide-50
SLIDE 50

D

Reconstruction attack

D’

We say that the attacker wins if

d( )~0 ,

In this class case we can use Hamming distance

slide-51
SLIDE 51

Blatantly non-privacy

A privacy mechanism M is blatantly non-private if an adversary can build a candidate database D’, that agrees with the private database D of size n in all but o(n) entries: dH(D,D’)∈o(n)

slide-52
SLIDE 52

Little o-notation

f(n)=o(g(n)) iff Limn→∞ f(n)/g(n)=0

slide-53
SLIDE 53

An exponential reconstruction attack

Let M : {0,1}n → R be a privacy mechanism adding noise within perturbation E. Then, there is an adversary, that can reconstruct the database within 4E positions.

[DinurNissim’02]

With E=n/401 we reconstruct 99% of the entries.

slide-54
SLIDE 54

Proof

Query phase: For each subset of elements S let aS* = qS*(D). Rule out phase: For each D’ ∈ List[n]{0,1}:
 if there exists S such that |qS(D’) - aS* | > E then rule out D’. We now want to show that dH(D,D’) ≤ 4E Notice that since for the real database we clearly have |qS(D) - qS*(D) | ≤ E the procedure clearly return a candidate output in an exponential number of steps. Output phase: Output a database D’ that was not ruled out.

slide-55
SLIDE 55

Proof

Let ’s consider D to be the real dataset and D’ to be the outputted one. Consider the sets of indices R = { i | D(i)=0 } and T = { i | D(i)=1 } Since D’ was not ruled out we have |qS*(D)-qS(D’)|≤E but by definition we also have |qS*(D)-qS(D)|≤E so by triangle inequality |qS(D)-qS(D’)|≤2E. We can apply a similar reasoning to T. So overall D and D’ differ in at most 4E positions. Since qR(D)=0, we have that on the indices in R the Hamming distance between D and D’ is at most 2E.

slide-56
SLIDE 56

Exponential reconstruction attack

Let M : {0,1}n → R be a privacy mechanism adding noise within E=o(n) perturbation. Then M is blatantly non-private against an adversary A asking an exponential number of queries.

[DinurNissim’02]

Question: Can we have a more realistic noise perturbation?

slide-57
SLIDE 57

Sample error

  • Suppose that a database contains n individuals drawn

uniformly at random from a population of size N>>n.

  • Suppose we are interested in a property that holds for

a fraction p of the population.

  • Then we expect the number of individuals in the

dataset with condition p is 
 np±Θ(√n)

  • The sampling error is of order √n.

We would like the noise we introduce for privacy to be smaller than (or at most as big as) the sampling error.

slide-58
SLIDE 58

Polynomial reconstruction attack

Let M : {0,1}n→ R be a privacy mechanism adding noise within perturbation E=o(√n). Then we can show M blatantly non-private against an adversary A running in polynomial time and answering number of queries linear in n.

[DworkYekhanin’08,DinurNissim’02]

slide-59
SLIDE 59

Number of queries

A privacy mechanism can answer with perturbation √n at most a number of queries sublinear in n.

slide-60
SLIDE 60

Foundamental Law of Information Reconstruction

The release of too many overly accurate statistics gives privacy violations.

slide-61
SLIDE 61

Privacy vs Utility

Utility Privacy

slide-62
SLIDE 62

Quantitative notions of Privacy

  • The impossibility results discussed above suggest a

quantitative notion of privacy,

  • a notion where the privacy loss depends on the

number of queries that are allowed,

  • and on the accuracy with which we answer them.
slide-63
SLIDE 63

Privacy-preserving data analysis?

  • The analyst knows no more about me after the

analysis than what she knew before the analysis.

Noise

q1 q2

qk

slide-64
SLIDE 64

Privacy-preserving data analysis?

Prior Knowledge ~ Posterior Knowledge

slide-65
SLIDE 65

Question: What is the problem with this requirement?

Privacy-preserving data analysis?

slide-66
SLIDE 66

Privacy-preserving data analysis?

If nothing can be learned about an individual, then nothing at all can be learned at all!

Utility

Privacy

[DworkNaor10]

slide-67
SLIDE 67

Privacy-preserving data analysis?

  • The analyst learn the same about me after the

analysis as what she would have learnt if I didn’t contribute my data.

Noise

q1 q2

qk

slide-68
SLIDE 68

Adjacent databases

  • We can formalize the concept of contributing my

data or not in terms of a notion of distance between datasets.

  • Given two datasets D, D’∈DB, their distance is

defined as:


  • We will call two datasets adjacent when DΔD’=1

and we will write D~D’. DΔD’=|{k≤n | D(k)≠D’(k)}|

slide-69
SLIDE 69

(ε,δ)-Differential Privacy

Definition Given ε,δ ≥ 0, a probabilistic query Q: Xn→R is (ε,δ)-differentially private iff for all adjacent database b1, b2 and for every S⊆R: Pr[Q(b1)∈ S] ≤ exp(ε)Pr[Q(b2)∈ S] + δ

slide-70
SLIDE 70

A query returning a probability distribution

(ε,δ)-Differential Privacy

Definition Given ε,δ ≥ 0, a probabilistic query Q: Xn→R is (ε,δ)-differentially private iff for all adjacent database b1, b2 and for every S⊆R: Pr[Q(b1)∈ S] ≤ exp(ε)Pr[Q(b2)∈ S] + δ

slide-71
SLIDE 71

Privacy parameters

(ε,δ)-Differential Privacy

Definition Given ε,δ ≥ 0, a probabilistic query Q: Xn→R is (ε,δ)-differentially private iff for all adjacent database b1, b2 and for every S⊆R: Pr[Q(b1)∈ S] ≤ exp(ε)Pr[Q(b2)∈ S] + δ

slide-72
SLIDE 72

a quantification over all the databases

(ε,δ)-Differential Privacy

Definition Given ε,δ ≥ 0, a probabilistic query Q: Xn→R is (ε,δ)-differentially private iff for all adjacent database b1, b2 and for every S⊆R: Pr[Q(b1)∈ S] ≤ exp(ε)Pr[Q(b2)∈ S] + δ

slide-73
SLIDE 73

a notion of adjacency or distance

(ε,δ)-Differential Privacy

Definition Given ε,δ ≥ 0, a probabilistic query Q: Xn→R is (ε,δ)-differentially private iff for all adjacent database b1, b2 and for every S⊆R: Pr[Q(b1)∈ S] ≤ exp(ε)Pr[Q(b2)∈ S] + δ

slide-74
SLIDE 74

and over all the possible

  • utcomes

(ε,δ)-Differential Privacy

Definition Given ε,δ ≥ 0, a probabilistic query Q: Xn→R is (ε,δ)-differentially private iff for all adjacent database b1, b2 and for every S⊆R: Pr[Q(b1)∈ S] ≤ exp(ε)Pr[Q(b2)∈ S] + δ

slide-75
SLIDE 75

ε-Differential Privacy

Definition Given ε ≥ 0, a probabilistic query Q: Xn → R is ε-differentially private iff for all adjacent database b1, b2 and for every S⊆R: Pr[Q(b1)∈ S] ≤ exp(ε)Pr[Q(b2)∈ S]

slide-76
SLIDE 76

ε-Differential Privacy

Let’s substitute a concrete instance:

Pr[Q(b∪{x})∈ S] ≤ exp(ε)Pr[Q(b∪{y})∈ S]

Definition Given ε ≥ 0, a probabilistic query Q: Xn → R is ε-differentially private iff for all adjacent database b1, b2 and for every S⊆R: Pr[Q(b1)∈ S] ≤ exp(ε)Pr[Q(b2)∈ S]

slide-77
SLIDE 77

Let’s substitute a concrete instance: Let’s use the two quantifiers:

exp(-ε)Pr[Q(b∪{y})∈ S] ≤ Pr[Q(b∪{x})∈ S] ≤ exp(ε)Pr[Q(b∪{y})∈ S]

Pr[Q(b∪{x})∈ S] ≤ exp(ε)Pr[Q(b∪{y})∈ S]

ε-Differential Privacy

Definition Given ε ≥ 0, a probabilistic query Q: Xn → R is ε-differentially private iff for all adjacent database b1, b2 and for every S⊆R: Pr[Q(b1)∈ S] ≤ exp(ε)Pr[Q(b2)∈ S]

slide-78
SLIDE 78

And for ε ➝0

(1-ε)Pr[Q(b∪{y})∈ S] ≤ Pr[Q(b∪{x})∈ S] ≤ (1+ε)Pr[Q(b∪{y})∈ S]

ε-Differential Privacy

Let’s substitute a concrete instance: Let’s use the two quantifiers:

exp(-ε)Pr[Q(b∪{y})∈ S] ≤ Pr[Q(b∪{x})∈ S] ≤ exp(ε)Pr[Q(b∪{y})∈ S]

Pr[Q(b∪{x})∈ S] ≤ exp(ε)Pr[Q(b∪{y})∈ S]

Definition Given ε ≥ 0, a probabilistic query Q: Xn → R is ε-differentially private iff for all adjacent database b1, b2 and for every S⊆R: Pr[Q(b1)∈ S] ≤ exp(ε)Pr[Q(b2)∈ S]

slide-79
SLIDE 79

Pr[Q(b)=r] Pr[Q(b’)=r]

log

Lb,b’(r) =

Differential Privacy

In general we can think about the following quantity as the privacy loss incurred by observing r on the databases b and b’.

slide-80
SLIDE 80

Q : db => R probabilistic

Q(b∪{x}) Q(b∪{y})

Differential Privacy

slide-81
SLIDE 81

d(Q(b∪{x}),Q(b∪{y}))≤ ε

Differential Privacy

with probability 1-δ

slide-82
SLIDE 82

Pr[Q(b1)=r] Pr[Q(b2)=r]

log ≤ε

ε

  • ε

with probability 1-δ

(ε,δ)-Differential Privacy

slide-83
SLIDE 83

The rest of the class

Understanding some basic methods to guarantee differential privacy and how they provide an answer for the privacy vs utility trade-off.

slide-84
SLIDE 84

Summary

  • Statistical queries and databases,
  • Additive noise perturbation,
  • Reconstruction attack,
  • Fundamental Law of Information Reconstruction,
  • Privacy preserving data analysis
  • Differential privacy