Marco Gaboardi
Boston University
Differential Privacy and Applications Marco Gaboardi Boston - - PowerPoint PPT Presentation
Differential Privacy and Applications Marco Gaboardi Boston University Data Private Queries? Private Queries? medical correlation? y r e u q r e w s n a Private Queries? Does Joe have cancer? query a n s w e r
Marco Gaboardi
Boston University
Data
Private Queries?
Private Queries?
medical correlation?
q u e r y a n s w e r
Private Queries?
a n s w e r query
Does Joe have cancer?
Private Queries?
Does Joe have cancer?
Anonymization?
Anonymization?
medical correlation
q u e r y a n s w e r
Anonymization?
a n s w e r query
?!?
Anonymous Data
Attacks on Anonymization
(Narayanan, Shmatikov: Robust De-anonymization of Large Sparse
Additional Data
correlations
A Possible Solution: randomization
Adding noise
Noise
Noise
q u e r y a n s w e r + n
s e
medical correlation?
Adding noise
Noise
a n s w e r + n
s e query
?!?
Adding noise
Noise
?!?
Adding noise
Data analyst
Privacy vs Utility
Utility Privacy
Differential privacy: understanding the mathematical and computational meaning of this trade-off.
[Dwork, McSherry, Nissim, Smith, TCC06]
releases in 2020
Some Official Users
definition of differential privacy.
differential privacy and how to support them in programming languages.
verification.
The rest of the class
Today: Fundamental of reconstruction attacks and definition of differential privacy
Data
Is this data private?
D1 D2 D3 D4 D5 D6 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15 I1 1 1 1 1 1 I2 1 1 1 1 1 1 1 1 I3 1 1 1 1 1 1 1 I4 1 1 1 1 1 1 1 1 I5 1 1 1 1 1 1 1 I6 1 1 1 1 1 1 1 1 I7 1 1 1 1 1 1 1 1 1 I8 1 1 1 I9 1 1 1 1 1 1 1 1 1 I10 1 1 1 1 1 1 I11 1 1 1 1 1 1 1 1 I12 1 1 I13 1 1 1 1 1 1 1 1 1 1 I14 1 1 1 1 1 I15 1 1 1 1 1 1 1 1
How about if we also have this data?
ID Name 1 I1 Alice 2 I2 Bob 3 I3 Cynthia a 4 I4 Dan 5 I5 Eve 6 I6 Frank 7 I7 Guy 8 I8 Hannah 9 I9 Ivan 10 I10 Jon 11 I11 Ken 12 I12 Lou 13 I13 Mike 14 I14 Noa 15 I15 Omer ID Disease 1 D1 AMAN 2 D2 Behcet 3 D3 Celiac 4 D4 Dermatitis 5 D5 Evans synd. 6 D6 Fibrosis 7 D7 Graves’ dis. 8 D8 Henoch-Schonlein 9 D9 IGA Neph. 10 D10
11 D11 Kawasaki dis. 12 D12 Lichen planus 13 D13 Myositis 14 D14 Narcolepsy 15 D15 Optic Neuritis
How about this?
D1 D2 D3 D4 D5 D6 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15 6 7 6 7 8 7 9 7 5 5 1 6 6 5 7 5
ID Disease 1 D1 AMAN 2 D2 Behcet 3 D3 Celiac 4 D4 Dermatitis 5 D5 Evans synd. 6 D6 Fibrosis 7 D7 Graves’ dis. 8 D8 Henoch-Schonlein ID Disease 9 D9 IGA Neph. 10 D10
11 D11 Kawasaki dis. 12 D12 Lichen planus 13 D13 Myositis 14 D14 Narcolepsy 15 D15 Optic Neuritis
The answers to this kind of questions depend on the additional information we have available
records from some universe set:
explicitly
Database
D ∈ Xn = DB D[i] ∈ X (x1,…,xn) ∈ DB
(Normalized) Counting Queries
is a function counting the proportion of element in a dataset satisfying the predicate: q: X → {0,1}
q(D) = 1 n ∑
i
q(D[i])
Example 1
Let’s consider an arbitrary universe domain X and let’s consider the following predicate for y ∈ X
qy(x) =
I
1 if y = x
we call a point function the associated counting query Question: Suppose that we answer all the point function queries for y ∈ X. What well know statistics do we obtain?
qy : Xn → [0,1]
Example 1
D1 D2 D3 I1 I2 1 1 I3 1 I4 1 1 I5 I6 1 I7 1 1 I8 I9 1 I10 1 1
q000(D) = .3 D ∈ X10 = X={0,1}3 q001(D) = .1 q010(D) = .2 q011(D) = 0 q100(D) = 0 q101(D) = .3 q110(D) = .1 q111(D) = 0
0.075 0.15 0.225 0.3 000 001 010 011 100 101 110 111
Example 1
Question: Suppose that we answer all the point function queries for y ∈ X. What well know statistics do we obtain? Answer: Histogram of the database.
Example II
Let’s consider an arbitrary ordered universe domain X and let’s consider the following predicate for y ∈ X we call a threshold function the associated counting query Question: Suppose that we answer all the threshold function queries for y ∈ X. What well know statistics do we obtain?
qy(x) =
I
1 if x ≤ y
qy : Xn → [0,1]
Example II
q000(D) = .3 D ∈ X10 = X={0,1}3 with order given by the corresponding binary encoding. q001(D) = .4 q010(D) = .6 q011(D) = .6 q100(D) = .6 q101(D) = .9 q110(D) = 1 q111(D) = 1
0.25 0.5 0.75 1 000 001 010 011 100 101 110 111
D1 D2 D3 I1 I2 1 1 I3 1 I4 1 1 I5 I6 1 I7 1 1 I8 I9 1 I10 1 1
Example II
Question: Suppose that we answer all the threshold function queries for y ∈ X. What well know statistics do we obtain? Answer: CDF of the database.
Example III
Let’s consider the universe domain X={0,1}d and let’s consider the following predicate for an index 1 ≤ j ≤ d we call an attribute counting function the associated counting query Question: Which statistics does correspond to releasing all the attribute counting functions?
qj(x) = x[j]
qj : {0,1}n*d → [0,1]
Example III
q1(D) = .4 D ∈ X10 = X={0,1}3 q2(D) = .3 q3(D) = .4
D1 D2 D3 I1 I2 1 1 I3 1 I4 1 1 I5 I6 1 I7 1 1 I8 I9 1 I10 1 1 margin 4 3 4 D1 D2 D3 I1 I2 1 1 I3 1 I4 1 1 I5 I6 1 I7 1 1 I8 I9 1 I10 1 1
Example III
Question: Which statistics does correspond to releasing all the attribute counting functions? Answer: (1-way) Marginals of the database
Example IV
Let’s consider the universe domain X={0,1}d and let’s consider and
⃗ v ∈ List[k]{1,¯ 1,2,¯ 2, …, d, ¯ d}
We call a conjunction or k-way marginal the associated counting query Question: Which statistics does correspond to releasing conjunctions? where as qj(x) = xj an
conjunction the c
and
d q¯
j(x) = ¬xj f
q
⃗ v (x) = qv1(x) ∧ qv2(x) ∧ ⋯ ∧ qvk(x)
qj : {0,1}n*d → [0,1]
Example IV
D ∈ X10 = X={0,1}3
D1 D2 D3 I1 I2 1 1 I3 1 I4 1 1 I5 I6 1 I7 1 1 I8 I9 1 I10 1 1
k=2 q12(D) = .1 q1/2(D) = .3 q13(D) = .3 q1/3(D) = .1 q/12(D) = .2 q/13(D) = .1 q/1/2(D) = .4 q/1/3(D) = .5
D1 /D1 D2 0.1 0.2 /D2 0.3 0.4
Example IV
Question: Which statistics does correspond to releasing conjunctions? Answer: contingency tables
Linear Queries
averagint the value of a function q : X → [0,1] over the elements of the dataset.
q(D) = ∑
i
1 nq(D[i])
Sum queries
the dataset.
qI(D) = ∑
i∈I
D[i]
Example
D1 D2 D3 I1 I2 1 1 I3 1 I4 1 1 I5 I6 1 I7 1 1 I8 I9 1 I10 1 1
q{1,2,3}(D) = (1,1,1) D = q{1,2,4}(D) = (2,0,2) q{5,8}(D) = (0,0,0) q{2,4,7,10}(D) = (4,1,3) X=List[3]{0,1}
Question: Is releasing the result
private?
Question: How can we make statistical queries private?
Private Statistical database
statistical query answer+noise
NoiseQuestion: What kind of noise?
Additive Noise Perturbation
adding noise if for every query q, M creates a new randomized query:
q*(D) = q(D) + Y
perturbation E iff for every q and every D:
|q*(D)-q(D)| ≤ E
Question: Does this approach protect privacy?
q1 q2 … qk
Reconstruction attack
Attacker
Reconstruction attack
We say that the attacker wins if
In this class case we can use Hamming distance
Blatantly non-privacy
A privacy mechanism M is blatantly non-private if an adversary can build a candidate database D’, that agrees with the private database D of size n in all but o(n) entries: dH(D,D’)∈o(n)
Little o-notation
f(n)=o(g(n)) iff Limn→∞ f(n)/g(n)=0
An exponential reconstruction attack
Let M : {0,1}n → R be a privacy mechanism adding noise within perturbation E. Then, there is an adversary, that can reconstruct the database within 4E positions.
[DinurNissim’02]
With E=n/401 we reconstruct 99% of the entries.
Proof
Query phase: For each subset of elements S let aS* = qS*(D). Rule out phase: For each D’ ∈ List[n]{0,1}: if there exists S such that |qS(D’) - aS* | > E then rule out D’. We now want to show that dH(D,D’) ≤ 4E Notice that since for the real database we clearly have |qS(D) - qS*(D) | ≤ E the procedure clearly return a candidate output in an exponential number of steps. Output phase: Output a database D’ that was not ruled out.
Proof
Let ’s consider D to be the real dataset and D’ to be the outputted one. Consider the sets of indices R = { i | D(i)=0 } and T = { i | D(i)=1 } Since D’ was not ruled out we have |qS*(D)-qS(D’)|≤E but by definition we also have |qS*(D)-qS(D)|≤E so by triangle inequality |qS(D)-qS(D’)|≤2E. We can apply a similar reasoning to T. So overall D and D’ differ in at most 4E positions. Since qR(D)=0, we have that on the indices in R the Hamming distance between D and D’ is at most 2E.
Exponential reconstruction attack
Let M : {0,1}n → R be a privacy mechanism adding noise within E=o(n) perturbation. Then M is blatantly non-private against an adversary A asking an exponential number of queries.
[DinurNissim’02]
Question: Can we have a more realistic noise perturbation?
Sample error
uniformly at random from a population of size N>>n.
a fraction p of the population.
dataset with condition p is np±Θ(√n)
We would like the noise we introduce for privacy to be smaller than (or at most as big as) the sampling error.
Polynomial reconstruction attack
Let M : {0,1}n→ R be a privacy mechanism adding noise within perturbation E=o(√n). Then we can show M blatantly non-private against an adversary A running in polynomial time and answering number of queries linear in n.
[DworkYekhanin’08,DinurNissim’02]
Number of queries
A privacy mechanism can answer with perturbation √n at most a number of queries sublinear in n.
Foundamental Law of Information Reconstruction
The release of too many overly accurate statistics gives privacy violations.
Privacy vs Utility
Utility Privacy
Quantitative notions of Privacy
quantitative notion of privacy,
number of queries that are allowed,
Privacy-preserving data analysis?
analysis than what she knew before the analysis.
Noiseq1 q2
…
qk
Privacy-preserving data analysis?
Prior Knowledge ~ Posterior Knowledge
Question: What is the problem with this requirement?
Privacy-preserving data analysis?
Privacy-preserving data analysis?
If nothing can be learned about an individual, then nothing at all can be learned at all!
Utility
Privacy
[DworkNaor10]
Privacy-preserving data analysis?
analysis as what she would have learnt if I didn’t contribute my data.
Noiseq1 q2
…
qk
Adjacent databases
data or not in terms of a notion of distance between datasets.
defined as:
and we will write D~D’. DΔD’=|{k≤n | D(k)≠D’(k)}|
(ε,δ)-Differential Privacy
Definition Given ε,δ ≥ 0, a probabilistic query Q: Xn→R is (ε,δ)-differentially private iff for all adjacent database b1, b2 and for every S⊆R: Pr[Q(b1)∈ S] ≤ exp(ε)Pr[Q(b2)∈ S] + δ
A query returning a probability distribution
(ε,δ)-Differential Privacy
Definition Given ε,δ ≥ 0, a probabilistic query Q: Xn→R is (ε,δ)-differentially private iff for all adjacent database b1, b2 and for every S⊆R: Pr[Q(b1)∈ S] ≤ exp(ε)Pr[Q(b2)∈ S] + δ
Privacy parameters
(ε,δ)-Differential Privacy
Definition Given ε,δ ≥ 0, a probabilistic query Q: Xn→R is (ε,δ)-differentially private iff for all adjacent database b1, b2 and for every S⊆R: Pr[Q(b1)∈ S] ≤ exp(ε)Pr[Q(b2)∈ S] + δ
a quantification over all the databases
(ε,δ)-Differential Privacy
Definition Given ε,δ ≥ 0, a probabilistic query Q: Xn→R is (ε,δ)-differentially private iff for all adjacent database b1, b2 and for every S⊆R: Pr[Q(b1)∈ S] ≤ exp(ε)Pr[Q(b2)∈ S] + δ
a notion of adjacency or distance
(ε,δ)-Differential Privacy
Definition Given ε,δ ≥ 0, a probabilistic query Q: Xn→R is (ε,δ)-differentially private iff for all adjacent database b1, b2 and for every S⊆R: Pr[Q(b1)∈ S] ≤ exp(ε)Pr[Q(b2)∈ S] + δ
and over all the possible
(ε,δ)-Differential Privacy
Definition Given ε,δ ≥ 0, a probabilistic query Q: Xn→R is (ε,δ)-differentially private iff for all adjacent database b1, b2 and for every S⊆R: Pr[Q(b1)∈ S] ≤ exp(ε)Pr[Q(b2)∈ S] + δ
ε-Differential Privacy
Definition Given ε ≥ 0, a probabilistic query Q: Xn → R is ε-differentially private iff for all adjacent database b1, b2 and for every S⊆R: Pr[Q(b1)∈ S] ≤ exp(ε)Pr[Q(b2)∈ S]
ε-Differential Privacy
Let’s substitute a concrete instance:
Pr[Q(b∪{x})∈ S] ≤ exp(ε)Pr[Q(b∪{y})∈ S]
Definition Given ε ≥ 0, a probabilistic query Q: Xn → R is ε-differentially private iff for all adjacent database b1, b2 and for every S⊆R: Pr[Q(b1)∈ S] ≤ exp(ε)Pr[Q(b2)∈ S]
Let’s substitute a concrete instance: Let’s use the two quantifiers:
exp(-ε)Pr[Q(b∪{y})∈ S] ≤ Pr[Q(b∪{x})∈ S] ≤ exp(ε)Pr[Q(b∪{y})∈ S]
Pr[Q(b∪{x})∈ S] ≤ exp(ε)Pr[Q(b∪{y})∈ S]
ε-Differential Privacy
Definition Given ε ≥ 0, a probabilistic query Q: Xn → R is ε-differentially private iff for all adjacent database b1, b2 and for every S⊆R: Pr[Q(b1)∈ S] ≤ exp(ε)Pr[Q(b2)∈ S]
And for ε ➝0
(1-ε)Pr[Q(b∪{y})∈ S] ≤ Pr[Q(b∪{x})∈ S] ≤ (1+ε)Pr[Q(b∪{y})∈ S]
ε-Differential Privacy
Let’s substitute a concrete instance: Let’s use the two quantifiers:
exp(-ε)Pr[Q(b∪{y})∈ S] ≤ Pr[Q(b∪{x})∈ S] ≤ exp(ε)Pr[Q(b∪{y})∈ S]
Pr[Q(b∪{x})∈ S] ≤ exp(ε)Pr[Q(b∪{y})∈ S]
Definition Given ε ≥ 0, a probabilistic query Q: Xn → R is ε-differentially private iff for all adjacent database b1, b2 and for every S⊆R: Pr[Q(b1)∈ S] ≤ exp(ε)Pr[Q(b2)∈ S]
Pr[Q(b)=r] Pr[Q(b’)=r]
log
Lb,b’(r) =
Differential Privacy
In general we can think about the following quantity as the privacy loss incurred by observing r on the databases b and b’.
Q : db => R probabilistic
Q(b∪{x}) Q(b∪{y})
Differential Privacy
d(Q(b∪{x}),Q(b∪{y}))≤ ε
Differential Privacy
with probability 1-δ
Pr[Q(b1)=r] Pr[Q(b2)=r]
log ≤ε
ε
with probability 1-δ
(ε,δ)-Differential Privacy
The rest of the class
Understanding some basic methods to guarantee differential privacy and how they provide an answer for the privacy vs utility trade-off.
Summary