Differential Privacy
Privacy & Fairness in Data Science CS848 Fall 2019
Differential Privacy Privacy & Fairness in Data Science CS848 - - PowerPoint PPT Presentation
Differential Privacy Privacy & Fairness in Data Science CS848 Fall 2019 2 Outline Problem Differential Privacy Basic Algorithms 3 Statistical Databases Person 1 Person 2 Person 3 Person N Individuals with r 1 r 2 r 3 r N
Privacy & Fairness in Data Science CS848 Fall 2019
2
3
DB
Person 1
r1
Person 2
r2
Person 3
r3
Person N
rN
Census
DB
Hospital
DB
Doctors Medical Researchers Economists Machine Learning Researchers Ranking Algorithms
Individuals with sensitive data Data Collectors Data Analysts
4
Person 1 r1 Person 2 r2 Person 3 r3 Person N rN
Server
DB
Function provided by the analyst
Output can disclose sensitive information about individuals
5
Person 1 r1 Person 2 r2 Person 3 r3 Person N rN
Server
DB
!"#$%&'(!", !)!! Privacy for individuals (controlled
by a parameter ε)
6
Person 1 r1 Person 2 r2 Person 3 r3 Person N rN
Server
DB
!"#$%&'(!", !)!! Utility for analyst
7
Person 1 r1 Person 2 r2 Person 3 r3 Person N rN
Server
DB
Individuals do not want server to infer their records
Server wants to compute f
8
Person 1 r1 Person 2 r2 Person 3 r3 Person N rN
Server
DB*
Perturb records to ensure privacy for individuals and Utility for server
Statistical Databases in real-world applications
9
Application Data Collector Private Information Analyst Function (utility) Medical Hospital Disease Epidemiologist Correlation between disease and geography Genome analysis Hospital Genome Statistician/ Researcher Correlation between genome and disease Advertising Google/FB Clicks/Brow sing Advertiser Number of clicks on an ad by age/region/gender … Social Recommen- dations Facebook Friend links / profile Another user Recommend other users or ads to users based on social network
Statistical Databases in real-world applications
trusted (or may not want the liability …)
10
Application Data Collector Private Information Function (utility) Location Services Verizon/AT&T Location Traffic prediction Recommen- dations Amazon/Google Purchase history Recommendation model Traffic Shaping Internet Service Provider Browsing history Traffic pattern of groups of users
11
12
Alice sends a message to Bob such that Trudy (attacker) does not learn the message. Bob should get the correct message …
Bob (attacker) can access a database
individuals in database.
13
14
controlled by Bob (attacker).
Alice, without Bob learning anything about the data.
properties of the database.
15
16
learn no other information than what can be inferred from their private input and the answer.
inputs.
17
18
(could be files or records in a database)
access (or not access) certain resources.
be disclosed
records without allowing access to individual records
19
arise:
– Encryption when communicating data across a unsecure channel – Secure Multiparty Computation when different parties want to compute on a function on their private data without using a centralized third party – Computing on encrypted data when one wants to use an unsecure cloud for computation – Access control when different users own different parts of the data
Quantifying (and bounding) the amount of information disclosed about individual records by the output of a valid computation.
20
21
A privacy mechanism M(D) that allows an unauthorized party to learn sensitive information about any individual in D, which could not have learnt without access to M(D).
22
23
Alice
Alice has Cancer Is this a privacy breach? NO
A privacy mechanism M(D) that allows an unauthorized party to learn sensitive information about any individual Alice in D, which could not have learnt even with access to M(D) if Alice was not in the dataset.
24
25
For every output … O D2 D1
Adversary should not be able to distinguish between any D1 and D2 based on any O
For every pair of inputs that differ in one row
[Dwork ICALP 2006]
26
ln Pr[𝐵 𝐸( = 𝑝] Pr[𝐵 𝐸, = 𝑝] ≤ 𝜁, 𝜁 > 0
D2 D1 For every pair of inputs that differ in one row Simulate the presence or absence
For every output … O
27
D2 D1 For every pair of inputs that differ in one row Guarantee holds no matter what the other records are. For every output … O
28
29
D2 D1
Set of all
. . . A(D1) = O1 P [ A(D1) = O1 ] P [ A(D2) = Ok ]
Should not be able to distinguish whether input was D1 or D2 no matter what the output
30
. . . Worst discrepancy in probabilities
D2 D1
D2 D1 For every pair of inputs that differ in one row Pr[A(D1) = o] ≤ eε Pr[A(D2) = o] For every output … O
Controls the degree to which D1 and D2 can be distinguished. Smaller the ε more the privacy (and worse the utility)
31
1. Resilience to background knowledge
– A privacy mechanism must be able to protect individuals’ privacy from attackers who may possess background knowledge
2. Privacy without obscurity
– Attacker must be assumed to know the algorithm used as well as all parameters [MK15]
3. Post-processing
– Post-processing the output of a privacy mechanism must not change the privacy guarantee [KL10, MK15]
4. Composition over multiple releases
– Allow a graceful degradation of privacy with multiple invocations
32
33
Pr 𝐵 𝐸( ∈ Ω ≤ 𝑓5 Pr 𝐵 𝐸, ∈ Ω
Every subset of
Pr 𝐵 𝑌 ∈ Ω ≤ 𝑓578(:,;) Pr 𝐵 𝑍 ∈ Ω
Number of row additions and deletions to change X to Y
34
Space of all inputs Space of all outputs (at least 2 distinct ouputs)
35
Each input mapped to a distinct
36
Pr > 0 Pr = 0
37
… also does not satisfy differential privacy
Input Output D2 D1 O
Pr[D1 à O] Pr[D2 à O]
Pr[D2 à O] = 0 implies
38
Disease (Y/N) Y Y N Y N N
39
With probability p, Report true value With probability 1-p, Report flipped value Disease (Y/N) Y N N N Y N D O
[W 65]
differ in the jth value
– D[j] ≠ D’[j]. But, D[i] = D’[i], for all i ≠ j
40
> 𝜌 = 𝑧 𝑂 − (1 − 𝑞) 2𝑞 − 1
𝜌) = 𝜌
𝜌 =
K((LK) M
+
( M (O PLQ
R R
L Q
S
41
Sampling Variance due to coin flips
Randomized response for larger domains
reporting the true location?
reporting a false location?
42
𝑞 + 𝑟 𝑙, − 1 = 1 𝑞 ≤ 𝑓5𝑟 𝑟 = 1 𝑓5 + 𝑙, − 1
V (W,
43
– Each answer does not leak too much information about the database. – Noisy answers are close to the original answers.
Databas e
Researcher
Query Add noise to true answer
44
0.5 1
2 4 6 8 10
Laplace Distribution – Lap(λ)
Databas e
Researcher
Query q
True answer
q(D) q(D) + η η
h(η) α exp(-|η| / λ)
Privacy depends on the λ parameter Mean: 0, Variance: 2 λ2
45
[DMNS 06]
Sensitivity: Consider a query q: I à R. S(q) is the smallest number s.t. for any neighboring tables D, D’, | q(D) – q(D’) | ≤ S(q) Thm: If sensitivity of the query is S, then the following guarantees ε-differential privacy.
λ = S/ε
46
where η is drawn from Lap(1/ε)
– Mean = 0 – Variance = 2/ε2
47
Disease (Y/N) Y Y N Y N N D
48
49
function that returns a real number
= Var( Lap(S(q)/ε) ) = 2*S(q)2 / ε2
50
𝐔𝐢𝐧: 𝑄 𝐵 𝐸 − 𝑟 𝐸 > 𝑢 7 𝜇 = 𝑓L_ 𝑄 𝐵 𝐸 − 𝑟 𝐸 > 𝑢 7 𝜇 = `
La L_ 𝑓L|c| d 𝑒𝑦
2𝜇 + `
_ a 𝑓L|c| d 𝑒𝑦
2𝜇 = 2 `
_ a 𝑓L|c| d 𝑒𝑦
2𝜇 = 𝑓L_ 𝐃𝐩𝐬: 𝑄 𝐵 𝐸 − 𝑟 𝐸 > 𝑇(𝑟) 𝜁 ln 1 𝜀 ≤ 𝜀
51
Laplace Mechanism vs Randomized Response
Privacy
guarantee
trusted
collected to be trusted
– Also called a Local Algorithm, since each record is perturbed
52
Laplace Mechanism vs Randomized Response
Utility
records have disease = Y.
O(√N/ε)
53
– Randomized Response – Laplace Mechanism – Exponential Mechanism
54
– “what is the most common nationality in this room”: Chinese/Indian/American…
– To ensure integrality/non-negativity of output
55
Consider some function f (can be deterministic or probabilistic): How to construct a differentially private version of f?
56
Inputs Outputs
[MT 07]
57
where D, D’ differ in one tuple
58
Given an input D, and a scoring function w, Randomly sample an output O from Outputs with probability
59
given to the best output.
“What is the most common nationality?” w(D,nationality) = # people in D having that nationality Sensitivity of w is 1.
60
Theorem:
61
Theorem: Suppose there are 4 nationalities Outputs = {Chinese, Indian, American, Greek} Exponential mechanism will output some nationality that is shared by at least K people with probability 1-e-3(=0.95), where K ≥ OPT – 2(log(4) + 3)/ε = OPT – 6.8/ε
62
≤ maxD,D’ |f(D) – f(D’)| = sensitivity of f
proportional to
63
Laplace noise with parameter 2Δ/ε
,l m n opLm(n)
Output the same value with prob:
q ⁄
s R
(oq ⁄
s R
64
Randomized Response with parameter ε/2
Randomized response for larger domains
reporting the true location?
reporting a false location?
65
Different scoring functions give different algorithms
– Report true position: 1 – Report a false position: 0
– Report true position (i,j): 0 – Report false position (x,y): - (|i-x| + |j-y|)
66
does not make sense.
Sample from the resulting distribution.
exponential mechanism.
– By choosing the appropriate score function.
67
log(|Outputs|)
– Can work well even if output space is exponential in the input
computationally efficient if output space is large.
68
– Randomized Response – Laplace mechanism – Exponential Mechanism
69
70