Building complex DP algorithms using composition Privacy & - - PowerPoint PPT Presentation
Building complex DP algorithms using composition Privacy & - - PowerPoint PPT Presentation
Building complex DP algorithms using composition Privacy & Fairness in Data Science CS848 Fall 2019 2 Outline Recap Laplace Mechanism Composition Theorems Optimizing accuracy of DP algorithms Utilizing Parallel
Outline
- Recap
– Laplace Mechanism
- Composition Theorems
- Optimizing accuracy of DP algorithms
– Utilizing Parallel Composition – Postprocessing & Inference – Strategy Selection – Data dependent noise
2
Differential Privacy
For every output … O D2 D1
Adversary should not be able to distinguish between any D1 and D2 based on any O
For every pair of inputs that differ in one row
[Dwork ICALP 2006]
3
∀Ω ∈ range A , ln Pr[𝐵 𝐸0 ∈ Ω] Pr[𝐵 𝐸2 ∈ Ω] ≤ 𝜁, 𝜁 > 0
Laplace mechanism
4
D
Private Database Analyst
Aggregate Query: q
Noisy Answer
7 𝒓 𝑬 = 𝒓 𝑬 + 𝐌𝐛𝐪 𝑻(𝒓) 𝜻
e.g., COUNT Sensitivity
- 10
- 5
5 10
Outline
- Recap
– Laplace Mechanism
- Composition Theorems
- Optimizing accuracy of DP algorithms
– Utilizing Parallel Composition – Postprocessing & Inference – Strategy Selection – Data dependent noise
5
Sequential Composition
- If M1, M2, ..., Mk are algorithms that access a private
database D such that each Mi satisfies εi -differential privacy, then the combination of their outputs satisfies ε- differential privacy with
ε = ε1 + ... + εk
6
D
Private Database M1, ε1 M1(D) M2, ε2 M2(D, M1(D))
…
Parallel Composition
- If M1, M2, ..., Mk are algorithms that access are
algorithms that access disjoint databases D1, D2, …, Dk such that each Mi satisfies εi -differential privacy,
then the combination of their outputs satisfies ε- differential privacy with
ε = max(ε1 , ... , εk)
7
D2
Private Database M1, ε1 M1(D1) M2, ε2 M2(D2)
…
D1
Postprocessing
- If M is an ε-differentially private algorithm, any
additional post-processing 𝐵 ∘ 𝑁 also satisfies ε- differential privacy.
8
D
Private Database M, ε M(D)
A
A(M(D))
Transformations & Stability
- 𝜏F: Stability of the transformation
– Maximum number of rows in V that can change due to changing a single row in D
9
D
Private Database M, ε M(V(D))
V(D)
Transformed Database V Transformation need not satisfy DP
Transformations & Stability
- Executing an ε-differentially private algorithm
M on a transformation of a database V(D) satisfies 𝜁 G 𝜏F-differential privacy.
- 𝜏F: Stability of the transformation
– Maximum number of rows in V that can change due to changing a single row in D
10
D
Private Database M, ε M(V(D))
V(D)
Transformed Database V
Transformations & Stability
- V1: For each row (x1, x2, x3) à (x1, x2+x3)
- V2: Each row in D is a tweet (id, {words}). For
each row in D, generate k rows with first k words {(id, word1), …, (id, wordk)}
- V3: Sample each row with probability p.
11
Stability = 1 Stability = k Stability = 1 … but can prove 2p𝜁 -differential privacy*
*Adam Smith, Differential Privacy and Secrecy of the Sample
Outline
- Recap
– Laplace Mechanism
- Composition Theorems
- Optimizing accuracy of DP algorithms
– Utilizing Parallel Composition – Postprocessing & Inference – Strategy Selection – Data dependent noise
12
Problem
- Design an ε-differentially private algorithm that
can answer all these questions.
- What is the total error?
13
Sex Height Weight M 6’2” 210 F 5’3” 190 F 5’9” 160 M 5’3” 180 M 6’7” 250
Queries:
- # Males with BMI < 25
- # Males
- # Females with BMI < 25
- # Females
Algorithm 1
Return:
- (# Males with BMI < 25) + Lap(4/ε)
- (# Males) + Lap(4/ε)
- (# Females with BMI) < 25 + Lap(4/ε)
- (# Females) + Lap(4/ε)
14
Privacy
- BMI can be computed by transforming each row
(s, h, w) à (s, bmi). This is stability 1.
- Sensitivity of count = 1. So each query is
answered using a ε/4-DP algorithm.
- By sequential composition, we get ε-DP.
15
Utility
Error: M 𝐹 O 𝑟 𝐸 − 𝑟 𝐸
2
Total Error: 2 4 𝜁
2
×4 = 128 𝜁2
16
Algorithm 2
Compute:
- V
𝑟0 = (# Males with BMI < 25) + Lap(1/ε)
- V
𝑟2 = (# Males with BMI > 25) + Lap(1/ε)
- V
𝑟W = (# Females with BMI < 25) + Lap(1/ε)
- V
𝑟X = (# Females with BMI > 25) + Lap(1/ε) Return
- V
𝑟0, V 𝑟0+V 𝑟2, V 𝑟W, V 𝑟W+V 𝑟X
17
Privacy
- Sensitivity of count = 1. So each query is
answered using a ε-DP algorithm.
- 𝑟0, 𝑟2, 𝑟W, 𝑟X are counts on disjoint portions of
the database. Thus by parallel composition releasing V 𝑟0, V 𝑟2, V 𝑟W, V 𝑟X satisfies ε-DP.
- By the postprocessing theorem, releasing V
𝑟0, V 𝑟0+V 𝑟2, V 𝑟W, V 𝑟W+V 𝑟X also satisfies ε-DP.
18
Utility
Error: M 𝐹 O 𝑟 𝐸 − 𝑟 𝐸
2
Total Error: 2 1 𝜁
2
+ 2 G 2 1 𝜁
2
+ 2 1 𝜁
2
+ 2 G 2 1 𝜁
2
= 12 𝜁2
19
V 𝑟0 V 𝑟0 + V 𝑟2 V 𝑟W V 𝑟W + V 𝑟X
Utility
Total Error: 2 1 𝜁
2
+ 2 G 2 1 𝜁
2
+ 2 1 𝜁
2
+ 2 G 2 1 𝜁
2
= 12 𝜁2
20
V 𝑟0 V 𝑟0 + V 𝑟2 V 𝑟W V 𝑟W + V 𝑟X Tighter privacy analysis gives better accuracy for the same level of privacy
Generalized Sensitivity
- Let 𝑔: → ℝ] be a function that outputs a
vector of d real numbers. The sensitivity of f is given by: 𝑇 𝑔 = max
a,ab: |a∆ab|e0 𝑔 𝐸 − 𝑔(𝐸f) 0
where 𝐲 − 𝐳 0 = ∑j 𝑦j − 𝑧j
21
Generalized Sensitivity
- 𝑟0 = # Males with BMI < 25
- 𝑟2 = # Males with BMI > 25
- 𝑟 = # Males with BMI
- Let f1 be a function that answers both 𝑟0, 𝑟2
- Let f2 be a function that answers both 𝑟0, 𝑟
- Sensitivity of f1 = 1
- Sensitivity of f2 = 2
- An alternate privacy proof for Alg 2 is to show that the
generalized sensitivity of V 𝑟0, V 𝑟2, V 𝑟W, V 𝑟X is 1.
22
Outline
- Recap
– Laplace Mechanism
- Composition Theorems
- Optimizing accuracy of DP algorithms
– Utilizing Parallel Composition – Postprocessing & Inference – Strategy Selection – Data dependent noise
23
Improving utility of Alg 2
Compute:
- V
𝑟0 = # Males with BMI < 25 + Lap(1/ε)
- V
𝑟2 = # Males with BMI > 25 + Lap(1/ε) Return
- V
𝑟0, V 𝑟0+V 𝑟2
24
We know 𝑟0 ≤ 𝑟0 + 𝑟2, but P[V 𝑟0 > V 𝑟0+V 𝑟2] > 0
Constrained Inference
25
DATA OWNER ANALYST
Constrained Inference
I
Private Data
- q
- ˜
q
Q(I) Q(I)
Diff. Private Interface
Q(I) = q
Step 1 Step 2 Step 3
Constrained Inference
- 𝑟0, 𝑟2, …, 𝑟m be a set of queries
- V
𝑟0,V 𝑟2, …,V 𝑟m be the noisy answers
- Constraint C(𝑟0, 𝑟2, …, 𝑟m) = 1 holds on true
answers (for all typical databases), but does not hold on noisy answers.
- Goal: Find 𝑟0, 𝑟2, …, 𝑟m that are:
– Close to V 𝑟0,V 𝑟2, …,V 𝑟m – Satisfy the constraint C(𝑟0, 𝑟2, …, 𝑟m)
26
Least Squares Optimization
min M V 𝑟0 − 𝑟0 2 𝑡. 𝑢. 𝐷(𝑟0, 𝑟2, … , 𝑟m)
27
Geometric Interpretation
min M V 𝑟0 − 𝑟0 2 𝑡. 𝑢. 𝐷(𝑟0, 𝑟2, … , 𝑟m)
28
𝒓 = (𝑟0, 𝑟2, …, 𝑟m) 7 𝒓 = (V 𝑟0,V 𝑟2, …,V 𝑟m) Noise Projection t 𝒓 = (𝑟0, 𝑟2, … , 𝑟m) Space of Outputs satisfying the constraint
Geometric Interpretation
Theorem: 𝒓 − t 𝒓 2 ≤ 𝒓 − 7 𝒓 2 when the constraints form a convex space
29
𝒓 = (𝑟0, 𝑟2, …, 𝑟m) 7 𝒓 = (V 𝑟0,V 𝑟2, …,V 𝑟m) Noise Projection t 𝒓 = (𝑟0, 𝑟2, … , 𝑟m) Space of Outputs satisfying the constraint
min M V 𝑟0 − 𝑟0 2 𝑡. 𝑢. 𝐷(𝑟0, 𝑟2, … , 𝑟m)
Ordering Constraint
30
min M V 𝑟0 − 𝑟0 2 𝑡. 𝑢. 𝑟0 ≤ 𝑟0 ≤ … ≤ 𝑟m Isotonic Regression:
Outline
- Recap
– Laplace Mechanism
- Composition Theorems
- Optimizing accuracy of DP algorithms
– Utilizing Parallel Composition – Postprocessing & Inference – Strategy Selection – Data dependent noise
31
Problem
- Design an ε-differentially private algorithm that
can answer all range queries.
- What is the total error?
32
Sex Height Weight M 6’2” 210 F 5’3” 190 F 5’9” 160 M 5’3” 180 M 6’7” 250
Queries:
- # people with height in [5’1”, 6’2”]
- # people with height in [2’0”, 4’0”]
- # people with height in [3’3”, 7’0”]
- …
Problem
- Let {v1, …, vk} be the domain of an attribute
- Let {x1, …, xk} be the number of rows with
values v1, …, vk
- Range Query: qij = xi+ xi+1 + …+ xj
- Goal: Answer all range queries
33
Strategy 1:
- Answer all range queries using Laplace
mechanism
- Sensitivity: O(𝑙2)
- Total Error: O(𝑙X/𝜁2)
34
Strategy 2:
- Estimate each individual xi using Laplace
mechanism
- Answer: 𝑟jw = 7
𝑦j + V 𝑦jx0 +…+ 7 𝑦w
- Error in each 7
𝑦j: 𝑃(1/𝜁2)
- Error in 𝑟0m: 𝑃(𝑙/𝜁2)
- Total Error: 𝑃(𝑙W/𝜁2)
35
Strategy 3: Hierarchy
- Estimate all the counts in the tree below
using Laplace mechanism
36
x1 x2 x3 x4 x5 x6 x7 x8 x12 x34 x56 x78 x1234 x5678 x1-8 x5+ x6+ x7+ x8
Strategy 3: Hierarchy
- Sensitivity: log 𝑙
- Every range query can be answered by summing
up at most 2 log 𝑙 nodes in the tree.
37
x1 x2 x3 x4 x5 x6 x7 x8 x12 x34 x56 x78 x1234 x5678 x1-8 x5+ x6+ x7+ x8
Strategy 3: Hierarchy
- Error in each node: 𝑃((log 𝑙)2/𝜁2)
- Max error on a range query: 𝑃((log 𝑙)W/𝜁2)
- Total Error: 𝑃(𝑙2(log 𝑙)W/𝜁2)
38
x1 x2 x3 x4 x5 x6 x7 x8 x12 x34 x56 x78 x1234 x5678 x1-8 x5+ x6+ x7+ x8
Strategy 3: Hierarchy
- Error in each node: 𝑃((log 𝑙)2/𝜁2)
- Max error on a range query: 𝑃((log 𝑙)W/𝜁2)
- Total Error: 𝑃(𝑙2(log 𝑙)W/𝜁2)
- Error can be further reduced using constrained
inference
– Here the constraint is that parent counts should not be smaller than child counts.
39
Strategy based mechanisms
- Can think of nodes in the tree as coefficients.
- Other algorithms use other transformations
– Wavelets, Fourier coefficients
- Should be able to losslessly reconstruct the original
data/query answers.
- General Idea:
– Apply transform – Add noise to the transformed space (based on sensitivity) – Reconstruct original data/query answers from noisy coefficients
40
Original Data Transform Coefficients Noisy Coefficients Noise Private Data
Reconstruct
Outline
- Recap
– Laplace Mechanism
- Composition Theorems
- Optimizing accuracy of DP algorithms
– Utilizing Parallel Composition – Postprocessing & Inference – Strategy Selection – Data dependent noise
41
Data dependent noise mechanisms
42
Original Data Transform Coefficients Noisy Coefficients Noise Private Data
Reconstruct
Transformation can be lossy Reconstruction is non-unique [LHMY14] Li et al. A data- and workload-aware algorithm for range queries under differential privacy. In PVLDB, 2014.
Data dependent noise mechanisms
- Use a data dependent sensitivity measure
called Smooth sensitivity.
43
- K. Nissim, S. Raskhodnikova, A. Smith, “Smooth Sensitivity and sampling in
private data analysis”, STOC 2007
Summary
- Composition theorems help build complex
algorithms using simple building blocks
– Sequential composition – Parallel composition – Postprocessing – There are more advanced forms of composition.
44
Summary
- For the same privacy budget, a better
designed algorithm can extract more utility
– When possible use parallel composition – Inference on constraints between queries can reduce error – Answering a different strategy of queries can help reduce error
45