Privacy Accounting and Quality Control in the Sage Di ff erentially - - PowerPoint PPT Presentation
Privacy Accounting and Quality Control in the Sage Di ff erentially - - PowerPoint PPT Presentation
Privacy Accounting and Quality Control in the Sage Di ff erentially Private ML Platform Mathias Lcuyer With: Riley Spahn, Kiran Vodrahalli, Roxana Geambasu, and Daniel Hsu Machine Learning (ML) introduces a dangerous double standard for data
2
Machine Learning (ML) introduces a dangerous double standard for data protection Example: messaging app
3
Example: messaging app
database Growing Database Traditional code
recommendation model
ML platform (e.g. TFX) messages, likes, clicks...
auto- complete model ad targeting model
4
Example: messaging app
API
user's messages
database Growing Database Traditional code Access control
auto- complete model ad targeting model recommendation model
ML platform (e.g. TFX) messages, likes, clicks...
user's messages (per access control restrictions)
5
Example: messaging app
models and/or predictions (based on everyone's messages, likes, clicks...)
(εg, δg)-DP recommendation model (εg, δg)-DP recommendation model (εg, δg)-DP auto- complete model (εg, δg)-DP auto- complete model (εg, δg)-DP ad targeting model (εg, δg)-DP ad targeting model
database Growing Database Traditional code Access control
auto- complete model ad targeting model recommendation model
ML platform (e.g. TFX)
API
messages, likes, clicks...
6
Example: messaging app
(εg, δg)-DP recommendation model (εg, δg)-DP recommendation model (εg, δg)-DP auto- complete model (εg, δg)-DP auto- complete model (εg, δg)-DP ad targeting model (εg, δg)-DP ad targeting model
database Growing Database Traditional code Access control
recommendation model
ML platform (e.g. TFX)
API
messages, likes, clicks...
ML should only captures general trends from the data, but often captures specific information about individual entries in the dataset.
auto- complete model ad targeting model
7
Example: messaging app
(εg, δg)-DP recommendation model (εg, δg)-DP recommendation model (εg, δg)-DP auto- complete model (εg, δg)-DP auto- complete model (εg, δg)-DP ad targeting model (εg, δg)-DP ad targeting model
database Growing Database Traditional code Access control
recommendation model
ML platform (e.g. TFX)
API
messages, likes, clicks...
Language models over users’ emails leak secrets. (Carlini+ '18)
auto- complete model ad targeting model
8
Example: messaging app
(εg, δg)-DP recommendation model (εg, δg)-DP recommendation model (εg, δg)-DP auto- complete model (εg, δg)-DP auto- complete model (εg, δg)-DP ad targeting model (εg, δg)-DP ad targeting model
database Growing Database Traditional code Access control
recommendation model
ML platform (e.g. TFX)
API
messages, likes, clicks...
Recommenders leak information across users. [Calandrino'11] Membership in a training set can be inferred through prediction APIs. (Shokri+17)
auto- complete model ad targeting model
9
Example: messaging app
(εg, δg)-DP recommendation model (εg, δg)-DP recommendation model (εg, δg)-DP auto- complete model (εg, δg)-DP auto- complete model (εg, δg)-DP ad targeting model (εg, δg)-DP ad targeting model
database Growing Database Traditional code Access control
recommendation model
ML platform (e.g. TFX)
API
messages, likes, clicks...
Recommenders leak information across users. [Calandrino'11] Language models over users’ emails leak secrets. (Carlini+ '18) Recommenders leak information across users. (Calandrino'11)
auto- complete model ad targeting model
10
Example: messaging app
(εg, δg)-DP recommendation model (εg, δg)-DP recommendation model (εg, δg)-DP auto- complete model (εg, δg)-DP auto- complete model (εg, δg)-DP ad targeting model (εg, δg)-DP ad targeting model
database Growing Database Traditional code Access control
recommendation model
ML platform (e.g. TFX)
API
messages, likes, clicks...
auto- complete model ad targeting model
- Making individual training algorithms Differentially Privacy (DP) is good but insufficient,
because old data is reused many times.
- No system exists for managing multiple DP training algorithms to enforce a global DP
guarantee.
11
Example: messaging app
(εg, δg)-DP recommendation model (εg, δg)-DP recommendation model (εg, δg)-DP auto- complete model (εg, δg)-DP auto- complete model (εg, δg)-DP ad targeting model (εg, δg)-DP ad targeting model
database Growing Database Traditional code Access control
(ε, δ)-DP recommendation model
ML platform (e.g. TFX)
API
messages, likes, clicks...
(ε, δ)-DP auto- complete model (ε, δ)-DP ad targeting model
- Making individual training algorithms Differentially Privacy (DP) is good but insufficient,
because old data is reused many times.
- No system exists for managing multiple DP training algorithms to enforce a global DP
guarantee.
12
Can we make Differential Privacy practical for ML applications?
13
- Enforces a global (εg,δg)-DP
guarantee across all models ever released from a growing database.
- Tackles in practical ways two difficult
DP challenges: 1. “Running out of budget” 2. “Privacy-utility tradeoff.”
(εg, δg)-DP recommendation model (εg, δg)-DP recommendation model (εg, δg)-DP auto- complete model (εg, δg)-DP auto- complete model (εg, δg)-DP ad targeting model (εg, δg)-DP ad targeting model
database Growing Database Traditional code Access control ML platform (e.g. TFX)
API
messages, likes, clicks... global (εg, δg)-DP Sage access control
Sage
(ε, δ)-DP recommendation model (ε, δ)-DP auto- complete model (ε, δ)-DP ad targeting model
Outline
Motivation Differential Privacy Two practical challenges
14
Evaluation Sage design
15
Differential Privacy (DP)
(Dwork+ '06)
- Developed to allow privacy-preserving statistical analyses on sensitive datasets (e.g.,
census, drug purchases, …).
- First (and only) rigorous definition of privacy suitable for this use case.
16
Definition
- DP is a stability constraint on computations running on datasets: it requires that no
single data point in an input dataset has a significant influence on the output.
- To achieve stability, randomness is added into the computation.
17
Definition
- A randomized computation f: D → O, is (ε, δ)-DP if for any pair of datasets D and D'
differing in one entry, and for any output set S ⊂ O:
P(f(D) ∈ S) ≤ eε P(f(D') ∈ S) + δ
- DP is a stability constraint on computations running on datasets: it requires that no
single data point in an input dataset has a significant influence on the output.
- To achieve stability, randomness is added into the computation.
18
DP in ML
- Approach: make training algorithms DP
.
- It prevents membership query and reconstruction attacks (Steinke-Ullman '14;
Dwork+ '15; Carlini+ '18).
- DP versions exist for most ML training algorithms:
- Stochastic gradient descent (SGD) (Abadi+16, Yu+19).
- Various regressions (Chaudhuri+08, Kifer+12, Nikolaenko+13, Talwar+15).
- Collaborative filtering (McSherry+09).
- Language models (McMahan+18).
- Feature and model selection (Chaudhuri+13, Smith+13).
- Model evaluation (Boyd+15).
- Tensorflow/privacy implements several of these algorithms (McMahan+19).
Outline
Motivation Differential Privacy Two practical challenges
19
Evaluation Sage design
(εg, δg)-DP model (εg, δg)-DP model
20
Challenge 1 - Running out of privacy budget
(εg, δg)-DP model
ML platform (e.g. TFX) global (εg, δg)-DP Fixed Dataset
(ε, δ)-DP model
Privacy loss Time Most DP work focuses on a fixed database model:
- Each model consumes some privacy
budget.
- When the budget is exhausted, the
data cannot be used anymore: the system can "run out of budget".
(εg, δg)-DP model (εg, δg)-DP model
21
Challenge 1 - Running out of privacy budget
(εg, δg)-DP model
ML platform (e.g. TFX) global (εg, δg)-DP Fixed Dataset
(ε, δ)-DP model
Privacy loss Time Most DP work focuses on a fixed database model:
- Each model consumes some privacy
budget.
- When the budget is exhausted, the
data cannot be used anymore: the system can "run out of budget".
(εg, δg)-DP model (εg, δg)-DP model
22
Challenge 1 - Running out of privacy budget
(εg, δg)-DP model
ML platform (e.g. TFX) global (εg, δg)-DP
(ε, δ)-DP model
Privacy loss Time Most DP work focuses on a fixed database model:
- Each model consumes some privacy
budget.
- When the budget is exhausted, the
data cannot be used anymore: the system can "run out of budget". Fixed Dataset
(εg, δg)-DP model (εg, δg)-DP model
23
Challenge 1 - Running out of privacy budget
(εg, δg)-DP model
ML platform (e.g. TFX) global (εg, δg)-DP
(ε, δ)-DP model
Privacy loss Time Most DP work focuses on a fixed database model:
- Each model consumes some privacy
budget.
- When the budget is exhausted, the
data cannot be used anymore: the system can "run out of budget". Fixed Dataset
24
Challenge 2 - Privacy/utility trade-off
25
MSE (x10-3) 2.0 2.5 3.0 3.5 4.0
10,000 100,000 1,000,000 10,000,000 100,000,000
Non DP DP (ε=1.0) DP (ε=0.1)
Training Samples
Linear Regression Challenge 2 - Privacy/utility trade-off
2.0 2.5 3.0 3.5 4.0
10,000 100,000 1,000,000 10,000,000 100,000,000
Non DP DP (ε=1.0) DP (ε=0.1)
Training Samples
Deep Neural Network
More utility More privacy More privacy
Outline
Motivation Differential Privacy Two practical challenges
26
Evaluation Sage design
27
Sage block composition (challenge 1)
Privacy loss Time Growing Database Sage access control global (εg, δg)-DP
Key realization: ML platforms operate on a growing database.
28
Privacy loss Time D2 ... Dk D1 Sage access control global (εg, δg)-DP Interaction model:
- Split the growing database into time
based blocks.
- Models can adaptively combine
blocks to form larger datasets.
- Account for privacy loss only against
blocks used by each models.
- Models can influence future data and
privacy budgets. Growing Database
Sage block composition (challenge 1)
29
Privacy loss Time Sage access control global (εg, δg)-DP model 1 model 2 Interaction model:
- Split the growing database into time
based blocks.
- Models can adaptively combine
blocks to form larger datasets.
- Account for privacy loss only against
blocks used by each models.
- Models can influence future data and
privacy budgets. D2 ... Dk D1
Sage block composition (challenge 1)
30
Privacy loss Time Sage access control global (εg, δg)-DP model 1 model 2
adaptive choices of data blocks, privacy parameters
model 3
ε1 ε2 ε3
D2 ... Dk D1
Sage block composition (challenge 1)
Interaction model:
- Split the growing database into time
based blocks.
- Models can adaptively combine
blocks to form larger datasets.
- Account for privacy loss only against
blocks used by each models.
- Models can influence future data and
privacy budgets.
31
Privacy loss Time Sage access control global (εg, δg)-DP model 1 model 2 model 3 D2 ... Dk D1 Theorem: | PrivacyLoss(stream) | ≤ maxk | PrivacyLoss(Dk) |
Sage block composition (challenge 1)
D2 D1
32
Privacy loss Time Sage access control global (εg, δg)-DP model 1 model 2 Why is this important?
- Controlling each block's privacy loss
controls the global privacy loss.
- New blocks arrive with zero loss and
constantly renew the budget. model 3 Dk+1 ... Dk
Sage block composition (challenge 1)
Theorem: | PrivacyLoss(stream) | ≤ maxk | PrivacyLoss(Dk) |
33
Iterative training (challenge 2)
ML platform (e.g. TFX)
34
Privacy loss Time Sage access control global (εg, δg)-DP
(ε, δ)-DP ML model
release Good?
Iterative training (challenge 2)
MSE (x10-3) 2.0 2.5 3.0 3.5 4.0
10,000 1,000,000 100,000,000
Non DP DP (ε=1.0) DP (ε=0.1)
Training Samples More utility More privacy More data
ML platform (e.g. TFX)
35
Privacy loss Time Sage access control global (εg, δg)-DP release Good?
Iterative training (challenge 2)
- Adaptively trains on growing data
and/or privacy budgets.
- Release when w.h.p. model
accuracy surpasses a target.
- Accounts for the impact of DP noise
in TFX-evaluate to give high- probability assessment of model accuracy. (ε, δ)-DP ML model
ML platform (e.g. TFX)
36
Privacy loss Time Sage access control global (εg, δg)-DP release Good? retry
Iterative training (challenge 2)
- Adaptively trains on growing data
and/or privacy budgets.
- Release when w.h.p. model
accuracy surpasses a target.
- Accounts for the impact of DP noise
in TFX-evaluate to give high- probability assessment of model accuracy. (ε, δ)-DP ML model
37
Privacy loss Time Sage access control global (εg, δg)-DP ML platform (e.g. TFX)
(ε, δ)-DP ML model
release Good?
Iterative training (challenge 2)
P(acc < τ) ≤ η over sampling of test set. Statistical test for evaluation:
- Adaptively trains on growing data
and/or privacy budgets.
- Release when w.h.p. model
accuracy surpasses a target.
- Accounts for the impact of DP noise
in TFX-evaluate to give high- probability assessment of model accuracy.
retry
(ε/2, δ/2)-DP model training
(ε/2, δ/2)-DP model validation
ML platform (e.g. TFX)
38
Privacy loss Time Sage access control global (εg, δg)-DP release Good?
Iterative training (challenge 2)
P(acc < τ) ≤ η over sampling of test set and DP noise. Statistical test for evaluation:
- Adaptively trains on growing data
and/or privacy budgets.
- Release when w.h.p. model
accuracy surpasses a target.
- Accounts for the impact of DP noise
in TFX-evaluate to give high- probability assessment of model accuracy.
retry
(ε, δ)-DP ML model
(ε/2, δ/2)-DP model training
(ε/2, δ/2)-DP model validation
39
Sage Architecture
ML platform (e.g. TFX) Privacy loss Time Sage access control global (εg, δg)-DP release Models meet quality goal w.h.p
(ε, δ)-DP ML model
(ε/2, δ/2)-DP model training
(ε/2, δ/2)-DP model validation
retry reject/timeout Traditional access control
dataset, ε, δ request dataset
Outline
Motivation Differential Privacy Two practical challenges
40
Evaluation Sage design
Evaluation:
1. Benefits of block composition versus traditional DP composition. 2. Importance of iterative training and DP aware performance tests. 3. Continuous operation on multiple models and growing database.
41
- 1. Benefits of block composition versus traditional DP composition
42
Required sample size
10,000 100,000 1,000,000 10,000,000 100,000,000
2 3 4 5 6 7
Traditional DP composition Sage MSE Target (x10-3)
Better model Data points used to reach target
- 2. Importance of iterative training and DP aware performance tests
43
Test methodology Non DP DP + UB Sage Failure rate at 1% proba. 0.2% 1.7% 0.3%
Required sample size
10,000 100,000 1,000,000 10,000,000 100,000,000
2 3 4 5 6 7
Non DP DP + UB Sage MSE Target (x10-3)
- 3. Continuous operation on multiple models and growing database
44
Arrival rate (per block)
- Avg. model release time
(in blocks)
25 50 75 100
0.1 0.2 0.3 0.4 0.5 0.6 0.7
Traditional DP composition Sage
Summary
45
- DP literature has mostly focused on individual ML algorithms running on
static databases (which don’t incorporate new data).
- ML workloads operate on growing databases: models incorporate new data
and (adaptively) reuse old data.
- Sage is the first to adapt DP theory and practice to ML workloads on growing
databases, for data protection.
- Opens an exciting design space for efficient privacy resource allocation!