Privacy Accounting and Quality Control in the Sage Di ff erentially - - PowerPoint PPT Presentation

privacy accounting and quality control in the sage di ff
SMART_READER_LITE
LIVE PREVIEW

Privacy Accounting and Quality Control in the Sage Di ff erentially - - PowerPoint PPT Presentation

Privacy Accounting and Quality Control in the Sage Di ff erentially Private ML Platform Mathias Lcuyer With: Riley Spahn, Kiran Vodrahalli, Roxana Geambasu, and Daniel Hsu Machine Learning (ML) introduces a dangerous double standard for data


slide-1
SLIDE 1

Privacy Accounting and Quality Control in the Sage Differentially Private ML Platform

Mathias Lécuyer With: Riley Spahn, Kiran Vodrahalli, Roxana Geambasu, and Daniel Hsu

slide-2
SLIDE 2

2

Machine Learning (ML) introduces a dangerous double standard for data protection Example: messaging app

slide-3
SLIDE 3

3

Example: messaging app

database Growing Database Traditional code

recommendation model

ML platform (e.g. TFX) messages, likes, clicks...

auto- complete model ad targeting model

slide-4
SLIDE 4

4

Example: messaging app

API

user's messages

database Growing Database Traditional code Access control

auto- complete model ad targeting model recommendation model

ML platform (e.g. TFX) messages, likes, clicks...

user's messages (per access control restrictions)

slide-5
SLIDE 5

5

Example: messaging app

models and/or predictions (based on everyone's messages, likes, clicks...)

(εg, δg)-DP recommendation model (εg, δg)-DP recommendation model (εg, δg)-DP auto- complete model (εg, δg)-DP auto- complete model (εg, δg)-DP ad targeting model (εg, δg)-DP ad targeting model

database Growing Database Traditional code Access control

auto- complete model ad targeting model recommendation model

ML platform (e.g. TFX)

API

messages, likes, clicks...

slide-6
SLIDE 6

6

Example: messaging app

(εg, δg)-DP recommendation model (εg, δg)-DP recommendation model (εg, δg)-DP auto- complete model (εg, δg)-DP auto- complete model (εg, δg)-DP ad targeting model (εg, δg)-DP ad targeting model

database Growing Database Traditional code Access control

recommendation model

ML platform (e.g. TFX)

API

messages, likes, clicks...

ML should only captures general trends from the data, but often captures specific information about individual entries in the dataset.

auto- complete model ad targeting model

slide-7
SLIDE 7

7

Example: messaging app

(εg, δg)-DP recommendation model (εg, δg)-DP recommendation model (εg, δg)-DP auto- complete model (εg, δg)-DP auto- complete model (εg, δg)-DP ad targeting model (εg, δg)-DP ad targeting model

database Growing Database Traditional code Access control

recommendation model

ML platform (e.g. TFX)

API

messages, likes, clicks...

Language models over users’ emails leak secrets. (Carlini+ '18)

auto- complete model ad targeting model

slide-8
SLIDE 8

8

Example: messaging app

(εg, δg)-DP recommendation model (εg, δg)-DP recommendation model (εg, δg)-DP auto- complete model (εg, δg)-DP auto- complete model (εg, δg)-DP ad targeting model (εg, δg)-DP ad targeting model

database Growing Database Traditional code Access control

recommendation model

ML platform (e.g. TFX)

API

messages, likes, clicks...

Recommenders leak information across users. [Calandrino'11] Membership in a training set can be inferred through prediction APIs. (Shokri+17)

auto- complete model ad targeting model

slide-9
SLIDE 9

9

Example: messaging app

(εg, δg)-DP recommendation model (εg, δg)-DP recommendation model (εg, δg)-DP auto- complete model (εg, δg)-DP auto- complete model (εg, δg)-DP ad targeting model (εg, δg)-DP ad targeting model

database Growing Database Traditional code Access control

recommendation model

ML platform (e.g. TFX)

API

messages, likes, clicks...

Recommenders leak information across users. [Calandrino'11] Language models over users’ emails leak secrets. (Carlini+ '18) Recommenders leak information across users. (Calandrino'11)

auto- complete model ad targeting model

slide-10
SLIDE 10

10

Example: messaging app

(εg, δg)-DP recommendation model (εg, δg)-DP recommendation model (εg, δg)-DP auto- complete model (εg, δg)-DP auto- complete model (εg, δg)-DP ad targeting model (εg, δg)-DP ad targeting model

database Growing Database Traditional code Access control

recommendation model

ML platform (e.g. TFX)

API

messages, likes, clicks...

auto- complete model ad targeting model

  • Making individual training algorithms Differentially Privacy (DP) is good but insufficient,

because old data is reused many times.

  • No system exists for managing multiple DP training algorithms to enforce a global DP

guarantee.

slide-11
SLIDE 11

11

Example: messaging app

(εg, δg)-DP recommendation model (εg, δg)-DP recommendation model (εg, δg)-DP auto- complete model (εg, δg)-DP auto- complete model (εg, δg)-DP ad targeting model (εg, δg)-DP ad targeting model

database Growing Database Traditional code Access control

(ε, δ)-DP recommendation model

ML platform (e.g. TFX)

API

messages, likes, clicks...

(ε, δ)-DP auto- complete model (ε, δ)-DP ad targeting model

  • Making individual training algorithms Differentially Privacy (DP) is good but insufficient,

because old data is reused many times.

  • No system exists for managing multiple DP training algorithms to enforce a global DP

guarantee.

slide-12
SLIDE 12

12

Can we make Differential Privacy practical for ML applications?

slide-13
SLIDE 13

13

  • Enforces a global (εg,δg)-DP

guarantee across all models ever released from a growing database.

  • Tackles in practical ways two difficult

DP challenges: 1. “Running out of budget” 2. “Privacy-utility tradeoff.”

(εg, δg)-DP recommendation model (εg, δg)-DP recommendation model (εg, δg)-DP auto- complete model (εg, δg)-DP auto- complete model (εg, δg)-DP ad targeting model (εg, δg)-DP ad targeting model

database Growing Database Traditional code Access control ML platform (e.g. TFX)

API

messages, likes, clicks... global (εg, δg)-DP Sage access control

Sage

(ε, δ)-DP recommendation model (ε, δ)-DP auto- complete model (ε, δ)-DP ad targeting model

slide-14
SLIDE 14

Outline

Motivation Differential Privacy Two practical challenges

14

Evaluation Sage design

slide-15
SLIDE 15

15

Differential Privacy (DP)

(Dwork+ '06)

  • Developed to allow privacy-preserving statistical analyses on sensitive datasets (e.g.,

census, drug purchases, …).

  • First (and only) rigorous definition of privacy suitable for this use case.
slide-16
SLIDE 16

16

Definition

  • DP is a stability constraint on computations running on datasets: it requires that no

single data point in an input dataset has a significant influence on the output.

  • To achieve stability, randomness is added into the computation.
slide-17
SLIDE 17

17

Definition

  • A randomized computation f: D → O, is (ε, δ)-DP if for any pair of datasets D and D'

differing in one entry, and for any output set S ⊂ O:

P(f(D) ∈ S) ≤ eε P(f(D') ∈ S) + δ

  • DP is a stability constraint on computations running on datasets: it requires that no

single data point in an input dataset has a significant influence on the output.

  • To achieve stability, randomness is added into the computation.
slide-18
SLIDE 18

18

DP in ML

  • Approach: make training algorithms DP

.

  • It prevents membership query and reconstruction attacks (Steinke-Ullman '14;

Dwork+ '15; Carlini+ '18).

  • DP versions exist for most ML training algorithms:
  • Stochastic gradient descent (SGD) (Abadi+16, Yu+19).
  • Various regressions (Chaudhuri+08, Kifer+12, Nikolaenko+13, Talwar+15).
  • Collaborative filtering (McSherry+09).
  • Language models (McMahan+18).
  • Feature and model selection (Chaudhuri+13, Smith+13).
  • Model evaluation (Boyd+15).
  • Tensorflow/privacy implements several of these algorithms (McMahan+19).
slide-19
SLIDE 19

Outline

Motivation Differential Privacy Two practical challenges

19

Evaluation Sage design

slide-20
SLIDE 20

(εg, δg)-DP model (εg, δg)-DP model

20

Challenge 1 - Running out of privacy budget

(εg, δg)-DP model

ML platform (e.g. TFX) global (εg, δg)-DP Fixed Dataset

(ε, δ)-DP model

Privacy loss Time Most DP work focuses on a fixed database model:

  • Each model consumes some privacy

budget.

  • When the budget is exhausted, the

data cannot be used anymore: the system can "run out of budget".

slide-21
SLIDE 21

(εg, δg)-DP model (εg, δg)-DP model

21

Challenge 1 - Running out of privacy budget

(εg, δg)-DP model

ML platform (e.g. TFX) global (εg, δg)-DP Fixed Dataset

(ε, δ)-DP model

Privacy loss Time Most DP work focuses on a fixed database model:

  • Each model consumes some privacy

budget.

  • When the budget is exhausted, the

data cannot be used anymore: the system can "run out of budget".

slide-22
SLIDE 22

(εg, δg)-DP model (εg, δg)-DP model

22

Challenge 1 - Running out of privacy budget

(εg, δg)-DP model

ML platform (e.g. TFX) global (εg, δg)-DP

(ε, δ)-DP model

Privacy loss Time Most DP work focuses on a fixed database model:

  • Each model consumes some privacy

budget.

  • When the budget is exhausted, the

data cannot be used anymore: the system can "run out of budget". Fixed Dataset

slide-23
SLIDE 23

(εg, δg)-DP model (εg, δg)-DP model

23

Challenge 1 - Running out of privacy budget

(εg, δg)-DP model

ML platform (e.g. TFX) global (εg, δg)-DP

(ε, δ)-DP model

Privacy loss Time Most DP work focuses on a fixed database model:

  • Each model consumes some privacy

budget.

  • When the budget is exhausted, the

data cannot be used anymore: the system can "run out of budget". Fixed Dataset

slide-24
SLIDE 24

24

Challenge 2 - Privacy/utility trade-off

slide-25
SLIDE 25

25

MSE (x10-3) 2.0 2.5 3.0 3.5 4.0

10,000 100,000 1,000,000 10,000,000 100,000,000

Non DP DP (ε=1.0) DP (ε=0.1)

Training Samples

Linear Regression Challenge 2 - Privacy/utility trade-off

2.0 2.5 3.0 3.5 4.0

10,000 100,000 1,000,000 10,000,000 100,000,000

Non DP DP (ε=1.0) DP (ε=0.1)

Training Samples

Deep Neural Network

More utility More privacy More privacy

slide-26
SLIDE 26

Outline

Motivation Differential Privacy Two practical challenges

26

Evaluation Sage design

slide-27
SLIDE 27

27

Sage block composition (challenge 1)

Privacy loss Time Growing Database Sage access control global (εg, δg)-DP

Key realization: ML platforms operate on a growing database.

slide-28
SLIDE 28

28

Privacy loss Time D2 ... Dk D1 Sage access control global (εg, δg)-DP Interaction model:

  • Split the growing database into time

based blocks.

  • Models can adaptively combine

blocks to form larger datasets.

  • Account for privacy loss only against

blocks used by each models.

  • Models can influence future data and

privacy budgets. Growing Database

Sage block composition (challenge 1)

slide-29
SLIDE 29

29

Privacy loss Time Sage access control global (εg, δg)-DP model 1 model 2 Interaction model:

  • Split the growing database into time

based blocks.

  • Models can adaptively combine

blocks to form larger datasets.

  • Account for privacy loss only against

blocks used by each models.

  • Models can influence future data and

privacy budgets. D2 ... Dk D1

Sage block composition (challenge 1)

slide-30
SLIDE 30

30

Privacy loss Time Sage access control global (εg, δg)-DP model 1 model 2

adaptive choices of data blocks, privacy parameters

model 3

ε1 ε2 ε3

D2 ... Dk D1

Sage block composition (challenge 1)

Interaction model:

  • Split the growing database into time

based blocks.

  • Models can adaptively combine

blocks to form larger datasets.

  • Account for privacy loss only against

blocks used by each models.

  • Models can influence future data and

privacy budgets.

slide-31
SLIDE 31

31

Privacy loss Time Sage access control global (εg, δg)-DP model 1 model 2 model 3 D2 ... Dk D1 Theorem: | PrivacyLoss(stream) | ≤ maxk | PrivacyLoss(Dk) |

Sage block composition (challenge 1)

slide-32
SLIDE 32

D2 D1

32

Privacy loss Time Sage access control global (εg, δg)-DP model 1 model 2 Why is this important?

  • Controlling each block's privacy loss

controls the global privacy loss.

  • New blocks arrive with zero loss and

constantly renew the budget. model 3 Dk+1 ... Dk

Sage block composition (challenge 1)

Theorem: | PrivacyLoss(stream) | ≤ maxk | PrivacyLoss(Dk) |

slide-33
SLIDE 33

33

Iterative training (challenge 2)

slide-34
SLIDE 34

ML platform (e.g. TFX)

34

Privacy loss Time Sage access control global (εg, δg)-DP

(ε, δ)-DP ML model

release Good?

Iterative training (challenge 2)

MSE (x10-3) 2.0 2.5 3.0 3.5 4.0

10,000 1,000,000 100,000,000

Non DP DP (ε=1.0) DP (ε=0.1)

Training Samples More utility More privacy More data

slide-35
SLIDE 35

ML platform (e.g. TFX)

35

Privacy loss Time Sage access control global (εg, δg)-DP release Good?

Iterative training (challenge 2)

  • Adaptively trains on growing data

and/or privacy budgets.

  • Release when w.h.p. model

accuracy surpasses a target.

  • Accounts for the impact of DP noise

in TFX-evaluate to give high- probability assessment of model accuracy. (ε, δ)-DP ML model

slide-36
SLIDE 36

ML platform (e.g. TFX)

36

Privacy loss Time Sage access control global (εg, δg)-DP release Good? retry

Iterative training (challenge 2)

  • Adaptively trains on growing data

and/or privacy budgets.

  • Release when w.h.p. model

accuracy surpasses a target.

  • Accounts for the impact of DP noise

in TFX-evaluate to give high- probability assessment of model accuracy. (ε, δ)-DP ML model

slide-37
SLIDE 37

37

Privacy loss Time Sage access control global (εg, δg)-DP ML platform (e.g. TFX)

(ε, δ)-DP ML model

release Good?

Iterative training (challenge 2)

P(acc < τ) ≤ η over sampling of test set. Statistical test for evaluation:

  • Adaptively trains on growing data

and/or privacy budgets.

  • Release when w.h.p. model

accuracy surpasses a target.

  • Accounts for the impact of DP noise

in TFX-evaluate to give high- probability assessment of model accuracy.

retry

(ε/2, δ/2)-DP model training

(ε/2, δ/2)-DP model validation

slide-38
SLIDE 38

ML platform (e.g. TFX)

38

Privacy loss Time Sage access control global (εg, δg)-DP release Good?

Iterative training (challenge 2)

P(acc < τ) ≤ η over sampling of test set and DP noise. Statistical test for evaluation:

  • Adaptively trains on growing data

and/or privacy budgets.

  • Release when w.h.p. model

accuracy surpasses a target.

  • Accounts for the impact of DP noise

in TFX-evaluate to give high- probability assessment of model accuracy.

retry

(ε, δ)-DP ML model

(ε/2, δ/2)-DP model training

(ε/2, δ/2)-DP model validation

slide-39
SLIDE 39

39

Sage Architecture

ML platform (e.g. TFX) Privacy loss Time Sage access control global (εg, δg)-DP release Models meet quality goal w.h.p

(ε, δ)-DP ML model

(ε/2, δ/2)-DP model training

(ε/2, δ/2)-DP model validation

retry reject/timeout Traditional access control

dataset, ε, δ request dataset

slide-40
SLIDE 40

Outline

Motivation Differential Privacy Two practical challenges

40

Evaluation Sage design

slide-41
SLIDE 41

Evaluation:

1. Benefits of block composition versus traditional DP composition. 2. Importance of iterative training and DP aware performance tests. 3. Continuous operation on multiple models and growing database.

41

slide-42
SLIDE 42
  • 1. Benefits of block composition versus traditional DP composition

42

Required sample size

10,000 100,000 1,000,000 10,000,000 100,000,000

2 3 4 5 6 7

Traditional DP composition Sage MSE Target (x10-3)

Better model Data points used to reach target

slide-43
SLIDE 43
  • 2. Importance of iterative training and DP aware performance tests

43

Test methodology Non DP DP + UB Sage Failure rate at 1% proba. 0.2% 1.7% 0.3%

Required sample size

10,000 100,000 1,000,000 10,000,000 100,000,000

2 3 4 5 6 7

Non DP DP + UB Sage MSE Target (x10-3)

slide-44
SLIDE 44
  • 3. Continuous operation on multiple models and growing database

44

Arrival rate (per block)

  • Avg. model release time


(in blocks)

25 50 75 100

0.1 0.2 0.3 0.4 0.5 0.6 0.7

Traditional DP composition Sage

slide-45
SLIDE 45

Summary

45

  • DP literature has mostly focused on individual ML algorithms running on

static databases (which don’t incorporate new data).

  • ML workloads operate on growing databases: models incorporate new data

and (adaptively) reuse old data.

  • Sage is the first to adapt DP theory and practice to ML workloads on growing

databases, for data protection.

  • Opens an exciting design space for efficient privacy resource allocation!