Scalable PATE The Secret Sharer work by the Brain Privacy and - - PowerPoint PPT Presentation

scalable pate the secret sharer
SMART_READER_LITE
LIVE PREVIEW

Scalable PATE The Secret Sharer work by the Brain Privacy and - - PowerPoint PPT Presentation

Scalable PATE The Secret Sharer work by the Brain Privacy and Security team and collaborators at UC Berkeley presented by Ian Goodfellow PATE / PATE-G Private / Papernot Aggregation / Abadi Teacher / Talwar Ensembles / Erlingsson


slide-1
SLIDE 1

Scalable PATE The Secret Sharer

work by the Brain Privacy and Security team and collaborators at UC Berkeley presented by Ian Goodfellow

slide-2
SLIDE 2

PATE / PATE-G

  • Private / Papernot
  • Aggregation / Abadi
  • Teacher / Talwar
  • Ensembles / Erlingsson
  • Generative / Goodfellow
slide-3
SLIDE 3

Threat Model

Types of adversaries and our threat model

35

In our work, the threat model assumes:

  • Adversary can make a potentially unbounded number of queries
  • Adversary has access to model internals

Model inspection (white-box adversary)

Zhang et al. (2017) Understanding DL requires rethinking generalization

Model querying (black-box adversary)

Shokri et al. (2016) Membership Inference Attacks against ML Models Fredrikson et al. (2015) Model Inversion Attacks ? Black-box ML

slide-4
SLIDE 4

A definition of privacy: differential privacy

A definition of privacy: differential privacy

36

Randomized Algorithm Randomized Algorithm Answer 1 Answer 2 ... Answer n Answer 1 Answer 2 ... Answer n

? ? ? ?

}

slide-5
SLIDE 5

A tangent

  • Which other fields need their “differential privacy

moment”?

  • Adversarial robustness needs a provable mechanism
  • Interpretability needs measurable / actionable

definitions

  • Differential privacy is maybe the brightest spot in ML

theory, especially in adversarial settings. Real guarantees that hold in practice

slide-6
SLIDE 6

Different teachers learn from different subsets

Private Aggregation of Teacher Ensembles (PATE)

37

Partition 1 Partition 2 Partition n Partition 3

...

Teacher 1 Teacher 2 Teacher n Teacher 3

...

Training Sensitive Data Data flow

Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data [ICLR 2017 best paper] Nicolas Papernot, Martín Abadi, Úlfar Erlingsson, Ian Goodfellow, and Kunal Talwar

slide-7
SLIDE 7

Aggregation

Aggregation

38

Count votes Take maximum

Aggregation

38

Count votes Take maximum

slide-8
SLIDE 8

Intuitive Privacy Analysis

Intuitive privacy analysis

39

If most teachers agree on the label, it does not depend on specific partitions, so the privacy cost is small. If two classes have close vote counts, the disagreement may reveal private information.

slide-9
SLIDE 9

Student Training

Student training

42

Partition 1 Partition 2 Partition n Partition 3

...

Teacher 1 Teacher 2 Teacher n Teacher 3

...

Aggregated Teacher Student Training Available to the adversary Not available to the adversary Sensitive Data Public Data Inference Data flow Queries

slide-10
SLIDE 10

Why train a student model?

Why train an additional “student” model?

43

Each prediction increases total privacy loss.

Privacy budgets create a tension between the accuracy and number of predictions.

Inspection of internals may reveal private data.

Privacy guarantees should hold in the face of white-box adversaries. 1 2

The aggregated teacher violates our threat model:

slide-11
SLIDE 11

Label-efficient learning

  • More queries to teacher while training student = more

privacy lost

  • Use semi-supervised GAN (Salimans et al 2016) to achieve

high accuracy with few labels

slide-12
SLIDE 12

(Goodfellow 2018)

Supervised Discriminator for Semi-Supervised Learning

Input Real Hidden units Fake Input Real dog Hidden units Fake Real cat

(Odena 2016, Salimans et al 2016) Learn to read with 100 labels rather than 60,000

slide-13
SLIDE 13

Trade-off between accuracy and privacy

Trade-off between student accuracy and privacy

47

slide-14
SLIDE 14

Scalable PATE

  • Nicolas Papernot*, Shuang Song*, Ilya Mironov, Ananth

Raghunathan, Kunal Talwar, Úlfar Erlingsson

slide-15
SLIDE 15

Limitations of first PATE paper

  • Only on MNIST / SVHN
  • Very clean
  • 10 classes (easier to get consensus)
  • Scalable PATE
  • More classes
  • Unbalanced classes
  • Mislabeled training examples
slide-16
SLIDE 16

Improvements

  • Noisy votes use Gaussian rather than Laplace distribution
  • More likely to achieve consensus for large number of

classes

  • Selective teacher response
slide-17
SLIDE 17

Selective Teacher Response

  • Check for overwhelming consensus
  • Use high variance noise
  • Check if noisy votes for argmax exceed threshold T
  • Consensus? Publish noisy votes with smaller variance
  • No consensus? Don’t publish anything, student skips
  • Note: running the noisy consensus check still spent some of
  • ur privacy budget
slide-18
SLIDE 18

Background: adversarial training

Labeled as bird Decrease probability

  • f bird class

Still has same label (bird)

slide-19
SLIDE 19

Virtual Adversarial Training

Unlabeled; model guesses it’s probably a bird, maybe a plane Adversarial perturbation intended to change the guess New guess should match old guess (probably bird, maybe plane)

(Miyato et al, 2015)

slide-20
SLIDE 20

VAT performance

(Oliver+Odena+Raffel et al, 2018)

slide-21
SLIDE 21

Scalable PATE: Improved Results

Synergy between utility and privacy

48

1. Check privately for consensus 2. Run noisy argmax only when consensus is sufficient

Scalable Private Learning with PATE [ICLR 2018] Nicolas Papernot, Shuang Song, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, Ulfar Erlingsson

(LNMax=PATE, Confident-GNMax=Scalable PATE)

slide-22
SLIDE 22

Scalable PATE: Improved tradeoff

Trade-off between student accuracy and privacy

49

Selective PATE

slide-23
SLIDE 23

The Secret Sharer

  • Nicholas Carlini, Chang Liu, Jernej Kos, Úlfar Erlingsson,

Dawn Song

slide-24
SLIDE 24

Secret with format known to adversary

  • “My social security number is ___-__-____”
  • Measure memorization with exposure

Secret

slide-25
SLIDE 25

Definitions

  • Suppose model assigns probability p to the actual secret
  • The rank of the secret is the number of other strings given

probability ≤p

  • Minimum value is 1
  • Exposure: negative log prob of sampling a string with

probability less than p

  • equivalent: Exposure: log (# possible strings) - log rank
slide-26
SLIDE 26

Practical Experiments

  • Can estimate exposure via sampling
  • Can approximately find most likely secret value with
  • ptimization (beam search)
slide-27
SLIDE 27

Memorization during learning

slide-28
SLIDE 28

Observations

  • Exposure is high
  • Exposure rises early during learning
  • Exposure is not caused by overfitting
  • Peaks before overfitting occurs
slide-29
SLIDE 29

Comparisons

  • Across architectures:
  • More accuracy -> more exposure
  • LSTM / GRU: high accuracy, high exposure
  • CNN: lower accuracy, lower exposure
  • Larger batch size -> more memorization
  • Larger model -> more memorization
  • Secret memorization happens even when compressed model smaller than

compressed dataset

  • Choice of optimizer: no significant difference
slide-30
SLIDE 30

Defenses

  • Regularization does not work
  • Weight decay
  • Dropout
  • Weight quantization
  • Differentially privacy works, as guaranteed
  • Even for very small epsilon, with little theoretical guarantee,

the exposure measured in practice decreases significantly

slide-31
SLIDE 31

Questions