Enhancing Privacy in Machine Learning Mathias Humbert INSA - - PowerPoint PPT Presentation

enhancing privacy in machine learning
SMART_READER_LITE
LIVE PREVIEW

Enhancing Privacy in Machine Learning Mathias Humbert INSA - - PowerPoint PPT Presentation

Enhancing Privacy in Machine Learning Mathias Humbert INSA Toulouse/CNRS Toulouse, January 22, 2019 Enhancing Privacy in Machine Learning data ML What ML? What data? What threat? Mathias Humbert - Enhancing Privacy in Machine Learning


slide-1
SLIDE 1

Enhancing Privacy in Machine Learning

Mathias Humbert

INSA Toulouse/CNRS Toulouse, January 22, 2019

slide-2
SLIDE 2

Mathias Humbert - Enhancing Privacy in Machine Learning

Enhancing Privacy in Machine Learning

2

ML data

What threat? What data? What ML?

slide-3
SLIDE 3

Mathias Humbert - Enhancing Privacy in Machine Learning

Different Attacks: Linkability

3

Ability to link at least two records concerning the same individual

Robert Alice Marius Eve

If one data set is not anonymized → re-identification

slide-4
SLIDE 4

Mathias Humbert - Enhancing Privacy in Machine Learning

Different Attacks: Membership Inference

4

(x,y,z)

?

Ability to infer that a certain target is in a specific dataset

Study focusing on 
 HIV patients

slide-5
SLIDE 5

Mathias Humbert - Enhancing Privacy in Machine Learning

Trading Off Privacy

5

What defense?

ML

Privacy Efficiency Utility What threat? What data? What ML?

slide-6
SLIDE 6

Mathias Humbert - Enhancing Privacy in Machine Learning

Different Defense Mechanisms

6

Anonymization Randomization Differential privacy Cryptography

ML

Privacy Efficiency Utility

slide-7
SLIDE 7

Mathias Humbert - Enhancing Privacy in Machine Learning

Outline of the Talk

7

  • Attack - defense - data
  • Temporal linkability - randomization - microRNA expression
  • USENIX Security’16
  • Re-identification - cryptography - DNA methylation
  • IEEE S&P’17
  • Membership inference - other defense - any data
  • NDSS'19

[0,1]m ℝr

r ≈ 103 m ≈ 107

slide-8
SLIDE 8

Mathias Humbert - Enhancing Privacy in Machine Learning

Outline of the Talk

8

  • Attack - defense - data
  • Temporal linkability - randomization - microRNA expression
  • USENIX Security’16
  • Re-identification - cryptography - DNA methylation
  • IEEE S&P’17
  • Membership inference - other defense - any data
  • NDSS'19
slide-9
SLIDE 9

Mathias Humbert - Enhancing Privacy in Machine Learning

DNA versus MicroRNA

9

  • contains blueprint of what a cell

potentially can do,

  • is (mostly) fixed over time,
  • can hint on risks of getting a

disease.

  • regulates what a cell really

does,

  • expression changes over time,
  • can tell whether you carry a

disease.

DNA miRNA

Common belief: no privacy threats from miRNAs,
 because of temporal variability

slide-10
SLIDE 10

Mathias Humbert - Enhancing Privacy in Machine Learning

Temporal Linkability Attack

10

  • Matching two datasets
  • E.g., a leaked database (incl. name) and public DB (excl. name)
  • Which sample from t1 corresponds to which sample from t2?

t1 t2

slide-11
SLIDE 11

Mathias Humbert - Enhancing Privacy in Machine Learning

Data Pre-processing

11

  • High dimensionality: 1,189 miRNAs per sample
  • Possibly correlated and uninteresting components
  • PCA + whitening provides
  • Unit variance
  • Smaller dimensionality
  • Uncorrelated components
  • Condenses data into a set of smaller dimensions


with minimal information loss

rtj

k

¯ rtj

k

PCA

slide-12
SLIDE 12

Mathias Humbert - Enhancing Privacy in Machine Learning

Linkability Attack

12

rt1

k

rt2

i

{rt1

i }n i=1

{rt2

i }n i=1

  • ¯

rt2

i − ¯

rt1

k

  • 2

σ∗ = arg min

σ n

X

i=1

  • ¯

rt2

σ(i) − ¯

rt1

i

  • 2

t2 t1

Which sample from t1 corresponds to which sample from t2?

slide-13
SLIDE 13

Mathias Humbert - Enhancing Privacy in Machine Learning

Linkability Attack

13

rt1

k

rt2

i

{rt1

i }n i=1

{rt2

i }n i=1

  • ¯

rt2

i − ¯

rt1

k

  • 2

t2 t1

σ

Which sample from t1 corresponds to which sample from t2? Time complexity: O(n3) σ∗ = arg min

σ n

X

i=1

  • ¯

rt2

σ(i) − ¯

rt1

i

  • 2
slide-14
SLIDE 14

Mathias Humbert - Enhancing Privacy in Machine Learning

Athletes Dataset

14

Participants: 29 Points in time: 2 (before and after exercising) Time period: 1 week Disease: none 1,189 miRNAs per sample

  • taken from blood and plasma
slide-15
SLIDE 15

Mathias Humbert - Enhancing Privacy in Machine Learning

Lung Cancer Dataset

15

before surgery

months 18 15 12 9

after surgery

3 6

  • ?

Participants: 26 (huge for a longitudinal study!) Points in time: 8 Time period: 18 months Disease: lung cancer 1,189 miRNAs per sample

  • taken from plasma
slide-16
SLIDE 16

Mathias Humbert - Enhancing Privacy in Machine Learning

Linkability Attack – Results

16 number of PCA dimensions number of PCA dimensions

success up to 90%
 for blood-based samples

90% 48% 55% 29%

slide-17
SLIDE 17

Mathias Humbert - Enhancing Privacy in Machine Learning

Linkability Attack – Results

17

How does the success change
 with larger datasets? Success decreases sharply
 for plasma-based samples, but decreases linearly
 for blood-based samples.

slide-18
SLIDE 18

Mathias Humbert - Enhancing Privacy in Machine Learning

Outline of the Talk

18

  • Attack - defense - data
  • Temporal linkability - randomization - microRNA expression
  • USENIX Security’16
  • Re-identification - cryptography - DNA methylation
  • IEEE S&P’17
  • Membership inference - other defense - any data
  • NDSS'19
slide-19
SLIDE 19

Mathias Humbert - Enhancing Privacy in Machine Learning

Defense Mechanisms

19

  • Hiding non-relevant miRNA expressions
  • Sometimes, randomization is not an option
  • E.g., for making a diagnosis in a hospital
  • Caution: correlations between miRNAs
  • Randomizing the miRNA expression profiles
  • Adding noise in a fully distributed, differentially-private manner


→ providing epigeno-indistinguishability (inspired by [1])

  • Noise drawn according to multivariate Laplacian mechanism
  • E.g., for publishing a dataset used in a study

[1] Chatzikokolakis et al. Broadening the scope of differential privacy using metrics, PETS, 2013


slide-20
SLIDE 20

Mathias Humbert - Enhancing Privacy in Machine Learning

Privacy-Utility Trade-Off

20

Privacy: prevent linkability of samples Utility: preserve accuracy of classification as diseased / healthy,
 usually using a radial SVM classifier

miRNA1 miRNA2

disease

slide-21
SLIDE 21

Mathias Humbert - Enhancing Privacy in Machine Learning

Privacy-Utility Trade-Off

21

Privacy: prevent linkability of samples Utility: preserve accuracy of classification as diseased / healthy,
 usually using a radial SVM classifier

miRNA1 miRNA2

disease

slide-22
SLIDE 22

Mathias Humbert - Enhancing Privacy in Machine Learning

Privacy-Utility Trade-Off

22

Privacy: prevent linkability of samples Utility: preserve accuracy of classification as diseased / healthy,
 usually using a radial SVM classifier

Another dataset for exploring utility: 1000+ participants, 19 diseases, 1 time point

slide-23
SLIDE 23

Mathias Humbert - Enhancing Privacy in Machine Learning

Hiding miRNAs – Results

23

<80% <100 miRNAs

slide-24
SLIDE 24

Mathias Humbert - Enhancing Privacy in Machine Learning

Hiding miRNAs – Results

24

accuracy 99,2%

slide-25
SLIDE 25

Mathias Humbert - Enhancing Privacy in Machine Learning

Hiding miRNAs – Results

25

attacker’s success rate

slide-26
SLIDE 26

Mathias Humbert - Enhancing Privacy in Machine Learning

Hiding miRNAs – Results

26

99,2%

slide-27
SLIDE 27

Mathias Humbert - Enhancing Privacy in Machine Learning

Hiding miRNAs – Results

27

Trade-off at 7 miRNAs Attack success decreased (relative to all)
 by 54% SVM accuracy decreased (relative to max)
 by only 1%

99,2

slide-28
SLIDE 28

Mathias Humbert - Enhancing Privacy in Machine Learning

Hiding miRNAs – Results

28

92,7%

slide-29
SLIDE 29

Mathias Humbert - Enhancing Privacy in Machine Learning

Hiding miRNAs – Results

29

92,7

Trade-off at 4 miRNAs Success decreases (relative to all)
 by 80% Accuracy decreases (relative to max)
 by only 1%

slide-30
SLIDE 30

Mathias Humbert - Enhancing Privacy in Machine Learning

Probabilistic Sanitization – Results

30

99,2%

slide-31
SLIDE 31

Mathias Humbert - Enhancing Privacy in Machine Learning

Probabilistic Sanitization – Results

31

99,2% 99,2%

slide-32
SLIDE 32

Mathias Humbert - Enhancing Privacy in Machine Learning

Probabilistic Sanitization – Results

32

99,2%

Suitable balance at ℇ=0.025 Attack success decreased (relative to all)
 by 63% SVM accuracy decreased (relative to max)
 by only 0.65%

slide-33
SLIDE 33

Mathias Humbert - Enhancing Privacy in Machine Learning

Probabilistic Sanitization – Results

33

96,9%

slide-34
SLIDE 34

Mathias Humbert - Enhancing Privacy in Machine Learning

Probabilistic Sanitization – Results

34

Trade-off at ℇ=0.01 Success decreases (relative to all)
 by 70% Accuracy decreases (relative to max)
 by only 0.2%

96,9%

slide-35
SLIDE 35

Mathias Humbert - Enhancing Privacy in Machine Learning

Outline of the Talk

35

  • Attack - defense - data type
  • Temporal linkability - randomization - microRNA expression
  • USENIX Security’16
  • Re-identification - cryptography - DNA methylation
  • IEEE S&P’17
  • Membership inference - other defense - any data
  • NDSS'19
slide-36
SLIDE 36

Mathias Humbert - Enhancing Privacy in Machine Learning

DNA Methylation Data and Privacy

36

  • DNA methylation
  • Very well understood epigenetic mechanism
  • Associated with human health status
  • Hyper-/hypomethylation associated with cancer
  • Smoking mother → child with asthma
  • Sensitive data → privacy must be protected
  • Current privacy practice
  • Public release on databases such as the Gene Expression Omnibus
  • Privacy precautions:
  • Anonymized samples (removal of personal identifiers)
  • Corresponding genomic data not accessible
  • Since the genome can be re-identified using various side channels [2,3]

[2] Gymrek et al., Identifying personal genomes by surname inference, Science, 2013 [3] Humbert et al., De-anonymizing genomic databases using phenotypic traits, PoPETS, 2015

slide-37
SLIDE 37

Mathias Humbert - Enhancing Privacy in Machine Learning

Re-identifying DNA Methylation Profiles

37

  • Experimental results
  • Focusing on 293 methylation regions highly correlated with genotype
  • Between 97.5% and 100% of matching accuracy for genotype database

  • f size greater than 2500
  • Wrongly matched pairs always rejected by our statistical test

Pr(Gi

j = gi j | M i j) =

p(M i

j | Gi j = gi j) Pr(Gi j = gi j)

P

gi

j p(M i

j | Gi j = gi j) Pr(Gi j = gi j)

(1)

slide-38
SLIDE 38

Mathias Humbert - Enhancing Privacy in Machine Learning

Outline of the Talk

38

  • Attack - defense - data type
  • Temporal linkability - randomization - microRNA expression
  • USENIX Security’16
  • Re-identification - cryptography - DNA methylation
  • IEEE S&P’17
  • Membership inference - other defense - any data
  • NDSS'19
slide-39
SLIDE 39

Mathias Humbert - Enhancing Privacy in Machine Learning

Defense Mechanism

39

  • Private classification of brain tumors [4]
  • Random forest classifier
  • Cryptographic mechanism: homomorphic encryption
  • Secure under the honest-but-curious adversary model
  • The machine-learning provider does not learn the patient’s data
  • The data owner (patient) does not learn the machine-learning model
  • Typical use case: clinical setting, or diagnosis by third-party provider
  • Implementation in C++
  • Classification based on 900 methylation regions
  • 9 tumor subtypes
  • Original random forest model: 1000 trees

[4] Danielsson et al. MethPed: A DNA methylation classifier tool for the identification of pediatric brain tumor subtypes, Clinical Epigenetics, 2015


slide-40
SLIDE 40

Mathias Humbert - Enhancing Privacy in Machine Learning

Performance Evaluation

40

100 trees take less than 2 hours very good accuracy already
 with < 100 trees accuracy total time
 all votes + plurality 1000 trees take < 12 hours

slide-41
SLIDE 41

Mathias Humbert - Enhancing Privacy in Machine Learning

Outline of the Talk

41

  • Attack - defense - data type
  • Temporal linkability - randomization - microRNA expression
  • USENIX Security’16
  • Re-identification - cryptography - DNA methylation
  • IEEE S&P’17
  • Membership inference - other defense - any data
  • NDSS'19
slide-42
SLIDE 42

Mathias Humbert - Enhancing Privacy in Machine Learning

Membership Inference Against ML Models

42

miRNA database

MLaaS

Objective: determine if v is part of the training dataset by using the

  • utput distribution of the MLaaS system

having access to a data sample

  • f a victim v

20 40 60 80 cat dog panda

slide-43
SLIDE 43

Mathias Humbert - Enhancing Privacy in Machine Learning

State-of-the-Art Attack (Shokri et al.)

43

Target Model

40 80 cat panda 20 40 cat panda

In or not in Same Distribution

40 80 cat panda

Shadow Models

. . .

Shadow Models Attack Models

. . .

Attack Models

Ground Truth? Target Dataset Local Dataset Multiple Attack Models Multiple Shadow Models

Shokri et al., Membership Inference Attacks against Machine Learning Models. IEEE S&P, 2017

slide-44
SLIDE 44

Mathias Humbert - Enhancing Privacy in Machine Learning

More Realistic Attacks

44

  • Three new adversary models:
  • 1. A single shadow model, different structure than the target model
  • Less costly attack
slide-45
SLIDE 45

Mathias Humbert - Enhancing Privacy in Machine Learning

Performance of the First New Adversary

45 Adult CIFAR-10 CIFAR-100 Face Location MNIST News Purchase-2 Purchase-10 Purchase-20 Purchase-50 Purchase-100 0.0 0.2 0.4 0.6 0.8 1.0

Precision

Shokri et al. Our approach

Adult CIFAR-10 CIFAR-100 Face Location MNIST News Purchase-2 Purchase-10 Purchase-20 Purchase-50 Purchase-100 0.0 0.2 0.4 0.6 0.8 1.0

Recall

Shokri et al. Our approach

95,95 94,95 88,89 83,85

slide-46
SLIDE 46

Mathias Humbert - Enhancing Privacy in Machine Learning

More Realistic Attacks

46

  • Three new adversary models:
  • 1. A single shadow model, different structure than the target model
  • Less costly attack
  • Experimental results similar to Shokri et al.
  • 2. Same as 1. + different distribution than the original training set
  • « Data transferring attack » -> different dataset than the training set
slide-47
SLIDE 47

Mathias Humbert - Enhancing Privacy in Machine Learning

Performance of the Second New Adversary

47

A d u l t C I F A R

  • 1

C I F A R

  • 1

F a c e L

  • c

a t i

  • n

M N I S T N e w s P u r c h a s e

  • 2

P u r c h a s e

  • 1

P u r c h a s e

  • 2

P u r c h a s e

  • 5

P u r c h a s e

  • 1

Adult CIFAR-10 CIFAR-100 Face Location MNIST News Purchase-2 Purchase-10 Purchase-20 Purchase-50 Purchase-100 0.50 0.25 0.75 0.87 0.25 0.25 0.78 0.25 0.24 0.25 0.77 0.82 0.50 0.87 0.90 0.85 0.65 0.74 0.92 0.77 0.79 0.80 0.78 0.82 0.50 0.83 0.95 0.87 0.75 0.75 0.89 0.77 0.78 0.79 0.83 0.87 0.50 0.83 0.95 0.88 0.79 0.75 0.88 0.77 0.78 0.79 0.82 0.87 0.50 0.81 0.92 0.83 0.88 0.75 0.85 0.76 0.77 0.78 0.80 0.83 0.50 0.86 0.72 0.55 0.68 0.65 0.92 0.54 0.51 0.54 0.84 0.67 0.50 0.84 0.95 0.87 0.77 0.75 0.88 0.77 0.78 0.79 0.83 0.88 0.50 0.87 0.88 0.80 0.65 0.71 0.90 0.73 0.77 0.60 0.73 0.73 0.50 0.87 0.84 0.77 0.66 0.73 0.93 0.71 0.77 0.75 0.78 0.86 0.50 0.87 0.89 0.84 0.66 0.74 0.92 0.76 0.79 0.80 0.82 0.83 0.50 0.86 0.93 0.87 0.67 0.75 0.92 0.77 0.79 0.81 0.85 0.86 0.50 0.85 0.95 0.88 0.69 0.75 0.91 0.77 0.79 0.80 0.84 0.89 0.0 0.2 0.4 0.6 0.8 1.0 A d u l t C I F A R

  • 1

C I F A R

  • 1

F a c e L

  • c

a t i

  • n

M N I S T N e w s P u r c h a s e

  • 2

P u r c h a s e

  • 1

P u r c h a s e

  • 2

P u r c h a s e

  • 5

P u r c h a s e

  • 1

Adult CIFAR-10 CIFAR-100 Face Location MNIST News Purchase-2 Purchase-10 Purchase-20 Purchase-50 Purchase-100 0.50 0.50 0.52 0.83 0.50 0.50 0.69 0.50 0.47 0.50 0.57 0.73 0.50 0.82 0.89 0.84 0.54 0.53 0.92 0.59 0.66 0.69 0.76 0.82 0.50 0.75 0.95 0.82 0.72 0.52 0.88 0.57 0.62 0.64 0.73 0.83 0.50 0.75 0.95 0.87 0.78 0.52 0.86 0.56 0.61 0.64 0.73 0.82 0.50 0.68 0.91 0.75 0.86 0.51 0.82 0.54 0.57 0.60 0.66 0.75 0.49 0.84 0.55 0.52 0.51 0.53 0.92 0.53 0.51 0.54 0.79 0.62 0.50 0.76 0.95 0.83 0.74 0.52 0.86 0.57 0.62 0.65 0.74 0.84 0.50 0.82 0.86 0.80 0.54 0.53 0.90 0.59 0.66 0.60 0.73 0.71 0.50 0.84 0.80 0.76 0.55 0.53 0.92 0.59 0.66 0.68 0.76 0.85 0.50 0.83 0.88 0.83 0.53 0.53 0.92 0.59 0.66 0.69 0.78 0.83 0.50 0.81 0.92 0.85 0.57 0.53 0.91 0.59 0.65 0.69 0.78 0.85 0.50 0.79 0.95 0.85 0.61 0.53 0.90 0.58 0.64 0.67 0.77 0.86 0.0 0.2 0.4 0.6 0.8 1.0

Precision Recall 95 95 89 89

slide-48
SLIDE 48

Mathias Humbert - Enhancing Privacy in Machine Learning

More Realistic Attacks

48

  • Three new adversary models:
  • 1. A single shadow model, different structure than the target model
  • Less costly attack
  • Experimental results similar to Shokri et al.
  • 2. Same as 1. + different distribution than the original training set
  • « Data transferring attack » -> different dataset than the training set
  • Results decreasing by a few % only
  • 3. No shadow model, different distribution than the training set
  • No training phase
  • Attack based on the output distribution only
  • Statistics such as max or the entropy can be sufficient
  • Good results for about half of the tested datasets
slide-49
SLIDE 49

Mathias Humbert - Enhancing Privacy in Machine Learning

Outline of the Talk

49

  • Attack - defense - data type
  • Temporal linkability - randomization - microRNA expression
  • USENIX Security’16
  • Re-identification - cryptography - DNA methylation
  • IEEE S&P’17
  • Membership inference - other defense - any data
  • NDSS'19
slide-50
SLIDE 50

Mathias Humbert - Enhancing Privacy in Machine Learning

Adult CIFAR-10 CIFAR-100 Face Location MNIST News Purchase-2 Purchase-10 Purchase-20 Purchase-50 Purchase-100 0.0 0.2 0.4 0.6 0.8 1.0

Recall

Original Dropout Model stacking

Defense Mechanisms

50

  • Main reason for the success of the attack
  • Overfitting of training samples

  • Two defenses
  • Dropout
  • Model stacking

Adult CIFAR-10 CIFAR-100 Face Location MNIST News Purchase-2 Purchase-10 Purchase-20 Purchase-50 Purchase-100 0.0 0.2 0.4 0.6 0.8 1.0

Precision

Original Dropout Model stacking Adult CIFAR-10 CIFAR-100 Face Location MNIST News Purchase-2 Purchase-10 Purchase-20 Purchase-50 Purchase-100 0.0 0.2 0.4 0.6 0.8 1.0

Accuracy

Original Dropout Model stacking

ML classifier Inference attack Inference attack

slide-51
SLIDE 51

Mathias Humbert - Enhancing Privacy in Machine Learning

Conclusion

51

Randomization/DP Cryptography Anti-overfitting

ML

Privacy Efficiency Utility

At training & test time At test time Attack at test time; Defense at training time (by modifying the model, not the data)

What ML?

contact: mathias.humbert@ar.admin.ch