New Directions in Privacy- preserving Machine Learning Kamalika - PowerPoint PPT Presentation

New Directions in Privacy- preserving Machine Learning Kamalika Chaudhuri University of California, San Diego

Sensitive Data Medical Records Genetic Data Search Logs

AOL Violates Privacy

Netflix Violates Privacy [NS08] Movies% User%1% User%2% User%3% 2-8 movie-ratings and dates for Alice reveals: Whether Alice is in the dataset or not Alice’s other movie ratings

High-dimensional Data is Unique Example: UCSD Employee Salary Table Position Department Gender Ethnicity Salary - Faculty Female CSE SE Asian One employee (Kamalika) fits description!

Simply anonymizing data is unsafe!

Disease Association Studies [WLWTZ09] Cancer Healthy Correlations Correlations Correlation (R 2 values), Alice’s DNA reveals: If Alice is in the Cancer set or Healthy set

Simply anonymizing data is unsafe! Statistics on small data sets is unsafe! Privacy Accuracy Data Size

Correlated Data User information in social networks Physical Activity Monitoring

Why is Privacy Hard for Correlated Data? Neighbor’s information leaks information on user

Talk Agenda: How do we learn from sensitive data while still preserving privacy ? New Directions: 1. Privacy-preserving Bayesian Learning 2. Privacy-preserving statistics on correlated data

Talk Agenda: 1. Privacy for Uncorrelated Data - How to define privacy

Differential Privacy [DMNS06] Randomized Data + Algorithm “similar” Randomized Data + Algorithm Participation of a single person does not change output

Differential Privacy: Attacker’s View Algorithm Prior Conclusion Output on + = Knowledge on Data & Algorithm Prior Conclusion Output on + = Knowledge on Data &

Differential Privacy [DMNS06] S D 1 D 2 Pr[A(D 1 ) in S] Pr[A(D 2 ) in S] For all D 1 , D 2 that differ in one person’s value, any set S, If A = -private randomized algorithm, then: ✏ Pr( A ( D 1 ) ∈ S ) ≤ e ✏ Pr( A ( D 2 ) ∈ S )

Differential Privacy 1. Provably strong notion of privacy 2. Good approximations for many functions e.g, means, histograms, etc.

Interpretation: Attacker’s Hypothesis Test [WZ10, OV13] H0: Input to the algorithm = Data + H1: Input to the algorithm = Data + Failure Events: False Alarm (FA), Missed Detection (MD)

Interpretation: Attacker’s Hypothesis Test [WZ10, OV13] (0, 1) If algorithm is ✏ -DP Pr( FA ) + e ✏ Pr( MD ) ≥ 1 e ✏ Pr( FA ) + Pr( MD ) ≥ 1 ✓ ◆ 1 1 1 + e ✏ , 1 + e ✏ FA = False Alarm MD = Missed Detection (1, 0)

Talk Agenda: 1. Privacy for Uncorrelated Data - How to define privacy - Privacy-preserving Learning

Example 1: Flu Test Predicts flu or not, based on patient symptoms Trained on sensitive patient data

Example 2: Clustering Abortion Data Given data on abortion locations, cluster by location while preserving privacy of individuals

Bayesian Learning

Bayesian Learning } Data X = { x 1 , x 2 , … } Related through Model Class Θ likelihood p ( x | θ )

Bayesian Learning } Data X = { x 1 , x 2 , … } Related through Model Class Θ likelihood p ( x | θ ) + Prior π ( θ )

Bayesian Learning } Data X = { x 1 , x 2 , … } Related through Model Class Θ likelihood p ( x | θ ) + Prior π ( θ ) Data X

Bayesian Learning } Data X = { x 1 , x 2 , … } Related through Model Class Θ likelihood p ( x | θ ) = + Prior π ( θ ) Data X Posterior p ( θ | X )

Bayesian Learning } Data X = { x 1 , x 2 , … } Related through Model Class Θ likelihood p ( x | θ ) = + Prior π ( θ ) Data X Posterior p ( θ | X ) Goal: Output posterior (approx. or samples)

Example: Coin tosses X = { H, T, H, H… } likelihood: Θ = [0, 1] p ( x | θ ) = θ x (1 − θ ) 1 − x

Example: Coin tosses X = { H, T, H, H… } likelihood: Θ = [0, 1] p ( x | θ ) = θ x (1 − θ ) 1 − x + Prior π ( θ ) = 1

Example: Coin tosses X = { H, T, H, H… } likelihood: Θ = [0, 1] p ( x | θ ) = θ x (1 − θ ) 1 − x + Prior Data X π ( θ ) = 1 (h H, t T)

Example: Coin tosses X = { H, T, H, H… } likelihood: Θ = [0, 1] p ( x | θ ) = θ x (1 − θ ) 1 − x = + Posterior Prior Data X π ( θ ) = 1 p ( θ | x ) ∝ θ h (1 − θ ) t (h H, t T)

Example: Coin tosses X = { H, T, H, H… } likelihood: Θ = [0, 1] p ( x | θ ) = θ x (1 − θ ) 1 − x = + Posterior Prior Data X π ( θ ) = 1 p ( θ | x ) ∝ θ h (1 − θ ) t (h H, t T) In general, is more complex (classifiers, etc) θ

Private Bayesian Learning } Data X = { x 1 , x 2 , … } Related through Model Class Θ likelihood p ( x | θ ) = + Prior π ( θ ) Data X Posterior p ( θ | X )

Private Bayesian Learning } Data X = { x 1 , x 2 , … } Related through Model Class Θ likelihood p ( x | θ ) = + Prior π ( θ ) Data X Posterior p ( θ | X ) Goal: Output private approx. to posterior

How to make posterior private? Option 1: Direct posterior sampling [Detal14] Not private unless under restrictive conditions p ( θ | D ) p ( θ | D 0 )

How to make posterior private? Option 2: Sample from truncated posterior at high temperature [WFS15] Disadvantage: Intractable - technically privacy only on convergence Needs more data/subjects

Our Work: Exponential Families Exponential family distributions: p ( x | θ ) = h ( x ) e θ > T ( x ) − A ( θ ) where T is a sufficient statistic Includes many common distributions like Gaussians, Binomials, Dirichlets, Betas, etc

Properties of Exponential Families Exponential families have conjugate priors = + Prior π ( θ ) Data X Posterior p ( θ | X ) is in the same distribution class as π ( θ ) p ( θ | x ) eg, Gaussians-Gaussians, Beta-Binomial, etc

Sampling from Exponential Families (Non-private) posterior comes from exp. family: p ( θ | x ) ∝ e η ( θ ) > ( P i T ( x i )) − B ( θ ) given data x 1 , x 2 , … Private Sampling: 1. If T is bounded, add noise to to get private X T ( x i ) version T’ i 2. Sample from the perturbed posterior: p ( θ | x ) ∝ e η ( θ ) > T 0 − B ( θ )

Performance • Theoretical Guarantees • Experiments

Theoretical Guarantees Performance Measure: Asymptotic Relative Efficiency (Lower = more sample efficient for large n) Non-private: 2 Our Method: 2 [WFS15]: max(2 , 1 + 1 / ✏ )

Experiments - Task Task: Time series clustering of events in Wikileaks war logs while preserving event-level privacy Data: War-log entries - Afghanistan (75K), Iraq (390K) Goal: Cluster entries in each region based on features (casualty counts, enemy/friendly fire, explosive hazards, etc…)

Experiments - Model Hidden Markov Model for each region h t … Hidden state Observed features … x t Discrete states (h t ) and observations (x t ) Transition parameters T: T ij = P(h t+1 = i | h t = j) Emission parameters O: where O ij = P(x t = i | h t = j) Goal: Sample from posterior P(O| data) (in the exponential family)

Experiments - Results 4 5 − 3.5 x 10 − 1.5 x 10 − 4 − 2 − 4.5 Non − private HMM − 2.5 Non − private naive Bayes − 5 Test − set log − likelihood Test − set log − likelihood Laplace mechanism HMM − 5.5 OPS HMM (truncation multiplier = 100) − 3 − 6 − 3.5 − 6.5 − 7 − 4 Non − private HMM − 7.5 Non − private naive Bayes − 4.5 − 8 Laplace mechanism HMM OPS HMM (truncation multiplier = 100) − 5 − 8.5 − 1 0 1 − 1 0 1 10 10 10 10 10 10 Epsilon (total) Epsilon (total) Afghanistan Iraq

State 2 Iraq State 1 Iraq 0.05 0.15 0.25 0.05 0.15 0.25 0.1 0.2 0.3 0.1 0.2 0 0 criminal event criminal event enemy action enemy action explosive hazard explosive hazard friendly action friendly action Experiments - States friendly fire friendly fire non − combat event combat event other other suspicious incident suspicious incident threat report threat report 0.005 0.015 0.005 0.015 0.025 0.01 0.02 0.01 0.02 0 0 cache found/cleared ied explosion ied found/cleared direct fire ied explosion ied found/cleared direct fire murder detain indirect fire escalation of force detain indirect fire search and attack small arms threat cache found/cleared raid raid murder counter mortar patrol 0.05 0.15 0.25 0.05 0.15 0.1 0.2 0.1 0.2 0 0 friendly and host casualties friendly and host casualties civilian casualties civilian casualties enemy casualties enemy casualties

Experiments - Clustering MND − BAGHDAD State 2 MND − C Region code MND − N MND − SE State 1 MNF − W Jan 2004 Jan 2005 Jan 2006 Surge announced Jan 2008 Peak troops Month

Conclusion New method for private posterior sampling from exponential families Open Problems: 1. Private sampling from more complex posteriors 2. Private versions of other Bayesian posterior approximation schemes (variational Bayes, etc) 3. Combining Bayesian inference with more relaxed forms of DP (eg, concentrated DP , distributional DP , etc)

Talk Agenda: 1. Privacy for Uncorrelated Data - How to define privacy - Privacy-preserving Bayesian Learning 2. Privacy for Correlated Data

New Directions in Privacy- preserving Machine Learning Kamalika - PowerPoint PPT Presentation

New Directions in Privacy- preserving Machine Learning Kamalika Chaudhuri University of California, San Diego Sensitive Data Medical Records Genetic Data Search Logs AOL Violates Privacy AOL Violates Privacy Netflix Violates Privacy [NS08]

G. G. Stokes 1857 Stokes diagram with Stokes directions Halo at with singular directions

G. G. Stokes 1857 Stokes diagram with Stokes directions Halo at with singular directions

Privacy Preserving Protocols Workshop on Cryptography for the Internet of Things Jens Hermans KU

FERTILITY PRESERVING SURGERY FERTILITY PRESERVING SURGERY FERTILITY PRESERVING SURGERY FERTILITY

Privacy Preserving Privacy Preserving Netw ork Flow Netw ork Flow Recording Recording Bilal

Data privacy: Privacy models Vicen c Torra March, 2019 Hamilton Institute, Maynooth

Differential Privacy Machine Learning Li Xiong Big Data + Machine Learning + Machine

Privacy in Wireless Networks privacy notions and metrics; privacy in RFID systems; location

$ Lesson Fourteen Consumer Privacy 04/09 privacy and information information privacy: privacy

$ Lesson Ten Consumer Privacy 04/09 privacy and information information privacy: privacy that

CS305 Topic Privacy Concept Evolution Rights to Privacy Privacy and Technologies

Privacy Protection privacy notions and metrics; privacy in RFID systems; location privacy in

Privacy preserving data mining randomized response and association rule hiding Li Xiong

Towards Privacy-Preserving Ontology Publishing F. Baader & A. Nuradiansyah Technische

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Privacy in Machine Learning Fatemehsadat Mireshghallah ICLR2020 Privacy: A Major Concern for

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Instead of generic

Health Coverage for your County Jails Pretrial Population Thursday, February 23, 2012 Support

Mamoudou Gazibo Professor of political science 1 Introduction I- China in Africa : past and

IMPACT OF COVID -19 ON IPPF OPERATIONS COVID -19 & SRHR Landscape 121 countries 6

2020 Convening & Collaborating / 2020 Reach Competition Information Session Shannon Tolleson

What is Quality? Workshop on Quality Assurance and Quality Measurement for Language and Speech

T o p 10ish Wo me n s He a lth o r to pic s in the ne ws Mo stly g yn, a little o b

Short-Term Certificates D R A F T - I E T F - A C M E - S T A R - 0 0 Y A R O N S H E F F E R ,

New Directions in Privacy- preserving Machine Learning Kamalika - PowerPoint PPT Presentation

New Directions in Privacy- preserving Machine Learning Kamalika Chaudhuri University of California, San Diego Sensitive Data Medical Records Genetic Data Search Logs AOL Violates Privacy AOL Violates Privacy Netflix Violates Privacy [NS08]

G. G. Stokes 1857 Stokes diagram with Stokes directions Halo at with singular directions

G. G. Stokes 1857 Stokes diagram with Stokes directions Halo at with singular directions

Privacy Preserving Protocols Workshop on Cryptography for the Internet of Things Jens Hermans KU

FERTILITY PRESERVING SURGERY FERTILITY PRESERVING SURGERY FERTILITY PRESERVING SURGERY FERTILITY

Privacy Preserving Privacy Preserving Netw ork Flow Netw ork Flow Recording Recording Bilal

Data privacy: Privacy models Vicen c Torra March, 2019 Hamilton Institute, Maynooth

Differential Privacy Machine Learning Li Xiong Big Data + Machine Learning + Machine

Privacy in Wireless Networks privacy notions and metrics; privacy in RFID systems; location

$ Lesson Fourteen Consumer Privacy 04/09 privacy and information information privacy: privacy

$ Lesson Ten Consumer Privacy 04/09 privacy and information information privacy: privacy that

CS305 Topic Privacy Concept Evolution Rights to Privacy Privacy and Technologies

Privacy Protection privacy notions and metrics; privacy in RFID systems; location privacy in

Privacy preserving data mining randomized response and association rule hiding Li Xiong

Towards Privacy-Preserving Ontology Publishing F. Baader &amp; A. Nuradiansyah Technische

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Privacy in Machine Learning Fatemehsadat Mireshghallah ICLR2020 Privacy: A Major Concern for

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Instead of generic

Health Coverage for your County Jails Pretrial Population Thursday, February 23, 2012 Support

Mamoudou Gazibo Professor of political science 1 Introduction I- China in Africa : past and

IMPACT OF COVID -19 ON IPPF OPERATIONS COVID -19 &amp; SRHR Landscape 121 countries 6

2020 Convening &amp; Collaborating / 2020 Reach Competition Information Session Shannon Tolleson

What is Quality? Workshop on Quality Assurance and Quality Measurement for Language and Speech

T o p 10ish Wo me n s He a lth o r to pic s in the ne ws Mo stly g yn, a little o b

Short-Term Certificates D R A F T - I E T F - A C M E - S T A R - 0 0 Y A R O N S H E F F E R ,

Towards Privacy-Preserving Ontology Publishing F. Baader & A. Nuradiansyah Technische

IMPACT OF COVID -19 ON IPPF OPERATIONS COVID -19 & SRHR Landscape 121 countries 6

2020 Convening & Collaborating / 2020 Reach Competition Information Session Shannon Tolleson