On the Theory and Practice of Privacy-Preserving Bayesian Data Analysis
James Foulds,* Joseph Geumlek,* Max Welling,+ Kamalika Chaudhuri*
+University of Amsterdam
*University of California, San Diego
On the Theory and Practice of Privacy-Preserving Bayesian Data - - PowerPoint PPT Presentation
On the Theory and Practice of Privacy-Preserving Bayesian Data Analysis James Foulds,* Joseph Geumlek,* Max Welling, + Kamalika Chaudhuri* + University of Amsterdam *University of California, San Diego Overview Bayesian Privacy-preserving
James Foulds,* Joseph Geumlek,* Max Welling,+ Kamalika Chaudhuri*
+University of Amsterdam
*University of California, San Diego
2
Bayesian data analysis Privacy-preserving data analysis
3
Bayesian data analysis Privacy-preserving data analysis Privacy-preserving Bayesian data analysis
4
Bayesian data analysis Privacy-preserving data analysis Privacy-preserving Bayesian data analysis “for free” via posterior sampling (Dimitrakakis et al., 2014; Wang et al., 2015)
5
Bayesian data analysis Privacy-preserving data analysis Privacy-preserving Bayesian data analysis “for free” via posterior sampling (Dimitrakakis et al., 2014; Wang et al., 2015) Limitations: data inefficiency, approximate inference We consider a very simple alternative technique to resolve this
restaurants, email recipients
6
restaurants, email recipients
7
restaurants, email recipients
8
restaurants, email recipients
9
10 http://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girl-was-pregnant-before-her-father-did/#b228dae34c62 ,Retrieved 6/16/2016
11
12
13
Apple senior vice president of Software Engineering. June 13 2016, WWDC16
14 Quote from http://appleinsider.com/articles/16/06/15/inside-ios-10-apple-doubles-down-on-security-with-cutting-edge-differential-privacy , retrieved 6/16/2016
15
[the Wikileaks disclosure] “puts the lives of United States and its partners’ service members and civilians at risk.”
– Text analysis (Blei et al., 2003; Goldwater and Griffiths, 2007) – Personalized recommender systems (Salakhutdinov and Mnih, 2008) – Medical informatics (Husmeier et al., 2006) – MOOCs (Piech et al., 2013).
16
17
Alice Bob Claire …. (Narayanan and Shmatikov, 2008) Alice Bob Claire ….
Anonymized Netflix data + public IMDB data = identified Netflix data
18
Alice Bob Claire …. (Narayanan and Shmatikov, 2008) Alice Bob Claire ….
Anonymized Netflix data + public IMDB data = identified Netflix data
19
Alice Bob Claire …. (Narayanan and Shmatikov, 2008) Alice Bob Claire ….
Anonymized Netflix data + public IMDB data = identified Netflix data
20 https://www.buzzfeed.com/nathanwpyle/can-you-spot-all-26-letters-in-this-messy-room-369?utm_term=.gyRdVVvV5#.kkovLL1LE Retrieved 6/16/2016
21
22
23
– We can identify Prof. X’s salary
24
25
26
27
28
Individuals’ data Untrusted users Answers Queries Privacy-preserving interface: randomized algorithms
any one data point
change too much
29
Individuals’ data
any one data point
change too much
30
Individuals’ data
any one data point
change too much
31
Randomized algorithm Individuals’ data
any one data point
change too much
32
Randomized algorithm Individuals’ data
any one data point
change too much
33
Randomized algorithm Individuals’ data
any one data point
change too much
34
Randomized algorithm Individuals’ data Randomized algorithm
any one data point
change too much
35
Randomized algorithm Individuals’ data Randomized algorithm
Similar!
36
Ratios of probabilities bounded by
37
38
to achieve differential privacy
exponential distributions, back-to-back
39
40
Temperature depends on sensitivity, epsilon
(Dimitrakakis et al., 2014; Wang et al., 2015)
– Interpret as exponential mechanism with the log joint probability as the utility function: – Setting gives the privacy we get “for free” from posterior sampling – For smaller , flatten posterior by increasing the temperature
41
(Dimitrakakis et al., 2014; Wang et al., 2015)
– Interpret as exponential mechanism with the log joint probability as the utility function: – Setting gives the privacy we get “for free” from posterior sampling – For smaller , flatten posterior by increasing the temperature
42
43
44
45
46
47
Worst case over parameters as well as data Example: Beta-Bernoulli model
48
49
asymptotic normality as in Bernstein-von Mises theorem:
50
– exponential mechanism typically must be approximated.
– Privacy cost will be close to that of a true posterior sample (Wang et al., 2015). However, cannot typically verify MCMC convergence
stochastic gradient Langevin dynamics.
51
52
53
54
family likelihood, only need to privatize the sufficient statistics
many iterations as we’d like!
55
– US military war logs from the wars in Iraq and Afghanistan disclosed by the Wikileaks organization.
56
57
– Multiple emissions per timestep (all logs in that month)
varied from 10-1 to 10 for held-out log-likelihood experiments (10% timestep/region pairs held out, 10 train/test splits)
58
59
60
61
62
Type Category Casualties
63
Type Category Casualties
64
Type Category Casualties
65
Type Category Casualties
66
Type Category Casualties
67
Type Category Casualties
68
Type Category Casualties
69
Type Category Casualties
70
71
Last 100 samples: Last 1 samples:
72
– In appendix, we analyze privacy of Metropolis-Hastings and annealed importance sampling. – Open problem to make better use of privacy budget to make these practical – New preprint on privacy-preserving EM!
email data, genetic data…
context.
– Open problem: How large is the class of privacy preserving algorithms that are asymptotically efficient?
73
74
Joseph Guemlek Max Welling Kamalika Chaudhuri
75
Thanks for your attention!