Rigorous Foundations for Statistical Data Privacy Adam Smith - - PowerPoint PPT Presentation

rigorous foundations for statistical data privacy
SMART_READER_LITE
LIVE PREVIEW

Rigorous Foundations for Statistical Data Privacy Adam Smith - - PowerPoint PPT Presentation

Rigorous Foundations for Statistical Data Privacy Adam Smith Boston University CWI, Amsterdam November 15, 2018 Privacy is changing Data-driven systems guiding decisions in many areas Models increasingly complex Privacy


slide-1
SLIDE 1

Rigorous Foundations for Statistical Data Privacy

Adam Smith Boston University

CWI, Amsterdam November 15, 2018

slide-2
SLIDE 2

“Privacy” is changing

  • Data-driven systems guiding

decisions in many areas

  • Models increasingly

complex

2

Benefits of data (better diagnoses, lower recidivism…) Control Transparency Privacy

slide-3
SLIDE 3

Privacy in Statistical Databases

Large collections of personal information

  • census data
  • medical/public health
  • online advertising
  • education

3

Individuals

queries answers

Researchers

“Agency”

Summaries Complex models Synthetic data

slide-4
SLIDE 4

Two conflicting goals

  • Utility: release aggregate statistics
  • Privacy: individual information stays hidden

How do we define “privacy”?

  • Studied since 1960’s in

ØStatistics ØDatabases & data mining ØCryptography

  • This talk: Rigorous foundations and analysis

4

Utility Privacy

slide-5
SLIDE 5

5

“Relax – it can only see metadata.”

slide-6
SLIDE 6

This talk

  • Why is privacy challenging?

ØAnonymization often fails ØExample: membership attacks, in theory and in practice

  • Differential Privacy [DMNS’06]

Ø“Privacy” as stability to small changes ØWidely studied and deployed

  • The “frontier” of research on statistical privacy

ØThree topics

6

slide-7
SLIDE 7

First attempt: Remove obvious identifiers

Everything is an identifier

7 Images: whitehouse.gov, genesandhealth.org, medium.com

[Ganta Kasiviswanathan S ’08] “AI recognizes blurred faces” [McPherson Shokri Shmatikov ’16]

N a m e : E t h n i c i t y : [Gymrek McGuire Golan Halperin Erlich ’13]

[Pandurangan ‘14]

slide-8
SLIDE 8

Is the problem granularity?

What if we only release aggregate information? Statistics together may encode data

  • Example: Average salary before/after resignation
  • More generally:

Too many, “too accurate” statistics reveal individual information

ØReconstruction attacks [Dinur Nissim 2003, …, Cohen Nissim 2017] ØMembership attacks [next slide]

8

Cannot release everything everyone would want to know

slide-9
SLIDE 9

A Few Membership Attacks

  • [Homer et al. 2008]

Exact high-dimensional summaries allow an attacker to test membership in a data set

Ø Caused US NIH to change data sharing practices

  • [Dwork, S, Steinke, Ullman, Vadhan, FOCS ‘15]

Distorted high-dimensional summaries allow an attacker to test membership in a data set

  • [Shokri, Stronati, Song, Shmatikov, Oakland 2017]

Membership inference using ML as a service (from exact answers)

9

slide-10
SLIDE 10

! attributes "

people

Membership Attacks

10

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

data Population

Suppose

  • We have a data set in which membership is sensitive

ØParticipants in clinical trial ØTargeted ad audience

  • Data has many binary attributes for each person

ØGenome-wide association studies # = 1 000 000 (“SNPs”), ' < 2000

slide-11
SLIDE 11

! attributes "

people

Membership Attacks

11

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

data Alice’s data

.50 .75 .50 .50 .75 .50 .25 .25 .50

# $ Attacker Population “In” “Out” “In”/ “Out”

  • Release exact column averages
  • Attacker succeeds with high probability when

there are more attributes than people and % ≪ '/)

slide-12
SLIDE 12
  • Release exact distorted column averages (± ")
  • Attacker succeeds with high probability when

there are more attributes than people and " ≪ %/' ( attributes )

people

Membership Attacks

12

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

data Alice’s data

.50 .75 .50 .50 .75 .50 .25 .25 .50

* + Attacker “In” “Out” “In”/ “Out”

.4 .7 .6 .5 .8 .4 .2 .3 .6

± " in each coordinate No matter how distortion performed Population

slide-13
SLIDE 13

Machine Learning as a Service

13

Model

Training API

DATA

Prediction API

Input from users, apps … Classification

Sensitive! Transactions, preferences,

  • nline and offline behavior
slide-14
SLIDE 14

Exploiting Trained Models

14

Model

Training API

DATA

Prediction API

Input from the training set Input not from the training set Classification Classification

recognize the difference

slide-15
SLIDE 15

Exploiting Trained Models

15

Model

Training API

DATA

Prediction API

Input from the training set Input not from the training set Classification Classification

recognize the difference Train a model to… … without knowing the specifics of the actual model!

slide-16
SLIDE 16

This talk

  • Why is privacy challenging?

ØAnonymization often fails ØExample: membership attacks, in theory and in practice

  • Differential Privacy [DMNS’06]

Ø“Privacy” as stability to small changes ØWidely studied and deployed

  • The “frontier” of research on statistical privacy

ØThree topics

16

slide-17
SLIDE 17
  • Several current deployments
  • Burgeoning field of research

17

Apple Google US Census

Algorithms Crypto, security Statistics, learning Game theory, economics Databases, programming languages Law, policy

Differential Privacy

slide-18
SLIDE 18

Differential Privacy

  • Data set !

ØDomain D can be numbers, categories, tax forms ØThink of x as fixed (not random)

  • A = randomized procedure

ØA(x) is a random variable ØRandomness might come from adding noise, resampling, etc.

18

random bits

A

A(x)

slide-19
SLIDE 19
  • A thought experiment

ØChange one person’s data (or add or remove them) ØWill the probabilities of outcomes change?

19

A

A(x’)

A

A(x)

For any set of

  • utcomes,

(e.g. I get denied health insurance) about the same probability in both worlds

Differential Privacy

random bits random bits

slide-20
SLIDE 20

20

local random coins

A

A(x’)

!’ is a neighbor of ! if they differ in one data point

local random coins

A

A(x)

Definition: A is #-differentially private if, for all neighbors $, $’, for all subsets S of outputs

Pr ' $ ∈ ) ≤ (1 + #) Pr ' $/ ∈ )

Neighboring databases induce close distributions

  • n outputs

Differential Privacy

# is a leakage measure 01

slide-21
SLIDE 21

Randomized Response [Warner 1965]

  • Say we want to release the proportion of diabetics in a data

set

Ø Each person’s data is 1 bit: !" = 0 or !" = 1

  • Randomized response: each individual rolls a die

Ø 1, 2, 3 or 4: Report true value !" Ø 5 or 6: Report opposite value & !"

  • Output is list of reported values '

(, … , ' +

Ø Satisfies our definition when , ≈ 0.7 Ø Can estimate fraction of !"’s that are 1 when 0 is large

22

local random coins

A

1 ! = '

(, … , ' +

slide-22
SLIDE 22

Laplace Mechanism

  • Say we want to release a summary ! " ∈ ℝ%

Øe.g., proportion of diabetics: "& ∈ 0,1 and ! " = +

, ∑& "&

  • Simple approach: add noise to !(")

ØHow much noise is needed? ØIdea: Calibrate noise to some measure of !’s volatility

23

local random coins

A

function f 0 " = ! " + 23456

slide-23
SLIDE 23

Laplace Mechanism

  • Global Sensitivity:

Ø Example:

24

local random coins

A

function f

x x’ f(x) f(x’)

! " = $ " + &'()*

slide-24
SLIDE 24

Laplace Mechanism

  • Global Sensitivity:

Ø Example:

ØLaplace distribution Lap $ has density ØChanging one point translates curve

25

local random coins

A

function f % & = ( & + *+,-.

slide-25
SLIDE 25
  • Can release ! proportions with noise ≈

# $% per entry

  • Requires “approximate” variant of DP

Attacks “match” differential privacy

26

local random coins

A

function f & ' = ) ' + +,-./ 1/ + Reconstruction attacks 2 34 Differential privacy

Sampling error

Robust membership attacks 2 4

slide-26
SLIDE 26

A rich algorithmic field

27

! ∼ # $ % ∝ exp(+ ⋅ -./012$ $, % )

Exponential sampling Noise addition

Untrusted aggregator

A

56 57 58

Local perturbation

slide-27
SLIDE 27

Interpreting Differential Privacy

  • A naïve hope:

Your beliefs about me are the same after you see the output as they were before

  • Impossible

Ø Suppose you know that I smoke Ø Clinical study: “smoking and cancer correlated” Ø You learn something about me

  • Whether or not my data were used
  • Differential privacy implies:

No matter what you know ahead of time, You learn (almost) the same things about me whether or not my data are used

Ø Provably resists attacks mentioned earlier

28

slide-28
SLIDE 28

Research on (differential) privacy

  • Definitions

ØPinning down “privacy”

  • Algorithms: what can we compute privately?

ØFundamental techniques ØSpecific applications

  • Usable systems
  • Attacks: “Cryptanalysis” for data privacy
  • Protocols: Cryptographic tools for large-scale analysis
  • Implications for other areas

ØAdaptive data analysis ØLaw and policy

29

slide-29
SLIDE 29

This talk

  • Why is privacy challenging?

ØAnonymization often fails ØExample: membership attacks, in theory and in practice

  • Differential Privacy [DMNS’06]

Ø“Privacy” as stability to small changes ØWidely studied and deployed

  • The “frontier” of research on statistical privacy

ØThree topics

30

slide-30
SLIDE 30

Frontier 1: Deep Learning with DP

[Abadi et al 2016, …]

31

Sensitive Data

P a r a m e t e r s

Model

Revealed now, but should be hidden Thought of as private now, but better to reason as if public

Deep Learning

slide-31
SLIDE 31

Frontier 2: From Law to Technical Definitions

Two central challenges

  • 1. Given a body of law and regulation, what technical definitions

comply with that law?

Ø E.g., what suffices to satisfy GDPR?

  • 2. How should we write laws and regulations so they make sense

given evolving technology?

Ø E.g., Surveillance ≠ physical wiretaps

  • Technical research must inform these questions

Ø E.g. ”personally identifiable information” is meaningless

  • [Nissim et al. 2016] When tradeoffs are inherent, mathematical

formulations play an important role

Ø E.g. formal interpretation of FERPA (a US law) mirrors DP Ø “Singling out” in GDPR is challenging to make sense of

32

slide-32
SLIDE 32

Frontier 3: Privacy and overfitting

  • Problem: In modern data analysis, data are re-used

across studies

ØChoice of what analysis to perform can depend on outcomes

  • f previous analyses
  • Differentially private algorithms help prevent overfitting

due to adaptivity

33

Adaptive

X !"

Population # data

  • utcome 1

!%

  • utcome 2
slide-33
SLIDE 33

This talk

  • Why is privacy challenging?

ØAnonymization often fails ØExample: membership attacks, in theory and in practice

  • Differential Privacy [DMNS’06]

Ø“Privacy” as stability to small changes ØWidely studied and deployed

  • The “frontier” of research on statistical privacy

ØThree topics

40

slide-34
SLIDE 34

A c c

  • u

n t a b i l i t y

  • Data increasingly used to automate decisions

ØE.g.: Lending, health, education, policing, sentencing

  • Traditional security: controlling intrusion
  • Modern security must include

trustworthiness of data-driven algorithmic systems

  • Differential privacy formalizes
  • ne piece of modern security

ØWhat other areas need such scrutiny?

Beyond privacy

41

Privacy F a i r n e s s R e s i s t a n c e t

  • m

a n i p u l a t i

  • n