Challenges in Privacy-Preserving Analysis of Structured Data - - PowerPoint PPT Presentation

challenges in privacy preserving analysis of structured
SMART_READER_LITE
LIVE PREVIEW

Challenges in Privacy-Preserving Analysis of Structured Data - - PowerPoint PPT Presentation

Challenges in Privacy-Preserving Analysis of Structured Data Kamalika Chaudhuri Computer Science and Engineering University of California, San Diego Sensitive Structured Data Medical Records Search Logs Social Networks This Talk: Two Case


slide-1
SLIDE 1

Challenges in Privacy-Preserving Analysis of Structured Data

Kamalika Chaudhuri

University of California, San Diego Computer Science and Engineering

slide-2
SLIDE 2

Sensitive Structured Data

Medical Records Search Logs Social Networks

slide-3
SLIDE 3

This Talk: Two Case Studies

  • 1. Privacy-preserving HIV Epidemiology
  • 2. Privacy in Time-series data
slide-4
SLIDE 4

HIV Epidemiology

Goal: Understand how HIV spreads among people

slide-5
SLIDE 5

HIV Transmission Data

distance (Seq-A, Seq-B) < t

HIV transmission Virus Seq-A

A

Virus Seq-B

B

slide-6
SLIDE 6

From Sequences to Transmission Graphs

Node = Patient Edge = Plausible transmission Viral Sequences

slide-7
SLIDE 7

…Growing over Time

Node = Patient Edge = Transmission 2015

slide-8
SLIDE 8

…Growing over Time

Node = Patient Edge = Transmission 2015 2016

slide-9
SLIDE 9

…Growing over Time

Node = Patient Edge = Transmission 2015 2016 2017

slide-10
SLIDE 10

…Growing over Time

2015 2016 2017

Release properties of G with privacy across time Goal:

slide-11
SLIDE 11

Problem: Continual Graph Statistics Release

Given: (Growing) graph G At time t, nodes and adjacent edges arrive (∂Vt, ∂Et) Goal: At time t, release f(Gt), where f = graph statistic, and Gt = (∪s≤t∂Vs, ∪s≤t∂Es) while preserving patient privacy and high accuracy

slide-12
SLIDE 12

What kind of Privacy?

Patient A is in the graph Hide: Release: Large scale properties Node = Patient Edge = Transmission

slide-13
SLIDE 13

What kind of Privacy?

Node = Patient Edge = Transmission A particular patient has HIV Hide: Release: Statistical properties (degree distribution, clusters, does therapy help, etc) Privacy notion: Node Differential Privacy

slide-14
SLIDE 14

Talk Outline

  • The Problem: Private HIV Epidemiology
  • Privacy Definition: Differential Privacy
slide-15
SLIDE 15

Differential Privacy [DMNS06]

“similar”

Randomized Algorithm Randomized Algorithm

Data + Data +

Participation of a single person does not change output

slide-16
SLIDE 16

Differential Privacy: Attacker’s View

Prior Knowledge + Algorithm Output on Data & = Conclusion

  • n

Prior Knowledge + Algorithm Output on Data & = Conclusion

  • n
  • a. Algorithm could draw personal conclusions about Alice
  • b. Alice has the agency to participate or not

Note:

slide-17
SLIDE 17

Differential Privacy [DMNS06]

For all D, D’ that differ in one person’s value,

t

D D’

p[A(D) = t] p[A(D’) = t]

If A = -differentially private randomized algorithm, then:

sup

t

  • log p(A(D) = t)

p(A(D0) = t)

  • ≤ ✏
slide-18
SLIDE 18

Differential Privacy

  • 1. Provably strong notion of privacy
  • 2. Good approximations for many functions

e.g, means, histograms, etc.

slide-19
SLIDE 19

Node Differential Privacy

Node = Patient Edge = Transmission

slide-20
SLIDE 20

Node Differential Privacy

Node = Patient Edge = Transmission One person’s value = One node + adjacent edges

slide-21
SLIDE 21

Talk Outline

  • The Problem: Private HIV Epidemiology
  • Privacy Definition: Node Differential Privacy
  • Challenges
slide-22
SLIDE 22

Problem: Continual Graph Statistics Release

Given: (Growing) graph G At time t, nodes and adjacent edges arrive (∂Vt, ∂Et) Goal: At time t, release f(Gt), where f = graph statistic, and Gt = (∪s≤t∂Vs, ∪s≤t∂Es) with node differential privacy and high accuracy

slide-23
SLIDE 23

Why is Continual Release of Graphs with Node Differential Privacy hard?

  • 1. Node DP challenging in static graphs [KNRS13, BBDS13]
  • 2. Continual release of graph data has extra challenges
slide-24
SLIDE 24

Challenge 1: Node DP

Removing one node can change properties by a lot (even for static graphs) #edges = 6 (size of V) #edges = 0 Hiding one node needs high noise low accuracy

slide-25
SLIDE 25

Prior Work: Node DP in Static Graphs

  • Project to low degree graph G’ and use node DP on G’
  • Projection algorithm needs to be “smooth” and

computationally efficient Approach 1 [BCS15]: Approach 2 [KNRS13, RS15]:

  • Assume bounded max degree
slide-26
SLIDE 26

Challenge 2: Continual Release of Graphs

  • Methods for tabular data [DNPR10, CSS10] do not apply
  • Sequential composition gives poor utility
  • Graph projection methods are not “smooth” over time
slide-27
SLIDE 27

Talk Outline

  • The Problem: Private HIV Epidemiology
  • Privacy Definition: Node Differential Privacy
  • Challenges
  • Approach
slide-28
SLIDE 28

Algorithm: Main Ideas

Strategy 1: Assume bounded max degree of G (from domain) Strategy 2: Privately release “difference sequence” of statistic (instead of the direct statistic)

slide-29
SLIDE 29

Difference Sequence

Graph Sequence:

G1 G2 G3

Statistic Sequence:

f(G1) f(G2) f(G3)

Difference Sequence:

f(G1) f(G2) - f(G1) f(G3) - f(G2)

slide-30
SLIDE 30

Key Observation

Key Observation: For many graph statistics, when G is degree bounded, the difference sequence has low sensitivity Example Theorem: If max degree(G) = D, then sensitivity of the difference sequence for #high degree nodes is at most 2D + 1.

slide-31
SLIDE 31

From Observation to Algorithm

Algorithm:

  • 1. Add noise to each item of difference sequence to

hide effect of single node and publish

  • 2. Reconstruct private statistic sequence from private

difference sequence

slide-32
SLIDE 32

How does this work?

slide-33
SLIDE 33

Experiments - Privacy vs. Utility

#high degree nodes Our Algorithm, DP Composition 1, DP Composition 2 #edges Baselines:

slide-34
SLIDE 34

Experiments - #Releases vs. Utility

#high degree nodes #edges Our Algorithm, DP Composition 1, DP Composition 2 Baselines:

slide-35
SLIDE 35

Talk Agenda

Privacy is application-dependent! Two applications:

  • 1. HIV Epidemiology
  • 2. Privacy of time-series data - activity

monitoring, power consumption, etc

slide-36
SLIDE 36

Time Series Data

Physical Activity Monitoring Location traces

slide-37
SLIDE 37

Example: Activity Monitoring

Hide: Activity at each time against adversary with prior knowledge Data: Activity trace of a subject Release: (Approximate) aggregate activity

slide-38
SLIDE 38

Why is Differential Privacy not Right for Correlated data?

slide-39
SLIDE 39

1-DP: Output histogram of activities + noise with stdev T Correlation Network

Example: Activity Monitoring

D = (x1, .., xT), xt = activity at time t Too much noise - no utility! Data from a single subject

slide-40
SLIDE 40

Correlation Network

Example: Activity Monitoring

D = (x1, .., xT), xt = activity at time t 1-entry-DP: Output activity histogram + noise with stdev 1 Not enough noise - activities across time are correlated!

slide-41
SLIDE 41

Correlation Network

Example: Activity Monitoring

D = (x1, .., xT), xt = activity at time t 1-entry-group DP: Output activity histogram + noise with stdev T Too much noise - no utility!

slide-42
SLIDE 42

How to define privacy for Correlated Data ?

slide-43
SLIDE 43

Pufferfish Privacy [KM12]

Secret Set S S: Information to be protected e.g: Alice’s age is 25, Bob has a disease

slide-44
SLIDE 44

Pufferfish Privacy [KM12]

Secret Set S Secret Pairs Set Q Q: Pairs of secrets we want to be indistinguishable e.g: (Alice’s age is 25, Alice’s age is 40) (Bob is in dataset, Bob is not in dataset)

slide-45
SLIDE 45

Pufferfish Privacy [KM12]

Secret Set S Secret Pairs Set Q Distribution Class Θ e.g: (connection graph G, disease transmits w.p [0.1, 0.5]) (Markov Chain with transition matrix in set P) : A set of distributions that plausibly generate the data Θ May be used to model correlation in data

slide-46
SLIDE 46

Pufferfish Privacy [KM12]

Secret Set S Secret Pairs Set Q Distribution Class Θ whenever P(si|θ), P(sj|θ) > 0

p(A(X)|sj, θ)

p(A(X)|si, θ)

t

p✓,A(A(X) = t|si, θ) ≤ e✏ · p✓,A(A(X) = t|sj, θ)

An algorithm A is -Pufferfish private with parameters (S, Q, Θ) if for all (si, sj) in Q, for all , all t, θ ∈ Θ X ∼ θ, ✏

slide-47
SLIDE 47

Pufferfish Interpretation of DP

Theorem: Pufferfish = Differential Privacy when: S = { si,a := Person i has value a, for all i, all a in domain X } Q = { (si,a si,b), for all i and (a, b) pairs in X x X } = { Distributions where each person i is independent } Θ

slide-48
SLIDE 48

Pufferfish Interpretation of DP

Theorem: Pufferfish = Differential Privacy when: S = { si,a := Person i has value a, for all i, all a in domain X } Q = { (si,a si,b), for all i and (a, b) pairs in X x X } = { Distributions where each person i is independent } Θ Theorem: No utility possible when: = { All possible distributions } Θ

slide-49
SLIDE 49

How to get Pufferfish privacy?

Special case mechanisms [KM12, HMD12] Is there a more general Pufferfish mechanism for a large class of correlated data? Our work: Yes, the Markov Quilt Mechanism (Also concurrent work [GK16])

slide-50
SLIDE 50

Correlation Measure: Bayesian Networks

Node: variable Directed Acyclic Graph

Pr(X1, X2, . . . , Xn) = Y

i

Pr(Xi|parents(Xi))

Joint distribution of variables:

slide-51
SLIDE 51

A Simple Example

X1 X2 X3 Xn Xi in {0, 1} Model: State Transition Probabilities: 1 1 - p 1 - p p p

slide-52
SLIDE 52

A Simple Example

X1 X2 X3 Xn Xi in {0, 1} Model: State Transition Probabilities: 1 1 - p 1 - p p p Pr(X2 = 0| X1 = 0) = p …. Pr(X2 = 0| X1 = 1) = 1 - p

slide-53
SLIDE 53

A Simple Example

X1 X2 X3 Xn Xi in {0, 1} Model: State Transition Probabilities: 1 1 - p 1 - p p p Pr(X2 = 0| X1 = 0) = p …. Influence of X1 diminishes with distance Pr(Xi = 0| X1 = 0) =

1 2 + 1 2(2p − 1)i−1

Pr(X2 = 0| X1 = 1) = 1 - p

1 2 − 1 2(2p − 1)i−1

Pr(Xi = 0| X1 = 1) =

slide-54
SLIDE 54

Algorithm: Main Idea

Goal: Protect X1

X1 X2 X3 Xn

slide-55
SLIDE 55

Algorithm: Main Idea

Goal: Protect X1

X1 X2 X3 Xn

Local nodes Rest (high correlation) (almost independent)

slide-56
SLIDE 56

Algorithm: Main Idea

Goal: Protect X1

X1 X2 X3 Xn

Add noise to hide local nodes Small correction for rest

+

Local nodes Rest (high correlation) (almost independent)

slide-57
SLIDE 57

Measuring “Independence”

Max-influence of Xi on a set of nodes XR: To protect Xi, correction term needed for XR is exp(e(XR|Xi))

e(XR|Xi) = max

a,b sup θ∈Θ

max

xR log Pr(XR = xR|Xi = a, θ)

Pr(XR = xR|Xi = b, θ)

Low e(XR|Xi) means XR is almost independent of Xi

slide-58
SLIDE 58

How to find large “almost independent” sets

Brute force search is expensive Use structural properties of the Bayesian network

slide-59
SLIDE 59

Markov Blanket

Markov Blanket(Xi) = Set of nodes XS s.t Xi is independent of X\(Xi U XS) given XS (usually, parents, children,

  • ther parents of children)

Xi XS

Markov Blanket (Xi)

slide-60
SLIDE 60

Define: Markov Quilt

XQ is a Markov Quilt of Xi if:

  • 2. Xi lies in XN
  • 1. Deleting XQ breaks graph

into XN and XR

  • 3. XR is independent of Xi

given XQ Xi XQ XR XN (For Markov Blanket XN = Xi)

slide-61
SLIDE 61

Why do we need Markov Quilts?

Given a Markov Quilt, Xi XQ XR XN XN = local nodes for Xi XQ U XR = rest

slide-62
SLIDE 62

From Markov Quilts to Amount of Noise

Xi XQ XR XN Stdev of noise to protect Xi: Score(XQ) =

Correction for XQ U XR Noise due to XN

Let XQ = Markov Quilt for Xi

card(XN) ✏ − e(XQ|Xi)

Search all Markov Quilts to find one that needs min noise

slide-63
SLIDE 63

Privacy Properties

Privacy: MQM is -Pufferfish private

slide-64
SLIDE 64

Graceful Composition

MQM for Markov Chains has:

  • Additive sequential composition
  • Parallel composition with a correction term

X1 X2 X3 Xn

slide-65
SLIDE 65

Simulations - Task

X1 X2 X3 Xn Xi in {0, 1} Model: State Transition Probabilities: 1 1 - p q 1-q p Model Class:

Θ = [`, 1 − `]

(implies p and q can lie anywhere in )

Θ

Sequence length = 100

slide-66
SLIDE 66

Simulations - Results

Methods:

  • Two versions of Markov Quilt Mechanism (MQMExact, MQMApprox)
  • GK16

0.1 0.15 0.2 0.25 0.3 0.35 0.4 1 2 3 4 5

L1 error

GK16 MQM Approx MQM Exact

0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.2 0.4 0.6 0.8 1

L1 error

GK16 MQM Approx MQM Exact

` `

✏ = 0.2

✏ = 1

slide-67
SLIDE 67

Real Data - Activity Measurement

Dataset on physical activity by three groups of subjects: 40 cyclists, 16 older women and 36 overweight women 4 states (active, standing still, standing moving, sedentary) Over 9,000 observations per subject Methods: MQMExact and MQMApprox GK16 does not apply GroupDP

Θ = { Empirical data generating distribution }

slide-68
SLIDE 68

Real Data - Activity Measurement

Active Stand Still Stand Moving Sedentary 0.2 0.4 0.6 0.8 1

Relative Frequency

Group-DP MQM Approx MQM Exact

Active Stand Still Stand Moving Sedentary 0.2 0.4 0.6 0.8 1

Relative Frequency

Group-DP MQM Approx MQM Exact

Active Stand Still Stand Moving Sedentary 0.2 0.4 0.6 0.8 1

Relative Frequency

Group-DP MQM Approx MQM Exact

Cyclists Older Overweight Aggregated results (over groups)

✏ = 1

slide-69
SLIDE 69

Real Data - Power Consumption

Dataset on power consumption in a single household Power consumption discretized to 51 levels (51 states) Over 1 million observations Methods: MQMExact vs. MQMApprox GK16 does not apply GroupDP has too little utility

Θ = { Empirical data generating distribution }

slide-70
SLIDE 70

Real Data - Power Consumption

Methods: Two versions of Markov Quilt Mechanism (MQMExact, MQMApprox)

✏ = 0.2

✏ = 1

slide-71
SLIDE 71

Conclusion

  • Real problems have complex privacy challenges
  • Rigorous privacy definitions are available
  • For any privacy problem, important to think:
  • What do we need to hide?
  • What do we need to reveal?
slide-72
SLIDE 72

References

  • “Differentially Private Continual Release of Graph Statistics”, S. Song,
  • S. Mehta, S.

Vinterbo, S. Little and K. Chaudhuri, Arxiv, 2018

  • “Pufferfish Privacy Mechanisms for Correlated Data”, S. Song,

Y. Wang and K. Chaudhuri, SIGMOD 2018.

  • “Composition Properties of Inferential Privacy on

Time-Series Data”,

  • S. Song and K. Chaudhuri, Allerton 2018.
slide-73
SLIDE 73

Thanks!