Sequence Data Continuous Aggregates Distance-based sampling - - PowerPoint PPT Presentation

sequence data
SMART_READER_LITE
LIVE PREVIEW

Sequence Data Continuous Aggregates Distance-based sampling - - PowerPoint PPT Presentation

Sequence Data Continuous Aggregates Distance-based sampling Transformation-based Model-based filtering and sampling Frequent sequential patterns CS573 Data Privacy and Security Differential Privacy Sequence Data Li Xiong


slide-1
SLIDE 1

Sequence Data

  • Continuous Aggregates
  • Distance-based sampling
  • Transformation-based
  • Model-based filtering and sampling
  • Frequent sequential patterns
slide-2
SLIDE 2

CS573 Data Privacy and Security Differential Privacy – Sequence Data

Li Xiong

slide-3
SLIDE 3

Continuous Aggregates with Differential Privacy

t1 t2 t3

a

100 90 100

b

20 50 20

c

20 10 20

slide-4
SLIDE 4
  • Compute a perturbed histogram at each time point – large

perturbation error

Baseline

slide-5
SLIDE 5
  • Compute one perturbed histogram at a sampled time point

– large sampling error

Baseline

slide-6
SLIDE 6
  • Only generate a new histogram when change is significant
  • Need to tune distance threshold

Haoran Li, Li Xiong, Xiaoqian Jiang, Jinfei Liu. Differentially Private Histogram Publication for Dynamic Datasets: An Adaptive Sampling Approach. CIKM 2015

Distance-based Adaptive Sampling

slide-7
SLIDE 7

Sequence Data

  • Continuous Aggregates
  • Distance-based sampling
  • Transformation-based
  • Model-based filtering and sampling
  • Frequent sequential patterns
slide-8
SLIDE 8
  • Represent the original counts over time as a time series
  • Baseline: Laplace perturbation at each time point

Transformation-Based

slide-9
SLIDE 9

Discrete Fourier Transform [Rastogi and Nath, SIGMOD 10]

Laplace Perturbation

Aggregate time series X Released series R

Discrete Fourier Transform Inverse DFT Retain only the first d coefficients to reduce sensitivity

  • Higher accuracy (perturbation error O(d) + reconstruction error)
  • Offline or batch processing only
slide-10
SLIDE 10

Sequence Data

  • Continuous Aggregates
  • Distance-based sampling
  • Transformation-based
  • Model-based filtering and sampling
  • Frequent sequential patterns
slide-11
SLIDE 11
  • Represent the original counts over time as a time series
  • Use model-based prediction to de-noise

Model-Based Filtering and Sampling

slide-12
SLIDE 12

State-Space Model

𝑦𝑙 𝑦𝑙+1 𝑨𝑙 𝑨𝑙+1

Process

  • riginal

Perturbed

slide-13
SLIDE 13

Filtering: State-Space Model

  • Process Model

𝑦𝑙+1 = 𝑦𝑙 + 𝜕 𝜕~ℕ(0, 𝑅)

  • Measurement Model

𝑨𝑙 = 𝑦𝑙 + 𝜉 𝜉~𝑀𝑏𝑞(𝜇)

  • Given noisy measurement 𝑨𝑙, how to estimate true state 𝑦𝑙 ?

Process noise Measurement noise

𝑦𝑙 𝑦𝑙+1 𝑨𝑙 𝑨𝑙+1

Process Perturbation

slide-14
SLIDE 14

Posterior Estimation

  • Denote ℤ𝑙 = 𝑨1, … , 𝑨𝑙 - noisy observations up to k
  • Posterior estimate:

𝑦 𝑙 = 𝐹(𝑦𝑙|ℤ𝑙)

  • Posterior distribution:

𝑔 𝑦𝑙 ℤ𝑙 = 𝑔 𝑦𝑙 ℤ𝑙−1 𝑔(𝑨𝑙|𝑦𝑙) 𝑔 𝑨𝑙 ℤ𝑙−1

  • Challenge:

𝑔 𝑨𝑙 ℤ𝑙−1 and 𝑔 𝑦𝑙 ℤ𝑙−1 are difficult to compute when 𝑔 𝑨𝑙 𝑦𝑙 = 𝑔

𝜉 is no

not Gaussian

slide-15
SLIDE 15

Filtering: Solutions

  • Option 1: Approximate measurement noise with Gaussian

𝜉~ℕ(0, 𝑆) → the Kalman filter

  • Option 2: Estimate posterior density by Monte-Carlo method

𝑔 𝑦𝑙 ℤ𝑙 = 𝜌𝑙

𝑗 𝜀(𝑦𝑙 − 𝑦𝑙 𝑗 ) 𝑂 𝑗=1

where {𝑦𝑙

𝑗 , 𝜌𝑙 𝑗 }1 𝑂 is a set of weighted samples/particles.

→ particle filters

  • Liyue Fan, Li Xiong. Adaptively Sharing Real-Time Aggregates with

Differential Privacy. IEEE TKDE, 2013

slide-16
SLIDE 16

Adaptive Sampling

17

  • Adaptive sampling - adjust sampling rate based on feedback (error

between posterior and prior estimate)

slide-17
SLIDE 17

Adaptive Sampling: PID Control

  • Feedback error: measures how well the data model describes the

current trend

  • PID error (Δ): compound of proportional, integral, and derivative errors
  • Proportional: current error
  • Integral: integral of errors in recent time window
  • Derivative: change rate of errors
  • Determines a new sampling interval:

𝐽′ = 𝐽 + 𝜄(1 − 𝑓

Δ−𝜊 𝜊 )

where 𝜄 represents the magnitude of change and 𝜊 is the set point for sampling process.

slide-18
SLIDE 18
  • Synthetic Data with 1000 data points:
  • Linear: process model
  • Logistic: 𝑦𝑙 = 𝐵(1 + 𝑓−𝑙)−1
  • Sinusoidal: 𝑦𝑙 = 𝐵 ∗ 𝑡𝑗𝑜(𝜕𝑙 + 𝜒)
  • Flu: CDC flu data 2006-2010, 209 data points
  • Traffic: UW/intelligent transportation systems research 2003-2004,

540 data points

  • Unemployment: ST. Louis Federal Reserve Bank, 478 data points

Evaluation: Data Sets

Flu Traffic

slide-19
SLIDE 19

Illustration: Original data stream vs. released data stream

  • FAST provides less data volume and higher data utility/integrity with

formal privacy guarantee

slide-20
SLIDE 20

Results: Utility vs. Privacy

21

Flu Traffic Unemployment

slide-21
SLIDE 21

Multi-dimensional time-series or spatial- temporal data: Traffic monitoring

  • Goal: release 𝐒𝑑 for each cell c
  • Approaches: Temporal modeling and spatial

partitioning

Liyue Fan, Li Xiong, Vaidy Sunderam. Differentially Private Multi-Dimensional Time- Series Release for Traffic Monitoring. DBSec, 2013 (best student paper award)

slide-22
SLIDE 22

Temporal modeling and estimation

  • Univariate time-series modeling for each cell
  • Road network based process variance modeling

𝑦𝑙+1 = 𝑦𝑙 + 𝜕 𝜕~ℕ(0, 𝑅)

Small value for Sparse cells; Large value for Dense cells.

slide-23
SLIDE 23

Spatial Partitioning and Estimation

  • Data has sparse and uniform regions and is dynamically changing
  • Failed attempt: dynamic feedback-driven partitioning
  • Approach: road network density based Quad-tree partitioning

partitioning

slide-24
SLIDE 24

Results on Brinkhoff moving objects data

Dataset: 500K objects at the beginning; 25K new objects at every timestamp; 100 timestamps

slide-25
SLIDE 25

Sequence Data

  • Continuous Aggregates
  • Distance-based sampling
  • Transformation-based
  • Model-based filtering and sampling
  • Frequent sequential patterns
slide-26
SLIDE 26

Frequent Sequential Patterns with Differential Privacy

S Xu, S Su, X Cheng, Z Li, L Xiong. Differentially Private Frequent Sequence Mining via Sampling-based Candidate Pruning. ICDE 2015

slide-27
SLIDE 27

Frequent sequential pattern mining with medical data

  • Longitudinal observations
  • Genome sequences
slide-28
SLIDE 28

Non-private frequent sequence mining (FSM) – An Example

ID 100 200 300 400 500 Record a→c→d b→c→d a→b→c→e→d d→b a→d→c→d Database D Sequence {a} {b} {c} {d} Sup. 3 3 4 4 {e} 1 C1: cand 1-seqs Sequence {a} {b} {c} {d} Sup. 3 3 4 4 F1: freq 1-seqs

Sequence {a→a} {a→b} {a→c} {a→d} Sup. 1 3 3 {b→a} {b→b} {b→c} {b→d} 2 2 1 {c→a} {c→b} {c→c} {c→d} 4 {d→a} {d→b} {d→c} {d→d} 1 1 C2: cand 2-seqs Sequence {a→c} {a→d} {c→d} Sup. 3 3 4 F3: freq 2-seqs

Scan D Scan D Scan D

Sequence {a→a} {a→b} {a→c} {a→d} {b→a} {b→b} {b→c} {b→d} {c→a} {c→b} {c→c} {c→d} {d→a} {d→b} {d→c} {d→d} C2: cand 2-seqs

Sequence {a→b→c} C3: cand 3-seqs Sequence {a→b→c} Sup. 3 F3: freq 3-seqs

slide-29
SLIDE 29

Naïve Private FSM

ID 100 200 300 400 500 Record a→c→d b→c→d a→b→c→e→d d→b a→d→c→d Database D

Sequence {a} {b} {c} {d} Sup. 3 3 4 4 {e} 1 C1: cand 1-seqs noise 0.2

  • 0.4

0.4

  • 0.5

0.8

Sequence {a→a} {a→c} {a→d} {c→a} {c→c} {c→d} {d→a} {d→c} {d→d} C2: cand 2-seqs Sequence {a→a} {a→c} {a→d} Sup. 3 3 {c→a} {c→c} {c→d} 4 {d→a} {d→c} {d→d} 1 C2: cand 2-seqs noise 0.2 0.3 0.2

  • 0.5

0.8 0.2 0.3 2.1

  • 0.5

Scan D Scan D

Sequence {a→c→d} C3: cand 3-seqs {a→d→c}

noise 0.3 Sequence {a→c→d} Sup. 3 {a→d→c} 1 C3: cand 3-seqs

Scan D

Sequence {a} {c} {d} Noisy Sup. 3.2 4.4 3.5 F1: freq 1-seqs

Sequence {a→c} {a→d} {c→d} Noisy Sup. 3.3 3.2 4.2 F2: freq 2-seqs {d→c} 3.1

Sequence {a→c→d} Noisy Sup. 3 F3: freq 3-seqs

Lap(|C2| / ε2) Lap(|C1| / ε1) Lap(|C3| / ε3)

slide-30
SLIDE 30

Challenges

  • The large number of generated candidate sequences
  • It leads to a large amount of perturbation noise required by

differential privacy

slide-31
SLIDE 31

Observations

  • Observation 1
  • The number of real frequent sequences is much smaller than the number
  • f candidate sequences
  • In MSNBC, for θ = 0.02, n(2-seq) = 32 vs. n(cand 2-seq) = 225
  • In BIBLE, for θ = 0.15, n(2-seq) = 21 vs. n(cand 2-seq) = 254
  • Observation 2
  • The frequencies of most patterns in a small part of the database are

approximately equal to their frequencies in the original database

  • Reason: the frequency of a pattern can be considered as the probability of

a record containing this pattern

slide-32
SLIDE 32

PFS2 Algorithm

  • PFS2 Algorithm
  • Differentially Private Frequent Sequence Mining via Sampling-based

Candidate Pruning

  • Basic Idea
  • Mining frequent sequences in order of increasing length
  • Using kth sample database for pruning candidate k-sequences
  • Partitioning the original database by random sampling

Original Database

mth sample database 2nd sample database 1st sample database …… Partition

slide-33
SLIDE 33

Overview

  • Mining Frequent k-Sequences

1. Generate candidate k-sequences Ck

  • Use frequent (k-1)-sequences

2. Prune candidate k-sequences

  • Utilize the kth sample database

3. Compute noisy supports of remaining candidate k-sequences

  • Output frequent k-sequences Fk

Ck

kth sample database

Prune

Ck’

Original Database Compute noisy support

Fk

Laplace Mechanism

slide-34
SLIDE 34

Candidate Pruning

  • Privacy Requirement
  • Add noise to local supports
  • Length of records affects sensitivity
  • How to reduce the length of sequence

records?

  • Misestimating
  • Sequences in the sample databases are

randomly drawn

  • How to reduce the probability of

misestimating frequent sequences as infrequent?

Ck

kth sample database

Prune

Ck’

Original Database Compute noisy support

Fk

Laplace Mechanism

slide-35
SLIDE 35

Candidate Pruning

  • Sequence Shrinking / Threshold Relaxation

Ck

Ck’

Sequence Shrinking Threshold Relaxation Prune

Transformed Sample Database kth Sample Database Local noisy support

Enforce the length constraint

  • n the sample database

Prune

Laplace Mechanism

slide-36
SLIDE 36

Candidate Pruning

  • Sequence Shrinking / Threshold Relaxation

Ck

Ck’

Sequence Shrinking Threshold Relaxation Prune

Transformed Sample Database kth Sample Database Local noisy support

Relax the threshold for the sample database

Prune

Laplace Mechanism

slide-37
SLIDE 37

Sequence Shrinking

  • Input
  • Sequence Record S; Maximal Length Constraint lmax;

Candidate k-sequences;

  • Goal
  • Construct a new sequence S’ (|S’| = lmax)
  • The number of common candidate k-sequences contained both in S and

S’ is maximized

  • Three Schemes
  • Irrelevant Item Deletion;
  • Consecutive Pattern Compression;
  • Sequence Reconstruction;
slide-38
SLIDE 38

Experiments

  • Datasets
  • Comparison
  • Prefix (“Differentially private transit data publication a case study on the

Montreal transportation system”, KDD 12’)

  • n-gram (“Differentially private sequential data publication via variable-

length n-grams”, CCS 12’)

  • Metrics
  • F-score:
  • Relative Error:

2 precision recall F score precision recall     

'

{(| |) / }

X x x x

RE median sup sup sup  

slide-39
SLIDE 39

Experiments

  • Mining Results vs. Threshold

MSNBC: F-score MSNBC: RE BIBLE: F-score BIBLE: RE House_Power: F-score House_Power: RE

slide-40
SLIDE 40

Experiments

  • Mining Results vs. Privacy Budget
  • MSNBC for relative error θ = 0.015
  • House_Power for relative error θ = 0.34

MSNBC House_Power

slide-41
SLIDE 41

Sequence Data

  • Continuous Aggregates
  • Distance-based sampling
  • Transformation-based
  • Model-based filtering and sampling
  • Frequent sequential patterns