 
              Sequence Data • Continuous Aggregates • Distance-based sampling • Transformation-based • Model-based filtering and sampling • Frequent sequential patterns
CS573 Data Privacy and Security Differential Privacy – Sequence Data Li Xiong
Continuous Aggregates with Differential Privacy t1 t2 t3 a 100 90 100 b 20 50 20 20 10 20 c
Baseline • Compute a perturbed histogram at each time point – large perturbation error
Baseline • Compute one perturbed histogram at a sampled time point – large sampling error
Distance-based Adaptive Sampling • Only generate a new histogram when change is significant • Need to tune distance threshold Haoran Li, Li Xiong, Xiaoqian Jiang, Jinfei Liu. Differentially Private Histogram Publication for Dynamic Datasets: An Adaptive Sampling Approach. CIKM 2015
Sequence Data • Continuous Aggregates • Distance-based sampling • Transformation-based • Model-based filtering and sampling • Frequent sequential patterns
Transformation-Based • Represent the original counts over time as a time series • Baseline: Laplace perturbation at each time point
Discrete Fourier Transform [Rastogi and Nath, SIGMOD 10] Aggregate time series X Discrete Fourier Transform Retain only the first d coefficients to Laplace Perturbation reduce sensitivity Inverse DFT Released series R • Higher accuracy (perturbation error O(d) + reconstruction error) • Offline or batch processing only
Sequence Data • Continuous Aggregates • Distance-based sampling • Transformation-based • Model-based filtering and sampling • Frequent sequential patterns
Model-Based Filtering and Sampling • Represent the original counts over time as a time series • Use model-based prediction to de-noise
State-Space Model Process original 𝑦 𝑙+1 𝑦 𝑙 𝑨 𝑙 𝑨 𝑙+1 Perturbed
Filtering: State-Space Model Process 𝑦 𝑙+1 𝑦 𝑙 Perturbation 𝑨 𝑙 𝑨 𝑙+1 • Process Model 𝑦 𝑙+1 = 𝑦 𝑙 + 𝜕 𝜕~ℕ(0, 𝑅) Process noise • Measurement Model 𝑨 𝑙 = 𝑦 𝑙 + 𝜉 𝜉~𝑀𝑏𝑞(𝜇) Measurement noise • Given noisy measurement 𝑨 𝑙 , how to estimate true state 𝑦 𝑙 ?
Posterior Estimation • Denote ℤ 𝑙 = 𝑨 1 , … , 𝑨 𝑙 - noisy observations up to k • Posterior estimate: 𝑦 𝑙 = 𝐹(𝑦 𝑙 |ℤ 𝑙 ) • Posterior distribution: 𝑔 𝑦 𝑙 ℤ 𝑙 = 𝑔 𝑦 𝑙 ℤ 𝑙−1 𝑔(𝑨 𝑙 |𝑦 𝑙 ) 𝑔 𝑨 𝑙 ℤ 𝑙−1 • Challenge : 𝑔 𝑨 𝑙 ℤ 𝑙−1 and 𝑔 𝑦 𝑙 ℤ 𝑙−1 are difficult to compute when 𝑔 𝑨 𝑙 𝑦 𝑙 = 𝑔 𝜉 is no not Gaussian
Filtering: Solutions • Option 1 : Approximate measurement noise with Gaussian 𝜉~ℕ(0, 𝑆) → the Kalman filter • Option 2 : Estimate posterior density by Monte-Carlo method 𝑂 𝑗 𝜀(𝑦 𝑙 − 𝑦 𝑙 𝑗 ) 𝑔 𝑦 𝑙 ℤ 𝑙 = 𝜌 𝑙 𝑗=1 𝑗 , 𝜌 𝑙 𝑗 } 1 𝑂 is a set of weighted samples/particles. where {𝑦 𝑙 → particle filters • Liyue Fan, Li Xiong. Adaptively Sharing Real-Time Aggregates with Differential Privacy. IEEE TKDE, 2013
17 Adaptive Sampling • Adaptive sampling - adjust sampling rate based on feedback (error between posterior and prior estimate)
Adaptive Sampling: PID Control • Feedback error: measures how well the data model describes the current trend • PID error ( Δ ): compound of proportional , integral , and derivative errors • Proportional: current error • Integral: integral of errors in recent time window • Derivative: change rate of errors • Determines a new sampling interval: Δ−𝜊 𝐽 ′ = 𝐽 + 𝜄(1 − 𝑓 𝜊 ) where 𝜄 represents the magnitude of change and 𝜊 is the set point for sampling process.
Evaluation: Data Sets • Synthetic Data with 1000 data points: • Linear: process model • Logistic: 𝑦 𝑙 = 𝐵(1 + 𝑓 −𝑙 ) −1 • Sinusoidal: 𝑦 𝑙 = 𝐵 ∗ 𝑡𝑗𝑜(𝜕𝑙 + 𝜒) • Flu: CDC flu data 2006-2010, 209 data points • Traffic: UW/intelligent transportation systems research 2003-2004, 540 data points • Unemployment: ST. Louis Federal Reserve Bank, 478 data points Traffic Flu
Illustration: Original data stream vs. released data stream • FAST provides less data volume and higher data utility/integrity with formal privacy guarantee
Results: Utility vs. Privacy Flu Traffic Unemployment 21
Multi-dimensional time-series or spatial- temporal data: Traffic monitoring • Goal: release 𝐒 𝑑 for each cell c • Approaches: Temporal modeling and spatial partitioning Liyue Fan, Li Xiong, Vaidy Sunderam. Differentially Private Multi-Dimensional Time- Series Release for Traffic Monitoring. DBSec, 2013 (best student paper award)
Temporal modeling and estimation • Univariate time-series modeling for each cell • Road network based process variance modeling 𝑦 𝑙+1 = 𝑦 𝑙 + 𝜕 Small value for Sparse cells; 𝜕~ℕ(0, 𝑅) Large value for Dense cells.
Spatial Partitioning and Estimation • Data has sparse and uniform regions and is dynamically changing • Failed attempt: dynamic feedback-driven partitioning • Approach: road network density based Quad-tree partitioning partitioning
Results on Brinkhoff moving objects data Dataset: 500K objects at the beginning; 25K new objects at every timestamp; 100 timestamps
Sequence Data • Continuous Aggregates • Distance-based sampling • Transformation-based • Model-based filtering and sampling • Frequent sequential patterns
Frequent Sequential Patterns with Differential Privacy S Xu, S Su, X Cheng, Z Li, L Xiong. Differentially Private Frequent Sequence Mining via Sampling-based Candidate Pruning. ICDE 2015
Frequent sequential pattern mining with medical data • Longitudinal observations • Genome sequences
Non-private frequent sequence mining (FSM) – An Example Database D C 1 : cand 1-seqs F 1 : freq 1-seqs ID Record Sequence Sup. Sequence Sup. a → c → d { a } 100 3 { a } 3 Scan D b → c → d 200 { b } 3 { b } 3 a → b → c → e → d 300 { c } 4 { c } 4 d → b { d } 400 4 { d } 4 a → d → c → d 500 { e } 1 C 2 : cand 2-seqs C 2 : cand 2-seqs Sequence Sup. Sequence { a → a } 0 { a → a } { a → b } 1 { a → b } { a → c } 3 { a → c } { a → d } 3 { a → d } F 3 : freq 2-seqs { b → a } 0 { b → a } Sequence Sup. { b → b } 2 { b → b } Scan D { a → c } 3 { b → c } 2 { b → c } { a → d } 3 { b → d } 1 { b → d } { c → d } 4 { c → a } 0 { c → a } { c → b } 0 { c → b } { c → c } 0 { c → c } { c → d } 4 { c → d } C 3 : cand 3-seqs F 3 : freq 3-seqs { d → a } 0 { d → a } Scan D Sequence Sequence Sup. { d → b } 1 { d → b } { a → b → c } { a → b → c } 3 { d → c } 1 { d → c } { d → d } 0 { d → d }
Naïve Private FSM C 1 : cand 1-seqs Database D F 1 : freq 1-seqs noise Sequence Sup. ID Record Lap ( |C 1 | / ε 1 ) { a } 3 0.2 Sequence Noisy Sup. a → c → d 100 Scan D { b } 3 -0.4 { a } 3.2 b → c → d 200 { c } 4 0.4 a → b → c → e → d { c } 4.4 300 { d } 4 -0.5 d → b 400 { d } 3.5 { e } 1 0.8 a → d → c → d 500 C 2 : cand 2-seqs C 2 : cand 2-seqs noise Sequence Sup. Sequence { a → a } 0.2 0 { a → a } Lap ( |C 2 | / ε 2 ) { a → c } F 2 : freq 2-seqs 3 0.3 { a → c } Sequence Noisy Sup. { a → d } 3 0.2 { a → d } { a → c } 3.3 { c → a } Scan D 0 -0.5 { c → a } { a → d } 3.2 { c → c } 0 0.8 { c → c } { c → d } 4.2 { c → d } 4 0.2 { c → d } { d → c } 3.1 { d → a } 0 0.3 { d → a } { d → c } 1 2.1 { d → c } { d → d } 0 -0.5 { d → d } C 3 : cand 3-seqs C 3 : cand 3-seqs Lap ( |C 3 | / ε 3 ) Sequence F 3 : freq 3-seqs noise Sequence Sup. Scan D { a → c → d } Sequence Noisy Sup. { a → c → d } 3 0 { a → c → d } 3 { a → d → c } { a → d → c } 1 0.3
Challenges • The large number of generated candidate sequences • It leads to a large amount of perturbation noise required by differential privacy
Observations • Observation 1 • The number of real frequent sequences is much smaller than the number of candidate sequences • In MSNBC, for θ = 0.02, n (2- seq ) = 32 vs. n ( cand 2- seq ) = 225 • In BIBLE, for θ = 0.15, n (2- seq ) = 21 vs. n ( cand 2- seq ) = 254 • Observation 2 • The frequencies of most patterns in a small part of the database are approximately equal to their frequencies in the original database • Reason: the frequency of a pattern can be considered as the probability of a record containing this pattern
PFS 2 Algorithm • PFS 2 Algorithm • Differentially P rivate F requent S equence Mining via S ampling-based Candidate Pruning • Basic Idea • Mining frequent sequences in order of increasing length • Using k th sample database for pruning candidate k -sequences • Partitioning the original database by random sampling Original Database Partition …… 1 st sample database 2 nd sample database m th sample database
Recommend
More recommend