sequence data
play

Sequence Data Continuous Aggregates Distance-based sampling - PowerPoint PPT Presentation

Sequence Data Continuous Aggregates Distance-based sampling Transformation-based Model-based filtering and sampling Frequent sequential patterns CS573 Data Privacy and Security Differential Privacy Sequence Data Li Xiong


  1. Sequence Data • Continuous Aggregates • Distance-based sampling • Transformation-based • Model-based filtering and sampling • Frequent sequential patterns

  2. CS573 Data Privacy and Security Differential Privacy – Sequence Data Li Xiong

  3. Continuous Aggregates with Differential Privacy t1 t2 t3 a 100 90 100 b 20 50 20 20 10 20 c

  4. Baseline • Compute a perturbed histogram at each time point – large perturbation error

  5. Baseline • Compute one perturbed histogram at a sampled time point – large sampling error

  6. Distance-based Adaptive Sampling • Only generate a new histogram when change is significant • Need to tune distance threshold Haoran Li, Li Xiong, Xiaoqian Jiang, Jinfei Liu. Differentially Private Histogram Publication for Dynamic Datasets: An Adaptive Sampling Approach. CIKM 2015

  7. Sequence Data • Continuous Aggregates • Distance-based sampling • Transformation-based • Model-based filtering and sampling • Frequent sequential patterns

  8. Transformation-Based • Represent the original counts over time as a time series • Baseline: Laplace perturbation at each time point

  9. Discrete Fourier Transform [Rastogi and Nath, SIGMOD 10] Aggregate time series X Discrete Fourier Transform Retain only the first d coefficients to Laplace Perturbation reduce sensitivity Inverse DFT Released series R • Higher accuracy (perturbation error O(d) + reconstruction error) • Offline or batch processing only

  10. Sequence Data • Continuous Aggregates • Distance-based sampling • Transformation-based • Model-based filtering and sampling • Frequent sequential patterns

  11. Model-Based Filtering and Sampling • Represent the original counts over time as a time series • Use model-based prediction to de-noise

  12. State-Space Model Process original 𝑦 𝑙+1 𝑦 𝑙 𝑨 𝑙 𝑨 𝑙+1 Perturbed

  13. Filtering: State-Space Model Process 𝑦 𝑙+1 𝑦 𝑙 Perturbation 𝑨 𝑙 𝑨 𝑙+1 • Process Model 𝑦 𝑙+1 = 𝑦 𝑙 + 𝜕 𝜕~ℕ(0, 𝑅) Process noise • Measurement Model 𝑨 𝑙 = 𝑦 𝑙 + 𝜉 𝜉~𝑀𝑏𝑞(𝜇) Measurement noise • Given noisy measurement 𝑨 𝑙 , how to estimate true state 𝑦 𝑙 ?

  14. Posterior Estimation • Denote ℤ 𝑙 = 𝑨 1 , … , 𝑨 𝑙 - noisy observations up to k • Posterior estimate: 𝑦 𝑙 = 𝐹(𝑦 𝑙 |ℤ 𝑙 ) • Posterior distribution: 𝑔 𝑦 𝑙 ℤ 𝑙 = 𝑔 𝑦 𝑙 ℤ 𝑙−1 𝑔(𝑨 𝑙 |𝑦 𝑙 ) 𝑔 𝑨 𝑙 ℤ 𝑙−1 • Challenge : 𝑔 𝑨 𝑙 ℤ 𝑙−1 and 𝑔 𝑦 𝑙 ℤ 𝑙−1 are difficult to compute when 𝑔 𝑨 𝑙 𝑦 𝑙 = 𝑔 𝜉 is no not Gaussian

  15. Filtering: Solutions • Option 1 : Approximate measurement noise with Gaussian 𝜉~ℕ(0, 𝑆) → the Kalman filter • Option 2 : Estimate posterior density by Monte-Carlo method 𝑂 𝑗 𝜀(𝑦 𝑙 − 𝑦 𝑙 𝑗 ) 𝑔 𝑦 𝑙 ℤ 𝑙 = 𝜌 𝑙 𝑗=1 𝑗 , 𝜌 𝑙 𝑗 } 1 𝑂 is a set of weighted samples/particles. where {𝑦 𝑙 → particle filters • Liyue Fan, Li Xiong. Adaptively Sharing Real-Time Aggregates with Differential Privacy. IEEE TKDE, 2013

  16. 17 Adaptive Sampling • Adaptive sampling - adjust sampling rate based on feedback (error between posterior and prior estimate)

  17. Adaptive Sampling: PID Control • Feedback error: measures how well the data model describes the current trend • PID error ( Δ ): compound of proportional , integral , and derivative errors • Proportional: current error • Integral: integral of errors in recent time window • Derivative: change rate of errors • Determines a new sampling interval: Δ−𝜊 𝐽 ′ = 𝐽 + 𝜄(1 − 𝑓 𝜊 ) where 𝜄 represents the magnitude of change and 𝜊 is the set point for sampling process.

  18. Evaluation: Data Sets • Synthetic Data with 1000 data points: • Linear: process model • Logistic: 𝑦 𝑙 = 𝐵(1 + 𝑓 −𝑙 ) −1 • Sinusoidal: 𝑦 𝑙 = 𝐵 ∗ 𝑡𝑗𝑜(𝜕𝑙 + 𝜒) • Flu: CDC flu data 2006-2010, 209 data points • Traffic: UW/intelligent transportation systems research 2003-2004, 540 data points • Unemployment: ST. Louis Federal Reserve Bank, 478 data points Traffic Flu

  19. Illustration: Original data stream vs. released data stream • FAST provides less data volume and higher data utility/integrity with formal privacy guarantee

  20. Results: Utility vs. Privacy Flu Traffic Unemployment 21

  21. Multi-dimensional time-series or spatial- temporal data: Traffic monitoring • Goal: release 𝐒 𝑑 for each cell c • Approaches: Temporal modeling and spatial partitioning Liyue Fan, Li Xiong, Vaidy Sunderam. Differentially Private Multi-Dimensional Time- Series Release for Traffic Monitoring. DBSec, 2013 (best student paper award)

  22. Temporal modeling and estimation • Univariate time-series modeling for each cell • Road network based process variance modeling 𝑦 𝑙+1 = 𝑦 𝑙 + 𝜕 Small value for Sparse cells; 𝜕~ℕ(0, 𝑅) Large value for Dense cells.

  23. Spatial Partitioning and Estimation • Data has sparse and uniform regions and is dynamically changing • Failed attempt: dynamic feedback-driven partitioning • Approach: road network density based Quad-tree partitioning partitioning

  24. Results on Brinkhoff moving objects data Dataset: 500K objects at the beginning; 25K new objects at every timestamp; 100 timestamps

  25. Sequence Data • Continuous Aggregates • Distance-based sampling • Transformation-based • Model-based filtering and sampling • Frequent sequential patterns

  26. Frequent Sequential Patterns with Differential Privacy S Xu, S Su, X Cheng, Z Li, L Xiong. Differentially Private Frequent Sequence Mining via Sampling-based Candidate Pruning. ICDE 2015

  27. Frequent sequential pattern mining with medical data • Longitudinal observations • Genome sequences

  28. Non-private frequent sequence mining (FSM) – An Example Database D C 1 : cand 1-seqs F 1 : freq 1-seqs ID Record Sequence Sup. Sequence Sup. a → c → d { a } 100 3 { a } 3 Scan D b → c → d 200 { b } 3 { b } 3 a → b → c → e → d 300 { c } 4 { c } 4 d → b { d } 400 4 { d } 4 a → d → c → d 500 { e } 1 C 2 : cand 2-seqs C 2 : cand 2-seqs Sequence Sup. Sequence { a → a } 0 { a → a } { a → b } 1 { a → b } { a → c } 3 { a → c } { a → d } 3 { a → d } F 3 : freq 2-seqs { b → a } 0 { b → a } Sequence Sup. { b → b } 2 { b → b } Scan D { a → c } 3 { b → c } 2 { b → c } { a → d } 3 { b → d } 1 { b → d } { c → d } 4 { c → a } 0 { c → a } { c → b } 0 { c → b } { c → c } 0 { c → c } { c → d } 4 { c → d } C 3 : cand 3-seqs F 3 : freq 3-seqs { d → a } 0 { d → a } Scan D Sequence Sequence Sup. { d → b } 1 { d → b } { a → b → c } { a → b → c } 3 { d → c } 1 { d → c } { d → d } 0 { d → d }

  29. Naïve Private FSM C 1 : cand 1-seqs Database D F 1 : freq 1-seqs noise Sequence Sup. ID Record Lap ( |C 1 | / ε 1 ) { a } 3 0.2 Sequence Noisy Sup. a → c → d 100 Scan D { b } 3 -0.4 { a } 3.2 b → c → d 200 { c } 4 0.4 a → b → c → e → d { c } 4.4 300 { d } 4 -0.5 d → b 400 { d } 3.5 { e } 1 0.8 a → d → c → d 500 C 2 : cand 2-seqs C 2 : cand 2-seqs noise Sequence Sup. Sequence { a → a } 0.2 0 { a → a } Lap ( |C 2 | / ε 2 ) { a → c } F 2 : freq 2-seqs 3 0.3 { a → c } Sequence Noisy Sup. { a → d } 3 0.2 { a → d } { a → c } 3.3 { c → a } Scan D 0 -0.5 { c → a } { a → d } 3.2 { c → c } 0 0.8 { c → c } { c → d } 4.2 { c → d } 4 0.2 { c → d } { d → c } 3.1 { d → a } 0 0.3 { d → a } { d → c } 1 2.1 { d → c } { d → d } 0 -0.5 { d → d } C 3 : cand 3-seqs C 3 : cand 3-seqs Lap ( |C 3 | / ε 3 ) Sequence F 3 : freq 3-seqs noise Sequence Sup. Scan D { a → c → d } Sequence Noisy Sup. { a → c → d } 3 0 { a → c → d } 3 { a → d → c } { a → d → c } 1 0.3

  30. Challenges • The large number of generated candidate sequences • It leads to a large amount of perturbation noise required by differential privacy

  31. Observations • Observation 1 • The number of real frequent sequences is much smaller than the number of candidate sequences • In MSNBC, for θ = 0.02, n (2- seq ) = 32 vs. n ( cand 2- seq ) = 225 • In BIBLE, for θ = 0.15, n (2- seq ) = 21 vs. n ( cand 2- seq ) = 254 • Observation 2 • The frequencies of most patterns in a small part of the database are approximately equal to their frequencies in the original database • Reason: the frequency of a pattern can be considered as the probability of a record containing this pattern

  32. PFS 2 Algorithm • PFS 2 Algorithm • Differentially P rivate F requent S equence Mining via S ampling-based Candidate Pruning • Basic Idea • Mining frequent sequences in order of increasing length • Using k th sample database for pruning candidate k -sequences • Partitioning the original database by random sampling Original Database Partition …… 1 st sample database 2 nd sample database m th sample database

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend