Time Series Compressibility and Privacy Spiros Papadimitriou* - - PowerPoint PPT Presentation

time series compressibility and privacy
SMART_READER_LITE
LIVE PREVIEW

Time Series Compressibility and Privacy Spiros Papadimitriou* - - PowerPoint PPT Presentation

Time Series Compressibility and Privacy Spiros Papadimitriou* Feifei Li + George Kollios + Philip S. Yu* *IBM TJ Watson + Boston University Intuition / Motivation Introduce uncertainty about individual values, while still allowing


slide-1
SLIDE 1

Time Series Compressibility and Privacy

Spiros Papadimitriou* Feifei Li+ George Kollios+ Philip S. Yu*

*IBM TJ Watson

+Boston University

slide-2
SLIDE 2

2

Intuition / Motivation

Introduce uncertainty about individual values,

while still allowing interesting pattern mining

speed time 55mph 35mph

highway city

slide-3
SLIDE 3

3

Intuition / Motivation

Introduce uncertainty about individual values,

while still allowing interesting pattern mining

speed time 55mph 35mph

highway city

Need to publish some value within the band: which one?

slide-4
SLIDE 4

4

Random (white noise) ?

speed time

Completely random permutation? Cars (typically) don’t drive like this

⇒ Noise can be filtered out

slide-5
SLIDE 5

5

Deterministic ?

speed time

Completely “deterministic” permutation? True value leaks

δ

slide-6
SLIDE 6

6

First extreme case

White noise Completely random

slide-7
SLIDE 7

7

Summary of extreme cases

  • ?

Completely “deterministic” Completely random

slide-8
SLIDE 8

8

Summary of extreme cases

  • ?

Completely “deterministic” Adaptively combine completely random and completely “deterministic” ? Completely random

slide-9
SLIDE 9

9

Main challenge

Completely random Completely “deterministic”

Combining both Knowledge of an arbitrary number

  • f true values

Knowledge of signal’s subspace (“shape”) with arbitrary precision

slide-10
SLIDE 10

10

Goals

Partial “information hiding” via data perturbation,

for time series

Perturbation adapts to data properties

Automatically combines “random” and “deterministic”

at appropriate scales

Evaluate against both

Filtering True value leaks

Suitable for on-the-fly, streaming perturbation

slide-11
SLIDE 11

11

Overview

Definitions Method Experiments Conclusion

slide-12
SLIDE 12

12

Utility = discord

Published values are (on expectation)

within of the true values :

time

slide-13
SLIDE 13

13

Privacy = final uncertainty

Recovered values are (on

expectation) within of the true values :

time

slide-14
SLIDE 14

14

Goal

Recovery of true values is based on

assumptions about attack model, with specific background knowledge

Linear filtering Linear reconstruction (based on true values)

Goal:

slide-15
SLIDE 15

15

Overview

Definitions Method Experiments Conclusion

slide-16
SLIDE 16

16

Wavelet and Fourier representations

One-slide refresher

Time Frequency Scale (frequency) Time

slide-17
SLIDE 17

17

Our work

Fourier-based perturbation

Batch

Wavelet-based perturbation

Batch Streaming

slide-18
SLIDE 18

18

Fourier-based perturbation

Intuition

+

Original series Perturbation

100 ≈ σ

≈ σ ≈ σ ≈ σ ≈ σ

≈ 100 ± σ

≈ σ ≈ σ

Time domain

  • Freq. domain

Perturbed series

=

Energy concentrated in few coefficients: high compression Original series

slide-19
SLIDE 19

19

Fourier-based perturbation

Intuition & Summary

Time Frequency

slide-20
SLIDE 20

20

Wavelet-based perturbation

Intuition & Summary

Time Scale (frequency) Time

Next: How to do this online? (1) Wavelet transform; (2) Noise allocation

slide-21
SLIDE 21

21

Streaming perturbation

(1) Wavelet transform—Summary

Forward transform:

post-order traversal

O(lgN) space O(1) time (amortized)

1 2 3 4 5 6 7

slide-22
SLIDE 22

23

Streaming perturbation

(2) Noise allocation—Summary

Challenge:

Knowing only the wavelet coefficients up to the

current time

How can we allocate the noise online so that it

is as close as possible to the batch allocation?

Indefinite publication delay?

current value

slide-23
SLIDE 23

22

Streaming perturbation

(1) Wavelet transform—Summary

Inverse transform:

pre-order traversal

O(lgN) space O(1) time (amortized)

1 2 3 4 5 6 7 1 2 3 4 5 6 7

slide-24
SLIDE 24

24

Streaming perturbation

(2) Noise allocation—Summary

Batch Per-band lookahead [see paper for details]

Exceeds threshold Perturbed

slide-25
SLIDE 25

25

Overview

Definitions Method Experiments Conclusion

slide-26
SLIDE 26

26

Experimental overview

Datasets:

Chlorine: Chlorine concentration in

drinkable water distribution network

Light: Light intensity measurements

(Intel Berkeley)

SP500: Standards & Poors 500 index

200 400 600 800 1000 1200 1400 1600 1800 2000

  • 0.5

0.5 1 1.5 2 200 400 600 800 1000 1200 1400 1600 1800 2000

  • 2
  • 1

1 2000 4000 6000 8000 10000 12000 14000 16000

  • 1

1 2 3 4

Chlorine Light SP500

slide-27
SLIDE 27

27

Experimental overview

Varying

Discord levels, and Perturbation methods:

IID Fourier-based (FFT) Batch wavelet-based (DWT) Streaming wavelet-based (str. DWT)

Filter: wavelet shrinkage [Donoho / TOIT95] True values: linear regression

slide-28
SLIDE 28

28

Removed uncertainty

Discord σ (% RMS) Removed noise (%) Perturbation method

slide-29
SLIDE 29

29

Removed uncertainty

Average (over ten runs):

IID noise: excellent resilience to leaks,

very poor for filtering

Other methods: comparable

slide-30
SLIDE 30

30

Removed uncertainty

Maximum (over ten runs):

Fourier may perform poorly for

“non-smooth” signals

slide-31
SLIDE 31

31

Removed uncertainty

Maximum (over ten runs):

Fourier may perform poorly for

“non-smooth” signals

2 0 0 4 0 0 6 0 0 8 0 0 1 0 0 0 1 2 0 0 1 4 0 0 1 6 0 0 1 8 0 0 2 0 0 0
  • 0 . 5
0 . 5 1 1 . 5 2

Light

  • 1

1 2

  • 1

1 2

  • 1

1 2

  • 0.4
  • 0.2

0.2 0.4

  • 0.2

0.2

  • 0.2

0.2

slide-32
SLIDE 32

32

“True” uncertainty

Discord σ (% RMS) Remaining noise (% RMS)

slide-33
SLIDE 33

33

“True” uncertainty

Average (over ten runs):

IID noise: very poor overall Other methods: comparable

slide-34
SLIDE 34

34

“True” uncertainty

Maximum (over ten runs):

Fourier may perform poorly for

“non-smooth” signals

slide-35
SLIDE 35

35

Scalability

Constant per measurement

slide-36
SLIDE 36

36

Overview

Definitions Method Experiments Conclusion

slide-37
SLIDE 37

37

Related work (1/2)

Privacy-preserving data mining

SMC

[Lindel & Pinkas / CRYPTO00], [Vaidya & Clifton / KDD02]

Partial information hiding

Perturbation

[Agrawal & Srikant / SIGMOD00], [Du & Zhan / KDD03], [Kargupta, Datta, Wang & Sivakumar / ICDM03], [Agrawal & Aggarwal / EDBT04], [Chen & Liu / ICDM05], [Huang, Du & Chen / SIGMOD05], [Liu, Ryan & Kargupta / TKDE05], [Li et al. / ICDE07]

k-anonymity

[Sweeney / IJUFKS02] , [Aggarwal & Yu / EDBT04], [Bertino, Ooi, Yang & Deng / ICDE05], [Kifer & Gehrke / SIGMOD06], [Machanwajjala, Gehrke & Kifer / ICDE06], [Xiao & Tao / SIGMOD06]

Interactive privacy

[Blum, Dwork, McSherry & Nissim / PODS05], [Dwork, McSherry, Nissim, Smith / TCC06]

SSDBs [Denning / TODS80]

Wavelets in DM

[Gilbert, Kotidis, Muthukrishnan & Strauss / VLDB01], [Garofalakis & Gibbons / SIGMOD02], [Bulut & Singh / ICDE03], [Papadimitriou, Brockwell & Faloutsos / VLDB04], [Lin, Vlachos, Keogh & Gunopulos / EDBT04], [Karras & Mamoulis / VLDB05]

Compression and DM

[Keogh, Lonardi & Ratanamahatana / KDD04]

slide-38
SLIDE 38

38

Related work (2/2)

Correlated perturbation [Kargupta, Datta, Wang &

Sivakumar / ICDE03], [Huang, Du & Chen / SIGMOD05],

for streams [Li et al. / ICDE07]

L-diversity [Machanwajjala, Gehrke & Kifer / ICDE06]

and personalized privacy [Xiao & Tao / SIGMOD06]

Dimensionality curse and privacy

[Aggarwal / VLDB05]

Watermarking [Sion, Attalah & Prabhakar / TKDE06] Compressed sensing [Donoho / TOIT06],

[Candés, Romberg & Tao / TOIT06]

slide-39
SLIDE 39

39

Conclusion

Partial information hiding via data perturbation User-defined discord (utility) Adapts to data properties

Automatically combines “random” and “deterministic”

at appropriate scales

Additionally preserves spectral properties

Evaluate against both

Filtering True value leaks

Suitable for on-the-fly, streaming perturbation

Perturbing data objects with any “structure” is non-trivial, even under fixed attack model(s)

slide-40
SLIDE 40

Time Series Compressibility and Privacy

Spiros Papadimitriou* Feifei Li+ George Kollios+ Philip S. Yu*

*IBM TJ Watson

+Boston University

Thank you

slide-41
SLIDE 41

41

Per-band allocation

Fourier equal alloc.:

“spreads” noise if signal is non-smooth

Wavelets: time-

adaptive anyway

BACKUP

slide-42
SLIDE 42

42

Per-band allocation

BACKUP

slide-43
SLIDE 43

43

Marginals

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 z ≡ |yt - xt| P(z) Light - CDF IID Fourier Wavelet

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 z ≡ |yt-xt| P(z) Chlorine - CDF IID Fourier Wavelet

BACKUP