Creating Probabilistic Databases from Imprecise Time-Series Data - - PowerPoint PPT Presentation

creating probabilistic databases from imprecise time
SMART_READER_LITE
LIVE PREVIEW

Creating Probabilistic Databases from Imprecise Time-Series Data - - PowerPoint PPT Presentation

Creating Probabilistic Databases from Imprecise Time-Series Data Saket Sathe, Hoyoung Jeung, Karl Aberer EPFL, Switzerland 13th April, 2011 S. Sathe, H. Jeung, K. Aberer (2011) EPFL, Switzerland 1 / 15 Outline raw_values Probability time


slide-1
SLIDE 1

Creating Probabilistic Databases from Imprecise Time-Series Data

Saket Sathe, Hoyoung Jeung, Karl Aberer EPFL, Switzerland

13th April, 2011

  • S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland 1 / 15

slide-2
SLIDE 2

Outline

room 1 room 3 room 2 room 4

Probability distribution p(R) showing Alice’s position

3σ area as a reasonable boundary

room4 ∩ 3σ area

p(R) dR

μ

0.2 0.4 0.1 0.3 1 2 3 4 room 1 room 2 room 4 2 2 2 2

probability

0.5 0.1 0.3 0.1

room

1 2 3 4

time

1 1 1 1

time = 1 time = 2 y

2.3 2.1 : :

x

1.1 1.3 : :

time

1 2 : :

?

x y x y

raw_values prob_view

  • S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland 2 / 15

slide-3
SLIDE 3

Outline

room 1 room 3 room 2 room 4

Probability distribution p(R) showing Alice’s position

3σ area as a reasonable boundary

room4 ∩ 3σ area

p(R) dR

μ

0.2 0.4 0.1 0.3 1 2 3 4 room 1 room 2 room 4 2 2 2 2

probability

0.5 0.1 0.3 0.1

room

1 2 3 4

time

1 1 1 1

time = 1 time = 2 y

2.3 2.1 : :

x

1.1 1.3 : :

time

1 2 : :

?

x y x y

raw_values prob_view

Dynamic Density Metrics Measure of Quality Efficiently creating probabilistic views Approximating Gaussian distributions using σ–cache Parameter setting under provable guarantees Experiments

  • S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland 2 / 15

slide-4
SLIDE 4

Problem Setting

H t

S

1 

time rt-1 rt t t-1 t-H-1 values

pt(Rt )

  

σ

Dynamic Density Metric

Given SH

t−1, the dynamic density metric infers time-dependent probability

distributions pt(Rt) at time t, where Rt is a random variable associated with rt.

  • S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland 3 / 15

slide-5
SLIDE 5

GARCH Metric

  H t

S

1 

time rt t t-1 t-H-1 values pt(Rt ) ~ N(rt ,σt

2)

rt ˆ pt(Rt =rt) pt(Rt =rt) ˆ ˆ ˆ

ˆ rt is modeled using an ARMA model ˆ σ2

t is modeled using a GARCH model

Thus pt(Rt) is a N(ˆ rt, ˆ σ2

t ). We refer to this approach as ARMA-GARCH

  • S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland 4 / 15

slide-6
SLIDE 6

Quality of Dynamic Density Metrics

ˆ rt ˆ σ2

t

ARMA-GARCH ARMA GARCH Uniform Thresholding (UT) ARMA u (user-specified) Variable Thresholding (VT) ARMA sample variance of SH

t−1

Kalman-GARCH Kalman Filter GARCH

  • S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland 5 / 15

slide-7
SLIDE 7

Quality of Dynamic Density Metrics

ˆ rt ˆ σ2

t

ARMA-GARCH ARMA GARCH Uniform Thresholding (UT) ARMA u (user-specified) Variable Thresholding (VT) ARMA sample variance of SH

t−1

Kalman-GARCH Kalman Filter GARCH Problem: The true density ˆ pt(Rt) is not observable

  • S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland 5 / 15

slide-8
SLIDE 8

Quality of Dynamic Density Metrics

ˆ rt ˆ σ2

t

ARMA-GARCH ARMA GARCH Uniform Thresholding (UT) ARMA u (user-specified) Variable Thresholding (VT) ARMA sample variance of SH

t−1

Kalman-GARCH Kalman Filter GARCH

Indirect Method

Suppose p1(R1), . . . , pT (RT ) are the inferred densities and let zt = P(Rt ≤ rt) then zt is uniformly distributed between (0, 1) when pt(Rt) = ˆ pt(Rt) [Deibold et. al.]. d{UZ(z), QZ(z)} =

  • 1
  • x=0

(UZ(x) − QZ(x))2, (1) where UZ(z) is the ideal uniform cdf between (0, 1) and QZ(z) is the observed cdf of zt. We call d{UZ(z), QZ(z)} the density distance.

  • S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland 5 / 15

slide-9
SLIDE 9

Probabilistic View Generation

CREATE VIEW prob_view AS DENSITY r OVER t OMEGA delta=2, n=2 FROM raw_values WHERE t >= 1 AND t <= 3

Ω―View builder

r

4.2 5.9 7.1 7.9 sensor

t

1 2 3 4 dynamic density metrics

σ

0.3 3.2 2.9 0.2

r

4.0 6.0 7.0 7.7

pt(Rt)

r =10.2 t = 2

probabilistic view generation query

user Framework raw_values prob_view

rt

Ω

r1 r2 r3 ω1 [2:4] ω2 [0:2] ω1 [4:6] ω2 [2:4] ω1 [5:7] ω2 [3:5] 0.50 0.01 0.23 0.08 0.25 0.16

Λ

σ―cache

ˆ ˆ

  • S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland 6 / 15

slide-10
SLIDE 10

Probabilistic View Generation

CREATE VIEW prob_view AS DENSITY r OVER t OMEGA delta=2, n=2 FROM raw_values WHERE t >= 1 AND t <= 3

Problem: Large computational cost when time interval and n are large and ∆ is small (finer granularity)

Ω―View builder

r

4.2 5.9 7.1 7.9 sensor

t

1 2 3 4 dynamic density metrics

σ

0.3 3.2 2.9 0.2

r

4.0 6.0 7.0 7.7

pt(Rt)

r =10.2 t = 2

probabilistic view generation query

user Framework raw_values prob_view

rt

Ω

r1 r2 r3 ω1 [2:4] ω2 [0:2] ω1 [4:6] ω2 [2:4] ω1 [5:7] ω2 [3:5] 0.50 0.01 0.23 0.08 0.25 0.16

Λ

σ―cache

ˆ ˆ

  • S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland 7 / 15

slide-11
SLIDE 11

Probabilistic View Generation

CREATE VIEW prob_view AS DENSITY r OVER t OMEGA delta=2, n=2 FROM raw_values WHERE t >= 1 AND t <= 3

Idea: Cache and reuse computation of probability values from earlier times

Ω―View builder

r

4.2 5.9 7.1 7.9 sensor

t

1 2 3 4 dynamic density metrics

σ

0.3 3.2 2.9 0.2

r

4.0 6.0 7.0 7.7

pt(Rt)

r =10.2 t = 2

probabilistic view generation query

user Framework raw_values prob_view

rt

Ω

r1 r2 r3 ω1 [2:4] ω2 [0:2] ω1 [4:6] ω2 [2:4] ω1 [5:7] ω2 [3:5] 0.50 0.01 0.23 0.08 0.25 0.16

Λ

σ―cache

ˆ ˆ

  • S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland 7 / 15

slide-12
SLIDE 12

Constraint-Aware Caching

Given: pt(Rt) and pt′(Rt′) are Gaussian with (ˆ rt, ˆ σ2

t ) and (ˆ

rt′, ˆ σ2

t′)

Aim: Approximate values of pt′(Rt′) by pt(Rt) when t′ > t

  • S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland 8 / 15

slide-13
SLIDE 13

Constraint-Aware Caching

Given: pt(Rt) and pt′(Rt′) are Gaussian with (ˆ rt, ˆ σ2

t ) and (ˆ

rt′, ˆ σ2

t′)

Aim: Approximate values of pt′(Rt′) by pt(Rt) when t′ > t Distance constraint guarantees that the maximum approximation error is upper bounded by the distance constraint when the cache is used Memory constraint guarantees that the cache does not use more memory than that specified by the memory constraint

  • S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland 8 / 15

slide-14
SLIDE 14

Constraint-Aware Caching

Given: pt(Rt) and pt′(Rt′) are Gaussian with (ˆ rt, ˆ σ2

t ) and (ˆ

rt′, ˆ σ2

t′)

Aim: Approximate values of pt′(Rt′) by pt(Rt) when t′ > t Distance constraint guarantees that the maximum approximation error is upper bounded by the distance constraint when the cache is used Memory constraint guarantees that the cache does not use more memory than that specified by the memory constraint

  • S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland 8 / 15

slide-15
SLIDE 15

Constraint-Aware Caching

Given: pt(Rt) and pt′(Rt′) are Gaussian with (ˆ rt, ˆ σ2

t ) and (ˆ

rt′, ˆ σ2

t′)

Aim: Approximate values of pt′(Rt′) by pt(Rt) when t′ > t Distance constraint guarantees that the maximum approximation error is upper bounded by the distance constraint when the cache is used Memory constraint guarantees that the cache does not use more memory than that specified by the memory constraint rt' rt

) , ; (

2 t t t t

r R P  ) , ; (

2 ' ' ' t t t t

r R P  ρλ remains unchanged b' a' b a a'=r +λΔ b'=r +(λ+1)Δ a=r +λΔ b=r +(λ+1)Δ

Δ Δ

ˆ

ˆ

ˆ ˆ ˆ ˆ

ˆ

ˆ

ˆ ˆ

  • S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland 8 / 15

slide-16
SLIDE 16

Guaranteeing Distance Constraint

We use the Hellinger distance denoted H[·, ·] as a distance measure. 0 ≤ H ≤ 1.

Theorem: Distance Constraint

Given a user-defined distance constraint H′, we guarantee that H[pt(Rt), pt′(Rt′)] ≤ H′, if ˆ σt′ ≤ ds · ˆ σt and ˆ σt′ > ˆ σt where the parameter ds is chosen as any value satisfying, ds ≤ 1 +

  • 1 −
  • 1 − H′24
  • 1 − H′22

. We call ds the ratio threshold. Example Suppose H′ = 0.2, then ds ≤ 1.5 Choose, say, ds = 1.4 then if ˆ

σt′ ˆ σt ≤ ds then H [pt(Rt), pt′(Rt′)] ≤ 0.2

  • S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland 9 / 15

slide-17
SLIDE 17

Initializing the σ–cache

Let max(ˆ σt) and min(ˆ σt) be the maximum and minimum standard deviations observed in a probabilistic view generation query Compute Q, such that, max(ˆ σt) = dQ

s · min(ˆ

σt) ⌈Q⌉ gives us the maximum number of distributions that we should cache

 

   

Δ

ˆ

ˆ

ˆ ˆ

 

  • S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland 10 / 15

slide-18
SLIDE 18

Initializing the σ–cache

Let max(ˆ σt) and min(ˆ σt) be the maximum and minimum standard deviations observed in a probabilistic view generation query Compute Q, such that, max(ˆ σt) = dQ

s · min(ˆ

σt) ⌈Q⌉ gives us the maximum number of distributions that we should cache

) (

2 t s

n i m d  

( )

Q s t

d min 

   

n Δ

cache ˆ

ˆ

ˆ ˆ )

(

1 t s

n i m d  

  • cached values

memory

Find dq

s · min(ˆ

σt) such that dq

s · min(ˆ

σt) ≤ ˆ σt′ < dq+1

s

· min(ˆ σt)

  • S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland 10 / 15

slide-19
SLIDE 19

σ–cache: Features

CREATE VIEW prob_view AS DENSITY r OVER t OMEGA delta=2, n=2 FROM raw_values WHERE t >= 1 AND t <= 3

The rate at which the memory requirement grows is O

  • log
  • max(ˆ

σt) min(ˆ σt)

  • The number of distributions cached does not depend on

number of tuples that match the WHERE clause ∆ or n

  • S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland 11 / 15

slide-20
SLIDE 20

Experimental Evaluation

campus-data: ambient temperature values for over sixty five hours car-data: more than one hour of GPS data

0.5 1 1.5 2 2.5 3 30 60 90 120 150 180

density distance window size (H)

(a) campus-data

0.5 1 1.5 2 2.5 3 30 60 90 120 150 180

density distance window size (H)

(b) car-data

UT VT ARMA-GARCH Kalman-GARCH

ARMA-GARCH and Kalman-GARCH give upto 12 to 20 times lower density distances

  • S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland 12 / 15

slide-21
SLIDE 21

Experimental Evaluation

400 800 1200 1600 2000 2400 6000 10000 14000 18000

time (milliseconds) database size (tuples)

naive σ-cache

(a) Efficiency

850 900 950 1000 1050 1100 1150 2000 4000 8000 16000

cache size (kilobytes)

  • max. ratio threshold (Ds)

(b) Scaling Characteristics

Using ∆ = 0.05, n = 300 and Hellinger distance H = 0.01 An order of magnitude improvement in performance!

  • S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland 13 / 15

slide-22
SLIDE 22

Conclusions

Proposed time-series based models can be used for creating probabilistic databases Introduced the concept of density distance for measuring quality Proved useful and practical guarantees for using the σ-cache Caching and reusing distributions significantly increases the efficiency

  • f creating probabilistic databases
  • S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland 14 / 15

slide-23
SLIDE 23

Thank You.

Questions?

  • S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland 15 / 15

Karl Aberer karl.aberer@epfl.ch Hoyoung Jeung hoyoung.jeung@epfl.ch Saket Sathe saket.sathe@epfl.ch