Persistent Data Sketching Ge Luo Ke Yi Zhewei Wei The Hong Kong - - PowerPoint PPT Presentation

persistent data sketching
SMART_READER_LITE
LIVE PREVIEW

Persistent Data Sketching Ge Luo Ke Yi Zhewei Wei The Hong Kong - - PowerPoint PPT Presentation

Persistent Data Sketching Ge Luo Ke Yi Zhewei Wei The Hong Kong University of The Hong Kong University of Renmin University of China Science and Technology Science and Technology Xiaoyong Du Ji-Rong Wen Renmin University of China Renmin


slide-1
SLIDE 1

Persistent Data Sketching

Zhewei Wei

Renmin University of China

Ge Luo

The Hong Kong University of Science and Technology

Ke Yi

The Hong Kong University of Science and Technology

Xiaoyong Du

Renmin University of China

Ji-Rong Wen

Renmin University of China

slide-2
SLIDE 2
  • A data stream is a (massive) sequence of data

Stream Processing Engine Summary in Memory

Data Stream (Approximate) Answer query

Streaming Algorithms

slide-3
SLIDE 3
  • A data stream is a (massive) sequence of data
  • Single Pass: Each record is examined at most
  • nce

Stream Processing Engine Summary in Memory

Data Stream (Approximate) Answer query

Streaming Algorithms

slide-4
SLIDE 4
  • A data stream is a (massive) sequence of data
  • Single Pass: Each record is examined at most
  • nce
  • Small Space: Log or polylog in data stream size

Stream Processing Engine Summary in Memory

Data Stream (Approximate) Answer query

Streaming Algorithms

slide-5
SLIDE 5
  • A data stream is a (massive) sequence of data
  • Single Pass: Each record is examined at most
  • nce
  • Small Space: Log or polylog in data stream size
  • Small time: Low per-record processing time (O(1)

to polylog N)

Stream Processing Engine Summary in Memory

Data Stream (Approximate) Answer query

Streaming Algorithms

slide-6
SLIDE 6

Sketches

  • Sub-linear space
  • Fast update and query time
  • Answer queries approximately
  • Linear transformation of the data frequencies
slide-7
SLIDE 7

Sketches

  • Count-Min Sketch [Cormode and Muthukrishnan 2005]
  • Point queries, heavy hitters (frequent items)
  • AMS Sketch [Alon et. al. 1999]
  • Frequency moments
  • Count Sketch [Charikar et. al. 2002]
  • Join size queries, self join size queries [Rusu and

Dobra 2007]

slide-8
SLIDE 8

Sketches

  • Sub-linear space
  • Fast update and query time
  • Answer queries approximately
  • Linear transformation of the data frequencies
slide-9
SLIDE 9

Sketches

  • Sub-linear space
  • Fast update and query time
  • Answer queries approximately
  • Linear transformation of the data frequencies
  • Ephemeral
  • Answer queries on current version of data stream
slide-10
SLIDE 10

Query Back in Time

  • The ability to query on historical data is necessary

for analyzing trends&change pattern of data

slide-11
SLIDE 11

Persistent Database/ Data Structure

  • Answer queries on the past version of the database
slide-12
SLIDE 12

Persistent Database/ Data Structure

  • Answer queries on the past version of the database
  • General technique to make data structure persistent [Driscoll et
  • al. 1989], Multi-version B-tree [Becker et al. 1996, 


, Brodal et al. 2012], Time-Split B-tree [Lomet and Salzberg 1989]

slide-13
SLIDE 13

Persistent Database/ Data Structure

  • Answer queries on the past version of the database
  • General technique to make data structure persistent [Driscoll et
  • al. 1989], Multi-version B-tree [Becker et al. 1996, 


, Brodal et al. 2012], Time-Split B-tree [Lomet and Salzberg 1989]

  • Microsoft Immortal DB [Lomet et. al. 2005], SNAP [Shrira and Xu

2005], Ganymed [Plattner et. al. 2006], Skippy [Shaull et. al. 2008] and LIVE[Sarma et. al. 2010]

slide-14
SLIDE 14

Persistent Database/ Data Structure

  • Answer queries on the past version of the database
  • General technique to make data structure persistent [Driscoll et
  • al. 1989], Multi-version B-tree [Becker et al. 1996, 


, Brodal et al. 2012], Time-Split B-tree [Lomet and Salzberg 1989]

  • Microsoft Immortal DB [Lomet et. al. 2005], SNAP [Shrira and Xu

2005], Ganymed [Plattner et. al. 2006], Skippy [Shaull et. al. 2008] and LIVE[Sarma et. al. 2010]

  • Space linear in # of updates
  • Large storage
  • Storage on disk (not in streaming setting)
slide-15
SLIDE 15

Query on historical data Linear space Persistent Database Query on current data Sub-linear space Sketch

slide-16
SLIDE 16

Query on historical data Linear space Persistent Database Query on current data Sub-linear space Sketch Query on historical data Sub-linear space Persistent Sketch

slide-17
SLIDE 17

Persistent Sketch

  • Historical window query

Time Stream

slide-18
SLIDE 18

Persistent Sketch

  • Historical window query

End time t Start time s Time Stream

slide-19
SLIDE 19

Persistent Sketch

  • Historical window query
  • Given a time interval (s, t], return a sketch for substream f(s, t)
  • What is the top-k/frequency moment/join size of the stream

between s and t?

End time t Start time s Time Stream

slide-20
SLIDE 20

High Level Ideas & Our Results

slide-21
SLIDE 21
  • Given an error parameter 𝜁
  • Choose a hash function h: [n] ➝ [2/𝜁] and build a hash

table of size 2/𝜁

Count-Min Sketch

[Cormode and Muthukrishnan 2005]

slide-22
SLIDE 22
  • Given an error parameter 𝜁
  • Choose a hash function h: [n] ➝ [2/𝜁] and build a hash

table of size 2/𝜁

Count-Min Sketch

[Cormode and Muthukrishnan 2005]

slide-23
SLIDE 23

C[h(i)]

h(i)

  • Given an error parameter 𝜁
  • Choose a hash function h: [n] ➝ [2/𝜁] and build a hash

table of size 2/𝜁

Count-Min Sketch

[Cormode and Muthukrishnan 2005]

slide-24
SLIDE 24

C[h(i)]

h(i) C[h(i)] = C[h(i)] + 1

  • Given an error parameter 𝜁
  • Choose a hash function h: [n] ➝ [2/𝜁] and build a hash

table of size 2/𝜁

Count-Min Sketch

[Cormode and Muthukrishnan 2005]

slide-25
SLIDE 25

Linear Transformation

C[h(i)]

h(i) f1 f2 f3 … fi … fN 0, 0, 0,…, 1,…, 0, 0, i = 0, 1, 0,…, 0,…, 0, 0, 0, 0, 0,…, 0,…, 1, 0, … …

slide-26
SLIDE 26

Linear Transformation

End time t Start time s Stream

slide-27
SLIDE 27

Linear Transformation

End time t Start time s Stream Cs Ct

slide-28
SLIDE 28

Linear Transformation

End time t Start time s Stream Cs Ct Ct - Cs

slide-29
SLIDE 29

Linear Transformation

End time t Start time s Stream Cs Ct Ct - Cs

  • Linear Space
slide-30
SLIDE 30

Linear Transformation

End time t Start time s Stream Cs Ct Ct - Cs

  • Linear Space
  • Sketch is already an

approximation

slide-31
SLIDE 31

Baseline Solution

C[i]

Ephemeral sketch:

slide-32
SLIDE 32

Baseline Solution

C[i]

Ephemeral sketch:

C[i, t1] C[i, t2]

C[i] at time t1 C[i] at time t2 C[i] at time t3

C[i, t3]

C[i, t2] - C[i, t1] ≈ Δ Historical Lists:

slide-33
SLIDE 33

Baseline Solution

C[i]

Ephemeral sketch:

C[i, t1] C[i, t2]

C[i] at time t1 C[i] at time t2 C[i] at time t3

C[i, t3]

C[i, t2] - C[i, t1] ≈ Δ Historical Lists: Query time t

slide-34
SLIDE 34

Baseline Solution

C[i] C[i, t1] C[i, t2]

C[i] at time t1 C[i] at time t2 C[i] at time t3

C[i, t3]

C[i, t2] - C[i, t1] ≈ Δ Ephemeral sketch: Historical Lists: Query time t

slide-35
SLIDE 35

Baseline Solution

  • Historical window point/heavy hitters query:
  • What is frequency of “/images/space.gif” between

day 34 and day 37

  • What are the mostly requested URLs between day

34 and day 37”

slide-36
SLIDE 36

Baseline Solution

  • Historical window point/heavy hitters query:
  • What is frequency of “/images/space.gif” between

day 34 and day 37

  • What are the mostly requested URLs between day

34 and day 37”

  • Error: 𝜁||f(s,t)||1 (ephemeral error) + Δ (persistent error)
slide-37
SLIDE 37

Baseline Solution

  • Historical window point/heavy hitters query:
  • What is frequency of “/images/space.gif” between

day 34 and day 37

  • What are the mostly requested URLs between day

34 and day 37”

  • Error: 𝜁||f(s,t)||1 (ephemeral error) + Δ (persistent error)

Size of the stream between s and t

slide-38
SLIDE 38

Baseline Solution

  • Historical window point/heavy hitters query:
  • What is frequency of “/images/space.gif” between

day 34 and day 37

  • What are the mostly requested URLs between day

34 and day 37”

  • Error: 𝜁||f(s,t)||1 (ephemeral error) + Δ (persistent error)
  • Space: proportional to (1/𝜁 + m/Δ)

Size of the stream between s and t

slide-39
SLIDE 39

Baseline Solution

  • Historical window point/heavy hitters query:
  • What is frequency of “/images/space.gif” between

day 34 and day 37

  • What are the mostly requested URLs between day

34 and day 37”

  • Error: 𝜁||f(s,t)||1 (ephemeral error) + Δ (persistent error)
  • Space: proportional to (1/𝜁 + m/Δ)
  • Cannot handle (self) join size queries

Size of the stream between s and t

slide-40
SLIDE 40

Piece-wise Linear Approximation

  • Counter changes by at most 1 at each timestamp
  • Each counter is a discrete function according to

timestamps

t v(t)

slide-41
SLIDE 41

PLA-based Persistent Sketch

C[i]

t v(t)

Ephemeral sketch: PLA generator:

slide-42
SLIDE 42

PLA-based Persistent Sketch

C[i]

t v(t)

Ephemeral sketch: PLA generator:

Query time t

slide-43
SLIDE 43
  • Historical window point/heavy hitters query:
  • What is frequency of “/images/space.gif” between

day 34 and day 37

  • What are the mostly requested URLs between day

34 and day 37”

  • Error: 𝜁||f(s,t)||1 (ephemeral error) + Δ (persistent

error)

PLA-based Persistent Sketch

slide-44
SLIDE 44
  • Historical window point/heavy hitters query:
  • What is frequency of “/images/space.gif” between

day 34 and day 37

  • What are the mostly requested URLs between day

34 and day 37”

  • Error: 𝜁||f(s,t)||1 (ephemeral error) + Δ (persistent

error)

  • Space: proportional to (1/𝜁 + m/Δ2) in random

stream model

PLA-based Persistent Sketch

slide-45
SLIDE 45

Estimating Join Size

  • Estimating (self) join size in an ephemeral sketch:

Σi C[i]2

slide-46
SLIDE 46

Estimating Join Size

  • Estimating (self) join size in an ephemeral sketch:

Σi C[i]2

  • Estimating (self) join size in a persistent sketch:
slide-47
SLIDE 47

Estimating Join Size

  • Estimating (self) join size in an ephemeral sketch:

Σi C[i]2

  • Estimating (self) join size in a persistent sketch:

Σi (C[i] + error of Δ)2

slide-48
SLIDE 48

Estimating Join Size

  • Estimating (self) join size in an ephemeral sketch:

Σi C[i]2

  • Estimating (self) join size in a persistent sketch:

Σi (C[i] + error of Δ)2

  • Bias will amplify error significantly
slide-49
SLIDE 49

Estimating Join Size

  • Estimating (self) join size in an ephemeral sketch:

Σi C[i]2

  • Estimating (self) join size in a persistent sketch:

Σi (C[i] + error of Δ)2

  • Bias will amplify error significantly
  • Need unbiased estimator of the counter
slide-50
SLIDE 50

Sampling Based Persistent Sketch

C[i] C[i, t1] C[i, t2]

C[i] at time t1 C[i] at time t2 C[i] at time t3

C[i, t3]

h(i) Ephemeral sketch: Historical Lists:

slide-51
SLIDE 51

Sampling Based Persistent Sketch

C[i] C[i, t1] C[i, t2]

C[i] at time t1 C[i] at time t2 C[i] at time t3

C[i, t3]

Sample with probability 1/Δ h(i) Ephemeral sketch: Historical Lists:

slide-52
SLIDE 52

Sampling Based Persistent Sketch

C[i] C[i, t1] C[i, t2]

C[i] at time t1 C[i] at time t2 C[i] at time t3

C[i, t3]

Sample with probability 1/Δ h(i) Ephemeral sketch: Historical Lists: Query time t C[i, t3] + Δ - 1

slide-53
SLIDE 53

Sampling based AMS Sketch

  • Unbiased Estimator
  • Historical window join size query:
  • What is the join size of stream 1 and stream 2

between day 34 and day 37

  • Error:
  • Space: proportional to (1/𝜁 + m/Δ)

ε

  • ∥fs,t∥2

2 + (∆f

ε )2 ∥gs,t∥2

2 + (∆g

ε )2

slide-54
SLIDE 54
  • 7,000,000 requests from the 1998 World Cup web site

access log

  • Built sketches on two attributes

2000 4000 6000 8000 10000 10 10

1

10

2

10

3

10

4

10

5

10

6

10

7

Error parameter ∆ Sketch size(log scale) Sample PWC_AMS PLA PWC_CountMin Sample_Theory 2000 4000 6000 8000 10000 10

3

10

4

10

5

10

6

10

7

Error parameter ∆ Sketch size(log scale) Sample PWC_AMS PLA PWC_CountMin Sample_Theory

Experimental Study

Requested URL IP address of the request

slide-55
SLIDE 55

10

2

10

3

10

4

10

5

10

6

10

7

10 10

1

10

2

10

3

Sketch size (log scale) Absoluate error (log scale) PWC_AMS PLA PWC_CountMin 10

3

10

4

10

5

10

6

10

7

10 10

1

10

2

10

3

10

4

Sketch size (log scale) Absoluate error (log scale) PWC_AMS PLA PWC_CountMin

Experimental Study

Requested URL IP address of the request

10

3

10

4

10

5

10

6

10

7

10

−4

10

−3

10

−2

10

−1

10 Sketch size (log scale) Relative error (log scale) Sample PWC_AMS PWC_CountMin 10

2

10

3

10

4

10

5

10

6

10

−6

10

−5

10

−4

10

−3

10

−2

10

−1

10 Sketch size (log scale) Relative error (log scale) Sample PWC_AMS PWC_CountMin

Point Query Self Join Size Query Query range (0.2N, 0.6N]

slide-56
SLIDE 56

Conclusion

  • Persistent sketch
  • Query on historical data
  • Sub-linear space
  • Support point/heavy hitters/join size queries
  • Provable error and space bound
  • Performs well in practice
slide-57
SLIDE 57

Thanks!