Mergeable Summaries Q P Je ff M. Phillips P Q University of Utah - - PowerPoint PPT Presentation

mergeable summaries
SMART_READER_LITE
LIVE PREVIEW

Mergeable Summaries Q P Je ff M. Phillips P Q University of Utah - - PowerPoint PPT Presentation

Mergeable Summaries Q P Je ff M. Phillips P Q University of Utah S ( Q, ) S ( P, ) joint with with Pankaj K. Agarwal (Duke) Graham Cormode (AT&T) Zengfeng Huang (HKUST) S ( P Q, ) Zheiwei Wei (HKUST) size of S ( X, )


slide-1
SLIDE 1

w d

Array: CM[i,j]

S(P, ε) P Q P ∪ Q S(Q, ε) S(P ∪ Q, ε)

size of S(X, ε) is always m

Mergeable Summaries

Jeff M. Phillips

University of Utah joint with with Pankaj K. Agarwal (Duke) Graham Cormode (AT&T) Zengfeng Huang (HKUST) Zheiwei Wei (HKUST) Ke Yi (HKUST)

slide-2
SLIDE 2

Summaries for MASSIVE Data

Allows approximate computation with guarantees and small space coreset: small summary, proxy for full data set with approx guarantees:

  • ε-samples of (P, R): approx density
  • ε-kernel: approx convex shape

sketch: (random) (linear) combination of full data, recover functions with approx guarantees:

  • Euclidean distance: Johnson-Lindenstrauss random projection
  • min-count sketch: approx item counts
  • Greenwald-Khanna sketch: approx quantiles
  • Misra-Gries sketch: approx frequent items

w d

Array: CM[i,j]

slide-3
SLIDE 3

Summaries for MASSIVE Data

Allows approximate computation with guarantees and small space coreset: small summary, proxy for full data set with approx guarantees:

  • ε-samples of (P, R): approx density
  • ε-kernel: approx convex shape

sketch: (random) (linear) combination of full data, recover functions with approx guarantees:

  • Euclidean distance: Johnson-Lindenstrauss random projection
  • min-count sketch: approx item counts
  • Greenwald-Khanna sketch: approx quantiles
  • Misra-Gries sketch: approx frequent items

Summary

slide-4
SLIDE 4

Massive Distributed Computation data centers sensor networks multi-core

slide-5
SLIDE 5

Massive Distributed Computation data centers sensor networks multi-core

slide-6
SLIDE 6

Massive Distributed Computation data centers sensor networks multi-core

slide-7
SLIDE 7

Massive Distributed Computation data centers sensor networks multi-core

slide-8
SLIDE 8

Massive Distributed Computation data centers sensor networks multi-core

slide-9
SLIDE 9

Massive Distributed Computation data centers sensor networks multi-core

slide-10
SLIDE 10

Massive Distributed Computation data centers sensor networks multi-core

S(P, ε) P Q S(Q, ε)

slide-11
SLIDE 11

Massive Distributed Computation data centers sensor networks multi-core

S(P, ε) P Q P ∪ Q S(Q, ε) S(P ∪ Q, ε)

slide-12
SLIDE 12

Massive Distributed Computation data centers sensor networks multi-core

S(P, ε) P Q P ∪ Q S(Q, ε) S(P ∪ Q, ε)

size of S(X, ε) is always m

slide-13
SLIDE 13

Massive Distributed Computation data centers sensor networks multi-core

S(P, ε) P Q P ∪ Q S(Q, ε) S(P ∪ Q, ε)

size of S(X, ε) is always m

  • similar to: MUD, Dremel

more restrictive, “natural”

  • generalizes streaming
  • archiving summaries
slide-14
SLIDE 14

S(P, ε) P Q P ∪ Q S(Q, ε) S(P ∪ Q, ε)

size of S(X, ε) is always m

Random Sample

val 15 17 20 1 8 42 7 10 14 3 ran .99 .42 .53 .01 .02 .23 .82 .75 .61 .14

P

slide-15
SLIDE 15

S(P, ε) P Q P ∪ Q S(Q, ε) S(P ∪ Q, ε)

size of S(X, ε) is always m

Random Sample

val 15 17 20 1 8 42 7 10 14 3 ran .99 .42 .53 .01 .02 .23 .82 .75 .61 .14

P

slide-16
SLIDE 16

S(P, ε) P Q P ∪ Q S(Q, ε) S(P ∪ Q, ε)

size of S(X, ε) is always m

Random Sample

P

val 15 7 10 14 20 17 42 3 8 1 ran .99 .82 .75 .61 .53 .42 .23 .14 .02 .01

slide-17
SLIDE 17

S(P, ε) P Q P ∪ Q S(Q, ε) S(P ∪ Q, ε)

size of S(X, ε) is always m

Random Sample

P

val 15 7 10 14 20 17 42 3 8 1 ran .99 .82 .75 .61 .53 .42 .23 .14 .02 .01

S(P, ε)

slide-18
SLIDE 18

S(P, ε) P Q P ∪ Q S(Q, ε) S(P ∪ Q, ε)

size of S(X, ε) is always m

Random Sample

P

val 15 7 10 14 20 17 42 3 8 1 ran .99 .82 .75 .61 .53 .42 .23 .14 .02 .01

S(P, ε)

val 31 9 16 11 14 7 2 13 21 4 ran .90 .85 .80 .57 .50 .37 .31 .12 .10 .08

Q S(Q, ε)

slide-19
SLIDE 19

S(P, ε) P Q P ∪ Q S(Q, ε) S(P ∪ Q, ε)

size of S(X, ε) is always m

Random Sample

P

val 15 7 10 14 20 17 42 3 8 1 ran .99 .82 .75 .61 .53 .42 .23 .14 .02 .01

S(P, ε)

val 31 9 16 11 14 7 2 13 21 4 ran .90 .85 .80 .57 .50 .37 .31 .12 .10 .08

Q S(Q, ε)

val 15 31 9 7 16 10 ran .99 .90 .85 .82 .80 .75

slide-20
SLIDE 20

P ∪ Q S(P ∪ Q, ε)

size of S(X, ε) is always m

Random Sample

P

val 15 7 10 14 20 17 42 3 8 1 ran .99 .82 .75 .61 .53 .42 .23 .14 .02 .01

S(P, ε)

val 31 9 16 11 14 7 2 13 21 4 ran .90 .85 .80 .57 .50 .37 .31 .12 .10 .08

Q S(Q, ε)

val 15 31 9 7 16 10 ran .99 .90 .85 .82 .80 .75

S(P ∪ Q, ε)

slide-21
SLIDE 21

P ∪ Q S(P ∪ Q, ε)

size of S(X, ε) is always m

Random Sample

P

val 15 7 10 14 20 17 42 3 8 1 ran .99 .82 .75 .61 .53 .42 .23 .14 .02 .01

S(P, ε)

val 31 9 16 11 14 7 2 13 21 4 ran .90 .85 .80 .57 .50 .37 .31 .12 .10 .08

Q S(Q, ε)

val 15 31 9 7 16 10 ran .99 .90 .85 .82 .80 .75

S(P ∪ Q, ε) max element top k elements

slide-22
SLIDE 22

Linear Sketches

Count-Min sketch of vector P[1...U]:

  • Linear sketch as array size w × d
  • Use d hash functions h to map x to [1...w]
  • Estimate P[i] = minj CM[hj(i), j]

Mergeable: CM(P + Q) = CM(P) + CM(Q)

w d

Array: CM[i,j]

slide-23
SLIDE 23

S(P, ε) P Q P ∪ Q S(Q, ε) S(P ∪ Q, ε)

size of S(X, ε) is always m

Linear Sketches

Count-Min sketch of vector P[1...U]:

  • Linear sketch as array size w × d
  • Use d hash functions h to map x to [1...w]
  • Estimate P[i] = minj CM[hj(i), j]

Mergeable: CM(P + Q) = CM(P) + CM(Q)

w d

Array: CM[i,j]

slide-24
SLIDE 24

S(P, ε) P Q P ∪ Q S(Q, ε) S(P ∪ Q, ε)

size of S(X, ε) is always m

S(P, ε) S(Q, ε)

Linear Sketches

Count-Min sketch of vector P[1...U]:

  • Linear sketch as array size w × d
  • Use d hash functions h to map x to [1...w]
  • Estimate P[i] = minj CM[hj(i), j]

Mergeable: CM(P + Q) = CM(P) + CM(Q)

w d

Array: CM[i,j]

slide-25
SLIDE 25

Misra-Gries (MG) sketch of P[1...U]:

  • Keep k (index,count) pairs
  • If existing index arrives, update count
  • If new index arrives, make new pair,
  • r decrement all counts

Mergeable: Stack MG(P) + MG(Q), decrement all counts Ck+1

(1,5) (3,6) (8,1) (11,1) (14,3)

Heavy Hitters Summaries

slide-26
SLIDE 26

Misra-Gries (MG) sketch of P[1...U]:

  • Keep k (index,count) pairs
  • If existing index arrives, update count
  • If new index arrives, make new pair,
  • r decrement all counts

Mergeable: Stack MG(P) + MG(Q), decrement all counts Ck+1

(1,5) (3,6) (8,1) (11,1) (14,3)

Heavy Hitters Summaries

slide-27
SLIDE 27

Misra-Gries (MG) sketch of P[1...U]:

  • Keep k (index,count) pairs
  • If existing index arrives, update count
  • If new index arrives, make new pair,
  • r decrement all counts

Mergeable: Stack MG(P) + MG(Q), decrement all counts Ck+1

(1,5) (3,6) (8,1) (11,2) (14,3)

Heavy Hitters Summaries

slide-28
SLIDE 28

Misra-Gries (MG) sketch of P[1...U]:

  • Keep k (index,count) pairs
  • If existing index arrives, update count
  • If new index arrives, make new pair,
  • r decrement all counts

Mergeable: Stack MG(P) + MG(Q), decrement all counts Ck+1

(1,5) (3,6) (8,1) (11,2) (14,3)

Heavy Hitters Summaries

slide-29
SLIDE 29

Misra-Gries (MG) sketch of P[1...U]:

  • Keep k (index,count) pairs
  • If existing index arrives, update count
  • If new index arrives, make new pair,
  • r decrement all counts

Mergeable: Stack MG(P) + MG(Q), decrement all counts Ck+1

(1,4) (3,5) (11,1) (14,2)

Heavy Hitters Summaries

slide-30
SLIDE 30

S(P, ε) P Misra-Gries (MG) sketch of P[1...U]:

  • Keep k (index,count) pairs
  • If existing index arrives, update count
  • If new index arrives, make new pair,
  • r decrement all counts

Mergeable: Stack MG(P) + MG(Q), decrement all counts Ck+1

(1,4) (3,5) (11,1) (14,2)

Heavy Hitters Summaries

S(P, ε) |P[i] − MG[i]| ≤ ε = ˆ m/(k + 1)

slide-31
SLIDE 31

S(P, ε) P Q S(Q, ε) Misra-Gries (MG) sketch of P[1...U]:

  • Keep k (index,count) pairs
  • If existing index arrives, update count
  • If new index arrives, make new pair,
  • r decrement all counts

Mergeable: Stack MG(P) + MG(Q), decrement all counts Ck+1

(1,3) (3,4) (11,1) (14,2) (1,2) (3,2) (5,1) (9,5) (14,4)

S(P, ε) S(Q, ε)

Heavy Hitters Summaries

slide-32
SLIDE 32

S(P, ε) P Q P ∪ Q S(Q, ε) S(P ∪ Q, ε)

size of S(X, ε) is always m

Misra-Gries (MG) sketch of P[1...U]:

  • Keep k (index,count) pairs
  • If existing index arrives, update count
  • If new index arrives, make new pair,
  • r decrement all counts

Mergeable: Stack MG(P) + MG(Q), decrement all counts Ck+1

(1,6) (3,6) (5,2) (9,5) (11,1) (14,6)

Heavy Hitters Summaries

slide-33
SLIDE 33

S(P, ε) P Q P ∪ Q S(Q, ε) S(P ∪ Q, ε)

size of S(X, ε) is always m

Misra-Gries (MG) sketch of P[1...U]:

  • Keep k (index,count) pairs
  • If existing index arrives, update count
  • If new index arrives, make new pair,
  • r decrement all counts

Mergeable: Stack MG(P) + MG(Q), decrement all counts Ck+1

(1,5) (3,5) (5,1) (9,4) (14,5)

Heavy Hitters Summaries

slide-34
SLIDE 34

S(P, ε) P Q P ∪ Q S(Q, ε) S(P ∪ Q, ε)

size of S(X, ε) is always m

Misra-Gries (MG) sketch of P[1...U]:

  • Keep k (index,count) pairs
  • If existing index arrives, update count
  • If new index arrives, make new pair,
  • r decrement all counts

Mergeable: Stack MG(P) + MG(Q), decrement all counts Ck+1

(1,5) (3,5) (5,1) (9,4) (14,5)

S(P ∪ Q, ε)

Heavy Hitters Summaries

slide-35
SLIDE 35

S(P, ε) P Q P ∪ Q S(Q, ε) S(P ∪ Q, ε)

size of S(X, ε) is always m

ε-Samples (Intervals)

val 15 17 20 1 8 42 7 10 14 3

P

slide-36
SLIDE 36

S(P, ε) P Q P ∪ Q S(Q, ε) S(P ∪ Q, ε)

size of S(X, ε) is always m

ε-Samples (Intervals)

val 15 17 20 1 8 42 7 10 14 3

P

slide-37
SLIDE 37

S(P, ε) P Q P ∪ Q S(Q, ε) S(P ∪ Q, ε)

size of S(X, ε) is always m

ε-Samples (Intervals)

val 1 3 7 8 10 14 15 17 20 42

P S(P, ε)

slide-38
SLIDE 38

S(P, ε) P Q P ∪ Q S(Q, ε) S(P ∪ Q, ε)

size of S(X, ε) is always m

ε-Samples (Intervals)

val 1 3 7 8 10 14 15 17 20 42

P S(P, ε) An ε-sample of ε-sample is a 2ε-sample

slide-39
SLIDE 39

S(P, ε) P Q P ∪ Q S(Q, ε) S(P ∪ Q, ε)

size of S(X, ε) is always m

ε-Samples (Intervals)

val 1 3 7 8 10 14 15 17 20 42

P Q

val 2 4 7 9 11 13 14 16 21 31

S(P, ε) S(Q, ε)

val 3 4 10 11 16 17

slide-40
SLIDE 40

S(P, ε) P Q P ∪ Q S(Q, ε) S(P ∪ Q, ε)

size of S(X, ε) is always m

ε-Samples (Intervals)

val 1 3 7 8 10 14 15 17 20 42

P Q

val 2 4 7 9 11 13 14 16 21 31

S(P, ε) S(Q, ε)

val 3 4 10 11 16 17

slide-41
SLIDE 41

S(P, ε) P Q P ∪ Q S(Q, ε) S(P ∪ Q, ε)

size of S(X, ε) is always m

ε-Samples (Intervals)

val 1 3 7 8 10 14 15 17 20 42

P Q

val 2 4 7 9 11 13 14 16 21 31

S(P, ε) S(Q, ε)

val 3 4 10 11 16 17

S(P ∪ Q, ε)

slide-42
SLIDE 42

S(P, ε) P Q P ∪ Q S(Q, ε) S(P ∪ Q, ε)

size of S(X, ε) is always m

ε-Samples (Intervals)

val 1 3 7 8 10 14 15 17 20 42

P Q

val 2 4 7 9 11 13 14 16 21 31

S(P, ε) S(Q, ε)

val 3 4 10 11 16 17

S(P ∪ Q, ε)

slide-43
SLIDE 43

S(P, ε) P Q P ∪ Q S(Q, ε) S(P ∪ Q, ε)

size of S(X, ε) is always m

ε-Samples (Intervals)

val 1 3 7 8 10 14 15 17 20 42

P Q

val 2 4 7 9 11 13 14 16 21 31

S(P, ε) S(Q, ε)

val 3 4 10 11 16 17

S(P ∪ Q, ε) Random Sample: (1/ε2) log(1/δ). Even-Weight Merge: (1/ε) p log(1/εδ). 40 samples

slide-44
SLIDE 44

S(P, ε) P Q P ∪ Q S(Q, ε) S(P ∪ Q, ε)

size of S(X, ε) is always m

ε-Samples (Intervals)

slide-45
SLIDE 45

S(P, ε) P Q P ∪ Q S(Q, ε) S(P ∪ Q, ε)

size of S(X, ε) is always m

ε-Samples (Intervals)

slide-46
SLIDE 46

S(P, ε) P Q P ∪ Q S(Q, ε) S(P ∪ Q, ε)

size of S(X, ε) is always m

ε-Samples (Intervals)

Let Ei,j is jth merge error at level i. E[Ei,j] = 0 and |Ei,j| ≤ 2i = ∆i Chernoff-Hoeffding Bound: Pr[Err > ε] ≤ 2 exp ⇣

−2ε2 P

i

P

j ∆2 j

⌘ ≤ δ

slide-47
SLIDE 47

S(P, ε) P Q P ∪ Q S(Q, ε) S(P ∪ Q, ε)

size of S(X, ε) is always m

ε-Samples (Intervals)

2j−1kε ≤ n < 2jkε 2j−1kε 2j−2kε 2j−3kε 2ikε 2j−4kε O ( l

  • g

( n ) ) l e v e l s

slide-48
SLIDE 48

S(P, ε) P Q P ∪ Q S(Q, ε) S(P ∪ Q, ε)

size of S(X, ε) is always m

ε-Samples (Intervals)

pi ∈ P → random ui ∈ [0, 1] B = {pi}i with top kε ui P B

  • utput:

kε points input: m0 points Random Buffer: 2j−1kε ≤ n < 2jkε 2j−1kε 2j−2kε 2j−3kε 2ikε

O(log(1/ε)) levels

2j−4kε

slide-49
SLIDE 49

S(P, ε) P Q P ∪ Q S(Q, ε) S(P ∪ Q, ε)

size of S(X, ε) is always m

ε-Samples (Intervals)

pi ∈ P → random ui ∈ [0, 1] B = {pi}i with top kε ui P B

  • utput:

kε points input: m0 points Random Buffer: 2j−1kε ≤ n < 2jkε 2j−1kε 2j−2kε 2j−3kε 2ikε

O(log(1/ε)) levels

2j−4kε

m = O((1/ε) log1.5(1/ε) log(1/δ))

slide-50
SLIDE 50

Mergeable Summaries for MASSIVE Data

Allows approximate computation with guarantees and small space coreset: small summary, proxy for full data set with approx guarantees:

  • ε-samples of (P, R): approx density
  • ε-kernel: approx convex shape

sketch: (random) (linear) combination of full data, recover functions with approx guarantees:

  • Euclidean distance: Johnson-Lindenstrauss random projection
  • min-count sketch: approx item counts
  • Greenwald-Khanna sketch: approx quantiles
  • Misra-Gries sketch: approx frequent items

w d

Array: CM[i,j]

Mergeable Mergeable Mergeable Mergeable One-way Mergeable Mergeable (restricted)

slide-51
SLIDE 51

Open Questions

  • Mergeable ε-kernels without restrictions
  • Mergeable summaries for clustering
  • Mergeable summaries for PCA
  • Mergeable summaries for graphs [next talk]
  • Lower bounds for mergeable summaries (deterministic)
  • Implementation Studies