Summaries of Streaming Data Martin J. Strauss University of - - PowerPoint PPT Presentation

summaries of streaming data
SMART_READER_LITE
LIVE PREVIEW

Summaries of Streaming Data Martin J. Strauss University of - - PowerPoint PPT Presentation

Summaries of Streaming Data Martin J. Strauss University of Michigan Sparse Approximation National retailer sees a stream of transactions: 2 Thomas sold, 1 Thomas returned, 1 TSP sold ... Implies vector x of item frequencies: 40 Thomas,


slide-1
SLIDE 1

Summaries of Streaming Data

Martin J. Strauss University of Michigan

slide-2
SLIDE 2

Sparse Approximation

National retailer sees a stream of transactions:

  • 2 Thomas sold, 1 Thomas returned, 1 TSP sold ...

Implies vector x of item frequencies:

  • 40 Thomas, 2 Lego, −30 TSP, ...

Goal: Track items with large-magnitude counts

1

slide-3
SLIDE 3

Example Algorithm

Measurements Signal, x Measurement matrix, Φ ✁ ✁ ✁ ✁ ☛ ❅ ❅ ❘           5.3 · · · 5.3           =           1 1 1 1 1 1 1 1 · · · · · · · · · · · · · · · · · · · · · · · · 1 1 1 1 1 1 1 1 1 1 1 1           ·                   5.3                   Recover position and coefficient of single spike in signal.

2

slide-4
SLIDE 4

Algorithmic Constraints

  • Little time per item
  • Little storage space
  • Little time to answer queries

3

slide-5
SLIDE 5

Fundamental Queries

Identification: Output a set that

  • contains all “heavy” indices
  • contains no “light” indices
  • (medium weight: no constraint)

Estimation

  • estimate large coefficients reliably.

4

slide-6
SLIDE 6

Summaries

Fundamental queries can be used to build summaries:

  • Fourier/Wavelet summaries
  • Piecewise-constant, piecewise-linear summaries
  • ...

Other user queries can be answered from summary

5

slide-7
SLIDE 7

Overview of Summaries

  • Heavy Hitters
  • Weak greedy sparse recovery
  • Orthonormal change of basis
  • Haar Wavelets
  • Histograms (piecewise constant)
  • Multi-dimensional (hierarchical)
  • Piecewise-linear
  • Range queries

6

slide-8
SLIDE 8

Setup

Design

  • a matrix Φ and decoding algo D that work together.

Process Stream:

  • Track y = Φx.

Answer queries:

  • Output D(Φx).

7

slide-9
SLIDE 9

Processing Items

  • See “add v to xi”
  • Read as “add vector vei to x”

       y ← Φx x ← x + vei y ← y + vΦei

8

slide-10
SLIDE 10

Some Costs

Space:

  • |y| plus space to store Φ.

Time per item:

  • generate Φei
  • Usually about proportional to |y|
  • Sometimes much less if Φ is sparse

(Still need to analyze time for queries. Depends a lot on Φ and D.)

9

slide-11
SLIDE 11

Warmup: One Spike, Low Noise

          5.6 · · · 0.2 5.5           =           1 1 1 1 1 1 1 1 · · · · · · · · · · · · · · · · · · · · · · · · 1 1 1 1 1 1 1 1 1 1 1 1           ·                   0.1 5.3 0.2                   d columns and log(d) rows. (Deterministic and efficient) If bℓ is ℓ’th row of matrix, and spike is at i, need

10

slide-12
SLIDE 12

|xi| ≥ 2.01 j = i|xj| or (weaker) ∀ℓ |xi| > 2.01

  • j=i

bℓ

jxj

  • .

11

slide-13
SLIDE 13

Many Spikes? Group Testing

Example:

  • 150 soldiers; 3 have syphilis
  • Pool specimens into 6 random groups.
  • “Many” groups have

– exactly one sick soldier – about 1/6 of the dilution from healthy soldiers

  • Perform 6 tests

– clear ≥ 3 groups—75 soldiers

12

slide-14
SLIDE 14

Warmup II: L1 significance

Problem:

  • Suppose |xi| > 1

k

  • j=i |xj|. Find i.

Solution: Hash...

  • Keep

1 12k fraction of positions at random

– i.e., consider xr, where r is 0/1-valued

  • With prob ≥

1 12k, we keep i; i.e., ri = 1.

  • For each j = i, E[|rjxj|] = 1

k|xj|. 13

slide-15
SLIDE 15

Warmup II: L1 significance

So E  

j=i

|rjxj|   =

  • j=i

E[|rjxj|] = 1 12k

  • j=i

|xj| So, with prob ≥ 3/4 (independently of whether ri = 1)

  • j=i

|rjxj| ≤ 1 3k

  • j=i

|xj| < 1 3|xiri|. Repeat, and proceed as above!

14

slide-16
SLIDE 16

Digression: Linearity of Expectation

Recall that a random variable is a function on a sample space. X : Ω → R ω → X(ω) Then E[X] =

ω∈Ω X(ω) Pr(ω), and so

E[X + Y ] =

  • ω∈Ω

(X(ω) + Y (ω)) Pr(ω) =

  • ω∈Ω

X(ω) Pr(ω) +

  • ω∈Ω

Y (ω) Pr(ω) = E[X] + E[Y ].

15

slide-17
SLIDE 17

Digression: Markov

Theorem: If X is a non-negative random variable and a > 0, then Pr(X ≥ a) ≤ E[X]/a. Proof: E[X] =

  • x

x Pr(X = x) ≥

  • x≥a

a Pr(X = x) = a Pr(X ≥ a). E.g., Pr(X ≥ 4E[X]) ≤ 1/4.

16

slide-18
SLIDE 18

Repeat

Pr(success) ≥ 3 4 · 1 4k = 3 16k > 1 6k Pr(failure) < 1 − 1 6k . Repeat 6k times, independently. Pr(all failures) <

  • 1 − 1

6k 6k ≈ 1/e ≈ .37 < .5. Repeat total of 6km times.

  • Modest cost.
  • Pr(all failures) < 2−m.

17

slide-19
SLIDE 19

Putting it together

Collect repeated r’s into matrix, R. Take row tensor product R ⊗r B with bit testing matrix, B:

  • rows are {rb : r is row of R, b is row of B}

18

slide-20
SLIDE 20

Row Tensor Product, E.g.

R =  1 1 1   , B =     1 1 1 1 1 1 1 1     , so B ⊗r R =              1 1 1 1 1             

19

slide-21
SLIDE 21

Warmup III: L2 significance

Problem: Suppose now that x2

i > 1 k′

  • j=i x2

j; want to find i.

  • Note: stronger statement than before.

Solution:

  • Multiply each xi by random ±1 first
  • Keep

1 36k′ , at random

  • i.e., consider rsx, where

– s has random signs – r is random mask

20

slide-22
SLIDE 22

Warmup III: L2 significance

Still keep i with prob’y

1 12k′ (Assume this.)

E     

j=i

bjrjsjxj  

2

  = E  

j,ℓ=i

bjbℓrjrℓsjsℓxjxℓ   = Er  

j,ℓ=i

Es[sjsℓ]rjrℓbjbℓxjxℓ   = Er  

  • j=i,bj=1

rjx2

j

  =

  • j=i,bj=1

E[rj]x2

j =

1 12k′

  • j=i,bj=1

x2

j < 1

12x2

i . 21

slide-23
SLIDE 23

Warmup III: L2 significance

With prob ≥ 3/4,  

j=i

bjrjsjxj  

2

< 1 9x2

i ,

  • r
  • j=i

bjrjsjxj < 1 3|risixi|. (Extra repetitions are needed to make all bℓ work simultaneously.) Proceed as above.

22

slide-24
SLIDE 24

Digression: Expectation of a product

Theorem: If X and Y are independent, then E[XY ] = E[X]E[Y ]. Proof: E[XY ] =

  • x,y

xy Pr(X = x and Y = y) =

  • x,y

xy Pr(X = x) Pr(Y = y) = E[X]E[Y ].

23

slide-25
SLIDE 25

Digression: Cauchy-Schwarz Inequality

Theorem: 1 d d

  • i=1

|xi| 2 ≤

d

  • i=1

x2

i ≤

d

  • i=1

|xi| 2 ; either equality is possible.

24

slide-26
SLIDE 26

Cauchy-Schwarz Inequality: Implication

Thus, if |xi| >

j=i |xj| then

|xi|2 >  

j=i

|xj|  

2

>

  • j=i

x2

i .

But, if |xi|2 >

j=i |xj|2, then all we know is

|xi| >

  • j=i

x2

i >

1 √ d

  • j=i

|xj|. Weaker by the large factor √ d.

25

slide-27
SLIDE 27

Cauchy-Schwarz Inequality: Proof

For x2

i ≤ ( |xi|)2:

  • i

x2

i ≤

  • i,j

xixj =

  • i

xi 2 . Pick out diagonal; Equality if there is only one term.

26

slide-28
SLIDE 28

Cauchy-Schwarz Inequality: Proof

For 1

d ( |xi|)2 ≤ x2 i , need d

  • i=1

xi = x, 1 ≤ x · 1 = x · √ d. We’ll show x, y ≤ x y. Can normalize; assume x = y = 1. Then 0 ≤ x − y, x − y = x2 + y2 − 2 x, y . So x, y ≤

  • x2 + y2

/2 = 1 = x · y. Equality if (and only if) x and y are proportional.

27

slide-29
SLIDE 29

On to Estimation

Let s be a random ±1-valued random vector. Atomic estimator for xi is X = si x, s. Then X = si

  • j

sjxj =

  • j

sisjxj, so E[X] =

  • j

E[sisj]xj = xi. Need to bound variance.

28

slide-30
SLIDE 30

Estimation: Variance

Also var(X) = E[X2] − x2

i

= E  

j,ℓ

sjsℓxjxℓ   =

  • j,ℓ

E [sjsℓ] xjxℓ =

  • j=i

x2

j.

Standard deviation small/bounded in terms of target value.

29

slide-31
SLIDE 31

Markov/Chebychev

Theorem: For a > 0, Pr(|X − E[X]| ≥ a) ≤ var(X)/a2. Proof: Pr((X − E[X])2 ≥ a2) ≤ var(X)/a2. Get Pr(|X − xi| ≥ 3||x||) ≤ 1/9.

30

slide-32
SLIDE 32

Better distortion

Let Y be the average of m copies of X. Then E[Y ] = E[X] and var(Y ) = 1

mvar(X).

Get Pr

  • |Y − xi| ≥ 3

m x

  • ≤ 1

9.

31

slide-33
SLIDE 33

Digression: Improving Variance

Theorem: Let Y be the average of m copies of X. Then var(Y ) = 1

mvar(X). Proof:

Let µ = E[X] = E[Y ]. Then E[X − µ] = 0 and var(X − µ) = E[(X − µ − 0)2] = var(X).

32

slide-34
SLIDE 34

Digression: Improving Variance

So assume E[X] = E[Y ] = 0. Then var(Y ) = E[Y 2] = E 1 m

  • Xi

2 = 1 m2

  • i,j

E[XiXj], using independence = 1 m2

  • i

E[X2

i ]

= 1 mE[X2].

33

slide-35
SLIDE 35

Better failure probability.

Theorem: Suppose Pr(Y is bad) < 1/9. Let Z be the median of l independent copies of Y . Then Pr(Z is bad) < 2−Ω(l). Proof: Z is bad only if at least half of the Y ’s are bad. Apply Chernoff. t t t t t t t

34

slide-36
SLIDE 36

Digression: Chernoff Bounds

Theorem: Suppose each of n Yi’s is independent with Yi =    1 − p, with probability p; −p, with probability 1 − p. Let Y =

i Yi. If a > 0, then

Pr(Y > a) < e−2a2/n.

35

slide-37
SLIDE 37

Chernoff: Proof

(Just for p = 1/2, so Yi is ±1/2, uniformly.) Lemma: For λ > 0, eλ+e−λ

2

< eλ2/2. (Proof: Taylor.) E[e2λ P Yi] =

  • E[e2λYi]

= eλ + e−λ 2 n < eλ2n/2.

36

slide-38
SLIDE 38

Chernoff, cont’d

Pr(Y > a) = Pr

  • e2λY > e2λa

≤ E[e2λY ] e2λa ≤ eλ2n/2−2λa. Put λ = 2a/n; get Pr(Y > a) < e−2a2/n.

37

slide-39
SLIDE 39

To this point

Find all i such that x2

i > 1 k

  • j=i x2

j, with failure probability 2−ℓ.

  • Need poly(k, ℓ) rows in the matrix B ⊗r S ⊗r R; comparable

runtimes. Estimate each xi up to ±ǫ x with failure probability 2−ℓ.

  • Need poly(ℓ/ǫ) rows; comparable runtimes.

38

slide-40
SLIDE 40

Space

To this point, fully random matrices.

  • Expensive to store!

But...

  • Need only pairwise independence within each row
  • (sometimes need full independence from row to row, but this is

usually ok).

  • i.e., two entries rj and rℓ in the same row need to be

independent, but three entries may be dependent.

  • This can cut down on needed space.

39

slide-41
SLIDE 41

Pairwise Independence: Construction

Random vector s in ±1d (equivalently, Zd

2)

Index i is a 0/1 vector of length log(d), i.e., i ∈ Zlog(d)

2

. Pick vector q ∈ Zlog(d)

2

and bit c ∈ Z2. Define si = c + q, i (mod 2). Then, if i = j, then (si, sj) takes all four possibilities with equal probability.

40

slide-42
SLIDE 42

Pairwise Independence: Proof

si is uniform because c is random. Conditioned on si, sj is uniform:

  • Sufficient to show that si + sj is uniform.
  • si + sj = (c + q, i) + (c + q, j) = q, i + j
  • i = j, so they differ on some bit, the ℓ’th.
  • As qℓ varies, si + sj varies uniformly over Z2.

41

slide-43
SLIDE 43

Pairwise independence, for r

Hashing into one of k buckets. Take log(k) independent hashes into two buckets. Get bucket label bit-by-bit.

42

slide-44
SLIDE 44

Space, again

For each row s, need only store q and c: log(d) + 1 bits. For each row r, need only log(k) copies of q and c: O(log(d) log(k)) bits. (Many other constructions are possible.)

43

slide-45
SLIDE 45

All Together—Heavy Hitters

  • Find all i such that x2

i > (1/k) j=i x2 j, with failure

probability 2−ℓ.

  • Estimate each xi up to ±ǫ x with failure probability 2−ℓ.
  • Space, time per item, and query time are poly(k, ℓ, log(d), 1/ǫ).

44

slide-46
SLIDE 46

Sparse Recovery

Next topic: Sparse Recovery. Fix k and ǫ. Want x such that

  • x − x2 ≤ (1 + ǫ)
  • x(k) − x
  • 2 .

Here x(k) is best k-term approximation to x. Will build on heavy hitters.

45

slide-47
SLIDE 47

Sparse Recovery: Issue

Suppose k = 10 and coefficient magnitudes are 1, 1/2, 1/4, 18, 1/16, ... Want to find top k terms in time poly(k), not time 2k. Heavy Hitters algorithm only guarantees that we find and estimate well terms with magnitude around 1/k—about log(k) terms.

46

slide-48
SLIDE 48

Weak Greedy Algorithm

  • Find indices of heavy terms in x
  • Estimate their coefs, getting intermediate rep’n r.

– iterative subroutine here

  • Recurse on x − r.

47

slide-49
SLIDE 49

Weak Greedy Algorithm

After removing top few terms, others become relatively larger. Can get sketch Φ(x − r) as Φx − Φr At this point, x may have more than k terms (to be fixed). Weak greedy–may not find the heaviest term.

48

slide-50
SLIDE 50

Iterative Estimation

Have: a set I of k indices, parameter ǫ Want: coefficient estimates so that the resulting approximation x satisfies

  • x − x ≤ (1 + ǫ) x − xI .

Define

  • Ic be the complement of I.
  • EI =

i∈I |xi|2 be original energy in I

EI =

i∈I |xi −

xi|2 to be energy in I after one round of estimation.

  • ∆ = EI/EIc to be the dynamic range.

49

slide-51
SLIDE 51

Iterative Estimation: Algorithm

Have: a set I of k indices, parameter ǫ Want: coefficient estimates so that the resulting approximation x satisfies

  • x − x ≤ (1 + ǫ) x − xI .

Repeat log(∆/ǫ) times

  • 1. estimate each xi for i ∈ I, by

xi with | xi − xi|2 <

ǫ 2k(1+ǫ)

Ec

i .

  • 2. update x.

50

slide-52
SLIDE 52

Iterative Estimation: Proof

Get: EI ≤

ǫ 2(1+ǫ)(E I + EIc).

Case EI > ǫ · EIc:

  • EI

≤ ǫ 2(1 + ǫ) (EI + EIc) ≤ ǫ 2(1 + ǫ)EI + 1 2(1 + ǫ)EI = 1 2EI. Geometric improvement. Get down to ǫEIc if this case holds for all iterations.

51

slide-53
SLIDE 53

Iterative Estimation: Proof

Case EI ≤ ǫ · EIc:

  • EI

≤ ǫ 2(1 + ǫ) (EI + EIc) ≤ ǫ 2EIc. EI fluctuates only in the range 0 to ǫ

2EIc after dropping below

ǫEIc.

52

slide-54
SLIDE 54

Iterative Identification

Similar to estimation Repeat log(∆/ǫ) times

  • 1. Identify indices i with |xi|2 >

ǫ 4k(1+ǫ)

Eic.

  • 2. Estimate each xi, for i ∈ I, by

xi with EI ≤ EIc

  • 3. update x.

Final estimation:

EI ≤ ǫ

3EIc. 53

slide-55
SLIDE 55

Iterative Identification: Proof

First: Estimation errors do not substantially affect Identification. Issue:

  • Have a set I of indices for intermediate r.
  • We’ll identify positions in x − r.
  • Values in (x − r)I are based on estimates and may be larger

than xI

  • ...contribute extra noise; obstacle to identification.

Identify i if |xi|2 large compared with Eic, so get i if |xi|2 large compared with EI > (1 − ǫ) E > (1 − ǫ) Eic.

54

slide-56
SLIDE 56

Iterative Identification: Proof

Among top k, miss a total of at most EK\I ≤ ǫ 2(1 + ǫ)E = ǫ 2(1 + ǫ)(EK + EKc). Case EK > ǫEKc: EK\I ≤ ǫ 2(1 + ǫ)(EK + EKc) < ǫ 2(1 + ǫ)EK + 1 2(1 + ǫ)EK = 1 2EK.

55

slide-57
SLIDE 57

Iterative Identification: Proof

Case EK ≤ ǫEKc: EK\I ≤ ǫ 2(1 + ǫ)(EK + EKc) ≤ ǫ 2EKc. Either case, identify enough.

56

slide-58
SLIDE 58

Iterative Identification—proof

Three sources of error:

  • 1. outside top k—excusable.
  • 2. inside top k, but not found—small compared with excusable.
  • 3. found, and estimated incorrectly—small compared with

excusable.

57

slide-59
SLIDE 59

Exactly k Terms Output

Algorithm:

  • 1. Get

x with x − x2 ≤ (1 + ǫ)

  • x(k) − x
  • 2.
  • 2. Estimate each xi by

xi with |xi − xi|2 ≤ ǫ2

k EKc.

  • 3. Output top k terms of

x, i.e., x(k)

58

slide-60
SLIDE 60

Exactly k Terms Output: Proof

Sources of error:

  • 1. Terms in K \ I (small; already shown)
  • 2. Error in terms we do take (small; already shown)
  • 3. Error from mis-ranking
  • if k + 1 terms are about equally good, we won’t know for

sure which are the k biggest.

59

slide-61
SLIDE 61

Exactly k Terms Output: Misranking

Idea: only displace one term for another if their magnitudes are

  • close. Some care needed to keep quadratic dependence on ǫ.

Let y be a vector of terms in top k that are displaced by an equal number of terms not in the top k, the vector z. Both y and z have length at most k. yi is displaced by zi. Assume we have found and estimated all terms in y (else don’t care; these terms are small.)

60

slide-62
SLIDE 62

Exactly k Terms Output: Proof

By the triangle inequality, |yi| ≤ | yi| + |yi − yi| |zi| ≥ | zi| − |zi − zi| Thus |yi| − |zi| ≤ | yi| − | zi| + |yi − yi| + |zi − zi| ≤ |yi − yi| + |zi − zi| ≤ 2ǫ

  • EKc/k

Thus |y| − |z| ≤ 2ǫ

  • EKc.

61

slide-63
SLIDE 63

Exactly k Terms Output: Proof

Continuing... |z| = z ≤ √EKc |y| = y ≤ z + |y| − |z| , so |y| + |z| ≤ 2 z + |y| − |z| ≤ 2

  • EKc + 2ǫ
  • EKc

≤ 3

  • EKc,

62

slide-64
SLIDE 64

Exactly k Terms Output: Proof

so, finally, y2 − z2 = |y|2 − |z|2 = |y| + |z|, |y| − |z| ≤ |y| + |z| · |y| − |z| ≤ 3

  • EKc · 2ǫ
  • EKc

≤ 6ǫEKc.

63

slide-65
SLIDE 65

Overview of Summaries

  • Heavy Hitters
  • Weak greedy sparse recovery
  • Orthonormal change of basis
  • Haar Wavelets
  • Histograms (piecewise constant)
  • Multi-dimensional (hierarchical)
  • Piecewise-linear
  • Range queries

64

slide-66
SLIDE 66

Finding Other Heavy Things

E.g., Fourier coefficients. Important by themselves Useful toward other kinds of summaries

65

slide-67
SLIDE 67

Orthonormal bases

Columns of U is ONB if columns of U are perpendicular and unit Euclidean length. Thus ψj, ψk =    1, j = k 0,

  • therwise.

E.g.:

  • Fourier basis
  • Haar wavelet basis

66

slide-68
SLIDE 68

Decompositions and Parseval

Let {ψj} be ONB. Then, for any x, x =

  • x, ψj ψj.

and

  • j

x, ψj2 =

  • i

x2

i 67

slide-69
SLIDE 69

Haar Wavelets, Graphically

68

slide-70
SLIDE 70

E.g., +1 +1 +1 +1 +1 +1 +1 +1 −1 −1 −1 −1 +1 +1 +1 +1 −1 −1 +1 +1 −1 −1 +1 +1 −1 +1 −1 +1 −1 +1 −1 +1

69

slide-71
SLIDE 71

Heavy Hitters under Orthonormal Change of Basis

Have vector x = U x, where x is sparse Process stream by transforming Φ:

  • Collect Φ

x = Φ(U −1U) x = (ΦU −1) x. Answer queries:

  • Recover heavy hitters in

x

  • Implicitly recover heavy U-coefficients of x.

Alternatively, transform updates...

70

slide-72
SLIDE 72

Haar Wavelets—per-Item Time

See “add v to xi” Want to simulate changes to x = U −1x Regard as “add v to xi” as “add vei to x” Decompose vei into its Haar wavelet components, vei =

  • j

v ei, ψj ψj. Key: ei, ψj = 0 unless i ∈ supp(ψj).

  • Just O(log(d)) such j’s—O(log(d))

xj’s change.

71

slide-73
SLIDE 73

Overview of Summaries

  • Heavy Hitters
  • Weak greedy sparse recovery
  • Orthonormal change of basis
  • Haar Wavelets
  • Histograms (piecewise constant)
  • Multi-dimensional (hierarchical)
  • Piecewise-linear
  • Range queries

72

slide-74
SLIDE 74

Histograms

Still see stream of additive updates: “add v to xi” Want B-piece piecewise-constant representation, h, with h − x ≤ (1 + ǫ) hopt − x . We optimize boundary positions and heights.

73

slide-75
SLIDE 75

Number of employees Salary

74

slide-76
SLIDE 76

Histograms–Algorithm Overview

Key idea: Haar wavelets and histograms simulate each other efficiently.

  • t-term wavelet is O(t)-bucket histogram
  • B-bucket histogram is O(B log(d))-term wavelet rep’n

Next, class of algorithms with varying costs and guarantees:

  • Get good Haar representation
  • Modify it into a histogram

75

slide-77
SLIDE 77

Simulation

Histograms simulate Haar wavelets:

  • Each Haar wavelet is piecewise constant with 4 pieces (3

breaks), so t terms have 3t breaks (3t + 1) pieces. Haar wavelets simulate histograms:

  • If h is a B-bucket histogram and ψj’s are wavelets, then

✸ h =

j h, ψj ψj.

✸ h, ψj = 0 unless supp(ψj) intersects a boundary of h. ✸ ≤ O(log(d)) such wavelets; ≤ O(log(d)) terms in a B-bucket histogram.

76

slide-78
SLIDE 78

Algorithm 1

  • 1. Get O(B log(d))-term wavelet rep’n w with

w − x ≤ (1 + ǫ) hopt − x .

  • 2. Return w as a O(B log(d))-bucket histogram

Compared with optimal, O(log(d)) times more buckets and (1 + ǫ) times more error—a (O(log(d)), 1 + ǫ)-approximation. We can do better...

77

slide-79
SLIDE 79

Algorithm 2

  • 1. Get O(B log(d))-term wavelet rep’n w with

w − x ≤ (1 + ǫ) hopt − x .

  • 2. Returnn best B-bucket histogram h to w. (How? soon.)

Get a (1, 3 + o(1))-approximation: h − x ≤ h − w + w − x ≤ hopt − w + w − x ≤ hopt − x + 2 w − x ≤ (3 + 2ǫ) hopt − x ,

78

slide-80
SLIDE 80

Algorithm 3

  • 1. Get O(B log(d) log(1/ǫ)/ǫ2)-term wavelet rep’n w with

w − x ≤ (1 + ǫ) hopt − x .

  • 2. Possibly discard some terms, getting a robust wrob.
  • 3. Output best B-bucket histogram h to wrob.

Get a (1, 1 + ǫ)-approximation. Next:

  • What is “robust?”
  • Proof of correctness.
  • How to find h from wrob.

79

slide-81
SLIDE 81

Robust Representations

Assume exact estimation (We’ve shown estimation error is dominated by other error.) Have O(B log(d) log(1/ǫ)/ǫ2)-term repn, w. Let B′ = 3B log(d) (hist to wavelet simulation expression) Consider w(B′), w(2B′), . . . Let wrob be wrob =    w(jB′),

  • w(jB′..(j+1)B′)
  • 2 ≤ ǫ2

w((j+1)B′..)

  • 2

w,

  • therwise.

“Take terms from top until there is little progress.”

80

slide-82
SLIDE 82

Robust Representation, Continued Progress

Continued progress on w implies very close to x.

  • w(jB′..(j+1)B′)
  • 2 drops exponentially in j:
  • 1. Group terms, 2/ǫ2 per group.
  • 2. Each group has twice the energy of the remaining terms, i.e.,

twice the energy of the remaining groups, so at least twice the energy of the next group.

81

slide-83
SLIDE 83

Robust Representation, Continued Progress

Terms drop off exponentially. Thus x − wrob2 = x − w2 ≤ d

  • w(last)
  • 2

≤ ǫ2 w(B′..2B′)

  • 2

≤ ǫ2 x − w(1..B′)

  • 2

≤ ǫ2(1 + ǫ) x − hopt2 Need T = (1/ǫ)2 log(d/ǫ2) repetitions, so (1 − ǫ2)T = ǫ2/d.

82

slide-84
SLIDE 84

Robust Representation, Continued Progress

Note:

  • x − w(B′)
  • ≤ (1 + ǫ) x − hopt, i.e., w(B′) is accurate
  • enough. (It has too many terms.)

Final guarantee: h − x ≤ h − wrob + wrob − x ≤ hopt − wrob + wrob − x ≤ hopt − x + 2 wrob − x ≤ (1 + 3ǫ) hopt − x . Adjust ǫ, and we’re done.

83

slide-85
SLIDE 85

Robust Representation, No Progress

No progress on w implies no progress on x:

  • w(jB′..(j+1)B′)
  • 2 ≤ ǫ2

w((j+1)B′..)

  • 2

implies

  • w(jB′..(j+1)B′)
  • 2

≤ ǫ2 x((j+1)B′..)

  • 2

≤ ǫ2 x − hopt2 . So, the best linear combination, r, of wrob and any B-bucket histogram isn’t much better than wrob.

84

slide-86
SLIDE 86

Robust Representation, No Progress

x ❆ ❆ ❆ ❆ ❆ r h wrob ◗◗◗◗◗◗◗ ◗ t t t t s s s x wrob ≈ r h ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍

✁ ✁ ✁ ✁ ✁ ✁ ✁

  • hopt s

Approximately: h − r ≤ hopt − r, so h − x ≤ hopt − x.

85

slide-87
SLIDE 87

Robust Representation, No Progress

x − wrob and wrob − hopt are bounded. x − wrob ≤ (1 + ǫ) x − hopt wrob − hopt ≤ (3 + ǫ)3 x − h . Also, r − wrob ≤ ǫ x − hopt .

86

slide-88
SLIDE 88

Robust Representation, No Progress

We have h − x2 = h − r2 + r − x2 ≤ (h − wrob + wrob − r)2 +(x − wrob − wrob − r)2 ≤ h − wrob2 + wrob − r2 + x − wrob2 + wrob − r2 + 2 h − wrob · wrob − r ≤ hopt − wrob2 + wrob − r2 + x − wrob2 + wrob − r2 + 2 hopt − wrob · wrob − r ≤ hopt − wrob2 + x − wrob2 +9 · ǫ · x − hopt2 ,

87

slide-89
SLIDE 89

Robust Representation, No Progress

...and, similarly, hopt − x2 = hopt − r′2 + r′ − x2 ≥ (hopt − wrob − wrob − r′)2 +(x − wrob − wrob − r′)2 ≥ hopt − wrob2 + 2 wrob − r′2 + x − wrob2 −2 hopt − wrob · wrob − r′ −2 x − wrob · wrob − r′ ≥ hopt − wrob2 + x − wrob2 −9 · ǫ · x − hopt2 .

88

slide-90
SLIDE 90

Robust Representation, No Progress

So h − x2 − hopt − x2 ≤ 18 · ǫ · x − hopt2 ,

  • r

h − x2 ≤ (1 + 18ǫ) hopt − x2 .

89

slide-91
SLIDE 91

Warmup: Best Histogram, Full Space

Want best B-bucket histogram to x. Use dynamic programming, based on the following recursion. Define

  • Err[j, k] = error of best k-bucket histogram to x on [0, j).
  • Cost[j, j′] = error of best 1-bucket histogram to x on [j, j′).

So: Err[j, k] = min

ℓ<j Err[ℓ, k − 1] + Cost[l, j).

“k − 1 buckets on [0, ℓ) and one bucket on [ℓ, j). Take best ℓ.” Runtime: j < d, k < B, l < d; total O(d2B). Can construct actual histogram (not just error) as we go (keep the ℓ’s that witness the minimization).

90

slide-92
SLIDE 92

Prefix array

From x, construct Px: x0, x0 + x1, x0 + x1 + x2, . . . Also Px2. Can get Cost[ℓ, j] from ℓ and j in constant time:

  • xℓ + xℓ+1 + · · · + xj−1 = (Px)j − (Px)ℓ.
  • Best height is average µ =

1 j−ℓ ((Px)ℓ − (Px)j).

  • Error is

ℓ≤i<j(xi − µ)2 = x2 i − 2µ xi + µ2. 91

slide-93
SLIDE 93

Best Histogram to Robust Representation

Want best B-bucket histogram h to wrob. wlog, boundaries of h are among boundaries of wrob. Dynamic programming takes time O(|wrob|2 · B), where |wrob| is the number of boundaries in wrob.

92

slide-94
SLIDE 94

Overview of Summaries

  • Heavy Hitters
  • Weak greedy sparse recovery
  • Orthonormal change of basis
  • Haar Wavelets
  • Histograms (piecewise constant)
  • Multi-dimensional (hierarchical)
  • Piecewise-linear
  • Range queries

93

slide-95
SLIDE 95

Two-Dimensional Histograms

Approximation is constant on rectangles Hierarchical (recursively split an existing rectangle) or general. Theorem: Any B-bucket (general) partition can be refined into a (4B)-bucket hierarchical partition. Proof omitted; not needed for algorithm. Aim: (1, 1 + ǫ)-approximate hierarchical histogram, which is a (4, 1 + ǫ)-approx general histogram.

1 2 1 1 4 5 2 3 3

94

slide-96
SLIDE 96

2-D Histograms–Overall Strategy

Same overall strategy as 1-D:

  • Find best B′-term rep’n over “tensor-product of Haar

wavelets.”

  • Cull back to a robust representation, wrob
  • Output best hierarchical histogram to wrob.

Next:

  • What is tensor-product of Haar wavelets?
  • How to find best B-bucket hierarchical histogram.

95

slide-97
SLIDE 97

Tensor products

Need ONB that simulates and is simulated by 1-bucket histograms. Generally: (α ⊗ β)(x, y) = α(x)β(y). Use tensor product of Haar wavelets: ψj,k(x, y) = ψj(x) · ψk(y). Tensor product of ONBs is ONB.

96

slide-98
SLIDE 98

Processing Updates

Update to x leads to updates to O(log2(d)) tensor product of Haar wavelets. (Algorithm is exponential in the dimension, 2.)

97

slide-99
SLIDE 99

Dynamic Programming

Want best hierarchical h to wrob. Boundaries of h can be taken from boundaries of wrob. Best j-cut hierarchical h has:

  • a full cut (horiz or vert, say vert)
  • a k-cut partition on the left
  • a (j − 1 − k)-cut partition on the right.

Runtime: polynomial in boundaries of wrob and desired number of buckets.

98

slide-100
SLIDE 100

Overview of Summaries

  • Heavy Hitters
  • Weak greedy sparse recovery
  • Orthonormal change of basis
  • Haar Wavelets
  • Histograms (piecewise constant)
  • Multi-dimensional (hierarchical)
  • Piecewise-linear
  • Range queries

99

slide-101
SLIDE 101

Piecewise-linear representations

Want best B-bucket pw-linear approx to x. Same overall strategy:

  • Find best “linear multiwavelet” representation
  • Cull back to a robust representation, wrob
  • Output best B-bucket piecewise-linear representation to wrob.

Next:

  • What are linear multiwavelets?
  • How to find best B-bucket piecewise-linear representation.

100

slide-102
SLIDE 102

Linear Multiwavelets, Graphical

constant slope vee

  • dd

101

slide-103
SLIDE 103

Linear Multiwavelets

E.g., +1 +1 +1 +1 +1 +1 +1 +1 −7 −5 −3 −1 +1 +3 +5 +7 +3 +1 −1 −3 −3 −1 +1 +3 +7 −1 −9 −17 +17 +9 +1 −7 +1 −1 −1 +1 −1 +3 −3 +1 +1 −1 −1 +1 −1 +3 −3 +1

102

slide-104
SLIDE 104

Linear Multiwavelets: Properties

  • ONB
  • Linear Multiwavelets and pw-linear representations simulate

each other with O(log(d))-factor blowup

103

slide-105
SLIDE 105

Best Piecewise-Linear Representation

Have wrob (pw-linear rep’n with B′ ≈ B · log(d)/ǫ pieces) Want best B-bucket pw-linear repn h to wrob. Recall best 1-bucket repn to x is x, ψ ψ + x, φ φ, where ψ is constant and φ is slant. Need

  • New prefix arrays
  • “Dual Dynamic Programming;” cost polynomial in B log(d)/ǫ.

104

slide-106
SLIDE 106

Prefix arrays:

  • Get x, ψ from Px
  • Get x, φ from P(x · φ) and Px
  • Error of a · ψ + b · φ to x is

x − (a · ψ + b · φ)2 = x − (a · ψ + b · φ), x − (a · ψ + b · φ) . Also need P(x2).

105

slide-107
SLIDE 107

Dual Dynamic Programming

Define Far[k, m] as the biggest j such that there’s a k-bucket histogram on [0, j) with error at most m (in appropriate units). Assume we know E with 1

2E ≤ Eopt ≤ E.

Consider m = 0, ǫE/B, 2ǫE/B, . . ., 2E. (B/ǫ possibilities for m; coarse granularity leads to ǫE/B extra error per boundary—ǫE in all). Thus: Far[k, m] = maxn{j : n + Cost[Far[k − 1, n], j] < m}. “Go as far as we can with k − 1 buckets and error n, then add 1

  • bucket. Try all n.”

Runtime: k < B, m < B/ǫ, n < B/ǫ, find j by binary search: O(B3 log(d)/ǫ2).

106

slide-108
SLIDE 108

Rangesum histograms

Given x, want pw-constant h to optimize range queries to x:

  • ℓ,r

 

ℓ≤i<r

h − xi  

2

. Height h of a bucket affects many non-local queries. Foils previous tricks. Instead, transform to prefix domain.

107

slide-109
SLIDE 109

Transform to Prefix domain

  • ℓ,r

 

ℓ≤i<r

hi − xi  

2

=

  • ℓ,r

((P(h − x))r − (P(h − x))ℓ)2 =

  • ℓ,r

(P(h − x))2

r + (P(h − x))2 ℓ − 2P(h − x)rP(h − x)ℓ

= 2d

((Ph)ℓ − (Px)ℓ)2 (we’ll make

ℓ P(h − x)ℓ = 0.)

= 2d Ph − Px2 , Get point-query problem.

108

slide-110
SLIDE 110

Prefix array of histograms

If h is pw-constant, then Ph is piecewise-linear connected Do not know how to find near-best pwlc approx to given Px (equivalent to original problem). Find near-best B-bucket pw-linear (disconnected) approx to Px under point queries. Leads to (2B)-bucket pw-constant repn for range queries to x.

109

slide-111
SLIDE 111

Simulate/Invert Prefix Array

When reading x, simulate reading Px:

  • “add 5 to x3” becomes “add 5 to (Px)3, (Px)4, (Px)5, . . .”
  • Affects only O(log(d)) linear multiwavelets (whose support

includes 3). From Ph, recover hi = (∆(Ph))i = (Ph)i+1 − (Ph)i.

110

slide-112
SLIDE 112

Overall algorithm

  • When reading x, simulate reading Px.
  • Find best (2B)-bucket pw-linear approx ℓ to Px under point

queries

  • Make sure avg(ℓ) = avg(Px). (Approximately enforced

automatically by optimality.)

  • Output ∆ℓ as (2, 1 + ǫ) approximation, i.e., 2B buckets, (1 + ǫ)

times best error under range queries.

111

slide-113
SLIDE 113

Overview of Summaries

  • Heavy Hitters
  • Weak greedy sparse recovery
  • Orthonormal change of basis
  • Haar Wavelets
  • Histograms (piecewise constant)
  • Multi-dimensional (hierarchical)
  • Piecewise-linear
  • Range queries

112