Summaries of Streaming Data
Martin J. Strauss University of Michigan
Summaries of Streaming Data Martin J. Strauss University of - - PowerPoint PPT Presentation
Summaries of Streaming Data Martin J. Strauss University of Michigan Sparse Approximation National retailer sees a stream of transactions: 2 Thomas sold, 1 Thomas returned, 1 TSP sold ... Implies vector x of item frequencies: 40 Thomas,
Martin J. Strauss University of Michigan
National retailer sees a stream of transactions:
Implies vector x of item frequencies:
Goal: Track items with large-magnitude counts
1
Measurements Signal, x Measurement matrix, Φ ✁ ✁ ✁ ✁ ☛ ❅ ❅ ❘ 5.3 · · · 5.3 = 1 1 1 1 1 1 1 1 · · · · · · · · · · · · · · · · · · · · · · · · 1 1 1 1 1 1 1 1 1 1 1 1 · 5.3 Recover position and coefficient of single spike in signal.
2
3
Identification: Output a set that
Estimation
4
Fundamental queries can be used to build summaries:
Other user queries can be answered from summary
5
6
Design
Process Stream:
Answer queries:
7
y ← Φx x ← x + vei y ← y + vΦei
8
Space:
Time per item:
(Still need to analyze time for queries. Depends a lot on Φ and D.)
9
5.6 · · · 0.2 5.5 = 1 1 1 1 1 1 1 1 · · · · · · · · · · · · · · · · · · · · · · · · 1 1 1 1 1 1 1 1 1 1 1 1 · 0.1 5.3 0.2 d columns and log(d) rows. (Deterministic and efficient) If bℓ is ℓ’th row of matrix, and spike is at i, need
10
|xi| ≥ 2.01 j = i|xj| or (weaker) ∀ℓ |xi| > 2.01
bℓ
jxj
11
Example:
– exactly one sick soldier – about 1/6 of the dilution from healthy soldiers
– clear ≥ 3 groups—75 soldiers
12
Problem:
k
Solution: Hash...
1 12k fraction of positions at random
– i.e., consider xr, where r is 0/1-valued
1 12k, we keep i; i.e., ri = 1.
k|xj|. 13
So E
j=i
|rjxj| =
E[|rjxj|] = 1 12k
|xj| So, with prob ≥ 3/4 (independently of whether ri = 1)
|rjxj| ≤ 1 3k
|xj| < 1 3|xiri|. Repeat, and proceed as above!
14
Recall that a random variable is a function on a sample space. X : Ω → R ω → X(ω) Then E[X] =
ω∈Ω X(ω) Pr(ω), and so
E[X + Y ] =
(X(ω) + Y (ω)) Pr(ω) =
X(ω) Pr(ω) +
Y (ω) Pr(ω) = E[X] + E[Y ].
15
Theorem: If X is a non-negative random variable and a > 0, then Pr(X ≥ a) ≤ E[X]/a. Proof: E[X] =
x Pr(X = x) ≥
a Pr(X = x) = a Pr(X ≥ a). E.g., Pr(X ≥ 4E[X]) ≤ 1/4.
16
Pr(success) ≥ 3 4 · 1 4k = 3 16k > 1 6k Pr(failure) < 1 − 1 6k . Repeat 6k times, independently. Pr(all failures) <
6k 6k ≈ 1/e ≈ .37 < .5. Repeat total of 6km times.
17
Collect repeated r’s into matrix, R. Take row tensor product R ⊗r B with bit testing matrix, B:
18
R = 1 1 1 , B = 1 1 1 1 1 1 1 1 , so B ⊗r R = 1 1 1 1 1
19
Problem: Suppose now that x2
i > 1 k′
j; want to find i.
Solution:
1 36k′ , at random
– s has random signs – r is random mask
20
Still keep i with prob’y
1 12k′ (Assume this.)
E
j=i
bjrjsjxj
2
= E
j,ℓ=i
bjbℓrjrℓsjsℓxjxℓ = Er
j,ℓ=i
Es[sjsℓ]rjrℓbjbℓxjxℓ = Er
rjx2
j
=
E[rj]x2
j =
1 12k′
x2
j < 1
12x2
i . 21
With prob ≥ 3/4,
j=i
bjrjsjxj
2
< 1 9x2
i ,
bjrjsjxj < 1 3|risixi|. (Extra repetitions are needed to make all bℓ work simultaneously.) Proceed as above.
22
Theorem: If X and Y are independent, then E[XY ] = E[X]E[Y ]. Proof: E[XY ] =
xy Pr(X = x and Y = y) =
xy Pr(X = x) Pr(Y = y) = E[X]E[Y ].
23
Theorem: 1 d d
|xi| 2 ≤
d
x2
i ≤
d
|xi| 2 ; either equality is possible.
24
Thus, if |xi| >
j=i |xj| then
|xi|2 >
j=i
|xj|
2
>
x2
i .
But, if |xi|2 >
j=i |xj|2, then all we know is
|xi| >
x2
i >
1 √ d
|xj|. Weaker by the large factor √ d.
25
For x2
i ≤ ( |xi|)2:
x2
i ≤
xixj =
xi 2 . Pick out diagonal; Equality if there is only one term.
26
For 1
d ( |xi|)2 ≤ x2 i , need d
xi = x, 1 ≤ x · 1 = x · √ d. We’ll show x, y ≤ x y. Can normalize; assume x = y = 1. Then 0 ≤ x − y, x − y = x2 + y2 − 2 x, y . So x, y ≤
/2 = 1 = x · y. Equality if (and only if) x and y are proportional.
27
Let s be a random ±1-valued random vector. Atomic estimator for xi is X = si x, s. Then X = si
sjxj =
sisjxj, so E[X] =
E[sisj]xj = xi. Need to bound variance.
28
Also var(X) = E[X2] − x2
i
= E
j,ℓ
sjsℓxjxℓ =
E [sjsℓ] xjxℓ =
x2
j.
Standard deviation small/bounded in terms of target value.
29
Theorem: For a > 0, Pr(|X − E[X]| ≥ a) ≤ var(X)/a2. Proof: Pr((X − E[X])2 ≥ a2) ≤ var(X)/a2. Get Pr(|X − xi| ≥ 3||x||) ≤ 1/9.
30
Let Y be the average of m copies of X. Then E[Y ] = E[X] and var(Y ) = 1
mvar(X).
Get Pr
m x
9.
31
Theorem: Let Y be the average of m copies of X. Then var(Y ) = 1
mvar(X). Proof:
Let µ = E[X] = E[Y ]. Then E[X − µ] = 0 and var(X − µ) = E[(X − µ − 0)2] = var(X).
32
So assume E[X] = E[Y ] = 0. Then var(Y ) = E[Y 2] = E 1 m
2 = 1 m2
E[XiXj], using independence = 1 m2
E[X2
i ]
= 1 mE[X2].
33
Theorem: Suppose Pr(Y is bad) < 1/9. Let Z be the median of l independent copies of Y . Then Pr(Z is bad) < 2−Ω(l). Proof: Z is bad only if at least half of the Y ’s are bad. Apply Chernoff. t t t t t t t
34
Theorem: Suppose each of n Yi’s is independent with Yi = 1 − p, with probability p; −p, with probability 1 − p. Let Y =
i Yi. If a > 0, then
Pr(Y > a) < e−2a2/n.
35
(Just for p = 1/2, so Yi is ±1/2, uniformly.) Lemma: For λ > 0, eλ+e−λ
2
< eλ2/2. (Proof: Taylor.) E[e2λ P Yi] =
= eλ + e−λ 2 n < eλ2n/2.
36
Pr(Y > a) = Pr
≤ E[e2λY ] e2λa ≤ eλ2n/2−2λa. Put λ = 2a/n; get Pr(Y > a) < e−2a2/n.
37
Find all i such that x2
i > 1 k
j, with failure probability 2−ℓ.
runtimes. Estimate each xi up to ±ǫ x with failure probability 2−ℓ.
38
To this point, fully random matrices.
But...
usually ok).
independent, but three entries may be dependent.
39
Random vector s in ±1d (equivalently, Zd
2)
Index i is a 0/1 vector of length log(d), i.e., i ∈ Zlog(d)
2
. Pick vector q ∈ Zlog(d)
2
and bit c ∈ Z2. Define si = c + q, i (mod 2). Then, if i = j, then (si, sj) takes all four possibilities with equal probability.
40
si is uniform because c is random. Conditioned on si, sj is uniform:
41
Hashing into one of k buckets. Take log(k) independent hashes into two buckets. Get bucket label bit-by-bit.
42
For each row s, need only store q and c: log(d) + 1 bits. For each row r, need only log(k) copies of q and c: O(log(d) log(k)) bits. (Many other constructions are possible.)
43
i > (1/k) j=i x2 j, with failure
probability 2−ℓ.
44
Next topic: Sparse Recovery. Fix k and ǫ. Want x such that
Here x(k) is best k-term approximation to x. Will build on heavy hitters.
45
Suppose k = 10 and coefficient magnitudes are 1, 1/2, 1/4, 18, 1/16, ... Want to find top k terms in time poly(k), not time 2k. Heavy Hitters algorithm only guarantees that we find and estimate well terms with magnitude around 1/k—about log(k) terms.
46
– iterative subroutine here
47
After removing top few terms, others become relatively larger. Can get sketch Φ(x − r) as Φx − Φr At this point, x may have more than k terms (to be fixed). Weak greedy–may not find the heaviest term.
48
Have: a set I of k indices, parameter ǫ Want: coefficient estimates so that the resulting approximation x satisfies
Define
i∈I |xi|2 be original energy in I
EI =
i∈I |xi −
xi|2 to be energy in I after one round of estimation.
49
Have: a set I of k indices, parameter ǫ Want: coefficient estimates so that the resulting approximation x satisfies
Repeat log(∆/ǫ) times
xi with | xi − xi|2 <
ǫ 2k(1+ǫ)
Ec
i .
50
Get: EI ≤
ǫ 2(1+ǫ)(E I + EIc).
Case EI > ǫ · EIc:
≤ ǫ 2(1 + ǫ) (EI + EIc) ≤ ǫ 2(1 + ǫ)EI + 1 2(1 + ǫ)EI = 1 2EI. Geometric improvement. Get down to ǫEIc if this case holds for all iterations.
51
Case EI ≤ ǫ · EIc:
≤ ǫ 2(1 + ǫ) (EI + EIc) ≤ ǫ 2EIc. EI fluctuates only in the range 0 to ǫ
2EIc after dropping below
ǫEIc.
52
Similar to estimation Repeat log(∆/ǫ) times
ǫ 4k(1+ǫ)
Eic.
xi with EI ≤ EIc
Final estimation:
EI ≤ ǫ
3EIc. 53
First: Estimation errors do not substantially affect Identification. Issue:
than xI
Identify i if |xi|2 large compared with Eic, so get i if |xi|2 large compared with EI > (1 − ǫ) E > (1 − ǫ) Eic.
54
Among top k, miss a total of at most EK\I ≤ ǫ 2(1 + ǫ)E = ǫ 2(1 + ǫ)(EK + EKc). Case EK > ǫEKc: EK\I ≤ ǫ 2(1 + ǫ)(EK + EKc) < ǫ 2(1 + ǫ)EK + 1 2(1 + ǫ)EK = 1 2EK.
55
Case EK ≤ ǫEKc: EK\I ≤ ǫ 2(1 + ǫ)(EK + EKc) ≤ ǫ 2EKc. Either case, identify enough.
56
Three sources of error:
excusable.
57
Algorithm:
x with x − x2 ≤ (1 + ǫ)
xi with |xi − xi|2 ≤ ǫ2
k EKc.
x, i.e., x(k)
58
Sources of error:
sure which are the k biggest.
59
Idea: only displace one term for another if their magnitudes are
Let y be a vector of terms in top k that are displaced by an equal number of terms not in the top k, the vector z. Both y and z have length at most k. yi is displaced by zi. Assume we have found and estimated all terms in y (else don’t care; these terms are small.)
60
By the triangle inequality, |yi| ≤ | yi| + |yi − yi| |zi| ≥ | zi| − |zi − zi| Thus |yi| − |zi| ≤ | yi| − | zi| + |yi − yi| + |zi − zi| ≤ |yi − yi| + |zi − zi| ≤ 2ǫ
Thus |y| − |z| ≤ 2ǫ
61
Continuing... |z| = z ≤ √EKc |y| = y ≤ z + |y| − |z| , so |y| + |z| ≤ 2 z + |y| − |z| ≤ 2
≤ 3
62
so, finally, y2 − z2 = |y|2 − |z|2 = |y| + |z|, |y| − |z| ≤ |y| + |z| · |y| − |z| ≤ 3
≤ 6ǫEKc.
63
64
E.g., Fourier coefficients. Important by themselves Useful toward other kinds of summaries
65
Columns of U is ONB if columns of U are perpendicular and unit Euclidean length. Thus ψj, ψk = 1, j = k 0,
E.g.:
66
Let {ψj} be ONB. Then, for any x, x =
and
x, ψj2 =
x2
i 67
68
E.g., +1 +1 +1 +1 +1 +1 +1 +1 −1 −1 −1 −1 +1 +1 +1 +1 −1 −1 +1 +1 −1 −1 +1 +1 −1 +1 −1 +1 −1 +1 −1 +1
69
Have vector x = U x, where x is sparse Process stream by transforming Φ:
x = Φ(U −1U) x = (ΦU −1) x. Answer queries:
x
Alternatively, transform updates...
70
See “add v to xi” Want to simulate changes to x = U −1x Regard as “add v to xi” as “add vei to x” Decompose vei into its Haar wavelet components, vei =
v ei, ψj ψj. Key: ei, ψj = 0 unless i ∈ supp(ψj).
xj’s change.
71
72
Still see stream of additive updates: “add v to xi” Want B-piece piecewise-constant representation, h, with h − x ≤ (1 + ǫ) hopt − x . We optimize boundary positions and heights.
73
Number of employees Salary
74
Key idea: Haar wavelets and histograms simulate each other efficiently.
Next, class of algorithms with varying costs and guarantees:
75
Histograms simulate Haar wavelets:
breaks), so t terms have 3t breaks (3t + 1) pieces. Haar wavelets simulate histograms:
✸ h =
j h, ψj ψj.
✸ h, ψj = 0 unless supp(ψj) intersects a boundary of h. ✸ ≤ O(log(d)) such wavelets; ≤ O(log(d)) terms in a B-bucket histogram.
76
w − x ≤ (1 + ǫ) hopt − x .
Compared with optimal, O(log(d)) times more buckets and (1 + ǫ) times more error—a (O(log(d)), 1 + ǫ)-approximation. We can do better...
77
w − x ≤ (1 + ǫ) hopt − x .
Get a (1, 3 + o(1))-approximation: h − x ≤ h − w + w − x ≤ hopt − w + w − x ≤ hopt − x + 2 w − x ≤ (3 + 2ǫ) hopt − x ,
78
w − x ≤ (1 + ǫ) hopt − x .
Get a (1, 1 + ǫ)-approximation. Next:
79
Assume exact estimation (We’ve shown estimation error is dominated by other error.) Have O(B log(d) log(1/ǫ)/ǫ2)-term repn, w. Let B′ = 3B log(d) (hist to wavelet simulation expression) Consider w(B′), w(2B′), . . . Let wrob be wrob = w(jB′),
w((j+1)B′..)
w,
“Take terms from top until there is little progress.”
80
Continued progress on w implies very close to x.
twice the energy of the remaining groups, so at least twice the energy of the next group.
81
Terms drop off exponentially. Thus x − wrob2 = x − w2 ≤ d
≤ ǫ2 w(B′..2B′)
≤ ǫ2 x − w(1..B′)
≤ ǫ2(1 + ǫ) x − hopt2 Need T = (1/ǫ)2 log(d/ǫ2) repetitions, so (1 − ǫ2)T = ǫ2/d.
82
Note:
Final guarantee: h − x ≤ h − wrob + wrob − x ≤ hopt − wrob + wrob − x ≤ hopt − x + 2 wrob − x ≤ (1 + 3ǫ) hopt − x . Adjust ǫ, and we’re done.
83
No progress on w implies no progress on x:
w((j+1)B′..)
implies
≤ ǫ2 x((j+1)B′..)
≤ ǫ2 x − hopt2 . So, the best linear combination, r, of wrob and any B-bucket histogram isn’t much better than wrob.
84
x ❆ ❆ ❆ ❆ ❆ r h wrob ◗◗◗◗◗◗◗ ◗ t t t t s s s x wrob ≈ r h ❍ ❍ ❍ ❍ ❍ ❍ ❍ ❍
✁ ✁ ✁ ✁ ✁ ✁ ✁
Approximately: h − r ≤ hopt − r, so h − x ≤ hopt − x.
85
x − wrob and wrob − hopt are bounded. x − wrob ≤ (1 + ǫ) x − hopt wrob − hopt ≤ (3 + ǫ)3 x − h . Also, r − wrob ≤ ǫ x − hopt .
86
We have h − x2 = h − r2 + r − x2 ≤ (h − wrob + wrob − r)2 +(x − wrob − wrob − r)2 ≤ h − wrob2 + wrob − r2 + x − wrob2 + wrob − r2 + 2 h − wrob · wrob − r ≤ hopt − wrob2 + wrob − r2 + x − wrob2 + wrob − r2 + 2 hopt − wrob · wrob − r ≤ hopt − wrob2 + x − wrob2 +9 · ǫ · x − hopt2 ,
87
...and, similarly, hopt − x2 = hopt − r′2 + r′ − x2 ≥ (hopt − wrob − wrob − r′)2 +(x − wrob − wrob − r′)2 ≥ hopt − wrob2 + 2 wrob − r′2 + x − wrob2 −2 hopt − wrob · wrob − r′ −2 x − wrob · wrob − r′ ≥ hopt − wrob2 + x − wrob2 −9 · ǫ · x − hopt2 .
88
So h − x2 − hopt − x2 ≤ 18 · ǫ · x − hopt2 ,
h − x2 ≤ (1 + 18ǫ) hopt − x2 .
89
Want best B-bucket histogram to x. Use dynamic programming, based on the following recursion. Define
So: Err[j, k] = min
ℓ<j Err[ℓ, k − 1] + Cost[l, j).
“k − 1 buckets on [0, ℓ) and one bucket on [ℓ, j). Take best ℓ.” Runtime: j < d, k < B, l < d; total O(d2B). Can construct actual histogram (not just error) as we go (keep the ℓ’s that witness the minimization).
90
From x, construct Px: x0, x0 + x1, x0 + x1 + x2, . . . Also Px2. Can get Cost[ℓ, j] from ℓ and j in constant time:
1 j−ℓ ((Px)ℓ − (Px)j).
ℓ≤i<j(xi − µ)2 = x2 i − 2µ xi + µ2. 91
Want best B-bucket histogram h to wrob. wlog, boundaries of h are among boundaries of wrob. Dynamic programming takes time O(|wrob|2 · B), where |wrob| is the number of boundaries in wrob.
92
93
Approximation is constant on rectangles Hierarchical (recursively split an existing rectangle) or general. Theorem: Any B-bucket (general) partition can be refined into a (4B)-bucket hierarchical partition. Proof omitted; not needed for algorithm. Aim: (1, 1 + ǫ)-approximate hierarchical histogram, which is a (4, 1 + ǫ)-approx general histogram.
1 2 1 1 4 5 2 3 3
94
Same overall strategy as 1-D:
wavelets.”
Next:
95
Need ONB that simulates and is simulated by 1-bucket histograms. Generally: (α ⊗ β)(x, y) = α(x)β(y). Use tensor product of Haar wavelets: ψj,k(x, y) = ψj(x) · ψk(y). Tensor product of ONBs is ONB.
96
Update to x leads to updates to O(log2(d)) tensor product of Haar wavelets. (Algorithm is exponential in the dimension, 2.)
97
Want best hierarchical h to wrob. Boundaries of h can be taken from boundaries of wrob. Best j-cut hierarchical h has:
Runtime: polynomial in boundaries of wrob and desired number of buckets.
98
99
Want best B-bucket pw-linear approx to x. Same overall strategy:
Next:
100
constant slope vee
101
E.g., +1 +1 +1 +1 +1 +1 +1 +1 −7 −5 −3 −1 +1 +3 +5 +7 +3 +1 −1 −3 −3 −1 +1 +3 +7 −1 −9 −17 +17 +9 +1 −7 +1 −1 −1 +1 −1 +3 −3 +1 +1 −1 −1 +1 −1 +3 −3 +1
102
each other with O(log(d))-factor blowup
103
Have wrob (pw-linear rep’n with B′ ≈ B · log(d)/ǫ pieces) Want best B-bucket pw-linear repn h to wrob. Recall best 1-bucket repn to x is x, ψ ψ + x, φ φ, where ψ is constant and φ is slant. Need
104
x − (a · ψ + b · φ)2 = x − (a · ψ + b · φ), x − (a · ψ + b · φ) . Also need P(x2).
105
Define Far[k, m] as the biggest j such that there’s a k-bucket histogram on [0, j) with error at most m (in appropriate units). Assume we know E with 1
2E ≤ Eopt ≤ E.
Consider m = 0, ǫE/B, 2ǫE/B, . . ., 2E. (B/ǫ possibilities for m; coarse granularity leads to ǫE/B extra error per boundary—ǫE in all). Thus: Far[k, m] = maxn{j : n + Cost[Far[k − 1, n], j] < m}. “Go as far as we can with k − 1 buckets and error n, then add 1
Runtime: k < B, m < B/ǫ, n < B/ǫ, find j by binary search: O(B3 log(d)/ǫ2).
106
Given x, want pw-constant h to optimize range queries to x:
ℓ≤i<r
h − xi
2
. Height h of a bucket affects many non-local queries. Foils previous tricks. Instead, transform to prefix domain.
107
ℓ≤i<r
hi − xi
2
=
((P(h − x))r − (P(h − x))ℓ)2 =
(P(h − x))2
r + (P(h − x))2 ℓ − 2P(h − x)rP(h − x)ℓ
= 2d
((Ph)ℓ − (Px)ℓ)2 (we’ll make
ℓ P(h − x)ℓ = 0.)
= 2d Ph − Px2 , Get point-query problem.
108
If h is pw-constant, then Ph is piecewise-linear connected Do not know how to find near-best pwlc approx to given Px (equivalent to original problem). Find near-best B-bucket pw-linear (disconnected) approx to Px under point queries. Leads to (2B)-bucket pw-constant repn for range queries to x.
109
When reading x, simulate reading Px:
includes 3). From Ph, recover hi = (∆(Ph))i = (Ph)i+1 − (Ph)i.
110
queries
automatically by optimality.)
times best error under range queries.
111
112