 
              Digression: Improving Variance So assume E [ X ] = E [ Y ] = 0. Then �� 1 � 2 � � var( Y ) = E [ Y 2 ] = E X i m � 1 = E [ X i X j ] , using independence m 2 i,j � 1 E [ X 2 = i ] m 2 i 1 mE [ X 2 ] . = 33
Better failure probability. Theorem: Suppose Pr( Y is bad) < 1 / 9. Let Z be the median of l independent copies of Y . Then Pr( Z is bad) < 2 − Ω( l ) . Proof: Z is bad only if at least half of the Y ’s are bad. Apply Chernoff. t t t t t t t 34
Digression: Chernoff Bounds Theorem: Suppose each of n Y i ’s is independent with   1 − p, with probability p ; Y i =  − p, with probability 1 − p. Let Y = � i Y i . If a > 0, then Pr( Y > a ) < e − 2 a 2 /n . 35
Chernoff: Proof (Just for p = 1 / 2, so Y i is ± 1 / 2, uniformly.) Lemma: For λ > 0, e λ + e − λ < e λ 2 / 2 . (Proof: Taylor.) 2 � E [ e 2 λ P Y i ] E [ e 2 λY i ] = � e λ + e − λ � n = 2 e λ 2 n/ 2 . < 36
Chernoff, cont’d � e 2 λY > e 2 λa � Pr( Y > a ) = Pr E [ e 2 λY ] ≤ e 2 λa e λ 2 n/ 2 − 2 λa . ≤ Put λ = 2 a/n ; get Pr( Y > a ) < e − 2 a 2 /n . 37
To this point � i > 1 Find all i such that x 2 j � = i x 2 j , with failure probability 2 − ℓ . k • Need poly( k, ℓ ) rows in the matrix B ⊗ r S ⊗ r R ; comparable runtimes. Estimate each x i up to ± ǫ � x � with failure probability 2 − ℓ . • Need poly( ℓ/ǫ ) rows; comparable runtimes. 38
Space To this point, fully random matrices. • Expensive to store! But... • Need only pairwise independence within each row • (sometimes need full independence from row to row, but this is usually ok). • i.e., two entries r j and r ℓ in the same row need to be independent, but three entries may be dependent. • This can cut down on needed space. 39
Pairwise Independence: Construction Random vector s in ± 1 d (equivalently, Z d 2 ) Index i is a 0/1 vector of length log( d ), i.e., i ∈ Z log( d ) . 2 Pick vector q ∈ Z log( d ) and bit c ∈ Z 2 . 2 Define s i = c + � q, i � (mod 2). Then, if i � = j , then ( s i , s j ) takes all four possibilities with equal probability. 40
Pairwise Independence: Proof s i is uniform because c is random. Conditioned on s i , s j is uniform: • Sufficient to show that s i + s j is uniform. • s i + s j = ( c + � q, i � ) + ( c + � q, j � ) = � q, i + j � • i � = j , so they differ on some bit, the ℓ ’th. • As q ℓ varies, s i + s j varies uniformly over Z 2 . 41
Pairwise independence, for r Hashing into one of k buckets. Take log( k ) independent hashes into two buckets. Get bucket label bit-by-bit. 42
Space, again For each row s , need only store q and c : log( d ) + 1 bits. For each row r , need only log( k ) copies of q and c : O (log( d ) log( k )) bits. (Many other constructions are possible.) 43
All Together—Heavy Hitters i > (1 /k ) � • Find all i such that x 2 j � = i x 2 j , with failure probability 2 − ℓ . • Estimate each x i up to ± ǫ � x � with failure probability 2 − ℓ . • Space, time per item, and query time are poly( k, ℓ, log( d ) , 1 /ǫ ). 44
Sparse Recovery Next topic: Sparse Recovery. Fix k and ǫ . Want � x such that � � � x ( k ) − x � � � x − x � 2 ≤ (1 + ǫ ) 2 . Here x ( k ) is best k -term approximation to x . Will build on heavy hitters. 45
Sparse Recovery: Issue Suppose k = 10 and coefficient magnitudes are 1 , 1 / 2 , 1 / 4 , 18 , 1 / 16 , ... Want to find top k terms in time poly( k ), not time 2 k . Heavy Hitters algorithm only guarantees that we find and estimate well terms with magnitude around 1 /k —about log( k ) terms. 46
Weak Greedy Algorithm • Find indices of heavy terms in x • Estimate their coefs, getting intermediate rep’n r . – iterative subroutine here • Recurse on x − r . 47
Weak Greedy Algorithm After removing top few terms, others become relatively larger. Can get sketch Φ( x − r ) as Φ x − Φ r At this point, � x may have more than k terms (to be fixed). Weak greedy–may not find the heaviest term. 48
Iterative Estimation Have: a set I of k indices, parameter ǫ Want: coefficient estimates so that the resulting approximation � x satisfies � � x − x � ≤ (1 + ǫ ) � x − x I � . Define • I c be the complement of I . • E I = � i ∈ I | x i | 2 be original energy in I E I = � x i | 2 to be energy in I after one round of • � i ∈ I | x i − � estimation. • ∆ = E I /E I c to be the dynamic range. 49
Iterative Estimation: Algorithm Have: a set I of k indices, parameter ǫ Want: coefficient estimates so that the resulting approximation � x satisfies � � x − x � ≤ (1 + ǫ ) � x − x I � . Repeat log(∆ /ǫ ) times x i − x i | 2 < 2 k (1+ ǫ ) � ǫ E c 1. estimate each x i for i ∈ I , by � x i with | � i . 2. update x . 50
Iterative Estimation: Proof Get: � ǫ E I ≤ 2(1+ ǫ ) ( E I + E I c ). Case E I > ǫ · E I c : ǫ � E I ≤ 2(1 + ǫ ) ( E I + E I c ) ǫ 1 ≤ 2(1 + ǫ ) E I + 2(1 + ǫ ) E I 1 = 2 E I . Geometric improvement. Get down to ǫE I c if this case holds for all iterations. 51
Iterative Estimation: Proof Case E I ≤ ǫ · E I c : ǫ � E I ≤ 2(1 + ǫ ) ( E I + E I c ) ǫ ≤ 2 E I c . E I fluctuates only in the range 0 to ǫ 2 E I c after dropping below ǫE I c . 52
Iterative Identification Similar to estimation Repeat log(∆ /ǫ ) times 1. Identify indices i with | x i | 2 > 4 k (1+ ǫ ) � ǫ E i c . x i with � 2. Estimate each x i , for i ∈ I , by � E I ≤ E I c 3. update x . Final estimation: • � E I ≤ ǫ 3 E I c . 53
Iterative Identification: Proof First: Estimation errors do not substantially affect Identification. Issue: • Have a set I of indices for intermediate r . • We’ll identify positions in x − r . • Values in ( x − r ) I are based on estimates and may be larger than x I • ...contribute extra noise; obstacle to identification. Identify i if | x i | 2 large compared with � E i c , so get i if | x i | 2 large compared with E I > (1 − ǫ ) � E > (1 − ǫ ) � E i c . 54
Iterative Identification: Proof Among top k, miss a total of at most ǫ ǫ E K \ I ≤ 2(1 + ǫ ) E = 2(1 + ǫ )( E K + E K c ) . Case E K > ǫE K c : ǫ E K \ I ≤ 2(1 + ǫ )( E K + E K c ) ǫ 1 < 2(1 + ǫ ) E K + 2(1 + ǫ ) E K 1 = 2 E K . 55
Iterative Identification: Proof Case E K ≤ ǫE K c : ǫ E K \ I ≤ 2(1 + ǫ )( E K + E K c ) ǫ ≤ 2 E K c . Either case, identify enough. 56
Iterative Identification—proof Three sources of error: 1. outside top k —excusable. 2. inside top k , but not found—small compared with excusable. 3. found, and estimated incorrectly—small compared with excusable. 57
Exactly k Terms Output Algorithm: � � x − x � 2 ≤ (1 + ǫ ) � 2 . � x ( k ) − x 1. Get � x with � � x i | 2 ≤ ǫ 2 2. Estimate each x i by � x i with | x i − � k E K c . 3. Output top k terms of � x , i.e., � x ( k ) 58
Exactly k Terms Output: Proof Sources of error: 1. Terms in K \ I (small; already shown) 2. Error in terms we do take (small; already shown) 3. Error from mis-ranking • if k + 1 terms are about equally good, we won’t know for sure which are the k biggest. 59
Exactly k Terms Output: Misranking Idea: only displace one term for another if their magnitudes are close. Some care needed to keep quadratic dependence on ǫ . Let y be a vector of terms in top k that are displaced by an equal number of terms not in the top k , the vector z . Both y and z have length at most k . y i is displaced by z i . Assume we have found and estimated all terms in y (else don’t care; these terms are small.) 60
Exactly k Terms Output: Proof By the triangle inequality, | y i | ≤ | � y i | + | y i − � y i | | z i | ≥ | � z i | − | z i − � z i | Thus | y i | − | z i | ≤ | � y i | − | � z i | + | y i − � y i | + | z i − � z i | ≤ | y i − � y i | + | z i − � z i | � ≤ 2 ǫ E K c /k Thus � �| y | − | z |� ≤ 2 ǫ E K c . 61
Exactly k Terms Output: Proof Continuing... √ E K c �| z |� = � z � ≤ �| y |� = � y � ≤ � z � + �| y | − | z |� , so �| y | + | z |� ≤ 2 � z � + �| y | − | z |� � � E K c + 2 ǫ ≤ 2 E K c � ≤ 3 E K c , 62
Exactly k Terms Output: Proof so, finally, � y � 2 − � z � 2 �| y |� 2 − �| z |� 2 = = �| y | + | z | , | y | − | z |� ≤ �| y | + | z |� · �| y | − | z |� � � E K c · 2 ǫ ≤ 3 E K c ≤ 6 ǫE K c . 63
Overview of Summaries • Heavy Hitters • Weak greedy sparse recovery • Orthonormal change of basis • Haar Wavelets • Histograms (piecewise constant) • Multi-dimensional (hierarchical) • Piecewise-linear • Range queries 64
Finding Other Heavy Things E.g., Fourier coefficients. Important by themselves Useful toward other kinds of summaries 65
Orthonormal bases Columns of U is ONB if columns of U are perpendicular and unit Euclidean length. Thus   1 , j = k � ψ j , ψ k � =  0 , otherwise. E.g.: • Fourier basis • Haar wavelet basis 66
Decompositions and Parseval Let { ψ j } be ONB. Then, for any x , � x = � x, ψ j � ψ j . and � � � x, ψ j � 2 = x 2 i j i 67
Haar Wavelets, Graphically 68
E.g., +1 +1 +1 +1 +1 +1 +1 +1 − 1 − 1 − 1 − 1 +1 +1 +1 +1 − 1 − 1 +1 +1 0 0 0 0 0 0 0 0 − 1 − 1 +1 +1 − 1 +1 0 0 0 0 0 0 0 0 − 1 +1 0 0 0 0 0 0 0 0 − 1 +1 0 0 0 0 0 0 0 0 − 1 +1 69
Heavy Hitters under Orthonormal Change of Basis Have vector x = U � x , where � x is sparse Process stream by transforming Φ: x = Φ( U − 1 U ) � x = (Φ U − 1 ) � • Collect Φ � x . Answer queries: • Recover heavy hitters in � x • Implicitly recover heavy U -coefficients of x . Alternatively, transform updates... 70
Haar Wavelets—per-Item Time See “add v to x i ” x = U − 1 x Want to simulate changes to � Regard as “add v to x i ” as “add ve i to x ” Decompose ve i into its Haar wavelet components, � ve i = v � e i , ψ j � ψ j . j Key: � e i , ψ j � = 0 unless i ∈ supp( ψ j ). • Just O (log( d )) such j ’s— O (log( d )) � x j ’s change. 71
Overview of Summaries • Heavy Hitters • Weak greedy sparse recovery • Orthonormal change of basis • Haar Wavelets • Histograms (piecewise constant) • Multi-dimensional (hierarchical) • Piecewise-linear • Range queries 72
Histograms Still see stream of additive updates: “add v to x i ” Want B -piece piecewise-constant representation, h , with � h − x � ≤ (1 + ǫ ) � h opt − x � . We optimize boundary positions and heights. 73
Number of employees Salary 74
Histograms–Algorithm Overview Key idea: Haar wavelets and histograms simulate each other efficiently. • t -term wavelet is O ( t )-bucket histogram • B -bucket histogram is O ( B log( d ))-term wavelet rep’n Next, class of algorithms with varying costs and guarantees: • Get good Haar representation • Modify it into a histogram 75
Simulation Histograms simulate Haar wavelets: • Each Haar wavelet is piecewise constant with 4 pieces (3 breaks), so t terms have 3 t breaks (3 t + 1) pieces. Haar wavelets simulate histograms: • If h is a B -bucket histogram and ψ j ’s are wavelets, then ✸ h = � j � h, ψ j � ψ j . ✸ � h, ψ j � = 0 unless supp( ψ j ) intersects a boundary of h . ✸ ≤ O (log( d )) such wavelets; ≤ O (log( d )) terms in a B -bucket histogram. 76
Algorithm 1 1. Get O ( B log( d ))-term wavelet rep’n w with � w − x � ≤ (1 + ǫ ) � h opt − x � . 2. Return w as a O ( B log( d ))-bucket histogram Compared with optimal, O (log( d )) times more buckets and (1 + ǫ ) times more error—a ( O (log( d )) , 1 + ǫ )-approximation. We can do better... 77
Algorithm 2 1. Get O ( B log( d ))-term wavelet rep’n w with � w − x � ≤ (1 + ǫ ) � h opt − x � . 2. Returnn best B -bucket histogram h to w . (How? soon.) Get a (1 , 3 + o (1))-approximation: � h − x � ≤ � h − w � + � w − x � ≤ � h opt − w � + � w − x � ≤ � h opt − x � + 2 � w − x � ≤ (3 + 2 ǫ ) � h opt − x � , 78
Algorithm 3 1. Get O ( B log( d ) log(1 /ǫ ) /ǫ 2 )-term wavelet rep’n w with � w − x � ≤ (1 + ǫ ) � h opt − x � . 2. Possibly discard some terms, getting a robust w rob . 3. Output best B -bucket histogram h to w rob . Get a (1 , 1 + ǫ )-approximation. Next: • What is “robust?” • Proof of correctness. • How to find h from w rob . 79
Robust Representations Assume exact estimation (We’ve shown estimation error is dominated by other error.) Have O ( B log( d ) log(1 /ǫ ) /ǫ 2 )-term repn, w . Let B ′ = 3 B log( d ) (hist to wavelet simulation expression) Consider w ( B ′ ) , w (2 B ′ ) , . . . Let w rob be  � � � 2 ≤ ǫ 2 � �  � 2 � w ( jB ′ .. ( j +1) B ′ ) � w (( j +1) B ′ .. ) w ( jB ′ ) , w rob =  w, otherwise. “Take terms from top until there is little progress.” 80
Robust Representation, Continued Progress Continued progress on w implies very close to x . � � � 2 drops exponentially in j : � w ( jB ′ .. ( j +1) B ′ ) 1. Group terms, 2 /ǫ 2 per group. 2. Each group has twice the energy of the remaining terms, i.e., twice the energy of the remaining groups, so at least twice the energy of the next group. 81
Robust Representation, Continued Progress Terms drop off exponentially. Thus � x − w rob � 2 � x − w � 2 = � � � 2 � w (last) ≤ d ǫ 2 � � � 2 � w ( B ′ .. 2 B ′ ) ≤ ǫ 2 � � � 2 � x − w (1 ..B ′ ) ≤ ǫ 2 (1 + ǫ ) � x − h opt � 2 ≤ Need T = (1 /ǫ ) 2 log( d/ǫ 2 ) repetitions, so (1 − ǫ 2 ) T = ǫ 2 /d. 82
Robust Representation, Continued Progress � � � x − w ( B ′ ) � ≤ (1 + ǫ ) � x − h opt � , i.e., w ( B ′ ) is accurate Note: enough. (It has too many terms.) Final guarantee: � h − x � ≤ � h − w rob � + � w rob − x � ≤ � h opt − w rob � + � w rob − x � ≤ � h opt − x � + 2 � w rob − x � ≤ (1 + 3 ǫ ) � h opt − x � . Adjust ǫ , and we’re done. 83
Robust Representation, No Progress No progress on w implies no progress on x : � � � 2 ≤ ǫ 2 � � � 2 � w ( jB ′ .. ( j +1) B ′ ) � w (( j +1) B ′ .. ) implies � � ǫ 2 � � � 2 � 2 � w ( jB ′ .. ( j +1) B ′ ) � x (( j +1) B ′ .. ) ≤ ǫ 2 � x − h opt � 2 . ≤ So, the best linear combination, r , of w rob and any B -bucket histogram isn’t much better than w rob . 84
Robust Representation, No Progress x x ❍ s ✁ ❍ ❍ ◗◗◗◗◗◗◗ t ❆ ✁ ❍ ❍ ❆ ✁ ❍ ❍ � ❆ ✁ � ❍ h s s w rob ≈ r ❆ ✁ � ❆ ◗ ✁ � r h t t t ✁ � w rob ✁ � h opt s Approximately: � h − r � ≤ � h opt − r � , so � h − x � ≤ � h opt − x � . 85
Robust Representation, No Progress � x − w rob � and � w rob − h opt � are bounded. � x − w rob � ≤ (1 + ǫ ) � x − h opt � � w rob − h opt � ≤ (3 + ǫ )3 � x − h � . Also, � r − w rob � ≤ ǫ � x − h opt � . 86
Robust Representation, No Progress We have � h − r � 2 + � r − x � 2 � h − x � 2 = ( � h − w rob � + � w rob − r � ) 2 ≤ +( � x − w rob � − � w rob − r � ) 2 � h − w rob � 2 + � w rob − r � 2 + � x − w rob � 2 ≤ + � w rob − r � 2 + 2 � h − w rob � · � w rob − r � � h opt − w rob � 2 + � w rob − r � 2 + � x − w rob � 2 ≤ + � w rob − r � 2 + 2 � h opt − w rob � · � w rob − r � � h opt − w rob � 2 + � x − w rob � 2 ≤ +9 · ǫ · � x − h opt � 2 , 87
Robust Representation, No Progress ...and, similarly, � h opt − r ′ � 2 + � r ′ − x � 2 � h opt − x � 2 = ( � h opt − w rob � − � w rob − r ′ � ) 2 ≥ +( � x − w rob � − � w rob − r ′ � ) 2 � h opt − w rob � 2 + 2 � w rob − r ′ � 2 + � x − w rob � 2 ≥ − 2 � h opt − w rob � · � w rob − r ′ � − 2 � x − w rob � · � w rob − r ′ � � h opt − w rob � 2 + � x − w rob � 2 ≥ − 9 · ǫ · � x − h opt � 2 . 88
Robust Representation, No Progress So � h − x � 2 − � h opt − x � 2 ≤ 18 · ǫ · � x − h opt � 2 , or � h − x � 2 ≤ (1 + 18 ǫ ) � h opt − x � 2 . 89
Warmup: Best Histogram, Full Space Want best B -bucket histogram to x . Use dynamic programming, based on the following recursion. Define • Err[ j, k ] = error of best k -bucket histogram to x on [0 , j ). • Cost[ j, j ′ ] = error of best 1-bucket histogram to x on [ j, j ′ ). So: Err[ j, k ] = min ℓ<j Err[ ℓ, k − 1] + Cost[ l, j ) . “ k − 1 buckets on [0 , ℓ ) and one bucket on [ ℓ, j ). Take best ℓ .” Runtime: j < d, k < B, l < d ; total O ( d 2 B ). Can construct actual histogram (not just error) as we go (keep the ℓ ’s that witness the minimization). 90
Prefix array From x , construct Px : x 0 , x 0 + x 1 , x 0 + x 1 + x 2 , . . . Also Px 2 . Can get Cost[ ℓ, j ] from ℓ and j in constant time: • x ℓ + x ℓ +1 + · · · + x j − 1 = ( Px ) j − ( Px ) ℓ . 1 • Best height is average µ = j − ℓ (( Px ) ℓ − ( Px ) j ). • Error is � ℓ ≤ i<j ( x i − µ ) 2 = � x 2 i − 2 µ � x i + µ 2 . 91
Best Histogram to Robust Representation Want best B -bucket histogram h to w rob . wlog, boundaries of h are among boundaries of w rob . Dynamic programming takes time O ( | w rob | 2 · B ), where | w rob | is the number of boundaries in w rob . 92
Overview of Summaries • Heavy Hitters • Weak greedy sparse recovery • Orthonormal change of basis • Haar Wavelets • Histograms (piecewise constant) • Multi-dimensional (hierarchical) • Piecewise-linear • Range queries 93
Two-Dimensional Histograms Approximation is constant on rectangles Hierarchical (recursively split an existing rectangle) or general. Theorem: Any B -bucket (general) partition can be refined into a (4 B )-bucket hierarchical partition. Proof omitted; not needed for algorithm. Aim: (1 , 1 + ǫ )-approximate hierarchical histogram, which is a (4 , 1 + ǫ )-approx general histogram. 5 1 1 1 4 2 3 3 2 94
2-D Histograms–Overall Strategy Same overall strategy as 1-D: • Find best B ′ -term rep’n over “tensor-product of Haar wavelets.” • Cull back to a robust representation, w rob • Output best hierarchical histogram to w rob . Next: • What is tensor-product of Haar wavelets? • How to find best B-bucket hierarchical histogram. 95
Tensor products Need ONB that simulates and is simulated by 1-bucket histograms. Generally: ( α ⊗ β )( x, y ) = α ( x ) β ( y ). Use tensor product of Haar wavelets: ψ j,k ( x, y ) = ψ j ( x ) · ψ k ( y ) . Tensor product of ONBs is ONB. 96
Processing Updates Update to x leads to updates to O (log 2 ( d )) tensor product of Haar wavelets. (Algorithm is exponential in the dimension, 2.) 97
Dynamic Programming Want best hierarchical h to w rob . Boundaries of h can be taken from boundaries of w rob . Best j -cut hierarchical h has: • a full cut (horiz or vert, say vert) • a k -cut partition on the left • a ( j − 1 − k )-cut partition on the right. Runtime: polynomial in boundaries of w rob and desired number of buckets. 98
Overview of Summaries • Heavy Hitters • Weak greedy sparse recovery • Orthonormal change of basis • Haar Wavelets • Histograms (piecewise constant) • Multi-dimensional (hierarchical) • Piecewise-linear • Range queries 99
Recommend
More recommend