b669 sublinear algorithms for big data
play

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 1: - PowerPoint PPT Presentation

B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 Part 1: Sublinear in Space 2-1 The model and challenge The data stream model (Alon, Matias and Szegedy 1996) RAM a n a 2 a 1 CPU Why hard? Cannot store everything. Applications : Internet


  1. B669 Sublinear Algorithms for Big Data Qin Zhang 1-1

  2. Part 1: Sublinear in Space 2-1

  3. The model and challenge The data stream model (Alon, Matias and Szegedy 1996) RAM a n a 2 a 1 CPU Why hard? Cannot store everything. Applications : Internet router, stock data, ad auction, flight logs on tape, etc. 3-1

  4. § 1 . 1 Point Queries (part I) RAM 9 7 6 3 3 9 Which items are most frequent? CPU Approximation allowed. 4-1

  5. More about the streaming model Denote the stream by A = a 1 , . . . , a m , where m = n O (1) is the length of the stream, which is unknown at the beginning. Let [ n ] be the item universe. Let x j be the frequency of item j in the steam. Each a i = ( j , ∆) denotes x j ← x j + ∆. 5-1

  6. More about the streaming model Denote the stream by A = a 1 , . . . , a m , where m = n O (1) is the length of the stream, which is unknown at the beginning. Let [ n ] be the item universe. Let x j be the frequency of item j in the steam. Each a i = ( j , ∆) denotes x j ← x j + ∆. We call an algorithm insertion-only if it only works for ∆ = 1. 5-2

  7. More about the streaming model Denote the stream by A = a 1 , . . . , a m , where m = n O (1) is the length of the stream, which is unknown at the beginning. Let [ n ] be the item universe. Let x j be the frequency of item j in the steam. Each a i = ( j , ∆) denotes x j ← x j + ∆. We call an algorithm insertion-only if it only works for ∆ = 1. Can represent the stream as a vector x = ( x 1 , . . . , x n ). When a i = ( j , ∆) comes, x j ← x j + ∆. – for insertion only, m = � x � 1 5-3

  8. The MAJORITY problem MAJORITY: if ∃ j : f j > m / 2, then output j , otherwise, output ⊥ . 6-1

  9. Heavy hitters and point queries L p heavy hitter set : HH p φ ( x ) = { i : | x i | ≥ φ � x � p } 7-1

  10. Heavy hitters and point queries L p heavy hitter set : HH p φ ( x ) = { i : | x i | ≥ φ � x � p } L p Heavy Hitter Problem: Given φ, φ ′ , (often φ ′ = φ − ǫ ), return a set S such that HH p φ ( x ) ⊆ S ⊆ HH p φ ′ ( x ) 7-2

  11. Heavy hitters and point queries L p heavy hitter set : HH p φ ( x ) = { i : | x i | ≥ φ � x � p } L p Heavy Hitter Problem: Given φ, φ ′ , (often φ ′ = φ − ǫ ), return a set S such that HH p φ ( x ) ⊆ S ⊆ HH p φ ′ ( x ) L p Point Query Problem: Given ǫ , after reading the whole stream, given i , report x i = x i ± ǫ � x � p ˜ 7-3

  12. The Misra-Gries algorithm The algorithm (Misra-Gries ’82) 1. Maintain a set A ; each item is a counter pair ( i , x i ). A ← ∅ 2. For each new coming item e , (a) if e ∈ A then set ( e , x e ) ← ( e , x e + 1) (b) else if | A | < 1 /ǫ , add ( e , 1) to A (c) else, for each e ∈ A , set ( e , x e ) ← ( e , x e − 1), and if x e − 1 = 0, then remove ( e , 0) from A . 3. On query i , if i ∈ A , then return x i , otherwise return 0 8-1

  13. The Misra-Gries algorithm The algorithm (Misra-Gries ’82) 1. Maintain a set A ; each item is a counter pair ( i , x i ). A ← ∅ 2. For each new coming item e , (a) if e ∈ A then set ( e , x e ) ← ( e , x e + 1) (b) else if | A | < 1 /ǫ , add ( e , 1) to A (c) else, for each e ∈ A , set ( e , x e ) ← ( e , x e − 1), and if x e − 1 = 0, then remove ( e , 0) from A . 3. On query i , if i ∈ A , then return x i , otherwise return 0 Analysis (on board) Theorem Misra-Gries uses O (1 /ǫ · log n ) bits, and for any j , produces an x j satisfing x j − ǫ m ≤ ˜ x j ≤ x j . estimate ˜ 8-2

  14. Space-saving: an algorithm for insertion only Algorithm Space-saving [Metwally et al. ’05] When a new item e comes, we have two cases. 1. If e is already in the array. We just increment ˜ x e by 1 and reinsert the ( e , ˜ x e ) into the array. 2. If e is not in the array, we create a new tuple ( e , MIN + 1 ) where MIN = min { ˜ x e : e is in the array } . We always keep the array sorted according to ˜ x e , and then MIN is just the estimated frequency of the last item. If the length array is larger than 1 /ǫ , we delete the last tuple. At the query of e , report ˜ x e if e is in the array, otherwise report MIN 9-1

  15. Space-saving: an algorithm for insertion only Algorithm Space-saving [Metwally et al. ’05] When a new item e comes, we have two cases. 1. If e is already in the array. We just increment ˜ x e by 1 and reinsert the ( e , ˜ x e ) into the array. 2. If e is not in the array, we create a new tuple ( e , MIN + 1 ) where MIN = min { ˜ x e : e is in the array } . We always keep the array sorted according to ˜ x e , and then MIN is just the estimated frequency of the last item. If the length array is larger than 1 /ǫ , we delete the last tuple. At the query of e , report ˜ x e if e is in the array, otherwise report MIN Theorem (Analysis on board) Space-saving uses O (1 /ǫ · log n ) bits, and for any j , produces an x j satisfing x j ≤ ˜ x j ≤ x j + ǫ m . estimate ˜ 9-2

  16. § 1 . 2 Distinct Elements RAM 9 7 6 3 3 9 How many distinct elements? CPU Approximation needed. 10-1

  17. Universal hash function A family H ⊆ { h : X → Y } is said to be 2-universal if the following property holds, with h ∈ R H picked uniformly at random: ∀ x , x ′ ∈ X , ∀ y , y ′ ∈ Y , � � 1 x � = x ′ ⇒ Pr h [ h ( x ) = y ∧ h ( x ′ ) = y ′ ] = | Y | 2 11-1

  18. The Flajoet-Martin algorithm The algorithm (Flajoet and Martin ’83) 1. Choose a random hash function h : [ n ] → [ n ] from a 2-universal family. Set z = 0. Let zeros( h ( e )) be the # tailing zeros of the binary representation of h ( e ). 2. For each new coming item e , if zeros( h ( e )) > z , then set z = zeros( h ( e )); 3. Output 2 z +0 . 5 . 12-1

  19. The Flajoet-Martin algorithm The algorithm (Flajoet and Martin ’83) 1. Choose a random hash function h : [ n ] → [ n ] from a 2-universal family. Set z = 0. Let zeros( h ( e )) be the # tailing zeros of the binary representation of h ( e ). 2. For each new coming item e , if zeros( h ( e )) > z , then set z = zeros( h ( e )); 3. Output 2 z +0 . 5 . Analysis (on board) 12-2

  20. The Flajoet-Martin algorithm The algorithm (Flajoet and Martin ’83) 1. Choose a random hash function h : [ n ] → [ n ] from a 2-universal family. Set z = 0. Let zeros( h ( e )) be the # tailing zeros of the binary representation of h ( e ). 2. For each new coming item e , if zeros( h ( e )) > z , then set z = zeros( h ( e )); 3. Output 2 z +0 . 5 . Analysis (on board) Theorem The number of distinct elements can be O (1)-approximated with probability 2 / 3 using O (log n ) bits. 12-3

  21. Probability amplification Can we boost the success probability to 1 − δ ? The idea is to run k = Θ(log(1 /δ )) copies of this algorithm in parallel, using mutually independent random hash functions, and output the median of the k answers. 13-1

  22. An improved algorithm Idea: two-level hashing. The algorithm (Bar-Yossef et al. ’02) 1. Choose a random hash function h : [ n ] → [ n ] from a 2-universal family. Set z = 0 , B = ∅ . Choose a secondary 2-universal hash function g : [ n ] → [(log n /ǫ ) O (1) ]. 2. For each new coming item e , if zeros( h ( e )) ≥ z , then (a) set B ← B ∪ { ( g ( e ) , zeros ( h ( e ))) } ; (b) if | B | > c /ǫ 2 then set z ← z + 1 and remove all ( α, β ) in B with β < z . 3. Output | B | 2 z . 14-1

  23. An improved algorithm Idea: two-level hashing. The algorithm (Bar-Yossef et al. ’02) 1. Choose a random hash function h : [ n ] → [ n ] from a 2-universal family. Set z = 0 , B = ∅ . Choose a secondary 2-universal hash function g : [ n ] → [(log n /ǫ ) O (1) ]. 2. For each new coming item e , if zeros( h ( e )) ≥ z , then (a) set B ← B ∪ { ( g ( e ) , zeros ( h ( e ))) } ; (b) if | B | > c /ǫ 2 then set z ← z + 1 and remove all ( α, β ) in B with β < z . 3. Output | B | 2 z . Analysis (on board) 14-2

  24. An improved algorithm Idea: two-level hashing. The algorithm (Bar-Yossef et al. ’02) 1. Choose a random hash function h : [ n ] → [ n ] from a 2-universal family. Set z = 0 , B = ∅ . Choose a secondary 2-universal hash function g : [ n ] → [(log n /ǫ ) O (1) ]. 2. For each new coming item e , if zeros( h ( e )) ≥ z , then (a) set B ← B ∪ { ( g ( e ) , zeros ( h ( e ))) } ; (b) if | B | > c /ǫ 2 then set z ← z + 1 and remove all ( α, β ) in B with β < z . 3. Output | B | 2 z . Analysis (on board) Theorem The number of distinct elements can be (1 + ǫ )-approximated with probability 2 / 3 using O (log n + 1 /ǫ 2 · (log(1 /ǫ ) + log log n )) bits. 14-3

  25. § 1 . 3 Linear Sketches 15-1

  26. We have seen Misra-Gries, Space-saving, Flajolet-Martin and its improvement. 16-1

  27. We have seen Misra-Gries, Space-saving, Flajolet-Martin and its improvement. Nice algorithms, but only work for insertion-only sequences ... Can we handle deletions? 16-2

  28. We have seen Misra-Gries, Space-saving, Flajolet-Martin and its improvement. Nice algorithms, but only work for insertion-only sequences ... Can we handle deletions? A popular way is to use linear sketches. 16-3

  29. Linear sketch Random linear projection M : R n → R k that preserves properties of any v ∈ R n with high prob. where k ≪ n . = answer M Mv v 17-1

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend