algorithms for big data iv
play

Algorithms for Big Data (IV) Chihao Zhang Shanghai Jiao Tong - PowerPoint PPT Presentation

Algorithms for Big Data (IV) Chihao Zhang Shanghai Jiao Tong University Oct. 11, 2019 Algorithms for Big Data (IV) 1/19 We are given a sequence of numbers a a m where each a i It defines a frequency vector f f n where f i Review of the Last


  1. Algorithms for Big Data (IV) Chihao Zhang Shanghai Jiao Tong University Oct. 11, 2019 Algorithms for Big Data (IV) 1/19

  2. We are given a sequence of numbers a a m where each a i It defines a frequency vector f f n where f i Review of the Last Lecture We want to compute the number d Algorithms for Big Data (IV) . f i n i m i . a k Last time, we introduced AMS algorithm for counting distinct elements in the streaming k f n . model. 2/19

  3. a m where each a i It defines a frequency vector f f n where f i Review of the Last Lecture i . Algorithms for Big Data (IV) . f i n i We want to compute the number d m a k Last time, we introduced AMS algorithm for counting distinct elements in the streaming k f n . We are given a sequence of numbers a model. 2/19

  4. It defines a frequency vector f f n where f i Review of the Last Lecture We want to compute the number d Algorithms for Big Data (IV) . f i n i a k i . Last time, we introduced AMS algorithm for counting distinct elements in the streaming m k f model. 2/19 We are given a sequence of numbers ⟨ a 1 , . . . , a m ⟩ where each a i ∈ [ n ] .

  5. Review of the Last Lecture We want to compute the number d Algorithms for Big Data (IV) . f i n i 2/19 Last time, we introduced AMS algorithm for counting distinct elements in the streaming model. We are given a sequence of numbers ⟨ a 1 , . . . , a m ⟩ where each a i ∈ [ n ] . � }� �{ � . It defines a frequency vector f = ( f 1 , . . . , f n ) where f i = k ∈ [ m ] : a k = i

  6. Review of the Last Lecture Last time, we introduced AMS algorithm for counting distinct elements in the streaming Algorithms for Big Data (IV) 2/19 model. We are given a sequence of numbers ⟨ a 1 , . . . , a m ⟩ where each a i ∈ [ n ] . � }� �{ � . It defines a frequency vector f = ( f 1 , . . . , f n ) where f i = k ∈ [ m ] : a k = i � }� �{ � . We want to compute the number d = i ∈ [ n ] : f i > 0

  7. Review of the Last Lecture Last time, we introduced AMS algorithm for counting distinct elements in the streaming Algorithms for Big Data (IV) 2/19 model. We are given a sequence of numbers ⟨ a 1 , . . . , a m ⟩ where each a i ∈ [ n ] . � }� �{ � . It defines a frequency vector f = ( f 1 , . . . , f n ) where f i = k ∈ [ m ] : a k = i � }� �{ � . We want to compute the number d = i ∈ [ n ] : f i > 0

  8. Algorithm AMS Algorithm for Counting Distinct Elements Init: On Input y : end if Output: Algorithms for Big Data (IV) 3/19 A random Hash function h : [ n ] → [ n ] from a 2 -universal family Z ← 0 if zero ( h ( y )) > Z then Z ← zero ( h ( y )) d = 2 Z + 1 � 2 .

  9. We also introduced the BJKST algorithm, a refinement of the AMS algorithm. Pr We will show today that the BJKST algorithm can produce d which is a approximation of d for any . Algorithms for Big Data (IV) 4/19 Using O ( log 1 δ log n ) bits of memory, we can obtain [ d ] 3 ≤ � d ≤ 3 d ≥ 1 − δ .

  10. We also introduced the BJKST algorithm, a refinement of the AMS algorithm. Pr We will show today that the BJKST algorithm can produce d which is a approximation of d for any . Algorithms for Big Data (IV) 4/19 Using O ( log 1 δ log n ) bits of memory, we can obtain [ d ] 3 ≤ � d ≤ 3 d ≥ 1 − δ .

  11. We also introduced the BJKST algorithm, a refinement of the AMS algorithm. Pr Algorithms for Big Data (IV) 4/19 Using O ( log 1 δ log n ) bits of memory, we can obtain [ d ] 3 ≤ � d ≤ 3 d ≥ 1 − δ . We will show today that the BJKST algorithm can produce � d which is a 1 ± ε approximation of d for any ε > 0 .

  12. The BJKST Algorithm The following refinement is due to Bar-Yossef, Jayram, Kumar, Sivakumar and Trevisan. Algorithms for Big Data (IV) end if end while 5/19 Algorithm BJKST Algorithm for Counting Distinct Elements On Input y : Init: Random Hash functions h : [ n ] → [ n ] , g : [ n ] → [ b ε − 4 log 2 n ] , both from 2 - universal families; Z ← 0 , B ← � if zero ( h ( y )) ≥ Z then { } B ← B ∪ ( g ( y ) , zeros ( h ( y ))) while | B | ≥ c / ε 2 do Z ← Z + 1 Remove all ( α , β ) with β < Z from B Output: � d = | B | 2 Z

  13. c for the size of B : We set a cap L Therefore, the size of B is a trade-ofg between the memory consumption and the than the current Z . if L , B stores all entries, and the algorithm is exact; if L , the algorithm is equivalent to AMS. accuracy of the algorithm. Algorithms for Big Data (IV) 6/19 The algorithm maintains a bucket B , which stores those y whose zeros ( h ( y )) is larger

  14. Therefore, the size of B is a trade-ofg between the memory consumption and the than the current Z . c if L , B stores all entries, and the algorithm is exact; if L , the algorithm is equivalent to AMS. accuracy of the algorithm. Algorithms for Big Data (IV) 6/19 The algorithm maintains a bucket B , which stores those y whose zeros ( h ( y )) is larger We set a cap L = ε 2 for the size of B :

  15. Therefore, the size of B is a trade-ofg between the memory consumption and the than the current Z . c accuracy of the algorithm. Algorithms for Big Data (IV) 6/19 The algorithm maintains a bucket B , which stores those y whose zeros ( h ( y )) is larger We set a cap L = ε 2 for the size of B : ▶ if L = ∞ , B stores all entries, and the algorithm is exact; ▶ if L = 2 , the algorithm is equivalent to AMS.

  16. than the current Z . c Therefore, the size of B is a trade-ofg between the memory consumption and the accuracy of the algorithm. Algorithms for Big Data (IV) 6/19 The algorithm maintains a bucket B , which stores those y whose zeros ( h ( y )) is larger We set a cap L = ε 2 for the size of B : ▶ if L = ∞ , B stores all entries, and the algorithm is exact; ▶ if L = 2 , the algorithm is equivalent to AMS.

  17. to n , namely g y n , X k r is the indicator that h k has at least r trailing X k r as the number of h a i with trailing zero at least r . We already know from the last lecture that E Y r r and Var Y r r . Analysis n f k Algorithms for Big Data (IV) d d Define Y r k To analyze the algorithm, we first assume that g is simply the identity function from n zeros. Similar to AMS, for every k . We need to store the whole B , whose size is O n . y for all y 7/19

  18. n , X k r is the indicator that h k has at least r trailing X k r as the number of h a i with trailing zero at least r . We already know from the last lecture that E Y r r and Var Y r r . Analysis Algorithms for Big Data (IV) d d k n f k Define Y r zeros. Similar to AMS, for every k . We need to store the whole B , whose size is O 7/19 To analyze the algorithm, we first assume that g is simply the identity function from [ n ] to [ n ] , namely g ( y ) = y for all y ∈ [ n ] .

  19. n , X k r is the indicator that h k has at least r trailing X k r as the number of h a i with trailing zero at least r . We already know from the last lecture that E Y r r and Var Y r r . Analysis Algorithms for Big Data (IV) d d n f k k Define Y r zeros. Similar to AMS, for every k 7/19 To analyze the algorithm, we first assume that g is simply the identity function from [ n ] to [ n ] , namely g ( y ) = y for all y ∈ [ n ] . We need to store the whole B , whose size is O ( ε − 2 ) .

  20. X k r as the number of h a i with trailing zero at least r . We already know from the last lecture that E Y r r and Var Y r r . Analysis Algorithms for Big Data (IV) d d n f k k Define Y r zeros. 7/19 To analyze the algorithm, we first assume that g is simply the identity function from [ n ] to [ n ] , namely g ( y ) = y for all y ∈ [ n ] . We need to store the whole B , whose size is O ( ε − 2 ) . Similar to AMS, for every k ∈ [ n ] , X k , r is the indicator that h ( k ) has at least r trailing

  21. We already know from the last lecture that E Y r r and Var Y r r . Analysis zeros. d d Algorithms for Big Data (IV) 7/19 To analyze the algorithm, we first assume that g is simply the identity function from [ n ] to [ n ] , namely g ( y ) = y for all y ∈ [ n ] . We need to store the whole B , whose size is O ( ε − 2 ) . Similar to AMS, for every k ∈ [ n ] , X k , r is the indicator that h ( k ) has at least r trailing Define Y r = ∑ k ∈ [ n ]: f k > 0 X k , r as the number of h ( a i ) with trailing zero at least r .

  22. Analysis zeros. Algorithms for Big Data (IV) 7/19 To analyze the algorithm, we first assume that g is simply the identity function from [ n ] to [ n ] , namely g ( y ) = y for all y ∈ [ n ] . We need to store the whole B , whose size is O ( ε − 2 ) . Similar to AMS, for every k ∈ [ n ] , X k , r is the indicator that h ( k ) has at least r trailing Define Y r = ∑ k ∈ [ n ]: f k > 0 X k , r as the number of h ( a i ) with trailing zero at least r . We already know from the last lecture that E [ Y r ] = d 2 r and Var [ Y r ] ≤ d 2 r .

  23. We use A to denote the bad event that Y t t t is large, so we can apply concentration inequalities; We let s be the threshold for small/large value mentioned above. 8/19 We will bound the probability of A using the following argument Algorithms for Big Data (IV) the value t is unlikely to be very large. d if t is small, then E Y t t d t d Y t d , or equivalently d If Z = t at the end of the algorithm, then Y t = | B | and � d = Y t 2 t .

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend