Algorithms for Big Data (IV) Chihao Zhang Shanghai Jiao Tong University Oct. 11, 2019 Algorithms for Big Data (IV) 1/19
We are given a sequence of numbers a a m where each a i It defines a frequency vector f f n where f i Review of the Last Lecture We want to compute the number d Algorithms for Big Data (IV) . f i n i m i . a k Last time, we introduced AMS algorithm for counting distinct elements in the streaming k f n . model. 2/19
a m where each a i It defines a frequency vector f f n where f i Review of the Last Lecture i . Algorithms for Big Data (IV) . f i n i We want to compute the number d m a k Last time, we introduced AMS algorithm for counting distinct elements in the streaming k f n . We are given a sequence of numbers a model. 2/19
It defines a frequency vector f f n where f i Review of the Last Lecture We want to compute the number d Algorithms for Big Data (IV) . f i n i a k i . Last time, we introduced AMS algorithm for counting distinct elements in the streaming m k f model. 2/19 We are given a sequence of numbers ⟨ a 1 , . . . , a m ⟩ where each a i ∈ [ n ] .
Review of the Last Lecture We want to compute the number d Algorithms for Big Data (IV) . f i n i 2/19 Last time, we introduced AMS algorithm for counting distinct elements in the streaming model. We are given a sequence of numbers ⟨ a 1 , . . . , a m ⟩ where each a i ∈ [ n ] . � }� �{ � . It defines a frequency vector f = ( f 1 , . . . , f n ) where f i = k ∈ [ m ] : a k = i
Review of the Last Lecture Last time, we introduced AMS algorithm for counting distinct elements in the streaming Algorithms for Big Data (IV) 2/19 model. We are given a sequence of numbers ⟨ a 1 , . . . , a m ⟩ where each a i ∈ [ n ] . � }� �{ � . It defines a frequency vector f = ( f 1 , . . . , f n ) where f i = k ∈ [ m ] : a k = i � }� �{ � . We want to compute the number d = i ∈ [ n ] : f i > 0
Review of the Last Lecture Last time, we introduced AMS algorithm for counting distinct elements in the streaming Algorithms for Big Data (IV) 2/19 model. We are given a sequence of numbers ⟨ a 1 , . . . , a m ⟩ where each a i ∈ [ n ] . � }� �{ � . It defines a frequency vector f = ( f 1 , . . . , f n ) where f i = k ∈ [ m ] : a k = i � }� �{ � . We want to compute the number d = i ∈ [ n ] : f i > 0
Algorithm AMS Algorithm for Counting Distinct Elements Init: On Input y : end if Output: Algorithms for Big Data (IV) 3/19 A random Hash function h : [ n ] → [ n ] from a 2 -universal family Z ← 0 if zero ( h ( y )) > Z then Z ← zero ( h ( y )) d = 2 Z + 1 � 2 .
We also introduced the BJKST algorithm, a refinement of the AMS algorithm. Pr We will show today that the BJKST algorithm can produce d which is a approximation of d for any . Algorithms for Big Data (IV) 4/19 Using O ( log 1 δ log n ) bits of memory, we can obtain [ d ] 3 ≤ � d ≤ 3 d ≥ 1 − δ .
We also introduced the BJKST algorithm, a refinement of the AMS algorithm. Pr We will show today that the BJKST algorithm can produce d which is a approximation of d for any . Algorithms for Big Data (IV) 4/19 Using O ( log 1 δ log n ) bits of memory, we can obtain [ d ] 3 ≤ � d ≤ 3 d ≥ 1 − δ .
We also introduced the BJKST algorithm, a refinement of the AMS algorithm. Pr Algorithms for Big Data (IV) 4/19 Using O ( log 1 δ log n ) bits of memory, we can obtain [ d ] 3 ≤ � d ≤ 3 d ≥ 1 − δ . We will show today that the BJKST algorithm can produce � d which is a 1 ± ε approximation of d for any ε > 0 .
The BJKST Algorithm The following refinement is due to Bar-Yossef, Jayram, Kumar, Sivakumar and Trevisan. Algorithms for Big Data (IV) end if end while 5/19 Algorithm BJKST Algorithm for Counting Distinct Elements On Input y : Init: Random Hash functions h : [ n ] → [ n ] , g : [ n ] → [ b ε − 4 log 2 n ] , both from 2 - universal families; Z ← 0 , B ← � if zero ( h ( y )) ≥ Z then { } B ← B ∪ ( g ( y ) , zeros ( h ( y ))) while | B | ≥ c / ε 2 do Z ← Z + 1 Remove all ( α , β ) with β < Z from B Output: � d = | B | 2 Z
c for the size of B : We set a cap L Therefore, the size of B is a trade-ofg between the memory consumption and the than the current Z . if L , B stores all entries, and the algorithm is exact; if L , the algorithm is equivalent to AMS. accuracy of the algorithm. Algorithms for Big Data (IV) 6/19 The algorithm maintains a bucket B , which stores those y whose zeros ( h ( y )) is larger
Therefore, the size of B is a trade-ofg between the memory consumption and the than the current Z . c if L , B stores all entries, and the algorithm is exact; if L , the algorithm is equivalent to AMS. accuracy of the algorithm. Algorithms for Big Data (IV) 6/19 The algorithm maintains a bucket B , which stores those y whose zeros ( h ( y )) is larger We set a cap L = ε 2 for the size of B :
Therefore, the size of B is a trade-ofg between the memory consumption and the than the current Z . c accuracy of the algorithm. Algorithms for Big Data (IV) 6/19 The algorithm maintains a bucket B , which stores those y whose zeros ( h ( y )) is larger We set a cap L = ε 2 for the size of B : ▶ if L = ∞ , B stores all entries, and the algorithm is exact; ▶ if L = 2 , the algorithm is equivalent to AMS.
than the current Z . c Therefore, the size of B is a trade-ofg between the memory consumption and the accuracy of the algorithm. Algorithms for Big Data (IV) 6/19 The algorithm maintains a bucket B , which stores those y whose zeros ( h ( y )) is larger We set a cap L = ε 2 for the size of B : ▶ if L = ∞ , B stores all entries, and the algorithm is exact; ▶ if L = 2 , the algorithm is equivalent to AMS.
to n , namely g y n , X k r is the indicator that h k has at least r trailing X k r as the number of h a i with trailing zero at least r . We already know from the last lecture that E Y r r and Var Y r r . Analysis n f k Algorithms for Big Data (IV) d d Define Y r k To analyze the algorithm, we first assume that g is simply the identity function from n zeros. Similar to AMS, for every k . We need to store the whole B , whose size is O n . y for all y 7/19
n , X k r is the indicator that h k has at least r trailing X k r as the number of h a i with trailing zero at least r . We already know from the last lecture that E Y r r and Var Y r r . Analysis Algorithms for Big Data (IV) d d k n f k Define Y r zeros. Similar to AMS, for every k . We need to store the whole B , whose size is O 7/19 To analyze the algorithm, we first assume that g is simply the identity function from [ n ] to [ n ] , namely g ( y ) = y for all y ∈ [ n ] .
n , X k r is the indicator that h k has at least r trailing X k r as the number of h a i with trailing zero at least r . We already know from the last lecture that E Y r r and Var Y r r . Analysis Algorithms for Big Data (IV) d d n f k k Define Y r zeros. Similar to AMS, for every k 7/19 To analyze the algorithm, we first assume that g is simply the identity function from [ n ] to [ n ] , namely g ( y ) = y for all y ∈ [ n ] . We need to store the whole B , whose size is O ( ε − 2 ) .
X k r as the number of h a i with trailing zero at least r . We already know from the last lecture that E Y r r and Var Y r r . Analysis Algorithms for Big Data (IV) d d n f k k Define Y r zeros. 7/19 To analyze the algorithm, we first assume that g is simply the identity function from [ n ] to [ n ] , namely g ( y ) = y for all y ∈ [ n ] . We need to store the whole B , whose size is O ( ε − 2 ) . Similar to AMS, for every k ∈ [ n ] , X k , r is the indicator that h ( k ) has at least r trailing
We already know from the last lecture that E Y r r and Var Y r r . Analysis zeros. d d Algorithms for Big Data (IV) 7/19 To analyze the algorithm, we first assume that g is simply the identity function from [ n ] to [ n ] , namely g ( y ) = y for all y ∈ [ n ] . We need to store the whole B , whose size is O ( ε − 2 ) . Similar to AMS, for every k ∈ [ n ] , X k , r is the indicator that h ( k ) has at least r trailing Define Y r = ∑ k ∈ [ n ]: f k > 0 X k , r as the number of h ( a i ) with trailing zero at least r .
Analysis zeros. Algorithms for Big Data (IV) 7/19 To analyze the algorithm, we first assume that g is simply the identity function from [ n ] to [ n ] , namely g ( y ) = y for all y ∈ [ n ] . We need to store the whole B , whose size is O ( ε − 2 ) . Similar to AMS, for every k ∈ [ n ] , X k , r is the indicator that h ( k ) has at least r trailing Define Y r = ∑ k ∈ [ n ]: f k > 0 X k , r as the number of h ( a i ) with trailing zero at least r . We already know from the last lecture that E [ Y r ] = d 2 r and Var [ Y r ] ≤ d 2 r .
We use A to denote the bad event that Y t t t is large, so we can apply concentration inequalities; We let s be the threshold for small/large value mentioned above. 8/19 We will bound the probability of A using the following argument Algorithms for Big Data (IV) the value t is unlikely to be very large. d if t is small, then E Y t t d t d Y t d , or equivalently d If Z = t at the end of the algorithm, then Y t = | B | and � d = Y t 2 t .
Recommend
More recommend