advanced algorithms
play

Advanced Algorithms Count Distinct Elements a sequence x 1 , x - PowerPoint PPT Presentation

Advanced Algorithms Count Distinct Elements a sequence x 1 , x 2 , ..., x n Input : z = | { x 1 , x 2 , ..., x n } | Output : an estimation of data stream: input comes one at a time naive algorithm: store


  1. Advanced Algorithms � ��� ���

  2. Count Distinct Elements a sequence x 1 , x 2 , ..., x n ∈ Ω Input : z = | { x 1 , x 2 , ..., x n } | Output : an estimation of • data stream: input comes one at a time • naive algorithm: store everything with O( n ) space x 1 x 2 x n Z : an estimation of b Algorithm z = f ( x 1 , ..., x n ) h i • ( ε , δ ) -estimator: (1 − ✏ ) z ≤ b Pr Z ≤ (1 + ✏ ) z ≥ 1 − � Using only memory equivalent to 5 lines of printed text, you can estimate with a typical accuracy of 5% and in a single pass the total vocabulary of Shakespeare. -----Flajolet

  3. a sequence x 1 , x 2 , ..., x n ∈ Ω Input : z = | { x 1 , x 2 , ..., x n } | Output : an estimation of h i • ( ε , δ ) -estimator: (1 − ✏ ) z ≤ b Pr Z ≤ (1 + ✏ ) z ≥ 1 − � uniform hash function h : Ω → [0,1] h ( x 1 ), ..., h ( x n ) : z uniform independent values in [0,1] (partition [0,1] into z +1 subintervals)  � 1 = E [ length of a subinterval] 1 ≤ i ≤ n h ( x i ) min E = z + 1 (by symmetry) 1 estimator: b min i h ( x i ) − 1 ? Z = But Var[min i h ( x i )] is too large! (think of z =1 )

  4. a sequence x 1 , x 2 , ..., x n ∈ Ω Input : z = | { x 1 , x 2 , ..., x n } | Output : an estimation of h i • ( ε , δ ) -estimator: Pr (1 − ✏ ) z ≤ b Z ≤ (1 + ✏ ) z ≥ 1 − � uniform independent hash functions: h 1 , h 2 , ..., h k : Ω → [0,1] Y j = min 1 ≤ i ≤ n h j ( x i ) k average-min: Y = 1 X Y j k j =1 Z = 1 b Flajolet-Martin estimator: Y − 1 UHA: Uniform Hash Assumption unbiased estimator: 1 E [ Y ] = E [ Y j ] = z +1 h i • Deviation: Z < (1 − ✏ ) z or b b Pr Z > (1 + ✏ ) z < ?

  5. z = | { x 1 , x 2 , ..., x n } | For j =1, 2, …, k , hash values of h j : uniform independent X j 1 , X j 2 , ... , X jz ∈ [0, 1] Y j = min 1 ≤ i ≤ n X ji symmetry ) 1 k E [ Y ] = E [ Y j ] = Y = 1 z +1 X Y j k j =1 Z = 1 F-M estimator: let b Y − 1 h i goal: Z > (1 + ✏ ) z or b b Pr Z < (1 − ✏ ) z < � for ε ≤ 1/2 � � ✏ / 2 1 � � � Y − � > � � z + 1 z + 1

  6. z = | { x 1 , x 2 , ..., x n } | For j =1, 2, …, k , hash values of h j : uniform independent X j 1 , X j 2 , ... , X jz ∈ [0, 1] Y j = min 1 ≤ i ≤ n X ji symmetry ) 1 k E [ Y ] = E [ Y j ] = Y = 1 z +1 X Y j k j =1 Z = 1 F-M estimator: let b Y − 1 h i goal: Z > (1 + ✏ ) z or b b Pr Z < (1 − ✏ ) z < � for ε ≤ 1/2 ✏ / 2 � > � � � Y − E [ Y ] z + 1

  7. z = | { x 1 , x 2 , ..., x n } | For j =1, 2, …, k , hash values of h j : uniform independent X j 1 , X j 2 , ... , X jz ∈ [0, 1] Y j = min 1 ≤ i ≤ n X ji symmetry ) 1 k E [ Y ] = E [ Y j ] = Y = 1 z +1 X Y j k j =1 Z = 1 F-M estimator: let b Y − 1 h i Z > (1 + ✏ ) z or b b Pr Z < (1 − ✏ ) z � � ✏ / 2 � > � (for ε ≤ 1/2 ) ≤ Pr � Y − E [ Y ] z + 1 Chebyshev : ≤ 4 ✏ 2 ( z + 1) 2 Var [ Y ]

  8. Markov’s Inequality Markov’s Inequality: For nonnegative X , for any t > 0, Pr[ X ≥ t ] ≤ E [ X ] . t Proof : � X � 1 if X ≥ t , ⇥ ≤ X Let Y = ⇒ Y ≤ t , 0 otherwise. t � X ⇥ = E [ X ] Pr[ X ≥ t ] = E [ Y ] ≤ E . t t tight if we only know the expectation of X

  9. A Generalization of Markov’s Inequality Theorem: For any X , for h : X ⇥� R + , for any t > 0, Pr[ h ( X ) ≥ t ] ≤ E [ h ( X )] . t

  10. Chebyshev’s Inequality Chebyshev’s Inequality: For any t > 0, Pr[ | X − E [ X ] | ≥ t ] ≤ Var [ X ] . t 2 Variance: Var [ X ] = E [( X − E [ X ]) 2 ] = E [ X 2 ] − ( E [ X ]) 2 Var [ cX ] = c 2 Var [ X ] for pairwise independent X i Var [ P i X i ] = P i Var [ X i ]

  11. Chebyshev’s Inequality Chebyshev’s Inequality: For any t > 0, Pr[ | X − E [ X ] | ≥ t ] ≤ Var [ X ] . t 2 Proof : Apply Markov’s inequality to ( X − E [ X ]) 2 ( X − E [ X ]) 2 ⇥ � E ( X − E [ X ]) 2 ≥ t 2 ⇥ � Pr ≤ t 2

  12. z = | { x 1 , x 2 , ..., x n } | For j =1, 2, …, k , hash values of h j : uniform independent X j 1 , X j 2 , ... , X jz ∈ [0, 1] Y j = min 1 ≤ i ≤ n X ji symmetry ) 1 k E [ Y ] = E [ Y j ] = Y = 1 z +1 X Y j k j =1 Z = 1 F-M estimator: let b Y − 1 h i Z > (1 + ✏ ) z or b b Pr Z < (1 − ✏ ) z � � ✏ / 2 � > � (for ε ≤ 1/2 ) ≤ Pr � Y − E [ Y ] z + 1 Chebyshev : ≤ 4 ✏ 2 ( z + 1) 2 Var [ Y ]

  13. z = | { x 1 , x 2 , ..., x n } | For j =1, 2, …, k , hash values of h j : uniform independent X j 1 , X j 2 , ... , X jz ∈ [0, 1] ) symmetry Y j = min 1 ≤ i ≤ n X ji 1 k E [ Y ] = E [ Y j ] = Y = 1 z +1 X Y j k j =1 geometry pdf = z (1- y ) z -1 Pr[ Y j ≥ y ] = (1- y ) z probability Z 1 2 y 2 z (1 − y ) z − 1 d y E [ Y 2 = j ] = ( z + 1)( z + 2) 0 j ] − E [ Y j ] 2 ≤ 1 Var [ Y j ] = E [ Y 2 ( z +1) 2 1 P k 1 Var [ Y ] = j =1 Var [ Y j ] = 1 ≤ k Var [ Y j ] k 2 k ( z + 1) 2 2-wise independence

  14. z = | { x 1 , x 2 , ..., x n } | For j =1, 2, …, k , hash values of h j : uniform independent X j 1 , X j 2 , ... , X jz ∈ [0, 1] Y j = min 1 ≤ i ≤ n X ji symmetry ) 1 k E [ Y ] = E [ Y j ] = Y = 1 z +1 X Y j k j =1 Z = 1 F-M estimator: let b Y − 1 h i 4 Z > (1 + ✏ ) z or b b ≤ Pr Z < (1 − ✏ ) z ✏ 2 k � � ✏ / 2 � > � (for ε ≤ 1/2 ) ≤ Pr � Y − E [ Y ] z + 1 1 Chebyshev : ≤ 4 ✏ 2 ( z + 1) 2 Var [ Y ] Var [ Y ] ≤ k ( z + 1) 2

  15. a sequence x 1 , x 2 , ..., x n ∈ Ω Input : z = | { x 1 , x 2 , ..., x n } | Output : an estimation of uniform independent hash functions: h 1 , h 2 , ..., h k : Ω → [0,1] Y j = min 1 ≤ i ≤ n h j ( x i ) k average-min: Y = 1 X Y j k j =1 Z = 1 b Flajolet-Martin estimator: Y − 1 UHA: Uniform Hash Assumption h i 4 Z > (1 + ✏ ) z or b b ≤ Pr Z < (1 − ✏ ) z ≤ δ ✏ 2 k 4 choose k = ✏ 2 �

  16. Frequency Estimation Data : a sequence x 1 , x 2 , ..., x n ∈ Ω Query : an item x ∈ Ω Estimate the frequency f x = | { i : x i = x } | of item x within additive error ε n . • data stream: input comes one at a time x 1 x 2 x n Algorithm

  17. Frequency Estimation Data : a sequence x 1 , x 2 , ..., x n ∈ Ω Query : an item x ∈ Ω Estimate the frequency f x = | { i : x i = x } | of item x within additive error ε n . • data stream: input comes one at a time x 1 x 2 x n ˆ f x : estimation of Algorithm frequency f x query x Pr[ | ˆ f x − f x | ≥ ✏ n ] ≤ � • heavy hitters: items that appears > ε n times

  18. Data Structure for Set Data : a set S of n items x 1 , x 2 , ..., x n ∈ Ω Query : an item x ∈ Ω Determine whether x ∈ S. • space cost: size of data structure (in bits) • entropy of a set: O( n log| Ω |) bits • time cost: time to answer a query • balanced tree: O( n log| Ω |) space, O(log n ) time • perfect hashing: O( n log| Ω |) space, O(1) time • using < entropy space ? a sketch of the set (approximate representation)

  19. Approximate a Set Data : a set S of n items x 1 , x 2 , ..., x n ∈ Ω Query : an item x ∈ Ω Determine whether x ∈ S. uniform hash function h : Ω → [ m ] data structure: an m -bit vector v ∈ {0, 1} m initially v is all- 0 ; set v [ h ( x i )]=1 for each x i ∈ S ; query x : answer “yes” if v [ h ( x )]=1 ; x ∈ S : always correct Pr[ v [ h ( x )]=1 ] = 1- (1-1/ m ) n = 1-e - n / m x ∉ S : false positive

  20. Bloom Filters (Bloom 1970) Data : a set S of n items x 1 , x 2 , ..., x n ∈ Ω Query : an item x ∈ Ω Determine whether x ∈ S. uniform independent hash functions h 1 , h 2 , ..., h k : Ω → [ m ] data structure: an m -bit vector v ∈ {0, 1} m initially v is all- 0 ; for each x i ∈ S : set v [ h j ( x i )]=1 for all j =1 ,...,k ; query x : “yes” if v [ h j ( x )]=1 for all j =1 ,...,k ;

  21. Bloom Filters uniform independent hash functions h 1 , h 2 , ..., h k : Ω → [ m ] data structure: an m -bit vector v ∈ {0, 1} m initially v is all- 0 ; for each x i ∈ S : set v [ h j ( x i )]=1 for all j =1 ,...,k ; query x : “yes” if v [ h j ( x )]=1 for all j =1 ,...,k ; y h 1 h 2 h 3 x z 0 0 1 0 0 1 1 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 v false positive! w

  22. data : set S ⊆ Ω of size | S | = n query : x ∈ Ω uniform independent hash functions h 1 , h 2 , ..., h k : Ω → [ m ] data structure: an m -bit vector v ∈ {0, 1} m initially v is all- 0 ; for each x i ∈ S : set v [ h j ( x i )]=1 for all j =1 ,...,k ; query x : “yes” if v [ h j ( x )]=1 for all j =1 ,...,k ; UHA: Uniform Hash Assumption choose k = m ln 2 x ∉ S : false positive n Pr[ ∀ 1 ≤ j ≤ k : v [ h j ( x )] = 1] m = cn = (Pr[ v [ h j ( x )] = 1]) k = (1 − Pr[ v [ h j ( x )] = 0]) k ≤ (1 − (1 − 1 /m ) kn ) k = (1 − e − kn/m ) k ≈ (0 . 6185) c

  23. Bloom Filters data : set S ⊆ Ω of size | S | = n query : x ∈ Ω uniform independent hash functions h 1 , h 2 , ..., h k : Ω → [ m ] data structure: an m -bit vector v ∈ {0, 1} m initially v is all- 0 ; for each x i ∈ S : set v [ h j ( x i )]=1 for all j =1 ,...,k ; query x : “yes” if v [ h j ( x )]=1 for all j =1 ,...,k ; k = m ln 2 choose = c ln 2 m = cn n • space cost: cn bits; time cost: c ln 2 • false positive: < (0.6185) c

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend