algorithms for big data v

Algorithms for Big Data (V) Chihao Zhang Shanghai Jiao Tong - PowerPoint PPT Presentation

Algorithms for Big Data (V) Chihao Zhang Shanghai Jiao Tong University Oct. 18, 2019 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Review of the Last Lecture . . . .


  1. Algorithms for Big Data (V) Chihao Zhang Shanghai Jiao Tong University Oct. 18, 2019 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  2. Review of the Last Lecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  3. Review of the Last Lecture Last time, we learnt Misra-Gries and Count Sketch for Frequency Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  4. Review of the Last Lecture Last time, we learnt Misra-Gries and Count Sketch for Frequency Estimation . The later has the advantage of being a linear sketch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  5. Review of the Last Lecture Last time, we learnt Misra-Gries and Count Sketch for Frequency Estimation . The later has the advantage of being a linear sketch. It also generalize to turnstile model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  6. Count Sketch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  7. Count Sketch Algorithm Count Sketch Init: 3 An array C [ j ] for j ∈ [ k ] where k = ε 2 . A random Hash function h : [ n ] → [ k ] from a 2-universal family. A random Hash function g : [ n ] → { − 1, 1 } from a 2-universal family. On Input ( y , ∆ ) : C [ h ( y )] ← C [ h ( y )] + ∆ · g ( y ) Output: On query a : Output � f a = g ( a ) · C [ h ( a )] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  8. The Performance We can apply the median trick to obtain: [� � ] � � ▶ Pr �� � ⩾ ε ∥ f ∥ 2 ⩽ δ ; f a − f a ( 1 ) ▶ it costs O ε 2 log 1 δ ( log m + log n ) bits of memeory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  9. The Performance We can apply the median trick to obtain: [� � ] � � ▶ Pr �� � ⩾ ε ∥ f ∥ 2 ⩽ δ ; f a − f a ( 1 ) ▶ it costs O ε 2 log 1 δ ( log m + log n ) bits of memeory. Today we will see another simple sketch algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  10. Count-Min . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  11. Count-Min We assume that for each entry ( y , ∆ ) , it holds that ∆ ⩾ 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  12. Count-Min We assume that for each entry ( y , ∆ ) , it holds that ∆ ⩾ 0. Algorithm Count-Min Init: An array C [ i ][ j ] for i ∈ [ t ] and j ∈ [ k ] where t = log ( 1 /δ ) and k = 2 /ε . Choose t independent random Hash function h 1 , . . . , h t : [ n ] → [ k ] from a 2-universal family. On Input ( y , ∆ ) : For each i ∈ [ t ] , C [ i ][ h i ( y )] ← C [ i ][ h i ( y )] + ∆ . Output: On query a : Output � f a = min 1 ⩽ i ⩽ t C [ i ][ h ( a )] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  13. Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  14. Analysis Obviously we have f a ⩽ � f a . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  15. Analysis Obviously we have f a ⩽ � f a . Our algorithm overestimates only if for some b ̸ = a , h i ( b ) = h i ( a ) . Let Y i , b be the indicator of this event. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  16. Analysis Obviously we have f a ⩽ � f a . Our algorithm overestimates only if for some b ̸ = a , h i ( b ) = h i ( a ) . Let Y i , b be the indicator of this event. Let X i be C [ i ][ h i ( a )] . Then [ ] ∑ b ∈ [ n ]: b ̸ = a f b ⩽ f a + ∥ f ∥ 1 � ∑ ∑ E X i = f b E [ Y i , b ] = f a + f b E [ Y i , b ] = f a + k . k b ∈ [ n ] b ∈ [ n ]: b ̸ = a . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  17. Analysis Obviously we have f a ⩽ � f a . Our algorithm overestimates only if for some b ̸ = a , h i ( b ) = h i ( a ) . Let Y i , b be the indicator of this event. Let X i be C [ i ][ h i ( a )] . Then [ ] ∑ b ∈ [ n ]: b ̸ = a f b ⩽ f a + ∥ f ∥ 1 � ∑ ∑ E X i = f b E [ Y i , b ] = f a + f b E [ Y i , b ] = f a + k . k b ∈ [ n ] b ∈ [ n ]: b ̸ = a Thus, ∥ f ∥ 1 = 1 Pr [ | X i − f a | ⩾ ε ∥ f ∥ 1 ] ⩽ 2. kε ∥ f ∥ 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  18. Since our output is the minimum out of t independent X i ’s, . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  19. Since our output is the minimum out of t independent X i ’s, [ ] � f a − f a ⩾ ε ∥ f ∥ 1 = Pr [ | min { X 1 , . . . , X t } − f a | ⩾ ∥ f ∥ 1 ] Pr [ t ] ∧ ( | X i − f a | ⩾ ε ∥ f ∥ 1 ) = Pr i = 1 t Pr [ | X i − f a | ⩾ ε ∥ f ∥ 1 ] ⩽ 2 − t = δ . ∏ = i = 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  20. Since our output is the minimum out of t independent X i ’s, [ ] � f a − f a ⩾ ε ∥ f ∥ 1 = Pr [ | min { X 1 , . . . , X t } − f a | ⩾ ∥ f ∥ 1 ] Pr [ t ] ∧ ( | X i − f a | ⩾ ε ∥ f ∥ 1 ) = Pr i = 1 t Pr [ | X i − f a | ⩾ ε ∥ f ∥ 1 ] ⩽ 2 − t = δ . ∏ = i = 1 The algorithm computes a linear sketch using ( 1 ) ε log 1 O δ · ( log m + log n ) bits of memory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  21. Since our output is the minimum out of t independent X i ’s, [ ] � f a − f a ⩾ ε ∥ f ∥ 1 = Pr [ | min { X 1 , . . . , X t } − f a | ⩾ ∥ f ∥ 1 ] Pr [ t ] ∧ ( | X i − f a | ⩾ ε ∥ f ∥ 1 ) = Pr i = 1 t Pr [ | X i − f a | ⩾ ε ∥ f ∥ 1 ] ⩽ 2 − t = δ . ∏ = i = 1 The algorithm computes a linear sketch using ( 1 ) ε log 1 O δ · ( log m + log n ) bits of memory. It can be generalized to turnstile model (Exercise). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  22. Frequency Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  23. Frequency Moments The k -th frequency moment of a stream is ∑ f k j = ∥ f ∥ k F k ≜ k . j ∈ [ n ] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  24. Frequency Moments The k -th frequency moment of a stream is ∑ f k j = ∥ f ∥ k F k ≜ k . j ∈ [ n ] For example, F 2 is the size of self-join of a relation r . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  25. Frequency Moments The k -th frequency moment of a stream is ∑ f k j = ∥ f ∥ k F k ≜ k . j ∈ [ n ] For example, F 2 is the size of self-join of a relation r . Many problems we met before can be viewed as estimating F k for some special k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  26. AMS Estimator for F k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  27. AMS Estimator for F k Given ⟨ a 1 , . . . , a m ⟩ , then algorithm first sample a uniform index J ∈ [ m ] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  28. AMS Estimator for F k Given ⟨ a 1 , . . . , a m ⟩ , then algorithm first sample a uniform index J ∈ [ m ] . It then count the number of entries a j with a j = a J and j ⩾ J . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  29. AMS Estimator for F k Given ⟨ a 1 , . . . , a m ⟩ , then algorithm first sample a uniform index J ∈ [ m ] . It then count the number of entries a j with a j = a J and j ⩾ J . Algorithm AMS Estimator for F k Init: ( m , r , a ) ← ( 0, 0, 0 ) . On Input ( y , ∆ ) : m ← m + 1, β ∼ Ber ( 1 m ) ; if β = 1 then a ← y , r ← 0; end if if y = a then r ← r + 1 end if Output: ( r k − ( r − 1 ) k ) . m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  30. Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

  31. Analysis We first compute the expectation of the output X . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Recommend


More recommend