lecture 4
play

Lecture 4 Barna Saha AT&T-Labs Research September 19, 2013 - PowerPoint PPT Presentation

Lecture 4 Barna Saha AT&T-Labs Research September 19, 2013 Outline Heavy Hitter Continued Frequency Moment Estimation Dimensionality Reduction Heavy Hitter Heavy Hitter Problem: For 0 < < < 1 find a set of elements S


  1. Lecture 4 Barna Saha AT&T-Labs Research September 19, 2013

  2. Outline Heavy Hitter Continued Frequency Moment Estimation Dimensionality Reduction

  3. Heavy Hitter ◮ Heavy Hitter Problem: For 0 < ǫ < φ < 1 find a set of elements S including all i such that f i > φ m and there is no element in S with frequency ≤ ( φ − ǫ ) m . ◮ Count-Min sketch guarantees: f i ≤ ˆ f i ≤ f i + ǫ m with probability ≥ 1 − δ in space e 1 ǫ log ( φ − ǫ ) δ . 1 ◮ Insert only: Maintain a min-heap of size k = φ − ǫ , when an item arrives estimate frequency and if above φ m include it in the heap. If heap size more than k , discard the minimum frequency element in the heap.

  4. Heavy Hitter ◮ Turnstile model: ◮ Maintain dyadic intervals over binary search tree and maintain ǫ log 2 log n log n count-min sketch with using space e δ ( φ − ǫ ) one for each level. ◮ At every level at most 1 φ heavy hitters. ◮ Estimate frequency of children of the heavy hitter nodes until leaf-level is reached. ◮ Return all the leaves with estimated frequency above φ m . ◮ Analysis ◮ At most 2 φ − ǫ nodes at every level is examined. ◮ Each true frequency > ( φ − ǫ ) m with probability at least 1 − δ ( φ − ǫ ) 2 log n . ◮ By union bound all true frequencies are above ( φ − ǫ ) m with probability at least 1 − δ .

  5. l 2 frequency estimation � ◮ | f i − ˆ f 2 1 + f 2 2 + .... f 2 f i | ≤ ± ǫ n [Count-sketch] ◮ F 2 = f 2 1 + f 2 2 + .... f 2 n ◮ How do we estimate F 2 in small space ?

  6. AMS- F 2 Estimation ◮ H = { h : [ n ] → { +1 , − 1 }} four-wise independent hash functions ◮ Maintain Z j = Z j + ah j ( i ) on arrival of ( i , a ) for j = 1 , ..., t = c ǫ 2 ◮ Return Y = 1 � t j =1 Z 2 j t

  7. Analysis ◮ Z j = � n i =1 f i h j ( i ) ◮ E � � � Z 2 � Z j = 0, E = F 2 . j ) 2 ≤ 4 F 2 ◮ Var � Z 2 � � Z 4 � � � − (E = E Z j 2 . j j j ) = 4 ǫ 2 = 1 � t ◮ E � � � � j =1 Var ( Z 2 c F 2 = F 2 . Var Y Y t 2 2 ◮ By Chebyshev Inequality Pr ≤ 4 � | Y − E � � | > ǫ F 2 � Y c

  8. Boosting by Median ◮ Keep Y 1 , Y 2 , ... Y s , s = O (log 1 δ ) ◮ Return A = median ( Y 1 , Y 2 , .., Y s ) ◮ By Chernoff bound Pr � � | A − F 2 | > ǫ F 2 < δ

  9. Linear Sketch ◮ Algorithm maintains a linear sketch [ Z 1 , Z 2 , ...., Z t ] x = R x where R is a t × n random matrix with entries { +1 , − 1 } . ◮ Use Y = || Rx || 2 2 to estimate t || x | 2 2 . t = O ( 1 ǫ 2 ). ◮ Streaming algorithm operating in the sketch model can be viewed as dimensionality reduction technique.

  10. Dimensionality Reduction ◮ Streaming algorithm operating in the sketch model can be viewed as dimensionality reduction technique. ◮ stream S : point in n dimensional space, want to compute l 2 ( S ) ◮ sketch operator can be viewed as an approximate embedding of l n 2 to sketch space C such that 1. Each point in C can be described using only small number (say m ) of numbers so C ⊂ R m and 2. value of l 2 ( S ) is approximately equal to F ( C ( S )). ◮ F ( Y 1 , Y 2 , .. Y t ) = median( Y 1 , Y 2 , .., Y t )

  11. Dimensionality Reduction ◮ F ( Y 1 , Y 2 , .. Y t ) = median( Y 1 , Y 2 , .., Y t ) ◮ Disadvantage: F is not a norm–performing any nontrivial operations in the sketch space (e.g. clustering, similarity search, regression etc.) becomes difficult. ◮ Can we embed from l n 2 to l m 2 , m << n approximately preserving the distance ? Johnson-Lindenstrauss Lemma

  12. Interlude to Normal Distribution Normal distribution N (0 , 1): ◮ Range ( −∞ , ∞ ) √ ◮ Density f ( x ) = e − x 2 / 2 π ◮ Mean=0, Variance=1 Basic facts ◮ If X and Y are independent random variables with normal distribution then so is X + Y ◮ If X and Y are independent with mean 0 then [ X + Y ] 2 � X 2 � Y 2 � � � � E = E + E ◮ E � � � � � � = c 2 Var � � = c E , Var cX X cX X

  13. A Different Linear Sketch Instead of ± 1 let r i be a i.i.d. random variable from N (0 , 1). ◮ Consider Z = � i r i x i ◮ E Z 2 � i r i x i ) 2 � r 2 x 2 x 2 � � ( � = � � � i = � � � = E i E i Var r i i = i i x 2 i = || x || 2 � 2 . ◮ As before we maintain Z = [ Z 1 , Z 2 , ..., Z t ] and define Y = || Z || 2 2 = t || x || 2 ◮ E � � Y 2 ◮ We show that there exists constant C > 0 s.t. for small enough ǫ > 0 ≤ e − C ǫ 2 t (JL lemma) | Y − t || x || 2 2 | > ǫ t || x || 2 � � Pr 2 ◮ set t = O ( 1 ǫ 2 log 1 δ )

  14. Johnson Lindenstrauss Lemma Lemma For any 0 < epsilon < 1 and any integer m, let t be a positive integer such that 4 ln m t > ǫ 2 / 2 + ǫ 3 / 3 Then for any set V of m points in R n , there is a map f : R n → R t such that for all u and v ∈ V , (1 − ǫ ) || u − v || 2 2 ≤ || f ( u ) − f ( v ) || 2 2 ≤ (1 + ǫ ) || u − v || 2 2 . Furthermore this map can be found in randomized polynomial time.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend