dimensionality reduction and jl lemma
play

Dimensionality Reduction and JL Lemma Lecture 12 February 21, 2019 - PowerPoint PPT Presentation

CS 498ABD: Algorithms for Big Data, Spring 2019 Dimensionality Reduction and JL Lemma Lecture 12 February 21, 2019 Chandra (UIUC) CS498ABD 1 Spring 2019 1 / 23 F 2 estimation in turnstile setting AMS- 2 -Estimate : Let Y 1 , Y 2 , . . .


  1. CS 498ABD: Algorithms for Big Data, Spring 2019 Dimensionality Reduction and JL Lemma Lecture 12 February 21, 2019 Chandra (UIUC) CS498ABD 1 Spring 2019 1 / 23

  2. F 2 estimation in turnstile setting AMS- ℓ 2 -Estimate : Let Y 1 , Y 2 , . . . , Y n be {− 1 , +1 } random variables that are 4 -wise independent z ← 0 While (stream is not empty) do a j = ( i j , ∆ j ) is current update z ← z + ∆ j Y i j endWhile Output z 2 Claim: Output estimates || x || 2 2 where x is the vector at end of stream of updates. Chandra (UIUC) CS498ABD 2 Spring 2019 2 / 23

  3. Analysis Z = � n i =1 x i Y i and output is Z 2 Z 2 = � � x 2 i Y 2 i + 2 x i x j Y i Y j i � = j i and hence Z 2 � � x 2 i = || x || 2 � = 2 . E i One can show that Var ( Z 2 ) ≤ 2(E Z 2 � ) 2 . � Chandra (UIUC) CS498ABD 3 Spring 2019 3 / 23

  4. Linear Sketching View Recall that we take average of independent estimators and take median to reduce error. Can we view all this as a sketch? AMS- ℓ 2 -Sketch : k = c log(1 /δ ) /ǫ 2 Let M be a ℓ × n matrix with entries in {− 1 , 1 } s.t (i) rows are independent and (ii) in each row entries are 4 -wise independent z is a ℓ × 1 vector initialized to 0 While (stream is not empty) do a j = ( i j , ∆ j ) is current update z ← z + ∆ j Me i j endWhile Output vector z as sketch. M is compactly represented via k hash functions, one per row, independently chosen from 4 -wise independent hash family. Chandra (UIUC) CS498ABD 4 Spring 2019 4 / 23

  5. Geometric Interpretation Given vector x ∈ R n let M the random map z = Mx has the following features � z 2 � = � x � 2 2 for each 1 ≤ i ≤ k where k is E[ z i ] = 0 and E i number of rows of M Thus each z 2 i is an estimate of length of x in Euclidean norm When k = Θ( 1 ǫ 2 log(1 /δ )) one can obtain an (1 ± ǫ ) estimate of � x � 2 by averaging and median ideas Thus we are able to compress x into k -dimensional vector z such that z contains information to estimate � x � 2 accurately Chandra (UIUC) CS498ABD 5 Spring 2019 5 / 23

  6. Geometric Interpretation Given vector x ∈ R n let M the random map z = Mx has the following features � z 2 � = � x � 2 2 for each 1 ≤ i ≤ k where k is E[ z i ] = 0 and E i number of rows of M Thus each z 2 i is an estimate of length of x in Euclidean norm When k = Θ( 1 ǫ 2 log(1 /δ )) one can obtain an (1 ± ǫ ) estimate of � x � 2 by averaging and median ideas Thus we are able to compress x into k -dimensional vector z such that z contains information to estimate � x � 2 accurately Question: Do we need median trick? Will averaging do? Chandra (UIUC) CS498ABD 5 Spring 2019 5 / 23

  7. Distributional JL Lemma Lemma (Distributional JL Lemma) Fix vector x ∈ R d and let Π ∈ R k × d matrix where each entry Π ij is chosen independently according to standard normal distribution N (0 , 1) distribution. If k = Ω( 1 ǫ 2 log(1 /δ )) , then with probability (1 − δ ) � 1 √ Π x � 2 = (1 ± ǫ ) � x � 2 . k Can choose entries from {− 1 , 1 } as well. Note: unlike ℓ 2 estimation, entries of Π are independent. 1 Letting z = k Π x we have projected x from d dimensions to √ k = O ( 1 ǫ 2 log(1 /δ )) dimensions while preserving length to within (1 ± ǫ ) -factor. Chandra (UIUC) CS498ABD 6 Spring 2019 6 / 23

  8. Dimensionality reduction Theorem (Metric JL Lemma) Let v 1 , v 2 , . . . , v n be any n points/vectors in R d . For any ǫ ∈ (0 , 1 / 2) , there is linear map f : R d → R k where k ≤ 8 ln n /ǫ 2 such that for all 1 ≤ i < j ≤ n , (1 − ǫ ) || v i − v j || 2 ≤ || f ( v i ) − f ( v j ) || 2 ≤ || v i − v j || 2 . Moreover f can be obtained in randomized polynomial-time. Linear map f is simply given by random matrix Π : f ( v ) = Π v . Chandra (UIUC) CS498ABD 7 Spring 2019 7 / 23

  9. Normal Distribution 2 πσ 2 e − ( x − µ )2 1 Density function: f ( x ) = √ 2 σ 2 Standard normal: N (0 , 1) is when µ = 0 , σ = 1 Chandra (UIUC) CS498ABD 8 Spring 2019 8 / 23

  10. Normal Distribution Cumulative density function for standard normal: � t ∞ e − t 2 / 2 (no closed form) 1 Φ( x ) = √ 2 π Chandra (UIUC) CS498ABD 9 Spring 2019 9 / 23

  11. Sum of independent Normally distributed variables Lemma Let X and Y be independent random variables. Suppose X ∼ N ( µ X , σ 2 X ) and Y ∼ N ( µ Y , σ 2 Y ) . Let Z = X + Y . Then Z ∼ N ( µ X + µ Y , σ 2 X + σ 2 Y ) . Chandra (UIUC) CS498ABD 10 Spring 2019 10 / 23

  12. Sum of independent Normally distributed variables Lemma Let X and Y be independent random variables. Suppose X ∼ N ( µ X , σ 2 X ) and Y ∼ N ( µ Y , σ 2 Y ) . Let Z = X + Y . Then Z ∼ N ( µ X + µ Y , σ 2 X + σ 2 Y ) . Corollary Let X and Y be independent random variables. Suppose X ∼ N (0 , 1) and Y ∼ N (0 , 1) . Let Z = aX + bY . Then Z ∼ N (0 , a 2 + b 2 ) . Chandra (UIUC) CS498ABD 10 Spring 2019 10 / 23

  13. Concentration of sum of squares of normally distributed variables Lemma Let Z 1 , Z 2 , . . . , Z k be independent N (0 , 1) random variables and i Z 2 let Y = � i . Then, for ǫ ∈ (0 , 1 / 2) , there is a constant c such that, Pr[(1 − ǫ ) 2 k ≤ Y ≤ (1 + ǫ ) 2 k ] ≥ 1 − 2 e c ǫ 2 k . Chandra (UIUC) CS498ABD 11 Spring 2019 11 / 23

  14. χ 2 distribution Density function Chandra (UIUC) CS498ABD 12 Spring 2019 12 / 23

  15. χ 2 distribution Cumulative density function Chandra (UIUC) CS498ABD 13 Spring 2019 13 / 23

  16. Proof of DJL Lemma Without loss of generality assume � x � 2 = 1 (unit vector) Z i = � n j =1 Π ij x i Z i ∼ N (0 , 1) Chandra (UIUC) CS498ABD 14 Spring 2019 14 / 23

  17. Proof of DJL Lemma Without loss of generality assume � x � 2 = 1 (unit vector) Z i = � n j =1 Π ij x i Z i ∼ N (0 , 1) i . Y ’s distribution is χ 2 since Z 1 , . . . , Z k are Let Y = � k i =1 Z 2 iid Chandra (UIUC) CS498ABD 14 Spring 2019 14 / 23

  18. Proof of DJL Lemma Without loss of generality assume � x � 2 = 1 (unit vector) Z i = � n j =1 Π ij x i Z i ∼ N (0 , 1) i . Y ’s distribution is χ 2 since Z 1 , . . . , Z k are Let Y = � k i =1 Z 2 iid Hence Pr[(1 − ǫ ) 2 k ≤ Y ≤ (1 + ǫ ) 2 k ] ≥ 1 − 2 e c ǫ 2 k Chandra (UIUC) CS498ABD 14 Spring 2019 14 / 23

  19. Proof of DJL Lemma Without loss of generality assume � x � 2 = 1 (unit vector) Z i = � n j =1 Π ij x i Z i ∼ N (0 , 1) i . Y ’s distribution is χ 2 since Z 1 , . . . , Z k are Let Y = � k i =1 Z 2 iid Hence Pr[(1 − ǫ ) 2 k ≤ Y ≤ (1 + ǫ ) 2 k ] ≥ 1 − 2 e c ǫ 2 k Since k = Ω( 1 ǫ 2 log(1 /δ )) we have Pr[(1 − ǫ ) 2 k ≤ Y ≤ (1 + ǫ ) 2 k ] ≥ 1 − δ Chandra (UIUC) CS498ABD 14 Spring 2019 14 / 23

  20. Proof of DJL Lemma Without loss of generality assume � x � 2 = 1 (unit vector) Z i = � n j =1 Π ij x i Z i ∼ N (0 , 1) i . Y ’s distribution is χ 2 since Z 1 , . . . , Z k are Let Y = � k i =1 Z 2 iid Hence Pr[(1 − ǫ ) 2 k ≤ Y ≤ (1 + ǫ ) 2 k ] ≥ 1 − 2 e c ǫ 2 k Since k = Ω( 1 ǫ 2 log(1 /δ )) we have Pr[(1 − ǫ ) 2 k ≤ Y ≤ (1 + ǫ ) 2 k ] ≥ 1 − δ � Therefore � z � 2 = Y / k has the property that with probability (1 − δ ) , � z � 2 = (1 ± ǫ ) � x � 2 . Chandra (UIUC) CS498ABD 14 Spring 2019 14 / 23

  21. JL lower bounds Question: Are the bounds achieved by the lemmas tight or can we do better? How about non-linear maps? Essentially optimal modulo constant factors for worst-case point sets. Chandra (UIUC) CS498ABD 15 Spring 2019 15 / 23

  22. Fast JL and Sparse JL Projection matrix Π is dense and hence Π x takes Θ( kn ) time. Question: Can we find Π to improve time bound? Two scenarios: x is dense x is sparse Chandra (UIUC) CS498ABD 16 Spring 2019 16 / 23

  23. Fast JL and Sparse JL Projection matrix Π is dense and hence Π x takes Θ( kn ) time. Question: Can we find Π to improve time bound? Two scenarios: x is dense x is sparse Main ideas: Choose Π ij to be {− 1 , 0 , 1 } with probability 1 / 6 , 1 / 3 , 1 / 6 . Also works. Roughly 1 / 3 entries are 0 Fast JL: Choose Π in a dependent way to ensure Π x can be computed in O ( d log d ) time Sparse JL: Choose Π such that each column is s -sparse. The best known is s = O ( 1 ǫ log(1 /δ )) Chandra (UIUC) CS498ABD 16 Spring 2019 16 / 23

  24. Subspace Embedding Question: Suppose we have linear subspace E of R d of dimension ℓ . Can we find a projection Π : R d → R k such that for every x ∈ E , � Π x � 2 = (1 ± ǫ ) � x � 2 ? Chandra (UIUC) CS498ABD 17 Spring 2019 17 / 23

  25. Subspace Embedding Question: Suppose we have linear subspace E of R d of dimension ℓ . Can we find a projection Π : R d → R k such that for every x ∈ E , � Π x � 2 = (1 ± ǫ ) � x � 2 ? Not possible if k < ℓ . Why? Chandra (UIUC) CS498ABD 17 Spring 2019 17 / 23

  26. Subspace Embedding Question: Suppose we have linear subspace E of R d of dimension ℓ . Can we find a projection Π : R d → R k such that for every x ∈ E , � Π x � 2 = (1 ± ǫ ) � x � 2 ? Not possible if k < ℓ . Why? Π maps E to a lower dimension. Implies some non-zero vector x ∈ E mapped to 0 Chandra (UIUC) CS498ABD 17 Spring 2019 17 / 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend