sparser johnson lindenstrauss transforms
play

Sparser Johnson-Lindenstrauss Transforms Jelani Nelson Princeton - PowerPoint PPT Presentation

Sparser Johnson-Lindenstrauss Transforms Jelani Nelson Princeton February 16, 2012 joint work with Daniel Kane (Stanford) Random Projections x R d , d huge store y = Sx , where S is a k d matrix (compression) Random Projections


  1. Sparser Johnson-Lindenstrauss Transforms Jelani Nelson Princeton February 16, 2012 joint work with Daniel Kane (Stanford)

  2. Random Projections • x ∈ R d , d huge • store y = Sx , where S is a k × d matrix (compression)

  3. Random Projections • x ∈ R d , d huge • store y = Sx , where S is a k × d matrix (compression) • compressed sensing (recover x from y when x is (near-)sparse) • group-testing (as above, but Sx is Boolean multiplication) • recover properties of x (entropy, heavy hitters, . . . ) • approximate norm preservation (want � y � ≈ � x � ) • motif discovery (slightly different; randomly project discrete x onto subset of its coordinates) [Buhler-Tompa]

  4. Random Projections • x ∈ R d , d huge • store y = Sx , where S is a k × d matrix (compression) • compressed sensing (recover x from y when x is (near-)sparse) • group-testing (as above, but Sx is Boolean multiplication) • recover properties of x (entropy, heavy hitters, . . . ) • approximate norm preservation (want � y � ≈ � x � ) • motif discovery (slightly different; randomly project discrete x onto subset of its coordinates) [Buhler-Tompa] • In many of these applications, random S is either required or obtains better parameters than deterministic constructions.

  5. Random Projections • x ∈ R d , d huge • store y = Sx , where S is a k × d matrix (compression) • compressed sensing (recover x from y when x is (near-)sparse) • group-testing (as above, but Sx is Boolean multiplication) • recover properties of x (entropy, heavy hitters, . . . ) • approximate norm preservation (want � y � ≈ � x � ) • motif discovery (slightly different; randomly project discrete x onto subset of its coordinates) [Buhler-Tompa] • In many of these applications, random S is either required or obtains better parameters than deterministic constructions.

  6. Metric Johnson-Lindenstrauss lemma Metric JL (MJL) Lemma, 1984 Every set of n points in Euclidean space can be embedded into O ( ε − 2 log n )-dimensional Euclidean space so that all pairwise distances are preserved up to a 1 ± ε factor.

  7. Metric Johnson-Lindenstrauss lemma Metric JL (MJL) Lemma, 1984 Every set of n points in Euclidean space can be embedded into O ( ε − 2 log n )-dimensional Euclidean space so that all pairwise distances are preserved up to a 1 ± ε factor. Uses: • Speed up geometric algorithms by first reducing dimension of input [Indyk-Motwani, 1998], [Indyk, 2001] • Low-memory streaming algorithms for linear algebra problems [Sarl´ os, 2006], [LWMRT, 2007], [Clarkson-Woodruff, 2009] • Essentially equivalent to RIP matrices from compressed sensing [Baraniuk et al., 2008], [Krahmer-Ward, 2011] (used for recovery of sparse signals)

  8. How to prove the JL lemma Distributional JL (DJL) lemma Lemma For any 0 < ε, δ < 1 / 2 there exists a distribution D ε,δ on R k × d for k = O ( ε − 2 log(1 /δ )) so that for any x of unit norm � > ε �� � � Sx � 2 � � Pr 2 − 1 < δ. S ∼D ε,δ

  9. How to prove the JL lemma Distributional JL (DJL) lemma Lemma For any 0 < ε, δ < 1 / 2 there exists a distribution D ε,δ on R k × d for k = O ( ε − 2 log(1 /δ )) so that for any x of unit norm � > ε �� � � Sx � 2 � � Pr 2 − 1 < δ. S ∼D ε,δ Proof of MJL: Set δ = 1 / n 2 in DJL and x as the difference vector � n � of some pair of points. Union bound over the pairs. 2

  10. How to prove the JL lemma Distributional JL (DJL) lemma Lemma For any 0 < ε, δ < 1 / 2 there exists a distribution D ε,δ on R k × d for k = O ( ε − 2 log(1 /δ )) so that for any x of unit norm � > ε �� � � Sx � 2 � � Pr 2 − 1 < δ. S ∼D ε,δ Proof of MJL: Set δ = 1 / n 2 in DJL and x as the difference vector � n � of some pair of points. Union bound over the pairs. 2 Theorem (Alon, 2003) For every n, there exists a set of n points requiring target dimension k = Ω(( ε − 2 / log(1 /ε )) log n ) . Theorem (Jayram-Woodruff, 2011; Kane-Meka-N., 2011) For DJL, k = Θ( ε − 2 log(1 /δ )) is optimal.

  11. Proving the JL lemma Older proofs • [Johnson-Lindenstrauss, 1984], [Frankl-Maehara, 1988]: Random rotation, then projection onto first k coordinates. • [Indyk-Motwani, 1998], [Dasgupta-Gupta, 2003]: Random matrix with independent Gaussian entries. • [Achlioptas, 2001]: Independent ± 1 entries. • [Clarkson-Woodruff, 2009]: O (log(1 /δ ))-wise independent ± 1 entries. • [Arriaga-Vempala, 1999], [Matousek, 2008]: Independent entries having mean 0, variance 1 / k , and subGaussian tails

  12. Proving the JL lemma Older proofs • [Johnson-Lindenstrauss, 1984], [Frankl-Maehara, 1988]: Random rotation, then projection onto first k coordinates. • [Indyk-Motwani, 1998], [Dasgupta-Gupta, 2003]: Random matrix with independent Gaussian entries. • [Achlioptas, 2001]: Independent ± 1 entries. • [Clarkson-Woodruff, 2009]: O (log(1 /δ ))-wise independent ± 1 entries. • [Arriaga-Vempala, 1999], [Matousek, 2008]: Independent entries having mean 0, variance 1 / k , and subGaussian tails Downside: Performing embedding is dense matrix-vector multiplication, O ( k · � x � 0 ) time

  13. Fast JL Transforms • [Ailon-Chazelle, 2006]: x �→ PHDx , O ( d log d + k 3 ) time P is a random sparse matrix, H is Hadamard, D has random ± 1 on diagonal • [Ailon-Liberty, 2008]: O ( d log k + k 2 ) time, also based on fast Hadamard transform • [Ailon-Liberty, 2011] and [Krahmer-Ward, 2011]: O ( d log d ) for MJL, but with suboptimal k = O ( ε − 2 log n log 4 d ).

  14. Fast JL Transforms • [Ailon-Chazelle, 2006]: x �→ PHDx , O ( d log d + k 3 ) time P is a random sparse matrix, H is Hadamard, D has random ± 1 on diagonal • [Ailon-Liberty, 2008]: O ( d log k + k 2 ) time, also based on fast Hadamard transform • [Ailon-Liberty, 2011] and [Krahmer-Ward, 2011]: O ( d log d ) for MJL, but with suboptimal k = O ( ε − 2 log n log 4 d ). Downside: Slow to embed sparse vectors: running time is Ω(min { k · � x � 0 , d log d } ).

  15. Where Do Sparse Vectors Show Up? • Document as bag of words: x i = number of occurrences of word i . Compare documents using cosine similarity. d = lexicon size; most documents aren’t dictionaries • Network traffic: x i , j = #bytes sent from i to j d = 2 64 (2 256 in IPv6); most servers don’t talk to each other • User ratings: x i is user’s score for movie i on Netflix d = # movies ; most people haven’t rated all movies • Streaming: x receives a stream of updates of the form: “add v to x i ”. Maintaining Sx requires calculating v · Se i . • . . .

  16. Sparse JL transforms One way to embed sparse vectors faster: use sparse matrices.

  17. Sparse JL transforms One way to embed sparse vectors faster: use sparse matrices. s = #non-zero entries per column in embedding matrix (so embedding time is s · � x � 0 ) reference value of s type k ≈ 4 ε − 2 log(1 /δ ) [JL84], [FM88], [IM98], . . . dense [Achlioptas01] k / 3 sparse Bernoulli [WDALS09] no proof hashing O ( ε − 1 log 3 (1 /δ )) ˜ [DKS10] hashing O ( ε − 1 log 2 (1 /δ )) ˜ [KN10a], [BOR10] ” O ( ε − 1 log(1 /δ )) [KN12] hashing (random codes)

  18. Other related work • CountSketch of [Charikar-Chen-FarachColton] gives s = O (log(1 /δ )) (see [Thorup-Zhang])

  19. Other related work • CountSketch of [Charikar-Chen-FarachColton] gives s = O (log(1 /δ )) (see [Thorup-Zhang]) • Can recover (1 ± ε ) � x � 2 from Sx , but not as � Sx � 2 (not an embedding into ℓ 2 ) • Not applicable in certain situations, e.g. in some nearest neighbor data structures, and when learning classifiers over projected vectors via stochastic gradient descent

  20. Sparse JL Constructions Θ( ε − 1 log 2 (1 /δ )) [DKS, 2010] s = ˜

  21. Sparse JL Constructions [DKS, 2010] Θ( ε − 1 log 2 (1 /δ )) s = ˜ [this work] s = Θ( ε − 1 log(1 /δ ))

  22. Sparse JL Constructions Θ( ε − 1 log 2 (1 /δ )) [DKS, 2010] s = ˜ s = Θ( ε − 1 log(1 /δ )) [this work] s = Θ( ε − 1 log(1 /δ )) [this work] k/s

  23. Sparse JL Constructions (in matrix form) 0 = 0 k 0 0 0 = 0 k/s k 0 0 Each black cell is ± 1 / √ s at random

  24. Sparse JL Constructions (nicknames) “Graph” construction “Block” construction k/s

  25. Sparse JL notation (block construction) • Let h ( j , r ) , σ ( j , r ) be random hash location and random sign for copy of x j in r th block.

  26. Sparse JL notation (block construction) • Let h ( j , r ) , σ ( j , r ) be random hash location and random sign for copy of x j in r th block. • ( Sx ) i = (1 / √ s ) · � h ( j , r )= i x j · σ ( j , r )

  27. Sparse JL notation (block construction) • Let h ( j , r ) , σ ( j , r ) be random hash location and random sign for copy of x j in r th block. • ( Sx ) i = (1 / √ s ) · � h ( j , r )= i x j · σ ( j , r ) s 2 + 1 � Sx � 2 2 = � x � 2 � � s · x i x j σ ( i , r ) σ ( j , r ) · 1 h ( i , r )= h ( j , r ) r =1 i � = j

  28. Sparse JL via Codes 0 = 0 k 0 0 0 = 0 k/s k 0 0 • Graph construction: Constant weight binary code of weight s . • Block construction: Code over q -ary alphabet, q = k / s .

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend