subspace embeddings and p regression using exponential
play

Subspace Embeddings and p -Regression Using Exponential Random - PowerPoint PPT Presentation

Subspace Embeddings and p -Regression Using Exponential Random Variables David P. Woodruff and Qin Zhang IBM Research Almaden COLT13, June 12, 2013 1-1 Subspace embeddings Subspace embeddings: A distribution over linear maps : R n


  1. Subspace Embeddings and ℓ p -Regression Using Exponential Random Variables David P. Woodruff and Qin Zhang IBM Research Almaden COLT’13, June 12, 2013 1-1

  2. Subspace embeddings Subspace embeddings: A distribution over linear maps Π : R n → R m , s.t., for any fixed d -dimensional subspace of R n (denoted by M ), w. pr. 0 . 99 � Mx � p ≤ � Π Mx � q ≤ κ � Mx � p simultaneously for all vectors x ∈ R d . Goal: to minimize 1. m : the dimension of the subspace embedding. 2. κ : the distortion of the embedding. 3. t : the time to compute Π M . 2-1

  3. Subspace embeddings Subspace embeddings: A distribution over linear maps Π : R n → R m , s.t., for any fixed d -dimensional subspace of R n (denoted by M ), w. pr. 0 . 99 � Mx � p ≤ � Π Mx � q ≤ κ � Mx � p simultaneously for all vectors x ∈ R d . Goal: to minimize 1. m : the dimension of the subspace embedding. 2. κ : the distortion of the embedding. 3. t : the time to compute Π M . Applications: ℓ p -regression (next slide), low-rank approximation, quantile regression, . . . 2-2

  4. All matter: embedding time, dimension and distortion Using ℓ p subspace embedding (SE) to solve ℓ p regression: � ¯ � � min x ∈ R d Mx − b � p For convenience, let ¯ M ∈ R n × ( d − 1) , and let M = [ ¯ M , − b ] ∈ R n × d . n ≫ d . Let Π be a SE with dimension m , distortion κ and embedding time t 3-1

  5. All matter: embedding time, dimension and distortion Using ℓ p subspace embedding (SE) to solve ℓ p regression: � ¯ � � min x ∈ R d Mx − b � p For convenience, let ¯ M ∈ R n × ( d − 1) , and let M = [ ¯ M , − b ] ∈ R n × d . n ≫ d . Let Π be a SE with dimension m , distortion κ and embedding time t 1. Compute Π M . (cost t ) 2. Use Π M to compute a matrix R ∈ R d × d (change-of-basis matrix) s.t. MR has some good properties. (cost ↑ if m ↑ ) 3. Given R , find a sampling matrix Π 1 ∈ R m ′ × n . ( m ′ ↑ if κ ↑ ) � Π 1 ¯ Mx − Π 1 b � � 4. Compute ˆ x of sub-sampled problem min x ∈ R d p . � (cost ↑ if m ′ ↑ , or κ ↑ ) Total running time ↑ if m ↑ or κ ↑ or t ↑ . 3-2

  6. ℓ 1 regression � ¯ 1 ( ¯ M ∈ R n × ( d − 1) ). � � ℓ 1 regression: min x ∈ R d Mx − b � • Can be solved by linear programming, in time superlinear in n . • Clarkson 2005 gave an n · poly( d ) solution. • . . . Allow a (1 + ǫ )-approximation : • Sohler & Woodruff 2011 used ℓ 1 subspace embedding (SE), gave O ( nd ω − 1 ) + poly( d /ǫ ). ( ω < 3 is the exponent of matrix multiplication) • Clarkson et al. 2012 used a more structured ℓ 1 SE, gave O ( nd log n ) + poly( d /ǫ ). • Clarkson & Woodruff / Meng & Mahoney 2012 used other ℓ 1 SE’s, gave O (nnz( M ) log n ) + poly( d /ǫ ), nnz( M ) is # non-zero entries of M . 4-1

  7. ℓ 1 regression � ¯ 1 ( ¯ M ∈ R n × ( d − 1) ). � � ℓ 1 regression: min x ∈ R d Mx − b � • Can be solved by linear programming, in time superlinear in n . • Clarkson 2005 gave an n · poly( d ) solution. • . . . Allow a (1 + ǫ )-approximation : • Sohler & Woodruff 2011 used ℓ 1 subspace embedding (SE), gave O ( nd ω − 1 ) + poly( d /ǫ ). ( ω < 3 is the exponent of matrix multiplication) • Clarkson et al. 2012 used a more structured ℓ 1 SE, gave O ( nd log n ) + poly( d /ǫ ). • Clarkson & Woodruff / Meng & Mahoney 2012 used other ℓ 1 SE’s, gave O (nnz( M ) log n ) + poly( d /ǫ ), nnz( M ) is # non-zero entries of M . This paper: further improves the ℓ 1 SE, thus also ℓ 1 regression. 4-2

  8. Our results ℓ p subspace embeddings. Improved all previous results for ∀ p ∈ [1 , ∞ ) \ 2 p = 2 has already been made optimal by Clarkson and Woodruff ’12 5-1

  9. Our results ℓ p subspace embeddings. Improved all previous results for ∀ p ∈ [1 , ∞ ) \ 2 p = 2 has already been made optimal by Clarkson and Woodruff ’12 In particular, p = 1 Time Distortion Dimemsion ˜ ˜ nd ω − 1 SW O ( d ) O ( d ) ˜ ˜ C + O ( d 2+ γ ) O ( d 5 ) nd log d ˜ ˜ O ( d 3 ) O ( d 5 ) MM nnz( M ) nnz( M ) + ˜ ˜ ˜ O ( d 2+ γ ) O ( d 2 ) This paper O ( d ) O ( d 3 / 2 log 1 / 2 n ) nnz( M ) + ˜ ˜ ˜ O ( d 2+ γ ) O ( d ) SW: Sohler & Woodruff ’11 ; C + : Clarkson et al. ’12; MM: Meng & Mahoney ’12; ω < 3 is the exponent of matrix multiplication. γ = 0 . 0000001. 5-2

  10. Our results ℓ p subspace embeddings. Improved all previous results for ∀ p ∈ [1 , ∞ ) \ 2 p = 2 has already been made optimal by Clarkson and Woodruff ’12 In particular, p = 1 Time Distortion Dimemsion ˜ ˜ nd ω − 1 SW O ( d ) O ( d ) ˜ ˜ C + O ( d 2+ γ ) O ( d 5 ) nd log d ˜ ˜ O ( d 3 ) O ( d 5 ) MM nnz( M ) nnz( M ) + ˜ ˜ ˜ O ( d 2+ γ ) O ( d 2 ) This paper O ( d ) O ( d 3 / 2 log 1 / 2 n ) nnz( M ) + ˜ ˜ ˜ O ( d 2+ γ ) O ( d ) SW: Sohler & Woodruff ’11 ; C + : Clarkson et al. ’12; MM: Meng & Mahoney ’12; ω < 3 is the exponent of matrix multiplication. γ = 0 . 0000001. ℓ p regression Improved all previous results for ∀ p ∈ [1 , ∞ ) \ 2 Have efficient distributed implementations. 5-3

  11. Our subspace embedding matrices ( m , s ) − ℓ 2 -SE (oblivious subspace embedding for ℓ 2 norm) A distribution over linear maps S : R n → R m , s.t., for any fixed d -dimensional subspace of R n , w. pr. 0 . 99, ∀ x ∈ R d . 1 / 2 · � Mx � 2 ≤ � SMx � 2 ≤ 3 / 2 · � Mx � 2 , s = O (1) is the the max of # non-zero entries of each colummn in S . 6-1

  12. Our subspace embedding matrices ( m , s ) − ℓ 2 -SE (oblivious subspace embedding for ℓ 2 norm) A distribution over linear maps S : R n → R m , s.t., for any fixed d -dimensional subspace of R n , w. pr. 0 . 99, ∀ x ∈ R d . 1 / 2 · � Mx � 2 ≤ � SMx � 2 ≤ 3 / 2 · � Mx � 2 , s = O (1) is the the max of # non-zero entries of each colummn in S . Our ℓ p subspace embedding matrix D ∈ R n × n 1 / u 1 / p 1 u i are i.i.d. × = Π ∈ R m × n S ∈ R m × n exponentials ℓ 2 -SE 1 / u 1 / p n Use different ℓ 2 -SEs (from CW12, MM12, Nelson & Nguyen 12) for 1 ≤ p < 2 and p > 2. Can compute Π M in O (nnz( M )) time. 6-2

  13. Two distributions Exponential distribution PDF f ( x ) = e − x , CDF F ( x ) = 1 − e − x (Recently used by Andoni (2012) for approximating frequency moments). ( max stability ) If u 1 , . . . , u n are exponentially distributed, α = ( α 1 , . . . , α n ) ∈ R + n , then max { α 1 / u 1 , . . . , α n / u n } ≃ � α � 1 / u , where u is exponential. 7-1

  14. Two distributions Exponential distribution PDF f ( x ) = e − x , CDF F ( x ) = 1 − e − x (Recently used by Andoni (2012) for approximating frequency moments). ( max stability ) If u 1 , . . . , u n are exponentially distributed, α = ( α 1 , . . . , α n ) ∈ R + n , then max { α 1 / u 1 , . . . , α n / u n } ≃ � α � 1 / u , where u is exponential. p -stable distribution: Previous pet for subspace embedding. D p is p -stable , if for any vector α = ( α 1 , . . . , α n ) ∈ R n and i . i . d . ∼ D p , we have � v 1 , . . . , v n i ∈ [ n ] α i v i ≃ � α � p v , where v ∼ D p . 7-2

  15. Two distributions Exponential distribution PDF f ( x ) = e − x , CDF F ( x ) = 1 − e − x (Recently used by Andoni (2012) for approximating frequency moments). ( max stability ) If u 1 , . . . , u n are exponentially distributed, α = ( α 1 , . . . , α n ) ∈ R + n , then max { α 1 / u 1 , . . . , α n / u n } ≃ � α � 1 / u , where u is exponential. p -stable distribution: Previous pet for subspace embedding. D p is p -stable , if for any vector α = ( α 1 , . . . , α n ) ∈ R n and i . i . d . ∼ D p , we have � v 1 , . . . , v n i ∈ [ n ] α i v i ≃ � α � p v , where v ∼ D p . E.g., for p = 2 it is the Gaussian distribution; for p = 1 it is the Cauchy distribution. 7-3

  16. Two distributions Exponential distribution PDF f ( x ) = e − x , CDF F ( x ) = 1 − e − x (Recently used by Andoni (2012) for approximating frequency moments). ( max stability ) If u 1 , . . . , u n are exponentially distributed, α = ( α 1 , . . . , α n ) ∈ R + n , then max { α 1 / u 1 , . . . , α n / u n } ≃ � α � 1 / u , where u is exponential. p -stable distribution: Previous pet for subspace embedding. D p is p -stable , if for any vector α = ( α 1 , . . . , α n ) ∈ R n and i . i . d . ∼ D p , we have � v 1 , . . . , v n i ∈ [ n ] α i v i ≃ � α � p v , where v ∼ D p . E.g., for p = 2 it is the Gaussian distribution; for p = 1 it is the Cauchy distribution. Similar embedding matrix v 1 D ′ ∈ R n × n × Π ∈ R m × n S ∈ R m × n = v i are i.i.d. ℓ 2 -SE p -stables vn 7-4

  17. Exponential distribution is superior than p -stables Why exponential distribution is better? 8-1

  18. Exponential distribution is superior than p -stables Why exponential distribution is better? 1. p -stables only exist for p ∈ [1 , 2]; while exponential can be used for all ℓ p -SE ( p ≥ 1). 8-2

  19. Exponential distribution is superior than p -stables Why exponential distribution is better? 1. p -stables only exist for p ∈ [1 , 2]; while exponential can be used for all ℓ p -SE ( p ≥ 1). 2. The lower tail of the reciprocal of exponential decreases faster than p -stable, while its the upper tail is similar to p -stables. 8-3

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend