Subspace Embeddings and p -Regression Using Exponential Random - PowerPoint PPT Presentation

Subspace Embeddings and ℓ p -Regression Using Exponential Random Variables David P. Woodruff and Qin Zhang IBM Research Almaden COLT’13, June 12, 2013 1-1

Subspace embeddings Subspace embeddings: A distribution over linear maps Π : R n → R m , s.t., for any fixed d -dimensional subspace of R n (denoted by M ), w. pr. 0 . 99 � Mx � p ≤ � Π Mx � q ≤ κ � Mx � p simultaneously for all vectors x ∈ R d . Goal: to minimize 1. m : the dimension of the subspace embedding. 2. κ : the distortion of the embedding. 3. t : the time to compute Π M . 2-1

Subspace embeddings Subspace embeddings: A distribution over linear maps Π : R n → R m , s.t., for any fixed d -dimensional subspace of R n (denoted by M ), w. pr. 0 . 99 � Mx � p ≤ � Π Mx � q ≤ κ � Mx � p simultaneously for all vectors x ∈ R d . Goal: to minimize 1. m : the dimension of the subspace embedding. 2. κ : the distortion of the embedding. 3. t : the time to compute Π M . Applications: ℓ p -regression (next slide), low-rank approximation, quantile regression, . . . 2-2

All matter: embedding time, dimension and distortion Using ℓ p subspace embedding (SE) to solve ℓ p regression: � ¯ � � min x ∈ R d Mx − b � p For convenience, let ¯ M ∈ R n × ( d − 1) , and let M = [ ¯ M , − b ] ∈ R n × d . n ≫ d . Let Π be a SE with dimension m , distortion κ and embedding time t 3-1

All matter: embedding time, dimension and distortion Using ℓ p subspace embedding (SE) to solve ℓ p regression: � ¯ � � min x ∈ R d Mx − b � p For convenience, let ¯ M ∈ R n × ( d − 1) , and let M = [ ¯ M , − b ] ∈ R n × d . n ≫ d . Let Π be a SE with dimension m , distortion κ and embedding time t 1. Compute Π M . (cost t ) 2. Use Π M to compute a matrix R ∈ R d × d (change-of-basis matrix) s.t. MR has some good properties. (cost ↑ if m ↑ ) 3. Given R , find a sampling matrix Π 1 ∈ R m ′ × n . ( m ′ ↑ if κ ↑ ) � Π 1 ¯ Mx − Π 1 b � � 4. Compute ˆ x of sub-sampled problem min x ∈ R d p . � (cost ↑ if m ′ ↑ , or κ ↑ ) Total running time ↑ if m ↑ or κ ↑ or t ↑ . 3-2

ℓ 1 regression � ¯ 1 ( ¯ M ∈ R n × ( d − 1) ). � � ℓ 1 regression: min x ∈ R d Mx − b � • Can be solved by linear programming, in time superlinear in n . • Clarkson 2005 gave an n · poly( d ) solution. • . . . Allow a (1 + ǫ )-approximation : • Sohler & Woodruff 2011 used ℓ 1 subspace embedding (SE), gave O ( nd ω − 1 ) + poly( d /ǫ ). ( ω < 3 is the exponent of matrix multiplication) • Clarkson et al. 2012 used a more structured ℓ 1 SE, gave O ( nd log n ) + poly( d /ǫ ). • Clarkson & Woodruff / Meng & Mahoney 2012 used other ℓ 1 SE’s, gave O (nnz( M ) log n ) + poly( d /ǫ ), nnz( M ) is # non-zero entries of M . 4-1

ℓ 1 regression � ¯ 1 ( ¯ M ∈ R n × ( d − 1) ). � � ℓ 1 regression: min x ∈ R d Mx − b � • Can be solved by linear programming, in time superlinear in n . • Clarkson 2005 gave an n · poly( d ) solution. • . . . Allow a (1 + ǫ )-approximation : • Sohler & Woodruff 2011 used ℓ 1 subspace embedding (SE), gave O ( nd ω − 1 ) + poly( d /ǫ ). ( ω < 3 is the exponent of matrix multiplication) • Clarkson et al. 2012 used a more structured ℓ 1 SE, gave O ( nd log n ) + poly( d /ǫ ). • Clarkson & Woodruff / Meng & Mahoney 2012 used other ℓ 1 SE’s, gave O (nnz( M ) log n ) + poly( d /ǫ ), nnz( M ) is # non-zero entries of M . This paper: further improves the ℓ 1 SE, thus also ℓ 1 regression. 4-2

Our results ℓ p subspace embeddings. Improved all previous results for ∀ p ∈ [1 , ∞ ) \ 2 p = 2 has already been made optimal by Clarkson and Woodruff ’12 5-1

Our results ℓ p subspace embeddings. Improved all previous results for ∀ p ∈ [1 , ∞ ) \ 2 p = 2 has already been made optimal by Clarkson and Woodruff ’12 In particular, p = 1 Time Distortion Dimemsion ˜ ˜ nd ω − 1 SW O ( d ) O ( d ) ˜ ˜ C + O ( d 2+ γ ) O ( d 5 ) nd log d ˜ ˜ O ( d 3 ) O ( d 5 ) MM nnz( M ) nnz( M ) + ˜ ˜ ˜ O ( d 2+ γ ) O ( d 2 ) This paper O ( d ) O ( d 3 / 2 log 1 / 2 n ) nnz( M ) + ˜ ˜ ˜ O ( d 2+ γ ) O ( d ) SW: Sohler & Woodruff ’11 ; C + : Clarkson et al. ’12; MM: Meng & Mahoney ’12; ω < 3 is the exponent of matrix multiplication. γ = 0 . 0000001. 5-2

Our results ℓ p subspace embeddings. Improved all previous results for ∀ p ∈ [1 , ∞ ) \ 2 p = 2 has already been made optimal by Clarkson and Woodruff ’12 In particular, p = 1 Time Distortion Dimemsion ˜ ˜ nd ω − 1 SW O ( d ) O ( d ) ˜ ˜ C + O ( d 2+ γ ) O ( d 5 ) nd log d ˜ ˜ O ( d 3 ) O ( d 5 ) MM nnz( M ) nnz( M ) + ˜ ˜ ˜ O ( d 2+ γ ) O ( d 2 ) This paper O ( d ) O ( d 3 / 2 log 1 / 2 n ) nnz( M ) + ˜ ˜ ˜ O ( d 2+ γ ) O ( d ) SW: Sohler & Woodruff ’11 ; C + : Clarkson et al. ’12; MM: Meng & Mahoney ’12; ω < 3 is the exponent of matrix multiplication. γ = 0 . 0000001. ℓ p regression Improved all previous results for ∀ p ∈ [1 , ∞ ) \ 2 Have efficient distributed implementations. 5-3

Our subspace embedding matrices ( m , s ) − ℓ 2 -SE (oblivious subspace embedding for ℓ 2 norm) A distribution over linear maps S : R n → R m , s.t., for any fixed d -dimensional subspace of R n , w. pr. 0 . 99, ∀ x ∈ R d . 1 / 2 · � Mx � 2 ≤ � SMx � 2 ≤ 3 / 2 · � Mx � 2 , s = O (1) is the the max of # non-zero entries of each colummn in S . 6-1

Our subspace embedding matrices ( m , s ) − ℓ 2 -SE (oblivious subspace embedding for ℓ 2 norm) A distribution over linear maps S : R n → R m , s.t., for any fixed d -dimensional subspace of R n , w. pr. 0 . 99, ∀ x ∈ R d . 1 / 2 · � Mx � 2 ≤ � SMx � 2 ≤ 3 / 2 · � Mx � 2 , s = O (1) is the the max of # non-zero entries of each colummn in S . Our ℓ p subspace embedding matrix D ∈ R n × n 1 / u 1 / p 1 u i are i.i.d. × = Π ∈ R m × n S ∈ R m × n exponentials ℓ 2 -SE 1 / u 1 / p n Use different ℓ 2 -SEs (from CW12, MM12, Nelson & Nguyen 12) for 1 ≤ p < 2 and p > 2. Can compute Π M in O (nnz( M )) time. 6-2

Two distributions Exponential distribution PDF f ( x ) = e − x , CDF F ( x ) = 1 − e − x (Recently used by Andoni (2012) for approximating frequency moments). ( max stability ) If u 1 , . . . , u n are exponentially distributed, α = ( α 1 , . . . , α n ) ∈ R + n , then max { α 1 / u 1 , . . . , α n / u n } ≃ � α � 1 / u , where u is exponential. 7-1

Two distributions Exponential distribution PDF f ( x ) = e − x , CDF F ( x ) = 1 − e − x (Recently used by Andoni (2012) for approximating frequency moments). ( max stability ) If u 1 , . . . , u n are exponentially distributed, α = ( α 1 , . . . , α n ) ∈ R + n , then max { α 1 / u 1 , . . . , α n / u n } ≃ � α � 1 / u , where u is exponential. p -stable distribution: Previous pet for subspace embedding. D p is p -stable , if for any vector α = ( α 1 , . . . , α n ) ∈ R n and i . i . d . ∼ D p , we have � v 1 , . . . , v n i ∈ [ n ] α i v i ≃ � α � p v , where v ∼ D p . 7-2

Two distributions Exponential distribution PDF f ( x ) = e − x , CDF F ( x ) = 1 − e − x (Recently used by Andoni (2012) for approximating frequency moments). ( max stability ) If u 1 , . . . , u n are exponentially distributed, α = ( α 1 , . . . , α n ) ∈ R + n , then max { α 1 / u 1 , . . . , α n / u n } ≃ � α � 1 / u , where u is exponential. p -stable distribution: Previous pet for subspace embedding. D p is p -stable , if for any vector α = ( α 1 , . . . , α n ) ∈ R n and i . i . d . ∼ D p , we have � v 1 , . . . , v n i ∈ [ n ] α i v i ≃ � α � p v , where v ∼ D p . E.g., for p = 2 it is the Gaussian distribution; for p = 1 it is the Cauchy distribution. 7-3

Two distributions Exponential distribution PDF f ( x ) = e − x , CDF F ( x ) = 1 − e − x (Recently used by Andoni (2012) for approximating frequency moments). ( max stability ) If u 1 , . . . , u n are exponentially distributed, α = ( α 1 , . . . , α n ) ∈ R + n , then max { α 1 / u 1 , . . . , α n / u n } ≃ � α � 1 / u , where u is exponential. p -stable distribution: Previous pet for subspace embedding. D p is p -stable , if for any vector α = ( α 1 , . . . , α n ) ∈ R n and i . i . d . ∼ D p , we have � v 1 , . . . , v n i ∈ [ n ] α i v i ≃ � α � p v , where v ∼ D p . E.g., for p = 2 it is the Gaussian distribution; for p = 1 it is the Cauchy distribution. Similar embedding matrix v 1 D ′ ∈ R n × n × Π ∈ R m × n S ∈ R m × n = v i are i.i.d. ℓ 2 -SE p -stables vn 7-4

Exponential distribution is superior than p -stables Why exponential distribution is better? 8-1

Exponential distribution is superior than p -stables Why exponential distribution is better? 1. p -stables only exist for p ∈ [1 , 2]; while exponential can be used for all ℓ p -SE ( p ≥ 1). 8-2

Exponential distribution is superior than p -stables Why exponential distribution is better? 1. p -stables only exist for p ∈ [1 , 2]; while exponential can be used for all ℓ p -SE ( p ≥ 1). 2. The lower tail of the reciprocal of exponential decreases faster than p -stable, while its the upper tail is similar to p -stables. 8-3

Subspace Embeddings and p -Regression Using Exponential Random - PowerPoint PPT Presentation

Subspace Embeddings and p -Regression Using Exponential Random Variables David P. Woodruff and Qin Zhang IBM Research Almaden COLT13, June 12, 2013 1-1 Subspace embeddings Subspace embeddings: A distribution over linear maps : R n

Subspace Polynomials and Cyclic Subspace Codes Netanel Raviv Joint work with: Prof. Tuvi Etzion

Subspace Embeddings for Regression Lecture 12 October 1, 2020 Chandra (UIUC) CS498ABD 1 Fall

Exponential Families Leila Wehbe March 19, 2013 Leila Wehbe Exponential Families Exponential

Embeddings @ Twitter Making ML easy with Embeddings !!! Sept 2018 Agenda 1 Team 2 Whats an

Graph based Subspace Segmentation Canyi Lu National University of Singapore Nov. 21, 2013

Word Embeddings Revisited: Contextual Embeddings CS 6956: Deep Learning for NLP Overview

Word embeddings Rappel Embeddings ( pas Word Embeddings ) Est une lookup table Formalisme:

Word Embeddings Natural Language Processing VU (706.230) - Andi Rexha 02/04/2020 Word Embeddings

Exponential Growth Exponential Growth Introduction Exponential Growth vs. Linear Growth

Applications of exponential functions Applications of exponential functions abound throughout the

Exponential Family Distributions CMSC 691 UMBC Exponential Family Form Exponential Family Form

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Word Embeddings CS 6956: Deep Learning for NLP Overview Representing meaning Word

Mixed membership word embeddings: Corpus-specific embeddings without big data James Foulds

The Game Development Process Audio Creation Introduction (1 of 2) Dramatic evolution of

Fisher scoring for some univariate discrete distributions Thomas Yee University of Auckland 26

Track Layout is Hard Michael J. Bannister 1 William E. Devanny 2 c 3 Vida Dujmovi David Eppstein

Providing Early Math Opportunities MAGIC PROFESSIONAL DEVELOPMENT SERIES 8 THE MAGIC

Ra Randomized SV SVD, CU CUR De Decom ompos osition on, and and SPSD SPSD Ma Matri trix

Te Testing sting th the e Bo Boolean lean Ran Rank Michal Parnas Joint work with: Dana

dynamics? Chris Woodruff, University of Warwick UNU Wider 30 th Anniversary Conference Helsinki

Regulatory Investigations Reduce the Costs, Minimize the Risks Sponsored By: Regulatory