Information-theoretically Optimal Sparse PCA Yash Deshpande and - PowerPoint PPT Presentation

Information-theoretically Optimal Sparse PCA Yash Deshpande and Andrea Montanari Stanford University July 3 rd , 2014

Problem Definition � λ n xx T Y λ = + Z . 2

Problem Definition � λ n xx T Y λ = + Z . � λ � λ n n � λ � λ 0 n n Z ij = Z ji 0 0 x i ∼ Bernoulli( ε ) , Z ij ∼ Normal(0 , 1) independent. Estimate X = xx T from Y λ 2

An example: gene expression data [Baechler et al, 2003 PNAS] • Genes × patients matrix • Blue - lupus patients, Aqua - healthy controls • Black - a subset of immune system specific genes 3

An example: gene expression data [Baechler et al, 2003 PNAS] • Genes × patients matrix • Blue - lupus patients, Aqua - healthy controls • Black - a subset of immune system specific genes A simple probabilistic model 3

Related work Detection and estimation: Y = X + noise . • X ∈ S ⊂ { 0 , 1 } n , a known set • Goal: hypothesis testing, support recovery • [Donoho, Jin 2004], [Addario-Berry et al. 2010], [Arias-Castro et al. 2011] . . . 4

Related work Machine learning: maximize � v , Y λ v � subject to: � v � 2 ≤ 1 , v is sparse. • Goal: maximize “variance”, support recovery • [d’Aspremont et al. 2004], [Moghaddam et al. 2005], [Zou et al. 2006], [Amini, Wainwright 2009] , [Papailiopoulos et al. 2013]. . . 4

Related work Information theory: minimize � Y λ − vv T � 2 F + f ( v ) . • Probabilistic model for x , Y λ • Propose approximate message passing algorithm • [Rangan, Fletcher 2012], [Kabashima et al. 2014] 4

A first try: simple PCA � λ n xx T + Z . Y λ = 5

A first try: simple PCA � λ n xx T + Z . Y λ = Estimate x using scaled principal eigenvector x 1 ( Y λ ). 5

Limitations of PCA 6

Limitations of PCA If λε 2 > 1 Limiting Spectral Density − 2 2 � x 1 ( Y λ ) , x � lim n →∞ > 0 a. s. √ n ε 6

Limitations of PCA If λε 2 > 1 If λε 2 < 1 Limiting Spectral Density Limiting Spectral Density − 2 2 − 2 2 � x 1 ( Y λ ) , x � � x 1 ( Y λ ) , x � lim n →∞ > 0 a. s. lim n →∞ = 0 a. s. √ n ε √ n ε 6

Limitations of PCA If λε 2 > 1 If λε 2 < 1 Limiting Spectral Density Limiting Spectral Density − 2 2 − 2 2 � x 1 ( Y λ ) , x � � x 1 ( Y λ ) , x � lim n →∞ > 0 a. s. lim n →∞ = 0 a. s. √ n ε √ n ε [Knowles, Yin, 2011] 6

Our contributions • Poly-time algorithm that exploits sparsity 7

Our contributions • Poly-time algorithm that exploits sparsity • Provably optimal in terms of MSE when ε > ε c 7

Our contributions • Poly-time algorithm that exploits sparsity • Provably optimal in terms of MSE when ε > ε c • “Single-letter” characterization of MMSE 7

Single letter characterization Original high-dimensional problem � λ n xx T + Z , Y λ = 8

Single letter characterization Original high-dimensional problem � λ n xx T + Z , Y λ = � � M-mmse( λ, n ) ≡ 1 � X − E { X | Y λ }� 2 . n 2 E F 8

Single letter characterization Original high-dimensional problem � λ n xx T + Z , Y λ = � � M-mmse( λ, n ) ≡ 1 � X − E { X | Y λ }� 2 . n 2 E F Scalar problem √ Y λ = λ X 0 + Z , 8

Single letter characterization Original high-dimensional problem � λ n xx T + Z , Y λ = � � M-mmse( λ, n ) ≡ 1 � X − E { X | Y λ }� 2 . n 2 E F Scalar problem √ Y λ = λ X 0 + Z , � ( X 0 − E { X 0 | Y λ } ) 2 � S-mmse( λ ) ≡ E . 8

Single letter characterization Original high-dimensional problem � λ n xx T + Z , Y λ = � � M-mmse( λ, n ) ≡ 1 � X − E { X | Y λ }� 2 . n 2 E F Scalar problem √ Y λ = λ X 0 + Z , � ( X 0 − E { X 0 | Y λ } ) 2 � S-mmse( λ ) ≡ E . Here X 0 ∼ Bernoulli( ε ) , Z ∼ Normal(0 , 1). 8

Main result Theorem (Deshpande, Montanari 2014) There exists an ε c < 1 such that the following happens. For every ε > ε c n →∞ M - mmse( λ, n ) = ε 2 − τ 2 lim ∗ where τ ∗ = ε − S - mmse( λτ ∗ ) . Further there exists a polynomial time algorithm that achieves this MSE. 9

Main result Theorem (Deshpande, Montanari 2014) There exists an ε c < 1 such that the following happens. For every ε > ε c n →∞ M - mmse( λ, n ) = ε 2 − τ 2 lim ∗ where τ ∗ = ε − S - mmse( λτ ∗ ) . Further there exists a polynomial time algorithm that achieves this MSE. ε c ≈ 0 . 05 (solution to scalar non-linear equation) 9

Making use of sparsity 10

Making use of sparsity The power iteration with A = Y λ / √ n : x t +1 = A x t . 10

Making use of sparsity The power iteration with A = Y λ / √ n : x t +1 = A x t . Improvement: x t +1 = A F t ( x t ) , where F t ( x t ) = ( f t ( x t 1 ) , . . . f t ( x t n )) T . Choose f t to exploit sparsity. 10

A heuristic analysis Expanding the i th entry of x t +1 : � √ � � λ � x , F t ( x t ) � x i + 1 x t +1 Z ij f t ( x t = √ n j ) i n � �� j � �� ≈ µ t ≈ Normal(0 ,τ t ) 11

A heuristic analysis Expanding the i th entry of x t +1 : � √ � � λ � x , F t ( x t ) � x i + 1 x t +1 Z ij f t ( x t = √ n j ) i n � �� j � �� ≈ µ t ≈ Normal(0 ,τ t ) Thus: ≈ µ t x + √ τ t z , where z ∼ Normal(0 , I n ) x t +1 d 11

Approximate Message Passing (AMP) This analysis is obviously wrong, but. . . 12

Approximate Message Passing (AMP) This analysis is obviously wrong, but. . . is asymptotically exact for the modified iteration: x t +1 = A � x t − b t � x t − 1 , x t = F t ( x t ) . � [Donoho, Maleki, Montanari 2009], [Bayati, Montanari 2011], [Rangan, Fletcher 2012]. 12

Asymptotic behavior t = 2 hist( x t hist( x t i − µ t x i ) i − µ t x i ) 150 150 100 100 50 50 0 0 − 2 − 1 0 1 2 − 2 − 1 0 1 2 3 Power method AMP 13

Asymptotic behavior t = 4 hist( x t i − µ t x i ) hist( x t i − µ t x i ) 150 150 100 100 50 50 0 0 − 2 0 2 4 − 1 − 0 . 5 0 0 . 5 1 1 . 5 Power method AMP 13

Asymptotic behavior t = 8 hist( x t i − µ t x i ) hist( x t i − µ t x i ) 200 150 150 100 100 50 50 0 0 − 10 − 5 0 5 10 15 20 − 0 . 6 − 0 . 4 − 0 . 2 0 0 . 2 0 . 4 0 . 6 Power method AMP 13

Asymptotic behavior t = 12 hist( x t i − µ t x i ) hist( x t i − µ t x i ) 200 150 150 100 100 50 50 0 0 − 50 0 50 100 − 0 . 3 − 0 . 2 − 0 . 1 0 0 . 1 0 . 2 0 . 3 Power method AMP 13

Asymptotic behavior t = 16 hist( x t hist( x t i − µ t x i ) i − µ t x i ) 200 150 150 100 100 50 50 0 0 − 0 . 1 − 5 · 10 − 2 0 0 . 1 0 . 15 − 200 0 200 400 600 5 · 10 − 2 Power method AMP 13

Asymptotic behavior: a lemma Lemma Let f t be a sequence of Lipschitz functions. For every fixed t and uniformly random i: ( X 0 , µ t X 0 + √ τ t Z ) almost surely. d ( x i , x t i ) → 14

State evolution Deterministic recursions: √ λ f t ( µ t X 0 + √ τ t Z ) } µ t +1 = E { τ t +1 = E { f t ( µ t X 0 + √ τ t Z ) 2 } . 15

State evolution Deterministic recursions: √ λ f t ( µ t X 0 + √ τ t Z ) } µ t +1 = E { τ t +1 = E { f t ( µ t X 0 + √ τ t Z ) 2 } . With optimal f t : √ µ t +1 = λτ t +1 τ t +1 = ε − S-mmse( λτ t ) . 15

State evolution: an illustration τ t +1 ε − S - mmse ( λτ t ) τ t 16

State evolution: an illustration τ t +1 ε − S - mmse ( λτ t ) τ 1 τ t 16

State evolution: an illustration τ t +1 ε − S - mmse ( λτ t ) M - mmse ( λ ) = ε 2 − τ 2 ∗ τ ∗ τ t 16

Proof sketch: MSE expression X t = � Using estimator � x t ( � x t ) T : X t , λ ) = 1 x t ) T − xx T � 2 mse( � n 2 E {� � x ( � F } � � � � x t , x � 2 = 1 n 2 E {� x � 4 } + 1 x � 4 } − 2 E n 2 E {� � n 2 → ε 2 − τ 2 t +1 . 17

Proof sketch: MSE expression X t = � Using estimator � x t ( � x t ) T : X t , λ ) = 1 x t ) T − xx T � 2 mse( � n 2 E {� � x ( � F } � � � � x t , x � 2 = 1 n 2 E {� x � 4 } + 1 x � 4 } − 2 E n 2 E {� � n 2 → ε 2 − τ 2 t +1 . Thus X t , λ ) = ε 2 − τ 2 n →∞ mse( � mse AMP ( λ ) = lim t →∞ lim ∗ . 17

Proof sketch: I-MMSE identity M-mmse( λ ) ≤ mse AMP ( λ ) 18

Proof sketch: I-MMSE identity � ∞ � ∞ 1 1 M-mmse( λ ) d λ ≤ mse AMP ( λ ) d λ 4 4 0 0 18

Proof sketch: I-MMSE identity � ∞ � ∞ 1 1 M-mmse( λ ) d λ ≤ mse AMP ( λ ) d λ 4 4 0 0 I ( X ; Y ∞ ) − I ( X ; Y 0 ) 18

Information-theoretically Optimal Sparse PCA Yash Deshpande and - PowerPoint PPT Presentation

Information-theoretically Optimal Sparse PCA Yash Deshpande and Andrea Montanari Stanford University July 3 rd , 2014 Problem Definition n xx T Y = + Z . 2 Problem Definition n xx T Y = + Z . n n

ECS231 PCA, revisited May 28, 2019 1 / 18 Outline 1. PCA for lossy data compression 2. PCA for

MLCC 2015 Dimensionality Reduction and PCA Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2015 Outline

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Subsampling at Information Theoretically Optimal Rates Adel Javanmard, Andrea Montanari Stanford

Ive Got You Under My Skin: A Comparison of IV and s/c PCA Nick Williamson Clinical Nurse

Exploratory Factor Analysis PCA Analysis A Review Precipitation Temperature Ecosystems PCA

Lecture 25: Autoencoders Kernel PCA Aykut Erdem January 2017 Hacettepe University Today

Unbreakable Cryptosystems ??? Almost all of the practical cryptosystems are theoretically

Estimation of Theoretically Consistent Stochastic Frontier Functions in R Arne Henningsen

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Information-Theoretically Optimal Compressed Sensing via Spatial Coupling and Approximate Message

Sparse PCA refusing to graduate :-) Aviad Rubinstein (UC Berkeley) Joint work with Siu-On Chan

Compression, inversion and sparse approximate PCA of dense kernel matrices in near linear

Taming the Beast: Topic imaging Predictive approach Sparse Machine Learning for Large Text

Principal Component Analysis (PCA) Dr. Veselina Kalinova Max Planck Institute for

Fast algorithms for sparse principal component analysis based on Rayleigh quotient iteration

A homology theory for Smale spaces Ian F. Putnam, University of Victoria 1 Hyperbolicity An

From Data to Effects Dependence Graphs: Source-to-Source Transformations for C CPC 2015 Nelson

HIGH-DEF FUZZING EXPLORING VULNERABILITIES IN HDMI-CEC name = "Joshua Smith" job =

A homology theory for Smale spaces Ian F. Putnam, University of Victoria 1 Smale spaces (D.

Distance between Sensor Nodes and the Base Station in Randomly Deployed WSNs Cneyt Sevgi 1

Collge de France abroad Lectures Quantum information with real or artificial atoms and photons

Sequence Alignment: A General Overview COMP 571 Luay Nakhleh, Rice University 2 Life through

Convenient synthesis of some novel amino acid coupled triazoles S. M. El Rayes Department of

Information-theoretically Optimal Sparse PCA Yash Deshpande and - PowerPoint PPT Presentation

Information-theoretically Optimal Sparse PCA Yash Deshpande and Andrea Montanari Stanford University July 3 rd , 2014 Problem Definition n xx T Y = + Z . 2 Problem Definition n xx T Y = + Z . n n

ECS231 PCA, revisited May 28, 2019 1 / 18 Outline 1. PCA for lossy data compression 2. PCA for

MLCC 2015 Dimensionality Reduction and PCA Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2015 Outline

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

Subsampling at Information Theoretically Optimal Rates Adel Javanmard, Andrea Montanari Stanford

Ive Got You Under My Skin: A Comparison of IV and s/c PCA Nick Williamson Clinical Nurse

Exploratory Factor Analysis PCA Analysis A Review Precipitation Temperature Ecosystems PCA

Lecture 25: Autoencoders Kernel PCA Aykut Erdem January 2017 Hacettepe University Today

Unbreakable Cryptosystems ??? Almost all of the practical cryptosystems are theoretically

Estimation of Theoretically Consistent Stochastic Frontier Functions in R Arne Henningsen

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Information-Theoretically Optimal Compressed Sensing via Spatial Coupling and Approximate Message

Sparse PCA refusing to graduate :-) Aviad Rubinstein (UC Berkeley) Joint work with Siu-On Chan

Compression, inversion and sparse approximate PCA of dense kernel matrices in near linear

Taming the Beast: Topic imaging Predictive approach Sparse Machine Learning for Large Text

Principal Component Analysis (PCA) Dr. Veselina Kalinova Max Planck Institute for

Fast algorithms for sparse principal component analysis based on Rayleigh quotient iteration

A homology theory for Smale spaces Ian F. Putnam, University of Victoria 1 Hyperbolicity An

From Data to Effects Dependence Graphs: Source-to-Source Transformations for C CPC 2015 Nelson

HIGH-DEF FUZZING EXPLORING VULNERABILITIES IN HDMI-CEC name = &quot;Joshua Smith&quot; job =

A homology theory for Smale spaces Ian F. Putnam, University of Victoria 1 Smale spaces (D.

Distance between Sensor Nodes and the Base Station in Randomly Deployed WSNs Cneyt Sevgi 1

Collge de France abroad Lectures Quantum information with real or artificial atoms and photons

Sequence Alignment: A General Overview COMP 571 Luay Nakhleh, Rice University 2 Life through

Convenient synthesis of some novel amino acid coupled triazoles S. M. El Rayes Department of

HIGH-DEF FUZZING EXPLORING VULNERABILITIES IN HDMI-CEC name = "Joshua Smith" job =