low rank matrix estimation via approximate message passing
play

Low-rank Matrix Estimation via Approximate Message Passing Andrea - PowerPoint PPT Presentation

Low-rank Matrix Estimation via Approximate Message Passing Andrea Montanari Ramji Venkataramanan Stanford University University of Cambridge WoLA 2018 1 / 25 The Spiked Model k i v i v T R n n A = i + W i =1 1 2


  1. Low-rank Matrix Estimation via Approximate Message Passing Andrea Montanari Ramji Venkataramanan Stanford University University of Cambridge WoLA 2018 1 / 25

  2. The Spiked Model k � λ i v i v T ∈ R n × n A = i + W i =1 • λ 1 ≥ λ 2 ≥ . . . ≥ λ k are deterministic scalars • v 1 , . . . , v k ∈ R n are orthonormal vectors • W ∼ GOE( n ) ⇒ W symmetric with ( W ii ) i ≤ n ∼ i . i . d . N(0 , 2 n ) and ( W ij ) i < j ≤ n ∼ i . i . d . N(0 , 1 n ) GOAL: To estimate the vectors v 1 , . . . , v k from A 2 / 25

  3. Spectrum of spiked matrix k � λ i v i v T A = i + W i =1 Random matrix theory and the ‘BBAP’ phase transition : • Bulk of eigenvalues of A in [ − 2 , 2] distributed according to Wigner’s semicircle • Outlier eigenvalues corresponding to | λ i | ’s greater than 1: z i → λ i + 1 > 2 λ i • Eigenvectors ϕ i corresponding to outliers z i satisfy � 1 − λ − 2 |� ϕ i , v i �| → i [Baik, Ben Arous, P´ ech´ e ’05], [Baik, Silverstein ’06], [Capitaine, Donati-Martin, F´ eral ’09], [Benaych-Georges and Nadakuditi ’11], . . . 3 / 25

  4. Structural information k � λ i v i v T A = i + W i =1 When v i ’s are unstructured, e.g., drawn uniformly at random from the unit sphere, • Best estimator of v i is the i th eigenvector ϕ i � 1 − 1 • If | λ i | ≥ 1, then |� v i , ϕ i �| → λ 2 i 4 / 25

  5. Structural information k � λ i v i v T A = i + W i =1 When v i ’s are unstructured, e.g., drawn uniformly at random from the unit sphere, • Best estimator of v i is the i th eigenvector ϕ i � 1 − 1 • If | λ i | ≥ 1, then |� v i , ϕ i �| → λ 2 i But we often have structural information about v i ’s • For example, v i ’s may be sparse, bounded, non-negative etc. • Relevant for many applications: sparse PCA, non-negative PCA, community detection under stochastic block model, . . . • Can improve on spectral methods 4 / 25

  6. Prior on eigenvectors k � i + W ≡ V Λ V T + W λ i v i v T A = i =1 R n × k V = [ v 1 v 2 . . . v k ] If each row of V is ∼ i . i . d P V , then Bayes-optimal estimator (for squared error) is � V Bayes = E [ V | A ] • Generally not computable • Closed-form expressions for asymptotic Bayes error [Deshpande, Montanari ’14], [Barbier et al. ’16], [Lesieur et al. ’17], [Miolane, Lelarge ’16] . . . 5 / 25

  7. Computable estimators � k i + W ≡ V Λ V T + W λ i v i v T A = i =1 • Convex relaxations generally do not achieve Bayes optimal error [Javanmard, Montanari, Ricci-Tersinghi ’16] • MCMC can approximate Bayes estimator, but can have very large mixing time and hard to analyze 6 / 25

  8. Computable estimators � k i + W ≡ V Λ V T + W λ i v i v T A = i =1 • Convex relaxations generally do not achieve Bayes optimal error [Javanmard, Montanari, Ricci-Tersinghi ’16] • MCMC can approximate Bayes estimator, but can have very large mixing time and hard to analyze In this talk Approximate Message Passing (AMP) algorithm to estimate V 6 / 25

  9. Rank one spiked model A = λ n vv T + W , E V 2 = 1 v ∼ i . i . d . P V , Power iteration for principal eigenvector: x t +1 = Ax t , with x 0 chosen at random 7 / 25

  10. Rank one spiked model A = λ n vv T + W , E V 2 = 1 v ∼ i . i . d . P V , Power iteration for principal eigenvector: x t +1 = Ax t , with x 0 chosen at random AMP : n � b t = 1 x t +1 = A f t ( x t ) − b t f t − 1 ( x t − 1 ) , f ′ t ( x t i ) n i =1 • Non-linear function f t chosen based on structural info on v • Memory term ensures a nice distributional property for the iterates in high dimensions • Can be derived via approximation of belief propagation equations 7 / 25

  11. State evolution n � with b t = 1 x t +1 = A f t ( x t ) − b t f t − 1 ( x t − 1 ) , f ′ t ( x t i ) n i =1 If we initialize with x 0 independent of A , then as n → ∞ : x t − → µ t v + σ t g • g ∼ i . i . d . N(0 , 1), independent of v ∼ i . i . d . P V [Bayati,Montanari ’11], [Rangan, Fletcher ’12], [Deshpande, Montanari ’14] 8 / 25

  12. State evolution n � with b t = 1 x t +1 = A f t ( x t ) − b t f t − 1 ( x t − 1 ) , f ′ t ( x t i ) n i =1 If we initialize with x 0 independent of A , then as n → ∞ : x t − → µ t v + σ t g • g ∼ i . i . d . N(0 , 1), independent of v ∼ i . i . d . P V • Scalars µ t , σ 2 t recursively determined as σ 2 t +1 = E [ f t ( µ t V + σ t G ) 2 ] µ t +1 = λ E [ V f t ( µ t V + σ t G )] , • Initialize with µ 0 = 1 n | E � x 0 , v �| [Bayati,Montanari ’11], [Rangan, Fletcher ’12], [Deshpande, Montanari ’14] 8 / 25

  13. Bayes-optimal AMP Assuming x t = µ t v + σ t g , choose f t ( y ) = E [ V | µ t V + σ t G = y ] State evolution becomes γ t +1 = λ 2 � � 1 − mmse( γ t ) with µ t = σ 2 t = γ t √ P V ∼ uniform { 1 , − 1 } , λ = 2 Initial value γ 0 ∝ 1 n | E � x 0 , v �| , what is lim t →∞ γ t ? 9 / 25

  14. Fixed points of state evolution • If E � x 0 , v � = 0, then γ t = 0 is an (unstable) fixed point. • This is the case in problems where v has zero mean, as x 0 is independent of v 10 / 25

  15. Spectral Initialization A = λ n vv T + W , λ > 1 • Compute ϕ 1 , the principal eigenvector of A • Run AMP with initialization x 0 = √ n ϕ 1 √ • γ 0 > 0 as 1 n | E � x 0 , v �| → 1 − λ − 2 11 / 25

  16. AMP with spectral initialization A = λ n vv T + W x 0 = √ n ϕ 1 x t +1 = A f t ( x t ) − b t f t − 1 ( x t − 1 ) , Existing AMP analysis does not apply for initialization x 0 correlated with v 12 / 25

  17. AMP analysis with spectral initialization A = λ n vv T + W Let ( ϕ 1 , z 1 ) are the principal eigenvector and eigenvalue of A Instead of A , we will analyze AMP on � λ � n vv T + ˜ ˜ A = z 1 ϕ 1 ϕ T 1 + P ⊥ P ⊥ W • P ⊥ = I − ϕ 1 ϕ T 1 ˜ • W ∼ GOE( n ) is independent of W 13 / 25

  18. True vs conditional model A = λ n vv T + W � λ � n vv T + ˜ ˜ A = z 1 ϕ 1 ϕ T 1 + P ⊥ P ⊥ W Lemma � � 1 v ) 2 ≥ 1 − λ − 2 − ε | z 1 − ( λ + λ − 1 ) | ≤ ε, ( ϕ T For ( z 1 , ϕ 1 ) ∈ , we have � �� � � � � � ˜ 1 � � � z 1 , ϕ 1 � z 1 , ϕ 1 c ( ε ) e − nc ( ε ) sup � P A ∈ · − P A ∈ · � TV ≤ ( z ˆ S , Φ ˆ S ) ∈E ε 14 / 25

  19. AMP on conditional model � λ � n vv T + ˜ ˜ A = z 1 ϕ 1 ϕ T 1 + P ⊥ P ⊥ W AMP with ˜ A instead of A : x 0 = √ n ϕ 1 x t +1 = ˜ x t ; t ) − b t f (˜ x t − 1 ; t − 1) , ˜ A f (˜ ˜ Analyze using existing AMP analysis + results from random matrix theory 15 / 25

  20. Model assumptions A = λ n vv T + W Let v = v ( n ) ∈ R n be a sequence such that the empirical distribution of entries of v ( n ) converges weakly to P V , 16 / 25

  21. Model assumptions A = λ n vv T + W Let v = v ( n ) ∈ R n be a sequence such that the empirical distribution of entries of v ( n ) converges weakly to P V , Performance of any estimator ˆ v measured via loss function ψ : R × R → R : n � v ) = 1 ψ ( v , ˆ ψ ( v i , ˆ v i ) . n i =1 ψ assumed to be pseudo-Lipschitz : ∀ x , y ∈ R 2 | ψ ( x ) − ψ ( y ) | ≤ C � x − y � 2 (1 + � x � 2 + � y � 2 ) , 16 / 25

  22. Result for rank one case A = λ n vv T + W Theorem: Let λ > 1. Consider the AMP x t +1 = A f t ( x t ) − b t f t − 1 ( x t − 1 ) • Assume f t : R → R is Lipschitz continuous • Initialize with x 0 = √ n ϕ 1 Then for any pseudo-Lipschitz loss function ψ and t ≥ 0, � n 1 ψ ( v i , x t lim i ) = E { ψ ( V , µ t V + σ t G ) } a.s. n →∞ n i =1 17 / 25

  23. Result for rank one case A = λ n vv T + W Theorem: Let λ > 1. Consider the AMP x t +1 = A f t ( x t ) − b t f t − 1 ( x t − 1 ) • Assume f t : R → R is Lipschitz continuous • Initialize with x 0 = √ n ϕ 1 Then for any pseudo-Lipschitz loss function ψ and t ≥ 0, � n 1 ψ ( v i , x t lim i ) = E { ψ ( V , µ t V + σ t G ) } a.s. n →∞ n i =1 The state evolution parameters are recursively defined as σ 2 t +1 = E [ f t ( µ t V + σ t G ) 2 ] , µ t +1 = λ E [ V f t ( µ t V + σ t G )] , √ 17 / 25 1 − λ − 2 and σ = 1 /λ . with µ =

  24. Bayes-optimal AMP A = λ n vv T + W x t +1 = A f t ( x t ) − b t f t − 1 ( x t − 1 ) • Bayes-optimal choice f t ( y ) = λ E ( V | γ t V + √ γ t G = y ) • State evolution: γ t +1 = λ 2 � � γ 0 = λ 2 − 1 1 − mmse( γ t ) , �� � 2 � V − E ( V | √ γ V + G ) where mmse( γ ) = E • µ t = σ 2 t = γ t 18 / 25

  25. Bayes-optimal AMP A = λ n vv T + W Let γ AMP ( λ ) be the smallest strictly positive solution of γ = λ 2 [1 − mmse( γ )] . (1) x t = f t ( x t ) achieves Then the AMP estimate ˆ 1 2 = 1 − γ AMP ( λ ) x t − s v � 2 t →∞ lim lim min n � ˆ λ 2 n →∞ s ∈{ +1 , − 1 } 19 / 25

  26. Bayes-optimal AMP A = λ n vv T + W Let γ AMP ( λ ) be the smallest strictly positive solution of γ = λ 2 [1 − mmse( γ )] . (1) x t = f t ( x t ) achieves Then the AMP estimate ˆ � x t , v �| |� ˆ γ AMP ( λ ) Overlap : t →∞ lim lim = x t � 2 � v � 2 � ˆ λ n →∞ 19 / 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend