understanding statistical vs computational tradeoffs via
play

Understanding Statistical-vs-Computational Tradeoffs via the - PowerPoint PPT Presentation

Understanding Statistical-vs-Computational Tradeoffs via the Low-Degree Likelihood Ratio Alex Wein Courant Institute, NYU Joint work with: Afonso Bandeira Yunzi Ding Tim Kunisky (ETH Zurich) (NYU) (NYU) 1 / 27 Motivation 2 / 27


  1. How to Show that a Problem is Hard? We don’t know how to prove that average-case problems are hard, but various forms of “rigorous evidence”: ◮ Reductions (e.g. from planted clique) [Berthet, Rigollet ’13; Brennan, Bresler,...] ◮ Failure of MCMC [Jerrum ’92] ◮ Shattering of solution space [Achlioptas, Coja-Oghlan ’08] 6 / 27

  2. How to Show that a Problem is Hard? We don’t know how to prove that average-case problems are hard, but various forms of “rigorous evidence”: ◮ Reductions (e.g. from planted clique) [Berthet, Rigollet ’13; Brennan, Bresler,...] ◮ Failure of MCMC [Jerrum ’92] ◮ Shattering of solution space [Achlioptas, Coja-Oghlan ’08] ◮ Failure of local algorithms [Gamarnik, Sudan ’13] 6 / 27

  3. How to Show that a Problem is Hard? We don’t know how to prove that average-case problems are hard, but various forms of “rigorous evidence”: ◮ Reductions (e.g. from planted clique) [Berthet, Rigollet ’13; Brennan, Bresler,...] ◮ Failure of MCMC [Jerrum ’92] ◮ Shattering of solution space [Achlioptas, Coja-Oghlan ’08] ◮ Failure of local algorithms [Gamarnik, Sudan ’13] ◮ Statistical physics, belief propagation [Decelle, Krzakala, Moore, Zdeborov´ a ’11] 6 / 27

  4. How to Show that a Problem is Hard? We don’t know how to prove that average-case problems are hard, but various forms of “rigorous evidence”: ◮ Reductions (e.g. from planted clique) [Berthet, Rigollet ’13; Brennan, Bresler,...] ◮ Failure of MCMC [Jerrum ’92] ◮ Shattering of solution space [Achlioptas, Coja-Oghlan ’08] ◮ Failure of local algorithms [Gamarnik, Sudan ’13] ◮ Statistical physics, belief propagation [Decelle, Krzakala, Moore, Zdeborov´ a ’11] ◮ Optimization landscape, Kac-Rice formula [Auffinger, Ben Arous, ˇ Cern´ y ’10] 6 / 27

  5. How to Show that a Problem is Hard? We don’t know how to prove that average-case problems are hard, but various forms of “rigorous evidence”: ◮ Reductions (e.g. from planted clique) [Berthet, Rigollet ’13; Brennan, Bresler,...] ◮ Failure of MCMC [Jerrum ’92] ◮ Shattering of solution space [Achlioptas, Coja-Oghlan ’08] ◮ Failure of local algorithms [Gamarnik, Sudan ’13] ◮ Statistical physics, belief propagation [Decelle, Krzakala, Moore, Zdeborov´ a ’11] ◮ Optimization landscape, Kac-Rice formula [Auffinger, Ben Arous, ˇ Cern´ y ’10] ◮ Statistical query lower bounds [Feldman, Grigorescu, Reyzin, Vempala, Xiao ’12] 6 / 27

  6. How to Show that a Problem is Hard? We don’t know how to prove that average-case problems are hard, but various forms of “rigorous evidence”: ◮ Reductions (e.g. from planted clique) [Berthet, Rigollet ’13; Brennan, Bresler,...] ◮ Failure of MCMC [Jerrum ’92] ◮ Shattering of solution space [Achlioptas, Coja-Oghlan ’08] ◮ Failure of local algorithms [Gamarnik, Sudan ’13] ◮ Statistical physics, belief propagation [Decelle, Krzakala, Moore, Zdeborov´ a ’11] ◮ Optimization landscape, Kac-Rice formula [Auffinger, Ben Arous, ˇ Cern´ y ’10] ◮ Statistical query lower bounds [Feldman, Grigorescu, Reyzin, Vempala, Xiao ’12] ◮ Sum-of-squares lower bounds [Barak, Hopkins, Kelner, Kothari, Moitra, Potechin ’16] 6 / 27

  7. How to Show that a Problem is Hard? We don’t know how to prove that average-case problems are hard, but various forms of “rigorous evidence”: ◮ Reductions (e.g. from planted clique) [Berthet, Rigollet ’13; Brennan, Bresler,...] ◮ Failure of MCMC [Jerrum ’92] ◮ Shattering of solution space [Achlioptas, Coja-Oghlan ’08] ◮ Failure of local algorithms [Gamarnik, Sudan ’13] ◮ Statistical physics, belief propagation [Decelle, Krzakala, Moore, Zdeborov´ a ’11] ◮ Optimization landscape, Kac-Rice formula [Auffinger, Ben Arous, ˇ Cern´ y ’10] ◮ Statistical query lower bounds [Feldman, Grigorescu, Reyzin, Vempala, Xiao ’12] ◮ Sum-of-squares lower bounds [Barak, Hopkins, Kelner, Kothari, Moitra, Potechin ’16] ◮ This talk: “low-degree method” [Barak, Hopkins, Kelner, Kothari, Moitra, Potechin ’16; Hopkins, Steurer ’17; Hopkins, Kothari, Potechin, Raghavendra, Schramm, Steurer ’17; Hopkins ’18 (PhD thesis)] 6 / 27

  8. The Low-Degree Method (e.g. [Hopkins, Steurer ’17] ) Suppose we want to hypothesis test with error probability o (1) between two distributions: 7 / 27

  9. The Low-Degree Method (e.g. [Hopkins, Steurer ’17] ) Suppose we want to hypothesis test with error probability o (1) between two distributions: ◮ Null model Y ∼ Q n e.g. G ( n , 1 / 2) 7 / 27

  10. The Low-Degree Method (e.g. [Hopkins, Steurer ’17] ) Suppose we want to hypothesis test with error probability o (1) between two distributions: ◮ Null model Y ∼ Q n e.g. G ( n , 1 / 2) ◮ Planted model Y ∼ P n e.g. G ( n , 1 / 2) ∪ { random k -clique } 7 / 27

  11. The Low-Degree Method (e.g. [Hopkins, Steurer ’17] ) Suppose we want to hypothesis test with error probability o (1) between two distributions: ◮ Null model Y ∼ Q n e.g. G ( n , 1 / 2) ◮ Planted model Y ∼ P n e.g. G ( n , 1 / 2) ∪ { random k -clique } Look for a degree- D (multivariate) polynomial f : R n × n → R that distinguishes P from Q : 7 / 27

  12. The Low-Degree Method (e.g. [Hopkins, Steurer ’17] ) Suppose we want to hypothesis test with error probability o (1) between two distributions: ◮ Null model Y ∼ Q n e.g. G ( n , 1 / 2) ◮ Planted model Y ∼ P n e.g. G ( n , 1 / 2) ∪ { random k -clique } Look for a degree- D (multivariate) polynomial f : R n × n → R that distinguishes P from Q : Want f ( Y ) to be big when Y ∼ P and small when Y ∼ Q 7 / 27

  13. The Low-Degree Method (e.g. [Hopkins, Steurer ’17] ) Suppose we want to hypothesis test with error probability o (1) between two distributions: ◮ Null model Y ∼ Q n e.g. G ( n , 1 / 2) ◮ Planted model Y ∼ P n e.g. G ( n , 1 / 2) ∪ { random k -clique } Look for a degree- D (multivariate) polynomial f : R n × n → R that distinguishes P from Q : Want f ( Y ) to be big when Y ∼ P and small when Y ∼ Q E Y ∼ P [ f ( Y )] mean in P Compute max � E Y ∼ Q [ f ( Y ) 2 ] fluctuations in Q f deg D 7 / 27

  14. The Low-Degree Method (e.g. [Hopkins, Steurer ’17] ) E Y ∼ P [ f ( Y )] max � E Y ∼ Q [ f ( Y ) 2 ] f deg D 8 / 27

  15. The Low-Degree Method (e.g. [Hopkins, Steurer ’17] ) E Y ∼ P [ f ( Y )] max � f , g � = E Y ∼ Q [ f ( Y ) g ( Y )] � E Y ∼ Q [ f ( Y ) 2 ] f deg D � � f � = � f , f � 8 / 27

  16. The Low-Degree Method (e.g. [Hopkins, Steurer ’17] ) E Y ∼ P [ f ( Y )] max � f , g � = E Y ∼ Q [ f ( Y ) g ( Y )] � E Y ∼ Q [ f ( Y ) 2 ] f deg D � E Y ∼ Q [ L ( Y ) f ( Y )] � f � = � f , f � = max � E Y ∼ Q [ f ( Y ) 2 ] f deg D Likelihood ratio: L ( Y ) = d P d Q ( Y ) 8 / 27

  17. The Low-Degree Method (e.g. [Hopkins, Steurer ’17] ) E Y ∼ P [ f ( Y )] max � f , g � = E Y ∼ Q [ f ( Y ) g ( Y )] � E Y ∼ Q [ f ( Y ) 2 ] f deg D � E Y ∼ Q [ L ( Y ) f ( Y )] � f � = � f , f � = max � E Y ∼ Q [ f ( Y ) 2 ] f deg D � L , f � Likelihood ratio: = max L ( Y ) = d P � f � d Q ( Y ) f deg D 8 / 27

  18. The Low-Degree Method (e.g. [Hopkins, Steurer ’17] ) E Y ∼ P [ f ( Y )] max � f , g � = E Y ∼ Q [ f ( Y ) g ( Y )] � E Y ∼ Q [ f ( Y ) 2 ] f deg D � E Y ∼ Q [ L ( Y ) f ( Y )] � f � = � f , f � = max � E Y ∼ Q [ f ( Y ) 2 ] f deg D � L , f � Likelihood ratio: = max L ( Y ) = d P � f � d Q ( Y ) f deg D Maximizer: f = L ≤ D := projection of L onto degree- D subspace 8 / 27

  19. The Low-Degree Method (e.g. [Hopkins, Steurer ’17] ) E Y ∼ P [ f ( Y )] max � f , g � = E Y ∼ Q [ f ( Y ) g ( Y )] � E Y ∼ Q [ f ( Y ) 2 ] f deg D � E Y ∼ Q [ L ( Y ) f ( Y )] � f � = � f , f � = max � E Y ∼ Q [ f ( Y ) 2 ] f deg D � L , f � Likelihood ratio: = max L ( Y ) = d P � f � d Q ( Y ) f deg D = � L ≤ D � Maximizer: f = L ≤ D := projection of L onto degree- D subspace Norm of low-degree likelihood ratio 8 / 27

  20. The Low-Degree Method E Y ∼ P [ f ( Y )] = � L ≤ D � Conclusion: max � E Y ∼ Q [ f ( Y ) 2 ] f deg D 9 / 27

  21. The Low-Degree Method E Y ∼ P [ f ( Y )] = � L ≤ D � Conclusion: max � E Y ∼ Q [ f ( Y ) 2 ] f deg D Heuristically, � ω (1) degree- D polynomial can distinguish Q , P � L ≤ D � = O (1) degree- D polynomials fail 9 / 27

  22. The Low-Degree Method E Y ∼ P [ f ( Y )] = � L ≤ D � Conclusion: max � E Y ∼ Q [ f ( Y ) 2 ] f deg D Heuristically, � ω (1) degree- D polynomial can distinguish Q , P � L ≤ D � = O (1) degree- D polynomials fail Conjecture (informal variant of [Hopkins ’18] ) For “nice” Q , P , if � L ≤ D � = O (1) for some D = ω (log n ) then no polynomial-time algorithm can distinguish Q , P with success probability 1 − o (1) . 9 / 27

  23. The Low-Degree Method E Y ∼ P [ f ( Y )] = � L ≤ D � Conclusion: max � E Y ∼ Q [ f ( Y ) 2 ] f deg D Heuristically, � ω (1) degree- D polynomial can distinguish Q , P � L ≤ D � = O (1) degree- D polynomials fail Conjecture (informal variant of [Hopkins ’18] ) For “nice” Q , P , if � L ≤ D � = O (1) for some D = ω (log n ) then no polynomial-time algorithm can distinguish Q , P with success probability 1 − o (1) . Degree- O (log n ) polynomials ⇔ Polynomial-time algorithms 9 / 27

  24. Formal Consequences of the Low-Degree Method The case D = ∞ : If � L � = O (1) (as n → ∞ ) then no test can distinguish Q from P (with success probability 1 − o (1)) ◮ Classical second moment method 10 / 27

  25. Formal Consequences of the Low-Degree Method The case D = ∞ : If � L � = O (1) (as n → ∞ ) then no test can distinguish Q from P (with success probability 1 − o (1)) ◮ Classical second moment method If � L ≤ D � = O (1) for some D = ω (log n ) then no spectral method can distinguish Q from P (in a particular sense) [Kunisky, W, Bandeira ’19] ◮ Spectral method: threshold top eigenvalue of poly-size matrix M = M ( Y ) whose entries are O (1)-degree polynomials in Y 10 / 27

  26. Formal Consequences of the Low-Degree Method The case D = ∞ : If � L � = O (1) (as n → ∞ ) then no test can distinguish Q from P (with success probability 1 − o (1)) ◮ Classical second moment method If � L ≤ D � = O (1) for some D = ω (log n ) then no spectral method can distinguish Q from P (in a particular sense) [Kunisky, W, Bandeira ’19] ◮ Spectral method: threshold top eigenvalue of poly-size matrix M = M ( Y ) whose entries are O (1)-degree polynomials in Y ◮ Proof: consider polynomial f ( Y ) = Tr ( M q ) with q = Θ(log n ) 10 / 27

  27. Formal Consequences of the Low-Degree Method The case D = ∞ : If � L � = O (1) (as n → ∞ ) then no test can distinguish Q from P (with success probability 1 − o (1)) ◮ Classical second moment method If � L ≤ D � = O (1) for some D = ω (log n ) then no spectral method can distinguish Q from P (in a particular sense) [Kunisky, W, Bandeira ’19] ◮ Spectral method: threshold top eigenvalue of poly-size matrix M = M ( Y ) whose entries are O (1)-degree polynomials in Y ◮ Proof: consider polynomial f ( Y ) = Tr ( M q ) with q = Θ(log n ) ◮ Spectral methods are believed to be as powerful as sum-of-squares for average-case problems [HKPRSS ’17] 10 / 27

  28. Low-Degree Method: Recap Given a hypothesis testing question Q n vs P n 11 / 27

  29. Low-Degree Method: Recap Given a hypothesis testing question Q n vs P n Take D ≈ log n 11 / 27

  30. Low-Degree Method: Recap Given a hypothesis testing question Q n vs P n Take D ≈ log n Compute/bound � L ≤ D � in the limit n → ∞ 11 / 27

  31. Low-Degree Method: Recap Given a hypothesis testing question Q n vs P n Take D ≈ log n Compute/bound � L ≤ D � in the limit n → ∞ ◮ If � L ≤ D � = ω (1), suggests that the problem is poly-time solvable 11 / 27

  32. Low-Degree Method: Recap Given a hypothesis testing question Q n vs P n Take D ≈ log n Compute/bound � L ≤ D � in the limit n → ∞ ◮ If � L ≤ D � = ω (1), suggests that the problem is poly-time solvable ◮ If � L ≤ D � = O (1), suggests that the problem is NOT poly-time solvable (and gives rigorous evidence: spectral methods fail) 11 / 27

  33. Advantages of the Low-Degree Method ◮ Possible to calculate/bound � L ≤ D � for many problems 12 / 27

  34. Advantages of the Low-Degree Method ◮ Possible to calculate/bound � L ≤ D � for many problems ◮ Predictions seem “correct”! 12 / 27

  35. Advantages of the Low-Degree Method ◮ Possible to calculate/bound � L ≤ D � for many problems ◮ Predictions seem “correct”! ◮ Planted clique, sparse PCA, stochastic block model, ... 12 / 27

  36. Advantages of the Low-Degree Method ◮ Possible to calculate/bound � L ≤ D � for many problems ◮ Predictions seem “correct”! ◮ Planted clique, sparse PCA, stochastic block model, ... ◮ (Relatively) simple 12 / 27

  37. Advantages of the Low-Degree Method ◮ Possible to calculate/bound � L ≤ D � for many problems ◮ Predictions seem “correct”! ◮ Planted clique, sparse PCA, stochastic block model, ... ◮ (Relatively) simple ◮ Much simpler than sum-of-squares lower bounds 12 / 27

  38. Advantages of the Low-Degree Method ◮ Possible to calculate/bound � L ≤ D � for many problems ◮ Predictions seem “correct”! ◮ Planted clique, sparse PCA, stochastic block model, ... ◮ (Relatively) simple ◮ Much simpler than sum-of-squares lower bounds ◮ Detection vs certification 12 / 27

  39. Advantages of the Low-Degree Method ◮ Possible to calculate/bound � L ≤ D � for many problems ◮ Predictions seem “correct”! ◮ Planted clique, sparse PCA, stochastic block model, ... ◮ (Relatively) simple ◮ Much simpler than sum-of-squares lower bounds ◮ Detection vs certification ◮ General: no assumptions on Q , P 12 / 27

  40. Advantages of the Low-Degree Method ◮ Possible to calculate/bound � L ≤ D � for many problems ◮ Predictions seem “correct”! ◮ Planted clique, sparse PCA, stochastic block model, ... ◮ (Relatively) simple ◮ Much simpler than sum-of-squares lower bounds ◮ Detection vs certification ◮ General: no assumptions on Q , P ◮ Captures sharp thresholds [Hopkins, Steurer ’17] 12 / 27

  41. Advantages of the Low-Degree Method ◮ Possible to calculate/bound � L ≤ D � for many problems ◮ Predictions seem “correct”! ◮ Planted clique, sparse PCA, stochastic block model, ... ◮ (Relatively) simple ◮ Much simpler than sum-of-squares lower bounds ◮ Detection vs certification ◮ General: no assumptions on Q , P ◮ Captures sharp thresholds [Hopkins, Steurer ’17] ◮ By varying degree D , can explore runtimes other than polynomial ◮ Conjecture (Hopkins ’18): degree- D polynomials ⇔ Θ( D ) algorithms time- n ˜ 12 / 27

  42. Advantages of the Low-Degree Method ◮ Possible to calculate/bound � L ≤ D � for many problems ◮ Predictions seem “correct”! ◮ Planted clique, sparse PCA, stochastic block model, ... ◮ (Relatively) simple ◮ Much simpler than sum-of-squares lower bounds ◮ Detection vs certification ◮ General: no assumptions on Q , P ◮ Captures sharp thresholds [Hopkins, Steurer ’17] ◮ By varying degree D , can explore runtimes other than polynomial ◮ Conjecture (Hopkins ’18): degree- D polynomials ⇔ Θ( D ) algorithms time- n ˜ ◮ No ingenuity required 12 / 27

  43. Advantages of the Low-Degree Method ◮ Possible to calculate/bound � L ≤ D � for many problems ◮ Predictions seem “correct”! ◮ Planted clique, sparse PCA, stochastic block model, ... ◮ (Relatively) simple ◮ Much simpler than sum-of-squares lower bounds ◮ Detection vs certification ◮ General: no assumptions on Q , P ◮ Captures sharp thresholds [Hopkins, Steurer ’17] ◮ By varying degree D , can explore runtimes other than polynomial ◮ Conjecture (Hopkins ’18): degree- D polynomials ⇔ Θ( D ) algorithms time- n ˜ ◮ No ingenuity required ◮ Interpretable 12 / 27

  44. How to Compute � L ≤ D � Additive Gaussian noise: P : Y = X + Z vs Q : Y = Z where X ∼ P , any distribution over R N and Z is i.i.d. N (0 , 1) 13 / 27

  45. How to Compute � L ≤ D � Additive Gaussian noise: P : Y = X + Z vs Q : Y = Z where X ∼ P , any distribution over R N and Z is i.i.d. N (0 , 1) d Q ( Y ) = E X exp( − 1 2 � Y − X � 2 ) L ( Y ) = d P = E X exp( � Y , X �− 1 2 � X � 2 ) exp( − 1 2 � Y � 2 ) 13 / 27

  46. How to Compute � L ≤ D � Additive Gaussian noise: P : Y = X + Z vs Q : Y = Z where X ∼ P , any distribution over R N and Z is i.i.d. N (0 , 1) d Q ( Y ) = E X exp( − 1 2 � Y − X � 2 ) L ( Y ) = d P = E X exp( � Y , X �− 1 2 � X � 2 ) exp( − 1 2 � Y � 2 ) Expand L = � α c α h α where { h α } are Hermite polynomials (orthonormal basis w.r.t. Q ) 13 / 27

  47. How to Compute � L ≤ D � Additive Gaussian noise: P : Y = X + Z vs Q : Y = Z where X ∼ P , any distribution over R N and Z is i.i.d. N (0 , 1) d Q ( Y ) = E X exp( − 1 2 � Y − X � 2 ) L ( Y ) = d P = E X exp( � Y , X �− 1 2 � X � 2 ) exp( − 1 2 � Y � 2 ) Expand L = � α c α h α where { h α } are Hermite polynomials (orthonormal basis w.r.t. Q ) � L ≤ D � 2 = � | α |≤ D c 2 α where c α = � L , h α � = E Y ∼ Q [ L ( Y ) h α ( Y )] 13 / 27

  48. How to Compute � L ≤ D � Additive Gaussian noise: P : Y = X + Z vs Q : Y = Z where X ∼ P , any distribution over R N and Z is i.i.d. N (0 , 1) d Q ( Y ) = E X exp( − 1 2 � Y − X � 2 ) L ( Y ) = d P = E X exp( � Y , X �− 1 2 � X � 2 ) exp( − 1 2 � Y � 2 ) Expand L = � α c α h α where { h α } are Hermite polynomials (orthonormal basis w.r.t. Q ) � L ≤ D � 2 = � | α |≤ D c 2 α where c α = � L , h α � = E Y ∼ Q [ L ( Y ) h α ( Y )] · · · 13 / 27

  49. How to Compute � L ≤ D � Additive Gaussian noise: P : Y = X + Z vs Q : Y = Z where X ∼ P , any distribution over R N and Z is i.i.d. N (0 , 1) d Q ( Y ) = E X exp( − 1 2 � Y − X � 2 ) L ( Y ) = d P = E X exp( � Y , X �− 1 2 � X � 2 ) exp( − 1 2 � Y � 2 ) Expand L = � α c α h α where { h α } are Hermite polynomials (orthonormal basis w.r.t. Q ) � L ≤ D � 2 = � | α |≤ D c 2 α where c α = � L , h α � = E Y ∼ Q [ L ( Y ) h α ( Y )] · · · D 1 Result: � L ≤ D � 2 = � d ! E X , X ′ [ � X , X ′ � d ] d =0 13 / 27

  50. References For more on the low-degree method... ◮ Samuel B. Hopkins, PhD thesis ’18: “Statistical Inference and the Sum of Squares Method” ◮ Connection to SoS ◮ Survey article: Kunisky, W, Bandeira, “Notes on Computational Hardness of Hypothesis Testing: Predictions using the Low-Degree Likelihood Ratio”, arxiv:1907.11636 14 / 27

  51. Part II: Sparse PCA Based on: Ding, Kunisky, W., Bandeira, “Subexponential-Time Algorithms for Sparse PCA”, arxiv:1907.11635 15 / 27

  52. Spiked Wigner Model Observe n × n matrix Y = λ xx T + W Signal: x ∈ R n , � x � = 1 Noise: W ∈ R n × n with entries W ij = W ji ∼ N (0 , 1 / n ) i.i.d. λ > 0: signal-to-noise ratio 16 / 27

  53. Spiked Wigner Model Observe n × n matrix Y = λ xx T + W Signal: x ∈ R n , � x � = 1 Noise: W ∈ R n × n with entries W ij = W ji ∼ N (0 , 1 / n ) i.i.d. λ > 0: signal-to-noise ratio Goal: given Y , estimate the signal x 16 / 27

  54. Spiked Wigner Model Observe n × n matrix Y = λ xx T + W Signal: x ∈ R n , � x � = 1 Noise: W ∈ R n × n with entries W ij = W ji ∼ N (0 , 1 / n ) i.i.d. λ > 0: signal-to-noise ratio Goal: given Y , estimate the signal x Or, even simpler: distinguish (w.h.p.) Y from pure noise W 16 / 27

  55. Spiked Wigner Model Observe n × n matrix Y = λ xx T + W Signal: x ∈ R n , � x � = 1 Noise: W ∈ R n × n with entries W ij = W ji ∼ N (0 , 1 / n ) i.i.d. λ > 0: signal-to-noise ratio Goal: given Y , estimate the signal x Or, even simpler: distinguish (w.h.p.) Y from pure noise W Structure: suppose x is drawn from some prior, e.g. 16 / 27

  56. Spiked Wigner Model Observe n × n matrix Y = λ xx T + W Signal: x ∈ R n , � x � = 1 Noise: W ∈ R n × n with entries W ij = W ji ∼ N (0 , 1 / n ) i.i.d. λ > 0: signal-to-noise ratio Goal: given Y , estimate the signal x Or, even simpler: distinguish (w.h.p.) Y from pure noise W Structure: suppose x is drawn from some prior, e.g. ◮ spherical (uniform on unit sphere) 16 / 27

  57. Spiked Wigner Model Observe n × n matrix Y = λ xx T + W Signal: x ∈ R n , � x � = 1 Noise: W ∈ R n × n with entries W ij = W ji ∼ N (0 , 1 / n ) i.i.d. λ > 0: signal-to-noise ratio Goal: given Y , estimate the signal x Or, even simpler: distinguish (w.h.p.) Y from pure noise W Structure: suppose x is drawn from some prior, e.g. ◮ spherical (uniform on unit sphere) ◮ Rademacher (i.i.d. ± 1 / √ n ) 16 / 27

  58. Spiked Wigner Model Observe n × n matrix Y = λ xx T + W Signal: x ∈ R n , � x � = 1 Noise: W ∈ R n × n with entries W ij = W ji ∼ N (0 , 1 / n ) i.i.d. λ > 0: signal-to-noise ratio Goal: given Y , estimate the signal x Or, even simpler: distinguish (w.h.p.) Y from pure noise W Structure: suppose x is drawn from some prior, e.g. ◮ spherical (uniform on unit sphere) ◮ Rademacher (i.i.d. ± 1 / √ n ) ◮ sparse 16 / 27

  59. PCA (Principal Component Analysis) Y = λ xx T + W J. Baik, G. Ben Arous, S. Peche, AoP 2005. D. Feral, S. Peche, CMP 2006. 17 / 27

  60. PCA (Principal Component Analysis) Y = λ xx T + W PCA: top eigenvalue λ 1 ( Y ) and (unit-norm) eigenvector v 1 J. Baik, G. Ben Arous, S. Peche, AoP 2005. D. Feral, S. Peche, CMP 2006. 17 / 27

  61. PCA (Principal Component Analysis) Y = λ xx T + W PCA: top eigenvalue λ 1 ( Y ) and (unit-norm) eigenvector v 1 Theorem (BBP’05, FP’06) Almost surely, as n → ∞ , J. Baik, G. Ben Arous, S. Peche, AoP 2005. D. Feral, S. Peche, CMP 2006. 17 / 27

  62. PCA (Principal Component Analysis) Y = λ xx T + W PCA: top eigenvalue λ 1 ( Y ) and (unit-norm) eigenvector v 1 Theorem (BBP’05, FP’06) Almost surely, as n → ∞ , ◮ If λ ≤ 1 : λ 1 ( Y ) → 2 and � x , v 1 � → 0 J. Baik, G. Ben Arous, S. Peche, AoP 2005. D. Feral, S. Peche, CMP 2006. 17 / 27

  63. PCA (Principal Component Analysis) Y = λ xx T + W PCA: top eigenvalue λ 1 ( Y ) and (unit-norm) eigenvector v 1 Theorem (BBP’05, FP’06) Almost surely, as n → ∞ , ◮ If λ ≤ 1 : λ 1 ( Y ) → 2 and � x , v 1 � → 0 λ > 2 and � x , v 1 � 2 → 1 − 1 /λ 2 > 0 ◮ If λ > 1 : λ 1 ( Y ) → λ + 1 J. Baik, G. Ben Arous, S. Peche, AoP 2005. D. Feral, S. Peche, CMP 2006. 17 / 27

  64. PCA (Principal Component Analysis) Y = λ xx T + W PCA: top eigenvalue λ 1 ( Y ) and (unit-norm) eigenvector v 1 Theorem (BBP’05, FP’06) Almost surely, as n → ∞ , ◮ If λ ≤ 1 : λ 1 ( Y ) → 2 and � x , v 1 � → 0 λ > 2 and � x , v 1 � 2 → 1 − 1 /λ 2 > 0 ◮ If λ > 1 : λ 1 ( Y ) → λ + 1 Sharp threshold: PCA can detect and recover the signal iff λ > 1 J. Baik, G. Ben Arous, S. Peche, AoP 2005. D. Feral, S. Peche, CMP 2006. 17 / 27

  65. Is PCA Optimal? 18 / 27

  66. Is PCA Optimal? PCA does not exploit structure of signal x 18 / 27

  67. Is PCA Optimal? PCA does not exploit structure of signal x Is the PCA threshold ( λ = 1) optimal? ◮ Is it statistically possible to detect/recover when λ < 1? 18 / 27

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend