Edgeworth and confidence interval correction in spiked PCA Iain - PowerPoint PPT Presentation

Edgeworth and confidence interval correction in spiked PCA Iain Johnstone & Jeha Yang Statistics & Biomedical Data Science, Stanford & Two Sigma Shanghai, December 10, 2019

Viral protein mutations and spiked models Quadeer et. al. PLOS Comp. Bio. 2018

A suggestive simulation on correlation matrices [David Morales, Matt McKay] 2 nd eigenvalue ρ 1 = 0 . 2 ; ρ 2 = 0 . 1 2-st Leading Eigenvalue 2-st Leading Eigenvalue γ γ c = 0.2, N = 300, N1 = 10, simple spks = [2.8, 1.9], deg. spks = [0.8, 0.9] c = 0.2, N = 300, N1 = 30, simple spks = [6.8, 3.9], deg. spks = [0.8, 0.9] Histogram of the sample eigenvalue Histogram of the sample eigenvalue 9 3.5 mean = 2.305 [2.322] mean = 4.145 [4.169] std = 0.053 [0.054] (0.89 xPaul) std = 0.126 [0.125] (0.89 xPaul) 8 3 7 2.5 6 2 5 4 1.5 3 1 2 0.5 1 0 0 2.1 2.15 2.2 2.25 2.3 2.35 2.4 2.45 2.5 3.7 3.8 3.9 4 4.1 4.2 4.3 4.4 4.5 4.6 4.7 Theoretical variance is pretty accurate, but there seems to be a shift in the mean (similar to what we’ve seen before in the eigenvector projections of sample covariance when spikes were close to each other) 6

Outline Background on spiked covariance model Edgeworth correction - single spike Edgeworth for multiple spikes Explaining the repulsion correction Confidence intervals after selection

High dimensional spiked PCA model ◮ Data : X = [ x 1 · · · x n ] ′ with i . i . d . x 1 , · · · , x n ∼ N p +1 (0 , Σ) ◮ Large dimensional asymptotic regime : as n → ∞ , γ n := p / n → γ ∈ (0 , ∞ ) ◮ Spiked eigenstructure of Σ : for a fixed r , ℓ 1 > · · · > ℓ r > 1 = ℓ r +1 = · · · = ℓ p +1 � �� Spikes ◮ Statistics : eigenvalues of sample covariance matrix X ′ X / n ρ 1 ≥ · · · ≥ ˆ ˆ ρ p +1 → w.l.o.g. Σ is diagonal

Largest Eigenvalue ˆ ρ 1 : Numerical illustration p = 200 , n = 800 [i.e. γ n = p / n = 0 . 25] subcritical critical supercritical Spike h = ℓ − 1 : 0, 0.25, h + = 0 . 5, 0.75, 1. 15 10 5 0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9

Finite rank model, K = 1: phase transition Σ = diag( ℓ 1 , 1 , . . . , 1) p / n → γ . Interior point transition at ℓ 1 = 1 + √ γ : [Baik–Ben Arous–Pech´ e,05] ¸ Tracy-Widom 1 {2 = 3 n fluctuation . ` 1 2 (1+ ° ) 1+ ° Critical point:

Finite rank model, K = 1: phase transition Σ = diag( ℓ 1 , 1 , . . . , 1) p / n → γ . Interior point transition at ℓ 1 = 1 + √ γ : [Baik–Ben Arous–Pech´ e,05] ¸ 1 Gaussian . {1 = 2 n fluctuation bias ` 1 2 (1+ ° ) 1+ ° ¸ ( ` ) Critical point: 1

Largest Eigenvalue ˆ ρ 1 : Numerical illustration p = 200 , n = 800 [i.e. γ n = p / n = 0 . 25] subcritical critical supercritical Spike h = 0, 0.25, h + = 0 . 5, 0.75, 1. 15 10 5 0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 Edge: (1 + √ γ n ) 2 = 2 . 25

½ ( ` ) ° º 2 (1+ ° ) ` =1+ h º 2 1 1+ ° (1+ ° ) Largest eigenvalue: Phase transition Different rates, limit distributions: � ˆ � For h < √ γ : ρ 1 − µ ( γ n ) D n 2 / 3 ⇒ TW β , τ ( γ n ) � ˆ � For h > √ γ : ρ 1 − ρ ( h , γ n ) D n 1 / 2 ⇒ N (0 , 1) σ ( h , γ n )

Largest eigenvalue: Phase transition Different rates, limit distributions: � ˆ � For h < √ γ : ρ 1 − µ ( γ n ) D n 2 / 3 ⇒ TW β , τ ( γ n ) � ˆ � For h > √ γ : ρ 1 − ρ ( h , γ n ) D n 1 / 2 ⇒ N (0 , 1) σ ( h , γ n ) with � 1 + γ � σ 2 ( h , γ ) = 2(1 + h ) 2 � 1 − γ � ρ ( h , γ ) = (1 + h ) h 2 h ½ ( ` ) º ° Statistical physics lit, 94- bias Baik-Ben Arous-Peche(05) 2 (1+ ° ) , Paul (07) Baik-Silverstein (06), Bloemendal-Virag (11) Mo (11) , Wang (12) ` =1+ h Benaych-Georges-Guionnet- º 2 Maida (11) 1 1+ ° (1+ ° ) (bulk)

Normal approximation – multiple spikes ◮ Assume that all spikes are simple, supercritical : ℓ 1 > · · · > ℓ r > 1 + √ γ ◮ Asymptotic mutual independence: with ρ kn := ρ ( ℓ k , γ n ) , σ kn := σ ( ℓ k , γ n ), � � n 1 / 2 (ˆ ρ k − ρ kn ) (ˆ z kn ) k =1 , ··· , r := ⇒ N (0 , I r ) σ kn k =1 , ··· , r Shi (2013)

Edgeworth approximations

Inaccuracy of approximations : ˆ z kn associated with ℓ k = 2 . 7 (n, γ n ,l) = (400,1,(2.7)) (n, γ n ,l) = (400,1,(2.7,2.2)) 0.5 0.5 Normal Normal 0.4 0.4 0.3 0.3 Density Density 0.2 0.2 0.1 0.1 0.0 0.0 − 3 − 2 − 1 0 1 2 3 4 − 3 − 2 − 1 0 1 2 3 4 ^ 1n ^ 1n z z (n, γ n ,l) = (400,1,(3.2,2.7)) (n, γ n ,l) = (400,1,(2.7,2.4)) 0.5 0.5 Normal Normal 0.4 0.4 0.3 0.3 Density Density 0.2 0.2 0.1 0.1 0.0 0.0 − 3 − 2 − 1 0 1 2 3 4 − 3 − 2 − 1 0 1 2 3 4 ^ 2n ^ 1n z z

Traditional Edgeworth (Smooth function of) means model: Petrov, 1975, Hall, 1992 n 1 � indep, mean 0 , ∈ R d , S n = √ n κ 2 n X ni d fixed i =1 n κ jn = 1 � E X j moments ni n 1 First order expansion: P ( S n ≤ x ) = Φ( x ) + n − 1 / 2 p ( x ) φ ( x ) + o ( n − 1 / 2 ) p ( x ) = − κ 3 n H 2 ( x ) H 2 ( x ) = x 2 − 1 . , κ 3 / 2 6 2 n skewness correction

Single spike, first order expansion for ˆ ρ 1 z 1 n = n 1 / 2 (ˆ ˆ ρ 1 − ρ 1 n ) /σ 1 n Theorem In spiked model, h 1 = ℓ 1 − 1 > √ γ, γ n = p / n , z 1 n ≤ x ) = Φ( x ) + n − 1 / 2 p 1 n ( x ) φ ( x ) + o ( n − 1 / 2 ) , P (ˆ uniformly in x ∈ R , with p 1 n ( x ) = − α 2 n H 2 ( x ) − α 0 n √ h 3 2 1 + γ n α 2 n = α 2 ( h 1 , γ n ) = 1 − γ n ) 3 / 2 , ( h 2 3 α 0 n = α 0 ( h 1 , γ n ) = γ n h 1 + 1 √ ( h 2 1 − γ n ) 3 / 2 2

Coefficients of Edgeworth expansion for single-spike √ h 3 2 1 + γ n α 0 ( h 1 , γ n ) = γ n h 1 + 1 α 2 ( h 1 , γ n ) = 1 − γ n ) 3 / 2 , √ ( h 2 ( h 2 1 − γ n ) 3 / 2 3 2 ◮ Larger for “harder” cases i.e. larger γ and smaller h ( > √ γ ) √ ◮ Larger than the fixed p case i.e. γ = 0 , α 2 = 2 / 3 , α 0 = 0 Muirhead-Chikuse (1975) ◮ Empirically reasonable if α 2 n = ( h 3 1 + γ ) 2 9 2 1 − γ ) 3 ≤ 0 . 2 n ( h 2 2

Single Spike Simulation (n, γ , l−factor) = (50,0.1,0.3) (n, γ , l−factor) = (50,1,0.3) 1.0 Edgeworth Edgeworth 1.2 Normal Normal 0.8 Upper Edge Upper Edge Density 0.8 Density 0.6 0.4 0.4 0.2 0.0 0.0 1.0 1.5 2.0 2.5 3.0 3.5 4 5 6 7 ^ ^ l l (n, γ , l−factor) = (100,0.1,0.3) (n, γ , l−factor) = (100,1,0.3) 2.0 1.5 Edgeworth Edgeworth Normal Normal 1.5 Upper Edge Upper Edge 1.0 Density Density 1.0 0.5 0.5 0.0 0.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 ^ ^ l l

Edgeworth for multiple spikes

Eigenvalues are repulsive! (n, γ n ,l) = (400,1,(2.7,2.2)) (n, γ n ,l) = (400,1,(2.7,2.4)) (n, γ n ,l) = (400,1,(3.2,2.7)) 0.5 0.5 0.5 Normal Normal Normal 0.4 0.4 0.4 0.3 0.3 0.3 Density Density Density 0.2 0.2 0.2 0.1 0.1 0.1 0.0 0.0 0.0 −3 −2 −1 0 1 2 3 4 −3 −2 −1 0 1 2 3 4 −3 −2 −1 0 1 2 3 4 ^ 1n ^ 1n ^ 2n z z z ◮ joint density of (ˆ ρ 1 , · · · , ˆ ρ n ∧ ( p +1) ) has a Jacobian factor � | ˆ ρ i − ˆ ρ j | i < j → pushes eigenvalues apart ◮ But, not visible at leading order (for supercritical spikes:) (ˆ z kn ) k =1 , ··· , r ⇒ N (0 , I r )

Multi spike, first order expansion for ˆ ρ k z kn = n 1 / 2 (ˆ ˆ ρ k − ρ kn ) /σ kn Theorem In spiked model, h k = ℓ k − 1 > √ γ, γ n = p / n , z kn ≤ x ) = Φ( x ) + n − 1 / 2 p kn ( x ) φ ( x ) + o ( n − 1 / 2 ) , P (ˆ uniformly in x ∈ R , with p kn ( x ) = − α 2 ( h k , γ n ) H 2 ( x ) − α 0 , k ( h ,γ n ) √ h 3 2 k + γ n α 2 ( h k , γ n ) = k − γ n ) 3 / 2 , ( h 2 3 1 h k + 1 � γ h j � � α 0 , k ( h , γ ) = √ k − γ + ( h 2 h 2 k − γ ) 1 / 2 h k − h j 2 j � = k

Interpretation Edgeworth corrected density φ + n − 1 / 2 ( α 2 H 3 + α 0 H 1 ) φ Relative to single spike case: α 2 unchanged, but 1 h k + 1 h j � √ ∆ α 0 = α 0 , k ( h , γ n ) − α 0 ( h k , γ n ) = ( h 2 k − γ n ) 1 / 2 h k − h j 2 j � = k ◮ ∆ α 0 > 0, e.g. smaller spikes h j < h k , push density to right, conversely for ∆ α 0 < 0 ◮ closer spikes ⇒ larger effect ◮ additive in ℓ j , j � = k

Repulsion example 1 : ˆ z kn associated with ℓ k = 2 . 7 (n, γ n ,l) = (400,1,(2.7)) (n, γ n ,l) = (400,1,(2.7,2.2)) 0.5 0.5 Normal Normal Edgeworth Edgeworth 0.4 0.4 0.3 Density 0.3 Density 0.2 0.2 0.1 0.1 0.0 0.0 − 3 − 2 − 1 0 1 2 3 4 −3 −2 −1 0 1 2 3 4 ^ 1n z ^ 1n z (n, γ n ,l) = (400,1,(3.2,2.7)) (n, γ n ,l) = (400,1,(2.7,2.4)) 0.5 0.5 Normal Normal Edgeworth Edgeworth 0.4 0.4 0.3 0.3 Density Density 0.2 0.2 0.1 0.1 0.0 0.0 − 3 − 2 − 1 0 1 2 3 4 − 3 − 2 − 1 0 1 2 3 4 ^ 2n ^ 1n z z Figure: Density of ˆ z kn associated with ℓ k = 2 . 7

Edgeworth and confidence interval correction in spiked PCA Iain - PowerPoint PPT Presentation

Edgeworth and confidence interval correction in spiked PCA Iain Johnstone & Jeha Yang Statistics & Biomedical Data Science, Stanford & Two Sigma Shanghai, December 10, 2019 Edgeworth and confidence interval correction in spiked PCA

ECS231 PCA, revisited May 28, 2019 1 / 18 Outline 1. PCA for lossy data compression 2. PCA for

Full counting statistics and Edgeworth series for Matrix Product State Yifei Shi, Israel Klich

Confidence Interval for the Variance of a Normal Population Bernd Schr oder logo1 Bernd

Weak Detection of Signal in the Spiked Wigner Model Hye Won Chung and Ji Oon Lee Korea Advanced

Statistical inference in a spiked population model Jian-feng Yao Joint work with Weiming Li

Spiked Eigenvalues of High Dimensional Separable Sample Covariance Matrices Guangming Pan,

Lecture 25/Chapter 21 Estimating Means with Confidence Example: Meaning of Confidence Interval

Massive Data Algorithmics Lecture 6: Interval Trees Massive Data Algorithmics Lecture 6:

MLCC 2015 Dimensionality Reduction and PCA Lorenzo Rosasco UNIGE-MIT-IIT June 25, 2015 Outline

THE LISTING PRESENTATION A Natural Close! CONFIDENCE CONFIDENCE CONFIDENCE CONFIDENCE Hi

Dynamic Programming: Interval Scheduling and Knapsack 6.1 Weighted Interval Scheduling Weighted

Interval Computations Interval . . . Linearization and their Possible Use Interval Arithmetic:

Towards More Realistic How Interval Data Is . . . Discussion Interval Models in How to Actually

General Edgeworth expansions with One-split branching random walks applications to profiles of

CS70: Jean Walrand: Lecture 29. Confidence? Confidence? Confidence is essential is many

Ive Got You Under My Skin: A Comparison of IV and s/c PCA Nick Williamson Clinical Nurse

Announcements Reading assignments 7 th Edition, Section 13.4 CSE 311 Foundations of

Nyttevekstforeningen Trondheim Salat ingredienser; Matfestivalen august 2008 Samtlige

We are working with FI Europe CONNECT as a media partner for the delivery of this event.

Third Section Friday 10-10:50 in 50-52H Zulu South Africa: Niger-Congo. ~9,000,000 speakers. 1.

RF Exposure Procedures TCB Workshop Nov 2017 Laboratory Division Office of Engineering and

RF Exposure Procedures TCB Workshop May 2017 (Updated May 5, 2017) Laboratory Division Office

ConnectHome Nation Webinar ConnectHome Nation Webinar Tips from the FCC on Handling Robocalls

FCC Spectrum Planning Challenges Walter Johnston , Chief-EMCD Office of Engineering and

Sambuz

Useful Links

Newsletter

Mail Us