 
              OPTIMAL DETECTION OF SPARSE PRINCIPAL COMPONENTS Philippe Rigollet (joint with Quentin Berthet)
High dimensional data Cloud of point in R p
High dimensional data Cloud of point in R p
High dimensional data Cloud of n points in R p
Principal component Principal component = direction of largest variance
Principal component analysis (PCA) • Tool for dimension reduction • Spectrum of covariance matrix • Main tool for exploratory data analysis. We study only the first principal component This talk: high-dimensional , finite sample framework. p � n
Testing for sphericity under rank-one alternative H 1 : Σ = I p + θ vv > H 0 : Σ = I p | v | 2 = 1 Isotropic Principal component
The model • Observations: i.i.d. X 1 , . . . , X n ∼ N p (0 , Σ ) • Estimator: empirical covariance matrix n Σ = 1 X ˆ X i X > i n i =1 If it is a consistent estimator. n � p If it is inconsistent (Nadler, Paul, Onatski, ...) n ' cp eigenvectors are orthogonal
Empirical spectrum under the null H 0 : Σ = I p 30 Marcenko-Pastur distribution 25 20 15 10 5 0 0 1 2 3 4 5 6 7 8 ˆ Spectrum of Σ
Empirical spectrum under the alternative H 1 : Σ = I p + θ vv > | v | 2 = 1 p The BBP (Baik, Ben Arous, Péché) transition n → α > 0 θ ≤ √ α θ > √ α Indistinguishable detection possible if r p from the null θ > n very strong signal!
Testing for sparse principal component H 1 : Σ = I p + θ vv > , H 0 : Σ = I p | v | 2 = 1 , | v | 0 ≤ k Isotropic Sparse principal direction
Testing for sparse principal component H 1 : Σ = I p + θ vv > , H 0 : Σ = I p | v | 2 = 1 , | v | 0 ≤ k minimum detection level ? θ Goal: find a statistic such that ϕ : S + p 7! R P H 0 ( ϕ (ˆ small under Σ ) < τ 0 ) ≥ 1 − δ H 0 P H 1 ( ϕ (ˆ large under Σ ) > τ 1 ) ≥ 1 − δ H 1 1 − δ 1 − δ τ 0 τ 1
P H 0 ( ϕ (ˆ small under Σ ) < τ 0 ) ≥ 1 − δ H 0 P H 1 ( ϕ (ˆ large under Σ ) > τ 1 ) ≥ 1 − δ H 1 1 − δ 1 − δ τ 0 τ 1 τ 0 ≤ τ ≤ τ 1 Take the test: ψ (ˆ Σ ) = 1 { ϕ (ˆ Σ ) > τ } . It satisfies: P H 0 ( ψ = 1) ∨ max P H 1 ( ψ = 0) ≤ δ | v | 2 =1 | v | 0 ≤ k
Sparse eigenvalue k-sparse eigenvalue: x > ˆ ϕ (ˆ max (ˆ | S | = k λ max (ˆ Σ ) = λ k Σ ) = max Σ x = max Σ S ) | x | 2 = 1 | x | 0  k Note that: λ k λ k max ( I p + θ vv > ) = 1 + θ max ( I p ) = 1 and Smaller fluctuations than the largest eigenvalue λ max (ˆ Σ )
Upper bounds w.p. 1 − δ Under the null hypothesis : r k log(9 ep/k ) + log(1 / δ ) max (ˆ λ k Σ ) ≤ 1 + 8 =: τ 0 n Under the alternative hypothesis : r log(1 / δ ) max (ˆ λ k Σ ) ≥ 1 + θ − 2(1 + θ ) =: τ 1 n Can detect as soon as , which yields τ 0 < τ 1 r k log( p/k ) θ ≥ C n
Minimax lower bound Fix (small). ν > 0 Then there exists a constant such that if C ν > 0 k log ( C ν p/k 2 + 1) r ∧ 1 θ < ¯ θ := √ n 2 Then ≥ 1 n o P n P n inf 0 ( ψ = 1) ∨ max v ( ψ = 0) 2 − ν ψ | v | 2 =1 | v | 0 ≤ k See also Arias-Castro, Bubeck and Lugosi (12)
Computational issues ✓ p ◆ To compute , need to compute eigenvalues max (ˆ λ k Σ ) k Can be used to find cliques in graphs: NP-complete pb. Need an approximation...
Semidefinite relaxation 101 Cauchy-Schwarz max ( A ) = max. λ k Tr( ) SDP k ( A ) = x > A x AZ subject to Tr( ) x > x = 1 � Z xx > | | 1 ≤ k | x | 0 ≤ k Z Z = xx > rank( Z ) = 1 Z ⌫ 0 Semidefinite program program (SDP) introduced by d’Aspremont, El Gahoui, Jordan and Lanckriet (2004). Testing procedure: 1 { SDP k (ˆ Σ ) > τ } Defined even if solution of SDP has rank > 1
Performance of SDP For the alternative : relaxation of so max (ˆ λ k Σ ) SDP k (ˆ max (ˆ Σ ) ≥ λ k Σ ) For the null : use dual (Bach et al. 2010) SDP k ( A ) = min { λ max ( A + U ) + k | U | ∞ } U ∈ S + p For any this gives an upper bound on SDP k (ˆ U ∈ S + Σ ) p Enough to look only at minimum dual perturbation n o MDP k (ˆ λ max ( st z (ˆ Σ ) = min Σ )) + kz z ≥ 0
Upper bounds w.p. 1 − δ � ∗ DP ∈ SDP , MDP Under the null hypothesis : k 2 log( ep/ δ ) r ∗ DP k (ˆ Σ ) ≤ 1 + 10 =: τ 0 n Under the alternative hypothesis : r log(1 / δ ) ∗ DP k (ˆ Σ ) ≥ 1 + θ − 2(1 + θ ) =: τ 1 n Can detect as soon as , which yields τ 0 < τ 1 k 2 log( p/k ) r θ ≥ C n
SDP k 1.08 MDP k λ max ( · ) 1.06 H 1 / H 0 1.04 1.02 1 0.98 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 θ Ratio of 5% quantile under H 1 over 95% quantile under H 0 , versus signal strength θ . When this ratio is larger than one, both type I and type II errors are below 5%. A. d’Aspremont Soutenance HDR, ENS Cachan, Nov. 2012. 32/33
Summary detection detection No with with θ detection λ k ∗ DP k max r r k ⇣ p k 2 ⇣ p ⌘ ⌘ n log n log k k Can we tighten the gap?
Numerical evidence Fix type I error at 1%, plot type II error of MDP k p={50, 100, 200, 500}, k= √ p 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 Q 0.5 P 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 f e k 2 k ⇣ p ⇣ p ⌘ ⌘ n log n log k k minimax optimal scaling proved scaling
Random graphs A random (Erdos-Renyi) graph on N vertices is obtained by drawing edges at random with probability 1/2 N=50 largest clique is of size =7.8 asymp. almost surely 2 log N
Hidden clique We can hide a clique (here of size 10) in this graph Choose points arbitrarily and draw a clique
Hidden clique embed in the original random graph
Hidden clique Question: is there a hidden clique in this graph?
Hidden clique problem It is believed that it is hard to find/test the presence of a clique in a random graph (Alon, Arora, Feige, Hazan, Krauthgamer,... Cryptosystems are based on this fact!) Conjecture: It is hard to find cliques of size between Alon, Krivelevich, Sudakov 98 and √ 2 log N N Feige and Krauthgamer 00 Dekel et al. 10 Feige and Ron 10 Ames and Vavasis 11 Canonical example of average case complexity
Hidden clique problem It seems related to our problem but not trivially (the randomness structure is very fragile) Note that all our results extend to sub-Gaussian r.v. Theorem. If we could prove that there exists C > 0 such that under the null hypothesis it holds k α log( ep/ δ ) r SDP k (ˆ Σ ) ≤ 1 + C n for some , then it can be used to test the α ∈ (1 , 2) 1 presence of a clique of size polylog ( N ) N 4 − α
Remarks Unlike usual hardness results, this one is for one (actually two) method only (not for all methods). In progress: we can remove this limitation using bi- cliques (need to carefully deal with independence)
Conclusion Optimal rates for sparse detection Computationally efficient methods with suboptimal rate First(?) link between sparse detection and average case complexity Opens the door to new statistical lower bounds: complexity theoretic lower bounds Evidence that heuristics cannot be optimal
Recommend
More recommend