OPTIMAL DETECTION OF SPARSE PRINCIPAL COMPONENTS Philippe Rigollet - - PowerPoint PPT Presentation

optimal detection of sparse principal components
SMART_READER_LITE
LIVE PREVIEW

OPTIMAL DETECTION OF SPARSE PRINCIPAL COMPONENTS Philippe Rigollet - - PowerPoint PPT Presentation

OPTIMAL DETECTION OF SPARSE PRINCIPAL COMPONENTS Philippe Rigollet (joint with Quentin Berthet) High dimensional data Cloud of point in R p High dimensional data Cloud of point in R p High dimensional data Cloud of n points in R p Principal


slide-1
SLIDE 1

OPTIMAL DETECTION OF SPARSE PRINCIPAL COMPONENTS

Philippe Rigollet (joint with Quentin Berthet)

slide-2
SLIDE 2

High dimensional data

Cloud of point in Rp

slide-3
SLIDE 3

High dimensional data

Cloud of point in Rp

slide-4
SLIDE 4

High dimensional data

Cloud of n points in Rp

slide-5
SLIDE 5

Principal component

Principal component = direction of largest variance

slide-6
SLIDE 6

Principal component analysis (PCA)

  • Tool for dimension reduction
  • Spectrum of covariance matrix
  • Main tool for exploratory data analysis.

We study only the first principal component This talk: high-dimensional , finite sample framework.

p n

slide-7
SLIDE 7

Testing for sphericity under rank-one alternative

H0 : Σ = Ip

Isotropic Principal component

H1 : Σ = Ip + θvv> |v|2 = 1

slide-8
SLIDE 8
  • Observations: i.i.d.
  • Estimator: empirical covariance matrix

The model

If it is a consistent estimator. If it is inconsistent (Nadler, Paul, Onatski, ...)

X1, . . . , Xn ∼ Np(0, Σ) ˆ Σ = 1 n

n

X

i=1

XiX>

i

n p n ' cp

eigenvectors are orthogonal

slide-9
SLIDE 9

Empirical spectrum under the null

ˆ Σ

1 2 3 4 5 6 7 8 5 10 15 20 25 30

Spectrum of Marcenko-Pastur distribution H0 : Σ = Ip

slide-10
SLIDE 10

p n → α > 0

The BBP (Baik, Ben Arous, Péché) transition

Empirical spectrum under the alternative

H1 : Σ = Ip + θvv> |v|2 = 1

Indistinguishable from the null detection possible if very strong signal!

θ ≤ √α θ > √α θ > r p n

slide-11
SLIDE 11

Testing for sparse principal component

H0 : Σ = Ip H1 : Σ = Ip + θvv>, |v|2 = 1 , |v|0 ≤ k

Isotropic Sparse principal direction

slide-12
SLIDE 12

Testing for sparse principal component

minimum detection level ?

θ

Goal: find a statistic such that

ϕ : S+

p 7! R

small under

PH0(ϕ(ˆ Σ) < τ0) ≥ 1 − δ PH1(ϕ(ˆ Σ) > τ1) ≥ 1 − δ

large under

H0 H1

1 − δ 1 − δ

τ0 τ1

H0 : Σ = Ip H1 : Σ = Ip + θvv>, |v|2 = 1 , |v|0 ≤ k

slide-13
SLIDE 13

small under

PH0(ϕ(ˆ Σ) < τ0) ≥ 1 − δ PH1(ϕ(ˆ Σ) > τ1) ≥ 1 − δ

large under

H0 H1

1 − δ 1 − δ

τ0 τ1 τ0 ≤ τ ≤ τ1

Take the test: ψ(ˆ

Σ) = 1{ϕ(ˆ Σ) > τ}. It satisfies: PH0(ψ = 1) ∨ max

|v|2=1 |v|0≤k

PH1(ψ = 0) ≤ δ

slide-14
SLIDE 14

Sparse eigenvalue

k-sparse eigenvalue:

λk

max(Ip) = 1

and λk

max(Ip + θvv>) = 1 + θ

Note that: Smaller fluctuations than the largest eigenvalue λmax(ˆ

Σ) ϕ(ˆ Σ) = λk

max(ˆ

Σ) = max

|x|2 = 1 |x|0  k

x> ˆ Σ x = max

|S|=k λmax(ˆ

ΣS)

slide-15
SLIDE 15

Upper bounds w.p.

Under the null hypothesis: Under the alternative hypothesis:

1 − δ λk

max(ˆ

Σ) ≥ 1 + θ − 2(1 + θ) r log(1/δ) n =: τ1 λk

max(ˆ

Σ) ≤ 1 + 8 r k log(9ep/k) + log(1/δ) n =: τ0

Can detect as soon as , which yields

τ0 < τ1 θ ≥ C r k log(p/k) n

slide-16
SLIDE 16

Then there exists a constant such that if

Minimax lower bound

Fix (small).

Cν > 0

Then

θ < ¯ θ := r k log (Cνp/k2 + 1) n ∧ 1 √ 2 ν > 0 inf

ψ

n P n

0 (ψ = 1) ∨ max |v|2=1 |v|0≤k

P n

v (ψ = 0)

  • ≥ 1

2 − ν

See also Arias-Castro, Bubeck and Lugosi (12)

slide-17
SLIDE 17

To compute , need to compute eigenvalues

Computational issues

λk

max(ˆ

Σ) ✓p k ◆

Can be used to find cliques in graphs: NP-complete pb. Need an approximation...

slide-18
SLIDE 18

A x

Tr( ) Tr( )

Semidefinite relaxation 101

λk

max(A) = max.

subject to

x> x>x = 1 SDPk(A) = AZ Z rank(Z) = 1 Z ⌫ 0 ≤ k |x|0

  • Cauchy-Schwarz

Z xx>

Semidefinite program program (SDP) introduced by d’Aspremont, El Gahoui, Jordan and Lanckriet (2004). Testing procedure:

1{SDPk(ˆ Σ) > τ}

Defined even if solution of SDP has rank > 1

Z = xx>

| |1 ≤ k

slide-19
SLIDE 19

Performance of SDP

For the null: use dual (Bach et al. 2010)

SDPk(A) = min

U∈S+

p

{λmax(A + U) + k|U|∞}

For any this gives an upper bound on Enough to look only at minimum dual perturbation

U ∈ S+

p

For the alternative: relaxation of so

λk

max(ˆ

Σ) SDPk(ˆ Σ) ≥ λk

max(ˆ

Σ) SDPk(ˆ Σ) MDPk(ˆ Σ) = min

z≥0

n λmax(stz(ˆ Σ)) + kz

slide-20
SLIDE 20

Upper bounds w.p.

Under the null hypothesis: Under the alternative hypothesis:

1 − δ

Can detect as soon as , which yields

τ0 < τ1 θ ≥ C r k2 log(p/k) n ∗DPk(ˆ Σ) ≤ 1 + 10 r k2 log(ep/δ) n =: τ0 ∗DPk(ˆ Σ) ≥ 1 + θ − 2(1 + θ) r log(1/δ) n =: τ1 ∗DP ∈

  • SDP, MDP
slide-21
SLIDE 21

0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.98 1 1.02 1.04 1.06 1.08

θ H1/H0

SDPk MDPk λmax(·)

Ratio of 5% quantile under H1 over 95% quantile under H0, versus signal strength θ. When this ratio is larger than one, both type I and type II errors are below 5%.

  • A. d’Aspremont

Soutenance HDR, ENS Cachan, Nov. 2012. 32/33

slide-22
SLIDE 22

Summary

No detection detection with detection with

r k n log ⇣p k ⌘ r k2 n log ⇣p k ⌘

θ

λk

max

∗DPk

Can we tighten the gap?

slide-23
SLIDE 23

Numerical evidence

Fix type I error at 1%, plot type II error of MDPk p={50, 100, 200, 500}, k=√p

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Q f 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 P e

k2 n log ⇣p k ⌘ k n log ⇣p k ⌘

minimax optimal scaling proved scaling

slide-24
SLIDE 24

largest clique is of size =7.8 asymp. almost surely

Random graphs

A random (Erdos-Renyi) graph on N vertices is obtained by drawing edges at random with probability 1/2

2 log N

N=50

slide-25
SLIDE 25

Hidden clique

We can hide a clique (here of size 10) in this graph Choose points arbitrarily and draw a clique

slide-26
SLIDE 26

Hidden clique

embed in the original random graph

slide-27
SLIDE 27

Hidden clique

Question: is there a hidden clique in this graph?

slide-28
SLIDE 28

Hidden clique problem

It is believed that it is hard to find/test the presence of a clique in a random graph (Alon, Arora, Feige, Hazan, Krauthgamer,... Cryptosystems are based on this fact!) Conjecture: It is hard to find cliques of size between and

2 log N √ N

Alon, Krivelevich, Sudakov 98 Feige and Krauthgamer 00 Dekel et al. 10 Feige and Ron 10 Ames and Vavasis 11 Canonical example of average case complexity

slide-29
SLIDE 29

Hidden clique problem

It seems related to our problem but not trivially (the randomness structure is very fragile) Note that all our results extend to sub-Gaussian r.v.

  • Theorem. If we could prove that there exists

such that under the null hypothesis it holds for some , then it can be used to test the presence of a clique of size

SDPk(ˆ Σ) ≤ 1 + C r kα log(ep/δ) n α ∈ (1, 2) C > 0 polylog(N)N

1 4−α

slide-30
SLIDE 30

Remarks

Unlike usual hardness results, this one is for one (actually two) method only (not for all methods). In progress: we can remove this limitation using bi- cliques (need to carefully deal with independence)

slide-31
SLIDE 31

Conclusion

Optimal rates for sparse detection Computationally efficient methods with suboptimal rate First(?) link between sparse detection and average case complexity Opens the door to new statistical lower bounds: complexity theoretic lower bounds Evidence that heuristics cannot be optimal