Understanding Statistical-vs-Computational Tradeoffs via the - - PowerPoint PPT Presentation

understanding statistical vs computational tradeoffs via
SMART_READER_LITE
LIVE PREVIEW

Understanding Statistical-vs-Computational Tradeoffs via the - - PowerPoint PPT Presentation

Understanding Statistical-vs-Computational Tradeoffs via the Low-Degree Likelihood Ratio Alex Wein Courant Institute, NYU Joint work with: Afonso Bandeira Yunzi Ding Tim Kunisky (ETH Zurich) (NYU) (NYU) 1 / 27 Motivation 2 / 27


slide-1
SLIDE 1

Understanding Statistical-vs-Computational Tradeoffs via the Low-Degree Likelihood Ratio Alex Wein

Courant Institute, NYU Joint work with: Afonso Bandeira (ETH Zurich) Yunzi Ding (NYU) Tim Kunisky (NYU)

1 / 27

slide-2
SLIDE 2

Motivation

2 / 27

slide-3
SLIDE 3

Motivation

Imagine we have a large noisy dataset and want to extract some kind of hidden “signal”

2 / 27

slide-4
SLIDE 4

Motivation

Imagine we have a large noisy dataset and want to extract some kind of hidden “signal”, e.g.,

◮ determine which combination of genes cause a certain disease

2 / 27

slide-5
SLIDE 5

Motivation

Imagine we have a large noisy dataset and want to extract some kind of hidden “signal”, e.g.,

◮ determine which combination of genes cause a certain disease ◮ find “communities” in a social network

2 / 27

slide-6
SLIDE 6

Motivation

Imagine we have a large noisy dataset and want to extract some kind of hidden “signal”, e.g.,

◮ determine which combination of genes cause a certain disease ◮ find “communities” in a social network ◮ predict which users will click on which ads

2 / 27

slide-7
SLIDE 7

Motivation

Imagine we have a large noisy dataset and want to extract some kind of hidden “signal”, e.g.,

◮ determine which combination of genes cause a certain disease ◮ find “communities” in a social network ◮ predict which users will click on which ads ◮ etc.

2 / 27

slide-8
SLIDE 8

Motivation

Imagine we have a large noisy dataset and want to extract some kind of hidden “signal”, e.g.,

◮ determine which combination of genes cause a certain disease ◮ find “communities” in a social network ◮ predict which users will click on which ads ◮ etc.

There are many potential solutions

2 / 27

slide-9
SLIDE 9

Motivation

Imagine we have a large noisy dataset and want to extract some kind of hidden “signal”, e.g.,

◮ determine which combination of genes cause a certain disease ◮ find “communities” in a social network ◮ predict which users will click on which ads ◮ etc.

There are many potential solutions The na¨ ıve algorithm would check all possibilities, too slow!

◮ “curse of dimensionality”

2 / 27

slide-10
SLIDE 10

Motivation

Imagine we have a large noisy dataset and want to extract some kind of hidden “signal”, e.g.,

◮ determine which combination of genes cause a certain disease ◮ find “communities” in a social network ◮ predict which users will click on which ads ◮ etc.

There are many potential solutions The na¨ ıve algorithm would check all possibilities, too slow!

◮ “curse of dimensionality”

Is there a “smarter” algorithm that can find the solution efficiently?

2 / 27

slide-11
SLIDE 11

Motivation

Imagine we have a large noisy dataset and want to extract some kind of hidden “signal”, e.g.,

◮ determine which combination of genes cause a certain disease ◮ find “communities” in a social network ◮ predict which users will click on which ads ◮ etc.

There are many potential solutions The na¨ ıve algorithm would check all possibilities, too slow!

◮ “curse of dimensionality”

Is there a “smarter” algorithm that can find the solution efficiently? Goal: develop a theory to understand which statistical tasks can be solved efficiently (and which ones cannot)

2 / 27

slide-12
SLIDE 12

Part I: Statistical-to-Computational Gaps and the “Low-Degree Method”

3 / 27

slide-13
SLIDE 13

Statistical-to-Computational Gaps

◮ Planted clique: G(n, 1/2) ∪ {k-clique}

4 / 27

slide-14
SLIDE 14

Statistical-to-Computational Gaps

◮ Planted clique: G(n, 1/2) ∪ {k-clique}

◮ n vertices 4 / 27

slide-15
SLIDE 15

Statistical-to-Computational Gaps

◮ Planted clique: G(n, 1/2) ∪ {k-clique}

◮ n vertices ◮ Each of the

n

2

  • edges occurs with probability 1/2

4 / 27

slide-16
SLIDE 16

Statistical-to-Computational Gaps

◮ Planted clique: G(n, 1/2) ∪ {k-clique}

◮ n vertices ◮ Each of the

n

2

  • edges occurs with probability 1/2

◮ Planted clique on k vertices 4 / 27

slide-17
SLIDE 17

Statistical-to-Computational Gaps

◮ Planted clique: G(n, 1/2) ∪ {k-clique}

◮ n vertices ◮ Each of the

n

2

  • edges occurs with probability 1/2

◮ Planted clique on k vertices 4 / 27

slide-18
SLIDE 18

Statistical-to-Computational Gaps

◮ Planted clique: G(n, 1/2) ∪ {k-clique}

◮ n vertices ◮ Each of the

n

2

  • edges occurs with probability 1/2

◮ Planted clique on k vertices ◮ Goal: find the clique 4 / 27

slide-19
SLIDE 19

Statistical-to-Computational Gaps

◮ Planted clique: G(n, 1/2) ∪ {k-clique}

5 / 27

slide-20
SLIDE 20

Statistical-to-Computational Gaps

◮ Planted clique: G(n, 1/2) ∪ {k-clique}

◮ Statistically, can find planted clique of size (2 + ε) log2 n 5 / 27

slide-21
SLIDE 21

Statistical-to-Computational Gaps

◮ Planted clique: G(n, 1/2) ∪ {k-clique}

◮ Statistically, can find planted clique of size (2 + ε) log2 n ◮ In polynomial time, we only know how to find clique of size

Ω(√n) [Alon, Krivelevich, Sudakov ’98]

5 / 27

slide-22
SLIDE 22

Statistical-to-Computational Gaps

◮ Planted clique: G(n, 1/2) ∪ {k-clique}

◮ Statistically, can find planted clique of size (2 + ε) log2 n ◮ In polynomial time, we only know how to find clique of size

Ω(√n) [Alon, Krivelevich, Sudakov ’98]

5 / 27

slide-23
SLIDE 23

Statistical-to-Computational Gaps

◮ Planted clique: G(n, 1/2) ∪ {k-clique}

◮ Statistically, can find planted clique of size (2 + ε) log2 n ◮ In polynomial time, we only know how to find clique of size

Ω(√n) [Alon, Krivelevich, Sudakov ’98]

◮ Other examples of stat-comp gaps

5 / 27

slide-24
SLIDE 24

Statistical-to-Computational Gaps

◮ Planted clique: G(n, 1/2) ∪ {k-clique}

◮ Statistically, can find planted clique of size (2 + ε) log2 n ◮ In polynomial time, we only know how to find clique of size

Ω(√n) [Alon, Krivelevich, Sudakov ’98]

◮ Other examples of stat-comp gaps

◮ Sparse PCA 5 / 27

slide-25
SLIDE 25

Statistical-to-Computational Gaps

◮ Planted clique: G(n, 1/2) ∪ {k-clique}

◮ Statistically, can find planted clique of size (2 + ε) log2 n ◮ In polynomial time, we only know how to find clique of size

Ω(√n) [Alon, Krivelevich, Sudakov ’98]

◮ Other examples of stat-comp gaps

◮ Sparse PCA ◮ Community detection in graphs (stochastic block model) 5 / 27

slide-26
SLIDE 26

Statistical-to-Computational Gaps

◮ Planted clique: G(n, 1/2) ∪ {k-clique}

◮ Statistically, can find planted clique of size (2 + ε) log2 n ◮ In polynomial time, we only know how to find clique of size

Ω(√n) [Alon, Krivelevich, Sudakov ’98]

◮ Other examples of stat-comp gaps

◮ Sparse PCA ◮ Community detection in graphs (stochastic block model) ◮ Random constraint satisfaction problems (e.g. 3-SAT) 5 / 27

slide-27
SLIDE 27

Statistical-to-Computational Gaps

◮ Planted clique: G(n, 1/2) ∪ {k-clique}

◮ Statistically, can find planted clique of size (2 + ε) log2 n ◮ In polynomial time, we only know how to find clique of size

Ω(√n) [Alon, Krivelevich, Sudakov ’98]

◮ Other examples of stat-comp gaps

◮ Sparse PCA ◮ Community detection in graphs (stochastic block model) ◮ Random constraint satisfaction problems (e.g. 3-SAT) ◮ Tensor PCA 5 / 27

slide-28
SLIDE 28

Statistical-to-Computational Gaps

◮ Planted clique: G(n, 1/2) ∪ {k-clique}

◮ Statistically, can find planted clique of size (2 + ε) log2 n ◮ In polynomial time, we only know how to find clique of size

Ω(√n) [Alon, Krivelevich, Sudakov ’98]

◮ Other examples of stat-comp gaps

◮ Sparse PCA ◮ Community detection in graphs (stochastic block model) ◮ Random constraint satisfaction problems (e.g. 3-SAT) ◮ Tensor PCA ◮ Tensor decomposition 5 / 27

slide-29
SLIDE 29

Statistical-to-Computational Gaps

◮ Planted clique: G(n, 1/2) ∪ {k-clique}

◮ Statistically, can find planted clique of size (2 + ε) log2 n ◮ In polynomial time, we only know how to find clique of size

Ω(√n) [Alon, Krivelevich, Sudakov ’98]

◮ Other examples of stat-comp gaps

◮ Sparse PCA ◮ Community detection in graphs (stochastic block model) ◮ Random constraint satisfaction problems (e.g. 3-SAT) ◮ Tensor PCA ◮ Tensor decomposition

Different from theory of NP-hardness: average-case

5 / 27

slide-30
SLIDE 30

Statistical-to-Computational Gaps

◮ Planted clique: G(n, 1/2) ∪ {k-clique}

◮ Statistically, can find planted clique of size (2 + ε) log2 n ◮ In polynomial time, we only know how to find clique of size

Ω(√n) [Alon, Krivelevich, Sudakov ’98]

◮ Other examples of stat-comp gaps

◮ Sparse PCA ◮ Community detection in graphs (stochastic block model) ◮ Random constraint satisfaction problems (e.g. 3-SAT) ◮ Tensor PCA ◮ Tensor decomposition

Different from theory of NP-hardness: average-case Q: What fundamentally makes a problem easy or hard?

5 / 27

slide-31
SLIDE 31

How to Show that a Problem is Hard?

We don’t know how to prove that average-case problems are hard, but various forms of “rigorous evidence”:

6 / 27

slide-32
SLIDE 32

How to Show that a Problem is Hard?

We don’t know how to prove that average-case problems are hard, but various forms of “rigorous evidence”:

◮ Reductions (e.g. from planted clique) [Berthet, Rigollet ’13; Brennan, Bresler,...]

6 / 27

slide-33
SLIDE 33

How to Show that a Problem is Hard?

We don’t know how to prove that average-case problems are hard, but various forms of “rigorous evidence”:

◮ Reductions (e.g. from planted clique) [Berthet, Rigollet ’13; Brennan, Bresler,...] ◮ Failure of MCMC [Jerrum ’92]

6 / 27

slide-34
SLIDE 34

How to Show that a Problem is Hard?

We don’t know how to prove that average-case problems are hard, but various forms of “rigorous evidence”:

◮ Reductions (e.g. from planted clique) [Berthet, Rigollet ’13; Brennan, Bresler,...] ◮ Failure of MCMC [Jerrum ’92] ◮ Shattering of solution space [Achlioptas, Coja-Oghlan ’08]

6 / 27

slide-35
SLIDE 35

How to Show that a Problem is Hard?

We don’t know how to prove that average-case problems are hard, but various forms of “rigorous evidence”:

◮ Reductions (e.g. from planted clique) [Berthet, Rigollet ’13; Brennan, Bresler,...] ◮ Failure of MCMC [Jerrum ’92] ◮ Shattering of solution space [Achlioptas, Coja-Oghlan ’08] ◮ Failure of local algorithms [Gamarnik, Sudan ’13]

6 / 27

slide-36
SLIDE 36

How to Show that a Problem is Hard?

We don’t know how to prove that average-case problems are hard, but various forms of “rigorous evidence”:

◮ Reductions (e.g. from planted clique) [Berthet, Rigollet ’13; Brennan, Bresler,...] ◮ Failure of MCMC [Jerrum ’92] ◮ Shattering of solution space [Achlioptas, Coja-Oghlan ’08] ◮ Failure of local algorithms [Gamarnik, Sudan ’13] ◮ Statistical physics, belief propagation [Decelle, Krzakala, Moore, Zdeborov´

a ’11] 6 / 27

slide-37
SLIDE 37

How to Show that a Problem is Hard?

We don’t know how to prove that average-case problems are hard, but various forms of “rigorous evidence”:

◮ Reductions (e.g. from planted clique) [Berthet, Rigollet ’13; Brennan, Bresler,...] ◮ Failure of MCMC [Jerrum ’92] ◮ Shattering of solution space [Achlioptas, Coja-Oghlan ’08] ◮ Failure of local algorithms [Gamarnik, Sudan ’13] ◮ Statistical physics, belief propagation [Decelle, Krzakala, Moore, Zdeborov´

a ’11]

◮ Optimization landscape, Kac-Rice formula [Auffinger, Ben Arous, ˇ

Cern´ y ’10] 6 / 27

slide-38
SLIDE 38

How to Show that a Problem is Hard?

We don’t know how to prove that average-case problems are hard, but various forms of “rigorous evidence”:

◮ Reductions (e.g. from planted clique) [Berthet, Rigollet ’13; Brennan, Bresler,...] ◮ Failure of MCMC [Jerrum ’92] ◮ Shattering of solution space [Achlioptas, Coja-Oghlan ’08] ◮ Failure of local algorithms [Gamarnik, Sudan ’13] ◮ Statistical physics, belief propagation [Decelle, Krzakala, Moore, Zdeborov´

a ’11]

◮ Optimization landscape, Kac-Rice formula [Auffinger, Ben Arous, ˇ

Cern´ y ’10]

◮ Statistical query lower bounds [Feldman, Grigorescu, Reyzin, Vempala, Xiao ’12]

6 / 27

slide-39
SLIDE 39

How to Show that a Problem is Hard?

We don’t know how to prove that average-case problems are hard, but various forms of “rigorous evidence”:

◮ Reductions (e.g. from planted clique) [Berthet, Rigollet ’13; Brennan, Bresler,...] ◮ Failure of MCMC [Jerrum ’92] ◮ Shattering of solution space [Achlioptas, Coja-Oghlan ’08] ◮ Failure of local algorithms [Gamarnik, Sudan ’13] ◮ Statistical physics, belief propagation [Decelle, Krzakala, Moore, Zdeborov´

a ’11]

◮ Optimization landscape, Kac-Rice formula [Auffinger, Ben Arous, ˇ

Cern´ y ’10]

◮ Statistical query lower bounds [Feldman, Grigorescu, Reyzin, Vempala, Xiao ’12] ◮ Sum-of-squares lower bounds [Barak, Hopkins, Kelner, Kothari, Moitra, Potechin ’16]

6 / 27

slide-40
SLIDE 40

How to Show that a Problem is Hard?

We don’t know how to prove that average-case problems are hard, but various forms of “rigorous evidence”:

◮ Reductions (e.g. from planted clique) [Berthet, Rigollet ’13; Brennan, Bresler,...] ◮ Failure of MCMC [Jerrum ’92] ◮ Shattering of solution space [Achlioptas, Coja-Oghlan ’08] ◮ Failure of local algorithms [Gamarnik, Sudan ’13] ◮ Statistical physics, belief propagation [Decelle, Krzakala, Moore, Zdeborov´

a ’11]

◮ Optimization landscape, Kac-Rice formula [Auffinger, Ben Arous, ˇ

Cern´ y ’10]

◮ Statistical query lower bounds [Feldman, Grigorescu, Reyzin, Vempala, Xiao ’12] ◮ Sum-of-squares lower bounds [Barak, Hopkins, Kelner, Kothari, Moitra, Potechin ’16] ◮ This talk: “low-degree method”

[Barak, Hopkins, Kelner, Kothari, Moitra, Potechin ’16; Hopkins, Steurer ’17; Hopkins, Kothari, Potechin, Raghavendra, Schramm, Steurer ’17; Hopkins ’18 (PhD thesis)] 6 / 27

slide-41
SLIDE 41

The Low-Degree Method (e.g. [Hopkins, Steurer ’17])

Suppose we want to hypothesis test with error probability o(1) between two distributions:

7 / 27

slide-42
SLIDE 42

The Low-Degree Method (e.g. [Hopkins, Steurer ’17])

Suppose we want to hypothesis test with error probability o(1) between two distributions:

◮ Null model Y ∼ Qn

e.g. G(n, 1/2)

7 / 27

slide-43
SLIDE 43

The Low-Degree Method (e.g. [Hopkins, Steurer ’17])

Suppose we want to hypothesis test with error probability o(1) between two distributions:

◮ Null model Y ∼ Qn

e.g. G(n, 1/2)

◮ Planted model Y ∼ Pn

e.g. G(n, 1/2) ∪ {random k-clique}

7 / 27

slide-44
SLIDE 44

The Low-Degree Method (e.g. [Hopkins, Steurer ’17])

Suppose we want to hypothesis test with error probability o(1) between two distributions:

◮ Null model Y ∼ Qn

e.g. G(n, 1/2)

◮ Planted model Y ∼ Pn

e.g. G(n, 1/2) ∪ {random k-clique}

Look for a degree-D (multivariate) polynomial f : Rn×n → R that distinguishes P from Q:

7 / 27

slide-45
SLIDE 45

The Low-Degree Method (e.g. [Hopkins, Steurer ’17])

Suppose we want to hypothesis test with error probability o(1) between two distributions:

◮ Null model Y ∼ Qn

e.g. G(n, 1/2)

◮ Planted model Y ∼ Pn

e.g. G(n, 1/2) ∪ {random k-clique}

Look for a degree-D (multivariate) polynomial f : Rn×n → R that distinguishes P from Q: Want f (Y ) to be big when Y ∼ P and small when Y ∼ Q

7 / 27

slide-46
SLIDE 46

The Low-Degree Method (e.g. [Hopkins, Steurer ’17])

Suppose we want to hypothesis test with error probability o(1) between two distributions:

◮ Null model Y ∼ Qn

e.g. G(n, 1/2)

◮ Planted model Y ∼ Pn

e.g. G(n, 1/2) ∪ {random k-clique}

Look for a degree-D (multivariate) polynomial f : Rn×n → R that distinguishes P from Q: Want f (Y ) to be big when Y ∼ P and small when Y ∼ Q Compute max

f deg D

EY ∼P[f (Y )]

  • EY ∼Q[f (Y )2]

mean in P fluctuations in Q

7 / 27

slide-47
SLIDE 47

The Low-Degree Method (e.g. [Hopkins, Steurer ’17])

max

f deg D

EY ∼P[f (Y )]

  • EY ∼Q[f (Y )2]

8 / 27

slide-48
SLIDE 48

The Low-Degree Method (e.g. [Hopkins, Steurer ’17])

max

f deg D

EY ∼P[f (Y )]

  • EY ∼Q[f (Y )2]

f , g = EY ∼Q[f (Y )g(Y )] f =

  • f , f

8 / 27

slide-49
SLIDE 49

The Low-Degree Method (e.g. [Hopkins, Steurer ’17])

max

f deg D

EY ∼P[f (Y )]

  • EY ∼Q[f (Y )2]

= max

f deg D

EY ∼Q[L(Y )f (Y )]

  • EY ∼Q[f (Y )2]

f , g = EY ∼Q[f (Y )g(Y )] f =

  • f , f

Likelihood ratio: L(Y ) = dP

dQ(Y )

8 / 27

slide-50
SLIDE 50

The Low-Degree Method (e.g. [Hopkins, Steurer ’17])

max

f deg D

EY ∼P[f (Y )]

  • EY ∼Q[f (Y )2]

= max

f deg D

EY ∼Q[L(Y )f (Y )]

  • EY ∼Q[f (Y )2]

= max

f deg D

L, f f f , g = EY ∼Q[f (Y )g(Y )] f =

  • f , f

Likelihood ratio: L(Y ) = dP

dQ(Y )

8 / 27

slide-51
SLIDE 51

The Low-Degree Method (e.g. [Hopkins, Steurer ’17])

max

f deg D

EY ∼P[f (Y )]

  • EY ∼Q[f (Y )2]

= max

f deg D

EY ∼Q[L(Y )f (Y )]

  • EY ∼Q[f (Y )2]

= max

f deg D

L, f f f , g = EY ∼Q[f (Y )g(Y )] f =

  • f , f

Likelihood ratio: L(Y ) = dP

dQ(Y )

Maximizer: f = L≤D := projection of L onto degree-D subspace

8 / 27

slide-52
SLIDE 52

The Low-Degree Method (e.g. [Hopkins, Steurer ’17])

max

f deg D

EY ∼P[f (Y )]

  • EY ∼Q[f (Y )2]

= max

f deg D

EY ∼Q[L(Y )f (Y )]

  • EY ∼Q[f (Y )2]

= max

f deg D

L, f f = L≤D f , g = EY ∼Q[f (Y )g(Y )] f =

  • f , f

Likelihood ratio: L(Y ) = dP

dQ(Y )

Maximizer: f = L≤D := projection of L onto degree-D subspace Norm of low-degree likelihood ratio

8 / 27

slide-53
SLIDE 53

The Low-Degree Method

Conclusion: max

f deg D

EY ∼P[f (Y )]

  • EY ∼Q[f (Y )2]

= L≤D

9 / 27

slide-54
SLIDE 54

The Low-Degree Method

Conclusion: max

f deg D

EY ∼P[f (Y )]

  • EY ∼Q[f (Y )2]

= L≤D Heuristically, L≤D = ω(1) degree-D polynomial can distinguish Q, P O(1) degree-D polynomials fail

9 / 27

slide-55
SLIDE 55

The Low-Degree Method

Conclusion: max

f deg D

EY ∼P[f (Y )]

  • EY ∼Q[f (Y )2]

= L≤D Heuristically, L≤D = ω(1) degree-D polynomial can distinguish Q, P O(1) degree-D polynomials fail

Conjecture (informal variant of [Hopkins ’18])

For “nice” Q, P, if L≤D = O(1) for some D = ω(log n) then no polynomial-time algorithm can distinguish Q, P with success probability 1 − o(1).

9 / 27

slide-56
SLIDE 56

The Low-Degree Method

Conclusion: max

f deg D

EY ∼P[f (Y )]

  • EY ∼Q[f (Y )2]

= L≤D Heuristically, L≤D = ω(1) degree-D polynomial can distinguish Q, P O(1) degree-D polynomials fail

Conjecture (informal variant of [Hopkins ’18])

For “nice” Q, P, if L≤D = O(1) for some D = ω(log n) then no polynomial-time algorithm can distinguish Q, P with success probability 1 − o(1). Degree-O(log n) polynomials ⇔ Polynomial-time algorithms

9 / 27

slide-57
SLIDE 57

Formal Consequences of the Low-Degree Method

The case D = ∞: If L = O(1) (as n → ∞) then no test can distinguish Q from P (with success probability 1 − o(1))

◮ Classical second moment method

10 / 27

slide-58
SLIDE 58

Formal Consequences of the Low-Degree Method

The case D = ∞: If L = O(1) (as n → ∞) then no test can distinguish Q from P (with success probability 1 − o(1))

◮ Classical second moment method

If L≤D = O(1) for some D = ω(log n) then no spectral method can distinguish Q from P (in a particular sense) [Kunisky, W, Bandeira ’19]

◮ Spectral method: threshold top eigenvalue of poly-size matrix

M = M(Y ) whose entries are O(1)-degree polynomials in Y

10 / 27

slide-59
SLIDE 59

Formal Consequences of the Low-Degree Method

The case D = ∞: If L = O(1) (as n → ∞) then no test can distinguish Q from P (with success probability 1 − o(1))

◮ Classical second moment method

If L≤D = O(1) for some D = ω(log n) then no spectral method can distinguish Q from P (in a particular sense) [Kunisky, W, Bandeira ’19]

◮ Spectral method: threshold top eigenvalue of poly-size matrix

M = M(Y ) whose entries are O(1)-degree polynomials in Y

◮ Proof: consider polynomial f (Y ) = Tr(Mq) with q = Θ(log n)

10 / 27

slide-60
SLIDE 60

Formal Consequences of the Low-Degree Method

The case D = ∞: If L = O(1) (as n → ∞) then no test can distinguish Q from P (with success probability 1 − o(1))

◮ Classical second moment method

If L≤D = O(1) for some D = ω(log n) then no spectral method can distinguish Q from P (in a particular sense) [Kunisky, W, Bandeira ’19]

◮ Spectral method: threshold top eigenvalue of poly-size matrix

M = M(Y ) whose entries are O(1)-degree polynomials in Y

◮ Proof: consider polynomial f (Y ) = Tr(Mq) with q = Θ(log n) ◮ Spectral methods are believed to be as powerful as

sum-of-squares for average-case problems [HKPRSS ’17]

10 / 27

slide-61
SLIDE 61

Low-Degree Method: Recap

Given a hypothesis testing question Qn vs Pn

11 / 27

slide-62
SLIDE 62

Low-Degree Method: Recap

Given a hypothesis testing question Qn vs Pn Take D ≈ log n

11 / 27

slide-63
SLIDE 63

Low-Degree Method: Recap

Given a hypothesis testing question Qn vs Pn Take D ≈ log n Compute/bound L≤D in the limit n → ∞

11 / 27

slide-64
SLIDE 64

Low-Degree Method: Recap

Given a hypothesis testing question Qn vs Pn Take D ≈ log n Compute/bound L≤D in the limit n → ∞

◮ If L≤D = ω(1), suggests that the problem is poly-time

solvable

11 / 27

slide-65
SLIDE 65

Low-Degree Method: Recap

Given a hypothesis testing question Qn vs Pn Take D ≈ log n Compute/bound L≤D in the limit n → ∞

◮ If L≤D = ω(1), suggests that the problem is poly-time

solvable

◮ If L≤D = O(1), suggests that the problem is NOT poly-time

solvable (and gives rigorous evidence: spectral methods fail)

11 / 27

slide-66
SLIDE 66

Advantages of the Low-Degree Method

◮ Possible to calculate/bound L≤D for many problems

12 / 27

slide-67
SLIDE 67

Advantages of the Low-Degree Method

◮ Possible to calculate/bound L≤D for many problems ◮ Predictions seem “correct”!

12 / 27

slide-68
SLIDE 68

Advantages of the Low-Degree Method

◮ Possible to calculate/bound L≤D for many problems ◮ Predictions seem “correct”!

◮ Planted clique, sparse PCA, stochastic block model, ... 12 / 27

slide-69
SLIDE 69

Advantages of the Low-Degree Method

◮ Possible to calculate/bound L≤D for many problems ◮ Predictions seem “correct”!

◮ Planted clique, sparse PCA, stochastic block model, ...

◮ (Relatively) simple

12 / 27

slide-70
SLIDE 70

Advantages of the Low-Degree Method

◮ Possible to calculate/bound L≤D for many problems ◮ Predictions seem “correct”!

◮ Planted clique, sparse PCA, stochastic block model, ...

◮ (Relatively) simple

◮ Much simpler than sum-of-squares lower bounds 12 / 27

slide-71
SLIDE 71

Advantages of the Low-Degree Method

◮ Possible to calculate/bound L≤D for many problems ◮ Predictions seem “correct”!

◮ Planted clique, sparse PCA, stochastic block model, ...

◮ (Relatively) simple

◮ Much simpler than sum-of-squares lower bounds

◮ Detection vs certification

12 / 27

slide-72
SLIDE 72

Advantages of the Low-Degree Method

◮ Possible to calculate/bound L≤D for many problems ◮ Predictions seem “correct”!

◮ Planted clique, sparse PCA, stochastic block model, ...

◮ (Relatively) simple

◮ Much simpler than sum-of-squares lower bounds

◮ Detection vs certification ◮ General: no assumptions on Q, P

12 / 27

slide-73
SLIDE 73

Advantages of the Low-Degree Method

◮ Possible to calculate/bound L≤D for many problems ◮ Predictions seem “correct”!

◮ Planted clique, sparse PCA, stochastic block model, ...

◮ (Relatively) simple

◮ Much simpler than sum-of-squares lower bounds

◮ Detection vs certification ◮ General: no assumptions on Q, P ◮ Captures sharp thresholds [Hopkins, Steurer ’17]

12 / 27

slide-74
SLIDE 74

Advantages of the Low-Degree Method

◮ Possible to calculate/bound L≤D for many problems ◮ Predictions seem “correct”!

◮ Planted clique, sparse PCA, stochastic block model, ...

◮ (Relatively) simple

◮ Much simpler than sum-of-squares lower bounds

◮ Detection vs certification ◮ General: no assumptions on Q, P ◮ Captures sharp thresholds [Hopkins, Steurer ’17] ◮ By varying degree D, can explore runtimes other than

polynomial

◮ Conjecture (Hopkins ’18): degree-D polynomials ⇔

time-n ˜

Θ(D) algorithms

12 / 27

slide-75
SLIDE 75

Advantages of the Low-Degree Method

◮ Possible to calculate/bound L≤D for many problems ◮ Predictions seem “correct”!

◮ Planted clique, sparse PCA, stochastic block model, ...

◮ (Relatively) simple

◮ Much simpler than sum-of-squares lower bounds

◮ Detection vs certification ◮ General: no assumptions on Q, P ◮ Captures sharp thresholds [Hopkins, Steurer ’17] ◮ By varying degree D, can explore runtimes other than

polynomial

◮ Conjecture (Hopkins ’18): degree-D polynomials ⇔

time-n ˜

Θ(D) algorithms

◮ No ingenuity required

12 / 27

slide-76
SLIDE 76

Advantages of the Low-Degree Method

◮ Possible to calculate/bound L≤D for many problems ◮ Predictions seem “correct”!

◮ Planted clique, sparse PCA, stochastic block model, ...

◮ (Relatively) simple

◮ Much simpler than sum-of-squares lower bounds

◮ Detection vs certification ◮ General: no assumptions on Q, P ◮ Captures sharp thresholds [Hopkins, Steurer ’17] ◮ By varying degree D, can explore runtimes other than

polynomial

◮ Conjecture (Hopkins ’18): degree-D polynomials ⇔

time-n ˜

Θ(D) algorithms

◮ No ingenuity required ◮ Interpretable

12 / 27

slide-77
SLIDE 77

How to Compute L≤D

Additive Gaussian noise: P : Y = X + Z vs Q : Y = Z where X ∼ P, any distribution over RN and Z is i.i.d. N(0, 1)

13 / 27

slide-78
SLIDE 78

How to Compute L≤D

Additive Gaussian noise: P : Y = X + Z vs Q : Y = Z where X ∼ P, any distribution over RN and Z is i.i.d. N(0, 1) L(Y ) = dP dQ(Y ) = EX exp(− 1

2Y − X2)

exp(− 1

2Y 2)

= EX exp(Y , X−1 2X2)

13 / 27

slide-79
SLIDE 79

How to Compute L≤D

Additive Gaussian noise: P : Y = X + Z vs Q : Y = Z where X ∼ P, any distribution over RN and Z is i.i.d. N(0, 1) L(Y ) = dP dQ(Y ) = EX exp(− 1

2Y − X2)

exp(− 1

2Y 2)

= EX exp(Y , X−1 2X2) Expand L =

α cαhα where {hα} are Hermite polynomials

(orthonormal basis w.r.t. Q)

13 / 27

slide-80
SLIDE 80

How to Compute L≤D

Additive Gaussian noise: P : Y = X + Z vs Q : Y = Z where X ∼ P, any distribution over RN and Z is i.i.d. N(0, 1) L(Y ) = dP dQ(Y ) = EX exp(− 1

2Y − X2)

exp(− 1

2Y 2)

= EX exp(Y , X−1 2X2) Expand L =

α cαhα where {hα} are Hermite polynomials

(orthonormal basis w.r.t. Q)

L≤D2 =

|α|≤D c2 α where cα = L, hα = EY ∼Q[L(Y )hα(Y )]

13 / 27

slide-81
SLIDE 81

How to Compute L≤D

Additive Gaussian noise: P : Y = X + Z vs Q : Y = Z where X ∼ P, any distribution over RN and Z is i.i.d. N(0, 1) L(Y ) = dP dQ(Y ) = EX exp(− 1

2Y − X2)

exp(− 1

2Y 2)

= EX exp(Y , X−1 2X2) Expand L =

α cαhα where {hα} are Hermite polynomials

(orthonormal basis w.r.t. Q)

L≤D2 =

|α|≤D c2 α where cα = L, hα = EY ∼Q[L(Y )hα(Y )]

· · ·

13 / 27

slide-82
SLIDE 82

How to Compute L≤D

Additive Gaussian noise: P : Y = X + Z vs Q : Y = Z where X ∼ P, any distribution over RN and Z is i.i.d. N(0, 1) L(Y ) = dP dQ(Y ) = EX exp(− 1

2Y − X2)

exp(− 1

2Y 2)

= EX exp(Y , X−1 2X2) Expand L =

α cαhα where {hα} are Hermite polynomials

(orthonormal basis w.r.t. Q)

L≤D2 =

|α|≤D c2 α where cα = L, hα = EY ∼Q[L(Y )hα(Y )]

· · · Result: L≤D2 =

D

  • d=0

1 d!EX,X ′[X, X ′d]

13 / 27

slide-83
SLIDE 83

References

For more on the low-degree method...

◮ Samuel B. Hopkins, PhD thesis ’18: “Statistical Inference and

the Sum of Squares Method”

◮ Connection to SoS

◮ Survey article: Kunisky, W, Bandeira, “Notes on

Computational Hardness of Hypothesis Testing: Predictions using the Low-Degree Likelihood Ratio”, arxiv:1907.11636

14 / 27

slide-84
SLIDE 84

Part II: Sparse PCA

Based on: Ding, Kunisky, W., Bandeira, “Subexponential-Time Algorithms for Sparse PCA”, arxiv:1907.11635

15 / 27

slide-85
SLIDE 85

Spiked Wigner Model

Observe n × n matrix Y = λxxT + W Signal: x ∈ Rn, x = 1 Noise: W ∈ Rn×n with entries Wij = Wji ∼ N(0, 1/n) i.i.d. λ > 0: signal-to-noise ratio

16 / 27

slide-86
SLIDE 86

Spiked Wigner Model

Observe n × n matrix Y = λxxT + W Signal: x ∈ Rn, x = 1 Noise: W ∈ Rn×n with entries Wij = Wji ∼ N(0, 1/n) i.i.d. λ > 0: signal-to-noise ratio Goal: given Y , estimate the signal x

16 / 27

slide-87
SLIDE 87

Spiked Wigner Model

Observe n × n matrix Y = λxxT + W Signal: x ∈ Rn, x = 1 Noise: W ∈ Rn×n with entries Wij = Wji ∼ N(0, 1/n) i.i.d. λ > 0: signal-to-noise ratio Goal: given Y , estimate the signal x Or, even simpler: distinguish (w.h.p.) Y from pure noise W

16 / 27

slide-88
SLIDE 88

Spiked Wigner Model

Observe n × n matrix Y = λxxT + W Signal: x ∈ Rn, x = 1 Noise: W ∈ Rn×n with entries Wij = Wji ∼ N(0, 1/n) i.i.d. λ > 0: signal-to-noise ratio Goal: given Y , estimate the signal x Or, even simpler: distinguish (w.h.p.) Y from pure noise W Structure: suppose x is drawn from some prior, e.g.

16 / 27

slide-89
SLIDE 89

Spiked Wigner Model

Observe n × n matrix Y = λxxT + W Signal: x ∈ Rn, x = 1 Noise: W ∈ Rn×n with entries Wij = Wji ∼ N(0, 1/n) i.i.d. λ > 0: signal-to-noise ratio Goal: given Y , estimate the signal x Or, even simpler: distinguish (w.h.p.) Y from pure noise W Structure: suppose x is drawn from some prior, e.g.

◮ spherical (uniform on unit sphere)

16 / 27

slide-90
SLIDE 90

Spiked Wigner Model

Observe n × n matrix Y = λxxT + W Signal: x ∈ Rn, x = 1 Noise: W ∈ Rn×n with entries Wij = Wji ∼ N(0, 1/n) i.i.d. λ > 0: signal-to-noise ratio Goal: given Y , estimate the signal x Or, even simpler: distinguish (w.h.p.) Y from pure noise W Structure: suppose x is drawn from some prior, e.g.

◮ spherical (uniform on unit sphere) ◮ Rademacher (i.i.d. ±1/√n)

16 / 27

slide-91
SLIDE 91

Spiked Wigner Model

Observe n × n matrix Y = λxxT + W Signal: x ∈ Rn, x = 1 Noise: W ∈ Rn×n with entries Wij = Wji ∼ N(0, 1/n) i.i.d. λ > 0: signal-to-noise ratio Goal: given Y , estimate the signal x Or, even simpler: distinguish (w.h.p.) Y from pure noise W Structure: suppose x is drawn from some prior, e.g.

◮ spherical (uniform on unit sphere) ◮ Rademacher (i.i.d. ±1/√n) ◮ sparse

16 / 27

slide-92
SLIDE 92

PCA (Principal Component Analysis)

Y = λxxT + W

  • J. Baik, G. Ben Arous, S. Peche, AoP 2005.
  • D. Feral, S. Peche, CMP 2006.

17 / 27

slide-93
SLIDE 93

PCA (Principal Component Analysis)

Y = λxxT + W PCA: top eigenvalue λ1(Y ) and (unit-norm) eigenvector v1

  • J. Baik, G. Ben Arous, S. Peche, AoP 2005.
  • D. Feral, S. Peche, CMP 2006.

17 / 27

slide-94
SLIDE 94

PCA (Principal Component Analysis)

Y = λxxT + W PCA: top eigenvalue λ1(Y ) and (unit-norm) eigenvector v1

Theorem (BBP’05, FP’06)

Almost surely, as n → ∞,

  • J. Baik, G. Ben Arous, S. Peche, AoP 2005.
  • D. Feral, S. Peche, CMP 2006.

17 / 27

slide-95
SLIDE 95

PCA (Principal Component Analysis)

Y = λxxT + W PCA: top eigenvalue λ1(Y ) and (unit-norm) eigenvector v1

Theorem (BBP’05, FP’06)

Almost surely, as n → ∞,

◮ If λ ≤ 1: λ1(Y ) → 2 and x, v1 → 0

  • J. Baik, G. Ben Arous, S. Peche, AoP 2005.
  • D. Feral, S. Peche, CMP 2006.

17 / 27

slide-96
SLIDE 96

PCA (Principal Component Analysis)

Y = λxxT + W PCA: top eigenvalue λ1(Y ) and (unit-norm) eigenvector v1

Theorem (BBP’05, FP’06)

Almost surely, as n → ∞,

◮ If λ ≤ 1: λ1(Y ) → 2 and x, v1 → 0 ◮ If λ > 1: λ1(Y ) → λ + 1 λ > 2 and x, v12 → 1 − 1/λ2 > 0

  • J. Baik, G. Ben Arous, S. Peche, AoP 2005.
  • D. Feral, S. Peche, CMP 2006.

17 / 27

slide-97
SLIDE 97

PCA (Principal Component Analysis)

Y = λxxT + W PCA: top eigenvalue λ1(Y ) and (unit-norm) eigenvector v1

Theorem (BBP’05, FP’06)

Almost surely, as n → ∞,

◮ If λ ≤ 1: λ1(Y ) → 2 and x, v1 → 0 ◮ If λ > 1: λ1(Y ) → λ + 1 λ > 2 and x, v12 → 1 − 1/λ2 > 0

Sharp threshold: PCA can detect and recover the signal iff λ > 1

  • J. Baik, G. Ben Arous, S. Peche, AoP 2005.
  • D. Feral, S. Peche, CMP 2006.

17 / 27

slide-98
SLIDE 98

Is PCA Optimal?

18 / 27

slide-99
SLIDE 99

Is PCA Optimal?

PCA does not exploit structure of signal x

18 / 27

slide-100
SLIDE 100

Is PCA Optimal?

PCA does not exploit structure of signal x Is the PCA threshold (λ = 1) optimal?

◮ Is it statistically possible to detect/recover when λ < 1?

18 / 27

slide-101
SLIDE 101

Is PCA Optimal?

PCA does not exploit structure of signal x Is the PCA threshold (λ = 1) optimal?

◮ Is it statistically possible to detect/recover when λ < 1?

Answer: it depends on the prior for x

18 / 27

slide-102
SLIDE 102

Is PCA Optimal?

PCA does not exploit structure of signal x Is the PCA threshold (λ = 1) optimal?

◮ Is it statistically possible to detect/recover when λ < 1?

Answer: it depends on the prior for x For some priors (e.g. spherical, Rademacher), detection and recovery are statistically impossible when λ < 1 [MRZ’14, DAM’15, PWBM’18]

18 / 27

slide-103
SLIDE 103

Is PCA Optimal?

PCA does not exploit structure of signal x Is the PCA threshold (λ = 1) optimal?

◮ Is it statistically possible to detect/recover when λ < 1?

Answer: it depends on the prior for x For some priors (e.g. spherical, Rademacher), detection and recovery are statistically impossible when λ < 1 [MRZ’14, DAM’15, PWBM’18] But what if x is sparse?

18 / 27

slide-104
SLIDE 104

Sparse PCA

Suppose x ∈ Rn is drawn from the k-sparse Rademacher prior:

◮ k random entries of x are nonzero ◮ the nonzero entries are drawn uniformly from {±1/

√ k}

Johnstone, Lu ’04, ’09 19 / 27

slide-105
SLIDE 105

Sparse PCA

Suppose x ∈ Rn is drawn from the k-sparse Rademacher prior:

◮ k random entries of x are nonzero ◮ the nonzero entries are drawn uniformly from {±1/

√ k} Normalization: x = 1

Johnstone, Lu ’04, ’09 19 / 27

slide-106
SLIDE 106

Sparse PCA

Suppose x ∈ Rn is drawn from the k-sparse Rademacher prior:

◮ k random entries of x are nonzero ◮ the nonzero entries are drawn uniformly from {±1/

√ k} Normalization: x = 1 As before, Y = λxxT + W

Johnstone, Lu ’04, ’09 19 / 27

slide-107
SLIDE 107

Sparse PCA

Suppose x ∈ Rn is drawn from the k-sparse Rademacher prior:

◮ k random entries of x are nonzero ◮ the nonzero entries are drawn uniformly from {±1/

√ k} Normalization: x = 1 As before, Y = λxxT + W Assume λ < 1 is a constant

◮ PCA fails

Johnstone, Lu ’04, ’09 19 / 27

slide-108
SLIDE 108

Maximum Likelihood Estimator

Let Sk := {v ∈ {0, ±1/ √ k}n : v0 = k} (set of k-sparse Rademacher vectors)

20 / 27

slide-109
SLIDE 109

Maximum Likelihood Estimator

Let Sk := {v ∈ {0, ±1/ √ k}n : v0 = k} (set of k-sparse Rademacher vectors) MLE: ˆ x = argmax

v∈Sk

v⊤Yv

20 / 27

slide-110
SLIDE 110

Maximum Likelihood Estimator

Let Sk := {v ∈ {0, ±1/ √ k}n : v0 = k} (set of k-sparse Rademacher vectors) MLE: ˆ x = argmax

v∈Sk

v⊤Yv Succeeds (ˆ x = x with high probability) provided k n/ log n

[PJ’12, VL’12, CMW’13] 20 / 27

slide-111
SLIDE 111

Maximum Likelihood Estimator

Let Sk := {v ∈ {0, ±1/ √ k}n : v0 = k} (set of k-sparse Rademacher vectors) MLE: ˆ x = argmax

v∈Sk

v⊤Yv Succeeds (ˆ x = x with high probability) provided k n/ log n

[PJ’12, VL’12, CMW’13]

◮ For weak recovery, k < ρ∗n ≈ 0.09n

[LKZ’15, KXZ’16, DMK+’16, LM’19, EKJ’17] 20 / 27

slide-112
SLIDE 112

Maximum Likelihood Estimator

Let Sk := {v ∈ {0, ±1/ √ k}n : v0 = k} (set of k-sparse Rademacher vectors) MLE: ˆ x = argmax

v∈Sk

v⊤Yv Succeeds (ˆ x = x with high probability) provided k n/ log n

[PJ’12, VL’12, CMW’13]

◮ For weak recovery, k < ρ∗n ≈ 0.09n

[LKZ’15, KXZ’16, DMK+’16, LM’19, EKJ’17]

Runtime: n

k

  • ≈ nk ≈ exp(k)

20 / 27

slide-113
SLIDE 113

Diagonal Thresholding

Diagonal thresholding algorithm [Johnstone, Lu ’09]:

21 / 27

slide-114
SLIDE 114

Diagonal Thresholding

Diagonal thresholding algorithm [Johnstone, Lu ’09]:

◮ Identify the largest k diagonal entries Yii

21 / 27

slide-115
SLIDE 115

Diagonal Thresholding

Diagonal thresholding algorithm [Johnstone, Lu ’09]:

◮ Identify the largest k diagonal entries Yii ◮ Report these indices i as the support of x

21 / 27

slide-116
SLIDE 116

Diagonal Thresholding

Diagonal thresholding algorithm [Johnstone, Lu ’09]:

◮ Identify the largest k diagonal entries Yii ◮ Report these indices i as the support of x ◮ (Easy to then recover x once you know the support)

21 / 27

slide-117
SLIDE 117

Diagonal Thresholding

Diagonal thresholding algorithm [Johnstone, Lu ’09]:

◮ Identify the largest k diagonal entries Yii ◮ Report these indices i as the support of x ◮ (Easy to then recover x once you know the support)

Succeeds (exact recovery) provided k

  • n/ log n [Amini, Wainwright ’08]

21 / 27

slide-118
SLIDE 118

Diagonal Thresholding

Diagonal thresholding algorithm [Johnstone, Lu ’09]:

◮ Identify the largest k diagonal entries Yii ◮ Report these indices i as the support of x ◮ (Easy to then recover x once you know the support)

Succeeds (exact recovery) provided k

  • n/ log n [Amini, Wainwright ’08]

Runtime: polynomial

21 / 27

slide-119
SLIDE 119

Diagonal Thresholding

Diagonal thresholding algorithm [Johnstone, Lu ’09]:

◮ Identify the largest k diagonal entries Yii ◮ Report these indices i as the support of x ◮ (Easy to then recover x once you know the support)

Succeeds (exact recovery) provided k

  • n/ log n [Amini, Wainwright ’08]

Runtime: polynomial Variant: covariance thresholding is poly-time and succeeds when k √n (removes log factor) [Krauthgamer, Nadler, Vilenchik ’15, Deshpande, Montanari ’14]

21 / 27

slide-120
SLIDE 120

Hard Regime

To summarize:

22 / 27

slide-121
SLIDE 121

Hard Regime

To summarize: Statistically possible when k ≪ n

◮ Runtime exp(k)

22 / 27

slide-122
SLIDE 122

Hard Regime

To summarize: Statistically possible when k ≪ n

◮ Runtime exp(k)

Poly-time solvable when k ≪ √n

22 / 27

slide-123
SLIDE 123

Hard Regime

To summarize: Statistically possible when k ≪ n

◮ Runtime exp(k)

Poly-time solvable when k ≪ √n Believed “hard” when √n ≪ k ≪ n

22 / 27

slide-124
SLIDE 124

Hard Regime

To summarize: Statistically possible when k ≪ n

◮ Runtime exp(k)

Poly-time solvable when k ≪ √n Believed “hard” when √n ≪ k ≪ n

◮ Reduction from planted clique [BR’13, WBS’16, BBH’18, BB’19] ◮ Sum-of-squares lower bounds [MW’15, HKP+’17]

22 / 27

slide-125
SLIDE 125

Hard Regime

To summarize: Statistically possible when k ≪ n

◮ Runtime exp(k)

Poly-time solvable when k ≪ √n Believed “hard” when √n ≪ k ≪ n

◮ Reduction from planted clique [BR’13, WBS’16, BBH’18, BB’19] ◮ Sum-of-squares lower bounds [MW’15, HKP+’17]

Question: exactly how hard is the “hard” regime?

22 / 27

slide-126
SLIDE 126

Hard Regime

To summarize: Statistically possible when k ≪ n

◮ Runtime exp(k)

Poly-time solvable when k ≪ √n Believed “hard” when √n ≪ k ≪ n

◮ Reduction from planted clique [BR’13, WBS’16, BBH’18, BB’19] ◮ Sum-of-squares lower bounds [MW’15, HKP+’17]

Question: exactly how hard is the “hard” regime?

◮ Can you do better than exp(k)?

22 / 27

slide-127
SLIDE 127

Hard Regime

To summarize: Statistically possible when k ≪ n

◮ Runtime exp(k)

Poly-time solvable when k ≪ √n Believed “hard” when √n ≪ k ≪ n

◮ Reduction from planted clique [BR’13, WBS’16, BBH’18, BB’19] ◮ Sum-of-squares lower bounds [MW’15, HKP+’17]

Question: exactly how hard is the “hard” regime?

◮ Can you do better than exp(k)? ◮ Reduction from planted clique doesn’t rule out

quasipolynomial time nO(log n)

22 / 27

slide-128
SLIDE 128

Hard Regime

To summarize: Statistically possible when k ≪ n

◮ Runtime exp(k)

Poly-time solvable when k ≪ √n Believed “hard” when √n ≪ k ≪ n

◮ Reduction from planted clique [BR’13, WBS’16, BBH’18, BB’19] ◮ Sum-of-squares lower bounds [MW’15, HKP+’17]

Question: exactly how hard is the “hard” regime?

◮ Can you do better than exp(k)? Yes: exp(k2/n) ◮ Reduction from planted clique doesn’t rule out

quasipolynomial time nO(log n)

22 / 27

slide-129
SLIDE 129

Low-Degree Prediction

Hypothesis testing between:

◮ P : Y = λxx⊤ + W with x drawn from k-sparse prior ◮ Q : Y = W

23 / 27

slide-130
SLIDE 130

Low-Degree Prediction

Hypothesis testing between:

◮ P : Y = λxx⊤ + W with x drawn from k-sparse prior ◮ Q : Y = W

Theorem (Ding, Kunisky, W., Bandeira ’19)

Suppose λ = Θ(1).

◮ If λ < 1 and D ≪ k2/n then L≤D = O(1) (“hard”) ◮ If λ > 1 or D ≫ k2/n then L≤D = ω(1) (“easy”)

23 / 27

slide-131
SLIDE 131

Low-Degree Prediction

Hypothesis testing between:

◮ P : Y = λxx⊤ + W with x drawn from k-sparse prior ◮ Q : Y = W

Theorem (Ding, Kunisky, W., Bandeira ’19)

Suppose λ = Θ(1).

◮ If λ < 1 and D ≪ k2/n then L≤D = O(1) (“hard”) ◮ If λ > 1 or D ≫ k2/n then L≤D = ω(1) (“easy”)

So degree-D polynomials can distinguish iff λ > 1 or D ≫ k2/n

23 / 27

slide-132
SLIDE 132

Low-Degree Prediction

Hypothesis testing between:

◮ P : Y = λxx⊤ + W with x drawn from k-sparse prior ◮ Q : Y = W

Theorem (Ding, Kunisky, W., Bandeira ’19)

Suppose λ = Θ(1).

◮ If λ < 1 and D ≪ k2/n then L≤D = O(1) (“hard”) ◮ If λ > 1 or D ≫ k2/n then L≤D = ω(1) (“easy”)

So degree-D polynomials can distinguish iff λ > 1 or D ≫ k2/n Suggests an algorithm of runtime nk2/n ≈ exp(k2/n) (and no better)

23 / 27

slide-133
SLIDE 133

Low-Degree Prediction

Hypothesis testing between:

◮ P : Y = λxx⊤ + W with x drawn from k-sparse prior ◮ Q : Y = W

Theorem (Ding, Kunisky, W., Bandeira ’19)

Suppose λ = Θ(1).

◮ If λ < 1 and D ≪ k2/n then L≤D = O(1) (“hard”) ◮ If λ > 1 or D ≫ k2/n then L≤D = ω(1) (“easy”)

So degree-D polynomials can distinguish iff λ > 1 or D ≫ k2/n Suggests an algorithm of runtime nk2/n ≈ exp(k2/n) (and no better)

◮ Subexponential time: exp(nδ) with δ ∈ (0, 1)

23 / 27

slide-134
SLIDE 134

Low-Degree Prediction

Hypothesis testing between:

◮ P : Y = λxx⊤ + W with x drawn from k-sparse prior ◮ Q : Y = W

Theorem (Ding, Kunisky, W., Bandeira ’19)

Suppose λ = Θ(1).

◮ If λ < 1 and D ≪ k2/n then L≤D = O(1) (“hard”) ◮ If λ > 1 or D ≫ k2/n then L≤D = ω(1) (“easy”)

So degree-D polynomials can distinguish iff λ > 1 or D ≫ k2/n Suggests an algorithm of runtime nk2/n ≈ exp(k2/n) (and no better)

◮ Subexponential time: exp(nδ) with δ ∈ (0, 1)

And indeed we will find such an algorithm...

23 / 27

slide-135
SLIDE 135

The Algorithm

For now, consider the detection problem (P vs Q)

24 / 27

slide-136
SLIDE 136

The Algorithm

For now, consider the detection problem (P vs Q) Choose a parameter 1 ≤ ℓ ≤ k

24 / 27

slide-137
SLIDE 137

The Algorithm

For now, consider the detection problem (P vs Q) Choose a parameter 1 ≤ ℓ ≤ k Let Sℓ := {v ∈ {±1}n : v0 = ℓ}

24 / 27

slide-138
SLIDE 138

The Algorithm

For now, consider the detection problem (P vs Q) Choose a parameter 1 ≤ ℓ ≤ k Let Sℓ := {v ∈ {±1}n : v0 = ℓ} Let T := max

v∈Sℓ

v⊤Yv

24 / 27

slide-139
SLIDE 139

The Algorithm

For now, consider the detection problem (P vs Q) Choose a parameter 1 ≤ ℓ ≤ k Let Sℓ := {v ∈ {±1}n : v0 = ℓ} Let T := max

v∈Sℓ

v⊤Yv Algorithm: compute T and threshold it (large ⇒ P)

24 / 27

slide-140
SLIDE 140

The Algorithm

For now, consider the detection problem (P vs Q) Choose a parameter 1 ≤ ℓ ≤ k Let Sℓ := {v ∈ {±1}n : v0 = ℓ} Let T := max

v∈Sℓ

v⊤Yv Algorithm: compute T and threshold it (large ⇒ P)

◮ ℓ = k ⇒ exhaustive search (MLE) ◮ ℓ = 1 ⇒ diagonal thresholding maxi Yii

24 / 27

slide-141
SLIDE 141

The Algorithm

For now, consider the detection problem (P vs Q) Choose a parameter 1 ≤ ℓ ≤ k Let Sℓ := {v ∈ {±1}n : v0 = ℓ} Let T := max

v∈Sℓ

v⊤Yv Algorithm: compute T and threshold it (large ⇒ P)

◮ ℓ = k ⇒ exhaustive search (MLE) ◮ ℓ = 1 ⇒ diagonal thresholding maxi Yii

Runtime: n

  • ≈ nℓ ≈ exp(ℓ)

24 / 27

slide-142
SLIDE 142

Analysis of the Algorithm

Recall: algorithm thresholds T := max

v∈Sℓ

v⊤Yv

25 / 27

slide-143
SLIDE 143

Analysis of the Algorithm

Recall: algorithm thresholds T := max

v∈Sℓ

v⊤Yv Analysis:

◮ Under P, Y = λxx⊤ + W , show T is large by considering a

‘good’ v (contained in x)

25 / 27

slide-144
SLIDE 144

Analysis of the Algorithm

Recall: algorithm thresholds T := max

v∈Sℓ

v⊤Yv Analysis:

◮ Under P, Y = λxx⊤ + W , show T is large by considering a

‘good’ v (contained in x)

◮ Under Q, Y = W , show T is small by Chernoff bound +

union bound over Sℓ

25 / 27

slide-145
SLIDE 145

Analysis of the Algorithm

Recall: algorithm thresholds T := max

v∈Sℓ

v⊤Yv Analysis:

◮ Under P, Y = λxx⊤ + W , show T is large by considering a

‘good’ v (contained in x)

◮ Under Q, Y = W , show T is small by Chernoff bound +

union bound over Sℓ Theorem (Ding, Kunisky, W., Bandeira ’19): algorithm succeeds if ℓ ≫ k2/n

25 / 27

slide-146
SLIDE 146

Analysis of the Algorithm

Recall: algorithm thresholds T := max

v∈Sℓ

v⊤Yv Analysis:

◮ Under P, Y = λxx⊤ + W , show T is large by considering a

‘good’ v (contained in x)

◮ Under Q, Y = W , show T is small by Chernoff bound +

union bound over Sℓ Theorem (Ding, Kunisky, W., Bandeira ’19): algorithm succeeds if ℓ ≫ k2/n For any given k, choose ℓ ≈ k2/n, get runtime exp(k2/n)

25 / 27

slide-147
SLIDE 147

From Detection to Recovery

Algorithm for recovering x from Y = λxx⊤ + W :

26 / 27

slide-148
SLIDE 148

From Detection to Recovery

Algorithm for recovering x from Y = λxx⊤ + W :

  • 1. Compute initial guess: u = argmax

v∈Sℓ

v⊤Yv

26 / 27

slide-149
SLIDE 149

From Detection to Recovery

Algorithm for recovering x from Y = λxx⊤ + W :

  • 1. Compute initial guess: u = argmax

v∈Sℓ

v⊤Yv But u is too sparse...

26 / 27

slide-150
SLIDE 150

From Detection to Recovery

Algorithm for recovering x from Y = λxx⊤ + W :

  • 1. Compute initial guess: u = argmax

v∈Sℓ

v⊤Yv But u is too sparse...

  • 2. Let w = Yu

26 / 27

slide-151
SLIDE 151

From Detection to Recovery

Algorithm for recovering x from Y = λxx⊤ + W :

  • 1. Compute initial guess: u = argmax

v∈Sℓ

v⊤Yv But u is too sparse...

  • 2. Let w = Yu
  • 3. Construct ˆ

x ∈ {0, ±1/ √ k}n by thresholding entries of w

26 / 27

slide-152
SLIDE 152

From Detection to Recovery

Algorithm for recovering x from Y = λxx⊤ + W :

  • 1. Compute initial guess: u = argmax

v∈Sℓ

v⊤Yv But u is too sparse...

  • 2. Let w = Yu
  • 3. Construct ˆ

x ∈ {0, ±1/ √ k}n by thresholding entries of w Theorem (Ding, Kunisky, W., Bandeira ’19): ˆ x = x with high probability, provided ℓ ≫ k2/n (same as detection)

26 / 27

slide-153
SLIDE 153

From Detection to Recovery

Algorithm for recovering x from Y = λxx⊤ + W :

  • 1. Compute initial guess: u = argmax

v∈Sℓ

v⊤Yv But u is too sparse...

  • 2. Let w = Yu
  • 3. Construct ˆ

x ∈ {0, ±1/ √ k}n by thresholding entries of w Theorem (Ding, Kunisky, W., Bandeira ’19): ˆ x = x with high probability, provided ℓ ≫ k2/n (same as detection) Technically, need independent copies of Y for steps 1 & 2

◮ Y + W ′ and Y − W ′ where W ′ is independent copy of W

26 / 27

slide-154
SLIDE 154

Summary

◮ Continuum of subexponential-time algorithms for sparse PCA

27 / 27

slide-155
SLIDE 155

Summary

◮ Continuum of subexponential-time algorithms for sparse PCA ◮ Smooth interpolation between diagonal thresholding and

exhaustive search

27 / 27

slide-156
SLIDE 156

Summary

◮ Continuum of subexponential-time algorithms for sparse PCA ◮ Smooth interpolation between diagonal thresholding and

exhaustive search

◮ Smooth tradeoff between sparsity and runtime: exp(k2/n)

27 / 27

slide-157
SLIDE 157

Summary

◮ Continuum of subexponential-time algorithms for sparse PCA ◮ Smooth interpolation between diagonal thresholding and

exhaustive search

◮ Smooth tradeoff between sparsity and runtime: exp(k2/n) ◮ Extensions:

27 / 27

slide-158
SLIDE 158

Summary

◮ Continuum of subexponential-time algorithms for sparse PCA ◮ Smooth interpolation between diagonal thresholding and

exhaustive search

◮ Smooth tradeoff between sparsity and runtime: exp(k2/n) ◮ Extensions:

◮ Allow λ ≪ 1; runtime exp(k2/(λ2n)) 27 / 27

slide-159
SLIDE 159

Summary

◮ Continuum of subexponential-time algorithms for sparse PCA ◮ Smooth interpolation between diagonal thresholding and

exhaustive search

◮ Smooth tradeoff between sparsity and runtime: exp(k2/n) ◮ Extensions:

◮ Allow λ ≪ 1; runtime exp(k2/(λ2n)) ◮ Spiked Wishart model 27 / 27

slide-160
SLIDE 160

Summary

◮ Continuum of subexponential-time algorithms for sparse PCA ◮ Smooth interpolation between diagonal thresholding and

exhaustive search

◮ Smooth tradeoff between sparsity and runtime: exp(k2/n) ◮ Extensions:

◮ Allow λ ≪ 1; runtime exp(k2/(λ2n)) ◮ Spiked Wishart model ◮ More general assumptions on x 27 / 27

slide-161
SLIDE 161

Summary

◮ Continuum of subexponential-time algorithms for sparse PCA ◮ Smooth interpolation between diagonal thresholding and

exhaustive search

◮ Smooth tradeoff between sparsity and runtime: exp(k2/n) ◮ Extensions:

◮ Allow λ ≪ 1; runtime exp(k2/(λ2n)) ◮ Spiked Wishart model ◮ More general assumptions on x

◮ Optimal: for a given k, the low-degree likelihood ratio

suggests that no better runtime is possible

27 / 27

slide-162
SLIDE 162

Summary

◮ Continuum of subexponential-time algorithms for sparse PCA ◮ Smooth interpolation between diagonal thresholding and

exhaustive search

◮ Smooth tradeoff between sparsity and runtime: exp(k2/n) ◮ Extensions:

◮ Allow λ ≪ 1; runtime exp(k2/(λ2n)) ◮ Spiked Wishart model ◮ More general assumptions on x

◮ Optimal: for a given k, the low-degree likelihood ratio

suggests that no better runtime is possible Thanks!

27 / 27