[PPT] - The Power of Low-Degree Polynomials for Solving Statistical Problems PowerPoint Presentation

SLIDE 1

The Power of Low-Degree Polynomials for Solving Statistical Problems Alex Wein

Courant Institute, New York University Based on joint work with: David Gamarnik (MIT) Aukosh Jagannath (Waterloo) Tselil Schramm (Stanford)

1 / 10

SLIDE 2

Problems in High-Dimensional Statistics

Example: finding a large clique in a random graph

2 / 10

SLIDE 3

Problems in High-Dimensional Statistics

Example: finding a large clique in a random graph ◮ Detection: distinguish between a random graph and a graph with a planted clique

2 / 10

SLIDE 4

Problems in High-Dimensional Statistics

Example: finding a large clique in a random graph ◮ Detection: distinguish between a random graph and a graph with a planted clique ◮ Recovery: given a graph with a planted clique, find the clique

2 / 10

SLIDE 5

Problems in High-Dimensional Statistics

Example: finding a large clique in a random graph ◮ Detection: distinguish between a random graph and a graph with a planted clique ◮ Recovery: given a graph with a planted clique, find the clique ◮ Optimization: given a random graph (with no planted clique), find as large a clique as possible

2 / 10

SLIDE 6

Problems in High-Dimensional Statistics

Example: finding a large clique in a random graph ◮ Detection: distinguish between a random graph and a graph with a planted clique ◮ Recovery: given a graph with a planted clique, find the clique ◮ Optimization: given a random graph (with no planted clique), find as large a clique as possible Common to have information-computation gaps

2 / 10

SLIDE 7

Problems in High-Dimensional Statistics

Example: finding a large clique in a random graph ◮ Detection: distinguish between a random graph and a graph with a planted clique ◮ Recovery: given a graph with a planted clique, find the clique ◮ Optimization: given a random graph (with no planted clique), find as large a clique as possible Common to have information-computation gaps E.g. planted k-clique (either detection or recovery)

2 / 10

SLIDE 8

Problems in High-Dimensional Statistics

Example: finding a large clique in a random graph ◮ Detection: distinguish between a random graph and a graph with a planted clique ◮ Recovery: given a graph with a planted clique, find the clique ◮ Optimization: given a random graph (with no planted clique), find as large a clique as possible Common to have information-computation gaps E.g. planted k-clique (either detection or recovery) What makes problems easy vs hard?

2 / 10

SLIDE 9

The Low-Degree Polynomial Method

A framework for understanding computational complexity

3 / 10

SLIDE 10

The Low-Degree Polynomial Method

A framework for understanding computational complexity Originated from sum-of-squares literature (for detection)

[Barak, Hopkins, Kelner, Kothari, Moitra, Potechin ’16] [Hopkins, Steurer ’17] [Hopkins, Kothari, Potechin, Raghavendra, Schramm, Steurer ’17] [Hopkins ’18 (PhD thesis)]

3 / 10

SLIDE 11

The Low-Degree Polynomial Method

A framework for understanding computational complexity Originated from sum-of-squares literature (for detection) Study a restricted class of algorithms: low-degree polynomials

3 / 10

SLIDE 12

The Low-Degree Polynomial Method

A framework for understanding computational complexity Originated from sum-of-squares literature (for detection) Study a restricted class of algorithms: low-degree polynomials ◮ Multivariate polynomial f : RN → RM

3 / 10

SLIDE 13

The Low-Degree Polynomial Method

A framework for understanding computational complexity Originated from sum-of-squares literature (for detection) Study a restricted class of algorithms: low-degree polynomials ◮ Multivariate polynomial f : RN → RM ◮ “Low” means O(log n) where n is dimension

3 / 10

SLIDE 14

The Low-Degree Polynomial Method

A framework for understanding computational complexity Originated from sum-of-squares literature (for detection) Study a restricted class of algorithms: low-degree polynomials ◮ Multivariate polynomial f : RN → RM ◮ “Low” means O(log n) where n is dimension Some low-degree algorithms:

3 / 10

SLIDE 15

The Low-Degree Polynomial Method

A framework for understanding computational complexity Originated from sum-of-squares literature (for detection) Study a restricted class of algorithms: low-degree polynomials ◮ Multivariate polynomial f : RN → RM ◮ “Low” means O(log n) where n is dimension Some low-degree algorithms: ◮ Spectral methods (power iteration)

3 / 10

SLIDE 16

The Low-Degree Polynomial Method

A framework for understanding computational complexity Originated from sum-of-squares literature (for detection) Study a restricted class of algorithms: low-degree polynomials ◮ Multivariate polynomial f : RN → RM ◮ “Low” means O(log n) where n is dimension Some low-degree algorithms: ◮ Spectral methods (power iteration) ◮ Approximate message passing (AMP) [DMM09]

3 / 10

SLIDE 17

The Low-Degree Polynomial Method

A framework for understanding computational complexity Originated from sum-of-squares literature (for detection) Study a restricted class of algorithms: low-degree polynomials ◮ Multivariate polynomial f : RN → RM ◮ “Low” means O(log n) where n is dimension Some low-degree algorithms: ◮ Spectral methods (power iteration) ◮ Approximate message passing (AMP) [DMM09] Low-degree algorithms are as powerful as the best known polynomial-time algorithms for many problems: planted clique,

sparse PCA, community detection, tensor PCA, constraint satisfaction, spiked matrix [BHKKMP16,HS17,HKPRSS17,Hop18,BKW19,KWB19,DKWB19]

3 / 10

SLIDE 18

Overview

This talk: techniques to prove that all low-degree polynomials fail

4 / 10

SLIDE 19

Overview

This talk: techniques to prove that all low-degree polynomials fail Constitutes evidence for computational hardness

4 / 10

SLIDE 20

Overview

This talk: techniques to prove that all low-degree polynomials fail Constitutes evidence for computational hardness Settings: ◮ Detection (prior work) ◮ Recovery Schramm, W. “Computational Barriers to Estimation from Low-Degree Polynomials”, arXiv, 2020. ◮ Optimization Gamarnik, Jagannath, W. “Low-Degree Hardness of Random Optimization Problems”, FOCS 2020.

4 / 10

SLIDE 21

Detection (e.g. [Hopkins, Steurer ’17])

Goal: hypothesis test with error probability o(1) between: ◮ Null model Y ∼ Qn

e.g. G(n, 1/2)

◮ Planted model Y ∼ Pn

e.g. G(n, 1/2) ∪ {random k-clique}

5 / 10

SLIDE 22

Detection (e.g. [Hopkins, Steurer ’17])

Goal: hypothesis test with error probability o(1) between: ◮ Null model Y ∼ Qn

e.g. G(n, 1/2)

◮ Planted model Y ∼ Pn

e.g. G(n, 1/2) ∪ {random k-clique}

Look for a degree-D (multivariate) polynomial f : Rn×n → R that distinguishes P from Q

5 / 10

SLIDE 23

Detection (e.g. [Hopkins, Steurer ’17])

Goal: hypothesis test with error probability o(1) between: ◮ Null model Y ∼ Qn

e.g. G(n, 1/2)

◮ Planted model Y ∼ Pn

e.g. G(n, 1/2) ∪ {random k-clique}

Look for a degree-D (multivariate) polynomial f : Rn×n → R that distinguishes P from Q ◮ In the sense that f (Y ) is “big” when Y ∼ P and “small” when Y ∼ Q

5 / 10

SLIDE 24

Detection (e.g. [Hopkins, Steurer ’17])

Goal: hypothesis test with error probability o(1) between: ◮ Null model Y ∼ Qn

e.g. G(n, 1/2)

◮ Planted model Y ∼ Pn

e.g. G(n, 1/2) ∪ {random k-clique}

Look for a degree-D (multivariate) polynomial f : Rn×n → R that distinguishes P from Q ◮ In the sense that f (Y ) is “big” when Y ∼ P and “small” when Y ∼ Q Compute max

f deg D

EY ∼P[f (Y )]

EY ∼Q[f (Y )2]

mean in P fluctuations in Q

5 / 10

SLIDE 25

Detection (e.g. [Hopkins, Steurer ’17])

Goal: hypothesis test with error probability o(1) between: ◮ Null model Y ∼ Qn

e.g. G(n, 1/2)

◮ Planted model Y ∼ Pn

e.g. G(n, 1/2) ∪ {random k-clique}

Look for a degree-D (multivariate) polynomial f : Rn×n → R that distinguishes P from Q ◮ In the sense that f (Y ) is “big” when Y ∼ P and “small” when Y ∼ Q Compute max

f deg D

EY ∼P[f (Y )]

EY ∼Q[f (Y )2]

mean in P fluctuations in Q = ω(1) degree-D polynomial succeed O(1) degree-D polynomials fail

5 / 10

SLIDE 26

Detection (e.g. [Hopkins, Steurer ’17])

max

f deg D

EY ∼P[f (Y )]

EY ∼Q[f (Y )2]

6 / 10

SLIDE 27

Detection (e.g. [Hopkins, Steurer ’17])

max

f deg D

EY ∼P[f (Y )]

EY ∼Q[f (Y )2]

f , g = EY ∼Q[f (Y )g(Y )] f =

f , f

6 / 10

SLIDE 28

Detection (e.g. [Hopkins, Steurer ’17])

max

f deg D

EY ∼P[f (Y )]

EY ∼Q[f (Y )2]

= max

f deg D

EY ∼Q[L(Y )f (Y )]

EY ∼Q[f (Y )2]

f , g = EY ∼Q[f (Y )g(Y )] f =

f , f

Likelihood ratio: L(Y ) = dP

dQ(Y )

6 / 10

SLIDE 29

Detection (e.g. [Hopkins, Steurer ’17])

max

f deg D

EY ∼P[f (Y )]

EY ∼Q[f (Y )2]

= max

f deg D

EY ∼Q[L(Y )f (Y )]

EY ∼Q[f (Y )2]

= max

f deg D

L, f f f , g = EY ∼Q[f (Y )g(Y )] f =

f , f

Likelihood ratio: L(Y ) = dP

dQ(Y )

6 / 10

SLIDE 30

Detection (e.g. [Hopkins, Steurer ’17])

max

f deg D

EY ∼P[f (Y )]

EY ∼Q[f (Y )2]

= max

f deg D

EY ∼Q[L(Y )f (Y )]

EY ∼Q[f (Y )2]

= max

f deg D

L, f f f , g = EY ∼Q[f (Y )g(Y )] f =

f , f

Likelihood ratio: L(Y ) = dP

dQ(Y )

Maximizer: f = L≤D := projection of L onto degree-D subspace

6 / 10

SLIDE 31

Detection (e.g. [Hopkins, Steurer ’17])

max

f deg D

EY ∼P[f (Y )]

EY ∼Q[f (Y )2]

= max

f deg D

EY ∼Q[L(Y )f (Y )]

EY ∼Q[f (Y )2]

= max

f deg D

L, f f = L≤D f , g = EY ∼Q[f (Y )g(Y )] f =

f , f

Likelihood ratio: L(Y ) = dP

dQ(Y )

Maximizer: f = L≤D := projection of L onto degree-D subspace Norm of low-degree likelihood ratio

6 / 10

SLIDE 32

Detection (e.g. [Hopkins, Steurer ’17])

max

f deg D

EY ∼P[f (Y )]

EY ∼Q[f (Y )2]

= max

f deg D

EY ∼Q[L(Y )f (Y )]

EY ∼Q[f (Y )2]

= max

f deg D

L, f f = L≤D f , g = EY ∼Q[f (Y )g(Y )] f =

f , f

Likelihood ratio: L(Y ) = dP

dQ(Y )

Maximizer: f = L≤D := projection of L onto degree-D subspace Norm of low-degree likelihood ratio To project: expand L in orthogonal polynomials w.r.t. Q

6 / 10

SLIDE 33

Detection (e.g. [Hopkins, Steurer ’17])

max

f deg D

EY ∼P[f (Y )]

EY ∼Q[f (Y )2]

= max

f deg D

EY ∼Q[L(Y )f (Y )]

EY ∼Q[f (Y )2]

= max

f deg D

L, f f = L≤D f , g = EY ∼Q[f (Y )g(Y )] f =

f , f

Likelihood ratio: L(Y ) = dP

dQ(Y )

Maximizer: f = L≤D := projection of L onto degree-D subspace Norm of low-degree likelihood ratio To project: expand L in orthogonal polynomials w.r.t. Q ◮ Works if Q has independent entries

6 / 10

SLIDE 34

Recovery [Schramm, W. ’20]

Example (planted submatrix): observe n × n matrix Y = X + Z ◮ Signal: X = λvv⊤ where λ > 0 and vi ∼ Bernoulli(ρ) ◮ Noise: Z i.i.d. N(0, 1)

7 / 10

SLIDE 35

Recovery [Schramm, W. ’20]

Example (planted submatrix): observe n × n matrix Y = X + Z ◮ Signal: X = λvv⊤ where λ > 0 and vi ∼ Bernoulli(ρ) ◮ Noise: Z i.i.d. N(0, 1) Goal: given Y , estimate v1 via polynomial f : Rn×n → R

7 / 10

SLIDE 36

Recovery [Schramm, W. ’20]

Example (planted submatrix): observe n × n matrix Y = X + Z ◮ Signal: X = λvv⊤ where λ > 0 and vi ∼ Bernoulli(ρ) ◮ Noise: Z i.i.d. N(0, 1) Goal: given Y , estimate v1 via polynomial f : Rn×n → R Low-degree minimum mean squared error: MMSE≤D = min

f deg D E(f (Y ) − v1)2

7 / 10

SLIDE 37

Recovery [Schramm, W. ’20]

Example (planted submatrix): observe n × n matrix Y = X + Z ◮ Signal: X = λvv⊤ where λ > 0 and vi ∼ Bernoulli(ρ) ◮ Noise: Z i.i.d. N(0, 1) Goal: given Y , estimate v1 via polynomial f : Rn×n → R Low-degree minimum mean squared error: MMSE≤D = min

f deg D E(f (Y ) − v1)2

Equivalent to low-degree maximum correlation: Corr≤D = max

f deg D

E[f (Y ) · v1]

E[f (Y )2]

Fact: MMSE≤D = E[v2

1 ] − Corr2 ≤D

7 / 10

SLIDE 38

Recovery [Schramm, W. ’20]

For hardness, want upper bound on Corr≤D = max

f deg D

E[f (Y ) · v1]

E[f (Y )2]

8 / 10

SLIDE 39

Recovery [Schramm, W. ’20]

For hardness, want upper bound on Corr≤D = max

f deg D

E[f (Y ) · v1]

E[f (Y )2]

Same proof as detection?

8 / 10

SLIDE 40

Recovery [Schramm, W. ’20]

For hardness, want upper bound on Corr≤D = max

f deg D

E[f (Y ) · v1]

E[f (Y )2]

Same proof as detection? Issue: would need orthogonal polynomials for planted distribution

8 / 10

SLIDE 41

Recovery [Schramm, W. ’20]

For hardness, want upper bound on Corr≤D = max

f deg D

E[f (Y ) · v1]

E[f (Y )2]

Same proof as detection? Issue: would need orthogonal polynomials for planted distribution Trick: bound denominator via Jensen’s inequality E[f (Y )2] = EZEX[f (X + Z)2] ≥ EZ (EXf (X + Z))2

8 / 10

SLIDE 42

Recovery [Schramm, W. ’20]

For hardness, want upper bound on Corr≤D = max

f deg D

E[f (Y ) · v1]

E[f (Y )2]

Same proof as detection? Issue: would need orthogonal polynomials for planted distribution Trick: bound denominator via Jensen’s inequality E[f (Y )2] = EZEX[f (X + Z)2] ≥ EZ (EXf (X + Z))2 ◮ This simplifies expression enough to find closed form

8 / 10

SLIDE 43

Recovery [Schramm, W. ’20]

For hardness, want upper bound on Corr≤D = max

f deg D

E[f (Y ) · v1]

E[f (Y )2]

Same proof as detection? Issue: would need orthogonal polynomials for planted distribution Trick: bound denominator via Jensen’s inequality E[f (Y )2] = EZEX[f (X + Z)2] ≥ EZ (EXf (X + Z))2 ◮ This simplifies expression enough to find closed form ◮ Yields tight bounds for planted submatrix problem

8 / 10

SLIDE 44

Optimization [Gamarnik, Jagannath, W. ’20]

Example (spherical spin glass): for Y ∈ Rn×n×n i.i.d. N(0, 1), find unit vector v maximizing H(v) =

1 √nY , v⊗3

9 / 10

SLIDE 45

Optimization [Gamarnik, Jagannath, W. ’20]

Example (spherical spin glass): for Y ∈ Rn×n×n i.i.d. N(0, 1), find unit vector v maximizing H(v) =

1 √nY , v⊗3

Optimum value: OPT = max

v=1 H(v) = Θ(1)

9 / 10

SLIDE 46

Optimization [Gamarnik, Jagannath, W. ’20]

Example (spherical spin glass): for Y ∈ Rn×n×n i.i.d. N(0, 1), find unit vector v maximizing H(v) =

1 √nY , v⊗3

Optimum value: OPT = max

v=1 H(v) = Θ(1)

Our result: no constant-degree polynomial can achieve value OPT − ǫ

9 / 10

SLIDE 47

Optimization [Gamarnik, Jagannath, W. ’20]

Example (spherical spin glass): for Y ∈ Rn×n×n i.i.d. N(0, 1), find unit vector v maximizing H(v) =

1 √nY , v⊗3

Optimum value: OPT = max

v=1 H(v) = Θ(1)

Our result: no constant-degree polynomial can achieve value OPT − ǫ

Theorem (GJW’20)

For some ǫ > 0, no degree-O(1) polynomial f : Rn×n×n → Rn achieves both of the following with probability 1 − o(1): ◮ Objective: H(f (Y )) ≥ OPT − ǫ ◮ Normalization: f (Y ) ≈ 1

9 / 10

SLIDE 48

Optimization [Gamarnik, Jagannath, W. ’20]

Example (spherical spin glass): for Y ∈ Rn×n×n i.i.d. N(0, 1), find unit vector v maximizing H(v) =

1 √nY , v⊗3

Optimum value: OPT = max

v=1 H(v) = Θ(1)

Our result: no constant-degree polynomial can achieve value OPT − ǫ ◮ Best known algorithms are constant-degree [Sub18,Mon18,EMS20]

9 / 10

SLIDE 49

Optimization [Gamarnik, Jagannath, W. ’20]

Example (spherical spin glass): for Y ∈ Rn×n×n i.i.d. N(0, 1), find unit vector v maximizing H(v) =

1 √nY , v⊗3

Optimum value: OPT = max

v=1 H(v) = Θ(1)

Our result: no constant-degree polynomial can achieve value OPT − ǫ ◮ Best known algorithms are constant-degree [Sub18,Mon18,EMS20] ◮ Proof:

◮ Low-degree polynomials are stable ◮ Overlap gap property [GS13,CGPR17,GJ19]

9 / 10

SLIDE 50

Optimization [Gamarnik, Jagannath, W. ’20]

Example (spherical spin glass): for Y ∈ Rn×n×n i.i.d. N(0, 1), find unit vector v maximizing H(v) =

1 √nY , v⊗3

Optimum value: OPT = max

v=1 H(v) = Θ(1)

Our result: no constant-degree polynomial can achieve value OPT − ǫ ◮ Best known algorithms are constant-degree [Sub18,Mon18,EMS20] ◮ Proof:

◮ Low-degree polynomials are stable ◮ Overlap gap property [GS13,CGPR17,GJ19]

◮ Open: show that no low-degree polynomial can achieve the precise objective value achieved by [Sub18]

9 / 10

SLIDE 51

References

◮ Detection (survey article) Kunisky, W., Bandeira. “Notes on Computational Hardness of Hypothesis Testing: Predictions using the Low-Degree Likelihood Ratio”, arXiv:1907.11636 ◮ Recovery Schramm, W. “Computational Barriers to Estimation from Low-Degree Polynomials”, arXiv:2008.02269 ◮ Optimization Gamarnik, Jagannath, W. “Low-Degree Hardness of Random Optimization Problems”, arXiv:2004.12063

10 / 10