Linearized two-layers neural network in high dimensions Song Mei - - PowerPoint PPT Presentation

linearized two layers neural network in high dimensions
SMART_READER_LITE
LIVE PREVIEW

Linearized two-layers neural network in high dimensions Song Mei - - PowerPoint PPT Presentation

Linearized two-layers neural network in high dimensions Song Mei Stanford University May 26, 2019 Joint work with Andrea Montanari, Theodor Misiakiewicz, Behrooz Ghorbani Song Mei (Stanford University) Linearized two layers neural network


slide-1
SLIDE 1

Linearized two-layers neural network in high dimensions

Song Mei

Stanford University

May 26, 2019

Joint work with Andrea Montanari, Theodor Misiakiewicz, Behrooz Ghorbani

Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 1 / 22

slide-2
SLIDE 2

Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 2 / 22

slide-3
SLIDE 3

❘✭Θ✮ ❂ ♠✐♥

Θ E❬❵✭②❀ W ✶✛ ✍ W ✷ ✍ ✛ ✍ ✁ ✁ ✁ ✍ W ❦ ✍ x✮❪✿

Empirical surprise of neural network [Zhang et al., 2016]

◮ Over-parameterized regime. ◮ Optimization surprise: efficiently fit all the data. ◮ Generalization surprise: generalize well.

Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 3 / 22

slide-4
SLIDE 4

Two-layers neural network

❫ ❢◆✭x❀ Θ✮ ❂

✐❂✶

❛✐✛✭❤w✐❀ x✐✮❀ Θ ❂ ✭❛✶❀ w✶❀ ✿ ✿ ✿ ❀ ❛◆❀ w◆✮✿

◮ Feature x ✷ R❞. ◮ Bottom layer weights w✐ ✷ R❞, ✐ ❂ ✶❀ ✷❀ ✿ ✿ ✿ ❀ ◆. ◮ Top layer weights ❛✐ ✷ R, ✐ ❂ ✶❀ ✷❀ ✿ ✿ ✿ ❀ ◆. ◮ Over-parametrization: ◆ large.

Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 4 / 22

slide-5
SLIDE 5

Two-layers neural network

❫ ❢◆✭x❀ Θ✮ ❂

✐❂✶

❛✐✛✭❤w✐❀ x✐✮❀ Θ ❂ ✭❛✶❀ w✶❀ ✿ ✿ ✿ ❀ ❛◆❀ w◆✮✿

◮ Feature x ✷ R❞. ◮ Bottom layer weights w✐ ✷ R❞, ✐ ❂ ✶❀ ✷❀ ✿ ✿ ✿ ❀ ◆. ◮ Top layer weights ❛✐ ✷ R, ✐ ❂ ✶❀ ✷❀ ✿ ✿ ✿ ❀ ◆. ◮ Over-parametrization: ◆ large.

Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 4 / 22

slide-6
SLIDE 6

Two-layers neural network

❫ ❢◆✭x❀ Θ✮ ❂

✐❂✶

❛✐✛✭❤w✐❀ x✐✮❀ Θ ❂ ✭❛✶❀ w✶❀ ✿ ✿ ✿ ❀ ❛◆❀ w◆✮✿

◮ Feature x ✷ R❞. ◮ Bottom layer weights w✐ ✷ R❞, ✐ ❂ ✶❀ ✷❀ ✿ ✿ ✿ ❀ ◆. ◮ Top layer weights ❛✐ ✷ R, ✐ ❂ ✶❀ ✷❀ ✿ ✿ ✿ ❀ ◆. ◮ Over-parametrization: ◆ large.

Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 4 / 22

slide-7
SLIDE 7

Two-layers neural network

❫ ❢◆✭x❀ Θ✮ ❂

✐❂✶

❛✐✛✭❤w✐❀ x✐✮❀ Θ ❂ ✭❛✶❀ w✶❀ ✿ ✿ ✿ ❀ ❛◆❀ w◆✮✿

◮ Feature x ✷ R❞. ◮ Bottom layer weights w✐ ✷ R❞, ✐ ❂ ✶❀ ✷❀ ✿ ✿ ✿ ❀ ◆. ◮ Top layer weights ❛✐ ✷ R, ✐ ❂ ✶❀ ✷❀ ✿ ✿ ✿ ❀ ◆. ◮ Over-parametrization: ◆ large.

Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 4 / 22

slide-8
SLIDE 8

Two-layers neural network

❫ ❢◆✭x❀ Θ✮ ❂

✐❂✶

❛✐✛✭❤w✐❀ x✐✮❀ Θ ❂ ✭❛✶❀ w✶❀ ✿ ✿ ✿ ❀ ❛◆❀ w◆✮✿

◮ Feature x ✷ R❞. ◮ Bottom layer weights w✐ ✷ R❞, ✐ ❂ ✶❀ ✷❀ ✿ ✿ ✿ ❀ ◆. ◮ Top layer weights ❛✐ ✷ R, ✐ ❂ ✶❀ ✷❀ ✿ ✿ ✿ ❀ ◆. ◮ Over-parametrization: ◆ large.

Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 4 / 22

slide-9
SLIDE 9

Two-layers neural network

Hidden layer Output layer Input layer w1 a1 a2 a3 a4 w2 w3 w4

Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 5 / 22

slide-10
SLIDE 10

Gradient flow with random initialization

Empirical risk: (♥: ★ data; ◆: ★ neuron) ❘♥❀◆✭Θ✮ ❂ ❫ Ex❀♥❬✭② ❫ ❢◆✭x❀ Θ✮✮✷❪ Gradient flow, on empirical risk, with random initialization: ❴ Θ✭t✮ ❂ r❘♥❀◆✭Θ✭t✮✮❀ ✭❛✐✭✵✮❀ w✐✭✵✮✮ ✘✐✿✐✿❞✿ P❛❀w✿

Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 6 / 22

slide-11
SLIDE 11

Convergence guarantees

Lemma (Global min. Not surprise. )

For ◆ ❃ ♥, we have ✐♥❢

Θ ❘♥❀◆✭Θ✮ ❂ ✵✿

There are many global minimizer with empirical risk ✵. ◆ ✢ ♥✶✰❝ ❧✐♠

t✦✶ ❘♥❀◆✭

✭t✮✮ ❂ ✵❀ ✵

Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 7 / 22

slide-12
SLIDE 12

Convergence guarantees

Lemma (Global min. Not surprise. )

For ◆ ❃ ♥, we have ✐♥❢

Θ ❘♥❀◆✭Θ✮ ❂ ✵✿

There are many global minimizer with empirical risk ✵. But there are also local minimizers with non-zero risk. ◆ ✢ ♥✶✰❝ ❧✐♠

t✦✶ ❘♥❀◆✭

✭t✮✮ ❂ ✵❀ ✵

Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 7 / 22

slide-13
SLIDE 13

Convergence guarantees

Lemma (Global min. Not surprise. )

For ◆ ❃ ♥, we have ✐♥❢

Θ ❘♥❀◆✭Θ✮ ❂ ✵✿

There are many global minimizer with empirical risk ✵. But there are also local minimizers with non-zero risk.

Theorem (The optimization surprise. )

For ◆ ✢ ♥✶✰❝, we have ❧✐♠

t✦✶ ❘♥❀◆✭Θ✭t✮✮ ❂ ✵❀

i.e., training loss converges to ✵.

Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 7 / 22

slide-14
SLIDE 14

Convergence guarantees

Lemma (Global min. Not surprise. )

For ◆ ❃ ♥, we have ✐♥❢

Θ ❘♥❀◆✭Θ✮ ❂ ✵✿

There are many global minimizer with empirical risk ✵. But there are also local minimizers with non-zero risk.

Theorem (The optimization surprise. )

For ◆ ✢ ♥✶✰❝, we have ❧✐♠

t✦✶ ❘♥❀◆✭Θ✭t✮✮ ❂ ✵❀

i.e., training loss converges to ✵. Under what assumptions?

Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 7 / 22

slide-15
SLIDE 15

Three variants of the convergence theorem

Gradient flow (♥: ★ data; ◆: ★ neuron): ❴ Θ✭t✮ ❂ r❫ Ex❀♥❬✭② ❫ ❢◆✭x❀ Θ✭t✮✮✮✷❪❀ w✐✭✵✮ ✘✐✿✐✿❞✿ ◆✭0❀ I❞❂❞✮✿ Theorem: for ◆ large enough, we have ❧✐♠

t✦✶ ❘♥❀◆✭Θ✭t✮✮ ❂ ✵✿

Random feature (RF) regime

❫ ❢◆✭x❀ Θ✮ ❂

✐❂✶

❛✐✛✭❤w✐❀ x✐✮❀ ❛✐✭✵✮ ✘✐✿✐✿❞✿ ◆✭✵❀ ✶❂◆✷✮✿ [Andoni et al., 2014], [Danialy, 2017], [Yehudai and Shamir, 2019] ...

Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 8 / 22

slide-16
SLIDE 16

Three variants of the convergence theorem

Gradient flow (♥: ★ data; ◆: ★ neuron): ❴ Θ✭t✮ ❂ r❫ Ex❀♥❬✭② ❫ ❢◆✭x❀ Θ✭t✮✮✮✷❪❀ w✐✭✵✮ ✘✐✿✐✿❞✿ ◆✭0❀ I❞❂❞✮✿ Theorem: for ◆ large enough, we have ❧✐♠

t✦✶ ❘♥❀◆✭Θ✭t✮✮ ❂ ✵✿

Neural tangent (NT) regime

❫ ❢◆✭x❀ Θ✮ ❂ ✶ ♣ ◆

✐❂✶

❛✐✛✭❤w✐❀ x✐✮❀ ❛✐✭✵✮ ✘✐✿✐✿❞✿ ◆✭✵❀ ✶✮✿ [Jacot et al., 2018], [Du et al., 2018], [Du et al., 2018], [Allen-Zhu el al., 2018], [Zou et al., 2018]...

Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 8 / 22

slide-17
SLIDE 17

Three variants of the convergence theorem

Gradient flow (♥: ★ data; ◆: ★ neuron): ❴ Θ✭t✮ ❂ r❫ Ex❀♥❬✭② ❫ ❢◆✭x❀ Θ✭t✮✮✮✷❪❀ w✐✭✵✮ ✘✐✿✐✿❞✿ ◆✭0❀ I❞❂❞✮✿ Theorem: for ◆ large enough, we have * ❧✐♠

t✦✶ ❘♥❀◆✭Θ✭t✮✮ ❂ ✵✿

Mean field (MF) regime

❫ ❢◆✭x❀ Θ✮ ❂ ✶ ◆

✐❂✶

❛✐✛✭❤w✐❀ x✐✮❀ ❛✐✭✵✮ ✘✐✿✐✿❞✿ ◆✭✵❀ ✶✮✿ [Mei et al., 2018], [Rotskoff and Vanden-Eijden, 2018], [Chizat and Bach, 2018]...

Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 8 / 22

slide-18
SLIDE 18

Three variants of the convergence theorem

Random feature (RF) regime

❫ ❢◆✭x❀ Θ✮ ❂

✐❂✶

❛✐✛✭❤w✐❀ x✐✮❀ ❛✐ ✘ ◆✭✵❀ ✶❂◆✷✮✿

Neural tangent (NT) regime

❫ ❢◆✭x❀ Θ✮ ❂ ✶ ♣ ◆

✐❂✶

❛✐✛✭❤w✐❀ x✐✮❀ ❛✐ ✘ ◆✭✵❀ ✶✮✿

Mean field (MF) regime

❫ ❢◆✭x❀ Θ✮ ❂ ✶ ◆

✐❂✶

❛✐✛✭❤w✐❀ x✐✮❀ ❛✐ ✘ ◆✭✵❀ ✶✮✿

Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 9 / 22

slide-19
SLIDE 19

... but different behavior of dynamics

Random feature (RF) regime

❫ ❢◆✭x❀ Θ✮ ❂

✐❂✶

❛✐✛✭❤w✐❀ x✐✮❀ ❛✐ ✘ ◆✭✵❀ ✶❂◆✷✮✿

◮ The limiting dynamics is linear (effectively only a is updated). ◮ Prediction function: kernel ridge regression with kernel

❦RF✭x❀ z✮ ❂ ❫ Ew❀◆❬✛✭❤w❀ z✐✮✛✭❤w❀ z✐✮❪✿ [Andoni et al., 2014], [Danialy, 2017], [Yehudai and Shamir, 2019] ...

Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 10 / 22

slide-20
SLIDE 20

... but different behavior of dynamics

Neural tangent (NT) regime

❫ ❢◆✭x❀ Θ✮ ❂ ✶ ♣ ◆

✐❂✶

❛✐✛✭❤w✐❀ x✐✮❀ ❛✐ ✘ ◆✭✵❀ ✶✮✿

◮ The limiting dynamics is linear (the change of Θ is small). ◮ Prediction function: kernel ridge regression with kernel

❦NT✭x❀ z✮ ❂ ❫ Ew❀◆❬✛✵✭❤w❀ x✐✮✛✵✭❤w❀ z✐✮❪❤x❀ z✐ ✰ ❦RF✭x❀ z✮✿ [Jacot et al., 2018], [Du et al., 2018], [Du et al., 2018], [Allen-Zhu el al., 2018], [Zou et al., 2018]...

Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 11 / 22

slide-21
SLIDE 21

... but different behavior of dynamics

Mean field (MF) regime

❫ ❢◆✭x❀ Θ✮ ❂ ✶ ◆

✐❂✶

❛✐✛✭❤w✐❀ x✐✮❀ ❛✐ ✘ ◆✭✵❀ ✶✮✿

◮ The limiting dynamics is non-linear (both a and W are updated). ◮ Distributional dynamics:

❅t✚t✭❛❀ w✮ ❂ r ✁ ✭✚r✠✭❛❀ w❀ ✚t✮✮ ✰ ☞✶✁✚t.

◮ Prediction function: ❫

❢✭x❀ ✚✶✮ ❂

❘ ❛✛✭❤w❀ x✐✮✚✶✭❞❛❞w✮.

[Mei et al., 2018], [Rotskoff and Vanden-Eijnden, 2018], [Chizat and Bach, 2018], [Sirignano and Spiliopoulos, 2018]...

Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 12 / 22

slide-22
SLIDE 22

Optimization: ✵ training loss.

Test risk = training loss + generalization risk. Today: generalization.

Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 13 / 22

slide-23
SLIDE 23

Optimization: ✵ training loss.

Test risk = training loss + generalization risk. Today: generalization.

Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 13 / 22

slide-24
SLIDE 24

Generalization theory for kernel methods

◮ Traditional theory: assume ❢❄ ✷ RKHS, then kernel ridge

regression generalize well.

◮ Problem: in high dimension, RKHS is a very small space.

Today: in high dimension, kernel methods (RF and NT) don’t generalize well.

Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 14 / 22

slide-25
SLIDE 25

Generalization theory for kernel methods

◮ Traditional theory: assume ❢❄ ✷ RKHS, then kernel ridge

regression generalize well.

◮ Problem: in high dimension, RKHS is a very small space.

Today: in high dimension, kernel methods (RF and NT) don’t generalize well.

Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 14 / 22

slide-26
SLIDE 26

Generalization theory for kernel methods

◮ Traditional theory: assume ❢❄ ✷ RKHS, then kernel ridge

regression generalize well.

◮ Problem: in high dimension, RKHS is a very small space.

Today: in high dimension, kernel methods (RF and NT) don’t generalize well.

Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 14 / 22

slide-27
SLIDE 27

Setting 1: ◆ finite, ♥ infinite

Distribution: x ✷ ❯♥✐❢✭S❞✶✭ ♣ ❞✮✮❀ ② ❂ ❢❄✭x✮❀ ❢❄ ✷ ▲✷✭S❞✭ ♣ ❞✮✮✿ Two classes of linearized neural network: (w✐ ✘ ❯♥✐❢✭S❞✮) ❋RF❀◆✭W ✮ ❂

❢ ❂

✐❂✶

❛✐✛✭❤w✐❀ x✐✮ ✿ ❛✐ ✷ R❀ ✐ ✷ ❬◆❪

❀ ❋NT❀◆✭W ✮ ❂

❢ ❂

✐❂✶

✛✵✭❤w✐❀ x✐✮❤a✐❀ x✐ ✿ a✐ ✷ R❞❀ ✐ ✷ ❬◆❪

✿ Mild assumptions on ✛ (universal approximation, growth not too fast).

Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 15 / 22

slide-28
SLIDE 28

Lower bound: ◆ finite, ♥ infinite

❋RF❀◆✭W ✮ ❂

❢ ❂

✐❂✶

❛✐✛✭❤w✐❀ x✐✮ ✿ ❛✐ ✷ R❀ ✐ ✷ ❬◆❪

Theorem (Ghorbani, Mei, Misiakiwics, Montanari, 2019)

Assume ◆ ❂ ❖❞✭❞❵✍✮, and ✭w✐✮✐✷❬◆❪ ✘ ❯♥✐❢✭S❞✮, we have ✐♥❢

❢✷❋RF❀◆✭W ✮ Ex❬✭❢❄✭x✮ ❢✭x✮✮✷❪ ✕ ❦P❃❵❢❄❦✷ ▲✷ ✰ ♦❞❀P✭❦❢❄❦✷ ✷✮❀

where P❃❵ is the projection operator orthogonal to the space of degree-❵ polynomials. Example: for ❢❄✭x✮ ❂ ①✷

✶ ✶, we have P❃✷❢❄ ✙ ❢❄. Then random

feature regression with ◆ ❂ ❖❞✭❞✷✍✮ neuron achieves trivial risk, which is ❦❢❄❦✷

▲✷.

Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 16 / 22

slide-29
SLIDE 29

Lower bound: ◆ finite, ♥ infinite

❋RF❀◆✭W ✮ ❂

❢ ❂

✐❂✶

❛✐✛✭❤w✐❀ x✐✮ ✿ ❛✐ ✷ R❀ ✐ ✷ ❬◆❪

Theorem (Ghorbani, Mei, Misiakiwics, Montanari, 2019)

Assume ◆ ❂ ❖❞✭❞❵✍✮, and ✭w✐✮✐✷❬◆❪ ✘ ❯♥✐❢✭S❞✮, we have ✐♥❢

❢✷❋RF❀◆✭W ✮ Ex❬✭❢❄✭x✮ ❢✭x✮✮✷❪ ✕ ❦P❃❵❢❄❦✷ ▲✷ ✰ ♦❞❀P✭❦❢❄❦✷ ✷✮❀

where P❃❵ is the projection operator orthogonal to the space of degree-❵ polynomials. Example: for ❢❄✭x✮ ❂ ①✷

✶ ✶, we have P❃✷❢❄ ✙ ❢❄. Then random

feature regression with ◆ ❂ ❖❞✭❞✷✍✮ neuron achieves trivial risk, which is ❦❢❄❦✷

▲✷.

Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 16 / 22

slide-30
SLIDE 30

100 101 102 n/d 0.0 0.5 1.0 1.5 2.0 2.5 3.0 R/R0 N = 100 N = 200 N = 300 N = 400 100 101 102 n/d 0.0 0.5 1.0 1.5 2.0 2.5 3.0 R/R0 N = 200 N = 400 N = 600 N = 800

Figure: Test risk for learning ❢✭x✮ ❂ ①✷

✶ ✶, d = 50 and d = 100.

Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 17 / 22

slide-31
SLIDE 31

Similar result for NT

❋NT❀◆✭W ✮ ❂

❢ ❂

✐❂✶

✛✵✭❤w✐❀ x✐✮❤a✐❀ x✐ ✿ a✐ ✷ R❞❀ ✐ ✷ ❬◆❪

Theorem (Ghorbani, Mei, Misiakiwics, Montanari, 2019)

Assume ◆ ❂ ❖❞✭❞❵✍✮, and ✭w✐✮✐✷❬◆❪ ✘ ❯♥✐❢✭S❞✮, we have ✐♥❢

❢✷❋RF❀◆✭W ✮ Ex❬✭❢❄✭x✮ ❢✭x✮✮✷❪ ✕ ❦P❃❵✰✶❢❄❦✷ ▲✷ ✰ ♦❞❀P✭❦❢❄❦✷ ✷✮❀

where P❃❵✰✶ is the projection operator orthogonal to the space of degree-✭❵ ✰ ✶✮ polynomials. Example: for ❢❄✭x✮ ❂ ①✸

✶ ①✶, we have P❃✸❢❄ ✙ ❢❄. Then random

feature regression with ◆ ❂ ❖❞✭❞✷✍✮ neuron achieves trivial risk, which is ❦❢❄❦✷

▲✷.

Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 18 / 22

slide-32
SLIDE 32

Similar result for NT

❋NT❀◆✭W ✮ ❂

❢ ❂

✐❂✶

✛✵✭❤w✐❀ x✐✮❤a✐❀ x✐ ✿ a✐ ✷ R❞❀ ✐ ✷ ❬◆❪

Theorem (Ghorbani, Mei, Misiakiwics, Montanari, 2019)

Assume ◆ ❂ ❖❞✭❞❵✍✮, and ✭w✐✮✐✷❬◆❪ ✘ ❯♥✐❢✭S❞✮, we have ✐♥❢

❢✷❋RF❀◆✭W ✮ Ex❬✭❢❄✭x✮ ❢✭x✮✮✷❪ ✕ ❦P❃❵✰✶❢❄❦✷ ▲✷ ✰ ♦❞❀P✭❦❢❄❦✷ ✷✮❀

where P❃❵✰✶ is the projection operator orthogonal to the space of degree-✭❵ ✰ ✶✮ polynomials. Example: for ❢❄✭x✮ ❂ ①✸

✶ ①✶, we have P❃✸❢❄ ✙ ❢❄. Then random

feature regression with ◆ ❂ ❖❞✭❞✷✍✮ neuron achieves trivial risk, which is ❦❢❄❦✷

▲✷.

Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 18 / 22

slide-33
SLIDE 33

Setting 2: ◆ infinite, ♥ finite

Distribution: x✐ ✷ ❯♥✐❢✭S❞✶✮❀ ②✐ ❂ ❢❄✭x✐✮❀ ❢❄ ✷ ▲✷✭S❞✭ ♣ ❞✮✮✿ Predicting using regularized kernel ridge regression: ❫ ❢✕✭x✮ ❂ ❦✭x❀ ❳✮✭❦✭❳❀ ❳✮ ✰ ✕I✮✶❢❄✭x✮❀ where ❦✭x✐❀ x❥✮ ❂ Ew✘S❞✶❬✛✭❤w❀ x✐✐✮✛✭❤w❀ x❥✐✮❪✿

Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 19 / 22

slide-34
SLIDE 34

Lower bound: ◆ infinite, ♥ finite

Theorem (Ghorbani, Mei, Misiakiwics, Montanari, 2019)

Assume ♥ ❂ ❖❞✭❞❵✍✮, we have ✐♥❢

✕ Ex❬✭❢❄✭x✮ ❫

❢✕✭x✮✮✷❪ ✕ ❦P❃❵❢❄❦✷

▲✷ ✰ ♦❞❀P✭❦❢❄❦✷ ✷✮❀

where P❃❵ is the projection operator orthogonal to the space of degree-❵ polynomials.

Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 20 / 22

slide-35
SLIDE 35

Intuition behind these results

In high dimension, the correlation between a degree-❦ Hermite polynomial and a random feature is very small Ew❬❍❡❦✭①✶✮✛✭❤w❀ x✐✮❪ ❂ ❖❞✭✶❂❞❦✮✿ Also observed in [Danialy, 2016], [Bach, 2017].

Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 21 / 22

slide-36
SLIDE 36

Implications & Conclusions

◮ In high dimension, even for simple function ❢✭x✮ ❂ ①❦ ✶, it takes

♥❀ ◆ ❂ ❖❞✭❞❦✮ to learn it well using linearized neural network (kernel methods);

◮ ... while a neural network can learn it (conjecture to be efficiently)

using ♥❀ ◆ ❂ ❖❞✭✶✮.

◮ Neural network is more powerful than kernel methods. ◮ Future work: what class of functions neural network can learn

efficiently.

Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 22 / 22

slide-37
SLIDE 37

Implications & Conclusions

◮ In high dimension, even for simple function ❢✭x✮ ❂ ①❦ ✶, it takes

♥❀ ◆ ❂ ❖❞✭❞❦✮ to learn it well using linearized neural network (kernel methods);

◮ ... while a neural network can learn it (conjecture to be efficiently)

using ♥❀ ◆ ❂ ❖❞✭✶✮.

◮ Neural network is more powerful than kernel methods. ◮ Future work: what class of functions neural network can learn

efficiently.

Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 22 / 22

slide-38
SLIDE 38

Implications & Conclusions

◮ In high dimension, even for simple function ❢✭x✮ ❂ ①❦ ✶, it takes

♥❀ ◆ ❂ ❖❞✭❞❦✮ to learn it well using linearized neural network (kernel methods);

◮ ... while a neural network can learn it (conjecture to be efficiently)

using ♥❀ ◆ ❂ ❖❞✭✶✮.

◮ Neural network is more powerful than kernel methods. ◮ Future work: what class of functions neural network can learn

efficiently.

Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 22 / 22

slide-39
SLIDE 39

Implications & Conclusions

◮ In high dimension, even for simple function ❢✭x✮ ❂ ①❦ ✶, it takes

♥❀ ◆ ❂ ❖❞✭❞❦✮ to learn it well using linearized neural network (kernel methods);

◮ ... while a neural network can learn it (conjecture to be efficiently)

using ♥❀ ◆ ❂ ❖❞✭✶✮.

◮ Neural network is more powerful than kernel methods. ◮ Future work: what class of functions neural network can learn

efficiently.

Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 22 / 22