Generalization of linearized neural networks: staircase decay and - - PowerPoint PPT Presentation

generalization of linearized neural networks staircase
SMART_READER_LITE
LIVE PREVIEW

Generalization of linearized neural networks: staircase decay and - - PowerPoint PPT Presentation

Generalization of linearized neural networks: staircase decay and double descent Song Mei UC Berkeley July 23, 2020 Department of Mathematics, HKUST Deep Learning Revolution Machine translation Autonomous Vehicle Robotics Healthcare


slide-1
SLIDE 1

Generalization of linearized neural networks: staircase decay and double descent

Song Mei

UC Berkeley

July 23, 2020

Department of Mathematics, HKUST

slide-2
SLIDE 2

Deep Learning Revolution

Gaming Healthcare Autonomous Vehicle Finance Machine translation Robotics Communication Deep learning

slide-3
SLIDE 3

Deep Learning Revolution

Gaming Healthcare Autonomous Vehicle Finance Machine translation Robotics Communication Deep learning

“ACM named Yoshua Bengio, Geoffrey Hinton, and Yann LeCun recipients of the 2018 ACM A.M. Turing Award for conceptual and engineering breakthroughs that have made deep neural networks a critical component of computing. ”

slide-4
SLIDE 4

But theoretically?

slide-5
SLIDE 5

But theoretically? WHEN and WHY does deep learning work?

slide-6
SLIDE 6

Call for Theoretical understandings

“Alchemy”

slide-7
SLIDE 7

Call for Theoretical understandings

Mathematical Theories Physical Laws Reproducible Experiments

“Alchemy” Science

slide-8
SLIDE 8

What don’t we understand?

✩ ✩

slide-9
SLIDE 9

What don’t we understand?

Empirical Surprises [Zhang, et.al, 2015]: ◮ Over-parameterization: ★ parameters ✢ ★ training samples. ◮ Non-convexity. ◮ Efficiently fit all the training samples using SGD. ◮ Generalize well on test samples. ✩ ✩

slide-10
SLIDE 10

What don’t we understand?

Empirical Surprises [Zhang, et.al, 2015]: ◮ Over-parameterization: ★ parameters ✢ ★ training samples. ◮ Non-convexity. ◮ Efficiently fit all the training samples using SGD. ◮ Generalize well on test samples.

Mathematical Challenges

Non-convexity Over-parameterization ✩ ✩ Why efficient optimization? Why effective generalization?

slide-11
SLIDE 11

A gentle introduction to

Linearization theory of neural networks

slide-12
SLIDE 12

Linearized neural networks (neural tangent model)

◮ Multi-layers neural network ❢✭x❀ θ✮, x ✷ R❞, θ ✷ R◆ ❢✭x❀ θ✮ ❂ W ▲✛✭✁ ✁ ✁ W ✷✛✭W ✶x✮✮✿ ◮ Linearization around (random) parameter θ✵ ❢✭x❀ θ✮ ❂ ❢✭x❀ θ✵✮ ✰ ❤θ θ✵❀ rθ❢✭x❀ θ✵✮✐ ✰ ♦✭❦θ θ✵❦✷✮✿ ◮ Neural tangent model: the linear part of ❢ ❢NT✭x❀ β❀ θ✵✮ ❂ ❤β❀ rθ❢✭x❀ θ✵✮✐✿

[Jacot, Gabriel, Hongler, 2018] [Chizat, Bach, 2018b]

slide-13
SLIDE 13

Linearized neural networks (neural tangent model)

◮ Multi-layers neural network ❢✭x❀ θ✮, x ✷ R❞, θ ✷ R◆ ❢✭x❀ θ✮ ❂ ✛✭W ▲✛✭✁ ✁ ✁ W ✷✛✭W ✶x✮✮✮✿ ◮ Linearization around (random) parameter θ✵ ❢✭x❀ θ✮ ❂ ❢✭x❀ θ✵✮ ✰ ❤θ θ✵❀ rθ❢✭x❀ θ✵✮✐ ✰ ♦✭❦θ θ✵❦✷✮✿ ◮ Neural tangent model: the linear part of ❢ ❢NT✭x❀ β❀ θ✵✮ ❂ ❤β❀ rθ❢✭x❀ θ✵✮✐✿

[Jacot, Gabriel, Hongler, 2018] [Chizat, Bach, 2018b]

slide-14
SLIDE 14

Linearized neural networks (neural tangent model)

◮ Multi-layers neural network ❢✭x❀ θ✮, x ✷ R❞, θ ✷ R◆ ❢✭x❀ θ✮ ❂ ✛✭W ▲✛✭✁ ✁ ✁ W ✷✛✭W ✶x✮✮✮✿ ◮ Linearization around (random) parameter θ✵ ❢✭x❀ θ✮ ❂ ❢✭x❀ θ✵✮ ✰ ❤θ θ✵❀ rθ❢✭x❀ θ✵✮✐ ✰ ♦✭❦θ θ✵❦✷✮✿ ◮ Neural tangent model: the linear part of ❢ ❢NT✭x❀ β❀ θ✵✮ ❂ ❤β❀ rθ❢✭x❀ θ✵✮✐✿

[Jacot, Gabriel, Hongler, 2018] [Chizat, Bach, 2018b]

slide-15
SLIDE 15

Linear regression over random features

◮ NT model: the linear part of ❢ ❢NT✭x❀ β❀ θ✵✮ ❂ ❤β❀ ✣✭x✮✐ ❂ ❤β❀ rθ❢✭x❀ θ✵✮✐✿ ◮ (Random) feature map: ✣✭✁✮ ❂ rθ❢✭✁❀ θ✵✮ ✿ R❞ ✦ R◆. ◮ Training dataset: ✭❳❀ ❨✮ ❂ ✭x✐❀ ②✐✮✐✷❬♥❪. ◮ Gradient flow dynamics: ❞ ❞tβt ❂ rβ ❫ E❬✭② ❢NT✭x❀ βt❀ θ✵✮✮✷❪❀ β✵ ❂ 0✿ ◮ Linear convergence: ☞t ✦ ❫ β ❂ ✣✭❳✮②❨.

slide-16
SLIDE 16

Linear regression over random features

◮ NT model: the linear part of ❢ ❢NT✭x❀ β❀ θ✵✮ ❂ ❤β❀ ✣✭x✮✐ ❂ ❤β❀ rθ❢✭x❀ θ✵✮✐✿ ◮ (Random) feature map: ✣✭✁✮ ❂ rθ❢✭✁❀ θ✵✮ ✿ R❞ ✦ R◆. ◮ Training dataset: ✭❳❀ ❨✮ ❂ ✭x✐❀ ②✐✮✐✷❬♥❪. ◮ Gradient flow dynamics: ❞ ❞tβt ❂ rβ ❫ E❬✭② ❢NT✭x❀ βt❀ θ✵✮✮✷❪❀ β✵ ❂ 0✿ ◮ Linear convergence: ☞t ✦ ❫ β ❂ ✣✭❳✮②❨.

slide-17
SLIDE 17

Linear regression over random features

◮ NT model: the linear part of ❢ ❢NT✭x❀ β❀ θ✵✮ ❂ ❤β❀ ✣✭x✮✐ ❂ ❤β❀ rθ❢✭x❀ θ✵✮✐✿ ◮ (Random) feature map: ✣✭✁✮ ❂ rθ❢✭✁❀ θ✵✮ ✿ R❞ ✦ R◆. ◮ Training dataset: ✭❳❀ ❨✮ ❂ ✭x✐❀ ②✐✮✐✷❬♥❪. ◮ Gradient flow dynamics: ❞ ❞tβt ❂ rβ ❫ E❬✭② ❢NT✭x❀ βt❀ θ✵✮✮✷❪❀ β✵ ❂ 0✿ ◮ Linear convergence: ☞t ✦ ❫ β ❂ ✣✭❳✮②❨.

slide-18
SLIDE 18

Linear regression over random features

◮ NT model: the linear part of ❢ ❢NT✭x❀ β❀ θ✵✮ ❂ ❤β❀ ✣✭x✮✐ ❂ ❤β❀ rθ❢✭x❀ θ✵✮✐✿ ◮ (Random) feature map: ✣✭✁✮ ❂ rθ❢✭✁❀ θ✵✮ ✿ R❞ ✦ R◆. ◮ Training dataset: ✭❳❀ ❨✮ ❂ ✭x✐❀ ②✐✮✐✷❬♥❪. ◮ Gradient flow dynamics: ❞ ❞tβt ❂ rβ ❫ E❬✭② ❢NT✭x❀ βt❀ θ✵✮✮✷❪❀ β✵ ❂ 0✿ ◮ Linear convergence: ☞t ✦ ❫ β ❂ ✣✭❳✮②❨.

slide-19
SLIDE 19

Linear regression over random features

◮ NT model: the linear part of ❢ ❢NT✭x❀ β❀ θ✵✮ ❂ ❤β❀ ✣✭x✮✐ ❂ ❤β❀ rθ❢✭x❀ θ✵✮✐✿ ◮ (Random) feature map: ✣✭✁✮ ❂ rθ❢✭✁❀ θ✵✮ ✿ R❞ ✦ R◆. ◮ Training dataset: ✭❳❀ ❨✮ ❂ ✭x✐❀ ②✐✮✐✷❬♥❪. ◮ Gradient flow dynamics: ❞ ❞tβt ❂ rβ ❫ E❬✭② ❢NT✭x❀ βt❀ θ✵✮✮✷❪❀ β✵ ❂ 0✿ ◮ Linear convergence: ☞t ✦ ❫ β ❂ ✣✭❳✮②❨.

slide-20
SLIDE 20

Neural network ✙ Neural tangent

Theorem [Jacot, Gabriel, Hongler, 2018] (Informal)

Consider neural networks ❢◆✭x❀ θ✮ with number of neurons ◆, and consider ❞ ❞tθt ❂ rθ ❫ E❬✭② ❢◆✭x❀ θt✮✮✷❪❀ θ✵ ❂ θ✵❀ ❞ ❞tβt ❂ rβ ❫ E❬✭② ❢◆

NT✭x❀ βt❀ θ✵✮✮✷❪❀

β✵ ❂ 0✿ Under proper (random) initialization, we have a.s. ❧✐♠

◆✦✶ ❥❢◆✭x❀ θt✮ ❢◆ NT✭x❀ βt✮❥ ❂ ✵✿

slide-21
SLIDE 21

Optimization success

Gradient flow of training loss of NN converges to global min ... ... with over-parameterization and proper initialization

[Jacot, Gabriel, Hongler, 2018], [Du, Zhai, Poczos, Singh, 2018], [Du, Lee, Li, Wang, Zhai, 2018], [Allen-Zhu, Li, Song 2018], [Zou, Cao, Zhou, Gu, 2018], [Oymak, Soltanolkotabi, 2018] [Chizat, Bach, 2018b], ....

slide-22
SLIDE 22

Optimization success

Gradient flow of training loss of NN converges to global min ... ... with over-parameterization and proper initialization

[Jacot, Gabriel, Hongler, 2018], [Du, Zhai, Poczos, Singh, 2018], [Du, Lee, Li, Wang, Zhai, 2018], [Allen-Zhu, Li, Song 2018], [Zou, Cao, Zhou, Gu, 2018], [Oymak, Soltanolkotabi, 2018] [Chizat, Bach, 2018b], ....

Does linearization fully explain the success of neural networks?

slide-23
SLIDE 23

Optimization success

Gradient flow of training loss of NN converges to global min ... ... with over-parameterization and proper initialization

[Jacot, Gabriel, Hongler, 2018], [Du, Zhai, Poczos, Singh, 2018], [Du, Lee, Li, Wang, Zhai, 2018], [Allen-Zhu, Li, Song 2018], [Zou, Cao, Zhou, Gu, 2018], [Oymak, Soltanolkotabi, 2018] [Chizat, Bach, 2018b], ....

Does linearization fully explain the success of neural networks? Our answer is No

slide-24
SLIDE 24

Generalization

Empirically, the generalization of NT models are not as good as NN

Table: Cifar10 experiments

Architecture Classification error CNN 4%- (1) CNTK 23% (2) CNTK 11% (3) Compositioal Kernel 10% (1) [Arora, Du, Hu, Li, Salakhutdinov, Wang, 2019], (2) [Li, Wang, Yu, Du, Hu, Salakhutdinov, Arora, 2019], (3) [Shankar, Fang, Guo, Fridovich-Keil, Schmidt, Ragan-Kelley, Recht, 2020].

slide-25
SLIDE 25

Performance gap: NN versus NT

slide-26
SLIDE 26

Two-layers neural network

❢◆✭x❀ Θ✮ ❂

✐❂✶

❛✐✛✭❤w✐❀ x✐✮❀ Θ ❂ ✭❛✶❀ w✶❀ ✿ ✿ ✿ ❀ ❛◆❀ w◆✮✿ ◮ Input vector x ✷ R❞. ◮ Bottom layer weights w✐ ✷ R❞, ✐ ❂ ✶❀ ✷❀ ✿ ✿ ✿ ❀ ◆. ◮ Top layer weights ❛✐ ✷ R, ✐ ❂ ✶❀ ✷❀ ✿ ✿ ✿ ❀ ◆.

slide-27
SLIDE 27

Linearization around initialization

Linearization

❢◆✭x❀ Θ✮ ❂ ❢◆✭x❀ Θ✵✮ ✰

✐❂✶

✁❛✐✛✭❤w✵

✐ ❀ x✐✮

⑤ ④③ ⑥

Top layer linearization

✐❂✶

❛✵

✐ ✛✵✭❤w✵ ✐ ❀ x✐✮❤✁w✐❀ x✐

⑤ ④③ ⑥

Bottom layer linearization

✰♦✭✁✮✿

Linearized neural network: (w✐ ✘ ❯♥✐❢✭S❞✶✮) ❋RF❀◆✭W ✮ ❂ ♥ ❢ ❂

✐❂✶

❛✐✛✭❤w✐❀ x✐✮ ✿ ❛✐ ✷ R❀ ✐ ✷ ❬◆❪ ♦ ❀ ❋NT❀◆✭W ✮ ❂ ♥ ❢ ❂

✐❂✶

✛✵✭❤w✐❀ x✐✮❤b✐❀ x✐ ✿ b✐ ✷ R❞❀ ✐ ✷ ❬◆❪ ♦ ✿ Blue: random and fixed. Red: parameters to be optimized.

[Rahimi, Recht, 2008] [Jacot, Gabriel, Hongler, 2018]

slide-28
SLIDE 28

Approximation error

Data distribution: x ✘ ❯♥✐❢✭S❞✶✭ ♣ ❞✮✮❀ ❢❄ ✷ ▲✷✭S❞✶✭ ♣ ❞✮✮✿ Minimum risk (approximation error): ❘M❀◆✭❢❄✮ ❂ ✐♥❢

❢✷❋M❀◆✭W ✮ Ex

❤✏ ❢❄✭x✮ ❢✭x✮ ✑✷✐ ❀ M ✷ ❢RF❀ NT❣✿

slide-29
SLIDE 29

Staircase decay

slide-30
SLIDE 30

Random features regression

❋RF❀◆✭W ✮ ❂ ♥ ❢ ❂

✐❂✶

❛✐✛✭❤w✐❀ x✐✮ ✿ ❛✐ ✷ R❀ ✐ ✷ ❬◆❪ ♦ ❀ W ❂ ✭w✐✮✐✷❬◆❪ ✘✐✿✐✿❞✿ ❯♥✐❢✭S❞✶✮✿

Theorem (Ghorbani, Mei, Misiakiewicz, Montanari, 2019)

Assume ❞❵✰✍ ✔ ◆ ✔ ❞❵✰✶✍ and ✛ satisfies “generic condition”, we have ✐♥❢

❢✷❋RF❀◆✭W ✮ Ex❬✭❢❄✭x✮ ❢✭x✮✮✷❪ ❂ ❦P❃❵❢❄❦✷ ▲✷ ✰ ♦❞❀P✭❦❢❄❦✷ ▲✷✮✿

P❃❵: projection orthogonal to the space of degree-❵ polynomials. With ❞❵ parameters, RF only fit a degree-❵ polynomial.

slide-31
SLIDE 31

Random features regression

❋RF❀◆✭W ✮ ❂ ♥ ❢ ❂

✐❂✶

❛✐✛✭❤w✐❀ x✐✮ ✿ ❛✐ ✷ R❀ ✐ ✷ ❬◆❪ ♦ ❀ W ❂ ✭w✐✮✐✷❬◆❪ ✘✐✿✐✿❞✿ ❯♥✐❢✭S❞✶✮✿

Theorem (Ghorbani, Mei, Misiakiewicz, Montanari, 2019)

Assume ❞❵✰✍ ✔ ◆ ✔ ❞❵✰✶✍ and ✛ satisfies “generic condition”, we have ✐♥❢

❢✷❋RF❀◆✭W ✮ Ex❬✭❢❄✭x✮ ❢✭x✮✮✷❪ ❂ ❦P❃❵❢❄❦✷ ▲✷ ✰ ♦❞❀P✭❦❢❄❦✷ ▲✷✮✿

P❃❵: projection orthogonal to the space of degree-❵ polynomials. With ❞❵ parameters, RF only fit a degree-❵ polynomial.

slide-32
SLIDE 32

Similar result for NT

❋NT❀◆✭W ✮ ❂ ♥ ❢ ❂

✐❂✶

✛✵✭❤w✐❀ x✐✮❤b✐❀ x✐ ✿ b✐ ✷ R❞❀ ✐ ✷ ❬◆❪ ♦ ❀ W ❂ ✭w✐✮✐✷❬◆❪ ✘✐✿✐✿❞✿ ❯♥✐❢✭S❞✶✮✿

Theorem (Ghorbani, Mei, Misiakiewicz, Montanari, 2019)

Assume ❞❵✰✍ ✔ ◆ ✔ ❞❵✰✶✍ and ✛ satisfies “generic condition”, we have ✐♥❢

❢✷❋NT❀◆✭W ✮ Ex❬✭❢❄✭x✮ ❢✭x✮✮✷❪ ❂ ❦P❃❵✰✶❢❄❦✷ ▲✷ ✰ ♦❞❀P✭❦❢❄❦✷ ▲✷✮✿

P❃❵✰✶: projection orthogonal to the space of degree-✭❵ ✰ ✶✮ polynomials. With ❞❵✰✶ parameters, NT only fit a degree-✭❵ ✰ ✶✮ polynomial.

slide-33
SLIDE 33

Similar result for NT

❋NT❀◆✭W ✮ ❂ ♥ ❢ ❂

✐❂✶

✛✵✭❤w✐❀ x✐✮❤b✐❀ x✐ ✿ b✐ ✷ R❞❀ ✐ ✷ ❬◆❪ ♦ ❀ W ❂ ✭w✐✮✐✷❬◆❪ ✘✐✿✐✿❞✿ ❯♥✐❢✭S❞✶✮✿

Theorem (Ghorbani, Mei, Misiakiewicz, Montanari, 2019)

Assume ❞❵✰✍ ✔ ◆ ✔ ❞❵✰✶✍ and ✛ satisfies “generic condition”, we have ✐♥❢

❢✷❋NT❀◆✭W ✮ Ex❬✭❢❄✭x✮ ❢✭x✮✮✷❪ ❂ ❦P❃❵✰✶❢❄❦✷ ▲✷ ✰ ♦❞❀P✭❦❢❄❦✷ ▲✷✮✿

P❃❵✰✶: projection orthogonal to the space of degree-✭❵ ✰ ✶✮ polynomials. With ❞❵✰✶ parameters, NT only fit a degree-✭❵ ✰ ✶✮ polynomial.

slide-34
SLIDE 34

The staircase decay (a cartoon)

❢ ❂ P✵❢ ✰ P✶❢ ✰ P✷❢ ✰ P✸❢ ✰ ✁ ✁ ✁

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0.2 0.4 0.6 0.8 1

slide-35
SLIDE 35

Approximation gap

Function ❢ ✿ S❞✶ ✦ R, ❢✭x✮ ❂ ◗❦✭①✶✮. ◗❦: degree ❦ polynomial. ◮ NT: ◆ ❂ ✂❞✭❞❦✶✮; ◮ NN: ◆ ❂ ✂❞✭✶✮. ◮ A separation of approximation power. ◮ Neural network can potentially learn features adaptively.

slide-36
SLIDE 36

Related work

Approximation error of two-layers NN and RF: [Barron, 1993], [Mhaskar, 1996], [Maiorov, 1999], [Caponnetto, de Vito, 2007], [Rahimi, Recht, 2009], [Bach, 2017], [E, Ma, Wu, 2018] ... Approx bound ❢❄ bounded norm ❢❄ ✷ ▲✷✭R❞✮ ❭ ✭❞❄-sparse✮ RF ❦❢❄❦✷

❍❂◆

✂◆✭✶❂◆✶❂❞✮ NN ❦❢❄❦✷

❇❂◆

✂◆✭✶❂◆✶❂❞❄✮ ◆ ❂ ❞❦ ❞ ✦ ✶❀ ❞ ◆ ✦ ✶❀ ❀ ◆ ❂ ✶✵❀ ✵✵✵❀ ✵✵✵✿ ❀ ✿

slide-37
SLIDE 37

Related work

Approximation error of two-layers NN and RF: [Barron, 1993], [Mhaskar, 1996], [Maiorov, 1999], [Caponnetto, de Vito, 2007], [Rahimi, Recht, 2009], [Bach, 2017], [E, Ma, Wu, 2018] ... Approx bound ❢❄ bounded norm ❢❄ ✷ ▲✷✭R❞✮ ❭ ✭❞❄-sparse✮ RF ❦❢❄❦✷

❍❂◆

✂◆✭✶❂◆✶❂❞✮ NN ❦❢❄❦✷

❇❂◆

✂◆✭✶❂◆✶❂❞❄✮ Difference between: New results v.s. Classical results ◆ ❂ ❞❦ as ❞ ✦ ✶❀ v.s. fixed ❞ as ◆ ✦ ✶❀ Constant asymptotic error, v.s. Vanishing upper bound. ❀ ◆ ❂ ✶✵❀ ✵✵✵❀ ✵✵✵✿ ❀ ✿

slide-38
SLIDE 38

Related work

Approximation error of two-layers NN and RF: [Barron, 1993], [Mhaskar, 1996], [Maiorov, 1999], [Caponnetto, de Vito, 2007], [Rahimi, Recht, 2009], [Bach, 2017], [E, Ma, Wu, 2018] ... Approx bound ❢❄ bounded norm ❢❄ ✷ ▲✷✭R❞✮ ❭ ✭❞❄-sparse✮ RF ❦❢❄❦✷

❍❂◆

✂◆✭✶❂◆✶❂❞✮ NN ❦❢❄❦✷

❇❂◆

✂◆✭✶❂◆✶❂❞❄✮ ◆ ❂ ❞❦ as ❞ ✦ ✶❀ v.s. fixed ❞ as ◆ ✦ ✶❀ Which asymptotics makes more sense? ❞ ❂ ✶✵✵❀ ◆ ❂ ✶✵❀ ✵✵✵❀ ✵✵✵✿ ◆ ❂ ❞✸✿✺❀ ✶❂◆✶❂❞ ❂ ✵✿✽✺✿

slide-39
SLIDE 39

Related work

Approximation error of two-layers NN and RF: [Barron, 1993], [Mhaskar, 1996], [Maiorov, 1999], [Caponnetto, de Vito, 2007], [Rahimi, Recht, 2009], [Bach, 2017], [E, Ma, Wu, 2018] ... Approx bound ❢❄ bounded norm ❢❄ ✷ ▲✷✭R❞✮ ❭ ✭❞❄-sparse✮ RF ❦❢❄❦✷

❍❂◆

✂◆✭✶❂◆✶❂❞✮ NN ❦❢❄❦✷

❇❂◆

✂◆✭✶❂◆✶❂❞❄✮ ◆ ❂ ❞❦ as ❞ ✦ ✶❀ v.s. fixed ❞ as ◆ ✦ ✶❀ Which asymptotics makes more sense? ❞❄ ❂ ✶✵❀ ◆ ❂ ✶✵❀ ✵✵✵❀ ✵✵✵✿ ◆ ❂ ❞❄

✼❀

✶❂◆✶❂❞❄ ❂ ✵✿✷✵✿

slide-40
SLIDE 40

Double descent

slide-41
SLIDE 41

The motivating experiment

◮ MNIST: ✭x✐❀ ②✐✮ ✷ R✼✽✹ ✂ ❬✶✵❪, ✐ ✷ ❬✺✵❀ ✵✵✵❪. ◮ Two-layers neural networks ❢◆: ❢◆✭x❀ θ✮ ❂

❥❂✶

❛❥✛✭❤w❥❀ x✐✮✿ ◮ Square loss without regularization. ◮ Find a local minimizer, report training and test error. ◮ Perform a sequence of experiments for different ◆. ◮ Plot training and test error vs ◆.

slide-42
SLIDE 42

The motivating experiment

◮ MNIST: ✭x✐❀ ②✐✮ ✷ R✼✽✹ ✂ ❬✶✵❪, ✐ ✷ ❬✺✵❀ ✵✵✵❪. ◮ Two-layers neural networks ❢◆: ❢◆✭x❀ θ✮ ❂

❥❂✶

❛❥✛✭❤w❥❀ x✐✮✿ ◮ Square loss without regularization. ◮ Find a local minimizer, report training and test error. ◮ Perform a sequence of experiments for different ◆. ◮ Plot training and test error vs ◆.

slide-43
SLIDE 43

Increasing ★ parameters

0.08 0.25 1 2.5 7.5 20 0.6 0.4 0.2 Test Train

# parameters / # samples

Figure: Experiments on MNIST. [Belkin, Hsu, Ma, Mandal, 2018].

slide-44
SLIDE 44

Increasing ★ parameters

0.08 0.25 1 2.5 7.5 20 0.6 0.4 0.2 Test Train

# parameters / # samples

N/n

# parameters / # samples

Figure: Experiments on MNIST. Left: [Belkin, Hsu, Ma, Mandal, 2018]. Right: [Spigler, Geiger, Ascoli, Sagun, Biroli, Wyart, 2018]. Similar phenomenon appeared in the literature [LeCun, Kanter, and Solla, 1991], [Krogh and Hertz, 1992], [Opper and Kinzel, 1995], [Neyshabur, Tomioka, Srebro, 2014], [Advani and Saxe, 2017].

slide-45
SLIDE 45

U-shaped curve

[Belkin, Hsu, Ma, Mandal, 2018]

slide-46
SLIDE 46

Double descent

Figure: A cartoon by [Belkin, Hsu, Ma, Mandal, 2018].

Peak at the interpolation threshold. Monotone decreasing in the overparameterized regime. Global minimum when the number of parameters is infinity.

slide-47
SLIDE 47

Complementary instead of contradictory

U-shaped curve

Test error vs model complexity that tightly controls generalization. Examples: ❵✷ norm in linear model, “❦” in ❦ nearest-neighbors.

Double-descent

Test error vs number of parameters. Examples: ★ parameters in NN. In NN, ★ parameters ✻❂ model complexity that tightly controls generalization.

[Bartlett, 1997], [Bartlett and Mendelson, 2002]

slide-48
SLIDE 48

Complementary instead of contradictory

U-shaped curve

Test error vs model complexity that tightly controls generalization. Examples: ❵✷ norm in linear model, “❦” in ❦ nearest-neighbors.

Double-descent

Test error vs number of parameters. Examples: ★ parameters in NN. In NN, ★ parameters ✻❂ model complexity that tightly controls generalization.

[Bartlett, 1997], [Bartlett and Mendelson, 2002]

slide-49
SLIDE 49

Linear model with random covariates

# parameters / # samples By [Hastie, Montanari, Rosset, Tibshirani, 2019]. See also [Belkin, Hsu, Xu, 2019].

◮ Under-parameterized: ❫ β ❂ ❛r❣ ♠✐♥β P

✐✭②✐ ❤x✐❀ β✐✮✷.

◮ Over-parameterized: ❫ β ❂ ❛r❣ ♠✐♥β ❦β❦✷❀ s✿t✿ ②✐ ❂ ❤x✐❀ β✐ ✰ ✧✐❀ ✐ ✷ ❬♥❪.

slide-50
SLIDE 50

Why singularity?

◮ Model: x✐ ✘ ◆✭0❀ I❞✮, ②✐ ❂ ❤0❀ x✐✐ ✰ ✧✐ ✘ ◆✭✵❀ ✶✮, ✐ ✷ ❬♥❪. ◮ Test risk ✴ E❬❦❫ β 0❦✷

✷❪ ✴ E❬❦X②y❦✷ ✷❪ ✴ E❬tr✭✭XTX✮②✮❪, where X ✷ R♥✂❞.

◮ When ♥ ✻❂ ❞, X is well conditioned. ◮ When ♥ ✙ ❞, X is infinitely ill conditioned. ◮ The model has marginally enough parameters to interpolate all the data, hence it interpolates in an awkward way. ◮ To fit the noise, the coefficients ❦❫ β❦✷

✷ ❂ ❦X②y❦✷ ✷ blows up.

[Bartlett, Long, Lugosi, Tsigler, 2019], [Muthukumar, Vodrahalli, Sahai, 2019]

slide-51
SLIDE 51

Why singularity?

◮ Model: x✐ ✘ ◆✭0❀ I❞✮, ②✐ ❂ ❤0❀ x✐✐ ✰ ✧✐ ✘ ◆✭✵❀ ✶✮, ✐ ✷ ❬♥❪. ◮ Test risk ✴ E❬❦❫ β 0❦✷

✷❪ ✴ E❬❦X②y❦✷ ✷❪ ✴ E❬tr✭✭XTX✮②✮❪, where X ✷ R♥✂❞.

◮ When ♥ ✻❂ ❞, X is well conditioned. ◮ When ♥ ✙ ❞, X is infinitely ill conditioned. ◮ The model has marginally enough parameters to interpolate all the data, hence it interpolates in an awkward way. ◮ To fit the noise, the coefficients ❦❫ β❦✷

✷ ❂ ❦X②y❦✷ ✷ blows up.

[Bartlett, Long, Lugosi, Tsigler, 2019], [Muthukumar, Vodrahalli, Sahai, 2019]

slide-52
SLIDE 52

Why singularity?

◮ Model: x✐ ✘ ◆✭0❀ I❞✮, ②✐ ❂ ❤0❀ x✐✐ ✰ ✧✐ ✘ ◆✭✵❀ ✶✮, ✐ ✷ ❬♥❪. ◮ Test risk ✴ E❬❦❫ β 0❦✷

✷❪ ✴ E❬❦X②y❦✷ ✷❪ ✴ E❬tr✭✭XTX✮②✮❪, where X ✷ R♥✂❞.

◮ When ♥ ✻❂ ❞, X is well conditioned. ◮ When ♥ ✙ ❞, X is infinitely ill conditioned. ◮ The model has marginally enough parameters to interpolate all the data, hence it interpolates in an awkward way. ◮ To fit the noise, the coefficients ❦❫ β❦✷

✷ ❂ ❦X②y❦✷ ✷ blows up.

[Bartlett, Long, Lugosi, Tsigler, 2019], [Muthukumar, Vodrahalli, Sahai, 2019]

slide-53
SLIDE 53

Why singularity?

◮ Model: x✐ ✘ ◆✭0❀ I❞✮, ②✐ ❂ ❤0❀ x✐✐ ✰ ✧✐ ✘ ◆✭✵❀ ✶✮, ✐ ✷ ❬♥❪. ◮ Test risk ✴ E❬❦❫ β 0❦✷

✷❪ ✴ E❬❦X②y❦✷ ✷❪ ✴ E❬tr✭✭XTX✮②✮❪, where X ✷ R♥✂❞.

◮ When ♥ ✻❂ ❞, X is well conditioned. ◮ When ♥ ✙ ❞, X is infinitely ill conditioned. ◮ The model has marginally enough parameters to interpolate all the data, hence it interpolates in an awkward way. ◮ To fit the noise, the coefficients ❦❫ β❦✷

✷ ❂ ❦X②y❦✷ ✷ blows up.

[Bartlett, Long, Lugosi, Tsigler, 2019], [Muthukumar, Vodrahalli, Sahai, 2019]

slide-54
SLIDE 54

Why singularity?

◮ Model: x✐ ✘ ◆✭0❀ I❞✮, ②✐ ❂ ❤0❀ x✐✐ ✰ ✧✐ ✘ ◆✭✵❀ ✶✮, ✐ ✷ ❬♥❪. ◮ Test risk ✴ E❬❦❫ β 0❦✷

✷❪ ✴ E❬❦X②y❦✷ ✷❪ ✴ E❬tr✭✭XTX✮②✮❪, where X ✷ R♥✂❞.

◮ When ♥ ✻❂ ❞, X is well conditioned. ◮ When ♥ ✙ ❞, X is infinitely ill conditioned. ◮ The model has marginally enough parameters to interpolate all the data, hence it interpolates in an awkward way. ◮ To fit the noise, the coefficients ❦❫ β❦✷

✷ ❂ ❦X②y❦✷ ✷ blows up.

[Bartlett, Long, Lugosi, Tsigler, 2019], [Muthukumar, Vodrahalli, Sahai, 2019]

slide-55
SLIDE 55

Why singularity?

◮ Model: x✐ ✘ ◆✭0❀ I❞✮, ②✐ ❂ ❤0❀ x✐✐ ✰ ✧✐ ✘ ◆✭✵❀ ✶✮, ✐ ✷ ❬♥❪. ◮ Test risk ✴ E❬❦❫ β 0❦✷

✷❪ ✴ E❬❦X②y❦✷ ✷❪ ✴ E❬tr✭✭XTX✮②✮❪, where X ✷ R♥✂❞.

◮ When ♥ ✻❂ ❞, X is well conditioned. ◮ When ♥ ✙ ❞, X is infinitely ill conditioned. ◮ The model has marginally enough parameters to interpolate all the data, hence it interpolates in an awkward way. ◮ To fit the noise, the coefficients ❦❫ β❦✷

✷ ❂ ❦X②y❦✷ ✷ blows up.

[Bartlett, Long, Lugosi, Tsigler, 2019], [Muthukumar, Vodrahalli, Sahai, 2019]

slide-56
SLIDE 56

Comparison

Neural networks [Spigler, et.al., 2018] Linear model [Hastie, et.al., 2019]

N/n

# parameters / # samples # parameters / # samples

Peak at the interpolation threshold. ❄ Monotone decreasing in the overparameterized regime. ❄ Global minimum when the number of parameters is infinity.

slide-57
SLIDE 57

Goal: find a tractable model that exhibits all the features

  • f the double descent curve.

Figure: By [Belkin, Hsu, Ma, Mandal, 2018].

slide-58
SLIDE 58

A simple model

The random features model ❢RF✭x❀ a✮ ❂

❥❂✶

❛❥✛✭❤w❥❀ x✐✮✿ Random weights ✭w❥✮❥✷❬◆❪ w❥ ✘✐✐❞ ❯♥✐❢✭S❞✶✮✿ ✭

✐❀ ②✐✮✐✷❬♥❪ ✐ ✘ ❯♥✐❢✭ ❞✶✭

♣ ❞✮✮❀ ②✐ ❂ ❢❄✭

✐✮ ✰ ✧✐✿

slide-59
SLIDE 59

A simple model

The random features model ❢RF✭x❀ a✮ ❂

❥❂✶

❛❥✛✭❤w❥❀ x✐✮✿ Random weights ✭w❥✮❥✷❬◆❪ w❥ ✘✐✐❞ ❯♥✐❢✭S❞✶✮✿ Data ✭x✐❀ ②✐✮✐✷❬♥❪ x✐ ✘ ❯♥✐❢✭S❞✶✭ ♣ ❞✮✮❀ ②✐ ❂ ❢❄✭x✐✮ ✰ ✧✐✿

slide-60
SLIDE 60

A simple model

Random features regression: ❫ a✕ ❂ ❛r❣ ♠✐♥a ▲✕✭a✮, ▲✕✭a✮ ❂ ✶ ♥

✐❂✶

❤✏ ②✐

❥❂✶

❛❥✛✭❤x✐❀ w❥✐✮ ✑✷✐ ✰ ✕◆ ❞ ❦a❦✷

✷❀

(Train) ❘✭a❀ ❢❄✮ ❂ Ex❀② ❤✏ ❢❄✭x✮

❥❂✶

❛❥✛✭❤x❀ w❥✐✮ ✑✷✐ ✿ (Test)

Assumptions

♥ data, ◆ features, ❞ dimension. ◆❂❞ ✦ ✥✶, ♥❂❞ ✦ ✥✷, as ❞ ✦ ✶.

  • Tech. ass. on ❢❄ and ✛ (apply to almost every ❢❄ and ✛).
slide-61
SLIDE 61

Precise asymptotics

Theorem (Mei and Montanari, 2019)

Under above assumptions, the test error of RF model is given by ❘✭❫ a✕❀ ❢❄✮ ❂ ❦β❦✷

✷ ✁ B✭✏❀ ✥✶❀ ✥✷❀ ✕❂✖✷ ❄✮ ✰ ✜ ✷ ✁ V ✭✏❀ ✥✶❀ ✥✷❀ ✕❂✖✷ ❄✮ ✰ ♦❞❀P✭✶✮❀

where functions B and V are given explicitly below.

slide-62
SLIDE 62

Explicit formulae

Let the functions ✗✶❀ ✗✷ ✿ C✰ ✦ C✰ be the unique solution of ✗✶ ❂ ✥✶

  • ✘ ✗✷

✏✷✗✷ ✶ ✏✷✗✶✗✷

✁✶ ❀

✗✷ ❂ ✥✷

  • ✘ ✗✶

✏✷✗✶ ✶ ✏✷✗✶✗✷

✁✶ ❀

Let ✤ ✑ ✗✶✭i✭✥✶✥✷✕✮✶❂✷✮ ✁ ✗✷✭i✭✥✶✥✷✕✮✶❂✷✮❀ and E✵✭✏❀ ✥✶❀ ✥✷❀ ✕✮ ✑ ✤✺✏✻ ✰ ✸✤✹✏✹ ✰ ✭✥✶✥✷ ✥✷ ✥✶ ✰ ✶✮✤✸✏✻ ✷✤✸✏✹ ✸✤✸✏✷ ✰ ✭✥✶ ✰ ✥✷ ✸✥✶✥✷ ✰ ✶✮✤✷✏✹ ✰ ✷✤✷✏✷ ✰ ✤✷ ✰ ✸✥✶✥✷✤✏✷ ✥✶✥✷ ❀ E✶✭✏❀ ✥✶❀ ✥✷❀ ✕✮ ✑ ✥✷✤✸✏✹ ✥✷✤✷✏✷ ✰ ✥✶✥✷✤✏✷ ✥✶✥✷ ❀ E✷✭✏❀ ✥✶❀ ✥✷❀ ✕✮ ✑ ✤✺✏✻ ✸✤✹✏✹ ✰ ✭✥✶ ✶✮✤✸✏✻ ✰ ✷✤✸✏✹ ✰ ✸✤✸✏✷ ✰ ✭✥✶ ✶✮✤✷✏✹ ✷✤✷✏✷ ✤✷ ✿ We then have B✭✏❀ ✥✶❀ ✥✷❀ ✕✮ ✑ E✶✭✏❀ ✥✶❀ ✥✷❀ ✕✮ E✵✭✏❀ ✥✶❀ ✥✷❀ ✕✮ ❀ V ✭✏❀ ✥✶❀ ✥✷❀ ✕✮ ✑ E✷✭✏❀ ✥✶❀ ✥✷❀ ✕✮ E✵✭✏❀ ✥✶❀ ✥✷❀ ✕✮ ✿

slide-63
SLIDE 63

Proof strategy

Random matrix theory for the random kernel inner product matrices Z ❂ ✏ ✛✭❤w✐❀ x❥✐✮ ✑

✐✷❬◆❪❀❥✷❬♥❪✿

[El Karoui, 2010], [Cheng, Singer, 2013], [Do, Vu, 2013], [Fan, Montanari, 2019], [Hastie, Montanari, Rosset, Tibshirani, 2019].

slide-64
SLIDE 64

Analytical prediction

1 2 3 4 5 0.5 1 1.5 2 2.5 3 3.5 4 1 2 3 4 5 0.5 1 1.5

✕ ❂ ✵✰ ✕ ❂ ✸ ✂ ✶✵✹ Peak at the interpolation threshold. Monotone decreasing in the overparameterized regime. Global minimum when the number of parameters is infinity.

slide-65
SLIDE 65

Insights

10-2 100 102 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Test error

◮ For any ✕, the min prediction error is achieved at ◆❂♥ ✦ ✶. ◮ For optimal ✕, the prediction error is monotonically decreasing.

slide-66
SLIDE 66

Insights

10-3 10-2 10-1 100 101 102 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Test error 10-3 10-2 10-1 100 101 102 0.5 1 1.5 2 2.5 3 Test error

SNR ❂ ✺ SNR ❂ ✶❂✶✵ ◮ High SNR: minimum at ✕ ❂ ✵✰; ◮ Low SNR: minimum at ✕ ❃ ✵.

slide-67
SLIDE 67

Summary of linearization of neural networks

1 2 3 4 5 0.5 1 1.5

◮ ★ parameters ✻❂ model complexity that controls generalization. ◮ Double descent also exists in linearized neural networks.

slide-68
SLIDE 68

Summary of linearization of neural networks

1 2 3 4 5 0.5 1 1.5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0.2 0.4 0.6 0.8 1

◮ ★ parameters ✻❂ model complexity that controls generalization. ◮ Double descent also exists in linearized neural networks. ◮ Gap between NN and NT. NT models cannot fully explain the generalization efficacy of NN.

slide-69
SLIDE 69

Going beyond linearization?

slide-70
SLIDE 70

Mean field theory

◮ SGD of two layers neural networks θ❦✰✶

❂ θ❦

✐ ✧rθ✐❵

✏ ②❦❀ ✶ ◆

✐❂✶

✛❄✭x❦❀ θ❦

✐ ✮

✑ ✿ ◮ Consider empirical distribution of weights ❫ ✚◆❀❦✧ ❂ ✶ ◆

✐❂✶

✍θ❦

◮ Then ❫ ✚◆❀t ✦ ✚t as ◆ ✦ ✶ and ✧ ✦ ✵, and ✚t satisfies ❅t✚t ❂ r ✁ ✭r✠✭θ❀ ✚t✮✚t✮✿ ◮ Difference from linearization theory: A different scaling limit.

[Mei, Montanari, Nguyen, 2018], [Rotskoff, Vanden-Eijnden, 2018]

slide-71
SLIDE 71

Future directions

◮ Distribution of features x matter.

Images ✩ Convolutional neural network. Graph ✩ Graph neural network. Exploring data and network invariance.

◮ Neural networks as function/distribution approximation?

Generative modeling. Reinforcement learning.

◮ Uncertainty quantification in neural network systems.

Robustness and adversarial examples. Approximate inference for Bayesian neural networks. Predictive inference.

slide-72
SLIDE 72

Thanks!