SLIDE 1 Generalization of linearized neural networks: staircase decay and double descent
Song Mei
UC Berkeley
July 23, 2020
Department of Mathematics, HKUST
SLIDE 2 Deep Learning Revolution
Gaming Healthcare Autonomous Vehicle Finance Machine translation Robotics Communication Deep learning
SLIDE 3 Deep Learning Revolution
Gaming Healthcare Autonomous Vehicle Finance Machine translation Robotics Communication Deep learning
“ACM named Yoshua Bengio, Geoffrey Hinton, and Yann LeCun recipients of the 2018 ACM A.M. Turing Award for conceptual and engineering breakthroughs that have made deep neural networks a critical component of computing. ”
SLIDE 4
But theoretically?
SLIDE 5
But theoretically? WHEN and WHY does deep learning work?
SLIDE 6
Call for Theoretical understandings
“Alchemy”
SLIDE 7
Call for Theoretical understandings
Mathematical Theories Physical Laws Reproducible Experiments
“Alchemy” Science
SLIDE 8
What don’t we understand?
✩ ✩
SLIDE 9
What don’t we understand?
Empirical Surprises [Zhang, et.al, 2015]: ◮ Over-parameterization: ★ parameters ✢ ★ training samples. ◮ Non-convexity. ◮ Efficiently fit all the training samples using SGD. ◮ Generalize well on test samples. ✩ ✩
SLIDE 10
What don’t we understand?
Empirical Surprises [Zhang, et.al, 2015]: ◮ Over-parameterization: ★ parameters ✢ ★ training samples. ◮ Non-convexity. ◮ Efficiently fit all the training samples using SGD. ◮ Generalize well on test samples.
Mathematical Challenges
Non-convexity Over-parameterization ✩ ✩ Why efficient optimization? Why effective generalization?
SLIDE 11
A gentle introduction to
Linearization theory of neural networks
SLIDE 12
Linearized neural networks (neural tangent model)
◮ Multi-layers neural network ❢✭x❀ θ✮, x ✷ R❞, θ ✷ R◆ ❢✭x❀ θ✮ ❂ W ▲✛✭✁ ✁ ✁ W ✷✛✭W ✶x✮✮✿ ◮ Linearization around (random) parameter θ✵ ❢✭x❀ θ✮ ❂ ❢✭x❀ θ✵✮ ✰ ❤θ θ✵❀ rθ❢✭x❀ θ✵✮✐ ✰ ♦✭❦θ θ✵❦✷✮✿ ◮ Neural tangent model: the linear part of ❢ ❢NT✭x❀ β❀ θ✵✮ ❂ ❤β❀ rθ❢✭x❀ θ✵✮✐✿
[Jacot, Gabriel, Hongler, 2018] [Chizat, Bach, 2018b]
SLIDE 13
Linearized neural networks (neural tangent model)
◮ Multi-layers neural network ❢✭x❀ θ✮, x ✷ R❞, θ ✷ R◆ ❢✭x❀ θ✮ ❂ ✛✭W ▲✛✭✁ ✁ ✁ W ✷✛✭W ✶x✮✮✮✿ ◮ Linearization around (random) parameter θ✵ ❢✭x❀ θ✮ ❂ ❢✭x❀ θ✵✮ ✰ ❤θ θ✵❀ rθ❢✭x❀ θ✵✮✐ ✰ ♦✭❦θ θ✵❦✷✮✿ ◮ Neural tangent model: the linear part of ❢ ❢NT✭x❀ β❀ θ✵✮ ❂ ❤β❀ rθ❢✭x❀ θ✵✮✐✿
[Jacot, Gabriel, Hongler, 2018] [Chizat, Bach, 2018b]
SLIDE 14
Linearized neural networks (neural tangent model)
◮ Multi-layers neural network ❢✭x❀ θ✮, x ✷ R❞, θ ✷ R◆ ❢✭x❀ θ✮ ❂ ✛✭W ▲✛✭✁ ✁ ✁ W ✷✛✭W ✶x✮✮✮✿ ◮ Linearization around (random) parameter θ✵ ❢✭x❀ θ✮ ❂ ❢✭x❀ θ✵✮ ✰ ❤θ θ✵❀ rθ❢✭x❀ θ✵✮✐ ✰ ♦✭❦θ θ✵❦✷✮✿ ◮ Neural tangent model: the linear part of ❢ ❢NT✭x❀ β❀ θ✵✮ ❂ ❤β❀ rθ❢✭x❀ θ✵✮✐✿
[Jacot, Gabriel, Hongler, 2018] [Chizat, Bach, 2018b]
SLIDE 15
Linear regression over random features
◮ NT model: the linear part of ❢ ❢NT✭x❀ β❀ θ✵✮ ❂ ❤β❀ ✣✭x✮✐ ❂ ❤β❀ rθ❢✭x❀ θ✵✮✐✿ ◮ (Random) feature map: ✣✭✁✮ ❂ rθ❢✭✁❀ θ✵✮ ✿ R❞ ✦ R◆. ◮ Training dataset: ✭❳❀ ❨✮ ❂ ✭x✐❀ ②✐✮✐✷❬♥❪. ◮ Gradient flow dynamics: ❞ ❞tβt ❂ rβ ❫ E❬✭② ❢NT✭x❀ βt❀ θ✵✮✮✷❪❀ β✵ ❂ 0✿ ◮ Linear convergence: ☞t ✦ ❫ β ❂ ✣✭❳✮②❨.
SLIDE 16
Linear regression over random features
◮ NT model: the linear part of ❢ ❢NT✭x❀ β❀ θ✵✮ ❂ ❤β❀ ✣✭x✮✐ ❂ ❤β❀ rθ❢✭x❀ θ✵✮✐✿ ◮ (Random) feature map: ✣✭✁✮ ❂ rθ❢✭✁❀ θ✵✮ ✿ R❞ ✦ R◆. ◮ Training dataset: ✭❳❀ ❨✮ ❂ ✭x✐❀ ②✐✮✐✷❬♥❪. ◮ Gradient flow dynamics: ❞ ❞tβt ❂ rβ ❫ E❬✭② ❢NT✭x❀ βt❀ θ✵✮✮✷❪❀ β✵ ❂ 0✿ ◮ Linear convergence: ☞t ✦ ❫ β ❂ ✣✭❳✮②❨.
SLIDE 17
Linear regression over random features
◮ NT model: the linear part of ❢ ❢NT✭x❀ β❀ θ✵✮ ❂ ❤β❀ ✣✭x✮✐ ❂ ❤β❀ rθ❢✭x❀ θ✵✮✐✿ ◮ (Random) feature map: ✣✭✁✮ ❂ rθ❢✭✁❀ θ✵✮ ✿ R❞ ✦ R◆. ◮ Training dataset: ✭❳❀ ❨✮ ❂ ✭x✐❀ ②✐✮✐✷❬♥❪. ◮ Gradient flow dynamics: ❞ ❞tβt ❂ rβ ❫ E❬✭② ❢NT✭x❀ βt❀ θ✵✮✮✷❪❀ β✵ ❂ 0✿ ◮ Linear convergence: ☞t ✦ ❫ β ❂ ✣✭❳✮②❨.
SLIDE 18
Linear regression over random features
◮ NT model: the linear part of ❢ ❢NT✭x❀ β❀ θ✵✮ ❂ ❤β❀ ✣✭x✮✐ ❂ ❤β❀ rθ❢✭x❀ θ✵✮✐✿ ◮ (Random) feature map: ✣✭✁✮ ❂ rθ❢✭✁❀ θ✵✮ ✿ R❞ ✦ R◆. ◮ Training dataset: ✭❳❀ ❨✮ ❂ ✭x✐❀ ②✐✮✐✷❬♥❪. ◮ Gradient flow dynamics: ❞ ❞tβt ❂ rβ ❫ E❬✭② ❢NT✭x❀ βt❀ θ✵✮✮✷❪❀ β✵ ❂ 0✿ ◮ Linear convergence: ☞t ✦ ❫ β ❂ ✣✭❳✮②❨.
SLIDE 19
Linear regression over random features
◮ NT model: the linear part of ❢ ❢NT✭x❀ β❀ θ✵✮ ❂ ❤β❀ ✣✭x✮✐ ❂ ❤β❀ rθ❢✭x❀ θ✵✮✐✿ ◮ (Random) feature map: ✣✭✁✮ ❂ rθ❢✭✁❀ θ✵✮ ✿ R❞ ✦ R◆. ◮ Training dataset: ✭❳❀ ❨✮ ❂ ✭x✐❀ ②✐✮✐✷❬♥❪. ◮ Gradient flow dynamics: ❞ ❞tβt ❂ rβ ❫ E❬✭② ❢NT✭x❀ βt❀ θ✵✮✮✷❪❀ β✵ ❂ 0✿ ◮ Linear convergence: ☞t ✦ ❫ β ❂ ✣✭❳✮②❨.
SLIDE 20 Neural network ✙ Neural tangent
Theorem [Jacot, Gabriel, Hongler, 2018] (Informal)
Consider neural networks ❢◆✭x❀ θ✮ with number of neurons ◆, and consider ❞ ❞tθt ❂ rθ ❫ E❬✭② ❢◆✭x❀ θt✮✮✷❪❀ θ✵ ❂ θ✵❀ ❞ ❞tβt ❂ rβ ❫ E❬✭② ❢◆
NT✭x❀ βt❀ θ✵✮✮✷❪❀
β✵ ❂ 0✿ Under proper (random) initialization, we have a.s. ❧✐♠
◆✦✶ ❥❢◆✭x❀ θt✮ ❢◆ NT✭x❀ βt✮❥ ❂ ✵✿
SLIDE 21
Optimization success
Gradient flow of training loss of NN converges to global min ... ... with over-parameterization and proper initialization
[Jacot, Gabriel, Hongler, 2018], [Du, Zhai, Poczos, Singh, 2018], [Du, Lee, Li, Wang, Zhai, 2018], [Allen-Zhu, Li, Song 2018], [Zou, Cao, Zhou, Gu, 2018], [Oymak, Soltanolkotabi, 2018] [Chizat, Bach, 2018b], ....
SLIDE 22
Optimization success
Gradient flow of training loss of NN converges to global min ... ... with over-parameterization and proper initialization
[Jacot, Gabriel, Hongler, 2018], [Du, Zhai, Poczos, Singh, 2018], [Du, Lee, Li, Wang, Zhai, 2018], [Allen-Zhu, Li, Song 2018], [Zou, Cao, Zhou, Gu, 2018], [Oymak, Soltanolkotabi, 2018] [Chizat, Bach, 2018b], ....
Does linearization fully explain the success of neural networks?
SLIDE 23
Optimization success
Gradient flow of training loss of NN converges to global min ... ... with over-parameterization and proper initialization
[Jacot, Gabriel, Hongler, 2018], [Du, Zhai, Poczos, Singh, 2018], [Du, Lee, Li, Wang, Zhai, 2018], [Allen-Zhu, Li, Song 2018], [Zou, Cao, Zhou, Gu, 2018], [Oymak, Soltanolkotabi, 2018] [Chizat, Bach, 2018b], ....
Does linearization fully explain the success of neural networks? Our answer is No
SLIDE 24
Generalization
Empirically, the generalization of NT models are not as good as NN
Table: Cifar10 experiments
Architecture Classification error CNN 4%- (1) CNTK 23% (2) CNTK 11% (3) Compositioal Kernel 10% (1) [Arora, Du, Hu, Li, Salakhutdinov, Wang, 2019], (2) [Li, Wang, Yu, Du, Hu, Salakhutdinov, Arora, 2019], (3) [Shankar, Fang, Guo, Fridovich-Keil, Schmidt, Ragan-Kelley, Recht, 2020].
SLIDE 25
Performance gap: NN versus NT
SLIDE 26 Two-layers neural network
❢◆✭x❀ Θ✮ ❂
◆
❳
✐❂✶
❛✐✛✭❤w✐❀ x✐✮❀ Θ ❂ ✭❛✶❀ w✶❀ ✿ ✿ ✿ ❀ ❛◆❀ w◆✮✿ ◮ Input vector x ✷ R❞. ◮ Bottom layer weights w✐ ✷ R❞, ✐ ❂ ✶❀ ✷❀ ✿ ✿ ✿ ❀ ◆. ◮ Top layer weights ❛✐ ✷ R, ✐ ❂ ✶❀ ✷❀ ✿ ✿ ✿ ❀ ◆.
SLIDE 27 Linearization around initialization
Linearization
❢◆✭x❀ Θ✮ ❂ ❢◆✭x❀ Θ✵✮ ✰
◆
❳
✐❂✶
✁❛✐✛✭❤w✵
✐ ❀ x✐✮
⑤ ④③ ⑥
Top layer linearization
✰
◆
❳
✐❂✶
❛✵
✐ ✛✵✭❤w✵ ✐ ❀ x✐✮❤✁w✐❀ x✐
⑤ ④③ ⑥
Bottom layer linearization
✰♦✭✁✮✿
Linearized neural network: (w✐ ✘ ❯♥✐❢✭S❞✶✮) ❋RF❀◆✭W ✮ ❂ ♥ ❢ ❂
◆
❳
✐❂✶
❛✐✛✭❤w✐❀ x✐✮ ✿ ❛✐ ✷ R❀ ✐ ✷ ❬◆❪ ♦ ❀ ❋NT❀◆✭W ✮ ❂ ♥ ❢ ❂
◆
❳
✐❂✶
✛✵✭❤w✐❀ x✐✮❤b✐❀ x✐ ✿ b✐ ✷ R❞❀ ✐ ✷ ❬◆❪ ♦ ✿ Blue: random and fixed. Red: parameters to be optimized.
[Rahimi, Recht, 2008] [Jacot, Gabriel, Hongler, 2018]
SLIDE 28 Approximation error
Data distribution: x ✘ ❯♥✐❢✭S❞✶✭ ♣ ❞✮✮❀ ❢❄ ✷ ▲✷✭S❞✶✭ ♣ ❞✮✮✿ Minimum risk (approximation error): ❘M❀◆✭❢❄✮ ❂ ✐♥❢
❢✷❋M❀◆✭W ✮ Ex
❤✏ ❢❄✭x✮ ❢✭x✮ ✑✷✐ ❀ M ✷ ❢RF❀ NT❣✿
SLIDE 29
Staircase decay
SLIDE 30 Random features regression
❋RF❀◆✭W ✮ ❂ ♥ ❢ ❂
◆
❳
✐❂✶
❛✐✛✭❤w✐❀ x✐✮ ✿ ❛✐ ✷ R❀ ✐ ✷ ❬◆❪ ♦ ❀ W ❂ ✭w✐✮✐✷❬◆❪ ✘✐✿✐✿❞✿ ❯♥✐❢✭S❞✶✮✿
Theorem (Ghorbani, Mei, Misiakiewicz, Montanari, 2019)
Assume ❞❵✰✍ ✔ ◆ ✔ ❞❵✰✶✍ and ✛ satisfies “generic condition”, we have ✐♥❢
❢✷❋RF❀◆✭W ✮ Ex❬✭❢❄✭x✮ ❢✭x✮✮✷❪ ❂ ❦P❃❵❢❄❦✷ ▲✷ ✰ ♦❞❀P✭❦❢❄❦✷ ▲✷✮✿
P❃❵: projection orthogonal to the space of degree-❵ polynomials. With ❞❵ parameters, RF only fit a degree-❵ polynomial.
SLIDE 31 Random features regression
❋RF❀◆✭W ✮ ❂ ♥ ❢ ❂
◆
❳
✐❂✶
❛✐✛✭❤w✐❀ x✐✮ ✿ ❛✐ ✷ R❀ ✐ ✷ ❬◆❪ ♦ ❀ W ❂ ✭w✐✮✐✷❬◆❪ ✘✐✿✐✿❞✿ ❯♥✐❢✭S❞✶✮✿
Theorem (Ghorbani, Mei, Misiakiewicz, Montanari, 2019)
Assume ❞❵✰✍ ✔ ◆ ✔ ❞❵✰✶✍ and ✛ satisfies “generic condition”, we have ✐♥❢
❢✷❋RF❀◆✭W ✮ Ex❬✭❢❄✭x✮ ❢✭x✮✮✷❪ ❂ ❦P❃❵❢❄❦✷ ▲✷ ✰ ♦❞❀P✭❦❢❄❦✷ ▲✷✮✿
P❃❵: projection orthogonal to the space of degree-❵ polynomials. With ❞❵ parameters, RF only fit a degree-❵ polynomial.
SLIDE 32 Similar result for NT
❋NT❀◆✭W ✮ ❂ ♥ ❢ ❂
◆
❳
✐❂✶
✛✵✭❤w✐❀ x✐✮❤b✐❀ x✐ ✿ b✐ ✷ R❞❀ ✐ ✷ ❬◆❪ ♦ ❀ W ❂ ✭w✐✮✐✷❬◆❪ ✘✐✿✐✿❞✿ ❯♥✐❢✭S❞✶✮✿
Theorem (Ghorbani, Mei, Misiakiewicz, Montanari, 2019)
Assume ❞❵✰✍ ✔ ◆ ✔ ❞❵✰✶✍ and ✛ satisfies “generic condition”, we have ✐♥❢
❢✷❋NT❀◆✭W ✮ Ex❬✭❢❄✭x✮ ❢✭x✮✮✷❪ ❂ ❦P❃❵✰✶❢❄❦✷ ▲✷ ✰ ♦❞❀P✭❦❢❄❦✷ ▲✷✮✿
P❃❵✰✶: projection orthogonal to the space of degree-✭❵ ✰ ✶✮ polynomials. With ❞❵✰✶ parameters, NT only fit a degree-✭❵ ✰ ✶✮ polynomial.
SLIDE 33 Similar result for NT
❋NT❀◆✭W ✮ ❂ ♥ ❢ ❂
◆
❳
✐❂✶
✛✵✭❤w✐❀ x✐✮❤b✐❀ x✐ ✿ b✐ ✷ R❞❀ ✐ ✷ ❬◆❪ ♦ ❀ W ❂ ✭w✐✮✐✷❬◆❪ ✘✐✿✐✿❞✿ ❯♥✐❢✭S❞✶✮✿
Theorem (Ghorbani, Mei, Misiakiewicz, Montanari, 2019)
Assume ❞❵✰✍ ✔ ◆ ✔ ❞❵✰✶✍ and ✛ satisfies “generic condition”, we have ✐♥❢
❢✷❋NT❀◆✭W ✮ Ex❬✭❢❄✭x✮ ❢✭x✮✮✷❪ ❂ ❦P❃❵✰✶❢❄❦✷ ▲✷ ✰ ♦❞❀P✭❦❢❄❦✷ ▲✷✮✿
P❃❵✰✶: projection orthogonal to the space of degree-✭❵ ✰ ✶✮ polynomials. With ❞❵✰✶ parameters, NT only fit a degree-✭❵ ✰ ✶✮ polynomial.
SLIDE 34 The staircase decay (a cartoon)
❢ ❂ P✵❢ ✰ P✶❢ ✰ P✷❢ ✰ P✸❢ ✰ ✁ ✁ ✁
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0.2 0.4 0.6 0.8 1
SLIDE 35
Approximation gap
Function ❢ ✿ S❞✶ ✦ R, ❢✭x✮ ❂ ◗❦✭①✶✮. ◗❦: degree ❦ polynomial. ◮ NT: ◆ ❂ ✂❞✭❞❦✶✮; ◮ NN: ◆ ❂ ✂❞✭✶✮. ◮ A separation of approximation power. ◮ Neural network can potentially learn features adaptively.
SLIDE 36 Related work
Approximation error of two-layers NN and RF: [Barron, 1993], [Mhaskar, 1996], [Maiorov, 1999], [Caponnetto, de Vito, 2007], [Rahimi, Recht, 2009], [Bach, 2017], [E, Ma, Wu, 2018] ... Approx bound ❢❄ bounded norm ❢❄ ✷ ▲✷✭R❞✮ ❭ ✭❞❄-sparse✮ RF ❦❢❄❦✷
❍❂◆
✂◆✭✶❂◆✶❂❞✮ NN ❦❢❄❦✷
❇❂◆
✂◆✭✶❂◆✶❂❞❄✮ ◆ ❂ ❞❦ ❞ ✦ ✶❀ ❞ ◆ ✦ ✶❀ ❀ ◆ ❂ ✶✵❀ ✵✵✵❀ ✵✵✵✿ ❀ ✿
SLIDE 37 Related work
Approximation error of two-layers NN and RF: [Barron, 1993], [Mhaskar, 1996], [Maiorov, 1999], [Caponnetto, de Vito, 2007], [Rahimi, Recht, 2009], [Bach, 2017], [E, Ma, Wu, 2018] ... Approx bound ❢❄ bounded norm ❢❄ ✷ ▲✷✭R❞✮ ❭ ✭❞❄-sparse✮ RF ❦❢❄❦✷
❍❂◆
✂◆✭✶❂◆✶❂❞✮ NN ❦❢❄❦✷
❇❂◆
✂◆✭✶❂◆✶❂❞❄✮ Difference between: New results v.s. Classical results ◆ ❂ ❞❦ as ❞ ✦ ✶❀ v.s. fixed ❞ as ◆ ✦ ✶❀ Constant asymptotic error, v.s. Vanishing upper bound. ❀ ◆ ❂ ✶✵❀ ✵✵✵❀ ✵✵✵✿ ❀ ✿
SLIDE 38 Related work
Approximation error of two-layers NN and RF: [Barron, 1993], [Mhaskar, 1996], [Maiorov, 1999], [Caponnetto, de Vito, 2007], [Rahimi, Recht, 2009], [Bach, 2017], [E, Ma, Wu, 2018] ... Approx bound ❢❄ bounded norm ❢❄ ✷ ▲✷✭R❞✮ ❭ ✭❞❄-sparse✮ RF ❦❢❄❦✷
❍❂◆
✂◆✭✶❂◆✶❂❞✮ NN ❦❢❄❦✷
❇❂◆
✂◆✭✶❂◆✶❂❞❄✮ ◆ ❂ ❞❦ as ❞ ✦ ✶❀ v.s. fixed ❞ as ◆ ✦ ✶❀ Which asymptotics makes more sense? ❞ ❂ ✶✵✵❀ ◆ ❂ ✶✵❀ ✵✵✵❀ ✵✵✵✿ ◆ ❂ ❞✸✿✺❀ ✶❂◆✶❂❞ ❂ ✵✿✽✺✿
SLIDE 39 Related work
Approximation error of two-layers NN and RF: [Barron, 1993], [Mhaskar, 1996], [Maiorov, 1999], [Caponnetto, de Vito, 2007], [Rahimi, Recht, 2009], [Bach, 2017], [E, Ma, Wu, 2018] ... Approx bound ❢❄ bounded norm ❢❄ ✷ ▲✷✭R❞✮ ❭ ✭❞❄-sparse✮ RF ❦❢❄❦✷
❍❂◆
✂◆✭✶❂◆✶❂❞✮ NN ❦❢❄❦✷
❇❂◆
✂◆✭✶❂◆✶❂❞❄✮ ◆ ❂ ❞❦ as ❞ ✦ ✶❀ v.s. fixed ❞ as ◆ ✦ ✶❀ Which asymptotics makes more sense? ❞❄ ❂ ✶✵❀ ◆ ❂ ✶✵❀ ✵✵✵❀ ✵✵✵✿ ◆ ❂ ❞❄
✼❀
✶❂◆✶❂❞❄ ❂ ✵✿✷✵✿
SLIDE 40
Double descent
SLIDE 41 The motivating experiment
◮ MNIST: ✭x✐❀ ②✐✮ ✷ R✼✽✹ ✂ ❬✶✵❪, ✐ ✷ ❬✺✵❀ ✵✵✵❪. ◮ Two-layers neural networks ❢◆: ❢◆✭x❀ θ✮ ❂
◆
❳
❥❂✶
❛❥✛✭❤w❥❀ x✐✮✿ ◮ Square loss without regularization. ◮ Find a local minimizer, report training and test error. ◮ Perform a sequence of experiments for different ◆. ◮ Plot training and test error vs ◆.
SLIDE 42 The motivating experiment
◮ MNIST: ✭x✐❀ ②✐✮ ✷ R✼✽✹ ✂ ❬✶✵❪, ✐ ✷ ❬✺✵❀ ✵✵✵❪. ◮ Two-layers neural networks ❢◆: ❢◆✭x❀ θ✮ ❂
◆
❳
❥❂✶
❛❥✛✭❤w❥❀ x✐✮✿ ◮ Square loss without regularization. ◮ Find a local minimizer, report training and test error. ◮ Perform a sequence of experiments for different ◆. ◮ Plot training and test error vs ◆.
SLIDE 43 Increasing ★ parameters
0.08 0.25 1 2.5 7.5 20 0.6 0.4 0.2 Test Train
# parameters / # samples
Figure: Experiments on MNIST. [Belkin, Hsu, Ma, Mandal, 2018].
SLIDE 44 Increasing ★ parameters
0.08 0.25 1 2.5 7.5 20 0.6 0.4 0.2 Test Train
# parameters / # samples
N/n
# parameters / # samples
Figure: Experiments on MNIST. Left: [Belkin, Hsu, Ma, Mandal, 2018]. Right: [Spigler, Geiger, Ascoli, Sagun, Biroli, Wyart, 2018]. Similar phenomenon appeared in the literature [LeCun, Kanter, and Solla, 1991], [Krogh and Hertz, 1992], [Opper and Kinzel, 1995], [Neyshabur, Tomioka, Srebro, 2014], [Advani and Saxe, 2017].
SLIDE 45
U-shaped curve
[Belkin, Hsu, Ma, Mandal, 2018]
SLIDE 46
Double descent
Figure: A cartoon by [Belkin, Hsu, Ma, Mandal, 2018].
Peak at the interpolation threshold. Monotone decreasing in the overparameterized regime. Global minimum when the number of parameters is infinity.
SLIDE 47
Complementary instead of contradictory
U-shaped curve
Test error vs model complexity that tightly controls generalization. Examples: ❵✷ norm in linear model, “❦” in ❦ nearest-neighbors.
Double-descent
Test error vs number of parameters. Examples: ★ parameters in NN. In NN, ★ parameters ✻❂ model complexity that tightly controls generalization.
[Bartlett, 1997], [Bartlett and Mendelson, 2002]
SLIDE 48
Complementary instead of contradictory
U-shaped curve
Test error vs model complexity that tightly controls generalization. Examples: ❵✷ norm in linear model, “❦” in ❦ nearest-neighbors.
Double-descent
Test error vs number of parameters. Examples: ★ parameters in NN. In NN, ★ parameters ✻❂ model complexity that tightly controls generalization.
[Bartlett, 1997], [Bartlett and Mendelson, 2002]
SLIDE 49 Linear model with random covariates
# parameters / # samples By [Hastie, Montanari, Rosset, Tibshirani, 2019]. See also [Belkin, Hsu, Xu, 2019].
◮ Under-parameterized: ❫ β ❂ ❛r❣ ♠✐♥β P
✐✭②✐ ❤x✐❀ β✐✮✷.
◮ Over-parameterized: ❫ β ❂ ❛r❣ ♠✐♥β ❦β❦✷❀ s✿t✿ ②✐ ❂ ❤x✐❀ β✐ ✰ ✧✐❀ ✐ ✷ ❬♥❪.
SLIDE 50 Why singularity?
◮ Model: x✐ ✘ ◆✭0❀ I❞✮, ②✐ ❂ ❤0❀ x✐✐ ✰ ✧✐ ✘ ◆✭✵❀ ✶✮, ✐ ✷ ❬♥❪. ◮ Test risk ✴ E❬❦❫ β 0❦✷
✷❪ ✴ E❬❦X②y❦✷ ✷❪ ✴ E❬tr✭✭XTX✮②✮❪, where X ✷ R♥✂❞.
◮ When ♥ ✻❂ ❞, X is well conditioned. ◮ When ♥ ✙ ❞, X is infinitely ill conditioned. ◮ The model has marginally enough parameters to interpolate all the data, hence it interpolates in an awkward way. ◮ To fit the noise, the coefficients ❦❫ β❦✷
✷ ❂ ❦X②y❦✷ ✷ blows up.
[Bartlett, Long, Lugosi, Tsigler, 2019], [Muthukumar, Vodrahalli, Sahai, 2019]
SLIDE 51 Why singularity?
◮ Model: x✐ ✘ ◆✭0❀ I❞✮, ②✐ ❂ ❤0❀ x✐✐ ✰ ✧✐ ✘ ◆✭✵❀ ✶✮, ✐ ✷ ❬♥❪. ◮ Test risk ✴ E❬❦❫ β 0❦✷
✷❪ ✴ E❬❦X②y❦✷ ✷❪ ✴ E❬tr✭✭XTX✮②✮❪, where X ✷ R♥✂❞.
◮ When ♥ ✻❂ ❞, X is well conditioned. ◮ When ♥ ✙ ❞, X is infinitely ill conditioned. ◮ The model has marginally enough parameters to interpolate all the data, hence it interpolates in an awkward way. ◮ To fit the noise, the coefficients ❦❫ β❦✷
✷ ❂ ❦X②y❦✷ ✷ blows up.
[Bartlett, Long, Lugosi, Tsigler, 2019], [Muthukumar, Vodrahalli, Sahai, 2019]
SLIDE 52 Why singularity?
◮ Model: x✐ ✘ ◆✭0❀ I❞✮, ②✐ ❂ ❤0❀ x✐✐ ✰ ✧✐ ✘ ◆✭✵❀ ✶✮, ✐ ✷ ❬♥❪. ◮ Test risk ✴ E❬❦❫ β 0❦✷
✷❪ ✴ E❬❦X②y❦✷ ✷❪ ✴ E❬tr✭✭XTX✮②✮❪, where X ✷ R♥✂❞.
◮ When ♥ ✻❂ ❞, X is well conditioned. ◮ When ♥ ✙ ❞, X is infinitely ill conditioned. ◮ The model has marginally enough parameters to interpolate all the data, hence it interpolates in an awkward way. ◮ To fit the noise, the coefficients ❦❫ β❦✷
✷ ❂ ❦X②y❦✷ ✷ blows up.
[Bartlett, Long, Lugosi, Tsigler, 2019], [Muthukumar, Vodrahalli, Sahai, 2019]
SLIDE 53 Why singularity?
◮ Model: x✐ ✘ ◆✭0❀ I❞✮, ②✐ ❂ ❤0❀ x✐✐ ✰ ✧✐ ✘ ◆✭✵❀ ✶✮, ✐ ✷ ❬♥❪. ◮ Test risk ✴ E❬❦❫ β 0❦✷
✷❪ ✴ E❬❦X②y❦✷ ✷❪ ✴ E❬tr✭✭XTX✮②✮❪, where X ✷ R♥✂❞.
◮ When ♥ ✻❂ ❞, X is well conditioned. ◮ When ♥ ✙ ❞, X is infinitely ill conditioned. ◮ The model has marginally enough parameters to interpolate all the data, hence it interpolates in an awkward way. ◮ To fit the noise, the coefficients ❦❫ β❦✷
✷ ❂ ❦X②y❦✷ ✷ blows up.
[Bartlett, Long, Lugosi, Tsigler, 2019], [Muthukumar, Vodrahalli, Sahai, 2019]
SLIDE 54 Why singularity?
◮ Model: x✐ ✘ ◆✭0❀ I❞✮, ②✐ ❂ ❤0❀ x✐✐ ✰ ✧✐ ✘ ◆✭✵❀ ✶✮, ✐ ✷ ❬♥❪. ◮ Test risk ✴ E❬❦❫ β 0❦✷
✷❪ ✴ E❬❦X②y❦✷ ✷❪ ✴ E❬tr✭✭XTX✮②✮❪, where X ✷ R♥✂❞.
◮ When ♥ ✻❂ ❞, X is well conditioned. ◮ When ♥ ✙ ❞, X is infinitely ill conditioned. ◮ The model has marginally enough parameters to interpolate all the data, hence it interpolates in an awkward way. ◮ To fit the noise, the coefficients ❦❫ β❦✷
✷ ❂ ❦X②y❦✷ ✷ blows up.
[Bartlett, Long, Lugosi, Tsigler, 2019], [Muthukumar, Vodrahalli, Sahai, 2019]
SLIDE 55 Why singularity?
◮ Model: x✐ ✘ ◆✭0❀ I❞✮, ②✐ ❂ ❤0❀ x✐✐ ✰ ✧✐ ✘ ◆✭✵❀ ✶✮, ✐ ✷ ❬♥❪. ◮ Test risk ✴ E❬❦❫ β 0❦✷
✷❪ ✴ E❬❦X②y❦✷ ✷❪ ✴ E❬tr✭✭XTX✮②✮❪, where X ✷ R♥✂❞.
◮ When ♥ ✻❂ ❞, X is well conditioned. ◮ When ♥ ✙ ❞, X is infinitely ill conditioned. ◮ The model has marginally enough parameters to interpolate all the data, hence it interpolates in an awkward way. ◮ To fit the noise, the coefficients ❦❫ β❦✷
✷ ❂ ❦X②y❦✷ ✷ blows up.
[Bartlett, Long, Lugosi, Tsigler, 2019], [Muthukumar, Vodrahalli, Sahai, 2019]
SLIDE 56 Comparison
Neural networks [Spigler, et.al., 2018] Linear model [Hastie, et.al., 2019]
N/n
# parameters / # samples # parameters / # samples
Peak at the interpolation threshold. ❄ Monotone decreasing in the overparameterized regime. ❄ Global minimum when the number of parameters is infinity.
SLIDE 57 Goal: find a tractable model that exhibits all the features
- f the double descent curve.
Figure: By [Belkin, Hsu, Ma, Mandal, 2018].
SLIDE 58 A simple model
The random features model ❢RF✭x❀ a✮ ❂
◆
❳
❥❂✶
❛❥✛✭❤w❥❀ x✐✮✿ Random weights ✭w❥✮❥✷❬◆❪ w❥ ✘✐✐❞ ❯♥✐❢✭S❞✶✮✿ ✭
✐❀ ②✐✮✐✷❬♥❪ ✐ ✘ ❯♥✐❢✭ ❞✶✭
♣ ❞✮✮❀ ②✐ ❂ ❢❄✭
✐✮ ✰ ✧✐✿
SLIDE 59 A simple model
The random features model ❢RF✭x❀ a✮ ❂
◆
❳
❥❂✶
❛❥✛✭❤w❥❀ x✐✮✿ Random weights ✭w❥✮❥✷❬◆❪ w❥ ✘✐✐❞ ❯♥✐❢✭S❞✶✮✿ Data ✭x✐❀ ②✐✮✐✷❬♥❪ x✐ ✘ ❯♥✐❢✭S❞✶✭ ♣ ❞✮✮❀ ②✐ ❂ ❢❄✭x✐✮ ✰ ✧✐✿
SLIDE 60 A simple model
Random features regression: ❫ a✕ ❂ ❛r❣ ♠✐♥a ▲✕✭a✮, ▲✕✭a✮ ❂ ✶ ♥
♥
❳
✐❂✶
❤✏ ②✐
◆
❳
❥❂✶
❛❥✛✭❤x✐❀ w❥✐✮ ✑✷✐ ✰ ✕◆ ❞ ❦a❦✷
✷❀
(Train) ❘✭a❀ ❢❄✮ ❂ Ex❀② ❤✏ ❢❄✭x✮
◆
❳
❥❂✶
❛❥✛✭❤x❀ w❥✐✮ ✑✷✐ ✿ (Test)
Assumptions
♥ data, ◆ features, ❞ dimension. ◆❂❞ ✦ ✥✶, ♥❂❞ ✦ ✥✷, as ❞ ✦ ✶.
- Tech. ass. on ❢❄ and ✛ (apply to almost every ❢❄ and ✛).
SLIDE 61 Precise asymptotics
Theorem (Mei and Montanari, 2019)
Under above assumptions, the test error of RF model is given by ❘✭❫ a✕❀ ❢❄✮ ❂ ❦β❦✷
✷ ✁ B✭✏❀ ✥✶❀ ✥✷❀ ✕❂✖✷ ❄✮ ✰ ✜ ✷ ✁ V ✭✏❀ ✥✶❀ ✥✷❀ ✕❂✖✷ ❄✮ ✰ ♦❞❀P✭✶✮❀
where functions B and V are given explicitly below.
SLIDE 62 Explicit formulae
Let the functions ✗✶❀ ✗✷ ✿ C✰ ✦ C✰ be the unique solution of ✗✶ ❂ ✥✶
✏✷✗✷ ✶ ✏✷✗✶✗✷
✁✶ ❀
✗✷ ❂ ✥✷
✏✷✗✶ ✶ ✏✷✗✶✗✷
✁✶ ❀
Let ✤ ✑ ✗✶✭i✭✥✶✥✷✕✮✶❂✷✮ ✁ ✗✷✭i✭✥✶✥✷✕✮✶❂✷✮❀ and E✵✭✏❀ ✥✶❀ ✥✷❀ ✕✮ ✑ ✤✺✏✻ ✰ ✸✤✹✏✹ ✰ ✭✥✶✥✷ ✥✷ ✥✶ ✰ ✶✮✤✸✏✻ ✷✤✸✏✹ ✸✤✸✏✷ ✰ ✭✥✶ ✰ ✥✷ ✸✥✶✥✷ ✰ ✶✮✤✷✏✹ ✰ ✷✤✷✏✷ ✰ ✤✷ ✰ ✸✥✶✥✷✤✏✷ ✥✶✥✷ ❀ E✶✭✏❀ ✥✶❀ ✥✷❀ ✕✮ ✑ ✥✷✤✸✏✹ ✥✷✤✷✏✷ ✰ ✥✶✥✷✤✏✷ ✥✶✥✷ ❀ E✷✭✏❀ ✥✶❀ ✥✷❀ ✕✮ ✑ ✤✺✏✻ ✸✤✹✏✹ ✰ ✭✥✶ ✶✮✤✸✏✻ ✰ ✷✤✸✏✹ ✰ ✸✤✸✏✷ ✰ ✭✥✶ ✶✮✤✷✏✹ ✷✤✷✏✷ ✤✷ ✿ We then have B✭✏❀ ✥✶❀ ✥✷❀ ✕✮ ✑ E✶✭✏❀ ✥✶❀ ✥✷❀ ✕✮ E✵✭✏❀ ✥✶❀ ✥✷❀ ✕✮ ❀ V ✭✏❀ ✥✶❀ ✥✷❀ ✕✮ ✑ E✷✭✏❀ ✥✶❀ ✥✷❀ ✕✮ E✵✭✏❀ ✥✶❀ ✥✷❀ ✕✮ ✿
SLIDE 63 Proof strategy
Random matrix theory for the random kernel inner product matrices Z ❂ ✏ ✛✭❤w✐❀ x❥✐✮ ✑
✐✷❬◆❪❀❥✷❬♥❪✿
[El Karoui, 2010], [Cheng, Singer, 2013], [Do, Vu, 2013], [Fan, Montanari, 2019], [Hastie, Montanari, Rosset, Tibshirani, 2019].
SLIDE 64 Analytical prediction
1 2 3 4 5 0.5 1 1.5 2 2.5 3 3.5 4 1 2 3 4 5 0.5 1 1.5
✕ ❂ ✵✰ ✕ ❂ ✸ ✂ ✶✵✹ Peak at the interpolation threshold. Monotone decreasing in the overparameterized regime. Global minimum when the number of parameters is infinity.
SLIDE 65 Insights
10-2 100 102 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Test error
◮ For any ✕, the min prediction error is achieved at ◆❂♥ ✦ ✶. ◮ For optimal ✕, the prediction error is monotonically decreasing.
SLIDE 66 Insights
10-3 10-2 10-1 100 101 102 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Test error 10-3 10-2 10-1 100 101 102 0.5 1 1.5 2 2.5 3 Test error
SNR ❂ ✺ SNR ❂ ✶❂✶✵ ◮ High SNR: minimum at ✕ ❂ ✵✰; ◮ Low SNR: minimum at ✕ ❃ ✵.
SLIDE 67 Summary of linearization of neural networks
1 2 3 4 5 0.5 1 1.5
◮ ★ parameters ✻❂ model complexity that controls generalization. ◮ Double descent also exists in linearized neural networks.
SLIDE 68 Summary of linearization of neural networks
1 2 3 4 5 0.5 1 1.5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0.2 0.4 0.6 0.8 1
◮ ★ parameters ✻❂ model complexity that controls generalization. ◮ Double descent also exists in linearized neural networks. ◮ Gap between NN and NT. NT models cannot fully explain the generalization efficacy of NN.
SLIDE 69
Going beyond linearization?
SLIDE 70 Mean field theory
◮ SGD of two layers neural networks θ❦✰✶
✐
❂ θ❦
✐ ✧rθ✐❵
✏ ②❦❀ ✶ ◆
◆
❳
✐❂✶
✛❄✭x❦❀ θ❦
✐ ✮
✑ ✿ ◮ Consider empirical distribution of weights ❫ ✚◆❀❦✧ ❂ ✶ ◆
◆
❳
✐❂✶
✍θ❦
✐
◮ Then ❫ ✚◆❀t ✦ ✚t as ◆ ✦ ✶ and ✧ ✦ ✵, and ✚t satisfies ❅t✚t ❂ r ✁ ✭r✠✭θ❀ ✚t✮✚t✮✿ ◮ Difference from linearization theory: A different scaling limit.
[Mei, Montanari, Nguyen, 2018], [Rotskoff, Vanden-Eijnden, 2018]
SLIDE 71
Future directions
◮ Distribution of features x matter.
Images ✩ Convolutional neural network. Graph ✩ Graph neural network. Exploring data and network invariance.
◮ Neural networks as function/distribution approximation?
Generative modeling. Reinforcement learning.
◮ Uncertainty quantification in neural network systems.
Robustness and adversarial examples. Approximate inference for Bayesian neural networks. Predictive inference.
SLIDE 72
Thanks!