Generalization of linearized neural networks: staircase decay and - PowerPoint PPT Presentation

Generalization of linearized neural networks: staircase decay and double descent Song Mei UC Berkeley July 23, 2020 Department of Mathematics, HKUST

Deep Learning Revolution Machine translation Autonomous Vehicle Robotics Healthcare Gaming Deep learning Communication Finance

Deep Learning Revolution Machine translation Autonomous Vehicle Robotics Healthcare Gaming Deep learning Communication Finance “ACM named Yoshua Bengio, Geoffrey Hinton, and Yann LeCun recipients of the 2018 ACM A.M. Turing Award for conceptual and engineering breakthroughs that have made deep neural networks a critical component of computing. ”

But theoretically?

But theoretically? WHEN and WHY does deep learning work?

Call for Theoretical understandings “Alchemy”

Call for Theoretical understandings “Alchemy” Science Reproducible Physical Mathematical Experiments Laws Theories

✩ ✩ What don’t we understand?

✩ ✩ What don’t we understand? Empirical Surprises [Zhang, et.al, 2015]: ◮ Over-parameterization: ★ parameters ✢ ★ training samples. ◮ Non-convexity. ◮ Efficiently fit all the training samples using SGD. ◮ Generalize well on test samples.

What don’t we understand? Empirical Surprises [Zhang, et.al, 2015]: ◮ Over-parameterization: ★ parameters ✢ ★ training samples. ◮ Non-convexity. ◮ Efficiently fit all the training samples using SGD. ◮ Generalize well on test samples. Mathematical Challenges ✩ Non-convexity Why efficient optimization? ✩ Over-parameterization Why effective generalization?

A gentle introduction to Linearization theory of neural networks

Linearized neural networks (neural tangent model) ◮ Multi-layers neural network ❢ ✭ x ❀ θ ✮ , x ✷ R ❞ , θ ✷ R ◆ ❢ ✭ x ❀ θ ✮ ❂ W ▲ ✛ ✭ ✁ ✁ ✁ W ✷ ✛ ✭ W ✶ x ✮✮ ✿ ◮ Linearization around (random) parameter θ ✵ ❢ ✭ x ❀ θ ✮ ❂ ❢ ✭ x ❀ θ ✵ ✮ ✰ ❤ θ � θ ✵ ❀ r θ ❢ ✭ x ❀ θ ✵ ✮ ✐ ✰ ♦ ✭ ❦ θ � θ ✵ ❦ ✷ ✮ ✿ ◮ Neural tangent model: the linear part of ❢ ❢ NT ✭ x ❀ β ❀ θ ✵ ✮ ❂ ❤ β ❀ r θ ❢ ✭ x ❀ θ ✵ ✮ ✐ ✿ [Jacot, Gabriel, Hongler, 2018] [Chizat, Bach, 2018b]

Linearized neural networks (neural tangent model) ◮ Multi-layers neural network ❢ ✭ x ❀ θ ✮ , x ✷ R ❞ , θ ✷ R ◆ ❢ ✭ x ❀ θ ✮ ❂ ✛ ✭ W ▲ ✛ ✭ ✁ ✁ ✁ W ✷ ✛ ✭ W ✶ x ✮✮✮ ✿ ◮ Linearization around (random) parameter θ ✵ ❢ ✭ x ❀ θ ✮ ❂ ❢ ✭ x ❀ θ ✵ ✮ ✰ ❤ θ � θ ✵ ❀ r θ ❢ ✭ x ❀ θ ✵ ✮ ✐ ✰ ♦ ✭ ❦ θ � θ ✵ ❦ ✷ ✮ ✿ ◮ Neural tangent model: the linear part of ❢ ❢ NT ✭ x ❀ β ❀ θ ✵ ✮ ❂ ❤ β ❀ r θ ❢ ✭ x ❀ θ ✵ ✮ ✐ ✿ [Jacot, Gabriel, Hongler, 2018] [Chizat, Bach, 2018b]

Linear regression over random features ◮ NT model: the linear part of ❢ ❢ NT ✭ x ❀ β ❀ θ ✵ ✮ ❂ ❤ β ❀ ✣ ✭ x ✮ ✐ ❂ ❤ β ❀ r θ ❢ ✭ x ❀ θ ✵ ✮ ✐ ✿ ◮ (Random) feature map: ✣ ✭ ✁ ✮ ❂ r θ ❢ ✭ ✁ ❀ θ ✵ ✮ ✿ R ❞ ✦ R ◆ . ◮ Training dataset: ✭ ❳ ❀ ❨ ✮ ❂ ✭ x ✐ ❀ ② ✐ ✮ ✐ ✷ ❬ ♥ ❪ . ◮ Gradient flow dynamics: ❞ ❞ t β t ❂ � r β ❫ β ✵ ❂ 0 ✿ E ❬✭ ② � ❢ NT ✭ x ❀ β t ❀ θ ✵ ✮✮ ✷ ❪ ❀ ◮ Linear convergence: ☞ t ✦ ❫ β ❂ ✣ ✭ ❳ ✮ ② ❨ .

Neural network ✙ Neural tangent Theorem [Jacot, Gabriel, Hongler, 2018] (Informal) Consider neural networks ❢ ◆ ✭ x ❀ θ ✮ with number of neurons ◆ , and consider ❞ ❞ t θ t ❂ � r θ ❫ θ ✵ ❂ θ ✵ ❀ E ❬✭ ② � ❢ ◆ ✭ x ❀ θ t ✮✮ ✷ ❪ ❀ ❞ ❞ t β t ❂ � r β ❫ β ✵ ❂ 0 ✿ E ❬✭ ② � ❢ ◆ NT ✭ x ❀ β t ❀ θ ✵ ✮✮ ✷ ❪ ❀ Under proper (random) initialization, we have a.s. ◆ ✦✶ ❥ ❢ ◆ ✭ x ❀ θ t ✮ � ❢ ◆ NT ✭ x ❀ β t ✮ ❥ ❂ ✵ ✿ ❧✐♠

Optimization success Gradient flow of training loss of NN converges to global min ... ... with over-parameterization and proper initialization [Jacot, Gabriel, Hongler, 2018], [Du, Zhai, Poczos, Singh, 2018], [Du, Lee, Li, Wang, Zhai, 2018], [Allen-Zhu, Li, Song 2018], [Zou, Cao, Zhou, Gu, 2018], [Oymak, Soltanolkotabi, 2018] [Chizat, Bach, 2018b], ....

Optimization success Gradient flow of training loss of NN converges to global min ... ... with over-parameterization and proper initialization [Jacot, Gabriel, Hongler, 2018], [Du, Zhai, Poczos, Singh, 2018], [Du, Lee, Li, Wang, Zhai, 2018], [Allen-Zhu, Li, Song 2018], [Zou, Cao, Zhou, Gu, 2018], [Oymak, Soltanolkotabi, 2018] [Chizat, Bach, 2018b], .... Does linearization fully explain the success of neural networks?

Optimization success Gradient flow of training loss of NN converges to global min ... ... with over-parameterization and proper initialization [Jacot, Gabriel, Hongler, 2018], [Du, Zhai, Poczos, Singh, 2018], [Du, Lee, Li, Wang, Zhai, 2018], [Allen-Zhu, Li, Song 2018], [Zou, Cao, Zhou, Gu, 2018], [Oymak, Soltanolkotabi, 2018] [Chizat, Bach, 2018b], .... Does linearization fully explain the success of neural networks? Our answer is No

Generalization Empirically, the generalization of NT models are not as good as NN Table: Cifar10 experiments Architecture Classification error CNN 4%- (1) CNTK 23% (2) CNTK 11% (3) Compositioal Kernel 10% (1) [Arora, Du, Hu, Li, Salakhutdinov, Wang, 2019] , (2) [Li, Wang, Yu, Du, Hu, Salakhutdinov, Arora, 2019] , (3) [Shankar, Fang, Guo, Fridovich-Keil, Schmidt, Ragan-Kelley, Recht, 2020] .

Performance gap: NN versus NT

Two-layers neural network ◆ ❳ ❢ ◆ ✭ x ❀ Θ ✮ ❂ ❛ ✐ ✛ ✭ ❤ w ✐ ❀ x ✐ ✮ ❀ Θ ❂ ✭ ❛ ✶ ❀ w ✶ ❀ ✿ ✿ ✿ ❀ ❛ ◆ ❀ w ◆ ✮ ✿ ✐ ❂✶ ◮ Input vector x ✷ R ❞ . ◮ Bottom layer weights w ✐ ✷ R ❞ , ✐ ❂ ✶ ❀ ✷ ❀ ✿ ✿ ✿ ❀ ◆ . ◮ Top layer weights ❛ ✐ ✷ R , ✐ ❂ ✶ ❀ ✷ ❀ ✿ ✿ ✿ ❀ ◆ .

Generalization of linearized neural networks: staircase decay and - PowerPoint PPT Presentation

Generalization of linearized neural networks: staircase decay and double descent Song Mei UC Berkeley July 23, 2020 Department of Mathematics, HKUST Deep Learning Revolution Machine translation Autonomous Vehicle Robotics Healthcare

New Walk Museum Staircase Heritage, Culture, Leisure and Sport Scrutiny Commission 15 th November

On the Distribution of Random Staircase Tableaux Pawel Hitczenko & Amanda Parshall June 20,

Linearized two-layers neural network in high dimensions Song Mei Stanford University May 26,

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

CSC413/2516 Lecture 7: Generalization & Recurrent Neural Networks Jimmy Ba Jimmy Ba

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Distribution of Symbols in Weighted Random Staircase Tableaux Pawe l Hitczenko based on a

Staircase diagrams and the enumeration of smooth Schubert varieties Edward Richmond* and William

Chip-Firing and A Devils Staircase Lionel Levine (MIT) FPSAC, July 21, 2009 Lionel Levine

Incompressible limit of the linearized NavierStokes equations. N.A. Gusev 1 1 Moscow Institute

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Bayesian feedforward Neural networks Seung-Hoon Na Chonbuk National University Neural networks

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

The Asymptotic Distribution of Symbols on Staircase Tableaux Diagonals Amanda Lohss September

Stair Caddy Carries items up to 50 lbs up and downstairs 2.009 Blue B Stair Caddy Design For

Enumeration of Permutation Classes by Inflation of Independent Sets of Graphs mile Nadeau

Workshop 1: Expressions NCTM Interac8ve Ins8tute, 2016 Introduc8ons Introduce yourself to

Enumerating Anchored Permutations with Bounded Gaps Maria Monks Gillespie, CSU Ken G. Monks,

2020 Advocacy Day January 22 Albany, NY ABOUT THE NETWORK FOR YOUTH SUCCESS Mission To

Sum rule for a mixed boundary qKZ equation Caley Finn May 2015, GGI, Firenze In collaboration

The Bruhat rank of binary symmetric staircase pattern Carlos M. da Fonseca with Zhibin Du

Generalization of linearized neural networks: staircase decay and - PowerPoint PPT Presentation

Generalization of linearized neural networks: staircase decay and double descent Song Mei UC Berkeley July 23, 2020 Department of Mathematics, HKUST Deep Learning Revolution Machine translation Autonomous Vehicle Robotics Healthcare

New Walk Museum Staircase Heritage, Culture, Leisure and Sport Scrutiny Commission 15 th November

On the Distribution of Random Staircase Tableaux Pawel Hitczenko &amp; Amanda Parshall June 20,

Linearized two-layers neural network in high dimensions Song Mei Stanford University May 26,

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

CSC413/2516 Lecture 7: Generalization &amp; Recurrent Neural Networks Jimmy Ba Jimmy Ba

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Neural Networks Neural networks arise from attempts to model Neural Networks human/animal

Distribution of Symbols in Weighted Random Staircase Tableaux Pawe l Hitczenko based on a

Staircase diagrams and the enumeration of smooth Schubert varieties Edward Richmond* and William

Chip-Firing and A Devils Staircase Lionel Levine (MIT) FPSAC, July 21, 2009 Lionel Levine

Incompressible limit of the linearized NavierStokes equations. N.A. Gusev 1 1 Moscow Institute

Sequential Data with Neural Networks Recurrent Neural Networks Sequential input / output Greg

Bayesian feedforward Neural networks Seung-Hoon Na Chonbuk National University Neural networks

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

The Asymptotic Distribution of Symbols on Staircase Tableaux Diagonals Amanda Lohss September

Stair Caddy Carries items up to 50 lbs up and downstairs 2.009 Blue B Stair Caddy Design For

Enumeration of Permutation Classes by Inflation of Independent Sets of Graphs mile Nadeau

Workshop 1: Expressions NCTM Interac8ve Ins8tute, 2016 Introduc8ons Introduce yourself to

Enumerating Anchored Permutations with Bounded Gaps Maria Monks Gillespie, CSU Ken G. Monks,

2020 Advocacy Day January 22 Albany, NY ABOUT THE NETWORK FOR YOUTH SUCCESS Mission To

Sum rule for a mixed boundary qKZ equation Caley Finn May 2015, GGI, Firenze In collaboration

The Bruhat rank of binary symmetric staircase pattern Carlos M. da Fonseca with Zhibin Du

On the Distribution of Random Staircase Tableaux Pawel Hitczenko & Amanda Parshall June 20,

CSC413/2516 Lecture 7: Generalization & Recurrent Neural Networks Jimmy Ba Jimmy Ba