Linearized two-layers neural network in high dimensions Song Mei - PowerPoint PPT Presentation

Linearized two-layers neural network in high dimensions Song Mei Stanford University May 26, 2019 Joint work with Andrea Montanari, Theodor Misiakiewicz, Behrooz Ghorbani Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 1 / 22

Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 2 / 22

❘ ✭ Θ ✮ ❂ ♠✐♥ Θ E ❬ ❵ ✭ ②❀ W ✶ ✛ ✍ W ✷ ✍ ✛ ✍ ✁ ✁ ✁ ✍ W ❦ ✍ x ✮❪ ✿ Empirical surprise of neural network [Zhang et al. , 2016] ◮ Over-parameterized regime. ◮ Optimization surprise: efficiently fit all the data. ◮ Generalization surprise: generalize well. Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 3 / 22

Two-layers neural network ◆ ❫ ❳ ❢ ◆ ✭ x ❀ Θ ✮ ❂ ❛ ✐ ✛ ✭ ❤ w ✐ ❀ x ✐ ✮ ❀ Θ ❂ ✭ ❛ ✶ ❀ w ✶ ❀ ✿ ✿ ✿ ❀ ❛ ◆ ❀ w ◆ ✮ ✿ ✐ ❂✶ ◮ Feature x ✷ R ❞ . ◮ Bottom layer weights w ✐ ✷ R ❞ , ✐ ❂ ✶ ❀ ✷ ❀ ✿ ✿ ✿ ❀ ◆ . ◮ Top layer weights ❛ ✐ ✷ R , ✐ ❂ ✶ ❀ ✷ ❀ ✿ ✿ ✿ ❀ ◆ . ◮ Over-parametrization: ◆ large. Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 4 / 22

Two-layers neural network Input layer Hidden layer Output layer w 1 a 1 w 2 a 2 w 3 a 3 a 4 w 4 Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 5 / 22

Gradient flow with random initialization Empirical risk: ( ♥ : ★ data; ◆ : ★ neuron) ❘ ♥❀◆ ✭ Θ ✮ ❂ ❫ E x ❀♥ ❬✭ ② � ❫ ❢ ◆ ✭ x ❀ Θ ✮✮ ✷ ❪ Gradient flow, on empirical risk, with random initialization: ❴ Θ ✭ t ✮ ❂ �r ❘ ♥❀◆ ✭ Θ ✭ t ✮✮ ❀ ✭ ❛ ✐ ✭✵✮ ❀ w ✐ ✭✵✮✮ ✘ ✐✿✐✿❞✿ P ❛❀ w ✿ Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 6 / 22

◆ ✢ ♥ ✶✰ ❝ t ✦✶ ❘ ♥❀◆ ✭ ❧✐♠ ✭ t ✮✮ ❂ ✵ ❀ ✵ Convergence guarantees Lemma (Global min. Not surprise. ) For ◆ ❃ ♥ , we have ✐♥❢ Θ ❘ ♥❀◆ ✭ Θ ✮ ❂ ✵ ✿ There are many global minimizer with empirical risk ✵ . Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 7 / 22

◆ ✢ ♥ ✶✰ ❝ t ✦✶ ❘ ♥❀◆ ✭ ❧✐♠ ✭ t ✮✮ ❂ ✵ ❀ ✵ Convergence guarantees Lemma (Global min. Not surprise. ) For ◆ ❃ ♥ , we have ✐♥❢ Θ ❘ ♥❀◆ ✭ Θ ✮ ❂ ✵ ✿ There are many global minimizer with empirical risk ✵ . But there are also local minimizers with non-zero risk. Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 7 / 22

Convergence guarantees Lemma (Global min. Not surprise. ) For ◆ ❃ ♥ , we have ✐♥❢ Θ ❘ ♥❀◆ ✭ Θ ✮ ❂ ✵ ✿ There are many global minimizer with empirical risk ✵ . But there are also local minimizers with non-zero risk. Theorem (The optimization surprise. ) For ◆ ✢ ♥ ✶✰ ❝ , we have t ✦✶ ❘ ♥❀◆ ✭ Θ ✭ t ✮✮ ❂ ✵ ❀ ❧✐♠ i.e., training loss converges to ✵ . Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 7 / 22

Convergence guarantees Lemma (Global min. Not surprise. ) For ◆ ❃ ♥ , we have ✐♥❢ Θ ❘ ♥❀◆ ✭ Θ ✮ ❂ ✵ ✿ There are many global minimizer with empirical risk ✵ . But there are also local minimizers with non-zero risk. Theorem (The optimization surprise. ) For ◆ ✢ ♥ ✶✰ ❝ , we have t ✦✶ ❘ ♥❀◆ ✭ Θ ✭ t ✮✮ ❂ ✵ ❀ ❧✐♠ i.e., training loss converges to ✵ . Under what assumptions? Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 7 / 22

Three variants of the convergence theorem Gradient flow ( ♥ : ★ data; ◆ : ★ neuron): Θ ✭ t ✮ ❂ �r ❫ E x ❀♥ ❬✭ ② � ❫ ❴ ❢ ◆ ✭ x ❀ Θ ✭ t ✮✮✮ ✷ ❪ ❀ w ✐ ✭✵✮ ✘ ✐✿✐✿❞✿ ◆ ✭ 0 ❀ I ❞ ❂❞ ✮ ✿ Theorem: for ◆ large enough, we have t ✦✶ ❘ ♥❀◆ ✭ Θ ✭ t ✮✮ ❂ ✵ ✿ ❧✐♠ Random feature ( RF ) regime ◆ ❫ ❛ ✐ ✭✵✮ ✘ ✐✿✐✿❞✿ ◆ ✭✵ ❀ ✶ ❂◆ ✷ ✮ ✿ ❳ ❢ ◆ ✭ x ❀ Θ ✮ ❂ ❛ ✐ ✛ ✭ ❤ w ✐ ❀ x ✐ ✮ ❀ ✐ ❂✶ [Andoni et al. , 2014], [Danialy, 2017], [Yehudai and Shamir, 2019] ... Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 8 / 22

Three variants of the convergence theorem Gradient flow ( ♥ : ★ data; ◆ : ★ neuron): Θ ✭ t ✮ ❂ �r ❫ ❴ E x ❀♥ ❬✭ ② � ❫ ❢ ◆ ✭ x ❀ Θ ✭ t ✮✮✮ ✷ ❪ ❀ w ✐ ✭✵✮ ✘ ✐✿✐✿❞✿ ◆ ✭ 0 ❀ I ❞ ❂❞ ✮ ✿ Theorem: for ◆ large enough, we have t ✦✶ ❘ ♥❀◆ ✭ Θ ✭ t ✮✮ ❂ ✵ ✿ ❧✐♠ Neural tangent ( NT ) regime ◆ ✶ ❫ ❳ ♣ ❛ ✐ ✛ ✭ ❤ w ✐ ❀ x ✐ ✮ ❀ ❛ ✐ ✭✵✮ ✘ ✐✿✐✿❞✿ ◆ ✭✵ ❀ ✶✮ ✿ ❢ ◆ ✭ x ❀ Θ ✮ ❂ ◆ ✐ ❂✶ [Jacot et al. , 2018], [Du et al. , 2018], [Du et al. , 2018], [Allen-Zhu el al. , 2018], [Zou et al. , 2018]... Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 8 / 22

Three variants of the convergence theorem Gradient flow ( ♥ : ★ data; ◆ : ★ neuron): Θ ✭ t ✮ ❂ �r ❫ ❴ E x ❀♥ ❬✭ ② � ❫ ❢ ◆ ✭ x ❀ Θ ✭ t ✮✮✮ ✷ ❪ ❀ w ✐ ✭✵✮ ✘ ✐✿✐✿❞✿ ◆ ✭ 0 ❀ I ❞ ❂❞ ✮ ✿ Theorem: for ◆ large enough, we have * t ✦✶ ❘ ♥❀◆ ✭ Θ ✭ t ✮✮ ❂ ✵ ✿ ❧✐♠ Mean field ( MF ) regime ◆ ❢ ◆ ✭ x ❀ Θ ✮ ❂ ✶ ❫ ❳ ❛ ✐ ✛ ✭ ❤ w ✐ ❀ x ✐ ✮ ❀ ❛ ✐ ✭✵✮ ✘ ✐✿✐✿❞✿ ◆ ✭✵ ❀ ✶✮ ✿ ◆ ✐ ❂✶ [Mei et al. , 2018], [Rotskoff and Vanden-Eijden, 2018], [Chizat and Bach, 2018]... Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 8 / 22

Three variants of the convergence theorem Random feature ( RF ) regime ◆ ❫ ❛ ✐ ✘ ◆ ✭✵ ❀ ✶ ❂◆ ✷ ✮ ✿ ❳ ❢ ◆ ✭ x ❀ Θ ✮ ❂ ❛ ✐ ✛ ✭ ❤ w ✐ ❀ x ✐ ✮ ❀ ✐ ❂✶ Neural tangent ( NT ) regime ◆ ✶ ❫ ❳ ❢ ◆ ✭ x ❀ Θ ✮ ❂ ♣ ❛ ✐ ✛ ✭ ❤ w ✐ ❀ x ✐ ✮ ❀ ❛ ✐ ✘ ◆ ✭✵ ❀ ✶✮ ✿ ◆ ✐ ❂✶ Mean field ( MF ) regime ◆ ❢ ◆ ✭ x ❀ Θ ✮ ❂ ✶ ❫ ❳ ❛ ✐ ✛ ✭ ❤ w ✐ ❀ x ✐ ✮ ❀ ❛ ✐ ✘ ◆ ✭✵ ❀ ✶✮ ✿ ◆ ✐ ❂✶ Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 9 / 22

... but different behavior of dynamics Random feature ( RF ) regime ◆ ❫ ❛ ✐ ✘ ◆ ✭✵ ❀ ✶ ❂◆ ✷ ✮ ✿ ❳ ❢ ◆ ✭ x ❀ Θ ✮ ❂ ❛ ✐ ✛ ✭ ❤ w ✐ ❀ x ✐ ✮ ❀ ✐ ❂✶ ◮ The limiting dynamics is linear (effectively only a is updated). ◮ Prediction function: kernel ridge regression with kernel ❦ RF ✭ x ❀ z ✮ ❂ ❫ E w ❀◆ ❬ ✛ ✭ ❤ w ❀ z ✐ ✮ ✛ ✭ ❤ w ❀ z ✐ ✮❪ ✿ [Andoni et al. , 2014], [Danialy, 2017], [Yehudai and Shamir, 2019] ... Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 10 / 22

Linearized two-layers neural network in high dimensions Song Mei - PowerPoint PPT Presentation

Linearized two-layers neural network in high dimensions Song Mei Stanford University May 26, 2019 Joint work with Andrea Montanari, Theodor Misiakiewicz, Behrooz Ghorbani Song Mei (Stanford University) Linearized two layers neural network

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Incompressible limit of the linearized NavierStokes equations. N.A. Gusev 1 1 Moscow Institute

Random Walks in Two Dimensions Leena Salmela January 31st, 2006 January 31st, 2006 Leena

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

CTA WEIGHTS AND CTA WEIGHTS AND DIMENSIONS DIMENSIONS INITIATIVES INITIATIVES Meeting of the

Module 4: Building Working with Standard Dimensions Dimensions Using the Basic Level

Generalization of linearized neural networks: staircase decay and double descent Song Mei UC

Regular Polytopes Laura Mancinska University of Waterloo, Department of C&O January 23,

Pace Layers The social economy ecosystem has many layers, all of which change at different

Cost function Machine Learning Neural Network (Classification) total no. of layers in network

A Mean Field View of the Landscape of Two-Layers Neural Networks Song Mei Stanford University

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Astrometry: Revealing the Other Astrometry: Revealing the Other Two Dimensions of Velocity Two

The committee machine: Computational to statistical gaps in learning a two-layers neural network

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Beyond three dimensions Non Cartesian representations R.W. Oldford Beyond three dimensions Hey!

World's Fastest Machine Learning With GPUs http://github.com/h2oai/h2o4gpu Speaker: Jonathan C.

The effectiveness of fiscal policy The effectiveness of fiscal policy in Australia - - selected

Understanding EDA Revolving Loan Funds April 9, 2019 1 Agenda RLF PROGRAM OVERVIEW AND RISK

Projects Presentation Programs & Administration Committee Planning & Organization

More efficient representations of compounds for machine learning models Bing Huang and Anatole von

Estimation of the Kernel Mean Embedding (with uncertainty) Paul Rubenstein University of

Parallel and Distributed Training of Neural Networks via Successive Convex Approximation Authors

Neuromorphic Computing with Reservoir Neural Networks on Memristive Hardware Aaron Stockdill

Linearized two-layers neural network in high dimensions Song Mei - PowerPoint PPT Presentation

Linearized two-layers neural network in high dimensions Song Mei Stanford University May 26, 2019 Joint work with Andrea Montanari, Theodor Misiakiewicz, Behrooz Ghorbani Song Mei (Stanford University) Linearized two layers neural network

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Incompressible limit of the linearized NavierStokes equations. N.A. Gusev 1 1 Moscow Institute

Random Walks in Two Dimensions Leena Salmela January 31st, 2006 January 31st, 2006 Leena

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

CTA WEIGHTS AND CTA WEIGHTS AND DIMENSIONS DIMENSIONS INITIATIVES INITIATIVES Meeting of the

Module 4: Building Working with Standard Dimensions Dimensions Using the Basic Level

Generalization of linearized neural networks: staircase decay and double descent Song Mei UC

Regular Polytopes Laura Mancinska University of Waterloo, Department of C&amp;O January 23,

Pace Layers The social economy ecosystem has many layers, all of which change at different

Cost function Machine Learning Neural Network (Classification) total no. of layers in network

A Mean Field View of the Landscape of Two-Layers Neural Networks Song Mei Stanford University

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Astrometry: Revealing the Other Astrometry: Revealing the Other Two Dimensions of Velocity Two

The committee machine: Computational to statistical gaps in learning a two-layers neural network

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Beyond three dimensions Non Cartesian representations R.W. Oldford Beyond three dimensions Hey!

World's Fastest Machine Learning With GPUs http://github.com/h2oai/h2o4gpu Speaker: Jonathan C.

The effectiveness of fiscal policy The effectiveness of fiscal policy in Australia - - selected

Understanding EDA Revolving Loan Funds April 9, 2019 1 Agenda RLF PROGRAM OVERVIEW AND RISK

Projects Presentation Programs &amp; Administration Committee Planning &amp; Organization

More efficient representations of compounds for machine learning models Bing Huang and Anatole von

Estimation of the Kernel Mean Embedding (with uncertainty) Paul Rubenstein University of

Parallel and Distributed Training of Neural Networks via Successive Convex Approximation Authors

Neuromorphic Computing with Reservoir Neural Networks on Memristive Hardware Aaron Stockdill

Regular Polytopes Laura Mancinska University of Waterloo, Department of C&O January 23,

Projects Presentation Programs & Administration Committee Planning & Organization