linearized two layers neural network in high dimensions
play

Linearized two-layers neural network in high dimensions Song Mei - PowerPoint PPT Presentation

Linearized two-layers neural network in high dimensions Song Mei Stanford University May 26, 2019 Joint work with Andrea Montanari, Theodor Misiakiewicz, Behrooz Ghorbani Song Mei (Stanford University) Linearized two layers neural network


  1. Linearized two-layers neural network in high dimensions Song Mei Stanford University May 26, 2019 Joint work with Andrea Montanari, Theodor Misiakiewicz, Behrooz Ghorbani Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 1 / 22

  2. Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 2 / 22

  3. ❘ ✭ Θ ✮ ❂ ♠✐♥ Θ E ❬ ❵ ✭ ②❀ W ✶ ✛ ✍ W ✷ ✍ ✛ ✍ ✁ ✁ ✁ ✍ W ❦ ✍ x ✮❪ ✿ Empirical surprise of neural network [Zhang et al. , 2016] ◮ Over-parameterized regime. ◮ Optimization surprise: efficiently fit all the data. ◮ Generalization surprise: generalize well. Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 3 / 22

  4. Two-layers neural network ◆ ❫ ❳ ❢ ◆ ✭ x ❀ Θ ✮ ❂ ❛ ✐ ✛ ✭ ❤ w ✐ ❀ x ✐ ✮ ❀ Θ ❂ ✭ ❛ ✶ ❀ w ✶ ❀ ✿ ✿ ✿ ❀ ❛ ◆ ❀ w ◆ ✮ ✿ ✐ ❂✶ ◮ Feature x ✷ R ❞ . ◮ Bottom layer weights w ✐ ✷ R ❞ , ✐ ❂ ✶ ❀ ✷ ❀ ✿ ✿ ✿ ❀ ◆ . ◮ Top layer weights ❛ ✐ ✷ R , ✐ ❂ ✶ ❀ ✷ ❀ ✿ ✿ ✿ ❀ ◆ . ◮ Over-parametrization: ◆ large. Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 4 / 22

  5. Two-layers neural network ◆ ❫ ❳ ❢ ◆ ✭ x ❀ Θ ✮ ❂ ❛ ✐ ✛ ✭ ❤ w ✐ ❀ x ✐ ✮ ❀ Θ ❂ ✭ ❛ ✶ ❀ w ✶ ❀ ✿ ✿ ✿ ❀ ❛ ◆ ❀ w ◆ ✮ ✿ ✐ ❂✶ ◮ Feature x ✷ R ❞ . ◮ Bottom layer weights w ✐ ✷ R ❞ , ✐ ❂ ✶ ❀ ✷ ❀ ✿ ✿ ✿ ❀ ◆ . ◮ Top layer weights ❛ ✐ ✷ R , ✐ ❂ ✶ ❀ ✷ ❀ ✿ ✿ ✿ ❀ ◆ . ◮ Over-parametrization: ◆ large. Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 4 / 22

  6. Two-layers neural network ◆ ❫ ❳ ❢ ◆ ✭ x ❀ Θ ✮ ❂ ❛ ✐ ✛ ✭ ❤ w ✐ ❀ x ✐ ✮ ❀ Θ ❂ ✭ ❛ ✶ ❀ w ✶ ❀ ✿ ✿ ✿ ❀ ❛ ◆ ❀ w ◆ ✮ ✿ ✐ ❂✶ ◮ Feature x ✷ R ❞ . ◮ Bottom layer weights w ✐ ✷ R ❞ , ✐ ❂ ✶ ❀ ✷ ❀ ✿ ✿ ✿ ❀ ◆ . ◮ Top layer weights ❛ ✐ ✷ R , ✐ ❂ ✶ ❀ ✷ ❀ ✿ ✿ ✿ ❀ ◆ . ◮ Over-parametrization: ◆ large. Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 4 / 22

  7. Two-layers neural network ◆ ❫ ❳ ❢ ◆ ✭ x ❀ Θ ✮ ❂ ❛ ✐ ✛ ✭ ❤ w ✐ ❀ x ✐ ✮ ❀ Θ ❂ ✭ ❛ ✶ ❀ w ✶ ❀ ✿ ✿ ✿ ❀ ❛ ◆ ❀ w ◆ ✮ ✿ ✐ ❂✶ ◮ Feature x ✷ R ❞ . ◮ Bottom layer weights w ✐ ✷ R ❞ , ✐ ❂ ✶ ❀ ✷ ❀ ✿ ✿ ✿ ❀ ◆ . ◮ Top layer weights ❛ ✐ ✷ R , ✐ ❂ ✶ ❀ ✷ ❀ ✿ ✿ ✿ ❀ ◆ . ◮ Over-parametrization: ◆ large. Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 4 / 22

  8. Two-layers neural network ◆ ❫ ❳ ❢ ◆ ✭ x ❀ Θ ✮ ❂ ❛ ✐ ✛ ✭ ❤ w ✐ ❀ x ✐ ✮ ❀ Θ ❂ ✭ ❛ ✶ ❀ w ✶ ❀ ✿ ✿ ✿ ❀ ❛ ◆ ❀ w ◆ ✮ ✿ ✐ ❂✶ ◮ Feature x ✷ R ❞ . ◮ Bottom layer weights w ✐ ✷ R ❞ , ✐ ❂ ✶ ❀ ✷ ❀ ✿ ✿ ✿ ❀ ◆ . ◮ Top layer weights ❛ ✐ ✷ R , ✐ ❂ ✶ ❀ ✷ ❀ ✿ ✿ ✿ ❀ ◆ . ◮ Over-parametrization: ◆ large. Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 4 / 22

  9. Two-layers neural network Input layer Hidden layer Output layer w 1 a 1 w 2 a 2 w 3 a 3 a 4 w 4 Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 5 / 22

  10. Gradient flow with random initialization Empirical risk: ( ♥ : ★ data; ◆ : ★ neuron) ❘ ♥❀◆ ✭ Θ ✮ ❂ ❫ E x ❀♥ ❬✭ ② � ❫ ❢ ◆ ✭ x ❀ Θ ✮✮ ✷ ❪ Gradient flow, on empirical risk, with random initialization: ❴ Θ ✭ t ✮ ❂ �r ❘ ♥❀◆ ✭ Θ ✭ t ✮✮ ❀ ✭ ❛ ✐ ✭✵✮ ❀ w ✐ ✭✵✮✮ ✘ ✐✿✐✿❞✿ P ❛❀ w ✿ Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 6 / 22

  11. ◆ ✢ ♥ ✶✰ ❝ t ✦✶ ❘ ♥❀◆ ✭ ❧✐♠ ✭ t ✮✮ ❂ ✵ ❀ ✵ Convergence guarantees Lemma (Global min. Not surprise. ) For ◆ ❃ ♥ , we have ✐♥❢ Θ ❘ ♥❀◆ ✭ Θ ✮ ❂ ✵ ✿ There are many global minimizer with empirical risk ✵ . Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 7 / 22

  12. ◆ ✢ ♥ ✶✰ ❝ t ✦✶ ❘ ♥❀◆ ✭ ❧✐♠ ✭ t ✮✮ ❂ ✵ ❀ ✵ Convergence guarantees Lemma (Global min. Not surprise. ) For ◆ ❃ ♥ , we have ✐♥❢ Θ ❘ ♥❀◆ ✭ Θ ✮ ❂ ✵ ✿ There are many global minimizer with empirical risk ✵ . But there are also local minimizers with non-zero risk. Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 7 / 22

  13. Convergence guarantees Lemma (Global min. Not surprise. ) For ◆ ❃ ♥ , we have ✐♥❢ Θ ❘ ♥❀◆ ✭ Θ ✮ ❂ ✵ ✿ There are many global minimizer with empirical risk ✵ . But there are also local minimizers with non-zero risk. Theorem (The optimization surprise. ) For ◆ ✢ ♥ ✶✰ ❝ , we have t ✦✶ ❘ ♥❀◆ ✭ Θ ✭ t ✮✮ ❂ ✵ ❀ ❧✐♠ i.e., training loss converges to ✵ . Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 7 / 22

  14. Convergence guarantees Lemma (Global min. Not surprise. ) For ◆ ❃ ♥ , we have ✐♥❢ Θ ❘ ♥❀◆ ✭ Θ ✮ ❂ ✵ ✿ There are many global minimizer with empirical risk ✵ . But there are also local minimizers with non-zero risk. Theorem (The optimization surprise. ) For ◆ ✢ ♥ ✶✰ ❝ , we have t ✦✶ ❘ ♥❀◆ ✭ Θ ✭ t ✮✮ ❂ ✵ ❀ ❧✐♠ i.e., training loss converges to ✵ . Under what assumptions? Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 7 / 22

  15. Three variants of the convergence theorem Gradient flow ( ♥ : ★ data; ◆ : ★ neuron): Θ ✭ t ✮ ❂ �r ❫ E x ❀♥ ❬✭ ② � ❫ ❴ ❢ ◆ ✭ x ❀ Θ ✭ t ✮✮✮ ✷ ❪ ❀ w ✐ ✭✵✮ ✘ ✐✿✐✿❞✿ ◆ ✭ 0 ❀ I ❞ ❂❞ ✮ ✿ Theorem: for ◆ large enough, we have t ✦✶ ❘ ♥❀◆ ✭ Θ ✭ t ✮✮ ❂ ✵ ✿ ❧✐♠ Random feature ( RF ) regime ◆ ❫ ❛ ✐ ✭✵✮ ✘ ✐✿✐✿❞✿ ◆ ✭✵ ❀ ✶ ❂◆ ✷ ✮ ✿ ❳ ❢ ◆ ✭ x ❀ Θ ✮ ❂ ❛ ✐ ✛ ✭ ❤ w ✐ ❀ x ✐ ✮ ❀ ✐ ❂✶ [Andoni et al. , 2014], [Danialy, 2017], [Yehudai and Shamir, 2019] ... Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 8 / 22

  16. Three variants of the convergence theorem Gradient flow ( ♥ : ★ data; ◆ : ★ neuron): Θ ✭ t ✮ ❂ �r ❫ ❴ E x ❀♥ ❬✭ ② � ❫ ❢ ◆ ✭ x ❀ Θ ✭ t ✮✮✮ ✷ ❪ ❀ w ✐ ✭✵✮ ✘ ✐✿✐✿❞✿ ◆ ✭ 0 ❀ I ❞ ❂❞ ✮ ✿ Theorem: for ◆ large enough, we have t ✦✶ ❘ ♥❀◆ ✭ Θ ✭ t ✮✮ ❂ ✵ ✿ ❧✐♠ Neural tangent ( NT ) regime ◆ ✶ ❫ ❳ ♣ ❛ ✐ ✛ ✭ ❤ w ✐ ❀ x ✐ ✮ ❀ ❛ ✐ ✭✵✮ ✘ ✐✿✐✿❞✿ ◆ ✭✵ ❀ ✶✮ ✿ ❢ ◆ ✭ x ❀ Θ ✮ ❂ ◆ ✐ ❂✶ [Jacot et al. , 2018], [Du et al. , 2018], [Du et al. , 2018], [Allen-Zhu el al. , 2018], [Zou et al. , 2018]... Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 8 / 22

  17. Three variants of the convergence theorem Gradient flow ( ♥ : ★ data; ◆ : ★ neuron): Θ ✭ t ✮ ❂ �r ❫ ❴ E x ❀♥ ❬✭ ② � ❫ ❢ ◆ ✭ x ❀ Θ ✭ t ✮✮✮ ✷ ❪ ❀ w ✐ ✭✵✮ ✘ ✐✿✐✿❞✿ ◆ ✭ 0 ❀ I ❞ ❂❞ ✮ ✿ Theorem: for ◆ large enough, we have * t ✦✶ ❘ ♥❀◆ ✭ Θ ✭ t ✮✮ ❂ ✵ ✿ ❧✐♠ Mean field ( MF ) regime ◆ ❢ ◆ ✭ x ❀ Θ ✮ ❂ ✶ ❫ ❳ ❛ ✐ ✛ ✭ ❤ w ✐ ❀ x ✐ ✮ ❀ ❛ ✐ ✭✵✮ ✘ ✐✿✐✿❞✿ ◆ ✭✵ ❀ ✶✮ ✿ ◆ ✐ ❂✶ [Mei et al. , 2018], [Rotskoff and Vanden-Eijden, 2018], [Chizat and Bach, 2018]... Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 8 / 22

  18. Three variants of the convergence theorem Random feature ( RF ) regime ◆ ❫ ❛ ✐ ✘ ◆ ✭✵ ❀ ✶ ❂◆ ✷ ✮ ✿ ❳ ❢ ◆ ✭ x ❀ Θ ✮ ❂ ❛ ✐ ✛ ✭ ❤ w ✐ ❀ x ✐ ✮ ❀ ✐ ❂✶ Neural tangent ( NT ) regime ◆ ✶ ❫ ❳ ❢ ◆ ✭ x ❀ Θ ✮ ❂ ♣ ❛ ✐ ✛ ✭ ❤ w ✐ ❀ x ✐ ✮ ❀ ❛ ✐ ✘ ◆ ✭✵ ❀ ✶✮ ✿ ◆ ✐ ❂✶ Mean field ( MF ) regime ◆ ❢ ◆ ✭ x ❀ Θ ✮ ❂ ✶ ❫ ❳ ❛ ✐ ✛ ✭ ❤ w ✐ ❀ x ✐ ✮ ❀ ❛ ✐ ✘ ◆ ✭✵ ❀ ✶✮ ✿ ◆ ✐ ❂✶ Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 9 / 22

  19. ... but different behavior of dynamics Random feature ( RF ) regime ◆ ❫ ❛ ✐ ✘ ◆ ✭✵ ❀ ✶ ❂◆ ✷ ✮ ✿ ❳ ❢ ◆ ✭ x ❀ Θ ✮ ❂ ❛ ✐ ✛ ✭ ❤ w ✐ ❀ x ✐ ✮ ❀ ✐ ❂✶ ◮ The limiting dynamics is linear (effectively only a is updated). ◮ Prediction function: kernel ridge regression with kernel ❦ RF ✭ x ❀ z ✮ ❂ ❫ E w ❀◆ ❬ ✛ ✭ ❤ w ❀ z ✐ ✮ ✛ ✭ ❤ w ❀ z ✐ ✮❪ ✿ [Andoni et al. , 2014], [Danialy, 2017], [Yehudai and Shamir, 2019] ... Song Mei (Stanford University) Linearized two layers neural network May 26, 2019 10 / 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend