A Mean Field View of the Landscape of Two-Layers Neural Networks - - PowerPoint PPT Presentation

a mean field view of the landscape of two layers neural
SMART_READER_LITE
LIVE PREVIEW

A Mean Field View of the Landscape of Two-Layers Neural Networks - - PowerPoint PPT Presentation

A Mean Field View of the Landscape of Two-Layers Neural Networks Song Mei Stanford University November 14, 2018 Joint work with Andrea Montanari and Phan-Minh Nguyen Song Mei (Stanford University) Mean Field Dynamics for Neural Network


slide-1
SLIDE 1

A Mean Field View of the Landscape

  • f Two-Layers Neural Networks

Song Mei

Stanford University

November 14, 2018

Joint work with Andrea Montanari and Phan-Minh Nguyen

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 1 / 37

slide-2
SLIDE 2

Empirical surprise [Zhang, Bengio, Hardt, Recht, Vinyals, 2016]

◮ Overparameterized regime. ◮ Efficiently fit all the data. ◮ Generalize well.

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 2 / 37

slide-3
SLIDE 3

Empirical surprise

◮ Overparameterized regime. ◮ Efficiently fit all the data. ◮ Generalize well.

Questions

◮ Why can complex neural network be optimized efficiently? ◮ Why does overparameterization not harm generalization?

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 3 / 37

slide-4
SLIDE 4

Two-layers neural networks

Hidden layer Output layer Input layer

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 4 / 37

slide-5
SLIDE 5

Two-layers neural networks

◮ Parameter: θ ❂ ✭θ✶❀ ✿ ✿ ✿ ❀ θ◆✮ ✷ R◆✂❉. ◮ Prediction:

❫ ②✭x❀ θ✮ ❂ ✶ ◆

✐❂✶

✛❄✭x❀ θ✐✮✿

◮ An example: θ✐ ❂ ✭❛✐❀ w✐✮, ✛❄✭x❀ θ✐✮ ❂ ❛✐✛✭❤x❀ w✐✐✮. ◮ Data distribution: ✭x❀ ②✮ ✘ Px❀②. ◮ Risk function:

❘◆✭θ✮ ❂ Ex❀②

❤✏

② ✶ ◆

✐❂✶

✛❄✭x❀ θ✐✮

✑✷✐

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 5 / 37

slide-6
SLIDE 6

Two-layers neural networks

◮ Parameter: θ ❂ ✭θ✶❀ ✿ ✿ ✿ ❀ θ◆✮ ✷ R◆✂❉. ◮ Prediction:

❫ ②✭x❀ θ✮ ❂ ✶ ◆

✐❂✶

✛❄✭x❀ θ✐✮✿

◮ An example: θ✐ ❂ ✭❛✐❀ w✐✮, ✛❄✭x❀ θ✐✮ ❂ ❛✐✛✭❤x❀ w✐✐✮. ◮ Data distribution: ✭x❀ ②✮ ✘ Px❀②. ◮ Risk function:

❘◆✭θ✮ ❂ Ex❀②

❤✏

② ✶ ◆

✐❂✶

✛❄✭x❀ θ✐✮

✑✷✐

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 5 / 37

slide-7
SLIDE 7

Two-layers neural networks

◮ Parameter: θ ❂ ✭θ✶❀ ✿ ✿ ✿ ❀ θ◆✮ ✷ R◆✂❉. ◮ Prediction:

❫ ②✭x❀ θ✮ ❂ ✶ ◆

✐❂✶

✛❄✭x❀ θ✐✮✿

◮ An example: θ✐ ❂ ✭❛✐❀ w✐✮, ✛❄✭x❀ θ✐✮ ❂ ❛✐✛✭❤x❀ w✐✐✮. ◮ Data distribution: ✭x❀ ②✮ ✘ Px❀②. ◮ Risk function:

❘◆✭θ✮ ❂ Ex❀②

❤✏

② ✶ ◆

✐❂✶

✛❄✭x❀ θ✐✮

✑✷✐

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 5 / 37

slide-8
SLIDE 8

Two-layers neural networks

◮ Parameter: θ ❂ ✭θ✶❀ ✿ ✿ ✿ ❀ θ◆✮ ✷ R◆✂❉. ◮ Prediction:

❫ ②✭x❀ θ✮ ❂ ✶ ◆

✐❂✶

✛❄✭x❀ θ✐✮✿

◮ An example: θ✐ ❂ ✭❛✐❀ w✐✮, ✛❄✭x❀ θ✐✮ ❂ ❛✐✛✭❤x❀ w✐✐✮. ◮ Data distribution: ✭x❀ ②✮ ✘ Px❀②. ◮ Risk function:

❘◆✭θ✮ ❂ Ex❀②

❤✏

② ✶ ◆

✐❂✶

✛❄✭x❀ θ✐✮

✑✷✐

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 5 / 37

slide-9
SLIDE 9

Two-layers neural networks

◮ Parameter: θ ❂ ✭θ✶❀ ✿ ✿ ✿ ❀ θ◆✮ ✷ R◆✂❉. ◮ Prediction:

❫ ②✭x❀ θ✮ ❂ ✶ ◆

✐❂✶

✛❄✭x❀ θ✐✮✿

◮ An example: θ✐ ❂ ✭❛✐❀ w✐✮, ✛❄✭x❀ θ✐✮ ❂ ❛✐✛✭❤x❀ w✐✐✮. ◮ Data distribution: ✭x❀ ②✮ ✘ Px❀②. ◮ Risk function:

❘◆✭θ✮ ❂ Ex❀②

❤✏

② ✶ ◆

✐❂✶

✛❄✭x❀ θ✐✮

✑✷✐

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 5 / 37

slide-10
SLIDE 10

Two-layers neural networks

Hidden layer Output layer Input layer w1 a1 a2 a3 a4 w2 w3 w4

Figure: θ✐ ❂ ✭❛✐❀ w✐✮.

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 6 / 37

slide-11
SLIDE 11

Related literatures (before 2018)

❘◆✭θ✮ ❂ Ex❀②

❤✏

② ✶ ◆

❥❂✶

✛❄✭x❀ θ❥✮

✑✷✐

◮ Landscape analysis: [Soudry, Carmon, 2016], [Freeman, Bruna, 2016],

[Ge, Lee, Ma, 2017], [Soltanolkotabi, Javanmard, Lee, 2017], [Zhong, Song, Jain, Bartlett, Dhillon, 2017]...

◮ Optimization dynamics: [Tian, 2017], [Soltanolkotabi, 2017], [Li,

Yuan, 2017]...

◮ Generalization: [Bartlett, Foster, Telgarsky, 2017], [Neyshabur,

Bhojanapalli, McAllester, Srebro, 2017]...

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 7 / 37

slide-12
SLIDE 12

Related literatures (before 2018)

❘◆✭θ✮ ❂ Ex❀②

❤✏

② ✶ ◆

❥❂✶

✛❄✭x❀ θ❥✮

✑✷✐

◮ Landscape analysis: [Soudry, Carmon, 2016], [Freeman, Bruna, 2016],

[Ge, Lee, Ma, 2017], [Soltanolkotabi, Javanmard, Lee, 2017], [Zhong, Song, Jain, Bartlett, Dhillon, 2017]...

◮ Optimization dynamics: [Tian, 2017], [Soltanolkotabi, 2017], [Li,

Yuan, 2017]...

◮ Generalization: [Bartlett, Foster, Telgarsky, 2017], [Neyshabur,

Bhojanapalli, McAllester, Srebro, 2017]...

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 7 / 37

slide-13
SLIDE 13

Related literatures (before 2018)

❘◆✭θ✮ ❂ Ex❀②

❤✏

② ✶ ◆

❥❂✶

✛❄✭x❀ θ❥✮

✑✷✐

◮ Landscape analysis: [Soudry, Carmon, 2016], [Freeman, Bruna, 2016],

[Ge, Lee, Ma, 2017], [Soltanolkotabi, Javanmard, Lee, 2017], [Zhong, Song, Jain, Bartlett, Dhillon, 2017]...

◮ Optimization dynamics: [Tian, 2017], [Soltanolkotabi, 2017], [Li,

Yuan, 2017]...

◮ Generalization: [Bartlett, Foster, Telgarsky, 2017], [Neyshabur,

Bhojanapalli, McAllester, Srebro, 2017]...

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 7 / 37

slide-14
SLIDE 14

Overparameterization: what happens for large ◆?

[Bengio, et. al, 2006]. Expand the square ❘◆✭θ✮ ❂E❬②✷❪ ✰ ✷ ◆

✐❂✶

❱ ✭θ✐✮ ✰ ✶ ◆✷

✐❀❥❂✶

❯✭θ✐❀ θ❥✮❀ where ❱ ✭θ✐✮ ❂ E❬②✛❄✭x❀ θ✐✮❪❀ ❯✭θ✐❀ θ❥✮ ❂E❬✛❄✭x❀ θ✐✮✛❄✭x❀ θ❥✮❪✿

◮ ❘◆ depends on ✭θ✐✮✐✔◆ through ✚◆ ❂ ✭✶❂◆✮ P◆ ✐❂✶ ✍θ✐. ◮ Motivate us to define ❘✭✚✮, ✚ ✷ P✭R❉✮,

❘✭✚✮ ❂ E❬②✷❪ ✰ ✷

❱ ✭θ✮✚✭❞θ✮ ✰

❯✭θ✶❀ θ✷✮✚✭❞θ✶✮✚✭❞θ✷✮✿

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 8 / 37

slide-15
SLIDE 15

Overparameterization: what happens for large ◆?

[Bengio, et. al, 2006]. Expand the square ❘◆✭θ✮ ❂E❬②✷❪ ✰ ✷ ◆

✐❂✶

❱ ✭θ✐✮ ✰ ✶ ◆✷

✐❀❥❂✶

❯✭θ✐❀ θ❥✮❀ where ❱ ✭θ✐✮ ❂ E❬②✛❄✭x❀ θ✐✮❪❀ ❯✭θ✐❀ θ❥✮ ❂E❬✛❄✭x❀ θ✐✮✛❄✭x❀ θ❥✮❪✿

◮ ❘◆ depends on ✭θ✐✮✐✔◆ through ✚◆ ❂ ✭✶❂◆✮ P◆ ✐❂✶ ✍θ✐. ◮ Motivate us to define ❘✭✚✮, ✚ ✷ P✭R❉✮,

❘✭✚✮ ❂ E❬②✷❪ ✰ ✷

❱ ✭θ✮✚✭❞θ✮ ✰

❯✭θ✶❀ θ✷✮✚✭❞θ✶✮✚✭❞θ✷✮✿

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 8 / 37

slide-16
SLIDE 16

Overparameterization: what happens for large ◆?

[Bengio, et. al, 2006]. Expand the square ❘◆✭θ✮ ❂E❬②✷❪ ✰ ✷ ◆

✐❂✶

❱ ✭θ✐✮ ✰ ✶ ◆✷

✐❀❥❂✶

❯✭θ✐❀ θ❥✮❀ where ❱ ✭θ✐✮ ❂ E❬②✛❄✭x❀ θ✐✮❪❀ ❯✭θ✐❀ θ❥✮ ❂E❬✛❄✭x❀ θ✐✮✛❄✭x❀ θ❥✮❪✿

◮ ❘◆ depends on ✭θ✐✮✐✔◆ through ✚◆ ❂ ✭✶❂◆✮ P◆ ✐❂✶ ✍θ✐. ◮ Motivate us to define ❘✭✚✮, ✚ ✷ P✭R❉✮,

❘✭✚✮ ❂ E❬②✷❪ ✰ ✷

❱ ✭θ✮✚✭❞θ✮ ✰

❯✭θ✶❀ θ✷✮✚✭❞θ✶✮✚✭❞θ✷✮✿

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 8 / 37

slide-17
SLIDE 17

Overparameterization: what happens for large ◆?

Correspondence ❘◆✭θ✮ ❂ ❘✭✭✶❂◆✮ P◆

✐❂✶ ✍θ✐✮, where

❘◆✭θ✮ ❂E❬②✷❪ ✰ ✷ ◆

✐❂✶

❱ ✭θ✐✮ ✰ ✶ ◆✷

✐❀❥❂✶

❯✭θ✐❀ θ❥✮❀ ❘✭✚✮ ❂E❬②✷❪ ✰ ✷

❱ ✭θ✮✚✭❞θ✮ ✰

❯✭θ✶❀ θ✷✮✚✭❞θ✶✮✚✭❞θ✷✮✿ ❘◆ ❘

❘ ❯✭ ❀ ✮✚♦♣t✭❞ ✮ ❁ ❑

✐♥❢

✚ ❘✭✚✮ ✔ ✐♥❢ ❘◆✭ ✮ ✔ ✐♥❢ ✚ ❘✭✚✮ ✰ ❑

◆ ✿

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 9 / 37

slide-18
SLIDE 18

Overparameterization: what happens for large ◆?

Correspondence ❘◆✭θ✮ ❂ ❘✭✭✶❂◆✮ P◆

✐❂✶ ✍θ✐✮, where

❘◆✭θ✮ ❂E❬②✷❪ ✰ ✷ ◆

✐❂✶

❱ ✭θ✐✮ ✰ ✶ ◆✷

✐❀❥❂✶

❯✭θ✐❀ θ❥✮❀ ❘✭✚✮ ❂E❬②✷❪ ✰ ✷

❱ ✭θ✮✚✭❞θ✮ ✰

❯✭θ✶❀ θ✷✮✚✭❞θ✶✮✚✭❞θ✷✮✿ What is the relationship of minimum value of ❘◆ and ❘?

❘ ❯✭ ❀ ✮✚♦♣t✭❞ ✮ ❁ ❑

✐♥❢

✚ ❘✭✚✮ ✔ ✐♥❢ ❘◆✭ ✮ ✔ ✐♥❢ ✚ ❘✭✚✮ ✰ ❑

◆ ✿

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 9 / 37

slide-19
SLIDE 19

Overparameterization: what happens for large ◆?

Correspondence ❘◆✭θ✮ ❂ ❘✭✭✶❂◆✮ P◆

✐❂✶ ✍θ✐✮, where

❘◆✭θ✮ ❂E❬②✷❪ ✰ ✷ ◆

✐❂✶

❱ ✭θ✐✮ ✰ ✶ ◆✷

✐❀❥❂✶

❯✭θ✐❀ θ❥✮❀ ❘✭✚✮ ❂E❬②✷❪ ✰ ✷

❱ ✭θ✮✚✭❞θ✮ ✰

❯✭θ✶❀ θ✷✮✚✭❞θ✶✮✚✭❞θ✷✮✿ What is the relationship of minimum value of ❘◆ and ❘?

Lemma

If

❘ ❯✭θ❀ θ✮✚♦♣t✭❞θ✮ ❁ ❑, then

✐♥❢

✚ ❘✭✚✮ ✔ ✐♥❢ θ ❘◆✭θ✮ ✔ ✐♥❢ ✚ ❘✭✚✮ ✰ ❑

◆ ✿

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 9 / 37

slide-20
SLIDE 20

How to optimize ❘✭✚✮?

[Bengio, et. al, 2006] proposed to optimize over ✚ ❘✭✚✮ ❂ E❬②✷❪ ✰ ✷

❱ ✭θ✮✚✭❞θ✮ ✰

❯✭θ✶❀ θ✷✮✚✭❞θ✶✮✚✭❞θ✷✮✿ Exponential bases functions to discretize ✚! ✚

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 10 / 37

slide-21
SLIDE 21

How to optimize ❘✭✚✮?

[Bengio, et. al, 2006] proposed to optimize over ✚ ❘✭✚✮ ❂ E❬②✷❪ ✰ ✷

❱ ✭θ✮✚✭❞θ✮ ✰

❯✭θ✶❀ θ✷✮✚✭❞θ✶✮✚✭❞θ✷✮✿ Exponential bases functions to discretize ✚! [This work]: run SGD on θ, and give a scaling limit dynamics for ✚.

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 10 / 37

slide-22
SLIDE 22

SGD and distributional dynamics (DD)

◮ SGD for θ❦, with ✭x❦❀ ②❦✮ ✘ Px❀②, ✐ ✷ ❬◆❪,

θ❦✰✶

❂ θ❦

✐ ✷s❦rθ✐❵✭x❦❀ ②❦❀ θ❦✮✿

(SGD) s❦ ❂ ✧✘✭❦✧✮ ❦ ❂ t❂✧ ◆ ✦ ✶ ✧ ✦ ✵ ❫ ✚✭◆✮

✑ ✶ ◆

✐❂✶

❦ ✐ ✮ ✚t✿

✚t ❅t✚t✭ ✮ ❂✷✘✭t✮r ✁ ✭✚t✭ ✮r ✠✭ ❀ ✚t✮✮❀ ✠✭ ❀ ✚✮ ❂ ✍❘✭✚✮ ✍✚✭ ✮ ❂ ❱ ✭ ✮ ✰

❯✭ ❀

✵✮✚✭❞ ✵✮✿

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 11 / 37

slide-23
SLIDE 23

SGD and distributional dynamics (DD)

◮ SGD for θ❦, with ✭x❦❀ ②❦✮ ✘ Px❀②, ✐ ✷ ❬◆❪,

θ❦✰✶

❂ θ❦

✐ ✷s❦rθ✐❵✭x❦❀ ②❦❀ θ❦✮✿

(SGD)

◮ Claim: s❦ ❂ ✧✘✭❦✧✮, ❦ ❂ t❂✧, ◆ ✦ ✶, ✧ ✦ ✵:

❫ ✚✭◆✮

✑ ✶ ◆

✐❂✶

✍θ❦

✐ ✮ ✚t✿

✚t ❅t✚t✭ ✮ ❂✷✘✭t✮r ✁ ✭✚t✭ ✮r ✠✭ ❀ ✚t✮✮❀ ✠✭ ❀ ✚✮ ❂ ✍❘✭✚✮ ✍✚✭ ✮ ❂ ❱ ✭ ✮ ✰

❯✭ ❀

✵✮✚✭❞ ✵✮✿

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 11 / 37

slide-24
SLIDE 24

SGD and distributional dynamics (DD)

◮ SGD for θ❦, with ✭x❦❀ ②❦✮ ✘ Px❀②, ✐ ✷ ❬◆❪,

θ❦✰✶

❂ θ❦

✐ ✷s❦rθ✐❵✭x❦❀ ②❦❀ θ❦✮✿

(SGD)

◮ Claim: s❦ ❂ ✧✘✭❦✧✮, ❦ ❂ t❂✧, ◆ ✦ ✶, ✧ ✦ ✵:

❫ ✚✭◆✮

✑ ✶ ◆

✐❂✶

✍θ❦

✐ ✮ ✚t✿

◮ Distributional dynamics (DD) for ✚t,

❅t✚t✭θ✮ ❂✷✘✭t✮rθ ✁ ✭✚t✭θ✮rθ✠✭θ❀ ✚t✮✮❀ (DD) where ✠✭θ❀ ✚✮ ❂ ✍❘✭✚✮ ✍✚✭θ✮ ❂ ❱ ✭θ✮ ✰

❯✭θ❀ θ✵✮✚✭❞θ✵✮✿

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 11 / 37

slide-25
SLIDE 25

More precisely

Assumption (i) ✛❄ bounded; (ii) rθ✛❄✭x❀ θ✮ sub-Gaussian; (iii) r❱❀ r❯ bdd. Lipschitz.

Theorem (M., Montanari, Nguyen, 2018)

Let ✭θ✵

✐ ✮✐✔◆ ✘✐✐❞ ✚✵. Then, ✽❢ bounded Lipschitz:

s✉♣

t✔❚

☞ ☞ ☞ ✶

✐❂✶

❢✭θ❜t❂✧❝

❢✭θ✮✚t✭θ✮

☞ ☞ ☞ ✔ ❑❡❑❚ ❡rr◆❀❉✭③✮❀

where ❡rr◆❀❉✭③✮ ✑

s

✶ ◆ ❴ ✧ ✁

✷ ✹ s

❉ ❴ ❧♦❣ ◆ ✧ ✰ ③

✸ ✺ ❀

with probability at least ✶ ✹❡③✷❂✷.

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 12 / 37

slide-26
SLIDE 26

More precisely

Assumption (i) ✛❄ bounded; (ii) rθ✛❄✭x❀ θ✮ sub-Gaussian; (iii) r❱❀ r❯ bdd. Lipschitz.

Theorem (M., Montanari, Nguyen, 2018)

Let ✭θ✵

✐ ✮✐✔◆ ✘✐✐❞ ✚✵. Then, ✽❢ bounded Lipschitz:

s✉♣

t✔❚

☞ ☞ ☞ ✶

✐❂✶

❢✭θ❜t❂✧❝

❢✭θ✮✚t✭θ✮

☞ ☞ ☞ ✔ ❑❡❑❚ ❡rr◆❀❉✭③✮❀

where ❡rr◆❀❉✭③✮ ✑

s

✶ ◆ ❴ ✧ ✁

✷ ✹ s

❉ ❴ ❧♦❣ ◆ ✧ ✰ ③

✸ ✺ ❀

with probability at least ✶ ✹❡③✷❂✷.

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 12 / 37

slide-27
SLIDE 27

Number of neurons, sample size, and dimensions

❡rr◆❀❉ ✏

s

❉ ◆ ❴ ✭❉✧✮✿

◆: number of neurons; ❉: feature dimension; ✧: stepsize.

◮ Small if ◆ ✢ ❉, ✧ ✜ ✶❂❉. This is very practical!

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 13 / 37

slide-28
SLIDE 28

Number of neurons, sample size, and dimensions

❡rr◆❀❉ ✏

s

❉ ◆ ❴ ✭❉✧✮✿

◆: number of neurons; ❉: feature dimension; ✧: stepsize.

◮ Small if ◆ ✢ ❉, ✧ ✜ ✶❂❉. This is very practical!

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 13 / 37

slide-29
SLIDE 29

An animation

◮ A specific model (classifying two Gaussians), fix dimension ❞ ❂ ✽✵,

◆ ❂ ✷✵✵.

◮ Run SGD and solve PDE. ◮ Track the norm statistics of the weights,

❫ ✖❦✭s✮ ❂ ✭✶❂◆✮ P◆

✐❂✶ ✍❦θ❦

✐ ❦✷ versus ✖t✭s✮ ❂ ✚t✭❢❦✒❦✷ ❂ s❣✮. Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 14 / 37

slide-30
SLIDE 30

An animation

◮ A specific model (classifying two Gaussians), fix dimension ❞ ❂ ✽✵,

◆ ❂ ✷✵✵.

◮ Run SGD and solve PDE. ◮ Track the norm statistics of the weights,

❫ ✖❦✭s✮ ❂ ✭✶❂◆✮ P◆

✐❂✶ ✍❦θ❦

✐ ❦✷ versus ✖t✭s✮ ❂ ✚t✭❢❦✒❦✷ ❂ s❣✮. Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 14 / 37

slide-31
SLIDE 31

An animation

◮ A specific model (classifying two Gaussians), fix dimension ❞ ❂ ✽✵,

◆ ❂ ✷✵✵.

◮ Run SGD and solve PDE. ◮ Track the norm statistics of the weights,

❫ ✖❦✭s✮ ❂ ✭✶❂◆✮ P◆

✐❂✶ ✍❦θ❦

✐ ❦✷ versus ✖t✭s✮ ❂ ✚t✭❢❦✒❦✷ ❂ s❣✮. Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 14 / 37

slide-32
SLIDE 32

Message

Approximately ✭✶❂◆✮ P◆

✐❂✶ ✍θ❦

✐ ✙ ✚t, where

θ❦✰✶

❂ θ❦

✐ ✷s❦rθ✐❵✭x❦❀ ②❦❀ θ❦✮❀

✐ ✷ ❬◆❪❀ (SGD) ❅t✚t✭θ✮ ❂ ✷✘✭t✮rθ ✁ ✭✚t✭θ✮rθ✠✭θ❀ ✚t✮✮✿ (DD) Overparameterization ◆ ✦ ✶ does not affect the limiting dynamics, and therefore

◮ Overparameterization does not slow down convergence! ◮ Overparameterization does not affect generalization!

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 15 / 37

slide-33
SLIDE 33

Message

Approximately ✭✶❂◆✮ P◆

✐❂✶ ✍θ❦

✐ ✙ ✚t, where

θ❦✰✶

❂ θ❦

✐ ✷s❦rθ✐❵✭x❦❀ ②❦❀ θ❦✮❀

✐ ✷ ❬◆❪❀ (SGD) ❅t✚t✭θ✮ ❂ ✷✘✭t✮rθ ✁ ✭✚t✭θ✮rθ✠✭θ❀ ✚t✮✮✿ (DD) Overparameterization ◆ ✦ ✶ does not affect the limiting dynamics, and therefore

◮ Overparameterization does not slow down convergence! ◮ Overparameterization does not affect generalization!

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 15 / 37

slide-34
SLIDE 34

Message

Approximately ✭✶❂◆✮ P◆

✐❂✶ ✍θ❦

✐ ✙ ✚t, where

θ❦✰✶

❂ θ❦

✐ ✷s❦rθ✐❵✭x❦❀ ②❦❀ θ❦✮❀

✐ ✷ ❬◆❪❀ (SGD) ❅t✚t✭θ✮ ❂ ✷✘✭t✮rθ ✁ ✭✚t✭θ✮rθ✠✭θ❀ ✚t✮✮✿ (DD) Overparameterization ◆ ✦ ✶ does not affect the limiting dynamics, and therefore

◮ Overparameterization does not slow down convergence! ◮ Overparameterization does not affect generalization!

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 15 / 37

slide-35
SLIDE 35

What is this?

❅t✚t ❂ rθ ✁

✚trθ✠✭θ❀ ✚t✮

. Existence and uniqueness: [Sznitman, 1991].

◮ Physics: nonlinear transport equation describing motions of

particles with pairwise interaction (mean field approach).

◮ Math: Gradient flow of ❘✭✚✮ ... ◮ ... in the metric space ✭P✭R❉✮❀ ❲✷✮. ◮ [Jordan, Kinderlehrer, Otto, 1998; Ambrosio, Gigli, Savaré, 2006;

Carrillo, McCann, Villani, 2013;. . . ]

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 16 / 37

slide-36
SLIDE 36

What is this?

❅t✚t ❂ rθ ✁

✚trθ✠✭θ❀ ✚t✮

. Existence and uniqueness: [Sznitman, 1991].

◮ Physics: nonlinear transport equation describing motions of

particles with pairwise interaction (mean field approach).

◮ Math: Gradient flow of ❘✭✚✮ ... ◮ ... in the metric space ✭P✭R❉✮❀ ❲✷✮. ◮ [Jordan, Kinderlehrer, Otto, 1998; Ambrosio, Gigli, Savaré, 2006;

Carrillo, McCann, Villani, 2013;. . . ]

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 16 / 37

slide-37
SLIDE 37

What is this?

❅t✚t ❂ rθ ✁

✚trθ✠✭θ❀ ✚t✮

. Existence and uniqueness: [Sznitman, 1991].

◮ Physics: nonlinear transport equation describing motions of

particles with pairwise interaction (mean field approach).

◮ Math: Gradient flow of ❘✭✚✮ ... ◮ ... in the metric space ✭P✭R❉✮❀ ❲✷✮. ◮ [Jordan, Kinderlehrer, Otto, 1998; Ambrosio, Gigli, Savaré, 2006;

Carrillo, McCann, Villani, 2013;. . . ]

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 16 / 37

slide-38
SLIDE 38

What is this?

❅t✚t ❂ rθ ✁

✚trθ✠✭θ❀ ✚t✮

. Existence and uniqueness: [Sznitman, 1991].

◮ Physics: nonlinear transport equation describing motions of

particles with pairwise interaction (mean field approach).

◮ Math: Gradient flow of ❘✭✚✮ ... ◮ ... in the metric space ✭P✭R❉✮❀ ❲✷✮. ◮ [Jordan, Kinderlehrer, Otto, 1998; Ambrosio, Gigli, Savaré, 2006;

Carrillo, McCann, Villani, 2013;. . . ]

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 16 / 37

slide-39
SLIDE 39

What is this?

❅t✚t ❂ rθ ✁

✚trθ✠✭θ❀ ✚t✮

. Existence and uniqueness: [Sznitman, 1991].

◮ Physics: nonlinear transport equation describing motions of

particles with pairwise interaction (mean field approach).

◮ Math: Gradient flow of ❘✭✚✮ ... ◮ ... in the metric space ✭P✭R❉✮❀ ❲✷✮. ◮ [Jordan, Kinderlehrer, Otto, 1998; Ambrosio, Gigli, Savaré, 2006;

Carrillo, McCann, Villani, 2013;. . . ]

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 16 / 37

slide-40
SLIDE 40

What is gradient flow in a metric space?

Euclidean space gradient descent x❦✰✶ ❂ x❦ ✧r❋✭x❦✮ ✿ More insightfully x❦✰✶ ❂ ❛r❣ ♠✐♥

x

❋✭x❦✮ ✰ ❤r❋✭x❦✮❀ x x❦✐ ✰ ✶ ✷✧

✌ ✌x x❦✌ ✌✷

✙ ❛r❣ ♠✐♥

x

❋✭x✮ ✰ ✶ ✷✧

✌ ✌x x❦✌ ✌✷

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 17 / 37

slide-41
SLIDE 41

What is gradient flow in a metric space?

Euclidean space gradient descent x❦✰✶ ❂ x❦ ✧r❋✭x❦✮ ✿ More insightfully x❦✰✶ ❂ ❛r❣ ♠✐♥

x

❋✭x❦✮ ✰ ❤r❋✭x❦✮❀ x x❦✐ ✰ ✶ ✷✧

✌ ✌x x❦✌ ✌✷

✙ ❛r❣ ♠✐♥

x

❋✭x✮ ✰ ✶ ✷✧

✌ ✌x x❦✌ ✌✷

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 17 / 37

slide-42
SLIDE 42

What is gradient flow in a metric space?

Euclidean space gradient descent x❦✰✶ ❂ x❦ ✧r❋✭x❦✮ ✿ More insightfully x❦✰✶ ❂ ❛r❣ ♠✐♥

x

❋✭x❦✮ ✰ ❤r❋✭x❦✮❀ x x❦✐ ✰ ✶ ✷✧

✌ ✌x x❦✌ ✌✷

✙ ❛r❣ ♠✐♥

x

❋✭x✮ ✰ ✶ ✷✧

✌ ✌x x❦✌ ✌✷

Use this as definition!

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 17 / 37

slide-43
SLIDE 43

Metric space ✭▼❀ ❞✮

x❦✰✶ ❂ ❛r❣ ♠✐♥

x

❋✭x✮ ✰ ✶ ✷✧ ❞✭x❀ x❦✮✷♦ ✿

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 18 / 37

slide-44
SLIDE 44

Wasserstein space ✭P✭R❉✮❀ ❲✷✮

✚t✰✧ ❂ ❛r❣ ♠✐♥

P✭R❉✮

❘✭✚✮ ✰ ✶ ✷✧ ❲✷✭✚❀ ✚t✮✷♦ ❅t✚t ❂ rθ ✁

✚trθ✠✭θ❀ ✚t✮

❀ ✠✭θ❀ ✚✮ ✑ ✍❘ ✍✚ ✭θ✮ ✿

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 19 / 37

slide-45
SLIDE 45

Related work

Last few months

◮ Rotskoff, Vanden-Eijnden

arXiv:1805.00915

◮ Sirignano, Spiliopoulos

arXiv:1805.01053

◮ Chizac, Bach

arXiv:1805.09545

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 20 / 37

slide-46
SLIDE 46

Does distributional dynamics converge?

Gradient flow minimizing ❘✭✚✮, ❅t✚t✭θ✮ ❂✷✘✭t✮rθ ✁ ✭✚t✭θ✮rθ✠✭θ❀ ✚t✮✮✿ (DD)

◮ Does distributional dynamics converge to minimizers? ◮ In general, no; sometimes, yes.

In the following

◮ Concrete examples with convergence. ◮ A general convergence result for noisy SGD.

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 21 / 37

slide-47
SLIDE 47

Does distributional dynamics converge?

Gradient flow minimizing ❘✭✚✮, ❅t✚t✭θ✮ ❂✷✘✭t✮rθ ✁ ✭✚t✭θ✮rθ✠✭θ❀ ✚t✮✮✿ (DD)

◮ Does distributional dynamics converge to minimizers? ◮ In general, no; sometimes, yes.

In the following

◮ Concrete examples with convergence. ◮ A general convergence result for noisy SGD.

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 21 / 37

slide-48
SLIDE 48

Does distributional dynamics converge?

Gradient flow minimizing ❘✭✚✮, ❅t✚t✭θ✮ ❂✷✘✭t✮rθ ✁ ✭✚t✭θ✮rθ✠✭θ❀ ✚t✮✮✿ (DD)

◮ Does distributional dynamics converge to minimizers? ◮ In general, no; sometimes, yes.

In the following

◮ Concrete examples with convergence. ◮ A general convergence result for noisy SGD.

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 21 / 37

slide-49
SLIDE 49

Concrete examples

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 22 / 37

slide-50
SLIDE 50

Simplest example requiring more than one neuron

With probability ✶❂✷: ② ❂ ✰✶, x ✘ ◆✭0❀ ✝✰✮, With probability ✶❂✷: ② ❂ ✶, x ✘ ◆✭0❀ ✝✮. ✝✝ ❂

✜ ✷

✝■s✵

■❞s✵

✿ Invariant under ❖✭s✵✮ ✂ ❖✭❞ s✵✮ ✮ Reduced PDE.

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 23 / 37

slide-51
SLIDE 51

Activation functions

Simple θ✐ ❂ w✐ ✷ R❞ (no offset, no scaling weights) ✛❄✭x❀ θ✐✮ ❂ ✛✭❤x❀ w✐✐✮✿ ReLU θ✐ ❂ ✭❛✐❀ ❜✐❀ w✐✮ ✷ R❞✰✷ ✛❄✭x❀ θ✐✮ ❂ ❛✐ ♠❛①✭❤x❀ w✐✐ ✰ ❜✐❀ ✵✮✿

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 24 / 37

slide-52
SLIDE 52

Distributional dynamics

◮ s✵ ❂ ❞ ❂ ✹✵, ◆ ❂ ✽✵✵, ✜ ✷ ✰ ❂ ✶✿✽, ✜ ✷ ❂ ✵✿✷. ◮ Simple activation. ◮ Histogram: empirical results. Cont. lines: PDE solutions.

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 25 / 37

slide-53
SLIDE 53

Evolution of the risk

100 101 102 103 104 105 106 107

Iteration

0.5 1 1.5 2 2.5

Risk PDE ( ∆=0.2) SGD (∆=0.2) PDE ( ∆=0.4) SGD (∆=0.4) PDE ( ∆=0.6) SGD (∆=0.6)

  • 0.5

2 0.5 1

b (mean)

1.5

r1 (mean) a (mean)

1.5 1 0.5 2 0.5 2.5 1

PDE ( ∆=0.2) SGD (∆=0.2) PDE ( ∆=0.4) SGD (∆=0.4) PDE ( ∆=0.6) SGD (∆=0.6)

◮ ❞ ❂ ✸✷✵, s✵ ❂ ✻✵, ◆ ❂ ✽✵✵, ✜ ✷ ✝ ❂ ✶ ✝ ✁. ◮ ReLU activation.

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 26 / 37

slide-54
SLIDE 54

Classifying anisotropic Gaussians: analysis

Assumption (i) ✛ ✿ R ✦ R truncated ReLU; (ii) s✵ ❂ ✌❞, ✌ ✷ ✭✵❀ ✶✮ fixed; (iii) ✖ ✚✵ ✷ P✭R✰✮ has bounded density and ❘✭✚✵✮ ❁ ✶.

Theorem (M., Montanari, Nguyen, 2018)

For ❚ ✕ ❚✵, ❞ ✕ ❞✵, ◆ ✕ ❈✵❞ ❧♦❣ ❞ (❚✵❀ ❞✵❀ ❈✵ depend on ✭✑❀ ✖ ✚✵❀ ✁✮), consider SGD initialized with ✭θ✵

✐ ✮✐✔◆ ✘✐✐❞ ✖

✚✵ ✂ ❯♥✐❢✭S❞✶✮ and step size ✧ ✔ ✶❂✭❈✵❞✮. Then, for any ❦ ✷ ❬❚❂✧❀ ✶✵❚❂✧❪, whp ❘◆✭θ❦✮ ✔ ✐♥❢

θ✷R❞✂◆ ❘◆✭θ✮ ✰ ✑✿ ◮ Learning from ❦ ❂ ❖✭✶❂✧✮ ❂ ❖✭❞✮ samples. ◮ Independent of number of neurons ◆ ✕ ❖✭❞ ❧♦❣ ❞✮.

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 27 / 37

slide-55
SLIDE 55

Classifying anisotropic Gaussians: analysis

Assumption (i) ✛ ✿ R ✦ R truncated ReLU; (ii) s✵ ❂ ✌❞, ✌ ✷ ✭✵❀ ✶✮ fixed; (iii) ✖ ✚✵ ✷ P✭R✰✮ has bounded density and ❘✭✚✵✮ ❁ ✶.

Theorem (M., Montanari, Nguyen, 2018)

For ❚ ✕ ❚✵, ❞ ✕ ❞✵, ◆ ✕ ❈✵❞ ❧♦❣ ❞ (❚✵❀ ❞✵❀ ❈✵ depend on ✭✑❀ ✖ ✚✵❀ ✁✮), consider SGD initialized with ✭θ✵

✐ ✮✐✔◆ ✘✐✐❞ ✖

✚✵ ✂ ❯♥✐❢✭S❞✶✮ and step size ✧ ✔ ✶❂✭❈✵❞✮. Then, for any ❦ ✷ ❬❚❂✧❀ ✶✵❚❂✧❪, whp ❘◆✭θ❦✮ ✔ ✐♥❢

θ✷R❞✂◆ ❘◆✭θ✮ ✰ ✑✿ ◮ Learning from ❦ ❂ ❖✭✶❂✧✮ ❂ ❖✭❞✮ samples. ◮ Independent of number of neurons ◆ ✕ ❖✭❞ ❧♦❣ ❞✮.

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 27 / 37

slide-56
SLIDE 56

Classifying anisotropic Gaussians: analysis

Assumption (i) ✛ ✿ R ✦ R truncated ReLU; (ii) s✵ ❂ ✌❞, ✌ ✷ ✭✵❀ ✶✮ fixed; (iii) ✖ ✚✵ ✷ P✭R✰✮ has bounded density and ❘✭✚✵✮ ❁ ✶.

Theorem (M., Montanari, Nguyen, 2018)

For ❚ ✕ ❚✵, ❞ ✕ ❞✵, ◆ ✕ ❈✵❞ ❧♦❣ ❞ (❚✵❀ ❞✵❀ ❈✵ depend on ✭✑❀ ✖ ✚✵❀ ✁✮), consider SGD initialized with ✭θ✵

✐ ✮✐✔◆ ✘✐✐❞ ✖

✚✵ ✂ ❯♥✐❢✭S❞✶✮ and step size ✧ ✔ ✶❂✭❈✵❞✮. Then, for any ❦ ✷ ❬❚❂✧❀ ✶✵❚❂✧❪, whp ❘◆✭θ❦✮ ✔ ✐♥❢

θ✷R❞✂◆ ❘◆✭θ✮ ✰ ✑✿ ◮ Learning from ❦ ❂ ❖✭✶❂✧✮ ❂ ❖✭❞✮ samples. ◮ Independent of number of neurons ◆ ✕ ❖✭❞ ❧♦❣ ❞✮.

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 27 / 37

slide-57
SLIDE 57

Predicting failure

100 101 102 103 104 105 106 107

Iteration

1 2 3 4 5 6 7 8 9

Risk

100 101 102 103 104 105 106 107 0.5 1 1.5 2 2.5

r PDE ( κ=0.1) SGD (κ=0.1) PDE ( κ=0.4) SGD (κ=0.4)

◮ s✵ ❂ ❞ ❂ ✸✷✵, ◆ ❂ ✽✵✵, ✜ ✷ ✰ ❂ ✶✿✺, ✜ ✷ ❂ ✵✿✺. ◮ Non-monotone activation. ◮ Two different initialization (✔ ❂ initialization variance).

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 28 / 37

slide-58
SLIDE 58

Predicting failure

◮ SGD does not necessarily converge to global min. ◮ Can we fix it?

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 29 / 37

slide-59
SLIDE 59

Noisy stochastic gradient descent

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 30 / 37

slide-60
SLIDE 60

Regularized noisy SGD

SGD ✭❣❦

✐ ✮✐✔◆❀❦✕✵ ✘✐✐❞ ◆✭ ❀ ■✮

θ❦✰✶

❂ ✭✶ ✷✕s❦✮ θ❦

✐ ✷s❦rθ✐❵✭x❦❀ ②❦❀ θ❦✮

q

s❦❂☞❣❦

✐ ✿

Distributional dynamics ❅t✚t✭θ✮ ❂✷✘✭t✮rθ ✁ ✭✚t✭θ✮rθ✠

✭θ❀ ✚t✮✮ ✰☞✶✁ ✚t✭ ✮✿ ✩

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 31 / 37

slide-61
SLIDE 61

Regularized noisy SGD

SGD with ✭❣❦

✐ ✮✐✔◆❀❦✕✵ ✘✐✐❞ ◆✭0❀ ■✮,

θ❦✰✶

❂ ✭✶ ✷✕s❦✮θ❦

✐ ✷s❦rθ✐❵✭x❦❀ ②❦❀ θ❦✮✰

q

s❦❂☞❣❦

✐ ✿

Distributional dynamics with diffusion term ❅t✚t✭θ✮ ❂✷✘✭t✮rθ ✁ ✭✚t✭θ✮rθ✠✕✭θ❀ ✚t✮✮✰☞✶✁θ✚t✭θ✮✿ ✩

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 31 / 37

slide-62
SLIDE 62

Regularized noisy SGD

SGD with ✭❣❦

✐ ✮✐✔◆❀❦✕✵ ✘✐✐❞ ◆✭0❀ ■✮,

θ❦✰✶

❂ ✭✶ ✷✕s❦✮θ❦

✐ ✷s❦rθ✐❵✭x❦❀ ②❦❀ θ❦✮✰

q

s❦❂☞❣❦

✐ ✿

Distributional dynamics with diffusion term ❅t✚t✭θ✮ ❂✷✘✭t✮rθ ✁ ✭✚t✭θ✮rθ✠✕✭θ❀ ✚t✮✮✰☞✶✁θ✚t✭θ✮✿

Theorem

Same approximation theorem: noisy SGD ✩ PDE.

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 31 / 37

slide-63
SLIDE 63

Gradient flow interpretation

❋☞❀✕✭✚✮ ❂✶ ✷❘✭✚✮ ✰ ✕ ✷

❦θ❦✷

✷✚✭❞θ✮ ☞✶❊♥t✭✚✮❀

❊♥t✭✚✮ ❂

✚✭θ✮ ❧♦❣ ✚✭θ✮❞θ✿

◮ Distributional dynamics is the gradient flow of ❋☞❀✕✭✚✮ ... ◮ ... in Wasserstein metric space.

[Jordan, Kinderlehrer, Otto, 1998]

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 32 / 37

slide-64
SLIDE 64

Convergence of DD

Theorem (M., Montanari, Nguyen, 2018)

Assume ❱❀ ❯❀ ✚✵ “sufficiently” regular. If ✚t is a solution of DD, then ❋☞❀✕✭✚t✮ is non-increasing: ❅t❋☞❀✕✭✚t✮ ❂

❩ ✌ ✌ ✌r ✏

✠✕✭θ❀ ✚t✮ ✶ ☞ ❧♦❣ ✚t✭θ✮

✑✌ ✌ ✌

✷ ✷✚t✭❞θ✮ ✔ ✵✿

In particular, there exists a unique fixed point ✚❄ of ❋☞❀✕ satisfies ✚❄✭θ✮ ❂ ✶ ❩❄✭☞❀ ✕✮ ❡①♣❢☞✠✕✭θ❀ ✚❄✮❣✿ Moreover, as t ✦ ✶, ✚t ✦ ✚❄.

Generalized the analysis of [Carrillo, McCann, Villani, 2013].

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 33 / 37

slide-65
SLIDE 65

Key remark

✚❄✭θ✮ ❂ ✶ ❩❄✭☞❀ ✕✮ ❡①♣❢☞✠✭θ❀ ✚❄✮❣✿ is the stationery equation for ❋☞❀✕✭✚✮ ❂ ✶ ✷❘✭✚✮ ✰ ✕ ✷

❦θ❦✷

✷✚✭❞θ✮ ☞✶❊♥t✭✚✮✿ ◮ ❋☞❀✕✭✁✮ is strongly convex. ◮ The fixed point is unique!

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 34 / 37

slide-66
SLIDE 66

General convergence for noisy SGD

Theorem (M., Montanari, Nguyen, 2018)

Assumptions of previous theorem. Initialization ✭θ✵

✐ ✮✐✔◆ ✘✐✐❞ ✚✵.

Then there exists ☞✵ ❂ ☞✵✭❉❀ ❯❀ ❱❀ ✑✮, such that, for ☞ ✕ ☞✵, there exists ❚ ❂ ❚✭❉❀ ❯❀ ❱❀ ☞❀ ✑✮ such that for any ❦ ✷ ❬❚❂✧❀ ✶✵❚❂✧❪, ◆ ✕ ❈✵❉ ❧♦❣ ❉, ✧ ✔ ✶❂✭❈✵❉✮, we have, whp ❘✕❀◆✭θ❦✮ ✔ ✐♥❢

θ✷R❉✂◆ ❘✕❀◆✭θ✮ ✰ ✑✿

✭ ❀ ②✮ ✘

❀②

❉ ◆

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 35 / 37

slide-67
SLIDE 67

General convergence for noisy SGD

Theorem (M., Montanari, Nguyen, 2018)

Assumptions of previous theorem. Initialization ✭θ✵

✐ ✮✐✔◆ ✘✐✐❞ ✚✵.

Then there exists ☞✵ ❂ ☞✵✭❉❀ ❯❀ ❱❀ ✑✮, such that, for ☞ ✕ ☞✵, there exists ❚ ❂ ❚✭❉❀ ❯❀ ❱❀ ☞❀ ✑✮ such that for any ❦ ✷ ❬❚❂✧❀ ✶✵❚❂✧❪, ◆ ✕ ❈✵❉ ❧♦❣ ❉, ✧ ✔ ✶❂✭❈✵❉✮, we have, whp ❘✕❀◆✭θ❦✮ ✔ ✐♥❢

θ✷R❉✂◆ ❘✕❀◆✭θ✮ ✰ ✑✿ ◮ For general distribution ✭x❀ ②✮ ✘ Px❀②! ◮ Convergence time depends on ❉, but not on ◆!

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 35 / 37

slide-68
SLIDE 68

Conclusion

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 36 / 37

slide-69
SLIDE 69

Conclusion

Correspondence

◮ Two layer neural networks. ◮ Dynamics of particles with pairwise interactions. ◮ Gradient flow in measure spaces.

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 37 / 37

slide-70
SLIDE 70

Conclusion

Correspondence

◮ Two layer neural networks. ◮ Dynamics of particles with pairwise interactions. ◮ Gradient flow in measure spaces.

Partially explained the optimization/generalization surprise.

Song Mei (Stanford University) Mean Field Dynamics for Neural Network November 14, 2018 37 / 37