Fine-Grained Analysis of Optimization and Generalization for - - PowerPoint PPT Presentation

fine grained analysis of optimization and generalization
SMART_READER_LITE
LIVE PREVIEW

Fine-Grained Analysis of Optimization and Generalization for - - PowerPoint PPT Presentation

Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer NNs Sanjeev Arora Simon S. Du Wei Hu Princeton & IAS CMU Princeton Zhiyuan Li Ruosong Wang Princeton CMU Rethinking generalization


slide-1
SLIDE 1

Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer NNs

Wei Hu

Princeton

Simon S. Du

CMU

Sanjeev Arora

Princeton & IAS

Zhiyuan Li

Princeton

Ruosong Wang

CMU

slide-2
SLIDE 2

“Rethinking generalization” Experiment [Zhang et al ‘17]

True Labels: 2 1 3 1 4 Random Labels: 5 1 7 8

slide-3
SLIDE 3

Unexplained phenomena

① SGD achieves nearly 0 training loss for both correct and random labels (overparametrization!) ② Good generalization with correct labels Faster convergence with correct labels than random labels.

“Rethinking generalization” Experiment [Zhang et al ‘17]

slide-4
SLIDE 4

Unexplained phenomena

① SGD achieves nearly 0 training loss for both correct and random labels (overparametrization!) ② Good generalization with correct labels Faster convergence with correct labels than random labels. No good explanation in existing generalization theory:

generalization gap ≤ model complexity # training samples

“Rethinking generalization” Experiment [Zhang et al ‘17]

slide-5
SLIDE 5

Unexplained phenomena

① SGD achieves nearly 0 training loss for both correct and random labels (overparametrization!) ② Good generalization with correct labels Faster convergence with correct labels than random labels. No good explanation in existing generalization theory:

generalization gap ≤ model complexity # training samples

“Rethinking generalization” Experiment [Zhang et al ‘17]

This paper: Theoretical explanation for

  • verparametrized 2-layer

nets using label properties

slide-6
SLIDE 6

Setting: Overparam Two-Layer ReLU Neural Nets

𝑦! 𝑦" 𝑦#

𝑔(𝑋, 𝑦)

𝑋

Overparam: # hidden nodes is large Training obj: ℓ! loss, binary classification Init: i.i.d. Gaussian Opt algo: GD for the first layer, 𝑋

Unexplained phenomena

① SGD achieves nearly 0 training loss for both correct and random labels (overparametrization!) ② Good generalization with correct labels Faster convergence with correct labels.

slide-7
SLIDE 7

Setting: Overparam Two-Layer ReLU Neural Nets

𝑦! 𝑦" 𝑦#

𝑔(𝑋, 𝑦)

𝑋

Overparam: # hidden nodes is large Training obj: ℓ! loss, binary classification Init: i.i.d. Gaussian Opt algo: GD for the first layer, 𝑋

[Du et al., ICLR’19]:

GD converges to 0 training loss

Explains phenomenon ①, but not ② or ③

Unexplained phenomena

① SGD achieves nearly 0 training loss for both correct and random labels (overparametrization!) ② Good generalization with correct labels Faster convergence with correct labels.

slide-8
SLIDE 8

Setting: Overparam Two-Layer ReLU Neural Nets

𝑦! 𝑦" 𝑦#

𝑔(𝑋, 𝑦)

𝑋

Overparam: # hidden nodes is large Training obj: ℓ! loss, binary classification Init: i.i.d. Gaussian Opt algo: GD for the first layer, 𝑋

[Du et al., ICLR’19]:

GD converges to 0 training loss

Explains phenomenon ①, but not ② or ③

This paper: for ② and ③

  • Faster convergence

with true labels

  • A data-dependent

generalization bound (distinguish random labels from true labels).

Unexplained phenomena

① SGD achieves nearly 0 training loss for both correct and random labels (overparametrization!) ② Good generalization with correct labels Faster convergence with correct labels.

slide-9
SLIDE 9

Training Speed

Theorem: loss iteration 𝑙 ≈ 𝐽 − 𝜃𝐼 " ⋅ 𝑧

#

  • 𝑧: vector of labels
  • 𝐼: kernel matrix (“Neural Tangent Kernel”),

𝐼"# = E$ ∇$𝑔 𝑋, 𝑦 " , ∇$𝑔 𝑋, 𝑦 # = 𝜌 − arccos 𝑦"

%𝑦#

2𝜌 𝑦"

%𝑦#

slide-10
SLIDE 10

Training Speed

Theorem: loss iteration 𝑙 ≈ 𝐽 − 𝜃𝐼 " ⋅ 𝑧

#

  • 𝑧: vector of labels
  • 𝐼: kernel matrix (“Neural Tangent Kernel”),

𝐼"# = E$ ∇$𝑔 𝑋, 𝑦 " , ∇$𝑔 𝑋, 𝑦 # = 𝜌 − arccos 𝑦"

%𝑦#

2𝜌 𝑦"

%𝑦#

Implication:

  • Training speed determined by projections of 𝑧 on

eigenvectors of 𝐼: 𝑧, 𝑤! , 𝑧, 𝑤" , 𝑧, 𝑤# , …

  • Components on top eigenvectors converge to 0

faster than components on bottom eigenvectors Explains different training speeds on correct vs random labels

Label projection sorted by eigenval Training loss over time

slide-11
SLIDE 11

Explaining Generalization despite vast overparametrization

Theorem: For 1-Lipschitz loss, test error ≤ 2𝑧$𝐼%&𝑧 # training samples + small terms

Corollary: Simple functions are provably learnable (eg, linear function and even-degree polynomials).

“data dependent complexity”

slide-12
SLIDE 12

Explaining Generalization despite vast overparametrization

Theorem: For 1-Lipschitz loss, test error ≤ 2𝑧$𝐼%&𝑧 # training samples + small terms

Corollary: Simple functions are provably learnable (eg, linear function and even-degree polynomials).

“data dependent complexity”

Poster #75 tonight

slide-13
SLIDE 13

Explaining Generalization despite vast overparametrization

Theorem: For 1-Lipschitz loss, test error ≤ 2𝑧$𝐼%&𝑧 # training samples + small terms

Corollary: Simple functions are provably learnable (eg, linear function and even-degree polynomials).

“data dependent complexity”

Poster #75 tonight

“Distance to Init” “Min RKHS norm for training labels”