Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer NNs
Wei Hu
Princeton
Simon S. Du
CMU
Sanjeev Arora
Princeton & IAS
Zhiyuan Li
Princeton
Ruosong Wang
CMU
Fine-Grained Analysis of Optimization and Generalization for - - PowerPoint PPT Presentation
Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer NNs Sanjeev Arora Simon S. Du Wei Hu Princeton & IAS CMU Princeton Zhiyuan Li Ruosong Wang Princeton CMU Rethinking generalization
Princeton
CMU
Princeton & IAS
Princeton
CMU
True Labels: 2 1 3 1 4 Random Labels: 5 1 7 8
Unexplained phenomena
① SGD achieves nearly 0 training loss for both correct and random labels (overparametrization!) ② Good generalization with correct labels Faster convergence with correct labels than random labels.
Unexplained phenomena
① SGD achieves nearly 0 training loss for both correct and random labels (overparametrization!) ② Good generalization with correct labels Faster convergence with correct labels than random labels. No good explanation in existing generalization theory:
generalization gap ≤ model complexity # training samples
Unexplained phenomena
① SGD achieves nearly 0 training loss for both correct and random labels (overparametrization!) ② Good generalization with correct labels Faster convergence with correct labels than random labels. No good explanation in existing generalization theory:
generalization gap ≤ model complexity # training samples
This paper: Theoretical explanation for
nets using label properties
𝑦! 𝑦" 𝑦#
𝑔(𝑋, 𝑦)
Overparam: # hidden nodes is large Training obj: ℓ! loss, binary classification Init: i.i.d. Gaussian Opt algo: GD for the first layer, 𝑋
Unexplained phenomena
① SGD achieves nearly 0 training loss for both correct and random labels (overparametrization!) ② Good generalization with correct labels Faster convergence with correct labels.
𝑦! 𝑦" 𝑦#
𝑔(𝑋, 𝑦)
Overparam: # hidden nodes is large Training obj: ℓ! loss, binary classification Init: i.i.d. Gaussian Opt algo: GD for the first layer, 𝑋
[Du et al., ICLR’19]:
GD converges to 0 training loss
Explains phenomenon ①, but not ② or ③
Unexplained phenomena
① SGD achieves nearly 0 training loss for both correct and random labels (overparametrization!) ② Good generalization with correct labels Faster convergence with correct labels.
𝑦! 𝑦" 𝑦#
𝑔(𝑋, 𝑦)
Overparam: # hidden nodes is large Training obj: ℓ! loss, binary classification Init: i.i.d. Gaussian Opt algo: GD for the first layer, 𝑋
[Du et al., ICLR’19]:
GD converges to 0 training loss
Explains phenomenon ①, but not ② or ③
This paper: for ② and ③
with true labels
generalization bound (distinguish random labels from true labels).
Unexplained phenomena
① SGD achieves nearly 0 training loss for both correct and random labels (overparametrization!) ② Good generalization with correct labels Faster convergence with correct labels.
Theorem: loss iteration 𝑙 ≈ 𝐽 − 𝜃𝐼 " ⋅ 𝑧
#
𝐼"# = E$ ∇$𝑔 𝑋, 𝑦 " , ∇$𝑔 𝑋, 𝑦 # = 𝜌 − arccos 𝑦"
%𝑦#
2𝜌 𝑦"
%𝑦#
Theorem: loss iteration 𝑙 ≈ 𝐽 − 𝜃𝐼 " ⋅ 𝑧
#
𝐼"# = E$ ∇$𝑔 𝑋, 𝑦 " , ∇$𝑔 𝑋, 𝑦 # = 𝜌 − arccos 𝑦"
%𝑦#
2𝜌 𝑦"
%𝑦#
Implication:
eigenvectors of 𝐼: 𝑧, 𝑤! , 𝑧, 𝑤" , 𝑧, 𝑤# , …
faster than components on bottom eigenvectors Explains different training speeds on correct vs random labels
Label projection sorted by eigenval Training loss over time
Theorem: For 1-Lipschitz loss, test error ≤ 2𝑧$𝐼%&𝑧 # training samples + small terms
Corollary: Simple functions are provably learnable (eg, linear function and even-degree polynomials).
“data dependent complexity”
Theorem: For 1-Lipschitz loss, test error ≤ 2𝑧$𝐼%&𝑧 # training samples + small terms
Corollary: Simple functions are provably learnable (eg, linear function and even-degree polynomials).
“data dependent complexity”
Theorem: For 1-Lipschitz loss, test error ≤ 2𝑧$𝐼%&𝑧 # training samples + small terms
Corollary: Simple functions are provably learnable (eg, linear function and even-degree polynomials).
“data dependent complexity”
“Distance to Init” “Min RKHS norm for training labels”