Fine-Grained Analysis of Optimization and Generalization for - PowerPoint PPT Presentation

Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer NNs Sanjeev Arora Simon S. Du Wei Hu Princeton & IAS CMU Princeton Zhiyuan Li Ruosong Wang Princeton CMU

“Rethinking generalization” Experiment [Zhang et al ‘17] True Labels: 2 1 3 1 4 Random Labels: 5 1 7 0 8

“Rethinking generalization” Experiment [Zhang et al ‘17] Unexplained phenomena ① SGD achieves nearly 0 training loss for both correct and random labels (overparametrization!) ② Good generalization with correct labels Faster convergence with correct labels than random labels.

“Rethinking generalization” Experiment [Zhang et al ‘17] Unexplained phenomena ① SGD achieves nearly 0 training loss for both correct and random labels (overparametrization!) ② Good generalization with correct labels Faster convergence with correct labels than random labels. No good explanation in existing generalization theory: model complexity generalization gap ≤ # training samples

“Rethinking generalization” Experiment [Zhang et al ‘17] Unexplained phenomena ① SGD achieves nearly 0 training loss for both correct and random labels (overparametrization!) ② Good generalization with correct labels Faster convergence with correct labels than random labels. No good explanation in existing generalization theory: This paper: Theoretical explanation for model complexity generalization gap ≤ overparametrized 2-layer # training samples nets using label properties

Setting: Overparam Two-Layer ReLU Neural Nets Unexplained phenomena ① SGD achieves nearly 0 training loss for both correct and random labels (overparametrization!) ② Good generalization with correct labels Faster convergence with correct labels. 𝑋 Overparam: # hidden nodes is large 𝑦 ! 𝑔(𝑋, 𝑦) Training obj: ℓ ! loss, binary classification Init: i.i.d. Gaussian 𝑦 " Opt algo: GD for the first layer, 𝑋 𝑦 #

Setting: Overparam Two-Layer ReLU Neural Nets Unexplained phenomena [Du et al., ICLR’19]: ① SGD achieves nearly 0 training loss for both GD converges to 0 training loss correct and random labels (overparametrization!) Explains phenomenon ①, but not ② or ③ ② Good generalization with correct labels Faster convergence with correct labels. 𝑋 Overparam: # hidden nodes is large 𝑦 ! 𝑔(𝑋, 𝑦) Training obj: ℓ ! loss, binary classification Init: i.i.d. Gaussian 𝑦 " Opt algo: GD for the first layer, 𝑋 𝑦 #

Setting: Overparam Two-Layer ReLU Neural Nets Unexplained phenomena [Du et al., ICLR’19]: ① SGD achieves nearly 0 training loss for both GD converges to 0 training loss correct and random labels (overparametrization!) Explains phenomenon ①, but not ② or ③ ② Good generalization with correct labels Faster convergence with correct labels. This paper: for ② and ③ 𝑋 Faster convergence • with true labels Overparam: # hidden nodes is large 𝑦 ! 𝑔(𝑋, 𝑦) A data-dependent • Training obj: ℓ ! loss, binary classification generalization bound Init: i.i.d. Gaussian 𝑦 " (distinguish random Opt algo: GD for the first layer, 𝑋 labels from true labels). 𝑦 #

Training Speed Theorem: # 𝐽 − 𝜃𝐼 " ⋅ 𝑧 loss iteration 𝑙 ≈ • 𝑧 : vector of labels • 𝐼 : kernel matrix (“Neural Tangent Kernel”), % 𝑦 # = 𝜌 − arccos 𝑦 " 𝐼 "# = E $ ∇ $ 𝑔 𝑋, 𝑦 " , ∇ $ 𝑔 𝑋, 𝑦 # % 𝑦 # 𝑦 " 2𝜌

Training Speed Theorem: # 𝐽 − 𝜃𝐼 " ⋅ 𝑧 loss iteration 𝑙 ≈ Label projection sorted by eigenval • 𝑧 : vector of labels • 𝐼 : kernel matrix (“Neural Tangent Kernel”), % 𝑦 # = 𝜌 − arccos 𝑦 " 𝐼 "# = E $ ∇ $ 𝑔 𝑋, 𝑦 " , ∇ $ 𝑔 𝑋, 𝑦 # % 𝑦 # 𝑦 " 2𝜌 Implication: Training loss over time • Training speed determined by projections of 𝑧 on eigenvectors of 𝐼 : 𝑧, 𝑤 ! , 𝑧, 𝑤 " , 𝑧, 𝑤 # , … Explains different training speeds on correct vs random • Components on top eigenvectors converge to 0 labels faster than components on bottom eigenvectors

Explaining Generalization despite vast overparametrization “data dependent complexity” Theorem: For 1-Lipschitz loss, 2𝑧 $ 𝐼 %& 𝑧 test error ≤ # training samples + small terms Corollary: Simple functions are provably learnable (eg, linear function and even-degree polynomials).

Explaining Generalization despite vast overparametrization “data dependent complexity” Theorem: For 1-Lipschitz loss, 2𝑧 $ 𝐼 %& 𝑧 test error ≤ # training samples + small terms Corollary: Simple functions are provably learnable (eg, linear function and even-degree polynomials). Poster #75 tonight

Explaining Generalization despite vast overparametrization “data dependent complexity” Theorem: For 1-Lipschitz loss, 2𝑧 $ 𝐼 %& 𝑧 test error ≤ # training samples + small terms “Distance to Init” “Min RKHS norm for training labels” Corollary: Simple functions are provably learnable (eg, linear function and even-degree polynomials). Poster #75 tonight

Fine-Grained Analysis of Optimization and Generalization for - PowerPoint PPT Presentation

Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer NNs Sanjeev Arora Simon S. Du Wei Hu Princeton & IAS CMU Princeton Zhiyuan Li Ruosong Wang Princeton CMU Rethinking generalization

Fine Grained Access Control Fine-Grained Access Control Fine Grained Access Control

Fine-Grained Access Control Fine Grained Access Control Fine-grained access control examples:

Fine-grained Visual Analysis: From Classification to Retrieval Yi-Zhe Song SketchX Lab, CVSSP,

Fine-Grained Geographic Communication (Geocast) Nexus Workshop Frank Drr 23.07.2003 1

Average-Case Fine-Grained Hardness Marshall Ball Alon Rosen Manuel Sabin Prashant Nalini

Mechanized Verification of Fine-grained Concurrent Programs Ilya Sergey Aleks Nanevski

Fine-Grained Analysis of Stability and Generalization for SGD Yunwen Lei 1 and Yiming Ying 2 1

Junfeng Fan ESAT/COSIC ECC implementation methods Multi-core systems Coarse-Grained

Phase Transition in 3SAT Yi Zhou Phase Transition in 3SAT Phase Transition in 3SAT Fine Grained

Combining Data-Intense and Compute-Intense Methods for Fine-Grained Morphological Analyses Petra

Fine-Grained Power Modeling for Smartphones Using System Call Tracing Based on paper and

Fine-grained Image Recognition Lei Wang VILA group School of Computing and Information

Enhancing Fine- Grained Parallelism Loop vectorization, Loop distribution, Scalar expansion

Fine-Grained Tracking of Grid Infections Ashish Gehani SRI Basim Baig, Salman Mahmood, Dawood

Addressing Inter-Class Similarity in Fine-Grained Visual Classification Abhimanyu Dubey

On the Correctness Criteria of Fine-Grained Access Control in Relational Databases Qihua Wang,

Generalization Error MACH IN E LEARN IN G W ITH TREE-BAS ED MODELS IN P YTH ON Elie Kawerk

Learning From Data Lecture 5 Training Versus Testing The Two Questions of Learning Theory of

Lecture 4: Linear Regression Optimization Generalization Model complexity

VC GENERALIZATION BOUND VC GENERALIZATION BOUND Matthieu Bloch March 12, 2020 1 LOGISTICS (AND

Outline Learning from Examples 1 Motivation Supervised Learning Aspects of Supervised Learning

Diagnosing ML System Shih-Yang Su Virginia Tech ECE-5424G / CS-5824 Spring 2019 Today's

Generalization + Globa Image Features Various slides from previous courses by: D.A. Forsyth

Deep learning: Challenges in learning and generalization Tomas Mikolov, Facebook AI What is