why are convlotuional nets more sample efficient than
play

Why Are Convlotuional Nets More Sample-Efficient than - PowerPoint PPT Presentation

Why Are Convlotuional Nets More Sample-Efficient than Fully-Connected Nets? Zhiyuan Li Joint work with Sanjeev Arora, Yi Zhang Princeton University August 19, 2020 @ IJTCS Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets


  1. Why Are Convlotuional Nets More Sample-Efficient than Fully-Connected Nets? Zhiyuan Li Joint work with Sanjeev Arora, Yi Zhang Princeton University August 19, 2020 @ IJTCS Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 1 / 30

  2. Introduction Table of Contents Introduction 1 Intuition and Warm-up example 2 Identifying Algorithmic Equivariance 3 Lower Bound for Equivariant Algorithms 4 Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 2 / 30

  3. Introduction Introduction CNNs (Convolutional neural networks) often perform better than its fully connected counterparts, FC Nets , especially on vision tasks. Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 3 / 30

  4. Introduction Introduction CNNs (Convolutional neural networks) often perform better than its fully connected counterparts, FC Nets , especially on vision tasks. Not an issue of expressiveness — FC nets easily gets full training accuracy, but still generalizes poorly. Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 3 / 30

  5. Introduction Introduction CNNs (Convolutional neural networks) often perform better than its fully connected counterparts, FC Nets , especially on vision tasks. Not an issue of expressiveness — FC nets easily gets full training accuracy, but still generalizes poorly. Often explained by “better inductive bias”. Ex: Over-parametrized Linear Regression has multiple solutions, and GD (Gradient Descent) initialized from 0 picks the one with min ℓ 2 norm. Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 3 / 30

  6. Introduction Introduction CNNs (Convolutional neural networks) often perform better than its fully connected counterparts, FC Nets , especially on vision tasks. Not an issue of expressiveness — FC nets easily gets full training accuracy, but still generalizes poorly. Often explained by “better inductive bias”. Ex: Over-parametrized Linear Regression has multiple solutions, and GD (Gradient Descent) initialized from 0 picks the one with min ℓ 2 norm. Question: Can we justify this rigorously by showing a sample complexity separation? Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 3 / 30

  7. Introduction Introduction CNN often performs better than FC Nets , especially on vision tasks. Often explained by “better inductive bias”. Ex: Over-parametrized Linear Regression has multiple solutions, and GD (Gradient Descent) initialized from 0 picks the one with min ℓ 2 norm. Question: Can we justify this rigorously by showing a sample complexity separation? Since ultra-wide FC nets can simulate any CNN, the hurdle here is how to show that (S)GD + FC net doesn’t learn those CNN with good generalization. Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 4 / 30

  8. Introduction Introduction CNN often performs better than FC Nets , especially on vision tasks. Often explained by “better inductive bias”. Ex: Over-parametrized Linear Regression has multiple solutions, and GD (Gradient Descent) initialized from 0 picks the one with min ℓ 2 norm. Question: Can we justify this rigorously by showing a sample complexity separation? Since ultra-wide FC nets can simulate any CNN, the hurdle here is how to show that (S)GD + FC net doesn’t learn those CNN with good generalization. This Work A single distribution + a single target function which can be learnt by CNN with constant samples, but SGD on FC nets of any depth and width require Ω( d 2 ) samples. Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 4 / 30

  9. Introduction Setting Binary classification, Y = {− 1 , 1 } . Data domain X = R d Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 5 / 30

  10. Introduction Setting Binary classification, Y = {− 1 , 1 } . Data domain X = R d Joint distribution P supported on X × Y = R d × {− 1 , 1 } . In this talk, P Y|X is always a deterministic function, h ∗ : R d → {− 1 , 1 } , i.e. P = P X ⋄ h ∗ . Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 5 / 30

  11. Introduction Setting Binary classification, Y = {− 1 , 1 } . Data domain X = R d Joint distribution P supported on X × Y = R d × {− 1 , 1 } . In this talk, P Y|X is always a deterministic function, h ∗ : R d → {− 1 , 1 } , i.e. P = P X ⋄ h ∗ . A Learning Algorithm A maps from a sequence of training data, { x i , y i } n i =1 ∈ ( X × Y ) n , i =1 ) ∈ Y X . A could also be random. to a hypothesis A ( { x i , y i } n Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 5 / 30

  12. Introduction Setting Binary classification, Y = {− 1 , 1 } . Data domain X = R d Joint distribution P supported on X × Y = R d × {− 1 , 1 } . In this talk, P Y|X is always a deterministic function, h ∗ : R d → {− 1 , 1 } , i.e. P = P X ⋄ h ∗ . A Learning Algorithm A maps from a sequence of training data, { x i , y i } n i =1 ∈ ( X × Y ) n , i =1 ) ∈ Y X . A could also be random. to a hypothesis A ( { x i , y i } n Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 5 / 30

  13. Introduction Setting Binary classification, Y = {− 1 , 1 } . Data domain X = R d Joint distribution P supported on X × Y = R d × {− 1 , 1 } . In this talk, P Y|X is always a deterministic function, h ∗ : R d → {− 1 , 1 } , i.e. P = P X ⋄ h ∗ . A Learning Algorithm A maps from a sequence of training data, { x i , y i } n i =1 ∈ ( X × Y ) n , i =1 ) ∈ Y X . A could also be random. to a hypothesis A ( { x i , y i } n Two examples, Kernel Regression and ERM (Empirical Risk Minimization): Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 5 / 30

  14. Introduction Setting Binary classification, Y = {− 1 , 1 } . Data domain X = R d Joint distribution P supported on X × Y = R d × {− 1 , 1 } . In this talk, P Y|X is always a deterministic function, h ∗ : R d → {− 1 , 1 } , i.e. P = P X ⋄ h ∗ . A Learning Algorithm A maps from a sequence of training data, { x i , y i } n i =1 ∈ ( X × Y ) n , i =1 ) ∈ Y X . A could also be random. to a hypothesis A ( { x i , y i } n Two examples, Kernel Regression and ERM (Empirical Risk Minimization): � � REG K ( { x i , y i } n K ( x , X n ) · K ( X n , X n ) † y ≥ 0 i =1 )( x ) := 1 . Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 5 / 30

  15. Introduction Setting Binary classification, Y = {− 1 , 1 } . Data domain X = R d Joint distribution P supported on X × Y = R d × {− 1 , 1 } . In this talk, P Y|X is always a deterministic function, h ∗ : R d → {− 1 , 1 } , i.e. P = P X ⋄ h ∗ . A Learning Algorithm A maps from a sequence of training data, { x i , y i } n i =1 ∈ ( X × Y ) n , i =1 ) ∈ Y X . A could also be random. to a hypothesis A ( { x i , y i } n Two examples, Kernel Regression and ERM (Empirical Risk Minimization): � � REG K ( { x i , y i } n K ( x , X n ) · K ( X n , X n ) † y ≥ 0 i =1 )( x ) := 1 . � n ERM H ( { x i , y i } n i =1 1 [ h ( x i ) � = y i ]. 1 i =1 ) = argmin h ∈H 1 Strictly speaking, ERM H is not a well-defined algorithm. In this talk, we consider the worst performance of all the empirical minimizers in H . Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 5 / 30

  16. Introduction Setting err P ( h ) = P ( X , Y ) ∼ P [ h ( X ) � = Y ]. Sample Complexity: single joint distribution P The ( ε, δ )- sample complexity , denoted N ( A , P , ε, δ ), is the smallest number n such that w.p. 1 − δ over the randomness of { x i , y i } n i =1 , err P ( A ( { x i , y i } n i =1 )) ≤ ε . We also define the ε -expected sample complexity, N ∗ ( A , P , ε ), as the smallest number n such that � � err P ( A ( { x i , y i } n i =1 )) ≤ ε . E ( x i , y i ) ∼ P Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 6 / 30

  17. Introduction Setting err P ( h ) = P ( X , Y ) ∼ P [ h ( X ) � = Y ]. Sample Complexity: single joint distribution P The ( ε, δ )- sample complexity , denoted N ( A , P , ε, δ ), is the smallest number n such that w.p. 1 − δ over the randomness of { x i , y i } n i =1 , err P ( A ( { x i , y i } n i =1 )) ≤ ε . We also define the ε -expected sample complexity, N ∗ ( A , P , ε ), as the smallest number n such that � � err P ( A ( { x i , y i } n i =1 )) ≤ ε . E ( x i , y i ) ∼ P Sample Complexity: a family of distributions, P N ∗ ( A , P , ε ) = max P ∈P N ∗ ( A , P , ε ) N ( A , P , ε, δ ) = max P ∈P N ( A , P , ε, δ ) ; Fact: N ∗ ( A , P , ε + δ ) ≤ N ( A , P , ε, δ ) ≤ N ∗ ( A , P , εδ ) , ∀ ε, δ ∈ [0 , 1]. Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 6 / 30

  18. Introduction Parametric Models A parametric model M : W → Y X is a functional mapping from weight W to a hypothesis M ( W ) : X → Y . Zhiyuan Li (Princeton University) Fully-Connected Nets vs Conv Nets August 19, 2020 @ IJTCS 7 / 30

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend