Disentangling Trainability and Generalization In Deep Neural Networks
Lechao Xiao, Jeffrey Pennington and Samuel S. Schoenholz Google Brain Team, Google Research
Colab Tutorial
Disentangling Trainability and Generalization In Deep Neural - - PowerPoint PPT Presentation
Disentangling Trainability and Generalization In Deep Neural Networks Lechao Xiao, Je ff rey Pennington and Samuel S. Schoenholz Google Brain Team, Google Research Colab Tutorial Two Fundamental Theoretical Questions in Deep Learning
Lechao Xiao, Jeffrey Pennington and Samuel S. Schoenholz Google Brain Team, Google Research
Colab Tutorial
Two Fundamental Theoretical Questions in Deep Learning
A trade-off between Trainability and Generalization for very deep and very wide NNs
Deep Neural Networks
x
<latexit sha1_base64="DuIzDhz/IEruhRbuOKeVOYkbx8E=">AB+XicbVDLSgMxFM3UVx1fVZdugkVwVWYU0Y1YdOyBfuAdiZ9E4bmSGJCOWoV/gVj9A3IhbP8JPENf+iOljodUDFw7n3Mu5nDhTBvP+3RyC4tLyv5VXdtfWNzq7C9U9dxqijUaMxj1QyJBs4k1AwzHJqJAiJCDo1wcDX2G7egNIvljRkmEAjSkyxilBgrVe86haJX8ibAf4k/I8WLd/c8ef5wK53CV7sb01SANJQTrVu+l5gI8owymHktlMNCaED0oOWpZI0E2eXSED6zSxVGs7EiDJ+rPi4wIrYcitJuCmL6e98biv14YirloE50FGZNJakDSaXKUcmxiPK4Bd5kCavjQEkIVs89j2ieKUGPLcm0r/nwHf0n9qOQfl06qXrF8iabIoz20jw6Rj05RGV2jCqohigDdowf06GTOk/PivE5Xc87sZhf9gvP2DWyrl04=</latexit>f
<latexit sha1_base64="GgESaj68ZiQuUqXL42pXNMYKlgk=">AB+XicbVC7SgNBFL3rM8ZX1FKRwSBYhV1FtAzaWCZgHpCEMDu5mwyZ2V1mZoWwpLSy1Q+wE1ub/Iq1pT/h5Fo4oELh3Pu5VyOHwujet+OkvLK6tr65mN7ObW9s5ubm+/qNEMaywSESq7lONgodYMdwIrMcKqfQF1vz+7divPaDSPArvzSDGlqTdkAecUWOlctDO5d2COwFZJN6M5ItHo/L34/Go1M59NTsRSySGhgmqdcNzY9NKqTKcCRxm4nGmLI+7WLD0pBK1K108uiQnFqlQ4JI2QkNmai/L1IqtR5I325Kanp63huL/3q+L+eiTXDdSnkYJwZDNk0OEkFMRMY1kA5XyIwYWEKZ4vZ5wnpUWZsWVnbijfwSKpnhe8i8Jl2dZzA1Nk4BO4Aw8uIi3EJKsA4Qme4cVJnVfnzXmfri45s5sD+APn4wfkpJem</latexit>Gradient descent dynamics with Mean Squared Error
through training (NTK Jacot et al., 2018)
Neural Tangent Kernel (NTK) Function Space
Training Dynamics: Learning Dynamics: Credit: Roman Novak Agreement between finite- and infinite-width networks
Training Dynamics: Eigen-decomposition The smallest eigenvector converges at rate Trainability Metric:
8-layers finite width FCN on CIFAR10
σ2
w = 25
σ2
w = 0.5
Mean Prediction Generalization metric: Learning Dynamics: Cannot generalize if becomes completely independent of the inputs.
P(Θ)Ytrain
NTK Condition Number Mean Prediction Neural Networks
Convergence of is determined by a bivariate function defined on the
Θ(l) χ1 (σ2
w, σ2 b)-plane
κ* = ∞ κ* = 1
P(Θ*)Ytrain = 0 P(Θ*)Ytrain = Ctest
:
Θ(l) → Θ* = C11T κ(l) → ∞ P(Θ(l))Ytrain → Ctest
:
Θ(l) → ∞ κ(l) → 1 P(Θ(l))Ytrain → 0
Θ(l) → Θ* Θ(l) → ∞
Easy to Train, but not Generalizable
Easy to Train, but Not Generalize
w = 25, σ2 b = 0, l = 8
Difficult to Train, Generalizable
w = 0.5
w = 25
Difficult to Train, Generalizable
Easy to Train, but Not Generalize
and wide networks
Colab Tutorial