Direction Matters: On the Implicit Regularization Effect of Stochastic Gradient Descent with Moderate Learning Rate
Jingfeng Wu, Difan Zou, Vladimir Braverman, Quanquan Gu Johns Hopkins University & UCLA
November 2020
Direction Matters: On the Implicit Regularization Effect of - - PowerPoint PPT Presentation
Direction Matters: On the Implicit Regularization Effect of Stochastic Gradient Descent with Moderate Learning Rate Jingfeng Wu , Difan Zou, Vladimir Braverman, Quanquan Gu Johns Hopkins University & UCLA November 2020 Overview
November 2020
CIFAR-10, ResNet-18, w/o weight decay, w/o data augmentation
Wu, Jingfeng, et al. "On the Noisy Gradient Descent that Generalizes as SGD." ICML 2020.
<latexit sha1_base64="NEVHOYBfO6Mb1VrYie/t9hYiKGM=">ACMXicbVBNT9tAEF0HaCGlJaXHXlaJKgWhRjaiao9RufRIJRIixVE0Xo+TVdYf2h03sqz8C078jf4BrvAPcqt6KV/go3DgQAj7erpvZl5oxdkShpy3aVT29refV6d6/+Zv/tu4PG+8O+SXMtsCdSlepBAaVTLBHkhQOMo0QBwovg9nZSr/8hdrINLmgIsNRDJNERlIAWrcOJ1zX2FEoHU653P+mZd+tbXUGC64jwT2n2gILVZqPGvPj/i40XI7blX8OfAeQKvb9I+vlt3ifNz454epyGNMSCgwZui5GY1K0CSFwkXdzw1mIGYwaGFCcRoRmV1x4J/skzIo1TblxCv2McTJcTGFHFgO2OgqXmqrciXtGFO0bdRKZMsJ0zE2ijKFaeUr6LiodQoSBUWgNDS3srFDQIsoFuMRFZVK3wXhPY3gO+icd70vH/WkT+s7Wtcs+siZrM49ZV32g52zHhPsmt2wW3bn/HaWzh/n7q15jzMfGAb5fy/BzjzrO8=</latexit>w w ⌘ r `k(w) <latexit sha1_base64="EarS2g+hOyhHCn1+yqIviVW4JU=">ACMnicbVA9bxNBEN1LCARDiIGSZoUVyS5s3UWyQhkBRQqKIOIPyWdZc3tz9sp7H9qdw7JO/iX8Avr8gbSJ+AEgGoRERZue9TkFtjPSrp7em5k3ekGmpCHX/e7s7D7Ye/ho/3HlydODZ4fV5y+6Js21wI5IVar7ARhUMsEOSVLYzRCHCjsBdN3S73GbWRaXJB8wyHMYwTGUkBZKlRtT3jvsKIQOt0xme8yQu/3FpoDBfcRwL7jzWE/MPI/yRA8fqsMarW3JZbFt8G3h2onTZuvzXf/x6Pqr+8cNU5DEmJBQYM/DcjIYFaJC4aLi5wYzEFMY48DCBGI0w6I8ZMGPLBPyKNX2JcRL9v+JAmJj5nFgO2OgidnUluR92iCn6M2wkEmWEyZiZRTlilPKl1nxUGoUpOYWgNDS3srFBDQIsomucTz0qRig/E2Y9gG3eOW1265H21Cb9mq9tkr9prVmcdO2Ck7Y+eswT7wq7YNbtxLp0fzi/n96p1x7mbecnWyvn7DzPzrjI=</latexit>w w η r LS(w) <latexit sha1_base64="KXCQlbNA4tNvFg4knZExHyj1hdI=">ACL3icbVDLSgMxFM34rPVdekmWIS6KTOC6KZQdKELFxXtAzp1yKSZNjTJDElGKeP8h9/gzh9wq38gbsSV4F+YabvQ6oHA4dxzOTfHjxhV2rbfrJnZufmFxdxSfnldW29sLHZUGEsManjkIWy5SNFGBWkrqlmpBVJgrjPSNMfnGTz5g2RiobiSg8j0uGoJ2hAMdJG8gr757iRGDpds9WIFuIBFOnDQRKXRVzL2EVpz0WkCXMObRzJX3CkW7bI8A/xJnQorVU/jgene9mlf4dLshjkRGjOkVNuxI91JkNQUM5Lm3ViRCOEB6pG2oQJxojrJ6G8p3DVKFwahNE9oOFJ/biSIKzXkvnFypPtqepaJ/83asQ6OgkVUayJwOgIGZQhzArCnapJFizoSEIS2puhbiPTD3a1PkrhQ9HIVkxznQNf0ljv+wclO0L09AxGCMHtsEOKAEHIqOAM1UAcY3IMn8AxerEfr1Xq3PsbWGWuyswV+wfr6Bk6UqsE=</latexit>2500 5000 7500 10000 12500 15000 17500
LterDtLon
10 20 30 40 50 60 70
test DccurDcy (%)
GD (66.96) GLD const (66.66) GLD GynDmLc (69.25) GLD GLDg (67.96) 6GD (75.21)
Wilson, Ashia C., et al. "The marginal value of adaptive gradient methods in machine learning." NIPS 2017. Zhu, Zhanxing, et al. "The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects." ICML 2019.
1 2 dBt
Gradient Flow (GF)
Stochastic Modified Equation (SME)
Higher order term 𝜃 = 𝑒𝑢 → 0 𝜃 = 𝑒𝑢 → 0
<latexit sha1_base64="H8W0j+nYlbjAbB3JxTavAukChE=">ACeHicbVFNb9NAEF2bj5bwUReOXEZUiEYokV0JwbECDhw4FEHaSnFkjdfjdJX1h3bHRJGVH8SP4Qcgceml/4ALJzZODrTNSLt6eu/NzuhtWmtlOQx/ef6du/fu7+w+6D189PjJXrD/9NRWjZE0kpWuzHmKlrQqacSKNZ3XhrBINZ2lsw8r/ew7Gauq8hsvapoUOC1VriSyo5KgnUOsKWc0prDHAbQxt2raFsCTExuntqMIPSfxVobDeR9ew1ZfalDOiNstHYMNF5PWyWzFLJPgIByGXcFtEG3AwXH/z8/Bx98/TpLgKs4q2RUstRo7TgKa560aFhJTcte3Fiq3QY4pbGDJRZkJ235hJeOiaDvDLulAwd+39Hi4W1iyJ1zgL5wt7UVuQ2bdxw/m7SqrJumEq5HpQ3GriCVeKQKUOS9cIBlEa5XUFeoEuK3b9cm1IsuiE9F0x0M4b4PRoGL0Zhl9cQu/FunbFc/FCHIpIvBXH4pM4ESMhxaW34wXevfXB/+V319bfW/T80xcK/oH9F/wao=</latexit>w w ⌘ r LS(w) + ⌘ (r LS(w) r `k(w)) <latexit sha1_base64="2YkqCO3ZcVtUvasN3QP5Gwy0bRM=">AC43icbVLihNBFK1uX2N8RWenm8JhQBCd0DGjTDMbFwIjmIyA6meUF190ymHj1V1Uo+wvcyWz9MH/Anf9gdSc+knih4HDuvecbndeCW5dknyP4mvXb9y8tXO7d+fuvfsP+g8fja2uDYMR0Kbs5xaEFzByHEn4KwyQGUu4DS/OG7px/BWK7VB7eoIJO0VHzGXWBmvZ/kBxKrjwLGrbpkTE1eEKgslxoleFXmFheSno+xIRsdKcpfo4JK7SzLfjNetKl8gaKhoCjDW5ltm/wpZLTHJevVrY7idOx82OMQCVfyJ+J6Xc0eN0Z9WGm3Kbn19214a5zuvpn295JB0hXeBukK7B3uP/65e5X6k/YwhWa1BOWYoNZO0qRymafGcSYgpKgtVJRd0BImASoqwWa+s2/wfmAKPNMmPOVwx/674am0diHzMCmpm9vNXkv+rzep3exl5rmqageKLY1mtcBO4/bj4oIbYE4sAqDM8JAVszk1lLnwC6y5yEVn0guHSTfPsA3Gw0H6YpC8Cxc6QsvaQU/QU/QMpegAHaLX6ASNEIveRCby0ecY4i/x1/hqORpHq51dtFbxt1/se1a</latexit>(Var[✏] = 2 Var[✏1 + · · · + ✏η] = ⌘2 ∼ O
⇒ = O (√⌘)
𝜃 ≥ 2 𝐼 ! 𝜃 < 2 𝐼 ! 𝑀(𝑥) = 0.5 𝑥"𝐼𝑥 𝑥#$% = 𝐽 − 𝜃𝐼 ⋅ 𝑥#
<latexit sha1_base64="KXCQlbNA4tNvFg4knZExHyj1hdI=">ACL3icbVDLSgMxFM34rPVdekmWIS6KTOC6KZQdKELFxXtAzp1yKSZNjTJDElGKeP8h9/gzh9wq38gbsSV4F+YabvQ6oHA4dxzOTfHjxhV2rbfrJnZufmFxdxSfnldW29sLHZUGEsManjkIWy5SNFGBWkrqlmpBVJgrjPSNMfnGTz5g2RiobiSg8j0uGoJ2hAMdJG8gr757iRGDpds9WIFuIBFOnDQRKXRVzL2EVpz0WkCXMObRzJX3CkW7bI8A/xJnQorVU/jgene9mlf4dLshjkRGjOkVNuxI91JkNQUM5Lm3ViRCOEB6pG2oQJxojrJ6G8p3DVKFwahNE9oOFJ/biSIKzXkvnFypPtqepaJ/83asQ6OgkVUayJwOgIGZQhzArCnapJFizoSEIS2puhbiPTD3a1PkrhQ9HIVkxznQNf0ljv+wclO0L09AxGCMHtsEOKAEHIqOAM1UAcY3IMn8AxerEfr1Xq3PsbWGWuyswV+wfr6Bk6UqsE=</latexit>ℓ! 𝑥 = 0.5 𝑥"𝐼!𝑥, 𝐼! = 𝑒𝑗𝑏(2𝜆, 0) ℓ# 𝑥 = 0.5 𝑥"𝐼#𝑥, 𝐼# = 𝑒𝑗𝑏(0, 2) 𝑀 𝑥 = 0.5 𝑥"𝐼𝑥, 𝐼 = 𝑒𝑗𝑏(𝜆, 1) 𝜆 > 2 𝜃$ = 5 6 1.1 𝜆 , 𝑢 = 1, … , 𝑈
!
6 0.1 𝜆 , 𝑢 = 𝑈
! + 1, … , 𝑈#
𝜃$ = 6 0.1 𝜆 , 𝑢 = 1, … , 𝑈 Small LR Moderate LR
Setups
𝜊 ∼ 𝒱(𝑇%&!)
WOLG
𝑦) #
# ∈ (0,1]
space of 𝑌
Theorem 0:
( 𝑥 − 𝑥∗ "𝑌𝑌"(𝑥 − 𝑥∗)
Remark: this is also known as “minimal-norm solution” since the initialization is usually zero
Theorem 1 (informal): Consider SGD with moderate LR, 𝜃& = 4𝜃 ∈ 6 1 𝜇% + 𝑝 1 , 6 1 𝜇! − 𝑝(1) , 𝑢 = 1, … , 𝑈
%
𝑝 1 , 𝑢 = 𝑈
% + 1, … , 𝑈!
then 𝑄 𝑥 − 𝑥∗ ‖ ‖ 𝑄(𝑥 − 𝑥∗) ! → 𝑤% ± 𝑝(1) Theorem 2 (informal): Consider GD with moderate or small LR, 𝜃& ∈ 0, 6 𝑜 2𝜇! − 𝑝(1) , 𝑢 = 1, … , 𝑈! then 𝑄 𝑥 − 𝑥∗ ‖ ‖ 𝑄(𝑥 − 𝑥∗) ! → 𝑤( ± 𝑝(1) Rayleigh quotient: 𝑆 𝑌𝑌", 𝑣 =
)!**!) )!)
Remark: 𝑤% (𝑤() is the largest (smallest) eigen vector of 𝑌𝑌"
w LD(w) = LD(walg) −
w02HS LD(w0)
∆(walg), estimation error
w02HS LD(w0) − inf w LD(w)
approximation error
intrinsic error, not improvable determined by the algorithms and hyperparameters
∗ = min .∈𝒳! Δ(𝑥)
Theorem 3:
⋅ Δ-
∗
⁄
3" 3# − 𝑝 1
⋅ Δ-
∗
Remark: 𝛿% (𝛿() is the largest (smallest) eigenvalue of 𝑌𝑌"
( 𝑥"𝑌𝑌"𝑥
( 𝑌𝑌" 𝑥$
⇒ 𝑥$ = 𝐽 − #5
( 𝑌𝑌" $
𝑥, ⇒ 𝑣$ = 𝐽 − #5
( Γ $
𝑣,
( $
≪ ⋯ ≪ 1 − #53#
( $
≪ 1
"𝑥
"
𝑥6,8 ⇒ 𝑥64! = ∏8:!
(
𝐽 − 2𝜃𝑦9 8 𝑦9 8
"
𝑥6
(
𝐽 − 2𝜃𝑦9 8 𝑦9 8
"
as projected onto 𝑌&! = (𝑦#, … , 𝑦() and its complement 6
; ;$"
; ;$"
Remark: with moderate LR, concentration
Henriksen, Amelia, and Rachel Ward. "Concentration inequalities for random matrix products." Linear Algebra and its Applications 594 (2020): 81-94.
Get the paper ->