Direction Matters: On the Implicit Regularization Effect of - PowerPoint PPT Presentation

Direction Matters: On the Implicit Regularization Effect of Stochastic Gradient Descent with Moderate Learning Rate Jingfeng Wu , Difan Zou, Vladimir Braverman, Quanquan Gu Johns Hopkins University & UCLA November 2020

Overview • Background • SGD vs. GD: Different Convergence Directions • Small Learning Rate • Moderate Learning Rate • Direction Matters: SGD + Moderate LR is Good • Proof Sketches

<latexit sha1_base64="KXCQlbNA4tNvFg4knZExHyj1hdI=">ACL3icbVDLSgMxFM34rPVdekmWIS6KTOC6KZQdKELFxXtAzp1yKSZNjTJDElGKeP8h9/gzh9wq38gbsSV4F+YabvQ6oHA4dxzOTfHjxhV2rbfrJnZufmFxdxSfnldW29sLHZUGEsManjkIWy5SNFGBWkrqlmpBVJgrjPSNMfnGTz5g2RiobiSg8j0uGoJ2hAMdJG8gr757iRGDpds9WIFuIBFOnDQRKXRVzL2EVpz0WkCXMObRzJX3CkW7bI8A/xJnQorVU/jgene9mlf4dLshjkRGjOkVNuxI91JkNQUM5Lm3ViRCOEB6pG2oQJxojrJ6G8p3DVKFwahNE9oOFJ/biSIKzXkvnFypPtqepaJ/83asQ6OgkVUayJwOgIGZQhzArCnapJFizoSEIS2puhbiPTD3a1PkrhQ9HIVkxznQNf0ljv+wclO0L09AxGCMHtsEOKAEHIqOAM1UAcY3IMn8AxerEfr1Xq3PsbWGWuyswV+wfr6Bk6UqsE=</latexit> Implicit Regularization: SGD >> GD n L S ( w ) = 1 X Loss ` i ( w ) n i =1 GD <latexit sha1_base64="EarS2g+hOyhHCn1+yqIviVW4JU=">ACMnicbVA9bxNBEN1LCARDiIGSZoUVyS5s3UWyQhkBRQqKIOIPyWdZc3tz9sp7H9qdw7JO/iX8Avr8gbSJ+AEgGoRERZue9TkFtjPSrp7em5k3ekGmpCHX/e7s7D7Ye/ho/3HlydODZ4fV5y+6Js21wI5IVar7ARhUMsEOSVLYzRCHCjsBdN3S73GbWRaXJB8wyHMYwTGUkBZKlRtT3jvsKIQOt0xme8yQu/3FpoDBfcRwL7jzWE/MPI/yRA8fqsMarW3JZbFt8G3h2onTZuvzXf/x6Pqr+8cNU5DEmJBQYM/DcjIYFaJC4aLi5wYzEFMY48DCBGI0w6I8ZMGPLBPyKNX2JcRL9v+JAmJj5nFgO2OgidnUluR92iCn6M2wkEmWEyZiZRTlilPKl1nxUGoUpOYWgNDS3srFBDQIsomucTz0qRig/E2Y9gG3eOW1265H21Cb9mq9tkr9prVmcdO2Ck7Y+eswT7wq7YNbtxLp0fzi/n96p1x7mbecnWyvn7DzPzrjI=</latexit> w w � η r L S ( w ) SGD <latexit sha1_base64="NEVHOYBfO6Mb1VrYie/t9hYiKGM=">ACMXicbVBNT9tAEF0HaCGlJaXHXlaJKgWhRjaiao9RufRIJRIixVE0Xo+TVdYf2h03sqz8C078jf4BrvAPcqt6KV/go3DgQAj7erpvZl5oxdkShpy3aVT29refV6d6/+Zv/tu4PG+8O+SXMtsCdSlepBAaVTLBHkhQOMo0QBwovg9nZSr/8hdrINLmgIsNRDJNERlIAWrcOJ1zX2FEoHU653P+mZd+tbXUGC64jwT2n2gILVZqPGvPj/i40XI7blX8OfAeQKvb9I+vlt3ifNz454epyGNMSCgwZui5GY1K0CSFwkXdzw1mIGYwaGFCcRoRmV1x4J/skzIo1TblxCv2McTJcTGFHFgO2OgqXmqrciXtGFO0bdRKZMsJ0zE2ijKFaeUr6LiodQoSBUWgNDS3srFDQIsoFuMRFZVK3wXhPY3gO+icd70vH/WkT+s7Wtcs+siZrM49ZV32g52zHhPsmt2wW3bn/HaWzh/n7q15jzMfGAb5fy/BzjzrO8=</latexit> w w � ⌘ r ` k ( w ) CIFAR-10, ResNet-18, w/o weight decay, w/o data augmentation Wu, Jingfeng, et al. "On the Noisy Gradient Descent that Generalizes as SGD." ICML 2020.

Two More Figures about SGD (Less Relevant) 70 test DccurDcy (%) 60 50 40 GD (66.96) GLD const (66.66) 30 GLD GynDmLc (69.25) 20 GLD GLDg (67.96) 6GD (75.21) 10 0 2500 5000 7500 10000 12500 15000 17500 LterDtLon Wilson, Ashia C., et al. "The marginal value of adaptive gradient methods in machine learning." NIPS 2017. Zhu, Zhanxing, et al. "The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects." ICML 2019.

SGD vs. GD: Learning Rate Matters! Small LR Moderate LR GD L L SGD L J Q1: Small LR, SGD ≈ GD? Q2: Moderate LR, SGD >> GD? Q3: GD is bad anyhow?

In Theory, SGD ≈ GD or SGD ≠ GD ?? GD <latexit sha1_base64="EarS2g+hOyhHCn1+yqIviVW4JU=">ACMnicbVA9bxNBEN1LCARDiIGSZoUVyS5s3UWyQhkBRQqKIOIPyWdZc3tz9sp7H9qdw7JO/iX8Avr8gbSJ+AEgGoRERZue9TkFtjPSrp7em5k3ekGmpCHX/e7s7D7Ye/ho/3HlydODZ4fV5y+6Js21wI5IVar7ARhUMsEOSVLYzRCHCjsBdN3S73GbWRaXJB8wyHMYwTGUkBZKlRtT3jvsKIQOt0xme8yQu/3FpoDBfcRwL7jzWE/MPI/yRA8fqsMarW3JZbFt8G3h2onTZuvzXf/x6Pqr+8cNU5DEmJBQYM/DcjIYFaJC4aLi5wYzEFMY48DCBGI0w6I8ZMGPLBPyKNX2JcRL9v+JAmJj5nFgO2OgidnUluR92iCn6M2wkEmWEyZiZRTlilPKl1nxUGoUpOYWgNDS3srFBDQIsomucTz0qRig/E2Y9gG3eOW1265H21Cb9mq9tkr9prVmcdO2Ck7Y+eswT7wq7YNbtxLp0fzi/n96p1x7mbecnWyvn7DzPzrjI=</latexit> w w � η r L S ( w ) SGD <latexit sha1_base64="H8W0j+nYlbjAbB3JxTavAukChE=">ACeHicbVFNb9NAEF2bj5bwUReOXEZUiEYokV0JwbECDhw4FEHaSnFkjdfjdJX1h3bHRJGVH8SP4Qcgceml/4ALJzZODrTNSLt6eu/NzuhtWmtlOQx/ef6du/fu7+w+6D189PjJXrD/9NRWjZE0kpWuzHmKlrQqacSKNZ3XhrBINZ2lsw8r/ew7Gauq8hsvapoUOC1VriSyo5KgnUOsKWc0prDHAbQxt2raFsCTExuntqMIPSfxVobDeR9ew1ZfalDOiNstHYMNF5PWyWzFLJPgIByGXcFtEG3AwXH/z8/Bx98/TpLgKs4q2RUstRo7TgKa560aFhJTcte3Fiq3QY4pbGDJRZkJ235hJeOiaDvDLulAwd+39Hi4W1iyJ1zgL5wt7UVuQ2bdxw/m7SqrJumEq5HpQ3GriCVeKQKUOS9cIBlEa5XUFeoEuK3b9cm1IsuiE9F0x0M4b4PRoGL0Zhl9cQu/FunbFc/FCHIpIvBXH4pM4ESMhxaW34wXevfXB/+V319bfW/T80xcK/oH9F/wao=</latexit> w w � ⌘ r L S ( w ) + ⌘ ( r L S ( w ) � r ` k ( w )) Unbiased noise (scales with 𝜃 ) Theory disagrees with practice L “Easy” to prove SGD ≈ GD by concentration <= e.g., small LR • “Hard” to prove an inverse result <= no concentration! •

<latexit sha1_base64="gE32TAgCiehRkp6UP49CDdg+CSs=">ACcnicbVFNb9NAEF2bj7Yp0BRu5TIQIbWCRnalqlyQqsKBA4eikrZSHKzxep2sumu7u2OqyPIP4sfwA1B75hfAnY3TA0kZaW3b97MW71NSiUtBcFPz793/8HDldW1zvqjx082uptPT21RGS4GvFCFOU/QCiVzMSBJSpyXRqBOlDhLt7P+mfhLGyL/QtBQjeNcZpIjOSruVlEqM7iCd7AL0dhgCp/i6ISjgu2rHaghai1qI9IGWik18BrqBdpeGqojQdi4y4kca3SzX6MJqmxJOVtwFMTd3tBP2gL7oLwFvQOd/782P1w8/047v6K0oJXWuTEFVo7DIOSRjUaklyJphNVpTIL3Ashg7mqIUd1a1A68ck0JWGHdygpb9d6JGbe1UJ06pkSZ2uTcj/9cbVpS9HdUyLysSOZ8bZUCKmCWNaTSCE5q6gByI91bgU/QICf3IwsuetqadFw4XIMd8HpXj/c7wefXUJHbF6r7Dl7ybZyA7YIfvIjtmAcXbteV7HW/d+1v+C783l/re7cwztlD+m79aJMCi</latexit> Small Learning Rate: SGD ≈ GD 𝜃 = 𝑒𝑢 → 0 GD <latexit sha1_base64="EarS2g+hOyhHCn1+yqIviVW4JU=">ACMnicbVA9bxNBEN1LCARDiIGSZoUVyS5s3UWyQhkBRQqKIOIPyWdZc3tz9sp7H9qdw7JO/iX8Avr8gbSJ+AEgGoRERZue9TkFtjPSrp7em5k3ekGmpCHX/e7s7D7Ye/ho/3HlydODZ4fV5y+6Js21wI5IVar7ARhUMsEOSVLYzRCHCjsBdN3S73GbWRaXJB8wyHMYwTGUkBZKlRtT3jvsKIQOt0xme8yQu/3FpoDBfcRwL7jzWE/MPI/yRA8fqsMarW3JZbFt8G3h2onTZuvzXf/x6Pqr+8cNU5DEmJBQYM/DcjIYFaJC4aLi5wYzEFMY48DCBGI0w6I8ZMGPLBPyKNX2JcRL9v+JAmJj5nFgO2OgidnUluR92iCn6M2wkEmWEyZiZRTlilPKl1nxUGoUpOYWgNDS3srFBDQIsomucTz0qRig/E2Y9gG3eOW1265H21Cb9mq9tkr9prVmcdO2Ck7Y+eswT7wq7YNbtxLp0fzi/n96p1x7mbecnWyvn7DzPzrjI=</latexit> w w � η r L S ( w ) <latexit sha1_base64="pkyzSQktJPiPlv6NuoczmDRe2JA=">ACL3icbVC9SgNBGNzN8a/qKXNhyLEwnAXEG2EoBYWFhFNDORC2Nvbi0v2ftjdM4Qj7+ErWPoCtpYiyBiJfgA9m4uKUziwMIwMx+zjBNxJpVpvhlT0zOzc/OZhezi0vLKam5tvSrDWBaISEPRc3BknIW0IpitNaJCj2HU6vnfZJ37+pUKyMLhS3Yg2fNwKmMcIVlpq5oq2yzowBHsgd0S2IXzpn1JMId8ZxcSsNOKRFC3B2lU9Zq5bNgpoBJYg3Jdmn352Xv9P2+3Mx92W5IYp8GinAsZd0yI9VIsFCMcNrL2rGkESZt3KJ1TQPsU9lI0uIe7GjFBS8U+gUKUvXvRYJ9Kbu+o5M+Vjdy3OuL/3n1WHmHjYQFUaxoQAZFXsxBhdAfClwmKFG8qwkmgum/ArnBAhOl5xp8btpSVYPY43PMEmqxYK1XzAv9ELHaIAM2kRbKI8sdIBK6AyVUQURdIce0RN6Nh6MV+PD+BxEp4zhzQYagfH9C87Dq+Q=</latexit> d w = � r L S ( w )d t Gradient Flow (GF) d w = � r L S ( w )d t + p η Σ ( w ) 1 2 d B t Stochastic Modified Equation (SME) 𝜃 = 𝑒𝑢 → 0 Higher order term SGD <latexit sha1_base64="H8W0j+nYlbjAbB3JxTavAukChE=">ACeHicbVFNb9NAEF2bj5bwUReOXEZUiEYokV0JwbECDhw4FEHaSnFkjdfjdJX1h3bHRJGVH8SP4Qcgceml/4ALJzZODrTNSLt6eu/NzuhtWmtlOQx/ef6du/fu7+w+6D189PjJXrD/9NRWjZE0kpWuzHmKlrQqacSKNZ3XhrBINZ2lsw8r/ew7Gauq8hsvapoUOC1VriSyo5KgnUOsKWc0prDHAbQxt2raFsCTExuntqMIPSfxVobDeR9ew1ZfalDOiNstHYMNF5PWyWzFLJPgIByGXcFtEG3AwXH/z8/Bx98/TpLgKs4q2RUstRo7TgKa560aFhJTcte3Fiq3QY4pbGDJRZkJ235hJeOiaDvDLulAwd+39Hi4W1iyJ1zgL5wt7UVuQ2bdxw/m7SqrJumEq5HpQ3GriCVeKQKUOS9cIBlEa5XUFeoEuK3b9cm1IsuiE9F0x0M4b4PRoGL0Zhl9cQu/FunbFc/FCHIpIvBXH4pM4ESMhxaW34wXevfXB/+V319bfW/T80xcK/oH9F/wao=</latexit> w w � ⌘ r L S ( w ) + ⌘ ( r L S ( w ) � r ` k ( w )) <latexit sha1_base64="2YkqCO3ZcVtUvasN3QP5Gwy0bRM=">AC43icbVLihNBFK1uX2N8RWenm8JhQBCd0DGjTDMbFwIjmIyA6meUF190ymHj1V1Uo+wvcyWz9MH/Anf9gdSc+knih4HDuvecbndeCW5dknyP4mvXb9y8tXO7d+fuvfsP+g8fja2uDYMR0Kbs5xaEFzByHEn4KwyQGUu4DS/OG7px/BWK7VB7eoIJO0VHzGXWBmvZ/kBxKrjwLGrbpkTE1eEKgslxoleFXmFheSno+xIRsdKcpfo4JK7SzLfjNetKl8gaKhoCjDW5ltm/wpZLTHJevVrY7idOx82OMQCVfyJ+J6Xc0eN0Z9WGm3Kbn19214a5zuvpn295JB0hXeBukK7B3uP/65e5X6k/YwhWa1BOWYoNZO0qRymafGcSYgpKgtVJRd0BImASoqwWa+s2/wfmAKPNMmPOVwx/674am0diHzMCmpm9vNXkv+rzep3exl5rmqageKLY1mtcBO4/bj4oIbYE4sAqDM8JAVszk1lLnwC6y5yEVn0guHSTfPsA3Gw0H6YpC8Cxc6QsvaQU/QU/QMpegAHaLX6ASNEIveRCby0ecY4i/x1/hqORpHq51dtFbxt1/se1a</latexit> ( Var[ ✏ ] = � 2 ⇒ � = O ( √ ⌘ ) Var[ ✏ 1 + · · · + ✏ η ] = ⌘� 2 ∼ O � ⌘ 2 �

Direction Matters: On the Implicit Regularization Effect of - PowerPoint PPT Presentation

Direction Matters: On the Implicit Regularization Effect of Stochastic Gradient Descent with Moderate Learning Rate Jingfeng Wu , Difan Zou, Vladimir Braverman, Quanquan Gu Johns Hopkins University & UCLA November 2020 Overview

Implicit Guarantees and Risk Taking: Implicit Guarantees and Risk Taking: Implicit Guarantees and

Implicit Bias Implicit bias Implicit bias refers to attitudes or stereotypes that affect our

Implicit Surfaces Implicit Surfaces An implicit surface is simply an iso-contour CIS 781 of a

S et the Bar Low. Be a WINNER every time. Public Power Matters Public Power Matters Innovation

10. Regularization More on tradeoffs Regularization Effect of using different norms

Regularization Overview Regularization Overview Problems & Multicollinearity We will

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Regularization is a general approach to add a complexity parameter to a

WINE & DINE Direction Munich Rotkreuz Direction Frankfurt Direction Basle Vorarlberg

Implicit Bias: Transcript Inclusive Teaching Series: Implicit Bias Welcome to the third module of

Implicit Extremes and Implicit MaxStable Laws Stilian Stoev ( sstoev@umich.edu ) University of

Multi-core Programming: Implicit Parallelism Tuukka Haapasalo April 16, 2009 Tuukka Haapasalo

Implicit Surfaces CPSC 599.86 / 601.86 Sonny Chan University of Calgary (some board work happened

Rational Phosphorus Rational Phosphorus Management in Biosolids Management in Biosolids

Implicit Bias and Race Mikah K. Thompson, Esq. Director of Affirmative Action & Adjunct

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization,

Cognitive Control Signals for Neural Prosthetics S. Musallam, B. D. Coneil, B. Greger, H.

Wiener filtering 6.011, Spring 2018 Lec 20 1 Unconstrained Wiener filter structure - m x m y y

Lecture 21: Wiener Filter Mark Hasegawa-Johnson All content CC-SA 4.0 unless otherwise specified.

Wiener filtering illustrations 6.011, Spring 2018 Lec 21 1 Unconstrained Wiener filter

Online Assessment & Feedback: How to square the circle Dr Tara Brendle & Dr Andrew Wilson

MQXF Long Coil Fabrication Status and Plans at Fermilab Fred Nobrega HiLumi - LARP Collaboration

The Long and Winding Path to Secure Implementation of GlobalPlatform SCP10 Daniel De Almeida

A Series of Serendipitous Events: The Winding Path Toward Digital Literacy Richard E. Snow Award

Direction Matters: On the Implicit Regularization Effect of - PowerPoint PPT Presentation

Direction Matters: On the Implicit Regularization Effect of Stochastic Gradient Descent with Moderate Learning Rate Jingfeng Wu , Difan Zou, Vladimir Braverman, Quanquan Gu Johns Hopkins University & UCLA November 2020 Overview

Implicit Guarantees and Risk Taking: Implicit Guarantees and Risk Taking: Implicit Guarantees and

Implicit Bias Implicit bias Implicit bias refers to attitudes or stereotypes that affect our

Implicit Surfaces Implicit Surfaces An implicit surface is simply an iso-contour CIS 781 of a

S et the Bar Low. Be a WINNER every time. Public Power Matters Public Power Matters Innovation

10. Regularization More on tradeoffs Regularization Effect of using different norms

Regularization Overview Regularization Overview Problems &amp; Multicollinearity We will

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Regularization is a general approach to add a complexity parameter to a

WINE &amp; DINE Direction Munich Rotkreuz Direction Frankfurt Direction Basle Vorarlberg

Implicit Bias: Transcript Inclusive Teaching Series: Implicit Bias Welcome to the third module of

Implicit Extremes and Implicit MaxStable Laws Stilian Stoev ( sstoev@umich.edu ) University of

Multi-core Programming: Implicit Parallelism Tuukka Haapasalo April 16, 2009 Tuukka Haapasalo

Implicit Surfaces CPSC 599.86 / 601.86 Sonny Chan University of Calgary (some board work happened

Rational Phosphorus Rational Phosphorus Management in Biosolids Management in Biosolids

Implicit Bias and Race Mikah K. Thompson, Esq. Director of Affirmative Action &amp; Adjunct

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization,

Cognitive Control Signals for Neural Prosthetics S. Musallam, B. D. Coneil, B. Greger, H.

Wiener filtering 6.011, Spring 2018 Lec 20 1 Unconstrained Wiener filter structure - m x m y y

Lecture 21: Wiener Filter Mark Hasegawa-Johnson All content CC-SA 4.0 unless otherwise specified.

Wiener filtering illustrations 6.011, Spring 2018 Lec 21 1 Unconstrained Wiener filter

Online Assessment &amp; Feedback: How to square the circle Dr Tara Brendle &amp; Dr Andrew Wilson

MQXF Long Coil Fabrication Status and Plans at Fermilab Fred Nobrega HiLumi - LARP Collaboration

The Long and Winding Path to Secure Implementation of GlobalPlatform SCP10 Daniel De Almeida

A Series of Serendipitous Events: The Winding Path Toward Digital Literacy Richard E. Snow Award

Regularization Overview Regularization Overview Problems & Multicollinearity We will

WINE & DINE Direction Munich Rotkreuz Direction Frankfurt Direction Basle Vorarlberg

Implicit Bias and Race Mikah K. Thompson, Esq. Director of Affirmative Action & Adjunct

Online Assessment & Feedback: How to square the circle Dr Tara Brendle & Dr Andrew Wilson