on the noisy gradient descent that generalizes as sgd
play

On the Noisy Gradient Descent that Generalizes as SGD Jingfeng Wu , - PowerPoint PPT Presentation

On the Noisy Gradient Descent that Generalizes as SGD Jingfeng Wu , Wenqing Hu, Haoyi Xiong, Jun Huan, Vladimir Braverman, Zhanxing Zhu Johns Hopkins University, Missouri University of Science and Technology, Baidu Research, Styling AI, Peking


  1. On the Noisy Gradient Descent that Generalizes as SGD Jingfeng Wu , Wenqing Hu, Haoyi Xiong, Jun Huan, Vladimir Braverman, Zhanxing Zhu Johns Hopkins University, Missouri University of Science and Technology, Baidu Research, Styling AI, Peking University

  2. <latexit sha1_base64="bS1Z0XerBUCN69KWD1FLEosGR7k=">ACynicbVHRStxAFJ3EWnVr21UfRlcWhTpklRBfRCkfHB4WuWjZLmEzu7g5OJmHmRghb/6eH+AP9Ds6iRF3Vy+EnDn3nHuHM1EmhUHPe3LcpQ/LH1dW1zqf1j9/+drd2Lw2a45DHgqU30bMQNSKBigQAm3mQaWRBJuorvfdf/mHrQRqfqDRQajhE2UGAvO0FJh9zHAKSALS9z3K/r9lLZnpD9oYAENUmuPNONQBihkDOWk2n0R7VU0CDq08eUqfhUuzJhoFoctXdGLWf+LZNZe/+4A31tY6+emzQ6rwm7P63tN0bfAb0GPtHUZdv8FcrzBRyYwZ+l6Go5JpFxC1QlyA5m9DJvA0ELFEjCjsom9ot8sE9Nxqu2nkDbsrKNkiTFElwnBqFns1+V5vmOP4eFQKleUIij8vGueSYkrN6Sx0MBRFhYwroW9K+VTZkND+9JzW5KiWdKxwfiLMbwF1z/7/mH/5Oqwd/arjWiVbJMdskt8ckTOyDm5JAPCnQPnrxM53L1wtVu45bPUdVrPFpkr9+E/w2XezQ=</latexit> <latexit sha1_base64="p6Gzhfmu2a6M/o0dA0E02QuYdk=">ACNXicbVDBThsxFPRSaGkKbdoe7GIKqWXaBchNRVCQnAoBw4gNSRSNl15nbeJFdu7st+iRsv+Ch/QH+gPcIU7B26IK79QZ5NDAx3J0mjePM3zxJkUFn3/1lt5sbr28tX69qbjc237+rvP5zZNDcOjyVqenFzIUGjoUEIvM8BULKEbTw5n8+45GCtS/QOnGQwUG2mRCM7QSVG9fdwMcQzIvtA9GiaG8SIoC13S0OYqKsReUP7UNAQpafNXJOguXdijesNv+RXocxIsSGP/O/0dRhejk6j+EA5TnivQyCWzth/4GQ4KZlBwCWUtzC1kjE/YCPqOaqbADorqhyX97JQhTVLjnkZaqf9uFExZO1WxcyqGY/t0NhP/N+vnmLQHhdBZjqD5PCjJcWUzuqiQ2GAo5w6wrgR7lbKx8zVhK7UpRQ1rUJqrpjgaQ3Pydl2K9hpfTt1DR2QOdbJ7JFmiQgX8k+OSInpEM4uSRX5JrceH+8O+/e5hbV7zFzkeyBO/xL6azrY=</latexit> <latexit sha1_base64="NO81viQx6ZAH+ykfVPko4iKqA=">ACDnicbVDLSsNAFJ3UV62vVJduBotQNyWRgrorunFZwT6gKWEynbZDJw9mbioh9B/8Abf6B+7Erb/gD/gdTtMsbOuBC4dz7uVcjhcJrsCyvo3CxubW9k5xt7S3f3B4ZJaP2yqMJWUtGopQdj2imOABawEHwbqRZMT3BOt4k7u535kyqXgYPEISsb5PRgEfckpAS65ZnrqOGg1w1YExA+LChWtWrJqVAa8TOycVlKPpmj/OIKSxzwKgijVs60I+imRwKlgs5ITKxYROiEj1tM0ID5T/TR7fYbPtTLAw1DqCQBn6t+LlPhKJb6nN30CY7XqzcX/vF4Mw+t+yoMoBhbQRdAwFhCPO8BD7hkFESiCaGS618xHRNJKOi2lL8JAsp6WLs1RrWSfuyZtdrNw/1SuM2r6iITtEZqiIbXaEGukdN1EIUPaEX9IrejGfj3fgwPherBSO/OUFLML5+ARZrnAU=</latexit> <latexit sha1_base64="WuEeC+IspeWG7WEBXAwi0QNv8w=">ACO3icbVDPSxtBFJ7VWmPUGvXoZWgQ0kvYVUFkWAv9pZCE4VsWGYnb5PBmdl5q0Ylvw3/hn2H+jVXnoueJBePHjv5Mehmj4Y+N73zeN74szKSz6/i9vYfHd0vl0kp5dW39w0Zlc6t09xwaPFUpuYqZhak0NBCgRKuMgNMxRIu4+vPY/3yBowVqf6Gwy6ivW1SARn6KiochYmhvEiGBXxiIY2V1EhQqHpeYSu7xvWi4oQB4DMtSBl7TYSJ3TKRPgpqlT9uj8pOg+CGag2vuyf3l8rzajymPYS3muQCOXzNpO4GfYLZhBwSWMymFuIWP8mvWh46BmCmy3mNw5oruO6dEkNe5pBP234mCKWuHyl2yqxgO7FtTP5P6+SYHULobMcQfOpUZJLikdh0Z7wgBHOXSAcSPcrpQPmAsOXbSvXNRwYlJ2wQRvY5gH7b16cFA/uoSOifTKpEd8pHUSEAOSYNckCZpEU7uyA/yQH56373f3pP3Z/p1wZvNbJNX5b38Bcgrsdc=</latexit> Stochastic gradient descent (SGD) n L ( ✓ ) = 1 Loss function X ` ( x i ; ✓ ) n i =1 1 X r θ ` ( x i ; ✓ t ) b i ∈ B t SGD z }| { θ t +1 = θ t � η g ( θ t ) ˜ = θ t � η r θ L ( θ t ) } � η (˜ g ( θ t ) � r θ L ( θ t )) | {z | {z } GD v sgd ( θ t ) (unbiased) gradient noise

  3. Noise matters! SGD >> GD • How? <= Still open… • Which? <= This work! CIFAR-10, ResNet-18, w/o weight decay, w/o data augmentation

  4. <latexit sha1_base64="kj+yFB6VfOGxOf9MuzV4MT7235s=">ACOXicbVDLSgNBEJz1bXxFPXoZDIeDLsSUEFB9OLBg4LRQDYs7OdzZDZBzO9gbDkZ/wNf8Cr3jx6EfHqDzh5KGosGCiquqme8lMpNr2szUxOTU9Mzs3X1hYXFpeKa6u3egkUxyqPJGJqvlMgxQxVFGghFqgEW+hFu/fdb3bzugtEjia+ym0IhYGIum4AyN5BWPOp6rw4Bu9gCZDv0mLoZA0/JZ2qRsqFnj5UOhdfDlesWSX7QHoOHFGpERGuPSKr26Q8CyCGLlkWtcdO8VGzhQKLqFXcDMNKeNtFkLd0JhFoBv54Jc9umWUgDYTZV6MdKD+3MhZpHU38s1kxLCl/3p98T+vnmHzoJGLOM0QYj4MamaSYkL7ldFAKOAou4YwroS5lfIWU4yjKfZXStQdhBRMc7fGsbJzV7ZqZQPryqlk9NRXNkg2ySbeKQfXJCzsklqRJO7sgDeSRP1r31Yr1Z78PRCWu0s05+wfr4BNijrFQ=</latexit> Which noise matters? v sgd ( θ ) = ˜ g ( θ ) � r θ L ( θ ) 1. Magnitude <= YES! (e.g., Jastrzkebski et al. 2017) 2. Covariance structure <= YES! (e.g., Zhu et al. 2018) 3. Distribution class <= ? No!!! (this work) Bernoulli? Gaussian? Levy?...

  5. <latexit sha1_base64="lb+H4UTjzf5Kv/JL3KQaDibfw=">ACNXicbVDLSgNBEJz1GeMr6tHLYBUNOxKwIiXoBePCkMZJcwO+mYIbMPZnolYcmv+Bv+gFe9e/AmufoLTmIEYywYKq6qZ7yYyk02vabNTe/sLi0nFnJrq6tb2zmtrZrOkoUhyqPZKTqPtMgRQhVFCihHitgS/hzu9ejfy7B1BaRGEF+zF4AbsPRVtwhkZq5kou9GLg2Ogdu9gBZM2Kl7og5UHv4kc4pCd0Wjo6HDRzebtgj0FniTMheTLBTM3dFsRTwIkUumdcOxY/RSplBwCYOsm2iIGe+ye2gYGrIAtJeOfzig+0Zp0XakzAuRjtXfGykLtO4HvpkMGHb0X28k/uc1EmyXvFSEcYIQ8u+gdiIpRnRUF20JZdqRfUMYV8LcSnmHKcbRlDqVEvTHIVlTjPO3hlSOy04xcL5bTFfvpxUlCG7ZI8cEIeckTK5JjekSjh5JM/khbxaT9a79WENv0fnrMnODpmC9fkFJx+riw=</latexit> <latexit sha1_base64="6fN/KzCDnCBe5BNfzP9VFrwepSQ=">ACVnicbZDfShtBFMYna6Mxrbray94MBsFSDLsiqBdCaG960QsFkyjZsMzOniSDs7PLzFkhLPtsvoY+QHvZvoE4SVb8kx4Y+M53vsMZflEmhUHPe6g5Kx/q2uN9ebHTxubW+72Ts+kuebQ5alM9VXEDEihoIsCJVxlGlgSehHNz9m8/4taCNSdYnTDIYJGysxEpyhtUL3OsAJIAsL/OaX9IxWLdIDGlhBg1zFoCPNOBTBWLM4LBaRkv7afw5/LZ/jty9e6La8tjcvuiz8SrRIVeh+yeIU54noJBLZszA9zIcFkyj4BLKZpAbyBi/YWMYWKlYAmZYzBGUdM86MR2l2j6FdO6+3ihYsw0iWwyYTgx72cz83+zQY6jk2EhVJYjKL4NMolxZTOeNJYaOAop1YwroX9K+UTZnmhpf7mSjKdH2laMP57DMuid9j2j9qnF0etzvcKUYN8Ibtkn/jkmHTIT3JOuoSTO/Kb/CX/ave1R6furC2iTq3a+UzelOM+AbQitZQ=</latexit> Intuition For quadratic loss, the generalization error x, θ T [ ` ( x ; ✓ T ) − ` ( x ; ✓ ∗ )] only depends on the first two moments of 𝜄 ! , which only depend on the first two moments of 𝑤(𝜄) . θ t +1 = θ t � η r θ L ( θ t ) } � η v ( θ t ) | {z Linear Noise matters! But noise class does not!!!

  6. <latexit sha1_base64="TAwr4vz/1WeH84dKCYFz3HPYwgM=">ACKHicbVDLSsNAFJ3UV62vqEs3g6UgqCWRg8Qim5cVrAPaEKZTCft0MkzEyEPIR/oY/4Fb/wJ1068LvcNJ2YVsPDBzOmcu593gRo1JZ1tgorKyurW8UN0tb2zu7e+b+QUuGscCkiUMWio6HJGUk6aipFOJAgKPEba3ug+9vPREga8ieVRMQN0IBTn2KktNQzT502RqznyEH/Bjpl6PgC4dTOUi+Dt9A7c3LR0pSfez2zbFWtCeAysWekDGZo9Mwfpx/iOCBcYak7NpWpNwUCUxI1nJiSWJEB6hAelqylFApJtOjspgRSt96IdCP67gRP07kaJAyiTQe1YCpIZy0cvF/7xurPwrN6U8ihXheBrkxwyqEOYNwT4VBCuWaIKwoHpXiIdI16J0j3MpQTIJKeli7MUalknromrXqtePtXL9blZRERyBY3ACbHAJ6uABNEATYPAC3sA7+DBejU/jyxhPvxaM2cwhmIPx/Qv7o6QE</latexit> <latexit sha1_base64="6nU86pnc4vVkxCH7hZrZS+zvU=">ACa3icbVHLSgMxFM2M7/qulMXQREqSpkRQUWEohsXLhSsCp0yZNLbNjSTGZI74jD0L/w0N/6AO3/AleljYdULISfnvgJEqlMOh5747NT0zOze/UFpcWl5ZLa+tP5gk0xzqPJGJfoqYASkU1FGghKdUA4sjCY9R72qQf3wGbUSi7jFPoRmzjhJtwRlaKiyroKNZKywC7AKyPg1uOJOV0WufXtAg0oz3AItJXQBSVl5C/3ysPKRBK0Fjr/9k6nAs64flXa/qDYP+Bf4Y7NZ2goPX91p+G5Y/bGOexaCQS2ZMw/dSbBZMo+AS+qUgM5DaDVkHGhYqFoNpFkNf+nTPMi3aTrQ9CumQ/VlRsNiYPI6sMmbYNb9zA/K/XCPD9mzECrNEBQfDWpnkmJCBybTltDAUeYWMK6F3ZXyLrNOov2KiSlxPhxSsb4v234Cx6Oqv5x9ezOnRJRjFPtsgOqRCfnJAauSa3pE4eSNfzpQz7Xy6G+6muz2Sus64ZoNMhLv3DRKv14=</latexit> <latexit sha1_base64="0bKvXi7R+xe/ISAKMPyZNSTCdio=">ACnicbZHPatAEMZXStuk7j8nOfay1BTSQ4wUQtMeAqGF0kNKU6gdg2XEaDWl6xWYncUMEIPk8fqC/Q5spbdNko6sPAxv5md4ZukVNJSEPzy/K1Hj59s7ztPXv+4uWr/u7e2BaVETgShSrMJAGLSmockSFk9Ig5InCy+Tq84pfXqOxstA/aVniLIdMy7kUQC4V92+iSqdoEgMC6+s4slnKDyJaIMG7hp9yfpdHJFWKPtXcNjlmYE0rte0Of9b1jvlXcSjcwHqD+eRSAvq/jR2vN2mifuDYBi0wR+KcCMGbBMXcf93lBaiylGTUGDtNAxKmtVgSAqFTS+qLJYgriDqZMacrSzurWy4W9dJuXzwrinibfZux015NYu8RV5kALe5+tkv9j04rmH2a1GVFqMV60LxSnAq+ugtPpUFBaukECPdrlwswJlB7nqdKfmyHdJzxoT3bXgoxkfD8Hj48cfx4OzTxqId9pq9YQcsZCfsjH1lF2zEhLftHXrvROf+1/8b/73danvbXr2WSf8yS3AuM4x</latexit> <latexit sha1_base64="tpgXxRfVOTUgPLu6xgEHKXJFoQ=">ACMXicbVDLSiQxFE35Gm1fPbqcTbARdNUSYPOTsbNLGahMK1CV9PcSt3qDqaSIrk10BT1I/Mb/oBb5w/cDS7c+BOmHwtfBwIn59zLSU5SKOkoDB+ChcWl5ZUvq2uN9Y3Nre3m151LZ0orsCuMvY6AYdKauySJIXhUXIE4VXyc3ZxL/6g9ZJo3/TuMB+DkMtMymAvDRoduKhXRQxTRCgprHvwSog9ntkMciNcTjzIKorSdWw0Jsmg2Qrb4RT8I4nmpMXmOB80n+LUiDJHTUKBc70oLKhfgSUpFNaNuHRYgLiBIfY81ZCj61fT39V83yspz4z1RxOfq83KsidG+eJn8yBRu69NxE/83olZSf9SuqiJNRiFpSVipPhk6p4Ki0KUmNPQFjp38rFCHwX5At9k5KPpyENX0z0voaP5PKoHXa3y86rdMf84pW2Te2xw5YxI7ZKfvJzlmXCfaX3bF79i+4DR6C/8HjbHQhmO/sjcInl8AhGeq6w=</latexit> <latexit sha1_base64="i9wUYTmMfkengD4WERgFodbQRU=">ACInicbVDNSsNAGNzUv1r/qh69LJaCF0siBfUgFL14rGB/oAlhs9m0SzebsLsRQsgT+Bq+gFd9A2/iSfDsc7hNi9jWgYXZme9jdseLGZXKND+N0srq2vpGebOytb2zu1fdP+jKBGYdHDEItH3kCSMctJRVDHSjwVBocdIzxvfTPzeAxGSRvxepTFxQjTkNKAYKS251brdxYi5thz68Aravd/LKbQDgXBm5RnP3WrNbJgF4DKxZqQGZmi71W/bj3ASEq4wQ1IOLDNWToaEopiRvGInksQIj9GQDTlKCTSyYrv5LCuFR8GkdCHK1iofzcyFEqZhp6eDJEayUVvIv7nDRIVXDgZ5XGiCMfToCBhUEVw0g30qSBYsVQThAXVb4V4hHQLSjc4lxKmRUhF2Mt1rBMumcNq9m4vGvWtezisrgCByDE2CBc9ACt6ANOgCDR/AMXsCr8WS8Ge/Gx3S0ZMx2DsEcjK8f4pKj0A=</latexit> <latexit sha1_base64="4RaNxqViWuH516fg4e9gd2nVKJQ=">ACLHicbVDLSgMxFM3UV62vqks3wSLUTZ2RgrorunHhoJ9QKeUO5m0Dc08SO4IZehn+Bv+gFv9Azcibvsdpo+FbT0QODnXk5yvFgKjb9ZWXW1jc2t7LbuZ3dvf2D/OFRXUeJYrzGIhmpgeaSxHyGgqUvBkrDoEnecMb3E38xjNXWkThEw5j3g6gF4quYIBG6uQv3J4Cv5O62OcI+o+MJDF2e2cusyPkLoNo5kR3fNHnXzBLtlT0FXizEmBzFHt5MeuH7Ek4CEyCVq3HDvGdgoKBZN8lHMTzWNgA+jxlqEhBFy30+nHRvTMKD7tRsqcEOlU/buRQqD1MPDMZADY18veRPzPayXYvW6nIowT5CGbBXUTSTGik5aoLxRnKIeGAFPCvJWyPihgaLpcSAmG05CcKcZrmGV1C9LTrl081guVG7nFWXJCTklReKQK1Ih96RKaoSRF/JG3smH9Wp9Wt/Wz2w0Y813jskCrPEvGhSopA=</latexit> A closer look at the noise of SGD v sgd ( θ ) | {z } = ˜ |{z} � r θ L ( θ ) g ( θ ) } = r θ L ( θ ) · V sgd | {z |{z} r θ L ( θ ) · 1 Gradient Sampling r θ L ( θ ) · W sgd noise noise n • Gradient matrix r θ L ( ✓ ) = ( r θ ` ( x 1 ; ✓ ) , . . . , r θ ` ( x n , ✓ )) W sgd : #1 • Sampling vector b = b, #0 = n − b V sgd = W sgd − 1 • Sampling noise n

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend