On the Noisy Gradient Descent that Generalizes as SGD Jingfeng Wu , - - PowerPoint PPT Presentation

on the noisy gradient descent that generalizes as sgd
SMART_READER_LITE
LIVE PREVIEW

On the Noisy Gradient Descent that Generalizes as SGD Jingfeng Wu , - - PowerPoint PPT Presentation

On the Noisy Gradient Descent that Generalizes as SGD Jingfeng Wu , Wenqing Hu, Haoyi Xiong, Jun Huan, Vladimir Braverman, Zhanxing Zhu Johns Hopkins University, Missouri University of Science and Technology, Baidu Research, Styling AI, Peking


slide-1
SLIDE 1

On the Noisy Gradient Descent that Generalizes as SGD

Jingfeng Wu, Wenqing Hu, Haoyi Xiong, Jun Huan, Vladimir Braverman, Zhanxing Zhu

Johns Hopkins University, Missouri University of Science and Technology, Baidu Research, Styling AI, Peking University

slide-2
SLIDE 2

Stochastic gradient descent (SGD)

L(✓) = 1 n

n

X

i=1

`(xi; ✓)

<latexit sha1_base64="p6Gzhfmu2a6M/o0dA0E02QuYdk=">ACNXicbVDBThsxFPRSaGkKbdoe7GIKqWXaBchNRVCQnAoBw4gNSRSNl15nbeJFdu7st+iRsv+Ch/QH+gPcIU7B26IK79QZ5NDAx3J0mjePM3zxJkUFn3/1lt5sbr28tX69qbjc237+rvP5zZNDcOjyVqenFzIUGjoUEIvM8BULKEbTw5n8+45GCtS/QOnGQwUG2mRCM7QSVG9fdwMcQzIvtA9GiaG8SIoC13S0OYqKsReUP7UNAQpafNXJOguXdijesNv+RXocxIsSGP/O/0dRhejk6j+EA5TnivQyCWzth/4GQ4KZlBwCWUtzC1kjE/YCPqOaqbADorqhyX97JQhTVLjnkZaqf9uFExZO1WxcyqGY/t0NhP/N+vnmLQHhdBZjqD5PCjJcWUzuqiQ2GAo5w6wrgR7lbKx8zVhK7UpRQ1rUJqrpjgaQ3Pydl2K9hpfTt1DR2QOdbJ7JFmiQgX8k+OSInpEM4uSRX5JrceH+8O+/e5hbV7zFzkeyBO/xL6azrY=</latexit>

1 b X

i∈Bt

rθ `(xi; ✓t)

<latexit sha1_base64="WuEeC+IspeWG7WEBXAwi0QNv8w=">ACO3icbVDPSxtBFJ7VWmPUGvXoZWgQ0kvYVUFkWAv9pZCE4VsWGYnb5PBmdl5q0Ylvw3/hn2H+jVXnoueJBePHjv5Mehmj4Y+N73zeN74szKSz6/i9vYfHd0vl0kp5dW39w0Zlc6t09xwaPFUpuYqZhak0NBCgRKuMgNMxRIu4+vPY/3yBowVqf6Gwy6ivW1SARn6KiochYmhvEiGBXxiIY2V1EhQqHpeYSu7xvWi4oQB4DMtSBl7TYSJ3TKRPgpqlT9uj8pOg+CGag2vuyf3l8rzajymPYS3muQCOXzNpO4GfYLZhBwSWMymFuIWP8mvWh46BmCmy3mNw5oruO6dEkNe5pBP234mCKWuHyl2yqxgO7FtTP5P6+SYHULobMcQfOpUZJLikdh0Z7wgBHOXSAcSPcrpQPmAsOXbSvXNRwYlJ2wQRvY5gH7b16cFA/uoSOifTKpEd8pHUSEAOSYNckCZpEU7uyA/yQH56373f3pP3Z/p1wZvNbJNX5b38Bcgrsdc=</latexit>

Loss function

GD

SGD

(unbiased) gradient noise

vsgd(θt)

<latexit sha1_base64="NO81viQx6ZAH+ykfVPko4iKqA=">ACDnicbVDLSsNAFJ3UV62vVJduBotQNyWRgrorunFZwT6gKWEynbZDJw9mbioh9B/8Abf6B+7Erb/gD/gdTtMsbOuBC4dz7uVcjhcJrsCyvo3CxubW9k5xt7S3f3B4ZJaP2yqMJWUtGopQdj2imOABawEHwbqRZMT3BOt4k7u535kyqXgYPEISsb5PRgEfckpAS65ZnrqOGg1w1YExA+LChWtWrJqVAa8TOycVlKPpmj/OIKSxzwKgijVs60I+imRwKlgs5ITKxYROiEj1tM0ID5T/TR7fYbPtTLAw1DqCQBn6t+LlPhKJb6nN30CY7XqzcX/vF4Mw+t+yoMoBhbQRdAwFhCPO8BD7hkFESiCaGS618xHRNJKOi2lL8JAsp6WLs1RrWSfuyZtdrNw/1SuM2r6iITtEZqiIbXaEGukdN1EIUPaEX9IrejGfj3fgwPherBSO/OUFLML5+ARZrnAU=</latexit>

θt+1 = θt η z }| { ˜ g(θt) = θt η rθ L(θt) | {z } η (˜ g(θt) rθ L(θt)) | {z }

<latexit sha1_base64="bS1Z0XerBUCN69KWD1FLEosGR7k=">ACynicbVHRStxAFJ3EWnVr21UfRlcWhTpklRBfRCkfHB4WuWjZLmEzu7g5OJmHmRghb/6eH+AP9Ds6iRF3Vy+EnDn3nHuHM1EmhUHPe3LcpQ/LH1dW1zqf1j9/+drd2Lw2a45DHgqU30bMQNSKBigQAm3mQaWRBJuorvfdf/mHrQRqfqDRQajhE2UGAvO0FJh9zHAKSALS9z3K/r9lLZnpD9oYAENUmuPNONQBihkDOWk2n0R7VU0CDq08eUqfhUuzJhoFoctXdGLWf+LZNZe/+4A31tY6+emzQ6rwm7P63tN0bfAb0GPtHUZdv8FcrzBRyYwZ+l6Go5JpFxC1QlyA5m9DJvA0ELFEjCjsom9ot8sE9Nxqu2nkDbsrKNkiTFElwnBqFns1+V5vmOP4eFQKleUIij8vGueSYkrN6Sx0MBRFhYwroW9K+VTZkND+9JzW5KiWdKxwfiLMbwF1z/7/mH/5Oqwd/arjWiVbJMdskt8ckTOyDm5JAPCnQPnrxM53L1wtVu45bPUdVrPFpkr9+E/w2XezQ=</latexit>
slide-3
SLIDE 3

Noise matters!

CIFAR-10, ResNet-18, w/o weight decay, w/o data augmentation

SGD >> GD

  • How?

<= Still open…

  • Which?

<= This work!

slide-4
SLIDE 4

Which noise matters?

  • 1. Magnitude

<= YES! (e.g., Jastrzkebski et al. 2017)

  • 2. Covariance structure <= YES! (e.g., Zhu et al. 2018)
  • 3. Distribution class

<= ? No!!! (this work) Bernoulli? Gaussian? Levy?...

vsgd(θ) = ˜ g(θ) rθ L(θ)

<latexit sha1_base64="kj+yFB6VfOGxOf9MuzV4MT7235s=">ACOXicbVDLSgNBEJz1bXxFPXoZDIeDLsSUEFB9OLBg4LRQDYs7OdzZDZBzO9gbDkZ/wNf8Cr3jx6EfHqDzh5KGosGCiquqme8lMpNr2szUxOTU9Mzs3X1hYXFpeKa6u3egkUxyqPJGJqvlMgxQxVFGghFqgEW+hFu/fdb3bzugtEjia+ym0IhYGIum4AyN5BWPOp6rw4Bu9gCZDv0mLoZA0/JZ2qRsqFnj5UOhdfDlesWSX7QHoOHFGpERGuPSKr26Q8CyCGLlkWtcdO8VGzhQKLqFXcDMNKeNtFkLd0JhFoBv54Jc9umWUgDYTZV6MdKD+3MhZpHU38s1kxLCl/3p98T+vnmHzoJGLOM0QYj4MamaSYkL7ldFAKOAou4YwroS5lfIWU4yjKfZXStQdhBRMc7fGsbJzV7ZqZQPryqlk9NRXNkg2ySbeKQfXJCzsklqRJO7sgDeSRP1r31Yr1Z78PRCWu0s05+wfr4BNijrFQ=</latexit>
slide-5
SLIDE 5

Intuition

  • nly depends on the first two moments of 𝜄!, which only

depend on the first two moments of 𝑤(𝜄).

θt+1 = θt η rθ L(θt) | {z } ηv(θt)

<latexit sha1_base64="6fN/KzCDnCBe5BNfzP9VFrwepSQ=">ACVnicbZDfShtBFMYna6Mxrbray94MBsFSDLsiqBdCaG960QsFkyjZsMzOniSDs7PLzFkhLPtsvoY+QHvZvoE4SVb8kx4Y+M53vsMZflEmhUHPe6g5Kx/q2uN9ebHTxubW+72Ts+kuebQ5alM9VXEDEihoIsCJVxlGlgSehHNz9m8/4taCNSdYnTDIYJGysxEpyhtUL3OsAJIAsL/OaX9IxWLdIDGlhBg1zFoCPNOBTBWLM4LBaRkv7afw5/LZ/jty9e6La8tjcvuiz8SrRIVeh+yeIU54noJBLZszA9zIcFkyj4BLKZpAbyBi/YWMYWKlYAmZYzBGUdM86MR2l2j6FdO6+3ihYsw0iWwyYTgx72cz83+zQY6jk2EhVJYjKL4NMolxZTOeNJYaOAop1YwroX9K+UTZnmhpf7mSjKdH2laMP57DMuid9j2j9qnF0etzvcKUYN8Ibtkn/jkmHTIT3JOuoSTO/Kb/CX/ave1R6furC2iTq3a+UzelOM+AbQitZQ=</latexit>

Linear

x,θT [`(x; ✓T ) − `(x; ✓∗)]

<latexit sha1_base64="lb+H4UTjzf5Kv/JL3KQaDibfw=">ACNXicbVDLSgNBEJz1GeMr6tHLYBUNOxKwIiXoBePCkMZJcwO+mYIbMPZnolYcmv+Bv+gFe9e/AmufoLTmIEYywYKq6qZ7yYyk02vabNTe/sLi0nFnJrq6tb2zmtrZrOkoUhyqPZKTqPtMgRQhVFCihHitgS/hzu9ejfy7B1BaRGEF+zF4AbsPRVtwhkZq5kou9GLg2Ogdu9gBZM2Kl7og5UHv4kc4pCd0Wjo6HDRzebtgj0FniTMheTLBTM3dFsRTwIkUumdcOxY/RSplBwCYOsm2iIGe+ye2gYGrIAtJeOfzig+0Zp0XakzAuRjtXfGykLtO4HvpkMGHb0X28k/uc1EmyXvFSEcYIQ8u+gdiIpRnRUF20JZdqRfUMYV8LcSnmHKcbRlDqVEvTHIVlTjPO3hlSOy04xcL5bTFfvpxUlCG7ZI8cEIeckTK5JjekSjh5JM/khbxaT9a79WENv0fnrMnODpmC9fkFJx+riw=</latexit>

For quadratic loss, the generalization error

Noise matters! But noise class does not!!!

slide-6
SLIDE 6

A closer look at the noise of SGD

rθ L(θ) · 1 n

<latexit sha1_base64="tpgXxRfVOTUgPLu6xgEHKXJFoQ=">ACMXicbVDLSiQxFE35Gm1fPbqcTbARdNUSYPOTsbNLGahMK1CV9PcSt3qDqaSIrk10BT1I/Mb/oBb5w/cDS7c+BOmHwtfBwIn59zLSU5SKOkoDB+ChcWl5ZUvq2uN9Y3Nre3m151LZ0orsCuMvY6AYdKauySJIXhUXIE4VXyc3ZxL/6g9ZJo3/TuMB+DkMtMymAvDRoduKhXRQxTRCgprHvwSog9ntkMciNcTjzIKorSdWw0Jsmg2Qrb4RT8I4nmpMXmOB80n+LUiDJHTUKBc70oLKhfgSUpFNaNuHRYgLiBIfY81ZCj61fT39V83yspz4z1RxOfq83KsidG+eJn8yBRu69NxE/83olZSf9SuqiJNRiFpSVipPhk6p4Ki0KUmNPQFjp38rFCHwX5At9k5KPpyENX0z0voaP5PKoHXa3y86rdMf84pW2Te2xw5YxI7ZKfvJzlmXCfaX3bF79i+4DR6C/8HjbHQhmO/sjcInl8AhGeq6w=</latexit>

rθ L(θ) · Wsgd

<latexit sha1_base64="4RaNxqViWuH516fg4e9gd2nVKJQ=">ACLHicbVDLSgMxFM3UV62vqks3wSLUTZ2RgrorunHhoJ9QKeUO5m0Dc08SO4IZehn+Bv+gFv9Azcibvsdpo+FbT0QODnXk5yvFgKjb9ZWXW1jc2t7LbuZ3dvf2D/OFRXUeJYrzGIhmpgeaSxHyGgqUvBkrDoEnecMb3E38xjNXWkThEw5j3g6gF4quYIBG6uQv3J4Cv5O62OcI+o+MJDF2e2cusyPkLoNo5kR3fNHnXzBLtlT0FXizEmBzFHt5MeuH7Ek4CEyCVq3HDvGdgoKBZN8lHMTzWNgA+jxlqEhBFy30+nHRvTMKD7tRsqcEOlU/buRQqD1MPDMZADY18veRPzPayXYvW6nIowT5CGbBXUTSTGik5aoLxRnKIeGAFPCvJWyPihgaLpcSAmG05CcKcZrmGV1C9LTrl081guVG7nFWXJCTklReKQK1Ih96RKaoSRF/JG3smH9Wp9Wt/Wz2w0Y813jskCrPEvGhSopA=</latexit>

Sampling noise

  • Gradient matrix
  • Sampling vector
  • Sampling noise

rθ L(✓) = (rθ `(x1; ✓), . . . , rθ `(xn, ✓))

<latexit sha1_base64="6nU86pnc4vVkxCH7hZrZS+zvU=">ACa3icbVHLSgMxFM2M7/qulMXQREqSpkRQUWEohsXLhSsCp0yZNLbNjSTGZI74jD0L/w0N/6AO3/AleljYdULISfnvgJEqlMOh5747NT0zOze/UFpcWl5ZLa+tP5gk0xzqPJGJfoqYASkU1FGghKdUA4sjCY9R72qQf3wGbUSi7jFPoRmzjhJtwRlaKiyroKNZKywC7AKyPg1uOJOV0WufXtAg0oz3AItJXQBSVl5C/3ysPKRBK0Fjr/9k6nAs64flXa/qDYP+Bf4Y7NZ2goPX91p+G5Y/bGOexaCQS2ZMw/dSbBZMo+AS+qUgM5DaDVkHGhYqFoNpFkNf+nTPMi3aTrQ9CumQ/VlRsNiYPI6sMmbYNb9zA/K/XCPD9mzECrNEBQfDWpnkmJCBybTltDAUeYWMK6F3ZXyLrNOov2KiSlxPhxSsb4v234Cx6Oqv5x9ezOnRJRjFPtsgOqRCfnJAauSa3pE4eSNfzpQz7Xy6G+6muz2Sus64ZoNMhLv3DRKv14=</latexit>

Wsgd : #1 b = b, #0 = n − b

<latexit sha1_base64="TAwr4vz/1WeH84dKCYFz3HPYwgM=">ACKHicbVDLSsNAFJ3UV62vqEs3g6UgqCWRg8Qim5cVrAPaEKZTCft0MkzEyEPIR/oY/4Fb/wJ1068LvcNJ2YVsPDBzOmcu593gRo1JZ1tgorKyurW8UN0tb2zu7e+b+QUuGscCkiUMWio6HJGUk6aipFOJAgKPEba3ug+9vPREga8ieVRMQN0IBTn2KktNQzT502RqznyEH/Bjpl6PgC4dTOUi+Dt9A7c3LR0pSfez2zbFWtCeAysWekDGZo9Mwfpx/iOCBcYak7NpWpNwUCUxI1nJiSWJEB6hAelqylFApJtOjspgRSt96IdCP67gRP07kaJAyiTQe1YCpIZy0cvF/7xurPwrN6U8ihXheBrkxwyqEOYNwT4VBCuWaIKwoHpXiIdI16J0j3MpQTIJKeli7MUalknromrXqtePtXL9blZRERyBY3ACbHAJ6uABNEATYPAC3sA7+DBejU/jyxhPvxaM2cwhmIPx/Qv7o6QE</latexit>

Vsgd = Wsgd − 1 n

<latexit sha1_base64="i9wUYTmMfkengD4WERgFodbQRU=">ACInicbVDNSsNAGNzUv1r/qh69LJaCF0siBfUgFL14rGB/oAlhs9m0SzebsLsRQsgT+Bq+gFd9A2/iSfDsc7hNi9jWgYXZme9jdseLGZXKND+N0srq2vpGebOytb2zu1fdP+jKBGYdHDEItH3kCSMctJRVDHSjwVBocdIzxvfTPzeAxGSRvxepTFxQjTkNKAYKS251brdxYi5thz68Aravd/LKbQDgXBm5RnP3WrNbJgF4DKxZqQGZmi71W/bj3ASEq4wQ1IOLDNWToaEopiRvGInksQIj9GQDTlKCTSyYrv5LCuFR8GkdCHK1iofzcyFEqZhp6eDJEayUVvIv7nDRIVXDgZ5XGiCMfToCBhUEVw0g30qSBYsVQThAXVb4V4hHQLSjc4lxKmRUhF2Mt1rBMumcNq9m4vGvWtezisrgCByDE2CBc9ACt6ANOgCDR/AMXsCr8WS8Ge/Gx3S0ZMx2DsEcjK8f4pKj0A=</latexit>

vsgd(θ) | {z } = ˜ g(θ) |{z} rθ L(θ) | {z } = rθ L(θ) · Vsgd |{z}

<latexit sha1_base64="0bKvXi7R+xe/ISAKMPyZNSTCdio=">ACnicbZHPatAEMZXStuk7j8nOfay1BTSQ4wUQtMeAqGF0kNKU6gdg2XEaDWl6xWYncUMEIPk8fqC/Q5spbdNko6sPAxv5md4ZukVNJSEPzy/K1Hj59s7ztPXv+4uWr/u7e2BaVETgShSrMJAGLSmockSFk9Ig5InCy+Tq84pfXqOxstA/aVniLIdMy7kUQC4V92+iSqdoEgMC6+s4slnKDyJaIMG7hp9yfpdHJFWKPtXcNjlmYE0rte0Of9b1jvlXcSjcwHqD+eRSAvq/jR2vN2mifuDYBi0wR+KcCMGbBMXcf93lBaiylGTUGDtNAxKmtVgSAqFTS+qLJYgriDqZMacrSzurWy4W9dJuXzwrinibfZux015NYu8RV5kALe5+tkv9j04rmH2a1GVFqMV60LxSnAq+ugtPpUFBaukECPdrlwswJlB7nqdKfmyHdJzxoT3bXgoxkfD8Hj48cfx4OzTxqId9pq9YQcsZCfsjH1lF2zEhLftHXrvROf+1/8b/73danvbXr2WSf8yS3AuM4x</latexit>

Gradient noise

slide-7
SLIDE 7

Gradient noise vs. sampling noise

Gradient noise

  • State-dependent
  • Noise of gradient

Gradient matrix

  • State-dependent
  • Deterministic

Sampling noise

  • State-independent
  • Noise of mini-batch

sampling

v(θ) |{z} = rθ L(θ) | {z } · V |{z}

<latexit sha1_base64="WrdFkAv/nEL+OK3431ovfBOab20=">ACcXicbVFNS+tAFJ1Enx/Vp1VX4mawCD4elEQEdSGIbly4ULBVaEq5mdy2g5NJnLkRSugP1R/gH/APOG2z0OqFYQ7nsMZzsS5kpaC4NXzFxb/LC2vrNbW1v9ubNa3ts2K4zAlshUZh5jsKikxhZJUviYG4Q0VvgQP1N9g8vaKzM9D2NcuymMNCyLwWQo3p1igqdoIkNCxf+GFEQyT4N+bRcwEJP6/ur6poYCDplTOlE94IUPO+SCQZ/eZtO/G4V28EzWA6/CcIK9Bg1dz26u9RkokiRU1CgbWdMipW4IhKRSOa1FhMQfxBAPsOKghRdstp+2M+YFjEt7PjDua+JT96ightXaUxk6ZAg3t/G5C/rbrFNQ/7ZS5wWhFrOgfqE4ZXxSNU+kQUFq5AI91buRiC64Hch3xLSUfTkJorJpyv4SdoHzXD4+bZ3XHj4rKqaIXtsX12yEJ2wi7YNbtlLSbYm8e8Va/mfi7Pvf3Z1Lfqzw7Nv4/z8Blu+Kg=</latexit>
slide-8
SLIDE 8

Noisy gradient descent

Option 1: use gradient noise v(θ)

<latexit sha1_base64="wNKRgCdISMghgl3bfPoS63krawc=">ACBHicbVDLSsNAFJ34rPVdekmWIS6KYkU1F3RjcsK9iFtKJPpB06MwkzN4UQuvUH3OofuBO3/oc/4Hc4TbOwrQcGDufcy7lz/IgzDY7zba2tb2xubRd2irt7+weHpaPjlg5jRWiThDxUHR9rypmkTWDAaSdSFAuf07Y/vpv57QlVmoXyEZKIegIPJQsYwWCkp0mlByMK+KJfKjtVJ4O9StyclFGORr/0xuEJBZUAuFY67rROClWAEjnE6LvVjTCJMxHtKuoRILqr0O3hqnxtlYAehMk+Cnal/N1IstE6EbyYFhpFe9mbif143huDaS5mMYqCSzIOCmNsQ2rPf2wOmKAGeGIKJYuZWm4ywgRMRwspIslCiqYd7mGVdK6rLq16s1DrVy/zSsqoFN0hirIRVeoju5RAzURQK9oFf0Zj1b79aH9TkfXbPynRO0AOvrF6YlmJg=</latexit>

v(θ) = rθ L(θ) · V

<latexit sha1_base64="9bQ8PcBNkyH7dKBZd/a17mZQok8=">ACMXicbVDLSsNAFJ34tr6iLt0MFkE3JZGCuhBENy5cVLBVaEq4mUzawcmDmZtCf0Rf8MfcKt/0J24cONPOH0gtnpg4HDOvZw7J8ik0Og4A2tufmFxaXltbS2vrG5ZW/vNHSaK8brLJWpeghAcykSXkeBkj9kikMcSH4fPF4N/fsuV1qkyR32Mt6KoZ2ISDBAI/l2tXvoYcjHNFz6rUVhH4xFvrUu2Egf2yPhSlSr2E03y47FWcE+pe4E1ImE9R8+9MLU5bHPEmQeum62TYKkChYJL3S16ueQbsEdq8aWgCMdetYvS7Pj0wSkijVJmXIB2pvzcKiLXuxYGZjAE7etYbiv95zRyj01YhkixHnrBxUJRLikdVkVDoThD2TMEmBLmVso6oIChKXQqJe6NQkqmGHe2hr+kcVxq5Wz2r54nJS0QrZI/vkLjkhFyQa1IjdcLIE3khr+TNerYG1rv1MR6dsyY7u2QK1tc3Spmpmg=</latexit>

Option 2: use sampling noise

  • in the same magnitude/covariance
  • from different classes

L J θt+1 = θt η rθ L(θt) | {z } η v(θ) |{z}

<latexit sha1_base64="smLwCdv7EKT8XDkS4wfMIeMRto=">ACYXicbVFNS8NAFNzG7/oV9ehlsQiKWBIR1IMgevHgQcGq0JSw2by2SzebsPsilJA/6D/wLHj2qie3NYpfDxaGmXnMYzbKpDoeY81Z2Jyanpmdq4+v7C4tOyurN6YNcWjyVqb6LmAEpFLRQoIS7TANLIgm30eBspN/egzYiVdc4zKCTsJ4SXcEZWip04wD7gCwscMcv6TENchWDjTjUFQS0l0aWECDnmZxWNElvdj6NGyX5Zv6/eVvl2GbsNreuOhf4FfgQap5jJ0n4M45XkCrlkxrR9L8NOwTQKLqGsB7mBjPEB60HbQsUSMJ1i3EZJNy0T026q7VNIx+z3jYIlxgyTyDoThn3zWxuR/2ntHLuHnUKoLEdQ/COom0uKR1VS2OhgaMcWsC4FvZWyvMdoH2A36kJMNxSN0W4/+u4S+42Wv6+82jq/3GyWlV0SxZJxtki/jkgJyQc3JWoSTB/JCXslb7cmZc1xn9cPq1KqdNfJjnPV3ET25nw=</latexit>

GD noise with

slide-9
SLIDE 9

Multiplicative SGD (MSGD)

θt+1 = θt η rθ L(θt) · W, W = 1 n + V

<latexit sha1_base64="9TejT7DMGiya3Sy2r/lkgpza14=">ACaHicbZFNi9NAGMcn8W2tb1k9iHh52CKsrFsSWVg9CItePHhYwbYLnRKeTCbt0MkzjwRSsiH9OgX8OAX8Oq0jeDu+sDAb/7P2/CfrNbKURz/CMIbN2/dvrN3d3Dv/oOHj6L9xNXNVbIsah0ZS8ydFIrI8ekSMuL2kosMy2n2erDJj/9Jq1TlflC61rOS1wYVSiB5KU0WnFaSsK0paOkg3fQXwmOgXsAvrCYp+1O7oB/EqgP/xa9BC7yioBPvfqKf20w3/FmUGFRtEnXmg6OgE+8mkbDeBRvA65D0sOQ9XGeRj95XomlIaERudmSVzTvEVLSmjZDXjZI1ihQs582iwlG7ebk3p4IVXcigq648h2Kr/drRYOrcuM19ZIi3d1dxG/F9u1lDxZt4qUzckjdgtKhoNVMHGYciVlYL02gMKq/xbQSzRu0H+Hy5tKdfbJQNvTHLVhusweT1KTkZvP58Mz973Fu2x5+yAHbKEnbIz9pGdszET7Dv7HbAgCH6FUfg0fLYrDYO+5wm7FOHBHyt8uOk=</latexit>

Algorithm:

  • 1. Generate sampling vector
  • 2. Compute randomized loss
  • 3. Compute stochastic gradient
  • 4. Update parameters

˜ L(θ) = L(θ) · W

<latexit sha1_base64="LYCwUH+xeLSqmFOfQR0VnshS6s=">ACKnicbVDLSgMxFM34rPU16tJNsAi6sMxIQV0IRTcuqhgbaFTSiZza4OZB8kdoQz9Cn/DH3Crf+CuBW/w/SB2NYDgZNz7uUkx0+k0Og4A2thcWl5ZTW3l/f2Nzatnd273WcKg41HstYNXymQYoIaihQiNRwEJfQt1/vB769SdQWsTRHfYSaIXsIRIdwRkaqW2feChkAFmlf+RhF5Ad0vqVTiTv3ePBzFSr260tl1wis4IdJ64E1IgE1Tb9rcXxDwNIUIumdZN10mwlTGFgkvo571UQ8L4I3uApqERC0G3stG3+vTQKAHtxMqcCOlI/buRsVDrXuibyZBhV896Q/E/r5li57yViShJESI+DuqkmJMhx3RQCjgKHuGMK6EeSvlXaYR9PkVErYG4XkTHubA3z5P606JaKF7elQvlqUlGO7JMDckRckbK5IZUSY1w8kxeyRt5t16sD2tgfY5HF6zJzh6ZgvX1A5lpq4=</latexit>

rθ ˜ L(θ)

<latexit sha1_base64="ts57PCt52rMAlqFHd7rblPIJ4FM=">ACHicbVC7SgNBFJ2NrxhfUsLF4MQm7ArAbUL2lhYRDAPyC5hdvYmGTL7YOauEJaU/oY/YKt/YCe2gj/gdzhJtjCJBwYO59zLuXO8WHCFlvVt5FZW19Y38puFre2d3b3i/kFTRYlk0GCRiGTbowoED6GBHAW0Ywk08AS0vOHNxG89glQ8Ch9wFIMb0H7Ie5xR1FK3eOz0JfW7qYMDQDp2kAsf0rtxeSacdYslq2JNYS4TOyMlkqHeLf4fsSAEJkgirVsa0Y3ZRK5EzAuOAkCmLKhrQPHU1DGoBy0+lHxuapVnyzF0n9QjSn6t+NlAZKjQJPTwYUB2rRm4j/eZ0Ee5duysM4QjZLKiXCBMjc9K6XMJDMVIE8ok17eabEAlZai7m0sJRtOQgi7GXqxhmTPK3a1cnVfLdWus4ry5IickDKxyQWpkVtSJw3CyBN5Ia/kzXg23o0P43M2mjOynUMyB+PrF1k4oqQ=</latexit>

θ θ η rθ ˜ L(θ)

<latexit sha1_base64="TS0gD9FGelc68iT/HUwvCY5IJbI=">ACPHicbZC7SixBEIZ7vB1dL2fV0KRxETRwmTkIaqR4EgMDBVeFnWp6anZbey50F1zZBn2cXwNX8D0GJiLiZga2zs7gbeChp/vr6Kq/yBT0pDrPjoTk1PTM79m52rzC4tLv+vLKxcmzbXAlkhVq8CMKhkgi2SpPAq0whxoPAyuP478i/oTYyTc5pkGEnhl4iIymALOrWD3zqIwH3FUYEWqc3vCLb3C+NnoawW4zh0CepQixOhptjsNWtN9ymWxb/LrxKNFhVp936sx+mIo8xIaHAmLbnZtQpQJMUCoc1PzeYgbiGHratTCBG0ynKjw75hiUhj1JtX0K8pB8nCoiNGcSB7YyB+uarN4I/e2cor1OIZMsJ0zEeFGUK04pH6XGQ6lRkBpYAUJLeysXfdAgyGb7aUs8KJfUbDe1xi+i4s/TW+nuX+20zg8qiKaZWtsnW0yj+2yQ3bMTlmLCXbL7tl/9uDcOU/Oi/M6bp1wqplV9qmct3dxLq9d</latexit>

W = 1/n + V

<latexit sha1_base64="Glzyg75kNRzQxEnAp1VHsBLVr8=">ACEHicbVDLSgMxFM34rPU12qWbYBEoc5IQV0IRTcuK9gHdIaSTNtaJIZkowDP0Jf8Ct/oE7cesf+AN+h+l0Frb1QODknHs5lxPEjCrtON/Wyura+sZmau8vbO7t28fHLZVlEhMWjhikewGSBFGBWlpqhnpxpIgHjDSCcZ3U7/zRKSikXjUaUx8joaChQjbaS+XfE6GDF4A91zAc+g1za/vl1ak4OuEzcglRBgWbf/vEGEU4ERozpFTPdWLtZ0hqihmZlL1EkRjhMRqSnqECcaL8LD9+Ak+MoBhJM0TGubq340McaVSHphJjvRILXpT8T+vl+jwys+oiBNBJ4FhQmDOoLTJuCASoI1Sw1BWFJzK8QjJBHWpq+5FJ7mIWVTjLtYwzJpX9Tceu36oV5t3BYVlcAROAanwAWXoAHuQRO0AYpeAGv4M16t6tD+tzNrpiFTsVMAfr6xdQqJtv</latexit>
slide-10
SLIDE 10

Injecting noise by MSGD

  • 1. SGD class
  • 2. Gaussian class
  • 3. “Bernoulli” class
  • 4. “Sparse Gaussian” class

Wsgd : #1 b = b, #0 = n − b

<latexit sha1_base64="TAwr4vz/1WeH84dKCYFz3HPYwgM=">ACKHicbVDLSsNAFJ3UV62vqEs3g6UgqCWRg8Qim5cVrAPaEKZTCft0MkzEyEPIR/oY/4Fb/wJ1068LvcNJ2YVsPDBzOmcu593gRo1JZ1tgorKyurW8UN0tb2zu7e+b+QUuGscCkiUMWio6HJGUk6aipFOJAgKPEba3ug+9vPREga8ieVRMQN0IBTn2KktNQzT502RqznyEH/Bjpl6PgC4dTOUi+Dt9A7c3LR0pSfez2zbFWtCeAysWekDGZo9Mwfpx/iOCBcYak7NpWpNwUCUxI1nJiSWJEB6hAelqylFApJtOjspgRSt96IdCP67gRP07kaJAyiTQe1YCpIZy0cvF/7xurPwrN6U8ihXheBrkxwyqEOYNwT4VBCuWaIKwoHpXiIdI16J0j3MpQTIJKeli7MUalknromrXqtePtXL9blZRERyBY3ACbHAJ6uABNEATYPAC3sA7+DBejU/jyxhPvxaM2cwhmIPx/Qv7o6QE</latexit>

WG ∼ N (1/n, Var [Wsgd])

<latexit sha1_base64="7vl+AyYHgR8qGboiAmpda1kZXEU=">ACOnicbVDLSgMxFM3UV62vqks3wSK4kDojBXUhF3oSirYB3RKyaRpG5pkhiRTGIb5Gn/DH3CrK7cuBHrB5hOZ2FbDwQO59zLyT1ewKjStv1u5ZaWV1bX8uFjc2t7Z3i7l5D+aHEpI595suWhxRhVJC6pqRViAJ4h4jTW90M/GbYyIV9cWjgLS4WgaJ9ipI3ULV65TYxYN3Y50kPJ49skga6iHLr3Rnc9ifCI6Ng5FSfQHSMZ/Nq0EuSpFs2WU7BVwkTkZKIEOtW/x0ez4OREaM6RU27ED3YmR1BQzkhTcUJHARKIBaRsqECeqE6dnJvDIKD3Y96V5QsNU/bsRI65UxD0zOblGzXsT8T+vHer+RSemIg1EXga1A8Z1D6cdAZ7VBKsWQIwpKav0I8RKYabZqdSeFRGlIwxTjzNSySxlnZqZQvHyql6nVWUR4cgENwDBxwDqrgDtRAHWDwBF7AK3iznq0P68v6no7mrGxnH8zA+vkFsDuvFA=</latexit>

✓ W(i)

B = 1

b ◆ = b n, ⇣ W(i)

B = 0

⌘ = 1 − b n

<latexit sha1_base64="uz8mqgvW6kJngTrthrdQeDXkV8=">ACb3ichVHLSgMxFM2Mj9b6qrpwIUiwCApaZkRQF4LUjUsFa4VOLZk0Y0OTzJBkhBLyn/oD7v0BMdMW1FbwQsLJuTmcm5M4Y1TpIHjz/Ln5hcVSeamyvLK6tl7d2HxQaS4xaeKUpfIxRowKkhTU83IYyYJ4jEjrXhwXfRbL0Qqmop7PcxIh6NnQROKkXZUtyqjTKaxiVoYsa6JONJ9yU3D2idzQA8tvIRIhE2oTWxhd/n2Bphj6LKv/Kg2MLjH6putRbUg1HBWRBOQA1M6rZbfY96Kc45ERozpFQ7DLdMUhqihmxlShXJEN4gJ5J20GBOFEdM8rGwn3H9GCSreEhiP2p8IgrtSQu8ftF9Or6V5B/tVr5zo57xgqslwTgcdGSc6gTmERNOxRSbBmQwcQltTNCnEfuRS0+45fLnw4Mqm4YMLpGbBw0k9PK1f3J3WrhqTiMpgB+yBAxCM3AFbsAtaAIMXsGnV/LK3oe/7e/6cHzV9yaLfCr/Mv/6u9rw=</latexit>

Mini-batch + Gaussian noise

θt+1 = θt η rθ L(θt) · W, W = 1 n + V

<latexit sha1_base64="9TejT7DMGiya3Sy2r/lkgpza14=">ACaHicbZFNi9NAGMcn8W2tb1k9iHh52CKsrFsSWVg9CItePHhYwbYLnRKeTCbt0MkzjwRSsiH9OgX8OAX8Oq0jeDu+sDAb/7P2/CfrNbKURz/CMIbN2/dvrN3d3Dv/oOHj6L9xNXNVbIsah0ZS8ydFIrI8ekSMuL2kosMy2n2erDJj/9Jq1TlflC61rOS1wYVSiB5KU0WnFaSsK0paOkg3fQXwmOgXsAvrCYp+1O7oB/EqgP/xa9BC7yioBPvfqKf20w3/FmUGFRtEnXmg6OgE+8mkbDeBRvA65D0sOQ9XGeRj95XomlIaERudmSVzTvEVLSmjZDXjZI1ihQs582iwlG7ebk3p4IVXcigq648h2Kr/drRYOrcuM19ZIi3d1dxG/F9u1lDxZt4qUzckjdgtKhoNVMHGYciVlYL02gMKq/xbQSzRu0H+Hy5tKdfbJQNvTHLVhusweT1KTkZvP58Mz973Fu2x5+yAHbKEnbIz9pGdszET7Dv7HbAgCH6FUfg0fLYrDYO+5wm7FOHBHyt8uOk=</latexit>
slide-11
SLIDE 11

Experiments

Small SVHN. More results are available in the paper!

Gaussian noise Sparse Gaussian noise Bernoulli noise SGD noise

slide-12
SLIDE 12

Take Home

  • 1. Noise class is not crucial
  • 2. Multiplicative SGD algorithm
  • 3. Sampling noise perspective

Get the paper!

Join our poster session for more details!