Obtaining Adjustable Regularization for Free via Iterate Averaging - - PowerPoint PPT Presentation

obtaining adjustable regularization for free via iterate
SMART_READER_LITE
LIVE PREVIEW

Obtaining Adjustable Regularization for Free via Iterate Averaging - - PowerPoint PPT Presentation

Obtaining Adjustable Regularization for Free via Iterate Averaging Jingfeng Wu , Vladimir Braverman, Lin F. Yang Johns Hopkins University & UCLA June 2020 <latexit


slide-1
SLIDE 1

Obtaining Adjustable Regularization for Free via Iterate Averaging

Jingfeng Wu, Vladimir Braverman, Lin F. Yang Johns Hopkins University & UCLA

June 2020

slide-2
SLIDE 2

Searching optimal hyperparameter

min

w L(w) + λR(w)

<latexit sha1_base64="NgkEF/pzBUBbtk0dDOe5Z6kCGk=">ACE3icbVDLSgMxFM3UV62vqks3wSJUhTIjgnZXdCPiop9QGcomUzahmYyQ5KxlGH+wY1bP8ONC0XcunHnb/gFptMutPVA4HDOfeW4IaNSmeaXkZmbX1hcyi7nVlbX1jfym1t1GUQCkxoOWCaLpKEU5qipGmqEgyHcZabj985HfuCNC0oDfqmFIHB91Oe1QjJSW2vkD26e8HQ8SeFUc7MNDGNvp0FgQL4E205M8lMAbzBbNkpoCzxJqQqWMLp3vwmO1nf+0vQBHPuEKMyRlyzJD5cRIKIoZSXJ2JEmIcB91SUtTjnwinThdn8A9rXiwEwj9uIKp+rsjRr6UQ9/VlT5SPTntjcT/vFakOqdOTHkYKcLxeFEnYlAFcBQ9KgWLGhJgLqm+FuIcEwkrHmNMhWNfniX1o5J1XCpf6zTOwBhZsAN2QRFY4ARUwAWoghrA4B48gRfwajwYz8ab8T4uzRiTnm3wB8bHD4ocoFU=</latexit>

Main loss Regularization Hyperparameter

wk+1 = wk η (rL(w) + λR(w))

<latexit sha1_base64="/BQPh6/OvMQrEs/7E6XsCNoHT0c=">ACNXicbVBNb9NAEF2XrxK+Ahy5jFohpYqI7AqJckCK4MKh4JIWymOrPF6nKyXlu740aR5X/BL+HC/+BEDxAiCt/gU3SA7Q8aWn92b0Zl9aeU4DM+DrWvXb9y8tX27c+fuvfsPug8fHbuytpJGstSlPU3RkVaGRqxY02lCYtU0k6f7PyT87IOlWaD7ysaFLg1KhcSWQvJd3DRdLM+1EL8AoWyRyeQUyMEGvKuRcbTDXCYW+xB31o4nVcYylr/YDPyLCF9ysztmo6472kuxsOwjXgKokuyO5wJ+5/PB8uj5LulzgrZV2QYanRuXEUVjxp0LKSmtpOXDuqUM5xSmNPDRbkJs36jBaeiWDvLT+GYa1+vdGg4VzyL1kwXyzF32VuL/vHN+cGkUaqmYzcBOW1Bi5hVSFkypJkvfQEpVX+VpAztCjZF93xJUSXv3yVHO8PoueDl+98G6/FBtvidgRPRGJF2Io3ojMRJSfBJfxXfxI/gcfAt+Br82o1vBxc5j8Q+C38AktCr+A=</latexit>

GD/SGD ML/Opt problem wk → w∗

λ

<latexit sha1_base64="WqLor5Nc19eHJ3+zI7DqK83WEhc=">ACnicbVC7TsMwFHXKq5RXgJHFtEKqGKoEIRW2ChbGItGH1ITIcdzWqvOQ7VBVUWYW/oBvYGEAIVYWVjYEH4ObdoCWI1k6Ouc+fI8bMSqkYXxquYXFpeWV/GphbX1jc0vf3mKMOaYNHDIQt52kSCMBqQhqWSkHXGCfJeRljs4H/utG8IFDYMrOYqI7aNeQLsUI6kR98fOgNoyRAOrw+dxMoGJpx4KbSYmuKh1NFLRsXIAOeJOSWlWrH8/V9v687+oflhTj2SAxQ0J0TCOSdoK4pJiRtGDFgkQID1CPdBQNkE+EnWSbU3igFA92Q65eIGm/u5IkC/EyHdVpY9kX8x6Y/E/rxPL7omd0CKJQnwZFE3ZlCdPs4FepQTLNlIEYQ5VX+FuI84wlKlV1AhmLMnz5PmUcU8rpxeqjTOwAR5sAeKoAxMUAU1cAHqoAEwuAUP4Ak8a3fao/aivU5Kc9q0Zxf8gfb2A2ZJno=</latexit>

Learning rate/step size

slide-3
SLIDE 3

Re-running the optimizer is expensive! L

  • A single round of training takes about 3 days.
  • Almost a year to try a hundred different hyperparameters.

ResNet-50 + ImageNet + 8 GPUs

Can we obtain adjustable regularization for free?

slide-4
SLIDE 4

Iterate averaging => regularization (Neu et al.)

SGD Path for solving min 𝑀 𝑥 Solution for min 𝑀 𝑥 + 𝜇𝑆 𝑥 Geometric averaging Contour of 𝑀(𝑥)

slide-5
SLIDE 5

Iterate averaging protocol

  • Require: A stored opt. path
  • Input: a hyperparameter 𝜇
  • Compute a weighting scheme
  • Average the path
  • Output: the regularized solution

Iterate averaging is cheap J But Neu et al.’s result is limited L

slide-6
SLIDE 6

Formally, Neu et al. shows

L(w) = 1 n

n

X

i=1

kwT x yk2

2

<latexit sha1_base64="FAcAPX0VgpeDU5bzVZuBTWV3X4=">ACGnicbVDLSsNAFJ34rPUVdelmsAi6sCRFUBdC0YUuXFToQ2jSMJlO6uBkEmYm1pDmO9z4Bf6DGxeKuBM3/o3T1oWvAxcO59zLvf4MaNSWdaHMTE5NT0zW5grzi8sLi2bK6tNGSUCkwaOWCQufCQJo5w0FWMXMSCoNBnpOVfHQ/91jURka8rtKYuCHqcRpQjJSWPNM+2+pvw0PoBALhzM4znkNHJqGX0UM73DoDPqdOryBOzB1Bl6lU/HMklW2RoB/if1FStUTeO94g17NM9+cboSTkHCFGZKybVuxcjMkFMWM5EUnkSRG+Ar1SFtTjkIi3Wz0Wg43tdKFQSR0cQVH6veJDIVSpqGvO0OkLuVvbyj+57UTFey7GeVxogjH40VBwqCK4DAn2KWCYMVSTRAWVN8K8SXSISmdZlGHYP9+S9pVsr2bvngXKdxBMYogHWwAbaADfZAFZyCGmgADG7BA3gCz8ad8Wi8GK/j1gnja2YN/IDx/gkcCqIH</latexit>
  • Linear regression
  • ℓ!-regularization

R(w) = 1 2kwk2

2

<latexit sha1_base64="f25a0wzOI/NgSEYzE82YR0fnMxg=">ACBXicbVC7SgNBFJ2NrxhfUQsRLQaDEJuwGwRjIQRtLKOYB2TXZXYymwyZfTAzawibNDb+io2FIrb+gthpY+tnOHkUmnjgwuGce7n3HidkVEhd/9ASM7Nz8wvJxdTS8srqWnp9oyKCiGNSxgELeM1BgjDqk7KkpFayAnyHEaqTvts4FdvCBc08K9kNySWh5o+dSlGUkl2evcy2zmAJ9B0OcKx0Y/zfWj2OmbPzl/n7XRGz+lDwGlijEmWPh62/r83i7Z6XezEeDI7EDAlRN/RQWjHikmJG+ikzEiREuI2apK6ojzwirHj4R/uK6UB3YCr8iUcqr8nYuQJ0fUc1ekh2RKT3kD8z6tH0i1YMfXDSBIfjxa5EYMygINIYINygiXrKoIwp+pWiFtI5SFVcCkVgjH58jSp5HPGYe74QqVxCkZIgh2wB7LAEegCM5BCZQBrfgHjyCJ+1Oe9CetZdRa0Ibz2yCP9BefwCil5su</latexit>
  • GD/SGD path

wk+1 = wk ηrL(w)

<latexit sha1_base64="wYnVpEecl+QCJ/GPzD21/2VLADQ=">AC3icbVA9SwNBEN3zM8avqKXNkiBEguFOBLUQgjYWFhGMCeTCMbfZJMvt7R27e4ZwpLex83fYWChi6x+wy79xk1io8cHA470Zub5MWdK2/bImptfWFxazqxkV9fWNzZzW9u3KkokoTUS8Ug2fFCUM0FrmlOG7GkEPqc1v3gYuzX76hULBI3ehDTVghdwTqMgDaSl8v3vTQoOUOMz3DfC/ABdqkG7ArwOeCrYn8fe7mCXbYnwLPE+SaFSt4tPY4qg6qX+3TbEUlCKjThoFTsWPdSkFqRjgdZt1E0RhIAF3aNFRASFUrnfwyxHtGaeNOJE0JjSfqz4kUQqUGoW86Q9A9dcbi/95zUR3TlopE3GiqSDTRZ2EYx3hcTC4zSQlmg8MASKZuRWTHkg2sSXNSE4f1+eJbeHZeofHpt0jhHU2TQLsqjInLQMaqgS1RFNUTQPXpCL+jVerCerTfrfdo6Z3P7KBfsD6+AM2Im1o=</latexit>
  • Geometric averaging

p1w1 + p2w2 + · · · + pkwk

<latexit sha1_base64="FhViANzgI41mA7lEGHD9+SXO/+o=">ACDXicbZDLSsNAFIYnXmu9RV0qMlgFQShJEdRd0Y3LFuwF2hAmk0k7dDIJMxOlhC7duPFV3AhVxK17dz6DL+E07UJbfzjw8Z9zmDm/FzMqlWV9GXPzC4tLy7mV/Ora+samubVdl1EiMKnhiEWi6SFJGOWkpqhipBkLgkKPkYbXuxr1G7dESBrxG9WPiROiDqcBxUhpyzUPY9eGd7pOYOyWNJU0tbEfKZlZPW31XLNgFa1McBbsCRTKe8Pq9/3+sOKan20/wklIuMIMSdmyrVg5KRKYkYG+XYiSYxwD3VISyNHIZFOml0zgEfa8WEQCV1cwcz9vZGiUMp+6OnJEKmunO6NzP96rUQF505KeZwowvH4oSBhUEVwFA30qSBYsb4GhAXVf4W4iwTCSgeY1yHY0yfPQr1UtE+LF1WdxiUYKwd2wQE4BjY4A2VwDSqgBjB4AE/gBbwaj8az8Wa8j0fnjMnODvgj4+MHgBqcaw=</latexit>

min

w L(w) + λR(w)

<latexit sha1_base64="NgkEF/pzBUBbtk0dDOe5Z6kCGk=">ACE3icbVDLSgMxFM3UV62vqks3wSJUhTIjgnZXdCPiop9QGcomUzahmYyQ5KxlGH+wY1bP8ONC0XcunHnb/gFptMutPVA4HDOfeW4IaNSmeaXkZmbX1hcyi7nVlbX1jfym1t1GUQCkxoOWCaLpKEU5qipGmqEgyHcZabj985HfuCNC0oDfqmFIHB91Oe1QjJSW2vkD26e8HQ8SeFUc7MNDGNvp0FgQL4E205M8lMAbzBbNkpoCzxJqQqWMLp3vwmO1nf+0vQBHPuEKMyRlyzJD5cRIKIoZSXJ2JEmIcB91SUtTjnwinThdn8A9rXiwEwj9uIKp+rsjRr6UQ9/VlT5SPTntjcT/vFakOqdOTHkYKcLxeFEnYlAFcBQ9KgWLGhJgLqm+FuIcEwkrHmNMhWNfniX1o5J1XCpf6zTOwBhZsAN2QRFY4ARUwAWoghrA4B48gRfwajwYz8ab8T4uzRiTnm3wB8bHD4ocoFU=</latexit>

pk = (1 − p)pk, p = 1 1 + λη

<latexit sha1_base64="DN39c4r+tOUEleqYdUsr2NFJXpY=">ACInicbVDLSgMxFM34rPVdekmKIKiLTMiaBdi0YUuFawVOrVkMndqaGYmJBmxDPMtbvwVNy6U6koQ/BXTx0KtFwIn59x7T3I8wZnStv1hjY1PTE5N52bys3PzC4uFpeUrFSeSQpXGPJbXHlHAWQRVzTSHayGBhB6Hmtc+6em1O5CKxdGl7ghohKQVsYBRog3VLJRFs40P8aZTFvipr3jYmGubiAJTZ0sdbZTt2+SvAzl5vFPslc0CRrFtbtkt0vPAqcIVivH174sHp+fNwpvrxzQJIdKUE6Xqji10IyVSM8ohy7uJAkFom7SgbmBEQlCNtO+e4Q3D+DiIpTmRxn3250RKQqU6oWc6Q6Jv1V+tR/6n1RMdHDRSFolEQ0QHRkHCsY5xLy/sMwlU84BhEpm3orpLTHxaJNq3oTg/P3yKLjaLTl7pfKFSeMYDSqHVtEa2kQO2kcVdIbOURVR9ICe0At6tR6tZ6trvQ9ax6zhzAr6VdbnN7MkplI=</latexit>

solves

slide-7
SLIDE 7

Our contributions: J J J J

  • 1. regularizers

<= generalized ℓ!-regularizer

  • 2. optimizers

<= Nesterov’s acceleration

  • 3. objectives

<= strongly convex and smooth losses

  • 4. deep neural networks! (Empirically)

Iterate averaging works for more general

slide-8
SLIDE 8
  • 1. Generalized ℓ"-regularization

wk+1 = wk ηQ−1rL(w)

<latexit sha1_base64="70MwI2fPbgrJ7PozSD5+vfNk4j4=">ACEnicbVA9SwNBEN2LXzF+RS1tlohgEMOdCGohBG0sLBIwMZCLx9xmkyzZ2zt29wzhyG+wyV+xsVDE1sou/8bNR6HGBwOP92aYmedHnClt2yMrtbC4tLySXs2srW9sbmW3d6oqjCWhFRLyUNZ8UJQzQSuaU5rkaQ+Jze+93rsX/SKViobjT/Yg2AmgL1mIEtJG8bL7nJd0jZ4DxJe5XyMXaoBlx+SYyO6AnwO+Pawl8dedt8u2BPgeLMyH4x5x4NR8V+yct+uc2QxAEVmnBQqu7YkW4kIDUjnA4ybqxoBKQLbVo3VEBAVSOZvDTAB0Zp4lYoTQmNJ+rPiQCpfqBbzoD0B31xuL/3n1WLfOGwkTUaypINFrZhjHeJxPrjJCWa9w0BIpm5FZMOSCDapJgxITh/X54n1ZOCc1q4KJs0rtAUabSHcugQOegMFdENKqEKIugJPaNX9GYNrRfr3fqYtqas2cwu+gXr8xth2J3F</latexit>

R(w) = 1 2w>Qw

<latexit sha1_base64="ws5cJAX3oCOTYhK7CiVfRA+KIzQ=">ACBnicbVDLSsNAFJ34rPUVdSnCYBHqpiSloC6EohuXrdgHNLFMpN26CQTZiaWErJy46+4caGIW7/BnX/jtM1CWw9cOJxzL/fe40WMSmVZ38bS8srq2npuI7+5tb2za+7tNyWPBSYNzBkXbQ9JwmhIGoqRtqRICjwGl5w+uJ3ogQlIe3qlxRNwA9UPqU4yUlrm0W1xdAovoeMLhBM7TcopHN07ikewDkds2CVrCngIrEzUgAZal3zy+lxHAckVJghKTu2FSk3QUJRzEiad2JIoSHqE86moYoINJNpm+k8EQrPehzoStUcKr+nkhQIOU48HRngNRAznsT8T+vEyv/3E1oGMWKhHi2yI8ZVBxOMoE9KghWbKwJwoLqWyEeIB2I0snldQj2/MuLpFku2ZXSRb1SqF5lceTAITgGRWCDM1AFN6AGgCDR/AMXsGb8WS8GO/Gx6x1ychmDsAfGJ8/tb2XZw=</latexit>

Use a preconditioned GD/SGD path instead!

p1w1 + p2w2 + · · · + pkwk

<latexit sha1_base64="FhViANzgI41mA7lEGHD9+SXO/+o=">ACDXicbZDLSsNAFIYnXmu9RV0qMlgFQShJEdRd0Y3LFuwF2hAmk0k7dDIJMxOlhC7duPFV3AhVxK17dz6DL+E07UJbfzjw8Z9zmDm/FzMqlWV9GXPzC4tLy7mV/Ora+samubVdl1EiMKnhiEWi6SFJGOWkpqhipBkLgkKPkYbXuxr1G7dESBrxG9WPiROiDqcBxUhpyzUPY9eGd7pOYOyWNJU0tbEfKZlZPW31XLNgFa1McBbsCRTKe8Pq9/3+sOKan20/wklIuMIMSdmyrVg5KRKYkYG+XYiSYxwD3VISyNHIZFOml0zgEfa8WEQCV1cwcz9vZGiUMp+6OnJEKmunO6NzP96rUQF505KeZwowvH4oSBhUEVwFA30qSBYsb4GhAXVf4W4iwTCSgeY1yHY0yfPQr1UtE+LF1WdxiUYKwd2wQE4BjY4A2VwDSqgBjB4AE/gBbwaj8az8Wa8j0fnjMnODvgj4+MHgBqcaw=</latexit>

min

w L(w) + λR(w)

<latexit sha1_base64="NgkEF/pzBUBbtk0dDOe5Z6kCGk=">ACE3icbVDLSgMxFM3UV62vqks3wSJUhTIjgnZXdCPiop9QGcomUzahmYyQ5KxlGH+wY1bP8ONC0XcunHnb/gFptMutPVA4HDOfeW4IaNSmeaXkZmbX1hcyi7nVlbX1jfym1t1GUQCkxoOWCaLpKEU5qipGmqEgyHcZabj985HfuCNC0oDfqmFIHB91Oe1QjJSW2vkD26e8HQ8SeFUc7MNDGNvp0FgQL4E205M8lMAbzBbNkpoCzxJqQqWMLp3vwmO1nf+0vQBHPuEKMyRlyzJD5cRIKIoZSXJ2JEmIcB91SUtTjnwinThdn8A9rXiwEwj9uIKp+rsjRr6UQ9/VlT5SPTntjcT/vFakOqdOTHkYKcLxeFEnYlAFcBQ9KgWLGhJgLqm+FuIcEwkrHmNMhWNfniX1o5J1XCpf6zTOwBhZsAN2QRFY4ARUwAWoghrA4B48gRfwajwYz8ab8T4uzRiTnm3wB8bHD4ocoFU=</latexit>

solves

slide-9
SLIDE 9
  • 2. Nesterov’s acceleration

ℓ!-regularizer Weighting scheme

pk = γ η p γ(α + λ) − √ηα 1 − √ηα ! 1 − p γ(α + λ) 1 − √ηα !k−2

<latexit sha1_base64="yLatCiTSFfTLGHuUplgKYJCfPiI=">AC4XicfVJda9swFJW9rzb7aNY9DoZYGaSUBLsMtj4Uygpj7KmDJS1EWbiW5UREsj1JLnhCP2BvYy92C/r6xj7HVOcbDRN2QXB4Zyjey5XSkrBtYmiyC8dfvO3Xsbm637Dx4+2mo/3h7olKU9WkhCnWgGaC56xvuBHsrFQMZCLYaTI7nun50xpXuQfTV2ykYRJzjNOwXhq3P5Vjmf4EJNMAbVkAlICdpYwAw4TwTLT+avpz8osHR0CopzCniXNAFax1BHhQ1Nwuw538dLsuycztm4u05iovhkanZXk/45r2bhPXxz2v87f7Kz7r4bt3eiXtQUXgfxEuwcvb149vN4cHEybv8maUEryXJDBWg9jKPSjCwow6lgrkUqzUqgM5iwoYc5SKZHthnP4ReSXFWKH9ygxv26g0LUutaJt4pwUz1dW1O3qQNK5O9Hlmel5VhOV0EZXApsDzp8UpV4waUXsAVHE/K6ZT8Cs1/gOspMi6CWn5xcTX17AOBvu9+GXv4IPf0Bu0qA30FD1HRSjV+gIvUMnqI9o8D4ogzr4EtLwa/gt/L6whsHyzhO0UuGPyRM7v0=</latexit>

γ = η 1 + λη

<latexit sha1_base64="elwx7KzOo7It7FzUN3VsIz+Nio=">ACGXicbVC7SsRAFJ34dn2tWtoMiCISyKCWoihZYKrgqbZbmZ3KzDziRhZiIuIb9h46/YWChip1ZW/oqzWQtfBwYO59zXnCAVXBvXfXcGBoeGR0bHxisTk1PTM9XZuTOdZIphnSUiURcBaBQ8xrhRuBFqhBkIPA86Bz0/PMrVJon8anptiU0I5xBkYK7Wqrt8GKYHuUD9SwHIfDRS5t5r75excYVj4ws4LoSi9VnXJrbkl6F/ifZGlvd2Pl+u1rcPjVvXVDxOWSYwNE6B1w3NT08xBGc4EFhU/05gC60AbG5bGIFE383J7QZetEtIoUfbFhpbq94cpNZdGdhKCeZS/Z64n9eIzPRVjPncZoZjFl/UZQJahLai4mGXCEzomsJMXtrZRdgk3I2DArNgTv95f/krP1mrdR2z6xaeyTPsbIAlkK8Qjm2SPHJFjUieM3JA78kAenVvn3nlynvulA85Xz5AeftE/DjpJU=</latexit>

where

p1w1 + p2w2 + · · · + pkwk

<latexit sha1_base64="FhViANzgI41mA7lEGHD9+SXO/+o=">ACDXicbZDLSsNAFIYnXmu9RV0qMlgFQShJEdRd0Y3LFuwF2hAmk0k7dDIJMxOlhC7duPFV3AhVxK17dz6DL+E07UJbfzjw8Z9zmDm/FzMqlWV9GXPzC4tLy7mV/Ora+samubVdl1EiMKnhiEWi6SFJGOWkpqhipBkLgkKPkYbXuxr1G7dESBrxG9WPiROiDqcBxUhpyzUPY9eGd7pOYOyWNJU0tbEfKZlZPW31XLNgFa1McBbsCRTKe8Pq9/3+sOKan20/wklIuMIMSdmyrVg5KRKYkYG+XYiSYxwD3VISyNHIZFOml0zgEfa8WEQCV1cwcz9vZGiUMp+6OnJEKmunO6NzP96rUQF505KeZwowvH4oSBhUEVwFA30qSBYsb4GhAXVf4W4iwTCSgeY1yHY0yfPQr1UtE+LF1WdxiUYKwd2wQE4BjY4A2VwDSqgBjB4AE/gBbwaj8az8Wa8j0fnjMnODvgj4+MHgBqcaw=</latexit>

min

w L(w) + λR(w)

<latexit sha1_base64="NgkEF/pzBUBbtk0dDOe5Z6kCGk=">ACE3icbVDLSgMxFM3UV62vqks3wSJUhTIjgnZXdCPiop9QGcomUzahmYyQ5KxlGH+wY1bP8ONC0XcunHnb/gFptMutPVA4HDOfeW4IaNSmeaXkZmbX1hcyi7nVlbX1jfym1t1GUQCkxoOWCaLpKEU5qipGmqEgyHcZabj985HfuCNC0oDfqmFIHB91Oe1QjJSW2vkD26e8HQ8SeFUc7MNDGNvp0FgQL4E205M8lMAbzBbNkpoCzxJqQqWMLp3vwmO1nf+0vQBHPuEKMyRlyzJD5cRIKIoZSXJ2JEmIcB91SUtTjnwinThdn8A9rXiwEwj9uIKp+rsjRr6UQ9/VlT5SPTntjcT/vFakOqdOTHkYKcLxeFEnYlAFcBQ9KgWLGhJgLqm+FuIcEwkrHmNMhWNfniX1o5J1XCpf6zTOwBhZsAN2QRFY4ARUwAWoghrA4B48gRfwajwYz8ab8T4uzRiTnm3wB8bHD4ocoFU=</latexit>

solves

slide-10
SLIDE 10
  • 3. Strongly convex and smooth objectives

Yes! But only approximately… Geometric weighting scheme 𝑞" = 1 − 𝑞 𝑞"

ˆ wλ = arg min L(w) + λR(w)

<latexit sha1_base64="IfzA4uN8keXbl3f/fG6MZtvDlR8=">ACRXicbVDLSgMxFM34tr6qLt0ERVCEMiOCuhCKbrpwoWJV6JRyJ5O2oUlmSDLKMxf+B3+hj/gpgtdunMnbjWdivi6EDic+494QxZ9q4bt8ZGR0bn5icmi7NzM7NL5QXly50lChC6yTikboKQFPOJK0bZji9ihUFEXB6GfSOBvrlNVWaRfLcpDFtCuhI1mYEjKVa5ZrfBZPd5K3ML45lioY59rm9EKOD7APqiOYxMcbN5t4C3+3fbnOrNYqr7kVtxj8F3ifYK26m/d9qvpSav87IcRSQSVhnDQuG5sWlmoAwjnOYlP9E0BtKDm1YKEFQ3cyK9ByvWybE7UjZJw0u2O8bGQitUxFYpwDT1b+1Afmf1khMe6+ZMRknhkoyDGonHJsID+rDIVOUGJ5aAEQx+1dMuqCAGFvyjxSRFiElW4z3u4a/4GK74u1U9k9tQ4doOFNoBa2iDeShXVRFNXSC6oigO/SAHtGTc+8OK/O29A64nzuLKMf47x/AISdtY4=</latexit>

ℓ!-regularizer

ˆ wλ1 .

X

k=1

pkwk . ˆ wλ2

<latexit sha1_base64="qOGuv2ak4gDO9X0P8tcbXMlRw=">ACVnicfVFNa9tAEF3JTZM6/VDTYy9LTaAnI4VC2kMhxIf0mEJtByxHrFajePHuSuyOEsyiP9BbflraS/tTegld24G2TsmDgcebNx87m9dSWIzjn0HYebT1eHvnSXf36bPnL6KXeyNbNYbDkFeyMmc5syCFhiEKlHBWG2AqlzDO54NlfnwJxopKf8FDVPFLrQoBWfopSxS6Yyhu2ozl6aOQNFS1PpOxQsS5YUrLVC0dQ2KnPzj0l7lKhS1y0tM7m9MrH9ND3Q7aLOrF/XgFep8kd6R3dPJ1cJte35xm0be0qHijQCOXzNpJEtc4dcyg4BLabtpYqBmfswuYeKqZAjt1q9kt3fdKQcvK+NBIV+rfFY4paxcq907FcGY3c0vxf7lJg+X7qRO6bhA0Xw8qG0mxosb0IY4CgXnjBuhN+V8hkzjKP/ia4/QrL5PtkdNBP3vU/fPbXOCZr7JDX5A15SxJySI7IJ3JKhoST7+RXEAad4EdwG26F2trGNzVvCL/Ix+AwQDuq8=</latexit>
slide-11
SLIDE 11
  • 4. Deep neural networks J

Dataset CIFAR-10 CIFAR-100 Model

VGG-16 ResNet-18 ResNet-18 Accuracy after

training (%) 92.54 ±0.22 94.54 ±0.04 75.62 ±0.16

Accuracy after

averaging (%) 93.18 ±0.06 94.72 ±0.04 76.24 ±0.05

Time of training

∼ 4.5h ∼ 8.3h ∼ 8.3h

Time of averaging

∼ 47s ∼ 56s ∼ 58s

<latexit sha1_base64="GTX8BMyXCb6APxzhY5d9UfemM7Q=">AEu3ichVNb9MwFPbWAqNctsHeLFYhoa0RUnW25AmbWxi8Aa027SXE2O67ZmthPFDlIV8if4H7zyX3jmPyAecdKly9oBjiKdy3fO+c7xsR9yprTj/JiZrVTv3L03d7/24OGjx/MLi09OVBHhB6TgAfRmY8V5UzSY80p2dhRLHwOT31L3cz/+lnGikWyCM9DGlH4L5kPUawNqaLxdlvNeTPpOJxn7McZQm5Ev+pTU0yLW9rA2BTR8AZGIuWamaCxk4mXINl92bncN1UuMvZAehIvh90KXcuBKkBOYcnuzvr7vNtGQ5pOoD1etu+3ZjnirPlWyZY1gJfEkJ5bzA7hASR5gMIe5pGkGEdISZLIPVy20Yr1Mc+ZFkLXp2Y26hZCFQgEd2/OsSUD9BsCpTwJaDbvpXQPcpgFcd/x/fthcCO7/lSDyRbK5YZveSySakyRyVN1uem/qGaoVtP26mVUw5qeakH2iAkKgx4sZpjls5BiApqpWIOx1rY3prTrGUxmG3dcStey1FhpNMtK2yileVLZHS/nxcKyaSA/cFpwr4Tl7ejrT7D0+/vBxcIv1A1ILKjUhGOlzl0n1J0ER2aPOTXbFCsaYnKJ+/TciBILqjpJ/q5SuGIsXdgLIvNLDXNrOSLBQqmh8A1SYD1Qk7MeJvPNa9didhMow1lWRUqBdzqAOYPVLYZRElmg+NgEnEDFdIBtgskVmgm1XEMu+NrLteLi05oZlDs5lmnhxLPdur350UzsNRidOfAMPAerwAUtsA3egNwDEhlqfKqslvZq25VSfVTlY+gszNXMU/BjVON/wDz7nh5</latexit>

Iterate averaging is effective and efficient!

A single GPU K80

slide-12
SLIDE 12

Take Home

Iterate averaging => adjustable regularization for free

  • For ℓ!-type regularization
  • For SGD/NSGD optimizers
  • For quadratic/strongly convex and smooth objectives
  • Regularizing deep neural networks

Join our poster session for more details!