Learning What and Where to Transfer Yunhun Jang* 1,2 , Hankook Lee* 1 - - PowerPoint PPT Presentation

learning what and where to transfer
SMART_READER_LITE
LIVE PREVIEW

Learning What and Where to Transfer Yunhun Jang* 1,2 , Hankook Lee* 1 - - PowerPoint PPT Presentation

Learning What and Where to Transfer Yunhun Jang* 1,2 , Hankook Lee* 1 , Sung Ju Hwang 3,4,5 , Jinwoo Shin 1,4,5 1 School of Electrical Engineering, KAIST 2 OMNIOUS 3 School of Computing, KAIST 4 Graduate School of AI, KAIST 5 AITRICS * Equal


slide-1
SLIDE 1

Learning What and Where to Transfer

Yunhun Jang*1,2, Hankook Lee*1, Sung Ju Hwang3,4,5, Jinwoo Shin1,4,5

* Equal contribution

1 School of Electrical Engineering, KAIST 2 OMNIOUS 3 School of Computing, KAIST 4 Graduate School of AI, KAIST 5 AITRICS

slide-2
SLIDE 2

Transfer Learning

  • DNNs require large labeled datasets to train
  • Transfer learning is a popular method to mitigate the lack of samples
  • Improve the performance of a model on a new task
  • By utilizing the knowledge of pre-trained source models

2

slide-3
SLIDE 3

Transfer Learning

  • DNNs require large labeled datasets to train
  • Transfer learning is a popular method to mitigate the lack of samples
  • Improve the performance of a model on a new task
  • By utilizing the knowledge of pre-trained source models
  • Limitations of previous methods
  • Require the same architecture between a source and target models (e.g., fine-tuning)

2

Output Training ImageNet Output Training New task Output

?

Pre-train and fine-tuning

slide-4
SLIDE 4

Transfer Learning

  • DNNs require large labeled datasets to train
  • Transfer learning is a popular method to mitigate the lack of samples
  • Improve the performance of a model on a new task
  • By utilizing the knowledge of pre-trained source models
  • Limitations of previous methods
  • Require the same architecture between a source and target models (e.g., fine-tuning)
  • Require exhaustive hand-crafted tuning (e.g., attention transfer [1], Jacobian matching

[2])

2

Output Training ImageNet Output Training New task Output

?

Pre-train and fine-tuning

Output Output True labels

Attention transfer/Jacobian matching

[1] Zagoruyko, S. and Komodakis, N. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In ICLR 2017 [2] Srinivas, S. and Fleuret, F. Knowledge transfer with Jaco- bian matching. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), 2018.

slide-5
SLIDE 5

Learning What/Where to Transfer

  • Propose meta-networks 𝑔 and 𝑕
  • Learn the learning rules to transfer the source knowledge

3

slide-6
SLIDE 6

Learning What/Where to Transfer

  • Propose meta-networks 𝑔 and 𝑕
  • Learn the learning rules to transfer the source knowledge

Where to transfer

  • A meta-network 𝑕 decides useful pairs of source/target layers to transfer

3

Previous methods Learning What and Where to Transfer (L2T-ww)

slide-7
SLIDE 7

Learning What/Where to Transfer

  • Propose meta-networks 𝑔 and 𝑕 : Learning what/where to transfer (L2T-ww)
  • Learn the learning rules to transfer the source knowledge

Where to transfer

  • A meta-network 𝑕 decides useful pairs of source/target layers to transfer

What to transfer

  • A meta-network 𝑔 decides useful channels to transfer

3

Previous methods Learning What and Where to Transfer (L2T-ww)

slide-8
SLIDE 8

L2T-ww: Learning What to Transfer

  • Transfer by making target features similar to those of source [3]

4

Lm,n

fm (θ|x) =

1 CHW X

i,j

(rθ(T n

θ (x))c,i,j − Sm(x)c,i,j)2

<latexit sha1_base64="nUZoIMYfa6w+/Drqc7rxvx79KU4=">ADIHicdVJNb9MwGHbC1wgf6+DIxaKlEqlagqCanSpF564DBgXSc1aeS4TmvmOFHsjFUmP4ULf4ULBxCG/wanI9qZQNLVp73eb+e943DlFEh+/1fhnt+o2bt3ZuW3fu3ru/29p7cCySPMNkghOWZCchEoRTiaSkZO0oygOGRkGp6OSv/0jGSCJvxIrlPix2jJaUQxkpoK9ozn9vu5iru8GM4aEChc+MOoMbx0RZ232ic807Hsj2miy9Qk7P8T1QWeHJFJHKO5opXnEe5FyO5CkP1pirkSdpTAQcN9p4Vn2JrjMV3WBwrK3lVj2lkirKogRU6+KDetJCSOtos7+cN6BQ+hFGcLKLdRorNuIPA4U7b4rnIsmtcwLU+vVDbplFHwCrc1gG64zHwStdr/Xrw68CtwGtEFzDoPWT2+R4DwmXGKGhJi5/VT6CmWSYkb0LkgKcKnaElmGnKkt+Kr6gcX0NbMAkZJpi+XsGK3MxSKhVjHoY4sVyIu+0ryX75ZLqN9X1Ge5pJwXDeKcgZlAsvXAhc0I1iytQYIZ1RrhXiF9DqlflOWXoJ7eSr4HjQc5/2Bq+ftQ9eNuvYAY/AY+AF7wAB2AMDsEYOj8dn4anwzP5lfzO/mjzrUNJqch+CvY/7+A2ad/rE=</latexit>

Feature Matching

Transformation for channel dimension matching (e.g., 1x1 conv)

[3] Romero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta, C., and Bengio, Y. Fitnets: Hints for thin deep nets. In ICLR, 2015.

slide-9
SLIDE 9

L2T-ww: Learning What to Transfer

  • Learn what to transfer

5

Lm,n

wfm (θ|x, wm,n) =

1 HW X

c

wm,n

c

X

i,j

(rθ(T n

θ (x))c,i,j − Sm(x)c,i,j)2

<latexit sha1_base64="WFLpo2AVQDyYsFg1INK5EiAbaUw=">ACxHicbVFda9swFJW9ry7aLo97kVbGCSQBTsbdAwGhcHWhz2krGkKcWJkRW60SraRrpsETftj+yX7G/sFk2MXlnYXBEfn3I/DvUkhuIYg+O35d+7eu/9g72Hr0eMnT/fbB8/OdF4qysY0F7k6T4hmgmdsDBwEOy8UIzIRbJcfqr0yRVTmufZKWwKNpPkIuMpwQcFbd/RZLAkhJhvtq5kf3MxiYCwKtU2m4ESwbkx7q/qUe/oijVBFqQmuOJxZHupSxoQ4AW8PWjUkEoZfWrK67OdW26kTe/26jnayuqsZVU2z3dG7q6fV3eu54n5Vg9/gb65dxV1TvfnQxu1OMAi2gW+DsAEd1MQobv+JFjktJcuACqL1NAwKmBmigFPBnMtSs8L5Ihds6mBGJNMzszVs8WvHLHCaK/cywFv23wpDpNYbmbjMaqX6plaR/9OmJaTvZ4ZnRQkso/WgtBQYclxdDC+4YhTExgFCFXdeMV0SdwVwd93ptLjihW5cr2vbLbek8OZKboOz4SB8OxievOscfWjWtYdeoFeoi0J0iI7QMRqhMaLeS+LN/JO/M+8LVf1qm+19Q8Rzvh/wLl7Hfsg=</latexit>
slide-10
SLIDE 10

L2T-ww: Learning What to Transfer

  • Learn what to transfer

5

Lm,n

wfm (θ|x, wm,n) =

1 HW X

c

wm,n

c

X

i,j

(rθ(T n

θ (x))c,i,j − Sm(x)c,i,j)2

<latexit sha1_base64="4Z5AMNmGo4HUKN+atZn/4WYMapM=">ACw3icbVHtihMxFM2MX2v9qvrTP8EitFBLpwqKICyIuj8EK263C512yKSZbWySGZI725aYF/NfA2fwExnFtxdLwROzrkfh3vTQnADw+HvILx89btOwd3W/fuP3j4qP34yYnJS03ZhOYi16cpMUxwxSbAQbDTQjMiU8Gm6fpDpU/PmTY8V8ewK9hckjPFM04JeCp/4olgRUlwn5xCyv7yiU2BsCbTLpuDCsG5Oe2v6mlHn6P40wTaiNnj6YOx6aUiaUeANvC3o39rBlTzm4unVtepE3v/h4jorFSVztquradUQ1z1e2Hp4/d32er62X5Xgl/i71ZxF1RvMXJuzMcDPeBr4OoAR3UxDhp/4mXOS0lU0AFMWYWDQuYW6KBU8G8ydKwgtA1OWMzDxWRzMzt3q/DLzyzxFmu/VOA9+y/FZIY3Yy9ZnVRs1VrSL/p81KyN7OLVdFCUzRelBWCgw5rg6Gl1wzCmLnAaGae6+Yrog/AvizXuq0POeFaVxva9stv6To6kqug5PRIHo1GH173Tl816zrAD1Dz1EXRegNOkRHaIwmiAY4+BR8Dcbhx3Ad6hDq1DBoap6iSxG6v4gr30c=</latexit>

wm,n

c

≥ 0, X

c

wm,n

c

= 1

<latexit sha1_base64="K3d+ksuTdnFdGexql9AhTaU2go0=">ACFHicbVDLSgMxFM3UV62vUZdugkUQLGWmCogFNy4rGAf0BlLJk3b0CQzJBmlDPMRbvwVNy4UcevCnX9jOu2ith64cHLOveTeE0SMKu04P1ZuaXldS2/XtjY3NresXf3GiqMJSZ1HLJQtgKkCKOC1DXVjLQiSRAPGkGw+ux3wgUtFQ3OlRHyO+oL2KEbaSB375PE+4SWRdhKcQq9PoFOCnop59p71rqDbsYtO2ckAF4k7JUwRa1jf3vdEMecCI0ZUqrtOpH2EyQ1xYykBS9WJEJ4iPqkbahAnCg/yY5K4ZFRurAXSlNCw0ydnUgQV2rEA9PJkR6oeW8s/ue1Y9278BMqolgTgScf9WIGdQjHCcEulQRrNjIEYUnNrhAPkERYmxwLJgR3/uRF0qiU3dNy5fasWL2cxpEHB+AQHAMXnIMquAE1UAcYPIEX8AberWfr1fqwPietOWs6sw/+wPr6BR2DnY4=</latexit>
slide-11
SLIDE 11

L2T-ww: Learning What to Transfer

  • Learn what to transfer

5

Lm,n

wfm (θ|x, wm,n) =

1 HW X

c

wm,n

c

X

i,j

(rθ(T n

θ (x))c,i,j − Sm(x)c,i,j)2

<latexit sha1_base64="4Z5AMNmGo4HUKN+atZn/4WYMapM=">ACw3icbVHtihMxFM2MX2v9qvrTP8EitFBLpwqKICyIuj8EK263C512yKSZbWySGZI725aYF/NfA2fwExnFtxdLwROzrkfh3vTQnADw+HvILx89btOwd3W/fuP3j4qP34yYnJS03ZhOYi16cpMUxwxSbAQbDTQjMiU8Gm6fpDpU/PmTY8V8ewK9hckjPFM04JeCp/4olgRUlwn5xCyv7yiU2BsCbTLpuDCsG5Oe2v6mlHn6P40wTaiNnj6YOx6aUiaUeANvC3o39rBlTzm4unVtepE3v/h4jorFSVztquradUQ1z1e2Hp4/d32er62X5Xgl/i71ZxF1RvMXJuzMcDPeBr4OoAR3UxDhp/4mXOS0lU0AFMWYWDQuYW6KBU8G8ydKwgtA1OWMzDxWRzMzt3q/DLzyzxFmu/VOA9+y/FZIY3Yy9ZnVRs1VrSL/p81KyN7OLVdFCUzRelBWCgw5rg6Gl1wzCmLnAaGae6+Yrog/AvizXuq0POeFaVxva9stv6To6kqug5PRIHo1GH173Tl816zrAD1Dz1EXRegNOkRHaIwmiAY4+BR8Dcbhx3Ad6hDq1DBoap6iSxG6v4gr30c=</latexit>

Choose important channels for learning a target task

slide-12
SLIDE 12

L2T-ww: Learning Where to Transfer

  • Learn where to transfer
  • Meta-networks choose important matching pairs to transfer
  • Given all possible candidate matching pairs 𝒟

6

λm,n

<latexit sha1_base64="Fw/IGEpRZnARKdtX6DGSYRzYgeM=">ACGXicbVDLSsNAFJ3UV42PRl26GSyCymJCrqz4MZlBfuAtpbJZNIOnUzCzKRYQj5Eu9WPcOlO3LryO/wBJ0XtvXAwOGce7lnjhsxKpVtfxuFldW19Y3iprm1vbNbsvb2GzKMBSZ1HLJQtFwkCaOc1BVjLQiQVDgMtJ0hzeZ3xwRIWnI79U4It0A9Tn1KUZKSz2rlHSYnvbQxKc8jTtW7Yk8Bl4kzI+Xrt6cMz7We9dPxQhwHhCvMkJRtx45UN0FCUcxIanZiSKEh6hP2pyFBDZTabBU3isFQ/6odCPKzhV/24kKJByHLh6MkBqIBe9TPzPa3sjGsnZrcf8mDkfRflX3YTyKFaE4zyJHzOoQpjVBD0qCFZsrAnCgurPQDxAmGlyzR1S85iJ8ukcVZxziv2nVOuXoAcRXAIjsAJcMAlqIJbUAN1gEMJuAFvBoT4934MD7z0YIx2zkAczC+fgErfKWj</latexit>
slide-13
SLIDE 13

L2T-ww: Learning Where to Transfer

  • Learn where to transfer

7

Lwfm(θ|x, φ) = X

(m,n)∈C

λm,nLm,n

wfm (θ|x, wm,n)

<latexit sha1_base64="vCrLMCcInMnLaMtxwVMTL+Nvmzg=">AEQHicrZPLb9MwHMe9hMcIrw6OXCyqSKlUqYgbUKqNKmXHjgMWNtJTRs5rtN6y0uxw1YZ/2lc+BO4cebCAYS4csJ0y3d4CEpao/17fz8+OvSgjLfbX7c0/dbtO3e37xn3Hzx89Li282TI4izFZIDjIE6PMRIQCMy4JQH5ChJCQq9gIy8k14eH30gKaNxdMiXCZmEaB5Rn2LElcvd0Ybm6VSEzUh2x6XhCiwnXb/cOMmCWu/VRlpnjYZhOoFqPkNlzfyGrNR1+IJwZB1ORVT4HBo5IeILzxPv5FT0HE5DwmC/B9JxzDXyXm9WDWQhlklMcwKpGEWiopb4xRIN7IdThHPqKZtXl41kDdqHjpwgLW4peX8mxLHQFbR5L60JshXuxVdxKqJlnwRew1FPetbMx7dyoflqVb5bYGxjnFhWpvpvYBWu64jy4yxwCj1LyZ/fUV7UkxJu3DX8pzndWr3dahcLXjXs0qiDch24tS/OLMZSCKOA8TY2G4nfCJQyikOiDScjJE4RM0J2NlRkh9NRNRPAJTeWZQT9O1S/isPBWKwQKGVuGnsrMh2CXY7nzutg4/7eRNAoyTiJ8ErIzwLIY5i/JjijKcE8WCoD4ZQqVogXSN0vV2/OUIdgXx75qjHstOyXrc7bV/X91+VxbINn4DmwgA12wT7ogwMwAFj7pH3Tfmg/9c/6d/2X/nuVqm2VNU/BxtL/AXaEmpj</latexit>
slide-14
SLIDE 14

L2T-ww: Learning Where to Transfer

  • Learn where to transfer
  • Choose pairs of feature-matched layers among all the possible pairs

7

Lwfm(θ|x, φ) = X

(m,n)∈C

λm,nLm,n

wfm (θ|x, wm,n)

<latexit sha1_base64="KfMxZQAmhem1oTzsLEup1pyAHtE=">AGcnicpVRb9MwFM5GV0a5bfAGL4apqNWyqumQEiTJvayB4SG2E1a2shx3dXMuShxukbGv5Fn/gaCV+DkstA2YULMUqSTc3y+7ztfHNs+Z6Hodr8uLd+qrdRvr95p3L13/8HDtfVHx6EXBYQeEY97wamNQ8qZS48E5ye+gHFjs3piX2xl9RPJjQImeceitinfQefu2zECBaQstZr46bpYDEmMt3aiAd3VWNIVAlyNHtUwxpgJ/nuqXWamNdpA5CjCRhpL7JwqZYeRYkAg6FSkeqTNMblQ8vIKDaq0cx2Mv2TMue2tYKEL6FRrcOBzOiz12m7Dd160oO20EfAS3JXqfagl8D+jbNUMznYMsQDua1vQ3Vm7KqBTX/M0mlT2S1AbZvMLZr21NzI7/EkfsjOsOS6FCVJFmt7C0ITlOWFJuG2slfxJaJuT/GpothECvLlnCFJzD/g2kJPYnHaLRLO12UnvT6k6p6AXnV0CWPAQlAJcWJsAsEWcIOeiAldx1KhYUGxHueWv0DXSkp2tEmMu0qGyq+Yac4BwEdwvepalE3sX7mGKQMkUsZCyWp6r4LIXHvYIyNfy/aBOPriVdSMXqpiejp/6Rp9OAZa1tdDvdKFyYOTBhpavA2vtmzn0SORQVxCOw/DM6PqiL3EgGOEUfuAopD782ficnkHoYoeGfZkyK9SEzBCNvAeV6A0O9shsROGsWPDzmTKcLGWJKtqZ5EYve5L5vqRoC7JiEYRh+OMkvsVDVlAieAxBJgEDLQiMsZwYwq4heQhPmh7nqaSY7MclYtKQcHPc6xnan9+Hlxu6b3K5V7an2XGtphvZK29X2tQPtSCO1L7UftZ+1Xyvf60/qz+q5t8tLec9jbW7V9d9GnksU</latexit>
slide-15
SLIDE 15

L2T-ww: Training Meta-Networks

8

  • Total loss for target model:
  • A popular bilevel scheme [4,5] for training meta-parameters 𝜚:

Ltotal(θ|x, y, φ) = Lorg(θ|x, y) + βLwfm(θ|x, φ)

<latexit sha1_base64="WtNJ/Z5PgEnVWi9Guy7jysdH7YA=">AGdHicpVRdb9MwFM1GW0b52uARHiyqolbLqZDAh4mTexlDwgNsS9pbiPHdtszocSp2tk/CN5G/wAK/YTpatbRiIWYp0c6/vOeOHZC6sas2/2+snqvUq3dX3tQf/jo8ZOn6xvPjuMgiTA5wgENolMHxYS6PjliLqPkNIwI8hxKTpyLPVU/mZIodgP/kKUh6Xto7LsjFyMmU/ZG5bwJPcQmGFH+UQy4Z/rC5pAxcDnyRAuyCWHo68y8zEptsAPgKEKYW4LvnwgA48SzOZYBIzOm9XCHInwh+OUVmqyKejPb6ZrnAs5ta0WKT9GI1uGAZ/TZ6zdlt2m6gFb4IvEU7mrVHvQU7B/4ryufULT9ANiOCQSmeGaMC3zW294Xr0sqFhOH1xFp6SyK3oesXTXtibuwSGqVFlJktWV/pSadsjnbtMRO/sK2IKLhBEfyUHtLuEywKG6DWmzcxUPnqIklE9bGu7iwVg2h8BWTzQ6lEwunCVDJI2CJWyLkoiQRuY6n/TbCZ5o6/BrcqUjvbYBNAp8yFk9YnISIDOXKesQdzH+xiHQDFwjFioWyzNRyCoc7hWU2u7/olUW3Uq6kErFXc9FT/wjT6cul73e6Ha6eoHlwMqDhpGvA3v9BxwGOPGIzBFcXxmdUPW5yhiLqZE1GESk1D+92hMzmToI4/Efa6ZBWjKzBCMgkg+PgM6e7ODIy+OU8+RO9WU8WJNJctqZwkbvetz1w8TRnycEY0SKk8zUDcsGLoRwYymMkA4cqVWgCdI3plM3sNzSMOpG8a56lkmW5lkLVqyHBz3OtZ2p/f5TWP3fW7XmvHCeGW0DMt4a+wa+8aBcWTgyrfKr6pRXan+rL2sNWrNbOvqSt7z3Jhbtc5vRk5KJA=</latexit>

[4] Colson, B., Marcotte, P., and Savard, G. An overview of bilevel optimization. Annals of operations research, 2007. [5] Franceschi, L., Frasconi, P., Salzo, S., Grazzi, R., and Pontil, M. Bilevel programming for hyperparameter optimization and meta-learning. In ICML, 2018.

slide-16
SLIDE 16

L2T-ww: Training Meta-Networks

8

  • Total loss for target model:
  • A popular bilevel scheme [4,5] for training meta-parameters 𝜚:
  • 1. Training simulation: for 𝑢 = 1, … , 𝑈,

Ltotal(θ|x, y, φ) = Lorg(θ|x, y) + βLwfm(θ|x, φ)

<latexit sha1_base64="WtNJ/Z5PgEnVWi9Guy7jysdH7YA=">AGdHicpVRdb9MwFM1GW0b52uARHiyqolbLqZDAh4mTexlDwgNsS9pbiPHdtszocSp2tk/CN5G/wAK/YTpatbRiIWYp0c6/vOeOHZC6sas2/2+snqvUq3dX3tQf/jo8ZOn6xvPjuMgiTA5wgENolMHxYS6PjliLqPkNIwI8hxKTpyLPVU/mZIodgP/kKUh6Xto7LsjFyMmU/ZG5bwJPcQmGFH+UQy4Z/rC5pAxcDnyRAuyCWHo68y8zEptsAPgKEKYW4LvnwgA48SzOZYBIzOm9XCHInwh+OUVmqyKejPb6ZrnAs5ta0WKT9GI1uGAZ/TZ6zdlt2m6gFb4IvEU7mrVHvQU7B/4ryufULT9ANiOCQSmeGaMC3zW294Xr0sqFhOH1xFp6SyK3oesXTXtibuwSGqVFlJktWV/pSadsjnbtMRO/sK2IKLhBEfyUHtLuEywKG6DWmzcxUPnqIklE9bGu7iwVg2h8BWTzQ6lEwunCVDJI2CJWyLkoiQRuY6n/TbCZ5o6/BrcqUjvbYBNAp8yFk9YnISIDOXKesQdzH+xiHQDFwjFioWyzNRyCoc7hWU2u7/olUW3Uq6kErFXc9FT/wjT6cul73e6Ha6eoHlwMqDhpGvA3v9BxwGOPGIzBFcXxmdUPW5yhiLqZE1GESk1D+92hMzmToI4/Efa6ZBWjKzBCMgkg+PgM6e7ODIy+OU8+RO9WU8WJNJctqZwkbvetz1w8TRnycEY0SKk8zUDcsGLoRwYymMkA4cqVWgCdI3plM3sNzSMOpG8a56lkmW5lkLVqyHBz3OtZ2p/f5TWP3fW7XmvHCeGW0DMt4a+wa+8aBcWTgyrfKr6pRXan+rL2sNWrNbOvqSt7z3Jhbtc5vRk5KJA=</latexit>

θt+1 = θt αrθLtotal(θt|xt, yt, φ)

<latexit sha1_base64="nOA0h1c01qF2VrXUE17nE+uAsYc=">AGzHicrVTbtMwGM7GWkY5bXDJjcU01GpZ1XRIKRJE7vZBRpD7ITmNnJcdzVzDkqcrpHxLQ/GW/AaPAG2k2VrGybEiBTpz3/4vu/YtmLGE14p/NzYfHeUq1+f/lB4+Gjx0+erqw+O07CNMbkCIcsjE89lBGA3LEKWfkNIoJ8j1GTryLXV0/GZM4oWFwyLOI9Hx0HtAhxYirlLu69GMd+oiPMGLig+wL3w6kKyDn4HLoybkI8LRt4l9mZdaYBvAYywcKTYO5EAJqnvCqwCTibc6BEeQ/hCisrNFWVjfW8k9pfJZxqa8aT9PI5mFf5PT56TVUtO2ngGb4LPC07mrVKvf1bB/4ryu7aNx9p6lRArIlDMD1Bdb9pZpuF69amkYjajZ2EhvKuQWpE5tCun1q6g0VpkJUlem/dXaTIpV/ANR24XH3wTIhaNEAyQWtTNs3O4POSIXWO63M7Ua5aoWNU3Fpvq9lwxjM+vgFxqJQoOFMYKwYFW8YauRClkMBtLBUaZhTbWH5K3CrJN3ZAhsAelU2VPzD8ijEZKB+T9WEvIvzN06BYRAGsVQxW57IUlZpcbekNH7/E626FbSmVQm73owuvIvedqN/3SuXT5zshsNd2Wt0+6YB8wHThGsWcVz4K78goMQpz4JOGYoSc6cTsR7AsWcYkZkA6YJidR9gs7JmQoD5JOkJ8xCEqyrzAMw1i9AQcme3NCID9JMt9TnXqRZLamk1W1s5QP3/YEDaKUkwDnRMOUqfWBvrnBgMYEc5apAOGYKq0Aj5C6i7m636eQBmMaJYXqS5bm+TMWjIfHfbzla7+n12s67wq5l64X10mpajvXG2rH2rAPryMK1Vu1j7bT2pb5f53VRl3nr4kIx89yaeurfwOWXWyJ</latexit>

[4] Colson, B., Marcotte, P., and Savard, G. An overview of bilevel optimization. Annals of operations research, 2007. [5] Franceschi, L., Frasconi, P., Salzo, S., Grazzi, R., and Pontil, M. Bilevel programming for hyperparameter optimization and meta-learning. In ICML, 2018.

slide-17
SLIDE 17

L2T-ww: Training Meta-Networks

8

  • Total loss for target model:
  • A popular bilevel scheme [4,5] for training meta-parameters 𝜚:
  • 1. Training simulation: for 𝑢 = 1, … , 𝑈,
  • 2. Evaluation:

Lmeta(φ) = Lorg(θT +1|xval, yval)

<latexit sha1_base64="hxGtdkpqWADqsb6oTYJYzi8cZkY=">AERHicdVPLbhMxFJ1OeJTh0RSWbCyqSIlIq0yKBKpUqVIXdMGiKSpFCcjx3ESU89D4ztIuOP4WdgC0v+gRWILcLz6COdYGmk63PuPfeM/IoElxCq/Vjza7cuXv/voD5+Gjx082qptPT2SYxJR1aSjC+HREJBM8YF3gINhpFDPijwTrjc4OU753zmLJw6ADi4gNfDIN+IRTAgbyNu29GvYJzCgR6p0eKr8ZaE9hAHQx8XUdw4wB+TRvXuRUA+0jPIkJVa5WRz2NsEx8T1ETAJtDNo8aCULPtLq4VDOsdmp5Jm9+1HgprR6n/dI2ut4Zqrx9fp03Gqa6mdagbfTB6KXYJdQYtlPZ/UscVgYW8ZkqHabuxl7vfeqjXE049m62dx1I9vAPLgqOtRLO7+NGQtu9Ein0Cs75FzZWTNQBnkKXrp6v7jANiYimhEcELOGl6MlXQiBiGtND5oL82UbOKVkP/M2I/dLZBhPL3U81TGDGLWMODcNjOpV3HAcr7rV2mlB5UDtwi2rOIce9VfeBzSxGcBUEGk7LutCAaKxMCpYNrBiWSR+VdkyvomDIjP5EBlBmtUM8gYTcLYfAGgDL1ZoYgv5cIfmcx0KXmbS8FVXD+ByZuB4kGUAto3miSCGMqSl8MGvOYURALExAaczMrojNi3gCYd7WkND7nkSymnudjpya5ty0pByftHXd3p/3+1dbBXmHXuvXcemHVLd6bR1YR9ax1bWo/dn+an+zv1e+VH5Wflf+5Kn2WlHzFo6lb/AO0DdT0=</latexit>

Ltotal(θ|x, y, φ) = Lorg(θ|x, y) + βLwfm(θ|x, φ)

<latexit sha1_base64="WtNJ/Z5PgEnVWi9Guy7jysdH7YA=">AGdHicpVRdb9MwFM1GW0b52uARHiyqolbLqZDAh4mTexlDwgNsS9pbiPHdtszocSp2tk/CN5G/wAK/YTpatbRiIWYp0c6/vOeOHZC6sas2/2+snqvUq3dX3tQf/jo8ZOn6xvPjuMgiTA5wgENolMHxYS6PjliLqPkNIwI8hxKTpyLPVU/mZIodgP/kKUh6Xto7LsjFyMmU/ZG5bwJPcQmGFH+UQy4Z/rC5pAxcDnyRAuyCWHo68y8zEptsAPgKEKYW4LvnwgA48SzOZYBIzOm9XCHInwh+OUVmqyKejPb6ZrnAs5ta0WKT9GI1uGAZ/TZ6zdlt2m6gFb4IvEU7mrVHvQU7B/4ryufULT9ANiOCQSmeGaMC3zW294Xr0sqFhOH1xFp6SyK3oesXTXtibuwSGqVFlJktWV/pSadsjnbtMRO/sK2IKLhBEfyUHtLuEywKG6DWmzcxUPnqIklE9bGu7iwVg2h8BWTzQ6lEwunCVDJI2CJWyLkoiQRuY6n/TbCZ5o6/BrcqUjvbYBNAp8yFk9YnISIDOXKesQdzH+xiHQDFwjFioWyzNRyCoc7hWU2u7/olUW3Uq6kErFXc9FT/wjT6cul73e6Ha6eoHlwMqDhpGvA3v9BxwGOPGIzBFcXxmdUPW5yhiLqZE1GESk1D+92hMzmToI4/Efa6ZBWjKzBCMgkg+PgM6e7ODIy+OU8+RO9WU8WJNJctqZwkbvetz1w8TRnycEY0SKk8zUDcsGLoRwYymMkA4cqVWgCdI3plM3sNzSMOpG8a56lkmW5lkLVqyHBz3OtZ2p/f5TWP3fW7XmvHCeGW0DMt4a+wa+8aBcWTgyrfKr6pRXan+rL2sNWrNbOvqSt7z3Jhbtc5vRk5KJA=</latexit>

θt+1 = θt αrθLtotal(θt|xt, yt, φ)

<latexit sha1_base64="nOA0h1c01qF2VrXUE17nE+uAsYc=">AGzHicrVTbtMwGM7GWkY5bXDJjcU01GpZ1XRIKRJE7vZBRpD7ITmNnJcdzVzDkqcrpHxLQ/GW/AaPAG2k2VrGybEiBTpz3/4vu/YtmLGE14p/NzYfHeUq1+f/lB4+Gjx0+erqw+O07CNMbkCIcsjE89lBGA3LEKWfkNIoJ8j1GTryLXV0/GZM4oWFwyLOI9Hx0HtAhxYirlLu69GMd+oiPMGLig+wL3w6kKyDn4HLoybkI8LRt4l9mZdaYBvAYywcKTYO5EAJqnvCqwCTibc6BEeQ/hCisrNFWVjfW8k9pfJZxqa8aT9PI5mFf5PT56TVUtO2ngGb4LPC07mrVKvf1bB/4ryu7aNx9p6lRArIlDMD1Bdb9pZpuF69amkYjajZ2EhvKuQWpE5tCun1q6g0VpkJUlem/dXaTIpV/ANR24XH3wTIhaNEAyQWtTNs3O4POSIXWO63M7Ua5aoWNU3Fpvq9lwxjM+vgFxqJQoOFMYKwYFW8YauRClkMBtLBUaZhTbWH5K3CrJN3ZAhsAelU2VPzD8ijEZKB+T9WEvIvzN06BYRAGsVQxW57IUlZpcbekNH7/E626FbSmVQm73owuvIvedqN/3SuXT5zshsNd2Wt0+6YB8wHThGsWcVz4K78goMQpz4JOGYoSc6cTsR7AsWcYkZkA6YJidR9gs7JmQoD5JOkJ8xCEqyrzAMw1i9AQcme3NCID9JMt9TnXqRZLamk1W1s5QP3/YEDaKUkwDnRMOUqfWBvrnBgMYEc5apAOGYKq0Aj5C6i7m636eQBmMaJYXqS5bm+TMWjIfHfbzla7+n12s67wq5l64X10mpajvXG2rH2rAPryMK1Vu1j7bT2pb5f53VRl3nr4kIx89yaeurfwOWXWyJ</latexit>

[4] Colson, B., Marcotte, P., and Savard, G. An overview of bilevel optimization. Annals of operations research, 2007. [5] Franceschi, L., Frasconi, P., Salzo, S., Grazzi, R., and Pontil, M. Bilevel programming for hyperparameter optimization and meta-learning. In ICML, 2018.

slide-18
SLIDE 18

L2T-ww: Training Meta-Networks

8

  • Total loss for target model:
  • A popular bilevel scheme [4,5] for training meta-parameters 𝜚:
  • 1. Training simulation: for 𝑢 = 1, … , 𝑈,
  • 2. Evaluation:
  • 3. Update 𝜚 based on using second-order gradients

Lmeta(φ) = Lorg(θT +1|xval, yval)

<latexit sha1_base64="hxGtdkpqWADqsb6oTYJYzi8cZkY=">AERHicdVPLbhMxFJ1OeJTh0RSWbCyqSIlIq0yKBKpUqVIXdMGiKSpFCcjx3ESU89D4ztIuOP4WdgC0v+gRWILcLz6COdYGmk63PuPfeM/IoElxCq/Vjza7cuXv/voD5+Gjx082qptPT2SYxJR1aSjC+HREJBM8YF3gINhpFDPijwTrjc4OU753zmLJw6ADi4gNfDIN+IRTAgbyNu29GvYJzCgR6p0eKr8ZaE9hAHQx8XUdw4wB+TRvXuRUA+0jPIkJVa5WRz2NsEx8T1ETAJtDNo8aCULPtLq4VDOsdmp5Jm9+1HgprR6n/dI2ut4Zqrx9fp03Gqa6mdagbfTB6KXYJdQYtlPZ/UscVgYW8ZkqHabuxl7vfeqjXE049m62dx1I9vAPLgqOtRLO7+NGQtu9Ein0Cs75FzZWTNQBnkKXrp6v7jANiYimhEcELOGl6MlXQiBiGtND5oL82UbOKVkP/M2I/dLZBhPL3U81TGDGLWMODcNjOpV3HAcr7rV2mlB5UDtwi2rOIce9VfeBzSxGcBUEGk7LutCAaKxMCpYNrBiWSR+VdkyvomDIjP5EBlBmtUM8gYTcLYfAGgDL1ZoYgv5cIfmcx0KXmbS8FVXD+ByZuB4kGUAto3miSCGMqSl8MGvOYURALExAaczMrojNi3gCYd7WkND7nkSymnudjpya5ty0pByftHXd3p/3+1dbBXmHXuvXcemHVLd6bR1YR9ax1bWo/dn+an+zv1e+VH5Wflf+5Kn2WlHzFo6lb/AO0DdT0=</latexit>

rφLmeta(φ)

<latexit sha1_base64="kHE1cbe6e7GA4SbWM8iyuthH+8=">AEbnicfVPbhMxEN1uA5RwaQsSLwhEVKRFplWyQUqRKfaAPBTRm1QnK8dxElPvRevZXGT8hXwBfwE8wguzu0nbNAFLlsZzPOecGcudWEkNjcb3FXe1dOfuvbX75QcPHz1e39h8cqjNOHihEcqSs47TAslQ3ECEpQ4jxPBgo4SZ53Lgw/G4pEyg8hksWgHrh7InOQNM+Ztud4sGDAacKfPRtk1QD61vKAZ9QJbpTAQwL6O6MCqpEmob2EceNZc3hmCdVp4BuOAYgx5H5MRzF+ac1oxoaoLW8VN2X9i6Vz16pJpfJ2Opx2xTyxXFcq2F1Pash2+Qz8mW5WarW3s1o/6W5gFGFY+mytmr7+Xod/LOqbxQObt5r6rSFujMrwqOrBzPX9IhAhvaGQu7FKFAlucLBrKU76B15tTg+wTZmKB4yGDNvwi+wCL0TA1DWnD/UJ7ryDJX0G+XBztLkARkl/RuSbY3SCdDkwRAWkvYpr5ZknJCL/EymX/Y1KY6eRL7IYeNOg4kzXkb/xi3YjngYiBK6Y1hdeI4aWYQlIroQt01SLGF+V9cUFhiELhG6Z/Cks2cJMl/SiBHcIJM/erDAs0HoSdPBm5lrfxrLkMuwihd67lpFhnIeSHUSxWOn2R/i3RlIjioCQaMJxK9Ej5g+FsAf+AcU3coYz1PS5sZ0Pybo9kMTjd3fH2dnY/vansv5+Oa8157rxyqo7nvHX2nUPnyDlxuPvN/en+dv+s/ig9K70ovSyuivTmqfO3CpV/wIbNYJU</latexit>

Ltotal(θ|x, y, φ) = Lorg(θ|x, y) + βLwfm(θ|x, φ)

<latexit sha1_base64="WtNJ/Z5PgEnVWi9Guy7jysdH7YA=">AGdHicpVRdb9MwFM1GW0b52uARHiyqolbLqZDAh4mTexlDwgNsS9pbiPHdtszocSp2tk/CN5G/wAK/YTpatbRiIWYp0c6/vOeOHZC6sas2/2+snqvUq3dX3tQf/jo8ZOn6xvPjuMgiTA5wgENolMHxYS6PjliLqPkNIwI8hxKTpyLPVU/mZIodgP/kKUh6Xto7LsjFyMmU/ZG5bwJPcQmGFH+UQy4Z/rC5pAxcDnyRAuyCWHo68y8zEptsAPgKEKYW4LvnwgA48SzOZYBIzOm9XCHInwh+OUVmqyKejPb6ZrnAs5ta0WKT9GI1uGAZ/TZ6zdlt2m6gFb4IvEU7mrVHvQU7B/4ryufULT9ANiOCQSmeGaMC3zW294Xr0sqFhOH1xFp6SyK3oesXTXtibuwSGqVFlJktWV/pSadsjnbtMRO/sK2IKLhBEfyUHtLuEywKG6DWmzcxUPnqIklE9bGu7iwVg2h8BWTzQ6lEwunCVDJI2CJWyLkoiQRuY6n/TbCZ5o6/BrcqUjvbYBNAp8yFk9YnISIDOXKesQdzH+xiHQDFwjFioWyzNRyCoc7hWU2u7/olUW3Uq6kErFXc9FT/wjT6cul73e6Ha6eoHlwMqDhpGvA3v9BxwGOPGIzBFcXxmdUPW5yhiLqZE1GESk1D+92hMzmToI4/Efa6ZBWjKzBCMgkg+PgM6e7ODIy+OU8+RO9WU8WJNJctqZwkbvetz1w8TRnycEY0SKk8zUDcsGLoRwYymMkA4cqVWgCdI3plM3sNzSMOpG8a56lkmW5lkLVqyHBz3OtZ2p/f5TWP3fW7XmvHCeGW0DMt4a+wa+8aBcWTgyrfKr6pRXan+rL2sNWrNbOvqSt7z3Jhbtc5vRk5KJA=</latexit>

θt+1 = θt αrθLtotal(θt|xt, yt, φ)

<latexit sha1_base64="nOA0h1c01qF2VrXUE17nE+uAsYc=">AGzHicrVTbtMwGM7GWkY5bXDJjcU01GpZ1XRIKRJE7vZBRpD7ITmNnJcdzVzDkqcrpHxLQ/GW/AaPAG2k2VrGybEiBTpz3/4vu/YtmLGE14p/NzYfHeUq1+f/lB4+Gjx0+erqw+O07CNMbkCIcsjE89lBGA3LEKWfkNIoJ8j1GTryLXV0/GZM4oWFwyLOI9Hx0HtAhxYirlLu69GMd+oiPMGLig+wL3w6kKyDn4HLoybkI8LRt4l9mZdaYBvAYywcKTYO5EAJqnvCqwCTibc6BEeQ/hCisrNFWVjfW8k9pfJZxqa8aT9PI5mFf5PT56TVUtO2ngGb4LPC07mrVKvf1bB/4ryu7aNx9p6lRArIlDMD1Bdb9pZpuF69amkYjajZ2EhvKuQWpE5tCun1q6g0VpkJUlem/dXaTIpV/ANR24XH3wTIhaNEAyQWtTNs3O4POSIXWO63M7Ua5aoWNU3Fpvq9lwxjM+vgFxqJQoOFMYKwYFW8YauRClkMBtLBUaZhTbWH5K3CrJN3ZAhsAelU2VPzD8ijEZKB+T9WEvIvzN06BYRAGsVQxW57IUlZpcbekNH7/E626FbSmVQm73owuvIvedqN/3SuXT5zshsNd2Wt0+6YB8wHThGsWcVz4K78goMQpz4JOGYoSc6cTsR7AsWcYkZkA6YJidR9gs7JmQoD5JOkJ8xCEqyrzAMw1i9AQcme3NCID9JMt9TnXqRZLamk1W1s5QP3/YEDaKUkwDnRMOUqfWBvrnBgMYEc5apAOGYKq0Aj5C6i7m636eQBmMaJYXqS5bm+TMWjIfHfbzla7+n12s67wq5l64X10mpajvXG2rH2rAPryMK1Vu1j7bT2pb5f53VRl3nr4kIx89yaeurfwOWXWyJ</latexit>

[4] Colson, B., Marcotte, P., and Savard, G. An overview of bilevel optimization. Annals of operations research, 2007. [5] Franceschi, L., Frasconi, P., Salzo, S., Grazzi, R., and Pontil, M. Bilevel programming for hyperparameter optimization and meta-learning. In ICML, 2018.

slide-19
SLIDE 19

L2T-ww: Training Meta-Networks

8

  • Total loss for target model:
  • A popular bilevel scheme [4,5] for training meta-parameters 𝜚:
  • The transfer loss acts as a regularization
  • A large number of steps 𝑈 is required to obtain meaningful gradients
  • But it is time-consuming
  • 1. Training simulation: for 𝑢 = 1, … , 𝑈,
  • 2. Evaluation:
  • 3. Update 𝜚 based on using second-order gradients

Lmeta(φ) = Lorg(θT +1|xval, yval)

<latexit sha1_base64="hxGtdkpqWADqsb6oTYJYzi8cZkY=">AERHicdVPLbhMxFJ1OeJTh0RSWbCyqSIlIq0yKBKpUqVIXdMGiKSpFCcjx3ESU89D4ztIuOP4WdgC0v+gRWILcLz6COdYGmk63PuPfeM/IoElxCq/Vjza7cuXv/voD5+Gjx082qptPT2SYxJR1aSjC+HREJBM8YF3gINhpFDPijwTrjc4OU753zmLJw6ADi4gNfDIN+IRTAgbyNu29GvYJzCgR6p0eKr8ZaE9hAHQx8XUdw4wB+TRvXuRUA+0jPIkJVa5WRz2NsEx8T1ETAJtDNo8aCULPtLq4VDOsdmp5Jm9+1HgprR6n/dI2ut4Zqrx9fp03Gqa6mdagbfTB6KXYJdQYtlPZ/UscVgYW8ZkqHabuxl7vfeqjXE049m62dx1I9vAPLgqOtRLO7+NGQtu9Ein0Cs75FzZWTNQBnkKXrp6v7jANiYimhEcELOGl6MlXQiBiGtND5oL82UbOKVkP/M2I/dLZBhPL3U81TGDGLWMODcNjOpV3HAcr7rV2mlB5UDtwi2rOIce9VfeBzSxGcBUEGk7LutCAaKxMCpYNrBiWSR+VdkyvomDIjP5EBlBmtUM8gYTcLYfAGgDL1ZoYgv5cIfmcx0KXmbS8FVXD+ByZuB4kGUAto3miSCGMqSl8MGvOYURALExAaczMrojNi3gCYd7WkND7nkSymnudjpya5ty0pByftHXd3p/3+1dbBXmHXuvXcemHVLd6bR1YR9ax1bWo/dn+an+zv1e+VH5Wflf+5Kn2WlHzFo6lb/AO0DdT0=</latexit>

rφLmeta(φ)

<latexit sha1_base64="kHE1cbe6e7GA4SbWM8iyuthH+8=">AEbnicfVPbhMxEN1uA5RwaQsSLwhEVKRFplWyQUqRKfaAPBTRm1QnK8dxElPvRevZXGT8hXwBfwE8wguzu0nbNAFLlsZzPOecGcudWEkNjcb3FXe1dOfuvbX75QcPHz1e39h8cqjNOHihEcqSs47TAslQ3ECEpQ4jxPBgo4SZ53Lgw/G4pEyg8hksWgHrh7InOQNM+Ztud4sGDAacKfPRtk1QD61vKAZ9QJbpTAQwL6O6MCqpEmob2EceNZc3hmCdVp4BuOAYgx5H5MRzF+ac1oxoaoLW8VN2X9i6Vz16pJpfJ2Opx2xTyxXFcq2F1Pash2+Qz8mW5WarW3s1o/6W5gFGFY+mytmr7+Xod/LOqbxQObt5r6rSFujMrwqOrBzPX9IhAhvaGQu7FKFAlucLBrKU76B15tTg+wTZmKB4yGDNvwi+wCL0TA1DWnD/UJ7ryDJX0G+XBztLkARkl/RuSbY3SCdDkwRAWkvYpr5ZknJCL/EymX/Y1KY6eRL7IYeNOg4kzXkb/xi3YjngYiBK6Y1hdeI4aWYQlIroQt01SLGF+V9cUFhiELhG6Z/Cks2cJMl/SiBHcIJM/erDAs0HoSdPBm5lrfxrLkMuwihd67lpFhnIeSHUSxWOn2R/i3RlIjioCQaMJxK9Ej5g+FsAf+AcU3coYz1PS5sZ0Pybo9kMTjd3fH2dnY/vansv5+Oa8157rxyqo7nvHX2nUPnyDlxuPvN/en+dv+s/ig9K70ovSyuivTmqfO3CpV/wIbNYJU</latexit>

Lwfm

<latexit sha1_base64="VQfIJC9gKmwaICjGsrYjYgb8U=">AFD3icfVTJbhNBEJ1gB8KwJXDk0iIKspWJ5UmQEiRIuVADhyCyCal7VFPux036VnU3ZPYavoj4Ar/wQ1x5RP4Db6AmiWL4yEtjVRTr+rVqzdLmAqudLf7Z+5Oozl/97CfBw0ePnywuPT1QSYp26eJSORSBQTPGb7mvBjlLJSBQKdhiebuf4RmTifxnp6krBeRk5gPOSUaUsFSY34FR0SPKBHmve2byItYLDW6HwY2RbWI6bJ57F3XkJtInwUBJqfGt2Di3CKosCQyHQbKwLPSYUhJ5ac37Bqh1V8pK7n2yeKqsJfN5+Rjb2ubcnx5O263odvLe9Aa+gh8e4i1e6v57T/mzmDYQG2DEjfbHgbBXq1d93GOB3xYt1Cdwto25jHl03bdmrnd5Kx+NqMXIWtnVBis86CoCIVGL3q283qRq9hItIRwTGBNYIyO8OrE03EFWegvQlcxQY1e0aFuQW6OQMm8uSCKDB7oAToCuAMJgDtZwzV6KACd02pUbDcXepPL7JbpVUl7ZRqsIh3U21DzAjlu9BJIN4PHUdVjXdYPF5W6nWxw0G/hVsOxUZzdY/IsHCc0iFmsqiFLHfjfVPUOk5lQw6+JMsReO3LCjiGMScRUzxRSLFqBzANEwlXrFGRvd5hSKTUJAqhMhesbmJ5sg47zvTwTc/wOM0i2k5aJgJcBvlHz8acMmoFhMICJUctCI6IvA5a/hFTDENzniqKtXjUnZukn/TktngYL3jb3TWP7xa3npb2bXgPHdeOC3Hd147W86Os+vsO7QhGl8b3xrfm1+aP5o/m7/K0jtzVc8zZ+o0f/8DZre+uw=</latexit>

Ltotal(θ|x, y, φ) = Lorg(θ|x, y) + βLwfm(θ|x, φ)

<latexit sha1_base64="WtNJ/Z5PgEnVWi9Guy7jysdH7YA=">AGdHicpVRdb9MwFM1GW0b52uARHiyqolbLqZDAh4mTexlDwgNsS9pbiPHdtszocSp2tk/CN5G/wAK/YTpatbRiIWYp0c6/vOeOHZC6sas2/2+snqvUq3dX3tQf/jo8ZOn6xvPjuMgiTA5wgENolMHxYS6PjliLqPkNIwI8hxKTpyLPVU/mZIodgP/kKUh6Xto7LsjFyMmU/ZG5bwJPcQmGFH+UQy4Z/rC5pAxcDnyRAuyCWHo68y8zEptsAPgKEKYW4LvnwgA48SzOZYBIzOm9XCHInwh+OUVmqyKejPb6ZrnAs5ta0WKT9GI1uGAZ/TZ6zdlt2m6gFb4IvEU7mrVHvQU7B/4ryufULT9ANiOCQSmeGaMC3zW294Xr0sqFhOH1xFp6SyK3oesXTXtibuwSGqVFlJktWV/pSadsjnbtMRO/sK2IKLhBEfyUHtLuEywKG6DWmzcxUPnqIklE9bGu7iwVg2h8BWTzQ6lEwunCVDJI2CJWyLkoiQRuY6n/TbCZ5o6/BrcqUjvbYBNAp8yFk9YnISIDOXKesQdzH+xiHQDFwjFioWyzNRyCoc7hWU2u7/olUW3Uq6kErFXc9FT/wjT6cul73e6Ha6eoHlwMqDhpGvA3v9BxwGOPGIzBFcXxmdUPW5yhiLqZE1GESk1D+92hMzmToI4/Efa6ZBWjKzBCMgkg+PgM6e7ODIy+OU8+RO9WU8WJNJctqZwkbvetz1w8TRnycEY0SKk8zUDcsGLoRwYymMkA4cqVWgCdI3plM3sNzSMOpG8a56lkmW5lkLVqyHBz3OtZ2p/f5TWP3fW7XmvHCeGW0DMt4a+wa+8aBcWTgyrfKr6pRXan+rL2sNWrNbOvqSt7z3Jhbtc5vRk5KJA=</latexit>

θt+1 = θt αrθLtotal(θt|xt, yt, φ)

<latexit sha1_base64="nOA0h1c01qF2VrXUE17nE+uAsYc=">AGzHicrVTbtMwGM7GWkY5bXDJjcU01GpZ1XRIKRJE7vZBRpD7ITmNnJcdzVzDkqcrpHxLQ/GW/AaPAG2k2VrGybEiBTpz3/4vu/YtmLGE14p/NzYfHeUq1+f/lB4+Gjx0+erqw+O07CNMbkCIcsjE89lBGA3LEKWfkNIoJ8j1GTryLXV0/GZM4oWFwyLOI9Hx0HtAhxYirlLu69GMd+oiPMGLig+wL3w6kKyDn4HLoybkI8LRt4l9mZdaYBvAYywcKTYO5EAJqnvCqwCTibc6BEeQ/hCisrNFWVjfW8k9pfJZxqa8aT9PI5mFf5PT56TVUtO2ngGb4LPC07mrVKvf1bB/4ryu7aNx9p6lRArIlDMD1Bdb9pZpuF69amkYjajZ2EhvKuQWpE5tCun1q6g0VpkJUlem/dXaTIpV/ANR24XH3wTIhaNEAyQWtTNs3O4POSIXWO63M7Ua5aoWNU3Fpvq9lwxjM+vgFxqJQoOFMYKwYFW8YauRClkMBtLBUaZhTbWH5K3CrJN3ZAhsAelU2VPzD8ijEZKB+T9WEvIvzN06BYRAGsVQxW57IUlZpcbekNH7/E626FbSmVQm73owuvIvedqN/3SuXT5zshsNd2Wt0+6YB8wHThGsWcVz4K78goMQpz4JOGYoSc6cTsR7AsWcYkZkA6YJidR9gs7JmQoD5JOkJ8xCEqyrzAMw1i9AQcme3NCID9JMt9TnXqRZLamk1W1s5QP3/YEDaKUkwDnRMOUqfWBvrnBgMYEc5apAOGYKq0Aj5C6i7m636eQBmMaJYXqS5bm+TMWjIfHfbzla7+n12s67wq5l64X10mpajvXG2rH2rAPryMK1Vu1j7bT2pb5f53VRl3nr4kIx89yaeurfwOWXWyJ</latexit>

[4] Colson, B., Marcotte, P., and Savard, G. An overview of bilevel optimization. Annals of operations research, 2007. [5] Franceschi, L., Frasconi, P., Salzo, S., Grazzi, R., and Pontil, M. Bilevel programming for hyperparameter optimization and meta-learning. In ICML, 2018.

slide-20
SLIDE 20

L2T-ww: Training Meta-Networks

9

  • Total loss for target model:
  • The proposed bilevel scheme for training meta-parameters 𝜚:

Ltotal(θ|x, y, φ) = Lorg(θ|x, y) + βLwfm(θ|x, φ)

<latexit sha1_base64="WtNJ/Z5PgEnVWi9Guy7jysdH7YA=">AGdHicpVRdb9MwFM1GW0b52uARHiyqolbLqZDAh4mTexlDwgNsS9pbiPHdtszocSp2tk/CN5G/wAK/YTpatbRiIWYp0c6/vOeOHZC6sas2/2+snqvUq3dX3tQf/jo8ZOn6xvPjuMgiTA5wgENolMHxYS6PjliLqPkNIwI8hxKTpyLPVU/mZIodgP/kKUh6Xto7LsjFyMmU/ZG5bwJPcQmGFH+UQy4Z/rC5pAxcDnyRAuyCWHo68y8zEptsAPgKEKYW4LvnwgA48SzOZYBIzOm9XCHInwh+OUVmqyKejPb6ZrnAs5ta0WKT9GI1uGAZ/TZ6zdlt2m6gFb4IvEU7mrVHvQU7B/4ryufULT9ANiOCQSmeGaMC3zW294Xr0sqFhOH1xFp6SyK3oesXTXtibuwSGqVFlJktWV/pSadsjnbtMRO/sK2IKLhBEfyUHtLuEywKG6DWmzcxUPnqIklE9bGu7iwVg2h8BWTzQ6lEwunCVDJI2CJWyLkoiQRuY6n/TbCZ5o6/BrcqUjvbYBNAp8yFk9YnISIDOXKesQdzH+xiHQDFwjFioWyzNRyCoc7hWU2u7/olUW3Uq6kErFXc9FT/wjT6cul73e6Ha6eoHlwMqDhpGvA3v9BxwGOPGIzBFcXxmdUPW5yhiLqZE1GESk1D+92hMzmToI4/Efa6ZBWjKzBCMgkg+PgM6e7ODIy+OU8+RO9WU8WJNJctqZwkbvetz1w8TRnycEY0SKk8zUDcsGLoRwYymMkA4cqVWgCdI3plM3sNzSMOpG8a56lkmW5lkLVqyHBz3OtZ2p/f5TWP3fW7XmvHCeGW0DMt4a+wa+8aBcWTgyrfKr6pRXan+rL2sNWrNbOvqSt7z3Jhbtc5vRk5KJA=</latexit>
slide-21
SLIDE 21

L2T-ww: Training Meta-Networks

9

  • Total loss for target model:
  • The proposed bilevel scheme for training meta-parameters 𝜚:
  • 1. Knowledge transfer: for 𝑢 = 1, … , 𝑈,

Ltotal(θ|x, y, φ) = Lorg(θ|x, y) + βLwfm(θ|x, φ)

<latexit sha1_base64="WtNJ/Z5PgEnVWi9Guy7jysdH7YA=">AGdHicpVRdb9MwFM1GW0b52uARHiyqolbLqZDAh4mTexlDwgNsS9pbiPHdtszocSp2tk/CN5G/wAK/YTpatbRiIWYp0c6/vOeOHZC6sas2/2+snqvUq3dX3tQf/jo8ZOn6xvPjuMgiTA5wgENolMHxYS6PjliLqPkNIwI8hxKTpyLPVU/mZIodgP/kKUh6Xto7LsjFyMmU/ZG5bwJPcQmGFH+UQy4Z/rC5pAxcDnyRAuyCWHo68y8zEptsAPgKEKYW4LvnwgA48SzOZYBIzOm9XCHInwh+OUVmqyKejPb6ZrnAs5ta0WKT9GI1uGAZ/TZ6zdlt2m6gFb4IvEU7mrVHvQU7B/4ryufULT9ANiOCQSmeGaMC3zW294Xr0sqFhOH1xFp6SyK3oesXTXtibuwSGqVFlJktWV/pSadsjnbtMRO/sK2IKLhBEfyUHtLuEywKG6DWmzcxUPnqIklE9bGu7iwVg2h8BWTzQ6lEwunCVDJI2CJWyLkoiQRuY6n/TbCZ5o6/BrcqUjvbYBNAp8yFk9YnISIDOXKesQdzH+xiHQDFwjFioWyzNRyCoc7hWU2u7/olUW3Uq6kErFXc9FT/wjT6cul73e6Ha6eoHlwMqDhpGvA3v9BxwGOPGIzBFcXxmdUPW5yhiLqZE1GESk1D+92hMzmToI4/Efa6ZBWjKzBCMgkg+PgM6e7ODIy+OU8+RO9WU8WJNJctqZwkbvetz1w8TRnycEY0SKk8zUDcsGLoRwYymMkA4cqVWgCdI3plM3sNzSMOpG8a56lkmW5lkLVqyHBz3OtZ2p/f5TWP3fW7XmvHCeGW0DMt4a+wa+8aBcWTgyrfKr6pRXan+rL2sNWrNbOvqSt7z3Jhbtc5vRk5KJA=</latexit>

θt+1 = θt αrθLwfm(θt|x, φ)

<latexit sha1_base64="10awWVMlxi4PK8ototjnx3oaAT4=">AHPXicrVXLbtNAFHULCaU82sKSzYiqKFHdKE6QEiVKrpAqEi+pLqxBpPJs3Q8UP2OI01zI/wM7CFP+AD2CEWbNgyM3adJnErRDuSpet75z7pmR7YaUxKzZ/D43f+t2pXpn4e7ivfsPHi4trzw6iIMkQngfBTSIjlwY0p8vM8Io/gojD0XIoP3dNtVT8c4igmgb/H0hB3PHjikz5BkMmUs1Jpr9keZAMEKX8jutwzfeFwmzFw1vdEzWYDzODHkXmWlepgE9j9CJuCb5zKIAdJ57DkQwYHjGth7sUolPBz87RZFUsrmU7iflB2BPbapHiUzSitflGX32OqrXZbepesAGeC/xVO48Ve+2FOxlnOPaWzhMX9MEC25T6UwPdnbOsN49HLhrbDAdETa+k1iVy3iV80bYuJsUtolBZRSpLVZv2VmnTK4WzdEpv5C9uwIQ0H0PahHNTJsjO4LGCQjEdZqby0UOUjOpi3V1c6YRCfnQA7fk0oknC4MJYOELWKFnIuSOAqlhINU4rNLf8GbhSktpZB+vAdstsKDnD4ipEuCePp6xDXMf5C7dAM3CNWKiYLo9EIauwuFVQar/i1ZdCXpVCoV170YLfGPI0bu9gOm7raN3ds+tTGNJcem7O82mw09QKzgZUHq0a+dp3l3YvQImHfYojONjqxmyDocRI4hisWgnMQ7lBwye4GMZ+tDcYdragHWZKYH+kEkH58Bnb3YwaEXx6nyp1qwHi6pJlteOE9V92OPHDhGEfZUT9hEq7gfpVgB6JMGI0lQFEZFaARpA+fFn8ocygdQbkjDOVY8y2coka9qS2eCg1bDajda756tbr3K7FownxlOjZljGC2PL2DF2jX0DVT5VvlS+Vr5VP1d/VH9Wf2Vb5+fynsfGxKr+QvSlZ2M</latexit>
slide-22
SLIDE 22

L2T-ww: Training Meta-Networks

9

  • Total loss for target model:
  • The proposed bilevel scheme for training meta-parameters 𝜚:
  • 1. Knowledge transfer: for 𝑢 = 1, … , 𝑈,
  • 2. One-step adaption:

Ltotal(θ|x, y, φ) = Lorg(θ|x, y) + βLwfm(θ|x, φ)

<latexit sha1_base64="WtNJ/Z5PgEnVWi9Guy7jysdH7YA=">AGdHicpVRdb9MwFM1GW0b52uARHiyqolbLqZDAh4mTexlDwgNsS9pbiPHdtszocSp2tk/CN5G/wAK/YTpatbRiIWYp0c6/vOeOHZC6sas2/2+snqvUq3dX3tQf/jo8ZOn6xvPjuMgiTA5wgENolMHxYS6PjliLqPkNIwI8hxKTpyLPVU/mZIodgP/kKUh6Xto7LsjFyMmU/ZG5bwJPcQmGFH+UQy4Z/rC5pAxcDnyRAuyCWHo68y8zEptsAPgKEKYW4LvnwgA48SzOZYBIzOm9XCHInwh+OUVmqyKejPb6ZrnAs5ta0WKT9GI1uGAZ/TZ6zdlt2m6gFb4IvEU7mrVHvQU7B/4ryufULT9ANiOCQSmeGaMC3zW294Xr0sqFhOH1xFp6SyK3oesXTXtibuwSGqVFlJktWV/pSadsjnbtMRO/sK2IKLhBEfyUHtLuEywKG6DWmzcxUPnqIklE9bGu7iwVg2h8BWTzQ6lEwunCVDJI2CJWyLkoiQRuY6n/TbCZ5o6/BrcqUjvbYBNAp8yFk9YnISIDOXKesQdzH+xiHQDFwjFioWyzNRyCoc7hWU2u7/olUW3Uq6kErFXc9FT/wjT6cul73e6Ha6eoHlwMqDhpGvA3v9BxwGOPGIzBFcXxmdUPW5yhiLqZE1GESk1D+92hMzmToI4/Efa6ZBWjKzBCMgkg+PgM6e7ODIy+OU8+RO9WU8WJNJctqZwkbvetz1w8TRnycEY0SKk8zUDcsGLoRwYymMkA4cqVWgCdI3plM3sNzSMOpG8a56lkmW5lkLVqyHBz3OtZ2p/f5TWP3fW7XmvHCeGW0DMt4a+wa+8aBcWTgyrfKr6pRXan+rL2sNWrNbOvqSt7z3Jhbtc5vRk5KJA=</latexit>

θt+1 = θt αrθLwfm(θt|x, φ)

<latexit sha1_base64="10awWVMlxi4PK8ototjnx3oaAT4=">AHPXicrVXLbtNAFHULCaU82sKSzYiqKFHdKE6QEiVKrpAqEi+pLqxBpPJs3Q8UP2OI01zI/wM7CFP+AD2CEWbNgyM3adJnErRDuSpet75z7pmR7YaUxKzZ/D43f+t2pXpn4e7ivfsPHi4trzw6iIMkQngfBTSIjlwY0p8vM8Io/gojD0XIoP3dNtVT8c4igmgb/H0hB3PHjikz5BkMmUs1Jpr9keZAMEKX8jutwzfeFwmzFw1vdEzWYDzODHkXmWlepgE9j9CJuCb5zKIAdJ57DkQwYHjGth7sUolPBz87RZFUsrmU7iflB2BPbapHiUzSitflGX32OqrXZbepesAGeC/xVO48Ve+2FOxlnOPaWzhMX9MEC25T6UwPdnbOsN49HLhrbDAdETa+k1iVy3iV80bYuJsUtolBZRSpLVZv2VmnTK4WzdEpv5C9uwIQ0H0PahHNTJsjO4LGCQjEdZqby0UOUjOpi3V1c6YRCfnQA7fk0oknC4MJYOELWKFnIuSOAqlhINU4rNLf8GbhSktpZB+vAdstsKDnD4ipEuCePp6xDXMf5C7dAM3CNWKiYLo9EIauwuFVQar/i1ZdCXpVCoV170YLfGPI0bu9gOm7raN3ds+tTGNJcem7O82mw09QKzgZUHq0a+dp3l3YvQImHfYojONjqxmyDocRI4hisWgnMQ7lBwye4GMZ+tDcYdragHWZKYH+kEkH58Bnb3YwaEXx6nyp1qwHi6pJlteOE9V92OPHDhGEfZUT9hEq7gfpVgB6JMGI0lQFEZFaARpA+fFn8ocygdQbkjDOVY8y2coka9qS2eCg1bDajda756tbr3K7FownxlOjZljGC2PL2DF2jX0DVT5VvlS+Vr5VP1d/VH9Wf2Vb5+fynsfGxKr+QvSlZ2M</latexit>

θT +2 = θT +1 αrθLorg(θT +1|x, y)

<latexit sha1_base64="8vWYW0uwoE5ZLUcQBH1ct0xRL0=">AHxXictVdT9swFA1stKz7ALbHvVgplaEqimTNk1CQuNhPEwTbHxJuI0c16UezocSpzTyrP2w/ZL9jf2COU5IaRsQGsxSpJt7XPOPdtnIDRiLdav+fmHz1eqFQXn9SePnv+Yml5eVx5MchJkfYZ3546qCIMOqRI045I6dBSJDrMHLiXOym9ZMhCSPqe4c8CUjHRece7VOMuErZKwu/1qGL+AjJj7LrnBNT9oCcg4u+6sQz4gHP0YmZdZqQG2AeyHCAtLir0TCWAUu7bAKuBkxLUe4TCEL6S4vEJTVlbz3ZS87uE9vqYcqX0sj6YVdk9NnrqNFQp830DNgE3xRemrtKNbrtFPYmznHtCxomH1lMpIBMOdNDXbFlbukN49bLmobBgOqOtfS6Qm5A6hWHduVE2yU0qRZSpLVZv1VmnTKFnzDktv5C9+EiAUDBD2kGrWz7Awu9zliY0ybm4l6dBMlrbraYl3dnin64fkVkC0OlRIFpwtDxaBgizhFzkUpJHAbS4mGKcVmklv+BtwqKd3ZABsAOmU2lMywuAoh6anxlJ2Q93H+2i3QDEIjFiqmyNZyCosbheU2u9/ok0tupV0KpXI+16MtrwjT/PBLrbNZ672g81Nj23Mc+Pc/t/Y8l/b3Uy1l9dazZeYDaw8mDNyNe+vfwH9nwcu8TjmKEoOrNaAe8IFHKGZE1GEckUH+k6JycqdBDLok6QpNKsK4yPdD3Q/V4HOjs9RMCuVGUuI7amXYcTdfSZFntLOb9x1BvSDmxMZUT9mauwg/WSBHg0J5ixRAcIhVoBHiD1EeLqwzaB1BvSIMpVjzLZNWSNW3JbHDcblpbzfbB27WdD7ldi8ZrY9WoG5bxztgx9ox948jAldXKXuWg8rX6qepWeXWYbZ2fy8+8MiZW9edfoGfQFQ=</latexit>
slide-23
SLIDE 23

9

  • Total loss for target model:
  • The proposed bilevel scheme for training meta-parameters 𝜚:
  • 1. Knowledge transfer: for 𝑢 = 1, … , 𝑈,
  • 2. One-step adaption:
  • 3. Evaluation:
  • 4. Update 𝜚 based on using second-order gradients

rφLmeta(φ)

<latexit sha1_base64="kHE1cbe6e7GA4SbWM8iyuthH+8=">AEbnicfVPbhMxEN1uA5RwaQsSLwhEVKRFplWyQUqRKfaAPBTRm1QnK8dxElPvRevZXGT8hXwBfwE8wguzu0nbNAFLlsZzPOecGcudWEkNjcb3FXe1dOfuvbX75QcPHz1e39h8cqjNOHihEcqSs47TAslQ3ECEpQ4jxPBgo4SZ53Lgw/G4pEyg8hksWgHrh7InOQNM+Ztud4sGDAacKfPRtk1QD61vKAZ9QJbpTAQwL6O6MCqpEmob2EceNZc3hmCdVp4BuOAYgx5H5MRzF+ac1oxoaoLW8VN2X9i6Vz16pJpfJ2Opx2xTyxXFcq2F1Pash2+Qz8mW5WarW3s1o/6W5gFGFY+mytmr7+Xod/LOqbxQObt5r6rSFujMrwqOrBzPX9IhAhvaGQu7FKFAlucLBrKU76B15tTg+wTZmKB4yGDNvwi+wCL0TA1DWnD/UJ7ryDJX0G+XBztLkARkl/RuSbY3SCdDkwRAWkvYpr5ZknJCL/EymX/Y1KY6eRL7IYeNOg4kzXkb/xi3YjngYiBK6Y1hdeI4aWYQlIroQt01SLGF+V9cUFhiELhG6Z/Cks2cJMl/SiBHcIJM/erDAs0HoSdPBm5lrfxrLkMuwihd67lpFhnIeSHUSxWOn2R/i3RlIjioCQaMJxK9Ej5g+FsAf+AcU3coYz1PS5sZ0Pybo9kMTjd3fH2dnY/vansv5+Oa8157rxyqo7nvHX2nUPnyDlxuPvN/en+dv+s/ig9K70ovSyuivTmqfO3CpV/wIbNYJU</latexit>

Lmeta(φ) = Lorg(θT +2|x, y).

<latexit sha1_base64="gz4O2xprL+sjrYqECZuo3uCkOMc=">AGb3icpVRdb9MwFM1GV0b52uCByRkMQ01WlY1HRIadKkPbAHobYlzS3keO6rZnzocTZWhn/Q/4AP4O9wgO2k2VrGybELEW6udf3nHOPnfgxoylvt38sLN6rLdXvLz9oPHz0+MnTldVnR2mUJZgc4ohFyYmPUsJoSA45YycxAlBgc/IsX+2q+vH5yRJaRQe8ElMugEahnRAMeIq5a3WyDoMEB9hxMQn2ROBE0pPQM7BxSCQTchHhKNvY+ciL9lgG8BgrBwpdg7lgCmWeAJrAJOxtzoET5D+EyKiys0VZWN9Xwndb5KOLWtmWg+TSObBz2R0+evY9tW3Y7uAZvgi8LTuauU3eto2L9xztUgU7b0U9sOVumej131cQwHlEzrtHdVLA2pGHZtCunZv6YEBLe4NAqZCVDXpt3VgkyKU/wDVduFy98EyIWjxAMkRrDy7NzuDziF1jetyZqMdMUDFnYMw1e25YpQMr4A8caCUKDhTOFcMCraMNXIhSiGB21gqNMwodiaF32/ArZL0ThtsAOhX2VBxgK3yFiSkr86nqkXexfobd8AwCINYypgtj2V5LKXHnZLSGP5ftNqjW0lnUhN1fne6GB35jzSthlreylq71TYLzAduEaxZxdr3Vi5hP8JZQEKOGUrTU7cd865ACaeYEdmAWUpi9VmjITlVYgCknaFYZgXWX6YBAl6gk5MNmbHQIFaToJfLVT5nO1nSyqna8cH7rqBhnHES4pxokDF1m4H+uYI+TQjmbKIChBOqtAI8Qup3ydUveAqpf07jtFA9zmVrk9xZS+aDo07L3Wp1Pr9d2/lQ2LVsvbReW03Ltd5ZO9aetW8dWrj2vXZ+1X7vfSz/qL+qg7yrYsLRc9za2rV7T/ckUnB</latexit>

Ltotal(θ|x, y, φ) = Lorg(θ|x, y) + βLwfm(θ|x, φ)

<latexit sha1_base64="WtNJ/Z5PgEnVWi9Guy7jysdH7YA=">AGdHicpVRdb9MwFM1GW0b52uARHiyqolbLqZDAh4mTexlDwgNsS9pbiPHdtszocSp2tk/CN5G/wAK/YTpatbRiIWYp0c6/vOeOHZC6sas2/2+snqvUq3dX3tQf/jo8ZOn6xvPjuMgiTA5wgENolMHxYS6PjliLqPkNIwI8hxKTpyLPVU/mZIodgP/kKUh6Xto7LsjFyMmU/ZG5bwJPcQmGFH+UQy4Z/rC5pAxcDnyRAuyCWHo68y8zEptsAPgKEKYW4LvnwgA48SzOZYBIzOm9XCHInwh+OUVmqyKejPb6ZrnAs5ta0WKT9GI1uGAZ/TZ6zdlt2m6gFb4IvEU7mrVHvQU7B/4ryufULT9ANiOCQSmeGaMC3zW294Xr0sqFhOH1xFp6SyK3oesXTXtibuwSGqVFlJktWV/pSadsjnbtMRO/sK2IKLhBEfyUHtLuEywKG6DWmzcxUPnqIklE9bGu7iwVg2h8BWTzQ6lEwunCVDJI2CJWyLkoiQRuY6n/TbCZ5o6/BrcqUjvbYBNAp8yFk9YnISIDOXKesQdzH+xiHQDFwjFioWyzNRyCoc7hWU2u7/olUW3Uq6kErFXc9FT/wjT6cul73e6Ha6eoHlwMqDhpGvA3v9BxwGOPGIzBFcXxmdUPW5yhiLqZE1GESk1D+92hMzmToI4/Efa6ZBWjKzBCMgkg+PgM6e7ODIy+OU8+RO9WU8WJNJctqZwkbvetz1w8TRnycEY0SKk8zUDcsGLoRwYymMkA4cqVWgCdI3plM3sNzSMOpG8a56lkmW5lkLVqyHBz3OtZ2p/f5TWP3fW7XmvHCeGW0DMt4a+wa+8aBcWTgyrfKr6pRXan+rL2sNWrNbOvqSt7z3Jhbtc5vRk5KJA=</latexit>

L2T-ww: Training Meta-Networks

θt+1 = θt αrθLwfm(θt|x, φ)

<latexit sha1_base64="10awWVMlxi4PK8ototjnx3oaAT4=">AHPXicrVXLbtNAFHULCaU82sKSzYiqKFHdKE6QEiVKrpAqEi+pLqxBpPJs3Q8UP2OI01zI/wM7CFP+AD2CEWbNgyM3adJnErRDuSpet75z7pmR7YaUxKzZ/D43f+t2pXpn4e7ivfsPHi4trzw6iIMkQngfBTSIjlwY0p8vM8Io/gojD0XIoP3dNtVT8c4igmgb/H0hB3PHjikz5BkMmUs1Jpr9keZAMEKX8jutwzfeFwmzFw1vdEzWYDzODHkXmWlepgE9j9CJuCb5zKIAdJ57DkQwYHjGth7sUolPBz87RZFUsrmU7iflB2BPbapHiUzSitflGX32OqrXZbepesAGeC/xVO48Ve+2FOxlnOPaWzhMX9MEC25T6UwPdnbOsN49HLhrbDAdETa+k1iVy3iV80bYuJsUtolBZRSpLVZv2VmnTK4WzdEpv5C9uwIQ0H0PahHNTJsjO4LGCQjEdZqby0UOUjOpi3V1c6YRCfnQA7fk0oknC4MJYOELWKFnIuSOAqlhINU4rNLf8GbhSktpZB+vAdstsKDnD4ipEuCePp6xDXMf5C7dAM3CNWKiYLo9EIauwuFVQar/i1ZdCXpVCoV170YLfGPI0bu9gOm7raN3ds+tTGNJcem7O82mw09QKzgZUHq0a+dp3l3YvQImHfYojONjqxmyDocRI4hisWgnMQ7lBwye4GMZ+tDcYdragHWZKYH+kEkH58Bnb3YwaEXx6nyp1qwHi6pJlteOE9V92OPHDhGEfZUT9hEq7gfpVgB6JMGI0lQFEZFaARpA+fFn8ocygdQbkjDOVY8y2coka9qS2eCg1bDajda756tbr3K7FownxlOjZljGC2PL2DF2jX0DVT5VvlS+Vr5VP1d/VH9Wf2Vb5+fynsfGxKr+QvSlZ2M</latexit>

θT +2 = θT +1 αrθLorg(θT +1|x, y)

<latexit sha1_base64="8vWYW0uwoE5ZLUcQBH1ct0xRL0=">AHxXictVdT9swFA1stKz7ALbHvVgplaEqimTNk1CQuNhPEwTbHxJuI0c16UezocSpzTyrP2w/ZL9jf2COU5IaRsQGsxSpJt7XPOPdtnIDRiLdav+fmHz1eqFQXn9SePnv+Yml5eVx5MchJkfYZ3546qCIMOqRI045I6dBSJDrMHLiXOym9ZMhCSPqe4c8CUjHRece7VOMuErZKwu/1qGL+AjJj7LrnBNT9oCcg4u+6sQz4gHP0YmZdZqQG2AeyHCAtLir0TCWAUu7bAKuBkxLUe4TCEL6S4vEJTVlbz3ZS87uE9vqYcqX0sj6YVdk9NnrqNFQp830DNgE3xRemrtKNbrtFPYmznHtCxomH1lMpIBMOdNDXbFlbukN49bLmobBgOqOtfS6Qm5A6hWHduVE2yU0qRZSpLVZv1VmnTKFnzDktv5C9+EiAUDBD2kGrWz7Awu9zliY0ybm4l6dBMlrbraYl3dnin64fkVkC0OlRIFpwtDxaBgizhFzkUpJHAbS4mGKcVmklv+BtwqKd3ZABsAOmU2lMywuAoh6anxlJ2Q93H+2i3QDEIjFiqmyNZyCosbheU2u9/ok0tupV0KpXI+16MtrwjT/PBLrbNZ672g81Nj23Mc+Pc/t/Y8l/b3Uy1l9dazZeYDaw8mDNyNe+vfwH9nwcu8TjmKEoOrNaAe8IFHKGZE1GEckUH+k6JycqdBDLok6QpNKsK4yPdD3Q/V4HOjs9RMCuVGUuI7amXYcTdfSZFntLOb9x1BvSDmxMZUT9mauwg/WSBHg0J5ixRAcIhVoBHiD1EeLqwzaB1BvSIMpVjzLZNWSNW3JbHDcblpbzfbB27WdD7ldi8ZrY9WoG5bxztgx9ox948jAldXKXuWg8rX6qepWeXWYbZ2fy8+8MiZW9edfoGfQFQ=</latexit>
slide-24
SLIDE 24

L2T-ww: Training Meta-Networks

9

  • Total loss for target model:
  • The proposed bilevel scheme for training meta-parameters 𝜚:
  • Ours is effective for learning 𝜚 with a small number of steps 𝑈
  • Ours learns 𝜄 and 𝜚 jointly without separate meta-learning phase
  • 1. Knowledge transfer: for 𝑢 = 1, … , 𝑈,
  • 2. One-step adaption:
  • 3. Evaluation:
  • 4. Update 𝜚 based on using second-order gradients

rφLmeta(φ)

<latexit sha1_base64="kHE1cbe6e7GA4SbWM8iyuthH+8=">AEbnicfVPbhMxEN1uA5RwaQsSLwhEVKRFplWyQUqRKfaAPBTRm1QnK8dxElPvRevZXGT8hXwBfwE8wguzu0nbNAFLlsZzPOecGcudWEkNjcb3FXe1dOfuvbX75QcPHz1e39h8cqjNOHihEcqSs47TAslQ3ECEpQ4jxPBgo4SZ53Lgw/G4pEyg8hksWgHrh7InOQNM+Ztud4sGDAacKfPRtk1QD61vKAZ9QJbpTAQwL6O6MCqpEmob2EceNZc3hmCdVp4BuOAYgx5H5MRzF+ac1oxoaoLW8VN2X9i6Vz16pJpfJ2Opx2xTyxXFcq2F1Pash2+Qz8mW5WarW3s1o/6W5gFGFY+mytmr7+Xod/LOqbxQObt5r6rSFujMrwqOrBzPX9IhAhvaGQu7FKFAlucLBrKU76B15tTg+wTZmKB4yGDNvwi+wCL0TA1DWnD/UJ7ryDJX0G+XBztLkARkl/RuSbY3SCdDkwRAWkvYpr5ZknJCL/EymX/Y1KY6eRL7IYeNOg4kzXkb/xi3YjngYiBK6Y1hdeI4aWYQlIroQt01SLGF+V9cUFhiELhG6Z/Cks2cJMl/SiBHcIJM/erDAs0HoSdPBm5lrfxrLkMuwihd67lpFhnIeSHUSxWOn2R/i3RlIjioCQaMJxK9Ej5g+FsAf+AcU3coYz1PS5sZ0Pybo9kMTjd3fH2dnY/vansv5+Oa8157rxyqo7nvHX2nUPnyDlxuPvN/en+dv+s/ig9K70ovSyuivTmqfO3CpV/wIbNYJU</latexit>

Lmeta(φ) = Lorg(θT +2|x, y).

<latexit sha1_base64="gz4O2xprL+sjrYqECZuo3uCkOMc=">AGb3icpVRdb9MwFM1GV0b52uCByRkMQ01WlY1HRIadKkPbAHobYlzS3keO6rZnzocTZWhn/Q/4AP4O9wgO2k2VrGybELEW6udf3nHOPnfgxoylvt38sLN6rLdXvLz9oPHz0+MnTldVnR2mUJZgc4ohFyYmPUsJoSA45YycxAlBgc/IsX+2q+vH5yRJaRQe8ElMugEahnRAMeIq5a3WyDoMEB9hxMQn2ROBE0pPQM7BxSCQTchHhKNvY+ciL9lgG8BgrBwpdg7lgCmWeAJrAJOxtzoET5D+EyKiys0VZWN9Xwndb5KOLWtmWg+TSObBz2R0+evY9tW3Y7uAZvgi8LTuauU3eto2L9xztUgU7b0U9sOVumej131cQwHlEzrtHdVLA2pGHZtCunZv6YEBLe4NAqZCVDXpt3VgkyKU/wDVduFy98EyIWjxAMkRrDy7NzuDziF1jetyZqMdMUDFnYMw1e25YpQMr4A8caCUKDhTOFcMCraMNXIhSiGB21gqNMwodiaF32/ArZL0ThtsAOhX2VBxgK3yFiSkr86nqkXexfobd8AwCINYypgtj2V5LKXHnZLSGP5ftNqjW0lnUhN1fne6GB35jzSthlreylq71TYLzAduEaxZxdr3Vi5hP8JZQEKOGUrTU7cd865ACaeYEdmAWUpi9VmjITlVYgCknaFYZgXWX6YBAl6gk5MNmbHQIFaToJfLVT5nO1nSyqna8cH7rqBhnHES4pxokDF1m4H+uYI+TQjmbKIChBOqtAI8Qup3ydUveAqpf07jtFA9zmVrk9xZS+aDo07L3Wp1Pr9d2/lQ2LVsvbReW03Ltd5ZO9aetW8dWrj2vXZ+1X7vfSz/qL+qg7yrYsLRc9za2rV7T/ckUnB</latexit>

Ltotal(θ|x, y, φ) = Lorg(θ|x, y) + βLwfm(θ|x, φ)

<latexit sha1_base64="WtNJ/Z5PgEnVWi9Guy7jysdH7YA=">AGdHicpVRdb9MwFM1GW0b52uARHiyqolbLqZDAh4mTexlDwgNsS9pbiPHdtszocSp2tk/CN5G/wAK/YTpatbRiIWYp0c6/vOeOHZC6sas2/2+snqvUq3dX3tQf/jo8ZOn6xvPjuMgiTA5wgENolMHxYS6PjliLqPkNIwI8hxKTpyLPVU/mZIodgP/kKUh6Xto7LsjFyMmU/ZG5bwJPcQmGFH+UQy4Z/rC5pAxcDnyRAuyCWHo68y8zEptsAPgKEKYW4LvnwgA48SzOZYBIzOm9XCHInwh+OUVmqyKejPb6ZrnAs5ta0WKT9GI1uGAZ/TZ6zdlt2m6gFb4IvEU7mrVHvQU7B/4ryufULT9ANiOCQSmeGaMC3zW294Xr0sqFhOH1xFp6SyK3oesXTXtibuwSGqVFlJktWV/pSadsjnbtMRO/sK2IKLhBEfyUHtLuEywKG6DWmzcxUPnqIklE9bGu7iwVg2h8BWTzQ6lEwunCVDJI2CJWyLkoiQRuY6n/TbCZ5o6/BrcqUjvbYBNAp8yFk9YnISIDOXKesQdzH+xiHQDFwjFioWyzNRyCoc7hWU2u7/olUW3Uq6kErFXc9FT/wjT6cul73e6Ha6eoHlwMqDhpGvA3v9BxwGOPGIzBFcXxmdUPW5yhiLqZE1GESk1D+92hMzmToI4/Efa6ZBWjKzBCMgkg+PgM6e7ODIy+OU8+RO9WU8WJNJctqZwkbvetz1w8TRnycEY0SKk8zUDcsGLoRwYymMkA4cqVWgCdI3plM3sNzSMOpG8a56lkmW5lkLVqyHBz3OtZ2p/f5TWP3fW7XmvHCeGW0DMt4a+wa+8aBcWTgyrfKr6pRXan+rL2sNWrNbOvqSt7z3Jhbtc5vRk5KJA=</latexit>

θt+1 = θt αrθLwfm(θt|x, φ)

<latexit sha1_base64="10awWVMlxi4PK8ototjnx3oaAT4=">AHPXicrVXLbtNAFHULCaU82sKSzYiqKFHdKE6QEiVKrpAqEi+pLqxBpPJs3Q8UP2OI01zI/wM7CFP+AD2CEWbNgyM3adJnErRDuSpet75z7pmR7YaUxKzZ/D43f+t2pXpn4e7ivfsPHi4trzw6iIMkQngfBTSIjlwY0p8vM8Io/gojD0XIoP3dNtVT8c4igmgb/H0hB3PHjikz5BkMmUs1Jpr9keZAMEKX8jutwzfeFwmzFw1vdEzWYDzODHkXmWlepgE9j9CJuCb5zKIAdJ57DkQwYHjGth7sUolPBz87RZFUsrmU7iflB2BPbapHiUzSitflGX32OqrXZbepesAGeC/xVO48Ve+2FOxlnOPaWzhMX9MEC25T6UwPdnbOsN49HLhrbDAdETa+k1iVy3iV80bYuJsUtolBZRSpLVZv2VmnTK4WzdEpv5C9uwIQ0H0PahHNTJsjO4LGCQjEdZqby0UOUjOpi3V1c6YRCfnQA7fk0oknC4MJYOELWKFnIuSOAqlhINU4rNLf8GbhSktpZB+vAdstsKDnD4ipEuCePp6xDXMf5C7dAM3CNWKiYLo9EIauwuFVQar/i1ZdCXpVCoV170YLfGPI0bu9gOm7raN3ds+tTGNJcem7O82mw09QKzgZUHq0a+dp3l3YvQImHfYojONjqxmyDocRI4hisWgnMQ7lBwye4GMZ+tDcYdragHWZKYH+kEkH58Bnb3YwaEXx6nyp1qwHi6pJlteOE9V92OPHDhGEfZUT9hEq7gfpVgB6JMGI0lQFEZFaARpA+fFn8ocygdQbkjDOVY8y2coka9qS2eCg1bDajda756tbr3K7FownxlOjZljGC2PL2DF2jX0DVT5VvlS+Vr5VP1d/VH9Wf2Vb5+fynsfGxKr+QvSlZ2M</latexit>

θT +2 = θT +1 αrθLorg(θT +1|x, y)

<latexit sha1_base64="8vWYW0uwoE5ZLUcQBH1ct0xRL0=">AHxXictVdT9swFA1stKz7ALbHvVgplaEqimTNk1CQuNhPEwTbHxJuI0c16UezocSpzTyrP2w/ZL9jf2COU5IaRsQGsxSpJt7XPOPdtnIDRiLdav+fmHz1eqFQXn9SePnv+Yml5eVx5MchJkfYZ3546qCIMOqRI045I6dBSJDrMHLiXOym9ZMhCSPqe4c8CUjHRece7VOMuErZKwu/1qGL+AjJj7LrnBNT9oCcg4u+6sQz4gHP0YmZdZqQG2AeyHCAtLir0TCWAUu7bAKuBkxLUe4TCEL6S4vEJTVlbz3ZS87uE9vqYcqX0sj6YVdk9NnrqNFQp830DNgE3xRemrtKNbrtFPYmznHtCxomH1lMpIBMOdNDXbFlbukN49bLmobBgOqOtfS6Qm5A6hWHduVE2yU0qRZSpLVZv1VmnTKFnzDktv5C9+EiAUDBD2kGrWz7Awu9zliY0ybm4l6dBMlrbraYl3dnin64fkVkC0OlRIFpwtDxaBgizhFzkUpJHAbS4mGKcVmklv+BtwqKd3ZABsAOmU2lMywuAoh6anxlJ2Q93H+2i3QDEIjFiqmyNZyCosbheU2u9/ok0tupV0KpXI+16MtrwjT/PBLrbNZ672g81Nj23Mc+Pc/t/Y8l/b3Uy1l9dazZeYDaw8mDNyNe+vfwH9nwcu8TjmKEoOrNaAe8IFHKGZE1GEckUH+k6JycqdBDLok6QpNKsK4yPdD3Q/V4HOjs9RMCuVGUuI7amXYcTdfSZFntLOb9x1BvSDmxMZUT9mauwg/WSBHg0J5ixRAcIhVoBHiD1EeLqwzaB1BvSIMpVjzLZNWSNW3JbHDcblpbzfbB27WdD7ldi8ZrY9WoG5bxztgx9ox948jAldXKXuWg8rX6qepWeXWYbZ2fy8+8MiZW9edfoGfQFQ=</latexit>
slide-25
SLIDE 25
  • Learning what and where to transfer gives consistent improvements
  • Suggested method works well in various tasks and architectures

Experiments

10

[1] Zagoruyko, S. and Komodakis, N. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In ICLR 2017 [3] Romero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta, C., and Bengio, Y. Fitnets: Hints for thin deep nets. In ICLR, 2015. [6] Li, Z. and Hoiem, D. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.

slide-26
SLIDE 26
  • Learning what and where to transfer gives consistent improvements
  • Suggested method works well in various tasks and architectures
  • Learning what to transfer (channel importance) improves all the baselines

Experiments

10

[1] Zagoruyko, S. and Komodakis, N. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In ICLR 2017 [3] Romero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta, C., and Bengio, Y. Fitnets: Hints for thin deep nets. In ICLR, 2015. [6] Li, Z. and Hoiem, D. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.

Maximum +15% relative improvements

slide-27
SLIDE 27
  • Learning what and where to transfer gives consistent improvements
  • Suggested method works well in various tasks and architectures
  • Learning what to transfer (channel importance) improves all the baselines
  • Learning where to transfer (pair importance) gives more improvements
  • n what to transfer

Experiments

10

[1] Zagoruyko, S. and Komodakis, N. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In ICLR 2017 [3] Romero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta, C., and Bengio, Y. Fitnets: Hints for thin deep nets. In ICLR, 2015. [6] Li, Z. and Hoiem, D. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.

Maximum +15% relative improvements Maximum +25% relative improvements

slide-28
SLIDE 28
  • Multi-source experiments

Experiments

11

[1] Zagoruyko, S. and Komodakis, N. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In ICLR 2017 [3] Romero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta, C., and Bengio, Y. Fitnets: Hints for thin deep nets. In ICLR, 2015. [6] Li, Z. and Hoiem, D. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.

slide-29
SLIDE 29
  • Multi-source experiments

Experiments

11

: Different architectures

[1] Zagoruyko, S. and Komodakis, N. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In ICLR 2017 [3] Romero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta, C., and Bengio, Y. Fitnets: Hints for thin deep nets. In ICLR, 2015. [6] Li, Z. and Hoiem, D. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.

+2.45% relative improvements

slide-30
SLIDE 30
  • Multi-source experiments

Experiments

11

: Different architectures, initialization

[1] Zagoruyko, S. and Komodakis, N. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In ICLR 2017 [3] Romero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta, C., and Bengio, Y. Fitnets: Hints for thin deep nets. In ICLR, 2015. [6] Li, Z. and Hoiem, D. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.

+2.45% relative improvements +2.72% relative improvements

slide-31
SLIDE 31
  • Multi-source experiments

Experiments

11

: Different architectures, initialization and datasets

[1] Zagoruyko, S. and Komodakis, N. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In ICLR 2017 [3] Romero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta, C., and Bengio, Y. Fitnets: Hints for thin deep nets. In ICLR, 2015. [6] Li, Z. and Hoiem, D. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.

+2.45% relative improvements +2.72% relative improvements +4.37% relative improvements

slide-32
SLIDE 32
  • Multi-source experiments
  • Limited data-regime experiments
  • Smaller the volume of the target dataset

à More relative gain of ours

  • Ours efficiently boosts up the performance of a target model

Experiments

[1] Zagoruyko, S. and Komodakis, N. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In ICLR 2017 [3] Romero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta, C., and Bengio, Y. Fitnets: Hints for thin deep nets. In ICLR, 2015. [6] Li, Z. and Hoiem, D. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.

11

: Different architectures, initialization and datasets

slide-33
SLIDE 33
  • Multi-source experiments
  • Limited data-regime experiments
  • Smaller the volume of the target dataset

à More relative gain of ours

  • Ours efficiently boosts up the performance of a target model

Experiments

[1] Zagoruyko, S. and Komodakis, N. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In ICLR 2017 [3] Romero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta, C., and Bengio, Y. Fitnets: Hints for thin deep nets. In ICLR, 2015. [6] Li, Z. and Hoiem, D. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.

11

: Different architectures, initialization and datasets

slide-34
SLIDE 34

Conclusion

  • Meta-learning based transfer method
  • Selective transfer depending on a source and target task relation
  • Effective training scheme that learns meta-networks and target model jointly
  • Applicable between heterogeneous or/and multiple network architectures and tasks

Poster #186 Thursday Jun 13th 6:30 – 9:00 PM @ Pacific Ballroom

12