CS 330
Optimization-Based Meta-Learning
1
Optimization-Based Meta-Learning CS 330 1 Course Reminders HW1 - - PowerPoint PPT Presentation
Optimization-Based Meta-Learning CS 330 1 Course Reminders HW1 due next Weds (9/30). Project guidelines posted start forming groups & formulating ideas. Guest lecture by Matt Johnson on Monday! 2 Plan for Today Recap - Meta-learning
1
HW1 due next Weds (9/30). Project guidelines posted — start forming groups & formulating ideas. Guest lecture by Matt Johnson on Monday!
2
Recap
Optimization Meta-Learning
Part of Homework 2!
Goals for by the end of lecture:
min
θ T
∑
i=1
ℒi(θ, i)
Multi-Task Learning Solve multiple tasks at once.
𝒰1, ⋯, 𝒰T
Transfer Learning Solve target task after solving source task
𝒰b 𝒰a
by transferring knowledge learned from 𝒰a The Meta-Learning Problem Given data from , quickly solve new task
𝒰1, …, 𝒰n
𝒰test
In all settings: tasks must share structure. In transfer learning and meta-learning: generally impractical to access prior tasks
Given 1 example of 5 classes: Classify new examples
meta-training training classes
… …
held-out classes
5-way, 1-shot image classifica;on (MiniImagenet) regression, language genera;on, skill learning, any ML problem Can replace image classificaCon with:
5
1 2 3 4 4
Dtr
i
φi
xts yts fθ general form: + expressive How else can we represent ? φi = fθ(Dtr
i )
yts = fblack-box(Dtr
i , xts)
<latexit sha1_base64="rb12wq8rXozM7PyL2Kah1Gx2n2M=">ACL3icbVDLSgMxFM3UV62vqks3wSJU0DIjgm6EoiIuK9gHtHXIpJk2NPMguSMtw/yRG3+lGxF3PoXZtqKWj0QODn3Hu69xwkFV2Caz0Zmbn5hcSm7nFtZXVvfyG9u1VQScqNBCBbDhEMcF9VgUOgjVCyYjnCFZ3+hdpvX7PpOKBfwvDkLU90vW5ykBLdn5q+FdC9gAYlAJPsOuPfk5gtD+oRMkmLI9CjRMSXyVerTGx+gAfzn07XzBL5hj4L7GmpICmqNj5UasT0MhjPlBlGpaZgjtmEjgVLAk14oUC/UOpMuamvrEY6odj+9N8J5WOtgNpH4+4LH60xET6mh5+jOdHk1W0vF/2rNCNzTdsz9MALm08kgNxIYApyGhztcMgpiqAmhkutdMe0RSjoiHM6BGv25L+kdlSyNL85LpTPp3Fk0Q7aRUVkoRNURteogqIogc0Qi/o1Xg0now343SmjGmnm30C8bHJ2L4qxY=</latexit><latexit sha1_base64="rb12wq8rXozM7PyL2Kah1Gx2n2M=">ACL3icbVDLSgMxFM3UV62vqks3wSJU0DIjgm6EoiIuK9gHtHXIpJk2NPMguSMtw/yRG3+lGxF3PoXZtqKWj0QODn3Hu69xwkFV2Caz0Zmbn5hcSm7nFtZXVvfyG9u1VQScqNBCBbDhEMcF9VgUOgjVCyYjnCFZ3+hdpvX7PpOKBfwvDkLU90vW5ykBLdn5q+FdC9gAYlAJPsOuPfk5gtD+oRMkmLI9CjRMSXyVerTGx+gAfzn07XzBL5hj4L7GmpICmqNj5UasT0MhjPlBlGpaZgjtmEjgVLAk14oUC/UOpMuamvrEY6odj+9N8J5WOtgNpH4+4LH60xET6mh5+jOdHk1W0vF/2rNCNzTdsz9MALm08kgNxIYApyGhztcMgpiqAmhkutdMe0RSjoiHM6BGv25L+kdlSyNL85LpTPp3Fk0Q7aRUVkoRNURteogqIogc0Qi/o1Xg0now343SmjGmnm30C8bHJ2L4qxY=</latexit><latexit sha1_base64="rb12wq8rXozM7PyL2Kah1Gx2n2M=">ACL3icbVDLSgMxFM3UV62vqks3wSJU0DIjgm6EoiIuK9gHtHXIpJk2NPMguSMtw/yRG3+lGxF3PoXZtqKWj0QODn3Hu69xwkFV2Caz0Zmbn5hcSm7nFtZXVvfyG9u1VQScqNBCBbDhEMcF9VgUOgjVCyYjnCFZ3+hdpvX7PpOKBfwvDkLU90vW5ykBLdn5q+FdC9gAYlAJPsOuPfk5gtD+oRMkmLI9CjRMSXyVerTGx+gAfzn07XzBL5hj4L7GmpICmqNj5UasT0MhjPlBlGpaZgjtmEjgVLAk14oUC/UOpMuamvrEY6odj+9N8J5WOtgNpH4+4LH60xET6mh5+jOdHk1W0vF/2rNCNzTdsz9MALm08kgNxIYApyGhztcMgpiqAmhkutdMe0RSjoiHM6BGv25L+kdlSyNL85LpTPp3Fk0Q7aRUVkoRNURteogqIogc0Qi/o1Xg0now343SmjGmnm30C8bHJ2L4qxY=</latexit><latexit sha1_base64="rb12wq8rXozM7PyL2Kah1Gx2n2M=">ACL3icbVDLSgMxFM3UV62vqks3wSJU0DIjgm6EoiIuK9gHtHXIpJk2NPMguSMtw/yRG3+lGxF3PoXZtqKWj0QODn3Hu69xwkFV2Caz0Zmbn5hcSm7nFtZXVvfyG9u1VQScqNBCBbDhEMcF9VgUOgjVCyYjnCFZ3+hdpvX7PpOKBfwvDkLU90vW5ykBLdn5q+FdC9gAYlAJPsOuPfk5gtD+oRMkmLI9CjRMSXyVerTGx+gAfzn07XzBL5hj4L7GmpICmqNj5UasT0MhjPlBlGpaZgjtmEjgVLAk14oUC/UOpMuamvrEY6odj+9N8J5WOtgNpH4+4LH60xET6mh5+jOdHk1W0vF/2rNCNzTdsz9MALm08kgNxIYApyGhztcMgpiqAmhkutdMe0RSjoiHM6BGv25L+kdlSyNL85LpTPp3Fk0Q7aRUVkoRNURteogqIogc0Qi/o1Xg0now343SmjGmnm30C8bHJ2L4qxY=</latexit>What if we treat it as an op6miza6on procedure?
Recap
Optimization Meta-Learning
Part of Homework 2!
1 2 3 4 4
Dtr
i
φi
xts yts fθ
1 2 3 4 4
Dtr
i
φi
xts yts
rθL
Key idea: embed opCmizaCon inside the inner learning process Why might this make sense?
10
Universal Language Model Fine-Tuning for Text Classifica;on. Howard, Ruder. ‘18
training data for new task pre-trained parameters
(typically for many gradient steps) Fine-tuning less effecCve with very small datasets.
Key idea: Over many tasks, learn parameter vector θ that transfers via fine-tuning
[test-Cme]
11
Finn, Abbeel, Levine. Model-Agnostic Meta-Learning. ICML 2017
training data for new task pre-trained parameters
Finn, Abbeel, Levine. Model-Agnostic Meta-Learning. ICML 2017
vector for task i parameter vector being meta-learned
12
i , Dtest i
from Di (or mini batch of tasks)
i )
i
) Black-box approach General Algorithm:
—> brings up second-order derivaCves
Optimize φi θ αrθL(θ, Dtr
i )
OpCmizaCon-based approach Key idea: Acquire through opCmizaCon.
13
Do we get higher-order deriva;ves with more inner gradient steps? Do we need to compute the full Hessian?
Recap
Optimization Meta-Learning
Part of Homework 2!
MAML can be viewed as computa6on graph, with embedded gradient operator
yts xts
general form: yts = fblack-box(Dtr
i , xts)
<latexit sha1_base64="rb12wq8rXozM7PyL2Kah1Gx2n2M=">ACL3icbVDLSgMxFM3UV62vqks3wSJU0DIjgm6EoiIuK9gHtHXIpJk2NPMguSMtw/yRG3+lGxF3PoXZtqKWj0QODn3Hu69xwkFV2Caz0Zmbn5hcSm7nFtZXVvfyG9u1VQScqNBCBbDhEMcF9VgUOgjVCyYjnCFZ3+hdpvX7PpOKBfwvDkLU90vW5ykBLdn5q+FdC9gAYlAJPsOuPfk5gtD+oRMkmLI9CjRMSXyVerTGx+gAfzn07XzBL5hj4L7GmpICmqNj5UasT0MhjPlBlGpaZgjtmEjgVLAk14oUC/UOpMuamvrEY6odj+9N8J5WOtgNpH4+4LH60xET6mh5+jOdHk1W0vF/2rNCNzTdsz9MALm08kgNxIYApyGhztcMgpiqAmhkutdMe0RSjoiHM6BGv25L+kdlSyNL85LpTPp3Fk0Q7aRUVkoRNURteogqIogc0Qi/o1Xg0now343SmjGmnm30C8bHJ2L4qxY=</latexit><latexit sha1_base64="rb12wq8rXozM7PyL2Kah1Gx2n2M=">ACL3icbVDLSgMxFM3UV62vqks3wSJU0DIjgm6EoiIuK9gHtHXIpJk2NPMguSMtw/yRG3+lGxF3PoXZtqKWj0QODn3Hu69xwkFV2Caz0Zmbn5hcSm7nFtZXVvfyG9u1VQScqNBCBbDhEMcF9VgUOgjVCyYjnCFZ3+hdpvX7PpOKBfwvDkLU90vW5ykBLdn5q+FdC9gAYlAJPsOuPfk5gtD+oRMkmLI9CjRMSXyVerTGx+gAfzn07XzBL5hj4L7GmpICmqNj5UasT0MhjPlBlGpaZgjtmEjgVLAk14oUC/UOpMuamvrEY6odj+9N8J5WOtgNpH4+4LH60xET6mh5+jOdHk1W0vF/2rNCNzTdsz9MALm08kgNxIYApyGhztcMgpiqAmhkutdMe0RSjoiHM6BGv25L+kdlSyNL85LpTPp3Fk0Q7aRUVkoRNURteogqIogc0Qi/o1Xg0now343SmjGmnm30C8bHJ2L4qxY=</latexit><latexit sha1_base64="rb12wq8rXozM7PyL2Kah1Gx2n2M=">ACL3icbVDLSgMxFM3UV62vqks3wSJU0DIjgm6EoiIuK9gHtHXIpJk2NPMguSMtw/yRG3+lGxF3PoXZtqKWj0QODn3Hu69xwkFV2Caz0Zmbn5hcSm7nFtZXVvfyG9u1VQScqNBCBbDhEMcF9VgUOgjVCyYjnCFZ3+hdpvX7PpOKBfwvDkLU90vW5ykBLdn5q+FdC9gAYlAJPsOuPfk5gtD+oRMkmLI9CjRMSXyVerTGx+gAfzn07XzBL5hj4L7GmpICmqNj5UasT0MhjPlBlGpaZgjtmEjgVLAk14oUC/UOpMuamvrEY6odj+9N8J5WOtgNpH4+4LH60xET6mh5+jOdHk1W0vF/2rNCNzTdsz9MALm08kgNxIYApyGhztcMgpiqAmhkutdMe0RSjoiHM6BGv25L+kdlSyNL85LpTPp3Fk0Q7aRUVkoRNURteogqIogc0Qi/o1Xg0now343SmjGmnm30C8bHJ2L4qxY=</latexit><latexit sha1_base64="rb12wq8rXozM7PyL2Kah1Gx2n2M=">ACL3icbVDLSgMxFM3UV62vqks3wSJU0DIjgm6EoiIuK9gHtHXIpJk2NPMguSMtw/yRG3+lGxF3PoXZtqKWj0QODn3Hu69xwkFV2Caz0Zmbn5hcSm7nFtZXVvfyG9u1VQScqNBCBbDhEMcF9VgUOgjVCyYjnCFZ3+hdpvX7PpOKBfwvDkLU90vW5ykBLdn5q+FdC9gAYlAJPsOuPfk5gtD+oRMkmLI9CjRMSXyVerTGx+gAfzn07XzBL5hj4L7GmpICmqNj5UasT0MhjPlBlGpaZgjtmEjgVLAk14oUC/UOpMuamvrEY6odj+9N8J5WOtgNpH4+4LH60xET6mh5+jOdHk1W0vF/2rNCNzTdsz9MALm08kgNxIYApyGhztcMgpiqAmhkutdMe0RSjoiHM6BGv25L+kdlSyNL85LpTPp3Fk0Q7aRUVkoRNURteogqIogc0Qi/o1Xg0now343SmjGmnm30C8bHJ2L4qxY=</latexit>Note: Can mix & match components of computa;on graph
Learn ini;aliza;on but replace gradient update with learned network Ravi & Larochelle ICLR ’17
(actually precedes MAML)
f(θ, Dtr
i , rθL)
15
This computa6on graph view of meta-learning will come back again!
Finn & Levine ICLR ’18
16
AssumpCons:
Does this structure come at a cost?
For a sufficiently deep network, MAML funcCon can approximate any funcCon of
Finn & Levine, ICLR 2018
Why is this interes6ng? MAML has benefit of inducCve bias without losing expressive power.
17
i , xts)
<latexit sha1_base64="rb12wq8rXozM7PyL2Kah1Gx2n2M=">ACL3icbVDLSgMxFM3UV62vqks3wSJU0DIjgm6EoiIuK9gHtHXIpJk2NPMguSMtw/yRG3+lGxF3PoXZtqKWj0QODn3Hu69xwkFV2Caz0Zmbn5hcSm7nFtZXVvfyG9u1VQScqNBCBbDhEMcF9VgUOgjVCyYjnCFZ3+hdpvX7PpOKBfwvDkLU90vW5ykBLdn5q+FdC9gAYlAJPsOuPfk5gtD+oRMkmLI9CjRMSXyVerTGx+gAfzn07XzBL5hj4L7GmpICmqNj5UasT0MhjPlBlGpaZgjtmEjgVLAk14oUC/UOpMuamvrEY6odj+9N8J5WOtgNpH4+4LH60xET6mh5+jOdHk1W0vF/2rNCNzTdsz9MALm08kgNxIYApyGhztcMgpiqAmhkutdMe0RSjoiHM6BGv25L+kdlSyNL85LpTPp3Fk0Q7aRUVkoRNURteogqIogc0Qi/o1Xg0now343SmjGmnm30C8bHJ2L4qxY=</latexit><latexit sha1_base64="rb12wq8rXozM7PyL2Kah1Gx2n2M=">ACL3icbVDLSgMxFM3UV62vqks3wSJU0DIjgm6EoiIuK9gHtHXIpJk2NPMguSMtw/yRG3+lGxF3PoXZtqKWj0QODn3Hu69xwkFV2Caz0Zmbn5hcSm7nFtZXVvfyG9u1VQScqNBCBbDhEMcF9VgUOgjVCyYjnCFZ3+hdpvX7PpOKBfwvDkLU90vW5ykBLdn5q+FdC9gAYlAJPsOuPfk5gtD+oRMkmLI9CjRMSXyVerTGx+gAfzn07XzBL5hj4L7GmpICmqNj5UasT0MhjPlBlGpaZgjtmEjgVLAk14oUC/UOpMuamvrEY6odj+9N8J5WOtgNpH4+4LH60xET6mh5+jOdHk1W0vF/2rNCNzTdsz9MALm08kgNxIYApyGhztcMgpiqAmhkutdMe0RSjoiHM6BGv25L+kdlSyNL85LpTPp3Fk0Q7aRUVkoRNURteogqIogc0Qi/o1Xg0now343SmjGmnm30C8bHJ2L4qxY=</latexit><latexit sha1_base64="rb12wq8rXozM7PyL2Kah1Gx2n2M=">ACL3icbVDLSgMxFM3UV62vqks3wSJU0DIjgm6EoiIuK9gHtHXIpJk2NPMguSMtw/yRG3+lGxF3PoXZtqKWj0QODn3Hu69xwkFV2Caz0Zmbn5hcSm7nFtZXVvfyG9u1VQScqNBCBbDhEMcF9VgUOgjVCyYjnCFZ3+hdpvX7PpOKBfwvDkLU90vW5ykBLdn5q+FdC9gAYlAJPsOuPfk5gtD+oRMkmLI9CjRMSXyVerTGx+gAfzn07XzBL5hj4L7GmpICmqNj5UasT0MhjPlBlGpaZgjtmEjgVLAk14oUC/UOpMuamvrEY6odj+9N8J5WOtgNpH4+4LH60xET6mh5+jOdHk1W0vF/2rNCNzTdsz9MALm08kgNxIYApyGhztcMgpiqAmhkutdMe0RSjoiHM6BGv25L+kdlSyNL85LpTPp3Fk0Q7aRUVkoRNURteogqIogc0Qi/o1Xg0now343SmjGmnm30C8bHJ2L4qxY=</latexit><latexit sha1_base64="rb12wq8rXozM7PyL2Kah1Gx2n2M=">ACL3icbVDLSgMxFM3UV62vqks3wSJU0DIjgm6EoiIuK9gHtHXIpJk2NPMguSMtw/yRG3+lGxF3PoXZtqKWj0QODn3Hu69xwkFV2Caz0Zmbn5hcSm7nFtZXVvfyG9u1VQScqNBCBbDhEMcF9VgUOgjVCyYjnCFZ3+hdpvX7PpOKBfwvDkLU90vW5ykBLdn5q+FdC9gAYlAJPsOuPfk5gtD+oRMkmLI9CjRMSXyVerTGx+gAfzn07XzBL5hj4L7GmpICmqNj5UasT0MhjPlBlGpaZgjtmEjgVLAk14oUC/UOpMuamvrEY6odj+9N8J5WOtgNpH4+4LH60xET6mh5+jOdHk1W0vF/2rNCNzTdsz9MALm08kgNxIYApyGhztcMgpiqAmhkutdMe0RSjoiHM6BGv25L+kdlSyNL85LpTPp3Fk0Q7aRUVkoRNURteogqIogc0Qi/o1Xg0now343SmjGmnm30C8bHJ2L4qxY=</latexit>Recap
Optimization Meta-Learning
Part of Homework 2!
Idea: AutomaCcally learn inner vector learning rate, tune outer learning rate
(Li et al. Meta-SGD, Behl et al. AlphaMAML)
Idea: Decouple inner learning rate, BN staCsCcs per-step
(Antoniou et al. MAML++)
Idea: OpCmize only a subset of the parameters in the inner loop
(Zhou et al. DEML, Zintgraf et al. CAVIA)
Idea: Introduce context variables for increased expressive power.
(Finn et al. bias transformaCon, Zintgraf et al. CAVIA)
Takeaway: a range of simple tricks that can help opCmizaCon significantly
19
compute- & memory-intensive.
Surprisingly works for simple few-shot problems, but (anecdotally) not for more complex meta-learning problems. (Finn et al. first-order MAML ‘17, Nichol et al. RepCle ’18)
Idea: [Crudely] approximate as idenCty
20
(Rajeswaran, Finn, Kakade, Levine. Implicit MAML ’19)
Idea: Derive meta-gradient using the implicit funcCon theorem Idea: Only opCmize the last layer of weights.
ridge regression, logis/c regression (BerCneho et al. R2-D2 ’19) support vector machine (Lee et al. MetaOptNet ’19)
—> leads to a closed form or convex opCmizaCon on top of meta-learned features —> compute full meta-gradient without differen/a/ng through op/miza/on path
21
(Rajeswaran, Finn, Kakade, Levine. Implicit MAML)
Idea: Derive meta-gradient using the implicit funcCon theorem Can we compute the meta-gradient without differen/a/ng through the op/miza/on path?
Memory and computa;on trade-offs Allows for second-order op;mizers in inner loop A recent development (NeurIPS ’19) (thus, all the typical caveats with recent work)
Idea: Progressive neural architecture search + MAML (Kim et al. Auto-Meta)
MAML, basic architecture: 63.11% MiniImagenet, 5-way 5-shot MAML + AutoMeta: 74.65%
22
Key idea: Acquire through opCmizaCon.
23
Takeaways: Construct bi-level op/miza/on problem.
architecture)
Recap
Optimization Meta-Learning
Part of Homework 2!
25
Link: hhps://arxiv.org/abs/2004.13390
CVPR 2020 EarthVision Workshop
26
Challenges: ApplicaCons in global urban planning, climate change research DeepGlobe dataset (Demir et al. 2018) Labeling data is expensive. Different regions look different & have different land use proporCons SEN12MS dataset (Schmih et al. 2019)
27
Different tasks: different regions of the world Croplands from four countries. Goal: Segment/classify images from a new region with a small amount of data
28
Goal: Segment/classify images from a new region with a small amount of data SEN12MS dataset (Schmih et al. 2019)
Geographic meta-data provided Example 2-way 2-shot classificaCon task
29
Goal: Segment/classify images from a new region with a small amount of data
No geographic metadata, used clustering to guess region Example 1-shot learning segmentaCon task.
30
Compare: Pre-train on meta-training data , fine-tune on
1 ∪ … ∪ T tr
j
Random init: Train from scratch on tr
j
MAML on meta-training data , adapt with
{1, …, T} tr
j
Meta-training data: {1, …, T} Meta-test ;me: small amount of data from new region: tr
j
(meta-test training set / meta-test support set)
SEN12MS dataset DeepGlobe dataset More visualizaCons and analysis in the paper!
Recap
Optimization Meta-Learning
Part of Homework 2!
Goals for by the end of lecture:
32
Monday: Guest lecture from Mah Johnson on automaCc differenCaCon Wednesday: Non-parametric few-shot learners, comparison of approaches Next week: Advanced (but important!) meta-learning topics Start of reinforcement learning topics [project proposals due]
Week 4: Week 5:
33
HW1 due next Weds (9/30). Project guidelines posted — start forming groups & formulaCng ideas. Guest lecture by Mah Johnson on Monday!