The Meta-Learning Problem & Black-Box Meta-Learning CS 330 - - PowerPoint PPT Presentation
The Meta-Learning Problem & Black-Box Meta-Learning CS 330 - - PowerPoint PPT Presentation
The Meta-Learning Problem & Black-Box Meta-Learning CS 330 Logistics Homework 1 posted today, due Wednesday, September 30 Project guidelines will be posted by tomorrow. Plan for Today Transfer Learning - Problem formulation - Fine-tuning
Logistics
Homework 1 posted today, due Wednesday, September 30 Project guidelines will be posted by tomorrow.
Plan for Today
Transfer Learning
- Problem formulation
- Fine-tuning
Meta-Learning
- Problem formulation
- General recipe of meta-learning algorithms
- Black-box adaptation approaches
- Case study of GPT-3 (time-permitting)
Goals for by the end of lecture:
- Differences between multi-task learning, transfer learning, and meta-learning problems
- Basics of transfer learning via fine-tuning
- Training set-up for few-shot meta-learning algorithms
- How to implement black-box meta-learning techniques
Topic of Homework 1!
}
Multi-Task Learning vs. Transfer Learning
4
min
θ T
∑
i=1
ℒi(θ, i)
Multi-Task Learning Solve multiple tasks at once.
𝒰1, ⋯, 𝒰T
Transfer Learning Solve target task after solving source task
𝒰b 𝒰a
by transferring knowledge learned from 𝒰a Key assumption: Cannot access data during transfer.
a
Transfer learning is a valid solution to multi-task learning.
(but not vice versa)
Question: What are some problems/applications where transfer learning might make sense?
(answer in chat or raise hand)
when is very large
(don’t want to retain & retrain on )
a
a when you don’t care about solving & simultaneously
𝒰a 𝒰b
training data for new task 𝒰b Parameters pre-trained on a
φ θ αrθL(θ, Dtr)
5
Where do you get the pre-trained parameters?
- ImageNet classifica8on
- Models trained on large language corpora (BERT, LMs)
- Other unsupervised learning techniques
- Whatever large, diverse dataset you might have
(typically for many gradient steps) Some common prac6ces
- Fine-tune with a smaller learning rate
- Smaller learning rate for earlier layers
- Freeze earlier layers, gradually unfreeze
- Reini8alize last layer
- Search over hyperparameters via cross-val
- Architecture choices maMer (e.g. ResNets)
Pre-trained models oOen available online.
What makes ImageNet good for transfer learning? Huh, Agrawal, Efros. ‘16
Transfer learning via fine-tuning
6
Universal Language Model Fine-Tuning for Text Classifica6on. Howard, Ruder. ‘18 Fine-tuning doesn’t work well with small target task datasets This is where meta-learning can help.
Plan for Today
Transfer Learning
- Problem formulation
- Fine-tuning
Meta-Learning
- Problem formulation
- General recipe of meta-learning algorithms
- Black-box adaptation approaches
- Case study of GPT-3 (time-permitting)
The Meta-Learning Problem Statement
(that we will consider in this class)
Two ways to view meta-learning algorithms
Mechanistic view
➢ Deep network that can read in an entire dataset and make predictions for new datapoints ➢ Training this network uses a meta-dataset, which itself consists of many datasets, each for a different task
Probabilistic view
➢ Extract prior knowledge from a set of tasks that allows efficient learning of new tasks ➢ Learning a new task uses this prior and (small) training set to infer most likely posterior parameters
Today: Focus primarily on the mechanistic view.
(Bayes will come back later)
How does meta-learning work? An example.
Given 1 example of 5 classes: Classify new examples training data test set
How does meta-learning work? An example.
meta-training training classes … … meta-testing
Ttest
Given 1 example of 5 classes: Classify new examples training data test set
regression, language genera6on, skill learning, any ML problem Can replace image classifica8on with:
The Meta-Learning Problem Given data from , quickly solve new task
𝒰1, …, 𝒰n
𝒰test
Key assumption: meta-training tasks and meta-test task drawn i.i.d. from same task distribution ,
𝒰1, …, 𝒰n ∼ p(𝒰) 𝒰j ∼ p(𝒰)
What do the tasks correspond to?
- recognizing handwritten digits from different languages (see homework 1!)
- spam filter for different users
- classifying species in different regions of the world
- a robot performing different tasks
How many tasks do you need? The more the better. Like before, tasks must share structure. (analogous to more data in ML)
“support set”
Dtest
i
task test dataset Some terminology
Dtr
i
task training set
“query set”
k-shot learning: learning with k examples per class N-way classification: choosing between N classes
(or k examples total for regression)
Question: What are k and N for the above example? (answer in chat)
min
θ T
∑
i=1
ℒi(θ, i)
Multi-Task Learning Solve multiple tasks at once.
𝒰1, ⋯, 𝒰T
Transfer Learning Solve target task after solving source task
𝒰b 𝒰a
by transferring knowledge learned from 𝒰a The Meta-Learning Problem Given data from , quickly solve new task
𝒰1, …, 𝒰n
𝒰test
In all settings: tasks must share structure. In transfer learning and meta-learning: generally impractical to access prior tasks
Problem Settings Recap
Plan for Today
Transfer Learning
- Problem formulation
- Fine-tuning
Meta-Learning
- Problem formulation
- General recipe of meta-learning algorithms
- Black-box adaptation approaches
- Case study of GPT-3 (time-permitting)
General recipe
How to evaluate a meta-learning algorithm the “transpose” of MNIST many classes, few examples 1623 characters from 50 different alphabets the Omniglot dataset Lake et al. Science 2015 … sta8s8cs more reflec8ve
- f the real world
20 instances of each character Proposes both few-shot discrimina6ve & few-shot genera6ve problems Ini8al few-shot learning approaches w/ Bayesian models, non-parametrics
Fei-Fei et al. ‘03 Salakhutdinov et al. ‘12 Lake et al. ‘11 Lake et al. ‘13
Other datasets used for few-shot image recogni6on: 8eredImageNet, CIFAR, CUB, CelebA, others Other benchmarks: molecular property predic8on (Ngyugen et al. ’20), object pose predic8on (Yin et al. ICLR ’20)
Another View on the Meta-Learning Problem
Inputs: Outputs:
Supervised Learning: Meta Supervised Learning:
Inputs: Outputs: Data: Data:
Why is this view useful?
Reduces the meta-learning problem to the design & optimization of h.
{
- Finn. Learning to Learn with Gradients. PhD Thesis 2018
Dtr
<latexit sha1_base64="1Zuvkn3lJe+MOxuFwnUwXx+8fNU=">ACIXicbVBNS8NAEN34bf2KevQSLEJPJRHBHkU9eKxgq9DWMtlO7OJmE3YnYgn9K178K148KNKb+Gfcfgi1+mDg8d4M/PCVApDv/pzM0vLC4tr6wW1tY3Nrfc7Z26STLNscYTmeibEAxKobBGgiTepBohDiVeh/dnQ/6AbURibqiXoqtGO6UiAQHslLbrRxE7SZ1kaDUjIG6HGR+3rcSPlJOGoTqF6aM2x+j3aLftkfwftLgkpsgmqbXfQ7CQ8i1ERl2BMI/BTauWgSXCJdktmMAV+D3fYsFRBjKaVjz7sewdW6XhRom0p8kbq9EQOsTG9OLSdw2PNrDcU/MaGUWVi5UmhEqPl4UZdKjxBvG5XWERk6yZwlwLeytHu+CBk421INIZh9+S+pH5YDy+PienkzhW2B7bZyUWsGN2wi5YldUYZ0/shb2xd+fZeXU+nMG4dc6ZzOyX3C+vgHtwKVA</latexit><latexit sha1_base64="1Zuvkn3lJe+MOxuFwnUwXx+8fNU=">ACIXicbVBNS8NAEN34bf2KevQSLEJPJRHBHkU9eKxgq9DWMtlO7OJmE3YnYgn9K178K148KNKb+Gfcfgi1+mDg8d4M/PCVApDv/pzM0vLC4tr6wW1tY3Nrfc7Z26STLNscYTmeibEAxKobBGgiTepBohDiVeh/dnQ/6AbURibqiXoqtGO6UiAQHslLbrRxE7SZ1kaDUjIG6HGR+3rcSPlJOGoTqF6aM2x+j3aLftkfwftLgkpsgmqbXfQ7CQ8i1ERl2BMI/BTauWgSXCJdktmMAV+D3fYsFRBjKaVjz7sewdW6XhRom0p8kbq9EQOsTG9OLSdw2PNrDcU/MaGUWVi5UmhEqPl4UZdKjxBvG5XWERk6yZwlwLeytHu+CBk421INIZh9+S+pH5YDy+PienkzhW2B7bZyUWsGN2wi5YldUYZ0/shb2xd+fZeXU+nMG4dc6ZzOyX3C+vgHtwKVA</latexit><latexit sha1_base64="1Zuvkn3lJe+MOxuFwnUwXx+8fNU=">ACIXicbVBNS8NAEN34bf2KevQSLEJPJRHBHkU9eKxgq9DWMtlO7OJmE3YnYgn9K178K148KNKb+Gfcfgi1+mDg8d4M/PCVApDv/pzM0vLC4tr6wW1tY3Nrfc7Z26STLNscYTmeibEAxKobBGgiTepBohDiVeh/dnQ/6AbURibqiXoqtGO6UiAQHslLbrRxE7SZ1kaDUjIG6HGR+3rcSPlJOGoTqF6aM2x+j3aLftkfwftLgkpsgmqbXfQ7CQ8i1ERl2BMI/BTauWgSXCJdktmMAV+D3fYsFRBjKaVjz7sewdW6XhRom0p8kbq9EQOsTG9OLSdw2PNrDcU/MaGUWVi5UmhEqPl4UZdKjxBvG5XWERk6yZwlwLeytHu+CBk421INIZh9+S+pH5YDy+PienkzhW2B7bZyUWsGN2wi5YldUYZ0/shb2xd+fZeXU+nMG4dc6ZzOyX3C+vgHtwKVA</latexit><latexit sha1_base64="1Zuvkn3lJe+MOxuFwnUwXx+8fNU=">ACIXicbVBNS8NAEN34bf2KevQSLEJPJRHBHkU9eKxgq9DWMtlO7OJmE3YnYgn9K178K148KNKb+Gfcfgi1+mDg8d4M/PCVApDv/pzM0vLC4tr6wW1tY3Nrfc7Z26STLNscYTmeibEAxKobBGgiTepBohDiVeh/dnQ/6AbURibqiXoqtGO6UiAQHslLbrRxE7SZ1kaDUjIG6HGR+3rcSPlJOGoTqF6aM2x+j3aLftkfwftLgkpsgmqbXfQ7CQ8i1ERl2BMI/BTauWgSXCJdktmMAV+D3fYsFRBjKaVjz7sewdW6XhRom0p8kbq9EQOsTG9OLSdw2PNrDcU/MaGUWVi5UmhEqPl4UZdKjxBvG5XWERk6yZwlwLeytHu+CBk421INIZh9+S+pH5YDy+PienkzhW2B7bZyUWsGN2wi5YldUYZ0/shb2xd+fZeXU+nMG4dc6ZzOyX3C+vgHtwKVA</latexit>General recipe
How to design a meta-learning algorithm
- 1. Choose a form of
- 2. Choose how to op8mize w.r.t. max-likelihood objec8ve using meta-training data
θ
meta-parameters
Plan for Today
Transfer Learning
- Problem formulation
- Fine-tuning
Meta-Learning
- Problem formulation
- General recipe of meta-learning algorithms
- Black-box adaptation approaches
- Case study of GPT-3 (time-permitting)
Dtr
i
φi
Key idea: Train a neural network to represent Train with standard supervised learning! L(φi, Dtest
i
)
Dtest
i
fθ gφi yts xts
Black-Box Adapta6on
max
θ
X
Ti
L(fθ(Dtr
i ), Dtest i
) max
θ
X
Ti
X
(x,y)∼Dtest
i
log gφi(y|x) φi = fθ(Dtr
i )
Predict test points with yts = gφi(xts)
<latexit sha1_base64="ok4DF6KGiHohsukztFevDS2nAak=">ACtHicbVFRa9swEJa9de28bs2x76IhUAKo9iltIVRKGwPfeygaQpR5snKORaVZSOdR4PxL9zb3vZvJifesrY5EHx83913p7ukVNJiGP72/GfPt15s7wMXu2+frPXe/vuxhaVETAShSrMbcItKlhBIV3JYGeJ4oGCd3n1t9/AOMlYW+xkUJ05zPtUyl4OiouPdzQFnOMUvSetHEDOEeawSLDT2n2XApCa7qL823TjPNx38V9w8qPlGSA/CAaszKQzSOMVs9ngDIWrJv/FWxzPo/r1iGWzXDdaq23DTYZxr1+eBgugz4FUQf6pIuruPeLzQpR5aBRKG7tJApLnNbcoBQKmoBVFkou7vgcJg5qnoOd1sulN3TgmBlNC+OeRrpk/6+oeW7tIk9cZjusfay15CZtUmF6Nq2lLisELVaN0kpRLGh7QTqTBgSqhQNcGOlmpSLjhgt0dw7cEqLHX34Kbo4OI4e/Hvcvrt17JB98oEMSUROyQW5JFdkRIQXeWPvu8f9E5/5wodVqu91Ne/Jg/D1H0yl2m4=</latexit><latexit sha1_base64="ok4DF6KGiHohsukztFevDS2nAak=">ACtHicbVFRa9swEJa9de28bs2x76IhUAKo9iltIVRKGwPfeygaQpR5snKORaVZSOdR4PxL9zb3vZvJifesrY5EHx83913p7ukVNJiGP72/GfPt15s7wMXu2+frPXe/vuxhaVETAShSrMbcItKlhBIV3JYGeJ4oGCd3n1t9/AOMlYW+xkUJ05zPtUyl4OiouPdzQFnOMUvSetHEDOEeawSLDT2n2XApCa7qL823TjPNx38V9w8qPlGSA/CAaszKQzSOMVs9ngDIWrJv/FWxzPo/r1iGWzXDdaq23DTYZxr1+eBgugz4FUQf6pIuruPeLzQpR5aBRKG7tJApLnNbcoBQKmoBVFkou7vgcJg5qnoOd1sulN3TgmBlNC+OeRrpk/6+oeW7tIk9cZjusfay15CZtUmF6Nq2lLisELVaN0kpRLGh7QTqTBgSqhQNcGOlmpSLjhgt0dw7cEqLHX34Kbo4OI4e/Hvcvrt17JB98oEMSUROyQW5JFdkRIQXeWPvu8f9E5/5wodVqu91Ne/Jg/D1H0yl2m4=</latexit><latexit sha1_base64="ok4DF6KGiHohsukztFevDS2nAak=">ACtHicbVFRa9swEJa9de28bs2x76IhUAKo9iltIVRKGwPfeygaQpR5snKORaVZSOdR4PxL9zb3vZvJifesrY5EHx83913p7ukVNJiGP72/GfPt15s7wMXu2+frPXe/vuxhaVETAShSrMbcItKlhBIV3JYGeJ4oGCd3n1t9/AOMlYW+xkUJ05zPtUyl4OiouPdzQFnOMUvSetHEDOEeawSLDT2n2XApCa7qL823TjPNx38V9w8qPlGSA/CAaszKQzSOMVs9ngDIWrJv/FWxzPo/r1iGWzXDdaq23DTYZxr1+eBgugz4FUQf6pIuruPeLzQpR5aBRKG7tJApLnNbcoBQKmoBVFkou7vgcJg5qnoOd1sulN3TgmBlNC+OeRrpk/6+oeW7tIk9cZjusfay15CZtUmF6Nq2lLisELVaN0kpRLGh7QTqTBgSqhQNcGOlmpSLjhgt0dw7cEqLHX34Kbo4OI4e/Hvcvrt17JB98oEMSUROyQW5JFdkRIQXeWPvu8f9E5/5wodVqu91Ne/Jg/D1H0yl2m4=</latexit><latexit sha1_base64="ok4DF6KGiHohsukztFevDS2nAak=">ACtHicbVFRa9swEJa9de28bs2x76IhUAKo9iltIVRKGwPfeygaQpR5snKORaVZSOdR4PxL9zb3vZvJifesrY5EHx83913p7ukVNJiGP72/GfPt15s7wMXu2+frPXe/vuxhaVETAShSrMbcItKlhBIV3JYGeJ4oGCd3n1t9/AOMlYW+xkUJ05zPtUyl4OiouPdzQFnOMUvSetHEDOEeawSLDT2n2XApCa7qL823TjPNx38V9w8qPlGSA/CAaszKQzSOMVs9ngDIWrJv/FWxzPo/r1iGWzXDdaq23DTYZxr1+eBgugz4FUQf6pIuruPeLzQpR5aBRKG7tJApLnNbcoBQKmoBVFkou7vgcJg5qnoOd1sulN3TgmBlNC+OeRrpk/6+oeW7tIk9cZjusfay15CZtUmF6Nq2lLisELVaN0kpRLGh7QTqTBgSqhQNcGOlmpSLjhgt0dw7cEqLHX34Kbo4OI4e/Hvcvrt17JB98oEMSUROyQW5JFdkRIQXeWPvu8f9E5/5wodVqu91Ne/Jg/D1H0yl2m4=</latexit>“learner"
Black-Box Adapta6on
- 1. Sample task Ti
- 2. Sample disjoint datasets Dtr
i , Dtest i
from Di (or mini batch of tasks) Di
Dtr
i
Dtest
i
Dtr
i
φi Dtest
i
fθ gφi yts xts Key idea: Train a neural network to represent . φi = fθ(Dtr
i )
Black-Box Adapta6on
- 1. Sample task Ti
- 2. Sample disjoint datasets Dtr
i , Dtest i
from Di (or mini batch of tasks)
- 3. Compute φi ← fθ(Dtr
i )
- 4. Update θ using rθL(φi, Dtest
i
)
Dtr
i
Dtest
i
Dtr
i
φi Dtest
i
fθ gφi yts xts Key idea: Train a neural network to represent . φi = fθ(Dtr
i )
Black-Box Adapta6on
Challenge Outpuhng all neural net parameters does not seem scalable?
φi Dtr
i
fθ represents contextual task informa8on hi low-dimensional vector hi general form:
Dtest
i
yts xts gφi yts = fθ(Dtr
i , xts)
(Santoro et al. MANN, Mishra et al. SNAIL) Idea: Do not need to output all parameters of neural net, only sufficient sta8s8cs φi = {hi, θg} recall:
23
What architecture should we use for ?
fθ
Key idea: Train a neural network to represent . φi = fθ(Dtr
i )
Black-Box Adapta6on
24
Meta-Learning with Memory-Augmented Neural Networks Santoro, Bartunov, Botvinick, Wierstra, Lillicrap. ICML ‘16
LSTMs or Neural turing machine (NTM) Feedforward + average
Condi6onal Neural Processes. Garnelo, Rosenbaum, Maddison, Ramalho, Saxton, Shanahan, Teh, Rezende, Eslami. ICML ‘18
Other external memory mechanisms
Meta Networks Munkhdalai, Yu. ICML ‘17
Convolu8ons & aMen8on
A Simple Neural A[en6ve Meta-Learner Mishra, Rohaninejad, Chen, Abbeel. ICLR ‘18
Question: Why might feedforward+average be better than a recurrent model?
(raise your hand)
HW 1:
- implement data processing
- implement simple black-box meta-learner
- train few-shot Omniglot classifier
How else can we represent ?
Black-Box Adapta6on
+ expressive + easy to combine with variety of learning problems (e.g. SL, RL)
- complex model w/ complex task:
challenging opBmizaBon problem
- oOen data-inefficient
Dtr
i
φi Dtest
i
fθ gφi Next /me (Wednesday): What if we treat it as an opBmizaBon procedure? yts xts Key idea: Train a neural network to represent . φi = fθ(Dtr
i )
φi = fθ(Dtr
i )
Plan for Today
Transfer Learning
- Problem formulation
- Fine-tuning
Meta-Learning
- Problem formulation
- General recipe of meta-learning algorithms
- Black-box adaptation approaches
- Case study of GPT-3 (time-permitting)
Case Study: GPT-3
May 2020
What is GPT-3?
black-box meta-learner trained on language generation tasks
architecture: giant “Transformer” network 175 billion parameters, 96 layers, 3.2M batch size : sequence of characters
tr
i
: the following sequence of characters
ts
i
[meta-training] dataset: crawled data from the internet, English-language Wikipedia, two books corpora
a language model
What do different tasks correspond to? spelling correction simple math problems translating between languages a variety of other tasks How can those tasks all be solved by a single architecture?
How can those tasks all be solved by a single architecture? Put them all in the form of text! spelling correction simple math problems translating between languages Why is that a good idea? Very easy to get a lot of meta-training data.
Some Results
One-shot learning from dictionary definitions: Few-shot language editing: Non-few-shot learning tasks:
Other Cool Use-Cases
English to LaTeX
Source: https://twitter.com/sh_reya/status/1284746918959239168
General Notes & Takeaways
The results are extremely impressive. The model is far from perfect. The model fails in unintuitive ways.
Source: https://lacker.io/ai/2020/07/06/giving-gpt-3-a-turing-test.html Source: https://github.com/shreyashankar/gpt3-sandbox/blob/master/docs/priming.md
The choice of at test time is important.
tr
i
Plan for Today
Transfer Learning
- Problem formulation
- Fine-tuning
Meta-Learning
- Problem formulation
- General recipe of meta-learning algorithms
- Black-box adaptation approaches
- Case study of GPT-3 (time-permitting)
Goals for by the end of lecture:
- Differences between multi-task learning, transfer learning, and meta-learning problems
- Basics of transfer learning via fine-tuning
- Training set-up for few-shot meta-learning algorithms
- How to implement black-box meta-learning techniques