The Meta-Learning Problem & Black-Box Meta-Learning CS 330 - - PowerPoint PPT Presentation

the meta learning problem black box meta learning
SMART_READER_LITE
LIVE PREVIEW

The Meta-Learning Problem & Black-Box Meta-Learning CS 330 - - PowerPoint PPT Presentation

The Meta-Learning Problem & Black-Box Meta-Learning CS 330 Logistics Homework 1 posted today, due Wednesday, September 30 Project guidelines will be posted by tomorrow. Plan for Today Transfer Learning - Problem formulation - Fine-tuning


slide-1
SLIDE 1

CS 330

The Meta-Learning Problem & Black-Box Meta-Learning

slide-2
SLIDE 2

Logistics

Homework 1 posted today, due Wednesday, September 30 Project guidelines will be posted by tomorrow.

slide-3
SLIDE 3

Plan for Today

Transfer Learning

  • Problem formulation
  • Fine-tuning

Meta-Learning

  • Problem formulation
  • General recipe of meta-learning algorithms
  • Black-box adaptation approaches
  • Case study of GPT-3 (time-permitting)

Goals for by the end of lecture:

  • Differences between multi-task learning, transfer learning, and meta-learning problems
  • Basics of transfer learning via fine-tuning
  • Training set-up for few-shot meta-learning algorithms
  • How to implement black-box meta-learning techniques

Topic of Homework 1!

}

slide-4
SLIDE 4

Multi-Task Learning vs. Transfer Learning

4

min

θ T

i=1

ℒi(θ, 𝒠i)

Multi-Task Learning Solve multiple tasks at once.

𝒰1, ⋯, 𝒰T

Transfer Learning Solve target task after solving source task

𝒰b 𝒰a

by transferring knowledge learned from 𝒰a Key assumption: Cannot access data during transfer.

𝒠a

Transfer learning is a valid solution to multi-task learning.

(but not vice versa)

Question: What are some problems/applications where transfer learning might make sense?

(answer in chat or raise hand)

when is very large

(don’t want to retain & retrain on )

𝒠a

𝒠a when you don’t care about solving & simultaneously

𝒰a 𝒰b

slide-5
SLIDE 5

training data for new task 𝒰b Parameters pre-trained on 𝒠a

φ θ αrθL(θ, Dtr)

5

Where do you get the pre-trained parameters?

  • ImageNet classifica8on
  • Models trained on large language corpora (BERT, LMs)
  • Other unsupervised learning techniques
  • Whatever large, diverse dataset you might have

(typically for many gradient steps) Some common prac6ces

  • Fine-tune with a smaller learning rate
  • Smaller learning rate for earlier layers
  • Freeze earlier layers, gradually unfreeze
  • Reini8alize last layer
  • Search over hyperparameters via cross-val
  • Architecture choices maMer (e.g. ResNets)

Pre-trained models oOen available online.

What makes ImageNet good for transfer learning? Huh, Agrawal, Efros. ‘16

Transfer learning via fine-tuning

slide-6
SLIDE 6

6

Universal Language Model Fine-Tuning for Text Classifica6on. Howard, Ruder. ‘18 Fine-tuning doesn’t work well with small target task datasets This is where meta-learning can help.

slide-7
SLIDE 7

Plan for Today

Transfer Learning

  • Problem formulation
  • Fine-tuning

Meta-Learning

  • Problem formulation
  • General recipe of meta-learning algorithms
  • Black-box adaptation approaches
  • Case study of GPT-3 (time-permitting)
slide-8
SLIDE 8

The Meta-Learning Problem Statement

(that we will consider in this class)

slide-9
SLIDE 9

Two ways to view meta-learning algorithms

Mechanistic view

➢ Deep network that can read in an entire dataset and make predictions for new datapoints ➢ Training this network uses a meta-dataset, which itself consists of many datasets, each for a different task

Probabilistic view

➢ Extract prior knowledge from a set of tasks that allows efficient learning of new tasks ➢ Learning a new task uses this prior and (small) training set to infer most likely posterior parameters

Today: Focus primarily on the mechanistic view.

(Bayes will come back later)

slide-10
SLIDE 10

How does meta-learning work? An example.

Given 1 example of 5 classes: Classify new examples training data test set

slide-11
SLIDE 11

How does meta-learning work? An example.

meta-training training classes … … meta-testing

Ttest

Given 1 example of 5 classes: Classify new examples training data test set

regression, language genera6on, skill learning, any ML problem Can replace image classifica8on with:

slide-12
SLIDE 12

The Meta-Learning Problem Given data from , quickly solve new task

𝒰1, …, 𝒰n

𝒰test

Key assumption: meta-training tasks and meta-test task drawn i.i.d. from same task distribution ,

𝒰1, …, 𝒰n ∼ p(𝒰) 𝒰j ∼ p(𝒰)

What do the tasks correspond to?

  • recognizing handwritten digits from different languages (see homework 1!)
  • spam filter for different users
  • classifying species in different regions of the world
  • a robot performing different tasks

How many tasks do you need? The more the better. Like before, tasks must share structure. (analogous to more data in ML)

slide-13
SLIDE 13

“support set”

Dtest

i

task test dataset Some terminology

Dtr

i

task training set

“query set”

k-shot learning: learning with k examples per class N-way classification: choosing between N classes

(or k examples total for regression)

Question: What are k and N for the above example? (answer in chat)

slide-14
SLIDE 14

min

θ T

i=1

ℒi(θ, 𝒠i)

Multi-Task Learning Solve multiple tasks at once.

𝒰1, ⋯, 𝒰T

Transfer Learning Solve target task after solving source task

𝒰b 𝒰a

by transferring knowledge learned from 𝒰a The Meta-Learning Problem Given data from , quickly solve new task

𝒰1, …, 𝒰n

𝒰test

In all settings: tasks must share structure. In transfer learning and meta-learning: generally impractical to access prior tasks

Problem Settings Recap

slide-15
SLIDE 15

Plan for Today

Transfer Learning

  • Problem formulation
  • Fine-tuning

Meta-Learning

  • Problem formulation
  • General recipe of meta-learning algorithms
  • Black-box adaptation approaches
  • Case study of GPT-3 (time-permitting)
slide-16
SLIDE 16

General recipe

How to evaluate a meta-learning algorithm the “transpose” of MNIST many classes, few examples 1623 characters from 50 different alphabets the Omniglot dataset Lake et al. Science 2015 … sta8s8cs more reflec8ve

  • f the real world

20 instances of each character Proposes both few-shot discrimina6ve & few-shot genera6ve problems Ini8al few-shot learning approaches w/ Bayesian models, non-parametrics

Fei-Fei et al. ‘03 Salakhutdinov et al. ‘12 Lake et al. ‘11 Lake et al. ‘13

Other datasets used for few-shot image recogni6on: 8eredImageNet, CIFAR, CUB, CelebA, others Other benchmarks: molecular property predic8on (Ngyugen et al. ’20), object pose predic8on (Yin et al. ICLR ’20)

slide-17
SLIDE 17

Another View on the Meta-Learning Problem

Inputs: Outputs:

Supervised Learning: Meta Supervised Learning:

Inputs: Outputs: Data: Data:

Why is this view useful?

Reduces the meta-learning problem to the design & optimization of h.

{

  • Finn. Learning to Learn with Gradients. PhD Thesis 2018

Dtr

<latexit sha1_base64="1Zuvkn3lJe+MOxuFwnUwXx+8fNU=">ACIXicbVBNS8NAEN34bf2KevQSLEJPJRHBHkU9eKxgq9DWMtlO7OJmE3YnYgn9K178K148KNKb+Gfcfgi1+mDg8d4M/PCVApDv/pzM0vLC4tr6wW1tY3Nrfc7Z26STLNscYTmeibEAxKobBGgiTepBohDiVeh/dnQ/6AbURibqiXoqtGO6UiAQHslLbrRxE7SZ1kaDUjIG6HGR+3rcSPlJOGoTqF6aM2x+j3aLftkfwftLgkpsgmqbXfQ7CQ8i1ERl2BMI/BTauWgSXCJdktmMAV+D3fYsFRBjKaVjz7sewdW6XhRom0p8kbq9EQOsTG9OLSdw2PNrDcU/MaGUWVi5UmhEqPl4UZdKjxBvG5XWERk6yZwlwLeytHu+CBk421INIZh9+S+pH5YDy+PienkzhW2B7bZyUWsGN2wi5YldUYZ0/shb2xd+fZeXU+nMG4dc6ZzOyX3C+vgHtwKVA</latexit><latexit sha1_base64="1Zuvkn3lJe+MOxuFwnUwXx+8fNU=">ACIXicbVBNS8NAEN34bf2KevQSLEJPJRHBHkU9eKxgq9DWMtlO7OJmE3YnYgn9K178K148KNKb+Gfcfgi1+mDg8d4M/PCVApDv/pzM0vLC4tr6wW1tY3Nrfc7Z26STLNscYTmeibEAxKobBGgiTepBohDiVeh/dnQ/6AbURibqiXoqtGO6UiAQHslLbrRxE7SZ1kaDUjIG6HGR+3rcSPlJOGoTqF6aM2x+j3aLftkfwftLgkpsgmqbXfQ7CQ8i1ERl2BMI/BTauWgSXCJdktmMAV+D3fYsFRBjKaVjz7sewdW6XhRom0p8kbq9EQOsTG9OLSdw2PNrDcU/MaGUWVi5UmhEqPl4UZdKjxBvG5XWERk6yZwlwLeytHu+CBk421INIZh9+S+pH5YDy+PienkzhW2B7bZyUWsGN2wi5YldUYZ0/shb2xd+fZeXU+nMG4dc6ZzOyX3C+vgHtwKVA</latexit><latexit sha1_base64="1Zuvkn3lJe+MOxuFwnUwXx+8fNU=">ACIXicbVBNS8NAEN34bf2KevQSLEJPJRHBHkU9eKxgq9DWMtlO7OJmE3YnYgn9K178K148KNKb+Gfcfgi1+mDg8d4M/PCVApDv/pzM0vLC4tr6wW1tY3Nrfc7Z26STLNscYTmeibEAxKobBGgiTepBohDiVeh/dnQ/6AbURibqiXoqtGO6UiAQHslLbrRxE7SZ1kaDUjIG6HGR+3rcSPlJOGoTqF6aM2x+j3aLftkfwftLgkpsgmqbXfQ7CQ8i1ERl2BMI/BTauWgSXCJdktmMAV+D3fYsFRBjKaVjz7sewdW6XhRom0p8kbq9EQOsTG9OLSdw2PNrDcU/MaGUWVi5UmhEqPl4UZdKjxBvG5XWERk6yZwlwLeytHu+CBk421INIZh9+S+pH5YDy+PienkzhW2B7bZyUWsGN2wi5YldUYZ0/shb2xd+fZeXU+nMG4dc6ZzOyX3C+vgHtwKVA</latexit><latexit sha1_base64="1Zuvkn3lJe+MOxuFwnUwXx+8fNU=">ACIXicbVBNS8NAEN34bf2KevQSLEJPJRHBHkU9eKxgq9DWMtlO7OJmE3YnYgn9K178K148KNKb+Gfcfgi1+mDg8d4M/PCVApDv/pzM0vLC4tr6wW1tY3Nrfc7Z26STLNscYTmeibEAxKobBGgiTepBohDiVeh/dnQ/6AbURibqiXoqtGO6UiAQHslLbrRxE7SZ1kaDUjIG6HGR+3rcSPlJOGoTqF6aM2x+j3aLftkfwftLgkpsgmqbXfQ7CQ8i1ERl2BMI/BTauWgSXCJdktmMAV+D3fYsFRBjKaVjz7sewdW6XhRom0p8kbq9EQOsTG9OLSdw2PNrDcU/MaGUWVi5UmhEqPl4UZdKjxBvG5XWERk6yZwlwLeytHu+CBk421INIZh9+S+pH5YDy+PienkzhW2B7bZyUWsGN2wi5YldUYZ0/shb2xd+fZeXU+nMG4dc6ZzOyX3C+vgHtwKVA</latexit>
slide-18
SLIDE 18

General recipe

How to design a meta-learning algorithm

  • 1. Choose a form of
  • 2. Choose how to op8mize w.r.t. max-likelihood objec8ve using meta-training data

θ

meta-parameters

slide-19
SLIDE 19

Plan for Today

Transfer Learning

  • Problem formulation
  • Fine-tuning

Meta-Learning

  • Problem formulation
  • General recipe of meta-learning algorithms
  • Black-box adaptation approaches
  • Case study of GPT-3 (time-permitting)
slide-20
SLIDE 20

Dtr

i

φi

Key idea: Train a neural network to represent Train with standard supervised learning! L(φi, Dtest

i

)

Dtest

i

fθ gφi yts xts

Black-Box Adapta6on

max

θ

X

Ti

L(fθ(Dtr

i ), Dtest i

) max

θ

X

Ti

X

(x,y)∼Dtest

i

log gφi(y|x) φi = fθ(Dtr

i )

Predict test points with yts = gφi(xts)

<latexit sha1_base64="ok4DF6KGiHohsukztFevDS2nAak=">ACtHicbVFRa9swEJa9de28bs2x76IhUAKo9iltIVRKGwPfeygaQpR5snKORaVZSOdR4PxL9zb3vZvJifesrY5EHx83913p7ukVNJiGP72/GfPt15s7wMXu2+frPXe/vuxhaVETAShSrMbcItKlhBIV3JYGeJ4oGCd3n1t9/AOMlYW+xkUJ05zPtUyl4OiouPdzQFnOMUvSetHEDOEeawSLDT2n2XApCa7qL823TjPNx38V9w8qPlGSA/CAaszKQzSOMVs9ngDIWrJv/FWxzPo/r1iGWzXDdaq23DTYZxr1+eBgugz4FUQf6pIuruPeLzQpR5aBRKG7tJApLnNbcoBQKmoBVFkou7vgcJg5qnoOd1sulN3TgmBlNC+OeRrpk/6+oeW7tIk9cZjusfay15CZtUmF6Nq2lLisELVaN0kpRLGh7QTqTBgSqhQNcGOlmpSLjhgt0dw7cEqLHX34Kbo4OI4e/Hvcvrt17JB98oEMSUROyQW5JFdkRIQXeWPvu8f9E5/5wodVqu91Ne/Jg/D1H0yl2m4=</latexit><latexit sha1_base64="ok4DF6KGiHohsukztFevDS2nAak=">ACtHicbVFRa9swEJa9de28bs2x76IhUAKo9iltIVRKGwPfeygaQpR5snKORaVZSOdR4PxL9zb3vZvJifesrY5EHx83913p7ukVNJiGP72/GfPt15s7wMXu2+frPXe/vuxhaVETAShSrMbcItKlhBIV3JYGeJ4oGCd3n1t9/AOMlYW+xkUJ05zPtUyl4OiouPdzQFnOMUvSetHEDOEeawSLDT2n2XApCa7qL823TjPNx38V9w8qPlGSA/CAaszKQzSOMVs9ngDIWrJv/FWxzPo/r1iGWzXDdaq23DTYZxr1+eBgugz4FUQf6pIuruPeLzQpR5aBRKG7tJApLnNbcoBQKmoBVFkou7vgcJg5qnoOd1sulN3TgmBlNC+OeRrpk/6+oeW7tIk9cZjusfay15CZtUmF6Nq2lLisELVaN0kpRLGh7QTqTBgSqhQNcGOlmpSLjhgt0dw7cEqLHX34Kbo4OI4e/Hvcvrt17JB98oEMSUROyQW5JFdkRIQXeWPvu8f9E5/5wodVqu91Ne/Jg/D1H0yl2m4=</latexit><latexit sha1_base64="ok4DF6KGiHohsukztFevDS2nAak=">ACtHicbVFRa9swEJa9de28bs2x76IhUAKo9iltIVRKGwPfeygaQpR5snKORaVZSOdR4PxL9zb3vZvJifesrY5EHx83913p7ukVNJiGP72/GfPt15s7wMXu2+frPXe/vuxhaVETAShSrMbcItKlhBIV3JYGeJ4oGCd3n1t9/AOMlYW+xkUJ05zPtUyl4OiouPdzQFnOMUvSetHEDOEeawSLDT2n2XApCa7qL823TjPNx38V9w8qPlGSA/CAaszKQzSOMVs9ngDIWrJv/FWxzPo/r1iGWzXDdaq23DTYZxr1+eBgugz4FUQf6pIuruPeLzQpR5aBRKG7tJApLnNbcoBQKmoBVFkou7vgcJg5qnoOd1sulN3TgmBlNC+OeRrpk/6+oeW7tIk9cZjusfay15CZtUmF6Nq2lLisELVaN0kpRLGh7QTqTBgSqhQNcGOlmpSLjhgt0dw7cEqLHX34Kbo4OI4e/Hvcvrt17JB98oEMSUROyQW5JFdkRIQXeWPvu8f9E5/5wodVqu91Ne/Jg/D1H0yl2m4=</latexit><latexit sha1_base64="ok4DF6KGiHohsukztFevDS2nAak=">ACtHicbVFRa9swEJa9de28bs2x76IhUAKo9iltIVRKGwPfeygaQpR5snKORaVZSOdR4PxL9zb3vZvJifesrY5EHx83913p7ukVNJiGP72/GfPt15s7wMXu2+frPXe/vuxhaVETAShSrMbcItKlhBIV3JYGeJ4oGCd3n1t9/AOMlYW+xkUJ05zPtUyl4OiouPdzQFnOMUvSetHEDOEeawSLDT2n2XApCa7qL823TjPNx38V9w8qPlGSA/CAaszKQzSOMVs9ngDIWrJv/FWxzPo/r1iGWzXDdaq23DTYZxr1+eBgugz4FUQf6pIuruPeLzQpR5aBRKG7tJApLnNbcoBQKmoBVFkou7vgcJg5qnoOd1sulN3TgmBlNC+OeRrpk/6+oeW7tIk9cZjusfay15CZtUmF6Nq2lLisELVaN0kpRLGh7QTqTBgSqhQNcGOlmpSLjhgt0dw7cEqLHX34Kbo4OI4e/Hvcvrt17JB98oEMSUROyQW5JFdkRIQXeWPvu8f9E5/5wodVqu91Ne/Jg/D1H0yl2m4=</latexit>

“learner"

slide-21
SLIDE 21

Black-Box Adapta6on

  • 1. Sample task Ti
  • 2. Sample disjoint datasets Dtr

i , Dtest i

from Di (or mini batch of tasks) Di

Dtr

i

Dtest

i

Dtr

i

φi Dtest

i

fθ gφi yts xts Key idea: Train a neural network to represent . φi = fθ(Dtr

i )

slide-22
SLIDE 22

Black-Box Adapta6on

  • 1. Sample task Ti
  • 2. Sample disjoint datasets Dtr

i , Dtest i

from Di (or mini batch of tasks)

  • 3. Compute φi ← fθ(Dtr

i )

  • 4. Update θ using rθL(φi, Dtest

i

)

Dtr

i

Dtest

i

Dtr

i

φi Dtest

i

fθ gφi yts xts Key idea: Train a neural network to represent . φi = fθ(Dtr

i )

slide-23
SLIDE 23

Black-Box Adapta6on

Challenge Outpuhng all neural net parameters does not seem scalable?

φi Dtr

i

fθ represents contextual task informa8on hi low-dimensional vector hi general form:

Dtest

i

yts xts gφi yts = fθ(Dtr

i , xts)

(Santoro et al. MANN, Mishra et al. SNAIL) Idea: Do not need to output all parameters of neural net, only sufficient sta8s8cs φi = {hi, θg} recall:

23

What architecture should we use for ?

Key idea: Train a neural network to represent . φi = fθ(Dtr

i )

slide-24
SLIDE 24

Black-Box Adapta6on

24

Meta-Learning with Memory-Augmented Neural Networks Santoro, Bartunov, Botvinick, Wierstra, Lillicrap. ICML ‘16

LSTMs or Neural turing machine (NTM) Feedforward + average

Condi6onal Neural Processes. Garnelo, Rosenbaum, Maddison, Ramalho, Saxton, Shanahan, Teh, Rezende, Eslami. ICML ‘18

Other external memory mechanisms

Meta Networks Munkhdalai, Yu. ICML ‘17

Convolu8ons & aMen8on

A Simple Neural A[en6ve Meta-Learner Mishra, Rohaninejad, Chen, Abbeel. ICLR ‘18

Question: Why might feedforward+average be better than a recurrent model?

(raise your hand)

HW 1:

  • implement data processing
  • implement simple black-box meta-learner
  • train few-shot Omniglot classifier
slide-25
SLIDE 25

How else can we represent ?

Black-Box Adapta6on

+ expressive + easy to combine with variety of learning problems (e.g. SL, RL)

  • complex model w/ complex task:

challenging opBmizaBon problem

  • oOen data-inefficient

Dtr

i

φi Dtest

i

fθ gφi Next /me (Wednesday): What if we treat it as an opBmizaBon procedure? yts xts Key idea: Train a neural network to represent . φi = fθ(Dtr

i )

φi = fθ(Dtr

i )

slide-26
SLIDE 26

Plan for Today

Transfer Learning

  • Problem formulation
  • Fine-tuning

Meta-Learning

  • Problem formulation
  • General recipe of meta-learning algorithms
  • Black-box adaptation approaches
  • Case study of GPT-3 (time-permitting)
slide-27
SLIDE 27

Case Study: GPT-3

May 2020

slide-28
SLIDE 28

What is GPT-3?

black-box meta-learner trained on language generation tasks

architecture: giant “Transformer” network 175 billion parameters, 96 layers, 3.2M batch size : sequence of characters

𝒠tr

i

: the following sequence of characters

𝒠ts

i

[meta-training] dataset: crawled data from the internet, English-language Wikipedia, two books corpora

a language model

What do different tasks correspond to? spelling correction simple math problems translating between languages a variety of other tasks How can those tasks all be solved by a single architecture?

slide-29
SLIDE 29

How can those tasks all be solved by a single architecture? Put them all in the form of text! spelling correction simple math problems translating between languages Why is that a good idea? Very easy to get a lot of meta-training data.

slide-30
SLIDE 30

Some Results

One-shot learning from dictionary definitions: Few-shot language editing: Non-few-shot learning tasks:

slide-31
SLIDE 31

Other Cool Use-Cases

English to LaTeX

Source: https://twitter.com/sh_reya/status/1284746918959239168

slide-32
SLIDE 32

General Notes & Takeaways

The results are extremely impressive. The model is far from perfect. The model fails in unintuitive ways.

Source: https://lacker.io/ai/2020/07/06/giving-gpt-3-a-turing-test.html Source: https://github.com/shreyashankar/gpt3-sandbox/blob/master/docs/priming.md

The choice of at test time is important.

𝒠tr

i

slide-33
SLIDE 33

Plan for Today

Transfer Learning

  • Problem formulation
  • Fine-tuning

Meta-Learning

  • Problem formulation
  • General recipe of meta-learning algorithms
  • Black-box adaptation approaches
  • Case study of GPT-3 (time-permitting)

Goals for by the end of lecture:

  • Differences between multi-task learning, transfer learning, and meta-learning problems
  • Basics of transfer learning via fine-tuning
  • Training set-up for few-shot meta-learning algorithms
  • How to implement black-box meta-learning techniques

Topic of Homework 1!

}

slide-34
SLIDE 34

Reminders

Homework 1 posted today, due Wednesday, September 30 Project guidelines will be posted by tomorrow. Next time: Optimization-based meta-learning