A few meta learning papers Guy Gur-Ari Machine Learning Journal - - PowerPoint PPT Presentation

a few meta learning papers
SMART_READER_LITE
LIVE PREVIEW

A few meta learning papers Guy Gur-Ari Machine Learning Journal - - PowerPoint PPT Presentation

A few meta learning papers Guy Gur-Ari Machine Learning Journal Club, September 2017 Meta Learning Mechanisms for faster, better adaptation to new tasks Integrate prior experience with a small amount of new information


slide-1
SLIDE 1

A few meta learning papers

Machine Learning Journal Club, September 2017 Guy Gur-Ari

slide-2
SLIDE 2

Meta Learning

  • Mechanisms for faster, better adaptation to new tasks
  • ‘Integrate prior experience with a small amount of new

information’

  • Examples: Image classifier applied to new classes,

game player applied to new games, …

  • Related: single-shot learning, catastrophic forgetting
  • Learning how to learn (instead of designing by hand)
slide-3
SLIDE 3

Meta Learning

  • Mechanisms for faster, better adaptation to new tasks
  • Learning how to learn (instead of designing by hand)
  • Each task is a single training sample
  • Performance metric: Generalization to new tasks
  • Higher derivatives show up, but first-order

approximations sometimes work well

slide-4
SLIDE 4

Transfer Learning (ad-hoc meta-learning)

slide-5
SLIDE 5

Learning to learn by gradient descent 
 by gradient descent Andrychowicz et al.

1606.04474

slide-6
SLIDE 6

Basic idea

Target parameters Optimizer parameters Target (‘optimizee’) loss function Recurrent Neural Network m with parameters ɸ

[1606.04474]

slide-7
SLIDE 7

Vanilla RNN refresher

ht = tanh (Whht−1 + Wxxt) yt = Wyht

[Karpathy]

t xt yt ht

ht−1

xt−1 yt−1 m m m t − 1 t + 1

Backpropagation through time

slide-8
SLIDE 8

Meta loss function

Ideal In practice

rt = rθf(θt)

wt ≡ 1

Optimal target parameters for given optimizer RNN (2-layer LSTM) RNN hidden state

[1606.04474]

slide-9
SLIDE 9

Meta loss function

  • Recurrent network can use trajectory information, similar

to momentum

  • Including historical losses also helps with backprop

through time

rt = rθf(θt)

wt ≡ 1

[1606.04474]

slide-10
SLIDE 10

Training protocol

  • Sample a random task f
  • Train optimizer on f by gradient descent 


(100 steps, unroll for 20)

  • Repeat

[1606.04474]

slide-11
SLIDE 11

Test optimizer performance

  • Sample new tasks
  • Apply optimizer for some steps, compute average loss
  • Compare with existing optimizers (ADAM, RMSProp)

[1606.04474]

slide-12
SLIDE 12

Computational graph

Graph used for computing the gradient of the optimizer (with respect to ɸ)

(φ)

[1606.04474]

(φ) (φ)

slide-13
SLIDE 13

Simplifying assumptions

  • No 2nd order derivatives:
  • RNN weights shared between

target parameters

  • Result is independent of

parameter ordering

  • Each parameter has

separate hidden state

rφrθf = 0

[1606.04474]

slide-14
SLIDE 14

Experiments

[1606.04474]

Variability is in initial target parameters and choice of mini-batches

slide-15
SLIDE 15

Experiments

[1606.04474]

Separate optimizers for convolutional and fully-connected layers

slide-16
SLIDE 16

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks Finn, Abbeel, Levine

1703.03400

slide-17
SLIDE 17

Basic idea

  • Start with a class of tasks with distribution
  • Train one model 𝛴 that can be quickly fine-tuned to new

tasks (‘few-shot learning’)
 


  • How? Explicitly require that a single training step will

significantly improve the loss

  • Meta loss function, optimized over 𝛴:

Ti p(T )

[1703.03400]

slide-18
SLIDE 18

(to avoid overfitting?)

[1703.03400]

slide-19
SLIDE 19

Comments

  • Can be adapted to any scenario that uses gradient

descent (e.g. regression, reinforcement learning)

  • Involves taking second derivative



 


  • First-order approximation still works well

[1703.03400]

slide-20
SLIDE 20

Regression experiment

Single task = compute sine with given underlying amplitude and phase Pretrained = compute a single model

  • n many tasks simultaneously

Model is FC network with 2 hidden layers

[1703.03400]

slide-21
SLIDE 21

Classification experiment

Each classification class is a single task

[1703.03400]

slide-22
SLIDE 22

RL experiment

Reward = negative square distance from goal position. For each task, goal is placed randomly.

[1703.03400]

slide-23
SLIDE 23

Overcoming catastrophic forgetting in neural networks Kirkpatrick et al.

1612.00796

slide-24
SLIDE 24

Basic idea

  • Catastrophic forgetting: When a model is trained on task

A followed by task B, it typically forgets A

  • Idea: After training on A, freeze the parameters that are

important for A

[1612.00796]

hyperparameter diagonal of Fisher information matrix

  • ptimal parameters

for task A

Fi ≈ ∂2LA ∂θ2

i

slide-25
SLIDE 25

Why Fisher information?

L(θ) = − log(θ|DA, DB) = − log p(DB|θ) − log p(θ) − log p(DA|θ) + log p(DA, DB) ∼ LB(θ) − log p(DA|θ)

now suppose pθ∗ = pA then

Fij = Ex∼pθ[rθi log pθ(x)rθj log pθ(x)] − log p(DA|θ) = − X

i

log pθ(xi) ∼ − X

x

pA(x) log pθ(x) − X

x

pθ∗(x) log pθ∗+dθ(x) = S(pθ∗) + 1 2dθT Fdθ + · · ·

slide-26
SLIDE 26

Why Fisher information?

L(θ) ∼ LB(θ) + 1 2dθT Fdθ dθ = θ − θ∗

A

slide-27
SLIDE 27

MNIST experiment

slide-28
SLIDE 28