a few meta learning papers
play

A few meta learning papers Guy Gur-Ari Machine Learning Journal - PowerPoint PPT Presentation

A few meta learning papers Guy Gur-Ari Machine Learning Journal Club, September 2017 Meta Learning Mechanisms for faster, better adaptation to new tasks Integrate prior experience with a small amount of new information


  1. A few meta learning papers Guy Gur-Ari Machine Learning Journal Club, September 2017

  2. Meta Learning • Mechanisms for faster, better adaptation to new tasks • ‘Integrate prior experience with a small amount of new information’ • Examples: Image classifier applied to new classes, game player applied to new games, … • Related: single-shot learning, catastrophic forgetting • Learning how to learn (instead of designing by hand)

  3. Meta Learning • Mechanisms for faster, better adaptation to new tasks • Learning how to learn (instead of designing by hand) • Each task is a single training sample • Performance metric: Generalization to new tasks • Higher derivatives show up, but first-order approximations sometimes work well

  4. Transfer Learning (ad-hoc meta-learning)

  5. Learning to learn by gradient descent 
 by gradient descent Andrychowicz et al. 1606.04474

  6. Basic idea Target (‘optimizee’) loss function Recurrent Neural Network m with parameters ɸ Target Optimizer parameters parameters [1606.04474]

  7. Vanilla RNN refresher y t − 1 y t Backpropagation through time m m m h t h t − 1 x t − 1 x t t + 1 t − 1 t h t = tanh ( W h h t − 1 + W x x t ) y t = W y h t [Karpathy]

  8. Meta loss function Ideal Optimal target parameters for given optimizer In practice r t = r θ f ( θ t ) w t ≡ 1 RNN RNN hidden (2-layer LSTM) state [1606.04474]

  9. Meta loss function r t = r θ f ( θ t ) w t ≡ 1 • Recurrent network can use trajectory information, similar to momentum • Including historical losses also helps with backprop through time [1606.04474]

  10. Training protocol • Sample a random task f • Train optimizer on f by gradient descent 
 (100 steps, unroll for 20) • Repeat [1606.04474]

  11. Test optimizer performance • Sample new tasks • Apply optimizer for some steps, compute average loss • Compare with existing optimizers (ADAM, RMSProp) [1606.04474]

  12. Computational graph ( φ ) ( φ ) ( φ ) Graph used for computing the gradient of the optimizer (with respect to ɸ ) [1606.04474]

  13. Simplifying assumptions • No 2nd order derivatives: r φ r θ f = 0 • RNN weights shared between target parameters • Result is independent of parameter ordering • Each parameter has separate hidden state [1606.04474]

  14. Experiments Variability is in initial target parameters and choice of mini-batches [1606.04474]

  15. Experiments Separate optimizers for convolutional and fully-connected layers [1606.04474]

  16. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks Finn, Abbeel, Levine 1703.03400

  17. 
 Basic idea • Start with a class of tasks with distribution p ( T ) T i • Train one model 𝛴 that can be quickly fine-tuned to new tasks (‘few-shot learning’) 
 • How? Explicitly require that a single training step will significantly improve the loss • Meta loss function, optimized over 𝛴 : [1703.03400]

  18. (to avoid overfitting?) [1703.03400]

  19. 
 
 Comments • Can be adapted to any scenario that uses gradient descent (e.g. regression, reinforcement learning) • Involves taking second derivative 
 • First-order approximation still works well [1703.03400]

  20. Regression experiment Single task = compute sine with given underlying amplitude and phase Model is FC network Pretrained = compute a single model with 2 hidden layers on many tasks simultaneously [1703.03400]

  21. Classification experiment Each classification class is a single task [1703.03400]

  22. RL experiment Reward = negative square distance from goal position. For each task, goal is placed randomly. [1703.03400]

  23. Overcoming catastrophic forgetting in neural networks Kirkpatrick et al. 1612.00796

  24. Basic idea • Catastrophic forgetting: When a model is trained on task A followed by task B, it typically forgets A • Idea: After training on A, freeze the parameters that are important for A optimal parameters hyperparameter for task A F i ≈ ∂ 2 L A diagonal of Fisher information matrix ∂θ 2 i [1612.00796]

  25. Why Fisher information? L ( θ ) = − log( θ | D A , D B ) = − log p ( D B | θ ) − log p ( θ ) − log p ( D A | θ ) + log p ( D A , D B ) ∼ L B ( θ ) − log p ( D A | θ ) X X − log p ( D A | θ ) = − log p θ ( x i ) ∼ − p A ( x ) log p θ ( x ) x i now suppose p θ ∗ = p A then p θ ∗ ( x ) log p θ ∗ + d θ ( x ) = S ( p θ ∗ ) + 1 2 d θ T Fd θ + · · · X − x F ij = E x ∼ p θ [ r θ i log p θ ( x ) r θ j log p θ ( x )]

  26. Why Fisher information? L ( θ ) ∼ L B ( θ ) + 1 2 d θ T Fd θ d θ = θ − θ ∗ A

  27. MNIST experiment

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend