A few meta learning papers Guy Gur-Ari Machine Learning Journal - PowerPoint PPT Presentation

A few meta learning papers Guy Gur-Ari Machine Learning Journal Club, September 2017

Meta Learning • Mechanisms for faster, better adaptation to new tasks • ‘Integrate prior experience with a small amount of new information’ • Examples: Image classifier applied to new classes, game player applied to new games, … • Related: single-shot learning, catastrophic forgetting • Learning how to learn (instead of designing by hand)

Meta Learning • Mechanisms for faster, better adaptation to new tasks • Learning how to learn (instead of designing by hand) • Each task is a single training sample • Performance metric: Generalization to new tasks • Higher derivatives show up, but first-order approximations sometimes work well

Transfer Learning (ad-hoc meta-learning)

Learning to learn by gradient descent   by gradient descent Andrychowicz et al. 1606.04474

Basic idea Target (‘optimizee’) loss function Recurrent Neural Network m with parameters ɸ Target Optimizer parameters parameters [1606.04474]

Vanilla RNN refresher y t − 1 y t Backpropagation through time m m m h t h t − 1 x t − 1 x t t + 1 t − 1 t h t = tanh ( W h h t − 1 + W x x t ) y t = W y h t [Karpathy]

Meta loss function Ideal Optimal target parameters for given optimizer In practice r t = r θ f ( θ t ) w t ≡ 1 RNN RNN hidden (2-layer LSTM) state [1606.04474]

Meta loss function r t = r θ f ( θ t ) w t ≡ 1 • Recurrent network can use trajectory information, similar to momentum • Including historical losses also helps with backprop through time [1606.04474]

Training protocol • Sample a random task f • Train optimizer on f by gradient descent   (100 steps, unroll for 20) • Repeat [1606.04474]

Test optimizer performance • Sample new tasks • Apply optimizer for some steps, compute average loss • Compare with existing optimizers (ADAM, RMSProp) [1606.04474]

Computational graph ( φ ) ( φ ) ( φ ) Graph used for computing the gradient of the optimizer (with respect to ɸ ) [1606.04474]

Simplifying assumptions • No 2nd order derivatives: r φ r θ f = 0 • RNN weights shared between target parameters • Result is independent of parameter ordering • Each parameter has separate hidden state [1606.04474]

Experiments Variability is in initial target parameters and choice of mini-batches [1606.04474]

Experiments Separate optimizers for convolutional and fully-connected layers [1606.04474]

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks Finn, Abbeel, Levine 1703.03400

  Basic idea • Start with a class of tasks with distribution p ( T ) T i • Train one model 𝛴 that can be quickly fine-tuned to new tasks (‘few-shot learning’)   • How? Explicitly require that a single training step will significantly improve the loss • Meta loss function, optimized over 𝛴 : [1703.03400]

(to avoid overfitting?) [1703.03400]

    Comments • Can be adapted to any scenario that uses gradient descent (e.g. regression, reinforcement learning) • Involves taking second derivative   • First-order approximation still works well [1703.03400]

Regression experiment Single task = compute sine with given underlying amplitude and phase Model is FC network Pretrained = compute a single model with 2 hidden layers on many tasks simultaneously [1703.03400]

Classification experiment Each classification class is a single task [1703.03400]

RL experiment Reward = negative square distance from goal position. For each task, goal is placed randomly. [1703.03400]

Overcoming catastrophic forgetting in neural networks Kirkpatrick et al. 1612.00796

Basic idea • Catastrophic forgetting: When a model is trained on task A followed by task B, it typically forgets A • Idea: After training on A, freeze the parameters that are important for A optimal parameters hyperparameter for task A F i ≈ ∂ 2 L A diagonal of Fisher information matrix ∂θ 2 i [1612.00796]

Why Fisher information? L ( θ ) = − log( θ | D A , D B ) = − log p ( D B | θ ) − log p ( θ ) − log p ( D A | θ ) + log p ( D A , D B ) ∼ L B ( θ ) − log p ( D A | θ ) X X − log p ( D A | θ ) = − log p θ ( x i ) ∼ − p A ( x ) log p θ ( x ) x i now suppose p θ ∗ = p A then p θ ∗ ( x ) log p θ ∗ + d θ ( x ) = S ( p θ ∗ ) + 1 2 d θ T Fd θ + · · · X − x F ij = E x ∼ p θ [ r θ i log p θ ( x ) r θ j log p θ ( x )]

Why Fisher information? L ( θ ) ∼ L B ( θ ) + 1 2 d θ T Fd θ d θ = θ − θ ∗ A

MNIST experiment

A few meta learning papers Guy Gur-Ari Machine Learning Journal - PowerPoint PPT Presentation

A few meta learning papers Guy Gur-Ari Machine Learning Journal Club, September 2017 Meta Learning Mechanisms for faster, better adaptation to new tasks Integrate prior experience with a small amount of new information

Meta- Meta -Programming with Programming with Modelica Modelica for Meta- for Meta

Me Meta Lear Learnin ing A Bri Brief Introduct ction Xiachong Feng Ou Outline

Me Meta Lear Learnin ing A Bri Brief Introduct ction Xiachong Feng TG Ph.D. Student

META Seal of Recognition and META Prize Award Ceremony Georg Rehm (DFKI) on behalf of the

Bayesian Model-Agnostic Meta-Learning Taesup Kim* (presenter), Jaesik Yoon* Ousmane Dia,

Meta-DermDiagnosis: Few-Shot Skin Disease Identification using Meta-Learning Kushagra Mahajan ,

Meta Learning Shengchao Liu Background Meta Learning (AKA Learning to Learn) A

Meta-transfer Learning for Few-shot Learning Yaoyao Liu Tianjin University and NUS School of

Other Writing Assignments Literature Reviews - Theoretical Papers -Case Studies - Issue Papers

The Meta-Learning Problem & Black-Box Meta-Learning CS 330 Logistics Homework 1 posted today,

MetaFun: Meta-Learning with Iterative Functional Updates Jin Xu, Jean-Francois Ton, Hyunjik Kim,

Intelligent Tutoring Systems: A Meta-Analysis Meta-Analysis Wenting Ma March, 2011

Company profile Capabilities Customers & References META-LRA Kft. 8400 Ajka,

Individual Participant Data (IPD) Reviews and Meta analyses Lesley Stewart Director, CRD Larysa

Lecture 31/Chapter 25 More about Meta-Analysis Benefits and Pitfalls An Application:

Simultaneous meta and data manipulation in Blaise Marien Lina Statistics netherlands Statistics

29.09.2007 No Shortage of Public Fears RFID and Privacy The risk [RFID] poses to humanity

Seven (+-2) Sins of Concurrency Chen Shapira In which I will show classical concurrency problems

Applied Machine Learning Nearest Neighbours Siamak Ravanbakhsh COMP 551 (Fall 2020) Admin Arnab

Energy-aware Software Development for Massive-Scale Systems Torsten Hoefler With input from Marc

Doubly Efficient Interactive Proofs Ron Rothblum Outsourcing Computation Weak client outsources

A historical moment Mary Queen of Scots is being held by Queen Elizabeth and

How do private digital currencies affect government policy? By Raskin, Saleh, Yermack Discussion

Voyage of the Reverser A Visual Study of Binary Species Sergey Bratus // Dartmouth //

A few meta learning papers Guy Gur-Ari Machine Learning Journal - PowerPoint PPT Presentation

A few meta learning papers Guy Gur-Ari Machine Learning Journal Club, September 2017 Meta Learning Mechanisms for faster, better adaptation to new tasks Integrate prior experience with a small amount of new information

Meta- Meta -Programming with Programming with Modelica Modelica for Meta- for Meta

Me Meta Lear Learnin ing A Bri Brief Introduct ction Xiachong Feng Ou Outline

Me Meta Lear Learnin ing A Bri Brief Introduct ction Xiachong Feng TG Ph.D. Student

META Seal of Recognition and META Prize Award Ceremony Georg Rehm (DFKI) on behalf of the

Bayesian Model-Agnostic Meta-Learning Taesup Kim* (presenter), Jaesik Yoon* Ousmane Dia,

Meta-DermDiagnosis: Few-Shot Skin Disease Identification using Meta-Learning Kushagra Mahajan ,

Meta Learning Shengchao Liu Background Meta Learning (AKA Learning to Learn) A

Meta-transfer Learning for Few-shot Learning Yaoyao Liu Tianjin University and NUS School of

Other Writing Assignments Literature Reviews - Theoretical Papers -Case Studies - Issue Papers

The Meta-Learning Problem &amp; Black-Box Meta-Learning CS 330 Logistics Homework 1 posted today,

MetaFun: Meta-Learning with Iterative Functional Updates Jin Xu, Jean-Francois Ton, Hyunjik Kim,

Intelligent Tutoring Systems: A Meta-Analysis Meta-Analysis Wenting Ma March, 2011

Company profile Capabilities Customers &amp; References META-LRA Kft. 8400 Ajka,

Individual Participant Data (IPD) Reviews and Meta analyses Lesley Stewart Director, CRD Larysa

Lecture 31/Chapter 25 More about Meta-Analysis Benefits and Pitfalls An Application:

Simultaneous meta and data manipulation in Blaise Marien Lina Statistics netherlands Statistics

29.09.2007 No Shortage of Public Fears RFID and Privacy The risk [RFID] poses to humanity

Seven (+-2) Sins of Concurrency Chen Shapira In which I will show classical concurrency problems

Applied Machine Learning Nearest Neighbours Siamak Ravanbakhsh COMP 551 (Fall 2020) Admin Arnab

Energy-aware Software Development for Massive-Scale Systems Torsten Hoefler With input from Marc

Doubly Efficient Interactive Proofs Ron Rothblum Outsourcing Computation Weak client outsources

A historical moment Mary Queen of Scots is being held by Queen Elizabeth and

How do private digital currencies affect government policy? By Raskin, Saleh, Yermack Discussion

Voyage of the Reverser A Visual Study of Binary Species Sergey Bratus // Dartmouth //

The Meta-Learning Problem & Black-Box Meta-Learning CS 330 Logistics Homework 1 posted today,

Company profile Capabilities Customers & References META-LRA Kft. 8400 Ajka,