Model-Agnostic Meta-Learning Universality, Inductive Bias, and Weak - PowerPoint PPT Presentation

Model-Agnostic Meta-Learning Universality, Inductive Bias, and Weak Supervision Chelsea Finn

Why Learn to Learn? - e ff ectively reuse data on other tasks - replace manual engineering of architecture, hyperparameters, etc. - learn to quickly adapt to unexpected scenarios (inevitable failures, long tail) - learn how to learn with weak supervision Chelsea Finn, UC Berkeley

Problem Domains : Approaches : - few-shot classification & generation - recurrent networks - hyperparameter optimization - learning optimizers or update rules - architecture search - learning initial parameters & - faster reinforcement learning architecture - acquiring metric spaces - domain generalization - Bayesian models - learning structure - … - … What is the meta-learning problem statement? Chelsea Finn, UC Berkeley

The Meta-Learning Problem Supervised Learning: Inputs: Outputs: Data: Meta-Supervised Learning: Inputs: Outputs: Data: { Why is this view useful? Reduces the problem to the design & optimization of f . Chelsea Finn, UC Berkeley

Design of f ? Recurrent network Santoro et al. ’16, Duan et al. ’17, Wang et al. ’17, Munkhdalai & Yu ’17, Mishra et al. ’17, … (LSTM, NTM, Conv) Chelsea Finn, UC Berkeley

Design of f ? Recurrent network Santoro et al. ’16, Duan et al. ’17, Wang et al. ’17, Munkhdalai & Yu ’17, Mishra et al. ’17, … (LSTM, NTM, Conv) Schmidhuber et al. ’87, Bengio et al. ’90, Learned optimizer Hochreiter et al. ’01, Li & Malik ’16, Andrychowicz (often uses recurrence) et al. ’16 , Ha et al. ’17, Ravi & Larochelle ’17, … Chelsea Finn, UC Berkeley

Design of f ? Recurrent network Santoro et al. ’16, Duan et al. ’17, Wang et al. ’17, Munkhdalai & Yu ’17, Mishra et al. ’17, … (LSTM, NTM, Conv) Learned optimizer Schmidhuber et al. ’87, Bengio et al. ’90, Hochreiter et al. ’01, Li & Malik ’16, Andrychowicz (often uses recurrence) et al. ’16 , Ha et al. ’17, Ravi & Larochelle ’17, … These approaches are general and quite powerful. What happens when the task is very di ff erent? Or very little meta-training? Bergstra et al. ’11, Snoek et al. ’12, Koch ’15, Maclaurin et al. ’15,   Impose Structure Vinyals et al. ‘16, Zoph & Le ’17, Snell et al. ’17, … Can we build a general meta-learning algorithm that interpolates between learning from scratch and few-shot learning? Chelsea Finn, UC Berkeley

pretrained parameters fi ne-tuning: test task [test-time] M odel- A gnostic   M eta- L earning : (MAML) Key idea : Train over many tasks, to learn parameter vector θ that transfers In-distribution task : k-shot learning Base case : learning from scratch Related but out-of-distribution task : somewhere in between Finn, Abbeel, Levine ICML ‘17 Chelsea Finn, UC Berkeley

Design of f ? Recurrent network Santoro et al. ’16, Duan et al. ’17, Wang et al. ’17, Munkhdalai & Yu ’17, Mishra et al. ’17, … (LSTM, NTM, Conv) Learned optimizer Schmidhuber et al. ’87, Bengio et al. ’90, Hochreiter et al. ’01, Li & Malik ’16, Andrychowicz (often uses recurrence) et al. ’16 , Ha et al. ’17, Ravi & Larochelle ’17, … Bergstra et al. ’11, Snoek et al. ’12, Koch ’15, Maclaurin et al. ’15,   Impose Structure Vinyals et al. ‘16, Zoph & Le ’17, Snell et al. ’17, … MAML Finn et al. ’17, Grant et al. ’17,   Reed et al. ’17, Li et al. ’17, … (learned initialization) Chelsea Finn, UC Berkeley

Theoretical & Empirical Questions 1. What happens when MAML faces out-of-distribution tasks ? 2. How expressive are deep representations + gradient descent? 3. Can we interpret MAML in a probabilistic framework ? 4. Can we use MAML to learn from weak supervision ? Chelsea Finn, UC Berkeley

How well can methods generalize to similar, but extrapolated tasks? The world is non-stationary. MAML TCML, MetaNetworks Omniglot image classi fi cation performance task variability Finn & Levine ’17 (under review) Chelsea Finn, UC Berkeley

How well can methods generalize to similar, but extrapolated tasks? The world is non-stationary. MAML TCML Sinusoid curve regression error Takeaway: Strategies learned with MAML consistently task variability generalize better to out-of-distribution tasks Finn & Levine ’17 (under review) Chelsea Finn, UC Berkeley

Theoretical & Empirical Questions 1. What happens when MAML faces out-of-distribution tasks ? 2. How expressive are deep representations + gradient descent? 3. Can we interpret MAML in a probabilistic framework ? 4. Can we use MAML to learn from weak supervision ? Chelsea Finn, UC Berkeley

Universal Function Approximation Theorem Hornik et al. ’89, Cybenko ’89, Funahashi ‘89 A neural network with one hidden layer of finite width can approximate any continuous function. “universal function approximator” How can we define a notion of universality / expressive power for meta-learning? “universal learning procedure approximator” Recurrent network Learned optimizer With su ffi cient depth, both are universal learning procedure approximators. Are we losing expressive power when using MAML? Chelsea Finn, UC Berkeley Finn & Levine ’17 (under review)

How expressive is MAML? - cross entropy or mean-squared error loss Assumptions: - datapoints x i in training dataset are unique Result : For a su ffi ciently deep , is a universal learning procedure approximator. [It can approximate any function of ] Why is this interesting? MAML has both benefits of inductive bias and expressive power. Finn & Levine ’17 (under review) Chelsea Finn, UC Berkeley

Theoretical & Empirical Questions 1. What happens when MAML faces out-of-distribution tasks ? 2. How expressive is deep representation + gradient descent? 3. Can we interpret MAML in a probabilistic framework ? 4. Can we use MAML to learn from weak supervision ? Chelsea Finn, UC Berkeley

Can we interpret MAML in a probabilistic framework? meta-learning ≈ learning a prior Bayesian concept learning [Tenenbaum ’99, Fei-Fei et al. ’03, Lawrence & Platt ’04, …] formulate few-shot learning as probabilistic inference problem + can e ff ectively generalize from limited evidence - hard to scale to complex models Chelsea Finn, UC Berkeley

Can we interpret MAML in a probabilistic framework? Bayesian meta-learning approach task-specific parameters (empirical Bayes) meta-parameters MAP estimate How to compute MAP estimate? Gradient descent with early stopping = MAP inference under Gaussian prior with mean at initial parameters [Santos ’96] (exact in linear case, approximate in nonlinear case) MAML approximates hierarchical Bayesian inference. [Grant et al. ’17] Erin Grant Chelsea Finn, UC Berkeley

Theoretical & Empirical Questions 1. What happens when MAML faces out-of-distribution tasks ? 2. How expressive is deep representation + gradient descent? 3. Can we interpret MAML in a probabilistic framework ? 4. Can we use MAML to learn from weak supervision ? Chelsea Finn, UC Berkeley

Learning to Learn from Weak Supervision Meta-Supervised Learning: Inputs: Outputs: Data: fully weakly supervised supervised During meta-training: access full supervision for each task During meta-testing: only use weakly-supervised datapoints With MAML: Key insight : inner loss can be di ff erent than outer loss Chelsea Finn, UC Berkeley

Weak Supervision Results - Learning from positive examples   Grant, Finn, Peterson, Abbott, Levine, Darrell, Gri ffi ths, NIPS ‘17 CIAI workshop - One-shot Imitation from human video   (in preparation, with Yu, Abbeel, Levine) Chelsea Finn, UC Berkeley

Typical Objective of Few-Shot Learning Image recognition Given 1 example of 5 classes: Classify new examples Human Concept Learning Given 1 positive example : Classify new examples: Beyond how humans learn, this setting is also more interesting. Chelsea Finn, UC Berkeley Grant et al. ’17 (NIPS CIAI workshop)

Human Concept Learning Given 1 positive example : Classify new examples: only positive examples both positive & negatives Why does this make sense? MAML approximates hierarchical Bayesian inference C oncept A cquisition through M eta- L earning (CAML) Chelsea Finn, UC Berkeley Grant et al. ’17 (NIPS CIAI workshop)

Few-Shot Image Classi fi cation from Positive Examples MiniImagenet dataset Grant et al. ’17 (NIPS CIAI workshop)

One-Shot Visual Imitation Learning Goal : Given one visual demonstration of a new task, learn a policy No direct supervision signal Visual imitation is expensive. in video of human. behavior cloning / supervised learning Rahmanizadeh et al. ‘17 Zhang et al. ‘17 learns from raw pixels, but requires many demonstrations Through meta-learning: reuse data from other tasks/objects/envionrments Yu*, Finn*, et al. (in prep.)

One-Shot Visual Imitation from Humans imitation loss meta-training time training demo val demo meta-training (video of human) (robot demo) tasks meta-test time demo of meta-test task (video of human) Yu*, Finn*, et al. (in prep.)

On-going work: One-shot imitation from human video input human demo resulting policy Yu*, Finn*, et al. (in prep.)

Model-Agnostic Meta-Learning Universality, Inductive Bias, and Weak - PowerPoint PPT Presentation

Model-Agnostic Meta-Learning Universality, Inductive Bias, and Weak Supervision Chelsea Finn Why Learn to Learn? - e ff ectively reuse data on other tasks - replace manual engineering of architecture, hyperparameters, etc. - learn to quickly

Bayesian Model-Agnostic Meta-Learning Taesup Kim* (presenter), Jaesik Yoon* Ousmane Dia,

Meta- Meta -Programming with Programming with Modelica Modelica for Meta- for Meta

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks Chelsea Finn, Pieter Abbeel,

LANGUAGE-AGNOSTIC INJECTION LANGUAGE-AGNOSTIC INJECTION DETECTION DETECTION Lars Hermerschmidt,

MANA for MPI MPI-Agnostic Network-Agnostic Transparent Checkpointing Rohan Garg, *Gregory Price,

Pool-based Agnostic Pool-based Agnostic Experiment Design Experiment Design in Linear

Multimodal Model Agnostic Meta-Learning via Task-Aware Modulation Risto Vuorio* Shao-Hua Sun*

META Seal of Recognition and META Prize Award Ceremony Georg Rehm (DFKI) on behalf of the

Meta Learning Shengchao Liu Background Meta Learning (AKA Learning to Learn) A

Category & Progression Specific Programming Model for Industry Agnostic Incubators Sean

CS485/685 Lecture 16: March 1, 2012 Agnostic Learning [BDSS] Chapters 2, 3 CS485/685 (c) 2012 P.

Illustrating Agnostic Learning We want a classifier to distinguish between cats and dogs Image 1

Task-Agnostic Dynamics Priors for Deep Reinforcement Learning Yilun Du 1 , Karthik Narasimhan 2

A few meta learning papers Guy Gur-Ari Machine Learning Journal Club, September 2017 Meta

The Meta-Learning Problem & Black-Box Meta-Learning CS 330 Logistics Homework 1 posted today,

MetaFun: Meta-Learning with Iterative Functional Updates Jin Xu, Jean-Francois Ton, Hyunjik Kim,

Understanding Git Nelson Elhage Anders Kaseorg Student Information Processing Board September

CS885 Reinforcement Learning Module 3: July 5, 2020 Imitation Learning Torabi, F., Warnell, G.,

6hp http://www.ida.liu.se/~patla00/courses/BDA Teachers Lectures: Patrick Lambrix,

Autopsy of an automation disaster Simon J Mudd (Senior Database Engineer) Percona Live, 25 th

APPLD: Adaptive Planner Parameter Learning From Demonstration Xuesu Xiao 1* , Bo Liu 1* , Garrett

ASIC and Custom in Nanometer Technologies David Chinnery Outline Introduction

Posix-Free File Systems in the Cloud Jeff Chase Duke University Beyond Posix

Building a Robust Research Commons: Enhancing the Precompe99ve

Model-Agnostic Meta-Learning Universality, Inductive Bias, and Weak - PowerPoint PPT Presentation

Model-Agnostic Meta-Learning Universality, Inductive Bias, and Weak Supervision Chelsea Finn Why Learn to Learn? - e ff ectively reuse data on other tasks - replace manual engineering of architecture, hyperparameters, etc. - learn to quickly

Bayesian Model-Agnostic Meta-Learning Taesup Kim* (presenter), Jaesik Yoon* Ousmane Dia,

Meta- Meta -Programming with Programming with Modelica Modelica for Meta- for Meta

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks Chelsea Finn, Pieter Abbeel,

LANGUAGE-AGNOSTIC INJECTION LANGUAGE-AGNOSTIC INJECTION DETECTION DETECTION Lars Hermerschmidt,

MANA for MPI MPI-Agnostic Network-Agnostic Transparent Checkpointing Rohan Garg, *Gregory Price,

Pool-based Agnostic Pool-based Agnostic Experiment Design Experiment Design in Linear

Multimodal Model Agnostic Meta-Learning via Task-Aware Modulation Risto Vuorio* Shao-Hua Sun*

META Seal of Recognition and META Prize Award Ceremony Georg Rehm (DFKI) on behalf of the

Meta Learning Shengchao Liu Background Meta Learning (AKA Learning to Learn) A

Category &amp; Progression Specific Programming Model for Industry Agnostic Incubators Sean

CS485/685 Lecture 16: March 1, 2012 Agnostic Learning [BDSS] Chapters 2, 3 CS485/685 (c) 2012 P.

Illustrating Agnostic Learning We want a classifier to distinguish between cats and dogs Image 1

Task-Agnostic Dynamics Priors for Deep Reinforcement Learning Yilun Du 1 , Karthik Narasimhan 2

A few meta learning papers Guy Gur-Ari Machine Learning Journal Club, September 2017 Meta

The Meta-Learning Problem &amp; Black-Box Meta-Learning CS 330 Logistics Homework 1 posted today,

MetaFun: Meta-Learning with Iterative Functional Updates Jin Xu, Jean-Francois Ton, Hyunjik Kim,

Understanding Git Nelson Elhage Anders Kaseorg Student Information Processing Board September

CS885 Reinforcement Learning Module 3: July 5, 2020 Imitation Learning Torabi, F., Warnell, G.,

6hp http://www.ida.liu.se/~patla00/courses/BDA Teachers Lectures: Patrick Lambrix,

Autopsy of an automation disaster Simon J Mudd (Senior Database Engineer) Percona Live, 25 th

APPLD: Adaptive Planner Parameter Learning From Demonstration Xuesu Xiao 1* , Bo Liu 1* , Garrett

ASIC and Custom in Nanometer Technologies David Chinnery Outline Introduction

Posix-Free File Systems in the Cloud Jeff Chase Duke University Beyond Posix

Building a Robust Research Commons: Enhancing the Precompe99ve

Category & Progression Specific Programming Model for Industry Agnostic Incubators Sean

The Meta-Learning Problem & Black-Box Meta-Learning CS 330 Logistics Homework 1 posted today,