The Peculiar Optimization and Regularization Challenges in - PowerPoint PPT Presentation

The Peculiar Optimization and Regularization Challenges in Multi-Task Learning and Meta-Learning Chelsea Finn Stanford

training data test datapoint Braque Cezanne By Braque or Cezanne?

How did you accomplish this? Through previous experience.

How might you get a machine to accomplish this task? Modeling image formaKon Geometry Fewer human priors, more data -driven priors SIFT features, HOG features + SVM Greater success. Fine-tuning from ImageNet features Domain adaptaKon from other painters ??? Can we explicitly learn priors from previous experience that lead to efficient downstream learning? Can we learn to learn?

Outline 1. Brief overview of meta-learning 2. A peculiar yet ubiquitous problem in meta-learning (and how we might regularize it away) 3. Can we scale meta-learning to broad task distributions?

How does meta-learning work? An example. Given 1 example of 5 classes: Classify new examples test set training data

How does meta-learning work? An example. training meta-training classes … … Given 1 example of 5 classes: Classify new examples meta-testing T test test set training data

How does meta-learning work? One approach : parameterize learner by neural network 4 0 1 2 3 4 y ts = f ( 𝒠 tr , x ts ; θ ) (Hochreiter et al. ’91, Santoro et al. ’16, many others)

How does meta-learning work? Another approach : embed optimization inside the learning process 4 r θ L y ts = f ( 𝒠 tr , x ts ; θ ) 0 1 2 3 4 (Maclaurin et al. ’15, Finn et al. ’17, many others)

Can we learn a representation under which RL is fast and efficient? after 1 gradient step after 1 gradient step after MAML training (backward reward) (forward reward) Finn, Abbeel, Levine. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks . ICML ‘17

Can we learn a representation under which imitation is fast and e ffi cient? input demo resulting policy (via teleoperation) subset of training objects [real-time execution] held-out test objects Finn*, Yu*, Zhang, Abbeel, Levine. One-Shot Visual Imitation Learning via Meta-Learning . CoRL ‘17

The Bayesian perspective p ( ϕ | θ ) meta-learning <~> learning priors from data (Grant et al. ’18, Gordon et al. ’18, many others)

How we construct tasks for meta-learning. 𝒠 tr x ts 0 1 2 3 4 2 4 0 1 2 3 4 3 1 T 3 0 1 2 3 4 4 3 Randomly assign class labels to image classes for each task —> Tasks are mutually exclusive . Algorithms must use training data to infer label ordering.

What if label order is consistent? 𝒠 tr x ts 0 1 2 3 4 2 4 0 1 2 3 4 3 1 T 3 0 2 3 4 1 1 2 Tasks are non-mutually exclusive : a single function can solve all tasks. The network can simply learn to classify inputs, irrespective of 𝒠 tr

The network can simply learn to classify inputs, irrespective of 𝒠 tr 4 1 2 3 4 0 4 r θ L 0 1 2 3 4

What if label order is consistent? 𝒠 tr x ts 0 1 2 3 4 2 4 0 1 2 3 4 3 1 T 3 0 2 3 4 1 1 2 For new image classes: can’t make predictions w/o 𝒠 tr T test training data test set

Is this a problem? - No : for image classi fi cation, we can just shu ffl e labels* - No , if we see the same image classes as training (& don’t need to adapt at meta-test time) - But, yes , if we want to be able to adapt with data for new tasks.

Another example “hammer” “close drawer” “stack” meta-training … T 50 “close box” T test If you tell the robot the task goal, the robot can ignore the trials. T Yu, D Quillen, Z He, R Julian, K Hausman, C Finn, S Levine. Meta-World . CoRL ‘19

Another example Model can memorize the canonical orientations of the training objects. Yin, Tucker, Yuan, Levine, Finn. Meta-Learning without Memorization . ICLR ‘19

Can we do something about it?

If tasks mutually exclusive : single function cannot solve all tasks (i.e. due to label shu ffl ing, hiding information) If tasks are non - mutually exclusive : single function can solve all tasks y ts = f θ ( D tr multiple solutions to the i , x ts ) meta-learning problem 𝒠 tr One solution: θ memorize canonical pose info in & ignore i 𝒠 tr Another solution: θ carry no info about canonical pose in , acquire from i An entire spectrum of solutions based on how information fl ows. Suggests a potential approach: control information fl ow. Yin, Tucker, Yuan, Levine, Finn. Meta-Learning without Memorization . ICLR ‘19

If tasks are non - mutually exclusive : single function can solve all tasks y ts = f θ ( D tr multiple solutions to the i , x ts ) meta-learning problem 𝒠 tr One solution: θ memorize canonical pose info in & ignore i 𝒠 tr Another solution: θ carry no info about canonical pose in , acquire from i An entire spectrum of solutions based on how information fl ows. one option: max I ( ̂ y ts , 𝒠 tr | x ts ) Meta-regularization minimize meta-training loss + information in θ ℒ ( θ , 𝒠 meta − train ) + β D KL ( q ( θ ; θ μ , θ σ ) ∥ p ( θ )) θ Places precedence on using information from over storing info in . 𝒠 tr Can combine with your favorite meta-learning algorithm. Yin, Tucker, Yuan, Levine, Finn. Meta-Learning without Memorization . ICLR ‘19

Omniglot without label shu ffl ing: “non-mutually-exclusive” Omniglot On pose prediction task: (and it’s not just as simple as standard regularization) TAML: Jamal & Qi. Task-Agnostic Meta-Learning for Few-Shot Learning . CVPR ‘19 Yin, Tucker, Yuan, Levine, Finn. Meta-Learning without Memorization . ICLR ‘19

Does meta-regularization lead to better generalization? P ( θ ) θ Let be an arbitrary distribution over that doesn’t depend on the meta-training data. P ( θ ) = 𝒪 ( θ ; 0 , I ) (e.g. ) 1 − δ For MAML, with probability at least , ∀ θ μ , θ σ meta-regularization error on the generalization meta-training set error β With a Taylor expansion of the RHS + a particular value of —> recover the MR MAML objective . Proof: draws heavily on Amit & Meier ‘18 Yin, Tucker, Yuan, Levine, Finn. Meta-Learning without Memorization . ICLR ‘19

2. A peculiar yet ubiquitous problem in meta-learning (and how we might regularize it away) Intermediate Takeaways meta overfitting standard overfitting f i ( x i , y i ) memorize training functions memorize training datapoints corresponding to tasks in your meta-training dataset in your training dataset meta regularization standard regularization controls information fl ow regularize hypothesis class regularizes description length (though not always for DNNs) of meta-parameters Yin, Tucker, Yuan, Levine, Finn. Meta-Learning without Memorization . ICLR ‘19

Has meta-learning accomplished our goal of making adaptation fast? Sort of… Can adapt to: - new objects - new goal velocities - new object categories Can we adapt to entirely new tasks or datasets?

Can we adapt to entirely new tasks or datasets? —> Need broad distribution of tasks meta-train task = meta-test task for meta-training distribution distribution Can we look to RL benchmarks? Fan et al. SURREAL: Open-Source Reinforcement Learning Brockman et al. OpenAI Gym . 2016 Bellemare et al. Atari Learning Framework and Robot Manipulation Benchmark . CoRL 2018 Environment . 2016

Our desiderata 50+ qualitatively distinct tasks shaped reward function & success metrics All tasks individually solvable (to allow us to focus on multi- task / meta-RL component) Uni fi ed state & action space, environment (to facilitate transfer) Meta-World Benchmark T Yu, D Quillen, Z He, R Julian, K Hausman, C Finn, S Levine. Meta-World . CoRL ‘19

Results: Meta-learning algorithms seem to struggle… …even on the 45 meta-training tasks! Multi-task RL algorithms also struggle… T Yu, D Quillen, Z He, R Julian, K Hausman, C Finn, S Levine. Meta-World . CoRL ‘19

Why the poor results? Exploration challenge? All tasks individually solvable. Data scarcity? All methods given budget with plenty of samples. Limited model capacity? All methods plenty of capacity. Training models independently performs the best. Our conclusion: must be an optimization challenge.

Prior literature on multi-task learning Architectural solutions: Sluice Networks . Ruder, Bingel, Multi-Task Attention Network . Liu, Multi-head architectures Augenstein, Sogaard ‘17 Johns, Davison ‘18 z i FiLM: Visual Reasoning with a General Cross-Stitch Networks . Misra, Deep Relation Networks . Long, Wang ‘15 Conditioning Layer . Perez et al. ‘17 Shrivastava, Gupta, Hebert ‘16 Task weighting solutions: GradNorm . Chen et al. ‘18 MT Learning as Multi-Objective Optimization . Sener & Koltun. ‘19

The Peculiar Optimization and Regularization Challenges in - PowerPoint PPT Presentation

The Peculiar Optimization and Regularization Challenges in Multi-Task Learning and Meta-Learning Chelsea Finn Stanford training data test datapoint Braque Cezanne By Braque or Cezanne? How did you accomplish this? Through previous

Muon g g- -2 and 2 and Muon a peculiar extra U(1) a peculiar extra U(1) PRD 80, 033001

High Grade DSO Iron Ore High Grade DSO Iron Ore at Peculiar Knob is Just at Peculiar Knob is

Looking Beyond the Knob Looking Beyond the Knob Looking Beyond the Knob Looking Beyond the Knob

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Regularization is a general approach to add a complexity parameter to a

Regularization Overview Regularization Overview Problems & Multicollinearity We will

10. Regularization More on tradeoffs Regularization Effect of using different norms

Improving Bug Prediction Accuracy by Regularization and Hyperparameter Optimization Haidar Osman

Iterative regularization for general inverse problems Guillaume Garrigos with L. Rosasco and S.

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization,

Regularization of optimal control problems Daniel Wachsmuth (RICAM Linz) joint work with Gerd

Regularization Methods for System Identification Input Design Biqiang MU Academy of Mathematics

Manifold Regularization Lorenzo Rosasco MIT, 9.520 L. Rosasco Manifold Regularization About

Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 L. Rosasco

Regularization Paths Boosting fits a regularization path toward a max-margin classifier.

LIC-Based Regularization of Multi-Valued Images David Tschumperl CNRS UMR 6072 (GREYC/ENSICAEN)

Simplicity The Way of the Unusual Architect Dan North Dan North & Associates @tastapod In

Thucydides and the Synchronous Pandemic By Gregory T. Papanikos Some Stylized (H (His

THE FINAL PLAGUE EXODUS 12:1-11 BIG IDEA: The focal point of the dramatic final plague is a

Introduction: the plague The Scaling Laws Of Human Travel highly infectious disease

Reverse Traceroute Ethan Katz-Bassett, Harsha V. Madhyastha, Vijay K. Adhikari, Colin Scott,

The Birth of Drama The Birth of Drama The three great Classical tragedians: Aeschylus

A Study of Linux File System Evolution Lanyue Lu Andrea C. Arpaci-Dusseau Remzi H.

Market Imperfections and Concepts (Welch, Chapter 11) Ivo Welch (No) Maintained Assumptions 1.

The Peculiar Optimization and Regularization Challenges in - PowerPoint PPT Presentation

The Peculiar Optimization and Regularization Challenges in Multi-Task Learning and Meta-Learning Chelsea Finn Stanford training data test datapoint Braque Cezanne By Braque or Cezanne? How did you accomplish this? Through previous

Muon g g- -2 and 2 and Muon a peculiar extra U(1) a peculiar extra U(1) PRD 80, 033001

High Grade DSO Iron Ore High Grade DSO Iron Ore at Peculiar Knob is Just at Peculiar Knob is

Looking Beyond the Knob Looking Beyond the Knob Looking Beyond the Knob Looking Beyond the Knob

Introduction CSCE 970 CSCE 970 Lecture 3: Lecture 3: Regularization Regularization CSCE 970

Regularization Regularization is a general approach to add a complexity parameter to a

Regularization Overview Regularization Overview Problems &amp; Multicollinearity We will

10. Regularization More on tradeoffs Regularization Effect of using different norms

Improving Bug Prediction Accuracy by Regularization and Hyperparameter Optimization Haidar Osman

Iterative regularization for general inverse problems Guillaume Garrigos with L. Rosasco and S.

CS7015 (Deep Learning) : Lecture 8 Regularization: Bias Variance Tradeoff, l2 regularization,

Regularization of optimal control problems Daniel Wachsmuth (RICAM Linz) joint work with Gerd

Regularization Methods for System Identification Input Design Biqiang MU Academy of Mathematics

Manifold Regularization Lorenzo Rosasco MIT, 9.520 L. Rosasco Manifold Regularization About

Regularization via Spectral Filtering Lorenzo Rosasco MIT, 9.520 Class 7 L. Rosasco

Regularization Paths Boosting fits a regularization path toward a max-margin classifier.

LIC-Based Regularization of Multi-Valued Images David Tschumperl CNRS UMR 6072 (GREYC/ENSICAEN)

Simplicity The Way of the Unusual Architect Dan North Dan North &amp; Associates @tastapod In

Thucydides and the Synchronous Pandemic By Gregory T. Papanikos Some Stylized (H (His

THE FINAL PLAGUE EXODUS 12:1-11 BIG IDEA: The focal point of the dramatic final plague is a

Introduction: the plague The Scaling Laws Of Human Travel highly infectious disease

Reverse Traceroute Ethan Katz-Bassett, Harsha V. Madhyastha, Vijay K. Adhikari, Colin Scott,

The Birth of Drama The Birth of Drama The three great Classical tragedians: Aeschylus

A Study of Linux File System Evolution Lanyue Lu Andrea C. Arpaci-Dusseau Remzi H.

Market Imperfections and Concepts (Welch, Chapter 11) Ivo Welch (No) Maintained Assumptions 1.

Regularization Overview Regularization Overview Problems & Multicollinearity We will

Simplicity The Way of the Unusual Architect Dan North Dan North & Associates @tastapod In