Imitation Learning Spring 2019, CMU 10-403 Katerina Fragkiadaki - PowerPoint PPT Presentation

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Imitation Learning Spring 2019, CMU 10-403 Katerina Fragkiadaki

Reinforcement learning Agent state reward action S t R t A t R t+ 1 Environment S t+ 1 Agent and environment interact at discrete time steps: t = 0,1, 2, K = 0 , 1 , 2 , 3 , . . . . Agent observes state at step t : S t ∈ S produces action at step t : A t ∈ A ( ( S t ) gets resulting reward: R t + 1 ∈ ∈ R ⊂ R , R S + and resulting next state: S t + 1 ∈ R t + 1 R t + 2 R t + 3 . . . . . . S t S t + 1 S t + 2 S t + 3 A t A t + 1 A t + 2 A t + 3

Limitations of Learning by Interaction • The agent should have the chance to try (and fail) MANY times • This is impossible when safety is a concern: we cannot afford to fail • This is also quite impossible in general in real life where each interaction takes time (in contrast to simulation) Crusher robot Learning from Demonstration for Autonomous Navigation in Complex Unstructured Terrain, Silver et al. 2010

Imitation Learning (a.k.a. Learning from Demonstrations) visual imitation kinesthetic imitation The actions of the teacher need to be • The teacher takes over the end- inferred from visual sensory input and effectors of the agent. mapped to the end-effectors to the • Demonstrated actions can be agent. imitated directly ( cloned ) Two challenges: • A.k.a. behavior cloning 1) visual understanding 2) action mapping, especially when the agent and the teacher do not have the same action space we will come back to this in a later lecture this lecture!

Imitating Controllers visual imitation kinesthetic imitation • Experts do not need to be humans. • Machinery that we develop in this lecture can be used for imitating expert The actions of the teacher need to be policies found through (easier) optimization in a constrained smaller part of • The teacher takes over the end- inferred from visual sensory input and the state space. effectors of the agent. mapped to the end-effectors to the • Imitation then means distilling knowledge of expert constrained policies into a • Demonstrated actions can be agent. general policy that can do well in all scenarios the simpler policies do well. imitated directly (cloned) Two challenges: • A.k.a. behavior cloning 1) visual understanding 2) action mapping, especially when the agent and the teacher do not have the same action space We will come back to this in a later lecture this lecture!

Notation actions a t actions u t x t s t states states c ( x t , u t ) rewards costs r t p ( s t +1 | s t , a t ) p ( x t +1 | x t , u t ) dynamics dynamics observations o t Diagram from Sergey Levine

Imitation learning VS Sequence labelling Imitation learning T Training data: o 1 1 , u 1 1 , o 1 2 , u 1 2 , o 1 3 , u 1 3 , . . . . o 2 1 , u 2 1 , o 2 2 , u 2 2 , o 2 3 , u 2 3 , . . . . o 3 1 , u 3 1 , o 3 2 , u 3 2 , o 3 3 , u 3 Sequence labelling 3 , . . . . y 1 y 2 y 3 y: which product was purchased if any

Imitation learning VS Sequence labelling Imitation learning Action interdependence in imitation learning: the actions we predict will influence the data we will see next, and thus, our future T Training data: predictions. Label interdependence is present in any structured prediction task, o 1 1 , u 1 1 , o 1 2 , u 1 2 , o 1 3 , u 1 3 , . . . . o 2 1 , u 2 1 , o 2 2 , u 2 2 , o 2 3 , u 2 3 , . . . . e.g, text generation: words we predict influence words we need to o 3 1 , u 3 1 , o 3 2 , u 3 2 , o 3 3 , u 3 Sequence labelling 3 , . . . . predict further down the sentence… y 1 y 2 y 3 y: which product was purchased if any

Imitation Learning for Driving Driving policy: a mapping from observations to steering wheel angles et al. ‘16, NVIDIA End to End Learning for Self-Driving Cars, Bojarski et al. 2016

Imitation Learning as Supervised Learning Driving policy: a mapping from observations to steering wheel angles • Assume actions in the expert trajectories are i.i.d. • Train a function approximator to map observations to actions at each time step of the trajectory. supervised training learning data et al. ‘16, NVIDIA End to End Learning for Self-Driving Cars, Bojarski et al. 2016 et al. ‘16, NVIDIA

What can go wrong? • Compounding errors Fix: data augmentation • Stochastic expert actions Fix: stochastic latent variable models, action discretiation, gaussian mixture networks • Non-markovian observations Fix: observation concatenation or recurrent models supervised training learning data End to End Learning for Self-Driving Cars, Bojarski et al. 2016 et al. ‘16, NVIDIA et al. ‘16, NVIDIA

Independent in time errors This means that at each time step t, the agent wakes up on a state drawn from the data distribution of the expert trajectories, and executes an action error at time t with probability ε E[Total errors] ≲ ε T

Compounding Errors This means that at each time step t, the agent wakes up on the state that resulted from executing the action the learned policy suggested in the previous time step. error at time t with probability ε E[Total errors] ≲ ε (T + (T-1) + (T-2) + …+ 1) ∝ ε T 2 A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning, Ross et al. 2011

Data Distribution Mismatch! p π ∗ ( o t ) 6 = p π θ ( o t ) Expert trajectory Learned Policy No data on how to recover

Data Distribution Mismatch! supervised learning + supervised learning control (NAIVE) train (x,y) ~ D s ~ d π * test (x,y) ~ D s ~ d π SL succeeds when training and test data distributions match, that is a fundamental assumption.

Solution: data augmentations p π ∗ (o t ) Change using demonstration augmentation!! Add examples in expert demonstration trajectories to cover the states/observations points where the agent will land when trying out its own policy. How? • Synthetically in simulation or by clever hardware • Interactively with experts in the loop (DAGGER)

Solution: data augmentations Change the training data distribution using demonstration p π ∗ (o t ) augmentation: add examples in expert demonstration trajectories to cover the states/observations where the agent will land when trying out its own policy. supervised learning + supervised learning control (NAIVE) train (x,y) ~ D s ~ d π * test (x,y) ~ D s ~ d π

Demonstration Augmentation: ALVINN 1989 Road follower • Using graphics simulator for road images and corresponding steering angle ground-truth • Online adaptation to human driver steering angle control • 3 layers, fully connected layers, very low resolution input from camera “In addition, the network must not solely be shown examples of accurate driving, but also how to recover (i.e. return to the road center) once a mistake has been made. Partial initial training on a variety of simulated road images should help eliminate these difficulties and facilitate better performance.” ALVINN: An autonomous Land vehicle in a neural Network”, Pomerleau 1989

Demonstration Augmentation: NVIDIA 2016 Additional, left and right cameras with automatic grant-truth labels to recover from mistakes et al. ‘16, NVIDIA “DAVE-2 was inspired by the pioneering work of Pomerleau [6] who in 1989 built the Autonomous Land Vehicle in a Neural Network (ALVINN) system. Training with data from only the human driver is not sufficient. The network must learn how to recover from mistakes. …” End to End Learning for Self-Driving Cars , Bojarski et al. 2016

Data Augmentation (2): NVIDIA 2016 add Nvidia video “DAVE-2 was inspired by the pioneering work of Pomerleau [6] who in 1989 built the Autonomous Land Vehicle in a Neural Network (ALVINN) system. Training with data from only the human driver is not sufficient. The network must learn how to recover from mistakes. …”, End to End Learning for Self-Driving Cars , Bojarski et al. 2016

Data Augmentation (3): Trails 2015 A Machine Learning Approach to Visual Perception of Forest Trails for Mobile Robots Giusti et al.

DAGGER (in simulation) Dataset AGGregation: bring learner’s and expert’s trajectory distributions closer by (asking uman experts to provide) labelling additional data points resulting from applying the current policy Execute current policy and Query Expert New Data Steering from expert Aggregate New Dataset All previous data Policy Supervised Learning A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning, Ross et al. 2011

Imitation Learning Spring 2019, CMU 10-403 Katerina Fragkiadaki - PowerPoint PPT Presentation

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Imitation Learning Spring 2019, CMU 10-403 Katerina Fragkiadaki Reinforcement learning Agent state reward action S t R t A t R t+ 1 Environment S t+ 1

Why do imitation and analogy fail? Why do imitation and analogy fail? Imitation Imitation

Imitation Learning Initial Concept and Approaches Nguyen, Thi Linh Chi Outline Motivation

Learning to Optimize as Policy Learning Yisong Yue Policy Learning (Reinforcement &

FAIC Foreign Accent Imitation Corpus Sara Neuhauser University of Jena, Germany IAFPA 2011

Imitation Theory and Experimental Evidence Joerg Oechssler University of Heidelberg

Imitation as a Stepping Stone to Innovation Amy Jocelyn Glass Texas A&M University Shift

Kevin Warwick Coventry University T urings Imitation Game T urings Imitation Game Kevin

to No-Regret Online Learning Stephane Ross Joint work with Drew Bagnell & Geoff Gordon

Random Expert Distillation For Imitation Learning Ruohan Wang, Carlo

Trajectory Optimization, Imitation Learning Lecture 14 What will you take home today? Recap LQR

One-Shot Imitation Learning Yan Duan, Marcin Andrychowicz, Bradly Stadie, Jonathan Ho, Jonas

Implicit Imitation in Multiagent Reinforcement Learning Bob Price and Craig Boutilier Slide 1

InfoGAIL: Interpretable Imitation Learning from Visual Demonstrations Chih-Hui Ho, Chun Hu,

Formal terms 1 Object-learning Social imitation Emulation Mimicry copying only an

A Bayesian Approach to Generative Adversarial Imitation Learning NeurIPS 2018 Presenter Wonseok

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 2: Supervised & Imitation

Femtocells: a Poisonous Needle in the Operator's Hay Stack . Ravishankar Borgaonkar, Nico Golde,

Modular Internet Programming with Cells Ran Rinat Scott Smith http://www.jcells.org

Phishing M alware vs Brazilian Banks: What each side is doing to raise the bar Jacomo Piccolini

LIFE LIFE STEM Future of Future of CELLS, CLONING How- -to 2 to 2 How SYSTEMS 3 D

Argumentative Writing Support: Structure Identification and Quality Assessment of Arguments

Recent Advances in Melanoma Therapy Adil Daud MBBS HS Clinical Professor of Medicine,

Proteomics Informatics Protein Characterization II: Protein Interactions (Week 11)

CATHOLIC INVESTMENT TRUST PARTICIPANT MEETING JULY 16, 2020 1 HOUSEKEEPING NOTES Mute/Unmute

Imitation Learning Spring 2019, CMU 10-403 Katerina Fragkiadaki - PowerPoint PPT Presentation

Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Imitation Learning Spring 2019, CMU 10-403 Katerina Fragkiadaki Reinforcement learning Agent state reward action S t R t A t R t+ 1 Environment S t+ 1

Why do imitation and analogy fail? Why do imitation and analogy fail? Imitation Imitation

Imitation Learning Initial Concept and Approaches Nguyen, Thi Linh Chi Outline Motivation

Learning to Optimize as Policy Learning Yisong Yue Policy Learning (Reinforcement &amp;

FAIC Foreign Accent Imitation Corpus Sara Neuhauser University of Jena, Germany IAFPA 2011

Imitation Theory and Experimental Evidence Joerg Oechssler University of Heidelberg

Imitation as a Stepping Stone to Innovation Amy Jocelyn Glass Texas A&amp;M University Shift

Kevin Warwick Coventry University T urings Imitation Game T urings Imitation Game Kevin

to No-Regret Online Learning Stephane Ross Joint work with Drew Bagnell &amp; Geoff Gordon

Random Expert Distillation For Imitation Learning Ruohan Wang, Carlo

Trajectory Optimization, Imitation Learning Lecture 14 What will you take home today? Recap LQR

One-Shot Imitation Learning Yan Duan, Marcin Andrychowicz, Bradly Stadie, Jonathan Ho, Jonas

Implicit Imitation in Multiagent Reinforcement Learning Bob Price and Craig Boutilier Slide 1

InfoGAIL: Interpretable Imitation Learning from Visual Demonstrations Chih-Hui Ho, Chun Hu,

Formal terms 1 Object-learning Social imitation Emulation Mimicry copying only an

A Bayesian Approach to Generative Adversarial Imitation Learning NeurIPS 2018 Presenter Wonseok

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 2: Supervised &amp; Imitation

Femtocells: a Poisonous Needle in the Operator's Hay Stack . Ravishankar Borgaonkar, Nico Golde,

Modular Internet Programming with Cells Ran Rinat Scott Smith http://www.jcells.org

Phishing M alware vs Brazilian Banks: What each side is doing to raise the bar Jacomo Piccolini

LIFE LIFE STEM Future of Future of CELLS, CLONING How- -to 2 to 2 How SYSTEMS 3 D

Argumentative Writing Support: Structure Identification and Quality Assessment of Arguments

Recent Advances in Melanoma Therapy Adil Daud MBBS HS Clinical Professor of Medicine,

Proteomics Informatics Protein Characterization II: Protein Interactions (Week 11)

CATHOLIC INVESTMENT TRUST PARTICIPANT MEETING JULY 16, 2020 1 HOUSEKEEPING NOTES Mute/Unmute

Learning to Optimize as Policy Learning Yisong Yue Policy Learning (Reinforcement &

Imitation as a Stepping Stone to Innovation Amy Jocelyn Glass Texas A&M University Shift

to No-Regret Online Learning Stephane Ross Joint work with Drew Bagnell & Geoff Gordon

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 2: Supervised & Imitation