Table of Contents Behavioral Cloning 1 Inverse Reinforcement - PowerPoint PPT Presentation

Lecture 7: Imitation Learning 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 2 With slides from Katerina Fragkiadaki and Pieter Abbeel Lecture 7: Imitation Learning 3 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 1 / 45

Table of Contents Behavioral Cloning 1 Inverse Reinforcement Learning 2 Apprenticeship Learning 3 Max Entropy Inverse RL 4 Lecture 7: Imitation Learning 4 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 2 / 45

Recall: Reinforcement Learning Involves Optimization Delayed consequences Exploration Generalization Lecture 7: Imitation Learning 5 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 3 / 45

Deep Reinforcement Learning Hessel, Matteo, et al. ”Rainbow: Combining Improvements in Deep Reinforcement Learning.” Lecture 7: Imitation Learning 6 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 4 / 45

We want RL Algorithms that Perform Optimization Delayed consequences Exploration Generalization And do it all statistically and computationally efficiently Lecture 7: Imitation Learning 7 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 5 / 45

Generalization and Efficiency We will discuss efficient exploration in more depth later in the class But exist hardness results that if learning in a generic MDP, can require large number of samples to learn a good policy This number is generally infeasible Alternate idea: use structure and additional knowledge to help constrain and speed reinforcement learning Today: Imitation learning Later: Policy search (can encode domain knowledge in the form of the policy class used) Strategic exploration Incorporating human help (in the form of teaching, reward specification, action specification, . . . ) Lecture 7: Imitation Learning 8 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 6 / 45

Class Structure Last time: CNNs and Deep Reinforcement learning This time: Imitation Learning Next time: Policy Search Lecture 7: Imitation Learning 9 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 7 / 45

Consider Montezuma’s revenge Bellemare et al. ”Unifying Count-Based Exploration and Intrinsic Motivation” Vs: https://www.youtube.com/watch?v=JR6wmLaYuu4 Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 7: Imitation Learning 10 Winter 2018 8 / 45

So Far in this Course Reinforcement Learning: Learning policies guided by (often sparse) rewards (e.g. win the game or not) Good: simple, cheap form of supervision Bad: High sample complexity Where is it successful? In simulation where data is cheap and parallelization is easy Not when: Execution of actions is slow Very expensive or not tolerable to fail Want to be safe Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 7: Imitation Learning 11 Winter 2018 9 / 45

Reward Shaping Rewards that are dense in time closely guide the agent How can we supply these rewards? Manually design them : often brittle Implicitly specify them through demonstrations Learning from Demonstration for Autonomous Navigation in Complex Unstructured Terrain, Silver et al. 2010 Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 7: Imitation Learning 12 Winter 2018 10 / 45

Examples Simulated highway driving Abbeel and Ng, ICML 2004 Syed and Schapire, NIPS 2007 Majumdar et al., RSS 2017 Aerial imagery-based navigation Ratliff, Bagnell, and Zinkevich, ICML 2006 Parking lot navigation Abbeel, Dolgov, Ng, and Thrun, IROS 2008 Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 7: Imitation Learning 13 Winter 2018 11 / 45

Examples Human path planning Mombaur, Truong, and Laumond, AURO 2009 Human goal inference Baker, Saxe, and Tenenbaum, Cognition 2009 Quadruped locomotion Ratliff, Bradley, Bagnell, and Chestnutt, NIPS 2007 Kolter, Abbeel, and Ng, NIPS 2008 Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 7: Imitation Learning 14 Winter 2018 12 / 45

Learning from Demonstrations Expert provides a set of demonstration trajectories : sequences of states and actions Imitation learning is useful when is easier for the expert to demonstrate the desired behavior rather than: come up with a reward that would generate such behavior, coding up the desired policy directly Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 7: Imitation Learning 15 Winter 2018 13 / 45

Problem Setup Input: State space, action space Transition model P ( s ′ | s , a ) No reward function R Set of one or more teacher’s demonstrations ( s 0 , a 0 , s 1 , s 0 , . . . ) (actions drawn from teacher’s policy π ∗ ) Behavioral Cloning: Can we directly learn the teacher’s policy using supervised learning? Inverse RL: Can we recover R ? Apprenticeship learning via Inverse RL: Can we use R to generate a good policy? Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 7: Imitation Learning 16 Winter 2018 14 / 45

Table of Contents Behavioral Cloning 1 Inverse Reinforcement Learning 2 Apprenticeship Learning 3 Max Entropy Inverse RL 4 Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 7: Imitation Learning 17 Winter 2018 15 / 45

Behavioral Cloning Formulate problem as a standard machine learning problem: Fix a policy class (e.g. neural network, decision tree, etc.) Estimate a policy from training examples ( s 0 , a 0 ) , ( s 1 , a 1 ) , ( s 2 , a 2 ) , . . . Two notable success stories: Pomerleau, NIPS 1989: ALVINN Summut et al., ICML 1992: Learning to fly in flight simulator Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 7: Imitation Learning 18 Winter 2018 16 / 45

ALVINN Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 7: Imitation Learning 19 Winter 2018 17 / 45

Problem: Compounding Errors Independent in time errors: Error at time t with probability ǫ E [Total errors] ≤ ǫ T Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 7: Imitation Learning 20 Winter 2018 18 / 45

Problem: Compounding Errors Error at time t with probability ǫ E [Total errors] ≤ ǫ ( T + ( T − 1) + ( T − 2) . . . + 1) ∝ ǫ T 2 A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning, Ross et al. 2011 Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 7: Imitation Learning 21 Winter 2018 19 / 45

Problem: Compounding Errors Data distribution mismatch! In supervised learning, ( x , y ) ∼ D during train and test. In MDPs: Train: s t ∼ D π ∗ Test: s t ∼ D π θ Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 7: Imitation Learning 22 Winter 2018 20 / 45

DAGGER: Dataset Aggregation Idea: Get more labels of the right action along the path taken by the policy computed by behavior cloning Obtains a stationary deterministic policy with good performance under its induced state distribution Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 7: Imitation Learning 23 Winter 2018 21 / 45

Feature Based Reward Function Given state space, action space, transition model P ( s ′ | s , a ) No reward function R Set of one or more teacher’s demonstrations ( s 0 , a 0 , s 1 , s 0 , . . . ) (actions drawn from teacher’s policy π ) Goal: infer the reward function R With no assumptions on the optimality of the teacher’s policy, what can be inferred about R ? Now assume that the teacher’s policy is optimal. What can be inferred about R ? Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 7: Imitation Learning 25 Winter 2018 23 / 45

Linear Feature Reward Inverse RL Recall linear value function approximation Similarly, here consider when reward is linear over features R ( s ) = w T x ( s ) where w ∈ R n , x : S → R n Goal: identify the weight vector w given a set of demonstrations The resulting value function for a policy π can be expressed as ∞ V π == E [ � γ t R ( s t ) (1) t =0 Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 7: Imitation Learning 26 Winter 2018 24 / 45

Linear Feature Reward Inverse RL Recall linear value function approximation Similarly, here consider when reward is linear over features R ( s ) = w T x ( s ) where w ∈ R n , x : S → R n Goal: identify the weight vector w given a set of demonstrations The resulting value function for a policy π can be expressed as ∞ V π == E [ � γ t R ( s t ) | π ] = E [ � ∞ t =0 γ t w T x ( s t ) | π ] (2) t =0 = w T E [ � ∞ t =0 γ t x ( s t ) | π ] (3) = w T µ ( π ) (4) where µ ( π )( s ) is defined as the discounted weighted frequency of state s under policy π . Emma Brunskill (CS234 Reinforcement Learning. ) Lecture 7: Imitation Learning 27 Winter 2018 25 / 45

Table of Contents Behavioral Cloning 1 Inverse Reinforcement - PowerPoint PPT Presentation

Lecture 7: Imitation Learning 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 2 With slides from Katerina Fragkiadaki and Pieter Abbeel Lecture 7: Imitation Learning 3 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 1 /

Databases Announcements Create Table and Drop Table Create Table 4 Create Table CREATE

Chemistry The Periodic Table 2015-11-16 www.njctl.org Slide 3 / 163 Table of Contents: The

Chemistry The Periodic Table 2015-11-16 www.njctl.org Slide 3 / 163 Table of Contents: The

1 Chemistry The Periodic Table 20151116 www.njctl.org 2 Table of Contents: The Periodic

Chapter 5 Chapter 5 Table Table of Contents Objectives Section 1 History of the Periodic Table

NEU TABLE By HAY Neu Table is a small table designed by HAY with a round or a square tabletop.

The Periodic Table Periodic Table & Electron Configurations Effective Nuclear Charge

The Periodic Table Periodic Table & Electron Configurations Effective Nuclear Charge

Chapter 5 Chapter 5 Table Table of Contents Objectives Explain the roles of Mendeleev and

Table A2 Field Descriptions for the Laboratory Instrument Table (Table A2) Contains related to

SLIT TABLE / Design HAY Slit Table is a simple metal side table in three shapes: round, oblong

SIP Table 2 / Table 3 Adam Roach Anaheim, CA, USA Friday, March 26, 2010 Current Situation

PERIODIC TABLE ATI TEAS SCIENCE PERIODIC TABLE Questions related to Periodic Table test your

PRODUCT LAW WORLDVIEW PRODUCT LAW WORLDVIEW TABLE OF CONTENTS TABLE OF CONTENTS INTRODUCTION

17 www.scad.ae Table of Contents Table of Contents

3 August 2006 Hong Kong www.solomon-systech.com Table of contents Table of contents

Refresh Your Knowledge. Imitation Learning and DRL Behavior cloning (select all) Involves using

Reverse engineering Reverse engineer Did anyone analyze f1 something similar A binary file f2

Scalable Detection of Semantic Clones Mark Gabel Lingxiao Jiang Zhendong Su Motivation

Rcpp classes and vectors Romain Franois Consulting Datactive, ThinkR DataCamp Optimizing R

G RADIENT D OMAIN I MAGE P ROCESSING CS 89.15/189.5, Fall 2015 Wojciech Jarosz

Universal algebra for CSP Lecture 1 Ross Willard University of Waterloo Fields Institute Summer

Solving Problems Recursively Print words entered, but backwards Can use a vector, store all

Finding our way round maps, mazes, mathematical graphs . Earlier slides dealt with (rooted

Table of Contents Behavioral Cloning 1 Inverse Reinforcement - PowerPoint PPT Presentation

Lecture 7: Imitation Learning 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 2 With slides from Katerina Fragkiadaki and Pieter Abbeel Lecture 7: Imitation Learning 3 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 1 /

Databases Announcements Create Table and Drop Table Create Table 4 Create Table CREATE

Chemistry The Periodic Table 2015-11-16 www.njctl.org Slide 3 / 163 Table of Contents: The

Chemistry The Periodic Table 2015-11-16 www.njctl.org Slide 3 / 163 Table of Contents: The

1 Chemistry The Periodic Table 20151116 www.njctl.org 2 Table of Contents: The Periodic

Chapter 5 Chapter 5 Table Table of Contents Objectives Section 1 History of the Periodic Table

NEU TABLE By HAY Neu Table is a small table designed by HAY with a round or a square tabletop.

The Periodic Table Periodic Table &amp; Electron Configurations Effective Nuclear Charge

The Periodic Table Periodic Table &amp; Electron Configurations Effective Nuclear Charge

Chapter 5 Chapter 5 Table Table of Contents Objectives Explain the roles of Mendeleev and

Table A2 Field Descriptions for the Laboratory Instrument Table (Table A2) Contains related to

SLIT TABLE / Design HAY Slit Table is a simple metal side table in three shapes: round, oblong

SIP Table 2 / Table 3 Adam Roach Anaheim, CA, USA Friday, March 26, 2010 Current Situation

PERIODIC TABLE ATI TEAS SCIENCE PERIODIC TABLE Questions related to Periodic Table test your

PRODUCT LAW WORLDVIEW PRODUCT LAW WORLDVIEW TABLE OF CONTENTS TABLE OF CONTENTS INTRODUCTION

17 www.scad.ae Table of Contents Table of Contents

3 August 2006 Hong Kong www.solomon-systech.com Table of contents Table of contents

Refresh Your Knowledge. Imitation Learning and DRL Behavior cloning (select all) Involves using

Reverse engineering Reverse engineer Did anyone analyze f1 something similar A binary file f2

Scalable Detection of Semantic Clones Mark Gabel Lingxiao Jiang Zhendong Su Motivation

Rcpp classes and vectors Romain Franois Consulting Datactive, ThinkR DataCamp Optimizing R

G RADIENT D OMAIN I MAGE P ROCESSING CS 89.15/189.5, Fall 2015 Wojciech Jarosz

Universal algebra for CSP Lecture 1 Ross Willard University of Waterloo Fields Institute Summer

Solving Problems Recursively Print words entered, but backwards Can use a vector, store all

Finding our way round maps, mazes, mathematical graphs . Earlier slides dealt with (rooted

The Periodic Table Periodic Table & Electron Configurations Effective Nuclear Charge

The Periodic Table Periodic Table & Electron Configurations Effective Nuclear Charge