Refresh Your Knowledge. Imitation Learning and DRL Behavior cloning - PowerPoint PPT Presentation

Lecture 8: Policy Gradient I 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2020 Additional reading: Sutton and Barto 2018 Chp. 13 1 With many slides from or derived from David Silver and John Schulman and Pieter Abbeel Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 1 / 57

Refresh Your Knowledge. Imitation Learning and DRL Behavior cloning (select all) Involves using supervised learning to predict actions given states using 1 expert demonstrations If the expert demonstrates an action in all states in a tabular domain, 2 behavior cloning will find an optimal expert policy If the expert demonstrates an action in all states visited under the 3 expert’s policy, behavior cloning will find an optimal expert policy DAGGER improves behavior cloning and only requires the expert to 4 demonstrate successful trajectories Not sure 5 Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 2 / 57

Last Time: We want RL Algorithms that Perform Optimization Delayed consequences Exploration Generalization And do it statistically and computationally efficiently Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 3 / 57

Last Time: Generalization and Efficiency Can use structure and additional knowledge to help constrain and speed reinforcement learning Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 4 / 57

Class Structure Last time: Imitation Learning in Large State Spaces This time: Policy Search Next time: Policy Search Cont. Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 5 / 57

Table of Contents Introduction 1 Policy Gradient 2 Score Function and Policy Gradient Theorem 3 Policy Gradient Algorithms and Reducing Variance 4 Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 6 / 57

Policy-Based Reinforcement Learning In the last lecture we approximated the value or action-value function using parameters w , V w ( s ) ≈ V π ( s ) Q w ( s , a ) ≈ Q π ( s , a ) A policy was generated directly from the value function e.g. using ǫ -greedy In this lecture we will directly parametrize the policy, and will typically use θ to show parameterization: π θ ( s , a ) = P [ a | s ; θ ] Goal is to find a policy π with the highest value function V π We will focus again on model-free reinforcement learning Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 7 / 57

Value-Based and Policy-Based RL Value Based Learnt Value Function Implicit policy (e.g. ǫ -greedy) Policy Based No Value Function Learnt Policy Actor-Critic Learnt Value Function Learnt Policy Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 8 / 57

Types of Policies to Search Over So far have focused on deterministic policies (why?) Now we are thinking about direct policy search in RL, will focus heavily on stochastic policies Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 9 / 57

Example: Rock-Paper-Scissors Two-player game of rock-paper-scissors Scissors beats paper Rock beats scissors Paper beats rock Let state be history of prior actions (rock, paper and scissors) and if won or lost Is deterministic policy optimal? Why or why not? Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 10 / 57

Example: Rock-Paper-Scissors, Vote Two-player game of rock-paper-scissors Scissors beats paper Rock beats scissors Paper beats rock Let state be history of prior actions (rock, paper and scissors) and if won or lost Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 11 / 57

Example: Aliased Gridword (1) The agent cannot differentiate the grey states Consider features of the following form (for all N, E, S, W) φ ( s , a ) = ✶ (wall to N , a = move E) Compare value-based RL, using an approximate value function Q θ ( s , a ) = f ( φ ( s , a ); θ ) To policy-based RL, using a parametrized policy π θ ( s , a ) = g ( φ ( s , a ); θ ) Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 12 / 57

Example: Aliased Gridworld (2) Under aliasing, an optimal deterministic policy will either move W in both grey states (shown by red arrows) move E in both grey states Either way, it can get stuck and never reach the money Value-based RL learns a near-deterministic policy e.g. greedy or ǫ -greedy So it will traverse the corridor for a long time Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 13 / 57

Example: Aliased Gridworld (3) An optimal stochastic policy will randomly move E or W in grey states π θ (wall to N and S, move E) = 0 . 5 π θ (wall to N and S, move W) = 0 . 5 It will reach the goal state in a few steps with high probability Policy-based RL can learn the optimal stochastic policy Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 14 / 57

Policy Objective Functions Goal: given a policy π θ ( s , a ) with parameters θ , find best θ But how do we measure the quality for a policy π θ ? In episodic environments can use policy value at start state V ( s 0 , θ ) For simplicity, today will mostly discuss the episodic case, but can easily extend to the continuing / infinite horizon case Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 15 / 57

Policy optimization Policy based reinforcement learning is an optimization problem Find policy parameters θ that maximize V ( s 0 , θ ) Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 16 / 57

Policy optimization Policy based reinforcement learning is an optimization problem Find policy parameters θ that maximize V ( s 0 , θ ) Can use gradient free optimization Hill climbing Simplex / amoeba / Nelder Mead Genetic algorithms Cross-Entropy method (CEM) Covariance Matrix Adaptation (CMA) Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 17 / 57

Human-in-the-Loop Exoskeleton Optimization (Zhang et al. Science 2017) Figure: Zhang et al. Science 2017 Optimization was done using CMA-ES, variation of covariance matrix evaluation Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 18 / 57

Gradient Free Policy Optimization Can often work embarrassingly well: ”discovered that evolution strategies (ES), an optimization technique that’s been known for decades, rivals the performance of standard reinforcement learning (RL) techniques on modern RL benchmarks (e.g. Atari/MuJoCo)” (https://blog.openai.com/evolution-strategies/) Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 19 / 57

Gradient Free Policy Optimization Often a great simple baseline to try Benefits Can work with any policy parameterizations, including non-differentiable Frequently very easy to parallelize Limitations Typically not very sample efficient because it ignores temporal structure Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 20 / 57

Policy optimization Policy based reinforcement learning is an optimization problem Find policy parameters θ that maximize V ( s 0 , θ ) Can use gradient free optimization: Greater efficiency often possible using gradient Gradient descent Conjugate gradient Quasi-newton We focus on gradient descent, many extensions possible And on methods that exploit sequential structure Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 21 / 57

Table of Contents Introduction 1 Policy Gradient 2 Score Function and Policy Gradient Theorem 3 Policy Gradient Algorithms and Reducing Variance 4 Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 22 / 57

Policy Gradient Define V ( θ ) = V ( s 0 , θ ) to make explicit the dependence of the value on the policy parameters [but don’t confuse with value function approximation, where parameterized value function] Assume episodic MDPs (easy to extend to related objectives, like average reward) Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 23 / 57

Policy Gradient Define V π θ = V ( s 0 , θ ) to make explicit the dependence of the value on the policy parameters Assume episodic MDPs Policy gradient algorithms search for a local maximum in V ( s 0 , θ ) by ascending the gradient of the policy, w.r.t parameters θ ∆ θ = α ∇ θ V ( s 0 , θ ) Where ∇ θ V ( s 0 , θ ) is the policy gradient   ∂ V ( s 0 ,θ ) ∂θ 1 .   . ∇ θ V ( s 0 , θ ) =   .   ∂ V ( s 0 ,θ ) ∂θ n and α is a step-size parameter Lecture 8: Policy Gradient I 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 24 / 57

Refresh Your Knowledge. Imitation Learning and DRL Behavior cloning - PowerPoint PPT Presentation

Lecture 8: Policy Gradient I 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2020 Additional reading: Sutton and Barto 2018 Chp. 13 1 With many slides from or derived from David Silver and John Schulman and Pieter Abbeel Lecture 8: Policy

Why do imitation and analogy fail? Why do imitation and analogy fail? Imitation Imitation

Imitation Learning Initial Concept and Approaches Nguyen, Thi Linh Chi Outline Motivation

Learning to Optimize as Policy Learning Yisong Yue Policy Learning (Reinforcement &

Imitation Theory and Experimental Evidence Joerg Oechssler University of Heidelberg

Imitation as a Stepping Stone to Innovation Amy Jocelyn Glass Texas A&M University Shift

FAIC Foreign Accent Imitation Corpus Sara Neuhauser University of Jena, Germany IAFPA 2011

Kevin Warwick Coventry University T urings Imitation Game T urings Imitation Game Kevin

Refresh Your Knowledge. Imitation Learning and DRL Behavior cloning (select all) Involves using

to No-Regret Online Learning Stephane Ross Joint work with Drew Bagnell & Geoff Gordon

Refresh Your Knowledge 6 Experience replay in deep Q-learning (select all): Involves using a bank

Refresh Your Knowledge 6 Experience replay in deep Q-learning (select all): Involves using a bank

Random Expert Distillation For Imitation Learning Ruohan Wang, Carlo

Trajectory Optimization, Imitation Learning Lecture 14 What will you take home today? Recap LQR

Knowledge-Based Agents knowledge knowledge representation, knowledge base, types of knowledge

Implicit Imitation in Multiagent Reinforcement Learning Bob Price and Craig Boutilier Slide 1

One-Shot Imitation Learning Yan Duan, Marcin Andrychowicz, Bradly Stadie, Jonathan Ho, Jonas

JSEP Update Justin Uberti IETF 83.5 Topics Activity since IETF 83 Implementation

Learning from Demonstration Applications and Challenges Feryal Behbahani 26 November 2018 Deep

Operating Systems Operating Systems CMPSC 473 CMPSC 473 Operating Systems Structure Operating

RFIDIOts!!! Hacking RFID Without A Soldering Iron (or a Patent Attorney) Adam Laurie

SELF the power of simplicity Rolph Recto + Jonathan DiLorenzo Great Works in PL April 30, 2019

ESTs - outline - Introduction - Improving ESTs - pre-processing - clustering - assembling - Gene

Lecture 26 Empirical Studies of Clone Evolution Clone Genealogies EE 382V Spring 2009 Software

Pseudorandom States, No-Cloning Pseudorandom States, No-Cloning Theorems and Quantum Money

Refresh Your Knowledge. Imitation Learning and DRL Behavior cloning - PowerPoint PPT Presentation

Lecture 8: Policy Gradient I 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2020 Additional reading: Sutton and Barto 2018 Chp. 13 1 With many slides from or derived from David Silver and John Schulman and Pieter Abbeel Lecture 8: Policy

Why do imitation and analogy fail? Why do imitation and analogy fail? Imitation Imitation

Imitation Learning Initial Concept and Approaches Nguyen, Thi Linh Chi Outline Motivation

Learning to Optimize as Policy Learning Yisong Yue Policy Learning (Reinforcement &amp;

Imitation Theory and Experimental Evidence Joerg Oechssler University of Heidelberg

Imitation as a Stepping Stone to Innovation Amy Jocelyn Glass Texas A&amp;M University Shift

FAIC Foreign Accent Imitation Corpus Sara Neuhauser University of Jena, Germany IAFPA 2011

Kevin Warwick Coventry University T urings Imitation Game T urings Imitation Game Kevin

Refresh Your Knowledge. Imitation Learning and DRL Behavior cloning (select all) Involves using

to No-Regret Online Learning Stephane Ross Joint work with Drew Bagnell &amp; Geoff Gordon

Refresh Your Knowledge 6 Experience replay in deep Q-learning (select all): Involves using a bank

Refresh Your Knowledge 6 Experience replay in deep Q-learning (select all): Involves using a bank

Random Expert Distillation For Imitation Learning Ruohan Wang, Carlo

Trajectory Optimization, Imitation Learning Lecture 14 What will you take home today? Recap LQR

Knowledge-Based Agents knowledge knowledge representation, knowledge base, types of knowledge

Implicit Imitation in Multiagent Reinforcement Learning Bob Price and Craig Boutilier Slide 1

One-Shot Imitation Learning Yan Duan, Marcin Andrychowicz, Bradly Stadie, Jonathan Ho, Jonas

JSEP Update Justin Uberti IETF 83.5 Topics Activity since IETF 83 Implementation

Learning from Demonstration Applications and Challenges Feryal Behbahani 26 November 2018 Deep

Operating Systems Operating Systems CMPSC 473 CMPSC 473 Operating Systems Structure Operating

RFIDIOts!!! Hacking RFID Without A Soldering Iron (or a Patent Attorney) Adam Laurie

SELF the power of simplicity Rolph Recto + Jonathan DiLorenzo Great Works in PL April 30, 2019

ESTs - outline - Introduction - Improving ESTs - pre-processing - clustering - assembling - Gene

Lecture 26 Empirical Studies of Clone Evolution Clone Genealogies EE 382V Spring 2009 Software

Pseudorandom States, No-Cloning Pseudorandom States, No-Cloning Theorems and Quantum Money

Learning to Optimize as Policy Learning Yisong Yue Policy Learning (Reinforcement &

Imitation as a Stepping Stone to Innovation Amy Jocelyn Glass Texas A&M University Shift

to No-Regret Online Learning Stephane Ross Joint work with Drew Bagnell & Geoff Gordon