Class Feedback Thanks to those that participated! Of 70 responses, - PowerPoint PPT Presentation

Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver and John Schulman and Pieter Abbeel Lecture 9: Policy Gradient II (Post lecture) 3 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 1 / 54

Class Feedback Thanks to those that participated! Of 70 responses, 54% thought too fast, 43% just right Lecture 9: Policy Gradient II (Post lecture) 4 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 2 / 54

Class Feedback Thanks to those that participated! Of 70 responses, 54% thought too fast, 43% just right Multiple request to: repeat questions for those watching later on; have more worked examples; have more conceptual emphasis; minimize notation errors Lecture 9: Policy Gradient II (Post lecture) 5 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 3 / 54

Class Feedback Thanks to4 those that participated! Of 70 responses, 54% thought too fast, 43% just right Multiple request to: repeat questions for those watching later on; have more worked examples; have more conceptual emphasis; minimize notation errors Common things people find are helping them learn: assignments, mathematical derivations, checking your understanding/talking to a neighbor Lecture 9: Policy Gradient II (Post lecture) 6 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 4 / 54

Class Structure Last time: Policy Search This time: Policy Search Next time: Midterm review Lecture 9: Policy Gradient II (Post lecture) 7 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 5 / 54

Recall: Policy-Based RL Policy search: directly parametrize the policy π θ ( s , a ) = P [ a | s , θ ] Goal is to find a policy π with the highest value function V π (Pure) Policy based methods No Value Function Learned Policy Actor-Critic methods Learned Value Function Learned Policy Lecture 9: Policy Gradient II (Post lecture) 8 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 6 / 54

Recall: Advantages of Policy-Based RL Advantages: Better convergence properties Effective in high-dimensional or continuous action spaces Can learn stochastic policies Disadvantages: Typically converge to a local rather than global optimum Evaluating a policy is typically inefficient and high variance Lecture 9: Policy Gradient II (Post lecture) 9 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 7 / 54

Recall: Policy Gradient Defined V ( θ ) = V π θ to make explicit the dependence of the value on the policy parameters Assumed episodic MDPs Policy gradient algorithms search for a local maximum in V ( θ ) by ascending the gradient of the policy, w.r.t parameters θ ∆ θ = α ∇ θ V ( θ ) Where ∇ θ V ( θ ) is the policy gradient   δ V ( θ ) δθ 1 .  .  ∇ θ V ( θ ) = .     δ V ( θ ) δθ n and α is a step-size parameter Lecture 9: Policy Gradient II (Post lecture) 10 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 8 / 54

Desired Properties of a Policy Gradient RL Algorithm Goal: Converge as quickly as possible to a local optima Incurring reward / cost as execute policy, so want to minimize number of iterations / time steps until reach a good policy Lecture 9: Policy Gradient II (Post lecture) 11 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 9 / 54

Desired Properties of a Policy Gradient RL Algorithm Goal: Converge as quickly as possible to a local optima Incurring reward / cost as execute policy, so want to minimize number of iterations / time steps until reach a good policy During policy search alternating between evaluating policy and changing (improving) policy (just like in policy iteration) Would like each policy update to be a monotonic improvement Only guaranteed to reach a local optima with gradient descent Monotonic improvement will achieve this And in the real world, monotonic improvement is often beneficial Lecture 9: Policy Gradient II (Post lecture) 12 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 10 / 54

Desired Properties of a Policy Gradient RL Algorithm Goal: Obtain large monotonic improvements to policy at each update Techniques to try to achieve this: Last time and today: Get a better estimate of the gradient (intuition: should improve updating policy parameters) Today: Change how to update the policy parameters given the gradient Lecture 9: Policy Gradient II (Post lecture) 13 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 11 / 54

Table of Contents Better Gradient Estimates 1 Policy Gradient Algorithms and Reducing Variance 2 Updating the Parameters Given the Gradient: Motivation 3 Need for Automatic Step Size Tuning 4 Updating the Parameters Given the Gradient: Local Approximation 5 Updating the Parameters Given the Gradient: Trust Regions 6 Updating the Parameters Given the Gradient: TRPO Algorithm 7 Lecture 9: Policy Gradient II (Post lecture) 14 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 12 / 54

Likelihood Ratio / Score Function Policy Gradient Recall last time ( m is a set of trajectories): T − 1 m ∇ θ log π θ ( a ( i ) t | s ( i ) � � R ( τ ( i ) ) ∇ θ V ( θ ) ≈ (1 / m ) t ) i =1 t =0 Unbiased estimate of gradient but very noisy Fixes that can make it practical Temporal structure (discussed last time) Baseline Alternatives to using Monte Carlo returns R ( τ ( i ) ) as targets Lecture 9: Policy Gradient II (Post lecture) 15 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 13 / 54

Policy Gradient: Introduce Baseline reduce variance by introducing a baseline b ( s ) � T − 1 � T − 1 �� r t ′ − b ( s t ) ∇ θ E τ [ R ] = E τ ∇ θ log π ( a t | s t , θ ) t ′ = t t =0 For any choice of b , gradient estimator is unbiased. Near optimal choice is expected return, b ( s t ) ≈ E [ r t + r t +1 + · · · + r T − 1 ] Interpretation: increase logprob of action a t proportionally to how much returns � T − 1 t ′ = t r t ′ are better than expected Lecture 9: Policy Gradient II (Post lecture) 16 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 14 / 54

Baseline b ( s ) Does Not Introduce Bias–Derivation E τ [ ∇ θ log π ( a t | s t , θ ) b ( s t )] � � = E s 0: t , a 0:( t − 1) E s ( t +1): T , a t :( T − 1) [ ∇ θ log π ( a t | s t , θ ) b ( s t )] Lecture 9: Policy Gradient II (Post lecture) 17 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 15 / 54

Baseline b ( s ) Does Not Introduce Bias–Derivation E τ [ ∇ θ log π ( a t | s t , θ ) b ( s t )] � � = E s 0: t , a 0:( t − 1) E s ( t +1): T , a t :( T − 1) [ ∇ θ log π ( a t | s t , θ ) b ( s t )] (break up expectation) � � = E s 0: t , a 0:( t − 1) b ( s t ) E s ( t +1): T , a t :( T − 1) [ ∇ θ log π ( a t | s t , θ )] (pull baseline term out) = E s 0: t , a 0:( t − 1) [ b ( s t ) E a t [ ∇ θ log π ( a t | s t , θ )]] (remove irrelevant variables) � � π θ ( a t | s t ) ∇ θ π ( a t | s t , θ ) � = E s 0: t , a 0:( t − 1) b ( s t ) (likelihood ratio) π θ ( a t | s t ) a � � � = E s 0: t , a 0:( t − 1) b ( s t ) ∇ θ π ( a t | s t , θ ) a � � � = E s 0: t , a 0:( t − 1) b ( s t ) ∇ θ π ( a t | s t , θ ) a = E s 0: t , a 0:( t − 1) [ b ( s t ) ∇ θ 1] = E s 0 : t , a 0 :( t − 1) [ b ( s t ) · 0] = 0 Lecture 9: Policy Gradient II (Post lecture) 18 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 16 / 54

”Vanilla” Policy Gradient Algorithm Initialize policy parameter θ , baseline b for iteration=1 , 2 , · · · do Collect a set of trajectories by executing the current policy At each timestep in each trajectory, compute the return R t ( s t ) = � T − 1 t ′ = t r t ′ , and the advantage estimate ˆ A t = R t − b ( s t ). Re-fit the baseline, by minimizing || b ( s t ) − R t || 2 , summed over all trajectories and timesteps. Update the policy, using a policy gradient estimate ˆ g , which is a sum of terms ∇ θ log π ( a t | s t , θ ) ˆ A t . (Plug ˆ g into SGD or ADAM) endfor Lecture 9: Policy Gradient II (Post lecture) 19 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 17 / 54

Practical Implementation with Autodiff t ∇ θ log π ( a t | s t ; θ ) ˆ Usual formula � A t is inifficient–want to batch data Define ”surrogate” function using data from current batch � log π ( a t | s t ; θ ) ˆ L ( θ ) = A t t Then policy gradient estimator ˆ g = ∇ θ L ( θ ) Can also include value function fit error � R t || 2 � � log π ( a t | s t ; θ ) ˆ A t − || V ( s t ) − ˆ L ( θ ) = t Lecture 9: Policy Gradient II (Post lecture) 20 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 18 / 54

Other choices for baseline? Initialize policy parameter θ , baseline b for iteration=1 , 2 , · · · do Collect a set of trajectories by executing the current policy At each timestep in each trajectory, compute the return R t = � T − 1 t ′ = t r t ′ , and the advantage estimate ˆ A t = R t − b ( s t ). Re-fit the baseline, by minimizing || b ( s t ) − R t || 2 , summed over all trajectories and timesteps. Update the policy, using a policy gradient estimate ˆ g , which is a sum of terms ∇ θ log π ( a t | s t , θ ) ˆ A t . (Plug ˆ g into SGD or ADAM) endfor Lecture 9: Policy Gradient II (Post lecture) 21 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 19 / 54

Class Feedback Thanks to those that participated! Of 70 responses, - PowerPoint PPT Presentation

Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver and John Schulman and Pieter Abbeel

Baby Got Feedback: How to Give and Take Feedback Like A Boss Sarah Hagan @thesarahhagan Sarah

Programming Abstraction in C++ Eric S. Roberts and Julie Zelenski Stanford University 2010

BIBLICAL SURVEY Introductory Class Introductory Class BIBLICAL SURVEY Introductory Class

Feedback By Daniels group (Mathew, Kamal and Daniel) Lisa needs feedback What is good

Feedback John Settle October 16, 2014 Distinctions: A Mediators use of feedback

Feedback Control Theory a Computer System s Perspective Introduction Introduction

Feedback EEG Brain Feedback EEG Brain Feedback Tends to be Supposed

Almost over... Feedback Website schedule for talk feedback Booklet (infodesk) feedback@

Nonlinear Control Lecture # 10 State Feedback Stabilization and Robust State Feedback

Haptics Haptics Haptic : Haptic : Haptic and Tactile Feedback Haptic and Tactile Feedback

Relevance Feedback Relevance Feedback Relevance Feedback Prof. Paolo Ciaccia Prof. Paolo

4303 Ed/Serious Feedback and Reinforcement Feedback and Reinforcement Both of them are

Improving Automated Feedback Building a Rule Feedback Generator Eric Bouwers September 27, 2007

Curriculum on The Cadet Corps Uniform Class A Uniform Class A Uniform Agenda C1. Class A

Electing Your Membership Class Class TG, Class TH, or Class DC As a school employee who

TwissOptics Class Joschua Dilly TwissOptics Class 2 The TwissOptics Class Resonance Driving

AI NTPU Dr. Chih-Chien Wang Dr. Min-Yuh Day Mr. Wei-Jin Gao Mr. Yen-Cheng Chiu Ms. Chun-Lian

Lecture 13 Gaussian Process Models - Part 2 3/06/2018 1 EDA and GPs 2 Variogram When fitting

Restoring & Sustaining Small Employer Health Insurance Coverage in Post- Pandemic New Jersey

ECO 317 Economics of Uncertainty Fall Term 2009 Slides to accompany 14. Information

"IQCP for POCT in the Post-Analytic Stage: The Results are In, Now What Will Become of Them?

ECE 222 Signals & Systems I Syllabus Miscellaneous Notes ece.pdx.edu/ ece2xx/ECE222

Using iRODS as an entry point to VITAM for long-term data preservation IRODS UGM 2020

Indications, evaluation and treatment Rajabrata Sarkar M.D. Ph.D. Barbara Baur Dunlap Professor

Class Feedback Thanks to those that participated! Of 70 responses, - PowerPoint PPT Presentation

Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver and John Schulman and Pieter Abbeel

Baby Got Feedback: How to Give and Take Feedback Like A Boss Sarah Hagan @thesarahhagan Sarah

Programming Abstraction in C++ Eric S. Roberts and Julie Zelenski Stanford University 2010

BIBLICAL SURVEY Introductory Class Introductory Class BIBLICAL SURVEY Introductory Class

Feedback By Daniels group (Mathew, Kamal and Daniel) Lisa needs feedback What is good

Feedback John Settle October 16, 2014 Distinctions: A Mediators use of feedback

Feedback Control Theory a Computer System s Perspective Introduction Introduction

Feedback EEG Brain Feedback EEG Brain Feedback Tends to be Supposed

Almost over... Feedback Website schedule for talk feedback Booklet (infodesk) feedback@

Nonlinear Control Lecture # 10 State Feedback Stabilization and Robust State Feedback

Haptics Haptics Haptic : Haptic : Haptic and Tactile Feedback Haptic and Tactile Feedback

Relevance Feedback Relevance Feedback Relevance Feedback Prof. Paolo Ciaccia Prof. Paolo

4303 Ed/Serious Feedback and Reinforcement Feedback and Reinforcement Both of them are

Improving Automated Feedback Building a Rule Feedback Generator Eric Bouwers September 27, 2007

Curriculum on The Cadet Corps Uniform Class A Uniform Class A Uniform Agenda C1. Class A

Electing Your Membership Class Class TG, Class TH, or Class DC As a school employee who

TwissOptics Class Joschua Dilly TwissOptics Class 2 The TwissOptics Class Resonance Driving

AI NTPU Dr. Chih-Chien Wang Dr. Min-Yuh Day Mr. Wei-Jin Gao Mr. Yen-Cheng Chiu Ms. Chun-Lian

Lecture 13 Gaussian Process Models - Part 2 3/06/2018 1 EDA and GPs 2 Variogram When fitting

Restoring &amp; Sustaining Small Employer Health Insurance Coverage in Post- Pandemic New Jersey

ECO 317 Economics of Uncertainty Fall Term 2009 Slides to accompany 14. Information

&quot;IQCP for POCT in the Post-Analytic Stage: The Results are In, Now What Will Become of Them?

ECE 222 Signals &amp; Systems I Syllabus Miscellaneous Notes ece.pdx.edu/ ece2xx/ECE222

Using iRODS as an entry point to VITAM for long-term data preservation IRODS UGM 2020

Indications, evaluation and treatment Rajabrata Sarkar M.D. Ph.D. Barbara Baur Dunlap Professor

Restoring & Sustaining Small Employer Health Insurance Coverage in Post- Pandemic New Jersey

"IQCP for POCT in the Post-Analytic Stage: The Results are In, Now What Will Become of Them?

ECE 222 Signals & Systems I Syllabus Miscellaneous Notes ece.pdx.edu/ ece2xx/ECE222