class feedback
play

Class Feedback Thanks to those that participated! Of 70 responses, - PowerPoint PPT Presentation

Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver and John Schulman and Pieter Abbeel


  1. Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver and John Schulman and Pieter Abbeel Lecture 9: Policy Gradient II (Post lecture) 3 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 1 / 54

  2. Class Feedback Thanks to those that participated! Of 70 responses, 54% thought too fast, 43% just right Lecture 9: Policy Gradient II (Post lecture) 4 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 2 / 54

  3. Class Feedback Thanks to those that participated! Of 70 responses, 54% thought too fast, 43% just right Multiple request to: repeat questions for those watching later on; have more worked examples; have more conceptual emphasis; minimize notation errors Lecture 9: Policy Gradient II (Post lecture) 5 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 3 / 54

  4. Class Feedback Thanks to4 those that participated! Of 70 responses, 54% thought too fast, 43% just right Multiple request to: repeat questions for those watching later on; have more worked examples; have more conceptual emphasis; minimize notation errors Common things people find are helping them learn: assignments, mathematical derivations, checking your understanding/talking to a neighbor Lecture 9: Policy Gradient II (Post lecture) 6 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 4 / 54

  5. Class Structure Last time: Policy Search This time: Policy Search Next time: Midterm review Lecture 9: Policy Gradient II (Post lecture) 7 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 5 / 54

  6. Recall: Policy-Based RL Policy search: directly parametrize the policy π θ ( s , a ) = P [ a | s , θ ] Goal is to find a policy π with the highest value function V π (Pure) Policy based methods No Value Function Learned Policy Actor-Critic methods Learned Value Function Learned Policy Lecture 9: Policy Gradient II (Post lecture) 8 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 6 / 54

  7. Recall: Advantages of Policy-Based RL Advantages: Better convergence properties Effective in high-dimensional or continuous action spaces Can learn stochastic policies Disadvantages: Typically converge to a local rather than global optimum Evaluating a policy is typically inefficient and high variance Lecture 9: Policy Gradient II (Post lecture) 9 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 7 / 54

  8. Recall: Policy Gradient Defined V ( θ ) = V π θ to make explicit the dependence of the value on the policy parameters Assumed episodic MDPs Policy gradient algorithms search for a local maximum in V ( θ ) by ascending the gradient of the policy, w.r.t parameters θ ∆ θ = α ∇ θ V ( θ ) Where ∇ θ V ( θ ) is the policy gradient   δ V ( θ ) δθ 1 .  .  ∇ θ V ( θ ) = .     δ V ( θ ) δθ n and α is a step-size parameter Lecture 9: Policy Gradient II (Post lecture) 10 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 8 / 54

  9. Desired Properties of a Policy Gradient RL Algorithm Goal: Converge as quickly as possible to a local optima Incurring reward / cost as execute policy, so want to minimize number of iterations / time steps until reach a good policy Lecture 9: Policy Gradient II (Post lecture) 11 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 9 / 54

  10. Desired Properties of a Policy Gradient RL Algorithm Goal: Converge as quickly as possible to a local optima Incurring reward / cost as execute policy, so want to minimize number of iterations / time steps until reach a good policy During policy search alternating between evaluating policy and changing (improving) policy (just like in policy iteration) Would like each policy update to be a monotonic improvement Only guaranteed to reach a local optima with gradient descent Monotonic improvement will achieve this And in the real world, monotonic improvement is often beneficial Lecture 9: Policy Gradient II (Post lecture) 12 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 10 / 54

  11. Desired Properties of a Policy Gradient RL Algorithm Goal: Obtain large monotonic improvements to policy at each update Techniques to try to achieve this: Last time and today: Get a better estimate of the gradient (intuition: should improve updating policy parameters) Today: Change how to update the policy parameters given the gradient Lecture 9: Policy Gradient II (Post lecture) 13 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 11 / 54

  12. Table of Contents Better Gradient Estimates 1 Policy Gradient Algorithms and Reducing Variance 2 Updating the Parameters Given the Gradient: Motivation 3 Need for Automatic Step Size Tuning 4 Updating the Parameters Given the Gradient: Local Approximation 5 Updating the Parameters Given the Gradient: Trust Regions 6 Updating the Parameters Given the Gradient: TRPO Algorithm 7 Lecture 9: Policy Gradient II (Post lecture) 14 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 12 / 54

  13. Likelihood Ratio / Score Function Policy Gradient Recall last time ( m is a set of trajectories): T − 1 m ∇ θ log π θ ( a ( i ) t | s ( i ) � � R ( τ ( i ) ) ∇ θ V ( θ ) ≈ (1 / m ) t ) i =1 t =0 Unbiased estimate of gradient but very noisy Fixes that can make it practical Temporal structure (discussed last time) Baseline Alternatives to using Monte Carlo returns R ( τ ( i ) ) as targets Lecture 9: Policy Gradient II (Post lecture) 15 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 13 / 54

  14. Policy Gradient: Introduce Baseline reduce variance by introducing a baseline b ( s ) � T − 1 � T − 1 �� � � r t ′ − b ( s t ) ∇ θ E τ [ R ] = E τ ∇ θ log π ( a t | s t , θ ) t ′ = t t =0 For any choice of b , gradient estimator is unbiased. Near optimal choice is expected return, b ( s t ) ≈ E [ r t + r t +1 + · · · + r T − 1 ] Interpretation: increase logprob of action a t proportionally to how much returns � T − 1 t ′ = t r t ′ are better than expected Lecture 9: Policy Gradient II (Post lecture) 16 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 14 / 54

  15. Baseline b ( s ) Does Not Introduce Bias–Derivation E τ [ ∇ θ log π ( a t | s t , θ ) b ( s t )] � � = E s 0: t , a 0:( t − 1) E s ( t +1): T , a t :( T − 1) [ ∇ θ log π ( a t | s t , θ ) b ( s t )] Lecture 9: Policy Gradient II (Post lecture) 17 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 15 / 54

  16. Baseline b ( s ) Does Not Introduce Bias–Derivation E τ [ ∇ θ log π ( a t | s t , θ ) b ( s t )] � � = E s 0: t , a 0:( t − 1) E s ( t +1): T , a t :( T − 1) [ ∇ θ log π ( a t | s t , θ ) b ( s t )] (break up expectation) � � = E s 0: t , a 0:( t − 1) b ( s t ) E s ( t +1): T , a t :( T − 1) [ ∇ θ log π ( a t | s t , θ )] (pull baseline term out) = E s 0: t , a 0:( t − 1) [ b ( s t ) E a t [ ∇ θ log π ( a t | s t , θ )]] (remove irrelevant variables) � � π θ ( a t | s t ) ∇ θ π ( a t | s t , θ ) � = E s 0: t , a 0:( t − 1) b ( s t ) (likelihood ratio) π θ ( a t | s t ) a � � � = E s 0: t , a 0:( t − 1) b ( s t ) ∇ θ π ( a t | s t , θ ) a � � � = E s 0: t , a 0:( t − 1) b ( s t ) ∇ θ π ( a t | s t , θ ) a = E s 0: t , a 0:( t − 1) [ b ( s t ) ∇ θ 1] = E s 0 : t , a 0 :( t − 1) [ b ( s t ) · 0] = 0 Lecture 9: Policy Gradient II (Post lecture) 18 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 16 / 54

  17. ”Vanilla” Policy Gradient Algorithm Initialize policy parameter θ , baseline b for iteration=1 , 2 , · · · do Collect a set of trajectories by executing the current policy At each timestep in each trajectory, compute the return R t ( s t ) = � T − 1 t ′ = t r t ′ , and the advantage estimate ˆ A t = R t − b ( s t ). Re-fit the baseline, by minimizing || b ( s t ) − R t || 2 , summed over all trajectories and timesteps. Update the policy, using a policy gradient estimate ˆ g , which is a sum of terms ∇ θ log π ( a t | s t , θ ) ˆ A t . (Plug ˆ g into SGD or ADAM) endfor Lecture 9: Policy Gradient II (Post lecture) 19 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 17 / 54

  18. Practical Implementation with Autodiff t ∇ θ log π ( a t | s t ; θ ) ˆ Usual formula � A t is inifficient–want to batch data Define ”surrogate” function using data from current batch � log π ( a t | s t ; θ ) ˆ L ( θ ) = A t t Then policy gradient estimator ˆ g = ∇ θ L ( θ ) Can also include value function fit error � R t || 2 � � log π ( a t | s t ; θ ) ˆ A t − || V ( s t ) − ˆ L ( θ ) = t Lecture 9: Policy Gradient II (Post lecture) 20 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 18 / 54

  19. Other choices for baseline? Initialize policy parameter θ , baseline b for iteration=1 , 2 , · · · do Collect a set of trajectories by executing the current policy At each timestep in each trajectory, compute the return R t = � T − 1 t ′ = t r t ′ , and the advantage estimate ˆ A t = R t − b ( s t ). Re-fit the baseline, by minimizing || b ( s t ) − R t || 2 , summed over all trajectories and timesteps. Update the policy, using a policy gradient estimate ˆ g , which is a sum of terms ∇ θ log π ( a t | s t , θ ) ˆ A t . (Plug ˆ g into SGD or ADAM) endfor Lecture 9: Policy Gradient II (Post lecture) 21 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 19 / 54

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend