 
              Lecture 9: Policy Gradient II (Post lecture) 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Additional reading: Sutton and Barto 2018 Chp. 13 2 With many slides from or derived from David Silver and John Schulman and Pieter Abbeel Lecture 9: Policy Gradient II (Post lecture) 3 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 1 / 54
Class Feedback Thanks to those that participated! Of 70 responses, 54% thought too fast, 43% just right Lecture 9: Policy Gradient II (Post lecture) 4 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 2 / 54
Class Feedback Thanks to those that participated! Of 70 responses, 54% thought too fast, 43% just right Multiple request to: repeat questions for those watching later on; have more worked examples; have more conceptual emphasis; minimize notation errors Lecture 9: Policy Gradient II (Post lecture) 5 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 3 / 54
Class Feedback Thanks to4 those that participated! Of 70 responses, 54% thought too fast, 43% just right Multiple request to: repeat questions for those watching later on; have more worked examples; have more conceptual emphasis; minimize notation errors Common things people find are helping them learn: assignments, mathematical derivations, checking your understanding/talking to a neighbor Lecture 9: Policy Gradient II (Post lecture) 6 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 4 / 54
Class Structure Last time: Policy Search This time: Policy Search Next time: Midterm review Lecture 9: Policy Gradient II (Post lecture) 7 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 5 / 54
Recall: Policy-Based RL Policy search: directly parametrize the policy π θ ( s , a ) = P [ a | s , θ ] Goal is to find a policy π with the highest value function V π (Pure) Policy based methods No Value Function Learned Policy Actor-Critic methods Learned Value Function Learned Policy Lecture 9: Policy Gradient II (Post lecture) 8 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 6 / 54
Recall: Advantages of Policy-Based RL Advantages: Better convergence properties Effective in high-dimensional or continuous action spaces Can learn stochastic policies Disadvantages: Typically converge to a local rather than global optimum Evaluating a policy is typically inefficient and high variance Lecture 9: Policy Gradient II (Post lecture) 9 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 7 / 54
Recall: Policy Gradient Defined V ( θ ) = V π θ to make explicit the dependence of the value on the policy parameters Assumed episodic MDPs Policy gradient algorithms search for a local maximum in V ( θ ) by ascending the gradient of the policy, w.r.t parameters θ ∆ θ = α ∇ θ V ( θ ) Where ∇ θ V ( θ ) is the policy gradient   δ V ( θ ) δθ 1 .  .  ∇ θ V ( θ ) = .     δ V ( θ ) δθ n and α is a step-size parameter Lecture 9: Policy Gradient II (Post lecture) 10 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 8 / 54
Desired Properties of a Policy Gradient RL Algorithm Goal: Converge as quickly as possible to a local optima Incurring reward / cost as execute policy, so want to minimize number of iterations / time steps until reach a good policy Lecture 9: Policy Gradient II (Post lecture) 11 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 9 / 54
Desired Properties of a Policy Gradient RL Algorithm Goal: Converge as quickly as possible to a local optima Incurring reward / cost as execute policy, so want to minimize number of iterations / time steps until reach a good policy During policy search alternating between evaluating policy and changing (improving) policy (just like in policy iteration) Would like each policy update to be a monotonic improvement Only guaranteed to reach a local optima with gradient descent Monotonic improvement will achieve this And in the real world, monotonic improvement is often beneficial Lecture 9: Policy Gradient II (Post lecture) 12 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 10 / 54
Desired Properties of a Policy Gradient RL Algorithm Goal: Obtain large monotonic improvements to policy at each update Techniques to try to achieve this: Last time and today: Get a better estimate of the gradient (intuition: should improve updating policy parameters) Today: Change how to update the policy parameters given the gradient Lecture 9: Policy Gradient II (Post lecture) 13 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 11 / 54
Table of Contents Better Gradient Estimates 1 Policy Gradient Algorithms and Reducing Variance 2 Updating the Parameters Given the Gradient: Motivation 3 Need for Automatic Step Size Tuning 4 Updating the Parameters Given the Gradient: Local Approximation 5 Updating the Parameters Given the Gradient: Trust Regions 6 Updating the Parameters Given the Gradient: TRPO Algorithm 7 Lecture 9: Policy Gradient II (Post lecture) 14 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 12 / 54
Likelihood Ratio / Score Function Policy Gradient Recall last time ( m is a set of trajectories): T − 1 m ∇ θ log π θ ( a ( i ) t | s ( i ) � � R ( τ ( i ) ) ∇ θ V ( θ ) ≈ (1 / m ) t ) i =1 t =0 Unbiased estimate of gradient but very noisy Fixes that can make it practical Temporal structure (discussed last time) Baseline Alternatives to using Monte Carlo returns R ( τ ( i ) ) as targets Lecture 9: Policy Gradient II (Post lecture) 15 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 13 / 54
Policy Gradient: Introduce Baseline reduce variance by introducing a baseline b ( s ) � T − 1 � T − 1 �� � � r t ′ − b ( s t ) ∇ θ E τ [ R ] = E τ ∇ θ log π ( a t | s t , θ ) t ′ = t t =0 For any choice of b , gradient estimator is unbiased. Near optimal choice is expected return, b ( s t ) ≈ E [ r t + r t +1 + · · · + r T − 1 ] Interpretation: increase logprob of action a t proportionally to how much returns � T − 1 t ′ = t r t ′ are better than expected Lecture 9: Policy Gradient II (Post lecture) 16 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 14 / 54
Baseline b ( s ) Does Not Introduce Bias–Derivation E τ [ ∇ θ log π ( a t | s t , θ ) b ( s t )] � � = E s 0: t , a 0:( t − 1) E s ( t +1): T , a t :( T − 1) [ ∇ θ log π ( a t | s t , θ ) b ( s t )] Lecture 9: Policy Gradient II (Post lecture) 17 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 15 / 54
Baseline b ( s ) Does Not Introduce Bias–Derivation E τ [ ∇ θ log π ( a t | s t , θ ) b ( s t )] � � = E s 0: t , a 0:( t − 1) E s ( t +1): T , a t :( T − 1) [ ∇ θ log π ( a t | s t , θ ) b ( s t )] (break up expectation) � � = E s 0: t , a 0:( t − 1) b ( s t ) E s ( t +1): T , a t :( T − 1) [ ∇ θ log π ( a t | s t , θ )] (pull baseline term out) = E s 0: t , a 0:( t − 1) [ b ( s t ) E a t [ ∇ θ log π ( a t | s t , θ )]] (remove irrelevant variables) � � π θ ( a t | s t ) ∇ θ π ( a t | s t , θ ) � = E s 0: t , a 0:( t − 1) b ( s t ) (likelihood ratio) π θ ( a t | s t ) a � � � = E s 0: t , a 0:( t − 1) b ( s t ) ∇ θ π ( a t | s t , θ ) a � � � = E s 0: t , a 0:( t − 1) b ( s t ) ∇ θ π ( a t | s t , θ ) a = E s 0: t , a 0:( t − 1) [ b ( s t ) ∇ θ 1] = E s 0 : t , a 0 :( t − 1) [ b ( s t ) · 0] = 0 Lecture 9: Policy Gradient II (Post lecture) 18 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 16 / 54
”Vanilla” Policy Gradient Algorithm Initialize policy parameter θ , baseline b for iteration=1 , 2 , · · · do Collect a set of trajectories by executing the current policy At each timestep in each trajectory, compute the return R t ( s t ) = � T − 1 t ′ = t r t ′ , and the advantage estimate ˆ A t = R t − b ( s t ). Re-fit the baseline, by minimizing || b ( s t ) − R t || 2 , summed over all trajectories and timesteps. Update the policy, using a policy gradient estimate ˆ g , which is a sum of terms ∇ θ log π ( a t | s t , θ ) ˆ A t . (Plug ˆ g into SGD or ADAM) endfor Lecture 9: Policy Gradient II (Post lecture) 19 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 17 / 54
Practical Implementation with Autodiff t ∇ θ log π ( a t | s t ; θ ) ˆ Usual formula � A t is inifficient–want to batch data Define ”surrogate” function using data from current batch � log π ( a t | s t ; θ ) ˆ L ( θ ) = A t t Then policy gradient estimator ˆ g = ∇ θ L ( θ ) Can also include value function fit error � R t || 2 � � log π ( a t | s t ; θ ) ˆ A t − || V ( s t ) − ˆ L ( θ ) = t Lecture 9: Policy Gradient II (Post lecture) 20 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 18 / 54
Other choices for baseline? Initialize policy parameter θ , baseline b for iteration=1 , 2 , · · · do Collect a set of trajectories by executing the current policy At each timestep in each trajectory, compute the return R t = � T − 1 t ′ = t r t ′ , and the advantage estimate ˆ A t = R t − b ( s t ). Re-fit the baseline, by minimizing || b ( s t ) − R t || 2 , summed over all trajectories and timesteps. Update the policy, using a policy gradient estimate ˆ g , which is a sum of terms ∇ θ log π ( a t | s t , θ ) ˆ A t . (Plug ˆ g into SGD or ADAM) endfor Lecture 9: Policy Gradient II (Post lecture) 21 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 19 / 54
Recommend
More recommend