Refresh Your Knowledge 7 Select all that are true about policy - PowerPoint PPT Presentation

Lecture 9: Policy Gradient II 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2020 Additional reading: Sutton and Barto 2018 Chp. 13 1 With many slides from or derived from David Silver and John Schulman and Pieter Abbeel Lecture 9: Policy Gradient II 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 1 / 59

Refresh Your Knowledge 7 Select all that are true about policy gradients: ∇ θ V ( θ ) = E π θ [ ∇ θ log π θ ( s , a ) Q π θ ( s , a )] 1 θ is always increased in the direction of ∇ θ ln( π ( S t , A t , θ ). 2 State-action pairs with higher estimated Q values will increase in 3 probability on average Are guaranteed to converge to the global optima of the policy class 4 Not sure 5 Lecture 9: Policy Gradient II 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 2 / 59

Class Structure Last time: Policy Search This time: Policy Search Next time: Midterm Lecture 9: Policy Gradient II 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 3 / 59

Midterm Covers material for all lectures before midterm To prepare, encourage you to (1) take past midterms (2) review slides and the refresh and check your understandings (3) review the homeworks We will have office hours this weekend for midterm prep: see piazza post for details Lecture 9: Policy Gradient II 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 4 / 59

Recall: Policy-Based RL Policy search: directly parametrize the policy π θ ( s , a ) = P [ a | s ; θ ] Goal is to find a policy π with the highest value function V π (Pure) Policy based methods No Value Function Learned Policy Actor-Critic methods Learned Value Function Learned Policy Lecture 9: Policy Gradient II 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 5 / 59

Recall: Advantages of Policy-Based RL Advantages: Better convergence properties Effective in high-dimensional or continuous action spaces Can learn stochastic policies Disadvantages: Typically converge to a local rather than global optimum Evaluating a policy is typically inefficient and high variance Lecture 9: Policy Gradient II 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 6 / 59

Recall: Policy Gradient Defined V ( θ ) = V π θ ( s 0 ) = V ( s 0 , θ ) to make explicit the dependence of the value on the policy parameters Assumed episodic MDPs Policy gradient algorithms search for a local maximum of V ( θ ) by ascending the gradient of the policy, w.r.t parameters θ ∆ θ = α ∇ θ V ( θ ) Where ∇ θ V ( θ ) is the policy gradient   ∂ V ( θ ) ∂θ 1 .  .  ∇ θ V ( θ ) = .     ∂ V ( θ ) ∂θ n and α is a step-size hyperparameter Lecture 9: Policy Gradient II 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 7 / 59

Desired Properties of a Policy Gradient RL Algorithm Goal: Converge as quickly as possible to a local optima Incurring reward / cost as execute policy, so want to minimize number of iterations / time steps until reach a good policy Lecture 9: Policy Gradient II 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 8 / 59

Desired Properties of a Policy Gradient RL Algorithm Goal: Converge as quickly as possible to a local optima Incurring reward / cost as execute policy, so want to minimize number of iterations / time steps until reach a good policy During policy search alternating between evaluating policy and changing (improving) policy (just like in policy iteration) Would like each policy update to be a monotonic improvement Only guaranteed to reach a local optima with gradient descent Monotonic improvement will achieve this And in the real world, monotonic improvement is often beneficial Lecture 9: Policy Gradient II 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 9 / 59

Desired Properties of a Policy Gradient RL Algorithm Goal: Obtain large monotonic improvements to policy at each update Techniques to try to achieve this: Last time and today: Get a better estimate of the gradient (intuition: should improve updating policy parameters) Today: Change, how to update the policy parameters given the gradient Lecture 9: Policy Gradient II 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 10 / 59

Table of Contents Better Gradient Estimates 1 Policy Gradient Algorithms and Reducing Variance 2 Need for Automatic Step Size Tuning 3 Updating the Parameters Given the Gradient: Local Approximation 4 Updating the Parameters Given the Gradient: Trust Regions 5 Updating the Parameters Given the Gradient: TRPO Algorithm 6 Lecture 9: Policy Gradient II 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 11 / 59

Likelihood Ratio / Score Function Policy Gradient Recall last time ( m is a set of trajectories): T − 1 m ∇ θ log π θ ( a ( i ) t | s ( i ) � � R ( τ ( i ) ) ∇ θ V ( s 0 , θ ) ≈ (1 / m ) t ) i =1 t =0 Unbiased estimate of gradient but very noisy Fixes that can make it practical Temporal structure (discussed last time) Baseline Alternatives to using Monte Carlo returns R ( τ ( i ) ) as targets Lecture 9: Policy Gradient II 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 12 / 59

Policy Gradient: Introduce Baseline Reduce variance by introducing a baseline b ( s ) � T − 1 � T − 1 �� ∇ θ E τ [ R ] = E τ ∇ θ log π ( a t | s t ; θ ) r t ′ − b ( s t ) t =0 t ′ = t For any choice of b , gradient estimator is unbiased. Near optimal choice is the expected return, b ( s t ) ≈ E [ r t + r t +1 + · · · + r T − 1 ] Interpretation: increase logprob of action a t proportionally to how much returns � T − 1 t ′ = t r t ′ are better than expected Lecture 9: Policy Gradient II 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 13 / 59

Baseline b ( s ) Does Not Introduce Bias–Derivation E τ [ ∇ θ log π ( a t | s t ; θ ) b ( s t )] � � = E s 0: t , a 0:( t − 1) E s ( t +1): T , a t :( T − 1) [ ∇ θ log π ( a t | s t ; θ ) b ( s t )] Lecture 9: Policy Gradient II 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 14 / 59

Baseline b ( s ) Does Not Introduce Bias–Derivation E τ [ ∇ θ log π ( a t | s t ; θ ) b ( s t )] � � = E s 0: t , a 0:( t − 1) E s ( t +1): T , a t :( T − 1) [ ∇ θ log π ( a t | s t ; θ ) b ( s t )] (break up expectation) � � = E s 0: t , a 0:( t − 1) b ( s t ) E s ( t +1): T , a t :( T − 1) [ ∇ θ log π ( a t | s t ; θ )] (pull baseline term out) = E s 0: t , a 0:( t − 1) [ b ( s t ) E a t [ ∇ θ log π ( a t | s t ; θ )]] (remove irrelevant variables) � � π θ ( a t | s t ) ∇ θ π ( a t | s t ; θ ) � = E s 0: t , a 0:( t − 1) b ( s t ) (likelihood ratio) π θ ( a t | s t ) a � � � = E s 0: t , a 0:( t − 1) b ( s t ) ∇ θ π ( a t | s t ; θ ) a � � � = E s 0: t , a 0:( t − 1) b ( s t ) ∇ θ π ( a t | s t ; θ ) a = E s 0: t , a 0:( t − 1) [ b ( s t ) ∇ θ 1] = E s 0 : t , a 0 :( t − 1) [ b ( s t ) · 0] = 0 Lecture 9: Policy Gradient II 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 15 / 59

”Vanilla” Policy Gradient Algorithm Initialize policy parameter θ , baseline b for iteration=1 , 2 , · · · do Collect a set of trajectories by executing the current policy At each timestep t in each trajectory τ i , compute t = � T − 1 Return G i t ′ = t r i t ′ , and Advantage estimate ˆ A i t = G i t − b ( s t ). t || 2 , Re-fit the baseline, by minimizing � � t || b ( s t ) − G i i Update the policy, using a policy gradient estimate ˆ g , Which is a sum of terms ∇ θ log π ( a t | s t , θ ) ˆ A t . (Plug ˆ g into SGD or ADAM) endfor Lecture 9: Policy Gradient II 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 16 / 59

Practical Implementation with Auto differentiation t ∇ θ log π ( a t | s t ; θ ) ˆ Usual formula � A t is inefficient–want to batch data Define ”surrogate” function using data from current batch � log π ( a t | s t ; θ ) ˆ L ( θ ) = A t t Then policy gradient estimator ˆ g = ∇ θ L ( θ ) Can also include value function fit error � G t || 2 � � log π ( a t | s t ; θ ) ˆ A t − || V ( s t ) − ˆ L ( θ ) = t Lecture 9: Policy Gradient II 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 17 / 59

Other Choices for Baseline? Initialize policy parameter θ , baseline b for iteration=1 , 2 , · · · do Collect a set of trajectories by executing the current policy At each timestep t in each trajectory τ i , compute t = � T − 1 Return G i t ′ = t r i t ′ , and Advantage estimate ˆ A i t = G i t − b ( s t ). t || 2 , Re-fit the baseline, by minimizing � � t || b ( s t ) − G i i Update the policy, using a policy gradient estimate ˆ g , Which is a sum of terms ∇ θ log π ( a t | s t , θ ) ˆ A t . (Plug ˆ g into SGD or ADAM) endfor Lecture 9: Policy Gradient II 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 18 / 59

Choosing the Baseline: Value Functions Recall Q-function / state-action-value function: Q π ( s , a ) = E π r 0 + γ r 1 + γ 2 r 2 · · · | s 0 = s , a 0 = a � � State-value function can serve as a great baseline V π ( s ) = E π r 0 + γ r 1 + γ 2 r 2 · · · | s 0 = s � � = E a ∼ π [ Q π ( s , a )] Advantage function: Combining Q with baseline V A π ( s , a ) = Q π ( s , a ) − V π ( s ) Lecture 9: Policy Gradient II 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 19 / 59

Refresh Your Knowledge 7 Select all that are true about policy - PowerPoint PPT Presentation

Lecture 9: Policy Gradient II 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2020 Additional reading: Sutton and Barto 2018 Chp. 13 1 With many slides from or derived from David Silver and John Schulman and Pieter Abbeel Lecture 9: Policy

Knowledge-Based Agents knowledge knowledge representation, knowledge base, types of knowledge

Winston-Salem State University Undergrad Brand Refresh WSSU UG Brand Refresh | Progress IMC:

Refresh RUS 1 Network RUS: Electrification Refresh 30 year perspective as part of Long

JSNA Refresh www.WalsallIntelligence.org.uk JSNA 2018-2019 refresh Gap in life expectancy Ageing

Cleaning Up the Clutter: Refresh Your MadCap Flare Project Design PRESENTED BY Nate Wolf

Plan for today Knowledge-based systems 1 Explicit knowledge Knowledge Representation Inferred

Plan for today Knowledge-based systems 1 Tacit knowledge Knowledge Representation Inferred

26:198:722 Expert Systems I Knowledge representation I Knowledge acquisition I Machine learning I

Lower Don Trail Master Plan Refresh Public Open House_September 17 2019 1 Lower Don Trail

and Sustainable Food Systems in the Circular Economy REFRESH final Policy Workshop Brussels, 22

SCIM Refresh Presented by Paul Mortimer Asset Management Policy Advisor paulmortimer@nhs.net 1

PC Refresh in the Consumerized IT Environment David Buchholz, Director of Consumerization, Intel

Adding value to food waste and by-products REFRESH Community of Experts webinar series

Voluntary Agreements to Address Food Waste REFRESH Community of Experts webinar series

VLSID 2016 KOLKATA, INDIA January 4-8, 2016 Massed Refresh: An Energy-Efficient Technique to

Measuring and managing retail food waste REFRESH Community of Experts webinar series

Beyond Memorability: Visualization Impact image memorable? memorable? memorable? Recognition

Out line Using knowledge Heurist ics I nf ormed Search Best -f irst search

and Collateral Brand Damage and Collateral Brand Damage Presenters: Matt Howsare and Matt Smith

1 Thriving on Our Changing Planet A Decadal Strategy for Earth Observation from Space Waleed

Double integrals First, let's recall Riemann sums for integrals with a single variable... 1 2

Binomial Representation Theorem Recall the discrete stochastic integral : if { X n } n 0 is

Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, Sanjiv Kumar Overview

BERTScore: Evaluating Text Generation with BERT Varsha Kishore Tianyi Zhang Felix Wu Kilian Q.