CS234 Notes - Lecture 9 Advanced Policy Gradient Patrick Cho, Emma - PDF document

CS234 Notes - Lecture 9 Advanced Policy Gradient Patrick Cho, Emma Brunskill February 11, 2019 1 Policy Gradient Objective Recall that in Policy Gradient, we parameterize the policy π θ and directly optimize for it using expe- rience in the environment. We first define the probability of a trajectory given our current policy π θ , which we denote as π θ ( τ ) . T � π θ ( τ ) = π θ ( s 1 , a 1 , ..., s T , a T ) = P ( s 1 ) π θ ( a t | s t ) P ( s t +1 | s t , a t ) t =1 Parsing the function above, P ( s 1 ) is the probability of starting at state s 1 , π θ ( a t | s t ) is the probability of our current policy selecting action a t given that we are in state s t , and P ( s t +1 | s t , a t ) is the probability of the environment’s dynamics transiting us to state s t +1 given that we start at s t and take action a t . Note that the we overload the notation for π θ here to either mean the probability of a trajectory ( π θ ( τ ) ) or the probability of an action given a state ( π θ ( a | s ) ). The goal of Policy Gradient, similar to most other RL objectives that we have discussed thus far, is to maximize the discounted sum of rewards. �� θ ∗ = arg max γ t r ( s t , a t ) E τ ∼ π θ ( τ ) θ t We denote our objective function as J ( θ ) which can be estimated using Monte Carlo. We also use r ( τ ) to represent the discounted sum of rewards over trajectory τ . �� N T � π θ ( τ ) r ( τ ) dτ ≈ 1 � � γ t r ( s t , a t ) γ t r ( s i,t , a i,t ) J ( θ ) = E τ ∼ π θ ( τ ) = N t i =1 t =1 θ ∗ = arg max J ( θ ) θ We define P θ ( s, a ) to be the probability of seeing (s,a) pair in our trajectory. Note that in the case of infinite horizon where a stationary distribution of states exist, we can write P θ ( s, a ) = d π θ ( s ) π θ ( a | s ) where d π θ ( s ) is the stationary state distribution under policy π θ . In the infinite horizon case, we have ∞ θ ∗ = arg max � E ( s,a ) ∼ P θ ( s,a ) [ γ t r ( s, a )] θ t =1 1 = arg max 1 − γ E ( s,a ) ∼ P θ ( s,a ) [ r ( s, a )] θ = arg max E ( s,a ) ∼ P θ ( s,a ) [ r ( s, a )] θ 1

In the finite horizon case, we have T θ ∗ = arg max � E ( s t ,a t ) ∼ P θ ( s t ,a t ) [ γ t r ( s t , a t )] θ t =1 We can use gradient based methods to do the above optimization. In particular, we need to find the gradient of J ( θ ) with respect to θ . � ∇ θ J ( θ ) = ∇ θ π θ ( τ ) r ( τ ) dτ � = ∇ θ π θ ( τ ) r ( τ ) dτ � π θ ( τ ) ∇ θ π θ ( τ ) = π θ ( τ ) r ( τ ) dτ = E τ ∼ π θ ( τ ) [ ∇ θ log π θ ( τ ) r ( τ )] As seen above, we have moved the gradient from outside of the expectation to inside of the expectation. This is commonly known as the log derivative trick. The advantage of doing so is that now we do not need to take gradient over the dynamics function as seen below. ∇ θ J ( θ ) = E τ ∼ π θ ( τ ) [ ∇ θ log π θ ( τ ) r ( τ )] � � T � � � = E τ ∼ π θ ( τ ) ∇ θ log P ( s 1 ) + (log π θ ( a t | s t ) + log P ( s t +1 | s t , a t )) r ( τ ) t =1 � T � � � � = E τ ∼ π θ ( τ ) ∇ θ (log π θ ( a t | s t )) r ( τ ) t =1 � T � T � �� γ t r ( s t , a t ) = E τ ∼ π θ ( τ ) ∇ θ (log π θ ( a t | s t )) t =1 t =1 � T N T � �� ≈ 1 � � � γ t r ( s i,t , a i,t ) ∇ θ (log π θ ( a i,t | s i,t )) N i =1 t =1 t =1 In the third equality, the terms cancel out because they do not involve θ . In the last step, we use Monte Carlo estimates from rollout trajectories. Note that there are many similarities between the above formulation and the Maximum Likelihood Estimate (MLE) in the supervised learning setting. For MLE in supervised learning, we have likelihood, J ′ ( θ ) , and log-likelihood, J ( θ ) : N � J ′ ( θ ) = P ( y i | x i ) i =1 N � J ( θ ) = log J ′ ( θ ) = log P ( y i | x i ) i =1 N � ∇ θ J ( θ ) = ∇ θ log P ( y i | x i ) i =1 Comparing with the Policy Gradient derivation, the key difference is the sum of rewards. We can even view MLE as policy gradient with a return of 1 for all examples. Although this difference may seem minor, it can cause the problem to become much harder. In particular, the summation of rewards drastically increases variance. Hence, in the next section, we discuss two methods to reduce variance. 2

2 Reducing Variance in Policy Gradient 2.1 Causality We first note that the action taken at time t ′ cannot affect reward at time t for all t < t ′ . This is known as causality since what we do now should not affect the past. Hence, we can change the summation of rewards, � T t =1 γ t r ( s i,t , a i,t ) , to the reward-to-go, ˆ Q i,t = � T t ′ = t γ t ′ r ( s i,t ′ , a i,t ′ ) . We use ˆ Q here to denote that this is a Monte Carlo estimate of Q. Doing so helps to reduce variance since we effectively reduce noise from prior rewards. In particular, our objective changes to: � T N T � �� N T ∇ θ J ( θ ) ≈ 1 = 1 � � � � � γ t ′ r ( s i,t ′ , a i,t ′ ) � � ∇ θ log π θ ( a i,t , s i,t ) ˆ ∇ θ log π θ ( a i,t , s i,t ) Q i,t N N i =1 t =1 t ′ = t i =1 t =1 2.2 Baselines Now, we consider subtracting a baseline from the reward-to-go. That is, we change our objective into the following form: �� T N T � � ∇ θ J ( θ ) ≈ 1 � � � γ t ′ r ( s i,t ′ , a i,t ′ ) ∇ θ log π θ ( a i,t , s i,t ) − b N i =1 t =1 t ′ = t We first note that subtracting a constant baseline, b , is unbiased. That is under expectation of trajectories from our current policy π θ , the term we have just included is 0. � E τ ∼ π θ ( τ ) [ ∇ θ log π θ ( τ ) b ] = π θ ( τ ) ∇ θ log π θ ( τ ) bdτ � π θ ( τ ) ∇ θ π θ ( τ ) = π θ ( τ ) bdτ � = ∇ θ π θ ( τ ) bdτ � = b ∇ θ π θ ( τ ) dτ = b ∇ θ 1 = 0 In the last equality, the integral of the probability of a trajectory over all trajectories is 1. In the second last equality, we are able to take b out of the integral since b is a constant (e.g. average return, � N b = 1 i =1 r ( τ ) ). However, we can also show that this term is unbiased if b is a function of state s . N � � E τ ∼ π θ ( τ ) [ ∇ θ log π θ ( a t | s t ) b ( s t )] = E s 0: t ,a 0:( t − 1) E s ( t +1): T ,a t :( T − 1) [ ∇ θ log π θ ( a t | s t ) b ( s t )] � � = E s 0: t ,a 0:( t − 1) b ( s t ) E s ( t +1): T ,a t :( T − 1) [ ∇ θ log π θ ( a t | s t )] = E s 0: t ,a 0:( t − 1) [ b ( s t ) E a t [ ∇ θ log π θ ( a t | s t )]] = E s 0: t ,a 0:( t − 1) [ b ( s t ) · 0] = 0 As seen above, if no assumptions on the policy are made, the baseline cannot be a function of actions since the proof depends on being able factor out b ( s t ) . Exceptions exist if we make some assumptions. See [3] for an example of action-dependent baselines. 3

CS234 Notes - Lecture 9 Advanced Policy Gradient Patrick Cho, Emma - PDF document

CS234 Notes - Lecture 9 Advanced Policy Gradient Patrick Cho, Emma Brunskill February 11, 2019 1 Policy Gradient Objective Recall that in Policy Gradient, we parameterize the policy and directly optimize for it using expe- rience in the

Refresh Your Knowledge. Policy Gradient Policy gradient algorithms change the policy parameters

Lecture 1: Introduction to RL Emma Brunskill CS234 RL Winter 2020 Today the 3rd part of the

Lecture 12: Batch RL Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Slides drawn from

Lecture 2: Making Sequences of Good Decisions Given a Model of the World Emma Brunskill CS234

Lecture 5: Value Function Approximation Emma Brunskill CS234 Reinforcement Learning. Winter 2020

Lecture 15: Batch RL Emma Brunskill CS234 Reinforcement Learning. Winter 2019 Slides drawn from

Lecture 15: Batch RL Emma Brunskill CS234 Reinforcement Learning. Winter 2019 Slides drawn from

Lecture 4: Model Free Control Emma Brunskill CS234 Reinforcement Learning. Winter 2020

CSC321 Lecture 21: Policy Gradient Roger Grosse Roger Grosse CSC321 Lecture 21: Policy Gradient

CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

From Importance Sampling to Doubly Robust Policy Gradient Jiawei Huang (UIUC) Nan Jiang (UIUC)

CS234 Notes - Lecture 14 Model Based RL, Monte-Carlo Tree Search Anchit Gupta, Emma Brunskill

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

How to use Gradient and Multi-Texture 1. Many situations, we need use the gradient texture for our

SQL Exercises This material comes form the recommended book by T. Connoly, C. Begg, A. Strachan

Computer Science II for Majors Lecture 07 Classes and Objects (Continued) Dr. Katherine

The 28th ACM International Conference on Information and Knowledge Management (CIKM 2019)

IT350 Web and Internet Programming Cookies: JavaScript and Perl (Some from Chapter 11.9 -4 th

Examining Confidentiality Messaging in Establishment Surveys Aryn Hernandez, Krysten Mesner, and

Calculus 3 Chapter 15. Multiple Integrals 15.3. Area by Double IntegrationExamples and Proofs

Mathematical Foundations for Finance Exercise 4 Martin Stefanik ETH Zurich Arbitrage

Pricing Early-exercise options GPU Acceleration of SGBM method Delft University of Technology -