policy gradients
play

Policy Gradients CS60077: Reinforcement Learning Abir Das IIT - PowerPoint PPT Presentation

Policy Gradients CS60077: Reinforcement Learning Abir Das IIT Kharagpur Nov 09, 10, 2020 Agenda Introduction REINFORCE Bias/Variance Agenda Get started with the policy gradient methods. Get familiar with naive REINFORCE algorithm and


  1. Policy Gradients CS60077: Reinforcement Learning Abir Das IIT Kharagpur Nov 09, 10, 2020

  2. Agenda Introduction REINFORCE Bias/Variance Agenda § Get started with the policy gradient methods. § Get familiar with naive REINFORCE algorithm and its advantages and disadvantages. § Getting familair with different variance reduction techniques. § Actor-Critic methods. Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 2 / 39

  3. Agenda Introduction REINFORCE Bias/Variance Resources § Deep Reinforcement Learning by Sergey Levine [Link] § OpenAI Spinning Up [Link] Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 3 / 39

  4. Agenda Introduction REINFORCE Bias/Variance Reinforcement Learning Setting Figure credit: [SB] Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 4 / 39

  5. Agenda Introduction REINFORCE Bias/Variance Reinforcement Learning Setting Figure credit: [SB] Figure credit: [Sergey Levine, UC Berkeley] Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 4 / 39

  6. Agenda Introduction REINFORCE Bias/Variance Reinforcement Learning Setting Figure credit: [Sergey Levine, UC Berkeley] § In the middle is the ‘policy network’ which can directly learn a parameterized policy π θ ( a | s ) (sometimes denoted as π ( a | s ; θ ) ) and provides the probability distribution over all actions given the state s and parameterized by θ . § To distinguish it from the parameter vector w in value function approximator ˆ v ( s ; w ) , the notation θ is used. Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 5 / 39

  7. Agenda Introduction REINFORCE Bias/Variance Reinforcement Learning Setting Figure credit: [Sergey Levine, UC Berkeley] § Goal in RL Problem is to maximize the total reward “in expectation” over long run. § A trajectory τ is defined as, τ = ( s 1 , a 1 , s 2 , a 2 , s 3 , a 3 , · · · ) § The probability of a trajectory is given by the joint probability of the state-action pairs. T � p θ ( s 1 , a 1 , s 2 , a 2 , · · · , s T , a T , s T +1 ) = p ( s 1 ) p ( s t +1 | s t , a t ) π θ ( a t | s t ) (1) t =1 Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 6 / 39

  8. Agenda Introduction REINFORCE Bias/Variance Reinforcement Learning Setting § Proof of the above relation, p ( s T +1 , s T , a T , s T − 1 , a T − 1 , · · · , s 1 , a 1 ) = p ( s T +1 | s T , a T , s T − 1 , a T − 1 , · · · , s 1 , a 1 ) p ( s T , a T , s T − 1 , a T − 1 , · · · , s 1 , a 1 ) = p ( s T +1 | s T , a T ) p ( s T , a T , s T − 1 , a T − 1 , · · · , s 1 , a 1 ) = p ( s T +1 | s T , a T ) p ( a T | s T , s T − 1 , a T − 1 , · · · , s 1 , a 1 ) p ( s T , s T − 1 , a T − 1 , · · · , s 1 , a 1 ) = p ( s T +1 | s T , a T ) π θ ( a T | s T ) p ( s T , s T − 1 , a T − 1 , · · · , s 1 , a 1 ) (2) § The boxed part of the equation is very simi- lar to the left hand side. So, using similar argument repetitively, we get, p ( s T +1 , s T , a T , s T − 1 , a T − 1 , · · · , s 1 , a 1 ) = p ( s T +1 | s T , a T ) π θ ( a T | s T ) p ( s T | s T − 1 , a T − 1 ) π θ ( a T − 1 | s T − 1 ) p ( s T − 1 , s T − 2 , a T − 2 · · · , s 1 , a 1 ) T � = p ( s 1 ) p ( s t +1 | s t , a t ) π θ ( a t | s t ) (3) t =1 Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 7 / 39

  9. Agenda Introduction REINFORCE Bias/Variance The Goal of Reinforcement Learning Figure credit: [Sergey Levine, UC Berkeley] § We will sometimes denote the probability as p θ ( τ ) , i.e. , T � p θ ( τ ) = p θ ( s 1 , a 1 , s 2 , a 2 , · · · , s T , a T , s T +1 ) = p ( s 1 ) p ( s t +1 | s t , a t ) π θ ( a t | s t ) t =1 § The goal can be written as, �� � θ ∗ = arg max E τ ∼ p θ ( τ ) r ( s t , a t ) θ t � �� � J ( θ ) § Note that, for the time being, we are not considering discount. We will come back to that. Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 8 / 39

  10. Agenda Introduction REINFORCE Bias/Variance The Goal of Reinforcement Learning § Goal for a finite horizon setting: T � θ ∗ = arg max E ( s t , a t ) ∼ p θ ( s t , a t ) [ r ( s t , a t )] θ t =1 § The same for the infinite horizon setting θ ∗ = arg max E ( s , a ) ∼ p θ ( s , a ) [ r ( s , a )] θ § We will consider only finite horizon case in this topic. Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 9 / 39

  11. Agenda Introduction REINFORCE Bias/Variance Evaluating the Objective § We will see how we can optimize this objective - the expected value of the total reward under the trajectory distribution induced by the policy θ . § But before that let us see how we can evaluate the objective in model free setting. �� � J ( θ ) = E τ ∼ p θ ( τ ) r ( s t , a t ) (4) t Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 10 / 39

  12. Agenda Introduction REINFORCE Bias/Variance Evaluating the Objective § We will see how we can optimize this objective - the expected value of the total reward under the trajectory distribution induced by the policy θ . § But before that let us see how we can evaluate the objective in model free setting. �� � � � ≈ 1 J ( θ ) = E τ ∼ p θ ( τ ) r ( s t , a t ) r ( s i,t , a i,t ) (4) N t t i Figure credit: [Sergey Levine, UC Berkeley] Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 10 / 39

  13. Agenda Introduction REINFORCE Bias/Variance Maximizing the Objective § Now that we have seen how to evaluate the objective, the next step is to maximize it. Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 11 / 39

  14. Agenda Introduction REINFORCE Bias/Variance Maximizing the Objective § Now that we have seen how to evaluate the objective, the next step is to maximize it. § Compute the gradient and take steps in the direction of the gradient. Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 11 / 39

  15. Agenda Introduction REINFORCE Bias/Variance Maximizing the Objective § Now that we have seen how to evaluate the objective, the next step is to maximize it. § Compute the gradient and take steps in the direction of the gradient.   r ( τ ) � �� � �   θ ∗ = arg max   E τ ∼ p θ ( τ ) r ( s t , a t )   θ t � �� � J ( θ ) � J ( θ ) = E τ ∼ p θ ( τ ) [ r ( τ )] = p θ ( τ ) r ( τ ) dτ Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 11 / 39

  16. Agenda Introduction REINFORCE Bias/Variance Maximizing the Objective § Now that we have seen how to evaluate the objective, the next step is to maximize it. § Compute the gradient and take steps in the direction of the gradient.   r ( τ ) � �� � �   θ ∗ = arg max   E τ ∼ p θ ( τ ) r ( s t , a t )   θ t � �� � J ( θ ) � J ( θ ) = E τ ∼ p θ ( τ ) [ r ( τ )] = p θ ( τ ) r ( τ ) dτ � ∇ θ J ( θ ) = ∇ θ p θ ( τ ) r ( τ ) dτ § How to compute this complicated looking gradient! Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 11 / 39

  17. Agenda Introduction REINFORCE Bias/Variance Maximizing the Objective § Now that we have seen how to evaluate the objective, the next step is to maximize it. § Compute the gradient and take steps in the direction of the gradient.   r ( τ ) � �� � �   θ ∗ = arg max   E τ ∼ p θ ( τ ) r ( s t , a t )   θ t � �� � J ( θ ) � J ( θ ) = E τ ∼ p θ ( τ ) [ r ( τ )] = p θ ( τ ) r ( τ ) dτ � ∇ θ J ( θ ) = ∇ θ p θ ( τ ) r ( τ ) dτ § How to compute this complicated looking gradient! The log-derivative trick is our rescue. Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 11 / 39

  18. Agenda Introduction REINFORCE Bias/Variance Log Derivative Trick ∇ θ log p θ ( τ ) = ∂ log p θ ( τ ) 1 ∇ θ p θ ( τ ) = p θ ( τ ) ∇ θ p θ ( τ ) ∂p θ ( τ ) = ⇒ ∇ θ p θ ( τ ) = p θ ( τ ) ∇ θ log p θ ( τ ) (5) Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 12 / 39

  19. Agenda Introduction REINFORCE Bias/Variance Log Derivative Trick ∇ θ log p θ ( τ ) = ∂ log p θ ( τ ) 1 ∇ θ p θ ( τ ) = p θ ( τ ) ∇ θ p θ ( τ ) ∂p θ ( τ ) = ⇒ ∇ θ p θ ( τ ) = p θ ( τ ) ∇ θ log p θ ( τ ) (5) § So using eqn. (5) we get the gradient of the objective as, � � ∇ θ J ( θ ) = ∇ θ p θ ( τ ) r ( τ ) dτ = p θ ( τ ) ∇ θ log p θ ( τ ) r ( τ ) dτ = E τ ∼ p θ ( τ ) [ ∇ θ log p θ ( τ ) r ( τ )] (6) Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 12 / 39

  20. Agenda Introduction REINFORCE Bias/Variance Log Derivative Trick ∇ θ log p θ ( τ ) = ∂ log p θ ( τ ) 1 ∇ θ p θ ( τ ) = p θ ( τ ) ∇ θ p θ ( τ ) ∂p θ ( τ ) = ⇒ ∇ θ p θ ( τ ) = p θ ( τ ) ∇ θ log p θ ( τ ) (5) § So using eqn. (5) we get the gradient of the objective as, � � ∇ θ J ( θ ) = ∇ θ p θ ( τ ) r ( τ ) dτ = p θ ( τ ) ∇ θ log p θ ( τ ) r ( τ ) dτ = E τ ∼ p θ ( τ ) [ ∇ θ log p θ ( τ ) r ( τ )] (6) § Remember that � J ( θ ) = E τ ∼ p θ ( τ ) [ r ( τ )] = p θ ( τ ) r ( τ ) dτ Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 12 / 39

  21. Agenda Introduction REINFORCE Bias/Variance Log Derivative Trick § Till now we have the following, θ ∗ = arg max E τ ∼ p θ ( τ ) J ( θ ); J ( θ ) = E τ ∼ p θ ( τ ) [ r ( τ )] θ ∇ θ J ( θ ) = E τ ∼ p θ ( τ ) [ ∇ θ log p θ ( τ ) r ( τ )] Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 13 / 39

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend