Policy Gradients CS60077: Reinforcement Learning Abir Das IIT - - PowerPoint PPT Presentation

policy gradients
SMART_READER_LITE
LIVE PREVIEW

Policy Gradients CS60077: Reinforcement Learning Abir Das IIT - - PowerPoint PPT Presentation

Policy Gradients CS60077: Reinforcement Learning Abir Das IIT Kharagpur Nov 09, 10, 2020 Agenda Introduction REINFORCE Bias/Variance Agenda Get started with the policy gradient methods. Get familiar with naive REINFORCE algorithm and


slide-1
SLIDE 1

Policy Gradients

CS60077: Reinforcement Learning Abir Das

IIT Kharagpur

Nov 09, 10, 2020

slide-2
SLIDE 2

Agenda Introduction REINFORCE Bias/Variance

Agenda

§ Get started with the policy gradient methods. § Get familiar with naive REINFORCE algorithm and its advantages and disadvantages. § Getting familair with different variance reduction techniques. § Actor-Critic methods.

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 2 / 39

slide-3
SLIDE 3

Agenda Introduction REINFORCE Bias/Variance

Resources

§ Deep Reinforcement Learning by Sergey Levine [Link] § OpenAI Spinning Up [Link]

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 3 / 39

slide-4
SLIDE 4

Agenda Introduction REINFORCE Bias/Variance

Reinforcement Learning Setting

Figure credit: [SB] Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 4 / 39

slide-5
SLIDE 5

Agenda Introduction REINFORCE Bias/Variance

Reinforcement Learning Setting

Figure credit: [SB] Figure credit: [Sergey Levine, UC Berkeley] Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 4 / 39

slide-6
SLIDE 6

Agenda Introduction REINFORCE Bias/Variance

Reinforcement Learning Setting

Figure credit: [Sergey Levine, UC Berkeley]

§ In the middle is the ‘policy network’ which can directly learn a parameterized policy πθ(a|s) (sometimes denoted as π(a|s; θ)) and provides the probability distribution over all actions given the state s and parameterized by θ. § To distinguish it from the parameter vector w in value function approximator ˆ v(s; w), the notation θ is used.

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 5 / 39

slide-7
SLIDE 7

Agenda Introduction REINFORCE Bias/Variance

Reinforcement Learning Setting

Figure credit: [Sergey Levine, UC Berkeley]

§ Goal in RL Problem is to maximize the total reward “in expectation”

  • ver long run.

§ A trajectory τ is defined as, τ = (s1, a1, s2, a2, s3, a3, · · · ) § The probability of a trajectory is given by the joint probability of the state-action pairs. pθ(s1, a1, s2, a2, · · · , sT , aT , sT+1) = p(s1)

T

  • t=1

p(st+1|st, at)πθ(at|st) (1)

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 6 / 39

slide-8
SLIDE 8

Agenda Introduction REINFORCE Bias/Variance

Reinforcement Learning Setting

§ Proof of the above relation,

p(sT +1, sT , aT , sT −1, aT −1, · · · , s1, a1) = p(sT +1|sT , aT , sT −1, aT −1, · · · , s1, a1)p(sT , aT , sT −1, aT −1, · · · , s1, a1) = p(sT +1|sT , aT )p(sT , aT , sT −1, aT −1, · · · , s1, a1) = p(sT +1|sT , aT )p(aT |sT , sT −1, aT −1, · · · , s1, a1)p(sT , sT −1, aT −1, · · · , s1, a1) = p(sT +1|sT , aT )πθ(aT |sT ) p(sT , sT −1, aT −1, · · · , s1, a1) (2)

§ The boxed part of the equation is very simi- lar to the left hand side. So, using similar argument repetitively, we get,

p(sT +1, sT , aT , sT −1, aT −1, · · · , s1, a1) = p(sT +1|sT , aT )πθ(aT |sT )p(sT |sT −1, aT −1)πθ(aT −1|sT −1) p(sT −1, sT −2, aT −2 · · · , s1, a1) = p(s1)

T

  • t=1

p(st+1|st, at)πθ(at|st) (3)

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 7 / 39

slide-9
SLIDE 9

Agenda Introduction REINFORCE Bias/Variance

The Goal of Reinforcement Learning

Figure credit: [Sergey Levine, UC Berkeley]

§ We will sometimes denote the probability as pθ(τ), i.e.,

pθ(τ) = pθ(s1, a1, s2, a2, · · · , sT , aT , sT +1) = p(s1)

T

  • t=1

p(st+1|st, at)πθ(at|st)

§ The goal can be written as,

θ∗ = arg max

θ

Eτ∼pθ(τ)

  • t

r(st, at)

  • J(θ)

§ Note that, for the time being, we are not considering discount. We will come back to that.

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 8 / 39

slide-10
SLIDE 10

Agenda Introduction REINFORCE Bias/Variance

The Goal of Reinforcement Learning

§ Goal for a finite horizon setting:

θ∗ = arg max

θ T

  • t=1

E(st,at)∼pθ(st,at) [r(st, at)]

§ The same for the infinite horizon setting

θ∗ = arg max

θ

E(s,a)∼pθ(s,a) [r(s, a)]

§ We will consider only finite horizon case in this topic.

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 9 / 39

slide-11
SLIDE 11

Agenda Introduction REINFORCE Bias/Variance

Evaluating the Objective

§ We will see how we can optimize this objective - the expected value

  • f the total reward under the trajectory distribution induced by the

policy θ. § But before that let us see how we can evaluate the objective in model free setting. J(θ) = Eτ∼pθ(τ)

  • t

r(st, at)

  • (4)

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 10 / 39

slide-12
SLIDE 12

Agenda Introduction REINFORCE Bias/Variance

Evaluating the Objective

§ We will see how we can optimize this objective - the expected value

  • f the total reward under the trajectory distribution induced by the

policy θ. § But before that let us see how we can evaluate the objective in model free setting. J(θ) = Eτ∼pθ(τ)

  • t

r(st, at)

  • ≈ 1

N

  • i
  • t

r(si,t, ai,t) (4)

Figure credit: [Sergey Levine, UC Berkeley] Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 10 / 39

slide-13
SLIDE 13

Agenda Introduction REINFORCE Bias/Variance

Maximizing the Objective

§ Now that we have seen how to evaluate the objective, the next step is to maximize it.

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 11 / 39

slide-14
SLIDE 14

Agenda Introduction REINFORCE Bias/Variance

Maximizing the Objective

§ Now that we have seen how to evaluate the objective, the next step is to maximize it. § Compute the gradient and take steps in the direction of the gradient.

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 11 / 39

slide-15
SLIDE 15

Agenda Introduction REINFORCE Bias/Variance

Maximizing the Objective

§ Now that we have seen how to evaluate the objective, the next step is to maximize it. § Compute the gradient and take steps in the direction of the gradient.

θ∗ = arg max

θ

Eτ∼pθ(τ)    

r(τ)

  • t

r(st, at)    

  • J(θ)

J(θ) = Eτ∼pθ(τ) [r(τ)] =

  • pθ(τ)r(τ)dτ

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 11 / 39

slide-16
SLIDE 16

Agenda Introduction REINFORCE Bias/Variance

Maximizing the Objective

§ Now that we have seen how to evaluate the objective, the next step is to maximize it. § Compute the gradient and take steps in the direction of the gradient.

θ∗ = arg max

θ

Eτ∼pθ(τ)    

r(τ)

  • t

r(st, at)    

  • J(θ)

J(θ) = Eτ∼pθ(τ) [r(τ)] =

  • pθ(τ)r(τ)dτ

∇θJ(θ) =

  • ∇θpθ(τ)r(τ)dτ

§ How to compute this complicated looking gradient!

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 11 / 39

slide-17
SLIDE 17

Agenda Introduction REINFORCE Bias/Variance

Maximizing the Objective

§ Now that we have seen how to evaluate the objective, the next step is to maximize it. § Compute the gradient and take steps in the direction of the gradient.

θ∗ = arg max

θ

Eτ∼pθ(τ)    

r(τ)

  • t

r(st, at)    

  • J(θ)

J(θ) = Eτ∼pθ(τ) [r(τ)] =

  • pθ(τ)r(τ)dτ

∇θJ(θ) =

  • ∇θpθ(τ)r(τ)dτ

§ How to compute this complicated looking gradient! The log-derivative trick is our rescue.

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 11 / 39

slide-18
SLIDE 18

Agenda Introduction REINFORCE Bias/Variance

Log Derivative Trick

∇θ log pθ(τ) = ∂ log pθ(τ) ∂pθ(τ) ∇θpθ(τ) = 1 pθ(τ)∇θpθ(τ) = ⇒ ∇θpθ(τ) = pθ(τ)∇θ log pθ(τ) (5)

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 12 / 39

slide-19
SLIDE 19

Agenda Introduction REINFORCE Bias/Variance

Log Derivative Trick

∇θ log pθ(τ) = ∂ log pθ(τ) ∂pθ(τ) ∇θpθ(τ) = 1 pθ(τ)∇θpθ(τ) = ⇒ ∇θpθ(τ) = pθ(τ)∇θ log pθ(τ) (5) § So using eqn. (5) we get the gradient of the objective as, ∇θJ(θ) =

  • ∇θpθ(τ)r(τ)dτ =
  • pθ(τ)∇θ log pθ(τ)r(τ)dτ

= Eτ∼pθ(τ) [∇θ log pθ(τ)r(τ)] (6)

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 12 / 39

slide-20
SLIDE 20

Agenda Introduction REINFORCE Bias/Variance

Log Derivative Trick

∇θ log pθ(τ) = ∂ log pθ(τ) ∂pθ(τ) ∇θpθ(τ) = 1 pθ(τ)∇θpθ(τ) = ⇒ ∇θpθ(τ) = pθ(τ)∇θ log pθ(τ) (5) § So using eqn. (5) we get the gradient of the objective as, ∇θJ(θ) =

  • ∇θpθ(τ)r(τ)dτ =
  • pθ(τ)∇θ log pθ(τ)r(τ)dτ

= Eτ∼pθ(τ) [∇θ log pθ(τ)r(τ)] (6) § Remember that J(θ) = Eτ∼pθ(τ) [r(τ)] =

  • pθ(τ)r(τ)dτ

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 12 / 39

slide-21
SLIDE 21

Agenda Introduction REINFORCE Bias/Variance

Log Derivative Trick

§ Till now we have the following,

θ∗ = arg max

θ

Eτ∼pθ(τ)J(θ); J(θ) = Eτ∼pθ(τ) [r(τ)] ∇θJ(θ) = Eτ∼pθ(τ) [∇θ log pθ(τ)r(τ)]

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 13 / 39

slide-22
SLIDE 22

Agenda Introduction REINFORCE Bias/Variance

Log Derivative Trick

§ Till now we have the following,

θ∗ = arg max

θ

Eτ∼pθ(τ)J(θ); J(θ) = Eτ∼pθ(τ) [r(τ)] ∇θJ(θ) = Eτ∼pθ(τ) [∇θ log pθ(τ)r(τ)]

§ We have also seen,

pθ(τ) = pθ(s1, a1, s2, a2, · · · , sT , aT , sT +1) = p(s1)

T

  • t=1

p(st+1|st, at)πθ(at|st)

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 13 / 39

slide-23
SLIDE 23

Agenda Introduction REINFORCE Bias/Variance

Log Derivative Trick

§ Till now we have the following,

θ∗ = arg max

θ

Eτ∼pθ(τ)J(θ); J(θ) = Eτ∼pθ(τ) [r(τ)] ∇θJ(θ) = Eτ∼pθ(τ) [∇θ log pθ(τ)r(τ)]

§ We have also seen,

pθ(τ) = pθ(s1, a1, s2, a2, · · · , sT , aT , sT +1) = p(s1)

T

  • t=1

p(st+1|st, at)πθ(at|st)

§ Taking log both sides,

log pθ(τ) = log p(s1) +

T

  • t=1

log p(st+1|st, at) +

T

  • t=1

log πθ(at|st)

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 13 / 39

slide-24
SLIDE 24

Agenda Introduction REINFORCE Bias/Variance

Log Derivative Trick

§ Till now we have the following,

θ∗ = arg max

θ

Eτ∼pθ(τ)J(θ); J(θ) = Eτ∼pθ(τ) [r(τ)] ∇θJ(θ) = Eτ∼pθ(τ) [∇θ log pθ(τ)r(τ)]

§ We have also seen,

pθ(τ) = pθ(s1, a1, s2, a2, · · · , sT , aT , sT +1) = p(s1)

T

  • t=1

p(st+1|st, at)πθ(at|st)

§ Taking log both sides,

log pθ(τ) = log p(s1) +

T

  • t=1

log p(st+1|st, at) +

T

  • t=1

log πθ(at|st)

§ Taking ∇θ both sides,

∇θ log pθ(τ) =

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 13 / 39

slide-25
SLIDE 25

Agenda Introduction REINFORCE Bias/Variance

Log Derivative Trick

§ Till now we have the following,

θ∗ = arg max

θ

Eτ∼pθ(τ)J(θ); J(θ) = Eτ∼pθ(τ) [r(τ)] ∇θJ(θ) = Eτ∼pθ(τ) [∇θ log pθ(τ)r(τ)]

§ We have also seen,

pθ(τ) = pθ(s1, a1, s2, a2, · · · , sT , aT , sT +1) = p(s1)

T

  • t=1

p(st+1|st, at)πθ(at|st)

§ Taking log both sides,

log pθ(τ) = log p(s1) +

T

  • t=1

log p(st+1|st, at) +

T

  • t=1

log πθ(at|st)

§ Taking ∇θ both sides,

∇θ log pθ(τ) =✘✘✘✘ ✘ ✿0 log p(s1)

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 13 / 39

slide-26
SLIDE 26

Agenda Introduction REINFORCE Bias/Variance

Log Derivative Trick

§ Till now we have the following,

θ∗ = arg max

θ

Eτ∼pθ(τ)J(θ); J(θ) = Eτ∼pθ(τ) [r(τ)] ∇θJ(θ) = Eτ∼pθ(τ) [∇θ log pθ(τ)r(τ)]

§ We have also seen,

pθ(τ) = pθ(s1, a1, s2, a2, · · · , sT , aT , sT +1) = p(s1)

T

  • t=1

p(st+1|st, at)πθ(at|st)

§ Taking log both sides,

log pθ(τ) = log p(s1) +

T

  • t=1

log p(st+1|st, at) +

T

  • t=1

log πθ(at|st)

§ Taking ∇θ both sides,

∇θ log pθ(τ) =✘✘✘✘ ✘ ✿0 log p(s1) +

T

  • t=1✘✘✘✘✘✘✘

✘ ✿0 log p(st+1|st, at)

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 13 / 39

slide-27
SLIDE 27

Agenda Introduction REINFORCE Bias/Variance

Log Derivative Trick

§ Till now we have the following,

θ∗ = arg max

θ

Eτ∼pθ(τ)J(θ); J(θ) = Eτ∼pθ(τ) [r(τ)] ∇θJ(θ) = Eτ∼pθ(τ) [∇θ log pθ(τ)r(τ)]

§ We have also seen,

pθ(τ) = pθ(s1, a1, s2, a2, · · · , sT , aT , sT +1) = p(s1)

T

  • t=1

p(st+1|st, at)πθ(at|st)

§ Taking log both sides,

log pθ(τ) = log p(s1) +

T

  • t=1

log p(st+1|st, at) +

T

  • t=1

log πθ(at|st)

§ Taking ∇θ both sides,

∇θ log pθ(τ) =✘✘✘✘ ✘ ✿0 log p(s1) +

T

  • t=1✘✘✘✘✘✘✘

✘ ✿0 log p(st+1|st, at) +

T

  • t=1

∇θ log πθ(at|st)

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 13 / 39

slide-28
SLIDE 28

Agenda Introduction REINFORCE Bias/Variance

Log Derivative Trick

§ Thus,

∇θJ(θ) = Eτ∼pθ(τ) [∇θ log pθ(τ)r(τ)] = Eτ∼pθ(τ) T

  • t=1

∇θ log πθ(at|st)

T

  • t=1

r(st, at)

  • Abir Das (IIT Kharagpur)

CS60077 Nov 09, 10, 2020 14 / 39

slide-29
SLIDE 29

Agenda Introduction REINFORCE Bias/Variance

Log Derivative Trick

§ Thus,

∇θJ(θ) = Eτ∼pθ(τ) [∇θ log pθ(τ)r(τ)] = Eτ∼pθ(τ) T

  • t=1

∇θ log πθ(at|st)

T

  • t=1

r(st, at)

  • § So, to get the estimate of the gradient we take samples and average

not only the sum of rewards but also average the sum of the gradients

  • f the policy values.

∇θJ(θ) ≈ 1 N

N

  • i=1

T

  • t=1

∇θ log πθ(ai,t|si,t)

T

  • t=1

r(si,t, ai,t)

  • Abir Das (IIT Kharagpur)

CS60077 Nov 09, 10, 2020 14 / 39

slide-30
SLIDE 30

Agenda Introduction REINFORCE Bias/Variance

Log Derivative Trick

§ Thus,

∇θJ(θ) = Eτ∼pθ(τ) [∇θ log pθ(τ)r(τ)] = Eτ∼pθ(τ) T

  • t=1

∇θ log πθ(at|st)

T

  • t=1

r(st, at)

  • § So, to get the estimate of the gradient we take samples and average

not only the sum of rewards but also average the sum of the gradients

  • f the policy values.

∇θJ(θ) ≈ 1 N

N

  • i=1

T

  • t=1

∇θ log πθ(ai,t|si,t)

T

  • t=1

r(si,t, ai,t)

  • § And the last bit is to update θ along the gradient direction.

θ ← θ + α∇θJ(θ) (7)

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 14 / 39

slide-31
SLIDE 31

Agenda Introduction REINFORCE Bias/Variance

Fitting in Generic RL Pipeline

∇θJ(θ) ≈ 1 N

N

  • i=1

 

T

  • t=1

∇θ log πθ(ai,t|si,t)

T

  • t=1

r(si,t, ai,t)   θ ← θ + α∇θJ(θ)

Figure credit: [Sergey Levine, UC Berkeley] Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 15 / 39

slide-32
SLIDE 32

Agenda Introduction REINFORCE Bias/Variance

Fitting in Generic RL Pipeline

∇θJ(θ) ≈ 1 N

N

  • i=1

 

T

  • t=1

∇θ log πθ(ai,t|si,t)

T

  • t=1

r(si,t, ai,t)   θ ← θ + α∇θJ(θ)

Figure credit: [Sergey Levine, UC Berkeley]

REINFORCE Algorithm

1

Sample {ri} from πθ(at|st) (run the policy)

2

∇θJ(θ) ≈

1 N N

  • i=1
  • T
  • t=1

∇θ log πθ(ai,t|si,t)

T

  • t=1

r(si,t, ai,t)

  • 3

θ ← θ + α∇θJ(θ)

4

Repeat

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 15 / 39

slide-33
SLIDE 33

Agenda Introduction REINFORCE Bias/Variance

Taking a Closer Look

∇θJ(θ) ≈ 1 N

N

  • i=1

 

T

  • t=1

∇θ log πθ(ai,t|si,t)

✟✟✟✟✟ ✟ ❍❍❍❍❍ ❍

T

  • t=1

r(si,t, ai,t)  

§ What is given by log πθ(ai,t|si,t)? - It is log of the probability of action ai,t at state si,t under the distribution parameterized by θ. § This gives the likelihood, i.e., how likely, we are to see ai,t as the action, if our policy is defined by the current θ that we have. § Computing the gradient and taking a step along the direction of the gradient, changes θ in such a way that the likelihood of the action ai,t increases.

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 16 / 39

slide-34
SLIDE 34

Agenda Introduction REINFORCE Bias/Variance

Taking a Closer Look

∇θJ(θ) ≈ 1 N

N

  • i=1

 

T

  • t=1

∇θ log πθ(ai,t|si,t)

✟✟✟✟✟ ✟ ❍❍❍❍❍ ❍

T

  • t=1

r(si,t, ai,t)  

§ What is given by log πθ(ai,t|si,t)? - It is log of the probability of action ai,t at state si,t under the distribution parameterized by θ. § This gives the likelihood, i.e., how likely, we are to see ai,t as the action, if our policy is defined by the current θ that we have. § Computing the gradient and taking a step along the direction of the gradient, changes θ in such a way that the likelihood of the action ai,t increases.

∇θJ(θ) ≈ 1 N

N

  • i=1

T

  • t=1

∇θ log πθ(ai,t|si,t)

T

  • t=1

r(si,t, ai,t)

  • § Now consider the case, when it is getting multiplied by

T

  • t=1

r(si,t, ai,t).

§ Those actions with high rewards are getting more likely.

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 16 / 39

slide-35
SLIDE 35

Agenda Introduction REINFORCE Bias/Variance

Taking a Closer Look

§ Good stuff is made more likely. § Bad stuff is made less likely. § Formalizes the ‘trial and error’ learning.

Figure credit: [Sergey Levine, UC Berkeley] Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 17 / 39

slide-36
SLIDE 36

Agenda Introduction REINFORCE Bias/Variance

Taking a Closer Look

§ Good stuff is made more likely. § Bad stuff is made less likely. § Formalizes the ‘trial and error’ learning.

Figure credit: [Sergey Levine, UC Berkeley] Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 17 / 39

slide-37
SLIDE 37

Agenda Introduction REINFORCE Bias/Variance

Taking a Closer Look

§ Good stuff is made more likely. § Bad stuff is made less likely. § Formalizes the ‘trial and error’ learning.

Figure credit: [Sergey Levine, UC Berkeley] Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 17 / 39

slide-38
SLIDE 38

Agenda Introduction REINFORCE Bias/Variance

Taking a Closer Look

§ Good stuff is made more likely. § Bad stuff is made less likely. § Formalizes the ‘trial and error’ learning.

Figure credit: [Sergey Levine, UC Berkeley] Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 17 / 39

slide-39
SLIDE 39

Agenda Introduction REINFORCE Bias/Variance

Bias and Variance in Estimation

§ One way to work with values we do not know is to estimate them by experimenting repeatedly. § Monte-Carlo methods provide the estimate of the true value and we have used Monte-Carlo methods to estimate the value functions and to estimate the gradient of the expected return. § The estimator is a function of the data which itself are random

  • variables. So the estimated value is subject to many possible
  • utcomes if employed repeatedly, i.e., if you conduct the experiment

multiple times, in general, the estimator will provide different values. § An estimator is good if,

◮ On average the estimated values are close to the true value for different trials - (Bias) ◮ The estimates do not vary much in each trial - (variance)

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 18 / 39

slide-40
SLIDE 40

Agenda Introduction REINFORCE Bias/Variance

Unbiased Estimators

§ An unbiased estimator is the one that yields the true value of the variable being estimated on average. With θ denoting the true value and ˆ θ denoting the estimated value, and unbiased estimator is one with, E[ˆ θ] = θ § Naturally bias is defined as, b = E[ˆ θ] − θ

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 19 / 39

slide-41
SLIDE 41

Agenda Introduction REINFORCE Bias/Variance

Unbiased Estimators

§ An unbiased estimator is the one that yields the true value of the variable being estimated on average. With θ denoting the true value and ˆ θ denoting the estimated value, and unbiased estimator is one with, E[ˆ θ] = θ § Naturally bias is defined as, b = E[ˆ θ] − θ § Let us consider estimating a constant value (say temperature of this room) by some sensors which are not perfect. Consider the

  • bservations.

x[n] = θ+w[n] n = 0, 1, · · · , N−1. w[n] is WGN with variance = σ2.

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 19 / 39

slide-42
SLIDE 42

Agenda Introduction REINFORCE Bias/Variance

Unbiased Estimators

§ An unbiased estimator is the one that yields the true value of the variable being estimated on average. With θ denoting the true value and ˆ θ denoting the estimated value, and unbiased estimator is one with, E[ˆ θ] = θ § Naturally bias is defined as, b = E[ˆ θ] − θ § Let us consider estimating a constant value (say temperature of this room) by some sensors which are not perfect. Consider the

  • bservations.

x[n] = θ+w[n] n = 0, 1, · · · , N−1. w[n] is WGN with variance = σ2. § A reasonable estimator is the average value of x[n] i.e., ˆ θ = 1

N N−1

  • n=0

x[n]

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 19 / 39

slide-43
SLIDE 43

Agenda Introduction REINFORCE Bias/Variance

Estimator Bias

§ The sample mean estimator is unbiased.

E[ˆ θ] = E

  • 1

N

N−1

  • n=0

x[n]

  • = 1

N

N−1

  • n=0

E[x[n]] = 1 N

N−1

  • n=0

E

  • [θ + w[n]]
  • = 1

N

N−1

  • n=0
  • E[θ] + E[w[n]]
  • = 1

N

N−1

  • n=0

=

  • θ + 0
  • = θ

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 20 / 39

slide-44
SLIDE 44

Agenda Introduction REINFORCE Bias/Variance

Estimator Bias

§ The sample mean estimator is unbiased.

E[ˆ θ] = E

  • 1

N

N−1

  • n=0

x[n]

  • = 1

N

N−1

  • n=0

E[x[n]] = 1 N

N−1

  • n=0

E

  • [θ + w[n]]
  • = 1

N

N−1

  • n=0
  • E[θ] + E[w[n]]
  • = 1

N

N−1

  • n=0

=

  • θ + 0
  • = θ

§ Let us see what happens with a modified estimator, x[n] i.e., ˇ θ =

1 2N N−1

  • n=0

x[n]

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 20 / 39

slide-45
SLIDE 45

Agenda Introduction REINFORCE Bias/Variance

Estimator Bias

§ The sample mean estimator is unbiased.

E[ˆ θ] = E

  • 1

N

N−1

  • n=0

x[n]

  • = 1

N

N−1

  • n=0

E[x[n]] = 1 N

N−1

  • n=0

E

  • [θ + w[n]]
  • = 1

N

N−1

  • n=0
  • E[θ] + E[w[n]]
  • = 1

N

N−1

  • n=0

=

  • θ + 0
  • = θ

§ Let us see what happens with a modified estimator, x[n] i.e., ˇ θ =

1 2N N−1

  • n=0

x[n] § It is easy to see that E[ˇ θ] = 1

2θ.

§ So the bias is b = E[ˇ θ] − θ = − 1

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 20 / 39

slide-46
SLIDE 46

Agenda Introduction REINFORCE Bias/Variance

Estimator Variance

§ That an estimator is unbiased does not necessarily mean that it is a good estimator. It is reasonable to check by repeating the experiment how the results differ in successive trials. § Thus the variance of the estimate is another measure of goodness of the estimator. And the aim will be to see how small we can make var(ˆ θ). § Let us take the following 3 estimators for θ and see the variances of all these.

ˆ θa = 0 E(ˆ θa) = 0 var(ˆ θa) = 0 ˆ θb = x[0] E(ˆ θb) = E(x[0]) = E(θ + w[0]) = θ + 0 = θ var(ˆ θb) = var(x[0]) = σ2 ˆ θc = 1 N

N−1

  • n=0

x[n] E(ˆ θc) = θ (already seen) var(ˆ θc) = E[(ˆ θc − E[ˆ θc])2] (Continued on next slide.)

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 21 / 39

slide-47
SLIDE 47

Agenda Introduction REINFORCE Bias/Variance

Estimator Variance

var(ˆ θc) = E[(ˆ θc − E[ˆ θc])2] = E[( 1 N

N−1

  • n=0

x[n] − E[ˆ θc])2] (8) = E[( 1 N

N−1

  • n=0

θ + w[n] − θ)2] = E[( 1 N

N−1

  • n=0

w[n])2] = 1 N 2 E[(

N−1

  • n=0

w[n])2]

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 22 / 39

slide-48
SLIDE 48

Agenda Introduction REINFORCE Bias/Variance

Estimator Variance

var(ˆ θc) = E[(ˆ θc − E[ˆ θc])2] = E[( 1 N

N−1

  • n=0

x[n] − E[ˆ θc])2] (8) = E[( 1 N

N−1

  • n=0

θ + w[n] − θ)2] = E[( 1 N

N−1

  • n=0

w[n])2] = 1 N 2 E[(

N−1

  • n=0

w[n])2] § Now, var N−1

  • n=0

w[n]

  • = E

N−1

  • n=0

w[n] − E N−1

  • n=0

w[n] 2

  • = E

  N−1

  • n=0

w[n] −

N−1

  • n=0

E

  • w[n]

2   = E N−1

  • n=0

w[n] 2

  • Abir Das (IIT Kharagpur)

CS60077 Nov 09, 10, 2020 22 / 39

slide-49
SLIDE 49

Agenda Introduction REINFORCE Bias/Variance

Estimator Variance

var(ˆ θc) = E[(ˆ θc − E[ˆ θc])2] = E[( 1 N

N−1

  • n=0

x[n] − E[ˆ θc])2] (8) = E[( 1 N

N−1

  • n=0

θ + w[n] − θ)2] = E[( 1 N

N−1

  • n=0

w[n])2] = 1 N 2 E[(

N−1

  • n=0

w[n])2] § Now, var N−1

  • n=0

w[n]

  • = E

N−1

  • n=0

w[n] − E N−1

  • n=0

w[n] 2

  • = E

  N−1

  • n=0

w[n] −

N−1

  • n=0

E

  • w[n]

2   = E N−1

  • n=0

w[n] 2

  • § Using the above in eqn. (8)

var(ˆ θc) = 1 N 2 var N−1

  • n=0

w[n]

  • =

1 N 2 N−1

  • n=0

var(w[n])

  • (WGN)

= Nσ2 N 2 = σ2 N

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 22 / 39

slide-50
SLIDE 50

Agenda Introduction REINFORCE Bias/Variance

Estimator Mean Square Error

§ The mean of the square error of estimation is,

mse(ˆ θc) = E

θ − θ)2 = E

θ − E[ˆ θ] + E[ˆ θ] − θ)2 = E

θ − E[ˆ θ])2 + E

  • (E[ˆ

θ] − θ)2 + 2E

θ − E[ˆ θ])(E[ˆ θ] − θ)

  • Abir Das (IIT Kharagpur)

CS60077 Nov 09, 10, 2020 23 / 39

slide-51
SLIDE 51

Agenda Introduction REINFORCE Bias/Variance

Estimator Mean Square Error

§ The mean of the square error of estimation is,

mse(ˆ θc) = E

θ − θ)2 = E

θ − E[ˆ θ] + E[ˆ θ] − θ)2 = E

θ − E[ˆ θ])2 + E

  • (E[ˆ

θ] − θ)2 + 2E

θ − E[ˆ θ])(E[ˆ θ] − θ)

  • = E

θ − E[ˆ θ])2 + (E[ˆ θ] − θ)2 + 2(E[ˆ θ] − θ)E

θ − E[ˆ θ])

  • (why?) − (Hint: What is random here?)

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 23 / 39

slide-52
SLIDE 52

Agenda Introduction REINFORCE Bias/Variance

Estimator Mean Square Error

§ The mean of the square error of estimation is,

mse(ˆ θc) = E

θ − θ)2 = E

θ − E[ˆ θ] + E[ˆ θ] − θ)2 = E

θ − E[ˆ θ])2 + E

  • (E[ˆ

θ] − θ)2 + 2E

θ − E[ˆ θ])(E[ˆ θ] − θ)

  • = E

θ − E[ˆ θ])2 + (E[ˆ θ] − θ)2 + 2(E[ˆ θ] − θ)E

θ − E[ˆ θ])

  • (why?) − (Hint: What is random here?)

= E

θ − E[ˆ θ])2 + (E[ˆ θ] − θ)2 + 2(E[ˆ θ] − θ)✘✘✘✘✘✘ ✘ ✿0 (E[ˆ θ] − E[ˆ θ])

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 23 / 39

slide-53
SLIDE 53

Agenda Introduction REINFORCE Bias/Variance

Estimator Mean Square Error

§ The mean of the square error of estimation is,

mse(ˆ θc) = E

θ − θ)2 = E

θ − E[ˆ θ] + E[ˆ θ] − θ)2 = E

θ − E[ˆ θ])2 + E

  • (E[ˆ

θ] − θ)2 + 2E

θ − E[ˆ θ])(E[ˆ θ] − θ)

  • = E

θ − E[ˆ θ])2 + (E[ˆ θ] − θ)2 + 2(E[ˆ θ] − θ)E

θ − E[ˆ θ])

  • (why?) − (Hint: What is random here?)

= E

θ − E[ˆ θ])2 + (E[ˆ θ] − θ)2 + 2(E[ˆ θ] − θ)✘✘✘✘✘✘ ✘ ✿0 (E[ˆ θ] − E[ˆ θ]) = var(ˆ θ) + bias2(ˆ θ)

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 23 / 39

slide-54
SLIDE 54

Agenda Introduction REINFORCE Bias/Variance

Estimator Mean Square Error

§ The mean of the square error of estimation is,

mse(ˆ θc) = E

θ − θ)2 = E

θ − E[ˆ θ] + E[ˆ θ] − θ)2 = E

θ − E[ˆ θ])2 + E

  • (E[ˆ

θ] − θ)2 + 2E

θ − E[ˆ θ])(E[ˆ θ] − θ)

  • = E

θ − E[ˆ θ])2 + (E[ˆ θ] − θ)2 + 2(E[ˆ θ] − θ)E

θ − E[ˆ θ])

  • (why?) − (Hint: What is random here?)

= E

θ − E[ˆ θ])2 + (E[ˆ θ] − θ)2 + 2(E[ˆ θ] − θ)✘✘✘✘✘✘ ✘ ✿0 (E[ˆ θ] − E[ˆ θ]) = var(ˆ θ) + bias2(ˆ θ)

§ So the mean square error in estimation, is composed of errors due to the variance of the esstimator as well as the bias.

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 23 / 39

slide-55
SLIDE 55

Agenda Introduction REINFORCE Bias/Variance

Estimator Mean Square Error

§ The mean of the square error of estimation is,

mse(ˆ θc) = E

θ − θ)2 = E

θ − E[ˆ θ] + E[ˆ θ] − θ)2 = E

θ − E[ˆ θ])2 + E

  • (E[ˆ

θ] − θ)2 + 2E

θ − E[ˆ θ])(E[ˆ θ] − θ)

  • = E

θ − E[ˆ θ])2 + (E[ˆ θ] − θ)2 + 2(E[ˆ θ] − θ)E

θ − E[ˆ θ])

  • (why?) − (Hint: What is random here?)

= E

θ − E[ˆ θ])2 + (E[ˆ θ] − θ)2 + 2(E[ˆ θ] − θ)✘✘✘✘✘✘ ✘ ✿0 (E[ˆ θ] − E[ˆ θ]) = var(ˆ θ) + bias2(ˆ θ)

§ So the mean square error in estimation, is composed of errors due to the variance of the esstimator as well as the bias. § Recall MC evaluation

Gt = Rt+1 + γRt+2 + · · · + γT −1RT and vπ(s) = E [Gt|St = s] ˆ vπ(s) = 1 N

N

  • i=1

G(i)

t (St = s)

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 23 / 39

slide-56
SLIDE 56

Agenda Introduction REINFORCE Bias/Variance

Estimator Mean Square Error

§ The mean of the square error of estimation is,

mse(ˆ θc) = E

θ − θ)2 = E

θ − E[ˆ θ] + E[ˆ θ] − θ)2 = E

θ − E[ˆ θ])2 + E

  • (E[ˆ

θ] − θ)2 + 2E

θ − E[ˆ θ])(E[ˆ θ] − θ)

  • = E

θ − E[ˆ θ])2 + (E[ˆ θ] − θ)2 + 2(E[ˆ θ] − θ)E

θ − E[ˆ θ])

  • (why?) − (Hint: What is random here?)

= E

θ − E[ˆ θ])2 + (E[ˆ θ] − θ)2 + 2(E[ˆ θ] − θ)✘✘✘✘✘✘ ✘ ✿0 (E[ˆ θ] − E[ˆ θ]) = var(ˆ θ) + bias2(ˆ θ)

§ So the mean square error in estimation, is composed of errors due to the variance of the esstimator as well as the bias. § Recall MC evaluation

Gt = Rt+1 + γRt+2 + · · · + γT −1RT and vπ(s) = E [Gt|St = s] ˆ vπ(s) = 1 N

N

  • i=1

G(i)

t (St = s)

§ So ˆ vπ(s) is an unbiased estimator but with variance (inversely proportional to number of samples N.)

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 23 / 39

slide-57
SLIDE 57

Agenda Introduction REINFORCE Bias/Variance

Bias and Variance of MC and TD

§ One key contribution of variance in MC evaluation comes from the randomness at each timestep. § This is not the case in TD as the Gt is estimated by bootstrapping, ˆ Gt = Rt+1 + γ ˆ V (St+1) § This makes the estimator suffer less from variance as randomness comes from only one random step taken. The rest is deterministic. § But this introduces bias. The estimate always have the deterministic additive component γ ˆ V (St+1)

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 24 / 39

slide-58
SLIDE 58

Agenda Introduction REINFORCE Bias/Variance

Reducing Variance in Policy Gradient Estimate

§ We have seen,

∇θJ(θ) = Eτ∼pθ(τ) T

  • t=1

∇θ log πθ(at|st)

T

  • t=1

r(st, at)

  • § Inside each trajectory, a lot of randomness is there.

§ We can derive versions of this formula that eliminate terms to reduce variance. § Let us apply the log derivative trick (∇θ log pθ(τ) = ∇θ log πθ(at|st)) to

compute the gradient for a single reward term. ∇θEτ [r(st, at)] = Eτ∼pθ(τ)

  • t
  • t′=1

∇θ log πθ(at′|st′)

  • r(st, at)
  • (9)

§ Note that the sum goes up to t. Why?

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 25 / 39

slide-59
SLIDE 59

Agenda Introduction REINFORCE Bias/Variance

Reducing Variance in Policy Gradient Estimate

§ We have seen,

∇θJ(θ) = Eτ∼pθ(τ) T

  • t=1

∇θ log πθ(at|st)

T

  • t=1

r(st, at)

  • § Inside each trajectory, a lot of randomness is there.

§ We can derive versions of this formula that eliminate terms to reduce variance. § Let us apply the log derivative trick (∇θ log pθ(τ) = ∇θ log πθ(at|st)) to

compute the gradient for a single reward term. ∇θEτ [r(st, at)] = Eτ∼pθ(τ)

  • t
  • t′=1

∇θ log πθ(at′|st′)

  • r(st, at)
  • (9)

§ Note that the sum goes up to t. Why? - The reward at timestep t depends on actions till t′ ≤ t. - Causality

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 25 / 39

slide-60
SLIDE 60

Agenda Introduction REINFORCE Bias/Variance

Reducing Variance in Policy Gradient Estimate

§ Summing over time we get (with some reordering of the sums, last)

∇θEτ [r(τ)] = Eτ∼pθ(τ) T

  • t=1

r(st, at)

t

  • t′=1

∇θ log πθ(at′|st′)

  • = Eτ∼pθ(τ)

T

  • t=1

∇θ log πθ(at|st)

T

  • t′=t

r(st′, at′)

  • (10)

§ With less randomness inside each trajectory the variance is less, but what about bias?

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 26 / 39

slide-61
SLIDE 61

Agenda Introduction REINFORCE Bias/Variance

Reducing Variance in Policy Gradient Estimate

∇θJ(θ) = Eτ∼pθ(τ)

  • T
  • t=1
  • ∇θ log πθ(at|st)
  • T
  • t=1
  • r(st, at)
  • = Eτ∼pθ(τ)
  • T
  • t=1
  • ∇θ log πθ(at|st)
  • T
  • t′=1
  • r(st′, at′)
  • = Eτ∼pθ(τ)
  • T
  • t=1

T

  • t′=1
  • ∇θ log πθ(at|st)r(st′, at′)
  • =

T

  • t=1

T

  • t′=1

Eτ∼pθ(τ)

f(t,t′)

  • ∇θ log πθ(at|st)r(st′, at′)
  • (11)

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 27 / 39

slide-62
SLIDE 62

Agenda Introduction REINFORCE Bias/Variance

Reducing Variance in Policy Gradient Estimate

∇θJ(θ) = Eτ∼pθ(τ)

  • T
  • t=1
  • ∇θ log πθ(at|st)
  • T
  • t=1
  • r(st, at)
  • = Eτ∼pθ(τ)
  • T
  • t=1
  • ∇θ log πθ(at|st)
  • T
  • t′=1
  • r(st′, at′)
  • = Eτ∼pθ(τ)
  • T
  • t=1

T

  • t′=1
  • ∇θ log πθ(at|st)r(st′, at′)
  • =

T

  • t=1

T

  • t′=1

Eτ∼pθ(τ)

f(t,t′)

  • ∇θ log πθ(at|st)r(st′, at′)
  • (11)

§ Let us consider the term, Eτ∼pθ(τ)

  • f(t, t′)
  • = Eτ∼pθ(τ)
  • ∇θ log πθ(at|st)r(st′, at′)
  • (12)

§ We will show that for the case of t′ < t (reward coming before the action is performed) the above term is zero.

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 27 / 39

slide-63
SLIDE 63

Agenda Introduction REINFORCE Bias/Variance

Reducing Variance in Policy Gradient Estimate

Eτ∼pθ(τ)

  • f(t, t′)
  • =
  • p(τ)f(t, t′)d(τ)

=

  • p(s1, a1, · · · , st, at, · · · , st′, at′, · · · )f(t, t′)

d(s1, a1, · · · , st, at, · · · , st′, at′, · · · ) =

  • p(st, at, st′, at′)f(t, t′)d(st, at, st′, at′)

(13)

§ The above comes from the property below.

  • X
  • Y

f(X)P(X, Y )dY dX =

  • X
  • Y

f(X)P(X)P(Y |X)dY dX =

  • X

f(X)P(X)dX

✟✟✟✟✟✟ ✟ ✯1

  • Y

P(Y |X)dY =

  • X

f(X)P(X)dX (14)

§ Taking X = {st, at, st′, at′} and Y the rest.

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 28 / 39

slide-64
SLIDE 64

Agenda Introduction REINFORCE Bias/Variance

Reducing Variance in Policy Gradient Estimate

§ Till now we have,

Eτ∼pθ(τ)

  • f(t, t′)
  • =
  • p(st, at, st′, at′)f(t, t′)d(st, at, st′, at′)

(15)

§ We will now use a variation of iterated expectation.

EA,B[f(A, B)] =

  • P(A, B)f(A, B)dBdA

=

  • P(B|A)P(A)f(A, B)dBdA

=

  • P(A)
  • P(B|A)f(A, B)dB dA

=

  • P(A)EB [f(A, B)|A] dA

= EA

  • EB [f(A, B)|A]
  • § Taking A = st′, at′ and B = st, at, eqn. (15) can be written as,

Eτ∼pθ(τ)

  • f(t, t′)
  • =

E

st′,at′

  • E

st,at[f(t, t′)|st′, at′]

  • (16)

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 29 / 39

slide-65
SLIDE 65

Agenda Introduction REINFORCE Bias/Variance

Reducing Variance in Policy Gradient Estimate

§ Putting the value of f(t, t′) back in eqn. (16), we get,

Eτ∼pθ(τ)

  • f(t, t′)
  • = E

st′,at′

  • E

st,at[f(t, t′)|st′, at′]

  • (17)

= E

st′,at′

  • E

st,at[∇θ log πθ(at|st)r(st′, at′)|st′, at′]

  • = E

st′,at′

  • r(st′, at′) E

st,at[∇θ log πθ(at|st)|st′, at′]

  • § Let us take a closer look at the inner expectation,

E

st,at[∇θ log πθ(at|st)|st′, at′] =

  • P(st, at|st′, at′)∇θ log πθ(at|st)d(at, st) (18)

§ Now, let us consider the timestep t be greater than t′, i.e., the action

  • ccurs after the reward. In such a case, P(st, at|st′, at′) can be

broken down to P(at|st)P(st|st′, at′). Thus eqn. (18) becomes,

E

st,at

[∇θ log πθ(at|st)|st′, at′] = P(at|st)P(st|st′, at′)∇θ log πθ(at|st)datdst =

  • P(st|st′, at′)
  • P(at|st)∇θ log πθ(at|st)datdst

=E

st

  • E

at

  • ∇θ log πθ(at|st)|st
  • |st′, at′
  • (19)

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 30 / 39

slide-66
SLIDE 66

Agenda Introduction REINFORCE Bias/Variance

Reducing Variance in Policy Gradient Estimate

§ Now we will use a neat trick known as ‘Expected Grad Log Probability’ (EGLP) lemma which says E

  • ∇θ log pθ(x)
  • = 0.

E

x∼pθ(x)

  • ∇θ log pθ(x)
  • =
  • pθ(x)∇θ log pθ(x)dx =
  • pθ(x)∇θpθ(x)

pθ(x) dx =

  • ∇θpθ(x)dx = ∇θ
  • pθ(x)dx = ∇θ1 = 0

§ Thus the inner expectation in eqn. (19) is 0. This, in turn, means

  • eqn. (17), (16) and (15) are all 0.

§ That is, Eτ∼pθ(τ)

  • f(t, t′)
  • = 0 for t > t′.

§ Now for t ≤ t′, P(st, at|st′, at′) can not be broken down to P(at|st)P(st|st′, at′), as past state (st) will get conditioned on future state and actions (st′, at′) violating the Markov property. § So, Eτ∼pθ(τ)

  • f(t, t′)
  • = 0 for t ≤ t′.

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 31 / 39

slide-67
SLIDE 67

Agenda Introduction REINFORCE Bias/Variance

Reducing Variance in Policy Gradient Estimate

§ So we began with,

∇θJ(θ) =

T

  • t=1

T

  • t′=1

Eτ∼pθ(τ)

  • f(t, t′)
  • (20)

and have shown that

Eτ∼pθ(τ)

  • f(t, t′)
  • = 0

if t′ < t = 0 if t′ ≥ t

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 32 / 39

slide-68
SLIDE 68

Agenda Introduction REINFORCE Bias/Variance

Reducing Variance in Policy Gradient Estimate

§ So we began with,

∇θJ(θ) =

T

  • t=1

T

  • t′=1

Eτ∼pθ(τ)

  • f(t, t′)
  • (20)

and have shown that

Eτ∼pθ(τ)

  • f(t, t′)
  • = 0

if t′ < t = 0 if t′ ≥ t

§ So, the gradient of the total expected return can be written as,

∇θJ(θ) =

T

  • t=1

T

  • t′=t

Eτ∼pθ(τ)

  • f(t, t′)
  • = Eτ∼pθ(τ)

T

  • t=1

T

  • t′=t

f(t, t′)

  • = Eτ∼pθ(τ)
  • T
  • t=1
  • ∇θ log πθ(at|st)
  • T
  • t′=t
  • r(st, at)
  • (21)

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 32 / 39

slide-69
SLIDE 69

Agenda Introduction REINFORCE Bias/Variance

Reducing Variance in Policy Gradient Estimate

§ So we began with,

∇θJ(θ) =

T

  • t=1

T

  • t′=1

Eτ∼pθ(τ)

  • f(t, t′)
  • (20)

and have shown that

Eτ∼pθ(τ)

  • f(t, t′)
  • = 0

if t′ < t = 0 if t′ ≥ t

§ So, the gradient of the total expected return can be written as,

∇θJ(θ) =

T

  • t=1

T

  • t′=t

Eτ∼pθ(τ)

  • f(t, t′)
  • = Eτ∼pθ(τ)

T

  • t=1

T

  • t′=t

f(t, t′)

  • = Eτ∼pθ(τ)
  • T
  • t=1
  • ∇θ log πθ(at|st)
  • T
  • t′=t
  • r(st, at)
  • (21)

§ This is the ‘reward to go’ formulation we have seen earlier and which has less variance. But this also is same as the total expected reward expression which is unbiased. So this is unbiased and less variance estimator of the total expected reward.

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 32 / 39

slide-70
SLIDE 70

Agenda Introduction REINFORCE Bias/Variance

Baselines

§ Good stuff is made more likely. § Bad stuff is made less likely.

Figure credit: [Sergey Levine, UC Berkeley] Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 33 / 39

slide-71
SLIDE 71

Agenda Introduction REINFORCE Bias/Variance

Baselines

§ Good stuff is made more likely. § Bad stuff is made less likely. § What if all have high reward?

Figure credit: [Sergey Levine, UC Berkeley]

∇θJ(θ) = E

τ∼pθ(τ) [∇θ log pθ(τ)r(τ)] =

E

τ∼pθ(τ)

T

  • t=1

∇θ log πθ(at|st)

T

  • t=1

r(st, at)

  • Abir Das (IIT Kharagpur)

CS60077 Nov 09, 10, 2020 33 / 39

slide-72
SLIDE 72

Agenda Introduction REINFORCE Bias/Variance

Baselines

§ Good stuff is made more likely. § Bad stuff is made less likely. § What if all have high reward?

Figure credit: [Sergey Levine, UC Berkeley]

∇θJ(θ) = E

τ∼pθ(τ) [∇θ log pθ(τ)r(τ)] =

E

τ∼pθ(τ)

T

  • t=1

∇θ log πθ(at|st)

T

  • t=1

r(st, at)

  • ∇θJ(θ) =

E

τ∼pθ(τ) [∇θ log pθ(τ)[r(τ) − b]]

§ Will it remain unbiased? § Only if E

τ∼pθ(τ) [∇θ log pθ(τ)b] = b

E

τ∼pθ(τ) [∇θ log pθ(τ)] = 0

§ And E

τ∼pθ(τ) [∇θ log pθ(τ)] = 0 by EGLP Lemma.

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 33 / 39

slide-73
SLIDE 73

Agenda Introduction REINFORCE Bias/Variance

Baselines

§ So subtracting a constant baseline keeps the estimate unbiased. § A reasonable choice of baseline is average reward across the trajectories, b = 1

N N

  • i=1

r(τ) § What about variance?

∇θJ(θ) = E

τ∼pθ(τ)

  • ∇θ log pθ(τ)[r(τ) − b]
  • var = E

τ∼pθ(τ)

  • ∇θ log pθ(τ)[r(τ) − b]

2 −

  • E

τ∼pθ(τ)

  • ∇θ log pθ(τ)[r(τ) − b]

2 = E

τ∼pθ(τ)

  • ∇θ log pθ(τ)[r(τ) − b]

2 −

  • E

τ∼pθ(τ)

  • ∇θ log pθ(τ)r(τ)

2 ∂var ∂b = ∂ E

τ∼pθ(τ)

  • ∇θ log pθ(τ)[r(τ) − b]

2 ∂b − 0 = ∂ E

τ∼pθ(τ)

  • ∇θ log pθ(τ)

2 r2(τ) − 2r(τ)b + b2 ∂b

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 34 / 39

slide-74
SLIDE 74

Agenda Introduction REINFORCE Bias/Variance

Baselines

∂var ∂b = ∂ E

τ∼pθ(τ)

  • ∇θ log pθ(τ)

2 r2(τ) − 2r(τ)b + b2 ∂b = 0 − 2 E

τ∼pθ(τ)

  • ∇θ log pθ(τ)

2r(τ)

  • + 2b

E

τ∼pθ(τ)

  • ∇θ log pθ(τ)

2

§ For minimum variance,

∂var ∂b = 0 − E

τ∼pθ(τ)

  • ∇θ log pθ(τ)

2r(τ)

  • + b

E

τ∼pθ(τ)

  • ∇θ log pθ(τ)

2 = 0 b = E

τ∼pθ(τ)

  • ∇θ log pθ(τ)

2r(τ)

  • E

τ∼pθ(τ)

  • ∇θ log pθ(τ)

2

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 35 / 39

slide-75
SLIDE 75

Agenda Introduction REINFORCE Bias/Variance

Advantage Function

∇θJ(θ) = E

τ∼pθ(τ)

  • T
  • t=1
  • ∇θ log πθ(at|st)
  • T
  • t′=t
  • r(st, at)
  • Qθ(st,at)
  • =

E

τ∼pθ(τ)

  • T
  • t=1
  • ∇θ log πθ(at|st)
  • Qθ(st, at)
  • Abir Das (IIT Kharagpur)

CS60077 Nov 09, 10, 2020 36 / 39

slide-76
SLIDE 76

Agenda Introduction REINFORCE Bias/Variance

Advantage Function

∇θJ(θ) = E

τ∼pθ(τ)

  • T
  • t=1
  • ∇θ log πθ(at|st)
  • T
  • t′=t
  • r(st, at)
  • Qθ(st,at)
  • =

E

τ∼pθ(τ)

  • T
  • t=1
  • ∇θ log πθ(at|st)
  • Qθ(st, at)
  • § It would be good to have the true value of Q to be used in the

equation.

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 36 / 39

slide-77
SLIDE 77

Agenda Introduction REINFORCE Bias/Variance

Advantage Function

∇θJ(θ) = E

τ∼pθ(τ)

  • T
  • t=1
  • ∇θ log πθ(at|st)
  • T
  • t′=t
  • r(st, at)
  • Qθ(st,at)
  • =

E

τ∼pθ(τ)

  • T
  • t=1
  • ∇θ log πθ(at|st)
  • Qθ(st, at)
  • § It would be good to have the true value of Q to be used in the

equation. § But that is not available to us.

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 36 / 39

slide-78
SLIDE 78

Agenda Introduction REINFORCE Bias/Variance

Advantage Function

∇θJ(θ) = E

τ∼pθ(τ)

  • T
  • t=1
  • ∇θ log πθ(at|st)
  • T
  • t′=t
  • r(st, at)
  • Qθ(st,at)
  • =

E

τ∼pθ(τ)

  • T
  • t=1
  • ∇θ log πθ(at|st)
  • Qθ(st, at)
  • § It would be good to have the true value of Q to be used in the

equation. § But that is not available to us. § Other alternatives are to estimate this value using methods that we have seen earlier - MC evaluation, Bootstrapped evaluation (TD), using function approximation for these.

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 36 / 39

slide-79
SLIDE 79

Agenda Introduction REINFORCE Bias/Variance

Advantage Function

∇θJ(θ) = E

τ∼pθ(τ)

  • T
  • t=1
  • ∇θ log πθ(at|st)
  • Qθ(st, at) − E

at

  • Qθ(st, at)
  • § We can also use a baseline version of this.

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 37 / 39

slide-80
SLIDE 80

Agenda Introduction REINFORCE Bias/Variance

Advantage Function

∇θJ(θ) = E

τ∼pθ(τ)

  • T
  • t=1
  • ∇θ log πθ(at|st)
  • Qθ(st, at) − E

at

  • Qθ(st, at)
  • =

E

τ∼pθ(τ)

  • T
  • t=1
  • ∇θ log πθ(at|st)
  • Qθ(st, at) − V θ(st)
  • § We can also use a baseline version of this.

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 37 / 39

slide-81
SLIDE 81

Agenda Introduction REINFORCE Bias/Variance

Advantage Function

∇θJ(θ) = E

τ∼pθ(τ)

  • T
  • t=1
  • ∇θ log πθ(at|st)
  • Qθ(st, at) − E

at

  • Qθ(st, at)
  • =

E

τ∼pθ(τ)

  • T
  • t=1
  • ∇θ log πθ(at|st)
  • Qθ(st, at) − V θ(st)
  • =

E

τ∼pθ(τ)

  • T
  • t=1
  • ∇θ log πθ(at|st)
  • Aθ(st, at)
  • § We can also use a baseline version of this.

§ This is called the ‘Advantage function’.

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 37 / 39

slide-82
SLIDE 82

Agenda Introduction REINFORCE Bias/Variance

Advantage Function

∇θJ(θ) = E

τ∼pθ(τ)

  • T
  • t=1
  • ∇θ log πθ(at|st)
  • Qθ(st, at) − E

at

  • Qθ(st, at)
  • =

E

τ∼pθ(τ)

  • T
  • t=1
  • ∇θ log πθ(at|st)
  • Qθ(st, at) − V θ(st)
  • =

E

τ∼pθ(τ)

  • T
  • t=1
  • ∇θ log πθ(at|st)
  • Aθ(st, at)
  • § We can also use a baseline version of this.

§ This is called the ‘Advantage function’. § A(st, at) can be approximated following the methods we used earlier (single sample backup or bootstrapping)

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 37 / 39

slide-83
SLIDE 83

Agenda Introduction REINFORCE Bias/Variance

Advantage Function

∇θJ(θ) = E

τ∼pθ(τ)

  • T
  • t=1
  • ∇θ log πθ(at|st)
  • Aθ(st, at)
  • ≈ 1

N

N

  • i=1

T

  • t=1
  • ∇θ log πθ(ai,t|si,t)
  • Aθ(si,t, ai,t)

§ Qθ(st, at) ≈ r(st, at) + V θ(st) § Aθ(st, at) ≈ r(st, at) + V θ(st+1) − V θ(st) § So we can use a neural network which learns to produce V (s)

Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 38 / 39

slide-84
SLIDE 84

Agenda Introduction REINFORCE Bias/Variance

Actor-Critic

Figure credit: [Sergey Levine, UC Berkeley] Abir Das (IIT Kharagpur) CS60077 Nov 09, 10, 2020 39 / 39