Deep Reinforcement Learning M. Soleymani Sharif University of - - PowerPoint PPT Presentation

deep reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Deep Reinforcement Learning M. Soleymani Sharif University of - - PowerPoint PPT Presentation

Deep Reinforcement Learning M. Soleymani Sharif University of Technology Spring 2020 Most slides are based on Bhiksha Raj, 11-785, CMU 2019, some slides from Fei Fei Li and colleagues lectures, cs231n, Stanford 2018, and some from Surguy


slide-1
SLIDE 1

Deep Reinforcement Learning

  • M. Soleymani

Sharif University of Technology Spring 2020 Most slides are based on Bhiksha Raj, 11-785, CMU 2019, some slides from Fei Fei Li and colleagues lectures, cs231n, Stanford 2018, and some from Surguy Levin lectures, cs294-112, Berkeley 2016.

slide-2
SLIDE 2

Q-Le Learn rning

  • Currently most-popular RL algorithm
  • Topics not covered:

– Value function approximation – Continuous state spaces – Deep-Q learning

slide-3
SLIDE 3

Scaling up the problem..

  • We’ve assumed a discrete set of states
  • And a discrete set of actions
  • Value functions can be stored as a table

– One entry per state

  • Action value functions can be stored as a table

– One entry per state-action combination

  • Policy can be stored as a table

– One probability entry per state-action combination

  • None of this is feasible if

– The state space grows too large (e.g. chess) – Or the states are continuous valued

slide-4
SLIDE 4

Problem

  • Not scalable.

– Must compute Q(s,a) for every state-action pair.

  • it computationally infeasible to compute for entire state space!
  • Solution: use a function approximator to estimate Q(s,a).

– E.g. a neural network!

4

slide-5
SLIDE 5

Continuous State Space

  • Tabular methods won’t work if our state space is infinite or huge
  • E.g. position on a [0, 5] x [0, 5] square, instead of a 5x5 grid.

4.4 4.5 4.8 5.3 5.9 3.9 4.0 4.4 4.9 5.6 3.2 3.4 3.8 4.0 5.1 2.2 2.4 3.0 3.7 4.6 1.0 2.0 3.0 4.0

The graphs show the negative value function

slide-6
SLIDE 6

Parameterized Functions

  • Instead of using a table of Q-values, we use a parametrized function:

𝑅 𝑡, 𝑏 𝜄)

  • If the function approximator is a deep network => Deep RL
  • Instead of writing values to the table, we fit the parameters to

minimize the prediction error of the “Q function” 𝜄'() ← 𝜄' − 𝜃𝛼. 𝑀𝑝𝑡𝑡 𝑅 𝑡, 𝑏 𝜄' , 𝑅1,2

324563)

slide-7
SLIDE 7

Parameterized Functions

slide-8
SLIDE 8

Case Study: Playing Atari Games (seen before)

[Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015] 8

slide-9
SLIDE 9

Q-network Architecture

[Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015] Last FC layer has 4-d output (if 4 actions) Q(st ,a1 ), Q(st,a2 ), Q(st,a3 ), Q(st,a4 ) Number of actions between 4-18 depending on Atari game A single feedforward pass to compute Q-values for all actions from the current state => efficient!

9

slide-10
SLIDE 10

Solving for the optimal policy: Q-learning

[Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

10

slide-11
SLIDE 11

Solving for the optimal policy: Q-learning

Iteratively try to make the Q- value close to the target value (yi ) it should have (according to Bellman Equations). [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

11

slide-12
SLIDE 12

Solving for the optimal policy: Q-learning

Iteratively try to make the Q- value close to the target value (yi ) it should have (according to Bellman Equations). [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

12

slide-13
SLIDE 13

Target Q

𝜄7() ← 𝜄7 − 𝜃𝛼. 𝑀𝑝𝑡𝑡 𝑅 𝑡, 𝑏 𝜄7 , 𝑅1,2

324563)

à What is 𝑅1,2

324563 ?

As in TD, use bootstrapping for the target : 𝑅1,2

324563 = 𝑠 + 𝛿 argmax AB∈𝒝

𝑅 𝑡′, 𝑏′ 𝜄7) And 𝑀𝑝𝑡𝑡 can be L2 distance

slide-14
SLIDE 14

DQN (v0)

  • Initialize 𝜄)
  • For each episode 𝑓

– Initialize 𝑡) – For 𝑢 = 1 … 𝑈𝑓𝑠𝑛𝑗𝑜𝑏𝑢𝑗𝑝𝑜

  • Choose action 𝑏O using 𝜁 –greedy policy obtained from 𝜄O
  • Observe 𝑠

O, 𝑡O()

  • 𝑅OAQRSO = 𝑠

O + 𝛿𝑛𝑏𝑦A𝑅(𝑡O(), 𝑏|𝜄O)

  • 𝜄O() = 𝜄O − 𝜃𝛼.W𝑅OAQRSO − 𝑅 𝑡O, 𝑏O 𝜄O ‖Y

Y

slide-15
SLIDE 15

Deep Q Network

  • Note : 𝛼.W𝑅OAQRSO − 𝑅 𝑡O, 𝑏O 𝜄O ‖Y

Y does not consider 𝑅OAQRSO as

depending of 𝜄O (although it does). Therefore this is semi-gradient descent.

  • space.
slide-16
SLIDE 16

Training the Q-network: Experience Replay

  • Learning from batches of consecutive samples is problematic:

– Samples are correlated => inefficient learning – Current Q-network parameters determines next training samples

  • can lead to bad feedback loops
  • e.g. if maximizing action is to move left, training samples will be dominated by samples

from left-hand side => can lead to bad feedback loops

  • Address these problems using experience replay

– Continually update a replay memory table of transitions (𝑡O, 𝑏O, 𝑠

O, 𝑡O())

– Train Q-network on random minibatches of transitions from the replay memory ü Each transition can also contribute to multiple weight updates => greater data efficiency

ü Smoothing out learning and avoiding oscillations or divergence in the parameters

16

slide-17
SLIDE 17

Parameterized Functions

  • Fundamental issue: limited capacity

– A table of Q values will never forget any values that you write into it – But, modifying the parameters of a Q-function will affect its overall behavior

  • Fitting the parameters to match one (𝑡, 𝑏) pair can change the function’s output at

𝑡′, 𝑏′ .

  • If we don’t visit 𝑡′, 𝑏′ for a long time, the function’s output can diverge considerably

from the values previously stored there.

slide-18
SLIDE 18

Tables have full capacity

  • Q-learning works well with Q-tables

– The sample data is going to be heavily biased toward optimal actions 𝑡, 𝜌∗ 𝑡 , or close approximations thereof. – But still, 𝜗-greedy policy will ensure that we will visit all state-action pairs arbitrarily many times if we explore long enough. – The action-value for uncommon inputs will still converge, just more slowly.

slide-19
SLIDE 19

Limited Capacity of 𝑅 𝑡, 𝑏 𝜄)

  • The Q-function will fit more closely to more common inputs, even at

the expense of lower accuracy for less common inputs.

  • Just exploring the whole state-action

space isn’t enough. We also need to visit those states often enough so the function computes accurate Q-values before they are “forgotten”.

slide-20
SLIDE 20

Experience Replay

  • The raw data obtained from Q-learning is:

– Highly correlated: current data can look very different from data from several episodes ago if the policy changed significantly. – Very unevenly distributed: only 𝜗 probability of choosing suboptimal actions.

  • Instead, create a replay buffer holding past experiences, so we can

train the Q-function using this data.

slide-21
SLIDE 21

Experience Replay

  • We have control over how the experiences are added, sampled and

deleted.

– Can make the samples look independent – Can emphasize old experiences more – Can change frequency depending on accuracy

  • What is the best way to sample? (A trade off!)

– On the one hand, our function has limited capacity, so we should let it

  • ptimize more strongly for the common case

– On the other hand, our function needs explore uncommon examples just enough to compute accurate action-values, so it can avoid missing out on better policies

slide-22
SLIDE 22

DQN (with Experience Replay )

  • Initialize 𝜄]
  • Initialize buffer with some random episodes
  • For each episode 𝑓

– Initialize 𝑡), 𝑏) – For 𝑢 = 1 … 𝑈𝑓𝑠𝑛𝑗𝑜𝑏𝑢𝑗𝑝𝑜

  • Choose action 𝑏O using 𝜁 –greedy policy obtained from 𝜄O
  • Observe 𝑠

O, 𝑡O()

  • Add 𝑡O, 𝑏O, 𝑠

O, 𝑡O() to the buffer

  • Sample from the buffer a batch of tuples 𝑡, 𝑏, 𝑠, 𝑡^S_
  • 𝑅OAQRSO = 𝑠 + 𝛿𝑛𝑏𝑦A𝑅(𝑡`6_, 𝑏|𝜄O)
  • 𝜄O() = 𝜄O − 𝜃𝛼.W𝑅OAQRSO − 𝑅 𝑡, 𝑏 𝜄O ‖Y

Y

slide-23
SLIDE 23

Moving target

  • We already have moving targets in Q-learning itself
  • The problem is much worse with Q-functions though. Optimizing the

function at one state-action pair affects all other state-action pairs.

– The target value is fluctuating at all inputs in the function’s domain, and all updates will shift the target value across the entire domain.

slide-24
SLIDE 24

Frozen target function

  • Solution : Create two copies of the Q-function.

– The “target copy” is frozen and used to compute the target Q-values. – The “learner copy” will be trained on the targets. 𝑅a624`64 𝑡O, 𝑏O ←bc3 𝑠

O + 𝛿max 2

𝑅324563 𝑡O(), 𝑏

  • Just need to periodically update the target copy to match the learner

copy.

slide-25
SLIDE 25

Fixed target DQN

  • Initialize 𝜄], 𝜄∗ = 𝜄]
  • Initialize buffer with some random episodes
  • For each episode 𝑓

– Initialize 𝑇), 𝐵)

  • For 𝑢 = 1 … 𝑈𝑓𝑠𝑛𝑗𝑜𝑏𝑢𝑗𝑝𝑜
  • If 𝑢%𝑙 = 0 then update 𝜄∗ = 𝜄O
  • Choose action 𝑏O using 𝜁 –greedy policy obtained from 𝜄O
  • Observe 𝑠

O, 𝑡O()

  • Add 𝑡O, 𝑏O, 𝑠

O, 𝑡O() to the buffer

  • Sample from the buffer a batch of tuples 𝑡, 𝑏, 𝑠, 𝑡^S_
  • 𝑅OAQRSO = 𝑠 + 𝛿 max2𝑅(𝑡`6_, 𝑏|𝜄∗)
  • 𝜄O() = 𝜄O − 𝜃𝛼.W𝑅OAQRSO −𝑅 𝑡, 𝑏 𝜄O ‖Y

Y

slide-26
SLIDE 26

Putting it together: Deep Q-Learning with Experience Replay

[Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

26

slide-27
SLIDE 27

Putting it together: Deep Q-Learning with Experience Replay

Initialize replay memory, Q-network [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

27

slide-28
SLIDE 28

Putting it together: Deep Q-Learning with Experience Replay

Play M episodes (full games) [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

28

slide-29
SLIDE 29

Putting it together: Deep Q-Learning with Experience Replay

Initialize state (starting game screen pixels) at the beginning of each episode [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

29

slide-30
SLIDE 30

Putting it together: Deep Q-Learning with Experience Replay

For each time-step of game [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

30

slide-31
SLIDE 31

Putting it together: Deep Q-Learning with Experience Replay

With small probability, select a random action (explore),

  • therwise select greedy action from current policy

[Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

31

slide-32
SLIDE 32

Putting it together: Deep Q-Learning with Experience Replay

Take the selected action observe the reward and next state [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

32

slide-33
SLIDE 33

Putting it together: Deep Q-Learning with Experience Replay

Store transition in replay memory [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

33

slide-34
SLIDE 34

Putting it together: Deep Q-Learning with Experience Replay

Sample a random minibatch

  • f transitions and perform a

gradient descent step [Mnih et al., Playing Atari with Deep Reinforcement Learning, NIPS Workshop 2013; Nature 2015]

34

slide-35
SLIDE 35

Performance

slide-36
SLIDE 36

https://www.youtube.com/watch?v=V1eYniJ0Rnk

slide-37
SLIDE 37

Results on 49 Games

  • The architecture and hyperparameter

values were the same for all 49 games.

  • DQN achieved performance comparable

to or better than an experienced human

  • n 29 out of 49 games.

[V. Mnih et al., Human-level control through deep reinforcement learning, Nature 2015]

37

slide-38
SLIDE 38

Policy Gradients

  • What is a problem with Q-learning?

– The Q-function can be very complicated!

  • Hard to learn exact value of every (state, action) pair
  • But the policy can be much simple
  • Can we learn a policy directly, e.g. finding the best policy from a

collection of policies?

slide-39
SLIDE 39

Direct Policy Estimation

  • It’s also possible to make a deep neural network that directly

produces a distribution over actions given a state

– Also known as a policy network, or the policy gradient method – Useful when the action space is also large or continuous

slide-40
SLIDE 40

Policy Network

  • Train a neural network to prescribe actions at each state:

𝜌(𝑏|𝑡; 𝜄)

– Input is s, output is probability distribution over a – Could be deterministic

  • Problem : how to train such a network?
  • No golden truth

– Unlike value functions, where there is a target value for the value at each state

  • Against which we can compute a loss
slide-41
SLIDE 41

Maximizing return

  • Learn policy to maximize expected return!
  • Problem: For discrete action space, the return is not differentiable

with respect to policy function parameters

– Selection is not a differentiable operation

s 𝑏~𝜌(𝑏|𝑡; 𝜄)

a1 a2 aK

Q select Q Q

slide-42
SLIDE 42

How to choose policy

  • In any run starting at a state 𝑇 we get

– (𝑡1 = 𝑡, ) 𝑏1 𝑠1 𝑡2 𝑏2 𝑠2 𝑡3 𝑏3 𝑠3 …

  • The trajectory 𝑈 associated with the run is

– 𝜐 = 𝑡 𝑏1 𝑠1 𝑡2 𝑏2 𝑠2 𝑡3 𝑏3 𝑠3 …

  • The total return over the run (at t=1) is

– G = r1 + gr2 + g2r3 ….

  • The choice of 𝜄 in 𝜌(𝑏|𝑡; 𝜄) will modify the trajectory and thereby

the return

slide-43
SLIDE 43

Policy Gradients

slide-44
SLIDE 44

The goal of RL

the policy that must be learnt 𝛿O

slide-45
SLIDE 45

The objective

  • The probability of a trajectory 𝜐 is a function of

𝜌(𝑏|𝑡; 𝜄) and hence of 𝜄

– 𝜐~𝑄(𝜐; 𝜄)

  • The probability of a return 𝐻 is a function of the trajectory 𝜐

– 𝐻(𝜐)

  • Objective: to maximize expected return

argmax

.

𝐾(𝜄) = argmax

.

{ 𝑄(𝜐; 𝜄)

  • }

𝐻(𝜐)

slide-46
SLIDE 46

REINFORCE algorithm

𝐻

𝐻 𝐻

slide-47
SLIDE 47

Solution

  • Recast differentiation as an expectation operation

– Can now be approximated by sampling – Policy gradient method

  • Compute

expected returns using an action-value function approximator

– Actor-critic methods

slide-48
SLIDE 48

Gradient of the objective

𝐾 𝜄 = 𝐹}~•(};.) 𝐻(𝜐) = { 𝑄(𝜐; 𝜄)

  • }

𝐻(𝜐) 𝛼.𝐾 𝜄 = { 𝛼.𝑄(𝜐; 𝜄)

  • }

𝐻(𝜐)

  • A simple trick:

𝛼.𝑄 𝜐; 𝜄 = 𝑄 𝜐; 𝜄 𝛼.𝑄 𝜐; 𝜄 𝑄 𝜐; 𝜄 = 𝑄 𝜐; 𝜄 𝛼. log 𝑄 𝜐; 𝜄 𝛼.𝐾 𝜄 = { 𝑄 𝜐; 𝜄 𝛼. log 𝑄 𝜐; 𝜄

  • }

𝐻(𝜐) 𝛼.𝐾 𝜄 = 𝐹}~• };. 𝛼. log 𝑄 𝜐; 𝜄 𝐻(𝜐) We can estimate it with Monte Carlo sampling

slide-49
SLIDE 49

The trajectory

  • The trajectory: 𝜐 = 𝑡 𝑏1 𝑠1 𝑡2 𝑏2 𝑠2 𝑡3 𝑏3 𝑠3 …
  • The probability of 𝜐, under the policy function 𝜌(𝑏|𝑡; 𝜄) is

𝑄 𝜐; 𝜄 = ‚ 𝑄 𝑡O 𝑡Oƒ) 𝜌 𝑏O 𝑡O; 𝜄

  • O„)

= 𝑄(𝑡))𝜌(𝑏)|𝑡); 𝜄)𝑄(𝑡Y|𝑡), 𝑏))𝜌(𝑏Y|𝑡Y; 𝜄) …

  • Taking logs

log 𝑄(𝜐; 𝜄) = { log 𝑄(𝑡O()|𝑡O, 𝑏O) + { log 𝜌(𝑏O|𝑡O; 𝜄)

  • O„)
  • O„)
  • Giving us the deriviative

𝛼. log 𝑄(𝜐; 𝜄) = { 𝛼. log 𝜌(𝑏O|𝑡O; 𝜄)

  • O

Does not depend on transition probabilities

slide-50
SLIDE 50

Gradient of the objective

𝛼.𝐾 𝜄 = 𝐹}~• };. { 𝛼. log 𝜌(𝑏O|𝑡O; 𝜄)

  • O

𝐻(𝜐)

  • This is a simple expectation that can be approximated by sampling!
slide-51
SLIDE 51

Policy Gradients

  • Record an episode (or episodes)

𝑡 𝑏1 𝑠1 𝑡2 𝑏2 𝑠2 𝑡3 𝑏3 𝑠3 …

  • Compute returns at each time
  • Compute log policy at each time
  • Compute gradient
  • Update network parameters

– Ideally 𝛼.𝐾 𝜄 is averaged over many episodes

slide-52
SLIDE 52

Policy Gradients

  • Episode

𝑡 𝑏1 𝑠1 𝑡2 𝑏2 𝑠2 𝑡3 𝑏3 𝑠3 …

  • Compute returns at each time
  • Compute log policy at each time
  • Compute gradient
  • Update network parameters

– Ideally 𝛼.𝐾 𝜄 is averaged over many episodes

𝐻(𝜐)

slide-53
SLIDE 53

Policy Gradients

  • Episode

𝑡 𝑏1 𝑠1 𝑡2 𝑏2 𝑠2 𝑡3 𝑏3 𝑠3 …

  • Compute returns at each time
  • Compute log policy at each time
  • Compute gradient
  • Update network parameters

– Ideally 𝛼.𝐾 𝜄 is averaged over many episodes

log 𝜌(𝑏)|𝑡); 𝜄) log 𝜌(𝑏Y|𝑡Y; 𝜄) log 𝜌(𝑏…|𝑡…; 𝜄) 𝐻(𝜐)

slide-54
SLIDE 54

Policy Gradients

  • Episode

𝑡 𝑏1 𝑠1 𝑡2 𝑏2 𝑠2 𝑡3 𝑏3 𝑠3 …

  • Compute returns at each time
  • Compute log policy at each time
  • Compute gradient
  • Update network parameters

– Ideally 𝛼.𝐾 𝜄 is averaged over many episodes

𝛼.𝐾 𝜄 ≈ { 𝛼. log 𝜌 𝑏O 𝑡O; 𝜄 𝐻(𝜐)

  • O

log 𝜌(𝑏)|𝑡); 𝜄) log 𝜌(𝑏Y|𝑡Y; 𝜄) log 𝜌(𝑏…|𝑡…; 𝜄) 𝐻(𝜐)

slide-55
SLIDE 55

Policy Gradients

  • Compute returns at each time
  • Compute log policy at each time
  • Compute gradient
  • Update network parameters

– Ideally 𝛼.𝐾 𝜄 is averaged over many episodes

𝜄 = 𝜄 + 𝜃𝛼.𝐾 𝜄

  • Episode

𝑡 𝑏1 𝑠1 𝑡2 𝑏2 𝑠2 𝑡3 𝑏3 𝑠3 …

𝛼.𝐾 𝜄 ≈ { 𝛼. log 𝜌 𝑏O 𝑡O; 𝜄 𝐻(𝜐)

  • O

log 𝜌(𝑏)|𝑡); 𝜄) log 𝜌(𝑏Y|𝑡Y; 𝜄) log 𝜌(𝑏…|𝑡…; 𝜄) 𝐻(𝜐)

slide-56
SLIDE 56

Its like Maximum Likelihood

  • The gradient actually looks like the derivative of a log likelihood

function 𝛼.𝐾 𝜄 = 𝐹}~• };. 𝛼. log 𝑄 𝜐; 𝜄 𝐻(𝜐)

  • Can be written as

𝛼.𝐾 𝜄 = 𝐹}~• };. 𝛼. log 𝑄 𝜐; 𝜄 ‡(})

  • Maximization increases the probability of trajectories with greater

return

– If you see a trajectory you increase its probability

slide-57
SLIDE 57

Its like Maximum Likelihood

  • The gradient actually looks like the derivative of a log likelihood

function 𝛼.𝐾 𝜄 ≈ { 𝛼. log 𝜌 𝑏O 𝑡O; 𝜄 ‡(})

  • O
  • Maximization increases the probability of all seen actions

– At the cost of the probability of unseen actions – Usual ML estimator

slide-58
SLIDE 58

Evaluating the policy gradient

𝛼.𝐾(𝜄) ≈ 1 𝑂 { { 𝛿O𝑠 𝑡O

^ , 𝑏O ^ O„]

{ 𝛼. log 𝜌. 𝑏O

^ |𝑡O (^) O„] ‰ ^Š)

𝑡O 𝑏O 𝜌. 𝑏O|𝑡O

𝐻 𝜐(^)

slide-59
SLIDE 59
  • Policy gradient:
  • Maximum Likelihood:

𝛼.𝐾(𝜄) ≈ 1 𝑂 { { 𝛼. log 𝜌. 𝑏O

^ |𝑡O (^) O„] ‰ ^Š)

𝑡O 𝑏O 𝜌. 𝑏O|𝑡O 𝜌. 𝑏O|𝑡O

𝛼.𝐾(𝜄) ≈ 1 𝑂 { 𝐻 𝜐(^) { 𝛼. log 𝜌. 𝑏O

^ |𝑡O (^) O„] ‰ ^Š)

slide-60
SLIDE 60

What did we just do?

𝑞. 𝑞. 𝛼.𝐾 𝜄 ≈ 1 𝑂 { 𝐻 𝜐 ^ 𝛼. log 𝑞. 𝜐 ^

‰ ^Š)

∑ 𝛼. log 𝜌. 𝑏O

^ |𝑡O (^) O„]

𝛼.𝐾•Ž 𝜄 ≈ 1 𝑂 { 𝛼. log 𝑞. 𝜐 ^

‰ ^Š)

slide-61
SLIDE 61

Intuition

𝛼.𝐾(𝜄) ≈ 1 𝑂 { 𝐻 𝜐(^) { 𝛼. log 𝜌. 𝑏O

^ |𝑡O (^) O„] ‰ ^Š)

  • However, this also suffers from high variance
  • because credit assignment is really hard.
slide-62
SLIDE 62

A simple extension

𝛼.𝐾 𝜄 = 𝐹}~• };. { 𝛼. log 𝜌(𝑏O|𝑡O; 𝜄)

  • O

𝐻(𝜐)

  • Better to compute the above instead as follows

𝛼.𝐾 𝜄 = 𝐹}~• };. { 𝛼. log 𝜌(𝑏O|𝑡O; 𝜄)

  • O

𝐻(𝑢)

  • This too can be estimated by sampling
slide-63
SLIDE 63

Policy Gradients: A simple extension

  • Episode

𝑡 𝑏1 𝑠1 𝑡2 𝑏2 𝑠2 𝑡3 𝑏3 𝑠3 …

  • Compute returns at each time
  • Compute log policy at each time
  • Compute gradient
  • Update network parameters

– Ideally 𝛼.𝐾 𝜄 is averaged over many episodes

𝐻(1) 𝐻(2) 𝐻(3) log 𝜌(𝑏)|𝑡); 𝜄) log 𝜌(𝑏Y|𝑡Y; 𝜄) log 𝜌(𝑏…|𝑡…; 𝜄)

slide-64
SLIDE 64

Policy Gradients: A simple extension

  • Episode

𝑡 𝑏1 𝑠1 𝑡2 𝑏2 𝑠2 𝑡3 𝑏3 𝑠3 …

  • Compute returns at each time
  • Compute log policy at each time
  • Compute gradient
  • Update network parameters

– Ideally 𝛼.𝐾 𝜄 is averaged over many episodes

𝛼.𝐾 𝜄 ≈ 1 𝜐 { 𝛼. log 𝜌 𝑏O 𝑡O; 𝜄 𝐻(𝑢)

  • O

𝐻(1) 𝐻(2) 𝐻(3) log 𝜌(𝑏)|𝑡); 𝜄) log 𝜌(𝑏Y|𝑡Y; 𝜄) log 𝜌(𝑏…|𝑡…; 𝜄)

slide-65
SLIDE 65

Policy Gradients: A simple extension

  • Compute returns at each time
  • Compute log policy at each time
  • Compute gradient
  • Update network parameters

– Ideally 𝛼.𝐾 𝜄 is averaged over many episodes

𝜄 = 𝜄 + 𝜃𝛼.𝐾 𝜄

  • Episode

𝑡 𝑏1 𝑠1 𝑡2 𝑏2 𝑠2 𝑡3 𝑏3 𝑠3 …

𝐻(1) 𝐻(2) 𝐻(3)

𝛼.𝐾 𝜄 ≈ 1 𝜐 { 𝛼. log 𝜌 𝑏O 𝑡O; 𝜄 𝐻(𝑢)

  • O

log 𝜌(𝑏)|𝑡); 𝜄) log 𝜌(𝑏Y|𝑡Y; 𝜄) log 𝜌(𝑏…|𝑡…; 𝜄)

slide-66
SLIDE 66

Reducing variance

𝛼.𝐾(𝜄) ≈ 1 𝑂 { { 𝛿O𝑠 𝑡O

^ , 𝑏O ^ O„]

{ 𝛼. log 𝜌. 𝑏O

^ |𝑡O (^) O„] ‰ ^Š)

  • Causality:

𝛼.𝐾(𝜄) ≈ 1 𝑂 { { { 𝛿O𝑠 𝑡O•

^ , 𝑏O• ^ O•„O

𝛼. log 𝜌. 𝑏O

^ |𝑡O (^) O„] ‰ ^Š)

𝐻 𝜐(^)

slide-67
SLIDE 67

Variance reduction

𝛼.𝐾 𝜄 ≈ 1 𝑂 { { 𝑠 𝑡O

^ , 𝑏O ^ O„]

{ 𝛼. log 𝜌. 𝑏O

^ |𝑡O (^) O„] ‰ ^Š)

𝛼.𝐾 𝜄 ≈ 1 𝑂 { { { 𝑠 𝑡O•

^ , 𝑏O• ^ O•„O

𝛼. log 𝜌. 𝑏O

^ |𝑡O (^) O„] ‰ ^Š)

𝛼.𝐾 𝜄 ≈ 1 𝑂 { { { 𝛿O•ƒO𝑠 𝑡O•

^ , 𝑏O• ^ O•„O

𝛼. log 𝜌. 𝑏O

^ |𝑡O (^) O„] ‰ ^Š)

slide-68
SLIDE 68

Merely seeing a trajectory isn’t good

  • We want to emphasize trajectories with high return and reduce the

probability of low-return trajectories

  • If an action results in more returns than the current average return

for the state, we must improve its probability

– If it results in less, we must decrease it

slide-69
SLIDE 69

Variance reduction: Baseline

  • Problem: The raw value of a trajectory isn’t necessarily meaningful.

– For example, if rewards are all positive, you keep pushing up probabilities of actions.

  • What is important then?

– Whether a reward is better or worse than what you expect to get

  • Idea: Introduce a baseline function dependent on the state.

𝛼.𝐾 𝜄 ≈ 1 𝑂 { { { 𝛿'𝑠 𝑡O('

^ , 𝑏O(' ^

  • ƒO

'Š]

− 𝑊 𝑡O

(^)

𝛼. log 𝜌. 𝑏O

^ |𝑡O (^) O„] ‰ ^Š)

Simple baseline: 𝑊 𝑡O

(^)

= 𝐻̅ 𝑡O 𝐻 𝑡O

(^)

A simple baseline: constant moving average of rewards experienced so far from all trajectories

slide-70
SLIDE 70

Its like Maximum Likelihood

  • Subtract the expected return for the current state

𝛼.𝐾 𝜄 ≈ { 𝛼. log 𝜌 𝑏O 𝑡O; 𝜄 ‡ O ƒ“(”•)

  • O
  • 𝐵(𝑢) = 𝐻 𝑢 − 𝑊(𝑇O) is the advantage function

– How much advantage the current action has over the average

  • Train 𝜌 𝑏O 𝑡O; 𝜄 to maximize advantage
slide-71
SLIDE 71

Reinforce

  • Initialize 𝜄
  • For each episode 𝑓

– Initialize 𝑡) – For 𝑢 = 1 … 𝑈𝑓𝑠𝑛𝑗𝑜𝑏𝑢𝑗𝑝𝑜

  • Choose action 𝑏O using policy obtained from 𝜄
  • Observe 𝑠

O, 𝑡O()

  • Compute the returns 𝐻 𝑇O , then the advantages 𝐵O
  • Compute 𝐾 𝜄 = )
  • ∑ log 𝜌. 𝑏O 𝑡O

𝐵O

  • O
  • 𝜄 ß 𝜄 + 𝜃𝛼.𝐾(𝜄)
slide-72
SLIDE 72

REINFORCE algorithm: Summary

𝛼.𝐾(𝜄) ≈ 1 𝑂 { { 𝐻 𝑡O

^

− 𝑊(𝑡O

(^)) 𝛼 . log 𝜌. 𝑏O ^ |𝑡O (^)

O„] ‰ ^Š)

𝛼.𝐾(𝜄) ≈ 1 𝑂 { 𝐻(𝜐(^)) 𝛼. log 𝑞 𝜐(^); 𝜄

‰ ^Š)

𝐻 𝐻

slide-73
SLIDE 73

Instability

  • In Reinforce, the estimator for the expected return has high variance:

rewards on one episode act as estimates for state action value functions. 𝐻 𝑇O = { 𝛿O•ƒO𝑠O•

  • O•„O
  • It also requires entire runs of episodes

– Not online

  • It can be made more stable through function approximation of the value

function

slide-74
SLIDE 74

Actor-Critic

𝛼.𝐾(𝜄) ≈ 1 𝑂 { { { 𝛿'𝑠 𝑡O('

^ , 𝑏O(' ^

  • ƒO

'Š]

𝛼. log 𝜌. 𝑏O

^ |𝑡O (^) O„] ‰ ^Š)

Instead of this we can use the temporal difference to estimate this average and also the baseline

slide-75
SLIDE 75

Actor-Critic

  • In actor-critic methods, two networks are used :
  • The actor is the policy network : 𝜌 𝑡, 𝑏 𝜄 = 𝜌.(𝑏|𝑡) and is used to predict the next action
  • The critic is a state value network : g 𝑡 𝜚 = 𝑊

— 𝑡

and is used to guide the optimization direction of the actor

  • To estimate the expected return based on an episode, we use a one-step lookahead:

𝐻 𝑇O = 𝑠

O + 𝛿𝑊 — 𝑡O()

  • Or by M-step lookahead :

𝐻 𝑇O = { 𝛿'𝑠

O('

  • ]˜'˜•ƒ)

+ 𝛿•𝑊

— 𝑡O(‰

slide-76
SLIDE 76

Advantage Actor Critic (A2C)

Rethink the advantages The critic can also be used as the “baseline” when computing the advantages: 𝐵O = 𝐻 𝑇O − 𝑊

—(𝑇O)

The trajectory’s probability is improved if it is better than the trajectories previously followed. The critic is trained on how well it predicted the return.

slide-77
SLIDE 77

Another view

  • 𝐹 𝐻O = 𝑅 𝑡O, 𝑏O
  • To push up the probability of an action from a state:

– if this action was better than the expected value of what we should get from that state. 𝑅 𝑡O, 𝑏O − 𝑊(𝑡O)

  • We are happy with an action 𝑏O in a state 𝑡O if it is large
  • we are unhappy with an action if it’s small
slide-78
SLIDE 78

Another view

𝛼.𝐾(𝜄) ≈ 1 𝑂 { { { 𝛿'𝑠 𝑡O('

^ , 𝑏O(' ^ 'Š]

𝛼. log 𝜌. 𝑏O

^ |𝑡O (^) O„] ‰ ^Š)

𝑅 ™O

(^)

Reward to go

  • 𝑅

™O

(^): estimate of expected reward if we take action 𝑏O (^) in state 𝑡O ^

  • 𝑅 𝑡O, 𝑏O = ∑

𝐹š› 𝛿'𝑠 𝑡O(', 𝑏O(' |𝑡O, 𝑏O

  • ƒO

'Š]

– True expected reward to go

  • 𝑊 𝑡O = 𝐹A•~œ› A•|•• 𝑅 𝑡O, 𝑏O
  • 𝛼.𝐾(𝜄) ≈ )

‰ ∑

∑ 𝑅 𝑡O

^ , 𝑏O (^) − 𝑊 𝑡O ^

𝛼. log 𝜌. 𝑏O

^ |𝑡O (^) O„] ‰ ^Š)

slide-79
SLIDE 79

Another view

  • 𝑅œ 𝑡O, 𝑏O = ∑

𝐹š› 𝛿'𝑠 𝑡O(', 𝑏O(' |𝑡O, 𝑏O

  • ƒO

'Š]

– True expected reward to go

  • 𝑊œ 𝑡O = 𝐹A•~œ› A•|•• 𝑅 𝑡O, 𝑏O

– Total reward from 𝑡O

  • 𝐵œ 𝑡O, 𝑏O = 𝑅œ 𝑡O, 𝑏O − 𝑊œ 𝑡O

– How much better 𝑏O is

  • 𝛼.𝐾(𝜄) ≈ )

‰ ∑

∑ 𝐵œ 𝑡O

(^), 𝑏O (^) 𝛼. log 𝜌. 𝑏O ^ |𝑡O (^) O„] ‰ ^Š)

Remark: we can define by the advantage function how much an action was better than expected

slide-80
SLIDE 80

Value function fitting

𝛿 𝛿 𝑅œ 𝑡O, 𝑏O = 𝐹 𝑠O() + 𝛿 𝑊œ 𝑡O() ≈ 𝑠 𝑡O, 𝑏O + 𝛿 𝑊œ 𝑡O()

slide-81
SLIDE 81

An actor-critic algorithm

𝑧O ≈ 𝑠 𝑡O, 𝑏O + 𝛿𝑊 ™—

œ 𝑡O()

𝑀Ÿ 𝜚 = { 𝑊 ™—

œ 𝑡O − 𝑧O Y O

repeat

𝛿O•ƒO

𝛿

slide-82
SLIDE 82

A2C

  • Initialize 𝜄A, 𝜄Ÿ
  • For each episode 𝑓

– Initialize 𝑡) – For 𝑢 = 1 … 𝑈𝑓𝑠𝑛𝑗𝑜𝑏𝑢𝑗𝑝𝑜

  • Choose action 𝑏O using policy obtained from 𝜄
  • Observe 𝑠

O, 𝑡O()

– Compute the returns 𝐻 𝑡O = ∑ 𝛿'𝑠

O('

  • ]˜'˜•ƒ)

+ 𝛿•𝑊

. 𝑡O(• if t + 𝑁 < 𝑈, else ∑

𝛿'𝑠

O('

  • ]˜'˜•ƒOƒ)

– Compute the advantages 𝐵O = 𝐻 𝑇O − 𝑊

. 𝑇O

– Compute 𝑀A 𝜄 =

)

  • ∑ log 𝜌. 𝑏O 𝑡O

𝐵O

  • O

, 𝑀Ÿ 𝜚 =

)

𝐻 𝑡O − 𝑊

— 𝑡O 𝟑

  • O

– 𝜄 ß 𝜄 + 𝜃A𝛼.𝑀A (𝜄), 𝜚ßϕ −𝜃Ÿ𝛼—𝑀Ÿ (𝜚), 𝑁-step look ahead

slide-83
SLIDE 83

Extensions

  • A2C can be applied in a multi-thread environment on several episodes

simultaneously, with a final mini-batch update

  • Asynchronous Advantage Actor-Critic (A3C) (Deepmind, 2016): Each

thread performs its updates without waiting for the others to end à each thread keeps its own version of the parameters. They upload their gradients asynchronously to a master server that performs batch updates

  • Experience Replay can be adapted to A2C à ACER algorithm (Deepmind

2017)

slide-84
SLIDE 84

Policy gradient in practice

  • Remember that the gradient has high variance

– This isn’t the same as supervised learning! – Gradients will be really noisy!

  • Consider using much larger batches
  • Tweaking learning rates is very hard

– Adaptive step size rules like ADAM can be OK-ish – policy gradient-specific learning rate adjustment methods

slide-85
SLIDE 85

Continuous action space

  • Action probabilities 𝜌. 𝑏O 𝑡O : We have seen the discrete action space

case (n labels + softmax) à Very large or continuous space ?

  • You can use a network that predict the parameters of a distribution and

sample an action from it. Ex : 𝑏O~N(𝜈, 𝜏) with 𝜈, 𝜏 = 𝜌(𝑡O|𝜄) (similar to the encoder of a VAE) à Reinforce/A2C can be used (with the reparametrization trick).

  • Most general case : 𝑔 𝑡O 𝜄 = 𝑏O. What algorithm can I use ?
slide-86
SLIDE 86

Actor-critic methods

value

  • ptimization

policy

  • ptimization

Actor-critic

slide-87
SLIDE 87

Advantages of Policy-based RL

  • Advantages

– Better convergence properties – Effective in high dimensional or continuous action spaces – Can learn stochastic policies

  • Disadvantages

– Typically converges to a local rather than global optimum – Evaluating a policy is typically inefficient and high variance

slide-88
SLIDE 88

Example: RL in Other ML Problems

  • Hard Attention

– Observation: current image window – Action: where to look – Reward: classification

  • V. Mnih et al., “Recurrent models of visual attention”, NIPS 2014.
slide-89
SLIDE 89

REINFORCE in action: Recurrent Attention Model (RAM)

  • Objective: Image Classification
  • Take a sequence of “glimpses” selectively focusing on regions of the

image, to predict class

– Inspiration from human perception and eye movements – Saves computational resources => scalability – Able to ignore clutter / irrelevant parts of image

  • V. Mnih et al., “Recurrent models of visual attention”, NIPS 2014.
slide-90
SLIDE 90

REINFORCE in action: Recurrent Attention Model (RAM)

  • Objective: Image Classification
  • State: Glimpses seen so far
  • Action: (x,y) coordinates (center of glimpse) of

where to look next in image

  • Reward: 1 at the final timestep if image correctly

classified, 0 otherwise

  • Glimpsing is a non-differentiable operation

=> learn policy for how to take glimpse actions using REINFORCE

  • V. Mnih et al., “Recurrent models of visual attention”, NIPS 2014.
slide-91
SLIDE 91

REINFORCE in action: Recurrent Attention Model (RAM)

  • Given state of glimpses seen so far, use RNN to model the state and
  • utput next action
  • V. Mnih et al., “Recurrent models of visual attention”, NIPS 2014.
slide-92
SLIDE 92

REINFORCE in action: Recurrent Attention Model (RAM)

  • Given state of glimpses seen so far, use RNN to model the state and
  • utput next action
  • V. Mnih et al., “Recurrent models of visual attention”, NIPS 2014.
slide-93
SLIDE 93

REINFORCE in action: Recurrent Attention Model (RAM)

  • Given state of glimpses seen so far, use RNN to model the state and
  • utput next action
slide-94
SLIDE 94

REINFORCE in action: Recurrent Attention Model (RAM)

slide-95
SLIDE 95

More policy gradients: AlphaGo Zero

https://deepmind.com/blog/article/alphago-zero-starting-scratch

Silver et al., Mastering the game of Go without human knowledge, Nature, 2017.

slide-96
SLIDE 96

Summary

  • Policy gradients: very general but suffer from high variance so

requires a lot of samples.

– Challenge: sample-efficiency

  • Q-learning: does not always work but when it works, usually more

sample-efficient.

– Challenge: exploration

  • Guarantees:

– Policy Gradients: Converges to a local minima of 𝐾(𝜄), often good enough! – Q-learning: Zero guarantees since you are approximating Bellman equation with a complicated function approximator

slide-97
SLIDE 97

Summary

  • Parameterized Functions
  • Deep Q Networks (DQNs)

– Experience-replay – Target functions

  • Policy gradients

– Reinforce – Actor-Critic