Bonus Lecture: Introduction to Reinforcement Learning Garima - - PowerPoint PPT Presentation

bonus lecture introduction to reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Bonus Lecture: Introduction to Reinforcement Learning Garima - - PowerPoint PPT Presentation

Bonus Lecture: Introduction to Reinforcement Learning Garima Lalwani, Karan Ganju and Unnat Jain Credits: These slides and images are borrowed from slides by David Silver and Peter Abbeel 4 Model-free Control 5 Summary Outline 1 RL Problem


slide-1
SLIDE 1

Bonus Lecture: Introduction to Reinforcement Learning

Garima Lalwani, Karan Ganju and Unnat Jain

Credits: These slides and images are borrowed from slides by David Silver and Peter Abbeel

slide-2
SLIDE 2

Outline

1 RL Problem Formulation 2 Model-based Prediction and Control 3 Model-free Prediction 4 Model-free Control

5 Summary

slide-3
SLIDE 3

Part 1: RL Problem Formulation

slide-4
SLIDE 4

Characteristics of Reinforcement Learning

What makes reinforcement learning different from other machine learning paradigms? There is no supervisor, only a reward signal Feedback is delayed, not instantaneous Time really matters (correlated, non i.i.d data) Agent’s actions affect the subsequent data it receives

slide-5
SLIDE 5

Agent and Environment

Observed state reward action At Rt St

Agent Environment

slide-6
SLIDE 6

Rewards

A reward Rt is a scalar feedback signal Indicates how well agent is doing at step t The agent’s job is to maximise cumulative reward

slide-7
SLIDE 7

Rod Balancing Demo

https://www.youtube.com/watch?v=Lt-KLtkDlh8

Learn to swing up and balance a real pole based on raw visual input data, ICNIP 2012

slide-8
SLIDE 8

RL based visual control

https://www.youtube.com/watch?v=CE6fBDHPbP8 End-to-end training of deep visuomotor policies, JMLR 2016

slide-9
SLIDE 9

RL based visual control

Source: https://68.media.tumblr.com/ Link: https://goo.gl/kY4RmS

slide-10
SLIDE 10

Examples of Rewards

Fly stunt manoeuvres in a helicopter

+ve reward for following desired trajectory −ve reward for crashing

Defeat the world champion at Go

+/−ve reward for winning/losing a game

Play many Atari games better than humans

+/−ve reward for increasing/decreasing score

https://deepmind.com/research/alphago/ Stanford autonomous helicopter Abbeel et. Al. https://gym.openai.com/

slide-11
SLIDE 11

Sample model of RL problem

R = +10 R = +1 R = -1 R = 0 R = 0 R = -2 R = -2 R = -2 0.2 0.4 0.4 R = -1 R = 0

Home Murphy's Project Complete Group Disc. Arun's OH Submit project Submit project Pubbing Pubbing Leave Study Study Take Arun's Quiz

slide-12
SLIDE 12

States

R = +10 R = +1 R = -1 R = 0 R = 0 R = -2 R = -2 R = -2 0.2 0.4 0.4 R = -1 R = 0

Home Murphy's Project Complete Group Disc. Arun's OH Submit project Submit project Pubbing Pubbing Leave Study Study Take Arun's Quiz

slide-13
SLIDE 13

Actions

R = +10 R = +1 R = -1 R = 0 R = 0 R = -2 R = -2 R = -2 0.2 0.4 0.4 R = -1 R = 0

Home Murphy's Project Complete Group Disc. Arun's OH Submit project Submit project Pubbing Pubbing Leave Study Study Take Arun's Quiz

slide-14
SLIDE 14

Rewards

R = +10 R = +10 R = +1 R = +1 R = -1 R = -1 R = 0 R = 0 R = -2 R = -2 R = -2 R = -2 0.2 0.4 0.4 R = -1 R = -1 R = 0 R = 0

Home Murphy's Project Complete Group Disc. Arun's OH Submit project Submit project Pubbing Pubbing Leave Study Study Take Arun's Quiz

slide-15
SLIDE 15

Transition probabilities

R = +10 R = +1 R = -1 R = 0 R = -2 R = -2

0.2 0.2 0.4 0.4 0.4 0.4

R = -1 R = 0

Home Murphy's Project Complete Group Disc. Arun's OH Submit project Submit project Pubbing Pubbing Leave Study Study Take Arun's Quiz

slide-16
SLIDE 16

Markov Decision Process

A Markov decision process (MDP) is an environment in which all states are Markov Markov. MDP A Markov D ecision Process has the following S, A, P, R, γ Pa

ss′ = P [St+1 = s

S is a finite set of states A is a finite set of actions P is a state transition

probability matrix, | St = s, At = a]

a

R is a reward function, Rs = E [Rt+1 | St = s, At = a] P [St+1 | St , At = a ] = P [St+1 | S1, ..., S t , At = a ]

slide-17
SLIDE 17

Major Components of an RL Agent

An RL agent may include one or more of these components:

Policy: agent’s behaviour function Model: agent’s representation of the environment Value function: how good is each state and/or action

slide-18
SLIDE 18

Policy

A policy is the agent’s behaviour It is a map from state to action, e.g. Deterministic policy: π(s) = 1 for At= a Stochastic policy: π(a|s) = P[At = a|St = s]

slide-19
SLIDE 19

Actions

R = +10 R = +1 R = -1 R = 0 R = 0 R = -2 R = -2 R = -2 0.2 0.4 0.4 R = -1 R = 0

Home Murphy's Project Complete Group Disc. Arun's OH Submit project Submit project Pubbing Pubbing Leave Study Study Take Arun's Quiz

slide-20
SLIDE 20

Model

A model predicts what the environment will do next P : Transition probabilities R : Expected rewards Pa

ss′ = P[St+1 = s′ | St = s, At = a]

Ra

s = E [Rt+1 | St = s, At = a]

slide-21
SLIDE 21

Beyond Rewards

R = +10 R = +1 R = -1 R = 0 R = 0 R = -2 R = -2 R = -2 0.2 0.4 0.4 R = -1 R = 0

Home Murphy's Project Complete Group Disc. Arun's OH Submit project

Submit project

Pubbing Pubbing Leave Study

Study

Take Arun's Quiz

slide-22
SLIDE 22

Value function - Concept of Return

Return Gt The return Gt is cumulative discounted discounted reward from time-step t. Gt = Rt+1 + γRt+2 + ... =

  • k=0

γkRt+k+1 The discount γ ∈ [0, 1] is the present value of future rewards This values immediate reward above delayed reward. Avoids infinite returns in cyclic Markov processes

slide-23
SLIDE 23

Value Function

State Value Function vπ(s) vπ(s) of an MDP is the expected return starting from state s, and then following policy π vπ(s) = Eπ [Gt | St = s] Action Value Function qπ(s,a) qπ(s, a) is the expected return starting from state s, taking action a, and then following policy π qπ(s, a) = Eπ [Gt | St = s, At = a]

slide-24
SLIDE 24

Subproblems in RL

Model based Model free Prediction: evaluate the future Given a policy Control: optimise the future

Find the best policy

slide-25
SLIDE 25

Part 2: Model-based Prediction and Control

slide-26
SLIDE 26

Connecting v(s) and q(s,a): Bellman equations

vπ(s) 7! s qπ(s, a) 7! a vπ(s) =

  • a∈A

π(a|s)qπ(s, a)

q in terms of v :

7! vπ(s0) s0 qπ(s, a) 7! s, a r qπ(s, a) = Rs

a + γ

  • s′∈S

Pa

ss′ vπ(s′)

v in terms of q :

Pa

ss′ vπ(s′)

π(a|s)qπ(s, a)

π(a1|s)qπ(s, a) π(an|s)qπ(s, a)

slide-27
SLIDE 27

Connecting v(s) and q(s,a): Bellman equations (2)

qπ(s, a) 7! s, a qπ(s0, a0) 7! a0 r s0

qπ(s, a) = Ra

s + γ

  • s′∈S

Pa

ss′

  • a′∈A

π(a′|s′)qπ(s′, a′)

7! vπ(s0) s0 vπ(s) 7! s r a

vπ(s) =

  • a∈A
  • π(a|s)

Rs

a + γ

  • s′∈S

P

a ss′ vπ(s′)

  • q in terms of other q :

v in terms of other v :

slide-28
SLIDE 28

Example: vπ(s)

R = +10 R = +1 R = -1 R = 0 R = -2 R = -2 0.2 0.4 0.4 R = -1 R = 0

Submit project Submit project Pubbing Pubbing Leave Study Study Take Arun's Quiz

  • 1.3
  • 1.3

7.4 7.4

  • 2.3
  • 2.3

Group Disc.

slide-29
SLIDE 29

Example: vπ(s)

R = +10 R = +1 R = -1 R = 0 R = -2 R = -2 0.2 0.4 0.4 R = -1 R = 0

Submit project Submit project Pubbing Pubbing Leave Study Study Take Arun's Quiz

  • 1.3
  • 1.3

7.4 7.4

  • 2.3
  • 2.3

vπ(s) for π(a|s)=0.5, γ =1 vπ(GD) = 0.5*(0+0) + 0.5*(-2 + 7.4) vπ(GD) = 0.5* (R+vπ (Submitted) ) + 0.5*(R+vπ(Arun's OH))

Group Disc.

slide-30
SLIDE 30

Example: vπ(s)

R = +10 R = +1 R = -1 R = 0 R = -2 R = -2 0.2 0.4 0.4 R = -1 R = 0

Submit project Submit project Pubbing Pubbing Leave Study Study Take Arun's Quiz

  • 1.3
  • 1.3

2.7 2.7 7.4 7.4

  • 2.3
  • 2.3

vπ(s) for π(a|s)=0.5, γ =1 vπ(GD) = 0.5*(0+0) + 0.5*(-2 + 7.4) vπ(GD) = 0.5* (R+vπ (Submitted) ) + 0.5*(R+vπ(Arun's OH))

slide-31
SLIDE 31

Example: qπ(s,a)

R = +10 R = +1 R = -1 R = 0 R = -2 R = -2 0.2 0.4 0.4 R = -1 R = 0 q = - 3.3 q = -1.3 q = -3.3 q = 0.7 q = 5.4 q = 0 q = 10 q = 3.78

qπ(s,a) for π(a|s)=0.5, γ =1

slide-32
SLIDE 32

Example: qπ(s,a)

R = +10 R = +1 R = -1 R = 0 R = -2 R = -2 0.2 0.4 0.4 R = -1 R = 0 q = 0.7 q = 5.4 q = 0 q = 10 q = 3.78

qπ(s,a) for π(a|s)=0.5, γ =1

q = - 3.3 q = -1.3 q = -3.3

slide-33
SLIDE 33

Example: Policy improvement

0.2 0.4 0.4 R = +10 R = +1 R = -1 R = 0 R = -2 R = -2 R = -1 R = 0 q = 0.7 q = 5.4 q = 0 q = 10 q = 3.78 q = - 3.3 q = -1.3 q = -3.3

slide-34
SLIDE 34

Example: Policy improvement - Greedy

0.2 0.4 0.4 R = +10 R = +1 R = -1 R = 0 R = -2 R = -2 R = -1 R = 0 q = 0.7 q = 5.4 q = 0 q = 10 q = 3.78

πnew(a|s) = 1 if a = argmax

a∈A

qold(s, a)

  • therwise

q = - 3.3 q = -1.3 q = -3.3

slide-35
SLIDE 35

Policy Iteration

Policy evaluation Estimate vπ Iterative policy evaluation Policy improvement Generate π′ ≥ π Greedy policy improvement

slide-36
SLIDE 36

Iterative Policy Evaluation in Small Gridworld

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

  • 1.0 -1.0 -1.0
  • 1.0 -1.0 -1.0 -1.0
  • 1.0 -1.0 -1.0 -1.0
  • 1.0 -1.0 -1.0
  • 1.7 -2.0 -2.0
  • 1.7 -2.0 -2.0 -2.0
  • 2.0 -2.0 -2.0 -1.7
  • 2.0 -2.0 -1.7
  • 2.4 -2.9 -3.0
  • 2.4 -2.9 -3.0 -2.9
  • 2.9 -3.0 -2.9 -2.4
  • 3.0 -2.9 -2.4
  • 6.1 -8.4 -9.0
  • 6.1 -7.7 -8.4 -8.4
  • 8.4 -8.4 -7.7 -6.1
  • 9.0 -8.4 -6.1
  • 14. -20. -22.
  • 14. -18. -20. -20.
  • 20. -20. -18. -14.
  • 22. -20. -14.

Vk for the

Random Policy Greedy Policy update w.r.t. Vk

k = 0 k = 1 k = 2 = 10 = = 3

  • ptimal

policy random policy

0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

vk vk

States: 14 cells + 2 terminal cells Actions: 4 directions Rewards: -1 for time step

slide-37
SLIDE 37

Iterative Policy Evaluation in Small Gridworld (2)

  • 2.4 -2.9 -3.0
  • 2.4 -2.9 -3.0 -2.9
  • 2.9 -3.0 -2.9 -2.4
  • 3.0 -2.9 -2.4
  • 6.1 -8.4 -9.0
  • 6.1 -7.7 -8.4 -8.4
  • 8.4 -8.4 -7.7 -6.1
  • 9.0 -8.4 -6.1
  • 14. -20. -22.
  • 14. -18. -20. -20.
  • 20. -20. -18. -14.
  • 22. -20. -14.

k = 10 k = ° k = 3

Saturated policy

0.0 0.0 0.0 0.0 0.0 0.0

slide-38
SLIDE 38

Policy Iteration

Policy evaluation Estimate vπ Iterative policy evaluation Policy improvement Generate π′ ≥ π Greedy policy improvement

slide-39
SLIDE 39

Modified Policy Iteration - Value Iteration

Policy converges faster than value function In the small gridworld k = 3 was sufficient to achieve optimal policy Why not update policy every iteration? i.e. stop after k = 1

This is value iteration

  • 2.4 -2.9 -3.0
  • 2.4 -2.9 -3.0 -2.9
  • 2.9 -3.0 -2.9 -2.4
  • 3.0 -2.9 -2.4
  • 6.1 -8.4 -9.0
  • 6.1 -7.7 -8.4 -8.4
  • 8.4 -8.4 -7.7 -6.1
  • 9.0 -8.4 -6.1
  • 14. -20. -22.
  • 20.
  • 14.
  • 14. -18. -20.
  • 20. -20. -18.
  • 22. -20. -14.

k = 10 k = ° k = 3

Saturated policy

0.0 0.0 0.0 0.0 0.0 0.0

slide-40
SLIDE 40

Modified Policy Iteration - Value Iteration

Policy converges faster than value function In the small gridworld k = 3 was sufficient t o achieve optimal policy Why not update policy every iteration? i.e. stop after k = 1

This is value iteration

Starting V π = greedy(V ) V = Vπ V*, π*

slide-41
SLIDE 41

Part 3: Model-Free Prediction

slide-42
SLIDE 42

Bellman Equation Estimate

T! T! T! T!

st

r

t+1

st+1

T! T! T! T! T! T! T! T! T!

vπ(s) =

  • a∈A
  • π(a|s) Rs

a +

γ

  • s′∈S P

a ss′ vπ(s ′)

  • v in terms of other v :
slide-43
SLIDE 43

Monte-Carlo Sampling

T! T! T! T! T! T! T! T! T! T!

st

T! T! T! T! T! T! T! T! T! T!

slide-44
SLIDE 44

Monte-Carlo Estimate

T! T! T! T! T! T! T! T! T! T!

st

T! T! T! T! T! T! T! T! T! T!

   ,s’

Almost! But we can’t

vπ(s) = E [Rt+1 + γRt+2 + . ..|St = s] [actual]

V (St ) [estimate]

slide-45
SLIDE 45

Monte-Carlo Estimate

T! T! T! T! T! T! T! T! T! T!

st

T! T! T! T! T! T! T! T! T! T!

   ,s’

Almost! But we can’t

vπ(s) = E [Rt+1 + γRt+2 + . ..|St = s]

V ( St ) := V ( St ) + α ( Rt+1 + γRt+2 + γ2Rt+3 ... − V ( St ))

slide-46
SLIDE 46

Monte-Carlo Estimate

T! T! T! T! T! T! T! T! T! T!

st

T! T! T! T! T! T! T! T! T! T!

   ,s’

Almost! But we can’t

V ( St ) := V ( St ) + α ( Rt+1 + γRt+2 + γ2Rt+3 ... − V ( St ))

slide-47
SLIDE 47

Temporal-Difference Estimate

V ( St ) := V (St ) + α (Rt+1 + γV (St+1

+1) − V (St ))

T! T! T! T! T! T! T! T! T! T!

st+1 r

t+1

st

T! T! T! T! T! T! T! T! T! T!

V ( St ) := V ( St ) + α ( Rt+1 + γRt+2 + γ2Rt+3 ... − V ( St ))

slide-48
SLIDE 48

Temporal-Difference Estimate

V ( St ) := V (St ) + α (Rt+1 + γV (St+1

+1) − V (St ))

T! T! T! T! T! T! T! T! T! T!

st+1 r

t+1

st

T! T! T! T! T! T! T! T! T! T!

Guess towards a guess

slide-49
SLIDE 49

MC vs. TD

MC: TD: V ( St ) := V ( St ) + α ( Rt+1 +1 + γV (St+1 +1) − V ( St )) TD can learn before knowing the final outcome TD target Rt+1 + γV ( St+1) is biased estimate of Rt+1 + γvπ(St+1 ) TD target is much lower variance than MC target V ( St ) := V ( St ) + α ( Rt+1

+1 + γ Rt+2 + γ2Rt+3 ... − V ( St ))

slide-50
SLIDE 50

Part 4: Model-Free Control

slide-51
SLIDE 51

Today's takeaways

  • MDP: States, actions
  • Environment: Transitions and

rewards

  • Agent: Policy over actions
  • Policy iteration
  • Policy evaluation
  • Policy improvement
  • Value Iteration
  • Model free policy evaluation
  • Model free policy control
slide-52
SLIDE 52

Today's takeaways

  • MDP: States, actions
  • Environment: Transitions and

rewards

  • Agent: Policy over actions
  • Policy iteration
  • Policy evaluation
  • Policy improvement
  • Value Iteration
  • Model free policy evaluation
  • Model free policy control

R = +10 R = +1 R = -1 R = 0 R = 0 R = -2 R = -2 R = -2 0.2 0.4 0.4 R = -1 R = 0 Home

Murphy's Project Complete Group Disc. Arun's OH Submit project Submit project Pubbing Pubbing Leave Study Study Take Arun's Quiz

slide-53
SLIDE 53

Today's takeaways

  • MDP: States, actions
  • Environment: Transitions and

rewards

  • Agent: Policy over actions
  • Policy iteration
  • Policy evaluation
  • Policy improvement
  • Value Iteration
  • Model free policy evaluation
  • Model free policy control

R = +10 R = +1 R = -1 R = 0 R = 0 R = -2 R = -2 R = -2 0.2 0.4 0.4 R = -1 R = 0 Home

Murphy's Project Complete Group Disc. Arun's OH Submit project Submit project Pubbing Pubbing Leave Study Study Take Arun's Quiz

slide-54
SLIDE 54

Today's takeaways

  • MDP: States, actions
  • Environment: Transitions and

rewards

  • Agent: Policy over actions
  • Policy iteration
  • Policy evaluation
  • Policy improvement
  • Value Iteration
  • Model free policy evaluation
  • Model free policy control

π(a|s) = P[At = a|St = s]

slide-55
SLIDE 55

Today's takeaways

  • MDP: States, actions
  • Environment: Transitions and

rewards

  • Agent: Policy over actions
  • Policy iteration
  • Policy evaluation
  • Policy improvement
  • Value Iteration
  • Model free policy evaluation
  • Model free policy control

q in terms of other q v in terms of other v

slide-56
SLIDE 56

Today's takeaways

  • MDP: States, actions
  • Environment: Transitions and

rewards

  • Agent: Policy over actions
  • Policy iteration
  • Policy evaluation
  • Policy improvement
  • Value Iteration
  • Model free policy evaluation
  • Model free policy control

πnew(a|s) = 1 if a = argmax

a∈A

qold(s, a)

  • therwise
slide-57
SLIDE 57

Today's takeaways

  • MDP: States, actions
  • Environment: Transitions and

rewards

  • Agent: Policy over actions
  • Policy iteration
  • Policy evaluation
  • Policy improvement
  • Value Iteration
  • Model free policy evaluation
  • Model free policy control

Starting V π = greedy(V ) V = Vπ V*, π*

  • 2.4 -2.9 -3.0
  • 2.4 -2.9 -3.0 -2.9
  • 2.9 -3.0 -2.9 -2.4
  • 3.0 -2.9 -2.4
  • 6.1 -8.4 -9.0
  • 6.1 -7.7 -8.4 -8.4
  • 8.4 -8.4 -7.7 -6.1
  • 9.0 -8.4 -6.1
  • 14. -20. -22.
  • 20.
  • 14.
  • 14. -18. -20.
  • 20. -20. -18.
  • 22. -20. -14.

k = 10 k = ° k = 3 Saturated policy

0.0 0.0 0.0 0.0 0.0 0.0

slide-58
SLIDE 58

Today's takeaways

  • MDP: States, actions
  • Environment: Transitions and

rewards

  • Agent: Policy over actions
  • Policy iteration
  • Policy evaluation
  • Policy improvement
  • Value Iteration
  • Model free policy evaluation
  • Model free policy control

V ( St ) := V ( St ) + α ( Rt+1 + γV ( St+1 +1) − V ( St ))

slide-59
SLIDE 59

Today's takeaways

  • MDP: States, actions
  • Environment: Transitions and

rewards

  • Agent: Policy over actions
  • Policy iteration
  • Policy evaluation
  • Policy improvement
  • Value Iteration
  • Model free policy evaluation
  • Model free policy control
slide-60
SLIDE 60

Generalised Policy Iteration (Refresher)

Policy evaluation Estimate vπ Model-based: Iterative policy evaluation Policy improvement Generate π′ ≥ π Model-based Greedy policy improvement

slide-61
SLIDE 61

Generalised Policy Iteration

Policy evaluation Estimate vπ Model-free: TD Policy evaluation Policy improvement Generate π′ ≥ π Model-free: Greedy policy improvement

slide-62
SLIDE 62

Model-Free Policy Improvement

π′(s) = argmax

a∈A a

Greedy policy improvement from V and Q values π′(s) = argmax

a∈A

Q(s, a)

  • s

Rs +

Pa

ss ss′ V (s′) ∈S

slide-63
SLIDE 63

Model-Free Policy Improvement

π′(s) = argmax

a∈A a

Greedy policy improvement from V and Q values π′(s) = argmax

a∈A

Q(s, a)

  • s

Rs +

Pa

ss ss′ V (s′) ∈S

slide-64
SLIDE 64

Model-Free Policy Improvement

Greedy policy improvement over V (s) requires model of MDP π′(s) = argmax

a∈A a

π′(s) = argmax

a∈A

Q(s, a)

  • s

Rs +

Pa

ss ss′ V (s′) ∈S

slide-65
SLIDE 65

Generalised Policy Iteration with Q values

Starting Q, π π = greedy(Q) Q = qπ q*, π*

Policy evaluation TD policy evaluation, Q = qπ Policy improvement Greedy policy improvement?

slide-66
SLIDE 66

Thinking beyond Greedy - Exploration-Exploitation Unseen Seen

What we hoped we had: What we have:

slide-67
SLIDE 67

ǫ-Greedy Exploration

Simplest idea for ensuring continual exploration With probability 1 − ǫ choose the greedy action With probability ǫ choose an action at random

slide-68
SLIDE 68

TD Policy Iteration

Starting Q, π π = ε

  • g

r e e d y ( Q ) Q = qπ q*, π*

Policy evaluation TD policy evaluation, Q = qπ Policy improvement ǫ-greedy policy improvement

slide-69
SLIDE 69

SARSA: TD Value Iteration

Starting Q π = ε-greedy(Q) Q = qπ q*, π*

One step of evaluation: Policy evaluation TD policy evaluation, Q ≈ qπ Policy improvement ǫ-greedy policy improvement

  • 2.4 -2.9 -3.0
  • 2.4 -2.9 -3.0 -2.9
  • 2.9 -3.0 -2.9 -2.4
  • 3.0 -2.9 -2.4
  • 6.1 -8.4 -9.0
  • 6.1 -7.7 -8.4 -8.4
  • 8.4 -8.4 -7.7 -6.1
  • 9.0 -8.4 -6.1
  • 14. -20. -22.
  • 20.
  • 14.
  • 14. -18. -20.
  • 20. -20. -18.
  • 22. -20. -14.

k = 10 k = ° k = 3

Saturated policy

0.0 0.0 0.0 0.0 0.0 0.0

slide-70
SLIDE 70

SARSA: Step by Step

S1,A1 R2 S2 Q(S1 , A 1) := Q(S1, A

1) +

α

  • R2 + γQ(S2, A2) − Q(S1, A1)
  • A2 ~ epsilon greedy (·|S2 )

S1,A1 S2 S2,A2

Agent Environment

slide-71
SLIDE 71

SARSA: Step by Step

S2,A2 R3 S3 Q(S2 ,A 2) := Q(S2 , A 2) + α

  • R3 +

γQ(S3, A 3) − Q (S2, A 2)

  • A3 ~ epsilon greedy2 (·|S3 )

S1,A1 S2 S2,A2 S3,A3 S3

Agent Environment

slide-72
SLIDE 72

Q Learning

Learn about optimal policy while following exploratory policy Target policy: Greedy [Optimal] Behaviour policy: Epsilon-greedy [Exploratory]

slide-73
SLIDE 73

Q Learning

Learn about optimal policy while following exploratory policy Target policy: Greedy [Optimal] Behaviour policy: Epsilon-greedy [Exploratory] Optimal Exploratory

slide-74
SLIDE 74

Q-Learning Control Algorithm

S1,A1 R2 a

2’~ g r e e d y ( · | S 2 )

S2

Q(S1, A 1) := Q(S1 , A 1) + α

  • a2

R2 + γ max max

Q(S2, a2′ ) − Q (S1, A1)

  • A2 ~ epsilon greedy (·|S2 )

Q(S1 , A 1) := Q(S1 ,A

1) +

α R2 + γQ(S2, A

2 ) − Q(S1 , A

1 )

Sarsa :

S1, A1 S2, A2 S2, a2'

slide-75
SLIDE 75

Q-Learning Control Algorithm

S2,A2 R3 a

3’~ g r e e d y ( · | S

3 )

S3

A3 ~ epsilon greedy (·|S3 )

S1, A1 S2, A2 S3, A3 S2, a2' S3 , a3'

Q(S2, A 2) :=Q(S2, A 2) + α

  • R3 +

γ m ax

a3′ Q(S3, a3′ ) − Q(S2, A

2)

slide-76
SLIDE 76

Q-Learning Control Algorithm

S3,A3 R4 a

4 ' ~ g r e e d y ( · | S 3 )

S4

A4 ~ epsilon greedy (·|S3 )

S1, A1 S2, A2 S3, A3 S2, a2' S3 , a3'

Q(S3, A 3) := Q(S3, A 3) + α

  • R4 +

γ m ax

a4′ Q(S4, a4′ ) − Q(S3, A

3)

slide-77
SLIDE 77

SARSA and Q-Learning example

https://studywolf.wordpress.com/2013/07/01/reinforcement-learning-sarsa-vs-q-learning/

slide-78
SLIDE 78

What's in store for Lec 13?

https://memegenerator.net/instance/73854727

slide-79
SLIDE 79

What's in store for Lec 13?

https://www.youtube.com/watch?v=60pwnLB0DqY https://www.youtube.com/watch?v=V1eYniJ0Rnk

slide-80
SLIDE 80

Questions?

The only stupid question is the one you were afraid to ask but never did.

  • Rich Sutton
slide-81
SLIDE 81

References

Introduction to RL by David Silver (UCL & DeepMind) www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html [Lec 1-5] https://youtu.be/2pWv7GOvuf0 Artificial Intelligence by Peter Abbeel (UCB) https://edge.edx.org/courses/BerkeleyX/CS188x-SP15/ SP15/20021a0a32d14a31b087db8d4bb582fd/ Artificial Intelligence by Svetlana Lazebnik (UIUC) http://slazebni.cs.illinois.edu/fall16/

slide-82
SLIDE 82

Appendix

slide-83
SLIDE 83

Incremental Monte-Carlo Updates

Update V (s) incrementally after episode S1, A1, R2, ..., ST For each state St with return Gt N(St ) := N(St ) + 1 V (St ) := V (St ) + 1 N(St ) (Gt − V (St)) In non-stationary problems, it can be useful to track a running mean, i.e. forget old episodes. V (St ) := (1-α) V(St ) + α Gt := V (St ) + α ( Gt − V (St ))

Idea:

slide-84
SLIDE 84

GLIE

Definition Greedy in the Limit with Infinite Exploration (GLIE) All state-action pairs are explored infinitely many times, lim

k→∞ Nk(s, a) = ∞

The policy converges on a greedy policy, lim

k→∞ πk(a|s) = 1(a = argmax a′∈A

Qk(s, a′)) For example, ǫ-greedy is GLIE if ǫ reduces to zero at ǫk = 1

k

slide-85
SLIDE 85

Convergence of Sarsa

Theorem Sarsa converges to the optimal action-value function, Q(s, a) → q∗(s, a), under the following conditions: GLIE sequence of policies πt(a|s) Robbins-Monro sequence of step-sizes αt

  • t=1

αt = ∞

  • t=1

α2

t < ∞

slide-86
SLIDE 86

Monte-Carlo Control

Sample kth episode using π: {S1, A1, R2, ..., ST} ∼ π For each state St and action At in the episode, N(St , At ) := N(St , At ) + 1 1 N(St, At) (Gt − Q(St, At)) Improve policy based on new action-value function ǫ = 1/k π = ǫ-greedy(Q) Theorem Decaying epsilon Monte-Carlo control converges to the optimal action-value function, Q(s, a) → q∗(s, a) Q (St , At ) := Q (St , At ) +

slide-87
SLIDE 87

Sarsa Algorithm for On-Policy Control

slide-88
SLIDE 88

Q-Learning Algorithm for Off-Policy Control