Chapter 6: Temporal Difference Learning Objectives of this chapter: - - PowerPoint PPT Presentation

chapter 6 temporal difference learning
SMART_READER_LITE
LIVE PREVIEW

Chapter 6: Temporal Difference Learning Objectives of this chapter: - - PowerPoint PPT Presentation

Chapter 6: Temporal Difference Learning Objectives of this chapter: Introduce Temporal Difference (TD) learning Focus first on policy evaluation, or prediction, methods Compare efficiency of TD learning with MC learning Then extend to control


slide-1
SLIDE 1
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

1

Chapter 6: Temporal Difference Learning

Introduce Temporal Difference (TD) learning Focus first on policy evaluation, or prediction, methods Compare efficiency of TD learning with MC learning Then extend to control methods Objectives of this chapter:

slide-2
SLIDE 2
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

2

  • cf. Dynamic Programming

T T T T T T T T T T T T T

V(St ) ← Eπ Rt+1 + γV(St+1)

[ ]

St

= X

a

π(a|St) X

s0,r

p(s0, r|St, a)[r + γV (s0)]

r

a s0

slide-3
SLIDE 3
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

3

Simple Monte Carlo

T T T T T T T T T T

T T T T T T T T T T

V(St ) ←V(St )+α Gt −V(St )

[ ]

St

slide-4
SLIDE 4
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

4

Simplest TD Method

T T T T T T T T T T

T T T T T T T T T T

V(St ) ←V(St )+α Rt+1 + γV(St+1)−V(St )

[ ]

St Rt+1 St+1

slide-5
SLIDE 5
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

5

TD methods bootstrap and sample Bootstrapping: update involves an estimate

MC does not bootstrap DP bootstraps TD bootstraps

Sampling: update does not involve an expected value

MC samples DP does not sample TD samples

slide-6
SLIDE 6
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

6

TD Prediction

Policy Evaluation (the prediction problem): for a given policy π, compute the state-value function vπ Recall: Simple every-visit Monte Carlo method: target: the actual return after time t target: an estimate of the return

V (St) ← V (St) + α h Gt − V (St) i ,

The simplest temporal-difference method TD(0):

V (St) ← V (St) + α h Rt+1 + γV (St+1) − V (St) i .

slide-7
SLIDE 7
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

7

Example: Driving Home

Elapsed Time Predicted Predicted State (minutes) Time to Go Total Time leaving office, friday at 6 30 30 reach car, raining 5 35 40 exiting highway 20 15 35 2ndary road, behind truck 30 10 40 entering home street 40 3 43 arrive home 43 43

slide-8
SLIDE 8
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

8

Driving Home

Changes recommended by Monte Carlo methods (α=1) Changes recommended by TD methods (α=1)

slide-9
SLIDE 9
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

9

Advantages of TD Learning

TD methods do not require a model of the environment,

  • nly experience

TD, but not MC, methods can be fully incremental You can learn before knowing the final outcome Less memory Less peak computation You can learn without the final outcome From incomplete sequences Both MC and TD converge (under certain assumptions to be detailed later), but which is faster?

slide-10
SLIDE 10
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

10

Random Walk Example

Values learned by TD after various numbers of episodes

V (St) ← V (St) + α h Rt+1 + γV (St+1) − V (St) i .

slide-11
SLIDE 11
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

11

TD and MC on the Random Walk

Data averaged over 100 sequences of episodes

slide-12
SLIDE 12
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

12

Batch Updating in TD and MC methods

Batch Updating: train completely on a finite amount of data,

e.g., train repeatedly on 10 episodes until convergence. Compute updates according to TD or MC, but only update estimates after each complete pass through the data. For any finite Markov prediction task, under batch updating, TD converges for sufficiently small α. Constant-α MC also converges under these conditions, but to a difference answer!

slide-13
SLIDE 13
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

13

Random Walk under Batch Updating

After each new episode, all previous episodes were treated as a batch, and algorithm was trained until convergence. All repeated 100 times.

slide-14
SLIDE 14
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

14

You are the Predictor

Suppose you observe the following 8 episodes:

A, 0, B, 0 B, 1 B, 1 B, 1 B, 1 B, 1 B, 1 B, 0

0.75

V(B)? V(A)? 0?

Assume Markov states, no discounting (𝜹 = 1)

slide-15
SLIDE 15
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

15

You are the Predictor

V(A)? 0.75

slide-16
SLIDE 16
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

16

You are the Predictor

The prediction that best matches the training data is V(A)=0 This minimizes the mean-square-error on the training set This is what a batch Monte Carlo method gets If we consider the sequentiality of the problem, then we would set V(A)=.75 This is correct for the maximum likelihood estimate of a Markov model generating the data i.e, if we do a best fit Markov model, and assume it is exactly correct, and then compute what it predicts (how?) This is called the certainty-equivalence estimate This is what TD gets

slide-17
SLIDE 17
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

17

Summary so far

Introduced one-step tabular model-free TD methods These methods bootstrap and sample, combining aspects of DP and MC methods TD methods are computationally congenial If the world is truly Markov, then TD methods will learn faster than MC methods MC methods have lower error on past data, but higher error

  • n future data
slide-18
SLIDE 18
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

18

Learning An Action-Value Function

Estimate qπ for the current policy π

St,At Rt+1 St St+1, At+1 Rt+2 St+1 Rt+3 St+2 St+3

. . . . . .

St+2, At+2 St+3, At+3

After every transition from a nonterminal state, St, do this: Q(St,At ) ← Q(St,At )+α Rt+1 + γ Q(St+1,At+1)− Q(St,At )

[ ]

If St+1 is terminal, then define Q(St+1,At+1) = 0

slide-19
SLIDE 19
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

19

Sarsa: On-Policy TD Control

Turn this into a control method by always updating the policy to be greedy with respect to the current estimate:

Initialize Q(s, a), ∀s ∈ S, a ∈ A(s), arbitrarily, and Q(terminal-state, ·) = 0 Repeat (for each episode): Initialize S Choose A from S using policy derived from Q (e.g., ε-greedy) Repeat (for each step of episode): Take action A, observe R, S0 Choose A0 from S0 using policy derived from Q (e.g., ε-greedy) Q(S, A) ← Q(S, A) + α[R + γQ(S0, A0) − Q(S, A)] S ← S0; A ← A0; until S is terminal

slide-20
SLIDE 20
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

20

Windy Gridworld

undiscounted, episodic, reward = –1 until goal

Wind:

slide-21
SLIDE 21
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

21

Results of Sarsa on the Windy Gridworld

slide-22
SLIDE 22
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

22

Q-Learning: Off-Policy TD Control

Initialize Q(s, a), ∀s ∈ S, a ∈ A(s), arbitrarily, and Q(terminal-state, ·) = 0 Repeat (for each episode): Initialize S Repeat (for each step of episode): Choose A from S using policy derived from Q (e.g., ε-greedy) Take action A, observe R, S0 Q(S, A) ← Q(S, A) + α[R + γ maxa Q(S0, a) − Q(S, A)] S ← S0; until S is terminal

Q(St, At) ← Q(St, At) + ↵ h Rt+1 + max

a

Q(St+1, a) − Q(St, At) i

One-step Q-learning:

slide-23
SLIDE 23
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

23

Cliffwalking

ε−greedy, ε = 0.1

R R

slide-24
SLIDE 24
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Expected Sarsa

24

Instead of the sample value-of-next-state, use the expectation! Expected Sarsa’s performs better than Sarsa (but costs more)

Q(St, At) ← Q(St, At) + α h Rt+1 + γE[Q(St+1, At+1) | St+1] − Q(St, At) i ← Q(St, At) + α h Rt+1 + γ X

a

π(a|St+1)Q(St+1, a) − Q(St, At) i ,

Q-learning Expected Sarsa

a

slide-25
SLIDE 25
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Performance on the Cliff-walking Task

25

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 −160 −140 −120 −100 −80 −60 −40 −20

alpha

n = 100, Sarsa n = 100, Q−learning n = 100, Expected Sarsa n = 1E5, Sarsa n = 1E5, Q−learning n = 1E5, Expected Sarsa

Expected Sarsa Sarsa Q-learning

Asymptotic Performance Interim Performance (after 100 episodes)

Q-learning

Reward per episode

α

1 0.1 0.2 0.4 0.6 0.8 0.3 0.5 0.7 0.9

  • 40
  • 80
  • 120

van Seijen, van Hasselt, Whiteson, & Wiering 2009

slide-26
SLIDE 26
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Off-policy Expected Sarsa

26

Expected Sarsa generalizes to arbitrary behavior policies 𝜈

in which case it includes Q-learning as the special case in which π is the greedy policy This idea seems to be new

Q-learning Expected Sarsa

Q(St, At) ← Q(St, At) + α h Rt+1 + γE[Q(St+1, At+1) | St+1] − Q(St, At) i ← Q(St, At) + α h Rt+1 + γ X

a

π(a|St+1)Q(St+1, a) − Q(St, At) i ,

a

Nothing changes here

slide-27
SLIDE 27

Maximization Bias Example

B A

right wrong

. . .

N(−0.1, 1)

Q-learning Double Q-learning

Episodes

100 1 200 300

% Wrong actions

100% 75% 50% 25% 5%

  • ptimal

Q(St, At) ← Q(St, At) + ↵ h Rt+1 + max

a

Q(St+1, a) − Q(St, At) i

Tabular Q-learning:

START

slide-28
SLIDE 28

Double Q-Learning

  • Train 2 action-value functions, Q1 and Q2
  • Do Q-learning on both, but
  • never on the same time steps (Q1 and Q2 are indep.)
  • pick Q1 or Q2 at random to be updated on each step
  • If updating Q1, use Q2 for the value of the next state:
  • Action selections are (say) 𝜁-greedy with respect to the sum
  • f Q1 and Q2

Hado van Hasselt 2010

Q1(St, At) ← Q1(St, At)+α ⇣ Rt+1 +Q2

  • St+1, argmax

a

Q1(St+1, a)

  • −Q1(St, At)

slide-29
SLIDE 29

Double Q-Learning

Hado van Hasselt 2010

Initialize Q1(s, a) and Q2(s, a), ∀s ∈ S, a ∈ A(s), arbitrarily Initialize Q1(terminal-state, ·) = Q2(terminal-state, ·) = 0 Repeat (for each episode): Initialize S Repeat (for each step of episode): Choose A from S using policy derived from Q1 and Q2 (e.g., ε-greedy in Q1 + Q2) Take action A, observe R, S0 With 0.5 probabilility: Q1(S, A) ← Q1(S, A) + α ⇣ R + γQ2

  • S0, argmaxa Q1(S0, a)
  • − Q1(S, A)

⌘ else: Q2(S, A) ← Q2(S, A) + α ⇣ R + γQ1

  • S0, argmaxa Q2(S0, a)
  • − Q2(S, A)

⌘ S ← S0; until S is terminal

slide-30
SLIDE 30

Example of Maximization Bias

B A

right wrong

. . .

N(−0.1, 1)

Q-learning Double Q-learning

Episodes

100 1 200 300

% Wrong actions

100% 75% 50% 25% 5%

  • ptimal

Double Q-learning:

START

Q1(St, At) ← Q1(St, At)+α ⇣ Rt+1 +Q2

  • St+1, argmax

a

Q1(St+1, a)

  • −Q1(St, At)

slide-31
SLIDE 31
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

31

Afterstates

Usually, a state-value function evaluates states in which the agent can take an action. But sometimes it is useful to evaluate states after agent has acted, as in tic-tac-toe. Why is this useful? What is this in general?

slide-32
SLIDE 32
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

32

Summary

Introduced one-step tabular model-free TD methods These methods bootstrap and sample, combining aspects of DP and MC methods TD methods are computationally congenial If the world is truly Markov, then TD methods will learn faster than MC methods MC methods have lower error on past data, but higher error

  • n future data

Extend prediction to control by employing some form of GPI On-policy control: Sarsa, Expected Sarsa Off-policy control: Q-learning, Expected Sarsa Avoiding maximization bias with Double Q-learning

slide-33
SLIDE 33
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

width

  • f backup

height (depth)

  • f backup

Temporal- difference learning Dynamic programming Monte Carlo

...

Exhaustive search

33

Unified View