New Temporal-Difference Methods Based on Gradient Descent Rich - - PowerPoint PPT Presentation

new temporal difference methods based on gradient descent
SMART_READER_LITE
LIVE PREVIEW

New Temporal-Difference Methods Based on Gradient Descent Rich - - PowerPoint PPT Presentation

New Temporal-Difference Methods Based on Gradient Descent Rich Sutton Hamid Maei Doina Precup (McGill) Shalabh Bhatnagar (IIS Bangalore) Csaba Szepesvari Eric Wiewiora David Silver Outline The promise and problems of TD learning


slide-1
SLIDE 1

New Temporal-Difference Methods Based on Gradient Descent

Rich Sutton Hamid Maei Doina Precup (McGill) Shalabh Bhatnagar (IIS Bangalore) Csaba Szepesvari Eric Wiewiora David Silver

slide-2
SLIDE 2

Outline

  • The promise and problems of TD learning
  • Value-function approximation
  • Gradient-descent methods - LMS example
  • Objective functions for TD
  • GD derivation of new algorithms
  • Proofs of convergence
  • Empirical results
  • Conclusions
slide-3
SLIDE 3

What is temporal-difference learning?

  • The most important and distinctive idea in

reinforcement learning

  • A way of learning to predict, 


from changes in your predictions, 
 without waiting for the final outcome

  • A way of taking advantage of state 


in multi-step prediction problems

  • Learning a guess from a guess
slide-4
SLIDE 4

Examples of TD learning

  • pportunities
  • Learning to evaluate backgammon

positions from changes in evaluation within a game

  • Learning where your tennis opponent

will hit the ball from his approach

  • Learning what features of a market

indicate that it will have a major decline

  • Learning to recognize your friend’s face
slide-5
SLIDE 5

Function approximation

  • TD learning is sometimes done in a table-

lookup context - where every state is distinct and treated totally separately

  • But really, to be powerful, we must

generalize between states

  • The same state never occurs twice

For example, in Computer Go, we use 106 parameters to learn about 10170 positions

slide-6
SLIDE 6

Advantages of TD methods for prediction

  • 1. Data efficient. 


Learn much faster on Markov problems

  • 2. Cheap to implement.


Require less memory, peak computation;

  • 3. Able to learn from incomplete sequences.


In particular, able to learn off-policy

slide-7
SLIDE 7

Off-policy learning

  • Learning about a policy different than the
  • ne being used to generate actions
  • Most often used to learn optimal

behavior from a given data set, or from more exploratory behavior

  • Key to ambitious theories of

knowledge and perception as continual prediction about the outcomes of

  • ptions
slide-8
SLIDE 8

Outline

  • The promise and problems of TD learning
  • Value-function approximation
  • Gradient-descent methods - LMS example
  • Objective functions for TD
  • GD derivation of new algorithms
  • Proofs of convergence
  • Empirical results
  • Conclusions
slide-9
SLIDE 9

Value-function approximation from sample trajectories

5 2

  • 1

states

  • utcome

V (s) = E[outcome|s]

  • True values:
  • Estimated values:
  • Linear approximation:

Vθ(s) V (s), θ ⇥ ⇤n Vθ(s) = θφs, φs ⇥n

modifiable parameter vector feature vector for state s

slide-10
SLIDE 10

Value-function approximation from sample trajectories

V (s) = E[outcome|s]

  • True values:
  • Estimated values:
  • Linear approximation:

Vθ(s) V (s), θ ⇥ ⇤n Vθ(s) = θφs, φs ⇥n

modifiable parameter vector feature vector for state s

1 1 1

feature vector

φs

= -2 + 0 + 5 = 3

x

parameter vector

0.1

  • 2

0.5 5

  • .4

θ

state

s

slide-11
SLIDE 11

From terminal outcomes to per-step rewards

1

state trajectory

1 2 1 1

rewards target values (returns) = sum of future rewards until end

  • f episode, or until

discounting horizon

1 2 4 5 6

  • True values:

V (s) = E ∞ ⇤

t=0

γtrt | s0 = s ⇥

discount rate,

0 ≤ γ ≤ 1

slide-12
SLIDE 12

TD methods operate on individual transitions

  • T

1 2 1

  • 1

2 1 1

Training set is now a bag of transitions Select from them i.i.d. (independently, identically distributed)

ds - distribution of first state s rs - expected reward given s Pss’ - prob of next state s’ given s

Sample transition:

(s, r, s) or (φ, r, φ) P and d are linked

trajectories transitions

TD(0) algorithm:

θ ← θ + αδφ δ = r + γθ⇥φ − θ⇥φ

slide-13
SLIDE 13

Off-policy training

  • T

1 2 1

  • 1

2 1 1

ds rs Pss’

trajectories transitions

P and d are no longer linked TD(0) may diverge!

slide-14
SLIDE 14

Baird’s counter-example

Vk(s) = !(7)+2!(1) terminal state 99% 1% 100% Vk(s) = !(7)+2!(2) Vk(s) = !(7)+2!(3) Vk(s) = !(7)+2!(4) Vk(s) = !(7)+2!(5) Vk(s) = 2!(7)+!(6)

  • P and d are not linked
  • d is all states with equal probability
  • P is according to this Markov chain:

α = 0.01 γ = 0.99

θ0 = (1, 1, 1, 1, 1, 10, 1)

r = 0

slide-15
SLIDE 15

TD can diverge: Baird’s counter-example

α = 0.01 γ = 0.99

θ0 = (1, 1, 1, 1, 1, 10, 1)

5 10

1000 2000 3000 4000 5000 10 10

/ -10

Iterations (k)

5

10

10

10 10

  • Parameter

values, !k(i)

(log scale, broken at ±1)

!k(7) !k(1) – !k(5) !k(6)

deterministic updates

slide-16
SLIDE 16

TD(0) can diverge: A simple example

TD update: TD fixpoint:

θ 2θ r=1

δ = r + γθ⇥φ − θ⇥φ = 0 + 2θ − θ = θ ∆θ = αδφ = αθ θ∗ =

Diverges!

slide-17
SLIDE 17

Previous attempts to solve the off-policy problem

  • Importance sampling
  • With recognizers
  • Least-squares methods, LSTD, LSPI, iLSTD
  • Averagers
  • Residual gradient methods
slide-18
SLIDE 18

Desiderata: We want a TD algorithm that

  • Bootstraps (genuine TD)
  • Works with linear function approximation


(stable, reliably convergent)

  • Is simple, like linear TD — O(n)
  • Learns fast, like linear TD
  • Can learn off-policy (arbitrary P and d)
  • Learns from online causal trajectories 


(no repeat sampling from the same state)

slide-19
SLIDE 19

Outline

  • The promise and problems of TD learning
  • Value-function approximation
  • Gradient-descent methods - LMS example
  • Objective functions for TD
  • GD derivation of new algorithms
  • Proofs of convergence
  • Empirical results
  • Conclusions
slide-20
SLIDE 20

Gradient-descent learning methods - the recipe

  • 1. Pick an objective function , a

parameterized function to be minimized

  • 2. Use calculus to analytically compute the

gradient

  • 3. Find a “sample gradient” that you can sample
  • n every time step and whose expected value

equals the gradient

  • 4. Take small steps in proportional to the

sample gradient:

J(θ)

θJ(θ) θ θ ⇥ θ α⇤θJt(θ)

slide-21
SLIDE 21

δ = r + γθ⇥φ − θ⇥φ

Conventional TD is not the gradient of anything

∆θ = αδφ

∂2J ∂θj∂θi = ∂(δφi) ∂θj = (γφ

j − φj)φi

∂2J ∂θi∂θj = ∂(δφj) ∂θi = (γφ

i − φi)φj

∂J ∂θi = δφi

Assume there is a J such that: Then look at the second derivative:

∂2J ∂θj∂θi = ∂2J ∂θi∂θj

TD(0) algorithm:

}

Real 2nd derivatives must be symmetric

C

  • n

t r a d i c t i

  • n

!

slide-22
SLIDE 22

Outline

  • The promise and problems of TD learning
  • Value-function approximation
  • Gradient-descent methods - LMS example
  • Objective functions for TD
  • GD derivation of new algorithms
  • Proofs of convergence
  • Empirical results
  • Conclusions
slide-23
SLIDE 23

Gradient descent for TD:

What should the objective function be?

  • Close to the true values?
  • Or close to satisfying the Bellman equation?



 
 where T is the Bellman operator defined by

V = r + γPV = TV

Mean-Square Value Error

MSE(θ) =

  • s

ds (Vθ(s) V (s))2 = ⇥ Vθ V ⇥2

D

Mean-Square Bellman Error

MSBE(θ) = ⇥ Vθ TVθ ⇥2

D

True value
 function

slide-24
SLIDE 24

Value function geometry

T Vθ

Π

TVθ

ΠTVθ

Φ, D

RMSBE R M S P B E The space spanned by the feature vectors, 
 weighted by the state visitation distribution

T takes you outside 


the space

Π projects you back 


into it

D = diag(d)

Vθ = ΠTVθ

Is the TD fix-point Better objective fn?

Previous work on gradient methods for TD minimized this objective fn (Baird 1995, 1999)

Mean Square Projected Bellman Error (MSPBE)

slide-25
SLIDE 25

A-split example (Dayan 1992)

A B 1

50% 50% 100%

Clearly, the true values are

V (A) = 0.5 V (B) = 1

But if you minimize the naive

  • bjective fn,

, then you get the solution Even in the tabular case (no FA)

J(θ) = E[δ2] V (B) = 2/3 V (A) = 1/3

slide-26
SLIDE 26

Split-A example

A1 A2 B 1

100% 100% 100%

The two ‘A’ states look the same, they share a single feature and must be given the same approximate value The example appears just like the previous, and the minimum MSBE solution is

V (B) = 2/3 V (A) = 1/3

slide-27
SLIDE 27

Outline

  • The promise and problems of TD learning
  • Value-function approximation
  • Gradient-descent methods - LMS example
  • Objective functions for TD
  • GD derivation of new algorithms
  • Proofs of convergence
  • Empirical results
  • Conclusions
slide-28
SLIDE 28

Three new algorithms

  • GTD, the original gradient

TD algorithm (Sutton, Szepevari & Maei, 2008)

  • GTD-2, a second-generation GTD
  • TD-C, TD with gradient correction
  • GTD(λ), GQ(λ)
slide-29
SLIDE 29

First relate the geometry to the iid statistics

T Vθ

Π

TVθ

ΠTVθ

Φ, D

RMSBE RMSPBE

ΦT D(TVθ − Vθ) = E[δφ]

ΦT DΦ = E[φφT ]

MSPBE(θ) = ⇥ Vθ ΠTVθ ⇥2

D

= ⇥ Π(Vθ TVθ) ⇥2

D

= (Π(Vθ TVθ))⇤D(Π(Vθ TVθ)) = (Vθ TVθ)⇤Π⇤DΠ(Vθ TVθ) = (Vθ TVθ)⇤D⇤Φ(Φ⇤DΦ)1Φ⇤D(Vθ TVθ) = (Φ⇤D(TVθ Vθ))⇤(Φ⇤DΦ)1Φ⇤D(TVθ Vθ) = E[δφ]⇤ E

  • φφ⇤⇥1 E[δφ] .
slide-30
SLIDE 30

Derivation of the GTD-2 algorithm as gradient descent in the MSPBE

1 2 MSPBE(θ) = E ⇤ (φ γφ⇥)φ⌅⌅ E ⇤ φφ⌅⌅1 E[δφ] ⇥ E ⇤ (φ γφ⇥)φ⌅⌅ w,

w ⇥ E ⇤ φφ⌅⌅1 E[δφ] .

Assuming

Sampling the expectation yields the O(n) update:

θ ← θ + α(φ − γφ)(φ⇥w)

with

w ← w + β(δ − φw)φ

where

δ = r + γθ⇥φ − θ⇥φ

Gradient TD Algorithm #2

This is the main trick!

slide-31
SLIDE 31

Derivation of the original GTD algorithm as gradient descent in

w ≈ E[δφ]

Assuming

Sampling the expectation yields the same θ update as GTD-2, but with a different update:

w

w ← w + β(δφ − w)

NEU(θ) = E[δφ]E[δφ]

1 2⇤θNEU(θ) = E[(φ γφ)φ⇥]E[δφ] ⇥ E[(φ γφ)φ⇥]w

slide-32
SLIDE 32

Derivation of the TD-C algorithm as gradient descent in the MSPBE

w ⇥ E ⇤ φφ⌅⌅1 E[δφ] .

1 2 MSPBE(θ) = E ⇤ (φ γφ⇥)φ⌅⌅ E ⇤ φφ⌅⌅1 E[δφ] =

  • E

⇤ φφ⌅⌅ γE ⇤ φ⇥φ⌅⌅⇥ E ⇤ φφ⌅⌅1 E[δφ] = E[δφ] γE ⇤ φ⇥φ⌅⌅ E ⇤ φφ⌅⌅1 E[δφ] ⇥ E[δφ] γE ⇤ φ⇥φ⌅⌅ w,

Assuming

Sampling the expectation yields

θ ← θ + αδφ − αγφ(φ⇥w)

conventional TD(0) gradient correction term

With updated as in GTD-2

w

slide-33
SLIDE 33

Outline

  • The promise and problems of TD learning
  • Value-function approximation
  • Gradient-descent methods - LMS example
  • Objective functions for TD
  • GD derivation of new algorithms
  • Proofs of convergence (sketch and

remarks)

  • Empirical results
slide-34
SLIDE 34

Convergence theorems

  • For arbitrary P and d
  • All algorithms converge w.p.1 to the TD fix-

point:

  • GTD, GTD-2 converges at one time scale
  • TD-C converges in a two-time-scale sense

α, β − → 0 α β − → 0 α = β − → 0

E[δφ] − → 0

slide-35
SLIDE 35

Outline

  • The promise and problems of TD learning
  • Value-function approximation
  • Gradient-descent methods - LMS example
  • Objective functions for TD
  • GD derivation of new algorithms
  • Proofs of convergence
  • Empirical results
  • Conclusions
slide-36
SLIDE 36

Random walk problem (on-policy)

A B C D E

1

start

3 different feature representations.

  • 5 tabular features
  • 5 inverted-tabular features
  • 3 features (genuine FA)
slide-37
SLIDE 37

Boyan chain problem (on-policy)

  • 3
  • 2
  • 3
  • 3
  • 3
  • 3
  • 3
  • 3
  • 3
  • 3
  • 3

1 2 3 4 5 13

[ [

0.75 0.25

[ [

1

[ [

0.5 0.5

[ [

1

13 states, 4 features Exact solution possible

Boyan 1999

slide-38
SLIDE 38

.0 .03 .06 .09 .12 .03 .06 .12 .25 0.5

!

RMSPBE

100 200

Random Walk - Tabular features

episodes

GTD GTD-2 TD-C TD GTD GTD-2 TD-C TD

.00 .05 .10 .15 .20 .03 .06 .12 .25 0.5

!

RMSPBE

250 500

Random Walk - Inverted features

episodes

GTD GTD-2 TD-C TD GTD GTD-2 TD-C TD

.00 .02 .04 .06 .08 .008 .015 .03 .06 .12 .25 0.5

!

RMSPBE

100 200 300

Random Walk - Dependent features

episodes

GTD GTD-2 TD-C TD GTD GTD-2 TD-C TD

0.7 1.4 2.1 2.8 .015 .03 .06 .12 .25 0.5 1 2

!

RMSPBE

50 100

Boyan Chain

episodes

GTD GTD-2 TD-C TD GTD GTD-2 TD-C TD

Summary of empirical results

  • n small problems

TD, TD-C > GTD-2 > GTD Sometimes TD > TD-C

slide-39
SLIDE 39

Computer Go experiment

  • Learn the value function (probability of

winning) for 5x5 Go

  • Lots of features, linearly combined, then

passed through a logistic non-linearity

  • An established experimental testbed
  • Tried the various algorithms
  • Results are still preliminary
slide-40
SLIDE 40

Computer Go results

0.001 0.01 0.1 1 1e-05 0.0001 0.001 0.01 0.1 1 NEU Alpha

TD GTD TD-C GTD-2

TD-C, TD > GTD, GTD-2

slide-41
SLIDE 41

Off-policy result: Baird’s counter-example

! "! #! $! %! &!! &"! &#! &$! &%! "!! ! " # $ % &! '()*+, )-../0 123 234 123!"

! "!!! #!!! $!!! %!!! &!!! "!

!"!

"!

!&

"!

!

"!

&

"!

"!

'()(*+,+)-.!/01 23++45 ! !

"!

67!

&

Gradient algorithms converge. TD diverges.

slide-42
SLIDE 42

Conclusions

  • The first O(n) methods to work off-

policy (and meet all the other desiderata)

  • New methods (GTD-2 and TD-C) are

much faster than original GTD

  • Not clear yet whether or not TD-C is

sufficiently close to TD speed on on- policy problems

  • But it is at least a major step closer. And

it works off-policy