Off-policy methods with approximation Recall off-policy learning - - PowerPoint PPT Presentation

off policy methods with approximation recall off policy
SMART_READER_LITE
LIVE PREVIEW

Off-policy methods with approximation Recall off-policy learning - - PowerPoint PPT Presentation

Chapter 11 Off-policy methods with approximation Recall off-policy learning involves two policies One policy whose value function we are learning the target policy Another policy that is used to select actions the behavior


slide-1
SLIDE 1

Off-policy methods with approximation

Chapter 11

slide-2
SLIDE 2

Recall off-policy learning involves two policies

  • One policy π whose value function we are learning
  • the target policy
  • Another policy 𝜈 that is used to select actions
  • the behavior policy
slide-3
SLIDE 3

Off-policy is much harder with Function Approximation

  • Even linear FA
  • Even for prediction (two fixed policies π and 𝜈)
  • Even for Dynamic Programming
  • The deadly triad: FA, TD, off-policy
  • Any two are OK, but not all three
  • With all three, we may get instability


(elements of 𝜾 may increase to ±∞)

slide-4
SLIDE 4

There are really 2 off-policy problems One we know how to solve, one we are not sure One about the future, one about the present

  • The easy problem is that of off-policy targets (future)
  • We have been correcting for that since Chapters 5 and 6
  • Using importance sampling in the target
  • The hard problem is that of the distribution of states to update (present);


we are no longer updating according to the on-policy distribution

slide-5
SLIDE 5

Baird’s counterexample illustrates the instability

Episodes

θ7 θ8 θ1– θ6

ctor de

Components

  • f the parameter vector

at the end of the episode

2θ2+θ8 2θ1+θ8 2θ3+θ8 2θ4+θ8 2θ5+θ8 2θ6+θ8 θ7+2θ8

99% 1%

1%

µ(dashed|·) = 6/7 µ(solid|·) = 1/7 π(solid|·) = 1

under semi-gradient

  • ff-policy TD(0)

(similar for DP)

slide-6
SLIDE 6

What causes the instability?

  • It has nothing to do with learning or sampling
  • Even dynamic programming suffers from divergence with FA
  • It has nothing to do with exploration, greedification, or control
  • Even prediction alone can diverge
  • It has nothing to do with local minima

  • r complex non-linear approximators
  • Even simple linear approximators can produce instability
slide-7
SLIDE 7

The deadly triad

  • The risk of divergence arises whenever we combine three things:
  • 1. Function approximation
  • significantly generalizing from large numbers of examples
  • 2. Bootstrapping
  • learning value estimates from other value estimates, 


as in dynamic programming and temporal-difference learning

  • 3. Off-policy learning
  • learning about a policy from data not due to that policy, 


as in Q-learning, where we learn about the greedy policy from data with a necessarily more exploratory policy

(Why is dynamic programming off-policy?)

Any 2 Ok

slide-8
SLIDE 8

TD(0) can diverge: A simple example

TD update: TD fixpoint:

θ 2θ r=1

δ = r + γθ⇥φ − θ⇥φ = 0 + 2θ − θ = θ ∆θ = αδφ = αθ θ∗ =

Diverges!

slide-9
SLIDE 9

Geometric intuition

vπ(

ΠBπvθ,

= ΠBπvθ,

The subspace of all value functions representable as vθ The space of all value functions B e l l m a n e r r

  • r

( B E )

PBE = 0

✓1

✓2

Πvπ = vθ⇤

VE

θ

an

1

2

VE(

P B E (

  • ⌘ min kVEk

min kBEk

→ (Bπv)(s) . = X

a2A

π(s, a) " r(s, a) + γ X

s02S

p(s0|s, a)v(s0) #

vθ . = ˆ v(·, θ) as a giant vector ∈ R|S|

Value Error

slide-10
SLIDE 10

Can we do without bootstrapping?

  • Bootstrapping is critical to the computational efficiency of DP
  • Bootstrapping is critical to the data efficiency of TD methods
  • On the other hand, bootstrapping introduces bias, which

harms the asymptotic performance of approximate methods

  • The degree of bootstrapping can be finely controlled via the λ

parameter, from λ=0 (full bootstrapping) to λ=1 (no bootstrapping)

slide-11
SLIDE 11

4 examples of the effect of bootstrapping


suggest that λ=1 (no bootstrapping) is a very poor choice

Pure bootstrapping

No bootstrapping

In all cases lower is better Red points are the cases

  • f no bootstrapping

We need bootstrapping!

slide-12
SLIDE 12

Desiderata: We want a TD algorithm that

  • Bootstraps (genuine TD)
  • Works with linear function approximation


(stable, reliably convergent)

  • Is simple, like linear TD — O(n)
  • Learns fast, like linear TD
  • Can learn off-policy
  • Learns from online causal trajectories 


(no repeat sampling from the same state)

slide-13
SLIDE 13

θ ⇥ θ α⇤θJt(θ)

  • 1. Pick an objective function , 


a parameterized function to be minimized

  • 2. Use calculus to analytically compute the gradient
  • 3. Find a “sample gradient” that you can sample on

every time step and whose expected value equals the gradient

  • 4. Take small steps in proportional to the sample gradient:

4 easy steps to stochastic gradient descent

J(θ)

θJ(θ)

θ

θ ⇥ θ α⇤θJt(θ)

slide-14
SLIDE 14

δ = r + γθ⇥φ − θ⇥φ

Conventional TD is not the gradient of anything

∆θ = αδφ

∂2J ∂θj∂θi = ∂(δφi) ∂θj = (γφ

j − φj)φi

∂2J ∂θi∂θj = ∂(δφj) ∂θi = (γφ

i − φi)φj

∂J ∂θi = δφi

Assume there is a J such that: Then look at the second derivative:

∂2J ∂θj∂θi = ∂2J ∂θi∂θj

TD(0) algorithm:

}

Real 2nd derivatives must be symmetric

C

  • n

t r a d i c t i

  • n

!

Etienne Barnard 1993

slide-15
SLIDE 15

A-split example (Dayan 1992)

A B 1

50% 50% 100%

Clearly, the true values are

V (A) = 0.5 V (B) = 1

But if you minimize the naive

  • bjective fn,

, then you get the solution Even in the tabular case (no FA)

J(θ) = E[δ2] V (B) = 2/3 V (A) = 1/3

slide-16
SLIDE 16

Indistinguishable pairs of MDPs

✓1 ✓1 ✓1

2 2 2

  • B

A

1

  • 1

B A

  • 1 B

1

  • 1

These two have different Value Errors, but the same Return Errors (both errors have the same minima)

JRE(θ)2 = JVE(θ)2 + E h vπ(St) − Gt 2

  • At:∞ ∼ π

i

These two have different Bellman Errors, but the same Projected Bellman Errors (the errors have different minima)

slide-17
SLIDE 17

Not all objectives can be estimated from data Not all minima can be found by learning

MDP1 MDP2 BE1 BE2

✓⇤

1

  • ✓⇤

2

  • PBE

✓⇤

3

  • TDE

✓⇤

4

  • Data

distribution MDP1 MDP2 VE1 VE2 RE Data distribution

✓⇤

  • d Pµ(ξ) =

d Pµ(ξ) =

  • No learning algorithm can find the minimum of the Bellman Error
slide-18
SLIDE 18

The Gradient-TD Family of Algorithms

  • True gradient-descent algorithms in the Projected Bellman Error
  • GTD(λ) and GQ(λ), for learning V and Q
  • Solve two open problems:
  • convergent linear-complexity off-policy TD learning
  • convergent non-linear TD
  • Extended to control variate, proximal forms by Mahadevan et al.
slide-19
SLIDE 19

First relate the geometry to the iid statistics

T Vθ

Π

TVθ

ΠTVθ

Φ, D

R M S B E RMSPBE

ΦT D(TVθ − Vθ) = E[δφ]

ΦT DΦ = E[φφT ]

MSPBE(θ) = ⇥ Vθ ΠTVθ ⇥2

D

= ⇥ Π(Vθ TVθ) ⇥2

D

= (Π(Vθ TVθ))⇤D(Π(Vθ TVθ)) = (Vθ TVθ)⇤Π⇤DΠ(Vθ TVθ) = (Vθ TVθ)⇤D⇤Φ(Φ⇤DΦ)1Φ⇤D(Vθ TVθ) = (Φ⇤D(TVθ Vθ))⇤(Φ⇤DΦ)1Φ⇤D(TVθ Vθ) = E[δφ]⇤ E

  • φφ⇤⇥1 E[δφ] .

Π = Φ(Φ⇧DΦ)1Φ⇧D

matrix of the feature vectors for all states

slide-20
SLIDE 20

Derivation of the TDC algorithm

s

r

− →s

φ φ

This is the trick! is a second set of weights

w ⇥n ∆θ = 1 2αrθJ(θ) = 1 2αrθ k Vθ ΠTVθ k2

D

= 1 2αrθ ⇣ E [δφ] E ⇥ φφ>⇤1 E [δφ] ⌘ = α (rθE [δφ]) E ⇥ φφ>⇤1 E [δφ] = αE ⇥ rθ[φ

  • r + γφ0>θ φ>θ
  • ]

⇤ E ⇥ φφ>⇤1 E [δφ] = αE h φ (γφ0 φ)>i> E ⇥ φφ>⇤1 E [δφ] = α

  • γE

⇥ φ0φ>⇤ E ⇥ φφ>⇤ E ⇥ φφ>⇤1 E [δφ] = αE [δφ] αγE ⇥ φ0φ>⇤ E ⇥ φφ>⇤1 E [δφ] ⇡ αE [δφ] αγE ⇥ φ0φ>⇤ w (sampling) ⇡ αδφ αγφ0φ>w

slide-21
SLIDE 21
  • on each transition
  • update two parameters
  • where, as usual

TD with gradient correction (TDC) algorithm

θ ← θ + αδφ − αγφ φ⇥w ⇥ w ← w + β(δ − φw)φ δ = r + γθ⇥φ − θ⇥φ s

r

− →s φ

φ

TD(0)

with gradient correction estimate of the TD error ( ) for the current state

δ φ

aka GTD(0)

slide-22
SLIDE 22

Convergence theorems

  • All algorithms converge w.p.1 to the TD fix-point:
  • GTD, GTD-2 converges at one time scale
  • TD-C converges in a two-time-scale sense

α, β − → 0 α β − → 0 α = β − → 0

E[δφ] − → 0

slide-23
SLIDE 23

Off-policy result: Baird’s counter-example

! "! #! $! %! &!! &"! &#! &$! &%! "!! ! " # $ % &! '()*+, )-../0 123 234 123!"

! "!!! #!!! $!!! %!!! &!!! "!

!"!

"!

!&

"!

!

"!

&

"!

"!

'()(*+,+)-.!/01 23++45 ! !

"!

67!

&

Gradient algorithms converge. TD diverges.

slide-24
SLIDE 24

Computer Go experiment

  • Learn a linear value

function (probability of winning) for 9x9 Go from self play

  • One million features,

each corresponding to a template on a part of the Go board

  • An established

experimental testbed

0.2 0.4 0.6 0.8

.000001 .000003 .00001 .00003 .0001 .0003 .001

!

RNEU

TD GTD2 GTD TDC GTD2 TDC

E[∆θT D]

slide-25
SLIDE 25

A L G O R I T H M A L G O R I T H M A L G O R I T H M TD(λ), Sarsa(λ) Approx. DP LSTD(λ), LSPE(λ) Fitted-Q Residual gradient GDP GTD(λ), GQ(λ)

Linear computation Nonlinear convergent Off-policy convergent Model-free,

  • nline

Converges to PBE = 0

✓ ✓ ✖ ✖ ✓ ✓ ✓ ✖ ✖ ✖ ✓ ✓ ✓ ✓ ✖ ✖ ✓ ✖ ✓ ✓ ✓ ✓ ✖ ✓ ✖ ✓ ✖ ✓ ✓ ✓ ✓ ✓ ✖ ✓ ✓

I S S U E

slide-26
SLIDE 26

Off-policy RL with FA and TD remains challenging; there are multiple ideas, plus combinations

  • Gradient TD, proximal gradient TD, and hybrids
  • Emphatic TD
  • Higher λ (less TD)
  • Better state rep’ns (less FA)
  • Recognizers (less off-policy)
  • LSTD (O(n2) methods)

In conclusion More work needed

  • n these novel algs!
slide-27
SLIDE 27

Emphatic temporal- difference learning

Rupam Mahmood, Huizhen (Janey) Yu, Martha White, Rich Sutton

Reinforcement Learning and Artificial Intelligence Laboratory Department of Computing Science University of Alberta Canada

R A I L

&

slide-28
SLIDE 28

State weightings are important, powerful, even magical,

when using “genuine function approximation”

(i.e., when the optimal solution can’t be approached)

  • They are the difference between convergence and divergence

in on-policy and off-policy TD learning

  • They are needed to make the problem well-defined
  • We can change the weighting by emphasizing some steps

more than others in learning

slide-29
SLIDE 29

Often some time steps are more important

  • Early time steps of an episode may be more important
  • Because of discounting
  • Because the control objective is to maximize the

value of the starting state

  • In general, function approximation resources are limited
  • Not all states can be accurately valued
  • The accuracy of different state must be traded off!
  • You may want to control the tradeoff
slide-30
SLIDE 30

Bootstrapping interacts with state importance

  • In the Monte Carlo case (λ=1) the values of different

states (or time steps) are estimated independently, 
 and their importances can be assigned independently

  • But with bootstrapping (λ<1) each state’s value is

estimated based on the estimated values of later states; if the state is important, then it becomes important to accurately value the later states even if they are not important on their own

slide-31
SLIDE 31

Two kinds of importance

  • Intrinsic and derived, primary and secondary
  • The one you specify, and the one that follows from

it because of bootstrapping

  • Our terms: Interest and Emphasis
  • Your intrinsic interest in valuing accurately on a

time step

  • The total resultant emphasis that you place on

each time step

slide-32
SLIDE 32
  • Data
  • State distribution
  • Objective to minimize
  • Emphatic TD(0)
  • Emphatic LSTD(0)

· · · φ(St) At Rt+1 φ(St+1) At+1 Rt+2 · · ·

feature function interest function

dµ(s) = lim

t→∞ Pr

⇥ St = s

  • A0:t−1 ∼ µ

⇤ MSE(θ) = X

s2S

dµ(s)i(s) ⇣ vπ(s) − θ>φ(s) ⌘2

target policy true value function transpose (inner product)

φ : S ! <n

behavior policy parameter vector

i : S ! <+

emphasis

θt+1 = θt + αMtρt

  • Rt+1 + γθ>

t φt+1 − θ> t φt

  • φt

Mt > 0

importance sampling ratio

ρt = π(At|St) µ(At|St) E[ρt] = 1

Problem Solution

φt = φ(St)

slide-33
SLIDE 33

θt+1 = A−1

t bt

bt =

t

X

k=1

MkρkRkφk At =

t

X

k=0

Mkρkφk

  • φk − γφk+1

> · · · φ(St) At Rt+1 φ(St+1) At+1 Rt+2 · · ·

  • State distribution
  • Objective to minimize
  • Emphatic TD(0)
  • Emphatic LSTD(0)

feature function interest function

dµ(s) = lim

t→∞ Pr

⇥ St = s

  • A0:t−1 ∼ µ

⇤ MSE(θ) = X

s2S

dµ(s)i(s) ⇣ vπ(s) − θ>φ(s) ⌘2

target policy true value function transpose (inner product) behavior policy parameter vector

i : S ! <+

emphasis

θt+1 = θt + αMtρt

  • Rt+1 + γθ>

t φt+1 − θ> t φt

  • φt

Mt > 0

importance sampling ratio

ρt = π(At|St) µ(At|St)

E[ρt] = 1

Problem Solution

φt = φ(St)

slide-34
SLIDE 34
  • Derived from analysis of general bootstrapping

relationships (Sutton, Mahmood, Precup & van Hasselt 2014)

  • Emphasis is a scalar signal
  • Defined from a new scalar followon trace

Ft ≥ 0, F−1 = 0 Mt ≥ 0

Emphasis algorithm

(Sutton, Mahmood & White 2015)

Ft = ρt−1γtFt−1 + i(St) Mt = λt i(St) + (1 − λt)Ft

slide-35
SLIDE 35

Off-policy implications

  • The emphasis weighting is stable under off-policy TD(λ) 


(like the on-policy weighting) (Sutton, Mahmood & White 2015)

  • It is the followon weighting, from the interest weighted behavior

distribution ( ), under the target policy

  • Learning is convergent (though not necessarily of finite variance)

under the emphasis weighting
 for arbitrary target and behavior policies (with coverage) (Yu 2015)

  • There are error bounds analogous to those for on-policy TD(λ) (Munos)
  • Emphatic TD is the simplest convergent off-policy TD algorithm 


(one parameter, one learning rate) dµ(s)i(s)