Off-policy methods with approximation
Chapter 11
Off-policy methods with approximation Recall off-policy learning - - PowerPoint PPT Presentation
Chapter 11 Off-policy methods with approximation Recall off-policy learning involves two policies One policy whose value function we are learning the target policy Another policy that is used to select actions the behavior
Chapter 11
(elements of 𝜾 may increase to ±∞)
we are no longer updating according to the on-policy distribution
Episodes
θ7 θ8 θ1– θ6
ctor de
Components
at the end of the episode
2θ2+θ8 2θ1+θ8 2θ3+θ8 2θ4+θ8 2θ5+θ8 2θ6+θ8 θ7+2θ8
99% 1%
1%
µ(dashed|·) = 6/7 µ(solid|·) = 1/7 π(solid|·) = 1
under semi-gradient
(similar for DP)
as in dynamic programming and temporal-difference learning
as in Q-learning, where we learn about the greedy policy from data with a necessarily more exploratory policy
(Why is dynamic programming off-policy?)
Any 2 Ok
TD update: TD fixpoint:
δ = r + γθ⇥φ − θ⇥φ = 0 + 2θ − θ = θ ∆θ = αδφ = αθ θ∗ =
vπ(
vθ
ΠBπvθ,
= ΠBπvθ,
The subspace of all value functions representable as vθ The space of all value functions B e l l m a n e r r
( B E )
PBE = 0
✓1
✓2
Πvπ = vθ⇤
VE
θ
an
1
2
VE(
P B E (
min kBEk
→ (Bπv)(s) . = X
a2A
π(s, a) " r(s, a) + γ X
s02S
p(s0|s, a)v(s0) #
vθ . = ˆ v(·, θ) as a giant vector ∈ R|S|
Value Error
harms the asymptotic performance of approximate methods
parameter, from λ=0 (full bootstrapping) to λ=1 (no bootstrapping)
suggest that λ=1 (no bootstrapping) is a very poor choice
Pure bootstrapping
No bootstrapping
In all cases lower is better Red points are the cases
We need bootstrapping!
J(θ)
δ = r + γθ⇥φ − θ⇥φ
∆θ = αδφ
∂2J ∂θj∂θi = ∂(δφi) ∂θj = (γφ
j − φj)φi
∂2J ∂θi∂θj = ∂(δφj) ∂θi = (γφ
i − φi)φj
∂J ∂θi = δφi
Assume there is a J such that: Then look at the second derivative:
∂2J ∂θj∂θi = ∂2J ∂θi∂θj
Real 2nd derivatives must be symmetric
C
t r a d i c t i
!
Etienne Barnard 1993
A B 1
50% 50% 100%
V (A) = 0.5 V (B) = 1
J(θ) = E[δ2] V (B) = 2/3 V (A) = 1/3
✓1 ✓1 ✓1
2 2 2
A
1
B A
1
These two have different Value Errors, but the same Return Errors (both errors have the same minima)
JRE(θ)2 = JVE(θ)2 + E h vπ(St) − Gt 2
i
These two have different Bellman Errors, but the same Projected Bellman Errors (the errors have different minima)
MDP1 MDP2 BE1 BE2
✓⇤
1
2
✓⇤
3
✓⇤
4
distribution MDP1 MDP2 VE1 VE2 RE Data distribution
✓⇤
d Pµ(ξ) =
T Vθ
Π
TVθ
ΠTVθ
Φ, D
R M S B E RMSPBE
ΦT D(TVθ − Vθ) = E[δφ]
ΦT DΦ = E[φφT ]
MSPBE(θ) = ⇥ Vθ ΠTVθ ⇥2
D
= ⇥ Π(Vθ TVθ) ⇥2
D
= (Π(Vθ TVθ))⇤D(Π(Vθ TVθ)) = (Vθ TVθ)⇤Π⇤DΠ(Vθ TVθ) = (Vθ TVθ)⇤D⇤Φ(Φ⇤DΦ)1Φ⇤D(Vθ TVθ) = (Φ⇤D(TVθ Vθ))⇤(Φ⇤DΦ)1Φ⇤D(TVθ Vθ) = E[δφ]⇤ E
Π = Φ(Φ⇧DΦ)1Φ⇧D
matrix of the feature vectors for all states
s
r
− →s
φ φ
This is the trick! is a second set of weights
w ⇥n ∆θ = 1 2αrθJ(θ) = 1 2αrθ k Vθ ΠTVθ k2
D
= 1 2αrθ ⇣ E [δφ] E ⇥ φφ>⇤1 E [δφ] ⌘ = α (rθE [δφ]) E ⇥ φφ>⇤1 E [δφ] = αE ⇥ rθ[φ
⇤ E ⇥ φφ>⇤1 E [δφ] = αE h φ (γφ0 φ)>i> E ⇥ φφ>⇤1 E [δφ] = α
⇥ φ0φ>⇤ E ⇥ φφ>⇤ E ⇥ φφ>⇤1 E [δφ] = αE [δφ] αγE ⇥ φ0φ>⇤ E ⇥ φφ>⇤1 E [δφ] ⇡ αE [δφ] αγE ⇥ φ0φ>⇤ w (sampling) ⇡ αδφ αγφ0φ>w
θ ← θ + αδφ − αγφ φ⇥w ⇥ w ← w + β(δ − φw)φ δ = r + γθ⇥φ − θ⇥φ s
r
− →s φ
φ
TD(0)
with gradient correction estimate of the TD error ( ) for the current state
δ φ
α, β − → 0 α β − → 0 α = β − → 0
! "! #! $! %! &!! &"! &#! &$! &%! "!! ! " # $ % &! '()*+, )-../0 123 234 123!"
! "!!! #!!! $!!! %!!! &!!! "!
!"!
"!
!&
"!
!
"!
&
"!
"!
'()(*+,+)-.!/01 23++45 ! !
"!
67!
&
function (probability of winning) for 9x9 Go from self play
each corresponding to a template on a part of the Go board
experimental testbed
0.2 0.4 0.6 0.8
.000001 .000003 .00001 .00003 .0001 .0003 .001
!
RNEU
TD GTD2 GTD TDC GTD2 TDC
E[∆θT D]
A L G O R I T H M A L G O R I T H M A L G O R I T H M TD(λ), Sarsa(λ) Approx. DP LSTD(λ), LSPE(λ) Fitted-Q Residual gradient GDP GTD(λ), GQ(λ)
Linear computation Nonlinear convergent Off-policy convergent Model-free,
Converges to PBE = 0
I S S U E
In conclusion More work needed
Rupam Mahmood, Huizhen (Janey) Yu, Martha White, Rich Sutton
Reinforcement Learning and Artificial Intelligence Laboratory Department of Computing Science University of Alberta Canada
R A I L
&
(i.e., when the optimal solution can’t be approached)
in on-policy and off-policy TD learning
more than others in learning
value of the starting state
states (or time steps) are estimated independently, and their importances can be assigned independently
estimated based on the estimated values of later states; if the state is important, then it becomes important to accurately value the later states even if they are not important on their own
it because of bootstrapping
time step
each time step
· · · φ(St) At Rt+1 φ(St+1) At+1 Rt+2 · · ·
feature function interest function
dµ(s) = lim
t→∞ Pr
⇥ St = s
⇤ MSE(θ) = X
s2S
dµ(s)i(s) ⇣ vπ(s) − θ>φ(s) ⌘2
target policy true value function transpose (inner product)
φ : S ! <n
behavior policy parameter vector
i : S ! <+
emphasis
θt+1 = θt + αMtρt
t φt+1 − θ> t φt
Mt > 0
importance sampling ratio
ρt = π(At|St) µ(At|St) E[ρt] = 1
Problem Solution
φt = φ(St)
θt+1 = A−1
t bt
bt =
t
X
k=1
MkρkRkφk At =
t
X
k=0
Mkρkφk
> · · · φ(St) At Rt+1 φ(St+1) At+1 Rt+2 · · ·
feature function interest function
dµ(s) = lim
t→∞ Pr
⇥ St = s
⇤ MSE(θ) = X
s2S
dµ(s)i(s) ⇣ vπ(s) − θ>φ(s) ⌘2
target policy true value function transpose (inner product) behavior policy parameter vector
i : S ! <+
emphasis
θt+1 = θt + αMtρt
t φt+1 − θ> t φt
Mt > 0
importance sampling ratio
ρt = π(At|St) µ(At|St)
E[ρt] = 1
Problem Solution
φt = φ(St)
relationships (Sutton, Mahmood, Precup & van Hasselt 2014)
Ft ≥ 0, F−1 = 0 Mt ≥ 0
(Sutton, Mahmood & White 2015)
Ft = ρt−1γtFt−1 + i(St) Mt = λt i(St) + (1 − λt)Ft
(like the on-policy weighting) (Sutton, Mahmood & White 2015)
distribution ( ), under the target policy
under the emphasis weighting for arbitrary target and behavior policies (with coverage) (Yu 2015)
(one parameter, one learning rate) dµ(s)i(s)