MVA-RL Course
Approximate Dynamic Programming
- A. LAZARIC (SequeL Team @INRIA-Lille)
ENS Cachan - Master 2 MVA
SequeL – INRIA Lille
Approximate Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille - - PowerPoint PPT Presentation
Approximate Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) A. LAZARIC Reinforcement Learning
MVA-RL Course
ENS Cachan - Master 2 MVA
SequeL – INRIA Lille
Dec 2nd, 2014 - 2/82
Dec 2nd, 2014 - 2/82
◮ Dynamic programming algorithms require an explicit
◮ transition probabilities p(·|x, a) ◮ reward function r(x, a)
Dec 2nd, 2014 - 3/82
◮ Dynamic programming algorithms require an explicit
◮ transition probabilities p(·|x, a) ◮ reward function r(x, a)
◮ This knowledge is often unavailable (i.e., wind intensity,
Dec 2nd, 2014 - 3/82
◮ Dynamic programming algorithms require an explicit
◮ transition probabilities p(·|x, a) ◮ reward function r(x, a)
◮ This knowledge is often unavailable (i.e., wind intensity,
◮ Can we rely on samples?
Dec 2nd, 2014 - 3/82
◮ Dynamic programming algorithms require an exact
Dec 2nd, 2014 - 4/82
◮ Dynamic programming algorithms require an exact
◮ This is often impossible since their shape is too “complicated”
Dec 2nd, 2014 - 4/82
◮ Dynamic programming algorithms require an exact
◮ This is often impossible since their shape is too “complicated”
◮ Can we use approximations?
Dec 2nd, 2014 - 4/82
Dec 2nd, 2014 - 5/82
Dec 2nd, 2014 - 6/82
a∈A
Dec 2nd, 2014 - 6/82
a∈A
Dec 2nd, 2014 - 6/82
Proposition
Dec 2nd, 2014 - 7/82
Dec 2nd, 2014 - 8/82
Dec 2nd, 2014 - 9/82
Dec 2nd, 2014 - 10/82
Dec 2nd, 2014 - 10/82
Dec 2nd, 2014 - 10/82
◮ Compute
Qk+1(x, a) = T Qk(x, a) = r(x, a)+
p(y|x, a)γ max
b
Qk(y, b)
πK(x) ∈ arg max
a∈A QK(x, a).
Dec 2nd, 2014 - 11/82
◮ Compute
Qk+1(x, a) = T Qk(x, a) = r(x, a)+
p(y|x, a)γ max
b
Qk(y, b)
πK(x) ∈ arg max
a∈A QK(x, a).
◮ Problem: how can we approximate T Qk? ◮ Problem: if Qk+1 = T Qk, does (approx.) value iteration still work?
Dec 2nd, 2014 - 11/82
d
Dec 2nd, 2014 - 12/82
d
Dec 2nd, 2014 - 12/82
Dec 2nd, 2014 - 13/82
Input: space F, iterations K, sampling distribution ρ, num of samples n
Dec 2nd, 2014 - 14/82
Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q0 ∈ F
Dec 2nd, 2014 - 14/82
Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q0 ∈ F For k = 1, . . . , K
Dec 2nd, 2014 - 14/82
Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q0 ∈ F For k = 1, . . . , K
i.i.d
∼ ρ
Dec 2nd, 2014 - 14/82
Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q0 ∈ F For k = 1, . . . , K
i.i.d
∼ ρ
i ∼ p(·|xi, ai) and ri = r(xi, ai)
Dec 2nd, 2014 - 14/82
Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q0 ∈ F For k = 1, . . . , K
i.i.d
∼ ρ
i ∼ p(·|xi, ai) and ri = r(xi, ai)
Qk−1(x′
i , a)
Dec 2nd, 2014 - 14/82
Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q0 ∈ F For k = 1, . . . , K
i.i.d
∼ ρ
i ∼ p(·|xi, ai) and ri = r(xi, ai)
Qk−1(x′
i , a)
n
i=1
Dec 2nd, 2014 - 14/82
Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q0 ∈ F For k = 1, . . . , K
i.i.d
∼ ρ
i ∼ p(·|xi, ai) and ri = r(xi, ai)
Qk−1(x′
i , a)
n
i=1
fˆ
αk = arg min fα∈F
1 n
n
2
Dec 2nd, 2014 - 14/82
Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q0 ∈ F For k = 1, . . . , K
i.i.d
∼ ρ
i ∼ p(·|xi, ai) and ri = r(xi, ai)
Qk−1(x′
i , a)
n
i=1
fˆ
αk = arg min fα∈F
1 n
n
2
Qk = fˆ
αk (truncation may be needed)
Dec 2nd, 2014 - 14/82
Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q0 ∈ F For k = 1, . . . , K
i.i.d
∼ ρ
i ∼ p(·|xi, ai) and ri = r(xi, ai)
Qk−1(x′
i , a)
n
i=1
fˆ
αk = arg min fα∈F
1 n
n
2
Qk = fˆ
αk (truncation may be needed)
Return πK(·) = arg maxa QK(·, a) (greedy policy)
Dec 2nd, 2014 - 14/82
i.i.d
∼ ρ
i ∼ p(·|xi, ai) and ri = r(xi, ai)
Dec 2nd, 2014 - 15/82
i.i.d
∼ ρ
i ∼ p(·|xi, ai) and ri = r(xi, ai)
◮ In practice it can be done once before running the algorithm ◮ The sampling distribution ρ should cover the state-action space in
all relevant regions
◮ If not possible to choose ρ, a database of samples can be used
Dec 2nd, 2014 - 15/82
Qk−1(x′
i , a)
n
i=1
Dec 2nd, 2014 - 16/82
Qk−1(x′
i , a)
n
i=1
◮ Each sample yi is an unbiased sample, since
E[yi|xi, ai] = E[ri + γ max
a
i , a)] = r(xi, ai) + γE[max a
i , a)]
= r(xi, ai) + γ
max
a
Qk−1(xi, ai)
◮ The problem “reduces” to standard regression ◮ It should be recomputed at each iteration
Dec 2nd, 2014 - 16/82
fˆ
αk = arg min fα∈F
1 n
n
2
Qk = fˆ
αk (truncation may be needed)
Dec 2nd, 2014 - 17/82
fˆ
αk = arg min fα∈F
1 n
n
2
Qk = fˆ
αk (truncation may be needed)
◮ Thanks to the linear space we can solve it as
◮ Build matrix Φ =
◮ Compute ˆ
αk = (Φ⊤Φ)−1Φ⊤y (least–squares solution)
◮ Truncation to [−Vmax; Vmax] (with Vmax = Rmax/(1 − γ))
Dec 2nd, 2014 - 17/82
Q3 greedy πK · · · Q2 Q0 Q1
T T
T Q2
ǫ2
ǫ3 T Q3 ǫ1
T Q1
T T
Q4 · · · final error Q∗
T
QπK Skip Theory
Dec 2nd, 2014 - 18/82
Dec 2nd, 2014 - 19/82
Dec 2nd, 2014 - 19/82
Dec 2nd, 2014 - 19/82
◮ Desired solution
Dec 2nd, 2014 - 20/82
◮ Desired solution
◮ Best solution (wrt sampling distribution ρ)
k = arg inf
fα∈F ||fα − Qk||ρ
Dec 2nd, 2014 - 20/82
◮ Desired solution
◮ Best solution (wrt sampling distribution ρ)
k = arg inf
fα∈F ||fα − Qk||ρ
Dec 2nd, 2014 - 20/82
◮ Desired solution
◮ Best solution (wrt sampling distribution ρ)
k = arg inf
fα∈F ||fα − Qk||ρ
◮ Returned solution
αk = arg min fα∈F
n
Dec 2nd, 2014 - 20/82
◮ Desired solution
◮ Best solution (wrt sampling distribution ρ)
k = arg inf
fα∈F ||fα − Qk||ρ
◮ Returned solution
αk = arg min fα∈F
n
Dec 2nd, 2014 - 20/82
Theorem
k ||ρ
k||
Tools: concentration of measure inequalities, covering space, linear algebra, union bounds, special tricks for linear spaces, ...
Dec 2nd, 2014 - 21/82
k ||ρ
k||
Dec 2nd, 2014 - 22/82
k ||ρ
k||
◮ No algorithm can do better ◮ Constant 4 ◮ Depends on the space F ◮ Changes with the iteration k
Dec 2nd, 2014 - 23/82
k ||ρ
k||
◮ Vanishing to zero as O(n−1/2) ◮ Depends on the features (L) and on the best solution (||α∗ k||)
Dec 2nd, 2014 - 24/82
k ||ρ
k||
◮ Vanishing to zero as O(n−1/2) ◮ Depends on the dimensionality of the space (d) and the
Dec 2nd, 2014 - 25/82
Dec 2nd, 2014 - 26/82
◮ Problem 1: the test norm µ is different from the sampling
Dec 2nd, 2014 - 26/82
◮ Problem 1: the test norm µ is different from the sampling
◮ Problem 2: we have bounds for
Dec 2nd, 2014 - 26/82
◮ Problem 1: the test norm µ is different from the sampling
◮ Problem 2: we have bounds for
◮ Problem 3: we have bounds for one single iteration
Dec 2nd, 2014 - 26/82
Transition kernel for a fixed policy Pπ. ◮ m-step (worst-case) concentration of future state distribution
c(m) = sup
π1...πm
dρ
< ∞
Dec 2nd, 2014 - 27/82
Transition kernel for a fixed policy Pπ. ◮ m-step (worst-case) concentration of future state distribution
c(m) = sup
π1...πm
dρ
< ∞
◮ Average (discounted) concentration
Cµ,ρ = (1 − γ)2
m≥1
mγm−1c(m) < +∞
Dec 2nd, 2014 - 27/82
Remark: relationship to top-Lyapunov exponent L+ = sup
π lim sup m→∞
1 m log+ ||ρPπ1Pπ2 · · · Pπm||
polynomial and Cµ,ρ < ∞ is finite
Dec 2nd, 2014 - 28/82
Proposition
||Q∗ − QπK ||2
µ ≤
(1 − γ)2 2 Cµ,ρ max
k
||ǫk||2
ρ + O
(1 − γ)3 Vmax
2
Dec 2nd, 2014 - 29/82
||Q∗ − QπK ||2
µ ≤
(1 − γ)2 2 Cµ,ρ max
k
||ǫk||2
ρ + O
(1 − γ)3 Vmax
2
Dec 2nd, 2014 - 30/82
||Q∗ − QπK ||2
µ ≤
(1 − γ)2 2 Cµ,ρ max
k
||ǫk||2
ρ + O
(1 − γ)3 Vmax
2
Qk||ρ ≤ 4||Qk − fα∗
k ||ρ
+ O
k||
n
n
Dec 2nd, 2014 - 30/82
Theorem (see e.g., Munos,’03) LinearFQI with a space F of d features, with n samples at each iteration returns a policy πK after K iterations such that
||Q∗ − QπK ||µ ≤ 2γ (1 − γ)2
L √ω
n
(1 − γ)3 Vmax2
Dec 2nd, 2014 - 31/82
Theorem
LinearFQI with a space F of d features, with n samples at each iteration returns a policy πK after K iterations such that ||Q∗ − QπK ||µ ≤ 2γ (1 − γ)2
L √ω
n
(1 − γ)3 Vmax2
⇒ how do we choose the sampling distribution?
Dec 2nd, 2014 - 32/82
Theorem
LinearFQI with a space F of d features, with n samples at each iteration returns a policy πK after K iterations such that ||Q∗ − QπK ||µ ≤ 2γ (1 − γ)2
L √ω
n
(1 − γ)3 Vmax2
Dec 2nd, 2014 - 33/82
||Qk − fα∗
k ||ρ = inf
f ∈F ||Qk − f ||ρ
= inf
f ∈F ||T
Qk−1 − f ||ρ ≤ inf
f ∈F ||T fαk−1 − f ||ρ
≤ sup
g∈F
inf
f ∈F ||T g − f ||ρ = d(F, T F)
Question: how to design F to make it “compatible” with the Bellman
Dec 2nd, 2014 - 34/82
Theorem
LinearFQI with a space F of d features, with n samples at each iteration returns a policy πK after K iterations such that ||Q∗ − QπK ||µ ≤ 2γ (1 − γ)2
L √ω
n
(1 − γ)3 Vmax2
⇒ is it possible to avoid it?
Dec 2nd, 2014 - 35/82
Theorem
LinearFQI with a space F of d features, with n samples at each iteration returns a policy πK after K iterations such that ||Q∗ − QπK ||µ ≤ 2γ (1 − γ)2
L √ω
n
(1 − γ)3 Vmax2
⇒ K ≈ ǫ/(1 − γ)
Dec 2nd, 2014 - 36/82
Theorem
LinearFQI with a space F of d features, with n samples at each iteration returns a policy πK after K iterations such that ||Q∗ − QπK ||µ ≤ 2γ (1 − γ)2
L √ω
n
(1 − γ)3 Vmax2
⇒ design the features so as to be orthogonal w.r.t. ρ
Dec 2nd, 2014 - 37/82
Theorem
LinearFQI with a space F of d features, with n samples at each iteration returns a policy πK after K iterations such that ||Q∗ − QπK ||µ ≤ 2γ (1 − γ)2
L √ω
n
(1 − γ)3 Vmax2
Dec 2nd, 2014 - 38/82
Approximation space Samples algorithm process Performance Markov decision Dynamic programming Approximation algorithm
(sampling strategy, number)
Range Vmax Concentrability Cµ,ρ d(F, T F) size d, features ω number n, sampling dist. ρ Qk − Qk Propagation
Dec 2nd, 2014 - 39/82
◮ K-nearest neighbour ◮ Regularized linear regression with L1 or L2 regularisation ◮ Neural network ◮ Support vector regression ◮ ...
Dec 2nd, 2014 - 40/82
Dec 2nd, 2014 - 41/82
Dec 2nd, 2014 - 41/82
◮ c(x, R) = C ◮ c(x, K) = c(x) maintenance plus extra costs.
Dec 2nd, 2014 - 41/82
◮ c(x, R) = C ◮ c(x, K) = c(x) maintenance plus extra costs.
◮ p(·|x, R) = exp(β) with density d(y) = β exp−βy I{y ≥ 0}, ◮ p(·|x, K) = x + exp(β) with density d(y − x).
Dec 2nd, 2014 - 41/82
◮ c(x, R) = C ◮ c(x, K) = c(x) maintenance plus extra costs.
◮ p(·|x, R) = exp(β) with density d(y) = β exp−βy I{y ≥ 0}, ◮ p(·|x, K) = x + exp(β) with density d(y − x).
Dec 2nd, 2014 - 41/82
Optimal value function V ∗(x) = min
∞ d(y −x)V ∗(y)dy, C +γ ∞ d(y)V ∗(y)dy
Dec 2nd, 2014 - 42/82
Optimal value function V ∗(x) = min
∞ d(y −x)V ∗(y)dy, C +γ ∞ d(y)V ∗(y)dy
Dec 2nd, 2014 - 42/82
Optimal value function V ∗(x) = min
∞ d(y −x)V ∗(y)dy, C +γ ∞ d(y)V ∗(y)dy
1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70
Management cost wear
1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70
Value function
R R R K K K
Dec 2nd, 2014 - 42/82
Optimal value function V ∗(x) = min
∞ d(y −x)V ∗(y)dy, C +γ ∞ d(y)V ∗(y)dy
1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70
Management cost wear
1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70
Value function
R R R K K K
Linear approximation space F :=
k=1 αk cos(kπ x xmax )
Dec 2nd, 2014 - 42/82
1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 +++++++++++++++++++++++++ ++++ +++++++++++++++++++++ ++++ +++++++++++++++++++++ ++++ +++++++++++++++++++++
Dec 2nd, 2014 - 43/82
1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 +++++++++++++++++++++++++ ++++ +++++++++++++++++++++ ++++ +++++++++++++++++++++ ++++ +++++++++++++++++++++ 1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70
Figure: Left: the target values computed as {T V0(xn)}1≤n≤N. Right: the approximation V1 ∈ F of the target function T V0.
Dec 2nd, 2014 - 43/82
Figure: Left: the target values computed as {T V1(xn)}1≤n≤N. Center: the approximation V2 ∈ F of T V1. Right: the approximation Vn ∈ F after n iterations.
Dec 2nd, 2014 - 44/82
Dec 2nd, 2014 - 45/82
Dec 2nd, 2014 - 46/82
◮ Policy evaluation given πk, compute Vk = V πk. ◮ Policy improvement: compute the greedy policy
πk+1(x) ∈ arg maxa∈A
p(y|x, a)V πk(y)
Dec 2nd, 2014 - 47/82
◮ Policy evaluation given πk, compute Vk = V πk. ◮ Policy improvement: compute the greedy policy
πk+1(x) ∈ arg maxa∈A
p(y|x, a)V πk(y)
◮ Problem: how can we approximate V πk? ◮ Problem: if Vk = V πk, does (approx.) policy iteration still work?
Dec 2nd, 2014 - 47/82
Problem: the algorithm is no longer guaranteed to converge.
V *−V πk k
Asymptotic Error
Proposition The asymptotic performance of the policies πk generated by the API algorithm is related to the approximation error as: lim sup
k→∞
V ∗ − V πk∞
≤ 2γ (1 − γ)2 lim sup
k→∞
Vk − V πk∞
Dec 2nd, 2014 - 48/82
◮ Linear space to approximate value functions*
d
Dec 2nd, 2014 - 49/82
◮ Linear space to approximate value functions*
d
◮ Least-Squares Temporal Difference (LSTD) algorithm for
*In practice we use approximations of action-value functions.
Dec 2nd, 2014 - 49/82
◮ V π may not belong to F
◮ Best approximation of V π in F is
ΠV π = arg min
f ∈F ||V π − f ||
(Π is the projection onto F)
F V π
T π
ΠV π
Dec 2nd, 2014 - 50/82
◮ V π is the fixed-point of T π
V π = T πV π = r π + γPπV π
◮ LSTD searches for the fixed-point of Π2,ρT π
Π2,ρ g = arg min
f ∈F ||g − f ||2,ρ
◮ When the fixed-point of ΠρT π exists, we call it the LSTD solution
VTD = ΠρT πVTD
F V π
T πVTD T π T π
ΠρV π
VTD = ΠρT πVTD
Dec 2nd, 2014 - 51/82
◮ The projection Πρ is orthogonal in expectation w.r.t. the space F spanned by the features ϕ1, . . . , ϕd Ex∼ρ
T πVTD − VTD, ϕiρ = 0
Dec 2nd, 2014 - 52/82
◮ The projection Πρ is orthogonal in expectation w.r.t. the space F spanned by the features ϕ1, . . . , ϕd Ex∼ρ
T πVTD − VTD, ϕiρ = 0 ◮ By definition of Bellman operator rπ + γPπVTD − VTD, ϕiρ = 0 rπ, ϕiρ − (I − γPπ)VTD, ϕiρ = 0
Dec 2nd, 2014 - 52/82
◮ The projection Πρ is orthogonal in expectation w.r.t. the space F spanned by the features ϕ1, . . . , ϕd Ex∼ρ
T πVTD − VTD, ϕiρ = 0 ◮ By definition of Bellman operator rπ + γPπVTD − VTD, ϕiρ = 0 rπ, ϕiρ − (I − γPπ)VTD, ϕiρ = 0 ◮ Since VTD ∈ F, there exists αTD such that VTD(x) = φ(x)⊤αTD rπ, ϕiρ −
d
(I − γPπ)ϕjαTD,j, ϕiρ = 0 rπ, ϕiρ −
d
(I − γPπ)ϕj, ϕiραTD,j = 0
Dec 2nd, 2014 - 52/82
d
Dec 2nd, 2014 - 53/82
◮ Problem: In general, ΠρT π is not a contraction and does not
have a fixed-point.
◮ Solution: If ρ = ρπ (stationary dist. of π) then ΠρπT π has a
unique fixed-point.
Dec 2nd, 2014 - 54/82
◮ Problem: In general, ΠρT π is not a contraction and does not
have a fixed-point.
◮ Solution: If ρ = ρπ (stationary dist. of π) then ΠρπT π has a
unique fixed-point.
◮ Problem: In general, ΠρT π cannot be computed (because
unknown)
◮ Solution: Use samples coming from a “trajectory” of π.
Dec 2nd, 2014 - 54/82
Input: space F, iterations K, sampling distribution ρ, num of samples n
Dec 2nd, 2014 - 55/82
Input: space F, iterations K, sampling distribution ρ, num of samples n Initial policy π0
Dec 2nd, 2014 - 55/82
Input: space F, iterations K, sampling distribution ρ, num of samples n Initial policy π0 For k = 1, . . . , K
Dec 2nd, 2014 - 55/82
Input: space F, iterations K, sampling distribution ρ, num of samples n Initial policy π0 For k = 1, . . . , K
(x1, πk(x1), r1, x2, πk(x2), r2, . . . , xn−1, πk(xn−1), rn−1, xn)
Dec 2nd, 2014 - 55/82
Input: space F, iterations K, sampling distribution ρ, num of samples n Initial policy π0 For k = 1, . . . , K
(x1, πk(x1), r1, x2, πk(x2), r2, . . . , xn−1, πk(xn−1), rn−1, xn)
Ak and the vector bk [ Ak]i,j = 1 n
n
(ϕj(xt) − γϕj(xt+1)ϕi(xt) ≈ (I − γPπ)ϕj, ϕiρπk [ bk]i = 1 n
n
ϕi(xt)rt ≈ r π, ϕiρπk
A−1
k
bk
Dec 2nd, 2014 - 55/82
Input: space F, iterations K, sampling distribution ρ, num of samples n Initial policy π0 For k = 1, . . . , K
(x1, πk(x1), r1, x2, πk(x2), r2, . . . , xn−1, πk(xn−1), rn−1, xn)
Ak and the vector bk [ Ak]i,j = 1 n
n
(ϕj(xt) − γϕj(xt+1)ϕi(xt) ≈ (I − γPπ)ϕj, ϕiρπk [ bk]i = 1 n
n
ϕi(xt)rt ≈ r π, ϕiρπk
A−1
k
bk
Vk = fαk
Dec 2nd, 2014 - 55/82
Input: space F, iterations K, sampling distribution ρ, num of samples n Initial policy π0 For k = 1, . . . , K
(x1, πk(x1), r1, x2, πk(x2), r2, . . . , xn−1, πk(xn−1), rn−1, xn)
Ak and the vector bk [ Ak]i,j = 1 n
n
(ϕj(xt) − γϕj(xt+1)ϕi(xt) ≈ (I − γPπ)ϕj, ϕiρπk [ bk]i = 1 n
n
ϕi(xt)rt ≈ r π, ϕiρπk
A−1
k
bk
Vk = fαk Return the last policy πK
Dec 2nd, 2014 - 55/82
(x1, πk(x1), r1, x2, πk(x2), r2, . . . , xn−1, πk(xn−1), rn−1, xn)
◮ The first few samples may be discarded because not actually drawn
from the stationary distribution ρπk
◮ Off-policy samples could be used with importance weighting ◮ In practice i.i.d. states drawn from an arbitrary distribution (but
with actions πk) may be used
Dec 2nd, 2014 - 56/82
Vk = fαk
◮ Computing the greedy policy from
Vk is difficult, so move to LSTD-Q and compute πk+1(x) = arg max
a
Dec 2nd, 2014 - 57/82
For k = 1, . . . , K
Dec 2nd, 2014 - 58/82
For k = 1, . . . , K
(x1, πk(x1), r1, x2, πk(x2), r2, . . . , xn−1, πk(xn−1), rn−1, xn) ...
Vk = fαk Problem: This process may be unstable because πk does not cover the state space properly
Skip Theory
Dec 2nd, 2014 - 58/82
Proposition (LSTD Performance)
V ∈F ||V π − V ||ρπ
Dec 2nd, 2014 - 59/82
Proposition (LSTD Performance)
V ∈F ||V π − V ||ρπ
Dec 2nd, 2014 - 59/82
Proposition (LSTD Performance)
V ∈F ||V π − V ||ρπ
Dec 2nd, 2014 - 59/82
Assumption: The Markov chain induced by the policy πk has a stationary distribution ρπk and it is ergodic and β-mixing.
Dec 2nd, 2014 - 60/82
Assumption: The Markov chain induced by the policy πk has a stationary distribution ρπk and it is ergodic and β-mixing. Theorem (LSTD Error Bound) At any iteration k, if LSTD uses n samples obtained from a single trajectory of π and a d-dimensional space, then with probability 1 − δ ||V πk − Vk||ρπk ≤ c
f ∈F ||V πk − f ||ρπk + O
n ν
Dec 2nd, 2014 - 60/82
||V π − V ||ρπ ≤ c
inf
f ∈F ||V π − f ||ρπ
+ O
n ν
◮ Approximation error: it depends on how well the function space F
can approximate the value function V π
◮ Estimation error: it depends on the number of samples n, the dim of
the function space d, the smallest eigenvalue of the Gram matrix ν, the mixing properties of the Markov chain (hidden in O)
Dec 2nd, 2014 - 61/82
||V πk − Vk||ρπk ≤ c
inf
f ∈F ||V πk − f ||ρπk
+ O
n νk
◮ n number of samples and d dimensionality
Dec 2nd, 2014 - 62/82
||V πk − Vk||ρπk ≤ c
inf
f ∈F ||V πk − f ||ρπk
+ O
n νk
◮ νk = the smallest eigenvalue of the Gram matrix (
(Assumption: eigenvalues of the Gram matrix are strictly positive - existence of the model-based LSTD solution) ◮ β-mixing coefficients are hidden in the O(·) notation
Dec 2nd, 2014 - 63/82
Theorem (LSPI Error Bound)
If LSPI is run over K iterations, then the performance loss policy πK is
||V ∗ − V πK ||µ ≤ 4γ (1 − γ)2
n νρ
Dec 2nd, 2014 - 64/82
Theorem (LSPI Error Bound)
If LSPI is run over K iterations, then the performance loss policy πK is
||V ∗ − V πK ||µ ≤ 4γ (1 − γ)2
n νρ
◮ Approximation error: E0(F) = supπ∈G(
F) inff ∈F ||V π − f ||ρπ
Dec 2nd, 2014 - 65/82
Theorem (LSPI Error Bound)
If LSPI is run over K iterations, then the performance loss policy πK is
||V ∗ − V πK ||µ ≤ 4γ (1 − γ)2
n νρ
◮ Approximation error: E0(F) = supπ∈G(
F) inff ∈F ||V π − f ||ρπ
◮ Estimation error: depends on n, d, νρ, K
Dec 2nd, 2014 - 66/82
Theorem (LSPI Error Bound)
If LSPI is run over K iterations, then the performance loss policy πK is
||V ∗ − V πK ||µ ≤ 4γ (1 − γ)2
n νρ
◮ Approximation error: E0(F) = supπ∈G(
F) inff ∈F ||V π − f ||ρπ
◮ Estimation error: depends on n, d, νρ, K ◮ Initialization error: error due to the choice of the initial value function or
initial policy |V ∗ − V π0|
Dec 2nd, 2014 - 67/82
LSPI Error Bound
||V ∗ − V πK ||µ ≤ 4γ (1 − γ)2
n νρ
There exists a distribution ρ such that for any policy π ∈ G( F), we have ρ ≤ Cρπ, where C < ∞ is a constant and ρπ is the stationary distribution of π. Furthermore, we can define the concentrability coefficient Cµ,ρ as before.
Dec 2nd, 2014 - 68/82
LSPI Error Bound
||V ∗ − V πK ||µ ≤ 4γ (1 − γ)2
n νρ
There exists a distribution ρ such that for any policy π ∈ G( F), we have ρ ≤ Cρπ, where C < ∞ is a constant and ρπ is the stationary distribution of π. Furthermore, we can define the concentrability coefficient Cµ,ρ as before.
◮ νρ = the smallest eigenvalue of the Gram matrix (
Dec 2nd, 2014 - 69/82
V π T π F T π T πVBR arg min
V ∈FV π − V
VBR = arg min
V ∈FT πV − V
V ∈F T πV − V 2,µ
Dec 2nd, 2014 - 70/82
µ is quadratic
d
Dec 2nd, 2014 - 71/82
Dec 2nd, 2014 - 72/82
Dec 2nd, 2014 - 72/82
Proposition
V ∈F V π − V .
1 1−γ , thus
V ∈F V π − V µπ.
Dec 2nd, 2014 - 73/82
◮ Drawn n states Xt ∼ µ ◮ Call generative model on (Xt, At) (with At = π(Xt)) and
◮ Compute
n
T V (Xt)
Dec 2nd, 2014 - 74/82
µ + E
Dec 2nd, 2014 - 75/82
t ∼ p(·|Xt, At)
n
t)
Dec 2nd, 2014 - 76/82
n
t)
n
t)
Dec 2nd, 2014 - 77/82
V π − V = V π − T πV + T πV − V = γPπ(V π − V ) + T πV − V (I − γPπ)(V π − V ) = T πV − V , taking the norm both sides we obtain V π − VBR ≤ (I − γPπ)−1T πVBR − VBR and T πVBR − VBR = inf
V ∈F T πV − V ≤ (1 + γPπ) inf V ∈F V π − V .
Dec 2nd, 2014 - 78/82
The matrix (I − γPπ) can be written as the power series
t γ(Pπ)t.
Applying the norm we obtain (I − γPπ)−1µπ ≤
γtPπt
µπ ≤
1 1 − γ
Dec 2nd, 2014 - 79/82
◮ Different assumptions: BRM requires a generative model,
◮ The performance is evaluated differently: BRM any
Dec 2nd, 2014 - 80/82
Dec 2nd, 2014 - 81/82
Alessandro Lazaric alessandro.lazaric@inria.fr sequel.lille.inria.fr