MVA-RL Course
Sample Complexity of ADP Algorithms
- A. LAZARIC (SequeL Team @INRIA-Lille)
ENS Cachan - Master 2 MVA
SequeL – INRIA Lille
Sample Complexity of ADP Algorithms A. LAZARIC ( SequeL Team - - PowerPoint PPT Presentation
Sample Complexity of ADP Algorithms A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Sources of Error Approximation error . If X is large or continuous , value functions V cannot be
MVA-RL Course
ENS Cachan - Master 2 MVA
SequeL – INRIA Lille
◮ Approximation error. If X is large or continuous, value
◮ Estimation error. If the reward r and dynamics p are
Dec 3rd, 2013 - 2/82
◮ Infinite horizon setting with discount γ ◮ Study the impact of estimation error
Dec 3rd, 2013 - 3/82
Dec 3rd, 2013 - 4/82
n
Dec 3rd, 2013 - 5/82
◮ General guarantees ◮ Rates of convergence (not always available in asymptotic
◮ Explicit dependency on the design parameters ◮ Explicit dependency on the problem parameters ◮ First guess on how to tune parameters ◮ Better understanding of the algorithms
Dec 3rd, 2013 - 6/82
Sample Complexity of LSTD
Dec 3rd, 2013 - 7/82
Sample Complexity of LSTD The Algorithm
Dec 3rd, 2013 - 8/82
Sample Complexity of LSTD The Algorithm
◮ Linear function space F =
j=1 αjϕj(·)
◮ V π may not belong to F
◮ Best approximation of V π in F is
ΠV π = arg min
f ∈F ||V π − f ||
(Π is the projection onto F)
F V π
T π
ΠV π
Dec 3rd, 2013 - 9/82
Sample Complexity of LSTD The Algorithm
◮ LSTD searches for the fixed-point of Π?T π instead (Π? is a
projection into F w.r.t. L?-norm)
◮ Π∞T π is a contraction in L∞-norm
◮ L∞-projection is numerically expensive when the number of
states is large or infinite
◮ LSTD searches for the fixed-point of Π2,ρT π
f ∈F ||g − f ||2,ρ
Dec 3rd, 2013 - 10/82
Sample Complexity of LSTD The Algorithm
When the fixed-point of ΠρT π exists, we call it the LSTD solution VTD = ΠρT πVTD
F V π
T πVTD T π T π
ΠρV π
VTD = ΠρT πVTD T πVTD − VTD, ϕiρ = 0, i = 1, . . . , d r π + γPπVTD − VTD, ϕiρ = 0 r π, ϕiρ
−
d
ϕj − γPπϕj, ϕiρ
· α(j)
TD = 0
− → A αTD = b
Dec 3rd, 2013 - 11/82
Sample Complexity of LSTD The Algorithm
◮ In general, ΠρT π is not a contraction and does not have a
◮ If ρ = ρπ, the stationary dist. of π, then ΠρπT π has a unique
Proposition (LSTD Performance)
V ∈F ||V π − V ||ρπ
Dec 3rd, 2013 - 12/82
Sample Complexity of LSTD The Algorithm
◮ We observe a trajectory (X0, R0, X1, R1, . . . , XN) where
N−1
N−1
◮
Dec 3rd, 2013 - 13/82
Sample Complexity of LSTD LSTD and LSPI Error Bounds
Dec 3rd, 2013 - 14/82
Sample Complexity of LSTD LSTD and LSPI Error Bounds
When the Markov chain induced by the policy under evaluation π has a stationary distribution ρπ (Markov chain is ergodic - e.g. β-mixing), then Theorem (LSTD Error Bound) Let V be the truncated LSTD solution computed using n samples along a trajectory generated by following the policy π. Then with probability 1 − δ, we have ||V π − V ||ρπ ≤ c
f ∈F ||V π − f ||ρπ + O
n ν
, d = dimension of the linear function space F
◮ ν = the smallest eigenvalue of the Gram matrix (
(Assume: eigenvalues of the Gram matrix are strictly positive - existence of the model-based LSTD solution) ◮ β-mixing coefficients are hidden in the O(·) notation
Dec 3rd, 2013 - 15/82
Sample Complexity of LSTD LSTD and LSPI Error Bounds
LSTD Error Bound ||V π − V ||ρπ ≤ c
inf
f ∈F ||V π − f ||ρπ
+ O
n ν
◮ Approximation error: it depends on how well the function space F
can approximate the value function V π
◮ Estimation error: it depends on the number of samples n, the dim of
the function space d, the smallest eigenvalue of the Gram matrix ν, the mixing properties of the Markov chain (hidden in O)
Dec 3rd, 2013 - 16/82
Sample Complexity of LSTD LSTD and LSPI Error Bounds
Theorem (LSPI Error Bound)
Let V−1 ∈ F be an arbitrary initial value function, V0, . . . , VK−1 be the sequence of truncated value functions generated by LSPI after K iterations, and πK be the greedy policy w.r.t. VK−1. Then with probability 1 − δ, we have
||V ∗−V πK ||µ ≤ 4γ (1 − γ)2
n νρ
K−1 2
Rmax
Dec 3rd, 2013 - 17/82
Sample Complexity of LSTD LSTD and LSPI Error Bounds
Theorem (LSPI Error Bound)
Let V−1 ∈ F be an arbitrary initial value function, V0, . . . , VK−1 be the sequence of truncated value functions generated by LSPI after K iterations, and πK be the greedy policy w.r.t. VK−1. Then with probability 1 − δ, we have
||V ∗−V πK ||µ ≤ 4γ (1 − γ)2
n νρ
K−1 2
Rmax
F) inff ∈F ||V π − f ||ρπ
Dec 3rd, 2013 - 18/82
Sample Complexity of LSTD LSTD and LSPI Error Bounds
Theorem (LSPI Error Bound)
Let V−1 ∈ F be an arbitrary initial value function, V0, . . . , VK−1 be the sequence of truncated value functions generated by LSPI after K iterations, and πK be the greedy policy w.r.t. VK−1. Then with probability 1 − δ, we have
||V ∗−V πK ||µ ≤ 4γ (1 − γ)2
n νρ
K−1 2
Rmax
F) inff ∈F ||V π − f ||ρπ
◮ Estimation error: depends on n, d, νρ, K
Dec 3rd, 2013 - 19/82
Sample Complexity of LSTD LSTD and LSPI Error Bounds
Theorem (LSPI Error Bound)
Let V−1 ∈ F be an arbitrary initial value function, V0, . . . , VK−1 be the sequence of truncated value functions generated by LSPI after K iterations, and πK be the greedy policy w.r.t. VK−1. Then with probability 1 − δ, we have
||V ∗−V πK ||µ ≤ 4γ (1 − γ)2
n νρ
K−1 2
Rmax
F) inff ∈F ||V π − f ||ρπ
◮ Estimation error: depends on n, d, νρ, K ◮ Initialization error: error due to the choice of the initial value function or
initial policy |V ∗ − V π0|
Dec 3rd, 2013 - 20/82
Sample Complexity of LSTD LSTD and LSPI Error Bounds
LSPI Error Bound
||V ∗−V πK ||µ ≤ 4γ (1 − γ)2
n νρ
K−1 2
Rmax
There exists a distribution ρ such that for any policy π ∈ G( F), we have ρ ≤ Cρπ, where C < ∞ is a constant and ρπ is the stationary distribution of π. Furthermore, we can define the concentrability coefficient Cµ,ρ as before.
Dec 3rd, 2013 - 21/82
Sample Complexity of LSTD LSTD and LSPI Error Bounds
LSPI Error Bound
||V ∗−V πK ||µ ≤ 4γ (1 − γ)2
n νρ
K−1 2
Rmax
There exists a distribution ρ such that for any policy π ∈ G( F), we have ρ ≤ Cρπ, where C < ∞ is a constant and ρπ is the stationary distribution of π. Furthermore, we can define the concentrability coefficient Cµ,ρ as before.
◮ νρ = the smallest eigenvalue of the Gram matrix (
Dec 3rd, 2013 - 22/82
Sample Complexity of Fitted Q-iteration
Dec 3rd, 2013 - 23/82
Sample Complexity of Fitted Q-iteration
Input: space F, iterations K, sampling distribution ρ, num of samples n Initial function Q0 ∈ F For k = 1, . . . , K
◮ Draw n samples (xi, ai)
i.i.d
∼ ρ
◮ Sample x′
i ∼ p(·|xi, ai) and ri = r(xi, ai)
◮ Compute yi = ri + γ maxa
Qk−1(x′
i , a)
◮ Build training set
n
i=1
◮ Solve the least squares problem
fˆ
αk = arg min fα∈F
1 n
n
2
◮ Return
Qk = Trunc(fˆ
αk)
Return πK(·) = arg maxa QK(·, a) (greedy policy)
Dec 3rd, 2013 - 24/82
Sample Complexity of Fitted Q-iteration
Dec 3rd, 2013 - 25/82
Sample Complexity of Fitted Q-iteration Error at Each Iteration
Dec 3rd, 2013 - 26/82
Sample Complexity of Fitted Q-iteration Error at Each Iteration
Input: space F, iterations K, sampling distribution ρ Initial function Q0 ∈ F For k = 1, . . . , K
◮ Draw n samples (xi, ai)
i.i.d
∼ ρ
◮ Sample x′
i ∼ p(·|xi, ai) and ri = r(xi, ai)
◮ Compute yi = ri + γ maxa
Qk−1(x′
i , a)
◮ Build training set
n
i=1
◮ Solve the least squares problem
fˆ
αk = arg min fα∈F
1 n
n
2
◮ Return
Qk = Trunc(fˆ
αk)
Return πK(·) = arg maxa QK(·, a) (greedy policy)
Dec 3rd, 2013 - 27/82
Sample Complexity of Fitted Q-iteration Error at Each Iteration
◮ Draw n samples (xi, ai)
i.i.d
∼ ρ
◮ Sample x′
i ∼ p(·|xi, ai) and ri = r(xi, ai)
◮ Compute yi = ri + γ maxa
Qk−1(x′
i , a)
◮ Build training set
n
i=1
◮ Solve the least squares problem
fˆ
αk = arg min fα∈F
1 n
n
2
◮ Return
Qk = Trunc(fˆ
αk)
Dec 3rd, 2013 - 28/82
Sample Complexity of Fitted Q-iteration Error at Each Iteration
Dec 3rd, 2013 - 29/82
Sample Complexity of Fitted Q-iteration Error at Each Iteration
i=1 (from ρ)
n
ˆ ρ ≤ ???
Dec 3rd, 2013 - 30/82
Sample Complexity of Fitted Q-iteration Error at Each Iteration
ρ ≤ ???
Dec 3rd, 2013 - 31/82
Sample Complexity of Fitted Q-iteration Error at Each Iteration
αk = arg min fα∈F
n
k = arg inf
fα∈F ||fα − Qk||ρ
Dec 3rd, 2013 - 32/82
Sample Complexity of Fitted Q-iteration Error at Each Iteration
i=1 drawn from ρ.
ˆ ρ = 1
n
n
i = ||z||2 n
z∈Fn ||y − z||n
Dec 3rd, 2013 - 33/82
Sample Complexity of Fitted Q-iteration Error at Each Iteration
◮ Target vector:
qi = Qk(xi, ai) = T Qk−1(xi, ai) = r(xi, ai) + γ max
a
◮ Observed target vector:
yi = ri + γ max
a
i , a)
◮ Noise vector (zero–mean and bounded):
ξi = qi − yi |ξi| ≤ Vmax E[ξi|xi] = 0
Dec 3rd, 2013 - 34/82
Sample Complexity of Fitted Q-iteration Error at Each Iteration
y Fn ξ q
Dec 3rd, 2013 - 35/82
Sample Complexity of Fitted Q-iteration Error at Each Iteration
◮ Optimal solution in Fn
z∈Fn ||q − z||n ◮ Returned vector
αk(xi, ai)
z∈Fn ||y − z||n
Dec 3rd, 2013 - 36/82
Sample Complexity of Fitted Q-iteration Error at Each Iteration
Fn ξ q y
Dec 3rd, 2013 - 37/82
Sample Complexity of Fitted Q-iteration Error at Each Iteration
αk||2 ˆ ρ = ||q −
n
Fn ξ q y
Dec 3rd, 2013 - 38/82
Sample Complexity of Fitted Q-iteration Error at Each Iteration
◮ Prediction error: distance between learned function and
◮ Approximation error: distance between the best function in
◮ Estimation error: distance between the best function in F
Dec 3rd, 2013 - 39/82
Sample Complexity of Fitted Q-iteration Error at Each Iteration
n
Dec 3rd, 2013 - 40/82
Sample Complexity of Fitted Q-iteration Error at Each Iteration
n
n
n
n
ˆ ρ
Dec 3rd, 2013 - 41/82
Sample Complexity of Fitted Q-iteration Error at Each Iteration
Dec 3rd, 2013 - 42/82
Sample Complexity of Fitted Q-iteration Error at Each Iteration
ρ
ρ ≤ 1
Dec 3rd, 2013 - 43/82
Sample Complexity of Fitted Q-iteration Error at Each Iteration
n
Dec 3rd, 2013 - 44/82
Sample Complexity of Fitted Q-iteration Error at Each Iteration
By definition of g ⇒
n
n
fα(xi, ai)ξi
ρ
n log 3(9ne2)d+1 δ
ξ ⇒ ξ, ξ ≤ 4Vmax|| ξ||n
n log 3(9ne2)d+1 δ
⇒ || ξ||2
n ≤ 4Vmax||
ξ||n
n log 3(9ne2)d+1 δ
Πq − q||n ≤ 4Vmax
n log 3(9ne2)d+1 δ
Dec 3rd, 2013 - 45/82
Sample Complexity of Fitted Q-iteration Error at Each Iteration
Fn ξ q y
Theorem (see e.g. Lazaric et al.,’11) At each iteration k and given a set of state–action pairs {(xi, ai)}, LinearFQI returns an approximation q such that ||q − q||n ≤ ||q − Πq||n + || Πq − q||n ≤ ||q − Πq||n + O
n
Dec 3rd, 2013 - 46/82
Sample Complexity of Fitted Q-iteration Error at Each Iteration
αk||ˆ ρ
k ||ˆ
ρ
αk||ˆ ρ ≤ ||Qk − fα∗
k ||ˆ
ρ + O
Dec 3rd, 2013 - 47/82
Sample Complexity of Fitted Q-iteration Error at Each Iteration
αk))
Theorem
ρ ≤ ||Qk − fˆ αk||ˆ ρ
k ||ˆ
ρ + O
Dec 3rd, 2013 - 48/82
Sample Complexity of Fitted Q-iteration Error at Each Iteration
ρ ≥ ||Qk −
k is a fixed function in F
k ||ˆ
ρ ≤ 2||Qk − fα∗
k ||ρ + O
k||
Dec 3rd, 2013 - 49/82
Sample Complexity of Fitted Q-iteration Error at Each Iteration
Theorem
k ||ρ
k||
Dec 3rd, 2013 - 50/82
Sample Complexity of Fitted Q-iteration Error at Each Iteration
||Qk − Qk||ρ ≤ 4||Qk − fα∗
k ||ρ
+ O
k||
n
n
Dec 3rd, 2013 - 51/82
Sample Complexity of Fitted Q-iteration Error at Each Iteration
||Qk − Qk||ρ ≤ 4||Qk − fα∗
k ||ρ
+ O
k||
n
n
◮ No algorithm can do better ◮ Constant 4 ◮ Depends on the space F ◮ Changes with the iteration k
Dec 3rd, 2013 - 52/82
Sample Complexity of Fitted Q-iteration Error at Each Iteration
||Qk − Qk||ρ ≤ 4||Qk − fα∗
k ||ρ
+ O
k||
n
n
◮ Vanishing to zero as O(n−1/2) ◮ Depends on the features (L) and on the best solution (||α∗ k||)
Dec 3rd, 2013 - 53/82
Sample Complexity of Fitted Q-iteration Error at Each Iteration
||Qk − Qk||ρ ≤ 4||Qk − fα∗
k ||ρ
+ O
k||
n
n
◮ Vanishing to zero as O(n−1/2) ◮ Depends on the dimensionality of the space (d) and the
Dec 3rd, 2013 - 54/82
Sample Complexity of Fitted Q-iteration Error Propagation
Dec 3rd, 2013 - 55/82
Sample Complexity of Fitted Q-iteration Error Propagation
◮ Problem 1: the test norm µ is different from the sampling
◮ Problem 2: we have bounds for
◮ Problem 3: we have bounds for one single iteration
Dec 3rd, 2013 - 56/82
Sample Complexity of Fitted Q-iteration Error Propagation
◮ Bellman operators
T Q(x, a) = r(x, a) + γ
max
a′ Q(dx′, a′)p(dx′|x, a)
T πQ(x, a) = r(x, a) + γ
Q(dx′, π(dx′))p(dx′|x, a)
◮ Optimal action–value function
Q∗ = T Q∗
◮ Greedy policy
π(x) = arg max
a
Q(x, a) π∗(x) = arg max
a
Q∗(x, a)
◮ Prediction error
ǫk = Qk − Qk
Dec 3rd, 2013 - 57/82
Sample Complexity of Fitted Q-iteration Error Propagation
fixed point
K−1
Dec 3rd, 2013 - 58/82
Sample Complexity of Fitted Q-iteration Error Propagation
K−1
Dec 3rd, 2013 - 59/82
Sample Complexity of Fitted Q-iteration Error Propagation
Q∗ − QπK = T π∗Q∗
fixed point
−T π∗ QK + T π∗ QK
QK + T πK QK
QK
Q∗ − QπK = T π∗Q∗ − T π∗ QK
+ T π∗ QK − T πK QK
+ T πK QK − T πK QK
Q∗ − QπK ≤ γPπ∗(Q∗ − QK) + γPπK ( QK −Q∗ + Q∗
Q∗ − QπK ≤ γPπ∗(Q∗ − QK
) + γPπK ( QK − Q∗
+ Q∗ − QπK
) (I − γPπK )(Q∗ − QπK ) ≤ γ(Pπ∗ − PπK )(Q∗ − QK)
Dec 3rd, 2013 - 60/82
Sample Complexity of Fitted Q-iteration Error Propagation
Q∗ − QπK ≤(I − γPπK )−1 K−1
γK−k (Pπ∗)K−k − PπK PπK−1 . . . Pπk+1
+
Q0)
Dec 3rd, 2013 - 61/82
Sample Complexity of Fitted Q-iteration Error Propagation
Q∗ − QπK ≤2γ(1 − γK+1) (1 − γ)2 K−1
αkAk|ǫk| + αKAK|Q∗ − Q0|
k αk = 1)
◮ Ak: summarize the Pπi terms
Dec 3rd, 2013 - 62/82
Sample Complexity of Fitted Q-iteration Error Propagation
||Q∗ − QπK ||2
µ =
≤ 2γ(1 − γK+1 (1 − γ)2 2 µ(dx, da) K−1
αkAk|ǫk| + αKAK|Q∗ − Q0| 2 (x, a) ≤ 2γ(1 − γK+1 (1 − γ)2 2 µ(dx, da) K−1
αkAkǫ2
k + αKAK(Q∗ −
Q0)2
Dec 3rd, 2013 - 63/82
Sample Complexity of Fitted Q-iteration Error Propagation
µAk = 1 − γ 2 µ(I − γPπK )−1 (Pπ∗)K−k + PπK PπK−1 . . . Pπk+1 = 1 − γ 2
γmµ(PπK )m (Pπ∗)K−k + PπK PπK−1 . . . Pπk+1 = 1 − γ 2
m≥0
γmµ(PπK )m(Pπ∗)K−k +
γmµ(PπK )mPπK PπK−1 . . . Pπk+1
Dec 3rd, 2013 - 64/82
Sample Complexity of Fitted Q-iteration Error Propagation
c(m) = sup
π1...πm
dρ
Cµ,ρ = (1 − γ)2
m≥1
mγm−1c(m) < +∞ Remark: related to top-Lyapunov exponent ⇒ Cµ,ρ < ∞ is a weak stability condition
Dec 3rd, 2013 - 65/82
Sample Complexity of Fitted Q-iteration Error Propagation
||Q∗ − QπK ||2
µ
≤ 2γ(1 − γK+1 (1 − γ)2 2 K−1
αk(1 − γ)
γmc(m + K − k)||ǫk||2
ρ + αK(2Vmax)2
Dec 3rd, 2013 - 66/82
Sample Complexity of Fitted Q-iteration Error Propagation
||Q∗ − QπK ||2
µ ≤
(1 − γ)2 2 Cµ,ρ max
k
||ǫk||2
ρ + O
(1 − γ)3 Vmax
2
Dec 3rd, 2013 - 67/82
Sample Complexity of Fitted Q-iteration The Final Bound
Dec 3rd, 2013 - 68/82
Sample Complexity of Fitted Q-iteration The Final Bound
||Q∗ − QπK ||2
µ ≤
(1 − γ)2 2 Cµ,ρ max
k
||ǫk||2
ρ + O
(1 − γ)3 Vmax
2
Qk||ρ ≤ 4||Qk − fα∗
k ||ρ
+ O
k||
n
n
Dec 3rd, 2013 - 69/82
Sample Complexity of Fitted Q-iteration The Final Bound
||Qk − fα∗
k ||ρ = inf
f ∈F ||Qk − f ||ρ
= inf
f ∈F ||T
Qk−1 − f ||ρ ≤ inf
f ∈F ||T fαk−1 − f ||ρ
≤ sup
g∈F
inf
f ∈F ||T g − f ||ρ = d(F, T F)
Dec 3rd, 2013 - 70/82
Sample Complexity of Fitted Q-iteration The Final Bound
k is the orthogonal projection of Qk onto F w.r.t. ρ
k ||ρ ≤ ||Qk||ρ = ||T
Dec 3rd, 2013 - 71/82
Sample Complexity of Fitted Q-iteration The Final Bound
||fα||2
ρ = ||φ⊤α||2 ρ = α⊤Gα ≥ ωα⊤α = ω||α||2
max
k
||α∗
k|| ≤ max k
||fα∗
k ||ρ
√ω ≤ Vmax √ω
Dec 3rd, 2013 - 72/82
Sample Complexity of Fitted Q-iteration The Final Bound
Theorem (see e.g., Munos,’03) LinearFQI with a space F of d features, with n samples at each iteration returns a policy πK after K iterations such that
||Q∗ − QπK ||µ ≤ 2γ (1 − γ)2
L √ω
n
(1 − γ)3 Vmax2
Dec 3rd, 2013 - 73/82
Sample Complexity of Fitted Q-iteration The Final Bound
Theorem
LinearFQI with a space F of d features, with n samples at each iteration returns a policy πK after K iterations such that ||Q∗ − QπK ||µ ≤ 2γ (1 − γ)2
L √ω
n
(1 − γ)3 Vmax2
⇒ how do we choose the sampling distribution?
Dec 3rd, 2013 - 74/82
Sample Complexity of Fitted Q-iteration The Final Bound
Theorem
LinearFQI with a space F of d features, with n samples at each iteration returns a policy πK after K iterations such that ||Q∗ − QπK ||µ ≤ 2γ (1 − γ)2
L √ω
n
(1 − γ)3 Vmax2
the Bellman operator?
Dec 3rd, 2013 - 75/82
Sample Complexity of Fitted Q-iteration The Final Bound
Theorem
LinearFQI with a space F of d features, with n samples at each iteration returns a policy πK after K iterations such that ||Q∗ − QπK ||µ ≤ 2γ (1 − γ)2
L √ω
n
(1 − γ)3 Vmax2
⇒ is it possible to avoid it?
Dec 3rd, 2013 - 76/82
Sample Complexity of Fitted Q-iteration The Final Bound
Theorem
LinearFQI with a space F of d features, with n samples at each iteration returns a policy πK after K iterations such that ||Q∗ − QπK ||µ ≤ 2γ (1 − γ)2
L √ω
n
(1 − γ)3 Vmax2
⇒ K ≈ ǫ/(1 − γ)
Dec 3rd, 2013 - 77/82
Sample Complexity of Fitted Q-iteration The Final Bound
Theorem
LinearFQI with a space F of d features, with n samples at each iteration returns a policy πK after K iterations such that ||Q∗ − QπK ||µ ≤ 2γ (1 − γ)2
L √ω
n
(1 − γ)3 Vmax2
⇒ design the features so as to be orthogonal w.r.t. ρ
Dec 3rd, 2013 - 78/82
Sample Complexity of Fitted Q-iteration The Final Bound
Theorem
LinearFQI with a space F of d features, with n samples at each iteration returns a policy πK after K iterations such that ||Q∗ − QπK ||µ ≤ 2γ (1 − γ)2
L √ω
n
(1 − γ)3 Vmax2
Dec 3rd, 2013 - 79/82
Sample Complexity of Fitted Q-iteration The Final Bound
◮ At each iteration FQI solves a regression problem
◮ The error is propagated through iterations
Dec 3rd, 2013 - 80/82
Sample Complexity of Fitted Q-iteration The Final Bound
Dec 3rd, 2013 - 81/82
Sample Complexity of Fitted Q-iteration The Final Bound
Alessandro Lazaric alessandro.lazaric@inria.fr sequel.lille.inria.fr