MVA-RL Course
Approximate Dynamic Programming
- A. LAZARIC (SequeL Team @INRIA-Lille)
ENS Cachan - Master 2 MVA
SequeL – INRIA Lille
Approximate Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille - - PowerPoint PPT Presentation
Approximate Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N 2. At each iteration k = 1 , 2 , . . . , K
MVA-RL Course
ENS Cachan - Master 2 MVA
SequeL – INRIA Lille
◮ Compute Vk+1 = T Vk
a∈A
Oct 29th, 2013 - 2/63
◮ From the fixed point property of T :
lim
k→∞ Vk = V ∗
◮ From the contraction property of T
||Vk+1 − V ∗||∞ ≤ γk+1||V0 − V ∗||∞ → 0
Oct 29th, 2013 - 3/63
◮ Policy evaluation given πk, compute Vk = V πk. ◮ Policy improvement: compute the greedy policy
πk+1(x) ∈ arg maxa∈A
p(y|x, a)V πk(y)
Oct 29th, 2013 - 4/63
Oct 29th, 2013 - 5/63
◮ Approximation error. If X is large or continuous, value
◮ Estimation error. If the reward r and dynamics p are
Oct 29th, 2013 - 6/63
◮ Infinite horizon setting with discount γ ◮ Study the impact of approximation error ◮ Study the impact of estimation error in the next lecture
Oct 29th, 2013 - 7/63
Performance Loss
Oct 29th, 2013 - 8/63
Performance Loss
a∈A
Oct 29th, 2013 - 9/63
Performance Loss
Proposition
Oct 29th, 2013 - 10/63
Performance Loss
Oct 29th, 2013 - 11/63
Performance Loss
f ∈F ||V ∗ − f ||
Oct 29th, 2013 - 12/63
Approximate Value Iteration
Oct 29th, 2013 - 13/63
Approximate Value Iteration
◮ Compute Vk+1 = AT Vk
a∈A
Oct 29th, 2013 - 14/63
Approximate Value Iteration
V ∈F T Vk − V ∞
Oct 29th, 2013 - 15/63
Approximate Value Iteration
Proposition
Oct 29th, 2013 - 16/63
Approximate Value Iteration
Proposition (Bertsekas & Tsitsiklis, 1996) Let V K be the function returned by AVI after K iterations and πK its corresponding greedy policy. Then V ∗−V πK ∞ ≤ 2γ (1 − γ)2 max
0≤k<K T Vk − AT Vk∞
+2γK+1 1 − γ V ∗ − V0∞
.
Oct 29th, 2013 - 17/63
Approximate Value Iteration
For any 0 ≤ k < K we have V ∗ − Vk+1∞ ≤ T V ∗ − T Vk∞ + T Vk − Vk+1∞ ≤ γV ∗ − Vk∞ + ε, then V ∗ − VK∞ ≤ (1 + γ + · · · + γK−1)ε + γKV ∗ − V0∞ ≤ 1 1 − γ ε + γKV ∗ − V0∞ Since from Proposition 1 we have that V ∗ − V πK ∞ ≤
2γ 1−γ V ∗ − VK∞, then we obtain
V ∗ − V πK ∞ ≤ 2γ (1 − γ)2 ε + 2γK+1 1 − γ V ∗ − V0∞.
Oct 29th, 2013 - 18/63
Approximate Value Iteration
Assumption: access to a generative model.
Generative model Action a State Next state Reward r(x, a) y ∼ p(·|x, a) x
Idea: work with Q-functions and linear spaces.
◮ Q∗ is the unique fixed point of T defined over X × A as:
T Q(x, a) =
p(y|x, a)[r(x, a, y) + γ max
b
Q(y, b)].
◮ F is a space defined by d features φ1, . . . , φd : X × A → R as:
F =
d
αjφj(x, a), α ∈ Rd . ⇒ At each iteration compute Qk+1 = Π∞T Qk
Oct 29th, 2013 - 19/63
Approximate Value Iteration
◮ the Π∞ operator cannot be computed efficiently ◮ the Bellman operator T is often unknown
Oct 29th, 2013 - 20/63
Approximate Value Iteration
Q∈F Q − T Qk2 µ.
Oct 29th, 2013 - 21/63
Approximate Value Iteration
a∈A Qk(Yi, a)
Oct 29th, 2013 - 22/63
Approximate Value Iteration
Qα∈F
n
Oct 29th, 2013 - 23/63
Approximate Value Iteration
◮ K-nearest neighbour ◮ Regularized linear regression with L1 or L2 regularisation ◮ Neural network ◮ Support vector machine
Oct 29th, 2013 - 24/63
Approximate Value Iteration
◮ c(x, R) = C ◮ c(x, K) = c(x) maintenance plus extra costs.
◮ p(·|x, R) = exp(β) with density d(y) = β exp−βy I{y ≥ 0}, ◮ p(·|x, K) = x + exp(β) with density d(y − x).
Oct 29th, 2013 - 25/63
Approximate Value Iteration
Optimal value function V ∗(x) = min
∞ d(y −x)V ∗(y)dy, C +γ ∞ d(y)V ∗(y)dy
1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70
Management cost wear
1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70
Value function
R R R K K K
Linear approximation space F :=
k=1 αk cos(kπ x xmax )
Oct 29th, 2013 - 26/63
Approximate Value Iteration
1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 +++++++++++++++++++++++++ ++++ +++++++++++++++++++++ ++++ +++++++++++++++++++++ ++++ +++++++++++++++++++++ 1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70 1 2 3 4 5 6 7 8 9 10 10 20 30 40 50 60 70
Figure: Left: the target values computed as {T V0(xn)}1≤n≤N. Right: the approximation V1 ∈ F of the target function T V0.
Oct 29th, 2013 - 27/63
Approximate Value Iteration
Figure: Left: the target values computed as {T V1(xn)}1≤n≤N. Center: the approximation V2 ∈ F of T V1. Right: the approximation Vn ∈ F after n iterations.
Oct 29th, 2013 - 28/63
Approximate Policy Iteration
Oct 29th, 2013 - 29/63
Approximate Policy Iteration
Let A be an approximation operator.
◮ Policy evaluation: given the current policy πk, compute Vk = AV πk ◮ Policy improvement: given the approximated value of the current
policy, compute the greedy policy w.r.t. Vk as πk+1(x) ∈ arg max
a∈A
p(y|x, a)Vk(y)
Problem: the algorithm is no longer guaranteed to converge.
V *−V πk k
Asymptotic Error
Oct 29th, 2013 - 30/63
Approximate Policy Iteration
Proposition
k→∞
k→∞
Oct 29th, 2013 - 31/63
Approximate Policy Iteration
◮ Approximation error: ek = Vk − V πk, ◮ Performance gain: gk = V πk+1 − V πk, ◮ Performance loss: lk = V ∗ − V πk.
Oct 29th, 2013 - 32/63
Approximate Policy Iteration
Proof (cont’d). Since πk+1 is greedy w.r.t. Vk we have that T πk+1Vk ≥ T πk Vk. gk = T πk+1V πk+1 − T πk+1V πk + T πk+1V πk − T πk+1Vk + T πk+1Vk − T πk Vk + T πk Vk − T πk V πk
(a)
≥ γPπk+1gk − γ(Pπk+1 − Pπk ) ek
(b)
≥ −γ(I − γPπk+1)−1(Pπk+1 − Pπk ) ek Which leads to gk ≥ −γ(I − γPπk+1)−1(Pπk+1 − Pπk ) ek, (1)
Oct 29th, 2013 - 33/63
Approximate Policy Iteration
Proof (cont’d). Relationship between the performance at subsequent iterations. Since T π∗Vk ≤ T πk+1Vk we have lk+1 = T π∗V ∗ − T π∗V πk + T π∗V πk − T π∗Vk + T π∗Vk − T πk+1Vk + T πk+1Vk − T πk+1V πk + T πk+1V πk − T πk+1V πk+1 ≤ γ[Pπ∗lk − Pπk+1gk + (Pπk+1 − Pπ∗)ek]. If we now plug-in equation (1), lk+1 ≤ γPπ∗lk + γ[Pπk+1(I − γPπk+1)−1(Pπk+1 − Pπk ) + Pπk+1 − Pπ∗]ek ≤ γPπ∗lk + γ[Pπk+1(I − γPπk+1)−1(I − γPπk ) − Pπ∗]ek. Thus we obtain the fact that the performance loss changes through iterations as lk+1 ≤ γPπ∗lk + γ[Pπk+1(I − γPπk+1)−1(I − γPπk ) − Pπ∗]ek.
Oct 29th, 2013 - 34/63
Approximate Policy Iteration
Proof (cont’d). Move to asymptotic regime. Let fk = γ[Pπk+1(I − γPπk+1)−1(I − γPπk ) − Pπ∗]ek, we have lk+1 ≤ γPπ∗lk + fk, thus if we move to the lim sup we obtain, (I − γPπ∗) lim sup
k→∞
lk ≤ lim sup
k→∞
fk lim sup
k→∞
lk ≤ (I − γPπ∗)−1 lim sup
k→∞
fk, since I − γPπ∗ is invertible. Finally, we only need to take the L∞-norm both sides and obtain, lim sup
k→∞
lk ≤ γ 1 − γ lim sup
k→∞
Pπk+1(I − γPπk+1)−1(I + γPπk ) + Pπ∗ ek ≤ γ 1 − γ ( 1 + γ 1 − γ + 1) lim sup
k→∞
ek = 2γ (1 − γ)2 lim sup
k→∞
ek.
Oct 29th, 2013 - 35/63
Approximate Policy Iteration Linear Temporal-Difference
Oct 29th, 2013 - 36/63
Approximate Policy Iteration Linear Temporal-Difference
Algorithm Definition Given a linear space F = {Vα(x) = d
i=1 αiφi(x), α ∈ Rd}.
Trace vector z ∈ Rd and parameter vector α ∈ Rd initialized to zero. Generate a sequence of states (x0, x1, x2, . . . ) according to π. At each step t, the temporal difference is dt = r(xt, π(xt)) + γVαt(xt+1) − Vαt(xt) and the parameters are updated as αt+1 = αt + ηtdtzt, zt+1 = λγzt + φ(xt+1), where ηt is learning step.
Oct 29th, 2013 - 37/63
Approximate Policy Iteration Linear Temporal-Difference
Proposition (Tsitsiklis et Van Roy, 1996) Let the learning rate ηt satisfy
ηt = ∞, and
η2
t < ∞.
We assume that π admits a stationary distribution µπ and that the features (φi)1≤k≤K are linearly independent. There exists a fixed α∗ such that lim
t→∞ αt = α∗.
Furthermore we obtain Vα∗ − V π2,µπ
≤ 1 − λγ 1 − γ inf
α Vα − V π2,µπ
.
Oct 29th, 2013 - 38/63
Approximate Policy Iteration Linear Temporal-Difference
Oct 29th, 2013 - 39/63
Approximate Policy Iteration Linear Temporal-Difference
◮ Pros: simple to implement, computational cost linear in d. ◮ Cons: very sample inefficient, many samples are needed to
Oct 29th, 2013 - 40/63
Approximate Policy Iteration Least-Squares Temporal Difference
Oct 29th, 2013 - 41/63
Approximate Policy Iteration Least-Squares Temporal Difference
VTD = ΠµT πVTD ΠµV π V π T π T πVTD T π F
f ∈F f − gµ.
Oct 29th, 2013 - 42/63
Approximate Policy Iteration Least-Squares Temporal Difference
d
Oct 29th, 2013 - 43/63
Approximate Policy Iteration Least-Squares Temporal Difference
Algorithm Definition
Oct 29th, 2013 - 44/63
Approximate Policy Iteration Least-Squares Temporal Difference
Oct 29th, 2013 - 45/63
Approximate Policy Iteration Least-Squares Temporal Difference
Proposition
V ∈F V π − V µπ
Oct 29th, 2013 - 46/63
Approximate Policy Iteration Least-Squares Temporal Difference
Proof. We show that Pπµπ = 1: PπV 2
µπ
=
µπ(x)
y
p(y|x, π(x))V (y) 2 ≤
µπ(x)p(y|x, π(x))V (y)2 =
µπ(y)V (y)2 = V 2
µπ.
It follows that T π is a contraction in L2,µπ, i.e., T πV1 − T πV2µπ = γPπ(V1 − V2)µπ ≤ γV1 − V2µπ. Thus ΠµπT π is a composition of a non-expansion and a contraction in L2,µπ, thus VTD = ΠµπT πVTD.
Oct 29th, 2013 - 47/63
Approximate Policy Iteration Least-Squares Temporal Difference
Proof. By Pythagorean theorem we have V π − VTD2
µπ = V π − ΠµπV π2 µπ + ΠµπV π − VTD2 µπ,
but ΠµπV π−VTD2
µπ = ΠµπV π−ΠµπT πVTD2 µπ ≤ T πV π−T VTD2 µπ ≤ γ2V π−
Thus V π − VTD2
µπ ≤ V π − ΠµπV π2 µπ + γ2V π − VTD2 µπ,
which corresponds to eq.(??) after reordering.
Oct 29th, 2013 - 48/63
Approximate Policy Iteration Least-Squares Temporal Difference
◮ Generate (X0, X1, . . . ) from direct execution of π and observes
Rt = r(Xt, π(Xt))
◮ Compute estimates
ˆ Aij = 1 n
n
φi(Xt)[φj(Xt) − γφj(Xt+1)], ˆ bi = 1 n
n
φi(Xt)Rt.
◮ Solve ˆ
Aα = ˆ b Remark:
◮ No need for a generative model. ◮ If the chain is ergodic, ˆ
A → A et ˆ b → b when n → ∞.
Oct 29th, 2013 - 49/63
Approximate Policy Iteration Bellman Residual Minimization
Oct 29th, 2013 - 50/63
Approximate Policy Iteration Bellman Residual Minimization
V π T π F T π T πVBR arg min
V ∈FV π − V
VBR = arg min
V ∈FT πV − V
V ∈F T πV − V 2,µ
Oct 29th, 2013 - 51/63
Approximate Policy Iteration Bellman Residual Minimization
µ is quadratic
d
Oct 29th, 2013 - 52/63
Approximate Policy Iteration Bellman Residual Minimization
Oct 29th, 2013 - 53/63
Approximate Policy Iteration Bellman Residual Minimization
Proposition
V ∈F V π − V .
1 1−γ , thus
V ∈F V π − V µπ.
Oct 29th, 2013 - 54/63
Approximate Policy Iteration Bellman Residual Minimization
V π − V = V π − T πV + T πV − V = γPπ(V π − V ) + T πV − V (I − γPπ)(V π − V ) = T πV − V , taking the norm both sides we obtain V π − VBR ≤ (I − γPπ)−1T πVBR − VBR and T πVBR − VBR = inf
V ∈F T πV − V ≤ (1 + γPπ) inf V ∈F V π − V .
Oct 29th, 2013 - 55/63
Approximate Policy Iteration Bellman Residual Minimization
The matrix (I − γPπ) can be written as the power series
t γ(Pπ)t.
Applying the norm we obtain (I − γPπ)−1µπ ≤
γtPπt
µπ ≤
1 1 − γ
Oct 29th, 2013 - 56/63
Approximate Policy Iteration Bellman Residual Minimization
◮ Drawn n states Xt ∼ µ ◮ Call generative model on (Xt, At) (with At = π(Xt)) and
◮ Compute
n
T V (Xt)
Oct 29th, 2013 - 57/63
Approximate Policy Iteration Bellman Residual Minimization
µ + E
Oct 29th, 2013 - 58/63
Approximate Policy Iteration Bellman Residual Minimization
t ∼ p(·|Xt, At)
n
t)
Oct 29th, 2013 - 59/63
Approximate Policy Iteration Bellman Residual Minimization
n
t)
n
t)
Oct 29th, 2013 - 60/63
Approximate Policy Iteration Bellman Residual Minimization
◮ Different assumptions: BRM requires a generative model,
◮ The performance is evaluated differently: BRM any
Oct 29th, 2013 - 61/63
Approximate Policy Iteration Bellman Residual Minimization
Oct 29th, 2013 - 62/63
Approximate Policy Iteration Bellman Residual Minimization
Alessandro Lazaric alessandro.lazaric@inria.fr sequel.lille.inria.fr