EC-RL Course
Markov Decision Processes and Dynamic Programming
- A. LAZARIC (SequeL Team @INRIA-Lille)
Ecole Centrale - Option DAD
SequeL – INRIA Lille
Markov Decision Processes and Dynamic Programming A. LAZARIC ( SequeL - - PowerPoint PPT Presentation
Markov Decision Processes and Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille ) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes and Dynamic Programming 2/81 In
EC-RL Course
Ecole Centrale - Option DAD
SequeL – INRIA Lille
2/81
◮ How do we formalize the agent-environment interaction?
2/81
◮ How do we formalize the agent-environment interaction?
◮ How do we solve an MDP?
2/81
Mathematical Tools
3/81
Mathematical Tools
Definition (Conditional probability)
4/81
Mathematical Tools
Definition (Conditional probability)
4/81
Mathematical Tools
Definition (Law of total expectation)
5/81
Mathematical Tools
Definition
0 is a norm if
◮ If f (v) = 0 for some v ∈ V, then v = 0. ◮ For any λ ∈ R, v ∈ V, f (λv) = |λ|f (v). ◮ Triangle inequality: For any v, u ∈ V, f (v + u) ≤ f (v) + f (u).
6/81
Mathematical Tools
◮ Lp-norm
||v||p =
|vi|p 1/p .
7/81
Mathematical Tools
◮ Lp-norm
||v||p =
|vi|p 1/p .
◮ L∞-norm
||v||∞ = max1≤i≤d|vi|.
7/81
Mathematical Tools
◮ Lp-norm
||v||p =
|vi|p 1/p .
◮ L∞-norm
||v||∞ = max1≤i≤d|vi|.
◮ Lµ,p-norm
||v||µ,p =
|vi|p µi 1/p .
7/81
Mathematical Tools
◮ Lp-norm
||v||p =
|vi|p 1/p .
◮ L∞-norm
||v||∞ = max1≤i≤d|vi|.
◮ Lµ,p-norm
||v||µ,p =
|vi|p µi 1/p .
◮ Lµ,p-norm
||v||µ,∞ = max
1≤i≤d
|vi| µi .
7/81
Mathematical Tools
◮ Lp-norm
||v||p =
|vi|p 1/p .
◮ L∞-norm
||v||∞ = max1≤i≤d|vi|.
◮ Lµ,p-norm
||v||µ,p =
|vi|p µi 1/p .
◮ Lµ,p-norm
||v||µ,∞ = max
1≤i≤d
|vi| µi .
◮ L2,P-matrix norm (P is a positive definite matrix)
||v||2
P = v ⊤Pv.
7/81
Mathematical Tools
Definition A sequence of vectors vn ∈ V (with n ∈ N) is said to converge in norm || · || to v ∈ V if lim
n→∞ ||vn − v|| = 0.
8/81
Mathematical Tools
Definition A sequence of vectors vn ∈ V (with n ∈ N) is said to converge in norm || · || to v ∈ V if lim
n→∞ ||vn − v|| = 0.
Definition A sequence of vectors vn ∈ V (with n ∈ N) is a Cauchy sequence if lim
n→∞ supm≥n||vn − vm|| = 0.
8/81
Mathematical Tools
Definition A sequence of vectors vn ∈ V (with n ∈ N) is said to converge in norm || · || to v ∈ V if lim
n→∞ ||vn − v|| = 0.
Definition A sequence of vectors vn ∈ V (with n ∈ N) is a Cauchy sequence if lim
n→∞ supm≥n||vn − vm|| = 0.
Definition A vector space V equipped with a norm || · || is complete if every Cauchy sequence in V is convergent in the norm of the space.
8/81
Mathematical Tools
Definition An operator T : V → V is L-Lipschitz if for any v, u ∈ V ||T v − T u|| ≤ L||u − v||. If L ≤ 1 then T is a non-expansion, while if L < 1 then T is a L-contraction. If T is Lipschitz then it is also continuous, that is if vn→||·||v then T vn→||·||T v.
9/81
Mathematical Tools
Definition An operator T : V → V is L-Lipschitz if for any v, u ∈ V ||T v − T u|| ≤ L||u − v||. If L ≤ 1 then T is a non-expansion, while if L < 1 then T is a L-contraction. If T is Lipschitz then it is also continuous, that is if vn→||·||v then T vn→||·||T v. Definition A vector v ∈ V is a fixed point of the operator T : V → V if T v = v.
9/81
Mathematical Tools
Proposition (Banach Fixed Point Theorem) Let V be a complete vector space equipped with the norm || · || and T : V → V be a γ-contraction mapping. Then
convergence rate: ||vn − v|| ≤ γn||v0 − v||.
10/81
Mathematical Tools
◮ Eigenvalues of a matrix (1). v ∈ RN and λ ∈ R are
11/81
Mathematical Tools
◮ Eigenvalues of a matrix (1). v ∈ RN and λ ∈ R are
◮ Eigenvalues of a matrix (2). If A has eigenvalues {λi}N i=1,
11/81
Mathematical Tools
◮ Eigenvalues of a matrix (1). v ∈ RN and λ ∈ R are
◮ Eigenvalues of a matrix (2). If A has eigenvalues {λi}N i=1,
◮ Matrix inversion. A can be inverted if and only if ∀i, λi = 0.
11/81
Mathematical Tools
◮ Stochastic matrix. A square matrix P ∈ RN×N is a stochastic
j=1[P]i,j = 1.
12/81
The Markov Decision Process
13/81
The Markov Decision Process
Agent Environment Learning
reward perception Critic actuation action / state /
14/81
The Markov Decision Process
Definition (Markov chain)
15/81
The Markov Decision Process
16/81
The Markov Decision Process
Definition (Markov decision process [1, 4, 3, 5, 2])
17/81
The Markov Decision Process
Definition (Markov decision process [1, 4, 3, 5, 2])
◮ t is the time clock,
17/81
The Markov Decision Process
Definition (Markov decision process [1, 4, 3, 5, 2])
◮ t is the time clock, ◮ X is the state space,
17/81
The Markov Decision Process
Definition (Markov decision process [1, 4, 3, 5, 2])
◮ t is the time clock, ◮ X is the state space, ◮ A is the action space,
17/81
The Markov Decision Process
Definition (Markov decision process [1, 4, 3, 5, 2])
◮ t is the time clock, ◮ X is the state space, ◮ A is the action space, ◮ p(y|x, a) is the transition probability with
17/81
The Markov Decision Process
Definition (Markov decision process [1, 4, 3, 5, 2])
◮ t is the time clock, ◮ X is the state space, ◮ A is the action space, ◮ p(y|x, a) is the transition probability with
◮ r(x, a, y) is the reward of transition (x, a, y).
17/81
The Markov Decision Process
◮ Park a car ◮ Find the shortest path from home to school ◮ Schedule a fleet of truck
18/81
The Markov Decision Process
Definition (Policy)
◮ Deterministic: πt : X → A, ◮ Stochastic: πt : X → ∆(A),
19/81
The Markov Decision Process
Definition (Policy)
◮ Deterministic: πt : X → A, ◮ Stochastic: πt : X → ∆(A),
◮ Non-stationary: π = (π0, π1, π2, . . . ), ◮ Stationary (Markovian): π = (π, π, π, . . . ).
19/81
The Markov Decision Process
Definition (Policy)
◮ Deterministic: πt : X → A, ◮ Stochastic: πt : X → ∆(A),
◮ Non-stationary: π = (π0, π1, π2, . . . ), ◮ Stationary (Markovian): π = (π, π, π, . . . ).
19/81
The Markov Decision Process
20/81
The Markov Decision Process
goods and the demand for that goods is Dt. At the end of each month the manager of the store can order at more items from his supplier. Furthermore we know that
◮ The cost of maintaining an inventory of x is h(x). ◮ The cost to order a items is C(a). ◮ The income for selling q items is f (q). ◮ If the demand D is bigger than the available inventory x, customers
that cannot be served leave.
◮ The value of the remaining inventory at the end of the year is g(x). ◮ Constraint: the store has a maximum capacity M.
21/81
The Markov Decision Process
◮ State space: x ∈ X = {0, 1, . . . , M}.
22/81
The Markov Decision Process
◮ State space: x ∈ X = {0, 1, . . . , M}. ◮ Action space: it is not possible to order more items that the
capacity of the store, then the action space should depend on the current state. Formally, at statex, a ∈ A(x) = {0, 1, . . . , M − x}.
22/81
The Markov Decision Process
◮ State space: x ∈ X = {0, 1, . . . , M}. ◮ Action space: it is not possible to order more items that the
capacity of the store, then the action space should depend on the current state. Formally, at statex, a ∈ A(x) = {0, 1, . . . , M − x}.
◮ Dynamics: xt+1 = [xt + at − Dt]+.
Problem: the dynamics should be Markov and stationary!
22/81
The Markov Decision Process
◮ State space: x ∈ X = {0, 1, . . . , M}. ◮ Action space: it is not possible to order more items that the
capacity of the store, then the action space should depend on the current state. Formally, at statex, a ∈ A(x) = {0, 1, . . . , M − x}.
◮ Dynamics: xt+1 = [xt + at − Dt]+.
Problem: the dynamics should be Markov and stationary!
◮ The demand Dt is stochastic and time-independent. Formally,
Dt
i.i.d.
∼ D.
22/81
The Markov Decision Process
◮ State space: x ∈ X = {0, 1, . . . , M}. ◮ Action space: it is not possible to order more items that the
capacity of the store, then the action space should depend on the current state. Formally, at statex, a ∈ A(x) = {0, 1, . . . , M − x}.
◮ Dynamics: xt+1 = [xt + at − Dt]+.
Problem: the dynamics should be Markov and stationary!
◮ The demand Dt is stochastic and time-independent. Formally,
Dt
i.i.d.
∼ D.
◮ Reward: rt = −C(at) − h(xt + at) + f ([xt + at − xt+1]+).
22/81
The Markov Decision Process
A driver wants to park his car as close as possible to the restaurant.
T 2 1 Reward t p(t) Reward 0
Restaurant
23/81
The Markov Decision Process
A driver wants to park his car as close as possible to the restaurant.
T 2 1 Reward t p(t) Reward 0
Restaurant
◮ The driver cannot see whether a place is available unless he is in
front of it.
◮ There are P places. ◮ At each place i the driver can either move to the next place or park
(if the place is available).
◮ The closer to the restaurant the parking, the higher the satisfaction. ◮ If the driver doesn’t park anywhere, then he/she leaves the
restaurant and has to find another one.
23/81
The Markov Decision Process
24/81
The Markov Decision Process
◮ Finite time horizon T: deadline at time T, the agent focuses
25/81
The Markov Decision Process
◮ Finite time horizon T: deadline at time T, the agent focuses
◮ Infinite time horizon with discount: the problem never
25/81
The Markov Decision Process
◮ Finite time horizon T: deadline at time T, the agent focuses
◮ Infinite time horizon with discount: the problem never
◮ Infinite time horizon with terminal state: the problem never
25/81
The Markov Decision Process
◮ Finite time horizon T: deadline at time T, the agent focuses
◮ Infinite time horizon with discount: the problem never
◮ Infinite time horizon with terminal state: the problem never
◮ Infinite time horizon with average reward: the problem never
25/81
The Markov Decision Process
◮ Finite time horizon T: deadline at time T, the agent focuses
26/81
The Markov Decision Process
◮ Infinite time horizon with discount: the problem never
◮ small = short-term rewards, big = long-term rewards ◮ for any γ ∈ [0, 1) the series always converge (for bounded
rewards)
27/81
The Markov Decision Process
◮ Infinite time horizon with terminal state: the problem never
28/81
The Markov Decision Process
◮ Infinite time horizon with average reward: the problem never
T→∞ E
T−1
29/81
The Markov Decision Process
30/81
The Markov Decision Process
30/81
The Markov Decision Process
Definition (Optimal policy and optimal value function)
31/81
The Markov Decision Process
Definition (Optimal policy and optimal value function)
31/81
The Markov Decision Process
Definition (Optimal policy and optimal value function)
Remark: π∗ ∈ arg max(·) and not π∗ = arg max(·) because an MDP may admit more than one optimal policy.
31/81
The Markov Decision Process
Work Work Work Work Rest Rest Rest Rest
p=0.5 0.4 0.3 0.7 0.5 0.5 0.5 0.5 0.4 0.6 0.6 1 0.5 r=1 r=−1000 r=0 r=−10 r=100 r=−10 0.9 0.1 r=−1
32/81
The Markov Decision Process
◮ Model: all the transitions are Markov, states x5, x6, x7 are
◮ Setting: infinite horizon with terminal states. ◮ Objective: find the policy that maximizes the expected sum of
33/81
The Markov Decision Process
Work Work Work Work Rest Rest Rest Rest
p=0.5 0.4 0.3 0.7 0.5 0.5 0.5 0.5 0.4 0.6 0.6 1 0.5 r=−1000 r=0 r=−10 r=100 0.9 0.1 r=−1
1
3
r=−10
4
r=1
2
5
6
7
34/81
The Markov Decision Process
35/81
The Markov Decision Process
Definition
t≥0
π
36/81
The Markov Decision Process
37/81
Bellman Equations for Discounted Infinite Horizon Problems
38/81
Bellman Equations for Discounted Infinite Horizon Problems
39/81
Bellman Equations for Discounted Infinite Horizon Problems
Proposition
40/81
Bellman Equations for Discounted Infinite Horizon Problems
V π(x) = E
t≥0
γtr(xt, π(xt)) | x0 = x; π
t≥1
γtr(xt, π(xt)) | x0 = x; π
+ γ
P(x1 = y | x0 = x; π(x0))E
t≥1
γt−1r(xt, π(xt)) | x1 = y; π
p(y|x, π(x))V π(y).
41/81
Bellman Equations for Discounted Infinite Horizon Problems
42/81
Bellman Equations for Discounted Infinite Horizon Problems
Proposition
a∈A
43/81
Bellman Equations for Discounted Infinite Horizon Problems
For any policy π = (a, π′) (possibly non-stationary), V ∗(x)
(a)
= max
π
E
t≥0
γtr(xt, π(xt)) | x0 = x; π
= max
(a,π′)
p(y|x, a)V π′(y)
= max
a
p(y|x, a) max
π′ V π′(y)
= max
a
p(y|x, a)V ∗(y)
44/81
Bellman Equations for Discounted Infinite Horizon Problems
Definition
45/81
Bellman Equations for Discounted Infinite Horizon Problems
Proposition
46/81
Bellman Equations for Discounted Infinite Horizon Problems
Proposition
46/81
Bellman Equations for Discounted Infinite Horizon Problems
Proposition
||T πW1 − T πW2||∞ ≤ γ||W1 − W2||∞, ||T W1 − T W2||∞ ≤ γ||W1 − W2||∞.
47/81
Bellman Equations for Discounted Infinite Horizon Problems
Proposition
||T πW1 − T πW2||∞ ≤ γ||W1 − W2||∞, ||T W1 − T W2||∞ ≤ γ||W1 − W2||∞.
V π is the unique fixed point of T π, V ∗ is the unique fixed point of T .
47/81
Bellman Equations for Discounted Infinite Horizon Problems
Proposition
||T πW1 − T πW2||∞ ≤ γ||W1 − W2||∞, ||T W1 − T W2||∞ ≤ γ||W1 − W2||∞.
V π is the unique fixed point of T π, V ∗ is the unique fixed point of T . Furthermore for any W ∈ RN and any stationary policy π lim
k→∞(T π)kW
= V π, lim
k→∞(T )kW
= V ∗.
47/81
Bellman Equations for Discounted Infinite Horizon Problems
Proof. The contraction property (3) holds since for any x ∈ X we have |T W1(x) − T W2(x)| =
a
p(y|x, a)W1(y)
a′
p(y|x, a′)W2(y)
≤ max
a
p(y|x, a)W1(y)
p(y|x, a)W2(y)
a
p(y|x, a)|W1(y) − W2(y)| ≤ γ||W1 − W2||∞ max
a
p(y|x, a) = γ||W1 − W2||∞, where in (a) we used maxa f (a) − maxa′ g(a′) ≤ maxa(f (a) − g(a)).
48/81
Bellman Equations for Discounted Infinite Horizon Problems
49/81
Bellman Equations for Uniscounted Infinite Horizon Problems
50/81
Bellman Equations for Uniscounted Infinite Horizon Problems
51/81
Bellman Equations for Uniscounted Infinite Horizon Problems
52/81
Bellman Equations for Uniscounted Infinite Horizon Problems
Definition
53/81
Bellman Equations for Uniscounted Infinite Horizon Problems
Proposition
π
54/81
Bellman Equations for Uniscounted Infinite Horizon Problems
Proof. By definition of proper policy P(x2n = ¯ x | x0 = x, π) = P(x2n = ¯ x | xn = ¯ x, π)×P(xn = ¯ x | x0 = x, π) ≤ ρ2
π.
Then for any t ∈ N P(xt = ¯ x | x0 = x, π) ≤ ρ⌊t/n⌋
π
, which implies that eventually the terminal state ¯ x is achieved with probability 1. Then ||V π||∞ = max
x∈X E
∞
r(xt, π(xt))|x0 = x; π
P(xt = ¯ x | x0 = x, π) ≤ nrmax + rmax
ρ⌊t/n⌋
π
.
55/81
Bellman Equations for Uniscounted Infinite Horizon Problems
Proposition ([2])
a∈A
56/81
Bellman Equations for Uniscounted Infinite Horizon Problems
Proposition
57/81
Bellman Equations for Uniscounted Infinite Horizon Problems
Proof. Let µ be the maximum (over policies) of the average time to the termination state. This can be easily casted to a MDP where for any action and any state the rewards are 1 (i.e., for any x ∈ X and a ∈ A, r(x, a) = 1). Under the assumption that all the policies are proper, then µ is finite and it is the solution to the dynamic programming equation µ(x) = 1 + max
a
p(y|x, a)µ(y). Then µ(x) ≥ 1 and for any a ∈ A, µ(x) ≥ 1 +
y p(y|x, a)µ(y).
Furthermore,
p(y|x, a)µ(y) ≤ µ(x) − 1 ≤ βµ(x), for β = max
x
µ(x) − 1 µ(x) < 1.
58/81
Bellman Equations for Uniscounted Infinite Horizon Problems
Proof (cont’d). From this definition of µ and β we obtain the contraction property of T (similar for T π) in norm L∞,µ: ||T W1 − T W2||∞,µ = max
x
|T W1(x) − T W2(x)| µ(x) ≤ max
x,a
µ(x) |W1(y) − W2(y)| ≤ max
x,a
µ(x) W1 − W2µ ≤ βW1 − W2µ
59/81
Dynamic Programming
60/81
Dynamic Programming
61/81
Dynamic Programming
62/81
Dynamic Programming
62/81
Dynamic Programming
63/81
Dynamic Programming
63/81
Dynamic Programming
◮ Compute Vk+1 = T Vk
63/81
Dynamic Programming
◮ Compute Vk+1 = T Vk
a∈A
63/81
Dynamic Programming
◮ From the fixed point property of T :
lim
k→∞ Vk = V ∗
64/81
Dynamic Programming
◮ From the fixed point property of T :
lim
k→∞ Vk = V ∗
◮ From the contraction property of T
||Vk+1−V ∗||∞ = ||T Vk−T V ∗||∞ ≤ γ||Vk−V ∗||∞ ≤ γk+1||V0−V ∗||∞ → 0
64/81
Dynamic Programming
◮ From the fixed point property of T :
lim
k→∞ Vk = V ∗
◮ From the contraction property of T
||Vk+1−V ∗||∞ = ||T Vk−T V ∗||∞ ≤ γ||Vk−V ∗||∞ ≤ γk+1||V0−V ∗||∞ → 0
◮ Convergence rate. Let ǫ > 0 and ||r||∞ ≤ rmax, then after at most
K = log(rmax/ǫ) log(1/γ) iterations ||VK − V ∗||∞ ≤ ǫ.
64/81
Dynamic Programming
65/81
Dynamic Programming
Q-iteration.
◮ Compute Qk+1 = T Qk
πK(x) ∈ arg max
a∈A Q(x,a)
66/81
Dynamic Programming
Q-iteration.
◮ Compute Qk+1 = T Qk
πK(x) ∈ arg max
a∈A Q(x,a)
Asynchronous VI.
◮ Choose a state xk ◮ Compute Vk+1(xk) = T Vk(xk)
πK(x) ∈ arg max
a∈A
p(y|x, a)VK(y)
66/81
Dynamic Programming
67/81
Dynamic Programming
67/81
Dynamic Programming
◮ Policy evaluation given πk, compute V πk.
67/81
Dynamic Programming
◮ Policy evaluation given πk, compute V πk. ◮ Policy improvement: compute the greedy policy
πk+1(x) ∈ arg maxa∈A
p(y|x, a)V πk(y)
67/81
Dynamic Programming
◮ Policy evaluation given πk, compute V πk. ◮ Policy improvement: compute the greedy policy
πk+1(x) ∈ arg maxa∈A
p(y|x, a)V πk(y)
67/81
Dynamic Programming
◮ Policy evaluation given πk, compute V πk. ◮ Policy improvement: compute the greedy policy
πk+1(x) ∈ arg maxa∈A
p(y|x, a)V πk(y)
67/81
Dynamic Programming
Proposition
68/81
Dynamic Programming
Proof. From the definition of the Bellman operators and the greedy policy πk+1 V πk = T πkV πk ≤ T V πk = T πk+1V πk, (1)
69/81
Dynamic Programming
Proof. From the definition of the Bellman operators and the greedy policy πk+1 V πk = T πkV πk ≤ T V πk = T πk+1V πk, (1) and from the monotonicity property of T πk+1, it follows that V πk ≤ T πk+1V πk, T πk+1V πk ≤ (T πk+1)2V πk, . . . (T πk+1)n−1V πk ≤ (T πk+1)nV πk, . . .
69/81
Dynamic Programming
Proof. From the definition of the Bellman operators and the greedy policy πk+1 V πk = T πkV πk ≤ T V πk = T πk+1V πk, (1) and from the monotonicity property of T πk+1, it follows that V πk ≤ T πk+1V πk, T πk+1V πk ≤ (T πk+1)2V πk, . . . (T πk+1)n−1V πk ≤ (T πk+1)nV πk, . . . Joining all the inequalities in the chain we obtain V πk ≤ lim
n→∞(T πk+1)nV πk = V πk+1.
69/81
Dynamic Programming
Proof. From the definition of the Bellman operators and the greedy policy πk+1 V πk = T πkV πk ≤ T V πk = T πk+1V πk, (1) and from the monotonicity property of T πk+1, it follows that V πk ≤ T πk+1V πk, T πk+1V πk ≤ (T πk+1)2V πk, . . . (T πk+1)n−1V πk ≤ (T πk+1)nV πk, . . . Joining all the inequalities in the chain we obtain V πk ≤ lim
n→∞(T πk+1)nV πk = V πk+1.
Then (V πk)k is a non-decreasing sequence.
69/81
Dynamic Programming
Proof (cont’d). Since a finite MDP admits a finite number of policies, then the termination condition is eventually met for a specific k. Thus eq. 1 holds with an equality and we obtain V πk = T V πk and V πk = V ∗ which implies that πk is an optimal policy.
70/81
Dynamic Programming
71/81
Dynamic Programming
72/81
Dynamic Programming
◮ Direct computation. For any policy π compute
V π = (I − γPπ)−1r π. Complexity: O(N3) (improvable to O(N2.807)).
73/81
Dynamic Programming
◮ Direct computation. For any policy π compute
V π = (I − γPπ)−1r π. Complexity: O(N3) (improvable to O(N2.807)). Exercise: prove the previous equality.
73/81
Dynamic Programming
◮ Direct computation. For any policy π compute
V π = (I − γPπ)−1r π. Complexity: O(N3) (improvable to O(N2.807)). Exercise: prove the previous equality.
◮ Iterative policy evaluation. For any policy π
lim
n→∞ T πV0 = V π.
Complexity: An ǫ-approximation of V π requires O(N2 log 1/ǫ
log 1/γ ) steps.
73/81
Dynamic Programming
◮ Direct computation. For any policy π compute
V π = (I − γPπ)−1r π. Complexity: O(N3) (improvable to O(N2.807)). Exercise: prove the previous equality.
◮ Iterative policy evaluation. For any policy π
lim
n→∞ T πV0 = V π.
Complexity: An ǫ-approximation of V π requires O(N2 log 1/ǫ
log 1/γ ) steps.
◮ Monte-Carlo simulation. In each state x, simulate n trajectories
((xi
t)t≥0,)1≤i≤n following policy π and compute
ˆ V π(x) ≃ 1 n
n
γtr(xi
t, π(xi t)).
Complexity: In each state, the approximation error is O(1/√n).
73/81
Dynamic Programming
◮ If the policy is evaluated with V , then the policy improvement
74/81
Dynamic Programming
◮ If the policy is evaluated with V , then the policy improvement
◮ If the policy is evaluated with Q, then the policy improvement
a∈A Q(x, a),
74/81
Dynamic Programming
◮ Pros: each iteration is very computationally efficient. ◮ Cons: convergence is only asymptotic.
◮ Pros: converge in a finite number of iterations (often small in
◮ Cons: each iteration requires a full policy evaluation and it
75/81
Dynamic Programming
◮ Modified Policy Iteration ◮ λ-Policy Iteration
76/81
Dynamic Programming
◮ Linear Programming: a one-shot approach to computing V ∗
77/81
Conclusions
78/81
Conclusions
◮ The Markov Decision Process framework
79/81
Conclusions
◮ The Markov Decision Process framework ◮ The discounted infinite horizon setting
79/81
Conclusions
◮ The Markov Decision Process framework ◮ The discounted infinite horizon setting ◮ State and state-action value function
79/81
Conclusions
◮ The Markov Decision Process framework ◮ The discounted infinite horizon setting ◮ State and state-action value function ◮ Bellman equations and Bellman operators
79/81
Conclusions
◮ The Markov Decision Process framework ◮ The discounted infinite horizon setting ◮ State and state-action value function ◮ Bellman equations and Bellman operators ◮ The value and policy iteration algorithms
79/81
Conclusions
Dynamic Programming. Princeton University Press, Princeton, N.J., 1957. D.P. Bertsekas and J. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, Belmont, MA, 1996.
Deterministic and stochastic optimal control. Applications of Mathematics, 1, Springer-Verlag, Berlin New York, 1975.
Dynamic Programming and Markov Processes. MIT Press, Cambridge, MA, 1960. M.L. Puterman. Markov Decision Processes : Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, Etats-Unis, 1994.
80/81
Conclusions
Alessandro Lazaric alessandro.lazaric@inria.fr sequel.lille.inria.fr