Markov Decision Processes and Dynamic Programming A. LAZARIC ( SequeL - - PowerPoint PPT Presentation

markov decision processes and dynamic programming
SMART_READER_LITE
LIVE PREVIEW

Markov Decision Processes and Dynamic Programming A. LAZARIC ( SequeL - - PowerPoint PPT Presentation

Markov Decision Processes and Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille ) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes and Dynamic Programming 2/81 In


slide-1
SLIDE 1

EC-RL Course

Markov Decision Processes and Dynamic Programming

  • A. LAZARIC (SequeL Team @INRIA-Lille)

Ecole Centrale - Option DAD

SequeL – INRIA Lille

slide-2
SLIDE 2

In This Lecture

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

2/81

slide-3
SLIDE 3

In This Lecture

◮ How do we formalize the agent-environment interaction?

⇒ Markov Decision Process (MDP)

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

2/81

slide-4
SLIDE 4

In This Lecture

◮ How do we formalize the agent-environment interaction?

⇒ Markov Decision Process (MDP)

◮ How do we solve an MDP?

⇒ Dynamic Programming

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

2/81

slide-5
SLIDE 5

Mathematical Tools

Outline

Mathematical Tools The Markov Decision Process Bellman Equations for Discounted Infinite Horizon Problems Bellman Equations for Uniscounted Infinite Horizon Problems Dynamic Programming Conclusions

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

3/81

slide-6
SLIDE 6

Mathematical Tools

Probability Theory

Definition (Conditional probability)

Given two events A and B with P(B) > 0, the conditional probability of A given B is P(A|B) = P(A ∪ B) P(B) .

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

4/81

slide-7
SLIDE 7

Mathematical Tools

Probability Theory

Definition (Conditional probability)

Given two events A and B with P(B) > 0, the conditional probability of A given B is P(A|B) = P(A ∪ B) P(B) . Similarly, if X and Y are non-degenerate and jointly continuous random variables with density fX,Y (x, y) then if B has positive measure then the conditional probability is P(X ∈ A|Y ∈ B) =

  • y∈B
  • x∈A fX,Y (x, y)dxdy
  • y∈B
  • x fX,Y (x, y)dxdy .
  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

4/81

slide-8
SLIDE 8

Mathematical Tools

Probability Theory

Definition (Law of total expectation)

Given a function f and two random variables X, Y we have that EX,Y

  • f (X, Y )
  • = EX
  • EY
  • f (x, Y )|X = x
  • .
  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

5/81

slide-9
SLIDE 9

Mathematical Tools

Norms and Contractions

Definition

Given a vector space V ⊆ Rd a function f : V → R+

0 is a norm if

an only if

◮ If f (v) = 0 for some v ∈ V, then v = 0. ◮ For any λ ∈ R, v ∈ V, f (λv) = |λ|f (v). ◮ Triangle inequality: For any v, u ∈ V, f (v + u) ≤ f (v) + f (u).

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

6/81

slide-10
SLIDE 10

Mathematical Tools

Norms and Contractions

◮ Lp-norm

||v||p =

  • d
  • i=1

|vi|p 1/p .

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

7/81

slide-11
SLIDE 11

Mathematical Tools

Norms and Contractions

◮ Lp-norm

||v||p =

  • d
  • i=1

|vi|p 1/p .

◮ L∞-norm

||v||∞ = max1≤i≤d|vi|.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

7/81

slide-12
SLIDE 12

Mathematical Tools

Norms and Contractions

◮ Lp-norm

||v||p =

  • d
  • i=1

|vi|p 1/p .

◮ L∞-norm

||v||∞ = max1≤i≤d|vi|.

◮ Lµ,p-norm

||v||µ,p =

  • d
  • i=1

|vi|p µi 1/p .

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

7/81

slide-13
SLIDE 13

Mathematical Tools

Norms and Contractions

◮ Lp-norm

||v||p =

  • d
  • i=1

|vi|p 1/p .

◮ L∞-norm

||v||∞ = max1≤i≤d|vi|.

◮ Lµ,p-norm

||v||µ,p =

  • d
  • i=1

|vi|p µi 1/p .

◮ Lµ,p-norm

||v||µ,∞ = max

1≤i≤d

|vi| µi .

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

7/81

slide-14
SLIDE 14

Mathematical Tools

Norms and Contractions

◮ Lp-norm

||v||p =

  • d
  • i=1

|vi|p 1/p .

◮ L∞-norm

||v||∞ = max1≤i≤d|vi|.

◮ Lµ,p-norm

||v||µ,p =

  • d
  • i=1

|vi|p µi 1/p .

◮ Lµ,p-norm

||v||µ,∞ = max

1≤i≤d

|vi| µi .

◮ L2,P-matrix norm (P is a positive definite matrix)

||v||2

P = v ⊤Pv.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

7/81

slide-15
SLIDE 15

Mathematical Tools

Norms and Contractions

Definition A sequence of vectors vn ∈ V (with n ∈ N) is said to converge in norm || · || to v ∈ V if lim

n→∞ ||vn − v|| = 0.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

8/81

slide-16
SLIDE 16

Mathematical Tools

Norms and Contractions

Definition A sequence of vectors vn ∈ V (with n ∈ N) is said to converge in norm || · || to v ∈ V if lim

n→∞ ||vn − v|| = 0.

Definition A sequence of vectors vn ∈ V (with n ∈ N) is a Cauchy sequence if lim

n→∞ supm≥n||vn − vm|| = 0.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

8/81

slide-17
SLIDE 17

Mathematical Tools

Norms and Contractions

Definition A sequence of vectors vn ∈ V (with n ∈ N) is said to converge in norm || · || to v ∈ V if lim

n→∞ ||vn − v|| = 0.

Definition A sequence of vectors vn ∈ V (with n ∈ N) is a Cauchy sequence if lim

n→∞ supm≥n||vn − vm|| = 0.

Definition A vector space V equipped with a norm || · || is complete if every Cauchy sequence in V is convergent in the norm of the space.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

8/81

slide-18
SLIDE 18

Mathematical Tools

Norms and Contractions

Definition An operator T : V → V is L-Lipschitz if for any v, u ∈ V ||T v − T u|| ≤ L||u − v||. If L ≤ 1 then T is a non-expansion, while if L < 1 then T is a L-contraction. If T is Lipschitz then it is also continuous, that is if vn→||·||v then T vn→||·||T v.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

9/81

slide-19
SLIDE 19

Mathematical Tools

Norms and Contractions

Definition An operator T : V → V is L-Lipschitz if for any v, u ∈ V ||T v − T u|| ≤ L||u − v||. If L ≤ 1 then T is a non-expansion, while if L < 1 then T is a L-contraction. If T is Lipschitz then it is also continuous, that is if vn→||·||v then T vn→||·||T v. Definition A vector v ∈ V is a fixed point of the operator T : V → V if T v = v.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

9/81

slide-20
SLIDE 20

Mathematical Tools

Norms and Contractions

Proposition (Banach Fixed Point Theorem) Let V be a complete vector space equipped with the norm || · || and T : V → V be a γ-contraction mapping. Then

  • 1. T admits a unique fixed point v.
  • 2. For any v0 ∈ V, if vn+1 = T vn then vn →||·|| v with a geometric

convergence rate: ||vn − v|| ≤ γn||v0 − v||.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

10/81

slide-21
SLIDE 21

Mathematical Tools

Linear Algebra

Given a square matrix A ∈ RN×N:

◮ Eigenvalues of a matrix (1). v ∈ RN and λ ∈ R are

eigenvector and eigenvalue of A if Av = λv.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

11/81

slide-22
SLIDE 22

Mathematical Tools

Linear Algebra

Given a square matrix A ∈ RN×N:

◮ Eigenvalues of a matrix (1). v ∈ RN and λ ∈ R are

eigenvector and eigenvalue of A if Av = λv.

◮ Eigenvalues of a matrix (2). If A has eigenvalues {λi}N i=1,

then B = (I − αA) has eigenvalues {µi} µi = 1 − αλi.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

11/81

slide-23
SLIDE 23

Mathematical Tools

Linear Algebra

Given a square matrix A ∈ RN×N:

◮ Eigenvalues of a matrix (1). v ∈ RN and λ ∈ R are

eigenvector and eigenvalue of A if Av = λv.

◮ Eigenvalues of a matrix (2). If A has eigenvalues {λi}N i=1,

then B = (I − αA) has eigenvalues {µi} µi = 1 − αλi.

◮ Matrix inversion. A can be inverted if and only if ∀i, λi = 0.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

11/81

slide-24
SLIDE 24

Mathematical Tools

Linear Algebra

◮ Stochastic matrix. A square matrix P ∈ RN×N is a stochastic

matrix if

  • 1. all non-zero entries, ∀i, j, [P]i,j ≥ 0
  • 2. all the rows sum to one, ∀i, N

j=1[P]i,j = 1.

All the eigenvalues of a stochastic matrix are bounded by 1, i.e., ∀i, λi ≤ 1.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

12/81

slide-25
SLIDE 25

The Markov Decision Process

Outline

Mathematical Tools The Markov Decision Process Bellman Equations for Discounted Infinite Horizon Problems Bellman Equations for Uniscounted Infinite Horizon Problems Dynamic Programming Conclusions

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

13/81

slide-26
SLIDE 26

The Markov Decision Process

The Reinforcement Learning Model

Agent Environment Learning

reward perception Critic actuation action / state /

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

14/81

slide-27
SLIDE 27

The Markov Decision Process

Markov Chains

Definition (Markov chain)

Let the state space X be a bounded compact subset of the Euclidean space, the discrete-time dynamic system (xt)t∈N ∈ X is a Markov chain if it satisfies the Markov property P(xt+1 = x | xt, xt−1, . . . , x0) = P(xt+1 = x | xt), Given an initial state x0 ∈ X, a Markov chain is defined by the transition probability p p(y|x) = P(xt+1 = y|xt = x).

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

15/81

slide-28
SLIDE 28

The Markov Decision Process

Example: Weather prediction

Informal definition: we want to describe how the weather evolves

  • ver time.

⇒ Board!

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

16/81

slide-29
SLIDE 29

The Markov Decision Process

Markov Decision Process

Definition (Markov decision process [1, 4, 3, 5, 2])

A Markov decision process is defined as a tuple M = (X, A, p, r) where

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

17/81

slide-30
SLIDE 30

The Markov Decision Process

Markov Decision Process

Definition (Markov decision process [1, 4, 3, 5, 2])

A Markov decision process is defined as a tuple M = (X, A, p, r) where

◮ t is the time clock,

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

17/81

slide-31
SLIDE 31

The Markov Decision Process

Markov Decision Process

Definition (Markov decision process [1, 4, 3, 5, 2])

A Markov decision process is defined as a tuple M = (X, A, p, r) where

◮ t is the time clock, ◮ X is the state space,

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

17/81

slide-32
SLIDE 32

The Markov Decision Process

Markov Decision Process

Definition (Markov decision process [1, 4, 3, 5, 2])

A Markov decision process is defined as a tuple M = (X, A, p, r) where

◮ t is the time clock, ◮ X is the state space, ◮ A is the action space,

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

17/81

slide-33
SLIDE 33

The Markov Decision Process

Markov Decision Process

Definition (Markov decision process [1, 4, 3, 5, 2])

A Markov decision process is defined as a tuple M = (X, A, p, r) where

◮ t is the time clock, ◮ X is the state space, ◮ A is the action space, ◮ p(y|x, a) is the transition probability with

p(y|x, a) = P(xt+1 = y|xt = x, at = a),

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

17/81

slide-34
SLIDE 34

The Markov Decision Process

Markov Decision Process

Definition (Markov decision process [1, 4, 3, 5, 2])

A Markov decision process is defined as a tuple M = (X, A, p, r) where

◮ t is the time clock, ◮ X is the state space, ◮ A is the action space, ◮ p(y|x, a) is the transition probability with

p(y|x, a) = P(xt+1 = y|xt = x, at = a),

◮ r(x, a, y) is the reward of transition (x, a, y).

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

17/81

slide-35
SLIDE 35

The Markov Decision Process

Examples

◮ Park a car ◮ Find the shortest path from home to school ◮ Schedule a fleet of truck

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

18/81

slide-36
SLIDE 36

The Markov Decision Process

Policy

Definition (Policy)

A decision rule πt can be

◮ Deterministic: πt : X → A, ◮ Stochastic: πt : X → ∆(A),

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

19/81

slide-37
SLIDE 37

The Markov Decision Process

Policy

Definition (Policy)

A decision rule πt can be

◮ Deterministic: πt : X → A, ◮ Stochastic: πt : X → ∆(A),

A policy (strategy, plan) can be

◮ Non-stationary: π = (π0, π1, π2, . . . ), ◮ Stationary (Markovian): π = (π, π, π, . . . ).

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

19/81

slide-38
SLIDE 38

The Markov Decision Process

Policy

Definition (Policy)

A decision rule πt can be

◮ Deterministic: πt : X → A, ◮ Stochastic: πt : X → ∆(A),

A policy (strategy, plan) can be

◮ Non-stationary: π = (π0, π1, π2, . . . ), ◮ Stationary (Markovian): π = (π, π, π, . . . ).

Remark: MDP M + stationary policy π ⇒ Markov chain of state X and transition probability p(y|x) = p(y|x, π(x)).

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

19/81

slide-39
SLIDE 39

The Markov Decision Process

Question

Is the MDP formalism powerful enough? ⇒ Let’s try!

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

20/81

slide-40
SLIDE 40

The Markov Decision Process

Example: the Retail Store Management Problem

  • Description. At each month t, a store contains xt items of a specific

goods and the demand for that goods is Dt. At the end of each month the manager of the store can order at more items from his supplier. Furthermore we know that

◮ The cost of maintaining an inventory of x is h(x). ◮ The cost to order a items is C(a). ◮ The income for selling q items is f (q). ◮ If the demand D is bigger than the available inventory x, customers

that cannot be served leave.

◮ The value of the remaining inventory at the end of the year is g(x). ◮ Constraint: the store has a maximum capacity M.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

21/81

slide-41
SLIDE 41

The Markov Decision Process

Example: the Retail Store Management Problem

◮ State space: x ∈ X = {0, 1, . . . , M}.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

22/81

slide-42
SLIDE 42

The Markov Decision Process

Example: the Retail Store Management Problem

◮ State space: x ∈ X = {0, 1, . . . , M}. ◮ Action space: it is not possible to order more items that the

capacity of the store, then the action space should depend on the current state. Formally, at statex, a ∈ A(x) = {0, 1, . . . , M − x}.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

22/81

slide-43
SLIDE 43

The Markov Decision Process

Example: the Retail Store Management Problem

◮ State space: x ∈ X = {0, 1, . . . , M}. ◮ Action space: it is not possible to order more items that the

capacity of the store, then the action space should depend on the current state. Formally, at statex, a ∈ A(x) = {0, 1, . . . , M − x}.

◮ Dynamics: xt+1 = [xt + at − Dt]+.

Problem: the dynamics should be Markov and stationary!

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

22/81

slide-44
SLIDE 44

The Markov Decision Process

Example: the Retail Store Management Problem

◮ State space: x ∈ X = {0, 1, . . . , M}. ◮ Action space: it is not possible to order more items that the

capacity of the store, then the action space should depend on the current state. Formally, at statex, a ∈ A(x) = {0, 1, . . . , M − x}.

◮ Dynamics: xt+1 = [xt + at − Dt]+.

Problem: the dynamics should be Markov and stationary!

◮ The demand Dt is stochastic and time-independent. Formally,

Dt

i.i.d.

∼ D.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

22/81

slide-45
SLIDE 45

The Markov Decision Process

Example: the Retail Store Management Problem

◮ State space: x ∈ X = {0, 1, . . . , M}. ◮ Action space: it is not possible to order more items that the

capacity of the store, then the action space should depend on the current state. Formally, at statex, a ∈ A(x) = {0, 1, . . . , M − x}.

◮ Dynamics: xt+1 = [xt + at − Dt]+.

Problem: the dynamics should be Markov and stationary!

◮ The demand Dt is stochastic and time-independent. Formally,

Dt

i.i.d.

∼ D.

◮ Reward: rt = −C(at) − h(xt + at) + f ([xt + at − xt+1]+).

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

22/81

slide-46
SLIDE 46

The Markov Decision Process

Exercise: the Parking Problem

A driver wants to park his car as close as possible to the restaurant.

T 2 1 Reward t p(t) Reward 0

Restaurant

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

23/81

slide-47
SLIDE 47

The Markov Decision Process

Exercise: the Parking Problem

A driver wants to park his car as close as possible to the restaurant.

T 2 1 Reward t p(t) Reward 0

Restaurant

◮ The driver cannot see whether a place is available unless he is in

front of it.

◮ There are P places. ◮ At each place i the driver can either move to the next place or park

(if the place is available).

◮ The closer to the restaurant the parking, the higher the satisfaction. ◮ If the driver doesn’t park anywhere, then he/she leaves the

restaurant and has to find another one.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

23/81

slide-48
SLIDE 48

The Markov Decision Process

Question

How do we evaluate a policy and compare two policies? ⇒ Value function!

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

24/81

slide-49
SLIDE 49

The Markov Decision Process

Optimization over Time Horizon

◮ Finite time horizon T: deadline at time T, the agent focuses

  • n the sum of the rewards up to T.
  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

25/81

slide-50
SLIDE 50

The Markov Decision Process

Optimization over Time Horizon

◮ Finite time horizon T: deadline at time T, the agent focuses

  • n the sum of the rewards up to T.

◮ Infinite time horizon with discount: the problem never

terminates but rewards which are closer in time receive a higher importance.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

25/81

slide-51
SLIDE 51

The Markov Decision Process

Optimization over Time Horizon

◮ Finite time horizon T: deadline at time T, the agent focuses

  • n the sum of the rewards up to T.

◮ Infinite time horizon with discount: the problem never

terminates but rewards which are closer in time receive a higher importance.

◮ Infinite time horizon with terminal state: the problem never

terminates but the agent will eventually reach a termination state.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

25/81

slide-52
SLIDE 52

The Markov Decision Process

Optimization over Time Horizon

◮ Finite time horizon T: deadline at time T, the agent focuses

  • n the sum of the rewards up to T.

◮ Infinite time horizon with discount: the problem never

terminates but rewards which are closer in time receive a higher importance.

◮ Infinite time horizon with terminal state: the problem never

terminates but the agent will eventually reach a termination state.

◮ Infinite time horizon with average reward: the problem never

terminates but the agent only focuses on the (expected) average of the rewards.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

25/81

slide-53
SLIDE 53

The Markov Decision Process

State Value Function

◮ Finite time horizon T: deadline at time T, the agent focuses

  • n the sum of the rewards up to T.

V π(t, x) = E T−1

  • s=t

r(xs, πs(xs)) + R(xT)| xt = x; π

  • ,

where R is a value function for the final state.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

26/81

slide-54
SLIDE 54

The Markov Decision Process

State Value Function

◮ Infinite time horizon with discount: the problem never

terminates but rewards which are closer in time receive a higher importance. V π(x) = E ∞

  • t=0

γtr(xt, π(xt)) | x0 = x; π

  • ,

with discount factor 0 ≤ γ < 1:

◮ small = short-term rewards, big = long-term rewards ◮ for any γ ∈ [0, 1) the series always converge (for bounded

rewards)

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

27/81

slide-55
SLIDE 55

The Markov Decision Process

State Value Function

◮ Infinite time horizon with terminal state: the problem never

terminates but the agent will eventually reach a termination state. V π(x) = E T

  • t=0

r(xt, π(xt))|x0 = x; π

  • ,

where T is the first (random) time when the termination state is achieved.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

28/81

slide-56
SLIDE 56

The Markov Decision Process

State Value Function

◮ Infinite time horizon with average reward: the problem never

terminates but the agent only focuses on the (expected) average of the rewards. V π(x) = lim

T→∞ E

1 T

T−1

  • t=0

r(xt, π(xt)) | x0 = x; π

  • .
  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

29/81

slide-57
SLIDE 57

The Markov Decision Process

State Value Function

Technical note: the expectations refer to all possible stochastic trajectories.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

30/81

slide-58
SLIDE 58

The Markov Decision Process

State Value Function

Technical note: the expectations refer to all possible stochastic trajectories. A non-stationary policy π applied from state x0 returns (x0, r0, x1, r1, x2, r2, . . .) with rt = r(xt, πt(xt)) and xt ∼ p(·|xt−1, at = π(xt)) are random realizations. The value function (discounted infinite horizon) is V π(x) = E(x1,x2,...) ∞

  • t=0

γtr(xt, π(xt)) | x0 = x; π

  • ,
  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

30/81

slide-59
SLIDE 59

The Markov Decision Process

Optimal Value Function

Definition (Optimal policy and optimal value function)

The solution to an MDP is an optimal policy π∗ satisfying π∗ ∈ arg maxπ∈ΠV π in all the states x ∈ X, where Π is some policy set of interest.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

31/81

slide-60
SLIDE 60

The Markov Decision Process

Optimal Value Function

Definition (Optimal policy and optimal value function)

The solution to an MDP is an optimal policy π∗ satisfying π∗ ∈ arg maxπ∈ΠV π in all the states x ∈ X, where Π is some policy set of interest. The corresponding value function is the optimal value function V ∗ = V π∗.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

31/81

slide-61
SLIDE 61

The Markov Decision Process

Optimal Value Function

Definition (Optimal policy and optimal value function)

The solution to an MDP is an optimal policy π∗ satisfying π∗ ∈ arg maxπ∈ΠV π in all the states x ∈ X, where Π is some policy set of interest. The corresponding value function is the optimal value function V ∗ = V π∗.

Remark: π∗ ∈ arg max(·) and not π∗ = arg max(·) because an MDP may admit more than one optimal policy.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

31/81

slide-62
SLIDE 62

The Markov Decision Process

Example: the EC student dilemma

Work Work Work Work Rest Rest Rest Rest

p=0.5 0.4 0.3 0.7 0.5 0.5 0.5 0.5 0.4 0.6 0.6 1 0.5 r=1 r=−1000 r=0 r=−10 r=100 r=−10 0.9 0.1 r=−1

1 2 3 4 5 6 7

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

32/81

slide-63
SLIDE 63

The Markov Decision Process

Example: the EC student dilemma

◮ Model: all the transitions are Markov, states x5, x6, x7 are

terminal.

◮ Setting: infinite horizon with terminal states. ◮ Objective: find the policy that maximizes the expected sum of

rewards before achieving a terminal state.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

33/81

slide-64
SLIDE 64

The Markov Decision Process

Example: the EC student dilemma

Work Work Work Work Rest Rest Rest Rest

p=0.5 0.4 0.3 0.7 0.5 0.5 0.5 0.5 0.4 0.6 0.6 1 0.5 r=−1000 r=0 r=−10 r=100 0.9 0.1 r=−1

V = 88.3

1

V = 86.9

3

r=−10

V = 88.9

4

r=1

V = 88.3

2

V = −10

5

V = 100

6

V = −1000

7

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

34/81

slide-65
SLIDE 65

The Markov Decision Process

Example: the EC student dilemma

V7 = −1000 V6 = 100 V5 = −10 V4 = −10 + 0.9V6 + 0.1V4 ≃ 88.9 V3 = −1 + 0.5V4 + 0.5V3 ≃ 86.9 V2 = 1 + 0.7V3 + 0.3V1 V1 = max{0.5V2 + 0.5V1, 0.5V3 + 0.5V1} V1 = V2 = 88.3

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

35/81

slide-66
SLIDE 66

The Markov Decision Process

State-Action Value Function

Definition

In discounted infinite horizon problems, for any policy π, the state-action value function (or Q-function) Qπ : X × A → R is Qπ(x, a) = E

t≥0

γtr(xt, at)|x0 = x, a0 = a, at = π(xt), ∀t ≥ 1

  • ,

and the corresponding optimal Q-function is Q∗(x, a) = max

π

Qπ(x, a).

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

36/81

slide-67
SLIDE 67

The Markov Decision Process

State-Action Value Function

The relationships between the V-function and the Q-function are: Qπ(x, a) = r(x, a) + γ

  • y∈X

p(y|x, a)V π(y) V π(x) = Qπ(x, π(x)) Q∗(x, a) = r(x, a) + γ

  • y∈X

p(y|x, a)V ∗(y) V ∗(x) = Q∗(x, π∗(x)) = maxa∈AQ∗(x, a).

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

37/81

slide-68
SLIDE 68

Bellman Equations for Discounted Infinite Horizon Problems

Outline

Mathematical Tools The Markov Decision Process Bellman Equations for Discounted Infinite Horizon Problems Bellman Equations for Uniscounted Infinite Horizon Problems Dynamic Programming Conclusions

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

38/81

slide-69
SLIDE 69

Bellman Equations for Discounted Infinite Horizon Problems

Question

Is there any more compact way to describe a value function? ⇒ Bellman equations!

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

39/81

slide-70
SLIDE 70

Bellman Equations for Discounted Infinite Horizon Problems

The Bellman Equation

Proposition

For any stationary policy π = (π, π, . . . ), the state value function at a state x ∈ X satisfies the Bellman equation: V π(x) = r(x, π(x)) + γ

  • y

p(y|x, π(x))V π(y).

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

40/81

slide-71
SLIDE 71

Bellman Equations for Discounted Infinite Horizon Problems

The Bellman Equation

Proof. For any policy π,

V π(x) = E

t≥0

γtr(xt, π(xt)) | x0 = x; π

  • = r(x, π(x)) + E

t≥1

γtr(xt, π(xt)) | x0 = x; π

  • = r(x, π(x))

+ γ

  • y

P(x1 = y | x0 = x; π(x0))E

t≥1

γt−1r(xt, π(xt)) | x1 = y; π

  • = r(x, π(x)) + γ
  • y

p(y|x, π(x))V π(y).

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

41/81

slide-72
SLIDE 72

Bellman Equations for Discounted Infinite Horizon Problems

The Optimal Bellman Equation

Bellman’s Principle of Optimality [1]: “An optimal policy has the property that, whatever the initial state and the initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.”

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

42/81

slide-73
SLIDE 73

Bellman Equations for Discounted Infinite Horizon Problems

The Optimal Bellman Equation

Proposition

The optimal value function V ∗ (i.e., V ∗ = maxπ V π) is the solution to the optimal Bellman equation: V ∗(x) = maxa∈A

  • r(x, a) + γ
  • y

p(y|x, a)V ∗(y)

  • .

and the optimal policy is π∗(x) = arg max

a∈A

  • r(x, a) + γ
  • y

p(y|x, a)V ∗(y)

  • .
  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

43/81

slide-74
SLIDE 74

Bellman Equations for Discounted Infinite Horizon Problems

The Optimal Bellman Equation

Proof.

For any policy π = (a, π′) (possibly non-stationary), V ∗(x)

(a)

= max

π

E

t≥0

γtr(xt, π(xt)) | x0 = x; π

  • (b)

= max

(a,π′)

  • r(x, a) + γ
  • y

p(y|x, a)V π′(y)

  • (c)

= max

a

  • r(x, a) + γ
  • y

p(y|x, a) max

π′ V π′(y)

  • (d)

= max

a

  • r(x, a) + γ
  • y

p(y|x, a)V ∗(y)

  • .
  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

44/81

slide-75
SLIDE 75

Bellman Equations for Discounted Infinite Horizon Problems

The Bellman Operators

  • Notation. w.l.o.g. a discrete state space |X| = N and V π ∈ RN.

Definition

For any W ∈ RN, the Bellman operator T π : RN → RN is T πW (x) = r(x, π(x)) + γ

  • y

p(y|x, π(x))W (y), and the optimal Bellman operator (or dynamic programming

  • perator) is

T W (x) = maxa∈A

  • r(x, a) + γ
  • y

p(y|x, a)W (y)

  • .
  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

45/81

slide-76
SLIDE 76

Bellman Equations for Discounted Infinite Horizon Problems

The Bellman Operators

Proposition

Properties of the Bellman operators

  • 1. Monotonicity: for any W1, W2 ∈ RN, if W1≤W2

component-wise, then T πW1 ≤ T πW2, T W1 ≤ T W2.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

46/81

slide-77
SLIDE 77

Bellman Equations for Discounted Infinite Horizon Problems

The Bellman Operators

Proposition

Properties of the Bellman operators

  • 1. Monotonicity: for any W1, W2 ∈ RN, if W1≤W2

component-wise, then T πW1 ≤ T πW2, T W1 ≤ T W2.

  • 2. Offset: for any scalar c ∈ R,

T π(W + cIN) = T πW + γcIN, T (W + cIN) = T W + γcIN,

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

46/81

slide-78
SLIDE 78

Bellman Equations for Discounted Infinite Horizon Problems

The Bellman Operators

Proposition

  • 3. Contraction in L∞-norm: for any W1, W2 ∈ RN

||T πW1 − T πW2||∞ ≤ γ||W1 − W2||∞, ||T W1 − T W2||∞ ≤ γ||W1 − W2||∞.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

47/81

slide-79
SLIDE 79

Bellman Equations for Discounted Infinite Horizon Problems

The Bellman Operators

Proposition

  • 3. Contraction in L∞-norm: for any W1, W2 ∈ RN

||T πW1 − T πW2||∞ ≤ γ||W1 − W2||∞, ||T W1 − T W2||∞ ≤ γ||W1 − W2||∞.

  • 4. Fixed point: For any policy π

V π is the unique fixed point of T π, V ∗ is the unique fixed point of T .

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

47/81

slide-80
SLIDE 80

Bellman Equations for Discounted Infinite Horizon Problems

The Bellman Operators

Proposition

  • 3. Contraction in L∞-norm: for any W1, W2 ∈ RN

||T πW1 − T πW2||∞ ≤ γ||W1 − W2||∞, ||T W1 − T W2||∞ ≤ γ||W1 − W2||∞.

  • 4. Fixed point: For any policy π

V π is the unique fixed point of T π, V ∗ is the unique fixed point of T . Furthermore for any W ∈ RN and any stationary policy π lim

k→∞(T π)kW

= V π, lim

k→∞(T )kW

= V ∗.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

47/81

slide-81
SLIDE 81

Bellman Equations for Discounted Infinite Horizon Problems

The Bellman Equation

Proof. The contraction property (3) holds since for any x ∈ X we have |T W1(x) − T W2(x)| =

  • max

a

  • r(x, a) + γ
  • y

p(y|x, a)W1(y)

  • − max

a′

  • r(x, a′) + γ
  • y

p(y|x, a′)W2(y)

  • (a)

≤ max

a

  • r(x, a) + γ
  • y

p(y|x, a)W1(y)

  • r(x, a) + γ
  • y

p(y|x, a)W2(y)

  • = γ max

a

  • y

p(y|x, a)|W1(y) − W2(y)| ≤ γ||W1 − W2||∞ max

a

  • y

p(y|x, a) = γ||W1 − W2||∞, where in (a) we used maxa f (a) − maxa′ g(a′) ≤ maxa(f (a) − g(a)).

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

48/81

slide-82
SLIDE 82

Bellman Equations for Discounted Infinite Horizon Problems

Exercise: Fixed Point

Revise the Banach fixed point theorem and prove the fixed point property of the Bellman operator.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

49/81

slide-83
SLIDE 83

Bellman Equations for Uniscounted Infinite Horizon Problems

Outline

Mathematical Tools The Markov Decision Process Bellman Equations for Discounted Infinite Horizon Problems Bellman Equations for Uniscounted Infinite Horizon Problems Dynamic Programming Conclusions

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

50/81

slide-84
SLIDE 84

Bellman Equations for Uniscounted Infinite Horizon Problems

Question

Is there any more compact way to describe a value function when we consider an infinite horizon with no discount? ⇒ Proper policies and Bellman equations!

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

51/81

slide-85
SLIDE 85

Bellman Equations for Uniscounted Infinite Horizon Problems

The Undiscounted Infinite Horizon Setting

The value function is V π(x) = E T

  • t=0

r(xt, π(xt))|x0 = x; π

  • ,

where T is the first random time when the agent achieves a terminal state.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

52/81

slide-86
SLIDE 86

Bellman Equations for Uniscounted Infinite Horizon Problems

Proper Policies

Definition

A stationary policy π is proper if ∃n ∈ N such that ∀x ∈ X the probability of achieving the terminal state ¯ x after n steps is strictly

  • positive. That is

ρπ = maxxP(xn = ¯ x | x0 = x, π) < 1.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

53/81

slide-87
SLIDE 87

Bellman Equations for Uniscounted Infinite Horizon Problems

Bounded Value Function

Proposition

For any proper policy π with parameter ρπ after n steps, the value function is bounded as ||V π||∞ ≤ rmax

  • t≥0

ρ⌊t/n⌋

π

.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

54/81

slide-88
SLIDE 88

Bellman Equations for Uniscounted Infinite Horizon Problems

The Undiscounted Infinite Horizon Setting

Proof. By definition of proper policy P(x2n = ¯ x | x0 = x, π) = P(x2n = ¯ x | xn = ¯ x, π)×P(xn = ¯ x | x0 = x, π) ≤ ρ2

π.

Then for any t ∈ N P(xt = ¯ x | x0 = x, π) ≤ ρ⌊t/n⌋

π

, which implies that eventually the terminal state ¯ x is achieved with probability 1. Then ||V π||∞ = max

x∈X E

  • t=0

r(xt, π(xt))|x0 = x; π

  • ≤ rmax
  • t>0

P(xt = ¯ x | x0 = x, π) ≤ nrmax + rmax

  • t≥n

ρ⌊t/n⌋

π

.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

55/81

slide-89
SLIDE 89

Bellman Equations for Uniscounted Infinite Horizon Problems

Bellman Operator

  • Assumption. There exists at least one proper policy and for any

non-proper policy π there exists at least one state x where V π(x) = −∞ (cycles with only negative rewards).

Proposition ([2])

Under the previous assumption, the optimal value function is bounded, i.e., ||V ∗||∞ < ∞ and it is the unique fixed point of the

  • ptimal Bellman operator T such that for any vector W ∈ Rn

T W (x) = max

a∈A

  • r(x, a) +
  • y

p(y|x, a)W (y)]. Furthermore V ∗ = limk→∞(T )kW .

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

56/81

slide-90
SLIDE 90

Bellman Equations for Uniscounted Infinite Horizon Problems

Bellman Operator

Proposition

Let all the policies π be proper, then there exist µ ∈ RN with µ > 0 and a scalar β < 1 such that, ∀x, y ∈ X, ∀a ∈ A,

  • y

p(y|x, a)µ(y) ≤ βµ(x). Thus both operators T and T π are contraction in the weighted norm L∞,µ, that is ||T W1 − T W2||∞,µ ≤ β||W1 − W2||∞,µ.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

57/81

slide-91
SLIDE 91

Bellman Equations for Uniscounted Infinite Horizon Problems

Bellman Operator

Proof. Let µ be the maximum (over policies) of the average time to the termination state. This can be easily casted to a MDP where for any action and any state the rewards are 1 (i.e., for any x ∈ X and a ∈ A, r(x, a) = 1). Under the assumption that all the policies are proper, then µ is finite and it is the solution to the dynamic programming equation µ(x) = 1 + max

a

  • y

p(y|x, a)µ(y). Then µ(x) ≥ 1 and for any a ∈ A, µ(x) ≥ 1 +

y p(y|x, a)µ(y).

Furthermore,

  • y

p(y|x, a)µ(y) ≤ µ(x) − 1 ≤ βµ(x), for β = max

x

µ(x) − 1 µ(x) < 1.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

58/81

slide-92
SLIDE 92

Bellman Equations for Uniscounted Infinite Horizon Problems

Bellman Operator

Proof (cont’d). From this definition of µ and β we obtain the contraction property of T (similar for T π) in norm L∞,µ: ||T W1 − T W2||∞,µ = max

x

|T W1(x) − T W2(x)| µ(x) ≤ max

x,a

  • y p(y|x, a)

µ(x) |W1(y) − W2(y)| ≤ max

x,a

  • y p(y|x, a)µ(y)

µ(x) W1 − W2µ ≤ βW1 − W2µ

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

59/81

slide-93
SLIDE 93

Dynamic Programming

Outline

Mathematical Tools The Markov Decision Process Bellman Equations for Discounted Infinite Horizon Problems Bellman Equations for Uniscounted Infinite Horizon Problems Dynamic Programming Conclusions

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

60/81

slide-94
SLIDE 94

Dynamic Programming

Question

How do we compute the value functions / solve an MDP? ⇒ Value/Policy Iteration algorithms!

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

61/81

slide-95
SLIDE 95

Dynamic Programming

System of Equations

The Bellman equation V π(x) = r(x, π(x)) + γ

  • y

p(y|x, π(x))V π(y). is a linear system of equations with N unknowns and N linear constraints.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

62/81

slide-96
SLIDE 96

Dynamic Programming

System of Equations

The Bellman equation V π(x) = r(x, π(x)) + γ

  • y

p(y|x, π(x))V π(y). is a linear system of equations with N unknowns and N linear constraints. The optimal Bellman equation V ∗(x) = maxa∈A

  • r(x, a) + γ
  • y

p(y|x, a)V ∗(y)

  • .

is a (highly) non-linear system of equations with N unknowns and N non-linear constraints (i.e., the max operator).

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

62/81

slide-97
SLIDE 97

Dynamic Programming

Value Iteration: the Idea

  • 1. Let V0 be any vector in RN
  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

63/81

slide-98
SLIDE 98

Dynamic Programming

Value Iteration: the Idea

  • 1. Let V0 be any vector in RN
  • 2. At each iteration k = 1, 2, . . . , K
  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

63/81

slide-99
SLIDE 99

Dynamic Programming

Value Iteration: the Idea

  • 1. Let V0 be any vector in RN
  • 2. At each iteration k = 1, 2, . . . , K

◮ Compute Vk+1 = T Vk

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

63/81

slide-100
SLIDE 100

Dynamic Programming

Value Iteration: the Idea

  • 1. Let V0 be any vector in RN
  • 2. At each iteration k = 1, 2, . . . , K

◮ Compute Vk+1 = T Vk

  • 3. Return the greedy policy

πK(x) ∈ arg max

a∈A

  • r(x, a) + γ
  • y

p(y|x, a)VK(y)

  • .
  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

63/81

slide-101
SLIDE 101

Dynamic Programming

Value Iteration: the Guarantees

◮ From the fixed point property of T :

lim

k→∞ Vk = V ∗

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

64/81

slide-102
SLIDE 102

Dynamic Programming

Value Iteration: the Guarantees

◮ From the fixed point property of T :

lim

k→∞ Vk = V ∗

◮ From the contraction property of T

||Vk+1−V ∗||∞ = ||T Vk−T V ∗||∞ ≤ γ||Vk−V ∗||∞ ≤ γk+1||V0−V ∗||∞ → 0

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

64/81

slide-103
SLIDE 103

Dynamic Programming

Value Iteration: the Guarantees

◮ From the fixed point property of T :

lim

k→∞ Vk = V ∗

◮ From the contraction property of T

||Vk+1−V ∗||∞ = ||T Vk−T V ∗||∞ ≤ γ||Vk−V ∗||∞ ≤ γk+1||V0−V ∗||∞ → 0

◮ Convergence rate. Let ǫ > 0 and ||r||∞ ≤ rmax, then after at most

K = log(rmax/ǫ) log(1/γ) iterations ||VK − V ∗||∞ ≤ ǫ.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

64/81

slide-104
SLIDE 104

Dynamic Programming

Value Iteration: the Complexity

One application of the optimal Bellman operator takes O(N2|A|)

  • perations.
  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

65/81

slide-105
SLIDE 105

Dynamic Programming

Value Iteration: Extensions and Implementations

Q-iteration.

  • 1. Let Q0 be any Q-function
  • 2. At each iteration k = 1, 2, . . . , K

◮ Compute Qk+1 = T Qk

  • 3. Return the greedy policy

πK(x) ∈ arg max

a∈A Q(x,a)

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

66/81

slide-106
SLIDE 106

Dynamic Programming

Value Iteration: Extensions and Implementations

Q-iteration.

  • 1. Let Q0 be any Q-function
  • 2. At each iteration k = 1, 2, . . . , K

◮ Compute Qk+1 = T Qk

  • 3. Return the greedy policy

πK(x) ∈ arg max

a∈A Q(x,a)

Asynchronous VI.

  • 1. Let V0 be any vector in RN
  • 2. At each iteration k = 1, 2, . . . , K

◮ Choose a state xk ◮ Compute Vk+1(xk) = T Vk(xk)

  • 3. Return the greedy policy

πK(x) ∈ arg max

a∈A

  • r(x, a) + γ
  • y

p(y|x, a)VK(y)

  • .
  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

66/81

slide-107
SLIDE 107

Dynamic Programming

Policy Iteration: the Idea

  • 1. Let π0 be any stationary policy
  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

67/81

slide-108
SLIDE 108

Dynamic Programming

Policy Iteration: the Idea

  • 1. Let π0 be any stationary policy
  • 2. At each iteration k = 1, 2, . . . , K
  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

67/81

slide-109
SLIDE 109

Dynamic Programming

Policy Iteration: the Idea

  • 1. Let π0 be any stationary policy
  • 2. At each iteration k = 1, 2, . . . , K

◮ Policy evaluation given πk, compute V πk.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

67/81

slide-110
SLIDE 110

Dynamic Programming

Policy Iteration: the Idea

  • 1. Let π0 be any stationary policy
  • 2. At each iteration k = 1, 2, . . . , K

◮ Policy evaluation given πk, compute V πk. ◮ Policy improvement: compute the greedy policy

πk+1(x) ∈ arg maxa∈A

  • r(x, a) + γ
  • y

p(y|x, a)V πk(y)

  • .
  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

67/81

slide-111
SLIDE 111

Dynamic Programming

Policy Iteration: the Idea

  • 1. Let π0 be any stationary policy
  • 2. At each iteration k = 1, 2, . . . , K

◮ Policy evaluation given πk, compute V πk. ◮ Policy improvement: compute the greedy policy

πk+1(x) ∈ arg maxa∈A

  • r(x, a) + γ
  • y

p(y|x, a)V πk(y)

  • .
  • 3. Return the last policy πK
  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

67/81

slide-112
SLIDE 112

Dynamic Programming

Policy Iteration: the Idea

  • 1. Let π0 be any stationary policy
  • 2. At each iteration k = 1, 2, . . . , K

◮ Policy evaluation given πk, compute V πk. ◮ Policy improvement: compute the greedy policy

πk+1(x) ∈ arg maxa∈A

  • r(x, a) + γ
  • y

p(y|x, a)V πk(y)

  • .
  • 3. Return the last policy πK

Remark: usually K is the smallest k such that V πk = V πk+1.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

67/81

slide-113
SLIDE 113

Dynamic Programming

Policy Iteration: the Guarantees

Proposition

The policy iteration algorithm generates a sequences of policies with non-decreasing performance V πk+1≥V πk, and it converges to π∗ in a finite number of iterations.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

68/81

slide-114
SLIDE 114

Dynamic Programming

Policy Iteration: the Guarantees

Proof. From the definition of the Bellman operators and the greedy policy πk+1 V πk = T πkV πk ≤ T V πk = T πk+1V πk, (1)

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

69/81

slide-115
SLIDE 115

Dynamic Programming

Policy Iteration: the Guarantees

Proof. From the definition of the Bellman operators and the greedy policy πk+1 V πk = T πkV πk ≤ T V πk = T πk+1V πk, (1) and from the monotonicity property of T πk+1, it follows that V πk ≤ T πk+1V πk, T πk+1V πk ≤ (T πk+1)2V πk, . . . (T πk+1)n−1V πk ≤ (T πk+1)nV πk, . . .

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

69/81

slide-116
SLIDE 116

Dynamic Programming

Policy Iteration: the Guarantees

Proof. From the definition of the Bellman operators and the greedy policy πk+1 V πk = T πkV πk ≤ T V πk = T πk+1V πk, (1) and from the monotonicity property of T πk+1, it follows that V πk ≤ T πk+1V πk, T πk+1V πk ≤ (T πk+1)2V πk, . . . (T πk+1)n−1V πk ≤ (T πk+1)nV πk, . . . Joining all the inequalities in the chain we obtain V πk ≤ lim

n→∞(T πk+1)nV πk = V πk+1.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

69/81

slide-117
SLIDE 117

Dynamic Programming

Policy Iteration: the Guarantees

Proof. From the definition of the Bellman operators and the greedy policy πk+1 V πk = T πkV πk ≤ T V πk = T πk+1V πk, (1) and from the monotonicity property of T πk+1, it follows that V πk ≤ T πk+1V πk, T πk+1V πk ≤ (T πk+1)2V πk, . . . (T πk+1)n−1V πk ≤ (T πk+1)nV πk, . . . Joining all the inequalities in the chain we obtain V πk ≤ lim

n→∞(T πk+1)nV πk = V πk+1.

Then (V πk)k is a non-decreasing sequence.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

69/81

slide-118
SLIDE 118

Dynamic Programming

Policy Iteration: the Guarantees

Proof (cont’d). Since a finite MDP admits a finite number of policies, then the termination condition is eventually met for a specific k. Thus eq. 1 holds with an equality and we obtain V πk = T V πk and V πk = V ∗ which implies that πk is an optimal policy.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

70/81

slide-119
SLIDE 119

Dynamic Programming

Exercise: Convergence Rate

Read the more refined convergence rates in: “Improved and Generalized Upper Bounds on the Complexity of Policy Iteration” by B. Scherrer.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

71/81

slide-120
SLIDE 120

Dynamic Programming

Policy Iteration

  • Notation. For any policy π the reward vector is r π(x) = r(x, π(x))

and the transition matrix is [Pπ]x,y = p(y|x, π(x))

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

72/81

slide-121
SLIDE 121

Dynamic Programming

Policy Iteration: the Policy Evaluation Step

◮ Direct computation. For any policy π compute

V π = (I − γPπ)−1r π. Complexity: O(N3) (improvable to O(N2.807)).

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

73/81

slide-122
SLIDE 122

Dynamic Programming

Policy Iteration: the Policy Evaluation Step

◮ Direct computation. For any policy π compute

V π = (I − γPπ)−1r π. Complexity: O(N3) (improvable to O(N2.807)). Exercise: prove the previous equality.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

73/81

slide-123
SLIDE 123

Dynamic Programming

Policy Iteration: the Policy Evaluation Step

◮ Direct computation. For any policy π compute

V π = (I − γPπ)−1r π. Complexity: O(N3) (improvable to O(N2.807)). Exercise: prove the previous equality.

◮ Iterative policy evaluation. For any policy π

lim

n→∞ T πV0 = V π.

Complexity: An ǫ-approximation of V π requires O(N2 log 1/ǫ

log 1/γ ) steps.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

73/81

slide-124
SLIDE 124

Dynamic Programming

Policy Iteration: the Policy Evaluation Step

◮ Direct computation. For any policy π compute

V π = (I − γPπ)−1r π. Complexity: O(N3) (improvable to O(N2.807)). Exercise: prove the previous equality.

◮ Iterative policy evaluation. For any policy π

lim

n→∞ T πV0 = V π.

Complexity: An ǫ-approximation of V π requires O(N2 log 1/ǫ

log 1/γ ) steps.

◮ Monte-Carlo simulation. In each state x, simulate n trajectories

((xi

t)t≥0,)1≤i≤n following policy π and compute

ˆ V π(x) ≃ 1 n

n

  • i=1
  • t≥0

γtr(xi

t, π(xi t)).

Complexity: In each state, the approximation error is O(1/√n).

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

73/81

slide-125
SLIDE 125

Dynamic Programming

Policy Iteration: the Policy Improvement Step

◮ If the policy is evaluated with V , then the policy improvement

has complexity O(N|A|) (computation of an expectation).

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

74/81

slide-126
SLIDE 126

Dynamic Programming

Policy Iteration: the Policy Improvement Step

◮ If the policy is evaluated with V , then the policy improvement

has complexity O(N|A|) (computation of an expectation).

◮ If the policy is evaluated with Q, then the policy improvement

has complexity O(|A|) corresponding to πk+1(x) ∈ arg max

a∈A Q(x, a),

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

74/81

slide-127
SLIDE 127

Dynamic Programming

Comparison between Value and Policy Iteration

Value Iteration

◮ Pros: each iteration is very computationally efficient. ◮ Cons: convergence is only asymptotic.

Policy Iteration

◮ Pros: converge in a finite number of iterations (often small in

practice).

◮ Cons: each iteration requires a full policy evaluation and it

might be expensive.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

75/81

slide-128
SLIDE 128

Dynamic Programming

Exercise: Review Extensions to Standard DP Algorithms

◮ Modified Policy Iteration ◮ λ-Policy Iteration

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

76/81

slide-129
SLIDE 129

Dynamic Programming

Exercise: Review Linear Programming

◮ Linear Programming: a one-shot approach to computing V ∗

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

77/81

slide-130
SLIDE 130

Conclusions

Outline

Mathematical Tools The Markov Decision Process Bellman Equations for Discounted Infinite Horizon Problems Bellman Equations for Uniscounted Infinite Horizon Problems Dynamic Programming Conclusions

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

78/81

slide-131
SLIDE 131

Conclusions

Things to Remember

◮ The Markov Decision Process framework

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

79/81

slide-132
SLIDE 132

Conclusions

Things to Remember

◮ The Markov Decision Process framework ◮ The discounted infinite horizon setting

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

79/81

slide-133
SLIDE 133

Conclusions

Things to Remember

◮ The Markov Decision Process framework ◮ The discounted infinite horizon setting ◮ State and state-action value function

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

79/81

slide-134
SLIDE 134

Conclusions

Things to Remember

◮ The Markov Decision Process framework ◮ The discounted infinite horizon setting ◮ State and state-action value function ◮ Bellman equations and Bellman operators

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

79/81

slide-135
SLIDE 135

Conclusions

Things to Remember

◮ The Markov Decision Process framework ◮ The discounted infinite horizon setting ◮ State and state-action value function ◮ Bellman equations and Bellman operators ◮ The value and policy iteration algorithms

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

79/81

slide-136
SLIDE 136

Conclusions

Bibliography I

  • R. E. Bellman.

Dynamic Programming. Princeton University Press, Princeton, N.J., 1957. D.P. Bertsekas and J. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, Belmont, MA, 1996.

  • W. Fleming and R. Rishel.

Deterministic and stochastic optimal control. Applications of Mathematics, 1, Springer-Verlag, Berlin New York, 1975.

  • R. A. Howard.

Dynamic Programming and Markov Processes. MIT Press, Cambridge, MA, 1960. M.L. Puterman. Markov Decision Processes : Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, Etats-Unis, 1994.

  • A. LAZARIC – Markov Decision Processes and Dynamic Programming

80/81

slide-137
SLIDE 137

Conclusions

Reinforcement Learning

Alessandro Lazaric alessandro.lazaric@inria.fr sequel.lille.inria.fr