Markov Decision Processes and Dynamic Programming A. LAZARIC ( SequeL - - PowerPoint PPT Presentation

markov decision processes and dynamic programming
SMART_READER_LITE
LIVE PREVIEW

Markov Decision Processes and Dynamic Programming A. LAZARIC ( SequeL - - PowerPoint PPT Presentation

Markov Decision Processes and Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to model an RL problem The Markov Decision Process A. LAZARIC Markov Decision


slide-1
SLIDE 1

MVA-RL Course

Markov Decision Processes and Dynamic Programming

  • A. LAZARIC (SequeL Team @INRIA-Lille)

ENS Cachan - Master 2 MVA

SequeL – INRIA Lille

slide-2
SLIDE 2

How to model an RL problem

The Markov Decision Process

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 2/103
slide-3
SLIDE 3

How to model an RL problem

The Markov Decision Process

Tools Model Value Functions

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 2/103
slide-4
SLIDE 4

Mathematical Tools

How to model an RL problem

The Markov Decision Process

Tools Model Value Functions

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 3/103
slide-5
SLIDE 5

Mathematical Tools

Probability Theory

Definition (Conditional probability)

Given two events A and B with P(B) > 0, the conditional probability of A given B is P(A|B) = P(A ∩ B) P(B) .

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 4/103
slide-6
SLIDE 6

Mathematical Tools

Probability Theory

Definition (Conditional probability)

Given two events A and B with P(B) > 0, the conditional probability of A given B is P(A|B) = P(A ∩ B) P(B) . Similarly, if X and Y are non-degenerate and jointly continuous random variables with density fX,Y (x, y) then if B has positive measure then the conditional probability is P(X ∈ A|Y ∈ B) =

  • y∈B
  • x∈A fX,Y (x, y)dxdy
  • y∈B
  • x fX,Y (x, y)dxdy .
  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 4/103
slide-7
SLIDE 7

Mathematical Tools

Probability Theory

Definition (Law of total expectation)

Given a function f and two random variables X, Y we have that EX,Y

  • f (X, Y )
  • = EX
  • EY
  • f (x, Y )|X = x
  • .
  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 5/103
slide-8
SLIDE 8

Mathematical Tools

Norms and Contractions

Definition

Given a vector space V ⊆ Rd a function f : V → R+

0 is a norm if

an only if

◮ If f (v) = 0 for some v ∈ V, then v = 0. ◮ For any λ ∈ R, v ∈ V, f (λv) = |λ|f (v). ◮ Triangle inequality: For any v, u ∈ V, f (v + u) ≤ f (v) + f (u).

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 6/103
slide-9
SLIDE 9

Mathematical Tools

Norms and Contractions

◮ Lp-norm

||v||p =

  • d
  • i=1

|vi|p 1/p .

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 7/103
slide-10
SLIDE 10

Mathematical Tools

Norms and Contractions

◮ Lp-norm

||v||p =

  • d
  • i=1

|vi|p 1/p .

◮ L∞-norm

||v||∞ = max1≤i≤d|vi|.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 7/103
slide-11
SLIDE 11

Mathematical Tools

Norms and Contractions

◮ Lp-norm

||v||p =

  • d
  • i=1

|vi|p 1/p .

◮ L∞-norm

||v||∞ = max1≤i≤d|vi|.

◮ Lµ,p-norm

||v||µ,p =

  • d
  • i=1

|vi|p µi 1/p .

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 7/103
slide-12
SLIDE 12

Mathematical Tools

Norms and Contractions

◮ Lp-norm

||v||p =

  • d
  • i=1

|vi|p 1/p .

◮ L∞-norm

||v||∞ = max1≤i≤d|vi|.

◮ Lµ,p-norm

||v||µ,p =

  • d
  • i=1

|vi|p µi 1/p .

◮ Lµ,∞-norm

||v||µ,∞ = max

1≤i≤d

|vi| µi .

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 7/103
slide-13
SLIDE 13

Mathematical Tools

Norms and Contractions

◮ Lp-norm

||v||p =

  • d
  • i=1

|vi|p 1/p .

◮ L∞-norm

||v||∞ = max1≤i≤d|vi|.

◮ Lµ,p-norm

||v||µ,p =

  • d
  • i=1

|vi|p µi 1/p .

◮ Lµ,∞-norm

||v||µ,∞ = max

1≤i≤d

|vi| µi .

◮ L2,P-matrix norm (P is a positive definite matrix)

||v||2

P = v ⊤Pv.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 7/103
slide-14
SLIDE 14

Mathematical Tools

Norms and Contractions

Definition A sequence of vectors vn ∈ V (with n ∈ N) is said to converge in norm || · || to v ∈ V if lim

n→∞ ||vn − v|| = 0.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 8/103
slide-15
SLIDE 15

Mathematical Tools

Norms and Contractions

Definition A sequence of vectors vn ∈ V (with n ∈ N) is said to converge in norm || · || to v ∈ V if lim

n→∞ ||vn − v|| = 0.

Definition A sequence of vectors vn ∈ V (with n ∈ N) is a Cauchy sequence if lim

n→∞ supm≥n||vn − vm|| = 0.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 8/103
slide-16
SLIDE 16

Mathematical Tools

Norms and Contractions

Definition A sequence of vectors vn ∈ V (with n ∈ N) is said to converge in norm || · || to v ∈ V if lim

n→∞ ||vn − v|| = 0.

Definition A sequence of vectors vn ∈ V (with n ∈ N) is a Cauchy sequence if lim

n→∞ supm≥n||vn − vm|| = 0.

Definition A vector space V equipped with a norm || · || is complete if every Cauchy sequence in V is convergent in the norm of the space.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 8/103
slide-17
SLIDE 17

Mathematical Tools

Norms and Contractions

Definition An operator T : V → V is L-Lipschitz if for any v, u ∈ V ||T v − T u|| ≤ L||u − v||. If L ≤ 1 then T is a non-expansion, while if L < 1 then T is a L-contraction. If T is Lipschitz then it is also continuous, that is if vn→||·||v then T vn→||·||T v.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 9/103
slide-18
SLIDE 18

Mathematical Tools

Norms and Contractions

Definition An operator T : V → V is L-Lipschitz if for any v, u ∈ V ||T v − T u|| ≤ L||u − v||. If L ≤ 1 then T is a non-expansion, while if L < 1 then T is a L-contraction. If T is Lipschitz then it is also continuous, that is if vn→||·||v then T vn→||·||T v. Definition A vector v ∈ V is a fixed point of the operator T : V → V if T v = v.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 9/103
slide-19
SLIDE 19

Mathematical Tools

Norms and Contractions

Proposition (Banach Fixed Point Theorem) Let V be a complete vector space equipped with the norm || · || and T : V → V be a γ-contraction mapping. Then

  • 1. T admits a unique fixed point v.
  • 2. For any v0 ∈ V, if vn+1 = T vn then vn →||·|| v with a geometric

convergence rate: ||vn − v|| ≤ γn||v0 − v||.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 10/103
slide-20
SLIDE 20

Mathematical Tools

Linear Algebra

Given a square matrix A ∈ RN×N:

◮ Eigenvalues of a matrix (1). v ∈ RN and λ ∈ R are

eigenvector and eigenvalue of A if Av = λv.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 11/103
slide-21
SLIDE 21

Mathematical Tools

Linear Algebra

Given a square matrix A ∈ RN×N:

◮ Eigenvalues of a matrix (1). v ∈ RN and λ ∈ R are

eigenvector and eigenvalue of A if Av = λv.

◮ Eigenvalues of a matrix (2). If A has eigenvalues {λi}N i=1,

then B = (I − αA) has eigenvalues {µi} µi = 1 − αλi.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 11/103
slide-22
SLIDE 22

Mathematical Tools

Linear Algebra

Given a square matrix A ∈ RN×N:

◮ Eigenvalues of a matrix (1). v ∈ RN and λ ∈ R are

eigenvector and eigenvalue of A if Av = λv.

◮ Eigenvalues of a matrix (2). If A has eigenvalues {λi}N i=1,

then B = (I − αA) has eigenvalues {µi} µi = 1 − αλi.

◮ Matrix inversion. A can be inverted if and only if ∀i, λi = 0.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 11/103
slide-23
SLIDE 23

Mathematical Tools

Linear Algebra

◮ Stochastic matrix. A square matrix P ∈ RN×N is a stochastic

matrix if

  • 1. all non-zero entries, ∀i, j, [P]i,j ≥ 0
  • 2. all the rows sum to one, ∀i, N

j=1[P]i,j = 1.

All the eigenvalues of a stochastic matrix are bounded by 1, i.e., ∀i, λi ≤ 1.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 12/103
slide-24
SLIDE 24

The Markov Decision Process

How to model an RL problem

The Markov Decision Process

Tools Model Value Functions

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 13/103
slide-25
SLIDE 25

The Markov Decision Process

The Reinforcement Learning Model

Environment Agent

actuation action / state / perception

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 14/103
slide-26
SLIDE 26

The Markov Decision Process

The Reinforcement Learning Model

Environment Agent Learning

Critic perception actuation action / reward state /

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 14/103
slide-27
SLIDE 27

The Markov Decision Process

The Reinforcement Learning Model

The environment

◮ Controllability: fully (e.g., chess) or partially (e.g., portfolio optimization) ◮ Uncertainty: deterministic (e.g., chess) or stochastic (e.g., backgammon) ◮ Reactive: adversarial (e.g., chess) or fixed (e.g., tetris) ◮ Observability: full (e.g., chess) or partial (e.g., robotics) ◮ Availability: known (e.g., chess) or unknown (e.g., robotics)

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 15/103
slide-28
SLIDE 28

The Markov Decision Process

The Reinforcement Learning Model

The environment

◮ Controllability: fully (e.g., chess) or partially (e.g., portfolio optimization) ◮ Uncertainty: deterministic (e.g., chess) or stochastic (e.g., backgammon) ◮ Reactive: adversarial (e.g., chess) or fixed (e.g., tetris) ◮ Observability: full (e.g., chess) or partial (e.g., robotics) ◮ Availability: known (e.g., chess) or unknown (e.g., robotics)

The critic

◮ Sparse (e.g., win or loose) vs informative (e.g., closer or further) ◮ Preference reward ◮ Frequent or sporadic ◮ Known or unknown

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 15/103
slide-29
SLIDE 29

The Markov Decision Process

The Reinforcement Learning Model

The environment

◮ Controllability: fully (e.g., chess) or partially (e.g., portfolio optimization) ◮ Uncertainty: deterministic (e.g., chess) or stochastic (e.g., backgammon) ◮ Reactive: adversarial (e.g., chess) or fixed (e.g., tetris) ◮ Observability: full (e.g., chess) or partial (e.g., robotics) ◮ Availability: known (e.g., chess) or unknown (e.g., robotics)

The critic

◮ Sparse (e.g., win or loose) vs informative (e.g., closer or further) ◮ Preference reward ◮ Frequent or sporadic ◮ Known or unknown

The agent

◮ Open loop control ◮ Close loop control (i.e., adaptive) ◮ Non-stationary close loop control (i.e., learning)

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 15/103
slide-30
SLIDE 30

The Markov Decision Process

Markov Chains

Definition (Markov chain)

Let the state space X be a bounded compact subset of the Euclidean space, the discrete-time dynamic system (xt)t∈N ∈ X is a Markov chain if it satisfies the Markov property P(xt+1 = x | xt, xt−1, . . . , x0) = P(xt+1 = x | xt), Given an initial state x0 ∈ X, a Markov chain is defined by the transition probability p p(y|x) = P(xt+1 = y|xt = x).

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 16/103
slide-31
SLIDE 31

The Markov Decision Process

Markov Decision Process

Definition (Markov decision process [1, 4, 3, 5, 2])

A Markov decision process is defined as a tuple M = (X, A, p, r) where

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 17/103
slide-32
SLIDE 32

The Markov Decision Process

Markov Decision Process

Definition (Markov decision process [1, 4, 3, 5, 2])

A Markov decision process is defined as a tuple M = (X, A, p, r) where

◮ X is the state space,

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 17/103
slide-33
SLIDE 33

The Markov Decision Process

Markov Decision Process

Definition (Markov decision process [1, 4, 3, 5, 2])

A Markov decision process is defined as a tuple M = (X, A, p, r) where

◮ X is the state space, ◮ A is the action space,

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 17/103
slide-34
SLIDE 34

The Markov Decision Process

Markov Decision Process

Definition (Markov decision process [1, 4, 3, 5, 2])

A Markov decision process is defined as a tuple M = (X, A, p, r) where

◮ X is the state space, ◮ A is the action space, ◮ p(y|x, a) is the transition probability with

p(y|x, a) = P(xt+1 = y|xt = x, at = a),

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 17/103
slide-35
SLIDE 35

The Markov Decision Process

Markov Decision Process

Definition (Markov decision process [1, 4, 3, 5, 2])

A Markov decision process is defined as a tuple M = (X, A, p, r) where

◮ X is the state space, ◮ A is the action space, ◮ p(y|x, a) is the transition probability with

p(y|x, a) = P(xt+1 = y|xt = x, at = a),

◮ r(x, a, y) is the reward of transition (x, a, y).

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 17/103
slide-36
SLIDE 36

The Markov Decision Process

Markov Decision Process: the Assumptions

Time assumption: time is discrete t → t + 1 Possible relaxations

◮ Identify the proper time granularity ◮ Most of MDP literature extends to continuous time

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 18/103
slide-37
SLIDE 37

The Markov Decision Process

Markov Decision Process: the Assumptions

Markov assumption: the current state x and action a are a sufficient statistics for the next state y p(y|x, a) = P(xt+1 = y|xt = x, at = a) Possible relaxations

◮ Define a new state ht = (xt, xt−1, xt−2, . . .) ◮ Move to partially observable MDP (PO-MDP) ◮ Move to predictive state representation (PSR) model

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 19/103
slide-38
SLIDE 38

The Markov Decision Process

Markov Decision Process: the Assumptions

Reward assumption: the reward is uniquely defined by a transition (or part of it) r(x, a, y) Possible relaxations

◮ Distinguish between global goal and reward function ◮ Move to inverse reinforcement learning (IRL) to induce the

reward function from desired behaviors

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 20/103
slide-39
SLIDE 39

The Markov Decision Process

Markov Decision Process: the Assumptions

Stationarity assumption: the dynamics and reward do not change

  • ver time

p(y|x, a) = P(xt+1 = y|xt = x, at = a) r(x, a, y) Possible relaxations

◮ Identify and remove the non-stationary components (e.g.,

cyclo-stationary dynamics)

◮ Identify the time-scale of the changes

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 21/103
slide-40
SLIDE 40

The Markov Decision Process

Question

Is the MDP formalism powerful enough? ⇒ Let’s try!

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 22/103
slide-41
SLIDE 41

The Markov Decision Process

Example: the Retail Store Management Problem

  • Description. At each month t, a store contains xt items of a specific

goods and the demand for that goods is Dt. At the end of each month the manager of the store can order at more items from his supplier. Furthermore we know that

◮ The cost of maintaining an inventory of x is h(x). ◮ The cost to order a items is C(a). ◮ The income for selling q items is f (q). ◮ If the demand D is bigger than the available inventory x, customers

that cannot be served leave.

◮ The value of the remaining inventory at the end of the year is g(x). ◮ Constraint: the store has a maximum capacity M.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 23/103
slide-42
SLIDE 42

The Markov Decision Process

Example: the Retail Store Management Problem

◮ State space: x ∈ X = {0, 1, . . . , M}.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 24/103
slide-43
SLIDE 43

The Markov Decision Process

Example: the Retail Store Management Problem

◮ State space: x ∈ X = {0, 1, . . . , M}. ◮ Action space: it is not possible to order more items that the

capacity of the store, then the action space should depend on the current state. Formally, at statex, a ∈ A(x) = {0, 1, . . . , M − x}.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 24/103
slide-44
SLIDE 44

The Markov Decision Process

Example: the Retail Store Management Problem

◮ State space: x ∈ X = {0, 1, . . . , M}. ◮ Action space: it is not possible to order more items that the

capacity of the store, then the action space should depend on the current state. Formally, at statex, a ∈ A(x) = {0, 1, . . . , M − x}.

◮ Dynamics: xt+1 = [xt + at − Dt]+.

Problem: the dynamics should be Markov and stationary!

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 24/103
slide-45
SLIDE 45

The Markov Decision Process

Example: the Retail Store Management Problem

◮ State space: x ∈ X = {0, 1, . . . , M}. ◮ Action space: it is not possible to order more items that the

capacity of the store, then the action space should depend on the current state. Formally, at statex, a ∈ A(x) = {0, 1, . . . , M − x}.

◮ Dynamics: xt+1 = [xt + at − Dt]+.

Problem: the dynamics should be Markov and stationary!

◮ The demand Dt is stochastic and time-independent. Formally,

Dt

i.i.d.

∼ D.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 24/103
slide-46
SLIDE 46

The Markov Decision Process

Example: the Retail Store Management Problem

◮ State space: x ∈ X = {0, 1, . . . , M}. ◮ Action space: it is not possible to order more items that the

capacity of the store, then the action space should depend on the current state. Formally, at statex, a ∈ A(x) = {0, 1, . . . , M − x}.

◮ Dynamics: xt+1 = [xt + at − Dt]+.

Problem: the dynamics should be Markov and stationary!

◮ The demand Dt is stochastic and time-independent. Formally,

Dt

i.i.d.

∼ D.

◮ Reward: rt = −C(at) − h(xt + at) + f ([xt + at − xt+1]+).

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 24/103
slide-47
SLIDE 47

The Markov Decision Process

Exercise: the Parking Problem

A driver wants to park his car as close as possible to the restaurant.

T 2 1 Reward t p(t) Reward 0

Restaurant

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 25/103
slide-48
SLIDE 48

The Markov Decision Process

Exercise: the Parking Problem

A driver wants to park his car as close as possible to the restaurant.

T 2 1 Reward t p(t) Reward 0

Restaurant

◮ The driver cannot see whether a place is available unless he is in

front of it.

◮ There are P places. ◮ At each place i the driver can either move to the next place or park

(if the place is available).

◮ The closer to the restaurant the parking, the higher the satisfaction. ◮ If the driver doesn’t park anywhere, then he/she leaves the

restaurant and has to find another one.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 25/103
slide-49
SLIDE 49

The Markov Decision Process

Policy

Definition (Policy)

A decision rule πt can be

◮ Deterministic: πt : X → A, ◮ Stochastic: πt : X → ∆(A),

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 26/103
slide-50
SLIDE 50

The Markov Decision Process

Policy

Definition (Policy)

A decision rule πt can be

◮ Deterministic: πt : X → A, ◮ Stochastic: πt : X → ∆(A),

A policy (strategy, plan) can be

◮ Non-stationary: π = (π0, π1, π2, . . . ), ◮ Stationary (Markovian): π = (π, π, π, . . . ).

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 26/103
slide-51
SLIDE 51

The Markov Decision Process

Policy

Definition (Policy)

A decision rule πt can be

◮ Deterministic: πt : X → A, ◮ Stochastic: πt : X → ∆(A),

A policy (strategy, plan) can be

◮ Non-stationary: π = (π0, π1, π2, . . . ), ◮ Stationary (Markovian): π = (π, π, π, . . . ).

Remark: MDP M + stationary policy π ⇒ Markov chain of state X and transition probability p(y|x) = p(y|x, π(x)).

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 26/103
slide-52
SLIDE 52

The Markov Decision Process

Example: the Retail Store Management Problem

◮ Stationary policy 1

π(x) =

  • M − x

if x < M/4

  • therwise

◮ Stationary policy 2

π(x) = max{(M − x)/2 − x; 0}

◮ Non-stationary policy

πt(x) =

  • M − x

if t < 6 ⌊(M − x)/5⌋

  • therwise
  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 27/103
slide-53
SLIDE 53

The Markov Decision Process

How to model an RL problem

The Markov Decision Process

The Model Value Functions

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 28/103
slide-54
SLIDE 54

The Markov Decision Process

Question

How do we evaluate a policy and compare two policies? ⇒ Value function!

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 29/103
slide-55
SLIDE 55

The Markov Decision Process

Optimization over Time Horizon

◮ Finite time horizon T: deadline at time T, the agent focuses

  • n the sum of the rewards up to T.
  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 30/103
slide-56
SLIDE 56

The Markov Decision Process

Optimization over Time Horizon

◮ Finite time horizon T: deadline at time T, the agent focuses

  • n the sum of the rewards up to T.

◮ Infinite time horizon with discount: the problem never

terminates but rewards which are closer in time receive a higher importance.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 30/103
slide-57
SLIDE 57

The Markov Decision Process

Optimization over Time Horizon

◮ Finite time horizon T: deadline at time T, the agent focuses

  • n the sum of the rewards up to T.

◮ Infinite time horizon with discount: the problem never

terminates but rewards which are closer in time receive a higher importance.

◮ Infinite time horizon with terminal state: the problem never

terminates but the agent will eventually reach a termination state.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 30/103
slide-58
SLIDE 58

The Markov Decision Process

Optimization over Time Horizon

◮ Finite time horizon T: deadline at time T, the agent focuses

  • n the sum of the rewards up to T.

◮ Infinite time horizon with discount: the problem never

terminates but rewards which are closer in time receive a higher importance.

◮ Infinite time horizon with terminal state: the problem never

terminates but the agent will eventually reach a termination state.

◮ Infinite time horizon with average reward: the problem never

terminates but the agent only focuses on the (expected) average of the rewards.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 30/103
slide-59
SLIDE 59

The Markov Decision Process

State Value Function

◮ Finite time horizon T: deadline at time T, the agent focuses

  • n the sum of the rewards up to T.

V π(t, x) = E T−1

  • s=t

r(xs, πs(xs)) + R(xT)| xt = x; π

  • ,

where R is a value function for the final state.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 31/103
slide-60
SLIDE 60

The Markov Decision Process

State Value Function

◮ Finite time horizon T: deadline at time T, the agent focuses

  • n the sum of the rewards up to T.

V π(t, x) = E T−1

  • s=t

r(xs, πs(xs)) + R(xT)| xt = x; π

  • ,

where R is a value function for the final state.

◮ Used when: there is an intrinsic deadline to meet.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 31/103
slide-61
SLIDE 61

The Markov Decision Process

State Value Function

◮ Infinite time horizon with discount: the problem never

terminates but rewards which are closer in time receive a higher importance. V π(x) = E ∞

  • t=0

γtr(xt, π(xt)) | x0 = x; π

  • ,

with discount factor 0 ≤ γ < 1:

◮ small = short-term rewards, big = long-term rewards ◮ for any γ ∈ [0, 1) the series always converge (for bounded

rewards)

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 32/103
slide-62
SLIDE 62

The Markov Decision Process

State Value Function

◮ Infinite time horizon with discount: the problem never

terminates but rewards which are closer in time receive a higher importance. V π(x) = E ∞

  • t=0

γtr(xt, π(xt)) | x0 = x; π

  • ,

with discount factor 0 ≤ γ < 1:

◮ small = short-term rewards, big = long-term rewards ◮ for any γ ∈ [0, 1) the series always converge (for bounded

rewards)

◮ Used when: there is uncertainty about the deadline and/or an

intrinsic definition of discount.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 32/103
slide-63
SLIDE 63

The Markov Decision Process

State Value Function

◮ Infinite time horizon with terminal state: the problem never

terminates but the agent will eventually reach a termination state. V π(x) = E T

  • t=0

r(xt, π(xt))|x0 = x; π

  • ,

where T is the first (random) time when the termination state is achieved.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 33/103
slide-64
SLIDE 64

The Markov Decision Process

State Value Function

◮ Infinite time horizon with terminal state: the problem never

terminates but the agent will eventually reach a termination state. V π(x) = E T

  • t=0

r(xt, π(xt))|x0 = x; π

  • ,

where T is the first (random) time when the termination state is achieved.

◮ Used when: there is a known goal or a failure condition.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 33/103
slide-65
SLIDE 65

The Markov Decision Process

State Value Function

◮ Infinite time horizon with average reward: the problem never

terminates but the agent only focuses on the (expected) average of the rewards. V π(x) = lim

T→∞ E

1 T

T−1

  • t=0

r(xt, π(xt)) | x0 = x; π

  • .
  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 34/103
slide-66
SLIDE 66

The Markov Decision Process

State Value Function

◮ Infinite time horizon with average reward: the problem never

terminates but the agent only focuses on the (expected) average of the rewards. V π(x) = lim

T→∞ E

1 T

T−1

  • t=0

r(xt, π(xt)) | x0 = x; π

  • .

◮ Used when: the system should be constantly controlled over

time.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 34/103
slide-67
SLIDE 67

The Markov Decision Process

State Value Function

Technical note: the expectations refer to all possible stochastic trajectories.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 35/103
slide-68
SLIDE 68

The Markov Decision Process

State Value Function

Technical note: the expectations refer to all possible stochastic trajectories. A non-stationary policy π applied from state x0 returns (x0, r0, x1, r1, x2, r2, . . .) where rt = r(xt, πt(xt)) and xt ∼ p(·|xt−1, at = π(xt)) are random realizations. The value function (discounted infinite horizon) is V π(x) = E(x1,x2,...) ∞

  • t=0

γtr(xt, π(xt)) | x0 = x; π

  • ,
  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 35/103
slide-69
SLIDE 69

The Markov Decision Process

Example: the Retail Store Management Problem

Simulation

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 36/103
slide-70
SLIDE 70

The Markov Decision Process

Optimal Value Function

Definition (Optimal policy and optimal value function)

The solution to an MDP is an optimal policy π∗ satisfying π∗ ∈ arg max

π∈ΠV π

in all the states x ∈ X, where Π is some policy set of interest.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 37/103
slide-71
SLIDE 71

The Markov Decision Process

Optimal Value Function

Definition (Optimal policy and optimal value function)

The solution to an MDP is an optimal policy π∗ satisfying π∗ ∈ arg max

π∈ΠV π

in all the states x ∈ X, where Π is some policy set of interest. The corresponding value function is the optimal value function V ∗ = V π∗

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 37/103
slide-72
SLIDE 72

The Markov Decision Process

Optimal Value Function

Remarks

  • 1. π∗ ∈ arg max(·) and not π∗ = arg max(·) because an MDP

may admit more than one optimal policy

  • 2. π∗ achieves the largest possible value function in every state
  • 3. there always exists an optimal deterministic policy
  • 4. expect for problems with a finite horizon, there always exists

an optimal stationary policy

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 38/103
slide-73
SLIDE 73

The Markov Decision Process

Summary

  • 1. MDP is a powerful model for interaction between an agent

and a stochastic environment

  • 2. The value function defines the objective to optimize
  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 39/103
slide-74
SLIDE 74

The Markov Decision Process

Limitations

  • 1. All the previous value functions define an objective in

expectation

  • 2. Other utility functions may be used
  • 3. Risk measures could be integrated but they may induce

“weird” problems and make the solution more difficult

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 40/103
slide-75
SLIDE 75

The Markov Decision Process

How to solve exactly an MDP

Dynamic Programming

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 41/103
slide-76
SLIDE 76

The Markov Decision Process

How to solve exactly an MDP

Dynamic Programming

Bellman Equations Value Iteration Policy Iteration

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 41/103
slide-77
SLIDE 77

The Markov Decision Process

Notice From now on we mostly work on the discounted infinite horizon setting. Most results smoothly extend to other settings.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 42/103
slide-78
SLIDE 78

The Markov Decision Process

The Optimization Problem

max

π

V π(x0) = max

π

E

  • r(x0, π(x0)) + γr(x1, π(x1)) + γ2r(x2, π(x2)) + . . .

very challenging (we should try as many as |A||S| policies!)

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 43/103
slide-79
SLIDE 79

The Markov Decision Process

The Optimization Problem

max

π

V π(x0) = max

π

E

  • r(x0, π(x0)) + γr(x1, π(x1)) + γ2r(x2, π(x2)) + . . .

very challenging (we should try as many as |A||S| policies!) ⇓ we need to leverage the structure of the MDP to simplify the optimization problem

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 43/103
slide-80
SLIDE 80

The Markov Decision Process

How to solve exactly an MDP

Dynamic Programming

Bellman Equations Value Iteration Policy Iteration

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 44/103
slide-81
SLIDE 81

The Markov Decision Process

The Bellman Equation

Proposition

For any stationary policy π = (π, π, . . . ), the state value function at a state x ∈ X satisfies the Bellman equation: V π(x) = r(x, π(x)) + γ

  • y

p(y|x, π(x))V π(y).

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 45/103
slide-82
SLIDE 82

The Markov Decision Process

The Bellman Equation

Proof. For any policy π,

V π(x) = E

t≥0

γtr(xt, π(xt)) | x0 = x; π

  • = r(x, π(x)) + E

t≥1

γtr(xt, π(xt)) | x0 = x; π

  • = r(x, π(x))

+ γ

  • y

P(x1 = y | x0 = x; π(x0))E

t≥1

γt−1r(xt, π(xt)) | x1 = y; π

  • = r(x, π(x)) + γ
  • y

p(y|x, π(x))V π(y).

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 46/103
slide-83
SLIDE 83

The Markov Decision Process

Example: the student dilemma

Work Work Work Work Rest Rest Rest Rest

p=0.5 0.4 0.3 0.7 0.5 0.5 0.5 0.5 0.4 0.6 0.6 1 0.5 r=1 r=−1000 r=0 r=−10 r=100 r=−10 0.9 0.1 r=−1

1 2 3 4 5 6 7

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 47/103
slide-84
SLIDE 84

The Markov Decision Process

Example: the student dilemma

◮ Model: all the transitions are Markov, states x5, x6, x7 are

terminal.

◮ Setting: infinite horizon with terminal states. ◮ Objective: find the policy that maximizes the expected sum of

rewards before achieving a terminal state. Notice: not a discounted infinite horizon setting! But the Bellman equations hold unchanged.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 48/103
slide-85
SLIDE 85

The Markov Decision Process

Example: the student dilemma

Work Work Work Work Rest Rest Rest Rest

p=0.5 0.4 0.3 0.7 0.5 0.5 0.5 0.5 0.4 0.6 0.6 1 0.5 r=−1000 r=0 r=−10 r=100 0.9 0.1 r=−1

V = 88.3

1

V = 86.9

3

r=−10

V = 88.9

4

r=1

V = 88.3

2

V = −10

5

V = 100

6

V = −1000

7

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 49/103
slide-86
SLIDE 86

The Markov Decision Process

Example: the student dilemma

Computing V4: V6 = 100 V4 = −10 + (0.9V6 + 0.1V4) ⇒ V4 = −10 + 0.9V6 0.9 = 88.8

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 50/103
slide-87
SLIDE 87

The Markov Decision Process

Example: the student dilemma

Computing V3: no need to consider all possible trajectories V4 = 88.8 V3 = −1 + (0.5V4 + 0.5V3) ⇒ V3 = −1 + 0.5V4 0.5 = 86.8

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 51/103
slide-88
SLIDE 88

The Markov Decision Process

Example: the student dilemma

Computing V3: no need to consider all possible trajectories V4 = 88.8 V3 = −1 + (0.5V4 + 0.5V3) ⇒ V3 = −1 + 0.5V4 0.5 = 86.8 and so on for the rest...

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 51/103
slide-89
SLIDE 89

The Markov Decision Process

The Optimal Bellman Equation

Bellman’s Principle of Optimality [1]: “An optimal policy has the property that, whatever the initial state and the initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.”

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 52/103
slide-90
SLIDE 90

The Markov Decision Process

The Optimal Bellman Equation

Proposition

The optimal value function V ∗ (i.e., V ∗ = maxπ V π) is the solution to the optimal Bellman equation: V ∗(x) = maxa∈A

  • r(x, a) + γ
  • y

p(y|x, a)V ∗(y)

  • .

and the optimal policy is π∗(x) = arg max

a∈A

  • r(x, a) + γ
  • y

p(y|x, a)V ∗(y)

  • .
  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 53/103
slide-91
SLIDE 91

The Markov Decision Process

The Optimal Bellman Equation

Proof.

For any policy π = (a, π′) (possibly non-stationary), V ∗(x)

(a)

= max

π

E

t≥0

γtr(xt, π(xt)) | x0 = x; π

  • (b)

= max

(a,π′)

  • r(x, a) + γ
  • y

p(y|x, a)V π′(y)

  • (c)

= max

a

  • r(x, a) + γ
  • y

p(y|x, a) max

π′ V π′(y)

  • (d)

= max

a

  • r(x, a) + γ
  • y

p(y|x, a)V ∗(y)

  • .
  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 54/103
slide-92
SLIDE 92

The Markov Decision Process

System of Equations

The Bellman equation V π(x) = r(x, π(x)) + γ

  • y

p(y|x, π(x))V π(y). is a linear system of equations with N unknowns and N linear constraints.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 55/103
slide-93
SLIDE 93

The Markov Decision Process

Example: the student dilemma

Work Work Work Work Rest Rest Rest Rest

p=0.5 0.4 0.3 0.7 0.5 0.5 0.5 0.5 0.4 0.6 0.6 1 0.5 r=−1000 r=0 r=−10 r=100 0.9 0.1 r=−1

V = 88.3

1

V = 86.9

3

r=−10

V = 88.9

4

r=1

V = 88.3

2

V = −10

5

V = 100

6

V = −1000

7

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 56/103
slide-94
SLIDE 94

The Markov Decision Process

Example: the student dilemma

V π(x) = r(x, π(x))+γ

y p(y|x, π(x))V π(y)

Work Work Work Work Rest Rest Rest Rest p=0.5 0.4 0.3 0.7 0.5 0.5 0.5 0.5 0.4 0.6 0.6 1 0.5 r=−1000 r=0 r=−10 r=100 0.9 0.1 r=−1

V = 88.3

1

V = 86.9

3

r=−10

V = 88.9

4

r=1

V = 88.3

2

V = −10

5

V = 100

6

V = −1000

7

System of equations                        V1 = 0 + 0.5V1 + 0.5V2 V2 = 1 + 0.3V1 + 0.7V3 V3 = −1 + 0.5V4 + 0.5V3 V4 = −10 + 0.9V6 + 0.1V4 V5 = −10 V6 = 100 V7 = −1000 ⇒ (V , R ∈ R7, P ∈ R7×7) V = R + PV ⇓ V = (I − P)−1R

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 57/103
slide-95
SLIDE 95

The Markov Decision Process

System of Equations

The optimal Bellman equation V ∗(x) = maxa∈A

  • r(x, a) + γ
  • y

p(y|x, a)V ∗(y)

  • .

is a (highly) non-linear system of equations with N unknowns and N non-linear constraints (i.e., the max operator).

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 58/103
slide-96
SLIDE 96

The Markov Decision Process

Example: the student dilemma

Work Work Work Work Rest Rest Rest Rest

p=0.5 0.4 0.3 0.7 0.5 0.5 0.5 0.5 0.4 0.6 0.6 1 0.5 r=1 r=−1000 r=0 r=−10 r=100 r=−10 0.9 0.1 r=−1

1 2 3 4 5 6 7

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 59/103
slide-97
SLIDE 97

The Markov Decision Process

Example: the student dilemma

V ∗(x) = max

a∈A

  • r(x, a) + γ

y p(y|x, a)V ∗(y)

  • Work

Work Work Work Rest Rest Rest Rest p=0.5 0.4 0.3 0.7 0.5 0.5 0.5 0.5 0.4 0.6 0.6 1 0.5 r=1 r=−1000 r=0 r=−10 r=100 r=−10 0.9 0.1 r=−1

1 2 3 4 5 6 7

System of equations                        V1 = max

  • 0 + 0.5V1 + 0.5V2; 0 + 0.5V1 + 0.5V3
  • V2

= max

  • 1 + 0.4V5 + 0.6V2; 1 + 0.3V1 + 0.7V3
  • V3

= max

  • − 1 + 0.4V2 + 0.6V3; −1 + 0.5V4 + 0.5V3
  • V4

= max

  • − 10 + 0.9V6 + 0.1V4; −10 + V7
  • V5

= −10 V6 = 100 V7 = −1000 ⇒ too complicated, we need to find an alternative solution.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 60/103
slide-98
SLIDE 98

The Markov Decision Process

The Bellman Operators

  • Notation. w.l.o.g. a discrete state space |X| = N and V π ∈ RN.

Definition

For any W ∈ RN, the Bellman operator T π : RN → RN is T πW (x) = r(x, π(x)) + γ

  • y

p(y|x, π(x))W (y), and the optimal Bellman operator (or dynamic programming

  • perator) is

T W (x) = maxa∈A

  • r(x, a) + γ
  • y

p(y|x, a)W (y)

  • .
  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 61/103
slide-99
SLIDE 99

The Markov Decision Process

The Bellman Operators

Proposition

Properties of the Bellman operators

  • 1. Monotonicity: for any W1, W2 ∈ RN, if W1≤W2

component-wise, then T πW1 ≤ T πW2, T W1 ≤ T W2.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 62/103
slide-100
SLIDE 100

The Markov Decision Process

The Bellman Operators

Proposition

Properties of the Bellman operators

  • 1. Monotonicity: for any W1, W2 ∈ RN, if W1≤W2

component-wise, then T πW1 ≤ T πW2, T W1 ≤ T W2.

  • 2. Offset: for any scalar c ∈ R,

T π(W + cIN) = T πW + γcIN, T (W + cIN) = T W + γcIN,

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 62/103
slide-101
SLIDE 101

The Markov Decision Process

The Bellman Operators

Proposition

  • 3. Contraction in L∞-norm: for any W1, W2 ∈ RN

||T πW1 − T πW2||∞ ≤ γ||W1 − W2||∞, ||T W1 − T W2||∞ ≤ γ||W1 − W2||∞.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 63/103
slide-102
SLIDE 102

The Markov Decision Process

The Bellman Operators

Proposition

  • 3. Contraction in L∞-norm: for any W1, W2 ∈ RN

||T πW1 − T πW2||∞ ≤ γ||W1 − W2||∞, ||T W1 − T W2||∞ ≤ γ||W1 − W2||∞.

  • 4. Fixed point: For any policy π

V π is the unique fixed point of T π, V ∗ is the unique fixed point of T .

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 63/103
slide-103
SLIDE 103

The Markov Decision Process

The Bellman Operators

Proposition

  • 3. Contraction in L∞-norm: for any W1, W2 ∈ RN

||T πW1 − T πW2||∞ ≤ γ||W1 − W2||∞, ||T W1 − T W2||∞ ≤ γ||W1 − W2||∞.

  • 4. Fixed point: For any policy π

V π is the unique fixed point of T π, V ∗ is the unique fixed point of T . Furthermore for any W ∈ RN and any stationary policy π lim

k→∞(T π)kW

= V π, lim

k→∞(T )kW

= V ∗.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 63/103
slide-104
SLIDE 104

The Markov Decision Process

The Bellman Equation

Proof. The contraction property (3) holds since for any x ∈ X we have |T W1(x) − T W2(x)| =

  • max

a

  • r(x, a) + γ
  • y

p(y|x, a)W1(y)

  • − max

a′

  • r(x, a′) + γ
  • y

p(y|x, a′)W2(y)

  • (a)

≤ max

a

  • r(x, a) + γ
  • y

p(y|x, a)W1(y)

  • r(x, a) + γ
  • y

p(y|x, a)W2(y)

  • = γ max

a

  • y

p(y|x, a)|W1(y) − W2(y)| ≤ γ||W1 − W2||∞ max

a

  • y

p(y|x, a) = γ||W1 − W2||∞, where in (a) we used maxa f (a) − maxa′ g(a′) ≤ maxa(f (a) − g(a)).

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 64/103
slide-105
SLIDE 105

The Markov Decision Process

Exercise: Fixed Point

Revise the Banach fixed point theorem and prove the fixed point property of the Bellman operator.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 65/103
slide-106
SLIDE 106

Dynamic Programming

How to solve exactly an MDP

Dynamic Programming

Bellman Equations Value Iteration Policy Iteration

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 66/103
slide-107
SLIDE 107

Dynamic Programming

Question

How do we compute the value functions / solve an MDP? ⇒ Value/Policy Iteration algorithms!

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 67/103
slide-108
SLIDE 108

Dynamic Programming

System of Equations

The Bellman equation V π(x) = r(x, π(x)) + γ

  • y

p(y|x, π(x))V π(y). is a linear system of equations with N unknowns and N linear constraints.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 68/103
slide-109
SLIDE 109

Dynamic Programming

System of Equations

The Bellman equation V π(x) = r(x, π(x)) + γ

  • y

p(y|x, π(x))V π(y). is a linear system of equations with N unknowns and N linear constraints. The optimal Bellman equation V ∗(x) = maxa∈A

  • r(x, a) + γ
  • y

p(y|x, a)V ∗(y)

  • .

is a (highly) non-linear system of equations with N unknowns and N non-linear constraints (i.e., the max operator).

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 68/103
slide-110
SLIDE 110

Dynamic Programming

Value Iteration: the Idea

  • 1. Let V0 be any vector in RN
  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 69/103
slide-111
SLIDE 111

Dynamic Programming

Value Iteration: the Idea

  • 1. Let V0 be any vector in RN
  • 2. At each iteration k = 1, 2, . . . , K
  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 69/103
slide-112
SLIDE 112

Dynamic Programming

Value Iteration: the Idea

  • 1. Let V0 be any vector in RN
  • 2. At each iteration k = 1, 2, . . . , K

◮ Compute Vk+1 = T Vk

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 69/103
slide-113
SLIDE 113

Dynamic Programming

Value Iteration: the Idea

  • 1. Let V0 be any vector in RN
  • 2. At each iteration k = 1, 2, . . . , K

◮ Compute Vk+1 = T Vk

  • 3. Return the greedy policy

πK(x) ∈ arg max

a∈A

  • r(x, a) + γ
  • y

p(y|x, a)VK(y)

  • .
  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 69/103
slide-114
SLIDE 114

Dynamic Programming

Value Iteration: the Guarantees

◮ From the fixed point property of T :

lim

k→∞ Vk = V ∗

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 70/103
slide-115
SLIDE 115

Dynamic Programming

Value Iteration: the Guarantees

◮ From the fixed point property of T :

lim

k→∞ Vk = V ∗

◮ From the contraction property of T

||Vk+1−V ∗||∞ = ||T Vk−T V ∗||∞ ≤ γ||Vk−V ∗||∞ ≤ γk+1||V0−V ∗||∞ → 0

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 70/103
slide-116
SLIDE 116

Dynamic Programming

Value Iteration: the Guarantees

◮ From the fixed point property of T :

lim

k→∞ Vk = V ∗

◮ From the contraction property of T

||Vk+1−V ∗||∞ = ||T Vk−T V ∗||∞ ≤ γ||Vk−V ∗||∞ ≤ γk+1||V0−V ∗||∞ → 0

◮ Convergence rate. Let ǫ > 0 and ||r||∞ ≤ rmax, then after at most

K = log(rmax/ǫ) log(1/γ) iterations ||VK − V ∗||∞ ≤ ǫ.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 70/103
slide-117
SLIDE 117

Dynamic Programming

Value Iteration: the Complexity

Time complexity

◮ Each iteration and the computation of the greedy policy take

O(N2|A|) operations. Vk+1(x) = T Vk(x) = maxa∈A

  • r(x, a) + γ
  • y

p(y|x, a)Vk(y)

  • πK(x) ∈ arg max

a∈A

  • r(x, a) + γ
  • y

p(y|x, a)VK(y)

  • ◮ Total time complexity O(KN2|A|)

Space complexity

◮ Storing the MDP: dynamics O(N2|A|) and reward O(N|A|). ◮ Storing the value function and the optimal policy O(N).

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 71/103
slide-118
SLIDE 118

Dynamic Programming

State-Action Value Function

Definition

In discounted infinite horizon problems, for any policy π, the state-action value function (or Q-function) Qπ : X × A → R is Qπ(x, a) = E

t≥0

γtr(xt, at)|x0 = x, a0 = a, at = π(xt), ∀t ≥ 1

  • ,

and the corresponding optimal Q-function is Q∗(x, a) = max

π

Qπ(x, a).

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 72/103
slide-119
SLIDE 119

Dynamic Programming

State-Action Value Function

The relationships between the V-function and the Q-function are: Qπ(x, a) = r(x, a) + γ

  • y∈X

p(y|x, a)V π(y) V π(x) = Qπ(x, π(x)) Q∗(x, a) = r(x, a) + γ

  • y∈X

p(y|x, a)V ∗(y) V ∗(x) = Q∗(x, π∗(x)) = maxa∈AQ∗(x, a).

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 73/103
slide-120
SLIDE 120

Dynamic Programming

Value Iteration: Extensions and Implementations

Q-iteration.

  • 1. Let Q0 be any Q-function
  • 2. At each iteration k = 1, 2, . . . , K

◮ Compute Qk+1 = T Qk

  • 3. Return the greedy policy

πK(x) ∈ arg max

a∈A Q(x,a)

Comparison

◮ Increased space and time complexity to O(N|A|) and O(N2|A|2) ◮ Computing the greedy policy is cheaper O(N|A|)

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 74/103
slide-121
SLIDE 121

Dynamic Programming

Value Iteration: Extensions and Implementations

Asynchronous VI.

  • 1. Let V0 be any vector in RN
  • 2. At each iteration k = 1, 2, . . . , K

◮ Choose a state xk ◮ Compute Vk+1(xk) = T Vk(xk)

  • 3. Return the greedy policy

πK(x) ∈ arg max

a∈A

  • r(x, a) + γ
  • y

p(y|x, a)VK(y)

  • .

Comparison

◮ Reduced time complexity to O(N|A|) ◮ Increased number of iterations to at most O(KN) but much smaller

in practice if states are properly prioritized

◮ Convergence guarantees

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 75/103
slide-122
SLIDE 122

Dynamic Programming

How to solve exactly an MDP

Dynamic Programming

Bellman Equations Value Iteration Policy Iteration

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 76/103
slide-123
SLIDE 123

Dynamic Programming

Policy Iteration: the Idea

  • 1. Let π0 be any stationary policy
  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 77/103
slide-124
SLIDE 124

Dynamic Programming

Policy Iteration: the Idea

  • 1. Let π0 be any stationary policy
  • 2. At each iteration k = 1, 2, . . . , K
  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 77/103
slide-125
SLIDE 125

Dynamic Programming

Policy Iteration: the Idea

  • 1. Let π0 be any stationary policy
  • 2. At each iteration k = 1, 2, . . . , K

◮ Policy evaluation given πk, compute V πk.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 77/103
slide-126
SLIDE 126

Dynamic Programming

Policy Iteration: the Idea

  • 1. Let π0 be any stationary policy
  • 2. At each iteration k = 1, 2, . . . , K

◮ Policy evaluation given πk, compute V πk. ◮ Policy improvement: compute the greedy policy

πk+1(x) ∈ arg maxa∈A

  • r(x, a) + γ
  • y

p(y|x, a)V πk(y)

  • .
  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 77/103
slide-127
SLIDE 127

Dynamic Programming

Policy Iteration: the Idea

  • 1. Let π0 be any stationary policy
  • 2. At each iteration k = 1, 2, . . . , K

◮ Policy evaluation given πk, compute V πk. ◮ Policy improvement: compute the greedy policy

πk+1(x) ∈ arg maxa∈A

  • r(x, a) + γ
  • y

p(y|x, a)V πk(y)

  • .
  • 3. Return the last policy πK
  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 77/103
slide-128
SLIDE 128

Dynamic Programming

Policy Iteration: the Idea

  • 1. Let π0 be any stationary policy
  • 2. At each iteration k = 1, 2, . . . , K

◮ Policy evaluation given πk, compute V πk. ◮ Policy improvement: compute the greedy policy

πk+1(x) ∈ arg maxa∈A

  • r(x, a) + γ
  • y

p(y|x, a)V πk(y)

  • .
  • 3. Return the last policy πK

Remark: usually K is the smallest k such that V πk = V πk+1.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 77/103
slide-129
SLIDE 129

Dynamic Programming

Policy Iteration: the Guarantees

Proposition

The policy iteration algorithm generates a sequences of policies with non-decreasing performance V πk+1≥V πk, and it converges to π∗ in a finite number of iterations.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 78/103
slide-130
SLIDE 130

Dynamic Programming

Policy Iteration: the Guarantees

Proof. From the definition of the Bellman operators and the greedy policy πk+1 V πk = T πkV πk ≤ T V πk = T πk+1V πk, (1)

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 79/103
slide-131
SLIDE 131

Dynamic Programming

Policy Iteration: the Guarantees

Proof. From the definition of the Bellman operators and the greedy policy πk+1 V πk = T πkV πk ≤ T V πk = T πk+1V πk, (1) and from the monotonicity property of T πk+1, it follows that V πk ≤ T πk+1V πk, T πk+1V πk ≤ (T πk+1)2V πk, . . . (T πk+1)n−1V πk ≤ (T πk+1)nV πk, . . .

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 79/103
slide-132
SLIDE 132

Dynamic Programming

Policy Iteration: the Guarantees

Proof. From the definition of the Bellman operators and the greedy policy πk+1 V πk = T πkV πk ≤ T V πk = T πk+1V πk, (1) and from the monotonicity property of T πk+1, it follows that V πk ≤ T πk+1V πk, T πk+1V πk ≤ (T πk+1)2V πk, . . . (T πk+1)n−1V πk ≤ (T πk+1)nV πk, . . . Joining all the inequalities in the chain we obtain V πk ≤ lim

n→∞(T πk+1)nV πk = V πk+1.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 79/103
slide-133
SLIDE 133

Dynamic Programming

Policy Iteration: the Guarantees

Proof. From the definition of the Bellman operators and the greedy policy πk+1 V πk = T πkV πk ≤ T V πk = T πk+1V πk, (1) and from the monotonicity property of T πk+1, it follows that V πk ≤ T πk+1V πk, T πk+1V πk ≤ (T πk+1)2V πk, . . . (T πk+1)n−1V πk ≤ (T πk+1)nV πk, . . . Joining all the inequalities in the chain we obtain V πk ≤ lim

n→∞(T πk+1)nV πk = V πk+1.

Then (V πk)k is a non-decreasing sequence.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 79/103
slide-134
SLIDE 134

Dynamic Programming

Policy Iteration: the Guarantees

Proof (cont’d). Since a finite MDP admits a finite number of policies, then the termination condition is eventually met for a specific k. Thus eq. 2 holds with an equality and we obtain V πk = T V πk and V πk = V ∗ which implies that πk is an optimal policy.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 80/103
slide-135
SLIDE 135

Dynamic Programming

Policy Iteration

  • Notation. For any policy π the reward vector is r π(x) = r(x, π(x))

and the transition matrix is [Pπ]x,y = p(y|x, π(x))

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 81/103
slide-136
SLIDE 136

Dynamic Programming

Policy Iteration: the Policy Evaluation Step

◮ Direct computation. For any policy π compute

V π = (I − γPπ)−1r π. Complexity: O(N3) (improvable to O(N2.807)).

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 82/103
slide-137
SLIDE 137

Dynamic Programming

Policy Iteration: the Policy Evaluation Step

◮ Direct computation. For any policy π compute

V π = (I − γPπ)−1r π. Complexity: O(N3) (improvable to O(N2.807)).

◮ Iterative policy evaluation. For any policy π

lim

n→∞ T πV0 = V π.

Complexity: An ǫ-approximation of V π requires O(N2 log 1/ǫ

log 1/γ ) steps.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 82/103
slide-138
SLIDE 138

Dynamic Programming

Policy Iteration: the Policy Evaluation Step

◮ Direct computation. For any policy π compute

V π = (I − γPπ)−1r π. Complexity: O(N3) (improvable to O(N2.807)).

◮ Iterative policy evaluation. For any policy π

lim

n→∞ T πV0 = V π.

Complexity: An ǫ-approximation of V π requires O(N2 log 1/ǫ

log 1/γ ) steps.

◮ Monte-Carlo simulation. In each state x, simulate n trajectories

((xi

t)t≥0,)1≤i≤n following policy π and compute

ˆ V π(x) ≃ 1 n

n

  • i=1
  • t≥0

γtr(xi

t, π(xi t)).

Complexity: In each state, the approximation error is O(1/√n).

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 82/103
slide-139
SLIDE 139

Dynamic Programming

Policy Iteration: the Policy Improvement Step

◮ If the policy is evaluated with V , then the policy improvement

has complexity O(N|A|) (computation of an expectation).

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 83/103
slide-140
SLIDE 140

Dynamic Programming

Policy Iteration: the Policy Improvement Step

◮ If the policy is evaluated with V , then the policy improvement

has complexity O(N|A|) (computation of an expectation).

◮ If the policy is evaluated with Q, then the policy improvement

has complexity O(|A|) corresponding to πk+1(x) ∈ arg max

a∈A Q(x, a),

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 83/103
slide-141
SLIDE 141

Dynamic Programming

Policy Iteration: Number of Iterations

◮ At most O

N|A|

1−γ log( 1 1−γ )

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 84/103
slide-142
SLIDE 142

Dynamic Programming

Comparison between Value and Policy Iteration

Value Iteration

◮ Pros: each iteration is very computationally efficient. ◮ Cons: convergence is only asymptotic.

Policy Iteration

◮ Pros: converge in a finite number of iterations (often small in

practice).

◮ Cons: each iteration requires a full policy evaluation and it

might be expensive.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 85/103
slide-143
SLIDE 143

Dynamic Programming

The Grid-World Problem

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 86/103
slide-144
SLIDE 144

Dynamic Programming

How to solve exactly an MDP

Dynamic Programming

Bellman Equations Value Iteration Policy Iteration

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 87/103
slide-145
SLIDE 145

Dynamic Programming

Other Algorithms

◮ Modified Policy Iteration ◮ λ-Policy Iteration ◮ Linear programming ◮ Policy search

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 88/103
slide-146
SLIDE 146

Dynamic Programming

Summary

◮ Bellman equations provide a compact formulation of value

functions

◮ DP provide a general tool to solve MDPs

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 89/103
slide-147
SLIDE 147

Dynamic Programming

Bibliography I

  • R. E. Bellman.

Dynamic Programming. Princeton University Press, Princeton, N.J., 1957. D.P. Bertsekas and J. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, Belmont, MA, 1996.

  • W. Fleming and R. Rishel.

Deterministic and stochastic optimal control. Applications of Mathematics, 1, Springer-Verlag, Berlin New York, 1975.

  • R. A. Howard.

Dynamic Programming and Markov Processes. MIT Press, Cambridge, MA, 1960. M.L. Puterman. Markov Decision Processes : Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, Etats-Unis, 1994.

  • A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 90/103
slide-148
SLIDE 148

Dynamic Programming

Reinforcement Learning

Alessandro Lazaric alessandro.lazaric@inria.fr sequel.lille.inria.fr