Markov Decision Processes and Dynamic Programming A. LAZARIC ( SequeL - PowerPoint PPT Presentation

Markov Decision Processes and Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA SequeL – INRIA Lille MVA-RL Course

In This Lecture ◮ How do we formalize the agent-environment interaction? ⇒ Markov Decision Process (MDP) ◮ How do we solve an MDP? ⇒ Dynamic Programming A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 2/79

Mathematical Tools Outline Mathematical Tools The Markov Decision Process Bellman Equations for Discounted Infinite Horizon Problems Bellman Equations for Uniscounted Infinite Horizon Problems Dynamic Programming Conclusions A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 3/79

Mathematical Tools Probability Theory Definition (Conditional probability) Given two events A and B with P ( B ) > 0 , the conditional probability of A given B is P ( A | B ) = P ( A ∪ B ) . P ( B ) Similarly, if X and Y are non-degenerate and jointly continuous random variables with density f X , Y ( x , y ) then if B has positive measure then the conditional probability is � � x ∈ A f X , Y ( x , y ) dxdy y ∈ B P ( X ∈ A | Y ∈ B ) = x f X , Y ( x , y ) dxdy . � � y ∈ B A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 4/79

Mathematical Tools Probability Theory Definition (Law of total expectation) Given a function f and two random variables X , Y we have that � �� f ( X , Y ) = E X f ( x , Y ) | X = x . E X , Y E Y A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 5/79

Mathematical Tools Norms and Contractions Definition Given a vector space V ⊆ R d a function f : V → R + 0 is a norm if an only if ◮ If f ( v ) = 0 for some v ∈ V , then v = 0 . ◮ For any λ ∈ R , v ∈ V , f ( λ v ) = | λ | f ( v ) . ◮ Triangle inequality: For any v , u ∈ V , f ( v + u ) ≤ f ( v ) + f ( u ) . A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 6/79

Mathematical Tools Norms and Contractions ◮ L p -norm d � 1 / p � � | v i | p || v || p = . i = 1 ◮ L ∞ -norm || v || ∞ = max 1 ≤ i ≤ d | v i | . ◮ L µ, p -norm d � 1 / p | v i | p � � || v || µ, p = . µ i i = 1 ◮ L µ, p -norm | v i | || v || µ, ∞ = max . µ i 1 ≤ i ≤ d ◮ L 2 , P -matrix norm ( P is a positive definite matrix) || v || 2 P = v ⊤ Pv . A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 7/79

Mathematical Tools Norms and Contractions Definition A sequence of vectors v n ∈ V (with n ∈ N ) is said to converge in norm || · || to v ∈ V if n →∞ || v n − v || = 0 . lim Definition A sequence of vectors v n ∈ V (with n ∈ N ) is a Cauchy sequence if n →∞ sup m ≥ n || v n − v m || = 0 . lim Definition A vector space V equipped with a norm || · || is complete if every Cauchy sequence in V is convergent in the norm of the space. A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 8/79

Mathematical Tools Norms and Contractions Definition An operator T : V → V is L-Lipschitz if for any v , u ∈ V ||T v − T u || ≤ L || u − v || . If L ≤ 1 then T is a non-expansion, while if L < 1 then T is a L-contraction. If T is Lipschitz then it is also continuous, that is if v n → ||·|| v then T v n → ||·|| T v . Definition A vector v ∈ V is a fixed point of the operator T : V → V if T v = v. A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 9/79

Mathematical Tools Norms and Contractions Proposition (Banach Fixed Point Theorem) Let V be a complete vector space equipped with the norm || · || and T : V → V be a γ -contraction mapping. Then 1. T admits a unique fixed point v . 2. For any v 0 ∈ V , if v n + 1 = T v n then v n → ||·|| v with a geometric convergence rate : || v n − v || ≤ γ n || v 0 − v || . A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 10/79

Mathematical Tools Linear Algebra Given a square matrix A ∈ R N × N : ◮ Eigenvalues of a matrix (1). v ∈ R N and λ ∈ R are eigenvector and eigenvalue of A if Av = λ v . ◮ Eigenvalues of a matrix (2). If A has eigenvalues { λ i } N i = 1 , then B = ( I − α A ) has eigenvalues { µ i } µ i = 1 − αλ i . ◮ Matrix inversion. A can be inverted if and only if ∀ i , λ i � = 0. A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 11/79

Mathematical Tools Linear Algebra ◮ Stochastic matrix. A square matrix P ∈ R N × N is a stochastic matrix if 1. all non-zero entries, ∀ i , j , [ P ] i , j ≥ 0 2. all the rows sum to one, ∀ i , � N j = 1 [ P ] i , j = 1. All the eigenvalues of a stochastic matrix are bounded by 1, i.e., ∀ i , λ i ≤ 1. A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 12/79

The Markov Decision Process Outline Mathematical Tools The Markov Decision Process Bellman Equations for Discounted Infinite Horizon Problems Bellman Equations for Uniscounted Infinite Horizon Problems Dynamic Programming Conclusions A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 13/79

The Markov Decision Process The Reinforcement Learning Model Environment Critic action / state / reward actuation perception Learning Agent A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 14/79

The Markov Decision Process Markov Chains Definition (Markov chain) Let the state space X be a bounded compact subset of the Euclidean space, the discrete-time dynamic system ( x t ) t ∈ N ∈ X is a Markov chain if it satisfies the Markov property P ( x t + 1 = x | x t , x t − 1 , . . . , x 0 ) = P ( x t + 1 = x | x t ) , Given an initial state x 0 ∈ X, a Markov chain is defined by the transition probability p p ( y | x ) = P ( x t + 1 = y | x t = x ) . A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 15/79

The Markov Decision Process Markov Decision Process Definition (Markov decision process [1, 4, 3, 5, 2]) A Markov decision process is defined as a tuple M = ( X , A , p , r ) where ◮ X is the state space, ◮ A is the action space, ◮ p ( y | x , a ) is the transition probability with p ( y | x , a ) = P ( x t + 1 = y | x t = x , a t = a ) , ◮ r ( x , a , y ) is the reward of transition ( x , a , y ) . A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 16/79

The Markov Decision Process Policy Definition (Policy) A decision rule π t can be ◮ Deterministic: π t : X → A, ◮ Stochastic: π t : X → ∆( A ) , A policy (strategy, plan) can be ◮ Non-stationary: π = ( π 0 , π 1 , π 2 , . . . ) , ◮ Stationary (Markovian): π = ( π, π, π, . . . ) . Remark : MDP M + stationary policy π ⇒ Markov chain of state X and transition probability p ( y | x ) = p ( y | x , π ( x )) . A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 17/79

The Markov Decision Process Question Is the MDP formalism powerful enough? ⇒ Let’s try! A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 18/79

The Markov Decision Process Example: the Retail Store Management Problem Description. At each month t , a store contains x t items of a specific goods and the demand for that goods is D t . At the end of each month the manager of the store can order a t more items from his supplier. Furthermore we know that ◮ The cost of maintaining an inventory of x is h ( x ) . ◮ The cost to order a items is C ( a ) . ◮ The income for selling q items is f ( q ) . ◮ If the demand D is bigger than the available inventory x , customers that cannot be served leave. ◮ The value of the remaining inventory at the end of the year is g ( x ) . ◮ Constraint : the store has a maximum capacity M . A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 19/79

The Markov Decision Process Example: the Retail Store Management Problem ◮ State space : x ∈ X = { 0 , 1 , . . . , M } . ◮ Action space : it is not possible to order more items that the capacity of the store, then the action space should depend on the current state. Formally, at state x , a ∈ A ( x ) = { 0 , 1 , . . . , M − x } . ◮ Dynamics : x t + 1 = [ x t + a t − D t ] + . Problem : the dynamics should be Markov and stationary! ◮ The demand D t is stochastic and time-independent . Formally, i . i . d . D t ∼ D . ◮ Reward : r t = − C ( a t ) − h ( x t + a t ) + f ([ x t + a t − x t + 1 ] + ) . A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 20/79

The Markov Decision Process Exercise: the Parking Problem A driver wants to park his car as close as possible to the restaurant. Reward t 1 2 T p(t) Restaurant Reward 0 ◮ The driver cannot see whether a place is available unless he is in front of it. ◮ There are P places. ◮ At each place i the driver can either move to the next place or park (if the place is available). ◮ The closer to the restaurant the parking, the higher the satisfaction. ◮ If the driver doesn’t park anywhere, then he/she leaves the restaurant and has to find another one. A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 21/79

Markov Decision Processes and Dynamic Programming A. LAZARIC ( SequeL - PowerPoint PPT Presentation

Markov Decision Processes and Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture How do we formalize the agent-environment interaction? Markov

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

Model Repair for Markov Decision Model Repair for Markov Decision Model Repair for Markov

Outline Md Md Markov Markov Decision Decision Processes Processes Grid World Example

Stochastic Processes Markov Processes Hamid R. Rabiee 1 Overview o Markov Property o Markov

Markov Decision Processes and Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille ) Ecole

Markov Decision Processes and Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS

Markov Systems, Markov Decision Processes, and Dynamic Programming Andrew W. Moore Note to

Markov decision processes and interval Markov chains: exploiting the connection Mingmei Teo

Hidden Markov Models Steven J Zeil Old Dominion Univ. Fall 2010 1 Discrete Markov Processes

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

Introduction to Partially Observable Markov Decision Processes CS 886 Sequential Decision Making

Markov Decision Processes Philipp Koehn 7 April 2020 Philipp Koehn Artificial Intelligence:

The simplex method is strongly polynomial for deterministic Markov decision processes Ian Post

Discrete Time Markov Chains Discrete-Time Markov Chains Books - Introduction to Stochastic

Randomness in Computing L ECTURE 26 Last time Randomized algorithm for 3SAT Gamblers

Introduction to Markov Chain Monte Carlo Olivier Le Matre 1 with Omar Knio (KAUST) 1 Centre de

8: Hidden Markov Models Machine Learning and Real-world Data Helen Yannakoudakis 1 Computer

Markov Chains and Hidden Markov Models CE417: Introduction to Artificial Intelligence Sharif

Under Interval and Fuzzy From the . . . Symmetric Markov Chains Uncertainty, Symmetric In

The Origins of the Cold War The Iron Curtain Winston Churchill gave the Map of the Iron Iron

The George Washington Bridge This port is representative of what ports all around the country

ADAPTING YOUR FINAL EXAM PLAN FOR REMOTE TEACHING FA C U LT Y PA N E L D I S C U S S I O N A P

Sambuz

Useful Links

Newsletter

Mail Us