markov decision processes and dynamic programming
play

Markov Decision Processes and Dynamic Programming A. LAZARIC ( SequeL - PowerPoint PPT Presentation

Markov Decision Processes and Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture How do we formalize the agent-environment interaction? Markov


  1. Markov Decision Processes and Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA SequeL – INRIA Lille MVA-RL Course

  2. In This Lecture ◮ How do we formalize the agent-environment interaction? ⇒ Markov Decision Process (MDP) ◮ How do we solve an MDP? ⇒ Dynamic Programming A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 2/79

  3. Mathematical Tools Outline Mathematical Tools The Markov Decision Process Bellman Equations for Discounted Infinite Horizon Problems Bellman Equations for Uniscounted Infinite Horizon Problems Dynamic Programming Conclusions A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 3/79

  4. Mathematical Tools Probability Theory Definition (Conditional probability) Given two events A and B with P ( B ) > 0 , the conditional probability of A given B is P ( A | B ) = P ( A ∪ B ) . P ( B ) Similarly, if X and Y are non-degenerate and jointly continuous random variables with density f X , Y ( x , y ) then if B has positive measure then the conditional probability is � � x ∈ A f X , Y ( x , y ) dxdy y ∈ B P ( X ∈ A | Y ∈ B ) = x f X , Y ( x , y ) dxdy . � � y ∈ B A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 4/79

  5. Mathematical Tools Probability Theory Definition (Law of total expectation) Given a function f and two random variables X , Y we have that � �� � � � f ( X , Y ) = E X f ( x , Y ) | X = x . E X , Y E Y A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 5/79

  6. Mathematical Tools Norms and Contractions Definition Given a vector space V ⊆ R d a function f : V → R + 0 is a norm if an only if ◮ If f ( v ) = 0 for some v ∈ V , then v = 0 . ◮ For any λ ∈ R , v ∈ V , f ( λ v ) = | λ | f ( v ) . ◮ Triangle inequality: For any v , u ∈ V , f ( v + u ) ≤ f ( v ) + f ( u ) . A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 6/79

  7. Mathematical Tools Norms and Contractions ◮ L p -norm d � 1 / p � � | v i | p || v || p = . i = 1 ◮ L ∞ -norm || v || ∞ = max 1 ≤ i ≤ d | v i | . ◮ L µ, p -norm d � 1 / p | v i | p � � || v || µ, p = . µ i i = 1 ◮ L µ, p -norm | v i | || v || µ, ∞ = max . µ i 1 ≤ i ≤ d ◮ L 2 , P -matrix norm ( P is a positive definite matrix) || v || 2 P = v ⊤ Pv . A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 7/79

  8. Mathematical Tools Norms and Contractions Definition A sequence of vectors v n ∈ V (with n ∈ N ) is said to converge in norm || · || to v ∈ V if n →∞ || v n − v || = 0 . lim Definition A sequence of vectors v n ∈ V (with n ∈ N ) is a Cauchy sequence if n →∞ sup m ≥ n || v n − v m || = 0 . lim Definition A vector space V equipped with a norm || · || is complete if every Cauchy sequence in V is convergent in the norm of the space. A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 8/79

  9. Mathematical Tools Norms and Contractions Definition An operator T : V → V is L-Lipschitz if for any v , u ∈ V ||T v − T u || ≤ L || u − v || . If L ≤ 1 then T is a non-expansion, while if L < 1 then T is a L-contraction. If T is Lipschitz then it is also continuous, that is if v n → ||·|| v then T v n → ||·|| T v . Definition A vector v ∈ V is a fixed point of the operator T : V → V if T v = v. A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 9/79

  10. Mathematical Tools Norms and Contractions Proposition (Banach Fixed Point Theorem) Let V be a complete vector space equipped with the norm || · || and T : V → V be a γ -contraction mapping. Then 1. T admits a unique fixed point v . 2. For any v 0 ∈ V , if v n + 1 = T v n then v n → ||·|| v with a geometric convergence rate : || v n − v || ≤ γ n || v 0 − v || . A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 10/79

  11. Mathematical Tools Linear Algebra Given a square matrix A ∈ R N × N : ◮ Eigenvalues of a matrix (1). v ∈ R N and λ ∈ R are eigenvector and eigenvalue of A if Av = λ v . ◮ Eigenvalues of a matrix (2). If A has eigenvalues { λ i } N i = 1 , then B = ( I − α A ) has eigenvalues { µ i } µ i = 1 − αλ i . ◮ Matrix inversion. A can be inverted if and only if ∀ i , λ i � = 0. A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 11/79

  12. Mathematical Tools Linear Algebra ◮ Stochastic matrix. A square matrix P ∈ R N × N is a stochastic matrix if 1. all non-zero entries, ∀ i , j , [ P ] i , j ≥ 0 2. all the rows sum to one, ∀ i , � N j = 1 [ P ] i , j = 1. All the eigenvalues of a stochastic matrix are bounded by 1, i.e., ∀ i , λ i ≤ 1. A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 12/79

  13. The Markov Decision Process Outline Mathematical Tools The Markov Decision Process Bellman Equations for Discounted Infinite Horizon Problems Bellman Equations for Uniscounted Infinite Horizon Problems Dynamic Programming Conclusions A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 13/79

  14. The Markov Decision Process The Reinforcement Learning Model Environment Critic action / state / reward actuation perception Learning Agent A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 14/79

  15. The Markov Decision Process Markov Chains Definition (Markov chain) Let the state space X be a bounded compact subset of the Euclidean space, the discrete-time dynamic system ( x t ) t ∈ N ∈ X is a Markov chain if it satisfies the Markov property P ( x t + 1 = x | x t , x t − 1 , . . . , x 0 ) = P ( x t + 1 = x | x t ) , Given an initial state x 0 ∈ X, a Markov chain is defined by the transition probability p p ( y | x ) = P ( x t + 1 = y | x t = x ) . A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 15/79

  16. The Markov Decision Process Markov Decision Process Definition (Markov decision process [1, 4, 3, 5, 2]) A Markov decision process is defined as a tuple M = ( X , A , p , r ) where ◮ X is the state space, ◮ A is the action space, ◮ p ( y | x , a ) is the transition probability with p ( y | x , a ) = P ( x t + 1 = y | x t = x , a t = a ) , ◮ r ( x , a , y ) is the reward of transition ( x , a , y ) . A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 16/79

  17. The Markov Decision Process Policy Definition (Policy) A decision rule π t can be ◮ Deterministic: π t : X → A, ◮ Stochastic: π t : X → ∆( A ) , A policy (strategy, plan) can be ◮ Non-stationary: π = ( π 0 , π 1 , π 2 , . . . ) , ◮ Stationary (Markovian): π = ( π, π, π, . . . ) . Remark : MDP M + stationary policy π ⇒ Markov chain of state X and transition probability p ( y | x ) = p ( y | x , π ( x )) . A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 17/79

  18. The Markov Decision Process Question Is the MDP formalism powerful enough? ⇒ Let’s try! A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 18/79

  19. The Markov Decision Process Example: the Retail Store Management Problem Description. At each month t , a store contains x t items of a specific goods and the demand for that goods is D t . At the end of each month the manager of the store can order a t more items from his supplier. Furthermore we know that ◮ The cost of maintaining an inventory of x is h ( x ) . ◮ The cost to order a items is C ( a ) . ◮ The income for selling q items is f ( q ) . ◮ If the demand D is bigger than the available inventory x , customers that cannot be served leave. ◮ The value of the remaining inventory at the end of the year is g ( x ) . ◮ Constraint : the store has a maximum capacity M . A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 19/79

  20. The Markov Decision Process Example: the Retail Store Management Problem ◮ State space : x ∈ X = { 0 , 1 , . . . , M } . ◮ Action space : it is not possible to order more items that the capacity of the store, then the action space should depend on the current state. Formally, at state x , a ∈ A ( x ) = { 0 , 1 , . . . , M − x } . ◮ Dynamics : x t + 1 = [ x t + a t − D t ] + . Problem : the dynamics should be Markov and stationary! ◮ The demand D t is stochastic and time-independent . Formally, i . i . d . D t ∼ D . ◮ Reward : r t = − C ( a t ) − h ( x t + a t ) + f ([ x t + a t − x t + 1 ] + ) . A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 20/79

  21. The Markov Decision Process Exercise: the Parking Problem A driver wants to park his car as close as possible to the restaurant. Reward t 1 2 T p(t) Restaurant Reward 0 ◮ The driver cannot see whether a place is available unless he is in front of it. ◮ There are P places. ◮ At each place i the driver can either move to the next place or park (if the place is available). ◮ The closer to the restaurant the parking, the higher the satisfaction. ◮ If the driver doesn’t park anywhere, then he/she leaves the restaurant and has to find another one. A. LAZARIC – Markov Decision Processes and Dynamic Programming Oct 1st, 2013 - 21/79

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend