markov decision processes and dynamic programming
play

Markov Decision Processes and Dynamic Programming A. LAZARIC ( SequeL - PowerPoint PPT Presentation

Markov Decision Processes and Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to model an RL problem The Markov Decision Process A. LAZARIC Markov Decision


  1. The Markov Decision Process The Reinforcement Learning Model The environment ◮ Controllability : fully (e.g., chess) or partially (e.g., portfolio optimization) ◮ Uncertainty : deterministic (e.g., chess) or stochastic (e.g., backgammon) ◮ Reactive : adversarial (e.g., chess) or fixed (e.g., tetris) ◮ Observability : full (e.g., chess) or partial (e.g., robotics) ◮ Availability : known (e.g., chess) or unknown (e.g., robotics) The critic ◮ Sparse (e.g., win or loose) vs informative (e.g., closer or further) ◮ Preference reward ◮ Frequent or sporadic ◮ Known or unknown A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 15/103

  2. The Markov Decision Process The Reinforcement Learning Model The environment ◮ Controllability : fully (e.g., chess) or partially (e.g., portfolio optimization) ◮ Uncertainty : deterministic (e.g., chess) or stochastic (e.g., backgammon) ◮ Reactive : adversarial (e.g., chess) or fixed (e.g., tetris) ◮ Observability : full (e.g., chess) or partial (e.g., robotics) ◮ Availability : known (e.g., chess) or unknown (e.g., robotics) The critic ◮ Sparse (e.g., win or loose) vs informative (e.g., closer or further) ◮ Preference reward ◮ Frequent or sporadic ◮ Known or unknown The agent ◮ Open loop control ◮ Close loop control (i.e., adaptive ) ◮ Non-stationary close loop control (i.e., learning ) A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 15/103

  3. The Markov Decision Process Markov Chains Definition (Markov chain) Let the state space X be a bounded compact subset of the Euclidean space, the discrete-time dynamic system ( x t ) t ∈ N ∈ X is a Markov chain if it satisfies the Markov property P ( x t + 1 = x | x t , x t − 1 , . . . , x 0 ) = P ( x t + 1 = x | x t ) , Given an initial state x 0 ∈ X, a Markov chain is defined by the transition probability p p ( y | x ) = P ( x t + 1 = y | x t = x ) . A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 16/103

  4. The Markov Decision Process Markov Decision Process Definition (Markov decision process [1, 4, 3, 5, 2]) A Markov decision process is defined as a tuple M = ( X , A , p , r ) where A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 17/103

  5. The Markov Decision Process Markov Decision Process Definition (Markov decision process [1, 4, 3, 5, 2]) A Markov decision process is defined as a tuple M = ( X , A , p , r ) where ◮ X is the state space, A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 17/103

  6. The Markov Decision Process Markov Decision Process Definition (Markov decision process [1, 4, 3, 5, 2]) A Markov decision process is defined as a tuple M = ( X , A , p , r ) where ◮ X is the state space, ◮ A is the action space, A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 17/103

  7. The Markov Decision Process Markov Decision Process Definition (Markov decision process [1, 4, 3, 5, 2]) A Markov decision process is defined as a tuple M = ( X , A , p , r ) where ◮ X is the state space, ◮ A is the action space, ◮ p ( y | x , a ) is the transition probability with p ( y | x , a ) = P ( x t + 1 = y | x t = x , a t = a ) , A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 17/103

  8. The Markov Decision Process Markov Decision Process Definition (Markov decision process [1, 4, 3, 5, 2]) A Markov decision process is defined as a tuple M = ( X , A , p , r ) where ◮ X is the state space, ◮ A is the action space, ◮ p ( y | x , a ) is the transition probability with p ( y | x , a ) = P ( x t + 1 = y | x t = x , a t = a ) , ◮ r ( x , a , y ) is the reward of transition ( x , a , y ) . A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 17/103

  9. The Markov Decision Process Markov Decision Process: the Assumptions Time assumption : time is discrete t → t + 1 Possible relaxations ◮ Identify the proper time granularity ◮ Most of MDP literature extends to continuous time A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 18/103

  10. The Markov Decision Process Markov Decision Process: the Assumptions Markov assumption : the current state x and action a are a sufficient statistics for the next state y p ( y | x , a ) = P ( x t + 1 = y | x t = x , a t = a ) Possible relaxations ◮ Define a new state h t = ( x t , x t − 1 , x t − 2 , . . . ) ◮ Move to partially observable MDP (PO-MDP) ◮ Move to predictive state representation (PSR) model A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 19/103

  11. The Markov Decision Process Markov Decision Process: the Assumptions Reward assumption : the reward is uniquely defined by a transition (or part of it) r ( x , a , y ) Possible relaxations ◮ Distinguish between global goal and reward function ◮ Move to inverse reinforcement learning (IRL) to induce the reward function from desired behaviors A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 20/103

  12. The Markov Decision Process Markov Decision Process: the Assumptions Stationarity assumption : the dynamics and reward do not change over time p ( y | x , a ) = P ( x t + 1 = y | x t = x , a t = a ) r ( x , a , y ) Possible relaxations ◮ Identify and remove the non-stationary components (e.g., cyclo-stationary dynamics) ◮ Identify the time-scale of the changes A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 21/103

  13. The Markov Decision Process Question Is the MDP formalism powerful enough? ⇒ Let’s try! A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 22/103

  14. The Markov Decision Process Example: the Retail Store Management Problem Description. At each month t , a store contains x t items of a specific goods and the demand for that goods is D t . At the end of each month the manager of the store can order a t more items from his supplier. Furthermore we know that ◮ The cost of maintaining an inventory of x is h ( x ) . ◮ The cost to order a items is C ( a ) . ◮ The income for selling q items is f ( q ) . ◮ If the demand D is bigger than the available inventory x , customers that cannot be served leave. ◮ The value of the remaining inventory at the end of the year is g ( x ) . ◮ Constraint : the store has a maximum capacity M . A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 23/103

  15. The Markov Decision Process Example: the Retail Store Management Problem ◮ State space : x ∈ X = { 0 , 1 , . . . , M } . A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 24/103

  16. The Markov Decision Process Example: the Retail Store Management Problem ◮ State space : x ∈ X = { 0 , 1 , . . . , M } . ◮ Action space : it is not possible to order more items that the capacity of the store, then the action space should depend on the current state. Formally, at state x , a ∈ A ( x ) = { 0 , 1 , . . . , M − x } . A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 24/103

  17. The Markov Decision Process Example: the Retail Store Management Problem ◮ State space : x ∈ X = { 0 , 1 , . . . , M } . ◮ Action space : it is not possible to order more items that the capacity of the store, then the action space should depend on the current state. Formally, at state x , a ∈ A ( x ) = { 0 , 1 , . . . , M − x } . ◮ Dynamics : x t + 1 = [ x t + a t − D t ] + . Problem : the dynamics should be Markov and stationary! A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 24/103

  18. The Markov Decision Process Example: the Retail Store Management Problem ◮ State space : x ∈ X = { 0 , 1 , . . . , M } . ◮ Action space : it is not possible to order more items that the capacity of the store, then the action space should depend on the current state. Formally, at state x , a ∈ A ( x ) = { 0 , 1 , . . . , M − x } . ◮ Dynamics : x t + 1 = [ x t + a t − D t ] + . Problem : the dynamics should be Markov and stationary! ◮ The demand D t is stochastic and time-independent . Formally, i . i . d . D t ∼ D . A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 24/103

  19. The Markov Decision Process Example: the Retail Store Management Problem ◮ State space : x ∈ X = { 0 , 1 , . . . , M } . ◮ Action space : it is not possible to order more items that the capacity of the store, then the action space should depend on the current state. Formally, at state x , a ∈ A ( x ) = { 0 , 1 , . . . , M − x } . ◮ Dynamics : x t + 1 = [ x t + a t − D t ] + . Problem : the dynamics should be Markov and stationary! ◮ The demand D t is stochastic and time-independent . Formally, i . i . d . D t ∼ D . ◮ Reward : r t = − C ( a t ) − h ( x t + a t ) + f ([ x t + a t − x t + 1 ] + ) . A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 24/103

  20. The Markov Decision Process Exercise: the Parking Problem A driver wants to park his car as close as possible to the restaurant. Reward t 1 2 T p(t) Restaurant Reward 0 A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 25/103

  21. The Markov Decision Process Exercise: the Parking Problem A driver wants to park his car as close as possible to the restaurant. Reward t 1 2 T p(t) Restaurant Reward 0 ◮ The driver cannot see whether a place is available unless he is in front of it. ◮ There are P places. ◮ At each place i the driver can either move to the next place or park (if the place is available). ◮ The closer to the restaurant the parking, the higher the satisfaction. ◮ If the driver doesn’t park anywhere, then he/she leaves the restaurant and has to find another one. A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 25/103

  22. The Markov Decision Process Policy Definition (Policy) A decision rule π t can be ◮ Deterministic: π t : X → A, ◮ Stochastic: π t : X → ∆( A ) , A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 26/103

  23. The Markov Decision Process Policy Definition (Policy) A decision rule π t can be ◮ Deterministic: π t : X → A, ◮ Stochastic: π t : X → ∆( A ) , A policy (strategy, plan) can be ◮ Non-stationary: π = ( π 0 , π 1 , π 2 , . . . ) , ◮ Stationary (Markovian): π = ( π, π, π, . . . ) . A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 26/103

  24. The Markov Decision Process Policy Definition (Policy) A decision rule π t can be ◮ Deterministic: π t : X → A, ◮ Stochastic: π t : X → ∆( A ) , A policy (strategy, plan) can be ◮ Non-stationary: π = ( π 0 , π 1 , π 2 , . . . ) , ◮ Stationary (Markovian): π = ( π, π, π, . . . ) . Remark : MDP M + stationary policy π ⇒ Markov chain of state X and transition probability p ( y | x ) = p ( y | x , π ( x )) . A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 26/103

  25. The Markov Decision Process Example: the Retail Store Management Problem ◮ Stationary policy 1 � M − x if x < M / 4 π ( x ) = 0 otherwise ◮ Stationary policy 2 π ( x ) = max { ( M − x ) / 2 − x ; 0 } ◮ Non-stationary policy � M − x if t < 6 π t ( x ) = ⌊ ( M − x ) / 5 ⌋ otherwise A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 27/103

  26. The Markov Decision Process How to model an RL problem The Markov Decision Process The Model Value Functions A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 28/103

  27. The Markov Decision Process Question How do we evaluate a policy and compare two policies? ⇒ Value function! A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 29/103

  28. The Markov Decision Process Optimization over Time Horizon ◮ Finite time horizon T : deadline at time T , the agent focuses on the sum of the rewards up to T . A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 30/103

  29. The Markov Decision Process Optimization over Time Horizon ◮ Finite time horizon T : deadline at time T , the agent focuses on the sum of the rewards up to T . ◮ Infinite time horizon with discount : the problem never terminates but rewards which are closer in time receive a higher importance. A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 30/103

  30. The Markov Decision Process Optimization over Time Horizon ◮ Finite time horizon T : deadline at time T , the agent focuses on the sum of the rewards up to T . ◮ Infinite time horizon with discount : the problem never terminates but rewards which are closer in time receive a higher importance. ◮ Infinite time horizon with terminal state : the problem never terminates but the agent will eventually reach a termination state . A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 30/103

  31. The Markov Decision Process Optimization over Time Horizon ◮ Finite time horizon T : deadline at time T , the agent focuses on the sum of the rewards up to T . ◮ Infinite time horizon with discount : the problem never terminates but rewards which are closer in time receive a higher importance. ◮ Infinite time horizon with terminal state : the problem never terminates but the agent will eventually reach a termination state . ◮ Infinite time horizon with average reward : the problem never terminates but the agent only focuses on the (expected) average of the rewards . A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 30/103

  32. The Markov Decision Process State Value Function ◮ Finite time horizon T : deadline at time T , the agent focuses on the sum of the rewards up to T . � T − 1 � � V π ( t , x ) = E r ( x s , π s ( x s )) + R ( x T ) | x t = x ; π , s = t where R is a value function for the final state. A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 31/103

  33. The Markov Decision Process State Value Function ◮ Finite time horizon T : deadline at time T , the agent focuses on the sum of the rewards up to T . � T − 1 � � V π ( t , x ) = E r ( x s , π s ( x s )) + R ( x T ) | x t = x ; π , s = t where R is a value function for the final state. ◮ Used when: there is an intrinsic deadline to meet. A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 31/103

  34. The Markov Decision Process State Value Function ◮ Infinite time horizon with discount : the problem never terminates but rewards which are closer in time receive a higher importance. � ∞ � � γ t r ( x t , π ( x t )) | x 0 = x ; π V π ( x ) = E , t = 0 with discount factor 0 ≤ γ < 1: ◮ small = short-term rewards, big = long-term rewards ◮ for any γ ∈ [ 0 , 1 ) the series always converge (for bounded rewards) A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 32/103

  35. The Markov Decision Process State Value Function ◮ Infinite time horizon with discount : the problem never terminates but rewards which are closer in time receive a higher importance. � ∞ � � γ t r ( x t , π ( x t )) | x 0 = x ; π V π ( x ) = E , t = 0 with discount factor 0 ≤ γ < 1: ◮ small = short-term rewards, big = long-term rewards ◮ for any γ ∈ [ 0 , 1 ) the series always converge (for bounded rewards) ◮ Used when: there is uncertainty about the deadline and/or an intrinsic definition of discount. A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 32/103

  36. The Markov Decision Process State Value Function ◮ Infinite time horizon with terminal state : the problem never terminates but the agent will eventually reach a termination state . � T � � V π ( x ) = E r ( x t , π ( x t )) | x 0 = x ; π , t = 0 where T is the first ( random ) time when the termination state is achieved. A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 33/103

  37. The Markov Decision Process State Value Function ◮ Infinite time horizon with terminal state : the problem never terminates but the agent will eventually reach a termination state . � T � � V π ( x ) = E r ( x t , π ( x t )) | x 0 = x ; π , t = 0 where T is the first ( random ) time when the termination state is achieved. ◮ Used when: there is a known goal or a failure condition. A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 33/103

  38. The Markov Decision Process State Value Function ◮ Infinite time horizon with average reward : the problem never terminates but the agent only focuses on the (expected) average of the rewards . � 1 T − 1 � � V π ( x ) = lim r ( x t , π ( x t )) | x 0 = x ; π . T →∞ E T t = 0 A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 34/103

  39. The Markov Decision Process State Value Function ◮ Infinite time horizon with average reward : the problem never terminates but the agent only focuses on the (expected) average of the rewards . � 1 T − 1 � � V π ( x ) = lim r ( x t , π ( x t )) | x 0 = x ; π . T →∞ E T t = 0 ◮ Used when: the system should be constantly controlled over time. A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 34/103

  40. The Markov Decision Process State Value Function Technical note : the expectations refer to all possible stochastic trajectories. A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 35/103

  41. The Markov Decision Process State Value Function Technical note : the expectations refer to all possible stochastic trajectories. A non-stationary policy π applied from state x 0 returns ( x 0 , r 0 , x 1 , r 1 , x 2 , r 2 , . . . ) where r t = r ( x t , π t ( x t )) and x t ∼ p ( ·| x t − 1 , a t = π ( x t )) are random realizations. The value function (discounted infinite horizon) is � ∞ � � γ t r ( x t , π ( x t )) | x 0 = x ; π V π ( x ) = E ( x 1 , x 2 ,... ) , t = 0 A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 35/103

  42. The Markov Decision Process Example: the Retail Store Management Problem Simulation A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 36/103

  43. The Markov Decision Process Optimal Value Function Definition (Optimal policy and optimal value function) The solution to an MDP is an optimal policy π ∗ satisfying π ∗ ∈ arg max π ∈ Π V π in all the states x ∈ X, where Π is some policy set of interest. A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 37/103

  44. The Markov Decision Process Optimal Value Function Definition (Optimal policy and optimal value function) The solution to an MDP is an optimal policy π ∗ satisfying π ∗ ∈ arg max π ∈ Π V π in all the states x ∈ X, where Π is some policy set of interest. The corresponding value function is the optimal value function V ∗ = V π ∗ A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 37/103

  45. The Markov Decision Process Optimal Value Function Remarks 1. π ∗ ∈ arg max ( · ) and not π ∗ = arg max ( · ) because an MDP may admit more than one optimal policy 2. π ∗ achieves the largest possible value function in every state 3. there always exists an optimal deterministic policy 4. expect for problems with a finite horizon, there always exists an optimal stationary policy A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 38/103

  46. The Markov Decision Process Summary 1. MDP is a powerful model for interaction between an agent and a stochastic environment 2. The value function defines the objective to optimize A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 39/103

  47. The Markov Decision Process Limitations 1. All the previous value functions define an objective in expectation 2. Other utility functions may be used 3. Risk measures could be integrated but they may induce “weird” problems and make the solution more difficult A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 40/103

  48. The Markov Decision Process How to solve exactly an MDP Dynamic Programming A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 41/103

  49. The Markov Decision Process How to solve exactly an MDP Dynamic Programming Bellman Equations Value Iteration Policy Iteration A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 41/103

  50. The Markov Decision Process Notice From now on we mostly work on the discounted infinite horizon setting. Most results smoothly extend to other settings. A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 42/103

  51. The Markov Decision Process The Optimization Problem max V π ( x 0 ) = π r ( x 0 , π ( x 0 )) + γ r ( x 1 , π ( x 1 )) + γ 2 r ( x 2 , π ( x 2 )) + . . . � � max E π ⇓ very challenging (we should try as many as | A | | S | policies!) A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 43/103

  52. The Markov Decision Process The Optimization Problem max V π ( x 0 ) = π r ( x 0 , π ( x 0 )) + γ r ( x 1 , π ( x 1 )) + γ 2 r ( x 2 , π ( x 2 )) + . . . � � max E π ⇓ very challenging (we should try as many as | A | | S | policies!) ⇓ we need to leverage the structure of the MDP to simplify the optimization problem A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 43/103

  53. The Markov Decision Process How to solve exactly an MDP Dynamic Programming Bellman Equations Value Iteration Policy Iteration A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 44/103

  54. The Markov Decision Process The Bellman Equation Proposition For any stationary policy π = ( π, π, . . . ) , the state value function at a state x ∈ X satisfies the Bellman equation : � V π ( x ) = r ( x , π ( x )) + γ p ( y | x , π ( x )) V π ( y ) . y A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 45/103

  55. The Markov Decision Process The Bellman Equation Proof. For any policy π , V π ( x ) = E � � γ t r ( x t , π ( x t )) | x 0 = x ; π � t ≥ 0 � � γ t r ( x t , π ( x t )) | x 0 = x ; π � = r ( x , π ( x )) + E t ≥ 1 = r ( x , π ( x )) � � � γ t − 1 r ( x t , π ( x t )) | x 1 = y ; π � + γ P ( x 1 = y | x 0 = x ; π ( x 0 )) E y t ≥ 1 � p ( y | x , π ( x )) V π ( y ) . = r ( x , π ( x )) + γ y � A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 46/103

  56. The Markov Decision Process Example: the student dilemma 2 r=1 p=0.5 1 Rest r=−10 0.4 Rest 0.5 5 Work r=0 0.6 Work 0.3 0.4 0.7 0.5 0.5 r=100 Rest 0.6 r=−1 0.9 6 Rest Work 3 r=−1000 0.1 0.5 1 0.5 Work r=−10 7 4 A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 47/103

  57. The Markov Decision Process Example: the student dilemma ◮ Model : all the transitions are Markov, states x 5 , x 6 , x 7 are terminal. ◮ Setting : infinite horizon with terminal states. ◮ Objective : find the policy that maximizes the expected sum of rewards before achieving a terminal state. Notice : not a discounted infinite horizon setting! But the Bellman equations hold unchanged. A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 48/103

  58. The Markov Decision Process Example: the student dilemma V = 88.3 2 p=0.5 r=1 Rest r=−10 0.4 Rest 0.5 V = 88.3 1 V = −10 5 Work r=0 0.6 Work 0.3 0.4 0.7 0.5 0.5 r=100 Rest 0.6 V = 100 6 r=−1 0.9 Rest V = 86.9 3 Work r=−1000 0.1 0.5 1 V = −1000 0.5 r=−10 7 Work V = 88.9 4 A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 49/103

  59. The Markov Decision Process Example: the student dilemma Computing V 4 : V 6 = 100 V 4 = − 10 + ( 0 . 9 V 6 + 0 . 1 V 4 ) ⇒ V 4 = − 10 + 0 . 9 V 6 = 88 . 8 0 . 9 A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 50/103

  60. The Markov Decision Process Example: the student dilemma Computing V 3 : no need to consider all possible trajectories V 4 = 88 . 8 V 3 = − 1 + ( 0 . 5 V 4 + 0 . 5 V 3 ) ⇒ V 3 = − 1 + 0 . 5 V 4 = 86 . 8 0 . 5 A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 51/103

  61. The Markov Decision Process Example: the student dilemma Computing V 3 : no need to consider all possible trajectories V 4 = 88 . 8 V 3 = − 1 + ( 0 . 5 V 4 + 0 . 5 V 3 ) ⇒ V 3 = − 1 + 0 . 5 V 4 = 86 . 8 0 . 5 and so on for the rest... A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 51/103

  62. The Markov Decision Process The Optimal Bellman Equation Bellman’s Principle of Optimality [1]: “An optimal policy has the property that, whatever the initial state and the initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.” A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 52/103

  63. The Markov Decision Process The Optimal Bellman Equation Proposition The optimal value function V ∗ (i.e., V ∗ = max π V π ) is the solution to the optimal Bellman equation : V ∗ ( x ) = max a ∈ A � p ( y | x , a ) V ∗ ( y ) � � r ( x , a ) + γ . y and the optimal policy is π ∗ ( x ) = arg max � p ( y | x , a ) V ∗ ( y ) � � r ( x , a ) + γ . a ∈ A y A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 53/103

  64. The Markov Decision Process The Optimal Bellman Equation Proof. For any policy π = ( a , π ′ ) (possibly non-stationary), ( a ) � � V ∗ ( x ) γ t r ( x t , π ( x t )) | x 0 = x ; π � = max E π t ≥ 0 ( b ) � � � p ( y | x , a ) V π ′ ( y ) = max r ( x , a ) + γ ( a ,π ′ ) y ( c ) � � � π ′ V π ′ ( y ) = max r ( x , a ) + γ p ( y | x , a ) max a y ( d ) � � � p ( y | x , a ) V ∗ ( y ) = max r ( x , a ) + γ . a y � A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 54/103

  65. The Markov Decision Process System of Equations The Bellman equation � V π ( x ) = r ( x , π ( x )) + γ p ( y | x , π ( x )) V π ( y ) . y is a linear system of equations with N unknowns and N linear constraints. A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 55/103

  66. The Markov Decision Process Example: the student dilemma V = 88.3 2 p=0.5 r=1 Rest r=−10 0.4 Rest 0.5 V = 88.3 1 V = −10 5 Work r=0 0.6 Work 0.3 0.4 0.7 0.5 0.5 r=100 Rest 0.6 V = 100 6 r=−1 0.9 Rest V = 86.9 3 Work r=−1000 0.1 0.5 1 V = −1000 0.5 r=−10 7 Work V = 88.9 4 A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 56/103

  67. The Markov Decision Process Example: the student dilemma V = 88.3 2 p=0.5 r=1 Rest 0.4 r=−10 Rest V = 88.3 0.5 1 V = −10 5 Work r=0 Work 0.3 0.6 0.4 V π ( x ) = r ( x , π ( x ))+ γ � y p ( y | x , π ( x )) V π ( y ) 0.7 0.5 0.5 r=100 Rest 0.6 V = 100 6 r=−1 0.9 V = 86.9 Rest 3 Work r=−1000 0.1 0.5 1 V = −1000 0.5 r=−10 7 Work V = 88.9 System of equations 4  V 1 = 0 + 0 . 5 V 1 + 0 . 5 V 2 ( V , R ∈ R 7 , P ∈ R 7 × 7 )    V 2 = 1 + 0 . 3 V 1 + 0 . 7 V 3     V = R + PV  V 3 = − 1 + 0 . 5 V 4 + 0 . 5 V 3    V 4 = − 10 + 0 . 9 V 6 + 0 . 1 V 4 ⇒ ⇓  V 5 = − 10    V = ( I − P ) − 1 R   V 6 = 100      V 7 = − 1000 A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 57/103

  68. The Markov Decision Process System of Equations The optimal Bellman equation V ∗ ( x ) = max a ∈ A � p ( y | x , a ) V ∗ ( y ) � � r ( x , a ) + γ . y is a (highly) non-linear system of equations with N unknowns and N non-linear constraints (i.e., the max operator). A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 58/103

  69. The Markov Decision Process Example: the student dilemma 2 r=1 p=0.5 1 Rest r=−10 0.4 Rest 0.5 5 Work r=0 0.6 Work 0.3 0.4 0.7 0.5 0.5 r=100 Rest 0.6 r=−1 0.9 6 Rest Work 3 r=−1000 0.1 0.5 1 0.5 Work r=−10 7 4 A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 59/103

  70. The Markov Decision Process Example: the student dilemma 2 r=1 p=0.5 1 Rest 0.4 r=−10 Rest 0.5 5 r=0 Work Work 0.3 0.6 0.4 V ∗ ( x ) = max y p ( y | x , a ) V ∗ ( y ) � r ( x , a ) + γ � � 0.7 0.5 0.5 r=100 Rest 0.6 a ∈ A r=−1 0.9 6 Rest Work 3 r=−1000 0.1 0.5 1 0.5 Work r=−10 7 System of equations 4 � �  V 1 = max 0 + 0 . 5 V 1 + 0 . 5 V 2 ; 0 + 0 . 5 V 1 + 0 . 5 V 3   � �  V 2 = max 1 + 0 . 4 V 5 + 0 . 6 V 2 ; 1 + 0 . 3 V 1 + 0 . 7 V 3     � � V 3 = max − 1 + 0 . 4 V 2 + 0 . 6 V 3 ; − 1 + 0 . 5 V 4 + 0 . 5 V 3     � � V 4 = max − 10 + 0 . 9 V 6 + 0 . 1 V 4 ; − 10 + V 7  V 5 = − 10      V 6 = 100      V 7 = − 1000 ⇒ too complicated, we need to find an alternative solution. A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 60/103

  71. The Markov Decision Process The Bellman Operators Notation. w.l.o.g. a discrete state space | X | = N and V π ∈ R N . Definition For any W ∈ R N , the Bellman operator T π : R N → R N is � T π W ( x ) = r ( x , π ( x )) + γ p ( y | x , π ( x )) W ( y ) , y and the optimal Bellman operator (or dynamic programming operator) is � � � T W ( x ) = max a ∈ A r ( x , a ) + γ p ( y | x , a ) W ( y ) . y A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 61/103

  72. The Markov Decision Process The Bellman Operators Proposition Properties of the Bellman operators 1. Monotonicity : for any W 1 , W 2 ∈ R N , if W 1 ≤ W 2 component-wise, then T π W 1 ≤ T π W 2 , T W 1 ≤ T W 2 . A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 62/103

  73. The Markov Decision Process The Bellman Operators Proposition Properties of the Bellman operators 1. Monotonicity : for any W 1 , W 2 ∈ R N , if W 1 ≤ W 2 component-wise, then T π W 1 ≤ T π W 2 , T W 1 ≤ T W 2 . 2. Offset : for any scalar c ∈ R , T π ( W + cI N ) = T π W + γ cI N , T ( W + cI N ) = T W + γ cI N , A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 62/103

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend