Markov Decision Processes and Dynamic Programming A. LAZARIC ( SequeL - PowerPoint PPT Presentation

The Markov Decision Process The Reinforcement Learning Model The environment ◮ Controllability : fully (e.g., chess) or partially (e.g., portfolio optimization) ◮ Uncertainty : deterministic (e.g., chess) or stochastic (e.g., backgammon) ◮ Reactive : adversarial (e.g., chess) or fixed (e.g., tetris) ◮ Observability : full (e.g., chess) or partial (e.g., robotics) ◮ Availability : known (e.g., chess) or unknown (e.g., robotics) The critic ◮ Sparse (e.g., win or loose) vs informative (e.g., closer or further) ◮ Preference reward ◮ Frequent or sporadic ◮ Known or unknown A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 15/103

The Markov Decision Process The Reinforcement Learning Model The environment ◮ Controllability : fully (e.g., chess) or partially (e.g., portfolio optimization) ◮ Uncertainty : deterministic (e.g., chess) or stochastic (e.g., backgammon) ◮ Reactive : adversarial (e.g., chess) or fixed (e.g., tetris) ◮ Observability : full (e.g., chess) or partial (e.g., robotics) ◮ Availability : known (e.g., chess) or unknown (e.g., robotics) The critic ◮ Sparse (e.g., win or loose) vs informative (e.g., closer or further) ◮ Preference reward ◮ Frequent or sporadic ◮ Known or unknown The agent ◮ Open loop control ◮ Close loop control (i.e., adaptive ) ◮ Non-stationary close loop control (i.e., learning ) A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 15/103

The Markov Decision Process Markov Chains Definition (Markov chain) Let the state space X be a bounded compact subset of the Euclidean space, the discrete-time dynamic system ( x t ) t ∈ N ∈ X is a Markov chain if it satisfies the Markov property P ( x t + 1 = x | x t , x t − 1 , . . . , x 0 ) = P ( x t + 1 = x | x t ) , Given an initial state x 0 ∈ X, a Markov chain is defined by the transition probability p p ( y | x ) = P ( x t + 1 = y | x t = x ) . A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 16/103

The Markov Decision Process Markov Decision Process Definition (Markov decision process [1, 4, 3, 5, 2]) A Markov decision process is defined as a tuple M = ( X , A , p , r ) where A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 17/103

The Markov Decision Process Markov Decision Process Definition (Markov decision process [1, 4, 3, 5, 2]) A Markov decision process is defined as a tuple M = ( X , A , p , r ) where ◮ X is the state space, A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 17/103

The Markov Decision Process Markov Decision Process Definition (Markov decision process [1, 4, 3, 5, 2]) A Markov decision process is defined as a tuple M = ( X , A , p , r ) where ◮ X is the state space, ◮ A is the action space, A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 17/103

The Markov Decision Process Markov Decision Process Definition (Markov decision process [1, 4, 3, 5, 2]) A Markov decision process is defined as a tuple M = ( X , A , p , r ) where ◮ X is the state space, ◮ A is the action space, ◮ p ( y | x , a ) is the transition probability with p ( y | x , a ) = P ( x t + 1 = y | x t = x , a t = a ) , A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 17/103

The Markov Decision Process Markov Decision Process Definition (Markov decision process [1, 4, 3, 5, 2]) A Markov decision process is defined as a tuple M = ( X , A , p , r ) where ◮ X is the state space, ◮ A is the action space, ◮ p ( y | x , a ) is the transition probability with p ( y | x , a ) = P ( x t + 1 = y | x t = x , a t = a ) , ◮ r ( x , a , y ) is the reward of transition ( x , a , y ) . A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 17/103

The Markov Decision Process Markov Decision Process: the Assumptions Time assumption : time is discrete t → t + 1 Possible relaxations ◮ Identify the proper time granularity ◮ Most of MDP literature extends to continuous time A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 18/103

The Markov Decision Process Markov Decision Process: the Assumptions Markov assumption : the current state x and action a are a sufficient statistics for the next state y p ( y | x , a ) = P ( x t + 1 = y | x t = x , a t = a ) Possible relaxations ◮ Define a new state h t = ( x t , x t − 1 , x t − 2 , . . . ) ◮ Move to partially observable MDP (PO-MDP) ◮ Move to predictive state representation (PSR) model A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 19/103

The Markov Decision Process Markov Decision Process: the Assumptions Reward assumption : the reward is uniquely defined by a transition (or part of it) r ( x , a , y ) Possible relaxations ◮ Distinguish between global goal and reward function ◮ Move to inverse reinforcement learning (IRL) to induce the reward function from desired behaviors A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 20/103

The Markov Decision Process Markov Decision Process: the Assumptions Stationarity assumption : the dynamics and reward do not change over time p ( y | x , a ) = P ( x t + 1 = y | x t = x , a t = a ) r ( x , a , y ) Possible relaxations ◮ Identify and remove the non-stationary components (e.g., cyclo-stationary dynamics) ◮ Identify the time-scale of the changes A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 21/103

The Markov Decision Process Question Is the MDP formalism powerful enough? ⇒ Let’s try! A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 22/103

The Markov Decision Process Example: the Retail Store Management Problem Description. At each month t , a store contains x t items of a specific goods and the demand for that goods is D t . At the end of each month the manager of the store can order a t more items from his supplier. Furthermore we know that ◮ The cost of maintaining an inventory of x is h ( x ) . ◮ The cost to order a items is C ( a ) . ◮ The income for selling q items is f ( q ) . ◮ If the demand D is bigger than the available inventory x , customers that cannot be served leave. ◮ The value of the remaining inventory at the end of the year is g ( x ) . ◮ Constraint : the store has a maximum capacity M . A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 23/103

The Markov Decision Process Example: the Retail Store Management Problem ◮ State space : x ∈ X = { 0 , 1 , . . . , M } . A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 24/103

The Markov Decision Process Example: the Retail Store Management Problem ◮ State space : x ∈ X = { 0 , 1 , . . . , M } . ◮ Action space : it is not possible to order more items that the capacity of the store, then the action space should depend on the current state. Formally, at state x , a ∈ A ( x ) = { 0 , 1 , . . . , M − x } . A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 24/103

The Markov Decision Process Example: the Retail Store Management Problem ◮ State space : x ∈ X = { 0 , 1 , . . . , M } . ◮ Action space : it is not possible to order more items that the capacity of the store, then the action space should depend on the current state. Formally, at state x , a ∈ A ( x ) = { 0 , 1 , . . . , M − x } . ◮ Dynamics : x t + 1 = [ x t + a t − D t ] + . Problem : the dynamics should be Markov and stationary! A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 24/103

The Markov Decision Process Example: the Retail Store Management Problem ◮ State space : x ∈ X = { 0 , 1 , . . . , M } . ◮ Action space : it is not possible to order more items that the capacity of the store, then the action space should depend on the current state. Formally, at state x , a ∈ A ( x ) = { 0 , 1 , . . . , M − x } . ◮ Dynamics : x t + 1 = [ x t + a t − D t ] + . Problem : the dynamics should be Markov and stationary! ◮ The demand D t is stochastic and time-independent . Formally, i . i . d . D t ∼ D . A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 24/103

The Markov Decision Process Example: the Retail Store Management Problem ◮ State space : x ∈ X = { 0 , 1 , . . . , M } . ◮ Action space : it is not possible to order more items that the capacity of the store, then the action space should depend on the current state. Formally, at state x , a ∈ A ( x ) = { 0 , 1 , . . . , M − x } . ◮ Dynamics : x t + 1 = [ x t + a t − D t ] + . Problem : the dynamics should be Markov and stationary! ◮ The demand D t is stochastic and time-independent . Formally, i . i . d . D t ∼ D . ◮ Reward : r t = − C ( a t ) − h ( x t + a t ) + f ([ x t + a t − x t + 1 ] + ) . A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 24/103

The Markov Decision Process Exercise: the Parking Problem A driver wants to park his car as close as possible to the restaurant. Reward t 1 2 T p(t) Restaurant Reward 0 A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 25/103

The Markov Decision Process Exercise: the Parking Problem A driver wants to park his car as close as possible to the restaurant. Reward t 1 2 T p(t) Restaurant Reward 0 ◮ The driver cannot see whether a place is available unless he is in front of it. ◮ There are P places. ◮ At each place i the driver can either move to the next place or park (if the place is available). ◮ The closer to the restaurant the parking, the higher the satisfaction. ◮ If the driver doesn’t park anywhere, then he/she leaves the restaurant and has to find another one. A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 25/103

The Markov Decision Process Policy Definition (Policy) A decision rule π t can be ◮ Deterministic: π t : X → A, ◮ Stochastic: π t : X → ∆( A ) , A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 26/103

The Markov Decision Process Policy Definition (Policy) A decision rule π t can be ◮ Deterministic: π t : X → A, ◮ Stochastic: π t : X → ∆( A ) , A policy (strategy, plan) can be ◮ Non-stationary: π = ( π 0 , π 1 , π 2 , . . . ) , ◮ Stationary (Markovian): π = ( π, π, π, . . . ) . A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 26/103

The Markov Decision Process Policy Definition (Policy) A decision rule π t can be ◮ Deterministic: π t : X → A, ◮ Stochastic: π t : X → ∆( A ) , A policy (strategy, plan) can be ◮ Non-stationary: π = ( π 0 , π 1 , π 2 , . . . ) , ◮ Stationary (Markovian): π = ( π, π, π, . . . ) . Remark : MDP M + stationary policy π ⇒ Markov chain of state X and transition probability p ( y | x ) = p ( y | x , π ( x )) . A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 26/103

The Markov Decision Process Example: the Retail Store Management Problem ◮ Stationary policy 1 � M − x if x < M / 4 π ( x ) = 0 otherwise ◮ Stationary policy 2 π ( x ) = max { ( M − x ) / 2 − x ; 0 } ◮ Non-stationary policy � M − x if t < 6 π t ( x ) = ⌊ ( M − x ) / 5 ⌋ otherwise A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 27/103

The Markov Decision Process How to model an RL problem The Markov Decision Process The Model Value Functions A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 28/103

The Markov Decision Process Question How do we evaluate a policy and compare two policies? ⇒ Value function! A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 29/103

The Markov Decision Process Optimization over Time Horizon ◮ Finite time horizon T : deadline at time T , the agent focuses on the sum of the rewards up to T . A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 30/103

The Markov Decision Process Optimization over Time Horizon ◮ Finite time horizon T : deadline at time T , the agent focuses on the sum of the rewards up to T . ◮ Infinite time horizon with discount : the problem never terminates but rewards which are closer in time receive a higher importance. A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 30/103

The Markov Decision Process Optimization over Time Horizon ◮ Finite time horizon T : deadline at time T , the agent focuses on the sum of the rewards up to T . ◮ Infinite time horizon with discount : the problem never terminates but rewards which are closer in time receive a higher importance. ◮ Infinite time horizon with terminal state : the problem never terminates but the agent will eventually reach a termination state . A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 30/103

The Markov Decision Process Optimization over Time Horizon ◮ Finite time horizon T : deadline at time T , the agent focuses on the sum of the rewards up to T . ◮ Infinite time horizon with discount : the problem never terminates but rewards which are closer in time receive a higher importance. ◮ Infinite time horizon with terminal state : the problem never terminates but the agent will eventually reach a termination state . ◮ Infinite time horizon with average reward : the problem never terminates but the agent only focuses on the (expected) average of the rewards . A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 30/103

The Markov Decision Process State Value Function ◮ Finite time horizon T : deadline at time T , the agent focuses on the sum of the rewards up to T . � T − 1 � � V π ( t , x ) = E r ( x s , π s ( x s )) + R ( x T ) | x t = x ; π , s = t where R is a value function for the final state. A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 31/103

The Markov Decision Process State Value Function ◮ Finite time horizon T : deadline at time T , the agent focuses on the sum of the rewards up to T . � T − 1 � � V π ( t , x ) = E r ( x s , π s ( x s )) + R ( x T ) | x t = x ; π , s = t where R is a value function for the final state. ◮ Used when: there is an intrinsic deadline to meet. A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 31/103

The Markov Decision Process State Value Function ◮ Infinite time horizon with discount : the problem never terminates but rewards which are closer in time receive a higher importance. � ∞ � � γ t r ( x t , π ( x t )) | x 0 = x ; π V π ( x ) = E , t = 0 with discount factor 0 ≤ γ < 1: ◮ small = short-term rewards, big = long-term rewards ◮ for any γ ∈ [ 0 , 1 ) the series always converge (for bounded rewards) A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 32/103

The Markov Decision Process State Value Function ◮ Infinite time horizon with discount : the problem never terminates but rewards which are closer in time receive a higher importance. � ∞ � � γ t r ( x t , π ( x t )) | x 0 = x ; π V π ( x ) = E , t = 0 with discount factor 0 ≤ γ < 1: ◮ small = short-term rewards, big = long-term rewards ◮ for any γ ∈ [ 0 , 1 ) the series always converge (for bounded rewards) ◮ Used when: there is uncertainty about the deadline and/or an intrinsic definition of discount. A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 32/103

The Markov Decision Process State Value Function ◮ Infinite time horizon with terminal state : the problem never terminates but the agent will eventually reach a termination state . � T � � V π ( x ) = E r ( x t , π ( x t )) | x 0 = x ; π , t = 0 where T is the first ( random ) time when the termination state is achieved. A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 33/103

The Markov Decision Process State Value Function ◮ Infinite time horizon with terminal state : the problem never terminates but the agent will eventually reach a termination state . � T � � V π ( x ) = E r ( x t , π ( x t )) | x 0 = x ; π , t = 0 where T is the first ( random ) time when the termination state is achieved. ◮ Used when: there is a known goal or a failure condition. A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 33/103

The Markov Decision Process State Value Function ◮ Infinite time horizon with average reward : the problem never terminates but the agent only focuses on the (expected) average of the rewards . � 1 T − 1 � � V π ( x ) = lim r ( x t , π ( x t )) | x 0 = x ; π . T →∞ E T t = 0 A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 34/103

The Markov Decision Process State Value Function ◮ Infinite time horizon with average reward : the problem never terminates but the agent only focuses on the (expected) average of the rewards . � 1 T − 1 � � V π ( x ) = lim r ( x t , π ( x t )) | x 0 = x ; π . T →∞ E T t = 0 ◮ Used when: the system should be constantly controlled over time. A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 34/103

The Markov Decision Process State Value Function Technical note : the expectations refer to all possible stochastic trajectories. A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 35/103

The Markov Decision Process State Value Function Technical note : the expectations refer to all possible stochastic trajectories. A non-stationary policy π applied from state x 0 returns ( x 0 , r 0 , x 1 , r 1 , x 2 , r 2 , . . . ) where r t = r ( x t , π t ( x t )) and x t ∼ p ( ·| x t − 1 , a t = π ( x t )) are random realizations. The value function (discounted infinite horizon) is � ∞ � � γ t r ( x t , π ( x t )) | x 0 = x ; π V π ( x ) = E ( x 1 , x 2 ,... ) , t = 0 A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 35/103

The Markov Decision Process Example: the Retail Store Management Problem Simulation A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 36/103

The Markov Decision Process Optimal Value Function Definition (Optimal policy and optimal value function) The solution to an MDP is an optimal policy π ∗ satisfying π ∗ ∈ arg max π ∈ Π V π in all the states x ∈ X, where Π is some policy set of interest. A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 37/103

The Markov Decision Process Optimal Value Function Definition (Optimal policy and optimal value function) The solution to an MDP is an optimal policy π ∗ satisfying π ∗ ∈ arg max π ∈ Π V π in all the states x ∈ X, where Π is some policy set of interest. The corresponding value function is the optimal value function V ∗ = V π ∗ A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 37/103

The Markov Decision Process Optimal Value Function Remarks 1. π ∗ ∈ arg max ( · ) and not π ∗ = arg max ( · ) because an MDP may admit more than one optimal policy 2. π ∗ achieves the largest possible value function in every state 3. there always exists an optimal deterministic policy 4. expect for problems with a finite horizon, there always exists an optimal stationary policy A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 38/103

The Markov Decision Process Summary 1. MDP is a powerful model for interaction between an agent and a stochastic environment 2. The value function defines the objective to optimize A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 39/103

The Markov Decision Process Limitations 1. All the previous value functions define an objective in expectation 2. Other utility functions may be used 3. Risk measures could be integrated but they may induce “weird” problems and make the solution more difficult A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 40/103

The Markov Decision Process How to solve exactly an MDP Dynamic Programming A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 41/103

The Markov Decision Process How to solve exactly an MDP Dynamic Programming Bellman Equations Value Iteration Policy Iteration A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 41/103

The Markov Decision Process Notice From now on we mostly work on the discounted infinite horizon setting. Most results smoothly extend to other settings. A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 42/103

The Markov Decision Process The Optimization Problem max V π ( x 0 ) = π r ( x 0 , π ( x 0 )) + γ r ( x 1 , π ( x 1 )) + γ 2 r ( x 2 , π ( x 2 )) + . . . � � max E π ⇓ very challenging (we should try as many as | A | | S | policies!) A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 43/103

The Markov Decision Process The Optimization Problem max V π ( x 0 ) = π r ( x 0 , π ( x 0 )) + γ r ( x 1 , π ( x 1 )) + γ 2 r ( x 2 , π ( x 2 )) + . . . � � max E π ⇓ very challenging (we should try as many as | A | | S | policies!) ⇓ we need to leverage the structure of the MDP to simplify the optimization problem A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 43/103

The Markov Decision Process How to solve exactly an MDP Dynamic Programming Bellman Equations Value Iteration Policy Iteration A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 44/103

The Markov Decision Process The Bellman Equation Proposition For any stationary policy π = ( π, π, . . . ) , the state value function at a state x ∈ X satisfies the Bellman equation : � V π ( x ) = r ( x , π ( x )) + γ p ( y | x , π ( x )) V π ( y ) . y A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 45/103

The Markov Decision Process The Bellman Equation Proof. For any policy π , V π ( x ) = E � � γ t r ( x t , π ( x t )) | x 0 = x ; π � t ≥ 0 � � γ t r ( x t , π ( x t )) | x 0 = x ; π � = r ( x , π ( x )) + E t ≥ 1 = r ( x , π ( x )) � � � γ t − 1 r ( x t , π ( x t )) | x 1 = y ; π � + γ P ( x 1 = y | x 0 = x ; π ( x 0 )) E y t ≥ 1 � p ( y | x , π ( x )) V π ( y ) . = r ( x , π ( x )) + γ y � A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 46/103

The Markov Decision Process Example: the student dilemma 2 r=1 p=0.5 1 Rest r=−10 0.4 Rest 0.5 5 Work r=0 0.6 Work 0.3 0.4 0.7 0.5 0.5 r=100 Rest 0.6 r=−1 0.9 6 Rest Work 3 r=−1000 0.1 0.5 1 0.5 Work r=−10 7 4 A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 47/103

The Markov Decision Process Example: the student dilemma ◮ Model : all the transitions are Markov, states x 5 , x 6 , x 7 are terminal. ◮ Setting : infinite horizon with terminal states. ◮ Objective : find the policy that maximizes the expected sum of rewards before achieving a terminal state. Notice : not a discounted infinite horizon setting! But the Bellman equations hold unchanged. A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 48/103

The Markov Decision Process Example: the student dilemma V = 88.3 2 p=0.5 r=1 Rest r=−10 0.4 Rest 0.5 V = 88.3 1 V = −10 5 Work r=0 0.6 Work 0.3 0.4 0.7 0.5 0.5 r=100 Rest 0.6 V = 100 6 r=−1 0.9 Rest V = 86.9 3 Work r=−1000 0.1 0.5 1 V = −1000 0.5 r=−10 7 Work V = 88.9 4 A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 49/103

The Markov Decision Process Example: the student dilemma Computing V 4 : V 6 = 100 V 4 = − 10 + ( 0 . 9 V 6 + 0 . 1 V 4 ) ⇒ V 4 = − 10 + 0 . 9 V 6 = 88 . 8 0 . 9 A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 50/103

The Markov Decision Process Example: the student dilemma Computing V 3 : no need to consider all possible trajectories V 4 = 88 . 8 V 3 = − 1 + ( 0 . 5 V 4 + 0 . 5 V 3 ) ⇒ V 3 = − 1 + 0 . 5 V 4 = 86 . 8 0 . 5 A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 51/103

The Markov Decision Process Example: the student dilemma Computing V 3 : no need to consider all possible trajectories V 4 = 88 . 8 V 3 = − 1 + ( 0 . 5 V 4 + 0 . 5 V 3 ) ⇒ V 3 = − 1 + 0 . 5 V 4 = 86 . 8 0 . 5 and so on for the rest... A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 51/103

The Markov Decision Process The Optimal Bellman Equation Bellman’s Principle of Optimality [1]: “An optimal policy has the property that, whatever the initial state and the initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.” A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 52/103

The Markov Decision Process The Optimal Bellman Equation Proposition The optimal value function V ∗ (i.e., V ∗ = max π V π ) is the solution to the optimal Bellman equation : V ∗ ( x ) = max a ∈ A � p ( y | x , a ) V ∗ ( y ) � � r ( x , a ) + γ . y and the optimal policy is π ∗ ( x ) = arg max � p ( y | x , a ) V ∗ ( y ) � � r ( x , a ) + γ . a ∈ A y A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 53/103

The Markov Decision Process The Optimal Bellman Equation Proof. For any policy π = ( a , π ′ ) (possibly non-stationary), ( a ) � � V ∗ ( x ) γ t r ( x t , π ( x t )) | x 0 = x ; π � = max E π t ≥ 0 ( b ) � � � p ( y | x , a ) V π ′ ( y ) = max r ( x , a ) + γ ( a ,π ′ ) y ( c ) � � � π ′ V π ′ ( y ) = max r ( x , a ) + γ p ( y | x , a ) max a y ( d ) � � � p ( y | x , a ) V ∗ ( y ) = max r ( x , a ) + γ . a y � A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 54/103

The Markov Decision Process System of Equations The Bellman equation � V π ( x ) = r ( x , π ( x )) + γ p ( y | x , π ( x )) V π ( y ) . y is a linear system of equations with N unknowns and N linear constraints. A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 55/103

The Markov Decision Process Example: the student dilemma V = 88.3 2 p=0.5 r=1 Rest r=−10 0.4 Rest 0.5 V = 88.3 1 V = −10 5 Work r=0 0.6 Work 0.3 0.4 0.7 0.5 0.5 r=100 Rest 0.6 V = 100 6 r=−1 0.9 Rest V = 86.9 3 Work r=−1000 0.1 0.5 1 V = −1000 0.5 r=−10 7 Work V = 88.9 4 A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 56/103

The Markov Decision Process Example: the student dilemma V = 88.3 2 p=0.5 r=1 Rest 0.4 r=−10 Rest V = 88.3 0.5 1 V = −10 5 Work r=0 Work 0.3 0.6 0.4 V π ( x ) = r ( x , π ( x ))+ γ � y p ( y | x , π ( x )) V π ( y ) 0.7 0.5 0.5 r=100 Rest 0.6 V = 100 6 r=−1 0.9 V = 86.9 Rest 3 Work r=−1000 0.1 0.5 1 V = −1000 0.5 r=−10 7 Work V = 88.9 System of equations 4  V 1 = 0 + 0 . 5 V 1 + 0 . 5 V 2 ( V , R ∈ R 7 , P ∈ R 7 × 7 )    V 2 = 1 + 0 . 3 V 1 + 0 . 7 V 3     V = R + PV  V 3 = − 1 + 0 . 5 V 4 + 0 . 5 V 3    V 4 = − 10 + 0 . 9 V 6 + 0 . 1 V 4 ⇒ ⇓  V 5 = − 10    V = ( I − P ) − 1 R   V 6 = 100      V 7 = − 1000 A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 57/103

The Markov Decision Process System of Equations The optimal Bellman equation V ∗ ( x ) = max a ∈ A � p ( y | x , a ) V ∗ ( y ) � � r ( x , a ) + γ . y is a (highly) non-linear system of equations with N unknowns and N non-linear constraints (i.e., the max operator). A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 58/103

The Markov Decision Process Example: the student dilemma 2 r=1 p=0.5 1 Rest r=−10 0.4 Rest 0.5 5 Work r=0 0.6 Work 0.3 0.4 0.7 0.5 0.5 r=100 Rest 0.6 r=−1 0.9 6 Rest Work 3 r=−1000 0.1 0.5 1 0.5 Work r=−10 7 4 A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 59/103

The Markov Decision Process Example: the student dilemma 2 r=1 p=0.5 1 Rest 0.4 r=−10 Rest 0.5 5 r=0 Work Work 0.3 0.6 0.4 V ∗ ( x ) = max y p ( y | x , a ) V ∗ ( y ) � r ( x , a ) + γ � � 0.7 0.5 0.5 r=100 Rest 0.6 a ∈ A r=−1 0.9 6 Rest Work 3 r=−1000 0.1 0.5 1 0.5 Work r=−10 7 System of equations 4 � �  V 1 = max 0 + 0 . 5 V 1 + 0 . 5 V 2 ; 0 + 0 . 5 V 1 + 0 . 5 V 3   � �  V 2 = max 1 + 0 . 4 V 5 + 0 . 6 V 2 ; 1 + 0 . 3 V 1 + 0 . 7 V 3     � � V 3 = max − 1 + 0 . 4 V 2 + 0 . 6 V 3 ; − 1 + 0 . 5 V 4 + 0 . 5 V 3     � � V 4 = max − 10 + 0 . 9 V 6 + 0 . 1 V 4 ; − 10 + V 7  V 5 = − 10      V 6 = 100      V 7 = − 1000 ⇒ too complicated, we need to find an alternative solution. A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 60/103

The Markov Decision Process The Bellman Operators Notation. w.l.o.g. a discrete state space | X | = N and V π ∈ R N . Definition For any W ∈ R N , the Bellman operator T π : R N → R N is � T π W ( x ) = r ( x , π ( x )) + γ p ( y | x , π ( x )) W ( y ) , y and the optimal Bellman operator (or dynamic programming operator) is � � � T W ( x ) = max a ∈ A r ( x , a ) + γ p ( y | x , a ) W ( y ) . y A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 61/103

The Markov Decision Process The Bellman Operators Proposition Properties of the Bellman operators 1. Monotonicity : for any W 1 , W 2 ∈ R N , if W 1 ≤ W 2 component-wise, then T π W 1 ≤ T π W 2 , T W 1 ≤ T W 2 . A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 62/103

The Markov Decision Process The Bellman Operators Proposition Properties of the Bellman operators 1. Monotonicity : for any W 1 , W 2 ∈ R N , if W 1 ≤ W 2 component-wise, then T π W 1 ≤ T π W 2 , T W 1 ≤ T W 2 . 2. Offset : for any scalar c ∈ R , T π ( W + cI N ) = T π W + γ cI N , T ( W + cI N ) = T W + γ cI N , A. LAZARIC – Markov Decision Process and Dynamic Programming Sept 29th, 2015 - 62/103

Markov Decision Processes and Dynamic Programming A. LAZARIC ( SequeL - PowerPoint PPT Presentation

Markov Decision Processes and Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to model an RL problem The Markov Decision Process A. LAZARIC Markov Decision

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

Model Repair for Markov Decision Model Repair for Markov Decision Model Repair for Markov

Outline Md Md Markov Markov Decision Decision Processes Processes Grid World Example

Stochastic Processes Markov Processes Hamid R. Rabiee 1 Overview o Markov Property o Markov

Markov Decision Processes and Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille ) Ecole

Markov Systems, Markov Decision Processes, and Dynamic Programming Andrew W. Moore Note to

Markov decision processes and interval Markov chains: exploiting the connection Mingmei Teo

Hidden Markov Models Steven J Zeil Old Dominion Univ. Fall 2010 1 Discrete Markov Processes

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

Introduction to Partially Observable Markov Decision Processes CS 886 Sequential Decision Making

Markov Decision Processes Philipp Koehn 7 April 2020 Philipp Koehn Artificial Intelligence:

The simplex method is strongly polynomial for deterministic Markov decision processes Ian Post

Discrete Time Markov Chains Discrete-Time Markov Chains Books - Introduction to Stochastic

Markov processes (Markov chains) Construct a Bayes net from these variables: parents? Markov

Deep Reinforcement Learning Introduction and State-of-the-art Arjun Chandra Research Scientist

Matthew Series Lesson #065 February 1, 2015 Dean Bible Ministries www.deanbibleministries.org

A REINFORCEMENT LEARNING PERSPECTIVE ON AGI Itamar Arel, Machine Intelligence Lab

100 Million Friends You Can Never Know Adding COPPA compliant social networking to Poptropica

Machine Learning for Trading Data: Economic reports, news, industry statistics Financial

Impact Measurement Working Group TRIS LUMLEY NEW PHILANTHROPY CAPITAL KELLY MCCARTHY GLOBAL

I HAVE NOTHING TO Category II FHRT- A DISCLOSE Standardized Approach Steven L. Clark, M.D.

Environmental Ethics: Anthropocentrism vs. Nonanthropocentrism Anthropocentric Worldview

Sambuz

Useful Links

Newsletter

Mail Us