SLIDE 1
Markov decision process (MDP)
Robert Platt Northeastern University
SLIDE 2 The RL Setting
On a single time step, agent does the following:
- 1. observe some information
- 2. select an action to execute
- 3. take note of any reward
Goal of agent: select actions that maximize cumulative reward in the long run
Action Observation Reward Agent World
SLIDE 3 Let’s turn this into an MDP
On a single time step, agent does the following:
- 1. observe some information
- 2. select an action to execute
- 3. take note of any reward
Goal of agent: select actions that maximize cumulative reward in the long run
Action Observation Reward Agent World
SLIDE 4 Let’s turn this into an MDP
On a single time step, agent does the following:
- 1. observe state
- 2. select an action to execute
- 3. take note of any reward
Goal of agent: select actions that maximize cumulative reward in the long run
Action Observation Reward Agent World State
SLIDE 5 Let’s turn this into an MDP
On a single time step, agent does the following:
- 1. observe state
- 2. select an action to execute
- 3. take note of any reward
Goal of agent: select actions that maximize cumulative reward in the long run
Action Observation Reward Agent World State
This part is the MDP
SLIDE 6
Example: Grid world
Grid world: – agent lives on grid – always occupies a single cell – can move left, right, up, down – gets zero reward unless in “+1” or “-1” cells
SLIDE 7
States and actions
State set: Action set:
SLIDE 8
Reward function
Reward function: Otherwise:
SLIDE 9
Reward function
Reward function: Otherwise: In general:
SLIDE 10
Reward function
Reward function: Otherwise: In general:
Expected reward on this time step given that agent takes action from state
SLIDE 11
Transition function
Transition model: For example:
SLIDE 12 Transition function
Transition model: For example:
– This entire probability distribution can be written as a table over state, action, next state. probability of this transition
SLIDE 13
Definition of an MDP
An MDP is a tuple: where State set: Action set: Reward function: Transition model:
SLIDE 14 Example: Frozen Lake
State set: Action set: Reward function: Transition model: only one third chance of going in specified direction – one third chance of moving +90deg – one third change of moving -90deg
if
Frozen Lake is this 4x4 grid
SLIDE 15 Example: Recycling Robot
Example 3.4 in SB, 2nd Ed.
SLIDE 16
Think-pair-share
Mobile robot: – the robot moves on a flat surface – the robot can execute point turns either left or right. It can also go forward or back with fixed velocity – it must reach a goal while avoiding obstacles Express mobile robot control problem as an MDP
SLIDE 17
Definition of an MDP
An MDP is a tuple: where State set: Action set: Reward function: Transition model:
SLIDE 18
Definition of an MDP
An MDP is a tuple: where State set: Action set: Reward function: Transition model:
Why is it called a Markov decision process?
SLIDE 19
Definition of an MDP
An MDP is a tuple: where State set: Action set: Reward function: Transition model:
Why is it called a Markov decision process? Because we’re making the following assumption:
SLIDE 20
Definition of an MDP
An MDP is a tuple: where State set: Action set: Reward function: Transition model:
Why is it called a Markov decision process? Because we’re making the following assumption: – this is called the “Markov” assumption
SLIDE 21
The Markov Assumption
Suppose agent starts in and follows this path:
SLIDE 22
The Markov Assumption
Suppose agent starts in and follows this path:
SLIDE 23
The Markov Assumption
Suppose agent starts in and follows this path:
SLIDE 24
The Markov Assumption
Suppose agent starts in and follows this path: Notice that probability of arriving in if agent executes right action does not depend on path taken to get to :
SLIDE 25 Think-pair-share
Cart-pole robot: – state is the position of the cart and the orientation of the pole – cart can execute a constant acceleration either left or right
- 1. Is this system Markov?
- 2. Why / Why not?
- 3. If not, how do you change it to make it Markov?
SLIDE 26
Policy
A policy is a rule for selecting actions: If agent is in this state, then take this action
SLIDE 27
Policy
A policy is a rule for selecting actions: If agent is in this state, then take this action
SLIDE 28
Policy
A policy is a rule for selecting actions: If agent is in this state, then take this action A policy can be stochastic:
SLIDE 29
Question
A policy is a rule for selecting actions: If agent is in this state, then take this action A policy can be stochastic:
Why would we want to use a stochastic policy?
SLIDE 30
Episodic vs Continuing Process
Episodic process: execution ends at some point and starts over. – after a fixed number of time steps – upon reaching a terminal state Terminal state Example of an episodic task: – execution ends upon reaching terminal state OR after 15 time steps
SLIDE 31
Episodic vs Continuing Process
Continuing process: execution goes on forever. Process doesn’t stop – keep getting rewards Example of a continuing task
SLIDE 32
Rewards and Return
On each time step, the agent gets a reward:
SLIDE 33
Rewards and Return
On each time step, the agent gets a reward:
– could have positive reward at goal, zero reward elsewhere – could have negative reward on every time step – could have an arbitrary reward function
SLIDE 34
Rewards and Return
On each time step, the agent gets a reward: Return can be a simple sum of rewards:
SLIDE 35
Rewards and Return
On each time step, the agent gets a reward: Return can be a simple sum of rewards:
Return
SLIDE 36
Rewards and Return
On each time step, the agent gets a reward: Return can be a simple sum of rewards: But, it is often a discounted sum of rewards:
SLIDE 37
Rewards and Return
On each time step, the agent gets a reward: Return can be a simple sum of rewards: But, it is often a discounted sum of rewards:
What effect does gamma have?
SLIDE 38
Rewards and Return
On each time step, the agent gets a reward: Return can be a simple sum of rewards: But, it is often a discounted sum of rewards:
Reward received k time steps in the future is only worth of what it would have been worth immediately
SLIDE 39
Rewards and Return
On each time step, the agent gets a reward: Return can be a simple sum of rewards: But, it is often a discounted sum of rewards:
SLIDE 40
Rewards and Return
On each time step, the agent gets a reward: Return can be a simple sum of rewards: But, it is often a discounted sum of rewards: Return is often evaluated over an infinite horizon:
Return
SLIDE 41
Think-pair-share
SLIDE 42
Value Function
Value of state when acting according to policy :
SLIDE 43
Value Function
Value of state when acting according to policy :
Value of a state == expected return from that state if agent follows policy
SLIDE 44
Value Function
Value of state when acting according to policy : Value of taking action from state when acting according to policy :
Value of a state == expected return from that state if agent follows policy
SLIDE 45
Value Function
Value of state when acting according to policy : Value of taking action from state when acting according to policy :
Value of a state == expected return from that state if agent follows policy Value of a state/action pair == expected return when taking action a from state s and following after that
SLIDE 46
Value Function
Value of state when acting according to policy : Value of taking action from state when acting according to policy :
SLIDE 47
Value function example 1
Policy: Discount factor: Value fn:
10 9 8.1 7.3 6.6 6.9
SLIDE 48
Value function example 2
Policy: Discount factor: Value fn:
10.66 0.66 0.73 0.81 0.9 1
SLIDE 49
Value function example 2
Policy: Discount factor: Value fn:
10.66 0.66 0.73 0.81 0.9 1
Notice that value function can help us compare two different policies – how?
SLIDE 50
Value function example 3
Policy: Discount factor: Value fn:
10 10 10 10 10 11
SLIDE 51
Think-pair-share
Policy: Discount factor: Value fn:
? ? ? ? ? ?
SLIDE 52
Value Function Revisited
Value of state when acting according to policy :
SLIDE 53
Value Function Revisited
Value of state when acting according to policy :
SLIDE 54
Value Function Revisited
Value of state when acting according to policy :
This is called a “backup diagram”
SLIDE 55
Value Function Revisited
Value of state when acting according to policy :
SLIDE 56
Value Function Revisited
Value of state when acting according to policy :
SLIDE 57 Think-pair-share 1
Value of state when acting according to policy :
Write this expectation in terms
- f P(s’,r|s,a) for a deterministic policy,
SLIDE 58 Think-pair-share 2
Value of state when acting according to policy :
Write this expectation in terms
- f P(s’,r|s,a) for a stochastic policy,
SLIDE 59
Think-pair-share
SLIDE 60
Value Function Revisited
Can we calculate Q in terms of V?
SLIDE 61
Value Function Revisited
Can we calculate Q in terms of V?
SLIDE 62
Think-pair-share
Can we calculate Q in terms of V?
Write this expectation in terms of P(s’,r|s,a) and
SLIDE 63
Optimal policies
Given a policy, , we know how to compute the value function, But, how do we compute the optimal policy, ?
SLIDE 64
Optimal policies
Given a policy, , we know how to compute the value function, But, how do we compute the optimal policy, ? Definition:
SLIDE 65
Optimal policies
Given a policy, , we know how to compute the value function, But, how do we compute the optimal policy, ? Definition:
Best out of all possible policies
SLIDE 66
Optimal policies
Given a policy, , we know how to compute the value function, But, how do we compute the optimal policy, ? Definition: Definition:
SLIDE 67
Optimal policies
Given a policy, , we know how to compute the value function, But, how do we compute the optimal policy, ? Definition: Definition: Bellman Equation: Bellman optimality condition:
SLIDE 68
Think-pair-share
SLIDE 69
Value function example 3
Policy: Discount factor: Value fn:
10 9 8 7 6 7