Markov decision process (MDP) Robert Platt Northeastern University - - PowerPoint PPT Presentation

markov decision process mdp
SMART_READER_LITE
LIVE PREVIEW

Markov decision process (MDP) Robert Platt Northeastern University - - PowerPoint PPT Presentation

Markov decision process (MDP) Robert Platt Northeastern University The RL Setting Action Agent World Observation Reward On a single time step, agent does the following: 1. observe some information 2. select an action to execute 3. take


slide-1
SLIDE 1

Markov decision process (MDP)

Robert Platt Northeastern University

slide-2
SLIDE 2

The RL Setting

On a single time step, agent does the following:

  • 1. observe some information
  • 2. select an action to execute
  • 3. take note of any reward

Goal of agent: select actions that maximize cumulative reward in the long run

Action Observation Reward Agent World

slide-3
SLIDE 3

Let’s turn this into an MDP

On a single time step, agent does the following:

  • 1. observe some information
  • 2. select an action to execute
  • 3. take note of any reward

Goal of agent: select actions that maximize cumulative reward in the long run

Action Observation Reward Agent World

slide-4
SLIDE 4

Let’s turn this into an MDP

On a single time step, agent does the following:

  • 1. observe state
  • 2. select an action to execute
  • 3. take note of any reward

Goal of agent: select actions that maximize cumulative reward in the long run

Action Observation Reward Agent World State

slide-5
SLIDE 5

Let’s turn this into an MDP

On a single time step, agent does the following:

  • 1. observe state
  • 2. select an action to execute
  • 3. take note of any reward

Goal of agent: select actions that maximize cumulative reward in the long run

Action Observation Reward Agent World State

This part is the MDP

slide-6
SLIDE 6

Example: Grid world

Grid world: – agent lives on grid – always occupies a single cell – can move left, right, up, down – gets zero reward unless in “+1” or “-1” cells

slide-7
SLIDE 7

States and actions

State set: Action set:

slide-8
SLIDE 8

Reward function

Reward function: Otherwise:

slide-9
SLIDE 9

Reward function

Reward function: Otherwise: In general:

slide-10
SLIDE 10

Reward function

Reward function: Otherwise: In general:

Expected reward on this time step given that agent takes action from state

slide-11
SLIDE 11

Transition function

Transition model: For example:

slide-12
SLIDE 12

Transition function

Transition model: For example:

– This entire probability distribution can be written as a table over state, action, next state. probability of this transition

slide-13
SLIDE 13

Definition of an MDP

An MDP is a tuple: where State set: Action set: Reward function: Transition model:

slide-14
SLIDE 14

Example: Frozen Lake

State set: Action set: Reward function: Transition model: only one third chance of going in specified direction – one third chance of moving +90deg – one third change of moving -90deg

if

  • therwise

Frozen Lake is this 4x4 grid

slide-15
SLIDE 15

Example: Recycling Robot

Example 3.4 in SB, 2nd Ed.

slide-16
SLIDE 16

Think-pair-share

Mobile robot: – the robot moves on a flat surface – the robot can execute point turns either left or right. It can also go forward or back with fixed velocity – it must reach a goal while avoiding obstacles Express mobile robot control problem as an MDP

slide-17
SLIDE 17

Definition of an MDP

An MDP is a tuple: where State set: Action set: Reward function: Transition model:

slide-18
SLIDE 18

Definition of an MDP

An MDP is a tuple: where State set: Action set: Reward function: Transition model:

Why is it called a Markov decision process?

slide-19
SLIDE 19

Definition of an MDP

An MDP is a tuple: where State set: Action set: Reward function: Transition model:

Why is it called a Markov decision process? Because we’re making the following assumption:

slide-20
SLIDE 20

Definition of an MDP

An MDP is a tuple: where State set: Action set: Reward function: Transition model:

Why is it called a Markov decision process? Because we’re making the following assumption: – this is called the “Markov” assumption

slide-21
SLIDE 21

The Markov Assumption

Suppose agent starts in and follows this path:

slide-22
SLIDE 22

The Markov Assumption

Suppose agent starts in and follows this path:

slide-23
SLIDE 23

The Markov Assumption

Suppose agent starts in and follows this path:

slide-24
SLIDE 24

The Markov Assumption

Suppose agent starts in and follows this path: Notice that probability of arriving in if agent executes right action does not depend on path taken to get to :

slide-25
SLIDE 25

Think-pair-share

Cart-pole robot: – state is the position of the cart and the orientation of the pole – cart can execute a constant acceleration either left or right

  • 1. Is this system Markov?
  • 2. Why / Why not?
  • 3. If not, how do you change it to make it Markov?
slide-26
SLIDE 26

Policy

A policy is a rule for selecting actions: If agent is in this state, then take this action

slide-27
SLIDE 27

Policy

A policy is a rule for selecting actions: If agent is in this state, then take this action

slide-28
SLIDE 28

Policy

A policy is a rule for selecting actions: If agent is in this state, then take this action A policy can be stochastic:

slide-29
SLIDE 29

Question

A policy is a rule for selecting actions: If agent is in this state, then take this action A policy can be stochastic:

Why would we want to use a stochastic policy?

slide-30
SLIDE 30

Episodic vs Continuing Process

Episodic process: execution ends at some point and starts over. – after a fixed number of time steps – upon reaching a terminal state Terminal state Example of an episodic task: – execution ends upon reaching terminal state OR after 15 time steps

slide-31
SLIDE 31

Episodic vs Continuing Process

Continuing process: execution goes on forever. Process doesn’t stop – keep getting rewards Example of a continuing task

slide-32
SLIDE 32

Rewards and Return

On each time step, the agent gets a reward:

slide-33
SLIDE 33

Rewards and Return

On each time step, the agent gets a reward:

– could have positive reward at goal, zero reward elsewhere – could have negative reward on every time step – could have an arbitrary reward function

slide-34
SLIDE 34

Rewards and Return

On each time step, the agent gets a reward: Return can be a simple sum of rewards:

slide-35
SLIDE 35

Rewards and Return

On each time step, the agent gets a reward: Return can be a simple sum of rewards:

Return

slide-36
SLIDE 36

Rewards and Return

On each time step, the agent gets a reward: Return can be a simple sum of rewards: But, it is often a discounted sum of rewards:

slide-37
SLIDE 37

Rewards and Return

On each time step, the agent gets a reward: Return can be a simple sum of rewards: But, it is often a discounted sum of rewards:

What effect does gamma have?

slide-38
SLIDE 38

Rewards and Return

On each time step, the agent gets a reward: Return can be a simple sum of rewards: But, it is often a discounted sum of rewards:

Reward received k time steps in the future is only worth of what it would have been worth immediately

slide-39
SLIDE 39

Rewards and Return

On each time step, the agent gets a reward: Return can be a simple sum of rewards: But, it is often a discounted sum of rewards:

slide-40
SLIDE 40

Rewards and Return

On each time step, the agent gets a reward: Return can be a simple sum of rewards: But, it is often a discounted sum of rewards: Return is often evaluated over an infinite horizon:

Return

slide-41
SLIDE 41

Think-pair-share

slide-42
SLIDE 42

Value Function

Value of state when acting according to policy :

slide-43
SLIDE 43

Value Function

Value of state when acting according to policy :

Value of a state == expected return from that state if agent follows policy

slide-44
SLIDE 44

Value Function

Value of state when acting according to policy : Value of taking action from state when acting according to policy :

Value of a state == expected return from that state if agent follows policy

slide-45
SLIDE 45

Value Function

Value of state when acting according to policy : Value of taking action from state when acting according to policy :

Value of a state == expected return from that state if agent follows policy Value of a state/action pair == expected return when taking action a from state s and following after that

slide-46
SLIDE 46

Value Function

Value of state when acting according to policy : Value of taking action from state when acting according to policy :

slide-47
SLIDE 47

Value function example 1

Policy: Discount factor: Value fn:

10 9 8.1 7.3 6.6 6.9

slide-48
SLIDE 48

Value function example 2

Policy: Discount factor: Value fn:

10.66 0.66 0.73 0.81 0.9 1

slide-49
SLIDE 49

Value function example 2

Policy: Discount factor: Value fn:

10.66 0.66 0.73 0.81 0.9 1

Notice that value function can help us compare two different policies – how?

slide-50
SLIDE 50

Value function example 3

Policy: Discount factor: Value fn:

10 10 10 10 10 11

slide-51
SLIDE 51

Think-pair-share

Policy: Discount factor: Value fn:

? ? ? ? ? ?

slide-52
SLIDE 52

Value Function Revisited

Value of state when acting according to policy :

slide-53
SLIDE 53

Value Function Revisited

Value of state when acting according to policy :

slide-54
SLIDE 54

Value Function Revisited

Value of state when acting according to policy :

This is called a “backup diagram”

slide-55
SLIDE 55

Value Function Revisited

Value of state when acting according to policy :

slide-56
SLIDE 56

Value Function Revisited

Value of state when acting according to policy :

slide-57
SLIDE 57

Think-pair-share 1

Value of state when acting according to policy :

Write this expectation in terms

  • f P(s’,r|s,a) for a deterministic policy,
slide-58
SLIDE 58

Think-pair-share 2

Value of state when acting according to policy :

Write this expectation in terms

  • f P(s’,r|s,a) for a stochastic policy,
slide-59
SLIDE 59

Think-pair-share

slide-60
SLIDE 60

Value Function Revisited

Can we calculate Q in terms of V?

slide-61
SLIDE 61

Value Function Revisited

Can we calculate Q in terms of V?

slide-62
SLIDE 62

Think-pair-share

Can we calculate Q in terms of V?

Write this expectation in terms of P(s’,r|s,a) and

slide-63
SLIDE 63

Optimal policies

Given a policy, , we know how to compute the value function, But, how do we compute the optimal policy, ?

slide-64
SLIDE 64

Optimal policies

Given a policy, , we know how to compute the value function, But, how do we compute the optimal policy, ? Definition:

slide-65
SLIDE 65

Optimal policies

Given a policy, , we know how to compute the value function, But, how do we compute the optimal policy, ? Definition:

Best out of all possible policies

slide-66
SLIDE 66

Optimal policies

Given a policy, , we know how to compute the value function, But, how do we compute the optimal policy, ? Definition: Definition:

slide-67
SLIDE 67

Optimal policies

Given a policy, , we know how to compute the value function, But, how do we compute the optimal policy, ? Definition: Definition: Bellman Equation: Bellman optimality condition:

slide-68
SLIDE 68

Think-pair-share

slide-69
SLIDE 69

Value function example 3

Policy: Discount factor: Value fn:

10 9 8 7 6 7