markov decision process mdp
play

Markov decision process (MDP) Robert Platt Northeastern University - PowerPoint PPT Presentation

Markov decision process (MDP) Robert Platt Northeastern University The RL Setting Action Agent World Observation Reward On a single time step, agent does the following: 1. observe some information 2. select an action to execute 3. take


  1. Markov decision process (MDP) Robert Platt Northeastern University

  2. The RL Setting Action Agent World Observation Reward On a single time step, agent does the following: 1. observe some information 2. select an action to execute 3. take note of any reward Goal of agent: select actions that maximize cumulative reward in the long run

  3. Let’s turn this into an MDP Action Agent World Observation Reward On a single time step, agent does the following: 1. observe some information 2. select an action to execute 3. take note of any reward Goal of agent: select actions that maximize cumulative reward in the long run

  4. Let’s turn this into an MDP Action Agent World State Observation Reward On a single time step, agent does the following: 1. observe state 2. select an action to execute 3. take note of any reward Goal of agent: select actions that maximize cumulative reward in the long run

  5. Let’s turn this into an MDP Action Agent World State Observation Reward On a single time step, agent does the following: 1. observe state 2. select an action to execute This part is the MDP 3. take note of any reward Goal of agent: select actions that maximize cumulative reward in the long run

  6. Example: Grid world Grid world: – agent lives on grid – always occupies a single cell – can move left, right, up, down – gets zero reward unless in “+1” or “-1” cells

  7. States and actions State set: Action set:

  8. Reward function Reward function: Otherwise:

  9. Reward function Reward function: Otherwise: In general:

  10. Reward function Expected reward on this time step given that agent takes action from state Reward function: Otherwise: In general:

  11. Transition function Transition model: For example:

  12. Transition function Transition model: For example: – This entire probability distribution can be written as a table over state, action, next state. probability of this transition

  13. Definition of an MDP An MDP is a tuple: where State set: Action set: Reward function: Transition model:

  14. Example: Frozen Lake Frozen Lake is this 4x4 grid State set: Action set: if Reward function: otherwise Transition model: only one third chance of going in specified direction – one third chance of moving +90deg – one third change of moving -90deg

  15. Example: Recycling Robot Example 3.4 in SB, 2 nd Ed.

  16. Think-pair-share Mobile robot: – the robot moves on a flat surface – the robot can execute point turns either left or right. It can also go forward or back with fixed velocity – it must reach a goal while avoiding obstacles Express mobile robot control problem as an MDP

  17. Definition of an MDP An MDP is a tuple: where State set: Action set: Reward function: Transition model:

  18. Definition of an MDP Why is it called a Markov decision process? An MDP is a tuple: where State set: Action set: Reward function: Transition model:

  19. Definition of an MDP Why is it called a Markov decision process? Because we’re making the following assumption: An MDP is a tuple: where State set: Action set: Reward function: Transition model:

  20. Definition of an MDP Why is it called a Markov decision process? Because we’re making the following assumption: An MDP is a tuple: – this is called the “Markov” assumption where State set: Action set: Reward function: Transition model:

  21. The Markov Assumption Suppose agent starts in and follows this path:

  22. The Markov Assumption Suppose agent starts in and follows this path:

  23. The Markov Assumption Suppose agent starts in and follows this path:

  24. The Markov Assumption Suppose agent starts in and follows this path: Notice that probability of arriving in if agent executes right action does not depend on path taken to get to :

  25. Think-pair-share Cart-pole robot: – state is the position of the cart and the orientation of the pole – cart can execute a constant acceleration either left or right 1. Is this system Markov? 2. Why / Why not? 3. If not, how do you change it to make it Markov?

  26. Policy A policy is a rule for selecting actions: If agent is in this state, then take this action

  27. Policy A policy is a rule for selecting actions: If agent is in this state, then take this action

  28. Policy A policy is a rule for selecting actions: If agent is in this state, then take this action A policy can be stochastic:

  29. Question Why would we want to use a stochastic policy? A policy is a rule for selecting actions: If agent is in this state, then take this action A policy can be stochastic:

  30. Episodic vs Continuing Process Episodic process: execution ends at some point and starts over. – after a fixed number of time steps – upon reaching a terminal state Terminal state Example of an episodic task: – execution ends upon reaching terminal state OR after 15 time steps

  31. Episodic vs Continuing Process Continuing process: execution goes on forever. Example of a continuing task Process doesn’t stop – keep getting rewards

  32. Rewards and Return On each time step, the agent gets a reward:

  33. Rewards and Return On each time step, the agent gets a reward: – could have positive reward at goal, zero reward elsewhere – could have negative reward on every time step – could have an arbitrary reward function

  34. Rewards and Return On each time step, the agent gets a reward: Return can be a simple sum of rewards:

  35. Rewards and Return Return On each time step, the agent gets a reward: Return can be a simple sum of rewards:

  36. Rewards and Return On each time step, the agent gets a reward: Return can be a simple sum of rewards: But, it is often a discounted sum of rewards:

  37. Rewards and Return On each time step, the agent gets a reward: Return can be a simple sum of rewards: But, it is often a discounted sum of rewards: What effect does gamma have?

  38. Rewards and Return On each time step, the agent gets a reward: Return can be a simple sum of rewards: But, it is often a discounted sum of rewards: Reward received k time steps in the future is only worth of what it would have been worth immediately

  39. Rewards and Return On each time step, the agent gets a reward: Return can be a simple sum of rewards: But, it is often a discounted sum of rewards:

  40. Rewards and Return Return On each time step, the agent gets a reward: Return can be a simple sum of rewards: But, it is often a discounted sum of rewards: Return is often evaluated over an infinite horizon:

  41. Think-pair-share

  42. Value Function Value of state when acting according to policy :

  43. Value Function Value of state when acting according to policy : Value of a state == expected return from that state if agent follows policy

  44. Value Function Value of state when acting according to policy : Value of a state == expected return from that state if agent follows policy Value of taking action from state when acting according to policy :

  45. Value Function Value of state when acting according to policy : Value of a state == expected return from that state if agent follows policy Value of taking action from state when acting according to policy : Value of a state/action pair == expected return when taking action a from state s and following after that

  46. Value Function Value of state when acting according to policy : Value of taking action from state when acting according to policy :

  47. Value function example 1 Policy: Discount factor: Value fn: 6.9 6.6 7.3 8.1 9 10

  48. Value function example 2 Policy: Discount factor: Value fn: 1 0.9 0.81 0.73 0.66 10.66

  49. Value function example 2 Notice that value function can help us compare two different policies Policy: – how? Discount factor: Value fn: 1 0.9 0.81 0.73 0.66 10.66

  50. Value function example 3 Policy: Discount factor: Value fn: 11 10 10 10 10 10

  51. Think-pair-share Policy: Discount factor: Value fn: ? ? ? ? ? ?

  52. Value Function Revisited Value of state when acting according to policy :

  53. Value Function Revisited Value of state when acting according to policy :

  54. Value Function Revisited Value of state when acting according to policy : This is called a “backup diagram”

  55. Value Function Revisited Value of state when acting according to policy :

  56. Value Function Revisited Value of state when acting according to policy :

  57. Think-pair-share 1 Value of state when acting according to policy : Write this expectation in terms of P( s’,r | s,a ) for a deterministic policy,

  58. Think-pair-share 2 Value of state when acting according to policy : Write this expectation in terms of P( s’,r | s,a ) for a stochastic policy,

  59. Think-pair-share

  60. Value Function Revisited Can we calculate Q in terms of V ?

  61. Value Function Revisited Can we calculate Q in terms of V ?

  62. Think-pair-share Can we calculate Q in terms of V ? Write this expectation in terms of P( s’,r | s,a ) and

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend