10703 deep reinforcement learning
play

10703 Deep Reinforcement Learning Tom Mitchell September 5, 2018 - PDF document

9/7/18 10703 Deep Reinforcement Learning Tom Mitchell September 5, 2018 Solving known MDPs Many slides borrowed from Katerina Fragkiadaki Russ Salakhutdinov Markov Decision Process (MDP) A Markov Decision Process is a tuple is a


  1. 9/7/18 10703 Deep Reinforcement Learning Tom Mitchell September 5, 2018 Solving known MDPs Many slides borrowed from Katerina Fragkiadaki Russ Salakhutdinov Markov Decision Process (MDP) � A Markov Decision Process is a tuple • is a finite set of states • is a finite set of actions • is a state transition probability function 
 • is a reward function 
 • is a discount factor 1

  2. 
 
 9/7/18 Solving MDPs � • Prediction : Given an MDP and a policy 
 predict the state and action value functions. • Optimal control : given an MDP , find the optimal policy (aka the planning/control problem). • Compare this to the learning problem with missing information about rewards/dynamics. • Today we still consider finite MDPs (finite and ) with known dynamics T and r . Outline � • Policy evaluation • Policy iteration • Value iteration • Asynchronous DP 2

  3. 9/7/18 First, a simple deterministic world… � Reinforcement Learning Task for Autonomous Agent � Execute actions in environment, observe results, and • Learn control policy π : S à A that maximizes from every state s ∈ S Example: Robot grid world, deterministic actions, policy, reward r(s,a) 3

  4. 9/7/18 Value Function – what are the V π (s) values? � Value Function – what are the V π (s) values? � 4

  5. 9/7/18 Value Function – what are the V * (s) values? � V*(s) is the value function for the optimal policy π * γ = 0.9 State values V*(s) for optimal policy 5

  6. 9/7/18 Question � How can agent who doesn’t know r(s,a), V*(s) or π *(s) learn them while randomly roaming and observing (and getting reborn after reaching G)? 
 [deterministic actions, rewards, policy. A single non-negative reward state] Question � How can agent who doesn’t know r(s,a), V*(s) or π *(s) learn them while randomly roaming and observing (and getting reborn after reaching G)? 
 [deterministic actions, rewards, policy. A single non-negative reward state] Hint: initialize estimate V(s)=0 for all s. After each transition, update: 6

  7. 9/7/18 Question � How can agent who doesn’t know r(s,a), V*(s) or π *(s) learn them while randomly roaming and observing (and getting reborn after reaching G)? 
 [deterministic actions, rewards, policy. A single non-negative reward state] Hint: initialize estimate V(s)=0 for all s. After each transition, update: Question � Algorithm: initialize estimate V(s)=0 for all s. After each (s, a, r, s’) transition, update: True or false: • V(s) estimate will always be non-negative for all s? • V(s) estimate will always be less than or equal to 100 for all s? 7

  8. 9/7/18 Question � Algorithm: initialize estimate V(s)=0 for all s. After each (s, a, r, s’) transition, update: True or false: • V(s) estimate will always be non-negative for all s? • V(s) estimate will always be less than or equal to 100 for all s? • As number of random actions and rebirths grows, V(s) will converge from below to V*(s) for optimal policy π *(s)? Now, consider probabilistic actions, rewards, policies 8

  9. 
 
 9/7/18 Policy Evaluation � Policy evaluation : for a given policy , compute the state value function 
 where is implicitly given by the Bellman equation a system of simultaneous equations. MDPs to MRPs � MDP under a fixed policy becomes Markov Reward Process (MRP) where 9

  10. 9/7/18 Back Up Diagram � MDP Back Up Diagram � MDP 10

  11. 9/7/18 Matrix Form � The Bellman expectation equation can be written concisely using the induced form: with direct solution of complexity here T π is an |S|x|S| matrix, whose (j,k) entry gives P(s k | s j , a= π (s j )) r π is an |S|-dim vector whose j th entry gives E[r | s j , a= π (s j ) ] v π is an |S|-dim vector whose j th entry gives V π (s j ) where |S| is the number of distinct states Iterative Methods: Recall the Bellman Equation � 11

  12. 9/7/18 Iterative Methods: Backup Operation � Given an expected value function at iteration k , we back up the expected value function at iteration k+1: Iterative Methods: Sweep � A sweep consists of applying the backup operation for all the states in Applying the back up operator iteratively 12

  13. 9/7/18 A Small-Grid World � R γ = 1 • An undiscounted episodic task • Nonterminal states: 1, 2, … , 14 • Terminal states: two, shown in shaded squares • Actions that would take the agent off the grid leave the state unchanged • Reward is -1 until the terminal state is reached Iterative Policy Evaluation � for the random policy Policy , an equiprobable random action • An undiscounted episodic task • Nonterminal states: 1, 2, … , 14 • Terminal states: two, shown in shaded squares • Actions that would take the agent off the grid leave the state unchanged • Reward is -1 until the terminal state is reached 13

  14. 9/7/18 Iterative Policy Evaluation � for the random policy Policy , an equiprobable random action • An undiscounted episodic task • Nonterminal states: 1, 2, … , 14 • Terminal states: two, shown in shaded squares • Actions that would take the agent off the grid leave the state unchanged • Reward is -1 until the terminal state is reached Iterative Policy Evaluation � for the random policy Policy , an equiprobable random action • An undiscounted episodic task • Nonterminal states: 1, 2, … , 14 • Terminal states: two, shown in shaded squares • Actions that would take the agent off the grid leave the state unchanged • Reward is -1 until the terminal state is reached 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend