10703 Deep Reinforcement Learning Tom Mitchell September 5, 2018 - PDF document

9/7/18 10703 Deep Reinforcement Learning Tom Mitchell September 5, 2018 Solving known MDPs Many slides borrowed from Katerina Fragkiadaki Russ Salakhutdinov Markov Decision Process (MDP) � A Markov Decision Process is a tuple • is a finite set of states • is a finite set of actions • is a state transition probability function   • is a reward function   • is a discount factor 1

    9/7/18 Solving MDPs � • Prediction : Given an MDP and a policy   predict the state and action value functions. • Optimal control : given an MDP , find the optimal policy (aka the planning/control problem). • Compare this to the learning problem with missing information about rewards/dynamics. • Today we still consider finite MDPs (finite and ) with known dynamics T and r . Outline � • Policy evaluation • Policy iteration • Value iteration • Asynchronous DP 2

9/7/18 First, a simple deterministic world… � Reinforcement Learning Task for Autonomous Agent � Execute actions in environment, observe results, and • Learn control policy π : S à A that maximizes from every state s ∈ S Example: Robot grid world, deterministic actions, policy, reward r(s,a) 3

9/7/18 Value Function – what are the V π (s) values? � Value Function – what are the V π (s) values? � 4

9/7/18 Value Function – what are the V * (s) values? � V*(s) is the value function for the optimal policy π * γ = 0.9 State values V*(s) for optimal policy 5

9/7/18 Question � How can agent who doesn’t know r(s,a), V*(s) or π *(s) learn them while randomly roaming and observing (and getting reborn after reaching G)?   [deterministic actions, rewards, policy. A single non-negative reward state] Question � How can agent who doesn’t know r(s,a), V*(s) or π *(s) learn them while randomly roaming and observing (and getting reborn after reaching G)?   [deterministic actions, rewards, policy. A single non-negative reward state] Hint: initialize estimate V(s)=0 for all s. After each transition, update: 6

9/7/18 Question � How can agent who doesn’t know r(s,a), V*(s) or π *(s) learn them while randomly roaming and observing (and getting reborn after reaching G)?   [deterministic actions, rewards, policy. A single non-negative reward state] Hint: initialize estimate V(s)=0 for all s. After each transition, update: Question � Algorithm: initialize estimate V(s)=0 for all s. After each (s, a, r, s’) transition, update: True or false: • V(s) estimate will always be non-negative for all s? • V(s) estimate will always be less than or equal to 100 for all s? 7

9/7/18 Question � Algorithm: initialize estimate V(s)=0 for all s. After each (s, a, r, s’) transition, update: True or false: • V(s) estimate will always be non-negative for all s? • V(s) estimate will always be less than or equal to 100 for all s? • As number of random actions and rebirths grows, V(s) will converge from below to V*(s) for optimal policy π *(s)? Now, consider probabilistic actions, rewards, policies 8

    9/7/18 Policy Evaluation � Policy evaluation : for a given policy , compute the state value function   where is implicitly given by the Bellman equation a system of simultaneous equations. MDPs to MRPs � MDP under a fixed policy becomes Markov Reward Process (MRP) where 9

9/7/18 Back Up Diagram � MDP Back Up Diagram � MDP 10

9/7/18 Matrix Form � The Bellman expectation equation can be written concisely using the induced form: with direct solution of complexity here T π is an |S|x|S| matrix, whose (j,k) entry gives P(s k | s j , a= π (s j )) r π is an |S|-dim vector whose j th entry gives E[r | s j , a= π (s j ) ] v π is an |S|-dim vector whose j th entry gives V π (s j ) where |S| is the number of distinct states Iterative Methods: Recall the Bellman Equation � 11

9/7/18 Iterative Methods: Backup Operation � Given an expected value function at iteration k , we back up the expected value function at iteration k+1: Iterative Methods: Sweep � A sweep consists of applying the backup operation for all the states in Applying the back up operator iteratively 12

9/7/18 A Small-Grid World � R γ = 1 • An undiscounted episodic task • Nonterminal states: 1, 2, … , 14 • Terminal states: two, shown in shaded squares • Actions that would take the agent off the grid leave the state unchanged • Reward is -1 until the terminal state is reached Iterative Policy Evaluation � for the random policy Policy , an equiprobable random action • An undiscounted episodic task • Nonterminal states: 1, 2, … , 14 • Terminal states: two, shown in shaded squares • Actions that would take the agent off the grid leave the state unchanged • Reward is -1 until the terminal state is reached 13

9/7/18 Iterative Policy Evaluation � for the random policy Policy , an equiprobable random action • An undiscounted episodic task • Nonterminal states: 1, 2, … , 14 • Terminal states: two, shown in shaded squares • Actions that would take the agent off the grid leave the state unchanged • Reward is -1 until the terminal state is reached Iterative Policy Evaluation � for the random policy Policy , an equiprobable random action • An undiscounted episodic task • Nonterminal states: 1, 2, … , 14 • Terminal states: two, shown in shaded squares • Actions that would take the agent off the grid leave the state unchanged • Reward is -1 until the terminal state is reached 14

10703 Deep Reinforcement Learning Tom Mitchell September 5, 2018 - PDF document

9/7/18 10703 Deep Reinforcement Learning Tom Mitchell September 5, 2018 Solving known MDPs Many slides borrowed from Katerina Fragkiadaki Russ Salakhutdinov Markov Decision Process (MDP) A Markov Decision Process is a tuple is a

10703 Deep Reinforcement Learning Reinforcement Learning in Humans and Animals Tom Mitchell

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

10703 Deep Reinforcement Learning Imitation Learning - 1 Tom Mitchell November 4, 2018

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

10703 Deep Reinforcement Learning Exploration vs. Exploitation Tom Mitchell October 22, 2018

10703 Deep Reinforcement Learning Policy Gradient Methods Tom Mitchell October 1, 2018 Reading:

10703 Deep Reinforcement Learning Solving known MDPs Tom Mitchell September 10, 2018 Many

10703 Deep Reinforcement Learning Policy Gradient Methods Part 3 Tom Mitchell October 8, 2018

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Asynchronous RL CMU 10703 Katerina Fragkiadaki Non-stationary data problem for Deep RL

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Deep Reinforcement Learning [Mastering the Game of Go with Deep Reinforcement Learning and Tree

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Implementation Issues More from Interface point of view V Eye Y U N X Z Viewing Coordinate

Waiting Times in BMAP/BMAP/1 Queues MAM-9, Budapest Nail Akar, Bilkent University, Ankara, Turkey

Processes (MDP) Prof. Kuan-Ting Lai 2020/3/20 Markov Decision Process (MDP)

Temporal Difference Methods CS60077: Reinforcement Learning Abir Das IIT Kharagpur Oct 12, 13,

Measuring together with the continuum large Miguel Angel Mota (ITAM) Joint work with David

Predicting Constituency Vote Shares from Pre-Election Polls Chris Hanretty (UEA) Benjamin E.

Reducing Sampling Error in Batch Temporal Di ff erence Learning Brahma S. Pavse 1 , Ishan Durugkar

Do Doing ng Rea eal l De DevOp Ops s with th DC/OS Julien en Stroheke heker Micros