10703 Deep Reinforcement Learning Tom Mitchell September 5, 2018 - - PDF document

10703 deep reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

10703 Deep Reinforcement Learning Tom Mitchell September 5, 2018 - - PDF document

9/7/18 10703 Deep Reinforcement Learning Tom Mitchell September 5, 2018 Solving known MDPs Many slides borrowed from Katerina Fragkiadaki Russ Salakhutdinov Markov Decision Process (MDP) A Markov Decision Process is a tuple is a


slide-1
SLIDE 1

9/7/18 1

10703 Deep Reinforcement Learning

Tom Mitchell September 5, 2018

Solving known MDPs

Many slides borrowed from Katerina Fragkiadaki Russ Salakhutdinov

A Markov Decision Process is a tuple

  • is a finite set of states
  • is a finite set of actions
  • is a state transition probability function

  • is a reward function

  • is a discount factor

Markov Decision Process (MDP)

slide-2
SLIDE 2

9/7/18 2

Solving MDPs

  • Prediction: Given an MDP and a policy



 
 predict the state and action value functions.

  • Optimal control: given an MDP , find the optimal

policy (aka the planning/control problem).

  • Compare this to the learning problem with missing information

about rewards/dynamics.

  • Today we still consider finite MDPs (finite and ) with known

dynamics T and r.

Outline

  • Policy evaluation
  • Policy iteration
  • Value iteration
  • Asynchronous DP
slide-3
SLIDE 3

9/7/18 3

First, a simple deterministic world…

Reinforcement Learning Task for Autonomous Agent

Execute actions in environment, observe results, and

  • Learn control policy π: SàA that maximizes from every

state s ∈ S Example: Robot grid world, deterministic actions, policy, reward r(s,a)

slide-4
SLIDE 4

9/7/18 4

Value Function – what are the Vπ(s) values?

Value Function – what are the Vπ(s) values?

slide-5
SLIDE 5

9/7/18 5

Value Function – what are the V*(s) values?

V*(s) is the value function for the optimal policy π* γ = 0.9

State values V*(s) for optimal policy

slide-6
SLIDE 6

9/7/18 6

Question

How can agent who doesn’t know r(s,a), V*(s) or π*(s) learn them while randomly roaming and observing (and getting reborn after reaching G)?
 [deterministic actions, rewards, policy. A single non-negative reward state]

Question

How can agent who doesn’t know r(s,a), V*(s) or π*(s) learn them while randomly roaming and observing (and getting reborn after reaching G)?
 [deterministic actions, rewards, policy. A single non-negative reward state] Hint: initialize estimate V(s)=0 for all s. After each transition, update:

slide-7
SLIDE 7

9/7/18 7

Question

How can agent who doesn’t know r(s,a), V*(s) or π*(s) learn them while randomly roaming and observing (and getting reborn after reaching G)?
 [deterministic actions, rewards, policy. A single non-negative reward state] Hint: initialize estimate V(s)=0 for all s. After each transition, update:

Question

Algorithm: initialize estimate V(s)=0 for all s. After each (s, a, r, s’) transition, update:

True or false:

  • V(s) estimate will always be non-negative for all s?
  • V(s) estimate will always be less than or equal to 100 for all s?
slide-8
SLIDE 8

9/7/18 8

Question

Algorithm: initialize estimate V(s)=0 for all s. After each (s, a, r, s’) transition, update:

True or false:

  • V(s) estimate will always be non-negative for all s?
  • V(s) estimate will always be less than or equal to 100 for all s?
  • As number of random actions and rebirths grows, V(s) will

converge from below to V*(s) for optimal policy π*(s)?

Now, consider probabilistic actions, rewards, policies

slide-9
SLIDE 9

9/7/18 9

Policy Evaluation

Policy evaluation: for a given policy , compute the state value function
 
 
 where is implicitly given by the Bellman equation a system of simultaneous equations.

MDPs to MRPs

MDP under a fixed policy becomes Markov Reward Process (MRP) where

slide-10
SLIDE 10

9/7/18 10

Back Up Diagram

MDP

Back Up Diagram

MDP

slide-11
SLIDE 11

9/7/18 11

Matrix Form

The Bellman expectation equation can be written concisely using the induced form: with direct solution

  • f complexity

here T π is an |S|x|S| matrix, whose (j,k) entry gives P(sk | sj, a=π(sj)) r π is an |S|-dim vector whose jth entry gives E[r | sj, a=π(sj) ] vπ

is an |S|-dim vector whose jth entry gives Vπ(sj)

where |S| is the number of distinct states

Iterative Methods: Recall the Bellman Equation

slide-12
SLIDE 12

9/7/18 12

Iterative Methods: Backup Operation

Given an expected value function at iteration k, we back up the expected value function at iteration k+1: A sweep consists of applying the backup operation for all the states in Applying the back up operator iteratively

Iterative Methods: Sweep

slide-13
SLIDE 13

9/7/18 13

A Small-Grid World

  • An undiscounted episodic task
  • Nonterminal states: 1, 2, … , 14
  • Terminal states: two, shown in shaded squares
  • Actions that would take the agent off the grid leave the state unchanged
  • Reward is -1 until the terminal state is reached

R γ = 1

  • An undiscounted episodic task
  • Nonterminal states: 1, 2, … , 14
  • Terminal states: two, shown in shaded squares
  • Actions that would take the agent off the grid leave the state unchanged
  • Reward is -1 until the terminal state is reached

Policy , an equiprobable random action

Iterative Policy Evaluation

for the random policy

slide-14
SLIDE 14

9/7/18 14

  • An undiscounted episodic task
  • Nonterminal states: 1, 2, … , 14
  • Terminal states: two, shown in shaded squares
  • Actions that would take the agent off the grid leave the state unchanged
  • Reward is -1 until the terminal state is reached

Policy , an equiprobable random action

Iterative Policy Evaluation

for the random policy

  • An undiscounted episodic task
  • Nonterminal states: 1, 2, … , 14
  • Terminal states: two, shown in shaded squares
  • Actions that would take the agent off the grid leave the state unchanged
  • Reward is -1 until the terminal state is reached

Policy , an equiprobable random action

Iterative Policy Evaluation

for the random policy