Anatomy of an RL agent: model, policy, value function Robert Platt - - PowerPoint PPT Presentation

anatomy of an rl agent model policy value function
SMART_READER_LITE
LIVE PREVIEW

Anatomy of an RL agent: model, policy, value function Robert Platt - - PowerPoint PPT Presentation

Anatomy of an RL agent: model, policy, value function Robert Platt Northeastern University Running example: gridworld Gridworld: agent lives on grid always occupies a single cell can move left, right, up, down gets zero


slide-1
SLIDE 1

Anatomy of an RL agent: model, policy, value function

Robert Platt Northeastern University

slide-2
SLIDE 2

Running example: gridworld

Gridworld: – agent lives on grid – always occupies a single cell – can move left, right, up, down – gets zero reward unless in “+1” or “-1” cells

slide-3
SLIDE 3

States and actions

State set: Action set:

slide-4
SLIDE 4

Reward function

Reward function: Otherwise:

slide-5
SLIDE 5

Reward function

Reward function: Otherwise: In general:

slide-6
SLIDE 6

Reward function

Reward function: Otherwise: In general:

Expected reward on this time step given that agent takes action from state

slide-7
SLIDE 7

Agent Model

Transition model: For example:

slide-8
SLIDE 8

Agent Model

Transition model: For example:

– This entire probability distribution can be written as a table over state, action, next state. probability of this transition

slide-9
SLIDE 9

Agent Model: Summary

State set: Action set: Reward function: Transition model:

slide-10
SLIDE 10

Agent Model: Frozen Lake Example

State set: Action set: Reward function: Transition model: only one third chance of going in specified direction – one third chance of moving +90deg – one third change of moving -90deg

if

  • therwise

Frozen Lake is this 4x4 grid

slide-11
SLIDE 11

Agent Model: Recycling Robot Example

Example 3.4 in SB, 2nd Ed.

slide-12
SLIDE 12

Policy

A policy is a rule for selecting actions: If agent is in this state, then take this action

slide-13
SLIDE 13

Policy

A policy is a rule for selecting actions: If agent is in this state, then take this action

slide-14
SLIDE 14

Policy

A policy is a rule for selecting actions: If agent is in this state, then take this action A policy can be stochastic:

slide-15
SLIDE 15

Episodic vs Continuing Process

Episodic process: execution ends at some point and starts over. – after a fixed number of time steps – upon reaching a terminal state Terminal state Example of an episodic task: – execution ends upon reaching terminal state OR after 15 time steps

slide-16
SLIDE 16

Episodic vs Continuing Process

Continuing process: execution goes on forever. Process doesn’t stop – keep getting rewards Example of a continuing task

slide-17
SLIDE 17

Value of state when acting according to policy

Value Function

Expected discounted future reward starting at state and acting according to policy

slide-18
SLIDE 18

Value of state when acting according to policy

Value Function

Expected discounted future reward starting at state and acting according to policy

Called the Value Function

slide-19
SLIDE 19

Value of state when acting according to policy

Value Function

Expected discounted future reward starting at state and acting according to policy

Why we care about the value function: Because it helps us calculate a good policy – we’ll see how shortly. Called the Value Function

slide-20
SLIDE 20

Value of state when acting according to policy

Value Function

Expected discounted future reward starting at state and acting according to policy

slide-21
SLIDE 21

Value of state when acting according to policy

Value Function

Expected discounted future reward starting at state and acting according to policy

what’s wrong with this?

slide-22
SLIDE 22

Value of state when acting according to policy

Value Function

Expected discounted future reward starting at state and acting according to policy

Two viable alternatives:

  • 1. maximize expected future reward over the next T timesteps (finite horizon):
  • 2. maximize expected discounted future rewards:
slide-23
SLIDE 23

Value of state when acting according to policy

Value Function

Expected discounted future reward starting at state and acting according to policy

Two viable alternatives:

  • 1. maximize expected future reward over the next T timesteps (finite horizon):
  • 2. maximize expected discounted future rewards:

Discount factor – 0.9 is a typical value

slide-24
SLIDE 24

Value of state when acting according to policy

Value Function

Expected discounted future reward starting at state and acting according to policy

Two viable alternatives:

  • 1. maximize expected future reward over the next T timesteps (finite horizon):
  • 2. maximize expected discounted future rewards:

Standard formulation for value function – notice this is a function over state

slide-25
SLIDE 25

Optimal policy

Why we care about the value function: because can be used to calculate a good policy.

Value of state when acting according to policy Expected discounted future reward starting at state and acting according to policy

slide-26
SLIDE 26

Value function example 1

Policy: Discount factor: Value fn:

10 9 8.1 7.3 6.6 6.9

slide-27
SLIDE 27

Value function example 1

Policy: Discount factor: Value fn:

10 9 8.1 7.3 6.6 6.9

Notice that value function can help us compare two different policies – how?

slide-28
SLIDE 28

Value function example 1

Policy: Discount factor: Value fn:

10.66 0.66 0.73 0.81 0.9 1

slide-29
SLIDE 29

Value function example 1

Policy: Discount factor: Value fn:

10 9 8.1 7.3 6.6 6.9

slide-30
SLIDE 30

Value function example 2

Policy: Discount factor: Value fn:

10 10 10 10 10 11

slide-31
SLIDE 31

Value function example 3

Policy: Discount factor: Value fn:

10 9 8 7 6 7