SLIDE 1
Anatomy of an RL agent: model, policy, value function Robert Platt - - PowerPoint PPT Presentation
Anatomy of an RL agent: model, policy, value function Robert Platt - - PowerPoint PPT Presentation
Anatomy of an RL agent: model, policy, value function Robert Platt Northeastern University Running example: gridworld Gridworld: agent lives on grid always occupies a single cell can move left, right, up, down gets zero
SLIDE 2
SLIDE 3
States and actions
State set: Action set:
SLIDE 4
Reward function
Reward function: Otherwise:
SLIDE 5
Reward function
Reward function: Otherwise: In general:
SLIDE 6
Reward function
Reward function: Otherwise: In general:
Expected reward on this time step given that agent takes action from state
SLIDE 7
Agent Model
Transition model: For example:
SLIDE 8
Agent Model
Transition model: For example:
– This entire probability distribution can be written as a table over state, action, next state. probability of this transition
SLIDE 9
Agent Model: Summary
State set: Action set: Reward function: Transition model:
SLIDE 10
Agent Model: Frozen Lake Example
State set: Action set: Reward function: Transition model: only one third chance of going in specified direction – one third chance of moving +90deg – one third change of moving -90deg
if
- therwise
Frozen Lake is this 4x4 grid
SLIDE 11
Agent Model: Recycling Robot Example
Example 3.4 in SB, 2nd Ed.
SLIDE 12
Policy
A policy is a rule for selecting actions: If agent is in this state, then take this action
SLIDE 13
Policy
A policy is a rule for selecting actions: If agent is in this state, then take this action
SLIDE 14
Policy
A policy is a rule for selecting actions: If agent is in this state, then take this action A policy can be stochastic:
SLIDE 15
Episodic vs Continuing Process
Episodic process: execution ends at some point and starts over. – after a fixed number of time steps – upon reaching a terminal state Terminal state Example of an episodic task: – execution ends upon reaching terminal state OR after 15 time steps
SLIDE 16
Episodic vs Continuing Process
Continuing process: execution goes on forever. Process doesn’t stop – keep getting rewards Example of a continuing task
SLIDE 17
Value of state when acting according to policy
Value Function
Expected discounted future reward starting at state and acting according to policy
SLIDE 18
Value of state when acting according to policy
Value Function
Expected discounted future reward starting at state and acting according to policy
Called the Value Function
SLIDE 19
Value of state when acting according to policy
Value Function
Expected discounted future reward starting at state and acting according to policy
Why we care about the value function: Because it helps us calculate a good policy – we’ll see how shortly. Called the Value Function
SLIDE 20
Value of state when acting according to policy
Value Function
Expected discounted future reward starting at state and acting according to policy
SLIDE 21
Value of state when acting according to policy
Value Function
Expected discounted future reward starting at state and acting according to policy
what’s wrong with this?
SLIDE 22
Value of state when acting according to policy
Value Function
Expected discounted future reward starting at state and acting according to policy
Two viable alternatives:
- 1. maximize expected future reward over the next T timesteps (finite horizon):
- 2. maximize expected discounted future rewards:
SLIDE 23
Value of state when acting according to policy
Value Function
Expected discounted future reward starting at state and acting according to policy
Two viable alternatives:
- 1. maximize expected future reward over the next T timesteps (finite horizon):
- 2. maximize expected discounted future rewards:
Discount factor – 0.9 is a typical value
SLIDE 24
Value of state when acting according to policy
Value Function
Expected discounted future reward starting at state and acting according to policy
Two viable alternatives:
- 1. maximize expected future reward over the next T timesteps (finite horizon):
- 2. maximize expected discounted future rewards:
Standard formulation for value function – notice this is a function over state
SLIDE 25
Optimal policy
Why we care about the value function: because can be used to calculate a good policy.
Value of state when acting according to policy Expected discounted future reward starting at state and acting according to policy
SLIDE 26
Value function example 1
Policy: Discount factor: Value fn:
10 9 8.1 7.3 6.6 6.9
SLIDE 27
Value function example 1
Policy: Discount factor: Value fn:
10 9 8.1 7.3 6.6 6.9
Notice that value function can help us compare two different policies – how?
SLIDE 28
Value function example 1
Policy: Discount factor: Value fn:
10.66 0.66 0.73 0.81 0.9 1
SLIDE 29
Value function example 1
Policy: Discount factor: Value fn:
10 9 8.1 7.3 6.6 6.9
SLIDE 30
Value function example 2
Policy: Discount factor: Value fn:
10 10 10 10 10 11
SLIDE 31