N-step bootstrapping Robert Platt Northeastern University Id love - - PowerPoint PPT Presentation

n step bootstrapping
SMART_READER_LITE
LIVE PREVIEW

N-step bootstrapping Robert Platt Northeastern University Id love - - PowerPoint PPT Presentation

N-step bootstrapping Robert Platt Northeastern University Id love to use my experiences more efficiently... Motivation Left: path taken by agent in grid world. Gets zero reward everywhere except for goal state where it gets positive


slide-1
SLIDE 1

N-step bootstrapping

Robert Platt Northeastern University

I’d love to use my experiences more efficiently...

slide-2
SLIDE 2

Motivation

Problem: standard Q-Learning/SARSA “propagates rewards” only one state back per time step – n-step bootstrapping is one way to address this problem – we will see other ways in subsequent slide decks.

Left: path taken by agent in grid world. Gets zero reward everywhere except for goal state where it gets positive reward. Middle: 1-step SARSA updates only penultimate state/action pair

slide-3
SLIDE 3

TD and MC are two extremes of a continuum

slide-4
SLIDE 4

TD and MC are two extremes of a continuum

What are these?

slide-5
SLIDE 5

TD and MC are two extremes of a continuum

slide-6
SLIDE 6

TD and MC are two extremes of a continuum

Update equation:

slide-7
SLIDE 7

TD and MC are two extremes of a continuum

Is called the target of the update Update equation:

slide-8
SLIDE 8

TD and MC are two extremes of a continuum

What’s the target for this one? Update equation:

slide-9
SLIDE 9

TD and MC are two extremes of a continuum

Complete update equation: What’s the target for this one?

slide-10
SLIDE 10

TD and MC are two extremes of a continuum

Complete update equation: What’s the target for this one?

slide-11
SLIDE 11

TD and MC are two extremes of a continuum

Complete update equation: What’s the target for this one?

slide-12
SLIDE 12

TD and MC are two extremes of a continuum

Complete update equation: What’s the target for this one?

Notice that you can’t do this update until time step t+3 – TD update happens on next time step – MC update happens at end of episode – n-step TD update happens on time step n

slide-13
SLIDE 13

How well does this work?

This comparison is for: – a 19 state random walk policy – n-step TD policy evaluation

slide-14
SLIDE 14

n-step TD algorithm

slide-15
SLIDE 15

n-step SARSA

Same idea as in n-step TD – how is this backup diagram different from that of n-step TD? – why is it different?

slide-16
SLIDE 16

n-step SARSA

Same idea as in n-step TD – how is this backup diagram different from that of n-step TD? – why is it different? Why does the backup start with a dot rather than a circle?

slide-17
SLIDE 17

n-step SARSA

Left: path taken by agent in grid world. Gets zero reward everywhere except for goal state where it gets positive reward. Middle: 1-step SARSA updates only penultimate state/action pair Right: 10-step SARSA updates last 10 state/action pairs