Efficient Planning 1 R. S. Sutton and A. G. Barto: Reinforcement - - PowerPoint PPT Presentation

efficient planning
SMART_READER_LITE
LIVE PREVIEW

Efficient Planning 1 R. S. Sutton and A. G. Barto: Reinforcement - - PowerPoint PPT Presentation

Efficient Planning 1 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction Tuesday class summary: Planning: any computational process that uses a model to create or improve a policy Dyna framework: 2 R. S. Sutton and A. G.


slide-1
SLIDE 1
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Efficient Planning

1

slide-2
SLIDE 2
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Tuesday class summary:

Planning: any computational process that uses a model to create or improve a policy Dyna framework:

2

slide-3
SLIDE 3
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Questions during class

“Why use simulated experience? Can’t you directly compute solution based on model?” “Wouldn’t it be better to plan backwards from goal”

3

slide-4
SLIDE 4
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

How to Achieve Efficient Planning?

What type of backup is better? Sample vs. full backups Incremental vs. less incremental backups How to order the backups?

4

slide-5
SLIDE 5
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

What is Efficient Planning?

it can compute the optimal policy (or value function) in less time. given the same amount of computation time, it improves the policy (or value function) more.

5

Planning algorithm A is more efficient than planning algorithm B if:

slide-6
SLIDE 6
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

What backup type is best?

6

slide-7
SLIDE 7
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

7

Full vs. Sample Backups

Full backups (DP) Sample backups (one-step TD) Value estimated

V

!(s)

V*(s) Q!(a,s) Q*

(a,s)

s a s' r

policy evaluation

s a s' r

max value iteration

s a r s'

TD(0)

s,a a' s' r

Q-policy evaluation

s,a a' s' r

max Q-value iteration

s,a a' s' r

Sarsa

s,a a' s' r

Q-learning max

vπ v* qπ q*

slide-8
SLIDE 8
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

8

Full vs. Sample Backups

b successor states, equally likely; initial error = 1; assume all next states’ values are correct

b = 2 (branching factor) b =10 b =100

b =1000

b =10,000

sample backups full backups

1 1b 2b

RMS error in value estimate Number of computations

max

a0 Q(s0, a0)

slide-9
SLIDE 9
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Small Backups

Small backups are single-successor backups based on the model Small backups have the same computational complexity as sample backups Small backups have no sampling error Small backups require storage for ‘old’ values

9

slide-10
SLIDE 10
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Main Idea behind Small Backups

10

− xj

y Xj. Consider estimate A that is constructed from a weighted sum estimates . full backup: What can we do if we know that only a single successor, , changed value since the last backup? Let be the old value of , used to construct the current value of A. The value A can then be updated for a single successor by adding the difference between the new and the old value: small backup:

Xi

A X

i

wiXi

+ Xj

X A A + wj(Xj xj)

slide-11
SLIDE 11
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Small vs. Sample Backups

11

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1

step−size / step−size decay RMS error

s a m p l e b a c k u p : T D ( ) , c

  • n

s t a n t s t e p − s i z e sample backup: TD(0), decaying step−size small backup

(normalized) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1

alpha / decay normalized RMS error sample backup: TD(0), constant α s a m p l e b a c k u p : T D ( ) , d e c a y i n g α small backup

r

l e f t

r

r i g h t

r a n d

  • m

t r a n s i t i

  • n

s

= +1 = -1

rleft = +1 rright = -1 rleft = +1 rright = +1

r = +1 r = +1

slide-12
SLIDE 12
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Small vs. Sample Backups

12

0.333 0.667 1 state A state B

A B C transition probability

2 4 6 8 10 state A state B

state values

slide-13
SLIDE 13
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Backup Ordering

13

slide-14
SLIDE 14
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Backup Ordering

14

Asynchronous Value Iteration

For every selection strategy H that selects each state infinitely often the values V converge to the optimal value function The rate of convergence depends strongly on the selection strategy H

Do Forever: 1) Select a state s 2 S according to some selection strategy H 2) Apply a full backup to s: V (s) maxa h ˆ r(s, a) + P

s0 p(s0|s, a)V (s0)

i

V⇤

slide-15
SLIDE 15
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

The Trade-Off

For any effective ordering strategy the cost that is saved by having to perform less backups should out-weigh the cost

  • f maintaining the ordering:

15

cost to maintain

  • rdering

cost savings due to fewer backups

slide-16
SLIDE 16
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

16

Prioritized Sweeping

Which states or state-action pairs should be generated during planning? Work backwards from states whose values have just changed: Maintain a queue of state-action pairs whose values would change a lot if backed up, prioritized by the size

  • f the change

When a new backup occurs, insert predecessors according to their priorities Always perform backups from first in queue Moore & Atkeson 1993; Peng & Williams 1993 improved by McMahan & Gordon 2005; Van Seijen 2013

slide-17
SLIDE 17
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Moore and Atekson’s Prioritized Sweeping

17

Published in 1993.

slide-18
SLIDE 18
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

18

Prioritized Sweeping vs. Dyna-Q

Both use n=5 backups per environmental interaction

slide-19
SLIDE 19
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Bellman Error Ordering

Bellman error is a measure for the difference between the current value and the value after a full backup:

19

BE(s) =

  • V (s) max

a

h ˆ r(s, a) + X

s0

p(s0|s, a)V (s0) i

slide-20
SLIDE 20
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Bellman Error Ordering

20

initialize V (s) arbitrarily for all s compute BE(s) for all s loop {until convergence} select state s0 with worst Bellman error perform full backup of s0 BE(s0) ← 0 for all predecessor states ¯ s of s0 do recompute BE(¯ s) end for end loop

To get positive trade-off:

  • comp. time Bellman error << comp time Full backup
slide-21
SLIDE 21
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Prioritized Sweeping with Small Backups

21

initialize V (s) arbitrarily for all s initialize U(s) = V (s) for all s initialize Q(s, a) = V (s) for all s, a initialize Nsa, Ns0

sa to 0 for all s, a, s0

loop {over episodes} initialize s repeat {for each step in the episode} select action a, based on Q(s, ·) take action a, observe r and s0 Nsa ← Nsa + 1; Ns0

sa ← Ns0 sa + 1

Q(s, a) ← ⇥ Q(s, a)(Nsa − 1) + r + γV (s0) ⇤ /Nsa V (s) ← maxb Q(s, b) p ← |V (s) − U(s)| if s is on queue, set its priority to p; otherwise, add it with priority p for a number of update cycles do remove top state ¯ s0 from queue ∆U ← U(¯ s0) − V (¯ s0) V (¯ s0) ← V U¯ s0) for all (¯ s, ¯ a) pairs with N ¯

s0 ¯ s¯ a > 0 do

Q(¯ s, ¯ a) ← Q(¯ s, ¯ a) + γN ¯

s0 ¯ s¯ a/N¯ s¯ a · ∆U

U(¯ s) ← maxb Q(¯ s, b) p ← |V (¯ s) − U(¯ s)| if s is on queue, set its priority to p; otherwise, add it with priority p end for end for s ← s0 until s is terminal end loop

slide-22
SLIDE 22
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

Empirical Comparison

22

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 x 10

−6

0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55

  • comp. time per observation [s]

RMS error

PS, Moore & Atkeson P S , P e n g & W i l l i a m s PS, Wiering & Schmidhuber initial error value iteration PS, small backups

(avg. over first 105 obs)

ation tas

slide-23
SLIDE 23
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

23

Trajectory Sampling

Trajectory sampling: perform backups along simulated trajectories This samples from the on-policy distribution Advantages when function approximation is used (Chapter 8) Focusing of computation: can cause vast uninteresting parts

  • f the state space to be (usefully) ignored:

Initial states Reachable under

  • ptimal control

Irrelevant states

slide-24
SLIDE 24
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

24

Trajectory Sampling Experiment

  • ne-step full tabular backups

uniform: cycled through all state- action pairs

  • n-policy: backed up along

simulated trajectories 200 randomly generated undiscounted episodic tasks 2 actions for each state, each with b equally likely next states 0.1 prob of transition to terminal state expected reward on each transition selected from mean 0 variance 1 Gaussian

slide-25
SLIDE 25
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

25

Heuristic Search

Used for action selection, not for changing a value function (=heuristic evaluation function) Backed-up values are computed, but typically discarded Extension of the idea of a greedy policy — only deeper Also suggests ways to select states to backup: smart focusing:

slide-26
SLIDE 26
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

26

Summary

Efficient planning is about trying to spend the available computation time in the most effective way. Backup types: full/sample/small Backup Ordering gain/loss trade-off prioritized sweeping prioritized sweeping with small backups: Bellman error

  • rdering

trajectory sampling: backup along trajectories heuristic search

slide-27
SLIDE 27
  • R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

27