Sample-Based Methods for Continuous Action Markov Decision - - PowerPoint PPT Presentation

sample based methods for continuous action markov
SMART_READER_LITE
LIVE PREVIEW

Sample-Based Methods for Continuous Action Markov Decision - - PowerPoint PPT Presentation

Sample-Based Methods for Continuous Action Markov Decision Processes Chris Mansley Ari Weinstein Michael Littman Rutgers Unversity From Learning to Planning Bellman Equation From Learning to Planning Bellman Equation Continuous State


slide-1
SLIDE 1

Sample-Based Methods for Continuous Action Markov Decision Processes

Chris Mansley Ari Weinstein Michael Littman Rutgers Unversity

slide-2
SLIDE 2

From Learning to Planning

Bellman Equation

slide-3
SLIDE 3

From Learning to Planning

Bellman Equation

Standard machine learning approaches to function approximation have proven successful! Continuous State Space

slide-4
SLIDE 4

From Learning to Planning

Bellman Equation

Standard machine learning approaches to function approximation have proven successful! Continuous State Space Continuous Action Space

Very little work addressing how to evaluate the maximum

slide-5
SLIDE 5

Sparse Sampling

[Kearns, et al 1999]

  • An epsilon-optimal planning algorithm for

discounted MDPs.

  • Number of samples independent of state

space size!

  • Requires too many samples!

S0 S1 S2 S3

A1

S4 S5

A2 A1 A2

S7 S8 S9

S10

S11 S12

slide-6
SLIDE 6

Can we use ideas from the exploration/exploitation problem to better direct our search?

slide-7
SLIDE 7

UCB

[Auer, et al 2002]

  • An algorithm for efficient learning in the

bandit domain

  • Fixed number of discrete actions with

bounded support

  • Choose an arm greedily according to the

following rule:

µi

 +

2lnn ni

S1

1 (+1) 0.8 (-1) 0.2 (+1)

slide-8
SLIDE 8

UCT

[Kocsis, Szepesvári 2006]

  • Upper Confidence applied to Trees
  • Takes the UCB algorithm and extends it to

the full MDP domain

  • Build a tree similar to SS, but instead of

doing a breadth first search perform a depth first search directed by a UCB algorithm at each node

slide-9
SLIDE 9

UCT, cont...

[Kocsis, Szepesvári 2006]

S0 S1 S4

S12

S0 S1

S12

...

S14

...

S4 S0 S1

S12 S14 S9

... ... ... ... Round 1 Round 2 Round 3

slide-10
SLIDE 10

HOO

[Bubeck, et al 2008]

  • UCT is still restricted to discrete states and

actions

  • HOO (hierarchical optimistic optimization)

provides similar guarantees to UCB in “well- behaved” continuous bandit problems

  • The idea is simple, divide the action space up

(similar to a KD-tree), keep track of returns in these volumes, provide exploration bonuses for both number of samples and size of each subdivision

slide-11
SLIDE 11

HOO, cont...

[Bubeck, et al 2008]

  • Choose an arm greedily with respect to the

following:

  • Very similar to UCB except the spatial term

at the end

  • The intuition is that arms with large volumes

and few samples are unknown, but small volumes and lots of samples are well known

µi

 +

2lnn ni + v1ρh

slide-12
SLIDE 12

HOO, cont...

[Bubeck, et al 2008]

  • Choose an arm greedily with respect to the

following:

  • Very similar to UCB except the spatial term

at the end

  • The intuition is that arms with large volumes

and few samples are unknown, but small volumes and lots of samples are well known

µi

 +

2lnn ni + v1ρh

diam(i)

slide-13
SLIDE 13

. . . . . .

Thanks to Remi Munos

HOO, cont...

[Bubeck, et al 2008]

slide-14
SLIDE 14

UCB vs HOO

slide-15
SLIDE 15

HOOT

  • Our idea is to replace UCB in UCT with

HOO, so that we can work directly in the continuous action space

  • This leads to our algorithm HOO applied to

Trees (HOOT)

  • The algorithm is exactly the same as UCT,

but instead of using UCB at each internal node, we maintain a HOO tree

slide-16
SLIDE 16

Empirical Results

40 60 80 100 120 140 160 180 200 100 1000 10000 Total Reward Samples per Planning Step (logscale) Double Integrator - 1D UCT 5A UCT 11A UCT 15A HOOT 165 170 175 180 185 190 195 10 20 30 40 50 Total Reward Number of Discrete Actions D-Double Integrator - 1D HOOT UCT 20 40 60 80 100 120 140 160 180 200 1 2 3 4 Total Reward Number of Action Dimensions D-Double Integrator HOOT UCT 5 UCT 10 UCT 20

slide-17
SLIDE 17

Empirical Results

200 400 600 800 1000 1200 1400 1600 1800 2000 2200 100 1000 10000 Total Reward Samples per Planning Step (logscale) Bicycle - 0.02cm UCT 5A UCT 10A UCT 20A HOOT 200 400 600 800 1000 1200 1400 1600 1800 2000 2200 3 5 7 9 11 13 15 Total Reward Number of Discretizations per Action Dimension Bicycle - 0.02cm HOOT UCT

slide-18
SLIDE 18

Future Work

  • Using HOO to optimize the n-step

sequence of actions as an n-dimensional space

  • Extend to continuous state spaces by a

weighted interpolation between representative HOO trees

slide-19
SLIDE 19

Summary

  • Choosing action discretizations is non-trival!
  • If you have a distance metric and your value

function is locally smooth, use HOOT not vanilla UCT!

slide-20
SLIDE 20

Thanks!