Planning and Optimization G8. Trial-based Heuristic Tree Search - - PowerPoint PPT Presentation

planning and optimization
SMART_READER_LITE
LIVE PREVIEW

Planning and Optimization G8. Trial-based Heuristic Tree Search - - PowerPoint PPT Presentation

Planning and Optimization G8. Trial-based Heuristic Tree Search Gabriele R oger and Thomas Keller Universit at Basel December 17, 2018 Motivation THTS Framework THTS Algorithms Summary Content of this Course Tasks Progression/


slide-1
SLIDE 1

Planning and Optimization

  • G8. Trial-based Heuristic Tree Search

Gabriele R¨

  • ger and Thomas Keller

Universit¨ at Basel

December 17, 2018

slide-2
SLIDE 2

Motivation THTS Framework THTS Algorithms Summary

Content of this Course

Planning Classical Tasks Progression/ Regression Complexity Heuristics Probabilistic MDPs Blind Methods Heuristic Search Monte-Carlo Methods

slide-3
SLIDE 3

Motivation THTS Framework THTS Algorithms Summary

Motivation

slide-4
SLIDE 4

Motivation THTS Framework THTS Algorithms Summary

AO∗ & LAO∗: Recap

Iteratively build explicated graph Extend explicated graph by expanding fringe node in partial solution graph State-value estimates are initialized with admissible heuristic Propagate information with Bellman backups in partial solution graph

slide-5
SLIDE 5

Motivation THTS Framework THTS Algorithms Summary

(Labeled) Real-Time Dynamic Programming: Recap

Iteratively performs trials Simulates greedy policy in each trial Encountered states are updated with Bellman backup Admissible heuristic used if no state-value estimate available Labeling procedure marks states that have converged

slide-6
SLIDE 6

Motivation THTS Framework THTS Algorithms Summary

Monte-Carlo Tree Search: Recap

Iteratively explicates search tree in trials Uses tree policy to traverse tree First encountered state not yet in tree added to search tree State-value estimates are initialized with default policy Propagates information with Monte-Carlo backups in reverse

  • rder through visited states
slide-7
SLIDE 7

Motivation THTS Framework THTS Algorithms Summary

Trial-based Heuristic Tree Search

All are asymptotically optimal (or such a version exists) In practice, all have complementary strengths There are a significant differences between these algorithms but they also have a lot in common common framework that allows to describe all three: Trial-based Heuristic Tree Search (THTS)

slide-8
SLIDE 8

Motivation THTS Framework THTS Algorithms Summary

Trial-based Heuristic Tree Search Framework

slide-9
SLIDE 9

Motivation THTS Framework THTS Algorithms Summary

Trial-based Heuristic Tree Search

Perform trials to explicate search tree

decision (OR) nodes for states chance (AND) nodes for actions

Annotate nodes with

state-/action-value estimate visit counter solved label

Initialize search nodes with heuristic abc

slide-10
SLIDE 10

Motivation THTS Framework THTS Algorithms Summary

Trial-based Heuristic Tree Search

Perform trials to explicate search tree

decision (OR) nodes for states chance (AND) nodes for actions

Annotate nodes with

state-/action-value estimate visit counter solved label

Initialize search nodes with heuristic 6 variable ingredients:

action selection

  • utcome selection

abc

slide-11
SLIDE 11

Motivation THTS Framework THTS Algorithms Summary

Trial-based Heuristic Tree Search

Perform trials to explicate search tree

decision (OR) nodes for states chance (AND) nodes for actions

Annotate nodes with

state-/action-value estimate visit counter solved label

Initialize search nodes with heuristic 6 variable ingredients:

action selection

  • utcome selection

initialization trial length

abc

slide-12
SLIDE 12

Motivation THTS Framework THTS Algorithms Summary

Trial-based Heuristic Tree Search

Perform trials to explicate search tree

decision (OR) nodes for states chance (AND) nodes for actions

Annotate nodes with

state-/action-value estimate visit counter solved label

Initialize search nodes with heuristic 6 variable ingredients:

action selection

  • utcome selection

initialization trial length backup function

abc

slide-13
SLIDE 13

Motivation THTS Framework THTS Algorithms Summary

Trial-based Heuristic Tree Search

Perform trials to explicate search tree

decision (OR) nodes for states chance (AND) nodes for actions

Annotate nodes with

state-/action-value estimate visit counter solved label

Initialize search nodes with heuristic 6 variable ingredients:

action selection

  • utcome selection

initialization trial length backup function recommendation function

abc

slide-14
SLIDE 14

Motivation THTS Framework THTS Algorithms Summary

Trial-based Heuristic Tree Search

THTS for SSP T = S, L, c, T, s0, S⋆ d0 = create root node associated with s0 while time allows: visit decision node(d0, T ) return recommend(d0)

slide-15
SLIDE 15

Motivation THTS Framework THTS Algorithms Summary

THTS: Visit a Decision Node

visit decision node for decision node d, SSP T = S, L, c, T, s0, S⋆ if s(d) ∈ S⋆ then return 0 a := select action(d) if a not explicated: cost = expand and initialize(d, a) if not trial length reached(d) let c be the node in children(d) with a(c) = a cost = visit chance node(c, T ) backup(d,cost) return cost

slide-16
SLIDE 16

Motivation THTS Framework THTS Algorithms Summary

THTS: Visit a Chance Node

visit chance node for chance node c, SSP T = S, L, c, T, s0, S⋆ s′ = select outcome(s(c), a(c)) if s′ not explicated: cost = expand and initialize(c, s′) if not trial length reached(c) let d be the node in children(c) with s(d) = s′ cost = visit decision node(d, T ) cost = cost + c(s(c), a(c)) backup(c,cost) return cost

slide-17
SLIDE 17

Motivation THTS Framework THTS Algorithms Summary

THTS Algorithms

slide-18
SLIDE 18

Motivation THTS Framework THTS Algorithms Summary

MCTS in the THTS Framework

Trial length: terminate trial when node is explicated Action selection: tree policy Outcome selection: sample Initialization: add single node to the tree and initialize with heuristic that simulates the default policy Backup function: Monte-Carlo backups Recommendation function: expected best arm

slide-19
SLIDE 19

Motivation THTS Framework THTS Algorithms Summary

AO∗ (Tree Search Version) in the THTS Framework

Trial length: terminate trial when node is expanded Action selection: greedy Outcome selection: depends on AO∗ version Initialization: expand decision node and all its chance node successors, then initialize all ˆ V k with admissible heuristic Backup function: Bellman backups & solved labels Recommendation function: expected best arm

slide-20
SLIDE 20

Motivation THTS Framework THTS Algorithms Summary

LRTDP (Tree Search Version) in the THTS Framework

Trial length: finish trials only in goal states Action selection: greedy Outcome selection: sample unsolved outcome Initialization: expand decision node and all its chance node successors, then initialize all ˆ V k with admissible heuristic Backup function: Bellman backups & solved labels Recommendation function: expected best arm

slide-21
SLIDE 21

Motivation THTS Framework THTS Algorithms Summary

Further Ingredients from Literature

Recommendation function:

Most played arm [Bubeck et al. 2009, Chaslot et al. 2008] Empirical distribution of plays [Bubeck et al. 2009] Secure arm [Chaslot et al. 2008]

Initialization:

Expand decision node and initialize chance nodes with heuristic for state-action pairs [Keller & Eyerich, 2012] Any classical heuristic on any determinization Occupation measure heuristic [Trevizan et al., 2017]

slide-22
SLIDE 22

Motivation THTS Framework THTS Algorithms Summary

Further Ingredients from Literature

Backup functions: Temporal Differences [Sutton & Barto, 1987] Q-Learning [Watkins, 1989] Selective Backups [Feldman & Domshlak, 2012; Keller, 2015] MaxMonte-Carlo [Keller & Helmert, 2013] Partial Bellman [Keller & Helmert, 2013]

slide-23
SLIDE 23

Motivation THTS Framework THTS Algorithms Summary

Further Ingredients from Literature

Action selections: Uniform sampling (UNI) ε-greedy (ε-G) ε-G with decaying ε:

εLIN-G [Singh et al., 2000; Auer et al., 2002] εRT-G [Keller, 2015] εLOG-G [Keller, 2015]

Boltzmann exploration (BE) BE with logarithmic decaying τ (BE-DT) [Singh et al., 2000] UCB1 [Auer et al., 2002] Root-valued UCB (RT-UCB) [Keller, 2015]

slide-24
SLIDE 24

Motivation THTS Framework THTS Algorithms Summary

Experimental Comparison

THTS allows to mix and match ingredients Not all combinations asymptotically optimal Analysis based on properties of ingredients possible

slide-25
SLIDE 25

Motivation THTS Framework THTS Algorithms Summary

Experimental Comparison

THTS allows to mix and match ingredients Not all combinations asymptotically optimal Analysis based on properties of ingredients possible In [Keller, 2015], comparison of:

1 trial length, 1 outcome selection, 1 initialization 2 different recommendation functions 9 different backup functions 9 different action selections

⇒ 162 different THTS algorithms 115 shown to be asymptotically optimal

slide-26
SLIDE 26

Motivation THTS Framework THTS Algorithms Summary

Asymptotic Optimality

U N I ǫ

  • G

ǫ L O G

  • G

ǫ R T

  • G

ǫ L I N

  • G

B E B E

  • D

T R T

  • U

C B U C B 1 LSMC MC ESMC LSTD TD ESTD QL MaxMC PB

slide-27
SLIDE 27

Motivation THTS Framework THTS Algorithms Summary

Experimental Evaluation

Most played arm recommendation function often better than same configuration with expected best arm

A c a d e m i c C r

  • s

s i n g E l e v a t

  • r

s G a m e N a v i g a t i

  • n

R e c

  • n

S k i l l S y s a d m i n T a m a r i s k T r a f f i c T r i a n g l e W i l d f i r e T

  • t

a l MCUCB1 MPA 27 65 78 86 45 92 77 89 86 71 46 84 70 Prost 2011 26 62 49 84 42 90 69 88 83 60 49 85 66

slide-28
SLIDE 28

Motivation THTS Framework THTS Algorithms Summary

Experimental Evaluation

Most played arm recommendation function often better than same configuration with expected best arm Boltzman exploration and root-valued UCB1 perform best in most domains

1 UCB1 4 RT-UCB 4 BE 2 BE-DT 1 ǫ-G 1 ǫRT-G 1 ǫLOG-G 1 ǫLIN-G

slide-29
SLIDE 29

Motivation THTS Framework THTS Algorithms Summary

Experimental Evaluation

Most played arm recommendation function often better than same configuration with expected best arm Boltzman exploration and root-valued UCB1 perform best in most domains Monte-Carlo and Partial Bellman backups perform best in most domains

1 UCB1 4 RT-UCB 4 BE 2 BE-DT 1 ǫ-G 1 ǫRT-G 1 ǫLOG-G 1 ǫLIN-G 6 MC 4 PB 2 TD 2 MaxMC 1 SMC 1 QL

slide-30
SLIDE 30

Motivation THTS Framework THTS Algorithms Summary

Experimental Evaluation

Most played arm recommendation function often better than same configuration with expected best arm Boltzman exploration and root-valued UCB1 perform best in most domains Monte-Carlo and Partial Bellman backups perform best in most domains almost all action selections and backup functions perform best in at least one domain

1 UCB1 4 RT-UCB 4 BE 2 BE-DT 1 ǫ-G 1 ǫRT-G 1 ǫLOG-G 1 ǫLIN-G 6 MC 4 PB 2 TD 2 MaxMC 1 SMC 1 QL

slide-31
SLIDE 31

Motivation THTS Framework THTS Algorithms Summary

Implementation: Prost

The Prost planner implements THTS framework mixing and matching of ingredients very simple to add new ingredients, just inherit from the corresponding class https://bitbucket.org/tkeller/prost/

slide-32
SLIDE 32

Motivation THTS Framework THTS Algorithms Summary

Summary

slide-33
SLIDE 33

Motivation THTS Framework THTS Algorithms Summary

Summary

MCTS, AO∗and RTDP have complementary strengths But also a similar structure THTS allows to combine ideas from MCTS, Heuristic Search and DP Mixing and matching ingredients leads to novel and sometimes better algorithms