Metareasoning for Deliberation Time Distribution in the Prost Planner - - PowerPoint PPT Presentation

metareasoning for deliberation time distribution in the
SMART_READER_LITE
LIVE PREVIEW

Metareasoning for Deliberation Time Distribution in the Prost Planner - - PowerPoint PPT Presentation

Motivation Approaches for Metareasoners Results Summary Metareasoning for Deliberation Time Distribution in the Prost Planner Ferdinand Badenberg University of Basel Bachelor Thesis Presentation, 2017 Motivation Approaches for Metareasoners


slide-1
SLIDE 1

Motivation Approaches for Metareasoners Results Summary

Metareasoning for Deliberation Time Distribution in the Prost Planner

Ferdinand Badenberg

University of Basel

Bachelor Thesis Presentation, 2017

slide-2
SLIDE 2

Motivation Approaches for Metareasoners Results Summary

Outline

1

Motivation Why Metareasoning? Metareasoning Problem

2

Approaches for Metareasoners Hand Made Functions Metareasoner of Lin. et al. Improvements for the Metareasoner

3

Results Results for the Hand Made Functions Results for the Formal Procedure

slide-3
SLIDE 3

Motivation Approaches for Metareasoners Results Summary

Table of Contents

1

Motivation Why Metareasoning? Metareasoning Problem

2

Approaches for Metareasoners Hand Made Functions Metareasoner of Lin. et al. Improvements for the Metareasoner

3

Results Results for the Hand Made Functions Results for the Formal Procedure

slide-4
SLIDE 4

Motivation Approaches for Metareasoners Results Summary

Cycle

Planner Environment act think reward, next state

slide-5
SLIDE 5

Motivation Approaches for Metareasoners Results Summary

Motivation

Why Metareasoning? Optimise policy in given time Allocate time to think where it is needed Act if decision is easy, clear best action Think if decision is difficult, multiple actions very close

slide-6
SLIDE 6

Motivation Approaches for Metareasoners Results Summary

Metareasoning Problem

Metareasoning Problem Steps from finite horizon MDP Rounds Limited time Anytime search algorithm Metareasoner Decision to think or act Based on specific values for these factors After one thinking cycle of the algorithm Goal: only think when necessary

slide-7
SLIDE 7

Motivation Approaches for Metareasoners Results Summary

Table of Contents

1

Motivation Why Metareasoning? Metareasoning Problem

2

Approaches for Metareasoners Hand Made Functions Metareasoner of Lin. et al. Improvements for the Metareasoner

3

Results Results for the Hand Made Functions Results for the Formal Procedure

slide-8
SLIDE 8

Motivation Approaches for Metareasoners Results Summary

Hand Made Functions

Idea Allocate time for each step Think for as long as time is left State of the search algorithm not considered Functions Tested:

1 Uniform (Standard) 2 First 3 Linear 4 Hyperbolic

slide-9
SLIDE 9

Motivation Approaches for Metareasoners Results Summary

Example Time Distribution

slide-10
SLIDE 10

Motivation Approaches for Metareasoners Results Summary

Formal Metareasoner of Lin et al.

Metareasoner Idea: think if change of policy is likely, act if it will stay the same Only considers expected reward estimations (Q-values) of search algorithm Act if Qact ≥ Qthink How are they calculated?

slide-11
SLIDE 11

Motivation Approaches for Metareasoners Results Summary

Formal Metareasoner: Qthink and Qact

Qthink Expected reward of the policy after another thinking cycle Simplification: only best action is relevant Estimate probability of action a being the best after the next thinking cycle Estimate expected reward given that action a is chosen Needed: next Q-values for each action Qact Intuitive idea: Q-value of current best action But: average of current Q-value and next Q-value

slide-12
SLIDE 12

Motivation Approaches for Metareasoners Results Summary

Formal Metareasoner: Estimation of Next Q-values

Estimating Next Q-values Idea: base next change in Q-values on previous change in Q-values Assumption: next ∆Q-value no larger than the previous one Draw random ρ between 0 and 1 ∆Q(a) = ˆ ∆Q(a) ∗ ρ for all actions a

slide-13
SLIDE 13

Motivation Approaches for Metareasoners Results Summary

Line Segment Example: UCT

Qthink > Qact

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 e1 e2 b a1 a2 a3 Unit Interval Scaled Values

slide-14
SLIDE 14

Motivation Approaches for Metareasoners Results Summary

Line Segment Example: UCT

Qthink = Qact

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 e1, b a1 a2 a3 Unit Interval Scaled Values

slide-15
SLIDE 15

Motivation Approaches for Metareasoners Results Summary

Improvements

Minimum Thinking Time Problem: assumption is often not true early on Improvement: think for at least Tmin seconds

slide-16
SLIDE 16

Motivation Approaches for Metareasoners Results Summary

Cthink+

Cthink+ Problem: time left is not considered Improvement: subtract C think from Qthink

slide-17
SLIDE 17

Motivation Approaches for Metareasoners Results Summary

Cthink

Cthink Problem: stopping with time left is useless Improvement: allow a negative C think

slide-18
SLIDE 18

Motivation Approaches for Metareasoners Results Summary

Table of Contents

1

Motivation Why Metareasoning? Metareasoning Problem

2

Approaches for Metareasoners Hand Made Functions Metareasoner of Lin. et al. Improvements for the Metareasoner

3

Results Results for the Hand Made Functions Results for the Formal Procedure

slide-19
SLIDE 19

Motivation Approaches for Metareasoners Results Summary

Results Hand Made Functions

Results Problem Uniform Hyperbolic First Linear Wildfire 74 71 80 81 Triangle 72 65 72 75 Academic 37 37 34 45 Elevators 93 93 91 94 Tamarisk 93 94 92 91 Sysadmin 94 94 90 91 Recon 97 99 97 96 Game 97 93 94 93 Traffic 97 97 96 96 Crossing 87 89 91 99 Skill 91 91 88 93 Navigation 65 58 83 82 Total 83 82 84 86

slide-20
SLIDE 20

Motivation Approaches for Metareasoners Results Summary

Results Formal Procedure

Results Problem Uniform Lin et al. Minimum Cthink+ Cthink Wildfire 60 90 86 95 68 Triangle 78 67 62 59 68 Academic 39 32 36 35 38 Elevators 98 71 83 83 97 Tamarisk 96 68 86 90 92 Sysadmin 100 36 67 74 82 Recon 98 56 75 75 97 Game 97 64 82 86 96 Traffic 99 85 90 87 98 Crossing 88 58 78 83 89 Skill 100 25 71 69 86 Navigation 82 26 25 28 83 Total 86 56 70 72 83

slide-21
SLIDE 21

Motivation Approaches for Metareasoners Results Summary

Summary

Result Summary Hand made functions performed very well Default metareasoner severely underestimates thinking The improvements proved to be very useful

slide-22
SLIDE 22

Motivation Approaches for Metareasoners Results Summary

Summary

Outlook More general hand made functions Improve formal procedure:

Consider all previous ∆Q-values Replace random ρ

More sophisticated cost of thinking: combination of two approaches

slide-23
SLIDE 23

Motivation Approaches for Metareasoners Results Summary

Questions?

slide-24
SLIDE 24

Appendix

BRTDP vs UCT

BRTDP Used in original paper Cost setting Uses upper bound of the actual Q-value Monotonously decreasing UCT Used by Prost planner Reward setting No guarantees

slide-25
SLIDE 25

Appendix

BRTDP vs UCT: Visualisation

slide-26
SLIDE 26

Appendix

Line Segment Example: BRTDP

Qthink < Qact

0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 b e3 e2 a1 a2 a3 Unit Interval Scaled Values

slide-27
SLIDE 27

Appendix

Wildfire Time per Step