Sample-Based Methods for Continuous Action Markov Decision - PowerPoint PPT Presentation

Sample-Based Methods for Continuous Action Markov Decision Processes Chris Mansley Ari Weinstein Michael Littman Rutgers Unversity

From Learning to Planning Bellman Equation

From Learning to Planning Bellman Equation Continuous State Space Standard machine learning approaches to function approximation have proven successful!

From Learning to Planning Bellman Equation Continuous Action Space Continuous State Space Very little work addressing how to Standard machine learning evaluate the maximum approaches to function approximation have proven successful!

Sparse Sampling [Kearns, et al 1999] • An epsilon-optimal planning algorithm for discounted MDPs. • Number of samples independent of state space size! S0 • Requires too many samples! A1 A2 S4 S5 S1 S2 S3 A1 A2 S10 S7 S8 S9 S11 S12

Can we use ideas from the exploration/exploitation problem to better direct our search?

UCB 1 (+1) 0.8 (-1) [Auer, et al 2002] S1 0.2 (+1) • An algorithm for efficient learning in the bandit domain • Fixed number of discrete actions with bounded support • Choose an arm greedily according to the following rule: 2ln n  + µ i n i

UCT [Kocsis, Szepesvári 2006] • Upper Confidence applied to Trees • Takes the UCB algorithm and extends it to the full MDP domain • Build a tree similar to SS, but instead of doing a breadth first search perform a depth first search directed by a UCB algorithm at each node

UCT, cont... [Kocsis, Szepesvári 2006] S0 S0 S0 S1 S4 S1 S1 S4 Round 1 Round 2 Round 3 S12 S9 S14 S12 S12 S14 ... ... ... ... ... ...

HOO [Bubeck, et al 2008] • UCT is still restricted to discrete states and actions • HOO (hierarchical optimistic optimization) provides similar guarantees to UCB in “well- behaved” continuous bandit problems • The idea is simple, divide the action space up (similar to a KD-tree), keep track of returns in these volumes, provide exploration bonuses for both number of samples and size of each subdivision

HOO, cont... [Bubeck, et al 2008] • Choose an arm greedily with respect to the following: 2ln n  + + v 1 ρ h µ i n i • Very similar to UCB except the spatial term at the end • The intuition is that arms with large volumes and few samples are unknown, but small volumes and lots of samples are well known

HOO, cont... [Bubeck, et al 2008] • Choose an arm greedily with respect to the following: 2ln n  + + v 1 ρ h µ i n i diam(i) • Very similar to UCB except the spatial term at the end • The intuition is that arms with large volumes and few samples are unknown, but small volumes and lots of samples are well known

HOO, cont... [Bubeck, et al 2008] Thanks to Remi Munos . . . . . .

UCB vs HOO

HOOT • Our idea is to replace UCB in UCT with HOO, so that we can work directly in the continuous action space • This leads to our algorithm HOO applied to Trees (HOOT) • The algorithm is exactly the same as UCT, but instead of using UCB at each internal node, we maintain a HOO tree

Empirical Results Double Integrator - 1D D-Double Integrator - 1D 200 195 180 190 160 Total Reward 185 Total Reward 140 120 180 100 175 UCT 5A 80 UCT 11A 170 60 UCT 15A HOOT HOOT UCT 40 165 100 1000 10000 0 10 20 30 40 50 Samples per Planning Step (logscale) Number of Discrete Actions D-Double Integrator 200 HOOT 180 UCT 5 UCT 10 160 UCT 20 140 Total Reward 120 100 80 60 40 20 1 2 3 4 Number of Action Dimensions

Empirical Results Bicycle - 0.02cm Bicycle - 0.02cm 2200 2200 UCT 5A 2000 2000 UCT 10A 1800 1800 UCT 20A HOOT 1600 1600 Total Reward Total Reward 1400 1400 1200 1200 1000 1000 800 800 600 600 400 400 HOOT 200 200 UCT 0 0 100 1000 10000 3 5 7 9 11 13 15 Samples per Planning Step (logscale) Number of Discretizations per Action Dimension

Future Work • Using HOO to optimize the n-step sequence of actions as an n-dimensional space • Extend to continuous state spaces by a weighted interpolation between representative HOO trees

Summary • Choosing action discretizations is non-trival! • If you have a distance metric and your value function is locally smooth, use HOOT not vanilla UCT!

Thanks!

Sample-Based Methods for Continuous Action Markov Decision - PowerPoint PPT Presentation

Sample-Based Methods for Continuous Action Markov Decision Processes Chris Mansley Ari Weinstein Michael Littman Rutgers Unversity From Learning to Planning Bellman Equation From Learning to Planning Bellman Equation Continuous State

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

Overview Verifying Continuous-Time Markov Chains Negative exponential distributions 1 Lecture

Imprecise Markov chains From basic theory to applications II prof. Jasper De Bock Imprecise

Overview Motivation Verifying Continuous-Time Markov Chains 1 Lecture 1+2: Discrete-Time Markov

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

Discrete Time Markov Chains Discrete-Time Markov Chains Books - Introduction to Stochastic

Stochastic Processes Markov Processes Hamid R. Rabiee 1 Overview o Markov Property o Markov

Sample 2 Inlet in western (Sunset) Bay 0 Sample 3 Inlet behind Christian Island 1 Sample

Markov Chain Monte Carlo Methods Michel Bierlaire michel.bierlaire@epfl.ch Transport and

Continuous Time Markov Chain Birth and Death Process IE 502: Probabilistic Models Jayendran

Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University Markov Chains and

Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University 2 Markov Chains

Green Action Centre, 2019 Green Action Centre, 2019 Green Action Centre, 2019 Green Action

Wireless Optimisation via Convex Bandits Unlicensed LTE/WiFi Coexistence Cristina Cano and

CS 101.2: Notes for Lecture 2 (Bandit Problems) Andreas Krause January 9, 2009 In these notes we

CMP722 ADVANCED COMPUTER VISION Lecture #6 Deep Reinforcement Learning Aykut Erdem //

Meta-Learning Contextual Bandit Exploration Amr Sharaf Hal Daum e III University of Maryland

An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for

What we learned last time 1. Intelligence is the computational part of the ability to achieve

Deep Hep Reading Group 1611.05763 Learning To Reinforcement Learn 1611.02779 SchemaAc

Reinforcement Learning n-armed bandit Kevin Spiteri April 21, 2015 n-armed bandit n-armed