Monte-Carlo Planning: Basic Principles and Recent Progress Alan - PowerPoint PPT Presentation

Monte-Carlo Planning: Basic Principles and Recent Progress Alan Fern School of EECS Oregon State University 1

Outline  Preliminaries: Markov Decision Processes  What is Monte-Carlo Planning?  Uniform Monte-Carlo  Single State Case (PAC Bandit)  Policy rollout  Sparse Sampling  Adaptive Monte-Carlo  Single State Case (UCB Bandit)  UCT Monte-Carlo Tree Search 2

Stochastic/Probabilistic Planning: Markov Decision Process (MDP) Model Actions State + Reward World (possibly stochastic) ???? We will model the world as an MDP. 3

Markov Decision Processes  An MDP has four components: S, A, P R , P T :  finite state set S  finite action set A  Transition distribution P T (s’ | s, a)  Probability of going to state s’ after taking action a in state s  First-order Markov model  Bounded reward distribution P R (r | s, a)  Probability of receiving immediate reward r after taking action a in state s  First-order Markov model 4

Graphical View of MDP S t S t+1 S t+2 A t+1 A t A t+2 R t R t+2 R t+1  First-Order Markovian dynamics (history independence)  Next state only depends on current state and current action  First-Order Markovian reward process  Reward only depends on current state and action 5

Policies (“plans” for MDPs)  Given an MDP we wish to compute a policy  Could be computed offline or online.  A policy is a possibly stochastic mapping from states to actions  π : S → A  π (s) is action to do at state s π (s)  specifies a continuously reactive controller How to measure goodness of a policy? 6

Value Function of a Policy  We consider finite-horizon discounted reward, discount factor 0 ≤ β < 1  V π (s,h) denotes expected h-horizon discounted total reward of policy π at state s  Each run of π for h steps produces a random reward sequence: R 1 R 2 R 3 … R h  V π (s,h) is the expected discounted sum of this sequence h t ( , ) | , V s h E R s t 0 t  Optimal policy π * is policy that achieves maximum value across all states 7

Relation to Infinite Horizon Setting  Often value function V π (s) is defined over infinite horizons for a discount factor 0 ≤ β < 1 t t ( ) [ | , ] V s E R s 0 t  It is easy to show that difference between V π (s,h) and V π (s) shrinks exponentially fast as h grows R h max ( ) ( , ) V s V s h 1  h-horizon results apply to infinite horizon setting 8

Computing a Policy  Optimal policy maximizes value at each state  Optimal policies guaranteed to exist [Howard, 1960]  When state and action spaces are small and MDP is known we find optimal policy in poly-time via LP  Can also use value iteration or policy Iteration  We are interested in the case of exponentially large state spaces. 9

Large Worlds: Model-Based Approach 1. Define a language for compactly describing MDP model, for example:  Dynamic Bayesian Networks  Probabilistic STRIPS/PDDL 2. Design a planning algorithm for that language Problem: more often than not, the selected language is inadequate for a particular problem, e.g.  Problem size blows up  Fundamental representational shortcoming 10

Large Worlds: Monte-Carlo Approach  Often a simulator of a planning domain is available or can be learned from data  Even when domain can’t be expressed via MDP language Fire & Emergency Response Klondike Solitaire 11 11

Large Worlds: Monte-Carlo Approach  Often a simulator of a planning domain is available or can be learned from data  Even when domain can’t be expressed via MDP language  Monte-Carlo Planning: compute a good policy for an MDP by interacting with an MDP simulator action World Simulator Real World State + reward 12 12

Example Domains with Simulators  Traffic simulators  Robotics simulators  Military campaign simulators  Computer network simulators  Emergency planning simulators  large-scale disaster and municipal  Sports domains (Madden Football)  Board games / Video games  Go / RTS In many cases Monte-Carlo techniques yield state-of-the-art performance. Even in domains where model-based planner is applicable. 13

MDP: Simulation-Based Representation  A simulation-based representation gives: S, A, R, T:  finite state set S (generally very large)  finite action set A  Stochastic, real-valued, bounded reward function R(s,a) = r  Stochastically returns a reward r given input s and a  Can be implemented in arbitrary programming language  Stochastic transition function T(s,a) = s’ (i.e. a simulator)  Stochastically returns a state s’ given input s and a  Probability of returning s’ is dictated by Pr(s’ | s,a) of MDP  T can be implemented in an arbitrary programming language 14

Outline  Preliminaries: Markov Decision Processes  What is Monte-Carlo Planning?  Uniform Monte-Carlo  Single State Case (Uniform Bandit)  Policy rollout  Sparse Sampling  Adaptive Monte-Carlo  Single State Case (UCB Bandit)  UCT Monte-Carlo Tree Search 15

Single State Monte-Carlo Planning  Suppose MDP has a single state and k actions  Figure out which action has best expected reward  Can sample rewards of actions using calls to simulator  Sampling a is like pulling slot machine arm with random payoff function R(s,a) s a 1 a 2 a k … … R(s,a k ) R(s,a 2 ) R(s,a 1 ) Multi-Armed Bandit Problem 16

PAC Bandit Objective  Probably Approximately Correct (PAC)  Select an arm that probably (w/ high probability) has approximately the best expected reward  Use as few simulator calls (or pulls) as possible s a 1 a 2 a k … … R(s,a k ) R(s,a 2 ) R(s,a 1 ) Multi-Armed Bandit Problem 17

UniformBandit Algorithm NaiveBandit from [Even-Dar et. al., 2002] 1. Pull each arm w times (uniform pulling). 2. Return arm with best average reward. s a k a 1 a 2 … … r 11 r 12 … r 1w r 21 r 22 … r 2w r k1 r k2 … r kw How large must w be to provide a PAC guarantee? 18

Aside: Additive Chernoff Bound • Let R be a random variable with maximum absolute value Z. An let r i i=1,…,w be i.i.d. samples of R • The Chernoff bound gives a bound on the probability that the average of the r i are far from E[R] 2 w Chernoff 1 Pr [ ] exp E R r w Bound i w Z 1 i Equivalently: 1 With probability at least we have that, w 1 1 1 [ ] ln E R r Z i w w 1 i 19

UniformBandit Algorithm NaiveBandit from [Even-Dar et. al., 2002] 1. Pull each arm w times (uniform pulling). 2. Return arm with best average reward. s a k a 1 a 2 … … r 11 r 12 … r 1w r 21 r 22 … r 2w r k1 r k2 … r kw How large must w be to provide a PAC guarantee? 20

UniformBandit PAC Bound With a bit of algebra and Chernoff bound we get: 2 R max ln k If for all arms simultaneously w w 1 [ ( , )] E R s a r i ij w 1 j with probability at least 1  That is, estimates of all actions are ε – accurate with probability at least 1-  Thus selecting estimate with highest value is approximately optimal with high probability, or PAC 21

# Simulator Calls for UniformBandit s a 1 a k a 2 … … R(s,a k ) R(s,a 2 ) R(s,a 1 ) k  Total simulator calls for PAC: ln k k w O 2  Can get rid of ln(k) term with more complex algorithm [Even-Dar et. al., 2002]. 22

Outline  Preliminaries: Markov Decision Processes  What is Monte-Carlo Planning?  Non-Adaptive Monte-Carlo  Single State Case (PAC Bandit)  Policy rollout  Sparse Sampling  Adaptive Monte-Carlo  Single State Case (UCB Bandit)  UCT Monte-Carlo Tree Search 23

Policy Improvement via Monte-Carlo  Now consider a multi-state MDP.  Suppose we have a simulator and a non-optimal policy  E.g. policy could be a standard heuristic or based on intuition  Can we somehow compute an improved policy? World Simulator action + Real Base Policy World State + reward 24

Policy Improvement Theorem  The h-horizon Q-function Q π (s,a,h) is defined as: expected total discounted reward of starting in state s, taking action a, and then following policy π for h-1 steps ' ( ) arg max ( , , ) s Q s a h  Define: a  Theorem [Howard, 1960]: For any non-optimal policy π the policy π’ a strict improvement over π .  Computing π’ amounts to finding the action that maximizes the Q-function  Can we use the bandit idea to solve this? 25

Policy Improvement via Bandits s a k a 1 a 2 … SimQ(s,a 1 , π ,h) SimQ(s,a 2 , π ,h) SimQ(s,a k , π ,h)  Idea: define a stochastic function SimQ(s,a, π ,h) that we can implement and whose expected value is Q π (s,a,h)  Use Bandit algorithm to PAC select improved action How to implement SimQ? 26

Policy Improvement via Bandits SimQ(s,a, π ,h) r = R(s,a) simulate a in s s = T(s,a) for i = 1 to h-1 r = r + β i R(s, π (s)) simulate h-1 steps s = T(s, π (s)) of policy Return r  Simply simulate taking a in s and following policy for h-1 steps, returning discounted sum of rewards  Expected value of SimQ(s,a, π ,h) is Q π (s,a,h) 27

Monte-Carlo Planning: Basic Principles and Recent Progress Alan - PowerPoint PPT Presentation

Monte-Carlo Planning: Basic Principles and Recent Progress Alan Fern School of EECS Oregon State University 1 Outline Preliminaries: Markov Decision Processes What is Monte-Carlo Planning? Uniform Monte-Carlo Single State Case

Monte Carlo Generators Monte Carlo Generators Monte Carlo Generators QCD Lecture III P .

Monte Carlo Methods Guojin Chen Christopher Cprek Chris Rambicure Monte Carlo Methods 1.

Monte Carlo Approximation of Monte Carlo Filters Adam M. Johansen et al. Collaborators Include:

BROCHURE 2019 TETRA JUICES DEL MONTE DEL MONTE 6 x 1L GOLD PINEAPPLE 6 x 1L 6 x 1L 6 x 1L

Chapter 5: Monte Carlo Methods Monte Carlo methods are learning methods Experience

Draft Introduction to (randomized) quasi-Monte Carlo Pierre LEcuyer MCQMC Conference,

Monte Carlo Estimation 7 January 2019 OSU CSE 1 Monte Carlo Methods Class of computational

Monte Carlo Localization Ximing Yu March 24, 2009 Ximing Yu Monte Carlo Localization 1

Monte Carlo Control CMPUT 366: Intelligent Systems S&B 5.3-5.5, 5.7 Lecture Outline 1.

4. THE MONTE CARLO METHOD 4.1 I ntroduction This chapter is aimed at describing the Monte Carlo

Techniques in Artificial Intelligence - Part I Todd W. Neller Gettysburg College Monte Carlo

Introduction to Monte Carlo Method Andrzej Palczewski and Jan Palczewski Introduction to Monte

Draft 1 Density estimation by Monte Carlo and randomized quasi-Monte Carlo (RQMC) Pierre

Barrier Option Pricing Introduction Barrier Options and Monte Carlo Simulations The

Monte Carlo Integration Monte Carlo methods use random numbers for sampling Although

Monte-Carlo tree search for Monte-Carlo tree search for multi-player, no-limit multi-player,

A Monte-Carlo Option-Pricing Algorithm for Log-Uniform Jump-Diffusion Model Floyd B. Hanson

Monte-Carlo Game Tree Search: Basic Techniques Tsan-sheng Hsu tshsu@iis.sinica.edu.tw

Monte Carlo Methods Monte Carlo Methods I, at any rate, am convinced that He does not throw dice.

1 Simple example of simulation Slide 4 Evaluation of expected value of production E(V)

High-Dimensional and Multi-Failure- Region SRAM Yield Analysis Xiao Shi 1,2 , Hao Yan 3 , Jinxin

Statistical Methods and Monte Carlo simulation in High Energy Physics Dr. Leonid Serkin

Ch.8.1-8.3: Random numbers and Monte Carlo simulation Joakim Sundnes 1 , 2 Hans Petter Langtangen

Monte-Carlo Tree Search Mich` ele Sebag TAO: Theme Apprentissage & Optimization

Monte-Carlo Planning: Basic Principles and Recent Progress Alan - PowerPoint PPT Presentation

Monte-Carlo Planning: Basic Principles and Recent Progress Alan Fern School of EECS Oregon State University 1 Outline Preliminaries: Markov Decision Processes What is Monte-Carlo Planning? Uniform Monte-Carlo Single State Case

Monte Carlo Generators Monte Carlo Generators Monte Carlo Generators QCD Lecture III P .

Monte Carlo Methods Guojin Chen Christopher Cprek Chris Rambicure Monte Carlo Methods 1.

Monte Carlo Approximation of Monte Carlo Filters Adam M. Johansen et al. Collaborators Include:

BROCHURE 2019 TETRA JUICES DEL MONTE DEL MONTE 6 x 1L GOLD PINEAPPLE 6 x 1L 6 x 1L 6 x 1L

Chapter 5: Monte Carlo Methods Monte Carlo methods are learning methods Experience

Draft Introduction to (randomized) quasi-Monte Carlo Pierre LEcuyer MCQMC Conference,

Monte Carlo Estimation 7 January 2019 OSU CSE 1 Monte Carlo Methods Class of computational

Monte Carlo Localization Ximing Yu March 24, 2009 Ximing Yu Monte Carlo Localization 1

Monte Carlo Control CMPUT 366: Intelligent Systems S&amp;B 5.3-5.5, 5.7 Lecture Outline 1.

4. THE MONTE CARLO METHOD 4.1 I ntroduction This chapter is aimed at describing the Monte Carlo

Techniques in Artificial Intelligence - Part I Todd W. Neller Gettysburg College Monte Carlo

Introduction to Monte Carlo Method Andrzej Palczewski and Jan Palczewski Introduction to Monte

Draft 1 Density estimation by Monte Carlo and randomized quasi-Monte Carlo (RQMC) Pierre

Barrier Option Pricing Introduction Barrier Options and Monte Carlo Simulations The

Monte Carlo Integration Monte Carlo methods use random numbers for sampling Although

Monte-Carlo tree search for Monte-Carlo tree search for multi-player, no-limit multi-player,

A Monte-Carlo Option-Pricing Algorithm for Log-Uniform Jump-Diffusion Model Floyd B. Hanson

Monte-Carlo Game Tree Search: Basic Techniques Tsan-sheng Hsu tshsu@iis.sinica.edu.tw

Monte Carlo Methods Monte Carlo Methods I, at any rate, am convinced that He does not throw dice.

1 Simple example of simulation Slide 4 Evaluation of expected value of production E(V)

High-Dimensional and Multi-Failure- Region SRAM Yield Analysis Xiao Shi 1,2 , Hao Yan 3 , Jinxin

Statistical Methods and Monte Carlo simulation in High Energy Physics Dr. Leonid Serkin

Ch.8.1-8.3: Random numbers and Monte Carlo simulation Joakim Sundnes 1 , 2 Hans Petter Langtangen

Monte-Carlo Tree Search Mich` ele Sebag TAO: Theme Apprentissage &amp; Optimization

Monte Carlo Control CMPUT 366: Intelligent Systems S&B 5.3-5.5, 5.7 Lecture Outline 1.

Monte-Carlo Tree Search Mich` ele Sebag TAO: Theme Apprentissage & Optimization