MCTS for MDPs 3/7/18 Real-Time Dynamic Programming Repeat while - PowerPoint PPT Presentation

MCTS for MDPs 3/7/18

Real-Time Dynamic Programming Repeat while there’s time remaining: • state ß start state • repeat until terminal (or depth bound): • action ß optimal action in current state • V(state) ß R(state) + discount * Q(state, action) • Q(state, action) calculated from V ( s ’) for all reachable s ’ . If s ’ hasn’t been seen before, initialize V( s ’ ) ß h( s ’ ) . • state ß result of taking action

RTDP does rollouts and backprop Rollouts: • Repeatedly select actions until terminal. Backpropagation: • Update policy/value for visited states/actions. It’s not doing either of these things particularly well: • Greedy action selection means no exploration. • Updating every state means lots of storage. MCTS is a better version of the same thing!

MCTS Review • Selection • Runs in the already-explored part of the state space. • Choose a random action, according to UCB weights. • Expansion • When we first encounter something unexplored. • Chose an unexplored action uniformly at random. • Simulation • After we’ve left the known region. • Select actions randomly according to the default policy. • Backpropagation • Update values for states visited in selection/expansion. • Average previous values with value on current rollout.

Differences from game-playing MCTS • Learning state/action values instead of state values. • The next state is non-deterministic. • Simulation may never reach a terminal state. • There is no longer a tree-structure to the states. • Non-terminal states can have rewards. • Rewards in the future need to be discounted.

Online vs. Offline Planning Offline: do a bunch of thinking before you start to figure out a complete plan. Online: do a little bit of thinking to come up with the next (few) action(s), then do more planning later. Are RTDP and MCTS online or offline planners? …or both?

UCB Exploration Policy Formula from today’s reading: visits to state s s ( ) ln( n s ) ˆ Policy ( s ) = arg min Q ( s, a ) − C n s,a a ∈ A ( s ) trials of action a in state s • We now need to track visits for each state/action. How does this differ from the UCB formula we saw two weeks ago?

MCTS Backprop for MDPs • Track all rewards experienced during rollout. • At the end of a rollout, update Q-values for the states/actions experienced during that rollout. • Also update visits. Q ( s, a ) ← R ≥ T + ( n s,a − 1) Q ( s, a ) Update: n s,a end γ t − T R ( t ) X R ≥ T = t = T

Notes on the backpropagation step Q ( s, a ) ← R ≥ T + ( n s,a − 1) Q ( s, a ) n s,a end γ t − T R ( t ) X R ≥ T = t = T • This doesn’t depend on any other Q-value. • We’re no longer doing dynamic programming . can be computed incrementally: • R ≥ T R ≥ T = γ R ≥ T +1 + R ( T )

Online MCTS Value Backup Observe sequence of (state, action) pairs and corresponding rewards. • Save (state, action, reward) during selection / expansion • Save only reward during simulation Want to compute value (on the current rollout) for each (s,a) pair, then average with old values. = γ R ≥ T +1 + R ( T ) states: [s0, s7, s3, s5] actions: [a0, a0, a2, a1] rewards: [ 0, -1, +2, 0, 0, +1, -1] 𝛿 =.9 Compute values for the current rollout.

When should a rollout end? A rollout ends if a terminal state is reached. • Will we always reach a terminal state? • If not, what can we do about it? • As t grows, 𝛿 t gets exponentially smaller. • Eventually 𝛿 t will be small enough that rewards have negligible effect on the start state’s values. • This means we can set a depth limit on our rollouts.

Heuristics • If we cut off rollouts at some depth, we may not have found any useful rewards. • If we have a heuristic that helps us estimate future value, we could evaluate it at the end of the rollout. • We could also change the agents’ rewards to give it intermediate goals. • This is called reward shaping, and is a topic of active research in reinforcement learning.

MCTS for MDPs 3/7/18 Real-Time Dynamic Programming Repeat while - PowerPoint PPT Presentation

MCTS for MDPs 3/7/18 Real-Time Dynamic Programming Repeat while theres time remaining: state start state repeat until terminal (or depth bound): action optimal action in current state V(state) R(state) + discount *

MCTS Extensions 2/15/17 The Monte Carlo Tree Search Algorithm MCTS Pseudocode for i = 1 :

Extending MCTS 2-17-16 Reading Quiz (from Monday) What is the relationship between Monte Carlo

Planning and Optimization December 4, 2019 G1. Factored MDPs G1.1 Factored MDPs Planning and

Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning

Parity Objectives in Countable MDPs Stefan Kiefer Richard Mayr Mahsa Shirmohammadi Dominik

CS 730/830: Intro AI Solving MDPs MDP Extras Wheeler Ruml (UNH) Lecture 20, CS 730 1 / 23

Class Structure Last time: Batch RL This Time: MCTS Next time: Human in the Loop RL Lecture 14:

Learning to Search with MCTSnets Minghan Li Ignavier Ng Motivation of MCTSnet MCTS is

Class Structure Last time: Batch RL This Time: MCTS Next time: Human in the Loop RL Lecture 16:

Monte Carlo Tree Search Simon M. Lucas Outline MCTS: The Excitement! A tutorial: how it

Partially Observable Markov Decision Processes 3/3/17 (Dis)Advantages of Online MCTS + Just

Policy Gradients for CVaR-Constrained MDPs Prashanth L.A. INRIA Lille Team SequeL Prashanth

CS 188: Artificial Intelligence Markov Decision Processes (MDPs) Pieter Abbeel UC Berkeley

Partially-Observable MDPs RN, Chapter 17.4 17.5 Decision Theoretic Agents Introduction

Online Convex Optimization in Adversarial MDPs Aviv Rosenberg Yishay Mansour Motivation:

Planning and Optimization G1. Factored MDPs Malte Helmert and Thomas Keller Universit at

CSE440: Introduction to HCI Methods for Design, Prototyping and Evaluating User Interaction

Lecture 14 Situation Calculus 13 th February 2020 Outline 2 / 24 Planning Situations

EECS 3401 AI and Logic Prog. Lecture 17 Adapted from slides of Brachman & Levesque

Lecture 02: Project Management, Cost Estimation 2015-04-27 Prof. Dr. Andreas Podelski, Dr. Bernd

Determining motor threshold Franziska Plessow, Ph.D. Intensive Course in Transcranial Magnetic

Simulating the Liquid-Liquid Extraction of Acetic Acid during Wastewater Treatment using Aspen

Chapter 21: Phenomena Phenomena: Below are the names and pictures of several organic compounds.

Synthesis of Dendronized Poly(fluorene)s with Peripheral Carbazole Groups LI Yi Introduction

MCTS for MDPs 3/7/18 Real-Time Dynamic Programming Repeat while - PowerPoint PPT Presentation

MCTS for MDPs 3/7/18 Real-Time Dynamic Programming Repeat while theres time remaining: state start state repeat until terminal (or depth bound): action optimal action in current state V(state) R(state) + discount *

MCTS Extensions 2/15/17 The Monte Carlo Tree Search Algorithm MCTS Pseudocode for i = 1 :

Extending MCTS 2-17-16 Reading Quiz (from Monday) What is the relationship between Monte Carlo

Planning and Optimization December 4, 2019 G1. Factored MDPs G1.1 Factored MDPs Planning and

Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning

Parity Objectives in Countable MDPs Stefan Kiefer Richard Mayr Mahsa Shirmohammadi Dominik

CS 730/830: Intro AI Solving MDPs MDP Extras Wheeler Ruml (UNH) Lecture 20, CS 730 1 / 23

Class Structure Last time: Batch RL This Time: MCTS Next time: Human in the Loop RL Lecture 14:

Learning to Search with MCTSnets Minghan Li Ignavier Ng Motivation of MCTSnet MCTS is

Class Structure Last time: Batch RL This Time: MCTS Next time: Human in the Loop RL Lecture 16:

Monte Carlo Tree Search Simon M. Lucas Outline MCTS: The Excitement! A tutorial: how it

Partially Observable Markov Decision Processes 3/3/17 (Dis)Advantages of Online MCTS + Just

Policy Gradients for CVaR-Constrained MDPs Prashanth L.A. INRIA Lille Team SequeL Prashanth

CS 188: Artificial Intelligence Markov Decision Processes (MDPs) Pieter Abbeel UC Berkeley

Partially-Observable MDPs RN, Chapter 17.4 17.5 Decision Theoretic Agents Introduction

Online Convex Optimization in Adversarial MDPs Aviv Rosenberg Yishay Mansour Motivation:

Planning and Optimization G1. Factored MDPs Malte Helmert and Thomas Keller Universit at

CSE440: Introduction to HCI Methods for Design, Prototyping and Evaluating User Interaction

Lecture 14 Situation Calculus 13 th February 2020 Outline 2 / 24 Planning Situations

EECS 3401 AI and Logic Prog. Lecture 17 Adapted from slides of Brachman &amp; Levesque

Lecture 02: Project Management, Cost Estimation 2015-04-27 Prof. Dr. Andreas Podelski, Dr. Bernd

Determining motor threshold Franziska Plessow, Ph.D. Intensive Course in Transcranial Magnetic

Simulating the Liquid-Liquid Extraction of Acetic Acid during Wastewater Treatment using Aspen

Chapter 21: Phenomena Phenomena: Below are the names and pictures of several organic compounds.

Synthesis of Dendronized Poly(fluorene)s with Peripheral Carbazole Groups LI Yi Introduction

EECS 3401 AI and Logic Prog. Lecture 17 Adapted from slides of Brachman & Levesque