Online Planning 3/1/17 Q-Learning vs MCTS Dynamic programming - PowerPoint PPT Presentation

Online Planning 3/1/17

Q-Learning vs MCTS • Dynamic programming • Backpropagation • Update depends on prior • Update uses all rewards estimates for other states. from a full rollout. end γ t − T R t X Q ( s, a ) ← α [ R + γ V ( s 0 )] + Q ( s, a ) ← average of t = T (1 − α ) [old Q ( s, a )] and old Q ( s, a ) • Updates immediately • Updates after rollout • Try action a in state s • Save path of (s,a) pairs • Update Q(s,a) • Update when all rewards are known. Both converge to correct Q(s,a) estimates!

Demo: Q-Learning vs. MCTS

What about expansion? • In MCTS for game playing, we only update values for nodes already in the tree. • On each rollout we expanded exactly one node. • In Q-learning, we update values for every node we encounter. Which method should we use in MCTS for MDPs? • Hint: either is appropriate under the right circumstances. What are those circumstances?

Online vs. Offline Decision-Making Our approach to MDPs so far: learn the value model completely, then pick optimal actions. Alternative approach: learn the (local) value model well enough to find a good action for the current state, take that action, then continue learning. When is online reasoning a good idea? Note: online learning (taking actions while you’re still learning) comes up in many machine learning contexts.

Simulated vs. Real World Actions So far, we’ve been blurring an important distinction. Does the agent: • take actions in the world and learn from the consequences, or • simulate the effect of possible actions before deciding how to act? Q-learning can be applied in either case. For online learning, we care about the difference.

Model Simulations • Value iteration is great when we know the whole model (and can fit the value table in memory). • Q-learning is great when we don’t know anything. • Simulation is a middle ground. • We might want to use simulation when: • We know the MDP, but it’s huge. • We have a function that generates successor states, but don’t know the full set of possible states in advance.

MCTS for Online Planning In the online planning setting, every time we need to choose an action, we stop and think about it first. “Thinking about it” means simulating future actions to learn from their consequences.

MCTS Review • Selection • Runs in the already-explored part of the state space. • Choose a random action, according to UCB weights. • Expansion • When we first encounter something unexplored. • Chose an unexplored action uniformly at random. • Simulation • After we’ve left the known region. • Select actions randomly according to the default policy. • Backpropagation • Update values for states visited in selection/expansion. • Average previous values with value on current rollout.

Differences from game-playing MCTS • Learning state/action values instead of state values. • The next state is non-deterministic. • Simulation may never reach a terminal state. • There is no longer a tree-structure to the states. • Non-terminal states can have rewards.

Online MCTS Value Backup Observe sequence of (state, action) pairs and corresponding rewards. • Save (state, action, reward) during selection / expansion • Save only reward during simulation Want to compute value (on the current rollout) for each (s,a) pair, then average with old values. states: [s0, s7, s3, s5] actions: [a0, a0, a2, a1] rewards: [ 0, -1, +2, 0, 0, +1, -1] 𝛿 =.9 Compute values for the current rollout.

Demo: Online MCTS

Online Planning 3/1/17 Q-Learning vs MCTS Dynamic programming - PowerPoint PPT Presentation

Online Planning 3/1/17 Q-Learning vs MCTS Dynamic programming Backpropagation Update depends on prior Update uses all rewards estimates for other states. from a full rollout. end t T R t X Q ( s, a ) [ R + V (

Online Learning and Online Investing Jia Mao February 20, 2006 Jia Mao () Online Learning and

ONLINE ADVERTISING What is SIBC online? SIBC Online is a leading online news source for the

Online Learning Lorenzo Rosasco MIT, 9.520 L. Rosasco Online Learning About this class Goal

Classical Planning Systems ICS 271 Fall 2014 Outline: Planning Planning environments

Planning 2.0 BLMs Final Planning Rule http://www.blm.gov/plan2 1 Planning 2.0 Outline

Classical Planning Systems Chapter 10 R&N ICS 271 Fall 2016 Outline: Planning Planning

SECURE-ONLINE (ZEKER-ONLINE) Quality mark for online cloud services Tom Vreeburg Boardmember

How DGD.online helps prepare DG documentation easily CIFFA Webinar, June 25, 2019 DGD.online

Getting Online Getting Online Domain Names Email Google My Business Listing

Online Identity & Social Media by: Nicole Santarsiero What is Online Identity? -Online

2008 Online Awards Awards Banquet Better Newspaper Online Contest 2008 Best Online Advertising

2013 IRS Online Services Update IRS Online Services Update Jim Weaver Director, Product

ONLINE PROCESS SIMULATION ONLINE PROCESS SIMULATION ONLINE, REAL-TIME AND PREDICTIVE PROCESS DATA

Benefits of Online Reporting WHY YOU SHOULD BE REPORTING ONLINE Implementation Benefits of Online

Computing Online Safety Computing | Year 1 | Online Safety | Staying SMART Online| Lesson 3 Aim

Family Planning Only Programs Current Family Planning Only Programs Family Planning Only

Reinforcement Learning (RL) CE-717: Machine Learning Sharif University of Technology M.

When is Reputation Bad? Jeffrey Ely Drew Fudenberg David K. Levine 11/13/02 traditional

CSE 473 Lecture 8 Adversarial Search: Expectimax and Expectiminimax Based on slides from CSE AI

Reinforcement Learning Maria-Florina Balcan Carnegie Mellon University April 20, 2015 Today:

Incentives and Behavior Prof. Dr. Heiner Schumacher KU Leuven 2. Game Theory II Prof. Dr.

Where are we? Informatics 2D Reasoning and Agents Last time . . . Semester 2, 20192020

Rewards Structure in Games: Learning a Compact Representation for Action Space Margot Yann,

343H: Honors AI Lecture 7: Expectimax Search 2/6/2014 Kristen Grauman UT-Austin Slides

Sambuz

Useful Links

Newsletter

Mail Us

Online Planning 3/1/17 Q-Learning vs MCTS Dynamic programming - PowerPoint PPT Presentation

Online Planning 3/1/17 Q-Learning vs MCTS Dynamic programming Backpropagation Update depends on prior Update uses all rewards estimates for other states. from a full rollout. end t T R t X Q ( s, a ) [ R + V (

Online Learning and Online Investing Jia Mao February 20, 2006 Jia Mao () Online Learning and

ONLINE ADVERTISING What is SIBC online? SIBC Online is a leading online news source for the

Online Learning Lorenzo Rosasco MIT, 9.520 L. Rosasco Online Learning About this class Goal

Classical Planning Systems ICS 271 Fall 2014 Outline: Planning Planning environments

Planning 2.0 BLMs Final Planning Rule http://www.blm.gov/plan2 1 Planning 2.0 Outline

Classical Planning Systems Chapter 10 R&amp;N ICS 271 Fall 2016 Outline: Planning Planning

SECURE-ONLINE (ZEKER-ONLINE) Quality mark for online cloud services Tom Vreeburg Boardmember

How DGD.online helps prepare DG documentation easily CIFFA Webinar, June 25, 2019 DGD.online

Getting Online Getting Online Domain Names Email Google My Business Listing

Online Identity &amp; Social Media by: Nicole Santarsiero What is Online Identity? -Online

2008 Online Awards Awards Banquet Better Newspaper Online Contest 2008 Best Online Advertising

2013 IRS Online Services Update IRS Online Services Update Jim Weaver Director, Product

ONLINE PROCESS SIMULATION ONLINE PROCESS SIMULATION ONLINE, REAL-TIME AND PREDICTIVE PROCESS DATA

Benefits of Online Reporting WHY YOU SHOULD BE REPORTING ONLINE Implementation Benefits of Online

Computing Online Safety Computing | Year 1 | Online Safety | Staying SMART Online| Lesson 3 Aim

Family Planning Only Programs Current Family Planning Only Programs Family Planning Only

Reinforcement Learning (RL) CE-717: Machine Learning Sharif University of Technology M.

When is Reputation Bad? Jeffrey Ely Drew Fudenberg David K. Levine 11/13/02 traditional

CSE 473 Lecture 8 Adversarial Search: Expectimax and Expectiminimax Based on slides from CSE AI

Reinforcement Learning Maria-Florina Balcan Carnegie Mellon University April 20, 2015 Today:

Incentives and Behavior Prof. Dr. Heiner Schumacher KU Leuven 2. Game Theory II Prof. Dr.

Where are we? Informatics 2D Reasoning and Agents Last time . . . Semester 2, 20192020

Rewards Structure in Games: Learning a Compact Representation for Action Space Margot Yann,

343H: Honors AI Lecture 7: Expectimax Search 2/6/2014 Kristen Grauman UT-Austin Slides

Sambuz

Useful Links

Newsletter

Mail Us

Classical Planning Systems Chapter 10 R&N ICS 271 Fall 2016 Outline: Planning Planning

Online Identity & Social Media by: Nicole Santarsiero What is Online Identity? -Online