Class Structure Last time: Batch RL This Time: MCTS Next time: - PowerPoint PPT Presentation

Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 2 With many slides from or derived from David Silver Lecture 14: MCTS 3 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 1 / 57

Class Structure Last time: Batch RL This Time: MCTS Next time: Human in the Loop RL Lecture 14: MCTS 4 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 2 / 57

Table of Contents Introduction 1 Model-Based Reinforcement Learning 2 Simulation-Based Search 3 Integrated Architectures 4 Lecture 14: MCTS 5 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 3 / 57

Model-Based Reinforcement Learning Previous lectures : learn value function or policy or directly from experience This lecture : learn model directly from experience and use planning to construct a value function or policy Integrate learning and planning into a single architecture Lecture 14: MCTS 6 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 4 / 57

Model-Based and Model-Free RL Model-Free RL No model Learn value function (and/or policy) from experience Lecture 14: MCTS 7 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 5 / 57

Model-Based and Model-Free RL Model-Free RL No model Learn value function (and/or policy) from experience Model-Based RL Learn a model from experience Plan value function (and/or policy) from model Lecture 14: MCTS 8 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 6 / 57

Model-Free RL Lecture 14: MCTS 9 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 7 / 57

Model-Based RL Lecture 14: MCTS 10 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 8 / 57

Model-Based RL Lecture 14: MCTS 12 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 10 / 57

Advantages of Model-Based RL Advantages: Can efficiently learn model by supervised learning methods Can reason about model uncertainty (like in upper confidence bound methods for exploration/exploitation trade offs) Disadvantages First learn a model, then construct a value function ⇒ two sources of approximation error Lecture 14: MCTS 13 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 11 / 57

MDP Model Refresher A model M is a representation of an MDP < S , A , P , R > , parametrized by η We will assume state space S and action space A are known So a model M = < P η , R η > represents state transitions P η ≈ P and rewards R η ≈ R S t +1 ∼ P η ( S t +1 | S t , A t ) R t +1 = R η ( R t +1 | S t , A t ) Typically assume conditional independence between state transitions and rewards P [ S t +1 , R t +1 | S t , A t ] = P [ S t +1 | S t , A t ] P [ R t +1 | S t , A t ] Lecture 14: MCTS 14 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 12 / 57

Model Learning Goal: estimate model M η from experience { S 1 , A 1 , R 2 , ..., S T } This is a supervised learning problem S 1 , A 1 → R 2 , S 2 S 2 A 2 → R 3 , S 3 . . . S T − 1 , A T − 1 → R T , S T Learning s , a → r is a regression problem Learning s , a → s ′ is a density estimation problem Pick loss function, e.g. mean-squared error, KL divergence, . . . Find parameters η that minimize empirical loss Lecture 14: MCTS 15 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 13 / 57

Examples of Models Table Lookup Model Linear Expectation Model Linear Gaussian Model Gaussian Process Model Deep Belief Network Model . . . Lecture 14: MCTS 16 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 14 / 57

Table Lookup Model Model is an explicit MDP, ˆ P , ˆ R Count visits N ( s , a ) to each state action pair T 1 P a ˆ � ✶ ( S t , A t , S t +1 = s , a , s ′ ) s , s ′ = N ( s , a ) t =1 T 1 ˆ � R a s = ✶ ( S t , A t = s , a ) N ( s , a ) t =1 Alternatively At each time-step t, record experience tuple < S t , A t , R t +1 , S t +1 > To sample model, randomly pick tuple matching < s , a , · , · > Lecture 14: MCTS 17 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 15 / 57

AB Example Two states A,B; no discounting; 8 episodes of experience We have constructed a table lookup model from the experience Recall: For a particular policy, TD with a tabular representation with infinite experience replay will converge to the same value as computed if construct a MLE model and do planning Check Your Memory: Will MC methods converge to the same solution? Lecture 14: MCTS 18 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 16 / 57

Planning with a Model Given a model M η = < P η , R η > Solve the MDP < S , A , P η , R η > Using favourite planning algorithm Value iteration Policy iteration Tree search · · · Lecture 14: MCTS 19 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 17 / 57

Sample-Based Planning A simple but powerful approach to planning Use the model only to generate samples Sample experience from model S t +1 ∼ P η ( S t +1 | S t , A t ) R t +1 = R η ( R t +1 | S t , A t ) Apply model-free RL to samples, e.g.: Monte-Carlo control Sarsa Q-learning Sample-based planning methods are often more efficient Lecture 14: MCTS 20 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 18 / 57

Back to the AB Example Construct a table-lookup model from real experience Apply model-free RL to sampled experience Real experience Sampled experience B, 1 A, 0, B, 0 B, 0 B, 1 B, 1 B, 1 A, 0 B, 1 B, 1 B, 1 B, 1 A, 0 B, 1 B, 1 B, 1 B, 0 B, 1 B, 0 e.g. Monte-Carlo learning: V (A) = 1, V (B) = 0.75 Check Your Memory: What would have MC on the original experience have converged to? Lecture 14: MCTS 21 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 19 / 57

Planning with an Inaccurate Model Given an imperfect model < P η , R η > � = < P , R > Performance of model-based RL is limited to optimal policy for approximate MDP < S , A , P η , R η > i.e. Model-based RL is only as good as the estimated model When the model is inaccurate, planning process will compute a sub-optimal policy Solution 1: when model is wrong, use model-free RL Solution 2: reason explicitly about model uncertainty (see Lectures on Exploration / Exploitation) Lecture 14: MCTS 22 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 20 / 57

Forward Search Forward search algorithms select the best action by lookahead They build a search tree with the current state st at the root Using a model of the MDP to look ahead No need to solve whole MDP, just sub-MDP starting from now Lecture 14: MCTS 24 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 22 / 57

Simulation-Based Search Forward search paradigm using sample-based planning Simulate episodes of experience from now with the model Apply model-free RL to simulated episodes Lecture 14: MCTS 25 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 23 / 57

Simulation-Based Search (2) Simulate episodes of experience from now with the model { S k t , A k t , R k t +1 , ..., S k T } K k =1 ∼ M v Apply model-free RL to simulated episodes Monte-Carlo control → Monte-Carlo search Sarsa → TD search Lecture 14: MCTS 26 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 24 / 57

Simple Monte-Carlo Search Given a model M v and a simulation policy π For each action a ∈ A Simulate K episodes from current (real) state s t { s t , a , R k t +1 , ..., S k T } K k =1 ∼ M v , π Evaluate actions by mean return (Monte-Carlo evaluation) K Q ( s t , a ) = 1 P � G t − → q π ( s t , a ) (1) K k =1 Select current (real) action with maximum value a t = argmin Q ( s t , a ) a ∈ A Lecture 14: MCTS 27 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 25 / 57

Recall Expectimax Tree If have a MDP model M v Can compute optimal q ( s , a ) values for current state by constructing an expectimax tree Limitations: Size of tree scales as Lecture 14: MCTS 28 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 26 / 57

Monte-Carlo Tree Search (MCTS) Given a model M v Build a search tree rooted at the current state s t Samples actions and next states Iteratively construct and update tree by performing K simulation episodes starting from the root state After search is finished, select current (real) action with maximum value in search tree a t = argmin Q ( s t , a ) a ∈ A Lecture 14: MCTS 29 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 27 / 57

Class Structure Last time: Batch RL This Time: MCTS Next time: - PowerPoint PPT Presentation

Lecture 14: MCTS 2 Emma Brunskill CS234 Reinforcement Learning. Winter 2018 2 With many slides from or derived from David Silver Lecture 14: MCTS 3 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2018 1 / 57 Class Structure Last time:

Programming Abstraction in C++ Eric S. Roberts and Julie Zelenski Stanford University 2010

BIBLICAL SURVEY Introductory Class Introductory Class BIBLICAL SURVEY Introductory Class

Curriculum on The Cadet Corps Uniform Class A Uniform Class A Uniform Agenda C1. Class A

Electing Your Membership Class Class TG, Class TH, or Class DC As a school employee who

TwissOptics Class Joschua Dilly TwissOptics Class 2 The TwissOptics Class Resonance Driving

3/14/16 Review Class/Object Type Class Keyword class class Point

61A Lecture 21 Announcements Binary Trees Binary Tree Class 4 Binary Tree Class class

STRUCTURE STRUCTURE Highlight the structure of Highlight the structure of material material

Part IV I/O System Chapter 12: Mass Storage Structure Chapter 12: Mass Storage Structure 1

Curriculum on The Cadet Corps Uniform Wear It WIth honor Class C Uniform Class C Uniform

Classroom Assessment Scoring System (CLASS) 104 B New Report Format Interpreting your CLASS

Essential Oils Class with Jami Borlik 1 Essential Oils Class with Jami Borlik 2 Essential Oils

Inheritance II Is-a versus has-a When an object of class A has a n object of class B, use

Inheritance A class can be a sub-type of another class The inheriting class contains all

UML Class Diagrams Steven Zeil February 25, 2013 UML Class Diagrams Outline Class

Multiple inheritance Multiple inheritance Can derive a class from more than one base class

Zoom Logistics When listening, please set your video off and mute your side Please feel free to

Deep Reinforcement Learning 1 Outline 1. Overview of Reinforcement Learning 2. Policy Search 3.

Mobile Experience Sampling Reaching the Parts of Facebook

Health Search From Consumers to Clinicians Slides available at

Simulation - Lectures Yee Whye Teh Part A Simulation TT 2013 Part A Simulation. TT 2013. Yee

Surveys, interviews, and diary studies Michelle Mazurek (some slides adapted from Blase Ur,

Planning and Learning Robert Platt Northeastern University (some slides/material borrowed from

Lecture 4/Chapter 4 How to Get a Good Sample Sampling Activity Study Designs; Focus on