Zoom Logistics When listening, please set your video off and mute - PowerPoint PPT Presentation

Lecture 16: MCTS 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2020 1 With many slides from or derived from David Silver Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 1 / 59

Zoom Logistics When listening, please set your video off and mute your side Please feel free to ask questions! To do so, at the bottom of your screen under participants should be an option to ”raise your hand.” That alerts me that you have a question. Note that in the chat session you can send a note to me, to everyone, or to a specific person in the session. The last one can be a useful for discussing a ”check your understanding” item This is our first time doing this– thanks for your patience as we work through this together! We will be releasing details of the poster session tomorrow Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 2 / 59

Refresh Your Understanding: Batch RL Select all that are true: Batch RL refers to when we have many agents acting in a batch 1 In batch RL we generally care more about sample efficiency than 2 computational efficiency Importance sampling can be used to get an unbiased estimate of policy 3 performance Q-learning can be used in batch RL and will generally provide a better 4 estimate than importance sampling in Markov environments for any function approximator used for the Q Not sure 5 Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 3 / 59

Quiz Results Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 4 / 59

Class Structure Last time: Quiz This Time: MCTS Next time: Poster session Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 5 / 59

Monte Carlo Tree Search Why choose to have this as well? Responsible in part for one of the greatest achievements in AI in the last decade– becoming a better Go player than any human Brings in ideas of model-based RL and the benefits of planning Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 6 / 59

Table of Contents Introduction 1 Model-Based Reinforcement Learning 2 Simulation-Based Search 3 Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 7 / 59

Introduction: Model-Based Reinforcement Learning Previous lectures : For online learning, learn value function or policy directly from experience This lecture : For online learning, learn model directly from experience and use planning to construct a value function or policy Integrate learning and planning into a single architecture Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 8 / 59

Model-Based and Model-Free RL Model-Free RL No model Learn value function (and/or policy) from experience Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 9 / 59

Model-Based and Model-Free RL Model-Free RL No model Learn value function (and/or policy) from experience Model-Based RL Learn a model from experience Plan value function (and/or policy) from model Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 10 / 59

Model-Free RL Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 11 / 59

Model-Based RL Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 12 / 59

Model-Based RL Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 14 / 59

Advantages of Model-Based RL Advantages: Can efficiently learn model by supervised learning methods Can reason about model uncertainty (like in upper confidence bound methods for exploration/exploitation trade offs) Disadvantages First learn a model, then construct a value function ⇒ two sources of approximation error Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 15 / 59

MDP Model Refresher A model M is a representation of an MDP < S , A , P , R > , parametrized by η We will assume state space S and action space A are known So a model M = represents state transitions P η ≈ P and rewards R η ≈ R S t +1 ∼ P η ( S t +1 | S t , A t ) R t +1 = R η ( R t +1 | S t , A t ) Typically assume conditional independence between state transitions and rewards P [ S t +1 , R t +1 | S t , A t ] = P [ S t +1 | S t , A t ] P [ R t +1 | S t , A t ] Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 16 / 59

Model Learning Goal: estimate model M η from experience { S 1 , A 1 , R 2 , ..., S T } This is a supervised learning problem S 1 , A 1 → R 2 , S 2 S 2 , A 2 → R 3 , S 3 . . . S T − 1 , A T − 1 → R T , S T Learning s , a → r is a regression problem Learning s , a → s ′ is a density estimation problem Pick loss function, e.g. mean-squared error, KL divergence, . . . Find parameters η that minimize empirical loss Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 17 / 59

Examples of Models Table Lookup Model Linear Expectation Model Linear Gaussian Model Gaussian Process Model Deep Belief Network Model . . . Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 18 / 59

Table Lookup Model Model is an explicit MDP, ˆ P , ˆ R Count visits N ( s , a ) to each state action pair T 1 P a ˆ � ✶ ( S t , A t , S t +1 = s , a , s ′ ) s , s ′ = N ( s , a ) t =1 T 1 ˆ � R a s = ✶ ( S t , A t = s , a ) N ( s , a ) t =1 Alternatively At each time-step t, record experience tuple < S t , A t , R t +1 , S t +1 > To sample model, randomly pick tuple matching < s , a , · , · > Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 19 / 59

AB Example Two states A,B; no discounting; 8 episodes of experience We have constructed a table lookup model from the experience Recall: For a particular policy, TD with a tabular representation with infinite experience replay will converge to the same value as computed if construct a MLE model and do planning Check Your Memory: Will MC methods converge to the same solution? Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 20 / 59

Planning with a Model Given a model M η = Solve the MDP < S , A , P η , R η > Using favourite planning algorithm Value iteration Policy iteration Tree search · · · Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 21 / 59

Sample-Based Planning A simple but powerful approach to planning Use the model only to generate samples Sample experience from model S t +1 ∼ P η ( S t +1 | S t , A t ) R t +1 = R η ( R t +1 | S t , A t ) Apply model-free RL to samples, e.g.: Monte-Carlo control Sarsa Q-learning Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 22 / 59

Planning with an Inaccurate Model Given an imperfect model � = Performance of model-based RL is limited to optimal policy for approximate MDP < S , A , P η , R η > i.e. Model-based RL is only as good as the estimated model When the model is inaccurate, planning process will compute a sub-optimal policy Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 23 / 59

Back to the AB Example Construct a table-lookup model from real experience Apply model-free RL to sampled experience Real experience A, 0, B, 0 B, 1 B, 1 What values will TD with estimated model converge to? Is this correct? Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 24 / 59

Planning with an Inaccurate Model Given an imperfect model � = Performance of model-based RL is limited to optimal policy for approximate MDP < S , A , P η , R η > i.e. Model-based RL is only as good as the estimated model When the model is inaccurate, planning process will compute a sub-optimal policy Solution 1: when model is wrong, use model-free RL Solution 2: reason explicitly about model uncertainty (see Lectures on Exploration / Exploitation) Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 25 / 59

Computing Action for Current State Only Previously would compute a policy for whole state space Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 27 / 59

Simulation-Based Search Simulate episodes of experience from now with the model starting from current state S t { S k t , A k t , R k t +1 , ..., S k T } K k =1 ∼ M v Apply model-free RL to simulated episodes Monte-Carlo control → Monte-Carlo search Sarsa → TD search Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 28 / 59

Zoom Logistics When listening, please set your video off and mute - PowerPoint PPT Presentation

Lecture 16: MCTS 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2020 1 With many slides from or derived from David Silver Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 1 / 59 Zoom Logistics When

Using Zoom at DU https://udenver.zoom.us/ Zoom: 1, 2, 3, 4, 5 1. Sign into Zoom with DU email

Zoom zoom zoom zoom Chloe Martindale Joint work with Francois Dupressoir Inspired by DJ

I. I.c. Zo Zoom Presentat ation Why Zoom? Zoom is commonly used at UTRGV to deliver live

ZOOM Meeting Tips Zoom Meeting ID: 918 6155 6777 # Join the Zoom meeting with video, if you have

Event Presentations in Zoom *If you have not claimed your Kenan-Flagler Zoom Pro account or

Recording a Video Presentation Using Zoom Recording a Video Presentation Using Zoom Before you

SETTING UP FREE ZOOM VIDEO CONFERENCING ACCOUNT STEP-BY-SETP Type zoom.us (not zoom.com) in

Zoom Record and Share an Asynchronous Video or Presentation Zoom is available to all faculty,

Online Meetings with Zoom For Participants and Hosts 1 Zoom for Participants and Hosts July

Energy Trust Info Session and Q&A July 16, 2020 Zoom Tips Zoom Tips Zoom Tips Send chats

Project Logistics 1 Our Satisfied Project Logistics Customers 2 Project Logistics Solutions

Instructions for Recording URD Presentation To record using Zoom: Open oklahoma.zoom.us >

Recording a Presentation in Zoom Zoom is a free video communications tool that will allow you to

Zoom Instructions Please click here to view the recording a zoom meeting video. Further step

Zoom Meetings THINGS TO KNOW WELCOME Shelita Kimble, MEd, CHSOS Sr. Systems Analyst Education

Zoom Instructions Please open the Zoom group chat by clicking on the Chat icon. Please

Deep Reinforcement Learning 1 Outline 1. Overview of Reinforcement Learning 2. Policy Search 3.

Mobile Experience Sampling Reaching the Parts of Facebook

Health Search From Consumers to Clinicians Slides available at

Bacterial Diseases Dr. Zaid Yaseen Ibrahim B.V.M.S, M.Sc You can find all my lectures and

Class Structure Last time: Batch RL This Time: MCTS Next time: Human in the Loop RL Lecture 14:

Simulation - Lectures Yee Whye Teh Part A Simulation TT 2013 Part A Simulation. TT 2013. Yee

Surveys, interviews, and diary studies Michelle Mazurek (some slides adapted from Blase Ur,

Planning and Learning Robert Platt Northeastern University (some slides/material borrowed from

Zoom Logistics When listening, please set your video off and mute - PowerPoint PPT Presentation

Lecture 16: MCTS 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2020 1 With many slides from or derived from David Silver Lecture 16: MCTS 1 Emma Brunskill (CS234 Reinforcement Learning. ) Winter 2020 1 / 59 Zoom Logistics When

Using Zoom at DU https://udenver.zoom.us/ Zoom: 1, 2, 3, 4, 5 1. Sign into Zoom with DU email

Zoom zoom zoom zoom Chloe Martindale Joint work with Francois Dupressoir Inspired by DJ

I. I.c. Zo Zoom Presentat ation Why Zoom? Zoom is commonly used at UTRGV to deliver live

ZOOM Meeting Tips Zoom Meeting ID: 918 6155 6777 # Join the Zoom meeting with video, if you have

Event Presentations in Zoom *If you have not claimed your Kenan-Flagler Zoom Pro account or

Recording a Video Presentation Using Zoom Recording a Video Presentation Using Zoom Before you

SETTING UP FREE ZOOM VIDEO CONFERENCING ACCOUNT STEP-BY-SETP Type zoom.us (not zoom.com) in

Zoom Record and Share an Asynchronous Video or Presentation Zoom is available to all faculty,

Online Meetings with Zoom For Participants and Hosts 1 Zoom for Participants and Hosts July

Energy Trust Info Session and Q&amp;A July 16, 2020 Zoom Tips Zoom Tips Zoom Tips Send chats

Project Logistics 1 Our Satisfied Project Logistics Customers 2 Project Logistics Solutions

Instructions for Recording URD Presentation To record using Zoom: Open oklahoma.zoom.us &gt;

Recording a Presentation in Zoom Zoom is a free video communications tool that will allow you to

Zoom Instructions Please click here to view the recording a zoom meeting video. Further step

Zoom Meetings THINGS TO KNOW WELCOME Shelita Kimble, MEd, CHSOS Sr. Systems Analyst Education

Zoom Instructions Please open the Zoom group chat by clicking on the Chat icon. Please

Deep Reinforcement Learning 1 Outline 1. Overview of Reinforcement Learning 2. Policy Search 3.

Mobile Experience Sampling Reaching the Parts of Facebook

Health Search From Consumers to Clinicians Slides available at

Bacterial Diseases Dr. Zaid Yaseen Ibrahim B.V.M.S, M.Sc You can find all my lectures and

Class Structure Last time: Batch RL This Time: MCTS Next time: Human in the Loop RL Lecture 14:

Simulation - Lectures Yee Whye Teh Part A Simulation TT 2013 Part A Simulation. TT 2013. Yee

Surveys, interviews, and diary studies Michelle Mazurek (some slides adapted from Blase Ur,

Planning and Learning Robert Platt Northeastern University (some slides/material borrowed from

Energy Trust Info Session and Q&A July 16, 2020 Zoom Tips Zoom Tips Zoom Tips Send chats

Instructions for Recording URD Presentation To record using Zoom: Open oklahoma.zoom.us >