Lecture 14: Batch RL Emma Brunskill CS234 Reinforcement Learning. - PowerPoint PPT Presentation

Lecture 14: Batch RL Emma Brunskill CS234 Reinforcement Learning. Winter 2020 Slides drawn from Philip Thomas with modifications *Note: we only went carefully through slides before slide 34. The remaining slides are kept for those interested but will not be material required for the quiz. See the last slide for a summary of what you should know Winter 2020 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 14: Batch RL / 70

Refresh Your Understanding: Fast RL III Select all that are true: • Thompson sampling for MDPs the posterior over the dynamics can be updated after each transition • When using a Beta prior for a Bernoulli reward parameter for an (s,a) pair, the posterior after N samples of that pair time steps can be the same as after N+2 samples • The optimism bonuses discussed for MBIE-EB depend on the maximum reward but not on the maximum value function • In class we discussed adding a bonus term to the policy gradient update for a (s,a,r,s’) tuple using Q-learning with function approximation. Adding this bonus term will ensure all Q estimates used to make decisions online using DQN are optimistic with respect to Q* • Not sure Winter 2020 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 14: Batch RL / 70

Class Structure • Last time: Fast Reinforcement Learning • This time: Batch RL • Next time: Guest Lecture Winter 2020 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 14: Batch RL / 70

A Scientific Experiment Winter 2020 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 14: Batch RL / 70

What Should We Do For a New Student? Winter 2020 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 14: Batch RL / 70

Involves Counterfactual Reasoning Winter 2020 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 14: Batch RL / 70

Involves Generalization Winter 2020 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 14: Batch RL / 70

Batch Reinforcement Learning Winter 2020 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 14: Batch RL / 70

Batch RL Winter 2020 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 14: Batch RL / 70

The Problem • If you apply an existing method, do you have confidence that it will work? Winter 2020 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 14: Batch RL / 70

A property of many real applications • Deploying "bad" policies can be costly or dangerous Winter 2020 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 14: Batch RL / 70

What property should a safe batch reinforcement learning algorithm have? • Given past experience from current policy/policies, produce a new policy • “Guarantee that with probability at least 1 − δ , will not change your policy to one that is worse than the current policy.” • You get to choose δ • Guarantee not contingent on the tuning of any hyperparameters Winter 2020 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 14: Batch RL / 70

Table of Contents Notation 1 Create a safe batch reinforcement learning algorithm 2 Off-policy policy evaluation (OPE) Safe policy improvement (SPI) Winter 2020 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 14: Batch RL / 70

Notation � ) = P ( a t = a � � � s t = s ) • Policy π : π ( a • Trajectory: T = ( s 1 , a 1 , r 1 , s 2 , a 2 , r 2 , · · · , s L , a L , r L ) • Historical data: D = { T 1 , T 2 , · · · , T n } • Historical data from behavior policy, π b • Objective: L V π = E [ � � π ] γ t R t � t = 1 Winter 2020 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 14: Batch RL / 70

Safe batch reinforcement learning algorithm • Reinforcement learning algorithm, A • Historical data, D , which is a random variable • Policy produced by the algorithm, A ( D ) , which is a random variable • a safe batch reinforcement learning algorithm, A , satisfies: Pr ( V A ( D ) ≥ V π b ) ≥ 1 − δ or, in general Pr ( V A ( D ) ≥ V min ) ≥ 1 − δ Winter 2020 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 14: Batch RL / 70

Table of Contents Notation 1 Create a safe batch reinforcement learning algorithm 2 Off-policy policy evaluation (OPE) Safe policy improvement (SPI) Winter 2020 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 14: Batch RL / 70

Create a safe batch reinforcement learning algorithm • Off-policy policy evaluation (OPE) • For any evaluation policy, π e , Convert historical data, D , into n independent and unbiased estimates of V π e • High-confidence off-policy policy evaluation (HCOPE) • Use a concentration inequality to convert the n independent and unbiased estimates of V π e into a 1 − δ confidence lower bound on V π e • Safe policy improvement (SPI) • Use HCOPE method to create a safe batch reinforcement learning algorithm, • Methods today focused on work by Philip Thomas UAI and ICML 2015 papers. Winter 2020 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 14: Batch RL / 70

Off-policy policy evaluation (OPE) Winter 2020 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 14: Batch RL / 70

High-confidence off-policy policy evaluation (HCOPE) Winter 2020 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 14: Batch RL / 70

Safe policy improvement (SPI) Winter 2020 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 14: Batch RL / 70

Create a safe batch reinforcement learning algorithm • Off-policy policy evaluation (OPE) • For any evaluation policy, π e , Convert historical data, D , into n independent and unbiased estimates of V π e • High-confidence off-policy policy evaluation (HCOPE) • Use a concentration inequality to convert the n independent and unbiased estimates of V π e into a 1 − δ confidence lower bound on V π e • Safe policy improvement (SPI) • Use HCOPE method to create a safe batch reinforcement learning algorithm, Winter 2020 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 14: Batch RL / 70

Monte Carlo (MC) Off Policy Evaluation • Aim: estimate value of policy π 1 , V π 1 ( s ) , given episodes generated under behavior policy π 2 • s 1 , a 1 , r 1 , s 2 , a 2 , r 2 , . . . where the actions are sampled from π 2 • G t = r t + γ r t + 1 + γ 2 r t + 2 + γ 3 r t + 3 + · · · in MDP M under policy π • V π ( s ) = E π [ G t | s t = s ] • Have data from a different policy, behavior policy π 2 • If π 2 is stochastic, can often use it to estimate the value of an alternate policy (formal conditions to follow) • Again, no requirement that have a model nor that state is Markov Winter 2020 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 14: Batch RL / 70

Monte Carlo (MC) Off Policy Evaluation: Distribution Mismatch • Distribution of episodes & resulting returns differs between policies Winter 2020 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 14: Batch RL / 70

Importance Sampling • Goal: estimate the expected value of a function f ( x ) under some probability distribution p ( x ) , E x ∼ p [ f ( x )] • Have data x 1 , x 2 , . . . , x n sampled from distribution q ( s ) • Under a few assumptions, we can use samples to obtain an unbiased estimate of E x ∼ q [ f ( x )] Winter 2020 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 14: Batch RL / 70

Importance Sampling � E x ∼ q [ f ( x )] = q ( x ) f ( x ) x Winter 2020 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 14: Batch RL / 70

Lecture 14: Batch RL Emma Brunskill CS234 Reinforcement Learning. - PowerPoint PPT Presentation

Lecture 14: Batch RL Emma Brunskill CS234 Reinforcement Learning. Winter 2020 Slides drawn from Philip Thomas with modifications *Note: we only went carefully through slides before slide 34. The remaining slides are kept for those interested

Batch Systems Running calculations on HPC resources Outline What is a batch system? How

HEBT Magnet Vacuum Chambers for Batch 2 and Batch 3 PSP Code 2.3.7.1.2.3.2 Lukas Urban /

Batch Systems Running your jobs on an HPC machine Outline What are batch systems? Why are

Batch Metadata Editing in DSpace 1.6+ Maureen P. Walsh, The Ohio State University Libraries

Stream Processing Marco Serafini COMPSCI 532 Lecture 5 Stream vs. Batch Processing Batch

ECE 417 Fall 2018 Lecture 19: Mini-Batch Training and Data Augmentation Mark Hasegawa-Johnson

Batch Modeling and Process Monitoring Geir Rune Flten Agenda CAMO Batch analysis

Automating batch fecundity measurements Automating batch fecundity measurements using digital

A Novel Micro- -Batch Mixer Batch Mixer A Novel Micro That Scales To That Scales To The

Enabling Efficient Batch Verification Enabling Efficient Batch Verification on Data Integrity for

Process costing By: Jyotsna Khaitan Batch Costing: It is a modified form of job costing where

Building the Easy Button: Automating SAS Program Batch Runs Nancy Brucken inVentiv Health June

Information-Theoretic Considerations in Batch RL Jinglin Chen, Nan Jiang University of Illinois

Asphalt Production Asphalt Plants Batch Plant Drum Plant Produces asphalt one batch at a time

Learning about the process and organism: Batch Sef Heijnen, Department of Biotechnology, Faculty

A new batch system, dCache and nfs A. Pickford Background Nikhef Local Batch System (Stoomboot)

The FICEP Infrastructure How We Deployed the Italian eIDAS Node in the Cloud P. Smiraglia, M. De

Debugging Scalable Applications on the XT May 2nd 2009 Chris Gottbrath Director, Product

Intrusion Detection Intrusion Detection October 23, 2020 Administrative Administrative

Access Control Jackson Argo Rackspace MO After-hours April 28, 2016 What is Access Control?

Batch-Incremental vs. Instance-Incremental Learning in Dynamic and Evolving Data Jesse Read 1 ,

ConQUR: Mitigating Delusional Bias in Deep Q-Learning DiJia (Andy) Su (Princeton) Jayden Ooi

Batch IS NOT Heavy: Learning Word Representations From All Samples 1 1 1 Xin Xin, Fajie Yuan,

BATCH BINARY WEIERSTRASS ECC 2019, Bochum, Germany 02 December 2019 Billy Bob Brumley Sohaib ul

Sambuz

Useful Links

Newsletter

Mail Us