Lecture 12: Batch RL Emma Brunskill CS234 Reinforcement Learning. - PowerPoint PPT Presentation

Lecture 12: Batch RL Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Slides drawn from Philip Thomas with modifications Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68

Class Structure • Last time: Fast Reinforcement Learning / Exploration and Exploitation • This time: Batch RL • Next time: Monte Carlo Tree Search Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68

Table of Contents What makes an RL algorithm safe? 1 Notation 2 Create a safe batch reinforement learning algorithm 3 Off-policy policy evaluation (OPE) High-confidence off-policy policy evaluation (HCOPE) Safe policy improvement (SPI) Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68

What does it mean to for a reinforcement learning algorithm to be safe? Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68

Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68

Changing the objective Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68

Changing the objective • Policy 1: • Reward = 0 with probability 0.999999 • Reward = 10 9 with probability 1-0.999999 • Expected reward approximately 1000 • Policy 2: • Reward = 999 with probability 0.5 • Reward = 1000 with probability 0.5 • Expected reward 999.5 Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68

Another notion of safety Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68

Another notion of safety (Munos et. al) Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68

Another notion of safety Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68

Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68

The Problem • If you apply an existing method, do you have confidence that it will work? Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68

Reinforcement learning success Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68

A property of many real applications • Deploying "bad" policies can be costly or dangerous Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68

Deploying bad policies can be costly Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68

Deploying bad policies can be dangerous Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68

What property should a safe batch reinforcement learning algorithm have? • Given past experience from current policy/policies, produce a new policy • “Guarantee that with probability at least 1 − δ , will not change your policy to one that is worse than the current policy.” • You get to choose δ • Guarantee not contingent on the tuning of any hyperparameters Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68

Notation � � s t = s ) • Policy π : π ( a ) = P ( a t = a • History: H = ( s 1 , a 1 , r 1 , s 2 , a 2 , r 2 , · · · , s L , a L , r L ) • Historical data: D = { H 1 , H 2 , · · · , H n } • Historical data from behavior policy, π b • Objective: L V π = E [ � � π ] γ t R t � t = 1 Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68

Safe batch reinforement learning algorithm • Reinforcement learning algorithm, A • Historical data, D , which is a random variable • Policy produced by the algorithm, A ( D ) , which is a random variable • a safe batch reinforement learning algorithm, A , satisfies: Pr( V A ( D ) ≥ V π b ≥ 1 − δ or, in general Pr( V A ( D ) ≥ V min ) ≥ 1 − δ Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68

Create a safe batch reinforement learning algorithm • Off-policy policy evaluation (OPE) • For any evaluation policy, π e , Convert historical data, D , into n independent and unbiased estimates of V π e • High-confidence off-policy policy evaluation (HCOPE) • Use a concentration inequality to convert the n independent and unbiased estimates of V π e into a 1 − δ confidence lower bound on V π e • Safe policy improvement (SPI) • Use HCOPE method to create a safe batch reinforement learning algorithm, a Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68

Off-policy policy evaluation (OPE) Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68

Importance Sampling (Reminder) � L � � L n � s t ) � � IS ( D ) = 1 π e ( a t � � � γ t R i t � � s t ) n π b ( a t i = 1 t = 1 t = 1 E [ IS ( D )] = V π e Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68

Create a safe batch reinforement learning algorithm • Off-policy policy evaluation (OPE) • For any evaluation policy, π e , Convert historical data, D , into n independent and unbiased estimates of V π e • High-confidence off-policy policy evaluation (HCOPE) • Use a concentration inequality to convert the n independent and unbiased estimates of V π e into a 1 − δ confidence lower bound on V π e • Safe policy improvement (SPI) • Use HCOPE method to create a safe batch reinforement learning algorithm, a Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68

High-confidence off-policy policy evaluation (HCOPE) Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68

Hoeffding’s inequality • Let X 1 , · · · , X n be n independent identically distributed random variables such that X i ∈ [ 0 , b ] • Then with probability at least 1 − δ : n � E [ X i ] ≥ 1 ln( 1 /δ ) � X i − b , n 2 n i = 1 where X i = 1 � n � L t = 1 γ t R i i = 1 ( w i t ) in our case. n Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68

Safe policy improvement (SPI) Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68

Create a safe batch reinforement learning algorithm • Off-policy policy evaluation (OPE) • For any evaluation policy, π e , Convert historical data, D , into n independent and unbiased estimates of V π e • High-confidence off-policy policy evaluation (HCOPE) • Use a concentration inequality to convert the n independent and unbiased estimates of V π e into a 1 − δ confidence lower bound on V π e • Safe policy improvement (SPI) • Use HCOPE method to create a safe batch reinforement learning algorithm, a WON’T WORK! Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68

Off-policy policy evaluation (revisited) • Importance sampling (IS): � L � � L n � � s t ) � π e ( a t IS ( D ) = 1 � � � γ t R i � s t ) t � n π b ( a t i = 1 t = 1 t = 1 • Per-decision importance sampling (PDIS) � t L n � s τ ) � � γ t 1 π e ( a τ � � � R i PSID ( D ) = � s τ ) t � n π b ( a τ t = 1 i = 1 τ = 1 Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL / 68

Lecture 12: Batch RL Emma Brunskill CS234 Reinforcement Learning. - PowerPoint PPT Presentation

Lecture 12: Batch RL Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Slides drawn from Philip Thomas with modifications Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL /

Batch Systems Running calculations on HPC resources Outline What is a batch system? How

HEBT Magnet Vacuum Chambers for Batch 2 and Batch 3 PSP Code 2.3.7.1.2.3.2 Lukas Urban /

Batch Systems Running your jobs on an HPC machine Outline What are batch systems? Why are

Batch Metadata Editing in DSpace 1.6+ Maureen P. Walsh, The Ohio State University Libraries

Stream Processing Marco Serafini COMPSCI 532 Lecture 5 Stream vs. Batch Processing Batch

ECE 417 Fall 2018 Lecture 19: Mini-Batch Training and Data Augmentation Mark Hasegawa-Johnson

Batch Modeling and Process Monitoring Geir Rune Flten Agenda CAMO Batch analysis

Automating batch fecundity measurements Automating batch fecundity measurements using digital

A Novel Micro- -Batch Mixer Batch Mixer A Novel Micro That Scales To That Scales To The

Enabling Efficient Batch Verification Enabling Efficient Batch Verification on Data Integrity for

Process costing By: Jyotsna Khaitan Batch Costing: It is a modified form of job costing where

Building the Easy Button: Automating SAS Program Batch Runs Nancy Brucken inVentiv Health June

Information-Theoretic Considerations in Batch RL Jinglin Chen, Nan Jiang University of Illinois

Asphalt Production Asphalt Plants Batch Plant Drum Plant Produces asphalt one batch at a time

Learning about the process and organism: Batch Sef Heijnen, Department of Biotechnology, Faculty

A new batch system, dCache and nfs A. Pickford Background Nikhef Local Batch System (Stoomboot)

An Unusual Cause of Involuntary Movements Varuna Prakash, MD MHSc PGY-2, Internal Medicine

8/31/2015 Health care Communities systems Michelle Futrell, MS, RD, LDN Nutrition Consultant

Strengths of the study Homogeneous distribution of enrolling centers across the whole country

The Nursing-Lab Relationship in POCT: The Good, the Bad and the Ugly of Interdisciplinary Teams

Mouse Models for Studying Human Mouse Models for Studying Human Islet Transplantation Islet

OUR JOURNEY TO ZERO SURGICAL SITE INFECTIONS Presented by: Phyllis C. Ahern, MSN, RN Danny

CVD & Diabetes: A promise for epigenetics? August 25 th , 2018. ESC, Munich, Germany Erik S

Weight as a Measure of Health vs. Health at Every Size Society for Nutrition Education and

Lecture 12: Batch RL Emma Brunskill CS234 Reinforcement Learning. - PowerPoint PPT Presentation

Lecture 12: Batch RL Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Slides drawn from Philip Thomas with modifications Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL /

Batch Systems Running calculations on HPC resources Outline What is a batch system? How

HEBT Magnet Vacuum Chambers for Batch 2 and Batch 3 PSP Code 2.3.7.1.2.3.2 Lukas Urban /

Batch Systems Running your jobs on an HPC machine Outline What are batch systems? Why are

Batch Metadata Editing in DSpace 1.6+ Maureen P. Walsh, The Ohio State University Libraries

Stream Processing Marco Serafini COMPSCI 532 Lecture 5 Stream vs. Batch Processing Batch

ECE 417 Fall 2018 Lecture 19: Mini-Batch Training and Data Augmentation Mark Hasegawa-Johnson

Batch Modeling and Process Monitoring Geir Rune Flten Agenda CAMO Batch analysis

Automating batch fecundity measurements Automating batch fecundity measurements using digital

A Novel Micro- -Batch Mixer Batch Mixer A Novel Micro That Scales To That Scales To The

Enabling Efficient Batch Verification Enabling Efficient Batch Verification on Data Integrity for

Process costing By: Jyotsna Khaitan Batch Costing: It is a modified form of job costing where

Building the Easy Button: Automating SAS Program Batch Runs Nancy Brucken inVentiv Health June

Information-Theoretic Considerations in Batch RL Jinglin Chen, Nan Jiang University of Illinois

Asphalt Production Asphalt Plants Batch Plant Drum Plant Produces asphalt one batch at a time

Learning about the process and organism: Batch Sef Heijnen, Department of Biotechnology, Faculty

A new batch system, dCache and nfs A. Pickford Background Nikhef Local Batch System (Stoomboot)

An Unusual Cause of Involuntary Movements Varuna Prakash, MD MHSc PGY-2, Internal Medicine

8/31/2015 Health care Communities systems Michelle Futrell, MS, RD, LDN Nutrition Consultant

Strengths of the study Homogeneous distribution of enrolling centers across the whole country

The Nursing-Lab Relationship in POCT: The Good, the Bad and the Ugly of Interdisciplinary Teams

Mouse Models for Studying Human Mouse Models for Studying Human Islet Transplantation Islet

OUR JOURNEY TO ZERO SURGICAL SITE INFECTIONS Presented by: Phyllis C. Ahern, MSN, RN Danny

CVD &amp; Diabetes: A promise for epigenetics? August 25 th , 2018. ESC, Munich, Germany Erik S

Weight as a Measure of Health vs. Health at Every Size Society for Nutrition Education and

CVD & Diabetes: A promise for epigenetics? August 25 th , 2018. ESC, Munich, Germany Erik S