Reinforcement learning Fredrik D. Johansson Clinical ML @ MIT - PowerPoint PPT Presentation

Reinforcement learning Fredrik D. Johansson Clinical ML @ MIT 6.S897/HST.956: Machine Learning for Healthcare, 2019

Reminder: Causal effects ► Potential outcomes under treatment and control, 𝑍 1 , 𝑍 0 ► Covariates and treatment, 𝑌, 𝑈 𝑈 𝑌 ► Conditional average treatment effect (CATE) 𝐷𝐵𝑈𝐹 𝑌 = 𝔽 𝑍 1 − 𝑍 0 ∣ 𝑌 𝑍 Potential outcomes Features

Today: Treatment policies/regimes ► A policy 𝝆 assigns treatments to patients (typically depending on their medical history/state) ► Example: For a patient with medical history 𝑦 , 𝜌(𝑦) = 𝕁[𝐷𝐵𝑈𝐹 𝑦 > 0] “Treat if effect is positive” ► Today we focus on policies guided by clinical outcomes (as opposed to legislation, monetary cost or side-effects)

Example: Sepsis management ► Sepsis is a complication of an infection which can lead to massive organ failure and death ► One of the leading causes of death in the ICU ► The primary treatment target is the infection ► Other symptoms need management: breathing difficulties, low blood pressure, …

Recall: Potential outcomes Septic patient with breathing difficulties 𝑍(0) Unobserved responses Blood 1. Should the patient be put on oxygen mechanical ventilation? 𝑌 Observed decisions 𝑍(1) & response 𝑈 Mechanical ventilation? Sedation? Vasopressors? Time

Today: Sequential decision making ► Many clinical decisions are made in sequence ► Choices early may rule out actions later ► Can we optimize the policy by which actions are made? 𝐵 9 𝑆 8 𝑆 9 𝑆 : 𝑢 𝑇 8 𝑇 9 𝑇 : 𝑢 8 𝑢 9 𝑢 :

Recall: Potential outcomes Septic patient with breathing difficulties Unobserved responses 1. Should the patient be put on mechanical ventilation? Observed decisions & response Mechanical ventilation? Sedation? Vasopressors? Time

Example: Sepsis management Septic patient with breathing difficulties Unobserved 2. Should the patient be responses sedated? (To alleviate discomfort due Observed decisions to mech. ventilation) & response Mechanical ventilation? Sedation? Vasopressors? Time

Example: Sepsis management Septic patient with breathing difficulties 3. Should we Unobserved responses artificially raise blood pressure? (Which may have Observed decisions dropped due to & response sedation) Mechanical ventilation? Sedation? Vasopressors? Time

Example: Sepsis management Septic patient with breathing difficulties Observed decisions & response Mechanical ventilation? Sedation? Vasopressors? Time

Finding optimal policies ► How can we treat patients so that their outcomes are as good as possible ? Outcome ► What are good outcomes ? ► Which policies should we consider? Mechanical ventilation? Sedation? Vasopressors?

Success stories in popular press ► AlphaStar ► AlphaGo ► DQN Atari ► Open AI Five

Reinforcement learning Game state 𝑇 8 ► Maximize reward! Possible actions 𝐵 8 Next state 𝑇 9 Reward 𝑆 9 (Loss) Figure by Tim Wheeler, tim.hibal.org

Great! Now let’s treat patients ► Patient state at time 𝑇 = is like the game board ► Medical treatments 𝐵 = are like the actions ► Outcomes 𝑆 = are the rewards in the game 𝐵 9 ► What could possibly go wrong? 𝑆 8 𝑆 9 𝑆 : 𝑢 𝑇 8 𝑇 9 𝑇 : 𝑢 8 𝑢 9 𝑢 :

1. Decision processes 2. Reinforcement learning 3. Learning from batch (off-policy) data 4. Reinforcement learning in healthcare

Decision processes ► An agent repeatedly, at times 𝑢 takes actions 𝐵 = Agent to receive rewards 𝑆 = Reward 𝑆 = Action 𝐵 = from an environment , the state 𝑇 = of which is (partially) observed Environment State 𝑇 =

Decision process: Mechanical ventilation @FG= HII + 𝑆 = @A=BCD + 𝑆 = @FG= HG 𝑆 = = 𝑆 = Agent Reward $ # Action " # Environment State % # 𝑇 8 𝐵 8 𝐵 9 𝑇 ? , 𝑆 ? 𝐵 ? 𝑆 : 𝑇 9 , 𝑆 9 Mechanical ventilation? Sedation? Spontaneous breathing trial? Time

Decision process: Mechanical ventilation ► State 𝑇 = includes demographics, physiological measurements, ventilator settings, level of consciousness, dosage of 𝑇 8 sedatives, time to 𝑇 ? ventilation, number of 𝑇 9 intubations

Decision process: Mechanical ventilation ► Actions 𝐵 = include intubation and extubation, as well as administration and dosages of sedatives 𝐵 8 𝐵 ? 𝐵 9

Decision processes ► A decision process specifies how states 𝑇 = , actions 𝐵 = , and rewards 𝑆 = are distributed : 𝑞(𝑇 8 , … , 𝑇 : , 𝐵 8 , … , 𝐵 : , 𝑆 8 , … , 𝑆 : ) ► The agent interacts with the environment according to a behavior policy 𝜈 = 𝑞(𝐵 = ∣ ⋯ ) * * The … depends on the type of agent

Markov Decision Processes ► Markov decision processes (MDPs) are a special case ► Markov transitions : 𝑞 𝑇 = 𝑇 8 , … , 𝑇 =N9 , 𝐵 8 , … , 𝐵 =N9 = 𝑞(𝑇 = ∣ 𝑇 =N9 , 𝐵 =N9 ) ► Markov reward function: 𝑞 𝑆 = 𝑇 = , 𝐵 = = 𝑞 𝑆 = 𝑇 8 , … , 𝑇 =N9 , 𝐵 8 , … , 𝐵 =N9 ► Markov action policy 𝜈 = 𝑞(𝐵 = ∣ 𝑇 = ) = 𝑞 𝐵 = 𝑇 8 , … , 𝑇 =N9 , 𝐵 8 , … , 𝐵 =N9

Markov assumption ► State transitions, actions and reward depend only on most recent state-action pair 𝐵 8 𝐵 : … 𝑇 8 𝑇 9 𝑇 : 𝑆 8 𝑆 :

Contextual bandits (special case)* ► States are independent: 𝑞 𝑇 = 𝑇 =N9 , 𝐵 =N9 = 𝑞(𝑇 = ) ► Equivalent to single-step case : potential outcomes! 𝐵 8 𝐵 : … 𝑇 8 𝑇 9 𝑇 : 𝑆 8 𝑆 : * The term “contextual bandits” has connotations of efficient exploration, which is not addressed here

Contextual bandits & potential outcomes ► Think of each state 𝑇 A as an i.i.d. patient, the actions 𝐵 A as the treatment group indicators and 𝑆 A as the outcomes 𝐵 8 𝐵 : … 𝑇 8 𝑇 : 𝑆 8 𝑆 :

Goal of RL ► Like previously with causal effect estimation, we are interested in the effects of actions 𝐵 = on future rewards 𝐵 8 𝐵 : … 𝑇 8 𝑇 9 𝑇 : 𝑆 8 𝑆 :

Value maximization ► The goal of most RL algorithms is to maximize the expected cumulative reward—the value 𝑊 P of its policy 𝜌 : ► Return : 𝐻 = = ∑ 𝑆 D Sum of future rewards DS= ► Value: 𝑊 P = 𝔽 T U ∼P 𝐻 8 Expected sum of rewards under policy 𝜌 ► The expectation is taken with respect to scenarios acted out according to the learned policy 𝜌

Example Value G P ≈ 1 𝑜 [ 𝐻 G 𝑊 ► Let’s say that we have data from a policy 𝜌 AS9 9 = 0 𝑏 ? Return 9 = 1 𝑏 W 9 9 = 1 𝑆 9 𝑏 9 9 + 𝑆 ? 9 + 𝑆 W 𝐻 9 = 𝑆 9 9 9 𝑆 W 9 𝑆 ? Patient 1 ? = 1 Patient 2 𝑏 W ? = 0 ? + 𝑆 ? ? + 𝑆 W 𝐻 ? = 𝑆 9 ? = 1 𝑏 9 ? 𝑏 ? ? 𝑆 W ? 𝑆 9 ? 𝑆 ? Patient 3 W 𝑆 9 W = 0 W = 0 𝑏 ? 𝑏 9 W W = 0 𝑆 ? 𝑏 W W 𝑆 W 𝐻 W = 𝑆 9 W + 𝑆 ? W + 𝑆 W W

Robot in a room ► Stochastic actions 𝑞 Move up 𝐵 = ”𝑣𝑞” = 0.8 + 1 Available non-opposite moves have uniform probability −1 ► Rewards: +1 at [4,3] (terminal state) Start -1 at [4,2] (terminal) -0.04 per step Slide from Peter Bodik

Robot in a room What is the optimal policy? ► Stochastic actions 𝑞 Move up 𝐵 = ”𝑣𝑞” = 0.8 + 1 ? ? ? Available non-opposite moves have uniform probability ? ? −1 ► Rewards: +1 at [4,3] (terminal state) ? ? ? ? -1 at [4,2] (terminal) -0.04 per step Slide from Peter Bodik

Robot in a room ► The following is the optimal policy/trajectory under + 1 deterministic transitions −1 ► Not achievable in our stochastic transition model Slide from Peter Bodik

Robot in a room ► Optimal policy + 1 ► How can we learn this? −1 Slide from Peter Bodik

1. Decision processes 2. Reinforcement learning 3. Learning from batch (off-policy) data 4. Reinforcement learning in healthcare

Paradigms* Model-based RL Value-based RL Policy-based RL Transitions Value/return Policy 𝑞 𝑇 = 𝑇 =N9 , 𝐵 =N9 𝑞 𝐻 = 𝑇 = , 𝐵 = 𝑞(𝐵 = ∣ 𝑇 = ) G-computation Q-learning REINFORCE MDP estimation G-estimation Marginal structural models *We focus on off-policy RL here

Dynamic programming ► Assume that we know how good a state-action pair is + 1 [3,1] [4,3] ► Q: Which end state is the −1 best? A: [4,3] Start ► Q: What is the best way to get there? A: Only [3,1] Slide from Peter Bodik

Dynamic programming ► [2,1] is slightly better than [3,2] because of the risk of + 1 transitioning to [4,2] from [3,2] [2,1] −1 ► Which is the best way to [2,1]? [3,2] [4,2] Start Slide from Peter Bodik

Dynamic programming ► The idea of dynamic programming for + 1 reinforcement learning is to recursively learn the best −1 action/value in a previous state given the best action/value in future states Slide from Peter Bodik

Reinforcement learning Fredrik D. Johansson Clinical ML @ MIT - PowerPoint PPT Presentation

Reinforcement learning Fredrik D. Johansson Clinical ML @ MIT 6.S897/HST.956: Machine Learning for Healthcare, 2019 Reminder: Causal effects Potential outcomes under treatment and control, 1 , 0 Covariates and treatment, ,

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

Machine Learning for NLP Reinforcement learning Aurlie Herbelot 2019 Centre for Mind/Brain

#sepsis 1 Method Hannah Shotton 2 Study Advisory Group Study proposal Study Advisory

R ELIABLE A PPLICATION OF THE S EPSIS B UNDLE E LEMENTS North Shore-LIJ Health System Martin E.

Malaysian Healthy Ageing Society The Impact of Age on Severe Sepsis and Septic Shock Survivors

MHA Keystone Center MICAH QN Meeting May 18, 2018 Agenda ADEs due to Opioids measure

IHI Expedition Antibiotic Stewardship Session 5: Focus on: 72 Hour Antibiotic Time - out

Introducing PCORnet and the Greater Plains Collaborative: The National Patient-Centered Clinical

VBP Bootcamp Managed Long Term Care October 2017 2 10/13/2017 October 2017 2 Agenda Area

Disclosures CODE SEPSIS I have no disclosures (not nowmaybe later) David Shimabukuro, MDCM