Reinforcement learning Fredrik D. Johansson Clinical ML @ MIT - - PowerPoint PPT Presentation

β–Ά
reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Reinforcement learning Fredrik D. Johansson Clinical ML @ MIT - - PowerPoint PPT Presentation

Reinforcement learning Fredrik D. Johansson Clinical ML @ MIT 6.S897/HST.956: Machine Learning for Healthcare, 2019 Reminder: Causal effects Potential outcomes under treatment and control, 1 , 0 Covariates and treatment, ,


slide-1
SLIDE 1

Reinforcement learning

Fredrik D. Johansson Clinical ML @ MIT 6.S897/HST.956: Machine Learning for Healthcare, 2019

slide-2
SLIDE 2

Reminder: Causal effects

β–Ί Potential outcomes under treatment and control, 𝑍 1 , 𝑍 0 β–Ί Covariates and treatment, π‘Œ, π‘ˆ β–Ί Conditional average treatment effect (CATE)

π·π΅π‘ˆπΉ π‘Œ = 𝔽 𝑍 1 βˆ’ 𝑍 0 ∣ π‘Œ

Features Potential outcomes π‘Œ π‘ˆ 𝑍

slide-3
SLIDE 3

Today: Treatment policies/regimes

β–Ί A policy 𝝆 assigns treatments to patients

(typically depending on their medical history/state)

β–Ί Example: For a patient with medical history 𝑦,

𝜌(𝑦) = 𝕁[π·π΅π‘ˆπΉ 𝑦 > 0]

β–Ί Today we focus on policies guided by clinical outcomes

(as opposed to legislation, monetary cost or side-effects)

β€œTreat if effect is positive”

slide-4
SLIDE 4

Example: Sepsis management

β–Ί Sepsis is a complication of an infection which

can lead to massive organ failure and death

β–Ί One of the leading causes of death in the ICU β–Ί The primary treatment target is the infection β–Ί Other symptoms need management:

breathing difficulties, low blood pressure, …

slide-5
SLIDE 5

Recall: Potential outcomes

Time Mechanical ventilation? Sedation? Vasopressors?

Unobserved responses

Observed decisions & response

Septic patient with breathing difficulties

  • 1. Should the patient be put on

mechanical ventilation?

𝑍(0) 𝑍(1) π‘ˆ π‘Œ Blood

  • xygen
slide-6
SLIDE 6

Today: Sequential decision making

β–Ί Many clinical decisions are made in sequence β–Ί Choices early may rule out actions later β–Ί Can we optimize the policy by which actions are made?

𝑒

𝑒8 𝑒9 𝑒: 𝐡9 𝑆8 𝑆9 𝑆: 𝑇8 𝑇9 𝑇:

slide-7
SLIDE 7

Recall: Potential outcomes

Time Mechanical ventilation? Sedation? Vasopressors?

Unobserved responses

Observed decisions & response

Septic patient with breathing difficulties

  • 1. Should the patient be put on

mechanical ventilation?

slide-8
SLIDE 8

Example: Sepsis management

Time Mechanical ventilation? Sedation? Vasopressors?

Unobserved responses

Observed decisions & response

Septic patient with breathing difficulties

  • 2. Should the patient be

sedated? (To alleviate discomfort due to mech. ventilation)

slide-9
SLIDE 9

Example: Sepsis management

Time Mechanical ventilation? Sedation? Vasopressors?

Unobserved responses

Observed decisions & response

Septic patient with breathing difficulties

  • 3. Should we

artificially raise blood pressure? (Which may have dropped due to sedation)

slide-10
SLIDE 10

Example: Sepsis management

Time Mechanical ventilation? Sedation? Vasopressors? Observed decisions & response

Septic patient with breathing difficulties

slide-11
SLIDE 11

Mechanical ventilation? Sedation? Vasopressors?

Finding optimal policies

β–Ί How can we treat patients so that their

  • utcomes are as good as possible?

β–Ί What are good outcomes? β–Ί Which policies should we consider?

Outcome

slide-12
SLIDE 12

Success stories in popular press

β–Ί AlphaStar β–Ί AlphaGo β–Ί DQN Atari β–Ί Open AI Five

slide-13
SLIDE 13

Reinforcement learning

Game state 𝑇8 Possible actions 𝐡8

Figure by Tim Wheeler, tim.hibal.org

Next state 𝑇9 Reward 𝑆9 (Loss)

β–Ί Maximize reward!

slide-14
SLIDE 14

Great! Now let’s treat patients

β–Ί Patient state at time 𝑇= is like the game board β–Ί Medical treatments 𝐡= are like the actions β–Ί Outcomes 𝑆= are the rewards in the game

β–Ί What could possibly go wrong?

𝑒

𝑒8 𝑒9 𝑒: 𝐡9 𝑆8 𝑆9 𝑆: 𝑇8 𝑇9 𝑇:

slide-15
SLIDE 15
  • 1. Decision processes
  • 2. Reinforcement learning
  • 3. Learning from batch (off-policy) data
  • 4. Reinforcement learning in healthcare
slide-16
SLIDE 16

Decision processes

β–Ί An agent repeatedly, at

times 𝑒 takes actions 𝐡= to receive rewards 𝑆= from an environment, the state 𝑇= of which is (partially) observed

Environment Agent

Action 𝐡= Reward 𝑆= State 𝑇=

slide-17
SLIDE 17

Decision process: Mechanical ventilation

Time Mechanical ventilation? Sedation? Spontaneous breathing trial?

𝑇9, 𝑆9

𝐡8 𝐡9 𝐡? 𝑆:

Environment Agent

Action "# Reward $# State %#

𝑇?, 𝑆?

𝑆= = 𝑆=

@A=BCD + 𝑆= @FG= HII + 𝑆= @FG= HG

𝑇8

slide-18
SLIDE 18

Decision process: Mechanical ventilation

𝑇9 𝑇?

𝑇8

β–Ί State 𝑇= includes demographics,

physiological measurements, ventilator settings, level of consciousness, dosage of sedatives, time to ventilation, number of intubations

slide-19
SLIDE 19

Decision process: Mechanical ventilation

𝐡9 𝐡?

𝐡8

β–Ί Actions 𝐡= include intubation

and extubation, as well as administration and dosages of sedatives

slide-20
SLIDE 20

Decision processes

β–Ί A decision process specifies how states 𝑇=, actions 𝐡=, and

rewards 𝑆= are distributed: π‘ž(𝑇8, … , 𝑇:, 𝐡8, … , 𝐡:, 𝑆8, … , 𝑆:)

β–Ί The agent interacts with the environment according to a

behavior policy 𝜈 = π‘ž(𝐡= ∣ β‹― )*

* The … depends on the type of agent

slide-21
SLIDE 21

Markov Decision Processes

β–Ί Markov decision processes (MDPs) are a special case β–Ί Markov transitions:

π‘ž 𝑇= 𝑇8, … , 𝑇=N9, 𝐡8, … , 𝐡=N9 = π‘ž(𝑇= ∣ 𝑇=N9, 𝐡=N9)

β–Ί Markov reward function: π‘ž 𝑆=

𝑇=, 𝐡=

= π‘ž 𝑆= 𝑇8, … , 𝑇=N9, 𝐡8, … , 𝐡=N9 β–Ί Markov action policy 𝜈 = π‘ž(𝐡= ∣ 𝑇=) = π‘ž 𝐡= 𝑇8, … , 𝑇=N9, 𝐡8, … , 𝐡=N9

slide-22
SLIDE 22

Markov assumption

β–Ί State transitions, actions and reward depend only on most

recent state-action pair

𝑇8 𝑇9 𝑇: 𝐡8 𝑆8 … 𝐡: 𝑆:

slide-23
SLIDE 23

Contextual bandits (special case)*

β–Ί States are independent: π‘ž 𝑇=

𝑇=N9, 𝐡=N9 = π‘ž(𝑇=)

β–Ί Equivalent to single-step case: potential outcomes!

𝑇8 𝑇9 𝑇: 𝐡8 𝑆8 … 𝐡: 𝑆:

* The term β€œcontextual bandits” has connotations of efficient exploration, which is not addressed here

slide-24
SLIDE 24

Contextual bandits & potential outcomes

β–Ί Think of each state 𝑇A as an i.i.d. patient, the actions 𝐡A as the

treatment group indicators and 𝑆A as the outcomes

𝑇8 𝑇: 𝐡8 𝑆8 … 𝐡: 𝑆:

slide-25
SLIDE 25

Goal of RL

β–Ί Like previously with causal effect estimation, we are interested

in the effects of actions 𝐡= on future rewards

𝑇8 𝑇9 𝑇: 𝐡8 𝑆8 … 𝐡: 𝑆:

slide-26
SLIDE 26

Value maximization

β–Ί The goal of most RL algorithms is to maximize the expected

cumulative rewardβ€”the value π‘Š

P of its policy 𝜌

β–Ί Return: 𝐻= = βˆ‘

𝑆D

: DS=

β–Ί Value: π‘Š

P = 𝔽TU∼P 𝐻8

β–Ί The expectation is taken with respect to scenarios acted out

according to the learned policy 𝜌

Sum of future rewards Expected sum of rewards under policy 𝜌

slide-27
SLIDE 27

Example

β–Ί Let’s say that we have data from a policy 𝜌

𝑆9

9

𝑆?

9

𝑆W

9

𝐻9 = 𝑆9

9 + 𝑆? 9 + 𝑆W 9

𝐻? = 𝑆9

? + 𝑆? ? + 𝑆W ?

𝐻W = 𝑆9

W + 𝑆? W + 𝑆W W

𝑆9

?

𝑆?

?

𝑆W

?

𝑆9

W

𝑆?

W

𝑆W

W

Patient 1 Patient 2 Patient 3

𝑏9

9 = 1

𝑏9

? = 0

𝑏9

W = 0

𝑏?

? = 1

𝑏W

? = 1

𝑏?

W = 0

𝑏W

W = 0

𝑏?

9 = 0

𝑏W

9 = 1

π‘Š

P β‰ˆ 1

π‘œ [ 𝐻G

G AS9

Return Value

slide-28
SLIDE 28

Robot in a room

+1 βˆ’1

Start

β–Ί Stochastic actions

π‘ž Move up 𝐡 = β€π‘£π‘žβ€ = 0.8 Available non-opposite moves have uniform probability

β–Ί Rewards:

+1 at [4,3] (terminal state)

  • 1 at [4,2] (terminal)
  • 0.04 per step

Slide from Peter Bodik

slide-29
SLIDE 29

?

Robot in a room

? ? ? ? ? ? ? ?

+1 βˆ’1

β–Ί Stochastic actions

π‘ž Move up 𝐡 = β€π‘£π‘žβ€ = 0.8 Available non-opposite moves have uniform probability

β–Ί Rewards:

+1 at [4,3] (terminal state)

  • 1 at [4,2] (terminal)
  • 0.04 per step

Slide from Peter Bodik

What is the optimal policy?

slide-30
SLIDE 30

Robot in a room

+1 βˆ’1

β–Ί The following is the optimal

policy/trajectory under deterministic transitions

β–Ί Not achievable in our

stochastic transition model

Slide from Peter Bodik

slide-31
SLIDE 31

Robot in a room

+1 βˆ’1

β–Ί Optimal policy β–Ί How can we learn this?

Slide from Peter Bodik

slide-32
SLIDE 32
  • 1. Decision processes
  • 2. Reinforcement learning
  • 3. Learning from batch (off-policy) data
  • 4. Reinforcement learning in healthcare
slide-33
SLIDE 33

Paradigms*

Model-based RL Transitions π‘ž 𝑇= 𝑇=N9, 𝐡=N9 G-computation MDP estimation Value-based RL Value/return π‘ž 𝐻= 𝑇=, 𝐡= Q-learning G-estimation Policy-based RL Policy π‘ž(𝐡= ∣ 𝑇=) REINFORCE

Marginal structural models

*We focus on off-policy RL here

slide-34
SLIDE 34

Paradigms*

Model-based RL Transitions π‘ž 𝑇= 𝑇=N9, 𝐡=N9 G-computation MDP estimation Value-based RL Value/return π‘ž 𝐻= 𝑇=, 𝐡= Q-learning G-estimation Policy-based RL Policy π‘ž(𝐡= ∣ 𝑇=) REINFORCE

Marginal structural models

*We focus on off-policy RL here

slide-35
SLIDE 35

Dynamic programming

+1 βˆ’1

β–Ί Assume that we know how

good a state-action pair is

β–Ί Q: Which end state is the

best? A: [4,3]

β–Ί Q: What is the best way to get

there? A: Only [3,1]

Slide from Peter Bodik

Start [3,1] [4,3]

slide-36
SLIDE 36

Dynamic programming

+1 βˆ’1

β–Ί [2,1] is slightly better than [3,2]

because of the risk of transitioning to [4,2] from [3,2]

β–Ί Which is the best way to [2,1]?

Slide from Peter Bodik

Start [2,1] [3,2] [4,2]

slide-37
SLIDE 37

Dynamic programming

β–Ί The idea of dynamic

programming for reinforcement learning is to recursively learn the best action/value in a previous state given the best action/value in future states

Slide from Peter Bodik

+1 βˆ’1

slide-38
SLIDE 38

Dynamic programming

β–Ί Next: How do we get the

value of each state?

Slide from Peter Bodik

+1 βˆ’1

slide-39
SLIDE 39

Q-learning

β–Ί Q-learning is a value-based reinforcement learning method β–Ί Recall: The value of a state 𝑑 under a policy 𝜌 is

𝑀P 𝑑 ≔ 𝔽P 𝐻= ∣ 𝑇= = 𝑑

Reward discount factor*

*Mathematical tool more than anything

≔ 𝔽P βˆ‘ 𝛿m𝑆=nm

  • mS8

∣ 𝑇= = 𝑑

slide-40
SLIDE 40

Q-learning

β–Ί Q-learning is a value-based reinforcement learning method β–Ί The value of a state 𝑑 under a policy 𝜌 is

𝑀P 𝑑 ≔ 𝔽P 𝐻= ∣ 𝑇= = 𝑑

β–Ί The value of a state-action pair 𝑑, 𝑏 is

π‘ŸP 𝑑, 𝑏 ≔ 𝔽P 𝐻= ∣ 𝑇= = 𝑑, 𝐡= = 𝑏

Reward discount factor*

*Mathematical tool more than anything

≔ 𝔽P βˆ‘ 𝛿m𝑆=nm

  • mS8

∣ 𝑇= = 𝑑

slide-41
SLIDE 41

Q-learning

β–Ί Q-learning attempts to estimate 𝒓𝝆 with a function 𝑅(𝑑, 𝑏) such

that 𝜌 is the deterministic policy 𝜌 𝑑 = arg maxx 𝑅(𝑑, 𝑏)

β–Ί The best 𝑅 is the best state-action value function

π‘…βˆ— 𝑑, 𝑏 = max

P

π‘ŸP(𝑑, 𝑏) =: π‘Ÿβˆ—(𝑑, 𝑏)

slide-42
SLIDE 42

Bellman equation

β–Ί For the optimal Q-function π‘Ÿβˆ—, β€œBellman optimality” holds*

π‘Ÿβˆ— 𝑑, 𝑏 = 𝔽P 𝑆= + 𝛿 max

B{ π‘Ÿβˆ—(𝑇=n9, 𝑏{) ∣ 𝑇= = 𝑑, 𝐡= = 𝑏

β–Ί Look for functions with this property!

Immediate reward Future (discounted) rewards* State-action value

*A necessary property for optimality of dynamic programming

slide-43
SLIDE 43

Q-learning with discrete states

β–Ί If states are discrete, 𝑑 ∈ {0, … , 𝐿}, Q-learning can be solved

exactly using dynamic programming (for small enough 𝐿)*

β–Ί Initialize a table of 𝑅 𝑑, 𝑏 β–Ί Repeat

𝑅 𝑇=, 𝐡= ← 𝑅 𝑇=, 𝐡= + 𝛽 𝑆= + 𝛿 max

B

𝑅(𝑇=n9, 𝑏) βˆ’ 𝑅(𝑇=, 𝐡=)

*Converges to the optimal π‘Ÿβˆ— if all state-action pairs visited over and over again

Learning rate

slide-44
SLIDE 44

Q-learning with discrete states

1.

Initialize 𝑅 𝑑, 𝑏 = 0, let 𝛽, 𝛿 = 1

2.

Repeat

𝑅 𝑇=, 𝐡= ← 𝑅 𝑇=, 𝐡= + 𝛽 𝑆= + 𝛿 max

B

𝑅(𝑇=n9, 𝑏) βˆ’ 𝑅(𝑇=, 𝐡=)

+1

  • 1

0.96

  • 0.04
  • 0.04
  • 0.04
  • 0.04
  • 0.04
  • 0.04
  • 0.04
  • 0.04
  • 0.04
  • 0.04
  • 0.04
  • 0.04
  • 0.04
  • 0.04
  • 1.04
  • 1.04
  • 0.04
  • 0.04
  • 0.04
  • 0.04

Q-table

Assume that transitions are deterministic for now Let each state-pair be visited in

  • rder, over and over*

* We will come back to this

slide-45
SLIDE 45

Q-learning with discrete states

1.

Initialize 𝑅 𝑑, 𝑏 = 0, let 𝛽, 𝛿 = 1

2.

Repeat

𝑅 𝑇=, 𝐡= ← 𝑅 𝑇=, 𝐡= + 𝛽 𝑆= + 𝛿 max

B

𝑅(𝑇=n9, 𝑏) βˆ’ 𝑅(𝑇=, 𝐡=)

+1

  • 1
  • 0.08
  • 0.08

0.92

  • 0.08
  • 0.08
  • 0.08
  • 0.08
  • 0.08
  • 0.08
  • 0.08
  • 0.08
  • 0.08
  • 0.08
  • 0.08
  • 1.04
  • 1.04
  • 0.08
  • 0.08

0.92

  • 0.08

Q-table

0.96

slide-46
SLIDE 46

Q-learning with discrete states

1.

Initialize 𝑅 𝑑, 𝑏 = 0, let 𝛽, 𝛿 = 1

2.

Repeat

𝑅 𝑇=, 𝐡= ← 𝑅 𝑇=, 𝐡= + 𝛽 𝑆= + 𝛿 max

B

𝑅(𝑇=n9, 𝑏) βˆ’ 𝑅(𝑇=, 𝐡=)

+1

  • 1

0.88 0.88 0.92

  • 0.12

0.88

  • 0.12
  • 0.12
  • 0.12
  • 0.12
  • 0.12
  • 0.12
  • 0.12
  • 0.12
  • 0.12
  • 1.04
  • 1.04

0.88

  • 0.08

0.92

  • 0.12

Q-table

0.96

slide-47
SLIDE 47

Q-learning with discrete states

1.

Initialize 𝑅 𝑑, 𝑏 = 0, let 𝛽, 𝛿 = 1

2.

Repeat

𝑅 𝑇=, 𝐡= ← 𝑅 𝑇=, 𝐡= + 𝛽 𝑆= + 𝛿 max

B

𝑅(𝑇=n9, 𝑏) βˆ’ 𝑅(𝑇=, 𝐡=)

+1

  • 1

0.88 0.88 0.92 0.84 0.88 0.84

  • 0.16
  • 0.16
  • 0.16

0.84

  • 0.16
  • 0.16
  • 0.16

0.84

  • 1.04
  • 1.04

0.88 0.84 0.92

  • 0.16

Q-table

0.96

slide-48
SLIDE 48

Q-learning with discrete states

1.

Initialize 𝑅 𝑑, 𝑏 = 0, let 𝛽, 𝛿 = 1

2.

Repeat

𝑅 𝑇=, 𝐡= ← 𝑅 𝑇=, 𝐡= + 𝛽 𝑆= + 𝛿 max

B

𝑅(𝑇=n9, 𝑏) βˆ’ 𝑅(𝑇=, 𝐡=)

+1

  • 1

0.88 0.88 0.92 0.84 0.88 0.84

  • 0.18

0.80 0.80 0.84

  • 0.18

0.80 0.80 0.84

  • 1.04
  • 1.04

0.88 0.84 0.92 0.80

Q-table

0.96

slide-49
SLIDE 49

Q-learning with discrete states

1.

Initialize 𝑅 𝑑, 𝑏 = 0, let 𝛽, 𝛿 = 1

2.

Repeat

𝑅 𝑇=, 𝐡= ← 𝑅 𝑇=, 𝐡= + 𝛽 𝑆= + 𝛿 max

B

𝑅(𝑇=n9, 𝑏) βˆ’ 𝑅(𝑇=, 𝐡=)

+1

  • 1

0.96 0.88 0.88 0.92 0.84 0.88 0.84 0.76 0.80 0.80 0.84 0.76 0.80 0.80 0.84

  • 1.04
  • 1.04

0.88 0.84 0.92 0.80

Q-table

slide-50
SLIDE 50

Fitted Q-learning (with function approximation)

β–Ί If the number of states 𝐿 is large or 𝑇= is not discrete, we

cannot maintain a table for 𝑅 𝑑, 𝑏

β–Ί Instead, we may represent 𝑅 𝑑, 𝑏 by a function π‘…β€š and

minimize the risk 𝑆 π‘…β€š = 𝔽P 𝑆 + 𝛿 max

BΖ’ 𝑅

β€ž 𝑇′, 𝐡{ βˆ’ π‘…β€š 𝑇, 𝐡

?

Current estimate Old estimate of 𝑅

slide-51
SLIDE 51

Bellman equation (one step)

β–Ί In the one-step case (no future states)

𝑆 π‘…β€š = 𝔽P 𝑆= + 𝛿 max

BΖ’ 𝑅

β€ž 𝑇′, 𝑏{ βˆ’ π‘…β€š 𝑇, 𝐡

?

= 𝔽P 𝑆= βˆ’ π‘…β€š 𝑇, 𝐡

?

β–Ί Finding π‘Ÿ(𝑑, 𝑏) is analogous to finding expected potential

  • utcomes 𝔽 𝑆 𝑏 ∣ 𝑇 = 𝑑 in the one-step case!
slide-52
SLIDE 52

Recall: Potential outcomes

Control outcome 𝔽[𝑍 0 ∣ π‘Œ]

π‘Œ 𝑍(𝑒)

Treated outcome 𝔽[𝑍 1 ∣ π‘Œ] min

I

U

1 π‘œ= [ 𝑔

= 𝑦A βˆ’ 𝑧A ?

  • A:=β€ΉS=

Regression adjustment

slide-53
SLIDE 53

Fitted Q-learning as covariate adjustment

β–Ί Fitted Q-learning is like covariate adjustment (regression) with

a moving target (which is updated during learning) 𝑆 π‘…β€š = 𝔽P 𝐻 β€ž 𝑇, 𝐡, 𝑇{, 𝑆 βˆ’ π‘…β€š 𝑇, 𝐡

?

Prediction Target Expectation over transitions (𝑑, 𝑏, 𝑑{, 𝑠) Choice of loss, (here squared) ≔ 𝑆 + 𝛿 max

BΖ’ 𝑅

β€ž 𝑇′, 𝑏{

slide-54
SLIDE 54

Off-policy learning

β–Ί Where does our data come from?

𝑆 π‘…β€š = 𝔽P 𝑆 + 𝛿 max

BΖ’ 𝑅

β€ž 𝑇′, 𝑏{ βˆ’ π‘…β€š 𝑇, 𝐡

?

β–Ί ”What are the inputs and outputs of our regression?” β–Ί Alternate between updates of 𝑅

β€ž and π‘…β€š

How do we evaluate this expectation?

slide-55
SLIDE 55

Exploration in RL

β–Ί Tuples 𝑑, 𝑏, 𝑑{, 𝑠 may be obtained by:

β–Ί On-policy explorationβ€”β€œPlaying the game” with the current policy β–Ί Randomized trialsβ€”Executing a sequentially random policy β–Ί Off-policy (observational)β€”E.g., healthcare records

β–Ί The latter is most relevant to us!

slide-56
SLIDE 56
  • 1. Decision processes
  • 2. Reinforcement learning paradigms
  • 3. Learning from batch (off-policy) data
  • 4. Reinforcement learning in healthcare
slide-57
SLIDE 57

Off-policy learning

β–Ί Trajectories 𝑑9, 𝑏9, 𝑠

9 , … , 𝑑:, 𝑏:, 𝑠: ,of states 𝑑=, actions 𝑏=,

and rewards 𝑠= observed in e.g. medical record

β–Ί Actions are drawn according to a behavior policy 𝜈, but we

want to know the value of a new policy 𝜌

β–Ί Learning policies from this data is at least as hard as

estimating treatment effects from observational data

slide-58
SLIDE 58

Assumptions for (off-policy) RL

β–Ί Sufficient conditions for identifying value function

Strong ignorability: 𝑍(0), 𝑍(1) β«« π‘ˆ ∣ π‘Œ β€œNo hidden confounders” Overlap: βˆ€π‘¦, 𝑒: π‘ž π‘ˆ = 𝑒 π‘Œ = 𝑦 > 0 β€œAll actions possible” Single-step case Sequential case Sequential randomization: 𝐻 … β«« 𝐡= ∣ 𝑇=

  • , 𝐡̅=N9

β€œReward indep. of policy given history” Positivity: βˆ€π‘, 𝑒: π‘ž 𝐡= = 𝑏 𝑇=

  • , 𝐡̅=N9

> 0 β€œAll actions possible at all times”

slide-59
SLIDE 59

Single-step case Strong ignorability: 𝑍(0), 𝑍(1) β«« π‘ˆ ∣ π‘Œ β€œNo hidden confounders” Overlap: βˆ€π‘¦, 𝑒: π‘ž π‘ˆ = 𝑒 π‘Œ = 𝑦 > 0 β€œAll actions possible” Positivity: βˆ€π‘, 𝑒: π‘ž 𝐡= = 𝑏 𝑇=

  • , 𝐡̅=N9

> 0 β€œAll actions possible at all times”

Assumptions for (off-policy) RL

β–Ί Sufficient conditions for identifying value function

Sequential case Sequential randomization: 𝐻 … β«« 𝐡= ∣ 𝑇=

  • , 𝐡̅=N9

β€œReward indep. of policy given history”

slide-60
SLIDE 60

Recap: Learning potential outcomes

Medication B β€œTreated” ! = 1 Medication A β€œControl” ! = 0

Age = 54 Gender = Female Race = Asian Blood pressure = 150/95 WBC count = 6.8*109/L Temperature = 36.7Β°C

Blood sugar = High

Anna Sep 15

Blood sugar = ?

%(0)

Blood sugar = ?

%(1)

May 15

slide-61
SLIDE 61

Treating Anna once

β–Ί We assumed a simple causal graph. This let us identify the causal effect

  • f treatment on outcome from observational data

Treatment, 𝐡 Outcome, 𝑆 State, 𝑇 Effect of treatment 𝑆(𝑏) β«« 𝐡 ∣ 𝑇 Ignorability Potential outcome under action 𝑏

slide-62
SLIDE 62

Treating Anna over time

β–Ί Let’s add a time point…

𝐡9 𝑆9 𝑇9 𝑆? 𝐡? 𝑇? 𝑒 = 1 𝑒 = 2 𝑆=(𝑏) β«« 𝐡= ∣ 𝑇= Ignorability

slide-63
SLIDE 63

Treating Anna over time

β–Ί What influences her state?

𝐡9 𝑆9 𝑇9 𝑆? 𝐡? 𝑇?

It is likely that if Anna is diabetic, she will remain so Anna’s health status depends on how we treated her

𝑆=(𝑏) β«« 𝐡= ∣ 𝑇= Ignorability

slide-64
SLIDE 64

Treating Anna over time

β–Ί What influences her state?

𝐡9 𝑆9 𝑇9 𝑆? 𝐡? 𝑇?

The outcome at a later time may depend on an earlier state The outcome at a later time point may depend on earlier choices

𝑆=(𝑏) β«« 𝐡= ∣ 𝑇= Ignorability

slide-65
SLIDE 65

Treating Anna over time

β–Ί What influences her state?

𝐡9 𝑆9 𝑇9 𝑆? 𝐡? 𝑇?

If we already tried a treatment, we might not try it again If the last treatment was unsuccessful, it may change our next choice If we know that a patient had a symptom previously, it may affect future decisions

𝑆=(𝑏) β«« 𝐡= ∣ 𝑇= Ignorability

slide-66
SLIDE 66

State & ignorability

β–Ί To have sequential ignorability, we need to remember history!

𝐡9 𝑆9 𝑇9 𝑆? 𝐡? 𝑇? History 𝐼? 𝐡9 𝑆9 𝐼9 𝑆? 𝐡? 𝐼? 𝑆=(𝑏) β«« 𝐡= ∣ 𝐼= Ignorability

slide-67
SLIDE 67

Summarizing history

β–Ί The difficulty with history is that its size grows with time β–Ί A simple change of the standard MDP is to store the states

and actions of a length 𝒍 window looking backwards

β–Ί Another alternative is to learn a summary function that

maintains what is relevant for making optimal decisions, e.g., using an RNN

slide-68
SLIDE 68

State & ignorability

β–Ί We cannot leave out unobserved confounders

𝐡9 𝑆9 𝐼9 𝑆? 𝐡? 𝐼? Unobserved confounder, 𝑉 𝐡9 𝑆9 𝐼9 𝑆? 𝐡? 𝐼? Unobserved confounder, 𝑉 …

slide-69
SLIDE 69

What made success possible/easier?

β–Ί Full observability

Everything important to optimal action is observed

β–Ί Markov dynamics

History is unimportant given recent state(s)

β–Ί Limitless exploration & self-play through simulation

We can test β€œany” policy and observe the outcome

β–Ί Noise-less state/outcome (for games, specifically)

slide-70
SLIDE 70
  • 1. Decision processes
  • 2. Reinforcement learning paradigms
  • 3. Learning from batch (off-policy) data
  • 4. Reinforcement learning in healthcare. Tomorrow!