Reinforcement learning
Fredrik D. Johansson Clinical ML @ MIT 6.S897/HST.956: Machine Learning for Healthcare, 2019
Reinforcement learning Fredrik D. Johansson Clinical ML @ MIT - - PowerPoint PPT Presentation
Reinforcement learning Fredrik D. Johansson Clinical ML @ MIT 6.S897/HST.956: Machine Learning for Healthcare, 2019 Reminder: Causal effects Potential outcomes under treatment and control, 1 , 0 Covariates and treatment, ,
Fredrik D. Johansson Clinical ML @ MIT 6.S897/HST.956: Machine Learning for Healthcare, 2019
βΊ Potential outcomes under treatment and control, π 1 , π 0 βΊ Covariates and treatment, π, π βΊ Conditional average treatment effect (CATE)
π·π΅ππΉ π = π½ π 1 β π 0 β£ π
Features Potential outcomes π π π
βΊ A policy π assigns treatments to patients
(typically depending on their medical history/state)
βΊ Example: For a patient with medical history π¦,
π(π¦) = π[π·π΅ππΉ π¦ > 0]
βΊ Today we focus on policies guided by clinical outcomes
(as opposed to legislation, monetary cost or side-effects)
βTreat if effect is positiveβ
βΊ Sepsis is a complication of an infection which
can lead to massive organ failure and death
βΊ One of the leading causes of death in the ICU βΊ The primary treatment target is the infection βΊ Other symptoms need management:
breathing difficulties, low blood pressure, β¦
Time Mechanical ventilation? Sedation? Vasopressors?
Unobserved responses
Observed decisions & response
Septic patient with breathing difficulties
mechanical ventilation?
π(0) π(1) π π Blood
βΊ Many clinical decisions are made in sequence βΊ Choices early may rule out actions later βΊ Can we optimize the policy by which actions are made?
π’8 π’9 π’: π΅9 π8 π9 π: π8 π9 π:
Time Mechanical ventilation? Sedation? Vasopressors?
Unobserved responses
Observed decisions & response
Septic patient with breathing difficulties
mechanical ventilation?
Time Mechanical ventilation? Sedation? Vasopressors?
Unobserved responses
Observed decisions & response
Septic patient with breathing difficulties
sedated? (To alleviate discomfort due to mech. ventilation)
Time Mechanical ventilation? Sedation? Vasopressors?
Unobserved responses
Observed decisions & response
Septic patient with breathing difficulties
artificially raise blood pressure? (Which may have dropped due to sedation)
Time Mechanical ventilation? Sedation? Vasopressors? Observed decisions & response
Septic patient with breathing difficulties
Mechanical ventilation? Sedation? Vasopressors?
βΊ How can we treat patients so that their
βΊ What are good outcomes? βΊ Which policies should we consider?
Outcome
βΊ AlphaStar βΊ AlphaGo βΊ DQN Atari βΊ Open AI Five
Game state π8 Possible actions π΅8
Figure by Tim Wheeler, tim.hibal.org
Next state π9 Reward π9 (Loss)
βΊ Maximize reward!
βΊ Patient state at time π= is like the game board βΊ Medical treatments π΅= are like the actions βΊ Outcomes π= are the rewards in the game
βΊ What could possibly go wrong?
π’8 π’9 π’: π΅9 π8 π9 π: π8 π9 π:
βΊ An agent repeatedly, at
times π’ takes actions π΅= to receive rewards π= from an environment, the state π= of which is (partially) observed
Environment Agent
Time Mechanical ventilation? Sedation? Spontaneous breathing trial?
π9, π9
π΅8 π΅9 π΅? π:
Environment Agent
Action "# Reward $# State %#
π?, π?
π= = π=
@A=BCD + π= @FG= HII + π= @FG= HG
π8
π9 π?
π8
βΊ State π= includes demographics,
physiological measurements, ventilator settings, level of consciousness, dosage of sedatives, time to ventilation, number of intubations
π΅9 π΅?
π΅8
βΊ Actions π΅= include intubation
and extubation, as well as administration and dosages of sedatives
βΊ A decision process specifies how states π=, actions π΅=, and
rewards π= are distributed: π(π8, β¦ , π:, π΅8, β¦ , π΅:, π8, β¦ , π:)
βΊ The agent interacts with the environment according to a
behavior policy π = π(π΅= β£ β― )*
* The β¦ depends on the type of agent
βΊ Markov decision processes (MDPs) are a special case βΊ Markov transitions:
π π= π8, β¦ , π=N9, π΅8, β¦ , π΅=N9 = π(π= β£ π=N9, π΅=N9)
βΊ Markov reward function: π π=
π=, π΅=
= π π= π8, β¦ , π=N9, π΅8, β¦ , π΅=N9 βΊ Markov action policy π = π(π΅= β£ π=) = π π΅= π8, β¦ , π=N9, π΅8, β¦ , π΅=N9
βΊ State transitions, actions and reward depend only on most
recent state-action pair
π8 π9 π: π΅8 π8 β¦ π΅: π:
βΊ States are independent: π π=
π=N9, π΅=N9 = π(π=)
βΊ Equivalent to single-step case: potential outcomes!
π8 π9 π: π΅8 π8 β¦ π΅: π:
* The term βcontextual banditsβ has connotations of efficient exploration, which is not addressed here
βΊ Think of each state πA as an i.i.d. patient, the actions π΅A as the
treatment group indicators and πA as the outcomes
π8 π: π΅8 π8 β¦ π΅: π:
βΊ Like previously with causal effect estimation, we are interested
in the effects of actions π΅= on future rewards
π8 π9 π: π΅8 π8 β¦ π΅: π:
βΊ The goal of most RL algorithms is to maximize the expected
cumulative rewardβthe value π
P of its policy π
βΊ Return: π»= = β
πD
: DS=
βΊ Value: π
P = π½TUβΌP π»8
βΊ The expectation is taken with respect to scenarios acted out
according to the learned policy π
Sum of future rewards Expected sum of rewards under policy π
βΊ Letβs say that we have data from a policy π
π9
9
π?
9
πW
9
π»9 = π9
9 + π? 9 + πW 9
π»? = π9
? + π? ? + πW ?
π»W = π9
W + π? W + πW W
π9
?
π?
?
πW
?
π9
W
π?
W
πW
W
Patient 1 Patient 2 Patient 3
π9
9 = 1
π9
? = 0
π9
W = 0
π?
? = 1
πW
? = 1
π?
W = 0
πW
W = 0
π?
9 = 0
πW
9 = 1
π
P β 1
π [ π»G
G AS9
Return Value
+1 β1
Start
βΊ Stochastic actions
π Move up π΅ = βπ£πβ = 0.8 Available non-opposite moves have uniform probability
βΊ Rewards:
+1 at [4,3] (terminal state)
Slide from Peter Bodik
?
? ? ? ? ? ? ? ?
+1 β1
βΊ Stochastic actions
π Move up π΅ = βπ£πβ = 0.8 Available non-opposite moves have uniform probability
βΊ Rewards:
+1 at [4,3] (terminal state)
Slide from Peter Bodik
+1 β1
βΊ The following is the optimal
policy/trajectory under deterministic transitions
βΊ Not achievable in our
stochastic transition model
Slide from Peter Bodik
+1 β1
βΊ Optimal policy βΊ How can we learn this?
Slide from Peter Bodik
Model-based RL Transitions π π= π=N9, π΅=N9 G-computation MDP estimation Value-based RL Value/return π π»= π=, π΅= Q-learning G-estimation Policy-based RL Policy π(π΅= β£ π=) REINFORCE
Marginal structural models
*We focus on off-policy RL here
Model-based RL Transitions π π= π=N9, π΅=N9 G-computation MDP estimation Value-based RL Value/return π π»= π=, π΅= Q-learning G-estimation Policy-based RL Policy π(π΅= β£ π=) REINFORCE
Marginal structural models
*We focus on off-policy RL here
+1 β1
βΊ Assume that we know how
good a state-action pair is
βΊ Q: Which end state is the
best? A: [4,3]
βΊ Q: What is the best way to get
there? A: Only [3,1]
Slide from Peter Bodik
Start [3,1] [4,3]
+1 β1
βΊ [2,1] is slightly better than [3,2]
because of the risk of transitioning to [4,2] from [3,2]
βΊ Which is the best way to [2,1]?
Slide from Peter Bodik
Start [2,1] [3,2] [4,2]
βΊ The idea of dynamic
programming for reinforcement learning is to recursively learn the best action/value in a previous state given the best action/value in future states
Slide from Peter Bodik
+1 β1
βΊ Next: How do we get the
value of each state?
Slide from Peter Bodik
+1 β1
βΊ Q-learning is a value-based reinforcement learning method βΊ Recall: The value of a state π‘ under a policy π is
Reward discount factor*
*Mathematical tool more than anything
β π½P β πΏmπ=nm
β£ π= = π‘
βΊ Q-learning is a value-based reinforcement learning method βΊ The value of a state π‘ under a policy π is
βΊ The value of a state-action pair π‘, π is
Reward discount factor*
*Mathematical tool more than anything
β π½P β πΏmπ=nm
β£ π= = π‘
βΊ Q-learning attempts to estimate ππ with a function π (π‘, π) such
that π is the deterministic policy π π‘ = arg maxx π (π‘, π)
βΊ The best π is the best state-action value function
π β π‘, π = max
P
πP(π‘, π) =: πβ(π‘, π)
βΊ For the optimal Q-function πβ, βBellman optimalityβ holds*
πβ π‘, π = π½P π= + πΏ max
B{ πβ(π=n9, π{) β£ π= = π‘, π΅= = π
βΊ Look for functions with this property!
Immediate reward Future (discounted) rewards* State-action value
*A necessary property for optimality of dynamic programming
βΊ If states are discrete, π‘ β {0, β¦ , πΏ}, Q-learning can be solved
exactly using dynamic programming (for small enough πΏ)*
βΊ Initialize a table of π π‘, π βΊ Repeat
π π=, π΅= β π π=, π΅= + π½ π= + πΏ max
B
π (π=n9, π) β π (π=, π΅=)
*Converges to the optimal πβ if all state-action pairs visited over and over again
Learning rate
1.
Initialize π π‘, π = 0, let π½, πΏ = 1
2.
Repeat
π π=, π΅= β π π=, π΅= + π½ π= + πΏ max
B
π (π=n9, π) β π (π=, π΅=)
+1
0.96
Q-table
Assume that transitions are deterministic for now Let each state-pair be visited in
* We will come back to this
1.
Initialize π π‘, π = 0, let π½, πΏ = 1
2.
Repeat
π π=, π΅= β π π=, π΅= + π½ π= + πΏ max
B
π (π=n9, π) β π (π=, π΅=)
+1
0.92
0.92
Q-table
0.96
1.
Initialize π π‘, π = 0, let π½, πΏ = 1
2.
Repeat
π π=, π΅= β π π=, π΅= + π½ π= + πΏ max
B
π (π=n9, π) β π (π=, π΅=)
+1
0.88 0.88 0.92
0.88
0.88
0.92
Q-table
0.96
1.
Initialize π π‘, π = 0, let π½, πΏ = 1
2.
Repeat
π π=, π΅= β π π=, π΅= + π½ π= + πΏ max
B
π (π=n9, π) β π (π=, π΅=)
+1
0.88 0.88 0.92 0.84 0.88 0.84
0.84
0.84
0.88 0.84 0.92
Q-table
0.96
1.
Initialize π π‘, π = 0, let π½, πΏ = 1
2.
Repeat
π π=, π΅= β π π=, π΅= + π½ π= + πΏ max
B
π (π=n9, π) β π (π=, π΅=)
+1
0.88 0.88 0.92 0.84 0.88 0.84
0.80 0.80 0.84
0.80 0.80 0.84
0.88 0.84 0.92 0.80
Q-table
0.96
1.
Initialize π π‘, π = 0, let π½, πΏ = 1
2.
Repeat
π π=, π΅= β π π=, π΅= + π½ π= + πΏ max
B
π (π=n9, π) β π (π=, π΅=)
+1
0.96 0.88 0.88 0.92 0.84 0.88 0.84 0.76 0.80 0.80 0.84 0.76 0.80 0.80 0.84
0.88 0.84 0.92 0.80
Q-table
βΊ If the number of states πΏ is large or π= is not discrete, we
cannot maintain a table for π π‘, π
βΊ Instead, we may represent π π‘, π by a function π β and
minimize the risk π π β = π½P π + πΏ max
BΖ π
β πβ², π΅{ β π β π, π΅
?
Current estimate Old estimate of π
βΊ In the one-step case (no future states)
π π β = π½P π= + πΏ max
BΖ π
β πβ², π{ β π β π, π΅
?
= π½P π= β π β π, π΅
?
βΊ Finding π(π‘, π) is analogous to finding expected potential
Control outcome π½[π 0 β£ π]
π π(π’)
Treated outcome π½[π 1 β£ π] min
I
U
1 π= [ π
= π¦A β π§A ?
Regression adjustment
βΊ Fitted Q-learning is like covariate adjustment (regression) with
a moving target (which is updated during learning) π π β = π½P π» β π, π΅, π{, π β π β π, π΅
?
Prediction Target Expectation over transitions (π‘, π, π‘{, π ) Choice of loss, (here squared) β π + πΏ max
BΖ π
β πβ², π{
βΊ Where does our data come from?
π π β = π½P π + πΏ max
BΖ π
β πβ², π{ β π β π, π΅
?
βΊ βWhat are the inputs and outputs of our regression?β βΊ Alternate between updates of π
β and π β
How do we evaluate this expectation?
βΊ Tuples π‘, π, π‘{, π may be obtained by:
βΊ On-policy explorationββPlaying the gameβ with the current policy βΊ Randomized trialsβExecuting a sequentially random policy βΊ Off-policy (observational)βE.g., healthcare records
βΊ The latter is most relevant to us!
βΊ Trajectories π‘9, π9, π
9 , β¦ , π‘:, π:, π : ,of states π‘=, actions π=,
and rewards π = observed in e.g. medical record
βΊ Actions are drawn according to a behavior policy π, but we
want to know the value of a new policy π
βΊ Learning policies from this data is at least as hard as
estimating treatment effects from observational data
βΊ Sufficient conditions for identifying value function
Strong ignorability: π(0), π(1) β«« π β£ π βNo hidden confoundersβ Overlap: βπ¦, π’: π π = π’ π = π¦ > 0 βAll actions possibleβ Single-step case Sequential case Sequential randomization: π» β¦ β«« π΅= β£ π=
βReward indep. of policy given historyβ Positivity: βπ, π’: π π΅= = π π=
> 0 βAll actions possible at all timesβ
Single-step case Strong ignorability: π(0), π(1) β«« π β£ π βNo hidden confoundersβ Overlap: βπ¦, π’: π π = π’ π = π¦ > 0 βAll actions possibleβ Positivity: βπ, π’: π π΅= = π π=
> 0 βAll actions possible at all timesβ
βΊ Sufficient conditions for identifying value function
Sequential case Sequential randomization: π» β¦ β«« π΅= β£ π=
βReward indep. of policy given historyβ
Medication B βTreatedβ ! = 1 Medication A βControlβ ! = 0
Age = 54 Gender = Female Race = Asian Blood pressure = 150/95 WBC count = 6.8*109/L Temperature = 36.7Β°C
Blood sugar = High
Anna Sep 15
Blood sugar = ?
%(0)
Blood sugar = ?
%(1)
May 15
βΊ We assumed a simple causal graph. This let us identify the causal effect
Treatment, π΅ Outcome, π State, π Effect of treatment π(π) β«« π΅ β£ π Ignorability Potential outcome under action π
βΊ Letβs add a time pointβ¦
π΅9 π9 π9 π? π΅? π? π’ = 1 π’ = 2 π=(π) β«« π΅= β£ π= Ignorability
βΊ What influences her state?
π΅9 π9 π9 π? π΅? π?
It is likely that if Anna is diabetic, she will remain so Annaβs health status depends on how we treated her
π=(π) β«« π΅= β£ π= Ignorability
βΊ What influences her state?
π΅9 π9 π9 π? π΅? π?
The outcome at a later time may depend on an earlier state The outcome at a later time point may depend on earlier choices
π=(π) β«« π΅= β£ π= Ignorability
βΊ What influences her state?
π΅9 π9 π9 π? π΅? π?
If we already tried a treatment, we might not try it again If the last treatment was unsuccessful, it may change our next choice If we know that a patient had a symptom previously, it may affect future decisions
π=(π) β«« π΅= β£ π= Ignorability
βΊ To have sequential ignorability, we need to remember history!
π΅9 π9 π9 π? π΅? π? History πΌ? π΅9 π9 πΌ9 π? π΅? πΌ? π=(π) β«« π΅= β£ πΌ= Ignorability
βΊ The difficulty with history is that its size grows with time βΊ A simple change of the standard MDP is to store the states
and actions of a length π window looking backwards
βΊ Another alternative is to learn a summary function that
maintains what is relevant for making optimal decisions, e.g., using an RNN
βΊ We cannot leave out unobserved confounders
π΅9 π9 πΌ9 π? π΅? πΌ? Unobserved confounder, π π΅9 π9 πΌ9 π? π΅? πΌ? Unobserved confounder, π β¦
βΊ Full observability
Everything important to optimal action is observed
βΊ Markov dynamics
History is unimportant given recent state(s)
βΊ Limitless exploration & self-play through simulation
We can test βanyβ policy and observe the outcome
βΊ Noise-less state/outcome (for games, specifically)