Lecture 12: Batch RL Emma Brunskill CS234 Reinforcement Learning. - - PowerPoint PPT Presentation

lecture 12 batch rl
SMART_READER_LITE
LIVE PREVIEW

Lecture 12: Batch RL Emma Brunskill CS234 Reinforcement Learning. - - PowerPoint PPT Presentation

Lecture 12: Batch RL Emma Brunskill CS234 Reinforcement Learning. Winter 2018 Slides drawn from Philip Thomas with modifications Winter 2018 Slides drawn from Philip Thomas Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL /


slide-1
SLIDE 1

Lecture 12: Batch RL

Emma Brunskill

CS234 Reinforcement Learning.

Winter 2018 Slides drawn from Philip Thomas with modifications

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-2
SLIDE 2

Class Structure

  • Last time: Fast Reinforcement Learning / Exploration and Exploitation
  • This time: Batch RL
  • Next time: Monte Carlo Tree Search

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-3
SLIDE 3

Table of Contents

1

What makes an RL algorithm safe?

2

Notation

3

Create a safe batch reinforement learning algorithm Off-policy policy evaluation (OPE) High-confidence off-policy policy evaluation (HCOPE) Safe policy improvement (SPI)

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-4
SLIDE 4

What does it mean to for a reinforcement learning algorithm to be safe?

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-5
SLIDE 5

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-6
SLIDE 6

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-7
SLIDE 7

Changing the objective

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-8
SLIDE 8

Changing the objective

  • Policy 1:
  • Reward = 0 with probability 0.999999
  • Reward = 109 with probability 1-0.999999
  • Expected reward approximately 1000
  • Policy 2:
  • Reward = 999 with probability 0.5
  • Reward = 1000 with probability 0.5
  • Expected reward 999.5

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-9
SLIDE 9

Another notion of safety

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-10
SLIDE 10

Another notion of safety (Munos et. al)

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-11
SLIDE 11

Another notion of safety

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-12
SLIDE 12

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-13
SLIDE 13

The Problem

  • If you apply an existing method, do you have confidence that it will

work?

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-14
SLIDE 14

Reinforcement learning success

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-15
SLIDE 15

A property of many real applications

  • Deploying "bad" policies can be costly or dangerous

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-16
SLIDE 16

Deploying bad policies can be costly

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-17
SLIDE 17

Deploying bad policies can be dangerous

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-18
SLIDE 18

What property should a safe batch reinforcement learning algorithm have?

  • Given past experience from current policy/policies, produce a new policy
  • “Guarantee that with probability at least 1 − δ, will not change

your policy to one that is worse than the current policy.”

  • You get to choose δ
  • Guarantee not contingent on the tuning of any hyperparameters

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-19
SLIDE 19

Table of Contents

1

What makes an RL algorithm safe?

2

Notation

3

Create a safe batch reinforement learning algorithm Off-policy policy evaluation (OPE) High-confidence off-policy policy evaluation (HCOPE) Safe policy improvement (SPI)

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-20
SLIDE 20

Notation

  • Policy π: π(a) = P(at = a
  • st = s)
  • History: H = (s1, a1, r1, s2, a2, r2, · · · , sL, aL, rL)
  • Historical data: D = {H1, H2, · · · , Hn}
  • Historical data from behavior policy, πb
  • Objective:

V π = E[

L

  • t=1

γtRt

  • π]

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-21
SLIDE 21

Safe batch reinforement learning algorithm

  • Reinforcement learning algorithm, A
  • Historical data, D, which is a random variable
  • Policy produced by the algorithm, A(D), which is a random variable
  • a safe batch reinforement learning algorithm, A, satisfies:

Pr(V A(D) ≥ V πb ≥ 1 − δ

  • r, in general

Pr(V A(D) ≥ Vmin) ≥ 1 − δ

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-22
SLIDE 22

Table of Contents

1

What makes an RL algorithm safe?

2

Notation

3

Create a safe batch reinforement learning algorithm Off-policy policy evaluation (OPE) High-confidence off-policy policy evaluation (HCOPE) Safe policy improvement (SPI)

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-23
SLIDE 23

Create a safe batch reinforement learning algorithm

  • Off-policy policy evaluation (OPE)
  • For any evaluation policy, πe, Convert historical data, D, into n

independent and unbiased estimates of V πe

  • High-confidence off-policy policy evaluation (HCOPE)
  • Use a concentration inequality to convert the n independent and

unbiased estimates of V πe into a 1 − δ confidence lower bound on V πe

  • Safe policy improvement (SPI)
  • Use HCOPE method to create a safe batch reinforement learning

algorithm, a

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-24
SLIDE 24

Off-policy policy evaluation (OPE)

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-25
SLIDE 25

Importance Sampling (Reminder)

IS(D) = 1 n

n

  • i=1

L

  • t=1

πe(at

  • st)

πb(at

  • st)

L

  • t=1

γtRi

t

  • E[IS(D)] = V πe

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-26
SLIDE 26

Create a safe batch reinforement learning algorithm

  • Off-policy policy evaluation (OPE)
  • For any evaluation policy, πe, Convert historical data, D, into n

independent and unbiased estimates of V πe

  • High-confidence off-policy policy evaluation (HCOPE)
  • Use a concentration inequality to convert the n independent and

unbiased estimates of V πe into a 1 − δ confidence lower bound on V πe

  • Safe policy improvement (SPI)
  • Use HCOPE method to create a safe batch reinforement learning

algorithm, a

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-27
SLIDE 27

High-confidence off-policy policy evaluation (HCOPE)

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-28
SLIDE 28

Hoeffding’s inequality

  • Let X1, · · · , Xn be n independent identically distributed random

variables such that Xi ∈ [0, b]

  • Then with probability at least 1 − δ:

E[Xi] ≥ 1 n

n

  • i=1

Xi − b

  • ln(1/δ)

2n , where Xi = 1

n

n

i=1(wi

L

t=1 γtRi t) in our case.

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-29
SLIDE 29

Safe policy improvement (SPI)

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-30
SLIDE 30

Safe policy improvement (SPI)

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-31
SLIDE 31

Create a safe batch reinforement learning algorithm

  • Off-policy policy evaluation (OPE)
  • For any evaluation policy, πe, Convert historical data, D, into n

independent and unbiased estimates of V πe

  • High-confidence off-policy policy evaluation (HCOPE)
  • Use a concentration inequality to convert the n independent and

unbiased estimates of V πe into a 1 − δ confidence lower bound on V πe

  • Safe policy improvement (SPI)
  • Use HCOPE method to create a safe batch reinforement learning

algorithm, a

WON’T WORK!

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-32
SLIDE 32

Off-policy policy evaluation (revisited)

  • Importance sampling (IS):

IS(D) = 1 n

n

  • i=1

L

  • t=1

πe(at

  • st)

πb(at

  • st)

L

  • t=1

γtRi

t

  • Per-decision importance sampling (PDIS)

PSID(D) =

L

  • t=1

γt 1 n

n

  • i=1

t

  • τ=1

πe(aτ

  • sτ)

πb(aτ

  • sτ)
  • Ri

t

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-33
SLIDE 33

Off-policy policy evaluation (revisited)

  • Importance sampling (IS):

IS(D) = 1 n

n

  • i=1

wi L

  • t=1

γtRi

t

  • Weighted importance sampling (WIS)

WIS(D) = 1 n

i=1 wi n

  • i=1

wi L

  • t=1

γtRi

t

  • Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL

Winter 2018 Slides drawn from Philip Thomas / 68

slide-34
SLIDE 34

Off-policy policy evaluation (revisited)

  • Weighted importance sampling (WIS)

WIS(D) = 1 n

i=1 wi n

  • i=1

wi L

  • t=1

γtRi

t

  • NOT unbiased. When n = 1, E[WIS] = J(πb)
  • Strongly consistent estimator of V πe
  • i.e. Pr(limn→∞ WIS(D) = V πe) = 1
  • If
  • Finite horizon
  • One beahvior policy, or bounded rewards

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-35
SLIDE 35

Off-policy policy evaluation (revisited)

  • Weighted per-decision importance sampling
  • Also called consistent weighted per-decision importance sampling
  • A fun exercise!

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-36
SLIDE 36

Control variates

  • Given: X
  • Estimate: µ = E[X]
  • ˆ

µ = X

  • Unbiased: E[ˆ

µ] = E[X] = µ

  • Variance: Var(ˆ

µ) = Var(X)

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-37
SLIDE 37

Control variates

  • Given: X, Y , E[Y ]
  • Estimate: µ = E[X]
  • ˆ

µ = X − Y + E[Y ]

  • Unbiased:

E[ˆ µ] = E[X − Y + E[Y ]] = E[X] − E[Y ] + E[Y ] = E[X] = µ

  • Variance:

Var(ˆ µ) = Var(X − Y + E[Y ]) = Var(X − Y ) = Var(X) + Var(Y ) − 2Cov(X, Y )

  • Lower variance if 2Cov(X, Y ) > Var(Y )
  • We call Y a control variate
  • We saw this idea before: baseline term in policy gradient estimation

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-38
SLIDE 38

Off-policy policy evaluation (revisited)

  • Idea: add a control variate to importance sampling estimators
  • X is the importance sampling estimator
  • Y is a control variate build from an approximate model of the

MDP

  • E[Y ] = 0 in this case
  • PDISCV (D) = PDIS(D) − CV (D)
  • Called the doubly robust estimator (Jiang and Li, 2015)
  • Robust to (1) poor approximate model, and (2) error in estimates
  • f πb
  • If the model is poor,the estimates are still unbiased
  • If the sampling policy is unknown, but the model is good,

MSE will still be low

  • DR(D) = PDISCV (D)
  • Non-recursive and weighted forms, as well as control variate view

provided by Thomas and Brunskill (2016)

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-39
SLIDE 39

Off-policy policy evaluation (revisited)

DR(πe

  • D) = 1

n

n

  • i=1

  • t=0

γtwi

t(Ri t − ˆ

qπe(Si

t, Ai t)) + γtρi t−1ˆ

vπe(Si

t),

where wi

t = t τ1 πe(aτ

  • sτ)

πb(aτ

  • sτ)
  • Recall: we want the control variate Y to cancel with X:

R − q(S, A) + γv(S′)

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-40
SLIDE 40

Empirical Results (Gridworld)

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-41
SLIDE 41

Empirical Results (Gridworld)

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-42
SLIDE 42

Empirical Results (Gridworld)

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-43
SLIDE 43

Empirical Results (Gridworld)

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-44
SLIDE 44

Empirical Results (Gridworld)

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-45
SLIDE 45

Off-policy policy evaluation (revisited): Blending

  • Importance sampling is unbiased but high variance
  • Model based estimate is biased but low variance
  • Doubly robust is one way to combine the two
  • Can also trade between importance sampling and model based estimate

within a trajectory

  • MAGIC estimator (Thomas and Brunskill 2016)
  • Can be particularly useful when part of the world is non-Markovian in

the given model, and other parts of the world are Markov

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-46
SLIDE 46

Off-policy policy evaluation (revisited)

  • What if supp(πe ⊂ supp(πb))
  • There is a state-action pair, (s, a), such that πe(a
  • s) = 0, but

πb(a

  • s) = 0.
  • If we see a history where (s, a) occurs, what weight should we give it?
  • IS(D) = 1

n

n

i=1

L

t=1 πe(at

  • st)

πb(at

  • st)

L

t=1 γtRi t

  • Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL

Winter 2018 Slides drawn from Philip Thomas / 68

slide-47
SLIDE 47

Off-policy policy evaluation (revisited)

  • What if there are zero samples (n = 0)?
  • The importance sampling estimate is undefined
  • What if no samples are in supp(πe) (or supp(p) in general)?
  • Importance sampling says: the estimate is zero
  • Alternate approach: undefined
  • Importance sampling estimator is unbiased if n > 0
  • Alternate approach will be unbiased given that at least one sample is in

the support of p

  • Alternate approach detailed in Importance Sampling with Unequal

Support (Thomas and Brunskill, AAAI 2017)

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-48
SLIDE 48

Off-policy policy evaluation (revisited)

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-49
SLIDE 49

Off-policy policy evaluation (revisited)

  • Thomas et. al. Predictive Off-Policy Policy Evaluation for

Nonstationary Decision Problems, with Applications to Digital Marketing (AAAI 2017)

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-50
SLIDE 50

Off-policy policy evaluation (revisited)

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-51
SLIDE 51

Create a safe batch reinforement learning algorithm

  • Off-policy policy evaluation (OPE)
  • For any evaluation policy, πe, Convert historical data, D, into n

independent and unbiased estimates of V πe

  • High-confidence off-policy policy evaluation (HCOPE)
  • Use a concentration inequality to convert the n independent and

unbiased estimates of V πe into a 1 − δ confidence lower bound on V πe

  • Safe policy improvement (SPI)
  • Use HCOPE method to create a safe batch reinforement learning

algorithm, a

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-52
SLIDE 52

High-confidence off-policy policy evaluation (revisited)

  • Consider using IS + Hoeffding’s inequality for HCOPE on mountain car

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-53
SLIDE 53

High-confidence off-policy policy evaluation (revisited)

  • Using 100,000 trajectories
  • Evaluation policy’s true performance is 0.19 ∈ [0, 1]
  • We get a 95% cconfidence lower bound of: -5,8310,000

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-54
SLIDE 54

What went wrong

wi =

L

  • t=1

πe(at

  • st)

πb(at

  • st)

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-55
SLIDE 55

High-confidence off-policy policy evaluation (revisited)

  • Removing the upper tail only decreases the expected value.

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-56
SLIDE 56

High-confidence off-policy policy evaluation (revisited)

  • Thomas et. al, High confidence off-policy evaluation, AAAI 2015

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-57
SLIDE 57

High-confidence off-policy policy evaluation (revisited)

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-58
SLIDE 58

High-confidence off-policy policy evaluation (revisited)

  • Use 20% of the data to optimize c
  • Use 80% to compute lower bound with optimized c
  • Mountain car results:

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-59
SLIDE 59

High-confidence off-policy policy evaluation (revisited)

Digital marketing:

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-60
SLIDE 60

High-confidence off-policy policy evaluation (revisited)

Cognitive dissonance: E[Xi] ≥ 1 n

n

  • i=1

Xi − b

  • ln(1/δ)

2n

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-61
SLIDE 61

High-confidence off-policy policy evaluation (revisited)

  • Student’s t-test
  • Assumes that IS(D) is normally distributed
  • By the central limit theorem, it (is as n → ∞)

Pr

  • E[1

n

n

  • i=1

Xi] ≥ 1 n

n

  • i=1

Xi

  • =
  • 1

n−1

n

i=1(Xi − ¯

Xn)2 √n t1−δ,n−1 ≥ 1 − δ

  • Efron’s Bootstrap methods (e.g., BCa)
  • Also, without importance sampling: Hanna, Stone, and

Niekum, AAMAS 2017

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-62
SLIDE 62

High-confidence off-policy policy evaluation (revisited)

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-63
SLIDE 63

Create a safe batch reinforcement learning algorithm

  • Off-policy policy evaluation (OPE)
  • For any evaluation policy, πe, Convert historical data, D, into n

independent and unbiased estimates of V πe

  • High-confidence off-policy policy evaluation (HCOPE)
  • Use a concentration inequality to convert the n independent and

unbiased estimates of V πe into a 1 − δ confidence lower bound on V πe

  • Safe policy improvement (SPI)
  • Use HCOPE method to create a safe batch reinforcement learning

algorithm, a

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-64
SLIDE 64

Safe policy improvement (revisited)

Thomas et. al, ICML 2015

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-65
SLIDE 65

Empirical Results: Digital Marketing

Agent Environment

Action, 𝑏 State, 𝑡 Reward, 𝑠

slide-66
SLIDE 66

Empirical Results: Digital Marketing

0.002715 0.003832 n=10000 n=30000 n=60000 n=100000 Expected Normalized Return None, CUT None, BCa k-Fold, CUT k-Fold, Bca

slide-67
SLIDE 67

Empirical Results: Digital Marketing

slide-68
SLIDE 68

Empirical Results: Digital Marketing

slide-69
SLIDE 69

Example Results : Diabetes Treatment

80

Blood Glucose (sugar) Eat Carbohydrates Release Insulin

slide-70
SLIDE 70

Example Results : Diabetes Treatment

81

Blood Glucose (sugar) Eat Carbohydrates Release Insulin Hyperglycemia

slide-71
SLIDE 71

Example Results : Diabetes Treatment

82

Blood Glucose (sugar) Eat Carbohydrates Release Insulin Hypoglycemia Hyperglycemia

slide-72
SLIDE 72

Example Results : Diabetes Treatment

83

injection = blood⁡glucose⁡ − target⁡blood⁡glucose 𝐷𝐺 + meal⁡size 𝐷𝑆

slide-73
SLIDE 73

Example Results : Diabetes Treatment

84

Intelligent Diabetes Management

slide-74
SLIDE 74

Example Results : Diabetes Treatment

85

Probability Policy Changed Probability Policy Worse

slide-75
SLIDE 75

Other Relevant Work

  • How to deal with long horizons? (Guo, Thomas, Brunskill NIPS 2017)
  • How to deal with importance sampling being “unfair”? (Doroudi,

Thomas and Brunskill, best paper UAI 2017)

  • What to do when the behavior policy is not known?
  • What to do when the behavior policy is deterministic?
  • What to do when care about doing safe exploration?
  • What to do when care about performance on a single trajectory
  • For last two, see great work by Marco Pavone’s group, Pieter Abbeel’s

group, Shie Mannor’s group and Claire Tomlin’s group, amongst others

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-76
SLIDE 76

Off Policy Policy Evaluation and Selection

  • Very important topic: healthcare, education, marketing, ...
  • Insights are relevant to on policy learning
  • Big focus of my lab
  • A number of others on campus also working in this area (e.g. Stefan

Wager, Susan Athey...)

  • Very interesting area at the intersection of causality and control

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-77
SLIDE 77

What You Should Know: Off Policy Policy Evaluation and Selection

  • Be able to define and apply importance sampling for off policy policy

evaluation

  • Define some limitations of IS (variance)
  • List a couple alternatives (weighted IS, doubly robust)
  • Define why we might want safe reinforcement learning
  • Define the scope of the guarantees implied by safe policy improvement

as defined in this lecture

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68

slide-78
SLIDE 78

Class Structure

  • Last time: Exploration and Exploitation
  • This time: Batch RL
  • Next time: Monte Carlo Tree Search

Emma Brunskill (CS234 Reinforcement Learning. )Lecture 12: Batch RL Winter 2018 Slides drawn from Philip Thomas / 68