[PPT] - Bootstrapping with Models: Confidence Intervals for Off-Policy PowerPoint Presentation

SLIDE 1

Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation

Josiah Hanna1 Peter Stone1 Scott Niekum2

1Learning Agents Research Group, UT Austin 2Personal Autonomous Robotics Lab, UT Austin

May 10th, 2017

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 1

SLIDE 2

Motivation

Determine a lower bound on the expected performance

f an autonomous control policy given data generated from

a different policy.

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 2

SLIDE 3

Motivation

Determine a lower bound on the expected performance

f an autonomous control policy given data generated from

a different policy.

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 2

SLIDE 4

Motivation

Determine a lower bound on the expected performance

f an autonomous control policy given data generated from

a different policy.

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 2

SLIDE 5

Preliminaries

The agent samples actions from a policy, At ∼ π(·|St). The environment responds with St+1 ∼ P(·|St, At). S0 A0 S1 A1 ... The policy and environment determine a distribution over trajectories, H : S1, A1, S2, A2, ..., SL, AL

H ∼ π.
V (π) = E

L

t=1 r(St, At)

H ∼ π
is the expected return
f π.

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 3

SLIDE 6

Confidence Intervals for Off-Policy Evaluation

Given: Trajectories generated by a behavior policy, πb, {H, πb} ∈ D. An evaluation policy, πe. δ ∈ [0, 1] is a confidence level.

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 4

SLIDE 7

Confidence Intervals for Off-Policy Evaluation

Given: Trajectories generated by a behavior policy, πb, {H, πb} ∈ D. An evaluation policy, πe. δ ∈ [0, 1] is a confidence level. Determine a lower bound ˆ Vlb(πe, D) such that V (πe) ≥ ˆ Vlb(πe, D) with probability 1 − δ.

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 4

SLIDE 8

Existing Methods

Exact confidence intervals Thomas et al. [2015a]. Clip importance weights Bottou et al. [2013] Bootstrap importance-sampling Thomas et al. [2015b].

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 5

SLIDE 9

Existing Methods

Exact confidence intervals Thomas et al. [2015a]. Clip importance weights Bottou et al. [2013] Bootstrap importance-sampling Thomas et al. [2015b].

Our work

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 5

SLIDE 10

Data-Efficient Confidence Intervals

We draw on two ideas to reduce the number of trajectories required for tight confidence bounds. Replace exact confidence bounds with bootstrap confidence intervals. Use learned models of the environment’s transition function to reduce variance.

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 6

SLIDE 11

Data-Efficient Confidence Intervals

We draw on two ideas to reduce the number of trajectories required for tight confidence bounds. Replace exact confidence bounds with bootstrap confidence intervals. Use learned models of the environment’s transition function to reduce variance. Contributions:

1 Two bootstrap methods that incorporate models for

approximate high confidence policy evaluation.

2 Theoretical bound on model bias. 3 Empirical evaluation of proposed methods.

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 6

SLIDE 12

Bootstrap Confidence Intervals

D0 ... D Dm

V0
Vm

... Sample with replacement Estimate V (πe)

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 7

SLIDE 13

Bootstrap Confidence Intervals

D0 ... D Dm

V0
Vm

... Sample with replacement Estimate V (πe)

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 7

SLIDE 14

Bootstrap Confidence Intervals

D0 ... D Dm

V0
Vm

... Sample with replacement Estimate V (πe)

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 7

SLIDE 15

Data-Efficient Confidence Intervals

We draw on two ideas to reduce the number of trajectories required for tight confidence bounds. Replace exact confidence bounds with bootstrap confidence intervals. Use learned models of the environment’s transition function to reduce variance.

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 8

SLIDE 16

Model Based Off-Policy Evaluation

Trajectories are generated from an MDP, M = S, A, P, r. s0 s1 s2 0.5 0.5 0.5 0.5

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 9

SLIDE 17

Model Based Off-Policy Evaluation

Trajectories are generated from an MDP, M = S, A, P, r. s0 s1 s2 0.5 0.5 0.5 0.5 Model Based off-policy estimator use all trajectories to estimate the unknown transition function, P. s0 s1 s2 0.45 0.55 0.35 0.65 Model-Based off-policy estimator: V (πe) := V

M(πe)

where M = S, A, P, r where P is the learned transition function.

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 9

SLIDE 18

Model-Bias

Model-Based approaches may have high bias.

1 Lack of Data: When we lack data for a particular (S, A)

pair then we must make assumptions about the transition probability, P(·|S, A).

2 Model Representation: The true function P may be

utside the class of models we consider.

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 10

SLIDE 19

Model-Bias

Model-Based approaches may have high bias.

1 Lack of Data: When we lack data for a particular (S, A)

pair then we must make assumptions about the transition probability, P(·|S, A).

2 Model Representation: The true function P may be

utside the class of models we consider.

We show theoretically that model bias depends on: The importance-sampled train / test error when building the model. The horizon length. The maximum reward.

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 10

SLIDE 20

Model-Based Bootstrap

D0 ... D Dm

V0
Vm

... Sample with replacement Model-based Estimate

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 11

SLIDE 21

Existing Methods

Importance- sampling based methods. Bootstrap importance- sampling mb- bootstrap (ours)

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 12

SLIDE 22

Doubly Robust Estimator [Jiang and Li, 2016, Thomas and Brunskill, 2016]

DR(D) := PDIS(D)

Unbiased estimator

−

n

i=1

L

t=0

w i

t ˆ

qπe(Si

t, Ai t) − w i t−1ˆ

v πe(Si

t)

Zero in Expectation

ˆ v π(S) := EA∼π,S′∼ˆ

P(·|S,A) [r(S, A) + ˆ

v(S′)]

State value function.

ˆ qπ(S, A) := r(S, A) + ES′∼P(·|S,A) [ˆ v(S′)]

State-action value function.

wt is the importance weight of the first t time-steps.

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 13

SLIDE 23

Weighted Doubly Robust Bootstrap

D0 ... D Dm

V0
Vm

... Sample with replacement Weighted Doubly Robust Estimate

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 14

SLIDE 24

Bootstrapping with Models

MB-Bootstrap (Model-Based Bootstrap) Advantages: Low variance. Disadvantages: Potentially high bias. WDR-Bootstrap (Weighted Doubly Robust Bootstrap) Advantages: Low bias. Disadvantages: Potentially higher variance.

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 15

SLIDE 25

Existing Methods

Importance- sampling based methods. Bootstrap importance- sampling wdr- bootstrap (ours) mb- bootstrap (ours)

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 16

SLIDE 26

MountainCar Domain

State and action spaces are discretized. Models use a tabular representation.

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 17

SLIDE 27

Mountain Car Domain

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 18

SLIDE 28

Mountain Car Domain

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 18

SLIDE 29

Mountain Car Domain

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 18

SLIDE 30

Mountain Car Domain

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 18

SLIDE 31

Cliffworld Domain

Agent must cross a narrow path to reach a goal. State is cartesian position and velocity. The agent moves by selecting acceleration. Linear Gaussian dynamics. Models are learned with linear and polynomial regression.

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 19

SLIDE 32

Cliffworld Domain

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 20

SLIDE 33

Cliffworld Domain

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 20

SLIDE 34

Cliffworld Domain

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 20

SLIDE 35

Cliffworld Domain

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 20

SLIDE 36

Conclusion

1 Two bootstrap methods that incorporate models for

approximate high confidence policy evaluation.

2 Theoretical bound on model bias. 3 Empirical evaluation of proposed methods.

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 21

SLIDE 37

Future Work

Investigate ways to “blend” MB-Bootstrap and WDR-Bootstrap for further improvements.

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 22

SLIDE 38

Future Work

Investigate ways to “blend” MB-Bootstrap and WDR-Bootstrap for further improvements. Application to evaluating policies learned in simulation.

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 22

SLIDE 39

Importance- sampling methods. wdr- bootstrap mb-bootstrap

Thanks for your attention! Questions?

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 23

SLIDE 40

L´ eon Bottou, Jonas Peters, Joaquin Quinonero Candela, Denis Xavier Charles, Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Y Simard, and Ed Snelson. Counterfactual reasoning and learning systems: the example

f computational advertising. Journal of Machine Learning

Research, 14(1):3207–3260, 2013. Nan Jiang and Lihong Li. Doubly robust off-policy evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, ICML, 2016. Doina Precup, Richard S. Sutton, and Satinder Singh. Eligibility traces for off-policy policy evaluation. In Proceedings of the 17th International Conference on Machine Learning, 2000.

P. S. Thomas, Georgios Theocharous, and Mohammad
Ghavamzadeh. High confidence off-policy evaluation. In

Association for the Advancement of Artificial Intelligence, AAAI, 2015a.

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 23

SLIDE 41

P. S. Thomas, Georgios Theocharous, and Mohammad
Ghavamzadeh. High confidence policy improvement. In

Proceedings of the 32nd International Conference on Machine Learning, ICML, 2015b. P.S. Thomas and Emma Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning. In Proceedings

f the 33rd International Conference on Machine Learning,

ICML, 2016.

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 24

SLIDE 42

Lower Bound Error

Mountain Car Cliffworld

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 24

SLIDE 43

Lower Bound Error

Mountain Car Cliffworld

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 24

SLIDE 44

Prior Work: Importance Sampling [Precup et al., 2000]

Re-weight return according to their relative likelihood: IS(πe, H, πb) :=

L−1

t=0

πe(Ai|Si) πb(Ai|Si)

Importance weight

L−1

t=0

r(St, At)

Observed Return

Mean of re-weighted returns is an unbiased estimate of V (πe): IS(D) :=

H∈D

IS(πe, H, πb)

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 25

SLIDE 45

Prior Work: Importance Sampling

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 26

SLIDE 46

Prior Work: Importance Sampling

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 26