Bootstrapping with Models: Confidence Intervals for Off-Policy - - PowerPoint PPT Presentation

bootstrapping with models confidence intervals for off
SMART_READER_LITE
LIVE PREVIEW

Bootstrapping with Models: Confidence Intervals for Off-Policy - - PowerPoint PPT Presentation

Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation Josiah Hanna 1 Peter Stone 1 Scott Niekum 2 1 Learning Agents Research Group, UT Austin 2 Personal Autonomous Robotics Lab, UT Austin May 10th, 2017 Josiah Hanna , Peter


slide-1
SLIDE 1

Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation

Josiah Hanna1 Peter Stone1 Scott Niekum2

1Learning Agents Research Group, UT Austin 2Personal Autonomous Robotics Lab, UT Austin

May 10th, 2017

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 1

slide-2
SLIDE 2

Motivation

Determine a lower bound on the expected performance

  • f an autonomous control policy given data generated from

a different policy.

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 2

slide-3
SLIDE 3

Motivation

Determine a lower bound on the expected performance

  • f an autonomous control policy given data generated from

a different policy.

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 2

slide-4
SLIDE 4

Motivation

Determine a lower bound on the expected performance

  • f an autonomous control policy given data generated from

a different policy.

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 2

slide-5
SLIDE 5

Preliminaries

The agent samples actions from a policy, At ∼ π(·|St). The environment responds with St+1 ∼ P(·|St, At). S0 A0 S1 A1 ... The policy and environment determine a distribution over trajectories, H : S1, A1, S2, A2, ..., SL, AL

  • H ∼ π.
  • V (π) = E

L

t=1 r(St, At)

  • H ∼ π
  • is the expected return
  • f π.

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 3

slide-6
SLIDE 6

Confidence Intervals for Off-Policy Evaluation

Given: Trajectories generated by a behavior policy, πb, {H, πb} ∈ D. An evaluation policy, πe. δ ∈ [0, 1] is a confidence level.

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 4

slide-7
SLIDE 7

Confidence Intervals for Off-Policy Evaluation

Given: Trajectories generated by a behavior policy, πb, {H, πb} ∈ D. An evaluation policy, πe. δ ∈ [0, 1] is a confidence level. Determine a lower bound ˆ Vlb(πe, D) such that V (πe) ≥ ˆ Vlb(πe, D) with probability 1 − δ.

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 4

slide-8
SLIDE 8

Existing Methods

Exact confidence intervals Thomas et al. [2015a]. Clip importance weights Bottou et al. [2013] Bootstrap importance-sampling Thomas et al. [2015b].

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 5

slide-9
SLIDE 9

Existing Methods

Exact confidence intervals Thomas et al. [2015a]. Clip importance weights Bottou et al. [2013] Bootstrap importance-sampling Thomas et al. [2015b].

Our work

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 5

slide-10
SLIDE 10

Data-Efficient Confidence Intervals

We draw on two ideas to reduce the number of trajectories required for tight confidence bounds. Replace exact confidence bounds with bootstrap confidence intervals. Use learned models of the environment’s transition function to reduce variance.

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 6

slide-11
SLIDE 11

Data-Efficient Confidence Intervals

We draw on two ideas to reduce the number of trajectories required for tight confidence bounds. Replace exact confidence bounds with bootstrap confidence intervals. Use learned models of the environment’s transition function to reduce variance. Contributions:

1 Two bootstrap methods that incorporate models for

approximate high confidence policy evaluation.

2 Theoretical bound on model bias. 3 Empirical evaluation of proposed methods.

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 6

slide-12
SLIDE 12

Bootstrap Confidence Intervals

D0 ... D Dm

  • V0
  • Vm

... Sample with replacement Estimate V (πe)

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 7

slide-13
SLIDE 13

Bootstrap Confidence Intervals

D0 ... D Dm

  • V0
  • Vm

... Sample with replacement Estimate V (πe)

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 7

slide-14
SLIDE 14

Bootstrap Confidence Intervals

D0 ... D Dm

  • V0
  • Vm

... Sample with replacement Estimate V (πe)

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 7

slide-15
SLIDE 15

Data-Efficient Confidence Intervals

We draw on two ideas to reduce the number of trajectories required for tight confidence bounds. Replace exact confidence bounds with bootstrap confidence intervals. Use learned models of the environment’s transition function to reduce variance.

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 8

slide-16
SLIDE 16

Model Based Off-Policy Evaluation

Trajectories are generated from an MDP, M = S, A, P, r. s0 s1 s2 0.5 0.5 0.5 0.5

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 9

slide-17
SLIDE 17

Model Based Off-Policy Evaluation

Trajectories are generated from an MDP, M = S, A, P, r. s0 s1 s2 0.5 0.5 0.5 0.5 Model Based off-policy estimator use all trajectories to estimate the unknown transition function, P. s0 s1 s2 0.45 0.55 0.35 0.65 Model-Based off-policy estimator: V (πe) := V

M(πe)

where M = S, A, P, r where P is the learned transition function.

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 9

slide-18
SLIDE 18

Model-Bias

Model-Based approaches may have high bias.

1 Lack of Data: When we lack data for a particular (S, A)

pair then we must make assumptions about the transition probability, P(·|S, A).

2 Model Representation: The true function P may be

  • utside the class of models we consider.

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 10

slide-19
SLIDE 19

Model-Bias

Model-Based approaches may have high bias.

1 Lack of Data: When we lack data for a particular (S, A)

pair then we must make assumptions about the transition probability, P(·|S, A).

2 Model Representation: The true function P may be

  • utside the class of models we consider.

We show theoretically that model bias depends on: The importance-sampled train / test error when building the model. The horizon length. The maximum reward.

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 10

slide-20
SLIDE 20

Model-Based Bootstrap

D0 ... D Dm

  • V0
  • Vm

... Sample with replacement Model-based Estimate

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 11

slide-21
SLIDE 21

Existing Methods

Importance- sampling based methods. Bootstrap importance- sampling mb- bootstrap (ours)

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 12

slide-22
SLIDE 22

Doubly Robust Estimator [Jiang and Li, 2016, Thomas and Brunskill, 2016]

DR(D) := PDIS(D)

  • Unbiased estimator

n

  • i=1

L

  • t=0

w i

t ˆ

qπe(Si

t, Ai t) − w i t−1ˆ

v πe(Si

t)

  • Zero in Expectation

ˆ v π(S) := EA∼π,S′∼ˆ

P(·|S,A) [r(S, A) + ˆ

v(S′)]

State value function.

ˆ qπ(S, A) := r(S, A) + ES′∼P(·|S,A) [ˆ v(S′)]

State-action value function.

wt is the importance weight of the first t time-steps.

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 13

slide-23
SLIDE 23

Weighted Doubly Robust Bootstrap

D0 ... D Dm

  • V0
  • Vm

... Sample with replacement Weighted Doubly Robust Estimate

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 14

slide-24
SLIDE 24

Bootstrapping with Models

MB-Bootstrap (Model-Based Bootstrap) Advantages: Low variance. Disadvantages: Potentially high bias. WDR-Bootstrap (Weighted Doubly Robust Bootstrap) Advantages: Low bias. Disadvantages: Potentially higher variance.

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 15

slide-25
SLIDE 25

Existing Methods

Importance- sampling based methods. Bootstrap importance- sampling wdr- bootstrap (ours) mb- bootstrap (ours)

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 16

slide-26
SLIDE 26

MountainCar Domain

State and action spaces are discretized. Models use a tabular representation.

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 17

slide-27
SLIDE 27

Mountain Car Domain

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 18

slide-28
SLIDE 28

Mountain Car Domain

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 18

slide-29
SLIDE 29

Mountain Car Domain

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 18

slide-30
SLIDE 30

Mountain Car Domain

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 18

slide-31
SLIDE 31

Cliffworld Domain

Agent must cross a narrow path to reach a goal. State is cartesian position and velocity. The agent moves by selecting acceleration. Linear Gaussian dynamics. Models are learned with linear and polynomial regression.

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 19

slide-32
SLIDE 32

Cliffworld Domain

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 20

slide-33
SLIDE 33

Cliffworld Domain

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 20

slide-34
SLIDE 34

Cliffworld Domain

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 20

slide-35
SLIDE 35

Cliffworld Domain

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 20

slide-36
SLIDE 36

Conclusion

1 Two bootstrap methods that incorporate models for

approximate high confidence policy evaluation.

2 Theoretical bound on model bias. 3 Empirical evaluation of proposed methods.

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 21

slide-37
SLIDE 37

Future Work

Investigate ways to “blend” MB-Bootstrap and WDR-Bootstrap for further improvements.

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 22

slide-38
SLIDE 38

Future Work

Investigate ways to “blend” MB-Bootstrap and WDR-Bootstrap for further improvements. Application to evaluating policies learned in simulation.

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 22

slide-39
SLIDE 39

Importance- sampling methods. wdr- bootstrap mb-bootstrap

Thanks for your attention! Questions?

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 23

slide-40
SLIDE 40

L´ eon Bottou, Jonas Peters, Joaquin Quinonero Candela, Denis Xavier Charles, Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Y Simard, and Ed Snelson. Counterfactual reasoning and learning systems: the example

  • f computational advertising. Journal of Machine Learning

Research, 14(1):3207–3260, 2013. Nan Jiang and Lihong Li. Doubly robust off-policy evaluation for reinforcement learning. In Proceedings of the 33rd International Conference on Machine Learning, ICML, 2016. Doina Precup, Richard S. Sutton, and Satinder Singh. Eligibility traces for off-policy policy evaluation. In Proceedings of the 17th International Conference on Machine Learning, 2000.

  • P. S. Thomas, Georgios Theocharous, and Mohammad
  • Ghavamzadeh. High confidence off-policy evaluation. In

Association for the Advancement of Artificial Intelligence, AAAI, 2015a.

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 23

slide-41
SLIDE 41
  • P. S. Thomas, Georgios Theocharous, and Mohammad
  • Ghavamzadeh. High confidence policy improvement. In

Proceedings of the 32nd International Conference on Machine Learning, ICML, 2015b. P.S. Thomas and Emma Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning. In Proceedings

  • f the 33rd International Conference on Machine Learning,

ICML, 2016.

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 24

slide-42
SLIDE 42

Lower Bound Error

Mountain Car Cliffworld

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 24

slide-43
SLIDE 43

Lower Bound Error

Mountain Car Cliffworld

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 24

slide-44
SLIDE 44

Prior Work: Importance Sampling [Precup et al., 2000]

Re-weight return according to their relative likelihood: IS(πe, H, πb) :=

L−1

  • t=0

πe(Ai|Si) πb(Ai|Si)

  • Importance weight

L−1

  • t=0

r(St, At)

  • Observed Return

Mean of re-weighted returns is an unbiased estimate of V (πe): IS(D) :=

  • H∈D

IS(πe, H, πb)

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 25

slide-45
SLIDE 45

Prior Work: Importance Sampling

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 26

slide-46
SLIDE 46

Prior Work: Importance Sampling

Josiah Hanna, Peter Stone, Scott Niekum UT Austin Bootstrapping with Models: Confidence Intervals for Off-Policy Evaluation 26