Data-efficient Policy Evaluation through Behavior Policy Search - - PowerPoint PPT Presentation

data efficient policy evaluation through behavior policy
SMART_READER_LITE
LIVE PREVIEW

Data-efficient Policy Evaluation through Behavior Policy Search - - PowerPoint PPT Presentation

Data-efficient Policy Evaluation through Behavior Policy Search Josiah Hanna 1 Philip Thomas 2 Peter Stone 1 Scott Niekum 1 1 University of Texas at Austin 2 University of Massachusetts, Amherst August 8th, 2017 Josiah Hanna , Philip Thomas, Peter


slide-1
SLIDE 1

Data-efficient Policy Evaluation through Behavior Policy Search

Josiah Hanna1 Philip Thomas2 Peter Stone1 Scott Niekum1

1University of Texas at Austin 2University of Massachusetts, Amherst

August 8th, 2017

Josiah Hanna, Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 1

slide-2
SLIDE 2

Policy Evaluation

Josiah Hanna, Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 2

slide-3
SLIDE 3

Outline

1 Demonstrate that importance-sampling for

policy evaluation can outperform on-policy policy evaluation.

Josiah Hanna, Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 3

slide-4
SLIDE 4

Outline

1 Demonstrate that importance-sampling for

policy evaluation can outperform on-policy policy evaluation.

2 Show how to improve the behavior policy for

importance-sampling policy evaluation.

Josiah Hanna, Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 3

slide-5
SLIDE 5

Outline

1 Demonstrate that importance-sampling for

policy evaluation can outperform on-policy policy evaluation.

2 Show how to improve the behavior policy for

importance-sampling policy evaluation.

3 Empirically evaluate (1) and (2).

Josiah Hanna, Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 3

slide-6
SLIDE 6

Background

Finite-horizon MDP. Agent selects actions with a stochastic policy, π. The policy and environment determine a distribution over trajectories, H : S0, A0, R0, S1, A1, R1, ..., SL, AL, RL

Josiah Hanna, Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 4

slide-7
SLIDE 7

Policy Evaluation

Policy performance: ρ(π) := E L

  • t=0

γtRt

  • H ∼ π
  • Josiah Hanna, Philip Thomas, Peter Stone , Scott Niekum

UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 5

slide-8
SLIDE 8

Policy Evaluation

Policy performance: ρ(π) := E L

  • t=0

γtRt

  • H ∼ π
  • Given a target policy, πe, estimate ρ(πe).

Josiah Hanna, Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 5

slide-9
SLIDE 9

Policy Evaluation

Policy performance: ρ(π) := E L

  • t=0

γtRt

  • H ∼ π
  • Given a target policy, πe, estimate ρ(πe).

Let πe ≡ πθe

Josiah Hanna, Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 5

slide-10
SLIDE 10

Monte Carlo Policy Evaluation

Given a dataset D of trajectories where ∀H ∈ D, H ∼ πe: MC(D) := 1 |D|

  • Hi∈D

L

  • t=0

γtR(i)

t

Josiah Hanna, Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 6

slide-11
SLIDE 11

+100 +1 Action 1 Action 2 Target policy πe samples the high-rewarding first action with probability 0.01.

Josiah Hanna, Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 7

slide-12
SLIDE 12

+100 +1 Action 1 Action 2 Target policy πe samples the high-rewarding first action with probability 0.01. Monte Carlo evaluation of πe has high variance.

Josiah Hanna, Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 7

slide-13
SLIDE 13

+100 +1 Action 1 Action 2 Target policy πe samples the high-rewarding first action with probability 0.01. Monte Carlo evaluation of πe has high variance. Importance-sampling with a behavior policy that samples either action with equal probability gives a low variance evaluation.

Josiah Hanna, Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 7

slide-14
SLIDE 14

Importance-Sampling Policy Evaluation1

Given a dataset D of trajectories where ∀Hi ∈ D, Hi is sampled from a behavior policy πi: IS(D) := 1 |D|

  • Hi∈D

L

  • t=0

πe(At|St) πi(At|St)

  • re-weighting factor

L

  • t=0

γtR(i)

t

1Precup, Sutton, and Singh (2000)

Josiah Hanna, Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 8

slide-15
SLIDE 15

Importance-Sampling Policy Evaluation1

Given a dataset D of trajectories where ∀Hi ∈ D, Hi is sampled from a behavior policy πi: IS(D) := 1 |D|

  • Hi∈D

L

  • t=0

πe(At|St) πi(At|St)

  • re-weighting factor

L

  • t=0

γtR(i)

t

For convenience: IS(H, π) :=

L

  • t=0

πe(At|St) π(At|St)

L

  • t=0

γtRt

1Precup, Sutton, and Singh (2000)

Josiah Hanna, Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 8

slide-16
SLIDE 16

The Optimal Behavior Policy

Importance-sampling can achieve zero mean-squared error policy evaluation with only a single trajectory!

Josiah Hanna, Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 9

slide-17
SLIDE 17

The Optimal Behavior Policy

Importance-sampling can achieve zero mean-squared error policy evaluation with only a single trajectory! We cannot analytically determine this policy. Requires ρ(πe) be known!

Josiah Hanna, Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 9

slide-18
SLIDE 18

The Optimal Behavior Policy

Importance-sampling can achieve zero mean-squared error policy evaluation with only a single trajectory! We cannot analytically determine this policy. Requires ρ(πe) be known! Requires the reward function be known.

Josiah Hanna, Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 9

slide-19
SLIDE 19

The Optimal Behavior Policy

Importance-sampling can achieve zero mean-squared error policy evaluation with only a single trajectory! We cannot analytically determine this policy. Requires ρ(πe) be known! Requires the reward function be known. Requires deterministic transitions.

Josiah Hanna, Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 9

slide-20
SLIDE 20

Behavior Policy Search

Adapt the behavior policy towards the optimal behavior policy.

Josiah Hanna, Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 10

slide-21
SLIDE 21

Behavior Policy Search

Adapt the behavior policy towards the optimal behavior policy. At each iteration, i:

1 Choose behavior policy parameters, θi, based on

all observed data D.

Josiah Hanna, Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 10

slide-22
SLIDE 22

Behavior Policy Search

Adapt the behavior policy towards the optimal behavior policy. At each iteration, i:

1 Choose behavior policy parameters, θi, based on

all observed data D.

2 Sample m trajectories, H ∼ θi and add to a

data set D.

Josiah Hanna, Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 10

slide-23
SLIDE 23

Behavior Policy Search

Adapt the behavior policy towards the optimal behavior policy. At each iteration, i:

1 Choose behavior policy parameters, θi, based on

all observed data D.

2 Sample m trajectories, H ∼ θi and add to a

data set D.

3 Estimate ρ(πe) with trajectories in D.

Josiah Hanna, Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 10

slide-24
SLIDE 24

Behavior Policy Gradient

Key Idea: Adapt the behavior policy parameters, θ, with gradient descent on the mean squared error of importance-sampling.

θi+1 = θi − α ∂ ∂θ MSE[IS(Hi, θ)]

Josiah Hanna, Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 11

slide-25
SLIDE 25

Behavior Policy Gradient

Key Idea: Adapt the behavior policy parameters, θ, with gradient descent on the mean squared error of importance-sampling.

θi+1 = θi − α ∂ ∂θ MSE[IS(Hi, θ)] MSE[IS(H, θ)] is not computable.

∂ ∂θ MSE[IS(H, θ)] is computable.

Josiah Hanna, Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 11

slide-26
SLIDE 26

Behavior Policy Gradient Theorem

Theorem ∂ ∂θ MSE(IS(H, θ)) = Eπθ

  • − IS(H, θ)2

L

  • t=0

∂ ∂θ log (πθ(At|St))

  • Josiah Hanna, Philip Thomas, Peter Stone , Scott Niekum

UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 12

slide-27
SLIDE 27

Empirical Results

Cartpole Swing-up Acrobot

Josiah Hanna, Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 13

slide-28
SLIDE 28

Empirical Results

Cartpole Swing-up Acrobot

Josiah Hanna, Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 13

slide-29
SLIDE 29

GridWorld Results

High Variance Policy Low Variance Policy

Josiah Hanna, Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 14

slide-30
SLIDE 30

GridWorld Results

High Variance Policy Low Variance Policy

Josiah Hanna, Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 14

slide-31
SLIDE 31

Variance Reduction

Josiah Hanna, Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 15

slide-32
SLIDE 32

Additional Work

Investigated an extension to the doubly-robust

  • ff-policy estimator.2

Investigated where BPG is most effective empirically.

2[Jiang and Li(2016), Thomas and Brunskill(2016)]

Josiah Hanna, Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 16

slide-33
SLIDE 33

Conclusion

Behavior policy search makes off-policy evaluation more accurate than on-policy evaluation. Behavior Policy Gradient is an effective behavior policy search method.

Josiah Hanna, Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 17

slide-34
SLIDE 34

Open Questions

1 Can behavior policy search improve policy

improvement?

Josiah Hanna, Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 18

slide-35
SLIDE 35

Open Questions

1 Can behavior policy search improve policy

improvement?

2 Are there better measures of a good behavior

policy?

Josiah Hanna, Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 18

slide-36
SLIDE 36

Open Questions

1 Can behavior policy search improve policy

improvement?

2 Are there better measures of a good behavior

policy?

3 Is the final behavior policy found by BPG

applicable to other target policies?

Josiah Hanna, Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 18

slide-37
SLIDE 37

Thanks for your attention! Questions?

Josiah Hanna, Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 19

slide-38
SLIDE 38

Nan Jiang and Lihong Li. Doubly robust off-policy evaluation for reinforcement learning. arXiv preprint arXiv:1511.03722, 2016. P.S. Thomas and Emma Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning. arXiv preprint arXiv:1604.00923, 2016.

Josiah Hanna, Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 20

slide-39
SLIDE 39

Prior Work: Importance Sampling

Josiah Hanna, Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 20

slide-40
SLIDE 40

Prior Work: Importance Sampling

Josiah Hanna, Philip Thomas, Peter Stone , Scott Niekum UT Austin Data-efficient Policy Evaluation through Behavior Policy Search 20