Reducing Sampling Error in Batch Temporal Di ff erence Learning - - PowerPoint PPT Presentation

reducing sampling error in batch temporal di ff erence
SMART_READER_LITE
LIVE PREVIEW

Reducing Sampling Error in Batch Temporal Di ff erence Learning - - PowerPoint PPT Presentation

Reducing Sampling Error in Batch Temporal Di ff erence Learning Brahma S. Pavse 1 , Ishan Durugkar 1 , Josiah Hanna 2 , Peter Stone 1 3 1 The University of Texas at Austin 2 The University of Edinburgh 3 Sony AI ICML July 2020


slide-1
SLIDE 1

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Reducing Sampling Error in Batch Temporal Difference Learning

Brahma S. Pavse1, Ishan Durugkar1, Josiah Hanna2, Peter Stone1 3

1The University of Texas at Austin 2The University of Edinburgh 3Sony AI

ICML July 2020

1

brahmasp@cs.utexas.edu

slide-2
SLIDE 2

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Reinforcement Learning Successes

2

slide-3
SLIDE 3

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Reinforcement Learning Successes

2

slide-4
SLIDE 4

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Reinforcement Learning Successes

2

slide-5
SLIDE 5

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Reinforcement Learning Successes

2

slide-6
SLIDE 6

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

How can RL agents make the most from a finite amount of experience?

3

slide-7
SLIDE 7

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

How can RL agents make the most from a finite amount of experience?

3

Learning an accurate estimation of the value function with finite amount data.

slide-8
SLIDE 8

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Spotlight Overview

4

slide-9
SLIDE 9

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Spotlight Overview

  • With finite batch of data, on-policy single-step temporal difference learning converges to

the value function for the wrong policy.

4

slide-10
SLIDE 10

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Spotlight Overview

  • With finite batch of data, on-policy single-step temporal difference learning converges to

the value function for the wrong policy.

  • Propose and prove that a more efficient estimator converges to the value function for the

true policy.

4

slide-11
SLIDE 11

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Spotlight Overview: Flaw in Batch TD(0)

5

True policy

s

a2 a1 +30 +60

True value function

s1 s2

slide-12
SLIDE 12

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Spotlight Overview: Flaw in Batch TD(0)

5

finite-sized batch

True policy

s

a2 a1 +30 +60

True value function

s1 s2

slide-13
SLIDE 13

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Spotlight Overview: Flaw in Batch TD(0)

5

finite-sized batch

True policy

s

a2 a1 +30 +60

True value function

Batch TD(0) computes

s1 s2

slide-14
SLIDE 14

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Spotlight Overview: Flaw in Batch TD(0)

5

finite-sized batch

True policy

s

a2 a1 +30 +60

True value function

Batch TD(0) computes

Batch TD(0) estimates value function for the wrong policy!

s1 s2

slide-15
SLIDE 15

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Spotlight Overview: Flaw in Batch TD(0)

5

finite-sized batch

True policy

s

a2 a1 +30 +60

True value function

Batch TD(0) computes

Batch TD(0) estimates value function for the wrong policy!

Our estimator will estimate value function for the true policy

s1 s2

slide-16
SLIDE 16

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Batch Linear* Value Function Learning

6

*Empirical analysis also considers non-linear TD(0)

slide-17
SLIDE 17

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Batch Linear* Value Function Learning

6

Policy and environment transition dynamics:

*Empirical analysis also considers non-linear TD(0)

slide-18
SLIDE 18

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Batch Linear* Value Function Learning

6

Policy and environment transition dynamics:

Generates batch of m episodes:

where

*Empirical analysis also considers non-linear TD(0)

slide-19
SLIDE 19

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Batch Linear* Value Function Learning

6

Policy and environment transition dynamics:

Generates batch of m episodes:

where

Estimate value function:

*Empirical analysis also considers non-linear TD(0)

slide-20
SLIDE 20

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Batch Linear* Value Function Learning

6

Policy and environment transition dynamics:

Generates batch of m episodes:

where

Estimate value function:

*Empirical analysis also considers non-linear TD(0)

slide-21
SLIDE 21

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Batch Linear* Value Function Learning

6

Policy and environment transition dynamics:

Generates batch of m episodes:

where

Estimate value function:

*Empirical analysis also considers non-linear TD(0)

Assumptions:

  • 1. is known (policy we want to learn about).
  • 2. is unknown (model-free).
  • 3. Reward function is unknown.
  • 4. On-policy (focus of talk).
slide-22
SLIDE 22

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Batch Linear* TD(0)

7

*Empirical analysis also considers non-linear TD(0)

slide-23
SLIDE 23

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Batch Linear* TD(0)

7

*Empirical analysis also considers non-linear TD(0)

slide-24
SLIDE 24

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Batch Linear* TD(0)

7

*Empirical analysis also considers non-linear TD(0)

fixed finite batch as input

slide-25
SLIDE 25

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Batch Linear* TD(0)

7

for each transition

*Empirical analysis also considers non-linear TD(0)

slide-26
SLIDE 26

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Batch Linear* TD(0)

7

accumulate computed TD error

*Empirical analysis also considers non-linear TD(0)

slide-27
SLIDE 27

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Batch Linear* TD(0)

7

make aggregated update to weights

*Empirical analysis also considers non-linear TD(0)

slide-28
SLIDE 28

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Batch Linear* TD(0)

7

clear aggregation

*Empirical analysis also considers non-linear TD(0)

slide-29
SLIDE 29

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Batch Linear* TD(0)

7

until convergence

*Empirical analysis also considers non-linear TD(0)

slide-30
SLIDE 30

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Batch TD(0) Value Function

8

batch TD(0) finite-sized

slide-31
SLIDE 31

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Batch TD(0) Value Function

8

batch TD(0) finite-sized certainty-equivalence estimate for MDP*

*Sutton (1988) proved a similar result for a Markov reward process

slide-32
SLIDE 32

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Batch TD(0) Value Function

8

batch TD(0)

maximum-likelihood estimates (MLE) computed from

finite-sized certainty-equivalence estimate for MDP*

*Sutton (1988) proved a similar result for a Markov reward process

slide-33
SLIDE 33

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Batch TD(0) Value Function

8

batch TD(0)

maximum-likelihood estimates (MLE) computed from

Problem! finite-sized certainty-equivalence estimate for MDP*

*Sutton (1988) proved a similar result for a Markov reward process

slide-34
SLIDE 34

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Batch TD(0) Value Function

8

batch TD(0)

maximum-likelihood estimates (MLE) computed from

Problem! finite-sized certainty-equivalence estimate for MDP*

*Sutton (1988) proved a similar result for a Markov reward process

slide-35
SLIDE 35

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Batch TD(0) Value Function

8

batch TD(0)

maximum-likelihood estimates (MLE) computed from

Problem!

policy and transition dynamics sampling error

finite-sized certainty-equivalence estimate for MDP*

*Sutton (1988) proved a similar result for a Markov reward process

slide-36
SLIDE 36

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Policy Sampling Error in Batch TD(0)

9

finite-sized batch

True policy

s

a2 a1 +30 +60

True value function

Batch TD(0) computes

s1 s2

Batch TD(0) estimates value function for the wrong policy!

slide-37
SLIDE 37

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Policy Sampling Error in Batch TD(0)

9

finite-sized batch

True policy

s

a2 a1 +30 +60

True value function

Batch TD(0) computes MLE policy

s1 s2

slide-38
SLIDE 38

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Policy Sampling Error in Batch TD(0)

9

finite-sized batch

True policy

s

a2 a1 +30 +60

True value function

Batch TD(0) computes MLE policy

s1 s2

slide-39
SLIDE 39

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Policy Sampling Error in Batch TD(0)

9

finite-sized batch

True policy

s

a2 a1 +30 +60

True value function

Batch TD(0) computes MLE policy

s1 s2

slide-40
SLIDE 40

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Policy Sampling Error Corrected-TD(0)

10

slide-41
SLIDE 41

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Policy Sampling Error Corrected-TD(0)

10

True policy distribution is assumed to be known.

slide-42
SLIDE 42

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Policy Sampling Error Corrected-TD(0)

10

Correct learning from the MLE policy distribution to the true policy distribution. True policy distribution is assumed to be known.

slide-43
SLIDE 43

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Policy Sampling Error Corrected-TD(0)

10

Correct learning from the MLE policy distribution to the true policy distribution.

An off-policy-styled correction for an on-policy algorithm.

True policy distribution is assumed to be known.

slide-44
SLIDE 44

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Policy Sampling Error Corrected-TD(0)

10

Correct learning from the MLE policy distribution to the true policy distribution.

An off-policy-styled correction for an on-policy algorithm.

PSEC ratio (importance sampling [Precup et al., 2000, Ghiassian et al., 2018]): On-policy PSEC-TD(0) Update

On-policy TD(0) Update

True policy distribution is assumed to be known.

slide-45
SLIDE 45

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Batch PSEC-TD(0)

11

finite-sized batch

True policy

s

a2 a1 +30 +60

True value function

Batch TD(0) computes

MLE policy

s1 s2

PSEC-TD(0) Update

slide-46
SLIDE 46

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Batch PSEC-TD(0)

11

finite-sized batch

True policy

s

a2 a1 +30 +60

True value function

Batch TD(0) computes

MLE policy

s1 s2

PSEC-TD(0) Update

slide-47
SLIDE 47

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Batch PSEC-TD(0)

11

finite-sized batch

True policy

s

a2 a1 +30 +60

True value function

Batch TD(0) computes

MLE policy

Batch PSEC-TD(0) computes

s1 s2

PSEC-TD(0) Update

slide-48
SLIDE 48

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Batch PSEC-TD(0) Value Function

12

slide-49
SLIDE 49

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Batch PSEC-TD(0) Value Function

12

slide-50
SLIDE 50

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Batch PSEC-TD(0) Value Function

12

slide-51
SLIDE 51

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Experimental Results

13

Gridworld [link] Discrete States and Actions

CartPole [Brockman et al., 2016] Continuous States and Discrete Actions

InvertedPendulum [Brockman et al., 2016, Todorov et al., 2012] Continuous States and Actions

slide-52
SLIDE 52

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Experimental Results

13

Gridworld [link] Discrete States and Actions

CartPole [Brockman et al., 2016] Continuous States and Discrete Actions

InvertedPendulum [Brockman et al., 2016, Todorov et al., 2012] Continuous States and Actions

Evaluation Metric (weighted error):

slide-53
SLIDE 53

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

14

slide-54
SLIDE 54

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Empirical Results: Deterministic Gridworld

14

slide-55
SLIDE 55

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Empirical Results: Deterministic Gridworld

14

slide-56
SLIDE 56

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Empirical Results: Deterministic Gridworld

14

PSEC-TD(0) vs. TD(0)

slide-57
SLIDE 57

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Empirical Results: Deterministic Gridworld

14

PSEC-TD(0) vs. TD(0) Unvisited (s,a,s’) tuples

slide-58
SLIDE 58

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Empirical Results: Function Approximation

15

slide-59
SLIDE 59

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Empirical Results: Function Approximation

15

CartPole*

slide-60
SLIDE 60

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Empirical Results: Function Approximation

15

CartPole*

* Statistically significant according to Welch’s test [Welch 1947]

slide-61
SLIDE 61

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Empirical Results: Function Approximation

15

CartPole*

InvertedPendulum

* Statistically significant according to Welch’s test [Welch 1947]

slide-62
SLIDE 62

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Empirical Results: Function Approximation

15

CartPole*

InvertedPendulum

* Statistically significant according to Welch’s test [Welch 1947]

slide-63
SLIDE 63

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Additional Results

16

slide-64
SLIDE 64

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Additional Results

  • Extend certainty-equivalence proof by Sutton (1988) from MRP to discounted, per-step

reward MDP

16

slide-65
SLIDE 65

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Additional Results

  • Extend certainty-equivalence proof by Sutton (1988) from MRP to discounted, per-step

reward MDP

  • Answer the following (subset of many) questions:

16

slide-66
SLIDE 66

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Additional Results

  • Extend certainty-equivalence proof by Sutton (1988) from MRP to discounted, per-step

reward MDP

  • Answer the following (subset of many) questions:
  • Does expressiveness of the value function impact performance?

16

slide-67
SLIDE 67

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Additional Results

  • Extend certainty-equivalence proof by Sutton (1988) from MRP to discounted, per-step

reward MDP

  • Answer the following (subset of many) questions:
  • Does expressiveness of the value function impact performance?
  • Does expressiveness of the PSEC MLE policy impact performance?

16

slide-68
SLIDE 68

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Additional Results

  • Extend certainty-equivalence proof by Sutton (1988) from MRP to discounted, per-step

reward MDP

  • Answer the following (subset of many) questions:
  • Does expressiveness of the value function impact performance?
  • Does expressiveness of the PSEC MLE policy impact performance?
  • Does underfitting/overfitting the PSEC MLE policy to the batch impact performance?

16

slide-69
SLIDE 69

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Additional Results

  • Extend certainty-equivalence proof by Sutton (1988) from MRP to discounted, per-step

reward MDP

  • Answer the following (subset of many) questions:
  • Does expressiveness of the value function impact performance?
  • Does expressiveness of the PSEC MLE policy impact performance?
  • Does underfitting/overfitting the PSEC MLE policy to the batch impact performance?
  • Can PSEC be applied to off-policy TD(0)?

16

slide-70
SLIDE 70

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Related Work

  • Estimating the behavior policy from data [Li et al., 2015, Narita et al.,

2018, Hirano et al., 2003].

  • Reducing sampling error in policy evaluation [Hanna et al., 2019] and

policy gradient learning [Hanna and Stone, 2019].

  • Reducing sampling error in action-values [van Seijen et al., 2009,

Precup et al., 2000]

17

slide-71
SLIDE 71

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Open Questions

  • Reduce sampling error in n-step and TD(\lambda).
  • Evaluate actor-critic algorithms with an improved value function

estimate.

  • Extend batch PSEC to online TD(0).

18

slide-72
SLIDE 72

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Takeaways

19

slide-73
SLIDE 73

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Takeaways

  • For finite batch of data, batch TD(0) converges to an inaccurate value function.

19

slide-74
SLIDE 74

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Takeaways

  • For finite batch of data, batch TD(0) converges to an inaccurate value function.
  • Mismatch between the MLE policy distribution and true policy distribution can be

viewed from an off-policy perspective.

19

slide-75
SLIDE 75

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Takeaways

  • For finite batch of data, batch TD(0) converges to an inaccurate value function.
  • Mismatch between the MLE policy distribution and true policy distribution can be

viewed from an off-policy perspective.

  • PSEC-TD(0) is a more efficient estimator than TD(0).

19

slide-76
SLIDE 76

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Takeaways

  • For finite batch of data, batch TD(0) converges to an inaccurate value function.
  • Mismatch between the MLE policy distribution and true policy distribution can be

viewed from an off-policy perspective.

  • PSEC-TD(0) is a more efficient estimator than TD(0).
  • PSEC-TD(0) brings benefit to discrete/continuous state/action spaces.

19

slide-77
SLIDE 77

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Takeaways

  • For finite batch of data, batch TD(0) converges to an inaccurate value function.
  • Mismatch between the MLE policy distribution and true policy distribution can be

viewed from an off-policy perspective.

  • PSEC-TD(0) is a more efficient estimator than TD(0).
  • PSEC-TD(0) brings benefit to discrete/continuous state/action spaces.
  • While primarily shown for on-policy TD(0), PSEC is also applicable in off-policy TD(0).

19

slide-78
SLIDE 78

Brahma S. Pavse (UT Austin) Reducing Sampling Error in Batch Temporal Difference Learning

Thank You!

20

Josiah Hanna Peter Stone Ishan Durugkar

brahmasp.github.io

Brahma S. Pavse

Recent extension: On Sampling Error in Batch Action-Value Prediction Algorithms

idurugkar.github.io

homepages.inf.ed.ac.uk/ jhanna2/index.html

cs.utexas.edu/~pstone/