Renewal Monte Carlo: Renewal theory based reinforcement learning - - PowerPoint PPT Presentation

renewal monte carlo
SMART_READER_LITE
LIVE PREVIEW

Renewal Monte Carlo: Renewal theory based reinforcement learning - - PowerPoint PPT Presentation

Renewal Monte Carlo: Renewal theory based reinforcement learning Jayakumar Subramanian and Aditya Mahajan 57th IEEE Conference on Decision and Control, Miami Beach, FL, USA, December 17-19, 2018 RL has achieved considerable success


slide-1
SLIDE 1

Renewal Monte Carlo:


Renewal theory based reinforcement learning

Jayakumar Subramanian and Aditya Mahajan

57th IEEE Conference on Decision and Control, Miami Beach, FL, USA, December 17-19, 2018

slide-2
SLIDE 2

RMC: Subramanian and Mahajan

RL has achieved considerable success…

Image credit: Popular Science

19/12/18 2

Image credit: MIT Technology review Image credit: Towards Data Science

slide-3
SLIDE 3

RMC: Subramanian and Mahajan

RL has achieved considerable success…

Image credit: Popular Science

Salient features ⊕ Model-free method ⊕ Use policy search

19/12/18 2

Image credit: MIT Technology review Image credit: Towards Data Science

slide-4
SLIDE 4

RMC: Subramanian and Mahajan

RL has achieved considerable success…

Image credit: Popular Science

Salient features ⊕ Model-free method ⊕ Use policy search Limitation ⊖Learning is slow (takes ∼ 10^9 to 10^15 iterations to converge)

19/12/18 2

Image credit: MIT Technology review Image credit: Towards Data Science

slide-5
SLIDE 5

RMC: Subramanian and Mahajan

RL has achieved considerable success…

Image credit: Popular Science

Salient features ⊕ Model-free method ⊕ Use policy search Limitation ⊖Learning is slow (takes ∼ 10^9 to 10^15 iterations to converge)

⊕Can we exploit features of the model to make it learn faster? … ⊕Without sacrificing generality?

19/12/18 2

Image credit: MIT Technology review Image credit: Towards Data Science

slide-6
SLIDE 6

RMC: Subramanian and Mahajan

An RL problem can be formulated as…

19/12/18 3

slide-7
SLIDE 7

RMC: Subramanian and Mahajan

An RL problem can be formulated as…

19/12/18 3

Agent

slide-8
SLIDE 8

RMC: Subramanian and Mahajan

An RL problem can be formulated as…

19/12/18 3

Environment Agent

slide-9
SLIDE 9

RMC: Subramanian and Mahajan

An RL problem can be formulated as…

19/12/18 3

Environment Agent

slide-10
SLIDE 10

RMC: Subramanian and Mahajan

An RL problem can be formulated as…

19/12/18 3

Environment Agent Infinite horizon Markov decision process (MDP) Model Transition probability Action space State space Per-step reward

slide-11
SLIDE 11

RMC: Subramanian and Mahajan

An RL problem can be formulated as…

19/12/18 3

Environment Agent Infinite horizon Markov decision process (MDP) Model Transition probability Action space State space Unknown in RL Per-step reward

slide-12
SLIDE 12

RMC: Subramanian and Mahajan

Policy parametrization

19/12/18 4

slide-13
SLIDE 13

RMC: Subramanian and Mahajan

Policy parametrization

19/12/18 4

is a parametrized policy

slide-14
SLIDE 14

RMC: Subramanian and Mahajan

Policy parametrization

19/12/18 4

is a parametrized policy Gibbs (softmax) policy

slide-15
SLIDE 15

RMC: Subramanian and Mahajan

Policy parametrization

19/12/18 4

Neural network (NN) policy

: weights of NN

is a parametrized policy Gibbs (softmax) policy

slide-16
SLIDE 16

RMC: Subramanian and Mahajan

Policy gradient

19/12/18 5

slide-17
SLIDE 17

RMC: Subramanian and Mahajan

Policy gradient

19/12/18 5

Performance Gradient Estimate

slide-18
SLIDE 18

RMC: Subramanian and Mahajan

Policy gradient

19/12/18 5

Performance Gradient Estimate

slide-19
SLIDE 19

RMC: Subramanian and Mahajan

Policy gradient

19/12/18 5

Performance Gradient Estimate is an estimate of

slide-20
SLIDE 20

RMC: Subramanian and Mahajan

Policy gradient

19/12/18 5

Performance Gradient Estimate Stochastic Gradient Ascent is an estimate of

slide-21
SLIDE 21

RMC: Subramanian and Mahajan

Policy gradient

19/12/18 5

Performance Gradient Estimate Stochastic Gradient Ascent is an estimate of

slide-22
SLIDE 22

RMC: Subramanian and Mahajan

Policy gradient

19/12/18 5

Performance Gradient Estimate Stochastic Gradient Ascent is an estimate of and

slide-23
SLIDE 23

RMC: Subramanian and Mahajan

Policy gradient

19/12/18 5

Performance Gradient Estimate Stochastic Gradient Ascent is an estimate of and How do we estimate this?

slide-24
SLIDE 24

RMC: Subramanian and Mahajan

How to estimate ?

19/12/18 6

slide-25
SLIDE 25

RMC: Subramanian and Mahajan

How to estimate ?

19/12/18 6

Monte Carlo estimate (REINFORCE)

slide-26
SLIDE 26

RMC: Subramanian and Mahajan

How to estimate ?

19/12/18 6

Actor Critic estimate (Temporal difference / SARSA) Monte Carlo estimate (REINFORCE)

slide-27
SLIDE 27

RMC: Subramanian and Mahajan

How to estimate ?

19/12/18 6

Actor Critic with eligibility traces estimate (SARSA-λ) Actor Critic estimate (Temporal difference / SARSA) Monte Carlo estimate (REINFORCE)

slide-28
SLIDE 28

RMC: Subramanian and Mahajan

MC vs. TD

19/12/18 7

MC TD

slide-29
SLIDE 29

RMC: Subramanian and Mahajan

MC vs. TD

19/12/18 7

MC ⊕ Unbiased ⊕ Simple & easy to implement ⊕ Discounted & average reward cases ⊖ High variance ⊖ End-of-episode updates ⊖ Not asymptotically optimal for inf. hor. TD

slide-30
SLIDE 30

RMC: Subramanian and Mahajan

MC vs. TD

19/12/18 7

MC ⊕ Unbiased ⊕ Simple & easy to implement ⊕ Discounted & average reward cases ⊖ High variance ⊖ End-of-episode updates ⊖ Not asymptotically optimal for inf. hor. TD ⊕ Low variance ⊕ Per-step updates ⊕ Asymptotically optimal for inf. hor. ⊖ Biased ⊖ Often requires function approximation ⊖ Additional effort for average reward

slide-31
SLIDE 31

RMC: Subramanian and Mahajan

MC vs. TD

19/12/18 7

MC ⊕ Unbiased ⊕ Simple & easy to implement ⊕ Discounted & average reward cases ⊖ High variance ⊖ End-of-episode updates ⊖ Not asymptotically optimal for inf. hor. TD ⊕ Low variance ⊕ Per-step updates ⊕ Asymptotically optimal for inf. hor. ⊖ Biased ⊖ Often requires function approximation ⊖ Additional effort for average reward

Can we get the best of both worlds?

slide-32
SLIDE 32

RMC: Subramanian and Mahajan

Renewal Monte Carlo

19/12/18 8

1 2 3 4 5 6 Time State 7

slide-33
SLIDE 33

RMC: Subramanian and Mahajan

Renewal Monte Carlo

19/12/18 8

1 2 3 4 5 6 Time State 7

slide-34
SLIDE 34

RMC: Subramanian and Mahajan

Renewal Monte Carlo

19/12/18 8

1 2 3 4 5 6 Time State 7

slide-35
SLIDE 35

RMC: Subramanian and Mahajan

Renewal Monte Carlo

19/12/18 8

1 2 3 4 5 6 Time State 7

slide-36
SLIDE 36

RMC: Subramanian and Mahajan

Renewal Monte Carlo

19/12/18 8

1 2 3 4 5 6 Time State 7

slide-37
SLIDE 37

RMC: Subramanian and Mahajan

Renewal Monte Carlo

19/12/18 8

1 2 3 4 5 6 Time State 7

slide-38
SLIDE 38

RMC: Subramanian and Mahajan

Renewal Monte Carlo

19/12/18 8

1 2 3 4 5 6 Time State 7 estimated by

slide-39
SLIDE 39

RMC: Subramanian and Mahajan

RMC based policy gradient

19/12/18 9

slide-40
SLIDE 40

RMC: Subramanian and Mahajan

RMC based policy gradient

19/12/18 9

Performance Gradient Estimate

slide-41
SLIDE 41

RMC: Subramanian and Mahajan

RMC based policy gradient

19/12/18 9

Performance Gradient Estimate ;

slide-42
SLIDE 42

RMC: Subramanian and Mahajan

RMC based policy gradient

19/12/18 9

Performance Gradient Estimate ;

slide-43
SLIDE 43

RMC: Subramanian and Mahajan

RMC based policy gradient

19/12/18 9

Performance Gradient Estimate with estimate: ;

slide-44
SLIDE 44

RMC: Subramanian and Mahajan

RMC based policy gradient

19/12/18 9

Performance Gradient Estimate Stochastic Gradient Ascent and with estimate: ;

slide-45
SLIDE 45

RMC: Subramanian and Mahajan

RMC based policy gradient

19/12/18 9

Performance Gradient Estimate Stochastic Gradient Ascent and with estimate: ;

estimated using MC / TD using RL policy gradient

slide-46
SLIDE 46

RMC: Subramanian and Mahajan

Convergence

19/12/18 10

slide-47
SLIDE 47

RMC: Subramanian and Mahajan

Convergence

19/12/18 10

unbiased estimators of

slide-48
SLIDE 48

RMC: Subramanian and Mahajan

Convergence

19/12/18 10

unbiased estimators of and

slide-49
SLIDE 49

RMC: Subramanian and Mahajan

Convergence

19/12/18 10

unbiased estimators of and is an unbiased estimator of

slide-50
SLIDE 50

RMC: Subramanian and Mahajan

Convergence

19/12/18 10

unbiased estimators of and is an unbiased estimator of is continuous; has bounded variance and

slide-51
SLIDE 51

RMC: Subramanian and Mahajan

Convergence

19/12/18 10

unbiased estimators of and is an unbiased estimator of is continuous; has bounded variance and has locally asymptotically stable isolated limit points

slide-52
SLIDE 52

RMC: Subramanian and Mahajan

Convergence

19/12/18 10

unbiased estimators of and is an unbiased estimator of is continuous; has bounded variance and Iteration for converges a.s. to a value where has locally asymptotically stable isolated limit points

slide-53
SLIDE 53

RMC: Subramanian and Mahajan

E.g. – Randomly generated MDP

19/12/18 11

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00

Samples

×105

50 100 150 200 250 300

Performance

Exact

slide-54
SLIDE 54

RMC: Subramanian and Mahajan

E.g. – Randomly generated MDP

19/12/18 11

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00

Samples

×105

50 100 150 200 250 300

Performance

Exact S-0

slide-55
SLIDE 55

RMC: Subramanian and Mahajan

E.g. – Randomly generated MDP

19/12/18 11

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00

Samples

×105

50 100 150 200 250 300

Performance

Exact S-0 S-1

slide-56
SLIDE 56

RMC: Subramanian and Mahajan

E.g. – Randomly generated MDP

19/12/18 11

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00

Samples

×105

50 100 150 200 250 300

Performance

Exact S-0 S-1 S-0.25 S-0.5 S-0.75 RMC RMC-B

slide-57
SLIDE 57

RMC: Subramanian and Mahajan

Related work

19/12/18 12

slide-58
SLIDE 58

RMC: Subramanian and Mahajan

Related work

  • Simulation optimization [Glynn 1986, 1990]:
  • Assume known probability law of the primitive random

variables and its weak derivate

19/12/18 12

slide-59
SLIDE 59

RMC: Subramanian and Mahajan

Related work

  • Simulation optimization [Glynn 1986, 1990]:
  • Assume known probability law of the primitive random

variables and its weak derivate

  • Sensitivity analysis for MDPs [Xi-Ren Cao, 1997]:
  • Average reward criterion
  • Known and unknown system models

19/12/18 12

slide-60
SLIDE 60

RMC: Subramanian and Mahajan

Related work

  • Simulation optimization [Glynn 1986, 1990]:
  • Assume known probability law of the primitive random

variables and its weak derivate

  • Sensitivity analysis for MDPs [Xi-Ren Cao, 1997]:
  • Average reward criterion
  • Known and unknown system models
  • Renewal theory for RL: [Marbach & Tsitsiklis 2001, 2003]
  • Average reward criterion
  • Relative value function for average reward

19/12/18 12

slide-61
SLIDE 61

RMC: Subramanian and Mahajan

Limitation of RMC

19/12/18 13

slide-62
SLIDE 62

RMC: Subramanian and Mahajan

Limitation of RMC

19/12/18 13

⊖ Renewal could take a long time

slide-63
SLIDE 63

RMC: Subramanian and Mahajan

Limitation of RMC

19/12/18 13

⊖ Renewal could take a long time ⊕ Two techniques to overcome this:

slide-64
SLIDE 64

RMC: Subramanian and Mahajan

Limitation of RMC

19/12/18 13

⊖ Renewal could take a long time ⊕ Two techniques to overcome this:

Post-decision state model

slide-65
SLIDE 65

RMC: Subramanian and Mahajan

Limitation of RMC

19/12/18 13

⊖ Renewal could take a long time ⊕ Two techniques to overcome this:

Post-decision state model Approximate renewal model

1 2 3 4 5 6

R3 R4 R5 R0 R1 R2

s0 ⇢

Time State

slide-66
SLIDE 66

RMC: Subramanian and Mahajan

Post-decision state model

19/12/18 14

1 2 3 4 5 6 0+ 1+ 2+ 3+ 4+ 5+ 6+ Time State

slide-67
SLIDE 67

RMC: Subramanian and Mahajan

Post-decision state model

19/12/18 14

1 2 3 4 5 6 0+ 1+ 2+ 3+ 4+ 5+ 6+ Time State

slide-68
SLIDE 68

RMC: Subramanian and Mahajan

Post-decision state model

19/12/18 14

1 2 3 4 5 6 0+ 1+ 2+ 3+ 4+ 5+ 6+ Time State

slide-69
SLIDE 69

RMC: Subramanian and Mahajan

Post-decision state model

19/12/18 14

1 2 3 4 5 6 0+ 1+ 2+ 3+ 4+ 5+ 6+ Time State

slide-70
SLIDE 70

RMC: Subramanian and Mahajan

Post-decision state model

19/12/18 14

1 2 3 4 5 6 0+ 1+ 2+ 3+ 4+ 5+ 6+ Time State

slide-71
SLIDE 71

RMC: Subramanian and Mahajan

Post-decision state model

19/12/18 14

1 2 3 4 5 6 0+ 1+ 2+ 3+ 4+ 5+ 6+ Time State

slide-72
SLIDE 72

RMC: Subramanian and Mahajan

Post-decision state model

19/12/18 14

1 2 3 4 5 6 0+ 1+ 2+ 3+ 4+ 5+ 6+

Renewals defined in terms of post-decision states

Time State

slide-73
SLIDE 73

RMC: Subramanian and Mahajan

Approximate RMC

19/12/18 15

1 2 3 4 5 6 Time State 7

slide-74
SLIDE 74

RMC: Subramanian and Mahajan

Approximate RMC

19/12/18 15

1 2 3 4 5 6 Time State 7

slide-75
SLIDE 75

RMC: Subramanian and Mahajan

Approximate RMC

19/12/18 15

1 2 3 4 5 6 Time State 7 estimated by

slide-76
SLIDE 76

RMC: Subramanian and Mahajan

Error bound

19/12/18 16

slide-77
SLIDE 77

RMC: Subramanian and Mahajan

Error bound

19/12/18 16

is Locally Lipschitz in

slide-78
SLIDE 78

RMC: Subramanian and Mahajan

Error bound

19/12/18 16

is Locally Lipschitz in

slide-79
SLIDE 79

RMC: Subramanian and Mahajan

Error bound

19/12/18 16

is Locally Lipschitz in Approximation error bounded by radius of approximation

slide-80
SLIDE 80

RMC: Subramanian and Mahajan

E.g. Inventory management

19/12/18 17

1 2 3 4

Samples

×106

200 210 220 230 240 250 260 270 280

Total Cost

Exact

slide-81
SLIDE 81

RMC: Subramanian and Mahajan

E.g. Inventory management

19/12/18 17

1 2 3 4

Samples

×106

200 210 220 230 240 250 260 270 280

Total Cost

Exact

slide-82
SLIDE 82

RMC: Subramanian and Mahajan

E.g. Inventory management

19/12/18 17

1 2 3 4

Samples

×106

200 210 220 230 240 250 260 270 280

Total Cost

Exact RMC

slide-83
SLIDE 83

RMC: Subramanian and Mahajan

Conclusion

19/12/18 18

slide-84
SLIDE 84

RMC: Subramanian and Mahajan

Conclusion

  • RMC useful in problems where:
  • renewal time is small
  • structure of optimal policy is known
  • reset actions are present

19/12/18 18

slide-85
SLIDE 85

RMC: Subramanian and Mahajan

Conclusion

  • RMC useful in problems where:
  • renewal time is small
  • structure of optimal policy is known
  • reset actions are present
  • Not so useful in arbitrary high dimensional

problems

19/12/18 18

slide-86
SLIDE 86

RMC: Subramanian and Mahajan

Conclusion

  • RMC useful in problems where:
  • renewal time is small
  • structure of optimal policy is known
  • reset actions are present
  • Not so useful in arbitrary high dimensional

problems

  • In high dimensional problems:
  • RMC can be used as a sub-component of main scheme
  • in the presence of hierarchies, can be used in a level with

short renewals

19/12/18 18

slide-87
SLIDE 87

RMC: Subramanian and Mahajan 19/12/18 19

Thank you