Renewal Monte Carlo: Renewal theory based reinforcement learning - - PowerPoint PPT Presentation
Renewal Monte Carlo: Renewal theory based reinforcement learning - - PowerPoint PPT Presentation
Renewal Monte Carlo: Renewal theory based reinforcement learning Jayakumar Subramanian and Aditya Mahajan 57th IEEE Conference on Decision and Control, Miami Beach, FL, USA, December 17-19, 2018 RL has achieved considerable success
RMC: Subramanian and Mahajan
RL has achieved considerable success…
Image credit: Popular Science
19/12/18 2
Image credit: MIT Technology review Image credit: Towards Data Science
RMC: Subramanian and Mahajan
RL has achieved considerable success…
Image credit: Popular Science
Salient features ⊕ Model-free method ⊕ Use policy search
19/12/18 2
Image credit: MIT Technology review Image credit: Towards Data Science
RMC: Subramanian and Mahajan
RL has achieved considerable success…
Image credit: Popular Science
Salient features ⊕ Model-free method ⊕ Use policy search Limitation ⊖Learning is slow (takes ∼ 10^9 to 10^15 iterations to converge)
19/12/18 2
Image credit: MIT Technology review Image credit: Towards Data Science
RMC: Subramanian and Mahajan
RL has achieved considerable success…
Image credit: Popular Science
Salient features ⊕ Model-free method ⊕ Use policy search Limitation ⊖Learning is slow (takes ∼ 10^9 to 10^15 iterations to converge)
⊕Can we exploit features of the model to make it learn faster? … ⊕Without sacrificing generality?
19/12/18 2
Image credit: MIT Technology review Image credit: Towards Data Science
RMC: Subramanian and Mahajan
An RL problem can be formulated as…
19/12/18 3
RMC: Subramanian and Mahajan
An RL problem can be formulated as…
19/12/18 3
Agent
RMC: Subramanian and Mahajan
An RL problem can be formulated as…
19/12/18 3
Environment Agent
RMC: Subramanian and Mahajan
An RL problem can be formulated as…
19/12/18 3
Environment Agent
RMC: Subramanian and Mahajan
An RL problem can be formulated as…
19/12/18 3
Environment Agent Infinite horizon Markov decision process (MDP) Model Transition probability Action space State space Per-step reward
RMC: Subramanian and Mahajan
An RL problem can be formulated as…
19/12/18 3
Environment Agent Infinite horizon Markov decision process (MDP) Model Transition probability Action space State space Unknown in RL Per-step reward
RMC: Subramanian and Mahajan
Policy parametrization
19/12/18 4
RMC: Subramanian and Mahajan
Policy parametrization
19/12/18 4
is a parametrized policy
RMC: Subramanian and Mahajan
Policy parametrization
19/12/18 4
is a parametrized policy Gibbs (softmax) policy
RMC: Subramanian and Mahajan
Policy parametrization
19/12/18 4
Neural network (NN) policy
: weights of NN
is a parametrized policy Gibbs (softmax) policy
RMC: Subramanian and Mahajan
Policy gradient
19/12/18 5
RMC: Subramanian and Mahajan
Policy gradient
19/12/18 5
Performance Gradient Estimate
RMC: Subramanian and Mahajan
Policy gradient
19/12/18 5
Performance Gradient Estimate
RMC: Subramanian and Mahajan
Policy gradient
19/12/18 5
Performance Gradient Estimate is an estimate of
RMC: Subramanian and Mahajan
Policy gradient
19/12/18 5
Performance Gradient Estimate Stochastic Gradient Ascent is an estimate of
RMC: Subramanian and Mahajan
Policy gradient
19/12/18 5
Performance Gradient Estimate Stochastic Gradient Ascent is an estimate of
RMC: Subramanian and Mahajan
Policy gradient
19/12/18 5
Performance Gradient Estimate Stochastic Gradient Ascent is an estimate of and
RMC: Subramanian and Mahajan
Policy gradient
19/12/18 5
Performance Gradient Estimate Stochastic Gradient Ascent is an estimate of and How do we estimate this?
RMC: Subramanian and Mahajan
How to estimate ?
19/12/18 6
RMC: Subramanian and Mahajan
How to estimate ?
19/12/18 6
Monte Carlo estimate (REINFORCE)
RMC: Subramanian and Mahajan
How to estimate ?
19/12/18 6
Actor Critic estimate (Temporal difference / SARSA) Monte Carlo estimate (REINFORCE)
RMC: Subramanian and Mahajan
How to estimate ?
19/12/18 6
Actor Critic with eligibility traces estimate (SARSA-λ) Actor Critic estimate (Temporal difference / SARSA) Monte Carlo estimate (REINFORCE)
RMC: Subramanian and Mahajan
MC vs. TD
19/12/18 7
MC TD
RMC: Subramanian and Mahajan
MC vs. TD
19/12/18 7
MC ⊕ Unbiased ⊕ Simple & easy to implement ⊕ Discounted & average reward cases ⊖ High variance ⊖ End-of-episode updates ⊖ Not asymptotically optimal for inf. hor. TD
RMC: Subramanian and Mahajan
MC vs. TD
19/12/18 7
MC ⊕ Unbiased ⊕ Simple & easy to implement ⊕ Discounted & average reward cases ⊖ High variance ⊖ End-of-episode updates ⊖ Not asymptotically optimal for inf. hor. TD ⊕ Low variance ⊕ Per-step updates ⊕ Asymptotically optimal for inf. hor. ⊖ Biased ⊖ Often requires function approximation ⊖ Additional effort for average reward
RMC: Subramanian and Mahajan
MC vs. TD
19/12/18 7
MC ⊕ Unbiased ⊕ Simple & easy to implement ⊕ Discounted & average reward cases ⊖ High variance ⊖ End-of-episode updates ⊖ Not asymptotically optimal for inf. hor. TD ⊕ Low variance ⊕ Per-step updates ⊕ Asymptotically optimal for inf. hor. ⊖ Biased ⊖ Often requires function approximation ⊖ Additional effort for average reward
Can we get the best of both worlds?
RMC: Subramanian and Mahajan
Renewal Monte Carlo
19/12/18 8
1 2 3 4 5 6 Time State 7
RMC: Subramanian and Mahajan
Renewal Monte Carlo
19/12/18 8
1 2 3 4 5 6 Time State 7
RMC: Subramanian and Mahajan
Renewal Monte Carlo
19/12/18 8
1 2 3 4 5 6 Time State 7
RMC: Subramanian and Mahajan
Renewal Monte Carlo
19/12/18 8
1 2 3 4 5 6 Time State 7
RMC: Subramanian and Mahajan
Renewal Monte Carlo
19/12/18 8
1 2 3 4 5 6 Time State 7
RMC: Subramanian and Mahajan
Renewal Monte Carlo
19/12/18 8
1 2 3 4 5 6 Time State 7
RMC: Subramanian and Mahajan
Renewal Monte Carlo
19/12/18 8
1 2 3 4 5 6 Time State 7 estimated by
RMC: Subramanian and Mahajan
RMC based policy gradient
19/12/18 9
RMC: Subramanian and Mahajan
RMC based policy gradient
19/12/18 9
Performance Gradient Estimate
RMC: Subramanian and Mahajan
RMC based policy gradient
19/12/18 9
Performance Gradient Estimate ;
RMC: Subramanian and Mahajan
RMC based policy gradient
19/12/18 9
Performance Gradient Estimate ;
RMC: Subramanian and Mahajan
RMC based policy gradient
19/12/18 9
Performance Gradient Estimate with estimate: ;
RMC: Subramanian and Mahajan
RMC based policy gradient
19/12/18 9
Performance Gradient Estimate Stochastic Gradient Ascent and with estimate: ;
RMC: Subramanian and Mahajan
RMC based policy gradient
19/12/18 9
Performance Gradient Estimate Stochastic Gradient Ascent and with estimate: ;
estimated using MC / TD using RL policy gradient
RMC: Subramanian and Mahajan
Convergence
19/12/18 10
RMC: Subramanian and Mahajan
Convergence
19/12/18 10
unbiased estimators of
RMC: Subramanian and Mahajan
Convergence
19/12/18 10
unbiased estimators of and
RMC: Subramanian and Mahajan
Convergence
19/12/18 10
unbiased estimators of and is an unbiased estimator of
RMC: Subramanian and Mahajan
Convergence
19/12/18 10
unbiased estimators of and is an unbiased estimator of is continuous; has bounded variance and
RMC: Subramanian and Mahajan
Convergence
19/12/18 10
unbiased estimators of and is an unbiased estimator of is continuous; has bounded variance and has locally asymptotically stable isolated limit points
RMC: Subramanian and Mahajan
Convergence
19/12/18 10
unbiased estimators of and is an unbiased estimator of is continuous; has bounded variance and Iteration for converges a.s. to a value where has locally asymptotically stable isolated limit points
RMC: Subramanian and Mahajan
E.g. – Randomly generated MDP
19/12/18 11
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
Samples
×105
50 100 150 200 250 300
Performance
Exact
RMC: Subramanian and Mahajan
E.g. – Randomly generated MDP
19/12/18 11
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
Samples
×105
50 100 150 200 250 300
Performance
Exact S-0
RMC: Subramanian and Mahajan
E.g. – Randomly generated MDP
19/12/18 11
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
Samples
×105
50 100 150 200 250 300
Performance
Exact S-0 S-1
RMC: Subramanian and Mahajan
E.g. – Randomly generated MDP
19/12/18 11
0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00
Samples
×105
50 100 150 200 250 300
Performance
Exact S-0 S-1 S-0.25 S-0.5 S-0.75 RMC RMC-B
RMC: Subramanian and Mahajan
Related work
19/12/18 12
RMC: Subramanian and Mahajan
Related work
- Simulation optimization [Glynn 1986, 1990]:
- Assume known probability law of the primitive random
variables and its weak derivate
19/12/18 12
RMC: Subramanian and Mahajan
Related work
- Simulation optimization [Glynn 1986, 1990]:
- Assume known probability law of the primitive random
variables and its weak derivate
- Sensitivity analysis for MDPs [Xi-Ren Cao, 1997]:
- Average reward criterion
- Known and unknown system models
19/12/18 12
RMC: Subramanian and Mahajan
Related work
- Simulation optimization [Glynn 1986, 1990]:
- Assume known probability law of the primitive random
variables and its weak derivate
- Sensitivity analysis for MDPs [Xi-Ren Cao, 1997]:
- Average reward criterion
- Known and unknown system models
- Renewal theory for RL: [Marbach & Tsitsiklis 2001, 2003]
- Average reward criterion
- Relative value function for average reward
19/12/18 12
RMC: Subramanian and Mahajan
Limitation of RMC
19/12/18 13
RMC: Subramanian and Mahajan
Limitation of RMC
19/12/18 13
⊖ Renewal could take a long time
RMC: Subramanian and Mahajan
Limitation of RMC
19/12/18 13
⊖ Renewal could take a long time ⊕ Two techniques to overcome this:
RMC: Subramanian and Mahajan
Limitation of RMC
19/12/18 13
⊖ Renewal could take a long time ⊕ Two techniques to overcome this:
Post-decision state model
RMC: Subramanian and Mahajan
Limitation of RMC
19/12/18 13
⊖ Renewal could take a long time ⊕ Two techniques to overcome this:
Post-decision state model Approximate renewal model
1 2 3 4 5 6
R3 R4 R5 R0 R1 R2
s0 ⇢
Time State
RMC: Subramanian and Mahajan
Post-decision state model
19/12/18 14
1 2 3 4 5 6 0+ 1+ 2+ 3+ 4+ 5+ 6+ Time State
RMC: Subramanian and Mahajan
Post-decision state model
19/12/18 14
1 2 3 4 5 6 0+ 1+ 2+ 3+ 4+ 5+ 6+ Time State
RMC: Subramanian and Mahajan
Post-decision state model
19/12/18 14
1 2 3 4 5 6 0+ 1+ 2+ 3+ 4+ 5+ 6+ Time State
RMC: Subramanian and Mahajan
Post-decision state model
19/12/18 14
1 2 3 4 5 6 0+ 1+ 2+ 3+ 4+ 5+ 6+ Time State
RMC: Subramanian and Mahajan
Post-decision state model
19/12/18 14
1 2 3 4 5 6 0+ 1+ 2+ 3+ 4+ 5+ 6+ Time State
RMC: Subramanian and Mahajan
Post-decision state model
19/12/18 14
1 2 3 4 5 6 0+ 1+ 2+ 3+ 4+ 5+ 6+ Time State
RMC: Subramanian and Mahajan
Post-decision state model
19/12/18 14
1 2 3 4 5 6 0+ 1+ 2+ 3+ 4+ 5+ 6+
Renewals defined in terms of post-decision states
Time State
RMC: Subramanian and Mahajan
Approximate RMC
19/12/18 15
1 2 3 4 5 6 Time State 7
RMC: Subramanian and Mahajan
Approximate RMC
19/12/18 15
1 2 3 4 5 6 Time State 7
RMC: Subramanian and Mahajan
Approximate RMC
19/12/18 15
1 2 3 4 5 6 Time State 7 estimated by
RMC: Subramanian and Mahajan
Error bound
19/12/18 16
RMC: Subramanian and Mahajan
Error bound
19/12/18 16
is Locally Lipschitz in
RMC: Subramanian and Mahajan
Error bound
19/12/18 16
is Locally Lipschitz in
RMC: Subramanian and Mahajan
Error bound
19/12/18 16
is Locally Lipschitz in Approximation error bounded by radius of approximation
RMC: Subramanian and Mahajan
E.g. Inventory management
19/12/18 17
1 2 3 4
Samples
×106
200 210 220 230 240 250 260 270 280
Total Cost
Exact
RMC: Subramanian and Mahajan
E.g. Inventory management
19/12/18 17
1 2 3 4
Samples
×106
200 210 220 230 240 250 260 270 280
Total Cost
Exact
RMC: Subramanian and Mahajan
E.g. Inventory management
19/12/18 17
1 2 3 4
Samples
×106
200 210 220 230 240 250 260 270 280
Total Cost
Exact RMC
RMC: Subramanian and Mahajan
Conclusion
19/12/18 18
RMC: Subramanian and Mahajan
Conclusion
- RMC useful in problems where:
- renewal time is small
- structure of optimal policy is known
- reset actions are present
19/12/18 18
RMC: Subramanian and Mahajan
Conclusion
- RMC useful in problems where:
- renewal time is small
- structure of optimal policy is known
- reset actions are present
- Not so useful in arbitrary high dimensional
problems
19/12/18 18
RMC: Subramanian and Mahajan
Conclusion
- RMC useful in problems where:
- renewal time is small
- structure of optimal policy is known
- reset actions are present
- Not so useful in arbitrary high dimensional
problems
- In high dimensional problems:
- RMC can be used as a sub-component of main scheme
- in the presence of hierarchies, can be used in a level with
short renewals
19/12/18 18
RMC: Subramanian and Mahajan 19/12/18 19