Renewal Monte Carlo: Renewal theory based reinforcement learning - PowerPoint PPT Presentation

Renewal Monte Carlo:   Renewal theory based reinforcement learning Jayakumar Subramanian and Aditya Mahajan 57th IEEE Conference on Decision and Control, Miami Beach, FL, USA, December 17-19, 2018

RL has achieved considerable success… Image credit: Towards Data Science Image credit: MIT Technology review Image credit: Popular Science 19/12/18 RMC: Subramanian and Mahajan �2

RL has achieved considerable success… Image credit: Towards Data Science Image credit: MIT Technology review Image credit: Popular Science Salient features ⊕ Model-free method ⊕ Use policy search 19/12/18 RMC: Subramanian and Mahajan �2

RL has achieved considerable success… Image credit: Towards Data Science Image credit: MIT Technology review Image credit: Popular Science Salient features Limitation ⊕ Model-free method ⊖ Learning is slow (takes ∼ 10 ^9 to 10 ^15 ⊕ Use policy search iterations to converge) 19/12/18 RMC: Subramanian and Mahajan �2

RL has achieved considerable success… Image credit: Towards Data Science Image credit: MIT Technology review Image credit: Popular Science Salient features Limitation ⊕ Model-free method ⊖ Learning is slow (takes ∼ 10 ^9 to 10 ^15 ⊕ Use policy search iterations to converge) ⊕ Can we exploit features of the model to make it learn faster? … ⊕ Without sacrificing generality? 19/12/18 RMC: Subramanian and Mahajan �2

An RL problem can be formulated as… 19/12/18 RMC: Subramanian and Mahajan �3

An RL problem can be formulated as… Agent 19/12/18 RMC: Subramanian and Mahajan �3

An RL problem can be formulated as… Environment Agent 19/12/18 RMC: Subramanian and Mahajan �3

An RL problem can be formulated as… Environment Agent Infinite horizon Markov decision process (MDP) Model State space Action space Transition probability Per-step reward 19/12/18 RMC: Subramanian and Mahajan �3

An RL problem can be formulated as… Environment Unknown in RL Agent Infinite horizon Markov decision process (MDP) Model State space Action space Transition probability Per-step reward 19/12/18 RMC: Subramanian and Mahajan �3

Policy parametrization 19/12/18 RMC: Subramanian and Mahajan �4

Policy parametrization is a parametrized policy 19/12/18 RMC: Subramanian and Mahajan �4

Policy parametrization is a parametrized policy Gibbs (softmax) policy 19/12/18 RMC: Subramanian and Mahajan �4

Policy parametrization is a parametrized policy Gibbs (softmax) policy Neural network (NN) policy : weights of NN 19/12/18 RMC: Subramanian and Mahajan �4

Policy gradient 19/12/18 RMC: Subramanian and Mahajan �5

Policy gradient Performance Gradient Estimate 19/12/18 RMC: Subramanian and Mahajan �5

Policy gradient Performance Gradient Estimate is an estimate of 19/12/18 RMC: Subramanian and Mahajan �5

Policy gradient Performance Gradient Estimate is an estimate of Stochastic Gradient Ascent 19/12/18 RMC: Subramanian and Mahajan �5

Policy gradient Performance Gradient Estimate is an estimate of Stochastic Gradient Ascent and 19/12/18 RMC: Subramanian and Mahajan �5

Policy gradient Performance Gradient Estimate is an estimate of How do we estimate this? Stochastic Gradient Ascent and 19/12/18 RMC: Subramanian and Mahajan �5

How to estimate ? 19/12/18 RMC: Subramanian and Mahajan �6

How to estimate ? Monte Carlo estimate (REINFORCE) 19/12/18 RMC: Subramanian and Mahajan �6

How to estimate ? Monte Carlo estimate (REINFORCE) Actor Critic estimate (Temporal difference / SARSA) 19/12/18 RMC: Subramanian and Mahajan �6

How to estimate ? Monte Carlo estimate (REINFORCE) Actor Critic estimate (Temporal difference / SARSA) Actor Critic with eligibility traces estimate (SARSA-λ) 19/12/18 RMC: Subramanian and Mahajan �6

MC vs. TD MC TD 19/12/18 RMC: Subramanian and Mahajan �7

MC vs. TD MC TD ⊕ Unbiased ⊕ Simple & easy to implement ⊕ Discounted & average reward cases ⊖ High variance ⊖ End-of-episode updates ⊖ Not asymptotically optimal for inf. hor. 19/12/18 RMC: Subramanian and Mahajan �7

MC vs. TD MC TD ⊕ Unbiased ⊕ Low variance ⊕ Simple & easy to implement ⊕ Per-step updates ⊕ Discounted & average reward cases ⊕ Asymptotically optimal for inf. hor. ⊖ High variance ⊖ Biased ⊖ End-of-episode updates ⊖ Often requires function approximation ⊖ Not asymptotically optimal for inf. hor. ⊖ Additional effort for average reward 19/12/18 RMC: Subramanian and Mahajan �7

MC vs. TD MC TD ⊕ Unbiased ⊕ Low variance ⊕ Simple & easy to implement ⊕ Per-step updates ⊕ Discounted & average reward cases ⊕ Asymptotically optimal for inf. hor. ⊖ High variance ⊖ Biased ⊖ End-of-episode updates ⊖ Often requires function approximation ⊖ Not asymptotically optimal for inf. hor. ⊖ Additional effort for average reward Can we get the best of both worlds? 19/12/18 RMC: Subramanian and Mahajan �7

Renewal Monte Carlo State 0 1 2 3 4 5 6 7 Time 19/12/18 RMC: Subramanian and Mahajan �8

Renewal Monte Carlo State 0 1 2 3 4 5 6 7 Time estimated by 19/12/18 RMC: Subramanian and Mahajan �8

RMC based policy gradient 19/12/18 RMC: Subramanian and Mahajan �9

RMC based policy gradient Performance Gradient Estimate 19/12/18 RMC: Subramanian and Mahajan �9

RMC based policy gradient ; Performance Gradient Estimate 19/12/18 RMC: Subramanian and Mahajan �9

RMC based policy gradient ; Performance Gradient Estimate with estimate: 19/12/18 RMC: Subramanian and Mahajan �9

RMC based policy gradient ; Performance Gradient Estimate with estimate: Stochastic Gradient Ascent and 19/12/18 RMC: Subramanian and Mahajan �9

RMC based policy gradient ; Performance Gradient Estimate with estimate: estimated using MC / TD using RL policy gradient Stochastic Gradient Ascent and 19/12/18 RMC: Subramanian and Mahajan �9

Convergence 19/12/18 RMC: Subramanian and Mahajan �10

Convergence unbiased estimators of 19/12/18 RMC: Subramanian and Mahajan �10

Convergence unbiased estimators of and 19/12/18 RMC: Subramanian and Mahajan �10

Convergence unbiased estimators of and is an unbiased estimator of 19/12/18 RMC: Subramanian and Mahajan �10

Convergence unbiased estimators of and is an unbiased estimator of has bounded variance and is continuous; 19/12/18 RMC: Subramanian and Mahajan �10

Convergence unbiased estimators of and is an unbiased estimator of has bounded variance and is continuous; has locally asymptotically stable isolated limit points 19/12/18 RMC: Subramanian and Mahajan �10

Convergence unbiased estimators of and is an unbiased estimator of has bounded variance and is continuous; has locally asymptotically stable isolated limit points Iteration for converges a.s. to a value where 19/12/18 RMC: Subramanian and Mahajan �10

E.g. – Randomly generated MDP 300 250 200 Performance 150 Exact 100 50 0 0 . 00 0 . 25 0 . 50 0 . 75 1 . 00 1 . 25 1 . 50 1 . 75 2 . 00 × 10 5 Samples 19/12/18 RMC: Subramanian and Mahajan �11

E.g. – Randomly generated MDP 300 250 200 Performance Exact 150 S-0 100 50 0 0 . 00 0 . 25 0 . 50 0 . 75 1 . 00 1 . 25 1 . 50 1 . 75 2 . 00 × 10 5 Samples 19/12/18 RMC: Subramanian and Mahajan �11

E.g. – Randomly generated MDP 300 250 200 Performance Exact 150 S-0 S-1 100 50 0 0 . 00 0 . 25 0 . 50 0 . 75 1 . 00 1 . 25 1 . 50 1 . 75 2 . 00 × 10 5 Samples 19/12/18 RMC: Subramanian and Mahajan �11

E.g. – Randomly generated MDP 300 250 Exact 200 S-0 Performance S-1 S-0.25 150 S-0.5 S-0.75 RMC 100 RMC-B 50 0 0 . 00 0 . 25 0 . 50 0 . 75 1 . 00 1 . 25 1 . 50 1 . 75 2 . 00 × 10 5 Samples 19/12/18 RMC: Subramanian and Mahajan �11

Related work 19/12/18 RMC: Subramanian and Mahajan �12

Renewal Monte Carlo: Renewal theory based reinforcement learning - PowerPoint PPT Presentation

Renewal Monte Carlo: Renewal theory based reinforcement learning Jayakumar Subramanian and Aditya Mahajan 57th IEEE Conference on Decision and Control, Miami Beach, FL, USA, December 17-19, 2018 RL has achieved considerable success

Monte Carlo Generators Monte Carlo Generators Monte Carlo Generators QCD Lecture III P .

Monte Carlo Methods Guojin Chen Christopher Cprek Chris Rambicure Monte Carlo Methods 1.

Monte Carlo Approximation of Monte Carlo Filters Adam M. Johansen et al. Collaborators Include:

BROCHURE 2019 TETRA JUICES DEL MONTE DEL MONTE 6 x 1L GOLD PINEAPPLE 6 x 1L 6 x 1L 6 x 1L

Chapter 5: Monte Carlo Methods Monte Carlo methods are learning methods Experience

Draft Introduction to (randomized) quasi-Monte Carlo Pierre LEcuyer MCQMC Conference,

Monte Carlo Estimation 7 January 2019 OSU CSE 1 Monte Carlo Methods Class of computational

Monte Carlo Localization Ximing Yu March 24, 2009 Ximing Yu Monte Carlo Localization 1

Monte Carlo Control CMPUT 366: Intelligent Systems S&B 5.3-5.5, 5.7 Lecture Outline 1.

4. THE MONTE CARLO METHOD 4.1 I ntroduction This chapter is aimed at describing the Monte Carlo

Techniques in Artificial Intelligence - Part I Todd W. Neller Gettysburg College Monte Carlo

Introduction to Monte Carlo Method Andrzej Palczewski and Jan Palczewski Introduction to Monte

Draft 1 Density estimation by Monte Carlo and randomized quasi-Monte Carlo (RQMC) Pierre

SENIOR EXEMPTION SENIOR EXEMPTION ANNUAL ANNUAL RENEWAL ANNUAL ANNUAL RENEWAL RENEWAL AND

Barrier Option Pricing Introduction Barrier Options and Monte Carlo Simulations The

Monte Carlo Integration Monte Carlo methods use random numbers for sampling Although

A General Approach to Discovering, Registering, and Extracting Features from Raster Maps Craig

Managing Performance Across Payers Getting Different Populations on the Same Page October 21,

free 18-May-17 Towards Weakly Supervised Image Understanding 1/50 Towards Weakly Supervised

Tips for preparing a clear talk Kristen Grauman Facebook AI Research University of Texas at

Refresh Your Knowledge 5 In TD learning with linear VFA (select all): w = w + ( r ( s t ) +

Housing C Housing Counseling ounseling System (HCS) T System (HCS) Today oday and i and in

The Password Doesnt Fall Far: How Service Influences Password Choice Miranda Wei, The

A wavelet based approach to climate biome clustering Derek Desantis University of Nebraska -

Renewal Monte Carlo: Renewal theory based reinforcement learning - PowerPoint PPT Presentation

Renewal Monte Carlo: Renewal theory based reinforcement learning Jayakumar Subramanian and Aditya Mahajan 57th IEEE Conference on Decision and Control, Miami Beach, FL, USA, December 17-19, 2018 RL has achieved considerable success

Monte Carlo Generators Monte Carlo Generators Monte Carlo Generators QCD Lecture III P .

Monte Carlo Methods Guojin Chen Christopher Cprek Chris Rambicure Monte Carlo Methods 1.

Monte Carlo Approximation of Monte Carlo Filters Adam M. Johansen et al. Collaborators Include:

BROCHURE 2019 TETRA JUICES DEL MONTE DEL MONTE 6 x 1L GOLD PINEAPPLE 6 x 1L 6 x 1L 6 x 1L

Chapter 5: Monte Carlo Methods Monte Carlo methods are learning methods Experience

Draft Introduction to (randomized) quasi-Monte Carlo Pierre LEcuyer MCQMC Conference,

Monte Carlo Estimation 7 January 2019 OSU CSE 1 Monte Carlo Methods Class of computational

Monte Carlo Localization Ximing Yu March 24, 2009 Ximing Yu Monte Carlo Localization 1

Monte Carlo Control CMPUT 366: Intelligent Systems S&amp;B 5.3-5.5, 5.7 Lecture Outline 1.

4. THE MONTE CARLO METHOD 4.1 I ntroduction This chapter is aimed at describing the Monte Carlo

Techniques in Artificial Intelligence - Part I Todd W. Neller Gettysburg College Monte Carlo

Introduction to Monte Carlo Method Andrzej Palczewski and Jan Palczewski Introduction to Monte

Draft 1 Density estimation by Monte Carlo and randomized quasi-Monte Carlo (RQMC) Pierre

SENIOR EXEMPTION SENIOR EXEMPTION ANNUAL ANNUAL RENEWAL ANNUAL ANNUAL RENEWAL RENEWAL AND

Barrier Option Pricing Introduction Barrier Options and Monte Carlo Simulations The

Monte Carlo Integration Monte Carlo methods use random numbers for sampling Although

A General Approach to Discovering, Registering, and Extracting Features from Raster Maps Craig

Managing Performance Across Payers Getting Different Populations on the Same Page October 21,

free 18-May-17 Towards Weakly Supervised Image Understanding 1/50 Towards Weakly Supervised

Tips for preparing a clear talk Kristen Grauman Facebook AI Research University of Texas at

Refresh Your Knowledge 5 In TD learning with linear VFA (select all): w = w + ( r ( s t ) +

Housing C Housing Counseling ounseling System (HCS) T System (HCS) Today oday and i and in

The Password Doesnt Fall Far: How Service Influences Password Choice Miranda Wei, The

A wavelet based approach to climate biome clustering Derek Desantis University of Nebraska -

Monte Carlo Control CMPUT 366: Intelligent Systems S&B 5.3-5.5, 5.7 Lecture Outline 1.