Online Exploration in Least-Squares Policy Iteration Lihong Li, - PowerPoint PPT Presentation

Online Exploration in Least-Squares Policy Iteration Lihong Li, Michael L. Littman, and Christopher R. Mansley Rutgers Laboratory for Real-Life Reinforcement Learning (RL 3 ) 5/14/2009 AAMAS - Budapest 1

Contributions Reinforcement Learning Challenge I Challenge II Exploration/Exploitation Tradeoff Value-Function Approximation Rmax [Brafman & Tenneholtz 02] LSPI [Lagoudakis & Parr 03] (provably efficient, finite) (continuous, offline) LSPI-Rmax 5/14/2009 AAMAS - Budapest 2

Outline • Introduction – LSPI – Rmax • LSPI-Rmax • Experiments • Conclusions 5/14/2009 AAMAS - Budapest 3

Basic Terminology • Markov decision process – States: S – Actions: A – Reward function: -1 ≤ R(s,a) ≤ 1 – Transition probabilities: T(s’|s,a) – Discount factor: 0 < γ < 1 • Optimal value function: • Optimal policy: • Approximate 5/14/2009 AAMAS - Budapest 4

Linear Function Approximation • Features: – A.k.a. “basis functions”, and predefined • Weights: – Measures contributions of φ i to approximating Q* • 5/14/2009 AAMAS - Budapest 5

LSPI [Lagoudakis & Parr 03] π  π ’ Improve π : Initialize Evaluate π : compute w π ’(s) = argmax a w· φ (s,a) π 5/14/2009 AAMAS - Budapest 6

LSPI [Lagoudakis & Parr 03] But, LSPI does not specify how to collect samples D : a fundamental challenge in online reinforcement learning π  π ’ Improve π : Initialize Evaluate π : An agent only collects samples in states it visits… compute w π ’(s) = argmax a w· φ (s,a) π 5/14/2009 AAMAS - Budapest 7

Exploration/Exploitation Tradeoff 0 0 0 0 0 0 0 1000 1 2 3 98 99 100 0.001 total rewards efficient optimal policy exploration inefficient exploration time 5/14/2009 AAMAS - Budapest 8

Rmax [Brafman & Tenenholtz 02] • Rmax is for finite-state, finite-action MDPs • Learns T and R by counting/averaging • In s t , takes optimal action in “Optimism in the face of uncertainty”  Either: explore “unknown” region  Or: exploit “known” region Thm: Rmax is provably efficient Known Unknown S x A state-actions state-actions 5/14/2009 AAMAS - Budapest 9

LSPI-Rmax • Similar to LSPI • But distinguishes known/unknown ( s,a ) : Samples in D Known state-actions Unknown state-actions (Like Rmax) Treat their Q-value as Q max S x A Modifications of LSTDQ 5/14/2009 AAMAS - Budapest 10

LSTDQ-Rmax 5/14/2009 AAMAS - Budapest 11

LSPI-Rmax for Online RL • D = empty set • Initialize w • for t = 1, 2, 3, … – Take greedy action: a t = argmax a w· φ (s t ,a) – D = D U {( s t ,a t ,r t ,s t+1 )} – Run LSPI using LSTDQ-Rmax 5/14/2009 AAMAS - Budapest 12

Experiments • Problems – MountainCar – Bicycle – Continuous Combination Lock – ExpressWorld (a variant of PuddleWorld) Four actions Stochastic transitions Reward: -1 reward per step -0.5 reward per step in “expresslane” penalty for stepping into puddles Random start states 5/14/2009 AAMAS - Budapest 13

Various Exploration Rules with LSPI Converges to better policies 5/14/2009 AAMAS - Budapest 14

A Closer Look States visited in the first 3 episodes: Efficient Inefficient exploration exploration Help discovery of goal and expresslane 5/14/2009 AAMAS - Budapest 15

More Experiments 5/14/2009 AAMAS - Budapest 16

Effect of Rmax Threshold 5/14/2009 AAMAS - Budapest 17

Conclusions • We proposed LSPI-Rmax – LSPI + Rmax – encourages active exploration – with linear function approximation • Future directions – Similar idea applied to Gaussian process RL – Comparison to model-based RL 5/14/2009 AAMAS - Budapest 18

5/14/2009 AAMAS - Budapest 19

Where are features from? • Hand-crafted features – expert knowledge required – expensive and error prone • Generic features – RBF, CMAC, polynomial, etc. – may not always work well • Automatic feature selection using – Bellman error [Parr et al. 07] – spectral graph analysis [Mahadevan & Maggioni 07] – TD approximation [Li & Williams & Balakrishnan 09] – L 1 Regularization for LSPI [Kolter & Ng 09] 5/14/2009 AAMAS - Budapest 20

LSPI-Rmax vs. MBRL • Model-based RL (e.g., Rmax) – Learns an MDP model – Computes policy with the approximate model – Can use function approx. in model learning • Rmax w/ many compact representations [Li 09] • LSPI-Rmax is model-free RL – Avoids expensive “planning” step – Has weaker theoretical guarantees 5/14/2009 AAMAS - Budapest 21

Online Exploration in Least-Squares Policy Iteration Lihong Li, - PowerPoint PPT Presentation

Online Exploration in Least-Squares Policy Iteration Lihong Li, Michael L. Littman, and Christopher R. Mansley Rutgers Laboratory for Real-Life Reinforcement Learning (RL 3 ) 5/14/2009 AAMAS - Budapest 1 Contributions Reinforcement Learning

Practical Least-Squares for Computer Graphics Siggraph Course 11 Siggraph Course 11 Practical

Statistical Properties of the Regularized Least Squares Functional and a hybrid LSQR Newton method

Least Mean Squares Regression Machine Learning 1 Least Squares Method for regression

The Mathemagic of Magic Squares History of Magic Squares Mathematics and Magic Squares

Matrix Iteration Higher Modes Inverse Iteration Matrix Iteration Giacomo Boffi with Shifts

ECE 516: Adaptive Digital Filters Lecture 13 (Recursive Least-Squares) Mojtaba Soltanalian 2

Statistical Geometry Processing Winter Semester 2011/2012 Least-Squares Least-Squares Fitting

9. Equality constraints and tradeoffs More least squares Example: moving average model

8. Least squares Review of linear equations Least squares Example: curve-fitting

Linear Least Squares I Steve Marschner Cornell CS 322 Cornell CS 322 Linear Least Squares I 1

Moving Least Squares Outline The Approximation Power of Moving Least- Squares D. Levin

The Chi-squared Distribution of the Regularized Least Squares Functional for Regularization

Non linear Least Squares Lectures for PHD course on Numerical optimization Enrico Bertolazzi

Geometry of Least Squares 2 Least squares from the

Policy iteration comments Each step of policy iteration is guaranteed to strictly improve the

Sums of Squares Bianca Homberg and Minna Liu June 24, 2010 Abstract For our exploration topic,

Project 5: Miniroute Bernard Wong What is Miniroute? It is an ad-hoc networking layer What is

Discussion on Host-based IPv6 Translation denghui@chinamobile.com Outline Conventional IPv4

the Unix Philosophy still matters markus schnalke <meillo@marmaro.de> Goals of this talk

Unix filesystem and graph constraints Nicolas Jeannerod Journ ees PPS, October 12, 2017

Developing Your Infrastructure Projects from Procurement to Closeout 2020 CDBG-DR and

Mobility: 7.5 & 7.6 Smith College, CSC 249 April 5, 2018 Overview Wireless

COVID-19 42 nd in a series of weekly calls, initiated in January by CDC as a forum for

The QSOmorphology connection Is there one? and: do we care? Or: way too much fame for bulges

Online Exploration in Least-Squares Policy Iteration Lihong Li, - PowerPoint PPT Presentation

Online Exploration in Least-Squares Policy Iteration Lihong Li, Michael L. Littman, and Christopher R. Mansley Rutgers Laboratory for Real-Life Reinforcement Learning (RL 3 ) 5/14/2009 AAMAS - Budapest 1 Contributions Reinforcement Learning

Practical Least-Squares for Computer Graphics Siggraph Course 11 Siggraph Course 11 Practical

Statistical Properties of the Regularized Least Squares Functional and a hybrid LSQR Newton method

Least Mean Squares Regression Machine Learning 1 Least Squares Method for regression

The Mathemagic of Magic Squares History of Magic Squares Mathematics and Magic Squares

Matrix Iteration Higher Modes Inverse Iteration Matrix Iteration Giacomo Boffi with Shifts

ECE 516: Adaptive Digital Filters Lecture 13 (Recursive Least-Squares) Mojtaba Soltanalian 2

Statistical Geometry Processing Winter Semester 2011/2012 Least-Squares Least-Squares Fitting

9. Equality constraints and tradeoffs More least squares Example: moving average model

8. Least squares Review of linear equations Least squares Example: curve-fitting

Linear Least Squares I Steve Marschner Cornell CS 322 Cornell CS 322 Linear Least Squares I 1

Moving Least Squares Outline The Approximation Power of Moving Least- Squares D. Levin

The Chi-squared Distribution of the Regularized Least Squares Functional for Regularization

Non linear Least Squares Lectures for PHD course on Numerical optimization Enrico Bertolazzi

Geometry of Least Squares 2 Least squares from the

Policy iteration comments Each step of policy iteration is guaranteed to strictly improve the

Sums of Squares Bianca Homberg and Minna Liu June 24, 2010 Abstract For our exploration topic,

Project 5: Miniroute Bernard Wong What is Miniroute? It is an ad-hoc networking layer What is

Discussion on Host-based IPv6 Translation denghui@chinamobile.com Outline Conventional IPv4

the Unix Philosophy still matters markus schnalke &lt;meillo@marmaro.de&gt; Goals of this talk

Unix filesystem and graph constraints Nicolas Jeannerod Journ ees PPS, October 12, 2017

Developing Your Infrastructure Projects from Procurement to Closeout 2020 CDBG-DR and

Mobility: 7.5 &amp; 7.6 Smith College, CSC 249 April 5, 2018 Overview Wireless

COVID-19 42 nd in a series of weekly calls, initiated in January by CDC as a forum for

The QSOmorphology connection Is there one? and: do we care? Or: way too much fame for bulges

the Unix Philosophy still matters markus schnalke <meillo@marmaro.de> Goals of this talk

Mobility: 7.5 & 7.6 Smith College, CSC 249 April 5, 2018 Overview Wireless