Sample Complexity of ADP Algorithms A. LAZARIC ( SequeL Team - PowerPoint PPT Presentation

Sample Complexity of ADP Algorithms A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA SequeL – INRIA Lille MVA-RL Course

Sources of Error ◮ Approximation error . If X is large or continuous , value functions V cannot be represented correctly ⇒ use an approximation space F ◮ Estimation error . If the reward r and dynamics p are unknown , the Bellman operators T and T π cannot be computed exactly ⇒ estimate the Bellman operators from samples A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 2/82

In This Lecture ◮ Infinite horizon setting with discount γ ◮ Study the impact of estimation error A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 3/82

In This Lecture: Warning!! Problem: are these performance bounds accurate/useful? Answer: of course not! :) Reason: upper bounds, non-tight analysis, worst case. A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 4/82

In This Lecture: Warning!! Chernoff-Hoeffding inequality �� n � � 1 log 2 /δ � � X t − E [ X 1 ] � > ( b − a ) ≤ δ P n 2 n t = 1 ⇒ worst-case w.r.t. to all the distributions bounded in [ a , b ] , loose for other distributions. A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 5/82

In This Lecture: Warning!! Question: so why should we derive/study these bounds? Answer: ◮ General guarantees ◮ Rates of convergence (not always available in asymptotic analysis) ◮ Explicit dependency on the design parameters ◮ Explicit dependency on the problem parameters ◮ First guess on how to tune parameters ◮ Better understanding of the algorithms A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 6/82

Sample Complexity of LSTD Outline Sample Complexity of LSTD The Algorithm LSTD and LSPI Error Bounds Sample Complexity of Fitted Q-iteration A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 7/82

Sample Complexity of LSTD The Algorithm Outline Sample Complexity of LSTD The Algorithm LSTD and LSPI Error Bounds Sample Complexity of Fitted Q-iteration A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 8/82

Sample Complexity of LSTD The Algorithm Least-Squares Temporal-Difference Learning (LSTD) � � f : f ( · ) = � d ◮ Linear function space F = j = 1 α j ϕ j ( · ) ◮ V π is the fixed-point of T π V π = T π V π ◮ V π may not belong to F V π / ∈ F ◮ Best approximation of V π in F is Π V π = arg min f ∈F || V π − f || ( Π is the projection onto F ) T π V π Π V π F A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 9/82

Sample Complexity of LSTD The Algorithm Least-Squares Temporal-Difference Learning (LSTD) ◮ LSTD searches for the fixed-point of Π ? T π instead ( Π ? is a projection into F w.r.t. L ? -norm) ◮ Π ∞ T π is a contraction in L ∞ -norm ◮ L ∞ -projection is numerically expensive when the number of states is large or infinite ◮ LSTD searches for the fixed-point of Π 2 ,ρ T π Π 2 ,ρ g = arg min f ∈F || g − f || 2 ,ρ A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 10/82

Sample Complexity of LSTD The Algorithm Least-Squares Temporal-Difference Learning (LSTD) When the fixed-point of Π ρ T π exists, we call it the LSTD solution V TD = Π ρ T π V TD T π V TD T π V π T π V TD = Π ρ T π V TD Π ρ V π F �T π V TD − V TD , ϕ i � ρ = 0 , i = 1 , . . . , d � r π + γ P π V TD − V TD , ϕ i � ρ = 0 � d · α ( j ) � r π , ϕ i � ρ � ϕ j − γ P π ϕ j , ϕ i � ρ − TD = 0 − → A α TD = b � �� i = 1 b i A ij A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 11/82

Sample Complexity of LSTD The Algorithm LSTD Algorithm ◮ In general, Π ρ T π is not a contraction and does not have a fixed-point. ◮ If ρ = ρ π , the stationary dist. of π , then Π ρ π T π has a unique fixed-point. Proposition (LSTD Performance) 1 || V π − V TD || ρ π ≤ V ∈F || V π − V || ρ π � 1 − γ 2 inf A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 12/82

Sample Complexity of LSTD The Algorithm LSTD Algorithm Empirical LSTD ◮ We observe a trajectory ( X 0 , R 0 , X 1 , R 1 , . . . , X N ) where � � � � X t + 1 ∼ P · | X t , π ( X t ) and R t = r X t , π ( X t ) ◮ We build estimators of the matrix A and vector b N − 1 N − 1 � � � � A ij = 1 b i = 1 � � ϕ i ( X t ) ϕ j ( X t ) − γϕ j ( X t + 1 ) , ϕ i ( X t ) R t N N t = 0 t = 0 ◮ � α TD = � � V TD ( · ) = φ ( · ) ⊤ � A � b , α TD when n → ∞ then � A → A and � b → b , and thus, � α TD → α TD and � V TD → V TD . A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 13/82

Sample Complexity of LSTD LSTD and LSPI Error Bounds Outline Sample Complexity of LSTD The Algorithm LSTD and LSPI Error Bounds Sample Complexity of Fitted Q-iteration A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 14/82

Sample Complexity of LSTD LSTD and LSPI Error Bounds LSTD Error Bound When the Markov chain induced by the policy under evaluation π has a stationary distribution ρ π (Markov chain is ergodic - e.g. β -mixing) , then Theorem (LSTD Error Bound) Let � V be the truncated LSTD solution computed using n samples along a trajectory generated by following the policy π . Then with probability 1 − δ , we have �� c d log ( d /δ ) || V π − � f ∈F || V π − f || ρ π + O V || ρ π ≤ � 1 − γ 2 inf n ν ◮ n = # of samples , d = dimension of the linear function space F � ◮ ν = the smallest eigenvalue of the Gram matrix ( ϕ i ϕ j d ρ π ) i , j ( Assume: eigenvalues of the Gram matrix are strictly positive - existence of the model-based LSTD solution ) ◮ β -mixing coefficients are hidden in the O ( · ) notation A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 15/82

Sample Complexity of LSTD LSTD and LSPI Error Bounds LSTD Error Bound LSTD Error Bound �� c d log ( d /δ ) || V π − � f ∈F || V π − f || ρ π V || ρ π ≤ � inf + O n ν 1 − γ 2 � �� approximation error estimation error ◮ Approximation error: it depends on how well the function space F can approximate the value function V π ◮ Estimation error: it depends on the number of samples n , the dim of the function space d , the smallest eigenvalue of the Gram matrix ν , the mixing properties of the Markov chain (hidden in O ) A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 16/82

Sample Complexity of LSTD LSTD and LSPI Error Bounds LSPI Error Bound Theorem (LSPI Error Bound) Let V − 1 ∈ � F be an arbitrary initial value function, � V 0 , . . . , � V K − 1 be the sequence of truncated value functions generated by LSPI after K iterations, and π K be the greedy policy w.r.t. � V K − 1 . Then with probability 1 − δ , we have �� 4 γ d log ( dK /δ ) K − 1 || V ∗ − V π K || µ ≤ CC µ,ρ cE 0 ( F ) + O + γ R max 2 ( 1 − γ ) 2 n ν ρ A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 17/82

Sample Complexity of LSTD LSTD and LSPI Error Bounds LSPI Error Bound Theorem (LSPI Error Bound) Let V − 1 ∈ � F be an arbitrary initial value function, � V 0 , . . . , � V K − 1 be the sequence of truncated value functions generated by LSPI after K iterations, and π K be the greedy policy w.r.t. � V K − 1 . Then with probability 1 − δ , we have �� 4 γ d log ( dK /δ ) K − 1 || V ∗ − V π K || µ ≤ CC µ,ρ cE 0 ( F ) + O + γ R max 2 ( 1 − γ ) 2 n ν ρ F ) inf f ∈F || V π − f || ρ π ◮ Approximation error: E 0 ( F ) = sup π ∈G ( � A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 18/82

Sample Complexity of LSTD LSTD and LSPI Error Bounds LSPI Error Bound Theorem (LSPI Error Bound) Let V − 1 ∈ � F be an arbitrary initial value function, � V 0 , . . . , � V K − 1 be the sequence of truncated value functions generated by LSPI after K iterations, and π K be the greedy policy w.r.t. � V K − 1 . Then with probability 1 − δ , we have �� 4 γ d log ( dK /δ ) K − 1 || V ∗ − V π K || µ ≤ cE 0 ( F ) + O + γ CC µ,ρ R max 2 ( 1 − γ ) 2 n ν ρ F ) inf f ∈F || V π − f || ρ π ◮ Approximation error: E 0 ( F ) = sup π ∈G ( � ◮ Estimation error: depends on n , d , ν ρ , K A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 19/82

Sample Complexity of LSTD LSTD and LSPI Error Bounds LSPI Error Bound Theorem (LSPI Error Bound) Let V − 1 ∈ � F be an arbitrary initial value function, � V 0 , . . . , � V K − 1 be the sequence of truncated value functions generated by LSPI after K iterations, and π K be the greedy policy w.r.t. � V K − 1 . Then with probability 1 − δ , we have �� 4 γ d log ( dK /δ ) K − 1 || V ∗ − V π K || µ ≤ CC µ,ρ cE 0 ( F ) + O + γ R max 2 ( 1 − γ ) 2 n ν ρ F ) inf f ∈F || V π − f || ρ π ◮ Approximation error: E 0 ( F ) = sup π ∈G ( � ◮ Estimation error: depends on n , d , ν ρ , K ◮ Initialization error: error due to the choice of the initial value function or initial policy | V ∗ − V π 0 | A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 20/82

Sample Complexity of ADP Algorithms A. LAZARIC ( SequeL Team - PowerPoint PPT Presentation

Sample Complexity of ADP Algorithms A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Sources of Error Approximation error . If X is large or continuous , value functions V cannot be

ADP Test - IRC1.401(k)(3) ADP stands for Actual Deferral Percentage. Required for plans

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Sample 2 Inlet in western (Sunset) Bay 0 Sample 3 Inlet behind Christian Island 1 Sample

Agglomeration of Ash Particles due to Flue Gas Conditioning (a) Sample CA8S12F1 (b) Sample

3Q Fiscal 2017 ADP Earnings Call & Webcast May 3, 2017 Forward Looking Statements This

CCDA II OUTSTANDING ISSUES IN CLIMATE CHANGE NEGOTIATIONS: RELEVANCE FOR AFRICA Xolisa Ngwadla

1Q Fiscal 2017 ADP Earnings Call & Webcast November 2, 2016 Forward Looking Statements This

Fiscal 2017 ADP Earnings Call & Webcast July 27, 2017 Forward Looking Statements This

ADP Earnings Call & Webcast February 3, 2016 Forward Looking Statements This presentation and

Increased Recognition of Cities under UNFCCC Process ADP mentioned The roles of NAZCA

CS 730/730W/830: Intro AI MDP Wrap-Up ADP Q -Learning 1 handout: slides project proposals are

SEM Photographs of Activated ash samples SEM Micrographs (Original ash samples) (a) Sample S1F1

Section 3.3 Section Summary ! Time Complexity ! Worst-Case Complexity ! Algorithmic Paradigms !

Algorithms Chapter 3 Chapter Summary Algorithms n Example Algorithms n Algorithmic Paradigms

Efficient Algorithms and Problem Complexity More Complexity Classes Frank Drewes

Efficient Algorithms and Problem Complexity More Complexity Classes Frank Drewes

Provisioning IoT with Web NFC Zoltan Kis (@zolkis), Intel Background JavaScript APIs for

Inference of Field Initialization Fausto Spoto and Michael D. Ernst University of Verona, Italy

Spatial coupling: Algorithm and Proof Technique Workshop on Local Algorithms - WOLA 2018 Boston,

Hurwitz trees and deformations of Artin-Schreier covers Huy Dang University of Virginia May 16,

Fast Radio Bursts with HIRAX Jeff Peterson FRB 110523 McWilliams Center for Cosmology Carnegie

Whats a Wiki anyway? Which Should I Choose? Tech Tools with Tine Webinar Series Presents:

A New Prediction Capability for post-sunset Equatorial Plasma Bubbles Brett A. Carter 1,2 ,

Recent Developments in Disjunctive Programming Egon Balas Carnegie Mellon University Recent

Sample Complexity of ADP Algorithms A. LAZARIC ( SequeL Team - PowerPoint PPT Presentation

Sample Complexity of ADP Algorithms A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Sources of Error Approximation error . If X is large or continuous , value functions V cannot be

ADP Test - IRC1.401(k)(3) ADP stands for Actual Deferral Percentage. Required for plans

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Sample 2 Inlet in western (Sunset) Bay 0 Sample 3 Inlet behind Christian Island 1 Sample

Agglomeration of Ash Particles due to Flue Gas Conditioning (a) Sample CA8S12F1 (b) Sample

3Q Fiscal 2017 ADP Earnings Call &amp; Webcast May 3, 2017 Forward Looking Statements This

CCDA II OUTSTANDING ISSUES IN CLIMATE CHANGE NEGOTIATIONS: RELEVANCE FOR AFRICA Xolisa Ngwadla

1Q Fiscal 2017 ADP Earnings Call &amp; Webcast November 2, 2016 Forward Looking Statements This

Fiscal 2017 ADP Earnings Call &amp; Webcast July 27, 2017 Forward Looking Statements This

ADP Earnings Call &amp; Webcast February 3, 2016 Forward Looking Statements This presentation and

Increased Recognition of Cities under UNFCCC Process ADP mentioned The roles of NAZCA

CS 730/730W/830: Intro AI MDP Wrap-Up ADP Q -Learning 1 handout: slides project proposals are

SEM Photographs of Activated ash samples SEM Micrographs (Original ash samples) (a) Sample S1F1

Section 3.3 Section Summary ! Time Complexity ! Worst-Case Complexity ! Algorithmic Paradigms !

Algorithms Chapter 3 Chapter Summary Algorithms n Example Algorithms n Algorithmic Paradigms

Efficient Algorithms and Problem Complexity More Complexity Classes Frank Drewes

Efficient Algorithms and Problem Complexity More Complexity Classes Frank Drewes

Provisioning IoT with Web NFC Zoltan Kis (@zolkis), Intel Background JavaScript APIs for

Inference of Field Initialization Fausto Spoto and Michael D. Ernst University of Verona, Italy

Spatial coupling: Algorithm and Proof Technique Workshop on Local Algorithms - WOLA 2018 Boston,

Hurwitz trees and deformations of Artin-Schreier covers Huy Dang University of Virginia May 16,

Fast Radio Bursts with HIRAX Jeff Peterson FRB 110523 McWilliams Center for Cosmology Carnegie

Whats a Wiki anyway? Which Should I Choose? Tech Tools with Tine Webinar Series Presents:

A New Prediction Capability for post-sunset Equatorial Plasma Bubbles Brett A. Carter 1,2 ,

Recent Developments in Disjunctive Programming Egon Balas Carnegie Mellon University Recent

3Q Fiscal 2017 ADP Earnings Call & Webcast May 3, 2017 Forward Looking Statements This

1Q Fiscal 2017 ADP Earnings Call & Webcast November 2, 2016 Forward Looking Statements This

Fiscal 2017 ADP Earnings Call & Webcast July 27, 2017 Forward Looking Statements This

ADP Earnings Call & Webcast February 3, 2016 Forward Looking Statements This presentation and