sample complexity of adp algorithms
play

Sample Complexity of ADP Algorithms A. LAZARIC ( SequeL Team - PowerPoint PPT Presentation

Sample Complexity of ADP Algorithms A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Sources of Error Approximation error . If X is large or continuous , value functions V cannot be


  1. Sample Complexity of ADP Algorithms A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA SequeL – INRIA Lille MVA-RL Course

  2. Sources of Error ◮ Approximation error . If X is large or continuous , value functions V cannot be represented correctly ⇒ use an approximation space F ◮ Estimation error . If the reward r and dynamics p are unknown , the Bellman operators T and T π cannot be computed exactly ⇒ estimate the Bellman operators from samples A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 2/82

  3. In This Lecture ◮ Infinite horizon setting with discount γ ◮ Study the impact of estimation error A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 3/82

  4. In This Lecture: Warning!! Problem: are these performance bounds accurate/useful? Answer: of course not! :) Reason: upper bounds, non-tight analysis, worst case. A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 4/82

  5. In This Lecture: Warning!! Chernoff-Hoeffding inequality �� � � � n � � 1 log 2 /δ � � X t − E [ X 1 ] � > ( b − a ) ≤ δ P n 2 n t = 1 ⇒ worst-case w.r.t. to all the distributions bounded in [ a , b ] , loose for other distributions. A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 5/82

  6. In This Lecture: Warning!! Question: so why should we derive/study these bounds? Answer: ◮ General guarantees ◮ Rates of convergence (not always available in asymptotic analysis) ◮ Explicit dependency on the design parameters ◮ Explicit dependency on the problem parameters ◮ First guess on how to tune parameters ◮ Better understanding of the algorithms A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 6/82

  7. Sample Complexity of LSTD Outline Sample Complexity of LSTD The Algorithm LSTD and LSPI Error Bounds Sample Complexity of Fitted Q-iteration A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 7/82

  8. Sample Complexity of LSTD The Algorithm Outline Sample Complexity of LSTD The Algorithm LSTD and LSPI Error Bounds Sample Complexity of Fitted Q-iteration A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 8/82

  9. Sample Complexity of LSTD The Algorithm Least-Squares Temporal-Difference Learning (LSTD) � � f : f ( · ) = � d ◮ Linear function space F = j = 1 α j ϕ j ( · ) ◮ V π is the fixed-point of T π V π = T π V π ◮ V π may not belong to F V π / ∈ F ◮ Best approximation of V π in F is Π V π = arg min f ∈F || V π − f || ( Π is the projection onto F ) T π V π Π V π F A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 9/82

  10. Sample Complexity of LSTD The Algorithm Least-Squares Temporal-Difference Learning (LSTD) ◮ LSTD searches for the fixed-point of Π ? T π instead ( Π ? is a projection into F w.r.t. L ? -norm) ◮ Π ∞ T π is a contraction in L ∞ -norm ◮ L ∞ -projection is numerically expensive when the number of states is large or infinite ◮ LSTD searches for the fixed-point of Π 2 ,ρ T π Π 2 ,ρ g = arg min f ∈F || g − f || 2 ,ρ A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 10/82

  11. Sample Complexity of LSTD The Algorithm Least-Squares Temporal-Difference Learning (LSTD) When the fixed-point of Π ρ T π exists, we call it the LSTD solution V TD = Π ρ T π V TD T π V TD T π V π T π V TD = Π ρ T π V TD Π ρ V π F �T π V TD − V TD , ϕ i � ρ = 0 , i = 1 , . . . , d � r π + γ P π V TD − V TD , ϕ i � ρ = 0 � d · α ( j ) � r π , ϕ i � ρ � ϕ j − γ P π ϕ j , ϕ i � ρ − TD = 0 − → A α TD = b � �� � � �� � i = 1 b i A ij A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 11/82

  12. Sample Complexity of LSTD The Algorithm LSTD Algorithm ◮ In general, Π ρ T π is not a contraction and does not have a fixed-point. ◮ If ρ = ρ π , the stationary dist. of π , then Π ρ π T π has a unique fixed-point. Proposition (LSTD Performance) 1 || V π − V TD || ρ π ≤ V ∈F || V π − V || ρ π � 1 − γ 2 inf A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 12/82

  13. Sample Complexity of LSTD The Algorithm LSTD Algorithm Empirical LSTD ◮ We observe a trajectory ( X 0 , R 0 , X 1 , R 1 , . . . , X N ) where � � � � X t + 1 ∼ P · | X t , π ( X t ) and R t = r X t , π ( X t ) ◮ We build estimators of the matrix A and vector b N − 1 N − 1 � � � � A ij = 1 b i = 1 � � ϕ i ( X t ) ϕ j ( X t ) − γϕ j ( X t + 1 ) , ϕ i ( X t ) R t N N t = 0 t = 0 ◮ � α TD = � � V TD ( · ) = φ ( · ) ⊤ � A � b , α TD when n → ∞ then � A → A and � b → b , and thus, � α TD → α TD and � V TD → V TD . A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 13/82

  14. Sample Complexity of LSTD LSTD and LSPI Error Bounds Outline Sample Complexity of LSTD The Algorithm LSTD and LSPI Error Bounds Sample Complexity of Fitted Q-iteration A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 14/82

  15. Sample Complexity of LSTD LSTD and LSPI Error Bounds LSTD Error Bound When the Markov chain induced by the policy under evaluation π has a stationary distribution ρ π (Markov chain is ergodic - e.g. β -mixing) , then Theorem (LSTD Error Bound) Let � V be the truncated LSTD solution computed using n samples along a trajectory generated by following the policy π . Then with probability 1 − δ , we have �� � c d log ( d /δ ) || V π − � f ∈F || V π − f || ρ π + O V || ρ π ≤ � 1 − γ 2 inf n ν ◮ n = # of samples , d = dimension of the linear function space F � ◮ ν = the smallest eigenvalue of the Gram matrix ( ϕ i ϕ j d ρ π ) i , j ( Assume: eigenvalues of the Gram matrix are strictly positive - existence of the model-based LSTD solution ) ◮ β -mixing coefficients are hidden in the O ( · ) notation A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 15/82

  16. Sample Complexity of LSTD LSTD and LSPI Error Bounds LSTD Error Bound LSTD Error Bound �� � c d log ( d /δ ) || V π − � f ∈F || V π − f || ρ π V || ρ π ≤ � inf + O n ν 1 − γ 2 � �� � � �� � approximation error estimation error ◮ Approximation error: it depends on how well the function space F can approximate the value function V π ◮ Estimation error: it depends on the number of samples n , the dim of the function space d , the smallest eigenvalue of the Gram matrix ν , the mixing properties of the Markov chain (hidden in O ) A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 16/82

  17. Sample Complexity of LSTD LSTD and LSPI Error Bounds LSPI Error Bound Theorem (LSPI Error Bound) Let V − 1 ∈ � F be an arbitrary initial value function, � V 0 , . . . , � V K − 1 be the sequence of truncated value functions generated by LSPI after K iterations, and π K be the greedy policy w.r.t. � V K − 1 . Then with probability 1 − δ , we have �� � � �� � � 4 γ d log ( dK /δ ) K − 1 || V ∗ − V π K || µ ≤ CC µ,ρ cE 0 ( F ) + O + γ R max 2 ( 1 − γ ) 2 n ν ρ A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 17/82

  18. Sample Complexity of LSTD LSTD and LSPI Error Bounds LSPI Error Bound Theorem (LSPI Error Bound) Let V − 1 ∈ � F be an arbitrary initial value function, � V 0 , . . . , � V K − 1 be the sequence of truncated value functions generated by LSPI after K iterations, and π K be the greedy policy w.r.t. � V K − 1 . Then with probability 1 − δ , we have �� � � �� � � 4 γ d log ( dK /δ ) K − 1 || V ∗ − V π K || µ ≤ CC µ,ρ cE 0 ( F ) + O + γ R max 2 ( 1 − γ ) 2 n ν ρ F ) inf f ∈F || V π − f || ρ π ◮ Approximation error: E 0 ( F ) = sup π ∈G ( � A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 18/82

  19. Sample Complexity of LSTD LSTD and LSPI Error Bounds LSPI Error Bound Theorem (LSPI Error Bound) Let V − 1 ∈ � F be an arbitrary initial value function, � V 0 , . . . , � V K − 1 be the sequence of truncated value functions generated by LSPI after K iterations, and π K be the greedy policy w.r.t. � V K − 1 . Then with probability 1 − δ , we have �� � � �� � � 4 γ d log ( dK /δ ) K − 1 || V ∗ − V π K || µ ≤ cE 0 ( F ) + O + γ CC µ,ρ R max 2 ( 1 − γ ) 2 n ν ρ F ) inf f ∈F || V π − f || ρ π ◮ Approximation error: E 0 ( F ) = sup π ∈G ( � ◮ Estimation error: depends on n , d , ν ρ , K A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 19/82

  20. Sample Complexity of LSTD LSTD and LSPI Error Bounds LSPI Error Bound Theorem (LSPI Error Bound) Let V − 1 ∈ � F be an arbitrary initial value function, � V 0 , . . . , � V K − 1 be the sequence of truncated value functions generated by LSPI after K iterations, and π K be the greedy policy w.r.t. � V K − 1 . Then with probability 1 − δ , we have �� � � �� � � 4 γ d log ( dK /δ ) K − 1 || V ∗ − V π K || µ ≤ CC µ,ρ cE 0 ( F ) + O + γ R max 2 ( 1 − γ ) 2 n ν ρ F ) inf f ∈F || V π − f || ρ π ◮ Approximation error: E 0 ( F ) = sup π ∈G ( � ◮ Estimation error: depends on n , d , ν ρ , K ◮ Initialization error: error due to the choice of the initial value function or initial policy | V ∗ − V π 0 | A. LAZARIC – Reinforcement Learning Algorithms Dec 3rd, 2013 - 20/82

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend