approximate dynamic programming
play

Approximate Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille - PowerPoint PPT Presentation

Approximate Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N 2. At each iteration k = 1 , 2 , . . . , K


  1. Approximate Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA SequeL – INRIA Lille MVA-RL Course

  2. Value Iteration: the Idea 1. Let V 0 be any vector in R N 2. At each iteration k = 1 , 2 , . . . , K ◮ Compute V k + 1 = T V k 3. Return the greedy policy � � � π K ( x ) ∈ arg max r ( x , a ) + γ p ( y | x , a ) V K ( y ) . a ∈ A y A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 2/63

  3. Value Iteration: the Guarantees ◮ From the fixed point property of T : k →∞ V k = V ∗ lim ◮ From the contraction property of T || V k + 1 − V ∗ || ∞ ≤ γ k + 1 || V 0 − V ∗ || ∞ → 0 Problem : what if V k + 1 � = T V k ?? A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 3/63

  4. Policy Iteration: the Idea 1. Let π 0 be any stationary policy 2. At each iteration k = 1 , 2 , . . . , K ◮ Policy evaluation given π k , compute V k = V π k . ◮ Policy improvement : compute the greedy policy � � � π k + 1 ( x ) ∈ arg max a ∈ A r ( x , a ) + γ p ( y | x , a ) V π k ( y ) . y 3. Return the last policy π K A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 4/63

  5. Policy Iteration: the Guarantees The policy iteration algorithm generates a sequences of policies with non-decreasing performance V π k + 1 ≥ V π k , and it converges to π ∗ in a finite number of iterations. Problem : what if V k � = V π k ?? A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 5/63

  6. Sources of Error ◮ Approximation error . If X is large or continuous , value functions V cannot be represented correctly ⇒ use an approximation space F ◮ Estimation error . If the reward r and dynamics p are unknown , the Bellman operators T and T π cannot be computed exactly ⇒ estimate the Bellman operators from samples A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 6/63

  7. In This Lecture ◮ Infinite horizon setting with discount γ ◮ Study the impact of approximation error ◮ Study the impact of estimation error in the next lecture A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 7/63

  8. Performance Loss Outline Performance Loss Approximate Value Iteration Approximate Policy Iteration A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 8/63

  9. Performance Loss From Approximation Error to Performance Loss Question : if V is an approximation of the optimal value function V ∗ with an error error = � V − V ∗ � how does it translate to the (loss of) performance of the greedy policy � � � π ( x ) ∈ arg max p ( y | x , a ) r ( x , a , y ) + γ V ( y ) a ∈ A y i.e. performance loss = � V ∗ − V π � ??? A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 9/63

  10. Performance Loss From Approximation Error to Performance Loss Proposition Let V ∈ R N be an approximation of V ∗ and π its corresponding greedy policy, then 2 γ � V ∗ − V π � ∞ 1 − γ � V ∗ − V � ∞ ≤ . � �� � � �� � approx. error performance loss Furthermore, there exists ǫ > 0 such that if � V − V ∗ � ∞ ≤ ǫ , then π is optimal . A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 10/63

  11. Performance Loss From Approximation Error to Performance Loss Proof. � V ∗ − V π � ∞ ≤ �T V ∗ − T π V � ∞ + �T π V − T π V π � ∞ ≤ �T V ∗ − T V � ∞ + γ � V − V π � ∞ ≤ γ � V ∗ − V � ∞ + γ ( � V − V ∗ � ∞ + � V ∗ − V π � ∞ ) 2 γ 1 − γ � V ∗ − V � ∞ . ≤ � A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 11/63

  12. Performance Loss From Approximation Error to Performance Loss Question: how do we compute V ? Problem: unlike in standard approximation scenarios (see supervised learning), we have a limited access to the target function, i.e. V ∗ Objective: given an approximation space F , compute an approximation V which is as close as possible to the best approximation of V ∗ in F , i.e. f ∈F || V ∗ − f || V ≈ arg inf A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 12/63

  13. Approximate Value Iteration Outline Performance Loss Approximate Value Iteration Approximate Policy Iteration A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 13/63

  14. Approximate Value Iteration Approximate Value Iteration: the Idea Let A be an approximation operator . 1. Let V 0 be any vector in R N 2. At each iteration k = 1 , 2 , . . . , K ◮ Compute V k + 1 = AT V k 3. Return the greedy policy � � � π K ( x ) ∈ arg max r ( x , a ) + γ p ( y | x , a ) V K ( y ) . a ∈ A y A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 14/63

  15. Approximate Value Iteration Approximate Value Iteration: the Idea Let A = Π ∞ be a projection operator in L ∞ -norm, which corresponds to V k + 1 = Π ∞ T V k = arg inf V ∈F �T V k − V � ∞ A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 15/63

  16. Approximate Value Iteration Approximate Value Iteration: convergence Proposition The projection Π ∞ is a non-expansion and the joint operator Π ∞ T is a contraction . Then there exists a unique fixed point ˜ V = Π ∞ T ˜ V which guarantees the convergence of AVI. A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 16/63

  17. Approximate Value Iteration Approximate Value Iteration: performance loss Proposition (Bertsekas & Tsitsiklis, 1996) Let V K be the function returned by AVI after K iterations and π K its corresponding greedy policy. Then + 2 γ K + 1 2 γ 1 − γ � V ∗ − V 0 � ∞ � V ∗ − V π K � ∞ ≤ ( 1 − γ ) 2 max 0 ≤ k < K �T V k − AT V k � ∞ . � �� � � �� � initial error worst approx. error A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 17/63

  18. Approximate Value Iteration Approximate Value Iteration: performance loss Proof. Let ε = max 0 ≤ k < K �T V k − AT V k � ∞ . For any 0 ≤ k < K we have � V ∗ − V k + 1 � ∞ �T V ∗ − T V k � ∞ + �T V k − V k + 1 � ∞ ≤ γ � V ∗ − V k � ∞ + ε, ≤ then � V ∗ − V K � ∞ ( 1 + γ + · · · + γ K − 1 ) ε + γ K � V ∗ − V 0 � ∞ ≤ 1 1 − γ ε + γ K � V ∗ − V 0 � ∞ ≤ Since from Proposition 1 we have that � V ∗ − V π K � ∞ ≤ 1 − γ � V ∗ − V K � ∞ , then we obtain 2 γ ( 1 − γ ) 2 ε + 2 γ K + 1 2 γ � V ∗ − V π K � ∞ ≤ 1 − γ � V ∗ − V 0 � ∞ . A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 18/63

  19. Approximate Value Iteration Fitted Q-iteration with linear approximation Assumption: access to a generative model. r ( x, a ) State x Reward Generative model Action a y ∼ p ( ·| x, a ) Next state Idea: work with Q -functions and linear spaces. ◮ Q ∗ is the unique fixed point of T defined over X × A as: � T Q ( x , a ) = p ( y | x , a )[ r ( x , a , y ) + γ max Q ( y , b )] . b y ◮ F is a space defined by d features φ 1 , . . . , φ d : X × A → R as: � α j φ j ( x , a ) , α ∈ R d � d � F = Q α ( x , a ) = . j = 1 ⇒ At each iteration compute Q k + 1 = Π ∞ T Q k A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 19/63

  20. Approximate Value Iteration Fitted Q-iteration with linear approximation ⇒ At each iteration compute Q k + 1 = Π ∞ T Q k Problems: ◮ the Π ∞ operator cannot be computed efficiently ◮ the Bellman operator T is often unknown A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 20/63

  21. Approximate Value Iteration Fitted Q-iteration with linear approximation Problem: the Π ∞ operator cannot be computed efficiently . Let µ a distribution over X . We use a projection in L 2 ,µ -norm onto the space F : Q ∈F � Q − T Q k � 2 Q k + 1 = arg min µ . A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 21/63

  22. Approximate Value Iteration Fitted Q-iteration with linear approximation Problem: the Bellman operator T is often unknown . 1. Sample n state actions ( X i , A i ) with X i ∼ µ and A i random, 2. Simulate Y i ∼ p ( ·| X i , A i ) and R i = r ( X i , A i , Y i ) with the generative model, 3. Estimate T Q k ( X i , A i ) with Z i = R i + γ max a ∈ A Q k ( Y i , a ) (unbiased E [ Z i | X i , A i ] = T Q k ( X i , A i ) ), A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 22/63

  23. Approximate Value Iteration Fitted Q-iteration with linear approximation At each iteration k compute Q k + 1 as n � � � 2 1 Q k + 1 = arg min Q α ( X i , A i ) − Z i n Q α ∈F i = 1 ⇒ Since Q α is a linear function in α , the problem is a simple quadratic minimization problem with closed form solution. A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 23/63

  24. Approximate Value Iteration Other implementations ◮ K -nearest neighbour ◮ Regularized linear regression with L 1 or L 2 regularisation ◮ Neural network ◮ Support vector machine A. LAZARIC – Reinforcement Learning Algorithms Oct 29th, 2013 - 24/63

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend