approximate dynamic programming
play

Approximate Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille - PowerPoint PPT Presentation

Approximate Dynamic Programming A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) A. LAZARIC Reinforcement Learning


  1. Linear Fitted Q-iteration Input : space F , iterations K , sampling distribution ρ , num of samples n Initial function � Q 0 ∈ F For k = 1 , . . . , K i.i.d 1. Draw n samples ( x i , a i ) ∼ ρ 2. Sample x ′ i ∼ p ( ·| x i , a i ) and r i = r ( x i , a i ) 3. Compute y i = r i + γ max a � Q k − 1 ( x ′ i , a ) �� �� n 4. Build training set ( x i , a i ) , y i i = 1 5. Solve the least squares problem � n � � 2 1 f ˆ α k = arg min f α ( x i , a i ) − y i n f α ∈F i = 1 A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 14/82

  2. Linear Fitted Q-iteration Input : space F , iterations K , sampling distribution ρ , num of samples n Initial function � Q 0 ∈ F For k = 1 , . . . , K i.i.d 1. Draw n samples ( x i , a i ) ∼ ρ 2. Sample x ′ i ∼ p ( ·| x i , a i ) and r i = r ( x i , a i ) 3. Compute y i = r i + γ max a � Q k − 1 ( x ′ i , a ) �� �� n 4. Build training set ( x i , a i ) , y i i = 1 5. Solve the least squares problem � n � � 2 1 f ˆ α k = arg min f α ( x i , a i ) − y i n f α ∈F i = 1 6. Return � Q k = f ˆ α k (truncation may be needed) A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 14/82

  3. Linear Fitted Q-iteration Input : space F , iterations K , sampling distribution ρ , num of samples n Initial function � Q 0 ∈ F For k = 1 , . . . , K i.i.d 1. Draw n samples ( x i , a i ) ∼ ρ 2. Sample x ′ i ∼ p ( ·| x i , a i ) and r i = r ( x i , a i ) 3. Compute y i = r i + γ max a � Q k − 1 ( x ′ i , a ) �� �� n 4. Build training set ( x i , a i ) , y i i = 1 5. Solve the least squares problem � n � � 2 1 f ˆ α k = arg min f α ( x i , a i ) − y i n f α ∈F i = 1 6. Return � Q k = f ˆ α k (truncation may be needed) Return π K ( · ) = arg max a � Q K ( · , a ) ( greedy policy ) A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 14/82

  4. Linear Fitted Q-iteration: Sampling i.i.d 1. Draw n samples ( x i , a i ) ∼ ρ 2. Sample x ′ i ∼ p ( ·| x i , a i ) and r i = r ( x i , a i ) A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 15/82

  5. Linear Fitted Q-iteration: Sampling i.i.d 1. Draw n samples ( x i , a i ) ∼ ρ 2. Sample x ′ i ∼ p ( ·| x i , a i ) and r i = r ( x i , a i ) ◮ In practice it can be done once before running the algorithm ◮ The sampling distribution ρ should cover the state-action space in all relevant regions ◮ If not possible to choose ρ , a database of samples can be used A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 15/82

  6. Linear Fitted Q-iteration: The Training Set 4. Compute y i = r i + γ max a � Q k − 1 ( x ′ i , a ) �� �� n 5. Build training set ( x i , a i ) , y i i = 1 A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 16/82

  7. Linear Fitted Q-iteration: The Training Set 4. Compute y i = r i + γ max a � Q k − 1 ( x ′ i , a ) �� �� n 5. Build training set ( x i , a i ) , y i i = 1 ◮ Each sample y i is an unbiased sample, since � � Q k − 1 ( x ′ Q k − 1 ( x ′ E [ y i | x i , a i ] = E [ r i + γ max i , a )] = r ( x i , a i ) + γ E [ max i , a )] a a � Q k − 1 ( x ′ , a ) p ( dy | x , a ) = T � � = r ( x i , a i ) + γ max Q k − 1 ( x i , a i ) a X ◮ The problem “reduces” to standard regression ◮ It should be recomputed at each iteration A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 16/82

  8. Linear Fitted Q-iteration: The Regression Problem 6. Solve the least squares problem n � � � 2 1 f ˆ α k = arg min f α ( x i , a i ) − y i n f α ∈F i = 1 7. Return � Q k = f ˆ α k (truncation may be needed) A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 17/82

  9. Linear Fitted Q-iteration: The Regression Problem 6. Solve the least squares problem n � � � 2 1 f ˆ α k = arg min f α ( x i , a i ) − y i n f α ∈F i = 1 7. Return � Q k = f ˆ α k (truncation may be needed) ◮ Thanks to the linear space we can solve it as � φ ( x 1 , a 1 ) ⊤ . . . φ ( x n , a n ) ⊤ � ◮ Build matrix Φ = α k = (Φ ⊤ Φ) − 1 Φ ⊤ y ( least–squares solution) ◮ Compute ˆ ◮ Truncation to [ − V max ; V max ] (with V max = R max / ( 1 − γ ) ) A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 17/82

  10. Sketch of the Analysis Q 0 T � Q 1 Q 1 ǫ 1 T � T � Q 2 Q 2 Q 1 ǫ 2 T T � � Q 2 Q 3 Q 3 ǫ 3 T T � Q 3 Q 4 · · · T · · · � Q K greedy π K final error Q π K Q ∗ Skip Theory A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 18/82

  11. Theoretical Objectives Objective : derive a bound on the performance ( quadratic ) loss w.r.t. a testing distribution µ || Q ∗ − Q π K || µ ≤ ??? A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 19/82

  12. Theoretical Objectives Objective : derive a bound on the performance ( quadratic ) loss w.r.t. a testing distribution µ || Q ∗ − Q π K || µ ≤ ??? Sub-Objective 1 : derive an intermediate bound on the prediction error at any iteration k w.r.t. to the sampling distribution ρ ||T � Q k − 1 − � Q k || ρ ≤ ??? A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 19/82

  13. Theoretical Objectives Objective : derive a bound on the performance ( quadratic ) loss w.r.t. a testing distribution µ || Q ∗ − Q π K || µ ≤ ??? Sub-Objective 1 : derive an intermediate bound on the prediction error at any iteration k w.r.t. to the sampling distribution ρ ||T � Q k − 1 − � Q k || ρ ≤ ??? Sub-Objective 2 : analyze how the error at each iteration is propagated through iterations || Q ∗ − Q π K || µ ≤ propagation ( ||T � Q k − 1 − � Q k || ρ ) A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 19/82

  14. The Sources of Error ◮ Desired solution Q k = T � Q k − 1 A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 20/82

  15. The Sources of Error ◮ Desired solution Q k = T � Q k − 1 ◮ Best solution (wrt sampling distribution ρ ) f α ∗ k = arg inf f α ∈F || f α − Q k || ρ A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 20/82

  16. The Sources of Error ◮ Desired solution Q k = T � Q k − 1 ◮ Best solution (wrt sampling distribution ρ ) f α ∗ k = arg inf f α ∈F || f α − Q k || ρ ⇒ Error from the approximation space F A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 20/82

  17. The Sources of Error ◮ Desired solution Q k = T � Q k − 1 ◮ Best solution (wrt sampling distribution ρ ) f α ∗ k = arg inf f α ∈F || f α − Q k || ρ ⇒ Error from the approximation space F ◮ Returned solution n � � � 2 1 f ˆ α k = arg min f α ( x i , a i ) − y i n f α ∈F i = 1 A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 20/82

  18. The Sources of Error ◮ Desired solution Q k = T � Q k − 1 ◮ Best solution (wrt sampling distribution ρ ) f α ∗ k = arg inf f α ∈F || f α − Q k || ρ ⇒ Error from the approximation space F ◮ Returned solution n � � � 2 1 f ˆ α k = arg min f α ( x i , a i ) − y i n f α ∈F i = 1 ⇒ Error from the (random) samples A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 20/82

  19. Per-Iteration Error Theorem At each iteration k, Linear-FQI returns an approximation � Q k such that ( Sub-Objective 1 ) || Q k − � Q k || ρ ≤ 4 || Q k − f α ∗ k || ρ � �� � � log 1 /δ V max + L || α ∗ + O k || n � � � d log n /δ + O V max , n with probability 1 − δ . Tools: concentration of measure inequalities, covering space, linear algebra, union bounds, special tricks for linear spaces, ... A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 21/82

  20. Per-Iteration Error || Q k − � Q k || ρ ≤ 4 || Q k − f α ∗ k || ρ � �� � � log 1 /δ V max + L || α ∗ + O k || n � � � d log n /δ + O V max n A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 22/82

  21. Per-Iteration Error || Q k − � Q k || ρ ≤ 4 || Q k − f α ∗ k || ρ � �� � � log 1 /δ V max + L || α ∗ + O k || n � � � d log n /δ + O V max n Remarks ◮ No algorithm can do better ◮ Constant 4 ◮ Depends on the space F ◮ Changes with the iteration k A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 23/82

  22. Per-Iteration Error || Q k − � Q k || ρ ≤ 4 || Q k − f α ∗ k || ρ � �� � � log 1 /δ V max + L || α ∗ + O k || n � � � d log n /δ + O V max n Remarks ◮ Vanishing to zero as O ( n − 1 / 2 ) ◮ Depends on the features ( L ) and on the best solution ( || α ∗ k || ) A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 24/82

  23. Per-Iteration Error || Q k − � Q k || ρ ≤ 4 || Q k − f α ∗ k || ρ � �� � � log 1 /δ V max + L || α ∗ + O k || n � � � d log n /δ + O V max n Remarks ◮ Vanishing to zero as O ( n − 1 / 2 ) ◮ Depends on the dimensionality of the space ( d ) and the number of samples ( n ) A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 25/82

  24. Error Propagation Objective || Q ∗ − Q π K || µ A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 26/82

  25. Error Propagation Objective || Q ∗ − Q π K || µ ◮ Problem 1 : the test norm µ is different from the sampling norm ρ A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 26/82

  26. Error Propagation Objective || Q ∗ − Q π K || µ ◮ Problem 1 : the test norm µ is different from the sampling norm ρ ◮ Problem 2 : we have bounds for � Q k not for the performance of the corresponding π k A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 26/82

  27. Error Propagation Objective || Q ∗ − Q π K || µ ◮ Problem 1 : the test norm µ is different from the sampling norm ρ ◮ Problem 2 : we have bounds for � Q k not for the performance of the corresponding π k ◮ Problem 3 : we have bounds for one single iteration A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 26/82

  28. Error Propagation Transition kernel for a fixed policy P π . ◮ m -step (worst-case) concentration of future state distribution � � � � � � � � d ( µ P π 1 . . . P π m ) � � � � c ( m ) = sup < ∞ � � � � � � d ρ � � π 1 ...π m ∞ A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 27/82

  29. Error Propagation Transition kernel for a fixed policy P π . ◮ m -step (worst-case) concentration of future state distribution � � � � � � � � d ( µ P π 1 . . . P π m ) � � � � c ( m ) = sup < ∞ � � � � � � d ρ � � π 1 ...π m ∞ ◮ Average (discounted) concentration C µ,ρ = ( 1 − γ ) 2 � m γ m − 1 c ( m ) < + ∞ m ≥ 1 A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 27/82

  30. Error Propagation Remark : relationship to top-Lyapunov exponent m log + � � 1 L + = sup || ρ P π 1 P π 2 · · · P π m || π lim sup m →∞ If L + ≤ 0 ( stable system ), then c ( m ) has a growth rate which is polynomial and C µ,ρ < ∞ is finite A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 28/82

  31. Error Propagation Proposition Let ǫ k = Q k − � Q k be the propagation error at each iteration, then after K iteration the performance loss of the greedy policy π K is � � 2 � � γ K 2 γ || Q ∗ − Q π K || 2 || ǫ k || 2 2 µ ≤ C µ,ρ max ρ + O ( 1 − γ ) 3 V max ( 1 − γ ) 2 k A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 29/82

  32. The Final Bound Bringing everything together... � � 2 � � γ K 2 γ || Q ∗ − Q π K || 2 2 || ǫ k || 2 µ ≤ C µ,ρ max ρ + O ( 1 − γ ) 3 V max ( 1 − γ ) 2 k A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 30/82

  33. The Final Bound Bringing everything together... � � 2 � � γ K 2 γ || Q ∗ − Q π K || 2 2 || ǫ k || 2 µ ≤ C µ,ρ max ρ + O ( 1 − γ ) 3 V max ( 1 − γ ) 2 k || ǫ k || ρ = || Q k − � Q k || ρ ≤ 4 || Q k − f α ∗ k || ρ � �� � � log 1 /δ V max + L || α ∗ + O k || n � � � d log n /δ + O V max n A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 30/82

  34. The Final Bound Theorem (see e.g., Munos,’03) LinearFQI with a space F of d features, with n samples at each iteration returns a policy π K after K iterations such that � �� �� 2 γ � L d log n /δ || Q ∗ − Q π K || µ ≤ � � C µ,ρ 4 d ( F , T F ) + O V max 1 + √ ω ( 1 − γ ) 2 n γ K � � ( 1 − γ ) 3 V max2 + O A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 31/82

  35. The Final Bound Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy π K after K iterations such that � �� �� 2 γ � L d log n /δ || Q ∗ − Q π K || µ ≤ � C µ,ρ 4 d ( F , T F ) + O V max � 1 + √ ω ( 1 − γ ) 2 n γ K � � ( 1 − γ ) 3 V max2 + O The propagation (and different norms) makes the problem more complex ⇒ how do we choose the sampling distribution ? A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 32/82

  36. The Final Bound Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy π K after K iterations such that � �� �� 2 γ � L d log n /δ || Q ∗ − Q π K || µ ≤ � � C µ,ρ 4 d ( F , T F ) + O V max 1 + √ ω ( 1 − γ ) 2 n γ K � � ( 1 − γ ) 3 V max2 + O The approximation error is worse than in regression A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 33/82

  37. The Final Bound The inherent Bellman error || Q k − f α ∗ k || ρ = inf f ∈F || Q k − f || ρ f ∈F ||T � = inf Q k − 1 − f || ρ ≤ inf f ∈F ||T f α k − 1 − f || ρ ≤ sup f ∈F ||T g − f || ρ = d ( F , T F ) inf g ∈F Question: how to design F to make it “compatible” with the Bellman operator? A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 34/82

  38. The Final Bound Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy π K after K iterations such that � �� �� 2 γ � L d log n /δ || Q ∗ − Q π K || µ ≤ � � C µ,ρ 4 d ( F , T F ) + O V max 1 + √ ω ( 1 − γ ) 2 n γ K � � ( 1 − γ ) 3 V max2 + O The dependency on γ is worse than at each iteration ⇒ is it possible to avoid it? A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 35/82

  39. The Final Bound Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy π K after K iterations such that � �� �� 2 γ � L d log n /δ || Q ∗ − Q π K || µ ≤ � � C µ,ρ 4 d ( F , T F ) + O V max 1 + √ ω ( 1 − γ ) 2 n γ K � � ( 1 − γ ) 3 V max2 + O The error decreases exponentially in K ⇒ K ≈ ǫ/ ( 1 − γ ) A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 36/82

  40. The Final Bound Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy π K after K iterations such that � �� �� 2 γ � L d log n /δ || Q ∗ − Q π K || µ ≤ � � C µ,ρ 4 d ( F , T F ) + O V max 1 + √ ω ( 1 − γ ) 2 n γ K � � ( 1 − γ ) 3 V max2 + O The smallest eigenvalue of the Gram matrix ⇒ design the features so as to be orthogonal w.r.t. ρ A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 37/82

  41. The Final Bound Theorem LinearFQI with a space F of d features, with n samples at each iteration returns a policy π K after K iterations such that � �� �� 2 γ � L d log n /δ || Q ∗ − Q π K || µ ≤ � � C µ,ρ 4 d ( F , T F ) + O V max 1 + √ ω ( 1 − γ ) 2 n γ K � � ( 1 − γ ) 3 V max2 + O The asymptotic rate O ( d / n ) is the same as for regression A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 38/82

  42. Summary Q k − � Q k Propagation Approximation Dynamic programming algorithm algorithm Samples Performance Markov decision (sampling strategy, number) process number n , sampling dist. ρ Concentrability C µ,ρ Range V max Approximation space d ( F , T F ) size d , features ω A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 39/82

  43. Other implementations Replace the regression step with ◮ K -nearest neighbour ◮ Regularized linear regression with L 1 or L 2 regularisation ◮ Neural network ◮ Support vector regression ◮ ... A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 40/82

  44. Example: the Optimal Replacement Problem State : level of wear of an object (e.g., a car). A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 41/82

  45. Example: the Optimal Replacement Problem State : level of wear of an object (e.g., a car). Action : { ( R )eplace , ( K )eep } . A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 41/82

  46. Example: the Optimal Replacement Problem State : level of wear of an object (e.g., a car). Action : { ( R )eplace , ( K )eep } . Cost : ◮ c ( x , R ) = C ◮ c ( x , K ) = c ( x ) maintenance plus extra costs. A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 41/82

  47. Example: the Optimal Replacement Problem State : level of wear of an object (e.g., a car). Action : { ( R )eplace , ( K )eep } . Cost : ◮ c ( x , R ) = C ◮ c ( x , K ) = c ( x ) maintenance plus extra costs. Dynamics : ◮ p ( ·| x , R ) = exp ( β ) with density d ( y ) = β exp − β y I { y ≥ 0 } , ◮ p ( ·| x , K ) = x + exp ( β ) with density d ( y − x ) . A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 41/82

  48. Example: the Optimal Replacement Problem State : level of wear of an object (e.g., a car). Action : { ( R )eplace , ( K )eep } . Cost : ◮ c ( x , R ) = C ◮ c ( x , K ) = c ( x ) maintenance plus extra costs. Dynamics : ◮ p ( ·| x , R ) = exp ( β ) with density d ( y ) = β exp − β y I { y ≥ 0 } , ◮ p ( ·| x , K ) = x + exp ( β ) with density d ( y − x ) . Problem : Minimize the discounted expected cost over an infinite horizon. A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 41/82

  49. Example: the Optimal Replacement Problem Optimal value function � ∞ � ∞ � � V ∗ ( x ) = min d ( y − x ) V ∗ ( y ) dy , C + γ d ( y ) V ∗ ( y ) dy c ( x )+ γ 0 0 A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 42/82

  50. Example: the Optimal Replacement Problem Optimal value function � ∞ � ∞ � � V ∗ ( x ) = min d ( y − x ) V ∗ ( y ) dy , C + γ d ( y ) V ∗ ( y ) dy c ( x )+ γ 0 0 Optimal policy : action that attains the minimum A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 42/82

  51. Example: the Optimal Replacement Problem Optimal value function � ∞ � ∞ � � V ∗ ( x ) = min d ( y − x ) V ∗ ( y ) dy , C + γ d ( y ) V ∗ ( y ) dy c ( x )+ γ 0 0 Optimal policy : action that attains the minimum 70 70 Value function Management cost 60 60 50 50 40 40 30 30 20 20 10 wear R K R K R K 0 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 42/82

  52. Example: the Optimal Replacement Problem Optimal value function � ∞ � ∞ � � V ∗ ( x ) = min d ( y − x ) V ∗ ( y ) dy , C + γ d ( y ) V ∗ ( y ) dy c ( x )+ γ 0 0 Optimal policy : action that attains the minimum 70 70 Value function Management cost 60 60 50 50 40 40 30 30 20 20 10 wear R K R K R K 0 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 � � V n ( x ) = � 20 x Linear approximation space F := k = 1 α k cos ( k π x max ) . A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 42/82

  53. Example: the Optimal Replacement Problem Collect N sample on a uniform grid. 70 60 ++++ +++++++++++++++++++++ 50 ++++ 40 +++++++++++++++++++++ 30 +++++++++++++++++++++ ++++ 20 +++++++++++++++++++++++++ 10 0 0 1 2 3 4 5 6 7 8 9 10 A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 43/82

  54. Example: the Optimal Replacement Problem Collect N sample on a uniform grid. 70 70 70 60 60 60 ++++ +++++++++++++++++++++ 50 50 50 ++++ 40 40 40 +++++++++++++++++++++ 30 30 30 +++++++++++++++++++++ ++++ 20 20 20 +++++++++++++++++++++++++ 10 10 10 0 0 0 0 1 2 3 4 5 6 7 8 9 10 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 Figure: Left: the target values computed as {T V 0 ( x n ) } 1 ≤ n ≤ N . Right: the approximation V 1 ∈ F of the target function T V 0 . A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 43/82

  55. Example: the Optimal Replacement Problem 70 70 70 70 70 60 ++++ +++++++++++++++++++++++++ 60 60 60 60 +++++++++++++++++++++ 50 50 50 50 50 +++++++++++++++++++++ 40 40 40 ++++ 40 40 30 30 30 +++++++++++++++++++++++++ 30 30 20 20 20 20 20 10 10 10 0 0 0 10 10 0 1 2 3 4 5 6 7 8 9 10 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 Figure: Left: the target values computed as {T V 1 ( x n ) } 1 ≤ n ≤ N . Center: the approximation V 2 ∈ F of T V 1 . Right: the approximation V n ∈ F after n iterations. A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 44/82

  56. Example: the Optimal Replacement Problem Simulation A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 45/82

  57. Approximate Dynamic Programming (a.k.a. Batch Reinforcement Learning) Approximate Value Iteration Approximate Policy Iteration A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 46/82

  58. Policy Iteration: the Idea 1. Let π 0 be any stationary policy 2. At each iteration k = 1 , 2 , . . . , K ◮ Policy evaluation given π k , compute V k = V π k . ◮ Policy improvement : compute the greedy policy � � � π k + 1 ( x ) ∈ arg max a ∈ A r ( x , a ) + γ p ( y | x , a ) V π k ( y ) . y 3. Return the last policy π K A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 47/82

  59. Policy Iteration: the Idea 1. Let π 0 be any stationary policy 2. At each iteration k = 1 , 2 , . . . , K ◮ Policy evaluation given π k , compute V k = V π k . ◮ Policy improvement : compute the greedy policy � � � π k + 1 ( x ) ∈ arg max a ∈ A r ( x , a ) + γ p ( y | x , a ) V π k ( y ) . y 3. Return the last policy π K ◮ Problem : how can we approximate V π k ? ◮ Problem : if V k � = V π k , does (approx.) policy iteration still work? A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 47/82

  60. Approximate Policy Iteration: performance loss Problem : the algorithm is no longer guaranteed to converge. V * − V π k Asymptotic Error k Proposition The asymptotic performance of the policies π k generated by the API algorithm is related to the approximation error as: 2 γ � V ∗ − V π k � ∞ lim sup ≤ ( 1 − γ ) 2 lim sup � V k − V π k � ∞ � �� � � �� � k →∞ k →∞ performance loss approximation error A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 48/82

  61. Least-Squares Policy Iteration (LSPI) LSPI uses ◮ Linear space to approximate value functions* � d α j ϕ j ( x ) , α ∈ R d � � F = f ( x ) = j = 1 A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 49/82

  62. Least-Squares Policy Iteration (LSPI) LSPI uses ◮ Linear space to approximate value functions* � d α j ϕ j ( x ) , α ∈ R d � � F = f ( x ) = j = 1 ◮ Least-Squares Temporal Difference (LSTD) algorithm for policy evaluation . *In practice we use approximations of action-value functions. A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 49/82

  63. Least-Squares Temporal-Difference Learning (LSTD) ◮ V π may not belong to F V π / ∈ F ◮ Best approximation of V π in F is Π V π = arg min f ∈F || V π − f || ( Π is the projection onto F ) T π V π Π V π F A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 50/82

  64. Least-Squares Temporal-Difference Learning (LSTD) ◮ V π is the fixed-point of T π V π = T π V π = r π + γ P π V π ◮ LSTD searches for the fixed-point of Π 2 ,ρ T π Π 2 ,ρ g = arg min f ∈F || g − f || 2 ,ρ ◮ When the fixed-point of Π ρ T π exists, we call it the LSTD solution V TD = Π ρ T π V TD T π V TD T π V π T π V TD = Π ρ T π V TD Π ρ V π F A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 51/82

  65. Least-Squares Temporal-Difference Learning (LSTD) V TD = Π ρ T π V TD ◮ The projection Π ρ is orthogonal in expectation w.r.t. the space F spanned by the features ϕ 1 , . . . , ϕ d � � ( T π V TD ( x ) − V TD ( x )) ϕ i ( x ) = 0 , ∀ i ∈ [ 1 , d ] E x ∼ ρ �T π V TD − V TD , ϕ i � ρ = 0 A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 52/82

  66. Least-Squares Temporal-Difference Learning (LSTD) V TD = Π ρ T π V TD ◮ The projection Π ρ is orthogonal in expectation w.r.t. the space F spanned by the features ϕ 1 , . . . , ϕ d � � ( T π V TD ( x ) − V TD ( x )) ϕ i ( x ) = 0 , ∀ i ∈ [ 1 , d ] E x ∼ ρ �T π V TD − V TD , ϕ i � ρ = 0 ◮ By definition of Bellman operator � r π + γ P π V TD − V TD , ϕ i � ρ = 0 � r π , ϕ i � ρ − � ( I − γ P π ) V TD , ϕ i � ρ = 0 A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 52/82

  67. Least-Squares Temporal-Difference Learning (LSTD) V TD = Π ρ T π V TD ◮ The projection Π ρ is orthogonal in expectation w.r.t. the space F spanned by the features ϕ 1 , . . . , ϕ d � � ( T π V TD ( x ) − V TD ( x )) ϕ i ( x ) = 0 , ∀ i ∈ [ 1 , d ] E x ∼ ρ �T π V TD − V TD , ϕ i � ρ = 0 ◮ By definition of Bellman operator � r π + γ P π V TD − V TD , ϕ i � ρ = 0 � r π , ϕ i � ρ − � ( I − γ P π ) V TD , ϕ i � ρ = 0 ◮ Since V TD ∈ F , there exists α TD such that V TD ( x ) = φ ( x ) ⊤ α TD d � � r π , ϕ i � ρ − � ( I − γ P π ) ϕ j α TD , j , ϕ i � ρ = 0 j = 1 d � � r π , ϕ i � ρ − � ( I − γ P π ) ϕ j , ϕ i � ρ α TD , j = 0 j = 1 A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 52/82

  68. Least-Squares Temporal-Difference Learning (LSTD) V TD = Π ρ T π V TD ⇓ d � � r π , ϕ i � ρ � ( I − γ P π ) ϕ j , ϕ i � ρ − α TD , j = 0 � �� � � �� � j = 1 b i A i , j ⇓ A α TD = b A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 53/82

  69. Least-Squares Temporal-Difference Learning (LSTD) ◮ Problem: In general, Π ρ T π is not a contraction and does not have a fixed-point. ◮ Solution: If ρ = ρ π ( stationary dist. of π ) then Π ρ π T π has a unique fixed-point. A. LAZARIC – Reinforcement Learning Algorithms Dec 2nd, 2014 - 54/82

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend