stochastic approximation for speeding up lstd lspi and
play

Stochastic approximation for speeding up LSTD/LSPI (and least - PowerPoint PPT Presentation

Stochastic approximation for speeding up LSTD/LSPI (and least squares regression/LinUCB) Prashanth L A Joint work with Nathaniel Korda and Rmi Munos INRIA Lille - Team SequeL MLRG - Oxford University November 24, 2014


  1. Stochastic approximation for speeding up LSTD/LSPI (and least squares regression/LinUCB) Prashanth L A † Joint work with Nathaniel Korda ♯ and Rémi Munos † † INRIA Lille - Team SequeL ♯ MLRG - Oxford University November 24, 2014 Prashanth L A Fast LSTD using SA November 24, 2014 1 / 39

  2. Fast LSTD using SA Outline Fast LSTD using SA 1 Fast LSPI using SA 2 Experiments - Traffic Signal Control 3 Extension to Least Squares Regression 4 Experiments - News Recommendation 5 Proof outline 6 Prashanth L A Fast LSTD using SA November 24, 2014 2 / 39

  3. Fast LSTD using SA Background MDP Set of States X , Set of Actions A , Rewards r ( x , a ) � ∞ � � V π ( s ) := E β t r ( s t , π ( s t )) | s 0 = s Value function t = 0 � T π ( V )( s ) := r ( s , π ( s )) + β p ( s , π ( s ) , s ′ ) V ( s ′ ) Bellman Operator s ′ Prashanth L A Fast LSTD using SA November 24, 2014 3 / 39

  4. Fast LSTD using SA TD with Function Approximation Linear Function Approximation. T φ ( s ) V π ( s ) ≈ θ Parameter θ ∈ R d Feature φ ( s ) ∈ R d TD Fixed Point Φ θ = Π T π (Φ θ ) Feature Matrix Orthogonal Projection to B = { Φ θ | θ ∈ R d } with rows φ ( s ) T , ∀ s ∈ S Prashanth L A Fast LSTD using SA November 24, 2014 4 / 39

  5. Fast LSTD using SA TD with Function Approximation Linear Function Approximation. T φ ( s ) V π ( s ) ≈ θ Parameter θ ∈ R d Feature φ ( s ) ∈ R d TD Fixed Point Φ θ = Π T π (Φ θ ) Feature Matrix Orthogonal Projection to B = { Φ θ | θ ∈ R d } with rows φ ( s ) T , ∀ s ∈ S Prashanth L A Fast LSTD using SA November 24, 2014 4 / 39

  6. Fast LSTD using SA LSTD - A Batch Algorithm Given dataset D := { ( s i , r i , s ′ i ) , i = 1 , . . . , T ) } LSTD approximates the TD fixed point by ˆ θ T = ¯ A − 1 T ¯ O ( d 2 T ) Complexity b T , T A T = 1 � where ¯ φ ( s i )( φ ( s i ) − βφ ( s ′ i )) T T i = 1 T b T = 1 � ¯ r i φ ( s i ) . T i = 1 Prashanth L A Fast LSTD using SA November 24, 2014 5 / 39

  7. Fast LSTD using SA LSTD - A Batch Algorithm Given dataset D := { ( s i , r i , s ′ i ) , i = 1 , . . . , T ) } LSTD approximates the TD fixed point by ˆ θ T = ¯ A − 1 T ¯ O ( d 2 T ) Complexity b T , T A T = 1 � where ¯ φ ( s i )( φ ( s i ) − βφ ( s ′ i )) T T i = 1 T b T = 1 � ¯ r i φ ( s i ) . T i = 1 Prashanth L A Fast LSTD using SA November 24, 2014 5 / 39

  8. Fast LSTD using SA Complexity of LSTD [1] Policy Evaluation Policy π Q-value Q π Policy Improvement Figure : LSPI - a batch-mode RL algorithm for control LSTD Complexity O ( d 2 T ) using the Sherman-Morrison lemma or O ( d 2 . 807 ) using the Strassen algorithm or O ( d 2 . 375 ) the Coppersmith-Winograd algorithm Prashanth L A Fast LSTD using SA November 24, 2014 6 / 39

  9. Fast LSTD using SA Complexity of LSTD [1] Policy Evaluation Policy π Q-value Q π Policy Improvement Figure : LSPI - a batch-mode RL algorithm for control LSTD Complexity O ( d 2 T ) using the Sherman-Morrison lemma or O ( d 2 . 807 ) using the Strassen algorithm or O ( d 2 . 375 ) the Coppersmith-Winograd algorithm Prashanth L A Fast LSTD using SA November 24, 2014 6 / 39

  10. Fast LSTD using SA Complexity of LSTD [2] Problem Practical applications involve high-dimensional features (e.g. Computer-Go: d ∼ 10 6 ) ⇒ solving LSTD is computationally intensive Related works: GTD 1 , GTD2 2 , iLSTD 3 Solution Use stochastic approximation (SA) Complexity O ( dT ) ⇒ O ( d ) reduction in complexity Theory SA variant of LSTD does not impact overall rate of convergence Experiments On traffic control application, performance of SA-based LSTD is comparable to LSTD, while gaining in runtime! 1Sutton et al. (2009) A convergent O(n) algorithm for off-policy temporal difference learning. In: NIPS 2Sutton et al. (2009) Fast gradient-descent methods for temporal-difference learning with linear func- tion approximation. In: ICML 3Geramifard A et al. (2007) iLSTD: Eligibility traces and convergence analysis. In: NIPS Prashanth L A Fast LSTD using SA November 24, 2014 7 / 39

  11. Fast LSTD using SA Complexity of LSTD [2] Problem Practical applications involve high-dimensional features (e.g. Computer-Go: d ∼ 10 6 ) ⇒ solving LSTD is computationally intensive Related works: GTD 1 , GTD2 2 , iLSTD 3 Solution Use stochastic approximation (SA) Complexity O ( dT ) ⇒ O ( d ) reduction in complexity Theory SA variant of LSTD does not impact overall rate of convergence Experiments On traffic control application, performance of SA-based LSTD is comparable to LSTD, while gaining in runtime! 1Sutton et al. (2009) A convergent O(n) algorithm for off-policy temporal difference learning. In: NIPS 2Sutton et al. (2009) Fast gradient-descent methods for temporal-difference learning with linear func- tion approximation. In: ICML 3Geramifard A et al. (2007) iLSTD: Eligibility traces and convergence analysis. In: NIPS Prashanth L A Fast LSTD using SA November 24, 2014 7 / 39

  12. Fast LSTD using SA Fast LSTD using Stochastic Approximation Update θ n Pick i n uniformly θ n + 1 θ n using ( s i n , r i n , s ′ in { 1 , . . . , T } i n ) Random Sampling SA Update Update rule: � � n − 1 φ ( s ′ r i n + βθ T i n ) − θ T θ n = θ n − 1 + γ n n − 1 φ ( s i n ) φ ( s i n ) Step-sizes Fixed-point iteration Complexity: O ( d ) per iteration Prashanth L A Fast LSTD using SA November 24, 2014 8 / 39

  13. Fast LSTD using SA Fast LSTD using Stochastic Approximation Update θ n Pick i n uniformly θ n + 1 θ n using ( s i n , r i n , s ′ in { 1 , . . . , T } i n ) Random Sampling SA Update Update rule: � � n − 1 φ ( s ′ r i n + βθ T i n ) − θ T θ n = θ n − 1 + γ n n − 1 φ ( s i n ) φ ( s i n ) Step-sizes Fixed-point iteration Complexity: O ( d ) per iteration Prashanth L A Fast LSTD using SA November 24, 2014 8 / 39

  14. Fast LSTD using SA Assumptions Setting: Given dataset D := { ( s i , r i , s ′ i ) , i = 1 , . . . , T ) } Bounded features (A1) � φ ( s i ) � 2 ≤ 1 Bounded rewards (A2) | r i | ≤ R max < ∞ � � T 1 � Co-variance matrix (A3) λ min φ ( s i ) φ ( s i ) T ≥ µ . has a min-eigenvalue T i = 1 Prashanth L A Fast LSTD using SA November 24, 2014 9 / 39

  15. Fast LSTD using SA Assumptions Setting: Given dataset D := { ( s i , r i , s ′ i ) , i = 1 , . . . , T ) } Bounded features (A1) � φ ( s i ) � 2 ≤ 1 Bounded rewards (A2) | r i | ≤ R max < ∞ � � T 1 � Co-variance matrix (A3) λ min φ ( s i ) φ ( s i ) T ≥ µ . has a min-eigenvalue T i = 1 Prashanth L A Fast LSTD using SA November 24, 2014 9 / 39

  16. Fast LSTD using SA Assumptions Setting: Given dataset D := { ( s i , r i , s ′ i ) , i = 1 , . . . , T ) } Bounded features (A1) � φ ( s i ) � 2 ≤ 1 Bounded rewards (A2) | r i | ≤ R max < ∞ � � T 1 � Co-variance matrix (A3) λ min φ ( s i ) φ ( s i ) T ≥ µ . has a min-eigenvalue T i = 1 Prashanth L A Fast LSTD using SA November 24, 2014 9 / 39

  17. Fast LSTD using SA Assumptions Setting: Given dataset D := { ( s i , r i , s ′ i ) , i = 1 , . . . , T ) } Bounded features (A1) � φ ( s i ) � 2 ≤ 1 Bounded rewards (A2) | r i | ≤ R max < ∞ � � T 1 � Co-variance matrix (A3) λ min φ ( s i ) φ ( s i ) T ≥ µ . has a min-eigenvalue T i = 1 Prashanth L A Fast LSTD using SA November 24, 2014 9 / 39

  18. Fast LSTD using SA Convergence Rate Step-size choice γ n = ( 1 − β ) c 2 ( c + n ) , with ( 1 − β ) 2 µ c ∈ ( 1 . 33 , 2 ) Bound in expectation � � K 1 � θ n − ˆ � � √ n + c E θ T 2 ≤ � High-probability bound �� � � K 2 � θ n − ˆ � � θ T 2 ≤ √ n + c ≥ 1 − δ, P � By iterate-averaging, the dependency of c on µ can be removed Prashanth L A Fast LSTD using SA November 24, 2014 10 / 39

  19. Fast LSTD using SA Convergence Rate Step-size choice γ n = ( 1 − β ) c 2 ( c + n ) , with ( 1 − β ) 2 µ c ∈ ( 1 . 33 , 2 ) Bound in expectation � � K 1 � θ n − ˆ � � √ n + c E θ T 2 ≤ � High-probability bound �� � � K 2 � θ n − ˆ � � θ T 2 ≤ √ n + c ≥ 1 − δ, P � By iterate-averaging, the dependency of c on µ can be removed Prashanth L A Fast LSTD using SA November 24, 2014 10 / 39

  20. Fast LSTD using SA Convergence Rate Step-size choice γ n = ( 1 − β ) c 2 ( c + n ) , with ( 1 − β ) 2 µ c ∈ ( 1 . 33 , 2 ) Bound in expectation � � K 1 � θ n − ˆ � � √ n + c E θ T 2 ≤ � High-probability bound �� � � K 2 � θ n − ˆ � � θ T 2 ≤ √ n + c ≥ 1 − δ, P � By iterate-averaging, the dependency of c on µ can be removed Prashanth L A Fast LSTD using SA November 24, 2014 10 / 39

  21. Fast LSTD using SA Convergence Rate Step-size choice γ n = ( 1 − β ) c 2 ( c + n ) , with ( 1 − β ) 2 µ c ∈ ( 1 . 33 , 2 ) Bound in expectation � � K 1 � θ n − ˆ � � √ n + c E θ T 2 ≤ � High-probability bound �� � � K 2 � θ n − ˆ � � θ T 2 ≤ √ n + c ≥ 1 − δ, P � By iterate-averaging, the dependency of c on µ can be removed Prashanth L A Fast LSTD using SA November 24, 2014 10 / 39

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend