reinforcement learning approximate dynamic programming
play

Reinforcement Learning: Approximate Dynamic Programming Decision - PowerPoint PPT Presentation

Reinforcement Learning: Approximate Dynamic Programming Decision Making Under Uncertainty, Chapter 10 Christos Dimitrakakis Chalmers November 21, 2013 Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming


  1. Reinforcement Learning: Approximate Dynamic Programming Decision Making Under Uncertainty, Chapter 10 Christos Dimitrakakis Chalmers November 21, 2013 Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 1 / 19

  2. 1 Introduction Error bounds Features 2 Approximate policy iteration Estimation building blocks The value estimation step Policy estimation Rollout-based policy iteration methods Least Squares Methods 3 Approximate Value Iteration Approximate backwards induction State aggregation Representative states Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 2 / 19

  3. Introduction Definition 1 ( u -greedy policy and value function) π ∗ v ∗ u ∈ arg max L π u , u = L u , (1.1) π where π : S → D ( A ) maps from states to action distributions. Parameteric value function estimation θ ∗ ∈ arg min V Θ = { v θ | θ ∈ Θ } , � v θ − u � φ (1.2) θ ∈ Θ where � · � φ � � S | · | d φ . Parameteric policy estimation θ ∗ ∈ arg min � π θ − π ∗ Π Θ = { π θ | θ ∈ Θ } , u � φ (1.3) θ ∈ Θ where π ∗ u = arg max π ∈ Π L π u Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 3 / 19

  4. Introduction Error bounds Theorem 2 Consider a finite MDP µ with discount factor γ < 1 and a vector u ∈ V such that � � u − V ∗ � ∞ = ǫ . If π is the u -greedy policy then µ � 2 γǫ � � V π µ − V ∗ � ∞ ≤ 1 − γ . µ � In addition, ∃ ǫ 0 > 0 s.t. if ǫ < ǫ 0 , then π is optimal. Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 4 / 19

  5. Introduction Features Feature mapping f : S × A → X . For X ⊂ R n , the feature mapping can be written in vector form:   f 1 ( s , a ) f ( s , a ) = . . . (1.4)   f n ( s , a ) Example 3 (Radial Basis Functions) Let d be a metric on S × A and { ( s i , a i ) | i = 1 , . . . , n } . Then we define each element of f as: f i ( s , a ) � exp {− d [( s , a ) , ( s i , a i )] } . (1.5) These function are sometimes called kernels . Example 4 (Tilings) Let G = { X 1 , . . . , X n } be a partition of S × A of size n . Then: f i ( s , a ) � I { ( s , a ) ∈ X i } . (1.6) Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 5 / 19

  6. Approximate policy iteration Approximate policy ieration Algorithm 1 Generic approximate policy iteration algorithm ˆ input Initial value function v 0 , approximate Bellman operator L , approximate value estimator ˆ V . for k = 1 , . . . do � � � ˆ π k = arg min π ∈ ˆ L π v k − 1 − L v k − 1 // policy improvement � � Π � V � v − V π k v k = arg min v ∈ ˆ µ � // policy evaluation end for Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 6 / 19

  7. Approximate policy iteration Theoretical gurantees Assumption 1 Consider a discounted problem with discount factor γ and iterates v k , π k such that: � v k − V π k � ∞ ≤ ǫ, ∀ k (2.1) � � � L π k +1 v k − L v k ∞ ≤ δ, ∀ k (2.2) � Theorem 5 ([6], proposition 6.2) Under Assumption 1 � V π k − V ∗ � ∞ ≤ δ + 2 γǫ lim sup (1 − γ ) 2 . (2.3) k →∞ Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 7 / 19

  8. Approximate policy iteration Estimation building blocks Lookahead policies Single-step lookahead q ( i , a ′ ) π q ( a | i ) > 0 iff a ∈ arg max (2.4) a ′ ∈A � q ( i , a ) � r µ ( i , a ) + γ P µ ( j | i , a ) u ( j ) . (2.5) j ∈S T -step lookahead π ( i ; q T ) = arg max q T ( i , a ) , (2.6) a ∈A where u k is recursively defined as: � q k ( i , a ) = r µ ( i , a ) + γ P µ ( j | i , a ) u k − 1 ( j ) (2.7) j ∈S u k ( i ) = max { q k ( i , a ) | a ∈ A} (2.8) and u 0 = u . Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 8 / 19

  9. Approximate policy iteration Estimation building blocks Rollout policies Rollout estimate of the q -factor K i T k − 1 q ( i , a ) = 1 � � r ( s t , k , a t , k ) , K i t =0 k =1 where s t , k , a t , k ∼ P π µ ( · | s 0 = i , a 0 = a ), and T k ∼ Geom (1 − γ ). Rollout policy estimation. Given a set of samples q ( i , a ) for i ∈ ˆ S , we estimate � π θ − π ∗ � � min φ , q � θ for some φ on ˆ S . Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 9 / 19

  10. Approximate policy iteration The value estimation step Generalised linear model using features (or kernel) Feature mapping f : S → R n , parameters θ ∈ R n . n � v θ ( s ) = θ i f i ( s ) (2.9) i =1 Fitting a value function. � c s ( θ ) = φ ( s ) � v θ ( s ) − v ( s ) � κ c ( θ ) = c s ( θ ) , p . (2.10) s ∈ ˆ S Example 6 The case p = 2, κ = 2 θ ′ j = θ j − 2 αφ ( s )[ v θ ( s ) − v ( s )] f j ( s ) . (2.11) Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 10 / 19

  11. Approximate policy iteration Policy estimation Generalised linear model using features (or kernel). Feature mapping f : S → R n , parameters θ ∈ R n . n π θ ( a | s ) = g ( s , a ) � � h ( s ) , g ( s , a ) = θ i f i ( s , a ) , h ( s ) = g ( s , b ) (2.12) i =1 b ∈A Fitting a policy through a cost function. � c s ( θ ) = φ ( s ) � π θ ( · | s ) − π ( · | s ) � κ c ( θ ) = c s ( θ ) , p . (2.13) s ∈ ˆ S The case p = 1, κ = 1. � � θ ′ � j = θ j − αφ ( s ) π θ ( a | s ) f j ( s , b ) − f j ( s , a ) . (2.14) b ∈A Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 11 / 19

  12. Approximate policy iteration Rollout-based policy iteration methods Algorithm 2 Rollout Sampling Approximate Policy Iteration. for k = 1 , . . . do Select a set of representative states ˆ S k for n = 1 , . . . do Select a state s n ∈ ˆ S k maximising U n ( s ) and perform a rollout. a ∗ ( s n ) is optimal w.p. 1 − δ , put s n in ˆ S k ( δ ) and remove it from ˆ If ˆ S k . end for Calculate q k ≈ Q π k from the rollouts. Train a classifier π θ k +1 on the set of states ˆ a ∗ ( s ). S k ( δ ) with actions ˆ end for Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 12 / 19

  13. Approximate policy iteration Least Squares Methods Least square value estimation Projection. Setting v = Φθ where Φ is a feature matrix and θ is a parameter vector we have Φθ = r + γ P µ,π Φθ (2.15) θ = [( I − γ P µ,π ) Φ ] − 1 r (2.16) Replacing the inverse with the pseudo-inverse, with A = ( I − γ P µ,π ) Φ A − 1 � A ⊤ � AA ⊤ � − 1 ˜ , Empirical constructions. Given a set of data points { ( s i , a i , r i , s ′ i ) | i = 1 , . . . , n } , which may not be consecutive, we define: 1 r = ( r i ) i . 2 Φ i = f ( s i , a i ), Φ = ( Φ i ) i . 3 P µ,π = P µ P π , P µ,π ( i , j ) = I { j = i + 1 } Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 13 / 19

  14. Approximate policy iteration Least Squares Methods Algorithm 3 LSTDQ - Least Squares Temporal Differences on q -factors input data D = { ( s i , a i , r i , s ′ i ) | i = 1 , . . . , n } , feature mapping f , policy π − 1 � θ = ( Φ ( I − γ P µ,π )) r Algorithm 4 LSPI - Least Squares Policy Iteration input data D = { ( s i , a i , r i , s ′ i ) | i = 1 , . . . , n } , feature mapping f Set π 0 arbitrarily. for k = 1 , . . . do θ k = LSTDQ ( D , f , π k − 1 ). π k = π ∗ Φθ k . end for Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 14 / 19

  15. Approximate Value Iteration Approximate backwards induction V ∗ a ∈A { r ( s , a ) + γ E µ ( V ∗ t ( s ) = max t +1 | s t = s , a t = a ) } (3.1) Iterative approximation � � P µ ( s ′ | s , a ) v t +1 ( s ′ ) ˆ � V t ( s ) = max r ( s , a ) + γ (3.2) a ∈A s ′ �� � � � � v − ˆ v t = arg min V t � v ∈ V (3.3) � � � � Online gradient estimation � � � v t − ˆ θ t +1 = θ t − α t ∇ θ V t (3.4) � � � Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 15 / 19

  16. Approximate Value Iteration State aggregation Aggregated estimate. Let G = { S 0 , S 1 , . . . , S n } be a partition of S , with S 0 = ∅ and θ ∈ R n and let f k ( s t ) = I { s t ∈ S k } . Then the approximate value function is v ( s ) = θ ( k ) , if s ∈ S k , k � = 0 . (3.5) Online gradient estimate. Consider the case �·� = �·� 2 2 . For s t ∈ S k : � θ t +1 ( k ) = (1 − α ) θ t ( k ) + α max P ( j | s t , a ) v t ( s ) a ∈A r ( s t , a ) + γ (3.6) j For s t / ∈ S k : θ t +1 ( k ) = θ ( k ) . (3.7) Christos Dimitrakakis (Chalmers) Reinforcement Learning: Approximate Dynamic Programming November 21, 2013 16 / 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend