reinforcement learning basic models and algorithms
play

Reinforcement Learning: Basic models and algorithms Optimal - PowerPoint PPT Presentation

Reinforcement Learning: Basic models and algorithms Optimal decisions, Part VII Christos Dimitrakakis Chalmers November 20, 2013 Christos Dimitrakakis (Chalmers) Reinforcement Learning: Basic models and algorithms November 20, 2013 1 / 28


  1. Reinforcement Learning: Basic models and algorithms Optimal decisions, Part VII Christos Dimitrakakis Chalmers November 20, 2013 Christos Dimitrakakis (Chalmers) Reinforcement Learning: Basic models and algorithms November 20, 2013 1 / 28

  2. Introduction The reinforcement learning problem and MDPs Markov decision processes Description of environments Solutions to bandit problems Algorithms for unknown MDPs. Stochastic exact algorithms Stochastic estimation algorithms Online algorithms Christos Dimitrakakis (Chalmers) Reinforcement Learning: Basic models and algorithms November 20, 2013 2 / 28

  3. Introduction Bandit problems The stochastic n -armed bandit problem P = { P i | i = 1 , . . . , n } . r t | a t = i ∼ P i . T � a ∗ t � max { E ( r t | a t = i ) | i = 1 , . . . , n } . E π U t = E π r k , k = t r t | a t = i , ω ∗ = ω ∼ P i ( r | ω ∗ ) . P = { P i ( · | ω ) | ω ∈ Ω } , (1.1) T E π ξ U t = E π � r k . (1.2) ξ k = t Christos Dimitrakakis (Chalmers) Reinforcement Learning: Basic models and algorithms November 20, 2013 3 / 28

  4. Introduction Estimation and Robbins-Monro approximation Algorithm 1 Robbins-Monro bandit algorithm 1: input Step-sizes ( α t ) t , initial estimates ( µ i , 0 ) i , policy π . 2: for t = 1 , . . . , T do Take action a t = i with probability π ( i | a 1 , . . . , a t − 1 , r 1 , . . . , r t − 1 ). 3: Observe reward r t . 4: µ t , i = α i , t r t + (1 − α i , t ) µ i , t − 1 5: // estimation step µ t , i = µ j , t − 1 for j � = i . 6: 7: end for 8: return µ T Definition 1 ǫ -greedy action selection (w.p. 1 − ǫ , select an apparently best action, otherwise a random action) π ∗ π ∗ ǫ � (1 − ǫ t )ˆ ˆ t + ǫ t Unif ( A ) , (1.3) � � i ∈ ˆ / | ˆ ˆ π ∗ A ∗ A ∗ A ∗ ˆ t ( i ) = I t | , t = arg max µ t , i (1.4) t i ∈A Christos Dimitrakakis (Chalmers) Reinforcement Learning: Basic models and algorithms November 20, 2013 4 / 28

  5. Introduction Estimation and Robbins-Monro approximation 1 0.01 0.1 0.5 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Figure: ǫ t = 0 . 1, α ∈ { 0 . 01 , 0 . 1 , 0 . 5 } . Christos Dimitrakakis (Chalmers) Reinforcement Learning: Basic models and algorithms November 20, 2013 5 / 28

  6. Introduction Estimation and Robbins-Monro approximation 0.9 0.0 0.1 1.0 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Figure: ǫ t = ǫ , α = 0 . 1. Christos Dimitrakakis (Chalmers) Reinforcement Learning: Basic models and algorithms November 20, 2013 6 / 28

  7. Introduction Estimation and Robbins-Monro approximation Main idea of the algorithm Estimate parameters Act according to the estimates Requirements Good estimation procedure. Balance estimation with getting rewards! Christos Dimitrakakis (Chalmers) Reinforcement Learning: Basic models and algorithms November 20, 2013 7 / 28

  8. Introduction The theory of the approximation Consider the algorithm µ t +1 = µ t + α t z t +1 . (1.5) Let h t = { µ t , z t , α t , . . . } be the history of the algorithm. Assumption 1 Assume a function f : R n → R such that: (i) f ( x ) ≥ 0 for all x ∈ R n . (ii) (Lipschitz derivative) f is continuously differentiable and ∃ L > 0 such that: ∀ x , y ∈ R n �∇ f ( x ) − ∇ f ( y ) � ≤ L � x − y � , (iii) (Pseudo-gradient) ∃ c > 0 such that: c �∇ f ( µ t ) � 2 ≤ −∇ f ( µ t ) ⊤ E ( z t +1 | h t ) , ∀ t . (iv) ∃ K 1 , K 2 > 0 such that E ( � z t +1 � 2 | h t ) ≤ K 1 + K 2 �∇ f ( µ t ) � 2 Christos Dimitrakakis (Chalmers) Reinforcement Learning: Basic models and algorithms November 20, 2013 8 / 28

  9. Introduction The theory of the approximation Theorem 2 For the algorithm µ t +1 = µ t + α t z t +1 , where α t ≥ 0 satisfy ∞ ∞ � � α 2 α t = ∞ , t < ∞ , (1.6) t =0 t =0 and under Assumption 1, with probability 1: 1 The sequence { f ( µ t ) } converges. 2 lim t →∞ ∇ f ( µ t ) = 0 . 3 Every limit point µ ∗ of µ t satisfies ∇ f ( µ ∗ ) = 0 . Christos Dimitrakakis (Chalmers) Reinforcement Learning: Basic models and algorithms November 20, 2013 8 / 28

  10. Introduction The theory of the approximation A demonstration 2 1 / t 1 / √ t 1.5 t − 3 / 2 1 µ t 0.5 0 -0.5 0 200 400 600 800 1000 t Figure: Estimation of the expectation of x t ∼ N (0 . 5 , 1) using use three step-size schedules. Christos Dimitrakakis (Chalmers) Reinforcement Learning: Basic models and algorithms November 20, 2013 9 / 28

  11. Dynamic problems Algorithm 2 Generic reinforcement learning algorithm 1: input Update-rule f : Θ × S 2 × A × R → Θ , initial parameters θ 0 ∈ Θ , policy π : S × Θ → D ( A ). 2: for t = 1 , . . . , T do a t ∼ π ( · | θ t , s t ) // take action 3: Observe reward r t +1 , state s t +1 . 4: θ t +1 = f ( θ t , s t , a t , r t +1 , s t +1 ) // update estimate 5: 6: end for Questions What should we estimate? What policy should we use? Christos Dimitrakakis (Chalmers) Reinforcement Learning: Basic models and algorithms November 20, 2013 10 / 28

  12. Dynamic problems 0 . 2 0 0 0 1 Figure: The chain task s s 1 s 2 s 3 s 4 s 5 V ∗ ( s ) 6.672 7.111 7.689 8.449 9.449 Q ∗ ( s , 1) 6.622 6.532 6.676 6.866 7.866 Q ∗ ( s , 2) 6.672 7.111 7.689 8.449 9.449 Table: The chain task’s value function for γ = 0 . 95 Christos Dimitrakakis (Chalmers) Reinforcement Learning: Basic models and algorithms November 20, 2013 11 / 28

  13. Dynamic problems Monte-Carlo policy evaluation and iteration Algorithm 3 Stochastic policy evaluation 1: input Initial parameters v 0 , Markov policy π . 2: for s ∈ S do s 1 = s . 3: for k = 1 , . . . , K do 4: Run policy π for T steps. 5: Observe utility U k = � t r t . 6: Update estimate v k +1 ( s ) = v k ( s ) + α k ( U k − v k ( s )) 7: end for 8: 9: end for 10: return v K For α k = 1 / k and iterating over all S , this is the same as Monte-Carlo policy evaluation. Algorithm 4 Approximate policy iteration 1: input Initial parameters v 0 , inital Markov policy π 0 , stochastic estimator f . 2: for i = 1 , . . . , N do Get estimate v i = f ( v i − 1 , π i − 1 ). 3: Calculate new policy π i = arg max π L v i . 4: 5: end for Christos Dimitrakakis (Chalmers) Reinforcement Learning: Basic models and algorithms November 20, 2013 12 / 28

  14. Dynamic problems Monte-Carlo policy evaluation and iteration Monte Carlo update Note that s 1 , . . . , s T contains s k , . . . , s T . Algorithm 5 Every-visit Monte-Carlo update 1: input Initial parameters v k , trajectory s 1 , . . . , s T , rewards r 1 , . . . , r T visit counts n . 2: for t = 1 , . . . , T do U t = � T t =1 r t . 3: n t ( s t ) = n t − 1 ( s t ) + 1 4: v t +1 ( s t ) = v t ( s ) + α n t ( s t ) ( s t )( U t − v t ( s t )) 5: n t ( s ) = n t − 1 ( s ), v t ( s ) = v t − 1 ( s ) ∀ s � = s t . 6: 7: end for 8: return v K Example 3 Consider a two-state chain with P ( s t +1 = 1 | s t = 0) = δ and P ( s t +1 = 1 | s t = 1) = 1, and reward r (1) = 1, r (0) = 0. Then the every-visit estimate is biased. Christos Dimitrakakis (Chalmers) Reinforcement Learning: Basic models and algorithms November 20, 2013 13 / 28

  15. Dynamic problems Monte-Carlo policy evaluation and iteration Unbiased Monte-Carlo update Algorithm 6 First-visit Monte-Carlo update 1: input Initial parameters v k , trajectory s 1 , . . . , s T , rewards r 1 , . . . , r T , visit counts n . 2: Let m ∈ N |S| be trajectory visit counts. 3: for t = 1 , . . . , T do U t = � T t =1 r t . 4: n t ( s t ) = n t − 1 ( s t ) + 1 5: m t ( s t ) = m t − 1 ( s t ) + 1 6: v t +1 ( s t ) = v t ( s ) + α n t ( s t ) ( s t )( U t − v t ( s t )) if m t ( s t ) = 1. 7: n t ( s ) = n t − 1 ( s ), v t ( s ) = v t − 1 ( s ) otherwise 8: 9: end for 10: return v K Christos Dimitrakakis (Chalmers) Reinforcement Learning: Basic models and algorithms November 20, 2013 14 / 28

  16. Dynamic problems Monte-Carlo policy evaluation and iteration 10 2 every first 10 1 � v t − V π � 10 0 10 -1 10 -2 0 2000 4000 6000 8000 10000 iterations Figure: Error as the number of iterations n increases, for first and every visit Monte Carlo estimation. Christos Dimitrakakis (Chalmers) Reinforcement Learning: Basic models and algorithms November 20, 2013 15 / 28

  17. Dynamic problems Temporal difference methods Temporal differences The full stochastic update is of the form: v k +1 ( s ) = v k ( s ) + α ( U k − v k ( s )) , Using the temporal difference error d ( s t , s t +1 ) = v ( s t ) − [ r ( s t ) + γ v ( s t +1 )], � γ t d t , d t � d ( s t , s t +1 ) v k +1 ( s ) = v k ( s ) + α (2.1) t Stochastic, incremental, update: v t +1 ( s ) = v t ( s ) + αγ t d t . (2.2) TD( λ ) Temporal-difference operator ∞ � τ n ( i ) � E π n ,µ [( γλ ) m d n ( s t , s t +1 ) | s 0 = i ] . v n +1 ( i ) = v n i + τ n i , t =0 Stochastic update: ∞ � ( γλ ) k − t d k . v n +1 ( s t ) = v n ( s t ) + α (2.3) k = t Christos Dimitrakakis (Chalmers) Reinforcement Learning: Basic models and algorithms November 20, 2013 16 / 28

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend