Reinforcement Learning: Basic models and algorithms Optimal - PowerPoint PPT Presentation

Reinforcement Learning: Basic models and algorithms Optimal decisions, Part VII Christos Dimitrakakis Chalmers November 20, 2013 Christos Dimitrakakis (Chalmers) Reinforcement Learning: Basic models and algorithms November 20, 2013 1 / 28

Introduction The reinforcement learning problem and MDPs Markov decision processes Description of environments Solutions to bandit problems Algorithms for unknown MDPs. Stochastic exact algorithms Stochastic estimation algorithms Online algorithms Christos Dimitrakakis (Chalmers) Reinforcement Learning: Basic models and algorithms November 20, 2013 2 / 28

Introduction Bandit problems The stochastic n -armed bandit problem P = { P i | i = 1 , . . . , n } . r t | a t = i ∼ P i . T � a ∗ t � max { E ( r t | a t = i ) | i = 1 , . . . , n } . E π U t = E π r k , k = t r t | a t = i , ω ∗ = ω ∼ P i ( r | ω ∗ ) . P = { P i ( · | ω ) | ω ∈ Ω } , (1.1) T E π ξ U t = E π � r k . (1.2) ξ k = t Christos Dimitrakakis (Chalmers) Reinforcement Learning: Basic models and algorithms November 20, 2013 3 / 28

Introduction Estimation and Robbins-Monro approximation Algorithm 1 Robbins-Monro bandit algorithm 1: input Step-sizes ( α t ) t , initial estimates ( µ i , 0 ) i , policy π . 2: for t = 1 , . . . , T do Take action a t = i with probability π ( i | a 1 , . . . , a t − 1 , r 1 , . . . , r t − 1 ). 3: Observe reward r t . 4: µ t , i = α i , t r t + (1 − α i , t ) µ i , t − 1 5: // estimation step µ t , i = µ j , t − 1 for j � = i . 6: 7: end for 8: return µ T Definition 1 ǫ -greedy action selection (w.p. 1 − ǫ , select an apparently best action, otherwise a random action) π ∗ π ∗ ǫ � (1 − ǫ t )ˆ ˆ t + ǫ t Unif ( A ) , (1.3) � � i ∈ ˆ / | ˆ ˆ π ∗ A ∗ A ∗ A ∗ ˆ t ( i ) = I t | , t = arg max µ t , i (1.4) t i ∈A Christos Dimitrakakis (Chalmers) Reinforcement Learning: Basic models and algorithms November 20, 2013 4 / 28

Introduction Estimation and Robbins-Monro approximation 1 0.01 0.1 0.5 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Figure: ǫ t = 0 . 1, α ∈ { 0 . 01 , 0 . 1 , 0 . 5 } . Christos Dimitrakakis (Chalmers) Reinforcement Learning: Basic models and algorithms November 20, 2013 5 / 28

Introduction Estimation and Robbins-Monro approximation 0.9 0.0 0.1 1.0 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Figure: ǫ t = ǫ , α = 0 . 1. Christos Dimitrakakis (Chalmers) Reinforcement Learning: Basic models and algorithms November 20, 2013 6 / 28

Introduction Estimation and Robbins-Monro approximation Main idea of the algorithm Estimate parameters Act according to the estimates Requirements Good estimation procedure. Balance estimation with getting rewards! Christos Dimitrakakis (Chalmers) Reinforcement Learning: Basic models and algorithms November 20, 2013 7 / 28

Introduction The theory of the approximation Consider the algorithm µ t +1 = µ t + α t z t +1 . (1.5) Let h t = { µ t , z t , α t , . . . } be the history of the algorithm. Assumption 1 Assume a function f : R n → R such that: (i) f ( x ) ≥ 0 for all x ∈ R n . (ii) (Lipschitz derivative) f is continuously differentiable and ∃ L > 0 such that: ∀ x , y ∈ R n �∇ f ( x ) − ∇ f ( y ) � ≤ L � x − y � , (iii) (Pseudo-gradient) ∃ c > 0 such that: c �∇ f ( µ t ) � 2 ≤ −∇ f ( µ t ) ⊤ E ( z t +1 | h t ) , ∀ t . (iv) ∃ K 1 , K 2 > 0 such that E ( � z t +1 � 2 | h t ) ≤ K 1 + K 2 �∇ f ( µ t ) � 2 Christos Dimitrakakis (Chalmers) Reinforcement Learning: Basic models and algorithms November 20, 2013 8 / 28

Introduction The theory of the approximation Theorem 2 For the algorithm µ t +1 = µ t + α t z t +1 , where α t ≥ 0 satisfy ∞ ∞ � � α 2 α t = ∞ , t < ∞ , (1.6) t =0 t =0 and under Assumption 1, with probability 1: 1 The sequence { f ( µ t ) } converges. 2 lim t →∞ ∇ f ( µ t ) = 0 . 3 Every limit point µ ∗ of µ t satisfies ∇ f ( µ ∗ ) = 0 . Christos Dimitrakakis (Chalmers) Reinforcement Learning: Basic models and algorithms November 20, 2013 8 / 28

Introduction The theory of the approximation A demonstration 2 1 / t 1 / √ t 1.5 t − 3 / 2 1 µ t 0.5 0 -0.5 0 200 400 600 800 1000 t Figure: Estimation of the expectation of x t ∼ N (0 . 5 , 1) using use three step-size schedules. Christos Dimitrakakis (Chalmers) Reinforcement Learning: Basic models and algorithms November 20, 2013 9 / 28

Dynamic problems Algorithm 2 Generic reinforcement learning algorithm 1: input Update-rule f : Θ × S 2 × A × R → Θ , initial parameters θ 0 ∈ Θ , policy π : S × Θ → D ( A ). 2: for t = 1 , . . . , T do a t ∼ π ( · | θ t , s t ) // take action 3: Observe reward r t +1 , state s t +1 . 4: θ t +1 = f ( θ t , s t , a t , r t +1 , s t +1 ) // update estimate 5: 6: end for Questions What should we estimate? What policy should we use? Christos Dimitrakakis (Chalmers) Reinforcement Learning: Basic models and algorithms November 20, 2013 10 / 28

Dynamic problems 0 . 2 0 0 0 1 Figure: The chain task s s 1 s 2 s 3 s 4 s 5 V ∗ ( s ) 6.672 7.111 7.689 8.449 9.449 Q ∗ ( s , 1) 6.622 6.532 6.676 6.866 7.866 Q ∗ ( s , 2) 6.672 7.111 7.689 8.449 9.449 Table: The chain task’s value function for γ = 0 . 95 Christos Dimitrakakis (Chalmers) Reinforcement Learning: Basic models and algorithms November 20, 2013 11 / 28

Dynamic problems Monte-Carlo policy evaluation and iteration Algorithm 3 Stochastic policy evaluation 1: input Initial parameters v 0 , Markov policy π . 2: for s ∈ S do s 1 = s . 3: for k = 1 , . . . , K do 4: Run policy π for T steps. 5: Observe utility U k = � t r t . 6: Update estimate v k +1 ( s ) = v k ( s ) + α k ( U k − v k ( s )) 7: end for 8: 9: end for 10: return v K For α k = 1 / k and iterating over all S , this is the same as Monte-Carlo policy evaluation. Algorithm 4 Approximate policy iteration 1: input Initial parameters v 0 , inital Markov policy π 0 , stochastic estimator f . 2: for i = 1 , . . . , N do Get estimate v i = f ( v i − 1 , π i − 1 ). 3: Calculate new policy π i = arg max π L v i . 4: 5: end for Christos Dimitrakakis (Chalmers) Reinforcement Learning: Basic models and algorithms November 20, 2013 12 / 28

Dynamic problems Monte-Carlo policy evaluation and iteration Monte Carlo update Note that s 1 , . . . , s T contains s k , . . . , s T . Algorithm 5 Every-visit Monte-Carlo update 1: input Initial parameters v k , trajectory s 1 , . . . , s T , rewards r 1 , . . . , r T visit counts n . 2: for t = 1 , . . . , T do U t = � T t =1 r t . 3: n t ( s t ) = n t − 1 ( s t ) + 1 4: v t +1 ( s t ) = v t ( s ) + α n t ( s t ) ( s t )( U t − v t ( s t )) 5: n t ( s ) = n t − 1 ( s ), v t ( s ) = v t − 1 ( s ) ∀ s � = s t . 6: 7: end for 8: return v K Example 3 Consider a two-state chain with P ( s t +1 = 1 | s t = 0) = δ and P ( s t +1 = 1 | s t = 1) = 1, and reward r (1) = 1, r (0) = 0. Then the every-visit estimate is biased. Christos Dimitrakakis (Chalmers) Reinforcement Learning: Basic models and algorithms November 20, 2013 13 / 28

Dynamic problems Monte-Carlo policy evaluation and iteration Unbiased Monte-Carlo update Algorithm 6 First-visit Monte-Carlo update 1: input Initial parameters v k , trajectory s 1 , . . . , s T , rewards r 1 , . . . , r T , visit counts n . 2: Let m ∈ N |S| be trajectory visit counts. 3: for t = 1 , . . . , T do U t = � T t =1 r t . 4: n t ( s t ) = n t − 1 ( s t ) + 1 5: m t ( s t ) = m t − 1 ( s t ) + 1 6: v t +1 ( s t ) = v t ( s ) + α n t ( s t ) ( s t )( U t − v t ( s t )) if m t ( s t ) = 1. 7: n t ( s ) = n t − 1 ( s ), v t ( s ) = v t − 1 ( s ) otherwise 8: 9: end for 10: return v K Christos Dimitrakakis (Chalmers) Reinforcement Learning: Basic models and algorithms November 20, 2013 14 / 28

Dynamic problems Monte-Carlo policy evaluation and iteration 10 2 every first 10 1 � v t − V π � 10 0 10 -1 10 -2 0 2000 4000 6000 8000 10000 iterations Figure: Error as the number of iterations n increases, for first and every visit Monte Carlo estimation. Christos Dimitrakakis (Chalmers) Reinforcement Learning: Basic models and algorithms November 20, 2013 15 / 28

Dynamic problems Temporal difference methods Temporal differences The full stochastic update is of the form: v k +1 ( s ) = v k ( s ) + α ( U k − v k ( s )) , Using the temporal difference error d ( s t , s t +1 ) = v ( s t ) − [ r ( s t ) + γ v ( s t +1 )], � γ t d t , d t � d ( s t , s t +1 ) v k +1 ( s ) = v k ( s ) + α (2.1) t Stochastic, incremental, update: v t +1 ( s ) = v t ( s ) + αγ t d t . (2.2) TD( λ ) Temporal-difference operator ∞ � τ n ( i ) � E π n ,µ [( γλ ) m d n ( s t , s t +1 ) | s 0 = i ] . v n +1 ( i ) = v n i + τ n i , t =0 Stochastic update: ∞ � ( γλ ) k − t d k . v n +1 ( s t ) = v n ( s t ) + α (2.3) k = t Christos Dimitrakakis (Chalmers) Reinforcement Learning: Basic models and algorithms November 20, 2013 16 / 28

Reinforcement Learning: Basic models and algorithms Optimal - PowerPoint PPT Presentation

Reinforcement Learning: Basic models and algorithms Optimal decisions, Part VII Christos Dimitrakakis Chalmers November 20, 2013 Christos Dimitrakakis (Chalmers) Reinforcement Learning: Basic models and algorithms November 20, 2013 1 / 28

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

1. Algorithms for Inverse Reinforcement Learning 2. Apprenticeship learning via Inverse

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Reinforcement Learning Algorithms A. LAZARIC ( SequeL Team @INRIA-Lille ) ENS Cachan - Master 2

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Deep Reinforcement Learning Prof. Kuan-Ting Lai 2020/3/5 Course Requirements Kaggle-style

CS885 Reinforcement Learning Lecture 1a: May 2, 2018 Course Introduction [SutBar] Chapter 1,

\ Task Scheduling in High-Performance Computing Thomas McSweeney School of Mathematics The

SDRL: Interpretable and Data-efficient Deep Liu Reinforcement Learning Introduction Background

55% Didactic Instruction Ongoing Training/PM 71% Lectures Monthly Feedback 1 10/6/2017

Presentation of WebCT usage in deploying quiz assignments Ivica.Matotek@CARNet.hr

Mat MattNet tNet: : Modu Modular Atten lar Attention tion Network for Referring Network for

Synchronization CS 416: Operating Systems Design Department of Computer Science Rutgers