markov decision process and reinforcement learning
play

Markov Decision Process and Reinforcement Learning Zeqian (Chris) - PowerPoint PPT Presentation

Markov Decision Process and Reinforcement Learning Zeqian (Chris) Li Feb 28, 2019 Zeqian (Chris) Li MDP and RL Feb 28, 2019 1 / 26 Outline 1 Introduction 2 Markov decision process 3 Statistical mechanics of MDP 4 Reinforcement learning 5


  1. Markov Decision Process and Reinforcement Learning Zeqian (Chris) Li Feb 28, 2019 Zeqian (Chris) Li MDP and RL Feb 28, 2019 1 / 26

  2. Outline 1 Introduction 2 Markov decision process 3 Statistical mechanics of MDP 4 Reinforcement learning 5 Discussion Zeqian (Chris) Li MDP and RL Feb 28, 2019 2 / 26

  3. Introduction Hungry rat experiment, Yale, 1948 Modeling reinforcement: agend-based model π ( a | s ) Agent a r ( s, a, s ′ ) Environment s p ( s ′ | s, a ) s : state; a : action; r : reward p ( s ′ | s, a ): transitional probability; r ( s, a, s ′ ): reward model; π ( a | s ): policy This is a dynamical process: s t , a t , r t ; s t +1 , a t +1 , r t +1 ; ... Zeqian (Chris) Li MDP and RL Feb 28, 2019 3 / 26

  4. Examples: Atari games π ( a | s ) Agent r ( s, a, s ′ ) a Environment s p ( s ′ | s, a ) Atari games State: brick positions, board positions, ball coordinate and velocity Action: controller/keyboard inputs Reward: game score Zeqian (Chris) Li MDP and RL Feb 28, 2019 4 / 26

  5. Examples: Go π ( a | s ) Agent r ( s, a, s ′ ) a Environment s p ( s ′ | s, a ) Go State: positions of stones Action: next move Reward: advantage evaluation Zeqian (Chris) Li MDP and RL Feb 28, 2019 5 / 26

  6. Examples: robots π ( a | s ) Agent a r ( s, a, s ′ ) Environment s p ( s ′ | s, a ) (Boston Dynamics) Robots State: positions, mass distribution, ... Action: adjusting forces on feet Reward: chance of falling Zeqian (Chris) Li MDP and RL Feb 28, 2019 6 / 26

  7. Other examples Example in physics? Zeqian (Chris) Li MDP and RL Feb 28, 2019 7 / 26

  8. Objective of reinforcement learning π ( a | s ) Agent s t , a t a r ( s, a, s ′ ) p ( s ′ | s, a ): transitional probability r ( s, a, s ′ ): reward model Environment π ( a | s ): policy s p ( s ′ | s, a ) Objective of reinforcement learning Find optimal policy π ∗ ( a | y ) to maximize expected reward: � ∞ � � π ∗ ( a | s ) = argmax γ t r ( t ) E [ V ] = argmax E π π t =0 ( γ : 0 ≤ γ < 1, discount factor) Zeqian (Chris) Li MDP and RL Feb 28, 2019 8 / 26

  9. Simplest example: one-armed bandits States Actions p =1: r =0 0 0 1 p =0 . 9: r =1 p =0 . 1: r =0 Optimal policy: π ∗ (0 | 0) = 0 , π ∗ (1 | 0) = 1 Zeqian (Chris) Li MDP and RL Feb 28, 2019 9 / 26

  10. Markov decision process π ( a | s ) Agent r ( s, a, s ′ ) a Environment s p ( s ′ | s, a ) Suppose that I have full knowledge of p ( s ′ | a, s ) , r ( s, a, s ′ ). This is called Markov Decision Process . Objective of MDP: compute � ∞ � π ∗ ( a | s ) = argmax � γ t r ( t ) E [ V ] = argmax E π π t =0 This is a computing problem. No learning. Zeqian (Chris) Li MDP and RL Feb 28, 2019 10 / 26

  11. Quality function Q ( s, a ) �� ∞ π ∗ ( a | s ) = argmax π E [ V ] = argmax π E t =0 γ t r ( t ) � Define � ∞ � � � γ t r ( t ) � Q ( s, a ) = E π ∗ � s 0 = s, a 0 = a � t =0 Given the initial state s and the initial action a , Q is the maximum expected future reward. Recursive relationship: � � � p ( s ′ | as ) r ( sas ′ ) + γ max Q ( s ′ a ′ ) Q ( sa ) = a ′ s ′ � � � r ( sas ′ ) + γ max Q ( s ′ a ′ ) � = E s ′ � sa � a ′ Zeqian (Chris) Li MDP and RL Feb 28, 2019 11 / 26

  12. Bellman equation � � � r ( sas ′ ) + γ max Q ( s ′ a ′ ) � Q ( sa ) = E s ′ � sa � a ′ Solve Q ( sa ) (or ψ ( s )) by Bellman equation, and the optimal policy is given by (when ǫ → 0): � , a ∗ ( s ) = argmax a Q ( a, s ) 1 π ∗ ( a | s ) = 0 , otherwise. “Curse of dimensionality” Zeqian (Chris) Li MDP and RL Feb 28, 2019 12 / 26

  13. Solve Bellman equation: iterative method � � � r ( sas ′ ) + γ max Q i ( s ′ a ′ ) � Q i +1 ( sa ) = E s ′ � sa � a ′ = B [ Q i ] Start with Q 0 , and update by Q i +1 = B [ Q i ]. Can prove the convergence by calculating the Jacobian of B near the fixed point. Problem : only update one entry (one ( s, a ) pair) at each iteration; converges too slow. Zeqian (Chris) Li MDP and RL Feb 28, 2019 13 / 26

  14. Statistical mechanics of MDP s t , a t ; p ( s ′ | s, a ) , r ( s, a, s ′ ) , π ( a | s ) �� ∞ Find π ∗ ( a | s ) = argmax π E [ V ] = argmax π E t =0 γ t r ( t ) � Define ρ t ( s ): probability in state s at time t Chapman–Kolmogorov equation: ρ t +1 ( s ′ ) = � p ( s ′ | sa ) π ( a | s ) ρ t ( s ) s,a Zeqian (Chris) Li MDP and RL Feb 28, 2019 14 / 26

  15. ∞ � γ t � ρ t ( s ) π ( a | s ) p ( s ′ | sa ) r ( sas ′ ) V π = E π,ρ [ R ] = t =0 sas ′ ∞ � γ t ρ t ( s ), average residence time in s before death) (Let η ( s ) ≡ t =0 � η ( s ) π ( a | s ) p ( s ′ | sa ) r ( s ′ as ) = s ′ as Constraints : - η ( s ) depends on π : η ( s ′ ) = ρ 0 ( s ′ ) + γ � p ( s ′ | sa ) π ( a | s ) η ( s ) sa - � a π ( a | s ) = 1 - introduce Lagrange multipliers Zeqian (Chris) Li MDP and RL Feb 28, 2019 15 / 26

  16. � � � � φ ( s ′ ) η ( s ′ ) − ρ 0 ( s ′ ) − γ p ( s ′ | sa ) π ( a | s ) η ( s ) F π,η = V π,η − sa s ′ �� � � − λ ( s ) π ( a | s ) − 1 s a δF δF Optimization: δπ ( a | s ) = 0 , δη ( s ) = 0. Problem : linear function → derivative is constant → extreme value on the boundary → Optimal policy is deterministic (0 or 1) Introduce non-linearity: entropy � H s [ π ] = − π ( a | s ) log π ( a | s ) a (Similar to regularization.) Zeqian (Chris) Li MDP and RL Feb 28, 2019 16 / 26

  17. � η ( s ) π ( a | s ) p ( s ′ | sa ) r ( s ′ as ) F π,η = ( V π,η ) s ′ as � � � φ ( s ′ ) η ( s ′ ) − ρ 0 ( s ′ ) − γ � p ( s ′ | sa ) π ( a | s ) η ( s ) − sa s ′ (dynamical constraint) �� � � − λ ( s ) π ( a | s ) − 1 (normalization) s a � + ǫ η ( s ) H s [ π ] (entropy) s δF δF δπ ( a | s ) = 0 , δη ( s ) = 0. Zeqian (Chris) Li MDP and RL Feb 28, 2019 17 / 26

  18. Results exp( Q ( s,a ) /ǫ ) π ∗ ( a | s ) = b exp( Q ( s,b ) /ǫ ) - Boltzmann distribution! � ǫ : temperature! Q : quality function - (minus) energy! � �� �� exp Q ( s ′ a ′ ) � p ( s ′ | sa ) r ( sas ′ ) + γǫ log Q ( sa ) = ǫ s ′ a ′ � � r ( sas ′ ) + γ softmax Q ( s ′ a ′ ) = E s ′ a ′ ; ǫ � � r ( sas ′ ) + γ max Q ( s ′ a ′ ) ( ǫ → 0) = E s ′ a ′ Can show that �� � � γ t r ( t ) � Q ( sa ) = E π ∗ � s 0 = s, a 0 = a � t Zeqian (Chris) Li MDP and RL Feb 28, 2019 18 / 26

  19. φ ( x ): value function - (minus) free energy! �� �� � 1 φ ( s ) = ǫ log exp ǫ Q ( as ) a = softmax Q ( as ) a ; ǫ ( ǫ → 0) = max Q ( as ) a Iterative equation: r ( sas ′ ) + γφ ( s ′ ) � E s ′ � �� φ ( s ) = softmax a ; ǫ r ( sas ′ ) + γφ ( s ′ ) � E s ′ � �� ( ǫ → 0) = max a Physical meaning of φ ( s ): maximum expected future reward, given initial state s . Zeqian (Chris) Li MDP and RL Feb 28, 2019 19 / 26

  20. Spectrum of reinforcement learning problems Accuracy of observation y Model-free Markov decision reinforcement process (MDP) learning Partially observable Full RL markov decision (very hard) process (POMDP) Knowledge about environment p ( s ′ as ) , r ( as ) Zeqian (Chris) Li MDP and RL Feb 28, 2019 20 / 26

  21. MDP Bellman equation ( ǫ > 0) � � � r ( sas ′ ) + γ softmax Q ( s ′ a ′ ) � Q ( s, a ) = E s ′ � sa � a ′ ; ǫ Reinforcement learning : don’t know r ( s, a, s ′ ) , p ( s ′ | s, a ), only have samples of ( s 0 , a 0 , s 1 ; r 0 ) , ( s 1 , a 1 , s 2 ; r 2 ) , ..., ( s t , a t , s t +1 ; r t ) , ... Rewrite Bellman equation: � � Q ( · , a ′ ) − Q ( s, a ) r ( s, a, · ) + γ softmax = 0 E samples of ( ·| sa ) a ′ ; ǫ Zeqian (Chris) Li MDP and RL Feb 28, 2019 21 / 26

  22. RL algorithm: soft Q-learning ˆ Q t +1 ( s, a ) = � � ˆ r t +1 + γ softmax a ′ ; ǫ ˆ Q t ( s t +1 , a ′ ) − ˆ Q t ( s, a ) − α t Q t ( s t , a t ) δ s,s t δ a,a t (Update if s = s t , a = a t ; otherwise, ˆ Q t +1 ( s, a ) = ˆ Q t ( s, a )) exp ( ˆ Q t +1 ( s,a ) /ǫ ) π t +1 ( a | s ) = ˆ b exp ( ˆ Q t +1 ( s,b ) /ǫ ) � Problem : only update one entry (one ( s, a ) pair) at each iteration; converges too slow. Zeqian (Chris) Li MDP and RL Feb 28, 2019 22 / 26

  23. Solution: parameterize Q ( s, a ) by Q ( s, a ; w ), and update w in each iteration. Parameterize function with a small number of parameters: neural network . Deep reinforcement learning: RMnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... & Petersen, S. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529. Zeqian (Chris) Li MDP and RL Feb 28, 2019 23 / 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend