foundations of machine learning reinforcement learning
play

Foundations of Machine Learning Reinforcement Learning - PowerPoint PPT Presentation

Foundations of Machine Learning Reinforcement Learning Reinforcement Learning Agent exploring environment. Interactions with environment: action state Agent Environment reward Problem: find action policy that maximizes cumulative reward


  1. Foundations of Machine Learning Reinforcement Learning

  2. Reinforcement Learning Agent exploring environment. Interactions with environment: action state Agent Environment reward Problem: find action policy that maximizes cumulative reward over the course of interactions. Mehryar Mohri - Foundations of Machine Learning page

  3. Key Features Contrast with supervised learning: • no explicit labeled training data. • distribution defined by actions taken. Delayed rewards or penalties. RL trade-off: • exploration (of unknown states and actions) to gain more reward information; vs. • exploitation (of known information) to optimize reward. Mehryar Mohri - Foundations of Machine Learning page

  4. Applications Robot control e.g., Robocup Soccer Teams (Stone et al., 1999) . Board games, e.g., TD-Gammon (Tesauro, 1995) . Elevator scheduling (Crites and Barto, 1996) . Ads placement. Telecommunications. Inventory management. Dynamic radio channel assignment. Mehryar Mohri - Foundations of Machine Learning page

  5. This Lecture Markov Decision Processes (MDPs) Planning Learning Multi-armed bandit problem Mehryar Mohri - Foundations of Machine Learning page

  6. Markov Decision Process (MDP) Definition: a Markov Decision Process is defined by: • a set of decision epochs . { 0 , . . . , T } • a set of states , possibly infinite. S • a start state or initial state ; s 0 • a set of actions , possibly infinite. A • a transition probability : distribution over Pr[ s � | s, a ] s � = δ ( s, a ) destination states . • a reward probability : distribution over Pr[ r � | s, a ] r � = r ( s, a ) rewards returned . Mehryar Mohri - Foundations of Machine Learning page 6

  7. Model State observed at time : s t ∈ S. t action Action taken at time : a t ∈ A. t state Agent Environment State reached . s t +1 = δ ( s t , a t ) reward Reward received: . r t +1 = r ( s t , a t ) a t /r t +1 a t +1 /r t +2 s t +2 s t s t +1 Mehryar Mohri - Foundations of Machine Learning page 7

  8. MDPs - Properties Finite MDPs: and finite sets. A S Finite horizon when . T < ∞ Reward : often deterministic function. r ( s, a ) Mehryar Mohri - Foundations of Machine Learning page 8

  9. Example - Robot Picking up Balls start search/[.1, R1] search/[.9, R1] carry/[.5, R3] other carry/[.5, -1] pickup/[1, R2] Mehryar Mohri - Foundations of Machine Learning page

  10. Policy Definition: a policy is a mapping π : S → A. Objective: find policy maximizing expected π return. • finite horizon return: . � T − 1 � � s t , π ( s t ) t =0 r • infinite horizon return: . � + ∞ � � t =0 γ t r s t , π ( s t ) Theorem: there exists an optimal policy from any start state. Mehryar Mohri - Foundations of Machine Learning page 10

  11. Policy Value Definition: the value of a policy at state is s π • finite horizon: � T − 1 � �� � � � V π ( s ) = E s t , π ( s t ) � s 0 = s r . � t =0 • infinite horizon: discount factor , γ ∈ [0 , 1) � + ∞ � �� � � γ t r � V π ( s ) = E s t , π ( s t ) � s 0 = s . � t =0 Problem: find policy with maximum value for all π states. Mehryar Mohri - Foundations of Machine Learning page 11

  12. Policy Evaluation Analysis of policy value: � + ∞ � �� � � γ t r � V π ( s ) = E s t , π ( s t ) � s 0 = s . � t =0 � + ∞ � �� � � γ t r � = E[ r ( s, π ( s ))] + γ E s t +1 , π ( s t +1 ) � s 0 = s � t =0 = E[ r ( s, π ( s )] + γ E[ V π ( δ ( s, π ( s )))] . Bellman equations (system of linear equations): � Pr[ s � | s, π ( s )] V π ( s � ) . V π ( s ) = E[ r ( s, π ( s )] + γ s � Mehryar Mohri - Foundations of Machine Learning page 12

  13. Bellman Equation - Existence and Uniqueness Notation: • transition probability matrix P s,s � =Pr[ s � | s, π ( s )] . • value column matrix V = V π ( s ) . • expected reward column matrix: R =E[ r ( s, π ( s )] . Theorem: for a finite MDP , Bellman’s equation admits a unique solution given by V 0 =( I − γ P ) − 1 R . Mehryar Mohri - Foundations of Machine Learning page 13

  14. Bellman Equation - Existence and Uniqueness Proof: Bellman’s equation rewritten as V = R + γ PV . • is a stochastic matrix, thus, P � � Pr[ s � | s, π ( s )] = 1 . � P � � = max | P ss � | = max s s s � s � • This implies that The eigenvalues � γ P � ∞ = γ < 1 . of are all less than one and is ( I − γ P ) P invertible. Notes: general shortest distance problem (MM, 2002) . Mehryar Mohri - Foundations of Machine Learning page 14

  15. Optimal Policy Definition: policy with maximal value for all π ∗ states s ∈ S. • value of (optimal value): π ∗ ∀ s ∈ S, V π ∗ ( s ) = max V π ( s ) . π • optimal state-action value function: expected return for taking action at state and then a s following optimal policy. Q ⇤ ( s, a ) = E[ r ( s, a )] + γ E[ V ⇤ ( δ ( s, a ))] Pr[ s 0 | s, a ] V ⇤ ( s 0 ) . X = E[ r ( s, a )] + γ s 0 2 S Mehryar Mohri - Foundations of Machine Learning page 15

  16. Optimal Values - Bellman Equations Property: the following equalities hold: ∀ s ∈ S, V ∗ ( s ) = max a ∈ A Q ∗ ( s, a ) . Proof: by definition, for all , . s V ∗ ( s ) ≤ max a ∈ A Q ∗ ( s, a ) • If for some we had , then V ∗ ( s ) < max a ∈ A Q ∗ ( s, a ) s maximizing action would define a better policy. Thus, n o X V ⇤ ( s ) = max Pr[ s 0 | s, a ] V ⇤ ( s 0 ) E[ r ( s, a )] + γ . a 2 A s 0 2 S Mehryar Mohri - Foundations of Machine Learning page 16

  17. This Lecture Markov Decision Processes (MDPs) Planning Learning Multi-armed bandit problem Mehryar Mohri - Foundations of Machine Learning page

  18. Known Model Setting: environment model known. Problem: find optimal policy. Algorithms: • value iteration. • policy iteration. • linear programming. Mehryar Mohri - Foundations of Machine Learning page 18

  19. Value Iteration Algorithm � � � Pr[ s � | s, a ] V ( s � ) Φ ( V )( s ) = max E[ r ( s, a )] + γ . a � A s � � S Φ ( V ) = max π { R π + γ P π V } . ValueIteration ( V 0 ) 1 V � V 0 � V 0 arbitrary value while � V � Φ ( V ) � � (1 − � ) � 2 do � 3 V � Φ ( V ) 4 return Φ ( V ) Mehryar Mohri - Foundations of Machine Learning page 19

  20. VI Algorithm - Convergence Theorem: for any initial value , the sequence V 0 defined by converge to . V n +1 = Φ ( V n ) V ∗ Proof: we show that is -contracting for � · � ∞ Φ γ existence and uniqueness of fixed point for . Φ • for any , let be the maximizing action a ∗ ( s ) s ∈ S defining . Then, for and any , s ∈ S Φ ( V )( s ) U Pr[ s � | s, a � ( s )] U ( s � ) � � � E[ r ( s, a � ( s ))] + γ Φ ( V )( s ) � Φ ( U )( s ) � Φ ( V )( s ) � s � � S � Pr[ s � | s, a � ( s )][ V ( s � ) � U ( s � )] = γ s � � S � Pr[ s � | s, a � ( s )] � V � U � � = γ � V � U � � . � γ s � � S Mehryar Mohri - Foundations of Machine Learning page 20

  21. Complexity and Optimality Complexity: convergence in . Observe that O (log 1 � ) � V n +1 � V n � ∞ � γ � V n � V n − 1 � ∞ � γ n � Φ ( V 0 ) � V 0 � ∞ . � n � Φ ( V 0 ) � V 0 � ∞ � (1 � � ) � log 1 � � Thus, � n = O . � � -Optimality: let be the value returned. Then, V n +1 � � V ∗ � V n +1 � ∞ � � V ∗ � Φ ( V n +1 ) � ∞ + � Φ ( V n +1 ) � V n +1 � ∞ � γ � V ∗ � V n +1 � ∞ + γ � V n +1 � V n � ∞ . Thus, � V ∗ � V n +1 � ∞ � � 1 � � � V n +1 � V n � ∞ � � . Mehryar Mohri - Foundations of Machine Learning page 21

  22. VI Algorithm - Example a/[3/4, 2] c/[1, 2] a/[1/4, 2] b/[1, 2] 1 2 d/[1, 3] � 3 4 V n (1) + 1 � � � V n +1 (1) = max 2 + γ 4 V n (2) , 2 + γ V n (2) � � V n +1 (2) = max 3 + γ V n (1) , 2 + γ V n (2) . For , V 0 (1) = − 1 , V 0 (2) = 1 , γ = 1 / 2 V 1 (1) = V 1 (2) = 5 / 2 . But, V ∗ (1) = 14 / 3 , V ∗ (2) = 16 / 3 . , Mehryar Mohri - Foundations of Machine Learning page

  23. Policy Iteration Algorithm PolicyIteration ( � 0 ) 1 � � 0 arbitrary policy � � � 0 � � � nil 2 3 while ( � � = � � ) do 4 � policy evaluation: solve ( I � � P π ) V = R π . V � V π � � � � 5 6 � � argmax π { R π + � P π V } � greedy policy improvement . 7 return � Mehryar Mohri - Foundations of Machine Learning page 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend