Foundations of Machine Learning Reinforcement Learning

Reinforcement Learning Agent exploring environment. Interactions with environment: action state Agent Environment reward Problem: find action policy that maximizes cumulative reward over the course of interactions. Mehryar Mohri - Foundations of Machine Learning page

Key Features Contrast with supervised learning: • no explicit labeled training data. • distribution defined by actions taken. Delayed rewards or penalties. RL trade-off: • exploration (of unknown states and actions) to gain more reward information; vs. • exploitation (of known information) to optimize reward. Mehryar Mohri - Foundations of Machine Learning page

Applications Robot control e.g., Robocup Soccer Teams (Stone et al., 1999) . Board games, e.g., TD-Gammon (Tesauro, 1995) . Elevator scheduling (Crites and Barto, 1996) . Ads placement. Telecommunications. Inventory management. Dynamic radio channel assignment. Mehryar Mohri - Foundations of Machine Learning page

This Lecture Markov Decision Processes (MDPs) Planning Learning Multi-armed bandit problem Mehryar Mohri - Foundations of Machine Learning page

Markov Decision Process (MDP) Definition: a Markov Decision Process is defined by: • a set of decision epochs . { 0 , . . . , T } • a set of states , possibly infinite. S • a start state or initial state ; s 0 • a set of actions , possibly infinite. A • a transition probability : distribution over Pr[ s � | s, a ] s � = δ ( s, a ) destination states . • a reward probability : distribution over Pr[ r � | s, a ] r � = r ( s, a ) rewards returned . Mehryar Mohri - Foundations of Machine Learning page 6

Model State observed at time : s t ∈ S. t action Action taken at time : a t ∈ A. t state Agent Environment State reached . s t +1 = δ ( s t , a t ) reward Reward received: . r t +1 = r ( s t , a t ) a t /r t +1 a t +1 /r t +2 s t +2 s t s t +1 Mehryar Mohri - Foundations of Machine Learning page 7

MDPs - Properties Finite MDPs: and finite sets. A S Finite horizon when . T < ∞ Reward : often deterministic function. r ( s, a ) Mehryar Mohri - Foundations of Machine Learning page 8

Example - Robot Picking up Balls start search/[.1, R1] search/[.9, R1] carry/[.5, R3] other carry/[.5, -1] pickup/[1, R2] Mehryar Mohri - Foundations of Machine Learning page

Policy Definition: a policy is a mapping π : S → A. Objective: find policy maximizing expected π return. • finite horizon return: . � T − 1 � � s t , π ( s t ) t =0 r • infinite horizon return: . � + ∞ � � t =0 γ t r s t , π ( s t ) Theorem: there exists an optimal policy from any start state. Mehryar Mohri - Foundations of Machine Learning page 10

Policy Value Definition: the value of a policy at state is s π • finite horizon: � T − 1 � �� V π ( s ) = E s t , π ( s t ) � s 0 = s r . � t =0 • infinite horizon: discount factor , γ ∈ [0 , 1) � + ∞ � �� γ t r � V π ( s ) = E s t , π ( s t ) � s 0 = s . � t =0 Problem: find policy with maximum value for all π states. Mehryar Mohri - Foundations of Machine Learning page 11

Policy Evaluation Analysis of policy value: � + ∞ � �� γ t r � V π ( s ) = E s t , π ( s t ) � s 0 = s . � t =0 � + ∞ � �� γ t r � = E[ r ( s, π ( s ))] + γ E s t +1 , π ( s t +1 ) � s 0 = s � t =0 = E[ r ( s, π ( s )] + γ E[ V π ( δ ( s, π ( s )))] . Bellman equations (system of linear equations): � Pr[ s � | s, π ( s )] V π ( s � ) . V π ( s ) = E[ r ( s, π ( s )] + γ s � Mehryar Mohri - Foundations of Machine Learning page 12

Bellman Equation - Existence and Uniqueness Notation: • transition probability matrix P s,s � =Pr[ s � | s, π ( s )] . • value column matrix V = V π ( s ) . • expected reward column matrix: R =E[ r ( s, π ( s )] . Theorem: for a finite MDP , Bellman’s equation admits a unique solution given by V 0 =( I − γ P ) − 1 R . Mehryar Mohri - Foundations of Machine Learning page 13

Bellman Equation - Existence and Uniqueness Proof: Bellman’s equation rewritten as V = R + γ PV . • is a stochastic matrix, thus, P � � Pr[ s � | s, π ( s )] = 1 . � P � � = max | P ss � | = max s s s � s � • This implies that The eigenvalues � γ P � ∞ = γ < 1 . of are all less than one and is ( I − γ P ) P invertible. Notes: general shortest distance problem (MM, 2002) . Mehryar Mohri - Foundations of Machine Learning page 14

Optimal Policy Definition: policy with maximal value for all π ∗ states s ∈ S. • value of (optimal value): π ∗ ∀ s ∈ S, V π ∗ ( s ) = max V π ( s ) . π • optimal state-action value function: expected return for taking action at state and then a s following optimal policy. Q ⇤ ( s, a ) = E[ r ( s, a )] + γ E[ V ⇤ ( δ ( s, a ))] Pr[ s 0 | s, a ] V ⇤ ( s 0 ) . X = E[ r ( s, a )] + γ s 0 2 S Mehryar Mohri - Foundations of Machine Learning page 15

Optimal Values - Bellman Equations Property: the following equalities hold: ∀ s ∈ S, V ∗ ( s ) = max a ∈ A Q ∗ ( s, a ) . Proof: by definition, for all , . s V ∗ ( s ) ≤ max a ∈ A Q ∗ ( s, a ) • If for some we had , then V ∗ ( s ) < max a ∈ A Q ∗ ( s, a ) s maximizing action would define a better policy. Thus, n o X V ⇤ ( s ) = max Pr[ s 0 | s, a ] V ⇤ ( s 0 ) E[ r ( s, a )] + γ . a 2 A s 0 2 S Mehryar Mohri - Foundations of Machine Learning page 16

This Lecture Markov Decision Processes (MDPs) Planning Learning Multi-armed bandit problem Mehryar Mohri - Foundations of Machine Learning page

Known Model Setting: environment model known. Problem: find optimal policy. Algorithms: • value iteration. • policy iteration. • linear programming. Mehryar Mohri - Foundations of Machine Learning page 18

Value Iteration Algorithm � � � Pr[ s � | s, a ] V ( s � ) Φ ( V )( s ) = max E[ r ( s, a )] + γ . a � A s � � S Φ ( V ) = max π { R π + γ P π V } . ValueIteration ( V 0 ) 1 V � V 0 � V 0 arbitrary value while � V � Φ ( V ) � � (1 − � ) � 2 do � 3 V � Φ ( V ) 4 return Φ ( V ) Mehryar Mohri - Foundations of Machine Learning page 19

VI Algorithm - Convergence Theorem: for any initial value , the sequence V 0 defined by converge to . V n +1 = Φ ( V n ) V ∗ Proof: we show that is -contracting for � · � ∞ Φ γ existence and uniqueness of fixed point for . Φ • for any , let be the maximizing action a ∗ ( s ) s ∈ S defining . Then, for and any , s ∈ S Φ ( V )( s ) U Pr[ s � | s, a � ( s )] U ( s � ) � � � E[ r ( s, a � ( s ))] + γ Φ ( V )( s ) � Φ ( U )( s ) � Φ ( V )( s ) � s � � S � Pr[ s � | s, a � ( s )][ V ( s � ) � U ( s � )] = γ s � � S � Pr[ s � | s, a � ( s )] � V � U � � = γ � V � U � � . � γ s � � S Mehryar Mohri - Foundations of Machine Learning page 20

Complexity and Optimality Complexity: convergence in . Observe that O (log 1 � ) � V n +1 � V n � ∞ � γ � V n � V n − 1 � ∞ � γ n � Φ ( V 0 ) � V 0 � ∞ . � n � Φ ( V 0 ) � V 0 � ∞ � (1 � � ) � log 1 � � Thus, � n = O . � � -Optimality: let be the value returned. Then, V n +1 � � V ∗ � V n +1 � ∞ � � V ∗ � Φ ( V n +1 ) � ∞ + � Φ ( V n +1 ) � V n +1 � ∞ � γ � V ∗ � V n +1 � ∞ + γ � V n +1 � V n � ∞ . Thus, � V ∗ � V n +1 � ∞ � � 1 � � � V n +1 � V n � ∞ � � . Mehryar Mohri - Foundations of Machine Learning page 21

VI Algorithm - Example a/[3/4, 2] c/[1, 2] a/[1/4, 2] b/[1, 2] 1 2 d/[1, 3] � 3 4 V n (1) + 1 � � � V n +1 (1) = max 2 + γ 4 V n (2) , 2 + γ V n (2) � � V n +1 (2) = max 3 + γ V n (1) , 2 + γ V n (2) . For , V 0 (1) = − 1 , V 0 (2) = 1 , γ = 1 / 2 V 1 (1) = V 1 (2) = 5 / 2 . But, V ∗ (1) = 14 / 3 , V ∗ (2) = 16 / 3 . , Mehryar Mohri - Foundations of Machine Learning page

Policy Iteration Algorithm PolicyIteration ( � 0 ) 1 � � 0 arbitrary policy � � � 0 � � � nil 2 3 while ( � � = � � ) do 4 � policy evaluation: solve ( I � � P π ) V = R π . V � V π � � � � 5 6 � � argmax π { R π + � P π V } � greedy policy improvement . 7 return � Mehryar Mohri - Foundations of Machine Learning page 23

Foundations of Machine Learning Reinforcement Learning - PowerPoint PPT Presentation

Foundations of Machine Learning Reinforcement Learning Reinforcement Learning Agent exploring environment. Interactions with environment: action state Agent Environment reward Problem: find action policy that maximizes cumulative reward

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

recap to this point foundations foundations foundations foundations genetics =

Boosting: Foundations and Algorithms Boosting: Foundations and Algorithms Boosting: Foundations

Reinforcement Learning for Continuous State and Action Spaces Gradient Methods 1 MACHINE LEARNING

Machine Learning for NLP Reinforcement learning Aurlie Herbelot 2019 Centre for Mind/Brain

Machine Learning, Reinforcement Learning Machine Learning: A quick retrospective AI Class 25

Reinforcement Learning Lecture 8 Reinforcement Learning November 24, 2015 1 Wentworth

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

Outreach at Warwick A Teachers Guide Opportunities for your students to engage with the

Decision trees PRISM - Nicolas Sutton-Charani 20/01/2020 N. Sutton-Charani Artificial

| 1 New gTLD Sub. Pro. PDP Work Track 5 on Geographic Names - Update 26 June 2018 ICANN65 -

Challenges and Successes in the Local Health Department Workforce Funded by the Robert Wood

Taxi Operational Performance Seminar 2 Unless otherwise stated the information contained in this

4/19/2018 From Play to Practice: Connecting Teachers Play to Childrens Learning Walter F.

Random matrices, operators and analytic functions Benedek Valk o (University of Wisconsin

Advertising, Innovation, and Economic Growth Laurent Cavenaile Pau Roldan-Blanco University of