Reinforcement Learning Lectures 4 and 5 Gillian Hayes 18th January - PowerPoint PPT Presentation

Reinforcement Learning Lectures 4 and 5 Gillian Hayes 18th January 2007 Gillian Hayes RL Lecture 4/5 18th January 2007

1 Reinforcement Learning • Framework • Rewards, Returns • Environment Dynamics • Components of a Problem • Values and Action Values, V and Q • Optimal Policies • Bellman Optimality Equations Gillian Hayes RL Lecture 4/5 18th January 2007

2 Framework Again POLICY State/ VALUE FUNCTION Situation s t AGENT Reward r t Action at Where is boundary r t+1 between agent and environment? ENVIRONMENT st+1 Task: one instance of an RL problem – one problem set-up Learning: how should agent change policy? Overall goal: maximise amount of reward received over time Gillian Hayes RL Lecture 4/5 18th January 2007

3 Goals and Rewards Goal: maximise total reward received Immediate reward r at each step. We must maximise expected cumulative reward: Return = Total reward R t = r t +1 + r t +2 + r t +3 + · · · + r τ τ = final time step (episodes/trials) But what if τ = ∞ ? Discounted Reward r t +1 + γr t +2 + γ 2 r t +3 + · · · R t = ∞ � γ k r t + k +1 = k =0 0 ≤ γ < 1 discount factor → discounted reward finite if reward sequence { r k } bounded γ = 0 : myopic γ → 1 : agent far-sighted. Future rewards count for more Gillian Hayes RL Lecture 4/5 18th January 2007

4 Dynamics of Environment Choose action a in situation s : what is the probability of ending up in state s ′ ? Transition probability ss ′ = Pr { s t +1 = s ′ | s t = s, a t = a } P a BACKUP DIAGRAM s a STOCHASTIC r s’ Gillian Hayes RL Lecture 4/5 18th January 2007

Dynamics of Environment 5 If action a chosen in state s and subsequent state reached is s ′ what’s the expected reward? R a ss ′ = E { r t +1 | s t = s, a t = a, s t +1 = s ′ } If we know P and R then have complete information about environment – may need to learn them Gillian Hayes RL Lecture 4/5 18th January 2007

6 R a ss ′ and ρ ( s, a ) Reward functions R a expected next reward given current state s and action a and next state s ′ ss ′ ρ ( s, a ) expected next reward given current state s and action a � P a ss ′ R a ρ ( s, a ) = ss ′ s ′ Sometimes you will see ρ ( s, a ) in the literature, especially that prior to 1998 when S+B was published. Sometimes you’ll also see ρ ( s ) . This is the reward for being in state s and is equivalent to a “bag of treasure” sitting on a grid-world square (e.g. computer games – weapons, health). Gillian Hayes RL Lecture 4/5 18th January 2007

7 Sutton and Barto’s Recycling Robot 1 • At each step, robot has choice of three actions: – go out and search for a can – wait till a human brings it a can – go to charging station to recharge • Searching is better (higher reward), but runs down battery. Running out of battery power is very bad and robot needs to be rescued • Decision based on current state – is energy high or low • Reward is no. cans (expected to be) collected, negative reward for needing rescue This slide and the next based on an earlier version of Sutton and Barto’s own slides from a previous Sutton web resource. Gillian Hayes RL Lecture 4/5 18th January 2007

8 Sutton and Barto’s Recycling Robot 2 S= { high, low } A(high) = { search, wait } A(low) = { search, wait, recharge } R search expected no. cans when searching R wait expected no. cans when waiting R search > R wait search β β ,−3 ,R 1− wait 1,R wait search recharge 1,0 high low wait search α ,R search wait search α 1,R 1− ,R Gillian Hayes RL Lecture 4/5 18th January 2007

9 Values V Policy π maps situations s ∈ S to (probability distribution over) actions a ∈ A ( s ) V-Value of s under policy π is V π ( s ) = expected return starting in s and following policy π ∞ � V π ( s ) = E π { R t | s t = s } = E π { γ k r t + k +1 | s t = s } k =0 BACKUP DIAGRAM FOR V(s) s π (s,a) Convention: a open circle = state Pa r filled circle = action ss’ s’ Gillian Hayes RL Lecture 4/5 18th January 2007

10 Action Values Q Q-Action Value of taking action a in state s under policy π is Q π ( s, a ) = expected return starting in s , taking a and then following policy π Q π ( s, a ) = E π { R t | s t = s, a t = a } ∞ � γ k r t + k +1 | s t = s, a t = a } = E π { k =0 What is the backup diagram? Gillian Hayes RL Lecture 4/5 18th January 2007

11 Recursive Relationship for V V π ( s ) = E π { R t | s t = s } ∞ � γ k r t + k +1 | s t = s } = E π { k =0 ∞ � γ k r t + k +2 | s t = s } = E π { r t +1 + γ k =0 ∞ � � � P a ss ′ [ R a γ k r t + k +2 | s t +1 = s ′ } ] = π ( s, a, ) ss ′ + γE π { a s ′ k =0 � � P a ss ′ [ R a ss ′ + γV π ( s ′ )] = π ( s, a, ) a s ′ This is the BELLMAN EQUATION. How does it relate to backup diagram? Gillian Hayes RL Lecture 4/5 18th January 2007

12 Recursive Relationship for Q � � Q π ( s, a ) = P a ss ′ [ R a π ( s ′ , a ′ ) Q ( s ′ , a ′ )] ss ′ + γ s ′ a ′ Relate to backup diagram Gillian Hayes RL Lecture 4/5 18th January 2007

13 Grid World Example Check the V’s comply with Bellman Equation From Sutton and Barto P. 71, Fig. 3.5 3.3 8.8 4.4 5.3 1.5 A B +5 1.5 3.0 2.3 1.9 0.5 B’ +10 0.1 0.7 0.7 0.4 -0.4 -1.0 -0.4 -0.4 -0.6 -1.2 A’ -1.9 -1.3 -1.2 -1.4 -2.0 Gillian Hayes RL Lecture 4/5 18th January 2007

14 Relating Q and V ∞ � Q π ( s, a ) γ k r t + k +1 | s t = s, a t = a } = E π { k =0 ∞ � γ k r t + k +2 | s t = s, a t = a } = E π { r t +1 + γ k =0 ∞ � � P a ss ′ [ R a γ k r t + k +2 | s t +1 = s ′ } ] = ss ′ + γE π { s ′ k =0 � P a ss ′ [ R a ss ′ + γV π ( s ′ )] = s ′ Gillian Hayes RL Lecture 4/5 18th January 2007

15 Relating V and Q ∞ � V π ( s ) γ k r t + k +1 | s t = s } = E π { k =0 � π ( s, a ) Q π ( s, a ) = a Gillian Hayes RL Lecture 4/5 18th January 2007

16 Optimal Policies π ∗ An optimal policy has the highest/optimal value function V ∗ ( s ) It chooses the action in each state which will result in the highest return Optimal Q-value Q ∗ ( s, a ) is reward received from executing action a in state s and following optimal policy π ∗ thereafter V π ( s ) V ∗ ( s ) = max π Q π ( s, a ) Q ∗ ( s, a ) = max π Q ∗ ( s, a ) = E { r t +1 + γV ∗ ( s t +1 ) | s t = s, a t = a } Gillian Hayes RL Lecture 4/5 18th January 2007

17 Bellman Optimality Equations 1 Bellman equations for the optimal values and Q-values Q π ∗ ( s, a ) V ∗ ( s ) = max a = max E π ∗ { R t | s t = s, a t = a } a � γ k r t + k +2 | s t = s, a t = a } = max E π ∗ { r t +1 + γ a k E { r t +1 + γV ∗ ( s t +1 ) | s t = s, a t = a } = max a � P a ss ′ [ R a ss ′ + γV ∗ ( s ′ )] = max a s ′ Gillian Hayes RL Lecture 4/5 18th January 2007

Bellman Optimality Equations 1 18 Q ∗ ( s, a ) a ′ Q ∗ ( s t +1 , a ′ ) | s t = s, a t = a } = E { r t +1 + γ max � P a ss ′ [ R a a ′ Q ∗ ( s ′ , a ′ )] = ss ′ + γ max s ′ Value under optimal policy = expected return for best action from that state. Gillian Hayes RL Lecture 4/5 18th January 2007

19 Bellman Optimality Equations 2 ss ′ known, then can solve equations for V ∗ (or If dynamics of environment R a ss ′ , P a Q ∗ ). Given V ∗ , what then is optimal policy? I.e. which action a do you pick in state s ? The one which maximises expected r t +1 + γV ∗ ( s t +1 ) , i.e. the one which gives the biggest s ′ (instant reward + discounted future maximum reward) ∗ P a � ss ′ So need to do one-step search Gillian Hayes RL Lecture 4/5 18th January 2007

Bellman Optimality Equations 2 20 There may be more than one action doing this → all OK All GREEDY actions Given Q ∗ , what’s the optimal policy? The one which gives the biggest Q ∗ ( s, a ) , i.e. in state s , you have various Q values, one per action. Pick (an) action with largest Q . Gillian Hayes RL Lecture 4/5 18th January 2007

21 Assumptions for Solving Bellman Optimality Equations 1. Know dynamics of environment P a ss ′ , R a ss ′ 2. Sufficient computational resources (time, memory) BUT Example: Backgammon 1. OK 2. 10 20 states ⇒ 10 20 equations in 10 20 unknowns, nonlinear equations (max) Often use a neural network to approximate value functions, policies and models ⇒ compact representation Optimal policy? Only needs to be optimal in situations we encounter – some very rarely/never encountered. So a policy that is only optimal in those states we encounter may do Gillian Hayes RL Lecture 4/5 18th January 2007

22 Components of an RL Problem Agent, task, environment States, actions, rewards Policy π ( s, a ) → probability of doing a in s Value V ( s ) → number – Value of a state Action value Q ( s, a ) – Value of a state-action pair ss ′ → probability of going from s → s ′ if do a Model P a Reward function R a ss ′ from doing a in s and reaching s ′ Return R → sum of future rewards Total future discounted reward r t +1 + γr t +2 + γ 2 r t +3 + · · · = � ∞ k =0 r t + k +1 γ k Learning strategy to learn... (continued) Gillian Hayes RL Lecture 4/5 18th January 2007

Reinforcement Learning Lectures 4 and 5 Gillian Hayes 18th January - PowerPoint PPT Presentation

Reinforcement Learning Lectures 4 and 5 Gillian Hayes 18th January 2007 Gillian Hayes RL Lecture 4/5 18th January 2007 1 Reinforcement Learning Framework Rewards, Returns Environment Dynamics Components of a Problem

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CSC 411 Lectures 2122: Reinforcement Learning Roger Grosse, Amir-massoud Farahmand, and Juan

Reinforcement Learning Framework Reinforcement Learning Rewards, Returns Lectures 4 and 5

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Reinforcement Learning Reinforcement Learning Now that you know a little about Optimal Control

Factor Saving Innovation Michele Boldrin and David K. Levine 1 Introduction endogeneity of

Secretary Problem Secretary Problem Mohammad Mahdian R. Preston McAfee David Pennock Secretary

Markov Decision Processes (MDPs) and Reinforcement Learning (RL) Sven Koenig, USC Russell and

Aggregate shocks and house prices fluctuations Jos e-V ctor R os-Rull Virginia S

Testable Implications of Models of Intertemporal Choice Exponential Discounting and Its

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 15: Language

Discounted Duration Calculus Work in Progress H. Ody Joint work with M. Frnzle and M. R.