Introduction to Reinforcement Learning and Q-Learning Skyler Seto - PowerPoint PPT Presentation

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence Outline 1 Reinforcement Learning and Markov Decision Process 2 Q-Learning 3 Q-Learning Convergence Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence Introduction How does an agent behave? 1 An agent can be a passive learner, lounging around analyzing data, then constructing its model. 2 An agent can be an active learner and learn to act on the fly given sequences of the form (state, action, reward). In this talk, we consider an agent to be one who actively learns from the environment (2). Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence Markov Decision Process (MDP) Definition The MDP framework consists of the four elements ( S , A , R , P ) • S is the finite set of possible states, • A is the finite set of possible actions, • R is the reward model R : S × A → R , • P is the transition model P ( s | s , a ) with � s ′ ∈ S P ( s ′ | s , a ) = 1. Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence Robot Navigation 1 State space S is the set of all possible locations and directions. 2 Action space A is the set of possible motions: move forward, backward, etc. 3 Reward model R rewards the robot positively if it gets to the goal, and negatively if it hits an obstacle. 4 Transition probability accounts for some probability the robot moves forward, doesn’t move, and moves forward twice. Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence Markov Decision Process Diagram Figure: Two Step Markov Decision Process Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence Properties of MDP 1 The reward function R ( s , a ) is deterministic and time-homogeneous. 2 P ( s t + 1 | s t , a t ) is independent of t and thus time-homogeneous. 3 Transition model is Markovian. Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence Reinforcement Learning in the MDP 1 Consider a partially known model ( S , A , R , P ) where S and A are known, but R and P must be learned as the agent acts. 2 Define the policy for the MDP π t : S → A as the solution to the MDP . 3 What is the optimal policy ( π ∗ ) the agent should learn in order to maximize its total discounted expected reward ( γ s )? Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence Value and Optimal Value Definition For a given policy π and discounted reward factor γ , the value of a state s is � V π ( s ) = R s ( π ( s )) + γ P s , y ( π ( s )) V π ( y ) y ∈ S     V ∗ ( s ) = V π ∗ ( s ) = max � P s , y ( a ) V π ∗ ( y )  R s ( a ) + γ a  y ∈ S Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence Q Function Definition For a policy π define Q values (action-values) as � Q π ( s , a ) = R s ( a ) + γ P x , y ( π ( s )) V π ( y ) y ∈ S = E s [ R s ( a ) + γ V π ( y )] The Q value is the expected discounted reward for executing action a at state s and following policy π after. Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence Q Values for the Optimal Policy 1 Let Q ∗ ( s , a ) = Q π ∗ ( s , a ) be the optimal action-values, 2 V ∗ ( s ) = max Q ∗ ( s , a ) be the optimal value, a 3 π ∗ ( s ) = arg max Q ∗ ( s , a ) be the optimal policy. a If an agent learns the Q values, it can easily determine the optimal action. Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence Q-Learning In Q-learning the agent experiences a sequence of stages. At the n th stage, the agent: • observes its current state x n , • performs an action a n , • observes subsequent state transition to y n , • receives reward r n , Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence Q-Learning • updates its Q function with learning factor α n according to • If s = x n and a = a n Q n ( s , a ) = ( 1 − α n ) Q n − 1 ( s , a ) + α n [ r n + γ V n − 1 ( y n )] • Otherwise Q n ( s , a ) = Q n − 1 ( s , a ) where V n − 1 ( y ) = max b { Q n − 1 ( y , b ) } Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence Convergence Theorem Let n i ( s , a ) be the i th time that action a is tried in state s . Theorem Given bounded rewards | r n | ≤ R , learning rates 0 ≤ α n < 1 , and ∞ ∞ � 2 � � � α n i ( s , a ) = ∞ , α n i ( s , a ) < ∞ ∀ s , a i = 1 i = 1 then Q n ( s , a ) a . s → Q ∗ ( s , a ) as n → ∞ . Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence Action Replay Process (ARP) 1 Let S = { s } , A = { a } be the set of states and actions for the original MDP . 2 Create an infinite deck of cards with the j th card from the bottom having ( s j , a j , y j , r j , α j ) written on it. 3 Additionally take the bottom card to have the value Q 0 ( s , a ) for all s and a . 4 We define the ARP as to have S ′ = { ( s , n ) } and A ′ = A = { a } . Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence State Transitions in (ARP) Given current state ( s , n ) and action a , we determine the next state by 1 Removing all cards for stages after n . 2 Find the first t searching from the top of the (remaining) deck where s t = s and a t = a . 3 Flip a biased coin with probability α t . • If the coin is heads, return reward r t , and transition to the state ( y t , t − 1 ) . Repeat the process on the remaining deck without card t . • If the coin is tails, find another card with s and a . Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence Transition Probability for ARP 1 Define the expected reward of card n determined by the ARP as R ( n ) s ( a ) 2 Define the transition probability for the ARP as P ARP ( x , n ) , ( y , m ) ( a ) with n − 1 P ( n ) � P ARP x , y ( a ) = ( x , n ) , ( y , m ) ( a ) m = 1 Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence Lemma A: Q n are Optimal for ARP Lemma Q n ( s , a ) = Q ∗ ARP (( s , n ) , a ) Q n ( s , a ) are the optimal action values for ARP states ( s , n ) and ARP actions a. Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence Lemma A: Q n are Optimal for ARP The ARP was constructed to have this property. At n = 0, Q 0 ( s , a ) is the optimal and only possible action-value of ( s , 0 ) and a , so Q 0 ( s , a ) = Q ∗ ARP (( s , 0 ) , a ) It’s easy to see by induction that for all a and s , and for any n , Q n ( s , a ) = Q ∗ ARP (( s , n ) , a ) Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence Lemma B: Convergence of Transitions and Rewards Lemma With probability 1, the probabilities P ( n ) x , y ( a ) and expected rewards R ( n ) x ( a ) in the ARP converge and tend to the transition matrices and expected rewards in the true process as the card n → ∞ . Skyler Seto (ss3349) Introduction to Reinforcement Learning and Q-Learning

Introduction to Reinforcement Learning and Q-Learning Skyler Seto - PowerPoint PPT Presentation

Reinforcement Learning and Markov Decision Process Q-Learning Q-Learning Convergence Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler Seto (ss3349) Introduction to Reinforcement Learning and

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

Module 11 Introduction to Reinforcement Learning CS 886 Sequential Decision Making and

Machine Learning for NLP Reinforcement learning Aurlie Herbelot 2019 Centre for Mind/Brain

White Dwarf stabilized against collapse by degeneracy pressure of electrons radius R , e mass m e

Tracking Heaps that Hop with Heap-Hop Jules Villard 1 , 3 tienne Lozes 2 , 3 Cristiano Calcagno 4

JUST THE MATHS SLIDES NUMBER 10.2 DIFFERENTIATION 2 (Rates of change) by A.J.Hobson

QCD at non-zero density and phenomenology CLAUDIA RATTI UNIVERSITY OF HOUSTON Open Questions

A NORMALIZED VALUE FOR INFORMATION PURCHASES Antonio Cabrales University College London Olivier

A database of genus 3 curves over Q Andrew V. Sutherland MIT May 17, 2018 Joint with Raymond

Narrow band excitation simulations Hector Garcia Morales Royal Holloway University of London

Superconductivity and charge density wave physics near an antiferromagnetic quantum critical