Reinforcement Learning Reinforcement Learning Now that you know a - PowerPoint PPT Presentation

Reinforcement Learning

Reinforcement Learning • Now that you know a little about Optimal Control Theory, you actually have some knowledge in RL. • RL shares the overall goal with OCT: solving for a control policy such that the cumulative cost is minimized; good for solving problems which include a long-term versus short-term reward trade-off. • But OCT assumes perfect knowledge of the system’s description in the form of a model and ensures strong guarantees, while RL operates directly on measured data and rewards from interaction with the environment.

RL in Robotics • Reinforcement learning (RL) enables a robot to autonomously discover an optimal behavior through trial-and-error interactions with its environment. • The designer of a control task provides feedback in terms of a scalar objective function that measures the one-step performance of the robot. • Problems are often high-dimensional with continuous states and actions, and the state is often partially observable. • Experience on a real physical system is tedious to obtain, expensive and often hard to reproduce.

Problem Definition • A reinforcement learning problem typical includes: • A set of states: S • A set of actions: A • Transition rules: P a S ⇥ S ⇥ A 7! R ss 0 X P a ss 0 = 1 0 ≤ P a ss 0 ≤ 1 s 0 • Reward function: S 7! R r • Here we assume full observability but with a stochastic transition model.

Long-term Expected Return • Finite-horizon expected return " H # X J = E r k k =0 • Infinite-horizon return with a discount factor γ " ∞ # X γ k r k J = E k =0 • In the limit when γ approaches 1, the metric approaches what is known as the average-reward criterion " H # 1 X J = lim H →∞ E r k H k =0

Value Function • Recall from optimal control theory, v ( x ) = “minimal total cost for completing the task starting from state x ” • An value function that follows a particular policy, Π V Π ( s ) : S 7! R ∞ X V Π ( s ) = E Π { R t | s t = s } = E Π { γ k r t + k +1 | s t = s } k =0 R t = r t +1 + γ r t +2 + γ 2 r t +3 + · · · where is the discount factor γ • The optimal value function: V Π ( s ) V ∗ ( s ) = max Π

Policy • Deterministic policy: S 7! A • Probabilistic policy: S ⇥ A 7! R • The optimal policy: Π ∗ = arg max V Π ( s ) Π

Exploration and Exploitation • To gain information about the rewards and the behavior of the system, the agent needs to explore by considering previously unused actions or actions it is uncertain about. • Need to decide whether to stick to well known actions with high rewards or to try new things in order to discover new strategies with an even higher reward. • This problem is commonly known as the exploration- exploitation trade-off.

Value Function Policy Search Approach Approach Policy Gradient Dynamic Program Expectation–Maximization Value Iteration Policy Iteration Information-Theoretic Monte Carlo Integral Path Temporal Difference TD(lambda) Actor-Critic Approach SARSA Q-learning

Bellman Equation • The expected long-term reward of a policy can be expressed in a recursive formulation. ∞ X V Π ( s ) = E Π { γ k r t + k +1 | s t = s } k =0 1 P Π ( s ) X X γ k r t + k +2 | s t +1 = s 0 } ) ( r ( s 0 ) + γ E Π { = ss 0 s 0 k =0 P Π ( s ) X ( r ( s 0 ) + γ V Π ( s 0 )) = ss 0 s 0

Value Iteration • Value iteration starts with a guess V (0) of the optimal value function and construct a sequence of improved guesses. P Π ( s ) V ( i +1) ( s ) = max X ( r ( s 0 ) + γ V ( i ) ( s 0 )) ss 0 Π s 0 • This process is guaranteed to converge to the optimal value function V in a finite number of iterations.

Policy Iteration • Find the optimal policy by iterating two procedures until convergence • Policy Evaluation • Policy Improvement

Policy Evaluation Input: Output: V Π Π Step 1: Arbitrarily initialize V ( s ) , ∀ s ∈ S Step 2: Repeat For each s a = Π ( s ) X ss 0 ( r ( s 0 ) + γ V ( s 0 )) V ( s ) = P a s 0 Until convergence Step 3: Output V ( s )

Policy Improvement Input: Output: V Π Π 0 Step 1: For each s X ss 0 ( r ( s 0 ) + γ V Π ( s 0 )) Π 0 ( s ) = arg max P a a s 0 Step 2: Output Π 0 ( s )

Monte Carlo Approach • Both value iteration and policy iteration use dynamic programming approach. • Dynamic programming approach requires a transition model, P, which is often unavailable in real world problem. • Monte Carlo algorithm does not require a model to be known. Instead, it generate samples to approximate the value function.

The Q function • Introduce the Q function: Q Π ( s, a ) : S ⇥ A 7! R Q Π ( s, a ) = E Π { R t | s t = s, a t = a } ∞ X γ k r t + k +1 | s t = s, a t = a } = E Π { k =0 • Use the Q ( s , a ) function instead of the value function V ( s ) because in the absence of transition model the values with respect to all possible actions at s must be stored explicitly. • The optimal Q Π ( s, a ) Q ∗ ( s, a ) = max Π V ∗ ( s ) = max Q ∗ ( s, a ) Π ∗ ( s ) = arg max Q ∗ ( s, a ) a a

Monte Carlo Policy Iteration 1 Step 1: Arbitrarily initialize Π ( s, a ) = Q ( s, a ) |A| and an empty list return( s, a ) ∀ s ∈ S ∀ a ∈ A Step 2: Repeat for many times Generate an episode using a 0 a 1 a 2 Π s 0 − → s 1 − → s 2 − → · · · For each pair in the episode ( s, a ) Policy Compute long-term return R from ( s, a ) evaluation Append R to return( s, a ) Assign average of to Q ( s, a ) return( s, a ) Policy Continue... improvement

Monte Carlo Policy Iteration For all s a ∗ = arg max Q ( s, a ) Policy a improvement 1 − ✏ if a = a ∗ Π ( s, a ) = ✏ if a = a ∗ |A| − 1 Step 3: Output Π ( s ) = arg max Q ( s, a ) a is a small number, which affects ✏ exploration and exploitation

Temporal Difference Learning • Problem with the Monte Carlo learning is that it takes a lot of time to simulate/execute the episodes. • Temporal Difference (TD) learning is a combination of Monte Carlo and dynamic programming. • Update the value function based on previously learned estimates.

Policy Iteration in TD 1 Step 1: Arbitrarily initialize Π ( s, a ) = Q ( s, a ) |A| Step 2: Repeat for each episode s = initial state of the episode a = generate a sample from Π ( s, a ) Repeat for each step in the episode s’ = new state by taking action a from s Continue...

Policy Iteration in TD a ⇤ = arg max Q ( s 0 , a ) a 1 − ✏ if a = a ∗ Π ( s 0 , a ) = ✏ if a = a ∗ |A| − 1 a’ = generate a sample from Π ( s 0 , a ) Q ( s, a ) = Q ( s, a ) + α [ r ( s 0 ) + γ Q ( s 0 , a 0 ) − Q ( s, a )] s = s’ a = a’ until s is the terminal state Step 3: Output Π ( s ) = arg max Q ( s, a ) a

n-step TD and Linear Combination 1-step TD 2-step TD Monte Carlo s s s a a a s’ s’ s’ a’ a’ (1 − λ ) s’’ s’’ (1 − λ ) λ 1-step TD method λ = 0 Monte Carlo method λ = 1 s n λ n − 1

Reinforcement Learning Reinforcement Learning Now that you know a - PowerPoint PPT Presentation

Reinforcement Learning Reinforcement Learning Now that you know a little about Optimal Control Theory, you actually have some knowledge in RL. RL shares the overall goal with OCT: solving for a control policy such that the cumulative cost

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

Machine Learning for NLP Reinforcement learning Aurlie Herbelot 2019 Centre for Mind/Brain

Factor Saving Innovation Michele Boldrin and David K. Levine 1 Introduction endogeneity of

Secretary Problem Secretary Problem Mohammad Mahdian R. Preston McAfee David Pennock Secretary

Markov Decision Processes (MDPs) and Reinforcement Learning (RL) Sven Koenig, USC Russell and

CS885 Reinforcement Learning Lecture 2a: May 4, 2018 Intro to Markov decision processes [SutBar]

Reinforcement Learning Lectures 4 and 5 Gillian Hayes 18th January 2007 Gillian Hayes RL

Aggregate shocks and house prices fluctuations Jos e-V ctor R os-Rull Virginia S

Testable Implications of Models of Intertemporal Choice Exponential Discounting and Its

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 15: Language