Introduction Reinforcement Learning Scott Sanner Act NICTA / ANU - PowerPoint PPT Presentation

Bellman Fixed Point • Define Bellman backup operator B : V t � � � � � �� V t-1 • ∃ an optimal value function V* and an optimal deterministic greedy policy π *= π V* satisfying: � ��

Bellman Error and Properties • Define Bellman error BE : � �� • Clearly: � �� • Can prove B is a contraction operator for BE : � �� Hmmm…. Does this suggest a solution?

Value Iteration: in search of fixed-point • Start with arbitrary value function V 0 • Iteratively apply Bellman backup Look familiar? Same DP solution � � � � � � � � � � � � �� as before. • Bellman error decreases on each iteration – Terminate when � � � � � � � � � � � � � � � � � �� – Guarantees ε -optimal value function • i.e., V t within ε of V* for all states Precompute maximum number of steps for ε ?

Single Dynamic Programming (DP) Step • Graphical view: � � � � � � s 1 � � � � � � � � � � � � �� a 1 � � �� s 2 s 1 �� s 3 � a 2 � � �� s 2

Synchronous DP Updates (Value Iteration) 3 2 1 0 V (s ) V (s ) V (s ) V (s ) 1 1 1 1 A 1 A 1 A 1 S S S S 1 1 1 1 MAX MAX MAX A 2 A 2 A 2 A 1 A 1 A 1 MAX MAX MAX S S S S 2 2 2 2 A 2 A 2 A 2 2 1 0 3 V (s ) V (s ) V (s ) V (s ) 2 2 2 2

Asynchronous DP Updates (Asynchronous Value Iteration) 0 V (s ) Don’t need to update 1 V (s ) 1 1 values synchronously S 1 with uniform depth. As long as each state S A 1 2 2 updated with non-zero V (s ) 1 probability, convergence S MAX 1 still guaranteed! S A 2 1 ... S A 1 3 2 V (s ) ... 1 S S 2 1 MAX A 2 Can you see intuition ... for error contraction?

Real-time DP (RTDP) • Async. DP guaranteed to converge over relevant states – relevant states : states reachable from initial states under π * – may converge without visiting all states! ��

Prioritized Sweeping (PS) • Simple asynchronous DP idea – Focus backups on high error states – Can use in conjunction with other focused methods, e.g., RTDP Where do RTDP and • Every time state visited: PS each focus? – Record Bellman error of state – Push state onto queue with priority = Bellman error • In between simulations / experience, repeat: – Withdraw maximal priority state from queue – Perform Bellman backup on state • Record Bellman error of predecessor states • Push predecessor states onto queue with priority = Bellman error

Which approach is better? • Synchronous DP Updates – Good when you need a policy for every state – OR transitions are dense • Asynchronous DP Updates – Know best states to update • e.g., reachable states, e.g. RTDP • e.g., high error states, e.g. PS – Know how to order updates • e.g., from goal back to initial state if DAG

Policy Evaluation • Given π , how to derive V π ? • Matrix inversion • Set up linear equality (no max!) for each state � � � � � �� • Can solve linear system in vector form as follows � � � � � � � � �� Guaranteed invertible. • Successive approximation • Essentially value iteration with fixed policy 0 arbitrarily • Initialize V π � � � � � �� • Guaranteed to converge to V π

Policy Iteration � �� !��" �� # �� $ ��# � � � % �� &��'� �� '�� ( �� )�� # � � �� # ��# � � � � � � � �� * � � �� '�� '�� '�� + �� ,� � � ��

Modified Policy Iteration • Value iteration – Each iteration seen as doing 1-step of policy evaluation for current greedy policy – Bootstrap with value estimate of previous policy • Policy iteration – Each iteration is full evaluation of V π for current policy π – Then do greedy policy update • Modified policy iteration – Like policy iteration, but V π i need only be closer to V* than V π i-1 • Fixed number of steps of successive approximation for V π i suffices when bootstrapped with V π i-1 – Typically faster than VI & PI in practice

Conclusion • Basic introduction to MDPs – Bellman equations from first principles – Solution via various algorithms • Should be familiar with model-based solutions – Value Iteration • Synchronous DP • Asynchronous DP (RTDP, PS) – (Modified) Policy Iteration • Policy evaluation • Model-free solutions just sample from above

Model-free MDP Solutions Reinforcement Learning Scott Sanner Act NICTA / ANU First.Last@nicta.com.au Learn Sense

Chapter 5: Monte Carlo Methods Reinforcement Learning, Sutton & Barto, 1998. Online. • Monte Carlo methods learn from sample returns – Sample from • experience in real application, or • simulations of known model – Only defined for episodic (terminating) tasks – On-line: Learn while acting Slides from Rich Sutton’s course CMP499/609 Reinforcement Learning in AI http://rlai.cs.ualberta.ca/RLAI/RLAIcourse/RLAIcourse2007.html

Essence of Monte Carlo (MC) • MC samples directly from value expectation for each state given π � � � � � � � � � � � � � � � � � � � � � � � � � ��

Monte Carlo Policy Evaluation • Goal: learn V π (s) • Given: some number of episodes under π which contain s • Idea: Average returns observed after visits to s Start 1 2 3 4 5 Goal update each state with final discounted return

Monte Carlo policy evaluation

Blackjack example • Object: Have your card sum be greater than the dealer’s without exceeding 21. • States (200 of them): – current sum (12-21) – dealer’s showing card (ace-10) – do I have a useable ace? • Reward: +1 win, 0 draw, -1 lose Assuming fixed • Actions: stick (stop receiving cards), policy for now. hit (receive another card) • Policy: Stick if my sum is 20 or 21, else hit

Blackjack value functions

Backup diagram for Monte Carlo • Entire episode included • Only one choice at each state (unlike DP) • MC does not bootstrap • Time required to estimate one state does not depend on the total number of states terminal state

MC Control: Need for Q-values • Control: want to learn a good policy – Not just evaluate a given policy • If no model available – Cannot execute policy based on V(s) – Instead, want to learn Q * (s,a) • Q π (s,a) - average return starting from state s and action a following π � � � � � � � � � � � � � � ��

Monte Carlo Control evaluation Instance of Q π Q →� Generalized Policy Iteration . π Q π→� greedy ( Q ) improvement • MC policy iteration: Policy evaluation using MC methods followed by policy improvement • Policy improvement step: Greedy π ’(s) is action a maximizing Q π (s,a)

Convergence of MC Control • Greedy policy update improves or keeps value: • This assumes all Q(s,a) visited an infinite number of times – Requires exploration, not just exploitation • In practice, update policy after finite iterations

Blackjack Example Continued • MC Control with exploring starts… • Start with random (s,a) then follow π

Monte Carlo Control • How do we get rid of exploring starts? – Need soft policies: π (s,a) > 0 for all s and a – e.g. ε -soft policy: ε ε − ε + 1 A ( s ) A ( s ) non-max greedy • Similar to GPI: move policy towards greedy policy (i.e. ε -soft) • Converges to best ε -soft policy

Summary • MC has several advantages over DP: – Learn from direct interaction with environment – No need for full models – Less harm by Markovian violations • MC methods provide an alternate policy evaluation process • No bootstrapping (as opposed to DP)

Temporal Difference Methods Reinforcement Learning Scott Sanner Act NICTA / ANU First.Last@nicta.com.au Learn Sense

Chapter 6: Temporal Difference (TD) Learning Reinforcement Learning, Sutton & Barto, 1998. Online. • Rather than sample full returns as in Monte Carlo… TD methods sample Bellman backup Slides from Rich Sutton’s course CMP499/609 Reinforcement Learning in AI http://rlai.cs.ualberta.ca/RLAI/RLAIcourse/RLAIcourse2007.html

TD Prediction Policy Evaluation (the prediction problem) : π for a given policy π , compute the state-value function V Recall: Simple every - visit Monte Carlo method : [ ] V ( s t ) ← V ( s t ) + α R t − V ( s t ) target : the actual return after time t The simplest TD method, TD(0) : [ ] V ( s t ) ← V ( s t ) + α r t + 1 + γ V ( s t + 1 ) − V ( s t ) target : an estimate of the return

Simple Monte Carlo [ ] V ( s t ) ← V ( s t ) + α R t − V ( s t ) where R t is the actual return following state s t . s t T T T T T T T T T T T T T T T T T T T T

Simplest TD Method [ ] V ( s t ) ← V ( s t ) + α r t + 1 + γ V ( s t + 1 ) − V ( s t ) s t r t + 1 s t + 1 T T T T T T T T T T T T T T T T T T T T

cf. Dynamic Programming { } V ( s t ) ← E π r t + 1 + γ V ( s t ) s t r t + 1 s t + 1 T T T T T T T T T T T T T

TD methods bootstrap and sample • Bootstrapping: update with estimate – MC does not bootstrap – DP bootstraps – TD bootstraps • Sampling: – MC samples – DP does not sample – TD samples

Example: Driving Home Stat e Elapsed T ime Pr edicted Pr edicted (minu tes) Time to Go Tota l Tim e leaving o ffice 0 30 30 reach car, ra ining 5 (5) 35 40 (15) exit highway 20 15 35 beh ind truck 30 (10) 10 40 home street 40 3 43 (10) arrive ho me 43 (3) 0 43

Driving Home Changes recommended by Changes recommended Monte Carlo methods (α =1) by TD methods ( α =1) 45 actual outcome 40 Predicted total travel 35 time 30 leaving reach exiting 2ndary home arrive car road street home office highway Situation

Advantages of TD Learning • TD methods do not require a model of the environment, only experience • TD, but not MC, methods can be fully incremental – You can learn before knowing the final outcome • Less memory • Less peak computation – You can learn without the final outcome • From incomplete sequences • Both MC and TD converge, but which is faster?

Random Walk Example 0 0 0 0 0 1 A B C D E start Values learned by TD(0) after various numbers of episodes

TD and MC on the Random Walk Data averaged over 100 sequences of episodes

Optimality of TD(0) Batch Updating : train completely on a finite amount of data, e.g., train repeatedly on 10 episodes until convergence. Only update estimates after complete pass through the data. For any finite Markov prediction task, under batch updating, TD(0) converges for sufficiently small α . Constant- α MC also converges under these conditions, but to a different answer!

Random Walk under Batch Updating .25 B ATCH T RAINING .2 RMS error, .15 averaged over states .1 MC TD .05 .0 0 25 50 75 100 Walks / Episodes After each new episode, all previous episodes were treated as a batch, and algorithm was trained until convergence. All repeated 100 times.

You are the Predictor Suppose you observe the following 8 episodes: A, 0, B, 0 B, 1 B, 1 V (A)? B, 1 B, 1 V (B)? B, 1 B, 1 B, 0

You are the Predictor r = 1 75% r = 0 V (A)? A B 100% r = 0 25%

You are the Predictor • The prediction that best matches the training data is V(A)=0 – This minimizes the mean-square-error – This is what a batch Monte Carlo method gets • If we consider the sequential aspect of the problem, then we would set V(A)=.75 – This is correct for the maximum likelihood estimate of a Markov model generating the data – This is what TD(0) gets MC and TD results are same in ∞ limit of data. But what if data < ∞ ?

Sarsa: On-Policy TD Control Turn this into a control method by always updating the policy to be greedy with respect to the current estimate: SARSA = TD(0) for Q functions.

Q-Learning: Off-Policy TD Control One - step Q - learning: [ ] ( ) ← Q s t , a t ( ) + α r ( ) − Q s t , a t ( ) t + 1 + γ max Q s t , a t a Q s t + 1 , a

Cliffwalking ε− greedy , ε = 0.1 Optimal exploring policy. Optimal policy, but exploration hurts more here.

Practical Modeling: Afterstates • Usually, a state-value function evaluates states in which the agent can take an action. • But sometimes it is useful to evaluate states after agent has acted, as in tic-tac-toe. • Why is this useful? X X + + O O X X • An afterstate is really just an action that looks like a state X O X

Summary • TD prediction • Introduced one-step tabular model-free TD methods • Extend prediction to control – On-policy control: Sarsa (instance of GPI) – Off-policy control: Q-learning a.k.a. bootstrapping • These methods sample from Bellman backup, combining aspects of DP and MC methods

TD( λ λ λ λ ): Between TD and MC Reinforcement Learning Scott Sanner Act NICTA / ANU First.Last@nicta.com.au Learn Sense

Unified View Is there a hybrid of TD & MC?

Chapter 7: TD( λ ) Reinforcement Learning, Sutton & Barto, 1998. Online. • MC and TD estimate same value – More estimators between two extremes • Idea: average multiple estimators – Yields lower variance – Leads to faster learning Slides from Rich Sutton’s course CMP499/609 Reinforcement Learning in AI http://rlai.cs.ualberta.ca/RLAI/RLAIcourse/RLAIcourse2007.html

N-step TD Prediction • Idea: Look farther into the future when you do TD backup (1, 2, 3, …, n steps) All of these estimate same value!

N-step Prediction • Monte Carlo: t + 3 + � + γ T − t − 1 r t + 2 + γ 2 r R t = r t + 1 + γ r �� T (1) = r • TD: t + 1 + γ V t ( s t + 1 ) R t – Use V to estimate remaining return • n-step TD: (2) = r t + 2 + γ 2 V t ( s t + 2 ) – 2 step return: t + 1 + γ r R t ( n ) = r – n-step return: t + 3 + � + γ n − 1 r t + 2 + γ 2 r t + n + γ n V t ( s t + n ) t + 1 + γ r R t ��

Random Walk Examples • How does 2-step TD work here? • How about 3-step TD? Hint: TD(0) is 1-step return… update previous state on each time step.

A Larger Example • Task: 19 state random walk • Do you think there is an optimal n? for everything?

Averaging N-step Returns One backup • n-step methods were introduced to help with TD( λ ) understanding • Idea: backup an average of several returns – e.g. backup half of 2-step & 4-step avg = 1 (2) + 1 (4) R t 2 R t 2 R t • Called a complex backup – Draw each component – Label with the weights for that component

Forward View of TD( λ ) • TD( λ ) is a method for averaging all n-step backups – weight by λ n-1 (time since visitation) � λ -return: ∞ � λ = (1 − λ ) λ n − 1 ( n ) R t R t n = 1 • Backup using λ -return: λ − V t ( s t ) [ ] ∆ V t ( s t ) = α R t What happens when λ =1, λ = 0?

λ -return Weighting Function T − t − 1 � λ = (1 − λ ) ( n ) + λ λ n − 1 T − t − 1 R t R t R t n = 1 Until termination After termination

Forward View of TD( λ ) • Look forward from each state to determine update from future states and rewards:

λ -return on the Random Walk • Same 19 state random walk as before • Why do you think intermediate values of λ are best?

Backward View δ t = r t + 1 + γ V t ( s t + 1 ) − V t ( s t ) • Shout δ t backwards over time • The strength of your voice decreases with temporal distance by γλ

Backward View of TD( λ ) • The forward view was for theory • The backward view is for mechanism e t ( s ) ∈ ℜ + • New variable called eligibility trace – On each step, decay all traces by γλ and increment the trace for the current state by 1 – Accumulating trace � γλ e t − 1 ( s ) if s ≠ s t � e t ( s ) = � γλ e t − 1 ( s ) + 1 if s = s t

On-line Tabular TD( λ ) Initialize V ( s ) arbitrarily Repeat (for each episode): e ( s ) = 0, for all s ∈ S Initialize s Repeat (for each step of episode): a ← action given by π for s Take action a , observe reward, r , and next state ′ s δ ← r + γ V ( ′ s ) − V ( s ) e ( s ) ← e ( s ) + 1 For all s: V ( s ) ← V ( s ) + αδ e ( s ) e ( s ) ← γλ e ( s ) s ← ′ s Until s is terminal

Forward View = Backward View • The forward (theoretical) view of averaging returns in TD( λ ) is equivalent to the backward (mechanistic) view for off-line updating • The book shows: T − 1 T − 1 � � λ ( s t ) TD ( s ) ∆ V t = ∆ V t I ss t t = 0 t = 0 Backward updates Forward updates algebra shown in book T − 1 T − 1 T − 1 T − 1 T − 1 T − 1 � � I ss t � � � I ss t � ( γλ ) k − t δ k λ ( s t ) I ss t ( γλ ) k − t δ k TD ( s ) ∆ V t = α ∆ V t = α t = 0 t = 0 k = t t = 0 t = 0 k = t

On-line versus Off-line on Random Walk Save all updates for end of episode. • Same 19 state random walk • On-line better over broader range of params – Updates used immediately

Control: Sarsa( λ ) • Save eligibility for state-action pairs instead of just states � e t ( s , a ) = γλ e t − 1 ( s , a ) + 1 if s = s t and a = a t � � γλ e t − 1 ( s , a ) otherwise Q t + 1 ( s , a ) = Q t ( s , a ) + αδ t e t ( s , a ) δ t = r t + 1 + γ Q t ( s t + 1 , a t + 1 ) − Q t ( s t , a t )

Sarsa( λ ) Algorithm Initialize Q ( s , a ) arbitrarily Repeat (for each episode) : e ( s , a ) = 0, for all s , a Initialize s , a Repeat (for each step of episode) : a , observe r , ′ Take action s a from ′ ′ Q (e.g. ε - greedy) Choose s using policy derived from δ ← r + γ Q ( ′ s , ′ a ) − Q ( s , a ) e ( s , a ) ← e ( s , a ) + 1 For all s,a : Q ( s , a ) ← Q ( s , a ) + αδ e ( s , a ) e ( s , a ) ← γλ e ( s , a ) s ← s ; a ← ′ ′ a Until s is terminal

Sarsa( λ ) Gridworld Example • With one trial, the agent has much more information about how to get to the goal – not necessarily the best way • Can considerably accelerate learning

Replacing Traces • Using accumulating traces, frequently visited states can have eligibilities greater than 1 – This can be a problem for convergence • Replacing traces: Instead of adding 1 when you visit a state, set that trace to 1 � e t ( s ) = γλ e t − 1 ( s ) if s ≠ s t � � if s = s t 1

Why Replacing Traces? • Replacing traces can significantly speed learning • Perform well for a broader set of parameters • Accumulating traces poor for certain types of tasks: Why is this task particularly onerous for accumulating traces?

Replacing Traces Example • Same 19 state random walk task as before • Replacing traces perform better than accumulating traces over more values of λ

The Two Views Efficient implementation Averaging Advantage of backward view estimators. for continuing tasks?

Conclusions • TD( λ ) and eligibilities – efficient, incremental way to interpolate between MC and TD • Averages multiple noisy estimators – Lower variance – Faster learning • Can significantly speed learning • Does have a cost in computation

Practical Issues and Discussion Reinforcement Learning Scott Sanner Act NICTA / ANU First.Last@nicta.com.au Learn Sense

Need for Exploration in RL • For model-based (known) MDP solutions – Get convergence with deterministic policies • But for model-free RL… – Need exploration – Usually use stochastic policies for this • Choose exploration action with small probability – Then get convergence to optimality

Introduction Reinforcement Learning Scott Sanner Act NICTA / ANU - PowerPoint PPT Presentation

Introduction Reinforcement Learning Scott Sanner Act NICTA / ANU First.Last@nicta.com.au Learn Sense Lecture Goals 1) To understand formal models for decision- making under uncertainty and their properties Unknown models (reinforcement

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

Machine Learning for NLP Reinforcement learning Aurlie Herbelot 2019 Centre for Mind/Brain

Predicate Abstraction for Dense Real-Time Systems oller 1 , Harald Rue 2 , Maria Sorea 2 Oliver

Low Rank Approximation Lecture 6 Daniel Kressner Chair for Numerical Algorithms and HPC

Advanced Algorithms COMS31900 Approximation algorithms part three (Fully) Polynomial Time

Precision physics at the LHC 56. International Meeting on Nuclear Physics, January 2018, Bormio

Parallel Scientific Computing Matrix-vector multiplication. Matrix-matrix multiplication.

CS475 / CM375 Lecture 23: Nov 29, 2011 Convergence of Iterative Methods CS475/CM375 (c) 2011 P.

Engineering Analysis ENG 3420 Fall 2009 Dan C. Marinescu Office: HEC 439 B Office hours: Tu-Th

M.S. Petrovi , A.I. Strini , N.B . Aleksi and M.R. Beli 1 OR THE TALE OF THE PAPER THAT