Reinforcement Learning: How Does It Work? We detect a state - PowerPoint PPT Presentation

1 Reinforcement Learning: How Does It Work? We detect a state Reinforcement Learning We choose an action Lecture 2 We get a reward Gillian Hayes Our aim is to learn a policy – what action to choose in what state to get maximum reward 11th January 2007 Maximum reward over the long term , not necessarily immediate maximum reward – watch TV now, panic over homework later vs. do homework now, watch TV while all your pals are panicking... Gillian Hayes RL Lecture 2 11th January 2007 Gillian Hayes RL Lecture 2 11th January 2007 2 3 Bandit Problems Evaluation vs Instruction N-armed bandits – as in slot machines RL – Training information evaluates the action . Doesn’t say whether it was best or correct. Relative to all other actions – must try them all and compare to see – action selection which is best – evaluation Supervised – Training instructs – it gives the correct answer regardless of the • Action-values – Q: how good (in the long term) it is to do this action in this action chosen. So there is no search in the action space in supervised learning situation, Q(s,a) (though may need to search parameters, e.g. neural network weights) • Estimating Q • So RL needs trial-and-error search • How to select an action • must try all actions • Evaluation vs. instruction • feedback is a scalar – other actions could be better (or worse) – Evaluation tells you how well you did after choosing an action • learning by selection – selectively choose those actions that prove to be better – Instruction tells you what the right thing to do was – make your action more like that next time! What about GAGP? Gillian Hayes RL Lecture 2 11th January 2007 Gillian Hayes RL Lecture 2 11th January 2007

4 5 What Is a Bandit Problem? The Action Value Q Just one state, always the same • Q = value of an action – the Expected or Mean reward from that action Non-associative, not mapping S → A (since just one s ∈ S ) • If Q-value known exactly, always choose that action with highest Q BUT, only have estimates of Q – build up these estimates from experience of rewards N-armed bandit: • Greedy action(s): have highest estimated Q: EXPLOITATION • N levers (actions) – choose one • Each has scalar reward (coins – or not) which is... • Other actions: lower estimated Qs: EXPLORATION • Chosen from probability distribution Maximise expected reward on 1 play vs. over long time? Aim Uncertainty in our estimates of values of Q JACKPOT EXPLORATION VS. EXPLOITATION TRADEOFF • Maximise expected total reward over time T, e.g. some number of plays Can’t exploit all the time; must sometimes explore to see if an action that currently looks bad eventually turns out to be good Which lever is best? Gillian Hayes RL Lecture 2 11th January 2007 Gillian Hayes RL Lecture 2 11th January 2007 6 7 How Do We Estimate Q? Action Selection Greedy : select the action a ∗ for which Q is highest: True value Q ∗ ( a ) of action a Estimated value Q t ( a ) at play/time t Q t ( a ∗ ) = max a Q t ( a ) So a ∗ = arg max a Q t ( a ) – and * means “best” Suppose we choose action a k a times and observe a reward r i on play i : Then we can estimate Q ∗ from running mean: Q t ( a ) = r 1 + r 2 + r 3 + ··· + r ka Example : 10-armed bandit k a If k a = 0 , r 0 = 0 Snapshot at time t for actions 1 to 10 Q t ( a ) → 0 0.3 0.1 0.1 0.4 0.05 0 0 0.05 0 Q t ( a ∗ ) = 0 . 4 and a ∗ = ? As k a → ∞ , Q t ( a ) → Q ∗ ( a ) Maximises reward Sample-average method of calculating Q . * in this case means “true value”: Q ∗ ( a ) . Sometimes write ˆ Q as estimated value Gillian Hayes RL Lecture 2 11th January 2007 Gillian Hayes RL Lecture 2 11th January 2007

Action Selection 8 9 ǫ -Greedy vs. Greedy ǫ -greedy : Select random action ǫ of the time, else select greedy action • What if reward variance is larger? Sample all actions infinitely many times So as k a → ∞ , Q s converge to Q ∗ • What if reward variance is very small, e.g. zero? Can reduce ǫ over time • What if task is nonstationary? NB: Difference between Q ∗ ( a ) and Q ( a ∗ ) (but we are following the Sutton and Which would be better in each of these cases? Barto notation) Exploration and Exploitation again Gillian Hayes RL Lecture 2 11th January 2007 Gillian Hayes RL Lecture 2 11th January 2007 10 11 Softmax Action Selection Softmax Action Selection ǫ -greedy: even if worst action is very bad, it will still be chosen with same Drawback of softmax? What if our estimate of the value of Q ( a ∗ ) is initially very probability as second-best – we may not want this. So: low? Vary selection probability as a function of estimated goodness Effect of | τ | Choose a at time t from among the n actions with probability As τ → ∞ , probability → 1 /n As τ → 0 , probability → greedy exp( Q t ( a ) /τ ) � n b =1 exp( Q t ( b ) /τ ) Gibbs/Boltzmann distribution, τ is temperature (from physics) Gillian Hayes RL Lecture 2 11th January 2007 Gillian Hayes RL Lecture 2 11th January 2007

12 13 Incremental Update Equations Incremental Update Equations Estimate Q ∗ from running mean: Q ( a ) = r 1 + r 2 + r 3 + ··· + r ka This general form will be met often: if we’ve tried action a k a k a times Incremental calculation: NewEstimate = OldEstimate + StepSize [ Target - OldEstimate ] k +1 1 Step size α depends on k in incremental equation: α k = 1 /k � Q k +1 = r i (1) k + 1 But is often kept constant, e.g. α = 0 . 1 i =1 1 (gives more weight to recent rewards – why might this be useful?) = Q k + k + 1[ r k +1 − Q k ] (2) NewEstimate = OldEstimate + StepSize [ Target - OldEstimate ] Gillian Hayes RL Lecture 2 11th January 2007 Gillian Hayes RL Lecture 2 11th January 2007 14 15 Application Effect of Initial Values of Q Drug trials. You have a limited number of trials, several drugs, and need to We arbitrarily set the initial values of Q to be zero. choose the best of them. Bandit arm ≈ drug Our estimates are biassed by initial estimate of Q Define a measure of success/failure – the reward Can use this to include domain knowledge Measure how well the patients do on each drug – estimating the Q values Set all Q values very high – optimistic Example Ethical clinical trials – how do we allocate patients to drug treatments? During Initial actual rewards are disappointing compared to estimate, so switch to another the trial we may find that some drugs work better than others. action – exploration Temporary effect • Fixed allocation design: allocate 1 /k of the patients to each of the k drugs • Adaptive allocation design: if the patients on one drug appear to be doing Policy worse, switch them to the other drugs – equivalent to removing one of the Once we’ve learnt the Q values, our policy is the greedy one: choose the action arms of the bandit with the highest Q Gillian Hayes RL Lecture 2 11th January 2007 Gillian Hayes RL Lecture 2 11th January 2007

Application 16 See: http://www.eecs.umich.edu/ ∼ qstout/AdaptSample.html And: J.Hardwick, R.Oehmke,Q.Stout: A program for sequential allocation of three Bernoulli populations, Computational Statistics and Data Analysis 31, 397–416, 1999. (just scan this one) Reading: Sutton and Barto Chapter 2. Next: Reinforcement Learning with more than one state. Gillian Hayes RL Lecture 2 11th January 2007

Reinforcement Learning: How Does It Work? We detect a state - PowerPoint PPT Presentation

1 Reinforcement Learning: How Does It Work? We detect a state Reinforcement Learning We choose an action Lecture 2 We get a reward Gillian Hayes Our aim is to learn a policy what action to choose in what state to get maximum reward 11th

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

Machine Learning for NLP Reinforcement learning Aurlie Herbelot 2019 Centre for Mind/Brain

DS595/CS525 Reinforcement Learning Prof. Yanhua Li Time: 6:00pm 8:50pm R Zoom Lecture Fall

EFFECTIVELY DEALING WITH DEADLINE PRESSURE DAVE MOORE 8TH LIGHT * EFFECTIVELY DEALING WITH

Food Access for Immigrant Californians During COVID-19 Presented in partnership with the

Matthew 7:13-14 - Enter ye in at the strait gate: for wide is the gate, and broad is the way,

3DST-S report to spokespersons for LBNC March 31, 2019 1 Introduction The 3DST-Spectrometer

Scalable Massively Parallel I/O to Task-Local Files | Wolfgang Frings, Jlich Supercomputing

Preliminary valuation approach 1. . Lis isted peers approach 1.1. Selection of peers and

Data Mining in Bioinformatics Day 2: Clustering Karsten Borgwardt February 21 to March 4, 2011