Reinforcement Learning Quentin Huys Division of Psychiatry and Max - PowerPoint PPT Presentation

Solving the Bellman Equation Option 1: turn it into update equation Option 2: linear solution (w/ absorbing states) �⇤ ⇥ ⇤ T a ss � [ R ( s ⇥ , a, s ) + V ( s ⇥ )] V ( s ) = π ( a, s t ) a s � R π + T π v = ⇒ v ( I − T π ) � 1 R π = ⇒ v π O ( |S| 3 ) Quentin Huys RL SWC

Solving the Bellman Equation Option 1: turn it into update equation ⇤⇧ ⌅ ⇧ V k +1 ( s ) T a R ( s � , a, s ) + V k ( s � ) � ⇥ = π ( a, s t ) ss � a s � Option 2: linear solution (w/ absorbing states) �⇤ ⇥ ⇤ T a ss � [ R ( s ⇥ , a, s ) + V ( s ⇥ )] V ( s ) = π ( a, s t ) a s � R π + T π v = ⇒ v ( I − T π ) � 1 R π = ⇒ v π O ( |S| 3 ) Quentin Huys RL SWC

Policy update Given the value function for a policy, say via linear solution "X # X ss 0 [ R ( s 0 , a, s ) + V π ( s 0 )] V π ( s ) = π ( a | s ) T a s 0 a | {z } Q π ( s,a ) Given the values V for the policy, we can improve the policy by always choosing the best action: ⇢ 1 if a = argmax a Q π ( s, a ) π 0 ( a | s ) = 0 else It is guaranteed to improve: for deterministic policy Q π ( s, π 0 ( s )) = max Q π ( s, a ) ≥ Q π ( s, π ( s )) = V π ( s ) a Quentin Huys RL SWC

Policy iteration Policy evaluation ( I − T π ) − 1 R π = v π ⇤ � ⇥ s � T a R a ss + V pi ( s � ) ⌅ 1 if a = argmax a ss � π ( a | s ) = 0 else Quentin Huys RL SWC

Policy iteration Policy evaluation ( I − T π ) − 1 R π = v π greedy policy improvement ⇤ � ⇥ s � T a R a ss + V pi ( s � ) ⌅ 1 if a = argmax a ss � π ( a | s ) = 0 else Quentin Huys RL SWC

Policy iteration Policy evaluation ( I − T π ) − 1 R π = v π Value iteration � V � ( s ) = max ss + V � ( s ⇥ )] ss � [ R a T a a s � greedy policy improvement ⇤ � ⇥ s � T a R a ss + V pi ( s � ) ⌅ 1 if a = argmax a ss � π ( a | s ) = 0 else Quentin Huys RL SWC

Model-free solutions ‣ So far we have assumed knowledge of R and T • R and T are the ‘model’ of the world, so we assume full knowledge of the dynamics and rewards in the environment ‣ What if we don’t know them? ‣ We can still learn from state-action-reward samples • we can learn R and T from them, and use our estimates to solve as above • alternatively, we can directly estimate V or Q Quentin Huys RL SWC

Solving the Bellman Equation Option 3: sampling �⇤ ⇥ ⇤ T a ss � [ R ( s � , a, s ) + V ( s � )] V ( s ) = π ( a, s t ) a s � So we can just draw some samples from the policy and the transitions and average over them: X = f ( x k ) p ( x k ) a k a = 1 X x ( i ) ∼ p ( x ) → ˆ f ( x ( i ) ) N i Quentin Huys RL SWC

Solving the Bellman Equation Option 3: sampling So we can just draw some samples from the policy and the transitions and average over them: X = f ( x k ) p ( x k ) a k a = 1 X x ( i ) ∼ p ( x ) → ˆ f ( x ( i ) ) N i Quentin Huys RL SWC

Solving the Bellman Equation Option 3: sampling this is an expectation over policy and transition samples. So we can just draw some samples from the policy and the transitions and average over them: X = f ( x k ) p ( x k ) a k a = 1 X x ( i ) ∼ p ( x ) → ˆ f ( x ( i ) ) N i Quentin Huys RL SWC

Solving the Bellman Equation Option 3: sampling this is an expectation over policy and transition samples. So we can just draw some samples from the policy and the transitions and average over them: X = f ( x k ) p ( x k ) a k a = 1 X x ( i ) ∼ p ( x ) → ˆ f ( x ( i ) ) N i more about this later... Quentin Huys RL SWC

Learning from samples 10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 10 A new problem: exploration versus exploitation Quentin Huys RL SWC

The e ff ect of bootstrapping Markov (every visit) B1 V(B)=3/4 B1 V(A)=0 B1 B1 B1 B1 TD B0 V(B)=3/4 A0 B0 V(A)=~3/4 ‣ Average over various bootstrappings: TD( ) λ after Sutton and Barto 1998 Quentin Huys RL SWC

Monte Carlo ‣ First visit MC • randomly start in all states, generate paths, average for starting state only ( T ) V ( s ) = 1 X X r i t 0 | s 0 = s N i t 0 =1 10 9 ‣ More efficient use of samples 8 • Every visit MC 7 6 • Bootstrap: TD 5 • Dyna 4 3 ‣ Better samples 2 • on policy versus off policy 1 1 2 3 4 5 6 7 8 9 10 • Stochastic search, UCT... Quentin Huys RL SWC

Update equation: towards TD Bellman equation �⇤ ⇥ ⇤ ss � [ R ( s � , a, s ) + V ( s � )] V ( s ) = π ( a, s ) T a s � a Not yet converged, so it doesn’t hold: �⇤ ⇥ ⇤ ss � [ R ( s � , a, s ) + V ( s � )] dV ( s ) = − V ( s ) + π ( a, s ) T a s � a And then use this to update V i +1 ( s ) = V i ( s ) + dV ( s ) Quentin Huys RL SWC

TD learning �⇤ ⇥ ⇤ ss � [ R ( s � , a, s ) + V ( s � )] dV ( s ) = − V ( s ) + π ( a, s ) T a s � a Quentin Huys RL SWC

TD learning �⇤ ⇥ ⇤ ss � [ R ( s � , a, s ) + V ( s � )] dV ( s ) = − V ( s ) + π ( a, s ) T a s � a π ( a | s t ) a t ∼ Sample T a t s t +1 ∼ s t ,s t +1 = R ( s t +1 , a t , s t ) r t Quentin Huys RL SWC

TD learning �⇤ ⇥ ⇤ ss � [ R ( s � , a, s ) + V ( s � )] dV ( s ) = − V ( s ) + π ( a, s ) T a s � a π ( a | s t ) a t ∼ Sample T a t s t +1 ∼ s t ,s t +1 = R ( s t +1 , a t , s t ) r t δ t = − V t − 1 ( s t ) + r t + V t − 1 ( s t +1 ) Quentin Huys RL SWC

TD learning �⇤ ⇥ ⇤ ss � [ R ( s � , a, s ) + V ( s � )] dV ( s ) = − V ( s ) + π ( a, s ) T a s � a π ( a | s t ) a t ∼ Sample T a t s t +1 ∼ s t ,s t +1 = R ( s t +1 , a t , s t ) r t δ t = − V t − 1 ( s t ) + r t + V t − 1 ( s t +1 ) V i +1 ( s ) = V i ( s ) + dV ( s ) V t ( s t ) = V t − 1 ( s t ) + αδ t Quentin Huys RL SWC

TD learning π ( a | s t ) a t ∼ T a t s t +1 ∼ s t ,s t +1 = R ( s t +1 , a t , s t ) r t δ t = − V t ( s t ) + r t + V t ( s t +1 ) V t +1 ( s t ) = V t ( s t ) + αδ t Quentin Huys RL SWC

Phasic dopamine neurone firing in ‣ Pavlovian conditioning Montague et al., 1996, Schultz et al., 1997 Quentin Huys RL SWC

Phasic dopamine neurone firing in ‣ Pavlovian conditioning Reward Reward Loss Reward Montague et al., 1996, Schultz et al., 1997 Quentin Huys RL SWC

Phasic dopamine neurone firing in ‣ Pavlovian conditioning Reward Reward Loss Reward V t +1 ( s ) = V t ( s ) + ✏ ( R t − V t ( s )) | {z } = Prediction error Montague et al., 1996, Schultz et al., 1997 Quentin Huys RL SWC

Phasic signals in humans B * * 0.008 0.004 Effect Size 0 unexpected reward re- expected reward -0.004 d reward expected, - not received NS luster D’Ardenne et al., 2008 Science; Zaghloul et al., 2009 Science Quentin Huys RL SWC

Blocking ‣ Are predictions and prediction errors really causally important in learning? • 1: A -> Reward Response • 2: A+B -> Reward • 3: A -> ? approach • B -> ? approach A B C Kamin 1968 Quentin Huys RL SWC

Causal role of phasic DA in learning b Single Compound Test c cue cue Paired stimulation 14–15 d 4 d 1 d Cue AX Time in port US (sucrose) +Stim A US AX US X? Unpaired stimulation Cue AX With paired or unpaired Time in port optical stimulation US (sucrose) Stim e Blocking f 25 * ( n = 12) PairedCre + Control ( n = 9) ** 15 UnpairedCre + ( n = 11) 20 ( n = 9) P = 0.095 PairedCre − 15 ** ( n = 10) 10 P = 0.055 10 5 5 0 0 1 2 3 1 trial 3 trials 1 trial 3 trials Steinberg et al., 2013 Nat. Neurosci. Quentin Huys RL SWC

Markov Decision Problems V ( s t ) = E [ r t + r t +1 + r t +2 + . . . ] = E [ r t ] + E [ r t +1 + r t +2 + r t +3 . . . ] ⇒ V ( s t ) = E [ r t ] + V ( s t +1 ) Quentin Huys RL SWC

“Cached” solutions to MDPs Quentin Huys RL SWC

“Cached” solutions to MDPs ‣ Learn from experience ‣ If we have true values V, then this is true every trial: V ( s t ) = E [ r t ] + V ( s t +1 ) ‣ If it is not true (we don’t know true V), then we get an error: δ = ( E [ r t ] + V ( s t +1 )) � V ( s t ) 6 = 0 ‣ So now we can update with our experience V ( s t ) ← V ( s t ) + ✏� ‣ This is an average over past experience Quentin Huys RL SWC

SARSA ‣ Do TD for state-action values instead: Q ( s t , a t ) ← Q ( s t , a t ) + α [ r t + γ Q ( s t +1 , a t +1 ) − Q ( s t , a t )] s t , a t , r t , s t +1 , a t +1 ‣ convergence guarantees - will estimate Q π ( s, a ) Quentin Huys RL SWC

Q learning: o ff -policy ‣ Learn off-policy • draw from some policy • “only” require extensive sampling � ⇥ Q ( s t , a t ) ← Q ( s t , a t ) + α ⇤ r t + γ max Q ( s t +1 , a ) − Q ( s t , a t ) ⌅ a ⌥ ⌃⇧ � update towards optimum ‣ will estimate Q ∗ ( s, a ) Quentin Huys RL SWC

MF and MB learning of V and Q values Model-free Model-based V MF s V MB s Pavlovian (state) values ð Þ ð Þ Q MF s , a Q MB s , a Instrumental (state-action) values ð Þ ð Þ There are both Pavlovian state and instrumental state-action values, and both of these can be either model-free (cached) or model-based. Quentin Huys RL SWC

Solutions ‣ “Cached” learning If you have an average over large number of subjects, it won’t move much if you add one more. • average experience • do again what worked in the past • averages are cheap to compute - no computational curse • averages move slowly ‣ “Goal-directed” or “Model-based” decisions • Think through possible options and choose the best • Requires detailed model of the world • Requires huge computational resources • Learning = building the model, extracting structure Quentin Huys RL SWC

MF and MB learning of V and Q values Model-free Model-based V MF s V MB s Pavlovian (state) values ð Þ ð Þ Q MF s , a Q MB s , a Instrumental (state-action) values ð Þ ð Þ There are both Pavlovian state and instrumental state-action values, and both of these can be either model-free (cached) or model-based. Quentin Huys RL SWC

Pavlovian and instrumental ‣ Pavlovian model-free learning: V t ( s ) = V t − 1 ( s ) + ✏ ( r t − V t − 1 ( s )) p ( a | s, V ) ∝ f ( a, V ( s )) p ( a | s ) ‣ Instrumental model-free learning: Q t ( a, s ) = Q t − 1 ( a, s ) + ✏ ( r t − Q t − 1 ( a, s )) Quentin Huys RL SWC

Innate evolutionary strategies Hirsch and Bolles 1980 Quentin Huys RL SWC

Innate evolutionary strategies more more survive survive fewer survive Hirsch and Bolles 1980 Quentin Huys RL SWC

Innate evolutionary strategies are quite sophisticated... more more survive survive fewer survive Hirsch and Bolles 1980 Quentin Huys RL SWC

Unconditioned responses • powerful • inflexible over short timescale • adaptive on evolutionary scale Hershberger 1986 Quentin Huys RL SWC

A ff ective go / nogo task Go Nogo Rewarded Avoids loss Guitart-Masip et al., 2012 J Neurosci Quentin Huys RL SWC

A ff ective go / nogo task Go Nogo Rewarded 1 Avoids loss Probability correct 0.5 0 Go to Go to Nogo to Nogo to Win Avoid Win Avoid Go rewarded Nogo punished Nogo rewarded Go punished Go to win Go to avoid Nogo to win Nogo to avoid 1 1 1 1 Probability(Go) 0.5 0.5 0.5 0.5 0 0 0 0 20 40 60 20 40 60 20 40 60 20 40 60 Guitart-Masip et al., 2012 J Neurosci Quentin Huys RL SWC

Models Go Nogo Rewarded ‣ Instrumental Avoids loss p t ( a | s ) ∝ Q t ( s, a ) Q t +1 ( s, a ) = Q t ( s, a ) + α ( r t − Q t ( s, a )) Go rewarded Nogo punished Nogo rewarded Go punished Go to win Go to avoid Nogo to win Nogo to avoid 1 1 1 1 Probability(Go) 0.5 0.5 0.5 0.5 0 0 0 0 20 40 60 20 40 60 20 40 60 20 40 60 Guitart-Masip et al., 2012 J Neurosci Quentin Huys RL SWC

Models Go Nogo Rewarded ‣ Instrumental + bias Avoids loss Go rewarded Nogo punished Nogo rewarded Go punished Go to win Go to avoid Nogo to win Nogo to avoid 1 1 1 1 Probability(Go) 0.5 0.5 0.5 0.5 0 0 0 0 20 40 60 20 40 60 20 40 60 20 40 60 Guitart-Masip et al., 2012 J Neurosci Quentin Huys RL SWC

Models Go Nogo Rewarded ‣ Instrumental + bias + Pavlovian Avoids loss Go rewarded Nogo punished Nogo rewarded Go punished Go to win Go to avoid Nogo to win Nogo to avoid 1 1 1 1 Probability(Go) 0.5 0.5 0.5 0.5 0 0 0 0 20 40 60 20 40 60 20 40 60 20 40 60 Guitart-Masip et al., 2012 J Neurosci Quentin Huys RL SWC

Habitization Quentin Huys RL SWC

Habitization Get this pattern late if lesion infralimbic cortex Quentin Huys RL SWC

Habitization Get this pattern Get this pattern early if lesion late if lesion prelimbic cortex infralimbic cortex Quentin Huys RL SWC

Habitization Get this pattern Get this pattern early if lesion late if lesion prelimbic cortex infralimbic cortex PEL ETOH Quentin Huys RL SWC

Two-step task C A B Daw et al. 2011, Neuron Quentin Huys RL SWC

Reinforcement Learning Quentin Huys Division of Psychiatry and Max - PowerPoint PPT Presentation

Reinforcement Learning Quentin Huys Division of Psychiatry and Max Planck UCL Centre for Computational Psychiatry and Ageing Research, UCL Complex Depression, Anxiety and Trauma Service, Camden and Islington NHS Foundation Trust Systems and

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

Machine Learning for NLP Reinforcement learning Aurlie Herbelot 2019 Centre for Mind/Brain

Question How difficult is the triplet project on a scale 1 1 10 ? a) 1 (Im offended

1 # *

On the separation of queries from modifiers Ran Ettinger, IBM Research Haifa CREST Open

Errors and Asserts Motivation Specifications assign responsibility When code crashes, who

Global Handwashing Day Dr Valerie Curtis LSHTM Why HWWS? Some 23-40% reduction in diarrhoea

DSLs for Application Development Tyler Hodgkins, YP Language Tooling Safe Harbor Statement This

Land Tax and Valuations Darrel M c Mahon & Catherine Clarisse Staff Development and Customer

Ambient Aware Information Delivery Stephan Steglich, Christian Rck, Stefan Arbanowski

Reinforcement Learning Quentin Huys Division of Psychiatry and Max - PowerPoint PPT Presentation

Reinforcement Learning Quentin Huys Division of Psychiatry and Max Planck UCL Centre for Computational Psychiatry and Ageing Research, UCL Complex Depression, Anxiety and Trauma Service, Camden and Islington NHS Foundation Trust Systems and

Reinforcement Learning AIMA Chapters: 21.1, 21.2, 21.3. Sutton and Barto, Reinforcement Learning:

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

RL Overview of topics About Reinforcement Learning The Reinforcement Learning Problem

Reinforcement Learning UMaine COS 470/570 Introduction to AI Why reinforcement learning?

Reinforcement Learning and Simulation-Based Search David Silver Reinforcement Learning and

Reinforcement Learning Reinforcement Learning Reinforcement Learning in a nutshell g Imagine

Safe Reinforcement Learning Philip S. Thomas Stanford CS234: Reinforcement Learning, Guest

CS885 Reinforcement Learning Module 2: June 6, 2020 Maximum Entropy Reinforcement Learning

Introduction to Reinforcement Learning Kevin Chen and Zack Khan Lecture 1: Introduction to

Introduction to Reinforcement Learning and Q-Learning Skyler Seto (ss3349) May 2, 2016 Skyler

7. Motor Control and Reinforcement Learning Outline A. Action Selection and Reinforcement B.

1 Deep Reinforcement Learning Qianqian Li, Nayeon Koong, Langtian He What is deep reinforcement

Introduction CSCE CSCE 496/896 496/896 Lecture 7: Lecture 7: Reinforcement Reinforcement

Path following with reinforcement learning for autonomous cars - Mozzam Motiwala (IAS) Index

CSC2621 Topics in Robotics Reinforcement Learning in Robotics Week 11: Hierarchical Reinforcement

Machine Learning for NLP Reinforcement learning Aurlie Herbelot 2019 Centre for Mind/Brain

Question How difficult is the triplet project on a scale 1 1 10 ? a) 1 (Im offended

1 # *

On the separation of queries from modifiers Ran Ettinger, IBM Research Haifa CREST Open

Errors and Asserts Motivation Specifications assign responsibility When code crashes, who

Global Handwashing Day Dr Valerie Curtis LSHTM Why HWWS? Some 23-40% reduction in diarrhoea

DSLs for Application Development Tyler Hodgkins, YP Language Tooling Safe Harbor Statement This

Land Tax and Valuations Darrel M c Mahon &amp; Catherine Clarisse Staff Development and Customer

Ambient Aware Information Delivery Stephan Steglich, Christian Rck, Stefan Arbanowski

Land Tax and Valuations Darrel M c Mahon & Catherine Clarisse Staff Development and Customer