Global plan Reinforcement learning I: prediction classical - PowerPoint PPT Presentation

Global plan • Reinforcement learning I: – prediction – classical conditioning – dopamine • Reinforcement learning II: • Reinforcement learning II: – dynamic programming; action selection – Pavlovian misbehaviour – vigor • Chapter 9 of Theoretical Neuroscience (thanks to Yael Niv)

Conditioning prediction: of important events control: in the light of those predictions • Ethology • Computation – optimality – dynamic progr. – appropriateness – appropriateness – Kalman filtering – Kalman filtering • Psychology • Algorithm – classical/operant – TD/delta rules conditioning – simple weights • Neurobiology neuromodulators; amygdala; OFC nucleus accumbens; dorsal striatum 2

Animals learn predictions Ivan Pavlov = Unconditioned Stimulus = Conditioned Stimulus = Unconditioned Response (reflex); Conditioned Response (reflex)

Animals learn predictions Ivan Pavlov Acquisition Extincti 100 80 60 40 20 0 very general across 1 2 3 4 5 6 7 8 9 10 11 12 13 14 species, stimuli, behaviors Blocks of 10 Trial

But do they really? 1. Rescorla’s control temporal contiguity is not enough - need contingency P(food | light) > P(food | no light)

But do they really? 2. Kamin’s blocking contingency is not enough either… need surprise

But do they really? 3. Reynold’s overshadowing seems like stimuli compete for learning

Theories of prediction learning: Goals • Explain how the CS acquires “value” • When (under what conditions) does this happen? • Basic phenomena: gradual learning and extinction curves • More elaborate behavioral phenomena • (Neural data) P.S. Why are we looking at old-fashioned Pavlovian conditioning? → it is the perfect uncontaminated test case for examining prediction learning on its own

Rescorla & Wagner (1972) error-driven learning: change in value is proportional to the difference between actual and predicted outcome   ∑   ∆ = η − V r V   CS US CS i j   j Assumptions: Assumptions: 1. learning is driven by error (formalizes notion of surprise) 2. summations of predictors is linear A simple model - but very powerful! – explains: gradual acquisition & extinction, blocking, overshadowing, conditioned inhibition, and more.. – predicted overexpectation note: US as “special stimulus”

Rescorla-Wagner learning ( ) V t + 1 = V t + η r t − V t • how does this explain acquisition and extinction? • what would V look like with 50% reinforcement? eg. 1 1 0 1 0 0 1 1 1 0 0 – what would V be on average after learning? – what would the error term be on average after learning?

Rescorla-Wagner learning ( ) = + η − V V r V 1 + t t t t how is the prediction on trial (t) influenced by rewards at times (t-1), (t-2), …? V t + 1 = (1 − η ) V t + η r V t + 1 = (1 − η ) V t + η r t t t the R-W rule estimates ∑ (1 − η ) t − i r V t = η expected reward using a i weighted average of past i = 1 rewards 0.6 0.5 recent rewards weigh more heavily 0.4 0.3 why is this sensible? 0.2 learning rate = forgetting rate! 0.1 0 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1

Summary so far Predictions are useful for behavior Animals (and people) learn predictions (Pavlovian conditioning = prediction learning) Prediction learning can be explained by an error-correcting learning rule Prediction learning can be explained by an error-correcting learning rule (Rescorla-Wagner): predictions are learned from experiencing the world and comparing predictions to reality = ∑ V V Marr: CS j j ( ) 2 = − E r V US ∂ E ( ) ∆ α = − = δ V r V CS US ∂ V i CS i

But: second order conditioning 50 phase 1: 45 40 35 phase 2: 30 25 25 ? 20 test: 15 number of phase 2 pairings what do you think will happen? what would Rescorla-Wagner learning predict here? animals learn that a predictor of a predictor is also a predictor of reward! ⇒ not interested solely in predicting immediate reward

lets start over: this time from the top Marr’s 3 levels: • The problem: optimal prediction of future reward   T ∑ V t = E r   want to predict expected sum of i     future reward in a trial/episode future reward in a trial/episode i = t = (N.B. here t indexes time within a trial) RW δ = r − V • what’s the obvious prediction error? CS T = ∑ δ r − V t i t = i t • what’s the obvious problem with this?

lets start over: this time from the top Marr’s 3 levels: • The problem: optimal prediction of future reward   T ∑ V t = E r   want to predict expected sum of i     future reward in a trial/episode future reward in a trial/episode i = t = [ ] ... = + + + + V E r r r r 1 2 + + t t t t T [ ] [ ] ... = + + + + E r E r r r 1 2 + + t t t T [ ] Bellman eqn = + E r V for policy 1 + t t [ ] evaluation δ = + − E r V V 1 + t t t t

lets start over: this time from the top Marr’s 3 levels: • The problem: optimal prediction of future reward • The algorithm: temporal difference learning [ ] + V t + 1 V t = E r t V t ← (1 − η ) V t + η ( r V ← (1 − η ) V + η ( r + V t + V t + 1 ) ) V t ← V t + η ( r t + V t + 1 − V t ) temporal difference prediction error δ t ( ) V T + 1 ← V T + η r T − V T compare to:

prediction error δ = + − r V V TD error + 1 t t t t L R V t δ t R no prediction prediction, reward prediction, no reward 17

Summary so far Temporal difference learning versus Rescorla-Wagner • derived from first principles about the future • explains everything that R-W does, and more (eg. 2 nd order conditioning) • a generalization of R-W to real time

Back to Marr’s 3 levels • The problem: optimal prediction of future reward • The algorithm: temporal difference learning • Neural implementation: does the brain use TD learning?

Dopamine Parkinson’s Disease Dorsal Striatum (Caudate, Putamen) Dorsal Striatum (Caudate, Putamen) Dorsal Striatum (Caudate, Putamen) Dorsal Striatum (Caudate, Putamen) → Motor control + Prefrontal Cortex Prefrontal Cortex Prefrontal Cortex Prefrontal Cortex initiation? Nucleus Accumbens Nucleus Accumbens Nucleus Accumbens Nucleus Accumbens Intracranial self-stimulation; (Ventral Striatum) (Ventral Striatum) (Ventral Striatum) (Ventral Striatum) Drug addiction; Natural rewards Natural rewards → Reward pathway? → Learning? Also involved in: Amygdala Amygdala Amygdala Amygdala • Working memory Substantia Nigra Substantia Nigra Substantia Nigra Substantia Nigra Ventral Tegmental Ventral Tegmental Ventral Tegmental Ventral Tegmental • Novel situations Area Area Area Area • ADHD • Schizophrenia • …

Role of dopamine: Many hypotheses • Anhedonia hypothesis • Prediction error (learning, action selection) • Salience/attention • Incentive salience • Uncertainty • Uncertainty • Cost/benefit computation • Energizing/motivating behavior

dopamine and prediction error δ = + − r V V TD error + 1 t t t t L R V t ( t ) δ R no prediction prediction, reward prediction, no reward 22

prediction error hypothesis of dopamine The idea: Dopamine encodes a reward prediction error Fiorillo et al, 2003 Tobler et al, 2005

prediction error hypothesis of dopamine measured firing rate m model prediction error at end of trial: δ t = r t - V t (just like R-W) t ∑ (1 − η ) t − i r V t = η i i = 1 Bayer & Glimcher (2005)

what drives the dips? • why an effect of reward at all? – Pavlovian influence influence Matsumoto & Hikosaka (2007)

what drives the dips? Matsumoto & Hikosaka (2007) • rHab -> rSTN • RMTg (predicted R/S) Jhou et al, 2009

Where does dopamine project to? Basal ganglia Several large subcortical nuclei (unfortunate anatomical names follow structure rather than function, eg caudate + putamen + nucleus accumbens are all relatively similar pieces of striatum; but globus pallidus & substantia nigra each comprise two different things)

Where does dopamine project to? Basal ganglia inputs to BG are from all over the cortex (and topographically mapped) Voorn et al, 2004

Corticostriatal synapses: 3 factor learning Cortex Stimulus X 1 X 2 X 3 X N Representation adjustable synapses Striatum V 1 V 2 V 3 V N learned values Prediction PPTN, δ δ δ δ R Error (Dopamine) VTA, SNc habenula etc but also amygdala; orbitofrontal cortex; ...

striatal complexities Cohen & Frank, 2009

Dopamine and plasticity Prediction errors are for learning… Cortico-striatal synapses show complex dopamine-dependent plasticity Wickens et al, 1996

Risk Experiment 5 stimuli: 40¢ 20¢ < 1 sec 0 / 40¢ 0¢ 5 sec 0¢ ISI 0.5 sec 0.5 sec 2-5sec You won ITI 40 cents 19 subjects (dropped 3 non learners, N=16) 3T scanner, TR=2sec, interleaved 234 trials: 130 choice, 104 single stimulus randomly ordered and counterbalanced

Neural results: Prediction Errors what would a prediction error look like (in BOLD)?

Global plan Reinforcement learning I: prediction classical - PowerPoint PPT Presentation

Global plan Reinforcement learning I: prediction classical conditioning dopamine Reinforcement learning II: Reinforcement learning II: dynamic programming; action selection Pavlovian misbehaviour vigor

GLOBAL RISKS GLOBAL RISKS GLOBAL RISKS - GLOBAL RISKS - - - GLOBAL RISKS GLOBAL RISKS

PHASE IA PLAN ULTIMATE PLAN 13 PHASE IB PLAN ULTIMATE PLAN 14 ULTIMATE PLAN ULTIMATE PLAN

NEW COURTHOUSE 1 ST FLOOR PLAN ANNEX 3rd FLOOR PLAN 2nd FLOOR PLAN BASEMENT FLOOR PLAN ANNEX

GLOBAL PLAZA Evolution GLOBAL Project: Juan Quemada, UPM http://globalplaza.org

Globa Global C l Conne onnection Pla tion Plan n Northeastern Jurisdiction Global Structure

Global Gold Global Gold Global Gold Global Gold connecting internationally

BAML Global Real Estate BAML Global Real Estate BAML Global Real Estate BAML Global Real Estate

Global Global Global Global

An Enhanced Global Router An Enhanced Global Router An Enhanced Global Router An Enhanced Global

Global routing Global routing Global routing Global routing Bill Swartz Bill Swartz

Medical Plan Comparison Central Care Plan Medical / Prescription Benefit Summary Advantage HDHP/HSA

Master Plan Open House #3 Preferred Alternative Master Plan Master Plan Process What is a

Site Plan May 2009 Site Plan February 2010 Site Plan May 5, 2010 Site Plan

Global Plan to End TB Monitoring progress How Global Plan is being used?...1 United Nations

The Global Ghost Gear Initiative: A global cross-sectoral approach to tackling derelict fishing

Introducing Sun Global Presentation 2013 Strictly Confidential. Sun Global Investments is

Working Group # 6 Working Group 6: Epigenomics Elena Colicino Sudha Ramalingam Bhramar

What goes through your mind when you hear this?? Your sleep is a reflection of your life 1

4/8/2013 Authors Osnat Ben-Shahar First Author and PI- responsible for study design and all

Positive Symptoms of Schizophrenia Negative Symptoms of Schizophrenia 1 Functional Status 100%

Neuropsychiatric Aspects of Parkinsons Disease: Across the stages Iracema Leroi MD FRCPC

Ibalizumab ( Trogarzo ) Prepared by: Brian R. Wood, MD David Spach, MD Last Updated: July 9,

Why People Take Drugs To feel better To feel good To lessen: To have novel: Anxiety Feelings

Some principles of modeling and simulation in preclinical research and drug development Philippe