Global plan Reinforcement learning I: prediction classical - - PowerPoint PPT Presentation

global plan
SMART_READER_LITE
LIVE PREVIEW

Global plan Reinforcement learning I: prediction classical - - PowerPoint PPT Presentation

Global plan Reinforcement learning I: prediction classical conditioning dopamine Reinforcement learning II: Reinforcement learning II: dynamic programming; action selection Pavlovian misbehaviour vigor


slide-1
SLIDE 1

Global plan

  • Reinforcement learning I:

– prediction – classical conditioning – dopamine

  • Reinforcement learning II:
  • Reinforcement learning II:

– dynamic programming; action selection – Pavlovian misbehaviour – vigor

  • Chapter 9 of Theoretical Neuroscience

(thanks to Yael Niv)

slide-2
SLIDE 2

Conditioning

  • Ethology

– optimality – appropriateness

  • Computation

– dynamic progr. – Kalman filtering

prediction: of important events control: in the light of those predictions

2

– appropriateness

  • Psychology

– classical/operant conditioning – Kalman filtering

  • Algorithm

– TD/delta rules – simple weights

  • Neurobiology

neuromodulators; amygdala; OFC nucleus accumbens; dorsal striatum

slide-3
SLIDE 3

Animals learn predictions

Ivan Pavlov = Conditioned Stimulus = Unconditioned Stimulus = Unconditioned Response (reflex); Conditioned Response (reflex)

slide-4
SLIDE 4

Animals learn predictions

Ivan Pavlov

Acquisition Extincti

20 40 60 80 100 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Blocks of 10 Trial

very general across species, stimuli, behaviors

slide-5
SLIDE 5

But do they really?

  • 1. Rescorla’s control

temporal contiguity is not enough - need contingency

P(food | light) > P(food | no light)

slide-6
SLIDE 6

But do they really?

  • 2. Kamin’s blocking

contingency is not enough either… need surprise

slide-7
SLIDE 7

But do they really?

  • 3. Reynold’s overshadowing

seems like stimuli compete for learning

slide-8
SLIDE 8

Theories of prediction learning: Goals

  • Explain how the CS acquires “value”
  • When (under what conditions) does this happen?
  • Basic phenomena: gradual learning and extinction curves
  • More elaborate behavioral phenomena
  • (Neural data)

P.S. Why are we looking at old-fashioned Pavlovian conditioning? → it is the perfect uncontaminated test case for examining prediction learning on its own

slide-9
SLIDE 9

error-driven learning: change in value is proportional to the difference between actual and predicted outcome Assumptions:

Rescorla & Wagner (1972)

        − = ∆

j CS US CS

j i

V r V η

Assumptions:

  • 1. learning is driven by error (formalizes notion of surprise)
  • 2. summations of predictors is linear

A simple model - but very powerful!

– explains: gradual acquisition & extinction, blocking, overshadowing, conditioned inhibition, and more.. – predicted overexpectation

note: US as “special stimulus”

slide-10
SLIDE 10
  • how does this explain acquisition and extinction?
  • what would V look like with 50% reinforcement? eg. 1 1 0 1 0 0 1 1 1 0 0

– what would V be on average after learning?

Rescorla-Wagner learning

Vt +1 = Vt + η r

t −Vt

( )

– what would the error term be on average after learning?

slide-11
SLIDE 11

how is the prediction on trial (t) influenced by rewards at times (t-1), (t-2), …?

Rescorla-Wagner learning

( )

t t t t

V r V V − + =

+

η

1

Vt +1 = (1−η)Vt + ηr

t

Vt = η (1−η)t−ir

i i=1 t

0.1 0.2 0.3 0.4 0.5 0.6

  • 10
  • 9
  • 8
  • 7
  • 6
  • 5
  • 4
  • 3
  • 2
  • 1

Vt +1 = (1−η)Vt + ηr

t

recent rewards weigh more heavily why is this sensible? learning rate = forgetting rate! the R-W rule estimates expected reward using a weighted average of past rewards

slide-12
SLIDE 12

Summary so far

Predictions are useful for behavior Animals (and people) learn predictions (Pavlovian conditioning = prediction learning) Prediction learning can be explained by an error-correcting learning rule Prediction learning can be explained by an error-correcting learning rule (Rescorla-Wagner): predictions are learned from experiencing the world and comparing predictions to reality Marr:

( ) ( )

δ α = − = ∂ ∂ ∆ − = = ∑ V r V E V V r E V V

US CS CS US j CS

i i j

2

slide-13
SLIDE 13

But: second order conditioning

25 30 35 40 45 50

phase 1: phase 2:

15 20 25 number of phase 2 pairings

animals learn that a predictor of a predictor is also a predictor of reward! ⇒ not interested solely in predicting immediate reward

test:

?

what do you think will happen? what would Rescorla-Wagner learning predict here?

slide-14
SLIDE 14

lets start over: this time from the top

Marr’s 3 levels:

  • The problem: optimal prediction of future reward

Vt = E r

i = T

     

want to predict expected sum of future reward in a trial/episode

  • what’s the obvious prediction error?
  • what’s the obvious problem with this?

i= t

 

future reward in a trial/episode (N.B. here t indexes time within a trial)

t T t i i t

V r − = ∑

=

δ

CS

V r − =

RW

δ

slide-15
SLIDE 15

lets start over: this time from the top

Marr’s 3 levels:

  • The problem: optimal prediction of future reward

Vt = E r

i = T

     

want to predict expected sum of future reward in a trial/episode

i= t

 

future reward in a trial/episode

[ ] [ ] [ ] [ ] [ ]

t t t t t t T t t t T t t t t

V V r E V r E r r r E r E r r r r E V − + = + = + + + + = + + + + =

+ + + + + + 1 1 2 1 2 1

... ... δ

Bellman eqn for policy evaluation

slide-16
SLIDE 16

lets start over: this time from the top

Marr’s 3 levels:

  • The problem: optimal prediction of future reward
  • The algorithm: temporal difference learning

Vt = E r

t

[ ]+ Vt+1

V ← (1−η)V + η(r + V ) Vt ← (1−η)Vt + η(r

t + Vt+1)

Vt ← Vt + η(r

t + Vt +1 −Vt)

temporal difference prediction error δt

VT +1 ← VT + η r

T −VT

( )

compare to:

slide-17
SLIDE 17

prediction error

TD error

Vt

R L

t t t t

V V r − + =

+1

δ

17

no prediction prediction, reward prediction, no reward

R

t

δ

slide-18
SLIDE 18

Summary so far

Temporal difference learning versus Rescorla-Wagner

  • derived from first principles about the future
  • explains everything that R-W does, and more (eg. 2nd order conditioning)
  • a generalization of R-W to real time
slide-19
SLIDE 19

Back to Marr’s 3 levels

  • The problem: optimal prediction of future reward
  • The algorithm: temporal difference learning
  • Neural implementation: does the brain use TD learning?
slide-20
SLIDE 20

Dopamine

Dorsal Striatum (Caudate, Putamen) Nucleus Accumbens (Ventral Striatum) Prefrontal Cortex Dorsal Striatum (Caudate, Putamen) Nucleus Accumbens (Ventral Striatum) Prefrontal Cortex Dorsal Striatum (Caudate, Putamen) Nucleus Accumbens (Ventral Striatum) Prefrontal Cortex Dorsal Striatum (Caudate, Putamen) Nucleus Accumbens (Ventral Striatum) Prefrontal Cortex

Parkinson’s Disease → Motor control + initiation? Intracranial self-stimulation; Drug addiction; Natural rewards

Ventral Tegmental Area Substantia Nigra Amygdala Ventral Tegmental Area Substantia Nigra Amygdala Ventral Tegmental Area Substantia Nigra Amygdala Ventral Tegmental Area Substantia Nigra Amygdala

Natural rewards → Reward pathway? → Learning? Also involved in:

  • Working memory
  • Novel situations
  • ADHD
  • Schizophrenia
slide-21
SLIDE 21

Role of dopamine: Many hypotheses

  • Anhedonia hypothesis
  • Prediction error (learning, action selection)
  • Salience/attention
  • Incentive salience
  • Uncertainty
  • Uncertainty
  • Cost/benefit computation
  • Energizing/motivating behavior
slide-22
SLIDE 22

dopamine and prediction error

TD error

Vt

R L

t t t t

V V r − + =

+1

δ

22

no prediction prediction, reward prediction, no reward

R

) (t δ

slide-23
SLIDE 23

prediction error hypothesis of dopamine

The idea: Dopamine encodes a reward prediction error

Tobler et al, 2005 Fiorillo et al, 2003

slide-24
SLIDE 24

prediction error hypothesis of dopamine

measured firing rate model prediction error m

Bayer & Glimcher (2005)

at end of trial: δt = rt - Vt (just like R-W)

Vt = η (1−η)t−ir

i i=1 t

slide-25
SLIDE 25

what drives the dips?

  • why an effect
  • f reward at

all?

– Pavlovian influence influence

Matsumoto & Hikosaka (2007)

slide-26
SLIDE 26

what drives the dips?

Matsumoto & Hikosaka (2007)

  • rHab -> rSTN
  • RMTg (predicted R/S)

Jhou et al, 2009

slide-27
SLIDE 27

Where does dopamine project to? Basal ganglia

Several large subcortical nuclei (unfortunate anatomical names follow structure rather than function, eg caudate + putamen + nucleus accumbens are all relatively similar pieces of striatum; but globus pallidus & substantia nigra each comprise two different things)

slide-28
SLIDE 28

Where does dopamine project to? Basal ganglia

inputs to BG are from all over the cortex (and topographically mapped)

Voorn et al, 2004

slide-29
SLIDE 29

Corticostriatal synapses: 3 factor learning

X1 X2 X3 XN

Cortex

Stimulus Representation adjustable synapses

V1 V2 V3 VN

PPTN, habenula etc Striatum

learned values

VTA, SNc

Prediction Error (Dopamine)

δ δ δ δ

R

but also amygdala; orbitofrontal cortex; ...

slide-30
SLIDE 30

striatal complexities

Cohen & Frank, 2009

slide-31
SLIDE 31

Dopamine and plasticity

Prediction errors are for learning… Cortico-striatal synapses show complex dopamine-dependent plasticity

Wickens et al, 1996

slide-32
SLIDE 32

Risk Experiment

< 1 sec 0.5 sec 5 sec ISI

5 stimuli: 40¢ 20¢ 0/40¢ 0¢ 0¢

0.5 sec

You won 40 cents

19 subjects (dropped 3 non learners, N=16) 3T scanner, TR=2sec, interleaved 234 trials: 130 choice, 104 single stimulus randomly ordered and counterbalanced

2-5sec ITI

slide-33
SLIDE 33

Neural results: Prediction Errors

what would a prediction error look like (in BOLD)?

slide-34
SLIDE 34

Neural results: Prediction errors in NAC

unbiased anatomical ROI in nucleus accumbens (marked per subject*)

* thanks to Laura deSouza raw BOLD (avg over all subjects)

can actually decide between different neuroeconomic models of risk

slide-35
SLIDE 35

Prediction error

punishment prediction error

Value TD error

t t t t

V V r − + =

+1

δ

35

High Pain Low Pain

0.8 1.0 0.8 1.0 0.2 0.2

slide-36
SLIDE 36

TD model

A – B – HIGH C – D – LOW C – B – HIGH A – B – HIGH A – D – LOW C – D – LOW A – B – HIGH A – B – HIGH C – D – LOW C – B – HIGH

experimental sequence…..

MR scanner

punishment prediction error

36

?

Brain responses Prediction error

Ben Seymour; John O’Doherty

slide-37
SLIDE 37

TD prediction error: ventral striatum Z=-4 R

punishment prediction error

37

Z=-4 R

slide-38
SLIDE 38

punishment prediction

38

right anterior insula dorsal raphe (5HT)?

slide-39
SLIDE 39

punishment

  • dips below baseline in

dopamine

– Frank: D2 receptors particularly sensitive – Bayer & Glimcher: length of pause related to size of pause related to size of negative prediction error

  • but:

– can’t afford to wait that long – negative signal for such an important event – opponency a more conventional solution:

  • serotonin…
slide-40
SLIDE 40

generalization

40

slide-41
SLIDE 41

generalization

41

slide-42
SLIDE 42

random-dot discrimination

differential reward (0.16ml; 0.38ml) Sakagami (2010)

slide-43
SLIDE 43
  • ther paradigms
  • inhibitory conditioning
  • transreinforcer blocking
  • motivational sensitivities
  • backwards blocking

– Kalman filtering

  • downwards unblocking
  • primacy as well as recency (highlighting)

– assumed density filtering

slide-44
SLIDE 44

Summary of this part: prediction and RL

Prediction is important for action selection

  • The problem: prediction of future reward
  • The algorithm: temporal difference learning
  • Neural implementation: dopamine dependent learning in BG

⇒ A precise computational model of learning allows one to look in the brain for “hidden variables” postulated by the model ⇒ Precise (normative!) theory for generation of dopamine firing patterns ⇒ Explains anticipatory dopaminergic responding, second order conditioning ⇒ Compelling account for the role of dopamine in classical conditioning: prediction error acts as signal driving learning in prediction areas

slide-45
SLIDE 45

Striatum and learned values

Striatal neurons show ramping activity that precedes a reward (and changes with learning!) start food (Schultz) (Daw)

slide-46
SLIDE 46

Phasic dopamine also responds to…

  • Novel stimuli
  • Especially salient (attention grabbing) stimuli
  • Aversive stimuli (??)
  • Reinforcers and appetitive stimuli induce approach behavior and

learning, but also have attention functions (elicit orienting response) learning, but also have attention functions (elicit orienting response) and disrupt ongoing behaviour. → Perhaps DA reports salience of stimuli (to attract attention; switching) and not a prediction error? (Horvitz, Redgrave)