reinforcement learning in humans and animals nathaniel daw nyu - - PowerPoint PPT Presentation
reinforcement learning in humans and animals nathaniel daw nyu - - PowerPoint PPT Presentation
reinforcement learning in humans and animals nathaniel daw nyu neuroscience; psychology; neuroeconomics cognition centric roundtable stevens, may 13 2011 collaborators NYU: Aaron Bornstein Sara Constantino Nick Gustafson Jian Li Seth
collaborators
NYU: Aaron Bornstein Sara Constantino Nick Gustafson Jian Li Seth Madlon-Kay Dylan Simon Bijan Pesaran Columbia: Daphna Shohamy Elliott Wimmer UCL: Peter Dayan Ben Seymour Ray Dolan Berkeley: Bianca Wittmann U Chicago: Jeff Beeler Xiaoji Zhuang Princeton: Yael Niv Sam Gershman Trinity: John O’Doherty Tel Aviv: Tom Schonberg Daphna Joel Montreal: Aaron Courville CMU: David Touretzky Austin: Ross Otto
funding: NIMH, NIDA, NARSAD, McKnight Endowment, HFSP
question
longstanding question in psychology: what information is learned from reward
– law of effect (Thorndike): learn to repeat reinforced actions
- dopamine
– cognitive maps (Tolman): learn “map” of task structure; evaluate new actions online
- even rats can do this
new leverage on this problem
draw on computer science, economics for methods, frameworks 1.new computational & neural tools
– examine learning via trial-by-trial adjustments in behavior and neural signals
1.new computational theories
– algorithmic view – dopamine associated with “model-free” RL – “model-based” RL as account for cognitive maps (Daw, Niv & Dayan 2005, 2006)
learned decision making in humans
+
0.25 0.5 probability 100 200 300 0.25 0.5 trial probability
“bandit” tasks Daw et al. 2006 Wittmann et al 2008 Gershman et al 2009 Schonberg et al 2007, 2010 Glascher et al. 2010 Li & Daw 2011
trial-by-trial analysis
experience (past choices & outcomes) t-1 t-3 t-4 … t-2 model (RL algorithm + probabilistic choice rule: experience choices)
Choice Probability
predicted values prediction errors etc predicted choice (probabilities) behavior: which model & parameters make
- bserved choices most likely?
 Â
Á
? E[V(a)] = Σo P(o|a) V(o) Á
“model- free” “model- based”
rat version
Valued Devalued
Lever Presses
5 10 moderate training extensive training actions per minute
(Holland, 2004)
two behavioral modes: devaluation-sensitive (“goal directed”) devaluation-insensitive (“habitual”) neurally dissociable with lesions (Dickinson, Balleine, Killcross) dual systems view
(Balleine, Daw & O’Doherty, 2009)
task
70% with prob: 70% 26% 57% 41% 28% (all slowly changing) (Daw, Gershman, Seymour, et al Neuron 2011)
question
does choice behavior respect sequential structure?
idea
30%
How does bottom-stage feedback affect top-stage choices? Example: rare transition at top level, followed by win
- Which top-stage action
is now favored?
predictions
direct reinforcement ignores transition structure model-based planning respects transition structure
data
reinforcement planning
17 subs x 201 trials each
reward: p<1e-8 reward x rare: p<5e-5 (mixed effects logit)
results reject pure reinforcement models suggest mixture of planning and reinforcement processes
(Daw, Gershman, Seymour, et al Neuron 2011)
Otto, Gershman, Markman
dual task
single task dual task dual x reward: p < 5e-7 dual x reward x rare: p< .05
neural analysis
behavior incorporates model knowledge: not just TD want to ask same question neurally can we dissociate multiple neural systems underlying neural behavior
- in particular, can we show subcortical systems are dumb?
dopamine & RL
(Schultz et al. 1997) (Daw et al. 2011)
fMRI analysis
+ β· =
hypothesis: striatal “error” signals are solely reinforcement driven
- 1. generate candidate error signals assuming TD
- 2. additional regressor captures how this signal would be changed for
errors relative to values computed by planning
TD error change due to forward planning net signal estimate this
fMRI analysis
+ β·
TD error change due to forward planning
contrary to theories: even striatal error signals incorporate knowledge of task structure
(P<.05 cluster) (Daw, Gershman, Seymour, et al Neuron 2011)
variation across subjects
subjects differ in degree of model usage
+ β· =
TD error change due to planning net signal compare behavioral & neural estimates
=
variation across subjects
subjects differ in degree of model usage
+ β· =
TD error change due to planning net signal p<.05 SVC
average signal
R NAcc: start of trial:
- interaction not significant
- but size of interaction covaries with behavioral model usage
(p=.02)
can distinguish multiple learned representations in humans
- neurally more intertwined than expected
related areas: self control (drugs, dieting, savings etc.) learning in multiplayer interactions (games)
- equilibrium vs equilibration
- do we learn about actions or about opponents?
thoughts
p-beauty context
- fast equilibration with repeated play, most
subjects never reinforced
Singaporean undergrads – Ho et al. 1998
RPS
- do subjects learn by reinforcement?
- best respond to reinforcement?
- best respond to that?
(Hampton et al, 2008)
conclusions
- 0. use of computational models to quantify phenomena &
distinctions for neural study
- 1. can leverage this to distinguish different sorts of
learning, trial-by-trial
– beginning to map neural substrates
- 2. implications for self control, economic interactions