Prefrontal cortex as a Meta-reinforcement learning system Matthew - - PowerPoint PPT Presentation
Prefrontal cortex as a Meta-reinforcement learning system Matthew - - PowerPoint PPT Presentation
Prefrontal cortex as a Meta-reinforcement learning system Matthew Botvinick DeepMind, London UK Gatsby Computational Neuroscience Unit, UCL Mnih et al, Nature (2015) Mnih et al, Nature (2015) Yamins & DiCarlo, 2016 Schultz et al, Science
Mnih et al, Nature (2015)
Mnih et al, Nature (2015)
Yamins & DiCarlo, 2016
Schultz et al, Science (1997)
Jederberg et al., 2016
Jederberg et al., 2016
Mante et al., Nature, 2013 Song et al., Elife, 2017
Lake et al, BBS (2017)
Harlow, Psychological Review, 1949
“Learning to learn”
Harlow, Psychological Review, 1949
Training episodes
“Learning to learn”
Mnih et al, Nature (2015)
Jederberg et al., 2016
Jederberg et al., 2016
https://deepmind.com/blog/impala-scalable-distributed-deeprl-dmlab-30/
a
t
v
t
- t
a
t
- 1
r
t
- 1
δ
(PFC)
(DA)
Wang et al., Nature Neuroscience (2018), Wang et al., Cog. Sci., 2016; Duan et al., arXiv (2016)
0.7 0.4 0.6 0.9 0.3 0.1 0.8 0.7
Wang et al., Nature Neuroscience (2018), Wang et al., Cog. Sci. (2016)
a
t
v
t
- t
a
t
- 1
r
t
- 1
δ
(PFC)
(DA)
Wang et al., Nature Neuroscience (2018), Wang et al., Cog. Sci. (2016)
Trial
100 80 60 40 1 20 1 2 3 4
Cumulative regret
Gittins indices UCB Thompson sampling
Trial Episode
Left Right
Wang et al., Nature Neuroscience (2018), Wang et al., Cog. Sci. (2016)
at vt
- t at-1 rt-1
δ
(PFC)
(DA)
Wang et al., Nature Neuroscience (2018), Wang et al., Cog. Sci. (2016)
0.7 0.3 0.6 0.4 0.3 0.7 0.8 0.2
Wang et al., Nature Neuroscience (2018), Wang et al., Cog. Sci. (2016)
Trial
100 80 60 40 1 20 1 2 3 4
Cumulative regret
Gittins indices UCB Thompson sampling
Trial Episode
Wang et al., Nature Neuroscience (2018), Wang et al., Cog. Sci. (2016)
Training episodes
Wang et al., Nature Neuroscience (2018), Wang et al., Cog. Sci. (2016)
a
t
v
t
- t
a
t
- 1
r
t
- 1
δ
(PFC)
(DA)
Volkmann et al., Nature Reviews Neurology, 2010
4 2
- 2
- 4
- 4
- 2
2 4
log2
RR RL
log2
CR CL
4 2
- 2
- 4
log2
RR RL
log2
CR CL
- 4
- 2
2 4 Tsutsui et al., Nature Comms, 2016
Wang et al., Nature Neuroscience (2018)
a
t
v
t
- t
a
t
- 1
r
t
- 1
δ
(PFC)
(DA)
Wang et al., Nature Neuroscience (2018)
at-1 rt-1 at-1x rt-1 vt
0.2 0.1 0.3 0.4 0.5 0.6
Proportion
Tsutsui et al., Nature Comms, 2016 0.2 0.1 0.3 0.4 0.5 0.6
Correlation
at-1 rt-1 at-1x rt-1 vt Wang et al., Nature Neuroscience (2018)
at vt
- t at-1 rt-1
δ
(PFC)
(DA)
Wang et al., Nature Neuroscience (2018)
Trial
100 80 60 40 1 20 1 2 3 4
Cumulative regret
Gittins indices UCB Thompson sampling
Trial Episode
A B
20 40 60 80 100 120 140 160 180 200
Step
0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 20 40 60 80 100 120 140 160 180 200
Step
Reward probability Inferred/decoded volatility Learning rate action feedback
Behrens et al., Nature Neuroscience, 2007 Wang et al., Nature Neuroscience (2018)
Behrens et al., Nature Neuroscience, 2007 Wang et al., Nature Neuroscience (2018)
a
t
v
t
- t
a
t
- 1
r
t
- 1
δ
(PFC)
(DA)
Volkmann et al., Nature Reviews Neurology, 2010
Bromberg-Martin et al, J Neurophys, 2010
REVERSAL
Wang et al., Nature Neuroscience (2018)
a
t
v
t
- t
a
t
- 1
r
t
- 1
δ
(PFC)
(DA)
Left rewarded Right rewarded
Wang et al., Nature Neuroscience (2018)
Miller, Botvinick & Brody, Nat. Neuro., 2017; Daw et al., Neuron, 2011
Model-based RPE
Stage 2 1
- 1
1
- 1
Meta-RL RPE
Reward r2 = 0.89
Model-based RL (from model-free RL)
Wang et al., Nature Neuroscience (2018)
DA blocked upon food reward from large/risky option DA blocked upon food reward from small/certain option DA triggered upon food omission from large/risky option
Wang et al., arXiv; 2018 Stopper et al., Neuron, 2014
Optogenetic manipulation of dopamine
Mnih et al, Nature (2015)
- Richer environments / abstractions (Espeholt et al., arXiv, 2018)
- Architectural biases (e.g., Raposo et al., NIPS, 2017)
- Complementary forms of meta-learning (e.g., Fernando et al., under review)
- Episodic reinstatement (Ritter et al., in press)