Clustering and generalization of abstract structures in reinforcement learning
Michael J. Frank
Laboratory for Neural Computation and Cognition Brown University
Michael J. Frank Laboratory for Neural Computation and Cognition - - PowerPoint PPT Presentation
Clustering and generalization of abstract structures in reinforcement learning Michael J. Frank Laboratory for Neural Computation and Cognition Brown University Reinforcement learning in neural nets and AI Mnih et al, 2015 , Nature But nets
Michael J. Frank
Laboratory for Neural Computation and Cognition Brown University
Mnih et al, 2015, Nature
Kansky et al, 2017 See also Witty et al 2018 arXiv Offset Paddle Breakout Breakout trained on Asynchronous Advantage Actor-Critic (A3C)
learning & generalization)
‘you no longer need to support his head. When he’s on his stomach he can lift his head and chest. He can open and close his hands..’
‘you no longer need to support his head. When he’s on his stomach he can lift his head and chest. He can open and close his hands..’
which is initially inefficient – see Werchan et al 2016!
‘you no longer need to support his head. When he’s on his stomach he can lift his head and chest. He can open and close his hands..’
which is initially inefficient
‘you no longer need to support his head. When he’s on his stomach he can lift his head and chest. He can open and close his hands..’
which is initially inefficient
Collins & Frank 2013
Collins & Frank 2013
CRP Prior on TS in a new context: P0(TS = TSj |Cnew) = N(TSj|C*) / [α + Σi N(TSi| C*)] P0(TS = new|Cnew) = α / [α + Σi N(TSi| C*)]
Collins & Frank 2013
Collins & Frank, Psych Review, 2013
DA BG TSi BG Ai
CTS-model Neural Network-model
Old TS New TS generalization & transfer
Collins & Frank 2013; 2016; Frank & Badre, 2012
C-PFC sparseness
Fitted clustering
prior Old TS New TS generalization & transfer
Collins & Frank 2013; 2016; Frank & Badre, 2012
“Mixture of Experts”
MIXTURE
Hierarchical task
Flat task
C0 C1 C2 TS1 TS2 A1 A2 A3 A4 ? ? 1/4 1/4 1/2 Trial# per input pattern Proportion Correct C0, C1 C2
Subjects (N=34)
Initial phase
Trial# per input pattern Proportion Correct C0, C1 C2
Subjects (N=34)
Phase 2
2 4 6 8 0.2 0.4 0.6 0.8 1 H
Proportion Correct C0, C1 C2 Model Trial# per input pattern
Trial# per input pattern Proportion Correct C0, C1 C2
Subjects (N=34)
Initial phase
2 4 6 8 0.2 0.4 0.6 0.8 1 H
Proportion Correct C0, C1 C2 Model
2 4 6 8 0.2 0.4 0.6 0.8 1 init
Proportion Correct C0, C1 C2 Model Trial# per input pattern Proportion Correct C0, C1 C2
Subjects (N=34)
Phase 2
C0 C1 C2 TS1 TS2 C4 TS3 C3 C3
C0 C1 C2 TS1 TS2 A1 A2 A3 A4 C3 C3 C4 TS3 A1 A4
Subjects (N = 34) Trial# per input pattern Proportion Correct *
2 4 6 8 0.2 0.4 0.6 0.8 1 CV
C3: TS old C4: TS new Model Trial# per input pattern Proportion Correct
C0 C1 C2 TS1 TS2 A1 A2 A3 A4 ? ? Correct Correct Correct Correct
Collins & Frank (2016), Cognition Time from FB
~ β0 + βPE + βStr trial number For each subject: βPE(electrodes, time) βStr(electrodes, time)
PE effect Time from feedback (ms) average βPE
Collins & Frank (2016) Cognition
EEG(trial) ~ β0 + βPE PE(trial) + βStr StructurePE(trial) ** * ns Unique effect of Structure learning PE ** * ROI1 ROI2 ROI1 ROI2
2 4 6 8 0.2 0.4 0.6 0.8 1 P(Correct) Iteration # 2 4 6 8 0.2 0.4 0.6 0.8 1 P(Correct) Iteration #
C3-TS old C4-TS new Unique effect of Structure learning PE
TS1 TS2
0.1 0.2 0.3 0.4 0.5 0.6 % Choices
TS1 action TS2 action
action
Collins & Frank, Cognition, accepted
ROI1+2
** * ns
Unique effect of Structure learning PE
** *
PE effect
Trial# per input pattern Proportion Correct * Trial# per input pattern Proportion Correct C0, C1 C2 *
Badre & Frank 2012; Collins et al 2014, 2016
New Context: Old TS New TS
*
Werchan et al, 2016, JNeurosci
Share: Physical Movements (mappings from sounds to notes) Share: Chord progression, rhythm, etc (desired sound/ song)
Piccolo Need compositionality: reuse flute mappings to play a song usually played on guitar
Cluster Rewards Transitions
Rewards What do you want to do? Transitions How can you do it?
vs vs Clustering goals and transitions jointly will not allow generalization of each independent of the other
Cluster Rewards Transitions Rewards Transitions
ClusterR ClusterT
Joint Clustering Independent Clustering Nick Franklin
𝑺 𝝔 𝑺 𝝔 𝝔 𝑺 𝝔
Joint Clustering Independent Clustering Policy
contexts clusters functions policies
Policy Policy Policy
C C C Franklin & Frank, 2018 PLOS Computational Biology C C C
a4 a2 a1 a8 a7 a6 a5 a3
𝑏 = { } 𝐵 = { } 𝜚:𝑏 → 𝐵 Reduced Transition Cardinal Movements Actions
N S E W
Rewards 𝑆:𝑦 → {0, + 1} What to do? Rewards: How to do it? Transition: 𝑦 ∈ { < 𝑦𝑗, 𝑦𝑘 > :𝑗 = 1,…, 6; 𝑘 = 1, …, 6} Model-based RL: What-to-do vs How-to-do-it?
Mapping () Reward ()
C1 C2 C3 C4
Agent Performance
solve task domain
Franklin & Frank, 2018 PLOS Computational Biology
task domain
and (lower KL-Divergence)
Function Estimates Agent Performance
𝜚(𝑏, 𝐵) 𝑆(𝑦)
CC CC CC CC
Transitions
C Franklin & Frank, 2018 PLOS Computational Biology
𝜚(𝑏, 𝐵) 𝑆(𝑦)
C C C C C C C C
Meta Model
select independent/joint clustering as actor
Independent (i.e. minimax)
idea?
Meta- agent Joint actor Ind. actor 𝑥 1 − 𝑥
Mapping () Reward ()
C1 C2 C3 C4
Franklin & Frank, 2018 PLOS Computational Biology
2. Theory
Back to Start Back to Start End
C C C C
𝜚(𝑏, 𝐵) 𝑆(𝑦) 𝜚(𝑏, 𝐵) 𝑆(𝑦)
C C C C C C C C
Generalization via Bayesian Inference
Posterior probability context is in a cluster Likelihood
context is in a cluster Prior probability a context is in a cluster
Cluster C C C C C C Cluster
task specific task general
2. Theory
How good of a guess about cluster assignment is the prior? Formally, how good of an estimator is the prior of the generative process, regardless of task specifics? ∝ ×
Joint Clustering
Independent clustering
C C C Rewards Transitions Rewards Transitions Transitions Rewards C C C Transitions
Joint Clustering Independent Clustering
a1 a4 a2 a8 a7 a6 a5 a3
a1 a4 a2 a8 a7 a6 a5 a3
Only consider prior: Don’t need statistics of specific transitions, rewards; replace with arbitrary symbol
R-Sequence T-Sequence
Update CRP prior in each context and ask how well it can guess the next reward? * CRP with independent clustering * CRP with joint clustering
G Sequence: AAAAAAAAAABBBBBBBBBB T Sequence: 11111222222222211111 G Sequence: AAAAAAAAAABBBBBBBBBB T Sequence: 11111122222222221111 G Sequence: AAAAAAAAAABBBBBBBBBB T Sequence: 11111112222222222111 Hold the structure of each sequence constant Increase mutual information () by progressively switching pairs Independent Clustering better with low mutual information
2. Theory
Independent vs Joint (bits)
2. Theory
G Sequence: AAAAAAAAAAAAAAAAAAAABCD T Sequence: 11111111111111111111234 G Sequence: AAAAAAAAAAAAAAAAAAAABCD T Sequence: 11111111111111111115234 G Sequence: AAAAAAAAAAAAAAAAAAAABCD T Sequence: 11111111111111111156234 Hold mutual information constant T perfectly predicts G (HGT=0) Increase chance of misidentifying T i.e. noise Even w/ T perfectly predicts G, independent clustering is useful when T observations are noisy
2. Theory
Independent vs Joint (bits)
G Sequence: AAAAAAAAAAAAAAAAAAAABCDBCD T Sequence: 11111111111111111111234342 G Sequence: AAAAAAAAAAAAAAAAAAAABCDBCD T Sequence: 11111111111111111115234342 G Sequence: AAAAAAAAAAAAAAAAAAAABCDBCD T Sequence: 11111111111111111156234342 Reduce Larger compositionality benefit Reduce by adding to sequence Hold mutual information constant T perfectly predicts G Increase chance of misidentifying T i.e. noise Even w/ T perfectly predicts G, independent clustering is useful when T observations are noisy
2. Theory
𝐼(𝐻 𝑈 ) ≈ 0.2 bits Independent vs Joint (bits)
a) Joint better asymptotically b) Independent better in noisy and/or independent environments
2. Theory
N S E W N S E W N S E W
f s a ; l k j d
B A C
Transitions:
relationship between keypresses and cardinal movements
Goal Values
3. Human Behavior 3. Human Behavior
N S E W
f s a ; l k j d
B A C
Transitions:
relationship between keypresses and cardinal movements
Goal Values Where do you want to go? How do you get there?
Do people cluster Goals and Transitions together (jointly) ? Do people cluster Goals and Transitions independently?
3. Human Behavior 3. Human Behavior
3. Human Behavior 3. Human Behavior
A B C Goal Values
B A C
3. Human Behavior
f s a ; l k j d
Transitions
3. Human Behavior
B A C
A B C Goal Values
3. Human Behavior
Transitions
f s a ; l k j d
3. Human Behavior
A B C Goal Values
3. Human Behavior
Transitions
f s a ; l k j d
B A C
3. Human Behavior
B A C
A B C Goal Values
3. Human Behavior
Transitions
f s a ; l k j d
3. Human Behavior
A B C Goal Values
3. Human Behavior
Transitions
B A C
f s a ; l k j d
3. Human Behavior
A B C Goal Values
3. Human Behavior
Transitions
f s a ; l k j d
3. Human Behavior
A B C
3. Human Behavior Joint Clustering
B A C B A C
Independent Clustering
B A C
Goal Popularity Goal Popularity
3. Human Behavior
1 4 2 3
Test Contexts
Model and Human Behavior: Ambiguous Structure
generalization is a combination of indep & joint: Meta-generalization
Test Contexts
Model and Human Behavior: Independent Structure
Test Contexts A B C
1 4 2 3
D
Model and Human Behavior: Independent Structure
Model and Human Behavior: Joint Structure
is C, what is S?) simultaneous with actions
LNCC Anne Collins (Berkeley)
Nick Franklin Denise Werchan Dima Amso Jim Cavanagh Nick Franklin (Harvard)