Jacob Andreas Berkeley → Microsoft Semantic Machines → MIT
Linguistic sca fg olds for policy learning Jacob Andreas Berkeley - - PowerPoint PPT Presentation
Linguistic sca fg olds for policy learning Jacob Andreas Berkeley - - PowerPoint PPT Presentation
Linguistic sca fg olds for policy learning Jacob Andreas Berkeley Microsoft Semantic Machines MIT Linguistic sca fg olds for policy learning (what can language do for RL?) Jacob Andreas Berkeley Microsoft Semantic Machines MIT An
Linguistic scafgolds for policy learning
(what can language do for RL?)
Jacob Andreas Berkeley → Microsoft Semantic Machines → MIT
An NLPer’s view of RL
( , R)
An NLPer’s view of RL
( , R)
memorize 1 reward fn
An NLPer’s view of RL
( , R)
( , R1) ( , R2)
[e.g. Taylor & Stone 09]
memorize k reward fns
An NLPer’s view of RL
( , R)
( , R1) ( , R2)
( , R1)
(-2, 3)
( , R1)
(-2, -2)
Learn to accomplish new goals!
[e.g. Schaul et al. 15]
An NLPer’s view of RL
( , R)
( , R1) ( , R2)
( , R1)
run northwest
( , R1)
go southwest
( , R1)
(-2, 3)
( , R1)
(-2, -2)
Learn to follow instructions!
Instructions as observations
( , R)
( , R1) ( , R2)
( , R1)
run northwest
( , R1)
go southwest
( , R1)
(-2, 3)
( , R1)
(-2, -2)
Instructions as observations
( , R)
( , R1) ( , R2)
( , R1)
run northwest
( , R1)
go southwest
( , R1)
(-2, 3)
( , R1)
(-2, -2)
Beyond observations
(1) Instructions are moves in a game, not
- bservations of an environment.
( , R1)
run northwest
( , R1)
go southwest
( , R1)
(-2, 3)
( , R1)
(-2, -2)
Beyond goals
( , R1)
???
( , R1)
not so fast
( , R1)
run northwest
( , R1)
go southwest
(2) There’s more to language learning than instruction following!
Language use as gameplay
Generation & understanding
[Anderson et al. 18]
Turn right and walk through the kitchen. Go right into the living room and stop by the rug.
A reference game
[Frank & Goodman 12]
“glasses"
[Frank & Goodman 12]
“glasses"
[Frank & Goodman 12]
“glasses"
[Frank & Goodman 12]
“glasses"
[Frank & Goodman 12]
The rational speech acts model
[Frank & Goodman 12, Degen 13]
L0( . | glasses) L0( . | hat)
1/2 1/2 1
The rational speech acts model
L0( . | glasses) L0( . | hat)
1/2 1/2 1
S1( glasses | . ) ∝ L0( . | glasses)
1 1/3
S1( hat | . )
2/3
[Frank & Goodman 12, Degen 13]
The rational speech acts model 3/4 1/4 1
S1( glasses | . ) ∝ L0( . | glasses)
1 1/3
S1( hat | . )
2/3
L1( . | glasses ) ∝ S1( glasses | . ) L1( . | hat )
[Frank & Goodman 12, Degen 13]
Pragmatics Q: Do you know what time it is?
Q: Do you know what time it is? A: Yes Pragmatics
Pragmatics Q: Do you know what time it is? A: Yes I find his cooking very interesting.
[Grice 70]
RSA game tree
hat glasses speaker
RSA game tree: as speaker
hat glasses hat
glasses
- 1
+1
- 1
+1 speaker listener
RSA game tree: as speaker
hat glasses hat
glasses
- 1
+1
- 1
+1 speaker listener
RSA game tree: as listener
glasses
glasses
? ? listener
?
speaker
A recipe for pragmatic language understanding
smiley plain glasses man glasses hat & glasses
- 1. Train a base speaker model
hat & glasses glasses man guy with hat
A recipe for pragmatic language understanding
- 2. Solve this POMDP:
- 1. Train a base speaker model
hat glasses hat glasses
- 1
+1
- 1
+1
Daniel Fried Ronghang Hu Volkan Cirik
Speaker—follower models for vision- and-language navigation. NeurIPS 18.
Application: instruction following
human: Go through the door on the right and continue straight. Stop in the next room in front of the bed.
instruction: Go through the door on the right and continue
- straight. Stop in the next
room in front of the bed. (a) orange: trajectory without pragmatic inference (b) green: trajectory with pragmatic inference top-down
- verview of
trajectories
baseline policy Reasoning
Application: instruction generation
reasoning: Walk past the dining room table and chairs and take a right into the living room. Stop once you are on the rug. seq2seq: Walk past the dining room table and chairs and wait there. human: Turn right and walk through the kitchen. Go right into the living room and stop by the rug.
Lesson Utterances are chosen to facilitate correct interpretation in context. (This makes the learning problem easier!)
Language as a scafgold for learning
What else is an instruction follower good for?
Language learning Reinforcement learning
go east of the heart
Learning with latent language. A, Klein & Levine. NAACL 18.
f( · ; η, )
Pretraining via language learning
NORTH
go east of the heart [Branavan et al., 09]
π
<latexit sha1_base64="MfoOZUbGzRkaB76umvTEWj+CN8=">AB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz0Et4v1xq+4cZJV4OalAjka/NUbxCyNUBomqNZdz02Mn1FlOBM4LfVSjQlYzrErqWSRqj9bH7qlJxZUDCWNmShszV3xMZjbSeRIHtjKgZ6WVvJv7ndVMTXvsZl0lqULFojAVxMRk9jcZcIXMiIklClubyVsRBVlxqZTsiF4y+vklat6l1Ua/eXlfpNHkcRTuAUzsGDK6jDHTSgCQyG8Ayv8OYI58V5dz4WrQUnzmGP3A+fwBRC43R</latexit>L(f( · ; η, ), · )
(Standard) reinforcement learning
???
π
<latexit sha1_base64="MfoOZUbGzRkaB76umvTEWj+CN8=">AB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz0Et4v1xq+4cZJV4OalAjka/NUbxCyNUBomqNZdz02Mn1FlOBM4LfVSjQlYzrErqWSRqj9bH7qlJxZUDCWNmShszV3xMZjbSeRIHtjKgZ6WVvJv7ndVMTXvsZl0lqULFojAVxMRk9jcZcIXMiIklClubyVsRBVlxqZTsiF4y+vklat6l1Ua/eXlfpNHkcRTuAUzsGDK6jDHTSgCQyG8Ayv8OYI58V5dz4WrQUnzmGP3A+fwBRC43R</latexit>R
<latexit sha1_base64="cVRUNBy/RTcU6LUbsjbBwonoaeo=">AB6HicbVDLTgJBEOzF+IL9ehlIjHxRHbRI9ELx7ByCOBDZkdemFkdnYzM2tCF/gxYPGePWTvPk3DrAHBSvpFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsG4EthOFNAoEtoLR7cxvPaHSPJYPZpygH9GB5CFn1Fipft8rltyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaT2OAtsZUTPUy95M/M/rpCa89idcJqlByRaLwlQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmpWyd1Gu1C9L1ZsjycwCmcgwdXUIU7qEDGCA8wyu8OY/Oi/PufCxac042cwx/4Hz+AK3vjNo=</latexit>Concept learning
find the horse
L(f( · ; η, ), · )
π
<latexit sha1_base64="MfoOZUbGzRkaB76umvTEWj+CN8=">AB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz0Et4v1xq+4cZJV4OalAjka/NUbxCyNUBomqNZdz02Mn1FlOBM4LfVSjQlYzrErqWSRqj9bH7qlJxZUDCWNmShszV3xMZjbSeRIHtjKgZ6WVvJv7ndVMTXvsZl0lqULFojAVxMRk9jcZcIXMiIklClubyVsRBVlxqZTsiF4y+vklat6l1Ua/eXlfpNHkcRTuAUzsGDK6jDHTSgCQyG8Ayv8OYI58V5dz4WrQUnzmGP3A+fwBRC43R</latexit>R
<latexit sha1_base64="cVRUNBy/RTcU6LUbsjbBwonoaeo=">AB6HicbVDLTgJBEOzF+IL9ehlIjHxRHbRI9ELx7ByCOBDZkdemFkdnYzM2tCF/gxYPGePWTvPk3DrAHBSvpFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsG4EthOFNAoEtoLR7cxvPaHSPJYPZpygH9GB5CFn1Fipft8rltyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaT2OAtsZUTPUy95M/M/rpCa89idcJqlByRaLwlQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmpWyd1Gu1C9L1ZsjycwCmcgwdXUIU7qEDGCA8wyu8OY/Oi/PufCxac042cwx/4Hz+AK3vjNo=</latexit>NORTH,…
Concept learning
- 0.52
L(f( · ; η, ), · )
find the horse
NORTH,…
π
<latexit sha1_base64="MfoOZUbGzRkaB76umvTEWj+CN8=">AB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz0Et4v1xq+4cZJV4OalAjka/NUbxCyNUBomqNZdz02Mn1FlOBM4LfVSjQlYzrErqWSRqj9bH7qlJxZUDCWNmShszV3xMZjbSeRIHtjKgZ6WVvJv7ndVMTXvsZl0lqULFojAVxMRk9jcZcIXMiIklClubyVsRBVlxqZTsiF4y+vklat6l1Ua/eXlfpNHkcRTuAUzsGDK6jDHTSgCQyG8Ayv8OYI58V5dz4WrQUnzmGP3A+fwBRC43R</latexit>R
<latexit sha1_base64="cVRUNBy/RTcU6LUbsjbBwonoaeo=">AB6HicbVDLTgJBEOzF+IL9ehlIjHxRHbRI9ELx7ByCOBDZkdemFkdnYzM2tCF/gxYPGePWTvPk3DrAHBSvpFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsG4EthOFNAoEtoLR7cxvPaHSPJYPZpygH9GB5CFn1Fipft8rltyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaT2OAtsZUTPUy95M/M/rpCa89idcJqlByRaLwlQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmpWyd1Gu1C9L1ZsjycwCmcgwdXUIU7qEDGCA8wyu8OY/Oi/PufCxac042cwx/4Hz+AK3vjNo=</latexit>Concept learning
left of heart
0.33
find the horse
L(f( · ; η, ), · )
- 0.52
SOUTH,…
π
<latexit sha1_base64="MfoOZUbGzRkaB76umvTEWj+CN8=">AB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz0Et4v1xq+4cZJV4OalAjka/NUbxCyNUBomqNZdz02Mn1FlOBM4LfVSjQlYzrErqWSRqj9bH7qlJxZUDCWNmShszV3xMZjbSeRIHtjKgZ6WVvJv7ndVMTXvsZl0lqULFojAVxMRk9jcZcIXMiIklClubyVsRBVlxqZTsiF4y+vklat6l1Ua/eXlfpNHkcRTuAUzsGDK6jDHTSgCQyG8Ayv8OYI58V5dz4WrQUnzmGP3A+fwBRC43R</latexit>R
<latexit sha1_base64="cVRUNBy/RTcU6LUbsjbBwonoaeo=">AB6HicbVDLTgJBEOzF+IL9ehlIjHxRHbRI9ELx7ByCOBDZkdemFkdnYzM2tCF/gxYPGePWTvPk3DrAHBSvpFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsG4EthOFNAoEtoLR7cxvPaHSPJYPZpygH9GB5CFn1Fipft8rltyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaT2OAtsZUTPUy95M/M/rpCa89idcJqlByRaLwlQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmpWyd1Gu1C9L1ZsjycwCmcgwdXUIU7qEDGCA8wyu8OY/Oi/PufCxac042cwx/4Hz+AK3vjNo=</latexit>Concept learning
left of the heart find the horse heart east side
0.95
L(f( · ; η, ), · )
0.33
- 0.52
SOUTH,…
π
<latexit sha1_base64="MfoOZUbGzRkaB76umvTEWj+CN8=">AB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz0Et4v1xq+4cZJV4OalAjka/NUbxCyNUBomqNZdz02Mn1FlOBM4LfVSjQlYzrErqWSRqj9bH7qlJxZUDCWNmShszV3xMZjbSeRIHtjKgZ6WVvJv7ndVMTXvsZl0lqULFojAVxMRk9jcZcIXMiIklClubyVsRBVlxqZTsiF4y+vklat6l1Ua/eXlfpNHkcRTuAUzsGDK6jDHTSgCQyG8Ayv8OYI58V5dz4WrQUnzmGP3A+fwBRC43R</latexit>R
<latexit sha1_base64="cVRUNBy/RTcU6LUbsjbBwonoaeo=">AB6HicbVDLTgJBEOzF+IL9ehlIjHxRHbRI9ELx7ByCOBDZkdemFkdnYzM2tCF/gxYPGePWTvPk3DrAHBSvpFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsG4EthOFNAoEtoLR7cxvPaHSPJYPZpygH9GB5CFn1Fipft8rltyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaT2OAtsZUTPUy95M/M/rpCa89idcJqlByRaLwlQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmpWyd1Gu1C9L1ZsjycwCmcgwdXUIU7qEDGCA8wyu8OY/Oi/PufCxac042cwx/4Hz+AK3vjNo=</latexit>As multitask learning
go east of the heart find the triangle
arg min
η
L(f( | ; η, ))
<latexit sha1_base64="6DTSKcLHA7PT7ua1kd9L70AxyAc=">ACQXicbZBSxwxFMcz1ra62nbHr0EF2EtZmRQgu9SNtDxUVXBV2luVN9s0aTDJj8qawDOvH8N4K+136EfwJh71YmacQ6s+SPjx/+clef8kV9JRGP4N5p7MP32fGxtbT84uWr9us3+y4rMC+yFRmDxNwqKTBPklSeJhbBJ0oPEiOv1b+wU+0TmZmj6Y5DjVMjEylAPLSqB2YrATLc0oRgL+o5t245OTAsY81tJvNX/mlfmen56ur4/anbAX1sUfQtRAhzW1M2pfx+NMFBoNCQXODaIwp2EJlqRQOGvFhcMcxDFMcODRgEY3LOvJZnzNK2OeZtYvQ7xW/+0oQTs31Yk/qYGO3H2vEh/zBgWln4alNHlBaMTdQ2mhOGW8iomPpUVBauoBhJX+r1wcgQVBPsxW/A39LBa3/L3bOVqgzL4rmyRnNcQV+bCi+9E8hP2NXhT2ot0Pnc0vTWwLbIWtsi6L2Ee2yb6zHdZngp2xc/ab/Ql+BRfBZXB1d3QuaHresv8quLkFy6v0w=</latexit><latexit sha1_base64="6DTSKcLHA7PT7ua1kd9L70AxyAc=">ACQXicbZBSxwxFMcz1ra62nbHr0EF2EtZmRQgu9SNtDxUVXBV2luVN9s0aTDJj8qawDOvH8N4K+136EfwJh71YmacQ6s+SPjx/+clef8kV9JRGP4N5p7MP32fGxtbT84uWr9us3+y4rMC+yFRmDxNwqKTBPklSeJhbBJ0oPEiOv1b+wU+0TmZmj6Y5DjVMjEylAPLSqB2YrATLc0oRgL+o5t245OTAsY81tJvNX/mlfmen56ur4/anbAX1sUfQtRAhzW1M2pfx+NMFBoNCQXODaIwp2EJlqRQOGvFhcMcxDFMcODRgEY3LOvJZnzNK2OeZtYvQ7xW/+0oQTs31Yk/qYGO3H2vEh/zBgWln4alNHlBaMTdQ2mhOGW8iomPpUVBauoBhJX+r1wcgQVBPsxW/A39LBa3/L3bOVqgzL4rmyRnNcQV+bCi+9E8hP2NXhT2ot0Pnc0vTWwLbIWtsi6L2Ee2yb6zHdZngp2xc/ab/Ql+BRfBZXB1d3QuaHresv8quLkFy6v0w=</latexit><latexit sha1_base64="6DTSKcLHA7PT7ua1kd9L70AxyAc=">ACQXicbZBSxwxFMcz1ra62nbHr0EF2EtZmRQgu9SNtDxUVXBV2luVN9s0aTDJj8qawDOvH8N4K+136EfwJh71YmacQ6s+SPjx/+clef8kV9JRGP4N5p7MP32fGxtbT84uWr9us3+y4rMC+yFRmDxNwqKTBPklSeJhbBJ0oPEiOv1b+wU+0TmZmj6Y5DjVMjEylAPLSqB2YrATLc0oRgL+o5t245OTAsY81tJvNX/mlfmen56ur4/anbAX1sUfQtRAhzW1M2pfx+NMFBoNCQXODaIwp2EJlqRQOGvFhcMcxDFMcODRgEY3LOvJZnzNK2OeZtYvQ7xW/+0oQTs31Yk/qYGO3H2vEh/zBgWln4alNHlBaMTdQ2mhOGW8iomPpUVBauoBhJX+r1wcgQVBPsxW/A39LBa3/L3bOVqgzL4rmyRnNcQV+bCi+9E8hP2NXhT2ot0Pnc0vTWwLbIWtsi6L2Ee2yb6zHdZngp2xc/ab/Ql+BRfBZXB1d3QuaHresv8quLkFy6v0w=</latexit><latexit sha1_base64="6DTSKcLHA7PT7ua1kd9L70AxyAc=">ACQXicbZBSxwxFMcz1ra62nbHr0EF2EtZmRQgu9SNtDxUVXBV2luVN9s0aTDJj8qawDOvH8N4K+136EfwJh71YmacQ6s+SPjx/+clef8kV9JRGP4N5p7MP32fGxtbT84uWr9us3+y4rMC+yFRmDxNwqKTBPklSeJhbBJ0oPEiOv1b+wU+0TmZmj6Y5DjVMjEylAPLSqB2YrATLc0oRgL+o5t245OTAsY81tJvNX/mlfmen56ur4/anbAX1sUfQtRAhzW1M2pfx+NMFBoNCQXODaIwp2EJlqRQOGvFhcMcxDFMcODRgEY3LOvJZnzNK2OeZtYvQ7xW/+0oQTs31Yk/qYGO3H2vEh/zBgWln4alNHlBaMTdQ2mhOGW8iomPpUVBauoBhJX+r1wcgQVBPsxW/A39LBa3/L3bOVqgzL4rmyRnNcQV+bCi+9E8hP2NXhT2ot0Pnc0vTWwLbIWtsi6L2Ee2yb6zHdZngp2xc/ab/Ql+BRfBZXB1d3QuaHresv8quLkFy6v0w=</latexit>arg min
η
L(f( | ; η, ))
<latexit sha1_base64="6DTSKcLHA7PT7ua1kd9L70AxyAc=">ACQXicbZBSxwxFMcz1ra62nbHr0EF2EtZmRQgu9SNtDxUVXBV2luVN9s0aTDJj8qawDOvH8N4K+136EfwJh71YmacQ6s+SPjx/+clef8kV9JRGP4N5p7MP32fGxtbT84uWr9us3+y4rMC+yFRmDxNwqKTBPklSeJhbBJ0oPEiOv1b+wU+0TmZmj6Y5DjVMjEylAPLSqB2YrATLc0oRgL+o5t245OTAsY81tJvNX/mlfmen56ur4/anbAX1sUfQtRAhzW1M2pfx+NMFBoNCQXODaIwp2EJlqRQOGvFhcMcxDFMcODRgEY3LOvJZnzNK2OeZtYvQ7xW/+0oQTs31Yk/qYGO3H2vEh/zBgWln4alNHlBaMTdQ2mhOGW8iomPpUVBauoBhJX+r1wcgQVBPsxW/A39LBa3/L3bOVqgzL4rmyRnNcQV+bCi+9E8hP2NXhT2ot0Pnc0vTWwLbIWtsi6L2Ee2yb6zHdZngp2xc/ab/Ql+BRfBZXB1d3QuaHresv8quLkFy6v0w=</latexit><latexit sha1_base64="6DTSKcLHA7PT7ua1kd9L70AxyAc=">ACQXicbZBSxwxFMcz1ra62nbHr0EF2EtZmRQgu9SNtDxUVXBV2luVN9s0aTDJj8qawDOvH8N4K+136EfwJh71YmacQ6s+SPjx/+clef8kV9JRGP4N5p7MP32fGxtbT84uWr9us3+y4rMC+yFRmDxNwqKTBPklSeJhbBJ0oPEiOv1b+wU+0TmZmj6Y5DjVMjEylAPLSqB2YrATLc0oRgL+o5t245OTAsY81tJvNX/mlfmen56ur4/anbAX1sUfQtRAhzW1M2pfx+NMFBoNCQXODaIwp2EJlqRQOGvFhcMcxDFMcODRgEY3LOvJZnzNK2OeZtYvQ7xW/+0oQTs31Yk/qYGO3H2vEh/zBgWln4alNHlBaMTdQ2mhOGW8iomPpUVBauoBhJX+r1wcgQVBPsxW/A39LBa3/L3bOVqgzL4rmyRnNcQV+bCi+9E8hP2NXhT2ot0Pnc0vTWwLbIWtsi6L2Ee2yb6zHdZngp2xc/ab/Ql+BRfBZXB1d3QuaHresv8quLkFy6v0w=</latexit><latexit sha1_base64="6DTSKcLHA7PT7ua1kd9L70AxyAc=">ACQXicbZBSxwxFMcz1ra62nbHr0EF2EtZmRQgu9SNtDxUVXBV2luVN9s0aTDJj8qawDOvH8N4K+136EfwJh71YmacQ6s+SPjx/+clef8kV9JRGP4N5p7MP32fGxtbT84uWr9us3+y4rMC+yFRmDxNwqKTBPklSeJhbBJ0oPEiOv1b+wU+0TmZmj6Y5DjVMjEylAPLSqB2YrATLc0oRgL+o5t245OTAsY81tJvNX/mlfmen56ur4/anbAX1sUfQtRAhzW1M2pfx+NMFBoNCQXODaIwp2EJlqRQOGvFhcMcxDFMcODRgEY3LOvJZnzNK2OeZtYvQ7xW/+0oQTs31Yk/qYGO3H2vEh/zBgWln4alNHlBaMTdQ2mhOGW8iomPpUVBauoBhJX+r1wcgQVBPsxW/A39LBa3/L3bOVqgzL4rmyRnNcQV+bCi+9E8hP2NXhT2ot0Pnc0vTWwLbIWtsi6L2Ee2yb6zHdZngp2xc/ab/Ql+BRfBZXB1d3QuaHresv8quLkFy6v0w=</latexit><latexit sha1_base64="6DTSKcLHA7PT7ua1kd9L70AxyAc=">ACQXicbZBSxwxFMcz1ra62nbHr0EF2EtZmRQgu9SNtDxUVXBV2luVN9s0aTDJj8qawDOvH8N4K+136EfwJh71YmacQ6s+SPjx/+clef8kV9JRGP4N5p7MP32fGxtbT84uWr9us3+y4rMC+yFRmDxNwqKTBPklSeJhbBJ0oPEiOv1b+wU+0TmZmj6Y5DjVMjEylAPLSqB2YrATLc0oRgL+o5t245OTAsY81tJvNX/mlfmen56ur4/anbAX1sUfQtRAhzW1M2pfx+NMFBoNCQXODaIwp2EJlqRQOGvFhcMcxDFMcODRgEY3LOvJZnzNK2OeZtYvQ7xW/+0oQTs31Yk/qYGO3H2vEh/zBgWln4alNHlBaMTdQ2mhOGW8iomPpUVBauoBhJX+r1wcgQVBPsxW/A39LBa3/L3bOVqgzL4rmyRnNcQV+bCi+9E8hP2NXhT2ot0Pnc0vTWwLbIWtsi6L2Ee2yb6zHdZngp2xc/ab/Ql+BRfBZXB1d3QuaHresv8quLkFy6v0w=</latexit>???
[Caruana 97]
Language learning Reinforcement learning
π
<latexit sha1_base64="MfoOZUbGzRkaB76umvTEWj+CN8=">AB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz0Et4v1xq+4cZJV4OalAjka/NUbxCyNUBomqNZdz02Mn1FlOBM4LfVSjQlYzrErqWSRqj9bH7qlJxZUDCWNmShszV3xMZjbSeRIHtjKgZ6WVvJv7ndVMTXvsZl0lqULFojAVxMRk9jcZcIXMiIklClubyVsRBVlxqZTsiF4y+vklat6l1Ua/eXlfpNHkcRTuAUzsGDK6jDHTSgCQyG8Ayv8OYI58V5dz4WrQUnzmGP3A+fwBRC43R</latexit>π
<latexit sha1_base64="MfoOZUbGzRkaB76umvTEWj+CN8=">AB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz0Et4v1xq+4cZJV4OalAjka/NUbxCyNUBomqNZdz02Mn1FlOBM4LfVSjQlYzrErqWSRqj9bH7qlJxZUDCWNmShszV3xMZjbSeRIHtjKgZ6WVvJv7ndVMTXvsZl0lqULFojAVxMRk9jcZcIXMiIklClubyVsRBVlxqZTsiF4y+vklat6l1Ua/eXlfpNHkcRTuAUzsGDK6jDHTSgCQyG8Ayv8OYI58V5dz4WrQUnzmGP3A+fwBRC43R</latexit>R
<latexit sha1_base64="cVRUNBy/RTcU6LUbsjbBwonoaeo=">AB6HicbVDLTgJBEOzF+IL9ehlIjHxRHbRI9ELx7ByCOBDZkdemFkdnYzM2tCF/gxYPGePWTvPk3DrAHBSvpFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsG4EthOFNAoEtoLR7cxvPaHSPJYPZpygH9GB5CFn1Fipft8rltyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaT2OAtsZUTPUy95M/M/rpCa89idcJqlByRaLwlQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmpWyd1Gu1C9L1ZsjycwCmcgwdXUIU7qEDGCA8wyu8OY/Oi/PufCxac042cwx/4Hz+AK3vjNo=</latexit>R
<latexit sha1_base64="cVRUNBy/RTcU6LUbsjbBwonoaeo=">AB6HicbVDLTgJBEOzF+IL9ehlIjHxRHbRI9ELx7ByCOBDZkdemFkdnYzM2tCF/gxYPGePWTvPk3DrAHBSvpFLVne6uIBFcG9f9dnJr6xubW/ntws7u3v5B8fCoqeNUMWywWMSqHVCNgktsG4EthOFNAoEtoLR7cxvPaHSPJYPZpygH9GB5CFn1Fipft8rltyOwdZJV5GSpCh1it+dfsxSyOUhgmqdcdzE+NPqDKcCZwWuqnGhLIRHWDHUkj1P5kfuiUnFmlT8JY2ZKGzNXfExMaT2OAtsZUTPUy95M/M/rpCa89idcJqlByRaLwlQE5PZ16TPFTIjxpZQpri9lbAhVZQZm03BhuAtv7xKmpWyd1Gu1C9L1ZsjycwCmcgwdXUIU7qEDGCA8wyu8OY/Oi/PufCxac042cwx/4Hz+AK3vjNo=</latexit>As a language game…
go east of the heart
speaker model listener loss
arg min
<latexit sha1_base64="/8RfHiPqR2J1MpofaCATkWdehoU=">ACz3icbVFdaxQxFM2MX3W0utVHX4LwlZkmSmig8ufoAPFVtw28JmWO5k7mxDMx9NMsoSR3z1r/iP/DdmZqdit15IODkn5+bem6SQpsw/O35167fuHlr63Zw5+72vfuDnQdHuqwVxkvZalOEtAoRYEzI4zEk0oh5InE4+Tsbasf0GlRVl8NqsK4xyWhcgEB+OoxeDXiIFa5qJYMDRA98fZmJ2f15BSlgu3dfgVbcWn7uQOu7vBhYcy+tdCecAc5ok9n2zsMxATZkWOXVKM+/ytS3PiXdyBm7pF03VmHaMBuMsq4UJzuBxdvDIbhJOyCXgVRD4akj4PFjrfN0pLXORaGS9B6HoWViS0oI7jEJmC1xgr4GSx7mABOerYdmU0dOSYlGalcqswtGP/dVjItV7libvZ9q03tZb8nzavTfYytqKoaoMFXz+U1ZKakrY/RFOhkBu5cgC4Eq5Wyk9BATfuHwP2Dl0vCj+6vJ8qVGBK9cT2I2o6wFp0qZx1S26A0ea4roKjvUkUTqLDZ8Ppm36UW+QReUzGJCIvyJR8IAdkRrg38J57r72pf+h/9b/7P9ZXfa/3PCSXwv/5B8AY3SY=</latexit><latexit sha1_base64="/8RfHiPqR2J1MpofaCATkWdehoU=">ACz3icbVFdaxQxFM2MX3W0utVHX4LwlZkmSmig8ufoAPFVtw28JmWO5k7mxDMx9NMsoSR3z1r/iP/DdmZqdit15IODkn5+bem6SQpsw/O35167fuHlr63Zw5+72vfuDnQdHuqwVxkvZalOEtAoRYEzI4zEk0oh5InE4+Tsbasf0GlRVl8NqsK4xyWhcgEB+OoxeDXiIFa5qJYMDRA98fZmJ2f15BSlgu3dfgVbcWn7uQOu7vBhYcy+tdCecAc5ok9n2zsMxATZkWOXVKM+/ytS3PiXdyBm7pF03VmHaMBuMsq4UJzuBxdvDIbhJOyCXgVRD4akj4PFjrfN0pLXORaGS9B6HoWViS0oI7jEJmC1xgr4GSx7mABOerYdmU0dOSYlGalcqswtGP/dVjItV7libvZ9q03tZb8nzavTfYytqKoaoMFXz+U1ZKakrY/RFOhkBu5cgC4Eq5Wyk9BATfuHwP2Dl0vCj+6vJ8qVGBK9cT2I2o6wFp0qZx1S26A0ea4roKjvUkUTqLDZ8Ppm36UW+QReUzGJCIvyJR8IAdkRrg38J57r72pf+h/9b/7P9ZXfa/3PCSXwv/5B8AY3SY=</latexit><latexit sha1_base64="/8RfHiPqR2J1MpofaCATkWdehoU=">ACz3icbVFdaxQxFM2MX3W0utVHX4LwlZkmSmig8ufoAPFVtw28JmWO5k7mxDMx9NMsoSR3z1r/iP/DdmZqdit15IODkn5+bem6SQpsw/O35167fuHlr63Zw5+72vfuDnQdHuqwVxkvZalOEtAoRYEzI4zEk0oh5InE4+Tsbasf0GlRVl8NqsK4xyWhcgEB+OoxeDXiIFa5qJYMDRA98fZmJ2f15BSlgu3dfgVbcWn7uQOu7vBhYcy+tdCecAc5ok9n2zsMxATZkWOXVKM+/ytS3PiXdyBm7pF03VmHaMBuMsq4UJzuBxdvDIbhJOyCXgVRD4akj4PFjrfN0pLXORaGS9B6HoWViS0oI7jEJmC1xgr4GSx7mABOerYdmU0dOSYlGalcqswtGP/dVjItV7libvZ9q03tZb8nzavTfYytqKoaoMFXz+U1ZKakrY/RFOhkBu5cgC4Eq5Wyk9BATfuHwP2Dl0vCj+6vJ8qVGBK9cT2I2o6wFp0qZx1S26A0ea4roKjvUkUTqLDZ8Ppm36UW+QReUzGJCIvyJR8IAdkRrg38J57r72pf+h/9b/7P9ZXfa/3PCSXwv/5B8AY3SY=</latexit><latexit sha1_base64="/8RfHiPqR2J1MpofaCATkWdehoU=">ACz3icbVFdaxQxFM2MX3W0utVHX4LwlZkmSmig8ufoAPFVtw28JmWO5k7mxDMx9NMsoSR3z1r/iP/DdmZqdit15IODkn5+bem6SQpsw/O35167fuHlr63Zw5+72vfuDnQdHuqwVxkvZalOEtAoRYEzI4zEk0oh5InE4+Tsbasf0GlRVl8NqsK4xyWhcgEB+OoxeDXiIFa5qJYMDRA98fZmJ2f15BSlgu3dfgVbcWn7uQOu7vBhYcy+tdCecAc5ok9n2zsMxATZkWOXVKM+/ytS3PiXdyBm7pF03VmHaMBuMsq4UJzuBxdvDIbhJOyCXgVRD4akj4PFjrfN0pLXORaGS9B6HoWViS0oI7jEJmC1xgr4GSx7mABOerYdmU0dOSYlGalcqswtGP/dVjItV7libvZ9q03tZb8nzavTfYytqKoaoMFXz+U1ZKakrY/RFOhkBu5cgC4Eq5Wyk9BATfuHwP2Dl0vCj+6vJ8qVGBK9cT2I2o6wFp0qZx1S26A0ea4roKjvUkUTqLDZ8Ppm36UW+QReUzGJCIvyJR8IAdkRrg38J57r72pf+h/9b/7P9ZXfa/3PCSXwv/5B8AY3SY=</latexit>???
- 0.52
π R
<latexit sha1_base64="Cvky5V13MoRBV8LVqr3UWOLq/tA=">AB8nicbVBNS8NAEN3Ur1q/qh69LBbBU0mqoMeiF49V7AckoWy2m3bpZjfsToQS+jO8eFDEq7/Gm/GbZuDtj4YeLw3w8y8KBXcgOt+O6W19Y3NrfJ2ZWd3b/+genjUMSrTlLWpEkr3ImKY4JK1gYNgvVQzkSCdaPx7czvPjFtuJKPMElZmJCh5DGnBKzkBynHAeWa4od+tebW3TnwKvEKUkMFWv3qVzBQNEuYBCqIMb7nphDmRAOngk0rQWZYSuiYDJlvqSQJM2E+P3mKz6wywLHStiTgufp7IieJMZMksp0JgZFZ9mbif56fQXwd5lymGTBJF4viTGBQePY/HnDNKIiJYRqbm/FdEQ0oWBTqtgQvOWXV0mnUfcu6o37y1rzpoijE7QKTpHrpCTXSHWqiNKFLoGb2iNwecF+fd+Vi0lpxi5hj9gfP5A2wJkLA=</latexit>Results
44 reach cell on left of triangle reach square left of triangle True description Pred description
Results: RL
20 40 60 80 100
Timestep (×1000)
−1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0
Average reward
L3 Multitask Scratch
This work
Results
46
change any n to a c replace all n s with c
loocies loocies
(a)
examples true description true output
- pred. description
- pred. output
emboldens kisses loneliness vein dogtrot emboldecs kisses locelicess veic dogtrot loonies
Results: programming by demonstration
Identity Multitask Meta This Work 18 50 62 76
Results: locomotion
Modular multitask reinforcement learning with policy sketches. A, Klein & Levine. ICML 2017 north, east, north
Generalization
25 50 75 100 Training Adapta0on
47 89 76 42
This work Mul-task
Learning with corrections
Language learning Reinforcement learning
go north a bit more
f( · ; η, )
Pretraining by learning to correct
NORTH
further east
π
<latexit sha1_base64="MfoOZUbGzRkaB76umvTEWj+CN8=">AB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz0Et4v1xq+4cZJV4OalAjka/NUbxCyNUBomqNZdz02Mn1FlOBM4LfVSjQlYzrErqWSRqj9bH7qlJxZUDCWNmShszV3xMZjbSeRIHtjKgZ6WVvJv7ndVMTXvsZl0lqULFojAVxMRk9jcZcIXMiIklClubyVsRBVlxqZTsiF4y+vklat6l1Ua/eXlfpNHkcRTuAUzsGDK6jDHTSgCQyG8Ayv8OYI58V5dz4WrQUnzmGP3A+fwBRC43R</latexit>JD Co-Reyes
Guiding policies with language via meta-learning. ICLR 19.
further east further east
f( · ; η, )
Pretraining by learning to correct
NORTH
further east
π
<latexit sha1_base64="MfoOZUbGzRkaB76umvTEWj+CN8=">AB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz0Et4v1xq+4cZJV4OalAjka/NUbxCyNUBomqNZdz02Mn1FlOBM4LfVSjQlYzrErqWSRqj9bH7qlJxZUDCWNmShszV3xMZjbSeRIHtjKgZ6WVvJv7ndVMTXvsZl0lqULFojAVxMRk9jcZcIXMiIklClubyVsRBVlxqZTsiF4y+vklat6l1Ua/eXlfpNHkcRTuAUzsGDK6jDHTSgCQyG8Ayv8OYI58V5dz4WrQUnzmGP3A+fwBRC43R</latexit>f( · ; η, )
Learning from corrections
WEST,…
π
<latexit sha1_base64="MfoOZUbGzRkaB76umvTEWj+CN8=">AB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz0Et4v1xq+4cZJV4OalAjka/NUbxCyNUBomqNZdz02Mn1FlOBM4LfVSjQlYzrErqWSRqj9bH7qlJxZUDCWNmShszV3xMZjbSeRIHtjKgZ6WVvJv7ndVMTXvsZl0lqULFojAVxMRk9jcZcIXMiIklClubyVsRBVlxqZTsiF4y+vklat6l1Ua/eXlfpNHkcRTuAUzsGDK6jDHTSgCQyG8Ayv8OYI58V5dz4WrQUnzmGP3A+fwBRC43R</latexit>f( · ; η, )
π
<latexit sha1_base64="MfoOZUbGzRkaB76umvTEWj+CN8=">AB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz0Et4v1xq+4cZJV4OalAjka/NUbxCyNUBomqNZdz02Mn1FlOBM4LfVSjQlYzrErqWSRqj9bH7qlJxZUDCWNmShszV3xMZjbSeRIHtjKgZ6WVvJv7ndVMTXvsZl0lqULFojAVxMRk9jcZcIXMiIklClubyVsRBVlxqZTsiF4y+vklat6l1Ua/eXlfpNHkcRTuAUzsGDK6jDHTSgCQyG8Ayv8OYI58V5dz4WrQUnzmGP3A+fwBRC43R</latexit>f( · ; η, )
π
<latexit sha1_base64="MfoOZUbGzRkaB76umvTEWj+CN8=">AB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqoMeiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz0Et4v1xq+4cZJV4OalAjka/NUbxCyNUBomqNZdz02Mn1FlOBM4LfVSjQlYzrErqWSRqj9bH7qlJxZUDCWNmShszV3xMZjbSeRIHtjKgZ6WVvJv7ndVMTXvsZl0lqULFojAVxMRk9jcZcIXMiIklClubyVsRBVlxqZTsiF4y+vklat6l1Ua/eXlfpNHkcRTuAUzsGDK6jDHTSgCQyG8Ayv8OYI58V5dz4WrQUnzmGP3A+fwBRC43R</latexit>NORTH,… NORTH,…
go to top further west
Touch cyan block. Move closer to magenta block. Move a lot up. Move a little up.
Enter the blue room. Enter the red room. Exit the blue room. Pick up the blue triangle
Lesson Language is useful as side information, not just a goal specification. Use it with / instead of instructions as a representational bottleneck
- r interactive advice
So what comes next?
What comes next?
Challenges for the field:
What comes next?
Challenges for the field:
- huge datasets
What comes next?
Challenges for the field:
- huge datasets
- with fake annotations
What comes next?
Challenges for the field:
- huge datasets
- with fake annotations
- that look very little like natural language
What comes next?
Challenges for the field:
- huge datasets →
- with fake annotations
- that look very little like natural language
Learn to make do without an annotation for every rollout!
What comes next?
Challenges for the field:
- huge datasets →
- with fake annotations →
- that look very little like natural language
Learn to make do without an annotation for every rollout! Learn to generalize from fake strings to real ones!
What comes next?
Challenges for the field:
- huge datasets →
- with fake annotations →
- that look very little like natural language
→
Learn to make do without an annotation for every rollout! Learn to generalize from fake strings to real ones! Pay attention to human evals (or scope claims accordingly)!
Learn more: Luketina et al., A survey of reinforcement learning informed by natural language
https://arxiv.org/abs/1906.03926
Agent Environment
Action State, Reward
Task-dependent Language-assisted Key Opens a door of the same color as the key. Skull They come in two varieties, rolling skulls and bouncing skulls ... you must jump over rolling skulls and walk under bouncing skulls. Language-conditional Go down the ladder and walk right im- mediately to avoid falling off the conveyor belt, jump to the yellow rope and again to the platform on the right. Task-independent
[...] having the correct key can open the lock [...] [...] known lock and key device was discovered [...] [...] unless the correct key is inserted [...]
vkey vskull vladder vrope
Pre-training Pre-trained