SLIDE 1 CS 4803 / 7643: Deep Learning
Erik Wijmans Georgia Tech
Topics:
– Application: PointGoal Navigation – Trust Region Policy Optimization (TRPO) – Proximal Policy Optimization (PPO)
SLIDE 2 Who Am I?
Erik Wijmans 3rd year PhD student at GT
Advisors: Dhruv Batra and Irfan Essa Research Interests
- Computer Vision
- Visual Navigation
- Embodied AI (virtual robots)
- Simulation to reality transfer
SLIDE 3 Lecture plan/motivation
- Combine CNNs, RNNs (LSTMs), and RL together — all things you have
learned about in this course — through a task called PointGoal Navigation
- Introduce more advanced RL — TRPO and PPO
- Show results using PPO on PointGoal Navigation
SLIDE 4 State-of-the-Art Visual Recognition
Slide credit: Abhishek Das
SLIDE 5 He et al., 2016a, b He et al., 2017 Lin et al., 2017
State-of-the-Art Visual Recognition
Slide credit: Abhishek Das
SLIDE 6 Vinyals et al., 2015 Karpathy and Johnson., 2016 Lu et al., 2018
Are there any animals? Yes, there are two elephants. A-BOT
Question Encoder Answer Decoder History Encoder Fact Embedding
Q-BOT
Question Decoder Fact Embedding Feature Regression Network History Encoder
Rounds of Dialog [0.1, -2, 0, … , 0.57] Reward Function
State-of-the-Art Visual Recognition
Slide credit: Abhishek Das
SLIDE 7 Yang et al., 2016 Das et al., 2017a, b Anderson et al., 2016
Are there any animals? Yes, there are two elephants. A-BOT
Question Encoder Answer Decoder History Encoder Fact Embedding
Q-BOT
Question Decoder Fact Embedding Feature Regression Network History Encoder
Rounds of Dialog [0.1, -2, 0, … , 0.57] Reward Function
State-of-the-Art Visual Recognition
Slide credit: Abhishek Das
SLIDE 8 Image Credit: You et al., 2016
State-of-the-Art Visual Recognition
Slide credit: Abhishek Das
SLIDE 9 Applications
Slide credit: Abhishek Das
SLIDE 10 Applications
Image Credit: Lockheed Martin; DARPA Robotics Challenge Slide credit: Abhishek Das
SLIDE 11 Applications
Physical agent
Image Credit: Lockheed Martin; DARPA Robotics Challenge Slide credit: Abhishek Das
SLIDE 12 Applications
Physical agent capable of taking actions in the world
Image Credit: Lockheed Martin; DARPA Robotics Challenge Slide credit: Abhishek Das
SLIDE 13 Image Credit: Lockheed Martin; DARPA Robotics Challenge
Applications
Physical agent capable of taking actions in the world
Slide credit: Abhishek Das
SLIDE 14 Applications
Is there smoke in any room around you? Yes, in one room Go there and look for people …
Physical agent capable of taking actions in the world and talking to humans in natural language
Image Credit: Lockheed Martin; DARPA Robotics Challenge Slide credit: Abhishek Das
SLIDE 15 Applications
Is there smoke in any room around you? Yes, in one room Go there and look for people …
Image Credit: Lockheed Martin; DARPA Robotics Challenge Slide credit: Abhishek Das
SLIDE 16 Is there smoke in any room around you? Yes, in one room Go there and look for people …
Applications
Image Credit: Lockheed Martin; DARPA Robotics Challenge Slide credit: Abhishek Das
SLIDE 17 Challenges
Image Credit: Image-Net Image Credit: Lockheed Martin; DARPA Robotics Challenge Slide credit: Abhishek Das
SLIDE 18 Challenges
Image Credit: Image-Net, Video Credit: Lee et al., 2012
Egocentric vision No access to well-composed, curated images
Slide credit: Abhishek Das
SLIDE 19 Challenges
Video Credit: Lee et al., 2012
Egocentric vision
Slide credit: Abhishek Das
SLIDE 20 Challenges
Video Credit: Lee et al., 2012
Egocentric vision Active perception Action Observation
Agent controls incoming data distribution
Slide credit: Abhishek Das
SLIDE 21 Challenges
Image Credit: Image-Net
Egocentric vision Active perception Sparse rewards
Slide credit: Abhishek Das
SLIDE 22 Challenges
Image Credit: Image-Net
Egocentric vision Active perception Sparse rewards
Slide credit: Abhishek Das
SLIDE 23 Challenges
Image Credit: Image-Net
+ —
{
Egocentric vision Active perception Sparse rewards
Slide credit: Abhishek Das
SLIDE 24 Challenges
Egocentric vision Active perception Sparse rewards Language understanding
Slide credit: Abhishek Das
SLIDE 25 Internet AI
Image Credit: Image-Net Image Credit: Lockheed Martin; DARPA Robotics Challenge Slide credit: Abhishek Das
SLIDE 26 Vladlen Koltun5 Abhishek Kadian1*
Oleksandr Maksymets1* Jia Liu1 Manolis Savva1,4* Erik Wijmans1,2,3 Bhavana Jain1 Yili Zhao1 Julian Straub2 Jitendra Malik1,6 Devi Parikh1,3 Dhruv Batra 1,3 1 2 3 4 5 6
* denotes equal contribution
SLIDE 27
Standardizing the Embodied AI “software stack”
SLIDE 28 EmbodiedQA (Das et al., 2018)
Tasks
Interactive QA (Gordon et al., 2018)
Vision-Language Navigation (Anderson et al., 2018)
Language grounding (Hill et al., 2017) Visual Navigation (Zhu et al., 2017, Gupta et al., 2017)
Standardizing the Embodied AI “software stack”
SLIDE 29 EmbodiedQA (Das et al., 2018)
Standardizing the Embodied AI “software stack”
SLIDE 30 Vision-Language Navigation (Anderson et al., 2018)
Standardizing the Embodied AI “software stack”
SLIDE 31 Simulators
AI2-THOR (Kolve et al., 2017) MINOS (Savva et al., 2017) Gibson (Zamir et al., 2018) CHALET (Yan et al., 2018) House3D (Wu et al., 2017) EmbodiedQA (Das et al., 2018)
Tasks
Interactive QA (Gordon et al., 2018)
Vision-Language Navigation (Anderson et al., 2018)
Language grounding (Hill et al., 2017) Visual Navigation (Zhu et al., 2017, Gupta et al., 2017)
Datasets
Matterport3D (Chang et al., 2017) 2D-3D-S (Armeni et al., 2017) Replica (Straub et al., 2019)
Standardizing the Embodied AI “software stack”
Habitat-Sim
Generic Dataset
Support
Habitat-API Habitat Platform
SLIDE 32
Habitat-Sim
SLIDE 33
Habitat-Sim Demo
SLIDE 34
Habitat-API
SLIDE 35
PointGoal Navigation
SLIDE 36
PointGoal Navigation
SLIDE 37
PointGoal Navigation
SLIDE 38
PointGoal Navigation
SLIDE 39 PointGoal Navigation
Goal
SLIDE 40
PointGoal Navigation
SLIDE 41
Agent and Model Design
SLIDE 42
Agent and Model Design
SLIDE 43 Agent and Model Design
- 1.25m tall cylinder with 0.1m radius
SLIDE 44 Agent and Model Design
- 1.25m tall cylinder with 0.1m radius
- Actions:
- <stop>: Indicates the agent
believes it has completed the task
- <forward>: Moves 0.25m forward
- <left>, <right>: Turn 10 degrees
SLIDE 45
Agent and Model Design
SLIDE 46
Agent and Model Design
SLIDE 47
Agent and Model Design
SLIDE 48 CNN
Agent and Model Design
SLIDE 49 CNN Policy
h0 a0
Agent and Model Design
SLIDE 50 CNN Policy
h0 a0
<left>
a0 V1
Agent and Model Design
SLIDE 51 CNN Policy
h0 a0
<left>
a0 V1
<forward>
h1
<left> Policy Policy
a1
CNN CNN
1
h2 a2 V2 V3
Agent and Model Design
SLIDE 52 <forward>
hT-2 aT-2
<stop> Policy Policy CNN CNN
hT-1 aT-1 VT-1 VT
Agent and Model Design
SLIDE 53 Agent and Model Design
- How do we train this agent?
SLIDE 54 Agent and Model Design
- How do we train this agent?
- Both actions (they are discrete) and the
simulation are non-differential-able
SLIDE 55 Agent and Model Design
- How do we train this agent?
- Both actions (they are discrete) and the
simulation are non-differential-able
- Use reinforcement learning!
SLIDE 56 Outline
- RL Refresher/Advantage Actor Critic (A2C)
- Trust Region Policy Optimization (TRPO)
- Proximal Policy Optimization (PPO)
- Application: PointGoal Navigation Results
SLIDE 57 Outline
- RL Refresher/Advantage Actor Critic (A2C)
- Trust Region Policy Optimization (TRPO)
- Proximal Policy Optimization (PPO)
- Application: PointGoal Navigation Results
SLIDE 58
RL Refresher
SLIDE 59
RL Refresher
SLIDE 60
RL Refresher
SLIDE 61
RL Refresher
SLIDE 62
RL Refresher
SLIDE 63
RL Refresher
SLIDE 64 Objective:
<latexit sha1_base64="9TCpe5Nw7VqaGPsqwdYfFq6YNs=">ACT3icbZFLSwMxFIUz9V1fVZdugkVRkDIjgm4EUQSXKq0tNOQSTNtaDIzJHeEMsw/dKM7/4YbF4qY1ln4uhD4OdecnMSplIYcN1npzI1PTM7N79QXVxaXlmtra3fmiTjLdYIhPdCanhUsS8BQIk76SaUxVK3g6H52O/fc+1EUnchFHKfUX7sYgEo2CloBYRWEQhvlFgYnkEXTxRGFU5jdF0MREi/4AfExI9aexg08wMZkiUigBJsjhxCvumt/Hd0A+5gGsBfU6m7DnRT+C14JdVTWVB7Ir2EZYrHwCQ1pu5Kfg51SCY5EWVZIanlA1pn3ctxlRx4+eTPAq8bZUejhJtTwx4on6fyKkyZqRC2zle1vz2xuJ/XjeD6NjPRZxmwGP2dVGUSQwJHoeLe0JzBnJkgTIt7K6YDaimDOwXVG0I3u8n/4Xbg4bnNrzrw/rpWRnHPNpEW2gXegInaJLdIVaiKEH9ILe0Lvz6Lw6H5WyteKUsIF+VGXhE6Sas5U=</latexit>
RL Refresher
SLIDE 65
Reinforce
<latexit sha1_base64="Q87OeiBHTj2uynKBf+Fc2+4r43I=">ACKXicbVBNSwMxEM36bf2qevQSLIJeyq4Ieix68ajSqtAty2w2bYPZElmhbL073jxr3hRUNSrf8RsW8GvB4E3782QmRdnUlj0/Tdvanpmdm5+YbGytLyulZd37i0OjeMt5iW2lzHYLkUirdQoOTXmeGQxpJfxTcnpX91y40VWjVxkPFOCj0luoIBOimqNkIFsYQoxD5HoKHUPRpm4qvehQhpmIqE2gj3aMgSXdaAfQayuBhGzaha8+v+CPQvCSakRiY4i6pPYaJZnKFTIK17cDPsFOAQcEkH1bC3PIM2A30eNtRBSm3nWJ06ZDuOCWhXW3cU0hH6veJAlJrB2nsOsl7W+vFP/z2jl2jzqFUFmOXLHxR91cUtS0jI0mwnCGcuAIMCPcrpT1wQBDF27FhRD8PvkvudyvB349OD+oNY4ncSyQLbJNdklADkmDnJIz0iKM3JEH8kxevHv0Xv13setU95kZpP8gPfxCf5KpnQ=</latexit>
SLIDE 66 Advantage Actor Critic (A2C)
<latexit sha1_base64="Q87OeiBHTj2uynKBf+Fc2+4r43I=">ACKXicbVBNSwMxEM36bf2qevQSLIJeyq4Ieix68ajSqtAty2w2bYPZElmhbL073jxr3hRUNSrf8RsW8GvB4E3782QmRdnUlj0/Tdvanpmdm5+YbGytLyulZd37i0OjeMt5iW2lzHYLkUirdQoOTXmeGQxpJfxTcnpX91y40VWjVxkPFOCj0luoIBOimqNkIFsYQoxD5HoKHUPRpm4qvehQhpmIqE2gj3aMgSXdaAfQayuBhGzaha8+v+CPQvCSakRiY4i6pPYaJZnKFTIK17cDPsFOAQcEkH1bC3PIM2A30eNtRBSm3nWJ06ZDuOCWhXW3cU0hH6veJAlJrB2nsOsl7W+vFP/z2jl2jzqFUFmOXLHxR91cUtS0jI0mwnCGcuAIMCPcrpT1wQBDF27FhRD8PvkvudyvB349OD+oNY4ncSyQLbJNdklADkmDnJIz0iKM3JEH8kxevHv0Xv13setU95kZpP8gPfxCf5KpnQ=</latexit>
SLIDE 67 Advantage Actor Critic (A2C)
- High variance:
- Reduce variance with baseline:
<latexit sha1_base64="Q87OeiBHTj2uynKBf+Fc2+4r43I=">ACKXicbVBNSwMxEM36bf2qevQSLIJeyq4Ieix68ajSqtAty2w2bYPZElmhbL073jxr3hRUNSrf8RsW8GvB4E3782QmRdnUlj0/Tdvanpmdm5+YbGytLyulZd37i0OjeMt5iW2lzHYLkUirdQoOTXmeGQxpJfxTcnpX91y40VWjVxkPFOCj0luoIBOimqNkIFsYQoxD5HoKHUPRpm4qvehQhpmIqE2gj3aMgSXdaAfQayuBhGzaha8+v+CPQvCSakRiY4i6pPYaJZnKFTIK17cDPsFOAQcEkH1bC3PIM2A30eNtRBSm3nWJ06ZDuOCWhXW3cU0hH6veJAlJrB2nsOsl7W+vFP/z2jl2jzqFUFmOXLHxR91cUtS0jI0mwnCGcuAIMCPcrpT1wQBDF27FhRD8PvkvudyvB349OD+oNY4ncSyQLbJNdklADkmDnJIz0iKM3JEH8kxevHv0Xv13setU95kZpP8gPfxCf5KpnQ=</latexit> <latexit sha1_base64="stVTPD2NxfInfe0m+8mdo3PSnQ=">ACL3icbVBNSysxFM348dQ+37Pq0k2wPKiLV2ZE0KUoiMsqrQpNGe5k0jaYSYbkjlCG/iM3/hU3Ioq49V+Y1gp+HQicnHMv96T5Eo6DMP7YGZ2bv7XwuJS5fyn78r1dW1M2cKy0WbG2XsRQJOKlFGyUqcZFbAVmixHlyeTj2z6+EdLoFg5z0c2gr2VPckAvxdUjpiFREDMcCATKlOlTlsv3fx1ipCyTKXUxblHGU4O0zjLAQdVno7iFv1Pk624Wgsb4QT0O4mpEamaMbVW5YaXmRCI1fgXCcKc+yWYFyJUYVjiRA7+Evuh4qiETrltO7h3Rf15Jac9Y/zTSifqxo4TMuWGW+Mrxpu6rNxZ/8joF9va6pdR5gULzt0G9QlE0dBweTaUVHNXQE+BW+l0pH4AFj7ig8h+nryd3K23YjCRnSyU9s/mMaxSDbIJqmTiOySfXJMmqRNOLkmt+SBPAY3wV3wFDy/lc4E0518gnBysfJqfQ</latexit>
SLIDE 68 Advantage Actor Critic (A2C)
- High variance:
- Reduce variance with baseline:
- Use value-function as the baseline (A2C):
<latexit sha1_base64="Q87OeiBHTj2uynKBf+Fc2+4r43I=">ACKXicbVBNSwMxEM36bf2qevQSLIJeyq4Ieix68ajSqtAty2w2bYPZElmhbL073jxr3hRUNSrf8RsW8GvB4E3782QmRdnUlj0/Tdvanpmdm5+YbGytLyulZd37i0OjeMt5iW2lzHYLkUirdQoOTXmeGQxpJfxTcnpX91y40VWjVxkPFOCj0luoIBOimqNkIFsYQoxD5HoKHUPRpm4qvehQhpmIqE2gj3aMgSXdaAfQayuBhGzaha8+v+CPQvCSakRiY4i6pPYaJZnKFTIK17cDPsFOAQcEkH1bC3PIM2A30eNtRBSm3nWJ06ZDuOCWhXW3cU0hH6veJAlJrB2nsOsl7W+vFP/z2jl2jzqFUFmOXLHxR91cUtS0jI0mwnCGcuAIMCPcrpT1wQBDF27FhRD8PvkvudyvB349OD+oNY4ncSyQLbJNdklADkmDnJIz0iKM3JEH8kxevHv0Xv13setU95kZpP8gPfxCf5KpnQ=</latexit> <latexit sha1_base64="stVTPD2NxfInfe0m+8mdo3PSnQ=">ACL3icbVBNSysxFM348dQ+37Pq0k2wPKiLV2ZE0KUoiMsqrQpNGe5k0jaYSYbkjlCG/iM3/hU3Ioq49V+Y1gp+HQicnHMv96T5Eo6DMP7YGZ2bv7XwuJS5fyn78r1dW1M2cKy0WbG2XsRQJOKlFGyUqcZFbAVmixHlyeTj2z6+EdLoFg5z0c2gr2VPckAvxdUjpiFREDMcCATKlOlTlsv3fx1ipCyTKXUxblHGU4O0zjLAQdVno7iFv1Pk624Wgsb4QT0O4mpEamaMbVW5YaXmRCI1fgXCcKc+yWYFyJUYVjiRA7+Evuh4qiETrltO7h3Rf15Jac9Y/zTSifqxo4TMuWGW+Mrxpu6rNxZ/8joF9va6pdR5gULzt0G9QlE0dBweTaUVHNXQE+BW+l0pH4AFj7ig8h+nryd3K23YjCRnSyU9s/mMaxSDbIJqmTiOySfXJMmqRNOLkmt+SBPAY3wV3wFDy/lc4E0518gnBysfJqfQ</latexit> <latexit sha1_base64="UAloJD1/ubWRcMSZ24yHXdTCXs=">ACNHicbVBNaxsxENXmO85H3eSYi6gJOIeY3VBoj6a5BHpJi+0ELPMamVbRCst0mzALP5RueSH5BICOTSEXPsbqrVdaJM8EDy9N8PMvCRX0mEYPgRLyura+sbm7Wt7Z3dD/WPez1nCstFlxtl7GUCTipRclKnGZWwFZosRFcnVa+RfXwjpdAcnuRhkMNJyKDmgl+L6d6YhURAzHAsEypQZUZbLv/8mxEhZJlPqYjyijKcGaZNlgGMOqvw5jTv0mPalXsU1xthK5yBviXRgjTIAudx/Y6lheZ0MgVONePwhwHJViUXIlpjRVO5MCvYCT6nmrIhBuUs6On9NArKR0a659GOlP/7Sghc26SJb6yWte9irxPa9f4PDroJQ6L1BoPh80LBRFQ6sEaSqt4KgmngC30u9K+RgscPQ513wI0euT35LeSsKW9GPz432t0UcG+SAfCJNEpEvpE3OyDnpEk5uyD35RZ6C2+AxeA5e5qVLwaJn/yH4Pcfr46pjQ=</latexit>
SLIDE 69 Advantage Actor Critic (A2C)
- High variance:
- Reduce variance with baseline:
- Use value-function as the baseline (A2C):
<latexit sha1_base64="Q87OeiBHTj2uynKBf+Fc2+4r43I=">ACKXicbVBNSwMxEM36bf2qevQSLIJeyq4Ieix68ajSqtAty2w2bYPZElmhbL073jxr3hRUNSrf8RsW8GvB4E3782QmRdnUlj0/Tdvanpmdm5+YbGytLyulZd37i0OjeMt5iW2lzHYLkUirdQoOTXmeGQxpJfxTcnpX91y40VWjVxkPFOCj0luoIBOimqNkIFsYQoxD5HoKHUPRpm4qvehQhpmIqE2gj3aMgSXdaAfQayuBhGzaha8+v+CPQvCSakRiY4i6pPYaJZnKFTIK17cDPsFOAQcEkH1bC3PIM2A30eNtRBSm3nWJ06ZDuOCWhXW3cU0hH6veJAlJrB2nsOsl7W+vFP/z2jl2jzqFUFmOXLHxR91cUtS0jI0mwnCGcuAIMCPcrpT1wQBDF27FhRD8PvkvudyvB349OD+oNY4ncSyQLbJNdklADkmDnJIz0iKM3JEH8kxevHv0Xv13setU95kZpP8gPfxCf5KpnQ=</latexit> <latexit sha1_base64="stVTPD2NxfInfe0m+8mdo3PSnQ=">ACL3icbVBNSysxFM348dQ+37Pq0k2wPKiLV2ZE0KUoiMsqrQpNGe5k0jaYSYbkjlCG/iM3/hU3Ioq49V+Y1gp+HQicnHMv96T5Eo6DMP7YGZ2bv7XwuJS5fyn78r1dW1M2cKy0WbG2XsRQJOKlFGyUqcZFbAVmixHlyeTj2z6+EdLoFg5z0c2gr2VPckAvxdUjpiFREDMcCATKlOlTlsv3fx1ipCyTKXUxblHGU4O0zjLAQdVno7iFv1Pk624Wgsb4QT0O4mpEamaMbVW5YaXmRCI1fgXCcKc+yWYFyJUYVjiRA7+Evuh4qiETrltO7h3Rf15Jac9Y/zTSifqxo4TMuWGW+Mrxpu6rNxZ/8joF9va6pdR5gULzt0G9QlE0dBweTaUVHNXQE+BW+l0pH4AFj7ig8h+nryd3K23YjCRnSyU9s/mMaxSDbIJqmTiOySfXJMmqRNOLkmt+SBPAY3wV3wFDy/lc4E0518gnBysfJqfQ</latexit> <latexit sha1_base64="UAloJD1/ubWRcMSZ24yHXdTCXs=">ACNHicbVBNaxsxENXmO85H3eSYi6gJOIeY3VBoj6a5BHpJi+0ELPMamVbRCst0mzALP5RueSH5BICOTSEXPsbqrVdaJM8EDy9N8PMvCRX0mEYPgRLyura+sbm7Wt7Z3dD/WPez1nCstFlxtl7GUCTipRclKnGZWwFZosRFcnVa+RfXwjpdAcnuRhkMNJyKDmgl+L6d6YhURAzHAsEypQZUZbLv/8mxEhZJlPqYjyijKcGaZNlgGMOqvw5jTv0mPalXsU1xthK5yBviXRgjTIAudx/Y6lheZ0MgVONePwhwHJViUXIlpjRVO5MCvYCT6nmrIhBuUs6On9NArKR0a659GOlP/7Sghc26SJb6yWte9irxPa9f4PDroJQ6L1BoPh80LBRFQ6sEaSqt4KgmngC30u9K+RgscPQ513wI0euT35LeSsKW9GPz432t0UcG+SAfCJNEpEvpE3OyDnpEk5uyD35RZ6C2+AxeA5e5qVLwaJn/yH4Pcfr46pjQ=</latexit>
SLIDE 70 Advantage Actor Critic (A2C)
- High variance:
- Reduce variance with baseline:
- Use value-function as the baseline (A2C):
<latexit sha1_base64="Q87OeiBHTj2uynKBf+Fc2+4r43I=">ACKXicbVBNSwMxEM36bf2qevQSLIJeyq4Ieix68ajSqtAty2w2bYPZElmhbL073jxr3hRUNSrf8RsW8GvB4E3782QmRdnUlj0/Tdvanpmdm5+YbGytLyulZd37i0OjeMt5iW2lzHYLkUirdQoOTXmeGQxpJfxTcnpX91y40VWjVxkPFOCj0luoIBOimqNkIFsYQoxD5HoKHUPRpm4qvehQhpmIqE2gj3aMgSXdaAfQayuBhGzaha8+v+CPQvCSakRiY4i6pPYaJZnKFTIK17cDPsFOAQcEkH1bC3PIM2A30eNtRBSm3nWJ06ZDuOCWhXW3cU0hH6veJAlJrB2nsOsl7W+vFP/z2jl2jzqFUFmOXLHxR91cUtS0jI0mwnCGcuAIMCPcrpT1wQBDF27FhRD8PvkvudyvB349OD+oNY4ncSyQLbJNdklADkmDnJIz0iKM3JEH8kxevHv0Xv13setU95kZpP8gPfxCf5KpnQ=</latexit> <latexit sha1_base64="stVTPD2NxfInfe0m+8mdo3PSnQ=">ACL3icbVBNSysxFM348dQ+37Pq0k2wPKiLV2ZE0KUoiMsqrQpNGe5k0jaYSYbkjlCG/iM3/hU3Ioq49V+Y1gp+HQicnHMv96T5Eo6DMP7YGZ2bv7XwuJS5fyn78r1dW1M2cKy0WbG2XsRQJOKlFGyUqcZFbAVmixHlyeTj2z6+EdLoFg5z0c2gr2VPckAvxdUjpiFREDMcCATKlOlTlsv3fx1ipCyTKXUxblHGU4O0zjLAQdVno7iFv1Pk624Wgsb4QT0O4mpEamaMbVW5YaXmRCI1fgXCcKc+yWYFyJUYVjiRA7+Evuh4qiETrltO7h3Rf15Jac9Y/zTSifqxo4TMuWGW+Mrxpu6rNxZ/8joF9va6pdR5gULzt0G9QlE0dBweTaUVHNXQE+BW+l0pH4AFj7ig8h+nryd3K23YjCRnSyU9s/mMaxSDbIJqmTiOySfXJMmqRNOLkmt+SBPAY3wV3wFDy/lc4E0518gnBysfJqfQ</latexit> <latexit sha1_base64="UAloJD1/ubWRcMSZ24yHXdTCXs=">ACNHicbVBNaxsxENXmO85H3eSYi6gJOIeY3VBoj6a5BHpJi+0ELPMamVbRCst0mzALP5RueSH5BICOTSEXPsbqrVdaJM8EDy9N8PMvCRX0mEYPgRLyura+sbm7Wt7Z3dD/WPez1nCstFlxtl7GUCTipRclKnGZWwFZosRFcnVa+RfXwjpdAcnuRhkMNJyKDmgl+L6d6YhURAzHAsEypQZUZbLv/8mxEhZJlPqYjyijKcGaZNlgGMOqvw5jTv0mPalXsU1xthK5yBviXRgjTIAudx/Y6lheZ0MgVONePwhwHJViUXIlpjRVO5MCvYCT6nmrIhBuUs6On9NArKR0a659GOlP/7Sghc26SJb6yWte9irxPa9f4PDroJQ6L1BoPh80LBRFQ6sEaSqt4KgmngC30u9K+RgscPQ513wI0euT35LeSsKW9GPz432t0UcG+SAfCJNEpEvpE3OyDnpEk5uyD35RZ6C2+AxeA5e5qVLwaJn/yH4Pcfr46pjQ=</latexit> <latexit sha1_base64="5+vMcXT+CN5kLeqSR9srGNolKAg=">AC3icbZDLSgMxFIYz9VbrbdSlm9AiVNAyI4JuhKobly3YC7TDkEkzbWjmQnJGKV7N76KGxeKuPUF3Pk2ZtpZ1NYfAl/+cw7J+b1YcAW9WPkVlbX1jfym4Wt7Z3dPXP/oKmiRFLWoJGIZNsjigkesgZwEKwdS0YCT7CWN7xL61HJhWPwgcYxcwJSD/kPqcEtOWaxZsyceEUKxdO8DWuz93OcLOcgmuWrIo1FV4GO4MSylRze9uL6JwEKgijVsa0YnDGRwKlgk0I3USwmdEj6rKMxJAFTzni6ywQfa6eH/UjqEwKeuvMTYxIoNQo83RkQGKjFWmr+V+sk4F85Yx7GCbCQzh7yE4EhwmkwuMcloyBGgiVXP8V0wGRhIKOr6BDsBdXobmecW2Knb9olS9zeLIoyNURGVko0tURfeohqIoif0gt7Qu/FsvBofxuesNWdkM4foj4yvX5iQl5Y=</latexit>
SLIDE 71 Advantage Actor Critic (A2C)
- High variance:
- Reduce variance with baseline:
- Use value-function as the baseline (A2C):
<latexit sha1_base64="Q87OeiBHTj2uynKBf+Fc2+4r43I=">ACKXicbVBNSwMxEM36bf2qevQSLIJeyq4Ieix68ajSqtAty2w2bYPZElmhbL073jxr3hRUNSrf8RsW8GvB4E3782QmRdnUlj0/Tdvanpmdm5+YbGytLyulZd37i0OjeMt5iW2lzHYLkUirdQoOTXmeGQxpJfxTcnpX91y40VWjVxkPFOCj0luoIBOimqNkIFsYQoxD5HoKHUPRpm4qvehQhpmIqE2gj3aMgSXdaAfQayuBhGzaha8+v+CPQvCSakRiY4i6pPYaJZnKFTIK17cDPsFOAQcEkH1bC3PIM2A30eNtRBSm3nWJ06ZDuOCWhXW3cU0hH6veJAlJrB2nsOsl7W+vFP/z2jl2jzqFUFmOXLHxR91cUtS0jI0mwnCGcuAIMCPcrpT1wQBDF27FhRD8PvkvudyvB349OD+oNY4ncSyQLbJNdklADkmDnJIz0iKM3JEH8kxevHv0Xv13setU95kZpP8gPfxCf5KpnQ=</latexit> <latexit sha1_base64="stVTPD2NxfInfe0m+8mdo3PSnQ=">ACL3icbVBNSysxFM348dQ+37Pq0k2wPKiLV2ZE0KUoiMsqrQpNGe5k0jaYSYbkjlCG/iM3/hU3Ioq49V+Y1gp+HQicnHMv96T5Eo6DMP7YGZ2bv7XwuJS5fyn78r1dW1M2cKy0WbG2XsRQJOKlFGyUqcZFbAVmixHlyeTj2z6+EdLoFg5z0c2gr2VPckAvxdUjpiFREDMcCATKlOlTlsv3fx1ipCyTKXUxblHGU4O0zjLAQdVno7iFv1Pk624Wgsb4QT0O4mpEamaMbVW5YaXmRCI1fgXCcKc+yWYFyJUYVjiRA7+Evuh4qiETrltO7h3Rf15Jac9Y/zTSifqxo4TMuWGW+Mrxpu6rNxZ/8joF9va6pdR5gULzt0G9QlE0dBweTaUVHNXQE+BW+l0pH4AFj7ig8h+nryd3K23YjCRnSyU9s/mMaxSDbIJqmTiOySfXJMmqRNOLkmt+SBPAY3wV3wFDy/lc4E0518gnBysfJqfQ</latexit> <latexit sha1_base64="UAloJD1/ubWRcMSZ24yHXdTCXs=">ACNHicbVBNaxsxENXmO85H3eSYi6gJOIeY3VBoj6a5BHpJi+0ELPMamVbRCst0mzALP5RueSH5BICOTSEXPsbqrVdaJM8EDy9N8PMvCRX0mEYPgRLyura+sbm7Wt7Z3dD/WPez1nCstFlxtl7GUCTipRclKnGZWwFZosRFcnVa+RfXwjpdAcnuRhkMNJyKDmgl+L6d6YhURAzHAsEypQZUZbLv/8mxEhZJlPqYjyijKcGaZNlgGMOqvw5jTv0mPalXsU1xthK5yBviXRgjTIAudx/Y6lheZ0MgVONePwhwHJViUXIlpjRVO5MCvYCT6nmrIhBuUs6On9NArKR0a659GOlP/7Sghc26SJb6yWte9irxPa9f4PDroJQ6L1BoPh80LBRFQ6sEaSqt4KgmngC30u9K+RgscPQ513wI0euT35LeSsKW9GPz432t0UcG+SAfCJNEpEvpE3OyDnpEk5uyD35RZ6C2+AxeA5e5qVLwaJn/yH4Pcfr46pjQ=</latexit> <latexit sha1_base64="5+vMcXT+CN5kLeqSR9srGNolKAg=">AC3icbZDLSgMxFIYz9VbrbdSlm9AiVNAyI4JuhKobly3YC7TDkEkzbWjmQnJGKV7N76KGxeKuPUF3Pk2ZtpZ1NYfAl/+cw7J+b1YcAW9WPkVlbX1jfym4Wt7Z3dPXP/oKmiRFLWoJGIZNsjigkesgZwEKwdS0YCT7CWN7xL61HJhWPwgcYxcwJSD/kPqcEtOWaxZsyceEUKxdO8DWuz93OcLOcgmuWrIo1FV4GO4MSylRze9uL6JwEKgijVsa0YnDGRwKlgk0I3USwmdEj6rKMxJAFTzni6ywQfa6eH/UjqEwKeuvMTYxIoNQo83RkQGKjFWmr+V+sk4F85Yx7GCbCQzh7yE4EhwmkwuMcloyBGgiVXP8V0wGRhIKOr6BDsBdXobmecW2Knb9olS9zeLIoyNURGVko0tURfeohqIoif0gt7Qu/FsvBofxuesNWdkM4foj4yvX5iQl5Y=</latexit>
SLIDE 72 Advantage Actor Critic (A2C)
- High variance:
- Reduce variance with baseline:
- Use value-function as the baseline (A2C):
<latexit sha1_base64="Q87OeiBHTj2uynKBf+Fc2+4r43I=">ACKXicbVBNSwMxEM36bf2qevQSLIJeyq4Ieix68ajSqtAty2w2bYPZElmhbL073jxr3hRUNSrf8RsW8GvB4E3782QmRdnUlj0/Tdvanpmdm5+YbGytLyulZd37i0OjeMt5iW2lzHYLkUirdQoOTXmeGQxpJfxTcnpX91y40VWjVxkPFOCj0luoIBOimqNkIFsYQoxD5HoKHUPRpm4qvehQhpmIqE2gj3aMgSXdaAfQayuBhGzaha8+v+CPQvCSakRiY4i6pPYaJZnKFTIK17cDPsFOAQcEkH1bC3PIM2A30eNtRBSm3nWJ06ZDuOCWhXW3cU0hH6veJAlJrB2nsOsl7W+vFP/z2jl2jzqFUFmOXLHxR91cUtS0jI0mwnCGcuAIMCPcrpT1wQBDF27FhRD8PvkvudyvB349OD+oNY4ncSyQLbJNdklADkmDnJIz0iKM3JEH8kxevHv0Xv13setU95kZpP8gPfxCf5KpnQ=</latexit> <latexit sha1_base64="stVTPD2NxfInfe0m+8mdo3PSnQ=">ACL3icbVBNSysxFM348dQ+37Pq0k2wPKiLV2ZE0KUoiMsqrQpNGe5k0jaYSYbkjlCG/iM3/hU3Ioq49V+Y1gp+HQicnHMv96T5Eo6DMP7YGZ2bv7XwuJS5fyn78r1dW1M2cKy0WbG2XsRQJOKlFGyUqcZFbAVmixHlyeTj2z6+EdLoFg5z0c2gr2VPckAvxdUjpiFREDMcCATKlOlTlsv3fx1ipCyTKXUxblHGU4O0zjLAQdVno7iFv1Pk624Wgsb4QT0O4mpEamaMbVW5YaXmRCI1fgXCcKc+yWYFyJUYVjiRA7+Evuh4qiETrltO7h3Rf15Jac9Y/zTSifqxo4TMuWGW+Mrxpu6rNxZ/8joF9va6pdR5gULzt0G9QlE0dBweTaUVHNXQE+BW+l0pH4AFj7ig8h+nryd3K23YjCRnSyU9s/mMaxSDbIJqmTiOySfXJMmqRNOLkmt+SBPAY3wV3wFDy/lc4E0518gnBysfJqfQ</latexit> <latexit sha1_base64="UAloJD1/ubWRcMSZ24yHXdTCXs=">ACNHicbVBNaxsxENXmO85H3eSYi6gJOIeY3VBoj6a5BHpJi+0ELPMamVbRCst0mzALP5RueSH5BICOTSEXPsbqrVdaJM8EDy9N8PMvCRX0mEYPgRLyura+sbm7Wt7Z3dD/WPez1nCstFlxtl7GUCTipRclKnGZWwFZosRFcnVa+RfXwjpdAcnuRhkMNJyKDmgl+L6d6YhURAzHAsEypQZUZbLv/8mxEhZJlPqYjyijKcGaZNlgGMOqvw5jTv0mPalXsU1xthK5yBviXRgjTIAudx/Y6lheZ0MgVONePwhwHJViUXIlpjRVO5MCvYCT6nmrIhBuUs6On9NArKR0a659GOlP/7Sghc26SJb6yWte9irxPa9f4PDroJQ6L1BoPh80LBRFQ6sEaSqt4KgmngC30u9K+RgscPQ513wI0euT35LeSsKW9GPz432t0UcG+SAfCJNEpEvpE3OyDnpEk5uyD35RZ6C2+AxeA5e5qVLwaJn/yH4Pcfr46pjQ=</latexit> <latexit sha1_base64="5+vMcXT+CN5kLeqSR9srGNolKAg=">AC3icbZDLSgMxFIYz9VbrbdSlm9AiVNAyI4JuhKobly3YC7TDkEkzbWjmQnJGKV7N76KGxeKuPUF3Pk2ZtpZ1NYfAl/+cw7J+b1YcAW9WPkVlbX1jfym4Wt7Z3dPXP/oKmiRFLWoJGIZNsjigkesgZwEKwdS0YCT7CWN7xL61HJhWPwgcYxcwJSD/kPqcEtOWaxZsyceEUKxdO8DWuz93OcLOcgmuWrIo1FV4GO4MSylRze9uL6JwEKgijVsa0YnDGRwKlgk0I3USwmdEj6rKMxJAFTzni6ywQfa6eH/UjqEwKeuvMTYxIoNQo83RkQGKjFWmr+V+sk4F85Yx7GCbCQzh7yE4EhwmkwuMcloyBGgiVXP8V0wGRhIKOr6BDsBdXobmecW2Knb9olS9zeLIoyNURGVko0tURfeohqIoif0gt7Qu/FsvBofxuesNWdkM4foj4yvX5iQl5Y=</latexit> <latexit sha1_base64="73WF1LFqO48zTabI2caWEeiC8nc=">ACJHicbVDLSgMxFM3UV62vUZdugkXoUC0zIiIUHXjsop9QFuGTJq2oZkHyR2hDP0YN/6KGxc+cOHGbzHTdlGrBwLnMvN/d4keAKbPvLyCwsLi2vZFdza+sbm1vm9k5NhbGkrEpDEcqGRxQTPGBV4CBYI5KM+J5gdW9wnfr1ByYVD4N7GEas7ZNewLucEtCSa5fFpQLh5i4YOELXGj5BPqUiORuNGMUcU1XCRSdkWXho3EFlmvm7ZI9Bv5LnCnJoykqrvne6oQ09lkAVBClmo4dQTshEjgVbJRrxYpFhA5IjzU1DYjPVDsZHznCB1rp4G4o9QsAj9XZiYT4Sg19T3emN6h5LxX/85oxdM/aCQ+iGFhAJ4u6scAQ4jQx3OGSURBDTQiVXP8V0z6RhILONadDcOZP/ktqxyXHLjm3J/ny1TSOLNpD+6iAHSKyugGVAVUfSIntErejOejBfjw/ictGaM6cwu+gXj+wcTkaC8</latexit>
SLIDE 73
- A2C is great, but you can only use each rollout once!
Advantage Actor Critic (A2C)
SLIDE 74
- A2C is great, but you can only use each rollout once!
Advantage Actor Critic (A2C) Why?
SLIDE 75
- A2C is great, but you can only use each rollout once!
- No theoretical grounding to do so
Advantage Actor Critic (A2C)
SLIDE 76 Advantage Actor Critic (A2C)
SLIDE 77
Advantage Actor Critic (A2C)
Image credit: Alberto Metelli, 2018
SLIDE 78 Outline
- RL Refresher/Advantage Actor Critic (A2C)
- Trust Region Policy Optimization (TRPO)
- Proximal Policy Optimization (PPO)
- Application: PointGoal Navigation Results
SLIDE 79 Trust Region Policy Optimization (TRPO)
A2C Maximizes:
SLIDE 80 Trust Region Policy Optimization (TRPO)
Given a policy:
<latexit sha1_base64="WRGxFEVI9rYjIA4fp4fgmoJwGI4=">AB+nicbVBNS8NAEN34WetXqkcvi0Wol5KIoMeiF48V7Ae0IWw2m3bpZhN3J0qJ/SlePCji1V/izX/jts1BWx8MPN6bYWZekAquwXG+rZXVtfWNzdJWeXtnd2/frhy0dZIpylo0EYnqBkQzwSVrAQfBuqliJA4E6wSj6nfeWBK80TewThlXkwGkecEjCSb1fucY34gPsxD7H24RT7dtWpOzPgZeIWpIoKNH37qx8mNIuZBCqI1j3XScHLiQJOBZuU+5lmKaEjMmA9QyWJmfby2ekTfGKUEeJMiUBz9TfEzmJtR7HgemMCQz1ojcV/N6GUSXs5lmgGTdL4oygSGBE9zwCFXjIYG0Ko4uZWTIdEQomrbIJwV18eZm0z+quU3dvz6uNqyKOEjpCx6iGXHSBGugGNVELUfSIntErerOerBfr3fqYt65Yxcwh+gPr8wczm5Kn</latexit>
SLIDE 81 Trust Region Policy Optimization (TRPO)
<latexit sha1_base64="m/+Baqn3U8yhwCE2Svpz4XjOY=">ACTXicbVFNb9NAEF2nQNPwUbcuYyIiBK1jWyEVC6VAlw4FkTSHGwxutNur6o7tjpMjyH+wFiRv/opceihBineYQGkZa6c17b7Q7b6NcSUOe9NpbD14+Gi7udN6/OTps13b39kskJzMeSZyvQ4QiOUTMWQJCkxzrXAJFLiLr4UOtn34Q2Mku/0CIX0wTnqZxJjmSp0I0DwgI6EBiZwCV0MSQIEhmDCakHQCtd18vu7Y5BKyZDpxYHemcoyo/V2vKAYyWzpIO/KoHR1D3UBt6odv2+t6yYBP4K9BmqzoN3R9BnPEiESlxhcZMfC+naYmaJFeiagWFETnyC5yLiYUpJsJMy2UaFbyTAyzTNuTEizZ9YkSE2MWSWSd9SLmvlaT/9MmBc3eTkuZ5gWJlN9dNCsUAZ1tBLTiphQXItbRvBX6OGjnZD2jZEPz7K2+C0eu+7/X9T2/ag/erOJrsBXvJusxnx2zAPrJTNmScXbFrdst+Od+dG+e38+fO2nBWM8/ZP9XY/gsupq3+</latexit>
Given a policy:
<latexit sha1_base64="WRGxFEVI9rYjIA4fp4fgmoJwGI4=">AB+nicbVBNS8NAEN34WetXqkcvi0Wol5KIoMeiF48V7Ae0IWw2m3bpZhN3J0qJ/SlePCji1V/izX/jts1BWx8MPN6bYWZekAquwXG+rZXVtfWNzdJWeXtnd2/frhy0dZIpylo0EYnqBkQzwSVrAQfBuqliJA4E6wSj6nfeWBK80TewThlXkwGkecEjCSb1fucY34gPsxD7H24RT7dtWpOzPgZeIWpIoKNH37qx8mNIuZBCqI1j3XScHLiQJOBZuU+5lmKaEjMmA9QyWJmfby2ekTfGKUEeJMiUBz9TfEzmJtR7HgemMCQz1ojcV/N6GUSXs5lmgGTdL4oygSGBE9zwCFXjIYG0Ko4uZWTIdEQomrbIJwV18eZm0z+quU3dvz6uNqyKOEjpCx6iGXHSBGugGNVELUfSIntErerOerBfr3fqYt65Yxcwh+gPr8wczm5Kn</latexit>
Collect experience and calculate advantage
SLIDE 82 Trust Region Policy Optimization (TRPO)
<latexit sha1_base64="m/+Baqn3U8yhwCE2Svpz4XjOY=">ACTXicbVFNb9NAEF2nQNPwUbcuYyIiBK1jWyEVC6VAlw4FkTSHGwxutNur6o7tjpMjyH+wFiRv/opceihBineYQGkZa6c17b7Q7b6NcSUOe9NpbD14+Gi7udN6/OTps13b39kskJzMeSZyvQ4QiOUTMWQJCkxzrXAJFLiLr4UOtn34Q2Mku/0CIX0wTnqZxJjmSp0I0DwgI6EBiZwCV0MSQIEhmDCakHQCtd18vu7Y5BKyZDpxYHemcoyo/V2vKAYyWzpIO/KoHR1D3UBt6odv2+t6yYBP4K9BmqzoN3R9BnPEiESlxhcZMfC+naYmaJFeiagWFETnyC5yLiYUpJsJMy2UaFbyTAyzTNuTEizZ9YkSE2MWSWSd9SLmvlaT/9MmBc3eTkuZ5gWJlN9dNCsUAZ1tBLTiphQXItbRvBX6OGjnZD2jZEPz7K2+C0eu+7/X9T2/ag/erOJrsBXvJusxnx2zAPrJTNmScXbFrdst+Od+dG+e38+fO2nBWM8/ZP9XY/gsupq3+</latexit>
Given a policy:
<latexit sha1_base64="WRGxFEVI9rYjIA4fp4fgmoJwGI4=">AB+nicbVBNS8NAEN34WetXqkcvi0Wol5KIoMeiF48V7Ae0IWw2m3bpZhN3J0qJ/SlePCji1V/izX/jts1BWx8MPN6bYWZekAquwXG+rZXVtfWNzdJWeXtnd2/frhy0dZIpylo0EYnqBkQzwSVrAQfBuqliJA4E6wSj6nfeWBK80TewThlXkwGkecEjCSb1fucY34gPsxD7H24RT7dtWpOzPgZeIWpIoKNH37qx8mNIuZBCqI1j3XScHLiQJOBZuU+5lmKaEjMmA9QyWJmfby2ekTfGKUEeJMiUBz9TfEzmJtR7HgemMCQz1ojcV/N6GUSXs5lmgGTdL4oygSGBE9zwCFXjIYG0Ko4uZWTIdEQomrbIJwV18eZm0z+quU3dvz6uNqyKOEjpCx6iGXHSBGugGNVELUfSIntErerOerBfr3fqYt65Yxcwh+gPr8wczm5Kn</latexit>
Maximize: Collect experience and calculate advantage
<latexit sha1_base64="Y/Q+5m3fIVE7ZlVTIpodCXH5oc=">ACQ3icbZDLSgMxFIYzXmu9V26CRahBSkzIuhG8LIRVxWsip06nMlkbGjmYnJGKEPfzY0v4M4XcONCEbeC6Wh1QOBP9/Dsn5/VQKjb9bE1MTk3PzBbmivMLi0vLpZXVC51kivEGS2SirnzQXIqYN1Cg5Fep4hD5kl/6neO+f3nPlRZJfI7dlLciuI1FKBigQV7p2o0A2wxkftqruNjmCFW6T91QAcvdVHhDVgEPqRuJgGoPq738bgxQlwUJ0sObu4q5b1HjVr1S2a7Zg6J/hTMSZTKquld6coOEZRGPkUnQunYKbZyUCiY5L2im2meAuvALW8aGUPEdSsfZNCjm4YENEyUOTHSAf05kUOkdTfyTWd/Yz3u9eF/XjPDcK+VizjNkMds+FCYSYoJ7QdKA6E4Q9k1ApgS5q+UtcHEhyb2ognBGV/5r7jYrjl2zTnbKR8cjeIokHWyQSrEIbvkgJyQOmkQRh7IC3kj79aj9Wp9WJ/D1glrNLNGfpX19Q185a/1</latexit>
SLIDE 83 Trust Region Policy Optimization (TRPO)
Maximize:
<latexit sha1_base64="Y/Q+5m3fIVE7ZlVTIpodCXH5oc=">ACQ3icbZDLSgMxFIYzXmu9V26CRahBSkzIuhG8LIRVxWsip06nMlkbGjmYnJGKEPfzY0v4M4XcONCEbeC6Wh1QOBP9/Dsn5/VQKjb9bE1MTk3PzBbmivMLi0vLpZXVC51kivEGS2SirnzQXIqYN1Cg5Fep4hD5kl/6neO+f3nPlRZJfI7dlLciuI1FKBigQV7p2o0A2wxkftqruNjmCFW6T91QAcvdVHhDVgEPqRuJgGoPq738bgxQlwUJ0sObu4q5b1HjVr1S2a7Zg6J/hTMSZTKquld6coOEZRGPkUnQunYKbZyUCiY5L2im2meAuvALW8aGUPEdSsfZNCjm4YENEyUOTHSAf05kUOkdTfyTWd/Yz3u9eF/XjPDcK+VizjNkMds+FCYSYoJ7QdKA6E4Q9k1ApgS5q+UtcHEhyb2ognBGV/5r7jYrjl2zTnbKR8cjeIokHWyQSrEIbvkgJyQOmkQRh7IC3kj79aj9Wp9WJ/D1glrNLNGfpX19Q185a/1</latexit>
Read as: Policy is better than if it takes good actions
( ) more often and takes bad actions ( ) less often
SLIDE 84 Trust Region Policy Optimization (TRPO)
Maximize:
<latexit sha1_base64="Y/Q+5m3fIVE7ZlVTIpodCXH5oc=">ACQ3icbZDLSgMxFIYzXmu9V26CRahBSkzIuhG8LIRVxWsip06nMlkbGjmYnJGKEPfzY0v4M4XcONCEbeC6Wh1QOBP9/Dsn5/VQKjb9bE1MTk3PzBbmivMLi0vLpZXVC51kivEGS2SirnzQXIqYN1Cg5Fep4hD5kl/6neO+f3nPlRZJfI7dlLciuI1FKBigQV7p2o0A2wxkftqruNjmCFW6T91QAcvdVHhDVgEPqRuJgGoPq738bgxQlwUJ0sObu4q5b1HjVr1S2a7Zg6J/hTMSZTKquld6coOEZRGPkUnQunYKbZyUCiY5L2im2meAuvALW8aGUPEdSsfZNCjm4YENEyUOTHSAf05kUOkdTfyTWd/Yz3u9eF/XjPDcK+VizjNkMds+FCYSYoJ7QdKA6E4Q9k1ApgS5q+UtcHEhyb2ognBGV/5r7jYrjl2zTnbKR8cjeIokHWyQSrEIbvkgJyQOmkQRh7IC3kj79aj9Wp9WJ/D1glrNLNGfpX19Q185a/1</latexit>
Read as: Policy is better than if it takes good actions
( ) more often and takes bad actions ( ) less often
Why this objective?
SLIDE 85 Trust Region Policy Optimization (TRPO)
<latexit sha1_base64="m/+Baqn3U8yhwCE2Svpz4XjOY=">ACTXicbVFNb9NAEF2nQNPwUbcuYyIiBK1jWyEVC6VAlw4FkTSHGwxutNur6o7tjpMjyH+wFiRv/opceihBineYQGkZa6c17b7Q7b6NcSUOe9NpbD14+Gi7udN6/OTps13b39kskJzMeSZyvQ4QiOUTMWQJCkxzrXAJFLiLr4UOtn34Q2Mku/0CIX0wTnqZxJjmSp0I0DwgI6EBiZwCV0MSQIEhmDCakHQCtd18vu7Y5BKyZDpxYHemcoyo/V2vKAYyWzpIO/KoHR1D3UBt6odv2+t6yYBP4K9BmqzoN3R9BnPEiESlxhcZMfC+naYmaJFeiagWFETnyC5yLiYUpJsJMy2UaFbyTAyzTNuTEizZ9YkSE2MWSWSd9SLmvlaT/9MmBc3eTkuZ5gWJlN9dNCsUAZ1tBLTiphQXItbRvBX6OGjnZD2jZEPz7K2+C0eu+7/X9T2/ag/erOJrsBXvJusxnx2zAPrJTNmScXbFrdst+Od+dG+e38+fO2nBWM8/ZP9XY/gsupq3+</latexit>
Given a policy:
<latexit sha1_base64="MNyLJ0iEjL6De720taE7+Lp+Axg=">ACI3icbVDLSgMxFM3Ud31VXboJFqFuyowIiAU3bhUsCp0hiGTuW1DMw+TO2IZ5l/c+CtuXCjFjQv/xUzbha8DIYdziW5J0il0GjbH1ZlZnZufmFxqbq8srq2XtvYvNZJpji0eSITdRswDVLE0EaBEm5TBSwKJNwEg7PSv7kHpUSX+EwBS9ivVh0BWdoJL92fNdgPlI3EiHVPu7RE+qmws9d7AOy8oYHzBMZFkVBf0b9Wt1u2mPQv8SZkjqZ4sKvjdw4VkEMXLJtO4dopezhQKLqGoupmGlPEB60H0JhFoL18vGNBd40S0m6izImRjtXvEzmLtB5GgUlGDPv6t1eK/3mdDLtHXi7iNEOI+eShbiYpJrQsjIZCAUc5NIRxJcxfKe8zxTiaWqumBOf3yn/J9X7TsZvO5UG9dTqtY5Fskx3SIA45JC1yTi5Im3DySJ7JK3mznqwXa2S9T6IVazqzRX7A+vwCEC2j7A=</latexit>
Collect experience and calculate advantage Maximize:
<latexit sha1_base64="Y/Q+5m3fIVE7ZlVTIpodCXH5oc=">ACQ3icbZDLSgMxFIYzXmu9V26CRahBSkzIuhG8LIRVxWsip06nMlkbGjmYnJGKEPfzY0v4M4XcONCEbeC6Wh1QOBP9/Dsn5/VQKjb9bE1MTk3PzBbmivMLi0vLpZXVC51kivEGS2SirnzQXIqYN1Cg5Fep4hD5kl/6neO+f3nPlRZJfI7dlLciuI1FKBigQV7p2o0A2wxkftqruNjmCFW6T91QAcvdVHhDVgEPqRuJgGoPq738bgxQlwUJ0sObu4q5b1HjVr1S2a7Zg6J/hTMSZTKquld6coOEZRGPkUnQunYKbZyUCiY5L2im2meAuvALW8aGUPEdSsfZNCjm4YENEyUOTHSAf05kUOkdTfyTWd/Yz3u9eF/XjPDcK+VizjNkMds+FCYSYoJ7QdKA6E4Q9k1ApgS5q+UtcHEhyb2ognBGV/5r7jYrjl2zTnbKR8cjeIokHWyQSrEIbvkgJyQOmkQRh7IC3kj79aj9Wp9WJ/D1glrNLNGfpX19Q185a/1</latexit>
SLIDE 86 Image credit: Alberto Metelli, 2018
Trust Region Policy Optimization (TRPO)
SLIDE 87
Trust Region Policy Optimization (TRPO)
SLIDE 88 Trust Region Policy Optimization (TRPO)
SLIDE 89
- PS 1 problem 1
- In this problem, you showed that the gradient descent update rule
can be seen as the minimizer of the affine-lower bound of
subject to a trust-region:
Trust Region Policy Optimization (TRPO)
SLIDE 90
Trust Region Policy Optimization (TRPO)
<latexit sha1_base64="98O0mbVqYkorID1M2C1j2kUbRIg=">ACznicbVFbaxNBFJ5dbzXeoj76MhiEBGzZFUFfhKovYgWjNG0hky5nZ0+SoTuzm5mzsXFZfPX3+eYP8H84uaCm9cAMH+c71+kZa4cRdHPILxy9dr1Gzs3W7du37l7r3/wZErKitxIu8sCcpOMyVwQEpyvGktAg6zfE4PXu75I/naJ0qzCEtShxpmBg1VhLIu5L2L6GBphLy+n1zWh9+7n9suoKmSNDjr1qiMhna1ILE2ib0h3l9Ou6hJ5ySKjXJLUgPKda6bKwBEbi7hdUkylhxiGbgyGYNPwXd4SswqyrbIi9RW5kFlB/OCDSNVk0hWlStatur4BF1pl3LfrdHqm20zq7y/k5S2OFda0YITWt0SbsT7Ur45dBvAEdtrF+0v4hskJWGg3JHJwbxlFJoxosKZlj4zdwWI84sNPTSg0Y3q1Tka/sR7Mj4urH+G+Mr7b0YN2rmFTn3kUnx3kVs6/8cNKxq/HNXKlBWhketG4yrnVPDlbXmLErKFx6AtMrPyuUvMxeBdfyIsQXV74Mjp7txdFe/Ol5Z/NRo4d9og9Zl0Wsxdsn71jfTZgMjgIZsHXoA74Txswm/r0DY5DxkWxZ+/w2RmuEV</latexit> <latexit sha1_base64="twLRSR1S0ZOn3JfPmQdbvpQgdY=">ACJ3icbZDLSsNAFIYnXmu9RV26GSxCuymJCLpRim5cVrAXaEKYTCbt0MnFmROhL6NG1/FjaAiuvRNnLZaOsPAz/fOYcz5/dTwRVY1pextLyurZe2ihvbm3v7Jp7+2VZJKyFk1EIrs+UzwmLWAg2DdVDIS+YJ1/OH1pN5YFLxJL6DUcrciPRjHnJKQCPvJQeVB0YMCA1fIGdUBKaOyn3ZqxKPMBOxAOsPKiN8/s54JkVq25NhReNXZgKtT0zFcnSGgWsRioIEr1bCsFNycSOBVsXHYyxVJCh6TPetrGJGLKzad3jvGxJgEOE6lfDHhKf0/kJFJqFPm6MyIwUPO1Cfyv1sgPHdzHqcZsJjOFoWZwJDgSWg4JRECNtCJVc/xXTAdFRgY62rEOw509eNO2Tum3V7dvTSuOqiKOEDtERqiIbnaEGukFN1EIUPaJn9IbejSfjxfgwPmetS0Yxc4D+yPj+Ab5bpT0=</latexit>
SLIDE 91 Trust Region Policy Optimization (TRPO)
- Advantage
- Able to perform multiple optimization steps per rollout
SLIDE 92 Trust Region Policy Optimization (TRPO)
- Advantage
- Able to perform multiple optimization steps per rollout
- Disadvantage
- Choosing the correct value for beta is challenging and problem/network
dependent
SLIDE 93 Outline
- RL Refresher/Advantage Actor Critic (A2C)
- Trust Region Policy Optimization (TRPO)
- Proximal Policy Optimization (PPO)
- Application: PointGoal Navigation Results
SLIDE 94
Proximal Policy Optimization (PPO)
SLIDE 95
Proximal Policy Optimization (PPO)
SLIDE 96 Given a policy:
<latexit sha1_base64="WRGxFEVI9rYjIA4fp4fgmoJwGI4=">AB+nicbVBNS8NAEN34WetXqkcvi0Wol5KIoMeiF48V7Ae0IWw2m3bpZhN3J0qJ/SlePCji1V/izX/jts1BWx8MPN6bYWZekAquwXG+rZXVtfWNzdJWeXtnd2/frhy0dZIpylo0EYnqBkQzwSVrAQfBuqliJA4E6wSj6nfeWBK80TewThlXkwGkecEjCSb1fucY34gPsxD7H24RT7dtWpOzPgZeIWpIoKNH37qx8mNIuZBCqI1j3XScHLiQJOBZuU+5lmKaEjMmA9QyWJmfby2ekTfGKUEeJMiUBz9TfEzmJtR7HgemMCQz1ojcV/N6GUSXs5lmgGTdL4oygSGBE9zwCFXjIYG0Ko4uZWTIdEQomrbIJwV18eZm0z+quU3dvz6uNqyKOEjpCx6iGXHSBGugGNVELUfSIntErerOerBfr3fqYt65Yxcwh+gPr8wczm5Kn</latexit>
Proximal Policy Optimization (PPO)
SLIDE 97 Given a policy:
<latexit sha1_base64="MNyLJ0iEjL6De720taE7+Lp+Axg=">ACI3icbVDLSgMxFM3Ud31VXboJFqFuyowIiAU3bhUsCp0hiGTuW1DMw+TO2IZ5l/c+CtuXCjFjQv/xUzbha8DIYdziW5J0il0GjbH1ZlZnZufmFxqbq8srq2XtvYvNZJpji0eSITdRswDVLE0EaBEm5TBSwKJNwEg7PSv7kHpUSX+EwBS9ivVh0BWdoJL92fNdgPlI3EiHVPu7RE+qmws9d7AOy8oYHzBMZFkVBf0b9Wt1u2mPQv8SZkjqZ4sKvjdw4VkEMXLJtO4dopezhQKLqGoupmGlPEB60H0JhFoL18vGNBd40S0m6izImRjtXvEzmLtB5GgUlGDPv6t1eK/3mdDLtHXi7iNEOI+eShbiYpJrQsjIZCAUc5NIRxJcxfKe8zxTiaWqumBOf3yn/J9X7TsZvO5UG9dTqtY5Fskx3SIA45JC1yTi5Im3DySJ7JK3mznqwXa2S9T6IVazqzRX7A+vwCEC2j7A=</latexit>
Proximal Policy Optimization (PPO)
SLIDE 98 Given a policy:
Proximal Policy Optimization (PPO)
<latexit sha1_base64="twLRSR1S0ZOn3JfPmQdbvpQgdY=">ACJ3icbZDLSsNAFIYnXmu9RV26GSxCuymJCLpRim5cVrAXaEKYTCbt0MnFmROhL6NG1/FjaAiuvRNnLZaOsPAz/fOYcz5/dTwRVY1pextLyurZe2ihvbm3v7Jp7+2VZJKyFk1EIrs+UzwmLWAg2DdVDIS+YJ1/OH1pN5YFLxJL6DUcrciPRjHnJKQCPvJQeVB0YMCA1fIGdUBKaOyn3ZqxKPMBOxAOsPKiN8/s54JkVq25NhReNXZgKtT0zFcnSGgWsRioIEr1bCsFNycSOBVsXHYyxVJCh6TPetrGJGLKzad3jvGxJgEOE6lfDHhKf0/kJFJqFPm6MyIwUPO1Cfyv1sgPHdzHqcZsJjOFoWZwJDgSWg4JRECNtCJVc/xXTAdFRgY62rEOw509eNO2Tum3V7dvTSuOqiKOEDtERqiIbnaEGukFN1EIUPaJn9IbejSfjxfgwPmetS0Yxc4D+yPj+Ab5bpT0=</latexit>
Objective: Maximize
<latexit sha1_base64="MNyLJ0iEjL6De720taE7+Lp+Axg=">ACI3icbVDLSgMxFM3Ud31VXboJFqFuyowIiAU3bhUsCp0hiGTuW1DMw+TO2IZ5l/c+CtuXCjFjQv/xUzbha8DIYdziW5J0il0GjbH1ZlZnZufmFxqbq8srq2XtvYvNZJpji0eSITdRswDVLE0EaBEm5TBSwKJNwEg7PSv7kHpUSX+EwBS9ivVh0BWdoJL92fNdgPlI3EiHVPu7RE+qmws9d7AOy8oYHzBMZFkVBf0b9Wt1u2mPQv8SZkjqZ4sKvjdw4VkEMXLJtO4dopezhQKLqGoupmGlPEB60H0JhFoL18vGNBd40S0m6izImRjtXvEzmLtB5GgUlGDPv6t1eK/3mdDLtHXi7iNEOI+eShbiYpJrQsjIZCAUc5NIRxJcxfKe8zxTiaWqumBOf3yn/J9X7TsZvO5UG9dTqtY5Fskx3SIA45JC1yTi5Im3DySJ7JK3mznqwXa2S9T6IVazqzRX7A+vwCEC2j7A=</latexit>
SLIDE 99
Proximal Policy Optimization (PPO)
SLIDE 100
Proximal Policy Optimization (PPO)
<latexit sha1_base64="a75IYbuwf6T/xNrGACEcFVl6gI=">AB+nicbVDLSsNAFL3xWesr1aWbwSK4KokIuiy6ETdGsA9oY5lMJ+3QySTMTJQS8yluXCji1i9x5984abvQ1gMDh3Pu5Z45QcKZ0o7zbS0tr6yurZc2yptb2zu7dmWvqeJUEtogMY9lO8CKciZoQzPNaTuRFEcBp61gdFn4rQcqFYvFnR4n1I/wQLCQEayN1LMr3QjrIcE8u87vM8+7yXt21ak5E6BF4s5IFWbwevZXtx+TNKJCE46V6rhOov0MS80Ip3m5myqaYDLCA9oxVOCIKj+bRM/RkVH6KIyleUKjifp7I8ORUuMoMJNFUDXvFeJ/XifV4bmfMZGkmgoyPRSmHOkYFT2gPpOUaD42BPJTFZEhlhiok1bZVOCO/lRdI8qblOzb09rdYvZnWU4AO4RhcOIM6XIEHDSDwCM/wCm/Wk/VivVsf09Ela7azD39gf4AZSyUEg=</latexit> <latexit sha1_base64="a75IYbuwf6T/xNrGACEcFVl6gI=">AB+nicbVDLSsNAFL3xWesr1aWbwSK4KokIuiy6ETdGsA9oY5lMJ+3QySTMTJQS8yluXCji1i9x5984abvQ1gMDh3Pu5Z45QcKZ0o7zbS0tr6yurZc2yptb2zu7dmWvqeJUEtogMY9lO8CKciZoQzPNaTuRFEcBp61gdFn4rQcqFYvFnR4n1I/wQLCQEayN1LMr3QjrIcE8u87vM8+7yXt21ak5E6BF4s5IFWbwevZXtx+TNKJCE46V6rhOov0MS80Ip3m5myqaYDLCA9oxVOCIKj+bRM/RkVH6KIyleUKjifp7I8ORUuMoMJNFUDXvFeJ/XifV4bmfMZGkmgoyPRSmHOkYFT2gPpOUaD42BPJTFZEhlhiok1bZVOCO/lRdI8qblOzb09rdYvZnWU4AO4RhcOIM6XIEHDSDwCM/wCm/Wk/VivVsf09Ela7azD39gf4AZSyUEg=</latexit> <latexit sha1_base64="XU6mzGKxL8N+dKuvm07YfKiYqg=">AB+HicbVBNS8NAEN3Ur1o/GvXoZbEIFaQkIujBQ9WLxwr2A9oQNtNu3SzCbsToYb+Ei8eFPHqT/Hmv3Hb5qCtDwYe780wMy9IBNfgON9WYWV1bX2juFna2t7ZLdt7+y0dp4qyJo1FrDoB0UxwyZrAQbBOohiJAsHaweh26rcfmdI8lg8wTpgXkYHkIacEjOTb5euq9uEUEx9O8BV2fLvi1JwZ8DJxc1JBORq+/dXrxzSNmAQqiNZd10nAy4gCTgWblHqpZgmhIzJgXUMliZj2stnhE3xslD4OY2VKAp6pvycyEmk9jgLTGREY6kVvKv7ndVMIL72MyQFJul8UZgKDGepoD7XDEKYmwIoYqbWzEdEkUomKxKJgR38eVl0jqruU7NvT+v1G/yOIroEB2hKnLRBaqjO9RATURip7RK3qznqwX6936mLcWrHzmAP2B9fkD4qiRQ=</latexit>
SLIDE 101
Proximal Policy Optimization (PPO)
<latexit sha1_base64="a75IYbuwf6T/xNrGACEcFVl6gI=">AB+nicbVDLSsNAFL3xWesr1aWbwSK4KokIuiy6ETdGsA9oY5lMJ+3QySTMTJQS8yluXCji1i9x5984abvQ1gMDh3Pu5Z45QcKZ0o7zbS0tr6yurZc2yptb2zu7dmWvqeJUEtogMY9lO8CKciZoQzPNaTuRFEcBp61gdFn4rQcqFYvFnR4n1I/wQLCQEayN1LMr3QjrIcE8u87vM8+7yXt21ak5E6BF4s5IFWbwevZXtx+TNKJCE46V6rhOov0MS80Ip3m5myqaYDLCA9oxVOCIKj+bRM/RkVH6KIyleUKjifp7I8ORUuMoMJNFUDXvFeJ/XifV4bmfMZGkmgoyPRSmHOkYFT2gPpOUaD42BPJTFZEhlhiok1bZVOCO/lRdI8qblOzb09rdYvZnWU4AO4RhcOIM6XIEHDSDwCM/wCm/Wk/VivVsf09Ela7azD39gf4AZSyUEg=</latexit> <latexit sha1_base64="a75IYbuwf6T/xNrGACEcFVl6gI=">AB+nicbVDLSsNAFL3xWesr1aWbwSK4KokIuiy6ETdGsA9oY5lMJ+3QySTMTJQS8yluXCji1i9x5984abvQ1gMDh3Pu5Z45QcKZ0o7zbS0tr6yurZc2yptb2zu7dmWvqeJUEtogMY9lO8CKciZoQzPNaTuRFEcBp61gdFn4rQcqFYvFnR4n1I/wQLCQEayN1LMr3QjrIcE8u87vM8+7yXt21ak5E6BF4s5IFWbwevZXtx+TNKJCE46V6rhOov0MS80Ip3m5myqaYDLCA9oxVOCIKj+bRM/RkVH6KIyleUKjifp7I8ORUuMoMJNFUDXvFeJ/XifV4bmfMZGkmgoyPRSmHOkYFT2gPpOUaD42BPJTFZEhlhiok1bZVOCO/lRdI8qblOzb09rdYvZnWU4AO4RhcOIM6XIEHDSDwCM/wCm/Wk/VivVsf09Ela7azD39gf4AZSyUEg=</latexit>
SLIDE 102
- Advantage
- Able to perform multiple optimization steps per rollout
- epsilon=0.2 “just works” in a lot of cases
- Easily handles networks with hundreds of millions of parameters
Proximal Policy Optimization (PPO)
SLIDE 103
- Advantage
- Able to perform multiple optimization steps per rollout
- epsilon=0.2 “just works” in a lot of cases
- Easily handles networks with hundreds of millions of parameters
- Disadvantage
- Other methods are more sample efficient
Proximal Policy Optimization (PPO)
SLIDE 104 A2C Implementation
- 1. Collect a set of trajectories using current policy
- 2. Update policy via step of A2C objective
- 3. Repeat
SLIDE 105 PPO/TRPO Implementation
- 1. Collect a set of trajectories using current policy
- 2. For a few epochs (typically 2 or 4)
- 1. Sample mini batches from rollout (typically 2 or 4)
- 1. Update the policy via step of PPO/TRPO objective
- 3. Repeat
SLIDE 106 Outline
- RL Refresher/Advantage Actor Critic (A2C)
- Trust Region Policy Optimization (TRPO)
- Proximal Policy Optimization (PPO)
- Application: PointGoal Navigation Results
SLIDE 107
PointGoal Navigation Results
SLIDE 109
Qualitative Results
SLIDE 110
SLIDE 111
SLIDE 112
SLIDE 113
SLIDE 114
SLIDE 115
SLIDE 116
SLIDE 117
Backtracking
SLIDE 118
SLIDE 119
SLIDE 120
SLIDE 121
SLIDE 122
SLIDE 123
SLIDE 124
SLIDE 125
Visual Turing Test
SLIDE 126
Option 1 Option 2
SLIDE 127
Option 1
SLIDE 128
Option 1 Option 2
SLIDE 129
Option 2
SLIDE 130
Option 1 Option 2
SLIDE 131
Learned Agent Shortest Path Oracle
SLIDE 132