CS 4803 / 7643: Deep Learning Topics: Application: PointGoal - - PowerPoint PPT Presentation

cs 4803 7643 deep learning
SMART_READER_LITE
LIVE PREVIEW

CS 4803 / 7643: Deep Learning Topics: Application: PointGoal - - PowerPoint PPT Presentation

CS 4803 / 7643: Deep Learning Topics: Application: PointGoal Navigation Trust Region Policy Optimization (TRPO) Proximal Policy Optimization (PPO) Erik Wijmans Georgia Tech Who Am I? Research Interests Computer Vision


slide-1
SLIDE 1

CS 4803 / 7643: Deep Learning

Erik Wijmans Georgia Tech

Topics:

– Application: PointGoal Navigation – Trust Region Policy Optimization (TRPO) – Proximal Policy Optimization (PPO)

slide-2
SLIDE 2

Who Am I?

Erik Wijmans 3rd year PhD student at GT
 Advisors: Dhruv Batra and Irfan Essa Research Interests

  • Computer Vision
  • Visual Navigation
  • Embodied AI (virtual robots)
  • Simulation to reality transfer
slide-3
SLIDE 3

Lecture plan/motivation

  • Combine CNNs, RNNs (LSTMs), and RL together — all things you have

learned about in this course — through a task called PointGoal Navigation

  • Introduce more advanced RL — TRPO and PPO
  • Show results using PPO on PointGoal Navigation
slide-4
SLIDE 4

State-of-the-Art Visual Recognition

Slide credit: Abhishek Das

slide-5
SLIDE 5

He et al., 2016a, b He et al., 2017 Lin et al., 2017

State-of-the-Art Visual Recognition

Slide credit: Abhishek Das

slide-6
SLIDE 6

Vinyals et al., 2015 Karpathy and Johnson., 2016 Lu et al., 2018

Are there any animals? Yes, there are two elephants. A-BOT

Question Encoder Answer Decoder History Encoder Fact Embedding

Q-BOT

Question Decoder Fact Embedding Feature Regression Network History Encoder

Rounds of Dialog [0.1, -2, 0, … , 0.57] Reward Function

State-of-the-Art Visual Recognition

Slide credit: Abhishek Das

slide-7
SLIDE 7

Yang et al., 2016 Das et al., 2017a, b Anderson et al., 2016

Are there any animals? Yes, there are two elephants. A-BOT

Question Encoder Answer Decoder History Encoder Fact Embedding

Q-BOT

Question Decoder Fact Embedding Feature Regression Network History Encoder

Rounds of Dialog [0.1, -2, 0, … , 0.57] Reward Function

State-of-the-Art Visual Recognition

Slide credit: Abhishek Das

slide-8
SLIDE 8

Image Credit: You et al., 2016

State-of-the-Art Visual Recognition

Slide credit: Abhishek Das

slide-9
SLIDE 9

Applications

Slide credit: Abhishek Das

slide-10
SLIDE 10

Applications

Image Credit: Lockheed Martin; DARPA Robotics Challenge Slide credit: Abhishek Das

slide-11
SLIDE 11

Applications

Physical agent

Image Credit: Lockheed Martin; DARPA Robotics Challenge Slide credit: Abhishek Das

slide-12
SLIDE 12

Applications

Physical agent capable of taking actions in the world

Image Credit: Lockheed Martin; DARPA Robotics Challenge Slide credit: Abhishek Das

slide-13
SLIDE 13

Image Credit: Lockheed Martin; DARPA Robotics Challenge

Applications

Physical agent capable of taking actions in the world

Slide credit: Abhishek Das

slide-14
SLIDE 14

Applications

Is there smoke in any room around you? Yes, in one room Go there and look for people …

Physical agent capable of taking actions in the world and talking to humans in natural language

Image Credit: Lockheed Martin; DARPA Robotics Challenge Slide credit: Abhishek Das

slide-15
SLIDE 15

Applications

Is there smoke in any room around you? Yes, in one room Go there and look for people …

Image Credit: Lockheed Martin; DARPA Robotics Challenge Slide credit: Abhishek Das

slide-16
SLIDE 16

Is there smoke in any room around you? Yes, in one room Go there and look for people …

Applications

Image Credit: Lockheed Martin; DARPA Robotics Challenge Slide credit: Abhishek Das

slide-17
SLIDE 17

Challenges

Image Credit: Image-Net Image Credit: Lockheed Martin; DARPA Robotics Challenge Slide credit: Abhishek Das

slide-18
SLIDE 18

Challenges

Image Credit: Image-Net, Video Credit: Lee et al., 2012

Egocentric vision No access to well-composed, curated images

Slide credit: Abhishek Das

slide-19
SLIDE 19

Challenges

Video Credit: Lee et al., 2012

Egocentric vision

Slide credit: Abhishek Das

slide-20
SLIDE 20

Challenges

Video Credit: Lee et al., 2012

Egocentric vision Active perception Action Observation

Agent controls incoming data distribution

Slide credit: Abhishek Das

slide-21
SLIDE 21

Challenges

Image Credit: Image-Net

Egocentric vision Active perception Sparse rewards

Slide credit: Abhishek Das

slide-22
SLIDE 22

Challenges

Image Credit: Image-Net

Egocentric vision Active perception Sparse rewards

Slide credit: Abhishek Das

slide-23
SLIDE 23

Challenges

Image Credit: Image-Net

+ —

{

Egocentric vision Active perception Sparse rewards

Slide credit: Abhishek Das

slide-24
SLIDE 24

Challenges

Egocentric vision Active perception Sparse rewards Language understanding

Slide credit: Abhishek Das

slide-25
SLIDE 25

Internet AI

  • Embodied AI

Image Credit: Image-Net Image Credit: Lockheed Martin; DARPA Robotics Challenge Slide credit: Abhishek Das

slide-26
SLIDE 26

Vladlen Koltun5 Abhishek Kadian1*
 Oleksandr Maksymets1* Jia Liu1 Manolis Savva1,4* Erik Wijmans1,2,3 Bhavana Jain1 Yili Zhao1 Julian Straub2 Jitendra Malik1,6 Devi Parikh1,3 Dhruv Batra 1,3 1 2 3 4 5 6

* denotes equal contribution

slide-27
SLIDE 27

Standardizing the Embodied AI “software stack”

slide-28
SLIDE 28

EmbodiedQA (Das et al., 2018)

Tasks

Interactive QA (Gordon et al., 2018)

Vision-Language Navigation (Anderson et al., 2018)

Language grounding (Hill et al., 2017) Visual Navigation (Zhu et al., 2017, Gupta et al., 2017)

Standardizing the Embodied AI “software stack”

slide-29
SLIDE 29

EmbodiedQA (Das et al., 2018)

Standardizing the Embodied AI “software stack”

slide-30
SLIDE 30

Vision-Language Navigation (Anderson et al., 2018)

Standardizing the Embodied AI “software stack”

slide-31
SLIDE 31

Simulators

AI2-THOR (Kolve et al., 2017) MINOS (Savva et al., 2017) Gibson (Zamir et al., 2018) CHALET (Yan et al., 2018) House3D (Wu et al., 2017) EmbodiedQA (Das et al., 2018)

Tasks

Interactive QA (Gordon et al., 2018)

Vision-Language Navigation (Anderson et al., 2018)

Language grounding (Hill et al., 2017) Visual Navigation (Zhu et al., 2017, Gupta et al., 2017)

Datasets

Matterport3D (Chang et al., 2017) 2D-3D-S (Armeni et al., 2017) Replica (Straub et al., 2019)

Standardizing the Embodied AI “software stack”

Habitat-Sim

Generic Dataset
 Support

Habitat-API Habitat Platform

slide-32
SLIDE 32

Habitat-Sim

slide-33
SLIDE 33

Habitat-Sim Demo

slide-34
SLIDE 34

Habitat-API

slide-35
SLIDE 35

PointGoal Navigation

slide-36
SLIDE 36

PointGoal Navigation

slide-37
SLIDE 37

PointGoal Navigation

slide-38
SLIDE 38

PointGoal Navigation

slide-39
SLIDE 39

PointGoal Navigation

Goal

slide-40
SLIDE 40

PointGoal Navigation

slide-41
SLIDE 41

Agent and Model Design

slide-42
SLIDE 42

Agent and Model Design

slide-43
SLIDE 43

Agent and Model Design

  • 1.25m tall cylinder with 0.1m radius
slide-44
SLIDE 44

Agent and Model Design

  • 1.25m tall cylinder with 0.1m radius
  • Actions:
  • <stop>: Indicates the agent 


believes it has completed the task

  • <forward>: Moves 0.25m forward
  • <left>, <right>: Turn 10 degrees
slide-45
SLIDE 45

Agent and Model Design

slide-46
SLIDE 46

Agent and Model Design

slide-47
SLIDE 47

Agent and Model Design

slide-48
SLIDE 48

CNN

Agent and Model Design

slide-49
SLIDE 49

CNN Policy

h0 a0

Agent and Model Design

slide-50
SLIDE 50

CNN Policy

h0 a0

<left>

a0 V1

Agent and Model Design

slide-51
SLIDE 51

CNN Policy

h0 a0

<left>

a0 V1

<forward>

h1

<left> Policy Policy

a1

CNN CNN

1

h2 a2 V2 V3

Agent and Model Design

slide-52
SLIDE 52

<forward>

hT-2 aT-2

<stop> Policy Policy CNN CNN

hT-1 aT-1 VT-1 VT

Agent and Model Design

slide-53
SLIDE 53

Agent and Model Design

  • How do we train this agent?
slide-54
SLIDE 54

Agent and Model Design

  • How do we train this agent?
  • Both actions (they are discrete) and the

simulation are non-differential-able

slide-55
SLIDE 55

Agent and Model Design

  • How do we train this agent?
  • Both actions (they are discrete) and the

simulation are non-differential-able

  • Use reinforcement learning!
slide-56
SLIDE 56

Outline

  • RL Refresher/Advantage Actor Critic (A2C)
  • Trust Region Policy Optimization (TRPO)
  • Proximal Policy Optimization (PPO)
  • Application: PointGoal Navigation Results
slide-57
SLIDE 57

Outline

  • RL Refresher/Advantage Actor Critic (A2C)
  • Trust Region Policy Optimization (TRPO)
  • Proximal Policy Optimization (PPO)
  • Application: PointGoal Navigation Results
slide-58
SLIDE 58

RL Refresher

slide-59
SLIDE 59

RL Refresher

slide-60
SLIDE 60

RL Refresher

slide-61
SLIDE 61

RL Refresher

slide-62
SLIDE 62

RL Refresher

slide-63
SLIDE 63

RL Refresher

slide-64
SLIDE 64

Objective:

<latexit sha1_base64="9TCpe5Nw7VqaGPsqwdYfFq6YNs=">ACT3icbZFLSwMxFIUz9V1fVZdugkVRkDIjgm4EUQSXKq0tNOQSTNtaDIzJHeEMsw/dKM7/4YbF4qY1ln4uhD4OdecnMSplIYcN1npzI1PTM7N79QXVxaXlmtra3fmiTjLdYIhPdCanhUsS8BQIk76SaUxVK3g6H52O/fc+1EUnchFHKfUX7sYgEo2CloBYRWEQhvlFgYnkEXTxRGFU5jdF0MREi/4AfExI9aexg08wMZkiUigBJsjhxCvumt/Hd0A+5gGsBfU6m7DnRT+C14JdVTWVB7Ir2EZYrHwCQ1pu5Kfg51SCY5EWVZIanlA1pn3ctxlRx4+eTPAq8bZUejhJtTwx4on6fyKkyZqRC2zle1vz2xuJ/XjeD6NjPRZxmwGP2dVGUSQwJHoeLe0JzBnJkgTIt7K6YDaimDOwXVG0I3u8n/4Xbg4bnNrzrw/rpWRnHPNpEW2gXegInaJLdIVaiKEH9ILe0Lvz6Lw6H5WyteKUsIF+VGXhE6Sas5U=</latexit>

RL Refresher

slide-65
SLIDE 65

Reinforce

<latexit sha1_base64="Q87OeiBHTj2uynKBf+Fc2+4r43I=">ACKXicbVBNSwMxEM36bf2qevQSLIJeyq4Ieix68ajSqtAty2w2bYPZElmhbL073jxr3hRUNSrf8RsW8GvB4E3782QmRdnUlj0/Tdvanpmdm5+YbGytLyulZd37i0OjeMt5iW2lzHYLkUirdQoOTXmeGQxpJfxTcnpX91y40VWjVxkPFOCj0luoIBOimqNkIFsYQoxD5HoKHUPRpm4qvehQhpmIqE2gj3aMgSXdaAfQayuBhGzaha8+v+CPQvCSakRiY4i6pPYaJZnKFTIK17cDPsFOAQcEkH1bC3PIM2A30eNtRBSm3nWJ06ZDuOCWhXW3cU0hH6veJAlJrB2nsOsl7W+vFP/z2jl2jzqFUFmOXLHxR91cUtS0jI0mwnCGcuAIMCPcrpT1wQBDF27FhRD8PvkvudyvB349OD+oNY4ncSyQLbJNdklADkmDnJIz0iKM3JEH8kxevHv0Xv13setU95kZpP8gPfxCf5KpnQ=</latexit>
slide-66
SLIDE 66

Advantage Actor Critic (A2C)

  • High variance:
<latexit sha1_base64="Q87OeiBHTj2uynKBf+Fc2+4r43I=">ACKXicbVBNSwMxEM36bf2qevQSLIJeyq4Ieix68ajSqtAty2w2bYPZElmhbL073jxr3hRUNSrf8RsW8GvB4E3782QmRdnUlj0/Tdvanpmdm5+YbGytLyulZd37i0OjeMt5iW2lzHYLkUirdQoOTXmeGQxpJfxTcnpX91y40VWjVxkPFOCj0luoIBOimqNkIFsYQoxD5HoKHUPRpm4qvehQhpmIqE2gj3aMgSXdaAfQayuBhGzaha8+v+CPQvCSakRiY4i6pPYaJZnKFTIK17cDPsFOAQcEkH1bC3PIM2A30eNtRBSm3nWJ06ZDuOCWhXW3cU0hH6veJAlJrB2nsOsl7W+vFP/z2jl2jzqFUFmOXLHxR91cUtS0jI0mwnCGcuAIMCPcrpT1wQBDF27FhRD8PvkvudyvB349OD+oNY4ncSyQLbJNdklADkmDnJIz0iKM3JEH8kxevHv0Xv13setU95kZpP8gPfxCf5KpnQ=</latexit>
slide-67
SLIDE 67

Advantage Actor Critic (A2C)

  • High variance:
  • Reduce variance with baseline:
<latexit sha1_base64="Q87OeiBHTj2uynKBf+Fc2+4r43I=">ACKXicbVBNSwMxEM36bf2qevQSLIJeyq4Ieix68ajSqtAty2w2bYPZElmhbL073jxr3hRUNSrf8RsW8GvB4E3782QmRdnUlj0/Tdvanpmdm5+YbGytLyulZd37i0OjeMt5iW2lzHYLkUirdQoOTXmeGQxpJfxTcnpX91y40VWjVxkPFOCj0luoIBOimqNkIFsYQoxD5HoKHUPRpm4qvehQhpmIqE2gj3aMgSXdaAfQayuBhGzaha8+v+CPQvCSakRiY4i6pPYaJZnKFTIK17cDPsFOAQcEkH1bC3PIM2A30eNtRBSm3nWJ06ZDuOCWhXW3cU0hH6veJAlJrB2nsOsl7W+vFP/z2jl2jzqFUFmOXLHxR91cUtS0jI0mwnCGcuAIMCPcrpT1wQBDF27FhRD8PvkvudyvB349OD+oNY4ncSyQLbJNdklADkmDnJIz0iKM3JEH8kxevHv0Xv13setU95kZpP8gPfxCf5KpnQ=</latexit> <latexit sha1_base64="stVTPD2NxfInfe0m+8mdo3PSnQ=">ACL3icbVBNSysxFM348dQ+37Pq0k2wPKiLV2ZE0KUoiMsqrQpNGe5k0jaYSYbkjlCG/iM3/hU3Ioq49V+Y1gp+HQicnHMv96T5Eo6DMP7YGZ2bv7XwuJS5fyn78r1dW1M2cKy0WbG2XsRQJOKlFGyUqcZFbAVmixHlyeTj2z6+EdLoFg5z0c2gr2VPckAvxdUjpiFREDMcCATKlOlTlsv3fx1ipCyTKXUxblHGU4O0zjLAQdVno7iFv1Pk624Wgsb4QT0O4mpEamaMbVW5YaXmRCI1fgXCcKc+yWYFyJUYVjiRA7+Evuh4qiETrltO7h3Rf15Jac9Y/zTSifqxo4TMuWGW+Mrxpu6rNxZ/8joF9va6pdR5gULzt0G9QlE0dBweTaUVHNXQE+BW+l0pH4AFj7ig8h+nryd3K23YjCRnSyU9s/mMaxSDbIJqmTiOySfXJMmqRNOLkmt+SBPAY3wV3wFDy/lc4E0518gnBysfJqfQ</latexit>
slide-68
SLIDE 68

Advantage Actor Critic (A2C)

  • High variance:
  • Reduce variance with baseline:
  • Use value-function as the baseline (A2C):
<latexit sha1_base64="Q87OeiBHTj2uynKBf+Fc2+4r43I=">ACKXicbVBNSwMxEM36bf2qevQSLIJeyq4Ieix68ajSqtAty2w2bYPZElmhbL073jxr3hRUNSrf8RsW8GvB4E3782QmRdnUlj0/Tdvanpmdm5+YbGytLyulZd37i0OjeMt5iW2lzHYLkUirdQoOTXmeGQxpJfxTcnpX91y40VWjVxkPFOCj0luoIBOimqNkIFsYQoxD5HoKHUPRpm4qvehQhpmIqE2gj3aMgSXdaAfQayuBhGzaha8+v+CPQvCSakRiY4i6pPYaJZnKFTIK17cDPsFOAQcEkH1bC3PIM2A30eNtRBSm3nWJ06ZDuOCWhXW3cU0hH6veJAlJrB2nsOsl7W+vFP/z2jl2jzqFUFmOXLHxR91cUtS0jI0mwnCGcuAIMCPcrpT1wQBDF27FhRD8PvkvudyvB349OD+oNY4ncSyQLbJNdklADkmDnJIz0iKM3JEH8kxevHv0Xv13setU95kZpP8gPfxCf5KpnQ=</latexit> <latexit sha1_base64="stVTPD2NxfInfe0m+8mdo3PSnQ=">ACL3icbVBNSysxFM348dQ+37Pq0k2wPKiLV2ZE0KUoiMsqrQpNGe5k0jaYSYbkjlCG/iM3/hU3Ioq49V+Y1gp+HQicnHMv96T5Eo6DMP7YGZ2bv7XwuJS5fyn78r1dW1M2cKy0WbG2XsRQJOKlFGyUqcZFbAVmixHlyeTj2z6+EdLoFg5z0c2gr2VPckAvxdUjpiFREDMcCATKlOlTlsv3fx1ipCyTKXUxblHGU4O0zjLAQdVno7iFv1Pk624Wgsb4QT0O4mpEamaMbVW5YaXmRCI1fgXCcKc+yWYFyJUYVjiRA7+Evuh4qiETrltO7h3Rf15Jac9Y/zTSifqxo4TMuWGW+Mrxpu6rNxZ/8joF9va6pdR5gULzt0G9QlE0dBweTaUVHNXQE+BW+l0pH4AFj7ig8h+nryd3K23YjCRnSyU9s/mMaxSDbIJqmTiOySfXJMmqRNOLkmt+SBPAY3wV3wFDy/lc4E0518gnBysfJqfQ</latexit> <latexit sha1_base64="UAloJD1/ubWRcMSZ24yHXdTCXs=">ACNHicbVBNaxsxENXmO85H3eSYi6gJOIeY3VBoj6a5BHpJi+0ELPMamVbRCst0mzALP5RueSH5BICOTSEXPsbqrVdaJM8EDy9N8PMvCRX0mEYPgRLyura+sbm7Wt7Z3dD/WPez1nCstFlxtl7GUCTipRclKnGZWwFZosRFcnVa+RfXwjpdAcnuRhkMNJyKDmgl+L6d6YhURAzHAsEypQZUZbLv/8mxEhZJlPqYjyijKcGaZNlgGMOqvw5jTv0mPalXsU1xthK5yBviXRgjTIAudx/Y6lheZ0MgVONePwhwHJViUXIlpjRVO5MCvYCT6nmrIhBuUs6On9NArKR0a659GOlP/7Sghc26SJb6yWte9irxPa9f4PDroJQ6L1BoPh80LBRFQ6sEaSqt4KgmngC30u9K+RgscPQ513wI0euT35LeSsKW9GPz432t0UcG+SAfCJNEpEvpE3OyDnpEk5uyD35RZ6C2+AxeA5e5qVLwaJn/yH4Pcfr46pjQ=</latexit>
slide-69
SLIDE 69

Advantage Actor Critic (A2C)

  • High variance:
  • Reduce variance with baseline:
  • Use value-function as the baseline (A2C):
<latexit sha1_base64="Q87OeiBHTj2uynKBf+Fc2+4r43I=">ACKXicbVBNSwMxEM36bf2qevQSLIJeyq4Ieix68ajSqtAty2w2bYPZElmhbL073jxr3hRUNSrf8RsW8GvB4E3782QmRdnUlj0/Tdvanpmdm5+YbGytLyulZd37i0OjeMt5iW2lzHYLkUirdQoOTXmeGQxpJfxTcnpX91y40VWjVxkPFOCj0luoIBOimqNkIFsYQoxD5HoKHUPRpm4qvehQhpmIqE2gj3aMgSXdaAfQayuBhGzaha8+v+CPQvCSakRiY4i6pPYaJZnKFTIK17cDPsFOAQcEkH1bC3PIM2A30eNtRBSm3nWJ06ZDuOCWhXW3cU0hH6veJAlJrB2nsOsl7W+vFP/z2jl2jzqFUFmOXLHxR91cUtS0jI0mwnCGcuAIMCPcrpT1wQBDF27FhRD8PvkvudyvB349OD+oNY4ncSyQLbJNdklADkmDnJIz0iKM3JEH8kxevHv0Xv13setU95kZpP8gPfxCf5KpnQ=</latexit> <latexit sha1_base64="stVTPD2NxfInfe0m+8mdo3PSnQ=">ACL3icbVBNSysxFM348dQ+37Pq0k2wPKiLV2ZE0KUoiMsqrQpNGe5k0jaYSYbkjlCG/iM3/hU3Ioq49V+Y1gp+HQicnHMv96T5Eo6DMP7YGZ2bv7XwuJS5fyn78r1dW1M2cKy0WbG2XsRQJOKlFGyUqcZFbAVmixHlyeTj2z6+EdLoFg5z0c2gr2VPckAvxdUjpiFREDMcCATKlOlTlsv3fx1ipCyTKXUxblHGU4O0zjLAQdVno7iFv1Pk624Wgsb4QT0O4mpEamaMbVW5YaXmRCI1fgXCcKc+yWYFyJUYVjiRA7+Evuh4qiETrltO7h3Rf15Jac9Y/zTSifqxo4TMuWGW+Mrxpu6rNxZ/8joF9va6pdR5gULzt0G9QlE0dBweTaUVHNXQE+BW+l0pH4AFj7ig8h+nryd3K23YjCRnSyU9s/mMaxSDbIJqmTiOySfXJMmqRNOLkmt+SBPAY3wV3wFDy/lc4E0518gnBysfJqfQ</latexit> <latexit sha1_base64="UAloJD1/ubWRcMSZ24yHXdTCXs=">ACNHicbVBNaxsxENXmO85H3eSYi6gJOIeY3VBoj6a5BHpJi+0ELPMamVbRCst0mzALP5RueSH5BICOTSEXPsbqrVdaJM8EDy9N8PMvCRX0mEYPgRLyura+sbm7Wt7Z3dD/WPez1nCstFlxtl7GUCTipRclKnGZWwFZosRFcnVa+RfXwjpdAcnuRhkMNJyKDmgl+L6d6YhURAzHAsEypQZUZbLv/8mxEhZJlPqYjyijKcGaZNlgGMOqvw5jTv0mPalXsU1xthK5yBviXRgjTIAudx/Y6lheZ0MgVONePwhwHJViUXIlpjRVO5MCvYCT6nmrIhBuUs6On9NArKR0a659GOlP/7Sghc26SJb6yWte9irxPa9f4PDroJQ6L1BoPh80LBRFQ6sEaSqt4KgmngC30u9K+RgscPQ513wI0euT35LeSsKW9GPz432t0UcG+SAfCJNEpEvpE3OyDnpEk5uyD35RZ6C2+AxeA5e5qVLwaJn/yH4Pcfr46pjQ=</latexit>
slide-70
SLIDE 70

Advantage Actor Critic (A2C)

  • High variance:
  • Reduce variance with baseline:
  • Use value-function as the baseline (A2C):
<latexit sha1_base64="Q87OeiBHTj2uynKBf+Fc2+4r43I=">ACKXicbVBNSwMxEM36bf2qevQSLIJeyq4Ieix68ajSqtAty2w2bYPZElmhbL073jxr3hRUNSrf8RsW8GvB4E3782QmRdnUlj0/Tdvanpmdm5+YbGytLyulZd37i0OjeMt5iW2lzHYLkUirdQoOTXmeGQxpJfxTcnpX91y40VWjVxkPFOCj0luoIBOimqNkIFsYQoxD5HoKHUPRpm4qvehQhpmIqE2gj3aMgSXdaAfQayuBhGzaha8+v+CPQvCSakRiY4i6pPYaJZnKFTIK17cDPsFOAQcEkH1bC3PIM2A30eNtRBSm3nWJ06ZDuOCWhXW3cU0hH6veJAlJrB2nsOsl7W+vFP/z2jl2jzqFUFmOXLHxR91cUtS0jI0mwnCGcuAIMCPcrpT1wQBDF27FhRD8PvkvudyvB349OD+oNY4ncSyQLbJNdklADkmDnJIz0iKM3JEH8kxevHv0Xv13setU95kZpP8gPfxCf5KpnQ=</latexit> <latexit sha1_base64="stVTPD2NxfInfe0m+8mdo3PSnQ=">ACL3icbVBNSysxFM348dQ+37Pq0k2wPKiLV2ZE0KUoiMsqrQpNGe5k0jaYSYbkjlCG/iM3/hU3Ioq49V+Y1gp+HQicnHMv96T5Eo6DMP7YGZ2bv7XwuJS5fyn78r1dW1M2cKy0WbG2XsRQJOKlFGyUqcZFbAVmixHlyeTj2z6+EdLoFg5z0c2gr2VPckAvxdUjpiFREDMcCATKlOlTlsv3fx1ipCyTKXUxblHGU4O0zjLAQdVno7iFv1Pk624Wgsb4QT0O4mpEamaMbVW5YaXmRCI1fgXCcKc+yWYFyJUYVjiRA7+Evuh4qiETrltO7h3Rf15Jac9Y/zTSifqxo4TMuWGW+Mrxpu6rNxZ/8joF9va6pdR5gULzt0G9QlE0dBweTaUVHNXQE+BW+l0pH4AFj7ig8h+nryd3K23YjCRnSyU9s/mMaxSDbIJqmTiOySfXJMmqRNOLkmt+SBPAY3wV3wFDy/lc4E0518gnBysfJqfQ</latexit> <latexit sha1_base64="UAloJD1/ubWRcMSZ24yHXdTCXs=">ACNHicbVBNaxsxENXmO85H3eSYi6gJOIeY3VBoj6a5BHpJi+0ELPMamVbRCst0mzALP5RueSH5BICOTSEXPsbqrVdaJM8EDy9N8PMvCRX0mEYPgRLyura+sbm7Wt7Z3dD/WPez1nCstFlxtl7GUCTipRclKnGZWwFZosRFcnVa+RfXwjpdAcnuRhkMNJyKDmgl+L6d6YhURAzHAsEypQZUZbLv/8mxEhZJlPqYjyijKcGaZNlgGMOqvw5jTv0mPalXsU1xthK5yBviXRgjTIAudx/Y6lheZ0MgVONePwhwHJViUXIlpjRVO5MCvYCT6nmrIhBuUs6On9NArKR0a659GOlP/7Sghc26SJb6yWte9irxPa9f4PDroJQ6L1BoPh80LBRFQ6sEaSqt4KgmngC30u9K+RgscPQ513wI0euT35LeSsKW9GPz432t0UcG+SAfCJNEpEvpE3OyDnpEk5uyD35RZ6C2+AxeA5e5qVLwaJn/yH4Pcfr46pjQ=</latexit> <latexit sha1_base64="5+vMcXT+CN5kLeqSR9srGNolKAg=">AC3icbZDLSgMxFIYz9VbrbdSlm9AiVNAyI4JuhKobly3YC7TDkEkzbWjmQnJGKV7N76KGxeKuPUF3Pk2ZtpZ1NYfAl/+cw7J+b1YcAW9WPkVlbX1jfym4Wt7Z3dPXP/oKmiRFLWoJGIZNsjigkesgZwEKwdS0YCT7CWN7xL61HJhWPwgcYxcwJSD/kPqcEtOWaxZsyceEUKxdO8DWuz93OcLOcgmuWrIo1FV4GO4MSylRze9uL6JwEKgijVsa0YnDGRwKlgk0I3USwmdEj6rKMxJAFTzni6ywQfa6eH/UjqEwKeuvMTYxIoNQo83RkQGKjFWmr+V+sk4F85Yx7GCbCQzh7yE4EhwmkwuMcloyBGgiVXP8V0wGRhIKOr6BDsBdXobmecW2Knb9olS9zeLIoyNURGVko0tURfeohqIoif0gt7Qu/FsvBofxuesNWdkM4foj4yvX5iQl5Y=</latexit>
slide-71
SLIDE 71

Advantage Actor Critic (A2C)

  • High variance:
  • Reduce variance with baseline:
  • Use value-function as the baseline (A2C):
<latexit sha1_base64="Q87OeiBHTj2uynKBf+Fc2+4r43I=">ACKXicbVBNSwMxEM36bf2qevQSLIJeyq4Ieix68ajSqtAty2w2bYPZElmhbL073jxr3hRUNSrf8RsW8GvB4E3782QmRdnUlj0/Tdvanpmdm5+YbGytLyulZd37i0OjeMt5iW2lzHYLkUirdQoOTXmeGQxpJfxTcnpX91y40VWjVxkPFOCj0luoIBOimqNkIFsYQoxD5HoKHUPRpm4qvehQhpmIqE2gj3aMgSXdaAfQayuBhGzaha8+v+CPQvCSakRiY4i6pPYaJZnKFTIK17cDPsFOAQcEkH1bC3PIM2A30eNtRBSm3nWJ06ZDuOCWhXW3cU0hH6veJAlJrB2nsOsl7W+vFP/z2jl2jzqFUFmOXLHxR91cUtS0jI0mwnCGcuAIMCPcrpT1wQBDF27FhRD8PvkvudyvB349OD+oNY4ncSyQLbJNdklADkmDnJIz0iKM3JEH8kxevHv0Xv13setU95kZpP8gPfxCf5KpnQ=</latexit> <latexit sha1_base64="stVTPD2NxfInfe0m+8mdo3PSnQ=">ACL3icbVBNSysxFM348dQ+37Pq0k2wPKiLV2ZE0KUoiMsqrQpNGe5k0jaYSYbkjlCG/iM3/hU3Ioq49V+Y1gp+HQicnHMv96T5Eo6DMP7YGZ2bv7XwuJS5fyn78r1dW1M2cKy0WbG2XsRQJOKlFGyUqcZFbAVmixHlyeTj2z6+EdLoFg5z0c2gr2VPckAvxdUjpiFREDMcCATKlOlTlsv3fx1ipCyTKXUxblHGU4O0zjLAQdVno7iFv1Pk624Wgsb4QT0O4mpEamaMbVW5YaXmRCI1fgXCcKc+yWYFyJUYVjiRA7+Evuh4qiETrltO7h3Rf15Jac9Y/zTSifqxo4TMuWGW+Mrxpu6rNxZ/8joF9va6pdR5gULzt0G9QlE0dBweTaUVHNXQE+BW+l0pH4AFj7ig8h+nryd3K23YjCRnSyU9s/mMaxSDbIJqmTiOySfXJMmqRNOLkmt+SBPAY3wV3wFDy/lc4E0518gnBysfJqfQ</latexit> <latexit sha1_base64="UAloJD1/ubWRcMSZ24yHXdTCXs=">ACNHicbVBNaxsxENXmO85H3eSYi6gJOIeY3VBoj6a5BHpJi+0ELPMamVbRCst0mzALP5RueSH5BICOTSEXPsbqrVdaJM8EDy9N8PMvCRX0mEYPgRLyura+sbm7Wt7Z3dD/WPez1nCstFlxtl7GUCTipRclKnGZWwFZosRFcnVa+RfXwjpdAcnuRhkMNJyKDmgl+L6d6YhURAzHAsEypQZUZbLv/8mxEhZJlPqYjyijKcGaZNlgGMOqvw5jTv0mPalXsU1xthK5yBviXRgjTIAudx/Y6lheZ0MgVONePwhwHJViUXIlpjRVO5MCvYCT6nmrIhBuUs6On9NArKR0a659GOlP/7Sghc26SJb6yWte9irxPa9f4PDroJQ6L1BoPh80LBRFQ6sEaSqt4KgmngC30u9K+RgscPQ513wI0euT35LeSsKW9GPz432t0UcG+SAfCJNEpEvpE3OyDnpEk5uyD35RZ6C2+AxeA5e5qVLwaJn/yH4Pcfr46pjQ=</latexit> <latexit sha1_base64="5+vMcXT+CN5kLeqSR9srGNolKAg=">AC3icbZDLSgMxFIYz9VbrbdSlm9AiVNAyI4JuhKobly3YC7TDkEkzbWjmQnJGKV7N76KGxeKuPUF3Pk2ZtpZ1NYfAl/+cw7J+b1YcAW9WPkVlbX1jfym4Wt7Z3dPXP/oKmiRFLWoJGIZNsjigkesgZwEKwdS0YCT7CWN7xL61HJhWPwgcYxcwJSD/kPqcEtOWaxZsyceEUKxdO8DWuz93OcLOcgmuWrIo1FV4GO4MSylRze9uL6JwEKgijVsa0YnDGRwKlgk0I3USwmdEj6rKMxJAFTzni6ywQfa6eH/UjqEwKeuvMTYxIoNQo83RkQGKjFWmr+V+sk4F85Yx7GCbCQzh7yE4EhwmkwuMcloyBGgiVXP8V0wGRhIKOr6BDsBdXobmecW2Knb9olS9zeLIoyNURGVko0tURfeohqIoif0gt7Qu/FsvBofxuesNWdkM4foj4yvX5iQl5Y=</latexit>
slide-72
SLIDE 72

Advantage Actor Critic (A2C)

  • High variance:
  • Reduce variance with baseline:
  • Use value-function as the baseline (A2C):
<latexit sha1_base64="Q87OeiBHTj2uynKBf+Fc2+4r43I=">ACKXicbVBNSwMxEM36bf2qevQSLIJeyq4Ieix68ajSqtAty2w2bYPZElmhbL073jxr3hRUNSrf8RsW8GvB4E3782QmRdnUlj0/Tdvanpmdm5+YbGytLyulZd37i0OjeMt5iW2lzHYLkUirdQoOTXmeGQxpJfxTcnpX91y40VWjVxkPFOCj0luoIBOimqNkIFsYQoxD5HoKHUPRpm4qvehQhpmIqE2gj3aMgSXdaAfQayuBhGzaha8+v+CPQvCSakRiY4i6pPYaJZnKFTIK17cDPsFOAQcEkH1bC3PIM2A30eNtRBSm3nWJ06ZDuOCWhXW3cU0hH6veJAlJrB2nsOsl7W+vFP/z2jl2jzqFUFmOXLHxR91cUtS0jI0mwnCGcuAIMCPcrpT1wQBDF27FhRD8PvkvudyvB349OD+oNY4ncSyQLbJNdklADkmDnJIz0iKM3JEH8kxevHv0Xv13setU95kZpP8gPfxCf5KpnQ=</latexit> <latexit sha1_base64="stVTPD2NxfInfe0m+8mdo3PSnQ=">ACL3icbVBNSysxFM348dQ+37Pq0k2wPKiLV2ZE0KUoiMsqrQpNGe5k0jaYSYbkjlCG/iM3/hU3Ioq49V+Y1gp+HQicnHMv96T5Eo6DMP7YGZ2bv7XwuJS5fyn78r1dW1M2cKy0WbG2XsRQJOKlFGyUqcZFbAVmixHlyeTj2z6+EdLoFg5z0c2gr2VPckAvxdUjpiFREDMcCATKlOlTlsv3fx1ipCyTKXUxblHGU4O0zjLAQdVno7iFv1Pk624Wgsb4QT0O4mpEamaMbVW5YaXmRCI1fgXCcKc+yWYFyJUYVjiRA7+Evuh4qiETrltO7h3Rf15Jac9Y/zTSifqxo4TMuWGW+Mrxpu6rNxZ/8joF9va6pdR5gULzt0G9QlE0dBweTaUVHNXQE+BW+l0pH4AFj7ig8h+nryd3K23YjCRnSyU9s/mMaxSDbIJqmTiOySfXJMmqRNOLkmt+SBPAY3wV3wFDy/lc4E0518gnBysfJqfQ</latexit> <latexit sha1_base64="UAloJD1/ubWRcMSZ24yHXdTCXs=">ACNHicbVBNaxsxENXmO85H3eSYi6gJOIeY3VBoj6a5BHpJi+0ELPMamVbRCst0mzALP5RueSH5BICOTSEXPsbqrVdaJM8EDy9N8PMvCRX0mEYPgRLyura+sbm7Wt7Z3dD/WPez1nCstFlxtl7GUCTipRclKnGZWwFZosRFcnVa+RfXwjpdAcnuRhkMNJyKDmgl+L6d6YhURAzHAsEypQZUZbLv/8mxEhZJlPqYjyijKcGaZNlgGMOqvw5jTv0mPalXsU1xthK5yBviXRgjTIAudx/Y6lheZ0MgVONePwhwHJViUXIlpjRVO5MCvYCT6nmrIhBuUs6On9NArKR0a659GOlP/7Sghc26SJb6yWte9irxPa9f4PDroJQ6L1BoPh80LBRFQ6sEaSqt4KgmngC30u9K+RgscPQ513wI0euT35LeSsKW9GPz432t0UcG+SAfCJNEpEvpE3OyDnpEk5uyD35RZ6C2+AxeA5e5qVLwaJn/yH4Pcfr46pjQ=</latexit> <latexit sha1_base64="5+vMcXT+CN5kLeqSR9srGNolKAg=">AC3icbZDLSgMxFIYz9VbrbdSlm9AiVNAyI4JuhKobly3YC7TDkEkzbWjmQnJGKV7N76KGxeKuPUF3Pk2ZtpZ1NYfAl/+cw7J+b1YcAW9WPkVlbX1jfym4Wt7Z3dPXP/oKmiRFLWoJGIZNsjigkesgZwEKwdS0YCT7CWN7xL61HJhWPwgcYxcwJSD/kPqcEtOWaxZsyceEUKxdO8DWuz93OcLOcgmuWrIo1FV4GO4MSylRze9uL6JwEKgijVsa0YnDGRwKlgk0I3USwmdEj6rKMxJAFTzni6ywQfa6eH/UjqEwKeuvMTYxIoNQo83RkQGKjFWmr+V+sk4F85Yx7GCbCQzh7yE4EhwmkwuMcloyBGgiVXP8V0wGRhIKOr6BDsBdXobmecW2Knb9olS9zeLIoyNURGVko0tURfeohqIoif0gt7Qu/FsvBofxuesNWdkM4foj4yvX5iQl5Y=</latexit> <latexit sha1_base64="73WF1LFqO48zTabI2caWEeiC8nc=">ACJHicbVDLSgMxFM3UV62vUZdugkXoUC0zIiIUHXjsop9QFuGTJq2oZkHyR2hDP0YN/6KGxc+cOHGbzHTdlGrBwLnMvN/d4keAKbPvLyCwsLi2vZFdza+sbm1vm9k5NhbGkrEpDEcqGRxQTPGBV4CBYI5KM+J5gdW9wnfr1ByYVD4N7GEas7ZNewLucEtCSa5fFpQLh5i4YOELXGj5BPqUiORuNGMUcU1XCRSdkWXho3EFlmvm7ZI9Bv5LnCnJoykqrvne6oQ09lkAVBClmo4dQTshEjgVbJRrxYpFhA5IjzU1DYjPVDsZHznCB1rp4G4o9QsAj9XZiYT4Sg19T3emN6h5LxX/85oxdM/aCQ+iGFhAJ4u6scAQ4jQx3OGSURBDTQiVXP8V0z6RhILONadDcOZP/ktqxyXHLjm3J/ny1TSOLNpD+6iAHSKyugGVAVUfSIntErejOejBfjw/ictGaM6cwu+gXj+wcTkaC8</latexit>
slide-73
SLIDE 73
  • A2C is great, but you can only use each rollout once!

Advantage Actor Critic (A2C)

slide-74
SLIDE 74
  • A2C is great, but you can only use each rollout once!

Advantage Actor Critic (A2C) Why?

slide-75
SLIDE 75
  • A2C is great, but you can only use each rollout once!
  • No theoretical grounding to do so

Advantage Actor Critic (A2C)

slide-76
SLIDE 76

Advantage Actor Critic (A2C)

  • Works poorly in-practice
slide-77
SLIDE 77
  • Works poorly in-practice

Advantage Actor Critic (A2C)

Image credit: Alberto Metelli, 2018

slide-78
SLIDE 78

Outline

  • RL Refresher/Advantage Actor Critic (A2C)
  • Trust Region Policy Optimization (TRPO)
  • Proximal Policy Optimization (PPO)
  • Application: PointGoal Navigation Results
slide-79
SLIDE 79

Trust Region Policy Optimization (TRPO)

A2C Maximizes:

slide-80
SLIDE 80

Trust Region Policy Optimization (TRPO)

Given a policy:

<latexit sha1_base64="WRGxFEVI9rYjIA4fp4fgmoJwGI4=">AB+nicbVBNS8NAEN34WetXqkcvi0Wol5KIoMeiF48V7Ae0IWw2m3bpZhN3J0qJ/SlePCji1V/izX/jts1BWx8MPN6bYWZekAquwXG+rZXVtfWNzdJWeXtnd2/frhy0dZIpylo0EYnqBkQzwSVrAQfBuqliJA4E6wSj6nfeWBK80TewThlXkwGkecEjCSb1fucY34gPsxD7H24RT7dtWpOzPgZeIWpIoKNH37qx8mNIuZBCqI1j3XScHLiQJOBZuU+5lmKaEjMmA9QyWJmfby2ekTfGKUEeJMiUBz9TfEzmJtR7HgemMCQz1ojcV/N6GUSXs5lmgGTdL4oygSGBE9zwCFXjIYG0Ko4uZWTIdEQomrbIJwV18eZm0z+quU3dvz6uNqyKOEjpCx6iGXHSBGugGNVELUfSIntErerOerBfr3fqYt65Yxcwh+gPr8wczm5Kn</latexit>
slide-81
SLIDE 81

Trust Region Policy Optimization (TRPO)

<latexit sha1_base64="m/+Baqn3U8yhwCE2Svpz4XjOY=">ACTXicbVFNb9NAEF2nQNPwUbcuYyIiBK1jWyEVC6VAlw4FkTSHGwxutNur6o7tjpMjyH+wFiRv/opceihBineYQGkZa6c17b7Q7b6NcSUOe9NpbD14+Gi7udN6/OTps13b39kskJzMeSZyvQ4QiOUTMWQJCkxzrXAJFLiLr4UOtn34Q2Mku/0CIX0wTnqZxJjmSp0I0DwgI6EBiZwCV0MSQIEhmDCakHQCtd18vu7Y5BKyZDpxYHemcoyo/V2vKAYyWzpIO/KoHR1D3UBt6odv2+t6yYBP4K9BmqzoN3R9BnPEiESlxhcZMfC+naYmaJFeiagWFETnyC5yLiYUpJsJMy2UaFbyTAyzTNuTEizZ9YkSE2MWSWSd9SLmvlaT/9MmBc3eTkuZ5gWJlN9dNCsUAZ1tBLTiphQXItbRvBX6OGjnZD2jZEPz7K2+C0eu+7/X9T2/ag/erOJrsBXvJusxnx2zAPrJTNmScXbFrdst+Od+dG+e38+fO2nBWM8/ZP9XY/gsupq3+</latexit>

Given a policy:

<latexit sha1_base64="WRGxFEVI9rYjIA4fp4fgmoJwGI4=">AB+nicbVBNS8NAEN34WetXqkcvi0Wol5KIoMeiF48V7Ae0IWw2m3bpZhN3J0qJ/SlePCji1V/izX/jts1BWx8MPN6bYWZekAquwXG+rZXVtfWNzdJWeXtnd2/frhy0dZIpylo0EYnqBkQzwSVrAQfBuqliJA4E6wSj6nfeWBK80TewThlXkwGkecEjCSb1fucY34gPsxD7H24RT7dtWpOzPgZeIWpIoKNH37qx8mNIuZBCqI1j3XScHLiQJOBZuU+5lmKaEjMmA9QyWJmfby2ekTfGKUEeJMiUBz9TfEzmJtR7HgemMCQz1ojcV/N6GUSXs5lmgGTdL4oygSGBE9zwCFXjIYG0Ko4uZWTIdEQomrbIJwV18eZm0z+quU3dvz6uNqyKOEjpCx6iGXHSBGugGNVELUfSIntErerOerBfr3fqYt65Yxcwh+gPr8wczm5Kn</latexit>

Collect experience and calculate advantage

slide-82
SLIDE 82

Trust Region Policy Optimization (TRPO)

<latexit sha1_base64="m/+Baqn3U8yhwCE2Svpz4XjOY=">ACTXicbVFNb9NAEF2nQNPwUbcuYyIiBK1jWyEVC6VAlw4FkTSHGwxutNur6o7tjpMjyH+wFiRv/opceihBineYQGkZa6c17b7Q7b6NcSUOe9NpbD14+Gi7udN6/OTps13b39kskJzMeSZyvQ4QiOUTMWQJCkxzrXAJFLiLr4UOtn34Q2Mku/0CIX0wTnqZxJjmSp0I0DwgI6EBiZwCV0MSQIEhmDCakHQCtd18vu7Y5BKyZDpxYHemcoyo/V2vKAYyWzpIO/KoHR1D3UBt6odv2+t6yYBP4K9BmqzoN3R9BnPEiESlxhcZMfC+naYmaJFeiagWFETnyC5yLiYUpJsJMy2UaFbyTAyzTNuTEizZ9YkSE2MWSWSd9SLmvlaT/9MmBc3eTkuZ5gWJlN9dNCsUAZ1tBLTiphQXItbRvBX6OGjnZD2jZEPz7K2+C0eu+7/X9T2/ag/erOJrsBXvJusxnx2zAPrJTNmScXbFrdst+Od+dG+e38+fO2nBWM8/ZP9XY/gsupq3+</latexit>

Given a policy:

<latexit sha1_base64="WRGxFEVI9rYjIA4fp4fgmoJwGI4=">AB+nicbVBNS8NAEN34WetXqkcvi0Wol5KIoMeiF48V7Ae0IWw2m3bpZhN3J0qJ/SlePCji1V/izX/jts1BWx8MPN6bYWZekAquwXG+rZXVtfWNzdJWeXtnd2/frhy0dZIpylo0EYnqBkQzwSVrAQfBuqliJA4E6wSj6nfeWBK80TewThlXkwGkecEjCSb1fucY34gPsxD7H24RT7dtWpOzPgZeIWpIoKNH37qx8mNIuZBCqI1j3XScHLiQJOBZuU+5lmKaEjMmA9QyWJmfby2ekTfGKUEeJMiUBz9TfEzmJtR7HgemMCQz1ojcV/N6GUSXs5lmgGTdL4oygSGBE9zwCFXjIYG0Ko4uZWTIdEQomrbIJwV18eZm0z+quU3dvz6uNqyKOEjpCx6iGXHSBGugGNVELUfSIntErerOerBfr3fqYt65Yxcwh+gPr8wczm5Kn</latexit>

Maximize: Collect experience and calculate advantage

<latexit sha1_base64="Y/Q+5m3fIVE7ZlVTIpodCXH5oc=">ACQ3icbZDLSgMxFIYzXmu9V26CRahBSkzIuhG8LIRVxWsip06nMlkbGjmYnJGKEPfzY0v4M4XcONCEbeC6Wh1QOBP9/Dsn5/VQKjb9bE1MTk3PzBbmivMLi0vLpZXVC51kivEGS2SirnzQXIqYN1Cg5Fep4hD5kl/6neO+f3nPlRZJfI7dlLciuI1FKBigQV7p2o0A2wxkftqruNjmCFW6T91QAcvdVHhDVgEPqRuJgGoPq738bgxQlwUJ0sObu4q5b1HjVr1S2a7Zg6J/hTMSZTKquld6coOEZRGPkUnQunYKbZyUCiY5L2im2meAuvALW8aGUPEdSsfZNCjm4YENEyUOTHSAf05kUOkdTfyTWd/Yz3u9eF/XjPDcK+VizjNkMds+FCYSYoJ7QdKA6E4Q9k1ApgS5q+UtcHEhyb2ognBGV/5r7jYrjl2zTnbKR8cjeIokHWyQSrEIbvkgJyQOmkQRh7IC3kj79aj9Wp9WJ/D1glrNLNGfpX19Q185a/1</latexit>
slide-83
SLIDE 83

Trust Region Policy Optimization (TRPO)

Maximize:

<latexit sha1_base64="Y/Q+5m3fIVE7ZlVTIpodCXH5oc=">ACQ3icbZDLSgMxFIYzXmu9V26CRahBSkzIuhG8LIRVxWsip06nMlkbGjmYnJGKEPfzY0v4M4XcONCEbeC6Wh1QOBP9/Dsn5/VQKjb9bE1MTk3PzBbmivMLi0vLpZXVC51kivEGS2SirnzQXIqYN1Cg5Fep4hD5kl/6neO+f3nPlRZJfI7dlLciuI1FKBigQV7p2o0A2wxkftqruNjmCFW6T91QAcvdVHhDVgEPqRuJgGoPq738bgxQlwUJ0sObu4q5b1HjVr1S2a7Zg6J/hTMSZTKquld6coOEZRGPkUnQunYKbZyUCiY5L2im2meAuvALW8aGUPEdSsfZNCjm4YENEyUOTHSAf05kUOkdTfyTWd/Yz3u9eF/XjPDcK+VizjNkMds+FCYSYoJ7QdKA6E4Q9k1ApgS5q+UtcHEhyb2ognBGV/5r7jYrjl2zTnbKR8cjeIokHWyQSrEIbvkgJyQOmkQRh7IC3kj79aj9Wp9WJ/D1glrNLNGfpX19Q185a/1</latexit>

Read as: Policy is better than if it takes good actions 
 ( ) more often and takes bad actions ( ) less often

slide-84
SLIDE 84

Trust Region Policy Optimization (TRPO)

Maximize:

<latexit sha1_base64="Y/Q+5m3fIVE7ZlVTIpodCXH5oc=">ACQ3icbZDLSgMxFIYzXmu9V26CRahBSkzIuhG8LIRVxWsip06nMlkbGjmYnJGKEPfzY0v4M4XcONCEbeC6Wh1QOBP9/Dsn5/VQKjb9bE1MTk3PzBbmivMLi0vLpZXVC51kivEGS2SirnzQXIqYN1Cg5Fep4hD5kl/6neO+f3nPlRZJfI7dlLciuI1FKBigQV7p2o0A2wxkftqruNjmCFW6T91QAcvdVHhDVgEPqRuJgGoPq738bgxQlwUJ0sObu4q5b1HjVr1S2a7Zg6J/hTMSZTKquld6coOEZRGPkUnQunYKbZyUCiY5L2im2meAuvALW8aGUPEdSsfZNCjm4YENEyUOTHSAf05kUOkdTfyTWd/Yz3u9eF/XjPDcK+VizjNkMds+FCYSYoJ7QdKA6E4Q9k1ApgS5q+UtcHEhyb2ognBGV/5r7jYrjl2zTnbKR8cjeIokHWyQSrEIbvkgJyQOmkQRh7IC3kj79aj9Wp9WJ/D1glrNLNGfpX19Q185a/1</latexit>

Read as: Policy is better than if it takes good actions 
 ( ) more often and takes bad actions ( ) less often

Why this objective?

slide-85
SLIDE 85

Trust Region Policy Optimization (TRPO)

<latexit sha1_base64="m/+Baqn3U8yhwCE2Svpz4XjOY=">ACTXicbVFNb9NAEF2nQNPwUbcuYyIiBK1jWyEVC6VAlw4FkTSHGwxutNur6o7tjpMjyH+wFiRv/opceihBineYQGkZa6c17b7Q7b6NcSUOe9NpbD14+Gi7udN6/OTps13b39kskJzMeSZyvQ4QiOUTMWQJCkxzrXAJFLiLr4UOtn34Q2Mku/0CIX0wTnqZxJjmSp0I0DwgI6EBiZwCV0MSQIEhmDCakHQCtd18vu7Y5BKyZDpxYHemcoyo/V2vKAYyWzpIO/KoHR1D3UBt6odv2+t6yYBP4K9BmqzoN3R9BnPEiESlxhcZMfC+naYmaJFeiagWFETnyC5yLiYUpJsJMy2UaFbyTAyzTNuTEizZ9YkSE2MWSWSd9SLmvlaT/9MmBc3eTkuZ5gWJlN9dNCsUAZ1tBLTiphQXItbRvBX6OGjnZD2jZEPz7K2+C0eu+7/X9T2/ag/erOJrsBXvJusxnx2zAPrJTNmScXbFrdst+Od+dG+e38+fO2nBWM8/ZP9XY/gsupq3+</latexit>

Given a policy:

<latexit sha1_base64="MNyLJ0iEjL6De720taE7+Lp+Axg=">ACI3icbVDLSgMxFM3Ud31VXboJFqFuyowIiAU3bhUsCp0hiGTuW1DMw+TO2IZ5l/c+CtuXCjFjQv/xUzbha8DIYdziW5J0il0GjbH1ZlZnZufmFxqbq8srq2XtvYvNZJpji0eSITdRswDVLE0EaBEm5TBSwKJNwEg7PSv7kHpUSX+EwBS9ivVh0BWdoJL92fNdgPlI3EiHVPu7RE+qmws9d7AOy8oYHzBMZFkVBf0b9Wt1u2mPQv8SZkjqZ4sKvjdw4VkEMXLJtO4dopezhQKLqGoupmGlPEB60H0JhFoL18vGNBd40S0m6izImRjtXvEzmLtB5GgUlGDPv6t1eK/3mdDLtHXi7iNEOI+eShbiYpJrQsjIZCAUc5NIRxJcxfKe8zxTiaWqumBOf3yn/J9X7TsZvO5UG9dTqtY5Fskx3SIA45JC1yTi5Im3DySJ7JK3mznqwXa2S9T6IVazqzRX7A+vwCEC2j7A=</latexit>

Collect experience and calculate advantage Maximize:

<latexit sha1_base64="Y/Q+5m3fIVE7ZlVTIpodCXH5oc=">ACQ3icbZDLSgMxFIYzXmu9V26CRahBSkzIuhG8LIRVxWsip06nMlkbGjmYnJGKEPfzY0v4M4XcONCEbeC6Wh1QOBP9/Dsn5/VQKjb9bE1MTk3PzBbmivMLi0vLpZXVC51kivEGS2SirnzQXIqYN1Cg5Fep4hD5kl/6neO+f3nPlRZJfI7dlLciuI1FKBigQV7p2o0A2wxkftqruNjmCFW6T91QAcvdVHhDVgEPqRuJgGoPq738bgxQlwUJ0sObu4q5b1HjVr1S2a7Zg6J/hTMSZTKquld6coOEZRGPkUnQunYKbZyUCiY5L2im2meAuvALW8aGUPEdSsfZNCjm4YENEyUOTHSAf05kUOkdTfyTWd/Yz3u9eF/XjPDcK+VizjNkMds+FCYSYoJ7QdKA6E4Q9k1ApgS5q+UtcHEhyb2ognBGV/5r7jYrjl2zTnbKR8cjeIokHWyQSrEIbvkgJyQOmkQRh7IC3kj79aj9Wp9WJ/D1glrNLNGfpX19Q185a/1</latexit>
slide-86
SLIDE 86

Image credit: Alberto Metelli, 2018

Trust Region Policy Optimization (TRPO)

slide-87
SLIDE 87
  • Use a trust-region!

Trust Region Policy Optimization (TRPO)

slide-88
SLIDE 88

Trust Region Policy Optimization (TRPO)

  • PS 1 problem 1
slide-89
SLIDE 89
  • PS 1 problem 1
  • In this problem, you showed that the gradient descent update rule



 
 can be seen as the minimizer of the affine-lower bound of
 subject to a trust-region:

Trust Region Policy Optimization (TRPO)

slide-90
SLIDE 90

Trust Region Policy Optimization (TRPO)

<latexit sha1_base64="98O0mbVqYkorID1M2C1j2kUbRIg=">ACznicbVFbaxNBFJ5dbzXeoj76MhiEBGzZFUFfhKovYgWjNG0hky5nZ0+SoTuzm5mzsXFZfPX3+eYP8H84uaCm9cAMH+c71+kZa4cRdHPILxy9dr1Gzs3W7du37l7r3/wZErKitxIu8sCcpOMyVwQEpyvGktAg6zfE4PXu75I/naJ0qzCEtShxpmBg1VhLIu5L2L6GBphLy+n1zWh9+7n9suoKmSNDjr1qiMhna1ILE2ib0h3l9Ou6hJ5ySKjXJLUgPKda6bKwBEbi7hdUkylhxiGbgyGYNPwXd4SswqyrbIi9RW5kFlB/OCDSNVk0hWlStatur4BF1pl3LfrdHqm20zq7y/k5S2OFda0YITWt0SbsT7Ur45dBvAEdtrF+0v4hskJWGg3JHJwbxlFJoxosKZlj4zdwWI84sNPTSg0Y3q1Tka/sR7Mj4urH+G+Mr7b0YN2rmFTn3kUnx3kVs6/8cNKxq/HNXKlBWhketG4yrnVPDlbXmLErKFx6AtMrPyuUvMxeBdfyIsQXV74Mjp7txdFe/Ol5Z/NRo4d9og9Zl0Wsxdsn71jfTZgMjgIZsHXoA74Txswm/r0DY5DxkWxZ+/w2RmuEV</latexit> <latexit sha1_base64="twLRSR1S0ZOn3JfPmQdbvpQgdY=">ACJ3icbZDLSsNAFIYnXmu9RV26GSxCuymJCLpRim5cVrAXaEKYTCbt0MnFmROhL6NG1/FjaAiuvRNnLZaOsPAz/fOYcz5/dTwRVY1pextLyurZe2ihvbm3v7Jp7+2VZJKyFk1EIrs+UzwmLWAg2DdVDIS+YJ1/OH1pN5YFLxJL6DUcrciPRjHnJKQCPvJQeVB0YMCA1fIGdUBKaOyn3ZqxKPMBOxAOsPKiN8/s54JkVq25NhReNXZgKtT0zFcnSGgWsRioIEr1bCsFNycSOBVsXHYyxVJCh6TPetrGJGLKzad3jvGxJgEOE6lfDHhKf0/kJFJqFPm6MyIwUPO1Cfyv1sgPHdzHqcZsJjOFoWZwJDgSWg4JRECNtCJVc/xXTAdFRgY62rEOw509eNO2Tum3V7dvTSuOqiKOEDtERqiIbnaEGukFN1EIUPaJn9IbejSfjxfgwPmetS0Yxc4D+yPj+Ab5bpT0=</latexit>
slide-91
SLIDE 91

Trust Region Policy Optimization (TRPO)

  • Advantage
  • Able to perform multiple optimization steps per rollout
slide-92
SLIDE 92

Trust Region Policy Optimization (TRPO)

  • Advantage
  • Able to perform multiple optimization steps per rollout
  • Disadvantage
  • Choosing the correct value for beta is challenging and problem/network

dependent

slide-93
SLIDE 93

Outline

  • RL Refresher/Advantage Actor Critic (A2C)
  • Trust Region Policy Optimization (TRPO)
  • Proximal Policy Optimization (PPO)
  • Application: PointGoal Navigation Results
slide-94
SLIDE 94

Proximal Policy Optimization (PPO)

slide-95
SLIDE 95

Proximal Policy Optimization (PPO)

slide-96
SLIDE 96

Given a policy:

<latexit sha1_base64="WRGxFEVI9rYjIA4fp4fgmoJwGI4=">AB+nicbVBNS8NAEN34WetXqkcvi0Wol5KIoMeiF48V7Ae0IWw2m3bpZhN3J0qJ/SlePCji1V/izX/jts1BWx8MPN6bYWZekAquwXG+rZXVtfWNzdJWeXtnd2/frhy0dZIpylo0EYnqBkQzwSVrAQfBuqliJA4E6wSj6nfeWBK80TewThlXkwGkecEjCSb1fucY34gPsxD7H24RT7dtWpOzPgZeIWpIoKNH37qx8mNIuZBCqI1j3XScHLiQJOBZuU+5lmKaEjMmA9QyWJmfby2ekTfGKUEeJMiUBz9TfEzmJtR7HgemMCQz1ojcV/N6GUSXs5lmgGTdL4oygSGBE9zwCFXjIYG0Ko4uZWTIdEQomrbIJwV18eZm0z+quU3dvz6uNqyKOEjpCx6iGXHSBGugGNVELUfSIntErerOerBfr3fqYt65Yxcwh+gPr8wczm5Kn</latexit>

Proximal Policy Optimization (PPO)

slide-97
SLIDE 97

Given a policy:

<latexit sha1_base64="MNyLJ0iEjL6De720taE7+Lp+Axg=">ACI3icbVDLSgMxFM3Ud31VXboJFqFuyowIiAU3bhUsCp0hiGTuW1DMw+TO2IZ5l/c+CtuXCjFjQv/xUzbha8DIYdziW5J0il0GjbH1ZlZnZufmFxqbq8srq2XtvYvNZJpji0eSITdRswDVLE0EaBEm5TBSwKJNwEg7PSv7kHpUSX+EwBS9ivVh0BWdoJL92fNdgPlI3EiHVPu7RE+qmws9d7AOy8oYHzBMZFkVBf0b9Wt1u2mPQv8SZkjqZ4sKvjdw4VkEMXLJtO4dopezhQKLqGoupmGlPEB60H0JhFoL18vGNBd40S0m6izImRjtXvEzmLtB5GgUlGDPv6t1eK/3mdDLtHXi7iNEOI+eShbiYpJrQsjIZCAUc5NIRxJcxfKe8zxTiaWqumBOf3yn/J9X7TsZvO5UG9dTqtY5Fskx3SIA45JC1yTi5Im3DySJ7JK3mznqwXa2S9T6IVazqzRX7A+vwCEC2j7A=</latexit>

Proximal Policy Optimization (PPO)

slide-98
SLIDE 98

Given a policy:

Proximal Policy Optimization (PPO)

<latexit sha1_base64="twLRSR1S0ZOn3JfPmQdbvpQgdY=">ACJ3icbZDLSsNAFIYnXmu9RV26GSxCuymJCLpRim5cVrAXaEKYTCbt0MnFmROhL6NG1/FjaAiuvRNnLZaOsPAz/fOYcz5/dTwRVY1pextLyurZe2ihvbm3v7Jp7+2VZJKyFk1EIrs+UzwmLWAg2DdVDIS+YJ1/OH1pN5YFLxJL6DUcrciPRjHnJKQCPvJQeVB0YMCA1fIGdUBKaOyn3ZqxKPMBOxAOsPKiN8/s54JkVq25NhReNXZgKtT0zFcnSGgWsRioIEr1bCsFNycSOBVsXHYyxVJCh6TPetrGJGLKzad3jvGxJgEOE6lfDHhKf0/kJFJqFPm6MyIwUPO1Cfyv1sgPHdzHqcZsJjOFoWZwJDgSWg4JRECNtCJVc/xXTAdFRgY62rEOw509eNO2Tum3V7dvTSuOqiKOEDtERqiIbnaEGukFN1EIUPaJn9IbejSfjxfgwPmetS0Yxc4D+yPj+Ab5bpT0=</latexit>

Objective: Maximize

<latexit sha1_base64="MNyLJ0iEjL6De720taE7+Lp+Axg=">ACI3icbVDLSgMxFM3Ud31VXboJFqFuyowIiAU3bhUsCp0hiGTuW1DMw+TO2IZ5l/c+CtuXCjFjQv/xUzbha8DIYdziW5J0il0GjbH1ZlZnZufmFxqbq8srq2XtvYvNZJpji0eSITdRswDVLE0EaBEm5TBSwKJNwEg7PSv7kHpUSX+EwBS9ivVh0BWdoJL92fNdgPlI3EiHVPu7RE+qmws9d7AOy8oYHzBMZFkVBf0b9Wt1u2mPQv8SZkjqZ4sKvjdw4VkEMXLJtO4dopezhQKLqGoupmGlPEB60H0JhFoL18vGNBd40S0m6izImRjtXvEzmLtB5GgUlGDPv6t1eK/3mdDLtHXi7iNEOI+eShbiYpJrQsjIZCAUc5NIRxJcxfKe8zxTiaWqumBOf3yn/J9X7TsZvO5UG9dTqtY5Fskx3SIA45JC1yTi5Im3DySJ7JK3mznqwXa2S9T6IVazqzRX7A+vwCEC2j7A=</latexit>
slide-99
SLIDE 99

Proximal Policy Optimization (PPO)

slide-100
SLIDE 100

Proximal Policy Optimization (PPO)

<latexit sha1_base64="a75IYbuwf6T/xNrGACEcFVl6gI=">AB+nicbVDLSsNAFL3xWesr1aWbwSK4KokIuiy6ETdGsA9oY5lMJ+3QySTMTJQS8yluXCji1i9x5984abvQ1gMDh3Pu5Z45QcKZ0o7zbS0tr6yurZc2yptb2zu7dmWvqeJUEtogMY9lO8CKciZoQzPNaTuRFEcBp61gdFn4rQcqFYvFnR4n1I/wQLCQEayN1LMr3QjrIcE8u87vM8+7yXt21ak5E6BF4s5IFWbwevZXtx+TNKJCE46V6rhOov0MS80Ip3m5myqaYDLCA9oxVOCIKj+bRM/RkVH6KIyleUKjifp7I8ORUuMoMJNFUDXvFeJ/XifV4bmfMZGkmgoyPRSmHOkYFT2gPpOUaD42BPJTFZEhlhiok1bZVOCO/lRdI8qblOzb09rdYvZnWU4AO4RhcOIM6XIEHDSDwCM/wCm/Wk/VivVsf09Ela7azD39gf4AZSyUEg=</latexit> <latexit sha1_base64="a75IYbuwf6T/xNrGACEcFVl6gI=">AB+nicbVDLSsNAFL3xWesr1aWbwSK4KokIuiy6ETdGsA9oY5lMJ+3QySTMTJQS8yluXCji1i9x5984abvQ1gMDh3Pu5Z45QcKZ0o7zbS0tr6yurZc2yptb2zu7dmWvqeJUEtogMY9lO8CKciZoQzPNaTuRFEcBp61gdFn4rQcqFYvFnR4n1I/wQLCQEayN1LMr3QjrIcE8u87vM8+7yXt21ak5E6BF4s5IFWbwevZXtx+TNKJCE46V6rhOov0MS80Ip3m5myqaYDLCA9oxVOCIKj+bRM/RkVH6KIyleUKjifp7I8ORUuMoMJNFUDXvFeJ/XifV4bmfMZGkmgoyPRSmHOkYFT2gPpOUaD42BPJTFZEhlhiok1bZVOCO/lRdI8qblOzb09rdYvZnWU4AO4RhcOIM6XIEHDSDwCM/wCm/Wk/VivVsf09Ela7azD39gf4AZSyUEg=</latexit> <latexit sha1_base64="XU6mzGKxL8N+dKuvm07YfKiYqg=">AB+HicbVBNS8NAEN3Ur1o/GvXoZbEIFaQkIujBQ9WLxwr2A9oQNtNu3SzCbsToYb+Ei8eFPHqT/Hmv3Hb5qCtDwYe780wMy9IBNfgON9WYWV1bX2juFna2t7ZLdt7+y0dp4qyJo1FrDoB0UxwyZrAQbBOohiJAsHaweh26rcfmdI8lg8wTpgXkYHkIacEjOTb5euq9uEUEx9O8BV2fLvi1JwZ8DJxc1JBORq+/dXrxzSNmAQqiNZd10nAy4gCTgWblHqpZgmhIzJgXUMliZj2stnhE3xslD4OY2VKAp6pvycyEmk9jgLTGREY6kVvKv7ndVMIL72MyQFJul8UZgKDGepoD7XDEKYmwIoYqbWzEdEkUomKxKJgR38eVl0jqruU7NvT+v1G/yOIroEB2hKnLRBaqjO9RATURip7RK3qznqwX6936mLcWrHzmAP2B9fkD4qiRQ=</latexit>
slide-101
SLIDE 101

Proximal Policy Optimization (PPO)

<latexit sha1_base64="a75IYbuwf6T/xNrGACEcFVl6gI=">AB+nicbVDLSsNAFL3xWesr1aWbwSK4KokIuiy6ETdGsA9oY5lMJ+3QySTMTJQS8yluXCji1i9x5984abvQ1gMDh3Pu5Z45QcKZ0o7zbS0tr6yurZc2yptb2zu7dmWvqeJUEtogMY9lO8CKciZoQzPNaTuRFEcBp61gdFn4rQcqFYvFnR4n1I/wQLCQEayN1LMr3QjrIcE8u87vM8+7yXt21ak5E6BF4s5IFWbwevZXtx+TNKJCE46V6rhOov0MS80Ip3m5myqaYDLCA9oxVOCIKj+bRM/RkVH6KIyleUKjifp7I8ORUuMoMJNFUDXvFeJ/XifV4bmfMZGkmgoyPRSmHOkYFT2gPpOUaD42BPJTFZEhlhiok1bZVOCO/lRdI8qblOzb09rdYvZnWU4AO4RhcOIM6XIEHDSDwCM/wCm/Wk/VivVsf09Ela7azD39gf4AZSyUEg=</latexit> <latexit sha1_base64="a75IYbuwf6T/xNrGACEcFVl6gI=">AB+nicbVDLSsNAFL3xWesr1aWbwSK4KokIuiy6ETdGsA9oY5lMJ+3QySTMTJQS8yluXCji1i9x5984abvQ1gMDh3Pu5Z45QcKZ0o7zbS0tr6yurZc2yptb2zu7dmWvqeJUEtogMY9lO8CKciZoQzPNaTuRFEcBp61gdFn4rQcqFYvFnR4n1I/wQLCQEayN1LMr3QjrIcE8u87vM8+7yXt21ak5E6BF4s5IFWbwevZXtx+TNKJCE46V6rhOov0MS80Ip3m5myqaYDLCA9oxVOCIKj+bRM/RkVH6KIyleUKjifp7I8ORUuMoMJNFUDXvFeJ/XifV4bmfMZGkmgoyPRSmHOkYFT2gPpOUaD42BPJTFZEhlhiok1bZVOCO/lRdI8qblOzb09rdYvZnWU4AO4RhcOIM6XIEHDSDwCM/wCm/Wk/VivVsf09Ela7azD39gf4AZSyUEg=</latexit>
slide-102
SLIDE 102
  • Advantage
  • Able to perform multiple optimization steps per rollout
  • epsilon=0.2 “just works” in a lot of cases
  • Easily handles networks with hundreds of millions of parameters

Proximal Policy Optimization (PPO)

slide-103
SLIDE 103
  • Advantage
  • Able to perform multiple optimization steps per rollout
  • epsilon=0.2 “just works” in a lot of cases
  • Easily handles networks with hundreds of millions of parameters
  • Disadvantage
  • Other methods are more sample efficient

Proximal Policy Optimization (PPO)

slide-104
SLIDE 104

A2C Implementation

  • 1. Collect a set of trajectories using current policy
  • 2. Update policy via step of A2C objective
  • 3. Repeat
slide-105
SLIDE 105

PPO/TRPO Implementation

  • 1. Collect a set of trajectories using current policy
  • 2. For a few epochs (typically 2 or 4)
  • 1. Sample mini batches from rollout (typically 2 or 4)
  • 1. Update the policy via step of PPO/TRPO objective
  • 3. Repeat
slide-106
SLIDE 106

Outline

  • RL Refresher/Advantage Actor Critic (A2C)
  • Trust Region Policy Optimization (TRPO)
  • Proximal Policy Optimization (PPO)
  • Application: PointGoal Navigation Results
slide-107
SLIDE 107

PointGoal Navigation Results

slide-108
SLIDE 108

108

slide-109
SLIDE 109

Qualitative Results

slide-110
SLIDE 110
slide-111
SLIDE 111
slide-112
SLIDE 112
slide-113
SLIDE 113
slide-114
SLIDE 114
slide-115
SLIDE 115
slide-116
SLIDE 116
slide-117
SLIDE 117

Backtracking

slide-118
SLIDE 118
slide-119
SLIDE 119
slide-120
SLIDE 120
slide-121
SLIDE 121
slide-122
SLIDE 122
slide-123
SLIDE 123
slide-124
SLIDE 124
slide-125
SLIDE 125

Visual Turing Test

slide-126
SLIDE 126

Option 1 Option 2

slide-127
SLIDE 127

Option 1

slide-128
SLIDE 128

Option 1 Option 2

slide-129
SLIDE 129

Option 2

slide-130
SLIDE 130

Option 1 Option 2

slide-131
SLIDE 131

Learned Agent Shortest Path Oracle

slide-132
SLIDE 132