CS391R: Robot Learning (Fall 2020)
Overview of Robot Decision Making
1
- Prof. Yuke Zhu
Overview of Robot Decision Making Prof. Yuke Zhu Fall 2020 CS391R: - - PowerPoint PPT Presentation
Overview of Robot Decision Making Prof. Yuke Zhu Fall 2020 CS391R: Robot Learning (Fall 2020) 1 Todays Agenda What is Robot Decision Making? Mathematical Framework of Sequential Decision Making Learning for Decision Making
CS391R: Robot Learning (Fall 2020)
1
CS391R: Robot Learning (Fall 2020) 3
CS391R: Robot Learning (Fall 2020) 4
[Levine et al. JMLR 2016] [Bohg et al. ICRA 2018] [Sa et al. IROS 2014] Perceive Act Perceive Act Act Perceive
CS391R: Robot Learning (Fall 2020) 5
Assistive Robots (Companions) Outer Space (Explorers) Autonomous Driving (Transporters)
CS391R: Robot Learning (Fall 2020) 6
[Source: Boston Dynamics]
CS391R: Robot Learning (Fall 2020) 7
[Source: Boston Dynamics] [Source: DeepMind’s AlphaGo]
CS391R: Robot Learning (Fall 2020) 8
We will re-visit them through paper reading in the following weeks.
Study the parts that you are less familiar with from online resources.
CS391R: Robot Learning (Fall 2020) 9
ss0 = Pr[st+1 | st, at]
<latexit sha1_base64="47fJjpAaSDkG5CZij9cVsTrtsLM=">ACFnicdVDLSgNBEJz1bXxFPXoZDKJgDLsxoh4EwYvHCEaF7Lr0TiZmcPbBTK8Q1v0KL/6KFw+KeBVv/o2Th6CiBQNFVTc9VUEihUb/rBGRsfGJyanpgszs3PzC8XFpTMdp4rxBotlrC4C0FyKiDdQoOQXieIQBpKfB9dHPf/8hist4ugUuwn3QriKRFswQCP5xS03BOwkFk9zOt1/NLOHDrqn9Ded3C3fumXtY5mCj5fLNmVqrNT3dmnA7K/PS1XepU7D5KZIi6X3x3WzFLQx4hk6B107ET9DJQKJjkecFNU+AXcMVbxoaQci1l/Vj5XTNKC3ajpV5EdK+n0jg1DrbhiYyV4I/dvriX95zRTbe14moiRFHrHBoXYqKca01xFtCcUZyq4hwJQwf6WsAwoYmiYLpoSvpPR/clatOLXK3kmtdFgd1jFVsgq2SAO2SWH5JjUSYMwckceyBN5tu6tR+vFeh2MjljDnWXyA9bJxupn0s=</latexit>CS391R: Robot Learning (Fall 2020) 10
π
t≥0
CS391R: Robot Learning (Fall 2020) 11
V π(s) = E ⇥ X
t=0
γtr(st, π(st)) | s0 = s ⇤
<latexit sha1_base64="R4yDJ86It7M8asPoegrzKF8D8=">ACM3icdVBNSysxFM34bfX5qi7dBIvQikzfRV1URBEFcKtgrNOGTStAaTmSG5I5Sx/8mNf8SFIC4Ucet/MNW0Md7BwKHc+7l5pwkcKA6z45U9Mzs3PzC4uFpeVfK7+Lq2tE6ea8RaLZawvQmq4FBFvgQDJLxLNqQolPw+vD3L/IZrI+LoDAYJ9xXtR6InGAUrBcXj9iVJRNlUmkRuArD7HBIQtHvEJOqIOmOyR9qhS9BKzLJoAqHs0HUKmQ6i2pmsBtmnzD4olt1b3tuvbe3hM9v5MSGMHezV3hBKa4CQoPpBuzFLFI2CSGtPx3AT8jGoQTPJhgaSGJ5Rd0z7vWBpRxY2fjTIP8ZVurgXa/siwCP1+0ZGlTEDFdrJPJj528vFf3mdFHq7fiaiJAUesfGhXioxDgvEHeF5gzkwBLKtLB/xeyKasrA1lywJXwlxf8n7XrNa9R2Txul/eqkjgW0gTZRGXloB+2jI3SCWoihO/SIXtCrc+8O2/O+3h0ypnsrKMfcD4+AXiQqiU=</latexit>s02S
P(s0|s, a)V π(s0)
<latexit sha1_base64="RbpDiy13DI8Yh/AUxM3TeWs8tSs=">ACM3icdVBNSwMxEM36bf2qevQSLNKUnZrRT0IghfxVNFWoVvLbJrWYJdkqxQ1v4nL/4RD4J4UMSr/8G0W0FHwQeb95MZl4QcaN6z45I6Nj4xOTU9OZmdm5+YXs4lJNh7EitEpCHqLADTlTNKqYbTi0hREAGn58H1Yb9+fkOVZqE8M92INgR0JGszAsZKzezxyaUfsYLexLC+j1VK8Ab2OyAEYF/HopnovM+kL8BcEeDJa+HKwWdv029tXRAfr2ZzbnFkrd2t7DKdnbGpLyDvaK7gA5NESlmX3wWyGJBZWGcNC67rmRaSgDCOc9jJ+rGkE5Bo6tG6pBEF1Ixnc3MNrVmnhdqjskwYP1O8dCQituyKwzv7i+netL/5Vq8emvdtImIxiQyVJP2rHJsQ9wPELaYoMbxrCRDF7K6YXIECYmzMGRvC16X4f1IrFb1ycfeknDvYHMYxhVbQKiogD+2gA3SEKqiKCLpDj+gFvTr3zrPz5ryn1hFn2LOMfsD5+AT+u6eH</latexit>CS391R: Robot Learning (Fall 2020) 12
CS391R: Robot Learning (Fall 2020) 13
A special case: exact solution is easily to solve Assume linear transi sitions s and qu quadr drati tic r reward fu d functi tions
Evaluate outcomes of sampled actions with models Choose the action that leads to the best (predicted) outcome π∗
<latexit sha1_base64="WD1KV0FU/uObB/zjPi9X9gs8U=">AB7HicdVBNS8NAEJ3Ur1q/qh69LBaheAhpKTS9Fbx4rGDaQhvLZrtpl242YXcjlNDf4MWDIl79Qd78N27TCir6YODx3gwz84KEM6Ud58MqbGxube8Ud0t7+weHR+Xjk6KU0moR2Iey36AFeVMUE8zWk/kRHAae9YHa19Hv3VCoWi1s9T6gf4YlgISNYG8kbJuzuclSuOHa96bitJspJq+GsiOvWUc12clRgjc6o/D4cxySNqNCEY6UGNSfRfoalZoTRWmYKpgMsMTOjBU4IgqP8uPXaALo4xRGEtTQqNc/T6R4UipeRSYzgjrqfrtLcW/vEGqQ9fPmEhSTQVZLQpTjnSMlp+jMZOUaD43BPJzK2ITLHERJt8SiaEr0/R/6Rbt2sN271pVNrVdRxFOINzqEINmtCGa+iABwQYPMATPFvCerRerNdVa8Faz5zCD1hvn90Rjqw=</latexit>Linear transi sition st+1 = Atst + Btat
<latexit sha1_base64="la3/cP+HjW65Q7ghWZW5R3XthD8=">ACAXicdVDJSgNBEO2JW4xb1IvgpTEIQiTMhEAmByHqxWMEs0AyD2dnqRJz0J3jRCGePFXvHhQxKt/4c2/sbMIKvqg4PFeFVX1vFhwBab5YWSWldW17LruY3Nre2d/O5eS0WJpKxJIxHJjkcUEzxkTeAgWCeWjASeYG1vdDn127dMKh6FNzCOmROQch9Tgloyc0fKDeFojU5O3cBK1FfOECcHNF8xSuWratSqekVrFnBPbLmOrZM5QAs03Px7rx/RJGAhUEGU6lpmDE5KJHAq2CTXSxSLCR2RAetqGpKAKSedfTDBx1rpYz+SukLAM/X7REoCpcaBpzsDAkP125uKf3ndBHzbSXkYJ8BCOl/kJwJDhKdx4D6XjIYa0Ko5PpWTIdEgo6tJwO4etT/D9plUtWpWRfVwr10UcWXSIjtAJslAV1dEVaqAmougOPaAn9GzcG4/Gi/E6b80Yi5l9APG2yf1lJXn</latexit>Qu Quadrati tic r reward
always negative
Ext xtensi sions: s: LQG (Gaussian noise), iLQR (non-linear transition)
Monte-Carlo Tree Search (MCTS) for Tic-Tac-Toe
r(st, at) = −s>
t Utst − a> t Wtat
<latexit sha1_base64="9RK0xP5XPU1nqrKrzIDonO2hrbs=">ACGnicdVBLS0JBFJ5rL7PXrZthiQwULnXFHURCG1aGqQGape546iDcx/MnBuI+Dva9FfatCiXbTp3zTqFSrqwMD3OIcz53NDwRVY1qeRWFldW9Ibqa2tnd298z9g6YKIklZgwYikDcuUxwnzWAg2A3oWTEcwVruaOLmd+6Y1LxwL+Gci6Hhn4vM8pAS05pi0zyoEsceAUn+OcxrcdCELcABrgnOYLKWpo4ZtrKF+xSoVTFC1A9i0GxjO28Na80iqvumO+dXkAj/lABVGqbVshdCdEAqeCTVOdSLGQ0BEZsLaGPvGY6k7mp03xiVZ6uB9I/XzAc/X7xIR4So09V3d6BIbqtzcT/LaEfQr3Qn3wiYTxeL+pHAEOBZTrjHJaMgxhoQKrn+K6ZDIgkFnWZKh7C8FP8PmoW8XcxXrorpWjaOI4mO0DHKIBuVUQ1dojpqIru0SN6Ri/Gg/FkvBpvi9aEc8coh9lfHwBUu2fLQ=</latexit>CS391R: Robot Learning (Fall 2020) 14
ˆ P
<latexit sha1_base64="3biLl6JOGgvPMvY0Eo2mJhKGKs=">AB+nicdVDNSsNAGNzUv1r/Uj16WSyCBwlJKTS9Fbx4rGBboQls920SzebsLtRSsyjePGgiFefxJtv4yatoKIDC8PM9/HNTpAwKpVtfxiVtfWNza3qdm1nd2/wKwfDmScCkz6OGaxuAmQJIxy0ldUMXKTCIKigJFhML8o/OEtEZLG/FotEuJHaMpSDFSWhqbdW+GVOZFSM0wYlkvz8dmw7abdvtGFJOi17SVy3CR3LtEAK/TG5rs3iXEaEa4wQ1KOHDtRfoaEopiRvOalkiQIz9GUjDTlKCLSz8roOTzVygSGsdCPK1iq3zcyFEm5iAI9WSUv71C/MsbpSp0/YzyJFWE4+WhMGVQxbDoAU6oIFixhSYIC6qzQjxDAmGl26rpEr5+Cv8ng6bltCz3qtXonq/qIJjcALOgAPaoAsuQ/0AQZ34AE8gWfj3ng0XozX5WjFWO0cgR8w3j4BViOUqw=</latexit>Can be represented by Gaussi ssian Processe sses, Ne Neural l Networks ks, GM GMMs, etc.
CS391R: Robot Learning (Fall 2020) 15
Model structure is known (e.g., simulator). We tune some model parameters (e.g., mass and friction).
[Ramos et al. RSS’19]
Week 12 Tue
Week 8 Tue Week 8 Tue
[Hafner et al. ICLR’20] [Finn et al. ICRA’17]
Predicting future raw sensory data Predicting future latent state
f(st+1 | st, at)
<latexit sha1_base64="czGOX5I0Npitp2mnp59LCix1TY=">ACAHicbVBNS8NAEN3Ur1q/oh48eFksQsVSEinY8GLxwr2A9oSNtNu3SzCbsTocRe/CtePCji1Z/hzX/jts1BWx8MPN6bYWaeHwuwXG+rdza+sbmVn67sLO7t39gHx61dJQoypo0EpHq+EQzwSVrAgfBOrFiJPQFa/vjm5nfmBK80jewyRm/ZAMJQ84JWAkz4JStpL4dKd9sqPvbL2oIyJBxeXQqzhx4lbgZKaIMDc/+6g0imoRMAhVE67rxNBPiQJOBZsWeolmMaFjMmRdQyUJme6n8wem+NwoAxEypQEPFd/T6Qk1HoS+qYzJDSy95M/M/rJhDU+imXcQJM0sWiIBEYIjxLAw+4YhTExBCFTe3YjoilAwmRVMCO7y6ukdVxq5XaXbVYL2dx5NEpOkMl5KJrVEe3qIGaiKIpekav6M16sl6sd+tj0Zqzsplj9AfW5w+965Up</latexit>ht = g(st) f(ht+1 | ht, at)
<latexit sha1_base64="QpkRAutW5J6yHrAMiClf6htp39k=">ACEnicbVDJSgNBEO1xjXEb9eilMQgJDmFGAuYiBLx4jGAWyIShp9PJNOlZ7K4Rwphv8OKvePGgiFdP3vwbO8tBEx8UPN6roqenwiuwLa/jZXVtfWNzdxWfntnd2/fPDhsqjiVlDVoLGLZ9oligkesARwEayeSkdAXrOUPryZ+65JxePoFkYJ64ZkEPE+pwS05JmlwAN8iQdF5UHJvUtJD/eLgZfBmTN2rQfX0r6FiTY9s2CX7SnwMnHmpIDmqHvml9uLaRqyCKgSnUcO4FuRiRwKtg476aKJYQOyYB1NI1IyFQ3m740xqda0afEUlcEeKr+nshIqNQo9HVnSCBQi95E/M/rpNCvdjMeJSmwiM4W9VOBIcaTfHCPS0ZBjDQhVHJ9K6YBkYSCTjGvQ3AWX14mzfOyUylXbyqFmjWPI4eO0QkqIgdoBq6RnXUQBQ9omf0it6MJ+PFeDc+Zq0rxnzmCP2B8fkD1ZabpA=</latexit>Pr[µ | D]
<latexit sha1_base64="vJyDIvsGefceHj4gcymMicsT6k=">ACAXicbVDLSsNAFJ34rPUVdSO4GSyCi1ISKdhlQRcuK9gHJKFMpN26MwkzEyEuvGX3HjQhG3/oU7/8ZJm4W2HrhwOde7r0nTBhV2nG+rZXVtfWNzdJWeXtnd2/fPjsqDiVmLRxzGLZC5EijArS1lQz0kskQTxkpBuOr3K/e0+korG405OEBwNBY0oRtpIfvYb0nP56lfCrPkd6hBHLrqdB364NWcGuEzcglRAgVbf/vIHMU45ERozpJTnOokOMiQ1xYxMy36qSILwGA2JZ6hAnKgm30whWdGcAolqaEhjP190SGuFITHprO/Ea16OXif56X6qgRZFQkqSYCzxdFKYM6hnkcEAlwZpNDEFYUnMrxCMkEdYmtLIJwV18eZl0Lmpuvda4rVea1SKOEjgBp+AcuOASNMENaIE2wOARPINX8GY9WS/Wu/Uxb12xipkj8AfW5w8fCpai</latexit>CS391R: Robot Learning (Fall 2020) 16
“Dynamics Learning with Cascaded Variational Inference for Multi-Step Manipulation.” Fang, Zhu, Garg, Savarese, Fei-Fei, CoRL 2019
CS391R: Robot Learning (Fall 2020) 17
τ = {(si, ai, ri) | i = 0, . . . , H}
<latexit sha1_base64="RwrDhI7j2rAzkN7hwL6RClfgo/s=">ACFHicdVDLSgMxFM3UV62vqks3wSJUHMq0FDpdFApuqxgH9ApQyZN29DMg+SOUMZ+hBt/xY0LRdy6cOfmD4EFT1wL4dz7iW5x4sEV2BZH0ZqbX1jcyu9ndnZ3ds/yB4etVUYS8paNBSh7HpEMcED1gIOgnUjyYjvCdbxJpdzv3PDpOJhcA3TiPV9Mgr4kFMCWnKzFw6QuOYkeVyE5N5ky4/d8xbx+Q1y8SOGISgTNxwZm42ZxVKFcuVvCVMvWkth2CRcL1gI5tELTzb47g5DGPguACqJUr2hF0E+IBE4Fm2WcWLGI0AkZsZ6mAfGZ6ieLo2b4TCsDPAylrgDwQv2+kRBfqanv6UmfwFj9ubiX14vhqHdT3gQxcACunxoGAsMIZ4nhAdcMgpiqgmhku/YjomklDQOWZ0CF+X4v9Ju1Qolgv2VTlXN1dxpNEJOkV5VEQVEcN1EQtRNEdekBP6Nm4Nx6NF+N1OZoyVjvH6AeMt09iZx+</latexit>π∗ = arg max
π
E[ X
t≥0
γtr(st, π(st))]
<latexit sha1_base64="PFKAgaY5LPxyxaNOeiwYDXcCVw=">ACOXicdVBSxtBFJ7VWm20bdRjL0NDQUXCJgSyHoSACB56SMFEIbMubyeTODizu515K4Ylf8uL/8Kb4MWDpfTaP+BskJb2gfDfPO972Pe+JMSYu+/+AtLb9aeb269qayvH23fvq5lbfprnhosdTlZrzGKxQMhE9lKjEeWYE6FiJs/jqOyfXQtjZqc4iQToYZxIkeSAzoqnZJi/2DhmYMdNwE7mnu/EyjovjKfvsaDFgNtdRgWwsvpTNgat4QKp2bER7lNnKMHu7lwcRtWaX2+2/eCgTWfgoOXPQRA0aPuz6pGFtWNqvdsmPJciwS5AmsHDT/DsACDkisxrbDcigz4FbhJHExACxsWs82n9JNjhnSUGncSpDP2d0cB2tqJjp2y3Mr+3SvJf/UGOY6CsJBJlqNI+PyjUa4oprSMkQ6lERzVxAHgRrpZKb8EAxd2BUXwq9N6f9Bv1lvtOrBl1ats7+IY418IB/JDmQNumQE9IlPcLJLXkz+Sbd+c9ed+9H3PpkrfwbJM/yv5Anv9rdw=</latexit>CS391R: Robot Learning (Fall 2020) 18
a0 Q⇤(s0, a0)]
<latexit sha1_base64="QsQJRlqAiHpoY7k4pQwlT2dZFag=">ACRHicdVDPSxtBGJ21tmq0Ndajl4+GEm1D2IRA1kNBkIHDwpGhex2+XYyiYMzu8vMrBi2+8d58Q/w5l/QSw8V6bV0NtlCW/TBwO978d8L0oF18Z1752F4svXy0tr9RW16/Wa9vD3VSaYoG9BEJOo8Qs0Ej9nAcCPYeaoYykiws+hyv/TPrpjSPIlPzDRlgcRJzMecorFSWB8ef8k/FNu6hTvwCdScfAR/glIi+C3wJZqLKMo/F2Gum1+tX4B/iGrChqV3HebYLKCa0mwBNncqPwjrDbfd7bvebh9mZLfnzondaHTdmdokApHYf3OHyU0kyw2VKDWw46bmiBHZTgVrKj5mWYp0ku0y2NUTId5LMQCnhvlRGME2VfbGCm/t2Ro9R6KiNbWZ6k/dK8SlvmJmxF+Q8TjPDYjpfNM4EmATKRGHEFaNGTC1Bqrj9K9ALVEiNzb1mQ/hzKTxPTrvtTq/tHfcae60qjmWyRd6RbdIhfbJHDsgRGRBKbsg38oM8OLfOd+fR+TkvXCqnk3yD5xfvwGhQa2/</latexit>π⇤(a|s) = arg max
a0 Q⇤(s, a0)
<latexit sha1_base64="WlTtbEsmH3PRnmnOt8R8NiTgFyY=">ACDnicdVBLS0JBFJ7b0+xltWwzJKGyFUEr4tAaNSIR/gNTl3HVw7oOZuZHc/AVt+itWhTRtnW7/k3jI6ioDw58fN85nHM+J+BMKtP8MFZW19Y3NmNb8e2d3b39xMFhU/qhILRBfO6LtgOScubRhmK03YgKLgOpy1nfD7zW9dUSOZ7l2oS0K4LQ48NGAGlpV4iZQfs6jQDtzKLz7ANYmi7cNOLID3FdW3IHIZ0tpdImvli2bQqZTwnlZK5IJZVxIW8OUcSLVHrJd7tvk9Cl3qKcJCyUzAD1Y1AKEY4ncbtUNIAyBiGtKOpBy6V3Wj+zhSntNLHA1/o8hSeq98nInClnLiO7nRBjeRvbyb+5XVCNbC6EfOCUFGPLBYNQo6Vj2fZ4D4TlCg+0QSIYPpWTEYgCidYFyH8PUp/p80i/lCKW/VS8lqbhlHDB2jE5RBVRGVXSBaqiBCLpD+gJPRv3xqPxYrwuWleM5cwR+gHj7ROihpn</latexit>De Deep Q-Network k (DQN): Represent Q with neural networks
CS391R: Robot Learning (Fall 2020) 19
policy parameterized by
θ
<latexit sha1_base64="+bQ9HdbhNbK6uND5RUi3m8XDGjk=">AB7XicbVBNS8NAEN3Ur1q/qh69LBbBU0lKwR4LXjxWsB/QhrLZTtq1m2zYnQgl9D948aCIV/+PN/+N2zYHbX0w8Hhvhpl5QSKFQdf9dgpb2zu7e8X90sHh0fFJ+fSsY1SqObS5kr3AmZAihjaKFBCL9HAokBCN5jeLvzuE2gjVPyAswT8iI1jEQrO0EqdAU4A2bBcavuEnSTeDmpkBytYflrMFI8jSBGLpkxfc9N0M+YRsElzEuD1EDC+JSNoW9pzCIwfra8dk6vrDKiodK2YqRL9fdExiJjZlFgOyOGE7PuLcT/vH6KYcPRJykCDFfLQpTSVHRxet0JDRwlDNLGNfC3kr5hGnG0QZUsiF46y9vk6t6tWrjft6pVnL4yiSC3JrolHbkiT3JEWaRNOHskzeSVvjnJenHfnY9VacPKZc/IHzucPorWPIw=</latexit>trajectories under policy
πθ
<latexit sha1_base64="XwxTAKQzl2a1lmFrRUY7nKxDwk=">AB83icbVBNS8NAEN3Ur1q/qh69BIvgqSlYI8FLx4r2FZoQtlsJ+3SzWbZnQgl9G948aCIV/+MN/+N2zYHbX0w8Hhvhpl5kRLcoOd9O6Wt7Z3dvfJ+5eDw6PikenrWM2mGXRZKlL9GFEDgkvoIkcBj0oDTSIB/Wh6u/D7T6ANT+UDzhSECR1LHnNG0UpBoPgwD3ACSOfDas2re0u4m8QvSI0U6AyrX8EoZVkCEpmgxgx8T2GYU42cCZhXgsyAomxKxzCwVNIETJgvb567V1YZuXGqbUl0l+rviZwmxsySyHYmFCdm3VuI/3mDONWmHOpMgTJVoviTLiYuosA3BHXwFDMLKFMc3uryZU4Y2poNwV9/eZP0GnW/W/dN2vtRhFHmVyQS3JNfHJD2uSOdEiXMKLIM3klb07mvDjvzseqteQUM+fkD5zPH3Gukes=</latexit>Total reward of trajectory
policy gradient theorem
Different ways of computing the Q values lead to different PG variants: Monte-Carlo estimates (REINFORCE), learning value functions (Actor-Critic)
rθJ(θ) = Eπ[Qπ(s, a)rθ log πθ(a|s)]
<latexit sha1_base64="yXxdRnv1LjblX8BXjI5i0ysE1u8=">ACPnicdVBNixNBEO3J6hqju2bXo5fGICQgYSYkJBchIJ4SsB8QGYcajqdpElPz9BdI4TZ/DIv+xv25tGLB0X2ukd7kghG1oKmX79Xj+p6USqFQdf96pROHjw8fVR+XHny9Oz8WfXicmySTDM+YolM9DQCw6VQfIQCJZ+mkMcST6J1m8LfKZayMS9RE3KQ9iWCqxEAzQUmF15CuIJIQ+rjgC/VDfg8YbPwZcRVH+bhv6qaCz4Sd71c1raBw7fJksqZUO7zpcmUYQVmtus9XtuV2P7oDndfbA7XSo13R3VSOHGoTVG3+esCzmCpkEY2aem2KQg0bBJN9W/MzwFNgalnxmoYKYmyDfrb+lrywzp4tE26OQ7ti/HTnExmziyHYWS5l/tYK8T5tluOgFuVBphlyx/aBFJikmtMiSzoXmDOXGAmBa2L9StgINDG3iFRvCn03p/8G41fTazd6wXeu3DnGUyQvyktSJR7qkT96TARkRr6Qb+QH+elcO9+dX87tvrXkHDzPyVE5d78BTIqvMQ=</latexit>CS391R: Robot Learning (Fall 2020) 20
CS391R: Robot Learning (Fall 2020) 21
“QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation.” Kalashnikov et al. CoRL 2018
CS391R: Robot Learning (Fall 2020) 22
Learns policy directly – often more stable Works for continuous action spaces Needs data from current policy to compute policy gradient (“on-policy” algorithm) – data inefficient Gradient estimates can be very noisy Can learn Q function from any interaction data, not just trajectories gathered using the current policy (“off-policy” algorithm) Relatively data-efficient (can reuse old interaction data) Need to optimize over actions: hard to apply to continuous action spaces Optimal Q function can be complicated, hard to learn
CS391R: Robot Learning (Fall 2020) 23
ss0 = Pr[st+1 | st, at]
<latexit sha1_base64="47fJjpAaSDkG5CZij9cVsTrtsLM=">ACFnicdVDLSgNBEJz1bXxFPXoZDKJgDLsxoh4EwYvHCEaF7Lr0TiZmcPbBTK8Q1v0KL/6KFw+KeBVv/o2Th6CiBQNFVTc9VUEihUb/rBGRsfGJyanpgszs3PzC8XFpTMdp4rxBotlrC4C0FyKiDdQoOQXieIQBpKfB9dHPf/8hist4ugUuwn3QriKRFswQCP5xS03BOwkFk9zOt1/NLOHDrqn9Ded3C3fumXtY5mCj5fLNmVqrNT3dmnA7K/PS1XepU7D5KZIi6X3x3WzFLQx4hk6B107ET9DJQKJjkecFNU+AXcMVbxoaQci1l/Vj5XTNKC3ajpV5EdK+n0jg1DrbhiYyV4I/dvriX95zRTbe14moiRFHrHBoXYqKca01xFtCcUZyq4hwJQwf6WsAwoYmiYLpoSvpPR/clatOLXK3kmtdFgd1jFVsgq2SAO2SWH5JjUSYMwckceyBN5tu6tR+vFeh2MjljDnWXyA9bJxupn0s=</latexit>CS391R: Robot Learning (Fall 2020) 24
ss0 = Pr[st+1 | st, at]
<latexit sha1_base64="47fJjpAaSDkG5CZij9cVsTrtsLM=">ACFnicdVDLSgNBEJz1bXxFPXoZDKJgDLsxoh4EwYvHCEaF7Lr0TiZmcPbBTK8Q1v0KL/6KFw+KeBVv/o2Th6CiBQNFVTc9VUEihUb/rBGRsfGJyanpgszs3PzC8XFpTMdp4rxBotlrC4C0FyKiDdQoOQXieIQBpKfB9dHPf/8hist4ugUuwn3QriKRFswQCP5xS03BOwkFk9zOt1/NLOHDrqn9Ded3C3fumXtY5mCj5fLNmVqrNT3dmnA7K/PS1XepU7D5KZIi6X3x3WzFLQx4hk6B107ET9DJQKJjkecFNU+AXcMVbxoaQci1l/Vj5XTNKC3ajpV5EdK+n0jg1DrbhiYyV4I/dvriX95zRTbe14moiRFHrHBoXYqKca01xFtCcUZyq4hwJQwf6WsAwoYmiYLpoSvpPR/clatOLXK3kmtdFgd1jFVsgq2SAO2SWH5JjUSYMwckceyBN5tu6tR+vFeh2MjljDnWXyA9bJxupn0s=</latexit>CS391R: Robot Learning (Fall 2020) 25
CS391R: Robot Learning (Fall 2020) 26
action from expert policy Distance metric that measures the discrepancy between the expert action and the policy action (e.g., KL-divergence)
CS391R: Robot Learning (Fall 2020) 27
CS391R: Robot Learning (Fall 2020) 28
"Agile Autonomous Driving using End-to-End Deep Imitation Learning" Pan, Cheng, Saigol, Lee, Yan, Theodorou, Boots. RSS 2018
CS391R: Robot Learning (Fall 2020) 29
Solvi ving fu full-fl fledged RL RL in in th the in inner lo loop To solve efficiently, IRL methods often assume:
v Known dynamics (for comparing and efficiently) v Linear reward function
r(s, a) = w>φ(s)
<latexit sha1_base64="7Z1QRpB/gD47TUuzr1mKBhPjLs=">AB/XicdVDLSsNAFJ34rPUVHzs3g0VoYSktrRdCAU3LivYBzSxTKaTdujkwcxEqaH4K25cKOLW/3Dn3zhtI6jogQuHc+7l3nvciFEhTfNDW1peWV1bz2xkN7e2d3b1vf2CGOSQuHLORdFwnCaEBakpGuhEnyHcZ6bj85nfuSFc0DC4kpOIOD4aBtSjGEkl9fVDnhdFVDi7vbZlGNnRiOZFoa/nTKNkVUqVOlyQ+mlKylVoGeYcOZCi2df7UGIY58EjMkRM8yI+kiEuKGZlm7ViQCOExGpKeogHyiXCS+fVTeKUAfRCriqQcK5+n0iQL8TEd1Wnj+RI/PZm4l9eL5ZezUloEMWSBHixyIsZlCGcRQEHlBMs2UQRhDlVt0I8QhxhqQLqhC+PoX/k3bJsMpG7bKcaxTODLgCByDPLBAFTABWiCFsDgDjyAJ/Cs3WuP2ov2umhd0tKZA/AD2tsneRiUkA=</latexit>Problem: IRL is generally ill-posed – many reward functions under which the expert policy is optimal. How can we address it?
CS391R: Robot Learning (Fall 2020) 30
CS391R: Robot Learning (Fall 2020) 31
IL reward rIL
<latexit sha1_base64="sojGkZRCj3khSrYxj0W2Wq1Vnxo=">AB9HicbZBNS8NAEIYnftb6VfXoZbEInkoigh4LXhQ8VLAf0Iay2W7apZtN3J0US+jv8OJBEa/+G/+G7dtDtr6wsLDOzPM7BskUh03W9nZXVtfWOzsFXc3tnd2y8dHDZMnGrG6yWsW4F1HApFK+jQMlbieY0CiRvBsPrab054tqIWD3gOF+RPtKhIJRtJavux3kT4iY3d5NuqWyW3FnIsvg5VCGXLVu6avTi1kacYVMUmPanpugn1GNgk+KXZSwxPKhrTP2xYVjbjxs9nRE3JqnR4JY2fQjJzf09kNDJmHAW2M6I4MIu1qflfrZ1ieOVnQiUpcsXmi8JUEozJNAHSE5ozlGMLlGlhbyVsQDVlaHMq2hC8xS8vQ+O84lm+vyhXq3kcBTiGEzgDy6hCjdQgzoweIRneIU3Z+S8O/Ox7x1xclnjuCPnM8fPEuSYQ=</latexit><latexit sha1_base64="sojGkZRCj3khSrYxj0W2Wq1Vnxo=">AB9HicbZBNS8NAEIYnftb6VfXoZbEInkoigh4LXhQ8VLAf0Iay2W7apZtN3J0US+jv8OJBEa/+G/+G7dtDtr6wsLDOzPM7BskUh03W9nZXVtfWOzsFXc3tnd2y8dHDZMnGrG6yWsW4F1HApFK+jQMlbieY0CiRvBsPrab054tqIWD3gOF+RPtKhIJRtJavux3kT4iY3d5NuqWyW3FnIsvg5VCGXLVu6avTi1kacYVMUmPanpugn1GNgk+KXZSwxPKhrTP2xYVjbjxs9nRE3JqnR4JY2fQjJzf09kNDJmHAW2M6I4MIu1qflfrZ1ieOVnQiUpcsXmi8JUEozJNAHSE5ozlGMLlGlhbyVsQDVlaHMq2hC8xS8vQ+O84lm+vyhXq3kcBTiGEzgDy6hCjdQgzoweIRneIU3Z+S8O/Ox7x1xclnjuCPnM8fPEuSYQ=</latexit><latexit sha1_base64="sojGkZRCj3khSrYxj0W2Wq1Vnxo=">AB9HicbZBNS8NAEIYnftb6VfXoZbEInkoigh4LXhQ8VLAf0Iay2W7apZtN3J0US+jv8OJBEa/+G/+G7dtDtr6wsLDOzPM7BskUh03W9nZXVtfWOzsFXc3tnd2y8dHDZMnGrG6yWsW4F1HApFK+jQMlbieY0CiRvBsPrab054tqIWD3gOF+RPtKhIJRtJavux3kT4iY3d5NuqWyW3FnIsvg5VCGXLVu6avTi1kacYVMUmPanpugn1GNgk+KXZSwxPKhrTP2xYVjbjxs9nRE3JqnR4JY2fQjJzf09kNDJmHAW2M6I4MIu1qflfrZ1ieOVnQiUpcsXmi8JUEozJNAHSE5ozlGMLlGlhbyVsQDVlaHMq2hC8xS8vQ+O84lm+vyhXq3kcBTiGEzgDy6hCjdQgzoweIRneIU3Z+S8O/Ox7x1xclnjuCPnM8fPEuSYQ=</latexit><latexit sha1_base64="ck8pdC+ekZH4nUmSP+ZG7r8lEyk=">AB2XicbZDNSgMxFIXv1L86Vq1rN8EiuCozbnQpuHFZwbZCO5RM5k4bmskMyR2hDH0BF25EfC93vo3pz0JbDwQ+zknIvSculLQUBN9ebWd3b/+gfugfNfzjk9Nmo2fz0gjsilzl5jnmFpXU2CVJCp8LgzyLFfbj6f0i7+gsTLXTzQrMr4WMtUCk7O6oyaraAdLMW2IVxDC9YaNb+GS7KDUJxa0dhEFBUcUNSaFw7g9LiwUXUz7GgUPNM7RtRxzi6dk7A0N+5oYkv394uKZ9bOstjdzDhN7Ga2MP/LBiWlt1EldVESarH6KC0Vo5wtdmaJNChIzRxwYaSblYkJN1yQa8Z3HYSbG29D7odOn4MoA7ncAFXEMIN3MEDdKALAhJ4hXdv4r15H6uat6tDP4I+/zBzjGijg=</latexit><latexit sha1_base64="K8VBAofnidEG3r5kxgDeaKbolSQ=">AB6XicbZBLSwMxFIXv1FetVatbN8EiuCozbnQpuFwUcE+oB1KJr3ThmYeJneKZejvcONCEf+QO/+N6WOhrQcCH+ck3JsTpEoact1vp7CxubW9U9wt7ZX3Dw4rR+WmSTItsCESleh2wA0qGWODJClspxp5FChsBaObWd4aozYyiR9pkqIf8UEsQyk4WcvXvS7hMxHld/fTXqXq1ty52Dp4S6jCUvVe5avbT0QWYUxCcWM6npuSn3NUiclrqZwZSLER9gx2LMIzR+Pl96ys6s02dhou2Jic3d3y9yHhkziQJ7M+I0NKvZzPwv62QUXvm5jNOMBaLQWGmGCVs1gDrS42C1MQCF1raXZkYcs0F2Z5KtgRv9cvr0LyoeZYfXCjCZzCOXhwCdwC3VogIAneIE3eHfGzqvzsair4Cx7O4Y/cj5/AO7rkQ=</latexit><latexit sha1_base64="K8VBAofnidEG3r5kxgDeaKbolSQ=">AB6XicbZBLSwMxFIXv1FetVatbN8EiuCozbnQpuFwUcE+oB1KJr3ThmYeJneKZejvcONCEf+QO/+N6WOhrQcCH+ck3JsTpEoact1vp7CxubW9U9wt7ZX3Dw4rR+WmSTItsCESleh2wA0qGWODJClspxp5FChsBaObWd4aozYyiR9pkqIf8UEsQyk4WcvXvS7hMxHld/fTXqXq1ty52Dp4S6jCUvVe5avbT0QWYUxCcWM6npuSn3NUiclrqZwZSLER9gx2LMIzR+Pl96ys6s02dhou2Jic3d3y9yHhkziQJ7M+I0NKvZzPwv62QUXvm5jNOMBaLQWGmGCVs1gDrS42C1MQCF1raXZkYcs0F2Z5KtgRv9cvr0LyoeZYfXCjCZzCOXhwCdwC3VogIAneIE3eHfGzqvzsair4Cx7O4Y/cj5/AO7rkQ=</latexit><latexit sha1_base64="DRDrA2um96uea2c+l4iKoNuU/hg=">AB9HicbZA9SwNBEIb34leMX1FLm8UgWIU7Gy0DNgoWEcwHJEfY28wlS/b2zt25YDjyO2wsFLH1x9j5b9wkV2jiCwsP78ws2+QSGHQdb+dwtr6xuZWcbu0s7u3f1A+PGqaONUcGjyWsW4HzIAUChoUEI70cCiQEIrGF3P6q0xaCNi9YCTBPyIDZQIBWdoLV/3ughPiJjd3k175Ypbdeiq+DlUCG56r3yV7cf8zQChVwyYzqem6CfMY2CS5iWuqmBhPERG0DHomIRGD+bHz2lZ9bp0zDW9imkc/f3RMYiYyZRYDsjhkOzXJuZ/9U6KYZXfiZUkiIovlgUpJiTGcJ0L7QwFOLDCuhb2V8iHTjKPNqWRD8Ja/vArNi6pn+d6t1Gp5HEVyQk7JOfHIJamRG1InDcLJI3kmr+TNGTsvzrvzsWgtOPnMfkj5/MHOwuSXQ=</latexit><latexit sha1_base64="sojGkZRCj3khSrYxj0W2Wq1Vnxo=">AB9HicbZBNS8NAEIYnftb6VfXoZbEInkoigh4LXhQ8VLAf0Iay2W7apZtN3J0US+jv8OJBEa/+G/+G7dtDtr6wsLDOzPM7BskUh03W9nZXVtfWOzsFXc3tnd2y8dHDZMnGrG6yWsW4F1HApFK+jQMlbieY0CiRvBsPrab054tqIWD3gOF+RPtKhIJRtJavux3kT4iY3d5NuqWyW3FnIsvg5VCGXLVu6avTi1kacYVMUmPanpugn1GNgk+KXZSwxPKhrTP2xYVjbjxs9nRE3JqnR4JY2fQjJzf09kNDJmHAW2M6I4MIu1qflfrZ1ieOVnQiUpcsXmi8JUEozJNAHSE5ozlGMLlGlhbyVsQDVlaHMq2hC8xS8vQ+O84lm+vyhXq3kcBTiGEzgDy6hCjdQgzoweIRneIU3Z+S8O/Ox7x1xclnjuCPnM8fPEuSYQ=</latexit><latexit sha1_base64="sojGkZRCj3khSrYxj0W2Wq1Vnxo=">AB9HicbZBNS8NAEIYnftb6VfXoZbEInkoigh4LXhQ8VLAf0Iay2W7apZtN3J0US+jv8OJBEa/+G/+G7dtDtr6wsLDOzPM7BskUh03W9nZXVtfWOzsFXc3tnd2y8dHDZMnGrG6yWsW4F1HApFK+jQMlbieY0CiRvBsPrab054tqIWD3gOF+RPtKhIJRtJavux3kT4iY3d5NuqWyW3FnIsvg5VCGXLVu6avTi1kacYVMUmPanpugn1GNgk+KXZSwxPKhrTP2xYVjbjxs9nRE3JqnR4JY2fQjJzf09kNDJmHAW2M6I4MIu1qflfrZ1ieOVnQiUpcsXmi8JUEozJNAHSE5ozlGMLlGlhbyVsQDVlaHMq2hC8xS8vQ+O84lm+vyhXq3kcBTiGEzgDy6hCjdQgzoweIRneIU3Z+S8O/Ox7x1xclnjuCPnM8fPEuSYQ=</latexit><latexit sha1_base64="sojGkZRCj3khSrYxj0W2Wq1Vnxo=">AB9HicbZBNS8NAEIYnftb6VfXoZbEInkoigh4LXhQ8VLAf0Iay2W7apZtN3J0US+jv8OJBEa/+G/+G7dtDtr6wsLDOzPM7BskUh03W9nZXVtfWOzsFXc3tnd2y8dHDZMnGrG6yWsW4F1HApFK+jQMlbieY0CiRvBsPrab054tqIWD3gOF+RPtKhIJRtJavux3kT4iY3d5NuqWyW3FnIsvg5VCGXLVu6avTi1kacYVMUmPanpugn1GNgk+KXZSwxPKhrTP2xYVjbjxs9nRE3JqnR4JY2fQjJzf09kNDJmHAW2M6I4MIu1qflfrZ1ieOVnQiUpcsXmi8JUEozJNAHSE5ozlGMLlGlhbyVsQDVlaHMq2hC8xS8vQ+O84lm+vyhXq3kcBTiGEzgDy6hCjdQgzoweIRneIU3Z+S8O/Ox7x1xclnjuCPnM8fPEuSYQ=</latexit><latexit sha1_base64="sojGkZRCj3khSrYxj0W2Wq1Vnxo=">AB9HicbZBNS8NAEIYnftb6VfXoZbEInkoigh4LXhQ8VLAf0Iay2W7apZtN3J0US+jv8OJBEa/+G/+G7dtDtr6wsLDOzPM7BskUh03W9nZXVtfWOzsFXc3tnd2y8dHDZMnGrG6yWsW4F1HApFK+jQMlbieY0CiRvBsPrab054tqIWD3gOF+RPtKhIJRtJavux3kT4iY3d5NuqWyW3FnIsvg5VCGXLVu6avTi1kacYVMUmPanpugn1GNgk+KXZSwxPKhrTP2xYVjbjxs9nRE3JqnR4JY2fQjJzf09kNDJmHAW2M6I4MIu1qflfrZ1ieOVnQiUpcsXmi8JUEozJNAHSE5ozlGMLlGlhbyVsQDVlaHMq2hC8xS8vQ+O84lm+vyhXq3kcBTiGEzgDy6hCjdQgzoweIRneIU3Z+S8O/Ox7x1xclnjuCPnM8fPEuSYQ=</latexit><latexit sha1_base64="sojGkZRCj3khSrYxj0W2Wq1Vnxo=">AB9HicbZBNS8NAEIYnftb6VfXoZbEInkoigh4LXhQ8VLAf0Iay2W7apZtN3J0US+jv8OJBEa/+G/+G7dtDtr6wsLDOzPM7BskUh03W9nZXVtfWOzsFXc3tnd2y8dHDZMnGrG6yWsW4F1HApFK+jQMlbieY0CiRvBsPrab054tqIWD3gOF+RPtKhIJRtJavux3kT4iY3d5NuqWyW3FnIsvg5VCGXLVu6avTi1kacYVMUmPanpugn1GNgk+KXZSwxPKhrTP2xYVjbjxs9nRE3JqnR4JY2fQjJzf09kNDJmHAW2M6I4MIu1qflfrZ1ieOVnQiUpcsXmi8JUEozJNAHSE5ozlGMLlGlhbyVsQDVlaHMq2hC8xS8vQ+O84lm+vyhXq3kcBTiGEzgDy6hCjdQgzoweIRneIU3Z+S8O/Ox7x1xclnjuCPnM8fPEuSYQ=</latexit><latexit sha1_base64="sojGkZRCj3khSrYxj0W2Wq1Vnxo=">AB9HicbZBNS8NAEIYnftb6VfXoZbEInkoigh4LXhQ8VLAf0Iay2W7apZtN3J0US+jv8OJBEa/+G/+G7dtDtr6wsLDOzPM7BskUh03W9nZXVtfWOzsFXc3tnd2y8dHDZMnGrG6yWsW4F1HApFK+jQMlbieY0CiRvBsPrab054tqIWD3gOF+RPtKhIJRtJavux3kT4iY3d5NuqWyW3FnIsvg5VCGXLVu6avTi1kacYVMUmPanpugn1GNgk+KXZSwxPKhrTP2xYVjbjxs9nRE3JqnR4JY2fQjJzf09kNDJmHAW2M6I4MIu1qflfrZ1ieOVnQiUpcsXmi8JUEozJNAHSE5ozlGMLlGlhbyVsQDVlaHMq2hC8xS8vQ+O84lm+vyhXq3kcBTiGEzgDy6hCjdQgzoweIRneIU3Z+S8O/Ox7x1xclnjuCPnM8fPEuSYQ=</latexit>Dψ(
<latexit sha1_base64="bK4po48UILSLHp19l0RKlr1uCSY=">ACAHicbZDLSsNAFIYnXmu9RV24cBMsQt2URARdFnXhsoK9QBPCZDph04yYeZEDCEbX8WNC0Xc+hjufBunbRba+sPAx3/O4cz5g4QzBb9bSwtr6yurVc2qptb2zu75t5+R4lUEtomgvZC7CinMW0DQw47SWS4ijgtBuMryf17gOVion4HrKEehEexixkBIO2fPwxncTxeou0EcAyBPBGcmKU9+s2Q17KmsRnBJqFTLN7/cgSBpRGMgHCvVd+wEvBxLYITouqmiaYjPGQ9jXGOKLKy6cHFNaJdgZWKR+MVhT9/dEjiOlsijQnRGkZqvTcz/av0UwksvZ3GSAo3JbFGYcguENUnDGjBJCfBMAyaS6b9aZIQlJqAzq+oQnPmTF6Fz1nA0353XmldlHBV0hI5RHTnoAjXRLWqhNiKoQM/oFb0ZT8aL8W58zFqXjHLmAP2R8fkDIlGWvg=</latexit><latexit sha1_base64="bK4po48UILSLHp19l0RKlr1uCSY=">ACAHicbZDLSsNAFIYnXmu9RV24cBMsQt2URARdFnXhsoK9QBPCZDph04yYeZEDCEbX8WNC0Xc+hjufBunbRba+sPAx3/O4cz5g4QzBb9bSwtr6yurVc2qptb2zu75t5+R4lUEtomgvZC7CinMW0DQw47SWS4ijgtBuMryf17gOVion4HrKEehEexixkBIO2fPwxncTxeou0EcAyBPBGcmKU9+s2Q17KmsRnBJqFTLN7/cgSBpRGMgHCvVd+wEvBxLYITouqmiaYjPGQ9jXGOKLKy6cHFNaJdgZWKR+MVhT9/dEjiOlsijQnRGkZqvTcz/av0UwksvZ3GSAo3JbFGYcguENUnDGjBJCfBMAyaS6b9aZIQlJqAzq+oQnPmTF6Fz1nA0353XmldlHBV0hI5RHTnoAjXRLWqhNiKoQM/oFb0ZT8aL8W58zFqXjHLmAP2R8fkDIlGWvg=</latexit><latexit sha1_base64="bK4po48UILSLHp19l0RKlr1uCSY=">ACAHicbZDLSsNAFIYnXmu9RV24cBMsQt2URARdFnXhsoK9QBPCZDph04yYeZEDCEbX8WNC0Xc+hjufBunbRba+sPAx3/O4cz5g4QzBb9bSwtr6yurVc2qptb2zu75t5+R4lUEtomgvZC7CinMW0DQw47SWS4ijgtBuMryf17gOVion4HrKEehEexixkBIO2fPwxncTxeou0EcAyBPBGcmKU9+s2Q17KmsRnBJqFTLN7/cgSBpRGMgHCvVd+wEvBxLYITouqmiaYjPGQ9jXGOKLKy6cHFNaJdgZWKR+MVhT9/dEjiOlsijQnRGkZqvTcz/av0UwksvZ3GSAo3JbFGYcguENUnDGjBJCfBMAyaS6b9aZIQlJqAzq+oQnPmTF6Fz1nA0353XmldlHBV0hI5RHTnoAjXRLWqhNiKoQM/oFb0ZT8aL8W58zFqXjHLmAP2R8fkDIlGWvg=</latexit><latexit sha1_base64="bK4po48UILSLHp19l0RKlr1uCSY=">ACAHicbZDLSsNAFIYnXmu9RV24cBMsQt2URARdFnXhsoK9QBPCZDph04yYeZEDCEbX8WNC0Xc+hjufBunbRba+sPAx3/O4cz5g4QzBb9bSwtr6yurVc2qptb2zu75t5+R4lUEtomgvZC7CinMW0DQw47SWS4ijgtBuMryf17gOVion4HrKEehEexixkBIO2fPwxncTxeou0EcAyBPBGcmKU9+s2Q17KmsRnBJqFTLN7/cgSBpRGMgHCvVd+wEvBxLYITouqmiaYjPGQ9jXGOKLKy6cHFNaJdgZWKR+MVhT9/dEjiOlsijQnRGkZqvTcz/av0UwksvZ3GSAo3JbFGYcguENUnDGjBJCfBMAyaS6b9aZIQlJqAzq+oQnPmTF6Fz1nA0353XmldlHBV0hI5RHTnoAjXRLWqhNiKoQM/oFb0ZT8aL8W58zFqXjHLmAP2R8fkDIlGWvg=</latexit>score on policy
1 [Goodfellow et al. 2014; Ho & Ermon, 2016]
discriminator policy generated trajectories environment demonstration trajectories IL reward:
strong discriminator bad policy weak discriminator good policy predicts 0 if policy and 1 if demo
− Dψ(st))
disc scriminator objective ve
− Dψ
CS391R: Robot Learning (Fall 2020) 32
[Goodfellow et al. 2014; Ho & Ermon, 2016]
discriminator policy generated trajectories environment demonstration trajectories IL reward:
predicts 0 if policy and 1 if demo
− Dψ(st))
disc scriminator objective ve
− Dψ
v Represent complex x reward function by neural networks v More iterative ve approaches to update reward and policy (no need to run full RL before updating the reward function) v We don’t know the dynamics but have access to a si simulator to compare with and .
CS391R: Robot Learning (Fall 2020) 33
4x 2x
CS391R: Robot Learning (Fall 2020) 34
le learnin ing so source re reward rd fu functi tion exp xpert demonst stration have ve mo model? no no ye yes
mal cont control rol & & planning
le learn mo model?
mod model el-base sed rei reinf nforcement
earning ng
ye yes
mod model el-fr free rei reinf nforcement
earning ng Week 7 Thu Week 8 Tue
no no est stimate re reward rd? no no ye yes
im imit itatio ion as su supervi vise sed learning Week 8 Thu
kn known dyn ynamics? s? no no ye yes
inve verse se reinforcement learning Week 9 Tue adve versa sarial imitation learning Week 9 Thu
CS391R: Robot Learning (Fall 2020) 35
Learning from rich data sources
Language, preferences, instruction
Object variations and long-horizon tasks.
Efficient learning of new tasks
Fast learning from limited experience. Representing and transferring past knowledge.
Safety and robustness
Probabilistic and formal guarantees
and inference
CS391R: Robot Learning (Fall 2020) 36