multi agent adversarial inverse reinforcement learning
play

Multi-Agent Adversarial Inverse Reinforcement Learning Lantao Yu, - PowerPoint PPT Presentation

Multi-Agent Adversarial Inverse Reinforcement Learning Lantao Yu, Jiaming Song, Stefano Ermon Department of Computer Science, Stanford University Contact: lantaoyu@cs.stanford.edu <latexit


  1. Multi-Agent Adversarial Inverse Reinforcement Learning Lantao Yu, Jiaming Song, Stefano Ermon Department of Computer Science, Stanford University Contact: lantaoyu@cs.stanford.edu

  2. <latexit sha1_base64="AKws76aOIQ/vlCa0QNvQLXdwHRA=">ACL3icbVBNSyNBFOzxe6OrUY+5NIYFhSXMxIMeVgiI4lHBGCEzDm86PUmT7pmh+81iGHLx93jx4g/xEsRFvPov7CQe/NiChuq9+iuijIpDLruozMzOze/sLj0o7S8nN1rby+cWHSXDPeZKlM9WUEhkuR8CYKlPwy0xUJHkr6h+O/dZfro1Ik3McZDxQ0E1ELBiglcLysa/gOvQzQS3BXhQVR8PpVfIY27JVjgTe8Oqd+F5SCK6R624T4m0KIO74W3R4GYbnq1twJ6HfivZNqo96qPN+M7k/D8oPfSVmueIJMgjFtz80wKECjYJIPS35ueAasD13etjQBxU1QTPIO6S+rdGicansSpBP140YBypiBiuzkOJT56o3F/3ntHOP9oBJliNP2PShOJcUzouj3aE5gzlwBJgWti/UtYDQxtxSVbgvc18ndyUa95u7X6mVdt/CFTLJEK2SLbxCN7pEFOyClpEkZuyQN5Iv+cO2fkPDsv09EZ531nk3yC8/oGl/mtEw=</latexit> Motivation • By definition, the performance of RL agents heavily relies on the quality of reward functions. hP T i t =1 γ t r ( s t , a t ) max π E π Computer Games Multi-Agent System Dialogue • In many real-world scenarios, especially in multi-agent settings, hand-tuning informative reward functions can be very challenging. • Solution: learning from expert demonstrations!

  3. <latexit sha1_base64="mSOGy5SbNzZRqB+RiQ+OplYS+yA=">AB7XicbVBNSwMxEJ2tX7V+VT16CRahgpTdVrDHghePFewHtEvJptk2NpsSVYoS/+DFw+KePX/ePfmLZ70NYHA4/3ZpiZF8ScaeO6305uY3Nreye/W9jbPzg8Kh6ftLVMFKEtIrlU3QBrypmgLcMp91YURwFnHaCye3c7zxRpZkUD2YaUz/CI8FCRrCxUluV9RW+HBRLbsVdAK0TLyMlyNAcFL/6Q0mSiApDONa657mx8VOsDCOczgr9RNMYkwke0Z6lAkdU+ni2hm6sMoQhVLZEgYt1N8TKY60nkaB7YywGetVby7+5/USE9b9lIk4MVSQ5aIw4chINH8dDZmixPCpJZgoZm9FZIwVJsYGVLAheKsvr5N2teLVKtX761KjnsWRhzM4hzJ4cAMNuIMmtIDAIzDK7w50nlx3p2PZWvOyWZO4Q+czx+X0o5z</latexit> <latexit sha1_base64="oLJVDvbL3g/9lhUVNX+FRAp9xk0=">ACJXicbVDLSsNAFJ34rPVdelmsAjVRUmqoIsKBSm4rGAf0MQwmU7boZNJmJmIJeZn3PgrblxYRHDlrzhJu9DWAwNnzrmXe+/xQkalMs0vY2l5ZXVtPbeR39za3tkt7O23ZBAJTJo4YIHoeEgSRjlpKqoY6YSCIN9jpO2NrlO/UCEpAG/U+OQOD4acNqnGCktuYWqHdL7U3gFbR89urH+QZtyaDdokpq6HlxPckMt50bRYMoOYl9CRPHLdQNMtmBrhIrBkpghkabmFi9wIc+YQrzJCUXcsMlRMjoShmJMnbkSQhwiM0IF1NOfKJdOLsygQea6UH+4HQjyuYqb87YuRLOfY9XZkuLue9VPzP60aqf+nElIeRIhxPB/UjBlUA08hgjwqCFRtrgrCgeleIh0grHSweR2CNX/yImlVytZuXJ7XqxVZ3HkwCE4AiVgQtQAzegAZoAg2fwCt7BxHgx3owP43NaumTMeg7AHxjfP2AHpJQ=</latexit> <latexit sha1_base64="lADpjtdH09HXKxCgQmHRrYn7+4=">AB73icbVDLSgNBEOz1GeMr6tHLYBDiJexGwRwDXjxGMA9IltA7mSRDZmfXmVkhrPkJLx4U8ervePNvnCR70MSChqKqm+6uIBZcG9f9dtbWNza3tnM7+d29/YPDwtFxU0eJoqxBIxGpdoCaCS5Zw3AjWDtWDMNAsFYwvpn5rUemNI/kvZnEzA9xKPmAUzRWandjXsInfdErFN2yOwdZJV5GipCh3it8dfsRTUImDRWodcdzY+OnqAyngk3z3USzGOkYh6xjqcSQaT+d3zsl51bpk0GkbElD5urviRDrSdhYDtDNCO97M3E/7xOYgZVP+UyTgyTdLFokAhiIjJ7nvS5YtSIiSVIFbe3EjpChdTYiPI2BG/5VXSrJS9y3Ll7qpYq2Zx5OAUzqAEHlxDW6hDg2gIOAZXuHNeXBenHfnY9G65mQzJ/AHzucPiA2Pmg=</latexit> <latexit sha1_base64="dJnK71NR3o5vXieC5lAfNvyDvE=">AB8XicbVBNSwMxEJ2tX7V+VT16CRahXspuFeyxILHCvYD26Vk02wbmk2WJCuUtf/CiwdFvPpvPlvTNs9aOuDgcd7M8zMC2LOtHdbye3tr6xuZXfLuzs7u0fFA+PWlomitAmkVyqToA15UzQpmG06sKI4CTtvB+Hrmtx+p0kyKezOJqR/hoWAhI9hY6aEXs/5NGT/p836x5FbcOdAq8TJSgyNfvGrN5AkiagwhGOtu54bGz/FyjDC6bTQSzSNMRnjIe1aKnBEtZ/OL56iM6sMUCiVLWHQXP09keJI60kU2M4Im5Fe9mbif143MWHNT5mIE0MFWSwKE46MRLP30YApSgyfWIKJYvZWREZYWJsSAUbgrf8ipVSveRaV6d1mq17I48nACp1AGD6gDrfQgCYQEPAMr/DmaOfFeXc+Fq05J5s5hj9wPn8A0I+QUg=</latexit> Motivation • Imitation learning does not recover reward functions. • Behavior Cloning π ∗ = max π ∈ Π E π E [log π ( a | s )] • Generative Adversarial Imitation Learning [Ho & Ermon, 2016] IRL RL π ( a | s ) π E ( a | s ) r ( s, a )

  4. <latexit sha1_base64="lADpjtdH09HXKxCgQmHRrYn7+4=">AB73icbVDLSgNBEOz1GeMr6tHLYBDiJexGwRwDXjxGMA9IltA7mSRDZmfXmVkhrPkJLx4U8ervePNvnCR70MSChqKqm+6uIBZcG9f9dtbWNza3tnM7+d29/YPDwtFxU0eJoqxBIxGpdoCaCS5Zw3AjWDtWDMNAsFYwvpn5rUemNI/kvZnEzA9xKPmAUzRWandjXsInfdErFN2yOwdZJV5GipCh3it8dfsRTUImDRWodcdzY+OnqAyngk3z3USzGOkYh6xjqcSQaT+d3zsl51bpk0GkbElD5urviRDrSdhYDtDNCO97M3E/7xOYgZVP+UyTgyTdLFokAhiIjJ7nvS5YtSIiSVIFbe3EjpChdTYiPI2BG/5VXSrJS9y3Ll7qpYq2Zx5OAUzqAEHlxDW6hDg2gIOAZXuHNeXBenHfnY9G65mQzJ/AHzucPiA2Pmg=</latexit> <latexit sha1_base64="oLJVDvbL3g/9lhUVNX+FRAp9xk0=">ACJXicbVDLSsNAFJ34rPVdelmsAjVRUmqoIsKBSm4rGAf0MQwmU7boZNJmJmIJeZn3PgrblxYRHDlrzhJu9DWAwNnzrmXe+/xQkalMs0vY2l5ZXVtPbeR39za3tkt7O23ZBAJTJo4YIHoeEgSRjlpKqoY6YSCIN9jpO2NrlO/UCEpAG/U+OQOD4acNqnGCktuYWqHdL7U3gFbR89urH+QZtyaDdokpq6HlxPckMt50bRYMoOYl9CRPHLdQNMtmBrhIrBkpghkabmFi9wIc+YQrzJCUXcsMlRMjoShmJMnbkSQhwiM0IF1NOfKJdOLsygQea6UH+4HQjyuYqb87YuRLOfY9XZkuLue9VPzP60aqf+nElIeRIhxPB/UjBlUA08hgjwqCFRtrgrCgeleIh0grHSweR2CNX/yImlVytZuXJ7XqxVZ3HkwCE4AiVgQtQAzegAZoAg2fwCt7BxHgx3owP43NaumTMeg7AHxjfP2AHpJQ=</latexit> <latexit sha1_base64="dJnK71NR3o5vXieC5lAfNvyDvE=">AB8XicbVBNSwMxEJ2tX7V+VT16CRahXspuFeyxILHCvYD26Vk02wbmk2WJCuUtf/CiwdFvPpvPlvTNs9aOuDgcd7M8zMC2LOtHdbye3tr6xuZXfLuzs7u0fFA+PWlomitAmkVyqToA15UzQpmG06sKI4CTtvB+Hrmtx+p0kyKezOJqR/hoWAhI9hY6aEXs/5NGT/p836x5FbcOdAq8TJSgyNfvGrN5AkiagwhGOtu54bGz/FyjDC6bTQSzSNMRnjIe1aKnBEtZ/OL56iM6sMUCiVLWHQXP09keJI60kU2M4Im5Fe9mbif143MWHNT5mIE0MFWSwKE46MRLP30YApSgyfWIKJYvZWREZYWJsSAUbgrf8ipVSveRaV6d1mq17I48nACp1AGD6gDrfQgCYQEPAMr/DmaOfFeXc+Fq05J5s5hj9wPn8A0I+QUg=</latexit> <latexit sha1_base64="3q69riKgMJ7dCLeUS7TDplOABc=">AB7XicbVBNSwMxEJ2tX7V+VT16CRahgpTdVrDHghePFewHtEvJptk2NpsSVYoS/+DFw+KePX/ePfmLZ70NYHA4/3ZpiZF8ScaeO6305uY3Nreye/W9jbPzg8Kh6ftLVMFKEtIrlU3QBrypmgLcMp91YURwFnHaCye3c7zxRpZkUD2YaUz/CI8FCRrCxUjsu6yt8OSiW3Iq7AFonXkZKkKE5KH71h5IkERWGcKx1z3Nj46dYGUY4nRX6iaYxJhM8oj1LBY6o9tPFtTN0YZUhCqWyJQxaqL8nUhxpPY0C2xlhM9ar3lz8z+slJqz7KRNxYqgy0VhwpGRaP46GjJFieFTSzBRzN6KyBgrTIwNqGBD8FZfXiftasWrVar316VGPYsjD2dwDmXw4AYacAdNaAGBR3iGV3hzpPivDsfy9ack82cwh84nz+UwI5x</latexit> Motivation • Imitation learning does not recover reward functions. • Behavior Cloning π ∗ = max π ∈ Π E π E [log π ( a | s )] • Generative Adversarial Imitation Learning [Ho & Ermon, 2016] π ( a | s ) π E ( a | s ) Matching with GAN p ( s, a )

  5. <latexit sha1_base64="yr7Og+EdZ1l80MCEkGrNsiJiA=">ACGHicbZDLSgMxFIYzXmu9V26CRahCtaZKtiNUHDjsoK9QC9DJs20sZnJkJwRy9DHcOruHGhiNvufBvTy0Jbfwj8fOcTs7vRYJrsO1va2l5ZXVtPbWR3tza3tnN7O1XtYwVZRUqhVR1j2gmeMgqwEGweqQYCTzBal7/ZlyvPTKluQzvYRCxVkC6Ifc5JWCQmzlX7VN8jXNYE+QSO+BUWi6kdRDfIansCuJmKTdsHNZO28PRFeNM7MZNFMZTczanYkjQMWAhVE64ZjR9BKiAJOBRum7FmEaF90mUNY0MSMN1KJocN8bEhHexLZV4IeEJ/TyQk0HoQeKYzINDT87Ux/K/WiMEvthIeRjGwkE4X+bHAIPE4JdzhygQhBsYQqrj5K6Y9ogFk2XahODMn7xoqoW8c5Ev3F1mS8VZHCl0iI5QDjnoCpXQLSqjCqLoGb2id/RhvVhv1qf1NW1dsmYzB+iPrNEP2UGfpg=</latexit> <latexit sha1_base64="iMdxnKiMTtlEX/9UuknRpqdtNU=">ACGHicbVC7TsMwFHXKq5RXgJHFokIqDCUpCBTEQtjEfQhNaFyXLe16jiR7SBVUT6DhV9hYQAh1m78DU4b8SgcydLxOfq3nu8kFGpLOvDyM3NLywu5ZcLK6tr6xvm5lZDBpHApI4DFoiWhyRhlJO6oqRVigI8j1Gmt7wMvWb90RIGvBbNQqJ6M+pz2KkdJSxzx0Qnp3cA4dH6kBRiy+SaCjgu9/LSl98Ytkv2MWrbI1AfxL7IwUQYZaxw73QBHPuEKMyRl27ZC5cZIKIoZSQpOJEmI8BD1SVtTjnwi3XhyWAL3tNKFvUDoxWcqD87YuRLOfI9XZnuKGe9VPzPa0eqd+bGlIeRIhxPB/UiBvXlaUqwSwXBio0QVhQvSvEAyQVjrLg7Bnj35L2lUyvZRuXJ9XKyeZHkwQ7YBSVg1NQBVegBuoAgwfwBF7Aq/FoPBtvxvu0NGdkPdvgF4zxJyEOn84=</latexit> Motivation • Why should we care reward learning? • Scientific inquiry: human and animal behavioral study, inferring intentions, etc. • Presupposition: reward function is considered to be the most succinct, robust and transferable description of the task. [Abbeel & Ng, 2014] r ∗ = (object pos − goal pos) 2 VS. π ∗ : S → P ( A ) • Re-optimizing policies in new environments, debugging and analyzing imitation learning algorithms, etc. • These properties are even more desirable in the multi-agent settings.

  6. <latexit sha1_base64="hVwuR8Ted1t3O6ZK0IXne5ndZM=">AB73icbVDLSgNBEJz1GeMr6tHLYBA8hd0omGPAi8cI5oHJEmYnvcmQeawzs0JY8hNePCji1d/x5t84SfagiQUNRVU3V1Rwpmxv/tra1vbG5tF3aKu3v7B4elo+OWUam0KSK92JiAHOJDQtsxw6iQYiIg7taHwz89tPoA1T8t5OEgFGUoWM0qskzoP/Z4SMCT9Utmv+HPgVRLkpIxyNPqlr95A0VSAtJQTY7qBn9gwI9oymFa7KUGEkLHZAhdRyURYMJsfu8UnztlgGOlXUmL5+rviYwIYyYicp2C2JFZ9mbif143tXEtzJhMUguSLhbFKcdW4dnzeMA0UMsnjhCqmbsV0xHRhFoXUdGFECy/vEpa1UpwWaneXZXrtTyOAjpFZ+gCBega1dEtaqAmoijZ/SK3rxH78V79z4WrWtePnOC/sD7/AH1fo/i</latexit> <latexit sha1_base64="BmMsp83qz7lP2nj6Yhkur/PhWg0=">ADHnicbVJdixMxFM2MX2v92K4+hIsLrOopbMW9GVhQRZ8rNDuLjbtkEnTadhkJiR3ZMs4v8QX/4ovPigi+KT/xkw7Ld2uF0IO596Te26SWEthodP56/k3bt6fWfnbuPe/QcPd5t7j05tlhvGByTmTmPqeVSpHwAiQ/14ZTFUt+Fl+8rfJnH7mxIkv7MNd8pGiSiqlgFBwV7Xndfawjkime0IAzQ8w0SbTkGEi+RSGhAMN7Dhc8pOogKOwHPdxz5EFPA/LT3YML+gYDogRyQxGmPBLvdAGmNhcrRWmboODSoErCV5q3E4a2MU+JoperuochlkcFydlVBAtopOytiSzZMv0qvfRVZFLOQ9C4U35pquiX659BatB8Oq0l3jR6sPaUJqluYq5iZqtTruzCHwdhDVoTp6UfM3mWQsVzwFJqm1w7CjYVRQA4JXjZIbrm7ImfOhgShW3o2LxvCV+5pgJnmbGrRTwgt1UFRZO1exq6yGt9u5ivxfbpjD9M2oEKnOgads2WiaS+wev/oreCIMZyDnDlBmhPOK2YwaysD9qIa7hHB75Ovg9LAdvmofvu+2jrv1deygJ+gpClCIXqNj9A710Ax7P31fvu/fC/+N/8n/6vZanv1ZrH6Er4f/4BEzr8cQ=</latexit> Preliminaries • Single-Agent Inverse RL • Basic principle: find a reward function that explains the expert behaviors. (ill-defined) • Maximum Entropy Inverse RL (MaxEnt IRL) provides a general probabilistic framework to solve the ambiguity. • Maximum Entropy Inverse RL (MaxEnt IRL) provides a general probabilistic framework to solve the ambiguity. " T # T ! Y X η ( s 1 ) P ( s t +1 | s t , a t ) r ω ( s t , a t ) p ω ( τ ) ∝ exp t =1 t =1 " T # X r ω ( s t , a t ) max E π E [log p ω ( τ )] = E τ ∼ π E − log Z ω ω t =1 where is the partition function. Z ω

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend