Multi-Agent Adversarial Inverse Reinforcement Learning Lantao Yu, - - PowerPoint PPT Presentation

multi agent adversarial inverse reinforcement learning
SMART_READER_LITE
LIVE PREVIEW

Multi-Agent Adversarial Inverse Reinforcement Learning Lantao Yu, - - PowerPoint PPT Presentation

Multi-Agent Adversarial Inverse Reinforcement Learning Lantao Yu, Jiaming Song, Stefano Ermon Department of Computer Science, Stanford University Contact: lantaoyu@cs.stanford.edu <latexit


slide-1
SLIDE 1

Multi-Agent Adversarial Inverse Reinforcement Learning

Contact: lantaoyu@cs.stanford.edu

Lantao Yu, Jiaming Song, Stefano Ermon

Department of Computer Science, Stanford University

slide-2
SLIDE 2

Motivation

  • By definition, the performance of RL agents heavily relies on the

quality of reward functions.

  • In many real-world scenarios, especially in multi-agent settings,

hand-tuning informative reward functions can be very challenging.

  • Solution: learning from expert demonstrations!

maxπ Eπ hPT

t=1 γtr(st, at)

i

<latexit sha1_base64="AKws76aOIQ/vlCa0QNvQLXdwHRA=">ACL3icbVBNSyNBFOzxe6OrUY+5NIYFhSXMxIMeVgiI4lHBGCEzDm86PUmT7pmh+81iGHLx93jx4g/xEsRFvPov7CQe/NiChuq9+iuijIpDLruozMzOze/sLj0o7S8nN1rby+cWHSXDPeZKlM9WUEhkuR8CYKlPwy0xUJHkr6h+O/dZfro1Ik3McZDxQ0E1ELBiglcLysa/gOvQzQS3BXhQVR8PpVfIY27JVjgTe8Oqd+F5SCK6R624T4m0KIO74W3R4GYbnq1twJ6HfivZNqo96qPN+M7k/D8oPfSVmueIJMgjFtz80wKECjYJIPS35ueAasD13etjQBxU1QTPIO6S+rdGicansSpBP140YBypiBiuzkOJT56o3F/3ntHOP9oBJliNP2PShOJcUzouj3aE5gzlwBJgWti/UtYDQxtxSVbgvc18ndyUa95u7X6mVdt/CFTLJEK2SLbxCN7pEFOyClpEkZuyQN5Iv+cO2fkPDsv09EZ531nk3yC8/oGl/mtEw=</latexit>

Computer Games Dialogue Multi-Agent System

slide-3
SLIDE 3

Motivation

  • Imitation learning does not recover reward functions.
  • Behavior Cloning
  • Generative Adversarial Imitation Learning [Ho & Ermon, 2016]

π∗ = max

π∈Π EπE[log π(a|s)]

<latexit sha1_base64="oLJVDvbL3g/9lhUVNX+FRAp9xk0=">ACJXicbVDLSsNAFJ34rPVdelmsAjVRUmqoIsKBSm4rGAf0MQwmU7boZNJmJmIJeZn3PgrblxYRHDlrzhJu9DWAwNnzrmXe+/xQkalMs0vY2l5ZXVtPbeR39za3tkt7O23ZBAJTJo4YIHoeEgSRjlpKqoY6YSCIN9jpO2NrlO/UCEpAG/U+OQOD4acNqnGCktuYWqHdL7U3gFbR89urH+QZtyaDdokpq6HlxPckMt50bRYMoOYl9CRPHLdQNMtmBrhIrBkpghkabmFi9wIc+YQrzJCUXcsMlRMjoShmJMnbkSQhwiM0IF1NOfKJdOLsygQea6UH+4HQjyuYqb87YuRLOfY9XZkuLue9VPzP60aqf+nElIeRIhxPB/UjBlUA08hgjwqCFRtrgrCgeleIh0grHSweR2CNX/yImlVytZuXJ7XqxVZ3HkwCE4AiVgQtQAzegAZoAg2fwCt7BxHgx3owP43NaumTMeg7AHxjfP2AHpJQ=</latexit>

r(s, a)

<latexit sha1_base64="mSOGy5SbNzZRqB+RiQ+OplYS+yA=">AB7XicbVBNSwMxEJ2tX7V+VT16CRahgpTdVrDHghePFewHtEvJptk2NpsSVYoS/+DFw+KePX/ePfmLZ70NYHA4/3ZpiZF8ScaeO6305uY3Nreye/W9jbPzg8Kh6ftLVMFKEtIrlU3QBrypmgLcMp91YURwFnHaCye3c7zxRpZkUD2YaUz/CI8FCRrCxUluV9RW+HBRLbsVdAK0TLyMlyNAcFL/6Q0mSiApDONa657mx8VOsDCOczgr9RNMYkwke0Z6lAkdU+ni2hm6sMoQhVLZEgYt1N8TKY60nkaB7YywGetVby7+5/USE9b9lIk4MVSQ5aIw4chINH8dDZmixPCpJZgoZm9FZIwVJsYGVLAheKsvr5N2teLVKtX761KjnsWRhzM4hzJ4cAMNuIMmtIDAIzDK7w50nlx3p2PZWvOyWZO4Q+czx+X0o5z</latexit>

π(a|s)

<latexit sha1_base64="lADpjtdH09HXKxCgQmHRrYn7+4=">AB73icbVDLSgNBEOz1GeMr6tHLYBDiJexGwRwDXjxGMA9IltA7mSRDZmfXmVkhrPkJLx4U8ervePNvnCR70MSChqKqm+6uIBZcG9f9dtbWNza3tnM7+d29/YPDwtFxU0eJoqxBIxGpdoCaCS5Zw3AjWDtWDMNAsFYwvpn5rUemNI/kvZnEzA9xKPmAUzRWandjXsInfdErFN2yOwdZJV5GipCh3it8dfsRTUImDRWodcdzY+OnqAyngk3z3USzGOkYh6xjqcSQaT+d3zsl51bpk0GkbElD5urviRDrSdhYDtDNCO97M3E/7xOYgZVP+UyTgyTdLFokAhiIjJ7nvS5YtSIiSVIFbe3EjpChdTYiPI2BG/5VXSrJS9y3Ll7qpYq2Zx5OAUzqAEHlxDW6hDg2gIOAZXuHNeXBenHfnY9G65mQzJ/AHzucPiA2Pmg=</latexit>

πE(a|s)

<latexit sha1_base64="dJnK71NR3o5vXieC5lAfNvyDvE=">AB8XicbVBNSwMxEJ2tX7V+VT16CRahXspuFeyxILHCvYD26Vk02wbmk2WJCuUtf/CiwdFvPpvPlvTNs9aOuDgcd7M8zMC2LOtHdbye3tr6xuZXfLuzs7u0fFA+PWlomitAmkVyqToA15UzQpmG06sKI4CTtvB+Hrmtx+p0kyKezOJqR/hoWAhI9hY6aEXs/5NGT/p836x5FbcOdAq8TJSgyNfvGrN5AkiagwhGOtu54bGz/FyjDC6bTQSzSNMRnjIe1aKnBEtZ/OL56iM6sMUCiVLWHQXP09keJI60kU2M4Im5Fe9mbif143MWHNT5mIE0MFWSwKE46MRLP30YApSgyfWIKJYvZWREZYWJsSAUbgrf8ipVSveRaV6d1mq17I48nACp1AGD6gDrfQgCYQEPAMr/DmaOfFeXc+Fq05J5s5hj9wPn8A0I+QUg=</latexit>

IRL RL

slide-4
SLIDE 4

Motivation

  • Imitation learning does not recover reward functions.
  • Behavior Cloning
  • Generative Adversarial Imitation Learning [Ho & Ermon, 2016]

π∗ = max

π∈Π EπE[log π(a|s)]

<latexit sha1_base64="oLJVDvbL3g/9lhUVNX+FRAp9xk0=">ACJXicbVDLSsNAFJ34rPVdelmsAjVRUmqoIsKBSm4rGAf0MQwmU7boZNJmJmIJeZn3PgrblxYRHDlrzhJu9DWAwNnzrmXe+/xQkalMs0vY2l5ZXVtPbeR39za3tkt7O23ZBAJTJo4YIHoeEgSRjlpKqoY6YSCIN9jpO2NrlO/UCEpAG/U+OQOD4acNqnGCktuYWqHdL7U3gFbR89urH+QZtyaDdokpq6HlxPckMt50bRYMoOYl9CRPHLdQNMtmBrhIrBkpghkabmFi9wIc+YQrzJCUXcsMlRMjoShmJMnbkSQhwiM0IF1NOfKJdOLsygQea6UH+4HQjyuYqb87YuRLOfY9XZkuLue9VPzP60aqf+nElIeRIhxPB/UjBlUA08hgjwqCFRtrgrCgeleIh0grHSweR2CNX/yImlVytZuXJ7XqxVZ3HkwCE4AiVgQtQAzegAZoAg2fwCt7BxHgx3owP43NaumTMeg7AHxjfP2AHpJQ=</latexit>

π(a|s)

<latexit sha1_base64="lADpjtdH09HXKxCgQmHRrYn7+4=">AB73icbVDLSgNBEOz1GeMr6tHLYBDiJexGwRwDXjxGMA9IltA7mSRDZmfXmVkhrPkJLx4U8ervePNvnCR70MSChqKqm+6uIBZcG9f9dtbWNza3tnM7+d29/YPDwtFxU0eJoqxBIxGpdoCaCS5Zw3AjWDtWDMNAsFYwvpn5rUemNI/kvZnEzA9xKPmAUzRWandjXsInfdErFN2yOwdZJV5GipCh3it8dfsRTUImDRWodcdzY+OnqAyngk3z3USzGOkYh6xjqcSQaT+d3zsl51bpk0GkbElD5urviRDrSdhYDtDNCO97M3E/7xOYgZVP+UyTgyTdLFokAhiIjJ7nvS5YtSIiSVIFbe3EjpChdTYiPI2BG/5VXSrJS9y3Ll7qpYq2Zx5OAUzqAEHlxDW6hDg2gIOAZXuHNeXBenHfnY9G65mQzJ/AHzucPiA2Pmg=</latexit>

πE(a|s)

<latexit sha1_base64="dJnK71NR3o5vXieC5lAfNvyDvE=">AB8XicbVBNSwMxEJ2tX7V+VT16CRahXspuFeyxILHCvYD26Vk02wbmk2WJCuUtf/CiwdFvPpvPlvTNs9aOuDgcd7M8zMC2LOtHdbye3tr6xuZXfLuzs7u0fFA+PWlomitAmkVyqToA15UzQpmG06sKI4CTtvB+Hrmtx+p0kyKezOJqR/hoWAhI9hY6aEXs/5NGT/p836x5FbcOdAq8TJSgyNfvGrN5AkiagwhGOtu54bGz/FyjDC6bTQSzSNMRnjIe1aKnBEtZ/OL56iM6sMUCiVLWHQXP09keJI60kU2M4Im5Fe9mbif143MWHNT5mIE0MFWSwKE46MRLP30YApSgyfWIKJYvZWREZYWJsSAUbgrf8ipVSveRaV6d1mq17I48nACp1AGD6gDrfQgCYQEPAMr/DmaOfFeXc+Fq05J5s5hj9wPn8A0I+QUg=</latexit>

Matching with GAN p(s, a)

<latexit sha1_base64="3q69riKgMJ7dCLeUS7TDplOABc=">AB7XicbVBNSwMxEJ2tX7V+VT16CRahgpTdVrDHghePFewHtEvJptk2NpsSVYoS/+DFw+KePX/ePfmLZ70NYHA4/3ZpiZF8ScaeO6305uY3Nreye/W9jbPzg8Kh6ftLVMFKEtIrlU3QBrypmgLcMp91YURwFnHaCye3c7zxRpZkUD2YaUz/CI8FCRrCxUjsu6yt8OSiW3Iq7AFonXkZKkKE5KH71h5IkERWGcKx1z3Nj46dYGUY4nRX6iaYxJhM8oj1LBY6o9tPFtTN0YZUhCqWyJQxaqL8nUhxpPY0C2xlhM9ar3lz8z+slJqz7KRNxYqgy0VhwpGRaP46GjJFieFTSzBRzN6KyBgrTIwNqGBD8FZfXiftasWrVar316VGPYsjD2dwDmXw4AYacAdNaAGBR3iGV3hzpPivDsfy9ack82cwh84nz+UwI5x</latexit>
slide-5
SLIDE 5

Motivation

  • Why should we care reward learning?
  • Scientific inquiry: human and animal behavioral study, inferring

intentions, etc.

  • Presupposition: reward function is considered to be the most succinct,

robust and transferable description of the task. [Abbeel & Ng, 2014]

  • Re-optimizing policies in new environments, debugging and analyzing

imitation learning algorithms, etc.

  • These properties are even more desirable in the multi-agent settings.

r∗ = (object pos − goal pos)2

<latexit sha1_base64="yr7Og+EdZ1l80MCEkGrNsiJiA=">ACGHicbZDLSgMxFIYzXmu9V26CRahCtaZKtiNUHDjsoK9QC9DJs20sZnJkJwRy9DHcOruHGhiNvufBvTy0Jbfwj8fOcTs7vRYJrsO1va2l5ZXVtPbWR3tza3tnN7O1XtYwVZRUqhVR1j2gmeMgqwEGweqQYCTzBal7/ZlyvPTKluQzvYRCxVkC6Ifc5JWCQmzlX7VN8jXNYE+QSO+BUWi6kdRDfIansCuJmKTdsHNZO28PRFeNM7MZNFMZTczanYkjQMWAhVE64ZjR9BKiAJOBRum7FmEaF90mUNY0MSMN1KJocN8bEhHexLZV4IeEJ/TyQk0HoQeKYzINDT87Ux/K/WiMEvthIeRjGwkE4X+bHAIPE4JdzhygQhBsYQqrj5K6Y9ogFk2XahODMn7xoqoW8c5Ev3F1mS8VZHCl0iI5QDjnoCpXQLSqjCqLoGb2id/RhvVhv1qf1NW1dsmYzB+iPrNEP2UGfpg=</latexit>

VS.

π∗ : S → P(A)

<latexit sha1_base64="iMdxnKiMTtlEX/9UuknRpqdtNU=">ACGHicbVC7TsMwFHXKq5RXgJHFokIqDCUpCBTEQtjEfQhNaFyXLe16jiR7SBVUT6DhV9hYQAh1m78DU4b8SgcydLxOfq3nu8kFGpLOvDyM3NLywu5ZcLK6tr6xvm5lZDBpHApI4DFoiWhyRhlJO6oqRVigI8j1Gmt7wMvWb90RIGvBbNQqJ6M+pz2KkdJSxzx0Qnp3cA4dH6kBRiy+SaCjgu9/LSl98Ytkv2MWrbI1AfxL7IwUQYZaxw73QBHPuEKMyRl27ZC5cZIKIoZSQpOJEmI8BD1SVtTjnwi3XhyWAL3tNKFvUDoxWcqD87YuRLOfI9XZnuKGe9VPzPa0eqd+bGlIeRIhxPB/UiBvXlaUqwSwXBio0QVhQvSvEAyQVjrLg7Bnj35L2lUyvZRuXJ9XKyeZHkwQ7YBSVg1NQBVegBuoAgwfwBF7Aq/FoPBtvxvu0NGdkPdvgF4zxJyEOn84=</latexit>
slide-6
SLIDE 6

Preliminaries

  • Single-Agent Inverse RL
  • Basic principle: find a reward function that explains the expert behaviors.

(ill-defined)

  • Maximum Entropy Inverse RL (MaxEnt IRL) provides a general

probabilistic framework to solve the ambiguity.

  • Maximum Entropy Inverse RL (MaxEnt IRL) provides a general

probabilistic framework to solve the ambiguity. where is the partition function.

pω(τ) ∝ " η(s1)

T

Y

t=1

P(st+1|st, at) # exp T X

t=1

rω(st, at) ! max

ω

EπE [log pω(τ)] = Eτ∼πE " T X

t=1

rω(st, at) # − log Zω

<latexit sha1_base64="BmMsp83qz7lP2nj6Yhkur/PhWg0=">ADHnicbVJdixMxFM2MX2v92K4+hIsLrOopbMW9GVhQRZ8rNDuLjbtkEnTadhkJiR3ZMs4v8QX/4ovPigi+KT/xkw7Ld2uF0IO596Te26SWEthodP56/k3bt6fWfnbuPe/QcPd5t7j05tlhvGByTmTmPqeVSpHwAiQ/14ZTFUt+Fl+8rfJnH7mxIkv7MNd8pGiSiqlgFBwV7Xndfawjkime0IAzQ8w0SbTkGEi+RSGhAMN7Dhc8pOogKOwHPdxz5EFPA/LT3YML+gYDogRyQxGmPBLvdAGmNhcrRWmboODSoErCV5q3E4a2MU+JoperuochlkcFydlVBAtopOytiSzZMv0qvfRVZFLOQ9C4U35pquiX659BatB8Oq0l3jR6sPaUJqluYq5iZqtTruzCHwdhDVoTp6UfM3mWQsVzwFJqm1w7CjYVRQA4JXjZIbrm7ImfOhgShW3o2LxvCV+5pgJnmbGrRTwgt1UFRZO1exq6yGt9u5ivxfbpjD9M2oEKnOgads2WiaS+wev/oreCIMZyDnDlBmhPOK2YwaysD9qIa7hHB75Ovg9LAdvmofvu+2jrv1deygJ+gpClCIXqNj9A710Ax7P31fvu/fC/+N/8n/6vZanv1ZrH6Er4f/4BEzr8cQ=</latexit>

<latexit sha1_base64="hVwuR8Ted1t3O6ZK0IXne5ndZM=">AB73icbVDLSgNBEJz1GeMr6tHLYBA8hd0omGPAi8cI5oHJEmYnvcmQeawzs0JY8hNePCji1d/x5t84SfagiQUNRVU3V1Rwpmxv/tra1vbG5tF3aKu3v7B4elo+OWUam0KSK92JiAHOJDQtsxw6iQYiIg7taHwz89tPoA1T8t5OEgFGUoWM0qskzoP/Z4SMCT9Utmv+HPgVRLkpIxyNPqlr95A0VSAtJQTY7qBn9gwI9oymFa7KUGEkLHZAhdRyURYMJsfu8UnztlgGOlXUmL5+rviYwIYyYicp2C2JFZ9mbif143tXEtzJhMUguSLhbFKcdW4dnzeMA0UMsnjhCqmbsV0xHRhFoXUdGFECy/vEpa1UpwWaneXZXrtTyOAjpFZ+gCBega1dEtaqAmoijZ/SK3rxH78V79z4WrWtePnOC/sD7/AH1fo/i</latexit>
slide-7
SLIDE 7

Preliminaries

  • Single-Agent Inverse RL
  • Adversarial Inverse RL (AIRL) provides an efficient sampling-based

approximation to MaxEnt IRL.

  • Special discriminator structure:
  • Train the policy (generator) with

.

  • Under certain conditions,

is guaranteed to recover the ground- truth reward up to a constant.

Dω,φ(s, a, s0) = exp(fω,φ(s, a, s0)) exp(fω,φ(s, a, s0)) + π(a|s) fω,φ(s, a, s0) = rω(s, a) + γhφ(s0) − hφ(s)

<latexit sha1_base64="ny4vLkUWrovcqtB2OJoCK96BkwA=">ACnXicfVFbixMxFM6Mt7Xe6vqmDwaLOmVrmdld0BehoIKyAp2u9CU4Ux6pg2bzIQkI5Zx/pW/xDf/jZm2iO6uHgh8+S6c5JxMS2FdHP8MwkuXr1y9tnO9c+Pmrdt3und3j21ZGY5jXsrSnGRgUYoCx04iSfaIKhM4iQ7fdXqky9orCiLz26lcaZgUYhcHCeSrvfX6c1KxUuYMD0UjSRHcDAPu3TJy8pyw3wmuFXHeUXu/rN/2W6R5kWEXyz/Yaxzj9sbS+TbpSW8cWoBTQZdo6o9b07Peln3Z78TBeFz0Pki3okW0dpd0fbF7ySmHhuARrp0ms3awG4wSX2HRYZVEDP4UFTj0sQKGd1evpNvSxZ+Y0L40/haNr9s9EDcralcq8U4Fb2rNaS16kTSuXv5jVotCVw4JvGuWVpK6k7aroXBjkTq48AG6EfyvlS/A7cX6hHT+E5OyXz4Pj/WFyMNz/dNgbHW7HsUMekEckIgl5TkbkLTkiY8KD+8EoeBe8Dx+Gb8IP4ceNQy2mXvkrwonvwAWHcYS</latexit>

log D − log(1 − D)

<latexit sha1_base64="3yil4nLVRNVidgOiICD05EDlg=">AB+3icbZDNTgIxFIXv4B/i34hLN43EBeQGSTRJYksXGIiSAIT0ikFGjrTSdsxkgmv4saFxrj1Rdz5NnZgFgqepMmXc+/NvT1+xJnSjvNt5TY2t7Z38ruFvf2DwyP7uNhRIpaEtongQnZ9rChnIW1rpjntRpLiwOf0wZ/epPWHRyoVE+G9nkXUC/A4ZCNGsDbWwC72uRijJqgFMpupXkxsEtO1VkIrYObQkytQb2V38oSBzQUBOleq5TqS9BEvNCKfzQj9WNMJkise0ZzDEAVesrh9js6NM0QjIc0LNVq4vycSHCg1C3zTGWA9Uau1Pyv1ov16NpLWBjFmoZkuWgUc6QFSoNAQyYp0XxmABPJzK2ITLDERJu4CiYEd/XL69CpVd3Lau2uXmrUszjycApnUAYXrqABt9CNhB4gmd4hTdrbr1Y79bHsjVnZTMn8EfW5w/coJV</latexit>

rω(s, a)

<latexit sha1_base64="GfFmbREzrOhfDvSh1z51/BnqU=">AB9HicbVBNSwMxEJ2tX7V+VT16CRahgpTdWtBjwYvHCvYD2qVk02wbmTXJFsoS3+HFw+KePXHePfmLZ70NYHA4/3ZpiZF8ScaeO6305uY3Nreye/W9jbPzg8Kh6ftHSUKEKbJOKR6gRYU84kbRpmO3EimIRcNoOxndzvz2hSrNIPpTH2Bh5KFjGBjJV/1e5GgQ1zWV/iyXy5FXcBtE68jJQgQ6Nf/OoNIpIKg3hWOu58bGT7EyjHA6K/QSTWNMxnhIu5ZKLKj208XRM3RhlQEKI2VLGrRQf0+kWGg9FYHtFNiM9Ko3F/zuokJb/2UyTgxVJLlojDhyERongAaMEWJ4VNLMFHM3orICtMjM2pYEPwVl9eJ61qxbuVB9qpXotiyMPZ3AOZfDgBupwDw1oAoEneIZXeHMmzovz7nwsW3NONnMKf+B8/gDj0ZF5</latexit>
slide-8
SLIDE 8

Preliminaries

  • Markov Games [Littman, 1994]: A multi-agent generalization to

markov decision process.

  • Agent number
  • State space
  • Action spaces
  • Transition dynamics
  • Initial state distribution

S

<latexit sha1_base64="WUcfr4zW4XU0RrOFm+mYfOBtu0=">AB8nicbVDLSgMxFL1TX7W+qi7dBIvgqsxUwS4LblxWtA+YDiWTZtrQTDIkGaEM/Qw3LhRx69e482/MtLPQ1gOBwzn3knNPmHCmjet+O6WNza3tnfJuZW/4PCoenzS1TJVhHaI5FL1Q6wpZ4J2DOc9hNFcRxy2gunt7nfe6JKMykezSyhQYzHgkWMYGMlfxBjMyGYZw/zYbXm1t0F0DrxClKDAu1h9WswkiSNqTCEY619z01MkGFlGOF0XhmkmiaYTPGY+pYKHFMdZIvIc3RhlRGKpLJPGLRQf29kONZ6Fod2Mo+oV71c/M/zUxM1g4yJDVUkOVHUcqRkSi/H42YosTwmSWYKGazIjLBChNjW6rYErzVk9dJt1H3ruqN+taq1nUYzOIdL8OAGWnAHbegAQnP8ApvjnFenHfnYzlacoqdU/gD5/MHifSRYw=</latexit>

N

<latexit sha1_base64="PUReiMaXyOC9SJxO9qz1CN5Hl/4=">AB6HicbVDLSgNBEOyNrxhfUY9eBoPgKexGwRwDXjxJAuYByRJmJ73JmNnZWZWCFf4MWDIl79JG/+jZNkD5pY0FBUdPdFSCa+O6305uY3Nreye/W9jbPzg8Kh6ftHScKoZNFotYdQKqUXCJTcONwE6ikEaBwHYwvp37SdUmsfywUwS9CM6lDzkjBorNe7xZJbdhcg68TLSAky1PvFr94gZmE0jBte56bmL8KVWGM4GzQi/VmFA2pkPsWiphNqfLg6dkQurDEgYK1vSkIX6e2JKI60nUWA7I2pGetWbi/953dSEVX/KZIalGy5KEwFMTGZf0GXCEzYmIJZYrbWwkbUWZsdkUbAje6svrpFUpe1flSuO6VKtmceThDM7hEjy4gRrcQR2awADhGV7hzXl0Xpx352PZmnOymVP4A+fzB6TdjMw=</latexit>

{Ai}N

i=1

<latexit sha1_base64="HYRQ8CwFtx4Y2154zpdwkeRb0c=">ACAnicbVDLSsNAFJ3UV62vqCtxM1gEVyWpgt0IFTeupIJ9QBPDZDph04mYWYilBDc+CtuXCji1q9w5984abPQ1gMXDufcy73+DGjUlnWt1FaWl5ZXSuvVzY2t7Z3zN29jowSgUkbRywSPR9JwignbUVI71YEBT6jHT98VXudx+IkDTid2oSEzdEQ04DipHSkmceOKkTIjXCiKWXmUedzEvphZ3d3hm1apZU8BFYhekCgq0PLGUQ4CQlXmCEp+7YVKzdFQlHMSFZxEklihMdoSPqachQS6abTFzJ4rJUBDCKhiys4VX9PpCiUchL6ujM/V857ufif109U0HBTyuNEY5ni4KEQRXBPA84oIJgxSaICyovhXiERIK51aRYdgz7+8SDr1mn1aq9+eVZuNIo4yOARH4ATY4Bw0wTVogTbA4BE8g1fwZjwZL8a78TFrLRnFzD74A+PzB2FTl2A=</latexit>

P : S × A1 × . . . × AN → P(S)

<latexit sha1_base64="6f2u5jVrOxZ9+LRT5C8bobGZko=">ACTHicbZDLSsNAFIYnVWut6hLN4NFqJuSVMHiquLGlVS0F2hKmUwm7dDJhZmJUEIe0I0Ldz6FGxeKCE7aYGvbAwM/37nMOb8dMiqkYbxpubX1jfxmYau4vbO7t68fHLZEHFMmjhgAe/YSBGfdKUVDLSCTlBns1I2x7dpPn2E+GCBv6jHIek56GBT12KkVSor+PGFbQ8JIcYsfghgZakHhEzdJ30zT/InECKVSV3CgYz0kjKczP+nrJqBiTgMvCzEQJZNHo6+WE+DI7EDAnRNY1Q9mLEJcWMJEUrEiREeIQGpKukj9Q+vXhiRgJPFXGgG3D1fAkndL4jRp4QY89WlemOYjGXwlW5biTdWi+mfhJ4uPpR27EoLo8dRY6lBMs2VgJhDlVu0I8RBxhqfwvKhPMxZOXRataMc8r1fuLUr2W2VEAx+AElIEJLkEd3IGaAIMnsE7+ARf2ov2oX1rP9PSnJb1HIF/kcv/AsBytFU=</latexit>

η ∈ P(S)

<latexit sha1_base64="Rg0C1hUWZk0K58KfJHeMI3SrfnM=">ACnicbZDLSsNAFIZP6q3W9Wlm9Ei1E1JqmCXBTcuK9oLNKFMpN26GQSZiZCV278VXcuFDErU/gzrdx0hbR1h8GPv5zDnPO78ecKW3bX1ZuZXVtfSO/Wdja3tndK+4ftFSUSEKbJOKR7PhYUc4EbWqmOe3EkuLQ57Ttj6yevueSsUicafHMfVCPBAsYARrY/WKxy7VGLlMIDfEekgwTxuT8g/fTs56xZJdsadCy+DMoQRzNXrFT7cfkSkQhOleo6dqy9FEvNCKeTgpsoGmMywgPaNShwSJWXTk+ZoFPj9FEQSfOERlP390SKQ6XGoW86sx3VYi0z/6t1Ex3UvJSJONFUkNlHQcKRjlCWC+ozSYnmYwOYSGZ2RWSIJSbapFcwITiLJy9Dq1pxzivVm4tSvTaPIw9HcAJlcOAS6nANDWgCgQd4ghd4tR6tZ+vNep+15qz5zCH8kfXxDcQGmkg=</latexit>
slide-9
SLIDE 9

Preliminaries

  • Solution Concepts to Markov Games
  • Correlated equilibrium (CE) [Aumann, 1974]: A joint strategy profile, where

no agent can achieve higher expected reward through unilaterally changing its own policy.

  • Nash equilibrium (NE) [Hu et al, 1998]: A more restrictive equilibrium which

further requires agents’ actions in each state to be independent.

  • Incompatible with MaxEnt IRL.
slide-10
SLIDE 10

Preliminaries

  • Solution Concepts to Markov Games
  • Logistic quantal response equilibrium (LQRE) [McKelvey & Palfrey, 1995; 1998]:

A stochastic generalization to NE and CE.

  • LQRE is a joint strategy profile satisfying the set of constraints:
  • Existing optimality notions do not explicitly define a tractable joint

strategy profile, which we can use to maximize the likelihood of expert demonstrations.

slide-11
SLIDE 11

Method

  • Logistic Stochastic Best Response Equilibrium
  • Motivated by LQRE, Gibbs sampling [Hastings, 1970], dependency networks

[Heckerman et al, 2000] and best response dynamics [Nisan et al, 2011].

slide-12
SLIDE 12

Method

  • Logistic Stochastic Best Response Equilibrium
  • Single-shot normal-form game: Consider a Markov chain over

, where the state of the markov chain at step is denoted .

  • Because the markov chain is ergodic, it admits a unique stationary joint policy,

which we call a LSBRE for normal-form game.

A1 × . . . × AN

<latexit sha1_base64="kn5TnKw9SeS8AHDpmRMoliI/i/I=">ACGXicbVDLSsNAFJ34rPUVdelmsAiuSlJFXVbcuJIK9gFNCJPJpB06eTBzI5TQ3Djr7hxoYhLXfk3TtsgtfXAwJlz7uXe/xUcAW9W0sLa+srq2XNsqbW9s7u+befkslmaSsSRORyI5PFBM8Zk3gIFgnlYxEvmBtf3A9tsPTCqexPcwTJkbkV7MQ04JaMkzLSci0KdE5Fcjz8YO8Igp7IgAfX7mym59cyKVbUmwIvELkgFWh45qcTJDSLWAxUEKW6tpWCmxMJnAo2KjuZYimhA9JjXU1joke6+eSyET7WSoDROoXA56osx05iZQaRr6uHC+p5r2x+J/XzSC8dHMepxmwmE4HhZnAkOBxTDjgklEQ0IlVzvimfSEJBh1nWIdjzJy+SVq1qn1Zrd2eV+nkRwkdoiN0gmx0geroBjVQE1H0iJ7RK3oznowX4934mJYuGUXPAfoD4+sHkJ6gnA=</latexit>

k

<latexit sha1_base64="tgIj26ZWcA3amFO+MGSWQOuZURM=">AB6HicbVBNS8NAEJ34WetX1aOXxSJ4Kkt6LHgxWML9gPaUDbSbt2swm7G6GE/gIvHhTx6k/y5r9x2+agrQ8GHu/NMDMvSATXxnW/nY3Nre2d3cJecf/g8Oi4dHLa1nGqGLZYLGLVDahGwSW2DcCu4lCGgUCO8Hkbu53nlBpHsHM03Qj+hI8pAzaqzUnAxKZbfiLkDWiZeTMuRoDEpf/WHM0gilYJq3fPcxPgZVYzgbNiP9WYUDahI+xZKmE2s8Wh87IpVWGJIyVLWnIQv09kdFI62kU2M6ImrFe9ebif14vNeGtn3GZpAYlWy4KU0FMTOZfkyFXyIyYWkKZ4vZWwsZUWZsNkUbgrf68jpVyvedaXarJXrtTyOApzDBVyBzdQh3toQAsYIDzDK7w5j86L8+58LFs3nHzmDP7A+fwBz52M5Q=</latexit>

z(k) = (z1, · · · , zN)(k)

<latexit sha1_base64="pqxlNX+9M8hJ1g9ojSFWK47KS1Q=">ACFHicbVDLSsNAFJ3UV62vqEs3g0VosZSkFnQjFNy4kgr2AU0Nk+mkHTp5MDMR2pCPcOvuHGhiFsX7vwbJ20W2npg4HDOvdw5xwkZFdIwvrXcyura+kZ+s7C1vbO7p+8ftEUQcUxaOGAB7zpIEZ90pJUMtINOUGew0jHGV+lfueBcED/05OQtL30NCnLsVIKsnWTy0PyZHjxtPkPi6Nywm8hKWpbVaghQeBFBU4tW/Kc8vWi0bVmAEuEzMjRZChaetf1iDAkUd8iRkSomcaoezHiEuKGUkKViRIiPAYDUlPUR95RPTjWagEnihlAN2Aq+dLOFN/b8TIE2LiOWoyjSAWvVT8z+tF0r3ox9QPI0l8PD/kRgzKAKYNwQHlBEs2UQRhTtVfIR4hjrBUPRZUCeZi5GXSrlXNs2rtl5s1LM68uAIHIMSME5aIBr0AQtgMEjeAav4E170l60d+1jPprTsp1D8Afa5w/jaJzD</latexit>

...……......

aN−1

<latexit sha1_base64="v+f/aYPxsRvWslafPnFr4Gd9Ofc=">AB7nicbVBNS8NAEJ34WetX1aOXxSJ4sS1oMeCF09SwX5AG8pku2mXbjZhdyOU0B/hxYMiXv093vw3btsctPXBwO9GWbmBYng2rjut7O2vrG5tV3YKe7u7R8clo6OWzpOFWVNGotYdQLUTHDJmoYbwTqJYhgFgrWD8e3Mbz8xpXksH80kYX6EQ8lDTtFYqY397P7Sm/ZLZbfizkFWiZeTMuRo9EtfvUFM04hJQwVq3fXcxPgZKsOpYNiL9UsQTrGIetaKjFi2s/m507JuVUGJIyVLWnIXP09kWGk9SQKbGeEZqSXvZn4n9dNTXjZ1wmqWGSLhaFqSAmJrPfyYArRo2YWIJUcXsroSNUSI1NqGhD8JZfXiWtasW7qlQfauV6LY+jAKdwBhfgwTXU4Q4a0AQKY3iGV3hzEufFeXc+Fq1rTj5zAn/gfP4AsjWPGg=</latexit>

a2

<latexit sha1_base64="Ec0V4M5VJcS9H1HCKGU89B9fOjI=">AB7HicbVBNS8NAEJ34WetX1aOXxSJ4Kkt6LHgxWMF0xbaUDbSbt0swm7G6GE/gYvHhTx6g/y5r9x2+agrQ8GHu/NMDMvTAXxnW/nY3Nre2d3dJef/g8Oi4cnLa1kmGPosEYnqhlSj4BJ9w43AbqQxqHATji5m/udJ1SaJ/LRTFMYjqSPOKMGiv5dJDXZ4NK1a25C5B14hWkCgVag8pXf5iwLEZpmKBa9zw3NUFOleFM4KzczSmlE3oCHuWShqjDvLFsTNyaZUhiRJlSxqyUH9P5DTWehqHtjOmZqxXvbn4n9fLTHQb5FymUHJlouiTBCTkPnZMgVMiOmlCmuL2VsDFVlBmbT9mG4K2+vE7a9Zp3Xas/NKrNRhFHCc7hAq7Agxtowj20wAcGHJ7hFd4c6bw4787HsnXDKWbO4A+czx+rq46M</latexit>

a3

<latexit sha1_base64="BP53D1IpypxAELhPBK/HQ3jhrnU=">AB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0nagh4LXjxWMG2hDWy3bZLN5uwuxFK6G/w4kERr/4gb/4bt20O2vpg4PHeDPzwkRwbVz32ylsbe/s7hX3SweHR8cn5dOzto5TRZlPYxGrboiaCS6Zb7gRrJsohlEoWCec3i38zhNTmsfy0cwSFkQ4lnzEKRor+TjI6vNBueJW3SXIJvFyUoEcrUH5qz+MaRoxahArXuem5gQ2U4FWxe6qeaJUinOGY9SyVGTAfZ8tg5ubLKkIxiZUsaslR/T2QYaT2LQtsZoZnodW8h/uf1UjO6DTIuk9QwSVeLRqkgJiaLz8mQK0aNmFmCVHF7K6ETVEiNzadkQ/DWX94k7VrVq1drD41Ks5HUYQLuIRr8OAGmnAPLfCBAodneIU3RzovzrvzsWotOPnMOfyB8/kDrTCOjQ=</latexit>

aN

<latexit sha1_base64="xRKkQeskeVxEbjV/hwOgPVUwf30=">AB7HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lqQY8FL56kgv2ANpTNdtMu3WzC7kQob/BiwdFvPqDvPlv3LY5aOuDgcd7M8zMCxIpDLrut1PY2Nza3inulvb2Dw6PyscnbROnmvEWi2WsuwE1XArFWyhQ8m6iOY0CyTvB5Hbud564NiJWjzhNuB/RkRKhYBSt1KD7H42KFfcqrsAWSdeTiqQozkof/WHMUsjrpBJakzPcxP0M6pRMlnpX5qeELZhI54z1JFI278bHsjFxYZUjCWNtSBbq74mMRsZMo8B2RhTHZtWbi/95vRTDGz8TKkmRK7ZcFKaSYEzmn5Oh0JyhnFpCmRb2VsLGVFOGNp+SDcFbfXmdtGtV76pae6hXGvU8jiKcwTlcgfX0IA7aEILGAh4hld4c5Tz4rw7H8vWgpPnMIfOJ8/1jeOqA=</latexit>

z(k)

2

<latexit sha1_base64="l+r6os6bQcS1OA6couVrhu6GD38=">AB8HicbVBNSwMxEJ2tX7V+VT16CRahXspuLeix4MVjBfsh7VqyabYNTbJLkhXq0l/hxYMiXv053vw3pu0etPXBwO9GWbmBTFn2rjut5NbW9/Y3MpvF3Z29/YPiodHLR0litAmiXikOgHWlDNJm4YZTjuxolgEnLaD8fXMbz9SpVk78wkpr7AQ8lCRrCx0v1Tv/qQlsfn036x5FbcOdAq8TJSgyNfvGrN4hIqg0hGOtu54bGz/FyjDC6bTQSzSNMRnjIe1aKrGg2k/nB0/RmVUGKIyULWnQXP09kWKh9UQEtlNgM9L3kz8z+smJrzyUybjxFBJFovChCMTodn3aMAUJYZPLMFEMXsrIiOsMDE2o4INwVt+eZW0qhXvolK9rZXqtSyOPJzAKZTBg0uow0oAkEBDzDK7w5ynlx3p2PRWvOyWaO4Q+czx8Z5Y/n</latexit>

z(k)

N

<latexit sha1_base64="jTGIYhOg/XAOEgydYoK56E86Is=">AB8HicbVBNSwMxEJ31s9avqkcvwSLUS9mtBT0WvHiSCvZD2rVk02wbmSXJCvUpb/CiwdFvPpzvPlvTNs9aOuDgcd7M8zMC2LOtHdb2dldW19YzO3ld/e2d3bLxwcNnWUKEIbJOKRagdYU84kbRhmOG3HimIRcNoKRldTv/VIlWaRvDPjmPoCDyQLGcHGSvdPvZuHtDQ6m/QKRbfszoCWiZeRImSo9wpf3X5EkGlIRxr3fHc2PgpVoYRTif5bqJpjMkID2jHUokF1X46O3iCTq3SR2GkbEmDZurviRQLrcisJ0Cm6Fe9Kbif14nMeGlnzIZJ4ZKMl8UJhyZCE2/R32mKDF8bAkmitlbERlihYmxGeVtCN7iy8ukWSl75+XKbVYq2Zx5OAYTqAEHlxADa6hDg0gIOAZXuHNUc6L8+58zFtXnGzmCP7A+fwBRP2QAw=</latexit>

z(k)

1

<latexit sha1_base64="IRIL4XR82P8cLtcx8gD5bpLihT0=">AB8HicbVBNSwMxEJ2tX7V+VT16CRahXspuLeix4MVjBfsh7VqyabYNTbJLkhXq0l/hxYMiXv053vw3pu0etPXBwO9GWbmBTFn2rjut5NbW9/Y3MpvF3Z29/YPiodHLR0litAmiXikOgHWlDNJm4YZTjuxolgEnLaD8fXMbz9SpVk78wkpr7AQ8lCRrCx0v1T3tIy+Pzab9YcivuHGiVeBkpQYZGv/jVG0QkEVQawrHWXc+NjZ9iZRjhdFroJZrGmIzxkHYtlVhQ7afzg6fozCoDFEbKljRorv6eSLHQeiIC2ymwGelbyb+53UTE175KZNxYqgki0VhwpGJ0Ox7NGCKEsMnlmCimL0VkRFWmBibUcG4C2/vEpa1Yp3Uane1kr1WhZHk7gFMrgwSXU4QYa0AQCAp7hFd4c5bw4787HojXnZDPH8AfO5w8YW4/m</latexit>

z(k+1)

1

∼ P(a1|a1 = z(k)

1) =

exp(λr1(a1, z(k)

1))

P

a0

1 exp(λr1(a0

1, z(k) 1))

<latexit sha1_base64="v8eSzIJxnCyu8uoPFyQJWadB+I=">ACmXicfVHBThsxEPVuC6Wh0NBKvXCxiCoSAdEakOihSFRVK9RTUBtAygZr1vGCFXt3ZXurBnf/qd/SW/+m3iSVCkGMZOn5zTzP+E1SGFsFP0JwidPl5afrTxvrL5YW3/Z3Hh1bvJSM95nucz1ZQKGS5HxvhVW8stCc1CJ5BfJ+GOdv/jOtRF59s1OCj5UcJ2JVDCwnqLNX7eUXLn2eId0KhwboXCvDZT8jBXYmyR1UFG3R6rjf/fbq7uzNgOPsZxqoG5mP8o2rH0fUeANSX1G7v4YVGncrEpFXWwTYlvuiDdfkxLm62oG0DLwIyBy0jx5t/o5HOSsVzyTYMyARIUdOtBWMmrRlwaXgAbwzUfeJiB4mbops5W+K1nRjNtT+ZxVP2f4UDZcxEJb6yntfcz9XkQ7lBadN3QyeyorQ8Y7NGaSmxzXG9JjwSmjMrJx4A08LPitkNeKutX2bDm0Duf3kRnO93yUF3/+ywdRLN7VhBm2gLtRFBR+gEnaIe6iMWvAneB5+Cz+Fm+CE8Db/MSsNgrnmN7kT49S/rGMe5</latexit>
slide-13
SLIDE 13

Method

  • Logistic Stochastic Best Response Equilibrium
  • Markov game: Consider

markov chains over , where the state of the markov chain at step is .

  • For

, we recursively define the markov chain with the following update rule:

  • We define the unique stationary joint distribution of the markov chains as

LSBRE strategy profiles:

T

<latexit sha1_base64="7zpftZzZ+yjWL4w30Iid2NJsf14=">AB6HicbVDLSgNBEOyNrxhfUY9eBoPgKexGUY8BLx4TyAuSJcxOepMxs7PLzKwQr7AiwdFvPpJ3vwbJ8keNLGgoajqprsrSATXxnW/ndzG5tb2Tn63sLd/cHhUPD5p6ThVDJsFrHqBFSj4BKbhuBnUQhjQKB7WB8P/fbT6g0j2XDTBL0IzqUPOSMGivVG/1iyS27C5B14mWkBlq/eJXbxCzNEJpmKBadz03Mf6UKsOZwFmhl2pMKBvTIXYtlTRC7U8Xh87IhVUGJIyVLWnIQv09MaWR1pMosJ0RNSO96s3F/7xuasI7f8plkhqUbLkoTAUxMZl/TQZcITNiYglitbCRtRZmx2RsCN7qy+ukVSl7V+VK/bpUvcniyMZnMleHALVXiAGjSBAcIzvMKb8+i8O/Ox7I152Qzp/AHzucPrVuM0A=</latexit>

(A1 × · · · AN)|S|

<latexit sha1_base64="ilJSyN3QRI/O3VAeEuQtFDtocE=">ACJHicbVDJSgNBEO2JW4zbqEcvjUGIlzATAwpeIl48SUSzQCaGnk4nadKz0F0jhMl8jBd/xYsHFzx48VvsLEhMLGj68V4V9eq5oeAKLOvLSC0tr6yupdczG5tb2zvm7l5VBZGkrEIDEci6SxQT3GcV4CBYPZSMeK5gNbd/OdJrD0wqHvh3MAhZ0yNdn3c4JaCplnmeczwCPUpEfJG0bOwA95jCDm0HoL8Z7fr4Ph7+ErfJMGmZWStvjQsvAnsKsmha5Zb57rQDGnMByqIUg3bCqEZEwmcCpZknEixkNA+6bKGhj7RVprx+MgEH2mjTuB1M8HPGZnJ2LiKTXwXN05MqnmtRH5n9aIoHPWjLkfRsB8OlnUiQSGAI8Sw20uGQUx0IBQybVXTHtEgo614wOwZ4/eRFUC3n7JF+4KWZLxWkcaXSADlEO2egUldAVKqMKougRPaNX9GY8GS/Gh/E5aU0Z05l9KeM7x/D/KWB</latexit>

t-th

<latexit sha1_base64="Z+7y9IsOCA/l+dTpqY+8VZaf1cA=">AB8nicbVDLSgNBEJz1GeMr6tHLYBC8GHYjqMeAF48RzAOSJcxOZrND5rHM9AphyWd48aCIV7/Gm3/jJNmDJhY0FXdHdFqeAWfP/bW1vf2NzaLu2Ud/f2Dw4rR8dtqzNDWYtqoU03IpYJrlgLOAjWTQ0jMhKsE43vZn7niRnLtXqEScpCSUaKx5wScFKvz2Sa5HAJyXRQqfo1fw68SoKCVFGB5qDy1R9qmkmgApibS/wUwhzYoBTwablfmZSuiYjFjPUks2E+P3mKz50yxLE2rhTgufp7IifS2omMXKckNhlbyb+5/UyiG/DnKs0A6boYlGcCQwaz/7HQ24YBTFxhFD3a2YJsQCi6lsgshWH5lbTrteCqVn+oVxvXRwldIrO0AUK0A1qoHvURC1EkUbP6BW9eC9eO/ex6J1zStmTtAfeJ8/f6yRWQ=</latexit>

k

<latexit sha1_base64="4VMQ86Bt3gZG37t8tODNC56piw0=">AB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0mqMeCF48t2FpoQ9lsJ+3azSbsboQS+gu8eFDEqz/Jm/GbZuDtj4YeLw3w8y8IBFcG9f9dgpr6xubW8Xt0s7u3v5B+fCoreNUMWyxWMSqE1CNgktsGW4EdhKFNAoEPgTj25n/8IRK81jem0mCfkSHkoecUWOl5rhfrhVdw6ySrycVCBHo1/+6g1ilkYoDRNU67nJsbPqDKcCZyWeqnGhLIxHWLXUkj1H42P3RKzqwyIGsbElD5urviYxGWk+iwHZG1Iz0sjcT/O6qQlv/IzLJDUo2WJRmApiYjL7mgy4QmbExBLKFLe3EjaijJjsynZELzl1dJu1b1Lq15mWlfpXHUYQTOIVz8OAa6nAHDWgBA4RneIU359F5cd6dj0VrwclnjuEPnM8f0DeM5w=</latexit>

{zt,(k)

i

: S → Ai}N

i=1

<latexit sha1_base64="qnSAOSv5tGeQwm7nUl7yR6KzThg=">ACKHicbVDLSgMxFM34rPVdekmWAQFKTMqKIJYceNKtoqdMYhk6ZtaOZBckepw3yOG3/FjYgi3folZtqC2nogcDjnXHLv8SLBFZhmz5iYnJqemc3N5ecXFpeWCyurNRXGkrIqDUobz2imOABqwIHwW4jyYjvCXbjdc4y/+aeScXD4Bq6EXN80gp4k1MCWnILJ3byeJfADt7qbKcuP8K2T6BNiUiuUmxL3moDkTJ8+NFPdcxO3YQfW+ndhVsomiWzDzxOrCEpoiEqbuHNboQ09lkAVBCl6pYZgZMQCZwKlubtWLGI0A5psbqmAfGZcpL+oSne1EoDN0OpXwC4r/6eSIivVNf3dDJbV416mfifV4+hegkPIhiYAEdfNSMBYQZ63hBpeMguhqQqjkeldM20QSCrbvC7BGj15nNR2S9Zeafdyv1jeH9aRQ+toA20hCx2gMjpHFVRFD2hF/SOPoxn49X4NHqD6IQxnFlDf2B8fQM3Sqas</latexit>

t ∈ [T, . . . , 1]

<latexit sha1_base64="iZaqTjckF1CTLeU+ZOUzVOSi/Vc=">AB/HicbVBNSwMxEM3Wr1q/qj16CRbBQym7VdRjwYvHCv2CdinZbNqGZpMlmRXKUv+KFw+KePWHePfmLZ70NYHA4/3ZpiZF8SCG3Ddbye3sbm1vZPfLeztHxweFY9P2kYlmrIWVULpbkAME1yFnAQrBtrRqJAsE4wuZv7nUemDVeyCdOY+REZST7klICVBsUS4D6XuNes9EWowFSw5w+KZbfqLoDXiZeRMsrQGBS/+qGiScQkUEGM6XluDH5KNHAq2KzQTwyLCZ2QEetZKknEjJ8ujp/hc6uEeKi0LQl4of6eSElkzDQKbGdEYGxWvbn4n9dLYHjrp1zGCTBJl4uGicCg8DwJHLNKIipJYRqbm/FdEw0oWDzKtgQvNWX10m7VvUuq7WHq3L9Oosj07RGbpAHrpBdXSPGqiFKJqiZ/SK3pwn58V5dz6WrTknmymhP3A+fwC+BJN/</latexit>

t-th

<latexit sha1_base64="Z+7y9IsOCA/l+dTpqY+8VZaf1cA=">AB8nicbVDLSgNBEJz1GeMr6tHLYBC8GHYjqMeAF48RzAOSJcxOZrND5rHM9AphyWd48aCIV7/Gm3/jJNmDJhY0FXdHdFqeAWfP/bW1vf2NzaLu2Ud/f2Dw4rR8dtqzNDWYtqoU03IpYJrlgLOAjWTQ0jMhKsE43vZn7niRnLtXqEScpCSUaKx5wScFKvz2Sa5HAJyXRQqfo1fw68SoKCVFGB5qDy1R9qmkmgApibS/wUwhzYoBTwablfmZSuiYjFjPUks2E+P3mKz50yxLE2rhTgufp7IifS2omMXKckNhlbyb+5/UyiG/DnKs0A6boYlGcCQwaz/7HQ24YBTFxhFD3a2YJsQCi6lsgshWH5lbTrteCqVn+oVxvXRwldIrO0AUK0A1qoHvURC1EkUbP6BW9eC9eO/ex6J1zStmTtAfeJ8/f6yRWQ=</latexit>
slide-14
SLIDE 14

Method

  • Multi-Agent Adversarial Inverse RL
  • By parameterizing the reward functions with

, the trajectory distribution under LSBRE is given by:

  • Maximizing the likelihood of expert demonstrations corresponds to:

ω

<latexit sha1_base64="2KfkJ5rul2iORja3LVvySE+BvsU=">ACWnicbVHRatswFJXdtWnTtU3XvVFLBTal2CnY9tLIVAKfeygSQuxMbJy7YjKlitdjwXjn9xLKexXBpWTMNpkF4SOzrkHXR3FhRQGPe/FcTc+bG61tnfaux/39g86h59GRpWaw5ArqfRDzAxIkcMQBUp4KDSwLJZwHz9eNfr9T9BGqPwOZwWEGUtzkQjO0FJR5ynIGE7jpApUBimr26eW+BWtsHRxjqvrOqr+aYWo2urSUhwbDeV0mLVeRYgK89poEU6xZBe0nbU6Xo9b150HfhL0CXLuo06v4OJ4mUGOXLJjBn7XoFhxTQKLqFuB6WBgvFHlsLYwpxlYMJqHk1NTy0zoYnSduVI5+xbR8UyY2ZbDubwc2q1pD/08YlJt/DSuRFiZDzxUVJKSkq2uRMJ0IDRzmzgHEt7KyUT5lmHO1vNCH4q09eB6N+z7/o9X986Q6+LuPYJifkMzkjPvlGBuSG3JIh4eSZ/HW2nJbzx3XdHXd30eo6S8ReVfu8Ss2vrWF</latexit>
slide-15
SLIDE 15

Method

  • Multi-Agent Adversarial Inverse RL
  • Bridging the optimization of joint likelihood and each conditional

likelihood with maximum pseudolikelihood estimation (Theorem 2):

Let be i.i.d. sampled from LSBRE induced by some unknown reward function. Suppose that is differentiable w.r.t. . Then as , with probability tending to 1, the equation has a root that tends to be the maximizer of joint likelihood. τ1, . . . , τM

<latexit sha1_base64="fJMHuYlAqvVA/0zpsPKqHtezo68=">AB/3icbVDLSgMxFM3UV62vUcGNm2ARXEiZqQVdFty4ESrYB3SGIZNJ29DMg+SOUMYu/BU3LhRx62+482/MtLPQ1gMhJ+fcy705fiK4Asv6Nkorq2vrG+XNytb2zu6euX/QUXEqKWvTWMSy5xPFBI9YGzgI1kskI6EvWNcfX+d+94FJxePoHiYJc0MyjPiAUwJa8swjB0jq2efYEUEMSt/5+9Yzq1bNmgEvE7sgVSg5ZlfThDTNGQRUEGU6tWAm5GJHAq2LTipIolhI7JkPU1jUjIlJvN9p/iU60EeBLfSLAM/V3R0ZCpSahrytDAiO16OXif14/hcGVm/EoSYFdD5okAoMc7DwAGXjIKYaEKo5HpXTEdEgo6soOwV78jLp1Gv2Ra1+16g2G0UcZXSMTtAZstElaqIb1EJtRNEjekav6M14Ml6Md+NjXloyip5D9AfG5w+QgpUf</latexit>

πt

i(at i|at −i, st; ωi)

<latexit sha1_base64="ZlhT61+hJ5MEtNOXI/hNWwf40w=">ACGHicbVDLSgMxFM34tr6qLt0Ei6CgdaYWFNwIblwq2Ad06nAnzdTQzIPkjlDGfoYbf8WNC0XcuvNvzLRdaOuBhJNz7iX3Hj+RQqNtf1szs3PzC4tLy4WV1bX1jeLmVl3HqWK8xmIZq6YPmksR8RoKlLyZKA6hL3nD713mfuOBKy3i6Bb7CW+H0I1EIBigkbzisZuIO/TEPuT3oxsC3vtBgPzI7E4JDqOzynbhzyLnjiwCuW7LI9BJ0mzpiUyBjXvHL7cQsDXmETILWLcdOsJ2BQsEkHxTcVPMEWA+6vGVoBCHX7Wy42IDuGaVDg1iZEyEdqr87Mgi17oe+qcwH15NeLv7ntVIMztqZiJIUecRGHwWpBjTPCXaEYozlH1DgClhZqXsHhQwNFkWTAjO5MrTpF4pOyflyk21dFEdx7FEdsgu2ScOSUX5Ipckxph5Im8kDfybj1br9aH9TkqnbHGPdvkD6yvH5yFoA0=</latexit>

ωi

<latexit sha1_base64="v6OMI4j1sZsCb2Y0PV8n/ZXQehE=">AB73icbVDLSgNBEOyNrxhfUY9eBoPgKezGgB4DXjxGMA9IljA7mU2GzGOdmRXCkp/w4kERr/6ON/GSbIHTSxoKq6e6KEs6M9f1vr7CxubW9U9wt7e0fHB6Vj0/aRqWa0BZRXOluhA3lTNKWZbTbqIpFhGnWhyO/c7T1QbpuSDnSY0FHgkWcwItk7q9pWgIzxg3LFr/oLoHUS5KQCOZqD8ld/qEgqLSEY2N6gZ/YMPaMsLprNRPDU0wmeAR7TkqsaAmzBb3ztCFU4YoVtqVtGih/p7IsDBmKiLXKbAdm1VvLv7n9VIb34QZk0lqSTLRXHKkVo/jwaMk2J5VNHMNHM3YrIGtMrIuo5EIVl9eJ+1aNbiq1u7rlUY9j6MIZ3AOlxDANTgDprQAgIcnuEV3rxH78V79z6WrQUvnzmFP/A+fwALjo/t</latexit>

M → ∞

<latexit sha1_base64="pc7wL1BuI4Re4hT0Z+Y0knid5ZM=">AB83icbVDLSgNBEOyNrxhfUY9eBoPgKezGgB4DXrwIEcwDsiHMTmaTIbOzy0yvsCz5DS8eFPHqz3jzb5w8DpY0FBUdPdFSRSGHTdb6ewsbm1vVPcLe3tHxwelY9P2iZONeMtFstYdwNquBSKt1Cg5N1EcxoFkneCye3M7zxbUSsHjFLeD+iIyVCwShayb8nPsbEFyrEbFCuFV3DrJOvCWpwBLNQfnLH8YsjbhCJqkxPc9NsJ9TjYJPi35qeEJZRM64j1LFY246efzm6fkwipDEsbalkIyV39P5DQyJosC2xlRHJtVbyb+5/VSDG/6uVBJilyxaIwlcT+OQuADIXmDGVmCWVa2FsJG1NGdqYSjYEb/XldKuVb2rau2hXmnUl3EU4QzO4RI8uIYG3ETWsAgWd4hTcndV6cd+dj0VpwljOn8AfO5w9lmZE6</latexit>
slide-16
SLIDE 16

Method

  • Multi-Agent Adversarial Inverse RL
  • Maximizing the pseudolikelihood objective:
  • By characterizing the trajectory distribution of LSBRE (Theorem 1), we

can optimize the following surrogate loss:

slide-17
SLIDE 17

Method

  • Multi-Agent Adversarial Inverse RL
  • Practical MA-AIRL Framework
  • Train the
  • parameterized discriminators as:
  • Train the -parameterized generators (policies) as:

ω

<latexit sha1_base64="3pc2gfJncb0B7g8IMigP2o0O3o=">AB+HicbVDLSgMxFM34rPXRUZdugkVwVWZqQZcFNy4r2Ad0Ssmkd9rQZDIkGaEO/RI3LhRx6e482/MtLPQ1gOBwzn3ck9OmHCmjed9OxubW9s7u6W98v7B4VHFPT7paJkqCm0quVS9kGjgLIa2YZDL1FARMihG05vc7/7CEozGT+YWQIDQcYxixglxkpDtxIYiZhlAVSwJjMh27Vq3kL4HXiF6SKCrSG7lcwkjQVEBvKidZ930vMICPKMphXg5SDQmhUzKGvqUxEaAH2SL4HF9YZYQjqeyLDV6ovzcyIrSeidBO5jH1qpeL/3n91EQ3g4zFSWogpstDUcqxkThvAY+YAmr4zBJCFbNZMZ0QRaixXZVtCf7ql9dJp17zr2r1+0a12SjqKEzdI4ukY+uURPdoRZqI4pS9Ixe0Zvz5Lw4787HcnTDKXZO0R84nz8uwZNi</latexit>

max

ω

EπE " N X

i=1

log exp(fωi(s, a)) exp(fωi(s, a)) + qθi(ai|s) # + Eqθ " N X

i=1

log qθi(ai|s) exp(fωi(s, a)) + qθi(ai|s) #

<latexit sha1_base64="W+IKNFZMnkevf4RkiEtOpr8Nk4=">ADOnicpVJLaxRBEO4ZX3F9ZKNHL42LsIuyzMSAXoSABHKSFdwksDMOPb01s016HumukSxj+7e8+CtyEXD4p49QfYOzu+koUcLOjm6+r6q6qriUQqPnTrulavXrt9Yu9m5dfvO3fXuxr09XVSKw5gXslAHMdMgRQ5jFCjhoFTAsljCfnz4cqHvwOlRZG/wXkJYcbSXCSCM7RUtOGMgowdR3bDWZzUQZFBysyH5Tmud0xU/9ZKYaIdQwMJCU4CXWVRLV745u0rSxUpDRLFeB3AcdlPovamSJi+fkJ/XcHMYGAud6GP6ZGVcQbYyHZ7rwcmUCKdYWjVDl2V4JH5U0cTay5LdtUr/5te1O15Q68xehH4LeiR1kZR9ySYFrzKIEcumdYT3ysxrJlCwSWYTlBpKBk/ZClMLMxZBjqsm6839JFlpjQplF050ob9O6JmdbzLaeiwr0eW1BrtImFSbPw1rkZYWQ8+VDSUpFnQxR3QqFHCUcwsYV8LmSvmM2ainbaObYJ/vuSLYG9z6D8dbr7e6m1vte1YIw/IQ9InPnlGtskuGZEx4c5H58z54nx1P7mf3W/u96Wr67Qx98k/5v74CXR5Dwg=</latexit>

θ

<latexit sha1_base64="awNWjzbBKgs6iOFQku1bCtckQXU=">AB+HicbVDLSsNAFL3xWeujUZdugkVwVZJa0GXBjcsK9gFNKZPpB06mYSZG6GfokbF4q49VPc+TdO2iy09cDA4Zx7uWdOkAiu0XW/rY3Nre2d3dJef/g8KhiH590dJwqyto0FrHqBUQzwSVrI0fBeoliJAoE6wbT29zvPjKleSwfcJawQUTGkoecEjTS0K74EcFJEGY+ThiS+dCujV3AWedeAWpQoHW0P7yRzFNIyaRCqJ13MTHGREIaeCzct+qlC6JSMWd9QSKmB9ki+Ny5MrICWNlnkRnof7eyEik9SwKzGQeU696ufif108xvBlkXCYpMkmXh8JUOBg7eQvOiCtGUcwMIVRxk9WhE6IRdNV2ZTgrX5nXTqNe+qVr9vVJuNo4SnME5XIH19CEO2hBGyik8Ayv8GY9WS/Wu/WxHN2wip1T+APr8wdClJNv</latexit>

max

θ

Eqθ " N X

i=1

log(Dωi(s, a)) − log(1 − Dωi(s, a)) # = Eqθ " N X

i=1

fωi(s, a) − log(qθi(ai|s)) #

<latexit sha1_base64="caMlnMsnskftwDSiVTw28Kk4A=">ADE3icpVLbtQwFHXCqwyvKSzZWIxAMxIdJaUSbCpVAiRWqEhMW2kcIsfjZKz6kdoOYhTMN7DhV9iwACG2bNjxNziTDCpTWHElW8fnmsf2zcrOTM2in4G4bnzFy5e2rjcu3L12vUb/c2bB0ZVmtAJUVzpowbypmkE8sp0elplhknB5mx4+b/OFrqg1T8qVdlDQRuJAsZwRbT6WbwQgigd+kyM6pxe/uNSs7z7L6qUvrFuf1iUtXsNU5BxGnuZ0iU4m0Zruxe/XcU6oYPvFlStACp8wNzX24KsRuNIJbrSbeWlOdFiHNirlNIJKViKjGqEe9LELvbn/sZf/29nK2Ena7dBI/ATfQvPbUdofRONoGfAsiDswAF3sp/0faKZIJai0hGNjpnFU2qTG2jLCqeuhytASk2Nc0KmHEgtqknr5pw7e9cwM5kr7IS1csqcraiyMWYjMK5t7mPVcQ/4tN61s/ipmSwrSyVpD8orDq2CTYPAGdOUWL7wABPNvFdI5lhjYn0b9fwjxOtXPgsOtsfxg/H2i53B3k73HBvgNrgDhiAGD8EeAb2wQSQ4H3wMfgcfAk/hJ/Cr+G3VhoGXc0t8EeE38BOML6+A=</latexit>
slide-18
SLIDE 18

Experiments

  • Policy imitation performance
  • Cooperative tasks: cooperative navigation & cooperative communication,
  • Use the ground-truth reward as the oracle evaluation metric.
slide-19
SLIDE 19

Experiments

  • Policy imitation performance
  • Competitive task (competitive keep-away)
  • “Battle” evaluation: we place the experts and learned policies in the

same environment; a learned policy is considered better if it receives a higher expected return than its opponent.

slide-20
SLIDE 20

Experiments

  • Reward recovery
  • Measuring the statistical correlation between the learned reward and the

ground-truth.

  • A more direct evaluation in multi-agent system.
  • Pearson‘s correlation coefficient (PCC): measures the linear correlation

between two random variables.

  • Spearman‘s rank correlation coefficient (SCC): measures the statistical

dependence between the rankings of two random variables.

slide-21
SLIDE 21

Experiments

  • Reward recovery
  • Cooperative tasks
slide-22
SLIDE 22

Experiments

  • Reward recovery
  • Competitive task
slide-23
SLIDE 23

Summary

  • We proposed a new solution concept for Markov games, which

allows us to characterize the trajectory distribution induced by parameterized rewards.

  • We propose the first multi-agent MaxEnt IRL framework, which is

effective and scalable to Markov games with continuous state-action space and unknown dynamics.

  • We employ maximum pseudolikelihood estimation and adversarial

reward learning to achieve tractability.

  • Experimental results demonstrate that MA-AIRL can recover both

policy and reward function that is highly correlated with the ground- truth.

slide-24
SLIDE 24

Thank You!

Lantao Yu, Jiaming Song, Stefano Ermon. Multi-Agent Adversarial Inverse Reinforcement Learning. ICML 2019.

Poster: 06:30 -- 09:00 PM @ Pacific Ballroom #36

Lantao Yu, Jiaming Song, Stefano Ermon

Department of Computer Science, Stanford University