policy certificates towards accountable reinforcement
play

Policy Certificates: Towards Accountable Reinforcement Learning - PowerPoint PPT Presentation

Policy Certificates: Towards Accountable Reinforcement Learning Christoph Dann, Lihong Li, Wei Wei, Emma Brunskill CMU Google Research Stanford University Minimax-Optimal PAC


  1. Policy Certificates: Towards Accountable Reinforcement Learning Christoph Dann, Lihong Li, Wei Wei, Emma Brunskill CMU Google Research Stanford University

  2. Minimax-Optimal PAC Bounds Key contribution: new algorithm for episodic tabular MDPs with PAC Bound High-prob. Regret Bound S: #states, A: #actions, H: episode length, T: #episodes, ϵ: accuracy

  3. Minimax-Optimal PAC Bounds Key contribution: new algorithm for episodic tabular MDPs with PAC Bound High-prob. Regret Bound First minimax-optimal! (for small ϵ) S: #states, A: #actions, H: episode length, T: #episodes, ϵ: accuracy Prior work: [DLB ‘17] [DB ‘15]

  4. <latexit sha1_base64="30xMli8LGXUeEmLEMSjzuAbUbRA=">ACX3icbVHLTgIxFC3jG1+jroybRmJigiEzg0aXqAtZ+gAlgYF0StGzsP2jpFM5v8BpeuXLnVrR3ARMGbtjk59x721MvElyBZb3mjJnZufmFxaX8srq2rq5sXmrwlhSVqehCGXDI4oJHrA6cBCsEUlGfE+wO69/nuXvnphUPAxqMIiY65P7gPc4JaCpjkla6lFCcnNabTu1FBdxC9gzDPs2ry/O3MSxrAOcrTQZSavtci1Np3T2kRb9HGlSxDdtJ2uadsyCVbKGgaeBPQYFNI7Ljvne6oY09lkAVBClmrYVgZsQCZwKluZbsWIRoX1yz5oaBsRnyk2GV0nxnma6uBdKvQPAQ/Z3RUJ8pQa+p5U+gQc1mcvI/3OU6gupifHQO3ETHkQxsICOpvdigSHEmdm4yWjIAYaECq5fgCmD0QSCvpL8toZe9KHaXDrlOxybk6LFQqY48W0Q7aRfvIRseogqroEtURS/oA32ir9ybsWCsGeZIauTGNVvoTxjb34GntLY=</latexit> Minimax-Optimal PAC Bounds Key contribution: new algorithm for episodic tabular MDPs with PAC Bound High-prob. Regret Bound First minimax-optimal! (for small ϵ) Matches existing + improves for large H S: #states, A: #actions, H: episode length, T: #episodes, ϵ: accuracy Prior work: √ √ H 3 T + S 2 AH 2 SAH 2 T + [AOM ‘17] [DLB ‘17] [DB ‘15]

  5. Motivation: Need for Accountability in Online RL current episode Even with PAC + regret bounds: expected return in next episode during learning unknown

  6. Motivation: Need for Accountability in Online RL How good will my treatment be? Is it the best possible?

  7. Our Proposal: Algorithms output policy certificates before each episode

  8. Algorithms with policy certificates Natural extension of model-based optimistic algorithms 1. UCB on optimal value function 2. Greedy Policy 3. LCB on value function of current policy 4. Output certificate 0

  9. Algorithms with Policy Certificates Natural extension of model-based optimistic algorithms 1. UCB on optimal value function 2. Greedy Policy 3. LCB on value function of current policy 4. Output certificate 0

  10. Symbiosis of Optimism and Certificates Certificates: • Challenge: random • Insight from optimism: at known rate

  11. Symbiosis of Optimism and Certificates Certificates: Optimism: • Challenge: random • Challenge: exploration bonus depends on • Insight from optimism: • Insight from certificates: at known rate bound by

  12. Symbiosis of Optimism and Certificates Certificates: Optimism: • Challenge: random • Challenge: exploration bonus depends on • Insight from optimism: • Insight from certificates: at known rate bound by More accountable algorithms Better exploration bonuses yield through accurate policy certificates minimax-optimal PAC & regret bounds

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend