Policy Certificates: Towards Accountable Reinforcement Learning - - PowerPoint PPT Presentation
Policy Certificates: Towards Accountable Reinforcement Learning - - PowerPoint PPT Presentation
Policy Certificates: Towards Accountable Reinforcement Learning Christoph Dann, Lihong Li, Wei Wei, Emma Brunskill CMU Google Research Stanford University Minimax-Optimal PAC
Minimax-Optimal PAC Bounds
S: #states, A: #actions, H: episode length, T: #episodes, ϵ: accuracy
Key contribution: new algorithm for episodic tabular MDPs with PAC Bound High-prob. Regret Bound
Minimax-Optimal PAC Bounds
S: #states, A: #actions, H: episode length, T: #episodes, ϵ: accuracy
Key contribution: new algorithm for episodic tabular MDPs with PAC Bound High-prob. Regret Bound
[DLB ‘17] [DB ‘15]
First minimax-optimal! (for small ϵ) Prior work:
Minimax-Optimal PAC Bounds
S: #states, A: #actions, H: episode length, T: #episodes, ϵ: accuracy
Key contribution: new algorithm for episodic tabular MDPs with PAC Bound High-prob. Regret Bound
√ SAH2T + √ H3T+S2AH2
<latexit sha1_base64="30xMli8LGXUeEmLEMSjzuAbUbRA=">ACX3icbVHLTgIxFC3jG1+jroybRmJigiEzg0aXqAtZ+gAlgYF0StGzsP2jpFM5v8BpeuXLnVrR3ARMGbtjk59x721MvElyBZb3mjJnZufmFxaX8srq2rq5sXmrwlhSVqehCGXDI4oJHrA6cBCsEUlGfE+wO69/nuXvnphUPAxqMIiY65P7gPc4JaCpjkla6lFCcnNabTu1FBdxC9gzDPs2ry/O3MSxrAOcrTQZSavtci1Np3T2kRb9HGlSxDdtJ2uadsyCVbKGgaeBPQYFNI7Ljvne6oY09lkAVBClmrYVgZsQCZwKluZbsWIRoX1yz5oaBsRnyk2GV0nxnma6uBdKvQPAQ/Z3RUJ8pQa+p5U+gQc1mcvI/3OU6gupifHQO3ETHkQxsICOpvdigSHEmdm4yWjIAYaECq5fgCmD0QSCvpL8toZe9KHaXDrlOxybk6LFQqY48W0Q7aRfvIRseogqroEtURS/oA32ir9ybsWCsGeZIauTGNVvoTxjb34GntLY=</latexit>[AOM ‘17] [DLB ‘17] [DB ‘15]
First minimax-optimal! (for small ϵ) Matches existing + improves for large H Prior work:
Motivation: Need for Accountability in Online RL
current episode
Even with PAC + regret bounds: expected return in next episode during learning unknown
Motivation: Need for Accountability in Online RL
How good will my treatment be? Is it the best possible?
Our Proposal: Algorithms output policy certificates before each episode
Algorithms with policy certificates
Natural extension of model-based optimistic algorithms 1. UCB on optimal value function 2. Greedy Policy 3. LCB on value function of current policy 4. Output certificate
Algorithms with Policy Certificates
Natural extension of model-based optimistic algorithms 1. UCB on optimal value function 2. Greedy Policy 3. LCB on value function of current policy 4. Output certificate
Symbiosis of Optimism and Certificates
Certificates:
- Challenge: random
- Insight from optimism:
at known rate
Symbiosis of Optimism and Certificates
Certificates:
- Challenge: random
- Insight from optimism:
at known rate Optimism:
- Challenge: exploration bonus
depends on
- Insight from certificates:
bound by
Symbiosis of Optimism and Certificates
Certificates:
- Challenge: random
- Insight from optimism:
at known rate Optimism:
- Challenge: exploration bonus
depends on
- Insight from certificates:
bound by
More accountable algorithms through accurate policy certificates Better exploration bonuses yield minimax-optimal PAC & regret bounds