Policy Certificates: Towards Accountable Reinforcement Learning - - PowerPoint PPT Presentation

▶

Feb 17, 2023 366 likes •492 views

Policy Certificates: Towards Accountable Reinforcement Learning Christoph Dann, Lihong Li, Wei Wei, Emma Brunskill CMU Google Research Stanford University Minimax-Optimal PAC

SLIDE 1

Policy Certificates: Towards Accountable Reinforcement Learning

Christoph Dann, Lihong Li, Wei Wei, Emma Brunskill CMU Google Research Stanford University

SLIDE 2

Minimax-Optimal PAC Bounds

S: #states, A: #actions, H: episode length, T: #episodes, ϵ: accuracy

Key contribution: new algorithm for episodic tabular MDPs with PAC Bound High-prob. Regret Bound

SLIDE 3

Minimax-Optimal PAC Bounds

S: #states, A: #actions, H: episode length, T: #episodes, ϵ: accuracy

Key contribution: new algorithm for episodic tabular MDPs with PAC Bound High-prob. Regret Bound

[DLB ‘17] [DB ‘15]

First minimax-optimal! (for small ϵ) Prior work:

SLIDE 4

Minimax-Optimal PAC Bounds

S: #states, A: #actions, H: episode length, T: #episodes, ϵ: accuracy

Key contribution: new algorithm for episodic tabular MDPs with PAC Bound High-prob. Regret Bound

√ SAH2T + √ H3T+S2AH2

<latexit sha1_base64="30xMli8LGXUeEmLEMSjzuAbUbRA=">ACX3icbVHLTgIxFC3jG1+jroybRmJigiEzg0aXqAtZ+gAlgYF0StGzsP2jpFM5v8BpeuXLnVrR3ARMGbtjk59x721MvElyBZb3mjJnZufmFxaX8srq2rq5sXmrwlhSVqehCGXDI4oJHrA6cBCsEUlGfE+wO69/nuXvnphUPAxqMIiY65P7gPc4JaCpjkla6lFCcnNabTu1FBdxC9gzDPs2ry/O3MSxrAOcrTQZSavtci1Np3T2kRb9HGlSxDdtJ2uadsyCVbKGgaeBPQYFNI7Ljvne6oY09lkAVBClmrYVgZsQCZwKluZbsWIRoX1yz5oaBsRnyk2GV0nxnma6uBdKvQPAQ/Z3RUJ8pQa+p5U+gQc1mcvI/3OU6gupifHQO3ETHkQxsICOpvdigSHEmdm4yWjIAYaECq5fgCmD0QSCvpL8toZe9KHaXDrlOxybk6LFQqY48W0Q7aRfvIRseogqroEtURS/oA32ir9ybsWCsGeZIauTGNVvoTxjb34GntLY=</latexit>

[AOM ‘17] [DLB ‘17] [DB ‘15]

First minimax-optimal! (for small ϵ) Matches existing + improves for large H Prior work:

SLIDE 5

Motivation: Need for Accountability in Online RL

current episode

Even with PAC + regret bounds: expected return in next episode during learning unknown

SLIDE 6

Motivation: Need for Accountability in Online RL

How good will my treatment be? Is it the best possible?

SLIDE 7

Our Proposal: Algorithms output policy certificates before each episode

SLIDE 8

Algorithms with policy certificates

Natural extension of model-based optimistic algorithms 1. UCB on optimal value function 2. Greedy Policy 3. LCB on value function of current policy 4. Output certificate

SLIDE 9

Algorithms with Policy Certificates

Natural extension of model-based optimistic algorithms 1. UCB on optimal value function 2. Greedy Policy 3. LCB on value function of current policy 4. Output certificate

SLIDE 10

Symbiosis of Optimism and Certificates

Certificates:

Challenge: random
Insight from optimism:

at known rate

SLIDE 11

Symbiosis of Optimism and Certificates

Certificates:

Challenge: random
Insight from optimism:

at known rate Optimism:

Challenge: exploration bonus

depends on

Insight from certificates:

bound by

SLIDE 12

Symbiosis of Optimism and Certificates

Certificates:

Challenge: random
Insight from optimism:

at known rate Optimism:

Challenge: exploration bonus

depends on

Insight from certificates:

bound by

More accountable algorithms through accurate policy certificates Better exploration bonuses yield minimax-optimal PAC & regret bounds