Policy Certificates: Towards Accountable Reinforcement Learning - - PowerPoint PPT Presentation

policy certificates towards accountable reinforcement
SMART_READER_LITE
LIVE PREVIEW

Policy Certificates: Towards Accountable Reinforcement Learning - - PowerPoint PPT Presentation

Policy Certificates: Towards Accountable Reinforcement Learning Christoph Dann, Lihong Li, Wei Wei, Emma Brunskill CMU Google Research Stanford University Minimax-Optimal PAC


slide-1
SLIDE 1

Policy Certificates: Towards Accountable Reinforcement Learning

Christoph Dann, Lihong Li, Wei Wei, Emma Brunskill CMU Google Research Stanford University

slide-2
SLIDE 2

Minimax-Optimal PAC Bounds

S: #states, A: #actions, H: episode length, T: #episodes, ϵ: accuracy

Key contribution: new algorithm for episodic tabular MDPs with PAC Bound High-prob. Regret Bound

slide-3
SLIDE 3

Minimax-Optimal PAC Bounds

S: #states, A: #actions, H: episode length, T: #episodes, ϵ: accuracy

Key contribution: new algorithm for episodic tabular MDPs with PAC Bound High-prob. Regret Bound

[DLB ‘17] [DB ‘15]

First minimax-optimal! (for small ϵ) Prior work:

slide-4
SLIDE 4

Minimax-Optimal PAC Bounds

S: #states, A: #actions, H: episode length, T: #episodes, ϵ: accuracy

Key contribution: new algorithm for episodic tabular MDPs with PAC Bound High-prob. Regret Bound

√ SAH2T + √ H3T+S2AH2

<latexit sha1_base64="30xMli8LGXUeEmLEMSjzuAbUbRA=">ACX3icbVHLTgIxFC3jG1+jroybRmJigiEzg0aXqAtZ+gAlgYF0StGzsP2jpFM5v8BpeuXLnVrR3ARMGbtjk59x721MvElyBZb3mjJnZufmFxaX8srq2rq5sXmrwlhSVqehCGXDI4oJHrA6cBCsEUlGfE+wO69/nuXvnphUPAxqMIiY65P7gPc4JaCpjkla6lFCcnNabTu1FBdxC9gzDPs2ry/O3MSxrAOcrTQZSavtci1Np3T2kRb9HGlSxDdtJ2uadsyCVbKGgaeBPQYFNI7Ljvne6oY09lkAVBClmrYVgZsQCZwKluZbsWIRoX1yz5oaBsRnyk2GV0nxnma6uBdKvQPAQ/Z3RUJ8pQa+p5U+gQc1mcvI/3OU6gupifHQO3ETHkQxsICOpvdigSHEmdm4yWjIAYaECq5fgCmD0QSCvpL8toZe9KHaXDrlOxybk6LFQqY48W0Q7aRfvIRseogqroEtURS/oA32ir9ybsWCsGeZIauTGNVvoTxjb34GntLY=</latexit>

[AOM ‘17] [DLB ‘17] [DB ‘15]

First minimax-optimal! (for small ϵ) Matches existing + improves for large H Prior work:

slide-5
SLIDE 5

Motivation: Need for Accountability in Online RL

current episode

Even with PAC + regret bounds: expected return in next episode during learning unknown

slide-6
SLIDE 6

Motivation: Need for Accountability in Online RL

How good will my treatment be? Is it the best possible?

slide-7
SLIDE 7

Our Proposal: Algorithms output policy certificates before each episode

slide-8
SLIDE 8

Algorithms with policy certificates

Natural extension of model-based optimistic algorithms 1. UCB on optimal value function 2. Greedy Policy 3. LCB on value function of current policy 4. Output certificate

slide-9
SLIDE 9

Algorithms with Policy Certificates

Natural extension of model-based optimistic algorithms 1. UCB on optimal value function 2. Greedy Policy 3. LCB on value function of current policy 4. Output certificate

slide-10
SLIDE 10

Symbiosis of Optimism and Certificates

Certificates:

  • Challenge: random
  • Insight from optimism:

at known rate

slide-11
SLIDE 11

Symbiosis of Optimism and Certificates

Certificates:

  • Challenge: random
  • Insight from optimism:

at known rate Optimism:

  • Challenge: exploration bonus

depends on

  • Insight from certificates:

bound by

slide-12
SLIDE 12

Symbiosis of Optimism and Certificates

Certificates:

  • Challenge: random
  • Insight from optimism:

at known rate Optimism:

  • Challenge: exploration bonus

depends on

  • Insight from certificates:

bound by

More accountable algorithms through accurate policy certificates Better exploration bonuses yield minimax-optimal PAC & regret bounds