Learning to Collaborate in Markov Decision Processes Goran Radanovic - - PowerPoint PPT Presentation

learning to collaborate in markov decision processes
SMART_READER_LITE
LIVE PREVIEW

Learning to Collaborate in Markov Decision Processes Goran Radanovic - - PowerPoint PPT Presentation

Learning to Collaborate in Markov Decision Processes Goran Radanovic , Rati Devidze, David C. Parkes, Adish Singla Motivation: Human-AI Collaboration Example setting Helper-AI Human Agent A1 Agent A2 Task (Best) responds Commits to to !


slide-1
SLIDE 1

Learning to Collaborate in Markov Decision Processes

Goran Radanovic, Rati Devidze, David C. Parkes, Adish Singla

slide-2
SLIDE 2

Motivation: Human-AI Collaboration

2

Commits to policy !" (Best) responds to !"

Behavioral differences Agents have different models of the world

Task

[Dimitrakakis et al., NIPS 2017]

Helper-AI Human Agent A1 Agent A2

Example setting

slide-3
SLIDE 3

Motivation: Human-AI Collaboration

3

Commits to policy !" Agent A2 !# changes

  • ver time

Task

Can we utilize learning to adopt a good policy for A1 despite the changing behavior of A2, without detailing A2's learning dynamics?

Helper-AI Human Agent A1

Humans change/adapt their behavior over time.

slide-4
SLIDE 4

Formal Model: Two-agent MDP

  • Episodic two-agent MDP with commitments
  • Goal: design a learning algorithm for A1 that

achieves a sublinear regret

– Implies near optimality for smooth MDPs

4

Agent A1

Rewards and transitions are non-stationary.

slide-5
SLIDE 5

Experts with Double Recency Bias

  • Based on experts in MDPs:

– Assign an experts algorithm to each state – Use ! values as experts’ losses

  • Introduce double recency bias

5

[Even-Dar et al., NIPS 2005]

" − 1 " − % Recency windowing &',) Recency modulation *' = 1 Γ -

)./

&',) !

slide-6
SLIDE 6

Main Results (Informally)

6

Theorem: The regret or ExpDRBias decays as !(#$%& '()*+

, , . / ), provided that the magnitude change

  • f A2’s policy is !( #(1).

Theorem: Assume that the magnitude change of A2’s policy is Ω(1). Then achieving a sublinear regret is at least as hard as learning parity with noise.

slide-7
SLIDE 7

Thank you!

  • Visit me at the poster session!

7