Learning to Collaborate in Markov Decision Processes Goran Radanovic - - PowerPoint PPT Presentation

▶

Mar 15, 2024 140 likes •229 views

Learning to Collaborate in Markov Decision Processes Goran Radanovic , Rati Devidze, David C. Parkes, Adish Singla Motivation: Human-AI Collaboration Example setting Helper-AI Human Agent A1 Agent A2 Task (Best) responds Commits to to !

SLIDE 1

Learning to Collaborate in Markov Decision Processes

Goran Radanovic, Rati Devidze, David C. Parkes, Adish Singla

SLIDE 2

Motivation: Human-AI Collaboration

Commits to policy !" (Best) responds to !"

Behavioral differences Agents have different models of the world

Task

[Dimitrakakis et al., NIPS 2017]

Helper-AI Human Agent A1 Agent A2

Example setting

SLIDE 3

Motivation: Human-AI Collaboration

Commits to policy !" Agent A2 !# changes

ver time

Task

Can we utilize learning to adopt a good policy for A1 despite the changing behavior of A2, without detailing A2's learning dynamics?

Helper-AI Human Agent A1

Humans change/adapt their behavior over time.

SLIDE 4

Formal Model: Two-agent MDP

Episodic two-agent MDP with commitments
Goal: design a learning algorithm for A1 that

achieves a sublinear regret

– Implies near optimality for smooth MDPs

Agent A1

Rewards and transitions are non-stationary.

SLIDE 5

Experts with Double Recency Bias

Based on experts in MDPs:

– Assign an experts algorithm to each state – Use ! values as experts’ losses

Introduce double recency bias

[Even-Dar et al., NIPS 2005]

" − 1 " − % Recency windowing &',) Recency modulation *' = 1 Γ -

)./

&',) !

SLIDE 6

Main Results (Informally)

Theorem: The regret or ExpDRBias decays as !(#$%& '()*+

, , . / ), provided that the magnitude change

f A2’s policy is !( #(1).

Theorem: Assume that the magnitude change of A2’s policy is Ω(1). Then achieving a sublinear regret is at least as hard as learning parity with noise.

SLIDE 7

Thank you!

Visit me at the poster session!