Feedback Control for Learning in Games Gurdal ARSLAN & Jeff - - PowerPoint PPT Presentation

feedback control for learning in games
SMART_READER_LITE
LIVE PREVIEW

Feedback Control for Learning in Games Gurdal ARSLAN & Jeff - - PowerPoint PPT Presentation

Feedback Control for Learning in Games Gurdal ARSLAN & Jeff SHAMMA Mechanical and Aerospace Engineering UCLA Setup: Repeated Games Time k = 1,2,3, Player i : Strategy: p i ( k ) Action: a i ( k ) = rand[ p i ( k


slide-1
SLIDE 1

Feedback Control for Learning in Games

Gurdal ARSLAN & Jeff SHAMMA

Mechanical and Aerospace Engineering UCLA

slide-2
SLIDE 2

2

Setup: Repeated Games

  • Time k = 1,2,3,…
  • Player i:

– Strategy:

pi(k) ∈ ∆

– Action:

ai(k) = rand[pi(k)]

– Payoff:

Ui(ai,a-i) ' ai

TMia-i

– Play:

pi(k) = F(information up to time k)

  • Assume players do not share utilities!
  • Separate issues: Will they? should they? compute NE?

How can simple rules lead players to mixed strategy Nash equilibrium?

slide-3
SLIDE 3

3

Prior Work & Convergence

  • (Stochastic) Fictitious Play
  • No Regret
  • New approaches: Multirate, Joint weak calibration,

Regret testing, …

  • Convergence results:

– Special cases: NE – Correlated equilibria – Convex hull of NE – “Dwell” near NE

slide-4
SLIDE 4

4

Non-convergence Results

  • Shapley game vs Fictitious Play
  • Crawford (1985): wide class of learning mechanisms must fail to

converge mixed strategies

  • Jordan anticoordination game: 3 players, each with 2 moves.
  • Hart & Mas-Colell (2003): Consider larger class & show

Uncoupled + Jordan anticoordination = non-convergence

P2 P1 P3

slide-5
SLIDE 5

5

Preview

  • Introduce new uncoupled dynamics based on “feedback control”.
  • Demonstrate how convergence to mixed strategy NE can be

enabled (including Shapley & Jordan games).

  • Best/Better response variants.
  • Action/Payoff based versions.
  • Two/Multi-player cases.
slide-6
SLIDE 6

6

Feedback Control

  • K = controller = sequential decision maker
  • P = process with approximate model Pmodel
  • Think of “standing upright”

P K

– +

desired behavior actual behavior

error controller disturbance process

feedback

slide-7
SLIDE 7

7

What’s the Connection?

  • FB → GT:

– New initiatives in “cooperative control” (combat systems, networks, self-

assembly, automata teams…) require general sum formulation.

  • GT → FB:

DMi is in feedback with DM-i

DM1 DM1 DM3 DM3 DM2 DM2 DM4 DM4 DM5 DM5

slide-8
SLIDE 8

8

Typical Controller: PID

  • Proportional + Integral + Derivative

– KP ⇒ current error – KI ⇒ error history – KD ⇒ error change

  • “Workhorse” of traditional control design.
  • Model of human motion control, homeostasis, …
slide-9
SLIDE 9

9

Derivative Action

e t t+τ (now)

  • React to predicted error
  • Example: “Balancing”:
slide-10
SLIDE 10

10

Repeated Games in Continuous Time

  • Empirical frequencies:
  • ODE method of stochastic approximation:

Deterministic continuous time analysis ⇓ Probabilistic discrete time conclusions

slide-11
SLIDE 11

11

Derivative Action FP (DAFP)

  • Define smoothed best response:
  • FP:
  • Derivative action FP:
  • “First order” model of adversary: Moving target.
slide-12
SLIDE 12

12

Ideal vs Approximate

  • Ideal ⇒ Implicit Equations
  • Approximate:
  • Use of ideal differentiators can always lead to NE (a

misleading conclusion).

slide-13
SLIDE 13

13

Approximate Differentiator

  • Define:
  • Asymptotically
  • Two-player implementation:
slide-14
SLIDE 14

14

Local Convergence of DAFP

  • Theorem : Consider a two-player game with a NE .

1) stable at stable at 2) unstable at , with stable at where are the eigenvalues of linearized

slide-15
SLIDE 15

15

Jordan Anticoordination Revisited

  • Unique mixed NE is unstable under
  • , hence stabilizable by
slide-16
SLIDE 16

16

Extensions to “Gradient Play”

  • “Better Response” = GP
  • DAGP :
  • Theorem: Similar … using eigenvalues of
  • Shapley & Jordan games convergent.
slide-17
SLIDE 17

17

Crawford & Conlisk

  • Crawford (1985): Nonconvergence of a class of algorithms.
  • Conlisk (1993): “Adaptation in games: Two solutions to the

Crawford puzzle”, J. of Economic Behavior and Organization.

– Two-player zero-sum games – Play in “rounds” (…, R-1, R, R+1, …) – On R+1 use adjust mixed strategy with “forecast” payoff based on intervals

R & R-1

slide-18
SLIDE 18

18

Discrete Time

  • Theorem: Local attractor in continuous time ⇒ Positive

probability of convergence to NE in discrete-time.

  • …as opposed to Zero probability.
slide-19
SLIDE 19

19

Payoff Based Rules

  • Use “stimulus response”
  • Theorem: Positive probability of convergence to NE.
slide-20
SLIDE 20

20

Jordan Anticoordination: Payoff Based DAGP

γ = 1, λ = 50, ε = 0.1

slide-21
SLIDE 21

21

Multiplayer Games

  • Immediate extensions in case of “pair-wise utility”

structure:

  • Otherwise, must inspect “joint-action” version of FP.
slide-22
SLIDE 22

22

Concluding Remarks

  • Feedback control motivates the use of auxiliary

dynamics to enable NE convergence.

  • Other “controller” structures possible (all mixed strategy

equilibria “stabilizable”)

  • DAFP & DAGP respect “graph” structures.
  • Key concerns:

– Natural? – Strategic?