Feedback Control for Learning in Games Gurdal ARSLAN & Jeff - - PowerPoint PPT Presentation

▶

Feb 07, 2023 39 likes •261 views

Feedback Control for Learning in Games Gurdal ARSLAN & Jeff SHAMMA Mechanical and Aerospace Engineering UCLA Setup: Repeated Games Time k = 1,2,3, Player i : Strategy: p i ( k ) Action: a i ( k ) = rand[ p i ( k

SLIDE 1

Feedback Control for Learning in Games

Gurdal ARSLAN & Jeff SHAMMA

Mechanical and Aerospace Engineering UCLA

SLIDE 2

Setup: Repeated Games

Time k = 1,2,3,…
Player i:

– Strategy:

pi(k) ∈ ∆

– Action:

ai(k) = rand[pi(k)]

– Payoff:

Ui(ai,a-i) ' ai

TMia-i

– Play:

pi(k) = F(information up to time k)

Assume players do not share utilities!
Separate issues: Will they? should they? compute NE?

How can simple rules lead players to mixed strategy Nash equilibrium?

SLIDE 3

Prior Work & Convergence

(Stochastic) Fictitious Play
No Regret
New approaches: Multirate, Joint weak calibration,

Regret testing, …

Convergence results:

– Special cases: NE – Correlated equilibria – Convex hull of NE – “Dwell” near NE

SLIDE 4

Non-convergence Results

Shapley game vs Fictitious Play
Crawford (1985): wide class of learning mechanisms must fail to

converge mixed strategies

Jordan anticoordination game: 3 players, each with 2 moves.
Hart & Mas-Colell (2003): Consider larger class & show

Uncoupled + Jordan anticoordination = non-convergence

P2 P1 P3

SLIDE 5

Preview

Introduce new uncoupled dynamics based on “feedback control”.
Demonstrate how convergence to mixed strategy NE can be

enabled (including Shapley & Jordan games).

Best/Better response variants.
Action/Payoff based versions.
Two/Multi-player cases.

SLIDE 6

Feedback Control

K = controller = sequential decision maker
P = process with approximate model Pmodel
Think of “standing upright”

P K

– +

desired behavior actual behavior

error controller disturbance process

feedback

SLIDE 7

What’s the Connection?

FB → GT:

– New initiatives in “cooperative control” (combat systems, networks, self-

assembly, automata teams…) require general sum formulation.

GT → FB:

DMi is in feedback with DM-i

DM1 DM1 DM3 DM3 DM2 DM2 DM4 DM4 DM5 DM5

SLIDE 8

Typical Controller: PID

Proportional + Integral + Derivative

– KP ⇒ current error – KI ⇒ error history – KD ⇒ error change

“Workhorse” of traditional control design.
Model of human motion control, homeostasis, …

SLIDE 9

Derivative Action

e t t+τ (now)

React to predicted error
Example: “Balancing”:

SLIDE 10

Repeated Games in Continuous Time

Empirical frequencies:
ODE method of stochastic approximation:

Deterministic continuous time analysis ⇓ Probabilistic discrete time conclusions

SLIDE 11

Derivative Action FP (DAFP)

Define smoothed best response:
FP:
Derivative action FP:
“First order” model of adversary: Moving target.

SLIDE 12

Ideal vs Approximate

Ideal ⇒ Implicit Equations
Approximate:
Use of ideal differentiators can always lead to NE (a

misleading conclusion).

SLIDE 13

Approximate Differentiator

Define:
Asymptotically
Two-player implementation:

SLIDE 14

Local Convergence of DAFP

Theorem : Consider a two-player game with a NE .

1) stable at stable at 2) unstable at , with stable at where are the eigenvalues of linearized

SLIDE 15

Jordan Anticoordination Revisited

Unique mixed NE is unstable under
, hence stabilizable by

SLIDE 16

Extensions to “Gradient Play”

“Better Response” = GP
DAGP :
Theorem: Similar … using eigenvalues of
Shapley & Jordan games convergent.

SLIDE 17

Crawford & Conlisk

Crawford (1985): Nonconvergence of a class of algorithms.
Conlisk (1993): “Adaptation in games: Two solutions to the

Crawford puzzle”, J. of Economic Behavior and Organization.

– Two-player zero-sum games – Play in “rounds” (…, R-1, R, R+1, …) – On R+1 use adjust mixed strategy with “forecast” payoff based on intervals

R & R-1

SLIDE 18

Discrete Time

Theorem: Local attractor in continuous time ⇒ Positive

probability of convergence to NE in discrete-time.

…as opposed to Zero probability.

SLIDE 19

Payoff Based Rules

Use “stimulus response”
Theorem: Positive probability of convergence to NE.

SLIDE 20

Jordan Anticoordination: Payoff Based DAGP

γ = 1, λ = 50, ε = 0.1

SLIDE 21

Multiplayer Games

Immediate extensions in case of “pair-wise utility”

structure:

Otherwise, must inspect “joint-action” version of FP.

SLIDE 22

Concluding Remarks

Feedback control motivates the use of auxiliary

dynamics to enable NE convergence.

Other “controller” structures possible (all mixed strategy

equilibria “stabilizable”)

DAFP & DAGP respect “graph” structures.
Key concerns: