Feedback Control for Learning in Games Gurdal ARSLAN & Jeff - - PowerPoint PPT Presentation
Feedback Control for Learning in Games Gurdal ARSLAN & Jeff - - PowerPoint PPT Presentation
Feedback Control for Learning in Games Gurdal ARSLAN & Jeff SHAMMA Mechanical and Aerospace Engineering UCLA Setup: Repeated Games Time k = 1,2,3, Player i : Strategy: p i ( k ) Action: a i ( k ) = rand[ p i ( k
SLIDE 1
SLIDE 2
2
Setup: Repeated Games
- Time k = 1,2,3,…
- Player i:
– Strategy:
pi(k) ∈ ∆
– Action:
ai(k) = rand[pi(k)]
– Payoff:
Ui(ai,a-i) ' ai
TMia-i
– Play:
pi(k) = F(information up to time k)
- Assume players do not share utilities!
- Separate issues: Will they? should they? compute NE?
How can simple rules lead players to mixed strategy Nash equilibrium?
SLIDE 3
3
Prior Work & Convergence
- (Stochastic) Fictitious Play
- No Regret
- New approaches: Multirate, Joint weak calibration,
Regret testing, …
- Convergence results:
– Special cases: NE – Correlated equilibria – Convex hull of NE – “Dwell” near NE
SLIDE 4
4
Non-convergence Results
- Shapley game vs Fictitious Play
- Crawford (1985): wide class of learning mechanisms must fail to
converge mixed strategies
- Jordan anticoordination game: 3 players, each with 2 moves.
- Hart & Mas-Colell (2003): Consider larger class & show
Uncoupled + Jordan anticoordination = non-convergence
P2 P1 P3
SLIDE 5
5
Preview
- Introduce new uncoupled dynamics based on “feedback control”.
- Demonstrate how convergence to mixed strategy NE can be
enabled (including Shapley & Jordan games).
- Best/Better response variants.
- Action/Payoff based versions.
- Two/Multi-player cases.
SLIDE 6
6
Feedback Control
- K = controller = sequential decision maker
- P = process with approximate model Pmodel
- Think of “standing upright”
P K
– +
desired behavior actual behavior
error controller disturbance process
feedback
SLIDE 7
7
What’s the Connection?
- FB → GT:
– New initiatives in “cooperative control” (combat systems, networks, self-
assembly, automata teams…) require general sum formulation.
- GT → FB:
DMi is in feedback with DM-i
DM1 DM1 DM3 DM3 DM2 DM2 DM4 DM4 DM5 DM5
SLIDE 8
8
Typical Controller: PID
- Proportional + Integral + Derivative
– KP ⇒ current error – KI ⇒ error history – KD ⇒ error change
- “Workhorse” of traditional control design.
- Model of human motion control, homeostasis, …
SLIDE 9
9
Derivative Action
e t t+τ (now)
- React to predicted error
- Example: “Balancing”:
SLIDE 10
10
Repeated Games in Continuous Time
- Empirical frequencies:
- ODE method of stochastic approximation:
Deterministic continuous time analysis ⇓ Probabilistic discrete time conclusions
SLIDE 11
11
Derivative Action FP (DAFP)
- Define smoothed best response:
- FP:
- Derivative action FP:
- “First order” model of adversary: Moving target.
SLIDE 12
12
Ideal vs Approximate
- Ideal ⇒ Implicit Equations
- Approximate:
- Use of ideal differentiators can always lead to NE (a
misleading conclusion).
SLIDE 13
13
Approximate Differentiator
- Define:
- Asymptotically
- Two-player implementation:
SLIDE 14
14
Local Convergence of DAFP
- Theorem : Consider a two-player game with a NE .
1) stable at stable at 2) unstable at , with stable at where are the eigenvalues of linearized
SLIDE 15
15
Jordan Anticoordination Revisited
- Unique mixed NE is unstable under
- , hence stabilizable by
SLIDE 16
16
Extensions to “Gradient Play”
- “Better Response” = GP
- DAGP :
- Theorem: Similar … using eigenvalues of
- Shapley & Jordan games convergent.
SLIDE 17
17
Crawford & Conlisk
- Crawford (1985): Nonconvergence of a class of algorithms.
- Conlisk (1993): “Adaptation in games: Two solutions to the
Crawford puzzle”, J. of Economic Behavior and Organization.
– Two-player zero-sum games – Play in “rounds” (…, R-1, R, R+1, …) – On R+1 use adjust mixed strategy with “forecast” payoff based on intervals
R & R-1
SLIDE 18
18
Discrete Time
- Theorem: Local attractor in continuous time ⇒ Positive
probability of convergence to NE in discrete-time.
- …as opposed to Zero probability.
SLIDE 19
19
Payoff Based Rules
- Use “stimulus response”
- Theorem: Positive probability of convergence to NE.
SLIDE 20
20
Jordan Anticoordination: Payoff Based DAGP
γ = 1, λ = 50, ε = 0.1
SLIDE 21
21
Multiplayer Games
- Immediate extensions in case of “pair-wise utility”
structure:
- Otherwise, must inspect “joint-action” version of FP.
SLIDE 22
22
Concluding Remarks
- Feedback control motivates the use of auxiliary
dynamics to enable NE convergence.
- Other “controller” structures possible (all mixed strategy
equilibria “stabilizable”)
- DAFP & DAGP respect “graph” structures.
- Key concerns: