Empirical-evidence Equilibria in Stochastic Games Nicolas Dudebout - - PowerPoint PPT Presentation
Empirical-evidence Equilibria in Stochastic Games Nicolas Dudebout - - PowerPoint PPT Presentation
Empirical-evidence Equilibria in Stochastic Games Nicolas Dudebout Outline 2 Stochastic games Empirical-evidence equilibria (EEEs) Open questions in EEEs Stochastic Games 3 Game theory Markov decision processes Game Theory
Outline
- Stochastic games
- Empirical-evidence equilibria (EEEs)
- Open questions in EEEs
2
Stochastic Games
- Game theory
- Markov decision processes
3
Game Theory
Decision making π£βΆ π β β βΉ πβ β arg max
πβπ
π£(π) Game theory π£1 βΆ π1 Γ π2 β β π£2 βΆ π1 Γ π2 β β Nash Equilibrium β§ βͺ β¨ βͺ β© πβ
1 β arg max π1βπ1
π£1(π1, πβ
2)
πβ
2 β arg max π2βπ2
π£2(πβ
1, π2)
4
Example: Battle of the Sexes
F O F 2, 2 0, 1 O 0, 0 1, 3 Nash equilibria
- (πΊ, πΊ)
- (π, π)
- (3/
4πΊ 1/ 4π, 1/ 3πΊ 2/ 3π)
5
Markov Decision Process (MDP)
Dynamic π¦+ βΌ π(π¦, π) βΊ π¦π’+1 βΌ π(π¦π’, ππ’) Stage cost π£(π¦, π) History βπ’ = (π¦0, π¦1, β¦ , π¦π’, π0, π1, β¦ , ππ’) Strategy π βΆ β β π Utility π(π) = π½π,π[
β
β
π’=0
ππ’π£(π¦π’, ππ’)] Bellmanβs equation π β(π¦) = max
πβπ {π£(π¦, π) + ππ½π[π β(π¦+) | π¦, π]}
Dynamic programming use knowledge of π Reinforcement learning learn π from repeated interaction
6
Markov Decision Process (MDP)
Dynamic π¦+ βΌ π(π¦, π) βΊ π¦π’+1 βΌ π(π¦π’, ππ’) Stage cost π£(π¦, π) History βπ’ = (π¦0, π¦1, β¦ , π¦π’, π0, π1, β¦ , ππ’) Strategy π βΆ β β π Utility π(π) = π½π,π[
β
β
π’=0
ππ’π£(π¦π’, ππ’)] Bellmanβs equation π β(π¦) = max
πβπ {π£(π¦, π) + ππ½π[π β(π¦+) | π¦, π]}
Dynamic programming use knowledge of π Reinforcement learning learn π from repeated interaction
6
Markov Decision Process (MDP)
Dynamic π¦+ βΌ π(π¦, π) βΊ π¦π’+1 βΌ π(π¦π’, ππ’) Stage cost π£(π¦, π) History βπ’ = (π¦0, π¦1, β¦ , π¦π’, π0, π1, β¦ , ππ’) Strategy π βΆ π΄ β π Utility π(π) = π½π,π[
β
β
π’=0
ππ’π£(π¦π’, ππ’)] Bellmanβs equation π β(π¦) = max
πβπ {π£(π¦, π) + ππ½π[π β(π¦+) | π¦, π]}
Dynamic programming use knowledge of π Reinforcement learning learn π from repeated interaction
6
Imperfect Information (POMDP)
Dynamic π₯+ βΌ π(π₯, π) Signal π‘ βΌ π(π₯) History βπ’ = (π‘0, π‘1, β¦ , π‘π’, π0, π1, β¦ , ππ’) Strategy π βΆ β β π Belief βπ,π,π[π₯ | β]
7
Imperfect Information (POMDP)
Dynamic π₯+ βΌ π(π₯, π) Signal π‘ βΌ π(π₯) History βπ’ = (π‘0, π‘1, β¦ , π‘π’, π0, π1, β¦ , ππ’) Strategy π βΆ β β π Belief βπ,π,π[π₯ | β]
7
Imperfect Information (POMDP)
Dynamic π₯+ βΌ π(π₯, π) Signal π‘ βΌ π(π₯) History βπ’ = (π‘0, π‘1, β¦ , π‘π’, π0, π1, β¦ , ππ’) Strategy π βΆ Ξ(π³ ) β π Belief βπ,π,π[π₯ | β]
7
Stochastic Games
Dynamic π₯+ βΌ π(π₯, π1, π2) Signals { π‘1 βΌ π1(π₯) π‘2 βΌ π2(π₯) Histories { βπ’
1 = (π‘1 0, π‘1 1, β¦ , π‘π’ 1, π1 0, π1 1, β¦ , ππ’ 1)
βπ’
2 = (π‘2 0, π‘2 1, β¦ , π‘π’ 2, π2 0, π2 1, β¦ , ππ’ 2)
Strategies { π1 βΆ β1 β π1 π2 βΆ β2 β π2 Beliefs { βπ,π1,π1,π2,π2[π₯, β2 | β1] βπ,π1,π1,π2,π2[π₯, β1 | β2]
8
Stochastic Games
Dynamic π₯+ βΌ π(π₯, π1, π2) Signals { π‘1 βΌ π1(π₯) π‘2 βΌ π2(π₯) Histories { βπ’
1 = (π‘1 0, π‘1 1, β¦ , π‘π’ 1, π1 0, π1 1, β¦ , ππ’ 1)
βπ’
2 = (π‘2 0, π‘2 1, β¦ , π‘π’ 2, π2 0, π2 1, β¦ , ππ’ 2)
Strategies { π1 βΆ β1 β π1 π2 βΆ β2 β π2 Beliefs { βπ,π1,π1,π2,π2[π₯, β2 | β1] βπ,π1,π1,π2,π2[π₯, β1 | β2]
8
Existing Approaches
- (Weakly) belief-free equilibrium
- Mean-field equilibrium
- Incomplete theories
9
Empirical-evidence Equilibria
10
Motivation
Agent 1 Nature Agent 2
- 0. Pick arbitrary strategies
- 1. Formulate simple but consistent models
- 2. Design strategies optimal w.r.t. models, then, back to 1.
Empirical-evidence equilibrium is a fixed point:
- Strategies optimal w.r.t. models
- Models consistent with strategies
11
Example: Asset Management
Trading one asset on the stock market Model based on
- information published by the company
- observed trading activity
Model very different for each agent
12
Multiple to Single Agent
Agent 1 Nature Agent 2
13
Multiple to Single Agent
Agent 1 Nature Agent 2 Nature 1
13
Single Agent Setup
Agent Nature
14
Single Agent Setup
π¦+ βΌ π(π¦, π, π‘) Nature
14
Single Agent Setup
π¦+ βΌ π(π¦, π, π‘) Nature π‘
14
Single Agent Setup
π¦+ βΌ π(π¦, π, π‘) π₯+ βΌ π(π₯, π¦, π) π‘ βΌ π(π₯) π‘
14
Example: Asset Management
π¦+ βΌ π(π¦, π, π‘) π₯+ βΌ π(π₯, π¦, π) π‘ βΌ π(π₯) π‘ State holding π¦ β {0 .. π} Action sell one, hold, or buy one π β {β1, 0, 1} Signal price π β {Low, High} Stage cost π β π Nature π₯ represents market sentiment, political climate,
- ther traders
15
Single Agent Setup
π¦+ βΌ π(π¦, π, π‘) π βΌ π(β) π₯+ βΌ π(π₯, π¦, π) π‘ βΌ π(π₯) π‘ Model Μ π‘ Μ π‘ consistent with π π optimal w.r.t. Μ π‘
16
Single Agent Setup
π¦+ βΌ π(π¦, π, π‘) π βΌ π(β) π₯+ βΌ π(π₯, π¦, π) π‘ βΌ π(π₯) π‘ Model Μ π‘ Μ π‘ consistent with π π optimal w.r.t. Μ π‘
16
Single Agent Setup
π¦+ βΌ π(π¦, π, π‘) π βΌ π(β) π₯+ βΌ π(π₯, π¦, π) π‘ βΌ π(π₯) π‘ Model Μ π‘ Μ π‘ consistent with π π optimal w.r.t. Μ π‘
16
Single Agent Setup
π¦+ βΌ π(π¦, π, π‘) π βΌ π(β) π₯+ βΌ π(π₯, π¦, π) π‘ βΌ π(π₯) π‘ Model Μ π‘ Μ π‘ consistent with π‘ π optimal w.r.t. Μ π‘
16
Single Agent Setup
π¦+ βΌ π(π¦, π, π‘) π βΌ π(β) π₯+ βΌ π(π₯, π¦, π) π‘ βΌ π(π₯) π‘ Model Μ π‘ Μ π‘ consistent with π π optimal w.r.t. Μ π‘
16
Single Agent Setup
π¦+ βΌ π(π¦, π, Μ π‘) π βΌ π(β) π₯+ βΌ π(π₯, π¦, π) π‘ βΌ π(π₯) π‘ Model Μ π‘ Μ π‘ consistent with π π optimal w.r.t. Μ π‘
16
Single Agent Setup
π¦+ βΌ π(π¦, π, Μ π‘) π βΌ π(β) π₯+ βΌ π(π₯, π¦, π) π‘ βΌ π(π₯) π‘ Model Μ π‘ Μ π‘ consistent with π π optimal w.r.t. Μ π‘
16
Depth-π Consistency
Consider a binary stochastic process π‘ 0100010001001010010110111010000111010101...
- 0 characteristic: β[π‘ = 0], β[π‘ = 1]
- 1 characteristic: β[π‘π‘+ = 00], β[π‘π‘+ = 10],
β[π‘π‘+ = 01], β[π‘π‘+ = 11]
- ...
- π characteristic: probability of strings of length π + 1
Definition Two processes π‘ and π‘β² are depth-π consistent if they have the same π characteristic
17
Depth-π Consistency
Consider a binary stochastic process π‘ 0100010001001010010110111010000111010101...
- 0 characteristic: β[π‘ = 0], β[π‘ = 1]
- 1 characteristic: β[π‘π‘+ = 00], β[π‘π‘+ = 10],
β[π‘π‘+ = 01], β[π‘π‘+ = 11]
- ...
- π characteristic: probability of strings of length π + 1
Definition Two processes π‘ and π‘β² are depth-π consistent if they have the same π characteristic
17
Depth-π Consistency
Consider a binary stochastic process π‘ 0100010001001010010110111010000111010101...
- 0 characteristic: β[π‘ = 0], β[π‘ = 1]
- 1 characteristic: β[π‘π‘+ = 00], β[π‘π‘+ = 10],
β[π‘π‘+ = 01], β[π‘π‘+ = 11]
- ...
- π characteristic: probability of strings of length π + 1
Definition Two processes π‘ and π‘β² are depth-π consistent if they have the same π characteristic
17
Depth-π Consistency: Example
1 π¨β
0.5 0.5
π¨0 π¨1 1
1 1 0.3 0.7 0.3 0.7
18
Complete picture
Fix a depth π β β π¦+ βΌ π(π¦, π, π‘) π βΌ π(β) π₯+ βΌ π(π₯, π¦, π) π‘ βΌ π(π₯) Model π‘ Μ π‘ π β¦ π consistent with π π β¦ π optimal w.r.t. π π¨ contains the last π observed signals π(π¨ = (π‘1, π‘2, β¦ , π‘π))[π‘π+1] = βπ[π‘π’+1 = π‘π+1 | π‘π’ = π‘π, β¦ , π‘π’βπ+1 = π‘1]
19
Complete picture
Fix a depth π β β π¦+ βΌ π(π¦, π, π‘) π βΌ π(β) π₯+ βΌ π(π₯, π¦, π) π‘ βΌ π(π₯) π¨+ βΌ ππ(π¨) Μ π‘ βΌ π(π¨) π‘ Μ π‘ π β¦ π consistent with π π β¦ π optimal w.r.t. π π¨ contains the last π observed signals π(π¨ = (π‘1, π‘2, β¦ , π‘π))[π‘π+1] = βπ[π‘π’+1 = π‘π+1 | π‘π’ = π‘π, β¦ , π‘π’βπ+1 = π‘1]
19
Complete picture
Fix a depth π β β π¦+ βΌ π(π¦, π, π‘) π βΌ π(π¦, π¨) π₯+ βΌ π(π₯, π¦, π) π‘ βΌ π(π₯) π¨+ βΌ ππ(π¨) Μ π‘ βΌ π(π¨) π‘ Μ π‘ π β¦ π consistent with π π β¦ π optimal w.r.t. π π¨ contains the last π observed signals π(π¨ = (π‘1, π‘2, β¦ , π‘π))[π‘π+1] = βπ[π‘π’+1 = π‘π+1 | π‘π’ = π‘π, β¦ , π‘π’βπ+1 = π‘1]
19
Complete picture
Fix a depth π β β π¦+ βΌ π(π¦, π, π‘) π βΌ π(π¦, π¨) π₯+ βΌ π(π₯, π¦, π) π‘ βΌ π(π₯) π¨+ βΌ ππ(π¨) Μ π‘ βΌ π(π¨) π‘ Μ π‘ π β¦ π consistent with π π β¦ π optimal w.r.t. π π¨ contains the last π observed signals π(π¨ = (π‘1, π‘2, β¦ , π‘π))[π‘π+1] = βπ[π‘π’+1 = π‘π+1 | π‘π’ = π‘π, β¦ , π‘π’βπ+1 = π‘1]
19
Definition
(π, π) is an empirical-evidence optimum (EEO) for π iff
- π is optimal w.r.t. π
- π is depth-π consistent with π
(π, π) is an π empirical-evidence optimum (π EEO) for π iff
- π is π optimal w.r.t. π
- π is depth-π consistent with π
20
Definition
(π, π) is an empirical-evidence optimum (EEO) for π iff
- π is optimal w.r.t. π
- π is depth-π consistent with π
(π, π) is an π empirical-evidence optimum (π EEO) for π iff
- π is π optimal w.r.t. π
- π is depth-π consistent with π
20
Existence Result
Theorem For all π and π, there exists an π EEO for π Proof sketch Prove continuity of π β¦ π β¦ π π βΆ π΄ Γ πΆ β Ξ(π) π parametrized over a simplex (convex and compact) Apply Brouwerβs fixed point theorem
21
Existence Result
Theorem For all π and π, there exists an π EEO for π Proof sketch Prove continuity of π β¦ π β¦ π π βΆ π΄ Γ πΆ β Ξ(π) π parametrized over a simplex (convex and compact) Apply Brouwerβs fixed point theorem
21
Complete picture
Fix a depth π β β π¦+ βΌ π(π¦, π, π‘) π βΌ π(π¦, π¨) π₯+ βΌ π(π₯, π¦, π) π‘ βΌ π(π₯) π¨+ βΌ ππ(π¨) Μ π‘ βΌ π(π¨) π‘ Μ π‘ π β¦ π consistent with π π β¦ π optimal w.r.t. π π¨ contains the last π observed signals π(π¨ = (π‘1, π‘2, β¦ , π‘π))[π‘π+1] = βπ[π‘π’+1 = π‘π+1 | π‘π’ = π‘π, β¦ , π‘π’βπ+1 = π‘1]
22
Multiagent Setting
π¦+
π βΌ ππ(π¦π, ππ, π‘π)
ππ βΌ ππ(π¦π, π¨π) π₯+ βΌ π(π₯, π¦, π) π‘ βΌ π(π₯) π¨+
π βΌ ππ,π(π¨π)
Μ π‘π βΌ ππ(π¨π) π‘π Μ π‘π π¦ = (π¦1, π¦2, β¦ , π¦π) π = (π1, π2, β¦ , ππ) π‘ = (π‘1, π‘2, β¦ , π‘π)
23
Empirical-evidence Equilibrium
(π, π) is an empirical-evidence equilibrium (EEE) for πΏ = (π1, π2, β¦ , ππ) iff
- for all π, ππ is optimal w.r.t. ππ
- for all π, ππ is depth-ππ consistent with π
Theorem For all πΏ and π, there exists an π EEE for πΏ
24
Empirical-evidence Equilibrium
(π, π) is an empirical-evidence equilibrium (EEE) for πΏ = (π1, π2, β¦ , ππ) iff
- for all π, ππ is optimal w.r.t. ππ
- for all π, ππ is depth-ππ consistent with π
Theorem For all πΏ and π, there exists an π EEE for πΏ
24
Open Questions
- endogenous model depending on action
- large number of agents
- large π
- relating EEE to other concepts (MFE, optimum)
- offline computation
- online learning using empirical evidence
25
Open Questions
- endogenous model depending on action
- large number of agents
- large π
- relating EEE to other concepts (MFE, optimum)
- offline computation
- online learning using empirical evidence
25
Open Questions
- endogenous model depending on action
- large number of agents
- large π
- relating EEE to other concepts (MFE, optimum)
- offline computation
- online learning using empirical evidence
25
Open Questions
- endogenous model depending on action
- large number of agents
- large π
- relating EEE to other concepts (MFE, optimum)
- offline computation
- online learning using empirical evidence
25
Example: Asset Management
State holdings π¦π β {0 .. π} Action sell one, hold, or buy one ππ β {β1, 0, 1} Signal price π β {Low, High} Dynamic π¦+
π = π¦π + ππ
Stage cost π β ππ Nature market trend π β {Bull, Bear} π₯ = (π, π) Nature is a sticky bear
26
Example: Asset Management
- 0. Pick arbitrary models π
- 1. Design strategies π optimal w.r.t. models π
- 2. Formulate consistent models πupd, then, back to 1.
Depth-0 consistency:
- π1 = 1
- π2 = 0
ππ’+1
π
= (1 β π½)ππ’
π + π½(ππ’ π,upd β ππ’ π)
27
Learning Results: Offline
20 40 60 80 100 0.2 0.4 0.6 0.8 1 Time π’ Prediction ππ’
π[High]
π = 1 π = 2
28
Learning Results: Online
20 40 60 80 100 0.2 0.4 0.6 0.8 1 Time π’ Prediction ππ’
π[High]
π = 1 π = 2
29
Empirical-evidence Equilibria
- Introduce
- Contrast
- Compute
30