Learning, Equilibria, Limitations, and Robots*
Michael Bowling Computer Science Department Carnegie Mellon University
*Joint work with Manuela Veloso
Learning, Equilibria, Limitations, and Robots * Michael Bowling - - PowerPoint PPT Presentation
Learning, Equilibria, Limitations, and Robots * Michael Bowling Computer Science Department Carnegie Mellon University *Joint work with Manuela Veloso Talk Outline Robots A two robot, adversarial, concurrent learning problem. The
*Joint work with Manuela Veloso
– A two robot, adversarial, concurrent learning problem. – The challenges for multiagent learning.
– Value function approximation, parameterized policies, state and temporal abstractions. – Limits agent behavior, sacrificing optimality.
– Value function approximation, parameterized policies, state and temporal abstractions. – Limits agent behavior, sacrificing optimality.
– Don’t learn motion control or obstacle avoidance. – Limits agent behavior, sacrificing optimality.
– Value function approximation, parameterized policies, state and temporal abstractions. – Limits agent behavior, sacrificing optimality.
– Don’t learn motion control or obstacle avoidance. – Limits agent behavior, sacrificing optimality.
– Can predict our own state through latency, not others. – Asymmetric partial observability. – Limits agent behavior, sacrificing optimality.
All of these challenges involve agent limitations. . . . their own and other’s.
– A two robot, adversarial, concurrent learning problem. – The challenges for multiagent learning.
Any subset of stochastic policies, π : S → PD(Ai).
The set of all policies from Πi that are optimal given the policies of the other players.
πi ∈ BRi(π−i) A strategy for each player, where no player can and wants to deviate given the other players continue to play the equilibrium. Do Restricted Equilibria Exist?
2 1 2
3, 1 3, 1 3
3, 1 3, 1 3
3, 2 3
1 0
0 0
0 0
0 1
R L
0 0
0 0
s0 sR sL
This game has no restricted equilibria!
1This counterexample is brought to you by Martin Zinkevich.
convex, and Πi=1 is convex, then . . .
convex, and Πi=1 is convex, then . . .
. . . there exists a restricted equilibrium.
∀π−i BRi(π−i) is convex.
None of these are nice enough to guarantee the existence
– A two robot, adversarial, concurrent learning problem. – The challenges for multiagent learning.
Gr¯ aWoLF— Gradient-based WoLF
(Sutton et al., 2000) – Policy improvement with parameterized policies. – Takes steps in direction of the gradient of the value. π(s, a) = eφsa·θk
θk+1 = θk + αk
φsaπ(s, a)fk(s, a) – fk is an approximation of the advantage function. fk(s, a) ≈ Q(s, a) − V π(s) ≈ Q(s, a) −
π(s, b)Q(s, b)
(Bowling & Veloso, 2002) – Variable learning rate accounts for other agents. ∗ Learn fast when losing. ∗ Cautious when winning, since agents may adapt. – Theoretical and empirical evidence of convergence.
RPS Without WoLF RPS With WoLF
0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Pr(Paper) Pr(Rock) Player 1 Player 2 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 Pr(Paper) Pr(Rock) Player 1 Player 2
RPS Without WoLF RPS With WoLF
0.2 0.4 0.6 0.8 1 100000 200000 300000 Rock Paper Scissors 0.2 0.4 0.6 0.8 1 Pr(Paper) Player 1 (Limited) P(Rock) P(Paper) P(Scissors) 0.2 0.4 0.6 0.8 1 100000 200000 300000 Rock Paper Scissors 0.2 0.4 0.6 0.8 1 Pr(Paper) Player 2 (Unlimited) P(Rock) P(Paper) P(Scissors)
(Sutton & Barto 1998) – Space covered by overlapping and offset tilings. – Maps continuous (or discrete) spaces to a vector of boolean values. – Provides discretization and generalization.
Tiling One Tiling Two
n |S| |S × A| SIZEOF(π or Q) VALUE(det) VALUE(random) 4 692 15150 ∼ 59KB −2 −2.5 8 3 × 106 1 × 107 ∼ 47MB −20 −10.5 13 1 × 1011 7 × 1011 ∼ 2.5TB −65 −28
My Hand 1 3 4 5 6 8 11 13 Quartiles * * * * * Opp Hand 4 5 8 9 10 11 12 13 Quartiles * * * * * Deck 1 2 3 5 9 10 11 12 Quartiles * * * * * Card 11 Action 3 1, 4, 6, 8, 13 , 4, 8, 10, 11, 13 , 1, 3, 9, 10, 12 , 11, 3
TILES ∈ {0, 1}106
4 Cards 8 Cards
10000 20000 30000 40000 Value v. Worst-Case Opponent Number of Training Games WoLF Fast Slow Random
10000 20000 30000 40000 Value v. Worst-Case Opponent Number of Training Games WoLF Fast Slow Random
13 Cards
10000 20000 30000 40000 Value v. Worst-Case Opponent Number of Training Games WoLF Fast Slow Random
Fast Slow
5 10 15 10000 20000 30000 40000 Expected Value While Learning Number of Games
5 10 15 10000 20000 30000 40000 Expected Value While Learning Number of Games
WoLF
5 10 15 10000 20000 30000 40000 Expected Value While Learning Number of Games
learning environment.
Playback of learned policies in simulation and on the robots. The robot video can be downloaded from. . . http://www.cs.cmu.edu/~mhb/research/
0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 LR v R LL v R R vs R R vs LL R v RL Attacker’s Expected Reward Omni vs Omni: Learned Policies
0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 LR v R LL v R R vs R R vs LL R v RL Attacker’s Expected Reward Diff vs Omni: Learned Policies
0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 LR v R LL v R R vs R R vs LL R v RL Attacker’s Expected Reward Diff vs Diff: Learned Policies
0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 A: LL A: R* D: LL D: R Attacker’s Expected Reward Omni vs Omni: Worst-Case E(LL)
learning environment.
problems with limited agents?
problems with limited agents? – Equilibrium learning may be in trouble. – Lagoudakis and Parr’s approximation and minimax. (NIPS ’02) – Correlated equilibria?
problems with limited agents? – Equilibrium learning may be in trouble. – Lagoudakis and Parr’s approximation and minimax. (NIPS ’02) – Correlated equilibria?
problems with limited agents? – Equilibrium learning may be in trouble. – Lagoudakis and Parr’s approximation and minimax. (NIPS ’02) – Correlated equilibria?
– Performance during learning. – Generality of learned policies. ∗ How can I be exploited? ∗ What if everyone played this policy?