learning equilibria limitations and robots
play

Learning, Equilibria, Limitations, and Robots * Michael Bowling - PowerPoint PPT Presentation

Learning, Equilibria, Limitations, and Robots * Michael Bowling Computer Science Department Carnegie Mellon University *Joint work with Manuela Veloso Talk Outline Robots A two robot, adversarial, concurrent learning problem. The


  1. Learning, Equilibria, Limitations, and Robots * Michael Bowling Computer Science Department Carnegie Mellon University *Joint work with Manuela Veloso

  2. Talk Outline • Robots – A two robot, adversarial, concurrent learning problem. – The challenges for multiagent learning. • Limitations and Equilibria • Limitations and Learning

  3. The Domain — CMDragons — 1

  4. The Domain — CMDragons — 2

  5. The Task — Breakthrough

  6. The Task — Breakthrough

  7. The Task — Breakthrough

  8. The Challenges

  9. The Challenges • Challenge #1: Continuous State and Action Spaces – Value function approximation, parameterized policies, state and temporal abstractions. – Limits agent behavior, sacrificing optimality.

  10. The Challenges • Challenge #1: Continuous State and Action Spaces – Value function approximation, parameterized policies, state and temporal abstractions. – Limits agent behavior, sacrificing optimality. • Challenge #2: Fixed Behavioral Components – Don’t learn motion control or obstacle avoidance. – Limits agent behavior, sacrificing optimality.

  11. The Challenges • Challenge #1: Continuous State and Action Spaces – Value function approximation, parameterized policies, state and temporal abstractions. – Limits agent behavior, sacrificing optimality. • Challenge #2: Fixed Behavioral Components – Don’t learn motion control or obstacle avoidance. – Limits agent behavior, sacrificing optimality. • Challenge #3: Latency – Can predict our own state through latency, not others. – Asymmetric partial observability. – Limits agent behavior, sacrificing optimality.

  12. The Challenges — 1 • Challenge #1: Continuous State and Action Spaces • Challenge #2: Fixed Behavioral Components • Challenge #3: Latency All of these challenges involve agent limitations. . . . their own and other’s .

  13. Talk Outline • Robots – A two robot, adversarial, concurrent learning problem. – The challenges for multiagent learning. • Limitations and Equilibria • Limitations and Learning

  14. Limitations Restrict Behavior • Restricted Policy Space — Π i ⊆ Π i Any subset of stochastic policies, π : S → PD ( A i ) . • Restricted Best-Response — BR i ( π − i ) The set of all policies from Π i that are optimal given the policies of the other players. • Restricted Equilibrium — π i =1 ...n π i ∈ BR i ( π − i ) A strategy for each player, where no player can and wants to deviate given the other players continue to play the equilibrium. Do Restricted Equilibria Exist?

  15. Do Restricted Equilibria Exist? — 1 Explicit Game Implicit Game    − 1 0 − 1 1 2 1 Payoffs 1 0 − 1 −    2    − 1 1 0 0 � 1 3 , 1 3 , 1 � � 1 3 , 1 3 , 1 � � 0 , 1 3 , 2 � Equilibrium , , 3 3 3 � 0 , 1 , 2 � � 1 , 1 , 1 � Restricted Equilibrium ,

  16. Do Restricted Equilibria Exist? — 2 • Two-player, zero-sum stochastic game (Marty’s Game 2). 1    0 0 s 0   0 0  L R   s L s R    1 0  0 0     0 0  0 1  • Players restricted to policies that play the same distribution over actions in all states. This game has no restricted equilibria! 1 This counterexample is brought to you by Martin Zinkevich.

  17. Do Restricted Equilibria Exist? — 3 • In matrix games, if Π i is convex, then . . . • If Π i is statewise convex, then . . . • In no-control stochastic games, if convex Π i , then . . . • In single-controller stochastic games, if Π 1 is statewise convex, and Π i � =1 is convex, then . . . • In team games . . .

  18. Do Restricted Equilibria Exist? — 3 • In matrix games, if Π i is convex, then . . . • If Π i is statewise convex, then . . . • In no-control stochastic games, if convex Π i , then . . . • In single-controller stochastic games, if Π 1 is statewise convex, and Π i � =1 is convex, then . . . • In team games . . . . . . there exists a restricted equilibrium. Proofs. Uses Kakutani’s fixed point theorem after showing ∀ π − i BR i ( π − i ) is convex.

  19. The Challenges — 2 • Challenge #1: Continuous State and Action Spaces • Challenge #2: Fixed Behavioral Components • Challenge #3: Latency None of these are nice enough to guarantee the existence of equilibria.

  20. Talk Outline • Robots – A two robot, adversarial, concurrent learning problem. – The challenges for multiagent learning. • Limitations and Equilibria • Limitations and Learning

  21. Three Ideas — One Algorithm • Idea #1: Policy Gradient Ascent • Idea #2: WoLF Variable Learning Rate Gr¯ aWoLF— Gradient-based WoLF • Idea #3: Tile Coding

  22. Idea #1 • Policy Gradient Ascent (Sutton et al., 2000) – Policy improvement with parameterized policies. – Takes steps in direction of the gradient of the value. e φ sa · θ k π ( s, a ) = � b ∈A i e φ sb · θ k � θ k +1 = θ k + α k φ sa π ( s, a ) f k ( s, a ) a – f k is an approximation of the advantage function. Q ( s, a ) − V π ( s ) f k ( s, a ) ≈ � ≈ Q ( s, a ) − π ( s, b ) Q ( s, b ) b

  23. Idea #2 • Win or Learn Fast (WoLF) (Bowling & Veloso, 2002) – Variable learning rate accounts for other agents. ∗ Learn fast when losing. ∗ Cautious when winning, since agents may adapt. – Theoretical and empirical evidence of convergence. RPS Without WoLF RPS With WoLF 1 0.8 0.6 0.4 0.2 0 1 0.8 0.6 0.4 0.2 0 1 0 1 0 Player 1 Player 1 Player 2 Player 2 0.8 0.2 0.8 0.2 0.6 0.4 0.6 0.4 Pr(Paper) Pr(Paper) 0.4 0.6 0.4 0.6 0.2 0.8 0.2 0.8 0 1 0 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Pr(Rock) Pr(Rock)

  24. Idea #2 — 2 RPS Without WoLF RPS With WoLF 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 1 1 P(Rock) P(Rock) P(Paper) P(Paper) P(Scissors) P(Scissors) 0.8 0.8 Scissors 0.6 0.6 Pr(Paper) Pr(Paper) Paper 0.4 0.4 Rock Paper 0.2 0.2 Scissors 0 0 Rock 0 100000 200000 300000 0 100000 200000 300000 Player 1 (Limited) Player 2 (Unlimited)

  25. Idea #3 • Tile Coding (a.k.a. CMACS) (Sutton & Barto 1998) – Space covered by overlapping and offset tilings. – Maps continuous (or discrete) spaces to a vector of boolean values. – Provides discretization and generalization. Tiling One Tiling Two

  26. The Task

  27. The Task — Goofspiel • A.k.a. “The Game of Pure Strategy”

  28. The Task — Goofspiel • A.k.a. “The Game of Pure Strategy” • Each player plays a full suit of cards. • Each player uses their cards (without replacement) to bid on cards from another suit.

  29. The Task — Goofspiel • A.k.a. “The Game of Pure Strategy” • Each player plays a full suit of cards. • Each player uses their cards (without replacement) to bid on cards from another suit. | S | | S × A | S IZEOF ( π or Q ) V ALUE (det) V ALUE (random) n 4 692 15150 ∼ 59KB − 2 − 2 . 5 3 × 10 6 1 × 10 7 8 ∼ 47MB − 20 − 10 . 5 1 × 10 11 7 × 10 11 13 ∼ 2.5TB − 65 − 28 • The game is very large. • Deterministic policies are very bad. • The random policy isn’t too bad.

  30. The Task — Goofspiel — 2  My Hand 1 3 4 5 6 8 11 13    Quartiles * * * * *  � 1 , 4 , 6 , 8 , 13 � ,     � 4 , 8 , 10 , 11 , 13 � ,     Opp Hand 4 5 8 9 10 11 12 13 � 1 , 3 , 9 , 10 , 12 � ,     Quartiles * * * * * 11 , 3     �  � Deck 1 2 3 5 9 10 11 12  � (Tile Coding)  �   � Quartiles * * * * *       T ILES ∈ { 0 , 1 } 10 6    Card 11     Action 3   • Gradient ascent on this parameterization. • WoLF variable learning rate on the gradient step size.

  31. Results — Worst-Case 4 Cards 8 Cards -1.1 -2 -1.2 -3 Value v. Worst-Case Opponent Value v. Worst-Case Opponent -1.3 -4 -1.4 -1.5 -5 -1.6 -6 -1.7 -7 -1.8 -1.9 -8 WoLF WoLF -2 Fast Fast -9 Slow Slow -2.1 Random Random -2.2 -10 0 10000 20000 30000 40000 0 10000 20000 30000 40000 Number of Training Games Number of Training Games 13 Cards -12 Value v. Worst-Case Opponent -14 -16 -18 -20 -22 WoLF Fast -24 Slow Random -26 0 10000 20000 30000 40000 Number of Training Games

  32. Results — While Learning Fast Slow 15 15 Expected Value While Learning Expected Value While Learning 10 10 5 5 0 0 -5 -5 -10 -10 -15 -15 0 10000 20000 30000 40000 0 10000 20000 30000 40000 Number of Games Number of Games WoLF 15 Expected Value While Learning 10 5 0 -5 -10 -15 0 10000 20000 30000 40000 Number of Games

  33. The Task — Breakthrough

  34. The Task — Breakthrough — 2

  35. Results — Breakthrough WARNING!

  36. Results — Breakthrough WARNING! • These results are preliminary.... some are only hours old. • They involve a single run of learning in a highly stochastic learning environment. • More experiments in progress.

  37. Results — “To the videotape...” Playback of learned policies in simulation and on the robots. The robot video can be downloaded from. . . http://www.cs.cmu.edu/~mhb/research/

  38. Results — 3 Omni vs Omni: Learned Policies 0.6 0.55 Attacker’s Expected Reward 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 LR v R LL v R R vs R R vs LL R v RL

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend