U G A V ! Michael Johanson, Nolan Bard, Marc Lanctot, " - - PowerPoint PPT Presentation

u g
SMART_READER_LITE
LIVE PREVIEW

U G A V ! Michael Johanson, Nolan Bard, Marc Lanctot, " - - PowerPoint PPT Presentation

Efficient Nash Equilibrium Approximation through Monte Carlo Counterfactual Regret Minimization AAMAS 2012 - June 6, 2012 Q J # $ K 1 0 P C R " ! U G A V ! Michael Johanson, Nolan Bard, Marc Lanctot, " # ! K Q $


slide-1
SLIDE 1

AAMAS 2012 - June 6, 2012

Michael Johanson, Nolan Bard, Marc Lanctot, Richard Gibson, Michael Bowling University of Alberta

U V

A ! A !

C

K " K "

P

Q # Q #

R

J $ J $

G

1 ! 1 !

University of Alberta Computer Poker Research Group

Efficient Nash Equilibrium Approximation through Monte Carlo Counterfactual Regret Minimization

Wednesday, November 14, 2012

slide-2
SLIDE 2

Motivation

Tackling the practical challenge of Nash equilibrium computation in large games Strategy that is guaranteed to not lose

  • n expectation (2-player, zero-sum)

Very useful property in practice: Dominant approach in the Annual Computer Poker Competition 2008: beat human professionals at 2-player limit Texas hold’em poker

♥ ♣ ♦ ♠

Wednesday, November 14, 2012

slide-3
SLIDE 3

Motivation

2006 2007 2008 2009 2010 2011 105 106 107 108 109 1010

Size of Game Solved

# Information Sets Computer Poker Competition Year

The poker community is now solving games with 109 decisions (information sets). LPs don’t scale to this size of

  • game. We’ve made great

progress on efficient approximation algorithms. (CFR, EGT)

Wednesday, November 14, 2012

slide-4
SLIDE 4

Counterfactual Regret Minimization (CFR), NIPS 2007

25 50 75 100 5,000 10,000 15,000 20,000

CFR Convergence

Best Response (mbb/game) Computation Time (seconds)

CFR is the competition’s most popular algorithm. Iterative, resembles self-play; reinforcement learning flavour. Memory efficient (2 doubles

per infoset-action) Converges quickly (1/ε2) Programmer Friendly Easy to implement and optimize Linear speedup with many cores

♦ ♣ ♥

♠ ♦

This paper: a new CFR variant that converges more quickly in imperfect information games.

Wednesday, November 14, 2012

slide-5
SLIDE 5

Counterfactual Regret Minimization (CFR)

Basic idea: Start with two uniform random strategies. Play them against each other. Put a regret minimizing agent at every decision, and let it independently learn its part

  • f the strategy.

Run many iterations: walk the game tree, agents update their parts of the strategy. Average strategy profile converges to equilibrium.

♣ ♥ ♦ ♠

I

versus ? ?

Regret(I)=(-2,1,4) σ(I)=(0,0.2,0.8)

σ

Nash Equilibrium

Wednesday, November 14, 2012

slide-6
SLIDE 6

Counterfactual Regret Minimization (CFR)

To update a decision, we need: Probability of other players taking their series of actions Expected value (or unbiased estimate) of actions’ utilities given opponent’s strategy Recursively walk the tree: Push forwards opponent action probabilities Return EV at this terminal node or in this subtree

♣ ♥ ♦ ♠

I

π-i(I)=0.2 V(I)=(-2,2,6)

  • 2

2 6 p=0.2

Root Terminal Nodes

Wednesday, November 14, 2012

slide-7
SLIDE 7

Chance-Sampled CFR

In practice, a sampling variant

  • f CFR is used.

Chance Sampling: on each iteration, randomly sample one set of chance events and only update that part of the tree. Terminal nodes: Get an unbiased estimate of my state’s

  • value. Takes O(1) time.

♣ ♥ ♦

Public Chance My Private Chance Opponent Private Chance

Recursion: PASS one scalar (opponent reach probability) RETURN one scalar (value of subgame)

Wednesday, November 14, 2012

slide-8
SLIDE 8

New CFR Sampling Variants

Chance Sampling (CS)

Sample:

Public chance My Private Chance Opponent Private Chance

Opponent-Public Chance Sampling (OPCS)

Sample: Expand:

My Private Chance

Self-Public Chance Sampling (CS)

Sample: Expand: Opponent Private Chance

Public Chance Sampling (PCS)

Sample:

Public chance

Expand:

My Private Chance Opponent Private Chance Opponent Private Chance Public chance My Private Chance Public chance

Wednesday, November 14, 2012

slide-9
SLIDE 9

Opponent-Public Chance Sampling (OPCS)

Sample one Public chance event Sample one opponent private chance event Enumerate all of my possible private chance events KEY OBSERVATION: Opponent can’t observe my chance event, so their strategy is the same for all of them. I can efficiently update all

  • f these decisions in the same

recursive pass! Terminal nodes: n states to evaluate. Takes O(n) time.

♣ ♥ ♦ ♠ ♥

Public Chance My Private Chance Opponent Private Chance

Recursion: PASS one scalar (opponent reach probability) RETURN a vector (values of subgames

...(45 choose 2)

Wednesday, November 14, 2012

slide-10
SLIDE 10

New CFR Sampling Variants

Chance Sampling (CS)

Sample:

Public chance My Private Chance Opponent Private Chance

Opponent-Public Chance Sampling (OPCS)

Sample: Opponent Private Chance

Public chance

Expand:

My Private Chance

Self-Public Chance Sampling (SPCS)

Sample: Expand: Opponent Private Chance

Public Chance Sampling (PCS)

Sample:

Public chance

Expand:

My Private Chance Opponent Private Chance Slower, Many updates per iteration My Private Chance Public chance

Wednesday, November 14, 2012

slide-11
SLIDE 11

New CFR Sampling Variants

Chance Sampling (CS)

Sample:

Public chance My Private Chance Opponent Private Chance

Opponent-Public Chance Sampling (OPCS)

Sample: Expand:

My Private Chance

Self-Public Chance Sampling (SPCS)

Sample:

My Private Chance Public chance

Expand: Opponent Private Chance

Public Chance Sampling (PCS)

Sample:

Public chance

Expand:

My Private Chance Opponent Private Chance Slower, Many updates per iteration Opponent Private Chance Public chance

Wednesday, November 14, 2012

slide-12
SLIDE 12

Self-Public Chance Sampling (SPCS)

Sample one Public chance event Sample one of my private chance events Enumerate all of opponent’s possible private chance events Terminal nodes: n states to evaluate. Much more precise estimate of my value, since I compare my state to all

  • f theirs!

RESULT: Slow but very precise updates.

♣ ♥ ♦ ♠ ♥

Public Chance My Private Chance Opponent Private Chance

Recursion: PASS one vector (opponent reach probabilities) RETURN one scalar (value of subgame)

...(45 choose 2)

Wednesday, November 14, 2012

slide-13
SLIDE 13

New CFR Sampling Variants

Chance Sampling (CS)

Sample:

Public chance My Private Chance Opponent Private Chance

Opponent-Public Chance Sampling (OPCS)

Sample: Expand:

My Private Chance

Self-Public Chance Sampling (SPCS)

Sample: Expand: Opponent Private Chance

Public Chance Sampling (PCS)

Sample:

Public chance

Expand:

My Private Chance Opponent Private Chance Slower, Many updates per iteration Slower, Very precise updates Opponent Private Chance Public chance My Private Chance Public chance

Wednesday, November 14, 2012

slide-14
SLIDE 14

New CFR Sampling Variants

Chance Sampling (CS)

Sample:

Public chance My Private Chance Opponent Private Chance

Opponent-Public Chance Sampling (OPCS)

Sample: Expand:

My Private Chance

Self-Public Chance Sampling (SPCS)

Sample: Expand: Opponent Private Chance

Public Chance Sampling (PCS)

Sample:

Public chance

Expand:

My Private Chance Opponent Private Chance Slower, Many updates per iteration Slower, Very precise updates Opponent Private Chance Public chance My Private Chance Public chance

Wednesday, November 14, 2012

slide-15
SLIDE 15

Public Chance Sampling (PCS)

Sample one Public chance event Enumerate all of my private chance events Enumerate all of opponent’s possible private chance events Terminal nodes: n states to evaluate against n states. Looks like O(n2)

  • work. But depending on game

structure, O(n) is often possible, making it as fast as OPCS or SPCS! RESULT: Slower, but do many precise updates

  • n each iteration.

♣ ♥ ♦ ♠ ♥

Public Chance My Private Chance Opponent Private Chance

Recursion: PASS one vector (opponent reach probability) RETURN one vector (value of subgame)

...(47 choose 2) ...(47 choose 2)

Wednesday, November 14, 2012

slide-16
SLIDE 16

New CFR Sampling Variants

Chance Sampling (CS)

Sample:

Public chance My Private Chance Opponent Private Chance

Opponent-Public Chance Sampling (OPCS)

Sample: Expand:

My Private Chance

Self-Public Chance Sampling (SPCS)

Sample: Expand: Opponent Private Chance

Public Chance Sampling (PCS)

Sample:

Public chance

Expand:

My Private Chance Opponent Private Chance Slower, More updates per iteration Slower, Very precise updates Same speed, many updates per iteration Same speed, very precise updates Opponent Private Chance Public chance My Private Chance Public chance

Wednesday, November 14, 2012

slide-17
SLIDE 17

Results: 2-round, 4-bet Poker 94 million decision points (information sets)

10-1 100 101 102 103 104 102 103 104 105 Best response (mbb/g) Time (seconds) CS OPCS SPCS PCS

Wednesday, November 14, 2012

slide-18
SLIDE 18

Abstracted Limit Texas Hold’em Poker

Real Poker Game 3*1014 Decisions (infosets)

Abstraction

Abstract Poker Game 109 Decisions (infosets)

Larger abstractions are better in practice, but take longer to solve. Can evaluate by measuring exploitability in abstract game.

Wednesday, November 14, 2012

slide-19
SLIDE 19

CS PCS CS PCS CS PCS CS PCS

105 104 103 102 101 105 104 103 102 101 106 10-1 100 101 10-1 100 101 102

Results: Abstracted Limit Texas Hold’em Poker Time (seconds) Abstract Best Response (mbb/g)

5 buckets 3.6m decisions 8 buckets 23.6m decisions 10 buckets 57.3m decisions 12 buckets 118.6m decisions

Wednesday, November 14, 2012

slide-20
SLIDE 20

10-4 10-3 10-2 10-1 103 104 105 Best Response Time (seconds) CS PCS

Alternate domain: Bluff, an imperfect information dice game

Wednesday, November 14, 2012

slide-21
SLIDE 21

Conclusion

Counterfactual Regret Minimization is a state-of-the-art algorithm for Nash equilibrium approximation in 2-player zero sum games. Public Chance Sampled CFR: Takes advantage of structure of imperfect information Converges faster in practice

♥ ♣ ♦ ♠

Wednesday, November 14, 2012

slide-22
SLIDE 22

Thanks! Poster: Panel 056

U V

A ! A !

C

K " K "

P

Q # Q #

R

J $ J $

G

10 ! 10 !

University of Alberta Computer Poker Research Group

Ground Floor

Wednesday, November 14, 2012

slide-23
SLIDE 23

2006 2007 2008 2009 2010 2011 105 106 107 108 109 1010

Size of Game Solved

# Information Sets Competition Year

2007 2008 2009 2010 2011

Distance to Equilibrium

Competition Year

100 200 300 400 500

Games Solved for the Annual Computer Poker Competition

Exploitability (milliblinds/game)

Wednesday, November 14, 2012

slide-24
SLIDE 24
  • 50
  • 40
  • 30
  • 20
  • 10

10 200000 400000 600000 800000 One-on-One Performance (mbb/g) Time (seconds) CS PCS

Results: Limit Texas Hold’em Poker One-on-One performanace against a strong opponent (Hyperborean2010.IRO)

Wednesday, November 14, 2012