Deep Counterfactual Regret Min inimization Noam Brown* 12 , Adam - - PowerPoint PPT Presentation

deep counterfactual regret min inimization
SMART_READER_LITE
LIVE PREVIEW

Deep Counterfactual Regret Min inimization Noam Brown* 12 , Adam - - PowerPoint PPT Presentation

Deep Counterfactual Regret Min inimization Noam Brown* 12 , Adam Lerer* 1 , Sam Gross 1 , Tuomas Sandholm 23 *Equal Contribution 1 Facebook AI Research 2 Carnegie Mellon University 3 Strategic Machine Inc., Strategy Robot Inc., and Optimized


slide-1
SLIDE 1

Deep Counterfactual Regret Min inimization

Noam Brown*12, Adam Lerer*1, Sam Gross1, Tuomas Sandholm23

*Equal Contribution

1Facebook AI Research 2Carnegie Mellon University 3Strategic Machine Inc., Strategy Robot Inc., and Optimized Markets Inc.

slide-2
SLIDE 2

Counterfactual Regret Min inimization (C (CFR)

[Zin inkevich et al

  • al. Neu

eurIP IPS-07] 07]

  • CFR is the leading algorithm for solving partially observable games
  • Iteratively converges to an equilibrium
  • Used by every top poker AI in the past 7 years, including Libratus
  • Every single one used a tabular form of CFR
  • This paper introduces a function approximation form of CFR using

deep neural networks

  • Less domain knowledge
  • Easier to apply to other games
slide-3
SLIDE 3

Example of Monte Carlo CFR [La

Lanctot et al

  • al. NeurIP

rIPS-09] 09] C 𝑄

1

𝑄2 C 25

  • Simulate a game with one player

designated as the traverser

slide-4
SLIDE 4

C 𝑄

1

𝑄2 𝑄2 C 𝑄

1

100 25

  • Simulate a game with one player

designated as the traverser

  • After game ends, traverser sees how

much better she could have done by choosing other actions

Example of Monte Carlo CFR [La

Lanctot et al

  • al. NeurIP

rIPS-09] 09]

slide-5
SLIDE 5

C 𝑄

1

𝑄2 𝑄2 C 𝑄

1

100 25

  • Simulate a game with one player

designated as the traverser

  • After game ends, traverser sees how

much better she could have done by choosing other actions

  • This difference is added to the

action’s regret. In future iterations, actions with higher regret are chosen with higher probability

Example of Monte Carlo CFR [La

Lanctot et al

  • al. NeurIP

rIPS-09] 09]

slide-6
SLIDE 6

C 𝑄

1

𝑄2 𝑄2 C 𝑄

1

100 25 50 120

  • Simulate a game with one player

designated as the traverser

  • After game ends, traverser sees how

much better she could have done by choosing other actions

  • This difference is added to the

action’s regret. In future iterations, actions with higher regret are chosen with higher probability

  • Process repeats even for

hypothetical decision points

Example of Monte Carlo CFR [La

Lanctot et al

  • al. NeurIP

rIPS-09] 09]

slide-7
SLIDE 7

Original game

Abstracted game Bucketed together

Prior Approach: Abstraction in Games

  • Requires extensive domain knowledge
  • Several papers written on how to do abstraction just in poker
  • Difficult to extend to other games
slide-8
SLIDE 8

Deep CFR

  • Input: low-level features (visible cards, observed actions)
  • Output: estimate of action regrets
  • On each iteration:
  • 1. Collect samples of action regrets, add to a buffer
  • 2. Train a network to predict regrets
  • 3. Use network’s regret estimates to play on next iteration
slide-9
SLIDE 9

Deep CFR

  • Input: low-level features (visible cards, observed actions)
  • Output: estimate of action regrets
  • On each iteration:
  • 1. Collect samples of action regrets, add to a buffer
  • 2. Train a network to predict regrets
  • 3. Use network’s regret estimates to play on next iteration
  • Theorem: With arbitrarily high probability, Deep CFR converges to an

𝜗-Nash equilibrium in two-player zero-sum games, where 𝜗 is determined by prediction error

slide-10
SLIDE 10

Experimental results in limit Texas hold’em

  • Deep CFR produces superhuman performance in heads-up limit Texas

hold’em poker

  • ~10 trillion decision points
  • Once played competitively by humans
  • Deep CFR outperforms Neural Fictitious Self Play (NFSP), the prior best

deep RL algorithm for partially observable games [Heinrich & Silver arXiv-15]

  • Deep CFR is also much more sample efficient
  • Deep CFR is competitive with domain-specific abstraction algorithms
slide-11
SLIDE 11

Conclusions

  • Among algorithms for non-tabular solving of partially-observable

games, Deep CFR is the fastest, most sample-efficient, and produces the best results

  • Uses less domain knowledge than abstraction-based approaches,

making it easier to apply to other games