Monte-Carlo tree search for Monte-Carlo tree search for - - PowerPoint PPT Presentation

monte carlo tree search for monte carlo tree search for
SMART_READER_LITE
LIVE PREVIEW

Monte-Carlo tree search for Monte-Carlo tree search for - - PowerPoint PPT Presentation

Monte-Carlo tree search for Monte-Carlo tree search for multi-player, no-limit multi-player, no-limit Texas hold'em poker Texas hold'em poker Guy Van den Broeck Should I bluff? Deceptive play Should I bluff? Is he bluffing? Opponent


slide-1
SLIDE 1

Monte-Carlo tree search for multi-player, no-limit Texas hold'em poker Monte-Carlo tree search for multi-player, no-limit Texas hold'em poker

Guy Van den Broeck

slide-2
SLIDE 2
slide-3
SLIDE 3

Deceptive play

Should I bluff?

slide-4
SLIDE 4

Opponent modeling

Should I bluff? Is he bluffing?

slide-5
SLIDE 5

Incomplete information

Should I bluff? Who has the Ace? Is he bluffing?

slide-6
SLIDE 6

Game of chance

Should I bluff? Who has the Ace? What are the odds? Is he bluffing?

slide-7
SLIDE 7

Exploitation

Should I bluff? Who has the Ace? What are the odds? Is he bluffing? I'll bet because he always calls

slide-8
SLIDE 8

Huge state space

Should I bluff? Who has the Ace? What are the odds? Is he bluffing? What can happen next? I'll bet because he always calls

slide-9
SLIDE 9

Risk management & Continuous action space

Should I bet $5 or $10? Should I bluff? Who has the Ace? What are the odds? Is he bluffing? What can happen next? I'll bet because he always calls

slide-10
SLIDE 10

Take-Away Message: We can solve all these problems!

Should I bet $5 or $10? Should I bluff? Who has the Ace? What are the odds? Is he bluffing? What can happen next? I'll bet because he always calls

slide-11
SLIDE 11

Problem Statement

 A bot for Texas hold'em poker

 No-Limit & > 2 players

 Not done before!

 Exploitative, not game theoretic

 Game tree search + Opponent modeling

 Applies to any problem with either

 incomplete information  non-determinism  continuous actions

slide-12
SLIDE 12

Outline

 Overview approach

 The Poker game tree  Opponent model  Monte-Carlo tree search

 Research challenges

 Search  Opponent model

 Conclusion

slide-13
SLIDE 13

Outline

 Overview approach

 The Poker game tree  Opponent model  Monte-Carlo tree search

 Research challenges

 Search  Opponent model

 Conclusion

slide-14
SLIDE 14

Outline

 Overview approach

 The Poker game tree  Opponent model  Monte-Carlo tree search

 Research challenges

 Search  Opponent model

 Conclusion

slide-15
SLIDE 15

Poker Game Tree Poker Game Tree

 Minimax trees: deterministic

 Tic-tac-toe, checkers, chess, go,…

max min

slide-16
SLIDE 16

Poker Game Tree Poker Game Tree

 Minimax trees: deterministic

 Tic-tac-toe, checkers, chess, go,…

 Expecti(mini)max trees: chance

 Backgammon, …

max min max min mix

slide-17
SLIDE 17

Poker Game Tree Poker Game Tree

 Minimax trees: deterministic

 Tic-tac-toe, checkers, chess, go,…

 Expecti(mini)max trees: chance

 Backgammon, …

 Miximax trees: hidden information

max min mix max max min

mix

mix

+ opponent model

slide-18
SLIDE 18

my action

fold call raise

slide-19
SLIDE 19

my action Resolve

fold call raise

slide-20
SLIDE 20

my action Resolve

fold call raise

slide-21
SLIDE 21

my action Resolve

fold call raise

Reveal Cards … 0.5 0.5

slide-22
SLIDE 22

my action Resolve

fold call raise

Reveal Cards … 0.5 0.5

slide-23
SLIDE 23

my action Resolve

fold call raise

Reveal Cards …

  • 1

0.5 0.5

slide-24
SLIDE 24

my action Resolve

fold call raise

Reveal Cards …

  • 1

3 0.5 0.5

slide-25
SLIDE 25

my action Resolve

fold call raise

Reveal Cards … 1

  • 1

3 0.5 0.5

slide-26
SLIDE 26

my action Resolve

fold call raise

Reveal Cards … 1

  • 1

3 0.5 0.5

fold call raise

  • pp-1 action

0.6 0.3 0.1

slide-27
SLIDE 27

my action Resolve

fold call raise

Reveal Cards …

  • pp-2 action

fold

… … 1

  • 1

3 0.5 0.5

fold call raise

  • pp-1 action

0.6 0.3 0.1

slide-28
SLIDE 28

my action Resolve

fold call raise

Reveal Cards …

  • pp-2 action

fold

… … 1

  • 1

3 0.5 0.5

fold call raise

  • pp-1 action

0.6 0.3 0.1 4

slide-29
SLIDE 29

my action Resolve

fold call raise

Reveal Cards …

  • pp-2 action

fold

… … … 1

  • 1

3 0.5 0.5

fold call raise

  • pp-1 action

0.6 0.3 0.1 4

slide-30
SLIDE 30

my action Resolve

fold call raise

Reveal Cards …

  • pp-2 action

fold

… … … 1

  • 1

3 0.5 0.5

fold call raise

  • pp-1 action

0.6 0.3 0.1 4 2

slide-31
SLIDE 31

my action Resolve

fold call raise

Reveal Cards …

  • pp-2 action

fold

… … … … 1

  • 1

3 0.5 0.5

fold call raise

  • pp-1 action

0.6 0.3 0.1 4 2

slide-32
SLIDE 32

my action Resolve

fold call raise

Reveal Cards …

  • pp-2 action

fold

… … … … 1

  • 1

3 0.5 0.5

fold call raise

  • pp-1 action

0.6 0.3 0.1 4 2

slide-33
SLIDE 33

my action Resolve

fold call raise

Reveal Cards …

  • pp-2 action

fold

… … … … 1

  • 1

3 0.5 0.5

fold call raise

  • pp-1 action

0.6 0.3 0.1 4 2 3

slide-34
SLIDE 34

my action Resolve

fold call raise

Reveal Cards …

  • pp-2 action

fold

… … … … 1

  • 1

3 0.5 0.5

fold call raise

  • pp-1 action

0.6 0.3 0.1 4 2 3 3

slide-35
SLIDE 35

Outline

 Overview approach

 The Poker game tree  Opponent model  Monte-Carlo tree search

 Research challenges

 Search  Opponent model

 Conclusion

slide-36
SLIDE 36

Short Experiment Short Experiment

slide-37
SLIDE 37

Opponent Model Opponent Model

 Set of probability trees  Weka's M5'  Separate model for

 Actions  Hand cards at showdown

slide-38
SLIDE 38

Fold Probability Fold Probability

nbAllPlayerRaises <= 1.5 : | callFrequency <= 0.128 : | | nbActionsThisRound <= 2.5 : | | | potOdds <= 0.28 : | | | | AF <= 2.585 : 0.6904 | | | | AF > 2.585 : | | | | | potSize <= 3.388 : | | | | | | round=flop <= 0.5 : 0.8068 | | | | | | round=flop > 0.5 : 0.6896 | | | | | potSize > 3.388 : 0.8198 | | | potOdds > 0.28 : | | | | stackSize <= 97.238 : | | | | | callFrequency <= 0.038 : 0.8838 | | | | | callFrequency > 0.038 : | | | | | | round=flop <= 0.5 : 0.8316 | | | | | | round=flop > 0.5 : | | | | | | | nbSeatedPlayers <= 7.5 : 0.6614 | | | | | | | nbSeatedPlayers > 7.5 : 0.7793 | | | | stackSize > 97.238 : | | | | | potSize <= 4.125 : | | | | | | foldFrequency <= 0.813 : 0.7839 | | | | | | foldFrequency > 0.813 : 0.9037 | | | | | potSize > 4.125 : 0.8623 | | nbActionsThisRound > 2.5 : | | | potOdds <= 0.218 : | | | | callFrequency <= 0.067 : 0.8753 | | | | callFrequency > 0.067 : 0.7661 | | | potOdds > 0.218 : | | | | AF <= 2.654 : 0.8818 | | | | AF > 2.654 : 0.921

slide-39
SLIDE 39

(Can also be relational)

 Tilde probability tree [Ponsen08]

slide-40
SLIDE 40

Opponent Ranks Opponent Ranks

 Learn distribution of hand ranks at showdown

Rank Bucket Probability Number of Raises Probability

slide-41
SLIDE 41

Outline

 Overview approach

 The Poker game tree  Opponent model  Monte-Carlo tree search

 Research challenges

 Search  Opponent model

 Conclusion

slide-42
SLIDE 42

Traversing the tree Traversing the tree

 Limit Texas Hold’em

 1018 nodes  Fully traversable

 No-limit

 >1071 nodes  Too large to traverse  Sampled, not searched  Monte-Carlo Tree Search

slide-43
SLIDE 43

Monte-Carlo Tree Search Monte-Carlo Tree Search

[Chaslot08]

slide-44
SLIDE 44

Selection Selection

In each node: is an estimate of the reward is the number of samples

slide-45
SLIDE 45

Selection Selection

 UCT (Multi-Armed Bandit)

In each node: is an estimate of the reward is the number of samples

slide-46
SLIDE 46

Selection Selection

 UCT (Multi-Armed Bandit)

In each node: is an estimate of the reward is the number of samples

exploitation

slide-47
SLIDE 47

Selection Selection

 UCT (Multi-Armed Bandit)

In each node: is an estimate of the reward is the number of samples

exploration exploitation

slide-48
SLIDE 48

Selection Selection

 UCT (Multi-Armed Bandit)  CrazyStone

In each node: is an estimate of the reward is the number of samples

exploration exploitation

slide-49
SLIDE 49

Expansion Simulation Expansion Simulation

slide-50
SLIDE 50

Backpropagation Backpropagation

is an estimate of the reward is the number of samples

slide-51
SLIDE 51

Backpropagation Backpropagation

 Sample-weighted average

is an estimate of the reward is the number of samples

slide-52
SLIDE 52

Backpropagation Backpropagation

 Sample-weighted average  Maximum child

is an estimate of the reward is the number of samples

slide-53
SLIDE 53

Initial experiments

 1*MCTS + 2*rule based  Exploitative!

MCTS Bot

slide-54
SLIDE 54

Outline

 Overview approach

 The Poker game tree  Opponent model  Monte-Carlo tree search

 Research challenges

 Search  Opponent model

 Conclusion

slide-55
SLIDE 55

Outline

 Overview approach

 The Poker game tree  Opponent model  Monte-Carlo tree search

 Research challenges

 Search

 Uncertainty in MCTS  Continuous action spaces

 Opponent model

 Online learning  Concept drift

 Conclusion

slide-56
SLIDE 56

Outline

 Overview approach

 The Poker game tree  Opponent model  Monte-Carlo tree search

 Research challenges

 Search

 Uncertainty in MCTS  Continuous action spaces

 Opponent model

 Online learning  Concept drift

 Conclusion

slide-57
SLIDE 57

Outline

 Overview approach

 The Poker game tree  Opponent model  Monte-Carlo tree search

 Research challenges

 Search

 Uncertainty in MCTS  Continuous action spaces

 Opponent model

 Online learning  Concept drift

 Conclusion

slide-58
SLIDE 58

MCTS for games with uncertainty?

 Expected reward distributions (ERD)  Sample selection using ERD  Backpropagation of ERD

[VandenBroeck09]

slide-59
SLIDE 59

100 samples ∞ samples MiniMax

Expected reward distribution

10 samples Variance Estimating

slide-60
SLIDE 60

100 samples ∞ samples MiniMax

Expected reward distribution

10 samples Variance Estimating

slide-61
SLIDE 61

100 samples ∞ samples MiniMax

Expected reward distribution

10 samples Variance Estimating

slide-62
SLIDE 62

100 samples ∞ samples MiniMax

Expected reward distribution

10 samples Variance Estimating

slide-63
SLIDE 63

100 samples ∞ samples MiniMax

Expected reward distribution

10 samples Sampling Variance Estimating

slide-64
SLIDE 64

100 samples ∞ samples MiniMax ExpectiMax/MixiMax

Expected reward distribution

10 samples Sampling Variance Estimating

slide-65
SLIDE 65

100 samples ∞ samples MiniMax ExpectiMax/MixiMax

Expected reward distribution

10 samples Sampling Variance Estimating

slide-66
SLIDE 66

100 samples ∞ samples MiniMax ExpectiMax/MixiMax

Expected reward distribution

10 samples Sampling Variance Estimating

slide-67
SLIDE 67

100 samples ∞ samples MiniMax ExpectiMax/MixiMax

Expected reward distribution

10 samples Sampling Variance Estimating

slide-68
SLIDE 68

100 samples ∞ samples MiniMax ExpectiMax/MixiMax

Expected reward distribution

10 samples Sampling Variance Uncertainty + Sampling Estimating

slide-69
SLIDE 69

100 samples ∞ samples MiniMax ExpectiMax/MixiMax

Expected reward distribution

10 samples Sampling Variance Uncertainty + Sampling ExpectiMax/MixiMax / T(P) Estimating

slide-70
SLIDE 70

100 samples ∞ samples MiniMax ExpectiMax/MixiMax

Expected reward distribution

10 samples Sampling Variance Uncertainty + Sampling ExpectiMax/MixiMax / T(P) Estimating

slide-71
SLIDE 71

100 samples ∞ samples MiniMax ExpectiMax/MixiMax

Expected reward distribution

10 samples Sampling Variance Uncertainty + Sampling ExpectiMax/MixiMax / T(P) Estimating

slide-72
SLIDE 72

100 samples ∞ samples MiniMax ExpectiMax/MixiMax

Expected reward distribution

10 samples Sampling Variance Uncertainty + Sampling ExpectiMax/MixiMax / T(P) Estimating

slide-73
SLIDE 73

100 samples ∞ samples MiniMax ExpectiMax/MixiMax

Expected reward distribution

10 samples Sampling Variance Uncertainty + Sampling ExpectiMax/MixiMax Sampling / T(P) Estimating

slide-74
SLIDE 74

100 samples ∞ samples MiniMax ExpectiMax/MixiMax

Expected reward distribution

10 samples Sampling Variance Uncertainty + Sampling ExpectiMax/MixiMax Sampling / T(P) Estimating

slide-75
SLIDE 75

ERD selection strategy

 Objective?

 Find maximum expected reward  Sample more in subtrees with

(1) High expected reward (2) Uncertain estimate

 UCT does (1) but not really (2)  CrazyStone does (1) and (2) for deterministic games (Go)  UCT+ selection:

(1) (2)

slide-76
SLIDE 76

ERD selection strategy

 Objective?

 Find maximum expected reward  Sample more in subtrees with

(1) High expected reward (2) Uncertain estimate

 UCT does (1) but not really (2)  CrazyStone does (1) and (2) for deterministic games (Go)  UCT+ selection: “Expected value under perfect play”

slide-77
SLIDE 77

ERD selection strategy

 Objective?

 Find maximum expected reward  Sample more in subtrees with

(1) High expected reward (2) Uncertain estimate

 UCT does (1) but not really (2)  CrazyStone does (1) and (2) for deterministic games (Go)  UCT+ selection: “Measure of uncertainty due to sampling”

slide-78
SLIDE 78

max A … B …

3 4

ERD max-distribution backpropagation

slide-79
SLIDE 79

max A … B …

3 4

ERD max-distribution backpropagation

3.5

sample-weighted

slide-80
SLIDE 80

max A … B …

3 4

ERD max-distribution backpropagation

3.5

sample-weighted

4

max

slide-81
SLIDE 81

max A … B …

3 4

ERD max-distribution backpropagation

3.5

sample-weighted

4

max “When the game reaches P, we'll have more time to find the real “

slide-82
SLIDE 82

max A … B …

3 4

ERD max-distribution backpropagation

4 3.5

sample-weighted max max-distribution

4.5

slide-83
SLIDE 83

max A … B …

3 4

ERD max-distribution backpropagation

A<4 A>4 B<4 0.8*0.5 0.2*0.5 B>4 0.8*0.5 0.2*0.5

P(A>4) = 0.2 P(A<4) = 0.8 P(B>4) = 0.5 P(B<4) = 0.5 P(max(A,B)>4) = 0.6 > 0.5 4.5

slide-84
SLIDE 84

Experiments

 2*MCTS

 Max-distribution  Sample-weighted

 2*MCTS

 UCT+ (stddev)  UCT

slide-85
SLIDE 85

Outline

 Overview approach

 The Poker game tree  Opponent model  Monte-Carlo tree search

 Research challenges

 Search

 Uncertainty in MCTS  Continuous action spaces

 Opponent model

 Online learning  Concept drift

 Conclusion

slide-86
SLIDE 86

Dealing with continuous actions

 Sample discrete actions  Progressive unpruning [Chaslot08]

(ignores smoothness of EV function)

 ...  Tree learning search (work in progress)

relative betsize

slide-87
SLIDE 87

Tree learning search

 Based on regression tree induction from data streams

 training examples arrive quickly  nodes split when significant reduction in stddev  training examples are immediately forgotten

 Edges in TLS tree are not actions, but sets of actions, e.g., (raise in [2,40]), (fold or call)  MCTS provides a stream of (action,EV) examples  Split action sets to reduce stddev of EV (when significant)

slide-88
SLIDE 88

Tree learning search

max {Fold, Call} max Bet in [0,10] ? ?

slide-89
SLIDE 89

Tree learning search

max ? {Fold, Call} max Bet in [0,10] ?

slide-90
SLIDE 90

Tree learning search

max ? {Fold, Call} max Bet in [0,10] ? Optimal split at 4

slide-91
SLIDE 91

Tree learning search

max Bet in [0,4] {Fold, Call} max Bet in [0,10] Bet in [4,10] max max ? ? ? ?

slide-92
SLIDE 92
  • ne action of P1
  • ne action of P2

Tree learning search

slide-93
SLIDE 93

Selection Phase

Sample 2.4 P1

Each node has EV estimate, which generalizes over actions

slide-94
SLIDE 94

Expansion

Selected Node P1 P2

slide-95
SLIDE 95

Expansion

Expanded node Represents any action of P3 P3 P2 P1

slide-96
SLIDE 96

Backpropagation

New sample; Split becomes significant

slide-97
SLIDE 97

Backpropagation

New sample; Split becomes significant

slide-98
SLIDE 98

Outline

 Overview approach

 The Poker game tree  Opponent model  Monte-Carlo tree search

 Research challenges

 Search

 Uncertainty in MCTS  Continuous action spaces

 Opponent model

 Online learning  Concept drift

 Conclusion

slide-99
SLIDE 99

Outline

 Overview approach

 The Poker game tree  Opponent model  Monte-Carlo tree search

 Research challenges

 Search

 Uncertainty in MCTS  Continuous action spaces

 Opponent model

 Online learning  Concept drift

 Conclusion

slide-100
SLIDE 100

Online learning of

  • pponent model

 Start from (safe) model of general opponent  Exploit weaknesses of specific opponent

Start to learn model

  • f specific opponent

(exploration of

  • pponent behavior)
slide-101
SLIDE 101

Multi-agent interaction

slide-102
SLIDE 102

Multi-agent interaction

Yellow learns model for Blue and changes strategy

slide-103
SLIDE 103

Multi-agent interaction

Yellow learns model for Blue and changes strategy Yellow doesn't profit!

slide-104
SLIDE 104

Multi-agent interaction

Yellow learns model for Blue and changes strategy Yellow doesn't profit! Green profits without changing strategy!!

slide-105
SLIDE 105

Outline

 Overview approach

 The Poker game tree  Opponent model  Monte-Carlo tree search

 Research challenges

 Search

 Uncertainty in MCTS  Continuous action spaces

 Opponent model

 Online learning  Concept drift

 Conclusion

slide-106
SLIDE 106

Concept drift

 While learning from a stream, the training examples in the stream change

 In opponent model: changing strategy

 “Changing gears is not just about bluffing, it's about changing strategy to achieve a goal.”  Learning with concept drift

 adapt quickly to changes  yet robust to noise  (recognize recurrent concepts)

slide-107
SLIDE 107

Basic approach to concept drift

 Maintain a window of training examples

 large enough to learn  small enough to adapt quickly  without 'old' concepts

 Heuristics to adjust window size

 based on FLORA2 framework [Widmer92]

slide-108
SLIDE 108

Accuracy Window size

4 components of a single

  • pponent model

Start online learning Concept drift

slide-109
SLIDE 109

Bad parameters for heuristic NOT ROBUST

Accuracy Window size

slide-110
SLIDE 110

Outline

 Overview approach

 The Poker game tree  Opponent model  Monte-Carlo tree search

 Research challenges

 Search  Opponent model

 Conclusion

slide-111
SLIDE 111

Conclusions

 First exploitive poker bot for

 No-limit Holdem  > 2 players

 Apply in other games

 backgammon  computational pool  ...

 Challenge for MCTS

 games with uncertainty  continuous action space

 Challenge for ML

 online learning  concept drift  (relational learning)

slide-112
SLIDE 112

Thanks for listening!