Monte Carlo Continual Resolving for Online Strategy Computation in - - PowerPoint PPT Presentation

monte carlo continual resolving for online strategy
SMART_READER_LITE
LIVE PREVIEW

Monte Carlo Continual Resolving for Online Strategy Computation in - - PowerPoint PPT Presentation

Monte Carlo Continual Resolving for Online Strategy Computation in Imperfect Information Games Michal Sustr Faculty of Electrical Engineering, Czech Technical University michal.sustr@aic.fel.cvut.cz December 6, 2018 Michal Sustr (FEE CTU)


slide-1
SLIDE 1

Monte Carlo Continual Resolving for Online Strategy Computation in Imperfect Information Games

Michal Sustr

Faculty of Electrical Engineering, Czech Technical University michal.sustr@aic.fel.cvut.cz

December 6, 2018

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 1 / 54

slide-2
SLIDE 2

Overview

1

Motivation - why online algorithms?

2

Background EFGs Solution concepts Offline algorithms

3

Online algorithms Without guarantees With guarantees

4

Continual resolving Gadget game Online play function

5

MCCR Evaluation

6

Future work

7

Conclusion

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 2 / 54

slide-3
SLIDE 3

Motivation - why online algorithms?

Offline vs online algorithms.

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 3 / 54

slide-4
SLIDE 4

Motivation - why online algorithms?

Offline vs online algorithms. How to find unexploitable strategies with online algorithms?

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 3 / 54

slide-5
SLIDE 5

Motivation - why online algorithms?

Offline vs online algorithms. How to find unexploitable strategies with online algorithms? How to evaluate online algorithms?

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 3 / 54

slide-6
SLIDE 6

Summary of what we’ve done

We extend Continual Resolving from DeepStack to work on general games.

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 4 / 54

slide-7
SLIDE 7

Summary of what we’ve done

We extend Continual Resolving from DeepStack to work on general games. We observe that construction of public tree for game depends on crude definition of augmented info sets, which can be domain-dependently refined.

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 4 / 54

slide-8
SLIDE 8

Summary of what we’ve done

We extend Continual Resolving from DeepStack to work on general games. We observe that construction of public tree for game depends on crude definition of augmented info sets, which can be domain-dependently refined. We implemented MCCFR with incremental tree building as a resolver for Continual Resolving.

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 4 / 54

slide-9
SLIDE 9

Summary of what we’ve done

We extend Continual Resolving from DeepStack to work on general games. We observe that construction of public tree for game depends on crude definition of augmented info sets, which can be domain-dependently refined. We implemented MCCFR with incremental tree building as a resolver for Continual Resolving. We prove that CR and MCCR exploitability can be bounded and improves with time.

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 4 / 54

slide-10
SLIDE 10

Summary of what we’ve done

We extend Continual Resolving from DeepStack to work on general games. We observe that construction of public tree for game depends on crude definition of augmented info sets, which can be domain-dependently refined. We implemented MCCFR with incremental tree building as a resolver for Continual Resolving. We prove that CR and MCCR exploitability can be bounded and improves with time. We implemented new algorithm MCCR (Monte Carlo Continual Resolving) It works, but it’s not significantly better than previous algorithms.

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 4 / 54

slide-11
SLIDE 11

Background

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 5 / 54

slide-12
SLIDE 12

EFGs

Extensive Form Game

Defined by tuple G = H, Z, A, P, σc, u, I: H histories Z terminal histories A actions in information sets P player function (who plays in a given history) σc stochastic transitions (chance player) u utilities in terminals I information-partition to infosets

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 6 / 54

slide-13
SLIDE 13

EFGs

Extensive Form Game

Defined by tuple G = H, Z, A, P, σc, u, I: H histories Z terminal histories A actions in information sets P player function (who plays in a given history) σc stochastic transitions (chance player) u utilities in terminals I information-partition to infosets Assumptions: two player zero-sum games, perfect recall.

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 6 / 54

slide-14
SLIDE 14

Game of small poker

🂼🂼

F

  • 1

B f 1 c

🂾🂼

F

  • 1

B f 1 c

  • 3

🂼🂾

F

  • 1

B f 1 c 3

🂾🂾

F

  • 1

B f 1 c

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 7 / 54

slide-15
SLIDE 15

Poker with infoset partition I1, I2

🂼🂼

F

  • 1

B f 1 c

🂾🂼

F

  • 1

B f 1 c

  • 3

🂼🂾

F

  • 1

B f 1 c 3

🂾🂾

F

  • 1

B f 1 c

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 8 / 54

slide-16
SLIDE 16

Augmented information sets

What is player’s uncertainty about moves where he doesn’t play? (An equivalent of information sets for opponent’s moves).

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 9 / 54

slide-17
SLIDE 17

Augmented information sets

What is player’s uncertainty about moves where he doesn’t play? (An equivalent of information sets for opponent’s moves). → we can use augmented information sets Iaug

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 9 / 54

slide-18
SLIDE 18

Augmented information sets

What is player’s uncertainty about moves where he doesn’t play? (An equivalent of information sets for opponent’s moves). → we can use augmented information sets Iaug

Observation history

For player i ∈ 1, 2 and history h ∈ H \ Z, the i’s observation history Oi(h) in h is the sequence (I1, a1, I2, a2, . . . ) of the information sets visited and actions taken by i on the path to h (incl. I ∋ h if h ∈ Hi)

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 9 / 54

slide-19
SLIDE 19

Augmented information sets

What is player’s uncertainty about moves where he doesn’t play? (An equivalent of information sets for opponent’s moves). → we can use augmented information sets Iaug

Observation history

For player i ∈ 1, 2 and history h ∈ H \ Z, the i’s observation history Oi(h) in h is the sequence (I1, a1, I2, a2, . . . ) of the information sets visited and actions taken by i on the path to h (incl. I ∋ h if h ∈ Hi)

Augmented information sets

Two histories g, h ∈ H \ Z belong to the same augmented information set I ∈ Iaug

i

if Oi(g) = Oi(h).

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 9 / 54

slide-20
SLIDE 20

Augmented infoset partition Iaug

1

, Iaug

2 🂼🂼

F

  • 1

B f 1 c

🂾🂼

F

  • 1

B f 1 c

  • 3

🂼🂾

F

  • 1

B f 1 c 3

🂾🂾

F

  • 1

B f 1 c

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 10 / 54

slide-21
SLIDE 21

Augmented infoset partition Iaug

1

, Iaug

2 🂼🂼

F

  • 1

B f 1 c

🂾🂼

F

  • 1

B f 1 c

  • 3

🂼🂾

F

  • 1

B f 1 c 3

🂾🂾

F

  • 1

B f 1 c

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 11 / 54

slide-22
SLIDE 22

How should Iaug

  • look like for ? 1

1Thanks Vojta for pics in TikZ! Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 12 / 54

slide-23
SLIDE 23

How should Iaug

  • look like for ? 1

1Thanks Vojta for pics in TikZ! Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 12 / 54

slide-24
SLIDE 24

Can we refine Iaug

  • ?

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 13 / 54

slide-25
SLIDE 25

Can we refine Iaug

  • ?

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 13 / 54

slide-26
SLIDE 26

Can we refine Iaug

  • ?

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 13 / 54

slide-27
SLIDE 27

Can we refine Iaug

  • ?

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 13 / 54

slide-28
SLIDE 28

Telling apart

We’d like to somehow define the notion of “every player knows that every player knows” in some parts of the game. We will call these public states. Examples: Poker, II Goofspiel, Phantom Tic-Tac-Toe.

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 14 / 54

slide-29
SLIDE 29

Telling apart

We’d like to somehow define the notion of “every player knows that every player knows” in some parts of the game. We will call these public states. Examples: Poker, II Goofspiel, Phantom Tic-Tac-Toe.

Telling apart histories

We write g ∼h when there is a player who cannot distinguish the two (cannot tell h apart from g): g ∼h ⇐ ⇒ O1(g) = O1(h) ∨ O2(g) = O2(h).

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 14 / 54

slide-30
SLIDE 30

Telling apart

We’d like to somehow define the notion of “every player knows that every player knows” in some parts of the game. We will call these public states. Examples: Poker, II Goofspiel, Phantom Tic-Tac-Toe.

Telling apart histories

We write g ∼h when there is a player who cannot distinguish the two (cannot tell h apart from g): g ∼h ⇐ ⇒ O1(g) = O1(h) ∨ O2(g) = O2(h).

Transitive closure

We denote ≈ the transitive closure of ∼. Formally, g ≈ h iff (∃n) (∃h1, . . . , hn) : g ∼h1, h1 ∼h2, . . . , hn−1 ∼hn, hn ∼h.

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 14 / 54

slide-31
SLIDE 31

Public state

Public state

Public partition is any partition S of H \ Z whose elements are closed under ∼ (telling apart). An element S of any such S is called a public state.

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 15 / 54

slide-32
SLIDE 32

Public state

Public state

Public partition is any partition S of H \ Z whose elements are closed under ∼ (telling apart). An element S of any such S is called a public state. Public partition induces public tree - a tree that we can traverse and compute strategy online.

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 15 / 54

slide-33
SLIDE 33

IIGS-3 beginning of the game

1 1 2 3 2 2 3 3 2 3 2 1 1 3 2 1 3 3 1 3 3 1 1 2 2 1 2 3 1 2

Round 1 Round 2 (only beginning)

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 16 / 54

slide-34
SLIDE 34

IIGS-3 augmented infosets Iaug

1

1 1 2 3 2 2 3 3 2 3 2 1 1 3 2 1 3 3 1 3 3 1 1 2 2 1 2 3 1 2

Round 1 Round 2 (only beginning)

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 17 / 54

slide-35
SLIDE 35

IIGS-3 augmented infosets Iaug

2

1 1 2 3 2 2 3 3 2 3 2 1 1 3 2 1 3 3 1 3 3 1 1 2 2 1 2 3 1 2

Round 1 Round 2 (only beginning)

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 18 / 54

slide-36
SLIDE 36

IIGS-3 public tree

play play play play

Round 1 Round 2 Round 3

play play

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 19 / 54

slide-37
SLIDE 37

IIGS-3 public tree induced by domain specific Iaug

We’d like something more refined when possible.

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 20 / 54

slide-38
SLIDE 38

IIGS-3 public tree induced by domain specific Iaug

We’d like something more refined when possible.

play lose play lose play play draw play play win play play draw play lose play play draw play play win play play win play lose play play draw play play win play play

Round 1 Round 2 Round 3

play play play play play play play play play play play play draw lose win lose draw win lose draw win lose draw win

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 20 / 54

slide-39
SLIDE 39

Solution concepts

Approximate Nash Equilibrium

Profile σ (σi ∈ Σi is a behavioural strategy of player i) is ǫ − NE if (∀i ∈ {1, 2}) : ui(σ) ≥ max

σ′

i ∈Σi

ui(σ′

i, σoppi) − ǫ.

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 21 / 54

slide-40
SLIDE 40

Solution concepts

Approximate Nash Equilibrium

Profile σ (σi ∈ Σi is a behavioural strategy of player i) is ǫ − NE if (∀i ∈ {1, 2}) : ui(σ) ≥ max

σ′

i ∈Σi

ui(σ′

i, σoppi) − ǫ.

Exploitability of strategy profile

expli(σ) := ui(σ∗) − min

σ′

  • ppi ∈Σoppi

ui(σi, σ′

  • ppi)

and expl(σ) := 1/2 [expl1(σ) + expl2(σ)]

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 21 / 54

slide-41
SLIDE 41

Complexity?

Finding NE in general sum games is PPAD-hard (not NP-complete, because we know the answer must exist).

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 22 / 54

slide-42
SLIDE 42

Complexity?

Finding NE in general sum games is PPAD-hard (not NP-complete, because we know the answer must exist). Zero sum EFGs with imperfect recall are NP-hard

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 22 / 54

slide-43
SLIDE 43

Complexity?

Finding NE in general sum games is PPAD-hard (not NP-complete, because we know the answer must exist). Zero sum EFGs with imperfect recall are NP-hard Zero sum EFGs with perfect recall can be formulated as linear program

The number of constraints is equal to the number of pure strategies of the

  • ther player

Can be exponential in size of the game But this optimization problem has polynomial-time “separation oracles” There is a polynomial time algorithm that tests if a given point satisfies all inequalities and if not, finds violated one Ellipsoid method can be applied to solve in polynomial time

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 22 / 54

slide-44
SLIDE 44

Complexity?

But even simple games are huge.

Game |S| |I| |H| |Z| |Ω| IIGS(5) 363 9948 41331 14400 13 LD(1,1,6) 4098 24576 147456 147420 396 GP(3,3,2,2) 2671 7920 23760 44883 45

IIGS-13 |H| ≈ 1022 no-limit Texas hold ’em |H| ≈ 10160

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 23 / 54

slide-45
SLIDE 45

Offline algorithms

Sequence-form LP

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 24 / 54

slide-46
SLIDE 46

Offline algorithms

Sequence-form LP Double Oracle

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 24 / 54

slide-47
SLIDE 47

Offline algorithms

Sequence-form LP Double Oracle EGT (Excessive Gap Technique)

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 24 / 54

slide-48
SLIDE 48

Offline algorithms

Sequence-form LP Double Oracle EGT (Excessive Gap Technique) CFR (Counterfactual Regret Minimization)

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 24 / 54

slide-49
SLIDE 49

Offline algorithms

Sequence-form LP Double Oracle EGT (Excessive Gap Technique) CFR (Counterfactual Regret Minimization) CFR+

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 24 / 54

slide-50
SLIDE 50

Offline algorithms

Sequence-form LP Double Oracle EGT (Excessive Gap Technique) CFR (Counterfactual Regret Minimization) CFR+ . . . probably more, but not much

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 24 / 54

slide-51
SLIDE 51

CFR

Counter-factuals

Counterfactual value (CFV) of player i with strategy profile σ is vσ

i (h) := πσ −i(h)uσ i (h)

and counterfactual regret at information set I for playing action a is rσ

i (I, a) := vσ i (I, a) − vσ i (I).

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 25 / 54

slide-52
SLIDE 52

CFR

Counter-factuals

Counterfactual value (CFV) of player i with strategy profile σ is vσ

i (h) := πσ −i(h)uσ i (h)

and counterfactual regret at information set I for playing action a is rσ

i (I, a) := vσ i (I, a) − vσ i (I).

Immediate counterfactual regret

¯ RT

i,imm(I) := max a∈A(I)

¯ RT

i,imm(I, a) := max a∈A(I)

1 T

T

  • t=1

rσt

i (I, a)

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 25 / 54

slide-53
SLIDE 53

CFR

Regret matching update rule

σt+1(I)(a) := ¯ Rt,+

i,imm(I, a)

  • a′∈A(I) ¯

Rt,+

i,imm(I, a′)

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 26 / 54

slide-54
SLIDE 54

CFR

Regret matching update rule

σt+1(I)(a) := ¯ Rt,+

i,imm(I, a)

  • a′∈A(I) ¯

Rt,+

i,imm(I, a′)

Average strategy

¯ σT(I)(a) := T

t=1 πσt i (I)σt(I, a)

T

t=1 πσt i (I)

(where I ∈ Ii)

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 26 / 54

slide-55
SLIDE 55

CFR

50 100 150 200

iterations

0.0 0.2 0.4 0.6 0.8 1.0

P(a)

Player 1

RM Rock RM Paper RM Scissors Avg Rock Avg Paper Avg Scissors 50 100 150 200

iterations

Player 2

RM Rock RM Paper RM Scissors Avg Rock Avg Paper Avg Scissors

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 27 / 54

slide-56
SLIDE 56

MCCFR

Monte Carlo variant of CFR - we will sample one terminal history at a time (this is called Outcome Sampling).

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 28 / 54

slide-57
SLIDE 57

MCCFR

Monte Carlo variant of CFR - we will sample one terminal history at a time (this is called Outcome Sampling).

Sampling distribution

Sampling distribution must have positive probability of sampling any leaf (even in unreachable parts of the tree). We use sampling strategy σt,ǫ := (1 − ǫ)σt + ǫ · rnd where ǫ ∈ (0, 1] controls the exploration and rnd(I)(a) :=

1 |A(I)|

Also called “epsilon-on-policy exploration”.

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 28 / 54

slide-58
SLIDE 58

MCCFR

Sampled regrets

Regrets are now sampled ˜ rσt

i (I, a) :=

  • wI · (πσt(z|ha) − πσt(z|h))

if ha ⊏ z wI · (0 − πσt(z|h))

  • therwise ,

where h denotes the prefix of z which is in I and wI stands for

πσt

−i(z|h)ui(z)

πσt,ǫ(z) .

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 29 / 54

slide-59
SLIDE 59

MCCFR average strategy (1/2)

Current strategy is not guaranteed to converge to equilibrium, so we need to calculate the average strategy. Recall that ¯ σT(I)(a) := T

t=1 πσt i (I)σt(I, a)

T

t=1 πσt i (I)

(where I ∈ Ii)

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 30 / 54

slide-60
SLIDE 60

MCCFR average strategy (1/2)

Current strategy is not guaranteed to converge to equilibrium, so we need to calculate the average strategy. Recall that ¯ σT(I)(a) := T

t=1 πσt i (I)σt(I, a)

T

t=1 πσt i (I)

(where I ∈ Ii) which can be rewritten as ¯ σT(I)(a) := accT(I, a)

  • a′∈A(I) accT(I, a′),

where acc denotes the cumulative sum accT(I, a) =

T

  • t=1

πσt

i (I)σt(I, a).

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 30 / 54

slide-61
SLIDE 61

MCCFR average strategy (2/2)

There are multiple ways to calculate average strategy with MCCFR.

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 31 / 54

slide-62
SLIDE 62

MCCFR average strategy (2/2)

There are multiple ways to calculate average strategy with MCCFR. We use stochastically-weighted averaging (updating only infosets that are on the trajectory of sampled terminal z)

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 31 / 54

slide-63
SLIDE 63

MCCFR average strategy (2/2)

There are multiple ways to calculate average strategy with MCCFR. We use stochastically-weighted averaging (updating only infosets that are on the trajectory of sampled terminal z) acct(I)(a) := acct−1(I)(a) +   

πσt

i

(h) πσt,ǫ(h)σt(I, a)

if ha ⊏ z

  • therwise.

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 31 / 54

slide-64
SLIDE 64

Online algorithms

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 32 / 54

slide-65
SLIDE 65

Online algorithms without guarantees

Information-Set Monte Carlo Tree Search

samples as in a perfect information game, but computes statistics for the whole infoset various selection functions - we use UCT, RM

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 33 / 54

slide-66
SLIDE 66

Online algorithms without guarantees

Information-Set Monte Carlo Tree Search

samples as in a perfect information game, but computes statistics for the whole infoset various selection functions - we use UCT, RM

Unsafe resolving

do not deal with what happens outside of subgame summarize what happened so far with chance node with σc(∅, a) = πσ(h)/πσ(S)

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 33 / 54

slide-67
SLIDE 67

Online algorithms with guaranteees

OOS

Update sampling distribution to send more samples in current play position Incremental tree building

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 34 / 54

slide-68
SLIDE 68

Online algorithms with guaranteees

OOS

Update sampling distribution to send more samples in current play position Incremental tree building

Continual Resolving

Use safe resolving (gadget game) repeatedly in public states We need to get CFVs somehow!

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 34 / 54

slide-69
SLIDE 69

Continual resolving

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 35 / 54

slide-70
SLIDE 70

Gadget game

For continual resolving, we will need to construct “resolving gadget game”:

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 36 / 54

slide-71
SLIDE 71

Gadget game

For continual resolving, we will need to construct “resolving gadget game”:

Continue as in the original game F F F T T T

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 36 / 54

slide-72
SLIDE 72

Continual resolving

Main idea: Repeatedly construct gadget game in each encountered public state S during play. Solve the gadget game We need to store data ∀I ∈ S: Reach probabilities of infosets πσ

  • ppi(I)

CFVs of infosets v−i(I)

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 37 / 54

slide-73
SLIDE 73

Function Play of Continual Resolving

Input : Information set I ∈ I1 Output: An action a ∈ A(I)

1 S ← the public state which contains I; 2 if S /

∈ KPS then

3

  • G(S) ← BuildResolvingGame(S,D(S));

4

KPS ← KPS ∪ S;

5

NPS ← all S′ ∈ S where CR acts for the first time after leaving KPS;

6

˜ ρ, ˜ D ← Resolve( G(S),NPS);

7

σ1|S′ ← ˜ ρ|S′;

8

D ← calculate data for NPS based on D, σ1 and ˜ D;

9 end 10 return a ∼ σ1(I)

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 38 / 54

slide-74
SLIDE 74

MCCR

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 39 / 54

slide-75
SLIDE 75

Description of algorithm

We implement the “Resolve” function using MCCFR: MCCFR finds the strategy in current information set.

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 40 / 54

slide-76
SLIDE 76

Description of algorithm

We implement the “Resolve” function using MCCFR: MCCFR finds the strategy in current information set. While sampling we store expected values ˜ u¯

σT (h).

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 40 / 54

slide-77
SLIDE 77

Description of algorithm

We implement the “Resolve” function using MCCFR: MCCFR finds the strategy in current information set. While sampling we store expected values ˜ u¯

σT (h).

The CFVs are simply obtained as ˜ v ¯

σT i

(h) = π¯

σT −i (h)˜

σT (h).

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 40 / 54

slide-78
SLIDE 78

Description of algorithm

We implement the “Resolve” function using MCCFR: MCCFR finds the strategy in current information set. While sampling we store expected values ˜ u¯

σT (h).

The CFVs are simply obtained as ˜ v ¯

σT i

(h) = π¯

σT −i (h)˜

σT (h).

We know how we arrived to h, so we easily store π¯

σT (h),

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 40 / 54

slide-79
SLIDE 79

Description of algorithm

We implement the “Resolve” function using MCCFR: MCCFR finds the strategy in current information set. While sampling we store expected values ˜ u¯

σT (h).

The CFVs are simply obtained as ˜ v ¯

σT i

(h) = π¯

σT −i (h)˜

σT (h).

We know how we arrived to h, so we easily store π¯

σT (h),

but we need to estimate ˜ u¯

σT (h) well.

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 40 / 54

slide-80
SLIDE 80

Description of algorithm

We implement the “Resolve” function using MCCFR: MCCFR finds the strategy in current information set. While sampling we store expected values ˜ u¯

σT (h).

The CFVs are simply obtained as ˜ v ¯

σT i

(h) = π¯

σT −i (h)˜

σT (h).

We know how we arrived to h, so we easily store π¯

σT (h),

but we need to estimate ˜ u¯

σT (h) well.

This is difficult, because we use σt to sample, not ¯ σt!

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 40 / 54

slide-81
SLIDE 81

Bounds

CR bound

Suppose that CR uses D = (r1, ˜ v) and G(S, σ1, ˜ v). Then the exploitability

  • f its strategy is bounded by

expl1(σ1) ≤ ǫ˜

v 0 + ǫR 1 + ǫ˜ v 1 + · · · + ǫ˜ v N−1 + ǫR N,

where N is the number of resolving steps and ǫR

n :=

expl1(˜ ρn), ǫ˜

v n :=

  • J∈ ˆ

Sn+1(2)

  • ˜

v(J) − vσ∗n

1 ,CBR

2

(J)

  • are the exploitability (in

G(Sn)) and value estimation error made by the n-th resolver (resp. initialization for n = 0).

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 41 / 54

slide-82
SLIDE 82

Bounds

MCCR bound

With probability at least (1 − p)N+1, the exploitability of stratey σ computed by MCCR satisfies expli(σ) ≤ √ 2/√p + 1

  • |Ii|∆u,i

√Ai δ

  • 2

√T0 + 2N − 1 √TR

  • ,

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 42 / 54

slide-83
SLIDE 83

Evaluation of online algorithms

Evaluation of online algorithms is hard.

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 43 / 54

slide-84
SLIDE 84

Evaluation of online algorithms

Evaluation of online algorithms is hard. Correct “brute-force” way: simulate all the possible game trajectories and resolve accordingly.

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 43 / 54

slide-85
SLIDE 85

Evaluation of online algorithms

Evaluation of online algorithms is hard. Correct “brute-force” way: simulate all the possible game trajectories and resolve accordingly. Drawback: O(t|S|) and for reliability we should use more seeds!

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 43 / 54

slide-86
SLIDE 86

Evaluation of online algorithms

Evaluation of online algorithms is hard. Correct “brute-force” way: simulate all the possible game trajectories and resolve accordingly. Drawback: O(t|S|) and for reliability we should use more seeds! Averaging over seeds we produce what we call ¯ ¯ σ (double bar) strategy.

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 43 / 54

slide-87
SLIDE 87

Evaluation of online algorithms

Evaluation of online algorithms is hard. Correct “brute-force” way: simulate all the possible game trajectories and resolve accordingly. Drawback: O(t|S|) and for reliability we should use more seeds! Averaging over seeds we produce what we call ¯ ¯ σ (double bar) strategy. This strategy is no worse than individual seed strategies: expl(¯ ¯ σT) ≤ 1 T

T

  • t

expl(¯ σt) (t means seeds in this context)

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 43 / 54

slide-88
SLIDE 88

Experimental results

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 44 / 54

slide-89
SLIDE 89

Results - CFVs

10

3

10

2

10

1

100

expl(

t)

0.00 0.05 0.10 0.15 0.20

Averages of CFVs differences

101 102 103 104 105 106 107

Number of samples

0.000 0.025 0.050 0.075 0.100 0.125 0.150

Variances of CFVs differences

B-RPS PTTT IIGS-5 IIGS-13 LD-116 LD-226 GP-3322 GP-4644

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 45 / 54

slide-90
SLIDE 90

Results - convergence, “reset” variant

# of iterations per gadget game 102 103 104 105 106 #
  • f
i t e r a t i
  • n
s i n r
  • t
102 103 104 105 106 107 0.2 0.4 # of iterations per gadget game 102 103 104 105 106 #
  • f
i t e r a t i
  • n
s i n r
  • t
102 103 104 105 106 107 0.2 0.4 0.6 # of iterations per gadget game 102 103 104 105 106 #
  • f
i t e r a t i
  • n
s i n r
  • t
102 103 104 105 106 107 0.2 0.4 0.6 # of iterations per gadget game 102 103 104 105 106 #
  • f
i t e r a t i
  • n
s i n r
  • t
102 103 104 105 106 107 0.04 0.06 0.08

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 46 / 54

slide-91
SLIDE 91

Results - convergence, “keep” variant

# of iterations per gadget game 102 103 104 105 106 #
  • f
i t e r a t i
  • n
s i n r
  • t
102 103 104 105 106 107 0.01 0.02 0.03 # of iterations per gadget game 102 103 104 105 106 #
  • f
i t e r a t i
  • n
s i n r
  • t
102 103 104 105 106 107 0.2 0.4 0.6 # of iterations per gadget game 102 103 104 105 106 #
  • f
i t e r a t i
  • n
s i n r
  • t
102 103 104 105 106 107 0.2 0.4 0.6 # of iterations per gadget game 102 103 104 105 106 #
  • f
i t e r a t i
  • n
s i n r
  • t
102 103 104 105 106 107 0.025 0.050 0.075 0.100

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 47 / 54

slide-92
SLIDE 92

Results - convergence, slices

102 103 104 105 106

TR, T0 = 107

10

2

10

1

100

expl2( )

102 103 104 105 106 107

T0, TR = 106

10

2

10

1

100

expl2( )

GP-3322 (reset) GP-3322 (keep) IIGS-5 (reset) IIGS-5 (keep) LD-116 (reset) LD-116 (keep)

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 48 / 54

slide-93
SLIDE 93

Results - exploitability given time budget

100 101 102 103 10

1

100

expl2( )

IIGS-5

100 101 102 103

Time [ms]

LD-116

100 101 102 103

GP-3322

MCCR (reset) MCCR (keep) OOS (PST) MCCFR RND Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 49 / 54

slide-94
SLIDE 94

Results - exploration parameter

II-GS(5) LD(1,1,6) GP(3,3,2,2)

102 103 104 105 106

Samples in gadget

10

2

10

1

100

expl1( ) = 0.2 = 0.4 = 0.6 = 0.8

102 103 104 105 106

Samples in gadget

10

1

100 2 × 10

1

3 × 10

1

4 × 10

1

6 × 10

1

expl1( ) = 0.2 = 0.4 = 0.6 = 0.8

102 103 104 105 106

Samples in gadget

10

2

10

1

2 × 10

2

3 × 10

2

4 × 10

2

6 × 10

2

expl1( ) = 0.2 = 0.4 = 0.6 = 0.8

102 103 104 105 106

Samples in gadget

10

2

10

1

100

expl1( ) = 0.2 = 0.4 = 0.6 = 0.8

102 103 104 105 106

Samples in gadget

10

1

100 2 × 10

1

3 × 10

1

4 × 10

1

6 × 10

1

expl1( ) = 0.2 = 0.4 = 0.6 = 0.8

102 103 104 105 106

Samples in gadget

10

2

10

1

100

expl1( ) = 0.2 = 0.4 = 0.6 = 0.8

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 50 / 54

slide-95
SLIDE 95

Results - selected domains

IIGS-13 MCCR (reset) MCCFR OOS (PST) OOS (IST) RM UCT RND MCCR (keep)

  • 23.8 ± 7.8

8.0 ± 8.0 11.2 ± 8.0

  • 7.6 ± 8.0
  • 54.1 ± 6.8
  • 70.4 ± 5.7

43.3 ± 7.3 MCCR (reset) 22.0 ± 7.9 20.8 ± 7.9 0.8 ± 8.1

  • 34.6 ± 7.6
  • 61.5 ± 6.4

52.9 ± 6.9 MCCFR

  • 1.4 ± 8.0
  • 18.6 ± 7.9
  • 58.5 ± 6.6
  • 76.8 ± 5.2

37.2 ± 7.5 OOS (PST)

  • 19.9 ± 7.9
  • 58.4 ± 6.6
  • 76.1 ± 5.2

34.2 ± 7.6 OOS (IST)

  • 40.3 ± 7.4
  • 60.0 ± 6.5

54.2 ± 6.7 RM

  • 22.4 ± 7.9

81.3 ± 4.7 UCT 92.0 ± 3.1 LD-226 MCCR (reset) MCCFR OOS (PST) OOS (IST) RM UCT RND MCCR (keep) 1.0 ± 8.1 46.0 ± 7.2 45.2 ± 7.3

  • 23.6 ± 7.9
  • 34.0 ± 7.7
  • 34.4 ± 7.7

75.0 ± 5.4 MCCR (reset) 39.8 ± 7.5 45.0 ± 7.3

  • 32.0 ± 7.7
  • 42.0 ± 7.4
  • 46.0 ± 7.2

81.8 ± 4.7 MCCFR 1.6 ± 8.1

  • 51.8 ± 7.0
  • 48.0 ± 7.1
  • 43.6 ± 7.3

52.2 ± 7.0 OOS (PST)

  • 58.2 ± 6.6
  • 53.2 ± 6.9
  • 47.2 ± 7.2

42.6 ± 7.4 OOS (IST)

  • 10.0 ± 8.1
  • 19.0 ± 8.0

83.6 ± 4.5 RM

  • 13.2 ± 8.1

80.6 ± 4.8 UCT 75.6 ± 5.3 GP-4644 MCCR (reset) MCCFR OOS (PST) OOS (IST) RM UCT RND MCCR (keep)

  • 0.0 ± 4.2

10.3 ± 5.8 13.2 ± 5.6

  • 0.2 ± 5.3
  • 1.0 ± 4.3
  • 4.1 ± 3.6

18.7 ± 5.7 MCCR (reset) 9.7 ± 4.8 11.5 ± 5.0 1.1 ± 4.2

  • 3.4 ± 3.7
  • 2.2 ± 3.1

15.5 ± 5.1 MCCFR

  • 5.4 ± 6.3
  • 11.5 ± 5.5
  • 12.2 ± 4.9
  • 8.6 ± 4.0

11.6 ± 6.1 OOS (PST)

  • 12.2 ± 5.5
  • 10.8 ± 4.9
  • 6.6 ± 4.1

11.2 ± 6.1 OOS (IST)

  • 0.4 ± 4.2
  • 0.0 ± 3.4

18.0 ± 5.6 RM 0.2 ± 2.3 19.7 ± 4.5 UCT 18.5 ± 4.2

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 51 / 54

slide-96
SLIDE 96

Future work

Find a good domain to show dominance.

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 52 / 54

slide-97
SLIDE 97

Future work

Find a good domain to show dominance. Variance reduction.

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 52 / 54

slide-98
SLIDE 98

Future work

Find a good domain to show dominance. Variance reduction. Using OOS as resolver in parent public state.

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 52 / 54

slide-99
SLIDE 99

Future work

Find a good domain to show dominance. Variance reduction. Using OOS as resolver in parent public state. Heuristics for resolving (neural networks) on general games.

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 52 / 54

slide-100
SLIDE 100

Future work

Find a good domain to show dominance. Variance reduction. Using OOS as resolver in parent public state. Heuristics for resolving (neural networks) on general games. Find ǫ-NE CFVs without having the strategy.

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 52 / 54

slide-101
SLIDE 101

Future work

Find a good domain to show dominance. Variance reduction. Using OOS as resolver in parent public state. Heuristics for resolving (neural networks) on general games. Find ǫ-NE CFVs without having the strategy. Evaluating online algorithms is hard, and we need something faster (exploitability bounds).

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 52 / 54

slide-102
SLIDE 102

Conclusion

Takeaways: We specified how to do continual resolving in general for imperfect information games.

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 53 / 54

slide-103
SLIDE 103

Conclusion

Takeaways: We specified how to do continual resolving in general for imperfect information games. Monte Carlo can be used to solve these games in an online fashion.

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 53 / 54

slide-104
SLIDE 104

Conclusion

Takeaways: We specified how to do continual resolving in general for imperfect information games. Monte Carlo can be used to solve these games in an online fashion. Furthermore: Augmented information sets can be further refined, but it is ambgious how and probably domain-dependent.

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 53 / 54

slide-105
SLIDE 105

Conclusion

Takeaways: We specified how to do continual resolving in general for imperfect information games. Monte Carlo can be used to solve these games in an online fashion. Furthermore: Augmented information sets can be further refined, but it is ambgious how and probably domain-dependent. We implemented MCCFR resolver and showed how CFVs of average strategy can be computed.

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 53 / 54

slide-106
SLIDE 106

Conclusion

Takeaways: We specified how to do continual resolving in general for imperfect information games. Monte Carlo can be used to solve these games in an online fashion. Furthermore: Augmented information sets can be further refined, but it is ambgious how and probably domain-dependent. We implemented MCCFR resolver and showed how CFVs of average strategy can be computed. We experimentally verified MCCR works and improves with more samples.

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 53 / 54

slide-107
SLIDE 107

Conclusion

Takeaways: We specified how to do continual resolving in general for imperfect information games. Monte Carlo can be used to solve these games in an online fashion. Furthermore: Augmented information sets can be further refined, but it is ambgious how and probably domain-dependent. We implemented MCCFR resolver and showed how CFVs of average strategy can be computed. We experimentally verified MCCR works and improves with more samples. However it is not strong enough to beat unsafe algorithms like IS-MCTS with the same time budget . and is not always stronger than OOS.

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 53 / 54

slide-108
SLIDE 108

Thanks for coming!

Questions? Suggestions? Feedback?

Michal Sustr (FEE CTU) Monte Carlo Continual Resolving December 6, 2018 54 / 54