Data Biased Robust Counter Strategies Michael Johanson, Michael - - PowerPoint PPT Presentation

data biased robust counter strategies
SMART_READER_LITE
LIVE PREVIEW

Data Biased Robust Counter Strategies Michael Johanson, Michael - - PowerPoint PPT Presentation

Data Biased Robust Counter Strategies Michael Johanson, Michael Bowling November 14, 2012 Q J K 1 0 P C R A U V G K Q A J 0 1 University of Alberta Computer Poker Research Group


slide-1
SLIDE 1

Data Biased Robust Counter Strategies

Michael Johanson, Michael Bowling November 14, 2012

U V

A ♠ A ♠

C

K ♥ K ♥

P

Q ♣ Q ♣

R

J ♦ J ♦

G

1 ♠ 1 ♠

University of Alberta Computer Poker Research Group

slide-2
SLIDE 2

Introduction

Computer Poker Research Group

Created Polaris - the world’s strongest program for playing Heads-Up Limit Texas Hold’em Poker July 2008: Went to Las Vegas, played against six poker pros, won the 2nd Man-Machine Poker Championship Won several events in the 2008 AAAI Computer Poker Competition

Research goals:

Solve very large extensive form games Learn to model and exploit opponent’s strategy

slide-3
SLIDE 3

Model Uncertainty and Risk

In this talk, we present a technique for dealing with three types of model uncertainty: The opponent / environment changes after we model it The model is more accurate in some areas than others The model’s prior beliefs are very inaccurate

slide-4
SLIDE 4

Texas Hold’em Poker

Our domain: 2-player Limit Texas Hold’em Poker

Zero-Sum Extensive form game Repeated game (Hundreds or thousands of short games) Hidden information (Can’t see opponent’s cards) Stochastic elements (Cards are dealt randomly) Goal: Win as much money as possible

RL interpretation:

POMDP (when opponent’s strategy is static) Some properties of world are known

Probability distribution at chance nodes

Don’t know exactly what state you are in (because of opponent’s cards) Transition probabilities at opponent choice nodes are unknown Payoffs at terminal nodes are unknown

slide-5
SLIDE 5

Types of strategies

There are lots of ways to play games like poker. Two are well known:

Nash Equilibrium

Minimizes worst-case performance Doesn’t try to exploit opponent’s mistakes

Best Response

Maximizes performance against a specific static opponent Doesn’t try to minimize worst-case performance Problem: requires the opponent’s strategy

Goals:

Observe the opponent, build a model, and use it instead of the

  • pponent’s strategy

Bound worst-case performance

Model could be inaccurate Opponent could change

slide-6
SLIDE 6

Types of Strategies

Performance against a static opponent, in millibets per game

10 20 30 40 50 60 1000 2000 3000 4000 5000 6000 Exploitation of Opponent (mb/g) Worst Case Exploitability (mb/g)

Game Theory: Nash equilibrium. Low exploitiveness, low exploitability Decision Theory: Best response. High exploitiveness, high exploitability

slide-7
SLIDE 7

Types of Strategies

Performance against a static opponent, in millibets per game

10 20 30 40 50 60 1000 2000 3000 4000 5000 6000 Exploitation of Opponent (mb/g) Worst Case Exploitability (mb/g) Mixture

Mixture: Linear tradeoff of exploitiveness and exploitability

slide-8
SLIDE 8

Types of Strategies

Performance against a static opponent, in millibets per game

10 20 30 40 50 60 1000 2000 3000 4000 5000 6000 Exploitation of Opponent (mb/g) Worst Case Exploitability (mb/g) Mixture Restricted Nash Response

Restricted Nash Response: Much better than linear tradeoff

slide-9
SLIDE 9

Restricted Nash Response

Restricted Nash Response

Proposed by Johanson, Zinkevich and Bowling (Computing robust counter-strategies, NIPS 2007)

Choose a value p and play an unusual game:

With probability p, opponent is forced to play according to a static strategy With probability 1 − p, opponent is free to play as they like

p = 1: Best response p = 0: Nash equilibrium 0 < p < 1: Different tradeoffs between exploiting model and being robust to any opponent! This provably generates the best possible counter-strategies to the

  • pponent
slide-10
SLIDE 10

Restricted Nash Response

Performance against model of Orange 10 20 30 40 50 60 1000 2000 3000 4000 5000 6000 Exploitation of Opponent (mb/g) Worst Case Exploitability (mb/g) (0) (0.5) (0.7) (0.8) (0.9) (0.93) (0.97) (0.99) (1) Restricted Nash Response

slide-11
SLIDE 11

Goals

Goals: Observe the opponent, build a model, and use it instead of the

  • pponent’s strategy

Bound worst-case performance

Model could be inaccurate Opponent could change

slide-12
SLIDE 12

Frequentist Opponent Models

2♦2♥ K♦K♥ 4/10 6/10 1/4 3/4 0/0 3/3

Observe 100,000 to 1 million games played by the opponent Do frequency counts on actions taken at information sets Model assumes opponent takes actions with observed frequencies Need a default policy when there are no observations

Poker: Always-Call

slide-13
SLIDE 13

Problems with Restricted Nash Response

Problem 1: Overfitting to the model 5 10 15 20 25 30 35 40 45 50 200 400 600 800 1000 Exploitation (mb/g) Exploitability (mb/g) Orange Model

slide-14
SLIDE 14

Problems with Restricted Nash Response

Problem 2: Requires a lot of training data

  • 120
  • 100
  • 80
  • 60
  • 40
  • 20

20 200 400 600 800 1000 Exploitation (mb/h) Exploitability (mb/h) 100 1k 10k 100k 1m

slide-15
SLIDE 15

Data Biased Response

Restricted Nash Response had two problems:

Model wasn’t accurate in states we never observed Model was more accurate in some states than in others

We need a new approach. We’d like to only use the model wherever we have reason to trust it New approach: use model’s accuracy as part of the restricted game

slide-16
SLIDE 16

Data Biased Response

Lets set up another restricted game. Instead of one p value for the whole tree, we’ll set one p value for each choice node, p(i) More observations → more confidence in the model → higher p(i) Set a maximum p(i) value, Pmax, that we vary to produce a range of strategies

slide-17
SLIDE 17

Data Biased Response

Three examples:

1-Step: p(i) = 0 if 0 observations, p(i) = Pmax otherwise 10-Step: p(i) = 0 if less than 10 observations, p(i) = Pmax otherwise 0-10 Linear: p(i) = 0 if 0 observations, p(i) = Pmax if 10 or more, and p(i) grows linearly in between

By setting p(i) = 0 in unobserved states, our prior is that the

  • pponent will play as strongly as possible
slide-18
SLIDE 18

DBR doesn’t overfit to the model

RNR and several DBR curves:

  • 5

5 10 15 20 200 400 600 800 1000 Exploitation (mb/h) Exploitability (mb/h) RN 1-Step 10-Step 0-10 Linear

slide-19
SLIDE 19

DBR works with fewer observations

0-10 Linear DBR curve: 2 4 6 8 10 12 14 16 18 20 50 100 150 200 250 Exploitation (mb/h) Exploitability (mb/h) 100 1k 10k 100k 1m

slide-20
SLIDE 20

Conclusion

Data Biased Response technique:

Generate a range of strategies, trading off exploitation and worst-case performance Take advantage of observed information Avoid overfitting to parts of the model we suspect are inaccurate

slide-21
SLIDE 21

Future directions

Extend to single-player domains

Can overfitting be reduced by assuming a slightly adversarial environment in unobserved / underobserved areas?

More rigorous method for setting p from the observations