Low-Variance and Zero-Variance Baselines in Extensive-Form Games - - PowerPoint PPT Presentation

low variance and zero variance baselines in extensive
SMART_READER_LITE
LIVE PREVIEW

Low-Variance and Zero-Variance Baselines in Extensive-Form Games - - PowerPoint PPT Presentation

1 2 Low-Variance and Zero-Variance Baselines in Extensive-Form Games Trevor Davis 2,* , Martin Schmid 1 , Michael Bowling 1,2 *Work done during internship at DeepMind Monte Carlo game solving Extensive-form games (EFGs) Monte Carlo game


slide-1
SLIDE 1

Trevor Davis2,*, Martin Schmid1, Michael Bowling1,2

Low-Variance and Zero-Variance Baselines in Extensive-Form Games

1 2

*Work done during internship at DeepMind

slide-2
SLIDE 2

Monte Carlo game solving

Extensive-form games (EFGs)

slide-3
SLIDE 3

Monte Carlo game solving

Extensive-form games (EFGs)

slide-4
SLIDE 4

Baseline functions - evaluating unsampled actions

slide-5
SLIDE 5
  • Lower variance, faster convergence
  • Provable zero-variance samples

VR-MCCFR (Schmid et al., AAAI 2019) This work

Our Contribution

slide-6
SLIDE 6

Unbiased updates at h

Monte carlo evaluation

slide-7
SLIDE 7

Unbiased updates at h where

Monte Carlo evaluation

slide-8
SLIDE 8

Unbiased updates at h where Unsampled actions:

Monte Carlo evaluation

slide-9
SLIDE 9

Baseline functions

slide-10
SLIDE 10

Without baseline:

Evaluation with baseline

slide-11
SLIDE 11

Without baseline: Baseline correction:

Evaluation with baseline

slide-12
SLIDE 12

Without baseline: Baseline correction: (control variate)

Evaluation with baseline

slide-13
SLIDE 13

Theorem 1: baseline-corrected values are unbiased: Theorem 2: each baseline-corrected value has variance bounded by a sum of squared prediction errors in the subtree rooted at a

Theoretical results

slide-14
SLIDE 14

We want Learned history baseline: We know Set to average of previous samples

Baseline function selection

slide-15
SLIDE 15

We want Learned history baseline: We know Set to average of previous samples Note: depends on strategies - not stationary ∴ is not an unbiased estimate of current expectation still unbiased

Baseline function selection

slide-16
SLIDE 16

Baseline convergence evaluation

Leduc poker, Monte Carlo Counterfactual Regret Minimization (MCCFR+)

No baseline VR-MCCFR (Schmid et al.) Learned history baseline

slide-17
SLIDE 17

Updating with learned history baseline: Optimal baseline depends on strategy update:

Predictive baseline

slide-18
SLIDE 18

Updating with learned history baseline: Optimal baseline depends on strategy update: Use strategy to update baseline: Recursively set

Predictive baseline

slide-19
SLIDE 19

If:

  • We use the predictive baseline
  • We sample public outcomes
  • All outcomes are sampled at least once

Theorem: the baseline-corrected values have zero variance

Zero-variance updates

slide-20
SLIDE 20

Baseline variance evaluation

No baseline VR-MCCFR (Schmid et al.) Learned history baseline Predictive baseline

Leduc poker, Monte Carlo Counterfactual Regret Minimization (MCCFR+)

slide-21
SLIDE 21
  • Lower variance, faster convergence
  • Provable zero-variance samples

Conclusion