low variance and zero variance baselines in extensive
play

Low-Variance and Zero-Variance Baselines in Extensive-Form Games - PowerPoint PPT Presentation

1 2 Low-Variance and Zero-Variance Baselines in Extensive-Form Games Trevor Davis 2,* , Martin Schmid 1 , Michael Bowling 1,2 *Work done during internship at DeepMind Monte Carlo game solving Extensive-form games (EFGs) Monte Carlo game


  1. 1 2 Low-Variance and Zero-Variance Baselines in Extensive-Form Games Trevor Davis 2,* , Martin Schmid 1 , Michael Bowling 1,2 *Work done during internship at DeepMind

  2. Monte Carlo game solving Extensive-form games (EFGs)

  3. Monte Carlo game solving Extensive-form games (EFGs)

  4. Baseline functions - evaluating unsampled actions

  5. Our Contribution VR-MCCFR This work (Schmid et al., AAAI 2019) Lower variance, faster convergence ● Provable zero-variance samples ●

  6. Monte carlo evaluation Unbiased updates at h

  7. Monte Carlo evaluation Unbiased updates at h where

  8. Monte Carlo evaluation Unbiased updates at h where Unsampled actions:

  9. Baseline functions

  10. Evaluation with baseline Without baseline:

  11. Evaluation with baseline Without baseline: Baseline correction:

  12. Evaluation with baseline Without baseline: Baseline correction: (control variate)

  13. Theoretical results Theorem 1: baseline-corrected values are unbiased: Theorem 2: each baseline-corrected value has variance bounded by a sum of squared prediction errors in the subtree rooted at a

  14. Baseline function selection We want Learned history baseline: We know Set to average of previous samples

  15. Baseline function selection We want Learned history baseline: We know Set to average of previous samples Note: depends on strategies - not stationary ∴ is not an unbiased estimate of current expectation still unbiased

  16. Baseline convergence evaluation Leduc poker, Monte Carlo Counterfactual Regret Minimization (MCCFR+) No baseline VR-MCCFR (Schmid et al.) Learned history baseline

  17. Predictive baseline Updating with learned history baseline: Optimal baseline depends on strategy update:

  18. Predictive baseline Updating with learned history baseline: Use strategy to update baseline: Optimal baseline depends on strategy update: Recursively set

  19. Zero-variance updates If: We use the predictive baseline ● We sample public outcomes ● All outcomes are sampled at least once ● Theorem: the baseline-corrected values have zero variance

  20. Baseline variance evaluation Leduc poker, Monte Carlo Counterfactual Regret Minimization (MCCFR+) No baseline VR-MCCFR (Schmid et al.) Learned history baseline Predictive baseline

  21. Conclusion Lower variance, faster convergence ● Provable zero-variance samples ●

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend