adaptive antithetic sampling
play

Adaptive Antithetic Sampling for Variance Reduction Hongyu Ren*, - PowerPoint PPT Presentation

Adaptive Antithetic Sampling for Variance Reduction Hongyu Ren*, Shengjia Zhao*, Stefano Ermon *equal contribution Goal Estimation of = () [ ] is ubiquitous in machine learning problems. Environment


  1. Adaptive Antithetic Sampling for Variance Reduction Hongyu Ren*, Shengjia Zhao*, Stefano Ermon *equal contribution

  2. Goal Estimation of 𝜈 = 𝔽 π‘ž(𝑦) [𝑔 𝑦 ] is ubiquitous in machine learning problems. Environment π‘ž 𝑦 π‘Ÿ(𝑨|𝑦) π‘Ÿ(𝑨) π‘ž 𝑦 Reward State Action 𝐸 Real/Fake π‘ž(𝑨) 𝐻 π‘ž(𝑦|𝑨) Agent π‘ž(𝑨) π‘ž 𝑦 log π‘ž(𝑦, 𝑨) 𝔽 π‘ž(𝜐) ෍ 𝑠(𝑑 𝑒 , 𝑏 𝑒 ) 𝔽 π‘ž 𝑦 𝔽 π‘Ÿ 𝑨|𝑦 𝔽 π‘ž(𝑦) log 𝐸(𝑦) + 𝔽 π‘ž 𝑨 log 1 βˆ’ 𝐸 𝐻 𝑨 π‘Ÿ(𝑨|𝑦) 𝑒 Variational Autoencoder Generative Adversarial Nets Reinforcement Learning

  3. Goal Estimation of 𝜈 = 𝔽 π‘ž(𝑦) [𝑔 𝑦 ] is ubiquitous in machine learning problems. 1 i.i.d. Monte Carlo Estimation: 𝜈 β‰ˆ 2 (𝑔 𝑦 1 + 𝑔(𝑦 2 )) 𝑦 1 , 𝑦 2 ∼ π‘ž(𝑦) 1 MC is unbiased: 𝔽 2 (𝑔 𝑦 1 + 𝑔(𝑦 2 )) = 𝜈 High variance Estimation can be far off with small sample size

  4. Goal Estimation of 𝜈 = 𝔽 π‘ž(𝑦) [𝑔 𝑦 ] is ubiquitous in machine learning problems. 1 i.i.d. Monte Carlo Estimation: 𝜈 β‰ˆ 2 (𝑔 𝑦 1 + 𝑔(𝑦 2 )) 𝑦 1 , 𝑦 2 ∼ π‘ž(𝑦) Trivial solution: Better solution: use more samples! better sampling strategy than i.i.d.

  5. Antithetic Sampling Don’t sample i.i.d. 𝑦 1 , 𝑦 2 ∼ π‘ž 𝑦 1 π‘ž(𝑦 2 ) Sample correlated distribution 𝑦 1 , 𝑦 2 ∼ π‘Ÿ(𝑦 1 , 𝑦 2 ) Unbiased if Goal: minimize π‘Ÿ 𝑦 1 = π‘ž 𝑦 1 𝑔 𝑦 1 + 𝑔(𝑦 2 ) π‘Ÿ 𝑦 2 = π‘ž(𝑦 2 ) Var π‘Ÿ(𝑦 1 ,𝑦 2 ) 2

  6. Example: Negative Sampling π‘Ÿ 𝑦 1 , 𝑦 2 defined by 𝑦 2 𝑦 2 π‘ž 𝑦 2 1.Sample 𝑦 1 ∼ π‘ž(𝑦) . 𝑦 1 Marginal 2.Pick 𝑦 2 = βˆ’π‘¦ 1 . on 𝑦 2 Marginal on 𝑦 1 𝑦 1 π‘ž 𝑦 1

  7. Example: Negative Sampling Best Case Example π‘Ÿ 𝑦 1 , 𝑦 2 defined by 1.Sample 𝑦 1 ∼ π‘ž(𝑦) . 2.Pick 𝑦 2 = βˆ’π‘¦ 1 . 𝑔 𝑦 1 + 𝑔(𝑦 2 ) = 0 2 matches 𝑔 = 𝑦 3 𝐹 π‘ž(𝑦) [𝑔 𝑦 ] = 0 𝑔 𝑦 1 +𝑔(𝑦 2 ) Var π‘Ÿ(𝑦 1 ,𝑦 2 ) = 0 2 no error for a sample size of 2!

  8. Example: Negative Sampling Worst Case Example π‘Ÿ 𝑦 1 , 𝑦 2 defined by 1.Sample 𝑦 1 ∼ π‘ž(𝑦) . 2.Pick 𝑦 2 = βˆ’π‘¦ 1 . 𝑔 = 𝑦 2 𝑔 𝑦 1 = 𝑔(𝑦 2 ) , 𝑦 2 redundant 𝑔 𝑦 1 +𝑔(𝑦 2 ) Var π‘Ÿ(𝑦 1 ,𝑦 2 ) doubles! 2

  9. General Result Question: is there an antithetic distribution that always works better than i.i.d.? Yes: sampling without replacement is always a tiny bit better. No Free Lunch (Theorem 1): no antithetic distribution work better than sampling without replacement for every function 𝑔 .

  10. Valid Distribution Set 𝑦 2 𝑦 2 𝑦 1 𝑦 1 𝒭 π‘£π‘œπ‘π‘—π‘π‘‘π‘“π‘’ : Set of distributions π‘Ÿ(𝑦 1 , 𝑦 2 ) that satisfy π‘Ÿ 𝑦 1 = π‘ž 𝑦 1 , π‘Ÿ 𝑦 2 = π‘ž 𝑦 2

  11. Variance of example functions Pick this distribution 𝑦 2 Low Variance 𝑦 1 High Variance 𝒭 π‘£π‘œπ‘π‘—π‘π‘‘π‘“π‘’ : Set of distributions π‘Ÿ(𝑦 1 , 𝑦 2 ) that satisfy π‘Ÿ 𝑦 1 = π‘ž 𝑦 1 , π‘Ÿ 𝑦 2 = π‘ž 𝑦 2 1 = 𝑦 3 𝑔

  12. Variance of example functions 𝑦 2 Pick this distribution 𝑦 1 Low Variance High Variance High Variance 𝒭 π‘£π‘œπ‘π‘—π‘π‘‘π‘“π‘’ : Set of distributions π‘Ÿ(𝑦 1 , 𝑦 2 ) that satisfy π‘Ÿ 𝑦 1 = π‘ž 𝑦 1 , π‘Ÿ 𝑦 2 = π‘ž 𝑦 2 2 = 𝑓 𝑦 + 2π‘¦π‘‘π‘—π‘œ(𝑦) 𝑔

  13. Pick Good Distribution for a Class of Functions 𝑦 2  = 𝑔 1 , 𝑔 2 , … 𝑦 1 Low Variance High Variance on average for  on average for  𝒭 π‘£π‘œπ‘π‘—π‘π‘‘π‘“π‘’ : Set of distributions π‘Ÿ(𝑦 1 , 𝑦 2 ) that satisfy π‘Ÿ 𝑦 1 = π‘ž 𝑦 1 , π‘Ÿ 𝑦 2 = π‘ž 𝑦 2

  14. Pick Good Distribution for a class of functions 𝑦 2 𝑦 1 High Variance Low Variance on average on average 𝒭 π‘£π‘œπ‘π‘—π‘π‘‘π‘“π‘’ : Set of distributions π‘Ÿ(𝑦 1 , 𝑦 2 ) that satisfy π‘Ÿ 𝑦 1 = π‘ž 𝑦 1 , π‘Ÿ 𝑦 2 = π‘ž 𝑦 2 Generalization Training Low variance for similar functions Pick a good π‘Ÿ for several functions

  15. Training Objective 𝑔 𝑦 1 + 𝑔 𝑦 2 min π‘Ÿ 𝔽 π‘”βˆΌ  Var π‘Ÿ 𝑦 1 ,𝑦 2 2 𝑑. 𝑒. π‘Ÿ 𝑦 1 , 𝑦 2 ∈ 𝒭 π‘£π‘œπ‘π‘—π‘π‘‘π‘“π‘’

  16. Practical Training Algorithm We design 1. Parameterization for 𝒭 π‘£π‘œπ‘π‘—π‘π‘‘π‘“π‘’ via copulas. 2. A surrogate objective to optimize the variance.

  17. Wasserstein GAN w/ gradient penalty Variance of Gradient Inception Score Inception Score Inception Score Inception Score Variance Wall Clock Time Batch Size Batch Size per Iteration Gulrajani, Ishaan, et al. "Improved training of wasserstein gans." Advances in Neural Information Processing Systems . 2017.

  18. Importance Weighted Autoencoder Our method VS negative sampling Our method VS i.i.d. sampling Probability of Improvement Log Likelihood Improvement (higher is better) Burda, Yuri, Roger Grosse, and Ruslan Salakhutdinov. "Importance weighted autoencoders." arXiv preprint arXiv:1509.00519 (2015).

  19. Conclusion β€’ Define a general family of (parameterized) unbiased antithetic distribution. β€’ Propose an optimization framework to learn the antithetic distribution based on the task at hand. β€’ Sampling from the resulting joint distribution reduces variance at negligible computation cost. Welcome to our poster session for further discussions! Thursday 6:30-9pm @ Pacific Ballroom #205

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend