Adaptive Antithetic Sampling for Variance Reduction Hongyu Ren*, - - PowerPoint PPT Presentation

β–Ά
adaptive antithetic sampling
SMART_READER_LITE
LIVE PREVIEW

Adaptive Antithetic Sampling for Variance Reduction Hongyu Ren*, - - PowerPoint PPT Presentation

Adaptive Antithetic Sampling for Variance Reduction Hongyu Ren*, Shengjia Zhao*, Stefano Ermon *equal contribution Goal Estimation of = () [ ] is ubiquitous in machine learning problems. Environment


slide-1
SLIDE 1

Adaptive Antithetic Sampling for Variance Reduction

Hongyu Ren*, Shengjia Zhao*, Stefano Ermon

*equal contribution

slide-2
SLIDE 2

Goal

Estimation of 𝜈 = π”½π‘ž(𝑦)[𝑔 𝑦 ] is ubiquitous in machine learning problems.

π”½π‘ž(𝜐) ෍

𝑒

𝑠(𝑑𝑒, 𝑏𝑒) Reinforcement Learning π”½π‘ž 𝑦 π”½π‘Ÿ 𝑨|𝑦 log π‘ž(𝑦, 𝑨) π‘Ÿ(𝑨|𝑦) Variational Autoencoder π”½π‘ž(𝑦) log 𝐸(𝑦) + π”½π‘ž 𝑨 log 1 βˆ’ 𝐸 𝐻 𝑨 Generative Adversarial Nets

π‘ž 𝑦 π‘Ÿ(𝑨|𝑦) π‘Ÿ(𝑨) π‘ž(𝑨) π‘ž(𝑦|𝑨) π‘ž 𝑦 π‘ž 𝑦 Real/Fake 𝐸 𝐻 π‘ž(𝑨) Environment Agent State Reward Action

slide-3
SLIDE 3

Goal

Estimation of 𝜈 = π”½π‘ž(𝑦)[𝑔 𝑦 ] is ubiquitous in machine learning problems. Monte Carlo Estimation: 𝜈 β‰ˆ

1 2 (𝑔 𝑦1 + 𝑔(𝑦2))

𝑦1, 𝑦2 ∼ π‘ž(𝑦)

i.i.d.

MC is unbiased: 𝔽

1 2 (𝑔 𝑦1 + 𝑔(𝑦2)) = 𝜈

High variance Estimation can be far off with small sample size

slide-4
SLIDE 4

Better solution: better sampling strategy than i.i.d. Trivial solution: use more samples!

Goal

Estimation of 𝜈 = π”½π‘ž(𝑦)[𝑔 𝑦 ] is ubiquitous in machine learning problems. Monte Carlo Estimation: 𝜈 β‰ˆ

1 2 (𝑔 𝑦1 + 𝑔(𝑦2))

𝑦1, 𝑦2 ∼ π‘ž(𝑦)

i.i.d.

slide-5
SLIDE 5

Antithetic Sampling

Don’t sample i.i.d. 𝑦1, 𝑦2 ∼ π‘ž 𝑦1 π‘ž(𝑦2) Sample correlated distribution 𝑦1, 𝑦2 ∼ π‘Ÿ(𝑦1, 𝑦2) Unbiased if π‘Ÿ 𝑦1 = π‘ž 𝑦1 π‘Ÿ 𝑦2 = π‘ž(𝑦2) Goal: minimize Varπ‘Ÿ(𝑦1,𝑦2) 𝑔 𝑦1 + 𝑔(𝑦2) 2

slide-6
SLIDE 6

Example: Negative Sampling

π‘Ÿ 𝑦1, 𝑦2 defined by 1.Sample 𝑦1 ∼ π‘ž(𝑦). 2.Pick 𝑦2 = βˆ’π‘¦1.

𝑦1 𝑦2 𝑦1 𝑦2 π‘ž 𝑦1 π‘ž 𝑦2 Marginal

  • n 𝑦1

Marginal

  • n 𝑦2
slide-7
SLIDE 7

Example: Negative Sampling

𝑔 = 𝑦3

π‘Ÿ 𝑦1, 𝑦2 defined by 1.Sample 𝑦1 ∼ π‘ž(𝑦). 2.Pick 𝑦2 = βˆ’π‘¦1.

Varπ‘Ÿ(𝑦1,𝑦2)

𝑔 𝑦1 +𝑔(𝑦2) 2

= 0 no error for a sample size of 2!

Best Case Example

𝑔 𝑦1 + 𝑔(𝑦2) 2 = 0 πΉπ‘ž(𝑦)[𝑔 𝑦 ] = 0 matches

slide-8
SLIDE 8

Example: Negative Sampling

π‘Ÿ 𝑦1, 𝑦2 defined by 1.Sample 𝑦1 ∼ π‘ž(𝑦). 2.Pick 𝑦2 = βˆ’π‘¦1.

𝑔 𝑦1 = 𝑔(𝑦2), 𝑦2 redundant Varπ‘Ÿ(𝑦1,𝑦2)

𝑔 𝑦1 +𝑔(𝑦2) 2

doubles!

Worst Case Example

𝑔 = 𝑦2

slide-9
SLIDE 9

General Result

No Free Lunch (Theorem 1): no antithetic distribution work better than sampling without replacement for every function 𝑔. Question: is there an antithetic distribution that always works better than i.i.d.? Yes: sampling without replacement is always a tiny bit better.

slide-10
SLIDE 10

Valid Distribution Set

𝑦1 𝑦2

π’­π‘£π‘œπ‘π‘—π‘π‘‘π‘“π‘’: Set of distributions π‘Ÿ(𝑦1, 𝑦2) that satisfy π‘Ÿ 𝑦1 = π‘ž 𝑦1 , π‘Ÿ 𝑦2 = π‘ž 𝑦2

𝑦1 𝑦2

slide-11
SLIDE 11

Variance of example functions

Pick this distribution

High Variance Low Variance 𝑔

1 = 𝑦3 𝑦1 𝑦2

π’­π‘£π‘œπ‘π‘—π‘π‘‘π‘“π‘’: Set of distributions π‘Ÿ(𝑦1, 𝑦2) that satisfy π‘Ÿ 𝑦1 = π‘ž 𝑦1 , π‘Ÿ 𝑦2 = π‘ž 𝑦2

slide-12
SLIDE 12

𝑔

2 = 𝑓𝑦 + 2π‘¦π‘‘π‘—π‘œ(𝑦)

Variance of example functions

High Variance

Pick this distribution

High Variance Low Variance

𝑦1 𝑦2

π’­π‘£π‘œπ‘π‘—π‘π‘‘π‘“π‘’: Set of distributions π‘Ÿ(𝑦1, 𝑦2) that satisfy π‘Ÿ 𝑦1 = π‘ž 𝑦1 , π‘Ÿ 𝑦2 = π‘ž 𝑦2

slide-13
SLIDE 13

Pick Good Distribution for a Class of Functions

High Variance

  • n average for 

𝑦1 𝑦2

Low Variance

  • n average for 

π’­π‘£π‘œπ‘π‘—π‘π‘‘π‘“π‘’: Set of distributions π‘Ÿ(𝑦1, 𝑦2) that satisfy π‘Ÿ 𝑦1 = π‘ž 𝑦1 , π‘Ÿ 𝑦2 = π‘ž 𝑦2

 = 𝑔

1, 𝑔 2, …

slide-14
SLIDE 14

Pick Good Distribution for a class of functions

High Variance

  • n average

Training

Pick a good π‘Ÿ for several functions Low variance for similar functions

Generalization

Low Variance

  • n average

π’­π‘£π‘œπ‘π‘—π‘π‘‘π‘“π‘’: Set of distributions π‘Ÿ(𝑦1, 𝑦2) that satisfy π‘Ÿ 𝑦1 = π‘ž 𝑦1 , π‘Ÿ 𝑦2 = π‘ž 𝑦2

𝑦1 𝑦2

slide-15
SLIDE 15

Training Objective

min

π‘Ÿ π”½π‘”βˆΌο† Varπ‘Ÿ 𝑦1,𝑦2

𝑔 𝑦1 + 𝑔 𝑦2 2 𝑑. 𝑒. π‘Ÿ 𝑦1, 𝑦2 ∈ π’­π‘£π‘œπ‘π‘—π‘π‘‘π‘“π‘’

slide-16
SLIDE 16

Practical Training Algorithm

We design

  • 1. Parameterization for π’­π‘£π‘œπ‘π‘—π‘π‘‘π‘“π‘’ via copulas.
  • 2. A surrogate objective to optimize the variance.
slide-17
SLIDE 17

Wasserstein GAN w/ gradient penalty

Batch Size Wall Clock Time per Iteration Variance of Gradient Inception Score Inception Score Variance Inception Score Batch Size Inception Score

Gulrajani, Ishaan, et al. "Improved training of wasserstein gans." Advances in Neural Information Processing Systems. 2017.

slide-18
SLIDE 18

Importance Weighted Autoencoder

Burda, Yuri, Roger Grosse, and Ruslan Salakhutdinov. "Importance weighted autoencoders." arXiv preprint arXiv:1509.00519 (2015).

Our method VS negative sampling Our method VS i.i.d. sampling Probability of Improvement Log Likelihood Improvement (higher is better)

slide-19
SLIDE 19

Conclusion

  • Define a general family of (parameterized) unbiased antithetic distribution.
  • Propose an optimization framework to learn the antithetic distribution

based on the task at hand.

  • Sampling from the resulting joint distribution reduces variance at negligible

computation cost.

Welcome to our poster session for further discussions! Thursday 6:30-9pm @ Pacific Ballroom #205