VCMC: Variational Consensus Monte Carlo Maxim Rabinovich, Elaine - - PowerPoint PPT Presentation

vcmc variational consensus monte carlo
SMART_READER_LITE
LIVE PREVIEW

VCMC: Variational Consensus Monte Carlo Maxim Rabinovich, Elaine - - PowerPoint PPT Presentation

VCMC: Variational Consensus Monte Carlo Maxim Rabinovich, Elaine Angelino, Michael I. Jordan Berkeley Vision and Learning Center September 22, 2015 probabilistic models! sky fog bridge water grass object tracking & recognition


slide-1
SLIDE 1

VCMC: Variational Consensus Monte Carlo

Maxim Rabinovich, Elaine Angelino, Michael I. Jordan Berkeley Vision and Learning Center September 22, 2015

slide-2
SLIDE 2

small molecule discovery

  • bject tracking & recognition

fog sky bridge grass water

probabilistic models!

genomics & phylogenetics personalized recommendations

slide-3
SLIDE 3

Outline

Bayesian inference and Markov chain Monte Carlo MCMC is hard → New data–parallel algorithms VCMC: Our approach and theoretical results Empirical evaluation

slide-4
SLIDE 4

Bayesian models encode uncertainty using probabilities

x y

Probability distribution

  • ver model parameters

π(α, β, σ | x, y) A model is a probabilistic description of data yi ∼ N(αxi + β, σ2)

slide-5
SLIDE 5

Bayesian inference uses Bayes’ rule

π(θ | x)

posterior

∝ π(θ)

  • prior

π(x | θ)

likelihood

Model parameters θ = (α, β, σ) Data x = {(x1, y1), (x2, y2), . . . , (x10, y10)} Probabilistic model of data yi ∼ N(αxi + β, σ2)

slide-6
SLIDE 6

In general, posterior distributions are difficult to work with

Normalizing involves an integral that is often intractable π(θ | x) = π(θ)π(x | θ)

  • Θ π(θ)π(x | θ)dθ
slide-7
SLIDE 7

In general, posterior distributions are difficult to work with

Normalizing involves an integral that is often intractable π(θ | x) = π(θ)π(x | θ)

  • Θ π(θ)π(x | θ)dθ

Expectations w.r.t. the posterior = More intractable integrals Eπ[f ] =

  • Θ

f (θ)π(θ | x)dθ (These are statistics that distill information about the posterior)

slide-8
SLIDE 8

Solution: Monte Carlo integration

Given a finite set of samples θ1, θ2, . . . , θT ∼ π(θ | x)

x y x y

slide-9
SLIDE 9

Solution: Monte Carlo integration

Given a finite set of samples θ1, θ2, . . . , θT ∼ π(θ | x) Estimate an intractable expectation as a sum: Eπ[f ] =

  • Θ

f (θ)π(θ | x)dθ ≈ 1 T

T

  • t=1

f (θt)

x y x y

slide-10
SLIDE 10

Solution: Monte Carlo integration

Given a finite set of samples θ1, θ2, . . . , θT ∼ π(θ | x) Estimate an intractable expectation as a sum: Eπ[f ] =

  • Θ

f (θ)π(θ | x)dθ ≈ 1 T

T

  • t=1

f (θt) i.e., replace a distribution with samples from it:

x y x y

slide-11
SLIDE 11

Markov chain Monte Carlo (MCMC)

Widely used class of sampling algorithms Sample by simulating a Markov chain (biased random walk) whose stationary distribution (after convergence) is the posterior θ1, θ2, . . . , θT ∼ π(θ | x) Use samples for Monte Carlo integration Eπ[f ] =

  • Θ

f (θ)π(θ | x)dθ ≈ 1 T

T

  • t=1

f (θt)

slide-12
SLIDE 12

Outline

Bayesian inference and Markov chain Monte Carlo MCMC is hard → New data–parallel algorithms VCMC: Our approach and theoretical results Empirical evaluation

slide-13
SLIDE 13

Traditional MCMC

◮ Serial, iterative algorithm for generating samples ◮ Slow for two reasons:

(1) Large number of iterations required to converge (2) Each iteration depends on the entire dataset

◮ Most innovation in MCMC has targeted (1) ◮ Recent threads of work target (2)

slide-14
SLIDE 14

Serial MCMC Data Single core Samples

slide-15
SLIDE 15

Data-parallel MCMC

Data Parallel cores “Samples”

slide-16
SLIDE 16

Aggregate samples from across partitions — but how?

Aggregate Data Parallel cores “Samples”

slide-17
SLIDE 17

Factorization (⋆) motivates a data-parallel approach

π(θ | x)

posterior

∝ π(θ)

  • prior

π(x | θ)

likelihood

=

J

  • j=1

π(θ)1/Jπ(x(j) | θ)

  • sub-posterior
slide-18
SLIDE 18

Factorization (⋆) motivates a data-parallel approach

π(θ | x)

posterior

∝ π(θ)

  • prior

π(x | θ)

likelihood

=

J

  • j=1

π(θ)1/Jπ(x(j) | θ)

  • sub-posterior

◮ Partition the data as x(1), . . . , x(J) across J cores ◮ The jth core samples from a distribution proportional to the

jth sub-posterior (a ‘piece’ of the full posterior)

◮ Aggregate the sub-posterior samples to form approximate full

posterior samples

slide-19
SLIDE 19

Aggregation strategies for sub-posterior samples

π(θ | x)

posterior

∝ π(θ)

  • prior

π(x | θ)

likelihood

=

J

  • j=1

π(θ)1/Jπ(x(j) | θ)

  • sub-posterior

Sub-posterior density estimation (Neiswanger et al, UAI 2014) Weierstrass samplers (Wang & Dunson, 2013) Weighted averaging of sub-posterior samples

◮ Consensus Monte Carlo (Scott et al, Bayes 250, 2013) ◮ Variational Consensus Monte Carlo (Rabinovich et al, NIPS 2015)

slide-20
SLIDE 20

Aggregate ‘horizontally’ (⋆) across partions

Aggregate Data Parallel cores “Samples”

slide-21
SLIDE 21

Recall that samples are parameter vectors

( , ) ( , ) =

slide-22
SLIDE 22

Na¨ ıve aggregation = Average

= + x

0.5

x

0.5 Aggregate( , )

slide-23
SLIDE 23

Less na¨ ıve aggregation = Weighted average

= + x

0.58

x

0.42 Aggregate( , )

slide-24
SLIDE 24

Consensus Monte Carlo (Scott et al, 2013)

Aggregate( , )

= + x x

◮ Weights are inverse covariance matrices ◮ Motivated by Gaussian assumptions ◮ Designed at Google for the MapReduce framework

slide-25
SLIDE 25

Outline

Bayesian inference and Markov chain Monte Carlo MCMC is hard → New data–parallel algorithms VCMC: Our approach and theoretical results Empirical evaluation

slide-26
SLIDE 26

Variational Consensus Monte Carlo

Goal: Choose the aggregation function to best approximate the target distribution Method: Convex optimization via variational Bayes

slide-27
SLIDE 27

Variational Consensus Monte Carlo

Goal: Choose the aggregation function to best approximate the target distribution Method: Convex optimization via variational Bayes F = aggregation function qF = approximate distribution L (F)

  • bjective

= EqF [log π (X, θ)]

  • likelihood

+ H [qF]

entropy

slide-28
SLIDE 28

Variational Consensus Monte Carlo

Goal: Choose the aggregation function to best approximate the target distribution Method: Convex optimization via variational Bayes F = aggregation function qF = approximate distribution ˜ L (F)

  • bjective

= EqF [log π (X, θ)]

  • likelihood

+ ˜ H [qF]

relaxed entropy

slide-29
SLIDE 29

Variational Consensus Monte Carlo

Goal: Choose the aggregation function to best approximate the target distribution Method: Convex optimization via variational Bayes F = aggregation function qF = approximate distribution ˜ L (F)

  • bjective

= EqF [log π (X, θ)]

  • likelihood

+ ˜ H [qF]

relaxed entropy

No mean field assumption

slide-30
SLIDE 30

Variational Consensus Monte Carlo

Aggregate( , )

= + x x

◮ Optimize over weight matrices (⋆) ◮ Restrict to valid solutions when parameter vectors constrained

slide-31
SLIDE 31

Variational Consensus Monte Carlo

Theorem (Entropy relaxation)

Under mild structural assumptions, we can choose ˜ H [qF] = c0 + 1 K

K

  • k=1

hk (F) , with each hk a concave function of F such that H [qF] ≥ ˜ H [qF] . We therefore have L (F) ≥ ˜ L (F) .

slide-32
SLIDE 32

Variational Consensus Monte Carlo

Theorem (Concavity of the variational Bayes objective)

Under mild structural assumptions, the relaxed variational Bayes

  • bjective

˜ L (F) = EqF [log π (X, θ)] + ˜ H [qF] is concave in F.

slide-33
SLIDE 33

Outline

Bayesian inference and Markov chain Monte Carlo MCMC is hard → New data–parallel algorithms VCMC: Our approach and theoretical results Empirical evaluation

slide-34
SLIDE 34

Empirical evaluation

◮ Compare 3 aggregation strategies:

◮ Uniform average ◮ Gaussian-motivated weighted average (CMC) ◮ Optimized weighted average (VCMC)

◮ For each algorithm A, report approximation error of some

expectation Eπ[f ], relative to serial MCMC ǫA (f ) = |EA [f ] − EMCMC [f ]| |EMCMC [f ]|

◮ Preliminary speedup results

slide-35
SLIDE 35

Example 1: High-dimensional Bayesian probit regression

#data = 100, 000, d = 300 First moment estimation error, relative to serial MCMC (Error truncated at 2.0)

slide-36
SLIDE 36

Example 2: High-dimensional covariance estimation

Normal-inverse Wishart model #data = 100, 000, #dim = 100 = ⇒ 5, 050 parameters (L) First moment estimation error (R) Eigenvalue estimation error

slide-37
SLIDE 37

Example 3: Mixture of 8, 8-dim Gaussians

Error relative to serial MCMC, for cluster comembership probabilities of pairs of test data points

slide-38
SLIDE 38

VCMC error decreases as the optimization runs longer

Initialize VCMC with CMC weights (inverse covariance matrices)

slide-39
SLIDE 39

VCMC reduces CMC error at the cost of speedup (∼2x)

VCMC speedup is approximately linear

CMC VCMC

slide-40
SLIDE 40

Concluding thoughts

Contributions

◮ Convex optimization framework for Consensus Monte Carlo ◮ Structured aggregation accounting for constrained parameters ◮ Entropy relaxation ◮ Empirical evaluation

Future work

◮ More structured and complex (latent variable) models ◮ Alternate posterior factorizations and aggregation schemes

We’d love to hear about your Bayesian inference problems!

slide-41
SLIDE 41
slide-42
SLIDE 42

Example 1: High-dimensional Bayesian probit regression

5 10 25 50 100 Number of cores 0.0 0.5 1.0 1.5 2.0 Error First Uniform Gaussian VCMC 5 10 25 50 100 Number of cores 0.0 0.5 1.0 1.5 2.0 Second (Mixed) Subposteriors 5 10 25 50 100 Number of cores 0.0 0.5 1.0 1.5 2.0 Error First Uniform Gaussian VCMC 5 10 25 50 100 Number of cores 0.0 0.5 1.0 1.5 2.0 Second (Mixed) Partial posteriors

Figure: High-dimensional probit regression (d = 300). Moment approximation error for the uniform and Gaussian averaging baselines and VCMC, relative to serial MCMC, for (left) subposteriors and (right) partial posteriors. We assessed three groups of functions: first moments, with f (β) = βj for 1 ≤ j ≤ d; pure second moments, with f (β) = β2

j for

1 ≤ j ≤ d; and mixed second moments, with f (β) = βiβj for 1 ≤ i < j ≤ d. For brevity, results for pure second moments are relegated to the supplement. Relative errors are truncated to 2 in all cases.

slide-43
SLIDE 43

Example 2: High-dimensional covariance estimation

25 50 100 Number of cores 0.0 0.1 0.2 0.3 0.4 Error First

Uniform Gaussian VCMC

25 50 100 Number of cores 0.0 0.1 0.2 0.3 0.4 Second (Pure) 25 50 100 Number of cores 0.0 0.1 0.2 0.3 0.4 Second (Mixed)

Figure: High-dimensional normal-inverse Wishart model (d = 100). (Far left, left, right) Moment approximation error for the uniform and Gaussian averaging baselines and VCMC, relative to serial MCMC. Letting ρj denote the jth largest eigenvalue of Λ−1, we assessed three groups of functions: first moments, with f (Λ) = ρj for 1 ≤ j ≤ d; pure second moments, with f (Λ) = ρ2

j for 1 ≤ j ≤ d; and mixed second

moments, with f (Λ) = ρiρj for 1 ≤ i < j ≤ d. (Far right) Graph of error in estimating E [ρj] as a function of j (where ρ1 ≥ ρ2 ≥ · · · ≥ ρd).

slide-44
SLIDE 44

Example 3: Mixture of 8, 8-dim Gaussians

Figure: Mixture of Gaussians (d = 8, L = 8). Expectation approximation error for the uniform and Gaussian baselines and VCMC. We report the median error, relative to serial MCMC, for cluster comembership probabilities of pairs of test data points, for (left) σ = 1 and (right) σ = 2, where we run the VCMC optimization procedure for 50 and 200 iterations, respectively. When σ = 2, some comembership probabilities are estimated poorly by all methods; we therefore only use the 70% of comembership probabilities with the smallest errors across all the methods.

slide-45
SLIDE 45

Computational efficiency

Figure: Error versus timing and speedup measurements. (Left) VCMC error as a function of number of seconds of optimization. The cost of

  • ptimization is nonnegligible, but still moderate compared to serial

MCMC—particularly since our optimization scheme only needs small batches of samples and can therefore operate concurrently with the

  • sampler. (Right) Error versus speedup relative to serial MCMC, for both

CMC with Gaussian averaging (small markers) and VCMC (large markers). In this case, the cost of optimization is small enough that a near linear speedup is achieved.