Scalable Metropolis-Hastings for Exact Bayesian Inference with Large - - PowerPoint PPT Presentation

scalable metropolis hastings for exact bayesian inference
SMART_READER_LITE
LIVE PREVIEW

Scalable Metropolis-Hastings for Exact Bayesian Inference with Large - - PowerPoint PPT Presentation

Scalable Metropolis-Hastings for Exact Bayesian Inference with Large Datasets Rob Cornish Paul Vanetti Alexandre Bouchard-C ot e George Deligiannidis Arnaud Doucet June 8, 2019 Cornish et al. Scalable MetropolisHastings June 8,


slide-1
SLIDE 1

Scalable Metropolis-Hastings for Exact Bayesian Inference with Large Datasets

Rob Cornish Paul Vanetti Alexandre Bouchard-Cˆ

e George Deligiannidis Arnaud Doucet June 8, 2019

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 1 / 24

slide-2
SLIDE 2

Problem

Bayesian inference via MCMC is expensive for large datasets

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 2 / 24

slide-3
SLIDE 3

Problem

Consider a posterior over parameters θ given n data points yi: π(θ) = p(θ|y1:n) ∝ p(θ)

n

  • i=1

p(yi|θ).

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 3 / 24

slide-4
SLIDE 4

Problem

Consider a posterior over parameters θ given n data points yi: π(θ) = p(θ|y1:n) ∝ p(θ)

n

  • i=1

p(yi|θ).

Metropolis–Hastings

Given a proposal q and current state θ:

1 Propose θ′ ∼ q(θ, ·) 2 Accept θ′ with probability

αMH(θ, θ′) := 1 ∧ q(θ′, θ)π(θ′) q(θ, θ′)π(θ) = 1 ∧ q(θ′, θ)p(θ′) q(θ, θ′)p(θ)

n

  • i=1

p(yi|θ′) p(yi|θ)

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 3 / 24

slide-5
SLIDE 5

Problem

Consider a posterior over parameters θ given n data points yi: π(θ) = p(θ|y1:n) ∝ p(θ)

n

  • i=1

p(yi|θ).

Metropolis–Hastings

Given a proposal q and current state θ:

1 Propose θ′ ∼ q(θ, ·) 2 Accept θ′ with probability

αMH(θ, θ′) := 1 ∧ q(θ′, θ)π(θ′) q(θ, θ′)π(θ) = 1 ∧ q(θ′, θ)p(θ′) q(θ, θ′)p(θ)

n

  • i=1

p(yi|θ′) p(yi|θ) ⇒ O(n) computation per step to compute αMH(θ, θ′)

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 3 / 24

slide-6
SLIDE 6

Our approach

Want a method with cost o(n) per step – subsampling

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 4 / 24

slide-7
SLIDE 7

Our approach

Want a method with cost o(n) per step – subsampling Want our method not to reduce accuracy – exactness

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 4 / 24

slide-8
SLIDE 8

Our approach

Several existing exact subsampling methods:

Firefly

[Maclaurin and Adams, 2014]

Delayed acceptance

[Banterle et al., 2015]

Piecewise-deterministic MCMC

[Bouchard-Cˆ

e et al., 2018, Bierkens et al., 2018]

64 128 256 512 1024 2048 4096 8192 32768 131072 n 101 102 103 104 105 Likelihoods per iteration MH SMH-1 SMH-2

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 5 / 24

slide-9
SLIDE 9

Our approach

Several existing exact subsampling methods:

Firefly

[Maclaurin and Adams, 2014]

Delayed acceptance

[Banterle et al., 2015]

Piecewise-deterministic MCMC

[Bouchard-Cˆ

e et al., 2018, Bierkens et al., 2018]

Our method: an exact subsampling scheme based on a proxy target that requires on average O(1) or O(1/√n) likelihood evaluations per step

64 128 256 512 1024 2048 4096 8192 32768 131072 n 101 102 103 104 105 Likelihoods per iteration MH SMH-1 SMH-2

Figure 1: Average number of likelihood evaluations per iteration required by SMH for a 10-dimensional logistic regression posterior as the number of data points n increases.

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 5 / 24

slide-10
SLIDE 10

Three key ingredients

1 A factorised MH acceptance probability 2 Procedures for fast simulation of Bernoulli random variables 3 Control performance using an approximate target (“control variates”) Cornish et al. Scalable Metropolis–Hastings June 8, 2019 6 / 24

slide-11
SLIDE 11

Ingredient 1 - Factorised Metropolis–Hastings

Suppose we can factor the target like π(θ) ∝

n

  • i=1

πi(θ)

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 7 / 24

slide-12
SLIDE 12

Ingredient 1 - Factorised Metropolis–Hastings

Suppose we can factor the target like π(θ) ∝

n

  • i=1

πi(θ) Obvious choice (with a flat prior) is πi(θ′) = p(yi|θ)

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 7 / 24

slide-13
SLIDE 13

Ingredient 1 - Factorised Metropolis–Hastings

Suppose we can factor the target like π(θ) ∝

n

  • i=1

πi(θ) Obvious choice (with a flat prior) is πi(θ′) = p(yi|θ) Can show that (for a symmetric proposal) αFMH(θ, θ′) :=

n

  • i=1

αFMHi(θ, θ′) :=

n

  • i=1

1 ∧ πi(θ′) πi(θ) is also a valid acceptance probability for an MH-style algorithm

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 7 / 24

slide-14
SLIDE 14

Ingredient 1 - Factorised Metropolis–Hastings

Suppose we can factor the target like π(θ) ∝

n

  • i=1

πi(θ) Obvious choice (with a flat prior) is πi(θ′) = p(yi|θ) Can show that (for a symmetric proposal) αFMH(θ, θ′) :=

n

  • i=1

αFMHi(θ, θ′) :=

n

  • i=1

1 ∧ πi(θ′) πi(θ) is also a valid acceptance probability for an MH-style algorithm Compare the MH acceptance probability as αMH(θ, θ′) = 1 ∧

n

  • i=1

πi(θ′) πi(θ)

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 7 / 24

slide-15
SLIDE 15

Ingredient 1 - Factorised Metropolis–Hastings

Explicitly, (assuming symmetric q) FMH algorithm is:

Factorised Metropolis-Hastings (FMH)

1 Propose θ′ ∼ q(θ, ·) 2 Accept θ′ with probability

αFMH(θ, θ′) :=

n

  • i=1

αFMHi(θ, θ′) :=

n

  • i=1

1 ∧ πi(θ′) πi(θ)

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 8 / 24

slide-16
SLIDE 16

Ingredient 1 - Factorised Metropolis–Hastings

Explicitly, (assuming symmetric q) FMH algorithm is:

Factorised Metropolis-Hastings (FMH)

1 Propose θ′ ∼ q(θ, ·) 2 Accept θ′ with probability

αFMH(θ, θ′) :=

n

  • i=1

αFMHi(θ, θ′) :=

n

  • i=1

1 ∧ πi(θ′) πi(θ) Can implement acceptance step by sampling independent Bi ∼ Bernoulli(αFMHi(θ, θ′)) and accepting if every Bi = 1

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 8 / 24

slide-17
SLIDE 17

Ingredient 1 - Factorised Metropolis–Hastings

Explicitly, (assuming symmetric q) FMH algorithm is:

Factorised Metropolis-Hastings (FMH)

1 Propose θ′ ∼ q(θ, ·) 2 Accept θ′ with probability

αFMH(θ, θ′) :=

n

  • i=1

αFMHi(θ, θ′) :=

n

  • i=1

1 ∧ πi(θ′) πi(θ) Can implement acceptance step by sampling independent Bi ∼ Bernoulli(αFMHi(θ, θ′)) and accepting if every Bi = 1 Can stop as soon as some Bi = 0: delayed acceptance

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 8 / 24

slide-18
SLIDE 18

Ingredient 1 - Factorised Metropolis–Hastings

Explicitly, (assuming symmetric q) FMH algorithm is:

Factorised Metropolis-Hastings (FMH)

1 Propose θ′ ∼ q(θ, ·) 2 Accept θ′ with probability

αFMH(θ, θ′) :=

n

  • i=1

αFMHi(θ, θ′) :=

n

  • i=1

1 ∧ πi(θ′) πi(θ) Can implement acceptance step by sampling independent Bi ∼ Bernoulli(αFMHi(θ, θ′)) and accepting if every Bi = 1 Can stop as soon as some Bi = 0: delayed acceptance However, still must compute all n terms in order to accept

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 8 / 24

slide-19
SLIDE 19

Three key ingredients

1 A factorised MH acceptance probability 2 Procedures for fast simulation of Bernoulli random variables 3 Control performance using an approximate target (“control variates”) Cornish et al. Scalable Metropolis–Hastings June 8, 2019 9 / 24

slide-20
SLIDE 20

Ingredient 2 - Fast Bernoulli simulation

How can we avoid simulating these n Bernoullis?

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 10 / 24

slide-21
SLIDE 21

Ingredient 2 - Fast Bernoulli simulation

How can we avoid simulating these n Bernoullis? Assuming we have bounds λi(θ, θ′) ≥ − log αFMHi(θ, θ′) =: λi(θ, θ′) we can use the following:

Poisson subsampling

1 C ∼ Poisson(n

i=1 λi(θ, θ′))

2 X1, . . . , XC

iid

∼ Categorical

  • [λi(θ, θ′)/ n

i=1 λi(θ, θ′)]1≤i≤n

  • 3 Bj ∼ Bernoulli(λXj(θ, θ′)/λXj(θ, θ′)) for 1 ≤ j ≤ C

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 10 / 24

slide-22
SLIDE 22

Ingredient 2 - Fast Bernoulli simulation

How can we avoid simulating these n Bernoullis? Assuming we have bounds λi(θ, θ′) ≥ − log αFMHi(θ, θ′) =: λi(θ, θ′) we can use the following:

Poisson subsampling

1 C ∼ Poisson(n

i=1 λi(θ, θ′))

2 X1, . . . , XC

iid

∼ Categorical

  • [λi(θ, θ′)/ n

i=1 λi(θ, θ′)]1≤i≤n

  • 3 Bj ∼ Bernoulli(λXj(θ, θ′)/λXj(θ, θ′)) for 1 ≤ j ≤ C

⇒ P(B1 = · · · = BC = 0) = αFMH(θ, θ′), so can use this procedure to perform the FMH accept/reject step

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 10 / 24

slide-23
SLIDE 23

Ingredient 2 - Fast Bernoulli simulation

How can we avoid simulating these n Bernoullis? Assuming we have bounds λi(θ, θ′) ≥ − log αFMHi(θ, θ′) =: λi(θ, θ′) we can use the following:

Poisson subsampling

1 C ∼ Poisson(n

i=1 λi(θ, θ′))

2 X1, . . . , XC

iid

∼ Categorical

  • [λi(θ, θ′)/ n

i=1 λi(θ, θ′)]1≤i≤n

  • 3 Bj ∼ Bernoulli(λXj(θ, θ′)/λXj(θ, θ′)) for 1 ≤ j ≤ C

⇒ P(B1 = · · · = BC = 0) = αFMH(θ, θ′), so can use this procedure to perform the FMH accept/reject step Intuition: sample a discrete Poisson point process on {1, . . . , n} with intensity i → λi(θ, θ′) by thinning one with intensity i → λi(θ, θ′)

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 10 / 24

slide-24
SLIDE 24

Ingredient 2 - Fast Bernoulli simulation

Poisson subsampling

1 C ∼ Poisson(n

i=1 λi(θ, θ′))

2 X1, . . . , XC

iid

∼ Categorical

  • [λi(θ, θ′)/ n

i=1 λi(θ, θ′)]1≤i≤n

  • 3 Bj ∼ Bernoulli(λXj(θ, θ′)/λXj(θ, θ′)) for 1 ≤ j ≤ C

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 11 / 24

slide-25
SLIDE 25

Ingredient 2 - Fast Bernoulli simulation

Poisson subsampling

1 C ∼ Poisson(n

i=1 λi(θ, θ′))

2 X1, . . . , XC

iid

∼ Categorical

  • [λi(θ, θ′)/ n

i=1 λi(θ, θ′)]1≤i≤n

  • 3 Bj ∼ Bernoulli(λXj(θ, θ′)/λXj(θ, θ′)) for 1 ≤ j ≤ C

When is this efficient? Suppose our bounds have the form: λi(θ, θ′) = ϕ(θ, θ′)ψi ≥ − log αFMHi(θ, θ′) = λi(θ, θ′). (*)

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 11 / 24

slide-26
SLIDE 26

Ingredient 2 - Fast Bernoulli simulation

Poisson subsampling

1 C ∼ Poisson(n

i=1 λi(θ, θ′))

2 X1, . . . , XC

iid

∼ Categorical

  • [λi(θ, θ′)/ n

i=1 λi(θ, θ′)]1≤i≤n

  • 3 Bj ∼ Bernoulli(λXj(θ, θ′)/λXj(θ, θ′)) for 1 ≤ j ≤ C

When is this efficient? Suppose our bounds have the form: λi(θ, θ′) = ϕ(θ, θ′)ψi ≥ − log αFMHi(θ, θ′) = λi(θ, θ′). (*) Then:

n

  • i=1

λi(θ, θ′) = ϕ(θ, θ′)

n

  • i=1

ψi

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 11 / 24

slide-27
SLIDE 27

Ingredient 2 - Fast Bernoulli simulation

Poisson subsampling

1 C ∼ Poisson(n

i=1 λi(θ, θ′)) ⇒ O(1) (after precomputing n i=1 ψi)

2 X1, . . . , XC

iid

∼ Categorical

  • [λi(θ, θ′)/ n

i=1 λi(θ, θ′)]1≤i≤n

  • 3 Bj ∼ Bernoulli(λXj(θ, θ′)/λXj(θ, θ′)) for 1 ≤ j ≤ C

When is this efficient? Suppose our bounds have the form: λi(θ, θ′) = ϕ(θ, θ′)ψi ≥ − log αFMHi(θ, θ′) = λi(θ, θ′). (*) Then:

n

  • i=1

λi(θ, θ′) = ϕ(θ, θ′)

n

  • i=1

ψi

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 11 / 24

slide-28
SLIDE 28

Ingredient 2 - Fast Bernoulli simulation

Poisson subsampling

1 C ∼ Poisson(n

i=1 λi(θ, θ′)) ⇒ O(1) (after precomputing n i=1 ψi)

2 X1, . . . , XC

iid

∼ Categorical

  • [λi(θ, θ′)/ n

i=1 λi(θ, θ′)]1≤i≤n

  • 3 Bj ∼ Bernoulli(λXj(θ, θ′)/λXj(θ, θ′)) for 1 ≤ j ≤ C

When is this efficient? Suppose our bounds have the form: λi(θ, θ′) = ϕ(θ, θ′)ψi ≥ − log αFMHi(θ, θ′) = λi(θ, θ′). (*) Then:

n

  • i=1

λi(θ, θ′) = ϕ(θ, θ′)

n

  • i=1

ψi and λi(θ, θ′) n

i=1 λi(θ, θ′) =

ψi n

i=1 ψi

.

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 11 / 24

slide-29
SLIDE 29

Ingredient 2 - Fast Bernoulli simulation

Poisson subsampling

1 C ∼ Poisson(n

i=1 λi(θ, θ′)) ⇒ O(1) (after precomputing n i=1 ψi)

2 X1, . . . , XC

iid

∼ Categorical

  • [λi(θ, θ′)/ n

i=1 λi(θ, θ′)]1≤i≤n

  • ⇒ O(C)

(via Walker’s alias method [Walker, 1977], after Θ(n) setup cost)

3 Bj ∼ Bernoulli(λXj(θ, θ′)/λXj(θ, θ′)) for 1 ≤ j ≤ C

When is this efficient? Suppose our bounds have the form: λi(θ, θ′) = ϕ(θ, θ′)ψi ≥ − log αFMHi(θ, θ′) = λi(θ, θ′). (*) Then:

n

  • i=1

λi(θ, θ′) = ϕ(θ, θ′)

n

  • i=1

ψi and λi(θ, θ′) n

i=1 λi(θ, θ′) =

ψi n

i=1 ψi

.

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 11 / 24

slide-30
SLIDE 30

Ingredient 2 - Fast Bernoulli simulation

Poisson subsampling

1 C ∼ Poisson(n

i=1 λi(θ, θ′)) ⇒ O(1) (after precomputing n i=1 ψi)

2 X1, . . . , XC

iid

∼ Categorical

  • [λi(θ, θ′)/ n

i=1 λi(θ, θ′)]1≤i≤n

  • ⇒ O(C)

(via Walker’s alias method [Walker, 1977], after Θ(n) setup cost)

3 Bj ∼ Bernoulli(λXj(θ, θ′)/λXj(θ, θ′)) for 1 ≤ j ≤ C ⇒ O(C)

When is this efficient? Suppose our bounds have the form: λi(θ, θ′) = ϕ(θ, θ′)ψi ≥ − log αFMHi(θ, θ′) = λi(θ, θ′). (*) Then:

n

  • i=1

λi(θ, θ′) = ϕ(θ, θ′)

n

  • i=1

ψi and λi(θ, θ′) n

i=1 λi(θ, θ′) =

ψi n

i=1 ψi

.

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 11 / 24

slide-31
SLIDE 31

Ingredient 2 - Fast Bernoulli simulation

Poisson subsampling

1 C ∼ Poisson(n

i=1 λi(θ, θ′)) ⇒ O(1) (after precomputing n i=1 ψi)

2 X1, . . . , XC

iid

∼ Categorical

  • [λi(θ, θ′)/ n

i=1 λi(θ, θ′)]1≤i≤n

  • ⇒ O(C)

(via Walker’s alias method [Walker, 1977], after Θ(n) setup cost)

3 Bj ∼ Bernoulli(λXj(θ, θ′)/λXj(θ, θ′)) for 1 ≤ j ≤ C ⇒ O(C)

⇒ Overall cost of O(C) When is this efficient? Suppose our bounds have the form: λi(θ, θ′) = ϕ(θ, θ′)ψi ≥ − log αFMHi(θ, θ′) = λi(θ, θ′). (*) Then:

n

  • i=1

λi(θ, θ′) = ϕ(θ, θ′)

n

  • i=1

ψi and λi(θ, θ′) n

i=1 λi(θ, θ′) =

ψi n

i=1 ψi

.

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 11 / 24

slide-32
SLIDE 32

Ingredient 2 - Fast Bernoulli simulation

Poisson subsampling

1 C ∼ Poisson(n

i=1 λi(θ, θ′)) ⇒ O(1) (after precomputing n i=1 ψi)

2 X1, . . . , XC

iid

∼ Categorical

  • [λi(θ, θ′)/ n

i=1 λi(θ, θ′)]1≤i≤n

  • ⇒ O(C)

(via Walker’s alias method [Walker, 1977], after Θ(n) setup cost)

3 Bj ∼ Bernoulli(λXj(θ, θ′)/λXj(θ, θ′)) for 1 ≤ j ≤ C ⇒ O(C)

⇒ Overall cost of O(C) When is this efficient? Suppose our bounds have the form: λi(θ, θ′) = ϕ(θ, θ′)ψi ≥ − log αFMHi(θ, θ′) = λi(θ, θ′). (*) Then:

n

  • i=1

λi(θ, θ′) = ϕ(θ, θ′)

n

  • i=1

ψi and λi(θ, θ′) n

i=1 λi(θ, θ′) =

ψi n

i=1 ψi

. (*) holds for instance if log πi is Lipschitz (but will see better case later).

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 11 / 24

slide-33
SLIDE 33

Potential problems

Two problems now to overcome:

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 12 / 24

slide-34
SLIDE 34

Potential problems

Two problems now to overcome:

1 Since C ∼ Poisson(n

i=1 λi(θ, θ′)), potentially C > n

⇒ Must ensure C = o(n) if we are to achieve anything

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 12 / 24

slide-35
SLIDE 35

Potential problems

Two problems now to overcome:

1 Since C ∼ Poisson(n

i=1 λi(θ, θ′)), potentially C > n

⇒ Must ensure C = o(n) if we are to achieve anything

2 Since each αFMHi(θ, θ′) ≤ 1, can have αFMH(θ, θ′) → 0 as n → ∞

⇒ Must ensure αFMH(θ, θ′) is well behaved

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 12 / 24

slide-36
SLIDE 36

Potential problems

Two problems now to overcome:

1 Since C ∼ Poisson(n

i=1 λi(θ, θ′)), potentially C > n

⇒ Must ensure C = o(n) if we are to achieve anything

2 Since each αFMHi(θ, θ′) ≤ 1, can have αFMH(θ, θ′) → 0 as n → ∞

⇒ Must ensure αFMH(θ, θ′) is well behaved

These problems are are related since E[C|θ, θ′] =

n

  • i=1

λi(θ, θ′) and αFMH(θ, θ′) ≥ exp(−

n

  • i=1

λi(θ, θ′)).

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 12 / 24

slide-37
SLIDE 37

Potential problems

Two problems now to overcome:

1 Since C ∼ Poisson(n

i=1 λi(θ, θ′)), potentially C > n

⇒ Must ensure C = o(n) if we are to achieve anything

2 Since each αFMHi(θ, θ′) ≤ 1, can have αFMH(θ, θ′) → 0 as n → ∞

⇒ Must ensure αFMH(θ, θ′) is well behaved

These problems are are related since E[C|θ, θ′] =

n

  • i=1

λi(θ, θ′) and αFMH(θ, θ′) ≥ exp(−

n

  • i=1

λi(θ, θ′)). Key question is how to choose bounds for which n

i=1 λi(θ, θ′) is small.

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 12 / 24

slide-38
SLIDE 38

Three key ingredients

1 A factorised MH acceptance probability 2 Procedures for fast simulation of Bernoulli random variables 3 Control performance using an approximate target (“control

variates”)

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 13 / 24

slide-39
SLIDE 39

Ingredient 3 - control variates

Write the target as π(θ) =

n

  • i=1

πi(θ) =

n

  • i=1

exp(−Ui(θ)) for potentials Ui = − log πi(θ)

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 14 / 24

slide-40
SLIDE 40

Ingredient 3 - control variates

Write the target as π(θ) =

n

  • i=1

πi(θ) =

n

  • i=1

exp(−Ui(θ)) for potentials Ui = − log πi(θ) Approximate

  • Uk,i(θ) ≈ Ui(θ)

where Uk,i is a k-th order Taylor expansion of Ui around some fixed

  • θ (not depending on i)

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 14 / 24

slide-41
SLIDE 41

Ingredient 3 - control variates

Also let

  • Uk(θ) :=

n

  • i=1
  • Uk,i(θ)

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 15 / 24

slide-42
SLIDE 42

Ingredient 3 - control variates

Also let

  • Uk(θ) :=

n

  • i=1
  • Uk,i(θ) ≈ U(θ) :=

n

  • i=1

Ui(θ) = − log π(θ) which is itself a Taylor expansion of U(θ) around θ

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 15 / 24

slide-43
SLIDE 43

Ingredient 3 - control variates

Also let

  • Uk(θ) :=

n

  • i=1
  • Uk,i(θ) ≈ U(θ) :=

n

  • i=1

Ui(θ) = − log π(θ) which is itself a Taylor expansion of U(θ) around θ Explicitly

  • U1(θ)

= U( θ) + ∇U( θ)⊤(θ − θ)

  • U2(θ)

= U( θ) + ∇U( θ)⊤(θ − θ) + 1 2(θ − θ)⊤∇2U( θ)(θ − θ)

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 15 / 24

slide-44
SLIDE 44

Ingredient 3 - control variates

Also let

  • Uk(θ) :=

n

  • i=1
  • Uk,i(θ) ≈ U(θ) :=

n

  • i=1

Ui(θ) = − log π(θ) which is itself a Taylor expansion of U(θ) around θ Explicitly

  • U1(θ)

= U( θ) + ∇U( θ)⊤(θ − θ)

  • U2(θ)

= U( θ) + ∇U( θ)⊤(θ − θ) + 1 2(θ − θ)⊤∇2U( θ)(θ − θ) In particular, exp(− U2(θ)) ≈ π(θ) is a Gaussian approximation to the target at θ

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 15 / 24

slide-45
SLIDE 45

Ingredient 3 - control variates

Define the Scalable Metropolis-Hastings (SMH) acceptance probability αSMH-k(θ, θ′) :=

  • 1 ∧ exp(−

Uk(θ′)) exp(− Uk(θ))

  • n
  • i=1

1 ∧ exp( Uk,i(θ′) − Ui(θ′)) exp( Uk,i(θ) − Ui(θ)) .

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 16 / 24

slide-46
SLIDE 46

Ingredient 3 - control variates

Define the Scalable Metropolis-Hastings (SMH) acceptance probability αSMH-k(θ, θ′) :=

  • 1 ∧ exp(−

Uk(θ′)) exp(− Uk(θ))

  • n
  • i=1

1 ∧ exp( Uk,i(θ′) − Ui(θ′)) exp( Uk,i(θ) − Ui(θ)) . Corresponds to FMH using the factorisations π = exp(− Uk)

  • πn+1

n

  • i=1

exp( Uk,i − Ui)

  • πi

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 16 / 24

slide-47
SLIDE 47

Ingredient 3 - control variates

Define the Scalable Metropolis-Hastings (SMH) acceptance probability αSMH-k(θ, θ′) :=

  • 1 ∧ exp(−

Uk(θ′)) exp(− Uk(θ))

  • n
  • i=1

1 ∧ exp( Uk,i(θ′) − Ui(θ′)) exp( Uk,i(θ) − Ui(θ)) . Corresponds to FMH using the factorisations π = exp(− Uk)

  • πn+1

n

  • i=1

exp( Uk,i − Ui)

  • πi

First factor can be simulated directly in O(1) time

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 16 / 24

slide-48
SLIDE 48

Ingredient 3 - control variates

Define the Scalable Metropolis-Hastings (SMH) acceptance probability αSMH-k(θ, θ′) :=

  • 1 ∧ exp(−

Uk(θ′)) exp(− Uk(θ))

  • n
  • i=1

1 ∧ exp( Uk,i(θ′) − Ui(θ′)) exp( Uk,i(θ) − Ui(θ)) . Corresponds to FMH using the factorisations π = exp(− Uk)

  • πn+1

n

  • i=1

exp( Uk,i − Ui)

  • πi

First factor can be simulated directly in O(1) time Remaining factors can be simulated with Poisson subsampling

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 16 / 24

slide-49
SLIDE 49

Ingredient 3 - control variates

Recall we need upper bounds − log αFMHi(θ, θ′) ≤ ϕ(θ, θ′)ψi =: λi(θ, θ′)

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 17 / 24

slide-50
SLIDE 50

Ingredient 3 - control variates

Recall we need upper bounds − log αFMHi(θ, θ′) ≤ ϕ(θ, θ′)ψi =: λi(θ, θ′) Possible to show that, if we can find constants Uk+1,i ≥ sup

θ∈Θ |β|=k+1

|∂βUi(θ)| (*) then we can use λi(θ, θ′) := (θ − θk+1

1

+ θ′ − θk+1

1

)

  • ϕ(θ,θ′)

Uk+1,i (k + 1)!

  • ψi

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 17 / 24

slide-51
SLIDE 51

Ingredient 3 - control variates

Recall we need upper bounds − log αFMHi(θ, θ′) ≤ ϕ(θ, θ′)ψi =: λi(θ, θ′) Possible to show that, if we can find constants Uk+1,i ≥ sup

θ∈Θ |β|=k+1

|∂βUi(θ)| (*) then we can use λi(θ, θ′) := (θ − θk+1

1

+ θ′ − θk+1

1

)

  • ϕ(θ,θ′)

Uk+1,i (k + 1)!

  • ψi

(*) constitutes the only quantity that must be specified by hand to use our method on a given model

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 17 / 24

slide-52
SLIDE 52

Ingredient 3 - control variates

Heuristically, suppose

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 18 / 24

slide-53
SLIDE 53

Ingredient 3 - control variates

Heuristically, suppose θ ∼ π (chain is at stationarity)

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 18 / 24

slide-54
SLIDE 54

Ingredient 3 - control variates

Heuristically, suppose θ ∼ π (chain is at stationarity) θ − θMAP = O(1/√n) (1/√n concentration - key assumption)

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 18 / 24

slide-55
SLIDE 55

Ingredient 3 - control variates

Heuristically, suppose θ ∼ π (chain is at stationarity) θ − θMAP = O(1/√n) (1/√n concentration - key assumption) θ′ − θ = O(1/√n) (proposal is scaled like 1/√n)

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 18 / 24

slide-56
SLIDE 56

Ingredient 3 - control variates

Heuristically, suppose θ ∼ π (chain is at stationarity) θ − θMAP = O(1/√n) (1/√n concentration - key assumption) θ′ − θ = O(1/√n) (proposal is scaled like 1/√n)

  • θ − θMAP = O(1/√n)

( θ is not too far from mode)

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 18 / 24

slide-57
SLIDE 57

Ingredient 3 - control variates

Heuristically, suppose θ ∼ π (chain is at stationarity) θ − θMAP = O(1/√n) (1/√n concentration - key assumption) θ′ − θ = O(1/√n) (proposal is scaled like 1/√n)

  • θ − θMAP = O(1/√n)

( θ is not too far from mode) then by the triangle inequality

n

  • i=1

λi(θ, θ′) = (θ − θk+1

1

+ θ′ − θk+1

1

)

  • O(n−(k+1)/2)

n

  • i=1

Uk+1,i (k + 1)!

  • O(n)

= O(n(1−k)/2)

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 18 / 24

slide-58
SLIDE 58

Ingredient 3 - control variates

Heuristically, suppose θ ∼ π (chain is at stationarity) θ − θMAP = O(1/√n) (1/√n concentration - key assumption) θ′ − θ = O(1/√n) (proposal is scaled like 1/√n)

  • θ − θMAP = O(1/√n)

( θ is not too far from mode) then by the triangle inequality

n

  • i=1

λi(θ, θ′) = (θ − θk+1

1

+ θ′ − θk+1

1

)

  • O(n−(k+1)/2)

n

  • i=1

Uk+1,i (k + 1)!

  • O(n)

= O(n(1−k)/2) In particular, n

i=1 λi(θ, θ′) is O(1) if k = 1 and O(1/√n) if k = 2

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 18 / 24

slide-59
SLIDE 59

Summary

This directly yields an average cost per step E[C|θ, θ′] =

n

  • i=1

λi(θ, θ′) =

  • O(1),

k = 1 O(1/√n) k = 2.

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 19 / 24

slide-60
SLIDE 60

Summary

This directly yields an average cost per step E[C|θ, θ′] =

n

  • i=1

λi(θ, θ′) =

  • O(1),

k = 1 O(1/√n) k = 2. Likewise, acceptance probability is stable since αSMH-k(θ, θ′) :=

  • 1 ∧ exp(−

Uk(θ′)) exp(− Uk(θ))

  • ≥exp(−O(1))

(can show) n

  • i=1

1 ∧ exp( Uk,i(θ′) − Ui(θ′)) exp( Uk,i(θ) − Ui(θ))

  • ≥exp(− n

i=1 λi(θ,θ′))

.

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 19 / 24

slide-61
SLIDE 61

Summary

This directly yields an average cost per step E[C|θ, θ′] =

n

  • i=1

λi(θ, θ′) =

  • O(1),

k = 1 O(1/√n) k = 2. Likewise, acceptance probability is stable since αSMH-k(θ, θ′) :=

  • 1 ∧ exp(−

Uk(θ′)) exp(− Uk(θ))

  • ≥exp(−O(1))

(can show) n

  • i=1

1 ∧ exp( Uk,i(θ′) − Ui(θ′)) exp( Uk,i(θ) − Ui(θ))

  • ≥exp(− n

i=1 λi(θ,θ′))

. Can do even better with a exp(− Uk)-reversible proposal (first term vanishes).

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 19 / 24

slide-62
SLIDE 62

Application - logistic regression

We consider logistic regression with covariates xi ∈ Rd and responses yi ∈ {0, 1} p(yi|θ, xi) = Bernoulli(yi| 1 1 + exp(−θ⊤xi)) ⇒ Ui(θ) = − log p(yi|θ, xi) = log(1 + exp(θTxi)) − yiθ⊤xi

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 20 / 24

slide-63
SLIDE 63

Application - logistic regression

We consider logistic regression with covariates xi ∈ Rd and responses yi ∈ {0, 1} p(yi|θ, xi) = Bernoulli(yi| 1 1 + exp(−θ⊤xi)) ⇒ Ui(θ) = − log p(yi|θ, xi) = log(1 + exp(θTxi)) − yiθ⊤xi Admits upper bounds U2,i = 1 4 max

1≤j≤d |xij|2

U3,i = 1 6 √ 3 max

1≤j≤d |xij|3

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 20 / 24

slide-64
SLIDE 64

Application - logistic regression

Empirical result for d = 10

64 128 256 512 1024 2048 4096 8192 32768 131072 n 101 102 103 104 105 Likelihoods per iteration MH SMH-1 SMH-2

Figure 2: Average number of likelihood evaluations per iteration required by SMH for a 10-dimensional logistic regression posterior as the number of data points n increases.

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 21 / 24

slide-65
SLIDE 65

Application - logistic regression

Empirical result for d = 10

64 128 256 512 1024 2048 4096 8192 32768 131072 n 101 102 103 104 Effective sample size per second MH SMH-1 SMH-2 FlyMC Zig-Zag

Figure 3: Effective sample size per second of computation for posterior mean of first regression coefficient (higher is better)

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 22 / 24

slide-66
SLIDE 66

Thanks for listening

Please feel free to ask any questions now, or find us later at poster #202.

Cornish et al. Scalable Metropolis–Hastings June 8, 2019 23 / 24