Zig-Zag Monte Carlo Delft University of Technology Joris Bierkens - - PowerPoint PPT Presentation

zig zag monte carlo
SMART_READER_LITE
LIVE PREVIEW

Zig-Zag Monte Carlo Delft University of Technology Joris Bierkens - - PowerPoint PPT Presentation

Zig-Zag Monte Carlo Delft University of Technology Joris Bierkens February 7, 2017 Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 1 / 33 Acknowledgements Collaborators Andrew Duncan Paul Fearnhead Antonietta Mira Gareth


slide-1
SLIDE 1

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 1 / 33

Zig-Zag Monte Carlo

Delft University of Technology

Joris Bierkens February 7, 2017

slide-2
SLIDE 2

Acknowledgements

Collaborators

Andrew Duncan Paul Fearnhead Antonietta Mira Gareth Roberts

Financial support

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 2 / 33

slide-3
SLIDE 3

Outline

1 Motivation: Markov Chain Monte Carlo 2 One-dimensional Zig-Zag process 3 Multi-dimensional ZZP 4 Subsampling 5 Doubly intractable likelihood

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 3 / 33

slide-4
SLIDE 4

Bayesian inference

In Bayesian inference we typically deal with a posterior density π(x) = π(x; y) ∝ L(y | x)π0(x), x ∈ Rd, where L(y | x) is the likelihood of the data y given parameter x ∈ Rd, and π0 is a prior density for x. Quantities of interest are e.g.

  • posterior mean
  • xπ(x) dx,
  • posterior variance
  • x2π(x) dx −
  • xπ(x) dx

2,

  • tail probability
  • ✶{x≥c}π(x) dx.

All of these involve integrals of the form

  • h(x)π(x) dx.

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 4 / 33

slide-5
SLIDE 5

Evaluating

  • h(x)π(x) dx

Possible approaches:

1 Explicit (analytic) integration. Rarely possible

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 5 / 33

slide-6
SLIDE 6

Evaluating

  • h(x)π(x) dx

Possible approaches:

1 Explicit (analytic) integration. Rarely possible 2 Numerical integration. Curse of dimensionality

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 5 / 33

slide-7
SLIDE 7

Evaluating

  • h(x)π(x) dx

Possible approaches:

1 Explicit (analytic) integration. Rarely possible 2 Numerical integration. Curse of dimensionality 3 Monte Carlo. Draw independent samples (X1, X2, . . . ) from π and use

the law of large numbers. Requires independent samples from π

  • h(x)π(x) dx = lim

K→∞

1 K

K

  • k=1

h(Xk).

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 5 / 33

slide-8
SLIDE 8

Evaluating

  • h(x)π(x) dx

Possible approaches:

1 Explicit (analytic) integration. Rarely possible 2 Numerical integration. Curse of dimensionality 3 Monte Carlo. Draw independent samples (X1, X2, . . . ) from π and use

the law of large numbers. Requires independent samples from π

4 Markov Chain Monte Carlo. Construct an ergodic Markov chain

(X1, X2, . . . ) with invariant distribution π(x) dx, use Birkhoff’s ergodic theorem.

  • h(x)π(x) dx = lim

K→∞

1 K

K

  • k=1

h(Xk).

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 5 / 33

slide-9
SLIDE 9

One-dimensional Zig-Zag process

Dynamics

  • Continuous time
  • Current state (X(t), Θ(t)) ∈ R × {−1, +1}.
  • Move X(t) in direction Θ(t) = ±1 until a switch occurs.
  • The switching intensity is λ(X(t), Θ(t)).

10 20 30 40 50 60 70 80 90 100 −2.5 −2 −1.5 −1 −0.5 0.5 1 1.5 2

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 6 / 33

slide-10
SLIDE 10

Relation between switching rate and potential

Lf (x, θ) = θ df dx + λ(x, θ)(f (x, −θ) − f (x, θ)), x ∈ R, θ ∈ {−1, +1}.

  • Potential U(x) = − log π(x)

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 7 / 33

slide-11
SLIDE 11

Relation between switching rate and potential

Lf (x, θ) = θ df dx + λ(x, θ)(f (x, −θ) − f (x, θ)), x ∈ R, θ ∈ {−1, +1}.

  • Potential U(x) = − log π(x)
  • π is invariant if and only if λ(x, +1) − λ(x, −1) = U′(x) for all x.

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 7 / 33

slide-12
SLIDE 12

Relation between switching rate and potential

Lf (x, θ) = θ df dx + λ(x, θ)(f (x, −θ) − f (x, θ)), x ∈ R, θ ∈ {−1, +1}.

  • Potential U(x) = − log π(x)
  • π is invariant if and only if λ(x, +1) − λ(x, −1) = U′(x) for all x.
  • Equivalently,

λ(x, θ) = γ(x) + max (0, θU′(x)) , γ(x) ≥ 0.

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 7 / 33

slide-13
SLIDE 13

Relation between switching rate and potential

Lf (x, θ) = θ df dx + λ(x, θ)(f (x, −θ) − f (x, θ)), x ∈ R, θ ∈ {−1, +1}.

  • Potential U(x) = − log π(x)
  • π is invariant if and only if λ(x, +1) − λ(x, −1) = U′(x) for all x.
  • Equivalently,

λ(x, θ) = γ(x) + max (0, θU′(x)) , γ(x) ≥ 0.

Example: Gaussian distribution N(0, σ2)

  • Density π(x) ∝ exp(−x2/(2σ2))
  • Potential U(x) = x2/(2σ2)
  • Derivative U′(x) = x/σ2
  • Switching rates λ(x, θ) = (θx/σ2)+ + γ(x)

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 7 / 33

slide-14
SLIDE 14

Proof of invariance of π ∝ exp(−U)

Lf (x, θ) = θ ∂f ∂x (x, θ) + λ(x, θ) (f (x, −θ) − f (x, θ)) , λ(x, +1) − λ(x, −1) = U′(x).

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 8 / 33

slide-15
SLIDE 15

Proof of invariance of π ∝ exp(−U)

Lf (x, θ) = θ ∂f ∂x (x, θ) + λ(x, θ) (f (x, −θ) − f (x, θ)) , λ(x, +1) − λ(x, −1) = U′(x). Markov semigroup P(t)f (x, θ) = Ex,θf (X(t), Θ(t))

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 8 / 33

slide-16
SLIDE 16

Proof of invariance of π ∝ exp(−U)

Lf (x, θ) = θ ∂f ∂x (x, θ) + λ(x, θ) (f (x, −θ) − f (x, θ)) , λ(x, +1) − λ(x, −1) = U′(x). Markov semigroup P(t)f (x, θ) = Ex,θf (X(t), Θ(t)) π stationary means that

  • θ=±1
  • R

P(t)f (x, θ)π(x) dx =

  • θ=±1
  • R

f (x, θ)π(x) dx f ∈ D(L), t ≥ 0.

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 8 / 33

slide-17
SLIDE 17

Proof of invariance of π ∝ exp(−U)

Lf (x, θ) = θ ∂f ∂x (x, θ) + λ(x, θ) (f (x, −θ) − f (x, θ)) , λ(x, +1) − λ(x, −1) = U′(x). Markov semigroup P(t)f (x, θ) = Ex,θf (X(t), Θ(t)) π stationary means that

  • θ=±1
  • R

P(t)f (x, θ)π(x) dx =

  • θ=±1
  • R

f (x, θ)π(x) dx f ∈ D(L), t ≥ 0. Differentiating gives the equivalent condition:

θ=±1

  • R Lf (x, θ)π(x) dx = 0, f ∈ D(L).

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 8 / 33

slide-18
SLIDE 18

Proof of invariance of π ∝ exp(−U)

Lf (x, θ) = θ ∂f ∂x (x, θ) + λ(x, θ) (f (x, −θ) − f (x, θ)) , λ(x, +1) − λ(x, −1) = U′(x). Markov semigroup P(t)f (x, θ) = Ex,θf (X(t), Θ(t)) π stationary means that

  • θ=±1
  • R

P(t)f (x, θ)π(x) dx =

  • θ=±1
  • R

f (x, θ)π(x) dx f ∈ D(L), t ≥ 0. Differentiating gives the equivalent condition:

θ=±1

  • R Lf (x, θ)π(x) dx = 0, f ∈ D(L).
  • θ=±1
  • R

λ(x, θ) (f (x, −θ) − f (x, θ)) π(x) dx

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 8 / 33

slide-19
SLIDE 19

Proof of invariance of π ∝ exp(−U)

Lf (x, θ) = θ ∂f ∂x (x, θ) + λ(x, θ) (f (x, −θ) − f (x, θ)) , λ(x, +1) − λ(x, −1) = U′(x). Markov semigroup P(t)f (x, θ) = Ex,θf (X(t), Θ(t)) π stationary means that

  • θ=±1
  • R

P(t)f (x, θ)π(x) dx =

  • θ=±1
  • R

f (x, θ)π(x) dx f ∈ D(L), t ≥ 0. Differentiating gives the equivalent condition:

θ=±1

  • R Lf (x, θ)π(x) dx = 0, f ∈ D(L).
  • θ=±1
  • R

λ(x, θ) (f (x, −θ) − f (x, θ)) π(x) dx =

  • R

{λ(x, +1) (f (x, −1) − f (x, +1)) + λ(x, −1) (f (x, +1) − f (x, −1))} π(x) dx

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 8 / 33

slide-20
SLIDE 20

Proof of invariance of π ∝ exp(−U)

Lf (x, θ) = θ ∂f ∂x (x, θ) + λ(x, θ) (f (x, −θ) − f (x, θ)) , λ(x, +1) − λ(x, −1) = U′(x). Markov semigroup P(t)f (x, θ) = Ex,θf (X(t), Θ(t)) π stationary means that

  • θ=±1
  • R

P(t)f (x, θ)π(x) dx =

  • θ=±1
  • R

f (x, θ)π(x) dx f ∈ D(L), t ≥ 0. Differentiating gives the equivalent condition:

θ=±1

  • R Lf (x, θ)π(x) dx = 0, f ∈ D(L).
  • θ=±1
  • R

λ(x, θ) (f (x, −θ) − f (x, θ)) π(x) dx =

  • R

{λ(x, +1) (f (x, −1) − f (x, +1)) + λ(x, −1) (f (x, +1) − f (x, −1))} π(x) dx

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 8 / 33

slide-21
SLIDE 21

Proof of invariance of π ∝ exp(−U)

Lf (x, θ) = θ ∂f ∂x (x, θ) + λ(x, θ) (f (x, −θ) − f (x, θ)) , λ(x, +1) − λ(x, −1) = U′(x). Markov semigroup P(t)f (x, θ) = Ex,θf (X(t), Θ(t)) π stationary means that

  • θ=±1
  • R

P(t)f (x, θ)π(x) dx =

  • θ=±1
  • R

f (x, θ)π(x) dx f ∈ D(L), t ≥ 0. Differentiating gives the equivalent condition:

θ=±1

  • R Lf (x, θ)π(x) dx = 0, f ∈ D(L).
  • θ=±1
  • R

λ(x, θ) (f (x, −θ) − f (x, θ)) π(x) dx =

  • R

{λ(x, +1) (f (x, −1) − f (x, +1)) + λ(x, −1) (f (x, +1) − f (x, −1))} π(x) dx =

  • R

(f (x, −1) − f (x, +1))(λ(x, +1) − λ(x, −1)) π(x) dx

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 8 / 33

slide-22
SLIDE 22

Proof of invariance of π ∝ exp(−U)

Lf (x, θ) = θ ∂f ∂x (x, θ) + λ(x, θ) (f (x, −θ) − f (x, θ)) , λ(x, +1) − λ(x, −1) = U′(x). Markov semigroup P(t)f (x, θ) = Ex,θf (X(t), Θ(t)) π stationary means that

  • θ=±1
  • R

P(t)f (x, θ)π(x) dx =

  • θ=±1
  • R

f (x, θ)π(x) dx f ∈ D(L), t ≥ 0. Differentiating gives the equivalent condition:

θ=±1

  • R Lf (x, θ)π(x) dx = 0, f ∈ D(L).
  • θ=±1
  • R

λ(x, θ) (f (x, −θ) − f (x, θ)) π(x) dx =

  • R

{λ(x, +1) (f (x, −1) − f (x, +1)) + λ(x, −1) (f (x, +1) − f (x, −1))} π(x) dx =

  • R

(f (x, −1) − f (x, +1))(λ(x, +1) − λ(x, −1)) π(x) dx =

  • R

(f (x, −1) − f (x, +1))U′(x)π(x) dx

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 8 / 33

slide-23
SLIDE 23

Proof of invariance of π ∝ exp(−U)

Lf (x, θ) = θ ∂f ∂x (x, θ) + λ(x, θ) (f (x, −θ) − f (x, θ)) , λ(x, +1) − λ(x, −1) = U′(x). Markov semigroup P(t)f (x, θ) = Ex,θf (X(t), Θ(t)) π stationary means that

  • θ=±1
  • R

P(t)f (x, θ)π(x) dx =

  • θ=±1
  • R

f (x, θ)π(x) dx f ∈ D(L), t ≥ 0. Differentiating gives the equivalent condition:

θ=±1

  • R Lf (x, θ)π(x) dx = 0, f ∈ D(L).
  • θ=±1
  • R

λ(x, θ) (f (x, −θ) − f (x, θ)) π(x) dx =

  • R

{λ(x, +1) (f (x, −1) − f (x, +1)) + λ(x, −1) (f (x, +1) − f (x, −1))} π(x) dx =

  • R

(f (x, −1) − f (x, +1))(λ(x, +1) − λ(x, −1)) π(x) dx =

  • R

(f (x, −1) − f (x, +1))U′(x)π(x) dx = −

  • R

(f (x, −1) − f (x, +1))π′(x) dx

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 8 / 33

slide-24
SLIDE 24

Proof of invariance of π ∝ exp(−U)

Lf (x, θ) = θ ∂f ∂x (x, θ) + λ(x, θ) (f (x, −θ) − f (x, θ)) , λ(x, +1) − λ(x, −1) = U′(x). Markov semigroup P(t)f (x, θ) = Ex,θf (X(t), Θ(t)) π stationary means that

  • θ=±1
  • R

P(t)f (x, θ)π(x) dx =

  • θ=±1
  • R

f (x, θ)π(x) dx f ∈ D(L), t ≥ 0. Differentiating gives the equivalent condition:

θ=±1

  • R Lf (x, θ)π(x) dx = 0, f ∈ D(L).
  • θ=±1
  • R

λ(x, θ) (f (x, −θ) − f (x, θ)) π(x) dx =

  • R

{λ(x, +1) (f (x, −1) − f (x, +1)) + λ(x, −1) (f (x, +1) − f (x, −1))} π(x) dx =

  • R

(f (x, −1) − f (x, +1))(λ(x, +1) − λ(x, −1)) π(x) dx =

  • R

(f (x, −1) − f (x, +1))U′(x)π(x) dx = −

  • R

(f (x, −1) − f (x, +1))π′(x) dx =

  • R

(f ′(x, −1) − f ′(x, +1))π(x) dx = −

  • θ=±1
  • R

θ df dx (x, θ)π(x) dx.

  • Joris Bierkens (TU Delft)

Zig-Zag Monte Carlo February 7, 2017 8 / 33

slide-25
SLIDE 25

Use in Monte Carlo

(X(t), Θ(t))t≥0 has invariant distribution proportional to π(x). If ergodic, lim

T→∞

1 T T h(X(s)) ds =

  • R

h(x)π(x) dx.

How to use in computations

Either:

  • Numerically integrate

1 T

T

0 h(Xs) ds for some finite T > 0, or

  • Define (X1, X2, . . . ) by setting Xk = X(k∆) for some ∆ > 0; use as in

traditional MCMC.

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 9 / 33

slide-26
SLIDE 26

CLT for the 1D Zig-Zag process

[B., Duncan, Limit theorems for the Zig-Zag process, 2016]

X(t) satisfies a Central Limit Theorem (CLT) for observable h if 1 √ T T [h(Xs) − Eπh(X)] ds N(0, σ2

h).

Example: unimodal potential/density function

T + T −

1

T +

1

T −

2

T +

2

T −

3

T +

3

S+

1

S−

1

S+

2

S−

2

S+

3

S−

3

t X(t)

  • Say Yi =

T +

i

T +

i−1 h(Xs) ds.

  • CLT for ZZP follows essentially from CLT for N(t)

i=1 Yi.

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 10 / 33

slide-27
SLIDE 27

CLT for the 1D Zig-Zag process

[B., Duncan, Limit theorems for the Zig-Zag process, 2016]

General formula for asymptotic variance σ2

h = 2

  • R

(λ(x, +1) + λ(x, −1))|φ′(x)|2π(x) dx where −LLangevinφ = h := h − π(h). Langevin diffusion: σ2

h = 2

  • R

|φ′(x)|2π(x) dx

Cool results

  • Computational efficiency for ZZP better than IID sampling for Gaussian

(oscillatory ACF)

  • Student-t distribution, ν degrees of freedom
  • Langevin diffusion satisfies CLT for ν > 2
  • Zig-Zag process satisfies CLT for ν > 1.

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 11 / 33

slide-28
SLIDE 28

Multi-dimensional Zig-Zag process

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 12 / 33

slide-29
SLIDE 29

Multi-dimensional Zig-Zag process

  • Target π(x) = exp(−U(x)) on Rd.
  • Set of directions θ ∈ {−1, +1}d.
  • Switching rates λi(x, θ) = (θi∂iU(x))+, for i = 1, . . . , d.

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 13 / 33

slide-30
SLIDE 30

Multi-dimensional Zig-Zag process

  • Target π(x) = exp(−U(x)) on Rd.
  • Set of directions θ ∈ {−1, +1}d.
  • Switching rates λi(x, θ) = (θi∂iU(x))+, for i = 1, . . . , d.

Cool observation

  • factorized target distribution π(x) = d

i=1 πi(xi) with

πi(y) = exp(−Ui(y)).

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 13 / 33

slide-31
SLIDE 31

Multi-dimensional Zig-Zag process

  • Target π(x) = exp(−U(x)) on Rd.
  • Set of directions θ ∈ {−1, +1}d.
  • Switching rates λi(x, θ) = (θi∂iU(x))+, for i = 1, . . . , d.

Cool observation

  • factorized target distribution π(x) = d

i=1 πi(xi) with

πi(y) = exp(−Ui(y)).

  • Switching rates: λi(x, θ) = (θiU′

i (xi))+.

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 13 / 33

slide-32
SLIDE 32

Multi-dimensional Zig-Zag process

  • Target π(x) = exp(−U(x)) on Rd.
  • Set of directions θ ∈ {−1, +1}d.
  • Switching rates λi(x, θ) = (θi∂iU(x))+, for i = 1, . . . , d.

Cool observation

  • factorized target distribution π(x) = d

i=1 πi(xi) with

πi(y) = exp(−Ui(y)).

  • Switching rates: λi(x, θ) = (θiU′

i (xi))+.

  • Every component of the Zig-Zag process mixes at O(1).
  • Compare to RWM O (d), MALA O
  • d1/3

, HMC O

  • d1/4

.

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 13 / 33

slide-33
SLIDE 33

Sampling

x

dU dx

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 14 / 33

slide-34
SLIDE 34

Sampling

x

dU dx

λ(x) = max

  • 0, dU

dx

  • Joris Bierkens (TU Delft)

Zig-Zag Monte Carlo February 7, 2017 14 / 33

slide-35
SLIDE 35

Sampling

x

dU dx

λ(x) = max

  • 0, dU

dx

  • Λ(x)

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 14 / 33

slide-36
SLIDE 36

Sampling

x

dU dx

λ(x) = max

  • 0, dU

dx

  • Λ(x)

T draw P(T ≥ t) = exp

t

0 Λ(X(s)) ds

  • accept T with probability

λ(X(T) Λ(X(T))

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 14 / 33

slide-37
SLIDE 37

Subsampling

x

dU1 dx dU2 dx dU dx

U = 1

2(U1 + U2)

m(x)

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 15 / 33

slide-38
SLIDE 38

Subsampling

x

dU1 dx dU2 dx

λ2(x) λ1(x) Λ(x)

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 15 / 33

slide-39
SLIDE 39

Subsampling

x T draw P(T ≥ t) = exp

t

0 Λ(X(s)) ds

  • draw I from {1, 2} uniformly

accept T with probability λI (X(T))

Λ(X(T)) dU2 dx

λ2(x)

dU1 dx

λ1(x) Λ(x)

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 15 / 33

slide-40
SLIDE 40

Subsampling

Intractable likelihood, big data:

U(x) = 1

n

n

i=1 Ui(x). If π(x) ∝ n i=1 f (yi | x)π0(x), take

Ui(x) = − log π0(x) − n log f (yi | x).

Theorem

With subsampling, the Zig-Zag Process has exp(−U) as invariant density.

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 16 / 33

slide-41
SLIDE 41

Subsampling

Intractable likelihood, big data:

U(x) = 1

n

n

i=1 Ui(x). If π(x) ∝ n i=1 f (yi | x)π0(x), take

Ui(x) = − log π0(x) − n log f (yi | x).

Theorem

With subsampling, the Zig-Zag Process has exp(−U) as invariant density. Proof: Effective switching rate is

  • λ(x, θ) = 1

n

n

  • i=1

λi(x, θ) = 1 n

n

  • i=1

(θU′

i (x))+.

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 16 / 33

slide-42
SLIDE 42

Subsampling

Intractable likelihood, big data:

U(x) = 1

n

n

i=1 Ui(x). If π(x) ∝ n i=1 f (yi | x)π0(x), take

Ui(x) = − log π0(x) − n log f (yi | x).

Theorem

With subsampling, the Zig-Zag Process has exp(−U) as invariant density. Proof: Effective switching rate is

  • λ(x, θ) = 1

n

n

  • i=1

λi(x, θ) = 1 n

n

  • i=1

(θU′

i (x))+.

  • λ(x, +1) −

λ(x, −1) = 1 n n

  • i=1

(U′

i (x))+ − n

  • i=1

(−U′

i (x))+

  • Joris Bierkens (TU Delft)

Zig-Zag Monte Carlo February 7, 2017 16 / 33

slide-43
SLIDE 43

Subsampling

Intractable likelihood, big data:

U(x) = 1

n

n

i=1 Ui(x). If π(x) ∝ n i=1 f (yi | x)π0(x), take

Ui(x) = − log π0(x) − n log f (yi | x).

Theorem

With subsampling, the Zig-Zag Process has exp(−U) as invariant density. Proof: Effective switching rate is

  • λ(x, θ) = 1

n

n

  • i=1

λi(x, θ) = 1 n

n

  • i=1

(θU′

i (x))+.

  • λ(x, +1) −

λ(x, −1) = 1 n n

  • i=1

(U′

i (x))+ − n

  • i=1

(−U′

i (x))+

  • = 1

n

n

  • i=1
  • (U′

i (x))+ − (U′ i (x))−

= 1 n

n

  • i=1

U′

i (x) = U′(x).

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 16 / 33

slide-44
SLIDE 44

Subsampling - scaling

  • Without subsampling, O(n) computations per O(1) update

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 17 / 33

slide-45
SLIDE 45

Subsampling - scaling

  • Without subsampling, O(n) computations per O(1) update
  • With naive subsampling, O(1) computations per O(1/n) update

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 17 / 33

slide-46
SLIDE 46

Subsampling - scaling

  • Without subsampling, O(n) computations per O(1) update
  • With naive subsampling, O(1) computations per O(1/n) update
  • Subsampling with control variates, O(1) computations per O(1)

update: super-efficient.

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 17 / 33

slide-47
SLIDE 47

Subsampling - scaling

  • Without subsampling, O(n) computations per O(1) update
  • With naive subsampling, O(1) computations per O(1/n) update
  • Subsampling with control variates, O(1) computations per O(1)

update: super-efficient.

  • The Control Variates approach depends on posterior contraction and

requires finding a point close to the mode: O(n) start-up cost.

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 17 / 33

slide-48
SLIDE 48

Control variates

  • U(x) = 1

n

n

i=1 Ui(x)

  • Let x⋆ denote (a point ‘close’ to) the mode of the posterior distribution.
  • Naive subsampling: λi(x, θ) = (θU′

i (x))+.

  • Control variates:

λi(x, θ) = (θ {U′

i (x) + U′(x⋆) − U′ i (x⋆)})+ .

  • If x is close to the mode then U′

i (x) − U′ i (x⋆) is small (under

assumptions on U)

  • So each λi(x, θ) is close to the ‘ideal’ switching rate (θU′(x))+.

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 18 / 33

slide-49
SLIDE 49

100 observations

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 19 / 33

slide-50
SLIDE 50

100 observations

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 20 / 33

slide-51
SLIDE 51

100 observations

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 21 / 33

slide-52
SLIDE 52

10,000 observations

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 22 / 33

slide-53
SLIDE 53

10,000 observations

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 23 / 33

slide-54
SLIDE 54

10,000 observations

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 24 / 33

slide-55
SLIDE 55

Scaling in number of observations

Zig-Zag, Zig-Zag w/Subsampling, Zig-Zag w/Control Variates, Zig-Zag with poor computational bound

  • 6

7 8 9 10 −8 −6 −4 −2 2

  • log(number of observations) base 2

log(ESS / epochs) base 2 Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 25 / 33

slide-56
SLIDE 56

Scaling in number of observations

Zig-Zag, Zig-Zag w/Subsampling, Zig-Zag w/Control Variates, Zig-Zag with poor computational bound

  • 6

7 8 9 10 6 8 10 12 14 16

  • log(number of observations) base 2

log(ESS / second) base 2 Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 26 / 33

slide-57
SLIDE 57

Doubly intractable likelihood

In many applications, the distribution of interest π has the following form. π(x; y) = exp d

i=1 xisi(y)

  • Z(y)M(x)

π0(x), x ∈ Rd, where

  • y ∈ {0, 1}n is a fixed observed realization of the ‘forward model’,

p(y | x) = exp d

i=1 xisi(y)

  • M(x)

,

  • si, i = 1, . . . , d, are statistics which characterize the distribution of the

forward model, with weights x1, . . . , xd ∈ R.

  • Z(y) usual normalization constant

Computational problem: Computation of M(x) is O(2n): M(x) =

  • y∈{0,1}n

exp d

  • i=1

xisi(y)

  • .

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 27 / 33

slide-58
SLIDE 58

Examples of doubly intractable likelihood

p(y | x) = exp d

i=1 xisi(y)

  • M(x)

, x ∈ Rd, y ∈ {0, 1}n. Ising model (physics, image analysis)

  • s1(y) = y TWy, where W is an interaction matrix
  • s2(y) = hTy, where h represents an “external magnetic field”
  • x1, x2 serve as “inverse temperatures”

Exponential Random Graph Model

  • random graphs over k vertices, with n := 1

2k(k − 1) possible edges

  • y1, . . . , yn indicate the presence of an edge
  • s1(y): number of edges in the random graph
  • s2(y): e.g. number of triangles in the random graph

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 28 / 33

slide-59
SLIDE 59

The Zig-Zag process applied to doubly intractable likelihood

For simplicity, say x ∈ R and ignore prior distribution. π(x; y) = exp (xs(y)) Z(y)M(x) , M(x) =

  • z∈{0,1}n

exp (xs(z)) x ∈ R, y ∈ {0, 1}n, so that U(x) = − log π(x; y) = −xs(y) + log M(x). For the derivative of U we find U′(x) = −s(y) + d dx log M(x)

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 29 / 33

slide-60
SLIDE 60

The Zig-Zag process applied to doubly intractable likelihood

For simplicity, say x ∈ R and ignore prior distribution. π(x; y) = exp (xs(y)) Z(y)M(x) , M(x) =

  • z∈{0,1}n

exp (xs(z)) x ∈ R, y ∈ {0, 1}n, so that U(x) = − log π(x; y) = −xs(y) + log M(x). For the derivative of U we find U′(x) = −s(y) + d dx log M(x) = −s(y) +

  • z∈{0,1}n exp (xs(z)) s(z)

M(x)

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 29 / 33

slide-61
SLIDE 61

The Zig-Zag process applied to doubly intractable likelihood

For simplicity, say x ∈ R and ignore prior distribution. π(x; y) = exp (xs(y)) Z(y)M(x) , M(x) =

  • z∈{0,1}n

exp (xs(z)) x ∈ R, y ∈ {0, 1}n, so that U(x) = − log π(x; y) = −xs(y) + log M(x). For the derivative of U we find U′(x) = −s(y) + d dx log M(x) = −s(y) +

  • z∈{0,1}n exp (xs(z)) s(z)

M(x) = −s(y) + Ex[s(Y )], where Y is a realization of the forward model with parameter x.

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 29 / 33

slide-62
SLIDE 62

The Zig-Zag process applied to doubly intractable likelihood

U′(x) = −s(y) + Ex[s(Y )]. Switching rate complexity O(2n). For x ∈ R, θ ∈ {−1, +1}, λ(x, θ) = max(θU′(x), 0) = max (−θs(y) + θEx[s(Y )], 0) . Idea: Use unbiased estimate of Ex[s(Y )] Crude algorithm for determining next switch:

1 Determine upper bound Λ(x) for λ(x, θ) 2 Generate switching time according to

P(T ≥ t) = exp

t

0 Λ(X(r)) dr

  • .

3 Obtain unbiased estimate ˆ

G of

d dx U(x) 4 Accept switch with probability max(0, θ ˆ

G)/Λ(X(T)), otherwise repeat.

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 30 / 33

slide-63
SLIDE 63

Unbiased estimation of Ex[s(Y )]

Two possible approaches:

  • perfect sampling, “coupling from the past” (Propp, Wilson, 1996): use

Glauber dynamics in ingenious way to obtain a sample Y which is distributed exactly according to the forward distribution π(· | x). Disadvantages:

  • Not applicable to all discrete models
  • Exponentially slow convergence in “cold temperature” regimes

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 31 / 33

slide-64
SLIDE 64

Unbiased estimation of Ex[s(Y )]

Two possible approaches:

  • perfect sampling, “coupling from the past” (Propp, Wilson, 1996): use

Glauber dynamics in ingenious way to obtain a sample Y which is distributed exactly according to the forward distribution π(· | x). Disadvantages:

  • Not applicable to all discrete models
  • Exponentially slow convergence in “cold temperature” regimes
  • unbiased MCMC sampling (Glynn, Rhee, 2014): introduce N-valued

random variable N. Define ∆i := s(Yi) − s( Yi) where (Yi) and ( Yi) are two realizations of Glauber dynamics, correlated in a specific way. Unbiased estimate ˆ G =

N

  • i=0

∆i P(N ≥ i). Disadvantages:

  • no global upper bound for estimate,
  • variance may be extremely large.

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 31 / 33

slide-65
SLIDE 65

Zig-Zag Process

  • We can use piecewise deterministic Markov processes for sampling
  • Unbiased estimate for the log density gradient results in correct

invariant distribution.

  • Significantly better scaling than IID sampling for big data
  • Doubly intractable likelihood: work in progress

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 32 / 33

slide-66
SLIDE 66

References

B., Roberts, A piecewise deterministic scaling limit of Lifted Metropolis-Hastings in the Curie-Weiss model, to appear in Annals of Applied Probability, 2015, https://arxiv.org/abs/1509.00302 B., Fearnhead, Roberts, The Zig-Zag Process and Super-Efficient Sampling for Bayesian Analysis of Big Data, 2016, https://arxiv.org/abs/1607.03188 B., Duncan, Limit theorems for the Zig-Zag process, 2016, https://arxiv.org/abs/1607.08845 B., Fearnhead, Pollock, Roberts, Piecewise Deterministic Markov Processes for Continuous-Time Monte Carlo, https://arxiv.org/abs/1611.07873 Thank you!

Joris Bierkens (TU Delft) Zig-Zag Monte Carlo February 7, 2017 33 / 33