Adaptive HMC via the Infinite Exponential Family Arthur Gretton - - PowerPoint PPT Presentation

adaptive hmc via the infinite exponential family
SMART_READER_LITE
LIVE PREVIEW

Adaptive HMC via the Infinite Exponential Family Arthur Gretton - - PowerPoint PPT Presentation

Adaptive HMC via the Infinite Exponential Family Arthur Gretton Gatsby Unit, CSML, University College London RegML, 2017 Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 1 / 38 Setting: MCMC


slide-1
SLIDE 1

Adaptive HMC via the Infinite Exponential Family

Arthur Gretton

⋆Gatsby Unit, CSML, University College London

RegML, 2017

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 1 / 38

slide-2
SLIDE 2

Setting: MCMC for intractable non-linear targets

Using samples to compute expectations

We have a density of the form p(x) = π(x) Z Z = ˆ π(x)dx Z often impractical to compute Goal: to compute expectations of functions, Ep[f (x)] = ˆ f (x)p(x)dx

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 2 / 38

slide-3
SLIDE 3

Setting: MCMC for intractable non-linear targets

Using samples to compute expectations

We have a density of the form p(x) = π(x) Z Z = ˆ π(x)dx Z often impractical to compute Goal: to compute expectations of functions, Ep[f (x)] = ˆ f (x)p(x)dx Given samples {xi}n

i=1 with distribution p(x),

  • Ep[f (x)] = 1

n

n

  • i=1

f (xi)

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 2 / 38

slide-4
SLIDE 4

Setting: MCMC for intractable non-linear targets

Metropolis-Hastings MCMC

A visual guide. . .

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 3 / 38

slide-5
SLIDE 5

Setting: MCMC for intractable non-linear targets

Metropolis-Hastings MCMC

Unnormalized target π(x) ∝ p(x) Generate Markov chain with invariant distribution p

Initialize x0 ∼ p0 At iteration t ≥ 0, propose to move to state x′ ∼ q(·|xt) Accept/Reject proposals based on ratio xt+1 =

  • x′,

w.p. min

  • 1, π(x′)q(xt|x′)

π(xt)q(x′|xt)

  • ,

xt,

  • therwise.

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 4 / 38

slide-6
SLIDE 6

Setting: MCMC for intractable non-linear targets

Metropolis-Hastings MCMC

Unnormalized target π(x) ∝ p(x) Generate Markov chain with invariant distribution p

Initialize x0 ∼ p0 At iteration t ≥ 0, propose to move to state x′ ∼ q(·|xt) Accept/Reject proposals based on ratio xt+1 =

  • x′,

w.p. min

  • 1, π(x′)q(xt|x′)

π(xt)q(x′|xt)

  • ,

xt,

  • therwise.

What proposal q(·|xt)?

Too narrow or broad: → slow convergence Does not conform to support of target → slow convergence

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 4 / 38

slide-7
SLIDE 7

Setting: MCMC for intractable non-linear targets

Adaptive MCMC

Adaptive Metropolis (Haario, Saksman & Tamminen, 2001): Update proposal qt(·|xt) = N(xt, ν2 ˆ Σt), using estimates of the target covariance

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 5 / 38

slide-8
SLIDE 8

Setting: MCMC for intractable non-linear targets

Adaptive MCMC

Adaptive Metropolis (Haario, Saksman & Tamminen, 2001): Update proposal qt(·|xt) = N(xt, ν2 ˆ Σt), using estimates of the target covariance Locally miscalibrated for strongly non-linear targets: directions of large variance depend on the current location

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 5 / 38

slide-9
SLIDE 9

Setting: MCMC for intractable non-linear targets

Alternative adaptive sampler: the Kameleon

Idea: fit Gaussian in feature space, take local steps in directions of

  • max. principal components.
  • D. Sejdinovic, H. Strathmann, M. Lomeli, C. Andrieu, and A. Gretton,

ICML 2014

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 6 / 38

slide-10
SLIDE 10

Setting: MCMC for intractable non-linear targets

Hamiltonian Monte Carlo

HMC: distant moves, high acceptance probability. Potential energy U(x) = − log π(x), auxiliary momentum p ∼ exp(−K(p)), simulate for t ∈ R along Hamiltonian flow of H(p, x) = K(p) + U(x), using

  • perator

∂K ∂p ∂ ∂x − ∂U ∂x ∂ ∂p Numerical simulation (i.e. leapfrog) depends on gradient information.

−5 −4 −3 −2 −1

θ2

−5 −4 −3 −2 −1 1

θ7

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 7 / 38

slide-11
SLIDE 11

Setting: MCMC for intractable non-linear targets

Intractable & Non-linear Target in GPC

Sliced posterior over hyperparameters of a Gaussian Process classifier

  • n UCI Glass dataset obtained using Pseudo-Marginal MCMC

−6 −5 −4 −3 −2 −1

θ2

−5 −4 −3 −2 −1

θ7

Can you learn an HMC sampler?

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 8 / 38

slide-12
SLIDE 12

Setting: MCMC for intractable non-linear targets

Outline for remainder of talk

−6 −5 −4 −3 −2 −1

θ2

−5 −4 −3 −2 −1

θ7

Infinite dimensional exponential family (Sriperumbudur et al. 2014 ) Exponential family with RKHS-valued natural parameter Learned via score matching, no log-partition function Kernel Adaptive Hamiltonian Monte Carlo (Strathmann et al. 2015 ) Global estimate of gradient

  • f log target density from
  • prev. samples

Mixing performance close to ideal “known density” HMC

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 9 / 38

slide-13
SLIDE 13

MCMC Kameleon

Infinite dimensional exponential family density estimator

−6 −5 −4 −3 −2 −1

θ2

−5 −4 −3 −2 −1

θ7

Bharath Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Revant Kumar, and Aapo Hyvarinen, JMLR 2017, to appear

(slides adapted from Bharath’s talk)

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 10 / 38

slide-14
SLIDE 14

MCMC Kameleon

The Exponential Family of Distributions

Natural form: pθ(x) = q0(x)eθT T(x)−A(θ) where

θ ∈ Θ ⊂ Rm (natural parameter) q0: probability density defined over Ω ⊂ Rd A(θ): log-partition function A(θ) = log ˆ eθT T(x)q0(x) dx T(x): sufficient statistic

Includes many commonly used distributions

Normal, Binomial, Poisson, Exponential, . . .

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 11 / 38

slide-15
SLIDE 15

MCMC Kameleon

Infinite Dimensional Generalization

P =

  • pf (x) = ef (x)−A(f )q0(x), x ∈ Ω : f ∈ F
  • where

F =

  • f ∈ H : A(f ) = log

ˆ ef (x)q0(x) dx < ∞

  • (Canu and Smola, 2005; Fukumizu, 2009): H is a reproducing kernel

Hilbert space (RKHS).

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 12 / 38

slide-16
SLIDE 16

MCMC Kameleon

Reproducing kernel Hilbert space

Exponentiated quadratic kernel, k(x, x′) = exp

  • −x − x′2

2σ2

  • =

  • i=1

φi(x)φi(x′) f (x) =

  • i=1

fiφi(x)

  • i=1

f 2

i < ∞.

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 13 / 38

slide-17
SLIDE 17

MCMC Kameleon

Reproducing kernel Hilbert space

Function with exponentiated quadratic kernel: f (x) : =

m

  • i=1

αik(xi, x) =

m

  • i=1

αi φ(xi), φ(x)H = m

  • i=1

αiφ(xi), φ(x)

  • H

−6 −4 −2 2 4 6 8 −0.4 −0.2 0.2 0.4 0.6 0.8 1

x f(x)

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 14 / 38

slide-18
SLIDE 18

MCMC Kameleon

Reproducing kernel Hilbert space

Function with exponentiated quadratic kernel: f (x) : =

m

  • i=1

αik(xi, x) =

m

  • i=1

αi φ(xi), φ(x)H = m

  • i=1

αiφ(xi), φ(x)

  • H

=

  • ℓ=1

fℓφℓ(x) = f (·), φ(x)H

−6 −4 −2 2 4 6 8 −0.4 −0.2 0.2 0.4 0.6 0.8 1

x f(x)

fℓ := m

i=1 αiφℓ(xi)

Possible to write functions of infinitely many features!

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 15 / 38

slide-19
SLIDE 19

MCMC Kameleon

RKHS-Based Exponential Family

H is an RKHS: P =

  • pf (x) = ef ,φ(x)H−A(f )q0(x), x ∈ Ω, f ∈ F
  • where

F =

  • f ∈ H : A(f ) = log

ˆ ef (x)q0(x) dx < ∞

  • .

Finite dimensional RKHS: one-to-one correspondence between finite dimensional exponential family and RKHS. T(x) k(x, y) = T(x), T(y). Similarly, k(x, y) = Φ(x), Φ(y) Φ(x).

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 16 / 38

slide-20
SLIDE 20

MCMC Kameleon

Examples

Exponential: Ω = R++, k(x, y) = xy. Normal: Ω = R, k(x, y) = xy + x2y2. Beta: Ω = (0, 1), k(x, y) = log x log y + log(1 − x) log(1 − y). Gamma: Ω = R++, k(x, y) = log x log y + xy. Inverse Gaussian: Ω = R++, k(x, y) = xy + 1

xy .

Poisson: Ω = N ∪ {0}, k(x, y) = xy, q0(x) = (x! e)−1. Geometric: Ω = N ∪ {0}, k(x, y) = xy, q0(x) = 1. Binomial: Ω = {0, . . . , m}, k(x, y) = xy, q0(x) = 2−mm

c

  • .

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 17 / 38

slide-21
SLIDE 21

MCMC Kameleon

Problem: Given random samples, X1, . . . , Xn drawn i.i.d. from an unknown density, p0 := pf0 ∈ P, estimate p0.

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 18 / 38

slide-22
SLIDE 22

MCMC Kameleon

Maximum Likelihood Estimation

fML = arg max

f ∈F n

  • i=1

log pf (Xi) = arg max

f ∈F n

  • i=1

f (Xi) − n log ˆ ef (x)q0(x) dx.

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 19 / 38

slide-23
SLIDE 23

MCMC Kameleon

Maximum Likelihood Estimation

fML = arg max

f ∈F n

  • i=1

log pf (Xi) = arg max

f ∈F n

  • i=1

f (Xi) − n log ˆ ef (x)q0(x) dx. Solving the above yields that fML satisfies 1 n

n

  • i=1

φ(xi) = ˆ φ(x)pfML(x) dx Can we solve this?

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 19 / 38

slide-24
SLIDE 24

MCMC Kameleon

Solving max. likelihood equations

Finite dimensional case: Normal distribution N(µ, σ) φ(x) =

  • x

x2⊤

  • Max. likelihood equations give

1 n

n

  • i=1
  • xi

x2

i

⊤ = ˆ x x2⊤ pfML(x) dx =

  • µML

(σ2

ML + µ2 ML)

⊤ System of likelihood equations: solvable

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 20 / 38

slide-25
SLIDE 25

MCMC Kameleon

Solving max. likelihood equations

Finite dimensional case: Normal distribution N(µ, σ) φ(x) =

  • x

x2⊤

  • Max. likelihood equations give

1 n

n

  • i=1
  • xi

x2

i

⊤ = ˆ x x2⊤ pfML(x) dx =

  • µML

(σ2

ML + µ2 ML)

⊤ System of likelihood equations: solvable Infinite dimensional case, characteristic kernel: ill-posed! (Fukumizu, 2009): a sieves method involving pseudo-MLE by restricting P to a series of finite dimensional submanifolds, which enlarge as the sample size increases.

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 20 / 38

slide-26
SLIDE 26

MCMC Kameleon

Score matching (general version)

Score matching (Hyvärinen, 2005) since MLE can be intractable even in finite dimensions if A(θ) is not easily computable. Assuming pf to be differentiable (w.r.t. x) and ´ p0(x)∇x log pf (x)2 dx < ∞, ∀ θ ∈ Θ: DF(p0pf ) := 1 2 ˆ p0(x) ∇x log p0(x) − ∇x log pf (x)2 dx

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 21 / 38

slide-27
SLIDE 27

MCMC Kameleon

Score matching: 1-D proof

DF(p0, pf ) = 1 2 ˆ b

a

p0(x) d log p0(x) dx − d log pf (x) dx 2 dx

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 22 / 38

slide-28
SLIDE 28

MCMC Kameleon

Score matching: 1-D proof

DF(p0, pf ) = 1 2 ˆ b

a

p0(x) d log p0(x) dx − d log pf (x) dx 2 dx = 1 2 ˆ b

a

p0(x) d log p0(x) dx 2 dx + 1 2 ˆ b

a

p0(x) d log pf (x) dx 2 dx − ˆ b

a

p0(x) d log pf (x) dx d log p0(x) dx

  • dx

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 22 / 38

slide-29
SLIDE 29

MCMC Kameleon

Score matching: 1-D proof

DF(p0, pf ) = 1 2 ˆ b

a

p0(x) d log p0(x) dx − d log pf (x) dx 2 dx = 1 2 ˆ b

a

p0(x) d log p0(x) dx 2 dx + 1 2 ˆ b

a

p0(x) d log pf (x) dx 2 dx − ˆ b

a

p0(x) d log pf (x) dx d log p0(x) dx

  • dx

Final term: ˆ b

a

p0(x) d log pf (x) dx d log p0(x) dx

  • dx

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 22 / 38

slide-30
SLIDE 30

MCMC Kameleon

Score matching: 1-D proof

DF(p0, pf ) = 1 2 ˆ b

a

p0(x) d log p0(x) dx − d log pf (x) dx 2 dx = 1 2 ˆ b

a

p0(x) d log p0(x) dx 2 dx + 1 2 ˆ b

a

p0(x) d log pf (x) dx 2 dx − ˆ b

a

p0(x) d log pf (x) dx d log p0(x) dx

  • dx

Final term: ˆ b

a

p0(x) d log pf (x) dx d log p0(x) dx

  • dx

= ˆ b

a ✟✟

p0(x) d log pf (x) dx 1

✟✟ ✟

p0(x) dp0(x) dx

  • dx

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 22 / 38

slide-31
SLIDE 31

MCMC Kameleon

Score matching: 1-D proof

DF(p0, pf ) = 1 2 ˆ b

a

p0(x) d log p0(x) dx − d log pf (x) dx 2 dx = 1 2 ˆ b

a

p0(x) d log p0(x) dx 2 dx + 1 2 ˆ b

a

p0(x) d log pf (x) dx 2 dx − ˆ b

a

p0(x) d log pf (x) dx d log p0(x) dx

  • dx

Final term: ˆ b

a

p0(x) d log pf (x) dx d log p0(x) dx

  • dx

= ˆ b

a ✟✟

p0(x) d log pf (x) dx 1

✟✟ ✟

p0(x) dp0(x) dx

  • dx

= d log pf (x) dx

  • p0(x)

b

a

− ˆ b

a

p0(x)d2 log pf (x) dx2 .

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 22 / 38

slide-32
SLIDE 32

MCMC Kameleon

Score matching (general version)

Score matching (Hyvärinen, 2005) since MLE can be intractable even in finite dimensions if A(θ) is not easily computable. Assuming pf to be differentiable (w.r.t. x) and ´ p0(x)∇x log pf (x)2 dx < ∞, ∀ θ ∈ Θ: DF(p0pf ) := 1 2 ˆ p0(x) ∇x log p0(x) − ∇x log pf (x)2 dx

(a)

= ˆ p0(x)

d

  • i=1
  • 1

2 ∂ log pf (x) ∂xi 2 + ∂2 log pf (x) ∂x2

i

  • dx

+1 2 ˆ p0(x)

  • ∂ log p0(x)

∂x

  • 2

dx, where partial integration is used in (a) under the condition that p0(x)∂ log pf (x) ∂xi → 0 as xi → ±∞, ∀ i = 1, . . . , d.

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 23 / 38

slide-33
SLIDE 33

MCMC Kameleon

Empirical Estimator

pn represents n i.i.d. samples from P0

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 24 / 38

slide-34
SLIDE 34

MCMC Kameleon

Empirical Estimator

pn represents n i.i.d. samples from P0 DF(pnpf ) := 1 n

n

  • a=1

d

  • i=1
  • 1

2 ∂ log pf (Xa) ∂xi 2 + ∂2 log pf (Xa) ∂x2

i

  • +C

Since DF(pn, pf ) is independent of A(f ), f ∗

n = arg min f ∈F DF(pn, pf )

should be easily computable, unlike the MLE.

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 24 / 38

slide-35
SLIDE 35

MCMC Kameleon

Empirical Estimator

pn represents n i.i.d. samples from P0 DF(pnpf ) := 1 n

n

  • a=1

d

  • i=1
  • 1

2 ∂ log pf (Xa) ∂xi 2 + ∂2 log pf (Xa) ∂x2

i

  • +C

Since DF(pn, pf ) is independent of A(f ), f ∗

n = arg min f ∈F DF(pn, pf )

should be easily computable, unlike the MLE. Add extra term λf 2

H to regularize

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 24 / 38

slide-36
SLIDE 36

MCMC Kameleon

How do we get a comptable solution?

pf (x) = ef ,φ(x)H−A(f )q0(x) Thus ∂ ∂x log pf (x) = ∂ ∂x f , φ(x)H + ∂ ∂x log q0(x).

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 25 / 38

slide-37
SLIDE 37

MCMC Kameleon

How do we get a comptable solution?

pf (x) = ef ,φ(x)H−A(f )q0(x) Thus ∂ ∂x log pf (x) = ∂ ∂x f , φ(x)H + ∂ ∂x log q0(x). Kernel trick for derivatives: ∂ ∂xi f (X) =

  • f , ∂

∂xi φ(X)

  • H

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 25 / 38

slide-38
SLIDE 38

MCMC Kameleon

How do we get a comptable solution?

pf (x) = ef ,φ(x)H−A(f )q0(x) Thus ∂ ∂x log pf (x) = ∂ ∂x f , φ(x)H + ∂ ∂x log q0(x). Kernel trick for derivatives: ∂ ∂xi f (X) =

  • f , ∂

∂xi φ(X)

  • H

Dot product between feature derivatives: ∂ ∂xi φ(X), ∂ ∂xj φ(X ′)

  • H

= ∂2 ∂xi∂xd+j k(X, X ′)

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 25 / 38

slide-39
SLIDE 39

MCMC Kameleon

How do we get a comptable solution?

pf (x) = ef ,φ(x)H−A(f )q0(x) Thus ∂ ∂x log pf (x) = ∂ ∂x f , φ(x)H + ∂ ∂x log q0(x). Kernel trick for derivatives: ∂ ∂xi f (X) =

  • f , ∂

∂xi φ(X)

  • H

Dot product between feature derivatives: ∂ ∂xi φ(X), ∂ ∂xj φ(X ′)

  • H

= ∂2 ∂xi∂xd+j k(X, X ′) By representer theorem: f ∗

n = αˆ

ξ +

n

  • ℓ=1

d

  • j=1

βℓj ∂φ(Xℓ) ∂xj ,

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 25 / 38

slide-40
SLIDE 40

MCMC Kameleon

Consistency and Rates for fλ,n

Suppose existence assumptions hold. (i) (Consistency) If f0 ∈ R(C), then fλ,n − f0H → 0 as λ√n → ∞, λ → 0, and n → ∞. (ii) (Rates of convergence) Suppose f0 ∈ R(C β) for some β > 0. Then fλ,n − f0H = Op0

  • n− min
  • 1

4 , β 2(β+1)

  • with λ = n− max
  • 1

4, 1 2(β+1)

  • as n → ∞.

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 26 / 38

slide-41
SLIDE 41

MCMC Kameleon

Kernel Adaptive Hamiltonian Monte Carlo (KMC)

Heiko Strathmann, Dino Sejdinovic, Samuel Livingstone, Zoltan Szabo, and Arthur Gretton, NIPS 2015

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 27 / 38

slide-42
SLIDE 42

MCMC Kameleon

Hamiltonian Monte Carlo

HMC: distant moves, high acceptance probability. Potential energy U(x) = − log π(x), auxiliary momentum p ∼ exp(−K(p)), simulate for t ∈ R along Hamiltonian flow of H(p, x) = K(p) + U(x), using

  • perator

∂K ∂p ∂ ∂x − ∂U ∂x ∂ ∂p Numerical simulation (i.e. leapfrog) depends on gradient information.

−5 −4 −3 −2 −1

θ2

−5 −4 −3 −2 −1 1

θ7

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 28 / 38

slide-43
SLIDE 43

MCMC Kameleon

Infinite dimensional exponential families

Proposal is RKHS exponential family model [Fukumizu, 2009; Sriperumbudur et al. 2014], but accept using correct MH ratio (to correct for both model and leapfrog) const × π(x) ≈ exp (f , k(x, ·)H − A(f )) Sufficient statistics: feature map k(·, x) ∈ H, satisfies f (x) = f , k(x, ·)H for any f ∈ H. Natural parameters: f ∈ H. Estimation of unnormalised density models from samples via score matching [Sriperumbudur et al. 2014] Expensive: full solution requires solving (td + 1)-dimensional linear system.

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 29 / 38

slide-44
SLIDE 44

MCMC Kameleon

Approximate solution: KMC lite

f (x) =

m

  • i=1

αik(zi, x) z ⊆ x sub-sample, m < n. α from linear system ˆ αλ = −σ 2 (C + λI)−1b where C ∈ Rm×m, b ∈ Rm depend on kernel matrix Cost O(m3 + m2d) (or cheaper with low-rank approx., conjugate gradient). Geometrically ergodic

  • n

log- concave targets (fast convergence). Gradient norm: Gaussian KMC Lite

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 30 / 38

slide-45
SLIDE 45

MCMC Kameleon

Approximate solution: KMC finite

f (x) = θ⊤φx Random Fourier Features φ⊤

x φy ≈ k(x, y)

θ ∈ Rm can be computed from ˆ θλ := (C + λI)−1b

b := − 1 t

t

  • i=1

d

  • ℓ=1

¨ φℓ

xi

C := 1 t

t

  • i=1

d

  • ℓ=1

˙ φℓ

xi

  • ˙

φℓ

xi

T where ˙ φℓ

x := ∂ ∂xℓ φx and ¨

φℓ

x := ∂2 ∂x2 ℓ

φx .

On-line updates cost O(dm2). Updates fast, uses all Markov chain history. Caveat: need to initialise correctly. Gradient norm: Gaussian KMC Finite

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 31 / 38

slide-46
SLIDE 46

MCMC Kameleon

Does kernel HMC work in high dimensions?

Challenging Gaussian target (top): Eigenvalues: λi ∼ Exp(1). Covariance: diag(λ1, . . . , λd), randomly rotate. Use Rational Quadratic kernel to account for resulting highly ‘non-singular’ length-scales. KMC scales up to d ≈ 30. An easy, isotropic Gaussian target (bottom): More smoothness allows KMC to scale up to d ≈ 100.

102 103 104 n 100 101 d 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 102 103 104 n 100 101 102 d 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 32 / 38

slide-47
SLIDE 47

Experiments

Gaussian Process Classification on UCI data

Standard GPC model p(f, y, θ) = p(θ)p(f|θ)p(y|f) where p(f|θ) is a GP and with a sigmoidal likelihood p(y|f). Goal: sample from p(θ|y) ∝ p(θ)p(y|θ). Unbiased estimate of ˆ p(y|θ) via importance sampling. No access to likelihood or gradient.

−5 −4 −3 −2 −1

θ2

−5 −4 −3 −2 −1 1

θ7

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 33 / 38

slide-48
SLIDE 48

Experiments

Gaussian Process Classification on UCI data

Standard GPC model p(f, y, θ) = p(θ)p(f|θ)p(y|f) where p(f|θ) is a GP and with a sigmoidal likelihood p(y|f). Goal: sample from p(θ|y) ∝ p(θ)p(y|θ). Unbiased estimate of ˆ p(y|θ) via importance sampling. No access to likelihood or gradient.

1000 2000 3000 4000 5000 Iterations 102 103 104 105 106 107 MMD from ground truth KMC KAMH RW

Significant mixing improvements over state-of-the-art.

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 33 / 38

slide-49
SLIDE 49

Experiments

Conclusions

Kernel HMC: Simple, versatile, gradient-free adaptive MCMC sampler: Derivative of log density fit to samples, use this as proposal in HMC. Outperforms existing adaptive approaches on nonlinear target distributions Future work: how does convergence rate degrade with increasing dimension? Kernel HMC code: https://github.com/karlnapf/kernel hmc

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 34 / 38

slide-50
SLIDE 50

Bayesian Gaussian Process Classification

Our case: target π(·) and log gradient not computable – Pseudo-Marginal MCMC

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 35 / 38

slide-51
SLIDE 51

Bayesian Gaussian Process Classification

Our case: target π(·) and log gradient not computable – Pseudo-Marginal MCMC Example: when is target not computable? GPC model: latent process f, labels y, (with covariate matrix X), and hyperparameters θ: p(f, y, θ) = p(θ)p(f|θ)p(y|f) f|θ ∼ N(0, Kθ) GP with covariance Kθ

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 35 / 38

slide-52
SLIDE 52

Bayesian Gaussian Process Classification

Our case: target π(·) and log gradient not computable – Pseudo-Marginal MCMC Example: when is target not computable? GPC model: latent process f, labels y, (with covariate matrix X), and hyperparameters θ: p(f, y, θ) = p(θ)p(f|θ)p(y|f) f|θ ∼ N(0, Kθ) GP with covariance Kθ Automatic Relevance Determination (ARD) covariance: (Kθ)ij = κ(xi, x′

j|θ) = exp

  • −1

2

d

  • s=1

(xi,s − x′

j,s)2

exp(θs)

  • Arthur Gretton (Gatsby Unit, UCL)

Adaptive HMC via the Infinite Exponential Family 30/03/2016 35 / 38

slide-53
SLIDE 53

Bayesian Gaussian Process Classification

Our case: target π(·) and log gradient not computable – Pseudo-Marginal MCMC Example: when is target not computable? GPC model: latent process f, labels y, (with covariate matrix X), and hyperparameters θ: p(f, y, θ) = p(θ)p(f|θ)p(y|f) f|θ ∼ N(0, Kθ) GP with covariance Kθ Automatic Relevance Determination (ARD) covariance: (Kθ)ij = κ(xi, x′

j|θ) = exp

  • −1

2

d

  • s=1

(xi,s − x′

j,s)2

exp(θs)

  • Classification p(y|f) = n

i=1 p(yi|f (xi)) where

p(yi|f (xi)) = (1 − exp(−yif (xi)))−1, yi ∈ {−1, 1}.

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 35 / 38

slide-54
SLIDE 54

Pseudo-Marginal MCMC

Example: when is target not computable? Gaussian process classification, latent process f p(θ|y) ∝ p(θ)p(y|θ) = p(θ) ˆ p(f|θ)p(y|f, θ)df =: π(θ) . . . but cannot integrate out f

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 36 / 38

slide-55
SLIDE 55

Pseudo-Marginal MCMC

Example: when is target not computable? Gaussian process classification, latent process f p(θ|y) ∝ p(θ)p(y|θ) = p(θ) ˆ p(f|θ)p(y|f, θ)df =: π(θ) . . . but cannot integrate out f MH ratio: α(θ, θ′) = min

  • 1, p(θ′)p(y|θ′)q(θ|θ′)

p(θ)p(y|θ)q(θ′|θ)

  • Arthur Gretton (Gatsby Unit, UCL)

Adaptive HMC via the Infinite Exponential Family 30/03/2016 36 / 38

slide-56
SLIDE 56

Pseudo-Marginal MCMC

Example: when is target not computable? Gaussian process classification, latent process f p(θ|y) ∝ p(θ)p(y|θ) = p(θ) ˆ p(f|θ)p(y|f, θ)df =: π(θ) . . . but cannot integrate out f MH ratio: α(θ, θ′) = min

  • 1, p(θ′)p(y|θ′)q(θ|θ′)

p(θ)p(y|θ)q(θ′|θ)

  • Filippone & Girolami, 2013 use Pseudo-Marginal MCMC: unbiased

estimate of p(y|θ) via importance sampling: ˆ p(θ|y) ∝ p(θ)ˆ p(y|θ) ≈ p(θ) 1 nimp

nimp

  • i=1

p(y|f(i))p(f(i)|θ) Q(f(i))

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 36 / 38

slide-57
SLIDE 57

Pseudo-Marginal MCMC

Example: when is target not computable? Gaussian process classification, latent process f p(θ|y) ∝ p(θ)p(y|θ) = p(θ) ˆ p(f|θ)p(y|f, θ)df =: π(θ) . . . but cannot integrate out f Estimated MH ratio: α(θ, θ′) = min

  • 1, p(θ′)ˆ

p(y|θ′)q(θ|θ′) p(θ)ˆ p(y|θ)q(θ′|θ)

  • Arthur Gretton (Gatsby Unit, UCL)

Adaptive HMC via the Infinite Exponential Family 30/03/2016 37 / 38

slide-58
SLIDE 58

Pseudo-Marginal MCMC

Example: when is target not computable? Gaussian process classification, latent process f p(θ|y) ∝ p(θ)p(y|θ) = p(θ) ˆ p(f|θ)p(y|f, θ)df =: π(θ) . . . but cannot integrate out f Estimated MH ratio: α(θ, θ′) = min

  • 1, p(θ′)ˆ

p(y|θ′)q(θ|θ′) p(θ)ˆ p(y|θ)q(θ′|θ)

  • Replacing marginal likelihood p(y|θ) with unbiased estimate ˆ

p(y|θ) still results in correct invariant distribution [Beaumont, 2003; Andrieu &

Roberts, 2009]

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 37 / 38

slide-59
SLIDE 59

How rich is our density family?

Let q0 ∈ C0(Ω) be a probability density such that q0(x) > 0 for all x ∈ Ω ⊂ Rd. Define Pc :=

  • p ∈ C0(Ω) :

ˆ

p(x) dx = 1, p(x) ≥ 0, ∀ x ∈ Ω and

  • p

q0

< ∞

  • .

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 38 / 38

slide-60
SLIDE 60

How rich is our density family?

Let q0 ∈ C0(Ω) be a probability density such that q0(x) > 0 for all x ∈ Ω ⊂ Rd. Define Pc :=

  • p ∈ C0(Ω) :

ˆ

p(x) dx = 1, p(x) ≥ 0, ∀ x ∈ Ω and

  • p

q0

< ∞

  • .

Suppose k satisfies (∗) k(·, x) ∈ C0(Ω) for all x ∈ Ω ⊂ Rd; (∗∗) ´ ´ k(x, y)dµ(x) dµ(y) > 0 for all µ ∈ Mb(Ω)\{0}; Then P is dense in Pc w.r.t. KL divergence, total variation and Hellinger distances.

Arthur Gretton (Gatsby Unit, UCL) Adaptive HMC via the Infinite Exponential Family 30/03/2016 38 / 38