Well-known shortcomings, advantages and computational challenges - - PowerPoint PPT Presentation

well known shortcomings advantages and computational
SMART_READER_LITE
LIVE PREVIEW

Well-known shortcomings, advantages and computational challenges - - PowerPoint PPT Presentation

Well-known shortcomings, advantages and computational challenges in Bayesian modelling: a few case stories Ole Winther The Bioinformatics Center University of Copenhagen Informatics and Mathematical Modelling Technical University of Denmark


slide-1
SLIDE 1

Well-known shortcomings, advantages and computational challenges in Bayesian modelling: a few case stories

Ole Winther

The Bioinformatics Center University of Copenhagen Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Lyngby, Denmark

  • le.winther@gmail.com
  • wi@imm.dtu.dk

1

slide-2
SLIDE 2

Overview

  • 1. How many species? predicting sequence tags.
  • Non-parametric Bayes
  • Averaging beats maximum likelihood
  • The model is always wrong (and Bayes can’t tell)
  • 2. Computing the marginal likelihood with MCMC
  • Motivation: computing corrections to EP/C
  • The trouble with Gibbs sampling
  • Gaussian process classification

2

slide-3
SLIDE 3

How many species?

3

slide-4
SLIDE 4

DNA sequence tags - CAGE

4

slide-5
SLIDE 5

Look at the data - cerebellum

5 10 15 20 25 30 200 400 600 800 1000 1200 1400 1600 1800 m(n) n 200 400 600 800 1000 1200 10 20 30 40 50 60 m(n) n

5

slide-6
SLIDE 6

Look at the data - embryo

5 10 15 20 25 30 200 400 600 800 1000 1200 1400 1600 1800 2000 m(n) n

5000 10000 15000 10 20 30 40 50 60 m(n) n

6

slide-7
SLIDE 7

Chinese restaurant process - Yor-Pitman sampling formula

Observing new species given counts n = n1, . . . , nk in k bins: p(n1, . . . , nk, 1|n, σ, θ) = θ + kσ n + θ with

k

  • i=1

ni = n Re-observing j: p(n1, . . . , nj−1, nj + 1, nj+1, . . . , nk|n, σ, θ) = nj − σ n + θ Exchangeability – invariant to re-ordering E, E, M, T, T : p1 = θ θ 1 − σ 1 + θ θ + σ 2 + θ θ + 2σ 3 + θ 1 − σ 4 + θ M, T, E, T, E : p2 = θ θ θ + σ 1 + θ θ + 2σ 2 + θ 1 − σ 3 + θ 1 − σ 4 + θ = . . . = p1

7

slide-8
SLIDE 8

Inference and prediction

Likelihood function, e.g. E, E, M, T, T p(n|σ, θ) = θ θ 1 − σ 1 + θ θ + σ 2 + θ θ + 2σ 3 + θ 1 − σ 4 + θ = 1

n−1

i=1(i + θ) k−1

  • j=1

(θ + jσ)

k

  • i′=1

ni′−1

  • j′=1

(j′ − σ) Flat prior distribution for σ ∈ [0, 1] and θ pseudo-count parameter. Predictions for new count m: p(m|n) =

  • p(m|n, σ, θ) p(σ, θ) dσdθ

with Gibbs sampling (σ, θ) and Yor-Pitman sampling for m.

8

slide-9
SLIDE 9

Averaging versus max. likelihood

0.5 1 1.5 2 2.5 3 x 10

−3

2600 2620 2640 2660 2680 2700 2720 2740 2760 2780

− 1 1 . 7 2 −10.81 −9.9 −8.99 −8.08 − 8 . 8 − 7 . 1 7 −7.17 − 7 . 1 7 − 6 . 2 6 −6.26 −6.26 −6.26 − 5 . 3 5 −5.35 −5.35 −5.35 − 4 . 4 3 −4.43 −4.43 − 3 . 5 2 −3.52 −3.52 −2.61 −2.61 −1.7 −1.7 −0.79

cerebellum Log L − log LML contours σ θ 0.07 0.075 0.08 0.085 0.09 1920 1940 1960 1980 2000 2020 2040 2060 2080 2100

− 1 8 . 4 3 − 1 6 . 3 8 − 1 6 . 3 8 − 1 4 . 3 3 − 1 4 . 3 3 − 1 2 . 2 8 − 1 2 . 2 8 − 1 . 2 4 − 1 . 2 4 − 8 . 1 9 − 8 . 1 9 − 6 . 1 4 − 6 . 1 4 −4.09 − 4 . 9 −4.09 −4.09 − 2 . 5 −2.05 −2.05 − 2 . 5

embryo Log L − log LML contours σ θ

9

slide-10
SLIDE 10

Notice anything funny?

2 4 6 x 10

5

5000 10000 15000 n k cerebellum 1.15 1.2 x 10

4

9350 9400 9450 9500 9550 186257 1420 1440 1460 1480 1500

10

slide-11
SLIDE 11

Notice anything funny? Example 2

2 4 6 8 10 12 x 10

5

2000 4000 6000 8000 10000 12000 14000 16000 n k embryo 1.3 1.35 1.4 x 10

4

1.02 1.04 1.06 1.08 x 10

4

344794 1750 1800 1850

11

slide-12
SLIDE 12

(Well-known) take home messages

  • Parameter averaging works!
  • The model is always wrong! (as revealed with sufficient data)
  • Only one model considered!
  • What happened to being Bayesian about model selection?

12

slide-13
SLIDE 13

Calculating the marginal likelihood

The marginal likelihood: p(D|H) =

  • p(D|f, H) p(f|H) df

Approximate inference needed! Monte Carlo: slow mixing, non-trivial to get marginal likelihood estimates Expectation propagation+, variational Bayes, loopy BP+: some- times not precise, approximation errors not controllable, not appli- cable to all models.

13

slide-14
SLIDE 14

Motivation: validating EP corrections

Kuss-Rasmussen (JMLR 2006) N = 767 3-vs-5 GP USPS digit classification with K(x, x′) = σ2 exp

  • −||x − x′||2

2ℓ2

  • I: log R = log ZEPc − log ZEP and II+III: log ZMCMC − log ZEP.

0.0001 0.001 0.001 0.01 0.01 0.01 0.1 0.1 0.1 . 1 . 2 0.2 0.2 0.2 0.3 0.3 0.3 . 3 0.3 . 3 0.4 0.4 0.4 0.4 0.4 0.5 0.5 . 5 0.5 0.6 0.6 0.6

log lengthscale, ln(ℓ) log magnitude, ln(σ)

1.5 2 2.5 3 3.5 4 4.5 5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0.0001 . 1 0.001 0.01 0.01 . 1 0.1 0.1 0.1 0.1 0.2 0.2 . 2 0.2 0.3 0.4

log lengthscale, ln(ℓ) log magnitude, ln(σ)

1.5 2 2.5 3 3.5 4 4.5 5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

1 e − 7 1e−07 1 e − 7 1e−07 1e−07 1 e − 7 1e−07 1e−07 1e−07 1e−07 1e−07 1 e − 7 1 e − 7 1e−07 1 e − 7 1 e − 7 1e−07 1e−07 1e−07 1e−07 1e−07 1 e − 7 1 e − 7 1e−07 1e−07 1e−07 1e−07 1e−07 1e−07 1e−07 1e−07 1e−07 1 e − 7 1 e − 7 1e−07 1e−07 1e−07 1e−06 1e−06 1e−06 1e−06 1e−06 1 e − 6 1e−06 1e−06 1e−06 1e−06 1e−06 1 e − 6 1e−06 1e−06 1 e − 6 1e−06 1 e − 6 1e−06 1e−06 1e−06 1e−06 1 e − 6 1e−06 1e−06 1 e − 6 1e−06 1e−06 1e−06 1e−06 1e−06 1e−06 1e−06 1e−06 1e−06 1e−06 1e−06 1e−06 1 e − 6 0.0001 0.0001 0.0001 . 1 . 1 0.0001 . 1 . 1 0.0001 0.0001 . 1 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 . 1 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.001 . 1 0.001 0.001 0.001 . 1 . 1 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 . 1 0.001 0.001 . 1 . 1 0.001 . 1 0.001 . 1 . 1 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.01 . 1 0.01 0.01 0.01 0.01 0.01 0.01 0.01 . 1 0.01 0.01 . 1 0.01 0.01 . 1 . 1 0.01 0.01 0.01 0.01 . 1 0.01 0.01 0.01 0.01 . 1 . 1 0.01 0.01 0.01 0.01 0.01 0.01 . 1 0.01 . 1 0.01 0.01 0.01 0.01 0.1 0.1 0.1 0.1 0.1 . 1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 . 1 0.1 0.1 . 1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 . 1 0.1 0.1 0.1 0.1 0.1 . 1 0.1 0.1 0.1 0.1 0.2 . 2 0.2 0.2 . 2 0.2 0.2 0.2 . 2 0.2 0.2 0.2 0.2 0.2 . 2 . 2 . 2 0.2 . 2 0.2 0.2 0.2 . 2 0.2 0.2 0.2 0.2 0.2 . 2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.3 0.3 0.3 0.3 0.3 0.3 . 3 0.3 . 3 0.3 0.3 0.3 0.3 0.3 0.3 . 3 0.3 . 3 . 3 0.3 0.3 0.3 . 3 . 3 0.3 0.3 0.4 0.4 0.4 0.4 . 4 0.4 . 4 . 4 0.4 0.4 0.4 0.4 0.4 0.4 . 4 0.4 . 4 0.4 0.4 0.5 0.5 0.5 0.5 0.5 0.5 . 5 0.5 0.5 . 5 0.5 . 5 0.5 0.5 0.5 0.5 0.5 0.5 0.6 0.6 0.6 0.6 . 6 0.6 0.6 . 6 0.6 0.6 0.6 0.6 0.6 0.6 0.7 0.7 0.7 0.7 0.7

log lengthscale, ln(ℓ) log magnitude, ln(σ)

1.5 2 2.5 3 3.5 4 4.5 5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

Thanks to Malte Kuss for making III available.

14

slide-15
SLIDE 15

Marginal likelihood from importance sampling

Importance sampling p(D|H) =

p(D|f, H) p(f|H)

q(f) q(f) df Draw samples f1, . . . , fR from q(f) and set p(D|H) ≈ 1 R

R

  • r=1

p(D|fr, H) p(fr|H) q(fr) This will usually not work because ratio varies too much.

15

slide-16
SLIDE 16

Marginal likelihood from thermodynamic integration

Variants: parallel tempering, simulated tempering and annealed im- portance sampling h(f) = p(D|f, H) p(f|H) p(f|β) = 1 Z(β)hβ(f) q1−β(f) log Z(β2) − log Z(β1) =

β2

β1

d log Z(β) dβ dβ =

β2

β1

  • log h(f)

q(f) p(f|β) df dβ Run Nβ chains and interpolate log Z(β2) − log Z(β1) ≈ ∆β R

  • b=1

R

  • r=1

log h(frb) q(frb) Other things that might work even better: multi-canonical.

16

slide-17
SLIDE 17

The trouble with Gibbs sampling

Cycle over variables fi, i = 1, . . . , N and sample conditionals p(fi|f\i) = p(f) p(f\i) ∝ p(f)

−2 −1 1 2 −2.5 −2 −1.5 −1 −0.5 0.5 1 1.5 2 2.5 17

slide-18
SLIDE 18

A trivial cure for N(f|0, C)

Gibbs sample zi, i = 1, . . . , N with N(z|0, I)

f = Lz

with

C = LLT

−2 −1 1 2 −2.5 −2 −1.5 −1 −0.5 0.5 1 1.5 2 2.5 −2 −1 1 2 −2.5 −2 −1.5 −1 −0.5 0.5 1 1.5 2 2.5 18

slide-19
SLIDE 19

Gaussian process classification (GPC)

p(f|y, K, β) = 1 Z(β)

  • n

φ(ynfn) exp

  • −β

2fTK−1f

  • Noise-free formulation fnf

φ(yf) =

  • θ(yfnf)N(fnf|f, I)

Joint distribution p(f, fnf, y|K, β) = p(y|fnf)p(fnf|f)p(f|K, β) Marginalize out f p(fnf|y, K, β) ∝

  • n

θ(ynfn)N(fnf|0, I + K/β) Samples of f can be recovered from p(f|fnf) (Gaussian) Efficient sampler of truncated Gaussian needed!

19

slide-20
SLIDE 20

MCMC for GPC - related work

  • G. Rodriguez-Yam, R. Davis, and L. Scharf: “Efficient Gibbs Sam-

pling of Truncated Multivariate Normal with Application to Con- strained Linear Regression” (preprint 2004)

  • R. Neal, U. Paquet: Sample joint f, fnf and use Adler’s over-relaxation
  • n f.
  • P. Rujan, R. Herbrich: Playing billiards in version space, the Bayes

point machine.

  • M. Kuss+C. Rasmussen: Hybrid Monte Carlo in z-space, f = Lz

and annealed importance sampling

  • Y. Qi+T. Minka: Hessian based Metropolis-Hastings - local Gaus-

sian proposals.

20

slide-21
SLIDE 21

Gibbs sampling - pos/neg covariance

Conditionals are truncated Gaussians! so sampling is easy.

0.5 1 1.5 2 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 0.05 0.1 0.15 0.2 0.25 0.05 0.1 0.15 0.2 0.25 21

slide-22
SLIDE 22

Efficient Gibbs sampling I

Gibbs sample in whitened space: zi, i = 1, . . . , N with N(z|0, I)

f = Lz

with

C = LLT

What happens to the constraints, say f ≥ 0

f = Lz ≥ 0

Just a linear transformation so region in z-space is convex and conditional (double) truncated Gaussian.

22

slide-23
SLIDE 23

Determine limits of conditionals

jth conditional constraints i = 1, . . . , N Lijzj ≥ −

  • k=i

Likzk divide in sets S+,j = {i|Lij > 0} S−,j = {i|Lij < 0} S0,j = {i|Lij = 0} zj,lower = max

i∈S+,j

k=i Likzk

Lij zj,upper = min

i∈S−,j

k=i Likzk

Lij zj = φ−1 φ(zj,lower) + rand

  • φ(zj,upper) − φ(zj,lower)
  • 23
slide-24
SLIDE 24

Gibbs sampling positive covariance

−2 −1 1 2 −2.5 −2 −1.5 −1 −0.5 0.5 1 1.5 2 2.5 0.5 1 1.5 2 0.5 1 1.5 2 24

slide-25
SLIDE 25

Gibbs sampling negative covariance

−2 −1 1 2 −2.5 −2 −1.5 −1 −0.5 0.5 1 1.5 2 2.5 0.05 0.1 0.15 0.2 0.25 0.05 0.1 0.15 0.2 0.25 25

slide-26
SLIDE 26

Kuss+Rasmussen set-up

EP, EP+corrections and MCMC are all very precise!

0.0001 0.001 0.001 0.01 0.01 0.01 0.1 0.1 . 1 0.1 0.2 . 2 . 2 . 2 0.3 0.3 . 3 0.3 . 3 0.3 0.4 . 4 0.4 . 4 . 4 0.5 0.5 0.5 0.5 0.6 0.6 . 6

log lengthscale, ln(ℓ) log magnitude, ln(σ)

1.5 2 2.5 3 3.5 4 4.5 5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 . 1 0.001 0.001 0.01 0.01 0.01 0.1 . 1 . 1 0.1 . 2 . 2 0.2 0.2 0.3 0.4

log lengthscale, ln(ℓ) log magnitude, ln(σ)

1.5 2 2.5 3 3.5 4 4.5 5 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

Details of the EP corrections will (hopefully) come to a conference near you soon!

26

slide-27
SLIDE 27

Summary

  • Averaging works!
  • (X-)validation points to model miss-specification!
  • How to find better (noise) models?
  • Marginal likelihood from sampling
  • The trouble with Gibbs sampling and a cure
  • Is machine learning becoming (Bayesian) statistics with big

data set?

27

slide-28
SLIDE 28

Acknowledgments

Bioinformatics: Albin Sandelin Eivind Valen Machine learning: Ulrich Paquet Manfred Opper Students: (many shared) Ricardo Henao (machine learning and bioinformatics) Morten Hansen (communication) Carsten Stahlhut (source location in EEG) J´

  • an Petur Petersen (machine learning for ship propulsion)

Man-Hung Tang (bioinformatics) Eivind Valen (bioinformatics) Troels Marstrand (bioinformatics)

28