well known shortcomings advantages and computational
play

Well-known shortcomings, advantages and computational challenges - PowerPoint PPT Presentation

Well-known shortcomings, advantages and computational challenges in Bayesian modelling: a few case stories Ole Winther The Bioinformatics Center University of Copenhagen Informatics and Mathematical Modelling Technical University of Denmark


  1. Well-known shortcomings, advantages and computational challenges in Bayesian modelling: a few case stories Ole Winther The Bioinformatics Center University of Copenhagen Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Lyngby, Denmark ole.winther@gmail.com owi@imm.dtu.dk 1

  2. Overview 1. How many species? predicting sequence tags. • Non-parametric Bayes • Averaging beats maximum likelihood • The model is always wrong (and Bayes can’t tell) 2. Computing the marginal likelihood with MCMC • Motivation: computing corrections to EP/C • The trouble with Gibbs sampling • Gaussian process classification 2

  3. How many species? 3

  4. DNA sequence tags - CAGE 4

  5. Look at the data - cerebellum 1800 60 1600 50 1400 1200 40 m(n) 1000 m(n) 30 800 20 600 400 10 200 0 0 200 400 600 800 1000 1200 5 10 15 20 25 30 n n 5

  6. Look at the data - embryo 60 2000 1800 50 1600 1400 40 1200 m(n) 30 m(n) 1000 800 20 600 400 10 200 0 0 5000 10000 15000 5 10 15 20 25 30 n n 6

  7. Chinese restaurant process - Yor-Pitman sampling formula Observing new species given counts n = n 1 , . . . , n k in k bins: k p ( n 1 , . . . , n k , 1 | n , σ, θ ) = θ + kσ � with n i = n n + θ i =1 Re-observing j : p ( n 1 , . . . , n j − 1 , n j + 1 , n j +1 , . . . , n k | n , σ, θ ) = n j − σ n + θ Exchangeability – invariant to re-ordering 1 − σ 1 − σ p 1 = θ θ + σ θ + 2 σ E, E, M, T, T : θ 1 + θ 2 + θ 3 + θ 4 + θ 1 − σ 1 − σ p 2 = θ θ + σ θ + 2 σ M, T, E, T, E : 4 + θ = . . . = p 1 1 + θ 2 + θ 3 + θ θ 7

  8. Inference and prediction Likelihood function, e.g. E, E, M, T, T 1 − σ 1 − σ θ θ + σ θ + 2 σ p ( n | σ, θ ) = 1 + θ 2 + θ 3 + θ 4 + θ θ n i ′ − 1 k − 1 k 1 ( j ′ − σ ) � � � = ( θ + jσ ) � n − 1 i =1 ( i + θ ) i ′ =1 j ′ =1 j =1 Flat prior distribution for σ ∈ [0 , 1] and θ pseudo-count parameter. Predictions for new count m : � p ( m | n ) = p ( m | n , σ, θ ) p ( σ, θ ) dσdθ with Gibbs sampling ( σ, θ ) and Yor-Pitman sampling for m . 8

  9. Averaging versus max. likelihood cerebellum Log L − log L ML contours embryo Log L − log L ML contours − −10.81 2780 −9.9 1 −8.08 1 . 7 −6.26 2 2100 − − 5 7 . . 3 1 5 7 −4.09 − − 2760 1 − 1 2 − 4 2080 6 . 4 . 1 − 3 2 . 3 −8.99 − 4 3 . 5 8 2 1 . 8 3 − 0 3 2 . 2740 − 2 . 0 −2.61 2060 8 4 5 . − 1 −6.26 6 − 9 −1.7 −7.17 . 2 −5.35 1 . 4 0 2720 2040 5 −4.09 2700 −4.43 2020 −0.79 −3.52 θ θ 2680 2000 − 4 . 0 2660 1980 9 −2.61 −1.7 −2.05 −5.35 2640 1960 − − 1 − 8 −2.05 6 1 . −4.43 − − − 1 . 0 −3.52 3 6 1940 1 1 . 9 2620 2 . 8 −6.26 8 4 1 − 4 . . 4 3 1 4 3 3 2 −4.09 . 2 −5.35 8 1920 2600 7 8 1 0 7 . 8 . − − 2 6 6 . − 0.5 1 1.5 2 2.5 3 0.07 0.075 0.08 0.085 0.09 σ σ −3 x 10 9

  10. Notice anything funny? cerebellum 4 x 10 15000 1.2 1.15 10000 9550 9500 9450 k 9400 9350 5000 1500 1480 1460 1440 1420 0 186257 2 4 6 n 5 x 10 10

  11. Notice anything funny? Example 2 embryo 4 x 10 1.4 16000 1.35 14000 1.3 12000 4 x 10 1.08 10000 1.06 k 8000 1.04 1.02 6000 1850 4000 1800 2000 1750 0 344794 2 4 6 8 10 12 n 5 x 10 11

  12. (Well-known) take home messages • Parameter averaging works! • The model is always wrong! (as revealed with sufficient data) • Only one model considered! • What happened to being Bayesian about model selection? 12

  13. Calculating the marginal likelihood The marginal likelihood: � p ( D|H ) = p ( D| f , H ) p ( f |H ) d f Approximate inference needed! Monte Carlo: slow mixing, non-trivial to get marginal likelihood estimates Expectation propagation+, variational Bayes, loopy BP+: some- times not precise, approximation errors not controllable, not appli- cable to all models. 13

  14. Motivation: validating EP corrections Kuss-Rasmussen (JMLR 2006) N = 767 3-vs-5 GP USPS digit classification with −|| x − x ′ || 2 � � K ( x , x ′ ) = σ 2 exp 2 ℓ 2 I: log R = log Z EPc − log Z EP and II+III: log Z MCMC − log Z EP . 0.0001 1e−07 0.6 1e−06 0.01 0.4 0.5 0.3 0.2 5 0.5 0.4 0.1 0 . 4 0.001 0.1 0.3 0 . 0.1 0.0001 5 5 2 0 0.7 . 6 2 1e−06 0.001 0.2 0.4 0.2 0.1 0 . 1e−07 0.6 0.5 0.01 0.3 0.2 0.0001 1e−06 0.001 0.01 1e−06 0.001 1 0.3 0.6 1 0.0001 0.3 0.5 0.01 e − 0 1e−07 0 1 0.4 7 0 0 0.6 0.5 3 0 . . 7 0.3 0.3 . 0 0 0.001 0.4 0.2 0.4 0 0.2 1 6 0.1 0 0.6 . 5 0.01 0.5 1e−06 4 − 0 1 0.6 0.5 4.5 0 0.3 . 0 7 − 0 0.4 0.01 0.0001 0.4 0.1 e 0 0 e 0 . 0.01 0.1 0.0001 4.5 4.5 1 0 1 e 0.0001 7 1e−06 0 1 1 0 0.5 − . − 0 6 0.001 0 . 0 0.1 0.4 e 0 0.3 0.001 0 1e−07 0.1 . 1 0.4 0.2 0.1 1e−06 − 3 0.2 1e−06 0.6 0 e 0.1 7 . 0 1 1e−07 1e−06 0.0001 − 0 0.2 1 0.1 e 1 0.001 1e−06 0.2 log magnitude, ln( σ ) log magnitude, ln( σ ) 0.001 1 0 0.5 0.7 0 . 3 0.001 0.01 4 0 . 1e−06 0.2 1e−07 4 4 0.3 0.3 1e−06 0.5 0.2 0.6 0.4 0.2 0.4 0.5 . 5 0.1 0.3 log magnitude, ln( σ ) 0 0.0001 2 0.6 0.1 0 . 0 1 0.0001 . 0.001 0 2 1e−07 1 e − 0 0 . 6 0 0 1 1e−07 0 . 7 0.1 0.2 0.1 3.5 0 0 . 2 0.001 1e−07 2 0.0001 3.5 3.5 0.1 0.01 0.0001 − 0.3 1e−06 0.0001 0.3 0 1 . 2 0 . 0 0.1 e 0 . 0.7 0.01 0.5 0.6 0 0.3 . 4 1 0.1 0.01 1 0.5 0 1 0.0001 0 . . 6 0.7 0.6 0.5 0.0001 1e−06 0.4 0 1e−07 1 3 0 0.001 3 3 0.4 0.3 . 1e−06 . 0 0 0 0 0.2 0 0.4 0.001 . . 1 e 0.2 1 0.0001 0.01 e − − 0 7 4 0.3 2 − 0 7 1 e 0 6 3 . 0.1 0.2 1 0 0.01 0.1 0.3 0.6 0.5 0.0001 0 . 0.3 0.6 0.7 0 0.001 1e−06 1e−07 0.01 0.2 . 2.5 2.5 2.5 0.0001 0.4 2 0.5 0.1 1e−07 1e−06 0.001 0.1 1e−07 0.3 0 0 0 1 0.4 0.01 0.6 0.0001 . 0 . 0 0.1 0.01 1 0.5 0.2 6 0.01 0 0.01 0.1 1e−07 0.001 − 0.2 5 1 0 . 3 . 0 0.001 e 1e−06 2 0 0.1 0.01 2 2 0.2 3 0 . 1 0.0001 1e−07 1e−06 1 0 . 0.3 0 1 0.01 0 . . 0 0.01 0.1 0 0.2 0.01 0.3 0 4 0.2 0.3 0.01 . . 0.001 1e−07 1e−06 1.5 1.5 1.5 0.6 5 0 0 . 0 0.1 0 0 1 0 1e−06 0.4 0.5 0.01 . 0.001 0 0.2 0.4 3 0.4 0 1e−07 0.2 1 1 . 0 . 1e−06 1 0.001 0 e e 0 2 0.0001 1e−07 1 0.1 . − − 0 0 0.1 0 0 0 1 0.1 0 0 6 7 0 0.2 1 . 1 1 1 0.0001 . 0 0.001 0.01 0.001 1e−07 1e−06 0 0.01 0 1e−07 1e−06 0.001 0 . 0 0 0 1 0.2 0 . 0.3 0.1 0.01 0.1 1 6 0.0001 1e−07 0.1 0 0 0.2 1 0.0001 0.1 0.2 0 − 0 . 0.2 0.01 0.1 0 . 3 0.01 0.1 . 0.2 . 2 0.2 . 1e−06 0.0001 e 0 3 0.5 0.5 0.5 0 1e−07 0 1e−06 1 0 0.01 . 0 1 1e−07 0 0.001 1 0.2 0.0001 0.1 1e−07 1 0.1 0.001 0.001 0.0001 7 0.0001 1e−06 0 0 1 1e−06 e − 0.1 . 0.1 1 0.001 0 0 0.01 0.01 1e−06 1e−07 0 0 0 0 0.0001 1e−06 1 1 0.0001 . 0.001 1e−07 0 0 0.01 . 0 0.001 0.01 0 0 0 . 0.01 1e−06 0.01 0.001 1 e − 0 7 1.5 2 2.5 3 3.5 4 4.5 5 1.5 2 2.5 3 3.5 4 4.5 5 1.5 2 2.5 3 3.5 4 4.5 5 log lengthscale, ln( ℓ ) log lengthscale, ln( ℓ ) log lengthscale, ln( ℓ ) Thanks to Malte Kuss for making III available. 14

  15. Marginal likelihood from importance sampling Importance sampling � p ( D| f , H ) p ( f |H ) p ( D|H ) = q ( f ) d f q ( f ) Draw samples f 1 , . . . , f R from q ( f ) and set R p ( D| f r , H ) p ( f r |H ) p ( D|H ) ≈ 1 � R q ( f r ) r =1 This will usually not work because ratio varies too much. 15

  16. Marginal likelihood from thermodynamic integration Variants: parallel tempering, simulated tempering and annealed im- portance sampling p ( D| f , H ) p ( f |H ) h ( f ) = 1 Z ( β ) h β ( f ) q 1 − β ( f ) p ( f | β ) = � β 2 d log Z ( β ) log Z ( β 2 ) − log Z ( β 1 ) = dβ dβ β 1 � β 2 log h ( f ) � = q ( f ) p ( f | β ) d f dβ β 1 Run N β chains and interpolate N β R log Z ( β 2 ) − log Z ( β 1 ) ≈ ∆ β log h ( f rb ) � � R q ( f rb ) r =1 b =1 Other things that might work even better: multi-canonical. 16

  17. The trouble with Gibbs sampling Cycle over variables f i , i = 1 , . . . , N and sample conditionals p ( f i | f \ i ) = p ( f ) p ( f \ i ) ∝ p ( f ) 2.5 2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 −2.5 −2 −1 0 1 2 17

  18. A trivial cure for N ( f | 0 , C ) Gibbs sample z i , i = 1 , . . . , N with N ( z | 0 , I ) C = LL T f = Lz with 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 −1.5 −1.5 −2 −2 −2.5 −2.5 −2 −1 0 1 2 −2 −1 0 1 2 18

  19. Gaussian process classification (GPC) 1 − β � � 2 f T K − 1 f � p ( f | y , K , β ) = φ ( y n f n ) exp Z ( β ) n Noise-free formulation f nf � θ ( yf nf ) N ( f nf | f , I ) φ ( yf ) = Joint distribution p ( f , f nf , y | K , β ) = p ( y | f nf ) p ( f nf | f ) p ( f | K , β ) Marginalize out f � p ( f nf | y , K , β ) ∝ θ ( y n f n ) N ( f nf | 0 , I + K /β ) n Samples of f can be recovered from p ( f | f nf ) (Gaussian) Efficient sampler of truncated Gaussian needed! 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend