Well-known shortcomings, advantages and computational challenges - PowerPoint PPT Presentation

Well-known shortcomings, advantages and computational challenges in Bayesian modelling: a few case stories Ole Winther The Bioinformatics Center University of Copenhagen Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Lyngby, Denmark ole.winther@gmail.com owi@imm.dtu.dk 1

Overview 1. How many species? predicting sequence tags. • Non-parametric Bayes • Averaging beats maximum likelihood • The model is always wrong (and Bayes can’t tell) 2. Computing the marginal likelihood with MCMC • Motivation: computing corrections to EP/C • The trouble with Gibbs sampling • Gaussian process classification 2

How many species? 3

DNA sequence tags - CAGE 4

Look at the data - cerebellum 1800 60 1600 50 1400 1200 40 m(n) 1000 m(n) 30 800 20 600 400 10 200 0 0 200 400 600 800 1000 1200 5 10 15 20 25 30 n n 5

Look at the data - embryo 60 2000 1800 50 1600 1400 40 1200 m(n) 30 m(n) 1000 800 20 600 400 10 200 0 0 5000 10000 15000 5 10 15 20 25 30 n n 6

Chinese restaurant process - Yor-Pitman sampling formula Observing new species given counts n = n 1 , . . . , n k in k bins: k p ( n 1 , . . . , n k , 1 | n , σ, θ ) = θ + kσ � with n i = n n + θ i =1 Re-observing j : p ( n 1 , . . . , n j − 1 , n j + 1 , n j +1 , . . . , n k | n , σ, θ ) = n j − σ n + θ Exchangeability – invariant to re-ordering 1 − σ 1 − σ p 1 = θ θ + σ θ + 2 σ E, E, M, T, T : θ 1 + θ 2 + θ 3 + θ 4 + θ 1 − σ 1 − σ p 2 = θ θ + σ θ + 2 σ M, T, E, T, E : 4 + θ = . . . = p 1 1 + θ 2 + θ 3 + θ θ 7

Inference and prediction Likelihood function, e.g. E, E, M, T, T 1 − σ 1 − σ θ θ + σ θ + 2 σ p ( n | σ, θ ) = 1 + θ 2 + θ 3 + θ 4 + θ θ n i ′ − 1 k − 1 k 1 ( j ′ − σ ) � � � = ( θ + jσ ) � n − 1 i =1 ( i + θ ) i ′ =1 j ′ =1 j =1 Flat prior distribution for σ ∈ [0 , 1] and θ pseudo-count parameter. Predictions for new count m : � p ( m | n ) = p ( m | n , σ, θ ) p ( σ, θ ) dσdθ with Gibbs sampling ( σ, θ ) and Yor-Pitman sampling for m . 8

Averaging versus max. likelihood cerebellum Log L − log L ML contours embryo Log L − log L ML contours − −10.81 2780 −9.9 1 −8.08 1 . 7 −6.26 2 2100 − − 5 7 . . 3 1 5 7 −4.09 − − 2760 1 − 1 2 − 4 2080 6 . 4 . 1 − 3 2 . 3 −8.99 − 4 3 . 5 8 2 1 . 8 3 − 0 3 2 . 2740 − 2 . 0 −2.61 2060 8 4 5 . − 1 −6.26 6 − 9 −1.7 −7.17 . 2 −5.35 1 . 4 0 2720 2040 5 −4.09 2700 −4.43 2020 −0.79 −3.52 θ θ 2680 2000 − 4 . 0 2660 1980 9 −2.61 −1.7 −2.05 −5.35 2640 1960 − − 1 − 8 −2.05 6 1 . −4.43 − − − 1 . 0 −3.52 3 6 1940 1 1 . 9 2620 2 . 8 −6.26 8 4 1 − 4 . . 4 3 1 4 3 3 2 −4.09 . 2 −5.35 8 1920 2600 7 8 1 0 7 . 8 . − − 2 6 6 . − 0.5 1 1.5 2 2.5 3 0.07 0.075 0.08 0.085 0.09 σ σ −3 x 10 9

Notice anything funny? cerebellum 4 x 10 15000 1.2 1.15 10000 9550 9500 9450 k 9400 9350 5000 1500 1480 1460 1440 1420 0 186257 2 4 6 n 5 x 10 10

Notice anything funny? Example 2 embryo 4 x 10 1.4 16000 1.35 14000 1.3 12000 4 x 10 1.08 10000 1.06 k 8000 1.04 1.02 6000 1850 4000 1800 2000 1750 0 344794 2 4 6 8 10 12 n 5 x 10 11

(Well-known) take home messages • Parameter averaging works! • The model is always wrong! (as revealed with sufficient data) • Only one model considered! • What happened to being Bayesian about model selection? 12

Calculating the marginal likelihood The marginal likelihood: � p ( D|H ) = p ( D| f , H ) p ( f |H ) d f Approximate inference needed! Monte Carlo: slow mixing, non-trivial to get marginal likelihood estimates Expectation propagation+, variational Bayes, loopy BP+: some- times not precise, approximation errors not controllable, not appli- cable to all models. 13

Motivation: validating EP corrections Kuss-Rasmussen (JMLR 2006) N = 767 3-vs-5 GP USPS digit classification with −|| x − x ′ || 2 � � K ( x , x ′ ) = σ 2 exp 2 ℓ 2 I: log R = log Z EPc − log Z EP and II+III: log Z MCMC − log Z EP . 0.0001 1e−07 0.6 1e−06 0.01 0.4 0.5 0.3 0.2 5 0.5 0.4 0.1 0 . 4 0.001 0.1 0.3 0 . 0.1 0.0001 5 5 2 0 0.7 . 6 2 1e−06 0.001 0.2 0.4 0.2 0.1 0 . 1e−07 0.6 0.5 0.01 0.3 0.2 0.0001 1e−06 0.001 0.01 1e−06 0.001 1 0.3 0.6 1 0.0001 0.3 0.5 0.01 e − 0 1e−07 0 1 0.4 7 0 0 0.6 0.5 3 0 . . 7 0.3 0.3 . 0 0 0.001 0.4 0.2 0.4 0 0.2 1 6 0.1 0 0.6 . 5 0.01 0.5 1e−06 4 − 0 1 0.6 0.5 4.5 0 0.3 . 0 7 − 0 0.4 0.01 0.0001 0.4 0.1 e 0 0 e 0 . 0.01 0.1 0.0001 4.5 4.5 1 0 1 e 0.0001 7 1e−06 0 1 1 0 0.5 − . − 0 6 0.001 0 . 0 0.1 0.4 e 0 0.3 0.001 0 1e−07 0.1 . 1 0.4 0.2 0.1 1e−06 − 3 0.2 1e−06 0.6 0 e 0.1 7 . 0 1 1e−07 1e−06 0.0001 − 0 0.2 1 0.1 e 1 0.001 1e−06 0.2 log magnitude, ln( σ ) log magnitude, ln( σ ) 0.001 1 0 0.5 0.7 0 . 3 0.001 0.01 4 0 . 1e−06 0.2 1e−07 4 4 0.3 0.3 1e−06 0.5 0.2 0.6 0.4 0.2 0.4 0.5 . 5 0.1 0.3 log magnitude, ln( σ ) 0 0.0001 2 0.6 0.1 0 . 0 1 0.0001 . 0.001 0 2 1e−07 1 e − 0 0 . 6 0 0 1 1e−07 0 . 7 0.1 0.2 0.1 3.5 0 0 . 2 0.001 1e−07 2 0.0001 3.5 3.5 0.1 0.01 0.0001 − 0.3 1e−06 0.0001 0.3 0 1 . 2 0 . 0 0.1 e 0 . 0.7 0.01 0.5 0.6 0 0.3 . 4 1 0.1 0.01 1 0.5 0 1 0.0001 0 . . 6 0.7 0.6 0.5 0.0001 1e−06 0.4 0 1e−07 1 3 0 0.001 3 3 0.4 0.3 . 1e−06 . 0 0 0 0 0.2 0 0.4 0.001 . . 1 e 0.2 1 0.0001 0.01 e − − 0 7 4 0.3 2 − 0 7 1 e 0 6 3 . 0.1 0.2 1 0 0.01 0.1 0.3 0.6 0.5 0.0001 0 . 0.3 0.6 0.7 0 0.001 1e−06 1e−07 0.01 0.2 . 2.5 2.5 2.5 0.0001 0.4 2 0.5 0.1 1e−07 1e−06 0.001 0.1 1e−07 0.3 0 0 0 1 0.4 0.01 0.6 0.0001 . 0 . 0 0.1 0.01 1 0.5 0.2 6 0.01 0 0.01 0.1 1e−07 0.001 − 0.2 5 1 0 . 3 . 0 0.001 e 1e−06 2 0 0.1 0.01 2 2 0.2 3 0 . 1 0.0001 1e−07 1e−06 1 0 . 0.3 0 1 0.01 0 . . 0 0.01 0.1 0 0.2 0.01 0.3 0 4 0.2 0.3 0.01 . . 0.001 1e−07 1e−06 1.5 1.5 1.5 0.6 5 0 0 . 0 0.1 0 0 1 0 1e−06 0.4 0.5 0.01 . 0.001 0 0.2 0.4 3 0.4 0 1e−07 0.2 1 1 . 0 . 1e−06 1 0.001 0 e e 0 2 0.0001 1e−07 1 0.1 . − − 0 0 0.1 0 0 0 1 0.1 0 0 6 7 0 0.2 1 . 1 1 1 0.0001 . 0 0.001 0.01 0.001 1e−07 1e−06 0 0.01 0 1e−07 1e−06 0.001 0 . 0 0 0 1 0.2 0 . 0.3 0.1 0.01 0.1 1 6 0.0001 1e−07 0.1 0 0 0.2 1 0.0001 0.1 0.2 0 − 0 . 0.2 0.01 0.1 0 . 3 0.01 0.1 . 0.2 . 2 0.2 . 1e−06 0.0001 e 0 3 0.5 0.5 0.5 0 1e−07 0 1e−06 1 0 0.01 . 0 1 1e−07 0 0.001 1 0.2 0.0001 0.1 1e−07 1 0.1 0.001 0.001 0.0001 7 0.0001 1e−06 0 0 1 1e−06 e − 0.1 . 0.1 1 0.001 0 0 0.01 0.01 1e−06 1e−07 0 0 0 0 0.0001 1e−06 1 1 0.0001 . 0.001 1e−07 0 0 0.01 . 0 0.001 0.01 0 0 0 . 0.01 1e−06 0.01 0.001 1 e − 0 7 1.5 2 2.5 3 3.5 4 4.5 5 1.5 2 2.5 3 3.5 4 4.5 5 1.5 2 2.5 3 3.5 4 4.5 5 log lengthscale, ln( ℓ ) log lengthscale, ln( ℓ ) log lengthscale, ln( ℓ ) Thanks to Malte Kuss for making III available. 14

Marginal likelihood from importance sampling Importance sampling � p ( D| f , H ) p ( f |H ) p ( D|H ) = q ( f ) d f q ( f ) Draw samples f 1 , . . . , f R from q ( f ) and set R p ( D| f r , H ) p ( f r |H ) p ( D|H ) ≈ 1 � R q ( f r ) r =1 This will usually not work because ratio varies too much. 15

Marginal likelihood from thermodynamic integration Variants: parallel tempering, simulated tempering and annealed importance sampling p ( D| f , H ) p ( f |H ) h ( f ) = 1 Z ( β ) h β ( f ) q 1 − β ( f ) p ( f | β ) = � β 2 d log Z ( β ) log Z ( β 2 ) − log Z ( β 1 ) = dβ dβ β 1 � β 2 log h ( f ) � = q ( f ) p ( f | β ) d f dβ β 1 Run N β chains and interpolate N β R log Z ( β 2 ) − log Z ( β 1 ) ≈ ∆ β log h ( f rb ) � � R q ( f rb ) r =1 b =1 Other things that might work even better: multi-canonical. 16

The trouble with Gibbs sampling Cycle over variables f i , i = 1 , . . . , N and sample conditionals p ( f i | f \ i ) = p ( f ) p ( f \ i ) ∝ p ( f ) 2.5 2 1.5 1 0.5 0 −0.5 −1 −1.5 −2 −2.5 −2 −1 0 1 2 17

A trivial cure for N ( f | 0 , C ) Gibbs sample z i , i = 1 , . . . , N with N ( z | 0 , I ) C = LL T f = Lz with 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 −0.5 −0.5 −1 −1 −1.5 −1.5 −2 −2 −2.5 −2.5 −2 −1 0 1 2 −2 −1 0 1 2 18

Gaussian process classification (GPC) 1 − β � � 2 f T K − 1 f � p ( f | y , K , β ) = φ ( y n f n ) exp Z ( β ) n Noise-free formulation f nf � θ ( yf nf ) N ( f nf | f , I ) φ ( yf ) = Joint distribution p ( f , f nf , y | K , β ) = p ( y | f nf ) p ( f nf | f ) p ( f | K , β ) Marginalize out f � p ( f nf | y , K , β ) ∝ θ ( y n f n ) N ( f nf | 0 , I + K /β ) n Samples of f can be recovered from p ( f | f nf ) (Gaussian) Efficient sampler of truncated Gaussian needed! 19

Well-known shortcomings, advantages and computational challenges - PowerPoint PPT Presentation

Well-known shortcomings, advantages and computational challenges in Bayesian modelling: a few case stories Ole Winther The Bioinformatics Center University of Copenhagen Informatics and Mathematical Modelling Technical University of Denmark

Gushers Advantages Gushers Advantages Gusher s Advantages Gusher s Advantages R&D

Advantages and Advantages and Advantages and Advantages and Disadvantages of Disadvantages of

Exam Techniques Structure of the Session Examination Shortcomings Examination Techniques

Real-World applications of Boosting Yoav Freund UCSD Practical Advantages of AdaBoost

Deep reinforcement learning methods Their advantages and shortcomings Ashley Hill CEA, LIST,

Natural hazards known knowns and known unknowns Richard Reinen-Hamill Business Leader

$ Lesson Seven Credit 04/09 advantages and disadvantages of using credit advantages: Able

Study Question Study Question What advantages do modern psychophysical What advantages do modern

Principles Principles Principles Principles of a well of a well of a well of a well- - -

Drilling Safety August 11, 2010 Pensacola, Florida Well Construction Process Well Design Primary

C0 2 / Acid Gas Injection Well Injection Well Conversion Prepared for the 5 Th Well Bore

Chapter 2: Getting Started A computational problem is a mathematical problem, specified by an

Well known and Little known Nation One persons perspective K. Adaricheva Department of

1. Computational Fluid a. Computational Fluid Dynamics is in the domain of Computational Science

Too often, people and nations ignore their own errors, shortcomings and complacencies. Only after

Trademark and Unfair Competition Law Slides 16: The Well-Known Marks Doctrine LAWS 7341-001

Legalization of Marijuana Mental Health, Substance Use Disorder Effects, Consequences Sheila

Active Processes in Perception, in Vision and in Electric Fish. Curtis Bell Neurological

Working with FreeSurfer Regions-of-Interest (ROIs) Outline Subcortical Segmentation

Monthly Webinar Series January, 2019 Todays Agenda Announcements/Trial Updates Christina

Making Linux do Hard (Real-) Time Hard Time Linux 1 of 21 MBARI The Monterey Bay Aquarium

Spice, Bath Salts and Salvia, oh my!: A review of on-tr end synthetic substances of abuse

Listening without hearing Nadia M. Biassou MD, PhD Senior Research Physician Dept of Radiology

Serve & Return Interactions 1 6/18/2018 What do we mean by serve & return? 2

Well-known shortcomings, advantages and computational challenges - PowerPoint PPT Presentation

Well-known shortcomings, advantages and computational challenges in Bayesian modelling: a few case stories Ole Winther The Bioinformatics Center University of Copenhagen Informatics and Mathematical Modelling Technical University of Denmark

Gushers Advantages Gushers Advantages Gusher s Advantages Gusher s Advantages R&amp;D

Advantages and Advantages and Advantages and Advantages and Disadvantages of Disadvantages of

Exam Techniques Structure of the Session Examination Shortcomings Examination Techniques

Real-World applications of Boosting Yoav Freund UCSD Practical Advantages of AdaBoost

Deep reinforcement learning methods Their advantages and shortcomings Ashley Hill CEA, LIST,

Natural hazards known knowns and known unknowns Richard Reinen-Hamill Business Leader

$ Lesson Seven Credit 04/09 advantages and disadvantages of using credit advantages: Able

Study Question Study Question What advantages do modern psychophysical What advantages do modern

Principles Principles Principles Principles of a well of a well of a well of a well- - -

Drilling Safety August 11, 2010 Pensacola, Florida Well Construction Process Well Design Primary

C0 2 / Acid Gas Injection Well Injection Well Conversion Prepared for the 5 Th Well Bore

Chapter 2: Getting Started A computational problem is a mathematical problem, specified by an

Well known and Little known Nation One persons perspective K. Adaricheva Department of

1. Computational Fluid a. Computational Fluid Dynamics is in the domain of Computational Science

Too often, people and nations ignore their own errors, shortcomings and complacencies. Only after

Trademark and Unfair Competition Law Slides 16: The Well-Known Marks Doctrine LAWS 7341-001

Legalization of Marijuana Mental Health, Substance Use Disorder Effects, Consequences Sheila

Active Processes in Perception, in Vision and in Electric Fish. Curtis Bell Neurological

Working with FreeSurfer Regions-of-Interest (ROIs) Outline Subcortical Segmentation

Monthly Webinar Series January, 2019 Todays Agenda Announcements/Trial Updates Christina

Making Linux do Hard (Real-) Time Hard Time Linux 1 of 21 MBARI The Monterey Bay Aquarium

Spice, Bath Salts and Salvia, oh my!: A review of on-tr end synthetic substances of abuse

Listening without hearing Nadia M. Biassou MD, PhD Senior Research Physician Dept of Radiology

Serve &amp; Return Interactions 1 6/18/2018 What do we mean by serve &amp; return? 2

Gushers Advantages Gushers Advantages Gusher s Advantages Gusher s Advantages R&D

Serve & Return Interactions 1 6/18/2018 What do we mean by serve & return? 2