Bayesian nonparametrics Dr. Jarad Niemi STAT 615 - Iowa State - PowerPoint PPT Presentation

Bayesian nonparametrics Dr. Jarad Niemi STAT 615 - Iowa State University December 5, 2017 Jarad Niemi (STAT615@ISU) Bayesian nonparametrics December 5, 2017 1 / 34

Bayesian nonparametrics Bayesian nonparametrics There are two main approaches to Bayesian nonparametrics for density estimation Dirichlet process and Polya trees See M¨ uller and Mitra (2013) for a general overview of all Bayesian nonparametric problems, e.g. density estimation, clustering, regression, random effects distributions, etc. Jarad Niemi (STAT615@ISU) Bayesian nonparametrics December 5, 2017 2 / 34

Bayesian nonparametrics Motivation 0.20 0.15 density 0.10 0.05 0.00 0 4 8 phi Jarad Niemi (STAT615@ISU) Bayesian nonparametrics December 5, 2017 3 / 34

Bayesian nonparametrics Goal Let Y i come from an unknown probability measure G , i.e. Y i ∼ G . As a Bayesian, the natural approach is to put a prior on G . That is, we want to make statements like P ( Y i ∈ A ) = G ( A ) for any set A . Jarad Niemi (STAT615@ISU) Bayesian nonparametrics December 5, 2017 4 / 34

Bayesian nonparametrics Dirichlet process One approach is to use a Dirichlet process (Ferguson 1973). We write G ∼ DP ( aG 0 ) where a > 0 is concentration (or total mass) parameter and G 0 is the base measure, i.e. a probability distribution defined on the support of G . For any partition A 1 , . . . , A K of the sample space S , the probability vector [ G ( A 1 ) , . . . , G ( A K )] follows a Dirichlet distribution, i.e. [ G ( A 1 ) , . . . , G ( A K )] ∼ Dir ([ aG 0 ( A 1 ) , . . . , aG 0 ( a K )]) . Thus E [ G ( A 1 )] = G 0 ( A 1 ) and V ar [ G ( A 1 )] = G 0 ( A 1 )[1 − G 0 ( A 1 )] . 1+ a Jarad Niemi (STAT615@ISU) Bayesian nonparametrics December 5, 2017 5 / 34

Bayesian nonparametrics Conjugacy of the Dirichlet process Assume ind Y i ∼ G and G ∼ DP ( aG 0 ) then for any partition { A 1 , . . . , A K } , we have [ G ( A 1 ) , . . . , G ( A K )] | y ∼ Dir ([ aG 0 ( A 1 ) + � n i =1 I( y i ∈ A 1 ) , . . . , aG 0 ( A K ) + � n i =1 I( y i ∈ A K )]) and thus � n � � G| y ∼ DP aG 0 + δ y i i =1 which has n � a � � n � 1 � E [ G ( A ) | y ] = G 0 ( A ) + n I( y i ∈ A ) a + n a + n i =1 Jarad Niemi (STAT615@ISU) Bayesian nonparametrics December 5, 2017 6 / 34

Bayesian nonparametrics Stick-breaking representation A constructive representation of the Dirichlet process is the stick-breaking representation. Assume G ∼ DP ( aG 0 ) , then ∞ � G ( · ) = π h δ θ h ( · ) h =1 ind where π ∼ stick ( a ) and θ h ∼ G 0 . The stick distribution is the following: π h = ν h � ℓ<h (1 − ν ℓ ) and ind ν h ∼ Be (1 , a ) . π 1 π 2 π 3 π 4 π 5 … 0 1 Jarad Niemi (STAT615@ISU) Bayesian nonparametrics December 5, 2017 7 / 34

Bayesian nonparametrics Realizations from a DP Base measure is a standard normal. Realizations are across the columns and values for a are down the rows. 1 2 3 4 10.0 7.5 0.01 5.0 2.5 0.0 10.0 7.5 0.1 5.0 2.5 0.0 10.0 density 7.5 5.0 1 2.5 0.0 2.0 1.5 1.0 10 0.5 0.0 0.8 0.6 100 0.4 0.2 0.0 −5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0 theta Jarad Niemi (STAT615@ISU) Bayesian nonparametrics December 5, 2017 8 / 34

Bayesian nonparametrics DP mixture If we have an absolutely continuous distribution we are trying to approximate, then a DP is not reasonable. Thus, we may want to use a DP mixture, i.e. ind ind Y i ∼ p ( ·| θ i ) , θ i ∼ G , G ∼ DP ( aG 0 ) for some parametric model p ( ·| θ ) . Alternatively, if we use the stick-breaking construction, we have ∞ ind ind � Y i ∼ p ( ·| θ i ) , θ i ∼ π h δ θ ∗ h h =1 ind where θ ∗ ∼ G 0 and π ∼ stick ( a ) . h Jarad Niemi (STAT615@ISU) Bayesian nonparametrics December 5, 2017 9 / 34

Bayesian nonparametrics Finite approximation to the stick-breaking representation For some ǫ > 0 , there exists an H such that � ∞ h = H π h < ǫ and components H and beyond can reasonably be ignored. The resulting model is H ind ind � Y i ∼ p ( ·| θ i ) , θ i ∼ π h δ θ ∗ h h =1 where � π h = ν h ℓ<h (1 − ν ℓ ) ind ν h ∼ Be (1 , a ) for h < H , and ν H = 1 . Jarad Niemi (STAT615@ISU) Bayesian nonparametrics December 5, 2017 10 / 34

Bayesian nonparametrics Normal example Normal example A DP mixture model for the marginal distribution for Y i = φ i is � µ i H � ind ∼ N ( µ i , σ 2 � Y i i ) ∼ π h δ ( µ ∗ h ,σ 2 ∗ σ 2 h ) i h =1 where � H h =1 π h = 1 . Alternatively, we can introduce a latent variable ζ i = h if observation i came from group h . Then ind ∼ N ( µ ∗ h , σ 2 ∗ Y i | ζ i = h h ) ind ζ i ∼ Cat ( H, π ) where ζ ∼ Cat ( H, π ) is a categorical random variable with P ( ζ = h ) = π h for h = 1 , . . . , H and π = ( π 1 , . . . , π H ) . Jarad Niemi (STAT615@ISU) Bayesian nonparametrics December 5, 2017 11 / 34

Bayesian nonparametrics Normal example Normal example Let H ind i ) ind ∼ N ( µ i , σ 2 ( µ i , σ 2 � Y i i ) , ∼ π h δ ( µ ∗ h ,σ 2 ∗ h ) h =1 where the base measure G 0 is ind ind µ ∗ h | σ 2 ∗ ∼ N ( m h , v 2 h σ 2 ∗ σ 2 ∗ h ) and ∼ IG ( c h , d h ) . h h But since each ( µ i , σ 2 h , σ 2 ∗ i ) must equal ( µ ∗ h ) for some h , we can rewrite the model as H ind � π h N ( µ ∗ h , σ 2 ∗ Y i ∼ h ) h =1 with a prior that is equal to the base measure. Thus this model is equivalent to our finite mixture with the exception of the prior for π . Jarad Niemi (STAT615@ISU) Bayesian nonparametrics December 5, 2017 12 / 34

Bayesian nonparametrics Normal example MCMC - Blocked Gibbs sampler The steps of a Gibbs sampler with stationary distribution p ( π, µ, σ 2 , ζ | y ) ∝ p ( y | ζ, µ, σ 2 ) p ( ζ | π ) p ( µ | σ 2 ) p ( σ 2 ) p ( π ) has steps 1. For i = 1 , . . . , n , independently sample ζ i from its full conditional h , σ 2 ∗ P ( ζ i = h | . . . ) ∝ π h N ( y i ; µ ∗ h ) 2. Jointly sample π and µ, σ 2 because they are conditionally independent. ind ∼ Be (1 + Z h , a + Z + a. Sample ν h h ) for V = 1 , . . . , H − 1 where Z h = � n i =1 I( ζ i = h ) and Z + h = � H h ′ = h +1 Z h ′ and set ν H = 1 . Then calculate π h = ν h � ℓ<h (1 − ν ℓ ) . b. For h = 1 , . . . , H , sample µ h , σ 2 h from their full conditional ind ind µ ∗ h | σ 2 ∗ ∼ N ( m ′ h , v ′ 2 h σ 2 ∗ σ 2 ∗ ∼ IG ( c ′ h , d ′ h ) h ) h h h , v ′ 2 where m ′ h , c ′ h , and d ′ h are exactly the same as in the normal finite mixture MCMC. Jarad Niemi (STAT615@ISU) Bayesian nonparametrics December 5, 2017 13 / 34

Bayesian nonparametrics JAGS library("rjags") dp_normal_blocked = " model { for (i in 1:n) { y[i] ~ dnorm(mu[zeta[i]], tau[zeta[i]]) zeta[i] ~ dcat(pi[]) } for (h in 1:H) { mu[h] ~ dnorm(2,1/3) tau[h] ~ dgamma(.1,.1) sigma[h] <- 1/sqrt(tau[h]) } # Stick breaking for (h in 1:(H-1)) { V[h] ~ dbeta(1,a) } V[H] <- 1 pi[1] <- V[1] for (h in 2:H) { pi[h] <- V[h] * (1-V[h-1]) * pi[h-1] / V[h-1] } } " tmp = hat[sample(nrow(hat), 1000),] dat = list(n=nrow(tmp), H=25, y=tmp$phi, a=1) jm = jags.model(textConnection(dp_normal_blocked), data = dat, n.chains = 3) r = jags.samples(jm, c('mu','sigma','pi','zeta'), 1e3) Jarad Niemi (STAT615@ISU) Bayesian nonparametrics December 5, 2017 14 / 34

Bayesian nonparametrics JAGS Monitor convergence of density As previously discussed, the model as constructed as identifiability h , and σ 2 ∗ problems among the π h , µ ∗ h due to label switching. What is identified in the model is the value of the density at any particular value. So rather than directly monitoring the parameters, we will monitor the estimated density, i.e. at iteration m of the MCMC, the estimated density at location x is H π ( m ) N ( x ; µ ∗ ( m ) , σ 2 ∗ ( m ) � ) . h h h h =1 Monitoring this quantity at a variety of locations x will provide appropriate convergence assessment. Jarad Niemi (STAT615@ISU) Bayesian nonparametrics December 5, 2017 15 / 34

Bayesian nonparametrics JAGS Monitor convergence of density −2.53078024929682 −2.11797253776076 0.005 0.7 0.004 0.6 0.003 0.5 0.002 0.4 0.001 0.3 chain 0.000 density 1 0 250 500 750 1000 0 250 500 750 1000 2 1.32209172503982 1.73489943657589 0.13 3 0.12 0.11 0.10 0.09 0.08 0.07 0.06 0.05 0 250 500 750 1000 0 250 500 750 1000 iteration Jarad Niemi (STAT615@ISU) Bayesian nonparametrics December 5, 2017 16 / 34

Bayesian nonparametrics JAGS Monitoring the number of utilized components Since we are using a finite approximation to the DP, we should monitor the index of the maximum occupied component (or the number of occupied clusters). If the finite approximation is reasonable, then this number will be smaller than H . If not, then H should be increased. Specifically, at iteration m , we monitor max { ζ ( m ) , . . . , ζ ( m ) , } 1 n the index of the maximum occupied cluster, or H � I( Z h > 0) , h =1 the number of occupied clusters. Jarad Niemi (STAT615@ISU) Bayesian nonparametrics December 5, 2017 17 / 34

Bayesian nonparametrics Dr. Jarad Niemi STAT 615 - Iowa State - PowerPoint PPT Presentation

Bayesian nonparametrics Dr. Jarad Niemi STAT 615 - Iowa State University December 5, 2017 Jarad Niemi (STAT615@ISU) Bayesian nonparametrics December 5, 2017 1 / 34 Bayesian nonparametrics Bayesian nonparametrics There are two main

Bayesian Nonparametrics Lorenzo Rosasco 9.520 Class 18 April 11, 2011 L. Rosasco Bayesian

Bayesian Nonparametrics Charlie Frogner 9.520 Class 11 March 14, 2012 C. Frogner Bayesian

Variational Russian Roulette for Variational Russian Roulette for Deep Bayesian Nonparametrics

A Tutorial on Bayesian Nonparametrics Fatima Al-Raisi Carnegie Mellon University

Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro Panella Department of

Being Bayesian About Being Bayesian About Net work St ruct ure Net work St ruct ure A Bayesian

Outline Intro to RL and Bayesian Learning History of Bayesian RL Model-based Bayesian

Applied Bayesian Nonparametrics 3. Infinite Hidden Markov Models Tutorial at CVPR 2012 Erik

Bayesian Magic for Complex Social Science Data: Fusion, Nonparametrics, Dynamics, Dyads, Networks

Spatial Bayesian Nonparametrics for Natural Image Segmentation Erik Sudderth Brown University

Applied Bayesian Nonparametrics 5. Spatial Models via Gaussian Processes, not MRFs Tutorial at

Structured Databases of Named Entities from Bayesian Nonparametrics Dr. Jacob Eisenstein

Bayesian Nonparametrics Peter Orbanz Columbia University P ARAMETERS AND P ATTERNS Parameters P

CS440/ECE448 Lecture 15: Bayesian Inference and Bayesian Learning Slides by Svetlana Lazebnik,

Bayesian Learning 1 Outline MLE, MAP vs. Bayesian Learning Bayesian Linear Regression

CS 331: Bayesian Networks 2 1 Bayesian Networks Youve heard about how Bayesian networks

Lecture 13: Dirichlet Processes Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center

Solving Large-scale problems using JuMP Thuener Silva JuMP Developers meet-up Santiago, March

Model-based Deep Hand Pose Estimation Xingyi Zhou, Qingfu Wan, Wei Zhang, Xiangyang Xue, Yichen

CSE 190 Lecture 14 Data Mining and Predictive Analytics Dimensionality-reduction approaches

Document and Topic Models: pLSA and LDA Andrew Levandoski and Jonathan Lobo CS 3750 Advanced

Natural Language Understanding Unsupervised Part-of-Speech Tagging Adam Lopez Slide credits:

Department of Computer Science CSCI 5622: Machine Learning Chenhao Tan Lecture 20: Topic

Dirichlet Bayesian Network Scores and the Maximum Entropy Principle Marco Scutari

Sambuz

Useful Links

Newsletter

Mail Us