Lecture 14: Inference in Dirichlet Processes (Blei & Jordan, - PowerPoint PPT Presentation

CS598JHM: Advanced NLP (Spring 2013) http://courses.engr.illinois.edu/cs598jhm/ Lecture 14: Inference in Dirichlet Processes (Blei & Jordan, Variational inference for Dirichlet Process Mixture models, Bayesian Analysis 2006) Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: by appointment

Dirichlet Process mixture models A mixture model with a DP as nonparametric prior: ‘Mixing weights’ (prior): G |{ α , G 0 } ~ DP( α , G 0 ) The base distribution G 0 and G are distributions over the same probability space. ‘Cluster’ parameters: η n | G ~ G For each data point n = 1,..., N , draw a distribution η n with value η c * over observations from G (We can interpret this as clustering because G is discrete with probability 1; hence different η n take on identical values η c * with nonzero probability. Data points are partitioned into | C | clusters: c = c 1 ...c N ) Observed data: x n | η n ~ p (x n | η n ) For each data point n = 1,...,N , draw observation x n from η n 2 Bayesian Methods in NLP

Stick-breaking representation of DPMs π 2 π 1 0 1 1 − v 1 v 1 (1 − v 1 )v 2 The component parameters η * : η * i ~ G 0 The mixing proportions π i ( v ) are defined by a stick-breaking process: V i ~ Beta (1, α ) π i ( v ) = v i ∏ j = 1. ..i − 1 (1 − v j ) also written as π ( v) ~ GEM( α ) ( G riffiths/ E ngen/ M cCloskey) Hence, if G ~ DP( α , G 0 ): G = ∑ i =1... ∞ π i ( v ) δ η i* with η * i ~ G 0 3 Bayesian Methods in NLP

DP mixture models with DP( α , G 0 ) 1. Define stick-breaking weights by drawing V i | α ~ Beta(1, α ) 2. Draw cluster η i * | G 0 ~ G 0 i = {1, 2, ...} 3. For the n th data point: Draw cluster id Z n | {v 1 ,v 2 ...} ~ Mult( π ( v )) Draw observation X n | z n ~ p (x | η z n * ) p (x | η *) is from an exponential family of distributions G 0 is from the corresponding conjugate prior e.g. p (x | η *) multinomial , G 0 Dirichlet 4 Bayesian Methods in NLP

Stick-breaking construction of DPMs V k α Z n η k * λ X n N ∞ Stick lengths V i ~ Beta(1, α ), yielding mixing weights π i ( v ) = v i ∏ j<i ( 1 − v j ) Component parameters: η i * ~ G 0 (assume G 0 is conjugate prior with hyperparameter λ ) Assignment of data to components: Z n |{v 1 , .... } ~ Mult( π ( v )) Generating the observations: X n | z n ~ p ( x n | η z n *) 5 Bayesian Methods in NLP

Inference for DP mixture models Given observed data x 1 , ...., x n , compute the predictive density : p (x | x 1 , ...., x n , α , G 0 ) = ∫ p (x | w ) p ( w | x 1 , ...., x n , α , G 0 ) d w Problem: the posterior of the latent variables p ( w | x 1 , ....,x n , α , G 0 ) can’t be computed in closed form Approximate inference: - Gibbs sampling: Sample from a Markov chain with equilibrium distribution p ( W | x 1 , ...., x n , α , G 0 ) - Variational inference : Construct a tractable variational approximation q of p with free variational parameters ν 6 Bayesian Methods in NLP

Gibbs sampling Bayesian Methods in NLP 7

Gibbs sampling for DPMs Two variants that differ in their definition of the Markov Chain Collapsed Gibbs sampler: Integrates out G and the distinct parameter values { η 1 *.... η | C| *} associated with the clusters Blocked Gibbs sampler: Based on the stick-breaking construction. This requires a truncated variant of the DP. 8 Bayesian Methods in NLP

Collapsed Gibbs sampler for DPMs Integrate out the random measure G and the distinct parameter values { η 1 *.... η | C| *} associated with each cluster Given data x = x 1 ...x N , each state of the Markov chain is a cluster assignment c = c 1 ...c N to each data point Each sample is also a cluster assignment c = c 1 ...c N Given a cluster assignment c b = c 1 ...c N with C distinct clusters, the predictive density is p (x N+1 | c b , x , α , λ ) = ∑ k ≤ C+1 p (c N+1 = k | c b , α ) p (x N+1 | c b , c N+1 = k, λ ) 9 Bayesian Methods in NLP

Collapsed Gibbs sampler for DPMs ‘Macro-sample step’: Assign a new cluster to all data points. ‘Micro-sample step’: Sample assignment variables C n for each data point conditioned on the assignment of the remaining points, c -n C n is either one of the values in c -n or a new value: p (c n = k | x , c -n ) ∝ p ( x n | x -n , c -n , c n =k, λ ) p ( c n = k | c -n , α ) with p ( x n | x -n , c -n , c n =k, λ ) = p ( x n , c -n , c n =k, λ ) / p ( x -n , c -n , c n =k, λ ) and p ( c n = k | c -n , α ) given by the Polya (Blackwell/McQueen) urn Inference: After burn-in, collect B sample assignments c b and average across their predictive densities. 10 Bayesian Methods in NLP

Blocked Gibbs sampling Based on the stick-breaking construction. States of the Markov chain consist of ( V, η *, Z) Problem: in the actual DPM model V, η * are infinite. Instead, the blocked Gibbs sampler uses a truncated DP (TDP), which samples only a finite collection of T stick lengths (and hence clusters) By setting V T -1 = 1, π i = 0 for i ≥ T : π i ( v ) = v i ∏ j<i ( 1 − v j ) 11 Bayesian Methods in NLP

Blocked Gibbs sampling The states of the Markov chain consist of - the beta variables V = {V 1 ...V T-1 }, - the mixture component parameters η * = { η 1 *... η T *} - the indicator variables Z = {Z 1 ...Z N } Sampling: - For n=1...N , sample Z N from p (z n = k | v, η *, x ) = π k ( v) p (x n | η k *) - For k=1...K , sample V k from Beta ( γ k2 , γ k1 α + n k+1...K ) γ k1 = 1+ n k with n k : number of data points in cluster k γ k2 = α + n k+1...K : with n k+1...K the data points in clusters k+1...K - For k=1...K , sample η k * from its posterior p ( η k * | τ k ) τ k = ( λ 1 + n -ik (x i ) , λ 2 + n -ik ) Predictive density for each sample: p (x n+1 | x , z , α , λ ) = ∑ k E[ π k ( v) | γ 1 .... γ K ] p (x n+1 | τ k ) 12 Bayesian Methods in NLP

Variational inference (recap) Bayesian Methods in NLP 13

Standard EM L ( q , θ ) = ln p ( X | θ ) − KL( q || p ) is a lower bound on the KL( q || p ) incomplete log-likelihood ln p ( X | θ ) ln p ( X | θ old ) E-step: With θ old fixed, return q new L ( q, θ old ) that maximizes L ( q , θ old ) wrt. q ( Z ) , Now KL( q new || p old ) = 0 . M-step: With q new fixed, return θ new that maximizes L ( q new , θ ) wrt. θ . If L ( q new , θ new ) > L ( q new , θ old ) : ln p ( X | θ new ) > ln p ( X | θ old ), and hence KL( q new || p new ) > 0 14 Bayesian Methods in NLP

Variational inference Variational inference is applicable when you have to compute an intractable posterior over latent variables p ( W | X ) Basic idea: Replace the exact, but intractable posterior p ( W | X ) with a tractable approximate posterior q ( W | X, V ) q ( W | X, V ) is from a family of simpler distributions over the latent variables W that is defined by a set of free variational parameters V Unlike in EM, KL( q || p ) > 0 for any q , since q only approximates p 15 Bayesian Methods in NLP

Variational EM Initialization : Define initial model θ old and variational distribution q ( W | X , V ) E-step: Find V that maximize the variational distribution q ( W | X , V ) Compute the expectation of true posterior p ( W | X , θ old ) under the new variational distribution q ( W | X , V ) M-step: Find model parameters θ new that maximize the expectation of the p ( W , X | θ ) under the variational posterior q ( W | X , V ) Set θ old := θ new 16 Bayesian Methods in NLP

Blei and Jordan’s mean-field variational inference for DP Bayesian Methods in NLP 17

Variational inference Define a family of variational distributions q ν ( w ) with variational parameters ν = ν 1 .... ν M that are specific to each observation x i Set ν to minimze the KL-divergence between q ν ( w ) and p ( w | x , θ ): D( q ν ( w ) || p ( w | x , θ ) ) = E q [log q ν ( W )] − E q [log p ( W , x | θ )] + log p ( x | θ ) (Here, log p ( x | θ ) can be ignored when finding q ) This is equivalent to maximizing a lower bound on log p ( x | θ ): log p ( x | θ ) = E q [log p ( W , x | θ )] − E q [log q ν ( W )] + D( q ν ( w )|| p ( w | x , θ )) log p ( x | θ ) ≥ E q [log p ( W , x | θ )] − E q [log q ν ( W )] 18 Bayesian Methods in NLP

q ν ( W ) for DPMs Blei and Jordan use again the stick-breaking construction. Hence, the latent variables are W = ( V , η *, Z ) V : T − 1 truncated stick lengths η * : T component parameters Z : cluster assignments of the N data points 19 Bayesian Methods in NLP

Variational inference for DPMs In general: log p ( x | θ ) ≥ E q [log p ( W , x | θ )] − E q [log q ν ( W )] For DPMs: θ = ( α , λ ); W = ( V , η * , Z ) log p ( x | α , λ ) ≥ E q [log p ( V | α )] + E q [log p ( η * | λ )] + ∑ n [ E q [log p (Z n | V )] + E q [log p (x n | Z n )] ] − E q [log q ν ( V , η * , Z )] Problem: V = {V 1 , V 2 ,...}, η * = { η 1 *, η 2 *, ... } are infinite. Solution: use a truncated representation 20 Bayesian Methods in NLP

Variational approximations q ν ( v , η *, z ) V t γ t Z n η k * τ t φ n N T The variational parameters ν = ( γ 1..T-1 , τ 1..T, φ 1...N ) q ν ( v , η *, z ) = ∏ t<T q γ t (v t ) ∏ t<T q τ t ( η t *) ∏ n ≤ N q φ n (z n ) q γ t (v t ): Beta distributions with variational parameter γ t q τ t ( η t *): conjugate priors for η , with parameter τ t q φ n (z n ): multinomials with variational parameters φ n 21 Bayesian Methods in NLP

Lecture 14: Inference in Dirichlet Processes (Blei & Jordan, - PowerPoint PPT Presentation

CS598JHM: Advanced NLP (Spring 2013) http://courses.engr.illinois.edu/cs598jhm/ Lecture 14: Inference in Dirichlet Processes (Blei & Jordan, Variational inference for Dirichlet Process Mixture models, Bayesian Analysis 2006) Julia

The Dirichlet-Bohr radius Manuel Maestre April 13, 2014 Kent State University Content

Hierarchical Dirichlet Processes Presenters: Micah Hodosh, Yizhou Sun 4/7/2010 1 Content

Hierarchical Dirichlet Processes AMS 241, Fall 2010 Vadim von Brzeski vvonbrze@ucsc.edu

Nested Hierarchical Dirichlet Processes John Paisley, Chong Wang, David M. Blei, and Michael I.

Perspective Hierarchical Dirichlet Process for Perspective Hierarchical Dirichlet Process for

Boundary Representation of Dirichlet Forms on Canonically Compactifiable Graphs Michael Schwarz

Hierarchical Dirichlet Processes Sharing Clusters Among Related Groups Dongruo Zhou 1 Difan Zou 2

Lecture 1: Lvy processes A. E. Kyprianou Department of Mathematical Sciences, University of

Birth and Death Processes Today: Birth processes Birth and Death Processes Death

Programs, Processes, and Threads Programs, Processes, and Threads (Chapter 2) Processes

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Probabilistic Grammars and Hierarchical Dirichlet Processes (Liang et. al 2009) Sean Massung &

Lecture 13: Dirichlet Processes Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center

Eigenvalue estimates and localization of the first Dirichlet eigenfunction Thomas Beck

Interpolating sequences for the Dirichlet space Nicola Arcozzi, with R. Rochberg and E. Sawyer

Incorporating Domain Knowledge into Topic Modeling via Dirichlet Forest Priors David

Background Poisson or Binomial data with the following properties GLM with clustered data A

Advanced Reconstruction Algorithms for the CMS High Granularity Calorimeter Kevin Pedro (FNAL)

20-03-28 9. General Improvement Techniques 9.1 Preprocessing There are several

CSC 411: Lecture 12: Clustering Class based on Raquel Urtasun & Rich Zemels lectures Sanja

Data Mining Learning from Large Data Sets Lecture 8

Clustering and Classification by Optimum-Path Forest Alexandre Falc ao Institute of Computing

Big Data Era 1 1 https://vimeo.com/102998774 The big problem: Scalability Visualization

New Developments In The Theory Of Clustering thats all very well in practice, but does it work

Lecture 14: Inference in Dirichlet Processes (Blei & Jordan, - PowerPoint PPT Presentation

CS598JHM: Advanced NLP (Spring 2013) http://courses.engr.illinois.edu/cs598jhm/ Lecture 14: Inference in Dirichlet Processes (Blei & Jordan, Variational inference for Dirichlet Process Mixture models, Bayesian Analysis 2006) Julia

The Dirichlet-Bohr radius Manuel Maestre April 13, 2014 Kent State University Content

Hierarchical Dirichlet Processes Presenters: Micah Hodosh, Yizhou Sun 4/7/2010 1 Content

Hierarchical Dirichlet Processes AMS 241, Fall 2010 Vadim von Brzeski vvonbrze@ucsc.edu

Nested Hierarchical Dirichlet Processes John Paisley, Chong Wang, David M. Blei, and Michael I.

Perspective Hierarchical Dirichlet Process for Perspective Hierarchical Dirichlet Process for

Boundary Representation of Dirichlet Forms on Canonically Compactifiable Graphs Michael Schwarz

Hierarchical Dirichlet Processes Sharing Clusters Among Related Groups Dongruo Zhou 1 Difan Zou 2

Lecture 1: Lvy processes A. E. Kyprianou Department of Mathematical Sciences, University of

Birth and Death Processes Today: Birth processes Birth and Death Processes Death

Programs, Processes, and Threads Programs, Processes, and Threads (Chapter 2) Processes

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Probabilistic Grammars and Hierarchical Dirichlet Processes (Liang et. al 2009) Sean Massung &amp;

Lecture 13: Dirichlet Processes Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center

Eigenvalue estimates and localization of the first Dirichlet eigenfunction Thomas Beck

Interpolating sequences for the Dirichlet space Nicola Arcozzi, with R. Rochberg and E. Sawyer

Incorporating Domain Knowledge into Topic Modeling via Dirichlet Forest Priors David

Background Poisson or Binomial data with the following properties GLM with clustered data A

Advanced Reconstruction Algorithms for the CMS High Granularity Calorimeter Kevin Pedro (FNAL)

20-03-28 9. General Improvement Techniques 9.1 Preprocessing There are several

CSC 411: Lecture 12: Clustering Class based on Raquel Urtasun &amp; Rich Zemels lectures Sanja

Data Mining Learning from Large Data Sets Lecture 8

Clustering and Classification by Optimum-Path Forest Alexandre Falc ao Institute of Computing

Big Data Era 1 1 https://vimeo.com/102998774 The big problem: Scalability Visualization

New Developments In The Theory Of Clustering thats all very well in practice, but does it work

Probabilistic Grammars and Hierarchical Dirichlet Processes (Liang et. al 2009) Sean Massung &

CSC 411: Lecture 12: Clustering Class based on Raquel Urtasun & Rich Zemels lectures Sanja