lecture 14 inference in dirichlet processes
play

Lecture 14: Inference in Dirichlet Processes (Blei & Jordan, - PowerPoint PPT Presentation

CS598JHM: Advanced NLP (Spring 2013) http://courses.engr.illinois.edu/cs598jhm/ Lecture 14: Inference in Dirichlet Processes (Blei & Jordan, Variational inference for Dirichlet Process Mixture models, Bayesian Analysis 2006) Julia


  1. CS598JHM: Advanced NLP (Spring 2013) http://courses.engr.illinois.edu/cs598jhm/ Lecture 14: Inference in Dirichlet Processes (Blei & Jordan, Variational inference for Dirichlet Process Mixture models, Bayesian Analysis 2006) Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: by appointment

  2. Dirichlet Process mixture models A mixture model with a DP as nonparametric prior: ‘Mixing weights’ (prior): G |{ α , G 0 } ~ DP( α , G 0 ) The base distribution G 0 and G are distributions over the same probability space. ‘Cluster’ parameters: η n | G ~ G For each data point n = 1,..., N , draw a distribution η n with value η c * over observations from G (We can interpret this as clustering because G is discrete with probability 1; hence different η n take on identical values η c * with nonzero probability. Data points are partitioned into | C | clusters: c = c 1 ...c N ) Observed data: x n | η n ~ p (x n | η n ) For each data point n = 1,...,N , draw observation x n from η n 2 Bayesian Methods in NLP

  3. Stick-breaking representation of DPMs π 2 π 1 0 1 1 − v 1 v 1 (1 − v 1 )v 2 The component parameters η * : η * i ~ G 0 The mixing proportions π i ( v ) are defined by a stick-breaking process: V i ~ Beta (1, α ) π i ( v ) = v i ∏ j = 1. ..i − 1 (1 − v j ) also written as π ( v) ~ GEM( α ) ( G riffiths/ E ngen/ M cCloskey) Hence, if G ~ DP( α , G 0 ): G = ∑ i =1... ∞ π i ( v ) δ η i* with η * i ~ G 0 3 Bayesian Methods in NLP

  4. DP mixture models with DP( α , G 0 ) 1. Define stick-breaking weights by drawing V i | α ~ Beta(1, α ) 2. Draw cluster η i * | G 0 ~ G 0 i = {1, 2, ...} 3. For the n th data point: Draw cluster id Z n | {v 1 ,v 2 ...} ~ Mult( π ( v )) Draw observation X n | z n ~ p (x | η z n * ) p (x | η *) is from an exponential family of distributions G 0 is from the corresponding conjugate prior e.g. p (x | η *) multinomial , G 0 Dirichlet 4 Bayesian Methods in NLP

  5. Stick-breaking construction of DPMs V k α Z n η k * λ X n N ∞ Stick lengths V i ~ Beta(1, α ), yielding mixing weights π i ( v ) = v i ∏ j<i ( 1 − v j ) Component parameters: η i * ~ G 0 (assume G 0 is conjugate prior with hyperparameter λ ) Assignment of data to components: Z n |{v 1 , .... } ~ Mult( π ( v )) Generating the observations: X n | z n ~ p ( x n | η z n *) 5 Bayesian Methods in NLP

  6. Inference for DP mixture models Given observed data x 1 , ...., x n , compute the predictive density : p (x | x 1 , ...., x n , α , G 0 ) = ∫ p (x | w ) p ( w | x 1 , ...., x n , α , G 0 ) d w Problem: the posterior of the latent variables p ( w | x 1 , ....,x n , α , G 0 ) can’t be computed in closed form Approximate inference: - Gibbs sampling: Sample from a Markov chain with equilibrium distribution p ( W | x 1 , ...., x n , α , G 0 ) - Variational inference : Construct a tractable variational approximation q of p with free variational parameters ν 6 Bayesian Methods in NLP

  7. Gibbs sampling Bayesian Methods in NLP 7

  8. Gibbs sampling for DPMs Two variants that differ in their definition of the Markov Chain Collapsed Gibbs sampler: Integrates out G and the distinct parameter values { η 1 *.... η | C| *} associated with the clusters Blocked Gibbs sampler: Based on the stick-breaking construction. This requires a truncated variant of the DP. 8 Bayesian Methods in NLP

  9. Collapsed Gibbs sampler for DPMs Integrate out the random measure G and the distinct parameter values { η 1 *.... η | C| *} associated with each cluster Given data x = x 1 ...x N , each state of the Markov chain is a cluster assignment c = c 1 ...c N to each data point Each sample is also a cluster assignment c = c 1 ...c N Given a cluster assignment c b = c 1 ...c N with C distinct clusters, the predictive density is p (x N+1 | c b , x , α , λ ) = ∑ k ≤ C+1 p (c N+1 = k | c b , α ) p (x N+1 | c b , c N+1 = k, λ ) 9 Bayesian Methods in NLP

  10. Collapsed Gibbs sampler for DPMs ‘Macro-sample step’: Assign a new cluster to all data points. ‘Micro-sample step’: Sample assignment variables C n for each data point conditioned on the assignment of the remaining points, c -n C n is either one of the values in c -n or a new value: p (c n = k | x , c -n ) ∝ p ( x n | x -n , c -n , c n =k, λ ) p ( c n = k | c -n , α ) with p ( x n | x -n , c -n , c n =k, λ ) = p ( x n , c -n , c n =k, λ ) / p ( x -n , c -n , c n =k, λ ) and p ( c n = k | c -n , α ) given by the Polya (Blackwell/McQueen) urn Inference: After burn-in, collect B sample assignments c b and average across their predictive densities. 10 Bayesian Methods in NLP

  11. Blocked Gibbs sampling Based on the stick-breaking construction. States of the Markov chain consist of ( V, η *, Z) Problem: in the actual DPM model V, η * are infinite. Instead, the blocked Gibbs sampler uses a truncated DP (TDP), which samples only a finite collection of T stick lengths (and hence clusters) By setting V T -1 = 1, π i = 0 for i ≥ T : π i ( v ) = v i ∏ j<i ( 1 − v j ) 11 Bayesian Methods in NLP

  12. Blocked Gibbs sampling The states of the Markov chain consist of - the beta variables V = {V 1 ...V T-1 }, - the mixture component parameters η * = { η 1 *... η T *} - the indicator variables Z = {Z 1 ...Z N } Sampling: - For n=1...N , sample Z N from p (z n = k | v, η *, x ) = π k ( v) p (x n | η k *) - For k=1...K , sample V k from Beta ( γ k2 , γ k1 α + n k+1...K ) γ k1 = 1+ n k with n k : number of data points in cluster k γ k2 = α + n k+1...K : with n k+1...K the data points in clusters k+1...K - For k=1...K , sample η k * from its posterior p ( η k * | τ k ) τ k = ( λ 1 + n -ik (x i ) , λ 2 + n -ik ) Predictive density for each sample: p (x n+1 | x , z , α , λ ) = ∑ k E[ π k ( v) | γ 1 .... γ K ] p (x n+1 | τ k ) 12 Bayesian Methods in NLP

  13. Variational inference (recap) Bayesian Methods in NLP 13

  14. Standard EM L ( q , θ ) = ln p ( X | θ ) − KL( q || p ) is a lower bound on the KL( q || p ) incomplete log-likelihood ln p ( X | θ ) ln p ( X | θ old ) E-step: With θ old fixed, return q new L ( q, θ old ) that maximizes L ( q , θ old ) wrt. q ( Z ) , Now KL( q new || p old ) = 0 . M-step: With q new fixed, return θ new that maximizes L ( q new , θ ) wrt. θ . If L ( q new , θ new ) > L ( q new , θ old ) : ln p ( X | θ new ) > ln p ( X | θ old ), and hence KL( q new || p new ) > 0 14 Bayesian Methods in NLP

  15. Variational inference Variational inference is applicable when you have to compute an intractable posterior over latent variables p ( W | X ) Basic idea: Replace the exact, but intractable posterior p ( W | X ) with a tractable approximate posterior q ( W | X, V ) q ( W | X, V ) is from a family of simpler distributions over the latent variables W that is defined by a set of free variational parameters V Unlike in EM, KL( q || p ) > 0 for any q , since q only approximates p 15 Bayesian Methods in NLP

  16. Variational EM Initialization : Define initial model θ old and variational distribution q ( W | X , V ) E-step: Find V that maximize the variational distribution q ( W | X , V ) Compute the expectation of true posterior p ( W | X , θ old ) under the new variational distribution q ( W | X , V ) M-step: Find model parameters θ new that maximize the expectation of the p ( W , X | θ ) under the variational posterior q ( W | X , V ) Set θ old := θ new 16 Bayesian Methods in NLP

  17. Blei and Jordan’s mean-field variational inference for DP Bayesian Methods in NLP 17

  18. Variational inference Define a family of variational distributions q ν ( w ) with variational parameters ν = ν 1 .... ν M that are specific to each observation x i Set ν to minimze the KL-divergence between q ν ( w ) and p ( w | x , θ ): D( q ν ( w ) || p ( w | x , θ ) ) = E q [log q ν ( W )] − E q [log p ( W , x | θ )] + log p ( x | θ ) (Here, log p ( x | θ ) can be ignored when finding q ) This is equivalent to maximizing a lower bound on log p ( x | θ ): log p ( x | θ ) = E q [log p ( W , x | θ )] − E q [log q ν ( W )] + D( q ν ( w )|| p ( w | x , θ )) log p ( x | θ ) ≥ E q [log p ( W , x | θ )] − E q [log q ν ( W )] 18 Bayesian Methods in NLP

  19. q ν ( W ) for DPMs Blei and Jordan use again the stick-breaking construction. Hence, the latent variables are W = ( V , η *, Z ) V : T − 1 truncated stick lengths η * : T component parameters Z : cluster assignments of the N data points 19 Bayesian Methods in NLP

  20. Variational inference for DPMs In general: log p ( x | θ ) ≥ E q [log p ( W , x | θ )] − E q [log q ν ( W )] For DPMs: θ = ( α , λ ); W = ( V , η * , Z ) log p ( x | α , λ ) ≥ E q [log p ( V | α )] + E q [log p ( η * | λ )] + ∑ n [ E q [log p (Z n | V )] + E q [log p (x n | Z n )] ] − E q [log q ν ( V , η * , Z )] Problem: V = {V 1 , V 2 ,...}, η * = { η 1 *, η 2 *, ... } are infinite. Solution: use a truncated representation 20 Bayesian Methods in NLP

  21. Variational approximations q ν ( v , η *, z ) V t γ t Z n η k * τ t φ n N T The variational parameters ν = ( γ 1..T-1 , τ 1..T, φ 1...N ) q ν ( v , η *, z ) = ∏ t<T q γ t (v t ) ∏ t<T q τ t ( η t *) ∏ n ≤ N q φ n (z n ) q γ t (v t ): Beta distributions with variational parameter γ t q τ t ( η t *): conjugate priors for η , with parameter τ t q φ n (z n ): multinomials with variational parameters φ n 21 Bayesian Methods in NLP

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend