introduction to machine learning
play

Introduction To Machine Learning David Sontag New York University - PowerPoint PPT Presentation

Introduction To Machine Learning David Sontag New York University Lecture 21, April 14, 2016 David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 1 / 14 Expectation maximization Algorithm is as follows: 1 Write


  1. Introduction To Machine Learning David Sontag New York University Lecture 21, April 14, 2016 David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 1 / 14

  2. Expectation maximization Algorithm is as follows: 1 Write down the complete log-likelihood log p ( x , z ; θ ) in such a way that it is linear in z 2 Initialize θ 0 , e.g. at random or using a good first guess 3 Repeat until convergence: M � θ t +1 = arg max E p ( z m | x m ; θ t ) [log p ( x m , Z ; θ )] θ m =1 Notice that log p ( x m , Z ; θ ) is a random function because Z is unknown By linearity of expectation, objective decomposes into expectation terms and data terms “E” step corresponds to computing the objective (i.e., the expectations ) “M” step corresponds to maximizing the objective David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 2 / 14

  3. Derivation of EM algorithm L ( θ n +1 ) l ( θ n +1 | θ n ) L ( θ n ) = l ( θ n | θ n ) L ( θ ) l ( θ | θ n ) L ( θ ) l ( θ | θ n ) θ θ n θ n +1 (Figure from tutorial by Sean Borman) David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 3 / 14

  4. Application to mixture models Prior distribution θ over topics Topic-word β Topic of doc d distributions z d w id Word i = 1 to N d = 1 to D This model is a type of (discrete) mixture model Called multinomial naive Bayes (a word can appear multiple times) Document is generated from a single topic David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 4 / 14

  5. EM for mixture models Prior distribution θ over topics Topic-word β Topic of doc d distributions z d w id Word i = 1 to N d = 1 to D The complete likelihood is p ( w , Z ; θ, β ) = � D d =1 p ( w d , Z d ; θ, β ), where N � p ( w d , Z d ; θ, β ) = θ Z d β Z d , w id i =1 Trick #1: re-write this as K N K � θ 1[ Z d = k ] � � β 1[ Z d = k ] p ( w d , Z d ; θ, β ) = k k , w id k =1 i =1 k =1 David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 5 / 14

  6. EM for mixture models Thus, the complete log-likelihood is: � K D N K � � � � � log p ( w , Z ; θ, β ) = 1[ Z d = k ] log θ k + 1[ Z d = k ] log β k , w id d =1 k =1 i =1 k =1 In the “E” step, we take the expectation of the complete log-likelihood with respect to p ( z | w ; θ t , β t ), applying linearity of expectation, i.e. E p ( z | w ; θ t ,β t ) [log p ( w , z ; θ, β )] = � K D N K � � � p ( Z d = k | w ; θ t , β t ) log θ k + � � p ( Z d = k | w ; θ t , β t ) log β k , w id d =1 k =1 i =1 k =1 In the “M” step, we maximize this with respect to θ and β David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 6 / 14

  7. EM for mixture models Just as with complete data, this maximization can be done in closed form First, re-write expected complete log-likelihood from � K D N K � � � � � p ( Z d = k | w ; θ t , β t ) log θ k + p ( Z d = k | w ; θ t , β t ) log β k , w id d =1 k =1 i =1 k =1 to K D K W D � � � � � p ( Z d = k | w d ; θ t , β t )+ N dw p ( Z d = k | w d ; θ t , β t ) log θ k log β k , w k =1 d =1 k =1 w =1 d =1 We then have that � D d =1 p ( Z d = k | w d ; θ t , β t ) θ t +1 = k d =1 p ( Z d = ˆ � K � D k | w d ; θ t , β t ) ˆ k =1 David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 7 / 14

  8. Latent Dirichlet allocation (LDA) Topic models are powerful tools for exploring large data sets and for making inferences about the content of documents !"#$%&'() *"+,#) .&/,0,"'1 )+".()1 +"/,9#)1 65)&65//1 +.&),3&'(1 2,'3$1 )"##&.1 "65%51 4$3,5)%1 &(2,#)1 65)7&(65//1 :5)2,'0("'1 6$332,)%1 8""(65//1 .&/,0,"'1 - - - Many applications in information retrieval, document summarization, and classification New+document+ What+is+this+document+about?+ weather+ .50+ finance+ .49+ sports+ .01+ Words+w 1 ,+…,+w N+ Distribu6on+of+topics + θ LDA is one of the simplest and most widely used topic models David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 8 / 14

  9. Generative model for a document in LDA 1 Sample the document’s topic distribution θ (aka topic vector) θ ∼ Dirichlet ( α 1: T ) where the { α t } T t =1 are fixed hyperparameters. Thus θ is a distribution over T topics with mean θ t = α t / � t ′ α t ′ 2 For i = 1 to N , sample the topic z i of the i ’th word z i | θ ∼ θ 3 ... and then sample the actual word w i from the z i ’th topic w i | z i ∼ β z i where { β t } T t =1 are the topics (a fixed collection of distributions on words) David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 9 / 14

  10. Generative model for a document in LDA Sample the document’s topic distribution θ (aka topic vector) 1 θ ∼ Dirichlet ( α 1: T ) where the { α t } T t =1 are hyperparameters.The Dirichlet density, defined over θ ∈ R T : ∀ t θ t ≥ 0 , � T ∆ = { � t =1 θ t = 1 } , is: T � θ α t − 1 p ( θ 1 , . . . , θ T ) ∝ t t =1 For example, for T =3 ( θ 3 = 1 − θ 1 − θ 2 ): α 1 = α 2 = α 3 = α 1 = α 2 = α 3 = log Pr( θ ) log Pr( θ ) θ 1 θ 2 θ 1 θ 2 David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 10 / 14

  11. Generative model for a document in LDA 3 ... and then sample the actual word w i from the z i ’th topic w i | z i ∼ β z i where { β t } T t =1 are the topics (a fixed collection of distributions on words) Documents+ Topics+ poli6cs+.0100+ religion+.0500+ sports+.0105+ president+.0095+ hindu+.0092+ baseball+.0100+ obama+.0090+ judiasm+.0080+ soccer+.0055+ washington+.0085+ ethics+.0075+ basketball+.0050+ religion+.0060+ buddhism+.0016+ football+.0045+ …+ …+ …+ � � β t = p ( w | z = t ) David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 11 / 14 θ

  12. Example of using LDA Topic proportions and Topics Documents assignments gene 0.04 β 1 dna 0.02 z 1 d genetic 0.01 .,, θ d life 0.02 evolve 0.01 organism 0.01 .,, brain 0.04 neuron 0.02 nerve 0.01 ... z Nd data 0.02 number 0.02 β T computer 0.01 .,, (Blei, Introduction to Probabilistic Topic Models , 2011) David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 12 / 14

  13. “Plate” notation for LDA model Dirichlet α hyperparameters Topic distribution θ d for document Topic-word β Topic of word i of doc d distributions z id w id Word i = 1 to N d = 1 to D Variables within a plate are replicated in a conditionally independent manner David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 13 / 14

  14. Comparison of mixture and admixture models Dirichlet α hyperparameters Prior distribution Topic distribution θ θ d over topics for document Topic-word Topic-word β β Topic of doc d Topic of word i of doc d distributions z d distributions z id w id Word w id Word i = 1 to N i = 1 to N d = 1 to D d = 1 to D Model on left is a mixture model Called multinomial naive Bayes (a word can appear multiple times) Document is generated from a single topic Model on right (LDA) is an admixture model Document is generated from a distribution over topics David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 14 / 14

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend