Introduction To Machine Learning David Sontag New York University - PowerPoint PPT Presentation

Introduction To Machine Learning David Sontag New York University Lecture 21, April 14, 2016 David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 1 / 14

Expectation maximization Algorithm is as follows: 1 Write down the complete log-likelihood log p ( x , z ; θ ) in such a way that it is linear in z 2 Initialize θ 0 , e.g. at random or using a good first guess 3 Repeat until convergence: M � θ t +1 = arg max E p ( z m | x m ; θ t ) [log p ( x m , Z ; θ )] θ m =1 Notice that log p ( x m , Z ; θ ) is a random function because Z is unknown By linearity of expectation, objective decomposes into expectation terms and data terms “E” step corresponds to computing the objective (i.e., the expectations ) “M” step corresponds to maximizing the objective David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 2 / 14

Derivation of EM algorithm L ( θ n +1 ) l ( θ n +1 | θ n ) L ( θ n ) = l ( θ n | θ n ) L ( θ ) l ( θ | θ n ) L ( θ ) l ( θ | θ n ) θ θ n θ n +1 (Figure from tutorial by Sean Borman) David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 3 / 14

Application to mixture models Prior distribution θ over topics Topic-word β Topic of doc d distributions z d w id Word i = 1 to N d = 1 to D This model is a type of (discrete) mixture model Called multinomial naive Bayes (a word can appear multiple times) Document is generated from a single topic David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 4 / 14

EM for mixture models Prior distribution θ over topics Topic-word β Topic of doc d distributions z d w id Word i = 1 to N d = 1 to D The complete likelihood is p ( w , Z ; θ, β ) = � D d =1 p ( w d , Z d ; θ, β ), where N � p ( w d , Z d ; θ, β ) = θ Z d β Z d , w id i =1 Trick #1: re-write this as K N K � θ 1[ Z d = k ] � � β 1[ Z d = k ] p ( w d , Z d ; θ, β ) = k k , w id k =1 i =1 k =1 David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 5 / 14

EM for mixture models Thus, the complete log-likelihood is: � K D N K � � � � � log p ( w , Z ; θ, β ) = 1[ Z d = k ] log θ k + 1[ Z d = k ] log β k , w id d =1 k =1 i =1 k =1 In the “E” step, we take the expectation of the complete log-likelihood with respect to p ( z | w ; θ t , β t ), applying linearity of expectation, i.e. E p ( z | w ; θ t ,β t ) [log p ( w , z ; θ, β )] = � K D N K � � � p ( Z d = k | w ; θ t , β t ) log θ k + � � p ( Z d = k | w ; θ t , β t ) log β k , w id d =1 k =1 i =1 k =1 In the “M” step, we maximize this with respect to θ and β David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 6 / 14

EM for mixture models Just as with complete data, this maximization can be done in closed form First, re-write expected complete log-likelihood from � K D N K � � � � � p ( Z d = k | w ; θ t , β t ) log θ k + p ( Z d = k | w ; θ t , β t ) log β k , w id d =1 k =1 i =1 k =1 to K D K W D � � � � � p ( Z d = k | w d ; θ t , β t )+ N dw p ( Z d = k | w d ; θ t , β t ) log θ k log β k , w k =1 d =1 k =1 w =1 d =1 We then have that � D d =1 p ( Z d = k | w d ; θ t , β t ) θ t +1 = k d =1 p ( Z d = ˆ � K � D k | w d ; θ t , β t ) ˆ k =1 David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 7 / 14

Latent Dirichlet allocation (LDA) Topic models are powerful tools for exploring large data sets and for making inferences about the content of documents !"#$%&'() *"+,#) .&/,0,"'1 )+".()1 +"/,9#)1 65)&65//1 +.&),3&'(1 2,'3$1 )"##&.1 "65%51 4$3,5)%1 &(2,#)1 65)7&(65//1 :5)2,'0("'1 6$332,)%1 8""(65//1 .&/,0,"'1 - - - Many applications in information retrieval, document summarization, and classification New+document+ What+is+this+document+about?+ weather+ .50+ finance+ .49+ sports+ .01+ Words+w 1 ,+…,+w N+ Distribu6on+of+topics + θ LDA is one of the simplest and most widely used topic models David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 8 / 14

Generative model for a document in LDA 1 Sample the document’s topic distribution θ (aka topic vector) θ ∼ Dirichlet ( α 1: T ) where the { α t } T t =1 are fixed hyperparameters. Thus θ is a distribution over T topics with mean θ t = α t / � t ′ α t ′ 2 For i = 1 to N , sample the topic z i of the i ’th word z i | θ ∼ θ 3 ... and then sample the actual word w i from the z i ’th topic w i | z i ∼ β z i where { β t } T t =1 are the topics (a fixed collection of distributions on words) David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 9 / 14

Generative model for a document in LDA Sample the document’s topic distribution θ (aka topic vector) 1 θ ∼ Dirichlet ( α 1: T ) where the { α t } T t =1 are hyperparameters.The Dirichlet density, defined over θ ∈ R T : ∀ t θ t ≥ 0 , � T ∆ = { � t =1 θ t = 1 } , is: T � θ α t − 1 p ( θ 1 , . . . , θ T ) ∝ t t =1 For example, for T =3 ( θ 3 = 1 − θ 1 − θ 2 ): α 1 = α 2 = α 3 = α 1 = α 2 = α 3 = log Pr( θ ) log Pr( θ ) θ 1 θ 2 θ 1 θ 2 David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 10 / 14

Generative model for a document in LDA 3 ... and then sample the actual word w i from the z i ’th topic w i | z i ∼ β z i where { β t } T t =1 are the topics (a fixed collection of distributions on words) Documents+ Topics+ poli6cs+.0100+ religion+.0500+ sports+.0105+ president+.0095+ hindu+.0092+ baseball+.0100+ obama+.0090+ judiasm+.0080+ soccer+.0055+ washington+.0085+ ethics+.0075+ basketball+.0050+ religion+.0060+ buddhism+.0016+ football+.0045+ …+ …+ …+ � � β t = p ( w | z = t ) David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 11 / 14 θ

Example of using LDA Topic proportions and Topics Documents assignments gene 0.04 β 1 dna 0.02 z 1 d genetic 0.01 .,, θ d life 0.02 evolve 0.01 organism 0.01 .,, brain 0.04 neuron 0.02 nerve 0.01 ... z Nd data 0.02 number 0.02 β T computer 0.01 .,, (Blei, Introduction to Probabilistic Topic Models , 2011) David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 12 / 14

“Plate” notation for LDA model Dirichlet α hyperparameters Topic distribution θ d for document Topic-word β Topic of word i of doc d distributions z id w id Word i = 1 to N d = 1 to D Variables within a plate are replicated in a conditionally independent manner David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 13 / 14

Comparison of mixture and admixture models Dirichlet α hyperparameters Prior distribution Topic distribution θ θ d over topics for document Topic-word Topic-word β β Topic of doc d Topic of word i of doc d distributions z d distributions z id w id Word w id Word i = 1 to N i = 1 to N d = 1 to D d = 1 to D Model on left is a mixture model Called multinomial naive Bayes (a word can appear multiple times) Document is generated from a single topic Model on right (LDA) is an admixture model Document is generated from a distribution over topics David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 14 / 14

Introduction To Machine Learning David Sontag New York University - PowerPoint PPT Presentation

Introduction To Machine Learning David Sontag New York University Lecture 21, April 14, 2016 David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 1 / 14 Expectation maximization Algorithm is as follows: 1 Write

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

scripts.mit.edu Quentin Smith scripts@mit.edu Student Information Processing Board October 29,

High Performance HTML5 stevesouders.com/docs/qcon-2011118.pptx Disclaimer: This content does not

A Cross-Language Approach to Historic Document Retrieval Marijn Koolen, Frans Adriaans, Jaap

IMGD 1001: Game Design Documents by Mark Claypool (claypool@cs.wpi.edu) Robert W. Lindeman

Howdy! 38 38 th th Int Interna nationa nal C l Conf nferenc nce on S n Software E

XQuery Language Introduction to databases CSCC43 Spring 2012 Ryan Johnson Thanks to Manos

Kuali ali f for or Ste teve vens Fiscal Officer / Organization Reviewer Training

Introduction to the TEI header What is the TEI header?