Probabilistic Graphical Models David Sontag New York University Lecture 2, February 7, 2013 David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 1 / 31
Bayesian networks Reminder of last lecture A Bayesian network is specified by a directed acyclic graph G = ( V , E ) with: One node i ∈ V for each random variable X i 1 One conditional probability distribution (CPD) per node, p ( x i | x Pa ( i ) ), 2 specifying the variable’s probability conditioned on its parents’ values Corresponds 1-1 with a particular factorization of the joint distribution: � p ( x i | x Pa ( i ) ) p ( x 1 , . . . x n ) = i ∈ V Powerful framework for designing algorithms to perform probability computations David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 2 / 31
Example Consider the following Bayesian network: d 0 d 1 i 0 i 1 0.7 0.3 0.6 0.4 Difficulty Intelligence g 1 g 2 g 3 Grade SAT i 0 , d 0 0.3 0.4 0.3 i 0 , d 1 0.05 0.25 0.7 s 0 s 1 i 0 , d 0 0.9 0.08 0.02 Letter i 0 , d 1 i 0 0.5 0.3 0.2 0.95 0.05 i 1 0.2 0.8 l 0 l 1 g 1 0.1 0.9 g 2 0.4 0.6 g 2 0.99 0.01 What is its joint distribution? � p ( x 1 , . . . x n ) = p ( x i | x Pa ( i ) ) i ∈ V p ( d , i , g , s , l ) = p ( d ) p ( i ) p ( g | i , d ) p ( s | i ) p ( l | g ) David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 3 / 31
D-separation (“directed separated”) in Bayesian networks Algorithm to calculate whether X ⊥ Z | Y by looking at graph separation Look to see if there is active path between X and Y when variables Y are observed: X Y Z X Y Z (a) (b) If no such path, then X and Z are d-separated with respect to Y d-separation reduces statistical independencies (hard) to connectivity in graphs (easy) Important because it allows us to quickly prune the Bayesian network, finding just the relevant variables for answering a query David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 4 / 31
Independence maps Let I ( G ) be the set of all conditional independencies implied by the directed acyclic graph (DAG) G Let I ( p ) denote the set of all conditional independencies that hold for the joint distribution p . A DAG G is an I-map (independence map) of a distribution p if I ( G ) ⊆ I ( p ) A fully connected DAG G is an I-map for any distribution, since I ( G ) = ∅ ⊆ I ( p ) for all p G is a minimal I-map for p if the removal of even a single edge makes it not an I-map A distribution may have several minimal I-maps Each corresponds to a specific node-ordering G is a perfect map (P-map) for distribution p if I ( G ) = I ( p ) David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 5 / 31
Equivalent structures Different Bayesian network structures can be equivalent in that they encode precisely the same conditional independence assertions (and thus the same distributions) Which of these are equivalent? X Y Z Z Z X Y Y X X Y Z (a) (b) (c) (d) David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 6 / 31
Equivalent structures Different Bayesian network structures can be equivalent in that they encode precisely the same conditional independence assertions (and thus the same distributions) Are these equivalent? V X Z V X Z W Y W Y David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 7 / 31
What are some frequently used graphical models? David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 8 / 31
Hidden Markov models Y5 Y6 Y2 Y3 Y4 Y1 X1 X2 X3 X4 X5 X6 Frequently used for speech recognition and part-of-speech tagging Joint distribution factors as: T � p ( y , x ) = p ( y 1 ) p ( x 1 | y 1 ) p ( y t | y t − 1 ) p ( x t | y t ) t =2 p ( y 1 ) is the distribution for the starting state p ( y t | y t − 1 ) is the transition probability between any two states p ( x t | y t ) is the emission probability What are the conditional independencies here? For example, Y 1 ⊥ { Y 3 , . . . , Y 6 } | Y 2 David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 9 / 31
Hidden Markov models Y6 Y3 Y4 Y5 Y1 Y2 X1 X2 X3 X4 X5 X6 Joint distribution factors as: T � p ( y , x ) = p ( y 1 ) p ( x 1 | y 1 ) p ( y t | y t − 1 ) p ( x t | y t ) t =2 A homogeneous HMM uses the same parameters ( β and α below) for each transition and emission distribution ( parameter sharing ): T � p ( y , x ) = p ( y 1 ) α x 1 , y 1 β y t , y t − 1 α x t , y t t =2 How many parameters need to be learned? David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 10 / 31
Mixture of Gaussians The N -dim. multivariate normal distribution, N ( µ, Σ), has density: 1 − 1 � � 2( x − µ ) T Σ − 1 ( x − µ ) p ( x ) = (2 π ) N / 2 | Σ | 1 / 2 exp Suppose we have k Gaussians given by µ k and Σ k , and a distribution θ over the numbers 1 , . . . , k Mixture of Gaussians distribution p ( y , x ) given by Sample y ∼ θ (specifies which Gaussian to use) 1 Sample x ∼ N ( µ y , Σ y ) 2 David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 11 / 31
Mixture of Gaussians The marginal distribution over x looks like: David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 12 / 31
Latent Dirichlet allocation (LDA) Topic models are powerful tools for exploring large data sets and for making inferences about the content of documents !"#$%&'() *"+,#) .&/,0,"'1 )+".()1 +"/,9#)1 65)&65//1 +.&),3&'(1 2,'3$1 )"##&.1 "65%51 4$3,5)%1 &(2,#)1 65)7&(65//1 :5)2,'0("'1 6$332,)%1 8""(65//1 .&/,0,"'1 - - - Many applications in information retrieval, document summarization, and classification New+document+ What+is+this+document+about?+ weather+ .50+ finance+ .49+ sports+ .01+ Words+w 1 ,+…,+w N+ Distribu6on+of+topics + θ LDA is one of the simplest and most widely used topic models David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 13 / 31
Generative model for a document in LDA 1 Sample the document’s topic distribution θ (aka topic vector) θ ∼ Dirichlet ( α 1: T ) where the { α t } T t =1 are fixed hyperparameters. Thus θ is a distribution over T topics with mean θ t = α t / � t ′ α t ′ 2 For i = 1 to N , sample the topic z i of the i ’th word z i | θ ∼ θ 3 ... and then sample the actual word w i from the z i ’th topic w i | z i ∼ β z i where { β t } T t =1 are the topics (a fixed collection of distributions on words) David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 14 / 31
Generative model for a document in LDA Sample the document’s topic distribution θ (aka topic vector) 1 θ ∼ Dirichlet ( α 1: T ) where the { α t } T t =1 are hyperparameters.The Dirichlet density, defined over θ ∈ R T : ∀ t θ t ≥ 0 , � T ∆ = { � t =1 θ t = 1 } , is: T � θ α t − 1 p ( θ 1 , . . . , θ T ) ∝ t t =1 For example, for T =3 ( θ 3 = 1 − θ 1 − θ 2 ): α 1 = α 2 = α 3 = α 1 = α 2 = α 3 = log Pr( θ ) log Pr( θ ) θ 1 θ 2 θ 1 θ 2 David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 15 / 31
Generative model for a document in LDA 3 ... and then sample the actual word w i from the z i ’th topic w i | z i ∼ β z i where { β t } T t =1 are the topics (a fixed collection of distributions on words) Documents+ Topics+ poli6cs+.0100+ religion+.0500+ sports+.0105+ president+.0095+ hindu+.0092+ baseball+.0100+ obama+.0090+ judiasm+.0080+ soccer+.0055+ washington+.0085+ ethics+.0075+ basketball+.0050+ religion+.0060+ buddhism+.0016+ football+.0045+ …+ …+ …+ � � β t = p ( w | z = t ) David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 16 / 31 θ
Example of using LDA Topic proportions and Topics Documents assignments gene 0.04 β 1 dna 0.02 z 1 d genetic 0.01 .,, θ d life 0.02 evolve 0.01 organism 0.01 .,, brain 0.04 neuron 0.02 nerve 0.01 ... z Nd data 0.02 number 0.02 β T computer 0.01 .,, (Blei, Introduction to Probabilistic Topic Models , 2011) David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 17 / 31
“Plate” notation for LDA model Dirichlet α hyperparameters Topic distribution θ d for document Topic-word β Topic of word i of doc d distributions z id Word w id i = 1 to N d = 1 to D Variables within a plate are replicated in a conditionally independent manner David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 18 / 31
Comparison of mixture and admixture models Dirichlet α hyperparameters Prior distribution Topic distribution θ θ d over topics for document Topic-word Topic-word β β Topic of doc d Topic of word i of doc d distributions z d distributions z id w id Word Word w id i = 1 to N i = 1 to N d = 1 to D d = 1 to D Model on left is a mixture model Called multinomial naive Bayes (a word can appear multiple times) Document is generated from a single topic Model on right (LDA) is an admixture model Document is generated from a distribution over topics David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 19 / 31
Recommend
More recommend