probabilistic graphical models
play

Probabilistic Graphical Models David Sontag New York University - PowerPoint PPT Presentation

Probabilistic Graphical Models David Sontag New York University Lecture 2, February 7, 2013 David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 1 / 31 Bayesian networks Reminder of last lecture A Bayesian network is specified by


  1. Probabilistic Graphical Models David Sontag New York University Lecture 2, February 7, 2013 David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 1 / 31

  2. Bayesian networks Reminder of last lecture A Bayesian network is specified by a directed acyclic graph G = ( V , E ) with: One node i ∈ V for each random variable X i 1 One conditional probability distribution (CPD) per node, p ( x i | x Pa ( i ) ), 2 specifying the variable’s probability conditioned on its parents’ values Corresponds 1-1 with a particular factorization of the joint distribution: � p ( x i | x Pa ( i ) ) p ( x 1 , . . . x n ) = i ∈ V Powerful framework for designing algorithms to perform probability computations David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 2 / 31

  3. Example Consider the following Bayesian network: d 0 d 1 i 0 i 1 0.7 0.3 0.6 0.4 Difficulty Intelligence g 1 g 2 g 3 Grade SAT i 0 , d 0 0.3 0.4 0.3 i 0 , d 1 0.05 0.25 0.7 s 0 s 1 i 0 , d 0 0.9 0.08 0.02 Letter i 0 , d 1 i 0 0.5 0.3 0.2 0.95 0.05 i 1 0.2 0.8 l 0 l 1 g 1 0.1 0.9 g 2 0.4 0.6 g 2 0.99 0.01 What is its joint distribution? � p ( x 1 , . . . x n ) = p ( x i | x Pa ( i ) ) i ∈ V p ( d , i , g , s , l ) = p ( d ) p ( i ) p ( g | i , d ) p ( s | i ) p ( l | g ) David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 3 / 31

  4. D-separation (“directed separated”) in Bayesian networks Algorithm to calculate whether X ⊥ Z | Y by looking at graph separation Look to see if there is active path between X and Y when variables Y are observed: X Y Z X Y Z (a) (b) If no such path, then X and Z are d-separated with respect to Y d-separation reduces statistical independencies (hard) to connectivity in graphs (easy) Important because it allows us to quickly prune the Bayesian network, finding just the relevant variables for answering a query David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 4 / 31

  5. Independence maps Let I ( G ) be the set of all conditional independencies implied by the directed acyclic graph (DAG) G Let I ( p ) denote the set of all conditional independencies that hold for the joint distribution p . A DAG G is an I-map (independence map) of a distribution p if I ( G ) ⊆ I ( p ) A fully connected DAG G is an I-map for any distribution, since I ( G ) = ∅ ⊆ I ( p ) for all p G is a minimal I-map for p if the removal of even a single edge makes it not an I-map A distribution may have several minimal I-maps Each corresponds to a specific node-ordering G is a perfect map (P-map) for distribution p if I ( G ) = I ( p ) David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 5 / 31

  6. Equivalent structures Different Bayesian network structures can be equivalent in that they encode precisely the same conditional independence assertions (and thus the same distributions) Which of these are equivalent? X Y Z Z Z X Y Y X X Y Z (a) (b) (c) (d) David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 6 / 31

  7. Equivalent structures Different Bayesian network structures can be equivalent in that they encode precisely the same conditional independence assertions (and thus the same distributions) Are these equivalent? V X Z V X Z W Y W Y David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 7 / 31

  8. What are some frequently used graphical models? David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 8 / 31

  9. Hidden Markov models Y5 Y6 Y2 Y3 Y4 Y1 X1 X2 X3 X4 X5 X6 Frequently used for speech recognition and part-of-speech tagging Joint distribution factors as: T � p ( y , x ) = p ( y 1 ) p ( x 1 | y 1 ) p ( y t | y t − 1 ) p ( x t | y t ) t =2 p ( y 1 ) is the distribution for the starting state p ( y t | y t − 1 ) is the transition probability between any two states p ( x t | y t ) is the emission probability What are the conditional independencies here? For example, Y 1 ⊥ { Y 3 , . . . , Y 6 } | Y 2 David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 9 / 31

  10. Hidden Markov models Y6 Y3 Y4 Y5 Y1 Y2 X1 X2 X3 X4 X5 X6 Joint distribution factors as: T � p ( y , x ) = p ( y 1 ) p ( x 1 | y 1 ) p ( y t | y t − 1 ) p ( x t | y t ) t =2 A homogeneous HMM uses the same parameters ( β and α below) for each transition and emission distribution ( parameter sharing ): T � p ( y , x ) = p ( y 1 ) α x 1 , y 1 β y t , y t − 1 α x t , y t t =2 How many parameters need to be learned? David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 10 / 31

  11. Mixture of Gaussians The N -dim. multivariate normal distribution, N ( µ, Σ), has density: 1 − 1 � � 2( x − µ ) T Σ − 1 ( x − µ ) p ( x ) = (2 π ) N / 2 | Σ | 1 / 2 exp Suppose we have k Gaussians given by µ k and Σ k , and a distribution θ over the numbers 1 , . . . , k Mixture of Gaussians distribution p ( y , x ) given by Sample y ∼ θ (specifies which Gaussian to use) 1 Sample x ∼ N ( µ y , Σ y ) 2 David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 11 / 31

  12. Mixture of Gaussians The marginal distribution over x looks like: David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 12 / 31

  13. Latent Dirichlet allocation (LDA) Topic models are powerful tools for exploring large data sets and for making inferences about the content of documents !"#$%&'() *"+,#) .&/,0,"'1 )+".()1 +"/,9#)1 65)&65//1 +.&),3&'(1 2,'3$1 )"##&.1 "65%51 4$3,5)%1 &(2,#)1 65)7&(65//1 :5)2,'0("'1 6$332,)%1 8""(65//1 .&/,0,"'1 - - - Many applications in information retrieval, document summarization, and classification New+document+ What+is+this+document+about?+ weather+ .50+ finance+ .49+ sports+ .01+ Words+w 1 ,+…,+w N+ Distribu6on+of+topics + θ LDA is one of the simplest and most widely used topic models David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 13 / 31

  14. Generative model for a document in LDA 1 Sample the document’s topic distribution θ (aka topic vector) θ ∼ Dirichlet ( α 1: T ) where the { α t } T t =1 are fixed hyperparameters. Thus θ is a distribution over T topics with mean θ t = α t / � t ′ α t ′ 2 For i = 1 to N , sample the topic z i of the i ’th word z i | θ ∼ θ 3 ... and then sample the actual word w i from the z i ’th topic w i | z i ∼ β z i where { β t } T t =1 are the topics (a fixed collection of distributions on words) David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 14 / 31

  15. Generative model for a document in LDA Sample the document’s topic distribution θ (aka topic vector) 1 θ ∼ Dirichlet ( α 1: T ) where the { α t } T t =1 are hyperparameters.The Dirichlet density, defined over θ ∈ R T : ∀ t θ t ≥ 0 , � T ∆ = { � t =1 θ t = 1 } , is: T � θ α t − 1 p ( θ 1 , . . . , θ T ) ∝ t t =1 For example, for T =3 ( θ 3 = 1 − θ 1 − θ 2 ): α 1 = α 2 = α 3 = α 1 = α 2 = α 3 = log Pr( θ ) log Pr( θ ) θ 1 θ 2 θ 1 θ 2 David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 15 / 31

  16. Generative model for a document in LDA 3 ... and then sample the actual word w i from the z i ’th topic w i | z i ∼ β z i where { β t } T t =1 are the topics (a fixed collection of distributions on words) Documents+ Topics+ poli6cs+.0100+ religion+.0500+ sports+.0105+ president+.0095+ hindu+.0092+ baseball+.0100+ obama+.0090+ judiasm+.0080+ soccer+.0055+ washington+.0085+ ethics+.0075+ basketball+.0050+ religion+.0060+ buddhism+.0016+ football+.0045+ …+ …+ …+ � � β t = p ( w | z = t ) David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 16 / 31 θ

  17. Example of using LDA Topic proportions and Topics Documents assignments gene 0.04 β 1 dna 0.02 z 1 d genetic 0.01 .,, θ d life 0.02 evolve 0.01 organism 0.01 .,, brain 0.04 neuron 0.02 nerve 0.01 ... z Nd data 0.02 number 0.02 β T computer 0.01 .,, (Blei, Introduction to Probabilistic Topic Models , 2011) David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 17 / 31

  18. “Plate” notation for LDA model Dirichlet α hyperparameters Topic distribution θ d for document Topic-word β Topic of word i of doc d distributions z id Word w id i = 1 to N d = 1 to D Variables within a plate are replicated in a conditionally independent manner David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 18 / 31

  19. Comparison of mixture and admixture models Dirichlet α hyperparameters Prior distribution Topic distribution θ θ d over topics for document Topic-word Topic-word β β Topic of doc d Topic of word i of doc d distributions z d distributions z id w id Word Word w id i = 1 to N i = 1 to N d = 1 to D d = 1 to D Model on left is a mixture model Called multinomial naive Bayes (a word can appear multiple times) Document is generated from a single topic Model on right (LDA) is an admixture model Document is generated from a distribution over topics David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 19 / 31

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend