probabilistic graphical models

Probabilistic Graphical Models David Sontag New York University - PowerPoint PPT Presentation

Probabilistic Graphical Models David Sontag New York University Lecture 2, February 2, 2012 David Sontag (NYU) Graphical Models Lecture 2, February 2, 2012 1 / 36 Bayesian networks Reminder of last lecture A Bayesian network is specified by


  1. Probabilistic Graphical Models David Sontag New York University Lecture 2, February 2, 2012 David Sontag (NYU) Graphical Models Lecture 2, February 2, 2012 1 / 36

  2. Bayesian networks Reminder of last lecture A Bayesian network is specified by a directed acyclic graph G = ( V , E ) with: One node i ∈ V for each random variable X i 1 One conditional probability distribution (CPD) per node, p ( x i | x Pa ( i ) ), 2 specifying the variable’s probability conditioned on its parents’ values Corresponds 1-1 with a particular factorization of the joint distribution: � p ( x i | x Pa ( i ) ) p ( x 1 , . . . x n ) = i ∈ V Powerful framework for designing algorithms to perform probability computations David Sontag (NYU) Graphical Models Lecture 2, February 2, 2012 2 / 36

  3. Example Consider the following Bayesian network: d 0 d 1 i 0 i 1 0.7 0.3 0.6 0.4 Difficulty Intelligence g 1 g 2 g 3 Grade SAT i 0 , d 0 0.3 0.4 0.3 i 0 , d 1 0.05 0.25 0.7 s 0 s 1 i 0 , d 0 0.9 0.08 0.02 Letter i 0 , d 1 i 0 0.5 0.3 0.2 0.95 0.05 i 1 0.2 0.8 l 0 l 1 g 1 0.1 0.9 g 2 0.4 0.6 g 2 0.99 0.01 What is its joint distribution? � p ( x 1 , . . . x n ) = p ( x i | x Pa ( i ) ) i ∈ V p ( d , i , g , s , l ) = p ( d ) p ( i ) p ( g | i , d ) p ( s | i ) p ( l | g ) David Sontag (NYU) Graphical Models Lecture 2, February 2, 2012 3 / 36

  4. D-separation (“directed separated”) in Bayesian networks Algorithm to calculate whether X ⊥ Z | Y by looking at graph separation Look to see if there is active path between X and Y when variables Y are observed: X Y Z X Y Z (a) (b) If no such path, then X and Z are d-separated with respect to Y d-separation reduces statistical independencies (hard) to connectivity in graphs (easy) Important because it allows us to quickly prune the Bayesian network, finding just the relevant variables for answering a query David Sontag (NYU) Graphical Models Lecture 2, February 2, 2012 4 / 36

  5. Independence maps Let I ( G ) be the set of all conditional independencies implied by the directed ayclic graph (DAG) G Let I ( p ) denote the set of all conditional independencies that hold for the joint distribution p . A DAG G is an I-map (independence map) of a distribution p if I ( G ) ⊆ I ( p ) A fully connected DAG G is an I-map for any distribution, since I ( G ) = ∅ ⊆ I ( p ) for all p G is a minimal I-map for p if the removal of even a single edge makes it not an I-map A distribution may have several minimal I-maps Each corresponds to a specific node-ordering G is a perfect map (P-map) for distribution p if I ( G ) = I ( p ) David Sontag (NYU) Graphical Models Lecture 2, February 2, 2012 5 / 36

  6. Equivalent structures Different Bayesian network structures can be equivalent in that they encode precisely the same conditional independence assertions (and thus the same distributions) Which of these are equivalent? X Y Z Z Z X Y Y X X Y Z (a) (b) (c) (d) David Sontag (NYU) Graphical Models Lecture 2, February 2, 2012 6 / 36

  7. Equivalent structures Different Bayesian network structures can be equivalent in that they encode precisely the same conditional independence assertions (and thus the same distributions) Which of these are equivalent? V X Z V X Z W Y W Y A causal network is a Bayesian network with an explicit requirement that the relationships be causal Bayesian networks are not the same as causal networks David Sontag (NYU) Graphical Models Lecture 2, February 2, 2012 7 / 36

  8. What are some frequently used graphical models? David Sontag (NYU) Graphical Models Lecture 2, February 2, 2012 8 / 36

  9. Quick Medical Reference (decision theoretic) (Miller et al. ’86, Shwe et al. ’91) !"#$%#$#& diseases d 1 d n f 1 f m '(!"()#& findings Joint distribution factors as p ( f , d ) = � j p ( d j ) � i p ( f i | d ) p ( d j = 1) is the prior probability of having disease j Model assumes the following independencies: d i ⊥ d j , f i ⊥ f j | d Common findings can be caused by hundreds of diseases – too many parameters required to specify the CPD p ( f i | d ) as a table David Sontag (NYU) Graphical Models Lecture 2, February 2, 2012 9 / 36

  10. Quick Medical Reference (decision theoretic) (Miller et al. ’86, Shwe et al. ’91) !"#$%#$#& diseases d 1 d n f 1 f m '(!"()#& findings Instead, we use a noisy-or parameterization : � (1 − q ij ) d j p ( f i = 0 | d ) = (1 − q i 0 ) j ∈ Pa ( i ) q ij = p ( f i = 1 | d j = 1) is the probability that the disease j , if present, could alone cause the finding to have a positive outcome q i 0 = p ( f i = 1 | L ) is the “leak” probability – the probability that the finding is caused by something other than the diseases in the model David Sontag (NYU) Graphical Models Lecture 2, February 2, 2012 10 / 36

  11. Hidden Markov models Y5 Y6 Y2 Y3 Y4 Y1 X1 X2 X3 X4 X5 X6 Frequently used for speech recognition and part-of-speech tagging Joint distribution factors as: T � p ( y , x ) = p ( y 1 ) p ( x 1 | y 1 ) p ( y t | y t − 1 ) p ( x t | y t ) t =2 p ( y 1 ) is the distribution for the starting state p ( y t | y t − 1 ) is the transition probability between any two states p ( x t | y t ) is the emission probability What are the conditional independencies here? For example, Y 1 ⊥ { Y 3 , . . . , Y 6 } | Y 2 David Sontag (NYU) Graphical Models Lecture 2, February 2, 2012 11 / 36

  12. Hidden Markov models Y6 Y3 Y4 Y5 Y1 Y2 X1 X2 X3 X4 X5 X6 Joint distribution factors as: T � p ( y , x ) = p ( y 1 ) p ( x 1 | y 1 ) p ( y t | y t − 1 ) p ( x t | y t ) t =2 A homogeneous HMM uses the same parameters ( β and α below) for each transition and emission distribution ( parameter sharing ): T � p ( y , x ) = p ( y 1 ) α x 1 , y 1 β y t , y t − 1 α x t , y t t =2 How many parameters need to be learned? David Sontag (NYU) Graphical Models Lecture 2, February 2, 2012 12 / 36

  13. Mixture of Gaussians The N -dim. multivariate normal distribution, N ( µ, Σ), has density: 1 − 1 � � 2( x − µ ) T Σ − 1 ( x − µ ) p ( x ) = (2 π ) N / 2 | Σ | 1 / 2 exp Suppose we have k Gaussians given by µ k and Σ k , and a distribution θ over the numbers 1 , . . . , k Mixture of Gaussians distribution p ( y , x ) given by Sample y ∼ θ (specifies which Gaussian to use) 1 Sample x ∼ N ( µ y , Σ y ) 2 David Sontag (NYU) Graphical Models Lecture 2, February 2, 2012 13 / 36

  14. Mixture of Gaussians The marginal distribution over x looks like: David Sontag (NYU) Graphical Models Lecture 2, February 2, 2012 14 / 36

  15. Latent Dirichlet allocation (LDA) Topic models are powerful tools for exploring large data sets and for making inferences about the content of documents !"#$%&'() *"+,#) .&/,0,"'1 )+".()1 +"/,9#)1 65)&65//1 +.&),3&'(1 2,'3$1 )"##&.1 "65%51 4$3,5)%1 &(2,#)1 65)7&(65//1 :5)2,'0("'1 6$332,)%1 8""(65//1 .&/,0,"'1 - - - Many applications in information retrieval, document summarization, and classification New+document+ What+is+this+document+about?+ weather+ .50+ finance+ .49+ sports+ .01+ Words+w 1 ,+…,+w N+ Distribu6on+of+topics + θ LDA is one of the simplest and most widely used topic models David Sontag (NYU) Graphical Models Lecture 2, February 2, 2012 15 / 36

  16. Generative model for a document in LDA 1 Sample the document’s topic distribution θ (aka topic vector) θ ∼ Dirichlet ( α 1: T ) where the { α t } T t =1 are fixed hyperparameters. Thus θ is a distribution over T topics with mean θ t = α t / � t ′ α t ′ 2 For i = 1 to N , sample the topic z i of the i ’th word z i | θ ∼ θ 3 ... and then sample the actual word w i from the z i ’th topic w i | z i , ... ∼ β z i where { β t } T t =1 are the topics (a fixed collection of distributions on words) David Sontag (NYU) Graphical Models Lecture 2, February 2, 2012 16 / 36

  17. Generative model for a document in LDA 1 Sample the document’s topic distribution θ (aka topic vector) θ ∼ Dirichlet ( α 1: T ) where the { α t } T t =1 are hyperparameters.The Dirichlet density is: T � θ α t − 1 p ( θ 1 , . . . , θ T ) ∝ t t =1 α 1 = α 2 = α 1 = α 2 = log Pr( θ ) log Pr( θ ) θ 1 θ 2 θ 1 θ 2 David Sontag (NYU) Graphical Models Lecture 2, February 2, 2012 17 / 36

  18. Generative model for a document in LDA 3 ... and then sample the actual word w i from the z i ’th topic w i | z i , ... ∼ β z i where { β t } T t =1 are the topics (a fixed collection of distributions on words) Documents+ Topics+ poli6cs+.0100+ religion+.0500+ sports+.0105+ president+.0095+ hindu+.0092+ baseball+.0100+ obama+.0090+ judiasm+.0080+ soccer+.0055+ washington+.0085+ ethics+.0075+ basketball+.0050+ religion+.0060+ buddhism+.0016+ football+.0045+ …+ …+ …+ � � β t = p ( w | z = t ) David Sontag (NYU) Graphical Models Lecture 2, February 2, 2012 18 / 36 θ

  19. Example of using LDA Topic proportions and Topics Documents assignments gene 0.04 β 1 dna 0.02 z 1 d genetic 0.01 .,, θ d life 0.02 evolve 0.01 organism 0.01 .,, brain 0.04 neuron 0.02 nerve 0.01 ... z Nd data 0.02 number 0.02 β T computer 0.01 .,, (Blei, Introduction to Probabilistic Topic Models , 2011) David Sontag (NYU) Graphical Models Lecture 2, February 2, 2012 19 / 36

Recommend


More recommend