Probabilistic Graphical Models David Sontag New York University - PowerPoint PPT Presentation

Probabilistic Graphical Models David Sontag New York University Lecture 2, February 7, 2013 David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 1 / 31

Bayesian networks Reminder of last lecture A Bayesian network is specified by a directed acyclic graph G = ( V , E ) with: One node i ∈ V for each random variable X i 1 One conditional probability distribution (CPD) per node, p ( x i | x Pa ( i ) ), 2 specifying the variable’s probability conditioned on its parents’ values Corresponds 1-1 with a particular factorization of the joint distribution: � p ( x i | x Pa ( i ) ) p ( x 1 , . . . x n ) = i ∈ V Powerful framework for designing algorithms to perform probability computations David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 2 / 31

Example Consider the following Bayesian network: d 0 d 1 i 0 i 1 0.7 0.3 0.6 0.4 Difficulty Intelligence g 1 g 2 g 3 Grade SAT i 0 , d 0 0.3 0.4 0.3 i 0 , d 1 0.05 0.25 0.7 s 0 s 1 i 0 , d 0 0.9 0.08 0.02 Letter i 0 , d 1 i 0 0.5 0.3 0.2 0.95 0.05 i 1 0.2 0.8 l 0 l 1 g 1 0.1 0.9 g 2 0.4 0.6 g 2 0.99 0.01 What is its joint distribution? � p ( x 1 , . . . x n ) = p ( x i | x Pa ( i ) ) i ∈ V p ( d , i , g , s , l ) = p ( d ) p ( i ) p ( g | i , d ) p ( s | i ) p ( l | g ) David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 3 / 31

D-separation (“directed separated”) in Bayesian networks Algorithm to calculate whether X ⊥ Z | Y by looking at graph separation Look to see if there is active path between X and Y when variables Y are observed: X Y Z X Y Z (a) (b) If no such path, then X and Z are d-separated with respect to Y d-separation reduces statistical independencies (hard) to connectivity in graphs (easy) Important because it allows us to quickly prune the Bayesian network, finding just the relevant variables for answering a query David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 4 / 31

Independence maps Let I ( G ) be the set of all conditional independencies implied by the directed acyclic graph (DAG) G Let I ( p ) denote the set of all conditional independencies that hold for the joint distribution p . A DAG G is an I-map (independence map) of a distribution p if I ( G ) ⊆ I ( p ) A fully connected DAG G is an I-map for any distribution, since I ( G ) = ∅ ⊆ I ( p ) for all p G is a minimal I-map for p if the removal of even a single edge makes it not an I-map A distribution may have several minimal I-maps Each corresponds to a specific node-ordering G is a perfect map (P-map) for distribution p if I ( G ) = I ( p ) David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 5 / 31

Equivalent structures Different Bayesian network structures can be equivalent in that they encode precisely the same conditional independence assertions (and thus the same distributions) Which of these are equivalent? X Y Z Z Z X Y Y X X Y Z (a) (b) (c) (d) David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 6 / 31

Equivalent structures Different Bayesian network structures can be equivalent in that they encode precisely the same conditional independence assertions (and thus the same distributions) Are these equivalent? V X Z V X Z W Y W Y David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 7 / 31

What are some frequently used graphical models? David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 8 / 31

Hidden Markov models Y5 Y6 Y2 Y3 Y4 Y1 X1 X2 X3 X4 X5 X6 Frequently used for speech recognition and part-of-speech tagging Joint distribution factors as: T � p ( y , x ) = p ( y 1 ) p ( x 1 | y 1 ) p ( y t | y t − 1 ) p ( x t | y t ) t =2 p ( y 1 ) is the distribution for the starting state p ( y t | y t − 1 ) is the transition probability between any two states p ( x t | y t ) is the emission probability What are the conditional independencies here? For example, Y 1 ⊥ { Y 3 , . . . , Y 6 } | Y 2 David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 9 / 31

Hidden Markov models Y6 Y3 Y4 Y5 Y1 Y2 X1 X2 X3 X4 X5 X6 Joint distribution factors as: T � p ( y , x ) = p ( y 1 ) p ( x 1 | y 1 ) p ( y t | y t − 1 ) p ( x t | y t ) t =2 A homogeneous HMM uses the same parameters ( β and α below) for each transition and emission distribution ( parameter sharing ): T � p ( y , x ) = p ( y 1 ) α x 1 , y 1 β y t , y t − 1 α x t , y t t =2 How many parameters need to be learned? David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 10 / 31

Mixture of Gaussians The N -dim. multivariate normal distribution, N ( µ, Σ), has density: 1 − 1 � � 2( x − µ ) T Σ − 1 ( x − µ ) p ( x ) = (2 π ) N / 2 | Σ | 1 / 2 exp Suppose we have k Gaussians given by µ k and Σ k , and a distribution θ over the numbers 1 , . . . , k Mixture of Gaussians distribution p ( y , x ) given by Sample y ∼ θ (specifies which Gaussian to use) 1 Sample x ∼ N ( µ y , Σ y ) 2 David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 11 / 31

Mixture of Gaussians The marginal distribution over x looks like: David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 12 / 31

Latent Dirichlet allocation (LDA) Topic models are powerful tools for exploring large data sets and for making inferences about the content of documents !"#$%&'() *"+,#) .&/,0,"'1 )+".()1 +"/,9#)1 65)&65//1 +.&),3&'(1 2,'3$1 )"##&.1 "65%51 4$3,5)%1 &(2,#)1 65)7&(65//1 :5)2,'0("'1 6$332,)%1 8""(65//1 .&/,0,"'1 - - - Many applications in information retrieval, document summarization, and classification New+document+ What+is+this+document+about?+ weather+ .50+ finance+ .49+ sports+ .01+ Words+w 1 ,+…,+w N+ Distribu6on+of+topics + θ LDA is one of the simplest and most widely used topic models David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 13 / 31

Generative model for a document in LDA 1 Sample the document’s topic distribution θ (aka topic vector) θ ∼ Dirichlet ( α 1: T ) where the { α t } T t =1 are fixed hyperparameters. Thus θ is a distribution over T topics with mean θ t = α t / � t ′ α t ′ 2 For i = 1 to N , sample the topic z i of the i ’th word z i | θ ∼ θ 3 ... and then sample the actual word w i from the z i ’th topic w i | z i ∼ β z i where { β t } T t =1 are the topics (a fixed collection of distributions on words) David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 14 / 31

Generative model for a document in LDA Sample the document’s topic distribution θ (aka topic vector) 1 θ ∼ Dirichlet ( α 1: T ) where the { α t } T t =1 are hyperparameters.The Dirichlet density, defined over θ ∈ R T : ∀ t θ t ≥ 0 , � T ∆ = { � t =1 θ t = 1 } , is: T � θ α t − 1 p ( θ 1 , . . . , θ T ) ∝ t t =1 For example, for T =3 ( θ 3 = 1 − θ 1 − θ 2 ): α 1 = α 2 = α 3 = α 1 = α 2 = α 3 = log Pr( θ ) log Pr( θ ) θ 1 θ 2 θ 1 θ 2 David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 15 / 31

Generative model for a document in LDA 3 ... and then sample the actual word w i from the z i ’th topic w i | z i ∼ β z i where { β t } T t =1 are the topics (a fixed collection of distributions on words) Documents+ Topics+ poli6cs+.0100+ religion+.0500+ sports+.0105+ president+.0095+ hindu+.0092+ baseball+.0100+ obama+.0090+ judiasm+.0080+ soccer+.0055+ washington+.0085+ ethics+.0075+ basketball+.0050+ religion+.0060+ buddhism+.0016+ football+.0045+ …+ …+ …+ � � β t = p ( w | z = t ) David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 16 / 31 θ

Example of using LDA Topic proportions and Topics Documents assignments gene 0.04 β 1 dna 0.02 z 1 d genetic 0.01 .,, θ d life 0.02 evolve 0.01 organism 0.01 .,, brain 0.04 neuron 0.02 nerve 0.01 ... z Nd data 0.02 number 0.02 β T computer 0.01 .,, (Blei, Introduction to Probabilistic Topic Models , 2011) David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 17 / 31

“Plate” notation for LDA model Dirichlet α hyperparameters Topic distribution θ d for document Topic-word β Topic of word i of doc d distributions z id Word w id i = 1 to N d = 1 to D Variables within a plate are replicated in a conditionally independent manner David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 18 / 31

Comparison of mixture and admixture models Dirichlet α hyperparameters Prior distribution Topic distribution θ θ d over topics for document Topic-word Topic-word β β Topic of doc d Topic of word i of doc d distributions z d distributions z id w id Word Word w id i = 1 to N i = 1 to N d = 1 to D d = 1 to D Model on left is a mixture model Called multinomial naive Bayes (a word can appear multiple times) Document is generated from a single topic Model on right (LDA) is an admixture model Document is generated from a distribution over topics David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 19 / 31

Probabilistic Graphical Models David Sontag New York University - PowerPoint PPT Presentation

Probabilistic Graphical Models David Sontag New York University Lecture 2, February 7, 2013 David Sontag (NYU) Graphical Models Lecture 2, February 7, 2013 1 / 31 Bayesian networks Reminder of last lecture A Bayesian network is specified by

Probabilistic Graphical Models CMSC 678 UMBC Probabilistic Graphical Models A graph G that

Probabilistic Graphical Models Probabilistic Graphical Models Variable elimination Siamak

Graphical Models Graphical Models Bayesian Networks Siamak Ravanbakhsh Fall 2019 Previously on

Probabilistic Graphical Models Probabilistic Graphical Models introduction to learning Siamak

Computer Science Let me be provocative Probabilistic graphical models is how we do probabilistic

Probabilistic Graphical Models Probabilistic Graphical Models Undirected Models Fall 2019

Probabilistic Graphical Models Probabilistic Graphical Models parameter learning in undirected

Probabilistic Graphical Models Probabilistic Graphical Models Gaussian Network Models Fall 2019

CS 6782: Fall 2010 Probabilistic Graphical Models Guozhang Wang December 10, 2010 1

Probabilistic Graphical Models Probabilistic Graphical Models Relationship between the directed

Probabilistic Graphical Models Probabilistic Graphical Models Review of probability theory

Probabilistic Graphical Models Probabilistic Graphical Models Loopy BP and Bethe Free Energy

Probabilistic Graphical Models Probabilistic Graphical Models Structure learning in Bayesian

Probabilistic Graphical Models Probabilistic Graphical Models MAP inference Siamak Ravanbakhsh

Probabilistic Graphical Models Probabilistic Graphical Models Markov Chain Monte Carlo Inference

The Elimination Algorithm Probabilistic Graphical Models (10- Probabilistic Graphical Models

Improving Information from Manipulable Data Alex Frankel Navin Kartik July 2020 Improving

Advanced and cost efficient the Joint Fire Support Team Trainer of the Bundeswehr #ITEC2019

Joint longitudinal and time-to-event models for multilevel hierarchical data Sam Brilleman 1,2 ,

Remuneration April 2018 Shareholder Engagement Royal Dutch Shell plc Gerard Kleisterlee

The Price of Competition: Effect Size Heterogeneity Matters in High Dimensions! joint work with

Skeletal Posture Estimation Shane Transue, Phuc Nguyen, Tam Vu, and Min-Hyung Choi University of

The KPZ fixed point Jeremy Quastel University of Toronto joint work with Konstantin Matetski

Power systems and Queueing theory: Storage and Electric Vehicles (Joint work with Lisa Flatley,