Probabilistic Graphical Models Probabilistic Graphical Models - PowerPoint PPT Presentation

Probabilistic Graphical Models Probabilistic Graphical Models parameter learning in undirected models Siamak Ravanbakhsh Fall 2019

Learning objectives Learning objectives the form of likelihood for undirected models why is it difficult to optimize? conditional likelihood in undirected models different approximations for parameter learning MAP inference and regularization pseudo likelihood pseudo moment-matching contrastive learning

Likelihood in MRFs Likelihood in MRFs example A probability dist. I ( A = 1, B = 1) 1 I ( A = I ( B = B p ( A , B , C ; θ ) = exp( θ 1, B = 1) + θ 1, C = 1)) 1 2 Z I ( B = 1, C = 1) C

Likelihood in MRFs Likelihood in MRFs example A probability dist. I ( A = 1, B = 1) 1 I ( A = I ( B = B p ( A , B , C ; θ ) = exp( θ 1, B = 1) + θ 1, C = 1)) 1 2 Z I ( B = 1, C = 1) observations ∣ D ∣ = 100 C E [ I ( A = 1, B = 1)] = .4, E [ I ( B = 1, C = 1)] = .4 D D

Likelihood in MRFs Likelihood in MRFs example A probability dist. I ( A = 1, B = 1) 1 I ( A = I ( B = B p ( A , B , C ; θ ) = exp( θ 1, B = 1) + θ 1, C = 1)) 1 2 Z I ( B = 1, C = 1) observations ∣ D ∣ = 100 C E [ I ( A = 1, B = 1)] = .4, E [ I ( B = 1, C = 1)] = .4 D D log-likelihood: log p ( D ; θ ) = I ( a = I ( b = 1, b = 1) + θ 1, c = 1) − 100 log Z ( θ ) ∑ a , b , c ∈ D θ 1 2 = 40 θ + 40 θ − 100 log Z ( θ ) 1 2

Likelihood in MRFs Likelihood in MRFs example A probability dist. I ( A = 1, B = 1) 1 I ( A = I ( B = B p ( A , B , C ; θ ) = exp( θ 1, B = 1) + θ 1, C = 1)) 1 2 Z I ( B = 1, C = 1) observations ∣ D ∣ = 100 C E [ I ( A = 1, B = 1)] = .4, E [ I ( B = 1, C = 1)] = .4 D D log-likelihood: log p ( D ; θ ) = I ( a = I ( b = 1, b = 1) + θ 1, c = 1) − 100 log Z ( θ ) ∑ a , b , c ∈ D θ 1 2 = 40 θ + 40 θ − 100 log Z ( θ ) 1 2 because of the partition function the likelihood does not decompose log-likelihood function θ 2 θ 1

Likelihood in linear exponential family linear exponential family (log-linear models) Likelihood in (log-linear models) probability distribution 1 p ( x ; θ ) = exp(⟨ θ , ϕ ( x )⟩) Z ( θ ) sufficient statistics

Likelihood in linear exponential family linear exponential family (log-linear models) Likelihood in (log-linear models) probability distribution 1 p ( x ; θ ) = exp(⟨ θ , ϕ ( x )⟩) Z ( θ ) sufficient statistics log-likelihood of ℓ( D , θ ) = log p ( D ; θ ) = ⟨ θ , ϕ ( x )⟩ − ∣ D ∣ log Z ( θ ) ∑ x ∈ D D

Likelihood in linear exponential family linear exponential family (log-linear models) Likelihood in (log-linear models) probability distribution 1 p ( x ; θ ) = exp(⟨ θ , ϕ ( x )⟩) Z ( θ ) sufficient statistics log-likelihood of ℓ( D , θ ) = log p ( D ; θ ) = ⟨ θ , ϕ ( x )⟩ − ∣ D ∣ log Z ( θ ) ∑ x ∈ D D ℓ( D , θ ) = ∣ D ∣ ⟨ θ , E ( [ ϕ ( x )]⟩ − log Z ( θ ) ) D expected sufficient statistics μ D

Likelihood in linear exponential family linear exponential family (log-linear models) Likelihood in (log-linear models) probability distribution 1 p ( x ; θ ) = exp(⟨ θ , ϕ ( x )⟩) Z ( θ ) sufficient statistics log-likelihood of ℓ( D , θ ) = log p ( D ; θ ) = ⟨ θ , ϕ ( x )⟩ − ∣ D ∣ log Z ( θ ) ∑ x ∈ D D ℓ( D , θ ) = ∣ D ∣ ⟨ θ , E ( [ ϕ ( x )]⟩ − log Z ( θ ) ) D example expected sufficient statistics μ D expected sufficient statistics params. E [ I ( X θ = 0, X = 0)] = P ( X = 0, X = 0) 1,2,0,0 1 2 1 2 D E [ I ( X = 1, X = 0)] = P ( X = 1, X = 0) θ 1,2,1,0 1 2 1 2 D E [ I ( X θ = 0, X = 1)] = P ( X = 0, X = 1) 1,2,0,1 1 2 1 2 D θ E [ I ( X = 1, X = 1)] = P ( X = 1, X = 1) 1,2,1,1 1 2 1 2 D image: Michael Jordan's draft

Likelihood in linear exponential family linear exponential family (log-linear models) Likelihood in (log-linear models) probability distribution 1 p ( x ; θ ) = exp(⟨ θ , ϕ ( x )⟩) Z ( θ ) sufficient statistics log-likelihood of ℓ( D , θ ) = log p ( D ; θ ) = ⟨ θ , ϕ ( x )⟩ − ∣ D ∣ log Z ( θ ) ∑ x ∈ D D ℓ( D , θ ) = ∣ D ∣ ⟨ θ , E ( [ ϕ ( x )]⟩ − log Z ( θ ) ) D expected sufficient statistics μ D has interesting properties log Z ( θ ) ∂ ∑ x exp(⟨ θ , ϕ ( x )⟩) so E ∂ 1 E ∇ log Z ( θ ) = [ ϕ ( x )] ∂ θ log Z ( θ ) = = ∑ x ( x ) exp(⟨ θ , ϕ ( x )⟩) = [ ϕ ( x )] ϕ i θ θ i p i ∂ θ Z ( θ ) Z ( θ ) i

Likelihood in linear exponential family linear exponential family (log-linear models) Likelihood in (log-linear models) probability distribution 1 p ( x ; θ ) = exp(⟨ θ , ϕ ( x )⟩) Z ( θ ) sufficient statistics log-likelihood of ℓ( D , θ ) = log p ( D ; θ ) = ⟨ θ , ϕ ( x )⟩ − ∣ D ∣ log Z ( θ ) ∑ x ∈ D D ℓ( D , θ ) = ∣ D ∣ ⟨ θ , E ( [ ϕ ( x )]⟩ − log Z ( θ ) ) D expected sufficient statistics μ D has interesting properties log Z ( θ ) ∂ ∑ x exp(⟨ θ , ϕ ( x )⟩) so E ∂ 1 E ∇ log Z ( θ ) = [ ϕ ( x )] ∂ θ log Z ( θ ) = = ∑ x ( x ) exp(⟨ θ , ϕ ( x )⟩) = [ ϕ ( x )] ϕ i θ θ i p i ∂ θ Z ( θ ) Z ( θ ) i ∂ 2 E [ ϕ E [ ϕ ( x )] E [ ϕ log Z ( θ ) = ( x ) ϕ ( x )] − ( x )] = Cov ( ϕ , ϕ ) i j i j i j ∂ θ ∂ θ i j so the Hessian matrix is positive definite is convex log Z ( θ )

Likelihood in linear exponential family linear exponential family (log-linear models) Likelihood in (log-linear models) probability distribution 1 p ( x ; θ ) = exp(⟨ θ , ϕ ( x )⟩) Z ( θ ) log-likelihood of D ℓ( D , θ ) = ∣ D ∣ ⟨ θ , E ( [ ϕ ( x )]⟩ − log Z ( θ ) ) D linear in θ convex

Likelihood in linear exponential family linear exponential family (log-linear models) Likelihood in (log-linear models) probability distribution 1 p ( x ; θ ) = exp(⟨ θ , ϕ ( x )⟩) Z ( θ ) log-likelihood of D ℓ( D , θ ) = ∣ D ∣ ⟨ θ , E ( [ ϕ ( x )]⟩ − log Z ( θ ) ) D linear in θ convex concave

Likelihood in linear exponential family linear exponential family (log-linear models) Likelihood in (log-linear models) probability distribution 1 p ( x ; θ ) = exp(⟨ θ , ϕ ( x )⟩) Z ( θ ) log-likelihood of D ℓ( D , θ ) = ∣ D ∣ ⟨ θ , E ( [ ϕ ( x )]⟩ − log Z ( θ ) ) D linear in θ convex concave should be easy to maximize (?)

Likelihood in linear exponential family linear exponential family (log-linear models) Likelihood in (log-linear models) probability distribution 1 p ( x ; θ ) = exp(⟨ θ , ϕ ( x )⟩) Z ( θ ) log-likelihood of D ℓ( D , θ ) = ∣ D ∣ ⟨ θ , E ( [ ϕ ( x )]⟩ − log Z ( θ ) ) D linear in θ convex concave should be easy to maximize (?) NO!

Likelihood in linear exponential family linear exponential family (log-linear models) Likelihood in (log-linear models) probability distribution 1 p ( x ; θ ) = exp(⟨ θ , ϕ ( x )⟩) Z ( θ ) log-likelihood of D ℓ( D , θ ) = ∣ D ∣ ⟨ θ , E ( [ ϕ ( x )]⟩ − log Z ( θ ) ) D linear in θ convex concave should be easy to maximize (?) NO! estimating is a difficult inference problem Z ( θ )

Likelihood in linear exponential family linear exponential family (log-linear models) Likelihood in (log-linear models) probability distribution 1 p ( x ; θ ) = exp(⟨ θ , ϕ ( x )⟩) Z ( θ ) log-likelihood of D ℓ( D , θ ) = ∣ D ∣ ⟨ θ , E ( [ ϕ ( x )]⟩ − log Z ( θ ) ) D linear in θ convex concave should be easy to maximize (?) NO! estimating is a difficult inference problem Z ( θ ) how about just using the gradient info? involves inference as well E ∇ log Z ( θ ) = [ ϕ ( x )] θ θ any combination of inference-gradient based optimization for learning undirected models

Moment matching Moment matching for for linear exponential family linear exponential family probability distribution 1 p ( x ; θ ) = exp(⟨ θ , ϕ ( x )⟩) Z ( θ ) log-likelihood of D ℓ( D , θ ) = ∣ D ∣ ⟨ θ , E ( [ ϕ ( x )]⟩ − log Z ( θ ) ) D linear in θ convex concave set its derivative to zero ∇ ∣ D ∣( E E ℓ( θ , D ) = [ ϕ ( x )] − [ ϕ ( x )]) = 0 D θ p θ ⇒ E E [ ϕ ( x )] [ ϕ ( x )] = D p θ find the parameter θ that results in the same expected sufficient statistics as the data

Moment matching Moment matching for for linear exponential family linear exponential family probability distribution 1 p ( x ; θ ) = exp(⟨ θ , ϕ ( x )⟩) Z ( θ ) log-likelihood of D ℓ( D , θ ) = ∣ D ∣ ⟨ θ , E ( [ ϕ ( x )]⟩ − log Z ( θ ) ) D linear in θ convex concave set its derivative to zero ∇ ∣ D ∣( E E ℓ( θ , D ) = [ ϕ ( x )] − [ ϕ ( x )]) = 0 D θ p θ ⇒ E E [ ϕ ( x )] [ ϕ ( x )] = D p θ p ( X = 0, X = 1; θ ) = p ( X = 0, X = 1) 1 2 1 2 D find the parameter θ that results in the same expected sufficient statistics as the data

Learning needs inference Learning needs inference in an inner loop in an inner loop maximizing the likelihood: arg max log p ( D ∣ θ ) θ gradient ∝ E E [ ϕ ( x )] − [ ϕ ( x )] D p θ optimality condition E E [ ϕ ( x )] = [ ϕ ( x )] D p θ easy to calculate inference in the graphical model

Probabilistic Graphical Models Probabilistic Graphical Models - PowerPoint PPT Presentation

Probabilistic Graphical Models Probabilistic Graphical Models parameter learning in undirected models Siamak Ravanbakhsh Fall 2019 Learning objectives Learning objectives the form of likelihood for undirected models why is it difficult to

Probabilistic Graphical Models CMSC 678 UMBC Probabilistic Graphical Models A graph G that

Probabilistic Graphical Models Probabilistic Graphical Models Variable elimination Siamak

Graphical Models Graphical Models Bayesian Networks Siamak Ravanbakhsh Fall 2019 Previously on

Probabilistic Graphical Models Probabilistic Graphical Models introduction to learning Siamak

Probabilistic Graphical Models Probabilistic Graphical Models Undirected Models Fall 2019

Probabilistic Graphical Models Probabilistic Graphical Models Gaussian Network Models Fall 2019

Computer Science Let me be provocative Probabilistic graphical models is how we do probabilistic

Probabilistic Graphical Models Probabilistic Graphical Models Relationship between the directed

CS 6782: Fall 2010 Probabilistic Graphical Models Guozhang Wang December 10, 2010 1

Probabilistic Graphical Models Probabilistic Graphical Models Review of probability theory

Probabilistic Graphical Models Probabilistic Graphical Models Loopy BP and Bethe Free Energy

Probabilistic Graphical Models Probabilistic Graphical Models Structure learning in Bayesian

Probabilistic Graphical Models Probabilistic Graphical Models MAP inference Siamak Ravanbakhsh

Probabilistic Graphical Models Probabilistic Graphical Models Markov Chain Monte Carlo Inference

The Elimination Algorithm Probabilistic Graphical Models (10- Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Parameter learning in Bayesian

Phylogenetics: Likelihood COMP 571 Luay Nakhleh, Rice University 2 The Problem Input: Multiple

Lecture 8: Finite State Machines And Sequential circuit Design CSE 140: Components and Design

Multiprogramming Single $PC Multiple $PCs (CPUs point of view) (process point of view) A

The many pit itfalls of poly lysemy: gaps and bri ridges between the dif ifferent

Using Single Photons Using Single Photons Using Single Photons Using Single Photons for WIMP

Statistical inference for incomplete Ins Couso ranking data: A comparison of two Mohsen Ahmadi

10-701 Probability and MLE (brief) intro to probability Basic notations Random variable -

Likelihood-Based Statistical Decisions Marco Cattaneo Seminar for Statistics ETH Z urich,