Probabilistic Graphical Models Probabilistic Graphical Models - PowerPoint PPT Presentation

Probabilistic Graphical Models Probabilistic Graphical Models Parameter learning in Bayesian networks Siamak Ravanbakhsh Fall 2019

Learning objectives Learning objectives likelihood function and MLE role of the sufficient statistics MLE for parameter learning in directed models why is it easy? conjugate priors and Bayesian parameter learning

Likelihood function Likelihood function through an example through an example a thumbtack with unknown prob. of heads & tails ≡ 1 ≡ 0 Bernoulli dist. p ( x ; θ ) = θ (1 − θ ) (1− x ) x

Likelihood function Likelihood function through an example through an example a thumbtack with unknown prob. of heads & tails ≡ 1 ≡ 0 Bernoulli dist. θ ) (1− x ) p ( x ; θ ) = θ (1 − x IID observations D = {1, 0, 0, 1, 1} likelihood of is 3 θ ) 2 L ( θ ; D ) = ∏ x ∈ D P ( x ; θ ) = θ (1 − θ θ likelihood function not a pdf (it does not integrate to 1)

Likelihood function Likelihood function through an example through an example a thumbtack with unknown prob. of heads & tails ≡ 1 ≡ 0 Bernoulli dist. θ ) (1− x ) p ( x ; θ ) = θ (1 − x max-likelihood estimate (MLE) IID observations D = {1, 0, 0, 1, 1} likelihood of is 3 θ ) 2 L ( θ ; D ) = ∏ x ∈ D P ( x ; θ ) = θ (1 − θ log-likelihood: log L ( θ ; D ) = 3 log θ + 2 log(1 − θ ) maximizing the log-likelihood (M-projection of ) P D θ likelihood function ∂ ( ^ 3 2 3−5 θ 3 3 log θ + 2 log(1 − θ ) = ) − = = 0 ⇒ = θ not a pdf (it does not integrate to 1) ∂ θ 1− θ θ (1− θ ) 5 θ

Sufficient statistics Sufficient statistics through an example through an example IID observations D = {1, 0, 0, 1, 1} ≡ 1 ≡ 0 likelihood of is 3 θ ) 2 L ( θ , D ) = P ( x ; θ ) = θ (1 − θ ∏ x ∈ D all we needed to know about the data: number of heads and tails given a distribution P ( x ; θ ) its sufficient statistics is function such that ϕ = [ ϕ , … , ϕ ] 1 K E E ′ 1 1 ′ ′ [ ϕ ( x )] = [ ϕ ( x )] ⇒ L ( θ , D ) = L ( θ , D ) ∀ D , D , θ D ′ D ′ ∣ D ∣ ∣ D ∣ sufficient statistics of the dataset is all that matters about the data

Revisiting Revisiting exponential family exponential family given a distribution P ( x ; θ ) its sufficient statistics is function such that ϕ = [ ϕ , … , ϕ ] 1 K E E ′ 1 1 ′ ′ [ ϕ ( x )] = [ ϕ ( x )] ⇒ L ( θ , D ) = L ( θ , D ) ∀ D , D , θ L ( θ , D ) = p ( x ; θ ) ∏ x ∈ D D ′ D ′ ∣ D ∣ ∣ D ∣ the (linear) exponential family: p ( x ) ∝ exp(⟨ θ , ϕ ( x )⟩) max-entropy distribution subject to E [ ϕ ( x )] = μ p

Revisiting Revisiting exponential family exponential family given a distribution P ( x ; θ ) its sufficient statistics is function such that ϕ = [ ϕ , … , ϕ ] 1 K E E ′ 1 1 ′ ′ [ ϕ ( x )] = [ ϕ ( x )] ⇒ L ( θ , D ) = L ( θ , D ) ∀ D , D , θ L ( θ , D ) = p ( x ; θ ) ∏ x ∈ D D ′ D ′ ∣ D ∣ ∣ D ∣ the (linear) exponential family: p ( x ) ∝ exp(⟨ θ , ϕ ( x )⟩) max-entropy distribution subject to E [ ϕ ( x )] = μ p if are linearly independent, then , … , ϕ θ ↔ μ ϕ 1 k

MLE for Bayesian networks MLE for Bayesian networks an example an example X a simple network p ( x , y ; θ ) = p ( x ; θ ) p ( y ∣ x ; θ ) Y ∣ X X Y

MLE for Bayesian networks MLE for Bayesian networks an example an example X a simple network p ( x , y ; θ ) = p ( x ; θ ) p ( y ∣ x ; θ ) Y ∣ X X likelihood L ( D ; θ ) = ∏ ( x , y )∈ D p ( x ; θ ) p ( y ∣ x ; θ ) Y ∣ X X Y ( ∏ ( x )∈ D X ) ( ∏ ( x , y )∈ D Y ∣ X ) = p ( x ; θ ) p ( y ∣ x ; θ ) likelihood of x cond. likelihood of y

MLE for Bayesian networks MLE for Bayesian networks an example an example X a simple network p ( x , y ; θ ) = p ( x ; θ ) p ( y ∣ x ; θ ) Y ∣ X X likelihood L ( D ; θ ) = ∏ ( x , y )∈ D p ( x ; θ ) p ( y ∣ x ; θ ) Y ∣ X X Y ( ∏ ( x )∈ D X ) ( ∏ ( x , y )∈ D Y ∣ X ) = p ( x ; θ ) p ( y ∣ x ; θ ) likelihood of x cond. likelihood of y for discrete vars. x = ℓ, y = ℓ ′ number of times in the dataset ′ ) number of times in the dataset x = ℓ ( ∏ ℓ∈ V al ( X ) N ( x =ℓ) ) ( ∏ ℓ,ℓ ∈ V al ( X )× V al ( Y ) Y ∣ X ,ℓ,ℓ ′ N ( x =ℓ, y =ℓ ) L ( D ; θ ) = θ θ X ,ℓ ′ p ( X = ℓ) ′ p ( X = ℓ ∣ Y = ℓ )

MLE for Bayesian networks MLE for Bayesian networks an example an example X a simple network p ( x , y ; θ ) = p ( x ; θ ) p ( y ∣ x ; θ ) Y ∣ X X likelihood L ( D ; θ ) = ∏ ( x , y )∈ D p ( x ; θ ) p ( y ∣ x ; θ ) Y ∣ X X Y ( ∏ ( x )∈ D X ) ( ∏ ( x , y )∈ D Y ∣ X ) = p ( x ; θ ) p ( y ∣ x ; θ ) likelihood of x cond. likelihood of y for discrete vars. x = ℓ, y = ℓ ′ number of times in the dataset ′ ) number of times in the dataset x = ℓ ( ∏ ℓ∈ V al ( X ) N ( x =ℓ) ) ( ∏ ℓ,ℓ ∈ V al ( X )× V al ( Y ) Y ∣ X ,ℓ,ℓ ′ N ( x =ℓ, y =ℓ ) L ( D ; θ ) = θ θ X ,ℓ ′ p ( X = ℓ) ′ p ( X = ℓ ∣ Y = ℓ ) MLE : maximize local likelihood terms individually N ( x =ℓ) ′ N ( x =ℓ, y =ℓ ) = = θ θ X ,ℓ Y ∣ X ,ℓ,ℓ ′ ∣ D ∣ ∣ D ∣

MLE for Bayesian networks general case MLE for Bayesian networks general case Bayes-net p ( x ; θ ) = p ( x ∣ ; θ ) ∏ i Pa ∣ Pa i x X i i X i likelihood L ( D ; θ ) = ∏ x ∈ D ∏ i p ( x ∣ ; θ ) Pa i ∣ Pa i x i i = ∏ i ∏ ( x p ( x ∣ ; θ ) Pa i ∣ Pa , Pa )∈ D i x i i i x i local likelihood terms

MLE for Bayesian networks MLE for Bayesian networks general case general case Bayes-net p ( x ; θ ) = p ( x ∣ ; θ ) ∏ i Pa ∣ Pa i x X i i X i likelihood L ( D ; θ ) = ∏ x ∈ D ∏ i p ( x ∣ ; θ ) Pa i ∣ Pa i x i i = ∏ i ∏ ( x p ( x ∣ ; θ ) Pa i ∣ Pa , Pa )∈ D i x i i i x i local likelihood terms maximizing the conditional likelihood for each node similar to solving individual prediction problems

MLE for Bayesian networks general case MLE for Bayesian networks general case Bayes-net p ( x ; θ ) = p ( x ∣ ; θ ) ∏ i Pa ∣ Pa i x X i i X i likelihood L ( D ; θ ) = ∏ x ∈ D ∏ i p ( x ∣ ; θ ) Pa i ∣ Pa i x i i = ∏ i ∏ ( x p ( x ∣ ; θ ) Pa i ∣ Pa , Pa )∈ D i x i i i x i local likelihood terms maximizing the conditional likelihood for each node similar to solving individual prediction problems how to learn a naive Bayes? Example

Bayesian Bayesian parameter estimation parameter estimation max-likelihood is the same for ^ 1 = θ 3 ≡ 1 ≡ 0 case 1. N ( x = 1) = 1, N ( x = 0) = 2 case 2. N ( x = 1) = 100, N ( x = 0) = 200 Example

Bayesian Bayesian parameter estimation parameter estimation max-likelihood is the same for ^ 1 = θ 3 ≡ 1 ≡ 0 case 1. N ( x = 1) = 1, N ( x = 0) = 2 case 2. N ( x = 1) = 100, N ( x = 0) = 200 Example need to model our uncertainty Bayesian approach: assume a prior p ( θ ) estimate the posterior

Bayesian parameter estimation Bayesian parameter estimation max-likelihood is the same for ^ 1 = θ 3 ≡ 1 ≡ 0 case 1. N ( x = 1) = 1, N ( x = 0) = 2 case 2. N ( x = 1) = 100, N ( x = 0) = 200 Example need to model our uncertainty Bayesian approach: assume a prior p ( θ ) estimate the posterior prior likelihood p ( θ ) p ( D ∣ θ ) p ( θ ∣ D ) = ∝ p ( θ ) p ( D ∣ θ ) p ( D ) posterior ∏ x ∈ D p ( x ∣ θ ) marginal likelihood

Bayesian parameter estimation Bayesian parameter estimation { 1 0 ≤ θ ≤ 1 assuming a uniform prior p ( θ ) = ≡ 1 ≡ 0 0 o . w . posterior p ( θ ∣ D ) ∝ p ( θ ) p ( D ∣ θ ) ∝ p ( D ∣ θ )

Bayesian parameter estimation Bayesian parameter estimation { 1 0 ≤ θ ≤ 1 assuming a uniform prior p ( θ ) = ≡ 1 ≡ 0 0 o . w . posterior p ( θ ∣ D ) ∝ p ( θ ) p ( D ∣ θ ) ∝ p ( D ∣ θ ) posterior predictive: predicting heads/tails using the posterior rather than a single MLE value 1 p ( x ∣ D ) = p ( θ ∣ D ) p ( x ∣ θ )d θ ∫ 0 N (1) θ ) N (0) θ ) (1− x ) ∝ θ (1 − θ (1 − x

Bayesian parameter estimation Bayesian parameter estimation { 1 0 ≤ θ ≤ 1 assuming a uniform prior p ( θ ) = ≡ 1 ≡ 0 0 o . w . posterior p ( θ ∣ D ) ∝ p ( θ ) p ( D ∣ θ ) ∝ p ( D ∣ θ ) posterior predictive: predicting heads/tails using the posterior rather than a single MLE value 1 p ( x ∣ D ) = p ( θ ∣ D ) p ( x ∣ θ )d θ ∫ 0 N (1) θ ) N (0) θ ) (1− x ) ∝ θ (1 − θ (1 − x N (1)+1 if we do the integration above: p ( x = 1 ∣ D ) = Laplace correction N (0)+ N (1)+2 (and normalize)

Probabilistic Graphical Models Probabilistic Graphical Models - PowerPoint PPT Presentation

Probabilistic Graphical Models Probabilistic Graphical Models Parameter learning in Bayesian networks Siamak Ravanbakhsh Fall 2019 Learning objectives Learning objectives likelihood function and MLE role of the sufficient statistics MLE for

Probabilistic Graphical Models CMSC 678 UMBC Probabilistic Graphical Models A graph G that

Probabilistic Graphical Models Probabilistic Graphical Models Variable elimination Siamak

Graphical Models Graphical Models Bayesian Networks Siamak Ravanbakhsh Fall 2019 Previously on

Probabilistic Graphical Models Probabilistic Graphical Models introduction to learning Siamak

Probabilistic Graphical Models Probabilistic Graphical Models Undirected Models Fall 2019

Probabilistic Graphical Models Probabilistic Graphical Models parameter learning in undirected

Probabilistic Graphical Models Probabilistic Graphical Models Gaussian Network Models Fall 2019

Computer Science Let me be provocative Probabilistic graphical models is how we do probabilistic

Probabilistic Graphical Models Probabilistic Graphical Models Relationship between the directed

CS 6782: Fall 2010 Probabilistic Graphical Models Guozhang Wang December 10, 2010 1

Probabilistic Graphical Models Probabilistic Graphical Models Review of probability theory

Probabilistic Graphical Models Probabilistic Graphical Models Loopy BP and Bethe Free Energy

Probabilistic Graphical Models Probabilistic Graphical Models Structure learning in Bayesian

Probabilistic Graphical Models Probabilistic Graphical Models MAP inference Siamak Ravanbakhsh

Probabilistic Graphical Models Probabilistic Graphical Models Markov Chain Monte Carlo Inference

The Elimination Algorithm Probabilistic Graphical Models (10- Probabilistic Graphical Models

STAT 339 A Generative Linear Model and Max Likelihood Estimation 20-22 February 2017 Colin

Non-Gaussian likelihoods for Gaussian Processes Alan Saul Outline Motivation Non-Gaussian

Introduction to General and Generalized Linear Models The Likelihood Principle - part I Henrik

Maximum Likelihood Estimation MLE tool for parameter estimation good approach for cases

The maximum likelihood degree of rank 2 matrices via Euler characteristics Jose Israel Rodriguez

Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas

Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation Selim Aksoy

Inferential Statistics Concepts IN TR OD U C TION TO L IN E AR MOD E L IN G IN P YTH ON Jason