Probabilistic Graphical Models 10-708 More on learning fully - PDF document

Probabilistic Graphical Models 10-708 More on learning fully observed More on learning fully observed BNs, exponential families, and , exponential families, and BNs generalized linear models generalized linear models Eric Xing Eric Xing Lecture 10, Oct 12, 2005 Reading: MJ-Chap. 7,8 Exponential family � For a numeric random variable X { } p x h x T T x A η = η − η ( | ) ( ) exp ( ) ( ) 1 { } = h x η T T x ( ) exp ( ) Z η ( ) is an exponential family distribution with natural (canonical) parameter η � Function T ( x ) is a sufficient statistic . � Function A( η ) = log Z( η ) is the log normalizer. � Examples: Bernoulli, multinomial, Gaussian, Poisson, gamma,... � A distribution p ( x ) has finite sufficient statistics (independent of number of data cases) iff it is in the exponential family. 1

Multivariate Gaussian Distribution � For a continuous vector random variable X ∈ R k : 1 1 ⎧ ⎫ p x x T − 1 x µ Σ = − − µ Σ − µ ⎨ ⎬ ( , ) exp ( ) ( ) ( ) 1 2 2 2 k 2 π Σ / ⎩ ⎭ / Moment parameter 1 { ( ) } = − Σ − 1 xx T + µ T Σ − 1 x − µ T Σ − 1 µ − Σ 1 1 exp tr log ( ) 2 2 2 k 2 π / � Exponential family representation Natural parameter [ ( ) ] [ ] ( ) − 1 − 1 − 1 − 1 η = Σ µ − 1 Σ = η η η = Σ µ η = − 1 Σ ; vec , vec , and 1 2 1 2 2 2 [ ( ) ] T x x xx T = ( ) ; vec ( ) A η = µ T Σ − 1 µ + Σ = − η η η T − − 2 η 1 1 1 ( ) log tr ( ) log 2 2 2 1 1 2 2 ( ) k 2 h x 2 − = π / ( ) Note: a k -dimensional Gaussian is a ( d + d 2 )-parameter distribution with a � ( d + d 2 )-element vector of sufficient statistics (but because of symmetry and positivity, parameters are constrained and have lower degree of freedom) Multinomial distribution x x π ~ multi ( | ), � For a binary vector random variable ⎧ ⎫ ∑ p x x x x K x k π = π 1 π 2 π = π L ⎨ ⎬ ( ) exp ln 1 2 K k ⎩ ⎭ k ⎧ K 1 K 1 K 1 ⎫ − ⎛ − − ⎞ ⎛ − − ⎞ ∑ ∑ ∑ x k 1 x K 1 = π + ⎜ ⎟ ⎜ π ⎟ ⎨ ⎬ exp ln ⎜ ⎟ ln ⎜ ⎟ k k ⎝ ⎠ ⎝ ⎠ ⎩ ⎭ k 1 k 1 k 1 = = = ⎧ ⎫ ⎛ ⎞ ⎪ K − 1 π ⎛ − K − 1 ⎞ ⎪ ∑ ∑ x k ⎜ k ⎟ 1 = + ⎜ π ⎟ ⎨ ⎬ exp ln ln ⎜ ⎟ ⎜ ⎟ k 1 K − 1 ⎪ − ∑ π ⎪ ⎝ ⎠ ⎝ ⎠ ⎩ ⎭ k = 1 k 1 k k = 1 = � Exponential family representation ⎡ ⎤ ⎛ π ⎞ 0 η = k ⎜ ⎟ ln ; ⎢ ⎥ π ⎝ ⎠ ⎣ ⎦ K [ ] T x x = ( ) ⎛ − K 1 ⎞ ⎛ K ⎞ − ∑ ∑ A 1 e η η = − ⎜ π ⎟ = ⎜ ⎟ ( ) ln ln k ⎜ ⎟ ⎜ ⎟ k ⎝ ⎠ ⎝ ⎠ k = 1 k = 1 h x 1 = ( ) 2

Why exponential family? � Moment generating property dA d 1 d Z Z = η = η log ( ) ( ) d d Z d η η η η ( ) 1 d { } ∫ h x T x dx = η T ( ) exp ( ) Z d η η ( ) { } h x T T x η ( ) exp ( ) ∫ = T x dx ( ) Z η ( ) [ ] T x = ( ) E { } { } d 2 A h x η T T x h x η T T x 1 d ( ) exp ( ) ( ) exp ( ) ∫ T 2 x dx ∫ T x dx Z = − η ( ) ( ) ( ) d 2 Z Z Z d η η η η η ( ) ( ) ( ) [ ] [ ] T 2 x T x = − 2 ( ) ( ) E E [ ] T x = ( ) Var Moment estimation � We can easily compute moments of any exponential family distribution by taking the derivatives of the log normalizer A ( η ). � The q th derivative gives the q th centered moment. dA η ( ) = mean d η d 2 A η ( ) = variance d 2 η L � When the sufficient statistic is a stacked vector, partial derivatives need to be considered. 3

Moment vs canonical parameters � The moment parameter µ can be derived from the natural (canonical) parameter dA η ( ) [ ] def T x = = µ 8 8 ( ) A E d η � A ( η ) is convex since 4 4 d 2 A η ( ) [ ] η T x 0 = > ( ) Var η ∗ d 2 η -2 -2 -1 -1 0 0 1 1 2 2 � Hence we can invert the relationship and infer the canonical parameter from the moment parameter (1-to-1): def η = ψ µ ( ) A distribution in the exponential family can be parameterized not only by η − the � canonical parameterization, but also by µ − the moment parameterization. MLE for Exponential Family � For iid data, the log-likelihood is { } ∏ D h x T T x A l η = η − η ( ; ) log ( ) exp ( ) ( ) n n n ⎛ ⎞ ∑ ∑ h x T T x NA = + ⎜ η ⎟ − η log ( ) ( ) ( ) n n ⎝ ⎠ n n � Take derivatives and set to zero: A ∂ ∂ η l ∑ ( ) T x N 0 = − = ( ) n ∂ η ∂ η n A 1 ∂ η ( ) ∑ T x = ( ) n N ∂ η ⇒ n 1 ) ∑ µ = T x ( ) MLE n N n � This amounts to moment matching . ) ) η = ψ µ � We can infer the canonical parameters using ( ) MLE MLE 4

Sufficiency � For p ( x | θ ), T ( x ) is sufficient for θ if there is no information in X regarding θ yeyond that in T ( x ). We can throw away X for the purpose pf inference w.r.t. θ . � p T x x p T x θ θ = θ ( | ( ), ) ( | ( )) � Bayesian view X T ( x ) p x T x p x T x θ = θ Frequentist view � ( | ( ), ) ( | ( )) X T ( x ) The Neyman factorization theorem � θ X T ( x ) T ( x ) is sufficient for θ if � p x T x T x x T x θ = ψ θ ψ ( , ( ), ) ( ( ), ) ( , ( )) 1 2 p x g T x h x T x ⇒ θ = θ ( | ) ( ( ), ) ( , ( )) Examples � Gaussian: [ ] ( ) − 1 − 1 η = Σ µ − 1 Σ ; vec 2 [ ] ( ) 1 1 T x x xx T = ∑ ∑ ( ) ; vec T x x ⇒ µ = = ( ) MLE N 1 n N n A η = µ T Σ − 1 µ + Σ 1 1 ( ) log n n 2 2 ( ) k 2 h x 2 − = π / ( ) � Multinomial: ⎡ π ⎤ ⎛ ⎞ 0 η = ⎜ k ⎟ ln ; ⎢ ⎥ π ⎣ ⎝ ⎠ ⎦ K 1 [ ] ∑ T x x = ⇒ µ = x ( ) MLE n N ⎛ − K − 1 ⎞ ⎛ K ⎞ n ∑ ∑ A η = − ⎜ 1 π ⎟ = ⎜ e η ⎟ ( ) ln ln k ⎜ ⎟ ⎜ ⎟ k ⎝ ⎠ ⎝ ⎠ k = 1 k = 1 h x = 1 ( ) � Poisson: η = λ log T x = x 1 ( ) ∑ ⇒ µ = x A e η = λ = η MLE n ( ) N n 1 h x = ( ) x ! 5

Generalized Linear Models (GLIMs) � The graphical model X n Linear regression � Discriminative linear classification � Commonality: � Y n model E ( Y )= µ = f ( θ T X ) N What is p (), the cond. dist. Of Y? � What is f (), the response function? � � GLIM The observed input x is assumed to enter into the model via a linear � T x ξ = θ combination of its elements The conditional mean µ is represented as a function f ( ξ ) of ξ , where f is � known as the response function The observed output y is assumed to be characterized by an � exponential family distribution with conditional mean µ . GLIM, cont. θ ψ f µ η ξ x { } p y h x T y A η = η − η ( | ) ( ) exp ( ) { } ( ) p y h x T y A ⇒ η = 1 η − η ( | ) ( ) exp ( ) φ The choice of exp family is constrained by the nature of the data Y � y is a continuous vector � multivariate Gaussian Example: � y is a class label � Bernoulli or multinomial The choice of the response function � Following some mild constrains, e.g., [0,1]. Positivity … � f − 1 = ψ ( ⋅ ) Canonical response function: � In this case θ T x directly corresponds to canonical parameter η . � 6

MLE for GLIMs with natural response � Log-likelihood ( ) ∑ ∑ h y T x y A = + θ − η l log ( ) ( ) n n n n n n � Derivative of Log-likelihood d ⎛ dA d ⎞ l η η ( ) ∑ = ⎜ x y − n n ⎟ ⎜ ⎟ n n d θ d η d θ ⎝ ⎠ n n ∑ ( ) y x = − µ n n n n This is a fixed point function X T y = − µ ( ) because µ is a function of θ � Online learning for canonical GLIMs � Stochastic gradient ascent = least mean squares (LMS) algorithm: ( ) t + 1 t y t x θ = θ + ρ − µ n n n ( ) T µ t = θ t x ρ where and is a step size n n Batch learning for canonical GLIMs � The Hessian matrix d 2 d d l µ ( ) ∑ ∑ H y x x n = = − µ = n n n n d d T d T d T θ θ θ θ n n d d µ η ∑ x n n = − n d d T η θ n n d µ ∑ x x T T x = − n η = θ since n d n n n η n n X T WX = − [ ] X = x T where is the design matrix and n d ⎛ d µ µ ⎞ W ⎜ ⎟ = 1 N diag , K , ⎜ ⎟ d d η η ⎝ ⎠ 1 N which can be computed by calculating the 2 nd derivative of A ( η n ) 7

Probabilistic Graphical Models 10-708 More on learning fully - PDF document

Probabilistic Graphical Models 10-708 More on learning fully observed More on learning fully observed BNs, exponential families, and , exponential families, and BNs generalized linear models generalized linear models Eric Xing Eric Xing

Probabilistic Graphical Models CMSC 678 UMBC Probabilistic Graphical Models A graph G that

Probabilistic Graphical Models Probabilistic Graphical Models Variable elimination Siamak

Graphical Models Graphical Models Bayesian Networks Siamak Ravanbakhsh Fall 2019 Previously on

Probabilistic Graphical Models Probabilistic Graphical Models introduction to learning Siamak

Computer Science Let me be provocative Probabilistic graphical models is how we do probabilistic

Probabilistic Graphical Models Probabilistic Graphical Models Undirected Models Fall 2019

Probabilistic Graphical Models Probabilistic Graphical Models parameter learning in undirected

Probabilistic Graphical Models Probabilistic Graphical Models Gaussian Network Models Fall 2019

CS 6782: Fall 2010 Probabilistic Graphical Models Guozhang Wang December 10, 2010 1

Probabilistic Graphical Models Probabilistic Graphical Models Relationship between the directed

Probabilistic Graphical Models Probabilistic Graphical Models Review of probability theory

Probabilistic Graphical Models Probabilistic Graphical Models Loopy BP and Bethe Free Energy

Probabilistic Graphical Models Probabilistic Graphical Models Structure learning in Bayesian

Probabilistic Graphical Models Probabilistic Graphical Models MAP inference Siamak Ravanbakhsh

Probabilistic Graphical Models Probabilistic Graphical Models Markov Chain Monte Carlo Inference

The Elimination Algorithm Probabilistic Graphical Models (10- Probabilistic Graphical Models

Generalized Linear Models (GLIMs) Probabilistic Graphical Models Sharif University of Technology

CSci 8980: Advanced Topics in Graphical Models Mixture Models, EM, Exponential Families

Graphical Models Graphical Models Exponential family & Variational Inference I Siamak

Approximate Inference Henrik I. Christensen Robotics & Intelligent Machines @ GT Georgia

Generalized Linear Model Certain nonlinear models with a specific structure arise from using

Exponential Random Graph Models and Their Polytopes Johannes Rauh York University (the one in

Bayesian Methods 1 Chris Williams School of Informatics, University of Edinburgh October 2015 1

Bias-Adjusted Maximum Likelihood Estimation Improving Estimation for Exponential-Family Random

Probabilistic Graphical Models 10-708 More on learning fully - PDF document

Probabilistic Graphical Models 10-708 More on learning fully observed More on learning fully observed BNs, exponential families, and , exponential families, and BNs generalized linear models generalized linear models Eric Xing Eric Xing

Probabilistic Graphical Models CMSC 678 UMBC Probabilistic Graphical Models A graph G that

Probabilistic Graphical Models Probabilistic Graphical Models Variable elimination Siamak

Graphical Models Graphical Models Bayesian Networks Siamak Ravanbakhsh Fall 2019 Previously on

Probabilistic Graphical Models Probabilistic Graphical Models introduction to learning Siamak

Computer Science Let me be provocative Probabilistic graphical models is how we do probabilistic

Probabilistic Graphical Models Probabilistic Graphical Models Undirected Models Fall 2019

Probabilistic Graphical Models Probabilistic Graphical Models parameter learning in undirected

Probabilistic Graphical Models Probabilistic Graphical Models Gaussian Network Models Fall 2019

CS 6782: Fall 2010 Probabilistic Graphical Models Guozhang Wang December 10, 2010 1

Probabilistic Graphical Models Probabilistic Graphical Models Relationship between the directed

Probabilistic Graphical Models Probabilistic Graphical Models Review of probability theory

Probabilistic Graphical Models Probabilistic Graphical Models Loopy BP and Bethe Free Energy

Probabilistic Graphical Models Probabilistic Graphical Models Structure learning in Bayesian

Probabilistic Graphical Models Probabilistic Graphical Models MAP inference Siamak Ravanbakhsh

Probabilistic Graphical Models Probabilistic Graphical Models Markov Chain Monte Carlo Inference

The Elimination Algorithm Probabilistic Graphical Models (10- Probabilistic Graphical Models

Generalized Linear Models (GLIMs) Probabilistic Graphical Models Sharif University of Technology

CSci 8980: Advanced Topics in Graphical Models Mixture Models, EM, Exponential Families

Graphical Models Graphical Models Exponential family &amp; Variational Inference I Siamak

Approximate Inference Henrik I. Christensen Robotics &amp; Intelligent Machines @ GT Georgia

Generalized Linear Model Certain nonlinear models with a specific structure arise from using

Exponential Random Graph Models and Their Polytopes Johannes Rauh York University (the one in

Bayesian Methods 1 Chris Williams School of Informatics, University of Edinburgh October 2015 1

Bias-Adjusted Maximum Likelihood Estimation Improving Estimation for Exponential-Family Random

Graphical Models Graphical Models Exponential family & Variational Inference I Siamak

Approximate Inference Henrik I. Christensen Robotics & Intelligent Machines @ GT Georgia