probabilistic graphical models
play

Probabilistic Graphical Models 10-708 More on learning fully - PDF document

Probabilistic Graphical Models 10-708 More on learning fully observed More on learning fully observed BNs, exponential families, and , exponential families, and BNs generalized linear models generalized linear models Eric Xing Eric Xing


  1. Probabilistic Graphical Models 10-708 More on learning fully observed More on learning fully observed BNs, exponential families, and , exponential families, and BNs generalized linear models generalized linear models Eric Xing Eric Xing Lecture 10, Oct 12, 2005 Reading: MJ-Chap. 7,8 Exponential family � For a numeric random variable X { } p x h x T T x A η = η − η ( | ) ( ) exp ( ) ( ) 1 { } = h x η T T x ( ) exp ( ) Z η ( ) is an exponential family distribution with natural (canonical) parameter η � Function T ( x ) is a sufficient statistic . � Function A( η ) = log Z( η ) is the log normalizer. � Examples: Bernoulli, multinomial, Gaussian, Poisson, gamma,... � A distribution p ( x ) has finite sufficient statistics (independent of number of data cases) iff it is in the exponential family. 1

  2. Multivariate Gaussian Distribution � For a continuous vector random variable X ∈ R k : 1 1 ⎧ ⎫ p x x T − 1 x µ Σ = − − µ Σ − µ ⎨ ⎬ ( , ) exp ( ) ( ) ( ) 1 2 2 2 k 2 π Σ / ⎩ ⎭ / Moment parameter 1 { ( ) } = − Σ − 1 xx T + µ T Σ − 1 x − µ T Σ − 1 µ − Σ 1 1 exp tr log ( ) 2 2 2 k 2 π / � Exponential family representation Natural parameter [ ( ) ] [ ] ( ) − 1 − 1 − 1 − 1 η = Σ µ − 1 Σ = η η η = Σ µ η = − 1 Σ ; vec , vec , and 1 2 1 2 2 2 [ ( ) ] T x x xx T = ( ) ; vec ( ) A η = µ T Σ − 1 µ + Σ = − η η η T − − 2 η 1 1 1 ( ) log tr ( ) log 2 2 2 1 1 2 2 ( ) k 2 h x 2 − = π / ( ) Note: a k -dimensional Gaussian is a ( d + d 2 )-parameter distribution with a � ( d + d 2 )-element vector of sufficient statistics (but because of symmetry and positivity, parameters are constrained and have lower degree of freedom) Multinomial distribution x x π ~ multi ( | ), � For a binary vector random variable ⎧ ⎫ ∑ p x x x x K x k π = π 1 π 2 π = π L ⎨ ⎬ ( ) exp ln 1 2 K k ⎩ ⎭ k ⎧ K 1 K 1 K 1 ⎫ − ⎛ − − ⎞ ⎛ − − ⎞ ∑ ∑ ∑ x k 1 x K 1 = π + ⎜ ⎟ ⎜ π ⎟ ⎨ ⎬ exp ln ⎜ ⎟ ln ⎜ ⎟ k k ⎝ ⎠ ⎝ ⎠ ⎩ ⎭ k 1 k 1 k 1 = = = ⎧ ⎫ ⎛ ⎞ ⎪ K − 1 π ⎛ − K − 1 ⎞ ⎪ ∑ ∑ x k ⎜ k ⎟ 1 = + ⎜ π ⎟ ⎨ ⎬ exp ln ln ⎜ ⎟ ⎜ ⎟ k 1 K − 1 ⎪ − ∑ π ⎪ ⎝ ⎠ ⎝ ⎠ ⎩ ⎭ k = 1 k 1 k k = 1 = � Exponential family representation ⎡ ⎤ ⎛ π ⎞ 0 η = k ⎜ ⎟ ln ; ⎢ ⎥ π ⎝ ⎠ ⎣ ⎦ K [ ] T x x = ( ) ⎛ − K 1 ⎞ ⎛ K ⎞ − ∑ ∑ A 1 e η η = − ⎜ π ⎟ = ⎜ ⎟ ( ) ln ln k ⎜ ⎟ ⎜ ⎟ k ⎝ ⎠ ⎝ ⎠ k = 1 k = 1 h x 1 = ( ) 2

  3. Why exponential family? � Moment generating property dA d 1 d Z Z = η = η log ( ) ( ) d d Z d η η η η ( ) 1 d { } ∫ h x T x dx = η T ( ) exp ( ) Z d η η ( ) { } h x T T x η ( ) exp ( ) ∫ = T x dx ( ) Z η ( ) [ ] T x = ( ) E { } { } d 2 A h x η T T x h x η T T x 1 d ( ) exp ( ) ( ) exp ( ) ∫ T 2 x dx ∫ T x dx Z = − η ( ) ( ) ( ) d 2 Z Z Z d η η η η η ( ) ( ) ( ) [ ] [ ] T 2 x T x = − 2 ( ) ( ) E E [ ] T x = ( ) Var Moment estimation � We can easily compute moments of any exponential family distribution by taking the derivatives of the log normalizer A ( η ). � The q th derivative gives the q th centered moment. dA η ( ) = mean d η d 2 A η ( ) = variance d 2 η L � When the sufficient statistic is a stacked vector, partial derivatives need to be considered. 3

  4. Moment vs canonical parameters � The moment parameter µ can be derived from the natural (canonical) parameter dA η ( ) [ ] def T x = = µ 8 8 ( ) A E d η � A ( η ) is convex since 4 4 d 2 A η ( ) [ ] η T x 0 = > ( ) Var η ∗ d 2 η -2 -2 -1 -1 0 0 1 1 2 2 � Hence we can invert the relationship and infer the canonical parameter from the moment parameter (1-to-1): def η = ψ µ ( ) A distribution in the exponential family can be parameterized not only by η − the � canonical parameterization, but also by µ − the moment parameterization. MLE for Exponential Family � For iid data, the log-likelihood is { } ∏ D h x T T x A l η = η − η ( ; ) log ( ) exp ( ) ( ) n n n ⎛ ⎞ ∑ ∑ h x T T x NA = + ⎜ η ⎟ − η log ( ) ( ) ( ) n n ⎝ ⎠ n n � Take derivatives and set to zero: A ∂ ∂ η l ∑ ( ) T x N 0 = − = ( ) n ∂ η ∂ η n A 1 ∂ η ( ) ∑ T x = ( ) n N ∂ η ⇒ n 1 ) ∑ µ = T x ( ) MLE n N n � This amounts to moment matching . ) ) η = ψ µ � We can infer the canonical parameters using ( ) MLE MLE 4

  5. Sufficiency � For p ( x | θ ), T ( x ) is sufficient for θ if there is no information in X regarding θ yeyond that in T ( x ). We can throw away X for the purpose pf inference w.r.t. θ . � p T x x p T x θ θ = θ ( | ( ), ) ( | ( )) � Bayesian view X T ( x ) p x T x p x T x θ = θ Frequentist view � ( | ( ), ) ( | ( )) X T ( x ) The Neyman factorization theorem � θ X T ( x ) T ( x ) is sufficient for θ if � p x T x T x x T x θ = ψ θ ψ ( , ( ), ) ( ( ), ) ( , ( )) 1 2 p x g T x h x T x ⇒ θ = θ ( | ) ( ( ), ) ( , ( )) Examples � Gaussian: [ ] ( ) − 1 − 1 η = Σ µ − 1 Σ ; vec 2 [ ] ( ) 1 1 T x x xx T = ∑ ∑ ( ) ; vec T x x ⇒ µ = = ( ) MLE N 1 n N n A η = µ T Σ − 1 µ + Σ 1 1 ( ) log n n 2 2 ( ) k 2 h x 2 − = π / ( ) � Multinomial: ⎡ π ⎤ ⎛ ⎞ 0 η = ⎜ k ⎟ ln ; ⎢ ⎥ π ⎣ ⎝ ⎠ ⎦ K 1 [ ] ∑ T x x = ⇒ µ = x ( ) MLE n N ⎛ − K − 1 ⎞ ⎛ K ⎞ n ∑ ∑ A η = − ⎜ 1 π ⎟ = ⎜ e η ⎟ ( ) ln ln k ⎜ ⎟ ⎜ ⎟ k ⎝ ⎠ ⎝ ⎠ k = 1 k = 1 h x = 1 ( ) � Poisson: η = λ log T x = x 1 ( ) ∑ ⇒ µ = x A e η = λ = η MLE n ( ) N n 1 h x = ( ) x ! 5

  6. Generalized Linear Models (GLIMs) � The graphical model X n Linear regression � Discriminative linear classification � Commonality: � Y n model E ( Y )= µ = f ( θ T X ) N What is p (), the cond. dist. Of Y? � What is f (), the response function? � � GLIM The observed input x is assumed to enter into the model via a linear � T x ξ = θ combination of its elements The conditional mean µ is represented as a function f ( ξ ) of ξ , where f is � known as the response function The observed output y is assumed to be characterized by an � exponential family distribution with conditional mean µ . GLIM, cont. θ ψ f µ η ξ x { } p y h x T y A η = η − η ( | ) ( ) exp ( ) { } ( ) p y h x T y A ⇒ η = 1 η − η ( | ) ( ) exp ( ) φ The choice of exp family is constrained by the nature of the data Y � y is a continuous vector � multivariate Gaussian Example: � y is a class label � Bernoulli or multinomial The choice of the response function � Following some mild constrains, e.g., [0,1]. Positivity … � f − 1 = ψ ( ⋅ ) Canonical response function: � In this case θ T x directly corresponds to canonical parameter η . � 6

  7. MLE for GLIMs with natural response � Log-likelihood ( ) ∑ ∑ h y T x y A = + θ − η l log ( ) ( ) n n n n n n � Derivative of Log-likelihood d ⎛ dA d ⎞ l η η ( ) ∑ = ⎜ x y − n n ⎟ ⎜ ⎟ n n d θ d η d θ ⎝ ⎠ n n ∑ ( ) y x = − µ n n n n This is a fixed point function X T y = − µ ( ) because µ is a function of θ � Online learning for canonical GLIMs � Stochastic gradient ascent = least mean squares (LMS) algorithm: ( ) t + 1 t y t x θ = θ + ρ − µ n n n ( ) T µ t = θ t x ρ where and is a step size n n Batch learning for canonical GLIMs � The Hessian matrix d 2 d d l µ ( ) ∑ ∑ H y x x n = = − µ = n n n n d d T d T d T θ θ θ θ n n d d µ η ∑ x n n = − n d d T η θ n n d µ ∑ x x T T x = − n η = θ since n d n n n η n n X T WX = − [ ] X = x T where is the design matrix and n d ⎛ d µ µ ⎞ W ⎜ ⎟ = 1 N diag , K , ⎜ ⎟ d d η η ⎝ ⎠ 1 N which can be computed by calculating the 2 nd derivative of A ( η n ) 7

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend