Probabilistic Graphical Models 10-708 More on learning fully - - PDF document

probabilistic graphical models
SMART_READER_LITE
LIVE PREVIEW

Probabilistic Graphical Models 10-708 More on learning fully - - PDF document

Probabilistic Graphical Models 10-708 More on learning fully observed More on learning fully observed BNs, exponential families, and , exponential families, and BNs generalized linear models generalized linear models Eric Xing Eric Xing


slide-1
SLIDE 1

1

Probabilistic Graphical Models

10-708

More on learning fully observed More on learning fully observed BNs BNs, exponential families, and , exponential families, and generalized linear models generalized linear models

Eric Xing Eric Xing

Lecture 10, Oct 12, 2005 Reading: MJ-Chap. 7,8

Exponential family

For a numeric random variable X

is an exponential family distribution with natural (canonical) parameter η

Function T(x) is a sufficient statistic. Function A(η) = log Z(η) is the log normalizer. Examples: Bernoulli, multinomial, Gaussian, Poisson,

gamma,...

A distribution p(x) has finite sufficient statistics (independent

  • f number of data cases) iff it is in the exponential family.

{ } { }

) ( exp ) ( ) ( ) ( ) ( exp ) ( ) | ( x T x h Z A x T x h x p

T T

η η η η η 1 = − =

slide-2
SLIDE 2

2

Multivariate Gaussian Distribution

For a continuous vector random variable X∈Rk: Exponential family representation

  • Note: a k-dimensional Gaussian is a (d+d2)-parameter distribution with a

(d+d2)-element vector of sufficient statistics (but because of symmetry and positivity, parameters are constrained and have lower degree of freedom)

( ) ( )

( ) { }

Σ − Σ − Σ + Σ − = ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ − Σ − − Σ = Σ

− − − −

log tr exp ) ( ) ( exp ) , (

/ / /

µ µ µ π µ µ π µ

1 2 1 1 1 2 1 2 1 2 1 2

2 1 2 1 2 1

T T T k T k

x xx x x x p

( ) [ ]

( ) [ ]

( ) [ ]

( ) ( )

2 2 2 1 1 1 2 2 1 1 2 1 1 2 1 2 1 1 2 1 1 2 1 1

2 2

/

) ( log ) ( tr log ) ( vec ; ) ( and , vec , vec ;

k T T T

x h A xx x x T

− − − − − −

= − − − = Σ + Σ = = Σ − = Σ = = Σ − Σ = π η η η η µ µ η η µ η η η µ η Moment parameter Natural parameter

Multinomial distribution

For a binary vector random variable Exponential family representation

), | ( multi ~ π x x

⎪ ⎭ ⎪ ⎬ ⎫ ⎪ ⎩ ⎪ ⎨ ⎧ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − + ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ∑ − = ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − + = ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ = =

∑ ∑ ∑ ∑ ∑ ∑

− = − = − = − = − = − = 1 1 1 1 1 1 1 1 1 1 1 1 2 1

1 1 1 1

2 1

K k k K k k K k k k K k k K k K K k k k k k k x K x x

x x x x x p

K

π π π π π π π π π π ln ln exp ln ln exp ln exp ) ( L

[ ]

1 1

1 1 1

= ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ = ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − − = = ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ =

∑ ∑

= − =

) ( ln ln ) ( ) ( ; ln x h e A x x T

K k K k k K k

k

η

π η π π η

slide-3
SLIDE 3

3

Why exponential family?

Moment generating property

{ } { }

[ ]

) ( ) ( ) ( exp ) ( ) ( ) ( exp ) ( ) ( ) ( ) ( ) ( log x T dx Z x T x h x T dx x T x h d d Z Z d d Z Z d d d dA

T T

E = = = = =

∫ ∫

η η η η η η η η η η η 1 1

{ } { } [ ]

[ ] [ ]

) ( ) ( ) ( ) ( ) ( ) ( ) ( exp ) ( ) ( ) ( ) ( exp ) ( ) ( x T x T x T Z d d Z dx Z x T x h x T dx Z x T x h x T d A d

T T

Var E E

2

= − = − =

∫ ∫

2 2 2 2

1 η η η η η η η η

Moment estimation

We can easily compute moments of any exponential family

distribution by taking the derivatives of the log normalizer A(η).

The qth derivative gives the qth centered moment. When the sufficient statistic is a stacked vector, partial

derivatives need to be considered. L variance ) ( mean ) ( = =

2 2

η η η η d A d d dA

slide-4
SLIDE 4

4

Moment vs canonical parameters

The moment parameter µ can be derived from the natural

(canonical) parameter

A(η) is convex since Hence we can invert the relationship and infer the canonical

parameter from the moment parameter (1-to-1):

  • A distribution in the exponential family can be parameterized not only by η − the

canonical parameterization, but also by µ − the moment parameterization.

[ ]

µ η η

def

) ( ) ( = = x T d dA E

[ ]

2 2

> = ) ( ) ( x T d A d Var η η ) (

def

µ ψ η =

4 8
  • 2
  • 1
1 2 4 8
  • 2
  • 1
1 2

A η η∗

MLE for Exponential Family

For iid data, the log-likelihood is Take derivatives and set to zero: This amounts to moment matching. We can infer the canonical parameters using

{ }

∑ ∑ ∏

− ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ + = − =

n n n T n n n T n

NA x T x h A x T x h D ) ( ) ( ) ( log ) ( ) ( exp ) ( log ) ; ( η η η η η l ) ( ) ( ) ( ) ( ) (

∑ ∑ ∑

= = ∂ ∂ ⇒ = ∂ ∂ − = ∂ ∂

n n MLE n n n n

x T N x T N A A N x T 1 1 µ η η η η η ) l

) (

MLE MLE

µ ψ η ) ) =

slide-5
SLIDE 5

5

Sufficiency

For p(x|θ), T(x) is sufficient for θ if there is no information in X

regarding θ yeyond that in T(x).

  • We can throw away X for the purpose pf inference w.r.t. θ .
  • Bayesian view
  • Frequentist view
  • The Neyman factorization theorem
  • T(x) is sufficient for θ if

T(x)

θ

T(x)

θ

X T(x)

θ

X X )) ( | ( ) ), ( | ( x T p x x T p θ θ = )) ( | ( ) ), ( | ( x T x p x T x p = θ )) ( , ( ) ), ( ( ) ), ( , ( x T x x T x T x p

2 1

ψ θ ψ θ = )) ( , ( ) ), ( ( ) | ( x T x h x T g x p θ θ = ⇒

Examples

Gaussian: Multinomial: Poisson:

( ) [ ] ( ) [ ]

( )

2 2 1 1 2 1 1 2 1 1

2

/

) ( log ) ( vec ; ) ( vec ;

k T T

x h A xx x x T

− − − −

= Σ + Σ = = Σ − Σ = π µ µ η µ η

∑ ∑

= = ⇒

n n n n MLE

x N x T N 1 1

1

) ( µ

[ ]

1 1

1 1 1

= ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ = ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − − = = ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ =

∑ ∑

= − =

) ( ln ln ) ( ) ( ; ln x h e A x x T

K k K k k K k

k

η

π η π π η

= ⇒

n n MLE

x N 1 µ

! ) ( ) ( ) ( log x x h e A x x T 1 = = = = =

η

λ η λ η

= ⇒

n n MLE

x N 1 µ

slide-6
SLIDE 6

6

Generalized Linear Models (GLIMs)

The graphical model

  • Linear regression
  • Discriminative linear classification
  • Commonality:

model E(Y)=µ=f(θTX)

  • What is p(), the cond. dist. Of Y?
  • What is f(), the response function?

GLIM

  • The observed input x is assumed to enter into the model via a linear

combination of its elements

  • The conditional mean µ is represented as a function f(ξ) of ξ, where f is

known as the response function

  • The observed output y is assumed to be characterized by an

exponential family distribution with conditional mean µ. Xn Yn N

x

T

θ ξ =

GLIM, cont.

  • The choice of exp family is constrained by the nature of the data Y
  • Example:

y is a continuous vector multivariate Gaussian y is a class label Bernoulli or multinomial

  • The choice of the response function
  • Following some mild constrains, e.g., [0,1]. Positivity …
  • Canonical response function:
  • In this case θTx directly corresponds to canonical parameter η.

η ψ f θ x µ ξ

( ) { }

) ( exp ) ( ) | ( η η η

φ

A y x h y p

T

− = ⇒

1

) (⋅ =

−1

ψ f

{ }

) ( exp ) ( ) | ( η η η A y x h y p

T

− =

slide-7
SLIDE 7

7

MLE for GLIMs with natural response

Log-likelihood Derivative of Log-likelihood Online learning for canonical GLIMs

  • Stochastic gradient ascent = least mean squares (LMS) algorithm:

( )

∑ ∑

− + =

n n n n n T n

A y x y h ) ( ) ( log η θ l

( )

) ( ) ( µ µ θ η η η θ − = − = ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − =

∑ ∑

y X x y d d d dA y x d d

T n n n n n n n n n n

l This is a fixed point function because µ is a function of θ

( )

n t n n t t

x y µ ρ θ θ − + =

+1

( )

size step a is and where ρ θ µ

n T t t n

x =

Batch learning for canonical GLIMs

The Hessian matrix

where is the design matrix and which can be computed by calculating the 2nd derivative of A(ηn) ( )

WX X x x d d x d d d d x d d x x y d d d d d H

T n T n T n n n n n T n n n n n n T n n n n n n T T

− = = − = − = = − = =

∑ ∑ ∑ ∑

θ η η µ θ η η µ θ µ µ θ θ θ since l

2

[ ]

T n

x X = ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ =

N N

d d d d W η µ η µ , , diag K

1 1

slide-8
SLIDE 8

8

Iteratively Reweighted Least Squares (IRLS)

Recall Newton-Raphson methods with cost function J We now have Now:

  • where the adjusted response is

This can be understood as solving the following " Iteratively

reweighted least squares " problem J H

t t θ

θ θ ∇ − =

− + 1 1

WX X H y X J

T T

− = − = ∇ ) ( µ

θ

( ) [ ] ( )

t t T t T t T t t T t T t t

z W X X W X y X X W X X W X H

1 1 1 1 − − − +

= − + = ∇ + = ) ( µ θ θ θ

θl

( )

) (

t t t t

y W X z µ θ − + =

−1

) ( ) ( min arg θ θ θ

θ

X z W X z

T t

− − =

+1

Example 1: logistic regression (sigmoid classifier)

The condition distribution: a Bernoulli

where µ is a logistic function

p(y|x) is an exponential family function, with

  • mean:
  • and canonical response function

IRLS

y y

x x x y p

− =

1

1 )) ( ( ) ( ) | ( µ µ

) (

) (

x

e x

η

µ

+ = 1 1 [ ]

) (

|

x

e x y

η

µ

+ = = 1 1 E x

T

θ ξ η = =

⎟ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎜ ⎝ ⎛ − − = − = ) ( ) ( ) (

N N

W d d µ µ µ µ µ µ η µ 1 1 1

1 1

O

slide-9
SLIDE 9

9

Logistic regression: practical issues

It is very common to use regularized maximum likelihood.

  • IRLS takes O(Nd3) per iteration, where N = number of training cases

and d = dimension of input x.

  • Quasi-Newton methods, that approximate the Hessian, work faster.
  • Conjugate gradient takes O(Nd) per iteration, and usually works best in

practice.

  • Stochastic gradient descent can also be used if N is large c.f. perceptron

rule:

( )

θ θ λ θ σ θ λ θ θ σ θ

θ T n n T n T x y

x y l I p x y e x y p

T

2 1 1 1

1

− = = + = ± =

− −

) ( log ) ( ) , ( Normal ~ ) ( ) ( ) , (

( )

λθ θ σ

θ

− − = ∇

n n n T n

x y x y ) ( 1 l

Example 2: linear regression

The condition distribution: a Gaussian

where µ is a linear function

p(y|x) is an exponential family function, with

  • mean:
  • and canonical response function

IRLS

) ( ) ( x x x

T

η θ µ = = [ ]

x x y

T

θ µ = = | E x

T

θ ξ η = =

1

I W d d = = 1 η µ

( )

( ) { }

) ( ) ( exp ) ( )) ( ( )) ( ( exp ) , , (

/ /

η η µ µ π θ A y x x h x y x y x y p

T T k

− Σ − ⇒ ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ − Σ − − Σ = Σ

− − 1 2 1 1 2 1 2

2 1 2 1

( ) ( ) ( ) ( )

) ( ) (

t T T t t t T T t t T t T t

y X X X y X X X X z W X X W X µ θ µ θ θ − + = − + = =

− − − + 1 1 1 1

∞ →

t

Y X X X

T T 1 −

= ) ( θ

Steepest descent Normal equation

Rescall

slide-10
SLIDE 10

10

MLE for general BNs

If we assume the parameters for each CPD are globally

independent, and all nodes are fully observed, then the log- likelihood function decomposes into a sum of local terms, one per node:

∑ ∑ ∏ ∏

⎟ ⎠ ⎞ ⎜ ⎝ ⎛ = ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ = =

i n i i n n i i i n

i i

x p x p D p D ) , | ( log ) , | ( log ) | ( log ) ; (

, ,

θ θ θ θ

π π

x x l

Consider the distribution defined by the directed acyclic GM: This is exactly like learning four separate small BNs, each of

which consists of a node and its parents.

Example: A directed model

) , , | ( ) , | ( ) , | ( ) | ( ) | (

1 3 2 4 3 1 3 1 1 2 1 1

θ θ θ θ θ x x x p x x p x x p x p x p = X1 X4 X2 X3

X4 X2 X3 X1 X1 X2 X1 X3

slide-11
SLIDE 11

11

MLE for BNs with tabular CPDs

Assume each CPD is represented as a table (multinomial)

where

  • Note that in case of multiple parents, will have a composite

state, and the CPD will be a high-dimensional table

  • The sufficient statistics are counts of family configurations

The log-likelihood is Using a Lagrange multiplier to enforce so we get

) | (

def

k X j X p

i

i ijk

= = =

π

θ

i

π

X

=

n k n j i n ijk

i

x x n

π , , def

∑ ∏

= =

k j i ijk ijk k j i n ijk

n D

ijk

, , , ,

log log ) ; ( θ θ θ l

1 =

∑j

ijk

θ

=

k j i k ij ijk ML ijk

n n

, ' , '

θ

MLE and Kulback-Leibler divergence

KL divergence Empirical distribution

  • Where δ(x,xn) is a Kronecker delta function

Maxθ(MLE) Minθ(KL)

( ) ∑

=

x

x p x q x q x p x q ) ( ) ( log ) ( ) ( || ) ( D

=

=

N n n

x x N x p

1

1 ) , ( ) ( ~

def

δ ≡ ( )

) ; ( ) | ( log ) ( ~ log ) ( ~ ) | ( log ) ( ~ ) ( ~ log ) ( ~ ) | ( ) ( ~ log ) ( ~ ) | ( || ) ( ~ D N C x p N x p x p x p x p x p x p x p x p x p x p x p

n n x x x x

θ θ θ θ θ l 1 1 + = − = − = =

∑ ∑ ∑ ∑ ∑

D

slide-12
SLIDE 12

12

Consider a time-invariant (stationary) 1st-order Markov model

  • Initial state probability vector:
  • State transition probability matrix:

The joint: The log-likelihood: Again, we optimize each parameter separately

  • π is a multinomial frequency vector, and we've seen it before
  • What about A?

Parameter sharing

X1 X2 X3 XT

A π ) (

def

1

1 =

=

k k

X p π ) | (

def

1 1

1 =

= =

− i t j t ij

X X p A

∏∏

= = −

=

T t t t t T

X X p x p X p

2 2 1 1 1

) | ( ) | ( ) | (

:

π θ

∑∑ ∑

= −

+ =

n T t t n t n n n

A x x p x p D

2 1 1

) , | ( log ) | ( log ) ; (

, , ,

π θ l

Learning a Markov chain transition matrix

A is a stochastic matrix: Each row of A is multinomial distribution. So MLE of Aij is the fraction of transitions from i to j Application:

  • if the states Xt represent words, this is called a bigram language model

Sparse data problem:

  • If i j did not occur in data, we will have Aij =0, then any futher

sequence with word pair i j will have zero probability.

  • A standard hack: backoff smoothing or deleted interpolation

1 =

∑j

ij

A

∑ ∑ ∑ ∑

= − = −

=

→ =

n T t i t n j t n n T t i t n ML ij

x x x i j i A

2 1 2 1 , , ,

) ( # ) ( #

ML i t i

A A

− + = ) ( ~ λ λη 1

slide-13
SLIDE 13

13

Bayesian language model

Global and local parameter independence The posterior of Ai · and Ai' · is factorized despite v-

structure on Xt, because Xt-1 acts like a multiplexer

Assign a Dirichlet prior βi to each row of the transition matrix:

  • We could consider more realistic priors, e.g., mixtures of Dirichlets to

account for types of words (adjectives, verbs, etc.) X1 X2 X3 XT

Ai ·

π

Ai'·

β α …

A

) ( # where , ) ( ) ( # ) ( # ) , , | (

' , , def

+ = − + = +

+ → = = i A i j i D i j p A

i i i ML ij i k i i i k i i Bayes ij

β β λ λ β λ β β β 1