Linear Regression Let us assume that the target variable and the - - PDF document

linear regression
SMART_READER_LITE
LIVE PREVIEW

Linear Regression Let us assume that the target variable and the - - PDF document

School of Computer Science Learning generalized linear models and tabular CPT of structured full BN Probabilistic Graphical Models (10- Probabilistic Graphical Models (10 -708) 708) Lecture 9, Oct 15, 2007 Receptor A Receptor A X 1 X 1 X


slide-1
SLIDE 1

1

1

School of Computer Science

Learning generalized linear models and tabular CPT of structured full BN

Probabilistic Graphical Models (10 Probabilistic Graphical Models (10-

  • 708)

708)

Lecture 9, Oct 15, 2007

Eric Xing Eric Xing

Receptor A Kinase C TF F Gene G Gene H Kinase E Kinase D Receptor B X1 X2 X3 X4 X5 X6 X7 X8 Receptor A Kinase C TF F Gene G Gene H Kinase E Kinase D Receptor B X1 X2 X3 X4 X5 X6 X7 X8 X1 X2 X3 X4 X5 X6 X7 X8

Reading: J-Chap. 7,8.

Eric Xing 2

Linear Regression

Let us assume that the target variable and the inputs are

related by the equation:

where ε is an error term of unmodeled effects or random noise

Now assume that ε follows a Gaussian N(0,σ), then we have:

i i T i

y ε θ + = x ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − − =

2 2

2 2 1 σ θ σ π θ ) ( exp ) ; | (

i T i i i

y x y p x

slide-2
SLIDE 2

2

Eric Xing 3

Logistic Regression (sigmoid classifier)

The condition distribution: a Bernoulli

where µ is a logistic function

We can used the brute-force gradient method as in LR But we can also apply generic laws by observing the p(y|x) is

an exponential family function, more specifically, a generalized linear model

y y

x x x y p

− =

1

1 )) ( ( ) ( ) | ( µ µ

x

T

e x

θ

µ

+ = 1 1 ) (

Eric Xing 4

Exponential family

For a numeric random variable X

is an exponential family distribution with natural (canonical) parameter η

Function T(x) is a sufficient statistic. Function A(η) = log Z(η) is the log normalizer. Examples: Bernoulli, multinomial, Gaussian, Poisson,

gamma,... { } { }

) ( exp ) ( ) ( ) ( ) ( exp ) ( ) | ( x T x h Z A x T x h x p

T T

η η η η η 1 = − = Xn N

slide-3
SLIDE 3

3

Eric Xing 5

Multivariate Gaussian Distribution

For a continuous vector random variable X∈Rk: Exponential family representation

  • Note: a k-dimensional Gaussian is a (d+d2)-parameter distribution with a (d+d2)-

element vector of sufficient statistics (but because of symmetry and positivity, parameters are constrained and have lower degree of freedom)

( ) ( )

( ) { }

Σ − Σ − Σ + Σ − = ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ − Σ − − Σ = Σ

− − − −

log tr exp ) ( ) ( exp ) , (

/ / /

µ µ µ π µ µ π µ

1 2 1 1 1 2 1 2 1 2 1 2

2 1 2 1 2 1

T T T k T k

x xx x x x p

( ) [ ]

( ) [ ]

( ) [ ]

( ) ( )

2 2 2 1 1 1 2 2 1 1 2 1 1 2 1 2 1 1 2 1 1 2 1 1

2 2

/

) ( log ) ( tr log ) ( vec ; ) ( and , vec , vec ;

k T T T

x h A xx x x T

− − − − − −

= − − − = Σ + Σ = = Σ − = Σ = = Σ − Σ = π η η η η µ µ η η µ η η η µ η Moment parameter Natural parameter

Eric Xing 6

Multinomial distribution

For a binary vector random variable Exponential family representation

), | ( multi ~ π x x

⎭ ⎬ ⎫ ⎩ ⎨ ⎧ = =

k k k x K x x

x x p

K

π π π π π ln exp ) ( L

2 1

2 1

[ ]

1 1

1 1 1

= ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ = ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − − = = ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ =

∑ ∑

= − =

) ( ln ln ) ( ) ( ; ln x h e A x x T

K k K k k K k

k

η

π η π π η ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − + ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ∑ − = ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − + =

∑ ∑ ∑ ∑ ∑

− = − = − = − = − = − = 1 1 1 1 1 1 1 1 1 1 1 1

1 1 1 1

K k k K k k K k k k K k k K k K K k k k

x x x π π π π π ln ln exp ln ln exp

slide-4
SLIDE 4

4

Eric Xing 7

Why exponential family?

Moment generating property

{ } { }

[ ]

) ( ) ( ) ( exp ) ( ) ( ) ( exp ) ( ) ( ) ( ) ( ) ( log x T E dx Z x T x h x T dx x T x h d d Z Z d d Z Z d d d dA

T T

= = = = =

∫ ∫

η η η η η η η η η η η 1 1

{ } { } [ ]

[ ] [ ]

) ( ) ( ) ( ) ( ) ( ) ( ) ( exp ) ( ) ( ) ( ) ( exp ) ( ) ( x T Var x T E x T E Z d d Z dx Z x T x h x T dx Z x T x h x T d A d

2 T T

= − = − =

∫ ∫

2 2 2 2

1 η η η η η η η η

Eric Xing 8

Moment estimation

We can easily compute moments of any exponential family

distribution by taking the derivatives of the log normalizer A(η).

The qth derivative gives the qth centered moment. When the sufficient statistic is a stacked vector, partial

derivatives need to be considered. L variance ) ( mean ) ( = =

2 2

η η η η d A d d dA

slide-5
SLIDE 5

5

Eric Xing 9

Moment vs canonical parameters

The moment parameter µ can be derived from the natural

(canonical) parameter

A(h) is convex since Hence we can invert the relationship and infer the canonical

parameter from the moment parameter (1-to-1):

  • A distribution in the exponential family can be parameterized not only by η − the

canonical parameterization, but also by µ − the moment parameterization.

[ ]

µ η η

def

) ( ) ( = = x T E d dA

[ ]

2 2

> = ) ( ) ( x T Var d A d η η ) (

def

µ ψ η =

4 8

  • 2
  • 1

1 2 4 8

  • 2
  • 1

1 2

A η η∗

Eric Xing 10

MLE for Exponential Family

For iid data, the log-likelihood is Take derivatives and set to zero: This amounts to moment matching. We can infer the canonical parameters using

{ }

∑ ∑ ∏

− ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ + = − =

n n n T n n n T n

NA x T x h A x T x h D ) ( ) ( ) ( log ) ( ) ( exp ) ( log ) ; ( η η η η η l ) ( ) ( ) ( ) ( ) (

∑ ∑ ∑

= = ∂ ∂ ⇒ = ∂ ∂ − = ∂ ∂

n n MLE n n n n

x T N x T N A A N x T 1 1 µ η η η η η ) l

) (

MLE MLE

µ ψ η ) ) =

slide-6
SLIDE 6

6

Eric Xing 11

Sufficiency

For p(x|θ), T(x) is sufficient for θ if there is no information in X

regarding θ yeyond that in T(x).

  • We can throw away X for the purpose pf inference w.r.t. θ .
  • Bayesian view
  • Frequentist view
  • The Neyman factorization theorem
  • T(x) is sufficient for θ if

T(x)

θ

X )) ( | ( ) ), ( | ( x T p x x T p θ θ = T(x)

θ

X )) ( | ( ) ), ( | ( x T x p x T x p = θ T(x)

θ

X )) ( , ( ) ), ( ( ) ), ( , ( x T x x T x T x p

2 1

ψ θ ψ θ = )) ( , ( ) ), ( ( ) | ( x T x h x T g x p θ θ = ⇒

Eric Xing 12

Examples

Gaussian: Multinomial: Poisson:

( ) [ ] ( ) [ ]

( )

2 2 1 1 2 1 1 2 1 1

2

/

) ( log ) ( vec ; ) ( vec ;

k T T

x h A xx x x T

− − − −

= Σ + Σ = = Σ − Σ = π µ µ η µ η

∑ ∑

= = ⇒

n n n n MLE

x N x T N 1 1

1

) ( µ

[ ]

1 1

1 1 1

= ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ = ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − − = = ⎥ ⎦ ⎤ ⎢ ⎣ ⎡ ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ =

∑ ∑

= − =

) ( ln ln ) ( ) ( ; ln x h e A x x T

K k K k k K k

k

η

π η π π η

= ⇒

n n MLE

x N 1 µ

! ) ( ) ( ) ( log x x h e A x x T 1 = = = = =

η

λ η λ η

= ⇒

n n MLE

x N 1 µ

slide-7
SLIDE 7

7

Eric Xing 13

Generalized Linear Models (GLIMs)

The graphical model

  • Linear regression
  • Discriminative linear classification
  • Commonality:

model Ep(Y)=µ=f(θTX)

  • What is p()? the cond. dist. of Y.
  • What is f()? the response function.

GLIM

  • The observed input x is assumed to enter into the model via a linear

combination of its elements

  • The conditional mean µ is represented as a function f(ξ) of ξ, where f is

known as the response function

  • The observed output y is assumed to be characterized by an

exponential family distribution with conditional mean µ. Xn Yn N

x

T

θ ξ =

Eric Xing 14

GLIM, cont.

  • The choice of exp family is constrained by the nature of the data Y
  • Example:

y is a continuous vector multivariate Gaussian y is a class label Bernoulli or multinomial

  • The choice of the response function
  • Following some mild constrains, e.g., [0,1]. Positivity …
  • Canonical response function:
  • In this case θTx directly corresponds to canonical parameter η.

η ψ f θ x µ ξ

( ) { }

) ( ) ( exp ) ( ) | ( η η η

φ

A y x y h y p

T

− = ⇒

1

) (⋅ =

−1

ψ f

{ }

) ( ) ( exp ) ( ) | ( η η η A y x y h y p

T

− =

slide-8
SLIDE 8

8

Eric Xing 15

MLE for GLIMs with natural response

Log-likelihood Derivative of Log-likelihood Online learning for canonical GLIMs

  • Stochastic gradient ascent = least mean squares (LMS) algorithm:

( )

∑ ∑

− + =

n n n n n T n

A y x y h ) ( ) ( log η θ l

( )

) ( ) ( µ µ θ η η η θ − = − = ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − =

∑ ∑

y X x y d d d dA y x d d

T n n n n n n n n n n

l This is a fixed point function because µ is a function of θ

( ) n

t n n t t

x y µ ρ θ θ − + =

+1

( )

size step a is and where ρ θ µ

n T t t n

x =

Eric Xing 16

Batch learning for canonical GLIMs

The Hessian matrix

where is the design matrix and which can be computed by calculating the 2nd derivative of A(ηn) ( )

WX X x x d d x d d d d x d d x x y d d d d d H

T n T n T n n n n n T n n n n n n T n n n n n n T T

− = = − = − = = − = =

∑ ∑ ∑ ∑

θ η η µ θ η η µ θ µ µ θ θ θ since l

2

[ ]

T n

x X = ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ =

N N

d d d d W η µ η µ , , diag K

1 1

⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ − − − − − − − − − − − − =

n

x x x X M M M

2 1

⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ =

n

y y y y M v

2 1

slide-9
SLIDE 9

9

Eric Xing 17

Recall LMS

Cost function in matrix form: To minimize J(θ), take derivative and set to zero:

( ) ( )

y y y J

T n i i T i

v v − − = − = ∑

=

θ θ θ θ X X x 2 1 2 1

1 2

) ( ) (

⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ − − − − − − − − − − − − =

n

x x x X M M M

2 1

⎥ ⎥ ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎢ ⎢ ⎣ ⎡ =

n

y y y y M v

2 1

( ) ( ) ( )

2 2 1 2 2 1 2 1 = − = − + = ∇ + ∇ − ∇ = + − − ∇ = ∇ y X X X y X X X X X y y X y X X y y X y y X X X J

T T T T T T T T T T T T T T T

v v v v v v v v v θ θ θ θ θ θ θ θ θ θ

θ θ θ θ θ

tr tr tr tr

y X X X

T T

v = ⇒ θ

The normal equations

( )

y X X X

T T

v

1 −

=

*

θ

Eric Xing 18

Iteratively Reweighted Least Squares (IRLS)

Recall Newton-Raphson methods with cost function J We now have Now:

  • where the adjusted response is

This can be understood as solving the following " Iteratively

reweighted least squares " problem J H

t t θ

θ θ ∇ − =

− + 1 1

WX X H y X J

T T

− = − = ∇ ) ( µ

θ

( ) [ ] ( )

t t T t T t T t t T t T t t

z W X X W X y X X W X X W X H

1 1 1 1 − − − +

= − + = ∇ + = ) ( µ θ θ θ

θl

( )

) (

t t t t

y W X z µ θ − + =

−1

) ( ) ( min arg θ θ θ

θ

X z W X z

T t

− − =

+1

slide-10
SLIDE 10

10

Eric Xing 19

Example 1: logistic regression (sigmoid classifier)

The condition distribution: a Bernoulli

where µ is a logistic function

p(y|x) is an exponential family function, with

  • mean:
  • and canonical response function

IRLS y y

x x x y p

− =

1

1 )) ( ( ) ( ) | ( µ µ

) (

) (

x

e x

η

µ

+ = 1 1 [ ]

) (

|

x

e x y E

η

µ

+ = = 1 1 x

T

θ ξ η = =

⎟ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎜ ⎝ ⎛ − − = − = ) ( ) ( ) (

N N

W d d µ µ µ µ µ µ η µ 1 1 1

1 1

O

Eric Xing 20

Logistic regression: practical issues

It is very common to use regularized maximum likelihood.

  • IRLS takes O(Nd3) per iteration, where N = number of training cases

and d = dimension of input x.

  • Quasi-Newton methods, that approximate the Hessian, work faster.
  • Conjugate gradient takes O(Nd) per iteration, and usually works best in

practice.

  • Stochastic gradient descent can also be used if N is large c.f. perceptron

rule:

( )

θ θ λ θ σ θ λ θ θ σ θ

θ T n n T n T x y

x y l I p x y e x y p

T

2 1 1 1

1

− = = + = ± =

− −

) ( log ) ( ) , ( Normal ~ ) ( ) ( ) , (

( )

λθ θ σ

θ

− − = ∇

n n n T n

x y x y ) ( 1 l

slide-11
SLIDE 11

11

Eric Xing 21

Example 2: linear regression

The condition distribution: a Gaussian

where µ is a linear function

p(y|x) is an exponential family function, with

  • mean:
  • and canonical response function

IRLS

) ( ) ( x x x

T

η θ µ = = [ ]

x x y E

T

θ µ = = | x

T

θ ξ η = =

1

I W d d = = 1 η µ

( )

( ) { }

) ( ) ( exp ) ( )) ( ( )) ( ( exp ) , , (

/ /

η η µ µ π θ A y x x h x y x y x y p

T T k

− Σ − ⇒ ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ − Σ − − Σ = Σ

− − 1 2 1 1 2 1 2

2 1 2 1

( ) ( ) ( ) ( )

) ( ) (

t T T t t t T T t t T t T t

y X X X y X X X X z W X X W X µ θ µ θ θ − + = − + = =

− − − + 1 1 1 1

∞ →

t

Y X X X

T T 1 −

= ) ( θ

Steepest descent Normal equation

Rescale

Eric Xing 22

Classification

Generative and discriminative approach

Q X Q X Regression

Linear, conditional mixture, nonparametric

X Y

Density estimation

Parametric and nonparametric methods µ,σ

X X

Simple GMs are the building blocks of complex BNs

slide-12
SLIDE 12

12

23

School of Computer Science

An (incomplete) genealogy

  • f graphical

models

The structures of most GMs (e.g., all listed here), are not learned from data, but designed by human. But such designs are useful and indeed favored because thereby human knowledge are put into good use …

Eric Xing 24

MLE for general BNs

If we assume the parameters for each CPD are globally

independent, and all nodes are fully observed, then the log- likelihood function decomposes into a sum of local terms, one per node:

∑ ∑ ∏ ∏

⎟ ⎠ ⎞ ⎜ ⎝ ⎛ = ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ = =

i n i n i n n i i n i n

i i

x p x p D p D ) , | ( log ) , | ( log ) | ( log ) ; (

, , , ,

θ θ θ θ

π π

x x l

X2=1 X2=0 X5=0 X5=1

slide-13
SLIDE 13

13

Eric Xing 25

Earthquake

Radio Burglary Alarm Call

=

= =

M i i

i

x p p

1

) | ( ) (

π

x x X Factorization:

j i k i i

x j k i

x p

π

θ

π x

x

|

) | ( =

Local Distributions defined by, e.g., multinomial parameters:

How to define parameter prior?

Assumptions (Geiger & Heckerman 97,99):

  • Complete Model Equivalence
  • Global Parameter Independence
  • Local Parameter Independence
  • Likelihood and Prior Modularity

? ) | ( G p θ

Eric Xing

Global Parameter Independence

For every DAG model:

Local Parameter Independence

For every node:

=

=

M i i m

G p G p

1

) | ( ) | ( θ θ

Earthquake

Radio Burglary Alarm Call

=

=

i j i k i

q j x i

G p G p

1

) | ( ) | (

|

π

θ θ

x

independent of

) (

| YES Alarm Call

P

=

θ ) (

| NO Alarm Call

P

=

θ

Global & Local Parameter Independence

slide-14
SLIDE 14

14

Eric Xing

Provided all variables are observed in all cases, we can perform Bayesian update each parameter independently !!! sample 1 sample 2

M

θ2|1 θ1 θ2|1 X1 X2 X1 X2 Global Parameter Independence Local Parameter Independence

Parameter Independence, Graphical View

Eric Xing 28

Which PDFs Satisfy Our Assumptions? (Geiger & Heckerman 97,99)

Discrete DAG Models:

Dirichlet prior:

Gaussian DAG Models:

Normal prior: Normal-Wishart prior:

∏ ∏ ∏ ∑

  • )

( ) ( ) ( ) (

k k k k k k k k

k k

C P

1 1 α α

θ α θ α α θ = Γ Γ =

) ( Multi ~ | θ π

j x i

i

x ) , ( Normal ~ | Σ µ π

j x i

i

x

⎭ ⎬ ⎫ ⎩ ⎨ ⎧ − Ψ − − Ψ = Ψ

) ( )' ( exp | | ) ( ) , | (

/ /

ν µ ν µ π ν µ

1 2 1 2

2 1 2 1

n

p

{ }

. where , W tr exp W ) , ( ) , | W (

/ ) ( / 1 2 1 2

2 1

− − −

Σ = ⎭ ⎬ ⎫ ⎩ ⎨ ⎧ Τ Τ = Τ W

n w w

w w

n c p

α α

α α

( ),

) ( , Normal ) , , | (

1 −

= W W

µ µ

α ν α ν µ p

slide-15
SLIDE 15

15

Eric Xing 29

MLE for general BNs

If we assume the parameters for each CPD are globally

independent, and all nodes are fully observed, then the log- likelihood function decomposes into a sum of local terms, one per node:

∑ ∑ ∏ ∏

⎟ ⎠ ⎞ ⎜ ⎝ ⎛ = ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ = =

i n i i n n i i i n

i i

x p x p D p D ) , | ( log ) , | ( log ) | ( log ) ; (

, ,

θ θ θ θ

π π

x x l

Eric Xing 30

Consider the distribution defined by the directed acyclic GM: This is exactly like learning four separate small BNs, each of

which consists of a node and its parents.

Example: decomposable likelihood of a directed model

) , , | ( ) , | ( ) , | ( ) | ( ) | (

1 3 2 4 3 1 3 1 1 2 1 1

θ θ θ θ θ x x x p x x p x x p x p x p = X1 X4 X2 X3

X4 X2 X3 X1 X1 X2 X1 X3

slide-16
SLIDE 16

16

Eric Xing 31

MLE for BNs with tabular CPDs

  • Assume each CPD is represented as a table (multinomial)

where

  • Note that in case of multiple parents, will have a composite

state, and the CPD will be a high-dimensional table

  • The sufficient statistics are counts of family configurations
  • The log-likelihood is
  • Using a Lagrange multiplier

to enforce , we get:

) | (

def

k X j X p

i

i ijk

= = =

π

θ

i

π

X

=

n k n j i n ijk

i

x x n

π , , def

∑ ∏

= =

k j i ijk ijk k j i n ijk

n D

ijk

, , , ,

log log ) ; ( θ θ θ l 1 =

∑j

ijk

θ

=

k j i k ij ijk ML ijk

n n

, ' , '

θ

Eric Xing 32

MLE and Kulback-Leibler divergence

KL divergence Empirical distribution

  • Where δ(x,xn) is a Kronecker delta function

Maxθ(MLE) Minθ(KL)

( ) ∑

=

x

x p x q x q x p x q D ) ( ) ( log ) ( ) ( || ) (

=

=

N n n

x x N x p

1

1 ) , ( ) ( ~

def

δ

( )

) ; ( ) | ( log ) ( ~ log ) ( ~ ) | ( log ) ( ~ ) ( ~ log ) ( ~ ) | ( ) ( ~ log ) ( ~ ) | ( || ) ( ~ D N C x p N x p x p x p x p x p x p x p x p x p x p x p D

n n x x x x

θ θ θ θ θ l 1 1 + = − = − = =

∑ ∑ ∑ ∑ ∑

slide-17
SLIDE 17

17

Eric Xing 33

  • Consider a time-invariant (stationary) 1st-order Markov model
  • Initial state probability vector:
  • State transition probability matrix:
  • The joint:
  • The log-likelihood:
  • Again, we optimize each parameter separately
  • π is a multinomial frequency vector, and we've seen it before
  • What about A?

Parameter sharing

X1 X2 X3 XT

A π ) (

def

1

1 =

=

k k

X p π ) | (

def

1 1

1 =

= =

− i t j t ij

X X p A

∏∏

= = −

=

T t t t t T

X X p x p X p

2 2 1 1 1

) | ( ) | ( ) | (

:

π θ

∑∑ ∑

= −

+ =

n T t t n t n n n

A x x p x p D

2 1 1

) , | ( log ) | ( log ) ; (

, , ,

π θ l

Eric Xing 34

Learning a Markov chain transition matrix

  • A is a stochastic matrix:
  • Each row of A is multinomial distribution.
  • So MLE of Aij is the fraction of transitions from i to j
  • Application:
  • if the states Xt represent words, this is called a bigram language model
  • Sparse data problem:
  • If i j did not occur in data, we will have Aij =0, then any futher sequence with

word pair i j will have zero probability.

  • A standard hack: backoff smoothing or deleted interpolation

1 =

∑ j

ij

A

∑ ∑ ∑ ∑

= − = −

=

→ =

n T t i t n j t n n T t i t n ML ij

x x x i j i A

2 1 2 1 , , ,

) ( # ) ( #

ML i t i

A A

− + = ) ( ~ λ λη 1

slide-18
SLIDE 18

18

Eric Xing 35

Bayesian language model

  • Global and local parameter independence
  • The posterior of Ai · and Ai' · is factorized despite v-structure on Xt, because

Xt-1 acts like a multiplexer

  • Assign a Dirichlet prior βi to each row of the transition matrix:
  • We could consider more realistic priors, e.g., mixtures of Dirichlets to account for

types of words (adjectives, verbs, etc.)

X1 X2 X3 XT

Ai ·

π

Ai'·

β α …

A

) ( # where , ) ( ) ( # ) ( # ) , , | (

' , , def

+ = − + = +

+ → = = i A i j i D i j p A

i i i ML ij i k i i i k i i Bayes ij

β β λ λ β λ β β β 1

Eric Xing 36

Example: HMM: two scenarios

  • Supervised learning: estimation when the “right answer” is known
  • Examples:

GIVEN:

a genomic region x = x1…x1,000,000 where we have good (experimental) annotations of the CpG islands

GIVEN:

the casino player allows us to observe him one evening, as he changes dice and produces 10,000 rolls

  • Unsupervised learning: estimation when the “right answer” is

unknown

  • Examples:

GIVEN:

the porcupine genome; we don’t know how frequent are the CpG islands there, neither do we know their composition

GIVEN:

10,000 rolls of the casino player, but we don’t see when he changes dice

  • QUESTION: Update the parameters θ of the model to maximize P(x|θ) -
  • - Maximal likelihood (ML) estimation
slide-19
SLIDE 19

19

Eric Xing 37

Recall definition of HMM

  • Transition probabilities between

any two states

  • r
  • Start probabilities
  • Emission probabilities associated with each state
  • r in general:

A A A A

x2 x3 x1 xT y2 y3 y1 yT

... ...

, ) | (

, j i i t j t

a y y p = = =

1 1

1

( )

. , , , , l Multinomia ~ ) | (

, , ,

I ∈ ∀ =

i a a a y y p

M i i i i t t

K

2 1 1

1

( ).

, , , l Multinomia ~ ) (

M

y p π π π K

2 1 1

( )

. , , , , l Multinomia ~ ) | (

, , ,

I ∈ ∀ = i b b b y x p

K i i i i t t

K

2 1

1

( )

. , | f ~ ) | ( I ∈ ∀ ⋅ = i y x p

i i t t

θ 1

Eric Xing 38

Supervised ML estimation

  • Given x = x1…xN for which the true state path y = y1…yN is known,
  • Define:

Aij = # times state transition i→j occurs in y Bik = # times state i in y emits k in x

  • We can show that the maximum likelihood parameters θ are:
  • What if x is continuous? We can treat

as N×T

  • bservations of, e.g., a Gaussian, and apply learning rules for Gaussian …

∑ ∑ ∑ ∑ ∑

= =

→ =

= − = − ' ' , , ,

) ( # ) ( #

j ij ij n T t i t n j t n n T t i t n ML ij

A A y y y i j i a

2 1 2 1

∑ ∑ ∑ ∑ ∑

= =

→ =

= = ' ' , , ,

) ( # ) ( #

k ik ik n T t i t n k t n n T t i t n ML ik

B B y x y i k i b

1 1

( ) { }

N n T t y x

t n t n

: , : : ,

, ,

1 1 = =

slide-20
SLIDE 20

20

Eric Xing 39

Supervised ML estimation, ctd.

  • Intuition:
  • When we know the underlying states, the best estimate of θ is the

average frequency of transitions & emissions that occur in the training data

  • Drawback:
  • Given little data, there may be overfitting:
  • P(x|θ) is maximized, but θ is unreasonable

0 probabilities – VERY BAD

  • Example:
  • Given 10 casino rolls, we observe

x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3 y = F, F, F, F, F, F, F, F, F, F

  • Then:

aFF = 1; aFL = 0 bF1 = bF3 = .2; bF2 = .3; bF4 = 0; bF5 = bF6 = .1

Eric Xing 40

Pseudocounts

  • Solution for small training sets:
  • Add pseudocounts

Aij = # times state transition i→j occurs in y + Rij Bik = # times state i in y emits k in x + Sik

  • Rij, Sij are pseudocounts representing our prior belief
  • Total pseudocounts: Ri = ΣjRij , Si = ΣkSik ,
  • -- "strength" of prior belief,
  • -- total number of imaginary instances in the prior
  • Larger total pseudocounts ⇒ strong prior belief
  • Small total pseudocounts: just to avoid 0 probabilities --- smoothing
  • This is equivalent to Bayesian est. under a uniform prior with

"parameter strength" equals to the pseudocounts