Max Likelihood for Log-Linear Models Daphne Koller Log-Likelihood - - PowerPoint PPT Presentation

max likelihood for log linear
SMART_READER_LITE
LIVE PREVIEW

Max Likelihood for Log-Linear Models Daphne Koller Log-Likelihood - - PowerPoint PPT Presentation

Learning Probabilistic Graphical Parameter Estimation Models Max Likelihood for Log-Linear Models Daphne Koller Log-Likelihood for Markov Nets A B C Partition function couples the parameters No decomposition of likelihood No


slide-1
SLIDE 1

Daphne Koller

Max Likelihood for Log-Linear Models

Probabilistic Graphical Models

Parameter Estimation Learning

slide-2
SLIDE 2

Daphne Koller

Log-Likelihood for Markov Nets

  • Partition function couples the parameters

– No decomposition of likelihood – No closed form solution

B A C

slide-3
SLIDE 3

Daphne Koller

  • 60
  • 40
  • 20

20 40 60 80 100 120 140 160 180 200

  • 200
  • 180
  • 160
  • 140
  • 120
  • 100
  • 80
  • 60
  • 40
  • 20

20 40 60

  • 120
  • 100
  • 80
  • 60
  • 40
  • 20

B A C

Example: Log-Likelihood Function

slide-4
SLIDE 4

Daphne Koller

Log-Likelihood for Log-Linear Model

slide-5
SLIDE 5

Daphne Koller

The Log-Partition Function

Theorem: Proof:

slide-6
SLIDE 6

Daphne Koller

The Log-Partition Function

  • Log likelihood function

– No local optima – Easy to optimize

Theorem:

slide-7
SLIDE 7

Daphne Koller

Maximum Likelihood Estimation

Theorem: is the MLE if and only if

slide-8
SLIDE 8

Daphne Koller

Computation: Gradient Ascent

  • Use gradient ascent:

– typically L-BFGS – a quasi-Newton method

  • For gradient, need expected feature counts:

– in data – relative to current model

  • Requires inference at each gradient step
slide-9
SLIDE 9

Daphne Koller

Example: Ising Model

slide-10
SLIDE 10

Daphne Koller

Summary

  • Partition function couples parameters in likelihood
  • No closed form solution, but convex optimization

– Solved using gradient ascent (usually L-BFGS)

  • Gradient computation requires inference at each

gradient step to compute expected feature counts

  • Features are always within clusters in cluster-

graph or clique tree due to family preservation

– One calibration suffices for all feature expectations

slide-11
SLIDE 11

Daphne Koller

Max Likelihood for CRFs

Probabilistic Graphical Models

Parameter Estimation Learning

slide-12
SLIDE 12

Daphne Koller

Estimation for CRFs

slide-13
SLIDE 13

Daphne Koller

Example

f2(Ys, Yt) = 1(Ys = Yt) f1(Ys, Xs) = 1(Ys = g) Gs Yi Yj

average intensity of green channel for pixels in superpixel s

slide-14
SLIDE 14

Daphne Koller

Computation

  • Requires inference at each gradient step
  • Requires inference for each x[m] at each

gradient step

MRF CRF

slide-15
SLIDE 15

Daphne Koller

However…

  • For inference of P(Y | x), we need to

compute distribution only over Y

  • If we learn an MRF, need to compute

P(Y,X), which may be much more complex

f2(Ys, Yt) = 1(Ys = Yt) f1(Ys, Xs) = 1(Ys = g) Gs

average intensity of green channel for pixels in superpixel i

slide-16
SLIDE 16

Daphne Koller

Summary

  • CRF learning very similar to MRF learning

– Likelihood function is concave – Optimized using gradient ascent (usually L-BFGS)

  • Gradient computation requires inference: one per

gradient step, data instance

– c.f., once per gradient step for MRFs

  • But conditional model is often much simpler, so

inference cost for CRF, MRF is not the same

slide-17
SLIDE 17

Daphne Koller

MAP Estimation for MRFs, CRFs

Probabilistic Graphical Models

Parameter Estimation Learning

slide-18
SLIDE 18

Daphne Koller

0.1 0.2 0.3 0.4 0.5

  • 10
  • 5

5 10

Gaussian Parameter Prior

slide-19
SLIDE 19

Daphne Koller

0.1 0.2 0.3 0.4 0.5

  • 10
  • 5

5 10

Laplacian Parameter Prior

slide-20
SLIDE 20

Daphne Koller

MAP Estimation & Regularization

L2 L1

  • log P( )
slide-21
SLIDE 21

Daphne Koller

Summary

  • In undirected models, parameter coupling

prevents efficient Bayesian estimation

  • However, can still use parameter priors to

avoid overfitting of MLE

  • Typical priors are L1, L2

– Drive parameters toward zero

  • L1 provably induces sparse solutions

– Performs feature selection / structure learning