Daphne Koller
Max Likelihood for Log-Linear Models
Probabilistic Graphical Models
Parameter Estimation Learning
Max Likelihood for Log-Linear Models Daphne Koller Log-Likelihood - - PowerPoint PPT Presentation
Learning Probabilistic Graphical Parameter Estimation Models Max Likelihood for Log-Linear Models Daphne Koller Log-Likelihood for Markov Nets A B C Partition function couples the parameters No decomposition of likelihood No
Daphne Koller
Probabilistic Graphical Models
Parameter Estimation Learning
Daphne Koller
– No decomposition of likelihood – No closed form solution
B A C
Daphne Koller
20 40 60 80 100 120 140 160 180 200
20 40 60
B A C
Daphne Koller
Daphne Koller
Theorem: Proof:
Daphne Koller
– No local optima – Easy to optimize
Theorem:
Daphne Koller
Theorem: is the MLE if and only if
Daphne Koller
– typically L-BFGS – a quasi-Newton method
– in data – relative to current model
Daphne Koller
Daphne Koller
– Solved using gradient ascent (usually L-BFGS)
gradient step to compute expected feature counts
graph or clique tree due to family preservation
– One calibration suffices for all feature expectations
Daphne Koller
Probabilistic Graphical Models
Parameter Estimation Learning
Daphne Koller
Daphne Koller
f2(Ys, Yt) = 1(Ys = Yt) f1(Ys, Xs) = 1(Ys = g) Gs Yi Yj
average intensity of green channel for pixels in superpixel s
Daphne Koller
MRF CRF
Daphne Koller
f2(Ys, Yt) = 1(Ys = Yt) f1(Ys, Xs) = 1(Ys = g) Gs
average intensity of green channel for pixels in superpixel i
Daphne Koller
– Likelihood function is concave – Optimized using gradient ascent (usually L-BFGS)
gradient step, data instance
– c.f., once per gradient step for MRFs
inference cost for CRF, MRF is not the same
Daphne Koller
Probabilistic Graphical Models
Parameter Estimation Learning
Daphne Koller
0.1 0.2 0.3 0.4 0.5
5 10
Daphne Koller
0.1 0.2 0.3 0.4 0.5
5 10
Daphne Koller
L2 L1
Daphne Koller
prevents efficient Bayesian estimation
avoid overfitting of MLE
– Drive parameters toward zero
– Performs feature selection / structure learning