Max Likelihood for Log-Linear Models Daphne Koller Log-Likelihood - PowerPoint PPT Presentation

Learning Probabilistic Graphical Parameter Estimation Models Max Likelihood for Log-Linear Models Daphne Koller

Log-Likelihood for Markov Nets A B C • Partition function couples the parameters – No decomposition of likelihood – No closed form solution Daphne Koller

Example: Log-Likelihood Function A B 0 -20 -40 C -60 -80 -100 200 180 160 -120 140 120 100 80 60 40 20 0 -20 60 -40 40 20 0 -20 -40 -60 -80 -100 -60 -120 -140 -160 -180 -200 Daphne Koller

Log-Likelihood for Log-Linear Model Daphne Koller

The Log-Partition Function Theorem: Proof: Daphne Koller

The Log-Partition Function Theorem: • Log likelihood function – No local optima – Easy to optimize Daphne Koller

Maximum Likelihood Estimation Theorem: is the MLE if and only if Daphne Koller

Computation: Gradient Ascent • Use gradient ascent: – typically L-BFGS – a quasi-Newton method • For gradient, need expected feature counts: – in data – relative to current model • Requires inference at each gradient step Daphne Koller

Example: Ising Model Daphne Koller

Summary • Partition function couples parameters in likelihood • No closed form solution, but convex optimization – Solved using gradient ascent (usually L-BFGS) • Gradient computation requires inference at each gradient step to compute expected feature counts • Features are always within clusters in cluster- graph or clique tree due to family preservation – One calibration suffices for all feature expectations Daphne Koller

Learning Probabilistic Graphical Parameter Estimation Models Max Likelihood for CRFs Daphne Koller

Estimation for CRFs Daphne Koller

Y i Example Y j f 1 (Y s , X s ) = 1 (Y s = g) G s average intensity of f 2 (Y s , Y t ) = 1 (Y s = Y t ) green channel for pixels in superpixel s Daphne Koller

Computation MRF • Requires inference at each gradient step CRF • Requires inference for each x [m] at each gradient step Daphne Koller

However… • For inference of P( Y | x ), we need to compute distribution only over Y • If we learn an MRF, need to compute P( Y,X ), which may be much more complex f 1 (Y s , X s ) = 1 (Y s = g) G s average intensity of f 2 (Y s , Y t ) = 1 (Y s = Y t ) green channel for pixels in superpixel i Daphne Koller

Summary • CRF learning very similar to MRF learning – Likelihood function is concave – Optimized using gradient ascent (usually L-BFGS) • Gradient computation requires inference: one per gradient step, data instance – c.f., once per gradient step for MRFs • But conditional model is often much simpler, so inference cost for CRF, MRF is not the same Daphne Koller

Learning Probabilistic Graphical Parameter Estimation Models MAP Estimation for MRFs, CRFs Daphne Koller

Gaussian Parameter Prior 0.5 0.4 0.3 0.2 0.1 0 -10 -5 0 5 10 Daphne Koller

Laplacian Parameter Prior 0.5 0.4 0.3 0.2 0.1 0 -10 -5 0 5 10 Daphne Koller

MAP Estimation & Regularization -log P( ) L 2 L 1 Daphne Koller

Summary • In undirected models, parameter coupling prevents efficient Bayesian estimation • However, can still use parameter priors to avoid overfitting of MLE • Typical priors are L 1 , L 2 – Drive parameters toward zero • L 1 provably induces sparse solutions – Performs feature selection / structure learning Daphne Koller

Max Likelihood for Log-Linear Models Daphne Koller Log-Likelihood - PowerPoint PPT Presentation

Learning Probabilistic Graphical Parameter Estimation Models Max Likelihood for Log-Linear Models Daphne Koller Log-Likelihood for Markov Nets A B C Partition function couples the parameters No decomposition of likelihood No

(142733/102960-Log[4])+(614851/73920-2 Log[64]) h 2 +(2329/1680-Log[4]) h 4 -h 10 /20160

Max. likelihood & Bayesian techniques are both likelihood-based. Weaknesses of likelihood for

Chandra data reduction The CDFs Giorgio, Margherita, Elisabeta, Eleonora, Lazarus, Enrica,

STAT 339 A Generative Linear Model and Max Likelihood Estimation 20-22 February 2017 Colin

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Syslog and Log Rotate Computer Center, CS, NCTU Log files Execution information of each

Distributed ephemeral log service Log entries are replicated,dispersed See Ivy,

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan

Section 3.7 Derivatives of logarithmic functions 1 Rules of exponentials and logarithms 1.

Log-Linear Models for History-Based Parsing Michael Collins, Columbia University Log-Linear

Max India Limited Max India Limited I Investor Presentation t P t ti June, 2014

Positive definite max-QP Recall max-cut max vE(1-v) s.t. 0 v 1 max

Chapter 8: Estimation In this chapter we will cover: 1. The likelihood and maximum likelihood

Maximum Likelihood properties Maximum parsimony Maximum likelihood Experimental design

Lesson 3: Likelihood-based inference for POMP models Aaron A. King, Edward L. Ionides, Kidus

Maximum likelihood models Tues. Feb. 27, 2018 1 Overview of today Informal notion of

Clustering with k-means and Gaussian mixture distributions Machine Learning and Category

Practical Bioinformatics Mark Voorhies 5/26/2015 Mark Voorhies Practical Bioinformatics Habits

Statistical Learning Marco Chiarandini Deptartment of Mathematics & Computer Science

LP Decoding of Regular LDPC Codes in Memoryless Channels Nissim Halabi Guy Even ISIT 2010 1

CSCE 478/878 Lecture 7: Bayesian Learning Stephen D. Scott (Adapted from Tom Mitchells slides)

Data Warehousing and Machine Learning Probabilistic Classifiers Thomas D. Nielsen Aalborg

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

Phylogenetic trees III Maximum Parsimony . Gerhard Jger ESSLLI 2016 Gerhard Jger Maximum

Max Likelihood for Log-Linear Models Daphne Koller Log-Likelihood - PowerPoint PPT Presentation

Learning Probabilistic Graphical Parameter Estimation Models Max Likelihood for Log-Linear Models Daphne Koller Log-Likelihood for Markov Nets A B C Partition function couples the parameters No decomposition of likelihood No

(142733/102960-Log[4])+(614851/73920-2 Log[64]) h 2 +(2329/1680-Log[4]) h 4 -h 10 /20160

Max. likelihood &amp; Bayesian techniques are both likelihood-based. Weaknesses of likelihood for

Chandra data reduction The CDFs Giorgio, Margherita, Elisabeta, Eleonora, Lazarus, Enrica,

STAT 339 A Generative Linear Model and Max Likelihood Estimation 20-22 February 2017 Colin

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Syslog and Log Rotate Computer Center, CS, NCTU Log files Execution information of each

Distributed ephemeral log service Log entries are replicated,dispersed See Ivy,

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan

Section 3.7 Derivatives of logarithmic functions 1 Rules of exponentials and logarithms 1.

Log-Linear Models for History-Based Parsing Michael Collins, Columbia University Log-Linear

Max India Limited Max India Limited I Investor Presentation t P t ti June, 2014

Positive definite max-QP Recall max-cut max vE(1-v) s.t. 0 v 1 max

Chapter 8: Estimation In this chapter we will cover: 1. The likelihood and maximum likelihood

Maximum Likelihood properties Maximum parsimony Maximum likelihood Experimental design

Lesson 3: Likelihood-based inference for POMP models Aaron A. King, Edward L. Ionides, Kidus

Maximum likelihood models Tues. Feb. 27, 2018 1 Overview of today Informal notion of

Clustering with k-means and Gaussian mixture distributions Machine Learning and Category

Practical Bioinformatics Mark Voorhies 5/26/2015 Mark Voorhies Practical Bioinformatics Habits

Statistical Learning Marco Chiarandini Deptartment of Mathematics &amp; Computer Science

LP Decoding of Regular LDPC Codes in Memoryless Channels Nissim Halabi Guy Even ISIT 2010 1

CSCE 478/878 Lecture 7: Bayesian Learning Stephen D. Scott (Adapted from Tom Mitchells slides)

Data Warehousing and Machine Learning Probabilistic Classifiers Thomas D. Nielsen Aalborg

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

Phylogenetic trees III Maximum Parsimony . Gerhard Jger ESSLLI 2016 Gerhard Jger Maximum

Max. likelihood & Bayesian techniques are both likelihood-based. Weaknesses of likelihood for

Statistical Learning Marco Chiarandini Deptartment of Mathematics & Computer Science