M-Estimation of Log-Linear Structure Models Noah Smith , Doug Vail , - - PowerPoint PPT Presentation

m estimation
SMART_READER_LITE
LIVE PREVIEW

M-Estimation of Log-Linear Structure Models Noah Smith , Doug Vail , - - PowerPoint PPT Presentation

Computationally Efficient M-Estimation of Log-Linear Structure Models Noah Smith , Doug Vail , and John Lafferty School of Computer Science Carnegie Mellon University {nasmith,dvail2,lafferty}@cs.cmu.edu Sketch of the Talk A new loss function


slide-1
SLIDE 1

Computationally Efficient

M-Estimation

  • f Log-Linear Structure Models

Noah Smith, Doug Vail, and John Lafferty

School of Computer Science Carnegie Mellon University

{nasmith,dvail2,lafferty}@cs.cmu.edu

slide-2
SLIDE 2

Sketch of the Talk

A new loss function for supervised structured classification with arbitrary features.

  • Fast & easy to train - no partition functions!
  • Consistent estimator of the joint distribution
  • Information-theoretic interpretation
  • Some practical issues
  • Speed & accuracy comparison
slide-3
SLIDE 3

Log-Linear Models as Classifiers

Distribution: Classification:

parameters partition function in out dynamic programming, search, discrete

  • ptimization, etc.

dot-product score

slide-4
SLIDE 4

Training Log-Linear Models

Maximum Likelihood Estimation: Also, discriminative alternatives:

  • conditional random fields (x-wise partition functions)
  • maximum margin training (decoding during training)

pain

slide-5
SLIDE 5

Notational Variant

Still log-linear.

“some other” distribution

slide-6
SLIDE 6

Jeon and Lin (2006)

A new loss function for training:

exponentiated, negated dot- product scores base distribution

slide-7
SLIDE 7

Jeon and Lin (2006)

A new loss function for training:

exponentiated, negated dot- product scores base distribution

slide-8
SLIDE 8

Attractive Properties of the M-Estimator

Computationally efficient.

slide-9
SLIDE 9

Attractive Properties of the M-Estimator

Convex.

linear exp is convex; affine composition → convex linear

slide-10
SLIDE 10

Statistical Consistency

  • If the data were drawn from some distribution

in the given family, parameterized by w*, then

  • True of MLE, Pseudolikelihood, and the M-

estimator.

– Conditional likelihood is consistent for the conditional distribution.

slide-11
SLIDE 11

Information-Theoretic Interpretation

  • True model: p?
  • Perturbation applied to p?, resulting in q0
  • Goal: recover the true distribution by

correcting the perturbation.

slide-12
SLIDE 12

Information-Theoretic Interpretation

  • True model: p?
  • Perturbation applied to p?, resulting in q0
  • Goal: recover the true distribution by

correcting the perturbation.

MLE J&L’06

slide-13
SLIDE 13

Minimizing KL Divergence

slide-14
SLIDE 14

So far …

  • Alternative objective function for log-linear models.

– Efficient to compute – Convex and differentiable – Easy to implement – Consistent

  • Interesting information-theoretic motivation.

Next …

  • Practical issues
  • Experiments
slide-15
SLIDE 15

q0 Desiderata

  • Fast to estimate
  • Smooth
  • Straightforward calculation of Eq0[f]

Here: smoothed HMM.

– See paper for details on Eq0[f] - linear system!

In general, can sample from q0 to estimate.

slide-16
SLIDE 16

Optimization

Can use Quasi-Newton methods (L-BFGS, CG). The gradient:

slide-17
SLIDE 17

Regularization

Problem: If we estimate Eq0[fj] = 0, then wj will tend toward -∞. Quadratic regularizer: Can be interpreted as a 0-mean, c-variance, diagonal Gaussian prior on w; maximum a posteriori analog for the M-estimator.

slide-18
SLIDE 18

Experiments

  • Data: CoNLL-2000 shallow parsing dataset
  • Task: NP-chunking (by B-I-O labeling)
  • Baseline/q0: smoothed MLE trigram HMM;

B-I-O label emits word and tag separately

  • Quadratic regularization for log-linear models,

c selected on held-out.

slide-19
SLIDE 19

B-I-O Example

Profits

  • f

franchises have n’t been higher since the mid-1970s NNS IN NNS VB RB VBN JJR IN DT NNS

B O B O O O O O B I

slide-20
SLIDE 20

Experiments

93.9 93.7 94.0 64:18:24 CRF 89.6 90.4 88.9 1:01:37 M-est. 91.8 91.8 91.9 9:34:52 PL 87.1 88.7 85.6 0:00:02 HMM 91.5 92.2 90.9 3:39:52 MEMM

F1 recall precision time (h:m:s)

rich features (Sha & Pereira ‘03)

slide-21
SLIDE 21

Accuracy, Training Time, and c

under-regularization hurts

slide-22
SLIDE 22

Generative/Discriminative vs. Features

more than additive

slide-23
SLIDE 23

18 Minutes Are Not Enough

  • See the paper

– q0 experiments – negative result: attempt to “make it discriminative”

  • WSJ section 22 dependency parsing

– generative baseline/q0 (≈ Klein & Manning ‘03) – 85.2% → 86.4% – 2 million → 3 million features (≈ McDonald et al. ‘05) – 4 hours training per value of c

slide-24
SLIDE 24

Ongoing & Future Work

  • Discriminative training works better but takes

longer.

– Cases where discriminative training may be too expensive

  • high complexity inference (parsing)
  • n is very large (MT?)

– Is there an efficient estimator like this for the conditional distribution?

  • Hidden variables increase complexity, too.

– Use M-estimator for M step in EM? – Is there an efficient estimator like this that handles hidden variables?

slide-25
SLIDE 25

Conclusion

  • M-estimation is

– fast to train (no partition functions) – easy to implement – statistically consistent – feature-empowered (like CRFs) – generative

A new point on the spectrum of speed/ accuracy/expressiveness tradeoffs.

runtime accuracy

slide-26
SLIDE 26

Thanks!

slide-27
SLIDE 27

How important is the choice of q0?

  • MAP-trained HMM
  • Empirical marginal:
  • Locally uniform model

– Uniform transitions – No temporal effects – 0% precision, recall

B I O

3 out-arcs 4 out-arcs 4 out-arcs

slide-28
SLIDE 28

q0 Experiments

87.1 88.7 85.6 baseline HMM (no M-est.)

52.1 37.7 84.4 precision 64.3 57.6 72.9 F1 locally uniform transitions 86.8 89.4 84.4 F1 empirical marginal 89.6 90.4 88.9 F1 HMM

F1 recall precision select c to maximize: q0

slide-29
SLIDE 29

Negative Result: Input-Only Features

Idea: Make M-estimator “more discriminative” by including features of words/tags only.

  • Think of the model in two parts:

→ Virtually no effect.

… by doing more of the “explanatory work” here. Improve fit here …