lti Is convex Perceptron Boosting Max-Margin Conditional - - PowerPoint PPT Presentation

lti
SMART_READER_LITE
LIVE PREVIEW

lti Is convex Perceptron Boosting Max-Margin Conditional - - PowerPoint PPT Presentation

Softmax-Margin CRFs: Training Log-Linear Models with Cost Functions Kevin Gimpel and Noah A. Smith lti Is convex Perceptron Boosting Max-Margin Conditional Likelihood MIRA Based on Uses a cost probabilistic function inference


slide-1
SLIDE 1

lti

Softmax-Margin CRFs: Training Log-Linear Models with Cost Functions

Kevin Gimpel and Noah A. Smith

slide-2
SLIDE 2

lti

Risk Perceptron Minimum Error Rate Training Conditional Likelihood Max-Margin MIRA Boosting

Uses a cost function Is convex Based on probabilistic inference

Latent Variable Conditional Likelihood

slide-3
SLIDE 3

lti

Risk Perceptron Minimum Error Rate Training Conditional Likelihood Max-Margin MIRA Boosting

Uses a cost function Is convex

Latent Variable Conditional Likelihood Softmax-Margin

Based on probabilistic inference

slide-4
SLIDE 4

lti

Risk Perceptron Minimum Error Rate Training Conditional Likelihood Max-Margin MIRA Boosting

Uses a cost function Is convex

Latent Variable Conditional Likelihood Softmax-Margin Jensen Risk Bound

Based on probabilistic inference

slide-5
SLIDE 5

lti

Linear Models for Structured Prediction

θ| {θ⊤f }

  • ′∈ {θ⊤f ′}

For probabilistic interpretation, exponentiate and normalize:

weights features

θ⊤f

input

  • utput
slide-6
SLIDE 6

lti

  • θ
  • −θ⊤f
  • Training
  • θ

−θ⊤f  

Standard approach is to maximize conditional likelihood: Another approach maximizes margin (Taskar et al., 2003):

{ } θ⊤f

  • θ⊤f

task-specific cost function

slide-7
SLIDE 7

lti

  • θ
  • −θ⊤f
  • Training
  • θ

−θ⊤f  

Standard approach is to maximize conditional likelihood: Another approach maximizes margin (Taskar et al., 2003):

{ } θ⊤f

  • θ⊤f

cost-augmented decoding

slide-8
SLIDE 8

lti

  • θ
  • −θ⊤f
  • Training
  • θ

−θ⊤f  

Standard approach is to maximize conditional likelihood: Another approach maximizes margin (Taskar et al., 2003):

  • θ

−θ⊤f  

Softmax-margin: replace “max” with “softmax”

{ } θ⊤f

  • θ⊤f

{ }

θ⊤f

“cost-augmented summing”

slide-9
SLIDE 9

lti

  • θ
  • −θ⊤f
  • Training
  • θ

−θ⊤f  

Standard approach is to maximize conditional likelihood: Another approach maximizes margin (Taskar et al., 2003):

  • θ

−θ⊤f  

Softmax-margin: replace “max” with “softmax”

{ } θ⊤f

  • θ⊤f

{ }

θ⊤f

Sha and Saul (2006), Povey et al. (2008)

slide-10
SLIDE 10

lti

Has a probabilistic interpretation in the minimum

divergence framework (Jelinek, 1997)

Details in technical report

Is a bound on:

Max-margin Conditional likelihood Risk

Properties of Softmax-Margin

slide-11
SLIDE 11

lti

Has a probabilistic interpretation in the minimum

divergence framework (Jelinek, 1997)

Details in technical report

Is a bound on:

Max-margin (because “softmax” bounds “max”)

  • Conditional likelihood

Risk

Properties of Softmax-Margin

slide-12
SLIDE 12

lti

Risk?

  • θ
  • θ|

Risk is the expected value of the cost function

(Smith and Eisner, 2006; Li and Eisner, 2009):

slide-13
SLIDE 13

lti

−θ⊤f

{θ⊤f }  

  • −θ⊤f
  • { }

Conditional likelihood Bound on risk via Jensen’s inequality

Softmax-margin:

Bounding Conditional Likelihood and Risk

slide-14
SLIDE 14

lti

−θ⊤f

{θ⊤f }  

  • −θ⊤f
  • { }

Conditional likelihood Bound on risk via Jensen’s inequality

Softmax-margin:

Bounding Conditional Likelihood and Risk

Softmax-margin is a convex bound on max-margin, conditional likelihood, and risk

slide-15
SLIDE 15

lti

−θ⊤f

{θ⊤f }  

  • −θ⊤f
  • { }

Conditional likelihood

Softmax-margin:

Bounding Conditional Likelihood and Risk

Bound on risk via Jensen’s inequality

Jensen Risk Bound Easier to optimize than risk (cf. Li and Eisner, 2009)

  • −θ⊤f
slide-16
SLIDE 16

lti

Implementation

Conditional likelihood → Softmax-margin

If cost function factors the same way as the

features, it’s easy:

Add additional features for the cost function Keep their weights fixed

If not, use a simpler cost function or use

approximate inference

slide-17
SLIDE 17

lti

Experiments

English named-entity recognition (CoNLL 2003) Compared softmax-margin and Jensen risk bound with five

baselines:

Perceptron (Collins, 2002) 1-best MIRA with cost-augmented decoding (Crammer et al., 2006) Max-margin via subgradient descent (Ratliff et al., 2006) Conditional likelihood (Lafferty et al., 2001) Risk (Xiong et al., 2009)

For risk and Jensen risk bound, initialized using output of

conditional likelihood training

Used Hamming cost for cost function

slide-18
SLIDE 18

lti

Results

85.84* Softmax-Margin 85.65* Jensen Risk Bound 85.59* Risk 85.46* Conditional Likelihood 85.28* Max-Margin 85.72* MIRA 83.98* Perceptron Test F1 Method

* Indicates significance (compared with softmax-margin)

slide-19
SLIDE 19

lti

Results

85.84* Softmax-Margin 85.65* Jensen Risk Bound 85.59* Risk 85.46* Conditional Likelihood 85.28* Max-Margin 85.72* MIRA 83.98* Perceptron Test F1 Method

* Indicates significance (compared with softmax-margin) Significant improvement with equal training time and implementation difficulty

slide-20
SLIDE 20

lti

Results

85.84* Softmax-Margin 85.65* Jensen Risk Bound 85.59* Risk 85.46* Conditional Likelihood 85.28* Max-Margin 85.72* MIRA 83.98* Perceptron Test F1 Method

* Indicates significance (compared with softmax-margin) Comparable performance with half the training time

slide-21
SLIDE 21

lti

Risk Conditional Likelihood Max-Margin MIRA

Uses a cost function Is convex

Softmax-Margin Jensen Risk Bound Perceptron

Based on probabilistic inference

slide-22
SLIDE 22

lti

Risk Conditional Likelihood Max-Margin MIRA

Time Performance

Perceptron Softmax-Margin Jensen Risk Bound

slide-23
SLIDE 23

lti

Risk Conditional Likelihood Max-Margin MIRA

Time Performance

Perceptron

(Cost-Augmented) Decoding (Cost-Augmented) Summing Expectations

  • f Products

Softmax-Margin Jensen Risk Bound (Cost-Augmented) Decoding (Cost-Augmented) Summing Expectations

  • f Products
slide-24
SLIDE 24

lti

See extended technical report for:

Probabilistic interpretation for softmax-margin in

minimum divergence framework (Jelinek, 1997)

Softmax-margin training with hidden variables Additional experiments

Thank you!

slide-25
SLIDE 25

lti

Loss Functions for Binary Classification

  • 3
  • 2
  • 1

1 2 3 4 1 2 3 × ! " − ! # − $%! " − #"! × ≤ # & '$ (! −

'$!

− −

slide-26
SLIDE 26

lti

  • cost-augmented summing

Softmax-Margin

  • cost-augmented summing

Jensen Risk Bound

  • expectations of products

Risk

  • summing

Conditional Likelihood

  • cost-augmented decoding

Max-Margin

  • cost-augmented decoding

MIRA

  • decoding

Perceptron Prob. Interp. Convex Cost Function Requirements Training Method