lti Is convex Perceptron Boosting Max-Margin Conditional - - PowerPoint PPT Presentation
lti Is convex Perceptron Boosting Max-Margin Conditional - - PowerPoint PPT Presentation
Softmax-Margin CRFs: Training Log-Linear Models with Cost Functions Kevin Gimpel and Noah A. Smith lti Is convex Perceptron Boosting Max-Margin Conditional Likelihood MIRA Based on Uses a cost probabilistic function inference
lti
Risk Perceptron Minimum Error Rate Training Conditional Likelihood Max-Margin MIRA Boosting
Uses a cost function Is convex Based on probabilistic inference
Latent Variable Conditional Likelihood
lti
Risk Perceptron Minimum Error Rate Training Conditional Likelihood Max-Margin MIRA Boosting
Uses a cost function Is convex
Latent Variable Conditional Likelihood Softmax-Margin
Based on probabilistic inference
lti
Risk Perceptron Minimum Error Rate Training Conditional Likelihood Max-Margin MIRA Boosting
Uses a cost function Is convex
Latent Variable Conditional Likelihood Softmax-Margin Jensen Risk Bound
Based on probabilistic inference
lti
Linear Models for Structured Prediction
θ| {θ⊤f }
- ′∈ {θ⊤f ′}
For probabilistic interpretation, exponentiate and normalize:
weights features
- ∈
θ⊤f
input
- utput
lti
- θ
- −θ⊤f
- Training
- θ
-
−θ⊤f
Standard approach is to maximize conditional likelihood: Another approach maximizes margin (Taskar et al., 2003):
- ∈
{ } θ⊤f
- ∈
- θ⊤f
task-specific cost function
lti
- θ
- −θ⊤f
- Training
- θ
-
−θ⊤f
Standard approach is to maximize conditional likelihood: Another approach maximizes margin (Taskar et al., 2003):
- ∈
{ } θ⊤f
- ∈
- θ⊤f
cost-augmented decoding
lti
- θ
- −θ⊤f
- Training
- θ
-
−θ⊤f
Standard approach is to maximize conditional likelihood: Another approach maximizes margin (Taskar et al., 2003):
- θ
-
−θ⊤f
Softmax-margin: replace “max” with “softmax”
- ∈
{ } θ⊤f
- ∈
- θ⊤f
- ∈
{ }
θ⊤f
“cost-augmented summing”
lti
- θ
- −θ⊤f
- Training
- θ
-
−θ⊤f
Standard approach is to maximize conditional likelihood: Another approach maximizes margin (Taskar et al., 2003):
- θ
-
−θ⊤f
Softmax-margin: replace “max” with “softmax”
- ∈
{ } θ⊤f
- ∈
- θ⊤f
- ∈
{ }
θ⊤f
Sha and Saul (2006), Povey et al. (2008)
lti
Has a probabilistic interpretation in the minimum
divergence framework (Jelinek, 1997)
Details in technical report
Is a bound on:
Max-margin Conditional likelihood Risk
Properties of Softmax-Margin
lti
Has a probabilistic interpretation in the minimum
divergence framework (Jelinek, 1997)
Details in technical report
Is a bound on:
Max-margin (because “softmax” bounds “max”)
- Conditional likelihood
Risk
Properties of Softmax-Margin
lti
Risk?
- θ
- θ|
Risk is the expected value of the cost function
(Smith and Eisner, 2006; Li and Eisner, 2009):
lti
-
−θ⊤f
- ∈
{θ⊤f }
- −θ⊤f
- { }
Conditional likelihood Bound on risk via Jensen’s inequality
Softmax-margin:
Bounding Conditional Likelihood and Risk
lti
-
−θ⊤f
- ∈
{θ⊤f }
- −θ⊤f
- { }
Conditional likelihood Bound on risk via Jensen’s inequality
Softmax-margin:
Bounding Conditional Likelihood and Risk
Softmax-margin is a convex bound on max-margin, conditional likelihood, and risk
lti
-
−θ⊤f
- ∈
{θ⊤f }
- −θ⊤f
- { }
Conditional likelihood
Softmax-margin:
Bounding Conditional Likelihood and Risk
Bound on risk via Jensen’s inequality
Jensen Risk Bound Easier to optimize than risk (cf. Li and Eisner, 2009)
- −θ⊤f
lti
Implementation
Conditional likelihood → Softmax-margin
If cost function factors the same way as the
features, it’s easy:
Add additional features for the cost function Keep their weights fixed
If not, use a simpler cost function or use
approximate inference
lti
Experiments
English named-entity recognition (CoNLL 2003) Compared softmax-margin and Jensen risk bound with five
baselines:
Perceptron (Collins, 2002) 1-best MIRA with cost-augmented decoding (Crammer et al., 2006) Max-margin via subgradient descent (Ratliff et al., 2006) Conditional likelihood (Lafferty et al., 2001) Risk (Xiong et al., 2009)
For risk and Jensen risk bound, initialized using output of
conditional likelihood training
Used Hamming cost for cost function
lti
Results
85.84* Softmax-Margin 85.65* Jensen Risk Bound 85.59* Risk 85.46* Conditional Likelihood 85.28* Max-Margin 85.72* MIRA 83.98* Perceptron Test F1 Method
* Indicates significance (compared with softmax-margin)
lti
Results
85.84* Softmax-Margin 85.65* Jensen Risk Bound 85.59* Risk 85.46* Conditional Likelihood 85.28* Max-Margin 85.72* MIRA 83.98* Perceptron Test F1 Method
* Indicates significance (compared with softmax-margin) Significant improvement with equal training time and implementation difficulty
lti
Results
85.84* Softmax-Margin 85.65* Jensen Risk Bound 85.59* Risk 85.46* Conditional Likelihood 85.28* Max-Margin 85.72* MIRA 83.98* Perceptron Test F1 Method
* Indicates significance (compared with softmax-margin) Comparable performance with half the training time
lti
Risk Conditional Likelihood Max-Margin MIRA
Uses a cost function Is convex
Softmax-Margin Jensen Risk Bound Perceptron
Based on probabilistic inference
lti
Risk Conditional Likelihood Max-Margin MIRA
Time Performance
Perceptron Softmax-Margin Jensen Risk Bound
lti
Risk Conditional Likelihood Max-Margin MIRA
Time Performance
Perceptron
(Cost-Augmented) Decoding (Cost-Augmented) Summing Expectations
- f Products
Softmax-Margin Jensen Risk Bound (Cost-Augmented) Decoding (Cost-Augmented) Summing Expectations
- f Products
lti
See extended technical report for:
Probabilistic interpretation for softmax-margin in
minimum divergence framework (Jelinek, 1997)
Softmax-margin training with hidden variables Additional experiments
Thank you!
lti
Loss Functions for Binary Classification
- 3
- 2
- 1
1 2 3 4 1 2 3 × ! " − ! # − $%! " − #"! × ≤ # & '$ (! −
−
'$!
− −
lti
- cost-augmented summing
Softmax-Margin
- cost-augmented summing
Jensen Risk Bound
- expectations of products
Risk
- summing
Conditional Likelihood
- cost-augmented decoding
Max-Margin
- cost-augmented decoding
MIRA
- decoding
Perceptron Prob. Interp. Convex Cost Function Requirements Training Method