Performance Prediction and Shrinking Language Models . Chen - - PowerPoint PPT Presentation

performance prediction and shrinking language models
SMART_READER_LITE
LIVE PREVIEW

Performance Prediction and Shrinking Language Models . Chen - - PowerPoint PPT Presentation

Performance Prediction and Shrinking Language Models . Chen Stanley F IBM T.J. Watson Research Center Yorktown Heights, New York, USA 27 June 2011 Joint work with Stephen Chu, Ahmad Emami, Lidia Mangu, Bhuvana Ramabhadran,


slide-1
SLIDE 1

■❇▼

Performance Prediction and Shrinking Language Models

Stanley F . Chen†

IBM T.J. Watson Research Center Yorktown Heights, New York, USA

27 June 2011

†Joint work with Stephen Chu, Ahmad Emami, Lidia Mangu,

Bhuvana Ramabhadran, Ruhi Sarikaya, and Abhinav Sethy.

Stanley F. Chen (IBM) Performance Prediction 27 June 2011 1 / 41

slide-2
SLIDE 2

■❇▼

What Does a Good Model Look Like?

(test error) ≡ (training error) + (overfit)

Stanley F. Chen (IBM) Performance Prediction 27 June 2011 2 / 41

slide-3
SLIDE 3

■❇▼

Overfitting: Theory

e.g., Akaike Information Criterion (1973) −(test LL) ≈ −(train LL) + (# params) e.g., structural risk minimization (Vapnik, 1974) (test err) ≤ (train err) + f(VC dimension) Down with big models!?

Stanley F. Chen (IBM) Performance Prediction 27 June 2011 3 / 41

slide-4
SLIDE 4

■❇▼

The Big Idea

Maybe overfit doesn’t act like we think it does. Let’s try to fit overfit empirically.

Stanley F. Chen (IBM) Performance Prediction 27 June 2011 4 / 41

slide-5
SLIDE 5

■❇▼

What This Talk Is About

An empirical estimate of the overfit in log likelihood of . . . Exponential language models . . . That is really simple and works really well. Why it works. What you can do with it.

Stanley F. Chen (IBM) Performance Prediction 27 June 2011 5 / 41

slide-6
SLIDE 6

■❇▼

Outline

1

Introduction

2

Finding an Empirical Law for Overfitting

3

Regularization

4

Why Does the Law Hold?

5

Things You Can Do With It

6

Discussion

Stanley F. Chen (IBM) Performance Prediction 27 June 2011 6 / 41

slide-7
SLIDE 7

■❇▼

Exponential N-Gram Language Models

Language model: predict next word given previous, say, two words. P(y = ate | x = the cat) Log-linear model: features fi(·); parameters λi. P(y|x) = exp(

i λifi(x, y))

ZΛ(x) A binary feature fi(·) for each n-gram in training set. An alternative parameterization of back-off n-gram models.

Stanley F. Chen (IBM) Performance Prediction 27 June 2011 7 / 41

slide-8
SLIDE 8

■❇▼

Details: Regression

Build hundreds of (regularized!) language models. Compute actual overfit: log likelihood (LL) per event = log PP . Calculate lots of statistics for each model. F = # parameters; D = # training events. F D; F log D D ; 1 D

  • λi; 1

D

  • λ2

i ; 1

D

  • |λi|

4 3; . . .

Do linear regression!

Stanley F. Chen (IBM) Performance Prediction 27 June 2011 8 / 41

slide-9
SLIDE 9

■❇▼

What Doesn’t Work? AIC-like Prediction

(overfit) ≡ LLtest − LLtrain ≈ γ (# params) (# train evs)

1 2 3 4 5 6 1 2 3 4 5 6 predicted actual

Stanley F. Chen (IBM) Performance Prediction 27 June 2011 9 / 41

slide-10
SLIDE 10

■❇▼

What Doesn’t Work? BIC-like Prediction

LLtest − LLtrain ≈ γ (# params) log (# train evs) (# train evs)

1 2 3 4 5 6 1 2 3 4 5 6 predicted actual

Stanley F. Chen (IBM) Performance Prediction 27 June 2011 10 / 41

slide-11
SLIDE 11

■❇▼

What Does Work? (r = 0.9996)

LLtest − LLtrain ≈ γ (# train evs)

F

  • i=1

|λi|

1 2 3 4 5 6 1 2 3 4 5 6 predicted actual

Stanley F. Chen (IBM) Performance Prediction 27 June 2011 11 / 41

slide-12
SLIDE 12

■❇▼

γ = 0.938

Holds for many different types of data. Different domains (e.g., Wall Street Journal, . . . ) Different token types (letters, parts-of-speech, words). Different vocabulary sizes (27–84,000 words). Different training set sizes (100–100,000 sentences). Different n-gram orders (2–7). Holds for many different types of exponential models. Word n-gram models; class-based n-gram models; minimum discrimination information models.

Stanley F. Chen (IBM) Performance Prediction 27 June 2011 12 / 41

slide-13
SLIDE 13

■❇▼

What About Other Languages?

LLtest − LLtrain ≈ 0.938 (# train evs)

F

  • i=1

|λi|

1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 predicted actual Iraqi Spanish German Turkish

Stanley F. Chen (IBM) Performance Prediction 27 June 2011 13 / 41

slide-14
SLIDE 14

■❇▼

What About Genetic Data?

LLtest − LLtrain ≈ 0.938 (# train evs)

F

  • i=1

|λi|

0.5 1 1.5 2 0.5 1 1.5 2 predicted actual rice chicken human

Stanley F. Chen (IBM) Performance Prediction 27 June 2011 14 / 41

slide-15
SLIDE 15

■❇▼

Outline

1

Introduction

2

Finding an Empirical Law for Overfitting

3

Regularization

4

Why Does the Law Hold?

5

Things You Can Do With It

6

Discussion

Stanley F. Chen (IBM) Performance Prediction 27 June 2011 15 / 41

slide-16
SLIDE 16

■❇▼

Regularization

Improves test set performance. ℓ1, ℓ2

2, ℓ1 + ℓ2 2 regularization: choose λi to minimize

(obj fn) ≡ LLtrain + α

F

  • i=1

|λi| + 1 2σ2

F

  • i=1

λ2

i

The problem: γ depends on α, σ!

Stanley F. Chen (IBM) Performance Prediction 27 June 2011 16 / 41

slide-17
SLIDE 17

■❇▼

Regularization: Two Criteria

Here: pick single α, σ across all models. Usual way: pick α, σ per model for good performance. Good performance and good overfit prediction? performance

  • verfit prediction

ℓ1 √ ℓ2

2

√ ℓ1 + ℓ2

2

√ √ (α = 0.5, σ2 = 6) as good as best n-gram smoothing.

Stanley F. Chen (IBM) Performance Prediction 27 June 2011 17 / 41

slide-18
SLIDE 18

■❇▼

The Law and ℓ1 + ℓ2

2 Regularization LLtest − LLtrain ≈ 0.938 (# train evs)

F

  • i=1

|λi|

1 2 3 4 5 6 1 2 3 4 5 6 predicted actual

Stanley F. Chen (IBM) Performance Prediction 27 June 2011 18 / 41

slide-19
SLIDE 19

■❇▼

The Law and ℓ2

2 Regularization LLtest − LLtrain ≈ 0.882 (# train evs)

F

  • i=1

|λi|

1 2 3 4 5 6 1 2 3 4 5 6 predicted actual

Stanley F. Chen (IBM) Performance Prediction 27 June 2011 19 / 41

slide-20
SLIDE 20

■❇▼

Outline

1

Introduction

2

Finding an Empirical Law for Overfitting

3

Regularization

4

Why Does the Law Hold?

5

Things You Can Do With It

6

Discussion

Stanley F. Chen (IBM) Performance Prediction 27 June 2011 20 / 41

slide-21
SLIDE 21

■❇▼

Why Exponential Models Are Special

Do some math (and include normalization features): LLtest − LLtrain = 1 (# train evs)

F ′

  • i=1

λi × (discount of fi(·)) Compare this to The Law: LLtest − LLtrain ≈ 1 (# train evs)

F

  • i=1

|λi| × 0.938 If only . . . (discount of fi(·)) ≈ 0.938 × sgn λi

Stanley F. Chen (IBM) Performance Prediction 27 June 2011 21 / 41

slide-22
SLIDE 22

■❇▼

What Are Discounts?

How many times fewer an n-gram occurs in test set . . . Compared to training set (of equal length). Studied extensively in language model smoothing. Let’s look at the data.

Stanley F. Chen (IBM) Performance Prediction 27 June 2011 22 / 41

slide-23
SLIDE 23

■❇▼

Smoothed Discount Per Feature

(discount of fi(·))

?

≈ 0.938 × sgn λi

  • 1
  • 0.5

0.5 1 1.5 2 2.5

  • 1

1 2 3 4 discount λ very sparse sparse less sparse dense

Stanley F. Chen (IBM) Performance Prediction 27 June 2011 23 / 41

slide-24
SLIDE 24

■❇▼

Why The Law Holds More Than It Should

Sparse models all act alike. Dense models don’t overfit much. LLtest − LLtrain ≈ 0.938 (# train evs)

F

  • i=1

|λi|

Stanley F. Chen (IBM) Performance Prediction 27 June 2011 24 / 41

slide-25
SLIDE 25

■❇▼

Outline

1

Introduction

2

Finding an Empirical Law for Overfitting

3

Regularization

4

Why Does the Law Hold?

5

Things You Can Do With It

6

Discussion

Stanley F. Chen (IBM) Performance Prediction 27 June 2011 25 / 41

slide-26
SLIDE 26

■❇▼

Explain Things

Why backoff features help. Why word class features help. Why domain adaptation helps. Why increasing n doesn’t hurt. Why relative performance differences shrink with more data.

Stanley F. Chen (IBM) Performance Prediction 27 June 2011 26 / 41

slide-27
SLIDE 27

■❇▼

Make Models Better

(test error) ≈ (training error) + (overfit) Decrease overfit ⇒ decrease test error.

Stanley F. Chen (IBM) Performance Prediction 27 June 2011 27 / 41

slide-28
SLIDE 28

■❇▼

Reducing Overfitting

(overfit) ≈ 0.938 (# train evs)

F

  • i=1

|λi| In practice, the number of features matters not! More features lead to less overfitting . . . If sum of parameters decreases!

Stanley F. Chen (IBM) Performance Prediction 27 June 2011 28 / 41

slide-29
SLIDE 29

■❇▼

A Method for Reducing Overfitting

Before: λ1 = λ2 = 2. Pbefore(y|x) = exp(2 · f1(x, y) + 2 · f2(x, y)) ZΛ(x) After: λ1 = λ2 = 0, λ3 = 2, f3(x, y) = f1(x, y) + f2(x, y). Pafter(y|x) = exp(2 · f3(x, y)) ZΛ(x) = exp(2 · f1(x, y) + 2 · f2(x, y)) ZΛ(x)

Stanley F. Chen (IBM) Performance Prediction 27 June 2011 29 / 41

slide-30
SLIDE 30

■❇▼

What’s the Catch? (Part I)

Same test set performance? Re-regularize model: improves performance more! (obj fn) ≡ LLtrain + α

F

  • i=1

|λi| + 1 2σ2

F

  • i=1

λ2

i

Stanley F. Chen (IBM) Performance Prediction 27 June 2011 30 / 41

slide-31
SLIDE 31

■❇▼

What’s the Catch? (Part II)

Select features to sum in hindsight? When sum features, sums discounts! LLtest − LLtrain = 1 (# train evs)

F ′

  • i=1

λi × (discount of fi(·)) Need to pick features to sum a priori!

Stanley F. Chen (IBM) Performance Prediction 27 June 2011 31 / 41

slide-32
SLIDE 32

■❇▼

Heuristic 1: Improving Model Performance

Identify features a priori with similar λi. Create new feature that is sum of original features.

Stanley F. Chen (IBM) Performance Prediction 27 June 2011 32 / 41

slide-33
SLIDE 33

■❇▼

Example: N-Gram Models and Backoff

λwj−2wj−1wj, λw′

j−2wj−1wj tend to be alike ⇒ create λwj−1wj!?

Bigram features reduce overfitting for trigram features.

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 3g 2g+3g 1g+2g+3g

  • verfit

1g params 2g params 3g params

Stanley F. Chen (IBM) Performance Prediction 27 June 2011 33 / 41

slide-34
SLIDE 34

■❇▼

Example: N-Gram Models and Word Classes

Group related words into classes, e.g., {Monday, Tuesday, . . .} Add class n-gram features to address sparsity. Problem: space of word/class n-gram features is large. cj−2cj−1cj; wj−2wj−1cj; wj−1cjwj; . . . Apply Heuristic 1 to word n-gram model!

Stanley F. Chen (IBM) Performance Prediction 27 June 2011 34 / 41

slide-35
SLIDE 35

■❇▼

Goldilocks and the Three Class-Based LM’s

Model S p(cj | cj−2cj−1) p(wj | cj) Model M (Heuristic 1) p(cj | cj−2cj−1) × p(cj | wj−2wj−1) p(wj | wj−2wj−1cj) Model L p(cj | wj−2cj−2wj−1cj−1) p(wj | wj−2cj−2wj−1cj−1cj)

Stanley F. Chen (IBM) Performance Prediction 27 June 2011 35 / 41

slide-36
SLIDE 36

■❇▼

This One Is Just Right!

1 2 3 4 5 6 S M L predicted LLtest LLtrain

  • verfit

Stanley F. Chen (IBM) Performance Prediction 27 June 2011 36 / 41

slide-37
SLIDE 37

■❇▼

Model M

Best class-based model results for speech recognition . . . Over a wide range of data sets; training set sizes. Gains up to 3% absolute in error rate over word n-gram.

Stanley F. Chen (IBM) Performance Prediction 27 June 2011 37 / 41

slide-38
SLIDE 38

■❇▼

Outline

1

Introduction

2

Finding an Empirical Law for Overfitting

3

Regularization

4

Why Does the Law Hold?

5

Things You Can Do With It

6

Discussion

Stanley F. Chen (IBM) Performance Prediction 27 June 2011 38 / 41

slide-39
SLIDE 39

■❇▼

Long Live Big Models!

(test error) ≡ (training error) + (overfit) (overfit) ≈ 0.938 (# train evs)

F

  • i=1

|λi| Despite theory, models with lots of parameters perform well! Adding the right parameters can lower overfitting! Heuristic 1.

Stanley F. Chen (IBM) Performance Prediction 27 June 2011 39 / 41

slide-40
SLIDE 40

■❇▼

Applicability to Other Domains

Log likelihood vs. error rate. Log-linear models LLtest − LLtrain = 1 (# train evs)

F ′

  • i=1

λi × (discount of fi(·)) It’s not the number of parameters . . . It’s the size of the parameters! Explain and/or enhance existing practice? e.g., backoff features; class-based features. Sometimes the space of feature types is large.

Stanley F. Chen (IBM) Performance Prediction 27 June 2011 40 / 41

slide-41
SLIDE 41

■❇▼

For More Details

Stanley F . Chen. 2008. Performance prediction for exponential language models.

  • Tech. Report RC 24671, IBM Research Division, October.

Stanley F . Chen, Lidia Mangu, Bhuvana Ramabhadran, Ruhi Sarikaya, Abhinav Sethy. 2009. Scaling shrinkage-based language models. In Proceedings of ASRU. Stanley F . Chen, Stephen M. Chu. 2010. Enhanced Word Classing for Model M. Submitted to Proceedings of Interspeech.

Stanley F. Chen (IBM) Performance Prediction 27 June 2011 41 / 41