Low Rank Ensembles Eric Xing Ankur Parikh Avneesh Saluja Chris - - PowerPoint PPT Presentation

β–Ά
low rank ensembles
SMART_READER_LITE
LIVE PREVIEW

Low Rank Ensembles Eric Xing Ankur Parikh Avneesh Saluja Chris - - PowerPoint PPT Presentation

Language Modeling with Power Low Rank Ensembles Eric Xing Ankur Parikh Avneesh Saluja Chris Dyer 1 Overview 2 Overview Model: Framework for language modeling using ensembles of low rank matrices and tensors Relations: Includes


slide-1
SLIDE 1

Language Modeling with Power Low Rank Ensembles

1

Ankur Parikh Avneesh Saluja Chris Dyer Eric Xing

slide-2
SLIDE 2

Overview

2

slide-3
SLIDE 3

Overview

  • Model: Framework for language modeling using

ensembles of low rank matrices and tensors

  • Relations: Includes existing n-gram smoothing

techniques as special cases

2

slide-4
SLIDE 4

Overview

  • Model: Framework for language modeling using

ensembles of low rank matrices and tensors

  • Relations: Includes existing n-gram smoothing

techniques as special cases

  • Performance: Consistently outperforms state-of-the-art

Kneser Ney baselines for same context length

  • Speed: Easily scalable since no partition function

required

2

slide-5
SLIDE 5

Introduction Background Rank Power Ensembles Experiments

Outline

  • Introduction
  • Background on n-gram smoothing
  • Our Approach
  • Rank
  • Power
  • Constructing the Ensemble
  • Experiments

3

slide-6
SLIDE 6

Introduction Background Rank Power Ensembles Experiments

Language Modeling

  • Evaluate probabilities of sentences

4

slide-7
SLIDE 7

Introduction Background Rank Power Ensembles Experiments

Language Modeling

  • Evaluate probabilities of sentences

4

Linear algebra is awesome

slide-8
SLIDE 8

Introduction Background Rank Power Ensembles Experiments

Language Modeling

  • Evaluate probabilities of sentences

4

𝑄 π‘₯1, . . , π‘₯4 = 0.3648

Linear algebra is awesome

slide-9
SLIDE 9

Introduction Background Rank Power Ensembles Experiments

Language Modeling

  • Evaluate probabilities of sentences

4

𝑄 π‘₯1, . . , π‘₯4 = 0.3648

Linear algebra is awesome Linear algebra is boring

slide-10
SLIDE 10

Introduction Background Rank Power Ensembles Experiments

Language Modeling

  • Evaluate probabilities of sentences

4

𝑄 π‘₯1, . . , π‘₯4 = 0.3648

Linear algebra is awesome Linear algebra is boring

𝑄 π‘₯1, . . , π‘₯4 = 0.1922

slide-11
SLIDE 11

Introduction Background Rank Power Ensembles Experiments

Language Modeling

  • Evaluate probabilities of sentences
  • Very useful in downstream applications such as machine

translation and speech recognition.

4

𝑄 π‘₯1, . . , π‘₯4 = 0.3648

Linear algebra is awesome Linear algebra is boring

𝑄 π‘₯1, . . , π‘₯4 = 0.1922

slide-12
SLIDE 12

Introduction Background Rank Power Ensembles Experiments

  • Predominant approach to language modeling

N-grams

5

slide-13
SLIDE 13

Introduction Background Rank Power Ensembles Experiments

  • Predominant approach to language modeling

N-grams

5

slide-14
SLIDE 14

Introduction Background Rank Power Ensembles Experiments

  • Predominant approach to language modeling

N-grams

5

π‘₯𝑗

slide-15
SLIDE 15

Introduction Background Rank Power Ensembles Experiments

  • Predominant approach to language modeling

N-grams

5

count π‘₯𝑗 π‘₯𝑗

slide-16
SLIDE 16

Introduction Background Rank Power Ensembles Experiments

  • Predominant approach to language modeling

N-grams

5

count π‘₯𝑗 π‘₯𝑗

slide-17
SLIDE 17

Introduction Background Rank Power Ensembles Experiments

  • Predominant approach to language modeling

N-grams

5

count π‘₯𝑗 π‘₯𝑗 π‘₯𝑗 π‘₯π‘—βˆ’1

slide-18
SLIDE 18

Introduction Background Rank Power Ensembles Experiments

  • Predominant approach to language modeling

N-grams

5

count π‘₯𝑗 π‘₯𝑗 π‘₯𝑗 π‘₯π‘—βˆ’1 count π‘₯𝑗, π‘₯π‘—βˆ’1

slide-19
SLIDE 19

Introduction Background Rank Power Ensembles Experiments

  • Predominant approach to language modeling

N-grams

5

count π‘₯𝑗 π‘₯𝑗 π‘₯𝑗 π‘₯π‘—βˆ’1 count π‘₯𝑗, π‘₯π‘—βˆ’1

slide-20
SLIDE 20

Introduction Background Rank Power Ensembles Experiments

  • Predominant approach to language modeling

N-grams

5

count π‘₯𝑗 π‘₯𝑗 π‘₯π‘—βˆ’1 π‘₯π‘—βˆ’2 π‘₯𝑗 π‘₯𝑗 π‘₯π‘—βˆ’1 count π‘₯𝑗, π‘₯π‘—βˆ’1

slide-21
SLIDE 21

Introduction Background Rank Power Ensembles Experiments

  • Predominant approach to language modeling

N-grams

5

count π‘₯𝑗 π‘₯𝑗 π‘₯π‘—βˆ’1 π‘₯π‘—βˆ’2 π‘₯𝑗 π‘₯𝑗 π‘₯π‘—βˆ’1 count π‘₯𝑗, π‘₯π‘—βˆ’1 count π‘₯𝑗, π‘₯π‘—βˆ’1, π‘₯π‘—βˆ’2

slide-22
SLIDE 22

Introduction Background Rank Power Ensembles Experiments

N-gram Smoothing

  • Alleviate data sparsity problem

6

𝑄 π‘₯𝑗 π‘₯π‘—βˆ’1, π‘₯π‘—βˆ’2) 𝑄 π‘₯𝑗 π‘₯π‘—βˆ’1) 𝑄(π‘₯𝑗)

slide-23
SLIDE 23

Introduction Background Rank Power Ensembles Experiments

N-gram Smoothing

  • Alleviate data sparsity problem

6

𝑄 π‘₯𝑗 π‘₯π‘—βˆ’1, π‘₯π‘—βˆ’2) 𝑄 π‘₯𝑗 π‘₯π‘—βˆ’1) 𝑄(π‘₯𝑗)

slide-24
SLIDE 24

Introduction Background Rank Power Ensembles Experiments

N-gram Smoothing

  • Alleviate data sparsity problem

6

𝑄 π‘₯𝑗 π‘₯π‘—βˆ’1, π‘₯π‘—βˆ’2) 𝑄 π‘₯𝑗 π‘₯π‘—βˆ’1) 𝑄(π‘₯𝑗)

slide-25
SLIDE 25

Introduction Background Rank Power Ensembles Experiments

N-gram Smoothing

  • Alleviate data sparsity problem

6

𝑄 π‘₯𝑗 π‘₯π‘—βˆ’1, π‘₯π‘—βˆ’2) 𝑄 π‘₯𝑗 π‘₯π‘—βˆ’1) 𝑄(π‘₯𝑗)

slide-26
SLIDE 26

Introduction Background Rank Power Ensembles Experiments

N-gram Smoothing

  • Alleviate data sparsity problem

6

𝑄 π‘₯𝑗 π‘₯π‘—βˆ’1, π‘₯π‘—βˆ’2) 𝑄 π‘₯𝑗 π‘₯π‘—βˆ’1) 𝑄(π‘₯𝑗)

slide-27
SLIDE 27

Introduction Background Rank Power Ensembles Experiments

N-gram Smoothing

  • Alleviate data sparsity problem

6

𝑄 π‘₯𝑗 π‘₯π‘—βˆ’1, π‘₯π‘—βˆ’2) 𝑄 π‘₯𝑗 π‘₯π‘—βˆ’1) 𝑄(π‘₯𝑗)

slide-28
SLIDE 28

Introduction Background Rank Power Ensembles Experiments

  • β€œFine-to-coarse”, captures various levels of dependence
  • Very fast
  • O(N) test complexity
  • Low context sizes sufficient

Advantages of f N-gram Models

7

𝑄 π‘₯𝑗 π‘₯π‘—βˆ’1, π‘₯π‘—βˆ’2) 𝑄 π‘₯𝑗 π‘₯π‘—βˆ’1) 𝑄(π‘₯𝑗)

slide-29
SLIDE 29

Introduction Background Rank Power Ensembles Experiments

Cla lassic Dis isadvantage of f N-gram Models

  • No notion of similarity between words

8

𝑄(π‘₯𝑗) 𝑄 π‘₯𝑗 π‘₯π‘—βˆ’1)

slide-30
SLIDE 30

Introduction Background Rank Power Ensembles Experiments

Cla lassic Dis isadvantage of f N-gram Models

  • No notion of similarity between words

8

𝑄(π‘₯𝑗) (house, decrepit) 𝑄 π‘₯𝑗 π‘₯π‘—βˆ’1)

slide-31
SLIDE 31

Introduction Background Rank Power Ensembles Experiments

Cla lassic Dis isadvantage of f N-gram Models

  • No notion of similarity between words

8

𝑄(π‘₯𝑗) (house, decrepit) 𝑄 π‘₯𝑗 π‘₯π‘—βˆ’1) (house)

slide-32
SLIDE 32

Introduction Background Rank Power Ensembles Experiments

Cla lassic Dis isadvantage of f N-gram Models

  • No notion of similarity between words

8

𝑄(π‘₯𝑗) (house, decrepit) (house, old) (house, shabby) 𝑄 π‘₯𝑗 π‘₯π‘—βˆ’1) (house)

slide-33
SLIDE 33

Introduction Background Rank Power Ensembles Experiments

Cla lassic Dis isadvantage of f N-gram Models

  • No notion of similarity between words

8

𝑄(π‘₯𝑗) (house, decrepit) (house, old) (house, shabby) 𝑄 π‘₯𝑗 π‘₯π‘—βˆ’1) (house)

slide-34
SLIDE 34

Introduction Background Rank Power Ensembles Experiments

Cla lassic Dis isadvantage of f N-gram Models

  • No notion of similarity between words

8

𝑄(π‘₯𝑗) (house, decrepit) (house, {synonym of old} ) (house, old) (house, shabby) 𝑄 π‘₯𝑗 π‘₯π‘—βˆ’1) (house)

slide-35
SLIDE 35

Introduction Background Rank Power Ensembles Experiments

Cla lassic Dis isadvantage of f N-gram Models

  • No notion of similarity between words

8

𝑄(π‘₯𝑗)

?

(house, decrepit) (house, {synonym of old} ) (house, old) (house, shabby) 𝑄 π‘₯𝑗 π‘₯π‘—βˆ’1) (house)

slide-36
SLIDE 36

Introduction Background Rank Power Ensembles Experiments

Motivation For Low Rank Methods

  • Project words to lower-dimensional space

9

slide-37
SLIDE 37

Introduction Background Rank Power Ensembles Experiments

Motivation For Low Rank Methods

  • Project words to lower-dimensional space

9

β‰ˆ

slide-38
SLIDE 38

Introduction Background Rank Power Ensembles Experiments

Motivation For Low Rank Methods

  • Project words to lower-dimensional space
  • Words with similar contexts will have similar projections

9

β‰ˆ

slide-39
SLIDE 39

Introduction Background Rank Power Ensembles Experiments

Motivation For Low Rank Methods

  • Project words to lower-dimensional space
  • Words with similar contexts will have similar projections

9

β‰ˆ

house cabin flat

slide-40
SLIDE 40

Introduction Background Rank Power Ensembles Experiments

Motivation For Low Rank Methods

  • Project words to lower-dimensional space
  • Words with similar contexts will have similar projections

9

β‰ˆ

house cabin flat

  • ld

shabby decrepit

slide-41
SLIDE 41

Introduction Background Rank Power Ensembles Experiments

Low Rank Approaches

10

slide-42
SLIDE 42

Introduction Background Rank Power Ensembles Experiments

Low Rank Approaches

  • Low rank approximation successful in many ML

applications

  • Collaborate filtering (Netflix)
  • Matrix completion

10

slide-43
SLIDE 43

Introduction Background Rank Power Ensembles Experiments

Low Rank Approaches

  • Low rank approximation successful in many ML

applications

  • Collaborate filtering (Netflix)
  • Matrix completion
  • These solutions have been attempted in language

modeling

  • Saul and Pereira 1997
  • Hutchinson et al. 2011

10

slide-44
SLIDE 44

Introduction Background Rank Power Ensembles Experiments

Low Rank Approaches

  • Low rank approximation successful in many ML

applications

  • Collaborate filtering (Netflix)
  • Matrix completion
  • These solutions have been attempted in language

modeling

  • Saul and Pereira 1997
  • Hutchinson et al. 2011
  • Unfortunately, not generally competitive with Kneser

Ney

10

slide-45
SLIDE 45

Introduction Background Rank Power Ensembles Experiments

Proble lem: Low Rank Methods Operate at Fix ixed Granularity

11

If rank is too small……

β‰ˆ

slide-46
SLIDE 46

Introduction Background Rank Power Ensembles Experiments

Proble lem: Low Rank Methods Operate at Fix ixed Granularity

11

If rank is too small……

β‰ˆ

(break, spring)

slide-47
SLIDE 47

Introduction Background Rank Power Ensembles Experiments

Proble lem: Low Rank Methods Operate at Fix ixed Granularity

11

If rank is too small……

β‰ˆ

(break, spring)

Probability gets diluted since β€œbreak” has many synonyms

slide-48
SLIDE 48

Introduction Background Rank Power Ensembles Experiments

Proble lem: Low Rank Methods Operate at Fix ixed Granularity

12

β‰ˆ

If rank is too large….

slide-49
SLIDE 49

Introduction Background Rank Power Ensembles Experiments

Proble lem: Low Rank Methods Operate at Fix ixed Granularity

12

β‰ˆ

If rank is too large….

(domicile, dilapidated)

slide-50
SLIDE 50

Introduction Background Rank Power Ensembles Experiments

Proble lem: Low Rank Methods Operate at Fix ixed Granularity

12

β‰ˆ

If rank is too large….

Probabilities of rare words a problem, since representation is too fine grained

(domicile, dilapidated)

slide-51
SLIDE 51

Introduction Background Rank Power Ensembles Experiments

Our Approach

13

slide-52
SLIDE 52

Introduction Background Rank Power Ensembles Experiments

Our Approach

  • Construct ensembles of low rank matrices/tensors to

model language at multiple granularities

13

slide-53
SLIDE 53

Introduction Background Rank Power Ensembles Experiments

Our Approach

  • Construct ensembles of low rank matrices/tensors to

model language at multiple granularities

  • Includes existing n-gram techniques as special cases
  • Absolute discounting
  • Jelinek Mercer (deleted-interpolation)
  • Kneser Ney

13

slide-54
SLIDE 54

Introduction Background Rank Power Ensembles Experiments

Our Approach

  • Construct ensembles of low rank matrices/tensors to

model language at multiple granularities

  • Includes existing n-gram techniques as special cases
  • Absolute discounting
  • Jelinek Mercer (deleted-interpolation)
  • Kneser Ney
  • Preserves advantages of standard n-gram approaches
  • Effective for short context lengths
  • Fast evaluation at test time

13

slide-55
SLIDE 55

Introduction Background Rank Power Ensembles Experiments

Outline

  • Introduction
  • Background on Kneser Ney smoothing
  • Our Approach
  • Rank
  • Power
  • Constructing the Ensemble
  • Experiments

14

slide-56
SLIDE 56

Introduction Background Rank Power Ensembles Experiments

  • Lower order distribution

should be altered

Kneser Ney - Intuition

56

slide-57
SLIDE 57

Introduction Background Rank Power Ensembles Experiments

  • Lower order distribution

should be altered

  • Consider two words, York and door
  • York only follows very few words i.e. New York
  • Door can follow many words i.e. β€œthe door”, β€œred door”, β€œmy door” etc.

𝑄 π‘₯𝑗 = door backed βˆ’ off on π‘₯π‘—βˆ’1) > 𝑄(π‘₯𝑗 = York | backed βˆ’ off on π‘₯π‘—βˆ’1)

Kneser Ney - Intuition

57

slide-58
SLIDE 58

Introduction Background Rank Power Ensembles Experiments

  • Lower order distribution

should be altered

  • Consider two words, York and door
  • York only follows very few words i.e. New York
  • Door can follow many words i.e. β€œthe door”, β€œred door”, β€œmy door” etc.

𝑄 π‘₯𝑗 = door backed βˆ’ off on π‘₯π‘—βˆ’1) > 𝑄(π‘₯𝑗 = York | backed βˆ’ off on π‘₯π‘—βˆ’1)

Kneser Ney - Intuition

58

slide-59
SLIDE 59

Introduction Background Rank Power Ensembles Experiments

Kneser Ney Unigram Distribution

16

π‘‚βˆ’ π‘₯𝑗 = | π‘₯ ∢ 𝑑 π‘₯𝑗, π‘₯ > 0 |

Diversity of 𝒙𝒋

′𝒕 history

slide-60
SLIDE 60

Introduction Background Rank Power Ensembles Experiments

Kneser Ney Unigram Distribution

16

π‘„π‘™π‘œβˆ’π‘£π‘œπ‘—(π‘₯𝑗) = π‘‚βˆ’ π‘₯𝑗 π‘₯ π‘‚βˆ’ π‘₯

π‘‚βˆ’ π‘₯𝑗 = | π‘₯ ∢ 𝑑 π‘₯𝑗, π‘₯ > 0 |

Diversity of 𝒙𝒋

′𝒕 history

slide-61
SLIDE 61

Introduction Background Rank Power Ensembles Experiments

Discounting

17

slide-62
SLIDE 62

Introduction Background Rank Power Ensembles Experiments

Discounting

17

𝑄𝑒 π‘₯𝑗 π‘₯π‘—βˆ’1) = max(𝑑 π‘₯𝑗, π‘₯π‘—βˆ’1 βˆ’ 𝑒, 0) π‘₯ 𝑑 π‘₯, π‘₯π‘—βˆ’1

slide-63
SLIDE 63

Introduction Background Rank Power Ensembles Experiments

Discounting

17

𝑄𝑒 π‘₯𝑗 π‘₯π‘—βˆ’1) = max(𝑑 π‘₯𝑗, π‘₯π‘—βˆ’1 βˆ’ 𝑒, 0) π‘₯ 𝑑 π‘₯, π‘₯π‘—βˆ’1

𝑄

π‘™π‘œπ‘“π‘§ π‘₯𝑗 π‘₯π‘—βˆ’1) =

𝑄

𝑒 π‘₯𝑗 π‘₯π‘—βˆ’1) + 𝛿 π‘₯π‘—βˆ’1

𝑄

π‘™π‘œβˆ’π‘£π‘œπ‘—(π‘₯𝑗)

slide-64
SLIDE 64

Introduction Background Rank Power Ensembles Experiments

Discounting

17

𝑄𝑒 π‘₯𝑗 π‘₯π‘—βˆ’1) = max(𝑑 π‘₯𝑗, π‘₯π‘—βˆ’1 βˆ’ 𝑒, 0) π‘₯ 𝑑 π‘₯, π‘₯π‘—βˆ’1 Where 𝜹 π’™π’‹βˆ’πŸ is the leftover probability

𝑄

π‘™π‘œπ‘“π‘§ π‘₯𝑗 π‘₯π‘—βˆ’1) =

𝑄

𝑒 π‘₯𝑗 π‘₯π‘—βˆ’1) + 𝛿 π‘₯π‘—βˆ’1

𝑄

π‘™π‘œβˆ’π‘£π‘œπ‘—(π‘₯𝑗)

slide-65
SLIDE 65

Introduction Background Rank Power Ensembles Experiments

Lower Order Marginal Aligns!

18

𝑄 π‘₯𝑗 =

π‘₯π‘—βˆ’1

π‘„π‘™π‘œπ‘“π‘§ π‘₯𝑗 π‘₯π‘—βˆ’1) 𝑄 π‘₯π‘—βˆ’1

slide-66
SLIDE 66

Introduction Background Rank Power Ensembles Experiments

Generalizing KN to PLRE

19

Kneser Ney Power Low Rank Ensembles

slide-67
SLIDE 67

Introduction Background Rank Power Ensembles Experiments

Generalizing KN to PLRE

19

Kneser Ney

  • Ensemble composed of

unsmoothed n-grams Power Low Rank Ensembles

slide-68
SLIDE 68

Introduction Background Rank Power Ensembles Experiments

Generalizing KN to PLRE

19

Kneser Ney

  • Ensemble composed of

unsmoothed n-grams

  • Alter lower order distributions by

using count of unique histories Power Low Rank Ensembles

slide-69
SLIDE 69

Introduction Background Rank Power Ensembles Experiments

Generalizing KN to PLRE

19

Kneser Ney

  • Ensemble composed of

unsmoothed n-grams

  • Alter lower order distributions by

using count of unique histories

  • Use absolute discounting to

interpolate different n-grams and preserve lower order marginal constraint Power Low Rank Ensembles

slide-70
SLIDE 70

Introduction Background Rank Power Ensembles Experiments

Generalizing KN to PLRE

19

Kneser Ney

  • Ensemble composed of

unsmoothed n-grams

  • Alter lower order distributions by

using count of unique histories

  • Use absolute discounting to

interpolate different n-grams and preserve lower order marginal constraint Power Low Rank Ensembles

? ? ?

slide-71
SLIDE 71

Introduction Background Rank Power Ensembles Experiments

Generalizing KN to PLRE

20

Kneser Ney

  • Ensemble composed of

unsmoothed n-grams

  • Alter lower order distributions by

using count of unique histories

  • Use absolute discounting to

interpolate different n-grams and preserve lower order marginal constraint Power Low Rank Ensembles

? ? ?

slide-72
SLIDE 72

Introduction Background Rank Power Ensembles Experiments

In In General, Bigram is Full Rank

21

slide-73
SLIDE 73

Introduction Background Rank Power Ensembles Experiments

In Independence = Rank 1

  • If π‘₯𝑗 and π‘₯π‘—βˆ’1 are independent

73

𝑄(π‘₯𝑗, π‘₯π‘—βˆ’1) = 𝑄 π‘₯𝑗 𝑄 π‘₯π‘—βˆ’1

slide-74
SLIDE 74

Introduction Background Rank Power Ensembles Experiments

In Independence = Rank 1

  • If π‘₯𝑗 and π‘₯π‘—βˆ’1 are independent

74

𝑄(π‘₯𝑗, π‘₯π‘—βˆ’1) = 𝑄 π‘₯𝑗 𝑄 π‘₯π‘—βˆ’1

slide-75
SLIDE 75

Introduction Background Rank Power Ensembles Experiments

In Independence = Rank 1

  • If π‘₯𝑗 and π‘₯π‘—βˆ’1 are independent

75

𝑄(π‘₯𝑗, π‘₯π‘—βˆ’1) = 𝑄 π‘₯𝑗 𝑄 π‘₯π‘—βˆ’1

=

𝑄(β„Žπ‘π‘£π‘‘π‘“) 𝑄(π‘π‘šπ‘’) 𝑄(β„Žπ‘π‘£π‘‘π‘“, π‘π‘šπ‘’)

slide-76
SLIDE 76

Introduction Background Rank Power Ensembles Experiments

In Independence = Rank 1

  • If π‘₯𝑗 and π‘₯π‘—βˆ’1 are independent
  • But what if π‘₯𝑗 and π‘₯π‘—βˆ’1 are not independent? What

does the best rank 1 approximation give?

76

𝑄(π‘₯𝑗, π‘₯π‘—βˆ’1) = 𝑄 π‘₯𝑗 𝑄 π‘₯π‘—βˆ’1

=

𝑄(β„Žπ‘π‘£π‘‘π‘“) 𝑄(π‘π‘šπ‘’) 𝑄(β„Žπ‘π‘£π‘‘π‘“, π‘π‘šπ‘’)

slide-77
SLIDE 77

Introduction Background Rank Power Ensembles Experiments

Rank

  • Let π‘ͺ be the matrix such that

π‘ͺ π‘₯𝑗, π‘₯π‘—βˆ’1 = 𝑑 π‘₯𝑗, π‘₯π‘—βˆ’1

  • Let
  • Then

𝑡1 π‘₯𝑗, π‘₯π‘—βˆ’1 ∝ 𝑄 π‘₯𝑗 𝑄 π‘₯π‘—βˆ’1

77

𝑡1 = π‘›π‘—π‘œπ‘΅:𝑡β‰₯0,π‘ π‘π‘œπ‘™ 𝑡 =1 π‘ͺ βˆ’ 𝑡 𝐿𝑀

=

Generalized KL [Lee and Seung 2001]

slide-78
SLIDE 78

Introduction Background Rank Power Ensembles Experiments

Rank

  • MLE unigram is normalized rank 1 approx. of MLE

bigram under KL:

24

𝑄 π‘₯𝑗 = 𝑡1 π‘₯𝑗, π‘₯π‘—βˆ’1 π‘₯𝑗 𝑡1(π‘₯𝑗, π‘₯π‘—βˆ’1)

slide-79
SLIDE 79

Introduction Background Rank Power Ensembles Experiments

Rank

  • MLE unigram is normalized rank 1 approx. of MLE

bigram under KL:

  • Vary rank to obtain quantities between bigram and

unigram

24

𝑄 π‘₯𝑗 = 𝑡1 π‘₯𝑗, π‘₯π‘—βˆ’1 π‘₯𝑗 𝑡1(π‘₯𝑗, π‘₯π‘—βˆ’1)

slide-80
SLIDE 80

Introduction Background Rank Power Ensembles Experiments

Rank

  • MLE unigram is normalized rank 1 approx. of MLE

bigram under KL:

  • Vary rank to obtain quantities between bigram and

unigram

24

𝑄 π‘₯𝑗 = 𝑡1 π‘₯𝑗, π‘₯π‘—βˆ’1 π‘₯𝑗 𝑡1(π‘₯𝑗, π‘₯π‘—βˆ’1)

full rank rank 1

slide-81
SLIDE 81

Introduction Background Rank Power Ensembles Experiments

Rank

  • MLE unigram is normalized rank 1 approx. of MLE

bigram under KL:

  • Vary rank to obtain quantities between bigram and

unigram

24

𝑄 π‘₯𝑗 = 𝑡1 π‘₯𝑗, π‘₯π‘—βˆ’1 π‘₯𝑗 𝑡1(π‘₯𝑗, π‘₯π‘—βˆ’1)

full rank low rank rank 1

slide-82
SLIDE 82

Introduction Background Rank Power Ensembles Experiments

Generalizing KN to PLRE

25

Kneser Ney

  • Ensemble composed of

unsmoothed n-grams

  • Alter lower order distributions by

using count of unique histories

  • Use absolute discounting to

interpolate different n-grams and preserve lower order marginal constraint Power Low Rank Ensembles

? ?

  • Ensemble composed of

unsmoothed n-grams plus other low rank matrices/tensors

slide-83
SLIDE 83

Introduction Background Rank Power Ensembles Experiments

Generalizing KN to PLRE

26

Kneser Ney

  • Ensemble composed of

unsmoothed n-grams

  • Alter lower order distributions by

using count of unique histories

  • Use absolute discounting to

interpolate different n-grams and preserve lower order marginal constraint Power Low Rank Ensembles

? ?

  • Ensemble composed of

unsmoothed n-grams plus other low rank matrices/tensors

slide-84
SLIDE 84

Introduction Background Rank Power Ensembles Experiments

Consider Elementwise Power

27

slide-85
SLIDE 85

Introduction Background Rank Power Ensembles Experiments

Consider Elementwise Power

27

𝟐 πŸ‘ 𝟐 𝟏 πŸ‘ πŸ” 𝟏 𝟏 𝟏

π‘ͺ

slide-86
SLIDE 86

Introduction Background Rank Power Ensembles Experiments

Consider Elementwise Power

27

𝟐 πŸ‘ 𝟐 𝟏 πŸ‘ πŸ” 𝟏 𝟏 𝟏

π‘ͺ

row sum

πŸ“ πŸ” πŸ‘

slide-87
SLIDE 87

Introduction Background Rank Power Ensembles Experiments

Consider Elementwise Power

27

𝟐 πŸ‘ 𝟐 𝟏 πŸ‘ πŸ” 𝟏 𝟏 𝟏

π‘ͺ

row sum

πŸ“ πŸ” πŸ‘

slide-88
SLIDE 88

Introduction Background Rank Power Ensembles Experiments

Consider Elementwise Power

27

𝟐 πŸ‘ 𝟐 𝟏 πŸ‘ πŸ” 𝟏 𝟏 𝟏

π‘ͺ

𝟐 𝟐. πŸ“ 𝟐 𝟏 𝟐. πŸ“ πŸ‘. πŸ‘ 𝟏 𝟏 𝟏

π‘ͺ𝟏.πŸ”

row sum

πŸ“ πŸ” πŸ‘

slide-89
SLIDE 89

Introduction Background Rank Power Ensembles Experiments

Consider Elementwise Power

27

𝟐 πŸ‘ 𝟐 𝟏 πŸ‘ πŸ” 𝟏 𝟏 𝟏

π‘ͺ

𝟐 𝟐. πŸ“ 𝟐 𝟏 𝟐. πŸ“ πŸ‘. πŸ‘ 𝟏 𝟏 𝟏

π‘ͺ𝟏.πŸ”

row sum row sum

πŸ“ πŸ” πŸ‘ πŸ’. πŸ“ πŸ‘. πŸ‘ 𝟐. πŸ“

slide-90
SLIDE 90

Introduction Background Rank Power Ensembles Experiments

Consider Elementwise Power

27

𝟐 πŸ‘ 𝟐 𝟏 πŸ‘ πŸ” 𝟏 𝟏 𝟏

π‘ͺ

𝟐 𝟐. πŸ“ 𝟐 𝟏 𝟐. πŸ“ πŸ‘. πŸ‘ 𝟏 𝟏 𝟏

π‘ͺ𝟏.πŸ”

row sum row sum

πŸ“ πŸ” πŸ‘ πŸ’. πŸ“ πŸ‘. πŸ‘ 𝟐. πŸ“

slide-91
SLIDE 91

Introduction Background Rank Power Ensembles Experiments

Consider Elementwise Power

27

𝟐 πŸ‘ 𝟐 𝟏 πŸ‘ πŸ” 𝟏 𝟏 𝟏 𝟐 𝟐 𝟐 𝟏 𝟐 𝟐 𝟏 𝟏 𝟏

π‘ͺ

𝟐 𝟐. πŸ“ 𝟐 𝟏 𝟐. πŸ“ πŸ‘. πŸ‘ 𝟏 𝟏 𝟏

π‘ͺ𝟏.πŸ” π‘ͺ𝟏

row sum row sum

πŸ“ πŸ” πŸ‘ πŸ’. πŸ“ πŸ‘. πŸ‘ 𝟐. πŸ“

slide-92
SLIDE 92

Introduction Background Rank Power Ensembles Experiments

Consider Elementwise Power

27

𝟐 πŸ‘ 𝟐 𝟏 πŸ‘ πŸ” 𝟏 𝟏 𝟏 𝟐 𝟐 𝟐 𝟏 𝟐 𝟐 𝟏 𝟏 𝟏

π‘ͺ

𝟐 𝟐. πŸ“ 𝟐 𝟏 𝟐. πŸ“ πŸ‘. πŸ‘ 𝟏 𝟏 𝟏

π‘ͺ𝟏.πŸ” π‘ͺ𝟏

row sum row sum row sum

πŸ“ πŸ” πŸ‘ πŸ’. πŸ“ πŸ‘. πŸ‘ 𝟐. πŸ“ πŸ’ 𝟐 𝟐

slide-93
SLIDE 93

Introduction Background Rank Power Ensembles Experiments

Consider Elementwise Power

27

𝟐 πŸ‘ 𝟐 𝟏 πŸ‘ πŸ” 𝟏 𝟏 𝟏 𝟐 𝟐 𝟐 𝟏 𝟐 𝟐 𝟏 𝟏 𝟏

π‘ͺ

𝟐 𝟐. πŸ“ 𝟐 𝟏 𝟐. πŸ“ πŸ‘. πŸ‘ 𝟏 𝟏 𝟏

π‘ͺ𝟏.πŸ” π‘ͺ𝟏

emphasis on diversity

row sum row sum row sum

πŸ“ πŸ” πŸ‘ πŸ’. πŸ“ πŸ‘. πŸ‘ 𝟐. πŸ“ πŸ’ 𝟐 𝟐

slide-94
SLIDE 94

Introduction Background Rank Power Ensembles Experiments

Consider Elementwise Power

28

slide-95
SLIDE 95

Introduction Background Rank Power Ensembles Experiments

Consider Elementwise Power

28

𝑡1

0 = π‘›π‘—π‘œπ‘΅:𝑡β‰₯0,π‘ π‘π‘œπ‘™ 𝑡 =1 π‘ͺ𝟏 βˆ’ 𝑡 𝐿𝑀

slide-96
SLIDE 96

Introduction Background Rank Power Ensembles Experiments

Consider Elementwise Power

28

𝑡1

0 = π‘›π‘—π‘œπ‘΅:𝑡β‰₯0,π‘ π‘π‘œπ‘™ 𝑡 =1 π‘ͺ𝟏 βˆ’ 𝑡 𝐿𝑀

π‘„π‘™π‘œβˆ’π‘£π‘œπ‘—(π‘₯𝑗) = 𝑡1

0 π‘₯𝑗, π‘₯π‘—βˆ’1

π‘₯ 𝑡1

0 π‘₯, π‘₯π‘—βˆ’1

slide-97
SLIDE 97

Introduction Background Rank Power Ensembles Experiments

Consider Elementwise Power

28

𝑡1

0 = π‘›π‘—π‘œπ‘΅:𝑡β‰₯0,π‘ π‘π‘œπ‘™ 𝑡 =1 π‘ͺ𝟏 βˆ’ 𝑡 𝐿𝑀

π‘„π‘™π‘œβˆ’π‘£π‘œπ‘—(π‘₯𝑗) = 𝑡1

0 π‘₯𝑗, π‘₯π‘—βˆ’1

π‘₯ 𝑡1

0 π‘₯, π‘₯π‘—βˆ’1

power = 1 full rank

slide-98
SLIDE 98

Introduction Background Rank Power Ensembles Experiments

Consider Elementwise Power

28

𝑡1

0 = π‘›π‘—π‘œπ‘΅:𝑡β‰₯0,π‘ π‘π‘œπ‘™ 𝑡 =1 π‘ͺ𝟏 βˆ’ 𝑡 𝐿𝑀

π‘„π‘™π‘œβˆ’π‘£π‘œπ‘—(π‘₯𝑗) = 𝑡1

0 π‘₯𝑗, π‘₯π‘—βˆ’1

π‘₯ 𝑡1

0 π‘₯, π‘₯π‘—βˆ’1

power = 1 full rank power = 0 full rank power

slide-99
SLIDE 99

Introduction Background Rank Power Ensembles Experiments

Consider Elementwise Power

28

𝑡1

0 = π‘›π‘—π‘œπ‘΅:𝑡β‰₯0,π‘ π‘π‘œπ‘™ 𝑡 =1 π‘ͺ𝟏 βˆ’ 𝑡 𝐿𝑀

π‘„π‘™π‘œβˆ’π‘£π‘œπ‘—(π‘₯𝑗) = 𝑡1

0 π‘₯𝑗, π‘₯π‘—βˆ’1

π‘₯ 𝑡1

0 π‘₯, π‘₯π‘—βˆ’1

power = 1 full rank power = 0 full rank power = 0 rank = 1 power low rank

slide-100
SLIDE 100

Introduction Background Rank Power Ensembles Experiments

Vary rying Rank and Power

  • Construct matrices of varying rank and power

29

power = 1 full rank power = 0 rank = 1

slide-101
SLIDE 101

Introduction Background Rank Power Ensembles Experiments

Vary rying Rank and Power

  • Construct matrices of varying rank and power

29

power = 1 full rank power = 0.5 low rank power = 0 rank = 1

slide-102
SLIDE 102

Introduction Background Rank Power Ensembles Experiments

Vary rying Rank and Power

  • Generalizes to higher orders

30

slide-103
SLIDE 103

Introduction Background Rank Power Ensembles Experiments

Generalizing KN to PLRE

31

Kneser Ney

  • Ensemble composed of

unsmoothed n-grams

  • Alter lower order distributions by

using count of unique histories

  • Use absolute discounting to

interpolate different n-grams and preserve lower order marginal constraint Power Low Rank Ensembles

?

  • Ensemble composed of

unsmoothed n-grams plus other low rank matrices/tensors

  • Alter lower order distributions

by elementwise power

slide-104
SLIDE 104

Introduction Background Rank Power Ensembles Experiments

Generalizing KN to PLRE

32

Kneser Ney

  • Ensemble composed of

unsmoothed n-grams

  • Alter lower order distributions by

using count of unique histories

  • Use absolute discounting to

interpolate different n-grams and preserve lower order marginal constraint Power Low Rank Ensembles

?

  • Ensemble composed of

unsmoothed n-grams plus other low rank matrices/tensors

  • Alter lower order distributions

by elementwise power

slide-105
SLIDE 105

Introduction Background Rank Power Ensembles Experiments

Key Requirements

  • Marginal constraint must hold:
  • Evaluation of conditional probabilities must be fast

33

𝑄 π‘₯𝑗 =

π‘₯π‘—βˆ’1

𝑄

𝑑𝑛 π‘₯𝑗 π‘₯π‘—βˆ’1)

𝑄 π‘₯π‘—βˆ’1

slide-106
SLIDE 106

Introduction Background Rank Power Ensembles Experiments

Our Approach: Two Step Procedure

  • Step 1: Compute discounts on powered counts such that marginal

constraint holds. Each count gets a different discount

34

1 0.5

slide-107
SLIDE 107

Introduction Background Rank Power Ensembles Experiments

Our Approach: Two Step Procedure

  • Step 1: Compute discounts on powered counts such that marginal

constraint holds. Each count gets a different discount

34

discount discount 1 0.5

slide-108
SLIDE 108

Introduction Background Rank Power Ensembles Experiments

Our Approach: Two Step Procedure

  • Step 1: Compute discounts on powered counts such that marginal

constraint holds. Each count gets a different discount

34

discount discount 1 0.5

slide-109
SLIDE 109

Introduction Background Rank Power Ensembles Experiments

Our Approach: Two Step Procedure

  • Step 2: Take low rank approximation of discounted quantities

such that marginal constraint still holds

35

discount discount 1 0.5

slide-110
SLIDE 110

Introduction Background Rank Power Ensembles Experiments

Our Approach: Two Step Procedure

  • Step 2: Take low rank approximation of discounted quantities

such that marginal constraint still holds

35

discount discount low rank low rank 1 0.5

slide-111
SLIDE 111

Introduction Background Rank Power Ensembles Experiments

Our Approach: Two Step Procedure

  • Step 2: Take low rank approximation of discounted quantities

such that marginal constraint still holds

35

discount discount low rank low rank 1 0.5

slide-112
SLIDE 112

Introduction Background Rank Power Ensembles Experiments

Our Approach: Two Step Procedure

  • Step 2: Take low rank approximation of discounted quantities

such that marginal constraint still holds

35

discount discount low rank low rank

power low rank ensemble

1 0.5

slide-113
SLIDE 113

Introduction Background Rank Power Ensembles Experiments

Why It It Works

  • Low rank approximations with respect to KL preserve

row/column sums

36

slide-114
SLIDE 114

Introduction Background Rank Power Ensembles Experiments

Why It It Works

  • Low rank approximations with respect to KL preserve

row/column sums

36

low rank

slide-115
SLIDE 115

Introduction Background Rank Power Ensembles Experiments

Why It It Works

  • Low rank approximations with respect to KL preserve

row/column sums

36

low rank

slide-116
SLIDE 116

Introduction Background Rank Power Ensembles Experiments

Why It It Works

  • Low rank approximations with respect to KL preserve

row/column sums

36

low rank

slide-117
SLIDE 117

Introduction Background Rank Power Ensembles Experiments

Why It It Works

  • Low rank approximations with respect to KL preserve

row/column sums

  • Therefore, discounting / leftover weight are preserved

under the low rank approximation

36

low rank

slide-118
SLIDE 118

Introduction Background Rank Power Ensembles Experiments

Normalizer can be Precomputed

  • Low rank approximations with respect to KL preserve

row/column sums

37

low rank

slide-119
SLIDE 119

Introduction Background Rank Power Ensembles Experiments

Normalizer can be Precomputed

  • Low rank approximations with respect to KL preserve

row/column sums

  • Compute normalizers on sparse counts

37

low rank

slide-120
SLIDE 120

Introduction Background Rank Power Ensembles Experiments

Normalizer can be Precomputed

  • Low rank approximations with respect to KL preserve

row/column sums

  • Compute normalizers on sparse counts
  • No partition functions!

37

low rank

slide-121
SLIDE 121

Introduction Background Rank Power Ensembles Experiments

Marginal Constraint Holds

38

𝑄 π‘₯𝑗 =

π‘₯π‘—βˆ’1

π‘„π‘žπ‘šπ‘ π‘“ π‘₯𝑗 π‘₯π‘—βˆ’1) 𝑄 π‘₯π‘—βˆ’1

slide-122
SLIDE 122

Introduction Background Rank Power Ensembles Experiments

  • Generalized discounting

scheme: First compute discounts on powered counts, then take low rank approximation

Generalizing KN to PLRE

39

Kneser Ney

  • Ensemble composed of

unsmoothed n-grams

  • Alter lower order distributions by

using count of unique histories

  • Use absolute discounting to

interpolate different n-grams and preserve lower order marginal constraint Power Low Rank Ensembles

  • Ensemble composed of

unsmoothed n-grams plus other low rank matrices/tensors

  • Alter lower order distributions

by elementwise power

slide-123
SLIDE 123

Introduction Background Rank Power Ensembles Experiments

Training Procedure

40

slide-124
SLIDE 124

Introduction Background Rank Power Ensembles Experiments

Training Procedure

40

count from corpus

slide-125
SLIDE 125

Introduction Background Rank Power Ensembles Experiments

Training Procedure

40

count from corpus count from corpus

slide-126
SLIDE 126

Introduction Background Rank Power Ensembles Experiments

Training Procedure

40

count from corpus count from corpus Use alternating minimization (EM) to compute low rank approximation with respect to KL [Lee and Seung 2001]

slide-127
SLIDE 127

Introduction Background Rank Power Ensembles Experiments

Training Procedure

  • Because of ensemble representation, required rank is
  • nly about 100, even for billion word datasets

40

count from corpus count from corpus Use alternating minimization (EM) to compute low rank approximation with respect to KL [Lee and Seung 2001]

slide-128
SLIDE 128

Introduction Background Rank Power Ensembles Experiments

Test Time

KN Test Complexity: 𝑃(π‘œ) PLRE Test Complexity: 𝑃 π‘œπΏ

41

π‘œ = 𝑝𝑠𝑒𝑓𝑠, 𝐿 = π‘ π‘π‘œπ‘™

slide-129
SLIDE 129

Introduction Background Rank Power Ensembles Experiments

Test Time

KN Test Complexity: 𝑃(π‘œ) PLRE Test Complexity: 𝑃 π‘œπΏ

41

=

π‘œ = 𝑝𝑠𝑒𝑓𝑠, 𝐿 = π‘ π‘π‘œπ‘™

slide-130
SLIDE 130

Introduction Background Rank Power Ensembles Experiments

Test Time

KN Test Complexity: 𝑃(π‘œ) PLRE Test Complexity: 𝑃 π‘œπΏ

41

=

𝑳 𝑳

π‘œ = 𝑝𝑠𝑒𝑓𝑠, 𝐿 = π‘ π‘π‘œπ‘™

slide-131
SLIDE 131

Introduction Background Rank Power Ensembles Experiments

Outline

  • Introduction
  • Background on n-gram smoothing
  • Our Approach
  • Rank
  • Power
  • Constructing the Ensemble
  • Experiments

42

slide-132
SLIDE 132

Introduction Background Rank Power Ensembles Experiments

Experiments

  • Evaluate on English and Russian
  • Baselines
  • modKN – Modified Kneser Ney (back-off)
  • modint-KN- Modified Interpolated Kneser Ney
  • Other comparisons: Class-based models, Neural Networks,

Hierarchical Pitman Yor

43

slide-133
SLIDE 133

Introduction Background Rank Power Ensembles Experiments

Small Datasets - Perplexity

  • English-Small [Bengio et al. 2003]
  • 20K vocabulary
  • 14 million tokens
  • Russian-Small
  • 77K vocabulary
  • 3.5 million tokens

44

slide-134
SLIDE 134

Introduction Background Rank Power Ensembles Experiments

Small Datasets - Perplexity

  • English-Small [Bengio et al. 2003]
  • 20K vocabulary
  • 14 million tokens
  • Russian-Small
  • 77K vocabulary
  • 3.5 million tokens

44

class KN mod-KN modint-KN PLRE English-Small 119.7 104.55 100.07 95.15 Russian-Small 284.09 283.7 260.19 238.96

slide-135
SLIDE 135

Introduction Background Rank Power Ensembles Experiments

Small English Comparisons

45

slide-136
SLIDE 136

Introduction Background Rank Power Ensembles Experiments

Small English Comparisons

46

Model Context Size Perplexity mod-KN(4) 3 128 modint-KN(4) 3 116.6 infinity-gram HPYP [Wood et al. 2009] infinity 111.8

slide-137
SLIDE 137

Introduction Background Rank Power Ensembles Experiments

Small English Comparisons

47

Model Context Size Perplexity mod-KN(4) 3 128 modint-KN(4) 3 116.6 infinity-gram HPYP [Wood et al. 2009] infinity 111.8 PLRE(4) 3 108.7

slide-138
SLIDE 138

Introduction Background Rank Power Ensembles Experiments

Small English Comparisons

48

Model Context Size Perplexity mod-KN(4) 3 128 modint-KN(4) 3 116.6 infinity-gram HPYP [Wood et al. 2009] infinity 111.8 PLRE(4) 3 108.7 LBL [Mnih and Hinton 2007] 5 117 LBL [Mnih and Hinton 2007] 10 107.8 RNN-ME [Mikolov et al. 2012] infinity 82.1

slide-139
SLIDE 139

Introduction Background Rank Power Ensembles Experiments

Large Datasets - Perplexity

  • English-Large
  • 836,000 types
  • 837 million tokens
  • Russian-Large
  • 1.3 million types
  • 521 million tokens
  • On 8 cores, PLRE (with optimal parameter settings)

completes training on English-Large in 3.2 hrs and Russian-Large in 7.7 hours

49

slide-140
SLIDE 140

Introduction Background Rank Power Ensembles Experiments

Large Datasets - Perplexity

  • English-Large
  • 836,000 types
  • 837 million tokens
  • Russian-Large
  • 1.3 million types
  • 521 million tokens
  • On 8 cores, PLRE (with optimal parameter settings)

completes training on English-Large in 3.2 hrs and Russian-Large in 7.7 hours

49

modint-KN PLRE English-Large 77.90 +/- 0.20 75.66 +/- 0.19 Russian-Large 289.6 +/-6.82 264.59 +/- 5.84

slide-141
SLIDE 141

Introduction Background Rank Power Ensembles Experiments

Machine Translation Task

  • English to Russian translation task

(Language model is used as a feature in the translation system)

  • Unlike other recent works, we use

PLRE instead of modint-KN (not both)

  • To deal with the non-determinism,

the model is only trained once, using modint-KN. The same feature weights are then used for both PLRE and modint-KN

50

slide-142
SLIDE 142

Introduction Background Rank Power Ensembles Experiments

Machine Translation Task

  • English to Russian translation task

(Language model is used as a feature in the translation system)

  • Unlike other recent works, we use

PLRE instead of modint-KN (not both)

  • To deal with the non-determinism,

the model is only trained once, using modint-KN. The same feature weights are then used for both PLRE and modint-KN

50

Method BLEU modint-KN 17.63 +/- 0.11 PLRE 17.79 +/- 0.07 Smallest Diff PLRE+0.05 Largest Diff PLRE+0.29

slide-143
SLIDE 143

Conclusion

  • We presented a novel technique for language modeling

called power low rank ensembles

  • Consistently outperforms state-of-the-art Kneser Ney

baselines

  • Effective for small context sizes
  • No partition function required
  • Part of broader theme of exploiting connection between

linear algebra and probability to develop new solutions for NLP

51

slide-144
SLIDE 144

Thanks!

52

Code/data available at http://www.cs.cmu.edu/~apparikh/plre