Language Modeling with Power Low Rank Ensembles
1
Ankur Parikh Avneesh Saluja Chris Dyer Eric Xing
Low Rank Ensembles Eric Xing Ankur Parikh Avneesh Saluja Chris - - PowerPoint PPT Presentation
Language Modeling with Power Low Rank Ensembles Eric Xing Ankur Parikh Avneesh Saluja Chris Dyer 1 Overview 2 Overview Model: Framework for language modeling using ensembles of low rank matrices and tensors Relations: Includes
1
Ankur Parikh Avneesh Saluja Chris Dyer Eric Xing
2
ensembles of low rank matrices and tensors
techniques as special cases
2
ensembles of low rank matrices and tensors
techniques as special cases
Kneser Ney baselines for same context length
required
2
Introduction Background Rank Power Ensembles Experiments
3
Introduction Background Rank Power Ensembles Experiments
4
Introduction Background Rank Power Ensembles Experiments
4
Linear algebra is awesome
Introduction Background Rank Power Ensembles Experiments
4
π π₯1, . . , π₯4 = 0.3648
Linear algebra is awesome
Introduction Background Rank Power Ensembles Experiments
4
π π₯1, . . , π₯4 = 0.3648
Linear algebra is awesome Linear algebra is boring
Introduction Background Rank Power Ensembles Experiments
4
π π₯1, . . , π₯4 = 0.3648
Linear algebra is awesome Linear algebra is boring
π π₯1, . . , π₯4 = 0.1922
Introduction Background Rank Power Ensembles Experiments
translation and speech recognition.
4
π π₯1, . . , π₯4 = 0.3648
Linear algebra is awesome Linear algebra is boring
π π₯1, . . , π₯4 = 0.1922
Introduction Background Rank Power Ensembles Experiments
5
Introduction Background Rank Power Ensembles Experiments
5
Introduction Background Rank Power Ensembles Experiments
5
π₯π
Introduction Background Rank Power Ensembles Experiments
5
count π₯π π₯π
Introduction Background Rank Power Ensembles Experiments
5
count π₯π π₯π
Introduction Background Rank Power Ensembles Experiments
5
count π₯π π₯π π₯π π₯πβ1
Introduction Background Rank Power Ensembles Experiments
5
count π₯π π₯π π₯π π₯πβ1 count π₯π, π₯πβ1
Introduction Background Rank Power Ensembles Experiments
5
count π₯π π₯π π₯π π₯πβ1 count π₯π, π₯πβ1
Introduction Background Rank Power Ensembles Experiments
5
count π₯π π₯π π₯πβ1 π₯πβ2 π₯π π₯π π₯πβ1 count π₯π, π₯πβ1
Introduction Background Rank Power Ensembles Experiments
5
count π₯π π₯π π₯πβ1 π₯πβ2 π₯π π₯π π₯πβ1 count π₯π, π₯πβ1 count π₯π, π₯πβ1, π₯πβ2
Introduction Background Rank Power Ensembles Experiments
6
π π₯π π₯πβ1, π₯πβ2) π π₯π π₯πβ1) π(π₯π)
Introduction Background Rank Power Ensembles Experiments
6
π π₯π π₯πβ1, π₯πβ2) π π₯π π₯πβ1) π(π₯π)
Introduction Background Rank Power Ensembles Experiments
6
π π₯π π₯πβ1, π₯πβ2) π π₯π π₯πβ1) π(π₯π)
Introduction Background Rank Power Ensembles Experiments
6
π π₯π π₯πβ1, π₯πβ2) π π₯π π₯πβ1) π(π₯π)
Introduction Background Rank Power Ensembles Experiments
6
π π₯π π₯πβ1, π₯πβ2) π π₯π π₯πβ1) π(π₯π)
Introduction Background Rank Power Ensembles Experiments
6
π π₯π π₯πβ1, π₯πβ2) π π₯π π₯πβ1) π(π₯π)
Introduction Background Rank Power Ensembles Experiments
7
π π₯π π₯πβ1, π₯πβ2) π π₯π π₯πβ1) π(π₯π)
Introduction Background Rank Power Ensembles Experiments
8
π(π₯π) π π₯π π₯πβ1)
Introduction Background Rank Power Ensembles Experiments
8
π(π₯π) (house, decrepit) π π₯π π₯πβ1)
Introduction Background Rank Power Ensembles Experiments
8
π(π₯π) (house, decrepit) π π₯π π₯πβ1) (house)
Introduction Background Rank Power Ensembles Experiments
8
π(π₯π) (house, decrepit) (house, old) (house, shabby) π π₯π π₯πβ1) (house)
Introduction Background Rank Power Ensembles Experiments
8
π(π₯π) (house, decrepit) (house, old) (house, shabby) π π₯π π₯πβ1) (house)
Introduction Background Rank Power Ensembles Experiments
8
π(π₯π) (house, decrepit) (house, {synonym of old} ) (house, old) (house, shabby) π π₯π π₯πβ1) (house)
Introduction Background Rank Power Ensembles Experiments
8
π(π₯π)
(house, decrepit) (house, {synonym of old} ) (house, old) (house, shabby) π π₯π π₯πβ1) (house)
Introduction Background Rank Power Ensembles Experiments
9
Introduction Background Rank Power Ensembles Experiments
9
Introduction Background Rank Power Ensembles Experiments
9
Introduction Background Rank Power Ensembles Experiments
9
house cabin flat
Introduction Background Rank Power Ensembles Experiments
9
house cabin flat
shabby decrepit
Introduction Background Rank Power Ensembles Experiments
10
Introduction Background Rank Power Ensembles Experiments
applications
10
Introduction Background Rank Power Ensembles Experiments
applications
modeling
10
Introduction Background Rank Power Ensembles Experiments
applications
modeling
Ney
10
Introduction Background Rank Power Ensembles Experiments
11
If rank is too smallβ¦β¦
Introduction Background Rank Power Ensembles Experiments
11
If rank is too smallβ¦β¦
(break, spring)
Introduction Background Rank Power Ensembles Experiments
11
If rank is too smallβ¦β¦
(break, spring)
Probability gets diluted since βbreakβ has many synonyms
Introduction Background Rank Power Ensembles Experiments
12
If rank is too largeβ¦.
Introduction Background Rank Power Ensembles Experiments
12
If rank is too largeβ¦.
(domicile, dilapidated)
Introduction Background Rank Power Ensembles Experiments
12
If rank is too largeβ¦.
Probabilities of rare words a problem, since representation is too fine grained
(domicile, dilapidated)
Introduction Background Rank Power Ensembles Experiments
13
Introduction Background Rank Power Ensembles Experiments
model language at multiple granularities
13
Introduction Background Rank Power Ensembles Experiments
model language at multiple granularities
13
Introduction Background Rank Power Ensembles Experiments
model language at multiple granularities
13
Introduction Background Rank Power Ensembles Experiments
14
Introduction Background Rank Power Ensembles Experiments
should be altered
56
Introduction Background Rank Power Ensembles Experiments
should be altered
π π₯π = door backed β off on π₯πβ1) > π(π₯π = York | backed β off on π₯πβ1)
57
Introduction Background Rank Power Ensembles Experiments
should be altered
π π₯π = door backed β off on π₯πβ1) > π(π₯π = York | backed β off on π₯πβ1)
58
Introduction Background Rank Power Ensembles Experiments
16
πβ π₯π = | π₯ βΆ π π₯π, π₯ > 0 |
Diversity of ππ
β²π history
Introduction Background Rank Power Ensembles Experiments
16
πππβπ£ππ(π₯π) = πβ π₯π π₯ πβ π₯
πβ π₯π = | π₯ βΆ π π₯π, π₯ > 0 |
Diversity of ππ
β²π history
Introduction Background Rank Power Ensembles Experiments
17
Introduction Background Rank Power Ensembles Experiments
17
ππ π₯π π₯πβ1) = max(π π₯π, π₯πβ1 β π, 0) π₯ π π₯, π₯πβ1
Introduction Background Rank Power Ensembles Experiments
17
ππ π₯π π₯πβ1) = max(π π₯π, π₯πβ1 β π, 0) π₯ π π₯, π₯πβ1
π
ππππ§ π₯π π₯πβ1) =
π
π π₯π π₯πβ1) + πΏ π₯πβ1
π
ππβπ£ππ(π₯π)
Introduction Background Rank Power Ensembles Experiments
17
ππ π₯π π₯πβ1) = max(π π₯π, π₯πβ1 β π, 0) π₯ π π₯, π₯πβ1 Where πΉ ππβπ is the leftover probability
π
ππππ§ π₯π π₯πβ1) =
π
π π₯π π₯πβ1) + πΏ π₯πβ1
π
ππβπ£ππ(π₯π)
Introduction Background Rank Power Ensembles Experiments
18
π π₯π =
π₯πβ1
πππππ§ π₯π π₯πβ1) π π₯πβ1
Introduction Background Rank Power Ensembles Experiments
19
Kneser Ney Power Low Rank Ensembles
Introduction Background Rank Power Ensembles Experiments
19
Kneser Ney
unsmoothed n-grams Power Low Rank Ensembles
Introduction Background Rank Power Ensembles Experiments
19
Kneser Ney
unsmoothed n-grams
using count of unique histories Power Low Rank Ensembles
Introduction Background Rank Power Ensembles Experiments
19
Kneser Ney
unsmoothed n-grams
using count of unique histories
interpolate different n-grams and preserve lower order marginal constraint Power Low Rank Ensembles
Introduction Background Rank Power Ensembles Experiments
19
Kneser Ney
unsmoothed n-grams
using count of unique histories
interpolate different n-grams and preserve lower order marginal constraint Power Low Rank Ensembles
Introduction Background Rank Power Ensembles Experiments
20
Kneser Ney
unsmoothed n-grams
using count of unique histories
interpolate different n-grams and preserve lower order marginal constraint Power Low Rank Ensembles
Introduction Background Rank Power Ensembles Experiments
21
Introduction Background Rank Power Ensembles Experiments
73
Introduction Background Rank Power Ensembles Experiments
74
Introduction Background Rank Power Ensembles Experiments
75
π(βππ£π‘π) π(πππ) π(βππ£π‘π, πππ)
Introduction Background Rank Power Ensembles Experiments
does the best rank 1 approximation give?
76
π(βππ£π‘π) π(πππ) π(βππ£π‘π, πππ)
Introduction Background Rank Power Ensembles Experiments
πͺ π₯π, π₯πβ1 = π π₯π, π₯πβ1
77
π΅1 = ππππ΅:π΅β₯0,π πππ π΅ =1 πͺ β π΅ πΏπ
Generalized KL [Lee and Seung 2001]
Introduction Background Rank Power Ensembles Experiments
bigram under KL:
24
π π₯π = π΅1 π₯π, π₯πβ1 π₯π π΅1(π₯π, π₯πβ1)
Introduction Background Rank Power Ensembles Experiments
bigram under KL:
unigram
24
π π₯π = π΅1 π₯π, π₯πβ1 π₯π π΅1(π₯π, π₯πβ1)
Introduction Background Rank Power Ensembles Experiments
bigram under KL:
unigram
24
π π₯π = π΅1 π₯π, π₯πβ1 π₯π π΅1(π₯π, π₯πβ1)
full rank rank 1
Introduction Background Rank Power Ensembles Experiments
bigram under KL:
unigram
24
π π₯π = π΅1 π₯π, π₯πβ1 π₯π π΅1(π₯π, π₯πβ1)
full rank low rank rank 1
Introduction Background Rank Power Ensembles Experiments
25
Kneser Ney
unsmoothed n-grams
using count of unique histories
interpolate different n-grams and preserve lower order marginal constraint Power Low Rank Ensembles
unsmoothed n-grams plus other low rank matrices/tensors
Introduction Background Rank Power Ensembles Experiments
26
Kneser Ney
unsmoothed n-grams
using count of unique histories
interpolate different n-grams and preserve lower order marginal constraint Power Low Rank Ensembles
unsmoothed n-grams plus other low rank matrices/tensors
Introduction Background Rank Power Ensembles Experiments
27
Introduction Background Rank Power Ensembles Experiments
27
π π π π π π π π π
Introduction Background Rank Power Ensembles Experiments
27
π π π π π π π π π
row sum
π π π
Introduction Background Rank Power Ensembles Experiments
27
π π π π π π π π π
row sum
π π π
Introduction Background Rank Power Ensembles Experiments
27
π π π π π π π π π
π π. π π π π. π π. π π π π
row sum
π π π
Introduction Background Rank Power Ensembles Experiments
27
π π π π π π π π π
π π. π π π π. π π. π π π π
row sum row sum
π π π π. π π. π π. π
Introduction Background Rank Power Ensembles Experiments
27
π π π π π π π π π
π π. π π π π. π π. π π π π
row sum row sum
π π π π. π π. π π. π
Introduction Background Rank Power Ensembles Experiments
27
π π π π π π π π π π π π π π π π π π
π π. π π π π. π π. π π π π
row sum row sum
π π π π. π π. π π. π
Introduction Background Rank Power Ensembles Experiments
27
π π π π π π π π π π π π π π π π π π
π π. π π π π. π π. π π π π
row sum row sum row sum
π π π π. π π. π π. π π π π
Introduction Background Rank Power Ensembles Experiments
27
π π π π π π π π π π π π π π π π π π
π π. π π π π. π π. π π π π
emphasis on diversity
row sum row sum row sum
π π π π. π π. π π. π π π π
Introduction Background Rank Power Ensembles Experiments
28
Introduction Background Rank Power Ensembles Experiments
28
π΅1
0 = ππππ΅:π΅β₯0,π πππ π΅ =1 πͺπ β π΅ πΏπ
Introduction Background Rank Power Ensembles Experiments
28
π΅1
0 = ππππ΅:π΅β₯0,π πππ π΅ =1 πͺπ β π΅ πΏπ
πππβπ£ππ(π₯π) = π΅1
0 π₯π, π₯πβ1
π₯ π΅1
0 π₯, π₯πβ1
Introduction Background Rank Power Ensembles Experiments
28
π΅1
0 = ππππ΅:π΅β₯0,π πππ π΅ =1 πͺπ β π΅ πΏπ
πππβπ£ππ(π₯π) = π΅1
0 π₯π, π₯πβ1
π₯ π΅1
0 π₯, π₯πβ1
power = 1 full rank
Introduction Background Rank Power Ensembles Experiments
28
π΅1
0 = ππππ΅:π΅β₯0,π πππ π΅ =1 πͺπ β π΅ πΏπ
πππβπ£ππ(π₯π) = π΅1
0 π₯π, π₯πβ1
π₯ π΅1
0 π₯, π₯πβ1
power = 1 full rank power = 0 full rank power
Introduction Background Rank Power Ensembles Experiments
28
π΅1
0 = ππππ΅:π΅β₯0,π πππ π΅ =1 πͺπ β π΅ πΏπ
πππβπ£ππ(π₯π) = π΅1
0 π₯π, π₯πβ1
π₯ π΅1
0 π₯, π₯πβ1
power = 1 full rank power = 0 full rank power = 0 rank = 1 power low rank
Introduction Background Rank Power Ensembles Experiments
29
power = 1 full rank power = 0 rank = 1
Introduction Background Rank Power Ensembles Experiments
29
power = 1 full rank power = 0.5 low rank power = 0 rank = 1
Introduction Background Rank Power Ensembles Experiments
30
Introduction Background Rank Power Ensembles Experiments
31
Kneser Ney
unsmoothed n-grams
using count of unique histories
interpolate different n-grams and preserve lower order marginal constraint Power Low Rank Ensembles
unsmoothed n-grams plus other low rank matrices/tensors
by elementwise power
Introduction Background Rank Power Ensembles Experiments
32
Kneser Ney
unsmoothed n-grams
using count of unique histories
interpolate different n-grams and preserve lower order marginal constraint Power Low Rank Ensembles
unsmoothed n-grams plus other low rank matrices/tensors
by elementwise power
Introduction Background Rank Power Ensembles Experiments
33
π π₯π =
π₯πβ1
π
π‘π π₯π π₯πβ1)
π π₯πβ1
Introduction Background Rank Power Ensembles Experiments
constraint holds. Each count gets a different discount
34
1 0.5
Introduction Background Rank Power Ensembles Experiments
constraint holds. Each count gets a different discount
34
discount discount 1 0.5
Introduction Background Rank Power Ensembles Experiments
constraint holds. Each count gets a different discount
34
discount discount 1 0.5
Introduction Background Rank Power Ensembles Experiments
such that marginal constraint still holds
35
discount discount 1 0.5
Introduction Background Rank Power Ensembles Experiments
such that marginal constraint still holds
35
discount discount low rank low rank 1 0.5
Introduction Background Rank Power Ensembles Experiments
such that marginal constraint still holds
35
discount discount low rank low rank 1 0.5
Introduction Background Rank Power Ensembles Experiments
such that marginal constraint still holds
35
discount discount low rank low rank
power low rank ensemble
1 0.5
Introduction Background Rank Power Ensembles Experiments
row/column sums
36
Introduction Background Rank Power Ensembles Experiments
row/column sums
36
low rank
Introduction Background Rank Power Ensembles Experiments
row/column sums
36
low rank
Introduction Background Rank Power Ensembles Experiments
row/column sums
36
low rank
Introduction Background Rank Power Ensembles Experiments
row/column sums
under the low rank approximation
36
low rank
Introduction Background Rank Power Ensembles Experiments
row/column sums
37
low rank
Introduction Background Rank Power Ensembles Experiments
row/column sums
37
low rank
Introduction Background Rank Power Ensembles Experiments
row/column sums
37
low rank
Introduction Background Rank Power Ensembles Experiments
38
π₯πβ1
Introduction Background Rank Power Ensembles Experiments
scheme: First compute discounts on powered counts, then take low rank approximation
39
Kneser Ney
unsmoothed n-grams
using count of unique histories
interpolate different n-grams and preserve lower order marginal constraint Power Low Rank Ensembles
unsmoothed n-grams plus other low rank matrices/tensors
by elementwise power
Introduction Background Rank Power Ensembles Experiments
40
Introduction Background Rank Power Ensembles Experiments
40
count from corpus
Introduction Background Rank Power Ensembles Experiments
40
count from corpus count from corpus
Introduction Background Rank Power Ensembles Experiments
40
count from corpus count from corpus Use alternating minimization (EM) to compute low rank approximation with respect to KL [Lee and Seung 2001]
Introduction Background Rank Power Ensembles Experiments
40
count from corpus count from corpus Use alternating minimization (EM) to compute low rank approximation with respect to KL [Lee and Seung 2001]
Introduction Background Rank Power Ensembles Experiments
KN Test Complexity: π(π) PLRE Test Complexity: π ππΏ
41
π = ππ πππ , πΏ = π πππ
Introduction Background Rank Power Ensembles Experiments
KN Test Complexity: π(π) PLRE Test Complexity: π ππΏ
41
π = ππ πππ , πΏ = π πππ
Introduction Background Rank Power Ensembles Experiments
KN Test Complexity: π(π) PLRE Test Complexity: π ππΏ
41
π³ π³
π = ππ πππ , πΏ = π πππ
Introduction Background Rank Power Ensembles Experiments
42
Introduction Background Rank Power Ensembles Experiments
Hierarchical Pitman Yor
43
Introduction Background Rank Power Ensembles Experiments
44
Introduction Background Rank Power Ensembles Experiments
44
class KN mod-KN modint-KN PLRE English-Small 119.7 104.55 100.07 95.15 Russian-Small 284.09 283.7 260.19 238.96
Introduction Background Rank Power Ensembles Experiments
45
Introduction Background Rank Power Ensembles Experiments
46
Model Context Size Perplexity mod-KN(4) 3 128 modint-KN(4) 3 116.6 infinity-gram HPYP [Wood et al. 2009] infinity 111.8
Introduction Background Rank Power Ensembles Experiments
47
Model Context Size Perplexity mod-KN(4) 3 128 modint-KN(4) 3 116.6 infinity-gram HPYP [Wood et al. 2009] infinity 111.8 PLRE(4) 3 108.7
Introduction Background Rank Power Ensembles Experiments
48
Model Context Size Perplexity mod-KN(4) 3 128 modint-KN(4) 3 116.6 infinity-gram HPYP [Wood et al. 2009] infinity 111.8 PLRE(4) 3 108.7 LBL [Mnih and Hinton 2007] 5 117 LBL [Mnih and Hinton 2007] 10 107.8 RNN-ME [Mikolov et al. 2012] infinity 82.1
Introduction Background Rank Power Ensembles Experiments
completes training on English-Large in 3.2 hrs and Russian-Large in 7.7 hours
49
Introduction Background Rank Power Ensembles Experiments
completes training on English-Large in 3.2 hrs and Russian-Large in 7.7 hours
49
modint-KN PLRE English-Large 77.90 +/- 0.20 75.66 +/- 0.19 Russian-Large 289.6 +/-6.82 264.59 +/- 5.84
Introduction Background Rank Power Ensembles Experiments
(Language model is used as a feature in the translation system)
PLRE instead of modint-KN (not both)
the model is only trained once, using modint-KN. The same feature weights are then used for both PLRE and modint-KN
50
Introduction Background Rank Power Ensembles Experiments
(Language model is used as a feature in the translation system)
PLRE instead of modint-KN (not both)
the model is only trained once, using modint-KN. The same feature weights are then used for both PLRE and modint-KN
50
Method BLEU modint-KN 17.63 +/- 0.11 PLRE 17.79 +/- 0.07 Smallest Diff PLRE+0.05 Largest Diff PLRE+0.29
called power low rank ensembles
baselines
linear algebra and probability to develop new solutions for NLP
51
52
Code/data available at http://www.cs.cmu.edu/~apparikh/plre