Exploring Kernel Functions in the Softmax Layer for Contextual Word - - PowerPoint PPT Presentation
Exploring Kernel Functions in the Softmax Layer for Contextual Word - - PowerPoint PPT Presentation
Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification Yingbo Gao, Christian Herold, Weiyue Wang, Hermann Ney Agenda Background Methodology Experiments Conclusion 2 of 33 Exploring Kernel Functions in
Agenda
- Background
- Methodology
- Experiments
- Conclusion
2 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Background Contextual Word Classification
- language modeling (LM) and machine translation (MT)
- dominated by neural networks (NN)
- p(w T
1 |...) = p(w0) T 1 p(wt|w t−1
, ...) = T
1 p(wt|ht)
- despite many choices to learn the context vector h:
– feed-forward NN – recurrent NN – convolutional NN – self-attention NN – ...
- output is often modeled with standard softmax and trained with cross entropy:
p(wv|h) = exp(W T
v h)/ V v ′=1 exp(W T v ′h)
L = − log p(wv|h)
3 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Background Softmax Bottleneck [Yang et al., 2018]
- from previous:
L = − log p(wv|h) = − log
- exp(W T
v h)/ V v ′=1 exp(W T v ′h)
- exponential-and-logarithm calculation:
log p(wv|h) + log V
v ′=1 exp(W T v ′h) = W T v h
- approximate true log posteriors with inner products:
log ˜ p(wv|h) + Ch ≈ W T
v h
4 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Background Softmax Bottleneck Cont.
- from previous: log ˜
p(wv|h) + Ch ≈ W T
v h
- in matrix form:
log
˜ p(w1|h1) . . . ˜ p(w1|hN) . . . ... . . . ˜ p(wV|h1) . . . ˜ p(wV|hN)
V ×N
+
Ch1 . . . ChN
- V ×N
- rank ∼ V
≈
W T
1
. . .
W T
V
V ×d
h1 . . . hN
- d×N
- rank ∼ d
- factorization the true log posterior matrix: log ˜
P + C ≈ W TH
- Softmax Bottleneck:
– log ˜
P is high-rank for natural language: rank(log ˜ P) ∼ V
– C decreases the rank of left-hand side by maximum 1 – rank of W TH is bounded by hidden dimension: rank(W TH) ∼ d – typically V ∼ 100, 000 and d ∼ 1000 ≈ → ≈
5 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Background Breaking Softmax Bottleneck
- mixture-of-softmax (mos) [Yang et al., 2018]:
pmos(wv|h) =
K
- k=1
πk exp(W T
v fk(h))
V
v ′=1 exp(W T v ′fk(h))
- sigsoftmax [Kanai et al., 2018]:
psigsoftmax(wv|h) = exp(W T
v h)σ(W T v h)
V
v ′=1 exp(W T v ′h)σ(W T v ′h)
- weight norm regularization [Herold et al., 2018]:
Lwnr = −
- data
- V
log(pmos(wv|h)) + ρ
- V
- v=1
(||Wv||2 − ν)2
6 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Background Breaking Softmax Bottleneck Cont.
- let z = W T
v h
- theoretically, to break softmax bottleneck with activation g(z) [Kanai et al., 2018]:
– nonlinearity of log(g(z)) – numerical stable – non-negative – monotonically increasing
7 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Background Geometric Explanation of Softmax Bottleneck
- an intuitive example:
– ˜ p(dog|a common home pet is ...) ≈ ˜ p(cat|a common home pet is ...) ≈ 50% – learned word vectors are close: Wdog ≈ Wcat – posteriors over dog and cat are thus close: p(dog|...) ≈ p(cat|...) – there exist contexts that would fail the model: – overall an expressiveness problem
8 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Background Kernel Trick
- widely used in support-vector machine (SVM) and logistic regression
- improve expressiveness by implicitly transforming data into high dimensional feature spaces
[Eric, 2019]
9 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Background Kernel Trick Cont.
- K(x, y) =< φ(x), φ(y) >
- Ksq(
x1 x2
- ,
y1 y2
- ) = (x1y1 + x2y2)2 =
x2
1,
√ 2x1x2, x2
2
y 2
1,
√ 2y1y2, y 2
2
T = φT x1 x2
- φ
y1 y2
- for a kernel function (kernel) to be valid:
– positive semidefinite (PSD) Gram matrix – corresponds to a scalar product in some feature space
- empirically, non PSD kernels also work well [Lin and Lin, 2003, Boughorbel et al., 2005]
→ we do not enforce PSD
- where there is an inner product, there could be an application of the kernel trick
– import vector machine [Zhu and Hastie, 2002] – multilayer kernel machine [Cho and Saul, 2009] – gated softmax [Memisevic et al., 2010]
10 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Background Non-Euclidean Word Embedding
- Gaussian embedding [Vilnis and McCallum, 2015, Athiwaratkun and Wilson, 2017]
[Athiwaratkun and Wilson, 2017]
11 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Background Non-Euclidean Word Embedding Cont.
- hyperbolic (Poincar´
e) embedding [Nickel and Kiela, 2017, Dhingra et al., 2018]
[Nickel and Kiela, 2017]
12 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Background Quick Recap
- contextual word classification
- softmax bottleneck
- breaking softmax bottleneck
- geometric explanation of softmax bottleneck
- kernel trick
- non-Euclidean word embedding
→ kernels in softmax
13 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Agenda
- Background
- Methodology
- Experiments
- Conclusion
14 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Methodology Generalized Softmax
- model posterior:
p(wv|h) =
K
- k=1
πk exp(Sk(Wv, fk(h)))
V
v ′=1 exp(Sk(Wv ′, fk(h)))
- mixture weight:
πk = exp(MT
k h)
K
k′=1 exp(MT k′h)
- nonlinearly transformed context:
fk(h) = tanh(QT
k h)
- with trainable parameters: W ∈ Rd×V, M ∈ Rd×K and Qk ∈ Rd×d
15 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Methodology Generalized Softmax Cont.
- from previous:
p(wv|h) =
K
- k=1
πk exp(Sk(Wv, fk(h)))
V
v ′=1 exp(Sk(Wv ′, fk(h)))
- note:
– replace inner product with kernels: W T
v h → Sk(Wv, fk(h))
– replace single softmax with a mixture: 1 → K
k=1 πk
– replace context vector with transformed ones: h → fk(h) – shared word vectors due to memory restriction
- motivations:
– different kernels give different feature spaces – based on context, model chooses which feature space is suitable – for each feature space, the context vector could be different – ideally, for each feature space, the word vector could also be different
16 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Methodology Individual Kernels
Slin(Wv, h) = W T
v h
Slog(Wv, h) = − log(||Wv − h||p + 1) Spow(Wv, h) = −||Wv − h||p Spol(Wv, h) = (αW T
v h + c)p
Srbf(Wv, h) = exp(−γ||Wv − h||2) Swav(Wv, h) = cos(||Wv − h||2 a ) exp(−||Wv − h||2 b ) Sssg(Wv, h) = log
- N(µWv, ΣWv)N(µh, Σh)
Smog(Wv, h) =
- i,j
log
- N(µi,Wv, Σi,Wv)N(µj,h, Σj,h)
Shpb(Wv, h) = −acos(1 + 2||Wv − h||2 (1 − ||Wv||2)(1 − ||h||2))
17 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Methodology Individual Kernels Cont.
- kernel sources:
– baseline: lin – classic (e.g. from SVM): log, pow, pol, pol, rbf, wav – from non-Eulidean word embedding: ssg, mog, hpb
- dimension reduction step in d common to kernels above not immediately executable
– simplify the wavelet kernel – use spherical covariance matrices – rewrite power of vector difference norm: ||Wv − h||p = (||Wv||2 + ||h||2 − 2W T
v h)
p 2
||Wv − h||p → a vector norm regularized version of the inner product
18 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Agenda
- Background
- Methodology
- Experiments
- Conclusion
19 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Experiments Experimental Setup
- LM on Switchboard:
– 30K vocabulary – 25M training tokens – LSTM Recurrent NN [Sundermeyer et al., 2012]
- MT on IWSLT2014 German→English:
– 10K joint byte pair encoding merge operations – 160K parallel training sentences – Transformer NN [Vaswani et al., 2017]
- overall:
– grid search hyperparameters of kernels – vary K and Sk – fairseq toolkit [Ott et al., 2019]
20 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Experiments Performance of Individual Kernels Method SWB IWSLT (Perplexity) (Bleu[%]) Ref. [Irie et al., 2018] [Wu et al., 2019] 47.6 35.2 lin 46.8 34.3 log 103.0 0.4 pow 46.8 32.8 pol 47.3 31.7 rbf 284.9 0.0 ssg 49.9 34.6 mog 46.7 34.2 hpb 122.6 0.3 wav 289.7 0.0
21 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Experiments Gradient Properties
2 4 6 ||Wv − h||p
- 6
- 4
- 2
2 S(Wv, h) rbf wav log pow
- expected performance: rbf ≈ wav < log < pow
22 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Experiments Performance of Individual Kernels Revisited Method SWB IWSLT (Perplexity) (Bleu[%]) Ref. [Irie et al., 2018] [Wu et al., 2019] 47.6 35.2 lin 46.8 34.3 log 103.0 0.4 pow 46.8 32.8 pol 47.3 31.7 rbf 284.9 0.0 ssg 49.9 34.6 mog 46.7 34.2 hpb 122.6 0.3 wav 289.7 0.0
23 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Experiments Mixture of Kernels
- encourage equal contribution of mixture components, similar to [Takase et al., 2018]:
Lreg = Lce + ρ N
- N
VarN(πk)
- compromise between regularization and performance → ρ = 0.1
ρ Variance PPL 0.001 4.74 46.8 0.01 4.98 46.6 0.1 3.67 47.2 1 3.81 47.4
24 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Experiments Mixture of Kernels Cont. Name Mixture settings PPL mos* 9×lin 47.8 mixbig 1 of each kernel 47.1 mix1 lin, log, rbf, hpb, wav 46.6 mix2 3×lin, log 46.5 mix3 lin, log, pow, pol 47.3 mix4 lin, log, rbf, hpb 47.1 mix5 2×lin, 2×rbf 46.7
* tricks like activation regularization and average stochastic gradient descent are not applied
- no major performance difference:
– at least one lin mixture component – consistently, a total mixture weight of more than 50% in lin component – tied projection matrices across mixture components
25 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Experiments Disambiguation Abilities
- automatic extraction of word clusters:
– recall the dog and cat example – extract projection matrix – calculate all pairwise distances between word vectors – for each word, sum the distances of five closest neighbors – sort the results and pick the words with the smallest values – make sure each word in the clusters appears at least 500 times
- three extracted clusters:
– {quickly, slowly, soon, quick, easily} – {democrat, republicans, politicians, republican, democrats} – {hamster, hamsters, parakeets, rabbits, turtles}
26 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Experiments Disambiguation Abilities Cont. Model Prediction Ground Truth ... books can end up being outdated very quickly lin ... books can end up being outdated very soon mixbig ... books can end up being outdated very quickly Ground Truth ... if you vote for a republican or vote for a democrat lin ... if you vote for a republican or vote for a republican mixbig ... if you vote for a republican or vote for a democrat Ground Truth ... well we had some turtles and a hamster lin ... well we had some turtles and a turtle mixbig ... well we had some turtles and a hamster
- manual inspection of appearances of words in the clusters
- examples where mixbig is right and lin is wrong
- need a more systematic way to judge if a model is better at word disambiguation
27 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Agenda
- Background
- Methodology
- Experiments
- Conclusion
28 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Conclusion Conclusion and Outlook
- softmax bottleneck: theory and intuition
- generalized softmax: introducing kernels and mixture of kernels
- individual kernels: 9 different kernels, lin, pol, pow, ssg and mog perform the best
- gradient properties: justification of why some kernels perform better than others
- mixture of kernels: consistent large mixture weight of lin when sharing word vectors
- disambiguition abilities: interesting observations, lack a more systematic way for evaluation
- outlook: untie the word embeddings
29 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Thank you for your attention!
Any questions?
Reference
Athiwaratkun, B. and Wilson, A. G. (2017). Multimodal word distributions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1645–1656, Vancouver, Canada. Boughorbel, S., Tarel, J.-P., and Boujemaa, N. (2005). Conditionally positive definite kernels for svm based image recognition. In 2005 IEEE International Conference on Multimedia and Expo, pages 113–116. IEEE. Cho, Y. and Saul, L. K. (2009). Kernel methods for deep learning. In Advances in neural information processing systems, pages 342–350. Dhingra, B., Shallue, C., Norouzi, M., Dai, A., and Dahl, G. (2018). Embedding text in hyperbolic spaces. In Proceedings of the Twelfth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-12), pages 59–69. Eric, K. (2019). Illustration of the kernel trick. http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html. Accessed: 2019.10.31. Herold, C., Gao, Y., and Ney, H. (2018). Improving neural language models with weight norm initialization and regularization. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 93–100. Irie, K., Lei, Z., Deng, L., Schl¨ uter, R., and Ney, H. (2018). Investigation on estimation of sentence probability by combining forward, backward and bi-directional lstm-rnns.
- Proc. Interspeech 2018, pages 392–395.
31 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Reference
Kanai, S., Fujiwara, Y., Yamanaka, Y., and Adachi, S. (2018). Sigsoftmax: Reanalysis of the softmax bottleneck. In 32nd Conference on Neural Information Processing Systems (NeurIPS), Montr´ eal, Canada. Lin, H.-T. and Lin, C.-J. (2003). A study on sigmoid kernels for svm and the training of non-psd kernels by smo-type methods. Technical report. Memisevic, R., Zach, C., Pollefeys, M., and Hinton, G. E. (2010). Gated softmax classification. In Advances in neural information processing systems, pages 1603–1611. Nickel, M. and Kiela, D. (2017). Poincar´ e embeddings for learning hierarchical representations. In Advances in neural information processing systems, pages 6338–6347. Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng, N., Grangier, D., and Auli, M. (2019). fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53, Minneapolis, MN, USA. Sundermeyer, M., Schl¨ uter, R., and Ney, H. (2012). Lstm neural networks for language modeling. In Thirteenth annual conference of the international speech communication association, pages 194–197, Portland, OR, USA. Takase, S., Suzuki, J., and Nagata, M. (2018). Direct output connection for a high-rank language model. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4599–4609.
32 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019
Reference
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008. Vilnis, L. and McCallum, A. (2015). Word representations via gaussian embedding. In 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA. Wu, F., Fan, A., Baevski, A., Dauphin, Y. N., and Auli, M. (2019). Pay less attention with lightweight and dynamic convolutions. In Seventh International Conference on Learning Representations (ICLR), New Orleans, LA, USA. Yang, Z., Dai, Z., Salakhutdinov, R., and Cohen, W. W. (2018). Breaking the softmax bottleneck: A high-rank rnn language model. In Sixth International Conference on Learning Representations (ICLR), Vancouver, Canada. Zhu, J. and Hastie, T. (2002). Kernel logistic regression and the import vector machine. In Advances in neural information processing systems, pages 1081–1088.
33 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019