Exploring Kernel Functions in the Softmax Layer for Contextual Word - - PowerPoint PPT Presentation

exploring kernel functions in the softmax layer for
SMART_READER_LITE
LIVE PREVIEW

Exploring Kernel Functions in the Softmax Layer for Contextual Word - - PowerPoint PPT Presentation

Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification Yingbo Gao, Christian Herold, Weiyue Wang, Hermann Ney Agenda Background Methodology Experiments Conclusion 2 of 33 Exploring Kernel Functions in


slide-1
SLIDE 1

Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification

Yingbo Gao, Christian Herold, Weiyue Wang, Hermann Ney

slide-2
SLIDE 2

Agenda

  • Background
  • Methodology
  • Experiments
  • Conclusion

2 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

slide-3
SLIDE 3

Background Contextual Word Classification

  • language modeling (LM) and machine translation (MT)
  • dominated by neural networks (NN)
  • p(w T

1 |...) = p(w0) T 1 p(wt|w t−1

, ...) = T

1 p(wt|ht)

  • despite many choices to learn the context vector h:

– feed-forward NN – recurrent NN – convolutional NN – self-attention NN – ...

  • output is often modeled with standard softmax and trained with cross entropy:

p(wv|h) = exp(W T

v h)/ V v ′=1 exp(W T v ′h)

L = − log p(wv|h)

3 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

slide-4
SLIDE 4

Background Softmax Bottleneck [Yang et al., 2018]

  • from previous:

L = − log p(wv|h) = − log

  • exp(W T

v h)/ V v ′=1 exp(W T v ′h)

  • exponential-and-logarithm calculation:

log p(wv|h) + log V

v ′=1 exp(W T v ′h) = W T v h

  • approximate true log posteriors with inner products:

log ˜ p(wv|h) + Ch ≈ W T

v h

4 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

slide-5
SLIDE 5

Background Softmax Bottleneck Cont.

  • from previous: log ˜

p(wv|h) + Ch ≈ W T

v h

  • in matrix form:

log

 

˜ p(w1|h1) . . . ˜ p(w1|hN) . . . ... . . . ˜ p(wV|h1) . . . ˜ p(wV|hN)

 

V ×N

+

Ch1 . . . ChN

  • V ×N
  • rank ∼ V

  W T

1

. . .

W T

V

 

V ×d

h1 . . . hN

  • d×N
  • rank ∼ d
  • factorization the true log posterior matrix: log ˜

P + C ≈ W TH

  • Softmax Bottleneck:

– log ˜

P is high-rank for natural language: rank(log ˜ P) ∼ V

– C decreases the rank of left-hand side by maximum 1 – rank of W TH is bounded by hidden dimension: rank(W TH) ∼ d – typically V ∼ 100, 000 and d ∼ 1000 ≈ → ≈

5 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

slide-6
SLIDE 6

Background Breaking Softmax Bottleneck

  • mixture-of-softmax (mos) [Yang et al., 2018]:

pmos(wv|h) =

K

  • k=1

πk exp(W T

v fk(h))

V

v ′=1 exp(W T v ′fk(h))

  • sigsoftmax [Kanai et al., 2018]:

psigsoftmax(wv|h) = exp(W T

v h)σ(W T v h)

V

v ′=1 exp(W T v ′h)σ(W T v ′h)

  • weight norm regularization [Herold et al., 2018]:

Lwnr = −

  • data
  • V

log(pmos(wv|h)) + ρ

  • V
  • v=1

(||Wv||2 − ν)2

6 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

slide-7
SLIDE 7

Background Breaking Softmax Bottleneck Cont.

  • let z = W T

v h

  • theoretically, to break softmax bottleneck with activation g(z) [Kanai et al., 2018]:

– nonlinearity of log(g(z)) – numerical stable – non-negative – monotonically increasing

7 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

slide-8
SLIDE 8

Background Geometric Explanation of Softmax Bottleneck

  • an intuitive example:

– ˜ p(dog|a common home pet is ...) ≈ ˜ p(cat|a common home pet is ...) ≈ 50% – learned word vectors are close: Wdog ≈ Wcat – posteriors over dog and cat are thus close: p(dog|...) ≈ p(cat|...) – there exist contexts that would fail the model: – overall an expressiveness problem

8 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

slide-9
SLIDE 9

Background Kernel Trick

  • widely used in support-vector machine (SVM) and logistic regression
  • improve expressiveness by implicitly transforming data into high dimensional feature spaces

[Eric, 2019]

9 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

slide-10
SLIDE 10

Background Kernel Trick Cont.

  • K(x, y) =< φ(x), φ(y) >
  • Ksq(

x1 x2

  • ,

y1 y2

  • ) = (x1y1 + x2y2)2 =

x2

1,

√ 2x1x2, x2

2

y 2

1,

√ 2y1y2, y 2

2

T = φT x1 x2

  • φ

y1 y2

  • for a kernel function (kernel) to be valid:

– positive semidefinite (PSD) Gram matrix – corresponds to a scalar product in some feature space

  • empirically, non PSD kernels also work well [Lin and Lin, 2003, Boughorbel et al., 2005]

→ we do not enforce PSD

  • where there is an inner product, there could be an application of the kernel trick

– import vector machine [Zhu and Hastie, 2002] – multilayer kernel machine [Cho and Saul, 2009] – gated softmax [Memisevic et al., 2010]

10 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

slide-11
SLIDE 11

Background Non-Euclidean Word Embedding

  • Gaussian embedding [Vilnis and McCallum, 2015, Athiwaratkun and Wilson, 2017]

[Athiwaratkun and Wilson, 2017]

11 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

slide-12
SLIDE 12

Background Non-Euclidean Word Embedding Cont.

  • hyperbolic (Poincar´

e) embedding [Nickel and Kiela, 2017, Dhingra et al., 2018]

[Nickel and Kiela, 2017]

12 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

slide-13
SLIDE 13

Background Quick Recap

  • contextual word classification
  • softmax bottleneck
  • breaking softmax bottleneck
  • geometric explanation of softmax bottleneck
  • kernel trick
  • non-Euclidean word embedding

→ kernels in softmax

13 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

slide-14
SLIDE 14

Agenda

  • Background
  • Methodology
  • Experiments
  • Conclusion

14 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

slide-15
SLIDE 15

Methodology Generalized Softmax

  • model posterior:

p(wv|h) =

K

  • k=1

πk exp(Sk(Wv, fk(h)))

V

v ′=1 exp(Sk(Wv ′, fk(h)))

  • mixture weight:

πk = exp(MT

k h)

K

k′=1 exp(MT k′h)

  • nonlinearly transformed context:

fk(h) = tanh(QT

k h)

  • with trainable parameters: W ∈ Rd×V, M ∈ Rd×K and Qk ∈ Rd×d

15 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

slide-16
SLIDE 16

Methodology Generalized Softmax Cont.

  • from previous:

p(wv|h) =

K

  • k=1

πk exp(Sk(Wv, fk(h)))

V

v ′=1 exp(Sk(Wv ′, fk(h)))

  • note:

– replace inner product with kernels: W T

v h → Sk(Wv, fk(h))

– replace single softmax with a mixture: 1 → K

k=1 πk

– replace context vector with transformed ones: h → fk(h) – shared word vectors due to memory restriction

  • motivations:

– different kernels give different feature spaces – based on context, model chooses which feature space is suitable – for each feature space, the context vector could be different – ideally, for each feature space, the word vector could also be different

16 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

slide-17
SLIDE 17

Methodology Individual Kernels

Slin(Wv, h) = W T

v h

Slog(Wv, h) = − log(||Wv − h||p + 1) Spow(Wv, h) = −||Wv − h||p Spol(Wv, h) = (αW T

v h + c)p

Srbf(Wv, h) = exp(−γ||Wv − h||2) Swav(Wv, h) = cos(||Wv − h||2 a ) exp(−||Wv − h||2 b ) Sssg(Wv, h) = log

  • N(µWv, ΣWv)N(µh, Σh)

Smog(Wv, h) =

  • i,j

log

  • N(µi,Wv, Σi,Wv)N(µj,h, Σj,h)

Shpb(Wv, h) = −acos(1 + 2||Wv − h||2 (1 − ||Wv||2)(1 − ||h||2))

17 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

slide-18
SLIDE 18

Methodology Individual Kernels Cont.

  • kernel sources:

– baseline: lin – classic (e.g. from SVM): log, pow, pol, pol, rbf, wav – from non-Eulidean word embedding: ssg, mog, hpb

  • dimension reduction step in d common to kernels above not immediately executable

– simplify the wavelet kernel – use spherical covariance matrices – rewrite power of vector difference norm: ||Wv − h||p = (||Wv||2 + ||h||2 − 2W T

v h)

p 2

||Wv − h||p → a vector norm regularized version of the inner product

18 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

slide-19
SLIDE 19

Agenda

  • Background
  • Methodology
  • Experiments
  • Conclusion

19 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

slide-20
SLIDE 20

Experiments Experimental Setup

  • LM on Switchboard:

– 30K vocabulary – 25M training tokens – LSTM Recurrent NN [Sundermeyer et al., 2012]

  • MT on IWSLT2014 German→English:

– 10K joint byte pair encoding merge operations – 160K parallel training sentences – Transformer NN [Vaswani et al., 2017]

  • overall:

– grid search hyperparameters of kernels – vary K and Sk – fairseq toolkit [Ott et al., 2019]

20 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

slide-21
SLIDE 21

Experiments Performance of Individual Kernels Method SWB IWSLT (Perplexity) (Bleu[%]) Ref. [Irie et al., 2018] [Wu et al., 2019] 47.6 35.2 lin 46.8 34.3 log 103.0 0.4 pow 46.8 32.8 pol 47.3 31.7 rbf 284.9 0.0 ssg 49.9 34.6 mog 46.7 34.2 hpb 122.6 0.3 wav 289.7 0.0

21 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

slide-22
SLIDE 22

Experiments Gradient Properties

2 4 6 ||Wv − h||p

  • 6
  • 4
  • 2

2 S(Wv, h) rbf wav log pow

  • expected performance: rbf ≈ wav < log < pow

22 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

slide-23
SLIDE 23

Experiments Performance of Individual Kernels Revisited Method SWB IWSLT (Perplexity) (Bleu[%]) Ref. [Irie et al., 2018] [Wu et al., 2019] 47.6 35.2 lin 46.8 34.3 log 103.0 0.4 pow 46.8 32.8 pol 47.3 31.7 rbf 284.9 0.0 ssg 49.9 34.6 mog 46.7 34.2 hpb 122.6 0.3 wav 289.7 0.0

23 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

slide-24
SLIDE 24

Experiments Mixture of Kernels

  • encourage equal contribution of mixture components, similar to [Takase et al., 2018]:

Lreg = Lce + ρ N

  • N

VarN(πk)

  • compromise between regularization and performance → ρ = 0.1

ρ Variance PPL 0.001 4.74 46.8 0.01 4.98 46.6 0.1 3.67 47.2 1 3.81 47.4

24 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

slide-25
SLIDE 25

Experiments Mixture of Kernels Cont. Name Mixture settings PPL mos* 9×lin 47.8 mixbig 1 of each kernel 47.1 mix1 lin, log, rbf, hpb, wav 46.6 mix2 3×lin, log 46.5 mix3 lin, log, pow, pol 47.3 mix4 lin, log, rbf, hpb 47.1 mix5 2×lin, 2×rbf 46.7

* tricks like activation regularization and average stochastic gradient descent are not applied

  • no major performance difference:

– at least one lin mixture component – consistently, a total mixture weight of more than 50% in lin component – tied projection matrices across mixture components

25 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

slide-26
SLIDE 26

Experiments Disambiguation Abilities

  • automatic extraction of word clusters:

– recall the dog and cat example – extract projection matrix – calculate all pairwise distances between word vectors – for each word, sum the distances of five closest neighbors – sort the results and pick the words with the smallest values – make sure each word in the clusters appears at least 500 times

  • three extracted clusters:

– {quickly, slowly, soon, quick, easily} – {democrat, republicans, politicians, republican, democrats} – {hamster, hamsters, parakeets, rabbits, turtles}

26 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

slide-27
SLIDE 27

Experiments Disambiguation Abilities Cont. Model Prediction Ground Truth ... books can end up being outdated very quickly lin ... books can end up being outdated very soon mixbig ... books can end up being outdated very quickly Ground Truth ... if you vote for a republican or vote for a democrat lin ... if you vote for a republican or vote for a republican mixbig ... if you vote for a republican or vote for a democrat Ground Truth ... well we had some turtles and a hamster lin ... well we had some turtles and a turtle mixbig ... well we had some turtles and a hamster

  • manual inspection of appearances of words in the clusters
  • examples where mixbig is right and lin is wrong
  • need a more systematic way to judge if a model is better at word disambiguation

27 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

slide-28
SLIDE 28

Agenda

  • Background
  • Methodology
  • Experiments
  • Conclusion

28 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

slide-29
SLIDE 29

Conclusion Conclusion and Outlook

  • softmax bottleneck: theory and intuition
  • generalized softmax: introducing kernels and mixture of kernels
  • individual kernels: 9 different kernels, lin, pol, pow, ssg and mog perform the best
  • gradient properties: justification of why some kernels perform better than others
  • mixture of kernels: consistent large mixture weight of lin when sharing word vectors
  • disambiguition abilities: interesting observations, lack a more systematic way for evaluation
  • outlook: untie the word embeddings

29 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

slide-30
SLIDE 30

Thank you for your attention!

Any questions?

slide-31
SLIDE 31

Reference

Athiwaratkun, B. and Wilson, A. G. (2017). Multimodal word distributions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1645–1656, Vancouver, Canada. Boughorbel, S., Tarel, J.-P., and Boujemaa, N. (2005). Conditionally positive definite kernels for svm based image recognition. In 2005 IEEE International Conference on Multimedia and Expo, pages 113–116. IEEE. Cho, Y. and Saul, L. K. (2009). Kernel methods for deep learning. In Advances in neural information processing systems, pages 342–350. Dhingra, B., Shallue, C., Norouzi, M., Dai, A., and Dahl, G. (2018). Embedding text in hyperbolic spaces. In Proceedings of the Twelfth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-12), pages 59–69. Eric, K. (2019). Illustration of the kernel trick. http://www.eric-kim.net/eric-kim-net/posts/1/kernel_trick.html. Accessed: 2019.10.31. Herold, C., Gao, Y., and Ney, H. (2018). Improving neural language models with weight norm initialization and regularization. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 93–100. Irie, K., Lei, Z., Deng, L., Schl¨ uter, R., and Ney, H. (2018). Investigation on estimation of sentence probability by combining forward, backward and bi-directional lstm-rnns.

  • Proc. Interspeech 2018, pages 392–395.

31 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

slide-32
SLIDE 32

Reference

Kanai, S., Fujiwara, Y., Yamanaka, Y., and Adachi, S. (2018). Sigsoftmax: Reanalysis of the softmax bottleneck. In 32nd Conference on Neural Information Processing Systems (NeurIPS), Montr´ eal, Canada. Lin, H.-T. and Lin, C.-J. (2003). A study on sigmoid kernels for svm and the training of non-psd kernels by smo-type methods. Technical report. Memisevic, R., Zach, C., Pollefeys, M., and Hinton, G. E. (2010). Gated softmax classification. In Advances in neural information processing systems, pages 1603–1611. Nickel, M. and Kiela, D. (2017). Poincar´ e embeddings for learning hierarchical representations. In Advances in neural information processing systems, pages 6338–6347. Ott, M., Edunov, S., Baevski, A., Fan, A., Gross, S., Ng, N., Grangier, D., and Auli, M. (2019). fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pages 48–53, Minneapolis, MN, USA. Sundermeyer, M., Schl¨ uter, R., and Ney, H. (2012). Lstm neural networks for language modeling. In Thirteenth annual conference of the international speech communication association, pages 194–197, Portland, OR, USA. Takase, S., Suzuki, J., and Nagata, M. (2018). Direct output connection for a high-rank language model. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4599–4609.

32 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019

slide-33
SLIDE 33

Reference

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008. Vilnis, L. and McCallum, A. (2015). Word representations via gaussian embedding. In 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA. Wu, F., Fan, A., Baevski, A., Dauphin, Y. N., and Auli, M. (2019). Pay less attention with lightweight and dynamic convolutions. In Seventh International Conference on Learning Representations (ICLR), New Orleans, LA, USA. Yang, Z., Dai, Z., Salakhutdinov, R., and Cohen, W. W. (2018). Breaking the softmax bottleneck: A high-rank rnn language model. In Sixth International Conference on Learning Representations (ICLR), Vancouver, Canada. Zhu, J. and Hastie, T. (2002). Kernel logistic regression and the import vector machine. In Advances in neural information processing systems, pages 1081–1088.

33 of 33 Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification — Gao et al. — gao@i6.informatik.rwth-aachen.de — 2019.11.03 @ IWSLT2019