Improving Neural Language Models with Weight Norm Initialization and - - PowerPoint PPT Presentation

improving neural language models with weight norm
SMART_READER_LITE
LIVE PREVIEW

Improving Neural Language Models with Weight Norm Initialization and - - PowerPoint PPT Presentation

Improving Neural Language Models with Weight Norm Initialization and Regularization Christian Herold Yingbo Gao Hermann Ney <surname>@i6.informatik.rwth-aachen.de October 31st, 2018 Human Language Technology and Pattern


slide-1
SLIDE 1

Improving Neural Language Models with Weight Norm Initialization and Regularization

Christian Herold∗ Yingbo Gao∗ Hermann Ney

<surname>@i6.informatik.rwth-aachen.de October 31st, 2018 Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University Third Conference on Machine Translation (WMT18) Brussels, Belgium * Equal Contribution

Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 1/13 31.10.2018

slide-2
SLIDE 2

Outline

Introduction Neural Language Modeling Weight Norm Initialization Weight Norm Regularization Experiments Conclusion

Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 2/13 31.10.2018

slide-3
SLIDE 3

Introduction

◮ Task: Given a word sequence xJ

1 = x1x2...xJ, find the probability P (xJ 1)

  • f that sequence

P (xJ

1) = J

  • j=1

P (xj|xj−1

1

) (1) ◮ The perplexity is defined as ppl = P (xJ

1)− 1

J

(2) ◮ The language model (LM) is trained on monolingual text data ◮ LMs are used in a variety of machine translation tasks: ⊲ Corpus filtering [Rossenbach & Rosendahl+ 18] ⊲ Unsupervised neural machine translation (NMT) [Kim & Geng+ 18] ⊲ Incorporation into NMT training [Stahlberg & Cross+ 18]

Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 3/13 31.10.2018

slide-4
SLIDE 4

Neural Language Modeling

◮ The context xj−1

1

for predicting the next word xj is encoded into a vector hj P (xJ

1) = J

  • j=1

P (xj|xj−1

1

) =

J

  • j=1

P (xj|hj) (3) ◮ A popular choice for a neural language model (NLM) is the long short-term memory Recurrent Neural Network (LSTM RNN) hj = LSTM([ET · x1, ET · x2, ..., ET · xj−1]) (4) where x1, x2, ..., xj−1 are 1-hot vectors and E is the embedding matrix ◮ P (xj = xk|hj), k = 1, 2, ..., V is calculated with a softmax activation P (xj = xk|hj) = exp(Wk · hj) V

k′=1 exp(Wk′ · hj)

(5) where Wk is the k-th row of the projection matrix and V is the vocabulary size

Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 4/13 31.10.2018

slide-5
SLIDE 5

Related Work

LSTM RNNs for language modeling: ◮ [Melis & Dyer+ 18], [Merity & Keskar+ 17] Mixture of Softmaxes: ◮ [Yang & Dai+ 18] Using word frequency information for embedding ◮ [Chen & Si+ 18], [Baevski & Auli 18] Tying the embedding and projection matrices ◮ [Press & Wolf 17], [Inan & Khosravi+ 17] Weight normalization reparametrization ◮ [Salimans & Kingma 16]

Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 5/13 31.10.2018

slide-6
SLIDE 6

Weight Norm Initialization

Penn Treebank WikiText-2 ◮ Idea: Initialize the norms of Wk with scaled logarithm of the wordcounts Wk = σlog(ck) Wk Wk2 (6) where ck denotes unigram word count for the k-th word and σ is a scalar

Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 6/13 31.10.2018

slide-7
SLIDE 7

Weight Norm Regularization

◮ Established method (L2-Regularization): Regularize every learnable weight w in the network equally by adding a term to the loss function L0 L = L0 + λ 2

  • w

(w2)2 (7) where λ is a scalar ◮ Idea: Regularize the norms of Wk to approach a certain value ν Lwnr = L0 + ρ

  • V
  • k=1

(Wk2 − ν)2 (8) where ν, ρ ≥ 0 are two scalars

Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 7/13 31.10.2018

slide-8
SLIDE 8

Experimental Setup

◮ Datasets: Penn Treebank (PTB) and WikiText-2 (WT2) Tokens Vocab Size Penn Treebank Train 888k 10k Valid 70k Test 79k WikiText-2 Train 2.1M 33k Valid 214k Test 241k ◮ Network structure: ⊲ Three-layer LSTM with Mixture of Softmaxes [Yang & Dai+ 18] ⊲ Embedding and projection matrices are tied

Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 8/13 31.10.2018

slide-9
SLIDE 9

Weight Norm Initialization

Penn Treebank epoch wni ppl baseline ppl ppl reduction (%) 1 162.18 180.72 10.26 10 85.92 92.09 6.70 20 73.36 78.94 7.07 30 71.44 73.06 2.22 40 69.27 70.20 1.32 WikiText-2 epoch wni ppl baseline ppl ppl reduction (%) 1 172.19 192.19 10.41 10 95.90 100.72 4.79 20 85.14 88.21 3.48 30 81.80 82.70 1.09 40 79.28 80.32 1.29

Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 9/13 31.10.2018

slide-10
SLIDE 10

Weight Norm Regularization

◮ Perform grid search on PTB-dataset over the hyperparameters ρ and ν ◮ With the tuned hyperparameters ρ = 1.0 × 10−3 and ν = 2.0, improvements over the baseline are achieved on both PTB and WT2 Penn Treebank Model #Params Validation Test [Yang & Dai+ 18] 22M 56.54 54.44 [Yang & Dai+ 18] + WNR 22M 55.03 53.16 WikiText-2 Model #Params Validation Test [Yang & Dai+ 18] 35M 63.88 61.45 [Yang & Dai+ 18] + WNR 35M 62.67 60.13

Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 10/13 31.10.2018

slide-11
SLIDE 11

Weight Norm Regularization

1 2 3 4 5 vector norm 5 10 15 probability [%]

baseline baseline + WNR

Penn Treebank

1 2 3 4 5 vector norm 5 10 15 20 25 probability [%]

baseline baseline + WNR

WikiText-2 ◮ Our regularization method leads to more concentrated distribution of weight norms around ν ◮ We still observe few words with significantly higher norms

Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 11/13 31.10.2018

slide-12
SLIDE 12

Conclusion

◮ Embedding and projection matrices are important components of a NLM ◮ Vector norms are related to the word frequencies ◮ Initializing and/or regularizing these vector norms can improve the NLM ◮ Future work ⊲ Test methods on larger data sets ⊲ Regularizing the vector norms to word counts ⊲ Study the angles between the word vectors ⊲ Transfer to machine translation

Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 12/13 31.10.2018

slide-13
SLIDE 13

Thank you for your attention

Christian Herold Yingbo Gao Hermann Ney <surname>@i6.informatik.rwth-aachen.de http://www-i6.informatik.rwth-aachen.de/

Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 13/13 31.10.2018

slide-14
SLIDE 14

References

[Baevski & Auli 18] A. Baevski, M. Auli: Adaptive Input Representations for Neural Language Modeling. arXiv preprint arXiv:1809.10853, Vol., 2018. [Chen & Si+ 18] P.H. Chen, S. Si, Y. Li, C. Chelba, C.j. Hsieh: GroupReduce: Block-Wise Low-Rank Approximation for Neural Language Model

  • Shrinking. arXiv preprint arXiv:1806.06950, Vol., 2018.

[Graca & Kim+ 18] M. Graca, Y. Kim, J. Schamper, J. Geng, H. Ney: The RWTH Aachen University English-German and German-English Unsupervised Neural Machine Translation Systems for WMT 2018. WMT 2018, Vol., 2018. [Inan & Khosravi+ 17] H. Inan, K. Khosravi, R. Socher: Tying word vectors and word classifiers: A loss framework for language modeling. In International Conference on Learning Representations, 2017.

Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 14/13 31.10.2018

slide-15
SLIDE 15

Title

[Kim & Geng+ 18] Y. Kim, J. Geng, H. Ney: Improving Unsupervised Word-by-Word Translation with Language Model and Denoising

  • Autoencoder. EMNLP 2018, Vol., 2018.

[Melis & Dyer+ 18] G. Melis, C. Dyer, P. Blunsom: On the state of the art of evaluation in neural language models. In International Conference on Learning Representations, 2018. [Merity & Keskar+ 17] S. Merity, N.S. Keskar, R. Socher: Regularizing and

  • ptimizing LSTM language models. arXiv preprint arXiv:1708.02182, Vol.,

2017. [Press & Wolf 17] O. Press, L. Wolf: Using the Output Embedding to Improve Language Models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, pp. 157–163, 2017.

Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 15/13 31.10.2018

slide-16
SLIDE 16

Title

[Rossenbach & Rosendahl+ 18] N. Rossenbach, J. Rosendahl, Y. Kim,

  • M. Graca, A. Gokrani, H. Ney: The RWTH Aachen University Filtering

System for the WMT 2018 Parallel Corpus Filtering Task. WMT 2018, Vol., 2018. [Salimans & Kingma 16] T. Salimans, D.P. Kingma: Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, pp. 901–909, 2016. [Stahlberg & Cross+ 18] F. Stahlberg, J. Cross, V. Stoyanov: Simple Fusion: Return of the Language Model. arXiv preprint arXiv:1809.00125, Vol., 2018. [Yang & Dai+ 18] Z. Yang, Z. Dai, R. Salakhutdinov, W.W. Cohen: Breaking the softmax bottleneck: A high-rank RNN language model. In International Conference on Learning Representations, 2018.

Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 16/13 31.10.2018