 
              Improving Neural Language Models with Weight Norm Initialization and Regularization Christian Herold ∗ Yingbo Gao ∗ Hermann Ney <surname>@i6.informatik.rwth-aachen.de October 31st, 2018 Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University Third Conference on Machine Translation (WMT18) Brussels, Belgium * Equal Contribution Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 1/13 31.10.2018
Outline Introduction Neural Language Modeling Weight Norm Initialization Weight Norm Regularization Experiments Conclusion Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 2/13 31.10.2018
Introduction ◮ Task: Given a word sequence x J 1 = x 1 x 2 ...x J , find the probability P ( x J 1 ) of that sequence J � P ( x j | x j − 1 P ( x J 1 ) = ) (1) 1 j =1 ◮ The perplexity is defined as 1 ) − 1 ppl = P ( x J (2) J ◮ The language model (LM) is trained on monolingual text data ◮ LMs are used in a variety of machine translation tasks: ⊲ Corpus filtering [Rossenbach & Rosendahl + 18] ⊲ Unsupervised neural machine translation (NMT) [Kim & Geng + 18] ⊲ Incorporation into NMT training [Stahlberg & Cross + 18] Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 3/13 31.10.2018
Neural Language Modeling ◮ The context x j − 1 for predicting the next word x j is encoded into a 1 vector h j J J � P ( x j | x j − 1 � P ( x J 1 ) = ) = P ( x j | h j ) (3) 1 j =1 j =1 ◮ A popular choice for a neural language model (NLM) is the long short-term memory Recurrent Neural Network (LSTM RNN) h j = LSTM ([ E T · x 1 , E T · x 2 , ..., E T · x j − 1 ]) (4) where x 1 , x 2 , ..., x j − 1 are 1-hot vectors and E is the embedding matrix ◮ P ( x j = x k | h j ) , k = 1 , 2 , ..., V is calculated with a softmax activation exp ( W k · h j ) P ( x j = x k | h j ) = (5) � V k ′ =1 exp ( W k ′ · h j ) where W k is the k -th row of the projection matrix and V is the vocabulary size Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 4/13 31.10.2018
Related Work LSTM RNNs for language modeling: ◮ [Melis & Dyer + 18], [Merity & Keskar + 17] Mixture of Softmaxes: ◮ [Yang & Dai + 18] Using word frequency information for embedding ◮ [Chen & Si + 18], [Baevski & Auli 18] Tying the embedding and projection matrices ◮ [Press & Wolf 17], [Inan & Khosravi + 17] Weight normalization reparametrization ◮ [Salimans & Kingma 16] Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 5/13 31.10.2018
Weight Norm Initialization Penn Treebank WikiText-2 ◮ Idea: Initialize the norms of W k with scaled logarithm of the wordcounts W k W k = σ log ( c k ) (6) � W k � 2 where c k denotes unigram word count for the k -th word and σ is a scalar Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 6/13 31.10.2018
Weight Norm Regularization ◮ Established method (L 2 -Regularization): Regularize every learnable weight w in the network equally by adding a term to the loss function L 0 L = L 0 + λ � ( � w � 2 ) 2 (7) 2 w where λ is a scalar ◮ Idea: Regularize the norms of W k to approach a certain value ν � V � � � ( � W k � 2 − ν ) 2 L wnr = L 0 + ρ (8) � k =1 where ν, ρ ≥ 0 are two scalars Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 7/13 31.10.2018
Experimental Setup ◮ Datasets: Penn Treebank (PTB) and WikiText-2 (WT2) Tokens Vocab Size Train 888k Penn Treebank Valid 70k 10k Test 79k Train 2.1M WikiText-2 Valid 214k 33k Test 241k ◮ Network structure: ⊲ Three-layer LSTM with Mixture of Softmaxes [Yang & Dai + 18] ⊲ Embedding and projection matrices are tied Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 8/13 31.10.2018
Weight Norm Initialization Penn Treebank epoch wni ppl baseline ppl ppl reduction (%) 1 162.18 180.72 10.26 10 85.92 92.09 6.70 20 73.36 78.94 7.07 30 71.44 73.06 2.22 40 69.27 70.20 1.32 WikiText-2 epoch wni ppl baseline ppl ppl reduction (%) 1 172.19 192.19 10.41 10 95.90 100.72 4.79 20 85.14 88.21 3.48 30 81.80 82.70 1.09 40 79.28 80.32 1.29 Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 9/13 31.10.2018
Weight Norm Regularization ◮ Perform grid search on PTB-dataset over the hyperparameters ρ and ν ◮ With the tuned hyperparameters ρ = 1 . 0 × 10 − 3 and ν = 2 . 0 , improvements over the baseline are achieved on both PTB and WT2 Penn Treebank Model #Params Validation Test [Yang & Dai + 18] 22M 56.54 54.44 [Yang & Dai + 18] + WNR 22M 55.03 53.16 WikiText-2 Model #Params Validation Test [Yang & Dai + 18] 35M 63.88 61.45 [Yang & Dai + 18] + WNR 35M 62.67 60.13 Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 10/13 31.10.2018
Weight Norm Regularization 15 probability [%] probability [%] 25 baseline baseline baseline + WNR baseline + WNR 20 10 15 10 5 5 0 0 0 1 2 3 4 5 0 1 2 3 4 5 vector norm vector norm Penn Treebank WikiText-2 ◮ Our regularization method leads to more concentrated distribution of weight norms around ν ◮ We still observe few words with significantly higher norms Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 11/13 31.10.2018
Conclusion ◮ Embedding and projection matrices are important components of a NLM ◮ Vector norms are related to the word frequencies ◮ Initializing and/or regularizing these vector norms can improve the NLM ◮ Future work ⊲ Test methods on larger data sets ⊲ Regularizing the vector norms to word counts ⊲ Study the angles between the word vectors ⊲ Transfer to machine translation Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 12/13 31.10.2018
Thank you for your attention Christian Herold Yingbo Gao Hermann Ney <surname>@i6.informatik.rwth-aachen.de http://www-i6.informatik.rwth-aachen.de/ Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 13/13 31.10.2018
References [Baevski & Auli 18] A. Baevski, M. Auli: Adaptive Input Representations for Neural Language Modeling. arXiv preprint arXiv:1809.10853 , Vol., 2018. [Chen & Si + 18] P.H. Chen, S. Si, Y. Li, C. Chelba, C.j. Hsieh: GroupReduce: Block-Wise Low-Rank Approximation for Neural Language Model Shrinking. arXiv preprint arXiv:1806.06950 , Vol., 2018. [Graca & Kim + 18] M. Graca, Y. Kim, J. Schamper, J. Geng, H. Ney: The RWTH Aachen University English-German and German-English Unsupervised Neural Machine Translation Systems for WMT 2018. WMT 2018 , Vol., 2018. [Inan & Khosravi + 17] H. Inan, K. Khosravi, R. Socher: Tying word vectors and word classifiers: A loss framework for language modeling. In International Conference on Learning Representations , 2017. Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 14/13 31.10.2018
Title [Kim & Geng + 18] Y. Kim, J. Geng, H. Ney: Improving Unsupervised Word-by-Word Translation with Language Model and Denoising Autoencoder. EMNLP 2018 , Vol., 2018. [Melis & Dyer + 18] G. Melis, C. Dyer, P. Blunsom: On the state of the art of evaluation in neural language models. In International Conference on Learning Representations , 2018. [Merity & Keskar + 17] S. Merity, N.S. Keskar, R. Socher: Regularizing and optimizing LSTM language models. arXiv preprint arXiv:1708.02182 , Vol., 2017. [Press & Wolf 17] O. Press, L. Wolf: Using the Output Embedding to Improve Language Models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics , pp. 157–163, 2017. Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 15/13 31.10.2018
Title [Rossenbach & Rosendahl + 18] N. Rossenbach, J. Rosendahl, Y. Kim, M. Graca, A. Gokrani, H. Ney: The RWTH Aachen University Filtering System for the WMT 2018 Parallel Corpus Filtering Task. WMT 2018 , Vol., 2018. [Salimans & Kingma 16] T. Salimans, D.P. Kingma: Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems , pp. 901–909, 2016. [Stahlberg & Cross + 18] F. Stahlberg, J. Cross, V. Stoyanov: Simple Fusion: Return of the Language Model. arXiv preprint arXiv:1809.00125 , Vol., 2018. [Yang & Dai + 18] Z. Yang, Z. Dai, R. Salakhutdinov, W.W. Cohen: Breaking the softmax bottleneck: A high-rank RNN language model. In International Conference on Learning Representations , 2018. Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 16/13 31.10.2018
Recommend
More recommend