improving neural language models with weight norm
play

Improving Neural Language Models with Weight Norm Initialization and - PowerPoint PPT Presentation

Improving Neural Language Models with Weight Norm Initialization and Regularization Christian Herold Yingbo Gao Hermann Ney <surname>@i6.informatik.rwth-aachen.de October 31st, 2018 Human Language Technology and Pattern


  1. Improving Neural Language Models with Weight Norm Initialization and Regularization Christian Herold ∗ Yingbo Gao ∗ Hermann Ney <surname>@i6.informatik.rwth-aachen.de October 31st, 2018 Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University Third Conference on Machine Translation (WMT18) Brussels, Belgium * Equal Contribution Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 1/13 31.10.2018

  2. Outline Introduction Neural Language Modeling Weight Norm Initialization Weight Norm Regularization Experiments Conclusion Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 2/13 31.10.2018

  3. Introduction ◮ Task: Given a word sequence x J 1 = x 1 x 2 ...x J , find the probability P ( x J 1 ) of that sequence J � P ( x j | x j − 1 P ( x J 1 ) = ) (1) 1 j =1 ◮ The perplexity is defined as 1 ) − 1 ppl = P ( x J (2) J ◮ The language model (LM) is trained on monolingual text data ◮ LMs are used in a variety of machine translation tasks: ⊲ Corpus filtering [Rossenbach & Rosendahl + 18] ⊲ Unsupervised neural machine translation (NMT) [Kim & Geng + 18] ⊲ Incorporation into NMT training [Stahlberg & Cross + 18] Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 3/13 31.10.2018

  4. Neural Language Modeling ◮ The context x j − 1 for predicting the next word x j is encoded into a 1 vector h j J J � P ( x j | x j − 1 � P ( x J 1 ) = ) = P ( x j | h j ) (3) 1 j =1 j =1 ◮ A popular choice for a neural language model (NLM) is the long short-term memory Recurrent Neural Network (LSTM RNN) h j = LSTM ([ E T · x 1 , E T · x 2 , ..., E T · x j − 1 ]) (4) where x 1 , x 2 , ..., x j − 1 are 1-hot vectors and E is the embedding matrix ◮ P ( x j = x k | h j ) , k = 1 , 2 , ..., V is calculated with a softmax activation exp ( W k · h j ) P ( x j = x k | h j ) = (5) � V k ′ =1 exp ( W k ′ · h j ) where W k is the k -th row of the projection matrix and V is the vocabulary size Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 4/13 31.10.2018

  5. Related Work LSTM RNNs for language modeling: ◮ [Melis & Dyer + 18], [Merity & Keskar + 17] Mixture of Softmaxes: ◮ [Yang & Dai + 18] Using word frequency information for embedding ◮ [Chen & Si + 18], [Baevski & Auli 18] Tying the embedding and projection matrices ◮ [Press & Wolf 17], [Inan & Khosravi + 17] Weight normalization reparametrization ◮ [Salimans & Kingma 16] Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 5/13 31.10.2018

  6. Weight Norm Initialization Penn Treebank WikiText-2 ◮ Idea: Initialize the norms of W k with scaled logarithm of the wordcounts W k W k = σ log ( c k ) (6) � W k � 2 where c k denotes unigram word count for the k -th word and σ is a scalar Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 6/13 31.10.2018

  7. Weight Norm Regularization ◮ Established method (L 2 -Regularization): Regularize every learnable weight w in the network equally by adding a term to the loss function L 0 L = L 0 + λ � ( � w � 2 ) 2 (7) 2 w where λ is a scalar ◮ Idea: Regularize the norms of W k to approach a certain value ν � V � � � ( � W k � 2 − ν ) 2 L wnr = L 0 + ρ (8) � k =1 where ν, ρ ≥ 0 are two scalars Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 7/13 31.10.2018

  8. Experimental Setup ◮ Datasets: Penn Treebank (PTB) and WikiText-2 (WT2) Tokens Vocab Size Train 888k Penn Treebank Valid 70k 10k Test 79k Train 2.1M WikiText-2 Valid 214k 33k Test 241k ◮ Network structure: ⊲ Three-layer LSTM with Mixture of Softmaxes [Yang & Dai + 18] ⊲ Embedding and projection matrices are tied Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 8/13 31.10.2018

  9. Weight Norm Initialization Penn Treebank epoch wni ppl baseline ppl ppl reduction (%) 1 162.18 180.72 10.26 10 85.92 92.09 6.70 20 73.36 78.94 7.07 30 71.44 73.06 2.22 40 69.27 70.20 1.32 WikiText-2 epoch wni ppl baseline ppl ppl reduction (%) 1 172.19 192.19 10.41 10 95.90 100.72 4.79 20 85.14 88.21 3.48 30 81.80 82.70 1.09 40 79.28 80.32 1.29 Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 9/13 31.10.2018

  10. Weight Norm Regularization ◮ Perform grid search on PTB-dataset over the hyperparameters ρ and ν ◮ With the tuned hyperparameters ρ = 1 . 0 × 10 − 3 and ν = 2 . 0 , improvements over the baseline are achieved on both PTB and WT2 Penn Treebank Model #Params Validation Test [Yang & Dai + 18] 22M 56.54 54.44 [Yang & Dai + 18] + WNR 22M 55.03 53.16 WikiText-2 Model #Params Validation Test [Yang & Dai + 18] 35M 63.88 61.45 [Yang & Dai + 18] + WNR 35M 62.67 60.13 Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 10/13 31.10.2018

  11. Weight Norm Regularization 15 probability [%] probability [%] 25 baseline baseline baseline + WNR baseline + WNR 20 10 15 10 5 5 0 0 0 1 2 3 4 5 0 1 2 3 4 5 vector norm vector norm Penn Treebank WikiText-2 ◮ Our regularization method leads to more concentrated distribution of weight norms around ν ◮ We still observe few words with significantly higher norms Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 11/13 31.10.2018

  12. Conclusion ◮ Embedding and projection matrices are important components of a NLM ◮ Vector norms are related to the word frequencies ◮ Initializing and/or regularizing these vector norms can improve the NLM ◮ Future work ⊲ Test methods on larger data sets ⊲ Regularizing the vector norms to word counts ⊲ Study the angles between the word vectors ⊲ Transfer to machine translation Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 12/13 31.10.2018

  13. Thank you for your attention Christian Herold Yingbo Gao Hermann Ney <surname>@i6.informatik.rwth-aachen.de http://www-i6.informatik.rwth-aachen.de/ Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 13/13 31.10.2018

  14. References [Baevski & Auli 18] A. Baevski, M. Auli: Adaptive Input Representations for Neural Language Modeling. arXiv preprint arXiv:1809.10853 , Vol., 2018. [Chen & Si + 18] P.H. Chen, S. Si, Y. Li, C. Chelba, C.j. Hsieh: GroupReduce: Block-Wise Low-Rank Approximation for Neural Language Model Shrinking. arXiv preprint arXiv:1806.06950 , Vol., 2018. [Graca & Kim + 18] M. Graca, Y. Kim, J. Schamper, J. Geng, H. Ney: The RWTH Aachen University English-German and German-English Unsupervised Neural Machine Translation Systems for WMT 2018. WMT 2018 , Vol., 2018. [Inan & Khosravi + 17] H. Inan, K. Khosravi, R. Socher: Tying word vectors and word classifiers: A loss framework for language modeling. In International Conference on Learning Representations , 2017. Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 14/13 31.10.2018

  15. Title [Kim & Geng + 18] Y. Kim, J. Geng, H. Ney: Improving Unsupervised Word-by-Word Translation with Language Model and Denoising Autoencoder. EMNLP 2018 , Vol., 2018. [Melis & Dyer + 18] G. Melis, C. Dyer, P. Blunsom: On the state of the art of evaluation in neural language models. In International Conference on Learning Representations , 2018. [Merity & Keskar + 17] S. Merity, N.S. Keskar, R. Socher: Regularizing and optimizing LSTM language models. arXiv preprint arXiv:1708.02182 , Vol., 2017. [Press & Wolf 17] O. Press, L. Wolf: Using the Output Embedding to Improve Language Models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics , pp. 157–163, 2017. Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 15/13 31.10.2018

  16. Title [Rossenbach & Rosendahl + 18] N. Rossenbach, J. Rosendahl, Y. Kim, M. Graca, A. Gokrani, H. Ney: The RWTH Aachen University Filtering System for the WMT 2018 Parallel Corpus Filtering Task. WMT 2018 , Vol., 2018. [Salimans & Kingma 16] T. Salimans, D.P. Kingma: Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems , pp. 901–909, 2016. [Stahlberg & Cross + 18] F. Stahlberg, J. Cross, V. Stoyanov: Simple Fusion: Return of the Language Model. arXiv preprint arXiv:1809.00125 , Vol., 2018. [Yang & Dai + 18] Z. Yang, Z. Dai, R. Salakhutdinov, W.W. Cohen: Breaking the softmax bottleneck: A high-rank RNN language model. In International Conference on Learning Representations , 2018. Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 16/13 31.10.2018

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend