Improving Neural Language Models with Weight Norm Initialization and - PowerPoint PPT Presentation

Improving Neural Language Models with Weight Norm Initialization and Regularization Christian Herold ∗ Yingbo Gao ∗ Hermann Ney <surname>@i6.informatik.rwth-aachen.de October 31st, 2018 Human Language Technology and Pattern Recognition Computer Science Department RWTH Aachen University Third Conference on Machine Translation (WMT18) Brussels, Belgium * Equal Contribution Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 1/13 31.10.2018

Outline Introduction Neural Language Modeling Weight Norm Initialization Weight Norm Regularization Experiments Conclusion Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 2/13 31.10.2018

Introduction ◮ Task: Given a word sequence x J 1 = x 1 x 2 ...x J , find the probability P ( x J 1 ) of that sequence J � P ( x j | x j − 1 P ( x J 1 ) = ) (1) 1 j =1 ◮ The perplexity is defined as 1 ) − 1 ppl = P ( x J (2) J ◮ The language model (LM) is trained on monolingual text data ◮ LMs are used in a variety of machine translation tasks: ⊲ Corpus filtering [Rossenbach & Rosendahl + 18] ⊲ Unsupervised neural machine translation (NMT) [Kim & Geng + 18] ⊲ Incorporation into NMT training [Stahlberg & Cross + 18] Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 3/13 31.10.2018

Neural Language Modeling ◮ The context x j − 1 for predicting the next word x j is encoded into a 1 vector h j J J � P ( x j | x j − 1 � P ( x J 1 ) = ) = P ( x j | h j ) (3) 1 j =1 j =1 ◮ A popular choice for a neural language model (NLM) is the long short-term memory Recurrent Neural Network (LSTM RNN) h j = LSTM ([ E T · x 1 , E T · x 2 , ..., E T · x j − 1 ]) (4) where x 1 , x 2 , ..., x j − 1 are 1-hot vectors and E is the embedding matrix ◮ P ( x j = x k | h j ) , k = 1 , 2 , ..., V is calculated with a softmax activation exp ( W k · h j ) P ( x j = x k | h j ) = (5) � V k ′ =1 exp ( W k ′ · h j ) where W k is the k -th row of the projection matrix and V is the vocabulary size Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 4/13 31.10.2018

Related Work LSTM RNNs for language modeling: ◮ [Melis & Dyer + 18], [Merity & Keskar + 17] Mixture of Softmaxes: ◮ [Yang & Dai + 18] Using word frequency information for embedding ◮ [Chen & Si + 18], [Baevski & Auli 18] Tying the embedding and projection matrices ◮ [Press & Wolf 17], [Inan & Khosravi + 17] Weight normalization reparametrization ◮ [Salimans & Kingma 16] Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 5/13 31.10.2018

Weight Norm Initialization Penn Treebank WikiText-2 ◮ Idea: Initialize the norms of W k with scaled logarithm of the wordcounts W k W k = σ log ( c k ) (6) � W k � 2 where c k denotes unigram word count for the k -th word and σ is a scalar Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 6/13 31.10.2018

Weight Norm Regularization ◮ Established method (L 2 -Regularization): Regularize every learnable weight w in the network equally by adding a term to the loss function L 0 L = L 0 + λ � ( � w � 2 ) 2 (7) 2 w where λ is a scalar ◮ Idea: Regularize the norms of W k to approach a certain value ν � V � � � ( � W k � 2 − ν ) 2 L wnr = L 0 + ρ (8) � k =1 where ν, ρ ≥ 0 are two scalars Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 7/13 31.10.2018

Experimental Setup ◮ Datasets: Penn Treebank (PTB) and WikiText-2 (WT2) Tokens Vocab Size Train 888k Penn Treebank Valid 70k 10k Test 79k Train 2.1M WikiText-2 Valid 214k 33k Test 241k ◮ Network structure: ⊲ Three-layer LSTM with Mixture of Softmaxes [Yang & Dai + 18] ⊲ Embedding and projection matrices are tied Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 8/13 31.10.2018

Weight Norm Initialization Penn Treebank epoch wni ppl baseline ppl ppl reduction (%) 1 162.18 180.72 10.26 10 85.92 92.09 6.70 20 73.36 78.94 7.07 30 71.44 73.06 2.22 40 69.27 70.20 1.32 WikiText-2 epoch wni ppl baseline ppl ppl reduction (%) 1 172.19 192.19 10.41 10 95.90 100.72 4.79 20 85.14 88.21 3.48 30 81.80 82.70 1.09 40 79.28 80.32 1.29 Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 9/13 31.10.2018

Weight Norm Regularization ◮ Perform grid search on PTB-dataset over the hyperparameters ρ and ν ◮ With the tuned hyperparameters ρ = 1 . 0 × 10 − 3 and ν = 2 . 0 , improvements over the baseline are achieved on both PTB and WT2 Penn Treebank Model #Params Validation Test [Yang & Dai + 18] 22M 56.54 54.44 [Yang & Dai + 18] + WNR 22M 55.03 53.16 WikiText-2 Model #Params Validation Test [Yang & Dai + 18] 35M 63.88 61.45 [Yang & Dai + 18] + WNR 35M 62.67 60.13 Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 10/13 31.10.2018

Weight Norm Regularization 15 probability [%] probability [%] 25 baseline baseline baseline + WNR baseline + WNR 20 10 15 10 5 5 0 0 0 1 2 3 4 5 0 1 2 3 4 5 vector norm vector norm Penn Treebank WikiText-2 ◮ Our regularization method leads to more concentrated distribution of weight norms around ν ◮ We still observe few words with significantly higher norms Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 11/13 31.10.2018

Conclusion ◮ Embedding and projection matrices are important components of a NLM ◮ Vector norms are related to the word frequencies ◮ Initializing and/or regularizing these vector norms can improve the NLM ◮ Future work ⊲ Test methods on larger data sets ⊲ Regularizing the vector norms to word counts ⊲ Study the angles between the word vectors ⊲ Transfer to machine translation Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 12/13 31.10.2018

Thank you for your attention Christian Herold Yingbo Gao Hermann Ney <surname>@i6.informatik.rwth-aachen.de http://www-i6.informatik.rwth-aachen.de/ Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 13/13 31.10.2018

References [Baevski & Auli 18] A. Baevski, M. Auli: Adaptive Input Representations for Neural Language Modeling. arXiv preprint arXiv:1809.10853 , Vol., 2018. [Chen & Si + 18] P.H. Chen, S. Si, Y. Li, C. Chelba, C.j. Hsieh: GroupReduce: Block-Wise Low-Rank Approximation for Neural Language Model Shrinking. arXiv preprint arXiv:1806.06950 , Vol., 2018. [Graca & Kim + 18] M. Graca, Y. Kim, J. Schamper, J. Geng, H. Ney: The RWTH Aachen University English-German and German-English Unsupervised Neural Machine Translation Systems for WMT 2018. WMT 2018 , Vol., 2018. [Inan & Khosravi + 17] H. Inan, K. Khosravi, R. Socher: Tying word vectors and word classifiers: A loss framework for language modeling. In International Conference on Learning Representations , 2017. Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 14/13 31.10.2018

Title [Kim & Geng + 18] Y. Kim, J. Geng, H. Ney: Improving Unsupervised Word-by-Word Translation with Language Model and Denoising Autoencoder. EMNLP 2018 , Vol., 2018. [Melis & Dyer + 18] G. Melis, C. Dyer, P. Blunsom: On the state of the art of evaluation in neural language models. In International Conference on Learning Representations , 2018. [Merity & Keskar + 17] S. Merity, N.S. Keskar, R. Socher: Regularizing and optimizing LSTM language models. arXiv preprint arXiv:1708.02182 , Vol., 2017. [Press & Wolf 17] O. Press, L. Wolf: Using the Output Embedding to Improve Language Models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics , pp. 157–163, 2017. Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 15/13 31.10.2018

Title [Rossenbach & Rosendahl + 18] N. Rossenbach, J. Rosendahl, Y. Kim, M. Graca, A. Gokrani, H. Ney: The RWTH Aachen University Filtering System for the WMT 2018 Parallel Corpus Filtering Task. WMT 2018 , Vol., 2018. [Salimans & Kingma 16] T. Salimans, D.P. Kingma: Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems , pp. 901–909, 2016. [Stahlberg & Cross + 18] F. Stahlberg, J. Cross, V. Stoyanov: Simple Fusion: Return of the Language Model. arXiv preprint arXiv:1809.00125 , Vol., 2018. [Yang & Dai + 18] Z. Yang, Z. Dai, R. Salakhutdinov, W.W. Cohen: Breaking the softmax bottleneck: A high-rank RNN language model. In International Conference on Learning Representations , 2018. Herold et al.: Improving NLMs with Weight Norm Initialization and Regularization 16/13 31.10.2018

Improving Neural Language Models with Weight Norm Initialization and - PowerPoint PPT Presentation

Improving Neural Language Models with Weight Norm Initialization and Regularization Christian Herold Yingbo Gao Hermann Ney <surname>@i6.informatik.rwth-aachen.de October 31st, 2018 Human Language Technology and Pattern

Modelling NORM in the Modelling NORM in the environment environment EMRAS Project, NORM Working

EMRAS I (NORM) SUMMARY (Detailed information is in the main EMRAS I NORM working group report)

cProbLog: Restricting the Possible Worlds of Probabilistic Logic Programs Dimitar Shterionov

EMRAS 2 EMRAS 2 Working Group 1 Working Group 1 Legacy Sites and NORM Legacy Sites and NORM

Gemstones a Unit of Weight Gemstones a Unit of Weight The historical unit of weight

INTRODUCING Connecting Weight Loss Patients Directly to your Weight Loss Center Physicians Weight

Formulation and development of foods for weight management Paola Vitaglione Weight control and

/k Content 2/15 1. Introduction 2. Hamming weight 3. Rank weight 4. Extended rank weight

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Weight Parameterizations in Deep Neural Networks Sergey Zagoruyko e Paris-Est, Universit

CSS creative by @aganaplocha breaking the norm with CSS creative by @aganaplocha breaking

6. Approximation and fitting norm approximation least-norm problems regularized

Models of Language Evolution models thereof its evolution language Models of Language Evolution

MEASUREMENT Weight ESSENTIAL QUESTION: How do we know which unit to choose to measure weight?

IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 Neural networks, Language

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Neural Architecture Optimization CONTENTS 1.AutoML 2.NAS

Featured Article Identification in Wikipedia - Thesis Defense - Christian Fricke

Introduction to Wikipedia editing Mike Peel 13 September 2014 Questions Who has edited

Introduction to Wikipedia editing Mike Peel 12 November 2014 Questions Who has used

Scalaris: Scalable Web Applications with a Transactional Key-Value Store Nico Kruber Michael

CICM 2016, OpenMath workshop Implicit Content Dictionaries in the NIST Digital Repository of

Clean Code 1 / 24 Clean Code What is clean code? Elegant and efficient. Bjarne

Gitify your life web, blog, configs, data, and backups Richard Hartmann, RichiH@ {