Language Modeling with Deep Transformers Kazuki Irie , Albert Zeyer, - - PowerPoint PPT Presentation
Language Modeling with Deep Transformers Kazuki Irie , Albert Zeyer, - - PowerPoint PPT Presentation
Language Modeling with Deep Transformers Kazuki Irie , Albert Zeyer, Ralf Schl uter, Hermann Ney Human Language Technology and Pattern Recognition Group RWTH Aachen University, Aachen, Germany INTERSPEECH 2019, Graz, Austria Neural Networks
Introduction
- 2017: Advent of Transformer [Vaswani & Shazeer+ 17] in NLP/beyond.
- Originally an encoder-decoder model for machine translation.
- Decoder component: language model
– Early work in text generation (5 layers) [Liu & Saleh+ 18] ICLR 2018
- Gain in popularity more recently:
– Google 64-layer Transformer character LM [Al-Rfou & Choe+ 19] AAAI 2019 – OpenAI GPT-2 LM (48 layers) [Radford & Wu+ 19] Blog February 2019
- Large scale language model pre-training at the center of interest in NLP.
– Nvidia, Megatron LM (72 layers) Blog August 2019 – Salesforce, Controllable Transformer LM (48 layers) Last week!
2 of 14 Language Modeling with Deep Transformers — INTERSPEECH 2019, Graz, Austria,
- Sep. 15, 2019
Contributions of this work
- Application of Transformer language models to ASR
– Successful training of deep and powerful Transformer language models. – Evaluation in both hybrid and attention based end-to-end ASR. – Large improvements over the state-of-the-art LSTM LM.
- Comprehensive hyper-parameter tuning
– Crucial for studying a new model. – In particular for Transformers which have lots of hyper-parameters.
- Demonstration of an LM specific property of Transformers
– LM task automatically provides positional information: No need for extra signal.
- Analysis and visualization
- Release of model configurations and checkpoints (link in the paper)
https://github.com/rwth-i6/returnn-experiments/tree/master/2019-lm-transformers
Open-source toolkit RETURNN [Zeyer & Alkhouli+ 18]
3 of 14 Language Modeling with Deep Transformers — INTERSPEECH 2019, Graz, Austria,
- Sep. 15, 2019
Transformer Language Model
- Stack L layers; each consisting of self-attention and feed-forward modules.
- Apply residual connections and layer normalization across modules.
- Self-attention typically has multiple attention heads.
Self-Attention LayerNorm Feed-forward LayerNorm Positional Encoding
4 of 14 Language Modeling with Deep Transformers — INTERSPEECH 2019, Graz, Austria,
- Sep. 15, 2019
Experimental Setups LibriSpeech dataset [Panayotov & Chen+ 15].
- 960h audio, read speech transcriptions.
- Large LM task: 200K vocab, 800M-word extra textual training data.
Language modeling for speech recognition in 2 settings:
- Word-level models for conventional hybrid HMM/NN system by lattice
rescoring [Sundermeyer & T¨
uske+ 14].
Push-forwarding Transformer states instead LSTM states.
- BPE subword-level models for end-to-end attention based system by
shallow fusion [G¨
ul¸ cehre & Firat+ 17, Toshniwal & Kannan+ 18].
Intensive tuning of the baseline LSTM LM [Sundermeyer & Schl¨
uter+ 12]
- All tuning details provided in the paper.
- Wide model gave the best results: 2 layers with 4096 LSTM nodes.
- Rel. improvements in PPL over 4-gram of about 58%.
5 of 14 Language Modeling with Deep Transformers — INTERSPEECH 2019, Graz, Austria,
- Sep. 15, 2019
Optimization of Transformer models Exhaustive list of hyper-parameters is long.
- Number of layers & dimension of the residual connection.
- (Dimension of input word embeddings).
- For each layer: number of attention heads, dimension of the key and query,
dimension of the value, and dimension of the feed-forward layer. To reduce this complexity,
- Use the same dimension for key, query, value, and the residual connection.
- Use the same dimensionality across all layers.
4 hyper-parameters to describe all our models.
- Number of layers L.
- Feed-forward dimension dff .
- Residual dimension dres.
- Number of attention heads H.
6 of 14 Language Modeling with Deep Transformers — INTERSPEECH 2019, Graz, Austria,
- Sep. 15, 2019
Effect of Depth and Width (Highlight) Perplexity after 2.5 epoch (H = 8, dres = 512).
L dff
Params. Perplexity in M Train Dev 12 2,048 243 67.6 67.1 24 281 62.2 62.3 42 338 59.0 59.6 6 8,192 262 66.7 66.7 12 4,096 268 63.5 63.8 4 16,384 277 67.6 67.4 32,768 344 65.4 68.4
- For a given parameter budget, deep models tend to perform better.
Full tuning details in the paper!
- Effect of number of heads: helps up to 16 ! 8 is already good.
- Effect of activation ReLU, GeLU, GLU: the standard ReLU is fine!
- Parameter tying (Universal Transformers): improvements w/o extra params!
7 of 14 Language Modeling with Deep Transformers — INTERSPEECH 2019, Graz, Austria,
- Sep. 15, 2019
Optimization of Transformer models: Final results Further scaling up: Best model: 96-layer model (L = 96, dff = 2048, dres = 512, H = 8) (112-layer model even got slightly better after camera-ready deadline.) Final perplexity on LibriSpeech 200K vocab word level. LM
- Param. Perplexity
in M Dev Test 4-gram 230 146 152 LSTM 1048 60 63 Transformer 431 54 56 Large improvements over the highly optimized LSTM LM:
- About 11% relative improvements in PPL.
8 of 14 Language Modeling with Deep Transformers — INTERSPEECH 2019, Graz, Austria,
- Sep. 15, 2019
Do we need extra positional encoding in Tranformer LM?
- Amount of information increases at each time step in LM: position signal?
- Our finding: External positional encoding unnecessary.
– Even slight improvements in perplexity w/o positional encoding.
- Attention in the first layer (all 8 heads per target word position shown)
Target word
<bos> so they went
- n
to the verandah and looked down upon the lights
- f
the prison and listened to the sea lapping the shore so they went
- n
to the verandah and looked down upon the lights
- f
the prison and listened to the sea lapping the shore <eos>
With positional encoding
<bos> so they went
- n
to the verandah and looked down upon the lights
- f
the prison and listened to the sea lapping the shore so they went
- n
to the verandah and looked down upon the lights
- f
the prison and listened to the sea lapping the shore <eos>
Without positional encoding Input word:
9 of 14 Language Modeling with Deep Transformers — INTERSPEECH 2019, Graz, Austria,
- Sep. 15, 2019
Other Layers: 3 categories Analysis for the 24-layer model which is also valid for deeper models.
- There are 3 functional groups of layers.
<bos> so they went
- n
to the verandah and looked down upon the lights
- f
the prison and listened to the sea lapping the shore so they went
- n
to the verandah and looked down upon the lights
- f
the prison and listened to the sea lapping the shore <eos> <bos> so they went
- n
to the verandah and looked down upon the lights
- f
the prison and listened to the sea lapping the shore so they went
- n
to the verandah and looked down upon the lights
- f
the prison and listened to the sea lapping the shore <eos> <bos> so they went
- n
to the verandah and looked down upon the lights
- f
the prison and listened to the sea lapping the shore so they went
- n
to the verandah and looked down upon the lights
- f
the prison and listened to the sea lapping the shore <eos>
Bottom layers (2-3):“Blur” ≈ Average over all positions; Bag-of-words. Global info. Some heads focus on difficult words, here verandah. Mid layers (4-9):“Window” Focus on the local n-gram. Top layers (10-24): “Structured” Attend to some specific pat-
- terns. Feature detector.
10 of 14 Language Modeling with Deep Transformers — INTERSPEECH 2019, Graz, Austria,
- Sep. 15, 2019
Speech Recognition Experiments: Conventional Hybrid System WERs (%) for hybrid systems on LibriSpeech 960h.
- The first pass decoding generates lattices.
- Rescore the lattices (denoted by →) with LSTM or Transformer (Trafo) LM.
Language Model Prm. dev test in M clean
- ther
clean
- ther
PPL WER PPL WER PPL WER PPL WER 4-gram 230 152 3.4 141 8.3 158 3.8 146 8.8 → LSTM 1048 60 2.3 60 5.4 65 2.6 62 5.9 → Transformer 431 53 2.1 54 5.2 58 2.5 55 5.6 LSTM → Trafo
- 1.9
- 4.5
- 2.3
- 5.0
Large improvements over the highly optimized LSTM LM:
- 10% relative improvements in PPL translate to:
- 4% to 10% relative improvements in WER.
Define new state-of-the-art results [L¨
uscher & Beck+ 19] on LibriSpeech 960h.
11 of 14 Language Modeling with Deep Transformers — INTERSPEECH 2019, Graz, Austria,
- Sep. 15, 2019
Speech Recognition Experiments: Attention-based System WERs (%) for attention-based models on LibriSpeech 960h dataset. Perplexities are on the 10K BPE level Language Model Beam dev test clean
- ther
clean
- ther
PPL WER PPL WER PPL WER PPL WER None 12
- 4.3
- 12.9
- 4.4
- 13.5
LSTM 64 44 2.9 46 8.9 47 3.2 47 9.9 Transformer 36 2.6 39 8.4 39 2.8 39 9.3
- Follow [Hannun & Lee+ 19] (Interspeech 2019):
Larger beam size and end-of-sentence penalty.
- Again, large improvements over the LSTM baseline.
- Best reported WERs for E2E systems w/o data augmentation e.g.
SpecAugment [Park & Chan+ 19] (Interspeech 2019).
- Available on: https://github.com/rwth-i6/returnn-experiments
12 of 14 Language Modeling with Deep Transformers — INTERSPEECH 2019, Graz, Austria,
- Sep. 15, 2019
Conclusion Summary
- Successfully trained deep Transformer LMs with excellent performance for ASR.
- Demonstrated that positional encoding is not needed for Transformer LMs.
- Visualized and identified hierarchical feature engineering inside Transformer
language models with link to some fundamental concepts in LM:
– N-gram, bag-of-words, and in top layers; max-entropy model-style features
(but data driven)? Future work
- Further scaling up (layer-wise training).
- Reduce memory requirements of Transformers.
- More study on scalability of Transformer vs. LSTM vs. amount of data.
- For LSTMs: deeper (and wider) models with residual connections and layer
normalization e.g., RNMT+ [Chen & Firat+ 18]?
13 of 14 Language Modeling with Deep Transformers — INTERSPEECH 2019, Graz, Austria,
- Sep. 15, 2019
Thank you for your attention.
Thanks to: Eugen Beck, Liuhui Deng, Christoph L¨ uscher, Arne Nix, Julian Schamper, and Wei Zhou
References
[Al-Rfou & Choe+ 19] R. Al-Rfou, D. Choe, N. Constant, M. Guo, L. Jones. Character-level language modeling with deeper self-attention. In Proc. AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, Jan. 2019. [Chen & Firat+ 18] M. X. Chen, O. Firat, A. Bapna, M. Johnson, W. Macherey, G. Foster, L. Jones, M. Schuster, N. Shazeer, N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser,
- Z. Chen, Y. Wu, M. Hughes.
The best of both worlds: Combining recent advances in neural machine translation. In Proc. ACL, pp. 76–86, Melbourne, Australia, July 2018. [G¨ ul¸ cehre & Firat+ 17] C ¸. G¨ ul¸ cehre, O. Firat, K. Xu, K. Cho, L. Barrault, H.-C. Lin, F. Bougares, H. Schwenk, Y. Bengio. On using monolingual corpora in neural machine translation. Computer Speech & Language, Vol. 45, pp. 137–148, Sept. 2017. [Hannun & Lee+ 19] A. Hannun, A. Lee, Q. Xu, R. Collobert. Sequence-to-sequence speech recognition with time-depth separable convolutions. arXiv preprint arXiv:1904.02619, Vol., 2019. [Liu & Saleh+ 18] P. J. Liu, M. Saleh, E. Pot, B. Goodrich, R. Sepassi, . Kaiser, N. Shazeer. Generating wikipedia by summarizing long sequences. In Int. Conf. on Learning Representations (ICLR), Vancouver, Canada, April 2018. [L¨ uscher & Beck+ 19] C. L¨ uscher, E. Beck, K. Irie, M. Kitza, W. Michel, A. Zeyer, R. Schl¨ uter, H. Ney. RWTH ASR systems for LibriSpeech: Hybrid vs Attention. In Submitted to Interspeech 2019, Graz, Austria, Sept. 2019. [Panayotov & Chen+ 15] V. Panayotov, G. Chen, D. Povey, S. Khudanpur. LibriSpeech: an ASR corpus based on public domain audio books. In Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210, South Brisbane, Queensland, Australia, April 2015. [Park & Chan+ 19] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, Q. V. Le. SpecAugment: A simple data augmentation method for automatic speech recognition. In Proc. Interspeech, Graz, Austria, 2019. [Radford & Wu+ 19] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever. Language models are unsupervised multitask learners. [Online]. : https://blog.openai.com/better-language-models/, 2019. 14 of 14 Language Modeling with Deep Transformers — INTERSPEECH 2019, Graz, Austria,
- Sep. 15, 2019
References
[Sundermeyer & Schl¨ uter+ 12] M. Sundermeyer, R. Schl¨ uter, H. Ney. LSTM neural networks for language modeling. In Proc. Interspeech, pp. 194–197, Portland, OR, USA, Sept. 2012. [Sundermeyer & T¨ uske+ 14] M. Sundermeyer, Z. T¨ uske, R. Schl¨ uter, H. Ney. Lattice decoding and rescoring with long-span neural network language models. In Proc. Interspeech, pp. 661–665, Singapore, Sept. 2014. [Toshniwal & Kannan+ 18] S. Toshniwal, A. Kannan, C.-C. Chiu, Y. Wu, T. N. Sainath, K. Livescu. A comparison of techniques for language model integration in encoder-decoder speech recognition. In Proc. SLT, Athens, Greece, Dec. 2018. [Vaswani & Shazeer+ 17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, . Kaiser, I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems (NIPS), pp. 5998–6008. Long Beach, CA, USA, Dec. 2017. [Zeyer & Alkhouli+ 18] A. Zeyer, T. Alkhouli, H. Ney. RETURNN as a generic flexible neural toolkit with application to translation and speech recognition. In Proc. of the Joint Conf. of the 47th Annual Meeting of the ACL and the 4th Int. Joint Conf. on Natural Language Processing of the AFNLP, Melbourne, Australia, July 2018. 14 of 14 Language Modeling with Deep Transformers — INTERSPEECH 2019, Graz, Austria,
- Sep. 15, 2019