efficient contextual representation learning with
play

Efficient Contextual Representation Learning With Continuous Outputs - PowerPoint PPT Presentation

1 Efficient Contextual Representation Learning With Continuous Outputs Kai-Wei Chang Liunian Harold Li Patrick H. Chen Cho-Jui Hsieh UCLA UCLA UCLA UCLA 2 Motivation: Efficient Contextual Representation Learning Energy implication of


  1. 1 Efficient Contextual Representation Learning With Continuous Outputs Kai-Wei Chang Liunian Harold Li Patrick H. Chen Cho-Jui Hsieh UCLA UCLA UCLA UCLA

  2. 2 Motivation: Efficient Contextual Representation Learning Energy implication of popular NLP models (Strubell et al., 2019).

  3. 3 Background: Language Model Pre-training Language Model Objectives: forward / backward / masked Softmax Layer … Sequence Encoder: LSTM / Transformer C C C C C C Input Layer: Subwords / CNN dog The quick An illustration of popular pre-trained language models, such as ELMo, GPT, and BERT.

  4. 4 Background: Softmax Layer <eos> quick brown Loss function with a softmax layer: … C C C c: context vector from the sequence encoder C C C W: V x m matrix, with V being the vocabulary size V could become extremely large (800K for ELMo) dog The quick W takes up 80% of parameters of ELMo Forward language modeling of ELMo Softmax layer becomes the speed bottleneck!

  5. 5 Approach: Accelerating Language Model Training with Continuous Output Loss function with a continuous output layer*: <eos> quick brown c: context vector from the sequence encoder … C C C w: pre-trained word embedding of w C C C d: distance function such as cosine distance Predicting the word embedding instead of the word! dog The quick Forward language modeling of ELMo *Von mises-fisher loss for training sequence to sequence models with continuous outputs. Sachin Kumar and Yulia Tsvetkov. 2018.

  6. 6 Approach: Computational Efficiency Related work Time complexity: O(|vocabulary|) -> O(|embedding|) Sampling Negligible Adaptive softmax Subword Trainable parameter size: … Hundreds of Millions -> 0 80% parameter reduction for ELMo Significant efficiency improvement over existing methods

  7. 7 Approach: Computational Efficiency Time complexity: Optimizer overhead O(|vocabulary|) -> O(|embedding|) Negligible GPU memory consumption Trainable parameter size: Hundreds of Millions -> 0 Communication cost 80% parameter reduction for ELMo Efficiency improvement of the output layer Efficiency improvement for the entire model ELMo training: 14 days x 3 GPUs -> 2.5 days x 4 GPUs

  8. 8 Approach: Open-vocabulary Training Open-vocabulary word embedding Loss function with a continuous output layer: such as FastText / MIMICK: w: pre-trained word embedding of w What if w is not in the vocabulary? MIMICK (Pinter et al., 2017)

  9. 9 Experiment All models pre-trained on One Billion Word Benchmark for 10 epochs. ELMo-C, ELMo-A, and ELMo-Sub trained with the exact same hyper-parameters. ELMo-A achieves a perplexity of 35.8, lower than 39.7 of the original ELMo.

  10. 10 Experiment Training time (Day x GPU), batch size (per GPU), trainable parameters of four ELMo variants ELMo-C is 4.2x faster and 6x more memory efficient than ELMo

  11. 11 Experiment Training time (Day x GPU), batch size (per GPU), trainable parameters of four ELMo variants ELMo-A and ELMo-Sub are more efficient than ELMo ELMo-C is still 1.6x - 2.3x faster

  12. 12 Experiment Performance on five downstream tasks following settings of the original ELMo ELMo-C is comparable with ELMo on four tasks except SRL.

  13. 13 Experiment Performance on five downstream tasks following settings of the original ELMo ELMo-C rivals or outperforms ELMo-A and ELMo-Sub.

  14. 14 Analysis: The Continuous Output Layer with Different Sequence Encoders Time needed to finish training on one million words using 4 GPUs. Consistent efficiency improvement over other variants (1.44x - 8.31x), even when the sequence encoder is very large.

  15. 15 Conclusion Predicting word embedding instead of softmaxing accelerates ELMo training The resulting model ELMo-C retains comparable performance as ELMo Computational efficiency sustains when applied to large transformers https://github.com/uclanlp/ELMO-C

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend