knowledge distillation
play

Knowledge Distillation Xiachong Feng Pic - PowerPoint PPT Presentation

Knowledge Distillation Xiachong Feng Pic h%ps://data-soup.gitlab.io/blog/knowledge-dis8lla8on/ Outline Why Knowledge Distillation? Distilling the knowledge in a neural network NIPS2014 Model Compression Distilling Task-Specific


  1. Knowledge Distillation Xiachong Feng Pic h%ps://data-soup.gitlab.io/blog/knowledge-dis8lla8on/

  2. Outline • Why Knowledge Distillation? • Distilling the knowledge in a neural network NIPS2014 • Model Compression • Distilling Task-Specific Knowledge from BERT into Simple Neural Networks arxiv 2018 • Multi-Task Setting • Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding arxiv • BAM! Born-Again Multi-Task Networks for Natural Language Understanding • Seq2Seq NMT • Sequence level knowledge distillation EMNLP16 • Cross Lingual NLP • Cross-lingual Distillation for Text Classification ACL17 • Zero-Shot Cross-Lingual Neural Headline Generation IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, 2018 • Variant • Exploiting the Ground-Truth: An Adversarial Imitation Based Knowledge Distillation Approach for Event Detection AAAI19 • Paper List • Reference • Conclusion

  3. Cost • BERT large • Contains 24 transformer layers with 344 million parameters • 16 Cloud TPU | 4 days • 12000 dollars • GPT-2 • Contains 48 transformer layers with 1.5 billion parameters • 64 Cloud TPU v3 | one week • 43000 dollars • XLNet • 128 Cloud TPU v3 | Two and a half days • 61000 dollars XLNet 训练成本 6 万美元,顶 5 个 BERT ,大模型「身价」惊人 https://zhuanlan.zhihu.com/p/71609636?utm_source=wechat_session&utm_me dium=social&utm_oi= 71065644564480 &from=timeline&isappinstalled=0&s_r=0

  4. Trade-Off Deeper models that greatly d e improve state of t c i r t s e r - e c r u o e s l e b i R o m • s a h the art on more c u s s m e t s y s . s e c i v e d e b l a c i l p p a n i tasks e b y a m s y m e h e T t s y s • e m K l a e w r o n i l f o e s u a c e b , . r y e c h n e t i i e c fi f e e m ] - e c n e r e f n i … … • Dis8lling Task-Specific Knowledge from BERT into Simple Neural Networks

  5. Knowledge Distillation Knowledge distillation is a process of distilling or transferring the knowledge from a (set of) large, cumbersome model(s) to a lighter, easier-to-deploy single model, without significant loss in performance. Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding

  6. Hot Topic Andrej Karpathy A Recipe for Training Neural Networks http://karpathy.github.io/2019/04/25/recipe/

  7. Hot Topic Towser 如何评价 BERT 模型 hdps://www.zhihu.com/ques]on/298203515/answer/509923837 霍华德 BERT 模型在 NLP 中目前取得如此好的效果,那下一步 NLP 该何去何从? https://www.zhihu.com/question/320606353/answer/658786633

  8. Distilling the Knowledge in a Neural Network Hinton NIPS 2014 Deep Learning Workshop

  9. Model Compression • Ensemble model • Cumbersome and may be too computationally expensive • Solution • The knowledge acquired by a large ensemble of models can be transferred to a single small model. • We call “ distillation ” to transfer the knowledge from the cumbersome model to a small model that is more suitable for deployment.

  10. What is Knowledge? 1 Parameters W!

  11. What is Knowledge? 2 Mapping : Input to Output! Output Input A more abstract view of the knowledge, that frees it from any particular instantiation , is that it is a learned mapping from input vectors to output vectors.

  12. Knowledge Distillation Loss Soft targets test train train Larger model Small model learns to mimic the teacher as a student. Training Data

  13. Softmax With Temperature Logits Temperature https://blog.csdn.net/qq_22749699/article/details/79460817

  14. Note Loss Test:T=1 The same Train:T Test:T Training Data

  15. Soft Targets Soft targets 0.98 0.01 0.01 Input

  16. Supervisory signals 1 One-hot Soft target • 2 independent of 3 and 7. • 2 is similar to 3 and 7 • Discrete distribution • Con]guous distribu]on • Inter-Class variance ✔ • Inter-Class variance • Between-Class distance ✔ • Between-Class distance Soft targets have high entropy ! Naiyan Wang https://www.zhihu.com/question/50519680/answer/136363665

  17. Data augmentation 2 Similarity 周博磊 https://www.zhihu.com/question/50519680/answer/136359743

  18. Reduce Modes 3 • NMT : Real translation data has many modes. • MLE training tends to use a single-mode model to cover multiple modes. Jiatao Gu Non-Autoregressive Neural Machine Translation https://zhuanlan.zhihu.com/p/34495294

  19. Soft Targets 1. Supervisory signals 2. Data augmenta]on 3. Reduce Modes

  20. How to use unlabeled data? Loss Unlabeled Training Data Data

  21. Loss function Transfer set = unlabeled data + original training set Hard target Soft target Student Teacher DOMAIN ADAPTATION OF DNN ACOUSTIC MODELS USING KNOWLEDGE DISTILLATION 2017 ICASSP

  22. Knowledge Distillation 如何理解 soft target 这一做法? Yjango https://www.zhihu.com/question/50519680?sort=created

  23. Outline • Why Knowledge Distillation? • Distilling the knowledge in a neural network NIPS2014 • Model Compression • Distilling Task-Specific Knowledge from BERT into Simple Neural Networks arxiv 2018 • Multi-Task Setting • Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding arxiv • BAM! Born-Again Multi-Task Networks for Natural Language Understanding • Seq2Seq NMT • Sequence level knowledge distillation EMNLP16 • Cross Lingual NLP • Cross-lingual Distillation for Text Classification ACL17 • Zero-Shot Cross-Lingual Neural Headline Generation IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, 2018 • Variant • Exploiting the Ground-Truth: An Adversarial Imitation Based Knowledge Distillation Approach for Event Detection AAAI19 • Paper List • Reference • Conclusion

  24. DisKlling Task-Specific Knowledge from BERT into Simple Neural Networks University of Waterloo arxiv

  25. Overview • Distill knowledge from BERT, a state-of-the-art language representation model, into a single-layer BiLSTM • Task 1. Binary sentiment classification 2. Multi-genre Natural Language Inference 3. Quora Question Pairs redundancy classification • Achieve comparable results with ELMo, while using roughly 100 times fewer parameters and 15 times less inference time.

  26. Teacher Model • Teacher Model: 𝐶𝐹𝑆𝑈 %&'()

  27. Student Model • Student Model : Single-layer Bi-LSTM with a non- linear classifier

  28. Data AugmentaKon for DisKllaKon • In the distillation approach, a small dataset may not suffice for the teacher model to fully express its knowledge. Augment the training set with a large, unlabeled dataset, with pseudo-labels provided by the teacher • Method • Masking . With probability pmask , we randomly replace a word with [MASK], • POS-guided word replacement . With probability ppos , we replace a word with another of the same POS tag. • n-gram sampling. With probability png , we randomly sample an n-gram from the example, where n is randomly selected from {1, 2, . . . , 5}.

  29. Distillation objective Mean-squared-error (MSE) loss between the • student network’s logits against the teacher’s logits. MSE to perform slightly better. • Teacher’s logits Student’s logits

  30. Result

  31. Outline • Why Knowledge DisKllaKon? • DisKlling the knowledge in a neural network NIPS2014 • Model Compression • Dis]lling Task-Specific Knowledge from BERT into Simple Neural Networks arxiv 2018 • MulK-Task Se^ng • Improving Mul]-Task Deep Neural Networks via Knowledge Dis]lla]on for Natural Language Understanding arxiv • BAM! Born-Again Mul]-Task Networks for Natural Language Understanding • Seq2Seq NMT • Sequence level knowledge dis]lla]on EMNLP16 • Cross Lingual NLP • Cross-lingual Dis]lla]on for Text Classifica]on ACL17 • Zero-Shot Cross-Lingual Neural Headline Genera]on IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, 2018 • Variant • Exploi]ng the Ground-Truth: An Adversarial Imita]on Based Knowledge Dis]lla]on Approach for Event Detec]on AAAI19 • Paper List • Reference • Conclusion

  32. Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding Microsoft

  33. MT-DNN pre-training stage MTL stage Multi-task deep neural networks for natural language understanding

  34. DisKllaKon correct targets + soft targets Initialized using the MT- ensemble of different MT- DNNs (teacher) Initialized DNN model using the pre- pre-trained trained BERT on the GLUE dataset The parameters of its shared layers are initialized using the MT-DNN model pre- • trained on the GLUE dataset via MTL, as in Algorithm 1, and the parameters of its task-specific output layers are randomly initialized. Disttilled MT-DNN significantly outperforms the original MT-DNN on 7 out of 9 • GLUE tasks (single model).

  35. Teacher Annealing • BAM! Born-Again Multi-Task Networks for Natural Language Understanding • Born Again : the student has the same model architecture as the teacher. λ is linearly increased from 0 to 1

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend