Knowledge Distillation
Xiachong Feng
Pic h%ps://data-soup.gitlab.io/blog/knowledge-dis8lla8on/
Knowledge Distillation Xiachong Feng Pic - - PowerPoint PPT Presentation
Knowledge Distillation Xiachong Feng Pic h%ps://data-soup.gitlab.io/blog/knowledge-dis8lla8on/ Outline Why Knowledge Distillation? Distilling the knowledge in a neural network NIPS2014 Model Compression Distilling Task-Specific
Xiachong Feng
Pic h%ps://data-soup.gitlab.io/blog/knowledge-dis8lla8on/
Language Understanding arxiv
AUDIO, SPEECH, AND LANGUAGE PROCESSING, 2018
Approach for Event Detection AAAI19
XLNet训练成本6万美元,顶5个BERT,大模型「身价」惊人 https://zhuanlan.zhihu.com/p/71609636?utm_source=wechat_session&utm_me dium=social&utm_oi=71065644564480&from=timeline&isappinstalled=0&s_r=0
Deeper models that greatly improve state of the art on more tasks
e s
r c e
e s t r i c t e d s y s t e m s s u c h a s m
i l e d e v i c e s .
h e y m a y b e i n a p p l i c a b l e i n r e a l K m e s y s t e m s e i t h e r , b e c a u s e
l
i n f e r e n c e
m e e f fi c i e n c y .
…
Dis8lling Task-Specific Knowledge from BERT into Simple Neural Networks
Knowledge distillation is a process of distilling or transferring the knowledge from a (set of) large, cumbersome model(s) to a lighter, easier-to-deploy single model, without significant loss in performance.
Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding
Andrej Karpathy A Recipe for Training Neural Networks http://karpathy.github.io/2019/04/25/recipe/
Towser 如何评价BERT模型 hdps://www.zhihu.com/ques]on/298203515/answer/509923837 霍华德 BERT模型在NLP中目前取得如此好的效果,那下一步NLP该何去何从? https://www.zhihu.com/question/320606353/answer/658786633
Hinton NIPS 2014 Deep Learning Workshop
expensive
models can be transferred to a single small model.
from the cumbersome model to a small model that is more suitable for deployment.
Parameters W!
1
Input Output
A more abstract view of the knowledge, that frees it from any particular instantiation, is that it is a learned mapping from input vectors to output vectors.
Mapping : Input to Output!
2
Training Data
Loss
Larger model Small model learns to mimic the teacher as a student. Soft targets
test train train
https://blog.csdn.net/qq_22749699/article/details/79460817
Logits Temperature
Training Data
Loss
Test:T Train:T
The same
Test:T=1
Input
0.98 0.01 0.01
Soft targets
Soft target One-hot
Naiyan Wang https://www.zhihu.com/question/50519680/answer/136363665
1
Soft targets have high entropy!
2
周博磊 https://www.zhihu.com/question/50519680/answer/136359743
Similarity
cover multiple modes.
3
Jiatao Gu Non-Autoregressive Neural Machine Translation https://zhuanlan.zhihu.com/p/34495294
1. Supervisory signals 2. Data augmenta]on 3. Reduce Modes
Training Data
Loss
Unlabeled Data
DOMAIN ADAPTATION OF DNN ACOUSTIC MODELS USING KNOWLEDGE DISTILLATION 2017 ICASSP
Teacher Student Soft target Hard target
Transfer set = unlabeled data + original training set
如何理解soft target这一做法?Yjango https://www.zhihu.com/question/50519680?sort=created
Language Understanding arxiv
AUDIO, SPEECH, AND LANGUAGE PROCESSING, 2018
Approach for Event Detection AAAI19
University of Waterloo arxiv
language representation model, into a single-layer BiLSTM
classification
roughly 100 times fewer parameters and 15 times less inference time.
%&'()
linear classifier
suffice for the teacher model to fully express its
unlabeled dataset, with pseudo-labels provided by the teacher
replace a word with [MASK],
ppos , we replace a word with another of the same POS tag.
randomly sample an n-gram from the example, where n is randomly selected from {1, 2, . . . , 5}.
Teacher’s logits Student’s logits
student network’s logits against the teacher’s logits.
Language Understanding arxiv
AUDIO, SPEECH, AND LANGUAGE PROCESSING, 2018
Approach for Event Detec]on AAAI19
Microsoft
pre-training stage MTL stage
Multi-task deep neural networks for natural language understanding
trained on the GLUE dataset via MTL, as in Algorithm 1, and the parameters of its task-specific output layers are randomly initialized.
GLUE tasks(single model).
ensemble of different MT- DNNs (teacher)
correct targets + soft targets Initialized using the MT- DNN model pre-trained
dataset Initialized using the pre- trained BERT
Language Understanding
architecture as the teacher.
λ is linearly increased from 0 to 1
Language Understanding arxiv
AUDIO, SPEECH, AND LANGUAGE PROCESSING, 2018
Approach for Event Detection AAAI19
EMNLP16 Yoon Kim Harvard
setting
knowledge distillation
result we are able to perform greedy decoding on the 2 × 500 model 10 times faster than beam search on the 4 × 1000 model with comparable performance.
Language Understanding arxiv
AUDIO, SPEECH, AND LANGUAGE PROCESSING, 2018
Approach for Event Detection AAAI19
ACL17 CMU
languages into the same taxonomy of categories.
classifiers in a label-rich source language to help the classifica]on of documents in other label- poor target languages?
source-language classifier on labeled source documents .
is transferred to the distilled model in the target language by training it on the parallel corpus.
Source language classifier Target language classifier Loss
should have the same distribution of class predicted by the source model and target model.
Classifier 𝐻.(𝜄.) Discriminative 𝐻1(𝜄1) Feature extractor 𝐻2(𝜄2)
…
CNN
good discriminative performance on L extracts features which have similar distribu]ons on L and U gradient reverse
Ayana, Shi-qi Shen, Yun Chen, Cheng Yang, Zhi-yuan Liu, and Mao-song Sun IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 26, NO. 12, DECEMBER 2018
given a document in a different source language (e.g., English).
articles and target language headlines,
phases.
LDC2004T07, LDC2004T08 and LDC2005T06.
English Article English Headline Chinese Headline English Chinese
NMT NMT CNHG 1
Chinese Article Chinese Headline
NHG
Chinese Article
NMT NHG 2 2 Pre-train
English Article English Headline Chinese Headline
NMT CNHG 1
Chinese Article
NMT NHG 2 2 2 1
Language Understanding arxiv
AUDIO, SPEECH, AND LANGUAGE PROCESSING, 2018
Approach for Event Detection AAAI19
AAAI19 Jian Liu , Yubo Chen , Kang Liu National Laboratory of Pattern Recognition, Institute
刘康 Associate Professor Sentiment Analysis, Information Extraction, Question Answering 陈玉博 Associate Professor 2017 赵军 Event Extraction , Relation Extraction and Knowledge Graph Construction .
The boy died in the hospital
Event trigger Role=Victim Role=Place
Event Extraction
Type:Die
The boy died in the hospital
Event trigger
Event Detection
Type:Die
Event argument
variation
might refer to entirely different events.
Transfer-Money Release-Parole
can provide evidence for event type disambiguation
annotations are missing.
Etea Estu
that D thinks f(wt) comes from Etea .
and C to build a raw-sentences event detector. Freeze
and C to build a raw-sentences event detector.
examples (labeled as 1s) and the outputs of Estu as nega]ve examples (labeled as 0s) to pretrain D. Freeze Freeze
final classification error
whether Estu has successfully fooled D
word entity event-argument
LSTM-CRF taggers
1. WHAT IS KNOWLEDGE DISTILLATION? https://data-soup.gitlab.io/blog/knowledge-distillation/ 2. 李如【DL】模型蒸馏 Distillation https://zhuanlan.zhihu.com/p/71986772 3. Towser 如何评价BERT模型https://www.zhihu.com/question/298203515/answer/509923837 4. 霍华德 BERT模型在NLP中目前取得如此好的效果,那下一步NLP该何去何从? https://www.zhihu.com/question/320606353/answer/658786633 5. Andrej Karpathy A Recipe for Training Neural Networks http://karpathy.github.io/2019/04/25/recipe/ 6. XLNet训练成本6万美元,顶5个BERT,大模型「身价」惊人 https://zhuanlan.zhihu.com/p/71609636?utm_source=wechat_session&utm_medium=social& utm_oi=71065644564480&from=timeline&isappinstalled=0&s_r=0 7. Naiyan Wang https://www.zhihu.com/question/50519680/answer/136363665 8. 周博磊 https://www.zhihu.com/question/50519680/answer/136359743 9. https://blog.csdn.net/qq_22749699/article/details/79460817
https://www.zhihu.com/question/50519680?sort=created
https://zhuanlan.zhihu.com/p/34495294