Knowledge Distillation Xiachong Feng Pic - - PowerPoint PPT Presentation

knowledge distillation
SMART_READER_LITE
LIVE PREVIEW

Knowledge Distillation Xiachong Feng Pic - - PowerPoint PPT Presentation

Knowledge Distillation Xiachong Feng Pic h%ps://data-soup.gitlab.io/blog/knowledge-dis8lla8on/ Outline Why Knowledge Distillation? Distilling the knowledge in a neural network NIPS2014 Model Compression Distilling Task-Specific


slide-1
SLIDE 1

Knowledge Distillation

Xiachong Feng

Pic h%ps://data-soup.gitlab.io/blog/knowledge-dis8lla8on/

slide-2
SLIDE 2

Outline

  • Why Knowledge Distillation?
  • Distilling the knowledge in a neural network NIPS2014
  • Model Compression
  • Distilling Task-Specific Knowledge from BERT into Simple Neural Networks arxiv 2018
  • Multi-Task Setting
  • Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural

Language Understanding arxiv

  • BAM! Born-Again Multi-Task Networks for Natural Language Understanding
  • Seq2Seq NMT
  • Sequence level knowledge distillation EMNLP16
  • Cross Lingual NLP
  • Cross-lingual Distillation for Text Classification ACL17
  • Zero-Shot Cross-Lingual Neural Headline Generation IEEE/ACM TRANSACTIONS ON

AUDIO, SPEECH, AND LANGUAGE PROCESSING, 2018

  • Variant
  • Exploiting the Ground-Truth: An Adversarial Imitation Based Knowledge Distillation

Approach for Event Detection AAAI19

  • Paper List
  • Reference
  • Conclusion
slide-3
SLIDE 3

Cost

  • BERTlarge
  • Contains 24 transformer layers with 344 million parameters
  • 16 Cloud TPU | 4 days
  • 12000 dollars
  • GPT-2
  • Contains 48 transformer layers with 1.5 billion parameters
  • 64 Cloud TPU v3 | one week
  • 43000 dollars
  • XLNet
  • 128 Cloud TPU v3 | Two and a half days
  • 61000 dollars

XLNet训练成本6万美元,顶5个BERT,大模型「身价」惊人 https://zhuanlan.zhihu.com/p/71609636?utm_source=wechat_session&utm_me dium=social&utm_oi=71065644564480&from=timeline&isappinstalled=0&s_r=0

slide-4
SLIDE 4

Trade-Off

Deeper models that greatly improve state of the art on more tasks

  • R

e s

  • u

r c e

  • r

e s t r i c t e d s y s t e m s s u c h a s m

  • b

i l e d e v i c e s .

  • T

h e y m a y b e i n a p p l i c a b l e i n r e a l K m e s y s t e m s e i t h e r , b e c a u s e

  • f

l

  • w

i n f e r e n c e

  • ]

m e e f fi c i e n c y .

Dis8lling Task-Specific Knowledge from BERT into Simple Neural Networks

slide-5
SLIDE 5

Knowledge Distillation

Knowledge distillation is a process of distilling or transferring the knowledge from a (set of) large, cumbersome model(s) to a lighter, easier-to-deploy single model, without significant loss in performance.

Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding

slide-6
SLIDE 6

Hot Topic

Andrej Karpathy A Recipe for Training Neural Networks http://karpathy.github.io/2019/04/25/recipe/

slide-7
SLIDE 7

Hot Topic

Towser 如何评价BERT模型 hdps://www.zhihu.com/ques]on/298203515/answer/509923837 霍华德 BERT模型在NLP中目前取得如此好的效果,那下一步NLP该何去何从? https://www.zhihu.com/question/320606353/answer/658786633

slide-8
SLIDE 8

Distilling the Knowledge in a Neural Network

Hinton NIPS 2014 Deep Learning Workshop

slide-9
SLIDE 9

Model Compression

  • Ensemble model
  • Cumbersome and may be too computationally

expensive

  • Solution
  • The knowledge acquired by a large ensemble of

models can be transferred to a single small model.

  • We call “distillation” to transfer the knowledge

from the cumbersome model to a small model that is more suitable for deployment.

slide-10
SLIDE 10

What is Knowledge?

Parameters W!

1

slide-11
SLIDE 11

What is Knowledge?

Input Output

A more abstract view of the knowledge, that frees it from any particular instantiation, is that it is a learned mapping from input vectors to output vectors.

Mapping : Input to Output!

2

slide-12
SLIDE 12

Knowledge Distillation

Training Data

Loss

Larger model Small model learns to mimic the teacher as a student. Soft targets

test train train

slide-13
SLIDE 13

Softmax With Temperature

https://blog.csdn.net/qq_22749699/article/details/79460817

Logits Temperature

slide-14
SLIDE 14

Note

Training Data

Loss

Test:T Train:T

The same

Test:T=1

slide-15
SLIDE 15

Soft Targets

Input

0.98 0.01 0.01

Soft targets

slide-16
SLIDE 16

Supervisory signals

  • 2 is similar to 3 and 7
  • Con]guous distribu]on
  • Inter-Class variance ✔
  • Between-Class distance ✔
  • 2 independent of 3 and 7.
  • Discrete distribution
  • Inter-Class variance
  • Between-Class distance

Soft target One-hot

Naiyan Wang https://www.zhihu.com/question/50519680/answer/136363665

1

Soft targets have high entropy!

slide-17
SLIDE 17

Data augmentation

2

周博磊 https://www.zhihu.com/question/50519680/answer/136359743

Similarity

slide-18
SLIDE 18

Reduce Modes

  • NMT : Real translation data has many modes.
  • MLE training tends to use a single-mode model to

cover multiple modes.

3

Jiatao Gu Non-Autoregressive Neural Machine Translation https://zhuanlan.zhihu.com/p/34495294

slide-19
SLIDE 19

Soft Targets

1. Supervisory signals 2. Data augmenta]on 3. Reduce Modes

slide-20
SLIDE 20

How to use unlabeled data?

Training Data

Loss

Unlabeled Data

slide-21
SLIDE 21

Loss function

DOMAIN ADAPTATION OF DNN ACOUSTIC MODELS USING KNOWLEDGE DISTILLATION 2017 ICASSP

Teacher Student Soft target Hard target

Transfer set = unlabeled data + original training set

slide-22
SLIDE 22

Knowledge Distillation

如何理解soft target这一做法?Yjango https://www.zhihu.com/question/50519680?sort=created

slide-23
SLIDE 23

Outline

  • Why Knowledge Distillation?
  • Distilling the knowledge in a neural network NIPS2014
  • Model Compression
  • Distilling Task-Specific Knowledge from BERT into Simple Neural Networks arxiv 2018
  • Multi-Task Setting
  • Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural

Language Understanding arxiv

  • BAM! Born-Again Multi-Task Networks for Natural Language Understanding
  • Seq2Seq NMT
  • Sequence level knowledge distillation EMNLP16
  • Cross Lingual NLP
  • Cross-lingual Distillation for Text Classification ACL17
  • Zero-Shot Cross-Lingual Neural Headline Generation IEEE/ACM TRANSACTIONS ON

AUDIO, SPEECH, AND LANGUAGE PROCESSING, 2018

  • Variant
  • Exploiting the Ground-Truth: An Adversarial Imitation Based Knowledge Distillation

Approach for Event Detection AAAI19

  • Paper List
  • Reference
  • Conclusion
slide-24
SLIDE 24

DisKlling Task-Specific Knowledge from BERT into Simple Neural Networks

University of Waterloo arxiv

slide-25
SLIDE 25

Overview

  • Distill knowledge from BERT, a state-of-the-art

language representation model, into a single-layer BiLSTM

  • Task
  • 1. Binary sentiment classification
  • 2. Multi-genre Natural Language Inference
  • 3. Quora Question Pairs redundancy

classification

  • Achieve comparable results with ELMo, while using

roughly 100 times fewer parameters and 15 times less inference time.

slide-26
SLIDE 26

Teacher Model

  • Teacher Model: 𝐶𝐹𝑆𝑈

%&'()

slide-27
SLIDE 27

Student Model

  • Student Model : Single-layer Bi-LSTM with a non-

linear classifier

slide-28
SLIDE 28

Data AugmentaKon for DisKllaKon

  • In the distillation approach, a small dataset may not

suffice for the teacher model to fully express its

  • knowledge. Augment the training set with a large,

unlabeled dataset, with pseudo-labels provided by the teacher

  • Method
  • Masking. With probability pmask , we randomly

replace a word with [MASK],

  • POS-guided word replacement. With probability

ppos , we replace a word with another of the same POS tag.

  • n-gram sampling. With probability png , we

randomly sample an n-gram from the example, where n is randomly selected from {1, 2, . . . , 5}.

slide-29
SLIDE 29

Distillation objective

Teacher’s logits Student’s logits

  • Mean-squared-error (MSE) loss between the

student network’s logits against the teacher’s logits.

  • MSE to perform slightly better.
slide-30
SLIDE 30

Result

slide-31
SLIDE 31

Outline

  • Why Knowledge DisKllaKon?
  • DisKlling the knowledge in a neural network NIPS2014
  • Model Compression
  • Dis]lling Task-Specific Knowledge from BERT into Simple Neural Networks arxiv 2018
  • MulK-Task Se^ng
  • Improving Mul]-Task Deep Neural Networks via Knowledge Dis]lla]on for Natural

Language Understanding arxiv

  • BAM! Born-Again Mul]-Task Networks for Natural Language Understanding
  • Seq2Seq NMT
  • Sequence level knowledge dis]lla]on EMNLP16
  • Cross Lingual NLP
  • Cross-lingual Dis]lla]on for Text Classifica]on ACL17
  • Zero-Shot Cross-Lingual Neural Headline Genera]on IEEE/ACM TRANSACTIONS ON

AUDIO, SPEECH, AND LANGUAGE PROCESSING, 2018

  • Variant
  • Exploi]ng the Ground-Truth: An Adversarial Imita]on Based Knowledge Dis]lla]on

Approach for Event Detec]on AAAI19

  • Paper List
  • Reference
  • Conclusion
slide-32
SLIDE 32

Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural Language Understanding

Microsoft

slide-33
SLIDE 33

MT-DNN

pre-training stage MTL stage

Multi-task deep neural networks for natural language understanding

slide-34
SLIDE 34

DisKllaKon

  • The parameters of its shared layers are initialized using the MT-DNN model pre-

trained on the GLUE dataset via MTL, as in Algorithm 1, and the parameters of its task-specific output layers are randomly initialized.

  • Disttilled MT-DNN significantly outperforms the original MT-DNN on 7 out of 9

GLUE tasks(single model).

ensemble of different MT- DNNs (teacher)

correct targets + soft targets Initialized using the MT- DNN model pre-trained

  • n the GLUE

dataset Initialized using the pre- trained BERT

slide-35
SLIDE 35

Teacher Annealing

  • BAM! Born-Again Multi-Task Networks for Natural

Language Understanding

  • Born Again : the student has the same model

architecture as the teacher.

λ is linearly increased from 0 to 1

slide-36
SLIDE 36

Outline

  • Why Knowledge Distillation?
  • Distilling the knowledge in a neural network NIPS2014
  • Model Compression
  • Distilling Task-Specific Knowledge from BERT into Simple Neural Networks arxiv 2018
  • Multi-Task Setting
  • Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural

Language Understanding arxiv

  • BAM! Born-Again Multi-Task Networks for Natural Language Understanding
  • Seq2Seq NMT
  • Sequence level knowledge distillation EMNLP16
  • Cross Lingual NLP
  • Cross-lingual Distillation for Text Classification ACL17
  • Zero-Shot Cross-Lingual Neural Headline Generation IEEE/ACM TRANSACTIONS ON

AUDIO, SPEECH, AND LANGUAGE PROCESSING, 2018

  • Variant
  • Exploiting the Ground-Truth: An Adversarial Imitation Based Knowledge Distillation

Approach for Event Detection AAAI19

  • Paper List
  • Reference
  • Conclusion
slide-37
SLIDE 37

Sequence level knowledge distillation

EMNLP16 Yoon Kim Harvard

slide-38
SLIDE 38

Seq2Seq

  • Non-recurrent models in the multiclass prediction

setting

  • Method
  • Word-level Distillation
  • Two novel sequence-level versions of

knowledge distillation

  • Sequence-Level Knowledge Distillation
  • Sequence-Level Interpolation
slide-39
SLIDE 39

Word-Level

slide-40
SLIDE 40

Sentence Level

slide-41
SLIDE 41

Result

  • Large state-of-the-art 4 × 1000 LSTM
  • à 2 × 500 LSTM
  • Not requiring any beam search at test-time. As a

result we are able to perform greedy decoding on the 2 × 500 model 10 times faster than beam search on the 4 × 1000 model with comparable performance.

slide-42
SLIDE 42

Outline

  • Why Knowledge Distillation?
  • Distilling the knowledge in a neural network NIPS2014
  • Model Compression
  • Distilling Task-Specific Knowledge from BERT into Simple Neural Networks arxiv 2018
  • Multi-Task Setting
  • Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural

Language Understanding arxiv

  • BAM! Born-Again Multi-Task Networks for Natural Language Understanding
  • Seq2Seq NMT
  • Sequence level knowledge distillation EMNLP16
  • Cross Lingual NLP
  • Cross-lingual Distillation for Text Classification ACL17
  • Zero-Shot Cross-Lingual Neural Headline Generation IEEE/ACM TRANSACTIONS ON

AUDIO, SPEECH, AND LANGUAGE PROCESSING, 2018

  • Variant
  • Exploiting the Ground-Truth: An Adversarial Imitation Based Knowledge Distillation

Approach for Event Detection AAAI19

  • Paper List
  • Reference
  • Conclusion
slide-43
SLIDE 43

Cross-lingual Distillation for Text Classification

ACL17 CMU

slide-44
SLIDE 44

Overview

  • Task
  • Cross-lingual text classifica]on(CLTC) is the task
  • f classifying documents wriden in different

languages into the same taxonomy of categories.

  • Problem
  • How can we effec]vely leverage the trained

classifiers in a label-rich source language to help the classifica]on of documents in other label- poor target languages?

  • Method
  • Vanilla version
  • Dis]lla]on with Adversarial Feature Adapta]on
slide-45
SLIDE 45

Vanilla version

  • The first step of our framework is to train the

source-language classifier on labeled source documents .

  • In the second step, the knowledge captured in 𝜄+',

is transferred to the distilled model in the target language by training it on the parallel corpus.

slide-46
SLIDE 46

Vanilla version

Source language classifier Target language classifier Loss

  • Intution
  • The intuition is that paired documents in parallel corpus

should have the same distribution of class predicted by the source model and target model.

slide-47
SLIDE 47

Problem

slide-48
SLIDE 48

Distillation with Adversarial Feature Adaptation

Classifier 𝐻.(𝜄.) Discriminative 𝐻1(𝜄1) Feature extractor 𝐻2(𝜄2)

CNN

good discriminative performance on L extracts features which have similar distribu]ons on L and U gradient reverse

slide-49
SLIDE 49

Zero-Shot Cross-Lingual Neural Headline Generation

Ayana, Shi-qi Shen, Yun Chen, Cheng Yang, Zhi-yuan Liu, and Mao-song Sun IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 26, NO. 12, DECEMBER 2018

slide-50
SLIDE 50

Cross-lingual headline generation

  • Task
  • Produce a headline in a target language (e.g., Chinese)

given a document in a different source language (e.g., English).

  • Problem
  • Lack of those parallel corpora of direct source language

articles and target language headlines,

  • Error propagation in the translation and summarization

phases.

slide-51
SLIDE 51

Corpus

  • English headline generation
  • Gigaword
  • Chinese headline generation
  • LCSTS
  • English-Chinese translation
  • LDC2002E18, LDC2003E07, LDC2003E14, part of

LDC2004T07, LDC2004T08 and LDC2005T06.

slide-52
SLIDE 52

Model

English Article English Headline Chinese Headline English Chinese

NMT NMT CNHG 1

Chinese Article Chinese Headline

NHG

Chinese Article

NMT NHG 2 2 Pre-train

slide-53
SLIDE 53

Teacher-Student

English Article English Headline Chinese Headline

NMT CNHG 1

Chinese Article

NMT NHG 2 2 2 1

slide-54
SLIDE 54

Outline

  • Why Knowledge Distillation?
  • Distilling the knowledge in a neural network NIPS2014
  • Model Compression
  • Distilling Task-Specific Knowledge from BERT into Simple Neural Networks arxiv 2018
  • Multi-Task Setting
  • Improving Multi-Task Deep Neural Networks via Knowledge Distillation for Natural

Language Understanding arxiv

  • BAM! Born-Again Multi-Task Networks for Natural Language Understanding
  • Seq2Seq NMT
  • Sequence level knowledge distillation EMNLP16
  • Cross Lingual NLP
  • Cross-lingual Distillation for Text Classification ACL17
  • Zero-Shot Cross-Lingual Neural Headline Generation IEEE/ACM TRANSACTIONS ON

AUDIO, SPEECH, AND LANGUAGE PROCESSING, 2018

  • Variant
  • Exploiting the Ground-Truth: An Adversarial Imitation Based Knowledge Distillation

Approach for Event Detection AAAI19

  • Paper List
  • Reference
  • Conclusion
slide-55
SLIDE 55

Exploiting the Ground-Truth: An Adversarial Imitation Based Knowledge Distillation Approach for Event Detection

AAAI19 Jian Liu , Yubo Chen , Kang Liu National Laboratory of Pattern Recognition, Institute

  • f Automation
slide-56
SLIDE 56

Author

刘康 Associate Professor Sentiment Analysis, Information Extraction, Question Answering 陈玉博 Associate Professor 2017 赵军 Event Extraction , Relation Extraction and Knowledge Graph Construction .

slide-57
SLIDE 57

Event DetecKon

  • Event Detection ∈ Event Extraction

The boy died in the hospital

Event trigger Role=Victim Role=Place

Event Extraction

Type:Die

The boy died in the hospital

Event trigger

Event Detection

Type:Die

Event argument

slide-58
SLIDE 58

Problem

  • Ambiguity
  • The same event can be expressed in a wide

variation

  • Depending on the context, the same expression

might refer to entirely different events.

Transfer-Money Release-Parole

slide-59
SLIDE 59

Previous

  • Chunk knowledge corresponding to the sentences

can provide evidence for event type disambiguation

  • Problem
  • In the real test scenario where the ground-truth

annotations are missing.

  • Pipline Error propagation
slide-60
SLIDE 60

Model

slide-61
SLIDE 61

Attention Based Encoder

Etea Estu

slide-62
SLIDE 62

Binary Classification-Based Discriminator

  • Input
  • Output
  • A probability p that indicates the probability

that D thinks f(wt) comes from Etea .

slide-63
SLIDE 63

Multi-class Event Classifier

slide-64
SLIDE 64

The Adversarial Imitation Strategy

  • In the Pretraining Stage:
  • concatenate Etea and C to form an event detector
slide-65
SLIDE 65

The Adversarial Imitation Strategy

  • In the Pretraining Stage:
  • concatenate Etea and C to form an event detector
  • freeze the event classifier C, and we concatenate Estu

and C to build a raw-sentences event detector. Freeze

slide-66
SLIDE 66

The Adversarial Imitation Strategy

  • In the Pretraining Stage:
  • concatenate Etea and C to form an event detector
  • freeze the event classifier C, and we concatenate Estu

and C to build a raw-sentences event detector.

  • freeze both Etea and Estu , outputs of Etea as posi]ve

examples (labeled as 1s) and the outputs of Estu as nega]ve examples (labeled as 0s) to pretrain D. Freeze Freeze

slide-67
SLIDE 67

The Adversarial Imitation Strategy

  • In the Adversarial learning Stage

final classification error

whether Estu has successfully fooled D

slide-68
SLIDE 68

Experiments

  • ACE 2005 corpus
  • 34-class classifica]on problem(33+None)
slide-69
SLIDE 69

Performance on Gold-truth Annotations

word entity event-argument

slide-70
SLIDE 70

Performance in the Real Testing Scenario

LSTM-CRF taggers

slide-71
SLIDE 71

Paper List

slide-72
SLIDE 72

Reference

1. WHAT IS KNOWLEDGE DISTILLATION? https://data-soup.gitlab.io/blog/knowledge-distillation/ 2. 李如【DL】模型蒸馏 Distillation https://zhuanlan.zhihu.com/p/71986772 3. Towser 如何评价BERT模型https://www.zhihu.com/question/298203515/answer/509923837 4. 霍华德 BERT模型在NLP中目前取得如此好的效果,那下一步NLP该何去何从? https://www.zhihu.com/question/320606353/answer/658786633 5. Andrej Karpathy A Recipe for Training Neural Networks http://karpathy.github.io/2019/04/25/recipe/ 6. XLNet训练成本6万美元,顶5个BERT,大模型「身价」惊人 https://zhuanlan.zhihu.com/p/71609636?utm_source=wechat_session&utm_medium=social& utm_oi=71065644564480&from=timeline&isappinstalled=0&s_r=0 7. Naiyan Wang https://www.zhihu.com/question/50519680/answer/136363665 8. 周博磊 https://www.zhihu.com/question/50519680/answer/136359743 9. https://blog.csdn.net/qq_22749699/article/details/79460817

  • 10. 如何理解soft target这一做法?Yjango

https://www.zhihu.com/question/50519680?sort=created

  • 11. Jiatao Gu Non-Autoregressive Neural Machine Translation

https://zhuanlan.zhihu.com/p/34495294

slide-73
SLIDE 73

Thanks!