Sequence-Level Knowledge Distillation Yoon Kim Alexander M. Rush - PowerPoint PPT Presentation

Sequence-Level Knowledge Distillation Yoon Kim Alexander M. Rush HarvardNLP Code: https://github.com/harvardnlp/seq2seq-attn

Sequence-to-Sequence Machine Translation (Sutskever et al., 2014; Cho et al., 2014; Bahdanau et al., 2015; Luong et al., 2015) Question Answering (Hermann et al., 2015) Conversation (Vinyals et al., 2015a; Serban et al., 2016; Li et al., 2016) Parsing (Vinyals and Le, 2015) Speech (Chorowski et al., 2015; Chan et al., 2015) Summarization (Rush et al., 2015) Caption Generation (Xu et al., 2015; Vinyals et al., 2015b) Video-Generation (Srivastava et al., 2015) NER/POS-Tagging (Gillick et al., 2016)

Neural Machine Translation Excellent results on many language pairs, but need large models Original seq2seq paper (Sutskever et al., 2014) : 4-layers/1000 units Deep Residual RNNs (Zhou et al., 2016) : 16-layers/512 units Google’s NMT system (Wu et al., 2016) : 8-layers/1024 units Beam search + ensemble on top = ⇒ Deployment is challenging!

Related Work: Compressing Deep Models Pruning : Prune weights based on importance criterion (LeCun et al., 1990; Han et al., 2016; See et al., 2016) Knowledge Distillation : Train a student model to learn from a teacher model (Bucila et al., 2006; Ba and Caruana, 2014; Hinton et al., 2015; Kuncoro et al., 2016) . (Sometimes called “dark knowledge”) Other methods: low-rank matrix factorization of weight matrices (Denton et al., 2014) weight binarization (Lin et al., 2016) weight sharing (Chen et al., 2015)

Standard Setup Minimize NLL � � L NLL = − ✶ { y t = k } log p ( w t = k | y 1: t − 1 , x ; θ ) t k ∈V w t = random variable for the t -th target token with support V y t = ground truth t -th target token y 1: t − 1 = target sentence up to t − 1 x = source sentence p ( · | x ; θ ) = model distribution, parameterized with θ (conditioning on source x dropped from now on)

Knowledge Distillation (Bucila et al., 2006; Hinton et al., 2015) Train a larger teacher model first to obtain teacher distribution q ( · ) Train a smaller student model p ( · ) to mimic the teacher

Word-Level Knowledge Distillation Teacher distribution: q ( w t | y 1: t − 1 ; θ T ) � � L NLL = − ✶ { y t = k } log p ( w t = k | y 1: t − 1 ; θ ) t k ∈V � � L WORD-KD = − q ( w t = k | y 1: t − 1 ; θ T ) log p ( w t = k | y 1: t − 1 ; θ ) t k ∈V

No Knowledge Distillation

Word-Level Knowledge Distillation

Word-Level Knowledge Distillation L = α L WORD-KD + (1 − α ) L NLL

Word-Level Knowledge Distillation Results English → German (WMT 2014) Model BLEU 4 × 1000 Teacher 19 . 5 2 × 500 Baseline (No-KD) 17 . 6 2 × 500 Student (Word-KD) 17 . 7 2 × 300 Baseline (No-KD) 16 . 9 2 × 300 Student (Word-KD) 17 . 6

This Work Generalize single-class knowledge distillation to the sequence-level. Sequence-Level Knowledge Distillation (Seq-KD) : Train towards the teacher’s sequence-level distribution. Sequence-Level Interpolation (Seq-Inter) : Train on a mixture of the teacher’s distribution and the data.

Sequence-Level Knowledge Distillation Recall word-level knowledge distillation: � � L NLL = − ✶ { y t = k } log p ( w t = k | y 1: t − 1 ; θ ) t k ∈V � � L WORD-KD = − q ( w t = k | y 1: t − 1 ; θ T ) log p ( w t = k | y 1: t − 1 ; θ ) t k ∈V Instead of word-level cross-entropy, minimize cross-entropy between q and p implied sequence -distributions � L NLL = − ✶ { w = y } log p ( w | x ; θ ) w ∈T � L SEQ-KD = − q ( w | x ; θ T ) log p ( w | x ; θ ) w ∈T Sum over an exponentially-sized set T .

Sequence-Level Knowledge Distillation Approximate q ( w | x ) with mode q ( w | x ) ≈ ✶ { arg max q ( w | x ) } w Approximate mode with beam search y ≈ arg max ˆ q ( w | x ) w Simple model: train the student model on ˆ y with NLL

Sequence-Level Knowledge Distillation

Sequence-Level Interpolation Word-level knowledge distillation L = α L WORD-KD + (1 − α ) L NLL Essentially training the student towards the mixture of teacher/data distributions. How can we incorporate ground truth data at the sequence-level?

Sequence-Level Interpolation Naively, could train on both y (ground truth sequence) and ˆ y (beam search output from teacher). This is non-ideal: Doubles size of training set y could be very different from ˆ y Consider a single-sequence approximation

Sequence-Level Interpolation Take the sequence that is on the beam but highest similarity function sim (e.g. BLEU) to ground truth ˜ y = arg max sim ( y , w ) q ( w | x ) y ∈T ≈ arg max sim ( y , w ) y ∈T K T K : K -best sequences from beam search. Similar to local updating (Liang et al., 2006) Train the student model on ˜ y with NLL.

Sequence-Level Interpolation

Sequence-Level Knowledge Distillation Yoon Kim Alexander M. Rush - PowerPoint PPT Presentation

Sequence-Level Knowledge Distillation Yoon Kim Alexander M. Rush HarvardNLP Code: https://github.com/harvardnlp/seq2seq-attn Sequence-to-Sequence Machine Translation (Sutskever et al., 2014; Cho et al., 2014; Bahdanau et al., 2015; Luong et

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Knowledge Distillation Xiachong Feng Pic h%ps://data-soup.gitlab.io/blog/knowledge-dis8lla8on/

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Distillation. Optimal operation using simple control structures Sigurd Skogestad, NTNU, Trondheim

Complex distillation systems. Theory and models. Pio Aguirre INGAR Santa Fe-Argentina Outline

Effective Topic Distillation Effective Topic Distillation with Key Resource Pre- -selection

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

Knowledge-Based Agents knowledge knowledge representation, knowledge base, types of knowledge

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

Sequence-to-Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le,

Other Hydrocarbons Presented by Sachin Joshi Licensing Manager GTC Technology US, LLC

Activities of Daily Living Indexing by Hierarchical HMM for Dementia Diagnostics Svebor Karaman,

IRIM at TRECVID 2017: Instance Search Presenter : Pierre-Etienne Martin Boris Mansencal, Jenny

A Bayesian Approach to Finding Compact Representations for Reinforcement Learning Special thanks

Multi-wavelength cross-correlation methods Mara Salvato (With a thanks to F. Guglielmetti &

CS 378: Autonomous Intelligent Robotics (FRI) Dr. Todd Hester Are there any questions?

Chatbots for Language Learning Anja Reusch Technische Universit at Dresden Analyse eines

Applied Machine Learning Syllabus and logistics Siamak Ravanbakhsh COMP 551 (fall 2020) Admin

Communication Lower Bounds for Matrix-Matrix Multiplication Dagstuhl Seminar #15281 July 6-9,

Sequence-Level Knowledge Distillation Yoon Kim Alexander M. Rush - PowerPoint PPT Presentation

Sequence-Level Knowledge Distillation Yoon Kim Alexander M. Rush HarvardNLP Code: https://github.com/harvardnlp/seq2seq-attn Sequence-to-Sequence Machine Translation (Sutskever et al., 2014; Cho et al., 2014; Bahdanau et al., 2015; Luong et

Protein Sequence Analysis Protein Sequence Analysis Protein sequence motifs Protein sequence

Knowledge Distillation Xiachong Feng Pic h%ps://data-soup.gitlab.io/blog/knowledge-dis8lla8on/

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Attention Models 1 Sequence-to-sequence modelling Problem:

Sequence to Sequence models: Connectionist Temporal Classification 1 Sequence-to-sequence

Distillation. Optimal operation using simple control structures Sigurd Skogestad, NTNU, Trondheim

Complex distillation systems. Theory and models. Pio Aguirre INGAR Santa Fe-Argentina Outline

Effective Topic Distillation Effective Topic Distillation with Key Resource Pre- -selection

SEQUENCE ANALYSIS The term &quot; sequence analysis &quot; in biology implies subjecting a DNA or

Knowledge-Based Agents knowledge knowledge representation, knowledge base, types of knowledge

Sequence Alignment Gerhard Jger ESSLLI 2016 Gerhard Jger Sequence Alignment ESSLLI 2016 1

Sequence to Sequence models: Connectionist Temporal Classification 5 March 2018 1

61A Lecture 30 Announcements Efficient Sequence Processing Sequence Operations 4 Sequence

Introduction to sequence to sequence models N ATURAL LAN GUAGE GEN ERATION IN P YTH ON

Sequence-to-Sequence Learning with Neural Networks Ilya Sutskever, Oriol Vinyals, Quoc V. Le,

Other Hydrocarbons Presented by Sachin Joshi Licensing Manager GTC Technology US, LLC

Activities of Daily Living Indexing by Hierarchical HMM for Dementia Diagnostics Svebor Karaman,

IRIM at TRECVID 2017: Instance Search Presenter : Pierre-Etienne Martin Boris Mansencal, Jenny

A Bayesian Approach to Finding Compact Representations for Reinforcement Learning Special thanks

Multi-wavelength cross-correlation methods Mara Salvato (With a thanks to F. Guglielmetti &amp;

CS 378: Autonomous Intelligent Robotics (FRI) Dr. Todd Hester Are there any questions?

Chatbots for Language Learning Anja Reusch Technische Universit at Dresden Analyse eines

Applied Machine Learning Syllabus and logistics Siamak Ravanbakhsh COMP 551 (fall 2020) Admin

Communication Lower Bounds for Matrix-Matrix Multiplication Dagstuhl Seminar #15281 July 6-9,

SEQUENCE ANALYSIS The term " sequence analysis " in biology implies subjecting a DNA or

Multi-wavelength cross-correlation methods Mara Salvato (With a thanks to F. Guglielmetti &