Model distillation and extraction CS 685, Fall 2020 Advanced Natural - PowerPoint PPT Presentation

Model distillation and extraction CS 685, Fall 2020 Advanced Natural Language Processing Mohit Iyyer College of Information and Computer Sciences University of Massachusetts Amherst many slides from Kalpesh Krishna

stuff from last time… • Topics you want to see covered? • HW1 due 10/28 2

Knowledge distillation: A small model (the student ) is trained to mimic the predictions of a much larger pretrained model (the teacher ) Bucila et al., 2006; Hinton et al., 2015

Sanh et al., 2019 (“DistilBERT”)

barbershop: 54% BERT barber: 20% Bob went to the <MASK> ( teacher ): salon: 6% to get a buzz cut 24 layer stylist: 4% Transformer …

barbershop: 54% BERT barber: 20% Bob went to the <MASK> ( teacher ): salon: 6% to get a buzz cut 24 layer stylist: 4% Transformer … soft targets

barbershop: 54% BERT barber: 20% Bob went to the <MASK> ( teacher ): salon: 6% to get a buzz cut 12 layer stylist: 4% Transformer … soft targets t i Cross entropy loss to predict soft targets DistilBERT Bob went to the <MASK> L ce = ∑ ( student ): to get a buzz cut t i log( s i ) 6 layer Transformer i

Instead of “one-hot” ground-truth, we have a full predicted distribution • More information encoded in the target prediction than just the “correct” word • Relative order of even low probability words (e.g., “church” vs “and” in the previous example) tells us some information • e.g., that the <MASK> is likely to be a noun and refer to a location, not a function word

Can also distill other parts of the teacher, not just its final predictions! Jiao et al., 2020 (“TinyBERT”)

Distillation helps significantly over just training the small model from scratch Turc et al., 2019 (“Well-read students learn better”)

Turc et al., 2019 (“Well-read students learn better”)

How to prune? Simply remove the weights with the lowest magnitudes in each layer Frankle & Carbin, 2019 (“The Lottery Ticket Hypothesis”)

Can prune a significant fraction of the network with no downstream performance loss Chen et al., 2020 (“Lottery Ticket for BERT Networks”)

What if you only have access to the model’s argmax prediction, and you also don’t have access to its training data?

��

Model distillation and extraction CS 685, Fall 2020 Advanced Natural - PowerPoint PPT Presentation

Model distillation and extraction CS 685, Fall 2020 Advanced Natural Language Processing Mohit Iyyer College of Information and Computer Sciences University of Massachusetts Amherst many slides from Kalpesh Krishna stuff from last time

uf: Minimizing the Coq Extraction TCB Eric Mullen , Stuart Pernsteiner, James Wilcox, Zachary

Complex distillation systems. Theory and models. Pio Aguirre INGAR Santa Fe-Argentina Outline

Effective Topic Distillation Effective Topic Distillation with Key Resource Pre- -selection

Distillation. Optimal operation using simple control structures Sigurd Skogestad, NTNU, Trondheim

Soil Extraction Cell: An Alternative Soil Extraction Cell: An Alternative Method of Soil

Knowledge Distillation Xiachong Feng Pic h%ps://data-soup.gitlab.io/blog/knowledge-dis8lla8on/

Separation of Ethanol and Water with Extractive Distillation David LaJambe Ethanol-Water Systems

Matching Guided Distillation ECCV 2020 Kaiyu Yue, Jiangfan Deng, and Feng Zhou Algorithm

Distillation Codes and DOS Resistant Multicast Moderation Prepared for CS 624 Fabian Monrose

Other Hydrocarbons Presented by Sachin Joshi Licensing Manager GTC Technology US, LLC

Non-asymptotic entanglement distillation arXiv:1706.06221 Kun Fang Joint work with Xin Wang,

Tight bounds for Communication assisted agreement distillation Jaikumar Radhakrishnan Tata

Declarative Information Extraction Declarative Information Extraction Using Datalog Datalog with

Variability Extraction and Analysis Toolkit (VEXA) VEXA Introduction The Variability Extraction

Automated Feature Extraction Automated Feature Extraction for Object Recognition for Object

3. Feature Extraction 3.1 Feature Extraction from Speech or other types of audio like music

E R G O D I C I T Y, E I G E N S TAT E T H E R M A L I Z AT I O N , A N D T H E F O U N D AT I

Geometry as made rigorous by Euclid and Descartes David Pierce October ,

Fourier Series and Twisted Crossed Products Villa Mondragone, Frascati, June 2014 JFAA 15, 2009

Interannual variation of a 12,760 km transequatorial ionospheric channel availability and its

PCORI Advisory Panels Winter 2015 Meetings General Session Arlington, VA January 13-14 2015 1

Evidence-Based Elections Influencers Salon Philip B. Stark 10 October 2020 University of

PROJECT DIRECTOR S MEETING PROJECT DIRECTORS MEETING MEETING AGENDA January 17-18, 2019

SSIP Evaluation Workshop 2.0: Taking the Online Series to the Next Level Evaluating