Model distillation and extraction CS 685, Fall 2020 Advanced Natural - - PowerPoint PPT Presentation

model distillation and extraction
SMART_READER_LITE
LIVE PREVIEW

Model distillation and extraction CS 685, Fall 2020 Advanced Natural - - PowerPoint PPT Presentation

Model distillation and extraction CS 685, Fall 2020 Advanced Natural Language Processing Mohit Iyyer College of Information and Computer Sciences University of Massachusetts Amherst many slides from Kalpesh Krishna stuff from last time


slide-1
SLIDE 1

Model distillation and extraction

CS 685, Fall 2020

Advanced Natural Language Processing

Mohit Iyyer College of Information and Computer Sciences

University of Massachusetts Amherst

many slides from Kalpesh Krishna

slide-2
SLIDE 2

stuff from last time…

  • Topics you want to see covered?
  • HW1 due 10/28

2

slide-3
SLIDE 3

Knowledge distillation:

A small model (the student) is trained to mimic the predictions of a much larger pretrained model (the teacher)

Bucila et al., 2006; Hinton et al., 2015

slide-4
SLIDE 4

Sanh et al., 2019 (“DistilBERT”)

slide-5
SLIDE 5

BERT (teacher): 24 layer Transformer Bob went to the <MASK> to get a buzz cut barbershop: 54% barber: 20% salon: 6% stylist: 4% …

slide-6
SLIDE 6

BERT (teacher): 24 layer Transformer Bob went to the <MASK> to get a buzz cut barbershop: 54% barber: 20% salon: 6% stylist: 4% …

soft targets

slide-7
SLIDE 7

BERT (teacher): 12 layer Transformer Bob went to the <MASK> to get a buzz cut barbershop: 54% barber: 20% salon: 6% stylist: 4% …

soft targets ti

DistilBERT (student): 6 layer Transformer Bob went to the <MASK> to get a buzz cut

Cross entropy loss to predict soft targets

Lce = ∑

i

ti log(si)

slide-8
SLIDE 8

Instead of “one-hot” ground-truth, we have a full predicted distribution

  • More information encoded in the target prediction

than just the “correct” word

  • Relative order of even low probability words (e.g.,

“church” vs “and” in the previous example) tells us some information

  • e.g., that the <MASK> is likely to be a noun and refer to a

location, not a function word

slide-9
SLIDE 9
slide-10
SLIDE 10

Can also distill other parts of the teacher, not just its final predictions!

Jiao et al., 2020 (“TinyBERT”)

slide-11
SLIDE 11

Distillation helps significantly over just training the small model from scratch

Turc et al., 2019 (“Well-read students learn better”)

slide-12
SLIDE 12

Turc et al., 2019 (“Well-read students learn better”)

slide-13
SLIDE 13

Frankle & Carbin, 2019 (“The Lottery Ticket Hypothesis”)

How to prune? Simply remove the weights with the lowest magnitudes in each layer

slide-14
SLIDE 14

Can prune a significant fraction of the network with no downstream performance loss

Chen et al., 2020 (“Lottery Ticket for BERT Networks”)

slide-15
SLIDE 15

What if you only have access to the model’s argmax prediction, and you also don’t have access to its training data?

slide-16
SLIDE 16
slide-17
SLIDE 17
slide-18
SLIDE 18
slide-19
SLIDE 19
slide-20
SLIDE 20
slide-21
SLIDE 21
slide-22
SLIDE 22
slide-23
SLIDE 23
slide-24
SLIDE 24
slide-25
SLIDE 25
slide-26
SLIDE 26
slide-27
SLIDE 27
slide-28
SLIDE 28
slide-29
SLIDE 29
slide-30
SLIDE 30
slide-31
SLIDE 31
slide-32
SLIDE 32
slide-33
SLIDE 33
slide-34
SLIDE 34
slide-35
SLIDE 35
slide-36
SLIDE 36
slide-37
SLIDE 37
slide-38
SLIDE 38
slide-39
SLIDE 39
slide-40
SLIDE 40
slide-41
SLIDE 41
slide-42
SLIDE 42
slide-43
SLIDE 43
slide-44
SLIDE 44
slide-45
SLIDE 45
slide-46
SLIDE 46
slide-47
SLIDE 47
slide-48
SLIDE 48

Limitation: Genuine queries can be out-of-distribution but still sensible Only works for RANDOM queries