Distilling Knowledge for Search-based Structured Prediction Yijia - - PowerPoint PPT Presentation

distilling knowledge for search based structured
SMART_READER_LITE
LIVE PREVIEW

Distilling Knowledge for Search-based Structured Prediction Yijia - - PowerPoint PPT Presentation

Distilling Knowledge for Search-based Structured Prediction Yijia Liu*, Wanxiang Che, Huaipeng Zhao, Bing Qin, Ting Liu Research Center for Social Computing and Information Retrieval Harbin Institute of Technology Complex Model Wins [ResNet,


slide-1
SLIDE 1

Distilling Knowledge for Search-based Structured Prediction

Yijia Liu*, Wanxiang Che, Huaipeng Zhao, Bing Qin, Ting Liu Research Center for Social Computing and Information Retrieval Harbin Institute of Technology

slide-2
SLIDE 2

Complex Model Wins

[ResNet, 2015] [He+, 2017]

slide-3
SLIDE 3

90 90.5 91 91.5 92 92.5 93 Dependency Parsing Baseline search SOTA Distillation Ensemble 20.5 22 23.5 25 26.5 NMT Baseline search SOTA Distillation Ensemble

slide-4
SLIDE 4

90 90.5 91 91.5 92 92.5 93 Dependency Parsing Baseline search SOTA Distillation Ensemble 20.5 22 23.5 25 26.5 NMT Baseline search SOTA Distillation Ensemble

0.6 1.3 0.8 2.6

slide-5
SLIDE 5

Classification vs. Structured Prediction

Classifier

! "

Structured Predictor

! "#, "%, … , y(

slide-6
SLIDE 6

Classification vs. Structured Prediction

Classifier

I like this book

Structured Predictor

  • I like this book
slide-7
SLIDE 7

Search Space

Search-based Structured Prediction

  • I

this like book

slide-8
SLIDE 8

Search Space

! " # that Controls Search Process

  • I

this like book

0.5 1 book i like love the this

p(y | I, like)

slide-9
SLIDE 9

Search Space

Generic ! " # Learning Algorithm

  • I

this like book

0.5 1 book I like love the this

argmax p(y | I, like) $(y=this)

slide-10
SLIDE 10

Search Space

Problems of the Generic Learning Algorithm

  • I

this like book the

Ambiguities in training data “both this and the seems reasonable”

slide-11
SLIDE 11

Search Space

Problems of the Generic Learning Algorithm

  • I

this like book the

Ambiguities in training data “both this and the seems reasonable” Training and test discrepancy “What if I made wrong decision?”

?

love

slide-12
SLIDE 12

Search Space

Solutions in Previous Works

  • I

this like book the

Ambiguities in training data Ensemble (Dietterich, 2000) Training and test discrepancy Explore (Ross and Bagnell, 2010)

love

slide-13
SLIDE 13

Search Space

Where We Are

  • I

this like book the love

Knowledge Distillation Ambiguities in training data Training and test discrepancy

slide-14
SLIDE 14

Knowledge Distillation

Learning from negative log-likelihood Learning from knowledge distillation

0.5 1 book I like love the this

argmax p(y | I, like) !(y=this)

0.5 1 book I like love the this

argmax sumy q(y) p(y |I, like)

" # $, &'()) is the output distribution

  • f a teacher model (e.g. ensemble)

On supervised data argmax0

0.5 1 book I like love the this

p(y | I, like) !(y=this)

0.5 1 book I like love the this

sumy q(y) p(y | I, like)

1 − 3 + 3

slide-15
SLIDE 15

Knowledge Distillation: from Where

Learning from knowledge distillation

0.5 1 book I like love the this

argmax sumy q(y) p(y |I, like)

Ambiguities in training data Ensemble (Dietterich, 2000) We use ensemble of M structure predictor as the teacher q

slide-16
SLIDE 16

Search Space

KD on Supervised (reference) Data

  • I

this like book the

!(y=this)

0.5 1 book I like love the this

p(y | I, like)

0.5 1 book I like love the this

sumy q(y) p(y | I, like)

1 − $ + $

!(y=this)

slide-17
SLIDE 17

Search Space

KD on Explored Data

  • I

this like book the

0.5 1 book I like love the this

sumy q(y) p(y | I, like, the)

Training and test discrepancy Explore (Ross and Bagnell, 2010) We use teacher q to explore the search space & learn from KD on the explored data

slide-18
SLIDE 18

We combine KD

  • n reference and

explored data

slide-19
SLIDE 19

Experiments

Transition-based Dependency Parsing Penn Treebank (Stanford dependencies) LAS Baseline 90.83 Ensemble (20) 92.73 Distill (reference, ! = 1.0) 91.99 Distill (exploration) 92.00 Distill (both) 92.14 Ballesteros et al. (2016) (dyn. oracle) 91.42 Andor et al. (2016) (local, B=1) 91.02 Neural Machine Translation IWSLT 2014 de-en BLEU Baseline 22.79 Ensemble (10) 26.26 Distill (reference, ! = 0.8) 24.76 Distill (exploration) 24.64 Distill (both) 25.44 MIXER (Ranzato et al. 2015) 20.73 Wiseman and Rush (2016) (local B=1) 22.53 Wiseman and Rush (2016) (global B=1) 23.83

slide-20
SLIDE 20

Analysis: Why the Ensemble Works Better?

  • Examining the ensemble on the “problematic” states.
  • I

this like book the

Optimal-yet-ambiguous Non-optimal

love

slide-21
SLIDE 21

Analysis: Why the Ensemble Works Better?

  • Examining the ensemble on the “problematic” states.
  • Testbed: Transition-based dependency parsing.
  • Tools: dynamic oracle, which returns a set of reference actions

for one state.

  • Evaluate the output distributions against the reference actions.
  • ptimal-yet-ambiguous

non-optimal Baseline 68.59 89.59 Ensemble 74.19 90.90

slide-22
SLIDE 22

Analysis: Is it Feasible to Fully Learn from KD w/o NLL?

Fully learning from KD is feasible

92.07 92.04 91.93 91.9 91.7 91.72 91.55 91.49 91.3 91.1 90.9

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

26.96 27.04 27.13 26.95 26.6 26.64 26.37 26.21 26.09 25.9 24.93

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

Transition-based Parsing Neural Machine Translation

slide-23
SLIDE 23

Analysis: Is Learning from KD Stable?

Transition-based Parsing Neural Machine Translation

slide-24
SLIDE 24

Conclusion

  • We propose to distill an ensemble into a single model both from

reference and exploration states.

  • Experiments on transition-based dependency parsing and

machine translation show that our distillation method significantly improves the single model’s performance.

  • Analysis gives empirically guarantee for our distillation method.
slide-25
SLIDE 25

Thanks and Q/A