Multilingual Sequence Labeling Xinyu Wang, Yong Jiang, Nguyen Bach, - - PowerPoint PPT Presentation

multilingual sequence labeling
SMART_READER_LITE
LIVE PREVIEW

Multilingual Sequence Labeling Xinyu Wang, Yong Jiang, Nguyen Bach, - - PowerPoint PPT Presentation

Structure-Level Knowledge Distillation For Multilingual Sequence Labeling Xinyu Wang, Yong Jiang, Nguyen Bach, Tao Wang, Fei Huang, Kewei Tu School of Information Science and Technology, ShanghaiTech University DAMO Academy, Alibaba Group 1


slide-1
SLIDE 1

Structure-Level Knowledge Distillation For Multilingual Sequence Labeling

Xinyu Wang, Yong Jiang, Nguyen Bach, Tao Wang, Fei Huang, Kewei Tu

School of Information Science and Technology, ShanghaiTech University DAMO Academy, Alibaba Group

1

slide-2
SLIDE 2

Motivation

  • Most of the previous work of sequence labeling focused on

monolingual models.

  • It is resource consuming to train and serve multiple monolingual

models online.

  • A unified multilingual model: smaller, easier, more generalizable.
  • However, the accuracy of the existing unified multilingual model is

inferior to monolingual models.

2

slide-3
SLIDE 3

Our Solution

Knowledge Distillation

3

slide-4
SLIDE 4

Background: Knowledge Distillation

Teacher Data

Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. 2014. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop.

4

slide-5
SLIDE 5

Background: Knowledge Distillation

Teacher Student Data

Distribution 𝑄𝑢 Distribution 𝑄𝑡

XE loss

Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. 2014. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop.

5

slide-6
SLIDE 6

Background: Knowledge Distillation

Teacher Student Data

Distribution 𝑄𝑢 Distribution 𝑄𝑡

XE loss

Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. 2014. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop.

Update

6

slide-7
SLIDE 7

Background: Sequence Labeling

7

Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. NAACL 2016. Neural architectures for named entity recognition.

slide-8
SLIDE 8

Background: Sequence Labeling

Exponentially number of possible labeled sequences

8

Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. NAACL 2016. Neural architectures for named entity recognition.

slide-9
SLIDE 9

Top-K Distillation

Top-K label sequence

9

slide-10
SLIDE 10

Top-WK Distillation

10

slide-11
SLIDE 11

Posterior Distillation

Posterior Distribution

11

slide-12
SLIDE 12

Structure-Level Knowledge Distillation

12

slide-13
SLIDE 13

Results

  • Monolingual teacher models outperform multilingual student

models

  • Our approaches outperform the baseline model
  • Top-WK+Posterior stays in between Top-WK and Posterior

13

slide-14
SLIDE 14

Zero-shot Transfer

14

slide-15
SLIDE 15

KD with weaker teachers

15

slide-16
SLIDE 16

k Value in Top-K

16

slide-17
SLIDE 17

Conclusion

  • Two structure-level KD methods: Top-K and Posterior distillation
  • Our approaches improve the performance of multilingual models
  • ver 4 tasks on 25 datasets.
  • Our distilled model has stronger zero-shot transfer ability on the

NER and POS tagging task.

17