Massively Multilingual Transfer for NER Afshin Rahimi, Yuan Li, and - - PowerPoint PPT Presentation

massively multilingual transfer for ner
SMART_READER_LITE
LIVE PREVIEW

Massively Multilingual Transfer for NER Afshin Rahimi, Yuan Li, and - - PowerPoint PPT Presentation

Massively Multilingual Transfer for NER Afshin Rahimi, Yuan Li, and Trevor Cohn University of Melbourne 6000+ languages 1% with annotation 2 Wikipedia:Jroehl Emergency Response Named Entity Recognition 3 Annotation Projection for


slide-1
SLIDE 1

Massively Multilingual Transfer for NER

Afshin Rahimi, Yuan Li, and Trevor Cohn University of Melbourne

slide-2
SLIDE 2

6000+ languages ≈ 1% with annotation

Wikipedia:Jroehl

2

slide-3
SLIDE 3

Emergency Response Named Entity Recognition

3

slide-4
SLIDE 4

Annotation Projection for Transfer

kailangan namin ng mas maraming dugo sa Pagasanjan . we need more blood in Pgasanjan .. O O O O O B-LOC O

Yarowsky et al. (2001)

Tagalog English

4

B-LOC

slide-5
SLIDE 5

Representation Projection for Transfer

Mis-matched Model

kailangan namin ng mas maraming dugo sa Pagasanjan .

language independent representation Cross-lingual word embeddings

(Lample et al., 2018)

Ideal: source-target similar in word order, script, syntax

5

O O O O O B-LOC O

slide-6
SLIDE 6

Direct Transfer for NER

English

kailangan namin ng mas maraming dugo sa Pagasanjan. kailangan namin ng mas maraming dugo sa Pagasanjan. kailangan namin ng mas maraming dugo sa Pagasanjan.

Input: Unlabelled sentences in the target language encoded with cross-lingual embeddings

Arabic Afrikaans

Output: Labelled sentences in the target language

Pre-trained NER source models

O O O O B-LOC O O O B-PER O O O O B-PER O O B-LOC O

6

...

slide-7
SLIDE 7

Direct Transfer Results (NER F1 score, WikiANN)

unsuprising

7

slide-8
SLIDE 8

Direct Transfer Results (NER F1 score, WikiANN)

unrelated

8

slide-9
SLIDE 9

Direct Transfer Results (NER F1 score, WikiANN)

asymmetry

9

slide-10
SLIDE 10

Voting & English are often poor!

10

slide-11
SLIDE 11

11

General findings

  • Transfer strongest within

language family (Germanic, Roman, Slavic-Cyr, Slavic-Latin)

  • Asymmetry between use

as source vs target language (Slavic-Cyr, Greek/Turkish/...)

  • But lots of odd results &
  • verall highly noisy
slide-12
SLIDE 12

Problem Statement

Input:

  • N black-box source models
  • Unlabelled data in target language
  • Little or no labelled data (few shot and zero shot)

Output:

  • Good predictions in the target language

12

slide-13
SLIDE 13

Model 1: Few Shot Ranking and Retraining (RaRe)

13

100 gold sents. In Tagalog Source Model EN Source Model AR Source Model VI F1VI

Source model qualities

F1EN F1AR

slide-14
SLIDE 14

Model 1: Few Shot Ranking and Retraining (RaRe)

14

20k unlabelled sents in Tagalog Source Model EN Source Model AR Source Model VI

N training sets in Tagalog

Dataset AR Dataset EN Dataset VI

slide-15
SLIDE 15

Model 1: Few Shot Ranking and Retraining (RaRe)

15

Final training set, a mixture of distilled knowledge

l ∈ source langs. Dataset l g(F1l) Training Set

slide-16
SLIDE 16

Model 1: Few Shot Ranking and Retraining (RaRe)

  • 1. Train an NER model on the mixture datasets.
  • 2. Fine-tune on 100 gold samples.

Zero-shot variant: uniform sampling without fine-tuning (RaReuns)

16

slide-17
SLIDE 17

Lample et al., (2016)

Our method is independent of model choice.

17

Hierarchical BiLSTM-CRF as model

slide-18
SLIDE 18

Model 2: Zero Shot Transfer (BEA)

What if no gold labels are available?

  • 1. Treat gold labels Z as hidden variables
  • 2. Estimate Z that best explains all the observed predictions
  • 3. Re-estimate the quality of source models

Inspired by Kim and Ghahramani (2012)

18

slide-19
SLIDE 19

Model 2: Zero Shot Transfer (BEA)

Predicted label of instance i by model j (observed)

19

slide-20
SLIDE 20

Model 2: Zero Shot Transfer (BEA)

True label of instance i

20

slide-21
SLIDE 21

Model 2: Zero Shot Transfer (BEA)

Model j’s confusion matrix between True and predicted labels.

21

slide-22
SLIDE 22

Model 2: Zero Shot Transfer (BEA)

Categorical Distribution

22

slide-23
SLIDE 23

Model 2: Zero Shot Transfer (BEA)

Uninformative Dirichlet Priors

23

slide-24
SLIDE 24

Model 2: Zero Shot Transfer (BEA)

24

Find Z to maximises P(Z|Y,𝛽,𝛾), using variational mean- field approx. Warm-start with MV.

slide-25
SLIDE 25

Extensions to BEA

  • 1. Spammer removal:

After running BEA, estimate source model qualities and remove bottom k, run BEA again (BEAunsx2)

  • 2. Few shot scenario:

Given 100 gold sentences, estimate source model confusion matrices, then run BEA (BEAsup)

  • 3. Token vs Entity application

25

slide-26
SLIDE 26

Benchmark: BWET (Xie et al., 2018)

Single source annotation projection with bilingual dictionaries from cross-lingual word embeddings

  • Transfer english training data to German, Dutch, and

Spanish.

  • Train a transformer NER on the projected training data.

State-of-the-art on zero-shot NER transfer (orthogonal to this)

26

slide-27
SLIDE 27

CoNLL Results (avg F1 over de, nl, es)

Use parallel data, dictionary or wikipedia Zero shot Few shot High-resource

27

slide-28
SLIDE 28

CoNLL Results (avg F1 over de, nl, es)

Zero shot Few shot High-resource

28

slide-29
SLIDE 29

CoNLL Results (avg F1 over de, nl, es)

Few shot High-resource

29

Zero shot

slide-30
SLIDE 30

CoNLL Results (avg F1 over de, nl, es)

Few shot High-resource

30

Zero shot

slide-31
SLIDE 31

WIKIANN NER Datasets (Pan et al., 2017)

  • Silver annotations from Wikipedia for 282 languages.
  • We picked 41 languages based on availability of bilingual

dictionaries.

  • Created balanced training/dev/test partitions

(varying size of training according to data availability)

31

github.com/afshinrahimi/mmner

slide-32
SLIDE 32

32

L.O.O. over 41 languages

slide-33
SLIDE 33

33

Tagalog

Transfer from 40 source languages

L.O.O. over 41 languages

slide-34
SLIDE 34

34

L.O.O. over 41 languages

slide-35
SLIDE 35

35

Tamil

Transfer from 40 source languages

L.O.O. over 41 languages

slide-36
SLIDE 36

Use fasttext monolingual wiki embeddings mapped to English space using Identical Character Strings.

36

Word representation: FastText/MUSE

Conneau et al. (2017)

slide-37
SLIDE 37

Results: WikiANN

37

Supervised: no transfer

Low-resource High-resource

slide-38
SLIDE 38

Results: WikiANN

38

Low-resource High-resource

Many low quality source models

Zero shot

slide-39
SLIDE 39

Results: WikiANN

39

Single source (en)

Low-resource High-resource Zero shot

slide-40
SLIDE 40

Results: WikiANN

40

Bayesian ensembling

Low-resource High-resource Zero shot

slide-41
SLIDE 41

Results: WikiANN

41

+spammer removal

Low-resource High-resource Zero shot

slide-42
SLIDE 42

Results: WikiANN

42

MV between top 3 sources

Low-resource High-resource Zero shot Few shot

slide-43
SLIDE 43

Results: WikiANN

43

Estimate BEA confusion & prior from annotations

Low-resource High-resource Zero shot Few shot

slide-44
SLIDE 44

Results: WikiANN

44

Ranking Retraining Method (using character info)

Low-resource High-resource Zero shot Few shot

slide-45
SLIDE 45

Effect of increasing #source languages

Methods robust to many varying quality source languages. Even better with few-shot supervision.

45

slide-46
SLIDE 46

Takeaways I

46

Transfer from multiple source languages helps because for many languages we don’t know the best source language.

takeaway / noun [uk/aus/nz]: a meal cooked and bought at a shop or restaurant but taken somewhere else... Cambridge English Dictionary

slide-47
SLIDE 47

Takeaways II

47

With multiple source languages, you need to estimate their qualities because uniform voting doesn’t perform well.

takeaway / noun [uk/aus/nz]: a meal cooked and bought at a shop or restaurant but taken somewhere else... Cambridge English Dictionary

slide-48
SLIDE 48

Takeaways III

48

A small training set in target language helps, and can be done cheaply and quickly (Garrette and Baldridge, 2013).

takeaway / noun [uk/aus/nz]: a meal cooked and bought at a shop or restaurant but taken somewhere else... Cambridge English Dictionary

slide-49
SLIDE 49

Thank you!

Datasets & code github.com/afshinrahimi/mmner

slide-50
SLIDE 50

Future Work

  • Map all scripts to IPA or Roman alphabet

(good for shared embeddings and character-level transfer) ■ uroman: Hermjakob et al. (2018) ■ epitran: Mortensen et al. (2018)

  • Can we estimate the quality of source models/languages

for a specific target language based on language characteristics (Littell et al., 2017)?

  • Technique should apply beyond NER to other tasks.

50