Transferring NLP models across languages and domains
Barbara Plank ITU, Copenhagen, Denmark August 28, 2019, #SyntaxFest2019 Paris
Transferring NLP models across languages and domains Barbara Plank - - PowerPoint PPT Presentation
Transferring NLP models across languages and domains Barbara Plank ITU, Copenhagen, Denmark August 28, 2019, #SyntaxFest2019 Paris Statistical NLP: The Need for Data X Y the Det dog NOUN Y = f(X) ML barks VERB Adverse
Barbara Plank ITU, Copenhagen, Denmark August 28, 2019, #SyntaxFest2019 Paris
the ability to generalize to new conditions:
3
CROSS-DOMAIN
CROSS-LINGUAL
4
O M G ! L M A O ! L O L ! R O F L ! I h a v e n
d e a w h a t y
’ r e s a y i n g
5
6
Model A Model B Traditional ML: Train & evaluate
domain/task/language
7
Model A Model B Transfer Learning Knowledge gained to help solve a related problem
Learning under domain shift Cross-lingual learning
8
1 2 3 4
Transfer learning/ Adaptation Transductive Transfer Inductive Transfer
same task different task Different domains Different languages
Multi-task learning Continual learning
Tasks learned: simultaneously sequentially
Adapted from Ruder (2019)
9
P(Xsrc) 6= P(Xtrg) Xsrc 6= Xtrg Ysrc 6= Ytrg
where is the feature space, prob. over e.g., BOW
where is the label space (e.g., +/-)
D = {X, P(X)} X P(X) T = {Y, P(Y|X)} Y
Notation:
Domain Adaptation (DA) Cross-lingual Learning (CL) Multi-task Learning (MTL)
Domains: Learning to select data Languages: Cross-lingual learning Multi-task learning
10
1 2 3
11
Sebastian Ruder and Barbara Plank EMNLP 2017
12
Target domain Source domains How to select the most relevant data?
Why? Why don’t we just train on all source data?
Prior approaches:
13
Intuition
similarity. Idea
14
15
x1 x2 xm ⋮ S = ϕ(x)⊤w
Training examples
⋮
Selection policy
xn
Sorted examples
m
Tsvetkov, Y., Faruqui, M., Ling, W., & Dyer, C. (2016). Learning the Curriculum with Bayesian Optimization for Task-Specific Word Representation Learning. In Proceedings of ACL 2016.
16
different similarity/diversity features learned feature weights
17
Jensen-Shannon, Rényi div, Bhattacharyya dist, Cosine sim, Euclidean distance, Variational dist
Term distributions, Topic distributions, Word embeddings
index, Rényi entropy, Quadratic entropy (Plank, 2011)
18
Sentiment analysis on Amazon reviews dataset (Blitzer et al., 2007) POS tagging and dependency parsing on SANCL 2012 (Petrov and McDonald, 2012)
Blitzer, J., Dredze, M., & Pereira, F. (2007). Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Proceedings of ACL 2007. Petrov, S., & McDonald, R. (2012). Overview of the 2012 shared task on parsing the web. In Notes of the First Workshop on Syntactic Analysis of Non-Canonical Language (SANCL).
19
Selecting 2,000 from 6,000 source domain examples
Accuracy (%) 62 68 74 80 86 Book DVD Electronics Kitchen
Random JS divergence (examples) JS divergence (domain) Similarity (topics) Diversity Similiarity + diversity All source data (6,000 examples)
20
Selecting 2,000 from 14-17.5k source domain examples
Accuracy (%) 91 92.5 94 95.5 97 Answers Emails Newsgroups Reviews Weblogs WSJ
Random JS divergence (examples) JS divergence (domain) Similarity (terms) Diversity Similiarity + diversity All source data
useful when domains are very similar.
21
Selecting 2,000 from 14-17.5k source domain examples
Labeled Attachment Score (LAS) 80 82.25 84.5 86.75 89 Answers Emails Newsgroups Reviews Weblogs WSJ
Random JS divergence (examples) JS divergence (domain) Similarity (terms) Diversity Similiarity + diversity All source data
(BIST parser, Kiperwasser & Goldberg, 2016)
22
23
Target tasks Feature set TS POS Pars SA Sim POS 93.51 83.11 74.19 Sim Pars 92.78 83.27 72.79 Sim SA 86.13 67.33 79.23 Div POS 93.51 83.11 69.78 Div Pars 93.02 83.41 68.45 Div SA 90.52 74.68 79.65 Sim+div POS 93.54 83.24 69.79 Sim+div Pars 93.11 83.51 72.27 Sim+div SA 89.80 75.17 80.36
Learning a task-specific data selection policy helps.
domains are dissimilar.
models, tasks, and domains
24
https://github.com/sebastianruder/learn-to-select-data
Code:
Domains: Learning to select data Languages: Cross-lingual learning Multi-task learning
25
1 2 3
26
Number Papers 23 45 68 90 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
81 54 47 53 35 22 27 33 15 22 13 4 3 7 3 7
Title contains: Cross(-)lingual
Papers in the ACL anthology (from 2004)
e.g. see ACL 2019 tutorial (Ruder et al., 2019)
We want to process all languages. Most of them are severely under-resourced. How to build taggers, parsers, etc. for those?
27
28
annotation transfer (annotation projection) model transfer (multi-lingual embeddings, zero-shot/few-shot learning, delexicalization,…)
1 2 3
TACL, 2016
1
30
Che fesa ncuei ?
PRON VERB ADV P PRON VERB PRON ADV P
word alignments project annotations Was machst du heute ?
e.g., Hwa et al. (2005)
(Agić et al., 2015; 2016)
31
source languages Bible: (data x languages) 100
32
33
UAS 11 22 33 44 55 Dependency Parsing (average UAS over 26 languages)
Delex Multi-source Bible WTC (Watchtower)
34
(Indonesian is best for Irish in delexicalized transfer)
Unlabeled Attachment Score
17.5 35 52.5 70
Multi-Source Proj Delex-SelectBest
35
Rahimi et al., ACL, 2019
36
similar to Ruder & Plank, 2017) including word/subword overlap, data size
(Geographic/Genetic distance etc)
37
Lin et al., ACL, 2019
4 NLP tasks, including parsing (DEP)
> WALS syntactic features
features
38
Lin et al., ACL, 2019
Have parallel data? multi-parallel? embeddings? lexicons? (some) gold annotated data? Amount of supervision Unlabeled only Labeled data (Just a couple
1 2 3 4
40
NoDaLiDa 2019 & EMNLP 2018 Plank & Klerke, 2019; Plank & Agic, 2018
2
41
42
Wiktionary Unimorph
embeddings (Plank et al., 2016)
43
bi* (85% noun in Danish) *able (98% adj in WSJ)
44
45
BiLSTM BiLSTM BiLSTM DET ADJ NOUN
the
char BiLSTM
new
char BiLSTM
beer
char BiLSTM lex. emb. lex. emb. lex. emb.
pre-trained embeddings
~ w
PolyGlot etc.
birra nuova
projection
DET ADJ NOUN
WTC:
+ data selection
ˆ yproj
la
W: U:
UniMorph
~ e
lexicons
(Agić et al., 2015; 2016)
46
(WTC), 300+ languages
source languages
word-alignment coverage
47
(Benoit & Martinez Alonso, 2017)
embed the lexicon
Wiktionary and Unimorph
<w> c a s t </w>
~ w
~ c
~ e
cast NOUN cast VERB cast ADJ cast V;NFIN cast V;PST cast V;V.PTCP;PST
48
49
3.8% 10%
Means over 21 languages
(each point is an average over 3 runs, for random: with 5 random samples)
50
Means over 21 languages
(each point is an average over 3 runs, for random: with 5 random samples)
5k
5%
51
Means over 21 languages
(each point is an average over 3 runs, for random: with 5 random samples)
52
Accuracy on dev set 75.0 77.3 79.5 81.8 84.0
Means over 21 languages (UD 2.1 data)
81.3 84 83.7 83.4 83.2 83
5k projected type constraints n-hot embed W embed W+U (DsDs) retrofit
DsDs
53
None: not in lexicon Disjoint: no tag overlap
(inspired by Li et al., 2012)
English, Croatian, Dutch)
tag set agreement (e.g., Danish, Dutch, Italian) and languages with low dictionary coverage (such as Bulgarian, Hindi, Croatian, Finnish)
54
55
56
70 75 80 85 90 25 50 75 100 200
in corpus
(no gold data)
(Means over 18 languages for which we had both in- and out-corpus gold data) Accuracy
DsDs
57
projection performance (+5% on average)
tagging beyond the lexicon’s coverage
58
59
to appear in NoDaLiDa 2019
3
*
* slide title inspired by Alisa Meechan-Maddon & Joakim Nivre’s SyntaxFest presentation :-)
Danish from existing English resources?
to annotating small amounts of gold data? And how to best combine them?
Danish?
60
Dependencies (UD) data for NERs
sentences)
and Small (604 sentences)
contain NEs (vs 80% on the CoNLL’03 English NER data)
61
bilingual embeddings
concatenation English & Danish (tiny|small)
tune on Danish
62
63
#sentences Medium Large (all) (no target) ~3k ~14k Tiny 272+ ~3k 272+ ~14k Small 604+ ~3k 604+ ~14k
Danish (UD train subset) English Source (CoNLL 03)
level bilstm
64
bilstm-CRF
O O
B-PER
CRF layer
MUSE (Conneau et al., 2017; Artetxe et al., 2017)
65
project embeddings
(many other possibilities, like joint data generation)
66
NER F1_score 20.00 33.00 46.00 59.00 72.00 Model TnT bilstm-CRF plain bilstm-CRF +polyglot embeds
67.2 51.9 44.3 56.1 36.2 37.5
Tiny in-language data (4.7k tokens/272 sentences) Small in-language data (10k tokens/604 sentences)
Small Tiny
11%
English to Danish (zero-shot learning)?
67
NER F1_score 20.00 33.00 46.00 59.00 72.00 Model zero-shot +tiny DA +small DA fine-tune
Tiny Small
labeled data (few-shot learning)?
68
NER F1_score 20.00 33.00 46.00 59.00 72.00 Model zero-shot +tiny DA +small DA fine-tune
Medium src Large src
Tiny Small Medium > Large
69
NER F1_score 20.00 33.00 46.00 59.00 72.00 Model zero-shot +tiny DA +small DA fine-tune
Medium src Large src
Tiny Small
70
the source; fine-tuning was inferior
Medium setup (rather than the entire CoNLL data)
embeddings yields an effective NER tagger for Danish quickly.
71
Domains: Learning to select data Languages: Cross-lingual learning Multi-task learning
72
1 2 3
73
“learning tasks in parallel while using a shared representation; what is learned for each task can help
74
input
shared task A
x
task B
x
task A task B
x
single-task learning (STL) multi-task learning (MTL)
75
75
(de Lhoneux et al., 2018, EMNLP)
76
(assume this is a transition-based parser)
77
http://jalammar.github.io/illustrated-bert/
78
https://arxiv.org/pdf/1904.02099v2.pdf
To appear at EMNLP, 2019
79
To appear at EMNLP, 2019
80
https://arxiv.org/pdf/1904.02099v2.pdf
To appear at EMNLP, 2019
embeddings and careful fine-tuning: big leaps forward
need?
pacing of learning)
81
82
83
Data selection is beneficial in cross-lingual and cross-domain learning
Cross-domain Cross-lingual
84
Neural models can benefit from inductive bias from symbolic information.
85
Multi-task learning provides many opportunities (and challenges) and there is more to be discovered (especially in relation to multilingual modeling)
https://nlp.itu.dk/
@barbara_plank bplank.github.io
Barbara Plank, ITU, Denmark
Supported by: