Continued Training Algorithms Huda Khayrallah, Jeremy Gwinnup SCALE - - PowerPoint PPT Presentation
Continued Training Algorithms Huda Khayrallah, Jeremy Gwinnup SCALE - - PowerPoint PPT Presentation
Continued Training Algorithms Huda Khayrallah, Jeremy Gwinnup SCALE Readout August 9, 2018 Wasch Hnde dir die Source Embedding Wasch Hnde dir die Encoder Source Embedding Wasch Hnde dir die Decoder Encoder Source
Wasch dir die Hände
Wasch dir die Hände
Source Embedding
Wasch dir die Hände
Encoder Source Embedding
Wasch dir die Hände
Decoder Encoder Source Embedding
Wasch dir die Hände
So6max Decoder Encoder Source Embedding
Wash
Wasch dir die Hände
So6max Decoder Encoder Source Embedding
Wash
Embedding Target
Wasch dir die Hände
So6max Decoder Encoder Source Embedding
Wash
Embedding Target
Wash your hands Wasch dir die Hände
So6max Decoder Encoder Source Target Embedding Embedding
[Bahdanau et al. 2015]
MT Training
General Domain NMT
General Domain NMT Model
50m General Domain sentence pairs
General Domain NMT
General Domain NMT Model
дверной замок повышенной степени защищенности от взлома Human: door lock with increased degree of security against burglary System: door security door security door Errors due to domain mismatch
In-Domain NMT
In-Domain NMT Model
30k In-Domain sentence pairs
In-Domain NMT
In-Domain NMT Model
Errors due to lack of data дверной замок повышенной степени защищенности от взлома Human: door lock with increased degree of security against burglary System: door lock for a high degree of protec\on against coke
Domain Adaptation
Continued Training
General Domain NMT Model Con\nued Training NMT Model Random Ini\alized NMT Model
30k In-domain sentence pairs 50M General Domain sentence pairs
Continued Training
дверной замок повышенной степени защищенности от взлома Human: door lock with increased degree of security against burglary System: door lock with increased penetra\on protec\on General Domain NMT Model Con\nued Training NMT Model Random Ini\alized NMT Model Improved performance!
Results
BLEU
Weighted n-gram precision
- Between 0 and 1
(often scaled to be 0-100)
Higher is better Imperfect… But… cheap & correlates with humans
- 𝑛𝑗𝑜(1,𝑝𝑣𝑢𝑞𝑣𝑢 𝑚𝑓𝑜𝑢ℎ/𝑠𝑓𝑔𝑓𝑠𝑓𝑜𝑑𝑓 𝑚𝑓𝑜𝑢ℎ )
(∏𝑗=1↑4▒𝑞𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜↓𝑗 )↑1/4
Russian Patent
10 20 30 40
General Domain In-Domain Mixed Domain Con\nued Training
+ 9.3 BLEU
Patent Results
10 20 30 40 50 60 70
German Korean Russian Chinese General Domain In-Domain Con\nued Training
+0.4 +1.8 +10.1 +3.5 BLEU
Patent Results
10 20 30 40 50 60 70
German Korean Russian Chinese General Domain In-Domain Con\nued Training Online A
+0.4 +1.8 +7.2 +3.5 BLEU
TED Results
5 10 15 20 25 30 35 40 45
Arabic German Farsi Korean Russian Chinese General Domain In Domain Con\nued Training
+5.8 +5.3 +5.7 +4.2 +2.8 +6.6 BLEU
TED Results
5 10 15 20 25 30 35 40 45
Arabic German Farsi Korean Russian Chinese General Domain In Domain Con\nued Training Online A
+1.1 +1.4
- 0.7
- 1.6
+6.6 +0.0 BLEU
How much data do we need?
- Patent
10 20 30 40 50 60 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 Amount of In-Domain Training Data
German
BLEU
- Patent
10 20 30 40 50 60 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 Amount of In-Domain Training Data
German Korean Russian Chinese
BLEU
- Patent
10 20 30 40 50 60 10,000 20,000 30,000 40,000 50,000 60,000 Amount of In-Domain Training Data
German Korean Russian Chinese
BLEU
- TED
10 20 30 40 50,000 100,000 150,000 Amount of In-Domain Training Data
Arabic German Farsi Korean Russian Chinese
BLEU
Human Evaluation
Source Output 1 Output 2 Ranking
等 了 十个月, 我 终 于 见到 了 他 - 将近 一年 啊。 wai\ng for 10, i finally met him- nearly a year . a6er 10 months, i finally saw him- nearly a year. Output 2 is beter 这 就是 免费 的 代 价 。 that's the price of free. that's the cost of free. Output 1 is beter 我 是 说 , 我 已 经 够 紧张 的 了 i mean, i'm nervous enough. i mean, i'm nervous enough. Both transla\ons are about the same
Con\nued Training Tie General
Continued Training vs General
Arabic Chinese Korean
Keyword Search
Farsi TED talk NER MicroAvg F1
0.40 0.45 0.50 0.55 0.60 Farsi
General Domain SMT Domain Adapted SMT General Domain NMT Con\nued Training NMT
F1
TED talk NER MicroAvg F1
0.40 0.45 0.50 0.55 0.60 Arabic German Farsi Korean Russian Chinese
General Domain SMT Domain Adapted SMT General Domain NMT Con\nued Training NMT
F1
Research Directions
Analysis of Continued Training
Accepted at EMNLP workshop on Analyzing and Interpreting Neural Networks for NLP
- Brian Thompson, Huda Khayrallah, Antonios
Anastasopoulos, Arya McCarthy, Kevin Duh, Rebecca Marvin, Paul McNamee, Jeremy Gwinnup, Tim Anderson and Philipp Koehn
Wash your hands Wasch dir die Hände
So6max Decoder Encoder Source Target Embedding Embedding
Selective Training
- f Components
+11.4 23.3 20 25 30 35 General Domain Encoder Decoder Source Embed Target Embed So6max Con\nued Training
BLEU
Selective Training
- f Components
+11.2 +11.8 +10.9 +11.3 +11.1 +11.4 23.3 20 25 30 35 General Domain Encoder Decoder Source Embed Target Embed So6max Con\nued Training
BLEU
Freeze Train
Selective Training
- f Components
+11.4 23.3 +11.2 +9.9 +10.6 +4.0 +9.3 20 25 30 35 General Domain Encoder Decoder Source Embed Target Embed So6max Con\nued Training
BLEU
Freeze Train
Selective Training
- f Components
11.4 23.3 20 25 30 35 General Domain Encoder Decoder Source Embed Target Embed So6max Con\nued Training
BLEU
Freeze Train
Elastic Weight Consolidation
Motivation
- Want to adapt NMT models to new domains –
Want to adapt NMT models to new domains – continued training is a great way to do this! continued training is a great way to do this!
- However, specializing models with CT leads to
However, specializing models with CT leads to ‘forgetting’ of info from the original domain. ‘forgetting’ of info from the original domain.
- Test set
General CT-Patent Model General 29.54 35.95 CT-Patent 7.88 62.28
Elastic Weight Consolidation (EWC)
- In a nutshell, learn Task B with a model trained on Task
A without forgeDng the important parts of Task A
- U\lize Fisher matrix to characterize sensi\vity of
parameters
“Overcoming catastrophic forgeDng in neural networks” (Kirkpatrick et. al, 2017)
Ltrain(𝜄)+ƛ𝚻i Fi(𝜄i- 𝜄i
g)2 i Fi(𝜄i- 𝜄i g)2 i- 𝜄i g)2 i g)2
New Loss Func\on
Training Models with EWC
General domain Model Approximate Fisher Matrix Regulariza\on EWC Model Con\nued Training
German Patent EWC Results
Gen Domain CT
5 10 15 20 25 30 35 35 40 45 50 55 60 65
Generaldomain BLEU Patent BLEU
Start here
German Patent EWC Results
Gen Domain CT
5 10 15 20 25 30 35 35 40 45 50 55 60 65
Generaldomain BLEU Patent BLEU
Con\nue Train to here…
German Patent EWC Results
Gen Domain CT
5 10 15 20 25 30 35 35 40 45 50 55 60 65
Generaldomain BLEU Patent BLEU
Want to be here!
German Patent EWC Results
Gen Domain 2.0 1.0 0.75 0.5 0.25 0.125 0.0 (CT)
5 10 15 20 25 30 35 35 40 45 50 55 60 65
Generaldomain BLEU Patent BLEU
Conclusion
- Continued training is a powerful means to
adapt a high-performing general model to new domains with a small amount of new data.
- EWC allows fine-tuning the tradeoff
between general and in-domain performance.
Other Results
39.95 39.9 34.59 29.19 27.56 29.54 10 20 30 40 50 EWC Cont. General
German TED
Generaldomain TED 57.98 62.28 35.95 23.52 7.88 29.54 20 40 60 80 EWC Cont. General
German Patent
Generaldomain Patent 28.67 28.6 23.4 27.01 25.19 28.3 10 20 30 40 EWC. Cont. General
Russian TED
Generaldomain TED 37.28 37 23.4 23.55 10.44 28.3 10 20 30 40 EWC. Cont. General
Russian Patent
Generaldomain Patent 16.76 17.2 11.6 10.61 9.85 11.8 5 10 15 20 EWC. Cont. General
Korean TED
Generaldomain TED 29.85 31.7 2.7 2.06 0.48 11.8 10 20 30 40 EWC. Cont. General
Korean Patent
Generaldomain Patent
Example
Source für die anschlussunterbringung sind die kommunen zuständig . Reference communes themselves are responsible for subsequent accommoda\on .
- Cont. Train the networks are responsible for the connec\on of the connec\on .
EWC the municipali\es are responsible for the connec\on accommoda\on .
Practical Examples
Extra Slides
Wash your hands Wasch dir die Hände
So6max Decoder Encoder Source Target Embedding Embedding
Source Encoder Embedding
Wash your hands Wasch dir die Hände
So6max Decoder Target Embedding
Hyperparameters
Model architecture Model architecture
- num_embed="512:512"
- rnn_num_hidden=512
- rnn_attention_type="dot"
- num_layers=2
- rnn_cell_type="lstm"
Regularization Regularization
- embed_dropout=0.0
- rnn_dropout=0.1
- label_smoothing=0.1
Vocabulary Vocabulary
- BPE on Source and Target
- num_words=30k:30k
- word_min_count="1:1"
- max_seq_len="100:100”
Training configuration Training configuration
- batch_size=4096
- ptimizer=adam
- initial_learning_rate=0.0003
- learning_rate_reduce_factor=0.7
- loss="cross-entropy”
- checkpoint_frequency=4000
Alternate MT explanation
Case Study
Our office needs to translate a lot of Russian patents.
- We have a few translators, but they can only process a
small fraction of our data.
- We would like to use machine translation find the most
interesting documents and let our translators focus on those.
- We know neural machine translation has state-of-the-art
performance, so we decide to build a Neural system…
MT training
General Domain NMT Model General Domain Data
MT training
In-Domain NMT Model In-domain Data
MT training
Mixed Domain NMT Model In-domain Data General Domain Data
MT training
In-domain Data Con\nued- Training NMT Model General Domain NMT Model
MT training
General Domain NMT Model 50M General Domain sentence pairs
Continued Training
General Domain NMT Model
Train on general domain data
Domain Adapted NMT Model
Con\nue training on in-domain data
Random Ini\alized NMT Model
General Domain NMT
General Domain NMT Model
50M General Domain sentence pairs
General Domain NMT
General Domain NMT Model
дверной замок повышенной степени защищенности от взлома Human: door lock with increased degree of security against burglary System: door security door security door Errors due to domain mismatch
Keyword Search
Keyword Search (sort of)
Extrinsic measure of MT Output quality based on ability to retrieve (i.e., match) words or phrases
- [Insert cartoon]
Human assigned categories
Keyword venture capitalist, zero gravity, hydrogen Sentiment fantastic, messy, bad, happy Person Heidi, Chris, Leonardo da Vinci, Aristotle Organization Toyota, UNESCO, Ikea, Swedish Army Geo-Political Entity Egypt, San Francisco, Haiti Location Arctic, Africa, hospital, ER, lobby Date Friday, 1980s, last March, today Temporal Expression 4:00 am, 30-second, six weeks Numeric Expression 20 percent, 27 kilometers, one- fifth, two nurses
This metric is pessimistic
Inexact matches count as failure Tokenization issues exacerbate measures
70 year old vs. 70-year-old
Alternative (very acceptable) translations can count as failure
Results
Russian Patent
10 20 30 40
General Domain In-Domain Con\nued Training
+ 10.1 BLEU
Russian Patent
10 20 30 40
General Domain In-Domain Con\nued Training Online A
+7.2 BLEU
Patent Results
10 20 30 40 50 60 70
German Korean Russian Chinese SMT Domain Adapted Online A NMT Con\nued Training
BLEU +11.3 +4.6 +7.2 +9.8
TED Results
5 10 15 20 25 30 35 40 45
Arabic German Farsi Korean Russian Chinese SMT Domain Adapted Online A Con\nued Training
+1.1 +1.4 +0.0
- 0.6
BLEU
- 0.7
+7.0
Patent Results
10 20 30 40 50 60 70
German Korean Russian Chinese General Domain In-Domain Con\nued Training Online A
+0.4 +1.8 +7.2 +3.5 BLEU
TED Results
5 10 15 20 25 30 35 40 45
Arabic German Farsi Korean Russian Chinese General Domain In Domain Con\nued Training Online A
+1.1 +1.4
- 0.7
+1.6 +6.6 +0.0 BLEU
TED results
Training data Ar De Fa Ko Ru Zh SMT General Domain 24.0 31.0 13.9 6.7 25.0 15.2 Mixed Domain 27.8 31.9 18.2 10.7 25.7 16.1 NMT General Domain 29.6 34.6 22.2 11.6 23.4 15.9 In Domain (TED) 27.4 32.3 21.3 14.4 22.9 16.2 Mixed Domain
- -- 35.6
- -- 24.5 17.8
Con\nued Training 35.4 39.9 27.9 17.2 28.6 20.4 Microso] Translator 34.3 38.5 20.9 17.9 28.6 21.0
Patent results
Training data De Ko Ru Zh SMT General Domain 26.6 2.4 21.4 13.7 Mixed Domain 50.6 21.7 29.0 29.8 NMT General Domain 36.0 2.7 23.4 12.6 In Domain (TED) 61.9 29.9 26.9 40.2 Mixed Domain 58.4
- -- 27.7 33.7
Con\nued Training 62.3 31.7 37.0 43.7 Microso] Translator 51.0 27.1 29.8 33.9
- Patent
10 20 30 40 50 60 1000 2000 3000 4000 5000 6000 7000 8000
German Korean Russian Chinese
BLEU
- TED
10 20 30 40 50000 100000 150000
Russian
BLEU
- TED
BLEU
10 20 30 40 10000 20000 30000 40000 50000 60000
Arabic German Farsi Korean Russian Chinese
- TED
BLEU
10 20 30 40 1000 2000 3000 4000 5000 6000 7000 8000
Arabic German Farsi Korean Russian Chinese
Human Eval
Continued Training vs General
0% 10% 20% 30% 40% 50% 60% 70%
Arabic Korean Chinese Con\nued Training Tie General
Continued Training vs Human
0% 10% 20% 30% 40% 50% 60% 70%
Arabic Korean Chinese Con\nued Training Tie Human
Continued Training vs General
10 20 30 40 50 60 70 80 90
Arabic Korean Chinese Con\nued Training Tie General
Continued Training vs Human
10 20 30 40 50 60 70 80 90
Arabic Korean Chinese Con\nued Training Tie Human
Human Evaluation
Trends similar across three languages
System differences consistent with BLEU Human reference (unsurprisingly) better
Research Extensions
Parameter Freezing
34.7
- 0.2
+0.4
- 0.5
- 0.1
+0.3
- 11.4
- 0.2
- 1.5
- 0.8
- 7.4
- 2.1
20 25 30 35 Baseline Encoder Decoder Source Embed Target Embed So6max
BLEU
Parameter Freezing
34.72 34.54 35.08 34.25 34.66 34.97 23.32 20 25 30 35 Baseline Encoder Decoder Source Embed Target Embed So6max
BLEU
Parameter Freezing
34.72 23.32 34.55 33.19 33.89 27.37 32.65 20 25 30 35 Baseline Encoder Decoder Source Embed Target Embed So6max
BLEU
Selective Training
- 0.2
+0.4
- 0.5
- 0.1
+0.3 20 25 30 35 Baseline Encoder Decoder Source Embed Target Embed So6max
BLEU
Selective Training
- f Components
34.54 35.08 34.25 34.66 34.97 34.72 23.32 34.55 33.19 33.89 27.37 32.65 20 25 30 35 General Domain Encoder Decoder Source Embed Target Embed So6max Con\nued Training
BLEU
Selective Training
34.7 23.3 20 25 30 35 Baseline Encoder Decoder Source Embed Target Embed So6max
BLEU
Selective Training
34.7 23.3 20 25 30 35 Baseline Encoder Decoder Source Embed Target Embed So6max
BLEU
Selective Training
34.7
- 0.2
+0.4
- 0.5
- 0.1
+0.3
- 11.4
- 0.2
- 1.5
- 0.8
- 7.4
- 2.1
20 25 30 35 Baseline Encoder Decoder Source Embed Target Embed So6max
BLEU
Data Sizes
Datasets
Ar-En De-En Fa-En Ko-En Ru-En Zh-En Large General Domain 49M Sub\tle, UN, LDC 28M Sub\tle, WMT 6M Sub\tle 1M Sub\tle 51M Sub\tle, WMT 36M Sub\tle,W MT TED Talks 175k 152k 114k 164k 180k 170k Patent (WIPO) 821k 81k 39k 154k
# sentence/segment pairs in-domain sets
113
Goal: Improve test results on TED/Patent using both Large General Domain and some In-Domain data
Data
Training data Arabic German Farsi Korean Russian Chinese General Domain 49M Sub\tle, UN, LDC 28M Sub\tle, WMT 6M Sub\tle 1M Sub\tle 51M Sub\tle, WMT 36M Sub\tle, WMT TED 175k 152k 114k 164k 180k 170k Patent
- 821k
- 81k
39k 154k
TED Data
Training data Arabic German Farsi Korean Russian Chinese General Domain 49M Sub\tle, UN, LDC 28M Sub\tle, WMT 6M Sub\tle 1M Sub\tle 51M Sub\tle, WMT 36M Sub\tle, WMT In Domain (TED) 175k 152k 114k 164k 180k 170k
So, um... she 's kidding... Resump\on of the session The European Union supports humanitarian ac\on. Allison Hunt: My three minutes hasn't started yet, has it?
Patent Data
Training data German Korean Russian Chinese General Domain 28M Sub\tle, WMT 1M Sub\tle 51M Sub\tle, WMT 36M Sub\tle, WMT Patent 821k 81k 39k 154k
So, um... she 's kidding... Resump\on of the session The European Union supports humanitarian ac\on. The tablets exhibit improved bioavailability of the ac\ve ingredient.
OOV rates
TED OOVs (type count)
Training data Arabic German Farsi Korean Russian Chinese General Domain 133 204 445 225 140 45 In Domain (TED) 745 700 758 249 813 422 Both domains 126 193 329 133 132 43 Total types 8248 5837 6261 4989 7954 5760
TED OOVs (token count)
Training data Arabic German Farsi Korean Russian Chinese General Domain 176 235 597 316 153 47 In Domain (TED) 840 809 956 327 933 536 Both domains 168 221 418 187 143 45 Total tokens 28636 35209 39223 45715 31575 33397
TED OOVs (type %)
Training data Arabic German Farsi Korean Russian Chinese General Domain 1.61% 3.49% 7.11% 4.51% 1.76% 0.78% In Domain (TED) 9.03% 11.99% 12.11% 4.99% 10.22% 7.33% Both domains 1.53% 3.31% 5.25% 2.67% 1.66% 0.75%
TED OOVs (token %)
Training data Arabic German Farsi Korean Russian Chinese General Domain 0.61% 0.67% 1.52% 0.69% 0.48% 0.14% In Domain (TED) 2.93% 2.30% 2.44% 0.72% 2.95% 1.60% Both domains 0.59% 0.63% 1.07% 0.41% 0.45% 0.13%
Patent OOVs (type count)
Training data German Korean Russian Chinese General Domain 5290 2098 1508 495 In Domain (TED) 2331 986 4286 1085 Both domains 2100 594 1262 339 Total types 14566 7939 15964 8627
Patent OOVs (token count)
Training data German Korean Russian Chinese General Domain 10264 7748 1980 1171 In Domain (TED) 3864 1724 5715 2061 Both domains 3528 1045 1617 681 Total types 132208 186832 81911 135591
Patent OOVs (type %)
Training data German Korean Russian Chinese General Domain 36.32% 26.43% 9.45% 5.74% In Domain (Patent) 16.00% 12.42% 26.85% 12.58% Both domains 14.42% 7.48% 7.91% 3.93%
Patent OOVs (type %)
Training data German Korean Russian Chinese General Domain 7.76% 4.15% 2.42% 0.86% In Domain (Patent) 2.92% 0.92% 6.98% 1.52% Both domains 2.67% 0.56% 1.97% 0.50%
TED OOVs (Type Count)
200 400 600 800
Arabic German Farsi Korean Russian Chinese General Domain In Domain (TED) Both domains
OOV Types
TED OOVs (Token Count)
200 400 600 800 1000 1200
Arabic German Farsi Korean Russian Chinese General Domain In Domain (TED) Both domains
OOV Tokens
TED OOVs (Type %)
0% 5% 10%
Arabic German Farsi Korean Russian Chinese General Domain In Domain (TED) Both domains
% OOV
TED OOVs (Token %)
0% 1% 2% 3%
Arabic German Farsi Korean Russian Chinese General Domain In Domain (TED) Both domains
% OOV
Patent OOVs (Type Count)
1000 2000 3000 4000 5000 6000
German Korean Russian Chinese General Domain In Domain (TED) Both domains
OOV Types
Patent OOVs (Token Count)
2000 4000 6000 8000 10000 12000
German Korean Russian Chinese General Domain In Domain (TED) Both domains
OOV Tokens
Patent OOVs (Type %)
0% 10% 20% 30% 40%
German Korean Russian Chinese General Domain In Domain (Patent) Both domains
% OOV
Patent OOVs (Token %)
0% 2% 4% 6% 8% 10%