Continued Training Algorithms Huda Khayrallah, Jeremy Gwinnup SCALE - - PowerPoint PPT Presentation

continued training algorithms
SMART_READER_LITE
LIVE PREVIEW

Continued Training Algorithms Huda Khayrallah, Jeremy Gwinnup SCALE - - PowerPoint PPT Presentation

Continued Training Algorithms Huda Khayrallah, Jeremy Gwinnup SCALE Readout August 9, 2018 Wasch Hnde dir die Source Embedding Wasch Hnde dir die Encoder Source Embedding Wasch Hnde dir die Decoder Encoder Source


slide-1
SLIDE 1

Continued Training Algorithms

Huda Khayrallah, Jeremy Gwinnup SCALE Readout August 9, 2018

slide-2
SLIDE 2

Wasch dir die Hände

slide-3
SLIDE 3

Wasch dir die Hände

Source Embedding

slide-4
SLIDE 4

Wasch dir die Hände

Encoder Source Embedding

slide-5
SLIDE 5

Wasch dir die Hände

Decoder Encoder Source Embedding

slide-6
SLIDE 6

Wasch dir die Hände

So6max Decoder Encoder Source Embedding

Wash

slide-7
SLIDE 7

Wasch dir die Hände

So6max Decoder Encoder Source Embedding

Wash

Embedding Target

slide-8
SLIDE 8

Wasch dir die Hände

So6max Decoder Encoder Source Embedding

Wash

Embedding Target

slide-9
SLIDE 9

Wash your hands Wasch dir die Hände

So6max Decoder Encoder Source Target Embedding Embedding

[Bahdanau et al. 2015]

slide-10
SLIDE 10

MT Training

slide-11
SLIDE 11

General Domain NMT

General Domain NMT Model

50m General Domain sentence pairs

slide-12
SLIDE 12

General Domain NMT

General Domain NMT Model

дверной замок повышенной степени защищенности от взлома Human: door lock with increased degree of security against burglary System: door security door security door Errors due to domain mismatch

slide-13
SLIDE 13

In-Domain NMT

In-Domain NMT Model

30k In-Domain sentence pairs

slide-14
SLIDE 14

In-Domain NMT

In-Domain NMT Model

Errors due to lack of data дверной замок повышенной степени защищенности от взлома Human: door lock with increased degree of security against burglary System: door lock for a high degree of protec\on against coke

slide-15
SLIDE 15

Domain Adaptation

slide-16
SLIDE 16

Continued Training

General Domain NMT Model Con\nued Training NMT Model Random Ini\alized NMT Model

30k In-domain sentence pairs 50M General Domain sentence pairs

slide-17
SLIDE 17

Continued Training

дверной замок повышенной степени защищенности от взлома Human: door lock with increased degree of security against burglary System: door lock with increased penetra\on protec\on General Domain NMT Model Con\nued Training NMT Model Random Ini\alized NMT Model Improved performance!

slide-18
SLIDE 18

Results

slide-19
SLIDE 19

BLEU

Weighted n-gram precision

  • Between 0 and 1

(often scaled to be 0-100)

Higher is better Imperfect… But… cheap & correlates with humans

  • 𝑛𝑗𝑜(1,​𝑝𝑣𝑢𝑞𝑣𝑢 𝑚𝑓𝑜𝑕𝑢ℎ/𝑠𝑓𝑔𝑓𝑠𝑓𝑜𝑑𝑓 𝑚𝑓𝑜𝑕𝑢ℎ )​

(∏𝑗=1↑4▒​𝑞𝑠𝑓𝑑𝑗𝑡𝑗𝑝𝑜↓𝑗 )↑​1/4

slide-20
SLIDE 20

Russian Patent

10 20 30 40

General Domain In-Domain Mixed Domain Con\nued Training

+ 9.3 BLEU

slide-21
SLIDE 21

Patent Results

10 20 30 40 50 60 70

German Korean Russian Chinese General Domain In-Domain Con\nued Training

+0.4 +1.8 +10.1 +3.5 BLEU

slide-22
SLIDE 22

Patent Results

10 20 30 40 50 60 70

German Korean Russian Chinese General Domain In-Domain Con\nued Training Online A

+0.4 +1.8 +7.2 +3.5 BLEU

slide-23
SLIDE 23

TED Results

5 10 15 20 25 30 35 40 45

Arabic German Farsi Korean Russian Chinese General Domain In Domain Con\nued Training

+5.8 +5.3 +5.7 +4.2 +2.8 +6.6 BLEU

slide-24
SLIDE 24

TED Results

5 10 15 20 25 30 35 40 45

Arabic German Farsi Korean Russian Chinese General Domain In Domain Con\nued Training Online A

+1.1 +1.4

  • 0.7
  • 1.6

+6.6 +0.0 BLEU

slide-25
SLIDE 25

How much data do we need?

slide-26
SLIDE 26
  • Patent

10 20 30 40 50 60 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 Amount of In-Domain Training Data

German

BLEU

slide-27
SLIDE 27
  • Patent

10 20 30 40 50 60 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 Amount of In-Domain Training Data

German Korean Russian Chinese

BLEU

slide-28
SLIDE 28
  • Patent

10 20 30 40 50 60 10,000 20,000 30,000 40,000 50,000 60,000 Amount of In-Domain Training Data

German Korean Russian Chinese

BLEU

slide-29
SLIDE 29
  • TED

10 20 30 40 50,000 100,000 150,000 Amount of In-Domain Training Data

Arabic German Farsi Korean Russian Chinese

BLEU

slide-30
SLIDE 30

Human Evaluation

slide-31
SLIDE 31

Source Output 1 Output 2 Ranking

等 了 十个月, 我 终 于 见到 了 他 - 将近 一年 啊。 wai\ng for 10, i finally met him- nearly a year . a6er 10 months, i finally saw him- nearly a year. Output 2 is beter 这 就是 免费 的 代 价 。 that's the price of free. that's the cost of free. Output 1 is beter 我 是 说 , 我 已 经 够 紧张 的 了 i mean, i'm nervous enough. i mean, i'm nervous enough. Both transla\ons are about the same

slide-32
SLIDE 32

Con\nued Training Tie General

Continued Training vs General

Arabic Chinese Korean

slide-33
SLIDE 33

Keyword Search

slide-34
SLIDE 34

Farsi TED talk NER MicroAvg F1

0.40 0.45 0.50 0.55 0.60 Farsi

General Domain SMT Domain Adapted SMT General Domain NMT Con\nued Training NMT

F1

slide-35
SLIDE 35

TED talk NER MicroAvg F1

0.40 0.45 0.50 0.55 0.60 Arabic German Farsi Korean Russian Chinese

General Domain SMT Domain Adapted SMT General Domain NMT Con\nued Training NMT

F1

slide-36
SLIDE 36

Research Directions

slide-37
SLIDE 37

Analysis of Continued Training

Accepted at EMNLP workshop on Analyzing and Interpreting Neural Networks for NLP

  • Brian Thompson, Huda Khayrallah, Antonios

Anastasopoulos, Arya McCarthy, Kevin Duh, Rebecca Marvin, Paul McNamee, Jeremy Gwinnup, Tim Anderson and Philipp Koehn

slide-38
SLIDE 38

Wash your hands Wasch dir die Hände

So6max Decoder Encoder Source Target Embedding Embedding

slide-39
SLIDE 39

Selective Training

  • f Components

+11.4 23.3 20 25 30 35 General Domain Encoder Decoder Source Embed Target Embed So6max Con\nued Training

BLEU

slide-40
SLIDE 40

Selective Training

  • f Components

+11.2 +11.8 +10.9 +11.3 +11.1 +11.4 23.3 20 25 30 35 General Domain Encoder Decoder Source Embed Target Embed So6max Con\nued Training

BLEU

Freeze Train

slide-41
SLIDE 41

Selective Training

  • f Components

+11.4 23.3 +11.2 +9.9 +10.6 +4.0 +9.3 20 25 30 35 General Domain Encoder Decoder Source Embed Target Embed So6max Con\nued Training

BLEU

Freeze Train

slide-42
SLIDE 42

Selective Training

  • f Components

11.4 23.3 20 25 30 35 General Domain Encoder Decoder Source Embed Target Embed So6max Con\nued Training

BLEU

Freeze Train

slide-43
SLIDE 43

Elastic Weight Consolidation

slide-44
SLIDE 44

Motivation

  • Want to adapt NMT models to new domains –

Want to adapt NMT models to new domains – continued training is a great way to do this! continued training is a great way to do this!

  • However, specializing models with CT leads to

However, specializing models with CT leads to ‘forgetting’ of info from the original domain. ‘forgetting’ of info from the original domain.

  • Test set

General CT-Patent Model General 29.54 35.95 CT-Patent 7.88 62.28

slide-45
SLIDE 45

Elastic Weight Consolidation (EWC)

  • In a nutshell, learn Task B with a model trained on Task

A without forgeDng the important parts of Task A

  • U\lize Fisher matrix to characterize sensi\vity of

parameters

“Overcoming catastrophic forgeDng in neural networks” (Kirkpatrick et. al, 2017)

slide-46
SLIDE 46

Ltrain(𝜄)+ƛ𝚻i Fi(𝜄i- 𝜄i

g)2 i Fi(𝜄i- 𝜄i g)2 i- 𝜄i g)2 i g)2

New Loss Func\on

Training Models with EWC

General domain Model Approximate Fisher Matrix Regulariza\on EWC Model Con\nued Training

slide-47
SLIDE 47

German Patent EWC Results

Gen Domain CT

5 10 15 20 25 30 35 35 40 45 50 55 60 65

Generaldomain BLEU Patent BLEU

Start here

slide-48
SLIDE 48

German Patent EWC Results

Gen Domain CT

5 10 15 20 25 30 35 35 40 45 50 55 60 65

Generaldomain BLEU Patent BLEU

Con\nue Train to here…

slide-49
SLIDE 49

German Patent EWC Results

Gen Domain CT

5 10 15 20 25 30 35 35 40 45 50 55 60 65

Generaldomain BLEU Patent BLEU

Want to be here!

slide-50
SLIDE 50

German Patent EWC Results

Gen Domain 2.0 1.0 0.75 0.5 0.25 0.125 0.0 (CT)

5 10 15 20 25 30 35 35 40 45 50 55 60 65

Generaldomain BLEU Patent BLEU

slide-51
SLIDE 51

Conclusion

  • Continued training is a powerful means to

adapt a high-performing general model to new domains with a small amount of new data.

  • EWC allows fine-tuning the tradeoff

between general and in-domain performance.

slide-52
SLIDE 52

Other Results

39.95 39.9 34.59 29.19 27.56 29.54 10 20 30 40 50 EWC Cont. General

German TED

Generaldomain TED 57.98 62.28 35.95 23.52 7.88 29.54 20 40 60 80 EWC Cont. General

German Patent

Generaldomain Patent 28.67 28.6 23.4 27.01 25.19 28.3 10 20 30 40 EWC. Cont. General

Russian TED

Generaldomain TED 37.28 37 23.4 23.55 10.44 28.3 10 20 30 40 EWC. Cont. General

Russian Patent

Generaldomain Patent 16.76 17.2 11.6 10.61 9.85 11.8 5 10 15 20 EWC. Cont. General

Korean TED

Generaldomain TED 29.85 31.7 2.7 2.06 0.48 11.8 10 20 30 40 EWC. Cont. General

Korean Patent

Generaldomain Patent

slide-53
SLIDE 53

Example

Source für die anschlussunterbringung sind die kommunen zuständig . Reference communes themselves are responsible for subsequent accommoda\on .

  • Cont. Train the networks are responsible for the connec\on of the connec\on .

EWC the municipali\es are responsible for the connec\on accommoda\on .

slide-54
SLIDE 54

Practical Examples

slide-55
SLIDE 55
slide-56
SLIDE 56
slide-57
SLIDE 57
slide-58
SLIDE 58
slide-59
SLIDE 59
slide-60
SLIDE 60
slide-61
SLIDE 61
slide-62
SLIDE 62
slide-63
SLIDE 63
slide-64
SLIDE 64
slide-65
SLIDE 65

Extra Slides

slide-66
SLIDE 66

Wash your hands Wasch dir die Hände

So6max Decoder Encoder Source Target Embedding Embedding

slide-67
SLIDE 67

Source Encoder Embedding

Wash your hands Wasch dir die Hände

So6max Decoder Target Embedding

slide-68
SLIDE 68

Hyperparameters

Model architecture Model architecture

  • num_embed="512:512"
  • rnn_num_hidden=512
  • rnn_attention_type="dot"
  • num_layers=2
  • rnn_cell_type="lstm"

Regularization Regularization

  • embed_dropout=0.0
  • rnn_dropout=0.1
  • label_smoothing=0.1

Vocabulary Vocabulary

  • BPE on Source and Target
  • num_words=30k:30k
  • word_min_count="1:1"
  • max_seq_len="100:100”

Training configuration Training configuration

  • batch_size=4096
  • ptimizer=adam
  • initial_learning_rate=0.0003
  • learning_rate_reduce_factor=0.7
  • loss="cross-entropy”
  • checkpoint_frequency=4000
slide-69
SLIDE 69

Alternate MT explanation

slide-70
SLIDE 70

Case Study

Our office needs to translate a lot of Russian patents.

  • We have a few translators, but they can only process a

small fraction of our data.

  • We would like to use machine translation find the most

interesting documents and let our translators focus on those.

  • We know neural machine translation has state-of-the-art

performance, so we decide to build a Neural system…

slide-71
SLIDE 71

MT training

General Domain NMT Model General Domain Data

slide-72
SLIDE 72

MT training

In-Domain NMT Model In-domain Data

slide-73
SLIDE 73

MT training

Mixed Domain NMT Model In-domain Data General Domain Data

slide-74
SLIDE 74

MT training

In-domain Data Con\nued- Training NMT Model General Domain NMT Model

slide-75
SLIDE 75

MT training

General Domain NMT Model 50M General Domain sentence pairs

slide-76
SLIDE 76

Continued Training

General Domain NMT Model

Train on general domain data

Domain Adapted NMT Model

Con\nue training on in-domain data

Random Ini\alized NMT Model

slide-77
SLIDE 77

General Domain NMT

General Domain NMT Model

50M General Domain sentence pairs

slide-78
SLIDE 78

General Domain NMT

General Domain NMT Model

дверной замок повышенной степени защищенности от взлома Human: door lock with increased degree of security against burglary System: door security door security door Errors due to domain mismatch

slide-79
SLIDE 79

Keyword Search

slide-80
SLIDE 80

Keyword Search (sort of)

Extrinsic measure of MT Output quality based on ability to retrieve (i.e., match) words or phrases

  • [Insert cartoon]
slide-81
SLIDE 81

Human assigned categories

Keyword venture capitalist, zero gravity, hydrogen Sentiment fantastic, messy, bad, happy Person Heidi, Chris, Leonardo da Vinci, Aristotle Organization Toyota, UNESCO, Ikea, Swedish Army Geo-Political Entity Egypt, San Francisco, Haiti Location Arctic, Africa, hospital, ER, lobby Date Friday, 1980s, last March, today Temporal Expression 4:00 am, 30-second, six weeks Numeric Expression 20 percent, 27 kilometers, one- fifth, two nurses

slide-82
SLIDE 82

This metric is pessimistic

Inexact matches count as failure Tokenization issues exacerbate measures

70 year old vs. 70-year-old

Alternative (very acceptable) translations can count as failure

slide-83
SLIDE 83

Results

slide-84
SLIDE 84

Russian Patent

10 20 30 40

General Domain In-Domain Con\nued Training

+ 10.1 BLEU

slide-85
SLIDE 85

Russian Patent

10 20 30 40

General Domain In-Domain Con\nued Training Online A

+7.2 BLEU

slide-86
SLIDE 86

Patent Results

10 20 30 40 50 60 70

German Korean Russian Chinese SMT Domain Adapted Online A NMT Con\nued Training

BLEU +11.3 +4.6 +7.2 +9.8

slide-87
SLIDE 87

TED Results

5 10 15 20 25 30 35 40 45

Arabic German Farsi Korean Russian Chinese SMT Domain Adapted Online A Con\nued Training

+1.1 +1.4 +0.0

  • 0.6

BLEU

  • 0.7

+7.0

slide-88
SLIDE 88

Patent Results

10 20 30 40 50 60 70

German Korean Russian Chinese General Domain In-Domain Con\nued Training Online A

+0.4 +1.8 +7.2 +3.5 BLEU

slide-89
SLIDE 89

TED Results

5 10 15 20 25 30 35 40 45

Arabic German Farsi Korean Russian Chinese General Domain In Domain Con\nued Training Online A

+1.1 +1.4

  • 0.7

+1.6 +6.6 +0.0 BLEU

slide-90
SLIDE 90

TED results

Training data Ar De Fa Ko Ru Zh SMT General Domain 24.0 31.0 13.9 6.7 25.0 15.2 Mixed Domain 27.8 31.9 18.2 10.7 25.7 16.1 NMT General Domain 29.6 34.6 22.2 11.6 23.4 15.9 In Domain (TED) 27.4 32.3 21.3 14.4 22.9 16.2 Mixed Domain

  • -- 35.6
  • -- 24.5 17.8

Con\nued Training 35.4 39.9 27.9 17.2 28.6 20.4 Microso] Translator 34.3 38.5 20.9 17.9 28.6 21.0

slide-91
SLIDE 91

Patent results

Training data De Ko Ru Zh SMT General Domain 26.6 2.4 21.4 13.7 Mixed Domain 50.6 21.7 29.0 29.8 NMT General Domain 36.0 2.7 23.4 12.6 In Domain (TED) 61.9 29.9 26.9 40.2 Mixed Domain 58.4

  • -- 27.7 33.7

Con\nued Training 62.3 31.7 37.0 43.7 Microso] Translator 51.0 27.1 29.8 33.9

slide-92
SLIDE 92
  • Patent

10 20 30 40 50 60 1000 2000 3000 4000 5000 6000 7000 8000

German Korean Russian Chinese

BLEU

slide-93
SLIDE 93
  • TED

10 20 30 40 50000 100000 150000

Russian

BLEU

slide-94
SLIDE 94
  • TED

BLEU

10 20 30 40 10000 20000 30000 40000 50000 60000

Arabic German Farsi Korean Russian Chinese

slide-95
SLIDE 95
  • TED

BLEU

10 20 30 40 1000 2000 3000 4000 5000 6000 7000 8000

Arabic German Farsi Korean Russian Chinese

slide-96
SLIDE 96

Human Eval

slide-97
SLIDE 97
slide-98
SLIDE 98

Continued Training vs General

0% 10% 20% 30% 40% 50% 60% 70%

Arabic Korean Chinese Con\nued Training Tie General

slide-99
SLIDE 99

Continued Training vs Human

0% 10% 20% 30% 40% 50% 60% 70%

Arabic Korean Chinese Con\nued Training Tie Human

slide-100
SLIDE 100

Continued Training vs General

10 20 30 40 50 60 70 80 90

Arabic Korean Chinese Con\nued Training Tie General

slide-101
SLIDE 101

Continued Training vs Human

10 20 30 40 50 60 70 80 90

Arabic Korean Chinese Con\nued Training Tie Human

slide-102
SLIDE 102

Human Evaluation

Trends similar across three languages

System differences consistent with BLEU Human reference (unsurprisingly) better

slide-103
SLIDE 103

Research Extensions

slide-104
SLIDE 104

Parameter Freezing

34.7

  • 0.2

+0.4

  • 0.5
  • 0.1

+0.3

  • 11.4
  • 0.2
  • 1.5
  • 0.8
  • 7.4
  • 2.1

20 25 30 35 Baseline Encoder Decoder Source Embed Target Embed So6max

BLEU

slide-105
SLIDE 105

Parameter Freezing

34.72 34.54 35.08 34.25 34.66 34.97 23.32 20 25 30 35 Baseline Encoder Decoder Source Embed Target Embed So6max

BLEU

slide-106
SLIDE 106

Parameter Freezing

34.72 23.32 34.55 33.19 33.89 27.37 32.65 20 25 30 35 Baseline Encoder Decoder Source Embed Target Embed So6max

BLEU

slide-107
SLIDE 107

Selective Training

  • 0.2

+0.4

  • 0.5
  • 0.1

+0.3 20 25 30 35 Baseline Encoder Decoder Source Embed Target Embed So6max

BLEU

slide-108
SLIDE 108

Selective Training

  • f Components

34.54 35.08 34.25 34.66 34.97 34.72 23.32 34.55 33.19 33.89 27.37 32.65 20 25 30 35 General Domain Encoder Decoder Source Embed Target Embed So6max Con\nued Training

BLEU

slide-109
SLIDE 109

Selective Training

34.7 23.3 20 25 30 35 Baseline Encoder Decoder Source Embed Target Embed So6max

BLEU

slide-110
SLIDE 110

Selective Training

34.7 23.3 20 25 30 35 Baseline Encoder Decoder Source Embed Target Embed So6max

BLEU

slide-111
SLIDE 111

Selective Training

34.7

  • 0.2

+0.4

  • 0.5
  • 0.1

+0.3

  • 11.4
  • 0.2
  • 1.5
  • 0.8
  • 7.4
  • 2.1

20 25 30 35 Baseline Encoder Decoder Source Embed Target Embed So6max

BLEU

slide-112
SLIDE 112

Data Sizes

slide-113
SLIDE 113

Datasets

Ar-En De-En Fa-En Ko-En Ru-En Zh-En Large General Domain 49M Sub\tle, UN, LDC 28M Sub\tle, WMT 6M Sub\tle 1M Sub\tle 51M Sub\tle, WMT 36M Sub\tle,W MT TED Talks 175k 152k 114k 164k 180k 170k Patent (WIPO) 821k 81k 39k 154k

# sentence/segment pairs in-domain sets

113

Goal: Improve test results on TED/Patent using both Large General Domain and some In-Domain data

slide-114
SLIDE 114

Data

Training data Arabic German Farsi Korean Russian Chinese General Domain 49M Sub\tle, UN, LDC 28M Sub\tle, WMT 6M Sub\tle 1M Sub\tle 51M Sub\tle, WMT 36M Sub\tle, WMT TED 175k 152k 114k 164k 180k 170k Patent

  • 821k
  • 81k

39k 154k

slide-115
SLIDE 115

TED Data

Training data Arabic German Farsi Korean Russian Chinese General Domain 49M Sub\tle, UN, LDC 28M Sub\tle, WMT 6M Sub\tle 1M Sub\tle 51M Sub\tle, WMT 36M Sub\tle, WMT In Domain (TED) 175k 152k 114k 164k 180k 170k

So, um... she 's kidding... Resump\on of the session The European Union supports humanitarian ac\on. Allison Hunt: My three minutes hasn't started yet, has it?

slide-116
SLIDE 116

Patent Data

Training data German Korean Russian Chinese General Domain 28M Sub\tle, WMT 1M Sub\tle 51M Sub\tle, WMT 36M Sub\tle, WMT Patent 821k 81k 39k 154k

So, um... she 's kidding... Resump\on of the session The European Union supports humanitarian ac\on. The tablets exhibit improved bioavailability of the ac\ve ingredient.

slide-117
SLIDE 117

OOV rates

slide-118
SLIDE 118

TED OOVs (type count)

Training data Arabic German Farsi Korean Russian Chinese General Domain 133 204 445 225 140 45 In Domain (TED) 745 700 758 249 813 422 Both domains 126 193 329 133 132 43 Total types 8248 5837 6261 4989 7954 5760

slide-119
SLIDE 119

TED OOVs (token count)

Training data Arabic German Farsi Korean Russian Chinese General Domain 176 235 597 316 153 47 In Domain (TED) 840 809 956 327 933 536 Both domains 168 221 418 187 143 45 Total tokens 28636 35209 39223 45715 31575 33397

slide-120
SLIDE 120

TED OOVs (type %)

Training data Arabic German Farsi Korean Russian Chinese General Domain 1.61% 3.49% 7.11% 4.51% 1.76% 0.78% In Domain (TED) 9.03% 11.99% 12.11% 4.99% 10.22% 7.33% Both domains 1.53% 3.31% 5.25% 2.67% 1.66% 0.75%

slide-121
SLIDE 121

TED OOVs (token %)

Training data Arabic German Farsi Korean Russian Chinese General Domain 0.61% 0.67% 1.52% 0.69% 0.48% 0.14% In Domain (TED) 2.93% 2.30% 2.44% 0.72% 2.95% 1.60% Both domains 0.59% 0.63% 1.07% 0.41% 0.45% 0.13%

slide-122
SLIDE 122

Patent OOVs (type count)

Training data German Korean Russian Chinese General Domain 5290 2098 1508 495 In Domain (TED) 2331 986 4286 1085 Both domains 2100 594 1262 339 Total types 14566 7939 15964 8627

slide-123
SLIDE 123

Patent OOVs (token count)

Training data German Korean Russian Chinese General Domain 10264 7748 1980 1171 In Domain (TED) 3864 1724 5715 2061 Both domains 3528 1045 1617 681 Total types 132208 186832 81911 135591

slide-124
SLIDE 124

Patent OOVs (type %)

Training data German Korean Russian Chinese General Domain 36.32% 26.43% 9.45% 5.74% In Domain (Patent) 16.00% 12.42% 26.85% 12.58% Both domains 14.42% 7.48% 7.91% 3.93%

slide-125
SLIDE 125

Patent OOVs (type %)

Training data German Korean Russian Chinese General Domain 7.76% 4.15% 2.42% 0.86% In Domain (Patent) 2.92% 0.92% 6.98% 1.52% Both domains 2.67% 0.56% 1.97% 0.50%

slide-126
SLIDE 126

TED OOVs (Type Count)

200 400 600 800

Arabic German Farsi Korean Russian Chinese General Domain In Domain (TED) Both domains

OOV Types

slide-127
SLIDE 127

TED OOVs (Token Count)

200 400 600 800 1000 1200

Arabic German Farsi Korean Russian Chinese General Domain In Domain (TED) Both domains

OOV Tokens

slide-128
SLIDE 128

TED OOVs (Type %)

0% 5% 10%

Arabic German Farsi Korean Russian Chinese General Domain In Domain (TED) Both domains

% OOV

slide-129
SLIDE 129

TED OOVs (Token %)

0% 1% 2% 3%

Arabic German Farsi Korean Russian Chinese General Domain In Domain (TED) Both domains

% OOV

slide-130
SLIDE 130

Patent OOVs (Type Count)

1000 2000 3000 4000 5000 6000

German Korean Russian Chinese General Domain In Domain (TED) Both domains

OOV Types

slide-131
SLIDE 131

Patent OOVs (Token Count)

2000 4000 6000 8000 10000 12000

German Korean Russian Chinese General Domain In Domain (TED) Both domains

OOV Tokens

slide-132
SLIDE 132

Patent OOVs (Type %)

0% 10% 20% 30% 40%

German Korean Russian Chinese General Domain In Domain (Patent) Both domains

% OOV

slide-133
SLIDE 133

Patent OOVs (Token %)

0% 2% 4% 6% 8% 10%

German Korean Russian Chinese General Domain In Domain (Patent) Both domains

% OOV