[PPT] - What Kind of Language Is Hard to Language-Model? ACL 2019 Sabrina J. PowerPoint Presentation

SLIDE 1

What Kind of Language Is Hard to Language-Model?

ACL 2019 Sabrina J. Mielke and Ryan Cotterell, Kyle Gorman, Brian Roark, Jason Eisner

Johns Hopkins University // City University of New York Graduate Center // Google

sjmielke@jhu.edu

Twitter: @sjmielke – paper and thread pinned! 1

SLIDE 2

Questions and answers

0. Do current language models do equally well on all languages?

2

SLIDE 3

Questions and answers

0. Do current language models do equally well on all languages?

No.

2

SLIDE 4

Questions and answers

0. Do current language models do equally well on all languages?

No.

1. Which one do they struggle more with: German or English?

2

SLIDE 5

Questions and answers

0. Do current language models do equally well on all languages?

No.

1. Which one do they struggle more with: German or English?

German.

2

SLIDE 6

Questions and answers

0. Do current language models do equally well on all languages?

No.

1. Which one do they struggle more with: German or English?

German.

2. What about non-Indo-European languages, say Chinese?

2

SLIDE 7

Questions and answers

0. Do current language models do equally well on all languages?

No.

1. Which one do they struggle more with: German or English?

German.

2. What about non-Indo-European languages, say Chinese?

It depends.

2

SLIDE 8

Questions and answers

0. Do current language models do equally well on all languages?

No.

1. Which one do they struggle more with: German or English?

German.

2. What about non-Indo-European languages, say Chinese?

It depends.

3. What makes a language harder to model?

2

SLIDE 9

Questions and answers

0. Do current language models do equally well on all languages?

No.

1. Which one do they struggle more with: German or English?

German.

2. What about non-Indo-European languages, say Chinese?

It depends.

3. What makes a language harder to model?

Actually, rather technical factors.

2

SLIDE 10

Questions and answers

0. Do current language models do equally well on all languages?

No.

1. Which one do they struggle more with: German or English?

German.

2. What about non-Indo-European languages, say Chinese?

It depends.

3. What makes a language harder to model?

Actually, rather technical factors.

4. Is Translationese easier?

2

SLIDE 11

Questions and answers

0. Do current language models do equally well on all languages?

No.

1. Which one do they struggle more with: German or English?

German.

2. What about non-Indo-European languages, say Chinese?

It depends.

3. What makes a language harder to model?

Actually, rather technical factors.

4. Is Translationese easier?

It’s different, but not actually easier!

2

SLIDE 12

Outline

“Difficulty”

3

SLIDE 13

Outline

“Difficulty” Models and languages

3

SLIDE 14

Outline

“Difficulty” Models and languages What correlates with difficulty?

3

SLIDE 15

Outline

“Difficulty” Models and languages What correlates with difficulty? And... is Translationese really easier?

3

SLIDE 16

How to measure “difficulty”?

Language models measure surprisal/information content (NLL; −log p(·)): p(·) ⇒ NLL en

I love Florence!

0.03 ⇒ 5 bits

4

SLIDE 17

How to measure “difficulty”?

Language models measure surprisal/information content (NLL; −log p(·)): p(·) ⇒ NLL en

I love Florence!

0.03 ⇒ 5 bits de

Ich grüße meine Oma und die Familie dahein.

0.008 ⇒ 7 bits

4

SLIDE 18

How to measure “difficulty”?

Language models measure surprisal/information content (NLL; −log p(·)): p(·) ⇒ NLL en

I love Florence!

0.03 ⇒ 5 bits de

Ich grüße meine Oma und die Familie dahein.

0.008 ⇒ 7 bits nl

Alle mensen worden vrij en gelijk in waardigheid en rechten geboren.

0.0004 ⇒ 11 bits

4

SLIDE 19

How to measure “difficulty”?

Language models measure surprisal/information content (NLL; −log p(·)): p(·) ⇒ NLL en

I love Florence!

0.03 ⇒ 5 bits de

Ich grüße meine Oma und die Familie dahein.

0.008 ⇒ 7 bits nl

Alle mensen worden vrij en gelijk in waardigheid en rechten geboren.

0.0004 ⇒ 11 bits Issue 1: Different topics/styles/content

4

SLIDE 20

How to measure “difficulty”?

Language models measure surprisal/information content (NLL; −log p(·)): p(·) ⇒ NLL en

Resumption of the session.

0.013 ⇒ 6.5 bits de

Wiederaufnahme der Sitzung.

0.011 ⇒ 6.3 bits nl

Hervatting van de sessie.

0.012 ⇒ 6.4 bits Issue 1: Different topics/styles/content Solution: train and test on translations! Europarl: 21 languages share ~40M chars Bibles: 62 languages share ~4M chars

4

SLIDE 21

How to measure “difficulty”?

Language models measure surprisal/information content (NLL; −log p(·)): p(·) ⇒ NLL en

Resumption of the session.

0.013 ⇒ 6.5 bits de

Wiederaufnahme der Sitzung.

0.011 ⇒ 6.3 bits nl

Hervatting van de sessie.

0.012 ⇒ 6.4 bits Issue 1: Different topics/styles/content Solution: train and test on translations! Europarl: 21 languages share ~40M chars Bibles: 62 languages share ~4M chars

and this one takes a big ILP to solve, which is really fun

Gurobi

4

SLIDE 22

How to measure “difficulty”?

Language models measure surprisal/information content (NLL; −log p(·)): p(·) ⇒ NLL en

Resumption of the session.

0.013 ⇒ 6.5 bits de

Wiederaufnahme der Sitzung.

0.011 ⇒ 6.3 bits nl

Hervatting van de sessie.

0.012 ⇒ 6.4 bits Issue 1: Different topics/styles/content Solution: train and test on translations! Europarl: 21 languages share ~40M chars Bibles: 62 languages share ~4M chars

and this one takes a big ILP to solve, which is really fun

Gurobi

69 languages

1 3 l a n g u a g e f a m i l i e s

4

SLIDE 23

How to measure “difficulty”?

Language models measure surprisal/information content (NLL; −log p(·)): p(·) ⇒ NLL en

Resumption of the session.

0.013 ⇒ 6.5 bits de

Wiederaufnahme der Sitzung.

0.011 ⇒ 6.3 bits nl

Hervatting van de sessie.

0.012 ⇒ 6.4 bits Issue 1: Different topics/styles/content Solution: train and test on translations! Europarl: 21 languages share ~40M chars Bibles: 62 languages share ~4M chars

and this one takes a big ILP to solve, which is really fun

Gurobi

69 languages

1 3 l a n g u a g e f a m i l i e s

Issue 2: Comparing scores

4

SLIDE 24

How to measure “difficulty”?

Language models measure surprisal/information content (NLL; −log p(·)): p(·) ⇒ NLL en

Resumption of the session.

0.013 ⇒ 6.5 bits de

Wiederaufnahme der Sitzung.

0.011 ⇒ 6.3 bits nl

Hervatting van de sessie.

0.012 ⇒ 6.4 bits Issue 1: Different topics/styles/content Solution: train and test on translations! Europarl: 21 languages share ~40M chars Bibles: 62 languages share ~4M chars

and this one takes a big ILP to solve, which is really fun

Gurobi

69 languages

1 3 l a n g u a g e f a m i l i e s

Issue 2: Comparing scores Use total bits of an

pen-vocabulary model.

Why?

4

SLIDE 25

How to compare your language models across languages

1. We need to be open-vocabulary – no UNKs.

5

SLIDE 26

How to compare your language models across languages

1. We need to be open-vocabulary – no UNKs.

Every UNK is “cheating” – morphologically rich languages have more UNKs, unfairly advantaging them.

5

SLIDE 27

How to compare your language models across languages

1. We need to be open-vocabulary – no UNKs.

Every UNK is “cheating” – morphologically rich languages have more UNKs, unfairly advantaging them.

2. We can’t normalize per word or even per character in languages individually.

5

SLIDE 28

How to compare your language models across languages

1. We need to be open-vocabulary – no UNKs.

Every UNK is “cheating” – morphologically rich languages have more UNKs, unfairly advantaging them.

2. We can’t normalize per word or even per character in languages individually.

Example: if puˇ ccz and Putschde are equally likely, they should be equally “difficult.”

5

SLIDE 29

How to compare your language models across languages

1. We need to be open-vocabulary – no UNKs.

Every UNK is “cheating” – morphologically rich languages have more UNKs, unfairly advantaging them.

2. We can’t normalize per word or even per character in languages individually.

Example: if puˇ ccz and Putschde are equally likely, they should be equally “difficult.” ⇒ just use overall bits (i.e., surprisal/NLL) of an aligned sentence

5

SLIDE 30

How to compare your language models across languages

1. We need to be open-vocabulary – no UNKs.

Every UNK is “cheating” – morphologically rich languages have more UNKs, unfairly advantaging them.

2. We can’t normalize per word or even per character in languages individually.

Example: if puˇ ccz and Putschde are equally likely, they should be equally “difficult.” ⇒ just use overall bits (i.e., surprisal/NLL) of an aligned sentence

[note: total easily obtainable from BPC or perplexity by multiplying with total chars/words]

5

SLIDE 31

How to aggregate multiple intents’ surprisals into “difficulties”?

For fully parallel corpora...

Resump- tion

f the

session Wieder- aufnah- me der ... Възобн- овяване на се- ... The peace that ... Der gestern verein- ... Мирът, който беше ... Although we were not al- ... Obwohl wir nicht ... Макар че не бяхме ... Now we can fi- nally ... Jetzt ist die Zeit ... Накрая всички можем ...

en de bg 1 2 3 4 aligned multi-text

Image CC-BY Mike Grauer Jr / flickr

SLIDE 32

How to aggregate multiple intents’ surprisals into “difficulties”?

For fully parallel corpora...

Resump- tion

f the

session Wieder- aufnah- me der ... Възобн- овяване на се- ... The peace that ... Der gestern verein- ... Мирът, който беше ... Although we were not al- ... Obwohl wir nicht ... Макар че не бяхме ... Now we can fi- nally ... Jetzt ist die Zeit ... Накрая всички можем ...

en de bg 1 2 3 4 aligned multi-text language model y1,en y1,de y1,bg y2,en y2,de y2,bg y3,en y3,de y3,bg y4,en y4,de y4,bg

LM surprisals/NLLs

6

SLIDE 33

How to aggregate multiple intents’ surprisals into “difficulties”?

For fully parallel corpora... we can just sum everything up and compare – that is fair.

Resump- tion

f the

session Wieder- aufnah- me der ... Възобн- овяване на се- ... The peace that ... Der gestern verein- ... Мирът, който беше ... Although we were not al- ... Obwohl wir nicht ... Макар че не бяхме ... Now we can fi- nally ... Jetzt ist die Zeit ... Накрая всички можем ...

en de bg 1 2 3 4 aligned multi-text language model y1,en y1,de y1,bg y2,en y2,de y2,bg y3,en y3,de y3,bg y4,en y4,de y4,bg

LM surprisals/NLLs

en
de
bg

⇒ ⇒ ⇒

6

SLIDE 34

How to aggregate multiple intents’ surprisals into “difficulties”?

But what if there’s missing data? Or we want robustness?

Resump- tion

f the

session Wieder- aufnah- me der ... The peace that ... Der gestern verein- ... Мирът, който беше ... Obwohl wir nicht ... Макар че не бяхме ... Now we can fi- nally ... Накрая всички можем ...

en de bg 1 2 3 4 aligned multi-text language model y1,en y1,de y2,en y2,de y2,bg y3,de y3,bg y4,en y4,bg

LM surprisals/NLLs

6

SLIDE 35

How to aggregate multiple intents’ surprisals into “difficulties”?

But what if there’s missing data? Or we want robustness?

Resump- tion

f the

session Wieder- aufnah- me der ... The peace that ... Der gestern verein- ... Мирът, който беше ... Obwohl wir nicht ... Макар че не бяхме ... Now we can fi- nally ... Накрая всички можем ...

en de bg 1 2 3 4 aligned multi-text language model y1,en y1,de y2,en y2,de y2,bg y3,de y3,bg y4,en y4,bg

LM surprisals/NLLs

n1 n2 n3 n4 ⇒ ⇒ ⇒ ⇒ den dde dbg ⇒ ⇒ ⇒

6

SLIDE 36

How to aggregate multiple intents’ surprisals into “difficulties”?

But what if there’s missing data? Or we want robustness?

Resump- tion

f the

session Wieder- aufnah- me der ... The peace that ... Der gestern verein- ... Мирът, който беше ... Obwohl wir nicht ... Макар че не бяхме ... Now we can fi- nally ... Накрая всички можем ...

en de bg 1 2 3 4 aligned multi-text language model y1,en y1,de y2,en y2,de y2,bg y3,de y3,bg y4,en y4,bg

LM surprisals/NLLs

n1 n2 n3 n4 ⇒ ⇒ ⇒ ⇒ den dde dbg ⇒ ⇒ ⇒ y2,de

6

SLIDE 37

How to aggregate multiple intents’ surprisals into “difficulties”?

But what if there’s missing data? Or we want robustness?

Resump- tion

f the

session Wieder- aufnah- me der ... The peace that ... Der gestern verein- ... Мирът, който беше ... Obwohl wir nicht ... Макар че не бяхме ... Now we can fi- nally ... Накрая всички можем ...

en de bg 1 2 3 4 aligned multi-text language model y1,en y1,de y2,en y2,de y2,bg y3,de y3,bg y4,en y4,bg

LM surprisals/NLLs

n1 n2 n3 n4 ⇒ ⇒ ⇒ ⇒ den dde dbg ⇒ ⇒ ⇒ y2,de = n2

6

SLIDE 38

How to aggregate multiple intents’ surprisals into “difficulties”?

But what if there’s missing data? Or we want robustness?

Resump- tion

f the

session Wieder- aufnah- me der ... The peace that ... Der gestern verein- ... Мирът, който беше ... Obwohl wir nicht ... Макар че не бяхме ... Now we can fi- nally ... Накрая всички можем ...

en de bg 1 2 3 4 aligned multi-text language model y1,en y1,de y2,en y2,de y2,bg y3,de y3,bg y4,en y4,bg

LM surprisals/NLLs

n1 n2 n3 n4 ⇒ ⇒ ⇒ ⇒ den dde dbg ⇒ ⇒ ⇒ y2,de = n2 · exp dde

6

SLIDE 39

How to aggregate multiple intents’ surprisals into “difficulties”?

But what if there’s missing data? Or we want robustness?

Resump- tion

f the

session Wieder- aufnah- me der ... The peace that ... Der gestern verein- ... Мирът, който беше ... Obwohl wir nicht ... Макар че не бяхме ... Now we can fi- nally ... Накрая всички можем ...

en de bg 1 2 3 4 aligned multi-text language model y1,en y1,de y2,en y2,de y2,bg y3,de y3,bg y4,en y4,bg

LM surprisals/NLLs

n1 n2 n3 n4 ⇒ ⇒ ⇒ ⇒ den dde dbg ⇒ ⇒ ⇒ y2,de n2 · exp dde ∼

log-normal noise This is a probabilistic model we can perform inference in!

6

SLIDE 40

How to aggregate multiple intents’ surprisals into “difficulties”?

But what if there’s missing data? Or we want robustness?

Resump- tion

f the

session Wieder- aufnah- me der ... The peace that ... Der gestern verein- ... Мирът, който беше ... Obwohl wir nicht ... Макар че не бяхме ... Now we can fi- nally ... Накрая всички можем ...

en de bg 1 2 3 4 aligned multi-text language model y1,en y1,de y2,en y2,de y2,bg y3,de y3,bg y4,en y4,bg

LM surprisals/NLLs

n1 n2 n3 n4 ⇒ ⇒ ⇒ ⇒ den dde dbg ⇒ ⇒ ⇒ y2,de n2 · exp dde ∼

log-normal noise This is a probabilistic model we can perform inference in!

n

t

q u i t e ,

u

r a c t u a l m

d

e l i s H E T E R O S C E D A S T I C yi j = ni · exp(dj) · exp(εi j) σ2

i = ln

1 + exp(σ2)−1

ni

εi j ∼ N

σ2−σ2

i

2

, σ2

i

Image CC-BY Mike Grauer Jr / flickr

6

SLIDE 41

Outline

“Difficulty” Models and languages What correlates with difficulty? And... is Translationese really easier?

7

SLIDE 42

Good open-vocabulary language models

Formerly state-of-the-art-ish AWD-LSTM (Merity et al., 2018) language models: char-RNNLM:

RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell

t h e c a t c h a s e d

8

SLIDE 43

Good open-vocabulary language models (Mielke and Eisner, 2019)

Formerly state-of-the-art-ish AWD-LSTM (Merity et al., 2018) language models: char-RNNLM:

RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell

t h e c a t c h a s e d

BPE-RNNLM, few merges:

RNN cell RNN cell RNN cell RNN cell RNN cell

the ca@ @ t cha@ @ sed

8

SLIDE 44

Good open-vocabulary language models (Mielke and Eisner, 2019)

Formerly state-of-the-art-ish AWD-LSTM (Merity et al., 2018) language models: char-RNNLM:

RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell RNN cell

t h e c a t c h a s e d

BPE-RNNLM, few merges:

RNN cell RNN cell RNN cell RNN cell RNN cell

the ca@ @ t cha@ @ sed

BPE-RNNLM, many merges:

RNN cell RNN cell RNN cell RNN cell

the cat cha@ @ sed

8

SLIDE 45

Choosing the number of BPE merges: how many is best?

It depends on the language (total surprisal, given merges as a ratio of the vocabulary): ratio: ← lower is better 0.2 0.4 0.6 0.8 1

bg bg bg bg bg bg bg bg bg bg cs cs cs cs cs cs cs cs cs cs da da da da da da da da da da de de de de de de de de de el el el el el el el el el el en en en en en en en en en en es es es es es es es es es es et et et et et et et et et et fi fi fi fi fi fi fi fi fi fi fr fr fr fr fr fr fr fr fr fr hu hu hu hu hu hu hu hu hu hu it it it it it it it it it it lt lt lt lt lt lt lt lt lt lt lv lv lv lv lv lv lv lv lv lv nl nl nl nl nl nl nl nl nl nl pl pl pl pl pl pl pl pl pl pl pt pt pt pt pt pt pt pt pt pt ro ro ro ro ro ro ro ro ro ro sk sk sk sk sk sk sk sk sk sk sl sl sl sl sl sl sl sl sl sl sv sv sv sv sv sv sv sv sv sv

9

SLIDE 46

Choosing the number of BPE merges: how many is best?

It depends on the language (total surprisal, given merges as a ratio of the vocabulary): ratio: ← lower is better 0.2 0.4 0.6 0.8 1

bg bg bg bg bg bg bg bg bg bg cs cs cs cs cs cs cs cs cs cs da da da da da da da da da da de de de de de de de de de el el el el el el el el el el en en en en en en en en en en es es es es es es es es es es et et et et et et et et et et fi fi fi fi fi fi fi fi fi fi fr fr fr fr fr fr fr fr fr fr hu hu hu hu hu hu hu hu hu hu it it it it it it it it it it lt lt lt lt lt lt lt lt lt lt lv lv lv lv lv lv lv lv lv lv nl nl nl nl nl nl nl nl nl nl pl pl pl pl pl pl pl pl pl pl pt pt pt pt pt pt pt pt pt pt ro ro ro ro ro ro ro ro ro ro sk sk sk sk sk sk sk sk sk sk sl sl sl sl sl sl sl sl sl sl sv sv sv sv sv sv sv sv sv sv

average

9

SLIDE 47

Choosing the number of BPE merges: how many is best?

It depends on the language (total surprisal, given merges as a ratio of the vocabulary): ratio: ← lower is better 0.2 0.4 0.6 0.8 1

bg bg bg bg bg bg bg bg bg bg cs cs cs cs cs cs cs cs cs cs da da da da da da da da da da de de de de de de de de de el el el el el el el el el el en en en en en en en en en en es es es es es es es es es es et et et et et et et et et et fi fi fi fi fi fi fi fi fi fi fr fr fr fr fr fr fr fr fr fr hu hu hu hu hu hu hu hu hu hu it it it it it it it it it it lt lt lt lt lt lt lt lt lt lt lv lv lv lv lv lv lv lv lv lv nl nl nl nl nl nl nl nl nl nl pl pl pl pl pl pl pl pl pl pl pt pt pt pt pt pt pt pt pt pt ro ro ro ro ro ro ro ro ro ro sk sk sk sk sk sk sk sk sk sk sl sl sl sl sl sl sl sl sl sl sv sv sv sv sv sv sv sv sv sv

average

is this one going to be fine?

9

SLIDE 48

Choosing the number of BPE merges: how many is best?

It depends on the language (total surprisal, given merges as a ratio of the vocabulary): ratio: ← lower is better 0.2 0.4 0.6 0.8 1

bg bg bg bg bg bg bg bg bg bg cs cs cs cs cs cs cs cs cs cs da da da da da da da da da da de de de de de de de de de el el el el el el el el el el en en en en en en en en en en es es es es es es es es es es et et et et et et et et et et fi fi fi fi fi fi fi fi fi fi fr fr fr fr fr fr fr fr fr fr hu hu hu hu hu hu hu hu hu hu it it it it it it it it it it lt lt lt lt lt lt lt lt lt lt lv lv lv lv lv lv lv lv lv lv nl nl nl nl nl nl nl nl nl nl pl pl pl pl pl pl pl pl pl pl pt pt pt pt pt pt pt pt pt pt ro ro ro ro ro ro ro ro ro ro sk sk sk sk sk sk sk sk sk sk sl sl sl sl sl sl sl sl sl sl sv sv sv sv sv sv sv sv sv sv

average

is this one going to be fine? Yeah: it doesn’t matter that much.

chars BPE (0.4) BPE (best)

bg cs da de el en es et fi fr hu it lt lv nl pl pt ro sk sl sv

9

SLIDE 49

Difficulties for char-/BPE-RNNLM: 21 Europarl languages

−4 −3 −2 −1 1 2 3 4 5 −8 −6 −4 −2 2 4 6 8 10 difficulty (×100) using BPE-RNNLM with 0.4|V| merges difficulty (×100) using char-RNNLM

Difficulties on Europarl 10

SLIDE 50

Difficulties for char-/BPE-RNNLM: 21 Europarl languages

−4 −3 −2 −1 1 2 3 4 5 −8 −6 −4 −2 2 4 6 8 10

easier with BPE easier with chars

difficulty (×100) using BPE-RNNLM with 0.4|V| merges difficulty (×100) using char-RNNLM

Difficulties on Europarl 10

SLIDE 51

Difficulties for char-/BPE-RNNLM: 21 Europarl languages

−4 −3 −2 −1 1 2 3 4 5 −8 −6 −4 −2 2 4 6 8 10

easier with BPE easier with chars

bg cs da el es et fi fr hu it lt lv nl pl pt ro sk sl sv en de

difficulty (×100) using BPE-RNNLM with 0.4|V| merges difficulty (×100) using char-RNNLM

Difficulties on Europarl 10

SLIDE 52

Difficulties for char-/BPE-RNNLM: 21 Europarl languages

−4 −3 −2 −1 1 2 3 4 5 −8 −6 −4 −2 2 4 6 8 10

easier with BPE easier with chars

bg cs da el es et fi fr hu it lt lv nl pl pt ro sk sl sv

en de

difficulty (×100) using BPE-RNNLM with 0.4|V| merges difficulty (×100) using char-RNNLM

Difficulties on Europarl 10

SLIDE 53

Difficulties for char-/BPE-RNNLM: 21 Europarl languages and Bibles

−4 −3 −2 −1 1 2 3 4 5 −8 −6 −4 −2 2 4 6 8 10

easier with BPE easier with chars

bg cs da el es et fi fr hu it lt lv nl pl pt ro sk sl sv

en de

difficulty (×100) using BPE-RNNLM with 0.4|V| merges difficulty (×100) using char-RNNLM

Difficulties on Europarl vs.

harder easier de en deu eng −15 −10 −5 5 10 15 20 25 −15 −10 −5 5 10 15 20

easier with BPE easier with chars

de de de de de de de de de de de en en en en

difficulty (×100) using BPE-RNNLM with 0.4|V| merges difficulty (×100) using char-RNNLM

Difficulties on Bibles 10

SLIDE 54

Difficulties for char-/BPE-RNNLM: 21 Europarl languages and 106 Bibles

−4 −3 −2 −1 1 2 3 4 5 −8 −6 −4 −2 2 4 6 8 10

easier with BPE easier with chars

bg cs da el es et fi fr hu it lt lv nl pl pt ro sk sl sv en de

difficulty (×100) using BPE-RNNLM with 0.4|V| merges difficulty (×100) using char-RNNLM

Difficulties on Europarl vs.

harder easier de en deu eng bg cs da el fi fr hu it lt nl pt ro bul ces dan ell fin fra hun ita lit nld por ron −15 −10 −5 5 10 15 20 25 −15 −10 −5 5 10 15 20

easier with BPE easier with chars

cmn afr aln arb arz ayr ayr bba ben ben bqc bul bul cac cak ceb ceb ceb ces ces cnh cym dan deu deu deu deu deu deu deu deu deu deu deu ell eng eng eng eng epo fin fin fin fra fra fra fra fra fra fra fra fra fra fra guj gur hat hat hrv hun hun ind ind ita ita ita ita kek kek kjb lat lit mah mam mri mya nld nor nor plt poh por por por qub quh quy quz ron rus som tbz tcw tgl tlh tpi tpm ukr ukr vie vie vie wal wbm xho zom

difficulty (×100) using BPE-RNNLM with 0.4|V| merges difficulty (×100) using char-RNNLM

Difficulties on Bibles 10

SLIDE 55

Difficulties for char-/BPE-RNNLM: 21 Europarl languages and 106 Bibles

−4 −3 −2 −1 1 2 3 4 5 −8 −6 −4 −2 2 4 6 8 10

easier with BPE easier with chars

bg cs da el es et fi fr hu it lt lv nl pl pt ro sk sl sv en de

difficulty (×100) using BPE-RNNLM with 0.4|V| merges difficulty (×100) using char-RNNLM

Difficulties on Europarl vs.

harder easier de en deu eng bg cs da el fi fr hu it lt nl pt ro bul ces dan ell fin fra hun ita lit nld por ron −15 −10 −5 5 10 15 20 25 −15 −10 −5 5 10 15 20

easier with BPE easier with chars

cmn

afr aln arb arz ayr ayr bba ben ben bqc bul bul cac cak ceb ceb ceb ces ces cnh cym dan deu deu deu deu deu deu deu deu deu deu deu ell eng eng eng eng epo fin fin fin fra fra fra fra fra fra fra fra fra fra fra guj gur hat hat hrv hun hun ind ind ita ita ita ita kek kek kjb lat lit mah mam mri mya nld nor nor plt poh por por por qub quh quy quz ron rus som tbz tcw tgl tlh tpi tpm ukr ukr vie vie vie wal wbm xho zom

difficulty (×100) using BPE-RNNLM with 0.4|V| merges difficulty (×100) using char-RNNLM

Difficulties on Bibles 10

SLIDE 56

Difficulties for char-/BPE-RNNLM: 21 Europarl languages and 106 Bibles

−4 −3 −2 −1 1 2 3 4 5 −8 −6 −4 −2 2 4 6 8 10

easier with BPE easier with chars

bg cs da el es et fi fr hu it lt lv nl pl pt ro sk sl sv en de

difficulty (×100) using BPE-RNNLM with 0.4|V| merges difficulty (×100) using char-RNNLM

Difficulties on Europarl vs.

harder easier de en deu eng bg cs da el fi fr hu it lt nl pt ro bul ces dan ell fin fra hun ita lit nld por ron −15 −10 −5 5 10 15 20 25 −15 −10 −5 5 10 15 20

easier with BPE easier with chars

cmn afr aln arb arz ayr ayr bba ben ben bqc bul bul cac cak ceb ceb ceb ces ces cnh cym dan deu deu deu deu deu deu deu deu deu deu deu ell eng eng eng eng epo fin fin fin fra fra fra fra fra fra fra fra fra fra fra guj gur hat hat hrv hun hun ind ind ita ita ita ita kek kek kjb lat lit mah mam mri mya nld nor nor plt poh por por por qub quh quy quz ron rus som tbz tcw tgl tlh tpi tpm ukr ukr vie vie vie wal wbm xho zom

difficulty (×100) using BPE-RNNLM with 0.4|V| merges difficulty (×100) using char-RNNLM

Difficulties on Bibles 10

SLIDE 57

Outline

“Difficulty” Models and languages What correlates with difficulty? And... is Translationese really easier?

11

SLIDE 58

How about: morphological counting complexity (Sagot, 2013)

50 100 150 200 −4 −2 2 4

bg cs da de el en et fi fr hu it lv lt nl pl pt ro sk sl es sv

MCC difficulty (×100, BPE-RNNLM) 50 100 150 200 −5 5 10

bg cs da de el en et fi fr hu it lv lt nl pl pt ro sk sl es sv

MCC difficulty (×100, char-RNNLM)

12

SLIDE 59

How about: morphological counting complexity (Sagot, 2013)

50 100 150 200 −4 −2 2 4

bg cs da de el en et fi fr hu it lv lt nl pl pt ro sk sl es sv

MCC difficulty (×100, BPE-RNNLM) 50 100 150 200 −5 5 10

bg cs da de el en et fi fr hu it lv lt nl pl pt ro sk sl es sv

MCC difficulty (×100, char-RNNLM)

...not particularly striking. Perhaps Finnish was an outlier in Cotterell et al. (2018)?

12

SLIDE 60

Other linguistically motivated regressors

WALS: “Prefixing vs. Suffixing [...] Morphology” (for languages where present)?

13

SLIDE 61

Other linguistically motivated regressors

WALS: “Prefixing vs. Suffixing [...] Morphology” (for languages where present)? ...no visible differences.

13

SLIDE 62

Other linguistically motivated regressors

WALS: “Prefixing vs. Suffixing [...] Morphology” (for languages where present)? ...no visible differences. WALS: “Order of Subject, Object and Verb” (for languages where present)?

13

SLIDE 63

Other linguistically motivated regressors

WALS: “Prefixing vs. Suffixing [...] Morphology” (for languages where present)? ...no visible differences. WALS: “Order of Subject, Object and Verb” (for languages where present)? ...no visible differences.

13

SLIDE 64

Other linguistically motivated regressors

WALS: “Prefixing vs. Suffixing [...] Morphology” (for languages where present)? ...no visible differences. WALS: “Order of Subject, Object and Verb” (for languages where present)? ...no visible differences. Head-POS Entropy (Dehouck and Denis, 2018)?

13

SLIDE 65

Other linguistically motivated regressors

WALS: “Prefixing vs. Suffixing [...] Morphology” (for languages where present)? ...no visible differences. WALS: “Order of Subject, Object and Verb” (for languages where present)? ...no visible differences. Head-POS Entropy (Dehouck and Denis, 2018)? ...neither mean and skew show correlation.

13

SLIDE 66

Other linguistically motivated regressors

WALS: “Prefixing vs. Suffixing [...] Morphology” (for languages where present)? ...no visible differences. WALS: “Order of Subject, Object and Verb” (for languages where present)? ...no visible differences. Head-POS Entropy (Dehouck and Denis, 2018)? ...neither mean and skew show correlation. Average dependency length (computed using UDPipe (Straka et al., 2016))?

13

SLIDE 67

Other linguistically motivated regressors

WALS: “Prefixing vs. Suffixing [...] Morphology” (for languages where present)? ...no visible differences. WALS: “Order of Subject, Object and Verb” (for languages where present)? ...no visible differences. Head-POS Entropy (Dehouck and Denis, 2018)? ...neither mean and skew show correlation. Average dependency length (computed using UDPipe (Straka et al., 2016))? ...correlation! But not significant after correcting for multiple hypotheses.

13

SLIDE 68

Other linguistically motivated regressors

WALS: “Prefixing vs. Suffixing [...] Morphology” (for languages where present)? ...no visible differences. WALS: “Order of Subject, Object and Verb” (for languages where present)? ...no visible differences. Head-POS Entropy (Dehouck and Denis, 2018)? ...neither mean and skew show correlation. Average dependency length (computed using UDPipe (Straka et al., 2016))? ...correlation! But not significant after correcting for multiple hypotheses. This is disappointing.

13

SLIDE 69

Very simple heuristics are very predictive

Raw sequence length / # predictions → char-RNNLM difficulty Significant on:

Europarl at p < .01
Bibles at p < .001

i.e., for the char-RNNLM puˇ ccz is easier than Putschde!

14

SLIDE 70

Very simple heuristics are very predictive

Raw sequence length / # predictions → char-RNNLM difficulty Significant on:

Europarl at p < .01
Bibles at p < .001

i.e., for the char-RNNLM puˇ ccz is easier than Putschde! Raw vocabulary size → BPE-RNNLM difficulty Significant on:

not Europarl
but Bibles at

p < .00000000001 i.e., the BPE-RNNLM still suffers if a language has high type-token-ratio!

14

SLIDE 71

Very simple heuristics are very predictive

Raw sequence length / # predictions → char-RNNLM difficulty Significant on:

Europarl at p < .01
Bibles at p < .001

i.e., for the char-RNNLM puˇ ccz is easier than Putschde! Raw vocabulary size → BPE-RNNLM difficulty Significant on:

not Europarl
but Bibles at

p < .00000000001 i.e., the BPE-RNNLM still suffers if a language has high type-token-ratio! Wow! What is happening here? We have many conjectures...

14

SLIDE 72

Outline

“Difficulty” Models and languages What correlates with difficulty? And... is Translationese really easier?

15

SLIDE 73

Translationese: translations as a separate language?

Common assumption: Translationese is somehow simpler than “native” text.

16

SLIDE 74

Translationese: translations as a separate language?

Common assumption: Translationese is somehow simpler than “native” text. We have partial parallel data that we can use to evaluate our models: enoriginal entranslated deoriginal detranslated nloriginal nltranslated

··· Resumption... Wiederauf-... Hervatten... ··· The German... Der deutsche... De Duitse... ··· Thank you... Vielen Dank... Hartelijk... ··· ··· ··· ··· ··· ··· ···

16

SLIDE 75

Translationese: translations as a separate language?

Common assumption: Translationese is somehow simpler than “native” text. We have partial parallel data that we can use to evaluate our models: enoriginal entranslated deoriginal detranslated nloriginal nltranslated

··· Resumption... Wiederauf-... Hervatten... ··· The German... Der deutsche... De Duitse... ··· Thank you... Vielen Dank... Hartelijk... ··· ··· ··· ··· ··· ··· ···

...and indeed the original languages seem harder.

16

SLIDE 76

Translationese: translations as a separate language?

Common assumption: Translationese is somehow simpler than “native” text. We have partial parallel data that we can use to evaluate our models: enoriginal entranslated deoriginal detranslated nloriginal nltranslated

··· Resumption... Wiederauf-... Hervatten... ··· The German... Der deutsche... De Duitse... ··· Thank you... Vielen Dank... Hartelijk... ··· ··· ··· ··· ··· ··· ···

...and indeed the original languages seem harder. But we missed something!

16

SLIDE 77

We trained on mostly translationese!

en fr de es nl it pt sv el fi pl da ro hu sk cs sl lt bg et lv 5 10 15 20 languages, sorted by absolute # native sentences % native of sentences

Of course we will then find it easier...

17

SLIDE 78

Repeat the experiment with fairly balancing training data

Change the training sets!

We can rebalance a single language, leaving the others merged, i.e.: enoriginal entranslated de nl

··· Resumption... Wiederauf-... Hervatten... ··· The German... Der deutsche... De Duitse... ··· Thank you... Vielen Dank... Hartelijk... ··· ··· ··· ··· ··· ···

18

SLIDE 79

Repeat the experiment with fairly balancing training data

Change the training sets!

We can rebalance a single language, leaving the others merged, i.e.: enoriginal entranslated de nl

··· Resumption... Wiederauf-... Hervatten... ··· The German... Der deutsche... De Duitse... ··· Thank you... Vielen Dank... Hartelijk... ··· ··· ··· ··· ··· ···

And the result: the difficulties are now the same!

(more precisely, “native” is 0.004 ± 0.02 easier)

18

SLIDE 80

Conclusion: cross-linguistic comparisons are tricky (hope we didn’t mess up!)

19

SLIDE 81

Conclusion: cross-linguistic comparisons are tricky (hope we didn’t mess up!)

1. Make sure your training data is comparable and fair.

19

SLIDE 82

Conclusion: cross-linguistic comparisons are tricky (hope we didn’t mess up!)

1. Make sure your training data is comparable and fair.
2. Make sure your metrics are comparable and fair.

19

SLIDE 83

Conclusion: cross-linguistic comparisons are tricky (hope we didn’t mess up!)

1. Make sure your training data is comparable and fair.
2. Make sure your metrics are comparable and fair.
3. Make sure your stats are fair (no p-hacking!).

19

SLIDE 84

Conclusion: cross-linguistic comparisons are tricky (hope we didn’t mess up!)

1. Make sure your training data is comparable and fair.
2. Make sure your metrics are comparable and fair.
3. Make sure your stats are fair (no p-hacking!).
4. Work on more NLP resources for more languages!

19

SLIDE 85

What Kind of Language Is Hard to Language-Model?

ACL 2019 Sabrina J. Mielke and Ryan Cotterell, Kyle Gorman, Brian Roark, Jason Eisner

Johns Hopkins University // City University of New York Graduate Center // Google

sjmielke@jhu.edu

Twitter: @sjmielke – paper and thread pinned! 20