Computational dialectology with machine translation techniques Yves - - PowerPoint PPT Presentation

computational dialectology with machine translation
SMART_READER_LITE
LIVE PREVIEW

Computational dialectology with machine translation techniques Yves - - PowerPoint PPT Presentation

Computational dialectology with machine translation techniques Yves Scherrer Department of Digital Humanities, University of Helsinki Mapping Language Variation and Change Cambridge, 19 March 2019 1 Illustration:


slide-1
SLIDE 1

Computational dialectology with machine translation techniques

Yves Scherrer Department of Digital Humanities, University of Helsinki

Mapping Language Variation and Change Cambridge, 19 March 2019 1

slide-2
SLIDE 2

Illustration: http://vas3k.com/blog/machine_translation/

2007 2012–2013 2017–2018 A brief history of my career as a machine translation researcher interested in dialectology (or the other way round…)

2

slide-3
SLIDE 3

Illustration: http://vas3k.com/blog/machine_translation/

2007 2012–2013 2017–2018 A brief history of my career as a machine translation researcher interested in dialectology (or the other way round…)

2

slide-4
SLIDE 4

Rule-based machine translation: Standard German → Swiss German

slide-5
SLIDE 5

Language variation in rule-based machine translation

Generative dialectology (Veith 1970, 1982)

  • Transformation rules derive a multitude of dialect

systems Di from a single reference system B:

  • #Töpfer#B → #Häfner#D33333−46999

My proposal:

  • D: Swiss German dialects
  • B: Modern High German (“Standard German”)
  • Most practical, but not historically correct
  • Dialects are not represented as discrete numbered

entities, but as probability maps

  • immer
  • StdG
  • → geng

3

slide-6
SLIDE 6

Language variation in rule-based machine translation

Generative dialectology (Veith 1970, 1982)

  • Transformation rules derive a multitude of dialect

systems Di from a single reference system B:

  • #Töpfer#B → #Häfner#D33333−46999

My proposal:

  • D: Swiss German dialects
  • B: Modern High German (“Standard German”)
  • Most practical, but not historically correct
  • Dialects are not represented as discrete numbered

entities, but as probability maps

  • immer
  • StdG
  • → geng
  • 3
slide-7
SLIDE 7

Example rule: Lemma change

{immer} → {immer}

  • |

{gäng}

  • |

{geng}

  • |

{all}

  • |

  • Probability maps extracted from digitized SDS

(Sprachatlas der deutschen Schweiz) maps

  • Rules implemented with XFST fjnite-state toolkit

4

slide-8
SLIDE 8

Example rule: Lemma change

{immer} → {immer}

  • |

{gäng}

  • |

{geng}

  • |

{all}

  • |

  • Probability maps extracted from digitized SDS

(Sprachatlas der deutschen Schweiz) maps

  • Rules implemented with XFST fjnite-state toolkit

4

slide-9
SLIDE 9

Example: morphological infmection

ADJA [Nom | Acc] Sg Gender Degree Weak →

  • |

i

  • schwarz ADJA Nom Sg Fem Pos Weak → schwarz

schwarzi

5

slide-10
SLIDE 10

Example: phonological adaptation

Vowel (n d) Vowel → n d

  • |

n g

  • |

n n

  • |

n

  • gestanden → gschtande

gschtange gschtanne gschtane

6

slide-11
SLIDE 11

Implementation

Finite-state toolkits do not provide functionality for direct integration of probability maps. We simulate this ability with fmag diacritics.

ADJA [Nom | Acc] Sg Gender Degree Weak →

  • |

i

  • defjne adj-2-fm [ ADJA [Nom | Acc] Sg Gender Degree Weak ->

[ 0 ”@U.3-254.null@” | i ”@U.3-254.i@” ]];

7

slide-12
SLIDE 12

Conclusions

  • Diffjcult to achieve good coverage
  • Dialectologically interesting features
  • vs. relevant features for practical usage
  • Diffjcult to evaluate on “real” data due to lack of unifjed

writing conventions

  • The digitized maps turned out to be more useful than the

rule set

  • Veith’s claim that the ordering of rules mirrors their order
  • f historical appearance is diffjcult to verify in practice

8

slide-13
SLIDE 13

Conclusions

  • Diffjcult to achieve good coverage
  • Dialectologically interesting features
  • vs. relevant features for practical usage
  • Diffjcult to evaluate on “real” data due to lack of unifjed

writing conventions

  • The digitized maps turned out to be more useful than the

rule set

  • Veith’s claim that the ordering of rules mirrors their order
  • f historical appearance is diffjcult to verify in practice

8

slide-14
SLIDE 14

Conclusions

  • Diffjcult to achieve good coverage
  • Dialectologically interesting features
  • vs. relevant features for practical usage
  • Diffjcult to evaluate on “real” data due to lack of unifjed

writing conventions

  • The digitized maps turned out to be more useful than the

rule set

  • Veith’s claim that the ordering of rules mirrors their order
  • f historical appearance is diffjcult to verify in practice

8

slide-15
SLIDE 15

Rule-based machine translation: Standard German → Swiss German

References

  • K. R. Beesley / L. Karttunen (2003): Finite State Morphology. CSLI Publications.
  • R. Hotzenköcherle / R. Schläpfer / R. Trüb / P. Zinsli (eds.) (1962-1997): Sprachatlas der deutschen Schweiz.

8 vols. Bern: Francke.

  • Y. Scherrer (2011): Morphology generation for Swiss German dialects. In: C. Mahlow / M. Piotrowski (eds.):

Systems and Frameworks for Computational Morphology - Proceedings of the Second International Workshop (SFCM 2011). Berlin: Springer, 130–140.

  • Y. Scherrer (2014): Computerlinguistische Experimente für die schweizerdeutsche Dialektlandschaft –

Maschinelle Übersetzung und Dialektometrie. In: D. Huck (ed.): Alemannische Dialektologie: Dialekte im

  • Kontakt. (ZDL Beihefte 155). Stuttgart: Steiner, 261–278.
  • W. H. Veith (1970): -Explikative +Applikative +Komputative Dialektkartographie. (Germanistische Linguistik 4).

Hildesheim: Olms.

  • W. H. Veith (1982): Theorieansätze einer generativen Dialektologie. In: W. Besch / U. Knoop / W. Putschke / H.
  • E. Wiegand (eds.): Dialektologie – Ein Handbuch zur deutschen und allgemeinen Dialektforschung. Berlin, New

York: De Gruyter, 277–295. Digitized SDS maps: http://www.dialektkarten.ch

9

slide-16
SLIDE 16

Character-level statistical machine translation: Normalization

slide-17
SLIDE 17

The data: The ArchiMob corpus

10

slide-18
SLIDE 18

The data: The ArchiMob corpus

10

ArchiMob was an oral history project focusing on testimonials

  • f the Second World War period

in Switzerland. 555 informants from all linguistic regions, both genders, different backgrounds were interviewed (1999–2001).

slide-19
SLIDE 19

The data: The ArchiMob corpus

10

ArchiMob was an oral history project focusing on testimonials

  • f the Second World War period

in Switzerland. 555 informants from all linguistic regions, both genders, different backgrounds were interviewed (1999–2001). 43 Swiss German interviews were transcribed at the University of Zurich (2006–2018) for dialectological research.

slide-20
SLIDE 20

The task: Normalization

There is a lot of variation in the transcriptions:

  • Transcription inconsistencies: different transcribers,

transcription tools and changing guidelines

  • Dialectal variation: different origins of informants
  • Intra-speaker variation

Goals:

  • Create an additional annotation layer to establish

identities between forms that are felt like “the same word”.

  • Enable dialect-independent corpus search
  • Facilitate further annotation (e.g. part-of-speech tagging)

11

slide-21
SLIDE 21

The task: Normalization

There is a lot of variation in the transcriptions:

  • Transcription inconsistencies: different transcribers,

transcription tools and changing guidelines

  • Dialectal variation: different origins of informants
  • Intra-speaker variation

Goals:

  • Create an additional annotation layer to establish

identities between forms that are felt like “the same word”.

  • Enable dialect-independent corpus search
  • Facilitate further annotation (e.g. part-of-speech tagging)

11

slide-22
SLIDE 22

The task: Normalization

Our normalization language is similar but not identical to Standard German:

jaa de het me no gluegt tänkt dasch ez de genneraal ja dann hat man noch gelugt gedacht das ist jetzt der general

Six documents were normalized manually by our transcribers (30-60 hours/document).

  • Can we use these six documents as training data to

normalize the remaining 37 automatically?

  • “Machine translation” from transcribed Swiss German to

the normalization language

12

slide-23
SLIDE 23

The task: Normalization

Our normalization language is similar but not identical to Standard German:

jaa de het me no gluegt tänkt dasch ez de genneraal ja dann hat man noch gelugt gedacht das ist jetzt der general

Six documents were normalized manually by our transcribers (30-60 hours/document).

  • Can we use these six documents as training data to

normalize the remaining 37 automatically?

  • “Machine translation” from transcribed Swiss German to

the normalization language

12

slide-24
SLIDE 24

The task: Normalization

Our normalization language is similar but not identical to Standard German:

jaa de het me no gluegt tänkt dasch ez de genneraal ja dann hat man noch gelugt gedacht das ist jetzt der general

Six documents were normalized manually by our transcribers (30-60 hours/document).

  • Can we use these six documents as training data to

normalize the remaining 37 automatically?

  • “Machine translation” from transcribed Swiss German to

the normalization language

12

slide-25
SLIDE 25

The task: Normalization

Our normalization language is similar but not identical to Standard German:

jaa de het me no gluegt tänkt dasch ez de genneraal ja dann hat man noch gelugt gedacht das ist jetzt der general

Six documents were normalized manually by our transcribers (30-60 hours/document).

  • Can we use these six documents as training data to

normalize the remaining 37 automatically?

  • “Machine translation” from transcribed Swiss German to

the normalization language

12

slide-26
SLIDE 26

The model: Character-level SMT (CSMT)

Standard SMT systems operate at the word level. They identify sequences of contiguous words (“phrases”) and their translations in a parallel corpus: Character-level SMT systems have been proposed for closely related languages (Vilar et al. 2007, Tiedemann 2009). They identify sequences of contiguous characters:

13

slide-27
SLIDE 27

The model: Character-level SMT (CSMT)

Standard SMT systems operate at the word level. They identify sequences of contiguous words (“phrases”) and their translations in a parallel corpus: Character-level SMT systems have been proposed for closely related languages (Vilar et al. 2007, Tiedemann 2009). They identify sequences of contiguous characters:

13

slide-28
SLIDE 28

The model: Character-level SMT (CSMT)

Standard SMT systems operate at the word level. They identify sequences of contiguous words (“phrases”) and their translations in a parallel corpus: Character-level SMT systems have been proposed for closely related languages (Vilar et al. 2007, Tiedemann 2009). They identify sequences of contiguous characters:

_ j a _ d a n n _ h a t _ m a n _ n o c h _ g e l u g t _ g e d a c h t _ j a a _ d e _ h e t _ m e _ n o _ g l u e g t _ t ä n k t 13

slide-29
SLIDE 29

The model: Character-level SMT (CSMT)

  • 1. Train a single CSMT model on the six normalized texts
  • 2. Apply this model to produce normalizations for the

37 remaining texts

  • Estimation: 90% of words normalized correctly

Original CSMT Correct muurermäischter maurermeister maurermeistern buechs buchs buochs riintel reintal rheintal komfjserii konfjserei konfjserie kaazèt kazat kz plimut pleinmut plymouth

14

slide-30
SLIDE 30

The model: Character-level SMT (CSMT)

  • 1. Train a single CSMT model on the six normalized texts
  • 2. Apply this model to produce normalizations for the

37 remaining texts

  • Estimation: 90% of words normalized correctly

Original CSMT Correct muurermäischter maurermeister maurermeistern buechs buchs buochs riintel reintal rheintal komfjserii konfjserei konfjserie kaazèt kazat kz plimut pleinmut plymouth

14

slide-31
SLIDE 31

The model: Character-level SMT (CSMT)

  • 1. Train a single CSMT model on the six normalized texts
  • 2. Apply this model to produce normalizations for the

37 remaining texts

  • Estimation: 90% of words normalized correctly

Original CSMT Correct muurermäischter maurermeister maurermeistern buechs buchs buochs riintel reintal rheintal komfjserii konfjserei konfjserie kaazèt kazat kz plimut pleinmut plymouth

14

slide-32
SLIDE 32

The model: Character-level SMT (CSMT)

  • 1. Train a single CSMT model on the six normalized texts
  • 2. Apply this model to produce normalizations for the

37 remaining texts

  • 3. Train a distinct CSMT model for every text
  • What character sequences do these models identify?
  • How do the frequencies of these sequences vary across

texts and dialects?

15

slide-33
SLIDE 33

The model: Character-level SMT (CSMT)

  • 1. Train a single CSMT model on the six normalized texts
  • 2. Apply this model to produce normalizations for the

37 remaining texts

  • 3. Train a distinct CSMT model for every text
  • What character sequences do these models identify?
  • How do the frequencies of these sequences vary across

texts and dialects?

15

slide-34
SLIDE 34

The model: Character-level SMT (CSMT)

  • 1. Train a single CSMT model on the six normalized texts
  • 2. Apply this model to produce normalizations for the

37 remaining texts

  • 3. Train a distinct CSMT model for every text
  • What character sequences do these models identify?
  • How do the frequencies of these sequences vary across

texts and dialects?

15

slide-35
SLIDE 35

The analysis: Corpus-based dialectology

Example: Which are the different dialectal realizations of normalized ck and what are their geographical distributions?

  • Look for p(∗ | ck) in the phrase tables created by the

CSMT systems

Document 1048:

c h ||| c k ||| 0.09615 0.18776 0.00247 0.03999 ||| 0-0 1-1 ||| 52 2028 5 g g ||| c k ||| 0.88462 0.00993 0.26136 0.00378 ||| 0-0 1-1 ||| 52 176 46 k ||| c k ||| 0.01923 0.10652 0.00820 0.03921 ||| 0-1 ||| 52 122 1

Document 1244:

g g ||| c k ||| 0.04256 0.00023 0.03922 0.00008 ||| 1-0 0-1 ||| 47 51 2 g ||| c k ||| 0.04256 0.02805 0.00064 0.00066 ||| 0-1 ||| 47 3126 2 k ||| c k ||| 0.91489 0.67461 0.07597 0.06274 ||| 0-1 ||| 47 566 43

  • Pick one variant (e.g. gg) and plot the probabilities
  • Compare with relevant SDS maps

16

slide-36
SLIDE 36

The analysis: Corpus-based dialectology

Example: Which are the different dialectal realizations of normalized ck and what are their geographical distributions?

  • Look for p(∗ | ck) in the phrase tables created by the

CSMT systems

Document 1048:

c h ||| c k ||| 0.09615 0.18776 0.00247 0.03999 ||| 0-0 1-1 ||| 52 2028 5 g g ||| c k ||| 0.88462 0.00993 0.26136 0.00378 ||| 0-0 1-1 ||| 52 176 46 k ||| c k ||| 0.01923 0.10652 0.00820 0.03921 ||| 0-1 ||| 52 122 1

Document 1244:

g g ||| c k ||| 0.04256 0.00023 0.03922 0.00008 ||| 1-0 0-1 ||| 47 51 2 g ||| c k ||| 0.04256 0.02805 0.00064 0.00066 ||| 0-1 ||| 47 3126 2 k ||| c k ||| 0.91489 0.67461 0.07597 0.06274 ||| 0-1 ||| 47 566 43

  • Pick one variant (e.g. gg) and plot the probabilities
  • Compare with relevant SDS maps

16

slide-37
SLIDE 37

The analysis: Corpus-based dialectology

Example: Which are the different dialectal realizations of normalized ck and what are their geographical distributions?

  • Look for p(∗ | ck) in the phrase tables created by the

CSMT systems

Document 1048:

c h ||| c k ||| 0.09615 0.18776 0.00247 0.03999 ||| 0-0 1-1 ||| 52 2028 5 g g ||| c k ||| 0.88462 0.00993 0.26136 0.00378 ||| 0-0 1-1 ||| 52 176 46 k ||| c k ||| 0.01923 0.10652 0.00820 0.03921 ||| 0-1 ||| 52 122 1

Document 1244:

g g ||| c k ||| 0.04256 0.00023 0.03922 0.00008 ||| 1-0 0-1 ||| 47 51 2 g ||| c k ||| 0.04256 0.02805 0.00064 0.00066 ||| 0-1 ||| 47 3126 2 k ||| c k ||| 0.91489 0.67461 0.07597 0.06274 ||| 0-1 ||| 47 566 43

  • Pick one variant (e.g. gg) and plot the probabilities
  • Compare with relevant SDS maps

16

slide-38
SLIDE 38

The analysis: Corpus-based dialectology

Example: Which are the different dialectal realizations of normalized ck and what are their geographical distributions?

  • Look for p(∗ | ck) in the phrase tables created by the

CSMT systems

Document 1048:

c h ||| c k ||| 0.09615 0.18776 0.00247 0.03999 ||| 0-0 1-1 ||| 52 2028 5 g g ||| c k ||| 0.88462 0.00993 0.26136 0.00378 ||| 0-0 1-1 ||| 52 176 46 k ||| c k ||| 0.01923 0.10652 0.00820 0.03921 ||| 0-1 ||| 52 122 1

Document 1244:

g g ||| c k ||| 0.04256 0.00023 0.03922 0.00008 ||| 1-0 0-1 ||| 47 51 2 g ||| c k ||| 0.04256 0.02805 0.00064 0.00066 ||| 0-1 ||| 47 3126 2 k ||| c k ||| 0.91489 0.67461 0.07597 0.06274 ||| 0-1 ||| 47 566 43

  • Pick one variant (e.g. gg) and plot the probabilities
  • Compare with relevant SDS maps

16

slide-39
SLIDE 39

The analysis: Corpus-based dialectology

Example: Which are the different dialectal realizations of normalized ck and what are their geographical distributions?

  • Look for p(∗ | ck) in the phrase tables created by the

CSMT systems

Document 1048:

c h ||| c k ||| 0.09615 0.18776 0.00247 0.03999 ||| 0-0 1-1 ||| 52 2028 5 g g ||| c k ||| 0.88462 0.00993 0.26136 0.00378 ||| 0-0 1-1 ||| 52 176 46 k ||| c k ||| 0.01923 0.10652 0.00820 0.03921 ||| 0-1 ||| 52 122 1

Document 1244:

g g ||| c k ||| 0.04256 0.00023 0.03922 0.00008 ||| 1-0 0-1 ||| 47 51 2 g ||| c k ||| 0.04256 0.02805 0.00064 0.00066 ||| 0-1 ||| 47 3126 2 k ||| c k ||| 0.91489 0.67461 0.07597 0.06274 ||| 0-1 ||| 47 566 43

  • Pick one variant (e.g. gg) and plot the probabilities
  • Compare with relevant SDS maps

16

slide-40
SLIDE 40

Dialectal gg ↔ Normalized ck (Teggi ↔ Decke)

0.84 0.91 0.98 0.62 0.53 0.88 0.81

p(gg|ck)

0.0 - 0.2 0.2 - 0.4 0.4 - 0.6 0.6 - 0.8 0.8 - 1.0

Green areas: SDS map 2/095 “drücken”, variant gg

17

slide-41
SLIDE 41

Dialectal gg ↔ Normalized ck (Teggi ↔ Decke)

0.84 0.91 0.98 0.62 0.53 0.88 0.81

p(gg|ck)

0.0 - 0.2 0.2 - 0.4 0.4 - 0.6 0.6 - 0.8 0.8 - 1.0

Green areas: SDS map 2/095 “drücken”, variant gg

17

slide-42
SLIDE 42

Dialectal ui ↔ Normalized au (Muis ↔ Maus)

0.06 0.14

p(ui|au)

0.00 - 0.02 0.02 - 0.05 0.05 - 0.2 0.2 - 0.6 0.6 - 1.0

Green areas: SDS map 1/106 “Maus”, variant ui

18

slide-43
SLIDE 43

Dialectal ui ↔ Normalized au (Muis ↔ Maus)

0.06 0.14

p(ui|au)

0.00 - 0.02 0.02 - 0.05 0.05 - 0.2 0.2 - 0.6 0.6 - 1.0

Green areas: SDS map 1/106 “Maus”, variant ui

18

slide-44
SLIDE 44

Dialectal u ↔ Normalized ll (Täuer ↔ Teller)

0.7 0.5 0.66 0.59 0.46 0.45

p(u|ll)

0.0 - 0.2 0.2 - 0.4 0.4 - 0.6 0.6 - 0.8 0.8 - 1.0

Green areas: SDS map 2/196 “Teller”, variant u

19

slide-45
SLIDE 45

Dialectal u ↔ Normalized ll (Täuer ↔ Teller)

0.7 0.5 0.66 0.59 0.46 0.45

p(u|ll)

0.0 - 0.2 0.2 - 0.4 0.4 - 0.6 0.6 - 0.8 0.8 - 1.0

Green areas: SDS map 2/196 “Teller”, variant u

19

slide-46
SLIDE 46

Dialectal n ↔ Normalized nn (Tane ↔ Tanne)

0.95 0.72 0.77 0.74 0.98 0.68 0.93 0.78

p(n|nn)

0.0 - 0.2 0.2 - 0.4 0.4 - 0.6 0.6 - 0.8 0.8 - 1.0

Green areas: SDS map 2/186 “Tanne”, variant n

20

slide-47
SLIDE 47

Dialectal n ↔ Normalized nn (Tane ↔ Tanne)

0.95 0.72 0.77 0.74 0.98 0.68 0.93 0.78

p(n|nn)

0.0 - 0.2 0.2 - 0.4 0.4 - 0.6 0.6 - 0.8 0.8 - 1.0

Green areas: SDS map 2/186 “Tanne”, variant n

20

slide-48
SLIDE 48

Conclusions: Corpus-based dialectology with CSMT

  • Multi-dialectal corpora are fun to work with, but can be

problematic due to transcription inconsistencies

  • Normalization provides comparability
  • Counting the frequency of dialectal u is not enough

because u occurs in many other contexts

  • Cf. part-of-speech tagging for dialect syntax
  • Finer-grained search in phrase tables could be useful
  • Example: V u V ↔ V l l V
  • Automatic procedures for the discovery of interesting

features could be useful

21

slide-49
SLIDE 49

Conclusions: Corpus-based dialectology with CSMT

  • Multi-dialectal corpora are fun to work with, but can be

problematic due to transcription inconsistencies

  • Normalization provides comparability
  • Counting the frequency of dialectal u is not enough

because u occurs in many other contexts

  • Cf. part-of-speech tagging for dialect syntax
  • Finer-grained search in phrase tables could be useful
  • Example: V u V ↔ V l l V
  • Automatic procedures for the discovery of interesting

features could be useful

21

slide-50
SLIDE 50

Conclusions: Corpus-based dialectology with CSMT

  • Multi-dialectal corpora are fun to work with, but can be

problematic due to transcription inconsistencies

  • Normalization provides comparability
  • Counting the frequency of dialectal u is not enough

because u occurs in many other contexts

  • Cf. part-of-speech tagging for dialect syntax
  • Finer-grained search in phrase tables could be useful
  • Example: V u V ↔ V l l V
  • Automatic procedures for the discovery of interesting

features could be useful

21

slide-51
SLIDE 51

Conclusions: Corpus-based dialectology with CSMT

  • Multi-dialectal corpora are fun to work with, but can be

problematic due to transcription inconsistencies

  • Normalization provides comparability
  • Counting the frequency of dialectal u is not enough

because u occurs in many other contexts

  • Cf. part-of-speech tagging for dialect syntax
  • Finer-grained search in phrase tables could be useful
  • Example: V u V ↔ V l l V
  • Automatic procedures for the discovery of interesting

features could be useful

21

slide-52
SLIDE 52

Corpus-based dialectology with CSMT

References

  • P. Koehn (2010): Statistical Machine Translation. Cambridge: Cambridge University Press.
  • T. Samardžić / Y. Scherrer / E. Glaser (2016): ArchiMob – a corpus of spoken Swiss German. In: Proceedings of

LREC 2016. Portorož, 4061–4066.

  • Y. Scherrer / N. Ljubešić (2016): Automatic normalisation of the Swiss German ArchiMob corpus using

character-level machine translation. In: Proceedings of KONVENS 2016 (Bochumer Linguistische Arbeitsberichte). Bochum, 248–255.

  • Y. Scherrer / T. Samardžić / E. Glaser (to appear): Digitising Swiss German – How to process and study a

polycentric spoken language. In: Language Resources and Evaluation.

  • J. Tiedemann (2009): Character-based PSMT for closely related languages. In: Proceedings of the 13th Conference
  • f the European Association for Machine Translation (EAMT 2009). Barcelona, 12–19.
  • D. Vilar / J.-T. Peter / H. Ney (2007): Can we translate letters? In: Proceedings of the Second Workshop on

Statistical Machine Translation. Prague, 33–39. ArchiMob corpus: https://www.spur.uzh.ch/en/departments/research/textgroup/ArchiMob.html CSMTiser: https://github.com/clarinsi/csmtiser

22

slide-53
SLIDE 53

Multi-dialectal neural machine translation: Dialect embeddings

slide-54
SLIDE 54

Neural machine translation (NMT)

NMT uses deep neural networks to transform sequences of the source language to sequences of the target language:

Illustration: http://vas3k.com/blog/machine_translation/

  • NMT has almost entirely replaced SMT in “common”

machine translation tasks

  • Can NMT also be used in our character-level

normalization setting?

23

slide-55
SLIDE 55

Neural machine translation (NMT)

NMT uses deep neural networks to transform sequences of the source language to sequences of the target language:

Illustration: http://vas3k.com/blog/machine_translation/

  • NMT has almost entirely replaced SMT in “common”

machine translation tasks

  • Can NMT also be used in our character-level

normalization setting?

23

slide-56
SLIDE 56

Character-level NMT with dialect embeddings

  • 1. Train a single CSMT model on the six normalized texts
  • 2. Apply this model to produce normalizations for the 37

remaining texts

  • 3. Train a single CNMT model for all texts, adding a text

source label at the beginning of each utterance:

<1007> _ j a a _ d e _ h e t _ m e _ n o _ g l u e g t _ t ä n k t _ _ j a _ d a n n _ h a t _ m a n _ n o c h _ g e l u g t _ g e d a c h t _

  • CNMT models don’t use “windows”, so the labels remain

visible until the end of the sentence

  • The model may learn to condition some transformations
  • n the label
  • The model may infer that some labels behave similarly

24

slide-57
SLIDE 57

Character-level NMT with dialect embeddings

  • 1. Train a single CSMT model on the six normalized texts
  • 2. Apply this model to produce normalizations for the 37

remaining texts

  • 3. Train a single CNMT model for all texts, adding a text

source label at the beginning of each utterance:

<1007> _ j a a _ d e _ h e t _ m e _ n o _ g l u e g t _ t ä n k t _ _ j a _ d a n n _ h a t _ m a n _ n o c h _ g e l u g t _ g e d a c h t _

  • CNMT models don’t use “windows”, so the labels remain

visible until the end of the sentence

  • The model may learn to condition some transformations
  • n the label
  • The model may infer that some labels behave similarly

24

slide-58
SLIDE 58

Character-level NMT with dialect embeddings

  • NMT models produce embeddings of their input and
  • utput tokens
  • Embeddings are just vectors of real numbers

Illustration: https://www.sdl.com/ilp/language/neural-machine-translation.html

25

slide-59
SLIDE 59

Character-level NMT with dialect embeddings

  • NMT models produce embeddings of their input and
  • utput tokens
  • Standard setting: word embeddings
  • Character-level setting: character embeddings
  • The text source labels are perceived by the model as

“special characters” and receive their own embeddings

  • Embeddings are just vectors of real numbers (500 in our

case)

  • Apply a dimensionality reduction method (MDS, PCA,

t-SNE, …) and plot the results

  • We only look at the label embeddings for now

26

slide-60
SLIDE 60

Character-level NMT with dialect embeddings

PCA reduction, component 1/3

27

slide-61
SLIDE 61

Character-level NMT with dialect embeddings

Z Z P A P Z S Z M S A Z M A M P P P A A P P A A P S P P P A M M M M P P P A A P A A P

PCA reduction, component 1/3, with transcriber initials

27

slide-62
SLIDE 62

Character-level NMT with dialect embeddings

PCA reduction, component 2/3

28

slide-63
SLIDE 63

Character-level NMT with dialect embeddings

PCA reduction, component 3/3

29

slide-64
SLIDE 64

Character-level NMT: Conclusions

  • The model learns that the normalization depends on:
  • The transcriber
  • The geographic origin of the text
  • Open questions:
  • Not all NMT algorithms and dimensionality reduction

algorithms work equally well

  • What is the overall normalization quality of NMT?
  • For which types of transformations does the model “look”

at the dialect label?

30

slide-65
SLIDE 65

Character-level NMT: Conclusions

  • The model learns that the normalization depends on:
  • The transcriber
  • The geographic origin of the text
  • Open questions:
  • Not all NMT algorithms and dimensionality reduction

algorithms work equally well

  • What is the overall normalization quality of NMT?
  • For which types of transformations does the model “look”

at the dialect label?

30

slide-66
SLIDE 66

Character-level NMT with dialect embeddings

References

  • D. Bahdanau / K. Cho / Y. Bengio (2015): Neural machine translation by jointly learning to align and translate.

In: Proceedings of ICLR 2015.

  • G. Klein / Y. Kim / Y. Deng / J. Senellart / A. M. Rush (2017): OpenNMT: Open-source toolkit for neural machine
  • translation. In: arXiv preprint arXiv:1701.02810. http://opennmt.net/

L.J.P. van der Maaten / G.E. Hinton (2008): Visualizing high-dimensional data using t-SNE. In: Journal of Machine Learning Research 9:2579-2605.

  • A. Vaswani / N. Shazeer / N. Parmar / J. Uszkoreit / L. Jones / A. N. Gomez / Ł. Kaiser / I. Polosukhin (2017):

Attention is all you need. In: Advances in Neural Information Processing Systems, 5998–6008.

  • R. Östling / J. Tiedemann (2017): Continuous multilinguality with language vectors. In: Proceedings of EACL 2017,

644-649.

31

slide-67
SLIDE 67

Conclusions

  • Rule-based MT
  • Knowledge-driven, i.e. dialect atlas-driven
  • From standard to dialect
  • Maps are prerequisites
  • Statistical and neural MT
  • Data-driven, i.e. dialect corpus driven
  • From dialect to “standard”
  • Maps result from model training
  • Neural MT
  • Emerging properties of dialect texts

32

slide-68
SLIDE 68

Conclusions

  • Rule-based MT
  • Knowledge-driven, i.e. dialect atlas-driven
  • From standard to dialect
  • Maps are prerequisites
  • Statistical and neural MT
  • Data-driven, i.e. dialect corpus driven
  • From dialect to “standard”
  • Maps result from model training
  • Neural MT
  • Emerging properties of dialect texts

32

slide-69
SLIDE 69

Conclusions

  • Rule-based MT
  • Knowledge-driven, i.e. dialect atlas-driven
  • From standard to dialect
  • Maps are prerequisites
  • Statistical and neural MT
  • Data-driven, i.e. dialect corpus driven
  • From dialect to “standard”
  • Maps result from model training
  • Neural MT
  • Emerging properties of dialect texts

32