Computational dialectology with machine translation techniques Yves - - PowerPoint PPT Presentation

computational dialectology with machine translation
SMART_READER_LITE
LIVE PREVIEW

Computational dialectology with machine translation techniques Yves - - PowerPoint PPT Presentation

Computational dialectology with machine translation techniques Yves Scherrer Department of Digital Humanities, University of Helsinki Linguistics Research Seminar, University of Gothenburg, 12 November 2019 1 Illustration:


slide-1
SLIDE 1

Computational dialectology with machine translation techniques

Yves Scherrer Department of Digital Humanities, University of Helsinki

Linguistics Research Seminar, University of Gothenburg, 12 November 2019 1

slide-2
SLIDE 2

Illustration: http://vas3k.com/blog/machine_translation/

2007 2012–2013 2017–2018 RBMT SMT NMT A brief history of my career as a machine translation researcher interested in dialectology (or the other way round…)

2

slide-3
SLIDE 3

Illustration: http://vas3k.com/blog/machine_translation/

2007 2012–2013 2017–2018 RBMT SMT NMT A brief history of my career as a machine translation researcher interested in dialectology (or the other way round…)

2

slide-4
SLIDE 4

Object of study: Swiss German dialects

3

slide-5
SLIDE 5

Rule-based machine translation: Standard German → Swiss German

slide-6
SLIDE 6

Language variation in rule-based machine translation

Generative dialectology (Veith 1970, 1982)

  • Transformation rules derive a multitude of dialect

systems Di from a single reference system B:

  • #Töpfer#B → #Häfner#D33333−46999

My proposal:

  • D: Swiss German dialects
  • B: Modern High German (“Standard German”)
  • Most practical, but not historically correct
  • Dialects are not represented as discrete numbered

entities, but as probability maps

  • immer
  • StdG
  • → geng

4

slide-7
SLIDE 7

Language variation in rule-based machine translation

Generative dialectology (Veith 1970, 1982)

  • Transformation rules derive a multitude of dialect

systems Di from a single reference system B:

  • #Töpfer#B → #Häfner#D33333−46999

My proposal:

  • D: Swiss German dialects
  • B: Modern High German (“Standard German”)
  • Most practical, but not historically correct
  • Dialects are not represented as discrete numbered

entities, but as probability maps

  • immer
  • StdG
  • → geng
  • 4
slide-8
SLIDE 8

Example rule: Lemma change

{immer} → {immer}

  • |

{gäng}

  • |

{geng}

  • |

{all}

  • |

  • Probability maps extracted from digitized SDS

(Sprachatlas der deutschen Schweiz) maps

  • Rules implemented with XFST fjnite-state toolkit

5

slide-9
SLIDE 9

Example rule: Lemma change

{immer} → {immer}

  • |

{gäng}

  • |

{geng}

  • |

{all}

  • |

  • Probability maps extracted from digitized SDS

(Sprachatlas der deutschen Schweiz) maps

  • Rules implemented with XFST fjnite-state toolkit

5

slide-10
SLIDE 10

Example: morphological infmection

ADJA [Nom | Acc] Sg Gender Degree Weak →

  • |

i

  • schwarz ADJA Nom Sg Fem Pos Weak → schwarz

schwarzi

6

slide-11
SLIDE 11

Example: phonological adaptation

Vowel (n d) Vowel → n d

  • |

n g

  • |

n n

  • |

n

  • gestanden → gschtande

gschtange gschtanne gschtane

7

slide-12
SLIDE 12

Implementation

Finite-state toolkits do not provide functionality for direct integration of probability maps. We simulate this ability with fmag diacritics.

ADJA [Nom | Acc] Sg Gender Degree Weak →

  • |

i

  • defjne adj-2-fm [ ADJA [Nom | Acc] Sg Gender Degree Weak ->

[ 0 ”@U.3-254.null@” | i ”@U.3-254.i@” ]];

8

slide-13
SLIDE 13

Conclusions

  • Diffjcult to achieve good coverage
  • Dialectologically interesting features
  • vs. relevant features for practical usage
  • Diffjcult to evaluate on “real” data due to lack of unifjed

writing conventions

  • Veith’s claim that the ordering of rules mirrors their order
  • f historical appearance could not be verifjed in practice
  • The digitized maps turned out to be more useful than the

rule set

  • Dialectometrical analyses
  • Online map viewer

9

slide-14
SLIDE 14

Conclusions

  • Diffjcult to achieve good coverage
  • Dialectologically interesting features
  • vs. relevant features for practical usage
  • Diffjcult to evaluate on “real” data due to lack of unifjed

writing conventions

  • Veith’s claim that the ordering of rules mirrors their order
  • f historical appearance could not be verifjed in practice
  • The digitized maps turned out to be more useful than the

rule set

  • Dialectometrical analyses
  • Online map viewer

9

slide-15
SLIDE 15

Conclusions

  • Diffjcult to achieve good coverage
  • Dialectologically interesting features
  • vs. relevant features for practical usage
  • Diffjcult to evaluate on “real” data due to lack of unifjed

writing conventions

  • Veith’s claim that the ordering of rules mirrors their order
  • f historical appearance could not be verifjed in practice
  • The digitized maps turned out to be more useful than the

rule set

  • Dialectometrical analyses
  • Online map viewer

9

slide-16
SLIDE 16

Conclusions

  • Diffjcult to achieve good coverage
  • Dialectologically interesting features
  • vs. relevant features for practical usage
  • Diffjcult to evaluate on “real” data due to lack of unifjed

writing conventions

  • Veith’s claim that the ordering of rules mirrors their order
  • f historical appearance could not be verifjed in practice
  • The digitized maps turned out to be more useful than the

rule set

  • Dialectometrical analyses
  • Online map viewer

9

slide-17
SLIDE 17

Rule-based machine translation: Standard German → Swiss German

References

  • K. R. Beesley / L. Karttunen (2003): Finite State Morphology. CSLI Publications.
  • R. Hotzenköcherle / R. Schläpfer / R. Trüb / P. Zinsli (eds.) (1962-1997): Sprachatlas der deutschen Schweiz.

8 vols. Bern: Francke.

  • Y. Scherrer (2011): Morphology generation for Swiss German dialects. In: C. Mahlow / M. Piotrowski (eds.):

Systems and Frameworks for Computational Morphology - Proceedings of the Second International Workshop (SFCM 2011). Berlin: Springer, 130–140.

  • Y. Scherrer (2014): Computerlinguistische Experimente für die schweizerdeutsche Dialektlandschaft –

Maschinelle Übersetzung und Dialektometrie. In: D. Huck (ed.): Alemannische Dialektologie: Dialekte im

  • Kontakt. (ZDL Beihefte 155). Stuttgart: Steiner, 261–278.
  • W. H. Veith (1970): -Explikative +Applikative +Komputative Dialektkartographie. (Germanistische Linguistik 4).

Hildesheim: Olms.

  • W. H. Veith (1982): Theorieansätze einer generativen Dialektologie. In: W. Besch / U. Knoop / W. Putschke / H.
  • E. Wiegand (eds.): Dialektologie – Ein Handbuch zur deutschen und allgemeinen Dialektforschung. Berlin, New

York: De Gruyter, 277–295. Digitized SDS maps: http://www.dialektkarten.ch

10

slide-18
SLIDE 18

Character-level statistical machine translation: Normalization

slide-19
SLIDE 19

The data: The ArchiMob corpus

11

slide-20
SLIDE 20

The data: The ArchiMob corpus

11

ArchiMob was an oral history project collecting testimonials of the Second World War period in Switzerland. 555 informants from all linguistic regions, both genders, different backgrounds were interviewed (1999–2001).

slide-21
SLIDE 21

The data: The ArchiMob corpus

11

ArchiMob was an oral history project collecting testimonials of the Second World War period in Switzerland. 555 informants from all linguistic regions, both genders, different backgrounds were interviewed (1999–2001). 43 Swiss German interviews were transcribed at the University of Zurich (2006–2018) for dialectological research.

slide-22
SLIDE 22

The task: Normalization

There is a lot of variation in the transcriptions:

  • Transcription inconsistencies: different transcribers,

transcription tools and changing guidelines

  • Dialectal variation: different origins of informants
  • Intra-speaker variation

Goals:

  • Create an additional annotation layer to establish

identities between forms that are felt like “the same word”.

  • Enable dialect-independent corpus search
  • Facilitate further annotation (e.g. part-of-speech tagging)

12

slide-23
SLIDE 23

The task: Normalization

There is a lot of variation in the transcriptions:

  • Transcription inconsistencies: different transcribers,

transcription tools and changing guidelines

  • Dialectal variation: different origins of informants
  • Intra-speaker variation

Goals:

  • Create an additional annotation layer to establish

identities between forms that are felt like “the same word”.

  • Enable dialect-independent corpus search
  • Facilitate further annotation (e.g. part-of-speech tagging)

12

slide-24
SLIDE 24

The task: Normalization

Normalization of historical texts (modernization): [French]

Ce ſeroit une marque de la force de voſtre merite pluſtoſt que de ma facilité. Ce serait une marque de la force de votre mérite plutôt que de ma facilité.

Normalization of user-generated content: [Dutch]

schaaaat, je et em nii nodig wie jou laat gaan is gwn DOM :p Iloveyouuuu schat, je hebt hem niet nodig wie jou laat gaan is gewoon dom :p I love you

Normalization of dialectal texts: [German]

jaa de het me no gluegt tänkt dasch ez de genneraal ja dann hat man noch gelugt gedacht das ist jetzt der general

  • Our normalization language is similar but not identical to

Standard German

13

slide-25
SLIDE 25

The task: Normalization

Normalization of historical texts (modernization): [French]

Ce ſeroit une marque de la force de voſtre merite pluſtoſt que de ma facilité. Ce serait une marque de la force de votre mérite plutôt que de ma facilité.

Normalization of user-generated content: [Dutch]

schaaaat, je et em nii nodig wie jou laat gaan is gwn DOM :p Iloveyouuuu schat, je hebt hem niet nodig wie jou laat gaan is gewoon dom :p I love you

Normalization of dialectal texts: [German]

jaa de het me no gluegt tänkt dasch ez de genneraal ja dann hat man noch gelugt gedacht das ist jetzt der general

  • Our normalization language is similar but not identical to

Standard German

13

slide-26
SLIDE 26

The task: Normalization

Normalization of historical texts (modernization): [French]

Ce ſeroit une marque de la force de voſtre merite pluſtoſt que de ma facilité. Ce serait une marque de la force de votre mérite plutôt que de ma facilité.

Normalization of user-generated content: [Dutch]

schaaaat, je et em nii nodig wie jou laat gaan is gwn DOM :p Iloveyouuuu schat, je hebt hem niet nodig wie jou laat gaan is gewoon dom :p I love you

Normalization of dialectal texts: [German]

jaa de het me no gluegt tänkt dasch ez de genneraal ja dann hat man noch gelugt gedacht das ist jetzt der general

  • Our normalization language is similar but not identical to

Standard German

13

slide-27
SLIDE 27

The task: Normalization

Six documents of the ArchiMob corpus were normalized manually by our transcribers (30-60 hours/document). Can we use these six documents as training data to normalize the remaining 37 automatically?

  • “Machine translation” from transcribed Swiss German to the

normalization language

14

slide-28
SLIDE 28

The model: Character-level SMT (CSMT)

Standard SMT systems operate at the word level. They identify sequences of contiguous words (“phrases”) and their translations in a parallel corpus: Character-level SMT systems have been proposed for closely related languages (Vilar et al. 2007, Tiedemann 2009). They identify sequences of contiguous characters:

15

slide-29
SLIDE 29

The model: Character-level SMT (CSMT)

Standard SMT systems operate at the word level. They identify sequences of contiguous words (“phrases”) and their translations in a parallel corpus: Character-level SMT systems have been proposed for closely related languages (Vilar et al. 2007, Tiedemann 2009). They identify sequences of contiguous characters:

15

slide-30
SLIDE 30

The model: Character-level SMT (CSMT)

Standard SMT systems operate at the word level. They identify sequences of contiguous words (“phrases”) and their translations in a parallel corpus: Character-level SMT systems have been proposed for closely related languages (Vilar et al. 2007, Tiedemann 2009). They identify sequences of contiguous characters:

_ j a _ d a n n _ h a t _ m a n _ n o c h _ g e l u g t _ g e d a c h t _ j a a _ d e _ h e t _ m e _ n o _ g l u e g t _ t ä n k t 15

slide-31
SLIDE 31

The model: Character-level SMT (CSMT)

  • 1. Train a single CSMT model on the six normalized texts
  • 2. Apply this model to produce normalizations for the

37 remaining texts

  • Estimate: 90% of words normalized correctly

Original CSMT Correct muurermäischter maurermeister maurermeistern buechs buchs buochs riintel reintal rheintal komfjserii konfjserei konfjserie kaazèt kazat kz plimut pleinmut plymouth

16

slide-32
SLIDE 32

The model: Character-level SMT (CSMT)

  • 1. Train a single CSMT model on the six normalized texts
  • 2. Apply this model to produce normalizations for the

37 remaining texts

  • Estimate: 90% of words normalized correctly

Original CSMT Correct muurermäischter maurermeister maurermeistern buechs buchs buochs riintel reintal rheintal komfjserii konfjserei konfjserie kaazèt kazat kz plimut pleinmut plymouth

16

slide-33
SLIDE 33

The model: Character-level SMT (CSMT)

  • 1. Train a single CSMT model on the six normalized texts
  • 2. Apply this model to produce normalizations for the

37 remaining texts

  • Estimate: 90% of words normalized correctly

Original CSMT Correct muurermäischter maurermeister maurermeistern buechs buchs buochs riintel reintal rheintal komfjserii konfjserei konfjserie kaazèt kazat kz plimut pleinmut plymouth

16

slide-34
SLIDE 34

The model: Character-level SMT (CSMT)

  • 1. Train a single CSMT model on the six normalized texts
  • 2. Apply this model to produce normalizations for the

37 remaining texts

  • 3. Train a distinct CSMT model for every text
  • What character sequences do these models identify?
  • How do the frequencies of these sequences vary across

texts and dialects?

17

slide-35
SLIDE 35

The model: Character-level SMT (CSMT)

  • 1. Train a single CSMT model on the six normalized texts
  • 2. Apply this model to produce normalizations for the

37 remaining texts

  • 3. Train a distinct CSMT model for every text
  • What character sequences do these models identify?
  • How do the frequencies of these sequences vary across

texts and dialects?

17

slide-36
SLIDE 36

The model: Character-level SMT (CSMT)

  • 1. Train a single CSMT model on the six normalized texts
  • 2. Apply this model to produce normalizations for the

37 remaining texts

  • 3. Train a distinct CSMT model for every text
  • What character sequences do these models identify?
  • How do the frequencies of these sequences vary across

texts and dialects?

17

slide-37
SLIDE 37

The analysis: Corpus-based dialectology

Example: What are the different dialectal realizations of normalized ck /kʰ/ and what are their geographical distributions?

  • Look for p(∗ | ck) in the phrase tables created by the

CSMT systems

Document 1048:

c h ||| c k ||| 0.09615 0.18776 0.00247 0.03999 ||| 0-0 1-1 ||| 52 2028 5 g g ||| c k ||| 0.88462 0.00993 0.26136 0.00378 ||| 0-0 1-1 ||| 52 176 46 k ||| c k ||| 0.01923 0.10652 0.00820 0.03921 ||| 0-1 ||| 52 122 1

Document 1244:

g g ||| c k ||| 0.04256 0.00023 0.03922 0.00008 ||| 1-0 0-1 ||| 47 51 2 g ||| c k ||| 0.04256 0.02805 0.00064 0.00066 ||| 0-1 ||| 47 3126 2 k ||| c k ||| 0.91489 0.67461 0.07597 0.06274 ||| 0-1 ||| 47 566 43

  • Pick one variant (e.g. gg) and plot the probabilities
  • Compare with relevant maps from dialect atlas SDS

18

slide-38
SLIDE 38

The analysis: Corpus-based dialectology

Example: What are the different dialectal realizations of normalized ck /kʰ/ and what are their geographical distributions?

  • Look for p(∗ | ck) in the phrase tables created by the

CSMT systems

Document 1048:

c h ||| c k ||| 0.09615 0.18776 0.00247 0.03999 ||| 0-0 1-1 ||| 52 2028 5 g g ||| c k ||| 0.88462 0.00993 0.26136 0.00378 ||| 0-0 1-1 ||| 52 176 46 k ||| c k ||| 0.01923 0.10652 0.00820 0.03921 ||| 0-1 ||| 52 122 1

Document 1244:

g g ||| c k ||| 0.04256 0.00023 0.03922 0.00008 ||| 1-0 0-1 ||| 47 51 2 g ||| c k ||| 0.04256 0.02805 0.00064 0.00066 ||| 0-1 ||| 47 3126 2 k ||| c k ||| 0.91489 0.67461 0.07597 0.06274 ||| 0-1 ||| 47 566 43

  • Pick one variant (e.g. gg) and plot the probabilities
  • Compare with relevant maps from dialect atlas SDS

18

slide-39
SLIDE 39

The analysis: Corpus-based dialectology

Example: What are the different dialectal realizations of normalized ck /kʰ/ and what are their geographical distributions?

  • Look for p(∗ | ck) in the phrase tables created by the

CSMT systems

Document 1048:

c h ||| c k ||| 0.09615 0.18776 0.00247 0.03999 ||| 0-0 1-1 ||| 52 2028 5 g g ||| c k ||| 0.88462 0.00993 0.26136 0.00378 ||| 0-0 1-1 ||| 52 176 46 k ||| c k ||| 0.01923 0.10652 0.00820 0.03921 ||| 0-1 ||| 52 122 1

Document 1244:

g g ||| c k ||| 0.04256 0.00023 0.03922 0.00008 ||| 1-0 0-1 ||| 47 51 2 g ||| c k ||| 0.04256 0.02805 0.00064 0.00066 ||| 0-1 ||| 47 3126 2 k ||| c k ||| 0.91489 0.67461 0.07597 0.06274 ||| 0-1 ||| 47 566 43

  • Pick one variant (e.g. gg) and plot the probabilities
  • Compare with relevant maps from dialect atlas SDS

18

slide-40
SLIDE 40

The analysis: Corpus-based dialectology

Example: What are the different dialectal realizations of normalized ck /kʰ/ and what are their geographical distributions?

  • Look for p(∗ | ck) in the phrase tables created by the

CSMT systems

Document 1048:

c h ||| c k ||| 0.09615 0.18776 0.00247 0.03999 ||| 0-0 1-1 ||| 52 2028 5 g g ||| c k ||| 0.88462 0.00993 0.26136 0.00378 ||| 0-0 1-1 ||| 52 176 46 k ||| c k ||| 0.01923 0.10652 0.00820 0.03921 ||| 0-1 ||| 52 122 1

Document 1244:

g g ||| c k ||| 0.04256 0.00023 0.03922 0.00008 ||| 1-0 0-1 ||| 47 51 2 g ||| c k ||| 0.04256 0.02805 0.00064 0.00066 ||| 0-1 ||| 47 3126 2 k ||| c k ||| 0.91489 0.67461 0.07597 0.06274 ||| 0-1 ||| 47 566 43

  • Pick one variant (e.g. gg) and plot the probabilities
  • Compare with relevant maps from dialect atlas SDS

18

slide-41
SLIDE 41

The analysis: Corpus-based dialectology

Example: What are the different dialectal realizations of normalized ck /kʰ/ and what are their geographical distributions?

  • Look for p(∗ | ck) in the phrase tables created by the

CSMT systems

Document 1048:

c h ||| c k ||| 0.09615 0.18776 0.00247 0.03999 ||| 0-0 1-1 ||| 52 2028 5 g g ||| c k ||| 0.88462 0.00993 0.26136 0.00378 ||| 0-0 1-1 ||| 52 176 46 k ||| c k ||| 0.01923 0.10652 0.00820 0.03921 ||| 0-1 ||| 52 122 1

Document 1244:

g g ||| c k ||| 0.04256 0.00023 0.03922 0.00008 ||| 1-0 0-1 ||| 47 51 2 g ||| c k ||| 0.04256 0.02805 0.00064 0.00066 ||| 0-1 ||| 47 3126 2 k ||| c k ||| 0.91489 0.67461 0.07597 0.06274 ||| 0-1 ||| 47 566 43

  • Pick one variant (e.g. gg) and plot the probabilities
  • Compare with relevant maps from dialect atlas SDS

18

slide-42
SLIDE 42

Dialectal gg ↔ Normalized ck (Teggi ↔ Decke)

0.84 0.91 0.98 0.62 0.53 0.88 0.81

p(gg|ck)

0.0 - 0.2 0.2 - 0.4 0.4 - 0.6 0.6 - 0.8 0.8 - 1.0

Green areas: SDS map 2/095 “drücken”, variant gg

19

slide-43
SLIDE 43

Dialectal gg ↔ Normalized ck (Teggi ↔ Decke)

0.84 0.91 0.98 0.62 0.53 0.88 0.81

p(gg|ck)

0.0 - 0.2 0.2 - 0.4 0.4 - 0.6 0.6 - 0.8 0.8 - 1.0

Green areas: SDS map 2/095 “drücken”, variant gg

19

slide-44
SLIDE 44

Dialectal ui ↔ Normalized au (Muis ↔ Maus)

0.06 0.14

p(ui|au)

0.00 - 0.02 0.02 - 0.05 0.05 - 0.2 0.2 - 0.6 0.6 - 1.0

Green areas: SDS map 1/106 “Maus”, variant ui

20

slide-45
SLIDE 45

Dialectal ui ↔ Normalized au (Muis ↔ Maus)

0.06 0.14

p(ui|au)

0.00 - 0.02 0.02 - 0.05 0.05 - 0.2 0.2 - 0.6 0.6 - 1.0

Green areas: SDS map 1/106 “Maus”, variant ui

20

slide-46
SLIDE 46

Dialectal u ↔ Normalized ll (Täuer ↔ Teller)

0.7 0.5 0.66 0.59 0.46 0.45

p(u|ll)

0.0 - 0.2 0.2 - 0.4 0.4 - 0.6 0.6 - 0.8 0.8 - 1.0

Green areas: SDS map 2/196 “Teller”, variant u

21

slide-47
SLIDE 47

Dialectal u ↔ Normalized ll (Täuer ↔ Teller)

0.7 0.5 0.66 0.59 0.46 0.45

p(u|ll)

0.0 - 0.2 0.2 - 0.4 0.4 - 0.6 0.6 - 0.8 0.8 - 1.0

Green areas: SDS map 2/196 “Teller”, variant u

21

slide-48
SLIDE 48

Dialectal n ↔ Normalized nn (Tane ↔ Tanne)

0.95 0.72 0.77 0.74 0.98 0.68 0.93 0.78

p(n|nn)

0.0 - 0.2 0.2 - 0.4 0.4 - 0.6 0.6 - 0.8 0.8 - 1.0

Green areas: SDS map 2/186 “Tanne”, variant n

22

slide-49
SLIDE 49

Dialectal n ↔ Normalized nn (Tane ↔ Tanne)

0.95 0.72 0.77 0.74 0.98 0.68 0.93 0.78

p(n|nn)

0.0 - 0.2 0.2 - 0.4 0.4 - 0.6 0.6 - 0.8 0.8 - 1.0

Green areas: SDS map 2/186 “Tanne”, variant n

22

slide-50
SLIDE 50

Conclusions: Corpus-based dialectology with CSMT

  • Multi-dialectal corpora are fun to work with, but can be

problematic due to transcription inconsistencies

  • Normalization provides comparability
  • Counting the frequency of dialectal u is not enough

because u occurs in many other contexts

  • Cf. part-of-speech tagging for dialect syntax
  • Finer-grained search in phrase tables could be useful
  • Example: V u V ↔ V l l V
  • Automatic procedures for the discovery of interesting

features could be useful

23

slide-51
SLIDE 51

Conclusions: Corpus-based dialectology with CSMT

  • Multi-dialectal corpora are fun to work with, but can be

problematic due to transcription inconsistencies

  • Normalization provides comparability
  • Counting the frequency of dialectal u is not enough

because u occurs in many other contexts

  • Cf. part-of-speech tagging for dialect syntax
  • Finer-grained search in phrase tables could be useful
  • Example: V u V ↔ V l l V
  • Automatic procedures for the discovery of interesting

features could be useful

23

slide-52
SLIDE 52

Conclusions: Corpus-based dialectology with CSMT

  • Multi-dialectal corpora are fun to work with, but can be

problematic due to transcription inconsistencies

  • Normalization provides comparability
  • Counting the frequency of dialectal u is not enough

because u occurs in many other contexts

  • Cf. part-of-speech tagging for dialect syntax
  • Finer-grained search in phrase tables could be useful
  • Example: V u V ↔ V l l V
  • Automatic procedures for the discovery of interesting

features could be useful

23

slide-53
SLIDE 53

Conclusions: Corpus-based dialectology with CSMT

  • Multi-dialectal corpora are fun to work with, but can be

problematic due to transcription inconsistencies

  • Normalization provides comparability
  • Counting the frequency of dialectal u is not enough

because u occurs in many other contexts

  • Cf. part-of-speech tagging for dialect syntax
  • Finer-grained search in phrase tables could be useful
  • Example: V u V ↔ V l l V
  • Automatic procedures for the discovery of interesting

features could be useful

23

slide-54
SLIDE 54

Corpus-based dialectology with CSMT

References

  • P. Koehn (2010): Statistical Machine Translation. Cambridge: Cambridge University Press.
  • T. Samardžić / Y. Scherrer / E. Glaser (2016): ArchiMob – a corpus of spoken Swiss German. In: Proceedings of

LREC 2016. Portorož, 4061–4066.

  • Y. Scherrer / N. Ljubešić (2016): Automatic normalisation of the Swiss German ArchiMob corpus using

character-level machine translation. In: Proceedings of KONVENS 2016 (Bochumer Linguistische Arbeitsberichte). Bochum, 248–255.

  • Y. Scherrer / T. Samardžić / E. Glaser (2019): Digitising Swiss German – How to process and study a polycentric

spoken language. In: Language Resources and Evaluation.

  • J. Tiedemann (2009): Character-based PSMT for closely related languages. In: Proceedings of the 13th Conference
  • f the European Association for Machine Translation (EAMT 2009). Barcelona, 12–19.
  • D. Vilar / J.-T. Peter / H. Ney (2007): Can we translate letters? In: Proceedings of the Second Workshop on

Statistical Machine Translation. Prague, 33–39. ArchiMob corpus: https://www.spur.uzh.ch/en/departments/research/textgroup/ArchiMob.html CSMTiser: https://github.com/clarinsi/csmtiser

24

slide-55
SLIDE 55

Multi-dialectal neural machine translation: Dialect embeddings

slide-56
SLIDE 56

Neural machine translation (NMT)

NMT uses deep neural networks to transform sequences of the source language to sequences of the target language:

Illustration: http://vas3k.com/blog/machine_translation/

  • NMT has almost entirely replaced SMT in “common”

machine translation tasks

  • Can NMT also be used in our character-level

normalization setting?

25

slide-57
SLIDE 57

Neural machine translation (NMT)

NMT uses deep neural networks to transform sequences of the source language to sequences of the target language:

Illustration: http://vas3k.com/blog/machine_translation/

  • NMT has almost entirely replaced SMT in “common”

machine translation tasks

  • Can NMT also be used in our character-level

normalization setting?

25

slide-58
SLIDE 58

Character-level NMT with dialect embeddings

  • 1. Train a single CSMT model on the six normalized texts
  • 2. Apply this model to produce normalizations for the 37

remaining texts

  • 3. Train a single CNMT model for all texts, adding a text

source label at the beginning of each utterance:

<1007> _ j a a _ d e _ h e t _ m e _ n o _ g l u e g t _ t ä n k t _ _ j a _ d a n n _ h a t _ m a n _ n o c h _ g e l u g t _ g e d a c h t _

  • CNMT models don’t use fjxed-size windows, so the labels

remain visible until the end of the sentence

  • The model can learn to condition some transformations
  • n the label
  • The model can infer that some labels behave similarly

26

slide-59
SLIDE 59

Character-level NMT with dialect embeddings

  • 1. Train a single CSMT model on the six normalized texts
  • 2. Apply this model to produce normalizations for the 37

remaining texts

  • 3. Train a single CNMT model for all texts, adding a text

source label at the beginning of each utterance:

<1007> _ j a a _ d e _ h e t _ m e _ n o _ g l u e g t _ t ä n k t _ _ j a _ d a n n _ h a t _ m a n _ n o c h _ g e l u g t _ g e d a c h t _

  • CNMT models don’t use fjxed-size windows, so the labels

remain visible until the end of the sentence

  • The model can learn to condition some transformations
  • n the label
  • The model can infer that some labels behave similarly

26

slide-60
SLIDE 60

Character-level NMT with dialect embeddings

  • NMT models produce embeddings of their input and
  • utput tokens
  • Embeddings are just vectors of real numbers

Illustration: https://www.sdl.com/ilp/language/neural-machine-translation.html

27

slide-61
SLIDE 61

Character-level NMT with dialect embeddings

NMT models produce embeddings of their input and output tokens

  • Let us just look at the embeddings of the text source labels

Embeddings are just vectors of real numbers (500 in our case)

  • Let us apply a dimensionality reduction method (MDS, PCA,

t-SNE, …) to visualize the results

  • Example: PCA, 3 dimensions

28

slide-62
SLIDE 62

Character-level NMT with dialect embeddings

NMT models produce embeddings of their input and output tokens

  • Let us just look at the embeddings of the text source labels

Embeddings are just vectors of real numbers (500 in our case)

  • Let us apply a dimensionality reduction method (MDS, PCA,

t-SNE, …) to visualize the results

  • Example: PCA, 3 dimensions

28

slide-63
SLIDE 63

Character-level NMT with dialect embeddings

PCA reduction, component 1/3

29

slide-64
SLIDE 64

Character-level NMT with dialect embeddings

Z Z P A P Z S Z M S A Z M A M P P P A A P P A A P S P P P A M M M M P P P A A P A A P

PCA reduction, component 1/3, with transcriber initials Correlation ratio: η = 0.816

29

slide-65
SLIDE 65

Character-level NMT with dialect embeddings

PCA reduction, component 2/3

30

slide-66
SLIDE 66

Character-level NMT with dialect embeddings

PCA reduction, component 2/3 Correlation with longitude: Pearson’s r = 0.487, p < 0.001

30

slide-67
SLIDE 67

Character-level NMT with dialect embeddings

PCA reduction, component 3/3

31

slide-68
SLIDE 68

Character-level NMT with dialect embeddings

PCA reduction, component 3/3 Correlation with latitude: Pearson’s r = 0.499, p < 0.001

31

slide-69
SLIDE 69

Character-level NMT: Conclusions

The model learns that the normalization depends on

  • the transcriber
  • the geographic origin of the text.

Open questions:

  • Not all NMT algorithms and dimensionality reduction

algorithms work equally well

  • What is the overall normalization quality of NMT?
  • For which types of transformations does the model “look”

at the dialect label?

32

slide-70
SLIDE 70

Character-level NMT: Conclusions

The model learns that the normalization depends on

  • the transcriber
  • the geographic origin of the text.

Open questions:

  • Not all NMT algorithms and dimensionality reduction

algorithms work equally well

  • What is the overall normalization quality of NMT?
  • For which types of transformations does the model “look”

at the dialect label?

32

slide-71
SLIDE 71

Character-level NMT with dialect embeddings

References

  • D. Bahdanau / K. Cho / Y. Bengio (2015): Neural machine translation by jointly learning to align and translate.

In: Proceedings of ICLR 2015.

  • G. Klein / Y. Kim / Y. Deng / J. Senellart / A. M. Rush (2017): OpenNMT: Open-source toolkit for neural machine
  • translation. In: arXiv preprint arXiv:1701.02810. http://opennmt.net/

L.J.P. van der Maaten / G.E. Hinton (2008): Visualizing high-dimensional data using t-SNE. In: Journal of Machine Learning Research 9:2579-2605.

  • A. Vaswani / N. Shazeer / N. Parmar / J. Uszkoreit / L. Jones / A. N. Gomez / Ł. Kaiser / I. Polosukhin (2017):

Attention is all you need. In: Advances in Neural Information Processing Systems, 5998–6008.

  • R. Östling / J. Tiedemann (2017): Continuous multilinguality with language vectors. In: Proceedings of EACL 2017,

644-649.

33

slide-72
SLIDE 72

Conclusions

  • 1. Rule-based MT
  • Generative dialectology:

from standard to dialect

  • Knowledge-driven, i.e.

dialect atlas-driven

  • Maps are prerequisites
  • Results: evaluation on

dialect identifjcation and generation, (in-)vali- dation of Veith’s claims, dialectometrical analyses

  • 2. Statistical and neural MT
  • Normalization: from

dialect to “standard”

  • Data-driven, i.e. dialect

corpus driven

  • Maps result from model

parameters

  • Results: emerging

properties of the normalization process and of dialect texts

34

slide-73
SLIDE 73

Conclusions

  • 1. Rule-based MT
  • Generative dialectology:

from standard to dialect

  • Knowledge-driven, i.e.

dialect atlas-driven

  • Maps are prerequisites
  • Results: evaluation on

dialect identifjcation and generation, (in-)vali- dation of Veith’s claims, dialectometrical analyses

  • 2. Statistical and neural MT
  • Normalization: from

dialect to “standard”

  • Data-driven, i.e. dialect

corpus driven

  • Maps result from model

parameters

  • Results: emerging

properties of the normalization process and of dialect texts

34

slide-74
SLIDE 74

Conclusions

  • 1. Rule-based MT
  • Generative dialectology:

from standard to dialect

  • Knowledge-driven, i.e.

dialect atlas-driven

  • Maps are prerequisites
  • Results: evaluation on

dialect identifjcation and generation, (in-)vali- dation of Veith’s claims, dialectometrical analyses

  • 2. Statistical and neural MT
  • Normalization: from

dialect to “standard”

  • Data-driven, i.e. dialect

corpus driven

  • Maps result from model

parameters

  • Results: emerging

properties of the normalization process and of dialect texts

34

slide-75
SLIDE 75

Conclusions

  • 1. Rule-based MT
  • Generative dialectology:

from standard to dialect

  • Knowledge-driven, i.e.

dialect atlas-driven

  • Maps are prerequisites
  • Results: evaluation on

dialect identifjcation and generation, (in-)vali- dation of Veith’s claims, dialectometrical analyses

  • 2. Statistical and neural MT
  • Normalization: from

dialect to “standard”

  • Data-driven, i.e. dialect

corpus driven

  • Maps result from model

parameters

  • Results: emerging

properties of the normalization process and of dialect texts

34

slide-76
SLIDE 76

Conclusions

  • 1. Rule-based MT
  • Generative dialectology:

from standard to dialect

  • Knowledge-driven, i.e.

dialect atlas-driven

  • Maps are prerequisites
  • Results: evaluation on

dialect identifjcation and generation, (in-)vali- dation of Veith’s claims, dialectometrical analyses

  • 2. Statistical and neural MT
  • Normalization: from

dialect to “standard”

  • Data-driven, i.e. dialect

corpus driven

  • Maps result from model

parameters

  • Results: emerging

properties of the normalization process and of dialect texts

34

slide-77
SLIDE 77

Conclusions

  • 1. Rule-based MT
  • Generative dialectology:

from standard to dialect

  • Knowledge-driven, i.e.

dialect atlas-driven

  • Maps are prerequisites
  • Results: evaluation on

dialect identifjcation and generation, (in-)vali- dation of Veith’s claims, dialectometrical analyses

  • 2. Statistical and neural MT
  • Normalization: from

dialect to “standard”

  • Data-driven, i.e. dialect

corpus driven

  • Maps result from model

parameters

  • Results: emerging

properties of the normalization process and of dialect texts

34