Computational dialectology with machine translation techniques Yves - - PowerPoint PPT Presentation
Computational dialectology with machine translation techniques Yves - - PowerPoint PPT Presentation
Computational dialectology with machine translation techniques Yves Scherrer Department of Digital Humanities, University of Helsinki Linguistics Research Seminar, University of Gothenburg, 12 November 2019 1 Illustration:
Illustration: http://vas3k.com/blog/machine_translation/
2007 2012–2013 2017–2018 RBMT SMT NMT A brief history of my career as a machine translation researcher interested in dialectology (or the other way round…)
2
Illustration: http://vas3k.com/blog/machine_translation/
2007 2012–2013 2017–2018 RBMT SMT NMT A brief history of my career as a machine translation researcher interested in dialectology (or the other way round…)
2
Object of study: Swiss German dialects
3
Rule-based machine translation: Standard German → Swiss German
Language variation in rule-based machine translation
Generative dialectology (Veith 1970, 1982)
- Transformation rules derive a multitude of dialect
systems Di from a single reference system B:
- #Töpfer#B → #Häfner#D33333−46999
My proposal:
- D: Swiss German dialects
- B: Modern High German (“Standard German”)
- Most practical, but not historically correct
- Dialects are not represented as discrete numbered
entities, but as probability maps
- immer
- StdG
- → geng
4
Language variation in rule-based machine translation
Generative dialectology (Veith 1970, 1982)
- Transformation rules derive a multitude of dialect
systems Di from a single reference system B:
- #Töpfer#B → #Häfner#D33333−46999
My proposal:
- D: Swiss German dialects
- B: Modern High German (“Standard German”)
- Most practical, but not historically correct
- Dialects are not represented as discrete numbered
entities, but as probability maps
- immer
- StdG
- → geng
- 4
Example rule: Lemma change
{immer} → {immer}
- |
{gäng}
- |
{geng}
- |
{all}
- |
…
- Probability maps extracted from digitized SDS
(Sprachatlas der deutschen Schweiz) maps
- Rules implemented with XFST fjnite-state toolkit
5
Example rule: Lemma change
{immer} → {immer}
- |
{gäng}
- |
{geng}
- |
{all}
- |
…
- Probability maps extracted from digitized SDS
(Sprachatlas der deutschen Schweiz) maps
- Rules implemented with XFST fjnite-state toolkit
5
Example: morphological infmection
ADJA [Nom | Acc] Sg Gender Degree Weak →
- |
i
- schwarz ADJA Nom Sg Fem Pos Weak → schwarz
schwarzi
6
Example: phonological adaptation
Vowel (n d) Vowel → n d
- |
n g
- |
n n
- |
n
- gestanden → gschtande
gschtange gschtanne gschtane
7
Implementation
Finite-state toolkits do not provide functionality for direct integration of probability maps. We simulate this ability with fmag diacritics.
ADJA [Nom | Acc] Sg Gender Degree Weak →
- |
i
- defjne adj-2-fm [ ADJA [Nom | Acc] Sg Gender Degree Weak ->
[ 0 ”@U.3-254.null@” | i ”@U.3-254.i@” ]];
8
Conclusions
- Diffjcult to achieve good coverage
- Dialectologically interesting features
- vs. relevant features for practical usage
- Diffjcult to evaluate on “real” data due to lack of unifjed
writing conventions
- Veith’s claim that the ordering of rules mirrors their order
- f historical appearance could not be verifjed in practice
- The digitized maps turned out to be more useful than the
rule set
- Dialectometrical analyses
- Online map viewer
9
Conclusions
- Diffjcult to achieve good coverage
- Dialectologically interesting features
- vs. relevant features for practical usage
- Diffjcult to evaluate on “real” data due to lack of unifjed
writing conventions
- Veith’s claim that the ordering of rules mirrors their order
- f historical appearance could not be verifjed in practice
- The digitized maps turned out to be more useful than the
rule set
- Dialectometrical analyses
- Online map viewer
9
Conclusions
- Diffjcult to achieve good coverage
- Dialectologically interesting features
- vs. relevant features for practical usage
- Diffjcult to evaluate on “real” data due to lack of unifjed
writing conventions
- Veith’s claim that the ordering of rules mirrors their order
- f historical appearance could not be verifjed in practice
- The digitized maps turned out to be more useful than the
rule set
- Dialectometrical analyses
- Online map viewer
9
Conclusions
- Diffjcult to achieve good coverage
- Dialectologically interesting features
- vs. relevant features for practical usage
- Diffjcult to evaluate on “real” data due to lack of unifjed
writing conventions
- Veith’s claim that the ordering of rules mirrors their order
- f historical appearance could not be verifjed in practice
- The digitized maps turned out to be more useful than the
rule set
- Dialectometrical analyses
- Online map viewer
9
Rule-based machine translation: Standard German → Swiss German
References
- K. R. Beesley / L. Karttunen (2003): Finite State Morphology. CSLI Publications.
- R. Hotzenköcherle / R. Schläpfer / R. Trüb / P. Zinsli (eds.) (1962-1997): Sprachatlas der deutschen Schweiz.
8 vols. Bern: Francke.
- Y. Scherrer (2011): Morphology generation for Swiss German dialects. In: C. Mahlow / M. Piotrowski (eds.):
Systems and Frameworks for Computational Morphology - Proceedings of the Second International Workshop (SFCM 2011). Berlin: Springer, 130–140.
- Y. Scherrer (2014): Computerlinguistische Experimente für die schweizerdeutsche Dialektlandschaft –
Maschinelle Übersetzung und Dialektometrie. In: D. Huck (ed.): Alemannische Dialektologie: Dialekte im
- Kontakt. (ZDL Beihefte 155). Stuttgart: Steiner, 261–278.
- W. H. Veith (1970): -Explikative +Applikative +Komputative Dialektkartographie. (Germanistische Linguistik 4).
Hildesheim: Olms.
- W. H. Veith (1982): Theorieansätze einer generativen Dialektologie. In: W. Besch / U. Knoop / W. Putschke / H.
- E. Wiegand (eds.): Dialektologie – Ein Handbuch zur deutschen und allgemeinen Dialektforschung. Berlin, New
York: De Gruyter, 277–295. Digitized SDS maps: http://www.dialektkarten.ch
10
Character-level statistical machine translation: Normalization
The data: The ArchiMob corpus
11
The data: The ArchiMob corpus
11
ArchiMob was an oral history project collecting testimonials of the Second World War period in Switzerland. 555 informants from all linguistic regions, both genders, different backgrounds were interviewed (1999–2001).
The data: The ArchiMob corpus
11
ArchiMob was an oral history project collecting testimonials of the Second World War period in Switzerland. 555 informants from all linguistic regions, both genders, different backgrounds were interviewed (1999–2001). 43 Swiss German interviews were transcribed at the University of Zurich (2006–2018) for dialectological research.
The task: Normalization
There is a lot of variation in the transcriptions:
- Transcription inconsistencies: different transcribers,
transcription tools and changing guidelines
- Dialectal variation: different origins of informants
- Intra-speaker variation
Goals:
- Create an additional annotation layer to establish
identities between forms that are felt like “the same word”.
- Enable dialect-independent corpus search
- Facilitate further annotation (e.g. part-of-speech tagging)
12
The task: Normalization
There is a lot of variation in the transcriptions:
- Transcription inconsistencies: different transcribers,
transcription tools and changing guidelines
- Dialectal variation: different origins of informants
- Intra-speaker variation
Goals:
- Create an additional annotation layer to establish
identities between forms that are felt like “the same word”.
- Enable dialect-independent corpus search
- Facilitate further annotation (e.g. part-of-speech tagging)
12
The task: Normalization
Normalization of historical texts (modernization): [French]
Ce ſeroit une marque de la force de voſtre merite pluſtoſt que de ma facilité. Ce serait une marque de la force de votre mérite plutôt que de ma facilité.
Normalization of user-generated content: [Dutch]
schaaaat, je et em nii nodig wie jou laat gaan is gwn DOM :p Iloveyouuuu schat, je hebt hem niet nodig wie jou laat gaan is gewoon dom :p I love you
Normalization of dialectal texts: [German]
jaa de het me no gluegt tänkt dasch ez de genneraal ja dann hat man noch gelugt gedacht das ist jetzt der general
- Our normalization language is similar but not identical to
Standard German
13
The task: Normalization
Normalization of historical texts (modernization): [French]
Ce ſeroit une marque de la force de voſtre merite pluſtoſt que de ma facilité. Ce serait une marque de la force de votre mérite plutôt que de ma facilité.
Normalization of user-generated content: [Dutch]
schaaaat, je et em nii nodig wie jou laat gaan is gwn DOM :p Iloveyouuuu schat, je hebt hem niet nodig wie jou laat gaan is gewoon dom :p I love you
Normalization of dialectal texts: [German]
jaa de het me no gluegt tänkt dasch ez de genneraal ja dann hat man noch gelugt gedacht das ist jetzt der general
- Our normalization language is similar but not identical to
Standard German
13
The task: Normalization
Normalization of historical texts (modernization): [French]
Ce ſeroit une marque de la force de voſtre merite pluſtoſt que de ma facilité. Ce serait une marque de la force de votre mérite plutôt que de ma facilité.
Normalization of user-generated content: [Dutch]
schaaaat, je et em nii nodig wie jou laat gaan is gwn DOM :p Iloveyouuuu schat, je hebt hem niet nodig wie jou laat gaan is gewoon dom :p I love you
Normalization of dialectal texts: [German]
jaa de het me no gluegt tänkt dasch ez de genneraal ja dann hat man noch gelugt gedacht das ist jetzt der general
- Our normalization language is similar but not identical to
Standard German
13
The task: Normalization
Six documents of the ArchiMob corpus were normalized manually by our transcribers (30-60 hours/document). Can we use these six documents as training data to normalize the remaining 37 automatically?
- “Machine translation” from transcribed Swiss German to the
normalization language
14
The model: Character-level SMT (CSMT)
Standard SMT systems operate at the word level. They identify sequences of contiguous words (“phrases”) and their translations in a parallel corpus: Character-level SMT systems have been proposed for closely related languages (Vilar et al. 2007, Tiedemann 2009). They identify sequences of contiguous characters:
15
The model: Character-level SMT (CSMT)
Standard SMT systems operate at the word level. They identify sequences of contiguous words (“phrases”) and their translations in a parallel corpus: Character-level SMT systems have been proposed for closely related languages (Vilar et al. 2007, Tiedemann 2009). They identify sequences of contiguous characters:
15
The model: Character-level SMT (CSMT)
Standard SMT systems operate at the word level. They identify sequences of contiguous words (“phrases”) and their translations in a parallel corpus: Character-level SMT systems have been proposed for closely related languages (Vilar et al. 2007, Tiedemann 2009). They identify sequences of contiguous characters:
_ j a _ d a n n _ h a t _ m a n _ n o c h _ g e l u g t _ g e d a c h t _ j a a _ d e _ h e t _ m e _ n o _ g l u e g t _ t ä n k t 15
The model: Character-level SMT (CSMT)
- 1. Train a single CSMT model on the six normalized texts
- 2. Apply this model to produce normalizations for the
37 remaining texts
- Estimate: 90% of words normalized correctly
Original CSMT Correct muurermäischter maurermeister maurermeistern buechs buchs buochs riintel reintal rheintal komfjserii konfjserei konfjserie kaazèt kazat kz plimut pleinmut plymouth
16
The model: Character-level SMT (CSMT)
- 1. Train a single CSMT model on the six normalized texts
- 2. Apply this model to produce normalizations for the
37 remaining texts
- Estimate: 90% of words normalized correctly
Original CSMT Correct muurermäischter maurermeister maurermeistern buechs buchs buochs riintel reintal rheintal komfjserii konfjserei konfjserie kaazèt kazat kz plimut pleinmut plymouth
16
The model: Character-level SMT (CSMT)
- 1. Train a single CSMT model on the six normalized texts
- 2. Apply this model to produce normalizations for the
37 remaining texts
- Estimate: 90% of words normalized correctly
Original CSMT Correct muurermäischter maurermeister maurermeistern buechs buchs buochs riintel reintal rheintal komfjserii konfjserei konfjserie kaazèt kazat kz plimut pleinmut plymouth
16
The model: Character-level SMT (CSMT)
- 1. Train a single CSMT model on the six normalized texts
- 2. Apply this model to produce normalizations for the
37 remaining texts
- 3. Train a distinct CSMT model for every text
- What character sequences do these models identify?
- How do the frequencies of these sequences vary across
texts and dialects?
17
The model: Character-level SMT (CSMT)
- 1. Train a single CSMT model on the six normalized texts
- 2. Apply this model to produce normalizations for the
37 remaining texts
- 3. Train a distinct CSMT model for every text
- What character sequences do these models identify?
- How do the frequencies of these sequences vary across
texts and dialects?
17
The model: Character-level SMT (CSMT)
- 1. Train a single CSMT model on the six normalized texts
- 2. Apply this model to produce normalizations for the
37 remaining texts
- 3. Train a distinct CSMT model for every text
- What character sequences do these models identify?
- How do the frequencies of these sequences vary across
texts and dialects?
17
The analysis: Corpus-based dialectology
Example: What are the different dialectal realizations of normalized ck /kʰ/ and what are their geographical distributions?
- Look for p(∗ | ck) in the phrase tables created by the
CSMT systems
Document 1048:
c h ||| c k ||| 0.09615 0.18776 0.00247 0.03999 ||| 0-0 1-1 ||| 52 2028 5 g g ||| c k ||| 0.88462 0.00993 0.26136 0.00378 ||| 0-0 1-1 ||| 52 176 46 k ||| c k ||| 0.01923 0.10652 0.00820 0.03921 ||| 0-1 ||| 52 122 1
Document 1244:
g g ||| c k ||| 0.04256 0.00023 0.03922 0.00008 ||| 1-0 0-1 ||| 47 51 2 g ||| c k ||| 0.04256 0.02805 0.00064 0.00066 ||| 0-1 ||| 47 3126 2 k ||| c k ||| 0.91489 0.67461 0.07597 0.06274 ||| 0-1 ||| 47 566 43
- Pick one variant (e.g. gg) and plot the probabilities
- Compare with relevant maps from dialect atlas SDS
18
The analysis: Corpus-based dialectology
Example: What are the different dialectal realizations of normalized ck /kʰ/ and what are their geographical distributions?
- Look for p(∗ | ck) in the phrase tables created by the
CSMT systems
Document 1048:
c h ||| c k ||| 0.09615 0.18776 0.00247 0.03999 ||| 0-0 1-1 ||| 52 2028 5 g g ||| c k ||| 0.88462 0.00993 0.26136 0.00378 ||| 0-0 1-1 ||| 52 176 46 k ||| c k ||| 0.01923 0.10652 0.00820 0.03921 ||| 0-1 ||| 52 122 1
Document 1244:
g g ||| c k ||| 0.04256 0.00023 0.03922 0.00008 ||| 1-0 0-1 ||| 47 51 2 g ||| c k ||| 0.04256 0.02805 0.00064 0.00066 ||| 0-1 ||| 47 3126 2 k ||| c k ||| 0.91489 0.67461 0.07597 0.06274 ||| 0-1 ||| 47 566 43
- Pick one variant (e.g. gg) and plot the probabilities
- Compare with relevant maps from dialect atlas SDS
18
The analysis: Corpus-based dialectology
Example: What are the different dialectal realizations of normalized ck /kʰ/ and what are their geographical distributions?
- Look for p(∗ | ck) in the phrase tables created by the
CSMT systems
Document 1048:
c h ||| c k ||| 0.09615 0.18776 0.00247 0.03999 ||| 0-0 1-1 ||| 52 2028 5 g g ||| c k ||| 0.88462 0.00993 0.26136 0.00378 ||| 0-0 1-1 ||| 52 176 46 k ||| c k ||| 0.01923 0.10652 0.00820 0.03921 ||| 0-1 ||| 52 122 1
Document 1244:
g g ||| c k ||| 0.04256 0.00023 0.03922 0.00008 ||| 1-0 0-1 ||| 47 51 2 g ||| c k ||| 0.04256 0.02805 0.00064 0.00066 ||| 0-1 ||| 47 3126 2 k ||| c k ||| 0.91489 0.67461 0.07597 0.06274 ||| 0-1 ||| 47 566 43
- Pick one variant (e.g. gg) and plot the probabilities
- Compare with relevant maps from dialect atlas SDS
18
The analysis: Corpus-based dialectology
Example: What are the different dialectal realizations of normalized ck /kʰ/ and what are their geographical distributions?
- Look for p(∗ | ck) in the phrase tables created by the
CSMT systems
Document 1048:
c h ||| c k ||| 0.09615 0.18776 0.00247 0.03999 ||| 0-0 1-1 ||| 52 2028 5 g g ||| c k ||| 0.88462 0.00993 0.26136 0.00378 ||| 0-0 1-1 ||| 52 176 46 k ||| c k ||| 0.01923 0.10652 0.00820 0.03921 ||| 0-1 ||| 52 122 1
Document 1244:
g g ||| c k ||| 0.04256 0.00023 0.03922 0.00008 ||| 1-0 0-1 ||| 47 51 2 g ||| c k ||| 0.04256 0.02805 0.00064 0.00066 ||| 0-1 ||| 47 3126 2 k ||| c k ||| 0.91489 0.67461 0.07597 0.06274 ||| 0-1 ||| 47 566 43
- Pick one variant (e.g. gg) and plot the probabilities
- Compare with relevant maps from dialect atlas SDS
18
The analysis: Corpus-based dialectology
Example: What are the different dialectal realizations of normalized ck /kʰ/ and what are their geographical distributions?
- Look for p(∗ | ck) in the phrase tables created by the
CSMT systems
Document 1048:
c h ||| c k ||| 0.09615 0.18776 0.00247 0.03999 ||| 0-0 1-1 ||| 52 2028 5 g g ||| c k ||| 0.88462 0.00993 0.26136 0.00378 ||| 0-0 1-1 ||| 52 176 46 k ||| c k ||| 0.01923 0.10652 0.00820 0.03921 ||| 0-1 ||| 52 122 1
Document 1244:
g g ||| c k ||| 0.04256 0.00023 0.03922 0.00008 ||| 1-0 0-1 ||| 47 51 2 g ||| c k ||| 0.04256 0.02805 0.00064 0.00066 ||| 0-1 ||| 47 3126 2 k ||| c k ||| 0.91489 0.67461 0.07597 0.06274 ||| 0-1 ||| 47 566 43
- Pick one variant (e.g. gg) and plot the probabilities
- Compare with relevant maps from dialect atlas SDS
18
Dialectal gg ↔ Normalized ck (Teggi ↔ Decke)
0.84 0.91 0.98 0.62 0.53 0.88 0.81
p(gg|ck)
0.0 - 0.2 0.2 - 0.4 0.4 - 0.6 0.6 - 0.8 0.8 - 1.0
Green areas: SDS map 2/095 “drücken”, variant gg
19
Dialectal gg ↔ Normalized ck (Teggi ↔ Decke)
0.84 0.91 0.98 0.62 0.53 0.88 0.81
p(gg|ck)
0.0 - 0.2 0.2 - 0.4 0.4 - 0.6 0.6 - 0.8 0.8 - 1.0
Green areas: SDS map 2/095 “drücken”, variant gg
19
Dialectal ui ↔ Normalized au (Muis ↔ Maus)
0.06 0.14
p(ui|au)
0.00 - 0.02 0.02 - 0.05 0.05 - 0.2 0.2 - 0.6 0.6 - 1.0
Green areas: SDS map 1/106 “Maus”, variant ui
20
Dialectal ui ↔ Normalized au (Muis ↔ Maus)
0.06 0.14
p(ui|au)
0.00 - 0.02 0.02 - 0.05 0.05 - 0.2 0.2 - 0.6 0.6 - 1.0
Green areas: SDS map 1/106 “Maus”, variant ui
20
Dialectal u ↔ Normalized ll (Täuer ↔ Teller)
0.7 0.5 0.66 0.59 0.46 0.45
p(u|ll)
0.0 - 0.2 0.2 - 0.4 0.4 - 0.6 0.6 - 0.8 0.8 - 1.0
Green areas: SDS map 2/196 “Teller”, variant u
21
Dialectal u ↔ Normalized ll (Täuer ↔ Teller)
0.7 0.5 0.66 0.59 0.46 0.45
p(u|ll)
0.0 - 0.2 0.2 - 0.4 0.4 - 0.6 0.6 - 0.8 0.8 - 1.0
Green areas: SDS map 2/196 “Teller”, variant u
21
Dialectal n ↔ Normalized nn (Tane ↔ Tanne)
0.95 0.72 0.77 0.74 0.98 0.68 0.93 0.78
p(n|nn)
0.0 - 0.2 0.2 - 0.4 0.4 - 0.6 0.6 - 0.8 0.8 - 1.0
Green areas: SDS map 2/186 “Tanne”, variant n
22
Dialectal n ↔ Normalized nn (Tane ↔ Tanne)
0.95 0.72 0.77 0.74 0.98 0.68 0.93 0.78
p(n|nn)
0.0 - 0.2 0.2 - 0.4 0.4 - 0.6 0.6 - 0.8 0.8 - 1.0
Green areas: SDS map 2/186 “Tanne”, variant n
22
Conclusions: Corpus-based dialectology with CSMT
- Multi-dialectal corpora are fun to work with, but can be
problematic due to transcription inconsistencies
- Normalization provides comparability
- Counting the frequency of dialectal u is not enough
because u occurs in many other contexts
- Cf. part-of-speech tagging for dialect syntax
- Finer-grained search in phrase tables could be useful
- Example: V u V ↔ V l l V
- Automatic procedures for the discovery of interesting
features could be useful
23
Conclusions: Corpus-based dialectology with CSMT
- Multi-dialectal corpora are fun to work with, but can be
problematic due to transcription inconsistencies
- Normalization provides comparability
- Counting the frequency of dialectal u is not enough
because u occurs in many other contexts
- Cf. part-of-speech tagging for dialect syntax
- Finer-grained search in phrase tables could be useful
- Example: V u V ↔ V l l V
- Automatic procedures for the discovery of interesting
features could be useful
23
Conclusions: Corpus-based dialectology with CSMT
- Multi-dialectal corpora are fun to work with, but can be
problematic due to transcription inconsistencies
- Normalization provides comparability
- Counting the frequency of dialectal u is not enough
because u occurs in many other contexts
- Cf. part-of-speech tagging for dialect syntax
- Finer-grained search in phrase tables could be useful
- Example: V u V ↔ V l l V
- Automatic procedures for the discovery of interesting
features could be useful
23
Conclusions: Corpus-based dialectology with CSMT
- Multi-dialectal corpora are fun to work with, but can be
problematic due to transcription inconsistencies
- Normalization provides comparability
- Counting the frequency of dialectal u is not enough
because u occurs in many other contexts
- Cf. part-of-speech tagging for dialect syntax
- Finer-grained search in phrase tables could be useful
- Example: V u V ↔ V l l V
- Automatic procedures for the discovery of interesting
features could be useful
23
Corpus-based dialectology with CSMT
References
- P. Koehn (2010): Statistical Machine Translation. Cambridge: Cambridge University Press.
- T. Samardžić / Y. Scherrer / E. Glaser (2016): ArchiMob – a corpus of spoken Swiss German. In: Proceedings of
LREC 2016. Portorož, 4061–4066.
- Y. Scherrer / N. Ljubešić (2016): Automatic normalisation of the Swiss German ArchiMob corpus using
character-level machine translation. In: Proceedings of KONVENS 2016 (Bochumer Linguistische Arbeitsberichte). Bochum, 248–255.
- Y. Scherrer / T. Samardžić / E. Glaser (2019): Digitising Swiss German – How to process and study a polycentric
spoken language. In: Language Resources and Evaluation.
- J. Tiedemann (2009): Character-based PSMT for closely related languages. In: Proceedings of the 13th Conference
- f the European Association for Machine Translation (EAMT 2009). Barcelona, 12–19.
- D. Vilar / J.-T. Peter / H. Ney (2007): Can we translate letters? In: Proceedings of the Second Workshop on
Statistical Machine Translation. Prague, 33–39. ArchiMob corpus: https://www.spur.uzh.ch/en/departments/research/textgroup/ArchiMob.html CSMTiser: https://github.com/clarinsi/csmtiser
24
Multi-dialectal neural machine translation: Dialect embeddings
Neural machine translation (NMT)
NMT uses deep neural networks to transform sequences of the source language to sequences of the target language:
Illustration: http://vas3k.com/blog/machine_translation/
- NMT has almost entirely replaced SMT in “common”
machine translation tasks
- Can NMT also be used in our character-level
normalization setting?
25
Neural machine translation (NMT)
NMT uses deep neural networks to transform sequences of the source language to sequences of the target language:
Illustration: http://vas3k.com/blog/machine_translation/
- NMT has almost entirely replaced SMT in “common”
machine translation tasks
- Can NMT also be used in our character-level
normalization setting?
25
Character-level NMT with dialect embeddings
- 1. Train a single CSMT model on the six normalized texts
- 2. Apply this model to produce normalizations for the 37
remaining texts
- 3. Train a single CNMT model for all texts, adding a text
source label at the beginning of each utterance:
<1007> _ j a a _ d e _ h e t _ m e _ n o _ g l u e g t _ t ä n k t _ _ j a _ d a n n _ h a t _ m a n _ n o c h _ g e l u g t _ g e d a c h t _
- CNMT models don’t use fjxed-size windows, so the labels
remain visible until the end of the sentence
- The model can learn to condition some transformations
- n the label
- The model can infer that some labels behave similarly
26
Character-level NMT with dialect embeddings
- 1. Train a single CSMT model on the six normalized texts
- 2. Apply this model to produce normalizations for the 37
remaining texts
- 3. Train a single CNMT model for all texts, adding a text
source label at the beginning of each utterance:
<1007> _ j a a _ d e _ h e t _ m e _ n o _ g l u e g t _ t ä n k t _ _ j a _ d a n n _ h a t _ m a n _ n o c h _ g e l u g t _ g e d a c h t _
- CNMT models don’t use fjxed-size windows, so the labels
remain visible until the end of the sentence
- The model can learn to condition some transformations
- n the label
- The model can infer that some labels behave similarly
26
Character-level NMT with dialect embeddings
- NMT models produce embeddings of their input and
- utput tokens
- Embeddings are just vectors of real numbers
Illustration: https://www.sdl.com/ilp/language/neural-machine-translation.html
27
Character-level NMT with dialect embeddings
NMT models produce embeddings of their input and output tokens
- Let us just look at the embeddings of the text source labels
Embeddings are just vectors of real numbers (500 in our case)
- Let us apply a dimensionality reduction method (MDS, PCA,
t-SNE, …) to visualize the results
- Example: PCA, 3 dimensions
28
Character-level NMT with dialect embeddings
NMT models produce embeddings of their input and output tokens
- Let us just look at the embeddings of the text source labels
Embeddings are just vectors of real numbers (500 in our case)
- Let us apply a dimensionality reduction method (MDS, PCA,
t-SNE, …) to visualize the results
- Example: PCA, 3 dimensions
28
Character-level NMT with dialect embeddings
PCA reduction, component 1/3
29
Character-level NMT with dialect embeddings
Z Z P A P Z S Z M S A Z M A M P P P A A P P A A P S P P P A M M M M P P P A A P A A P
PCA reduction, component 1/3, with transcriber initials Correlation ratio: η = 0.816
29
Character-level NMT with dialect embeddings
PCA reduction, component 2/3
30
Character-level NMT with dialect embeddings
PCA reduction, component 2/3 Correlation with longitude: Pearson’s r = 0.487, p < 0.001
30
Character-level NMT with dialect embeddings
PCA reduction, component 3/3
31
Character-level NMT with dialect embeddings
PCA reduction, component 3/3 Correlation with latitude: Pearson’s r = 0.499, p < 0.001
31
Character-level NMT: Conclusions
The model learns that the normalization depends on
- the transcriber
- the geographic origin of the text.
Open questions:
- Not all NMT algorithms and dimensionality reduction
algorithms work equally well
- What is the overall normalization quality of NMT?
- For which types of transformations does the model “look”
at the dialect label?
32
Character-level NMT: Conclusions
The model learns that the normalization depends on
- the transcriber
- the geographic origin of the text.
Open questions:
- Not all NMT algorithms and dimensionality reduction
algorithms work equally well
- What is the overall normalization quality of NMT?
- For which types of transformations does the model “look”
at the dialect label?
32
Character-level NMT with dialect embeddings
References
- D. Bahdanau / K. Cho / Y. Bengio (2015): Neural machine translation by jointly learning to align and translate.
In: Proceedings of ICLR 2015.
- G. Klein / Y. Kim / Y. Deng / J. Senellart / A. M. Rush (2017): OpenNMT: Open-source toolkit for neural machine
- translation. In: arXiv preprint arXiv:1701.02810. http://opennmt.net/
L.J.P. van der Maaten / G.E. Hinton (2008): Visualizing high-dimensional data using t-SNE. In: Journal of Machine Learning Research 9:2579-2605.
- A. Vaswani / N. Shazeer / N. Parmar / J. Uszkoreit / L. Jones / A. N. Gomez / Ł. Kaiser / I. Polosukhin (2017):
Attention is all you need. In: Advances in Neural Information Processing Systems, 5998–6008.
- R. Östling / J. Tiedemann (2017): Continuous multilinguality with language vectors. In: Proceedings of EACL 2017,
644-649.
33
Conclusions
- 1. Rule-based MT
- Generative dialectology:
from standard to dialect
- Knowledge-driven, i.e.
dialect atlas-driven
- Maps are prerequisites
- Results: evaluation on
dialect identifjcation and generation, (in-)vali- dation of Veith’s claims, dialectometrical analyses
- 2. Statistical and neural MT
- Normalization: from
dialect to “standard”
- Data-driven, i.e. dialect
corpus driven
- Maps result from model
parameters
- Results: emerging
properties of the normalization process and of dialect texts
34
Conclusions
- 1. Rule-based MT
- Generative dialectology:
from standard to dialect
- Knowledge-driven, i.e.
dialect atlas-driven
- Maps are prerequisites
- Results: evaluation on
dialect identifjcation and generation, (in-)vali- dation of Veith’s claims, dialectometrical analyses
- 2. Statistical and neural MT
- Normalization: from
dialect to “standard”
- Data-driven, i.e. dialect
corpus driven
- Maps result from model
parameters
- Results: emerging
properties of the normalization process and of dialect texts
34
Conclusions
- 1. Rule-based MT
- Generative dialectology:
from standard to dialect
- Knowledge-driven, i.e.
dialect atlas-driven
- Maps are prerequisites
- Results: evaluation on
dialect identifjcation and generation, (in-)vali- dation of Veith’s claims, dialectometrical analyses
- 2. Statistical and neural MT
- Normalization: from
dialect to “standard”
- Data-driven, i.e. dialect
corpus driven
- Maps result from model
parameters
- Results: emerging
properties of the normalization process and of dialect texts
34
Conclusions
- 1. Rule-based MT
- Generative dialectology:
from standard to dialect
- Knowledge-driven, i.e.
dialect atlas-driven
- Maps are prerequisites
- Results: evaluation on
dialect identifjcation and generation, (in-)vali- dation of Veith’s claims, dialectometrical analyses
- 2. Statistical and neural MT
- Normalization: from
dialect to “standard”
- Data-driven, i.e. dialect
corpus driven
- Maps result from model
parameters
- Results: emerging
properties of the normalization process and of dialect texts
34
Conclusions
- 1. Rule-based MT
- Generative dialectology:
from standard to dialect
- Knowledge-driven, i.e.
dialect atlas-driven
- Maps are prerequisites
- Results: evaluation on
dialect identifjcation and generation, (in-)vali- dation of Veith’s claims, dialectometrical analyses
- 2. Statistical and neural MT
- Normalization: from
dialect to “standard”
- Data-driven, i.e. dialect
corpus driven
- Maps result from model
parameters
- Results: emerging
properties of the normalization process and of dialect texts
34
Conclusions
- 1. Rule-based MT
- Generative dialectology:
from standard to dialect
- Knowledge-driven, i.e.
dialect atlas-driven
- Maps are prerequisites
- Results: evaluation on
dialect identifjcation and generation, (in-)vali- dation of Veith’s claims, dialectometrical analyses
- 2. Statistical and neural MT
- Normalization: from
dialect to “standard”
- Data-driven, i.e. dialect
corpus driven
- Maps result from model
parameters
- Results: emerging