Computational dialectology with machine translation techniques Yves - - PowerPoint PPT Presentation
Computational dialectology with machine translation techniques Yves - - PowerPoint PPT Presentation
Computational dialectology with machine translation techniques Yves Scherrer Department of Digital Humanities, University of Helsinki Mapping Language Variation and Change Cambridge, 19 March 2019 1 Illustration:
Illustration: http://vas3k.com/blog/machine_translation/
2007 2012–2013 2017–2018 A brief history of my career as a machine translation researcher interested in dialectology (or the other way round…)
2
Illustration: http://vas3k.com/blog/machine_translation/
2007 2012–2013 2017–2018 A brief history of my career as a machine translation researcher interested in dialectology (or the other way round…)
2
Rule-based machine translation: Standard German → Swiss German
Language variation in rule-based machine translation
Generative dialectology (Veith 1970, 1982)
- Transformation rules derive a multitude of dialect
systems Di from a single reference system B:
- #Töpfer#B → #Häfner#D33333−46999
My proposal:
- D: Swiss German dialects
- B: Modern High German (“Standard German”)
- Most practical, but not historically correct
- Dialects are not represented as discrete numbered
entities, but as probability maps
- immer
- StdG
- → geng
3
Language variation in rule-based machine translation
Generative dialectology (Veith 1970, 1982)
- Transformation rules derive a multitude of dialect
systems Di from a single reference system B:
- #Töpfer#B → #Häfner#D33333−46999
My proposal:
- D: Swiss German dialects
- B: Modern High German (“Standard German”)
- Most practical, but not historically correct
- Dialects are not represented as discrete numbered
entities, but as probability maps
- immer
- StdG
- → geng
- 3
Example rule: Lemma change
{immer} → {immer}
- |
{gäng}
- |
{geng}
- |
{all}
- |
…
- Probability maps extracted from digitized SDS
(Sprachatlas der deutschen Schweiz) maps
- Rules implemented with XFST fjnite-state toolkit
4
Example rule: Lemma change
{immer} → {immer}
- |
{gäng}
- |
{geng}
- |
{all}
- |
…
- Probability maps extracted from digitized SDS
(Sprachatlas der deutschen Schweiz) maps
- Rules implemented with XFST fjnite-state toolkit
4
Example: morphological infmection
ADJA [Nom | Acc] Sg Gender Degree Weak →
- |
i
- schwarz ADJA Nom Sg Fem Pos Weak → schwarz
schwarzi
5
Example: phonological adaptation
Vowel (n d) Vowel → n d
- |
n g
- |
n n
- |
n
- gestanden → gschtande
gschtange gschtanne gschtane
6
Implementation
Finite-state toolkits do not provide functionality for direct integration of probability maps. We simulate this ability with fmag diacritics.
ADJA [Nom | Acc] Sg Gender Degree Weak →
- |
i
- defjne adj-2-fm [ ADJA [Nom | Acc] Sg Gender Degree Weak ->
[ 0 ”@U.3-254.null@” | i ”@U.3-254.i@” ]];
7
Conclusions
- Diffjcult to achieve good coverage
- Dialectologically interesting features
- vs. relevant features for practical usage
- Diffjcult to evaluate on “real” data due to lack of unifjed
writing conventions
- The digitized maps turned out to be more useful than the
rule set
- Veith’s claim that the ordering of rules mirrors their order
- f historical appearance is diffjcult to verify in practice
8
Conclusions
- Diffjcult to achieve good coverage
- Dialectologically interesting features
- vs. relevant features for practical usage
- Diffjcult to evaluate on “real” data due to lack of unifjed
writing conventions
- The digitized maps turned out to be more useful than the
rule set
- Veith’s claim that the ordering of rules mirrors their order
- f historical appearance is diffjcult to verify in practice
8
Conclusions
- Diffjcult to achieve good coverage
- Dialectologically interesting features
- vs. relevant features for practical usage
- Diffjcult to evaluate on “real” data due to lack of unifjed
writing conventions
- The digitized maps turned out to be more useful than the
rule set
- Veith’s claim that the ordering of rules mirrors their order
- f historical appearance is diffjcult to verify in practice
8
Rule-based machine translation: Standard German → Swiss German
References
- K. R. Beesley / L. Karttunen (2003): Finite State Morphology. CSLI Publications.
- R. Hotzenköcherle / R. Schläpfer / R. Trüb / P. Zinsli (eds.) (1962-1997): Sprachatlas der deutschen Schweiz.
8 vols. Bern: Francke.
- Y. Scherrer (2011): Morphology generation for Swiss German dialects. In: C. Mahlow / M. Piotrowski (eds.):
Systems and Frameworks for Computational Morphology - Proceedings of the Second International Workshop (SFCM 2011). Berlin: Springer, 130–140.
- Y. Scherrer (2014): Computerlinguistische Experimente für die schweizerdeutsche Dialektlandschaft –
Maschinelle Übersetzung und Dialektometrie. In: D. Huck (ed.): Alemannische Dialektologie: Dialekte im
- Kontakt. (ZDL Beihefte 155). Stuttgart: Steiner, 261–278.
- W. H. Veith (1970): -Explikative +Applikative +Komputative Dialektkartographie. (Germanistische Linguistik 4).
Hildesheim: Olms.
- W. H. Veith (1982): Theorieansätze einer generativen Dialektologie. In: W. Besch / U. Knoop / W. Putschke / H.
- E. Wiegand (eds.): Dialektologie – Ein Handbuch zur deutschen und allgemeinen Dialektforschung. Berlin, New
York: De Gruyter, 277–295. Digitized SDS maps: http://www.dialektkarten.ch
9
Character-level statistical machine translation: Normalization
The data: The ArchiMob corpus
10
The data: The ArchiMob corpus
10
ArchiMob was an oral history project focusing on testimonials
- f the Second World War period
in Switzerland. 555 informants from all linguistic regions, both genders, different backgrounds were interviewed (1999–2001).
The data: The ArchiMob corpus
10
ArchiMob was an oral history project focusing on testimonials
- f the Second World War period
in Switzerland. 555 informants from all linguistic regions, both genders, different backgrounds were interviewed (1999–2001). 43 Swiss German interviews were transcribed at the University of Zurich (2006–2018) for dialectological research.
The task: Normalization
There is a lot of variation in the transcriptions:
- Transcription inconsistencies: different transcribers,
transcription tools and changing guidelines
- Dialectal variation: different origins of informants
- Intra-speaker variation
Goals:
- Create an additional annotation layer to establish
identities between forms that are felt like “the same word”.
- Enable dialect-independent corpus search
- Facilitate further annotation (e.g. part-of-speech tagging)
11
The task: Normalization
There is a lot of variation in the transcriptions:
- Transcription inconsistencies: different transcribers,
transcription tools and changing guidelines
- Dialectal variation: different origins of informants
- Intra-speaker variation
Goals:
- Create an additional annotation layer to establish
identities between forms that are felt like “the same word”.
- Enable dialect-independent corpus search
- Facilitate further annotation (e.g. part-of-speech tagging)
11
The task: Normalization
Our normalization language is similar but not identical to Standard German:
jaa de het me no gluegt tänkt dasch ez de genneraal ja dann hat man noch gelugt gedacht das ist jetzt der general
Six documents were normalized manually by our transcribers (30-60 hours/document).
- Can we use these six documents as training data to
normalize the remaining 37 automatically?
- “Machine translation” from transcribed Swiss German to
the normalization language
12
The task: Normalization
Our normalization language is similar but not identical to Standard German:
jaa de het me no gluegt tänkt dasch ez de genneraal ja dann hat man noch gelugt gedacht das ist jetzt der general
Six documents were normalized manually by our transcribers (30-60 hours/document).
- Can we use these six documents as training data to
normalize the remaining 37 automatically?
- “Machine translation” from transcribed Swiss German to
the normalization language
12
The task: Normalization
Our normalization language is similar but not identical to Standard German:
jaa de het me no gluegt tänkt dasch ez de genneraal ja dann hat man noch gelugt gedacht das ist jetzt der general
Six documents were normalized manually by our transcribers (30-60 hours/document).
- Can we use these six documents as training data to
normalize the remaining 37 automatically?
- “Machine translation” from transcribed Swiss German to
the normalization language
12
The task: Normalization
Our normalization language is similar but not identical to Standard German:
jaa de het me no gluegt tänkt dasch ez de genneraal ja dann hat man noch gelugt gedacht das ist jetzt der general
Six documents were normalized manually by our transcribers (30-60 hours/document).
- Can we use these six documents as training data to
normalize the remaining 37 automatically?
- “Machine translation” from transcribed Swiss German to
the normalization language
12
The model: Character-level SMT (CSMT)
Standard SMT systems operate at the word level. They identify sequences of contiguous words (“phrases”) and their translations in a parallel corpus: Character-level SMT systems have been proposed for closely related languages (Vilar et al. 2007, Tiedemann 2009). They identify sequences of contiguous characters:
13
The model: Character-level SMT (CSMT)
Standard SMT systems operate at the word level. They identify sequences of contiguous words (“phrases”) and their translations in a parallel corpus: Character-level SMT systems have been proposed for closely related languages (Vilar et al. 2007, Tiedemann 2009). They identify sequences of contiguous characters:
13
The model: Character-level SMT (CSMT)
Standard SMT systems operate at the word level. They identify sequences of contiguous words (“phrases”) and their translations in a parallel corpus: Character-level SMT systems have been proposed for closely related languages (Vilar et al. 2007, Tiedemann 2009). They identify sequences of contiguous characters:
_ j a _ d a n n _ h a t _ m a n _ n o c h _ g e l u g t _ g e d a c h t _ j a a _ d e _ h e t _ m e _ n o _ g l u e g t _ t ä n k t 13
The model: Character-level SMT (CSMT)
- 1. Train a single CSMT model on the six normalized texts
- 2. Apply this model to produce normalizations for the
37 remaining texts
- Estimation: 90% of words normalized correctly
Original CSMT Correct muurermäischter maurermeister maurermeistern buechs buchs buochs riintel reintal rheintal komfjserii konfjserei konfjserie kaazèt kazat kz plimut pleinmut plymouth
14
The model: Character-level SMT (CSMT)
- 1. Train a single CSMT model on the six normalized texts
- 2. Apply this model to produce normalizations for the
37 remaining texts
- Estimation: 90% of words normalized correctly
Original CSMT Correct muurermäischter maurermeister maurermeistern buechs buchs buochs riintel reintal rheintal komfjserii konfjserei konfjserie kaazèt kazat kz plimut pleinmut plymouth
14
The model: Character-level SMT (CSMT)
- 1. Train a single CSMT model on the six normalized texts
- 2. Apply this model to produce normalizations for the
37 remaining texts
- Estimation: 90% of words normalized correctly
Original CSMT Correct muurermäischter maurermeister maurermeistern buechs buchs buochs riintel reintal rheintal komfjserii konfjserei konfjserie kaazèt kazat kz plimut pleinmut plymouth
14
The model: Character-level SMT (CSMT)
- 1. Train a single CSMT model on the six normalized texts
- 2. Apply this model to produce normalizations for the
37 remaining texts
- 3. Train a distinct CSMT model for every text
- What character sequences do these models identify?
- How do the frequencies of these sequences vary across
texts and dialects?
15
The model: Character-level SMT (CSMT)
- 1. Train a single CSMT model on the six normalized texts
- 2. Apply this model to produce normalizations for the
37 remaining texts
- 3. Train a distinct CSMT model for every text
- What character sequences do these models identify?
- How do the frequencies of these sequences vary across
texts and dialects?
15
The model: Character-level SMT (CSMT)
- 1. Train a single CSMT model on the six normalized texts
- 2. Apply this model to produce normalizations for the
37 remaining texts
- 3. Train a distinct CSMT model for every text
- What character sequences do these models identify?
- How do the frequencies of these sequences vary across
texts and dialects?
15
The analysis: Corpus-based dialectology
Example: Which are the different dialectal realizations of normalized ck and what are their geographical distributions?
- Look for p(∗ | ck) in the phrase tables created by the
CSMT systems
Document 1048:
c h ||| c k ||| 0.09615 0.18776 0.00247 0.03999 ||| 0-0 1-1 ||| 52 2028 5 g g ||| c k ||| 0.88462 0.00993 0.26136 0.00378 ||| 0-0 1-1 ||| 52 176 46 k ||| c k ||| 0.01923 0.10652 0.00820 0.03921 ||| 0-1 ||| 52 122 1
Document 1244:
g g ||| c k ||| 0.04256 0.00023 0.03922 0.00008 ||| 1-0 0-1 ||| 47 51 2 g ||| c k ||| 0.04256 0.02805 0.00064 0.00066 ||| 0-1 ||| 47 3126 2 k ||| c k ||| 0.91489 0.67461 0.07597 0.06274 ||| 0-1 ||| 47 566 43
- Pick one variant (e.g. gg) and plot the probabilities
- Compare with relevant SDS maps
16
The analysis: Corpus-based dialectology
Example: Which are the different dialectal realizations of normalized ck and what are their geographical distributions?
- Look for p(∗ | ck) in the phrase tables created by the
CSMT systems
Document 1048:
c h ||| c k ||| 0.09615 0.18776 0.00247 0.03999 ||| 0-0 1-1 ||| 52 2028 5 g g ||| c k ||| 0.88462 0.00993 0.26136 0.00378 ||| 0-0 1-1 ||| 52 176 46 k ||| c k ||| 0.01923 0.10652 0.00820 0.03921 ||| 0-1 ||| 52 122 1
Document 1244:
g g ||| c k ||| 0.04256 0.00023 0.03922 0.00008 ||| 1-0 0-1 ||| 47 51 2 g ||| c k ||| 0.04256 0.02805 0.00064 0.00066 ||| 0-1 ||| 47 3126 2 k ||| c k ||| 0.91489 0.67461 0.07597 0.06274 ||| 0-1 ||| 47 566 43
- Pick one variant (e.g. gg) and plot the probabilities
- Compare with relevant SDS maps
16
The analysis: Corpus-based dialectology
Example: Which are the different dialectal realizations of normalized ck and what are their geographical distributions?
- Look for p(∗ | ck) in the phrase tables created by the
CSMT systems
Document 1048:
c h ||| c k ||| 0.09615 0.18776 0.00247 0.03999 ||| 0-0 1-1 ||| 52 2028 5 g g ||| c k ||| 0.88462 0.00993 0.26136 0.00378 ||| 0-0 1-1 ||| 52 176 46 k ||| c k ||| 0.01923 0.10652 0.00820 0.03921 ||| 0-1 ||| 52 122 1
Document 1244:
g g ||| c k ||| 0.04256 0.00023 0.03922 0.00008 ||| 1-0 0-1 ||| 47 51 2 g ||| c k ||| 0.04256 0.02805 0.00064 0.00066 ||| 0-1 ||| 47 3126 2 k ||| c k ||| 0.91489 0.67461 0.07597 0.06274 ||| 0-1 ||| 47 566 43
- Pick one variant (e.g. gg) and plot the probabilities
- Compare with relevant SDS maps
16
The analysis: Corpus-based dialectology
Example: Which are the different dialectal realizations of normalized ck and what are their geographical distributions?
- Look for p(∗ | ck) in the phrase tables created by the
CSMT systems
Document 1048:
c h ||| c k ||| 0.09615 0.18776 0.00247 0.03999 ||| 0-0 1-1 ||| 52 2028 5 g g ||| c k ||| 0.88462 0.00993 0.26136 0.00378 ||| 0-0 1-1 ||| 52 176 46 k ||| c k ||| 0.01923 0.10652 0.00820 0.03921 ||| 0-1 ||| 52 122 1
Document 1244:
g g ||| c k ||| 0.04256 0.00023 0.03922 0.00008 ||| 1-0 0-1 ||| 47 51 2 g ||| c k ||| 0.04256 0.02805 0.00064 0.00066 ||| 0-1 ||| 47 3126 2 k ||| c k ||| 0.91489 0.67461 0.07597 0.06274 ||| 0-1 ||| 47 566 43
- Pick one variant (e.g. gg) and plot the probabilities
- Compare with relevant SDS maps
16
The analysis: Corpus-based dialectology
Example: Which are the different dialectal realizations of normalized ck and what are their geographical distributions?
- Look for p(∗ | ck) in the phrase tables created by the
CSMT systems
Document 1048:
c h ||| c k ||| 0.09615 0.18776 0.00247 0.03999 ||| 0-0 1-1 ||| 52 2028 5 g g ||| c k ||| 0.88462 0.00993 0.26136 0.00378 ||| 0-0 1-1 ||| 52 176 46 k ||| c k ||| 0.01923 0.10652 0.00820 0.03921 ||| 0-1 ||| 52 122 1
Document 1244:
g g ||| c k ||| 0.04256 0.00023 0.03922 0.00008 ||| 1-0 0-1 ||| 47 51 2 g ||| c k ||| 0.04256 0.02805 0.00064 0.00066 ||| 0-1 ||| 47 3126 2 k ||| c k ||| 0.91489 0.67461 0.07597 0.06274 ||| 0-1 ||| 47 566 43
- Pick one variant (e.g. gg) and plot the probabilities
- Compare with relevant SDS maps
16
Dialectal gg ↔ Normalized ck (Teggi ↔ Decke)
0.84 0.91 0.98 0.62 0.53 0.88 0.81
p(gg|ck)
0.0 - 0.2 0.2 - 0.4 0.4 - 0.6 0.6 - 0.8 0.8 - 1.0
Green areas: SDS map 2/095 “drücken”, variant gg
17
Dialectal gg ↔ Normalized ck (Teggi ↔ Decke)
0.84 0.91 0.98 0.62 0.53 0.88 0.81
p(gg|ck)
0.0 - 0.2 0.2 - 0.4 0.4 - 0.6 0.6 - 0.8 0.8 - 1.0
Green areas: SDS map 2/095 “drücken”, variant gg
17
Dialectal ui ↔ Normalized au (Muis ↔ Maus)
0.06 0.14
p(ui|au)
0.00 - 0.02 0.02 - 0.05 0.05 - 0.2 0.2 - 0.6 0.6 - 1.0
Green areas: SDS map 1/106 “Maus”, variant ui
18
Dialectal ui ↔ Normalized au (Muis ↔ Maus)
0.06 0.14
p(ui|au)
0.00 - 0.02 0.02 - 0.05 0.05 - 0.2 0.2 - 0.6 0.6 - 1.0
Green areas: SDS map 1/106 “Maus”, variant ui
18
Dialectal u ↔ Normalized ll (Täuer ↔ Teller)
0.7 0.5 0.66 0.59 0.46 0.45
p(u|ll)
0.0 - 0.2 0.2 - 0.4 0.4 - 0.6 0.6 - 0.8 0.8 - 1.0
Green areas: SDS map 2/196 “Teller”, variant u
19
Dialectal u ↔ Normalized ll (Täuer ↔ Teller)
0.7 0.5 0.66 0.59 0.46 0.45
p(u|ll)
0.0 - 0.2 0.2 - 0.4 0.4 - 0.6 0.6 - 0.8 0.8 - 1.0
Green areas: SDS map 2/196 “Teller”, variant u
19
Dialectal n ↔ Normalized nn (Tane ↔ Tanne)
0.95 0.72 0.77 0.74 0.98 0.68 0.93 0.78
p(n|nn)
0.0 - 0.2 0.2 - 0.4 0.4 - 0.6 0.6 - 0.8 0.8 - 1.0
Green areas: SDS map 2/186 “Tanne”, variant n
20
Dialectal n ↔ Normalized nn (Tane ↔ Tanne)
0.95 0.72 0.77 0.74 0.98 0.68 0.93 0.78
p(n|nn)
0.0 - 0.2 0.2 - 0.4 0.4 - 0.6 0.6 - 0.8 0.8 - 1.0
Green areas: SDS map 2/186 “Tanne”, variant n
20
Conclusions: Corpus-based dialectology with CSMT
- Multi-dialectal corpora are fun to work with, but can be
problematic due to transcription inconsistencies
- Normalization provides comparability
- Counting the frequency of dialectal u is not enough
because u occurs in many other contexts
- Cf. part-of-speech tagging for dialect syntax
- Finer-grained search in phrase tables could be useful
- Example: V u V ↔ V l l V
- Automatic procedures for the discovery of interesting
features could be useful
21
Conclusions: Corpus-based dialectology with CSMT
- Multi-dialectal corpora are fun to work with, but can be
problematic due to transcription inconsistencies
- Normalization provides comparability
- Counting the frequency of dialectal u is not enough
because u occurs in many other contexts
- Cf. part-of-speech tagging for dialect syntax
- Finer-grained search in phrase tables could be useful
- Example: V u V ↔ V l l V
- Automatic procedures for the discovery of interesting
features could be useful
21
Conclusions: Corpus-based dialectology with CSMT
- Multi-dialectal corpora are fun to work with, but can be
problematic due to transcription inconsistencies
- Normalization provides comparability
- Counting the frequency of dialectal u is not enough
because u occurs in many other contexts
- Cf. part-of-speech tagging for dialect syntax
- Finer-grained search in phrase tables could be useful
- Example: V u V ↔ V l l V
- Automatic procedures for the discovery of interesting
features could be useful
21
Conclusions: Corpus-based dialectology with CSMT
- Multi-dialectal corpora are fun to work with, but can be
problematic due to transcription inconsistencies
- Normalization provides comparability
- Counting the frequency of dialectal u is not enough
because u occurs in many other contexts
- Cf. part-of-speech tagging for dialect syntax
- Finer-grained search in phrase tables could be useful
- Example: V u V ↔ V l l V
- Automatic procedures for the discovery of interesting
features could be useful
21
Corpus-based dialectology with CSMT
References
- P. Koehn (2010): Statistical Machine Translation. Cambridge: Cambridge University Press.
- T. Samardžić / Y. Scherrer / E. Glaser (2016): ArchiMob – a corpus of spoken Swiss German. In: Proceedings of
LREC 2016. Portorož, 4061–4066.
- Y. Scherrer / N. Ljubešić (2016): Automatic normalisation of the Swiss German ArchiMob corpus using
character-level machine translation. In: Proceedings of KONVENS 2016 (Bochumer Linguistische Arbeitsberichte). Bochum, 248–255.
- Y. Scherrer / T. Samardžić / E. Glaser (to appear): Digitising Swiss German – How to process and study a
polycentric spoken language. In: Language Resources and Evaluation.
- J. Tiedemann (2009): Character-based PSMT for closely related languages. In: Proceedings of the 13th Conference
- f the European Association for Machine Translation (EAMT 2009). Barcelona, 12–19.
- D. Vilar / J.-T. Peter / H. Ney (2007): Can we translate letters? In: Proceedings of the Second Workshop on
Statistical Machine Translation. Prague, 33–39. ArchiMob corpus: https://www.spur.uzh.ch/en/departments/research/textgroup/ArchiMob.html CSMTiser: https://github.com/clarinsi/csmtiser
22
Multi-dialectal neural machine translation: Dialect embeddings
Neural machine translation (NMT)
NMT uses deep neural networks to transform sequences of the source language to sequences of the target language:
Illustration: http://vas3k.com/blog/machine_translation/
- NMT has almost entirely replaced SMT in “common”
machine translation tasks
- Can NMT also be used in our character-level
normalization setting?
23
Neural machine translation (NMT)
NMT uses deep neural networks to transform sequences of the source language to sequences of the target language:
Illustration: http://vas3k.com/blog/machine_translation/
- NMT has almost entirely replaced SMT in “common”
machine translation tasks
- Can NMT also be used in our character-level
normalization setting?
23
Character-level NMT with dialect embeddings
- 1. Train a single CSMT model on the six normalized texts
- 2. Apply this model to produce normalizations for the 37
remaining texts
- 3. Train a single CNMT model for all texts, adding a text
source label at the beginning of each utterance:
<1007> _ j a a _ d e _ h e t _ m e _ n o _ g l u e g t _ t ä n k t _ _ j a _ d a n n _ h a t _ m a n _ n o c h _ g e l u g t _ g e d a c h t _
- CNMT models don’t use “windows”, so the labels remain
visible until the end of the sentence
- The model may learn to condition some transformations
- n the label
- The model may infer that some labels behave similarly
24
Character-level NMT with dialect embeddings
- 1. Train a single CSMT model on the six normalized texts
- 2. Apply this model to produce normalizations for the 37
remaining texts
- 3. Train a single CNMT model for all texts, adding a text
source label at the beginning of each utterance:
<1007> _ j a a _ d e _ h e t _ m e _ n o _ g l u e g t _ t ä n k t _ _ j a _ d a n n _ h a t _ m a n _ n o c h _ g e l u g t _ g e d a c h t _
- CNMT models don’t use “windows”, so the labels remain
visible until the end of the sentence
- The model may learn to condition some transformations
- n the label
- The model may infer that some labels behave similarly
24
Character-level NMT with dialect embeddings
- NMT models produce embeddings of their input and
- utput tokens
- Embeddings are just vectors of real numbers
Illustration: https://www.sdl.com/ilp/language/neural-machine-translation.html
25
Character-level NMT with dialect embeddings
- NMT models produce embeddings of their input and
- utput tokens
- Standard setting: word embeddings
- Character-level setting: character embeddings
- The text source labels are perceived by the model as
“special characters” and receive their own embeddings
- Embeddings are just vectors of real numbers (500 in our
case)
- Apply a dimensionality reduction method (MDS, PCA,
t-SNE, …) and plot the results
- We only look at the label embeddings for now
26
Character-level NMT with dialect embeddings
PCA reduction, component 1/3
27
Character-level NMT with dialect embeddings
Z Z P A P Z S Z M S A Z M A M P P P A A P P A A P S P P P A M M M M P P P A A P A A P
PCA reduction, component 1/3, with transcriber initials
27
Character-level NMT with dialect embeddings
PCA reduction, component 2/3
28
Character-level NMT with dialect embeddings
PCA reduction, component 3/3
29
Character-level NMT: Conclusions
- The model learns that the normalization depends on:
- The transcriber
- The geographic origin of the text
- Open questions:
- Not all NMT algorithms and dimensionality reduction
algorithms work equally well
- What is the overall normalization quality of NMT?
- For which types of transformations does the model “look”
at the dialect label?
30
Character-level NMT: Conclusions
- The model learns that the normalization depends on:
- The transcriber
- The geographic origin of the text
- Open questions:
- Not all NMT algorithms and dimensionality reduction
algorithms work equally well
- What is the overall normalization quality of NMT?
- For which types of transformations does the model “look”
at the dialect label?
30
Character-level NMT with dialect embeddings
References
- D. Bahdanau / K. Cho / Y. Bengio (2015): Neural machine translation by jointly learning to align and translate.
In: Proceedings of ICLR 2015.
- G. Klein / Y. Kim / Y. Deng / J. Senellart / A. M. Rush (2017): OpenNMT: Open-source toolkit for neural machine
- translation. In: arXiv preprint arXiv:1701.02810. http://opennmt.net/
L.J.P. van der Maaten / G.E. Hinton (2008): Visualizing high-dimensional data using t-SNE. In: Journal of Machine Learning Research 9:2579-2605.
- A. Vaswani / N. Shazeer / N. Parmar / J. Uszkoreit / L. Jones / A. N. Gomez / Ł. Kaiser / I. Polosukhin (2017):
Attention is all you need. In: Advances in Neural Information Processing Systems, 5998–6008.
- R. Östling / J. Tiedemann (2017): Continuous multilinguality with language vectors. In: Proceedings of EACL 2017,
644-649.
31
Conclusions
- Rule-based MT
- Knowledge-driven, i.e. dialect atlas-driven
- From standard to dialect
- Maps are prerequisites
- Statistical and neural MT
- Data-driven, i.e. dialect corpus driven
- From dialect to “standard”
- Maps result from model training
- Neural MT
- Emerging properties of dialect texts
32
Conclusions
- Rule-based MT
- Knowledge-driven, i.e. dialect atlas-driven
- From standard to dialect
- Maps are prerequisites
- Statistical and neural MT
- Data-driven, i.e. dialect corpus driven
- From dialect to “standard”
- Maps result from model training
- Neural MT
- Emerging properties of dialect texts
32
Conclusions
- Rule-based MT
- Knowledge-driven, i.e. dialect atlas-driven
- From standard to dialect
- Maps are prerequisites
- Statistical and neural MT
- Data-driven, i.e. dialect corpus driven
- From dialect to “standard”
- Maps result from model training
- Neural MT
- Emerging properties of dialect texts