Computational dialectology with machine translation techniques Yves - PowerPoint PPT Presentation

Computational dialectology with machine translation techniques Yves Scherrer Department of Digital Humanities, University of Helsinki Mapping Language Variation and Change Cambridge, 19 March 2019 1

Illustration: http://vas3k.com/blog/machine_translation/ 2007 2012–2013 2017–2018 A brief history of my career as a machine translation researcher interested in dialectology (or the other way round…) 2

Rule-based machine translation: Standard German → Swiss German

Language variation in rule-based machine translation • B : Modern High German (“Standard German”) StdG entities, but as probability maps Generative dialectology (Veith 1970, 1982) • Most practical, but not historically correct • Dialects are not represented as discrete numbered • D : Swiss German dialects My proposal: • Transformation rules derive a multitude of dialect 3 systems D i from a single reference system B : • # Töpfer # B → # Häfner # D 33333 − 46999 � � � � → geng • immer

Example rule: Lemma change {geng} • Rules implemented with XFST fjnite-state toolkit ( Sprachatlas der deutschen Schweiz ) maps • Probability maps extracted from digitized SDS … {all} {immer} 4 {immer} {gäng} � � → � � | � � | � � | |

Example: morphological infmection ADJA [Nom | Acc] Sg Gender Degree Weak 0 i schwarzi 5 � � → � � | schwarz ADJA Nom Sg Fem Pos Weak → schwarz

Example: phonological adaptation Vowel (n d) Vowel gschta n e gschta nn e gschta ng e n n n n g n d 6 � � → � � | � � | � � | gesta nd en → gschta nd e

Implementation Finite-state toolkits do not provide functionality for direct integration of probability maps. We simulate this ability with fmag diacritics . ADJA [Nom | Acc] Sg Gender Degree Weak 0 i defjne adj-2-fm [ ADJA [Nom | Acc] Sg Gender Degree Weak -> [ 0 ”@U.3-254.null@” | i ”@U.3-254.i@” ]]; 7 � � → � � |

Conclusions • Diffjcult to achieve good coverage • Dialectologically interesting features vs. relevant features for practical usage • Diffjcult to evaluate on “real” data due to lack of unifjed writing conventions • The digitized maps turned out to be more useful than the rule set • Veith’s claim that the ordering of rules mirrors their order of historical appearance is diffjcult to verify in practice 8

Rule-based machine translation: Maschinelle Übersetzung und Dialektometrie. In: D. Huck (ed.): Alemannische Dialektologie: Dialekte im Digitized SDS maps: http://www.dialektkarten.ch York: De Gruyter, 277–295. E. Wiegand (eds.): Dialektologie – Ein Handbuch zur deutschen und allgemeinen Dialektforschung. Berlin, New W. H. Veith (1982): Theorieansätze einer generativen Dialektologie. In: W. Besch / U. Knoop / W. Putschke / H. Hildesheim: Olms. W. H. Veith (1970): -Explikative +Applikative +Komputative Dialektkartographie. (Germanistische Linguistik 4) . Kontakt. (ZDL Beihefte 155). Stuttgart: Steiner, 261–278. Y. Scherrer (2014): Computerlinguistische Experimente für die schweizerdeutsche Dialektlandschaft – (SFCM 2011) . Berlin: Springer, 130–140. Systems and Frameworks for Computational Morphology - Proceedings of the Second International Workshop Y. Scherrer (2011): Morphology generation for Swiss German dialects. In: C. Mahlow / M. Piotrowski (eds.): 8 vols. Bern: Francke. R. Hotzenköcherle / R. Schläpfer / R. Trüb / P. Zinsli (eds.) (1962-1997): Sprachatlas der deutschen Schweiz. K. R. Beesley / L. Karttunen (2003): Finite State Morphology. CSLI Publications. References 9 Standard German → Swiss German

Character-level statistical machine translation: Normalization

The data: The ArchiMob corpus 10

The data: The ArchiMob corpus 10 ArchiMob was an oral history project focusing on testimonials of the Second World War period in Switzerland. 555 informants from all linguistic regions, both genders, different backgrounds were interviewed (1999–2001).

The data: The ArchiMob corpus 10 ArchiMob was an oral history project focusing on testimonials of the Second World War period in Switzerland. 555 informants from all linguistic regions, both genders, different backgrounds were interviewed (1999–2001). 43 Swiss German interviews were transcribed at the University of Zurich (2006–2018) for dialectological research.

The task: Normalization There is a lot of variation in the transcriptions: • Transcription inconsistencies: different transcribers, transcription tools and changing guidelines • Dialectal variation: different origins of informants • Intra-speaker variation Goals: • Create an additional annotation layer to establish identities between forms that are felt like “the same word”. • Enable dialect-independent corpus search • Facilitate further annotation (e.g. part-of-speech tagging) 11

The task: Normalization ja the normalization language • “Machine translation” from transcribed Swiss German to normalize the remaining 37 automatically? • Can we use these six documents as training data to (30-60 hours/document). Six documents were normalized manually by our transcribers dann hat man noch gelugt gedacht das ist jetzt der general genneraal Our normalization language is similar but not identical to de ez dasch gluegt tänkt no het me jaa de Standard German: 12

Computational dialectology with machine translation techniques Yves - PowerPoint PPT Presentation

Computational dialectology with machine translation techniques Yves Scherrer Department of Digital Humanities, University of Helsinki Mapping Language Variation and Change Cambridge, 19 March 2019 1 Illustration:

Computational dialectology with machine translation techniques Yves Scherrer Department of

Crowdsourcing dialectology in the undergraduate classroom Laurel

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Introd u ction to machine translation MAC H IN E TR AN SL ATION IN P YTH ON Th u shan

Machine Translation Machine Translation February 13, 2008 Andreas Eisele UdS Computerlinguistik

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

11-731 Machine Translation Speech 2 Speech Translation Speech Translation Three part systems

Machine Translation Philipp Koehn 28 April 2020 Philipp Koehn Artificial Intelligence: Machine

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Computer Aided Translation Philipp Koehn 30 April 2015 Philipp Koehn Machine Translation:

Computer Aided Translation Philipp Koehn 15 November 2018 Philipp Koehn Machine Translation:

Machine Translation: Going Deep Philipp Koehn 4 June 2015 Philipp Koehn Machine Translation:

Machine Translation Philipp Koehn 1 December 2015 Philipp Koehn Artificial Intelligence:

Neural Machine Translation II Refinements Philipp Koehn 17 October 2017 Philipp Koehn Machine

Representing Huge Translation Models Statistical Machine Translation parallel text + alignment

Global Translation Services Website translation using post-edited machine translation and

Against the tide: Advocating Degrowth and a Viable Economy in Manchester Mark Burton Carolyn

IT Faculty Collaboration with Business and Industry: Ventspils experience Sergey Hilkevics,

Not To Like?!.. CRR as a city- shaper and seen as such in an international context and as

Without Resilience Nothing Else Matters Jonas Bonr CTO TypEsafe @jboner Without Resilience

Bonding Bonding: H2+ and H2 molecules ( h h h ) 2 2 2 + = e-

Analyzing Big Data From Complex Systems: Smart Cards in Urban Transportation Networks Soong Moon

Beveridges Five Giants in Scotland: Leadership #RSAFiveGiants #CourageousLeadership Welcome!

Working conditions: viewpoints from different generations Dr Patricia Vendramin Fondation

Sambuz

Useful Links

Newsletter

Mail Us