Computational dialectology with machine translation techniques Yves - PowerPoint PPT Presentation

Computational dialectology with machine translation techniques Yves Scherrer Department of Digital Humanities, University of Helsinki Linguistics Research Seminar, University of Gothenburg, 12 November 2019 1

Illustration: http://vas3k.com/blog/machine_translation/ 2007 2012–2013 2017–2018 RBMT SMT NMT A brief history of my career as a machine translation researcher interested in dialectology (or the other way round…) 2

Object of study: Swiss German dialects 3

Rule-based machine translation: Standard German → Swiss German

Language variation in rule-based machine translation • B : Modern High German (“Standard German”) StdG entities, but as probability maps Generative dialectology (Veith 1970, 1982) • Most practical, but not historically correct • Dialects are not represented as discrete numbered • D : Swiss German dialects My proposal: • Transformation rules derive a multitude of dialect 4 systems D i from a single reference system B : • # Töpfer # B → # Häfner # D 33333 − 46999 � � � � → geng • immer

Example rule: Lemma change {geng} • Rules implemented with XFST fjnite-state toolkit ( Sprachatlas der deutschen Schweiz ) maps • Probability maps extracted from digitized SDS … {all} {immer} 5 {immer} {gäng} � � → � � | � � | � � | |

Example: morphological infmection ADJA [Nom | Acc] Sg Gender Degree Weak 0 i schwarzi 6 � � → � � | schwarz ADJA Nom Sg Fem Pos Weak → schwarz

Example: phonological adaptation Vowel (n d) Vowel gschta n e gschta nn e gschta ng e n n n n g n d 7 � � → � � | � � | � � | gesta nd en → gschta nd e

Implementation Finite-state toolkits do not provide functionality for direct integration of probability maps. We simulate this ability with fmag diacritics . ADJA [Nom | Acc] Sg Gender Degree Weak 0 i defjne adj-2-fm [ ADJA [Nom | Acc] Sg Gender Degree Weak -> [ 0 ”@U.3-254.null@” | i ”@U.3-254.i@” ]]; 8 � � → � � |

Conclusions • Diffjcult to achieve good coverage • Dialectologically interesting features vs. relevant features for practical usage • Diffjcult to evaluate on “real” data due to lack of unifjed writing conventions • Veith’s claim that the ordering of rules mirrors their order of historical appearance could not be verifjed in practice • The digitized maps turned out to be more useful than the rule set • Dialectometrical analyses • Online map viewer 9

Rule-based machine translation: Maschinelle Übersetzung und Dialektometrie. In: D. Huck (ed.): Alemannische Dialektologie: Dialekte im Digitized SDS maps: http://www.dialektkarten.ch York: De Gruyter, 277–295. E. Wiegand (eds.): Dialektologie – Ein Handbuch zur deutschen und allgemeinen Dialektforschung. Berlin, New W. H. Veith (1982): Theorieansätze einer generativen Dialektologie. In: W. Besch / U. Knoop / W. Putschke / H. Hildesheim: Olms. W. H. Veith (1970): -Explikative +Applikative +Komputative Dialektkartographie. (Germanistische Linguistik 4) . Kontakt. (ZDL Beihefte 155). Stuttgart: Steiner, 261–278. Y. Scherrer (2014): Computerlinguistische Experimente für die schweizerdeutsche Dialektlandschaft – (SFCM 2011) . Berlin: Springer, 130–140. Systems and Frameworks for Computational Morphology - Proceedings of the Second International Workshop Y. Scherrer (2011): Morphology generation for Swiss German dialects. In: C. Mahlow / M. Piotrowski (eds.): 8 vols. Bern: Francke. R. Hotzenköcherle / R. Schläpfer / R. Trüb / P. Zinsli (eds.) (1962-1997): Sprachatlas der deutschen Schweiz. K. R. Beesley / L. Karttunen (2003): Finite State Morphology. CSLI Publications. References 10 Standard German → Swiss German

Character-level statistical machine translation: Normalization

The data: The ArchiMob corpus 11

The data: The ArchiMob corpus 11 ArchiMob was an oral history project collecting testimonials of the Second World War period in Switzerland. 555 informants from all linguistic regions, both genders, different backgrounds were interviewed (1999–2001).

The data: The ArchiMob corpus 11 ArchiMob was an oral history project collecting testimonials of the Second World War period in Switzerland. 555 informants from all linguistic regions, both genders, different backgrounds were interviewed (1999–2001). 43 Swiss German interviews were transcribed at the University of Zurich (2006–2018) for dialectological research.

The task: Normalization There is a lot of variation in the transcriptions: • Transcription inconsistencies: different transcribers, transcription tools and changing guidelines • Dialectal variation: different origins of informants • Intra-speaker variation Goals: • Create an additional annotation layer to establish identities between forms that are felt like “the same word”. • Enable dialect-independent corpus search • Facilitate further annotation (e.g. part-of-speech tagging) 12

The task: Normalization Normalization of dialectal texts: Standard German • Our normalization language is similar but not identical to dann hat man noch gelugt gedacht das ist jetzt der general ja genneraal de ez dasch gluegt tänkt no het me jaa de [German] je hebt hem niet nodig wie jou laat gaan is gewoon dom :p I love you Normalization of historical texts (modernization): schat, DOM :p Iloveyouuuu nodig wie jou laat gaan is gwn nii em schaaaat, je et [Dutch] Normalization of user-generated content: que de ma facilité. mérite plutôt Ce serait une marque de la force de votre Ce ſeroit une marque de la force de voſtre merite pluſtoſt que de ma facilité. [French] 13

Computational dialectology with machine translation techniques Yves - PowerPoint PPT Presentation

Computational dialectology with machine translation techniques Yves Scherrer Department of Digital Humanities, University of Helsinki Linguistics Research Seminar, University of Gothenburg, 12 November 2019 1 Illustration:

Computational dialectology with machine translation techniques Yves Scherrer Department of

Crowdsourcing dialectology in the undergraduate classroom Laurel

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Introd u ction to machine translation MAC H IN E TR AN SL ATION IN P YTH ON Th u shan

Machine Translation Machine Translation February 13, 2008 Andreas Eisele UdS Computerlinguistik

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

11-731 Machine Translation Speech 2 Speech Translation Speech Translation Three part systems

Machine Translation Philipp Koehn 28 April 2020 Philipp Koehn Artificial Intelligence: Machine

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Computer Aided Translation Philipp Koehn 30 April 2015 Philipp Koehn Machine Translation:

Computer Aided Translation Philipp Koehn 15 November 2018 Philipp Koehn Machine Translation:

Machine Translation: Going Deep Philipp Koehn 4 June 2015 Philipp Koehn Machine Translation:

Machine Translation Philipp Koehn 1 December 2015 Philipp Koehn Artificial Intelligence:

Neural Machine Translation II Refinements Philipp Koehn 17 October 2017 Philipp Koehn Machine

Representing Huge Translation Models Statistical Machine Translation parallel text + alignment

Global Translation Services Website translation using post-edited machine translation and

600.406 Finite-State Methods in NLP, Part II Assignment 4: Building Finite-State Operators

MA/CSSE 474 Theory of Computation TM Macro Language Your Questions? Previous class days'

CPSC 121: Models of Computation Instructor: Bob Woodham woodham@cs.ubc.ca Department of Computer

Machine Learning in Formal Verification Manish Pandey, PhD Chief Architect, New Technologies

CS 240 Programming in C Control Statements, Operators September 16, 2019 Haoyu Wang UMass

How a Translator Works CS 222: Programming Languages Translators The job of a tr translator is to

Motions and Visual Mode Part II Core Commands The World as Everyone Else Sees it The World as

Introducing sp objects Working with Geospatial Data in R Data frames arent a great way to

Sambuz

Useful Links

Newsletter

Mail Us