transcription transliteration transduction and translation
play

Transcription, transliteration, transduction, and translation A - PowerPoint PPT Presentation

Transcription, transliteration, transduction, and translation A typology of crosslinguistic name representation strategies Deryle Lonsdale BYU Linguistics lonz@byu.edu FHTW 2005 1 The crossroads Many NLP applications treat personal


  1. Transcription, transliteration, transduction, and translation A typology of crosslinguistic name representation strategies Deryle Lonsdale BYU Linguistics lonz@byu.edu FHTW 2005 1

  2. The crossroads  Many NLP applications treat personal names  (CL)IR of text (MUC, TREC, TIPSTER)  (CL)IR of spoken documents (TDT)  Information extraction (ACE)  i18n, l10n  OCR/digitization  Semantic Web annotation  Homeland security and DoD (Aladdin, REFLEX) and, of course,  Family history research (PAF, TMG, etc.) 2 FHTW 2005

  3. The problem  Storing and accessing proper nouns crosslinguistically b ʊʃ ブッシュ 부시 bu شوب ? S µáõ 布什 ß Буш Μπους Bush 3 FHTW 2005

  4. What we won’t address...  Other types of proper nouns (organizations, countries, etc.)  Position and title modifiers  Selection and ordering of name components (surname, patronymics, etc.)  Nicknames and hypocoristics  Morphological variants (case, honorifics)  Coreference, reduced forms, subsequent mentions 4 FHTW 2005

  5. Issues  Scope: some 6,000 languages  Various types of writing systems  Conventions: culturally/linguistically set  Crosslinguistic: migrations, minorities  Diachrony: spelling changes over time  Innovation: names are continually invented  Borrowings: names cross barriers 5 FHTW 2005

  6. Writing systems  Alphabetic: (roughly) one symbol / sound  Roman (Bush), Armenian ( µáõß) , Georgian, etc.  Syllabic: (usually) one symbol / syllable  Hiragana, Katakana ( ブッシュ ), Cherokee, etc.  Abugidic (alphasyllabic): CV*  Devanagari ( buS) , Inuktitut, Lao, Thai, Tibetan, etc.  Logographic: (roughly) one symbol / word  Hieroglyphs, Hieratic, Cuneiform, Hanzi ( 布什 ), etc. 6 FHTW 2005

  7. Special cases  Hangul  underlyingly alphabetic  sounds are arranged compositionally into syllabic symbols ( 부시 )  Abjads  alphabetic, but without (some/all) vocalization  e.g. Arabic, Hebrew, Persian ( شوب ) 7 FHTW 2005

  8. Normalization  Direction  left-right vs. right-left  horizontal vs. vertical  boustrophedonic  Case  DeVon vs. Devon  Vocalization  McConnell, St. John  Diacritics  Étienne vs. Etienne  Punctuation  Abbreviations 8 FHTW 2005

  9. Related computational aspects  Character sets, fonts, glyphs  Input/output (keyboard, display)  Collation (ordering, alphabetization) 9 FHTW 2005

  10. A few mapping strategies  Don’t bother: lexical lookup  Transcoding  Transcription  Transliteration  Transduction  Translation 10 FHTW 2005

  11. Lexical lookup  Rote, literal access (e.g. hash tables)  Unending, expensive lexicon management task  Some automation possible (bitext, text mining)  Bush  布殊  Some large-scale commercial undertakings  Hundreds of millions of names and variants, primarily European  Similar efforts exist for CJK conversion via lookup 11 FHTW 2005

  12. Transcoding  Rote (mostly) character-by-character symbol conversion (e.g. Unix recode)  x44 x61 x6e  xee xb3 xdd  Even codes within a language vary 布什 (Mainland China)  布希 (Taiwan) 布殊 (Hong Kong)  Osama bin Laden: 10 Hanzi variants  Unicode helps, but does not solve the problems 12 FHTW 2005

  13. Transcription  Conversion: (spoken) words  script  SAMPA (ASCII)  International Phonetic Alphabet (linguistics)  Bush  b ʊʃ  Usually spoken language = transcribed language  Sometimes as a strategy for crosslinguistic textual conversion  Variation is a problem: whose dialectal/idiolectal pronunciation should be used? 13 FHTW 2005

  14. Transliteration  Rewrite symbols of source language in target alphabet  Bush  Буш  Source/target sounds don’t always align  32 English spellings for Muammar Gaddafi  6 Arabic spellings for Clinton  Sensitive to properties of target language  e.g. Yuschenko vs. Iouchtchenko  Romanization chaos: scores of schemes 14 FHTW 2005

  15. Transduction  Mapping variable correspondences (transcription, transliteration), often (probabilistic) rule-based  Implemented via algorithmic finite-state automata  e.g. Soundex (Russell, American, Daitch-Mokotoff), others  Bush  buS Alternate spellings based American soundex Daitch-Mokotoff soundex upon easily confused alternatives alternatives letters Bcller, Bebler, Beiler, Beler, Beller Aueler, Beler, Fbeler, Belber, Belier, Bellcr, Feler, Peler, Pfeler, Bellen, Bellor, Boller, Ppheler, Veler, Weler Bcbler, and 152 others... 15 FHTW 2005

  16. Problems with Soundex  Long names: Sivaramakrishnarao, Sivaramakrishnan, Sivaramarao  Implausible collapses  Anglocentric  Alphabetic-based  Not very efficient distributionally 16 FHTW 2005

  17. Translation  Most widely used when logographic system is used  Names are rendered non-literally, non-phonemically to/from logograph (sequence)  Great Salt Lake  大鹽湖  Creative, most opaque of mapping schemes 17 FHTW 2005

  18. Common techniques used  Machine learning  Statistical/stochastic approaches (e.g. n-grams)  Entropy/noisy channel approaches  Rule-based transformational approaches  String matching algorithms  Levenshtein edit distance (similarity measure)  Dynamic programming techniques  Speech processing (recognition, TTS)  Bitext mining, alignment metrics, indexing 18 FHTW 2005

  19. What’s the best method?  One of schemes listed previously  All approaches are information-losing propositions  Hybrid approaches combining several of these  Pipeline results  Poll different engines for optimal results  How to generalize beyond a handful of languages? 19 FHTW 2005

  20. The direct model  Pairwise conversion between specific languages  Potentially n x m components  Not all pairs will likely be needed, though  Developer expertise a problem 20 FHTW 2005

  21. The pivot model  Neutral “interlingua” or pivot  n + m components  What could serve as the pivot?  Some small-scale examples exist  ISCII for Dravidian-script (South Asian) languages 21 FHTW 2005

  22. Pivot desiderata  Neutral representation scheme  Should address all possible writing systems  Should assure as lossless a conversion as possible  Should encode all necessary information  Principled enough to allow algorithmic implementation  Generative capability necessary  Is it even possible to have only one pivot? 22 FHTW 2005

  23. Pivot = alphabet?  English?  Consistency: very bad sound/symbol mapping  Anglocentricity  IPA?  Transparency: difficult for non-linguists  Comprehensive, but not totally adequate  Logographs would be problematic 23 FHTW 2005

  24. Pivot = syllabic?  Not as intuitive to alphabet users  Syllable definition is still debated in some languages  Ambisyllabicity  Mary, Brigham, Deryle 24 FHTW 2005

  25. Pivot = logographic?  Need to invent character (sequences)  Meaning is not always obvious  Impracticality: complexity of representation, script 25 FHTW 2005

  26. An articulated pivot approach  More than one “pivot”, feed into each other  n + m + p components  Allows grouping of typologically similar languages  Intra-pivot links could represent current research results (most commonly used languages) 26 FHTW 2005

  27. Conclusions  Rich area for current research  The issues are daunting  Various approaches are being implemented  MT has tackled some of the same problems  A principled solution might involve some type of articulated pivot  Open annotation environment, sharable resources, algorithm libraries  Genealogists can contribute 27 FHTW 2005

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend