Transcription, transliteration, transduction, and translation A - - PowerPoint PPT Presentation

transcription transliteration transduction and translation
SMART_READER_LITE
LIVE PREVIEW

Transcription, transliteration, transduction, and translation A - - PowerPoint PPT Presentation

Transcription, transliteration, transduction, and translation A typology of crosslinguistic name representation strategies Deryle Lonsdale BYU Linguistics lonz@byu.edu FHTW 2005 1 The crossroads Many NLP applications treat personal


slide-1
SLIDE 1

FHTW 2005 1

Transcription, transliteration, transduction, and translation

A typology of crosslinguistic name representation strategies

Deryle Lonsdale BYU Linguistics lonz@byu.edu

slide-2
SLIDE 2

FHTW 2005 2

The crossroads

 Many NLP applications treat personal names  (CL)IR of text (MUC, TREC, TIPSTER)  (CL)IR of spoken documents (TDT)  Information extraction (ACE)  i18n, l10n  OCR/digitization  Semantic Web annotation  Homeland security and DoD (Aladdin, REFLEX)

and, of course,

 Family history research (PAF, TMG, etc.)

slide-3
SLIDE 3

FHTW 2005 3

The problem

 Storing and accessing proper nouns

crosslinguistically

ブッシュ

bu S

布什 Буш

?

부시

شوب

µáõ ß

Bush bʊʃ Μπους

slide-4
SLIDE 4

FHTW 2005 4

What we won’t address...

 Other types of proper nouns (organizations,

countries, etc.)

 Position and title modifiers  Selection and ordering of name components

(surname, patronymics, etc.)

 Nicknames and hypocoristics  Morphological variants (case, honorifics)  Coreference, reduced forms, subsequent

mentions

slide-5
SLIDE 5

FHTW 2005 5

Issues

 Scope: some 6,000 languages  Various types of writing systems  Conventions: culturally/linguistically set  Crosslinguistic: migrations, minorities  Diachrony: spelling changes over time  Innovation: names are continually invented  Borrowings: names cross barriers

slide-6
SLIDE 6

FHTW 2005 6

Writing systems

 Alphabetic: (roughly) one symbol / sound

 Roman (Bush), Armenian (µáõß) , Georgian, etc.

 Syllabic: (usually) one symbol / syllable

 Hiragana, Katakana ( ブッシュ ), Cherokee, etc.

 Abugidic (alphasyllabic): CV*

 Devanagari (buS), Inuktitut, Lao, Thai, Tibetan, etc.

 Logographic: (roughly) one symbol / word

 Hieroglyphs, Hieratic, Cuneiform, Hanzi ( 布什 ), etc.

slide-7
SLIDE 7

FHTW 2005 7

Special cases

 Hangul  underlyingly alphabetic  sounds are arranged compositionally into

syllabic symbols ( 부시 )

 Abjads  alphabetic, but without (some/all) vocalization  e.g. Arabic, Hebrew, Persian (شوب)

slide-8
SLIDE 8

FHTW 2005 8

Normalization

 Direction

 left-right vs. right-left  horizontal vs. vertical  boustrophedonic

 Case

 DeVon vs. Devon

 Vocalization

 McConnell, St. John

 Diacritics

 Étienne vs. Etienne

 Punctuation  Abbreviations

slide-9
SLIDE 9

FHTW 2005 9

Related computational aspects

 Character sets, fonts, glyphs  Input/output (keyboard, display)  Collation (ordering, alphabetization)

slide-10
SLIDE 10

FHTW 2005 10

A few mapping strategies

 Don’t bother: lexical lookup  Transcoding  Transcription  Transliteration  Transduction  Translation

slide-11
SLIDE 11

FHTW 2005 11

Lexical lookup

 Rote, literal access (e.g. hash tables)  Unending, expensive lexicon management task  Some automation possible (bitext, text mining)  Bush  布殊  Some large-scale commercial undertakings  Hundreds of millions of names and variants,

primarily European

 Similar efforts exist for CJK conversion via

lookup

slide-12
SLIDE 12

FHTW 2005 12

Transcoding

 Rote (mostly) character-by-character symbol

conversion (e.g. Unix recode)

 x44 x61 x6e  xee xb3 xdd  Even codes within a language vary 

布什 (Mainland China) 布希 (Taiwan) 布殊 (Hong Kong)

 Osama bin Laden: 10 Hanzi variants  Unicode helps, but does not solve the problems

slide-13
SLIDE 13

FHTW 2005 13

Transcription

 Conversion: (spoken) words  script  SAMPA (ASCII)  International Phonetic Alphabet (linguistics)

 Bush  bʊʃ

 Usually spoken language = transcribed language  Sometimes as a strategy for crosslinguistic

textual conversion

 Variation is a problem: whose dialectal/idiolectal

pronunciation should be used?

slide-14
SLIDE 14

FHTW 2005 14

Transliteration

 Rewrite symbols of source language in target

alphabet

 Bush  Буш  Source/target sounds don’t always align  32 English spellings for Muammar Gaddafi  6 Arabic spellings for Clinton  Sensitive to properties of target language  e.g. Yuschenko vs. Iouchtchenko  Romanization chaos: scores of schemes

slide-15
SLIDE 15

FHTW 2005 15

Transduction

 Mapping variable correspondences (transcription,

transliteration), often (probabilistic) rule-based

 Implemented via algorithmic finite-state automata

 e.g. Soundex (Russell, American, Daitch-Mokotoff), others

 Bush  buS

Alternate spellings based upon easily confused letters American soundex alternatives Daitch-Mokotoff soundex alternatives Bcller, Bebler, Beiler, Belber, Belier, Bellcr, Bellen, Bellor, Boller, Bcbler, and 152 others... Beler, Beller Aueler, Beler, Fbeler, Feler, Peler, Pfeler, Ppheler, Veler, Weler

slide-16
SLIDE 16

FHTW 2005 16

Problems with Soundex

 Long names: Sivaramakrishnarao,

Sivaramakrishnan, Sivaramarao

 Implausible collapses  Anglocentric  Alphabetic-based  Not very efficient distributionally

slide-17
SLIDE 17

FHTW 2005 17

Translation

 Most widely used when logographic system is

used

 Names are rendered non-literally,

non-phonemically to/from logograph (sequence)

 Great Salt Lake  大鹽湖  Creative, most opaque of mapping schemes

slide-18
SLIDE 18

FHTW 2005 18

Common techniques used

 Machine learning  Statistical/stochastic approaches (e.g. n-grams)  Entropy/noisy channel approaches  Rule-based transformational approaches  String matching algorithms  Levenshtein edit distance (similarity measure)  Dynamic programming techniques  Speech processing (recognition, TTS)  Bitext mining, alignment metrics, indexing

slide-19
SLIDE 19

FHTW 2005 19

What’s the best method?

 One of schemes listed previously  All approaches are information-losing

propositions

 Hybrid approaches combining several of these  Pipeline results  Poll different engines for optimal results  How to generalize beyond a handful of languages?

slide-20
SLIDE 20

FHTW 2005 20

The direct model

 Pairwise conversion

between specific languages

 Potentially n x m

components

 Not all pairs will likely

be needed, though

 Developer expertise a

problem

slide-21
SLIDE 21

FHTW 2005 21

The pivot model

 Neutral “interlingua”

  • r pivot

 n + m components  What could serve as

the pivot?

 Some small-scale

examples exist

 ISCII for

Dravidian-script (South Asian) languages

slide-22
SLIDE 22

FHTW 2005 22

Pivot desiderata

 Neutral representation scheme  Should address all possible writing systems  Should assure as lossless a conversion as possible  Should encode all necessary information  Principled enough to allow algorithmic

implementation

 Generative capability necessary  Is it even possible to have only one pivot?

slide-23
SLIDE 23

FHTW 2005 23

Pivot = alphabet?

 English?

 Consistency: very bad sound/symbol mapping  Anglocentricity

 IPA?

 Transparency: difficult for non-linguists  Comprehensive, but not totally adequate

 Logographs would be problematic

slide-24
SLIDE 24

FHTW 2005 24

Pivot = syllabic?

 Not as intuitive to alphabet users  Syllable definition is still debated in some

languages

 Ambisyllabicity  Mary, Brigham, Deryle

slide-25
SLIDE 25

FHTW 2005 25

Pivot = logographic?

 Need to invent character (sequences)  Meaning is not always obvious  Impracticality: complexity of representation,

script

slide-26
SLIDE 26

FHTW 2005 26

An articulated pivot approach

 More than one “pivot”,

feed into each other

 n + m + p components  Allows grouping of

typologically similar languages

 Intra-pivot links could

represent current research results (most commonly used languages)

slide-27
SLIDE 27

FHTW 2005 27

Conclusions

 Rich area for current research  The issues are daunting  Various approaches are being implemented  MT has tackled some of the same problems  A principled solution might involve some

type of articulated pivot

 Open annotation environment, sharable

resources, algorithm libraries

 Genealogists can contribute