Exploring Syllables, Romanization, and Analogy in Names Deryle - - PowerPoint PPT Presentation

exploring syllables romanization and analogy in names
SMART_READER_LITE
LIVE PREVIEW

Exploring Syllables, Romanization, and Analogy in Names Deryle - - PowerPoint PPT Presentation

Exploring Syllables, Romanization, and Analogy in Names Deryle Lonsdale BYU Linguistics lonz@byu.edu FHTW 2006 1 Proper nouns and analogy Proper nouns are interesting linguistically Phonology: sound sequences, syllable structure


slide-1
SLIDE 1

FHTW 2006 1

Exploring Syllables, Romanization, and Analogy in Names

Deryle Lonsdale BYU Linguistics lonz@byu.edu

slide-2
SLIDE 2

FHTW 2006 2

Proper nouns and analogy

Proper nouns are interesting linguistically

Phonology: sound sequences, syllable structure Orthography: how writing systems do(n’t)

reflect sounds

Semantics: meaning, denotation Pragmatics: culture, religion, history Translation: crosslinguistic issues

Analogy, a general cognitive strategy, can

help in explaining many of these phenomena

slide-3
SLIDE 3

FHTW 2006 3

Arabic script

Arabic is a Semitic language Arabic script is also used for other

languages, including non-Semitic ones

Urdu: Pakistan (Indo-Aryan) Persian/Farsi: Iran (Indo-Iranian) Pashto: Afghanistan (Indo-Iranian)

It’s an (impure) abjad

Abjad: alphabet but (some) symbols missing No short vowels, though long ones are usually

represented

slide-4
SLIDE 4

FHTW 2006 4

Names in Arabic script

Written right-to-left No capital letters Vocalization: add missing short vowels Romanization: converting words to Roman

script languages (e.g. English)

يوﺎﻗرﺰﻟا ﺐﻌﺼﻣﻮﺑأ داﮋﻧ ﯼﺪﻤﺣا دﻮﻤﺤﻣ Abu M(u)sab al-Z(a)rqawi M(a)hmoud Ahm(a)din(e)jad

slide-5
SLIDE 5

FHTW 2006 5

Common techniques used

Lexicographic: dictionary lookup Bitext mining: previous translations Text-to-speech phonemicization

Usually transduction via finite-state methods

Machine learning

Statistical/stochastic approaches (e.g. n-grams) Entropy/noisy channel approaches Rule-based transformational approaches Exemplar-based approaches

slide-6
SLIDE 6

FHTW 2006 6

Analogical modeling

Exemplar-based machine learning approach Analogy is the basic operation Useful for modeling natural language

phenomena

Particularly low-level issues: phonology,

  • rthography, morphology

No explicit rules, just store of vectorized

exemplar data

Flexible input, output, reporting, metrics

slide-7
SLIDE 7

FHTW 2006 7

The task(s)

  • Process Farsi names (Arabic script):

1)

Arabic script vocalized Arabic script

2)

Arabic script vocalized romanization

  • 23,000 items with three types of proper

noun information (given name(s), last name(s), location)

  • Arabic script and one romanization
slide-8
SLIDE 8

FHTW 2006 8

Sample data

ﻮﮑﭙه| سﺎﺒﻌﻣﻼﻏ| داﮋﻧ قاﺮﻋ ﯽﻤﻴهاﺮﺑا

hepko | Ghulam Abaas | Ebrahimi Iraq Nezhad

  • تﺎﻨﻗ ﯼزﺎﺳ ﻪﻧﺎﺧ| ﺮﺻﺎﻧ| ﯽﻗاﺮﻋ ﯽﻤﻴهاﺮﺑا

Khanah Saazi Qnaat | Naser | Ebrahimi Iraqi

  • ﯼدوﺮﻴﺷ ﺪﻴﻬﺷ| ﺎﺿﺮﻣﻼﻏ| داﮋﻧ ﯽﻗاﺮﻋ ﯽﻤﻴهاﺮﺑا

Shaheed sherodi | Ghulam Reza | Ebrahimi Iraqi Nezhad

ﯽﺘﻌﻨﺻ ﺮﻬﺷ| سﺎﺒﻌﻟاﺪﺒﻋ| ﻢﻠﻳﻮﺳﻮﺑ لﺁ

Shaher Sunhati | Abdul Abaas | Aal Busuylam

ﯼﺮﻴﮕﻧﺎﻬﺟ| ﺪﻤﺤﻣ|ﺶﻴﺒﻏﻮﺒﻟﺁ

Jahangeeri | Mohammad | Aalbughabish

ﯽﺋﺎﺟر ﺪﻴﻬﺷ| دﻮﻌﺴﻣ| ﯽﻣﻼﻏ ﯽﮕﻴﺑ لﺁ

Shaheed Rijahee | Masood | Aal Baigi Ghulami

slide-9
SLIDE 9

FHTW 2006 9

Task 1

Provide Arabic-script vocalization

slide-10
SLIDE 10

FHTW 2006 10

Issues in vocalization

Variable placement: metathesis-like

Ahm(a)di / Ah(a)mdi

Diphthongs and glides are problematic

Baizaa hee / Baizayee Ahsaanian / Ahsaaneean

Nasalization Vowels (short & long) are notoriously variable

in English (ghoti, ghoughpteighbteau)

Imami / Imaami

slide-11
SLIDE 11

FHTW 2006 11

Step 1: Transliterate

kukb+slTAn Kowkab+Sultan zhrA Zahra jmilh Jamila }biH+Alh Zabeeulah }biH+A... Zabee+A& Sdiqh Sideeqa Dmir Zameer ESmt Esmat ElirDA Ali+Reza GlAmEli Ghulam+Ali mHmd+Hsin Mohmmad+Hussian mHmd+Eli Mohmmad+Ali

slide-12
SLIDE 12

FHTW 2006 12

Step 2: Capture pairings

Wrote finite-state automaton to capture

correspondences between Arabic / romanization

Sliding window across names, 1 character

at a time

Prefer 1-1 mappings, but allow for others

Result: training vectors with 31

  • rthographic features

Outcomes are 0-3 character realizations

slide-13
SLIDE 13

FHTW 2006 13

Sample vectors

H , = = = = = = = = = = = = = = = H A j + m H m d + x A n i = = = A , = = = = = = = = = = = = = = H A j + m H m d + x A n i = = = = j , = = = = = = = = = = = = = H A j + m H m d + x A n i = = = = = + , = = = = = = = = = = = = H A j + m H m d + x A n i = = = = = = m , = = = = = = = = = = = H A j + m H m d + x A n i = = = = = = =

  • H , = = = = = = = = = = H A j + m H m d + x A n i = = = = = = = =

am , = = = = = = = = = H A j + m H m d + x A n i = = = = = = = = = ad , = = = = = = = = H A j + m H m d + x A n i = = = = = = = = = = + , = = = = = = = H A j + m H m d + x A n i = = = = = = = = = = = x , = = = = = = H A j + m H m d + x A n i = = = = = = = = = = = = A , = = = = = H A j + m H m d + x A n i = = = = = = = = = = = = = n , = = = = H A j + m H m d + x A n i = = = = = = = = = = = = = = i , = = = H A j + m H m d + x A n i = = = = = = = = = = = = = = =

slide-14
SLIDE 14

FHTW 2006 14

Sample generated outputs

ﻦﻣﺮﺧ+ﺰﻴﺑ

78.55 ﯽﻣَﺮُﺧ +ﺰﻴﺑ 77.72 ﯽﻣَﺮﺧ +ﺰﻴﺑ 76.69 ﯽﻣَﺮَﺧ +ﺰﻴﺑ 76.52 ﻦَﻣَﺮُﺧ +ﺰﻴﺑ 75.69ﺧ ﺮﻦَﻣ+ ﺰﻴﺑ 78.55 xorami+biz 77.72 xrami+biz 76.69 xarami+biz 76.52 xoraman+biz 75.69 xrman+biz

slide-15
SLIDE 15

FHTW 2006 15

Sample vocalized output

ﯼﺮﻐﺻ 75.00 ﯼﺮﻐَﺻ 71.43 ﯼﺮَﻐَﺻ 64.29ﯼﺮﻏﻮﺻ 64.29 ﯼَﺮﻐَﺻ 60.71ﯼﺮَﻏﻮﺻ 60.71 ﯼَﺮَﻐَﺻ 53.57ﯼَﺮﻏﻮﺻ 50.00ﯼَﺮَﻏﻮﺻ

slide-16
SLIDE 16

FHTW 2006 16

Task 2

Provide vocalized romanization

slide-17
SLIDE 17

FHTW 2006 17

Issues in romanization

Arabic sounds do not always map to English

symbols

Not just one-to-one correspondence Divine name often elided

ا ﺖﻳﺁ. .. ﯼرﺎﻔﻏ

Ayatullah Ghafari

Syllable boundaries are unclear

Ambisyllabicity, consonant gemination

Word boundaries are not consistent

slide-18
SLIDE 18

FHTW 2006 18

Process: as for vocalization

Transliterate Transduce to produce instance vectors

31 orthographic features

Outcomes are letter sequences, generally

more complicated

Perform vocalization and romanization at once

slide-19
SLIDE 19

FHTW 2006 19

Sample vectors

B , = = = = = = = = = = = = = = = b d x C A n = = = = = = = = = = , ad , = = = = = = = = = = = = = = b d x C A n = = = = = = = = = = = , akh , = = = = = = = = = = = = = b d x C A n = = = = = = = = = = = = , sh , = = = = = = = = = = = = b d x C A n = = = = = = = = = = = = = , a , = = = = = = = = = = = b d x C A n = = = = = = = = = = = = = = , n , = = = = = = = = = = b d x C A n = = = = = = = = = = = = = = = , B , = = = = = = = = = = = = = = = b d x C A n i = = = = = = = = = , ad , = = = = = = = = = = = = = = b d x C A n i = = = = = = = = = = , akh , = = = = = = = = = = = = = b d x C A n i = = = = = = = = = = = , sh , = = = = = = = = = = = = b d x C A n i = = = = = = = = = = = = , a , = = = = = = = = = = = b d x C A n i = = = = = = = = = = = = = , n , = = = = = = = = = = b d x C A n i = = = = = = = = = = = = = = , i , = = = = = = = = = b d x C A n i = = = = = = = = = = = = = = = , B , = = = = = = = = = = = = = = = b E A j + z A d h = = = = = = = , E , = = = = = = = = = = = = = = b E A j + z A d h = = = = = = = = , haa , = = = = = = = = = = = = = b E A j + z A d h = = = = = = = = = , j , = = = = = = = = = = = = b E A j + z A d h = = = = = = = = = = , + , = = = = = = = = = = = b E A j + z A d h = = = = = = = = = = = , Z , = = = = = = = = = = b E A j + z A d h = = = = = = = = = = = = , a , = = = = = = = = = b E A j + z A d h = = = = = = = = = = = = = , d , = = = = = = = = b E A j + z A d h = = = = = = = = = = = = = = , h , = = = = = = = b E A j + z A d h = = = = = = = = = = = = = = = ,

slide-20
SLIDE 20

FHTW 2006 20

Sample raw output

:::::::::::::: ]it+_...bhbhAni :::::::::::::: 91.11 Ayat+Allah+Bahbahaani 91.11 Ayat+Allah+Bahbahani 88.89 Ayat+Allah+Bahbahanee 88.89 Ayat+Allah+Bahbahaanee 88.89 Aayat+Allah+Bahbahaani 88.89 Aayat+Allah+Bahbahani 88.89 Aayat+Allah+Bahbahaani 88.89 Ayat+Allah+Bahbahaanee 86.67 Aayat+Allah+Bahbahaanee 86.67 Aayat+Allah+Bahbahanee 86.67 Aayat+Allah+BahbahAnee

slide-21
SLIDE 21

FHTW 2006 21

Sample output

ﻆﻓﺎﺣ 450.000000 Hafizee 450.000000 Hafeezee ﺪﻴﺸﻤﺟ 399.414000 Jamsheed 396.716000 Jamshid 394.940000 Jamshaid 384.322000 Jamasheed رﻮﭙهﺎﺷ 450.164000 Shaahpur 395.169000 Shaah+Pur مﺎﻨﻬﺑ 436.044000 Bahnaam 402.424000 Behnaam

slide-22
SLIDE 22

FHTW 2006 22

Syllabification is an issue

Even in English

Merriam Webster: si.lly, ho.llow, ba.lance

Cambridge: sill.y, ho.llow or holl.ow, bal.ance

People vary in their perceptions, practices This has implications for doubled

consonants (ambisyllabicity)

Frequently observed in the data

Hessari / Hesaari

Syllable boundary in vectors would help

slide-23
SLIDE 23

FHTW 2006 23

Performance and evaluation

Why not simply transduce?

Only one possible realization provided; many

are possible and desirable to identify

Generate all possible realizations, with scores

Rote recall of forms provided Analogy applied to generate, score, rank

alternative possibilities

Human evaluation of alternatives necessary

slide-24
SLIDE 24

FHTW 2006 24

Conclusions

Interesting issues in Arabic-script name

processing

Widely varying practices in romanization of

names

Analogy (and AM) provide good account Techniques can be used for other

languages (source and target) if training data available