A Log-linear Block Transliteration Model based on Bi-Stream HMMs
Bing Zhao
Joint work with
Nguyen Bach, Ian Lane, and Stephan Vogel
Language Technologies Institute Carnegie Mellon University April 2007
A Log-linear Block Transliteration Model based on Bi-Stream HMMs - - PowerPoint PPT Presentation
A Log-linear Block Transliteration Model based on Bi-Stream HMMs Bing Zhao Joint work with Nguyen Bach, Ian Lane, and Stephan Vogel Language Technologies Institute Carnegie Mellon University April 2007 OOV-words in Machine-Translation
Joint work with
Language Technologies Institute Carnegie Mellon University April 2007
Translation hypotheses cannot be generated for any source word that did not appear in training corpora
OOV words often major components of semantic content i.e. Named-Entities (Person/Place names)
Transliteration of place-names for different language pairs
Arabic English
i.e: ﺎﻔﺧ ﻲﺟ xfAjy Mahasin / Muhasan / Mahsan
Arabic Script Romanized English Transliteration
Constantinople German: Konstantinopolis Damascus Arabic:ﻖﺸﻣد(Dmk) Spanish: Adelaida
Source Language
Adelaide
English Transliteration
Rule-set either manually defined or automatically generated Only appropriate for close language-pairs (poor performance for ArabicEnglish transliteration)
Finite state transducers (Knight & Graehl 1997, Stalls & Knight, 1998) Model combination (Al-Onaizan 2002, Huang, 2005) Specific approach typically limited to target language pair
Highly portable framework
Able to generate high quality transliterations
Broad phonetic classes consistent across languages
i.e. transliterate: consonant consonant, vowel vowel
Propose Bi-Stream HMM framework to estimate both letter and letter-class
Typically, number of letters similar across language-pair Constrained fertility for Arabic English
Phonetic reordering does not occur in transliteration
Improve handling of context during transliteration Propose “block-level” transliteration framework
IBM-Model-4 Bi-Stream HMM Bi-Stream HMM combined with a Log-linear model
Special setups for transliterations Configurations of SMT decoder Spelling checker
Name pairs Preprocessing
Letter Alignment Transliteration Blocks
N-best Translation Hypothesis Letter Language Model Internet Spell Checker
Name Pairs PreProc Bi-HMM A-to-E Bi-HMM E-to-A Refined A Log-Linear Model for Block Alignment Fertility Distortion Lexicon Blocks
) | ( e P
f
φ ) | ( f P
e
φ
) | (
e f
P Θ Θ ) | (
f e
P Θ Θ ) | ( e f P ) | ( f e P
CVC: Consonant-Vowel-Consonant
Vowels: a e i o u …. Consonants: k j l …. Ambi-class: can be both vowel and consonant, e.g “y” Unknown: letters without linguistic clues
Additional position markers: initial & final
From left to right letter-level alignment
Enriched with letter classes Generating letter sequence Generating letter-class sequence
Configured for strict monotone alignment
1
1 1 1 1
j J
J J I j a j j j a
=
Name-Pair Letter-transliteration State-Transition
1
1 1 1 1 1 1
j j J
J J J I I j a j a j j j a
=
1 j j
Start End
Source Letter Sequence Target Letter Sequence
Start Right boundary Left boundary End
Source Letter Sequence Target Letter Sequence
Start Right boundary Left boundary src center tgt center Width End
Source Letter Sequence Target Letter Sequence
Letter level fertility probability A dynamic programming
IBM-1 letter-to-letter transliteration prob. IBM Model-1 style score for named-entity pair
Letter n-gram pairs are assumed along the diagonal Gaussian distribution for the centers’ positions
Name-pairs usually have similar lengths in characters; A letter is transliterated into less than 4 letters.
How many letters we want to generate in the target name; Letter fertilities in both direction.
Compute length relevance
e1 e2 e3 1 3 2 2 e1 e2 e3 … … …. 1 2 3 4 3 1 3 1 2
Letter to letter transliteration probabilities Letter to letter mapping is captured by lexicons
Compute statistics from letter alignment Learn lexicons in both directions
Compute IBM Model-1 style scores:
J j i i j
I i j j i
Monotone alignment nature for name-pairs Aligned n-gram pairs are mostly located along the diagonal
The center of the block should be along the diagonal Define the centers for source and target letter-ngrams:
Weights for particular feature functions
Improved Iterative Scaling Simplex downhill
m
Name pairs Preprocessing
Letter Alignment Transliteration Blocks
N-best Translation Hypothesis Letter Language Model Internet Spell Checker
I c m zu d I c I w c i k t t y y o
Three systems Applying a spelling checker Simple Comparison with Google Translations Some examples for MT output
Bulkwalter Arabic Morph 6K LDC2004L02 Bilingual person names 11K LDC2005G021 Bilingual geographic names 74K LDC2005G01-NGA
286 unique tokens were left un-translated Among them: 97 un-translated unique person, location names
IBM Model-4 in both directions Refined letter alignment Blocks are extracted according to heuristics
IBM Model-4 in both directions Refined letter alignment Blocks are extracted according to a log-linear model
Bi-stream HMM in both directions Refined letter alignment Blocks are extracted according to a log-linear model
Edit-Distance between hyp against possibly multiple references Src = “mHmd” Ref = Muhammad / Mohammed Acceptable translation if edit distance = 1 Perfect match if edit distance = 0
Perfect/Exact match Edit-distance of 1
names:
Perfect/Exact match Edit-distance of 1
Significantly better than Rule based system ( 52% v.s. 14%) Log-linear model, Bi-stream HMM, and Spelling checker
System re-configurations for other language pairs New features for transliterations Models for letter alignment for transliteration Algorithms for extracting letter n-gram pairs for transliteration
Bilingual NE corpus Learner Generator Picker Applicator Top N Candidates T r a n s l a t i
H y p
h e s i s Spell Checker Training Decoding
Given “lybyry” & “liberian” how many possible rules? A: Alignment by calculating edit distance Use all optimal paths to extract rules according to alignment paths Distinguish rules for begin, middle, and end Use consonants to anchor rule
Head list 379 An an Begin 345 q ca Begin 303 X sh Begin 286 nd nd Middle 283 ry ri End 273 ny ni End 252 kt ph Begin 252 qr car Begin 219 x kha Begin 217 x kh Begin From 5820 pairs Total: 19957 different rules Max freq: 379 Min freq: 1
How to know which rule is good or bad? For each rule, apply it to the held-out data & use reduction
Application order: Begin -> End -> Middle Confidence threshold: filter out unreliable rules Application strategy: for each source word, find all possible rules, and apply them in order
52% 72%
Spelling Checker effectively improved the accuracy significantly
Arabic text source sentence ﻪﻐﻨﻴﺴﻣﺮﻜﻳو ﻞﻴﻧار ﻰﻜﻧﻼﻳﺮﺴﻟا ءارزﻮﻟا ﺲﻴﺋر رﺬﺣ /اﻮﺨﻨﻴﺷ / ﺮﻳﺎﻨﻳ 4 ﻮﺒﻤﻟﻮآ ﺎهﺎﻋﺮﺗ ﻰﺘﻟا مﻼﺴﻟا ﺔﻴﻠﻤﻋ ﺮﻴﻣﺪﺗ ﺔﺒﻐﻣ ﻦﻣ ﺎﺠﻧﻮﺗارﺎﻣﻮآ ﺎﻜﻳرﺪﻧﺎﺸﺗ ﺔﺴﻴﺋﺮﻟا ﺞﻳوﺮﻨﻟا
SMT hypothesis
{UNKﻰﻜﻧﻼﻳﺮﺴﻟا ﻪﻐﻨﻴﺴﻣﺮﻜﻳو ﻞﻴﻧار} chairperson {UNK ﺞﻧﻮﺗارﺎﻣﻮآ ﺎﻜﻳرﺪﻧﺎﺸﺗ} cautioned the destruction of the peace process sponsored by norway
SMT with T.a.T
Srilankan Ranil Wikramsinghe charperson Chandrika Kumaratunga
cautioned the destruction of the peace process sponsored by norway
Reference translation
Wickremasinghe warned the country's President Chandrika Kumaratunga of the consequences of destroying the peace process
sponsored by the Norwegians