Simplified Abugidas
Chenchen Ding, Masao Utiyama, and Eiichiro Sumita ASTREC, NICT, Japan
1
Simplified Abugidas Chenchen Ding, Masao Utiyama, and Eiichiro - - PowerPoint PPT Presentation
Simplified Abugidas Chenchen Ding, Masao Utiyama, and Eiichiro Sumita ASTREC, NICT, Japan 1 Writing Systems : Hierarchy logogram syllabic abugida can phonogram segmental alphabet abjad 2
Chenchen Ding, Masao Utiyama, and Eiichiro Sumita ASTREC, NICT, Japan
1
phonogram logogram segmental syllabic abugida alphabet abjad
2
phonogram logogram segmental syllabic abugida alphabet abjad
3
Keyboard ↓ Input method ↓ 102 – 104 symbols Keyboard ↓ 20 – 40 symbols Logogram Syllabic Abugida Alphabet & Abjad > > > Keyboard ↓ ? typically 50 – 70
4
Keyboard ↓ Input method ↓ 102 – 104 symbols Keyboard ↓ 20 – 40 symbols Logogram Syllabic Abugida Alphabet & Abjad > > >
5
→ To insert a light layer of input method → To type less and to recover automatically
→ Various approaches for Chinese and Japanese
6
/noon/
/naen/
/nuən/
/nein/
vowel diacritic
consonant character merging
(a) ABUGIDA SIMPLIFICATION
(b) RECOVERY
machine learning methods
7
8
TH ะ ั ั ั ั ั ั ั ั
ั ั ั ั ั ั ั ั ั
MY ိ ိ ိ ိ ေိ ိ ိ ိ
ိ
KM ិ ិ ិ ិ ិ ិ ិ ើិ ើិ ើិ ើិ ែិ ៃិ
ើិ ើិ ិ ិ ិ ិ ិ ិ ិ ិ ិ ិ ិ ិ
LO ະ ិ
ិ ិ ិ ិ ិ ិ ិ ិ ិ ຽ ិ ិ ិ ិ ិ ិ
OMITTED I II I II I II I II MN K G U C J I Y T D N L P B M W R S H Q A E TH กขฃ คฅฆ ง
จฉ ชซฌ ญ ย ฎฏฐดตถ ฑฒทธ ณน ลฦฬ บปผฝ พฟภ ม ว รฤ ศษส หฬ อ ๅ เ แ โ ใ ไ
MY ကခ ဂဃ င စဆ ဇဈ ဉ ည ယ ိ ဋဌတထ
ဍဎဒဓ ဏန လဠ ပဖ ဗဘ မ ဝ ိ ရ ြိ သဿ ဟ ိ အ ိ ိ ိ ိ
KM កខ គឃ ង ចឆ ជឈ
ញ យ ដឋតថ ឌឍទធ ណន លឡ បផ ពភ ម វ រ ឝឞស ហ អ ិ ិ
LO ກຂ
ຄ ງ ຈ ຊ ຍ ຢ ດຕຖ ທ ນ ລ ບປຜຝ ພຟ ມ ວ ຣ ສ ຫຮ ອ ເ ແ ໂ ໃ ໄ
APP. DENTAL PALATE PRE-V. DE-V. PLOSIVE NAS. MERGED R-LIKE S-LIKE H-LIKE LONG-A ZERO-C. LABIAL PLOSIVE NAS. APP. PLOSIVE NAS. GUTTURAL PLOSIVE NAS. APP.
) saved
9
len ("J T N N") len ("ជ ិ ត ណ ែិ ន") = 4 6 = 66.7%
Thai Burmese Khmer Lao Leng. 76.0% 74.0% 77.6% 72.8%
10
11
J T N N
ជិ ត ណណ ន
128-dim. LSTM-RNN 512-dim. input
linear softmax
(256×2)
12
→ 16-ensemble is adequate
→ Top-4 is satisfactory
† : p < 0.01, ‡ : p < 0.001
→ Embedding + bi-LSTM > N-gram features
→ RNN outperforms SVM, regardless of the training data size
13
88% 90% 92% 94% 96% 98%
2.E+05 2.E+06
Top-1 accuracy Number of graphemes after simplification TH-SVM MY-SVM KH-SVM LO-SVM TH-RNN MY-RNN KH-RNN LO-RNN
200K 2M
14
→ Experimentally show the feasibility to encode abugidas in a lossy manner
15
16