Simplified Abugidas Chenchen Ding, Masao Utiyama, and Eiichiro - PowerPoint PPT Presentation

Simplified Abugidas Chenchen Ding, Masao Utiyama, and Eiichiro Sumita ASTREC, NICT, Japan 1

Writing Systems : Hierarchy かん罐 កំប៉ង់ logogram syllabic abugida can phonogram segmental alphabet תיחפ abjad 2

Writing Systems : Population 10 9 10 9 logogram syllabic abugida 10 9 phonogram segmental alphabet abjad 3

Writing Systems : How to Input Logogram > Syllabic > Abugida > Alphabet & Abjad Keyboard Keyboard Keyboard ↓ ↓ Input method ? ↓ ↓ typically 10 2 – 10 4 symbols 50 – 70 20 – 40 symbols 4

Writing Systems : How to Input Logogram > Syllabic > Abugida > Alphabet & Abjad Keyboard Keyboard ↓ Input method ↓ ↓ 10 2 – 10 4 symbols 20 – 40 symbols 5

Motivation of This Study • Can abugidas be inputted more efficiently? → To insert a light layer of input method → To type less and to recover automatically • Related work → Various approaches for Chinese and Japanese → To take advantage of redundancy in a writing system 6

Outline of This Study • Khmer script as an example … ជិតណណន … ណ ូ ន ណណន នួន ននន /noon/ /naen/ / nuən / /nein/ vowel machine diacritic ណន នន learning omission methods consonant character merging N N … J T N N … (a) ABUGIDA SIMPLIFICATION (b) RECOVERY 7

Abugida Simplification • Four Brahmic abugidas • Thai, Burmese (Myanmar), Khmer (Cambodian), and Lao • Based on phonetics / conventional usages → reduced to 21 symbols GUTTURAL PALATE DENTAL LABIAL ZERO-C. LONG-A MERGED H-LIKE R-LIKE S-LIKE PRE-V. PLOSIVE NAS. PLOSIVE NAS. PLOSIVE NAS. PLOSIVE NAS. APP. APP. APP. DE-V. I II I II I II I II U I Y N L M W R S H Q A E MN K G C J T D P B TH กขฃ คฅฆ ง จฉ ชซฌ ญ ย ฎฏฐดตถ ฑฒทธ ณน ลฦฬ บปผฝ พฟภ ม ว รฤ ศษส หฬ อ ๅ เ แ โ ใ ไ MY ကခ ဂဃ င စဆ ဇဈ ဉ ည ယ ိ ဋဌတထ ဍဎဒဓ ဏန လဠ ပဖ ဗဘ မ ဝ ိ ရ ြိ သဿ ဟ ိ အ ိ ိ ိ ိ KM កខ គឃ ង ចឆ ជឈ ញ យ ដឋតថ ឌឍទធ ណន លឡ បផ ពភ ម វ រ ឝឞស ហ អ ិ ិ LO ກຂ ຄ ງ ຈ ຊ ຍ ຢ ດຕຖ ທ ນ ລ ບປຜຝ ພຟ ມ ວ ຣ ສ ຫຮ ອ ເ ແ ໂ ໃ ໄ TH ะ ั ั ั ั ั ั ั ั LO ະ ិ ិ ិ ិ ិ ិ ិ ិ KM ិ ិ ិ ិ ិ ិ ិ ើិ ើិ ើិ ើិ ែិ ៃិ MY ိ ိ ိ ိ ေိ ိ ိ ိ ိ OMITTED ั ั ั ั ั ั ั ั ั ិ ិ ຽ ិ ិ ិ ិ ិ ិ ើិ ើិ ិ ិ ិ ិ ិ ិ ិ ិ ិ ិ ិ ិ 8

Abugida Simplification • Khmer script as an example ជិតណណន ជ + ិ + ត + ណ + ែិ + ន = J T N N → len ( "J T N N " ) len ( " ជ ិ ត ណ ែិ ន " ) = 4 Leng. = 6 = 66.7% Thai Burmese Khmer Lao Leng. 76.0% 74.0% 77.6% 72.8% • Around one quarter characters ( ) saved 9

Recovery Methods • To formulate as a sequential labeling task • However, list-wise search as in conditional random fields is costing • To solve by point-wise classification • Support vector machine (SVM) as a baseline • Recurrent neural network (RNN) as a state-of-the-art method • Setting for the SVM baseline • Linear kernel with N-gram features • Using LIBLINEAR library • Wrapped by the KyTea toolkit 10

RNN Structure and Settings • Bi-gram of graphemes as input • Embedding → Bi-directional LSTM → Linear transform → Softmax • Original writing units as output ជិ ត ណណ ន output • Implemented by DyNet softmax linear • Trained by Adam 512-dim. (256×2) • Initial learning rate 0.001 LSTM-RNN • Controlled by a validation set 128-dim. • Multi-model ensemble J T N N input 11

Experimental Results • Asian Lang. Treebank data • 20,000 sent., newswire • SVM • Up to 5-gram for TH , KM , LO • 7-gram for Burmese • RNN • ⊕ M : M-ensemble → 16 -ensemble is adequate † : p < 0.01, ‡ : p < 0.001 • @N : Top-N results → Embedding + bi-LSTM > N-gram features → Top- 4 is satisfactory 12

Experimental Results : Training Data Size 98% TH-SVM 96% MY-SVM Top-1 accuracy KH-SVM 94% LO-SVM 92% TH-RNN 90% MY-RNN KH-RNN 88% LO-RNN 2.E+05 2.E+06 200K 2M Number of graphemes after simplification → RNN outperforms SVM, regardless of the training data size 13

Manual Evaluation • On Burmese and Khmer best results by RNN • Conducted by native-speakers • To classify errors into four-level • 0. acceptable, i.e., alternative spelling • 1. clear and easy to identify the correct result • 2. confusing but possible to identify the correct result • 3. incomprehensible 14

Conclusion and Future Work • Abugidas can be simplified largely and recovered with high accuracy • Four Brahmic abugidas are investigated • Simplified into a compact symbol set (around 20 graphemes) • Recovered satisfactorily by standard machine learning method → Experimentally show the feasibility to encode abugidas in a lossy manner • Future work • Language specific investigation • To integrate dictionary • To develop practical input method for abugidas 15

Thanks for your kind attention 16

Simplified Abugidas Chenchen Ding, Masao Utiyama, and Eiichiro - PowerPoint PPT Presentation

Simplified Abugidas Chenchen Ding, Masao Utiyama, and Eiichiro Sumita ASTREC, NICT, Japan 1 Writing Systems : Hierarchy logogram syllabic abugida can phonogram segmental alphabet abjad 2

50 crossings: 0 Simplified: 41 0 100 crossings: 0 Simplified: 81 0 Simplified: 200 156

Printed Circuit Board Design A Simplified Integrated Arduino Design Simplified PCB Design

A Simplified LCA methodology tailored A Simplified LCA methodology tailored to meet the challenge

A simplified A simplified method method for for determination determination of of

Simplicity Is Worse Than What Simplification . . . Theft: A Constraint-Based The Simplified . .

A FORCING EXTENSION OF A SIMPLIFIED ( 2 , 1) MORASS WITH NO SIMPLIFIED ( 2 , 1) MORASS WITH

Advanced Encryption Standard Simplified-AES Simplified-AES Example Details of AES Cryptography

Advanced Encryption Standard Simplified-AES Simplified-AES Example Details of AES Cryptography

Is There a Contradiction A Simplified Statistical . . . Between Statistics and A Simplified

Software lifecycle lifecycle (simplified) (simplified) Software Problem

LOGISTICS Maximizing its Contribution to the Organization Dave Klugman CEO Simplified

Presentation 10 Presentation 10 Michigan Business Tax Simplified Michigan Business Tax

Corporate Citizenship Sustainability, Simplified. The Foundations of Business The Growth of

Corporate Citizenship Sustainability, Simplified. Unlocking the mysteries of the Dow Jones

The Student Visa Subclass 500 Session plan Simplified Student Visa Framework Student visa

Hierarchical Deformation of Locally Rigid Meshes Josiah Manson and Scott Schaefer Motivation

Laboratory Testing of Intact Laboratory Testing of Intact Rocks k? Why Test Rock? t R Wh T

Development Services in Automotive TESTING LABORATORY Accredited Testing Laboratory Nr. 1552

LPA 2018 QUALITY ASSURANCE What is Quality Assurance? Why needed? Sampling &

Filtration, Cake Washing & Drying: Lab Testing to Pilot Testing to Project Installation and

Abstract Phonotactic Constraints for Speech Segmentation: Evidence from Human and Computational

Sensori-motor constraints and the organization of sound patterns Lucie Mnard Laboratoire de

Disclosures For the purposes of this presentation: I am a paid employee of Mount Sinai

Handout Download at http://thejourneyler.org/handout.pdf Friday, August 17, 18 Not RGH /*R

Simplified Abugidas Chenchen Ding, Masao Utiyama, and Eiichiro - PowerPoint PPT Presentation

Simplified Abugidas Chenchen Ding, Masao Utiyama, and Eiichiro Sumita ASTREC, NICT, Japan 1 Writing Systems : Hierarchy logogram syllabic abugida can phonogram segmental alphabet abjad 2

50 crossings: 0 Simplified: 41 0 100 crossings: 0 Simplified: 81 0 Simplified: 200 156

Printed Circuit Board Design A Simplified Integrated Arduino Design Simplified PCB Design

A Simplified LCA methodology tailored A Simplified LCA methodology tailored to meet the challenge

A simplified A simplified method method for for determination determination of of

Simplicity Is Worse Than What Simplification . . . Theft: A Constraint-Based The Simplified . .

A FORCING EXTENSION OF A SIMPLIFIED ( 2 , 1) MORASS WITH NO SIMPLIFIED ( 2 , 1) MORASS WITH

Advanced Encryption Standard Simplified-AES Simplified-AES Example Details of AES Cryptography

Advanced Encryption Standard Simplified-AES Simplified-AES Example Details of AES Cryptography

Is There a Contradiction A Simplified Statistical . . . Between Statistics and A Simplified

Software lifecycle lifecycle (simplified) (simplified) Software Problem

LOGISTICS Maximizing its Contribution to the Organization Dave Klugman CEO Simplified

Presentation 10 Presentation 10 Michigan Business Tax Simplified Michigan Business Tax

Corporate Citizenship Sustainability, Simplified. The Foundations of Business The Growth of

Corporate Citizenship Sustainability, Simplified. Unlocking the mysteries of the Dow Jones

The Student Visa Subclass 500 Session plan Simplified Student Visa Framework Student visa

Hierarchical Deformation of Locally Rigid Meshes Josiah Manson and Scott Schaefer Motivation

Laboratory Testing of Intact Laboratory Testing of Intact Rocks k? Why Test Rock? t R Wh T

Development Services in Automotive TESTING LABORATORY Accredited Testing Laboratory Nr. 1552

LPA 2018 QUALITY ASSURANCE What is Quality Assurance? Why needed? Sampling &amp;

Filtration, Cake Washing &amp; Drying: Lab Testing to Pilot Testing to Project Installation and

Abstract Phonotactic Constraints for Speech Segmentation: Evidence from Human and Computational

Sensori-motor constraints and the organization of sound patterns Lucie Mnard Laboratoire de

Disclosures For the purposes of this presentation: I am a paid employee of Mount Sinai

Handout Download at http://thejourneyler.org/handout.pdf Friday, August 17, 18 Not RGH /*R

LPA 2018 QUALITY ASSURANCE What is Quality Assurance? Why needed? Sampling &

Filtration, Cake Washing & Drying: Lab Testing to Pilot Testing to Project Installation and