Semi-supervised Transliteration Mining from Parallel and Comparable - PowerPoint PPT Presentation

Introduction Related work TM algorithm for parallel corpora TM algorithm for comparable corpora Conclusion Related work Semi-supervised Transliteration Mining from Parallel and Comparable Corpora Walid Aransa, Holger Schwenk, Loic Barrault LIUM - University of Le Mans, France firstname.lastname@lium.univ-lemans.fr Dec 7th 2012 IWSLT 2012, December 6-7, 2012, Hong Kong 1/ 40 Walid Aransa, Holger Schwenk, Loic Barrault Semi-supervised Transliteration Mining from Parallel and Co

Introduction Related work TM algorithm for parallel corpora TM algorithm for comparable corpora Conclusion Related work Outline 1 Introduction Transliteration Transliteration challenges Transliteration mining 2 Related work 3 Transliteration mining using parallel corpora - semi-supervised 4 Transliteration mining using comparable corpora - semi-supervised 5 Conclusion 2/ 40 Walid Aransa, Holger Schwenk, Loic Barrault Semi-supervised Transliteration Mining from Parallel and Co

Introduction Related work TM algorithm for parallel corpora TM algorithm for comparable corpora Conclusion Related work Introduction Transliteration is the process of writing a word (mainly proper noun) from one language in the alphabet of another language. It requires mapping the pronunciation of the word from the original language to the closest possible pronunciation in the target language The word and its transliteration are called a Transliteration Pair (TP) 3/ 40 Walid Aransa, Holger Schwenk, Loic Barrault Semi-supervised Transliteration Mining from Parallel and Co

Introduction Related work TM algorithm for parallel corpora TM algorithm for comparable corpora Conclusion Related work Transliteration applications Machine Translations: improve the word alignments, OOV Machine Transliterations: train statistical transliteration system Cross language Information Retrieval (IR): enrich the search results with orthographical variations Name Entity Recognition (NER) 4/ 40 Walid Aransa, Holger Schwenk, Loic Barrault Semi-supervised Transliteration Mining from Parallel and Co

Introduction Related work TM algorithm for parallel corpora TM algorithm for comparable corpora Conclusion Related work Transliteration challenges Examples: Transliteration from Arabic into English Some Arabic letters have no phonically equivalent letters in � � and � ) English (e.g. Some English letters do not have phonically equivalent letters in Arabic (e.g. v) Missing of short vowels (i.e. diacritics) in the Arabic text Some Arabic letters can be mapped to any letter from a group of phonically close English letters (e.g. � � to ” p or b” ) Some Arabic letters can be mapped to a sequence of English letters (e.g. � to ’kh’) 5/ 40 Walid Aransa, Holger Schwenk, Loic Barrault Semi-supervised Transliteration Mining from Parallel and Co

Introduction Related work TM algorithm for parallel corpora TM algorithm for comparable corpora Conclusion Related work Transliteration challenges - Cont Tokenization challenges: the Arabic name is concatenated to clitics like: Preposition �� Conjunction � Both together (e.g. �� ) Transliteration types: Forward: name is transliterated from its original language to another language Example: Arabic origin name ” �� ”- > ” Mohamed” Backward: the transliterated names are transliterated back to the origin names in its original language Example: ” � �� ”- > ” Bush” 6/ 40 Walid Aransa, Holger Schwenk, Loic Barrault Semi-supervised Transliteration Mining from Parallel and Co

Introduction Related work TM algorithm for parallel corpora TM algorithm for comparable corpora Conclusion Related work Transliteration mining (TM) The automatic extraction of TPs from parallel or comparable corpora is called Transliteration Mining (TM) Several methods to perform TM: Supervised Unsupervised Semi-supervised Some TM researches focus: Parallel corpora Comparable corpora 7/ 40 Walid Aransa, Holger Schwenk, Loic Barrault Semi-supervised Transliteration Mining from Parallel and Co

Introduction Related work TM algorithm for parallel corpora TM algorithm for comparable corpora Conclusion Related work Related work (Holmes et al., 2004) uses variant of the SOUNDEX methods and n-grams It improves precision and recall of name matching in the context of transliterated Arabic name search. (Darwish, 2010) presents two methods for improving TM, phonetic conflation of letters and iterative training of a transliteration model. The first method is an improved SOUNDEX phonetic algorithm. They propose SOUNDEX like conflation scheme to improve the recall and F-measure. Also iterative training method was presented that improves the recall but decreases the precision. 8/ 40 Walid Aransa, Holger Schwenk, Loic Barrault Semi-supervised Transliteration Mining from Parallel and Co

Introduction Related work TM algorithm for parallel corpora TM algorithm for comparable corpora Conclusion Related work TM using parallel corpora - semi-supervised Parallel POS Tagging Text Word Alignment Ar En Preprocessing Preprocessing Statistical or Rule Based Transliteration Normalization System – Ar/En Trans Ar Similarity Normalization Scoring Transliteration Table- TT TPs Thresholds Ar-En 9/ 40 Walid Aransa, Holger Schwenk, Loic Barrault Semi-supervised Transliteration Mining from Parallel and Co Figure: Extracting TPs from parallel corpora

Introduction Related work TM algorithm for parallel corpora TM algorithm for comparable corpora Conclusion Related work TM algorithm for parallel corpora (1) The parallel corpus is tagged using a part-of-speech (POS) tagger. We used Stanford POS tagger for English and Mada/Tokan for Arabic POS tagging. (2) Align the tagged bitext using Giza++, using the source/target alignment file, remove all aligned word pairs with POS tags other than noun (NN) or proper noun (PNN) tags and remove all English words starting with lower-case letters. Words which have most lowest alignment scores are removed (about 5% from the total number of aligned word pairs). (3) Remove the POS tags from Arabic and English words. (4) Transliterate the Arabic word A into English using a rule based transliteration system (or a previously trained statistical based transliteration system). 10/ 40 Walid Aransa, Holger Schwenk, Loic Barrault Semi-supervised Transliteration Mining from Parallel and Co

Introduction Related work TM algorithm for parallel corpora TM algorithm for comparable corpora Conclusion Related work TM algorithm for parallel corpora - Cont (5) Normalize the transliteration of Arabic word A t as well as the English word to Norm 1 , Norm 2 and Norm 3 as will be explained. The objective of the normalization is folding English letters with similar phonetic to the same letter or symbol. (6) For each aligned Arabic transliterated word A t and English word E, use their normalized forms to calculate the three levels of similarity scores which we store in a transliteration table (TT). (7) Extract TPs from the TT by applying a threshold on the three levels similarity scores. We selected the thresholds using empirical method shown later. 11/ 40 Walid Aransa, Holger Schwenk, Loic Barrault Semi-supervised Transliteration Mining from Parallel and Co

Introduction Related work TM algorithm for parallel corpora TM algorithm for comparable corpora Conclusion Related work Calculating the three levels of similarity scores Statistical or Rule Based Transliteration Ar word System – Ar/En Transliterated Ar word En word Norm Norm Norm Norm Norm Norm Form3 Form2 Form1 Form3 Form2 Form1 3 rd Level 2 nd Level 1 st Level Similarity Similarity Similarity Score Score Score Transliteration Table- TT TP Thresholds Ar-En 12/ 40 Walid Aransa, Holger Schwenk, Loic Barrault Semi-supervised Transliteration Mining from Parallel and Co

Introduction Related work TM algorithm for parallel corpora TM algorithm for comparable corpora Conclusion Related work Calculating the three levels of similarity scores - Cont (1) Norm 1 normalization function: folding English letters with similar phonetic to one letter or symbol. lower cased phonically equivalent consonants and vowels are folded to one letter e.g. p and b are normalized to b, v and f are normalized to f, i and e are normalized to e double consonants are replaced by one letter hyphen ” -”is inserted after the initial two letters ” al”which is the transliteration of Arabic article ” �� ” 13/ 40 Walid Aransa, Holger Schwenk, Loic Barrault Semi-supervised Transliteration Mining from Parallel and Co

Introduction Related work TM algorithm for parallel corpora TM algorithm for comparable corpora Conclusion Related work Calculating the three levels of similarity scores - Cont (2) Norm 2 normalization function: Using Norm 1 output Double vowels are replaced by one similar upper-case letter (i.e. ee is normalized to E) Remove non-initial and non-final vowels only if not followed by vowel or not preceded by vowel (3) Norm 3 normalization function: Using Norm 2 output, hyphen ” -” and vowels are removed. 14/ 40 Walid Aransa, Holger Schwenk, Loic Barrault Semi-supervised Transliteration Mining from Parallel and Co

Semi-supervised Transliteration Mining from Parallel and Comparable - PowerPoint PPT Presentation

Introduction Related work TM algorithm for parallel corpora TM algorithm for comparable corpora Conclusion Related work Semi-supervised Transliteration Mining from Parallel and Comparable Corpora Walid Aransa, Holger Schwenk, Loic Barrault

A Python Toolkit for Universal Transliteration . . . . . Ting Qian 1 , Kristy Hollingshead 2 ,

Margin-based Semi-supervised Learning Using Apollonius circle MONA EMADI AND JAFAR TANHA T TC S

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Support Vector Machines (SVMs). Semi-Supervised Learning. Semi-Supervised SVMs.

Semi-Supervised Learning Maria-Florina Balcan 03/30/2015 Readings: Semi-Supervised Learning.

Semi-Supervised Kernel Mean Shift Clustering A Semi-Supervised Clustering Approach Motivation:

Semi-Supervised Local Fisher Semi-Supervised Local Fisher Discriminant Analysis Discriminant

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Iterative Hybrid Algorithm for Semi-supervised Classification Martin SAVESKI Supervised by

CS330 Paper Presentation: October 16th, 2019 Supervised Classification Semi-Supervised

Cross-Language IR at University of Tsukuba Automatic Transliteration for Japanese, English, and

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Semi-Crystalline Polymer Morphologies and their Hierarchical Morphologies 1 Semi-Crystalline

Classification Semi-supervised learning based on network Speakers: Hanwen Wang, Xinxin Huang, and

Particles Competition and Cooperation in Networks for Semi-Supervised Learning Fabricio Breve

Asymptotical Statistics of Stochastic Processes IV ( S tatistique A symptotique des P rocessus S

MODERN LIFE IS RUBBISH? USING TIME USE ANALYSIS TO DISPEL MYTHS OF DECLINE Modern Life is Rubbish

Culture Change Journey Presented by: Michael Bastian, Phillippa Welch, & Jill Gibson from

Grammar-driven versus Data-driven: Which Parsing System is More Affected by Domain Shifts?

EuroMatrixPlus Evaluation, Localisation, Open Source Josef van Genabith Centre for Next

On estimation for the fractional Ornstein-Uhlembeck process observed at discrete time Stefano M.

The Risk-Sensitive Switching Problem Under Knightian Uncertainty S.Hamad` ene & H.Wang

Texture-Structure-Microstructure: a combined analysis by x-ray diffraction of Pb 0.76 Ca 0.24 TiO

Semi-supervised Transliteration Mining from Parallel and Comparable - PowerPoint PPT Presentation

Introduction Related work TM algorithm for parallel corpora TM algorithm for comparable corpora Conclusion Related work Semi-supervised Transliteration Mining from Parallel and Comparable Corpora Walid Aransa, Holger Schwenk, Loic Barrault

A Python Toolkit for Universal Transliteration . . . . . Ting Qian 1 , Kristy Hollingshead 2 ,

Margin-based Semi-supervised Learning Using Apollonius circle MONA EMADI AND JAFAR TANHA T TC S

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Support Vector Machines (SVMs). Semi-Supervised Learning. Semi-Supervised SVMs.

Semi-Supervised Learning Maria-Florina Balcan 03/30/2015 Readings: Semi-Supervised Learning.

Semi-Supervised Kernel Mean Shift Clustering A Semi-Supervised Clustering Approach Motivation:

Semi-Supervised Local Fisher Semi-Supervised Local Fisher Discriminant Analysis Discriminant

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Unsupervised and Semi-supervised Learning of Structure Graham Neubig Site

Iterative Hybrid Algorithm for Semi-supervised Classification Martin SAVESKI Supervised by

CS330 Paper Presentation: October 16th, 2019 Supervised Classification Semi-Supervised

Cross-Language IR at University of Tsukuba Automatic Transliteration for Japanese, English, and

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Semi-Crystalline Polymer Morphologies and their Hierarchical Morphologies 1 Semi-Crystalline

Classification Semi-supervised learning based on network Speakers: Hanwen Wang, Xinxin Huang, and

Particles Competition and Cooperation in Networks for Semi-Supervised Learning Fabricio Breve

Asymptotical Statistics of Stochastic Processes IV ( S tatistique A symptotique des P rocessus S

MODERN LIFE IS RUBBISH? USING TIME USE ANALYSIS TO DISPEL MYTHS OF DECLINE Modern Life is Rubbish

Culture Change Journey Presented by: Michael Bastian, Phillippa Welch, &amp; Jill Gibson from

Grammar-driven versus Data-driven: Which Parsing System is More Affected by Domain Shifts?

EuroMatrixPlus Evaluation, Localisation, Open Source Josef van Genabith Centre for Next

On estimation for the fractional Ornstein-Uhlembeck process observed at discrete time Stefano M.

The Risk-Sensitive Switching Problem Under Knightian Uncertainty S.Hamad` ene &amp; H.Wang

Texture-Structure-Microstructure: a combined analysis by x-ray diffraction of Pb 0.76 Ca 0.24 TiO

Culture Change Journey Presented by: Michael Bastian, Phillippa Welch, & Jill Gibson from

The Risk-Sensitive Switching Problem Under Knightian Uncertainty S.Hamad` ene & H.Wang