Query Word Labeling and Back Transliteration for Indian Languages: - PowerPoint PPT Presentation

Query Word Labeling and Back Transliteration for Indian Languages: MSRI Shared task system description Spandana Gella 1 , 2 , Jatin Sharma 1 , Kalika Bali 1 1 Microsoft Research, India 2 University of Melbourne, Australia December 4, 2013 Authors: Spandana Gella, Jatin Sharma, Kalika Bali Query Word Labeling and Back Transliteration for Indian Languages

SubTask1: Query Word Labeling Many Indian languages esp. in social media is written using romanized script Table: Shared Task description in two seperate steps of query labeling and back translieration Authors: Spandana Gella, Jatin Sharma, Kalika Bali Query Word Labeling and Back Transliteration for Indian Languages

Our Methodology Word level language identification based on character n-gram features learned from wordlists extracted from monolingual corpus ("King and "Abney, 2013) Adding context switch probability to indirectly learn the language sequence patterns Frequency based filtering Back-Translitertaion Hash based mapping between source and target languages (Kumar and Udupa, 2011) Use indic character mapping to create training data in poor-resource languages Authors: Spandana Gella, Jatin Sharma, Kalika Bali Query Word Labeling and Back Transliteration for Indian Languages

Terminology, Datasets and Tools Character n-gram features: hello :’h’,’e’,..,’o’,’he’,el’..,’hel’..,’hell’,’ello’,’hello’ Training resrouces: Word lists (from Leipzig Corpus, Anandbazar Patrika), word frequencies and transliterated pairs given as part of shared task Training size from 100 - 5000 words (Always <=546 for gujarati) (McCallum, 2002) for learning classifiers, MSRI Name Search Tool for Transliteration Authors: Spandana Gella, Jatin Sharma, Kalika Bali Query Word Labeling and Back Transliteration for Indian Languages

Word label prediction based on n-gram features 1.00 1.00 1.00 0.95 0.95 0.95 accuracy % 0.90 accuracy % 0.90 accuracy % 0.90 0.85 0.85 0.85 0.80 0.80 0.80 NaiveBayes Max−Ent 0.75 0.75 0.75 DTree 100 200 500 1000 2000 3000 5000 100 200 500 1000 2000 3000 5000 100 200 500 1000 2000 3000 5000 (a) Hindi−English sampled size (b) Gujarati−English sampled size (c) Bangla−English sampled size (a) Hindi (b) Gujarati (c) Bangla Figure: Learining curves for maximum entropy, naive Bayes and decision tree on word labeling for Hindi, Gujarati and Bangla language on development data Authors: Spandana Gella, Jatin Sharma, Kalika Bali Query Word Labeling and Back Transliteration for Indian Languages

Adding context-switch probability 0.88 , + , # + < − − # < − * , , − < , < # − < − < + < * # − , , < # * # s s 0.94 − s s s # s + s * , s , , s + − , * s < s s + + # < * * * , , # − < 0.95 + s < − * * 0.87 + + 0.92 , s − < # + accuracy % + accuracy % accuracy % , * − # , s − # < − s s * + 0.90 0.86 < s 0.6 * # , < 0.90 , − < − + # # * * 0.65 * < < # + < , − , + + # 0.7 , # * * 0.88 s − − 0.75 s + + < 0.85 * < − s # 0.8 # + * * − # , # 0.85 0.85 0.86 s 0.9 * + + 0.84 None 100 200 500 1000 2000 3000 5000 100 200 500 1000 2000 3000 5000 100 200 500 1000 2000 3000 5000 (a) Hindi - Maxent (b) Gujarati - Maxent (c) Bangla - Naive Figure: Learining curves with varying context switch probabilities Authors: Spandana Gella, Jatin Sharma, Kalika Bali Query Word Labeling and Back Transliteration for Indian Languages

Language Identification Errors Type Romanized Predicted Reference Short Words i; ve H; E E; H Ambiguous Words the; ate E; E H; H Erroneous Words emosal H E Mixed Numerals Words zara2; duwan2 E; E H; H Table: Annotation Errors Authors: Spandana Gella, Jatin Sharma, Kalika Bali Query Word Labeling and Back Transliteration for Indian Languages

Back Translitertaion MSRI Name Search Tool, built based on n-gram based feature hashing Used indic character mapping between Hindi-Bangla and Hindi-Gujarati All 3 systems for Gujarati and Bangla uses indic character mapping Authors: Spandana Gella, Jatin Sharma, Kalika Bali Query Word Labeling and Back Transliteration for Indian Languages

Test set Results Hindi Gujarati Bangla System LA TF TQM LA TF TQM LA TF TQM MSRI-1 0.9823 0.8127 0.1940 0.9614 0.4711 0.0800 0.9259 0.4914 0.0100 MSRI-2 0.9848 0.8130 0.1980 0.9755 0.4803 0.0733 0.9499 0.5033 0.0100 MSRI-3 0.9826 0.8101 0.1860 0.9661 0.4748 0.0667 0.9459 0.5137 0.0100 Maximum 0.9848 0.8130 0.1980 0.9755 0.4803 0.0800 0.9499 0.5137 0.0100 Median 0.9540 0.4160 0.0290 0.9661 0.4748 0.0733 0.9359 0.4973 0.0100 Table: Language labeling analysis on submitted runs in all three languages, along with maximum and median scores. Our runs which had maximum scores are presented in bold . LA - Labeling Accuracy, TF- Transliteration F-score, TQM - % of queries that had exact labeling and transliteration Authors: Spandana Gella, Jatin Sharma, Kalika Bali Query Word Labeling and Back Transliteration for Indian Languages

Transliteration Error Analysis Table: Transliteration Errors Authors: Spandana Gella, Jatin Sharma, Kalika Bali Query Word Labeling and Back Transliteration for Indian Languages

Summary Contributions: Using context switch probability inceases the performance of language labeling in code-mixed language. Cross-language character mapping to increase translitertaion accuracy - promising direction for resource-poor languages Future Work: Extending it to text with spelling variations (covering text normalization) Working on multiple languages esp. poor resource languages by exploiting resources from related languages Authors: Spandana Gella, Jatin Sharma, Kalika Bali Query Word Labeling and Back Transliteration for Indian Languages

Questions? Authors: Spandana Gella, Jatin Sharma, Kalika Bali Query Word Labeling and Back Transliteration for Indian Languages

Bibliography I King, B. and "Abney, S. (2013). Labeling the languages of words in mixed-language documents using weakly supervised methods. In Proceedings of NAACL-HLT , pages 1110–1119. Kumar, S. and Udupa, R. (2011). Learning hash functions for cross-view similarity search. In Proceedings of the Twenty-Second international joint conference on Artificial Intelligence-Volume Volume Two , pages 1360–1365. AAAI Press. McCallum, A. K. (2002). Mallet: A machine learning for language toolkit. Authors: Spandana Gella, Jatin Sharma, Kalika Bali Query Word Labeling and Back Transliteration for Indian Languages

Query Word Labeling and Back Transliteration for Indian Languages: - PowerPoint PPT Presentation

Query Word Labeling and Back Transliteration for Indian Languages: MSRI Shared task system description Spandana Gella 1 , 2 , Jatin Sharma 1 , Kalika Bali 1 1 Microsoft Research, India 2 University of Melbourne, Australia December 4, 2013

Query Word Labeling and Transliteration for Indian Languages: IITP TS Shared Task system

LIGA and Syllabification Approach for Language Identification and Back Transliteration : Shared

A Python Toolkit for Universal Transliteration . . . . . Ting Qian 1 , Kristy Hollingshead 2 ,

Improve Query Performance with the Query Log Analyzer Kees Vegter Field Engineer Query Log

Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu Query

Cross-Language IR at University of Tsukuba Automatic Transliteration for Japanese, English, and

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Machine Transliteration in Code-Mixed Indian Social Media Text Hemanta Baruah (186155001) Ph.D

Query Processing Relevance feedback; query expansion; Web Search 1 Overview Indexes Query

Requirements of the Final Rule for Restaurant Menu Labeling Loretta Carey Food Labeling and

Definitions in the Final Rule for Restaurant Menu Labeling Loretta Carey Food Labeling and

Fall Seminar Seed Sampling & Labeling Larry Nees Seed Administrator Office of INDIANA

Hub Labeling Algorithms Andrew V. Goldberg Amazon.com A.V. Goldberg Hub Labeling 6/2/2016 1 /

Structured Perceptron CMSC 470 Marine Carpuat POS tagging Sequence labeling with the perceptron

Welcome back... Welcome back... ..to me. Welcome back... ..to me. Test out Welcome back...

Perfect Query FORMULA 5 critical sections in every successful query letter (c) 2019

ENSURING THE QUALITY OF CONSTRUCTION MATERIALS BY ALLAN MAYNARD EXECUTIVE DIRECTOR Outline

Introduction to Higgs bundles Steve Bradlow Department of Mathematics University of Illinois at

Popp measure and the intrinsic Sub-Laplacian Winterschool in Geilo, Norway Wolfram Bauer Leibniz

The Laplace-Beltrami operator on rank-varying and non-equiregular SR manifolds (state of the art

Preparing for New School Year This session u What is the Preparing for the new school year

Southwest Basin Roundtable Special Meeting IPP Workshop August 26, 2020 3:00 to 5:30 pm

Trademark and Unfair Competition Law Slides 7: Product Packaging Trade Dress: Abercrombie v.

International Trade Agreements and Implications on Global Growth Tuesday, April 19, 2016 9:00

Sambuz

Useful Links

Newsletter

Mail Us

Query Word Labeling and Back Transliteration for Indian Languages: - PowerPoint PPT Presentation

Query Word Labeling and Back Transliteration for Indian Languages: MSRI Shared task system description Spandana Gella 1 , 2 , Jatin Sharma 1 , Kalika Bali 1 1 Microsoft Research, India 2 University of Melbourne, Australia December 4, 2013

Query Word Labeling and Transliteration for Indian Languages: IITP TS Shared Task system

LIGA and Syllabification Approach for Language Identification and Back Transliteration : Shared

A Python Toolkit for Universal Transliteration . . . . . Ting Qian 1 , Kristy Hollingshead 2 ,

Improve Query Performance with the Query Log Analyzer Kees Vegter Field Engineer Query Log

Query Execution 2 and Query Optimization Instructor: Matei Zaharia cs245.stanford.edu Query

Cross-Language IR at University of Tsukuba Automatic Transliteration for Japanese, English, and

Memory Memory Decoders M bits M bits RWM NVRWM ROM S 0 S 0 Word 0 Word 0 S 1 Word 1 Word

Machine Transliteration in Code-Mixed Indian Social Media Text Hemanta Baruah (186155001) Ph.D

Query Processing Relevance feedback; query expansion; Web Search 1 Overview Indexes Query

Requirements of the Final Rule for Restaurant Menu Labeling Loretta Carey Food Labeling and

Definitions in the Final Rule for Restaurant Menu Labeling Loretta Carey Food Labeling and

Fall Seminar Seed Sampling &amp; Labeling Larry Nees Seed Administrator Office of INDIANA

Hub Labeling Algorithms Andrew V. Goldberg Amazon.com A.V. Goldberg Hub Labeling 6/2/2016 1 /

Structured Perceptron CMSC 470 Marine Carpuat POS tagging Sequence labeling with the perceptron

Welcome back... Welcome back... ..to me. Welcome back... ..to me. Test out Welcome back...

Perfect Query FORMULA 5 critical sections in every successful query letter (c) 2019

ENSURING THE QUALITY OF CONSTRUCTION MATERIALS BY ALLAN MAYNARD EXECUTIVE DIRECTOR Outline

Introduction to Higgs bundles Steve Bradlow Department of Mathematics University of Illinois at

Popp measure and the intrinsic Sub-Laplacian Winterschool in Geilo, Norway Wolfram Bauer Leibniz

The Laplace-Beltrami operator on rank-varying and non-equiregular SR manifolds (state of the art

Preparing for New School Year This session u What is the Preparing for the new school year

Southwest Basin Roundtable Special Meeting IPP Workshop August 26, 2020 3:00 to 5:30 pm

Trademark and Unfair Competition Law Slides 7: Product Packaging Trade Dress: Abercrombie v.

International Trade Agreements and Implications on Global Growth Tuesday, April 19, 2016 9:00

Sambuz

Useful Links

Newsletter

Mail Us

Fall Seminar Seed Sampling & Labeling Larry Nees Seed Administrator Office of INDIANA