Encoding transliteration variation through dimensionality reduction - PowerPoint PPT Presentation

Encoding transliteration variation through dimensionality reduction Parth Gupta 1 , Paolo Rosso 1 and Rafael E. Banchs 2 pgupta@dsic.upv.es 1 Natural Language Engineering Lab Technical University of Valencia (UPV), Spain 2 HLT, Institute for Infocomm Research (I 2 R), Singapore

2 of 21

Transliterated Search (Means: My Dream Girl ) 3 of 21

Transliterated Search (A special case of Lyrics Retrieval) 4 of 21

What is query and document ? • Query - Mere Sapno ki rani ◦ The most repeated lines in the song e.g. Ooh la la ooh la la ◦ The first line of the song e.g. Tadap tadap ke ◦ The “catchiest” part of the song e.g. Billo Rani ◦ Quite unique line e.g. Mujhko saja di pyar ki • Document ◦ Webpage/document containing that song’s lyrics in [Roman | Devnagari] script 5 of 21

Some challenges • Extensive spelling variation, e.g. “ayega”, “aaega”, “ayegaa” • Match across the scripts e.g. a�� , “ a�e�� ” • Unlike normal documents, some words/lines are repeated many times (statistical drift?) 6 of 21

Looking at the Problem... • Basically the problem is two-fold 1. Handling spelling variation in the same script • Edit Distance? • Edit distance is Integer i.e. many entries at same distance like Sapney → Sapne, Apney, Samney (same distance) • Smarter Edit Distance? - Editex (uses Phonix and Soundex info) in calculating edit distance • Need mature Soundex and Phonix standards for the language. 2. Performing transliteration generation/mining operation to operate in the other script • Basically motivated to operate across the script 7 of 21

Our Model • We observe the association among the inter/intra script terms at character uni/bi-gram level. 1. Intra script, e.g. s → sh, f → ph, j → z, � � (mu) → � � (moo) 2. Inter script e.g. k → к , kh → х • Ideally the algorithm should automatically derive such mappings BUT the end goal is to find equivalents considering this information. • We model inter/intra script equivalents jointly . 8 of 21

Distribution of Units - Character n-grams in Terms • The character n-grams in terms follow same distribution as terms in documents with some variation. Freq. Distrinution of Char. 1−grams Freq. Distrinution of Char. 2−grams 50000 10000 Frequency Frequency 20000 4000 0 0 0 20 40 60 80 0 500 1000 1500 2000 Char. 1−grams ID Char. 2−grams ID Freq. Distrinution of Char. 3−grams 2500 Frequency 1000 0 0 1000 2000 3000 4000 5000 6000 Char. 3−grams ID 9 of 21

Modeling the terms 1. We create unique character uni/bi-gram joint space ( C n ) of both scripts out of the training terms, n =dimensionality. к х .. � � a� ..] e.g. [a b c ... ch ks .. 2. The training term-pairs are transformed into feature vector v d ∈ C n ). e.g. � v d = �� . ( � v r , � v r = “pyar” and � 3. The dimensionality of these pairs are reduced to � h r , � h d ∈ R m such that, dist ( � h r , � h d ) is minimum where m << n . 4. [Important] As there is no distinction between features across the scripts the model can learn principle components within (intra) and across (inter) the scripts jointly . 10 of 21

Training Method • A Deep Autoencoder is trained where the visible layer models the character grams through multinomial sampling [Salakhutdinov and Hinton, 2009]. Pre-training Fine-Tuning Output Layer 20 Linear Layer Code Layer RSM Layer Original Word ( � v d ) Transliteration ( � v r ) Input Layer 11 of 21

Finding equivalents • Apriori the complete lexicon of Code ( � h q ) the reference/source 20 Linear Layer collection is projected into the abstract space using the autoencoder. • Given the query term q t , its feature vector � v q t is also RSM Layer projected in the abstract Query Term ( � v q ) Zero Vector space as ( � h q t ). • All the terms which have cosine similarity greater than θ are considered as equivalents. 12 of 21

Subtask-2 : Adhoc Retrieval • Query Formulation Original Query ik din ayega “ik”, “ikk”, “ig”, “ eк ”, “ iк ” Variants of “ik” “din”, “didn”, “diin”, “ �� ”, “ �� ” Variants of “din” “ayega”, “aeyega”, “ayegaa”, “ a�� ”, “ a�e�� ” Variants of “ayega” Formulated Query ik$din, ik$didn, ik$diin, diin$ayega, · · · eк $ �� , eк $ �� , diin$aeyega, diin$ayegaa, · · · , �� $ a�� , �� $ a�e�� • Ranking Model (word 2-grams variant) ◦ TF-IDF ◦ unsupervised DFR (free from parameters) 13 of 21

Demo Transliteration Encoding Demo 14 of 21

Encoding transliteration variation through dimensionality reduction - PowerPoint PPT Presentation

Encoding transliteration variation through dimensionality reduction Parth Gupta 1 , Paolo Rosso 1 and Rafael E. Banchs 2 pgupta@dsic.upv.es 1 Natural Language Engineering Lab Technical University of Valencia (UPV), Spain 2 HLT, Institute for

A Python Toolkit for Universal Transliteration . . . . . Ting Qian 1 , Kristy Hollingshead 2 ,

Investigating Dimensionality Dimensionality Dimensionality with with Investigating

STAT 209 Dimensionality Reduction November 26, 2019 Colin Reimer Dawson 1 / 24 Dimensionality

Dimensionality Reduction Alexandros Tantos Assistant Professor Aristotle University of

Cross-Language IR at University of Tsukuba Automatic Transliteration for Japanese, English, and

61A Extra Lecture 4 Announcements Encoding Strings Representing Strings: UTF-8 Encoding 4

Deep Encode: Machine Learning for Per-Title Encoding Daniel Silhavy| IBC20| Per-Title Encoding

Language and Computers Relation to language Encoding written language Prologue: Encoding

Language and Computers Relation to language Encoding written Prologue: Encoding Language

Kernel-Based Dimensionality Reduction Methods on Synthesized and Facial Image Data Jonathan L.

WIKIPEDIA ARTICLE GROUP 9 Contents Article Overview 1. Dimensionality Reduction 2.

Dimensionality Reduction INFO-4604, Applied Machine Learning University of Colorado Boulder

Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization Maxim Raginsky and

Nonlinear Dimensionality Reduction Donovan Parks Overview Direct visualization vs.

Dimensionality Reduction Algorithms (and how to interpret their output) Dalya Baron (Tel Aviv

Exploring Multivariate Data with Clustering and Dimensionality Reduction Marco Baroni Practical

Adverse rainfall shocks and civil war Myth or reality? Ricardo Maertens Universitat Pompeu Fabra

Behavior/Collective Action Is there a Link? Arab Spring and sudden regime change in

Double-dot quantum ratchet driven by an independent quantum point contact Vadim Khrapay LMU

Gender Gap and Firm Performance in Developing Countries WIDER, Helsinki October 2019 Inmaculada

FROM ASHES TO GLORY 5 & 12 November 2017 PAKISTAN Christians are targets for murder , bombings

MapReduce and its use for indexing The Programming Model and Practice Enrique Alfonseca

Introduction To: Public Health Information Management System (PHIMS) Knowledge Management

Legacy-Compliant Data Authentication for Industrial Control System Traffic John Henry

Encoding transliteration variation through dimensionality reduction - PowerPoint PPT Presentation

Encoding transliteration variation through dimensionality reduction Parth Gupta 1 , Paolo Rosso 1 and Rafael E. Banchs 2 pgupta@dsic.upv.es 1 Natural Language Engineering Lab Technical University of Valencia (UPV), Spain 2 HLT, Institute for

A Python Toolkit for Universal Transliteration . . . . . Ting Qian 1 , Kristy Hollingshead 2 ,

Investigating Dimensionality Dimensionality Dimensionality with with Investigating

STAT 209 Dimensionality Reduction November 26, 2019 Colin Reimer Dawson 1 / 24 Dimensionality

Dimensionality Reduction Alexandros Tantos Assistant Professor Aristotle University of

Cross-Language IR at University of Tsukuba Automatic Transliteration for Japanese, English, and

61A Extra Lecture 4 Announcements Encoding Strings Representing Strings: UTF-8 Encoding 4

Deep Encode: Machine Learning for Per-Title Encoding Daniel Silhavy| IBC20| Per-Title Encoding

Language and Computers Relation to language Encoding written language Prologue: Encoding

Language and Computers Relation to language Encoding written Prologue: Encoding Language

Kernel-Based Dimensionality Reduction Methods on Synthesized and Facial Image Data Jonathan L.

WIKIPEDIA ARTICLE GROUP 9 Contents Article Overview 1. Dimensionality Reduction 2.

Dimensionality Reduction INFO-4604, Applied Machine Learning University of Colorado Boulder

Estimation of Intrinsic Dimensionality Using High-Rate Vector Quantization Maxim Raginsky and

Nonlinear Dimensionality Reduction Donovan Parks Overview Direct visualization vs.

Dimensionality Reduction Algorithms (and how to interpret their output) Dalya Baron (Tel Aviv

Exploring Multivariate Data with Clustering and Dimensionality Reduction Marco Baroni Practical

Adverse rainfall shocks and civil war Myth or reality? Ricardo Maertens Universitat Pompeu Fabra

Behavior/Collective Action Is there a Link? Arab Spring and sudden regime change in

Double-dot quantum ratchet driven by an independent quantum point contact Vadim Khrapay LMU

Gender Gap and Firm Performance in Developing Countries WIDER, Helsinki October 2019 Inmaculada

FROM ASHES TO GLORY 5 &amp; 12 November 2017 PAKISTAN Christians are targets for murder , bombings

MapReduce and its use for indexing The Programming Model and Practice Enrique Alfonseca

Introduction To: Public Health Information Management System (PHIMS) Knowledge Management

Legacy-Compliant Data Authentication for Industrial Control System Traffic John Henry

FROM ASHES TO GLORY 5 & 12 November 2017 PAKISTAN Christians are targets for murder , bombings