encoding transliteration variation through dimensionality
play

Encoding transliteration variation through dimensionality reduction - PowerPoint PPT Presentation

Encoding transliteration variation through dimensionality reduction Parth Gupta 1 , Paolo Rosso 1 and Rafael E. Banchs 2 pgupta@dsic.upv.es 1 Natural Language Engineering Lab Technical University of Valencia (UPV), Spain 2 HLT, Institute for


  1. Encoding transliteration variation through dimensionality reduction Parth Gupta 1 , Paolo Rosso 1 and Rafael E. Banchs 2 pgupta@dsic.upv.es 1 Natural Language Engineering Lab Technical University of Valencia (UPV), Spain 2 HLT, Institute for Infocomm Research (I 2 R), Singapore

  2. 2 of 21

  3. Transliterated Search (Means: My Dream Girl ) 3 of 21

  4. Transliterated Search (A special case of Lyrics Retrieval) 4 of 21

  5. Transliterated Search (A special case of Lyrics Retrieval) 4 of 21

  6. Transliterated Search (A special case of Lyrics Retrieval) 4 of 21

  7. Transliterated Search (A special case of Lyrics Retrieval) 4 of 21

  8. What is query and document ? • Query - Mere Sapno ki rani ◦ The most repeated lines in the song e.g. Ooh la la ooh la la ◦ The first line of the song e.g. Tadap tadap ke ◦ The “catchiest” part of the song e.g. Billo Rani ◦ Quite unique line e.g. Mujhko saja di pyar ki • Document ◦ Webpage/document containing that song’s lyrics in [Roman | Devnagari] script 5 of 21

  9. Some challenges • Extensive spelling variation, e.g. “ayega”, “aaega”, “ayegaa” • Match across the scripts e.g. a�� ��� , “ a�e�� ” • Unlike normal documents, some words/lines are repeated many times (statistical drift?) 6 of 21

  10. Looking at the Problem... • Basically the problem is two-fold 1. Handling spelling variation in the same script • Edit Distance? • Edit distance is Integer i.e. many entries at same distance like Sapney → Sapne, Apney, Samney (same distance) • Smarter Edit Distance? - Editex (uses Phonix and Soundex info) in calculating edit distance • Need mature Soundex and Phonix standards for the language. 2. Performing transliteration generation/mining operation to operate in the other script • Basically motivated to operate across the script 7 of 21

  11. Looking at the Problem... • Basically the problem is two-fold 1. Handling spelling variation in the same script • Edit Distance? • Edit distance is Integer i.e. many entries at same distance like Sapney → Sapne, Apney, Samney (same distance) • Smarter Edit Distance? - Editex (uses Phonix and Soundex info) in calculating edit distance • Need mature Soundex and Phonix standards for the language. 2. Performing transliteration generation/mining operation to operate in the other script • Basically motivated to operate across the script 7 of 21

  12. Looking at the Problem... • Basically the problem is two-fold 1. Handling spelling variation in the same script • Edit Distance? • Edit distance is Integer i.e. many entries at same distance like Sapney → Sapne, Apney, Samney (same distance) • Smarter Edit Distance? - Editex (uses Phonix and Soundex info) in calculating edit distance • Need mature Soundex and Phonix standards for the language. 2. Performing transliteration generation/mining operation to operate in the other script • Basically motivated to operate across the script 7 of 21

  13. Looking at the Problem... • Basically the problem is two-fold 1. Handling spelling variation in the same script • Edit Distance? • Edit distance is Integer i.e. many entries at same distance like Sapney → Sapne, Apney, Samney (same distance) • Smarter Edit Distance? - Editex (uses Phonix and Soundex info) in calculating edit distance • Need mature Soundex and Phonix standards for the language. 2. Performing transliteration generation/mining operation to operate in the other script • Basically motivated to operate across the script 7 of 21

  14. Looking at the Problem... • Basically the problem is two-fold 1. Handling spelling variation in the same script • Edit Distance? • Edit distance is Integer i.e. many entries at same distance like Sapney → Sapne, Apney, Samney (same distance) • Smarter Edit Distance? - Editex (uses Phonix and Soundex info) in calculating edit distance • Need mature Soundex and Phonix standards for the language. 2. Performing transliteration generation/mining operation to operate in the other script • Basically motivated to operate across the script 7 of 21

  15. Looking at the Problem... • Basically the problem is two-fold 1. Handling spelling variation in the same script • Edit Distance? • Edit distance is Integer i.e. many entries at same distance like Sapney → Sapne, Apney, Samney (same distance) • Smarter Edit Distance? - Editex (uses Phonix and Soundex info) in calculating edit distance • Need mature Soundex and Phonix standards for the language. 2. Performing transliteration generation/mining operation to operate in the other script • Basically motivated to operate across the script 7 of 21

  16. Our Model • We observe the association among the inter/intra script terms at character uni/bi-gram level. 1. Intra script, e.g. s → sh, f → ph, j → z, � � (mu) → � � (moo) 2. Inter script e.g. k → к , kh → х • Ideally the algorithm should automatically derive such mappings BUT the end goal is to find equivalents considering this information. • We model inter/intra script equivalents jointly . 8 of 21

  17. Our Model • We observe the association among the inter/intra script terms at character uni/bi-gram level. 1. Intra script, e.g. s → sh, f → ph, j → z, � � (mu) → � � (moo) 2. Inter script e.g. k → к , kh → х • Ideally the algorithm should automatically derive such mappings BUT the end goal is to find equivalents considering this information. • We model inter/intra script equivalents jointly . 8 of 21

  18. Our Model • We observe the association among the inter/intra script terms at character uni/bi-gram level. 1. Intra script, e.g. s → sh, f → ph, j → z, � � (mu) → � � (moo) 2. Inter script e.g. k → к , kh → х • Ideally the algorithm should automatically derive such mappings BUT the end goal is to find equivalents considering this information. • We model inter/intra script equivalents jointly . 8 of 21

  19. Our Model • We observe the association among the inter/intra script terms at character uni/bi-gram level. 1. Intra script, e.g. s → sh, f → ph, j → z, � � (mu) → � � (moo) 2. Inter script e.g. k → к , kh → х • Ideally the algorithm should automatically derive such mappings BUT the end goal is to find equivalents considering this information. • We model inter/intra script equivalents jointly . 8 of 21

  20. Distribution of Units - Character n-grams in Terms • The character n-grams in terms follow same distribution as terms in documents with some variation. Freq. Distrinution of Char. 1−grams Freq. Distrinution of Char. 2−grams 50000 10000 Frequency Frequency 20000 4000 0 0 0 20 40 60 80 0 500 1000 1500 2000 Char. 1−grams ID Char. 2−grams ID Freq. Distrinution of Char. 3−grams 2500 Frequency 1000 0 0 1000 2000 3000 4000 5000 6000 Char. 3−grams ID 9 of 21

  21. Modeling the terms 1. We create unique character uni/bi-gram joint space ( C n ) of both scripts out of the training terms, n =dimensionality. к х .. � � a� ..] e.g. [a b c ... ch ks .. 2. The training term-pairs are transformed into feature vector v d ∈ C n ). e.g. � v d = ���� . ( � v r , � v r = “pyar” and � 3. The dimensionality of these pairs are reduced to � h r , � h d ∈ R m such that, dist ( � h r , � h d ) is minimum where m << n . 4. [Important] As there is no distinction between features across the scripts the model can learn principle components within (intra) and across (inter) the scripts jointly . 10 of 21

  22. Training Method • A Deep Autoencoder is trained where the visible layer models the character grams through multinomial sampling [Salakhutdinov and Hinton, 2009]. Pre-training Fine-Tuning Output Layer 20 Linear Layer Code Layer RSM Layer Original Word ( � v d ) Transliteration ( � v r ) Input Layer 11 of 21

  23. Finding equivalents • Apriori the complete lexicon of Code ( � h q ) the reference/source 20 Linear Layer collection is projected into the abstract space using the autoencoder. • Given the query term q t , its feature vector � v q t is also RSM Layer projected in the abstract Query Term ( � v q ) Zero Vector space as ( � h q t ). • All the terms which have cosine similarity greater than θ are considered as equivalents. 12 of 21

  24. Subtask-2 : Adhoc Retrieval • Query Formulation Original Query ik din ayega “ik”, “ikk”, “ig”, “ eк ”, “ iк ” Variants of “ik” “din”, “didn”, “diin”, “ ��� ”, “ ��� ” Variants of “din” “ayega”, “aeyega”, “ayegaa”, “ a�� ��� ”, “ a�e�� ” Variants of “ayega” Formulated Query ik$din, ik$didn, ik$diin, diin$ayega, · · · eк $ ��� , eк $ ��� , diin$aeyega, diin$ayegaa, · · · , ��� $ a�� ��� , ��� $ a�e�� • Ranking Model (word 2-grams variant) ◦ TF-IDF ◦ unsupervised DFR (free from parameters) 13 of 21

  25. Demo Transliteration Encoding Demo 14 of 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend