analysis and performance of morphological query expansion
play

Analysis and performance of morphological query expansion and - PowerPoint PPT Presentation

Analysis and performance of morphological query expansion and language-filtering words on Basque web searching I. Leturia, A. Gurrutxaga, N. Areta, E. Pociello Elhuyar R&D, Usurbil, Basque Country LREC 2008 May 29, 2008 Marrakech


  1. Analysis and performance of morphological query expansion and language-filtering words on Basque web searching I. Leturia, A. Gurrutxaga, N. Areta, E. Pociello Elhuyar R&D, Usurbil, Basque Country LREC 2008 – May 29, 2008 – Marrakech

  2. Contents • Introduction • Current study • Morphological query expansion • Language-filtering words • Conclusions

  3. Contents • Introduction • Current study • Morphological query expansion • Language-filtering words • Conclusions

  4. Introduction Basque IR problems Current study Morphological query expansion Language-filtering words Conclusions • Looking for conjugations and inflections – Basque is an agglutinative language – A given lemma makes many different surface forms: lan (“work”), lana (“the work”), lanak (“the works”), lanari (“to the work”), lanei (“to the works”), lanaren (“of the work”)... – Looking only for the exact given word, or the word plus an “s” for the plural, is not enough – Wildcards are not an appropriate solution: looking for lan* would also return forms of the words lanabes (“tool”), lanbro (“fog”)...

  5. Introduction Basque IR problems Current study Morphological query expansion Language-filtering words Conclusions • Language discrimination – No search engine offers the possibility of returning only pages in Basque – Big problem when looking for technical words that exist also in other languages ( anorexia , sulfuroso , byte , allegro , sistema , energia ...), short words ( katu , ur ...) or proper nouns ( Egipto , Newton , Pluton ...) – Many non-Basque results are returned, often no Basque results at all

  6. Introduction Our approach Current study Morphological query expansion Language-filtering words Conclusions • API based – We use APIs of major search engines – Cost-effective solution – NLP techniques applied to obtain better results

  7. Introduction Our approach Current study Morphological query expansion Language-filtering words Conclusions • Morphological query expansion or MQE (I) – We use a morphological generator for Basque created by the IXA Group of the University of the Basque Country – We obtain all the forms of a given lemma – We ask the search engine for all of them using an OR operator – etxe => etxe OR etxea OR etxeak OR etxeari OR etxeei OR etxeek OR...

  8. Introduction Our approach Current study Morphological query expansion Language-filtering words Conclusions • Morphological query expansion or MQE (II) – The APIs of the search engines have each a limit in number of words of the queries – This makes real lemma-based search impossible – But good results can be obtained if the forms sent in the query are the most frequent ones

  9. Introduction Our approach Current study Morphological query expansion Language-filtering words Conclusions • Language-filtering words or LFW – Some of the most frequent Basque words are added to the query using an AND operator – Several LFWs have to be used, since the most frequent words in Basque exist in other languages too – The more LFWs used, the better language-precision we obtain, but with loss in recall

  10. Introduction Tools built Current study Morphological query expansion Language-filtering words Conclusions • Elebila – Search service for Basque – API based – Lemma-based search (MQE) – Returns pages in Basque alone (LFWs) – Optional search for variants of words – Optional lemma-based search for whole noun phrases or terms (including them in double quotes) – http://www.elebila.eu

  11. Various possible Variant analyses suggestion offered Lemma- All results based in Basque search

  12. Introduction Tools built Current study Morphological query expansion Language-filtering words Conclusions • CorpEus (I) – Web-as-corpus tool for Basque – API based – Lemma-based search (MQE) – Returns occurrences in Basque alone (LFWs) – Optional search for variants of words – Optional lemma-based search for whole noun phrases or terms (including them in double quotes)

  13. Introduction Tools built Current study Morphological query expansion Language-filtering words Conclusions • CorpEus (II) – Parallel downloading of pages – Analyses of the results – Different ordering criteria – Occurrence counts and charts – http://www.corpeus.org

  14. Various possible analyses offered Analysis of the results Occurrence counts and charts All results in Lemma- Basque based search

  15. Contents • Introduction • Current study • Morphological query expansion • Language-filtering words • Conclusions

  16. Contents • Introduction • Current study • Morphological query expansion • Language-filtering words • Conclusions

  17. Introduction Current study Current study Morphological query expansion Language-filtering words Conclusions • Analysis and performance measurement of MQE and LFWs • Corpora based

  18. Introduction Current study Current study Morphological query expansion Language-filtering words Conclusions • Implementation details of the methodology very important in its performance – Cases for MQE – Which and how many LFWs • Previously – LFWs chosen based on a classic corpus – Cases for MQE quite intuitively – Improvement not measured quantitatively

  19. Introduction Corpora used Current study Morphological query expansion Language-filtering words Conclusions • ZT Corpusa – Corpus of Science and Technology – 7.6 million words • A web corpus – Downloaded all the pages of the Basque branch of Google Directory (+3,000) and recursively followed links of pages in Basque – 44,000 documents – 20 million words

  20. Introduction Words used Current study Morphological query expansion Language-filtering words Conclusions • Some words needed to perform the various measurements – For observing the most frequent cases for MQE – For measuring the language-precision obtained by LFWs • Most asked-for words of the Elebila logs – Four months, 400,000 queries, 800,000 words, 70,000 different words – Lemmatised and used the most frequent ones

  21. Contents • Introduction • Current study • Morphological query expansion • Language-filtering words • Conclusions

  22. Contents • Introduction • Current study • Morphological query expansion • Language-filtering words • Conclusions

  23. Introduction Most frequent cases Current study Morphological query expansion Language-filtering words Conclusions • Observed which are the most frequent cases – For each POS – Using the most frequently asked-for words of Elebila – Using both corpora – We have opted for the web corpus lists

  24. Introduction Most frequent cases Current study Morphological query expansion Language-filtering words Conclusions Verb Adjective Noun Proper noun Place name Participle / perfective aspect 1 Nominative singular ( berria ) Nominative indefinite ( hiztegi ) Nominative ( Mikel ) Nominative ( Egipto ) ( sortu ) Nominative plural/Ergative 2 Imperfective aspect ( sortzen ) Nominative singular ( hiztegia ) Ergative ( Mikelek ) Genitive locative ( Egiptoko ) singular ( berriak ) Nominative plural/Ergative 3 Verbal noun + - ko ( sortzeko ) Nominative indefinite ( berri ) Genitive ( Mikelen ) Inessive ( Egipton ) singular ( hiztegiak ) Genitive locative singular 4 Unrealized aspect ( sortuko ) Genitive plural ( berrien ) Dative ( Mikeli ) Allative ( Egiptora ) ( hiztegiko ) Genitive singular 5 Short stem ( sor ) Inessive singular ( berrian ) Associative ( Mikelekin ) Ablative ( Egiptotik ) ( hiztegiaren ) Verbal noun + Nominative Genitive + Nominative 6 Genitive singular ( berriaren ) Dative singular ( hiztegiari ) Genitive ( Egiptoren ) singular ( sortzea ) singular ( Mikelena ) Adjectival participle Associative singular 7 Inessive singular ( hiztegian ) Partitive ( Mikelik ) Dative ( Egiptori ) ( sortutako ) ( berriarekin ) Genitive + Nominative Participle + Nominative Genitive locative + Nominative 8 Ergative indefinite ( berrik ) Partitive ( hiztegirik ) Plural/Ergative singular singular ( sortua ) singular ( Egiptokoa ) ( Mikelenak ) Dynamic adverbial participle Instrumental indefinite Allative + Genitive locative 9 Dative singular ( berriari ) Instrumental ( Mikelez ) ( sortuz ) ( hiztegiz ) ( Egiptorako ) - ta/-da stative adverbial Instrumental indefinite ( berriz ) Instrumental singular 10 Inessive ( Mikelengan ) Associative ( Egiptorekin ) participle ( sortuta ) ( hiztegiaz ) Participle + Nominative Genitive singular + Genitive locative + Nominative 11 plural/Ergative singular Inessive indefinite ( berritan ) Nominative singular plural/Ergative singular ( sortuak ) ( hiztegiarena ) ( Egiptokoak ) Verbal noun + Inessive 12 Sociative plural ( berriekin ) Genitive plural ( hiztegien ) Destinative ( Egiptorentzat ) singular ( sortzean ) -(r)ik stative adverbial Sociative singular 13 Inessive plural ( berrietan ) Instrumental ( Egiptoz ) participle ( sorturik ) ( hiztegiarekin ) Verbal noun + Allative singular Genitive locative singular 14 Ablative singular ( hiztegitik ) Terminal allative ( Egiptoraino ) ( sortzera ) ( berriko ) Adjectival participle + Genitive locative + Inessive 15 Nominative plural/Ergative Partitive ( berririk ) Allative singular ( hiztegira ) singular ( Egiptokoan ) singular ( sortutakoak )

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend