cross language high similarity search using a conceptual
play

Cross-language High Similarity Search using a Conceptual Thesaurus - PowerPoint PPT Presentation

Introduction Conceptual Thesaurus Method Results Analysis References Cross-language High Similarity Search using a Conceptual Thesaurus no 2 and Paolo Rosso 1 Parth Gupta 1 , Alberto Barr on-Cede 1 Universitat Polit` ecnica de Val`


  1. Introduction Conceptual Thesaurus Method Results Analysis References Cross-language High Similarity Search using a Conceptual Thesaurus no 2 and Paolo Rosso 1 Parth Gupta 1 , Alberto Barr´ on-Cede˜ 1 Universitat Polit` ecnica de Val` encia, Spain 2 Universitat Polit´ ecnica de Catalunya, Spain September 19, 2012 1 UPV, Spain, 2 UPC, Spain P. Gupta 1 , A. Barr´ no 2 , P. Rosso 1 on-Cede˜ Cross-language High Similarity Search using a Conceptual Thesaurus

  2. Introduction Conceptual Thesaurus Method Results Analysis References Outline Introduction Conceptual Thesaurus Method Results Analysis References 1 UPV, Spain, 2 UPC, Spain P. Gupta 1 , A. Barr´ no 2 , P. Rosso 1 on-Cede˜ Cross-language High Similarity Search using a Conceptual Thesaurus

  3. Introduction Conceptual Thesaurus Method Results Analysis References Introduction I The task of cross-language high similarity search refers to the identification of documents that are duplicates or share very similar information in two di ff erent languages. I Some examples I Wikipedia articles in multiple languages I news stories in di ff erent languages covering the same event I cross-language cases of plagiarism I translated documents etc. I In the literature, also referred as I Cross-language pairwise similarity search I Cross-language mate retrieval I Cross-language near duplicate search 1 UPV, Spain, 2 UPC, Spain P. Gupta 1 , A. Barr´ no 2 , P. Rosso 1 on-Cede˜ Cross-language High Similarity Search using a Conceptual Thesaurus

  4. Introduction Conceptual Thesaurus Method Results Analysis References Conceptual Thesaurus (Domain specific) I Has often a multi-word structure I Tries to exhaustively cover omnipresent concepts of the domain I Eurovoc 1 I Emerged from European Parliamentary proceedings I Contains 6,797 multilingual concepts in 22 languages I Span across 21 domains of European Parliament activities 1 http://eurovoc.europa.eu/ 1 UPV, Spain, 2 UPC, Spain P. Gupta 1 , A. Barr´ no 2 , P. Rosso 1 on-Cede˜ Cross-language High Similarity Search using a Conceptual Thesaurus

  5. Introduction Conceptual Thesaurus Method Results Analysis References Eurovoc English Spanish German action for failure recurso por in- Klage wegen to fulfil an obli- cumplimiento Vertragsverlet- gation zung extra- intercambio ex- außergemeinschaf- community tracomunitario tlicher Handel trade sexual harass- acoso sexual sexuelle ment Bel¨ astigung 1 UPV, Spain, 2 UPC, Spain P. Gupta 1 , A. Barr´ no 2 , P. Rosso 1 on-Cede˜ Cross-language High Similarity Search using a Conceptual Thesaurus

  6. Introduction Conceptual Thesaurus Method Results Analysis References Eurovoc Assigning these concepts to Wikipedia documents or I Domain of concepts Shakespeare stories? I Politics I Intenational relations I European community I Law I Economics I So on.. 1 UPV, Spain, 2 UPC, Spain P. Gupta 1 , A. Barr´ no 2 , P. Rosso 1 on-Cede˜ Cross-language High Similarity Search using a Conceptual Thesaurus

  7. Introduction Conceptual Thesaurus Method Results Analysis References Method - Cross-language Conceptual Thesaurus based Similarity (CL-CTS) I Represent documents as a vector of concepts I Concept assignment is the least trivial part I Challenge: Exploit a domain specific CT for all the corpora I Assignment of concepts according to their verbatim occurrence in the document gives very bad results [Pouliquen et al.2006] I Assign a concept to a document if it “triggers the concept” 1 UPV, Spain, 2 UPC, Spain P. Gupta 1 , A. Barr´ no 2 , P. Rosso 1 on-Cede˜ Cross-language High Similarity Search using a Conceptual Thesaurus

  8. Introduction Conceptual Thesaurus Method Results Analysis References Method contd. I Heuristic: The terms together are highly domain dependent but alone are domain independent. I For example, “community” and “trade” compared to “community trade” Concept Assignment I Sum of the term frequencies (TF) of the terms in the concept in the Doc I Stopword removal + stemming I Filter the terms based on the discriminative power in the corpora 1 UPV, Spain, 2 UPC, Spain P. Gupta 1 , A. Barr´ no 2 , P. Rosso 1 on-Cede˜ Cross-language High Similarity Search using a Conceptual Thesaurus

  9. Introduction Conceptual Thesaurus Method Results Analysis References Method contd. I All the concepts do not help in similarity estimation - Hence Reduced Concepts (RC) I Reduces the comparison vocabulary drastically I Domain independent threshold 0 < d f ( t ) < � I Automatic domain adaptation (Football in “Sports” and “Society and Culture”) 1 UPV, Spain, 2 UPC, Spain P. Gupta 1 , A. Barr´ no 2 , P. Rosso 1 on-Cede˜ Cross-language High Similarity Search using a Conceptual Thesaurus

  10. Introduction Conceptual Thesaurus Method Results Analysis References Method contd. I Concern - The concepts are limited and are common across even slightly relevant documents I To overcome the limitation of conceptual similarity estimation, we use Named Entities in similarity too I n-gram similarity of NEs - simplest method I NEs act as discriminative features - e.g. Wikipedia page of Rome vs. Madrid 1 UPV, Spain, 2 UPC, Spain P. Gupta 1 , A. Barr´ no 2 , P. Rosso 1 on-Cede˜ Cross-language High Similarity Search using a Conceptual Thesaurus

  11. Introduction Conceptual Thesaurus Method Results Analysis References Method contd. I Sometimes high similar documents are parallel and the task is to find the parallel document for the given document I A pattern in length is noticed for parallel documents across languages [Pouliquen et al.2006] I we use the same “length panelty” len(parallel( d q )) = f ( µ, � , len( d q )) 1 UPV, Spain, 2 UPC, Spain P. Gupta 1 , A. Barr´ no 2 , P. Rosso 1 on-Cede˜ Cross-language High Similarity Search using a Conceptual Thesaurus

  12. Introduction Conceptual Thesaurus Method Results Analysis References Method contd. I The similarity function Conceptual Component NE Component ~ c q · ~ ! ( q, d ) = ↵ c d ! | q || d | + ` ( q, d ) + (1 − ↵ ) ∗ ⇣ ( q, d ) 2 ∗ Conceptual Similarity Length Penalty 1 UPV, Spain, 2 UPC, Spain P. Gupta 1 , A. Barr´ no 2 , P. Rosso 1 on-Cede˜ Cross-language High Similarity Search using a Conceptual Thesaurus

  13. Introduction Conceptual Thesaurus Method Results Analysis References Compared with 1. Cross-language Alignment based Similarity Analysis (CL-ASA) [Barr´ on-Cede˜ no et al.2008, Pinto et al.2009] 2. Cross-language Character n-grams (CL-CNG) [Mcnamee and Mayfield2004] 1 UPV, Spain, 2 UPC, Spain P. Gupta 1 , A. Barr´ no 2 , P. Rosso 1 on-Cede˜ Cross-language High Similarity Search using a Conceptual Thesaurus

  14. Introduction Conceptual Thesaurus Method Results Analysis References Datasets I JRC-Acquis (JRC) I Nature: related to European Commission activities I Size: 10,000 in each language I Type: Parallel I PAN-PC-2011 (PAN) I Nature: Project Gutenberg (artificially created cross-language plagiarism cases) I Size: 2920 (en-es) and 2222 (en-de) I Type: Noisy parallel I Wikipedia (Wiki) I Nature: General Wikipedia pages I Size: 10000 in each language I Type: Comparable 1 UPV, Spain, 2 UPC, Spain P. Gupta 1 , A. Barr´ no 2 , P. Rosso 1 on-Cede˜ Cross-language High Similarity Search using a Conceptual Thesaurus

  15. Introduction Conceptual Thesaurus Method Results Analysis References Datasets contd.. I Vocabulary shared by Eurovoc and JRC is higher than that of Eurovoc and PAN or Wiki. 1 UPV, Spain, 2 UPC, Spain P. Gupta 1 , A. Barr´ no 2 , P. Rosso 1 on-Cede˜ Cross-language High Similarity Search using a Conceptual Thesaurus

  16. Introduction Conceptual Thesaurus Method Results Analysis References Results : JRC en-es en-de 1 UPV, Spain, 2 UPC, Spain P. Gupta 1 , A. Barr´ no 2 , P. Rosso 1 on-Cede˜ Cross-language High Similarity Search using a Conceptual Thesaurus

  17. Introduction Conceptual Thesaurus Method Results Analysis References Results : PAN en-es en-de 1 UPV, Spain, 2 UPC, Spain P. Gupta 1 , A. Barr´ no 2 , P. Rosso 1 on-Cede˜ Cross-language High Similarity Search using a Conceptual Thesaurus

  18. Introduction Conceptual Thesaurus Method Results Analysis References Results : Wiki en-es en-de 1 UPV, Spain, 2 UPC, Spain P. Gupta 1 , A. Barr´ no 2 , P. Rosso 1 on-Cede˜ Cross-language High Similarity Search using a Conceptual Thesaurus

  19. Introduction Conceptual Thesaurus Method Results Analysis References Analysis I Performance of CL-CTS with reduced concepts is much higher compared to inclusion of all concepts I R@1 0.02 → 0.58 (JRC en-es) I Inclusion of NE component usually improves the performace except JRC - Interesting! I CL-ASA and CL-CNG exhibit very corpus dependent performace. I German stays more di ffi cult compared to Spanish (compounding of the words needs better care) 1 UPV, Spain, 2 UPC, Spain P. Gupta 1 , A. Barr´ no 2 , P. Rosso 1 on-Cede˜ Cross-language High Similarity Search using a Conceptual Thesaurus

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend