outline
play

Outline Information Retrieval (IR) Syntactic IR Problems of - PowerPoint PPT Presentation

Fausto Giunchiglia, Uladzimir Kharkevich , Ilya Zaihrayeu Concept Search : Semantics Enabled Syntactic Search June 2nd, 2008, Tenerife, Spain 1 Outline Information Retrieval (IR) Syntactic IR Problems of Syntactic IR Semantic


  1. Fausto Giunchiglia, Uladzimir Kharkevich , Ilya Zaihrayeu Concept Search : Semantics Enabled Syntactic Search June 2nd, 2008, Tenerife, Spain 1

  2. Outline � Information Retrieval (IR) � Syntactic IR � Problems of Syntactic IR � Semantic Continuum � Concept Search ( C-Search ) � C-Search via Inverted Indices � Preliminary Evaluation � Conclusion and Future work 2

  3. I nformation Retrieval (I R) IR can be represented as a mapping function: � I R: Q → D Q - natural language queries which specify user information needs � D - a set of documents in the document collection, which meet these � needs, (optionally) ordered according to the degree of relevance. Ex. document collection: � Ex. queries: � 3

  4. I nformation Retrieval System I R_System = < Model, Data_Structure, Term, Match> Model – IR models used for document and query representations, � for computing query answers and relevance ranking. Bag of words model (representation) � Boolean Model, Vector Space Model, Probabilistic Model (retrieval) � Data_Structure – data structures used for indexing and retrieval. � Inverted Index � Signature File � Term – an atomic element in document and query representations. � a word or multi-words phrase � Match – matching technique used for term matching. � a syntactic matching of words or phrases: � � search for equivalent words � search for words with common prefixes � search for words within a certain edit distance with a given word 4

  5. Syntactic I R (Ex. I nv. I ndex) Q3 : 5

  6. Problems of Syntactic I R (I) Ambiguity of Natural Language � Polysemy : one word ↔ multiple meanings � e.g., baby is a young mammal or a human child Synonymy : different words ↔ same meaning � e.g., mark and print – a visible indication made on a surface (II) Complex Concepts � Syntactic IR does not take into account complex concepts formed by � Natural Language Phrases (e.g., Noun Phrases). � E.g., Computer table → A laptop computer is on a coffee table (III) Related Concepts � Syntactic IR does not take into account related concepts : � � E.g., carnivores (flesh-eating mammals) is more general than dog OR cat 6

  7. Syntactic I R We can think of Syntactic IR as a point in a space of IR approaches � NL Word String Similarity (0, 0, 0) Pure Syntax 7

  8. (1) Ambiguity : Natural Language → Formal Language (FL) NL2FL 1 NL Word String Similarity (0, 0, 0) Pure Syntax E.g., baby → C(baby) : a human child � print → C(print) : a visible indication made on a surface 8

  9. (2) Complex Concepts : Words → Multi-word Phrases W2P 1 (Free Text) … +Verb Phrase +Noun Phrase (FL) NL2FL 1 NL Word String Similarity (0, 0, 0) Pure Syntax E.g., Computer table → C (computer table) � A laptop computer is on a coffee table → { C (laptop computer), C (coffee table)} 9

  10. (3) Related Concepts : String similarity → Knowledge W2P 1 (Free Text) … +Verb Phrase +Noun Phrase (FL) NL2FL 1 NL Word KNOW 1 … String +Statistical +Lexical (Complete Similarity Knowledge knowledge Ontological (0, 0, 0) Knowledge) Pure Syntax E.g., “ carnivores ” ≠ “ dog ” → C(carnivores) ⊒ C(dog) � 10

  11. Semantic Continuum Full Semantics (1, 1, 1) W2P 1 (Free Text) … +Verb Phrase � C-Search +Noun Phrase (FL) NL2FL 1 NL Word KNOW 1 … String +Statistical +Lexical (Complete Similarity Knowledge knowledge Ontological (0, 0, 0) Knowledge) Pure Syntax 11

  12. C-Search in Semantic Continuum NL2FL-axis - Lack of background knowledge : � It is not always possible to find a concept which corresponds to a � given word (e.g., a concept does not exist in the lexical database). In this case, word itself is used as the identifier for a concept. W2P-axis - Descriptive phrases � (Complex) concepts are extracted from descriptive phrases � � descriptive phrase ::= noun phrase { OR noun phrase} � E.g., C(A little dog OR a huge cat) = (little-2 ⊓ dog-1) ⊔ (huge-1 ⊓ cat- 3) KNOW-axis - lexical knowledge � We use synonyms, hyponyms, hypernyms � Semantic Matching → search for related complex concepts. � 12

  13. C-Search in Semantic Continuum Full Semantics (1, 1, 1) W2P 1 (Free Text) … C-Search +Verb Phrase +Descriptive Phrase +Noun Phrase (FL) NL2FL 1 NL&FL NL Word KNOW 1 … String +Statistical +Lexical (Complete Similarity Knowledge knowledge Ontological (0, 0, 0) Knowledge) Pure Syntax 13

  14. C-Search via I nverted I ndices Moving from Syntactic I R to C-Search does not require � the introduction of new data structures or retrieval models The current implementation of C-Search : � Model – Bag of concepts (representation), � Boolean Model (retrieval), Vector Space Model (ranking) Data_Structure – Inverted Index � Term – an atomic or a complex concept � Match – semantic matching of concepts � 14

  15. C-Search (Ex. I nv. I ndex) 15

  16. Concept Matching Goal: To find a set of document concepts matching query concept � = ⊆ ( ) { | } q d d q C C C C C ms 1 st approach - directly via S-Match � Sequentially iterate through all document concepts � Compare document concept with query concept (using S-Match ) � Collect those concepts for which S-Match return more specific ( ⊑ ) � I t can be slow! (because number of document concepts > 10E6) � 2 nd approach - via I nverted I ndices (brief overview) � A-I ndex � → Index atomic concepts by more general atomic concept ⊓ -I ndex � → Index conjunctive clauses by its components (i.e., atomic concepts ) ⊔ -I ndex � → Index DNF formulas by its components (i.e., conjunctive clauses ) 16

  17. Concept I ndices (An example) Let us consider the following concept: � C1 = (little-2 ⊓ dog-1) ⊔ (huge-1 ⊓ cat-3) Fragments of concept indices for document concept C1: � Concept ∩ -index Concept ∪ -index Concept A-index C 2 (little ∩ dog) C 1 ,… A 1 (little) C 2 , … A 5 (canine) A 2 ,… C 3 (huge ∩ cat) C 1 ,… A 2 (dog) C 2 , … A 6 (feline) A 4 ,… … … … … … … C 3 , … A 3 (huge) C 3 , … A 4 (cat) … … 17

  18. Concept Retrieval (An example) 0. Query concept: Cq = canine ⊔ feline � 1. For each atomic concept → more specific atomic concepts � Search A-I ndex � E.g., canine → { dog, wolf, …} and feline → { cat, lion, …} � 2. For each atomic concept → more specific conjunctive clauses � Search ⊓ -I ndex � E.g., dog → { C2= little ⊓ dog, …} and cat → { C3= huge ⊓ cat, …} � (Note that: canine → { C2= little ⊓ dog, …} and feline → { C3= huge ⊓ cat, …} ) � 3. For each disjunctive clause → more specific conjunctive clauses � Union of conjunctive clauses � E.g., canine ⊔ feline → { C2= little ⊓ dog, C3= huge ⊓ cat, …} � 4. For each disjunctive clause → more specific DNF formulas � Search ⊔ -I ndex � E.g., canine ⊔ feline → { C1= (little ⊓ dog) ⊔ (huge ⊓ cat), …} � 5. … � 18

  19. Evaluation: Settings Data_set_1 : Home sub-tree of DMoz web directory � Document set : documents classified to nodes (29506) � Query set : concatenation of node's and its parent's labels (890) � Relevance judgment : node-document links � Data_set_2 : Only difference with Data_set_1 is: � Document set : concatenation of titles and descriptions of docs in DMoz. � WordNet is used as Lexical DB � GATE is used as NLP Tool � Lucene is used as I nverted I ndex � 19

  20. Evaluation results Data_set_1 � Data_set_2 � 20

  21. Conclusion and Future work Conclusion � In C-Search , syntactic IR is extended with a semantics layer � C-Search performs as good as syntactic search while allowing for � an improvement when semantics is available In principle, C-Search supports a continuum from purely syntactic IR to � fully semantic IR in which indexing and retrieval can be performed at any point of the continuum depending on how much semantics is available Future work � Development of more accurate concept extraction algorithm � Development of document relevance metrics based on both syntactic and � semantic similarities of query and document descriptions Allow semantic scope (such as equivalence, more/less general, disjoint) � Comparing the performance of the proposed solution with the state-of-the- � art syntactic IR systems using a syntactic IR benchmark 21

  22. Thank You! 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend