Outline Information Retrieval (IR) Syntactic IR Problems of - - PowerPoint PPT Presentation

outline
SMART_READER_LITE
LIVE PREVIEW

Outline Information Retrieval (IR) Syntactic IR Problems of - - PowerPoint PPT Presentation

Fausto Giunchiglia, Uladzimir Kharkevich , Ilya Zaihrayeu Concept Search : Semantics Enabled Syntactic Search June 2nd, 2008, Tenerife, Spain 1 Outline Information Retrieval (IR) Syntactic IR Problems of Syntactic IR Semantic


slide-1
SLIDE 1

1

Concept Search: Semantics Enabled

Syntactic Search

Fausto Giunchiglia, Uladzimir Kharkevich, Ilya Zaihrayeu

June 2nd, 2008, Tenerife, Spain

slide-2
SLIDE 2

2

Outline

Information Retrieval (IR)

Syntactic IR Problems of Syntactic IR

Semantic Continuum Concept Search (C-Search) C-Search via Inverted Indices Preliminary Evaluation Conclusion and Future work

slide-3
SLIDE 3

3

I nformation Retrieval (I R)

  • IR can be represented as a mapping function:

I R: Q → D

  • Q - natural language queries which specify user information needs
  • D - a set of documents in the document collection, which meet these

needs, (optionally) ordered according to the degree of relevance.

  • Ex. document collection:
  • Ex. queries:
slide-4
SLIDE 4

4

I nformation Retrieval System

I R_System = < Model, Data_Structure, Term, Match>

  • Model – IR models used for document and query representations,

for computing query answers and relevance ranking.

  • Bag of words model (representation)
  • Boolean Model, Vector Space Model, Probabilistic Model (retrieval)
  • Data_Structure – data structures used for indexing and retrieval.
  • Inverted Index
  • Signature File
  • Term – an atomic element in document and query representations.
  • a word or multi-words phrase
  • Match – matching technique used for term matching.
  • a syntactic matching of words or phrases:

search for equivalent words search for words with common prefixes search for words within a certain edit distance with a given word

slide-5
SLIDE 5

5

Syntactic I R (Ex. I nv. I ndex)

Q3 :

slide-6
SLIDE 6

6

Problems of Syntactic I R

  • (I) Ambiguity of Natural Language
  • Polysemy: one word ↔ multiple meanings

e.g., baby is a young mammal or a human child

  • Synonymy: different words ↔ same meaning

e.g., mark and print – a visible indication made on a surface

  • (II) Complex Concepts
  • Syntactic IR does not take into account complex concepts formed by

Natural Language Phrases (e.g., Noun Phrases).

E.g., Computer table → A laptop computer is on a coffee table

  • (III) Related Concepts
  • Syntactic IR does not take into account related concepts:

E.g., carnivores (flesh-eating mammals) is more general than

dog OR cat

slide-7
SLIDE 7

7

Syntactic I R

  • We can think of Syntactic IR as a point in a space of IR approaches

(0, 0, 0) Pure Syntax NL Word String Similarity

slide-8
SLIDE 8

8

(1) Ambiguity : Natural Language → Formal Language

  • E.g., baby → C(baby) : a human child

print → C(print) : a visible indication made on a surface

NL2FL (0, 0, 0) Pure Syntax NL (FL) 1 Word String Similarity

slide-9
SLIDE 9

9

(2) Complex Concepts : Words → Multi-word Phrases

NL2FL W2P +Noun Phrase +Verb Phrase … (0, 0, 0) Pure Syntax NL (FL) 1 Word String Similarity 1 (Free Text)

  • E.g., Computer table → C (computer table)

A laptop computer is on a coffee table →

{ C (laptop computer), C (coffee table)}

slide-10
SLIDE 10

10

(3) Related Concepts : String similarity → Knowledge

  • E.g., “carnivores” ≠ “dog” → C(carnivores) ⊒ C(dog)

NL2FL W2P +Noun Phrase +Lexical knowledge +Verb Phrase … (0, 0, 0) Pure Syntax NL (FL) 1 Word String Similarity +Statistical Knowledge 1 (Complete Ontological Knowledge)

1 (Free Text) KNOW

slide-11
SLIDE 11

11

Semantic Continuum

NL2FL W2P +Noun Phrase +Lexical knowledge +Verb Phrase … (0, 0, 0) Pure Syntax NL (FL) 1 Word String Similarity +Statistical Knowledge 1 (Complete Ontological Knowledge)

1 (Free Text) KNOW Full Semantics (1, 1, 1)

C-Search

slide-12
SLIDE 12

12

C-Search in Semantic Continuum

  • NL2FL-axis - Lack of background knowledge:
  • It is not always possible to find a concept which corresponds to a

given word (e.g., a concept does not exist in the lexical database). In this case, word itself is used as the identifier for a concept.

  • W2P-axis - Descriptive phrases
  • (Complex) concepts are extracted from descriptive phrases

descriptive phrase ::= noun phrase { OR noun phrase} E.g., C(A little dog OR a huge cat) = (little-2 ⊓ dog-1) ⊔ (huge-1 ⊓ cat-

3)

  • KNOW-axis - lexical knowledge
  • We use synonyms, hyponyms, hypernyms
  • Semantic Matching → search for related complex concepts.
slide-13
SLIDE 13

13

C-Search in Semantic Continuum

NL2FL W2P +Noun Phrase +Lexical knowledge +Verb Phrase …

C-Search

(0, 0, 0) Pure Syntax NL (FL) 1 Word String Similarity +Statistical Knowledge 1 (Complete Ontological Knowledge)

1 (Free Text) KNOW +Descriptive Phrase

NL&FL

Full Semantics (1, 1, 1)

slide-14
SLIDE 14

14

C-Search via I nverted I ndices

  • Moving from Syntactic I R to C-Search does not require

the introduction of new data structures or retrieval models

  • The current implementation of C-Search:
  • Model – Bag of concepts (representation),

Boolean Model (retrieval), Vector Space Model (ranking)

  • Data_Structure – Inverted Index
  • Term – an atomic or a complex concept
  • Match – semantic matching of concepts
slide-15
SLIDE 15

15

C-Search (Ex. I nv. I ndex)

slide-16
SLIDE 16

16

Concept Matching

  • Goal: To find a set of document concepts matching query concept
  • 1st approach - directly via S-Match
  • Sequentially iterate through all document concepts
  • Compare document concept with query concept (using S-Match)
  • Collect those concepts for which S-Match return more specific (⊑)
  • I t can be slow! (because number of document concepts > 10E6)
  • 2nd approach - via I nverted I ndices (brief overview)
  • A-I ndex

→ Index atomic concepts by more general atomic concept

  • ⊓-I ndex

→ Index conjunctive clauses by its components (i.e., atomic concepts)

  • ⊔ -I ndex

→ Index DNF formulas by its components (i.e., conjunctive clauses)

} | { ) (

q d d q ms

C C C C C ⊆ =

slide-17
SLIDE 17

17

Concept I ndices (An example)

  • Let us consider the following concept:

C1 = (little-2 ⊓ dog-1) ⊔ (huge-1 ⊓ cat-3)

  • Fragments of concept indices for document concept C1:

C2, … C2, …

C3, … C3, …

Concept ∩-index

A1(little) A2(dog)

A3(huge) A4(cat)

C1,… C1,…

Concept ∪-index C2(little ∩ dog) C3(huge ∩ cat)

A2,… A4,…

Concept A-index A5(canine) A6(feline)

slide-18
SLIDE 18

18

Concept Retrieval (An example)

  • 0. Query concept: Cq = canine ⊔ feline
  • 1. For each atomic concept → more specific atomic concepts
  • Search A-I ndex
  • E.g., canine → { dog, wolf, …} and feline → { cat, lion, …}
  • 2. For each atomic concept → more specific conjunctive clauses
  • Search ⊓-I ndex
  • E.g., dog → { C2= little ⊓ dog, …} and cat → { C3= huge ⊓ cat, …}
  • (Note that: canine → { C2= little ⊓ dog, …} and feline → { C3= huge ⊓ cat, …} )
  • 3. For each disjunctive clause → more specific conjunctive clauses
  • Union of conjunctive clauses
  • E.g., canine ⊔ feline → { C2= little ⊓ dog, C3= huge ⊓ cat, …}
  • 4. For each disjunctive clause → more specific DNF formulas
  • Search ⊔ -I ndex
  • E.g., canine ⊔ feline → { C1= (little ⊓ dog) ⊔ (huge ⊓ cat), …}
  • 5. …
slide-19
SLIDE 19

19

Evaluation: Settings

  • Data_set_1: Home sub-tree of DMoz web directory
  • Document set : documents classified to nodes (29506)
  • Query set : concatenation of node's and its parent's labels (890)
  • Relevance judgment: node-document links
  • Data_set_2: Only difference with Data_set_1 is:
  • Document set : concatenation of titles and descriptions of docs in DMoz.
  • WordNet is used as Lexical DB
  • GATE is used as NLP Tool
  • Lucene is used as I nverted I ndex
slide-20
SLIDE 20

20

Evaluation results

  • Data_set_1
  • Data_set_2
slide-21
SLIDE 21

21

Conclusion and Future work

  • Conclusion
  • In C-Search, syntactic IR is extended with a semantics layer
  • C-Search performs as good as syntactic search while allowing for

an improvement when semantics is available

  • In principle, C-Search supports a continuum from purely syntactic IR to

fully semantic IR in which indexing and retrieval can be performed at any point of the continuum depending on how much semantics is available

  • Future work
  • Development of more accurate concept extraction algorithm
  • Development of document relevance metrics based on both syntactic and

semantic similarities of query and document descriptions

  • Allow semantic scope (such as equivalence, more/less general, disjoint)
  • Comparing the performance of the proposed solution with the state-of-the-

art syntactic IR systems using a syntactic IR benchmark

slide-22
SLIDE 22

22

Thank You!