Query Languages R B R Baeza Yates and B R. Baeza Baeza-Yates - - PDF document

query languages
SMART_READER_LITE
LIVE PREVIEW

Query Languages R B R Baeza Yates and B R. Baeza Baeza-Yates - - PDF document

1 Query Languages R B R Baeza Yates and B R. Baeza Baeza-Yates and B Yates and B Riberio Yates and B. Riberio Riberio Neto Riberio-Neto Neto Neto Modern Information Retrieval, Chapter 4 Modern Information Retrieval, Chapter 4 Jon


slide-1
SLIDE 1

1

Query Languages

R B R Baeza Yates and B Yates and B Riberio Riberio Neto Neto

  • R. Baeza

Baeza-Yates and B Yates and B. Riberio Riberio-Neto Neto Modern Information Retrieval, Chapter 4 Modern Information Retrieval, Chapter 4

Jon Atle Gulla

TDT4215 Query Languages 2

Query Languages

  • How do specify your information needs?
  • Query types:

– Keyword-based querying Keyword based querying – Pattern matching – Structural queries

TDT4215 Query Languages

  • NTNU/IDI/IS
slide-2
SLIDE 2

3

Keyword-based querying

Single-word queries

skiing TDT4215 exam NTNU Trondheim skiing norway snowboarding skiing norway snowboarding

  • Result is a set of documents containing at least one of the words of the query
  • Documents ranked according to relevance

Documents ranked according to relevance

  • Web extensions:

skiing telemark skiing +telemark skiing -telemark

TDT4215 Query Languages 4

Keyword-based querying

Context queries

Words appearing near each other may signal a higher relevance than Words appearing near each other may signal a higher relevance than words far apart

  • Phrases:

– a sequence of single word queries “new york times” “to be or not to be” “ l i ” l d

  • Proximity

– a sequence of words is given together with a maximum allowed distance “olympic games” london q g g between them ntnu trondheim “ …the university in trondheim is ntnu…” “…ntnu is situated in trondheim…” bk TDT4215 Query Languages eggen rbk

slide-3
SLIDE 3

5

Phrasing or not phrasing

new york times Query

  • How to deal with queries that have potential

phrases?

“new york times”

– How to recognize a potential phrase? – How to interpret potential phrases?

“new york” times

  • Interpretation affects ranking!!

new york times TDT4215 Query Languages 6

Proximity search

D it's 3 pm in New York, what time is it in the rest of the world? Documents new york times Query .For your reading pleasure, we present historic issues from the New York Times. City of York Council - list of new library

  • pening times and addresses
  • Which document is the

most relevant one? How do we achieve this

.Three webcam views of Times Square, N Y k

  • pening times and addresses.
  • How do we achieve this

ranking?

New York. TDT4215 Query Languages

slide-4
SLIDE 4

7

Keyword-based querying

Boolean queries

  • Boolean operators:

– OR (e1 OR e2) – AND (e1 AND e2) – BUT (e1 BUT e2) NOT BUT (e1 BUT e2) NOT

  • No ranking of documents provided
  • “Fuzzy boolean”: Meaning of AND and OR relaxed

Natural language:

  • Query is an enumeration of words and context queries
  • Query is an enumeration of words and context queries
  • All documents matching a portion of the user query are retrieved
  • Higher ranking is assigned to those documents matching more parts of the

query Q d d t i d t TDT4215 Query Languages

  • Query and documents viewed as vectors

8

Pattern matching

  • A pattern is a set of syntactic features that must occur in a text segment, ranging

from simple (e.g. words) to complex (e.g. regular expressions) terms

  • Typical patterns:

– words – prefixes ‘comput’ -> ‘computer’ ‘computation’ ‘computing’ – prefixes comput -> computer , computation , computing –

  • suffixes. ‘ters’ -> ‘computers’, ‘testers’, ‘printers’

– sub-strings. ‘tal’ -> ‘coastal’, ‘talk’, ‘metallic’ –

  • ranges. ‘held’ and ‘hold’ -> ‘hoax’, ‘hissing’
  • ranges. held and hold

hoax , hissing – allowing erros – regular expressions – extended patterns

TDT4215 Query Languages

slide-5
SLIDE 5

9

Structural queries

  • Allowing the user to query documents based on their structure (not on their

content)

  • Mixing content and structure in query allows us to post more expressive queries
  • Three main structures:

f lik fi d t t – form-like fixed structures – hypertext structures – hierarchical structures TDT4215 Query Languages 10

Fixed structure Fixed structure

  • Document has a fixed set of fields, much like a filled form
  • Intended for document collections with fixed structures
  • Example

– Mail archive as a set of mails – Each mail has a standard set of fields:

  • sender

sender

  • receiver
  • subject
  • date
  • body

– User can search for mails sent to a given person with ”football” in the subject field

  • Leads to the relational model

– Extend SQL to full text retrieval -> SFQL TDT4215 Query Languages

slide-6
SLIDE 6

11

Hypertext

  • Hypertext is a directed graph where the nodes hold some text and the links

represent connections between nodes

  • Search by following hyperlinks
  • “give me documents that link to X”

TDT4215 Query Languages 12

Hierarchical structure

  • Hierarchical structure is an intermediate structuring model that lies between fixed structure

and hypertext structure

  • Sample of hierarchical models:

Sample of hierarchical models: – PAT expressions Structure is marked in the text as tags (e.g. HTML) – Overlapped lists Hierarchical partly overlapping regions of text defined – Lists of references Lists of references Querying path expressions in text – Proximal nodes Many fixed hierarchical structures of text defined y – Tree matching Document and query gives a tree structure TDT4215 Query Languages

slide-7
SLIDE 7

13

Conclusions

  • Query types:

– Keyword-based queries:

  • Single-word queries
  • Context queries

q

  • Boolean queries
  • Natural language

– Pattern matching Pattern matching – Structural queries:

  • Fixed structure
  • Hypertext

Hypertext

  • Hierarchical structure

TDT4215 Query Languages