Text Indexing Arun Chauhan COMP 314 Lecture 15, 16 Mar 4, Mar 6, - - PowerPoint PPT Presentation

text indexing
SMART_READER_LITE
LIVE PREVIEW

Text Indexing Arun Chauhan COMP 314 Lecture 15, 16 Mar 4, Mar 6, - - PowerPoint PPT Presentation

Text Indexing Arun Chauhan COMP 314 Lecture 15, 16 Mar 4, Mar 6, 2003 Searching Text grep utility on Unix - specify a regular expression - search all specified files Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003 Searching Text


slide-1
SLIDE 1

Text Indexing

Arun Chauhan COMP 314

Lecture 15, 16 Mar 4, Mar 6, 2003

slide-2
SLIDE 2

Searching Text

  • grep utility on Unix
  • specify a regular expression
  • search all specified files

Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

slide-3
SLIDE 3

Searching Text

  • grep utility on Unix
  • specify a regular expression
  • search all specified files
  • what happens if
  • the files are very big, and
  • many repeated searches need to be carried out

Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

slide-4
SLIDE 4

Searching Text

  • grep utility on Unix
  • specify a regular expression
  • search all specified files
  • what happens if
  • the files are very big, and
  • many repeated searches need to be carried out
  • can we do better?

Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

slide-5
SLIDE 5

Indexing

  • split the search process
  • create an index of frequently used terms (also called a

concordance)

  • handle the search as a query to lookup the index

amortize indexing time over a large number of queries

Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

slide-6
SLIDE 6

Full-text Retrieval

  • full-text retrieval ≡ searching large text databases

using automatically constructed concordances

Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

slide-7
SLIDE 7

Full-text Retrieval

  • full-text retrieval ≡ searching large text databases

using automatically constructed concordances

  • Questions
  • How is this different from a library catalog?
  • Can we rely on high-speed modern processors to do

exhaustive searches?

  • What kind of indexing would be required for full-text

retrieval?

Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

slide-8
SLIDE 8

Indexing: A General Technique

  • no large database can be searched without indexes
  • there may be primary and secondary indexes
  • elaborate data structures to hold the index to

support rapid queries

  • e.g., B+ trees
  • other issues
  • separate structures for separate indexes?
  • rapid reindexing for addition, deletion, update
  • size of the index

Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

slide-9
SLIDE 9

Applications

  • databases
  • every database has elaborate index generation schemes
  • web search
  • search engines, e.g., google, yahoo!, lycos
  • also the issue of ranking and displaying the results
  • disk search
  • Apple’s Sherlock creates index files for filesystem search

Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

slide-10
SLIDE 10

Inverted File Index

  • term ≡ keywords of interest
  • lexicon ≡ list of all terms occurring in the text

index[term] = document1, document2, . . .

Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

slide-11
SLIDE 11

Inverted File Index

  • term ≡ keywords of interest
  • lexicon ≡ list of all terms occurring in the text

index[term] = document1, document2, . . .

How do you index non-text data (e.g., PDF files, images)?

Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

slide-12
SLIDE 12

An Example

Document Text 1 Pease porridge hot, pease porridge cold 2 Pease porridge in the pot 3 Nine days old 4 Some like it hot, some like it cold 5 Sole like it in the pot 6 Nine days old find the lexicon and build the inverted index

Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

slide-13
SLIDE 13

Example (contd.)

Number Term Documents 1 cold 2; 1, 4 2 days 2; 3, 6 3 hot 2; 1, 4 4 in 2; 2, 5 5 it 2; 4, 5 6 like 2; 4, 5 7 nine 2; 3, 6 9

  • ld

2; 3, 6 10 pease 2; 1, 2 11 porridge 2; 1, 2 12 pot 2; 2, 5 13 some 3; 4, 5 14 the 2; 2, 5

Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

slide-14
SLIDE 14

Using the Inverted Index

  • simple lookup is trivial
  • for large documents, may have to maintain the location

within each document

Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

slide-15
SLIDE 15

Using the Inverted Index

  • simple lookup is trivial
  • for large documents, may have to maintain the location

within each document

  • compound queries?
  • conjunctive and disjunctive queries, e.g., “term1 AND

term2”, “term1 OR term2”

  • complement, “NOT term1”

Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

slide-16
SLIDE 16

Using the Inverted Index

  • simple lookup is trivial
  • for large documents, may have to maintain the location

within each document

  • compound queries?
  • conjunctive and disjunctive queries, e.g., “term1 AND

term2”, “term1 OR term2”

  • complement, “NOT term1”
  • near queries?

Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

slide-17
SLIDE 17

Using the Inverted Index

  • simple lookup is trivial
  • for large documents, may have to maintain the location

within each document

  • compound queries?
  • conjunctive and disjunctive queries, e.g., “term1 AND

term2”, “term1 OR term2”

  • complement, “NOT term1”
  • near queries?
  • potentially huge index files
  • should we worry about the index size?

Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

slide-18
SLIDE 18

Trimming the Index

  • case folding
  • mostly, case is immaterial
  • stemming
  • are “search”, “searching”, “searches” different?
  • strategy: maintain only the neutral form of the term
  • eliminate stop words
  • frequently occurring terms ≡ stop list
  • e.g., “a”, “the”, “in”, “to”, etc.

Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

slide-19
SLIDE 19

Effectiveness: Precision

Precision = r t r: number of relevant documents retrieved t: total number of documents retrieved

if 50 documents are retrieved, 35 are relevant, then the precision is 70%

Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

slide-20
SLIDE 20

Effectiveness: Recall

Recall = r n r: number of relevant documents retrieved n: total number of relevant documents in the collection

if 50 documents are retrieved, 35 are relevant, then the precision is 70% if there are 140 relevant documents then the recall is 25%

Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

slide-21
SLIDE 21

Search Engines

Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

slide-22
SLIDE 22

Indexing the Web

  • more than 2 billion documents on the web
  • google claims to index 1.5 billion documents
  • two indexing approaches
  • search engines (e.g., google)
  • hierarchical directories (e.g., Yahoo!)

Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

slide-23
SLIDE 23

Web Search Characteristics

  • bulk
  • rapidly changing content
  • about one-third changes every year
  • heterogeneous content
  • duplication, as much as 30%
  • high linkage
  • wide variety of users
  • varying user behavior
  • 85% only look at the first screen
  • 78% never modify their first query

Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

slide-24
SLIDE 24

Query Characteristics

0 term in query 21% 1 term in query 26% 2 terms in query 26% 3 terms in query 15% > 3 terms in query 12%

Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

slide-25
SLIDE 25

Goals of a Search Engine

  • speed
  • recall
  • precision
  • precision in the top result page

Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

slide-26
SLIDE 26

Search Engine Architecture

  • crawler
  • collects pages from the Web
  • indexer
  • indexes the collected pages
  • query server
  • accepts and processes queries and returns the results

Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

slide-27
SLIDE 27

The Crawler

base ← set of known working hyperlinks queue ← base while (! queue.empty()) { p = first element of queue process p for each page, q, referenced from p add q to queue; }

Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

slide-28
SLIDE 28

Indexing

  • inverted index
  • most common, used by google
  • superimposed coding is another technique
  • term extraction
  • title or the whole document
  • document analysis to identify keywords

Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

slide-29
SLIDE 29

Query Processing

  • keyword vs concept-based searching
  • concept-based searching uses “clustering”
  • Excite used concept-based searching
  • searching “similar” results
  • ranking the hits

Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

slide-30
SLIDE 30

Rankings

  • google’s page-popularity based rankings
  • combined with proximity of search keywords to those in the

document let page P be pointed to by pages T1, T2, T3, etc. let L(x) be the number of links going out of page x let R(x) be the page rank of page x R(P) = (1 − d) + d × (R(T1) L(T1) + R(T2) L(T2) + . . . + R(Tk) L(Tk)) where, d is a damping factor

Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

slide-31
SLIDE 31

Solving the Rankings

R(P1) = (1 − d) + d × ( 1 L(T p1

1 )R(T p1 1 ) +

1 L(T p1

2 )R(T p1 2 ) + . . . +

1 L(T p1

k1 )R(T p1 k1 ))

R(P2) = (1 − d) + d × ( 1 L(T p2

1 )R(T p2 1 ) +

1 L(T p2

2 )R(T p2 2 ) + . . . +

1 L(T p2

k2 )R(T p2 k2 ))

. . . R(Pn) = (1 − d) + d × ( 1 L(T pn

1 )R(T pn 1 ) +

1 L(T pn

2 )R(T pn 2 ) + . . . +

1 L(T pn

kn )R(T pn kn )) Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003

slide-32
SLIDE 32

Solving the Rankings

R(P1) = (1 − d) + d × ( 1 L(T p1

1 )R(T p1 1 ) +

1 L(T p1

2 )R(T p1 2 ) + . . . +

1 L(T p1

k1 )R(T p1 k1 ))

R(P2) = (1 − d) + d × ( 1 L(T p2

1 )R(T p2 1 ) +

1 L(T p2

2 )R(T p2 2 ) + . . . +

1 L(T p2

k2 )R(T p2 k2 ))

. . . R(Pn) = (1 − d) + d × ( 1 L(T pn

1 )R(T pn 1 ) +

1 L(T pn

2 )R(T pn 2 ) + . . . +

1 L(T pn

kn )R(T pn kn ))

L × R = C

Lecture 15, 16: Text Indexing Mar 4, Mar 6, 2003