INFO 4300 / CS4300 Information Retrieval slides adapted from - - PowerPoint PPT Presentation

info 4300 cs4300 information retrieval slides adapted
SMART_READER_LITE
LIVE PREVIEW

INFO 4300 / CS4300 Information Retrieval slides adapted from - - PowerPoint PPT Presentation

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from http://informationretrieval.org/ IR 1: Boolean Retrieval Paul Ginsparg Cornell University, Ithaca, NY 25 Aug 2011 1 / 43 Plan for today Course


slide-1
SLIDE 1

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨ utze’s, linked from http://informationretrieval.org/

IR 1: Boolean Retrieval

Paul Ginsparg

Cornell University, Ithaca, NY

25 Aug 2011

1 / 43

slide-2
SLIDE 2

Plan for today

Course overview Administrativa Boolean retrieval

2 / 43

slide-3
SLIDE 3

Overview “After change, things are different . . .”

3 / 43

slide-4
SLIDE 4

“Plan”

Search full text: basic concepts Web search Probabalistic Retrieval Interfaces Metadata / Semantics IR ⇔ NLP ⇔ ML Prereqs: Introductory courses in data structures and algorithms, in linear algebra (eigenvalues) and in probability theory (Bayes theorem)

4 / 43

slide-5
SLIDE 5

Administrativa (tentative)

Course Webpage: http://www.infosci.cornell.edu/Courses/info4300/2011fa/ Lectures: Tuesday and Thursday 11:40-12:55, Kimball B11 Instructor: Paul Ginsparg, ginsparg@..., 255-7371, Physical Sciences Building 452 Instructor’s Office Hours: Wed 1-2pm, Fri 2-3pm, or e-mail instructor to schedule an appointment Teaching Assistant: Saeed Abdullah, use cs4300-l@lists.cs.cornell.edu Course text at: http://informationretrieval.org/

Introduction to Information Retrieval, C.Manning, P.Raghavan, H.Sch¨ utze

see also

Information Retrieval, S. B¨ uttcher, C. Clarke, G. Cormack

http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=12307

5 / 43

slide-6
SLIDE 6

Tentative Assignment and Exam Schedules

During this course there will be four assignments which require programming Assignment 1 due Sun 18 Sep Assignment 2 due Sat 8 Oct Assignment 3 due Sun 6 Nov Assignment 4 due Fri 2 Dec and two examinations: Midterm on Thu 13 Oct Final exam on Wed 14 Dec (7:00 PM – 9:30 PM) The course grade will be based on course assignments, examinations, and subjective measures (e.g., class participation) with rough weightings: Assignments 50%, Examinations 50%, with subjective adjustments (as much as 20%)

6 / 43

slide-7
SLIDE 7

Outline

1

Introduction

2

Inverted index

3

Processing Boolean queries

4

Discussion Section (next week)

7 / 43

slide-8
SLIDE 8

Definition of information retrieval

Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). Used to be only reference librarians, paralegals, professionals. Now hundreds of millions of people (billions?) engage in information retrieval every day when they use a web search engine

  • r search their email

Three scales (web, enterprise/inst/domain, personal)

8 / 43

slide-9
SLIDE 9

Clustering and Classification

IR also covers supporting users in browsing or filtering document collections or further processing a set of retrieved documents. Clustering: find a good grouping of the documents based on their

  • contents. (c.f., arrange books on a bookshelf according to their

topic) Classification: given a set of topics, standing information needs, or

  • ther categories (such as suitability of texts for different age

groups), decide which class(es), if any, to which each of a set of documents belongs

9 / 43

slide-10
SLIDE 10

Structured vs Unstructured

“unstructured data”: no clear, semantically overt (easy-for-a-computer) structure. structured data: e.g.: relational database (product inventories and personnel records) But: no data truly “unstructured” (text data has latent linguistic structure, in addition headings, paragraphs, footnotes, with explicit markup) IR facilitates “semistructured” search: e.g., find document whose title contains Java and body contains threading

10 / 43

slide-11
SLIDE 11
slide-12
SLIDE 12
slide-13
SLIDE 13

!"#$%&'()%"*#%*!"+%$,-)%"*./#$0/1-2*

!"#$%&'$&%()*+$(,$-*.#/*#$%&'$&%()* +)0$010#(-*)0$0*2"*3445*

!" #!" $!" %!" &!" '!!" '#!" '$!" '%!" '&!" #!!" ()*)"+,-./0" 1)230*"4)5" 678*2.9*.20:" ;*2.9*.20:"

6*

slide-14
SLIDE 14

Boolean retrieval

The Boolean model is among the simplest models on which to base an information retrieval system. Queries are Boolean expressions, e.g., Caesar and Brutus The seach engine returns all documents that satisfy the Boolean expression. Does Google use the Boolean model?

14 / 43

slide-15
SLIDE 15

Outline

1

Introduction

2

Inverted index

3

Processing Boolean queries

4

Discussion Section (next week)

15 / 43

slide-16
SLIDE 16

Unstructured data in 1650: Shakespeare

Which plays of Shakespeare contain the words Brutus and Caesar, but not Calpurnia? One could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines containing Calpurnia. Why is grep not the solution?

Slow (for large collections) grep is line-oriented, IR is document-oriented “not Calpurnia” is non-trivial Other operations (e.g., find the word Romans near countryman) not feasible Ranked retrieval (best documents to return) — later in course

16 / 43

slide-17
SLIDE 17

Term-document incidence matrix

Anthony Julius The Hamlet Othello Macbeth . . . and Caesar Tempest Cleopatra Anthony 1 1 1 Brutus 1 1 1 Caesar 1 1 1 1 1 Calpurnia 1 Cleopatra 1 mercy 1 1 1 1 1 worser 1 1 1 1 . . . Entry is 1 if term occurs. Example: Calpurnia occurs in Julius Caesar. Entry is 0 if term doesn’t occur. Example: Calpurnia doesn’t occur in The tempest. (Shakespeare used about 32,000 different words)

17 / 43

slide-18
SLIDE 18

Binary–valued vector for Brutus

Anthony Julius The Hamlet Othello Macbeth . . . and Caesar Tempest Cleopatra Anthony 1 1 1 Brutus 1 1 1 Caesar 1 1 1 1 1 Calpurnia 1 Cleopatra 1 mercy 1 1 1 1 1 worser 1 1 1 1 . . .

18 / 43

slide-19
SLIDE 19

Incidence vectors

So we have a binary–valued vector for each term. To answer the query Brutus and Caesar and not Calpurnia:

Take the vectors for Brutus, Caesar, and Calpurnia Complement the vector of Calpurnia Do a (bitwise) and on the three vectors 110100 and 110111 and 101111 = 100100

19 / 43

slide-20
SLIDE 20

Answers to query

Anthony and Cleopatra, Act III, Scene ii Agrippa [Aside to Domitius Enobarbus]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain. Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar: I was killed i’ the Capitol; Brutus killed me.

20 / 43

slide-21
SLIDE 21

Ad hoc retrieval

Provide documents from within the collection that are relevant to an arbitrary user information need, communicated to the system by means of a one-off, user-initiated query. Information need (topic about which user desires to know more) = query (what the user conveys to computer) To assess the effectiveness of an IR system, two key statistics: Precision: What fraction of the returned results are relevant to the information need? P = # relevant results

# results

Recall: What fraction of the relevant documents in the collection were returned by the system? R = # relevant results

total # relevant

Example: from 100 document collection containing 20 documents relevant to query Q, IR system returns 10, of which 9 are relevant.

21 / 43

slide-22
SLIDE 22

Bigger collections

Consider N = 106 documents, each with about 1000 tokens On average 6 bytes per token, including spaces and punctuation ⇒ size of document collection is about 6 GB Assume there are M = 500,000 distinct terms in the collection (Notice that we are making a term/token distinction.)

22 / 43

slide-23
SLIDE 23

Can’t build the incidence matrix

M = 500,000 × 106 = half a trillion 0s and 1s. But the matrix has no more than one billion 1s.

Matrix is extremely sparse. (109/5 · 1011 = .2%)

What is a better representation?

We only record the 1s.

23 / 43

slide-24
SLIDE 24

Inverted Index

For each term t, we store a list of all documents that contain t. Brutus − → 1 2 4 11 31 45 173 174 Caesar − → 1 2 4 5 6 16 57 132 . . . Calpurnia − → 2 31 54 101 . . .

  • dictionary

postings

24 / 43

slide-25
SLIDE 25

Inverted index construction

1

Collect the documents to be indexed: Friends, Romans, countrymen. So let it be with Caesar . . .

2

Tokenize the text, turning each document into a list of tokens: Friends Romans countrymen So . . .

3

Do linguistic preprocessing, producing a list of normalized tokens, which are the indexing terms: friend roman countryman so . . .

4

Index the documents that each term occurs in by creating an inverted index, consisting of a dictionary and postings.

25 / 43

slide-26
SLIDE 26

Tokenization and preprocessing

Doc 1. I did enact Julius Caesar: I was killed i’ the Capitol; Brutus killed me. Doc 2. So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious:

= ⇒

Doc 1. i did enact julius caesar i was killed i’ the capitol brutus killed me Doc 2. so let it be with caesar the noble brutus hath told you caesar was ambitious

26 / 43

slide-27
SLIDE 27

Generate postings

Doc 1. i did enact julius caesar i was killed i’ the capitol brutus killed me Doc 2. so let it be with caesar the noble brutus hath told you caesar was ambitious

= ⇒

term docID i 1 did 1 enact 1 julius 1 caesar 1 i 1 was 1 killed 1 i’ 1 the 1 capitol 1 brutus 1 killed 1 me 1 so 2 let 2 it 2 be 2 with 2 caesar 2 the 2 noble 2 brutus 2 hath 2 told 2 you 2 caesar 2 was 2 ambitious 2

27 / 43

slide-28
SLIDE 28

Sort postings

term docID i 1 did 1 enact 1 julius 1 caesar 1 i 1 was 1 killed 1 i’ 1 the 1 capitol 1 brutus 1 killed 1 me 1 so 2 let 2 it 2 be 2 with 2 caesar 2 the 2 noble 2 brutus 2 hath 2 told 2 you 2 caesar 2 was 2 ambitious 2

= ⇒

term docID ambitious 2 be 2 brutus 1 brutus 2 capitol 1 caesar 1 caesar 2 caesar 2 did 1 enact 1 hath 1 i 1 i 1 i’ 1 it 2 julius 1 killed 1 killed 1 let 2 me 1 noble 2 so 2 the 1 the 2 told 2 you 2 was 1 was 2 with 2

28 / 43

slide-29
SLIDE 29

Create postings lists, determine document frequency

term docID ambitious 2 be 2 brutus 1 brutus 2 capitol 1 caesar 1 caesar 2 caesar 2 did 1 enact 1 hath 1 i 1 i 1 i’ 1 it 2 julius 1 killed 1 killed 1 let 2 me 1 noble 2 so 2 the 1 the 2 told 2 you 2 was 1 was 2 with 2

= ⇒

term

  • doc. freq.

→ postings lists ambitious 1 → 2 be 1 → 2 brutus 2 → 1 → 2 capitol 1 → 1 caesar 2 → 1 → 2 did 1 → 1 enact 1 → 1 hath 1 → 2 i 1 → 1 i’ 1 → 1 it 1 → 2 julius 1 → 1 killed 1 → 1 let 1 → 2 me 1 → 1 noble 1 → 2 so 1 → 2 the 2 → 1 → 2 told 1 → 2 you 1 → 2 was 2 → 1 → 2 with 1 → 2

29 / 43

slide-30
SLIDE 30

Split the result into dictionary and postings file

Brutus − → 1 2 4 11 31 45 173 174 Caesar − → 1 2 4 5 6 16 57 132 . . . Calpurnia − → 2 31 54 101 . . .

  • dictionary

postings file

30 / 43

slide-31
SLIDE 31

Later in this course

Index construction: how can we create inverted indexes for large collections? How much space do we need for dictionary and index? Index compression: how can we efficiently store and process indexes for large collections? Ranked retrieval: what does the inverted index look like when we want the “best” answer?

31 / 43

slide-32
SLIDE 32

Outline

1

Introduction

2

Inverted index

3

Processing Boolean queries

4

Discussion Section (next week)

32 / 43

slide-33
SLIDE 33

Simple conjunctive query (two terms)

Consider the query: Brutus AND Calpurnia To find all matching documents using inverted index:

1

Locate Brutus in the dictionary

2

Retrieve its postings list from the postings file

3

Locate Calpurnia in the dictionary

4

Retrieve its postings list from the postings file

5

Intersect the two postings lists

6

Return intersection to user

33 / 43

slide-34
SLIDE 34

Intersecting two postings lists

Brutus − → 1 → 2 → 4 → 11 → 31 → 45 → 173 → 174 Calpurnia − → 2 → 31 → 54 → 101 Intersection = ⇒ 2 → 31 This is linear in the length of the postings lists. Note: This only works if postings lists are sorted.

34 / 43

slide-35
SLIDE 35

Intersecting two postings lists

Intersect(p1, p2) 1 answer ← 2 while p1 = nil and p2 = nil 3 do if docID(p1) = docID(p2) 4 then Add(answer, docID(p1)) 5 p1 ← next(p1) 6 p2 ← next(p2) 7 else if docID(p1) < docID(p2) 8 then p1 ← next(p1) 9 else p2 ← next(p2) 10 return answer

35 / 43

slide-36
SLIDE 36

Query processing: Exercise

france − → 1 → 2 → 3 → 4 → 5 → 7 → 8 → 9 → 11 → 12 → 13 → 14 → 15 paris − → 2 → 6 → 10 → 12 → 14 lear − → 12 → 15 Compute hit list for ((paris AND NOT france) OR lear)

36 / 43

slide-37
SLIDE 37

Boolean queries

The Boolean retrieval model can answer any query that is a Boolean expression.

Boolean queries are queries that use and, or and not to join query terms. Views each document as a set of terms. Is precise: Document matches condition or not.

Primary commercial retrieval tool for 3 decades Many professional searchers (e.g., lawyers) still like Boolean queries.

You know exactly what you are getting.

Many search systems you use are also Boolean: spotlight, email, intranet etc.

37 / 43

slide-38
SLIDE 38

Commercially successful Boolean retrieval: Westlaw

Largest commercial legal search service in terms of the number of paying subscribers Over half a million subscribers performing millions of searches a day over tens of terabytes of text data The service was started in 1975. In 2005, Boolean search (called “Terms and Connectors” by Westlaw) was still the default, and used by a large percentage

  • f users . . .

. . . although ranked retrieval has been available since 1992.

38 / 43

slide-39
SLIDE 39

Westlaw: Example queries

Information need: Information on the legal theories involved in preventing the disclosure of trade secrets by employees formerly employed by a competing company Query: “trade secret” /s disclos! /s prevent /s employe! Information need: Requirements for disabled people to be able to access a workplace Query: disab! /p access! /s work-site work-place (employment /3 place) Information need: Cases about a host’s responsibility for drunk guests Query: host! /p (responsib! liab!) /p (intoxicat! drunk!) /p guest

39 / 43

slide-40
SLIDE 40

Westlaw: Comments

/s = within sentence /p = within paragraph /k = within k words (proximity = augmentation of Boolean queries) Space is disjunction (OR), not conjunction (AND)! (This was the default in search pre-Google.) Long, precise queries: proximity operators, incrementally developed, not like web search Why professional searchers often like Boolean search: precision, transparency, control (But: high precision, low recall?) When are Boolean queries the best way of searching? Depends

  • n: information need, searcher, document collection, . . .

40 / 43

slide-41
SLIDE 41

Outline

1

Introduction

2

Inverted index

3

Processing Boolean queries

4

Discussion Section (next week)

41 / 43

slide-42
SLIDE 42

Discussion 1

In preparation, explore three information retrieval systems and compare them: Bing — a Web search engine (http://bing.com/). The Library of Congress catalog — a very large bibliographic catalog (http://catalog.loc.gov/). PubMed — an indexing and abstracting service for medicine and related fields (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi).

42 / 43

slide-43
SLIDE 43

Use each service separately for the following information discovery task: What is the medical evidence that vaccines can cause autism? Evaluate each search service. What do you consider the strengths and weaknesses of each service? When would you use them? (a) Does the service search full text or surrogates? What is the underlying corpus? What effect does this have on your results? (b) Is fielded searching offered? What Boolean operators are supported? What regular expressions? How does it handle non-Roman character sets? What is the stop list? How are results ranked? Are they sorted, if so in what order? (c) From a usability viewpoint. What style of user interface(s) is provided? What training or help services? If there are basic and advanced user interfaces, what does each offer?

43 / 43