INFO 4300 / CS4300 Information Retrieval slides adapted from - PowerPoint PPT Presentation

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch¨ utze’s, linked from http://informationretrieval.org/ IR 6: Ranking Paul Ginsparg Cornell University, Ithaca, NY 13 Sep 2011 1 / 48

Administrativa Course Webpage: http://www.infosci.cornell.edu/Courses/info4300/2011fa/ Assignment 1. Posted: 2 Sep, Due: Sun, 18 Sep Lectures: Tuesday and Thursday 11:40-12:55, Kimball B11 Instructor: Paul Ginsparg, ginsparg@..., 255-7371, Physical Sciences Building 452 Instructor’s Office Hours: Wed 1-2pm, Fri 2-3pm, or e-mail instructor to schedule an appointment Teaching Assistant: Saeed Abdullah, office hour Fri 3:30pm-4:30pm in the small conference room (133) at 301 College Ave, and by email, use cs4300-l@lists.cs.cornell.edu Course text at: http://informationretrieval.org/ Introduction to Information Retrieval , C.Manning, P.Raghavan, H.Sch¨ utze see also Information Retrieval , S. B¨ uttcher, C. Clarke, G. Cormack http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=12307 2 / 48

Administrativa Reread assignment 1 instructions The Midterm Examination is on Thu, Oct 13 from 11:40 to 12:55, in Kimball B11. It will be open book. The topics to be examined are all the lectures and discussion class readings before the midterm break. According to the registrar (http://registrar.sas.cornell.edu/Sched/EXFA.html ), the final examination is Wed 14 Dec 7:00-9:30 pm (location TBD) 3 / 48

Discussion 2, 20 Sep For this class, read and be prepared to discuss the following: K. Sparck Jones, “A statistical interpretation of term specificity and its application in retrieval”. Journal of Documentation 28, 11-21, 1972. http://www.soi.city.ac.uk/ ∼ ser/idfpapers/ksj orig.pdf Letter by Stephen Robertson and reply by Karen Sparck Jones, Journal of Documentation 28, 164-165, 1972. http://www.soi.city.ac.uk/ ∼ ser/idfpapers/letters.pdf The first paper introduced the term weighting scheme known as inverse document frequency (IDF). Some of the terminology used in this paper will be introduced in the lectures. The letter describes a slightly different way of expressing IDF, which has become the standard form. (Stephen Robertson has mounted these papers on his Web site with permission from the publisher.) 4 / 48

Overview Recap 1 Zones 2 Why rank? 3 More on cosine 4 Implementation 5 5 / 48

Outline Recap 1 Zones 2 Why rank? 3 More on cosine 4 Implementation 5 6 / 48

t ∈ q w ( idf ) · w ( tf ) Query Scores: S ( q , d ) = � (ltn.lnn) t , d t 1. “A sentence is a document.” 2. “A document is a sentence and a sentence is a document.” 3. “This document is short.” 4. “This document is a sentence.” w ( tf ) doc1 doc2 doc3 doc4 tf t , d doc1 doc2 doc3 doc4 t , d a 2 4 0 1 a 1.3 1.6 0 1 and 0 1 0 0 and 0 1 0 0 document 1 2 1 1 document 1 1.3 1 1 → is 1 2 1 1 is 1 1.3 1 1 sentence 1 1.3 0 1 sentence 1 2 0 1 short 0 0 1 0 short 0 0 1 0 this 0 0 1 1 this 0 0 1 1 w ( idf ) df � � log(4 / 4) = 0, log(4 / 3) ≈ . 125, log(4 / 2) ≈ . 3, log(4 / 1) ≈ . 6 t a 3 .125 Query: “a sentence” and 1 .6 doc1: . 125 ∗ 1 . 3 + . 3 ∗ 1 = . 46, doc2: . 125 ∗ 1 . 6 + . 3 ∗ 1 . 3 = . 59 document 4 0 doc3: . 125 ∗ 0 + . 3 ∗ 0 = 0, doc4: . 125 ∗ 1 + . 3 ∗ 1 = . 425 is 4 0 sentence 2 .3 Query: “short sentence” short 1 .6 doc1: . 6 ∗ 0 + . 3 ∗ 1 = . 3, doc2: . 6 ∗ 0 + . 3 ∗ 1 . 3 = . 39 this 2 .3 doc3: . 6 ∗ 1 + . 3 ∗ 0 = . 6, doc4: . 6 ∗ 0 + . 3 ∗ 1 = . 3 7 / 48

t ∈ q w ( idf ) w ( tf ) Query Scores: S ( q , d ) = � · ˆ (ltn.lnc) t , d t 1. “A sentence is a document.” 2. “A document is a sentence and a sentence is a document.” 3. “This document is short.” 4. “This document is a sentence.” w ( tf ) w ( tf ) ˆ doc1 doc2 doc3 doc4 doc1 doc2 doc3 doc4 t , d t , d a .60 .54 0 .45 a 1.3 1.6 0 1 and 0 .34 0 0 and 0 1 0 0 document .46 .44 .5 .45 document 1 1.3 1 1 → is .46 .44 .5 .45 is 1 1.3 1 1 sentence .46 .44 0 .45 sentence 1 1.3 0 1 short 0 0 .5 0 short 0 0 1 0 this 0 0 .5 .45 this 0 0 1 1 w ( idf ) df lengths(doc1, . . . ,doc4) = (2.17, 2.94, 2, 2.24) t a 3 .125 Query: “a sentence” and 1 .6 doc1: . 125 ∗ . 6 + . 3 ∗ . 46 = . 21, doc2: . 125 ∗ . 54 + . 3 ∗ . 44 = . 20 document 4 0 doc3: . 125 ∗ 0 + . 3 ∗ 0 = 0, doc4: . 125 ∗ . 45 + . 3 ∗ . 45 = . 19 is 4 0 sentence 2 .3 Query: “short sentence” short 1 .6 doc1: . 6 ∗ 0 + . 3 ∗ . 46 = . 14, doc2: . 6 ∗ 0 + . 3 ∗ . 44 = . 133 this 2 .3 doc3: . 6 ∗ . 5 + . 3 ∗ 0 = . 3, doc4: . 6 ∗ 0 + . 3 ∗ . 45 = . 134 8 / 48

Cosine similarity between query and document | V | � d ) = � q d q i d i q ,� q ,� � cos( � d ) = sim ( � q | · = · | � | � �� | V | �� | V | d | i =1 q 2 i =1 d 2 i =1 i i q i is the tf-idf weight (idf) of term i in the query. d i is the tf-idf weight (tf) of term i in the document. q | and | � q and � | � d | are the lengths of � d . q | and � d / | � � q / | � d | are length-1 vectors (= normalized). 9 / 48

Cosine similarity illustrated poor 1 v ( d 1 ) � � v ( q ) � v ( d 2 ) θ � v ( d 3 ) 0 rich 0 1 10 / 48

Variant tf-idf functions We’ve considered sublinear tf scaling ( wf t , d = 1 + log tf t , d ) Or normalize instead by maximum tf in document, tf max ( d ): tf t , d ntf t , d = a + (1 − a ) tf max ( d ) where a ∈ [0 , 1] (e.g., .4) is a smoothing term to avoid large swing in ntf due small changes in tf. This eliminates repeat content problem ( d ′ = d + d ), but has other issues: sensitive to change in stop word list outlier terms with large tf skewed distribution of many nearly most frequent terms. 11 / 48

Components of tf.idf weighting Term frequency Document frequency Normalization n (natural) tf t , d n (no) 1 n (none) 1 1 log N √ l (logarithm) 1 + log(tf t , d ) t (idf) c (cosine) df t w 2 1 + w 2 2 + ... + w 2 M 0 . 5 × tf t , d max { 0 , log N − df t a (augmented) 0 . 5 + p (prob idf) } u (pivoted 1 / u max t ( tf t , d ) df t unique) � 1 if tf t , d > 0 1 / CharLength α , b (boolean) b (byte size) 0 otherwise α < 1 1+log( tf t , d ) L (log ave) 1+log(ave t ∈ d ( tf t , d )) Best known combination of weighting options Default: no weighting 12 / 48

tf.idf example We often use different weightings for queries and documents. Notation: qqq.ddd (term frequency / document frequency / normalization) for (query.document) Example: ltn.lnc query: logarithmic tf, idf, no normalization document: logarithmic tf, no df weighting, cosine normalization Isn’t it bad to not idf-weight the document? Example query: “best car insurance” Example document: “car insurance auto insurance” 13 / 48

tf.idf example: ltn.lnc Query: “best car insurance”. Document: “car insurance auto insurance”. word query document product tf-raw tf-wght df idf weight tf-raw tf-wght weight n’lized auto 0 0 5000 2.3 0 1 1 1 0.52 0 best 1 1 50000 1.3 1.3 0 0 0 0 0 car 1 1 10000 2.0 2.0 1 1 1 0.52 1.04 insurance 1 1 1000 3.0 3.0 2 1.3 1.3 0.68 2.04 Key to columns: tf-raw: raw (unweighted) term frequency, tf-wght: logarithmically weighted term frequency, df: document frequency, idf: inverse document frequency, weight: the final weight of the term in the query or document, n’lized: document weights after cosine normalization, product: the product of final query weight and final document weight √ 1 2 + 0 2 + 1 2 + 1 . 3 2 ≈ 1 . 92 1 / 1 . 92 ≈ 0 . 52 1 . 3 / 1 . 92 ≈ 0 . 68 Final similarity score between query and document: � i w qi · w di = 0 + 0 + 1 . 04 + 2 . 04 = 3 . 08 14 / 48

Outline Recap 1 Zones 2 Why rank? 3 More on cosine 4 Implementation 5 15 / 48

Parametric and Zone indices Digital documents have additional structure: metadata encoded in machine-parseable form (e.g., author, title, date of publication, . . . ) One parametric index for each field. Fields: take finite set of values (e.g., dates of authorship) Zones: arbitrary free text (e.g., titles, abstracts) Permits searching for documents by Shakespeare written in 1601 containing the phrase “alas poor Yorick” or find documents with “merchant” in title and “william” in author list and the phrase “gentle rain” in body Use separate indexes for each field and zone, or use william.abstract, william.title, william.author Permits weighted zone scoring 16 / 48

Weighted Zone Scoring Given boolean query q and document d , assign to the pair ( q , d ) a score in [0,1] by computing linear combination of zone scores. Let g 1 , . . . , g ℓ ∈ [0 , 1] such that � ℓ i =1 g i = 1. For 1 ≤ i ≤ ℓ , let s i be the score between q and the i th zone. Then the weighted zone score is defined as � ℓ i =1 g i s i . Example: Three zones: author title, body g 1 = . 2, g 2 = . 5, g 3 = . 3 (match in author zone least important) Compute weighted zone scores directly from inverted indexes: Instead of adding document to set of results as for boolean AND query, now compute a score for each document. 17 / 48

INFO 4300 / CS4300 Information Retrieval slides adapted from - PowerPoint PPT Presentation

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from http://informationretrieval.org/ IR 6: Ranking Paul Ginsparg Cornell University, Ithaca, NY 13 Sep 2011 1 / 48 Administrativa Course Webpage:

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

Come and Knock On Our Door: Evaluating the Impact of Varying Rules for Case Follow-Up Using

Employment Inequality in the United States 1 download slides at: www.inequality.com/slides

Spillovers, Capital Flows and Prudential Regulation in Small Open Economies BANCO CENTRAL DE

Sources of Export Growth Marianne Matthee (North-West University) Neil Rankin (Stellenbosch

Scali ling Optim imiz izatio ion I2DL: Prof. Niessner, Prof. Leal-Taix 1 Lecture 4 Recap

Quick Introduction to Quality of Context Hlio Carlos Brauner Filho Advisor: Prof. Dr. Claudio

08-1 Custom Exception Classes Throwable/Exception Hierarchy To create a custom exception:

Smart An Open Data Set and Tools for Enabling Research in Sustainable Homes Sean Barker , Aditya