1 Cant build the matrix Inverted index Term Doc # Documents are - PDF document

Query Information Retrieval (IR) � Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia ? � Could grep all of Shakespeare’s plays for Brutus and Caesar then strip out lines containing Calpurnia ? � Slow (for large corpora) Based on slides by � NOT is hard to do Prabhakar Raghavan, Hinrich Schütze, � Other operations (e.g., find the Romans NEAR Ray Larson countrymen ) not feasible Term-document incidence Incidence vectors � So we have a 0/1 vector for each term. Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth � To answer query: take the vectors for Brutus, Antony 1 1 0 0 0 1 Caesar and Calpurnia (complemented) � Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 bitwise AND . Calpurnia 0 1 0 0 0 0 � 110100 AND 110111 AND 101111 = 100100. Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0 1 if play contains word, 0 otherwise Answers to query Bigger corpora � Antony and Cleopatra, Act III, Scene ii � Consider n = 1M documents, each with about 1K � Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, terms. When Antony found Julius Caesar dead, � He cried almost to roaring; and he wept � Avg 6 bytes/term incl spaces/punctuation � When at Philippi he found Brutus slain. � � 6GB of data. � Hamlet, Act III, Scene ii � Say there are m = 500K distinct terms among Lord Polonius: I did enact Julius Caesar I was killed i' the � these. Capitol; Brutus killed me. � 1

Can’t build the matrix Inverted index Term Doc # � Documents are parsed to extract words and � 500K x 1M matrix has half-a-trillion 0’s and 1’s. I 1 did 1 these are saved with the document ID. enact 1 � But it has no more than one billion 1’s. julius 1 caesar 1 Why? I 1 � matrix is extremely sparse. was 1 killed 1 i' 1 � What’s a better representation? the 1 capitol 1 brutus 1 Doc 1 Doc 2 killed 1 me 1 so 2 let 2 it 2 I did enact Julius So let it be with be 2 with 2 Caesar I was killed Caesar. The noble caesar 2 the 2 i' the Capitol; noble 2 Brutus hath told you brutus 2 Brutus killed me. hath 2 Caesar was ambitious told 2 you 2 caesar 2 was 2 ambitious 2 Term Doc # Term Doc # Term Doc # Term Doc # Freq I 1 ambitious 2 ambitious 2 � After all documents have � Multiple term entries in ambitious 2 1 did 1 be 2 be 2 be 2 1 enact 1 brutus 1 brutus 1 brutus 1 1 been parsed the inverted julius 1 brutus 2 a single document are brutus 2 brutus 2 1 caesar 1 capitol 1 capitol 1 capitol 1 1 I 1 caesar 1 file is sorted by terms merged and frequency caesar 1 caesar 1 1 was 1 caesar 2 caesar 2 caesar 2 2 killed 1 caesar 2 caesar 2 did 1 1 i' 1 did 1 information added did 1 enact 1 1 the 1 enact 1 enact 1 hath 2 1 capitol 1 hath 1 hath 1 brutus 1 I 1 I 1 2 I 1 i' 1 1 killed 1 I 1 I 1 it 2 1 me 1 i' 1 i' 1 julius 1 1 so 2 it 2 it 2 killed 1 2 let 2 julius 1 julius 1 let 2 1 it 2 killed 1 killed 1 me 1 1 be 2 killed 1 killed 1 noble 2 1 with 2 let 2 let 2 so 2 1 caesar 2 me 1 me 1 the 1 1 the 2 noble 2 noble 2 the 2 1 noble 2 so 2 so 2 the 1 told 2 1 brutus 2 the 1 the 2 you 2 1 hath 2 the 2 told 2 was 1 1 told 2 told 2 you 2 was 2 1 you 2 you 2 caesar 2 was 1 with 2 1 was 1 was 2 was 2 was 2 ambitious 2 with 2 with 2 Issues with index we just built Issues in what to index � How do we process a query? Cooper’s concordance of Wordsworth was published in � What terms in a doc do we index? 1911. The applications of full-text retrieval are legion: � All words or only “important” ones? they include résumé scanning, litigation support and � Stopword list: terms that are so common that searching published journals on-line. they’re ignored for indexing. � e.g ., the, a, an, of, to … � Cooper’s vs. Cooper vs. Coopers . � language-specific. � Full-text vs. full text vs. { full, text } vs. fulltext. � Accents: résumé vs. resume . 2

Punctuation Numbers � Ne’er : use language-specific, handcrafted � 3/12/91 “locale” to normalize. � Mar. 12, 1991 � State-of-the-art : break up hyphenated � 55 B.C. sequence. � B-52 � U.S.A. vs. USA - use locale. � 100.2.86.144 � a.out � Generally, don’t index as text � Creation dates for docs Case folding Thesauri and soundex � Reduce all letters to lower case � Handle synonyms and homonyms � exception: upper case in mid-sentence � Hand-constructed equivalence classes � e.g., General Motors � e.g., car = automobile � your � you’re � Fed vs. fed � SAIL vs . sail � Index such equivalences, or expand query? � More later ... Spell correction Lemmatization � Look for all words within (say) edit distance 3 � Reduce inflectional/variant forms to base form (Insert/Delete/Replace) at query time � E.g., � e.g., Alanis Morisette � am, are, is → be � Spell correction is expensive and slows the query � car, cars, car's , cars' → car (upto a factor of 100) � the boy's cars are different colors → the boy car � Invoke only when index returns zero matches? be different color � What if docs contain mis-spellings? 3

Stemming Porter’s algorithm � Reduce terms to their “roots” before indexing � Commonest algorithm for stemming English � language dependent � Conventions + 5 phases of reductions � e.g., automate(s), automatic, automation all � phases applied sequentially reduced to automat . � each phase consists of a set of commands � sample convention: Of the rules in a compound command, select the one that applies to the longest suffix. for exampl compres and for example compressed � Porter’s stemmer available: compres are both accept and compression are both http//www.sims.berkeley.edu/~hearst/irbook/porter.html as equival to compres. accepted as equivalent to compress. Typical rules in Porter Beyond term search � sses → ss � What about phrases? � ies → i � Proximity: Find Gates NEAR Microsoft . � ational → ate � Need index to capture position information in docs. � tional → tion � Zones in documents: Find documents with ( author = Ullman ) AND (text contains automata ). Evidence accumulation Ranking search results � 1 vs. 0 occurrence of a search term � Boolean queries give inclusion or exclusion of docs. � 2 vs. 1 occurrence � Need to measure proximity from query to each � 3 vs. 2 occurrences, etc. doc. � Need term frequency information in docs � Whether docs presented to user are singletons, or a group of docs covering various aspects of the query. 4

Test Corpora Standard relevance benchmarks � TREC - National Institute of Standards and Testing (NIST) has run large IR testbed for many years � Reuters and other benchmark sets used � “Retrieval tasks” specified � sometimes as queries � Human experts mark, for each query and for each doc, “Relevant” or “Not relevant” � or at least for subset that some system returned Sample TREC query Precision and recall � Precision : fraction of retrieved docs that are relevant = P(relevant|retrieved) � Recall : fraction of relevant docs that are retrieved = P(retrieved|relevant) Relevant Not Relevant Retrieved tp fp Not Retrieved fn tn � Precision P = tp/(tp + fp) � Recall R = tp/(tp + fn) Credit: Marti Hearst Precision & Recall Precision/Recall Actual relevant docs tp tn � Can get high recall (but low precision) by retrieving � Precision + tp fp all docs on all queries! fp tp fn � Recall is a non-decreasing function of the number � Proportion of selected items that are correct of docs retrieved tp � Precision usually decreases (in a good system) + tp fn System returned these � Recall � Difficulties in using precision/recall � Proportion of target items � Binary relevance that were selected Precision � Precision-Recall curve � Should average over large corpus/query ensembles � Shows tradeoff � Need human relevance judgements � Heavily skewed by corpus/authorship Recall 5

1 Cant build the matrix Inverted index Term Doc # Documents are - PDF document

Query Information Retrieval (IR) Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia ? Could grep all of Shakespeares plays for Brutus and Caesar then strip out lines containing Calpurnia ? Slow (for

solving systems L. Olson Department of Computer Science University of Illinois at

A SSET O WNERS S TRATEGY B ) Strategy Disclose the actual and potential impacts of

Combining Point and Line Samples for Direct Illumination Points only Points + Lines Katherine

20 October 2016 Trust in water 1 Agenda Agenda Item Time 10:00 to 1 Introductions 10.15am

Information Retrieval Visualization CPSC 533c Class Presentation Qixing Zheng March 22, 2004

Collaborative Filtering Alejandro Bellogn 1 , Jun Wang 2 , and Pablo Castells 1 1 Escuela

Consolidation (2) Interaction Design the Tacit Dimension Starbucks.ca 2 Pizza Box

Consolidation theory Outlines 7.1 Introduction Craig Page # 227 7.2 The oedometer test

Overview of Changes MAS Solicitation 47QSMD20R0001 Refresh # 5 - November 2020 1 Agenda

John Byrd Center for Beam Physics,LBNL Slides from: Mike Lamont, Lucio Rossi, R. Aleksan, Frank

Measuring EU value added em bodied in EU foreign exports by consolidating 27 national SUTs for 20

Ride The Clouds to greater business heights Name : Chin, Kar Wai ( vExpert 2009, VCP 4 )

Price Determination Anban Pillay Deputy Director General National Department of Health

Medicaid Users in New York State Sandra L McGinnis, Ph.D.* Center for Human Services Research

First Quarter 2020 Financial Results Echo Global Logistics, Inc. April 22, 2020 Forward-Looking

Q1 Financial Results Fiscal 2020 Lee D. Rudow President and CEO Michael J. Tschiderer Chief

SECOND QUARTER 2016 EARNINGS CALL August 17, 2016 Disclosures regarding Forward Looking

Reform Initiative November 27 th , 2012 1 Confidential draft / Policy under development

Pioneering Digital Physical Therapy Remote Monitoring Health Solutions for Physical Therapy PT

Ch03. Convex Sets and Concave Functions Ping Yu Faculty of Business and Economics The University

Maximization of a Function of One Variable Economic theories assume that Economic agents

Partial identification, distributional preferences, and the welfare ranking of policies

Dealing With and Understanding Endogeneity Enrique Pinzn StataCorp LP September 29, 2016

STRATEGIC PLANNING FRAMEWORK TOWN HALL MEETING APRIL 17, 2019 ACADEMIC EXCELLENCE STRATEGIC