1 Cant build the matrix Inverted index Documents are parsed to - PDF document

Query Information Retrieval (IR) � Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia ? � Could grep all of Shakespeare’s plays for Brutus and Caesar then strip out lines containing Calpurnia ? Based on slides by � Slow (for large corpora) Prabhakar Raghavan, Hinrich Schütze, � NOT is hard to do Ray Larson � Other operations (e.g., find the Romans NEAR countrymen ) not feasible Term-document incidence Incidence vectors � So we have a 0/1 vector for each term. Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth � To answer query: take the vectors for Antony 1 1 0 0 0 1 Brutus, Caesar and Calpurnia Brutus 1 1 0 1 0 0 (complemented) � bitwise AND . Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 � 110100 AND 110111 AND 101111 = Cleopatra 1 0 0 0 0 0 100100. mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0 1 if play contains word, 0 otherwise Answers to query Bigger corpora � Consider n = 1M documents, each with � Antony and Cleopatra, Act III, Scene ii about 1K terms. � Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, � � Avg 6 bytes/term incl spaces/punctuation He cried almost to roaring; and he wept � When at Philippi he found Brutus slain. � � 6GB of data. � Hamlet, Act III, Scene ii � Say there are m = 500K distinct terms Lord Polonius: I did enact Julius Caesar I was killed i' the � among these. Capitol; Brutus killed me. � 1

Can’t build the matrix Inverted index � Documents are parsed to extract words Term Doc # � 500K x 1M matrix has half-a-trillion 0’s and I 1 and these are saved with the document did 1 1’s. enact 1 ID. julius 1 caesar 1 Why? � But it has no more than one billion 1’s. I 1 was 1 killed 1 � matrix is extremely sparse. i' 1 the 1 capitol 1 � What’s a better representation? brutus 1 Doc 1 Doc 2 killed 1 me 1 so 2 let 2 it 2 I did enact Julius So let it be with be 2 with 2 Caesar I was killed Caesar. The noble caesar 2 the 2 i' the Capitol; noble 2 Brutus hath told you brutus 2 Brutus killed me. hath 2 Caesar was ambitious told 2 you 2 caesar 2 was 2 ambitious 2 Term Doc # Term Doc # Term Doc # Term Doc # Freq I 1 ambitious 2 � After all documents � Multiple term entries ambitious 2 ambitious 2 1 did 1 be 2 be 2 be 2 1 enact 1 brutus 1 brutus 1 have been parsed the in a single document brutus 1 1 julius 1 brutus 2 brutus 2 brutus 2 1 caesar 1 capitol 1 capitol 1 capitol 1 1 inverted file is sorted by I 1 caesar 1 are merged and caesar 1 caesar 1 1 was 1 caesar 2 caesar 2 caesar 2 2 killed 1 caesar 2 terms frequency caesar 2 did 1 1 i' 1 did 1 did 1 enact 1 1 the 1 enact 1 enact 1 hath 2 1 capitol 1 hath 1 information added hath 1 brutus 1 I 1 I 1 2 I 1 i' 1 1 killed 1 I 1 I 1 it 2 1 me 1 i' 1 i' 1 julius 1 1 so 2 it 2 it 2 killed 1 2 let 2 julius 1 julius 1 let 2 1 it 2 killed 1 killed 1 me 1 1 be 2 killed 1 killed 1 noble 2 1 with 2 let 2 let 2 so 2 1 caesar 2 me 1 me 1 the 1 1 the 2 noble 2 noble 2 the 2 1 noble 2 so 2 so 2 the 1 told 2 1 brutus 2 the 1 the 2 you 2 1 hath 2 the 2 told 2 was 1 1 told 2 told 2 you 2 was 2 1 you 2 you 2 caesar 2 was 1 with 2 1 was 1 was 2 was 2 was 2 ambitious 2 with 2 with 2 Issues with index we just built Issues in what to index � How do we process a query? Cooper’s concordance of Wordsworth was published in � What terms in a doc do we index? 1911. The applications of full-text retrieval are legion: � All words or only “important” ones? they include résumé scanning, litigation support and � Stopword list: terms that are so common searching published journals on-line. that they’re ignored for indexing. � e.g ., the, a, an, of, to … � Cooper’s vs. Cooper vs. Coopers . � language-specific. � Full-text vs. full text vs. { full, text } vs. fulltext. � Accents: résumé vs. resume . 2

Punctuation Numbers � Ne’er : use language-specific, handcrafted � 3/12/91 “locale” to normalize. � Mar. 12, 1991 � State-of-the-art : break up hyphenated � 55 B.C. sequence. � B-52 � U.S.A. vs. USA - use locale. � 100.2.86.144 � a.out � Generally, don’t index as text � Creation dates for docs Case folding Thesauri and soundex � Reduce all letters to lower case � Handle synonyms and homonyms � exception: upper case in mid-sentence � Hand-constructed equivalence classes � e.g., General Motors � e.g., car = automobile � Fed vs. fed � your � you’re � SAIL vs . sail � Index such equivalences, or expand query? � More later ... Spell correction Lemmatization � Look for all words within (say) edit distance � Reduce inflectional/variant forms to base 3 (Insert/Delete/Replace) at query time form � e.g., Alanis Morisette � E.g., � Spell correction is expensive and slows the � am, are, is → be query (up to a factor of 100) � car, cars, car's , cars' → car � Invoke only when index returns zero � the boy's cars are different colors → the boy matches? car be different color � What if docs contain mis-spellings? 3

Stemming Porter’s algorithm � Reduce terms to their “roots” before � Commonest algorithm for stemming English indexing � Conventions + 5 phases of reductions � language dependent � phases applied sequentially � e.g., automate(s), automatic, automation all � each phase consists of a set of commands reduced to automat . � sample convention: Of the rules in a compound command, select the one that applies to the longest suffix. for exampl compres and for example compressed � Porter’s stemmer available: compres are both accept and compression are both http//www.sims.berkeley.edu/~hearst/irbook/porter.html as equival to compres. accepted as equivalent to compress. Typical rules in Porter Beyond term search � sses → ss � What about phrases? � ies → i � Proximity: Find Gates NEAR Microsoft . � ational → ate � Need index to capture position information in docs. � tional → tion � Zones in documents: Find documents with ( author = Ullman ) AND (text contains automata ). Evidence accumulation Ranking search results � 1 vs. 0 occurrence of a search term � Boolean queries give inclusion or exclusion of docs. � 2 vs. 1 occurrence � Need to measure proximity from query to � 3 vs. 2 occurrences, etc. each doc. � Need term frequency information in docs � Whether docs presented to user are singletons, or a group of docs covering various aspects of the query. 4

Test Corpora Standard relevance benchmarks � TREC - National Institute of Standards and Testing (NIST) has run large IR testbed for many years � Reuters and other benchmark sets used � “Retrieval tasks” specified � sometimes as queries � Human experts mark, for each query and for each doc, “Relevant” or “Not relevant” � or at least for subset that some system returned Sample TREC query Precision and recall � Precision : fraction of retrieved docs that are relevant = P(relevant|retrieved) � Recall : fraction of relevant docs that are retrieved = P(retrieved|relevant) Relevant Not Relevant Retrieved tp fp Not Retrieved fn tn � Precision P = tp/(tp + fp) � Recall R = tp/(tp + fn) Credit: Marti Hearst Precision & Recall Precision/Recall � Can get high recall (but low precision) by tp Actual relevant docs � Precision + retrieving all docs on all queries! tp fp tn � Recall is a non-decreasing function of the � Proportion of selected number of docs retrieved items that are correct fp tp fn tp � Precision usually decreases (in a good system) + tp fn � Recall � Difficulties in using precision/recall � Proportion of target System returned these � Binary relevance items that were selected � Precision-Recall curve � Should average over large corpus/query Precision ensembles � Shows tradeoff � Need human relevance judgements � Heavily skewed by corpus/authorship Recall 5

1 Cant build the matrix Inverted index Documents are parsed to - PDF document

Query Information Retrieval (IR) Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia ? Could grep all of Shakespeares plays for Brutus and Caesar then strip out lines containing Calpurnia ? Based on slides

Humanoid Robot Soccer 101 Thomas Rfer Cyber-Physical Systems German Research Center for

Until now... Phrase Structures and Syntax ANLP: Lecture 11 Focused mostly on regular

University of Trento Logica & Linguaggio 12/03/12 1 Animal problem Consider the following

https://www.youtube.com/watch?v=w0sWIVR1hXw Content From the aboriginal

Goals for the discussion To identify major hazards for requirements quality in the current

Overview of the 7 th NTCIR f Workshop N Noriko Kando k K d National Institute of

Predictability and Efficiency in Predictability and Efficiency in Wireless Sensor Networks

Education at Ohio State The short story Hands-on projects involving networked sensor nodes in

GENI (the Global Environment for Network Innovations) and Cloud Computing Harry Mussman

Why Graphs? Discussion is based on the book and slides by Jimmy Lin and Chris Dyer Analyze

UK Healthy Universities Network: Online discussion on COVID-19 Recovery, Healthy Universities

Possession and homelessness in a time of Coronavirus Justine Compton and Liz Davies, Garden

Meeting: Strategic Commissioning Board Meeting Date 03 February 2020 Action Receive

Over 65s Falls are the most frequent and serious accident in people over 65. Age 30% or 3 in

12 th 16 th March 2018 P7 Residential Information Evening Dalguise DVD St Marys

Commissioning and Enabling Primary Care in Doncaster Laura Sherburn 30 June 2016 Primary Care

Committee February 8, 2016 Agenda OPEN SPACE & ECOLOGY COMMITTEE AGENDA Monday, February 8

Wel Welcom come ! Launch and Learn 8 th June 2016 Dr Philip Crowley National Director Quality

Digital health & cyber security Presented by Chelsea Gordon Wednesday 1 August 2018 2 3 My

Time Signatures to detect multi-headed stealthy attack tools Marc Dacier (EURECOM) Guillaume

for Change Agenda 07:30 07:45 Registration. 07:45 08:10 Introduction Myths and Facts

Agenda Introduction Key Outcomes What is Immersive Technologies Why Immersive

<Title> Yiqun Hu, SP Group Agenda Condition monitoring & anomaly detection

Fundamentals of data and interactive journalism Your instructor: Nadja Popovich A bit about me

1 Cant build the matrix Inverted index Documents are parsed to - PDF document

Query Information Retrieval (IR) Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia ? Could grep all of Shakespeares plays for Brutus and Caesar then strip out lines containing Calpurnia ? Based on slides

Humanoid Robot Soccer 101 Thomas Rfer Cyber-Physical Systems German Research Center for

Until now... Phrase Structures and Syntax ANLP: Lecture 11 Focused mostly on regular

University of Trento Logica &amp; Linguaggio 12/03/12 1 Animal problem Consider the following

https://www.youtube.com/watch?v=w0sWIVR1hXw Content From the aboriginal

Goals for the discussion To identify major hazards for requirements quality in the current

Overview of the 7 th NTCIR f Workshop N Noriko Kando k K d National Institute of

Predictability and Efficiency in Predictability and Efficiency in Wireless Sensor Networks

Education at Ohio State The short story Hands-on projects involving networked sensor nodes in

GENI (the Global Environment for Network Innovations) and Cloud Computing Harry Mussman

Why Graphs? Discussion is based on the book and slides by Jimmy Lin and Chris Dyer Analyze

UK Healthy Universities Network: Online discussion on COVID-19 Recovery, Healthy Universities

Possession and homelessness in a time of Coronavirus Justine Compton and Liz Davies, Garden

Meeting: Strategic Commissioning Board Meeting Date 03 February 2020 Action Receive

Over 65s Falls are the most frequent and serious accident in people over 65. Age 30% or 3 in

12 th 16 th March 2018 P7 Residential Information Evening Dalguise DVD St Marys

Commissioning and Enabling Primary Care in Doncaster Laura Sherburn 30 June 2016 Primary Care

Committee February 8, 2016 Agenda OPEN SPACE &amp; ECOLOGY COMMITTEE AGENDA Monday, February 8

Wel Welcom come ! Launch and Learn 8 th June 2016 Dr Philip Crowley National Director Quality

Digital health &amp; cyber security Presented by Chelsea Gordon Wednesday 1 August 2018 2 3 My

Time Signatures to detect multi-headed stealthy attack tools Marc Dacier (EURECOM) Guillaume

for Change Agenda 07:30 07:45 Registration. 07:45 08:10 Introduction Myths and Facts

Agenda Introduction Key Outcomes What is Immersive Technologies Why Immersive

&lt;Title&gt; Yiqun Hu, SP Group Agenda Condition monitoring &amp; anomaly detection

Fundamentals of data and interactive journalism Your instructor: Nadja Popovich A bit about me

University of Trento Logica & Linguaggio 12/03/12 1 Animal problem Consider the following

Committee February 8, 2016 Agenda OPEN SPACE & ECOLOGY COMMITTEE AGENDA Monday, February 8

Digital health & cyber security Presented by Chelsea Gordon Wednesday 1 August 2018 2 3 My

<Title> Yiqun Hu, SP Group Agenda Condition monitoring & anomaly detection