Text Retrieval Algorithms Jimmy Lin Jimmy Lin University of - PowerPoint PPT Presentation

Data-Intensive Information Processing Applications ― Session #4 Text Retrieval Algorithms Jimmy Lin Jimmy Lin University of Maryland Tuesday, February 23, 2010 This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

Source: Wikipedia (Japanese rock garden)

Today’s Agenda � Introduction to information retrieval � Basics of indexing and retrieval as cs o de g a d et e a � Inverted indexing in MapReduce � Retrieval at scale � Retrieval at scale

First, nomenclature… � Information retrieval (IR) � Focus on textual information (= text/document retrieval) � Other possibilities include image, video, music, … � What do we search? � Generically, “collections” � Less-frequently used, “corpora” � What do we find? � What do we find? � Generically, “documents” � Even though we may be referring to web pages, PDFs, PowerPoint slides, paragraphs, etc.

Information Retrieval Cycle Source Resource Selection Query Query Formulation Search Results Selection Documents System discovery Vocabulary discovery Concept discovery Document discovery Document discovery Examination Examination Information source reselection Delivery

The Central Problem in Search Author Searcher Concepts Concepts Concepts Concepts Query Terms Document Terms “tragic love story” “fateful star-crossed romance” Do these represent the same concepts?

Abstract IR Architecture Query Documents online offline Representation Representation Function Function Function Function Query Representation Document Representation Comparison Index Function Hits

How do w e represent text? � Remember: computers don’t “understand” anything! � “Bag of words” ag o o ds � Treat all the words in a document as index terms � Assign a “weight” to each term based on “importance” (or in simplest case presence/absence of word) (or, in simplest case, presence/absence of word) � Disregard order, structure, meaning, etc. of the words � Simple, yet effective! � Assumptions � Term occurrence is independent � Document relevance is independent � “Words” are well-defined

What’s a w ord? 天主教教宗若望保祿二世因感冒再度住進醫院。 ﻒﻴﺠﻳر كرﺎﻣ لﺎﻗو - ﻢﺳﺎﺑ ﻖﻃﺎﻨﻟا 這是他今年第二度因同樣的病因住院。 ﺔﻴﻠﻴﺋاﺮﺳﻹا ﺔﻴﺟرﺎﺨﻟا - ﻞﺒﻗ نورﺎﺷ نإ ةرﺎﻳﺰﺑ ﻰﻟوﻷا ةﺮﻤﻠﻟ مﻮﻘﻴﺳو ةﻮﻋﺪﻟا ﺮﻘﻤﻟا ﺔﻠﻳﻮﻃ ةﺮﺘﻔﻟ ﺖﻧﺎآ ﻲﺘﻟا ،ﺲﻧﻮﺗ مﺎﻋ نﺎﻨﺒﻟ ﻦﻣ ﺎﻬﺟوﺮﺧ ﺪﻌﺑ ﺔﻴﻨﻴﻄﺴﻠﻔﻟا ﺮﻳﺮﺤﺘﻟا ﺔﻤﻈﻨﻤﻟ ﻲﻤﺳﺮﻟا 1982. Выступая в Мещанском суде Москвы экс - глава ЮКОСа заявил не совершал ничего противозаконного , в чем обвиняет его генпрокуратура России . भारत सरकार ने आिथरॎरॎक सवेरॎेरॎक्स क्सण मेःेः िवत्थ त्थीय वषरॎरॎ 2005-06 मेःेः सात फ़ीसदी िवकास दर हािसल करने का आकलन िकया है और कर सुधार पर ज़ोर िदया है 日米連合で台頭中国に対処 … アーミテージ前副長官提言 조재영 기자 = 서울시는 25 일 이명박 시장이 ` 행정중심복합도시 '' 건설안 에 대해 ` 군대라도 동원해 막고싶은 심정 '' 이라고 말했다는 일부 언론의 에 대해 군대라도 동원해 막고싶은 심정 이라고 말했다는 일부 언론의 보도를 부인했다 .

Sample Document McDonald's slims down spuds “Bag of Words” Fast-food chain to reduce certain types of fat in its french fries with new cooking oil. 14 × McDonalds 14 M D ld NEW YORK (CNN/Money) - McDonald's Corp. is cutting the amount of "bad" fat in its french fries nearly in half, the fast-food chain said Tuesday as 12 × fat it moves to make all its fried menu items healthier. 11 × fries 11 × fries But does that mean the popular shoestring fries But does that mean the popular shoestring fries won't taste the same? The company says no. "It's a win-win for our customers because they are 8 × new getting the same great french-fry taste along with an even healthier nutrition profile," said Mike Roberts, president of McDonald's USA. 7 7 × french french But others are not so sure. McDonald's will not specifically discuss the kind of oil it plans to use, 6 × company, said, nutrition but at least one nutrition expert says playing with the formula could mean a different taste. 5 × food, oil, percent, reduce, , , p , , Shares of Oak Brook, Ill.-based McDonald's (MCD d (MCD: down $0.54 to $23.22, Research, $0 54 t $23 22 R h taste, Tuesday Estimates) were lower Tuesday afternoon. It was unclear Tuesday whether competitors Burger King and Wendy's International (WEN: down … $0.80 to $34.91, Research, Estimates) would follow suit. Neither company could immediately be reached for comment. …

Counting Words… Documents Documents case folding, tokenization, stopword removal, stemming Bag of Words syntax, semantics, word knowledge, etc. Inverted Index

Boolean Retrieval � Users express queries as a Boolean expression � AND, OR, NOT � Can be arbitrarily nested � Retrieval is based on the notion of sets � Any given query divides the collection into two sets: retrieved, not-retrieved � Pure Boolean systems do not define an ordering of the results

Inverted Index: Boolean Retrieval Doc 1 Doc 2 Doc 3 Doc 4 one fish, two fish red fish, blue fish cat in the hat green eggs and ham 1 2 3 4 blue 1 blue 2 cat 1 cat 3 egg 1 egg 4 fish fish 1 1 1 1 fish fish 1 1 2 2 green 1 green 4 ham 1 ham 4 hat 1 hat 3 one 1 one 1 red red 1 1 red red 2 2 two 1 two 1

Boolean Retrieval � To execute a Boolean query: � Build query syntax tree OR OR ( blue AND fish ) OR ham ham AND � For each clause, look up postings For each clause, look up postings blue blue fish fish blue 2 fish 1 2 � Traverse postings and apply Boolean operator � Efficiency analysis y y � Postings traversal is linear (assuming sorted postings) � Start with shortest posting first

Strengths and Weaknesses � Strengths � Precise, if you know the right strategies � Precise, if you have an idea of what you’re looking for � Implementations are fast and efficient � Weaknesses � Weaknesses � Users must learn Boolean logic � Boolean logic insufficient to capture the richness of language � No control over size of result set: either too many hits or none � When do you stop reading? All documents in the result set are considered “equally good” co s de ed equa y good � What about partial matches? Documents that “don’t quite match” the query may be useful also

Ranked Retrieval � Order documents by how likely they are to be relevant to the information need � Estimate relevance( q , d i ) � Sort documents by relevance � Display sorted results Display sorted results � User model � Present hits one screen at a time, best results first � At any point, users can decide to stop looking � How do we estimate relevance? � Assume document is relevant if it has a lot of query terms � Replace relevance( q , d i ) with sim( q , d i ) � Compute similarity of vector representations p y p

Vector Space Model t 3 d 2 d 3 d 1 θ θ φ t 1 d 5 t 2 d 4 Assumption: Documents that are “close together” in vector space “talk about” the same things Therefore retrieve documents based on how close the Therefore, retrieve documents based on how close the document is to the query (i.e., similarity ~ “closeness”)

Similarity Metric � Use “angle” between the vectors: r r ⋅ d d j d d cos( θ = j k k r r ) d d j k r r ∑ ∑ n n ⋅ w w d d = = = i , j i , k j k r r i 1 sim ( d , d ) j k ∑ ∑ n n d d 2 2 w w j k = i , j = i , k i 1 i 1 � Or, more generally, inner products: r r ∑ = n = ⋅ = sim ( d , d ) d d w w j k j k i , j i , k i 1

Term Weighting � Term weights consist of two components � Local: how important is the term in this document? � Global: how important is the term in the collection? � Here’s the intuition: � Terms that appear often in a document should get high weights � Terms that appear in many documents should get low weights � How do we capture this mathematically? � How do we capture this mathematically? � Term frequency (local) � Inverse document frequency (global)

TF.IDF Term Weighting N = ⋅ w tf log i , j i , j n i w , weight assigned to term i in document j i j tf number of occurrence of term i in document j i , j number of documents in entire collection N n n number of documents with term i i i

Text Retrieval Algorithms Jimmy Lin Jimmy Lin University of - PowerPoint PPT Presentation

Data-Intensive Information Processing Applications Session #4 Text Retrieval Algorithms Jimmy Lin Jimmy Lin University of Maryland Tuesday, February 23, 2010 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

Utilizing Knowledge Bases for Text Retrieval: A Wishlist for Text Retrieval: A Wishlist

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

NPFL103: Information Retrieval (1) Introduction, Boolean retrieval, Inverted index, Text

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Hearing Gods Voice The man who would not listen: 2 Kings 14:11 Amaziah, however, did

When faced with the vicissitudes of life, ones mind remains unshaken, sorrowless, stainless,

ba baby bylo lon Tre Trend nding ng in n the he Di Direction on of of Jesus us

The EDELWEISS dark matter search: Latest results and future plans Direct dark matter search

Welcome! Joel, Muharem, Andy Thanks! Thanks! Thanks! No picture required Look around you!

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 1: MapReduce Algorithm

Advances in Internal Medicine What a Let it Bugs & Pain & Jeopardy Rules Potpourri

Crashes of older Australian riders Prof Narelle Haworth, CARRS-Q Christine Mulvihill, MUARC

Text Retrieval Algorithms Jimmy Lin Jimmy Lin University of - PowerPoint PPT Presentation

Data-Intensive Information Processing Applications Session #4 Text Retrieval Algorithms Jimmy Lin Jimmy Lin University of Maryland Tuesday, February 23, 2010 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

Utilizing Knowledge Bases for Text Retrieval: A Wishlist for Text Retrieval: A Wishlist

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

NPFL103: Information Retrieval (1) Introduction, Boolean retrieval, Inverted index, Text

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

Retrieval Models: Outline CS490W: Web I nformation Search &amp; Management Retrieval Models

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Hearing Gods Voice The man who would not listen: 2 Kings 14:11 Amaziah, however, did

When faced with the vicissitudes of life, ones mind remains unshaken, sorrowless, stainless,

ba baby bylo lon Tre Trend nding ng in n the he Di Direction on of of Jesus us

The EDELWEISS dark matter search: Latest results and future plans Direct dark matter search

Welcome! Joel, Muharem, Andy Thanks! Thanks! Thanks! No picture required Look around you!

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 1: MapReduce Algorithm

Advances in Internal Medicine What a Let it Bugs &amp; Pain &amp; Jeopardy Rules Potpourri

Crashes of older Australian riders Prof Narelle Haworth, CARRS-Q Christine Mulvihill, MUARC

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models

Advances in Internal Medicine What a Let it Bugs & Pain & Jeopardy Rules Potpourri