Outline Introduction to information retrieval Logical view of - PDF document

• 1 I f Information Retrieval Models ti R t i l M d l Chapter 2. In R. Baeza-Yates and B. Ribeiro-Neto: Modern Information Retrieval, 1999. Addision Wesley. Jon Atle Gulla / Terje Brasethvik / Jon Atle Gulla / Terje Brasethvik / Geir Solskinnsbakk • 2 Outline • Introduction to information retrieval • Logical view of documents L i l i f d t – Document representations – The “bag-of-words” approach The bag of words approach • The Classic IR Models – Boolean – Vector – Probablistic

• 3 Information retrieval • Information retrieval = information access • (Document retrieval / Text retrieval/ Search) (D t t i l / T t t i l/ S h) • Retrieve documents that satisfy user’s information need from document collection need from document collection – Query interpretation – Document representation and indexing – Ranking of retrieved documents – Linguistics, arithmetics and statistics • 4 AllTheWeb • AllTheWeb: FAST’s showcase (www.alltheweb.com) - 2002 Query R Retrieved documents d d (www.alltheweb.com part of Yahoo today)

• 5 IR vs. IE vs. TDM • Information retrieval – “finding documents that is similar to the query” finding documents that is similar to the query • Fulfilling an information need – Document retrieval / text retrieval • Give me information about Trondheim? • Information Extraction – Extracting data – Extracting data • Extract todays car-sales advertisements from adressa.no ? • Text Mining – Discovering new knowledge from text • Ex: Pubgene (http://www.pubgene.org) Discovery of genome relations through retrieval of MEDILINE articles • 6 Document retrieval • Give me information about Apple Computer? – Article? Web site? Web store? Prices? Article? Web site? Web store? Prices? • Is this flower poisonous? – Image? Fact sheet? Medical/biological encyclopedia? Image? Fact sheet? Medical/biological encyclopedia? • How much does a ticket cost from Trondheim to Paris? – Airline Price table? Web travel agency?

• 7 Text Retrieval vs. Database Text Retrieval vs Database Queries? • Well defined schema vs. no schema • Structured data vs. plain unformatted data St t d d t l i f tt d d t • Identity of records vs. “fuzzy” similarity measures • Well defined query languages and operations vs. W ll d fi d l d ti “Natural Language” queries and lexical and mathematical query transformations mathematical query transformations • 8 Document Retrieval problems • What is the definition of “CSCW” ? – Finds no documents about “CSCW” – Finds 1M + documents about “CSCW” Finds 1M documents about CSCW – Find no documents that actually define CSCW – Find 50 different definitions of CSCW

• 9 Retrieval Models • A retrieval model is an idealization or abstraction of an actual retrieval process an actual retrieval process • Approximation of the retrieval situation • Approximation of the retrieval situation • A retrieval model is not the same as a retrieval A retrieval model is not the same as a retrieval implementation • 1 0 Components of a retrieval Components of a retrieval model • User – Search expert (e.g. librarian) vs. non-expert Search expert (e g librarian) vs non expert – Background (knowledge of topic) – In-depth searching vs. ”just-wanna-get-an-idea” searching • Documents: – Different languages Diff t l – Semi-structured (e.g. HTML or XML) vs. plain

• 11 Retrieving vs. Browsing ? • Open web directories – Yahoo, … Yahoo • Domain specific – Medline, Lexis-Nexis, Jussnett, Dialog, … Medline, Lexis Nexis, Jussnett, Dialog, … • Libraries – Bibsys, ACM/IEEE - Diglib • Company Intranets – Project workspaces – General Information G l I f ti • WWW – Google alltheweb askJeeves – Google, alltheweb, askJeeves, … • 1 2 Taxonomy of retrieval models Set theoretic • Fuzzy sets • Extended boolean Classic models • Boolean Algebraic • Vector Vector • Probabilistic • generalized vector • Latent semantic Retrieval: indexing -Ad Hoc • Neural networks - Filtering Structured models Probabilistic • Non overlapping Browsing • Inference networks lists • Belief networks • Proximal Nodes Browsing models • Flat • Flat • Structure guided • Hypertext

• 1 3 Information Retrieval Model • An information retrieval model is a quadruple [D Q F R( i dj)] [D, Q, F, R(qi,dj)] where h – D is a set composed of logical views for the documents in the D is a set composed of logical views for the documents in the collection – Q is a set composed of logical views for the user information needs (queries) (queries) – F is a framework for modeling document representations, queries, and their relationships – R(qi,dj) is a ranking function which associates a real number with a R( i dj) i ki f ti hi h i t l b ith query qi  Q and a document representation dj  D. Such ranking defines an ordering among the documents with regard to the query qi qi. • 1 4 The retrieval cycle •Query Transformation •Normalization •Query Expansion •Query Expansion •Phrasing / Anti Phrasing •Result Presentation: •Ranking •Clustering Cl t i •Classification

• 1 5 About Document representations • Document meta-information – (author, title, date, URI, …) (author title date URI ) • Index term selection ? – Automated indexing - bag of words Automated indexing bag of words – User selected words: Key-words – Controlled vocabularies • Document structure • Document type • 1 6 Index term selection Language Encoding Transliteration Phrasing Stemming detection Document Meta-data Extraction D Document type t t Structure St t recognition recognition Word Analysis Document categorization categorization Index term selection

• 1 7 Bag-of-words approach • A document is an unordered list of words/tokens – Grammatical information is lost Grammatical information is lost • Tokenization: What is a word? – Is ”White House” one or two words? Is White House one or two words? • Case folding – ”President Bush” becomes ”president”, ”bush” • Stemming or lemmatization – Morphological information is thrown away: ”agreements” becomes ”agreement” (lemmatization) or even ”agree” (stemming) agreement (lemmatization) or even agree (stemming) • 1 8 Some repetition • IR = retrieval of documents that seem to be similar to the users information need the users information need • Information retrieval models – Users Users -> Query > Query – Documents -> Document representation – Similarity function -> sim(q, di) • Document representations – (logical views of documents) – Index term selection Index term selection

• 1 9 Example ”bag of words” Scientists have found compelling new evidence of possible ancient microscopic life on Mars, derived from magnetic crystals in a meteorite that fell to Earch from the red planet NASA announced on Monday that fell to Earch from the red planet, NASA announced on Monday. a, ancient, announced, compelling, crystals, derived, earth, evidence, fell, p g y found, from (2X), have, in, magnetic, mars, meteorite, microscopic, monday, nasa, new, of, on (2X), planet, possible, red, scientists, that, the, to • 2 0 What is this about? allmennviteskapelige, at (2x), av, bredt, datateknikk (2x), de (2x), doktorgradsstudier, Dr.ing., dr scient dr.scient., emner, en, et, etter-, fagtilbud, fleste, grunn-, har, hoveddel, hovedfagsstudier, i emner en et etter- fagtilbud fleste grunn- har hoveddel hovedfagsstudier i (3x), Instituttet (2x), informasjonsvitenskap., informatikk, innen, innenfor, kurs, leverer (2x), mellom-, NTNU, NTNUs (2x), og (5x), også, områder, samt (2x), selvsagt, sivilingenixrstudium, Som, studiene, til, tilbyr (3x), undervisning, undervisningen, universitetsinstitutt, ved (2x), vi (2x), videre., videreutdanningstilbud,

• 2 1 What is this about? allmennviteskapelige, at (2x), av, bredt, datateknikk (2x), de (2x), doktorgradsstudier, Dr.ing., dr scient dr.scient., emner, en, et, etter-, fagtilbud, fleste, grunn-, har, hoveddel, hovedfagsstudier, i emner en et etter- fagtilbud fleste grunn- har hoveddel hovedfagsstudier i (3x), Instituttet (2x), informasjonsvitenskap., informatikk, innen, innenfor, kurs, leverer (2x), mellom-, NTNU, NTNUs (2x), og (5x), også, områder, samt (2x), selvsagt, sivilingenixrstudium, Som, studiene, til, tilbyr (3x), undervisning, undervisningen, universitetsinstitutt, ved (2x), vi (2x), videre., videreutdanningstilbud, Instituttet har et bredt fagtilbud og tilbyr undervisning i emner innenfor de fleste områder innen datateknikk og informasjonsvitenskap. Instituttet leverer en hoveddel av undervisningen ved g j p g NTNUs sivilingeniørstudium i datateknikk, samt at vi tilbyr grunn-, mellom- og hovedfagsstudier i informatikk ved de allmennviteskapelige studiene. Som universitetsinstitutt tilbyr vi selvsagt også doktorgradsstudier (dr.ing. og dr.scient.), samt at vi leverer kurs til NTNUs etter og videreutdanningstilbud NTNUs etter- og videreutdanningstilbud - NTNU videre. NTNU videre • 2 2 “The language problem” Q ? D rep D rep D rep D rep D D rep

Outline Introduction to information retrieval Logical view of - PDF document

1 I f Information Retrieval Models ti R t i l M d l Chapter 2. In R. Baeza-Yates and B. Ribeiro-Neto: Modern Information Retrieval, 1999. Addision Wesley. Jon Atle Gulla / Terje Brasethvik / Jon Atle Gulla / Terje Brasethvik / Geir

Ins Domingues Breast Cancer Workshop April 7th 2015 Outline Outline Outline Outline

Presentation Preparation Outline Speech Outline Template ***Use this outline to guide you in

Outline for St Outline for St Outline for

Beob Kyun Kim, S oonwook Hwang {kyun, hwang}@ kisti.re.kr KIS TI, Korea Outline Outline

Catherine Revels, World Bank November 2009 Presentation outline Presentation outline

Battlestar Galactica Battlestar Galactica Galactica Battlestar Outline Outline Outline

Outline 2 Outline 2 ZSim core simulation techniques Outline 2 ZSim core simulation

Appendix J: Capstone Presentation Outline Revised Spring 2016 CAPSTONE PRESENTATION OUTLINE This

PT1 TMP Presentation Outline 1 Group Members: ___________________________________ Use this outline

Broverview Outline 2 Outline Philosophy and Architecture A framework for network traffic

Xingqian Peng, Huaqiao University, China Presented by Zhen Wu Presented by Zhen Wu October 30,2011

1 Web Application Development 2 3 Web Application Development CSS Outline An outline is a

Lecture Outline Strengthening Induction Hypothesis. Lecture Outline Strengthening Induction

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

High Dimensional Approximation - Outline Background and Sources Wolfgang Dahmen Seminar: USC,

Outline Outline Deaf and Hearing Impaired Deaf and Hearing Impaired Physical Structures of

How Engineering Management is like Jenna Zeigen - @ zeigenvector - October 20, 2016 - BrooklynJS

Internet Engineering: Search Ali Kamandi Sharif University of Technology kamandi@ce.sharif.edu

Lecture 2 Agents & Environments (Chap. 2) Based on slides by UW CSE AI faculty, Dan Klein,

Hello Alexa, Im Drupal Arash Farazdaghi Builder Track \

Information Retrieval Lecture 10 Recap Last lecture HITS algorithm using anchor text

Search engines, Question Answering and Syntactic Analysis Kaarel Kaljurand (kaarel@ut.ee) Tartu

Web Engineering An interim report for the Economic and Social Research Council (ESRC), says that

Web Mining Web Mining to automatically discover and extract information from Web

Outline Introduction to information retrieval Logical view of - PDF document

1 I f Information Retrieval Models ti R t i l M d l Chapter 2. In R. Baeza-Yates and B. Ribeiro-Neto: Modern Information Retrieval, 1999. Addision Wesley. Jon Atle Gulla / Terje Brasethvik / Jon Atle Gulla / Terje Brasethvik / Geir

Ins Domingues Breast Cancer Workshop April 7th 2015 Outline Outline Outline Outline

Presentation Preparation Outline Speech Outline Template ***Use this outline to guide you in

Outline for St Outline for St Outline for

Beob Kyun Kim, S oonwook Hwang {kyun, hwang}@ kisti.re.kr KIS TI, Korea Outline Outline

Catherine Revels, World Bank November 2009 Presentation outline Presentation outline

Battlestar Galactica Battlestar Galactica Galactica Battlestar Outline Outline Outline

Outline 2 Outline 2 ZSim core simulation techniques Outline 2 ZSim core simulation

Appendix J: Capstone Presentation Outline Revised Spring 2016 CAPSTONE PRESENTATION OUTLINE This

PT1 TMP Presentation Outline 1 Group Members: ___________________________________ Use this outline

Broverview Outline 2 Outline Philosophy and Architecture A framework for network traffic

Xingqian Peng, Huaqiao University, China Presented by Zhen Wu Presented by Zhen Wu October 30,2011

1 Web Application Development 2 3 Web Application Development CSS Outline An outline is a

Lecture Outline Strengthening Induction Hypothesis. Lecture Outline Strengthening Induction

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

High Dimensional Approximation - Outline Background and Sources Wolfgang Dahmen Seminar: USC,

Outline Outline Deaf and Hearing Impaired Deaf and Hearing Impaired Physical Structures of

How Engineering Management is like Jenna Zeigen - @ zeigenvector - October 20, 2016 - BrooklynJS

Internet Engineering: Search Ali Kamandi Sharif University of Technology kamandi@ce.sharif.edu

Lecture 2 Agents &amp; Environments (Chap. 2) Based on slides by UW CSE AI faculty, Dan Klein,

Hello Alexa, Im Drupal Arash Farazdaghi Builder Track \

Information Retrieval Lecture 10 Recap Last lecture HITS algorithm using anchor text

Search engines, Question Answering and Syntactic Analysis Kaarel Kaljurand (kaarel@ut.ee) Tartu

Web Engineering An interim report for the Economic and Social Research Council (ESRC), says that

Web Mining Web Mining to automatically discover and extract information from Web

Lecture 2 Agents & Environments (Chap. 2) Based on slides by UW CSE AI faculty, Dan Klein,