Informa(onRetrieval CS276:Informa*onRetrievalandWebSearch - PowerPoint PPT Presentation

Introduc)on to Informa)on Retrieval        Introduc*on to  Informa(on Retrieval  CS276: Informa*on Retrieval and Web Search  Pandu Nayak and Prabhakar Raghavan  Lecture 2: The term vocabulary and pos*ngs  lists 

Introduc)on to Informa)on Retrieval        Ch. 1 Recap of the previous lecture   Basic inverted indexes:   Structure: Dic*onary and Pos*ngs   Key step in construc*on: Sor*ng   Boolean query processing   Intersec*on by linear *me  “ merging ”    Simple op*miza*ons   Overview of course topics  2 

Introduc)on to Informa)on Retrieval        Plan for this lecture  Elaborate basic indexing   Preprocessing to form the term vocabulary   Documents   Tokeniza*on   What  terms  do we put in the index?   Pos*ngs   Faster merges: skip lists   Posi*onal pos*ngs and phrase queries  3 

Introduc)on to Informa)on Retrieval        Recall the basic indexing pipeline  Documents to Friends, Romans, countrymen. be indexed. Tokenizer Token stream. Friends Romans Countrymen Linguistic modules friend roman countryman Modified tokens. 2 4 Indexer friend  1 2 roman  Inverted index. 16 13 countryman  4 

Introduc)on to Informa)on Retrieval        Sec. 2.1 Parsing a document   What format is it in?   pdf/word/excel/html?   What language is it in?   What character set is in use?  Each of these is a classification problem, which we will study later in the course. But these tasks are often done heuristically … 5 

Introduc)on to Informa)on Retrieval     Sec. 2.1    Complica*ons: Format/language   Documents being indexed can include docs from  many different languages   A single index may have to contain terms of several  languages.   Some*mes a document or its components can  contain mul*ple languages/formats   French email with a German pdf aXachment.   What is a unit document?   A file?   An email?  (Perhaps one of many in an mbox.)   An email with 5 aXachments?   A group of files (PPT or LaTeX as HTML pages)  6 

Introduc)on to Informa)on Retrieval        TOKENS AND TERMS  7 

Introduc)on to Informa)on Retrieval        Sec. 2.2.1 Tokeniza*on   Input:  “ Friends, Romans, Countrymen ”    Output: Tokens   Friends   Romans   Countrymen   A token is a sequence of characters in a document   Each such token is now a candidate for an index  entry, a`er further processing   Described below   But what are valid tokens to emit?  8 

Introduc)on to Informa)on Retrieval     Sec. 2.2.1    Tokeniza*on   Issues in tokeniza*on:   Finland ’ s capital  →         Finland? Finlands? Finland ’ s ?   Hewle9‐Packard   →   Hewle9  and  Packard  as two  tokens?   state‐of‐the‐art : break up hyphenated sequence.     co‐educa>on   lowercase ,  lower‐case ,  lower case  ?   It can be effec*ve to get the user to put in possible hyphens   San Francisco : one token or two?     How do you decide it is one token?  9 

Introduc)on to Informa)on Retrieval     Sec. 2.2.1    Numbers   3/12/91       Mar. 12, 1991        12/3/91   55 B.C.   B‐52   My PGP key is 324a3df234cb23e   (800) 234‐2333   Oèn have embedded spaces   Older IR systems may not index numbers   But oèn very useful: think about things like looking up error  codes/stacktraces on the web   (One answer is using n‐grams: Lecture 3)   Will oèn index  “ meta‐data ”  separately   Crea*on date, format, etc.  10 

Introduc)on to Informa)on Retrieval     Sec. 2.2.1    Tokeniza*on: language issues   French   L'ensemble   →  one token or two?   L  ?  L ’ ?  Le  ?   Want  l ’ ensemble  to match with  un ensemble   Un*l at least 2003, it didn ’ t on Google   Interna*onaliza*on!   German noun compounds are not segmented   LebensversicherungsgesellschaTsangestellter   ‘ life insurance company employee ’    German retrieval systems benefit greatly from a  compound spli>er  module   Can give a 15% performance boost for German   11 

Introduc)on to Informa)on Retrieval     Sec. 2.2.1    Tokeniza*on: language issues   Chinese and Japanese have no spaces between  words:   莎拉波娃现在居住在美国东南部的佛罗里达。  Not always guaranteed a unique tokeniza*on    Further complicated in Japanese, with mul*ple  alphabets intermingled   Dates/amounts in mul*ple formats  フォーチュン 500 社は情報不足のため時間あた $500K( 約 6,000 万円 ) Katakana Hiragana Kanji Romaji End-user can express query entirely in hiragana! 12 

Introduc)on to Informa)on Retrieval     Sec. 2.2.1    Tokeniza*on: language issues   Arabic (or Hebrew) is basically wriXen right to le`,  but with certain items like numbers wriXen le` to  right   Words are separated, but leXer forms within a word  form complex ligatures                                 ←  →    ← →                         ← start   ‘ Algeria achieved its independence in 1962 a`er 132  years of French occupa*on. ’    With Unicode, the surface presenta*on is complex, but the  stored form is  straighlorward  13 

Introduc)on to Informa)on Retrieval     Sec. 2.2.2    Stop words   With a stop list, you exclude from the dic*onary  en*rely the commonest words. Intui*on:   They have liXle seman*c content:  the, a, and, to, be   There are a lot of them: ~30% of pos*ngs for top 30 words   But the trend is away from doing this:   Good compression techniques (lecture 5) means the space for  including stopwords in a system is very small   Good query op*miza*on techniques (lecture 7) mean you pay liXle  at query *me for including stop words.   You need them for:   Phrase queries:  “ King of Denmark ”    Various song *tles, etc.:  “ Let it be ” ,  “ To be or not to be ”    “ Rela*onal ”  queries:  “ flights to London ”   14 

Introduc)on to Informa)on Retrieval     Sec. 2.2.3    Normaliza*on to terms   We need to  “ normalize ”  words in indexed text as  well as query words into the same form   We want to match  U.S.A.  and  USA   Result is terms: a term is a (normalized) word type,  which is an entry in our IR system dic*onary   We most commonly implicitly define equivalence  classes of terms by, e.g.,    dele*ng periods to form a term   U.S.A. ,   USA      USA   dele*ng hyphens to form a term   an>‐discriminatory, an>discriminatory      an>discriminatory  15 

Introduc)on to Informa)on Retrieval        Sec. 2.2.3 Normaliza*on: other languages   Accents: e.g., French  résumé  vs.  resume .   Umlauts: e.g., German:  Tuebingen  vs.  Tübingen   Should be equivalent   Most important criterion:   How are your users like to write their queries for these  words?   Even in languages that standardly have accents,  users o`en may not type them   O`en best to normalize to a de‐accented term   Tuebingen, Tübingen, Tubingen    Tubingen   16 

Introduc)on to Informa)on Retrieval     Sec. 2.2.3    Normaliza*on: other languages   Normaliza*on of things like date forms   7 月 30 日 vs. 7/30  Japanese use of kana vs. Chinese characters      Tokeniza*on and normaliza*on may depend on the  language and so is intertwined with language  detec*on  Is this German “ mit ” ? Morgen will ich in MIT …  Crucial: Need to  “ normalize ”  indexed text as well as  query terms into the same form  17 

Informa(onRetrieval CS276:Informa*onRetrievalandWebSearch - PowerPoint PPT Presentation

Introduc)ontoInforma)onRetrieval Introduconto Informa(onRetrieval CS276:InformaonRetrievalandWebSearch PanduNayakandPrabhakarRaghavan

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

Informa Half Year Results Presentation 24th July 2019 Informa Stephen A. Carter, Group Chief

Informa(on Retrieval Introduc(on Debapriyo Majumdar Information Retrieval Spring

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Informa(on Retrieval as Sta(s(cal Transla(on Presented by: Lin

Page 2 Informa - Half Year Results Presentation - 28th July 2015 Key Highlights Stephen A. Carter,

Page 2 Informa - 2014 Full Year Results Presentation - 12th February 2015 Key Highlights Stephen

Informa Full Year Results Presentation 6th March 2017 1 INFORMA Stephen A. Carter, Group Chief

OUND INFORMATION ON FOUNDER TION ON FOUNDER GROUND INFORMA BACKGR OUND INFORMATION ON FOUNDER

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Information Retrieval Chapter 2: The term vocabulary and postings p y p g lists Slides:

Computer Networks - Xarxes de Computadors Outline Course Syllabus Unit 1: Introduction Unit 2.

Frameworks, Implementation & Open Frameworks, Implementation & Open Problems for the

Information Retrieval Lecture 2 Recap of the previous lecture Basic inverted indexes:

ASSESSMENT OF VULNERABILITY THROUGH PARTICIPATION Jeevan Madapala, Dr. Repaul Kanji, Sangeeta

Handwritten Recognition of Chinese Characters Analysis on CNN working principles and best

Faster Subsequence and Dont -Care Pattern Matching on Compressed Texts Takanori Yamamoto,

IASL System for NTCIR-6 Korean-Chinese CLIR Yu-Chun Wang Cheng-Wei Lee Richard Tzong-Han Tsai

Informa(onRetrieval CS276:Informa*onRetrievalandWebSearch - PowerPoint PPT Presentation

Introduc)ontoInforma)onRetrieval Introduc*onto Informa(onRetrieval CS276:Informa*onRetrievalandWebSearch PanduNayakandPrabhakarRaghavan

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

Informa Half Year Results Presentation 24th July 2019 Informa Stephen A. Carter, Group Chief

Informa(on Retrieval Introduc(on Debapriyo Majumdar Information Retrieval Spring

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

Retrieval Models: Outline CS490W: Web I nformation Search &amp; Management Retrieval Models

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Informa(on Retrieval as Sta(s(cal Transla(on Presented by: Lin

Page 2 Informa - Half Year Results Presentation - 28th July 2015 Key Highlights Stephen A. Carter,

Page 2 Informa - 2014 Full Year Results Presentation - 12th February 2015 Key Highlights Stephen

Informa Full Year Results Presentation 6th March 2017 1 INFORMA Stephen A. Carter, Group Chief

OUND INFORMATION ON FOUNDER TION ON FOUNDER GROUND INFORMA BACKGR OUND INFORMATION ON FOUNDER

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Information Retrieval Chapter 2: The term vocabulary and postings p y p g lists Slides:

Computer Networks - Xarxes de Computadors Outline Course Syllabus Unit 1: Introduction Unit 2.

Frameworks, Implementation &amp; Open Frameworks, Implementation &amp; Open Problems for the

Information Retrieval Lecture 2 Recap of the previous lecture Basic inverted indexes:

ASSESSMENT OF VULNERABILITY THROUGH PARTICIPATION Jeevan Madapala, Dr. Repaul Kanji, Sangeeta

Handwritten Recognition of Chinese Characters Analysis on CNN working principles and best

Faster Subsequence and Dont -Care Pattern Matching on Compressed Texts Takanori Yamamoto,

IASL System for NTCIR-6 Korean-Chinese CLIR Yu-Chun Wang Cheng-Wei Lee Richard Tzong-Han Tsai

Introduc)ontoInforma)onRetrieval Introduconto Informa(onRetrieval CS276:InformaonRetrievalandWebSearch PanduNayakandPrabhakarRaghavan

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models

Frameworks, Implementation & Open Frameworks, Implementation & Open Problems for the