Retrieval Models: Outline CS490W: Web I nformation Search & - PDF document

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models CS-490W � Exact-match retrieval method Web Information Search & Management � Unranked Boolean retrieval method � Ranked Boolean retrieval method Information Retrieval: Retrieval Models � Best-match retrieval method � Vector space retrieval method Luo Si � Latent semantic indexing Department of Computer Science Purdue University Retrieval Models Retrieval Models: Unranked Boolean Unranked Boolean: Exact match method Information � Selection Model Need � Retrieve a document iff it matches the precise query Representation Representation � Often return unranked documents (or with chronological order) � Operators Query Retrieval Model Indexed Objects � Logical Operators: AND OR, NOT � Approximately operators: #1(white house) (i.e., within one word Retrieved Objects distance, phrase) #sen(Iraq weapon) (i.e., within a sentence) � String matching operators: Wildcard (e.g., ind* for india and indonesia) Evaluation/Feedback � Field operators: title(information and retrieval)… Overview of Retrieval Models Retrieval Models: Unranked Boolean Retrieval Models Unranked Boolean: Exact match method � Boolean � A query example � Vector space (#2(distributed information retrieval) OR (#1 (federated � Basic vector space SMART, Lucene search)) AND author(#1(Jamie Callan) AND NOT (Steve)) � Extended Boolean � Probabilistic models � Statistical language models Lemur � Two Possion model Okapi � Bayesian inference networks Inquery � Citation/Link analysis models � Page rank Google � Hub & authorities Clever

Retrieval Models: Unranked Boolean Retrieval Models: Ranked Boolean WestLaw system: Commercial Legal/Health/Finance Ranked Boolean: Calculate doc score Information Retrieval System � Term evidence: Evidence from term i occurred in doc j: (tf ij ) and (tf ij *idf i ) � Logical operators � Proximity operators: Phrase, word proximity, same � AND weight: minimum of argument weights sentence/paragraph � OR weight: maximum of argument weights � String matching operator: wildcard (e.g., ind*) Min=0.2 Max=0.6 � Field operator: title(#1(“legal retrieval”)) date(2000) AND OR � Citations: Cite (Salton) Term 0.2 0.6 0.4 0.2 0.6 0.4 evidence Query: (Thailand AND stock AND market) Retrieval Models: Unranked Boolean Retrieval Models: Ranked Boolean Advantages: Advantages: � Work well if user knows exactly what to retrieve � All advantages from unranked Boolean algorithm � Works well when query is precise; predictive; efficient � Predicable; easy to explain � Results in a ranked list (not a full list); easier to browse and � Very efficient find the most relevant ones than Boolean Disadvantages: � Rank criterion is flexible: e.g., different variants of term evidence � It is difficult to design the query; high recall and low precision for loose query; low recall and high precision for strict query Disadvantages: � Results are unordered; hard to find useful ones � Still an exact match (document selection) model: inverse � Users may be too optimistic for strict queries. A few very correlation for recall and precision of strict and loose queries relevant but a lot more are missing � Predictability makes user overestimate retrieval quality Retrieval Models: Ranked Boolean Retrieval Models: Vector Space Model Ranked Boolean: Exact match Vector space model � Similar as unranked Boolean but documents are ordered by � Any text object can be represented by a term vector some criterion � Documents, queries, passages, sentences Retrieve docs from Wall Street Journal Collection � A query can be seen as a short document Query: (Thailand AND stock AND market) � Similarity is determined by distance in the vector space Which word is more important? Reflect importance of � Example: cosine of the angle between two vectors document by its words Many “stock” and “market”, but fewer � The SMART system “Thailand”. Fewer may be more indicative � Developed at Cornell University: 1960-1999 Term Frequency (TF): Number of occurrence in query/doc; larger � Still quite popular number means more important Total number of docs � The Lucene system Inversed Document Frequency (IDF): Number of docs � Open source information retrieval library; (Based on Java) Larger means more important contain a term � Work with Hadoop (Map/Reduce) in large scale app (e.g., Amazon There are many variants of TF, IDF: e.g., consider document length Book)

Retrieval Models: Vector Space Model Retrieval Models: Vector Space Model Vector space model vs. Boolean model Give two vectors of query and document � = � Boolean models q ( q q , ,..., q ) � query as � 1 2 n � � � Query: a Boolean expression that a document must satisfy q = � document as d ( d , d ,..., d ) j j 1 j 2 jn � Retrieval: Deductive inference � � � � calculate the similarity θ ( , ) � Vector space model q d j � � Cosine similarity: Angle between vectors � Query: viewed as a short document in a vector space d � � � � � � j = θ � Retrieval: Find similar vectors/objects sim q d ( , ) cos( ( , q d )) j j � � � θ cos( ( , q d )) j � � � + + + + + + i q d q d ... q d q d q d ... q d q d = j = = � � � 1 j ,1 2 � j ,2 � � j j n , 1 j ,1 2 j ,2 j j n , + + + + q d q d 2 2 2 2 q ... q d ... d 1 n j 1 jn Retrieval Models: Vector Space Model Retrieval Models: Vector Space Model Vector representation Vector representation Retrieval Models: Vector Space Model Retrieval Models: Vector Space Model Vector representation Vector Coefficients Java � The coefficients (vector elements) represent term D 3 evidence/ term importance D 1 � It is derived from several elements Query � Document term weight: Evidence of the term in the document/query D 2 � Collection term weight: Importance of term from observation of collection � Length normalization: Reduce document length bias Sun � Naming convention for coefficients: = First triple represents query term; q d . DCL DCL . k j k , second for document term Starbucks

Retrieval Models: Vector Space Model Retrieval Models: Vector Space Model Advantages: Common vector weight components: � Best match method; it does not need a precise query � lnc.ltc: widely used term weight � Generated ranked lists; easy to explore the results � “l”: log(tf)+1 � Simplicity: easy to implement � “n”: no weight/normalization � “t”: log(N/df) � Effectiveness: often works well � “c”: cosine normalization � Flexibility: can utilize different types of term weighting methods ⎡ ⎤ ( )( ) ∑ N + + ⎢ log( tf ( k ) 1 log( tf ( k ) 1 log ⎥ + + � Used in a wide range of IR tasks: retrieval, classification, .. ⎣ q j ⎦ q d q d q d df ( k ) 1 j 1 2 j 2 n jn = k [ ] 2 summarization, content-based filtering… q d ( ) ⎡ ( ) ⎤ j ∑ ∑ N + 2 + ⎢ ⎥ log( tf ( k ) 1 log( tf ( k ) 1 log q j ⎣ df ( k ) ⎦ k k Retrieval Models: Vector Space Model Retrieval Models: Vector Space Model Disadvantages: Common vector weight components: � Hard to choose the dimension of the vector (“basic concept”); � dnn.dtb: handle varied document lengths terms may not be the best choice � “d”: 1+ln(1+ln(tf)) � Assume independent relationship among terms � “t”: log((N/df) � Heuristic for choosing vector operations � “b”: 1/(0.8+0.2*docleng/avg_doclen) � Choose of term weights � Choose of similarity function � Assume a query and a document can be treated in the same way Retrieval Models: Vector Space Model Retrieval Models: Vector Space Model Disadvantages: � Standard vector space � Represent query/documents in a vector space � Hard to choose the dimension of the vector (“basic concept”); terms may not be the best choice � Each dimension corresponds to a term in the vocabulary � Use a combination of components to represent the term evidence in � Assume independent relationship among terms both query and document � Heuristic for choosing vector operations � Use similarity function to estimate the relationship between � Choose of term weights query/documents (e.g., cosine similarity) � Choose of similarity function � Assume a query and a document can be treated in the same way

Retrieval Models: Outline CS490W: Web I nformation Search & - PDF document

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models CS-490W Exact-match retrieval method Web Information Search & Management Unranked Boolean retrieval method Ranked Boolean retrieval

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

Models for Models for Retrieval and Browsing Retrieval and Browsing - Structural Models and

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Models for Models for Retrieval and Browsing Retrieval and Browsing - Fuzzy Set, Extended

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

Luo Si Department of Computer Science Purdue University Retrieval Models Information Need

Accessing XML content: An information retrieval perspective Mounia Lalmas mounia@acm.org 1

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Information Retrieval Introducing Information Retrieval and Web Search

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

Query Likelihood Retrieval LM, session 6 CS6200: Information Retrieval Slides by: Jesse Anderton

Models for Retrieval Models for Retrieval 1. HMM/N-gram-based 2. Latent Semantic Indexing (LSI)

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

A Traffic Study to Interleaved Dark Space Markus De Shon (mdeshon) Agenda Methodology Results

1.4 Recreation, Part 1 Standards for Accessible Playgrounds 26 th Annual Mid-Atlantic ADA Update

ViAMoD Visual Spatiotemporal Pattern Analysis of Movement and Event Data Prof. Dr. Stefan Wrobel

Augmented Reality: Fusing the Real and Synthetic Worlds Mauro Dalla Mura, Michele Zanin,

Future Space The team that helps you connect with your future: Careers Skills audit, CV

Proposition #1 VOTING IS MARCH 22nd; 1PM-9PM AT CE & EJA SCHOOLS Instead of this: We could

CS480/680 Lecture 2: May 8 th , 2019 Nearest Neighbour [RN] Sec. 18.8.1, [HTF] Sec. 2.3.2, [D]

Retrieval Models: Outline CS490W: Web I nformation Search & - PDF document

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models CS-490W Exact-match retrieval method Web Information Search & Management Unranked Boolean retrieval method Ranked Boolean retrieval

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

Models for Models for Retrieval and Browsing Retrieval and Browsing - Structural Models and

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Models for Models for Retrieval and Browsing Retrieval and Browsing - Fuzzy Set, Extended

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

Luo Si Department of Computer Science Purdue University Retrieval Models Information Need

Accessing XML content: An information retrieval perspective Mounia Lalmas mounia@acm.org 1

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Information Retrieval Introducing Information Retrieval and Web Search

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

Query Likelihood Retrieval LM, session 6 CS6200: Information Retrieval Slides by: Jesse Anderton

Models for Retrieval Models for Retrieval 1. HMM/N-gram-based 2. Latent Semantic Indexing (LSI)

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

A Traffic Study to Interleaved Dark Space Markus De Shon (mdeshon) Agenda Methodology Results

1.4 Recreation, Part 1 Standards for Accessible Playgrounds 26 th Annual Mid-Atlantic ADA Update

ViAMoD Visual Spatiotemporal Pattern Analysis of Movement and Event Data Prof. Dr. Stefan Wrobel

Augmented Reality: Fusing the Real and Synthetic Worlds Mauro Dalla Mura, Michele Zanin,

Future Space The team that helps you connect with your future: Careers Skills audit, CV

Proposition #1 VOTING IS MARCH 22nd; 1PM-9PM AT CE &amp; EJA SCHOOLS Instead of this: We could

CS480/680 Lecture 2: May 8 th , 2019 Nearest Neighbour [RN] Sec. 18.8.1, [HTF] Sec. 2.3.2, [D]

Proposition #1 VOTING IS MARCH 22nd; 1PM-9PM AT CE & EJA SCHOOLS Instead of this: We could