retrieval models outline
play

Retrieval Models: Outline CS490W: Web I nformation Search & - PDF document

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models CS-490W Exact-match retrieval method Web Information Search & Management Unranked Boolean retrieval method Ranked Boolean retrieval


  1. Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models CS-490W � Exact-match retrieval method Web Information Search & Management � Unranked Boolean retrieval method � Ranked Boolean retrieval method Information Retrieval: Retrieval Models � Best-match retrieval method � Vector space retrieval method Luo Si � Latent semantic indexing Department of Computer Science Purdue University Retrieval Models Retrieval Models: Unranked Boolean Unranked Boolean: Exact match method Information � Selection Model Need � Retrieve a document iff it matches the precise query Representation Representation � Often return unranked documents (or with chronological order) � Operators Query Retrieval Model Indexed Objects � Logical Operators: AND OR, NOT � Approximately operators: #1(white house) (i.e., within one word Retrieved Objects distance, phrase) #sen(Iraq weapon) (i.e., within a sentence) � String matching operators: Wildcard (e.g., ind* for india and indonesia) Evaluation/Feedback � Field operators: title(information and retrieval)… Overview of Retrieval Models Retrieval Models: Unranked Boolean Retrieval Models Unranked Boolean: Exact match method � Boolean � A query example � Vector space (#2(distributed information retrieval) OR (#1 (federated � Basic vector space SMART, Lucene search)) AND author(#1(Jamie Callan) AND NOT (Steve)) � Extended Boolean � Probabilistic models � Statistical language models Lemur � Two Possion model Okapi � Bayesian inference networks Inquery � Citation/Link analysis models � Page rank Google � Hub & authorities Clever

  2. Retrieval Models: Unranked Boolean Retrieval Models: Ranked Boolean WestLaw system: Commercial Legal/Health/Finance Ranked Boolean: Calculate doc score Information Retrieval System � Term evidence: Evidence from term i occurred in doc j: (tf ij ) and (tf ij *idf i ) � Logical operators � Proximity operators: Phrase, word proximity, same � AND weight: minimum of argument weights sentence/paragraph � OR weight: maximum of argument weights � String matching operator: wildcard (e.g., ind*) Min=0.2 Max=0.6 � Field operator: title(#1(“legal retrieval”)) date(2000) AND OR � Citations: Cite (Salton) Term 0.2 0.6 0.4 0.2 0.6 0.4 evidence Query: (Thailand AND stock AND market) Retrieval Models: Unranked Boolean Retrieval Models: Ranked Boolean Advantages: Advantages: � Work well if user knows exactly what to retrieve � All advantages from unranked Boolean algorithm � Works well when query is precise; predictive; efficient � Predicable; easy to explain � Results in a ranked list (not a full list); easier to browse and � Very efficient find the most relevant ones than Boolean Disadvantages: � Rank criterion is flexible: e.g., different variants of term evidence � It is difficult to design the query; high recall and low precision for loose query; low recall and high precision for strict query Disadvantages: � Results are unordered; hard to find useful ones � Still an exact match (document selection) model: inverse � Users may be too optimistic for strict queries. A few very correlation for recall and precision of strict and loose queries relevant but a lot more are missing � Predictability makes user overestimate retrieval quality Retrieval Models: Ranked Boolean Retrieval Models: Vector Space Model Ranked Boolean: Exact match Vector space model � Similar as unranked Boolean but documents are ordered by � Any text object can be represented by a term vector some criterion � Documents, queries, passages, sentences Retrieve docs from Wall Street Journal Collection � A query can be seen as a short document Query: (Thailand AND stock AND market) � Similarity is determined by distance in the vector space Which word is more important? Reflect importance of � Example: cosine of the angle between two vectors document by its words Many “stock” and “market”, but fewer � The SMART system “Thailand”. Fewer may be more indicative � Developed at Cornell University: 1960-1999 Term Frequency (TF): Number of occurrence in query/doc; larger � Still quite popular number means more important Total number of docs � The Lucene system Inversed Document Frequency (IDF): Number of docs � Open source information retrieval library; (Based on Java) Larger means more important contain a term � Work with Hadoop (Map/Reduce) in large scale app (e.g., Amazon There are many variants of TF, IDF: e.g., consider document length Book)

  3. Retrieval Models: Vector Space Model Retrieval Models: Vector Space Model Vector space model vs. Boolean model Give two vectors of query and document � = � Boolean models q ( q q , ,..., q ) � query as � 1 2 n � � � Query: a Boolean expression that a document must satisfy q = � document as d ( d , d ,..., d ) j j 1 j 2 jn � Retrieval: Deductive inference � � � � calculate the similarity θ ( , ) � Vector space model q d j � � Cosine similarity: Angle between vectors � Query: viewed as a short document in a vector space d � � � � � � j = θ � Retrieval: Find similar vectors/objects sim q d ( , ) cos( ( , q d )) j j � � � θ cos( ( , q d )) j � � � + + + + + + i q d q d ... q d q d q d ... q d q d = j = = � � � 1 j ,1 2 � j ,2 � � j j n , 1 j ,1 2 j ,2 j j n , + + + + q d q d 2 2 2 2 q ... q d ... d 1 n j 1 jn Retrieval Models: Vector Space Model Retrieval Models: Vector Space Model Vector representation Vector representation Retrieval Models: Vector Space Model Retrieval Models: Vector Space Model Vector representation Vector Coefficients Java � The coefficients (vector elements) represent term D 3 evidence/ term importance D 1 � It is derived from several elements Query � Document term weight: Evidence of the term in the document/query D 2 � Collection term weight: Importance of term from observation of collection � Length normalization: Reduce document length bias Sun � Naming convention for coefficients: = First triple represents query term; q d . DCL DCL . k j k , second for document term Starbucks

  4. Retrieval Models: Vector Space Model Retrieval Models: Vector Space Model Advantages: Common vector weight components: � Best match method; it does not need a precise query � lnc.ltc: widely used term weight � Generated ranked lists; easy to explore the results � “l”: log(tf)+1 � Simplicity: easy to implement � “n”: no weight/normalization � “t”: log(N/df) � Effectiveness: often works well � “c”: cosine normalization � Flexibility: can utilize different types of term weighting methods ⎡ ⎤ ( )( ) ∑ N + + ⎢ log( tf ( k ) 1 log( tf ( k ) 1 log ⎥ + + � Used in a wide range of IR tasks: retrieval, classification, .. ⎣ q j ⎦ q d q d q d df ( k ) 1 j 1 2 j 2 n jn = k [ ] 2 summarization, content-based filtering… q d ( ) ⎡ ( ) ⎤ j ∑ ∑ N + 2 + ⎢ ⎥ log( tf ( k ) 1 log( tf ( k ) 1 log q j ⎣ df ( k ) ⎦ k k Retrieval Models: Vector Space Model Retrieval Models: Vector Space Model Disadvantages: Common vector weight components: � Hard to choose the dimension of the vector (“basic concept”); � dnn.dtb: handle varied document lengths terms may not be the best choice � “d”: 1+ln(1+ln(tf)) � Assume independent relationship among terms � “t”: log((N/df) � Heuristic for choosing vector operations � “b”: 1/(0.8+0.2*docleng/avg_doclen) � Choose of term weights � Choose of similarity function � Assume a query and a document can be treated in the same way Retrieval Models: Vector Space Model Retrieval Models: Vector Space Model Disadvantages: � Standard vector space � Represent query/documents in a vector space � Hard to choose the dimension of the vector (“basic concept”); terms may not be the best choice � Each dimension corresponds to a term in the vocabulary � Use a combination of components to represent the term evidence in � Assume independent relationship among terms both query and document � Heuristic for choosing vector operations � Use similarity function to estimate the relationship between � Choose of term weights query/documents (e.g., cosine similarity) � Choose of similarity function � Assume a query and a document can be treated in the same way

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend