Retrieval by Content Part 2: Text Retrieval Term Frequency and - PowerPoint PPT Presentation

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency Srihari: CSE 626 1

Text Retrieval • Retrieval of text-based information is referred to as Information Retrieval (IR) • Used by text search engines over the internet • Text is composed of two fundamental units documents and terms • Document: journal paper, book, e-mail messages, source code, web pages • Term: word, word-pair, phrase within a document Srihari: CSE 626 2

Representation of Text • Capability to retain as much of the semantic content of the data as possible • Computation of distance measures between queries and documents efficiently • Natural language processing is difficult, e.g., – Polysemy (same word with different meanings) – Synonymy (several different ways to describe the same thing) • IR systems is use today do not rely on NLP techniques – Instead rely on vector of term occurrences Srihari: CSE 626 3

Vector Space Representation Terms used In “Database” Document-Term Matrix Related t1 t2 t3 t4 t5 t6 D1 24 21 9 0 0 3 t1 database D2 32 10 5 0 3 0 t2 SQL D3 12 16 5 0 0 0 D4 6 7 2 0 0 0 t3 index D5 43 31 20 0 3 0 D6 2 0 0 18 7 16 t4 regression D7 0 0 1 32 12 0 t5 likelihood D8 3 0 0 22 4 2 D9 1 0 0 34 27 25 t6 linear D10 6 0 0 17 4 23 “Regression” Terms d ij represents number of times that term appears in that document Srihari: CSE 626 4

Cosine Distance between Document Vectors Document-Term Matrix Cosine Distance t1 t2 t3 t4 t5 t6 T ∑ D1 24 21 9 0 0 3 d d ik jk D2 32 10 5 0 3 0 = = k 1 d ( D , D ) D3 12 16 5 0 0 0 c i j T T ∑ ∑ 2 2 D4 6 7 2 0 0 0 d d ik jk D5 43 31 20 0 3 0 = = 1 1 k k D6 2 0 0 18 7 16 D7 0 0 1 32 12 0 D8 3 0 0 22 4 2 D9 1 0 0 34 27 25 D10 6 0 0 17 4 23 Cosine of the angle between two vectors Equivalent to their inner product after each has been normalized to have unit Higher values for more similar vectors Reflects similarity in terms of relative distributions of components Cosine is not influenced by one document being small compared Srihari: CSE 626 5 to the other (as in the case of Euclidean)

Euclidean vs Cosine Distance Document-Term Matrix Euclidean Document White = 0 t1 t2 t3 t4 t5 t6 Black = D1 24 21 9 0 0 3 Database max distance D2 32 10 5 0 3 0 D3 12 16 5 0 0 0 Document Number D4 6 7 2 0 0 0 D5 43 31 20 0 3 0 D6 2 0 0 18 7 16 Regression D7 0 0 1 32 12 0 Cosine D8 3 0 0 22 4 2 Document D9 1 0 0 34 27 25 White = LargerCosine D10 6 0 0 17 4 23 (or smaller angle) Both show two clusters of light sub-blocks Document ( database documents and regression documents) Euclidean: 3,4 closer to 6-9 than to 5 since 3,4, 6-9 are closer to origin than 5 Cosine emphasizes relative contributions of individual terms Srihari: CSE 626 6

Properties of Document-Term Matrix • Each vector D i is a surrogate document for the original document • Entire document-term matrix (N x T) is sparse, with only 0.03% of cells being non-zero in TREC collection • Each document is a vector in terms-space • Due to sparsity, original text documents are represented as an inverted file structure (rather than matrix directly) – Each term t j points to a list of N numbers describing term occurrences for each document • Generating document-term matix is non-trivial – Are plural and singular terms counted? – Are very common words used as terms? Srihari: CSE 626 7

Vector Space Representation of Queries • Queries – Expressed using same term-based representation as documents – Query is a document with very few terms Terms • Vector space representation of queries t1 database – database = (1,0,0,0,0,0) t2 SQL – SQL = (0,1,0,0,0,0) t3 index – regression = (0,0,0,1,0,0) t4 regression t5 likelihood t6 linear Srihari: CSE 626 8

Query match against database using cosine distance database SQL index regression likelihood linear D1 24 21 9 0 0 3 D2 32 10 5 0 3 0 D3 12 16 5 0 0 0 D4 6 7 2 0 0 0 D5 43 31 20 0 3 0 D6 2 0 0 18 7 16 D7 0 0 1 32 12 0 D8 3 0 0 22 4 2 D9 1 0 0 34 27 25 D10 6 0 0 17 4 23 � database = (1,0,0,0,0,0) closest match is D2 � SQL = (0,1,0,0,0,0) closest match is D3 regression = (0,0,0,1,0,0) � closest match is D9 Use of cosine distance results in D2, D3 and D9 ranked as closest Srihari: CSE 626 9

Weights in Vector Space Model • d ik is the weight for the k th term • Many different choices for weights in IR literature – Boolean approach of setting weights to 1 if term occurs and 0 if it doesn’t • favors larger documents since larger document is more likely database SQL index regression likelihood linear to include query term somewhere D1 24 21 9 0 0 3 D2 32 10 5 0 3 0 D3 12 16 5 0 0 0 – TF-IDF weighting scheme is popular D4 6 7 2 0 0 0 D5 43 31 20 0 3 0 D6 2 0 0 18 7 16 D7 0 0 1 32 12 0 • TF (term frequency) is same as seen earlier D8 3 0 0 22 4 2 D9 1 0 0 34 27 25 D10 6 0 0 17 4 23 • IDF (inverse document frequency) favors terms that occur in relatively few documents Srihari: CSE 626 10

Inverse Document Frequency of a Term • Definition log( / ) N n j N = total number of documents n j = number of documents containing term j ( / ) is the fraction of documents containing term j n j N • IDF favors terms that occur in relatively few documents • Example of IDF database SQL index regression likelihood linear IDF weights of terms (using natural logs): D1 24 21 9 0 0 3 D2 32 10 5 0 3 0 0.105,0.693,0.511,0.693,0.357,0.69 D3 12 16 5 0 0 0 D4 6 7 2 0 0 0 Term “database” occurs in many documents D5 43 31 20 0 3 0 D6 2 0 0 18 7 16 and is given least weight D7 0 0 1 32 12 0 D8 3 0 0 22 4 2 D9 1 0 0 34 27 25 “regression” occurs in fewest documents and Srihari: CSE 626 11 D10 6 0 0 17 4 23 given highest weight

TF-IDF Weighting of Terms • TF-IDF weighting scheme • TF = term frequency, denoted TF(d,t) • IDF = inverse document frequency, IDF(t) • TF-IDF weight is the product of TF and IDF for a particular term in a particular document – TF(d,t)IDF(t) – Example is given next Srihari: CSE 626 12

TF-IDF Document Matrix TF (d,t) Document Matrix IDF (t) weights (using natural logs): database SQL index regression likelihood linear D1 24 21 9 0 0 3 0.105,0.693,0.511,0.693,0.357,0.69 D2 32 10 5 0 3 0 D3 12 16 5 0 0 0 D4 6 7 2 0 0 0 TF-IDF Document Matrix D5 43 31 20 0 3 0 D6 2 0 0 18 7 16 2.53 14.56 4.6 0 0 2.07 D7 0 0 1 32 12 0 D8 3 0 0 22 4 2 3.37 6.93 2.55 0 1.07 0 D9 1 0 0 34 27 25 1.26 11.09 2.55 0 0 0 D10 6 0 0 17 4 23 0.63 4.85 1.02 0 0 0 4.53 21.48 10.21 0 1.07 0 0.21 0 0 12.47 2.5 11.09 0 0 0.51 22.18 4.28 0 0.31 0 0 15.24 1.42 1.38 0.1 0 0 23.56 9.63 17.33 0.63 0 0 11.78 1.42 15.94 Srihari: CSE 626 13

Classic Approach to Matching Queries to Documents • Represent queries as term vectors – 1s for terms occurring in the query and 0s everywhere else • Represent term vectors for documents using TF- IDF for the vector components • Use the cosine distance measure to rank the documents in terms of distance to query – Has disadvantage that shorter documents have a better match with query terms Srihari: CSE 626 14

Document Retrieval with TF and TF-IDF TF: Document Term Matrix TF-IDF Document Matrix database SQL index regression likelihood linear 2.53 14.56 4.6 0 0 2.07 D1 24 21 9 0 0 3 3.37 6.93 2.55 0 1.07 0 D2 32 10 5 0 3 0 1.26 11.09 2.55 0 0 0 D3 12 16 5 0 0 0 D4 6 7 2 0 0 0 0.63 4.85 1.02 0 0 0 D5 43 31 20 0 3 0 4.53 21.48 10.21 0 1.07 0 D6 2 0 0 18 7 16 D7 0 0 1 32 12 0 0.21 0 0 12.47 2.5 11.09 D8 3 0 0 22 4 2 0 0 0.51 22.18 4.28 0 D9 1 0 0 34 27 25 D10 6 0 0 17 4 23 0.31 0 0 15.24 1.42 1.38 0.1 0 0 23.56 9.63 17.33 Query contains both database and index 0.63 0 0 11.78 1.42 15.94 Q=(1,0,1,0,0,0) Max values for query (1,0,1,0,0,0) Document TF distance TF-IDF distance Cosine distance is high when there is a D1 0.7 0.32 D2 0.77 0.51 better match D3 0.58 0.24 D4 0.6 0.23 D5 0.79 0.43 TF-IDF chooses D2 while TF chooses D5 D6 0.14 0.02 D7 0.06 0.01 Unclear why D5 is a better choice D8 0.02 0.02 D9 0.09 0.01 Cosine distance favors shorter documents D10 0.01 0 Srihari: CSE 626 15 (a disadvantage)

Comments on TF-IDF Method • TF-IDF-based IR system – first builds an inverted index with TF and IDF information – Given a query (vector) lists some number of document vectors that are most similar to the query • TF-IDF is superior in precision-recall compared to other weighting schemes • Default baseline method for comparing performance Srihari: CSE 626 16

Retrieval by Content Part 2: Text Retrieval Term Frequency and - PowerPoint PPT Presentation

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency Srihari: CSE 626 1 Text Retrieval Retrieval of text-based information is referred to as Information Retrieval (IR) Used by text search engines

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

Accessing XML content: An information retrieval perspective Mounia Lalmas mounia@acm.org 1

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Content-Based Image Retrieval Queries Commercial Systems Retrieval Features

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Information Retrieval Introducing Information Retrieval and Web Search

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

Content Based Image Retrieval Techniques Ambrose Tuscano (atuscan1@umbc.edu) University of

Retrieval by Content Srihari: CSE 626 Database Retrieval In a Database Context Query

Retrieval by Content Part 3: Text Retrieval Latent Semantic Indexing Srihari: CSE 626 1 Latent

Structured Document Retrieval Benjamin Piwowarski DCC October 28, 2004 B. Piwowarski (DCC)

Multi-Frequency Phase Synchronization Tingran Gao 1 Zhizhen Zhao 2 1 Committee on Computational and

XL1A: Graph Nominal Frequency Data Using Excel2013 3/10/2017 V0E XL1A: V0E XL1A: V0E Graph

The OpenLexicons Project - Development and Uses of SUBTLEX-Corpora for Investigating Sound

The Frequency Injection Attack on Ring-Oscillator-Based True Random Number Generators A.

International Study of Comparative Health Effectiveness with Medical and Invasive Approaches

Choir: Empowering Low-Power Wide-Area Networks in Urban Settings Rashad Eletreby Diana Zhang,

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

EE456 Digital Communications Professor Ha Nguyen September 2016 EE456 Digital

Retrieval by Content Part 2: Text Retrieval Term Frequency and - PowerPoint PPT Presentation

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency Srihari: CSE 626 1 Text Retrieval Retrieval of text-based information is referred to as Information Retrieval (IR) Used by text search engines

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

Accessing XML content: An information retrieval perspective Mounia Lalmas mounia@acm.org 1

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

Retrieval Models: Outline CS490W: Web I nformation Search &amp; Management Retrieval Models

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Content-Based Image Retrieval Queries Commercial Systems Retrieval Features

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Information Retrieval Introducing Information Retrieval and Web Search

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

Content Based Image Retrieval Techniques Ambrose Tuscano (atuscan1@umbc.edu) University of

Retrieval by Content Srihari: CSE 626 Database Retrieval In a Database Context Query

Retrieval by Content Part 3: Text Retrieval Latent Semantic Indexing Srihari: CSE 626 1 Latent

Structured Document Retrieval Benjamin Piwowarski DCC October 28, 2004 B. Piwowarski (DCC)

Multi-Frequency Phase Synchronization Tingran Gao 1 Zhizhen Zhao 2 1 Committee on Computational and

XL1A: Graph Nominal Frequency Data Using Excel2013 3/10/2017 V0E XL1A: V0E XL1A: V0E Graph

The OpenLexicons Project - Development and Uses of SUBTLEX-Corpora for Investigating Sound

The Frequency Injection Attack on Ring-Oscillator-Based True Random Number Generators A.

International Study of Comparative Health Effectiveness with Medical and Invasive Approaches

Choir: Empowering Low-Power Wide-Area Networks in Urban Settings Rashad Eletreby Diana Zhang,

compsci 514: algorithms for data science Cameron Musco University of Massachusetts Amherst. Fall

EE456 Digital Communications Professor Ha Nguyen September 2016 EE456 Digital

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models