Retrieval by Content Srihari: CSE 626 Database Retrieval In a - PDF document

1 Retrieval by Content Srihari: CSE 626

Database Retrieval • In a Database Context – Query is well-defined – Operation returns a set of records (or entities) that exactly match required specifications – Example query • [level = MANAGER] AND [age < 30] – Returns list of young employees with significant responsibility D e p t A D Drill down to e p t slice D Records for each Director Department, location. Manager Look up age field Roll-up by East Coast Staff is another operation LAX JFK BUF O F Srihari: CSE 626 2 S

Retrieval by Content • More general, less precise queries than Database Retrieval • Example of Medical Context: – Query is a patient record containing • Demographic information (age, sex,..) • Test Results (blood Tests, physical tests, biomedical time series, X-rays) – Search database for similar cases in hospital database • To determine diagnoses, treatments, outcomes • Exact match is not relevant since it is unlikely there is any other patient that matches exactly • Need to determine similarity among patients based on different data types (multivariate, time series, image data) Srihari: CSE 626 3

Retrieval Task • Find the k objects in the database that are most similar to either a specific query or a specific object • Examples: – Searching historical records of Dow Jones index for past occurrences of a particular time series pattern – Searching a database of satellite images for evidence of volcano eruptions – Searching internet for reviews of restaurants in Buffalo Srihari: CSE 626 4

Retrieval by Content is Interactive Data Mining • User is directly involved in exploring data set by – Specifying a query – Interpreting results of matching process • Role of human judgement is not prominent in predictive and descriptive forms of data mining • If database is pre-indexed by content then task reduces to standard database indexing • Instead we have a query pattern Q – Goal is to infer which other objects are most similar to Q – In Text Retrieval Q is a short list of query words matched with large sets of documents Srihari: CSE 626 5

Retrieval by Content depends on notion of Similarity • Either Similarity or Distance is used • Maximize similarity or minimize distance • Common to reduce mesurements to a standard fixed-length vector and use geometric measures (Euclidean, weighted Euclidean, Manhattan, etc) Srihari: CSE 626 6

Retrieval Performance • In classification and regression – There is an objective measure of accuracy of model on unseen test data – Comparison of different algorithms and models is straightforward • In retrieval – Performance is subjective : relative to a query – Ultimate measure is usefulness to user – Performance evaluation is difficult – Objects in data set need to be labelled as relevant to query Srihari: CSE 626 7

Evaluation of a Retrieval Algorithm • In response to a specific query Q • Independent test data set – Test data has not been tuned to given query Q • Objects of the test data set have been pre-classified (truthed) as being relevant or irrelevant to query Q – Algorithm is not aware of class labels – Who determines whether object is relevant? Confusion Matrix Query Q Test Set Truth: Truth: Not- Relevant Relevant Algorithm: TP FP Relevant Irrelevant Relevant Algorithm: FN TN Not Relevant Objects Srihari: CSE 626 8

Precision and Recall Definitions Obtained from Confusion Matrix Objects returned for query Q Relevant Irrelevant TP = × Recall 100 % + TP FN TP FP Database TP = × Precision 100 % + TP FP FN TN Srihari: CSE 626 9

Observations about Precision and Recall 1. Numerator is same for precision and recall: no of correct returned 2. Denominator for precision is all that is returned 3. Denominator for recall is all that is relevant query Q TP = × Recall 100 % Relevant Irrelevant + TP FN FP TP Recall=1 means the whole truth TP = × Precision 100 % Database + TP FP FN TN Precision=1 means nothing but the truth Srihari: CSE 626 10

Precision versus Recall • Assume that the results of retrieval have been pre- classified as relevant or irrelevant w.r.t query Q • If algorithm uses a distance measure to rank objects, then a threshold T is used – then K T objects are returned as closer than threshold T to query object Q • If we run the retrieval algorithm with a set of values of T we get different pairs of (recall, precision) values– giving recall-precision characterization – Relative to query Q, particular data set, labeling of the data Srihari: CSE 626 11

Precision-Recall Relationship Typically an inverse relationship: Precision-Recall as FP is decreased (to increase precision), TP also decreases are evaluated and FN increases (decreasing recall) w.r.t. a set of queries Precision TP FP Relevant Irrelevant Recall Precision = TP/TP+FP FN TN Recall = TP/TP+FN Database Srihari: CSE 626 12

How is Precision-Recall related to ROC? • Receiver Operating Characteristics (ROCs) are used to characterize performance of binary classifiers with variable thresholds ROC Irrelevant Relevant True Positive (TP) TN Threshold T TP FN FP False Positive (FP) Srihari: CSE 626 13

Relationship between Precision-Recall and ROC • Receiver Operating Characteristics (ROCs) are used to characterize performance of binary classifiers with variable thresholds Precision Recall Irrelevant Relevant Precision Recall TN TP Threshold T e v FN i t i s ROC o P e As FP increases As TP increases u r T TP also increases FN decreases (but at slower rate) Therefore Thus Precision=TP/TP+FP Recall= TP/TP+FN False Positive decreases also increases Srihari: CSE 626 14 Thus ROC is inverse of Recall-Precision Plot

C ombined Measure of Retrieval • Harmonic Mean of Precision and Recall ( ) 1 1 1 1 = + F 2 P R • Or • = 2 P R F + P R • If you travel at 20 mph one way and 40 mph the other way, the average speed is given by the harmonic mean of 26.6 mph • Harmonic mean is appropriate when the average of a rate is desired Srihari: CSE 626 15

Precision-Recall of several algorithms Precision-Recall are evaluated w.r.t. the same data set and a set of queries Cannot distinguish between two algorithms Except at say: 1. Precision = recall 2. Precision when a certain no are retrieved 3. Average precision over multiple recall levels Srihari: CSE 626 16

Precision-Recall Properties • Should average over large corpus/query ensembles • Need human assessments – People aren’t reliable assessors • Assessments have to be binary – Nuanced assessments? Srihari: CSE 626 17

Retrieval by Content Srihari: CSE 626 Database Retrieval In a - PDF document

1 Retrieval by Content Srihari: CSE 626 Database Retrieval In a Database Context Query is well-defined Operation returns a set of records (or entities) that exactly match required specifications Example query [level =

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

Accessing XML content: An information retrieval perspective Mounia Lalmas mounia@acm.org 1

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Content-Based Image Retrieval Queries Commercial Systems Retrieval Features

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Information Retrieval Introducing Information Retrieval and Web Search

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

Content Based Image Retrieval Techniques Ambrose Tuscano (atuscan1@umbc.edu) University of

Retrieval by Content Part 3: Text Retrieval Latent Semantic Indexing Srihari: CSE 626 1 Latent

Structured Document Retrieval Benjamin Piwowarski DCC October 28, 2004 B. Piwowarski (DCC)

for Efficient Quantum Sorting Naveed Mahmud, Bailey K. Srimoungchanh, Bennett Haase-Divine, Nolan

Classification Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester

Exploiting Community Structure for Floating-Point Precision Tuning Hui Guo Cindy Rubio-Gonzlez

Mixing domains and precisions in BLIS: Ini5al thoughts Field G. Van Zee Science of High

3. Text and document databases Normal databases: formatted records; document databases:

L OGO TO SVG Vladimir Batagelj Department of mathematics, FMF, University of Ljubljana Jadranska

Dialogue management, system design & evaluation Pierre Lison IN4080 : Natural Language

T alk Outline Platform Overview Introducing Metro style games App Model Object model for

Retrieval by Content Srihari: CSE 626 Database Retrieval In a - PDF document

1 Retrieval by Content Srihari: CSE 626 Database Retrieval In a Database Context Query is well-defined Operation returns a set of records (or entities) that exactly match required specifications Example query [level =

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

Accessing XML content: An information retrieval perspective Mounia Lalmas mounia@acm.org 1

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

Retrieval Models: Outline CS490W: Web I nformation Search &amp; Management Retrieval Models

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Content-Based Image Retrieval Queries Commercial Systems Retrieval Features

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Information Retrieval Introducing Information Retrieval and Web Search

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

Content Based Image Retrieval Techniques Ambrose Tuscano (atuscan1@umbc.edu) University of

Retrieval by Content Part 3: Text Retrieval Latent Semantic Indexing Srihari: CSE 626 1 Latent

Structured Document Retrieval Benjamin Piwowarski DCC October 28, 2004 B. Piwowarski (DCC)

for Efficient Quantum Sorting Naveed Mahmud, Bailey K. Srimoungchanh, Bennett Haase-Divine, Nolan

Classification Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester

Exploiting Community Structure for Floating-Point Precision Tuning Hui Guo Cindy Rubio-Gonzlez

Mixing domains and precisions in BLIS: Ini5al thoughts Field G. Van Zee Science of High

3. Text and document databases Normal databases: formatted records; document databases:

L OGO TO SVG Vladimir Batagelj Department of mathematics, FMF, University of Ljubljana Jadranska

Dialogue management, system design &amp; evaluation Pierre Lison IN4080 : Natural Language

T alk Outline Platform Overview Introducing Metro style games App Model Object model for

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models

Dialogue management, system design & evaluation Pierre Lison IN4080 : Natural Language