cs 1655 spring 2013 secure data management and web
play

CS 1655 / Spring 2013 Secure Data Management and Web Applications - PDF document

CS 1655 / Spring 2013 Secure Data Management and Web Applications 04 Information Retrieval Alexandros Labrinidis University of Pittsburgh Alexandros Labrinidis, CS 1655 / Spring 2013 1 University of Pittsburgh What is


  1. CS 1655 / Spring 2013 
 Secure Data Management 
 and Web Applications � 04 – Information Retrieval Alexandros Labrinidis 
 University of Pittsburgh � Alexandros Labrinidis, CS 1655 / Spring 2013 1 University of Pittsburgh What is Information Retrieval? � • Information organized into documents � – Large number of documents � – Data in documents is unstructured � • Quest: � – Locate documents that match user ʼ s needs � – How: � • Keywords � • Sample documents � • Like finding a needle in a haystack  � – Or worse: a hay-colored needle! � Alexandros Labrinidis, CS 1655 / Spring 2013 2 University of Pittsburgh 1

  2. IR vs Databases � • Database systems : � – Structured data � – Complex data models � – Updates/transactions/concurrency control � • Information retrieval : � – Unstructured data � – Collection of documents � – Approximate searching/ranking of results � Alexandros Labrinidis, CS 1655 / Spring 2013 3 University of Pittsburgh How to retrieve information? � • One way: � – Get keywords from user � – Scan entire collection of documents � – Return documents that match � – Problems? � • Will not scale to large document collections � – E.g., the Web � • Will not rank results � – E.g., too many matches for “Labrinidis” � Alexandros Labrinidis, CS 1655 / Spring 2013 4 University of Pittsburgh 2

  3. Relevance Ranking using terms � • Given a term t how relevant is a document d to the term? � • Approach #1: � – Use the number of occurrences of t in d � – n(d, t) � • Approach #2: � – Normalize number of occurrences of t in d by the total number of terms in d � r ( d , t ) = log(1 + n ( d , t ) n ( d ) ) Alexandros Labrinidis, CS 1655 / Spring 2013 5 University of Pittsburgh Handling Multiple Query Terms � • Simple way: � – Compute independent relevance measures � – Add them up � • Better way: � – Use inverse document frequency for each term � • Number of documents that contain term t � – Relevance of document d to set of terms Q : � r ( d , t ) ∑ r ( d , Q ) = n ( t ) t ∈ Q Alexandros Labrinidis, CS 1655 / Spring 2013 6 University of Pittsburgh 3

  4. Not all words created equal � • Google query: � – the oranges from florida � • http://www.google.com � • the, from are very common and omitted from search � – These are called stop words � Alexandros Labrinidis, CS 1655 / Spring 2013 7 University of Pittsburgh Other factors affecting relevance � • Word proximity � – If two query terms are closer in a document this should rank higher than when they are not � – Example? � • Using hyperlinks � – E.g., site popularity, hubs, authorities 
 (more on this next time) � Alexandros Labrinidis, CS 1655 / Spring 2013 8 University of Pittsburgh 4

  5. Scaling to large collections � • Effective index structure is crucial � • Documents containing a specific term are located using an inverted index � – Each keyword maps to a list of documents that contain it. � • How to support or/and semantics? � – OR : compute union of sets � – AND : compute intersection of sets � Alexandros Labrinidis, CS 1655 / Spring 2013 9 University of Pittsburgh How to measure effectiveness � • Approximate, incomplete results are usual � – Especially if using an index � • How to measure quality of these results? � • False negative : � – A relevant document was not returned � • False positive : � – An irrelevant document was returned � Alexandros Labrinidis, CS 1655 / Spring 2013 10 University of Pittsburgh 5

  6. Effectiveness metrics � • Precision � – What percentage of retrieved documents are relevant to query � • Recall � – What percentage of the documents that are relevant to the query has been retrieved � Alexandros Labrinidis, CS 1655 / Spring 2013 11 University of Pittsburgh How to improve effectiveness � • Better ranking � • Better indexing � • Classification of documents � – Instead of “global” pool, focus on smaller set of documents that are related � • Feedback from users � Alexandros Labrinidis, CS 1655 / Spring 2013 12 University of Pittsburgh 6

  7. Focused Crawling � • google.com � – Search for “database management systems” � • google.com � – Search for “database management systems” � – +site: pitt.edu � • scholar.google.com � – Search for “database management systems” � Alexandros Labrinidis, CS 1655 / Spring 2013 13 University of Pittsburgh 7

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend