CS 1655 / Spring 2013 Secure Data Management and Web Applications - PDF document

CS 1655 / Spring 2013   Secure Data Management   and Web Applications � 04 – Information Retrieval Alexandros Labrinidis   University of Pittsburgh � Alexandros Labrinidis, CS 1655 / Spring 2013 1 University of Pittsburgh What is Information Retrieval? � • Information organized into documents � – Large number of documents � – Data in documents is unstructured � • Quest: � – Locate documents that match user ʼ s needs � – How: � • Keywords � • Sample documents � • Like finding a needle in a haystack  � – Or worse: a hay-colored needle! � Alexandros Labrinidis, CS 1655 / Spring 2013 2 University of Pittsburgh 1

IR vs Databases � • Database systems : � – Structured data � – Complex data models � – Updates/transactions/concurrency control � • Information retrieval : � – Unstructured data � – Collection of documents � – Approximate searching/ranking of results � Alexandros Labrinidis, CS 1655 / Spring 2013 3 University of Pittsburgh How to retrieve information? � • One way: � – Get keywords from user � – Scan entire collection of documents � – Return documents that match � – Problems? � • Will not scale to large document collections � – E.g., the Web � • Will not rank results � – E.g., too many matches for “Labrinidis” � Alexandros Labrinidis, CS 1655 / Spring 2013 4 University of Pittsburgh 2

Relevance Ranking using terms � • Given a term t how relevant is a document d to the term? � • Approach #1: � – Use the number of occurrences of t in d � – n(d, t) � • Approach #2: � – Normalize number of occurrences of t in d by the total number of terms in d � r ( d , t ) = log(1 + n ( d , t ) n ( d ) ) Alexandros Labrinidis, CS 1655 / Spring 2013 5 University of Pittsburgh Handling Multiple Query Terms � • Simple way: � – Compute independent relevance measures � – Add them up � • Better way: � – Use inverse document frequency for each term � • Number of documents that contain term t � – Relevance of document d to set of terms Q : � r ( d , t ) ∑ r ( d , Q ) = n ( t ) t ∈ Q Alexandros Labrinidis, CS 1655 / Spring 2013 6 University of Pittsburgh 3

Not all words created equal � • Google query: � – the oranges from florida � • http://www.google.com � • the, from are very common and omitted from search � – These are called stop words � Alexandros Labrinidis, CS 1655 / Spring 2013 7 University of Pittsburgh Other factors affecting relevance � • Word proximity � – If two query terms are closer in a document this should rank higher than when they are not � – Example? � • Using hyperlinks � – E.g., site popularity, hubs, authorities   (more on this next time) � Alexandros Labrinidis, CS 1655 / Spring 2013 8 University of Pittsburgh 4

Scaling to large collections � • Effective index structure is crucial � • Documents containing a specific term are located using an inverted index � – Each keyword maps to a list of documents that contain it. � • How to support or/and semantics? � – OR : compute union of sets � – AND : compute intersection of sets � Alexandros Labrinidis, CS 1655 / Spring 2013 9 University of Pittsburgh How to measure effectiveness � • Approximate, incomplete results are usual � – Especially if using an index � • How to measure quality of these results? � • False negative : � – A relevant document was not returned � • False positive : � – An irrelevant document was returned � Alexandros Labrinidis, CS 1655 / Spring 2013 10 University of Pittsburgh 5

Effectiveness metrics � • Precision � – What percentage of retrieved documents are relevant to query � • Recall � – What percentage of the documents that are relevant to the query has been retrieved � Alexandros Labrinidis, CS 1655 / Spring 2013 11 University of Pittsburgh How to improve effectiveness � • Better ranking � • Better indexing � • Classification of documents � – Instead of “global” pool, focus on smaller set of documents that are related � • Feedback from users � Alexandros Labrinidis, CS 1655 / Spring 2013 12 University of Pittsburgh 6

Focused Crawling � • google.com � – Search for “database management systems” � • google.com � – Search for “database management systems” � – +site: pitt.edu � • scholar.google.com � – Search for “database management systems” � Alexandros Labrinidis, CS 1655 / Spring 2013 13 University of Pittsburgh 7

CS 1655 / Spring 2013 Secure Data Management and Web Applications - PDF document

CS 1655 / Spring 2013 Secure Data Management and Web Applications 04 Information Retrieval Alexandros Labrinidis University of Pittsburgh Alexandros Labrinidis, CS 1655 / Spring 2013 1 University of Pittsburgh What is

CS 1655 / Spring 2013 Secure Data Management and Web Applications 01 Data Mining and

CS 1655 / Spring 2010 Secure Data Management and Web Applications 01 Data Mining and

and Web Applications 06 Secure Coding Alexandros Labrinidis University of Pittsburgh Secure

7/8/2013 1 7/8/2013 2 7/8/2013 3 7/8/2013 4 7/8/2013 5 7/8/2013 6 7/8/2013 7 7/8/2013

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

How Secure are Secure How Secure are Secure Interdomain Routing Protocols? Interdomain Routing

and Web Applications 03 Data Warehousing Alexandros Labrinidis University of Pittsburgh

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Web Security: Web Security: Secure Electronic Transaction Secure Electronic Transaction

Lecture 1: Semantic Web and RDF Aidan Hogan aidhog@gmail.com THE WEB The Web is now 26 years

Petros Maniatis , Devdatta Akhawe, Kevin Fall, Elaine Shi, Stephen McCamant, Dawn Song Secure

Revised: March 4, 2013 3/19/2013 3/19/2013 2 3/19/2013 3 3/19/2013 4 3/19/2013 5

Web Scraping 1 / 9 Web Scraping Two ways to mine data from the web The hard way, by web

Ubiquitous and Secure Networks and Services Ubiquitous and Secure Networks and Services

Associations and Frequent Item Analysis 1 Outline Transactions Frequent itemsets

Agenda Web MVC-2: Apache Struts Drawbacks with Web Model 1 Web Model 2 (Web MVC) Rimon

Graph-Based Word Embeddings Learning Presenter: Zheng ZHANG Supervisors: Pierre

Landscaping Performance Research at the ICPE and its Predecessors: A Systematic Literature Review

Course Content Principles of Knowledge Introduction to Data Mining Discovery in Databases

Distinguishing the Popularity Between Topics: A System for Up-to-date Opinion Retrieval and

Information Retrieval Language

Principles of Software Construction: Objects, Design, and Concurrency Concurrency Part II:

From Uncertainty to Belief: Inferring the Specification Within Stephen McLaughlin Stephen

1 Food Systems Summit September 2021 Presented by Jamie Morrison, FAO Visit: