Course overview and introduction CE-324: Modern Information - PowerPoint PPT Presentation

Course overview and introduction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2018 Some slides have been adapted from: Profs. Manning, Nayak & Raghavan lectures (CS-276, Stanford)

Course info } Instructor: Mahdieh Soleymani } Email: soleymani@sharif.edu } Website: http://ce.sharif.edu/cources/97-98/1/ce324-1 2

Text books } Main: } Introduction to Information Retrieval , C.D. Manning, P. Raghavan and H. Schuetze, Cambridge University Press, 2008. } Free online version is available at: http://informationretrieval.org/ } Recommended: } Modern Information Retrieval, R. Baeza-Yates and B. Ribeiro-Neto, Addison Wesley, Second Edition, 2011. } Managing Gigabytes: Compressing and Indexing Documents and Images, I.H. Witten, A. Moffat, and T.C. Bell, Second Edition, Morgan Kaufmann Publishing,1999. } Information Retrieval: Implementing and Evaluating Search Engines, S. Büttcher, C.L.A. Clarke and G.V. Cormack, MIT Press, 2010. 3

Marking scheme } Midterm Exam: 25% } Final Exam: 35% } Project (multiple phase): 25% } Mini-exams: 10% } Quizes: 5% 4

Typical IR system } Given: corpus & user query } Find:A ranked set of docs relevant to the query. Document Corpus: A collection of documents corpus IR System Query A list of Ranked Documents 5

Information Retrieval (IR) } Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections [IIR Book]. } Retrieving relevant documents to a query (while retrieving as few non-relevant documents as possible) } especially from large sets of documents efficiently . 6

Information Retrieval (IR) – These days we frequently think first of web search, but there are many other cases: • E-mail search • Searching your laptop • Corporate knowledge bases • Legal information retrieval 7

Basic Definitions } Document : a unit decided to build a retrieval system over } textual: a sequence of words, punctuation, etc that express ideas about some topic in a natural language. } Corpus or collection: a set of documents } Information need : information required by the user about some topics } Query : formulation of the information need 8

Heuristic nature of IR } Problem: Semantic gap between query and docs } A doc is relevant if the user perceives that this doc contains his information need } How to extract information from docs and how to use it to decide relevance } Solution: IR system must interpret and rank docs according to the amount of relevance to the user’s query. } “The notion of relevance is at the center of IR.” 9

Minimize search overhead } Search overhead: Time spent in all steps leading to the reading of items containing the needed information } Steps: query generation, query execution, scanning results, reading non-relevant items, etc. } The amount of online data has grown at least as quickly as the speed of computers 10

Condensing the data (indexing) } Indexing the corpus to speed up the searching task } Using the index instead of linearly scanning the docs that is c omputationally expensive for large collections } Indexing depends on the query language and IR model } T erm (index unit): A word, phrase, and other groups of symbols used for retrieval } Index terms are useful for remembering the document themes 11

Typical IR system architecture Text User Interface user need Text Text Operations Query Indexing Operations user feedback Corpus query Searching Index retrieved docs Ranking ranked docs 12

IR system components } T ext Operations forms index terms } Tokenization, stop word removal, stemming, … } Indexing constructs an index for a corpus of docs. } Query Operations transform the query to improve retrieval: } Query expansion using a thesaurus or query transformation using relevance feedback } Searching retrieves docs that are related to the query. 13

IR system components (continued) } Ranking scores retrieved documents according to their relevance. } User Interface manages interaction with the user: } Query input and visualization of results } Relevance feedback 14

Structured vs. unstructured docs } Unstructured text (free text): a continuous sequence of tokens } Structured text (fielded text): text is broken into fields that are distinguished by tags or other markup } Semi-structured text } e.g. web page 15

Databases vs. IR: Structured vs. unstructured data } Structured: data tends to refer to information in “tables” Student Name Student ID Supervisor GPA Name Smith 20116671 Joes 12 Joes 20114190 Chang 14.1 Lee 20095900 Chang 19 Typically allows numerical range and exact match (for text) queries, e.g., GPA < 16 AND Supervisor = Chang . 16

Semi-structured data 17

Semi-structured data } In fact almost no data is “unstructured” } E.g., this slide has distinctly identified zones such as the Title and Bullets } Facilitates “semi-structured” search such as } Title contains data AND Bullets contain search … to say nothing of linguistic structure 18

Data retrieval vs. information retrieval } Data retrieval } which items contain a set of keywords? Or satisfy the given (e.g., regular expression like) user query? } well defined structure and semantics } a single erroneous object implies failure! } Information retrieval } information about a subject } semantics is frequently loose (natural language is not well structured and may be ambiguous) } small errors are tolerated 19

Sec. 1.1 Evaluation of results } Precision: Fraction of retrieved docs that are relevant to user’s information need ✦ Precision = relevant retrieved / total retrieved = | Retrieved Ç Relevant | / | Retrieved | } Recall: Fraction of relevant docs that are retrieved ✦ Recall = relevant retrieved / relevant exist = | Retrieved Ç Relevant | / | Relevant | Retrieved & Relevant Retrieved Relevant 20

Example } Assume that there are 8 relevant docs to the query 𝑅 . } List of the retrieved docs for 𝑅 : } d1: R } d2: NR 𝑄 = 3 } d3: R 7 } d4: R 𝑆 = 3 } d5: NR 8 } d6: NR } d7: NR 21

Web Search } Application of IR to (HTML) documents on the World Wide Web. } Web IR } collect doc corpus by crawling the web } exploit the structural layout of docs } Beyond terms, exploit the link structure ( ideas from social networks) } link analysis, clickstreams ... 22

Web IR Web corpus Crawler IR System Query A list of Ranked Pages 23

The web and its challenges } Web collection properties } Distributed nature of the web collection } Size of the collection and volume of the user queries } Web advertisement (web is a medium for business too) } Predicting relevance on the web } Docs change uncontrollably (dynamic and volatile data ) } Unusual and diverse (heterogeneous) docs, users, and queries 24

Course main topics } Introduction } Indexing & text operations } IR Models } Boolean, vector space, probabilistic } Evaluation of IR systems } Machine Learning in IR: Classification, clustering, and dimensionality reduction } Web IR } Some advanced topics (e.g., Recommender systems) 25

Some main trends in IR models } Boolean models: Exact matching } Vector space model: Ranking docs by similarity to query } PageRank: Ranking of matches by importance of documents } Combinations of methods 26

Course overview and introduction CE-324: Modern Information - PowerPoint PPT Presentation

Course overview and introduction CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2018 Some slides have been adapted from: Profs. Manning, Nayak & Raghavan lectures (CS-276, Stanford) Course info }

Course Orientation q Course Description q Course Outcomes q Course Requirements q Course Outline

CANVAS COURSE PROFILE STUDENT PERFORMANCE COURSE OVERVIEW ASSIGNMENT AND SUBMISSION ANALYSIS

Course Search Widget Topics StudyLink Course Search Widget Demo Generic Course Search

Course Specifications/Detailed Course Outline Course code : STA 331 2.0 Course title :

Lecture 1.1 Course Introduction Course Introduction and Overview Course Goals Learn how

Statistics II Xavier Vil Course 2004-2005 1.- Course Contents 2.- Course Resources 3.-

BIOE 301/362 Lecture One Overview of Lecture 1 Course Overview: Course organization

DPD Basic Bicycle Course Course Objectives COURSE GOAL: The course will provide the trainee with

Leadplane Training Course Leadplane Training Course Course Objectives Describe procedures for

ARM Microcontroller Course June 3, 2015 ARM Microcontroller Course The Course Direct Digital

Course Home Page Course Design Course Structure main source reading-intensive course

Level 1, V2.0 Level 1, V2.0 1 Course Contents Course Contents Course Contents Course

Management Course presentation Dan C. Lungescu, PhD, assistant professor 2015-2016 Topics A.

LOS ALAMOS COUNTY GOLF COURSE OVERVIEW DESIGN DEVELOPMENT SUBMITTAL, NOVEMBER 2019 LOS ALAMOS

Programming for Robotics Introduction to ROS Course 3 Marko Bjelonic, Dominic Jud, Martin

Programming for Robotics Introduction to ROS Course 2 Martin Wermelinger, Dominic Jud, Marko

s tr ts

Algebraic entropy for amenable semigroup actions Anna Giordano Bruno (joint work with Dikran

KMS states on self-similar groupoid actions Mike Whittaker (University of Glasgow) Joint with

Using semidirect product of (semi)groups in public key cryptography Delaram Kahrobaei City

Outline Morning program Preliminaries Semantic matching Learning to rank Entities Afternoon

N = * log tfidf f i , k i , k df Similarity

Information Retrieval 70: :

Math 211 Math 211 Lecture #17 Solving Systems of Equations October 5, 2001 2 Solving Systems