Modern Information Retrieval Introduction 1 Hamid Beigy Sharif - PowerPoint PPT Presentation

Modern Information Retrieval Introduction 1 Hamid Beigy Sharif University of Technology September 19, 2020 1 Some slides have been adapted from slides of Manning, Yannakoudakis, and Sch¨ utze.

Table of contents 1. Course Information 2. Introduction 3. Course overview 1/20

Course Information

Course Information 1. Course name : Modern Information Retrieval 2. Instructor : Hamid Beigy Email : beigy@sharif.edu 3. Class Link: https://vc.sharif.edu/ch/beigy 4. Course Website: http://ce.sharif.edu/courses/99-00/1/ce324-1/ 5. Lectures: Sat-Mon (9:00-10:30) 6. TAs : Fariba Lotfi Email: flotfi@ce.sharif.edu 2/20

Course evaluation ◮ Evaluation: Mid-term exam 30% 1399/8/17 Final exam 30% Practical Assignments 30% Quiz 10% 3/20

Main Reference 4/20

References Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval . 2nd. USA: Addison-Wesley Publishing Company, 2011. isbn : 9780321416919. Gerald Kowalski. Information Retrieval Architecture and Algorithms . 1st. Berlin, Heidelberg: Springer-Verlag, 2010. isbn : 1441977155, 9781441977151. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Sch¨ utze. Introduction to Information Retrieval . New York, NY, USA: Cambridge University Press, 2008. 5/20

Introduction

Definition of information retrieval 1. We define the information retrieval as Definition (Information retrieval ) Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). 2. Document Collection: units we have built an IR system over. Documents can be ◮ memos ◮ book chapters paragraphs ◮ scenes of a movie ◮ turns in a conversation... 3. These days we frequently think first of web search, but there are many other cases: ◮ E-mail search ◮ Searching your laptop ◮ Corporate knowledge bases ◮ Legal information retrieval 6/20

Structured vs Unstructured Data ◮ Unstructured data means that a formal, semantically overt, easy-for-computer structure is missing. ◮ In contrast to the rigidly structured data used in DB style searching (e.g. product inventories, personnel records) SELECT * FROM business-catalogue WHERE category = ”florist” AND city-zip = ”cb1” ◮ This does not mean that there is no structure in the data ◮ Document structure (headings, paragraphs, lists. . . ) ◮ Explicit markup formatting (e.g. in HTML, XML. . . ) ◮ Linguistic structure (latent, hidden) 7/20

Information Needs and Relevance 1. Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). 2. An information need is the topic about which the user desires to know more about. 3. A query is what the user conveys to the computer in an attempt to communicate the information need. 4. Types of information needs ◮ Known-item search ◮ Precise information seeking search ◮ Open-ended search (“topical search”) 8/20

Structured vs Unstructured data growth 9/20

Relevance 1. A document is relevant if the user perceives that it contains information of value with respect to their personal information need. 2. Are the retrieved documents ◮ about the target subject ? ◮ up-to-date? ◮ from a trusted source? ◮ satisfying the user’s needs? 3. How should we rank documents in terms of these factors? 10/20

Information Retrieval Basics Document Collection Query IR System Set of relevant documents 11/20

How well has the system performed? ◮ The effectiveness of an IR system (i.e., the quality of its search results) is determined by two key statistics about the system’s returned results for a query: ◮ Precision: What fraction of the returned results are relevant to the information need? ◮ Recall: What fraction of the relevant documents in the collection were returned by the system? ◮ What is the best balance between the two? ◮ Easy to get perfect recall: just retrieve everything ◮ Easy to get good precision: retrieve only the most relevant 12/20

A short history of IR 1970s 1945 1950s 1990s 1960s 2000s 1980s T erm Salton; Cranfield IR coined TREC experiments by Calvin VSM Moers memex Boolean IR Literature searching SMART systems; Multimedia evaluation Multilingual pagerank by P&R (CLEF) (Alan Kent) Recommendation 1 Systems recall precision/ recall precision 0 no items retrieved 13/20

A short history of IR i 1960-1970 2 ◮ Initial exploration of text retrieval systems for ”small” corpora of scientific abstracts, and law and business documents. ◮ Development of the basic Boolean and vector-space models of retrieval. ◮ Prof. Salton and his students at Cornell University are the leading researchers in the area 1970-1980 ◮ Large document database systems, many run by companies (Lexis-Nexis and Dialog and MEDLINE) 1980-1990 ◮ Searching FTPable documents on the Internet (Archie and WAIS) ◮ Searching the World Wide Web (Lycos and Yahoo and Altavista) 1990-2000 ◮ Searching FTPable documents on the Internet (Archie and WAIS) ◮ Searching the World Wide Web (Lycos and Yahoo and Altavista) ◮ Organized Competitions (NIST and TREC) ◮ Searching the World Wide Web (Ringo and Amazon and NetPerceptions) 14/20

A short history of IR ii ◮ Automated Text Categorization & Clustering 2000-2010 ◮ Link analysis for Web Search (Google) ◮ Parallel Processing (Map-Reduce) ◮ Question Answering (TREC Q/A track) ◮ Multimedia IR (Image and Video and Audio and music) ◮ Cross-Language IR ◮ Document Summarization 2010-2020 ◮ Intelligent Personal Assistants (Siri, Cortana, Google, and Alexa) ◮ Complex Question Answering (IBM Watson) ◮ Distributional Semantics ◮ Deep Learning 2020-**** ◮ By 2025, the researchers believes that we have rich multi-sensorial experiences that will be capable of producing hallucinations which blend or alter perceived reality. 2 This slide is taken from Prof. Sampath Jayarathna slides. 15/20

IR for non-textual media 16/20

Unstructured data in 1650 ◮ Which plays of Shakespeare contain the words Brutus and Caesar , but not Calpurnia ? ◮ One could grep all of Shakespeare’s plays for Brutus and Caesar , then strip out lines containing Calpurnia . ◮ Why is grep not the solution? ◮ Slow (for large collections) ◮ grep is line-oriented, IR is document-oriented ◮ “ not Calpurnia ” is non-trivial ◮ Other operations (e.g., find the word Romans near countryman ) not feasible 17/20

Web Information Retrieval 18/20

Related areas Applications Mathematics Web Applications, Bioinformatics … Machine Learning Pattern Recognition Library & Info Science Information Natural Statistics Retrieval Language Databases Optimization Processing Data Mining Software engineering Computer systems Algorithms Systems 19/20

Course overview

Course overview ◮ Introduction ◮ Indexing and text operations ◮ IR models ( Boolean, vector space, probabilistic) ◮ Evaluation of IR systems ◮ Query operations ◮ Language models ◮ Machine Learning in IR (classification, clustering, and learning to rank) ◮ Dimensionality reduction and word embedding ◮ Web information retrieval and search engines ◮ Some advanced topics ◮ Recommender systems ◮ Personalized IR ◮ Sentiment Analysis ◮ Corss-lingual IR ◮ QA systems ◮ Neural information retrieval 20/20

Questions? 20/20

Modern Information Retrieval Introduction 1 Hamid Beigy Sharif - PowerPoint PPT Presentation

Modern Information Retrieval Introduction 1 Hamid Beigy Sharif University of Technology September 19, 2020 1 Some slides have been adapted from slides of Manning, Yannakoudakis, and Sch utze. Table of contents 1. Course Information 2.

MODERN 1 MODERN 2 MODERN 3 MODERN 4 MODERN A peep at some distant orb has power to raise

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

Information Retrieval Introducing Information Retrieval and Web Search

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

Accessing XML content: An information retrieval perspective Mounia Lalmas mounia@acm.org 1

Probabilistic Information Retrieval CE-324: Modern Information Retrieval Sharif University of

Modern Information Retrieval Boolean information retrieval and document preprocessing 1 Hamid

Information Retrieval CS-7961: Topics in Information retrieval (IR) is finding material (usually

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

Information Retrieval: An Introduction Dr. Grace Hui Yang InfoSense Department of Computer

!"#$!#!%& Critical thinking Validation = critical assessment How good is my

Paradigms for Therapeutic Discovery William T. Carpenter, M.D. Professor of Psychiatry and

Portfolio Theory of Information Retrieval Jun Wang and Jianhan Zhu jun.wang@cs.ucl.ac.uk

TREC Video Retrieval Evaluation TRECVID 2019 Ian Soboroff* Alan Smeaton, Yvette Graham

Introduction to Information Retrieval & Web Search Kevin Duh Johns Hopkins University Fall

Query Likelihood Retrieval LM, session 6 CS6200: Information Retrieval Slides by: Jesse Anderton

Sambuz

Useful Links

Newsletter

Mail Us

Modern Information Retrieval Introduction 1 Hamid Beigy Sharif - PowerPoint PPT Presentation

Modern Information Retrieval Introduction 1 Hamid Beigy Sharif University of Technology September 19, 2020 1 Some slides have been adapted from slides of Manning, Yannakoudakis, and Sch utze. Table of contents 1. Course Information 2.

MODERN 1 MODERN 2 MODERN 3 MODERN 4 MODERN A peep at some distant orb has power to raise

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

Information Retrieval Introducing Information Retrieval and Web Search

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

Retrieval Models: Outline CS490W: Web I nformation Search &amp; Management Retrieval Models

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

Accessing XML content: An information retrieval perspective Mounia Lalmas mounia@acm.org 1

Probabilistic Information Retrieval CE-324: Modern Information Retrieval Sharif University of

Modern Information Retrieval Boolean information retrieval and document preprocessing 1 Hamid

Information Retrieval CS-7961: Topics in Information retrieval (IR) is finding material (usually

INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Sch utzes, linked from

Information Retrieval: An Introduction Dr. Grace Hui Yang InfoSense Department of Computer

!&quot;#$!#!%&amp; Critical thinking Validation = critical assessment How good is my

Paradigms for Therapeutic Discovery William T. Carpenter, M.D. Professor of Psychiatry and

Portfolio Theory of Information Retrieval Jun Wang and Jianhan Zhu jun.wang@cs.ucl.ac.uk

TREC Video Retrieval Evaluation TRECVID 2019 Ian Soboroff* Alan Smeaton, Yvette Graham

Introduction to Information Retrieval &amp; Web Search Kevin Duh Johns Hopkins University Fall

Query Likelihood Retrieval LM, session 6 CS6200: Information Retrieval Slides by: Jesse Anderton

Sambuz

Useful Links

Newsletter

Mail Us

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models

!"#$!#!%& Critical thinking Validation = critical assessment How good is my

Introduction to Information Retrieval & Web Search Kevin Duh Johns Hopkins University Fall