Modern Information Retrieval Introduction 1 Hamid Beigy Sharif - - PowerPoint PPT Presentation

modern information retrieval
SMART_READER_LITE
LIVE PREVIEW

Modern Information Retrieval Introduction 1 Hamid Beigy Sharif - - PowerPoint PPT Presentation

Modern Information Retrieval Introduction 1 Hamid Beigy Sharif University of Technology September 19, 2020 1 Some slides have been adapted from slides of Manning, Yannakoudakis, and Sch utze. Table of contents 1. Course Information 2.


slide-1
SLIDE 1

Modern Information Retrieval

Introduction1

Hamid Beigy

Sharif University of Technology

September 19, 2020

1Some slides have been adapted from slides of Manning, Yannakoudakis, and Sch¨

utze.

slide-2
SLIDE 2

Table of contents

  • 1. Course Information
  • 2. Introduction
  • 3. Course overview

1/20

slide-3
SLIDE 3

Course Information

slide-4
SLIDE 4

Course Information

  • 1. Course name : Modern Information Retrieval
  • 2. Instructor : Hamid Beigy

Email : beigy@sharif.edu

  • 3. Class Link: https://vc.sharif.edu/ch/beigy
  • 4. Course Website: http://ce.sharif.edu/courses/99-00/1/ce324-1/
  • 5. Lectures: Sat-Mon (9:00-10:30)
  • 6. TAs :

Fariba Lotfi Email: flotfi@ce.sharif.edu

2/20

slide-5
SLIDE 5

Course evaluation

◮ Evaluation:

Mid-term exam 30% 1399/8/17 Final exam 30% Practical Assignments 30% Quiz 10%

3/20

slide-6
SLIDE 6

Main Reference

4/20

slide-7
SLIDE 7

References

Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. 2nd. USA: Addison-Wesley Publishing Company, 2011. isbn: 9780321416919. Gerald Kowalski. Information Retrieval Architecture and Algorithms. 1st. Berlin, Heidelberg: Springer-Verlag, 2010. isbn: 1441977155, 9781441977151. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Sch¨

  • utze. Introduction to

Information Retrieval. New York, NY, USA: Cambridge University Press, 2008.

5/20

slide-8
SLIDE 8

Introduction

slide-9
SLIDE 9

Definition of information retrieval

  • 1. We define the information retrieval as

Definition (Information retrieval ) Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored

  • n computers).
  • 2. Document Collection: units we have built an IR system over. Documents can be

◮ memos ◮ book chapters paragraphs ◮ scenes of a movie ◮ turns in a conversation...

  • 3. These days we frequently think first of web search, but there are many other cases:

◮ E-mail search ◮ Searching your laptop ◮ Corporate knowledge bases ◮ Legal information retrieval 6/20

slide-10
SLIDE 10

Structured vs Unstructured Data

◮ Unstructured data means that a formal, semantically overt, easy-for-computer structure is

missing.

◮ In contrast to the rigidly structured data used in DB style searching (e.g. product

inventories, personnel records) SELECT * FROM business-catalogue WHERE category = ”florist” AND city-zip = ”cb1”

◮ This does not mean that there is no structure in the data

◮ Document structure (headings, paragraphs, lists. . . ) ◮ Explicit markup formatting (e.g. in HTML, XML. . . ) ◮ Linguistic structure (latent, hidden) 7/20

slide-11
SLIDE 11

Information Needs and Relevance

  • 1. Information retrieval (IR) is finding material (usually documents) of an unstructured

nature (usually text) that satisfies an information need from within large collections (usually stored on computers).

  • 2. An information need is the topic about which the user desires to know more about.
  • 3. A query is what the user conveys to the computer in an attempt to communicate the

information need.

  • 4. Types of information needs

◮ Known-item search ◮ Precise information seeking search ◮ Open-ended search (“topical search”) 8/20

slide-12
SLIDE 12

Structured vs Unstructured data growth

9/20

slide-13
SLIDE 13

Relevance

  • 1. A document is relevant if the user perceives that it contains information of value with

respect to their personal information need.

  • 2. Are the retrieved documents

◮ about the target subject ? ◮ up-to-date? ◮ from a trusted source? ◮ satisfying the user’s needs?

  • 3. How should we rank documents in terms of these factors?

10/20

slide-14
SLIDE 14

Information Retrieval Basics

IR System Query Document Collection Set of relevant documents

11/20

slide-15
SLIDE 15

How well has the system performed?

◮ The effectiveness of an IR system (i.e., the quality of its search results) is determined by

two key statistics about the system’s returned results for a query:

◮ Precision: What fraction of the returned results are relevant to the information need? ◮ Recall: What fraction of the relevant documents in the collection were returned by the

system?

◮ What is the best balance between the two? ◮ Easy to get perfect recall: just retrieve everything ◮ Easy to get good precision: retrieve only the most relevant 12/20

slide-16
SLIDE 16

A short history of IR 1945 1950s 1960s 1970s 1980s 1990s 2000s

memex T erm IR coined by Calvin Moers Literature searching systems; evaluation by P&R (Alan Kent) Cranfield experiments Boolean IR SMART

1 recall precision no items retrieved precision/ recall

Salton; VSM pagerank TREC Multimedia Multilingual (CLEF) Recommendation Systems

13/20

slide-17
SLIDE 17

A short history of IR i

1960-1970 2

◮ Initial exploration of text retrieval systems for ”small” corpora of scientific

abstracts, and law and business documents.

◮ Development of the basic Boolean and vector-space models of retrieval. ◮ Prof. Salton and his students at Cornell University are the leading

researchers in the area 1970-1980

◮ Large document database systems, many run by companies (Lexis-Nexis

and Dialog and MEDLINE) 1980-1990

◮ Searching FTPable documents on the Internet (Archie and WAIS) ◮ Searching the World Wide Web (Lycos and Yahoo and Altavista)

1990-2000

◮ Searching FTPable documents on the Internet (Archie and WAIS) ◮ Searching the World Wide Web (Lycos and Yahoo and Altavista) ◮ Organized Competitions (NIST and TREC) ◮ Searching the World Wide Web (Ringo and Amazon and NetPerceptions)

14/20

slide-18
SLIDE 18

A short history of IR ii

◮ Automated Text Categorization & Clustering

2000-2010

◮ Link analysis for Web Search (Google) ◮ Parallel Processing (Map-Reduce) ◮ Question Answering (TREC Q/A track) ◮ Multimedia IR (Image and Video and Audio and music) ◮ Cross-Language IR ◮ Document Summarization

2010-2020

◮ Intelligent Personal Assistants (Siri, Cortana, Google, and Alexa) ◮ Complex Question Answering (IBM Watson) ◮ Distributional Semantics ◮ Deep Learning

2020-****

◮ By 2025, the researchers believes that we have rich multi-sensorial

experiences that will be capable of producing hallucinations which blend or alter perceived reality.

2This slide is taken from Prof. Sampath Jayarathna slides.

15/20

slide-19
SLIDE 19

IR for non-textual media

16/20

slide-20
SLIDE 20

Unstructured data in 1650

◮ Which plays of Shakespeare contain the words Brutus and Caesar, but not

Calpurnia?

◮ One could grep all of Shakespeare’s plays for Brutus and Caesar, then strip out lines

containing Calpurnia.

◮ Why is grep not the solution?

◮ Slow (for large collections) ◮ grep is line-oriented, IR is document-oriented ◮ “not Calpurnia” is non-trivial ◮ Other operations (e.g., find the word Romans near countryman) not feasible 17/20

slide-21
SLIDE 21

Web Information Retrieval

18/20

slide-22
SLIDE 22

Related areas

Information Retrieval

Databases Library & Info Science Machine Learning Pattern Recognition Natural Language Processing Web Applications, Bioinformatics… Statistics Optimization Software engineering Computer systems

Mathematics Algorithms Applications Systems Data Mining

19/20

slide-23
SLIDE 23

Course overview

slide-24
SLIDE 24

Course overview

◮ Introduction ◮ Indexing and text operations ◮ IR models ( Boolean, vector space, probabilistic) ◮ Evaluation of IR systems ◮ Query operations ◮ Language models ◮ Machine Learning in IR (classification, clustering, and learning to rank) ◮ Dimensionality reduction and word embedding ◮ Web information retrieval and search engines ◮ Some advanced topics

◮ Recommender systems ◮ Personalized IR ◮ Sentiment Analysis ◮ Corss-lingual IR ◮ QA systems ◮ Neural information retrieval 20/20

slide-25
SLIDE 25

Questions?

20/20