Introduction Information Retrieval Indian Statistical Institute - PowerPoint PPT Presentation

Introduction Information Retrieval Indian Statistical Institute Information Retrieval (ISI) Introduction 1 / 20

Course details Books [MRS] Introduction to Information Retrieval , Manning, Raghavan, Schütze. https://nlp.stanford.edu/IR-book/ [BCC] Information Retrieval Implementing and Evaluating Search Engines , Büttcher, Clarke, Cormack. http://www.ir.uwaterloo.ca/book/ [CMS] Search Engines: Information Retrieval in Practice , Croft, Metzler, Strohman. http://www.search-engines-book.com/ Foundations and Trends in Information Retrieval (FTIR) https://www.nowpublishers.com/INR Weightage: Mid-sem 20% Project 30% End-sem 50% Slides: Available from http://www.isical.ac.in/~mandar/courses.html and http://www.isical.ac.in/~debapriyo Information Retrieval (ISI) Introduction 2 / 20

Terminology Problem definition: Given a user’s information need , find documents satisfying that need. Information need: what user is looking for Query: actual representation of above Document: any unit / item that can be retrieved For this course, we will only consider textual information (no images/graphics, maps, speech, video, etc.). Information Retrieval (ISI) Introduction 3 / 20

Overview INDEXING Document Index collection Results Retrieval QUERYING engine Information Retrieval (ISI) Introduction 4 / 20

Steps 1. Document acquisition: how is the document collection obtained / constructed? ( LATER ) 2. Indexing: representing documents so that retrieval is easy 3. Retrieval: matching the user query against documents in the collection 4. Evaluation: how to determine whether the system did well? ( NEXT WEEK ) Information Retrieval (ISI) Introduction 5 / 20

Bag of words approach Indexing: document → list of keywords / content-descriptors / terms user’s information need → (natural-language) query → list of keywords Retrieval: measure overlap between query and documents. Information Retrieval (ISI) Introduction 6 / 20

Indexing 1. Tokenisation 2. Stopword removal 3. Stemming 4. Phrase identification 5. Named entity extraction Information Retrieval (ISI) Introduction 7 / 20

Indexing – I Tokenisation: identify individual words. Information retrieval (IR) is the activity of obtaining information resources relevant to an information need from a collection of information resources. Searches can be based on full-text or other content-based indexing. ⇓ Information retrieval IR is the activity of obtaining . . . Information Retrieval (ISI) Introduction 8 / 20

Indexing – II Stopword removal: eliminate common words . . . Information retrieval IR is the activity of obtaining Information Retrieval (ISI) Introduction 9 / 20

Indexing – III Stemming: reduce words to a common root. e.g. resignation, resigned, resigns → resign for common languages, use standard algorithms (Porter). Information Retrieval (ISI) Introduction 10 / 20

Indexing – IV Phrases: multi-word terms e.g. computer science, data mining. Syntactic/linguistic methods use a part of speech tagger look for particular POS sequences, e.g., NN NN, JJ NN Example: computer/NN science/NN Information Retrieval (ISI) Introduction 11 / 20

Indexing – IV Statistical methods: f ( a,b ) > θ (threshold) Raw frequency: f raw ( a, b ) = n ( a,b ) Dice coefficient: f dice ( a, b ) = 2 × n ( a,b ) / ( n a + n b ) n a , n b number of bi-grams whose first (second) word is a ( b ) . . . Information Retrieval (ISI) Introduction 12 / 20

Indexing Document collection → Term-Document Matrix Vocabulary : set of all t 1 t 2 . . . t M words in collection D 1 D 2 N × M binary . Document collection (0-1) matrix . . D N Information Retrieval (ISI) Introduction 13 / 20

Retrieval models Information Retrieval (ISI) Introduction 14 / 20

Boolean model Keywords combined using AND , OR , ( AND ) NOT e.g. (medicine OR treatment) AND (hypertension OR “high blood pressure”) Information Retrieval (ISI) Introduction 15 / 20

Boolean model Keywords combined using AND , OR , ( AND ) NOT e.g. (medicine OR treatment) AND (hypertension OR “high blood pressure”) Efficient and easy to implement (list merging) AND ≡ intersection OR ≡ union Example: medicine → D 1 , D 4 , D 5 , D 10 , . . . hypertension → D 2 , D 4 , D 8 , D 10 , . . . Information Retrieval (ISI) Introduction 15 / 20

Boolean model Keywords combined using AND , OR , ( AND ) NOT e.g. (medicine OR treatment) AND (hypertension OR “high blood pressure”) Efficient and easy to implement (list merging) AND ≡ intersection OR ≡ union Example: medicine → D 1 , D 4 , D 5 , D 10 , . . . hypertension → D 2 , D 4 , D 8 , D 10 , . . . Drawbacks OR — one match as good as many AND — one miss as bad as all no ranking queries may be difficult to formulate Information Retrieval (ISI) Introduction 15 / 20

Vector space model (VSM) Any text item (“document”) is represented as list of terms and associated weights. t 1 t 2 . . . t M D 1 w 11 w 12 w 1 M D 2 w 21 w 22 w 2 M . . . D N w N 1 w N 2 w NM Term = keywords or content-descriptors Weight = measure of the importance of a term in representing the information contained in the document Information Retrieval (ISI) Introduction 16 / 20

Term weights Term frequency (tf) repeated words are strongly related to content importance does not grow linearly with frequency ⇒ use sub-linear function examples: tf ( ) 1 + log( tf ) , 1 + log 1 + log( tf ) , k + tf Inverse document frequency (idf): uncommon term is more important Example: medicine vs. antibiotic commonly used functions N log N − df + 0 . 5 log 1 + df , df + 0 . 5 Information Retrieval (ISI) Introduction 17 / 20

Term weights Normalisation by document length: term-weights for long documents should be reduced long docs. contain many distinct words. long docs. contain same word many times. Intuition: each term covers a smaller portion of the overall information content of a long document use # bytes, # distinct words, Euclidean length, etc. Weight = tf x idf / normalisation Information Retrieval (ISI) Introduction 18 / 20

Term weights: “traditional” weighting schemes Cosine normalisation N (1 + log( tf )) × log 1+ df √∑ w 2 i Pivoted normalisation 1+log( tf ) log( N × df ) 1+log( average tf ) (1 . 0 − slope ) × pivot + slope × # unique terms Information Retrieval (ISI) Introduction 19 / 20

VSM: retrieval Measure vocabulary overlap between user query and documents. t 1 . . . t M Q = q 1 . . . q M D = d 1 . . . d M Q.⃗ ⃗ Sim ( Q, D ) = D = ∑ i q i × d i more matches between Q, D ⇒ Sim ( Q, D ) ↑ matches on important terms between Q, D ⇒ Sim ( Q, D ) ↑ Information Retrieval (ISI) Introduction 20 / 20

VSM: retrieval Measure vocabulary overlap between user query and documents. t 1 . . . t M Q = q 1 . . . q M D = d 1 . . . d M Q.⃗ ⃗ Sim ( Q, D ) = D = ∑ i q i × d i more matches between Q, D ⇒ Sim ( Q, D ) ↑ matches on important terms between Q, D ⇒ Sim ( Q, D ) ↑ Use inverted list (index). t i → ( D i 1 , w i 1 ) , . . . , ( D i k , w i k ) Information Retrieval (ISI) Introduction 20 / 20

Introduction Information Retrieval Indian Statistical Institute - PowerPoint PPT Presentation

Introduction Information Retrieval Indian Statistical Institute Information Retrieval (ISI) Introduction 1 / 20 Course details Books [MRS] Introduction to Information Retrieval , Manning, Raghavan, Schtze. https://nlp.stanford.edu/IR-book/

INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION

Introduction ATV Introduction A T V Introduction A lphabet T V Introduction A lphabet

Brief Brief Introduction Introduction Brief Brief Introduction Introduction Zhengzhou

Brief Brief Introduction Introduction Brief Brief Introduction Introduction Zhengzhou

Shenzhen Cuilu jewelry Co., Ltd was founded in 1996 and its a large private enterprise

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Spectrum Painting Richard Shipman MW0RCZ ADARS 6th Jan 2020 Introduction Introduction

Introduction Introduction Introduction Introduction Outline Motivation Failures

Introduction Introduction Introduction Nationwide Cause for Concern 1

Team Introduction Experiments Outreach Problem Project Brainstorm Introduction Introduction

Lecture 1 Andreas Habegger Introduction Zynq Introduction Zynq Introduction Zynq PS vs. PL

Introduction to Web Design & Computer Principles Class 1 CSCI-UA 4 Introduction and Overview

Introduction to CICS Course introduction Course introduction What is CICS? What is an

INF5110 Compiler Construction Introduction Spring 2016 1 / 33 Outline 1. Introduction

INTRODUCTION I Syllabus INTRODUCTION I Syllabus I Why study labor economics? INTRODUCTION I

2018.06 01 SMILE5 Introduction S E 5 02 Alpha Cloud M I L 03 Company Introduction 04

Decision Procedures An Algorithmic Point of View Revision 1.0 D.Kroening O.Strichman Outline 1

CS 225 Data Structures Oc October 31 He Heaps and Priority Qu Queues G G Carl Evans Ru

Properties of the automorphism group and a probabilistic construction of a class of countable

Polynomial completeness properties Erhard Aichinger Department of Algebra Johannes Kepler

Number Theory and Algebra: A Brief Introduction Rana Barua Indian Statistical Institute Kolkata

Sujoy Das & Aarti Kumar Associate Professor Research Scholar Department of

Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kripabandhu Ghosh and

Quantitative estimates of a drainage network model Rahul Roy Indian Statistical Institute, New

Introduction Information Retrieval Indian Statistical Institute - PowerPoint PPT Presentation

Introduction Information Retrieval Indian Statistical Institute Information Retrieval (ISI) Introduction 1 / 20 Course details Books [MRS] Introduction to Information Retrieval , Manning, Raghavan, Schtze. https://nlp.stanford.edu/IR-book/

INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION

Introduction ATV Introduction A T V Introduction A lphabet T V Introduction A lphabet

Brief Brief Introduction Introduction Brief Brief Introduction Introduction Zhengzhou

Brief Brief Introduction Introduction Brief Brief Introduction Introduction Zhengzhou

Shenzhen Cuilu jewelry Co., Ltd was founded in 1996 and its a large private enterprise

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Spectrum Painting Richard Shipman MW0RCZ ADARS 6th Jan 2020 Introduction Introduction

Introduction Introduction Introduction Introduction Outline Motivation Failures

Introduction Introduction Introduction Nationwide Cause for Concern 1

Team Introduction Experiments Outreach Problem Project Brainstorm Introduction Introduction

Lecture 1 Andreas Habegger Introduction Zynq Introduction Zynq Introduction Zynq PS vs. PL

Introduction to Web Design &amp; Computer Principles Class 1 CSCI-UA 4 Introduction and Overview

Introduction to CICS Course introduction Course introduction What is CICS? What is an

INF5110 Compiler Construction Introduction Spring 2016 1 / 33 Outline 1. Introduction

INTRODUCTION I Syllabus INTRODUCTION I Syllabus I Why study labor economics? INTRODUCTION I

2018.06 01 SMILE5 Introduction S E 5 02 Alpha Cloud M I L 03 Company Introduction 04

Decision Procedures An Algorithmic Point of View Revision 1.0 D.Kroening O.Strichman Outline 1

CS 225 Data Structures Oc October 31 He Heaps and Priority Qu Queues G G Carl Evans Ru

Properties of the automorphism group and a probabilistic construction of a class of countable

Polynomial completeness properties Erhard Aichinger Department of Algebra Johannes Kepler

Number Theory and Algebra: A Brief Introduction Rana Barua Indian Statistical Institute Kolkata

Sujoy Das &amp; Aarti Kumar Associate Professor Research Scholar Department of

Improving IR performance from OCRed text using cooccurrence RISOT 2012 Kripabandhu Ghosh and

Quantitative estimates of a drainage network model Rahul Roy Indian Statistical Institute, New

Introduction to Web Design & Computer Principles Class 1 CSCI-UA 4 Introduction and Overview

Sujoy Das & Aarti Kumar Associate Professor Research Scholar Department of