Information Retrieval: An Introduction
- Dr. Grace Hui Yang
InfoSense Department of Computer Science Georgetown University, USA huiyang@cs.georgetown.edu Jan 2019 @ Cape Town
1
Information Retrieval: An Introduction Dr. Grace Hui Yang - - PowerPoint PPT Presentation
Information Retrieval: An Introduction Dr. Grace Hui Yang InfoSense Department of Computer Science Georgetown University, USA huiyang@cs.georgetown.edu Jan 2019 @ Cape Town 1 A Quick Introduction What do we do at InfoSense Dynamic
InfoSense Department of Computer Science Georgetown University, USA huiyang@cs.georgetown.edu Jan 2019 @ Cape Town
1
Berthier Ribeiro-Neto. Second condition. 2011.
2009.
2
3
acts as a remedy to it
and a narrow sense
4
5
algorithms and tools to help people find information that they want
structured or unstructured, with or without hyperlinks, with or without metadata, in a foreign language or not – Monday Lecture Multilingual IR by Doug Oard),
(based on the judgement by the person who starts the search)
6
least) and returned in a list
7
Information Need
Corpus Metric Results User
8
1 2 3 4 5 6 7 8 1940s 1950s 1960s 1970s 1980s 1990s 2000 2005 2010 2015 2020 Memex Vector Space Model Probabilistic Theory Okapi BM25 TREC LM Learning to Rank Deep Learning QA Filtering Query User 9
10
IR Supervised ML AI DB NLP QA HCI Recommendation Information Seeking; IS Library Science
tabulated data; Boolean queries Unstructured data; NL queries Human issued queries; Non-exhaustive search No query but user profile Returns answers instead of documents Understanding of data; Semantics Loss of semantics; only counting terms Intermediate step before answers extracted Large scale; use of algorithms Controlled vocabulary; browsing User-centered study Data-driven; use of training data E x p e r t
r a f t e d m
e l s ; n
r a i n i n g d a t a Interactive; complex information needs Single iteration Solid line: transformations or special cases Dashed line: overlap with
11
12
Query Representation Document Representation
Indexing
Information Need Retrieval Models
Index
Retrieval Results Corpus Evaluation/ Feedback
13
14
Query Representation Document Representation Indexing Information Need Retrieval Models Index Retrieval Results
Corpus
Evaluation/ Feedback
TASK Info Need Query Verbal form
Get rid of mice in a politically correct way Info about removing mice without killing them How do I trap mice alive?
mouse trap
15 Textbook slides for “Introduction to Information Retrieval” by Hinrich Schütze and Christina Lioma. Chap 1
16
Query Representation Document Representation Indexing Information Need Retrieval Models Index Retrieval Results Corpus
Evaluation/ Feedback
Tokenizer
Tokens
Friends Romans Countrymen
Linguistic modules
Normalized tokens
friend roman countryman Indexer
Inverted index
friend roman countryman 2 4 2 13 16 1
Documents to be indexed
Friends, Romans, countrymen.
17 Textbook slides for “Introduction to Information Retrieval” by Hinrich Schütze and Christina Lioma. Ch 1
I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. Doc 1 So let it be with
Brutus hath told you Caesar was ambitious Doc 2
18 Textbook slides for “Introduction to Information Retrieval” by Hinrich Schütze and Christina Lioma. Chap 1
19
Query Representation Document Representation Indexing Information Need Retrieval Models Index Retrieval Results Evaluation/ Feedback Corpus
interesting details
20
Query Representation Document Representation Indexing Information Need Retrieval Algorithms Index Retrieval Results Evaluation/ Feedback Corpus
21
22
23
24
sets; sometimes it is very labor intensive
their queries
(document)
25
documents based on a ranking score or relevance score:
26
27
in a word space
word space
28
29
Suppose the corpus only has two words: ’Jealous’ and ‘Gossip’ They form a space of “Jealous” and “Gossip” d1: gossip gossip jealous gossip gossip gossip gossip gossip gossip gossip gossip d2: gossip gossip jealous gossip gossip gossip gossip gossip gossip gossip jealous jealous jealous jealous jealous jealous jealous gossip jealous d3: jealous gossip jealous jealous jealous jealous jealous jealous jealous jealous jealous q: gossip gossip jealous gossip gossip gossip gossip gossip jealous jealous jealous jealous
Adapted from textbook slides for “Introduction to Information Retrieval” by Hinrich Schütze and Christina Lioma. Chap 6
Euclidean space, their Euclidean distance is
30
31
d1: gossip gossip jealous gossip gossip gossip gossip gossip gossip gossip gossip d2: gossip gossip jealous gossip gossip gossip gossip gossip gossip gossip jealous jealous jealous jealous jealous jealous jealous gossip jealous d3: jealous gossip jealous jealous jealous jealous jealous jealous jealous jealous jealous q: gossip gossip jealous gossip gossip gossip gossip gossip jealous jealous jealous jealous Here, if you look at the content (or we say the word distributions) of each document, d2 is actually the most similar document to q However, d2 produces a bigger Eclidean distance score to q
Adapted from textbook slides for “Introduction to Information Retrieval” by Hinrich Schütze and Christina Lioma. Chap 6
always have big Euclidean distance
to their angles with query
small, between dissimilar vectors is large
document length normalization
32 Adapted from textbook slides for “Introduction to Information Retrieval” by Hinrich Schütze and Christina Lioma. Chap 6
qi is the tf-idf weight of term i in the query di is the tf-idf weight of term i in the document cos(q,d) is the cosine similarity of q and d … or, equivalently, the cosine of the angle between q and d.
33
D1 = (0.5, 0.8, 0.3), D2 = (0.9, 0.4, 0.2), Q = (1.5, 1.0, 0)
34 Example from textbook “Search Engines: Information Retrieval in Practice” Chap 7
35
D1 = (0.5, 0.8, 0.3), D2 = (0.9, 0.4, 0.2), Q = (1.5, 1.0, 0)
36 Example from textbook “Search Engines: Information Retrieval in Practice” Chap 7
37
38
very important term in those pages.
39
40
41
42 Textbook slides for “Introduction to Information Retrieval” by Hinrich Schütze and Christina Lioma. Chap 6
43
44
45
46
term frequency
hoc
distributions
stochastic element
represented by the term and those which are not.
frequency, which is assumed to depend only on eliteness.
47
48
Figure adapted from “Search Engines: Information Retrieval in Practice” Chap 7
where lambda and mu are the Poisson means for tf In the elite and non-elite sets for t
p’ = P(document elite for t| R) q’ = P(document elite for t| NR)
that would be given to a direct indicator of eliteness.
49
p = P(term present| R) q = P(term present| NR)
50
Robertson/Sparck-Jones weight; Becomes the idf component of Okapi Approximated term weight constant tf component of Okapi
51
idf (Robertson-Sparck Jones weight) tf user related weight
Original Okapi: k1 = 2, b=0.75, k3 = 0 BM25: k1 = 1.2, b=0.75, k3 = a number from 0 to 1000
52 Example from textbook “Search Engines: Information Retrieval in Practice” Chap 7
53
54 Example from textbook “Search Engines: Information Retrieval in Practice” Chap 7
55 Textbook slides from “Search Engines: Information Retrieval in Practice” Chap 7
56
§ Each document is treated as (the basis for) a language model § Given a query q, rank documents based on P(d|q)
§ P(q) is the same for all documents, so ignore § P(d) is the prior – often treated as the same for all d
§ But we can give a prior to high-quality documents, e.g., those with high PageRank.
§ P(q|d) is the probability of q given d
§ Ranking according to P(q|d) and P(d|q) is equivalent
57 Textbook slides for “Introduction to Information Retrieval” by Hinrich Schütze and Christina Lioma.
1
d
N
d
d1 d2
Document Language Model Query Likelihood
dN
2
d
q
) | (
1
d
q p q
) | (
2
d
q p q
) | (
N
d
q p q
58
Adapted from Mei, Fang and Zhai‘s “A study of poison query generation model in IR”
String = frog said that toad likes frog STOP P(string|Md1 ) = 0.01 · 0.03 · 0.04 · 0.01 · 0.02 · 0.01 · 0.02 = 0.0000000000048 = 4.8 · 10-12 P(string|Md2 ) = 0.01 · 0.03 · 0.05 · 0.02 · 0.02 · 0.01 · 0.02 = 0.0000000000120 = 12 · 10-12 P(string|Md1 ) < P(string|Md2 ) Thus, document d2 is more relevant to the string frog said that toad likes frog STOP than d1 is.
59 Textbook slides for “Introduction to Information Retrieval” by Hinrich Schütze and Christina Lioma.
from all the others
success in any trial is :
60
r n r
r n n r b
÷ ø ö ç ç è æ = ) 1 ( ) , ; ( q q q q
tosses).
– N (number of trials) – (the probability of success of the event)
each side of a die comes up in a set of rolls).
– The parameters: – N (number of trials) – (the probability of success for each category)
61
q
1.. k
q q
62 1 2 1
1 1 1 1 2 1 2
! ( ,..., | , ,.., ) .. ! !.. !
k
n n n k k k k
N P W n W n N n n n q q q q q = = =
1 k i i
n N
=
=
1
1
k i i
q
=
=
Number of possible orderings of N balls
Assume events (terms being generated ) are independent
A binomial distribution is the multinomial distribution with k=2 and
1 2 2
, 1 q q q = -
Each is estimated by Maximum Likelihood Estimation (MLE)
Õ Õ
Ï Î
= = =
q w q w
d w p d w p d q p ) | ( ) | 1 ( ) | ( text mining model clustering text model text … Doc: d
text mining … model
Multi-Bernoulli: Flip a coin for each word Multinomial: Roll a dice to choose a word
text mining model
H H T Query q: “text mining”
text
mining
Query q: “text mining”
=
=
| | 1 ) , (
) | ( ) | (
V j q w c j
j
d w p d q p
63
Adapted from Mei, Fang and Zhai‘s “A study of poison query generation model in IR”
§ Issue: a single t with P(t|Md) = 0 will make zero § Smooth the estimates to avoid zeros
64
65
called a conjugate prior for the likelihood
just as beta is conjugate prior for the binomial.
Gamma function
is
66
1
( ,.., )
k
Dir a a
1 1
( ,.., )
k k
Dir n n a a + +
1,.., k
1,.., k
1,.., k
q q
67
§ Also known as the Mixture Model § Mixes the probability from the document with the general collection frequency of the word. § Correctly setting λ is very important for good performance.
§ High value of λ: conjunctive-like search – tends to retrieve documents containing all query words. § Low value of λ: more disjunctive, suitable for long queries
Textbook slides for “Introduction to Information Retrieval” by Hinrich Schütze and Christina Lioma.
text mining model mining text clustering text … Query q : “mining text mining systems” / / Rates of arrival :
text mining model clustering
… [ ] [ ] [ ] [ ] [ ]
Duration: |q|
Poisson: Each term is written Receiver: Query
3/7 2/7 1/7 1/7
1 2 1
= ) | ( d q p
! 1 |) | 7 3 (
1 | | 7 / 3
q e
q
2 |) | 7 2 (
2 | | 7 / 2
q e
q
|) | 7 1 (
| | 7 / 1
q e
q
|) | 7 1 (
| | 7 / 1
q e
q
1 |) | (
1 | |
q e
i q
i
l
l
l
68
Slides adapted from Mei, Fang and Zhai‘s “A study of poison query generation model in IR”
multi-Bernoulli multinomial Poisson Event space Appearance /absence Vocabulary frequency Model frequency? No Yes Yes Model length? (document/query) No Implicitly yes Yes w/o Sum-to-one constraint? Yes No Yes Per-Term Smoothing Easy Hard Easy Closed form solution for mixture of models? No No Yes
69
Õ
= | | 1 ) , (
) | (
V j q w c j
j
d w p
Õ Õ
Ï Î
= =
q w q w
d w p d w p ) | ( ) | 1 (
Õ
= | | 1
) | ) , ( (
V j j
d q w c p
) | ( d q p
Slides adapted from Mei, Fang and Zhai‘s “A study of poison query generation model in IR”
effective for IR
§ Probabilities are inherently length-normalized
§ When doing parameter estimation
§ Mixing document and collection frequencies has an effect similar to idf
§ Terms rare in the general collection, but common in some documents will have a greater influence on the ranking.
70
71
BM25
72
and sometimes worsen the performance)
73
metrics (P, R, nDCG, MAP etc)
use of large amount of data that can be learned models from
understanding of IR and they hand-crafted or imagined most of the models
74
and Emine Yilmaz) will introduce them in depth
75
SIGIR 1994.
SIGIR 2001.
76
77
InfoSense Georgetown University, USA Contact: huiyang@cs.georgetown.edu