Advanced Topics in Information Retrieval
Vinay Setty (vsetty@mpi-inf.mpg.de) Jannik Strötgen (jtroetge@mpi-inf.mpg.de)
1
Advanced Topics in Information Retrieval Vinay Setty Jannik - - PowerPoint PPT Presentation
Advanced Topics in Information Retrieval Vinay Setty Jannik Strtgen (vsetty@mpi-inf.mpg.de) (jtroetge@mpi-inf.mpg.de) 1 Agenda Organization Course overview What is IR? Retrieval Models Link Analysis Indexing and
Advanced Topics in Information Retrieval
Vinay Setty (vsetty@mpi-inf.mpg.de) Jannik Strötgen (jtroetge@mpi-inf.mpg.de)
1
Agenda
2
Agenda
3
Organization
0.01 in building E 1.7 (MMCI)
(MMCI)
fixed office hours)
email (no fixed office hours)
4
Background Literature
. Raghavan, H. Schütze, Introduction to Information Retrieval, Cambridge University Press, 2008 http://www.informationretrieval.org
Information Retrieval, MIT Press, 2010
Modern Information Retrieval, Addison-Wesley, 2011
5
Required Background Knowledge
6
Exercise Sheets
(best: typeset using LaTeX, worst: scans of your handwriting)
7
Tutorials
8
Requirements for 6 ECTS
problems
(one grade per bonus point, at most three, at most one per session)
9
Registration & Password
password-protected area on the course website
first lecture, if not send an email to atir16@mpi-inf.mpg.de
10
Questions/Suggestions?
11
Agenda
12
What is this Course About?
13
What is this Course About?
14
What is this Course About?
15
What is this Course About?
16
What is this Course About?
17
Agenda
18
Information Need
19
Information Need
19
Information Need
19
A m b i g u
s
Information Need
19
Information Need
19
K n
l e d g e B a s e
Information Need
19
Information Need
19
E x t e r n a l S e r v i c e
Information Need
19
Information Need
19
C
p u t a t i
Information Need
19
Information Need
19
E x t e r n a l S e r v i c e
Information Need
19
Information Need
19
T e m p
a l Q A ?
Documents
20
Webpages, news articles etc Social media tweets, forums, Facebook status etc. Books, Journals, scholars article etc Files/data on your personal computer Apps/data on your smartphones Knowledge Graphs
What is Information Retrieval?
21
What is Information Retrieval?
documents) of an unstructured nature (usually text) that satisfies an information need (usually a query) from within large collections (usually stored on computers). - Manning et. al.
21
Information Retrieval
22
Information need Documents
Information Retrieval
22
Information need Documents
Query Representation Document Representation
Information Retrieval
22
Information need Documents
Query Representation Document Representation How to match?
Information Retrieval in a Nutshell
23
Information Retrieval in a Nutshell
23
Crawling strategies for crawl schedule and priority queue for crawl frontier
Information Retrieval in a Nutshell
23
Crawling strategies for crawl schedule and priority queue for crawl frontier Extract, & Clean handle dynamic pages, detect duplicates, detect spam,
Information Retrieval in a Nutshell
23
Crawling strategies for crawl schedule and priority queue for crawl frontier Extract, & Clean handle dynamic pages, detect duplicates, detect spam,
preprocess
tokenization, stop word removal, stemming, lemmatization
Information Retrieval in a Nutshell
23
Crawling strategies for crawl schedule and priority queue for crawl frontier Index build and analyze web graph, index all tokens, compress index Extract, & Clean handle dynamic pages, detect duplicates, detect spam,
preprocess
tokenization, stop word removal, stemming, lemmatization
Information Retrieval in a Nutshell
23
Crawling strategies for crawl schedule and priority queue for crawl frontier Index build and analyze web graph, index all tokens, compress index Process Queries fast top-k queries, query logging, auto- completion Extract, & Clean handle dynamic pages, detect duplicates, detect spam,
preprocess
tokenization, stop word removal, stemming, lemmatization
Information Retrieval in a Nutshell
23
Crawling strategies for crawl schedule and priority queue for crawl frontier Index build and analyze web graph, index all tokens, compress index Ranking scoring function
and context criteria, diversification Process Queries fast top-k queries, query logging, auto- completion Extract, & Clean handle dynamic pages, detect duplicates, detect spam,
preprocess
tokenization, stop word removal, stemming, lemmatization
Information Retrieval in a Nutshell
23
Crawling strategies for crawl schedule and priority queue for crawl frontier Index build and analyze web graph, index all tokens, compress index Ranking scoring function
and context criteria, diversification Present GUI, user guidance, personalization, visualization Process Queries fast top-k queries, query logging, auto- completion Extract, & Clean handle dynamic pages, detect duplicates, detect spam,
preprocess
tokenization, stop word removal, stemming, lemmatization
Documents & Queries
24
Investigators entered the company’s HQ located in Boston MA
investig enter compani hq locat boston ma thursdai
Documents & Queries
24
Investigators entered the company’s HQ located in Boston MA
investig enter compani hq locat boston ma thursdai
M
e i n n e x t l e c t u r e
Agenda
25
Retrieval Models
and a query q which documents to return and in which
model)
26
Boolean Retrieval
fields) with rudimentary ranking (e.g., weighted matches) exist
27
Processing Boolean Queries
28
d1 d2 d3 d4 d5 d6 Frodo 1 1 1 Sam 1 1 1 1 1 Gollum 1 Saruman 1 Gandalf 1 1 1 1 1 Sauron 1 1 1 1
How to Process the query: Frodo AND Sam AND NOT Gollum
Processing Boolean Queries
1 are relevant
29
d1 d2 d3 d4 d5 d6 Frodo 1 1 1 Sam 1 1 1 1 1 Gollum 1
Processing Boolean Queries
1 are relevant
29
d1 d2 d3 d4 d5 d6 Frodo 1 1 1 Sam 1 1 1 1 1 Gollum 1 1 1 1 1 d1 d2 d3 d4 d5 d6 Frodo 1 1 1 Sam 1 1 1 1 1 Gollum 1 1 1 1 1 1 1
Processing Boolean Queries
1 are relevant
29
d1 d2 d3 d4 d5 d6 Frodo 1 1 1 Sam 1 1 1 1 1 Gollum 1 1 1 1 1 d1 d2 d3 d4 d5 d6 Frodo 1 1 1 Sam 1 1 1 1 1 Gollum 1 1 1 1 1 1 1
d1 and d4 are the relevant documents for Frodo AND Sam AND NOT Gollum
vectors in a common high-dimensional vector space
is the cosine of the angle between them
Vector Space Model
30
sim(q, d) = q · d kqk kdk
q d
= P
v qv dv
pP
v q 2 v
pP
v d 2 v
tf.idf
lower weight
31
dv = tf(v, d) log |D| d f(v)
Statistical Language Models
sentence
hard to wreck a nice beach)
32
Language Model of a Document
elements from a formal language (e.g., sequences of words)
language model and be used to estimate its parameters
most natural estimate
33
P[ v | θd ] = tf(v, d) P
w tf(w, d)
a b a c a a a c a b b b b a a c b a a a a a a a a
P[ a | θd ] = 16 25 P[ b | θd ] = 6 25 P[ c | θd ] = 3 25
Unigram Language Models
representing text
34
Words M1 M2 the 0.2 0.15 a 0.1 0.12 Frodo 0.01 0.0002 Sam 0.01 0.0001 said 0.03 0.03 likes 0.02 0.04 that 0.04 0.04 Rosie 0.005 0.01 Gandalf 0.003 0.015 Saruman 0.001 0.002 ... ... ...
P(Frodo said that Sam likes Rosie)
Unigram Language Models
representing text
34
Words M1 M2 the 0.2 0.15 a 0.1 0.12 Frodo 0.01 0.0002 Sam 0.01 0.0001 said 0.03 0.03 likes 0.02 0.04 that 0.04 0.04 Rosie 0.005 0.01 Gandalf 0.003 0.015 Saruman 0.001 0.002 ... ... ...
P(Frodo said that Sam likes Rosie) = P(Frodo) * P(said) * P(that) * P(Sam) * P(likes) * P(Rosie)
Unigram Language Models
representing text
34
Words M1 M2 the 0.2 0.15 a 0.1 0.12 Frodo 0.01 0.0002 Sam 0.01 0.0001 said 0.03 0.03 likes 0.02 0.04 that 0.04 0.04 Rosie 0.005 0.01 Gandalf 0.003 0.015 Saruman 0.001 0.002 ... ... ...
s Frodo said that Sam likes Rosie M1 0.01 0.03 0.04 0.01 0.02 0.005 M2 0.0002 0.03 0.04 0.0001 0.04 0.01
P(Frodo said that Sam likes Rosie) = P(Frodo) * P(said) * P(that) * P(Sam) * P(likes) * P(Rosie)
Unigram Language Models
representing text
34
Words M1 M2 the 0.2 0.15 a 0.1 0.12 Frodo 0.01 0.0002 Sam 0.01 0.0001 said 0.03 0.03 likes 0.02 0.04 that 0.04 0.04 Rosie 0.005 0.01 Gandalf 0.003 0.015 Saruman 0.001 0.002 ... ... ...
s Frodo said that Sam likes Rosie M1 0.01 0.03 0.04 0.01 0.02 0.005 M2 0.0002 0.03 0.04 0.0001 0.04 0.01
P(s|M1) = 0.000000000012 P(s|M2) = 0.0000000000000096
P(Frodo said that Sam likes Rosie) = P(Frodo) * P(said) * P(that) * P(Sam) * P(likes) * P(Rosie)
Zero Probability Problem
query generation
35
Words M1 M2 the 0.2 0.15 a 0.1 0.12 Frodo 0.01 0.0002 Sam 0.01 0.0001 said 0.03 0.03 likes 0.02 0.04 that 0.04 0.04 Rosie 0.005 0.01 Gandalf 0.003 0.015 Saruman 0.001 0.002
Zero Probability Problem
query generation
35
Words M1 M2 the 0.2 0.15 a 0.1 0.12 Frodo 0.01 0.0002 Sam 0.01 0.0001 said 0.03 0.03 likes 0.02 0.04 that 0.04 0.04 Rosie 0.005 0.01 Gandalf 0.003 0.015 Saruman 0.001 0.002
P("Frodo" , "Gollum"|M1) = 0.01 * 0 P("Frodo" ,"Gollum"|M2) = 0.0002 * 0
Smoothing
avoid zero probabilities
effect) since more common terms now have higher probability for all documents
bears the risk of overfitting to this very limited sample
entire document collection as a background model
36
estimate term probabilities
Jelinek-Mercer smoothing
37
P[ v | θd ] = α · tf(v, d) P
w tf(w, d) + (1 − α) ·
tf(v, D) P
w tf(w, D)
document d
term in the entire collection D
estimate term probabilities
Jelinek-Mercer smoothing
37
P[ v | θd ] = α · tf(v, d) P
w tf(w, d) + (1 − α) ·
tf(v, D) P
w tf(w, D)
document d
term in the entire collection D
estimate term probabilities
Jelinek-Mercer smoothing
37
P[ v | θd ] = α · tf(v, d) P
w tf(w, d) + (1 − α) ·
tf(v, D) P
w tf(w, D)
corpus contrib.
document d
term in the entire collection D
estimate term probabilities
Jelinek-Mercer smoothing
37
P[ v | θd ] = α · tf(v, d) P
w tf(w, d) + (1 − α) ·
tf(v, D) P
w tf(w, D)
collection freq. or document fréquency
corpus contrib.
document d
term in the entire collection D
estimate term probabilities
Jelinek-Mercer smoothing
37
P[ v | θd ] = α · tf(v, d) P
w tf(w, d) + (1 − α) ·
tf(v, D) P
w tf(w, D)
collection freq. or document fréquency
corpus contrib.
contribution
document d
term in the entire collection D
Dirichlet Smoothing
38
P[ v | θd ] = tf(v, d) + µ
tf(v,D) P
w tf(w,D)P
w tf(w, d) + µ
Dirichlet Smoothing
38
P[ v | θd ] = tf(v, d) + µ
tf(v,D) P
w tf(w,D)P
w tf(w, d) + µ
term freq of word in Doc
Dirichlet Smoothing
38
P[ v | θd ] = tf(v, d) + µ
tf(v,D) P
w tf(w,D)P
w tf(w, d) + µ
term freq of word in Doc collection freq. or LM built
Dirichlet Smoothing
38
P[ v | θd ] = tf(v, d) + µ
tf(v,D) P
w tf(w,D)P
w tf(w, d) + µ
dirichlet prior term freq of word in Doc collection freq. or LM built
to the probability that their language model generates the query
Kullback-Leibler divergence between the query language model and language models estimate from documents
Query Likelihood vs. Divergence
39
P[ q | θd ] ∝ Y
v∈q
P[ v | θd ] KL( θq k θd) = X
v
P[ v | θq ] log P[ v | θq ] P[ v | θd ]
Agenda
40
Link Analysis
to determine characteristics of individual web pages
41
1 2 3 4 A = 1 1 1 1 1 1
PageRank
walk
probability ε
probability (1-ε)
and corresponds to its stationary visiting probability
42
p(v) = (1 − ✏) · X
(u,v)∈E
p(u)
✏ |V |
PageRank
dominant Eigenvector π of the transition probability matrix P which can be computed using the power- iteration method
43
1 2 3 4
∊ = 0.2
PageRank
dominant Eigenvector π of the transition probability matrix P which can be computed using the power- iteration method
43
P = 0.05 0.45 0.05 0.45 0.05 0.05 0.05 0.85 0.45 0.05 0.05 0.45 0.05 0.05 0.85 0.05 1 2 3 4
∊ = 0.2
PageRank
dominant Eigenvector π of the transition probability matrix P which can be computed using the power- iteration method
43
P = 0.05 0.45 0.05 0.45 0.05 0.05 0.05 0.85 0.45 0.05 0.05 0.45 0.05 0.05 0.85 0.05 1 2 3 4 π(0) = ⇥0.25 0.25 0.25 0.25⇤
∊ = 0.2
PageRank
dominant Eigenvector π of the transition probability matrix P which can be computed using the power- iteration method
43
P = 0.05 0.45 0.05 0.45 0.05 0.05 0.05 0.85 0.45 0.05 0.05 0.45 0.05 0.05 0.85 0.05 1 2 3 4 π(0) = ⇥0.25 0.25 0.25 0.25⇤ π(1) = ⇥0.15 0.15 0.25 0.45⇤
∊ = 0.2
PageRank
dominant Eigenvector π of the transition probability matrix P which can be computed using the power- iteration method
43
P = 0.05 0.45 0.05 0.45 0.05 0.05 0.05 0.85 0.45 0.05 0.05 0.45 0.05 0.05 0.85 0.05 1 2 3 4 π(0) = ⇥0.25 0.25 0.25 0.25⇤ π(1) = ⇥0.15 0.15 0.25 0.45⇤ π(2) = ⇥0.15 0.11 0.41 0.33⇤ π(10) = ⇥0.18 0.12 0.34 0.36⇤ . . .
∊ = 0.2
HITS
keyword query and considers
the co-citation matrix AAT and co-reference matrix ATA
44
h = α βAAT h a = α βAT A a h(u) ∝ X
(u,v)∈E
a(v) a(v) ∝ X
(u,v)∈E
h(u)
Agenda
45
Indexing & Query Processing
query but not how they can be identified efficiently
systems; variants of the inverted index are by far most common
structures
results (e.g., term-at-a-time, document-at-a-time)
46
Inverted Index
additional structure (e.g., to support skipping)
(e.g., term frequency, tf.idf score contribution, term offsets)
47
d123, 2, [4, 14] d125, 2, [1, 4] d227, 1, [ 6 ] giants a z Dictionary Posting list
Term-at-a-Time
a time
processing the first k query terms this corresponds to
48
acc(d) =
k
X
i=1
score(qi, d)
d1, 0.2 a b d3, 0.1 d5, 0.5 d5, 0.3 d7, 0.2
Term-at-a-Time
a time
processing the first k query terms this corresponds to
48
acc(d) =
k
X
i=1
score(qi, d)
d1, 0.2 a b d3, 0.1 d5, 0.5 d5, 0.3 d7, 0.2
Term-at-a-Time
a time
processing the first k query terms this corresponds to
48
acc(d) =
k
X
i=1
score(qi, d)
d1, 0.2 a b d3, 0.1 d5, 0.5 d5, 0.3 d7, 0.2
Term-at-a-Time
a time
processing the first k query terms this corresponds to
48
acc(d) =
k
X
i=1
score(qi, d)
d1, 0.2 a b d3, 0.1 d5, 0.5 d5, 0.3 d7, 0.2
Term-at-a-Time
a time
processing the first k query terms this corresponds to
48
acc(d) =
k
X
i=1
score(qi, d)
d1, 0.2 a b d3, 0.1 d5, 0.5 d5, 0.3 d7, 0.2
Term-at-a-Time
a time
processing the first k query terms this corresponds to
48
acc(d) =
k
X
i=1
score(qi, d)
d1, 0.2 a b d3, 0.1 d5, 0.5 d5, 0.3 d7, 0.2
Term-at-a-Time
a time
processing the first k query terms this corresponds to
48
acc(d) =
k
X
i=1
score(qi, d)
d1, 0.2 a b d3, 0.1 d5, 0.5 d5, 0.3 d7, 0.2
Document-at-a-Time
same time, determines score, and decides whether it belongs into top-k
required)
49
d1, 0.2 a b d3, 0.1 d5, 0.5 d5, 0.3 d7, 0.2
Document-at-a-Time
same time, determines score, and decides whether it belongs into top-k
required)
49
d1, 0.2 a b d3, 0.1 d5, 0.5 d5, 0.3 d7, 0.2
Document-at-a-Time
same time, determines score, and decides whether it belongs into top-k
required)
49
d1, 0.2 a b d3, 0.1 d5, 0.5 d5, 0.3 d7, 0.2
Document-at-a-Time
same time, determines score, and decides whether it belongs into top-k
required)
49
d1, 0.2 a b d3, 0.1 d5, 0.5 d5, 0.3 d7, 0.2
Document-at-a-Time
same time, determines score, and decides whether it belongs into top-k
required)
49
d1, 0.2 a b d3, 0.1 d5, 0.5 d5, 0.3 d7, 0.2
Document-at-a-Time
same time, determines score, and decides whether it belongs into top-k
required)
49
d1, 0.2 a b d3, 0.1 d5, 0.5 d5, 0.3 d7, 0.2
Agenda
50
Elasticsearch
search and analytics engine
lucene.
51
Elasticsearch Installation
elasticsearch-2.3.1.tar.gz
52
Elasticsearch - Indexing
53
$ curl -XPUT 'http://localhost:9200/atirtest/doc/1' -d '{ "title" : "Ecuador earthquake: Aid agencies step up efforts", "pub_date" : 1461230434627, "content" : "Aid agencies are stepping up help following Saturdays devastating earthquake in Ecuador, amid concerns over the conditions faced by survivors." }' $ curl -XPUT 'http://localhost:9200/atirtest' $ curl -XPUT 'http://localhost:9200/atirtest/doc/2' -d '{ "title" : "Syria conflict: Air strikes on Idlib markets kill dozens", "pub_date" : 1461230434457, "content" : "At least 44 people have been killed and dozens hurt in Syrian government air strikes on markets in two rebel-held towns in Idlib province, activists say." }'
Elasticsearch - Boolean Queries
54
curl -XGET 'localhost:9200/atirtest/_search?pretty' -d ' { "query":{ "query_string" : { "default_field" : "content", "query": "earthquake AND Ecuador AND NOT Syrian" } } }'
Language Analyzers
cjk, czech, danish, dutch, english, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai.
55
English Analyzer Example
56
"settings": { "analysis": { "filter": { "stop_filter": { "type": "stop", "stopwords": ["_english_"] }, "custom_english_stemmer": { "type": "stemmer", "name": "minimal_english" } }, "analyzer": { "custom_lowercase_stemmed": { "tokenizer": "standard", "filter": [ "stop_filter", "custom_english_stemmer", "lowercase" ] } } } }
Elasticsearch - TF-IDF
57
curl -XGET 'localhost:9200/atirtest/_search?pretty' -d ' { "query": { "match": { "content": "Earthquake in Ecuador" } } }'
Okapi BM25 in Elasticsearch
58
curl -XPUT 'localhost:9200/atirtest/' -d' { "mappings": { "doc": { "properties": { "title": { "type": "string", "similarity": "BM25" }, "pub_date": { "type": "date" }, "content": { "type": "string", "similarity": "BM25" }, } } }'
Caveats
missing or a bracket not closed
introduction.html
59
Credits
60
Thanks to Prof. Klaus Berberich and Dr. Avishek Anand for allowing us to use their slides!