Axiomatic Analysis and Optimization
- f Information Retrieval Models
ChengXiang (“Cheng”) Zhai
Department of Computer Science University of Illinois at Urbana-Champaign http://www.cs.uiuc.edu/homes/czhai
1
Axiomatic Analysis and Optimization of Information Retrieval Models - - PowerPoint PPT Presentation
Axiomatic Analysis and Optimization of Information Retrieval Models ChengXiang (Cheng) Zhai Department of Computer Science University of Illinois at Urbana-Champaign http://www.cs.uiuc.edu/homes/czhai 1 Search is everywhere, and part
Axiomatic Analysis and Optimization
ChengXiang (“Cheng”) Zhai
Department of Computer Science University of Illinois at Urbana-Champaign http://www.cs.uiuc.edu/homes/czhai
1
Search is everywhere, and part of everyone’s life
2
Web Search Desk Search Site Search Enterprise Search Social Media Search
Search accuracy matters!
3
Sources: Google, Twitter: http://www.statisticbrain.com/ PubMed: http://www.ncbi.nlm.nih.gov/About/tools/restable_stat_pubmed.html
# Queries /Day 4,700,000,000 1,600,000,000 2,000,000
~1,300,000 hrs
X 1 sec X 10 sec
~13,000,000 hrs ~440,000 hrs ~4,400,000 hrs ~550 hrs ~5,500 hrs
How can we improve all search engines in a general way?
Behind all the search boxes…
4
Document collection
k number of queries search engines
Query q
Ranked list Retrieval Model
Natural Language Processing Machine Learning
d Score(q,d) How can we optimize a retrieval model?
Retrieval model = computational definition of “relevance”
5
S(“computer science CMU”, ) s(“computer”, ) s(“science”, ) s(“CMU”, ) How many times does “computer” occur in d? Term Frequency (TF): c(“computer”, d) How long is d? Document length: |d| How often do we see “computer” in the entire collection? Document Frequency: df(“computer) P(“computer”|Collection)
Scoring based on bag of words in general
6
q
w
This image cannot currently be displayed. =
∩ ∈
) , ( , ) , , ( ) , ( d q a d q w weight f d q s
d q w
)] ( |, | ), , ( ), , ( [ w df d d w c q w c g
Sum over matched query terms Term Frequency (TF)
) | ( C w p
Inverse Document Frequency (IDF) Document length
Improving retrieval models is a long-standing
challenge
1975], [Robertson & Sparck Jones 1976], [van Rijsbergen 1977], [Robertson 1977], [Robertson et al. 1981], [Robertson & Walker 1994], …
[Zhai & Lafferty 2001], [Lavrenko & Croft 2001], [Kurland & Lee 2004], …
1991], …
& Ounis 2005], …
7
Many different models were proposed and tested
Some are working very well (equally well)
Croft 98], [Zhai & Lafferty]
8
but many others failed to work well…
1 ln(1 ln( ( , ))) 1 ( , ) ln | | ( ) (1 )
w q d
c w d N c w q d df w s s avdl
∈ ∩
+ + + ⋅ ⋅ − +
∑
( , ) ( , ) ln(1 ) | | ln ( | ) | |
w q d
c w d c w q q p w C d µ µ µ
∈ ∩
× + + ⋅ ⋅ +
∑
3 1 3 1
( 1) ( , ) ( 1) ( , ) ( ) 0.5 ln | | ( ) 0.5 ( , ) ((1 ) ) ( , )
w q d
k c w q k c w d N df w d df w k c w q k b b c w d avdl
∈ ∩
+ × + × − + ⋅ ⋅ + + − + +
∑
State of the art retrieval models
PL2 is a bit more complicated, but implements similar heuristics
9
Questions
similarly even though they were derived in very different ways?
strong baseline methods?
assumption?
– If yes, how can we prove it? – If not, how can we find a more effective one?
10
Suggested Answers
though they were derived in very different ways?
methods? We don’t have a good knowledge about their deficiencies
– If yes, how can we prove it? – If not, how can we find a more effective one?
11
They share some nice common properties These properties are more important than how each is derived
Other variants don’t have all the “nice properties”
Need to formally define “the ceiling” (= complete set of “nice properties”)
Main Point of the Talk: Axiomatic Relevance Hypothesis (ARH)
defined constraints on a retrieval function
– If a function satisfies all the constraints, it will perform well empirically – If function Fa satisfies more constraints than function Fb, Fa would perform bettter than Fb empirically
– Given a set of relevance constraints C={c1, …, ck} – Function Fa is analytically more effective than function Fb iff the set of constraints satisfied by Fb is a proper subset
– A function F is optimal iff it satisfies all the constraints in C
12
Rest of the Talk
models
retrieval model
13
Outline
models
retrieval model
14
1 ln(1 ln( ( , ))) 1 ( , ) ln | | ( ) (1 )
w q d
c w d N c w q d df w s s avdl
∈ ∩
+ + + ⋅ ⋅ − +
∑
( , ) ( , ) ln(1 ) | | ln ( | ) | |
w q d
c w d c w q q p w C d µ µ µ
∈ ∩
× + + ⋅ ⋅ +
∑
3 1 3 1
( 1) ( , ) ( 1) ( , ) ( ) 0.5 ln | | ( ) 0.5 ( , ) ((1 ) ) ( , )
w q d
k c w q k c w d N df w d df w k c w q k b b c w d avdl
∈ ∩
+ × + × − + ⋅ ⋅ + + − + +
∑
Inversed Document Frequency Document Length Normalization Term Frequency
Motivation: different models, but similar heuristics
Parameter sensitivity
PL2 is a bit more complicated, but implements similar heuristics
15
Are they performing well because they implement similar retrieval heuristics? Can we formally capture these necessary retrieval heuristics?
16
d2: d1:
) , (
1
d w c ) , (
2
d w c
Term Frequency Constraints (TFC1)
TF weighting heuristic I:
Give a higher score to a document with more occurrences of a query term. q : w
If
| | | |
2 1
d d =
) , ( ) , (
2 1
d w c d w c > and Let q be a query with only one term w. ). , ( ) , (
2 1
q d f q d f > then
) , ( ) , (
2 1
q d f q d f >
17
1 2
( , ) ( , ) f d q f d q >
Term Frequency Constraints (TFC2)
TF weighting heuristic II:
Favor a document with more distinct query terms.
2 1
( , ) c w d
1 2
( , ) c w d
1 1
( , ) c w d
d1: d2:
1 2
( , ) ( , ). f d q f d q > then
1 2 1 1 2 1
( , ) ( , ) ( , ) c w d c w d c w d = + If
2 2 1 1 2 1
( , ) 0, ( , ) 0, ( , ) c w d c w d c w d = ≠ ≠ and
1 2
| | | | d d =
and Let q be a query and w1, w2 be two query terms. Assume
1 2
( ) ( ) idf w idf w =
q: w1 w2
18
Length Normalization Constraints(LNCs)
Document length normalization heuristic:
Penalize long documents(LNC1); avoid over-penalizing long documents (LNC2) .
d2: q:
Let q be a query.
d1:
| | | | , 1
2 1
d k d k ⋅ = > ∀ ) , ( ) , (
2 1
d w c k d w c ⋅ =
If and
) , ( ) , (
2 1
q d f q d f ≥
then
) , ( ) , (
2 1
q d f q d f ≥
d1: d2: q:
Let q be a query.
1 ) , ( ) , ( ,
1 2
+ = ∉ d w c d w c q w ) , ( ) , ( ,
1 2
d w c d w c w =
q w∉
) , (
1
d w c
) , (
2
d w c
If for some word but for other words ) , ( ) , (
2 1
q d f q d f ≥
) , ( ) , (
2 1
q d f q d f ≥
then
19
20
TF-LENGTH Constraint (TF-LNC)
TF-LN heuristic:
Regularize the interaction of TF and document length. q: w
) , (
2
d w c
d2:
) , (
1
d w c
d1:
Let q be a query with only one term w. ). , ( ) , (
2 1
q d f q d f > then ) , ( ) , (
2 1
d w c d w c > and If
1 2 1 2
| | | | ( , ) ( , ) d d c w d c w d = + −
1 2
( , ) ( , ) f d q f d q >
Seven Basic Relevance Constraints
[Fang et al. 2011]
Hui Fang, Tao Tao, ChengXiang Zhai: Diagnostic Evaluation of Information Retrieval
21
Outline
models
retrieval model
22
Axiomatic Relevance Hypothesis (ARH)
defined constraints on a retrieval function
– If a function satisfies all the constraints, it will perform well empirically – If function Fa satisfies more constraints than function Fb, Fa would perform bettter than Fb empirically
– Given a set of relevance constraints C={c1, …, ck} – Function Fa is analytically more effective than function Fb iff the set of constraints satisfied by Fb is a proper subset
– A function F is optimal iff it satisfies all the constraints in C
23
Testing the Axiomatic Relevance Hypothesis
correlated with good empirical performance of a retrieval function?
compare retrieval functions without experimentation?
– Constraint analysis reveals optimal ranges of parameter values – When a formula does not satisfy the constraint, it
– Violation of constraints may pinpoint where a formula needs to be improved.
24
Parameter sensitivity of s s
Bounding Parameters
LNC2 s<0.4
0.4 Optimal s (for average precision)
25
Negative when df(w) is large Violate many constraints
3 1 3 1
( 1) ( , ) ( 1) ( , ) ( ) 0.5 ln | | ( ) 0.5 ( , ) ((1 ) ) ( , )
w q d
k c w q k c w d N df w d df w k c w q k b b c w d avdl
∈ ∩
+ × + × − + ⋅ ⋅ + + − + +
∑
Analytical Comparison
Pivoted Okapi keyword query verbose query s or b s or b
26
Fixing a deficiency in BM25 improves the effectiveness
Make Okapi satisfy more constraints; expected to help verbose queries
3 1 3 1
( 1) ( , ) ( 1) ( , ) ( ) 0.5 ln | | ( ) 0.5 ( , ) ((1 ) ) ( , )
w q d
k c w q k c w d N df w d df w k c w q k b b c w d avdl
∈ ∩
+ × + × − + ⋅ ⋅ + + − + +
∑
df N 1 ln + keyword query verbose query s or b s or b
Pivoted Okapi
Modified Okapi
27
Systematic Analysis of 4 State of the Art Models [Fang et al. 11]
28
Parameter s must be small
Problematic when a query term occurs less frequently in a doc than expected
Negative IDF Problematic with common terms; parameter c must be large
Question: why are Dirichlet and PL2 still competitive despite their inherent problems that can’t be fixed through parameter tuning?
Outline
models
retrieval model
29
How can we leverage constraints to find an optimal retrieval model?
30
C2 C3 S1 S2 S3
Function space
C1
Retrieval constraints
Our target
Function space
S1 S2 S3
31
Basic Idea of the Axiomatic Framework (Optimization Problem Setup)
Three Questions
32
One possibility: leverage existing state of the art functions We’ve talked about that; more later One possibility: search in the neighborhood of existing state of the art functions
Inductive Definition of Function Space
D = d1,d2,...,dn Q = q1,q2,...,qm;
S :Q × D →
Define the function space inductively Q: D:
cat dog
Primitive weighting function (f) S(Q,D) = S( , ) = f ( , )
big
Query growth function (h) S(Q,D) = S( , ) = S( , )+h( , , ) Document growth function (g) S(Q,D) = S( , ) = S( , )+g( , , )
big
33
C1 C3 C2
Derivation of New Retrieval Functions
S(Q,D)
f g h
decompose S’ S generalize
F G
H
constrain
f ' g' h'
existing function assemble
S'(Q,D)
new function
34
A Sample Derived Function based on BM25 [Fang & Zhai 05]
S(Q,D) = c(t,Q)⋅
t∈Q∩D
∑
( N df (t))0.35 ⋅ c(t,D) c(t,D) + s + s⋅ | D | avdl
IDF TF
length normalization
QTF
35
The derived function is less sensitive to the parameter setting
Axiomatic Model better
36
Inevitability of heuristic thinking and necessity of axiomatic analysis
– Theoretically motivated models don’t automatically perform well empirically – Heuristic adjustment seems always necessary – Cause: inaccurate modeling of relevance
– The answer lies in axiomatic analysis – Use constraints to help identify the error in modeling relevance, thus obtaining insights about how to improve a model
37
Systematic Analysis of 4 State of the Art Models [Fang et al. 11]
38
Parameter s must be small
Problematic when a query term occurs less frequently in a doc than expected
Negative IDF Problematic with common terms; parameter c must be large
Modified BM25 satisfies all the constraints!
Without knowing its deficiency, we can’t easily propose a new model working better than BM25
A Recent Success of Axiomatic Analysis: Lower Bounding TF Normalization [Lv & Zhai 11]
normalized TF with document length
– Long documents overly penalized – A very long document matching two query terms can have a lower score than a short document matching only
worked for BM25, PL2, Dirichlet, and Piv, leading to improved versions of them (BM25+, PL2+, Dir+, Piv+)
39
New Constraints: LB1 & LB2
40
LB1: Let Q be a query. Assume D1 and D2 are two documents such that S(Q,D1) = S(Q,D2). If we reformulate the query by adding another term q ∉Q into Q, where c(q,D1) = 0 and c(q,D2) > 0, then S(Q∪{q},D1) < S(Q ∪ {q},D2). LB2: Let Q = {q1, q2} be a query with two terms q1 and q2. Assume td(q1) = td(q2), where td(t) can be any reasonable measure of term discrimination value. If D1 and D2 are two documents such that c(q2,D1) = c(q2,D2) = 0, c(q1,D1) > 0, c(q1,D2) > 0, and S(Q,D1) = S(Q,D2), then S(Q,D1 ∪ {q1} − {t1}) < S(Q,D2 ∪ {q2} − {t2}), for all t1 and t2 such that t1 ∈ D1, t2 ∈ D2, t1 ∉ Q and t2 ∉ Q.
Repeated occurrence of an already matched query term isn’t as important as the first occurrence of an otherwise absent query term The presence –absence gap (0-1 gap) shouldn’t be closed due to length normalization
It turns out none of BM25, PL2, Dirichlet, PIV satisfies both constraints
A general heuristic solution: add a small constant lower bound Worked well for improving all the four models
41
BM25+ Improves over BM25
42
For details, see
Yuanhua Lv, ChengXiang Zhai, Lower Bounding Term Frequency Normalization, Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM'11), page 7-16, 2011.
More examples of theory-effectiveness gap and the need for axiomatic analysis
independency of query terms; (3) collection LM for smoothing; however, it can’t explain why some apparently reasonable smoothing methods perform poorly
99], we must ensure sufficient self translation probability to avoid unreasonable retrieval results, but such a constraint can’t be explained by estimation of translation model
function doesn’t work well as the asymmetric KL-divergence function D(Q||D)
43
Outline
models
retrieval model
44
Open Challenges
– If yes, how can we define them? – If no, how can we prove it?
– How do we evaluate a constraint? (e.g., should the score contribution of a term be bounded? In BM25, it is.) – How do we evaluate a set of constraints?
– Search in the neighborhood of an existing function? – Search in a new function space?
45
Open Challenges
– How can we quantify the degree of satisfaction? – How can we put constraints in a machine learning framework? Something like maximum entropy?
pseudo feedback? Cross-lingual IR?
queries? Specific type of documents?
46
Possible Future Scenario 1: Impossibility Theorems for IR
IR
– Similar to Kleinberg’s impossibility theorem for clustering
47
Processing Systems (NIPS) 15, 2002
Future Scenario 2: Sufficiently Restrictive Constraints
unique (optimal) retrieval function
– Similar to the derivation of the entropy function
48
Future Scenario 3 (most likely): Open Set of Insufficient Constraints
conflict, but insufficient for ensuring good retrieval performance
sure what they are
constructive retrieval functional space and supervised machine learning
49
Summary: Axiomatic Relevance Hypothesis
relevance
retrieval models
models for bridging the theory-effectiveness gap
the state of the art models
function space based on existing or new models and theories
50
Updated Answers
though they were derived in very different ways?
methods? We don’t have a good knowledge about their deficiencies
– If yes, how can we prove it? – If not, how can we find a more effective one?
51
They share some nice common properties These properties are more important than how each is derived
Other variants don’t have all the “nice properties”
Need to formally define “the ceiling” (= complete set of “nice properties”)
We didn’t find a constraint that they fail to satisfy Relevance more accurately modeled with constraints
No, they have NOT hit the ceiling yet!
Future Work: Putting All Together!
52
We are here Axiomatic Analysis
Optimal Model Vector Space Model?
Cranfield
You are 2013.0322 miles from destination
This strategy seems to have worked well in the past
Optimal Model
Acknowledgments
Maryam Karimzadehgan, and others
53
Questions/Comments?
54
Search
Text
Filtering Categorization Summarization Clustering Natural Language Content Analysis Extraction Mining Visualization Search Applications Mining Applications
Information Access Knowledge Acquisition Information Organization
Overview of all Research Projects
summarization
Web, Email, and Biomedical informatics
Entity/Relation Extraction
More information can be found at http://timan.cs.uiuc.edu/