Document Selec,on Methodologies for Efficient and Effec,ve Learning‐to‐Rank Javed Aslam, Evangelos Kanoulas, Virgil Pavlu, Stefan Savev, Emine Yilmaz
Search Engines User’s Request Results Search Engine BM25,I*idf, PageRank, … Document Corpus Hundreds of features
Training Search Engines Queries Search Engine Metric Judges BM25,I*idf, PageRank, … Document Corpus 1. Neural Network 2. Support Vector Machine 3. Regression Func,on 4. Decision Tree …
Training Data Sets • Data Collec,ons – Billions of documents – Thousands of queries • Ideal, in theory; infeasible, in prac,ce… – Extract features from all query‐document pairs – Judge each document with respect to each query • Extensive human effort – Train over all query‐document pairs
Training Data Sets • Train the ranking func,on over a subset of the complete collec,on • Few queries with many document judged vs. many queries with few documents judged – Be^er to train over many queries with few judged documents [Yilmaz and Robertson ’09] • How should we select document?
Training Data Sets • Machine Learning (Ac,ve Learning) – Itera,ve process – Tightly coupled with the learning algorithm • IR Evalua,on – Many test collec,ons already available – Efficient and effec,ve techniques to construct test collec,ons • Intelligent way of selec,ng documents • Inferences of effec,veness metrics
Duality between LTR and Evalua,on • This work: Explore duality between Evalua,on and Learning‐to‐Rank – Employ techniques used for efficient and effec,ve test collec,on construc,on to construct training collec,ons
Duality between LTR and Evalua,on • Can test collec,on construc,on methodologies be used to construct training collec,ons? • If yes, which one of these methodologies is be^er? • What makes a training set be^er than the other?
Methodology • Depth‐100 pool (as the complete collec,on) • Select subsets of documents from the depth‐100 pool – Using different document selec,on methodologies • Train over the different training sets – Using a number of learning‐to‐rank algorithms • Test the performance of the resul,ng ranking func,ons – Five fold cross valida,on
Data Sets • Data from TREC 6,7 and 8 – Document corpus : TREC Discs 4 and 5 – Queries : 150 queries; ad‐hoc tracks – Relevance judgments : depth‐100 pools • Features from each query‐document pair – 22 features; subset of LETOR features (BM25, Language Models, TF‐IDF, …)
Document Selec,on Methodologies Select subsets of documents • Subset size varying from 6% to 60% – 1. Depth‐k pooling 2. InfAP (uniform random sampling) 3. StatAP (stra,fied random sampling) 4. MTC (greedy on‐line algorithm) 5. LETOR (top‐k by BM25; current prac,ce) 6. Hedge (greedy on‐line algorithm)
Document Selec,on Methodologies Discrepancy between relevant and non � relevant documents Precision of the selection methods 6.5 0.8 (symmetrized KL divergence) 6 0.6 Discrepancy Precision 5.5 0.4 depth depth 5 hedge hedge infAP infAP 0.2 mtc mtc 4.5 statAP statAP LETOR LETOR 0 4 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 Percentage of data used for training Percentage of data used for training • Precision : frac,on of selected documents that are relevant • Discrepancy : symmetrized KL divergence between documents’ language models
LTR Algorithms • Train over the different data sets 1. Regression (classifica,on error) 2. Ranking SVM (AUC) 3. RankBoost (pairwise preferences) 4. RankNet (probability of correct order) 5. LambdaRank (nDCG)
Results (1) Regression Ranking SVM 0.25 0.25 0.25 0.25 MAP MAP 0.2 0.2 0.2 depth depth 0.2 hedge hedge infAP infAP 0.15 0.15 0.15 0.15 MTC MTC statAP statAP 0.1 0.1 0 5 10 0 5 10 LETOR LETOR 0.1 0.1 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 Percentage of data used for training Percentage of data used for training
Results (2) RankBoost LambdaRank 0.25 0.25 0.25 0.2 0.25 MAP 0.2 MAP 0.2 depth depth 0.2 hedge hedge 0.15 0.15 infAP infAP 0.15 0.15 MTC MTC statAP statAP 0.1 0.1 0 5 10 0 5 10 LETOR LETOR 0.1 0.1 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 Percentage of data used for training Percentage of data used for training
Results (3) RankNet RankNet (hidden layer) 0.25 0.25 0.25 0.25 MAP MAP 0.2 0.2 depth 0.2 0.2 depth hedge hedge infAP infAP 0.15 0.15 0.15 0.15 MTC MTC statAP statAP 0.1 0.1 LETOR 0 5 10 0 5 10 LETOR 0.1 0.1 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 Percentage of data used for training Percentage of data used for training
Observa,ons (1) • Some Learning‐to‐Rank algorithms are robust to document selec,on methodologies – LambdaRank vs. RankBoost LambdaRank RankBoost 0.25 0.25 0.25 0.2 0.25 0.2 MAP MAP 0.2 depth depth 0.2 hedge hedge 0.15 0.15 infAP infAP 0.15 0.15 MTC MTC statAP statAP 0.1 0.1 0 5 10 0 5 10 LETOR LETOR 0.1 0.1 0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70 Percentage of data used for training Percentage of data used for training
Observa,ons (2) • Near‐op,mal performance with 1%‐2% of complete collec,on (depth‐100 pool) – No significant differences at greater % (t‐test) – Number of features ma^er [Taylor et.al ‘06] RankNet 0.25 0.25 MAP 0.2 0.2 depth hedge infAP 0.15 0.15 MTC statAP 0.1 0 5 10 LETOR 0.1 0 10 20 30 40 50 60 70 Percentage of data used for training
Observa,ons (3) • Selec,on methodology ma^ers – Hedge (worst performance) – Depth‐k pooling and statAP (best performance) – LETOR‐like (neither most efficient nor most effec,ve) Ranking SVM 0.25 0.25 MAP 0.2 0.2 depth hedge infAP 0.15 0.15 MTC statAP 0.1 0 5 10 LETOR 0.1 0 10 20 30 40 50 60 70 Percentage of data used for training
Rela,ve Importance on Effec,veness • Learning‐to‐Rank algorithm vs. document selec,on methodology – 2‐way ANOVA model • Variance decomposi,on over all data sets – 26% due to document selec,on – 31% due to LTR algorithm • Variance decomposi,on (small data sets, <10%) – 44% due to document selec,on – 31% due to LTR algorithm
What makes one training set be^er than another? • Different methods have different proper,es – Precision – Recall – Similari,es between relevant documents – Similari,es between relevant and non‐relevant documents – ... • Model selec,on
What makes one training set be^er than another? • Different methods have different proper,es – Precision – Recall – Similari,es between relevant documents – Similari,es between relevant and non‐relevant documents – ... • Model selec,on – Linear model (adjusted R 2 = 0.99)
What makes one training set be^er than another? RankBoost Ranking SVM 0.25 0.25 MAP MAP 0.2 0.2 0.15 0.15 0.1 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Precision Precision
What makes one training set be^er than another? RankBoost Ranking SVM 0.25 0.25 MAP MAP 0.2 0.2 0.15 0.15 0.1 0.1 4 4.5 5 5.5 6 6.5 4 4.5 5 5.5 6 6.5 Discrepancy between relevant and non � relevant Discrepancy between relevant and non � relevant documents in the training data documents in the training data
Conclusions • Some LTR algorithms are robust to document selec,on methodologies • For those not, selec,on methodology ma^ers – Depth‐k pooling, stra,fied sampling • Harmful to select too many relevant docs • Harmful to select relevant and non‐relevant docs that are too similar
Recommend
More recommend