machine learning for ranking
play

Machine Learning for Ranking CE-324: Modern Information Retrieval - PowerPoint PPT Presentation

Machine Learning for Ranking CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2018 Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford) Machine learning for IR


  1. Machine Learning for Ranking CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2018 Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

  2. Machine learning for IR ranking? } We’ve looked at methods for ranking docs in IR } Cosine similarity, idf, proximity, pivoted doc length normalization, doc autorithy, … } We’ve looked at methods for classifying docs using supervised machine learning classifiers } Naïve Bayes, Rocchio, kNN } Surely we can also use machine learning to rank the docs displayed in search results? } Sounds like a good idea } A.k.a.“machine-learned relevance” or “learning to rank” 2

  3. Machine learning for IR ranking } Actively researched – and actively deployed by major web search engines } In the last decade } Why didn’t it happen earlier? } Modern supervised ML has been around for about 20 years… } Naïve Bayes has been around for about 50 years… 3

  4. Machine learning for IR ranking } The IR community wasn’t very connected to the ML community } But there were a whole bunch of precursors: } Wong, S.K. et al. 1988. Linear structure in information retrieval. SIGIR 1988. } Fuhr, N. 1992. Probabilistic methods in information retrieval. Computer Journal. } Gey, F. C. 1994. Inferring probability of relevance using the method of logistic regression. SIGIR 1994. } Herbrich, R. et al. 2000. Large Margin Rank Boundaries for Ordinal Regression.Advances in Large Margin Classifiers. 4

  5. Why weren’t early attempts very successful/influential? } Sometimes an idea just takes time to be appreciated… } Limited training data } Especially for real world use (as opposed to writing academic papers), it was very hard to gather test collection queries and relevance judgments } This has changed, both in academia and industry } Poor machine learning techniques } Insufficient customization to IR problem } Not enough features for ML to show value 5

  6. Why wasn’t ML much needed? } Traditional ranking functions in IR used a very small number of features, e.g., } Term frequency } Inverse document frequency } Doc length } It was easy to tune weighting coefficients by hand } And people did 6

  7. Why is ML needed now? } Modern systems – especially on the Web – use a great number of features: } Arbitrary useful features – not a single unified model } Log frequency of query word in anchor text? } Query word in color on page? } # of images on page? } # of (out) links on page? } PageRank of page? } URL length? } URL contains “~”? } Page edit recency? } Page length? } The New York Times (2008-06-03) quoted Amit Singhal as saying Google was using over 200 such features. 7

  8. Weighted zone scoring } Simple example: } Only two zone: } title, body 𝑡𝑑𝑝𝑠𝑓 𝑒, 𝑟 = 𝛽𝑇 , 𝑒, 𝑟 + 1 − 𝛽 𝑇 0 𝑒, 𝑟 0 ≤ 𝛽 ≤ 1 } We intend to find an optimal value for 𝛽 8

  9. Weighted zone scoring } Training examples : each example is a tuple consisting of a query 𝑟 and a doc 𝑒 , together with a relevance judgment 𝑠 for 𝑒 on 𝑟 . } In the simplest form, each relevance judgments is either Relevant or Non-relevant . } 𝛽 is “learned” from examples, in order that the learned scores approximate the relevance judgments in the training examples. 9

  10. Weighted zone scoring 𝑒 4 , 𝑟 4 , 𝑠 (4) , … , 𝑒 8 , 𝑟 8 , 𝑠 (8) } 𝐸 = } Find minimum of the following cost function: 8 > 𝐾 : (𝛽) = < 𝑠 = − 𝛽𝑇 , 𝑒 = , 𝑟 = − 1 − 𝛽 𝑇 0 𝑒 = , 𝑟 = =?4 𝛽 ∗ = argmin 𝐾 : (𝛽) GHIH4 10

  11. Weighted zone scoring: special case } Boolean match in each zone: 𝑇 , 𝑒, 𝑟 , 𝑇 0 𝑒, 𝑟 ∈ {0,1} } Boolean relevance judgment: 𝑠 ∈ {0,1} 𝑜 4GN + 𝑜 G4O 𝛽 ∗ = 𝑜 4GN + 𝑜 4GO + 𝑜 G4O + 𝑜 G4N 𝑜 4GN : number of training data with 𝑇 , = 1 , 𝑇 0 = 0 and 𝑠 = 1 𝑜 G4O : number of training data with 𝑇 , = 0 , 𝑇 0 = 1 and 𝑠 = 0 11

  12. Simple example: Using classification for ad-hoc IR } Collect a training corpus of (q, d, r) triples } Relevance r is here binary (but may be multiclass 3–7) } Doc is represented by a feature vector } 𝒚 = (𝑦 4 , 𝑦 > ) } 𝑦 4 : cosine similarity } 𝑦 > : minimum query window size (shortest text span including all query words) called 𝜕 } Query term proximity is a very important new weighting factor } Train a model to predict the class 𝑠 of a doc-query pair 12

  13. Simple example: Using classification for ad hoc IR } Training data 13

  14. Simple example: Using classification for ad hoc IR } A linear score function is then 𝑇𝑑𝑝𝑠𝑓 𝑒, 𝑟 = 𝑥 4 ×CosineScore + 𝑥 > ×𝜕 + 𝑥 G } And the linear classifier is Decide relevant if 𝑇𝑑𝑝𝑠𝑓(𝑒, 𝑟) > 0 } just like when we were doing text classification 14

  15. Simple example: Using classification for ad hoc IR 0.05 cosine score Decision R surface R N R R R R R N N R R R N R N N N N N N 0 2 3 4 5 Term proximity 15

  16. More complex example of using classification for search ranking [Nallapati 2004] } We can generalize this to classifier functions over more features } We can use methods we have seen previously for learning the linear classifier weights 16

  17. Classifier for information retrieval } 𝑕 𝑒, 𝑟 = 𝑥 4 𝑔 4 𝑒, 𝑟 + ⋯ + 𝑥 _ 𝑔 _ (𝑒, 𝑟) + 𝑐 } Training: 𝑕(𝑒, 𝑟) must be negative for nonrelevant docs and positive for relevant docs } Testing: decide relevant iff 𝑕(𝑒, 𝑟) ≥ 0 } Features are not word presence features } To deal with query words not in your training data } but scores like the summed (log) tf of all query terms } Unbalanced data } Problem: It can result in trivial always-say-nonrelevant classifiers } A solution: undersampling nonrelevant docs during training (just take some at random) 17

  18. An SVM classifier for ranking [Nallapati 2004] } Experiments: } 4 TREC data sets } Comparisons with Lemur, a state-of-the-art open source IR engine (Language Model (LM)-based – see IIR ch. 12) } 6 features, all variants of tf, idf, and tf.idf scores 18

  19. An SVM classifier for ranking [Nallapati 2004] Train \Test Disk 3 Disk 4-5 WT10G (web) Disk 3 LM 0.1785 0.2503 0.2666 SVM 0.1728 0.2432 0.2750 Disk 4-5 LM 0.1773 0.2516 0.2656 SVM 0.1646 0.2355 0.2675 } At best the results are about equal to LM } Actually a little bit below } Paper’s advertisement: Easy to add more features } This is illustrated on a homepage finding task on WT10G: } Baseline LM 52% p@10, baseline SVM 58% } SVM with URL-depth, and in-link features: 78% p@10 19

  20. Learning to rank } Classification probably isn’t the right way . } Classification : Map to a unordered set of classes } Regression : Map to a real value } Ordinal regression : Map to an ordered set of classes } A fairly obscure sub-branch of statistics, but what we want here } Ordinal regression formulation gives extra power: } Relations between relevance levels are modeled } A number of categories of relevance: } These are totally ordered: 𝑑 1 < 𝑑 2 < ⋯ < 𝑑 d } Training data: each doc-query pair represented as a feature vector 𝜒 = and relevance ranking 𝑑 = as labels 20

  21. Sec. 15.4.2 The Ranking SVM [Herbrich et al. 1999, 2000; Joachims et al. 2002] } Ranking Model: f ( d ) f ( d )

  22. Sec. 15.4.2 Pairwise learning: The Ranking SVM [Herbrich et al. 1999, 2000; Joachims et al. 2002] } Aim is to classify instance pairs as correctly ranked or incorrectly ranked } This turns an ordinal regression problem back into a binary classification problem in an expanded space } We want a ranking function f such that c i > c k iff f (ψ i ) > f (ψ k ) } … or at least one that tries to do this with minimal error } Suppose that f is a linear function f (ψ i ) = w Ÿ ψ i

  23. Sec. 15.4.2 The Ranking SVM [Herbrich et al. 1999, 2000; Joachims et al. 2002] } Then (combining c i > c k iff f (ψ i ) > f (ψ k ) and f (ψ i ) = w Ÿ ψ i ): c i > c k iff w Ÿ (ψ i − ψ k ) > 0 } Let us then create a new instance space from such pairs: Φ u = Φ( d i , d k , q ) = ψ i − ψ k z u = +1, 0, −1 as c i >,=,< c k } We can build model over just cases for which z u = −1 } From training data S = {Φ u }, we train an SVM

  24. Two queries in the pairwise space

  25. The Ranking by a classifier } Assume that the ranked list of docs for a set of sample queries is available as training data } Or even a set of training data in the form of ( 𝑒, 𝑒 f , 𝑟, 𝑨) is available } This form of training data can also be derived from the available ranked list of docs for samples queries if exist } Again we can use SVM classifier for ranking 25

  26. The Ranking by a classifier } Φ 𝑒, 𝑟 = 𝑡 4 𝑒, 𝑟 , … , 𝑡 _ (𝑒, 𝑟) } Φ 𝑒 f , 𝑟 = 𝑡 4 𝑒′, 𝑟 , … , 𝑡 _ (𝑒′, 𝑟) } It seeks a vector 𝑥 in the space of scores (constructed as above) such that: 𝑥 , Φ 𝑒, 𝑟 ≥ 𝑥 , Φ 𝑒′, 𝑟 For each 𝑒 that precedes 𝑒 f in the ranked list of docs for 𝑟 available in training data } A linear classifier like SVM can be used where its training data are constructed as the following pairs of input and output: ( 𝑒,𝑒 f ,𝑟,𝑨) (Φ 𝑒, 𝑟 − Φ 𝑒 f , 𝑟 , 𝑨) ⇒ 𝑨 = +1 if 𝑒 must precede 𝑒’ Feature label 𝑨 = −1 if 𝑒′ must precede 𝑒 vector 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend