[PPT] - Machine Learning for Ranking CE-324: Modern Information Retrieval PowerPoint Presentation

SLIDE 1

Machine Learning for Ranking

CE-324: Modern Information Retrieval

Sharif University of Technology

M. Soleymani

Fall 2015

Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

SLIDE 2

Machine learning for IR ranking?

 We’ve looked at methods for ranking docs in IR

 Cosine similarity, idf, proximity, pivoted doc length normalization, doc

autorithy, …

 We’ve looked at methods for classifying docs using supervised

machine learning classifiers

 Naïve Bayes, Rocchio, kNN

 Surely we can also use machine learning to rank the docs

displayed in search results?

 Sounds like a good idea  A.k.a.“machine-learned relevance” or “learning to rank”

2

SLIDE 3

Machine learning for IR ranking

 Actively researched – and actively deployed by major web

search engines

 In the last decade

 Why didn’t it happen earlier?

 Modern supervised ML has been around for about 20

years…

 Naïve Bayes has been around for about 50 years…

3

SLIDE 4

Machine learning for IR ranking

 The

IR community wasn’t very connected to the ML community

 But there were a whole bunch of precursors:

 Wong, S.K. et al. 1988. Linear structure in information retrieval. SIGIR

1988.

 Fuhr, N. 1992. Probabilistic

methods in information retrieval. Computer Journal.

 Gey, F. C. 1994. Inferring probability of relevance using the method of

logistic regression. SIGIR 1994.

 Herbrich, R. et al. 2000. Large Margin Rank Boundaries for Ordinal

Regression.Advances in Large Margin Classifiers.

4

SLIDE 5

Why weren’t early attempts very successful/influential?

 Sometimes an idea just takes time to be appreciated…  Limited training data

 Especially for real world use (as opposed to writing academic papers), it

was very hard to gather test collection queries and relevance judgments

 This has changed, both in academia and industry

 Poor machine learning techniques  Insufficient customization to IR problem  Not enough features for ML to show value

5

SLIDE 6

Why wasn’t ML much needed?

 Traditional ranking functions in IR used a very small

number of features, e.g.,

 T

erm frequency

 Inverse document frequency  Doc length

 It was easy to tune weighting coefficients by hand

 And people did

6

SLIDE 7

Why is ML needed now?

 Modern systems – especially on the Web – use a great number

f features:

 Arbitrary useful features – not a single unified model

 Log frequency of query word in anchor text?  Query word in color on page?  # of images on page?  # of (out) links on page?  PageRank of page?  URL length?  URL contains “~”?  Page edit recency?  Page length?

 The New York Times (2008-06-03) quoted Amit Singhal as

saying Google was using over 200 such features.

7

SLIDE 8

Weighted zone scoring

8

 Simple example:

 Only two zone:

 title, body

𝑡𝑑𝑝𝑠𝑓 𝑒, 𝑟 = 𝛽𝑇𝑈 𝑒, 𝑟 + 1 − 𝛽 𝑇𝐶 𝑒, 𝑟 0 ≤ 𝛽 ≤ 1

 We intend to find an optimal value for 𝛽

SLIDE 9

Weighted zone scoring

9

 Training examples: each example is a tuple consisting

f a query 𝑟 and a doc 𝑒, together with a relevance

judgment 𝑠 for 𝑒 on 𝑟.

 In the simplest form, each relevance judgments is either

Relevant or Non-relevant.

 𝛽 is “learned” from examples, in order that the learned

scores approximate the relevance judgments in the training examples.

SLIDE 10

Weighted zone scoring

10

 𝐸 =

𝑒 1 , 𝑟 1 , 𝑠(1) , … , 𝑒 𝑂 , 𝑟 𝑂 , 𝑠(𝑂)

 Find minimum of the following cost function:

𝐾𝐸 (𝛽) =

𝑗=1 𝑂

𝑠 𝑗 − 𝛽𝑇𝑈 𝑒 𝑗 , 𝑟 𝑗 − 1 − 𝛽 𝑇𝐶 𝑒 𝑗 , 𝑟 𝑗

2

𝛽∗ = argmin

0≤𝛽≤1

𝐾𝐸(𝛽)

SLIDE 11

Weighted zone scoring: special case

11

 Boolean match in each zone: 𝑇𝑈 𝑒, 𝑟 , 𝑇𝐶 𝑒, 𝑟 ∈ {0,1}  Boolean relevance judgment: 𝑠 ∈ {0,1}

𝛽∗ = 𝑜10𝑠 + 𝑜01𝑜 𝑜10𝑠 + 𝑜10𝑜 + 𝑜01𝑜 + 𝑜01𝑠

𝑜10𝑠: number of training data with 𝑇𝑈 = 1, 𝑇𝐶 = 0 and 𝑠 = 1 𝑜01𝑜: number of training data with 𝑇𝑈 = 0, 𝑇𝐶 = 1 and 𝑠 = 0

SLIDE 12

Simple example: Using classification for ad-hoc IR

 Collect a training corpus of (q, d, r) triples

 Relevance r is here binary (but may be multiclass 3–7)

 Doc is represented by a feature vector

 𝒚 = (𝑦1, 𝑦2)

 𝑦1: cosine similarity  𝑦2: minimum query window size (shortest text span including all query

words) called 𝜕

 Query term proximity is a very important new weighting

factor

 Train a model to predict the class 𝑠 of a doc-query pair

12

SLIDE 13

Simple example: Using classification for ad hoc IR

 Training data

13

SLIDE 14

Simple example: Using classification for ad hoc IR

 A linear score function is then

𝑇𝑑𝑝𝑠𝑓 𝑒, 𝑟 = 𝑥1 × CosineScore + 𝑥2 × 𝜕 + 𝑥0

 And the linear classifier is

Decide relevant if 𝑇𝑑𝑝𝑠𝑓(𝑒, 𝑟) > 0

 just like when we were doing text classification

14

SLIDE 15

Simple example: Using classification for ad hoc IR

2 3 4 5 0.05 cosine score Term proximity

R R R R R R R R R R R N N N N N N N N N N

Decision surface

15

SLIDE 16

More complex example of using classification for search ranking [Nallapati 2004]

 We can generalize this to classifier functions over more

features

 We can use methods we have seen previously for

learning the linear classifier weights

16

SLIDE 17

Classifier for information retrieval

 𝑕 𝑒, 𝑟 = 𝑥1𝑔

1 𝑒, 𝑟 + ⋯ + 𝑥𝑒𝑔 𝑒(𝑒, 𝑟) + 𝑐

 Training: 𝑕(𝑒, 𝑟) must be negative for nonrelevant docs and positive for

relevant docs

 Testing: decide relevant iff 𝑕(𝑒, 𝑟) ≥ 0

 Features are not word presence features

 To deal with query words not in your training data?  but scores like the summed (log) tf of all query terms

 Unbalanced data

 Problem: It can result in trivial always-say-nonrelevant classifiers  A solution: undersampling nonrelevant docs during training (just take

some at random)

17

SLIDE 18

An SVM classifier for ranking [Nallapati 2004]

 Experiments:

 4 TREC data sets  Comparisons with Lemur, a state-of-the-art open source IR

engine (Language Model (LM)-based – see IIR ch. 12)

 6 features, all variants of tf, idf, and tf.idf scores

18

SLIDE 19

An SVM classifier for ranking [Nallapati 2004]

Train \Test Disk 3 Disk 4-5 WT10G (web) Disk 3 LM 0.1785 0.2503 0.2666 SVM 0.1728 0.2432 0.2750 Disk 4-5 LM 0.1773 0.2516 0.2656 SVM 0.1646 0.2355 0.2675  At best the results are about equal to LM

 Actually a little bit below

 Paper’s advertisement: Easy to add more features

 This is illustrated on a homepage finding task onWT10G:

 Baseline LM 52% p@10, baseline SVM 58%  SVM with URL-depth, and in-link features: 78% p@10

19

SLIDE 20

Learning to rank

 Classification probably isn’t the right way.

 Classification: Map to a unordered set of classes  Regression: Map to a real value  Ordinal regression: Map to an ordered set of classes

 A fairly obscure sub-branch of statistics, but what we want here

 Ordinal regression formulation gives extra power:

 Relations between relevance levels are modeled

 A number of categories of relevance:

 These are totally ordered: 𝑑1 < 𝑑2 < ⋯ < 𝑑𝐿

 Training data: each doc-query pair represented as a feature vector 𝜒𝑗

and relevance ranking 𝑑𝑗 as labels

20

SLIDE 21

Learning to rank: Approaches

 Point-wise

 Predicting class label or relevance score

 Pair-wise

 Predicting relative order is closer to the nature of ranking than

predicting class label or relevance score.

 Input is a pair of results for a query, and the class is the relevance

rdering relationship between them

 List-wise

 Learns a ranking function  Models the ranking problem in a straightforward fashion.

 can overcome the drawback of the above approaches by tackling the ranking problem

directly

21

SLIDE 22

Problem with Pointwise Approach

 Properties of IR evaluation measures have not been well

considered.

 When number of associated docs varies largely for different queries,

domination by those queries with a large number of docs.

 The position of docs in the ranked list is unimportant.

 The pointwise loss function may unconsciously emphasize too much those

unimportant docs (which are ranked low in the final results).

22

SLIDE 23

Pairwise ranking

 Goal: classify instance pairs as correctly ranked or

incorrectly ranked

 This turns an ordinal regression problem back into a binary

classification problem

 Advantage

 Predicting relative order is closer to the nature of ranking than

predicting class label or relevance score.

 Problem

 The distribution of doc pair number is more skewed than the

distribution of doc number, with respect to different queries.

23

SLIDE 24

The Limitation of Machine Learning

 Most work produces linear models of features by weighting

different base features

 This contrasts with most of the clever ideas of traditional IR,

which are nonlinear scalings and combinations of basic measurements

 log term frequency, idf, pivoted length normalization

24