Learnin ning g Maximal al Marginal nal Re Releva vanc nce e - - PowerPoint PPT Presentation

learnin ning g maximal al marginal nal re releva vanc nce
SMART_READER_LITE
LIVE PREVIEW

Learnin ning g Maximal al Marginal nal Re Releva vanc nce e - - PowerPoint PPT Presentation

Learnin ning g Maximal al Marginal nal Re Releva vanc nce e Model via Directly ctly Optim imizi izing ng Diver ersi sity ty Ev Evalua uatio tion Measur ures es Long Xia , Jun Xu, Yanyan Lan, Jiafeng Guo, Xueqi Cheng Key


slide-1
SLIDE 1

Learnin ning g Maximal al Marginal nal Re Releva vanc nce e Model via Directly ctly Optim imizi izing ng Diver ersi sity ty Ev Evalua uatio tion Measur ures es

Long Xia, Jun Xu, Yanyan Lan, Jiafeng Guo, Xueqi Cheng Key Laboratory of Network Data Science and Technology Institute of Computing Technology Chinese Academy of Sciences

slide-2
SLIDE 2

Outline

  • Background
  • Rela

late ted work rk

  • Our

r appro roach

  • Experim

iments

  • Summary

ry

1

slide-3
SLIDE 3

2

slide-4
SLIDE 4

Problem of diversity

3

slide-5
SLIDE 5

Outline

  • Background
  • Rela

late ted work rk

  • Our

r appro roach

  • Experim

iments

  • Summary

ry

4

slide-6
SLIDE 6

Related work

Learning approaches Diversity evaluation measures

  • Heuristic approaches
  • Maximal marginal relevance (MMR)

criterion (Carbonell and Goldstein, SIGIR’98)

  • Select documents with high

divergence (Zhai et al., SIGIR’03)

  • Minimize the risk of dissatisfaction of

the average user (Agrawal et al., WSDM’09)

  • Diversity by proportionality: an

election-based approach (Dang and Croft, SIGIR’12)

Heuristic approaches

5

slide-7
SLIDE 7
  • Heuristic approaches
  • Maximal marginal relevance (MMR)

criterion (Carbonell and Goldstein, SIGIR’98)

  • Select documents with high

divergence (Zhai et al., SIGIR’03)

  • Minimize the risk of dissatisfaction of

the average user (Agrawal et al., WSDM’09)

  • Diversity by proportionality: an

election-based approach (Dang and Croft, SIGIR’12)

Related work

Diversity evaluation measures Heuristic approaches Learning approaches

5

  • Learning approaches
  • SVM-DIV: formulate the task as a

problem of predicting diverse subsets (Yue and Joachims, ICML’04)

  • REC & RBA: online learning

algorithms based on user’s clicking behavior (Radlinski et al., ICML’07)

  • R-LTR: a process of sequential

document selection and optimizing the likelihood of ground truth rankings (Zhu et al., SIGIR’14)

slide-8
SLIDE 8
  • Heuristic approaches
  • Maximal marginal relevance (MMR)

criterion (Carbonell and Goldstein, SIGIR’98)

  • Select documents with high

divergence (Zhai et al., SIGIR’03)

  • Minimize the risk of dissatisfaction of

the average user (Agrawal et al., WSDM’09)

  • Diversity by proportionality: an

election-based approach (Dang and Croft, SIGIR’12)

Related work

Diversity evaluation measures Heuristic approaches

5

  • Learning approaches
  • SVM-DIV: formulate the task as a

problem of predicting diverse subsets (Yue and Joachims, ICML’04)

  • REC & RBA: online learning

algorithms based on user’s clicking behavior (Radlinski et al., ICML’07)

  • R-LTR: a process of sequential

document selection and optimizing the likelihood of ground truth rankings (Zhu et al., SIGIR’14)

  • Diversity evaluation measures
  • Subtopic recall (Zhai et al.,

SIGIR’03)

  • 𝛽-NDCG (Clarke et al., SIGIR’08)
  • ERR-IA (Chapella et al., CIKM’09)
  • NRBP (Clarke et al., ICTIT’09)

Learning approaches

slide-9
SLIDE 9

Maximal marginal relevance(Carbonell and Goldstein, SIGIR’98)

𝑁𝑁𝑆 ≝ 𝐵𝑠𝑕 max

𝐸𝑗∈𝑆\S 𝜇 𝑇𝑗𝑛1 𝐸𝑗, 𝑅 − 1 − 𝜇 max 𝐸𝑗∈𝑇 𝑇𝑗𝑛2 𝐸𝑗, 𝐸 𝑘

query-document relevance similarity with selected documents

  • Advantage
  • top-down user browsing behavior
  • Disadvantage
  • non-learning: limited number of ranking signals
  • High parameter tuning cost

6

slide-10
SLIDE 10
  • Formalization
  • Four key components: input space, output space, ranking

function f, loss function L 𝐠 = 𝑏𝑠𝑕 min

𝐠∈ℱ 𝑗=1 𝑂

𝑀 𝐠 𝑌(𝑗), 𝑆(𝑗) , 𝐳(𝑗)

Relational Learning-to-Rank(Zhu et al., SIGIR’14)

7

slide-11
SLIDE 11
  • Formalization
  • Four key components: input space, output space, ranking

function f, loss function L 𝐠 = 𝑏𝑠𝑕 min

𝐠∈ℱ 𝑗=1 𝑂

𝑀 𝐠 𝑌(𝑗), 𝑆(𝑗) , 𝐳(𝑗)

  • Definition of ranking function

Relational Learning-to-Rank(Zhu et al., SIGIR’14)

relevance score diversity score

𝑔

𝑇 𝐲𝑗, 𝑆𝑗 = 𝜕𝑠 𝑈𝐲𝑗 + 𝜕𝑒 𝑈ℎ𝑇 𝑆𝑗 , ∀𝐲𝑗 ∈ 𝑌\S

7

slide-12
SLIDE 12
  • Formalization
  • Four key components: input space, output space, ranking

function f, loss function L 𝐠 = 𝑏𝑠𝑕 min

𝐠∈ℱ 𝑗=1 𝑂

𝑀 𝐠 𝑌(𝑗), 𝑆(𝑗) , 𝐳(𝑗)

  • Definition of ranking function

Relational Learning-to-Rank(Zhu et al., SIGIR’14)

relevance score diversity score matrix of relationships between document 𝑦𝑗 and

  • ther documents

Relational function

𝑔

𝑇 𝐲𝑗, 𝑆𝑗 = 𝜕𝑠 𝑈𝐲𝑗 + 𝜕𝑒 𝑈ℎ𝑇 𝑆𝑗 , ∀𝐲𝑗 ∈ 𝑌\S

7

slide-13
SLIDE 13

Relational Learning-to-Rank(Zhu et al., SIGIR’14)

  • Definition of loss function

𝑀 𝑔 𝑌, 𝑆 , 𝑧 = − log 𝑄 𝑧|𝑌 𝑄 𝑧|𝑌 = 𝑄(𝑦𝑧 1 , 𝑦𝑧 2 , ⋯ , 𝑦𝑧 𝑜 |𝑌)

  • Plackett-Luce based Probability

𝑄 𝑧 𝑌 =

𝑘=1 𝑜

exp 𝑔

𝑇𝑘−1 𝑦𝑧 𝑘 , 𝑆𝑧 𝑘

𝑙=𝑘

𝑜

𝑓𝑦𝑞 𝑔

𝑇𝑙−1 𝑦𝑧 𝑙 , 𝑆𝑧 𝑙

8

slide-14
SLIDE 14

Relational Learning-to-Rank(Zhu et al., SIGIR’14)

  • R-LTR Pros:
  • Modeling sequential user behavior in the MMR way
  • A learnable framework to combine complex features
  • State-of-the-art empirical performance

9

Can R-LTR be further improved?

slide-15
SLIDE 15

Motivation

  • R-LTR Cons:
  • Only utilizes “positive” rankings, but treat “negative”

rankings equally

  • Not all negative rankings are equally negative (different

scores)

  • How about using discriminative learning which is effective in

many machine learning tasks?

  • Learning objective differs with diversity evaluation

measures

  • How about directly optimizing evaluation measures?

10

slide-16
SLIDE 16

Major Idea

learn MMR model using both positive and negative rankings

  • ptimize diversity evaluation measures

Ho How w to ac

  • achi

hieve eve thi his? s?

11

slide-17
SLIDE 17

Outline

  • Background
  • Rela

late ted work rk

  • Our

r appro roach

  • Experim

iments

  • Summary

ry

12

slide-18
SLIDE 18

min

𝒈𝑇 𝑜=1 𝑂

𝑴( 𝐳(𝑜), 𝑲(𝑜)) Learning the ranking model

𝐾(𝑜) denotes the human labels on the documents 𝐳(𝑜) is the ranking constructed by the maximal marginal relevance model 𝑴( 𝐳(𝑜), 𝑲(𝑜)) is the function for judging the ‘loss’ of the predicted ranking 𝐳(𝑜) compared with the human labels 𝑲(𝑜)

Basic loss function

13

slide-19
SLIDE 19

𝑜=1 𝑂

1 − 𝐹 𝑌(𝑜), 𝒛(𝑜), 𝐾(𝑜)

Evaluation measures as loss function

  • Aim to maximize the diverse ranking accuracy in terms
  • f a diversity evaluation measure on the training data

𝑭 represents the evaluation measures which measures the agreements between the ranking 𝐳 over documents in 𝒀 and the human judgements 𝑲 Difficult to directly

  • ptimize the loss as

𝑭 is a non-convex function.

14

slide-20
SLIDE 20

Evaluation measures as loss function

  • Resort to optimize the upper bound of the loss function

𝑜=1 𝑂

1 − 𝐹 𝑌(𝑜), 𝒛(𝑜), 𝐾(𝑜)

𝑜=1 𝑂

max

𝐳+∈𝑍+(𝑜); 𝐳−∈𝑍−(𝑜)

𝐹 𝑌(𝑜), 𝐳+, 𝐾(𝑜) − 𝐹(𝑌(𝑜), 𝐳−, 𝐾(𝑜)) ∙ 𝐺(𝑌(𝑜), 𝑆(𝑜), 𝐳+) ≤ 𝐺(𝑌(𝑜), 𝑆(𝑜), 𝐳−) Upp pper bou bounded 𝑍+(𝑜): positive rankings 𝑍−(𝑜): negative rankings 𝐺 𝑌, 𝑆, 𝐳 : the query level ranking model

𝐺 𝑌, 𝑆, 𝐳 = Pr 𝐳|𝑌, 𝑆 = Pr 𝐲𝐳(1) ⋯ 𝐲𝐳(𝑁)|𝑌, 𝑆 = 𝑠=1

𝑁−1 Pr 𝐲𝐳(𝑠)|𝑌, 𝑇𝑠−1, 𝑆

= 𝑠=1

𝑁−1 exp 𝑔𝑇𝑠−1 𝐲𝑗,𝑆𝐳(𝑠) 𝑙=𝑠

𝑁

exp 𝑔𝑇𝑠−1 𝐲𝑗,𝑆𝐳(𝑙)

∙ is one if the condition is satisfied otherwise zero

15

slide-21
SLIDE 21

Evaluation measures as loss function

  • Resort to optimize the upper bound of the loss function

𝑜=1 𝑂

1 − 𝐹 𝑌(𝑜), 𝒛(𝑜), 𝐾(𝑜)

𝑜=1 𝑂 𝐳+∈𝑍+(𝑜); 𝐳−∈𝑍−(𝑜)

𝐺 𝑌(𝑜), 𝑆(𝑜), 𝐳+ − 𝐺 𝑌(𝑜), 𝑆(𝑜), 𝐳− ≤ 𝐹 𝑌(𝑜), 𝐳+, 𝐾(𝑜) − 𝐹 𝑌(𝑜), 𝐳−, 𝐾(𝑜)

𝑜=1 𝑂

max

𝐳+∈𝑍+(𝑜); 𝐳−∈𝑍−(𝑜)

𝐹 𝑌(𝑜), 𝐳+, 𝐾(𝑜) − 𝐹(𝑌(𝑜), 𝐳−, 𝐾(𝑜)) ∙ 𝐺(𝑌(𝑜), 𝑆(𝑜), 𝐳+) ≤ 𝐺(𝑌(𝑜), 𝑆(𝑜), 𝐳−) Upp pper bou bounded Upp pper bou bounded if f 𝑭 ∈ 𝟏, 𝟐 𝑍+(𝑜): positive rankings 𝑍−(𝑜): negative rankings 𝐺 𝑌, 𝑆, 𝐳 : the query level ranking model

𝐺 𝑌, 𝑆, 𝐳 = Pr 𝐳|𝑌, 𝑆 = Pr 𝐲𝐳(1) ⋯ 𝐲𝐳(𝑁)|𝑌, 𝑆 = 𝑠=1

𝑁−1 Pr 𝐲𝐳(𝑠)|𝑌, 𝑇𝑠−1, 𝑆

= 𝑠=1

𝑁−1 exp 𝑔𝑇𝑠−1 𝐲𝑗,𝑆𝐳(𝑠) 𝑙=𝑠

𝑁

exp 𝑔𝑇𝑠−1 𝐲𝑗,𝑆𝐳(𝑙)

∙ is one if the condition is satisfied otherwise zero

15

slide-22
SLIDE 22

Loss function can be optimized under the framework of Perceptron.

The algorithm is referred to as PAMM.

  • Firstly, PAMM generates positive and negative rankings.
  • Secondly, PAMM optimizes model parameters 𝜕𝑠 and 𝜕𝑒.

1: ∆𝐺 ← 𝐺 𝑌(𝑜), 𝑆(𝑜), 𝐳+ − 𝐺 𝑌(𝑜), 𝑆(𝑜), 𝐳− 2: if ∆𝐺 ≤ 𝐹 𝑌(𝑜), 𝐳+, 𝐾(𝑜) − 𝐹 𝑌(𝑜), 𝐳−, 𝐾(𝑜) 3: then 4: calculate 𝛼𝜕𝑠

(𝑜) and 𝛼𝜕𝑒 (𝑜)

5: 𝜕𝑠, 𝜕𝑒 ← 𝜕𝑠, 𝜕𝑒 + 𝜃 × 𝛼𝜕𝑠

(𝑜), 𝛼𝜕𝑒 (𝑜)

6: end end if if

  • Finally, PAMM outputs the optimized model parameters (𝜕𝑠,

𝜕𝑒).

Direct optimization with Perceptron

16

slide-23
SLIDE 23

Advantages of PAMM

  • Adopting the ranking model that meets

the maximal marginal relevance criterion

  • Ability to direct optimize any diversity

evaluation measure in training

  • Ability to use both positive rankings and

negative rankings in training

17

slide-24
SLIDE 24

Outline

  • Background
  • Rela

late ted work rk

  • Our

r appro roach

  • Experim

iments

  • Summary

ry

18

slide-25
SLIDE 25

Experiment setting

  • Dataset: TREC WT2009, WT2010, and WT2011
  • Data processing
  • Indri toolkit (version 5.2)
  • Porter stemmer and stopword removal
  • Evaluation
  • TREC official measures: ERR-IA, 𝛽-NDCG
  • Baselines
  • QL, MMR, xQuAD, PM-2, ListMLE, SVM-DIV, StructSVM, R-LTR

Dataset #queries #labeled docs #subtopics per query WT2009 50 5149 3~8 WT2010 48 6554 3~7 WT2011 50 5000 2~6

19

slide-26
SLIDE 26

Feature Vectors(Zhu et al., SIGIR’14)

  • Relevance features
  • Weighing features: VSM, BM25, LM…
  • Term dependency features: MRF
  • Length
  • Pos
  • Diversity features
  • Cosine diversity
  • Jaccard diversity
  • Subtopic diversity
  • Document-level co-occurrence

20

slide-27
SLIDE 27

Performance comparison of all methods

Method ERR-IA@20 𝛽-NDCG@20 ERR-IA@20 𝛽-NDCG@20 ERR-IA@20 𝛽-NDCG@20 QL 0.164 0.269 0.198 0.302 0.352 0.453 ListMLE 0.191 0.307 0.244 0.376 0.417 0.517 MMR 0.202 0.308 0.274 0.404 0.428 0.530 xQuAD 0.232 0.344 0.328 0.445 0.475 0.565 PM-2 0.229 0.337 0.330 0.448 0.487 0.579 SVM-DIV 0.241 0.353 0.333 0.459 0.490 0.591 StructSVM(ERR-IA) 0.261 0.373 0.355 0.472 0.513 0.613 StructSVM(𝛽-NDCG) 0.260 0.377 0.352 0.476 0.512 0.617 R-LTR 0.271 0.396 0.365 0.492 0.539 0.630 PAMM(ERR-IA) 0.294 0.422 0.387 0.511 0.548 0.637 PAMM(𝛽-NDCG) 0.284 0.427 0.380 0.524 0.541 0.643

WT2009 WT2010 WT2011

21

slide-28
SLIDE 28

Effects of positive and negative rankings

0.4 0.41 0.42 0.43 0.44 1 2 3 4 5 6 7 8 9 10 α-NDCG NDCG number r of positi tive ve rankings ings PAMM(α-NDCG) Ranking accuracies w.r.t. number of positive rankings 0.4 0.41 0.42 0.43 0.44 5 10 15 20 25 30 α-NDCG NDCG Number r of negative ive rankings ngs PAMM(α-NDCG) Ranking accuracies w.r.t. number of negative rankings

22

slide-29
SLIDE 29

Summary

  • New learning-to-rank model for search result

diversification: PAMM

 Employs a ranking model that follows the maximal marginal relevance criterion  Directly optimizes the diversity evaluation measures under the framework of Perceptron  Ability to utilize both positive rankings and negative rankings in training

  • Experimental results show that PAMM significantly
  • utperforms the state-of-the-art baseline methods

23

slide-30
SLIDE 30

THANKS!

Contact me: Email: xialong@software.ict.ac.cn

24