learning to rank
play

Learning to Rank Weinan Zhang Shanghai Jiao Tong University - PowerPoint PPT Presentation

2019 EE448, Big Data Mining, Lecture 9 Learning to Rank Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html Content of This Course Another ML problem: ranking Learning to rank


  1. 2019 EE448, Big Data Mining, Lecture 9 Learning to Rank Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html

  2. Content of This Course • Another ML problem: ranking • Learning to rank • Pointwise methods • Pairwise methods • Listwise methods

  3. Ranking Problem Learning to rank Pointwise methods Pairwise methods Listwise methods Sincerely thank Dr. Tie-Yan Liu

  4. The Probability Ranking Principle • https://nlp.stanford.edu/IR- book/html/htmledition/the-probability-ranking- principle-1.html

  5. Regression and Classification • Supervised learning N N X X 1 1 min min L ( y i ; f μ ( x i )) L ( y i ; f μ ( x i )) N N μ μ i =1 i =1 • Two major problems for supervised learning • Regression L ( y i ; f μ ( x i )) = 1 L ( y i ; f μ ( x i )) = 1 2( y i ¡ f μ ( x i )) 2 2( y i ¡ f μ ( x i )) 2 • Classification L ( y i ; f μ ( x i )) = ¡ y i log f μ ( x i ) ¡ (1 ¡ y i ) log(1 ¡ f μ ( x i )) L ( y i ; f μ ( x i )) = ¡ y i log f μ ( x i ) ¡ (1 ¡ y i ) log(1 ¡ f μ ( x i ))

  6. Learning to Rank Problem • Input: a set of instances X = f x 1 ; x 2 ; : : : ; x n g X = f x 1 ; x 2 ; : : : ; x n g • Output: a rank list of these instances ^ ^ Y = f x r 1 ; x r 2 ; : : : ; x r n g Y = f x r 1 ; x r 2 ; : : : ; x r n g • Ground truth: a correct ranking of these instances Y = f x y 1 ; x y 2 ; : : : ; x y n g Y = f x y 1 ; x y 2 ; : : : ; x y n g

  7. A Typical Application • Webpage ranking given a query • Page ranking

  8. Webpage Ranking Indexed Document Repository D = f d i g D = f d i g Ranked List of Documents query = q query = q d q d q 1 =https://www.crunchbase.com 1 =https://www.crunchbase.com Query d q d q Ranking 2 =https://www.reddit.com 2 =https://www.reddit.com q . . Model . . . . "ML in China" d q d q n =https://www.quora.com n =https://www.quora.com

  9. Model Perspective • In most existing work, learning to rank is defined as having the following two properties • Feature-based • Each instance (e.g. query-document pair) is represented with a list of features • Discriminative training • Estimate the relevance given a query-document pair • Rank the documents based on the estimation y i = f μ ( x i ) y i = f μ ( x i )

  10. Learning to Rank • Input: features of query and documents • Query, document, and combination features • Output: the documents ranked by a scoring function y i = f μ ( x i ) y i = f μ ( x i ) • Objective: relevance of the ranking list • Evaluation metrics: NDCG, MAP, MRR… • Training data: the query-doc features and relevance ratings

  11. Training Data • The query-doc features and relevance ratings Query=‘ML in China’ Features Rating Document Query Doc Doc Title Content Length PageRank Length Rel. Rel. 3 d 1 =http://crunchbase.com 0.30 0.61 0.47 0.54 0.76 5 d 2 =http://reddit.com 0.30 0.81 0.76 0.91 0.81 4 d 3 =http://quora.com 0.30 0.86 0.56 0.96 0.69 Query Document Query-doc features features features

  12. Learning to Rank Approaches • Learn (not define) a scoring function to optimally rank the documents given a query • Pointwise • Predict the absolute relevance (e.g. RMSE) • Pairwise • Predict the ranking of a document pair (e.g. AUC) • Listwise • Predict the ranking of a document list (e.g. Cross Entropy) Tie-Yan Liu. Learning to Rank for Information Retrieval. Springer 2011. http://www.cda.cn/uploadfile/image/20151220/20151220115436_46293.pdf

  13. Pointwise Approaches • Predict the expert ratings • As a regression problem y i = f μ ( x i ) y i = f μ ( x i ) N N X X 1 1 ( y i ¡ f μ ( x i )) 2 ( y i ¡ f μ ( x i )) 2 min min 2 N 2 N μ μ i =1 i =1 Query=‘ML in China’ Features Rating Document Query Doc Doc Title Content Length PageRank Length Rel. Rel. 3 d 1 =http://crunchbase.com 0.30 0.61 0.47 0.54 0.76 5 d 2 =http://reddit.com 0.30 0.81 0.76 0.91 0.81 4 d 3 =http://quora.com 0.30 0.86 0.56 0.96 0.69

  14. Point Accuracy != Ranking Accuracy 2.4 3 3.4 3.6 4 4.6 Relevancy Ranking interchanged Doc 1 Ground truth Doc 2 Ground truth • Same square error might lead to different rankings

  15. Pairwise Approaches • Not care about the absolute relevance but the relative preference on a document pair • A binary classification q ( i ) q ( i ) 2 2 3 3 d ( i ) d ( i ) q ( i ) q ( i ) 1 ; 5 1 ; 5 6 6 7 7 Transform n n o o d ( i ) d ( i ) 6 6 7 7 2 ; 3 2 ; 3 ( d ( i ) ( d ( i ) 1 ; d ( i ) 1 ; d ( i ) 2 ) ; ( d ( i ) 2 ) ; ( d ( i ) 1 ; d ( i ) 1 ; d ( i ) n ( i ) ) ; : : : ; ( d ( i ) n ( i ) ) ; : : : ; ( d ( i ) 2 ; d ( i ) 2 ; d ( i ) 6 6 7 7 n ( i ) ) n ( i ) ) . . 6 6 7 7 . . . . 4 4 5 5 d ( i ) d ( i ) 5 > 3 5 > 3 5 > 2 5 > 2 3 > 2 3 > 2 n ( i ) ; 2 n ( i ) ; 2

  16. Binary Classification for Pairwise Ranking • Given a query and a pair of documents q ( d i ; d j ) ( d i ; d j ) ( ( 1 1 if i B j if i B j • Target probability y i;j = y i;j = 0 0 otherwise otherwise • Modeled probability exp( o i;j ) exp( o i;j ) P i;j = P ( d i B d j j q ) = P i;j = P ( d i B d j j q ) = 1 + exp( o i;j ) 1 + exp( o i;j ) o i;j ´ f ( x i ) ¡ f ( x j ) o i;j ´ f ( x i ) ¡ f ( x j ) x i is the feature vector of ( q , d i ) • Cross entropy loss L ( q; d i ; d j ) = ¡ y i;j log P i;j ¡ (1 ¡ y i;j ) log(1 ¡ P i;j ) L ( q; d i ; d j ) = ¡ y i;j log P i;j ¡ (1 ¡ y i;j ) log(1 ¡ P i;j )

  17. RankNet • The scoring function is implemented by a f μ ( x i ) f μ ( x i ) neural network exp( o i;j ) exp( o i;j ) • Modeled probability P i;j = P ( d i B d j j q ) = P i;j = P ( d i B d j j q ) = 1 + exp( o i;j ) 1 + exp( o i;j ) o i;j ´ f ( x i ) ¡ f ( x j ) o i;j ´ f ( x i ) ¡ f ( x j ) • Cross entropy loss L ( q; d i ; d j ) = ¡ y i;j log P i;j ¡ (1 ¡ y i;j ) log(1 ¡ P i;j ) L ( q; d i ; d j ) = ¡ y i;j log P i;j ¡ (1 ¡ y i;j ) log(1 ¡ P i;j ) • Gradient by chain rule @ L ( q; d i ; d j ) @ L ( q; d i ; d j ) = @ L ( q; d i ; d j ) = @ L ( q; d i ; d j ) @P i;j @P i;j @o i;j @o i;j BP in NN @ μ @ μ @P i;j @P i;j @o i;j @o i;j @ μ @ μ ³ @f μ ( x i ) ³ @f μ ( x i ) ´ ´ = @ L ( q; d i ; d j ) = @ L ( q; d i ; d j ) @P i;j @P i;j ¡ @f μ ( x j ) ¡ @f μ ( x j ) @P i;j @P i;j @o i;j @o i;j @ μ @ μ @ μ @ μ Burges, Christopher JC, Robert Ragno, and Quoc Viet Le. "Learning to rank with nonsmooth cost functions." NIPS . Vol. 6. 2006.

  18. Shortcomings of Pairwise Approaches • Each document pair is regarded with the same importance Documents Rating 2 Same pair-level error 4 but different list-level 3 error 2 4

  19. Ranking Evaluation Metrics ( ( 1 1 if d i is relevant with q if d i is relevant with q • For binary labels y i = y i = 0 0 otherwise otherwise • Precision@ k for query q P @ k = # f relevant documents in top k results g P @ k = # f relevant documents in top k results g k k • Average precision for query q 1 0 P P 1 k P @ k ¢ y i ( k ) k P @ k ¢ y i ( k ) AP = AP = 0 # f relevant documents g # f relevant documents g 1 ³ 1 ³ 1 ´ ´ AP = 1 AP = 1 1 + 2 1 + 2 3 + 3 3 + 3 • i ( k ) is the document id at k -th position 3 ¢ 3 ¢ 5 5 • Mean average precision (MAP): average over all queries

  20. Ranking Evaluation Metrics • For score labels, e.g., y i 2 f 0 ; 1 ; 2 ; 3 ; 4 g y i 2 f 0 ; 1 ; 2 ; 3 ; 4 g • Normalized discounted cumulative gain (NDCG@ k ) for query q k k X X 2 y i ( j ) ¡ 1 2 y i ( j ) ¡ 1 Gain NDCG @ k = Z k NDCG @ k = Z k log( j + 1) log( j + 1) Discount j =1 j =1 Normalizer • i ( j ) is the document id at j -th position • Z k is set to normalize the DCG of the ground truth ranking as 1

  21. Shortcomings of Pairwise Approaches • Same pair-level error but different list-level error k k X X 2 y i ( j ) ¡ 1 2 y i ( j ) ¡ 1 NDCG @ k = Z k NDCG @ k = Z k log( j + 1) log( j + 1) j =1 j =1 y = 0 y = 1

  22. Listwise Approaches • Training loss is directly built based on the difference between the prediction list and the ground truth list • Straightforward target • Directly optimize the ranking evaluation measures • Complex model

  23. ListNet • Train the score function y i = f μ ( x i ) y i = f μ ( x i ) • Rankings generated based on f y i g i =1 :::n f y i g i =1 :::n • Each possible k -length ranking list has a probability k k Y Y exp( f ( x j t )) exp( f ( x j t )) P n P n P f ([ j 1 ; j 2 ; : : : ; j k ]) = P f ([ j 1 ; j 2 ; : : : ; j k ]) = l = t exp( f ( x j l )) l = t exp( f ( x j l )) t =1 t =1 • List-level loss: cross entropy between the predicted distribution and the ground truth X X L ( y ; f ( x )) = ¡ L ( y ; f ( x )) = ¡ P y ( g ) log P f ( g ) P y ( g ) log P f ( g ) g 2G k g 2G k • Complexity: many possible rankings Cao, Zhe, et al. "Learning to rank: from pairwise approach to listwise approach." Proceedings of the 24th international conference on Machine learning . ACM, 2007.

  24. Distance between Ranked Lists • A similar distance: KL divergence Tie-Yan Liu @ Tutorial at WWW 2008

  25. Pairwise vs. Listwise • Pairwise approach shortcoming • Pair-level loss is away from IR list-level evaluations • Listwise approach shortcoming • Hard to define a list-level loss under a low model complexity • A good solution: LambdaRank • Pairwise training with listwise information Burges, Christopher JC, Robert Ragno, and Quoc Viet Le. "Learning to rank with nonsmooth cost functions." NIPS . Vol. 6. 2006.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend