Learning to Rank Weinan Zhang Shanghai Jiao Tong University - PowerPoint PPT Presentation

2019 EE448, Big Data Mining, Lecture 9 Learning to Rank Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html

Content of This Course • Another ML problem: ranking • Learning to rank • Pointwise methods • Pairwise methods • Listwise methods

Ranking Problem Learning to rank Pointwise methods Pairwise methods Listwise methods Sincerely thank Dr. Tie-Yan Liu

The Probability Ranking Principle • https://nlp.stanford.edu/IR- book/html/htmledition/the-probability-ranking- principle-1.html

Regression and Classification • Supervised learning N N X X 1 1 min min L ( y i ; f μ ( x i )) L ( y i ; f μ ( x i )) N N μ μ i =1 i =1 • Two major problems for supervised learning • Regression L ( y i ; f μ ( x i )) = 1 L ( y i ; f μ ( x i )) = 1 2( y i ¡ f μ ( x i )) 2 2( y i ¡ f μ ( x i )) 2 • Classification L ( y i ; f μ ( x i )) = ¡ y i log f μ ( x i ) ¡ (1 ¡ y i ) log(1 ¡ f μ ( x i )) L ( y i ; f μ ( x i )) = ¡ y i log f μ ( x i ) ¡ (1 ¡ y i ) log(1 ¡ f μ ( x i ))

Learning to Rank Problem • Input: a set of instances X = f x 1 ; x 2 ; : : : ; x n g X = f x 1 ; x 2 ; : : : ; x n g • Output: a rank list of these instances ^ ^ Y = f x r 1 ; x r 2 ; : : : ; x r n g Y = f x r 1 ; x r 2 ; : : : ; x r n g • Ground truth: a correct ranking of these instances Y = f x y 1 ; x y 2 ; : : : ; x y n g Y = f x y 1 ; x y 2 ; : : : ; x y n g

A Typical Application • Webpage ranking given a query • Page ranking

Webpage Ranking Indexed Document Repository D = f d i g D = f d i g Ranked List of Documents query = q query = q d q d q 1 =https://www.crunchbase.com 1 =https://www.crunchbase.com Query d q d q Ranking 2 =https://www.reddit.com 2 =https://www.reddit.com q . . Model . . . . "ML in China" d q d q n =https://www.quora.com n =https://www.quora.com

Model Perspective • In most existing work, learning to rank is defined as having the following two properties • Feature-based • Each instance (e.g. query-document pair) is represented with a list of features • Discriminative training • Estimate the relevance given a query-document pair • Rank the documents based on the estimation y i = f μ ( x i ) y i = f μ ( x i )

Learning to Rank • Input: features of query and documents • Query, document, and combination features • Output: the documents ranked by a scoring function y i = f μ ( x i ) y i = f μ ( x i ) • Objective: relevance of the ranking list • Evaluation metrics: NDCG, MAP, MRR… • Training data: the query-doc features and relevance ratings

Training Data • The query-doc features and relevance ratings Query=‘ML in China’ Features Rating Document Query Doc Doc Title Content Length PageRank Length Rel. Rel. 3 d 1 =http://crunchbase.com 0.30 0.61 0.47 0.54 0.76 5 d 2 =http://reddit.com 0.30 0.81 0.76 0.91 0.81 4 d 3 =http://quora.com 0.30 0.86 0.56 0.96 0.69 Query Document Query-doc features features features

Learning to Rank Approaches • Learn (not define) a scoring function to optimally rank the documents given a query • Pointwise • Predict the absolute relevance (e.g. RMSE) • Pairwise • Predict the ranking of a document pair (e.g. AUC) • Listwise • Predict the ranking of a document list (e.g. Cross Entropy) Tie-Yan Liu. Learning to Rank for Information Retrieval. Springer 2011. http://www.cda.cn/uploadfile/image/20151220/20151220115436_46293.pdf

Pointwise Approaches • Predict the expert ratings • As a regression problem y i = f μ ( x i ) y i = f μ ( x i ) N N X X 1 1 ( y i ¡ f μ ( x i )) 2 ( y i ¡ f μ ( x i )) 2 min min 2 N 2 N μ μ i =1 i =1 Query=‘ML in China’ Features Rating Document Query Doc Doc Title Content Length PageRank Length Rel. Rel. 3 d 1 =http://crunchbase.com 0.30 0.61 0.47 0.54 0.76 5 d 2 =http://reddit.com 0.30 0.81 0.76 0.91 0.81 4 d 3 =http://quora.com 0.30 0.86 0.56 0.96 0.69

Point Accuracy != Ranking Accuracy 2.4 3 3.4 3.6 4 4.6 Relevancy Ranking interchanged Doc 1 Ground truth Doc 2 Ground truth • Same square error might lead to different rankings

Pairwise Approaches • Not care about the absolute relevance but the relative preference on a document pair • A binary classification q ( i ) q ( i ) 2 2 3 3 d ( i ) d ( i ) q ( i ) q ( i ) 1 ; 5 1 ; 5 6 6 7 7 Transform n n o o d ( i ) d ( i ) 6 6 7 7 2 ; 3 2 ; 3 ( d ( i ) ( d ( i ) 1 ; d ( i ) 1 ; d ( i ) 2 ) ; ( d ( i ) 2 ) ; ( d ( i ) 1 ; d ( i ) 1 ; d ( i ) n ( i ) ) ; : : : ; ( d ( i ) n ( i ) ) ; : : : ; ( d ( i ) 2 ; d ( i ) 2 ; d ( i ) 6 6 7 7 n ( i ) ) n ( i ) ) . . 6 6 7 7 . . . . 4 4 5 5 d ( i ) d ( i ) 5 > 3 5 > 3 5 > 2 5 > 2 3 > 2 3 > 2 n ( i ) ; 2 n ( i ) ; 2

Binary Classification for Pairwise Ranking • Given a query and a pair of documents q ( d i ; d j ) ( d i ; d j ) ( ( 1 1 if i B j if i B j • Target probability y i;j = y i;j = 0 0 otherwise otherwise • Modeled probability exp( o i;j ) exp( o i;j ) P i;j = P ( d i B d j j q ) = P i;j = P ( d i B d j j q ) = 1 + exp( o i;j ) 1 + exp( o i;j ) o i;j ´ f ( x i ) ¡ f ( x j ) o i;j ´ f ( x i ) ¡ f ( x j ) x i is the feature vector of ( q , d i ) • Cross entropy loss L ( q; d i ; d j ) = ¡ y i;j log P i;j ¡ (1 ¡ y i;j ) log(1 ¡ P i;j ) L ( q; d i ; d j ) = ¡ y i;j log P i;j ¡ (1 ¡ y i;j ) log(1 ¡ P i;j )

RankNet • The scoring function is implemented by a f μ ( x i ) f μ ( x i ) neural network exp( o i;j ) exp( o i;j ) • Modeled probability P i;j = P ( d i B d j j q ) = P i;j = P ( d i B d j j q ) = 1 + exp( o i;j ) 1 + exp( o i;j ) o i;j ´ f ( x i ) ¡ f ( x j ) o i;j ´ f ( x i ) ¡ f ( x j ) • Cross entropy loss L ( q; d i ; d j ) = ¡ y i;j log P i;j ¡ (1 ¡ y i;j ) log(1 ¡ P i;j ) L ( q; d i ; d j ) = ¡ y i;j log P i;j ¡ (1 ¡ y i;j ) log(1 ¡ P i;j ) • Gradient by chain rule @ L ( q; d i ; d j ) @ L ( q; d i ; d j ) = @ L ( q; d i ; d j ) = @ L ( q; d i ; d j ) @P i;j @P i;j @o i;j @o i;j BP in NN @ μ @ μ @P i;j @P i;j @o i;j @o i;j @ μ @ μ ³ @f μ ( x i ) ³ @f μ ( x i ) ´ ´ = @ L ( q; d i ; d j ) = @ L ( q; d i ; d j ) @P i;j @P i;j ¡ @f μ ( x j ) ¡ @f μ ( x j ) @P i;j @P i;j @o i;j @o i;j @ μ @ μ @ μ @ μ Burges, Christopher JC, Robert Ragno, and Quoc Viet Le. "Learning to rank with nonsmooth cost functions." NIPS . Vol. 6. 2006.

Shortcomings of Pairwise Approaches • Each document pair is regarded with the same importance Documents Rating 2 Same pair-level error 4 but different list-level 3 error 2 4

Ranking Evaluation Metrics ( ( 1 1 if d i is relevant with q if d i is relevant with q • For binary labels y i = y i = 0 0 otherwise otherwise • Precision@ k for query q P @ k = # f relevant documents in top k results g P @ k = # f relevant documents in top k results g k k • Average precision for query q 1 0 P P 1 k P @ k ¢ y i ( k ) k P @ k ¢ y i ( k ) AP = AP = 0 # f relevant documents g # f relevant documents g 1 ³ 1 ³ 1 ´ ´ AP = 1 AP = 1 1 + 2 1 + 2 3 + 3 3 + 3 • i ( k ) is the document id at k -th position 3 ¢ 3 ¢ 5 5 • Mean average precision (MAP): average over all queries

Ranking Evaluation Metrics • For score labels, e.g., y i 2 f 0 ; 1 ; 2 ; 3 ; 4 g y i 2 f 0 ; 1 ; 2 ; 3 ; 4 g • Normalized discounted cumulative gain (NDCG@ k ) for query q k k X X 2 y i ( j ) ¡ 1 2 y i ( j ) ¡ 1 Gain NDCG @ k = Z k NDCG @ k = Z k log( j + 1) log( j + 1) Discount j =1 j =1 Normalizer • i ( j ) is the document id at j -th position • Z k is set to normalize the DCG of the ground truth ranking as 1

Shortcomings of Pairwise Approaches • Same pair-level error but different list-level error k k X X 2 y i ( j ) ¡ 1 2 y i ( j ) ¡ 1 NDCG @ k = Z k NDCG @ k = Z k log( j + 1) log( j + 1) j =1 j =1 y = 0 y = 1

Listwise Approaches • Training loss is directly built based on the difference between the prediction list and the ground truth list • Straightforward target • Directly optimize the ranking evaluation measures • Complex model

ListNet • Train the score function y i = f μ ( x i ) y i = f μ ( x i ) • Rankings generated based on f y i g i =1 :::n f y i g i =1 :::n • Each possible k -length ranking list has a probability k k Y Y exp( f ( x j t )) exp( f ( x j t )) P n P n P f ([ j 1 ; j 2 ; : : : ; j k ]) = P f ([ j 1 ; j 2 ; : : : ; j k ]) = l = t exp( f ( x j l )) l = t exp( f ( x j l )) t =1 t =1 • List-level loss: cross entropy between the predicted distribution and the ground truth X X L ( y ; f ( x )) = ¡ L ( y ; f ( x )) = ¡ P y ( g ) log P f ( g ) P y ( g ) log P f ( g ) g 2G k g 2G k • Complexity: many possible rankings Cao, Zhe, et al. "Learning to rank: from pairwise approach to listwise approach." Proceedings of the 24th international conference on Machine learning . ACM, 2007.

Distance between Ranked Lists • A similar distance: KL divergence Tie-Yan Liu @ Tutorial at WWW 2008

Pairwise vs. Listwise • Pairwise approach shortcoming • Pair-level loss is away from IR list-level evaluations • Listwise approach shortcoming • Hard to define a list-level loss under a low model complexity • A good solution: LambdaRank • Pairwise training with listwise information Burges, Christopher JC, Robert Ragno, and Quoc Viet Le. "Learning to rank with nonsmooth cost functions." NIPS . Vol. 6. 2006.

Learning to Rank Weinan Zhang Shanghai Jiao Tong University - PowerPoint PPT Presentation

2019 EE448, Big Data Mining, Lecture 9 Learning to Rank Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html Content of This Course Another ML problem: ranking Learning to rank

2 3 4 5 8 9 MINNEAPOLIS MILWAUKEE MSA RANK #16 MSA RANK #39 CHICAGO MSA RANK #3

On the minimum rank of a graph Jisu Jeong June 21, 2013 Jisu Jeong On the minimum rank of a

10. Learning to Rank Outline 10.1. Why Learning to Rank (LeToR)? 10.2. Pointwise, Pairwise,

A new family of maximum rank distance codes or: Maximum rank distance codes and finite semifields

1 SVD applications: rank, column, row, and null spaces Rank : the rank of a matrix is equal to:

Learning to Rank Learning to Rank with Partially-Labeled Data with Partially-Labeled Data Kevin

Cross-Domain Learning-to-rank with SVM Erheng Zhong 1 1 Department of Computer Science and

Learning to Rank with Learning to Rank with Partially-Labeled Data Partially-Labeled Data Kevin

2018 - 2019 Teacher Salary Comparison Report 0-Year 5-Year 10-Year 15-Year 20-Year District

Introduction to rank-based cryptography Philippe Gaborit University of Limoges, France ASCRYPTO

Web Mining Mining content Simple rank is confused by rank sinks, e.g. two pages that

Parallel Numerical Algorithms Chapter 6 Matrix Models Section 6.2 Low Rank Approximation

Selection Problem Rank Given n unsorted elements, determine the Rank of an element is its

Selection Problem Rank Given n unsorted elements, determine the Rank of an element is its

Multiple-Rank Updates to Matrix Factorizations Zack 8/30/2013 Outline u Introduction u

/k Content 2/15 1. Introduction 2. Hamming weight 3. Rank weight 4. Extended rank weight

How to Make an iOS App A Brief Guide to Getting on the App Store Emily Van Haren @ CodeChix

Media Husnu Saner Narman * Alymbek Damir Uulu * Jinwei Liu + * Weisberg Division of Computer

Collecting and Analyzing Reddit Data Best Practices Christine Sowa csowa@andrew.cmu.edu Center

2021 SECONDARY 3 SUBJECT COMBINATIONS Students Briefing 14 October 2020 Sec 3 Subject

Asynchronous event/state notifications Janus Where were we? Admin API in the Janus WebRTC

6.2 Controlling the Visibility of Data 6.2 Controlling the Visibility of Data Protocol

CS 378: Autonomous Intelligent Robotics Instructor: Jivko Sinapov

What is RCU, Fundamentally? By: Paul E. McKenney Jonathan Walpole Presenter: Dany Madden Agenda