Outline Morning program Preliminaries Text matching I Text - PowerPoint PPT Presentation

Outline Morning program Preliminaries Text matching I Text matching II Afternoon program Learning to rank Modeling user behavior Generating responses Wrap up 116

Outline Morning program Preliminaries Text matching I Text matching II Afternoon program Learning to rank Overview & basics Refresher of cross-entropy Pointwise loss Pairwise loss Listwise loss Different levels of supervision Toolkits Modeling user behavior Generating responses Wrap up 117

Learning to rank Learning to rank (L2R) Definition ”... the task to automatically construct a ranking model using training data, such that the model can sort new objects according to their degrees of relevance, preference, or importance.” - Liu [2009] L2R models represent a rankable item—e.g., a document—given some context—e.g., a x ∈ R n . user-issued query—as a numerical vector � The ranking model f : � x → R is trained to map the vector to a real-valued score such that relevant items are scored higher. We discuss supervised (offline) L2R models first, but briefly introduce online L2R later. 118

Learning to rank Approaches Liu [2009] categorizes different L2R approaches based on training objectives: ◮ Pointwise approach: relevance label y q,d is a number—derived from binary or graded human judgments or implicit user feedback (e.g., CTR). Typically, a regression or classification model is trained to predict y q,d given � x q,d . ◮ Pairwise approach: pairwise preference between documents for a query ( d i ≻ q d j ) as label. Reduces to binary classification to predict more relevant document. ◮ Listwise approach: directly optimize for rank-based metric, such as NDCG—difficult because these metrics are often not differentiable w.r.t. model parameters. 119

Learning to rank Features Traditional L2R models employ hand-crafted features that encode IR insights They can often be categorized as: ◮ Query-independent or static features (e.g., incoming link count and document length) ◮ Query-dependent or dynamic features (e.g., BM25) ◮ Query-level features (e.g., query length) 120

Learning to rank A quick refresher - Neural models for different tasks 122

Learning to rank A quick refresher - What is the Softmax function? In neural classification models, the softmax function is popularly used to normalize the neural network output scores across all the classes e γz i p ( z i ) = (2) ( γ is a constant) � z ∈ Z e γz 123

Learning to rank A quick refresher - What is Cross Entropy? The cross entropy between two probability distributions p and q over a discrete set of events is given by, � CE ( p, q ) = − p i log( q i ) i (3) If p correct = 1 and p i = 0 for all other values of i then, CE ( p, q ) = − log( q correct ) (4) 124

Learning to rank A quick refresher - What is the Cross Entropy with Softmax loss? Cross entropy with softmax is a popular loss function for classification � e γz correct � L CE = − log (5) � z ∈ Z e γz 125

Learning to rank Pointwise objectives Regression-based or classification-based approaches are popular Regression loss Given � q, d � predict the value of y q,d E.g., square loss for binary or categorical labels, x q,d ) � 2 L Squared = � y q,d − f ( � (6) where, y q,d is the one-hot representation [Fuhr, 1989] or the actual value [Cossock and Zhang, 2006] of the label 127

Learning to rank Pointwise objectives Regression-based or classification-based approaches are popular Classification loss Given � q, d � predict the class y q,d E.g., Cross-Entropy with Softmax over categorical labels Y [Li et al., 2008], e γ · s yq,d � � � � L CE ( q, d, y q,d ) = − log p ( y q,d | q, d ) = − log (7) � y ∈ Y e γ · s y where, s y q,d is the model’s score for label y q,d 128

Learning to rank Pairwise objectives Pairwise loss generally has the followingform [Chen et al., 2009], Pairwise loss minimizes the average number of inversions in ranking—i.e., L pairwise = φ ( s i − s j ) (8) d i ≻ q d j but d j is ranked higher than d i where, φ can be, Given � q, d i , d j � , predict the more ◮ Hinge function φ ( z ) = max (0 , 1 − z ) relevant document [Herbrich et al., 2000] ◮ Exponential function φ ( z ) = e − z [Freund For � q, d i � and � q, d j � , et al., 2003] Feature vectors: � x i and � x j ◮ Logistic function φ ( z ) = log (1 + e − z ) Model scores: s i = f ( � x i ) and s j = f ( � x j ) [Burges et al., 2005] ◮ etc. 130

Learning to rank RankNet RankNet [Burges et al., 2005] is a pairwise loss function—popular choice for training neural L2R models and also an industry favourite [Burges, 2015] e γ · si 1 Predicted probabilities: p ij = p ( s i > s j ) ≡ e γ · si + e γ · sj = 1+ e − γ ( si − sj ) 1 and p ji ≡ 1+ e − γ ( sj − si ) Desired probabilities: ¯ p ij = 1 and ¯ p ji = 0 Computing cross-entropy between ¯ p and p , L RankNet = − ¯ p ij log( p ij ) − ¯ p ji log( p ji ) (9) = − log( p ij ) (10) = log (1 + e − γ ( s i − s j ) ) (11) 131

Learning to rank Cross Entropy (CE) with Softmax over documents An alternative loss function assumes a single relevant document d + and compares it against the full collection D Probability of retrieving d + for q is given by the softmax function, � q,d + � e γ · s p ( d + | q ) = (12) � d ∈ D e γ · s ( q,d ) The cross entropy loss is then given by, � � L CE ( q, d + , D ) = − log p ( d + | q ) (13) � q,d + � e γ · s � � = − log (14) d ∈ D e γ · s ( q,d ) � 132

Learning to rank Notes on Cross Entropy (CE) loss ◮ If we consider only a pair of relevant and non-relevant documents in the denominator, CE reduces to RankNet ◮ Computing the denominator is prohibitively expensive—L2R models typically consider few negative candidates [Huang et al., 2013, Mitra et al., 2017, Shen et al., 2014] ◮ Large body of work in NLP to deal with similar issue that may be relevant to future L2R models ◮ E.g., hierarchical softmax [Goodman, 2001, Mnih and Hinton, 2009, Morin and Bengio, 2005], Importance sampling [Bengio and Sen´ ecal, 2008, Bengio et al., 2003, Jean et al., 2014, Jozefowicz et al., 2016], Noise Contrastive Estimation [Gutmann and Hyv¨ arinen, 2010, Mnih and Teh, 2012, Vaswani et al., 2013], Negative sampling [Mikolov et al., 2013], and BlackOut [Ji et al., 2015] 133

Learning to rank Listwise Blue: relevant Gray: non-relevant NDCG and ERR higher for left but pairwise errors less for right Due to strong position-based discounting in IR measures, errors at higer ranks are much more problematic than at lower ranks But listwise metrics are non-continuous and non-differentiable [Burges, 2010] 135

Learning to rank LambdaRank Key observations: ◮ To train a model we dont need the costs themselves, only the gradients (of the costs w.r.t model scores) ◮ It is desired that the gradient be bigger for pairs of documents that produces a bigger impact in NDCG by swapping positions LambdaRank [Burges et al., 2006] Multiply actual gradients with the change in NDCG by swapping the rank positions of the two documents λ LambdaRank = λ RankNet · | ∆ NDCG | (15) 136

Learning to rank ListNet and ListMLE According to the Luce model [Luce, 2005], given four items { d 1 , d 2 , d 3 , d 4 } the probability of observing a particular rank-order, say [ d 2 , d 1 , d 4 , d 3 ] , is given by: φ ( s 2 ) φ ( s 1 ) φ ( s 4 ) p ( π | s ) = φ ( s 1 ) + φ ( s 2 ) + φ ( s 3 ) + φ ( s 4 ) · φ ( s 1 ) + φ ( s 3 ) + φ ( s 4 ) · φ ( s 3 ) + φ ( s 4 ) (16) where, π is a particular permutation and φ is a transformation (e.g., linear, exponential, or sigmoid) over the score s i corresponding to item d i 137

Outline Morning program Preliminaries Text matching I Text - PowerPoint PPT Presentation

Outline Morning program Preliminaries Text matching I Text matching II Afternoon program Learning to rank Modeling user behavior Generating responses Wrap up 116 Outline Morning program Preliminaries Text matching I Text matching II

Ins Domingues Breast Cancer Workshop April 7th 2015 Outline Outline Outline Outline

Presentation Preparation Outline Speech Outline Template ***Use this outline to guide you in

Outline for St Outline for St Outline for

Beob Kyun Kim, S oonwook Hwang {kyun, hwang}@ kisti.re.kr KIS TI, Korea Outline Outline

Catherine Revels, World Bank November 2009 Presentation outline Presentation outline

Battlestar Galactica Battlestar Galactica Galactica Battlestar Outline Outline Outline

Outline 2 Outline 2 ZSim core simulation techniques Outline 2 ZSim core simulation

Appendix J: Capstone Presentation Outline Revised Spring 2016 CAPSTONE PRESENTATION OUTLINE This

PT1 TMP Presentation Outline 1 Group Members: ___________________________________ Use this outline

Broverview Outline 2 Outline Philosophy and Architecture A framework for network traffic

Xingqian Peng, Huaqiao University, China Presented by Zhen Wu Presented by Zhen Wu October 30,2011

1 Web Application Development 2 3 Web Application Development CSS Outline An outline is a

Lecture Outline Strengthening Induction Hypothesis. Lecture Outline Strengthening Induction

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

High Dimensional Approximation - Outline Background and Sources Wolfgang Dahmen Seminar: USC,

Outline Outline Deaf and Hearing Impaired Deaf and Hearing Impaired Physical Structures of

Logistic Regression CMSC 678 UMBC Recap from last time Central Question: How Well Are We

Pseudorandom Black Swans Cache Attacks on CTR DRBG Shaanan Cohney 1 , Andrew Kwong 2 , Shahar Paz

Lecture 1 Capacity of the Gaussian Channel Basic concepts in information theory: Appendix B

The entanglement entropy and its universal behaviour in one dimension Benjamin Doyon Department

Prevalence and Impact of Low-Entropy Packing Schemes in the Malware Ecosystem Alessandro

On the roles of energy and entropy in thermodynamics by Ingo Mller & Wolf Weiss TU Berlin

Towards Exact Quantum Entropy of Black Holes Atish Dabholkar CNRS/University of Paris Tata

Randomness Properties of Cryptographic Hash Functions Micah A. Thornton Southern Methodist

Outline Morning program Preliminaries Text matching I Text - PowerPoint PPT Presentation

Outline Morning program Preliminaries Text matching I Text matching II Afternoon program Learning to rank Modeling user behavior Generating responses Wrap up 116 Outline Morning program Preliminaries Text matching I Text matching II

Ins Domingues Breast Cancer Workshop April 7th 2015 Outline Outline Outline Outline

Presentation Preparation Outline Speech Outline Template ***Use this outline to guide you in

Outline for St Outline for St Outline for

Beob Kyun Kim, S oonwook Hwang {kyun, hwang}@ kisti.re.kr KIS TI, Korea Outline Outline

Catherine Revels, World Bank November 2009 Presentation outline Presentation outline

Battlestar Galactica Battlestar Galactica Galactica Battlestar Outline Outline Outline

Outline 2 Outline 2 ZSim core simulation techniques Outline 2 ZSim core simulation

Appendix J: Capstone Presentation Outline Revised Spring 2016 CAPSTONE PRESENTATION OUTLINE This

PT1 TMP Presentation Outline 1 Group Members: ___________________________________ Use this outline

Broverview Outline 2 Outline Philosophy and Architecture A framework for network traffic

Xingqian Peng, Huaqiao University, China Presented by Zhen Wu Presented by Zhen Wu October 30,2011

1 Web Application Development 2 3 Web Application Development CSS Outline An outline is a

Lecture Outline Strengthening Induction Hypothesis. Lecture Outline Strengthening Induction

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

High Dimensional Approximation - Outline Background and Sources Wolfgang Dahmen Seminar: USC,

Outline Outline Deaf and Hearing Impaired Deaf and Hearing Impaired Physical Structures of

Logistic Regression CMSC 678 UMBC Recap from last time Central Question: How Well Are We

Pseudorandom Black Swans Cache Attacks on CTR DRBG Shaanan Cohney 1 , Andrew Kwong 2 , Shahar Paz

Lecture 1 Capacity of the Gaussian Channel Basic concepts in information theory: Appendix B

The entanglement entropy and its universal behaviour in one dimension Benjamin Doyon Department

Prevalence and Impact of Low-Entropy Packing Schemes in the Malware Ecosystem Alessandro

On the roles of energy and entropy in thermodynamics by Ingo Mller &amp; Wolf Weiss TU Berlin

Towards Exact Quantum Entropy of Black Holes Atish Dabholkar CNRS/University of Paris Tata

Randomness Properties of Cryptographic Hash Functions Micah A. Thornton Southern Methodist

On the roles of energy and entropy in thermodynamics by Ingo Mller & Wolf Weiss TU Berlin