ranking aggregation and you
play

Ranking, Aggregation, and You Lester Mackey Collaborators: John C. - PowerPoint PPT Presentation

Ranking, Aggregation, and You Lester Mackey Collaborators: John C. Duchi and Michael I. Jordan Stanford University UC Berkeley October 5, 2014 A simple question A simple question On a scale of 1 (very white) to 10 (very


  1. Tractable ranking First try: Empirical risk minimization ← Intractable! � n E n [ L ( f ( Q ) , Y )] = 1 R n ( f ) := ˆ ˆ min k =1 L ( f ( Q k ) , Y k ) n f Idea: Replace loss L ( α, Y ) with convex surrogate ϕ ( α, Y ) L ( α, Y ) = � i � = j Y ij 1 ( α i ≤ α j ) Hard

  2. Tractable ranking First try: Empirical risk minimization ← Intractable! � n E n [ L ( f ( Q ) , Y )] = 1 R n ( f ) := ˆ ˆ min k =1 L ( f ( Q k ) , Y k ) n f Idea: Replace loss L ( α, Y ) with convex surrogate ϕ ( α, Y ) L ( α, Y ) = � ϕ ( α, Y ) = � i � = j Y ij 1 ( α i ≤ α j ) i � = j Y ij φ ( α i − α j ) Hard Tractable

  3. Surrogate ranking Idea: Empirical surrogate risk minimization � n E n [ ϕ ( f ( Q ) , Y )] = 1 R ϕ,n ( f ) := ˆ ˆ min k =1 ϕ ( f ( Q k ) , Y k ) n f

  4. Surrogate ranking Idea: Empirical surrogate risk minimization � n E n [ ϕ ( f ( Q ) , Y )] = 1 R ϕ,n ( f ) := ˆ ˆ min k =1 ϕ ( f ( Q k ) , Y k ) n f ◮ If ϕ convex, then minimization is tractable

  5. Surrogate ranking Idea: Empirical surrogate risk minimization � n E n [ ϕ ( f ( Q ) , Y )] = 1 R ϕ,n ( f ) := ˆ ˆ min k =1 ϕ ( f ( Q k ) , Y k ) n f ◮ If ϕ convex, then minimization is tractable n →∞ ◮ argmin f ˆ R ϕ,n ( f ) → argmin f R ϕ ( f ) := E [ ϕ ( f ( Q ) , Y )]

  6. Surrogate ranking Idea: Empirical surrogate risk minimization � n E n [ ϕ ( f ( Q ) , Y )] = 1 R ϕ,n ( f ) := ˆ ˆ min k =1 ϕ ( f ( Q k ) , Y k ) n f ◮ If ϕ convex, then minimization is tractable n →∞ ◮ argmin f ˆ R ϕ,n ( f ) → argmin f R ϕ ( f ) := E [ ϕ ( f ( Q ) , Y )] Main Question: Are these tractable ranking procedures consistent ?

  7. Surrogate ranking Idea: Empirical surrogate risk minimization � n E n [ ϕ ( f ( Q ) , Y )] = 1 R ϕ,n ( f ) := ˆ ˆ min k =1 ϕ ( f ( Q k ) , Y k ) n f ◮ If ϕ convex, then minimization is tractable n →∞ ◮ argmin f ˆ R ϕ,n ( f ) → argmin f R ϕ ( f ) := E [ ϕ ( f ( Q ) , Y )] Main Question: Are these tractable ranking procedures consistent ? ⇐ ⇒ Does argmin f R ϕ ( f ) also minimize the true risk R ( f ) ?

  8. Classification consistency Consider the special case of classification

  9. Classification consistency Consider the special case of classification ◮ Observe: query X , items { 0 , 1 } , label Y 01 = 1 or Y 10 = 1

  10. Classification consistency Consider the special case of classification ◮ Observe: query X , items { 0 , 1 } , label Y 01 = 1 or Y 10 = 1 ◮ Pairwise loss: L ( α, Y ) = Y 01 1 ( α 0 ≤ α 1 ) + Y 10 1 ( α 1 ≤ α 0 )

  11. Classification consistency Consider the special case of classification ◮ Observe: query X , items { 0 , 1 } , label Y 01 = 1 or Y 10 = 1 ◮ Pairwise loss: L ( α, Y ) = Y 01 1 ( α 0 ≤ α 1 ) + Y 10 1 ( α 1 ≤ α 0 ) ◮ Surrogate loss: ϕ ( α, Y ) = Y 01 φ ( α 0 − α 1 ) + Y 10 φ ( α 1 − α 0 )

  12. Classification consistency Consider the special case of classification ◮ Observe: query X , items { 0 , 1 } , label Y 01 = 1 or Y 10 = 1 ◮ Pairwise loss: L ( α, Y ) = Y 01 1 ( α 0 ≤ α 1 ) + Y 10 1 ( α 1 ≤ α 0 ) ◮ Surrogate loss: ϕ ( α, Y ) = Y 01 φ ( α 0 − α 1 ) + Y 10 φ ( α 1 − α 0 ) Theorem: If φ is convex, procedure based on minimizing φ is consistent if and only if φ ′ (0) < 0 . [Bartlett, Jordan, and McAuliffe, 2006] ⇒ Tractable consistency for boosting, SVMs, logistic regression

  13. Ranking consistency? Good news: Can characterize surrogate ranking consistency 1 [Duchi, Mackey, and Jordan, 2013]

  14. Ranking consistency? Good news: Can characterize surrogate ranking consistency Theorem: 1 Procedure based on minimizing ϕ is consistent ⇐ ⇒ � � � � � E [ L ( α ′ , Y ) | q ] min E [ ϕ ( α, Y ) | q ] � α �∈ argmin α α ′ > min α E [ ϕ ( α, Y ) | q ] . ◮ Translation: ϕ is consistent if and only if minimizing conditional surrogate risk gives correct ranking for every query 1 [Duchi, Mackey, and Jordan, 2013]

  15. Ranking consistency? Bad news: The consequences are dire...

  16. Ranking consistency? Bad news: The consequences are dire... 1 y 12 y 1 3 Consider the pairwise loss: � 2 3 L ( α, Y ) = Y ij 1 ( α i ≤ α j ) y 3 4 i � = j 4

  17. Ranking consistency? Bad news: The consequences are dire... 1 y 12 y 1 3 Consider the pairwise loss: � 2 3 L ( α, Y ) = Y ij 1 ( α i ≤ α j ) y 3 4 i � = j 4 Task: Find argmin α E [ L ( α, Y ) | q ]

  18. Ranking consistency? Bad news: The consequences are dire... 1 y 12 y 1 3 Consider the pairwise loss: � 2 3 L ( α, Y ) = Y ij 1 ( α i ≤ α j ) y 3 4 i � = j 4 Task: Find argmin α E [ L ( α, Y ) | q ] ◮ Classification (two node) case: Easy ◮ Choose α 0 > α 1 ⇐ ⇒ P [ Class 0 | q ] > P [ Class 1 | q ]

  19. Ranking consistency? Bad news: The consequences are dire... 1 y 12 y 1 3 Consider the pairwise loss: � 2 3 L ( α, Y ) = Y ij 1 ( α i ≤ α j ) y 3 4 i � = j 4 Task: Find argmin α E [ L ( α, Y ) | q ] ◮ Classification (two node) case: Easy ◮ Choose α 0 > α 1 ⇐ ⇒ P [ Class 0 | q ] > P [ Class 1 | q ] ◮ General case: NP hard ◮ Unless P = NP , must restrict problem for tractable consistency

  20. Low noise distribution Define: Average preference for item i over item j : s ij = E [ Y ij | q ] ◮ We say i ≻ j on average if s ij > s ji

  21. Low noise distribution Define: Average preference for item i over item j : s ij = E [ Y ij | q ] ◮ We say i ≻ j on average if s ij > s ji Definition ( Low noise distribution ): If i ≻ j on average and j ≻ k on average, then i ≻ k on average. 1 s 1 3 ◮ No cyclic preferences on average s 12 s 3 1 s 23 2 3 Low noise ⇒ s 13 > s 31

  22. Low noise distribution Define: Average preference for item i over item j : s ij = E [ Y ij | q ] ◮ We say i ≻ j on average if s ij > s ji Definition ( Low noise distribution ): If i ≻ j on average and j ≻ k on average, then i ≻ k on average. 1 s 1 3 ◮ No cyclic preferences on average s 12 s 3 1 ◮ Find argmin α E [ L ( α, Y ) | q ] : Very easy ◮ Choose α i > α j ⇐ ⇒ s ij > s ji s 23 2 3 Low noise ⇒ s 13 > s 31

  23. Ranking consistency? Pairwise ranking surrogate: [Herbrich, Graepel, and Obermayer, 2000, Freund, Iyer, Schapire, and Singer, 2003, Dekel, Manning, and Singer, 2004] � ϕ ( α, Y ) = Y ij φ ( α i − α j ) ij for φ convex with φ ′ (0) < 0 . Common in ranking literature.

  24. Ranking consistency? Pairwise ranking surrogate: [Herbrich, Graepel, and Obermayer, 2000, Freund, Iyer, Schapire, and Singer, 2003, Dekel, Manning, and Singer, 2004] � ϕ ( α, Y ) = Y ij φ ( α i − α j ) ij for φ convex with φ ′ (0) < 0 . Common in ranking literature. Theorem: ϕ is not consistent, even in low noise settings. [Duchi, Mackey, and Jordan, 2013]

  25. Ranking consistency? Pairwise ranking surrogate: [Herbrich, Graepel, and Obermayer, 2000, Freund, Iyer, Schapire, and Singer, 2003, Dekel, Manning, and Singer, 2004] � ϕ ( α, Y ) = Y ij φ ( α i − α j ) ij for φ convex with φ ′ (0) < 0 . Common in ranking literature. Theorem: ϕ is not consistent, even in low noise settings. [Duchi, Mackey, and Jordan, 2013] ⇒ Inconsistency for RankBoost, RankSVM, Logistic Ranking...

  26. Ranking with pairwise data is challenging

  27. Ranking with pairwise data is challenging ◮ Inconsistent in general (unless P = NP )

  28. Ranking with pairwise data is challenging ◮ Inconsistent in general (unless P = NP ) ◮ Low noise distributions

  29. Ranking with pairwise data is challenging ◮ Inconsistent in general (unless P = NP ) ◮ Low noise distributions ◮ Inconsistent for standard convex losses � ϕ ( α, Y ) = Y ij φ ( α i − α j ) ij

  30. Ranking with pairwise data is challenging ◮ Inconsistent in general (unless P = NP ) ◮ Low noise distributions ◮ Inconsistent for standard convex losses � ϕ ( α, Y ) = Y ij φ ( α i − α j ) ij ◮ Inconsistent for margin-based convex losses � ϕ ( α, Y ) = φ ( α i − α j − Y ij ) ij

  31. Ranking with pairwise data is challenging ◮ Inconsistent in general (unless P = NP ) ◮ Low noise distributions ◮ Inconsistent for standard convex losses � ϕ ( α, Y ) = Y ij φ ( α i − α j ) ij ◮ Inconsistent for margin-based convex losses � ϕ ( α, Y ) = φ ( α i − α j − Y ij ) ij Question: Do tractable consistent losses exist for partial preference data?

  32. Ranking with pairwise data is challenging ◮ Inconsistent in general (unless P = NP ) ◮ Low noise distributions ◮ Inconsistent for standard convex losses � ϕ ( α, Y ) = Y ij φ ( α i − α j ) ij ◮ Inconsistent for margin-based convex losses � ϕ ( α, Y ) = φ ( α i − α j − Y ij ) ij Question: Do tractable consistent losses exist for partial preference data? Yes!

  33. Ranking with pairwise data is challenging ◮ Inconsistent in general (unless P = NP ) ◮ Low noise distributions ◮ Inconsistent for standard convex losses � ϕ ( α, Y ) = Y ij φ ( α i − α j ) ij ◮ Inconsistent for margin-based convex losses � ϕ ( α, Y ) = φ ( α i − α j − Y ij ) ij Question: Do tractable consistent losses exist for partial preference data? Yes , if we aggregate!

  34. Outline Supervised Ranking Formal definition Tractable surrogates Pairwise inconsistency Aggregation Restoring consistency Estimating complete preferences U-statistics Practical procedures Experimental results

  35. An observation Can rewrite risk of pairwise loss � E [ L ( α, Y ) | q ] = s ij 1 ( α i ≤ α j ) i � = j where s ij = E [ Y ij | q ] .

  36. An observation Can rewrite risk of pairwise loss � � E [ L ( α, Y ) | q ] = s ij 1 ( α i ≤ α j ) = max { s ij − s ji , 0 } 1 ( α i ≤ α j ) i � = j i � = j where s ij = E [ Y ij | q ] . ◮ Only depends on net expected preferences: s ij − s ji

  37. An observation Can rewrite risk of pairwise loss � � E [ L ( α, Y ) | q ] = s ij 1 ( α i ≤ α j ) = max { s ij − s ji , 0 } 1 ( α i ≤ α j ) i � = j i � = j where s ij = E [ Y ij | q ] . ◮ Only depends on net expected preferences: s ij − s ji Consider the surrogate � ϕ ( α, s ) := max { s ij − s ji , 0 } φ ( α i − α j ) i � = j for φ non-increasing and convex, with φ ′ (0) < 0 .

  38. An observation Can rewrite risk of pairwise loss � � E [ L ( α, Y ) | q ] = s ij 1 ( α i ≤ α j ) = max { s ij − s ji , 0 } 1 ( α i ≤ α j ) i � = j i � = j where s ij = E [ Y ij | q ] . ◮ Only depends on net expected preferences: s ij − s ji Consider the surrogate � � ϕ ( α, s ) := max { s ij − s ji , 0 } φ ( α i − α j ) � = s ij φ ( α i − α j ) i � = j i � = j for φ non-increasing and convex, with φ ′ (0) < 0 . ◮ Either i → j penalized or j → i but not both

  39. An observation Can rewrite risk of pairwise loss � � E [ L ( α, Y ) | q ] = s ij 1 ( α i ≤ α j ) = max { s ij − s ji , 0 } 1 ( α i ≤ α j ) i � = j i � = j where s ij = E [ Y ij | q ] . ◮ Only depends on net expected preferences: s ij − s ji Consider the surrogate � � ϕ ( α, s ) := max { s ij − s ji , 0 } φ ( α i − α j ) � = s ij φ ( α i − α j ) i � = j i � = j for φ non-increasing and convex, with φ ′ (0) < 0 . ◮ Either i → j penalized or j → i but not both ◮ Consistent whenever average preferences are acyclic

  40. What happened? � 1 E [ ϕ ( α, Y ) | q ] = lim k →∞ k ϕ ( α, Y k ) Old surrogates: k ◮ Loss ϕ ( α, Y ) applied to a single datapoint

  41. What happened? � 1 E [ ϕ ( α, Y ) | q ] = lim k →∞ k ϕ ( α, Y k ) Old surrogates: k ◮ Loss ϕ ( α, Y ) applied to a single datapoint � New surrogates: ϕ ( α, E [ Y | q ]) = lim k →∞ ϕ ( α, 1 k Y k ) k ◮ Loss applied to aggregation of many datapoints

  42. What happened? � 1 E [ ϕ ( α, Y ) | q ] = lim k →∞ k ϕ ( α, Y k ) Old surrogates: k ◮ Loss ϕ ( α, Y ) applied to a single datapoint � New surrogates: ϕ ( α, E [ Y | q ]) = lim k →∞ ϕ ( α, 1 k Y k ) k ◮ Loss applied to aggregation of many datapoints New framework: Ranking with aggregate losses L ( α, s k ( Y 1 , . . . , Y k )) and ϕ ( α, s k ( Y 1 , . . . , Y k )) where s k is a structure function that aggregates first k datapoints

  43. What happened? � 1 E [ ϕ ( α, Y ) | q ] = lim k →∞ k ϕ ( α, Y k ) Old surrogates: k ◮ Loss ϕ ( α, Y ) applied to a single datapoint � New surrogates: ϕ ( α, E [ Y | q ]) = lim k →∞ ϕ ( α, 1 k Y k ) k ◮ Loss applied to aggregation of many datapoints New framework: Ranking with aggregate losses L ( α, s k ( Y 1 , . . . , Y k )) and ϕ ( α, s k ( Y 1 , . . . , Y k )) where s k is a structure function that aggregates first k datapoints ◮ s k combines partial preferences into more complete estimates

  44. What happened? � 1 E [ ϕ ( α, Y ) | q ] = lim k →∞ k ϕ ( α, Y k ) Old surrogates: k ◮ Loss ϕ ( α, Y ) applied to a single datapoint � New surrogates: ϕ ( α, E [ Y | q ]) = lim k →∞ ϕ ( α, 1 k Y k ) k ◮ Loss applied to aggregation of many datapoints New framework: Ranking with aggregate losses L ( α, s k ( Y 1 , . . . , Y k )) and ϕ ( α, s k ( Y 1 , . . . , Y k )) where s k is a structure function that aggregates first k datapoints ◮ s k combines partial preferences into more complete estimates ◮ Consistency characterization extends to this setting

  45. Aggregation via structure function 2 1 1 3 3 3 2 4 3 4 4 3 1 3 4 4 s k ( Y 1 , . . . , Y k ) Y 1 , Y 2 , . . . , Y k

  46. Aggregation via structure function 2 1 1 3 3 3 2 4 3 4 4 3 1 3 4 4 s k ( Y 1 , . . . , Y k ) Y 1 , Y 2 , . . . , Y k Question: When does aggregation help?

  47. Complete data losses ◮ Normalized Discounted Cumulative Gain (NDCG) ◮ Precision, Precision@ k ◮ Expected reciprocal rank (ERR) Pros: Popular, well-motivated, admit tractable consistent surrogates ◮ e.g., Penalize mistakes at top of ranked list more heavily

  48. Complete data losses ◮ Normalized Discounted Cumulative Gain (NDCG) ◮ Precision, Precision@ k ◮ Expected reciprocal rank (ERR) Pros: Popular, well-motivated, admit tractable consistent surrogates ◮ e.g., Penalize mistakes at top of ranked list more heavily Cons: Require complete preference data

  49. Complete data losses ◮ Normalized Discounted Cumulative Gain (NDCG) ◮ Precision, Precision@ k ◮ Expected reciprocal rank (ERR) Pros: Popular, well-motivated, admit tractable consistent surrogates ◮ e.g., Penalize mistakes at top of ranked list more heavily Cons: Require complete preference data Idea: ◮ Use aggregation to estimate complete preferences from partial preferences

  50. Complete data losses ◮ Normalized Discounted Cumulative Gain (NDCG) ◮ Precision, Precision@ k ◮ Expected reciprocal rank (ERR) Pros: Popular, well-motivated, admit tractable consistent surrogates ◮ e.g., Penalize mistakes at top of ranked list more heavily Cons: Require complete preference data Idea: ◮ Use aggregation to estimate complete preferences from partial preferences ◮ Plug estimates into consistent surrogates

  51. Complete data losses ◮ Normalized Discounted Cumulative Gain (NDCG) ◮ Precision, Precision@ k ◮ Expected reciprocal rank (ERR) Pros: Popular, well-motivated, admit tractable consistent surrogates ◮ e.g., Penalize mistakes at top of ranked list more heavily Cons: Require complete preference data Idea: ◮ Use aggregation to estimate complete preferences from partial preferences ◮ Plug estimates into consistent surrogates ◮ Check that aggregation + surrogacy retains consistency

  52. Cascade model for click data [Craswell, Zoeter, Taylor, and Ramsey, 2008, Chapelle, Metzler, Zhang, and Grinspan, 2009] ◮ Person i clicks on first relevant result, k ( i ) 1 2 3 4 5

  53. Cascade model for click data [Craswell, Zoeter, Taylor, and Ramsey, 2008, Chapelle, Metzler, Zhang, and Grinspan, 2009] ◮ Person i clicks on first relevant result, k ( i ) 1 ◮ Relevance probability of item k is p k 2 3 4 5

  54. Cascade model for click data [Craswell, Zoeter, Taylor, and Ramsey, 2008, Chapelle, Metzler, Zhang, and Grinspan, 2009] ◮ Person i clicks on first relevant result, k ( i ) 1 ◮ Relevance probability of item k is p k ◮ Probability of a click on item k is 2 k − 1 � p k (1 − p j ) j =1 3 4 5

  55. Cascade model for click data [Craswell, Zoeter, Taylor, and Ramsey, 2008, Chapelle, Metzler, Zhang, and Grinspan, 2009] ◮ Person i clicks on first relevant result, k ( i ) 1 ◮ Relevance probability of item k is p k ◮ Probability of a click on item k is 2 k − 1 � p k (1 − p j ) j =1 3 ◮ ERR loss assumes p is known 4 5

  56. Cascade model for click data [Craswell, Zoeter, Taylor, and Ramsey, 2008, Chapelle, Metzler, Zhang, and Grinspan, 2009] ◮ Person i clicks on first relevant result, k ( i ) 1 ◮ Relevance probability of item k is p k ◮ Probability of a click on item k is 2 k − 1 � p k (1 − p j ) j =1 3 ◮ ERR loss assumes p is known Estimate p via maximum likelihood on n clicks: 4 k ( i ) � n � s = argmax log p k ( i ) + log(1 − p j ) . p ∈ [0 , 1] m i =1 j =1 5 ⇒ Consistent ERR minimization under our framework

  57. Benefits of aggregation ◮ Tractable consistency for partial preference losses argmin k →∞ E [ ϕ ( f ( Q ) , s k ( Y 1 , . . . , Y k ))] lim f ⇒ argmin k →∞ E [ L ( f ( Q ) , s k ( Y 1 , . . . , Y k ))] lim f ◮ Use complete data losses with realistic partial preference data ◮ Models process of generating relevance scores from clicks/comparisons

  58. What remains? Before aggregation, we had � n 1 argmin k =1 ϕ ( f ( Q k ) , Y k ) → argmin E [ ϕ ( f ( Q ) , Y )] � �� � n f � �� � f population empirical

  59. What remains? Before aggregation, we had � n 1 argmin k =1 ϕ ( f ( Q k ) , Y k ) → argmin E [ ϕ ( f ( Q ) , Y )] � �� � n f � �� � f population empirical What’s a suitable empirical analogue � R ϕ,n ( f ) with aggregation?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend