SLIDE 1 Ranking, Aggregation, and You
Lester Mackey†
Collaborators: John C. Duchi† and Michael I. Jordan∗
†Stanford University ∗UC Berkeley
October 5, 2014
SLIDE 2
A simple question
SLIDE 3 A simple question
◮ On a scale of 1 (very white) to 10 (very black), how black is this
box?
SLIDE 4 A simple question
◮ On a scale of 1 (very white) to 10 (very black), how black is this
box?
◮ Which box is blacker?
SLIDE 5
Another question
On a scale of 1 to 10, how relevant is this result for the query flowers?
SLIDE 6
Another question
On a scale of 1 to 10, how relevant is this result for the query flowers?
SLIDE 7
Another question
SLIDE 8
What have we learned?
SLIDE 9 What have we learned?
- 1. We are good at pairwise comparisons
◮ Much worse at absolute relevance judgments [Miller, 1956, Shiffrin and Nosofsky, 1994, Stewart, Brown, and Chater, 2005]
SLIDE 10 What have we learned?
- 1. We are good at pairwise comparisons
◮ Much worse at absolute relevance judgments [Miller, 1956, Shiffrin and Nosofsky, 1994, Stewart, Brown, and Chater, 2005]
- 2. We are good at expressing sparse, partial preferences
◮ Much worse at expressing complete preferences
Complete preferences:
ftd.com en.wikipedia.org/... 1800flowers.com
What you express:
ftd.com en.wikipedia.org/... 1800flowers.com
SLIDE 11
Ranking
Goal: Order set of items/results to best match your preferences
SLIDE 12 Ranking
Goal: Order set of items/results to best match your preferences
◮ Web search: Return most relevant URLs for user queries
SLIDE 13 Ranking
Goal: Order set of items/results to best match your preferences
◮ Web search: Return most relevant URLs for user queries ◮ Recommendation systems:
◮ Movies to watch based on user’s past ratings ◮ News articles to read based on past browsing history ◮ Items to buy based on patron’s or other patrons’ purchases
SLIDE 14 Ranking procedures
Goal: Order set of items/results to best match your preferences
- 1. Tractable: Run in polynomial time
SLIDE 15 Ranking procedures
Goal: Order set of items/results to best match your preferences
- 1. Tractable: Run in polynomial time
- 2. Consistent: Recover true preferences given sufficient data
SLIDE 16 Ranking procedures
Goal: Order set of items/results to best match your preferences
- 1. Tractable: Run in polynomial time
- 2. Consistent: Recover true preferences given sufficient data
- 3. Realistic: Make use of ubiquitous partial preference data
SLIDE 17 Ranking procedures
Goal: Order set of items/results to best match your preferences
- 1. Tractable: Run in polynomial time
- 2. Consistent: Recover true preferences given sufficient data
- 3. Realistic: Make use of ubiquitous partial preference data
Past work: 1+2 are possible given complete preference data
[Ravikumar, Tewari, and Yang, 2011, Buffoni, Calauzenes, Gallinari, and Usunier, 2011]
SLIDE 18 Ranking procedures
Goal: Order set of items/results to best match your preferences
- 1. Tractable: Run in polynomial time
- 2. Consistent: Recover true preferences given sufficient data
- 3. Realistic: Make use of ubiquitous partial preference data
Past work: 1+2 are possible given complete preference data
[Ravikumar, Tewari, and Yang, 2011, Buffoni, Calauzenes, Gallinari, and Usunier, 2011]
This work [Duchi, Mackey, and Jordan, 2013]
SLIDE 19 Ranking procedures
Goal: Order set of items/results to best match your preferences
- 1. Tractable: Run in polynomial time
- 2. Consistent: Recover true preferences given sufficient data
- 3. Realistic: Make use of ubiquitous partial preference data
Past work: 1+2 are possible given complete preference data
[Ravikumar, Tewari, and Yang, 2011, Buffoni, Calauzenes, Gallinari, and Usunier, 2011]
This work [Duchi, Mackey, and Jordan, 2013]
◮ Standard (tractable) procedures for ranking with partial
preferences are inconsistent
SLIDE 20 Ranking procedures
Goal: Order set of items/results to best match your preferences
- 1. Tractable: Run in polynomial time
- 2. Consistent: Recover true preferences given sufficient data
- 3. Realistic: Make use of ubiquitous partial preference data
Past work: 1+2 are possible given complete preference data
[Ravikumar, Tewari, and Yang, 2011, Buffoni, Calauzenes, Gallinari, and Usunier, 2011]
This work [Duchi, Mackey, and Jordan, 2013]
◮ Standard (tractable) procedures for ranking with partial
preferences are inconsistent
◮ Aggregating partial preferences into more complete preferences
can restore consistency
SLIDE 21 Ranking procedures
Goal: Order set of items/results to best match your preferences
- 1. Tractable: Run in polynomial time
- 2. Consistent: Recover true preferences given sufficient data
- 3. Realistic: Make use of ubiquitous partial preference data
Past work: 1+2 are possible given complete preference data
[Ravikumar, Tewari, and Yang, 2011, Buffoni, Calauzenes, Gallinari, and Usunier, 2011]
This work [Duchi, Mackey, and Jordan, 2013]
◮ Standard (tractable) procedures for ranking with partial
preferences are inconsistent
◮ Aggregating partial preferences into more complete preferences
can restore consistency
◮ New estimators based on U-statistics achieve 1+2+3
SLIDE 22
Outline
Supervised Ranking Formal definition Tractable surrogates Pairwise inconsistency Aggregation Restoring consistency Estimating complete preferences U-statistics Practical procedures Experimental results
SLIDE 23
Outline
Supervised Ranking Formal definition Tractable surrogates Pairwise inconsistency Aggregation Restoring consistency Estimating complete preferences U-statistics Practical procedures Experimental results
SLIDE 24
Supervised ranking
Observe: Sequence of training examples
SLIDE 25 Supervised ranking
Observe: Sequence of training examples
◮ Query Q: e.g., search term “flowers”
SLIDE 26 Supervised ranking
Observe: Sequence of training examples
◮ Query Q: e.g., search term “flowers” ◮ Set of m items IQ to rank
◮ e.g., websites {1, 2, 3, 4}
SLIDE 27 Supervised ranking
Observe: Sequence of training examples
◮ Query Q: e.g., search term “flowers” ◮ Set of m items IQ to rank
◮ e.g., websites {1, 2, 3, 4}
◮ Label Y representing some preference
structure over items
SLIDE 28 Supervised ranking
Observe: Sequence of training examples
◮ Query Q: e.g., search term “flowers” ◮ Set of m items IQ to rank
◮ e.g., websites {1, 2, 3, 4}
◮ Label Y representing some preference
structure over items
◮ Item 1 preferred to {2, 3} and item 3 to 4
1 2 3 4
y12 y13 y34
Example: Y is a graph on items {1, 2, 3, 4}
SLIDE 29
Supervised ranking
Observe: (Q1, Y1), . . . , (Qn, Yn) Learn: Scoring function f to induce item rankings for each query
SLIDE 30 Supervised ranking
Observe: (Q1, Y1), . . . , (Qn, Yn) Learn: Scoring function f to induce item rankings for each query
◮ Real-valued score for each item i in item set IQ
αi := fi(Q)
◮ Vector of scores f(Q) induces ranking over IQ
i ranked above j ⇐ ⇒ αi > αj
SLIDE 31
Supervised ranking
Example: Scoring function f with scores f1(Q) > f2(Q) > f3(Q) induces same ranking as preference graph Y 1 2 3 Y f1(Q) > f2(Q) f2(Q) > f3(Q)
SLIDE 32 Supervised ranking
Observe: (Q1, Y1), . . . , (Qn, Yn) Learn: Scoring function f to predict item ranking Suffer loss: L(f(Q), Y )
◮ Encodes discord between observed label Y and prediction f(Q) ◮ Depends on specific ranking task and available data
SLIDE 33
Supervised ranking
Example: Pairwise loss
SLIDE 34 Supervised ranking
Example: Pairwise loss
◮ Let Y = (weighted) adjacency matrix for a preference graph
◮ Yij = the preference weight on edge (i, j)
1 2 3 4
y12 y13 y34
SLIDE 35 Supervised ranking
Example: Pairwise loss
◮ Let Y = (weighted) adjacency matrix for a preference graph
◮ Yij = the preference weight on edge (i, j)
◮ Let α = f(Q) be the predicted scores for query Q
1 2 3 4
y12 y13 y34
SLIDE 36 Supervised ranking
Example: Pairwise loss
◮ Let Y = (weighted) adjacency matrix for a preference graph
◮ Yij = the preference weight on edge (i, j)
◮ Let α = f(Q) be the predicted scores for query Q ◮ Then, L(α, Y ) = i=j Yij1(αi≤αj) ◮ Imposes penalty for each misordered edge
1 2 3 4
y12 y13 y34
L(α, Y ) = Y121(α1≤α2) + Y131(α1≤α3) + Y341(α3≤α4)
SLIDE 37
Supervised ranking
Observe: (Q1, Y1), . . . (Qn, Yn) Learn: Scoring function f to rank items Suffer loss: L(f(Q), Y ) Goal: Minimize the risk R(f) := E [L(f(Q), Y )]
SLIDE 38
Supervised ranking
Observe: (Q1, Y1), . . . (Qn, Yn) Learn: Scoring function f to rank items Suffer loss: L(f(Q), Y ) Goal: Minimize the risk R(f) := E [L(f(Q), Y )] Main Question: Are there tractable ranking procedures that minimize R as n → ∞?
SLIDE 39 Tractable ranking
First try: Empirical risk minimization min
f
ˆ Rn(f) := ˆ En [L(f(Q), Y )] = 1 n n
k=1 L(f(Qk), Yk)
SLIDE 40 Tractable ranking
First try: Empirical risk minimization ← Intractable! min
f
ˆ Rn(f) := ˆ En [L(f(Q), Y )] = 1 n n
k=1 L(f(Qk), Yk)
SLIDE 41 Tractable ranking
First try: Empirical risk minimization ← Intractable! min
f
ˆ Rn(f) := ˆ En [L(f(Q), Y )] = 1 n n
k=1 L(f(Qk), Yk)
L(α, Y ) =
i=j Yij1(αi≤αj)
Hard
SLIDE 42 Tractable ranking
First try: Empirical risk minimization ← Intractable! min
f
ˆ Rn(f) := ˆ En [L(f(Q), Y )] = 1 n n
k=1 L(f(Qk), Yk)
Idea: Replace loss L(α, Y ) with convex surrogate ϕ(α, Y ) L(α, Y ) =
i=j Yij1(αi≤αj)
Hard
SLIDE 43 Tractable ranking
First try: Empirical risk minimization ← Intractable! min
f
ˆ Rn(f) := ˆ En [L(f(Q), Y )] = 1 n n
k=1 L(f(Qk), Yk)
Idea: Replace loss L(α, Y ) with convex surrogate ϕ(α, Y ) L(α, Y ) =
i=j Yij1(αi≤αj)
ϕ(α, Y ) =
i=j Yijφ(αi − αj)
Hard Tractable
SLIDE 44 Surrogate ranking
Idea: Empirical surrogate risk minimization min
f
ˆ Rϕ,n(f) := ˆ En [ϕ(f(Q), Y )] = 1 n n
k=1 ϕ(f(Qk), Yk)
SLIDE 45 Surrogate ranking
Idea: Empirical surrogate risk minimization min
f
ˆ Rϕ,n(f) := ˆ En [ϕ(f(Q), Y )] = 1 n n
k=1 ϕ(f(Qk), Yk) ◮ If ϕ convex, then minimization is tractable
SLIDE 46 Surrogate ranking
Idea: Empirical surrogate risk minimization min
f
ˆ Rϕ,n(f) := ˆ En [ϕ(f(Q), Y )] = 1 n n
k=1 ϕ(f(Qk), Yk) ◮ If ϕ convex, then minimization is tractable ◮ argminf ˆ
Rϕ,n(f)
n→∞
→ argminf Rϕ(f) := E [ϕ(f(Q), Y )]
SLIDE 47 Surrogate ranking
Idea: Empirical surrogate risk minimization min
f
ˆ Rϕ,n(f) := ˆ En [ϕ(f(Q), Y )] = 1 n n
k=1 ϕ(f(Qk), Yk) ◮ If ϕ convex, then minimization is tractable ◮ argminf ˆ
Rϕ,n(f)
n→∞
→ argminf Rϕ(f) := E [ϕ(f(Q), Y )] Main Question: Are these tractable ranking procedures consistent?
SLIDE 48 Surrogate ranking
Idea: Empirical surrogate risk minimization min
f
ˆ Rϕ,n(f) := ˆ En [ϕ(f(Q), Y )] = 1 n n
k=1 ϕ(f(Qk), Yk) ◮ If ϕ convex, then minimization is tractable ◮ argminf ˆ
Rϕ,n(f)
n→∞
→ argminf Rϕ(f) := E [ϕ(f(Q), Y )] Main Question: Are these tractable ranking procedures consistent? ⇐ ⇒ Does argminf Rϕ(f) also minimize the true risk R(f)?
SLIDE 49
Classification consistency
Consider the special case of classification
SLIDE 50 Classification consistency
Consider the special case of classification
◮ Observe: query X, items {0, 1}, label Y01 = 1 or Y10 = 1
SLIDE 51 Classification consistency
Consider the special case of classification
◮ Observe: query X, items {0, 1}, label Y01 = 1 or Y10 = 1 ◮ Pairwise loss: L(α, Y ) = Y011(α0≤α1) + Y101(α1≤α0)
SLIDE 52 Classification consistency
Consider the special case of classification
◮ Observe: query X, items {0, 1}, label Y01 = 1 or Y10 = 1 ◮ Pairwise loss: L(α, Y ) = Y011(α0≤α1) + Y101(α1≤α0) ◮ Surrogate loss: ϕ(α, Y ) = Y01φ(α0 − α1) + Y10φ(α1 − α0)
SLIDE 53 Classification consistency
Consider the special case of classification
◮ Observe: query X, items {0, 1}, label Y01 = 1 or Y10 = 1 ◮ Pairwise loss: L(α, Y ) = Y011(α0≤α1) + Y101(α1≤α0) ◮ Surrogate loss: ϕ(α, Y ) = Y01φ(α0 − α1) + Y10φ(α1 − α0)
Theorem: If φ is convex, procedure based on minimizing φ is consistent if and only if φ′(0) < 0.
[Bartlett, Jordan, and McAuliffe, 2006]
⇒ Tractable consistency for boosting, SVMs, logistic regression
SLIDE 54 Ranking consistency?
Good news: Can characterize surrogate ranking consistency
1[Duchi, Mackey, and Jordan, 2013]
SLIDE 55 Ranking consistency?
Good news: Can characterize surrogate ranking consistency Theorem:1 Procedure based on minimizing ϕ is consistent ⇐ ⇒ min
α
- E[ϕ(α, Y ) | q]
- α ∈ argmin
α′
E[L(α′, Y ) | q]
α E[ϕ(α, Y ) | q]. ◮ Translation: ϕ is consistent if and only if minimizing conditional
surrogate risk gives correct ranking for every query
1[Duchi, Mackey, and Jordan, 2013]
SLIDE 56
Ranking consistency?
Bad news: The consequences are dire...
SLIDE 57 Ranking consistency?
Bad news: The consequences are dire... Consider the pairwise loss: L(α, Y ) =
Yij1(αi≤αj)
1 2 3 4
y12 y13 y34
SLIDE 58 Ranking consistency?
Bad news: The consequences are dire... Consider the pairwise loss: L(α, Y ) =
Yij1(αi≤αj)
1 2 3 4
y12 y13 y34
Task: Find argminα E[L(α, Y ) | q]
SLIDE 59 Ranking consistency?
Bad news: The consequences are dire... Consider the pairwise loss: L(α, Y ) =
Yij1(αi≤αj)
1 2 3 4
y12 y13 y34
Task: Find argminα E[L(α, Y ) | q]
◮ Classification (two node) case: Easy
◮ Choose α0 > α1 ⇐
⇒ P[Class 0 | q] > P[Class 1 | q]
SLIDE 60 Ranking consistency?
Bad news: The consequences are dire... Consider the pairwise loss: L(α, Y ) =
Yij1(αi≤αj)
1 2 3 4
y12 y13 y34
Task: Find argminα E[L(α, Y ) | q]
◮ Classification (two node) case: Easy
◮ Choose α0 > α1 ⇐
⇒ P[Class 0 | q] > P[Class 1 | q]
◮ General case: NP hard
◮ Unless P = NP, must restrict problem for tractable consistency
SLIDE 61 Low noise distribution
Define: Average preference for item i over item j: sij = E[Yij | q]
◮ We say i ≻ j on average if sij > sji
SLIDE 62 Low noise distribution
Define: Average preference for item i over item j: sij = E[Yij | q]
◮ We say i ≻ j on average if sij > sji
Definition (Low noise distribution): If i ≻ j on average and j ≻ k
- n average, then i ≻ k on average.
2 3 1
s12 s31 s13 s23
Low noise ⇒ s13 > s31
◮ No cyclic preferences on average
SLIDE 63 Low noise distribution
Define: Average preference for item i over item j: sij = E[Yij | q]
◮ We say i ≻ j on average if sij > sji
Definition (Low noise distribution): If i ≻ j on average and j ≻ k
- n average, then i ≻ k on average.
2 3 1
s12 s31 s13 s23
Low noise ⇒ s13 > s31
◮ No cyclic preferences on average ◮ Find argminα E[L(α, Y ) | q]: Very easy
◮ Choose αi > αj ⇐
⇒ sij > sji
SLIDE 64 Ranking consistency?
Pairwise ranking surrogate:
[Herbrich, Graepel, and Obermayer, 2000, Freund, Iyer, Schapire, and Singer, 2003, Dekel, Manning, and Singer, 2004]
ϕ(α, Y ) =
Yijφ(αi − αj) for φ convex with φ′(0) < 0. Common in ranking literature.
SLIDE 65 Ranking consistency?
Pairwise ranking surrogate:
[Herbrich, Graepel, and Obermayer, 2000, Freund, Iyer, Schapire, and Singer, 2003, Dekel, Manning, and Singer, 2004]
ϕ(α, Y ) =
Yijφ(αi − αj) for φ convex with φ′(0) < 0. Common in ranking literature. Theorem: ϕ is not consistent, even in low noise settings.
[Duchi, Mackey, and Jordan, 2013]
SLIDE 66 Ranking consistency?
Pairwise ranking surrogate:
[Herbrich, Graepel, and Obermayer, 2000, Freund, Iyer, Schapire, and Singer, 2003, Dekel, Manning, and Singer, 2004]
ϕ(α, Y ) =
Yijφ(αi − αj) for φ convex with φ′(0) < 0. Common in ranking literature. Theorem: ϕ is not consistent, even in low noise settings.
[Duchi, Mackey, and Jordan, 2013]
⇒ Inconsistency for RankBoost, RankSVM, Logistic Ranking...
SLIDE 67
Ranking with pairwise data is challenging
SLIDE 68 Ranking with pairwise data is challenging
◮ Inconsistent in general (unless P = NP)
SLIDE 69 Ranking with pairwise data is challenging
◮ Inconsistent in general (unless P = NP) ◮ Low noise distributions
SLIDE 70 Ranking with pairwise data is challenging
◮ Inconsistent in general (unless P = NP) ◮ Low noise distributions
◮ Inconsistent for standard convex losses
ϕ(α, Y ) =
Yijφ(αi − αj)
SLIDE 71 Ranking with pairwise data is challenging
◮ Inconsistent in general (unless P = NP) ◮ Low noise distributions
◮ Inconsistent for standard convex losses
ϕ(α, Y ) =
Yijφ(αi − αj)
◮ Inconsistent for margin-based convex losses
ϕ(α, Y ) =
φ(αi − αj − Yij)
SLIDE 72 Ranking with pairwise data is challenging
◮ Inconsistent in general (unless P = NP) ◮ Low noise distributions
◮ Inconsistent for standard convex losses
ϕ(α, Y ) =
Yijφ(αi − αj)
◮ Inconsistent for margin-based convex losses
ϕ(α, Y ) =
φ(αi − αj − Yij)
Question: Do tractable consistent losses exist for partial preference data?
SLIDE 73 Ranking with pairwise data is challenging
◮ Inconsistent in general (unless P = NP) ◮ Low noise distributions
◮ Inconsistent for standard convex losses
ϕ(α, Y ) =
Yijφ(αi − αj)
◮ Inconsistent for margin-based convex losses
ϕ(α, Y ) =
φ(αi − αj − Yij)
Question: Do tractable consistent losses exist for partial preference data? Yes!
SLIDE 74 Ranking with pairwise data is challenging
◮ Inconsistent in general (unless P = NP) ◮ Low noise distributions
◮ Inconsistent for standard convex losses
ϕ(α, Y ) =
Yijφ(αi − αj)
◮ Inconsistent for margin-based convex losses
ϕ(α, Y ) =
φ(αi − αj − Yij)
Question: Do tractable consistent losses exist for partial preference data? Yes, if we aggregate!
SLIDE 75
Outline
Supervised Ranking Formal definition Tractable surrogates Pairwise inconsistency Aggregation Restoring consistency Estimating complete preferences U-statistics Practical procedures Experimental results
SLIDE 76 An observation
Can rewrite risk of pairwise loss E[L(α, Y ) | q] =
sij1(αi≤αj) where sij = E[Yij | q].
SLIDE 77 An observation
Can rewrite risk of pairwise loss E[L(α, Y ) | q] =
sij1(αi≤αj) =
max{sij − sji, 0}1(αi≤αj) where sij = E[Yij | q].
◮ Only depends on net expected preferences: sij − sji
SLIDE 78 An observation
Can rewrite risk of pairwise loss E[L(α, Y ) | q] =
sij1(αi≤αj) =
max{sij − sji, 0}1(αi≤αj) where sij = E[Yij | q].
◮ Only depends on net expected preferences: sij − sji
Consider the surrogate ϕ(α, s) :=
max{sij − sji, 0}φ(αi − αj) for φ non-increasing and convex, with φ′(0) < 0.
SLIDE 79 An observation
Can rewrite risk of pairwise loss E[L(α, Y ) | q] =
sij1(αi≤αj) =
max{sij − sji, 0}1(αi≤αj) where sij = E[Yij | q].
◮ Only depends on net expected preferences: sij − sji
Consider the surrogate ϕ(α, s) :=
max{sij − sji, 0}φ(αi − αj) =
sijφ(αi − αj) for φ non-increasing and convex, with φ′(0) < 0.
◮ Either i → j penalized or j → i but not both
SLIDE 80 An observation
Can rewrite risk of pairwise loss E[L(α, Y ) | q] =
sij1(αi≤αj) =
max{sij − sji, 0}1(αi≤αj) where sij = E[Yij | q].
◮ Only depends on net expected preferences: sij − sji
Consider the surrogate ϕ(α, s) :=
max{sij − sji, 0}φ(αi − αj) =
sijφ(αi − αj) for φ non-increasing and convex, with φ′(0) < 0.
◮ Either i → j penalized or j → i but not both ◮ Consistent whenever average preferences are acyclic
SLIDE 81 What happened?
Old surrogates: E[ϕ(α, Y ) | q] = limk→∞
1 k
◮ Loss ϕ(α, Y ) applied to a single datapoint
SLIDE 82 What happened?
Old surrogates: E[ϕ(α, Y ) | q] = limk→∞
1 k
◮ Loss ϕ(α, Y ) applied to a single datapoint
New surrogates: ϕ(α, E[Y | q]) = limk→∞ ϕ(α, 1
k
◮ Loss applied to aggregation of many datapoints
SLIDE 83 What happened?
Old surrogates: E[ϕ(α, Y ) | q] = limk→∞
1 k
◮ Loss ϕ(α, Y ) applied to a single datapoint
New surrogates: ϕ(α, E[Y | q]) = limk→∞ ϕ(α, 1
k
◮ Loss applied to aggregation of many datapoints
New framework: Ranking with aggregate losses L(α, sk(Y1, . . . , Yk)) and ϕ(α, sk(Y1, . . . , Yk)) where sk is a structure function that aggregates first k datapoints
SLIDE 84 What happened?
Old surrogates: E[ϕ(α, Y ) | q] = limk→∞
1 k
◮ Loss ϕ(α, Y ) applied to a single datapoint
New surrogates: ϕ(α, E[Y | q]) = limk→∞ ϕ(α, 1
k
◮ Loss applied to aggregation of many datapoints
New framework: Ranking with aggregate losses L(α, sk(Y1, . . . , Yk)) and ϕ(α, sk(Y1, . . . , Yk)) where sk is a structure function that aggregates first k datapoints
◮ sk combines partial preferences into more complete estimates
SLIDE 85 What happened?
Old surrogates: E[ϕ(α, Y ) | q] = limk→∞
1 k
◮ Loss ϕ(α, Y ) applied to a single datapoint
New surrogates: ϕ(α, E[Y | q]) = limk→∞ ϕ(α, 1
k
◮ Loss applied to aggregation of many datapoints
New framework: Ranking with aggregate losses L(α, sk(Y1, . . . , Yk)) and ϕ(α, sk(Y1, . . . , Yk)) where sk is a structure function that aggregates first k datapoints
◮ sk combines partial preferences into more complete estimates ◮ Consistency characterization extends to this setting
SLIDE 86
Aggregation via structure function
1 3 4 3 3 2 4 1 3 4 4 3
1 2 3 4
Y1, Y2, . . . , Yk sk(Y1, . . . , Yk)
SLIDE 87
Aggregation via structure function
1 3 4 3 3 2 4 1 3 4 4 3
1 2 3 4
Y1, Y2, . . . , Yk sk(Y1, . . . , Yk) Question: When does aggregation help?
SLIDE 88 Complete data losses
◮ Normalized Discounted Cumulative Gain (NDCG) ◮ Precision, Precision@k ◮ Expected reciprocal rank (ERR)
Pros: Popular, well-motivated, admit tractable consistent surrogates
◮ e.g., Penalize mistakes at top of ranked list more heavily
SLIDE 89 Complete data losses
◮ Normalized Discounted Cumulative Gain (NDCG) ◮ Precision, Precision@k ◮ Expected reciprocal rank (ERR)
Pros: Popular, well-motivated, admit tractable consistent surrogates
◮ e.g., Penalize mistakes at top of ranked list more heavily
Cons: Require complete preference data
SLIDE 90 Complete data losses
◮ Normalized Discounted Cumulative Gain (NDCG) ◮ Precision, Precision@k ◮ Expected reciprocal rank (ERR)
Pros: Popular, well-motivated, admit tractable consistent surrogates
◮ e.g., Penalize mistakes at top of ranked list more heavily
Cons: Require complete preference data Idea:
◮ Use aggregation to estimate complete preferences from partial
preferences
SLIDE 91 Complete data losses
◮ Normalized Discounted Cumulative Gain (NDCG) ◮ Precision, Precision@k ◮ Expected reciprocal rank (ERR)
Pros: Popular, well-motivated, admit tractable consistent surrogates
◮ e.g., Penalize mistakes at top of ranked list more heavily
Cons: Require complete preference data Idea:
◮ Use aggregation to estimate complete preferences from partial
preferences
◮ Plug estimates into consistent surrogates
SLIDE 92 Complete data losses
◮ Normalized Discounted Cumulative Gain (NDCG) ◮ Precision, Precision@k ◮ Expected reciprocal rank (ERR)
Pros: Popular, well-motivated, admit tractable consistent surrogates
◮ e.g., Penalize mistakes at top of ranked list more heavily
Cons: Require complete preference data Idea:
◮ Use aggregation to estimate complete preferences from partial
preferences
◮ Plug estimates into consistent surrogates ◮ Check that aggregation + surrogacy retains consistency
SLIDE 93 Cascade model for click data
[Craswell, Zoeter, Taylor, and Ramsey, 2008, Chapelle, Metzler, Zhang, and Grinspan, 2009]
◮ Person i clicks on first relevant result, k(i)
1 2 3 4 5
SLIDE 94 Cascade model for click data
[Craswell, Zoeter, Taylor, and Ramsey, 2008, Chapelle, Metzler, Zhang, and Grinspan, 2009]
◮ Person i clicks on first relevant result, k(i) ◮ Relevance probability of item k is pk
1 2 3 4 5
SLIDE 95 Cascade model for click data
[Craswell, Zoeter, Taylor, and Ramsey, 2008, Chapelle, Metzler, Zhang, and Grinspan, 2009]
◮ Person i clicks on first relevant result, k(i) ◮ Relevance probability of item k is pk ◮ Probability of a click on item k is
pk
k−1
(1 − pj)
1 2 3 4 5
SLIDE 96 Cascade model for click data
[Craswell, Zoeter, Taylor, and Ramsey, 2008, Chapelle, Metzler, Zhang, and Grinspan, 2009]
◮ Person i clicks on first relevant result, k(i) ◮ Relevance probability of item k is pk ◮ Probability of a click on item k is
pk
k−1
(1 − pj)
◮ ERR loss assumes p is known
1 2 3 4 5
SLIDE 97 Cascade model for click data
[Craswell, Zoeter, Taylor, and Ramsey, 2008, Chapelle, Metzler, Zhang, and Grinspan, 2009]
◮ Person i clicks on first relevant result, k(i) ◮ Relevance probability of item k is pk ◮ Probability of a click on item k is
pk
k−1
(1 − pj)
◮ ERR loss assumes p is known
Estimate p via maximum likelihood on n clicks: s = argmax
p∈[0,1]m n
log pk(i) +
k(i)
log(1 − pj). ⇒ Consistent ERR minimization under our framework
1 2 3 4 5
SLIDE 98 Benefits of aggregation
◮ Tractable consistency for partial preference losses
argmin
f
lim
k→∞ E[ϕ(f(Q), sk(Y1, . . . , Yk))]
⇒ argmin
f
lim
k→∞ E[L(f(Q), sk(Y1, . . . , Yk))] ◮ Use complete data losses with realistic partial preference data
◮ Models process of generating relevance scores from
clicks/comparisons
SLIDE 99 What remains?
Before aggregation, we had argmin
f
1 n n
k=1 ϕ(f(Qk), Yk)
→ argmin
f
E[ϕ(f(Q), Y )]
SLIDE 100 What remains?
Before aggregation, we had argmin
f
1 n n
k=1 ϕ(f(Qk), Yk)
→ argmin
f
E[ϕ(f(Q), Y )]
What’s a suitable empirical analogue Rϕ,n(f) with aggregation?
SLIDE 101 What remains?
Before aggregation, we had argmin
f
1 n n
k=1 ϕ(f(Qk), Yk)
→ argmin
f
E[ϕ(f(Q), Y )]
What’s a suitable empirical analogue Rϕ,n(f) with aggregation? ⇐ ⇒ When does argmin
f
empirical → argmin
f
lim
k→∞ E[ϕ(f(Q), sk(Y1, . . . , Yk))]
?
SLIDE 102
Outline
Supervised Ranking Formal definition Tractable surrogates Pairwise inconsistency Aggregation Restoring consistency Estimating complete preferences U-statistics Practical procedures Experimental results
SLIDE 103 Data with aggregation
q1 q2 q3 q4 q5
Y1 Y2 Y3 ...
nq1 nq2 nq3
◮ Datapoint consists of query q
and preference judgment Y
◮ nq datapoints for query q ◮ Structure functions for
aggregation: s(Y1, Y2, . . . , Yk)
SLIDE 104 Data with aggregation
q1 q2 q3 q4 q5
Y1 Y2 Y3 ...
nq1 nq2 nq3
◮ Simple idea: for query q,
aggregate all Y1, Y2, . . . , Ynq
◮ Loss ϕ for query q is
nq · ϕ(α, s(Y1, . . . , Ynq))
SLIDE 105 Data with aggregation
q1 q2 q3 q4 q5
Y1 Y2 Y3 ...
nq1 nq2 nq3
◮ Simple idea: for query q,
aggregate all Y1, Y2, . . . , Ynq
◮ Loss ϕ for query q is
nq · ϕ(α, s(Y1, . . . , Ynq)) Cons:
◮ Requires detailed knowledge of ϕ and sk(Y1, . . . , Yk) as k → ∞
SLIDE 106 Data with aggregation
q1 q2 q3 q4 q5
Y1 Y2 Y3 ...
nq1 nq2 nq3
◮ Simple idea: for query q,
aggregate all Y1, Y2, . . . , Ynq
◮ Loss ϕ for query q is
nq · ϕ(α, s(Y1, . . . , Ynq)) Cons:
◮ Requires detailed knowledge of ϕ and sk(Y1, . . . , Yk) as k → ∞
Ideal procedure:
◮ Agnostic to form of aggregation ◮ Take advantage of independence of Y1, Y2, . . .
SLIDE 107 Digression: U-statistics
q
nq
- k
- ◮ U-statistic: classical tool in statistics
◮ Given X1, . . . , Xn, estimate E[g(X1, . . . , Xk)]
for g symmetric
◮ Idea: Average all estimates based on k
datapoints Un = n k −1
g(Xi1, Xi2, . . . , Xik)
SLIDE 108 Data with aggregation: U-statistic in the loss
q
nq
- k
- ◮ Target: E[ϕ(α, s(Y1, . . . , Yk)) | q]
SLIDE 109 Data with aggregation: U-statistic in the loss
q
nq
- k
- ◮ Target: E[ϕ(α, s(Y1, . . . , Yk)) | q]
◮ Idea: Estimate with U-statistic:
nq k −1
ϕ(α, s(Yi1, . . . , Yik))
SLIDE 110 Data with aggregation: U-statistic in the loss
q
nq
- k
- ◮ Target: E[ϕ(α, s(Y1, . . . , Yk)) | q]
◮ Idea: Estimate with U-statistic:
nq k −1
ϕ(α, s(Yi1, . . . , Yik))
◮ Empirical risk for scoring function f:
1 n
nq nq k −1
ϕ(f(q), s(Yi1, . . . , Yik))
SLIDE 111 Convergence of U-statistic procedures
Empirical risk for scoring function f:
n
nq nq k −1
ϕ(f(q), s(Yi1, . . . , Yik)) Theorem: If we choose kn = o(n) but kn → ∞, then uniformly in f
k→∞ E[ϕ(f(Q), s(Y1, . . . , Yk))]
SLIDE 112 New procedure for learning to rank
1 2 3 4
◮ Use loss function that aggregates per-query:
1 n
nq nq k −1
ϕ(f(q), s(Yi1, . . . , Yik))
◮ Learn ranking function by taking
f∈F
◮ Can optimize by stochastic gradient descent over
queries q and subsets (i1, . . . , ik)
SLIDE 113 Experiments
◮ Web search ◮ Image ranking
SLIDE 114 Web search
◮ Microsoft Learning to Rank Web10K dataset
SLIDE 115 Web search
◮ Microsoft Learning to Rank Web10K dataset
◮ 10,000 queries issued ◮ 100 items per query ◮ Estimated relevance score r ∈ R for each query/result pair
SLIDE 116 Web search
◮ Microsoft Learning to Rank Web10K dataset
◮ 10,000 queries issued ◮ 100 items per query ◮ Estimated relevance score r ∈ R for each query/result pair
◮ Generating pairwise preferences
◮ Choose query q uniformly at random ◮ Choose pair (i, j) of items, and set i ≻ j with probability
pij = 1 1 + exp(rj − ri)
SLIDE 117 Web search
◮ Microsoft Learning to Rank Web10K dataset
◮ 10,000 queries issued ◮ 100 items per query ◮ Estimated relevance score r ∈ R for each query/result pair
◮ Generating pairwise preferences
◮ Choose query q uniformly at random ◮ Choose pair (i, j) of items, and set i ≻ j with probability
pij = 1 1 + exp(rj − ri)
◮ Aggregate scores by setting
si =
log
SLIDE 118 Benefits of aggregation
NDCG risk as a function of aggregation level k for n = 106 samples
10 10
1
10
2
10
3
10
4
0.65 0.7 0.75 0.8 0.85
Order k NDCG@10 Aggregate Pairwise Score−based
SLIDE 119 Image ranking
◮ Setup [Grangier and Bengio 2008]
◮ Take most common image search queries on google.com ◮ Train an independent ranker based on aggregated preference
statistics for each query
◮ Compare with standard, disaggregated image-ranking
approaches
SLIDE 120
Image ranking experiments
Highly ranked items from Corel Image Database for query tree car: Aggregated SVM PLSA
SLIDE 121
Conclusions
SLIDE 122 Conclusions
- 1. Partial preference data is abundant and (more) reliable
SLIDE 123 Conclusions
- 1. Partial preference data is abundant and (more) reliable
- 2. General theory of ranking consistency: When is
argmin
f
E[ϕ(f(Q), s)] ⊆ argmin
f
E[L(f(Q), s)]?
◮ Tractable consistency difficult with partial preference data ◮ Possible with complete preference data
SLIDE 124 Conclusions
- 1. Partial preference data is abundant and (more) reliable
- 2. General theory of ranking consistency: When is
argmin
f
E[ϕ(f(Q), s)] ⊆ argmin
f
E[L(f(Q), s)]?
◮ Tractable consistency difficult with partial preference data ◮ Possible with complete preference data
- 3. Aggregation can bridge the gap
◮ Can transform pairwise preferences/click data into scores s
SLIDE 125 Conclusions
- 1. Partial preference data is abundant and (more) reliable
- 2. General theory of ranking consistency: When is
argmin
f
E[ϕ(f(Q), s)] ⊆ argmin
f
E[L(f(Q), s)]?
◮ Tractable consistency difficult with partial preference data ◮ Possible with complete preference data
- 3. Aggregation can bridge the gap
◮ Can transform pairwise preferences/click data into scores s
- 4. Practical consistent procedures via U-statistic aggregation
◮ Allows for arbitrary aggregation s ◮ High-probability convergence of the learned ranking function
SLIDE 126
Future work
SLIDE 127 Future work
◮ Empirical directions
◮ Apply to more ranking problems! ◮ Which aggregation procedures perform best? ◮ How much aggregation is enough?
SLIDE 128 Future work
◮ Empirical directions
◮ Apply to more ranking problems! ◮ Which aggregation procedures perform best? ◮ How much aggregation is enough?
◮ Statistical questions: beyond consistency
◮ How does aggregation impact rate of convergence? ◮ Can we design statistically efficient ranking procedures?
SLIDE 129 Future work
◮ Empirical directions
◮ Apply to more ranking problems! ◮ Which aggregation procedures perform best? ◮ How much aggregation is enough?
◮ Statistical questions: beyond consistency
◮ How does aggregation impact rate of convergence? ◮ Can we design statistically efficient ranking procedures?
◮ Other ways of dealing with realistic partial preference data?
SLIDE 130 References I
- P. L. Bartlett, M. I. Jordan, and J. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical
Association, 101:138–156, 2006.
- D. Buffoni, C. Calauzenes, P. Gallinari, and N. Usunier. Learning scoring functions with order-preserving losses and standardized
- supervision. In Proceedings of the 28th International Conference on Machine Learning, 2011.
- O. Chapelle, D. Metzler, Y. Zhang, and P. Grinspan. Expected reciprocal rank for graded relevance. In Conference on
Information and Knowledge Management, 2009.
- N. Craswell, O. Zoeter, M. J. Taylor, and B. Ramsey. An experimental comparison of click position-bias models. In Web Search
and Data Mining (WSDM), pages 87–94, 2008.
- O. Dekel, C. Manning, and Y. Singer. Log-linear models for label ranking. In Advances in Neural Information Processing
Systems 16, 2004.
- J. C. Duchi, L. Mackey, and M. I. Jordan. The asymptotics of ranking algorithms. Annals of Statistics, 2013.
- Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. Efficient boosting algorithms for combining preferences. Journal of Machine
Learning Research, 4:933–969, 2003.
- R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal regression. In Advances in Large Margin
- Classifiers. MIT Press, 2000.
- G. Miller. The magic number seven, plus or minus two: Some limits on our capacity for processing information. Psychology
Review, 63:81–97, 1956.
- P. Ravikumar, A. Tewari, and E. Yang. On NDCG consistency of listwise ranking methods. In Proceedings of the 14th
International Conference on Artificial Intelligence and Statistics, 2011.
- R. Shiffrin and R. Nosofsky. Seven plus or minus two: a commentary on capacity limitations. Psychological Review, 101(2):
357–361, 1994.
- N. Stewart, G. Brown, and N. Chater. Absolute identification by relative judgment. Psychological Review, 112(4):881–911,
2005.
SLIDE 131
SLIDE 132 What is the problem?
Surrogate loss ϕ(α, s) =
ij sijφ(αi − αj) 2 3 1
s12 s23 s13
2 3 1
s31
2 3 1
s12 s31 s13 s23
p(s) = .5 p(s′) = .5 Aggregate
SLIDE 133 What is the problem?
Surrogate loss ϕ(α, s) =
ij sijφ(αi − αj) 2 3 1
s12 s23 s13
2 3 1
s31
2 3 1
s12 s31 s13 s23
p(s) = .5 p(s′) = .5 Aggregate
p(s)ϕ(α, s) = 1 2ϕ(α, s′) + 1 2ϕ(α, s′) ∝ s12φ(α1 − α2) + s13φ(α1 − α3) + s23φ(α2 − α3) + s31φ(α3 − α1)
SLIDE 134
What is the problem?
s12φ(α1 − α2) + s13φ(α1 − α3) + s23φ(α2 − α3) + s31φ(α3 − α1)
SLIDE 135
What is the problem?
s12φ(α1 − α2) + s13φ(α1 − α3) + s23φ(α2 − α3) + s31φ(α3 − α1)
SLIDE 136
What is the problem?
s12φ(α1 − α2) + s13φ(α1 − α3) + s23φ(α2 − α3) + s31φ(α3 − α1)
a31φ(α3 - α1) a13φ(α1 - α3)
2 3 1
s12 s31 s13 s23
SLIDE 137 What is the problem?
s12φ(α1 − α2) + s13φ(α1 − α3) + s23φ(α2 − α3) + s31φ(α3 − α1)
a31φ(α3 - α1) a13φ(α1 - α3)
2 3 1
s12 s31 s13 s23
More bang for your $$ by increasing to 0 from left: α1 ↓. Result: α∗ = argmin
α
sijφ(αi − αj) can have α∗
2 > α∗ 1, even if s13 − s31 > s12 + s23.