Ranking, Aggregation, and You Lester Mackey Collaborators: John C. - - PowerPoint PPT Presentation

ranking aggregation and you
SMART_READER_LITE
LIVE PREVIEW

Ranking, Aggregation, and You Lester Mackey Collaborators: John C. - - PowerPoint PPT Presentation

Ranking, Aggregation, and You Lester Mackey Collaborators: John C. Duchi and Michael I. Jordan Stanford University UC Berkeley October 5, 2014 A simple question A simple question On a scale of 1 (very white) to 10 (very


slide-1
SLIDE 1

Ranking, Aggregation, and You

Lester Mackey†

Collaborators: John C. Duchi† and Michael I. Jordan∗

†Stanford University ∗UC Berkeley

October 5, 2014

slide-2
SLIDE 2

A simple question

slide-3
SLIDE 3

A simple question

◮ On a scale of 1 (very white) to 10 (very black), how black is this

box?

slide-4
SLIDE 4

A simple question

◮ On a scale of 1 (very white) to 10 (very black), how black is this

box?

◮ Which box is blacker?

slide-5
SLIDE 5

Another question

On a scale of 1 to 10, how relevant is this result for the query flowers?

slide-6
SLIDE 6

Another question

On a scale of 1 to 10, how relevant is this result for the query flowers?

slide-7
SLIDE 7

Another question

slide-8
SLIDE 8

What have we learned?

slide-9
SLIDE 9

What have we learned?

  • 1. We are good at pairwise comparisons

◮ Much worse at absolute relevance judgments [Miller, 1956, Shiffrin and Nosofsky, 1994, Stewart, Brown, and Chater, 2005]

slide-10
SLIDE 10

What have we learned?

  • 1. We are good at pairwise comparisons

◮ Much worse at absolute relevance judgments [Miller, 1956, Shiffrin and Nosofsky, 1994, Stewart, Brown, and Chater, 2005]

  • 2. We are good at expressing sparse, partial preferences

◮ Much worse at expressing complete preferences

Complete preferences:

ftd.com en.wikipedia.org/... 1800flowers.com

What you express:

ftd.com en.wikipedia.org/... 1800flowers.com

slide-11
SLIDE 11

Ranking

Goal: Order set of items/results to best match your preferences

slide-12
SLIDE 12

Ranking

Goal: Order set of items/results to best match your preferences

◮ Web search: Return most relevant URLs for user queries

slide-13
SLIDE 13

Ranking

Goal: Order set of items/results to best match your preferences

◮ Web search: Return most relevant URLs for user queries ◮ Recommendation systems:

◮ Movies to watch based on user’s past ratings ◮ News articles to read based on past browsing history ◮ Items to buy based on patron’s or other patrons’ purchases

slide-14
SLIDE 14

Ranking procedures

Goal: Order set of items/results to best match your preferences

  • 1. Tractable: Run in polynomial time
slide-15
SLIDE 15

Ranking procedures

Goal: Order set of items/results to best match your preferences

  • 1. Tractable: Run in polynomial time
  • 2. Consistent: Recover true preferences given sufficient data
slide-16
SLIDE 16

Ranking procedures

Goal: Order set of items/results to best match your preferences

  • 1. Tractable: Run in polynomial time
  • 2. Consistent: Recover true preferences given sufficient data
  • 3. Realistic: Make use of ubiquitous partial preference data
slide-17
SLIDE 17

Ranking procedures

Goal: Order set of items/results to best match your preferences

  • 1. Tractable: Run in polynomial time
  • 2. Consistent: Recover true preferences given sufficient data
  • 3. Realistic: Make use of ubiquitous partial preference data

Past work: 1+2 are possible given complete preference data

[Ravikumar, Tewari, and Yang, 2011, Buffoni, Calauzenes, Gallinari, and Usunier, 2011]

slide-18
SLIDE 18

Ranking procedures

Goal: Order set of items/results to best match your preferences

  • 1. Tractable: Run in polynomial time
  • 2. Consistent: Recover true preferences given sufficient data
  • 3. Realistic: Make use of ubiquitous partial preference data

Past work: 1+2 are possible given complete preference data

[Ravikumar, Tewari, and Yang, 2011, Buffoni, Calauzenes, Gallinari, and Usunier, 2011]

This work [Duchi, Mackey, and Jordan, 2013]

slide-19
SLIDE 19

Ranking procedures

Goal: Order set of items/results to best match your preferences

  • 1. Tractable: Run in polynomial time
  • 2. Consistent: Recover true preferences given sufficient data
  • 3. Realistic: Make use of ubiquitous partial preference data

Past work: 1+2 are possible given complete preference data

[Ravikumar, Tewari, and Yang, 2011, Buffoni, Calauzenes, Gallinari, and Usunier, 2011]

This work [Duchi, Mackey, and Jordan, 2013]

◮ Standard (tractable) procedures for ranking with partial

preferences are inconsistent

slide-20
SLIDE 20

Ranking procedures

Goal: Order set of items/results to best match your preferences

  • 1. Tractable: Run in polynomial time
  • 2. Consistent: Recover true preferences given sufficient data
  • 3. Realistic: Make use of ubiquitous partial preference data

Past work: 1+2 are possible given complete preference data

[Ravikumar, Tewari, and Yang, 2011, Buffoni, Calauzenes, Gallinari, and Usunier, 2011]

This work [Duchi, Mackey, and Jordan, 2013]

◮ Standard (tractable) procedures for ranking with partial

preferences are inconsistent

◮ Aggregating partial preferences into more complete preferences

can restore consistency

slide-21
SLIDE 21

Ranking procedures

Goal: Order set of items/results to best match your preferences

  • 1. Tractable: Run in polynomial time
  • 2. Consistent: Recover true preferences given sufficient data
  • 3. Realistic: Make use of ubiquitous partial preference data

Past work: 1+2 are possible given complete preference data

[Ravikumar, Tewari, and Yang, 2011, Buffoni, Calauzenes, Gallinari, and Usunier, 2011]

This work [Duchi, Mackey, and Jordan, 2013]

◮ Standard (tractable) procedures for ranking with partial

preferences are inconsistent

◮ Aggregating partial preferences into more complete preferences

can restore consistency

◮ New estimators based on U-statistics achieve 1+2+3

slide-22
SLIDE 22

Outline

Supervised Ranking Formal definition Tractable surrogates Pairwise inconsistency Aggregation Restoring consistency Estimating complete preferences U-statistics Practical procedures Experimental results

slide-23
SLIDE 23

Outline

Supervised Ranking Formal definition Tractable surrogates Pairwise inconsistency Aggregation Restoring consistency Estimating complete preferences U-statistics Practical procedures Experimental results

slide-24
SLIDE 24

Supervised ranking

Observe: Sequence of training examples

slide-25
SLIDE 25

Supervised ranking

Observe: Sequence of training examples

◮ Query Q: e.g., search term “flowers”

slide-26
SLIDE 26

Supervised ranking

Observe: Sequence of training examples

◮ Query Q: e.g., search term “flowers” ◮ Set of m items IQ to rank

◮ e.g., websites {1, 2, 3, 4}

slide-27
SLIDE 27

Supervised ranking

Observe: Sequence of training examples

◮ Query Q: e.g., search term “flowers” ◮ Set of m items IQ to rank

◮ e.g., websites {1, 2, 3, 4}

◮ Label Y representing some preference

structure over items

slide-28
SLIDE 28

Supervised ranking

Observe: Sequence of training examples

◮ Query Q: e.g., search term “flowers” ◮ Set of m items IQ to rank

◮ e.g., websites {1, 2, 3, 4}

◮ Label Y representing some preference

structure over items

◮ Item 1 preferred to {2, 3} and item 3 to 4

1 2 3 4

y12 y13 y34

Example: Y is a graph on items {1, 2, 3, 4}

slide-29
SLIDE 29

Supervised ranking

Observe: (Q1, Y1), . . . , (Qn, Yn) Learn: Scoring function f to induce item rankings for each query

slide-30
SLIDE 30

Supervised ranking

Observe: (Q1, Y1), . . . , (Qn, Yn) Learn: Scoring function f to induce item rankings for each query

◮ Real-valued score for each item i in item set IQ

αi := fi(Q)

◮ Vector of scores f(Q) induces ranking over IQ

i ranked above j ⇐ ⇒ αi > αj

slide-31
SLIDE 31

Supervised ranking

Example: Scoring function f with scores f1(Q) > f2(Q) > f3(Q) induces same ranking as preference graph Y 1 2 3 Y f1(Q) > f2(Q) f2(Q) > f3(Q)

slide-32
SLIDE 32

Supervised ranking

Observe: (Q1, Y1), . . . , (Qn, Yn) Learn: Scoring function f to predict item ranking Suffer loss: L(f(Q), Y )

◮ Encodes discord between observed label Y and prediction f(Q) ◮ Depends on specific ranking task and available data

slide-33
SLIDE 33

Supervised ranking

Example: Pairwise loss

slide-34
SLIDE 34

Supervised ranking

Example: Pairwise loss

◮ Let Y = (weighted) adjacency matrix for a preference graph

◮ Yij = the preference weight on edge (i, j)

1 2 3 4

y12 y13 y34

slide-35
SLIDE 35

Supervised ranking

Example: Pairwise loss

◮ Let Y = (weighted) adjacency matrix for a preference graph

◮ Yij = the preference weight on edge (i, j)

◮ Let α = f(Q) be the predicted scores for query Q

1 2 3 4

y12 y13 y34

slide-36
SLIDE 36

Supervised ranking

Example: Pairwise loss

◮ Let Y = (weighted) adjacency matrix for a preference graph

◮ Yij = the preference weight on edge (i, j)

◮ Let α = f(Q) be the predicted scores for query Q ◮ Then, L(α, Y ) = i=j Yij1(αi≤αj) ◮ Imposes penalty for each misordered edge

1 2 3 4

y12 y13 y34

L(α, Y ) = Y121(α1≤α2) + Y131(α1≤α3) + Y341(α3≤α4)

slide-37
SLIDE 37

Supervised ranking

Observe: (Q1, Y1), . . . (Qn, Yn) Learn: Scoring function f to rank items Suffer loss: L(f(Q), Y ) Goal: Minimize the risk R(f) := E [L(f(Q), Y )]

slide-38
SLIDE 38

Supervised ranking

Observe: (Q1, Y1), . . . (Qn, Yn) Learn: Scoring function f to rank items Suffer loss: L(f(Q), Y ) Goal: Minimize the risk R(f) := E [L(f(Q), Y )] Main Question: Are there tractable ranking procedures that minimize R as n → ∞?

slide-39
SLIDE 39

Tractable ranking

First try: Empirical risk minimization min

f

ˆ Rn(f) := ˆ En [L(f(Q), Y )] = 1 n n

k=1 L(f(Qk), Yk)

slide-40
SLIDE 40

Tractable ranking

First try: Empirical risk minimization ← Intractable! min

f

ˆ Rn(f) := ˆ En [L(f(Q), Y )] = 1 n n

k=1 L(f(Qk), Yk)

slide-41
SLIDE 41

Tractable ranking

First try: Empirical risk minimization ← Intractable! min

f

ˆ Rn(f) := ˆ En [L(f(Q), Y )] = 1 n n

k=1 L(f(Qk), Yk)

L(α, Y ) =

i=j Yij1(αi≤αj)

Hard

slide-42
SLIDE 42

Tractable ranking

First try: Empirical risk minimization ← Intractable! min

f

ˆ Rn(f) := ˆ En [L(f(Q), Y )] = 1 n n

k=1 L(f(Qk), Yk)

Idea: Replace loss L(α, Y ) with convex surrogate ϕ(α, Y ) L(α, Y ) =

i=j Yij1(αi≤αj)

Hard

slide-43
SLIDE 43

Tractable ranking

First try: Empirical risk minimization ← Intractable! min

f

ˆ Rn(f) := ˆ En [L(f(Q), Y )] = 1 n n

k=1 L(f(Qk), Yk)

Idea: Replace loss L(α, Y ) with convex surrogate ϕ(α, Y ) L(α, Y ) =

i=j Yij1(αi≤αj)

ϕ(α, Y ) =

i=j Yijφ(αi − αj)

Hard Tractable

slide-44
SLIDE 44

Surrogate ranking

Idea: Empirical surrogate risk minimization min

f

ˆ Rϕ,n(f) := ˆ En [ϕ(f(Q), Y )] = 1 n n

k=1 ϕ(f(Qk), Yk)

slide-45
SLIDE 45

Surrogate ranking

Idea: Empirical surrogate risk minimization min

f

ˆ Rϕ,n(f) := ˆ En [ϕ(f(Q), Y )] = 1 n n

k=1 ϕ(f(Qk), Yk) ◮ If ϕ convex, then minimization is tractable

slide-46
SLIDE 46

Surrogate ranking

Idea: Empirical surrogate risk minimization min

f

ˆ Rϕ,n(f) := ˆ En [ϕ(f(Q), Y )] = 1 n n

k=1 ϕ(f(Qk), Yk) ◮ If ϕ convex, then minimization is tractable ◮ argminf ˆ

Rϕ,n(f)

n→∞

→ argminf Rϕ(f) := E [ϕ(f(Q), Y )]

slide-47
SLIDE 47

Surrogate ranking

Idea: Empirical surrogate risk minimization min

f

ˆ Rϕ,n(f) := ˆ En [ϕ(f(Q), Y )] = 1 n n

k=1 ϕ(f(Qk), Yk) ◮ If ϕ convex, then minimization is tractable ◮ argminf ˆ

Rϕ,n(f)

n→∞

→ argminf Rϕ(f) := E [ϕ(f(Q), Y )] Main Question: Are these tractable ranking procedures consistent?

slide-48
SLIDE 48

Surrogate ranking

Idea: Empirical surrogate risk minimization min

f

ˆ Rϕ,n(f) := ˆ En [ϕ(f(Q), Y )] = 1 n n

k=1 ϕ(f(Qk), Yk) ◮ If ϕ convex, then minimization is tractable ◮ argminf ˆ

Rϕ,n(f)

n→∞

→ argminf Rϕ(f) := E [ϕ(f(Q), Y )] Main Question: Are these tractable ranking procedures consistent? ⇐ ⇒ Does argminf Rϕ(f) also minimize the true risk R(f)?

slide-49
SLIDE 49

Classification consistency

Consider the special case of classification

slide-50
SLIDE 50

Classification consistency

Consider the special case of classification

◮ Observe: query X, items {0, 1}, label Y01 = 1 or Y10 = 1

slide-51
SLIDE 51

Classification consistency

Consider the special case of classification

◮ Observe: query X, items {0, 1}, label Y01 = 1 or Y10 = 1 ◮ Pairwise loss: L(α, Y ) = Y011(α0≤α1) + Y101(α1≤α0)

slide-52
SLIDE 52

Classification consistency

Consider the special case of classification

◮ Observe: query X, items {0, 1}, label Y01 = 1 or Y10 = 1 ◮ Pairwise loss: L(α, Y ) = Y011(α0≤α1) + Y101(α1≤α0) ◮ Surrogate loss: ϕ(α, Y ) = Y01φ(α0 − α1) + Y10φ(α1 − α0)

slide-53
SLIDE 53

Classification consistency

Consider the special case of classification

◮ Observe: query X, items {0, 1}, label Y01 = 1 or Y10 = 1 ◮ Pairwise loss: L(α, Y ) = Y011(α0≤α1) + Y101(α1≤α0) ◮ Surrogate loss: ϕ(α, Y ) = Y01φ(α0 − α1) + Y10φ(α1 − α0)

Theorem: If φ is convex, procedure based on minimizing φ is consistent if and only if φ′(0) < 0.

[Bartlett, Jordan, and McAuliffe, 2006]

⇒ Tractable consistency for boosting, SVMs, logistic regression

slide-54
SLIDE 54

Ranking consistency?

Good news: Can characterize surrogate ranking consistency

1[Duchi, Mackey, and Jordan, 2013]

slide-55
SLIDE 55

Ranking consistency?

Good news: Can characterize surrogate ranking consistency Theorem:1 Procedure based on minimizing ϕ is consistent ⇐ ⇒ min

α

  • E[ϕ(α, Y ) | q]
  • α ∈ argmin

α′

E[L(α′, Y ) | q]

  • > min

α E[ϕ(α, Y ) | q]. ◮ Translation: ϕ is consistent if and only if minimizing conditional

surrogate risk gives correct ranking for every query

1[Duchi, Mackey, and Jordan, 2013]

slide-56
SLIDE 56

Ranking consistency?

Bad news: The consequences are dire...

slide-57
SLIDE 57

Ranking consistency?

Bad news: The consequences are dire... Consider the pairwise loss: L(α, Y ) =

  • i=j

Yij1(αi≤αj)

1 2 3 4

y12 y13 y34

slide-58
SLIDE 58

Ranking consistency?

Bad news: The consequences are dire... Consider the pairwise loss: L(α, Y ) =

  • i=j

Yij1(αi≤αj)

1 2 3 4

y12 y13 y34

Task: Find argminα E[L(α, Y ) | q]

slide-59
SLIDE 59

Ranking consistency?

Bad news: The consequences are dire... Consider the pairwise loss: L(α, Y ) =

  • i=j

Yij1(αi≤αj)

1 2 3 4

y12 y13 y34

Task: Find argminα E[L(α, Y ) | q]

◮ Classification (two node) case: Easy

◮ Choose α0 > α1 ⇐

⇒ P[Class 0 | q] > P[Class 1 | q]

slide-60
SLIDE 60

Ranking consistency?

Bad news: The consequences are dire... Consider the pairwise loss: L(α, Y ) =

  • i=j

Yij1(αi≤αj)

1 2 3 4

y12 y13 y34

Task: Find argminα E[L(α, Y ) | q]

◮ Classification (two node) case: Easy

◮ Choose α0 > α1 ⇐

⇒ P[Class 0 | q] > P[Class 1 | q]

◮ General case: NP hard

◮ Unless P = NP, must restrict problem for tractable consistency

slide-61
SLIDE 61

Low noise distribution

Define: Average preference for item i over item j: sij = E[Yij | q]

◮ We say i ≻ j on average if sij > sji

slide-62
SLIDE 62

Low noise distribution

Define: Average preference for item i over item j: sij = E[Yij | q]

◮ We say i ≻ j on average if sij > sji

Definition (Low noise distribution): If i ≻ j on average and j ≻ k

  • n average, then i ≻ k on average.

2 3 1

s12 s31 s13 s23

Low noise ⇒ s13 > s31

◮ No cyclic preferences on average

slide-63
SLIDE 63

Low noise distribution

Define: Average preference for item i over item j: sij = E[Yij | q]

◮ We say i ≻ j on average if sij > sji

Definition (Low noise distribution): If i ≻ j on average and j ≻ k

  • n average, then i ≻ k on average.

2 3 1

s12 s31 s13 s23

Low noise ⇒ s13 > s31

◮ No cyclic preferences on average ◮ Find argminα E[L(α, Y ) | q]: Very easy

◮ Choose αi > αj ⇐

⇒ sij > sji

slide-64
SLIDE 64

Ranking consistency?

Pairwise ranking surrogate:

[Herbrich, Graepel, and Obermayer, 2000, Freund, Iyer, Schapire, and Singer, 2003, Dekel, Manning, and Singer, 2004]

ϕ(α, Y ) =

  • ij

Yijφ(αi − αj) for φ convex with φ′(0) < 0. Common in ranking literature.

slide-65
SLIDE 65

Ranking consistency?

Pairwise ranking surrogate:

[Herbrich, Graepel, and Obermayer, 2000, Freund, Iyer, Schapire, and Singer, 2003, Dekel, Manning, and Singer, 2004]

ϕ(α, Y ) =

  • ij

Yijφ(αi − αj) for φ convex with φ′(0) < 0. Common in ranking literature. Theorem: ϕ is not consistent, even in low noise settings.

[Duchi, Mackey, and Jordan, 2013]

slide-66
SLIDE 66

Ranking consistency?

Pairwise ranking surrogate:

[Herbrich, Graepel, and Obermayer, 2000, Freund, Iyer, Schapire, and Singer, 2003, Dekel, Manning, and Singer, 2004]

ϕ(α, Y ) =

  • ij

Yijφ(αi − αj) for φ convex with φ′(0) < 0. Common in ranking literature. Theorem: ϕ is not consistent, even in low noise settings.

[Duchi, Mackey, and Jordan, 2013]

⇒ Inconsistency for RankBoost, RankSVM, Logistic Ranking...

slide-67
SLIDE 67

Ranking with pairwise data is challenging

slide-68
SLIDE 68

Ranking with pairwise data is challenging

◮ Inconsistent in general (unless P = NP)

slide-69
SLIDE 69

Ranking with pairwise data is challenging

◮ Inconsistent in general (unless P = NP) ◮ Low noise distributions

slide-70
SLIDE 70

Ranking with pairwise data is challenging

◮ Inconsistent in general (unless P = NP) ◮ Low noise distributions

◮ Inconsistent for standard convex losses

ϕ(α, Y ) =

  • ij

Yijφ(αi − αj)

slide-71
SLIDE 71

Ranking with pairwise data is challenging

◮ Inconsistent in general (unless P = NP) ◮ Low noise distributions

◮ Inconsistent for standard convex losses

ϕ(α, Y ) =

  • ij

Yijφ(αi − αj)

◮ Inconsistent for margin-based convex losses

ϕ(α, Y ) =

  • ij

φ(αi − αj − Yij)

slide-72
SLIDE 72

Ranking with pairwise data is challenging

◮ Inconsistent in general (unless P = NP) ◮ Low noise distributions

◮ Inconsistent for standard convex losses

ϕ(α, Y ) =

  • ij

Yijφ(αi − αj)

◮ Inconsistent for margin-based convex losses

ϕ(α, Y ) =

  • ij

φ(αi − αj − Yij)

Question: Do tractable consistent losses exist for partial preference data?

slide-73
SLIDE 73

Ranking with pairwise data is challenging

◮ Inconsistent in general (unless P = NP) ◮ Low noise distributions

◮ Inconsistent for standard convex losses

ϕ(α, Y ) =

  • ij

Yijφ(αi − αj)

◮ Inconsistent for margin-based convex losses

ϕ(α, Y ) =

  • ij

φ(αi − αj − Yij)

Question: Do tractable consistent losses exist for partial preference data? Yes!

slide-74
SLIDE 74

Ranking with pairwise data is challenging

◮ Inconsistent in general (unless P = NP) ◮ Low noise distributions

◮ Inconsistent for standard convex losses

ϕ(α, Y ) =

  • ij

Yijφ(αi − αj)

◮ Inconsistent for margin-based convex losses

ϕ(α, Y ) =

  • ij

φ(αi − αj − Yij)

Question: Do tractable consistent losses exist for partial preference data? Yes, if we aggregate!

slide-75
SLIDE 75

Outline

Supervised Ranking Formal definition Tractable surrogates Pairwise inconsistency Aggregation Restoring consistency Estimating complete preferences U-statistics Practical procedures Experimental results

slide-76
SLIDE 76

An observation

Can rewrite risk of pairwise loss E[L(α, Y ) | q] =

  • i=j

sij1(αi≤αj) where sij = E[Yij | q].

slide-77
SLIDE 77

An observation

Can rewrite risk of pairwise loss E[L(α, Y ) | q] =

  • i=j

sij1(αi≤αj) =

  • i=j

max{sij − sji, 0}1(αi≤αj) where sij = E[Yij | q].

◮ Only depends on net expected preferences: sij − sji

slide-78
SLIDE 78

An observation

Can rewrite risk of pairwise loss E[L(α, Y ) | q] =

  • i=j

sij1(αi≤αj) =

  • i=j

max{sij − sji, 0}1(αi≤αj) where sij = E[Yij | q].

◮ Only depends on net expected preferences: sij − sji

Consider the surrogate ϕ(α, s) :=

  • i=j

max{sij − sji, 0}φ(αi − αj) for φ non-increasing and convex, with φ′(0) < 0.

slide-79
SLIDE 79

An observation

Can rewrite risk of pairwise loss E[L(α, Y ) | q] =

  • i=j

sij1(αi≤αj) =

  • i=j

max{sij − sji, 0}1(αi≤αj) where sij = E[Yij | q].

◮ Only depends on net expected preferences: sij − sji

Consider the surrogate ϕ(α, s) :=

  • i=j

max{sij − sji, 0}φ(αi − αj) =

  • i=j

sijφ(αi − αj) for φ non-increasing and convex, with φ′(0) < 0.

◮ Either i → j penalized or j → i but not both

slide-80
SLIDE 80

An observation

Can rewrite risk of pairwise loss E[L(α, Y ) | q] =

  • i=j

sij1(αi≤αj) =

  • i=j

max{sij − sji, 0}1(αi≤αj) where sij = E[Yij | q].

◮ Only depends on net expected preferences: sij − sji

Consider the surrogate ϕ(α, s) :=

  • i=j

max{sij − sji, 0}φ(αi − αj) =

  • i=j

sijφ(αi − αj) for φ non-increasing and convex, with φ′(0) < 0.

◮ Either i → j penalized or j → i but not both ◮ Consistent whenever average preferences are acyclic

slide-81
SLIDE 81

What happened?

Old surrogates: E[ϕ(α, Y ) | q] = limk→∞

1 k

  • k ϕ(α, Yk)

◮ Loss ϕ(α, Y ) applied to a single datapoint

slide-82
SLIDE 82

What happened?

Old surrogates: E[ϕ(α, Y ) | q] = limk→∞

1 k

  • k ϕ(α, Yk)

◮ Loss ϕ(α, Y ) applied to a single datapoint

New surrogates: ϕ(α, E[Y | q]) = limk→∞ ϕ(α, 1

k

  • k Yk)

◮ Loss applied to aggregation of many datapoints

slide-83
SLIDE 83

What happened?

Old surrogates: E[ϕ(α, Y ) | q] = limk→∞

1 k

  • k ϕ(α, Yk)

◮ Loss ϕ(α, Y ) applied to a single datapoint

New surrogates: ϕ(α, E[Y | q]) = limk→∞ ϕ(α, 1

k

  • k Yk)

◮ Loss applied to aggregation of many datapoints

New framework: Ranking with aggregate losses L(α, sk(Y1, . . . , Yk)) and ϕ(α, sk(Y1, . . . , Yk)) where sk is a structure function that aggregates first k datapoints

slide-84
SLIDE 84

What happened?

Old surrogates: E[ϕ(α, Y ) | q] = limk→∞

1 k

  • k ϕ(α, Yk)

◮ Loss ϕ(α, Y ) applied to a single datapoint

New surrogates: ϕ(α, E[Y | q]) = limk→∞ ϕ(α, 1

k

  • k Yk)

◮ Loss applied to aggregation of many datapoints

New framework: Ranking with aggregate losses L(α, sk(Y1, . . . , Yk)) and ϕ(α, sk(Y1, . . . , Yk)) where sk is a structure function that aggregates first k datapoints

◮ sk combines partial preferences into more complete estimates

slide-85
SLIDE 85

What happened?

Old surrogates: E[ϕ(α, Y ) | q] = limk→∞

1 k

  • k ϕ(α, Yk)

◮ Loss ϕ(α, Y ) applied to a single datapoint

New surrogates: ϕ(α, E[Y | q]) = limk→∞ ϕ(α, 1

k

  • k Yk)

◮ Loss applied to aggregation of many datapoints

New framework: Ranking with aggregate losses L(α, sk(Y1, . . . , Yk)) and ϕ(α, sk(Y1, . . . , Yk)) where sk is a structure function that aggregates first k datapoints

◮ sk combines partial preferences into more complete estimates ◮ Consistency characterization extends to this setting

slide-86
SLIDE 86

Aggregation via structure function

1 3 4 3 3 2 4 1 3 4 4 3

1 2 3 4

Y1, Y2, . . . , Yk sk(Y1, . . . , Yk)

slide-87
SLIDE 87

Aggregation via structure function

1 3 4 3 3 2 4 1 3 4 4 3

1 2 3 4

Y1, Y2, . . . , Yk sk(Y1, . . . , Yk) Question: When does aggregation help?

slide-88
SLIDE 88

Complete data losses

◮ Normalized Discounted Cumulative Gain (NDCG) ◮ Precision, Precision@k ◮ Expected reciprocal rank (ERR)

Pros: Popular, well-motivated, admit tractable consistent surrogates

◮ e.g., Penalize mistakes at top of ranked list more heavily

slide-89
SLIDE 89

Complete data losses

◮ Normalized Discounted Cumulative Gain (NDCG) ◮ Precision, Precision@k ◮ Expected reciprocal rank (ERR)

Pros: Popular, well-motivated, admit tractable consistent surrogates

◮ e.g., Penalize mistakes at top of ranked list more heavily

Cons: Require complete preference data

slide-90
SLIDE 90

Complete data losses

◮ Normalized Discounted Cumulative Gain (NDCG) ◮ Precision, Precision@k ◮ Expected reciprocal rank (ERR)

Pros: Popular, well-motivated, admit tractable consistent surrogates

◮ e.g., Penalize mistakes at top of ranked list more heavily

Cons: Require complete preference data Idea:

◮ Use aggregation to estimate complete preferences from partial

preferences

slide-91
SLIDE 91

Complete data losses

◮ Normalized Discounted Cumulative Gain (NDCG) ◮ Precision, Precision@k ◮ Expected reciprocal rank (ERR)

Pros: Popular, well-motivated, admit tractable consistent surrogates

◮ e.g., Penalize mistakes at top of ranked list more heavily

Cons: Require complete preference data Idea:

◮ Use aggregation to estimate complete preferences from partial

preferences

◮ Plug estimates into consistent surrogates

slide-92
SLIDE 92

Complete data losses

◮ Normalized Discounted Cumulative Gain (NDCG) ◮ Precision, Precision@k ◮ Expected reciprocal rank (ERR)

Pros: Popular, well-motivated, admit tractable consistent surrogates

◮ e.g., Penalize mistakes at top of ranked list more heavily

Cons: Require complete preference data Idea:

◮ Use aggregation to estimate complete preferences from partial

preferences

◮ Plug estimates into consistent surrogates ◮ Check that aggregation + surrogacy retains consistency

slide-93
SLIDE 93

Cascade model for click data

[Craswell, Zoeter, Taylor, and Ramsey, 2008, Chapelle, Metzler, Zhang, and Grinspan, 2009]

◮ Person i clicks on first relevant result, k(i)

1 2 3 4 5

slide-94
SLIDE 94

Cascade model for click data

[Craswell, Zoeter, Taylor, and Ramsey, 2008, Chapelle, Metzler, Zhang, and Grinspan, 2009]

◮ Person i clicks on first relevant result, k(i) ◮ Relevance probability of item k is pk

1 2 3 4 5

slide-95
SLIDE 95

Cascade model for click data

[Craswell, Zoeter, Taylor, and Ramsey, 2008, Chapelle, Metzler, Zhang, and Grinspan, 2009]

◮ Person i clicks on first relevant result, k(i) ◮ Relevance probability of item k is pk ◮ Probability of a click on item k is

pk

k−1

  • j=1

(1 − pj)

1 2 3 4 5

slide-96
SLIDE 96

Cascade model for click data

[Craswell, Zoeter, Taylor, and Ramsey, 2008, Chapelle, Metzler, Zhang, and Grinspan, 2009]

◮ Person i clicks on first relevant result, k(i) ◮ Relevance probability of item k is pk ◮ Probability of a click on item k is

pk

k−1

  • j=1

(1 − pj)

◮ ERR loss assumes p is known

1 2 3 4 5

slide-97
SLIDE 97

Cascade model for click data

[Craswell, Zoeter, Taylor, and Ramsey, 2008, Chapelle, Metzler, Zhang, and Grinspan, 2009]

◮ Person i clicks on first relevant result, k(i) ◮ Relevance probability of item k is pk ◮ Probability of a click on item k is

pk

k−1

  • j=1

(1 − pj)

◮ ERR loss assumes p is known

Estimate p via maximum likelihood on n clicks: s = argmax

p∈[0,1]m n

  • i=1

log pk(i) +

k(i)

  • j=1

log(1 − pj). ⇒ Consistent ERR minimization under our framework

1 2 3 4 5

slide-98
SLIDE 98

Benefits of aggregation

◮ Tractable consistency for partial preference losses

argmin

f

lim

k→∞ E[ϕ(f(Q), sk(Y1, . . . , Yk))]

⇒ argmin

f

lim

k→∞ E[L(f(Q), sk(Y1, . . . , Yk))] ◮ Use complete data losses with realistic partial preference data

◮ Models process of generating relevance scores from

clicks/comparisons

slide-99
SLIDE 99

What remains?

Before aggregation, we had argmin

f

1 n n

k=1 ϕ(f(Qk), Yk)

  • empirical

→ argmin

f

E[ϕ(f(Q), Y )]

  • population
slide-100
SLIDE 100

What remains?

Before aggregation, we had argmin

f

1 n n

k=1 ϕ(f(Qk), Yk)

  • empirical

→ argmin

f

E[ϕ(f(Q), Y )]

  • population

What’s a suitable empirical analogue Rϕ,n(f) with aggregation?

slide-101
SLIDE 101

What remains?

Before aggregation, we had argmin

f

1 n n

k=1 ϕ(f(Qk), Yk)

  • empirical

→ argmin

f

E[ϕ(f(Q), Y )]

  • population

What’s a suitable empirical analogue Rϕ,n(f) with aggregation? ⇐ ⇒ When does argmin

f

  • Rϕ,n(f)

empirical → argmin

f

lim

k→∞ E[ϕ(f(Q), sk(Y1, . . . , Yk))]

  • population

?

slide-102
SLIDE 102

Outline

Supervised Ranking Formal definition Tractable surrogates Pairwise inconsistency Aggregation Restoring consistency Estimating complete preferences U-statistics Practical procedures Experimental results

slide-103
SLIDE 103

Data with aggregation

q1 q2 q3 q4 q5

Y1 Y2 Y3 ...

nq1 nq2 nq3

◮ Datapoint consists of query q

and preference judgment Y

◮ nq datapoints for query q ◮ Structure functions for

aggregation: s(Y1, Y2, . . . , Yk)

slide-104
SLIDE 104

Data with aggregation

q1 q2 q3 q4 q5

Y1 Y2 Y3 ...

nq1 nq2 nq3

◮ Simple idea: for query q,

aggregate all Y1, Y2, . . . , Ynq

◮ Loss ϕ for query q is

nq · ϕ(α, s(Y1, . . . , Ynq))

slide-105
SLIDE 105

Data with aggregation

q1 q2 q3 q4 q5

Y1 Y2 Y3 ...

nq1 nq2 nq3

◮ Simple idea: for query q,

aggregate all Y1, Y2, . . . , Ynq

◮ Loss ϕ for query q is

nq · ϕ(α, s(Y1, . . . , Ynq)) Cons:

◮ Requires detailed knowledge of ϕ and sk(Y1, . . . , Yk) as k → ∞

slide-106
SLIDE 106

Data with aggregation

q1 q2 q3 q4 q5

Y1 Y2 Y3 ...

nq1 nq2 nq3

◮ Simple idea: for query q,

aggregate all Y1, Y2, . . . , Ynq

◮ Loss ϕ for query q is

nq · ϕ(α, s(Y1, . . . , Ynq)) Cons:

◮ Requires detailed knowledge of ϕ and sk(Y1, . . . , Yk) as k → ∞

Ideal procedure:

◮ Agnostic to form of aggregation ◮ Take advantage of independence of Y1, Y2, . . .

slide-107
SLIDE 107

Digression: U-statistics

q

nq

  • k
  • ◮ U-statistic: classical tool in statistics

◮ Given X1, . . . , Xn, estimate E[g(X1, . . . , Xk)]

for g symmetric

◮ Idea: Average all estimates based on k

datapoints Un = n k −1

  • i1<···<ik

g(Xi1, Xi2, . . . , Xik)

slide-108
SLIDE 108

Data with aggregation: U-statistic in the loss

q

nq

  • k
  • ◮ Target: E[ϕ(α, s(Y1, . . . , Yk)) | q]
slide-109
SLIDE 109

Data with aggregation: U-statistic in the loss

q

nq

  • k
  • ◮ Target: E[ϕ(α, s(Y1, . . . , Yk)) | q]

◮ Idea: Estimate with U-statistic:

nq k −1

  • i1<···<ik

ϕ(α, s(Yi1, . . . , Yik))

slide-110
SLIDE 110

Data with aggregation: U-statistic in the loss

q

nq

  • k
  • ◮ Target: E[ϕ(α, s(Y1, . . . , Yk)) | q]

◮ Idea: Estimate with U-statistic:

nq k −1

  • i1<···<ik

ϕ(α, s(Yi1, . . . , Yik))

◮ Empirical risk for scoring function f:

  • Rϕ,n(f) =

1 n

  • q

nq nq k −1

  • i1<···<ik

ϕ(f(q), s(Yi1, . . . , Yik))

slide-111
SLIDE 111

Convergence of U-statistic procedures

Empirical risk for scoring function f:

  • Rϕ,n(f) = 1

n

  • q

nq nq k −1

  • i1<···<ik

ϕ(f(q), s(Yi1, . . . , Yik)) Theorem: If we choose kn = o(n) but kn → ∞, then uniformly in f

  • Rϕ,n(f) → lim

k→∞ E[ϕ(f(Q), s(Y1, . . . , Yk))]

  • Limiting aggregated loss
slide-112
SLIDE 112

New procedure for learning to rank

1 2 3 4

◮ Use loss function that aggregates per-query:

  • Rϕ,n(f) =

1 n

  • q

nq nq k −1

  • i1<···<ik

ϕ(f(q), s(Yi1, . . . , Yik))

◮ Learn ranking function by taking

  • f ∈ argmin

f∈F

  • Rϕ,n(f)

◮ Can optimize by stochastic gradient descent over

queries q and subsets (i1, . . . , ik)

slide-113
SLIDE 113

Experiments

◮ Web search ◮ Image ranking

slide-114
SLIDE 114

Web search

◮ Microsoft Learning to Rank Web10K dataset

slide-115
SLIDE 115

Web search

◮ Microsoft Learning to Rank Web10K dataset

◮ 10,000 queries issued ◮ 100 items per query ◮ Estimated relevance score r ∈ R for each query/result pair

slide-116
SLIDE 116

Web search

◮ Microsoft Learning to Rank Web10K dataset

◮ 10,000 queries issued ◮ 100 items per query ◮ Estimated relevance score r ∈ R for each query/result pair

◮ Generating pairwise preferences

◮ Choose query q uniformly at random ◮ Choose pair (i, j) of items, and set i ≻ j with probability

pij = 1 1 + exp(rj − ri)

slide-117
SLIDE 117

Web search

◮ Microsoft Learning to Rank Web10K dataset

◮ 10,000 queries issued ◮ 100 items per query ◮ Estimated relevance score r ∈ R for each query/result pair

◮ Generating pairwise preferences

◮ Choose query q uniformly at random ◮ Choose pair (i, j) of items, and set i ≻ j with probability

pij = 1 1 + exp(rj − ri)

◮ Aggregate scores by setting

si =

  • j=i

log

  • P(j ≺ i)
  • P(i ≺ j)
slide-118
SLIDE 118

Benefits of aggregation

NDCG risk as a function of aggregation level k for n = 106 samples

10 10

1

10

2

10

3

10

4

0.65 0.7 0.75 0.8 0.85

Order k NDCG@10 Aggregate Pairwise Score−based

slide-119
SLIDE 119

Image ranking

◮ Setup [Grangier and Bengio 2008]

◮ Take most common image search queries on google.com ◮ Train an independent ranker based on aggregated preference

statistics for each query

◮ Compare with standard, disaggregated image-ranking

approaches

slide-120
SLIDE 120

Image ranking experiments

Highly ranked items from Corel Image Database for query tree car: Aggregated SVM PLSA

slide-121
SLIDE 121

Conclusions

slide-122
SLIDE 122

Conclusions

  • 1. Partial preference data is abundant and (more) reliable
slide-123
SLIDE 123

Conclusions

  • 1. Partial preference data is abundant and (more) reliable
  • 2. General theory of ranking consistency: When is

argmin

f

E[ϕ(f(Q), s)] ⊆ argmin

f

E[L(f(Q), s)]?

◮ Tractable consistency difficult with partial preference data ◮ Possible with complete preference data

slide-124
SLIDE 124

Conclusions

  • 1. Partial preference data is abundant and (more) reliable
  • 2. General theory of ranking consistency: When is

argmin

f

E[ϕ(f(Q), s)] ⊆ argmin

f

E[L(f(Q), s)]?

◮ Tractable consistency difficult with partial preference data ◮ Possible with complete preference data

  • 3. Aggregation can bridge the gap

◮ Can transform pairwise preferences/click data into scores s

slide-125
SLIDE 125

Conclusions

  • 1. Partial preference data is abundant and (more) reliable
  • 2. General theory of ranking consistency: When is

argmin

f

E[ϕ(f(Q), s)] ⊆ argmin

f

E[L(f(Q), s)]?

◮ Tractable consistency difficult with partial preference data ◮ Possible with complete preference data

  • 3. Aggregation can bridge the gap

◮ Can transform pairwise preferences/click data into scores s

  • 4. Practical consistent procedures via U-statistic aggregation

◮ Allows for arbitrary aggregation s ◮ High-probability convergence of the learned ranking function

slide-126
SLIDE 126

Future work

slide-127
SLIDE 127

Future work

◮ Empirical directions

◮ Apply to more ranking problems! ◮ Which aggregation procedures perform best? ◮ How much aggregation is enough?

slide-128
SLIDE 128

Future work

◮ Empirical directions

◮ Apply to more ranking problems! ◮ Which aggregation procedures perform best? ◮ How much aggregation is enough?

◮ Statistical questions: beyond consistency

◮ How does aggregation impact rate of convergence? ◮ Can we design statistically efficient ranking procedures?

slide-129
SLIDE 129

Future work

◮ Empirical directions

◮ Apply to more ranking problems! ◮ Which aggregation procedures perform best? ◮ How much aggregation is enough?

◮ Statistical questions: beyond consistency

◮ How does aggregation impact rate of convergence? ◮ Can we design statistically efficient ranking procedures?

◮ Other ways of dealing with realistic partial preference data?

slide-130
SLIDE 130

References I

  • P. L. Bartlett, M. I. Jordan, and J. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical

Association, 101:138–156, 2006.

  • D. Buffoni, C. Calauzenes, P. Gallinari, and N. Usunier. Learning scoring functions with order-preserving losses and standardized
  • supervision. In Proceedings of the 28th International Conference on Machine Learning, 2011.
  • O. Chapelle, D. Metzler, Y. Zhang, and P. Grinspan. Expected reciprocal rank for graded relevance. In Conference on

Information and Knowledge Management, 2009.

  • N. Craswell, O. Zoeter, M. J. Taylor, and B. Ramsey. An experimental comparison of click position-bias models. In Web Search

and Data Mining (WSDM), pages 87–94, 2008.

  • O. Dekel, C. Manning, and Y. Singer. Log-linear models for label ranking. In Advances in Neural Information Processing

Systems 16, 2004.

  • J. C. Duchi, L. Mackey, and M. I. Jordan. The asymptotics of ranking algorithms. Annals of Statistics, 2013.
  • Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer. Efficient boosting algorithms for combining preferences. Journal of Machine

Learning Research, 4:933–969, 2003.

  • R. Herbrich, T. Graepel, and K. Obermayer. Large margin rank boundaries for ordinal regression. In Advances in Large Margin
  • Classifiers. MIT Press, 2000.
  • G. Miller. The magic number seven, plus or minus two: Some limits on our capacity for processing information. Psychology

Review, 63:81–97, 1956.

  • P. Ravikumar, A. Tewari, and E. Yang. On NDCG consistency of listwise ranking methods. In Proceedings of the 14th

International Conference on Artificial Intelligence and Statistics, 2011.

  • R. Shiffrin and R. Nosofsky. Seven plus or minus two: a commentary on capacity limitations. Psychological Review, 101(2):

357–361, 1994.

  • N. Stewart, G. Brown, and N. Chater. Absolute identification by relative judgment. Psychological Review, 112(4):881–911,

2005.

slide-131
SLIDE 131
slide-132
SLIDE 132

What is the problem?

Surrogate loss ϕ(α, s) =

ij sijφ(αi − αj) 2 3 1

s12 s23 s13

2 3 1

s31

2 3 1

s12 s31 s13 s23

p(s) = .5 p(s′) = .5 Aggregate

slide-133
SLIDE 133

What is the problem?

Surrogate loss ϕ(α, s) =

ij sijφ(αi − αj) 2 3 1

s12 s23 s13

2 3 1

s31

2 3 1

s12 s31 s13 s23

p(s) = .5 p(s′) = .5 Aggregate

  • s

p(s)ϕ(α, s) = 1 2ϕ(α, s′) + 1 2ϕ(α, s′) ∝ s12φ(α1 − α2) + s13φ(α1 − α3) + s23φ(α2 − α3) + s31φ(α3 − α1)

slide-134
SLIDE 134

What is the problem?

s12φ(α1 − α2) + s13φ(α1 − α3) + s23φ(α2 − α3) + s31φ(α3 − α1)

slide-135
SLIDE 135

What is the problem?

s12φ(α1 − α2) + s13φ(α1 − α3) + s23φ(α2 − α3) + s31φ(α3 − α1)

slide-136
SLIDE 136

What is the problem?

s12φ(α1 − α2) + s13φ(α1 − α3) + s23φ(α2 − α3) + s31φ(α3 − α1)

a31φ(α3 - α1) a13φ(α1 - α3)

2 3 1

s12 s31 s13 s23

slide-137
SLIDE 137

What is the problem?

s12φ(α1 − α2) + s13φ(α1 − α3) + s23φ(α2 − α3) + s31φ(α3 − α1)

a31φ(α3 - α1) a13φ(α1 - α3)

2 3 1

s12 s31 s13 s23

More bang for your $$ by increasing to 0 from left: α1 ↓. Result: α∗ = argmin

α

  • ij

sijφ(αi − αj) can have α∗

2 > α∗ 1, even if s13 − s31 > s12 + s23.