Statistical Inference for Incomplete Ranking Data: The Case of - - PowerPoint PPT Presentation

statistical inference for incomplete ranking data the
SMART_READER_LITE
LIVE PREVIEW

Statistical Inference for Incomplete Ranking Data: The Case of - - PowerPoint PPT Presentation

Statistical Inference for Incomplete Ranking Data: The Case of Rank-Dependent Coarsening Mohsen Ahmadi Fahandar 1 ullermeier 1 es Couso 2 Eyke H In 1 Intelligent Systems Group, Paderborn University, Germany 2 Department of Statistics,


slide-1
SLIDE 1

Statistical Inference for Incomplete Ranking Data: The Case of Rank-Dependent Coarsening

Mohsen Ahmadi Fahandar1 Eyke H¨ ullermeier1 In´ es Couso2

1Intelligent Systems Group, Paderborn University, Germany 2Department of Statistics, University of Oviedo, Spain

ICML 2017 Tuesday, August 8th

slide-2
SLIDE 2

Contributions

Considering statistical inference for incomplete ranking data, we: Propose a specific type of data-generating process, in which incompleteness is due to ”coarsening” of (latent) complete rankings. Introduce the concept of ”rank-dependent” coarsening. Under our proposed setting: We study the problem of rank aggregation and the performance of various rank aggregation methods, both theoretically and practically.

1 / 22

slide-3
SLIDE 3

Rank Aggregation

Given rankings over set of items (e.g., K = 5): (observations) a4 ≻ a5 ≻ a3 ≻ a2 ≻ a1 a5 ≻ a2 ≻ a1 ≻ a3 ≻ a4 a3 ≻ a1 ≻ a5 ≻ a4 ≻ a2 . . . a1 ≻ a2 ≻ a4 ≻ a3 ≻ a5 a? ≻ a? ≻ a? ≻ a? ≻ a? Combine rankings into a (single) consensus ranking.

2 / 22

slide-4
SLIDE 4

Ranking Distributions

Plackett-Luce (PL) model The probability assigned to ranking π given parameter vector θ = (θ1, θ2, . . . , θK) ∈ RK

+:

Pθ(π) =

K

  • i=1

θπ(i) θπ(i) + θπ(i+1) + . . . + θπ(K)

  • The mode of the PL distribution (i.e., π∗) is the natural consensus in

this case.

For example, Pθ(a2 ≻ a1 ≻ a3) = (

θa2 θa1+θa2+θa3 )( θa1 θa1+θa3 )(θa3 θa3 )

Bradley-Terry-Luce (BTL) model Pθ(a1 ≻ a2) = θa1 θa1 + θa2

3 / 22

slide-5
SLIDE 5

Incomplete Rankings

In most applications, the observed rankings are incomplete (e.g., K = 5) (observations) a4 ≻ a5 ≻ a3 ≻ a2 ≻ a1 a2 ≻ a1 ≻ a3 ≻ a4 a3 ≻ a1 . . . a1 ≻ a4 ≻ a5 a? ≻ a? ≻ a? ≻ a? ≻ a? Rank aggregation for incomplete rankings is more challenging!

4 / 22

slide-6
SLIDE 6

From Complete to Incomplete Ranking

ranking model full ranking incomplete ranking generation Pθ(π) coarsening Pλ(τ | π)

Where does the word ”coarsening” come from?

5 / 22

slide-7
SLIDE 7

A Stochastic Model for Incomplete Rankings

A collection of rankings SK: Pθ,λ(τ, π) = Pθ(π) · Pλ(τ | π) Generation of full rankings: Pθ : SK → [0, 1], Coarsening process: Pλ(. | π) : π ∈ SK, λ ∈ Λ.

6 / 22

slide-8
SLIDE 8

Modeling of the Coarsening

Full Model (Pθ + Pλ) Estimate Pλ Not Estimate Pλ non-parametric (i.e., model and estimate Pλ with no assumptions) non-parametric (i.e., model and estimate Pλ with no assumptions) parametric (i.e., take Pλ from parametric family) parametric (i.e., take Pλ from parametric family) ignore coarsening but make assumptions (e.g., rank-dependent) ignore coarsening but make assumptions (e.g., rank-dependent) ignore coarseing and make no assumption ignore coarseing and make no assumption

7 / 22

slide-9
SLIDE 9

The Underlying Assumption

Standard marginalization {a1, a2, a3, a4} set of items {a4, a3} a4 ≻ a3

  • bserved ranking

a4 ≻ a1 ≻ a3 ≻ a2 full ranking random subset What we propose: A coarsening that acts only on ”ranks” (positions) not items: P : 2[K] → [0, 1] {1, 2, 3, 4} set of ranks {2, 4} a1 ≻ a2

  • bserved ranking

a4 ≻ a1 ≻ a3 ≻ a2 full ranking random subset

8 / 22

slide-10
SLIDE 10

Specific Instantiation

ranking model full ranking incomplete ranking Plackett-Luce model pairwise observations generation Pθ(π) coarsening Pλ(τ | π)

9 / 22

slide-11
SLIDE 11

Data Generating Process

Rank-dependence in case of Pairwise Comparisons The entire distribution Pλ is specified by the set of K(K − 1)/2 probabilities:

  • λu,v | 1 ≤ u < v ≤ K, λu,v ≥ 0,
  • 1≤u<v≤K

λu,v = 1

  • The probability to observe i better than j:

q′

i,j =

  • π∈E(ai≻aj)

Pθ(π) λπ(i),π(j) E(ai ≻ aj) is the set of all rankings consistent with ai ≻ aj.

10 / 22

slide-12
SLIDE 12

Data Generating Process

Generated rankings based on PL Coarsening λ1,3 = 1 (a degenerate probability distribution) a4 ≻ a5 ≻ a3 ≻ a2 ≻ a1 a5 ≻ a2 ≻ a1 ≻ a3 ≻ a4 a1 ≻ a3 ≻ a2 ≻ a4 ≻ a5 . . . a1 ≻ a2 ≻ a4 ≻ a3 ≻ a5 Observations D a4 ≻ a3 a5 ≻ a1 a1 ≻ a2 . . . a1 ≻ a4

11 / 22

slide-13
SLIDE 13

Introduced Bias

Let θ = (14, 5, 1) and coarsening be degenerate: λ1,2 = 1 (i.e., top-2): (pi,j = θi θi + θj ) marginal matrix ≈

   

− 0.737 0.933 0.263 − 0.833 0.067 0.167 −

   

(qi,j = q′

i,j

q′

i,j + q′ j,i

) matrix ≈

   

− 0.714 0.76 0.286 − 0.559 0.24 0.441 −

   

12 / 22

slide-14
SLIDE 14

Definitions

Comparison matrix C (ci,j: number of wins ai over aj): C =

    

a1 a2 a3 a4 a1 6 4 1 a2 7 5 8 a3 3 4 9 a4 2 1 12

    

Probability matrix ˆ P (relative wins): ˆ P =

    

a1 a2 a3 a4 a1 0.46 0.57 0.33 a2 0.54 0.56 0.89 a3 0.43 0.44 0.43 a4 0.67 0.11 0.57

    

where ˆ pi,j = ci,j ci,j + cj,i

13 / 22

slide-15
SLIDE 15

Rank Estimation Framework

  • bservations D (K = 4)

a4 ≻ a3 a2 ≻ a1 a1 ≻ a2 . . . a1 ≻ a4

estimate

= = = = ⇒ Matrix C

aggregate

= = = = = ⇒ esimated ranking ˆ π : a2 ≻ a4 ≻ a1 ≻ a3

14 / 22

slide-16
SLIDE 16

Rank Aggregation Methods

Statistical Estimation BTL, BTL(R) (Bradley & Terry, 1952) Least Squares/HodgeRank (LS) (Jiang et al., 2011) Voting Methods Borda (Borda, 1781) Copeland (CP) (Copeland, 1951) Spectral Methods Rank Centrality (RC) (Negahban et al., 2012) MC2, MC3 (Dwork et al., 2001) Graph-based Methods FAS, FAS(R), FAS(B) (Saab, 2001; Fomin et al., 2010) Pairwise Coupling HT (Hastie & Tibshirani, 1998) Price (Price et al., 1994) WU1, WU2 (Wu et al., 2004)

15 / 22

slide-17
SLIDE 17

Research Questions

Practical performance: How close is the prediction ˆ π to the ground truth ranking π∗? Consistency Consistency Let ˆ πN denote the ranking produced as a prediction by a ranking method on the basis of N observed (pairwise)

  • preferences. The method is consistent if

P(ˆ πN = π∗) → 1 for N → ∞.

16 / 22

slide-18
SLIDE 18

BTL (Bradley-Terry-Luce)

Given comparison matrix C: C =

    

a1 a2 a3 a4 a1 6 4 1 a2 7 5 8 a3 3 4 9 a4 2 1 12

    

BTL estimates the parameters by likelihood maximization: ˆ θ ∈ arg max

θ∈RK

  • 1≤i=j≤K
  • θi

θi + θj

ci,j

ˆ θ ≈ (0.253, 0.382, 0.178, 0.187) ˆ π : a2 ≻ a1 ≻ a4 ≻ a3

17 / 22

slide-19
SLIDE 19

Borda and Copeland (CP)

Given probability matrix ˆ P: ˆ P =

    

a1 a2 a3 a4 a1 0.46 0.57 0.33 a2 0.54 0.56 0.89 a3 0.43 0.44 0.43 a4 0.67 0.11 0.57

    

Borda assigns a score to each item: si =

K

  • i=1

ˆ pi,j s : (1.366, 1.983, 1.302, 1.349) ⇒ ˆ π : a2 ≻ a1 ≻ a4 ≻ a3 Copeland (the number of pairwise victories): si =

K

  • i=1

I

  • ˆ

pi,j > 1 2

  • s : (1, 3, 0, 2) ⇒ ˆ

π : a2 ≻ a1 ≻ a4 ≻ a3

18 / 22

slide-20
SLIDE 20

FAS (Feedback Arc Set)

Given comparison matrix C: C =

    

a1 a2 a3 a4 a1 6 4 1 a2 7 5 8 a3 3 4 9 a4 2 1 12

    

FAS seeks to find the ranking that causes the lowest sum of penalties: ˆ π = arg min

π∈SK

  • (i,j): π(i)<π(j)

cj,i ˆ π : a2 ≻ a4 ≻ a1 ≻ a3

19 / 22

slide-21
SLIDE 21

Synthetic Data

PL (with K = 3), X-axis shows the sample size, Y-axis is the normalized Kendall distance

0.00 0.05 0.10 0.15

λ1,2 = 1

200 400 600 800 1000 1200 1400 1600 1800 2000

  • RC

LS Borda CP MC2 MC3 BTL BTL(R) FAS FAS(R) FAS(B) Price WU2 HT WU1

0.00 0.05 0.10 0.15

λ1,3 = 1

200 400 600 800 1000 1200 1400 1600 1800 2000

  • RC

LS Borda CP MC2 MC3 BTL BTL(R) FAS FAS(R) FAS(B) Price WU2 HT WU1

0.00 0.05 0.10 0.15

λ2,3 = 1

200 400 600 800 1000 1200 1400 1600 1800 2000

  • RC

LS Borda CP MC2 MC3 BTL BTL(R) FAS FAS(R) FAS(B) Price WU2 HT WU1

0.00 0.05 0.10 0.15

Full Breaking

200 400 600 800 1000 1200 1400 1600 1800 2000

  • RC

LS Borda CP MC2 MC3 BTL BTL(R) FAS FAS(R) FAS(B) Price WU2 HT WU1

20 / 22

slide-22
SLIDE 22

Theoretical Findings

Being agnostic of the coarsening, all ranking methods essentially expect (estimates of) the marginals pij as input. In our case, however, the inputs are the qij. As a consequence, estimates of θ will be biased. Conjecture: In spite of this, and somewhat surprisingly, all methods are consistent rankers, i.e., produce π∗ in the limit. We proved this conjecture for some of the methods (including Copeland and FAS), while some proofs are still

  • pen (e.g., for Borda and BTL).

21 / 22

slide-23
SLIDE 23

Summary and Conclusion

We studied statistical inference for incomplete ranking data. We proposed a model in which incompleteness is due to coarsening the complete information (full rankings). We proposed the notion of rank-dependent coarsening, which generalizes selection mechanisms such as top-k (whether or not an item is observed solely depends on its rank). We studied (probabilistic) rank aggregation as a specific instance. Interestingly, rank-dependent coarsening behaves somehow ”good-natured” in this setting, in the sense that agnostic learning does not compromise consistency of the rankers. Thanks!

22 / 22

slide-24
SLIDE 24

Backup slides

Backup slides

22 / 22

slide-25
SLIDE 25

Sushi Data

  • RC

LS Borda CP MC2 MC3 BTL BTL(R) FAS FAS(R) FAS(B) Price WU2 HT WU1 0.0 0.1 0.2 0.3 0.4 0.5

  • RC

LS Borda CP MC2 MC3 BTL BTL(R) FAS FAS(R) FAS(B) Price WU2 HT WU1 0.0 0.1 0.2 0.3 0.4 0.5

22 / 22

slide-26
SLIDE 26

Coarsening is Rich

Complete model when K = 3:

τ−1\π−1 Pθ(abc) Pθ(acb) Pθ(bac) Pθ(bca) Pθ(cab) Pθ(cba) abc Pλ(abc|abc) acb Pλ(acb|acb) bac Pλ(bac|bac) bca Pλ(bca|bca) cab Pλ(cab|cab) cba Pλ(cba|cba) ab Pλ(ab|abc) Pλ(ab|acb) Pλ(ab|cab) ba Pλ(ba|bac) Pλ(ba|bca) Pλ(ba|cba) ac Pλ(ac|abc) Pλ(ac|acb) Pλ(ac|bac) ca Pλ(ca|bca) Pλ(ca|cab) Pλ(ca|cba) bc Pλ(bc|abc) Pλ(bc|bac) Pλ(bc|bca) cb Pλ(cb|acb) Pλ(cb|cab) Pλ(cb|cba) a Pλ(a|abc) Pλ(a|acb) Pλ(a|bac) Pλ(a|bca) Pλ(a|cab) Pλ(a|cba) b Pλ(b|abc) Pλ(b|acb) Pλ(b|bac) Pλ(b|bca) Pλ(b|cab) Pλ(b|cba) c Pλ(c|abc) Pλ(c|acb) Pλ(c|bac) Pλ(c|bca) Pλ(c|cab) Pλ(c|cba) [] Pλ([]|abc) Pλ([]|acb) Pλ([]|bac) Pλ([]|bca) Pλ([]|cab) Pλ([]|cba)

  • The actual order of items remains intact by coarsening.
  • The number of probabilities to be specified: (2KK!)

22 / 22

slide-27
SLIDE 27

Parametric Modeling of Pλ

More restrictive (stronger) assumptions than rank-dependence: Top-k for A ⊆ {1, . . . , K}: P(A) =

  • 1

if A = {1, . . . , k}

  • therwise

Top-k when k is a random variable. The positions are discarded with increasing probability. So, for A ⊆ {1, . . . , K}: P(A) =

  • i∈A

λi ·

  • j∈A

(1 − λj) . in which case, the coarsening is then defined by the K parameters λ1 > λ2 > . . . > λK.

22 / 22