Fast and Accurate Inference of PlackettLuce Models Lucas Maystre , - - PowerPoint PPT Presentation

fast and accurate inference of plackett luce models
SMART_READER_LITE
LIVE PREVIEW

Fast and Accurate Inference of PlackettLuce Models Lucas Maystre , - - PowerPoint PPT Presentation

Fast and Accurate Inference of PlackettLuce Models Lucas Maystre , Matthias Grossglauser LCA 4, EPFL Swiss Machine Learning Day November 10 th , 2015 1 Outline 1. Introduction to PlackettLuce models 2. Model inference : state of the


slide-1
SLIDE 1

Fast and Accurate Inference

  • f Plackett–Luce Models

Lucas Maystre, Matthias Grossglauser LCA 4, EPFL

1

Swiss Machine Learning Day — November 10th, 2015

slide-2
SLIDE 2

Outline

  • 1. Introduction to Plackett–Luce models
  • 2. Model inference: state of the art
  • 3. Unifying ML and spectral algorithms
  • 4. Experimental results

2

slide-3
SLIDE 3

Plackett–Luce family of models

slide-4
SLIDE 4

Modeling preferences

universe of n items Goal: describe, explain & predict choices between alternatives Probabilistic approach

4

slide-5
SLIDE 5

Luce's choice axiom

Assumption (Luce, 1959.) The odds of choosing item i over item j are independent of the rest of the alternatives.

5

p(i | A) p(j | A) = p(i | B) p(j | B).

sets of alternatives (contain i and j) alternatives a.k.a. “independence of irrelevant alternatives”

slide-6
SLIDE 6

Consequence of axiom

To each item i = 1, ..., n we can assign a number πi ∈ R>0 such that

6

p(i | {1, . . . , k}) = πi π1 + · · · + πk

πi = strength (or utility, or score) of item i

slide-7
SLIDE 7

Bradley–Terry model

[Zermelo, 1928; Bradley & Terry, 1952; Ford, 1957] Variant of the model for pairwise comparisons

7

p(i j) = πi πi + πj

slide-8
SLIDE 8

Plackett–Luce model

[Luce, 1959; Plackett 1975] Variant of the model for (partial or full) rankings

8

p(i j k) = p(i | {i, j, k}) · p(j | {j, k}) = πi πi + πj + πk · πj πj + πk

slide-9
SLIDE 9

Rao–Kupper model

[Rao & Kupper, 1967] Variant of the model for pairwise comparisons with ties

9

p(i j) = πi πi + απj p(i ≡ j) = (α2 − 1)πiπj (πi + απj)(πj + απi)

slide-10
SLIDE 10

RUM perspective

New parameterization: θi = log(πi)

10

−4 −2 2 4 6 8 θi θj

Xi ⇠ Gumbel(θi, 1) Xi Xj ⇠ Logistic(θi θj, 1) p(i j) = P(Xi Xj > 0) = 1 1 + e−(θi−θj)

slide-11
SLIDE 11

Identifying parameters

11

p(i | {1, . . . , k}) = πi π1 + · · · + πk

Defined up to multiplicative term

X

i

πi = 1 X

i

θi = 0

We use the following convention:

slide-12
SLIDE 12

Beyond preferences

12

GIFGIF experiment (comparative judgment) NASCAR rankings Chess games

slide-13
SLIDE 13

Model inference

slide-14
SLIDE 14

Maximum-likelihood

For conciseness, we consider pairwise comparisons Data in the form of counts: aji = # times i beat j

14

L(π) = Y

i

Y

j6=i

✓ πi πi + πj ◆aji log L(π) = X

i

X

j6=i

aji (log πi − log(πi + πj))

Can lead to problems if = 0

  • Assumption. In every partition of the n items into two subsets

A and B, some i ∈ A beats some j ∈ B.

slide-15
SLIDE 15

1 2 3 5 7 8 9 4 6

  • 1. Items are states of a Markov chain
  • 2. Going from i to j more likely if j
  • fuen won against i
  • 3. Stationary distribution defines

the scores [Negahban et al. 2012] Completely different take on parameter inference

Rank Centrality

15

Pij = ( εaij if i 6= j, 1 ε P

k6=i aik

if i = j.

slide-16
SLIDE 16

GMM estimators

16

[Azari Soufiani et al. 2013, 2014] Generalization of Rank Centrality to rankings

  • 1. Breaks the rankings into m choose 2 pairwise comparisons
  • 2. Constructs a Markov chain, finds the stationary distribution

The resulting estimator is asymptotically consistent

a b c d

  • a b
  • a c
  • a d
  • b c
  • b d

c d

slide-17
SLIDE 17

Unifying ML inference and spectral algorithms

slide-18
SLIDE 18

X

j6=i

aji πi + πj πj = X

j6=i

aij πi + πj πi ∀i

MLE as stationary distribution

18

incoming flow

  • utgoing

flow transition rates

log L(π) = X

i

X

j6=i

aji (log πi − log(πi + πj)) ∂ ∂πi log L(π) = X

j6=i

✓ aji 1 πi − (aji + aij) 1 πi + πj ◆ = 1 πi X

j6=i

✓ aji πj πi + πj − aij πi πi + πj ◆

Global balance equations of Markov chain on the states

slide-19
SLIDE 19

Corresponding MC

19

Pij = 8 > > < > > : ε aij πi + πj if i 6= j, 1 ε X

k6=i

aik πi + πj if i = j.

  • Stationary distribution is ML

estimate iff

  • If πi = 1/n for all i, we recover Rank

Centrality

π = ˆ π

k k k k k

We can iteratively adjust π!

  • (k+1)-th iterate is stationary

distribution of Pk

  • Unique fixed point of

iteration is the ML estimate

slide-20
SLIDE 20

Generalization

20

Pij = 8 > > > < > > > : ε X

A2Dij

1 P

k2A πk

if i 6= j, 1 X

k6=i

Pik if i = j.

The same Markov chain formulation applies to other models in the same family! For choices among many alternatives Spectral formulation for ranking data, comparisons with ties, etc...

slide-21
SLIDE 21

Algorithms

21

Algorithm 1 Luce Spectral Ranking Require: observations D

1: 0n⇥n 2: for (i, A) 2 D do 3:

for j 2 A \ {i} do

4:

ji ji + n/|A|

5:

end for

6: end for 7: ¯

π stat. dist. of Markov chain

8: return ¯

π Algorithm 2 Iterative Luce Spectral Ranking Require: observations D

1: π [1/n, . . . , 1/n]| 2: repeat 3:

0n⇥n

4:

for (i, A) 2 D do

5:

for j 2 A \ {i} do

6:

ji ji + 1/ P

t2A ⇡t

7:

end for

8:

end for

9:

π stat. dist. of Markov chain

10: until convergence

What is the statistical efficiency of the spectral estimate? What is the computational efficiency

  • f the ML algorithm?
slide-22
SLIDE 22

Experimental results

slide-23
SLIDE 23

Statistical efficiency

23

21 22 23 24 25 26 27 28 29 210 k 0.1 0.2 0.4 RMSE lower bound ML-F GMM-F ML LSR

Take-away: careful derivation of MC leads to better estimator

a b c d e f g h c a b e f d h g a d e h b c f g c e f g a b d h a d e h b g c f c e f g a d b h

k=8 k=4 k=2 Which inference method works best?

slide-24
SLIDE 24

Computational efficiency

24

Table 2: Performance of iterative ML inference algorithms.

I-LSR MM Newton Dataset γD I T [s] I T [s] I T [s] NASCAR 0.832 3 0.08 4 0.10 — — Sushi 0.890 2 0.42 4 1.09 3 10.45 YouTube 0.002 12 414.44 8 680 22 443.88 — — GIFGIF 0.408 10 22.31 119 109.62 5 72.38 Chess 0.007 15 43.69 181 55.61 3 49.37

  • I-LSR is competitive / faster than the state of the art
  • MM seems to converge very slowly in certain cases
slide-25
SLIDE 25

I-LSR and MM mixing

25

1 2 3 4 5 6 7 8 9 10 iteration 100 10−2 10−4 10−6 10−8 10−10 10−12 RMSE MM, k = 10 MM, k = 2 I-LSR, k = 10 I-LSR, k = 2

Take-away: I-LSR seems to be robust to slow-mixing chains

well mixing poorly mixing

slide-26
SLIDE 26

Conclusions

  • Variety of models derived from Luce's choice axiom
  • Can interpret maximum-likelihood estimate as stationary distribution of

Markov chain

  • Gives rise to fast and efficient spectral inference algorithm
  • Gives rise to new iterative algorithm for maximum-likelihood inference

26

Paper & code available at: lucas.maystre.ch/nips15