Online Ranking Combination Erzs ebet Frig o Institute for - - PowerPoint PPT Presentation

online ranking combination
SMART_READER_LITE
LIVE PREVIEW

Online Ranking Combination Erzs ebet Frig o Institute for - - PowerPoint PPT Presentation

Online Ranking Combination Erzs ebet Frig o Institute for Computer Science and Control (MTA SZTAKI) Joint work with Levente Kocsis Overview Framework: prequential ranking evaluation Goal: optimize convex combination of ranking


slide-1
SLIDE 1

Online Ranking Combination

Erzs´ ebet Frig´

  • Institute for Computer Science and Control (MTA SZTAKI)

Joint work with Levente Kocsis

slide-2
SLIDE 2

Overview

◮ Framework: prequential ranking evaluation ◮ Goal: optimize convex combination of ranking models ◮ Our proposal: direct optimization of the ranking function

slide-3
SLIDE 3

Model combination in prequential framework with ranking evaluation

time

i2 i7 i3 u i5 i2 im-1 im i1 ... i2 im-1 im i1 ... i2 im-1 im i1 ...

A1 A2 A3

scores ranking list combine

slide-4
SLIDE 4

Model combination in prequential framework with ranking evaluation

time

i2 i7 i3 u i5 i2 im-1 im i1 ... i2 im-1 im i1 ... i2 im-1 im i1 ...

A1 A2 A3

scores ranking list combine

Objective: choosing combination weights.

slide-5
SLIDE 5

New idea: optimize ranking function directly

◮ Standard method: take a surrogate function and use its gradient

◮ E.g. MSE

◮ Drawback: optimum of the surrogate = optimum of ranking function ◮ Proposed solution: optimize the ranking function directly ◮ Two approaches:

◮ Global search in the weight space ◮ Gradient approximation (finite differences)

slide-6
SLIDE 6

ExpW

◮ Choose a subset Q of the weight space Θ

◮ e.g., lay a grid to the parameter space

◮ Apply exponentially weighted forecaster on Q P(select q ∈ Q in round t) = e−ηt

t−1

τ=1(1−rτ(q))

  • s∈Q e−ηt

t−1

τ=1(1−rτ(s))

◮ Theoretical guarantee: E[RT(best static combination in Θ) − RT(ExpW )] < O( √ T)

◮ if the cumulative reward function (RT) is sufficiently smooth ◮ and Q is sufficiently large

◮ Difficulty: size of Q is exponential in number of base rankers, can’t scale

slide-7
SLIDE 7

Simultaneous Perturbation Stochastic Approximation (SPSA)

◮ Approximated gradient (for the weight of base ranker i in round t): gti = rt(θt + ct∆t) − rt(θt − ct∆t) ct∆ti

◮ θt is the current combination weight vector ◮ ∆t = (∆t1, ...) is a random vector of +/-1 ◮ ct is perturbation step size

◮ Online update step: one gradient step using the approximated gradient

slide-8
SLIDE 8
slide-9
SLIDE 9
slide-10
SLIDE 10
slide-11
SLIDE 11

RSPSA

◮ RSPSA = SPSA + Resilient Backpropagation (RProp)

slide-12
SLIDE 12

RSPSA

◮ RSPSA = SPSA + Resilient Backpropagation (RProp) ◮ RProp defines gradient step sizes for each weight ◮ Perturbation step size is tied to gradient step size ◮ Update step sizes using RProp

slide-13
SLIDE 13

Resilient Backpropagation (RProp)

◮ Gradient update rule ◮ Predefined step size for each coordinate

◮ ignores the length of the gradient vector

◮ Step size is updated based on the sign of the gradient

◮ decrease step if gradient changed direction ◮ increase otherwise

slide-14
SLIDE 14

gti = rt(θt+ct∆t)−rt(θt−ct∆t)

ct∆ti

slide-15
SLIDE 15

RFDSA+

◮ Switch to finite differences (FD)

◮ allows to detect 0 gradient w.r.t. one coordinate

◮ If the gradient is 0 w.r.t. a coordinate, then

◮ increase perturbation size (+) for that coordinate ◮ escape flat section in the right direction

◮ RFDSA+ = RSPSA - simultaneous perturbation + finite differences + zero gradient detection ◮ The modifications might seem to be minor, but are essential to make the algorithm work

slide-16
SLIDE 16

Experiments - Datasets, base rankers

◮ 5 datasets

◮ Amazon

◮ CDs and Vinyl ◮ Movies and TV ◮ Electronics

◮ MovieLens 10M ◮ Twitter

◮ hashtag prediction

◮ Size

◮ # of events: 2M-10M ◮ # of users: 70k-4M ◮ # of items: 10k-100k

◮ Base rankers:

◮ Models updated incrementally SGD Matrix Factorization Asymmetric Matrix Factorization Item-to-item similarity

25% 25% 30% 15% 5%

Most popular ◮ Traditional models updated periodically SGD Matrix Factorization Implicit Alternating Least Squares MF

slide-17
SLIDE 17

Combination algorithms in the experiments

Direct optimization: ◮ ExpW

◮ exponentially weighted forecaster on a grid ◮ global optimization

◮ SPSA

◮ gradient method with simultaneous perturbation

◮ RSPSA

◮ SPSA with RProp

◮ RFDSA+

◮ our new algorithm ◮ finite differences, flat section detection

Baselines: ◮ ExpA

◮ exponentially weighted forecaster on the base rankers

◮ ExpAW

◮ use probabilities of ExpA as weights

◮ SGD

◮ use MSE as a surrogate ◮ target=1 for positive sample ◮ target=0 for generated negative samples

slide-18
SLIDE 18

Results - 2 base rankers (i2i, OMF) - nDCG

0.01 0.02 0.03 0.04 0.05 0.06 0.07 1000 2000 3000 4000 5000 6000 7000

NDCG days

item2item OMF ExpA ExpAW ExpW SGD SPSA RSPSA RFDSA+

slide-19
SLIDE 19

Results - 2 base rankers - Combination weights

0.0000010 0.0000100 0.0001000 0.0010000 0.0100000 0.1000000 1.0000000 1000 2000 3000 4000 5000 6000 7000

θ days

OptG100+ ExpAW SGD SPSA RSPSA RFDSA+

slide-20
SLIDE 20

Cumulative reward as function of combination weight

0.038 0.04 0.042 0.044 0.046 0.048 0.05 0.052 0.054 0.0001 0.001 0.01 0.1 1

NDCG θ

RT(θ)

slide-21
SLIDE 21

Results - Scalability

0.03 0.035 0.04 0.045 0.05 0.055 0.06 1 2 3 4 5 6 7 8 9 10

NDCG number of OMF’s

ExpA ExpAW SGD SPSA RSPSA RFDSA+

slide-22
SLIDE 22

Results - 6 base rankers - DCG

slide-23
SLIDE 23

Conclusions

◮ Problem: combine ranking algorithms ◮ Our proposal: optimize the ranking measure directly ◮ Global optimization (ExpW) works well in case of two base algo ◮ Our new algo: RFDSA+

◮ solves problems (scaling, constant sections w.r.t one coordinate) ◮ strong combination

slide-24
SLIDE 24

The End Online Ranking Combination

Erzs´ ebet Frig´

  • Institute for Computer Science and Control (MTA SZTAKI)

Joint work with Levente Kocsis