One-Pass Ranking Models for Low-Latency Product Recommendations - - PowerPoint PPT Presentation

one pass ranking models for low latency product
SMART_READER_LITE
LIVE PREVIEW

One-Pass Ranking Models for Low-Latency Product Recommendations - - PowerPoint PPT Presentation

One-Pass Ranking Models for Low-Latency Product Recommendations Martin Saveski @msaveski MIT (Amazon Berlin) One-Pass Ranking Models for Low-Latency Product Recommendations Amazon Machine Learning Team, Berlin Antonino Freno


slide-1
SLIDE 1

One-Pass Ranking Models for 
 Low-Latency Product Recommendations

Martin Saveski

@msaveski MIT (Amazon Berlin)

slide-2
SLIDE 2

One-Pass Ranking Models for 
 Low-Latency Product Recommendations

Antonino Freno Rodolphe Jenatton Cédric Archambeau

Amazon Machine Learning Team, Berlin

slide-3
SLIDE 3

Product Recommendations

slide-4
SLIDE 4

Product Recommendations

Constraints

slide-5
SLIDE 5

Product Recommendations

Constraints

  • 1. Large # of examples


Large # of features

slide-6
SLIDE 6

Product Recommendations

Constraints

  • 1. Large # of examples


Large # of features

  • 2. Drifting distribution
slide-7
SLIDE 7

Product Recommendations

Constraints

  • 1. Large # of examples


Large # of features

  • 2. Drifting distribution
  • 3. Real-time ranking 


(<few ms)

slide-8
SLIDE 8

Product Recommendations

Constraints

  • 1. Large # of examples


Large # of features

  • 2. Drifting distribution
  • 3. Real-time ranking 


(<few ms) Small memory footprint

slide-9
SLIDE 9

Product Recommendations

Constraints

  • 1. Large # of examples


Large # of features

  • 2. Drifting distribution
  • 3. Real-time ranking 


(<few ms) Small memory footprint Fast training time

slide-10
SLIDE 10

Product Recommendations

Constraints

  • 1. Large # of examples


Large # of features

  • 2. Drifting distribution
  • 3. Real-time ranking 


(<few ms) Small memory footprint Fast training time Low prediction latency

slide-11
SLIDE 11

Our approach

Product Recommendations

Small memory footprint Fast training time Low prediction latency

slide-12
SLIDE 12

Our approach

Product Recommendations

Stochastic optimization One pass learning Small memory footprint Fast training time Low prediction latency

slide-13
SLIDE 13

Our approach

Product Recommendations

Stochastic optimization One pass learning Sparse models Small memory footprint Fast training time Low prediction latency

slide-14
SLIDE 14

Learning Ranking Functions

slide-15
SLIDE 15

Learning Ranking Functions

Three broad families of models

  • 1. Pointwise (Logistic regression)
  • 2. Pairwise (RankSVM)
  • 3. Listwise (ListNet)
slide-16
SLIDE 16

Learning Ranking Functions

Three broad families of models

  • 1. Pointwise (Logistic regression)
  • 2. Pairwise (RankSVM)
  • 3. Listwise (ListNet)

Loss functions

  • Evaluation functions (NDCG)
  • Surrogate functions
slide-17
SLIDE 17

Loss Function

Lambda Rank (Burges et al., 2007)

slide-18
SLIDE 18

Loss Function

Lambda Rank (Burges et al., 2007)

: Features : Ground-truth Rank 1 1 2 3

Product 1 Product 2 Product 3 Product 4

r

X

x1 x2 x3 x4

slide-19
SLIDE 19

Loss Function

Lambda Rank (Burges et al., 2007)

: Features : Ground-truth Rank 1 1 2 3

Product 1 Product 2 Product 3 Product 4

r

X

x1 x2 x3 x4

i

j

slide-20
SLIDE 20

Loss Function

Lambda Rank (Burges et al., 2007) Importance of sorting and correctly

: Features : Ground-truth Rank 1 1 2 3

Product 1 Product 2 Product 3 Product 4

r

X

x1 x2 x3 x4

i

j ∆M = M(r) − M(ri/j)

i

j

slide-21
SLIDE 21

Loss Function

Lambda Rank (Burges et al., 2007) Importance of sorting and correctly Difference in scores

S = max{0, wTxj − wTxi} ∆S

: Features : Ground-truth Rank 1 1 2 3

Product 1 Product 2 Product 3 Product 4

r

X

x1 x2 x3 x4

i

j ∆M = M(r) − M(ri/j)

i

j

slide-22
SLIDE 22

Loss Function

Lambda Rank (Burges et al., 2007) Importance of sorting and correctly Difference in scores

S = max{0, wTxj − wTxi} ∆S

: Features : Ground-truth Rank 1 1 2 3

Product 1 Product 2 Product 3 Product 4

r

X

x1 x2 x3 x4

i

j

Loss

∆M = M(r) − M(ri/j) L(X; w) = X

ri≤rj

∆M · ∆S

i

j

slide-23
SLIDE 23

ElasticRank

Introducing Sparsity

Adding and penalties l2 l1 L∗(X, w) = L(X, w) + λ1||w||1 + 1 2λ2||w||2

2

slide-24
SLIDE 24

ElasticRank

Introducing Sparsity

Adding and penalties

Both and control model complexity

l2 l1 L∗(X, w) = L(X, w) + λ1||w||1 + 1 2λ2||w||2

2

λ1 λ2

slide-25
SLIDE 25

ElasticRank

Introducing Sparsity

Adding and penalties

Both and control model complexity

  • trades-off sparsity and performance

λ1 l2 l1 L∗(X, w) = L(X, w) + λ1||w||1 + 1 2λ2||w||2

2

λ1 λ2

slide-26
SLIDE 26

ElasticRank

Introducing Sparsity

Adding and penalties

Both and control model complexity

  • trades-off sparsity and performance
  • adds strong convexity & improves convergence

λ1 λ2 l2 l1 L∗(X, w) = L(X, w) + λ1||w||1 + 1 2λ2||w||2

2

λ1 λ2

slide-27
SLIDE 27

Optimization Algorithms

Extensions of Stochastic Gradient Descent

slide-28
SLIDE 28

Optimization Algorithms

Extensions of Stochastic Gradient Descent

FOBOS Forward-Backward Splitting (Duchi, 2009)

  • 1. Gradient step
  • 2. Proximal step involving the regularization
slide-29
SLIDE 29

Optimization Algorithms

Extensions of Stochastic Gradient Descent

FOBOS Forward-Backward Splitting (Duchi, 2009)

  • 1. Gradient step
  • 2. Proximal step involving the regularization

RDA Regularized Dual Averaging (Xiao, 2010)

  • Keeps a running average of all past gradients
  • Solves a proximal step using the average
slide-30
SLIDE 30

Optimization Algorithms

Extensions of Stochastic Gradient Descent

FOBOS Forward-Backward Splitting (Duchi, 2009)

  • 1. Gradient step
  • 2. Proximal step involving the regularization

RDA Regularized Dual Averaging (Xiao, 2010)

  • Keeps a running average of all past gradients
  • Solves a proximal step using the average

pSGD Pruned Stochastic Gradient Descent

  • Prunes every gradient steps
  • If

k |wi| < θ ⇒ wi = 0

slide-31
SLIDE 31

Hyper-parameter Optimization

  • Turn-key inference
  • Automatic adjustment of hyper-parameters
  • Bayesian Approach (Snoek, Larochelle, Adams; 2012)
  • Gaussian Process
  • Thomson Sampling
slide-32
SLIDE 32

NDCG @ 5

0.1 0.2 0.3 0.4 0.5 0.6 OHSUMED TD2003 TD2004

LETOR Experiments

ElasticRank is comparable with state-of-the-art models

Logistic Regression RankSVM ListNet ElasticRank

slide-33
SLIDE 33

Amazon.com Experiments

Experimental Setup

  • # examples millions
  • # features thousands (millions of dimensions)
  • Purchase logs from contiguous time interval

Training Validation Testing 1 11 1 11 9 11

N ≈

slide-34
SLIDE 34

Recall @ 1

Logistic Regression RankSVM ElasticRank pSGD ElasticRank FOBOS ElasticRank RDA

Experimental Results

ElasticRank performs best

slide-35
SLIDE 35

Sparsity vs Performance

RDA achieves the best trade-off

0.26 0.265 0.27 0.275 0.28 0.285 0.29 0.295 0.3 0.305 1 4 16 64 256 1024 PSGD FOBOS RDA

RDA pSGD FOBOS

Recall @ 1 Number of Weights

slide-36
SLIDE 36

Prediction Time

Microseconds

5 10 15

Number of Weights 4 29 1804

6.2 μs 8.7 μs 10.9 μs

slide-37
SLIDE 37

Contributions

How to learn ranking functions with

  • Single pass
  • Small memory footprint
  • Sparse

WITHOUT sacrificing performance

slide-38
SLIDE 38

References

  • C. J. C. Burges, R. Ragno, and Q. V. Le. Learning to rank with nonsmooth cost
  • functions. In Advances in Neural Information Processing Systems (NIPS), 2006.
  • J. C. Duchi and Y. Singer. Efficient online and batch learning using forward backward
  • splitting. Journal of Machine Learning Research (JMLR), 2009.
  • L. Xiao. Dual Averaging Methods for Regularized Stochastic Learning and Online
  • Optimization. Journal of Machine Learning Research (JMLR), 2010.
  • J. Snoek, H. Larochelle, and R. P. Adams. Practical bayesian optimization of machine

learning algorithms. In Advances in Neural Information Processing Systems (NIPS), 2012.

slide-39
SLIDE 39

One-Pass Ranking Models for 
 Low-Latency Product Recommendations

Martin Saveski

@msaveski MIT (Amazon Berlin)