Reduce and Aggregate: Similarity Ranking in Multi-Categorical - - PowerPoint PPT Presentation

reduce and aggregate similarity ranking in multi
SMART_READER_LITE
LIVE PREVIEW

Reduce and Aggregate: Similarity Ranking in Multi-Categorical - - PowerPoint PPT Presentation

Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs Alessandro Epasto J. Feldman*, S. Lattanzi*, S. Leonardi, V. Mirrokni*. *Google Research Sapienza U. Rome Motivation Recommendation Systems: Bipartite


slide-1
SLIDE 1

Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs

  • J. Feldman*, S. Lattanzi*, S. Leonardi°, V. Mirrokni*.

*Google Research °Sapienza U. Rome

Alessandro Epasto

slide-2
SLIDE 2

Motivation

  • Recommendation Systems:
  • Bipartite graphs with Users and Items.
  • Identify similar users and suggest relevant

items.

  • Concrete example: The AdWords case.
  • Two key observations:
  • Items belong to different categories.
  • Graphs are often lopsided.
slide-3
SLIDE 3

Modeling the Data as a Bipartite Graph

Millions of Advertisers Billions of Queries Hundreds of Labels

Nike Store New York Soccer Shoes Soccer Ball

2$ 3$ 4$ 1$ 5$ 2$

Retailers Apparel Sport Equipment

slide-4
SLIDE 4

Personalized PageRank

v u The stationary distribution assigns a similarity score to each node in the graph w.r.t. node v. For a node v (the seed) and a probability alpha

slide-5
SLIDE 5

The Problem

Millions of Advertisers Billions of Queries Hundreds of Labels

Nike Store New York Soccer Shoes Soccer Ball

2$ 3$ 4$ 1$ 5$ 2$

Retailers Apparel Sport Equipment

slide-6
SLIDE 6

Other Applications

  • General approach applicable to several

contexts:

  • User, Movies, Genres: find similar users

and suggest movies.

  • Authors, Papers, Conferences: find

related authors and suggest papers to read.

slide-7
SLIDE 7

Semi-Formal Problem Definition

Advertisers Queries

slide-8
SLIDE 8

Semi-Formal Problem Definition

A

Advertisers Queries

slide-9
SLIDE 9

Semi-Formal Problem Definition

A

Advertisers Queries Labels:

slide-10
SLIDE 10

Semi-Formal Problem Definition

A

Advertisers Queries Labels:

Goal: Find the nodes most “similar” to A.

slide-11
SLIDE 11

How to Define Similarity?

  • We address the computation of several node

similarity measures:

  • Neighborhood based: Common neighbors,

Jaccard Coefficient, Adamic-Adar.

  • Paths based: Katz.
  • Random Walk based: Personalized PageRank.
  • Experimental question: which measure is useful?
  • Algorithmic questions:
  • Can it scale to huge graphs?
  • Can we compute it in real-time?
slide-12
SLIDE 12

Our Contribution

  • Reduce and Aggregate: general approach to

induce real-time similarity rankings in multi- categorical bipartite graphs, that we apply to several similarity measures.

  • Theoretical guarantees for the precision of the

algorithms.

  • Experimental evaluation with real world data.
slide-13
SLIDE 13

Personalized PageRank

v u The stationary distribution assigns a similarity score to each node in the graph w.r.t. node v. For a node v (the seed) and a probability alpha

slide-14
SLIDE 14

Challenges

  • Our graphs are too big (billions of nodes) even for

very large-scale MapReduce systems.

  • MapReduce is not real-time.
  • We cannot pre-compute the rankings for each

subset of labels.

slide-15
SLIDE 15

Reduce and Aggregate

Reduce: Given the bipartite and a category construct a graph with only A nodes that preserves the ranking on the entire graph. Aggregate: Given a node v in A and the reduced graphs of the subset of categories interested determine the ranking for v.

a b c a b c c a b a c

1)

b

2) 3)

slide-16
SLIDE 16

Reduce (Precomputation)

Advertisers Queries

slide-17
SLIDE 17

Reduce (Precomputation)

Advertisers Queries

Precomputed Rankings

slide-18
SLIDE 18

Reduce (Precomputation)

Advertisers Queries

Precomputed Rankings Precomputed Rankings

slide-19
SLIDE 19

Reduce (Precomputation)

Advertisers Queries

Precomputed Rankings Precomputed Rankings Precomputed Rankings

slide-20
SLIDE 20

Aggregate (Run Time)

Precomputed Rankings Precomputed Rankings Ranking of Red + Yellow

A

slide-21
SLIDE 21

Reduce for Personalized PageRank

  • Markov Chain state aggregation theory

(Simon and Ado, ’61; Meyer ’89, etc.).

  • 750x reduction in the number of node

while preserving correctly the PPR distribution on the entire graph.

X

Side A Side B Side A

Y X Y

slide-22
SLIDE 22

Run-time Aggregation

slide-23
SLIDE 23

Koury et al. Aggregation-Disaggregation Algorithm Step 1: Partition the Markov chain into DISJOINT subsets

A B

slide-24
SLIDE 24

Koury et al. Aggregation-Disaggregation Algorithm

Step 2: Approximate the stationary distribution on each subset independently.

πA

πB

A B

slide-25
SLIDE 25

Koury et al. Aggregation-Disaggregation Algorithm Step 3: Consider the transition between subsets.

πA

PAB PBA PBB PAA

A B

πB

slide-26
SLIDE 26

Koury et al. Aggregation-Disaggregation Algorithm

Step 4: Aggregate the distributions. Repeat until convergence.

PAB PBA PBB PAA

A B

π0

B

π0

A

slide-27
SLIDE 27

Aggregation in PPR

X Y Precompute the stationary distributions individually

πA

A

slide-28
SLIDE 28

Aggregation in PPR

X Y Precompute the stationary distributions individually

πB

B

slide-29
SLIDE 29

Aggregation in PPR

The two subsets are not disjoint!

A B

slide-30
SLIDE 30

Our Approach

X Y X Y

  • The algorithm is based only on the reduced graphs

with Advertiser-Side nodes.

  • The aggregation algorithm is scalable and

converges to the correct distribution.

πA

πB

slide-31
SLIDE 31

Experimental Evaluation

  • We experimented with publicly available and

proprietary datasets:

  • Query-Ads graph from Google AdWords > 1.5

billions nodes, > 5 billions edges.

  • DBLP Author-Papers and Patent Inventor-

Inventions graphs.

  • Ground-Truth clusters of competitors in Google

AdWords.

slide-32
SLIDE 32

Patent Graph

Recall Precision

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Precision Recall Precision vs Recall Inter Jaccard Adamic-Adar Katz PPR

slide-33
SLIDE 33

Google AdWords

Recall Precision

slide-34
SLIDE 34

Conclusions and Future Work

  • It is possible to compute several similarity

scores on very large bipartite graphs in real-time with good accuracy.

  • Future work could focus on the case

where categories are not disjoint is relevant.

slide-35
SLIDE 35

Thank you for your attention

slide-36
SLIDE 36

Reduction to the Query Side

X Y

πA

πB

slide-37
SLIDE 37

Reduction to the Query Side

X Y

This is the larger side of the graph.

πA

πB

slide-38
SLIDE 38

Convergence after One Iteration

0.2 0.4 0.6 0.8 1 10 20 30 40 50 All Kendall-T au Position (k) Kendall-T au Correlation DBLP Patent Query-Ads (cost)

slide-39
SLIDE 39

Convergence

Iterations 1-Cosine Similarity

1e-06 1e-05 0.0001 0.001 2 4 6 8 10 12 14 16 18 20 1-Cosine Iterations Approximation Error vs # Iterations DBLP (1 - Cosine) Patent (1 - Cosine)