Clustering and Ranking in Heterogeneous Information Networks via - - PowerPoint PPT Presentation

clustering and ranking in heterogeneous information
SMART_READER_LITE
LIVE PREVIEW

Clustering and Ranking in Heterogeneous Information Networks via - - PowerPoint PPT Presentation

Clustering and Ranking in Heterogeneous Information Networks via Gamma-Poisson Model Junxiang Chen Wei Dai Yizhou Sun Jennifer Dy Northeastern University May 1, 2015 Information Network Information networks are oftentimes used to


slide-1
SLIDE 1

Clustering and Ranking in Heterogeneous Information Networks via Gamma-Poisson Model

Junxiang Chen Wei Dai Yizhou Sun Jennifer Dy

Northeastern University

May 1, 2015

slide-2
SLIDE 2

Information Network

◮ Information networks are oftentimes used to represent

  • bjects and their interactions.

◮ Objects are represented by vertices. ◮ Relationships are represented by edges.

◮ Homogeneous information network has been well studied.

◮ It assumes there contains only one type of vertices and one

type of edges.

◮ A friendship network is an example. Sophia Emma Jacob William Mason

slide-3
SLIDE 3

Heterogeneous Network

◮ In the real world, multiple-typed objects are usually related with

each other.

◮ It can be represented by a heterogeneous information network. ◮ It involves vertices of multiple types and edges of multiple types. ◮ For example, DBLP is a computer science bibliographic database. Author Word Venue use publish appear co-author

Sophia William Mason “data” “mining” “database” VLDB SDM

slide-4
SLIDE 4

Related Work

◮ Clustering and ranking are prominent techniques to analyze

information networks.

◮ They are usually regarded as orthogonal techniques.

slide-5
SLIDE 5

Related Work

◮ Clustering and ranking are prominent techniques to analyze

information networks.

◮ They are usually regarded as orthogonal techniques.

Clustering Methods

Clustering Methods

Homogenous Networks Heterogeneous Networks

Spectral clustering [Shi and Malik, 2000] Affinity propagation [Frey and Dueck, 2007] Stochastic blockmodel [Snijders and Nowicki, 1997] Multi-type spectral clustering [Long et al., 2006]

slide-6
SLIDE 6

Related Work

◮ Clustering and ranking are prominent techniques to analyze

information networks.

◮ They are usually regarded as orthogonal techniques.

Clustering and Ranking Methods

Clustering Methods Ranking Methods

Homogenous Networks Heterogeneous Networks

Spectral clustering Affinity propogation Stochastic blockmodel Multi-type spectral clustering HITS [Kleinberg, 1999] PageRank [Page et al., 1998] PopRank [Nie et al., 2005]

slide-7
SLIDE 7

Related Work (cont.)

◮ Combining clustering and ranking together usually achieves better results.

◮ Sun et al. [2009a] proposes the RankClus model for bi-typed networks. ◮ Sun et al. [2009b] introduces NetClus model for star-network schema.

Clustering and Ranking Methods

Clustering Methods Ranking Methods

Homogenous Networks Heterogeneous Networks

Spectral clustering Affinity propogation Stochastic blockmodel Multi-type spectral clustering HITS PageRank PopRank Networks with Specified Schema RankClus NetClus

slide-8
SLIDE 8

Contributions

◮ We develop a Gamma-Poisson generative model, called GPNRankClus

(Gamma-Poisson Network Model for Ranking and Clustering) Clustering and Ranking Methods

Clustering Methods Ranking Methods

Homogenous Networks Heterogeneous Networks

Spectral clustering Affinity propogation Stochastic blockmodel Multi-type spectral clustering HITS PageRank PopRank Networks with Specified Schema RankClus NetClus

GPNRankClus

slide-9
SLIDE 9

Ranking Scores

◮ We want to simultaneously achieve ranking and clustering. ◮ We assign each vertex v(Tm)

n

ranking score r(Tm)

nk

for each cluster that represents the importance of the vertex in this cluster, s.t.

v(Tm)

n

∈ Ck ⇔ k = argmaxl(r(Tm)

nl

) (1) rankk(v(Tm)

i

) < rankk(v(Tm)

j

) ⇔ r(Tm)

ik

> r(Tm)

jk

(2) N Objects of type Tm K clusters Clustering results Ranking results

slide-10
SLIDE 10

Ranking Scores

◮ We want to simultaneously achieve ranking and clustering. ◮ We assign each vertex v(Tm)

n

ranking score r(Tm)

nk

for each cluster that represents the importance of the vertex in this cluster, s.t.

v(Tm)

n

∈ Ck ⇔ k = argmaxl(r(Tm)

nl

) rankk(v(Tm)

i

) < rankk(v(Tm)

j

) ⇔ r(Tm)

ik

> r(Tm)

jk

◮ Since r(Tm)

nk

is a positive real number r(Tm)

nk

∼ Gamma(αr, βr).

θr r(Ta)

ik

N × K r(Tb)

jk

N × K Ranking Scores

slide-11
SLIDE 11

Intensity of Edge Type

◮ In heterogeneous networks, the

intensity for different edge type differs.

◮ Some edge types tend to

generate more connections.

◮ We model the intensity of each edge

type using a positive real number. λ(Ta,Tb) ∼ Gamma(αλ, βλ)

θr r(Ta)

ik

N × K r(Tb)

jk

N × K W (Ta,Tb)

ij

λ(Ta,Tb)

Te

M2 θλ Ranking Scores Intensity of Edge Type

slide-12
SLIDE 12

Number of Edges

◮ There exist multiple edges between two

vertices.

◮ Connections between vertices are

treated as counts of repeated events.

W(Ta,Tb)

ij

∼ Pois(λ(Ta,Tb) (r(Ta)

i

· r(Tb)

j

))

# of Edges Intensity of dot product of edge type ranking scores θr r(Ta)

ik

N × K r(Tb)

jk

N × K W (Ta,Tb)

ij

N 2 λ(Ta,Tb)

Te

M 2 θλ Ranking Scores Intensity of Edge Type Number of Edges

slide-13
SLIDE 13

Why Dot Product?

W(Ta,Tb)

ij

∼ Pois(λ(Ta,Tb) ( r(Ta)

i

· r(Tb)

j

))

# of Edges Intensity of dot product of edge type ranking scores ◮ The dot product can be expressed as

r(Ta)

i

· r(Tb)

j

= cos θ × ||r(Ta)

i

|| × ||r(Tb)

j

||

◮ In order to have a large W(Ta,Tb)

ij

we need

◮ Large λ(Ta,Tb) ◮ Large cos θ ◮ Large ||r(Ta) i

|| and ||r(Tb)

j

||

slide-14
SLIDE 14

Summary of the Model

◮ For each vertex n and each cluster k, Draw r(Tm)

nk

∼ Gamma(αr, βr) ◮ For each non-zero edge type (Ta, Tb), Draw λ(Ta,Tb) ∼ Gamma(αλ, βλ) ◮ For each pair of different vertices (v(Ta)

i

, v(Tb)

j

) Draw W(Ta,Tb)

ij

∼ Pois(λ(Ta,Tb)(r(Ta)

i

· r(Tb)

j

))

θr r(Ta)

ik

N × K r(Tb)

jk

N × K W (Ta,Tb)

ij

N2 λ(Ta,Tb)

Te

M2 θλ Ranking Scores Intensity of Edge Type Number of Edges

slide-15
SLIDE 15

Inference

◮ It is computationally intractable to

directly evaluate the posterior distributions.

◮ We use mean-field variational

inference to approximate these distributions. ◮ Ranking and clustering results are

given by comparing the expected values of the ranking scores v(Tm)

n

∈ Ck, where k = argmaxl(E[r(Tm)

nl

]). rankk(v(Tm)

n

) = argsorti(E[r(Tm)

ik

]).

◮ We introduce seeds.

◮ Existing models use seeds to guide

the clustering process.

◮ We select 1 representative object for

each cluster.

◮ We assign a special prior distribution

for these seeds.

N Objects of type Tm K clusters Clustering results Ranking results

slide-16
SLIDE 16

Synthetic Data

◮ We generate synthetic data

◮ 400 data points ◮ 4 different types ◮ 2 clusters

◮ We add noise of different

levels.

Mediate noise level Low noise level High noise level

slide-17
SLIDE 17

Real Data

◮ We test the performance of model on two real

heterogeneous network datasets:

◮ DBLP dataset ◮ YELP dataset

◮ We compare GPNRankClus with state-of-the-art algorithms

◮ NetClus, A clustering and ranking method for

heterogeneous networks that follow a star-network schema.

◮ GNetMine, a transductive classification method in

heterogeneous networks.

◮ RankClass, a ranking-based classification method in

heterogeneous networks.

slide-18
SLIDE 18

DBLP Dataset

◮ The dataset includes conferences from Database (DB), Data Mining (DM), Machine Learning (ML), Information Retrieval (IR).

Author Word Venue use publish appear co-author

Classification Accuracy on Authors

GPNRankClus NetClus GNetMine RankClass Accuracy 92.28% 76.11%‡ 80.67% 91.12%

Classification Accuracy on Conferences

GPNRankClus NetClus GNetMine RankClass Accuracy 100% 85%‡ 100% 100% ‡We test NetClus on the star-schema version of the DBLP dataset.

Top-5 Words in Each Cluster

DB DM ML IR 1 data data learning web 2 database mining knowledge retrieval 3 databases learning system information 4 query clustering reasoning search 5 system classification model text

Top-5 Conferences in Each Cluster DB DM ML IR 1 VLDB KDD IJCAI SIGIR 2 ICDE PAKDD AAAI WWW 3 SIGMOD ICDM ICML CIKM 4 PODS PKDD CVPR ECIR 5 EDBT SDM ECML AAAI

slide-19
SLIDE 19

YELP Dataset

◮ We examine a subset of the YELP

dataset for 3 different clustering tasks:

◮ 4 Level-1 categories ◮ 6 Restaurant categories ◮ 6 Shopping categories Business Review User Word given to given by contains

Classification accuracy on businesses

GPNRankClus NetClus GNetMine RankClass Level 1 56.25% 17.78% 47.16% 37.19% Restaurant 66.81% 15.31% 49.36% 57.11% Shopping 64.62% 13.28% 64.45% 32.58%

Normalized Mutual Information (NMI) on businesses

GPNRankClus NetClus GNetMine RankClass Level 1 0.5590 0.0168 0.1387 0.1579 Restaurant 0.6606 0.0187 0.2346 0.3044 Shopping 0.4721 0.0313 0.3617 0.2335

slide-20
SLIDE 20

Conclusions

◮ We introduce a new concept of ranking score that conveys

both ranking and clustering information.

◮ Based on this concept, we propose a generative model,

called GPNRankClus.

◮ We model the ranking score of each vertex in each cluster as

a gamma distribution.

◮ We model the number of edges as a Poisson distribution.

◮ We test our model on DBLP and YELP data.

◮ GPNRankClus outperforms state-of-the-art baselines.

slide-21
SLIDE 21

References

Brendan J Frey and Delbert Dueck. Clustering by passing messages between data points. Science, 315(5814):972–976, 2007. Jon M Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 46(5):604–632, 1999. Bo Long, Zhongfei Mark Zhang, Xiaoyun Wu, and Philip S Yu. Spectral clustering for multi-type relational data. In ICML’06, Pittsburgh, Pennsylvania, USA, pages 585–592, 2006. Zaiqing Nie, Yuanzhi Zhang, Ji-Rong Wen, and Wei-Ying Ma. Object-level ranking: bringing

  • rder to web objects. In WWW’05, Chiba, Japan, pages 567–574, 2005.

Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing order to the web. Computer Networks, 30, 1998. Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22(8):888–905, 2000. Tom AB Snijders and Krzysztof Nowicki. Estimation and prediction for stochastic blockmodels for graphs with latent block structure. Journal of Classification, 14(1):75–100, 1997. Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin, Hong Cheng, and Tianyi Wu. Rankclus: integrating clustering with ranking for heterogeneous information network analysis. In EDBT’09, Saint-Petersburg, Russia, pages 565–576, 2009a. Yizhou Sun, Yintao Yu, and Jiawei Han. Ranking-based clustering of heterogeneous information networks with star network schema. In KDD’09, Paris, France, pages 797–806, 2009b.

Acknowledgement

This work was partially supported by NIH/NHLBI grants R01HL089856 & R01HL089857.