Clustering and Ranking in Heterogeneous Information Networks via - - PowerPoint PPT Presentation
Clustering and Ranking in Heterogeneous Information Networks via - - PowerPoint PPT Presentation
Clustering and Ranking in Heterogeneous Information Networks via Gamma-Poisson Model Junxiang Chen Wei Dai Yizhou Sun Jennifer Dy Northeastern University May 1, 2015 Information Network Information networks are oftentimes used to
Information Network
◮ Information networks are oftentimes used to represent
- bjects and their interactions.
◮ Objects are represented by vertices. ◮ Relationships are represented by edges.
◮ Homogeneous information network has been well studied.
◮ It assumes there contains only one type of vertices and one
type of edges.
◮ A friendship network is an example. Sophia Emma Jacob William Mason
Heterogeneous Network
◮ In the real world, multiple-typed objects are usually related with
each other.
◮ It can be represented by a heterogeneous information network. ◮ It involves vertices of multiple types and edges of multiple types. ◮ For example, DBLP is a computer science bibliographic database. Author Word Venue use publish appear co-author
Sophia William Mason “data” “mining” “database” VLDB SDM
Related Work
◮ Clustering and ranking are prominent techniques to analyze
information networks.
◮ They are usually regarded as orthogonal techniques.
Related Work
◮ Clustering and ranking are prominent techniques to analyze
information networks.
◮ They are usually regarded as orthogonal techniques.
Clustering Methods
Clustering Methods
Homogenous Networks Heterogeneous Networks
Spectral clustering [Shi and Malik, 2000] Affinity propagation [Frey and Dueck, 2007] Stochastic blockmodel [Snijders and Nowicki, 1997] Multi-type spectral clustering [Long et al., 2006]
Related Work
◮ Clustering and ranking are prominent techniques to analyze
information networks.
◮ They are usually regarded as orthogonal techniques.
Clustering and Ranking Methods
Clustering Methods Ranking Methods
Homogenous Networks Heterogeneous Networks
Spectral clustering Affinity propogation Stochastic blockmodel Multi-type spectral clustering HITS [Kleinberg, 1999] PageRank [Page et al., 1998] PopRank [Nie et al., 2005]
Related Work (cont.)
◮ Combining clustering and ranking together usually achieves better results.
◮ Sun et al. [2009a] proposes the RankClus model for bi-typed networks. ◮ Sun et al. [2009b] introduces NetClus model for star-network schema.
Clustering and Ranking Methods
Clustering Methods Ranking Methods
Homogenous Networks Heterogeneous Networks
Spectral clustering Affinity propogation Stochastic blockmodel Multi-type spectral clustering HITS PageRank PopRank Networks with Specified Schema RankClus NetClus
Contributions
◮ We develop a Gamma-Poisson generative model, called GPNRankClus
(Gamma-Poisson Network Model for Ranking and Clustering) Clustering and Ranking Methods
Clustering Methods Ranking Methods
Homogenous Networks Heterogeneous Networks
Spectral clustering Affinity propogation Stochastic blockmodel Multi-type spectral clustering HITS PageRank PopRank Networks with Specified Schema RankClus NetClus
GPNRankClus
Ranking Scores
◮ We want to simultaneously achieve ranking and clustering. ◮ We assign each vertex v(Tm)
n
ranking score r(Tm)
nk
for each cluster that represents the importance of the vertex in this cluster, s.t.
v(Tm)
n
∈ Ck ⇔ k = argmaxl(r(Tm)
nl
) (1) rankk(v(Tm)
i
) < rankk(v(Tm)
j
) ⇔ r(Tm)
ik
> r(Tm)
jk
(2) N Objects of type Tm K clusters Clustering results Ranking results
Ranking Scores
◮ We want to simultaneously achieve ranking and clustering. ◮ We assign each vertex v(Tm)
n
ranking score r(Tm)
nk
for each cluster that represents the importance of the vertex in this cluster, s.t.
v(Tm)
n
∈ Ck ⇔ k = argmaxl(r(Tm)
nl
) rankk(v(Tm)
i
) < rankk(v(Tm)
j
) ⇔ r(Tm)
ik
> r(Tm)
jk
◮ Since r(Tm)
nk
is a positive real number r(Tm)
nk
∼ Gamma(αr, βr).
θr r(Ta)
ik
N × K r(Tb)
jk
N × K Ranking Scores
Intensity of Edge Type
◮ In heterogeneous networks, the
intensity for different edge type differs.
◮ Some edge types tend to
generate more connections.
◮ We model the intensity of each edge
type using a positive real number. λ(Ta,Tb) ∼ Gamma(αλ, βλ)
θr r(Ta)
ik
N × K r(Tb)
jk
N × K W (Ta,Tb)
ij
λ(Ta,Tb)
Te
M2 θλ Ranking Scores Intensity of Edge Type
Number of Edges
◮ There exist multiple edges between two
vertices.
◮ Connections between vertices are
treated as counts of repeated events.
W(Ta,Tb)
ij
∼ Pois(λ(Ta,Tb) (r(Ta)
i
· r(Tb)
j
))
# of Edges Intensity of dot product of edge type ranking scores θr r(Ta)
ik
N × K r(Tb)
jk
N × K W (Ta,Tb)
ij
N 2 λ(Ta,Tb)
Te
M 2 θλ Ranking Scores Intensity of Edge Type Number of Edges
Why Dot Product?
W(Ta,Tb)
ij
∼ Pois(λ(Ta,Tb) ( r(Ta)
i
· r(Tb)
j
))
# of Edges Intensity of dot product of edge type ranking scores ◮ The dot product can be expressed as
r(Ta)
i
· r(Tb)
j
= cos θ × ||r(Ta)
i
|| × ||r(Tb)
j
||
◮ In order to have a large W(Ta,Tb)
ij
we need
◮ Large λ(Ta,Tb) ◮ Large cos θ ◮ Large ||r(Ta) i
|| and ||r(Tb)
j
||
Summary of the Model
◮ For each vertex n and each cluster k, Draw r(Tm)
nk
∼ Gamma(αr, βr) ◮ For each non-zero edge type (Ta, Tb), Draw λ(Ta,Tb) ∼ Gamma(αλ, βλ) ◮ For each pair of different vertices (v(Ta)
i
, v(Tb)
j
) Draw W(Ta,Tb)
ij
∼ Pois(λ(Ta,Tb)(r(Ta)
i
· r(Tb)
j
))
θr r(Ta)
ik
N × K r(Tb)
jk
N × K W (Ta,Tb)
ij
N2 λ(Ta,Tb)
Te
M2 θλ Ranking Scores Intensity of Edge Type Number of Edges
Inference
◮ It is computationally intractable to
directly evaluate the posterior distributions.
◮ We use mean-field variational
inference to approximate these distributions. ◮ Ranking and clustering results are
given by comparing the expected values of the ranking scores v(Tm)
n
∈ Ck, where k = argmaxl(E[r(Tm)
nl
]). rankk(v(Tm)
n
) = argsorti(E[r(Tm)
ik
]).
◮ We introduce seeds.
◮ Existing models use seeds to guide
the clustering process.
◮ We select 1 representative object for
each cluster.
◮ We assign a special prior distribution
for these seeds.
N Objects of type Tm K clusters Clustering results Ranking results
Synthetic Data
◮ We generate synthetic data
◮ 400 data points ◮ 4 different types ◮ 2 clusters
◮ We add noise of different
levels.
Mediate noise level Low noise level High noise level
Real Data
◮ We test the performance of model on two real
heterogeneous network datasets:
◮ DBLP dataset ◮ YELP dataset
◮ We compare GPNRankClus with state-of-the-art algorithms
◮ NetClus, A clustering and ranking method for
heterogeneous networks that follow a star-network schema.
◮ GNetMine, a transductive classification method in
heterogeneous networks.
◮ RankClass, a ranking-based classification method in
heterogeneous networks.
DBLP Dataset
◮ The dataset includes conferences from Database (DB), Data Mining (DM), Machine Learning (ML), Information Retrieval (IR).
Author Word Venue use publish appear co-author
Classification Accuracy on Authors
GPNRankClus NetClus GNetMine RankClass Accuracy 92.28% 76.11%‡ 80.67% 91.12%
Classification Accuracy on Conferences
GPNRankClus NetClus GNetMine RankClass Accuracy 100% 85%‡ 100% 100% ‡We test NetClus on the star-schema version of the DBLP dataset.
Top-5 Words in Each Cluster
DB DM ML IR 1 data data learning web 2 database mining knowledge retrieval 3 databases learning system information 4 query clustering reasoning search 5 system classification model text
Top-5 Conferences in Each Cluster DB DM ML IR 1 VLDB KDD IJCAI SIGIR 2 ICDE PAKDD AAAI WWW 3 SIGMOD ICDM ICML CIKM 4 PODS PKDD CVPR ECIR 5 EDBT SDM ECML AAAI
YELP Dataset
◮ We examine a subset of the YELP
dataset for 3 different clustering tasks:
◮ 4 Level-1 categories ◮ 6 Restaurant categories ◮ 6 Shopping categories Business Review User Word given to given by contains
Classification accuracy on businesses
GPNRankClus NetClus GNetMine RankClass Level 1 56.25% 17.78% 47.16% 37.19% Restaurant 66.81% 15.31% 49.36% 57.11% Shopping 64.62% 13.28% 64.45% 32.58%
Normalized Mutual Information (NMI) on businesses
GPNRankClus NetClus GNetMine RankClass Level 1 0.5590 0.0168 0.1387 0.1579 Restaurant 0.6606 0.0187 0.2346 0.3044 Shopping 0.4721 0.0313 0.3617 0.2335
Conclusions
◮ We introduce a new concept of ranking score that conveys
both ranking and clustering information.
◮ Based on this concept, we propose a generative model,
called GPNRankClus.
◮ We model the ranking score of each vertex in each cluster as
a gamma distribution.
◮ We model the number of edges as a Poisson distribution.
◮ We test our model on DBLP and YELP data.
◮ GPNRankClus outperforms state-of-the-art baselines.
References
Brendan J Frey and Delbert Dueck. Clustering by passing messages between data points. Science, 315(5814):972–976, 2007. Jon M Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM (JACM), 46(5):604–632, 1999. Bo Long, Zhongfei Mark Zhang, Xiaoyun Wu, and Philip S Yu. Spectral clustering for multi-type relational data. In ICML’06, Pittsburgh, Pennsylvania, USA, pages 585–592, 2006. Zaiqing Nie, Yuanzhi Zhang, Ji-Rong Wen, and Wei-Ying Ma. Object-level ranking: bringing
- rder to web objects. In WWW’05, Chiba, Japan, pages 567–574, 2005.
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing order to the web. Computer Networks, 30, 1998. Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 22(8):888–905, 2000. Tom AB Snijders and Krzysztof Nowicki. Estimation and prediction for stochastic blockmodels for graphs with latent block structure. Journal of Classification, 14(1):75–100, 1997. Yizhou Sun, Jiawei Han, Peixiang Zhao, Zhijun Yin, Hong Cheng, and Tianyi Wu. Rankclus: integrating clustering with ranking for heterogeneous information network analysis. In EDBT’09, Saint-Petersburg, Russia, pages 565–576, 2009a. Yizhou Sun, Yintao Yu, and Jiawei Han. Ranking-based clustering of heterogeneous information networks with star network schema. In KDD’09, Paris, France, pages 797–806, 2009b.
Acknowledgement
This work was partially supported by NIH/NHLBI grants R01HL089856 & R01HL089857.