gin a clustering model for capturing dual heterogeneity
play

GIN: A Clustering Model for Capturing Dual Heterogeneity in - PowerPoint PPT Presentation

GIN: A Clustering Model for Capturing Dual Heterogeneity in Networked Data Jialu Liu Chi Wang Jing Gao Quanquan Gu Charu Aggarwal Lance Kaplan Jiawei Han May 1, 2015 amss Outline 1 Heterogeneity in Networked Data GINthe Proposed


  1. GIN: A Clustering Model for Capturing Dual Heterogeneity in Networked Data Jialu Liu Chi Wang Jing Gao Quanquan Gu Charu Aggarwal Lance Kaplan Jiawei Han May 1, 2015 amss

  2. Outline 1 Heterogeneity in Networked Data GIN–the Proposed Network Clustering Algorithm 2 Modeling Subnetworks Unified Model Experiments 3 amss

  3. Networked Data Many real-world data can be represented as a network (or graph), which is composed of nodes interconnected with each other via meaningful links. amss

  4. Node Heterogeneity In real networks, there will likely be multiple types of nodes. amss

  5. Link Heterogeneity Meanwhile, links can be categorized into different types. 10 12 3 24 5 6 Binary/Unweighted Links Weighted Links Besides link weights, links can be directed or undirected. amss

  6. Dual Heterogeneity In this work, we work on heterogeneous networks that contain interconnected multi-typed nodes and links. Specifically, links are undirected but are allowed to be either binary or weighted . Author Paper Paper Venue Author Venue A 1 P 1 P 1 V 1 A 1 V 1 Author P 2 A 2 P 2 V 2 A 2 V 2 Paper Venue P 3 P 3 A 3 A 3 P 4 P 4 (a) (b) (c) (d) Figure: Dashed line – binary links, Solid line – weighted links. amss

  7. Task and Novelty Network Clustering: We aim to find a clustering solution given a general heterogeneous network, in which each cluster consists of multiple types of nodes and links. Novelty compared with previous works: We are considering heterogeneity in both nodes and links; The algorithm does not have requirement on the network schema; The algorithm shows that sampling unobserved links (negative sampling) improves performance. amss

  8. Outline 1 Heterogeneity in Networked Data GIN–the Proposed Network Clustering Algorithm 2 Modeling Subnetworks Unified Model Experiments 3 amss

  9. Subnetworks A subnetwork in heterogeneous network is either a homogeneous network or a bipartite network. A network with the number of object types T = 1 is called homogeneous network . It is called bipartite network when T = 2 and links only exist between two object types. amss

  10. Symbols We use G to denote a heterogeneous network and G ( uv ) to represent its subnetwork (can be homogeneous or bipartite network depending on whether object type u equals v ). G ( uv ) can be either unweighted or weighted. That is to say, link e ( uv ) between nodes x ( u ) and x ( v ) with weight W ( uv ) ij i j ij can be binary or take any non-negative values. amss

  11. Subnetworks with Binary Links Suppose the probability of a link between nodes x ( u ) and x ( v ) is i j P ( e ( uv ) = 1 ) . ij Specifically, we factorize P ( e ( uv ) = 1 ) into P K k = 1 θ ( u ) ik θ ( v ) where ij jk { θ ( u ) ik } K k = 1 is a vector with length K indicating the cluster membership of node x ( u ) . i This factorization implies that two nodes get connected more easily if they share the same cluster distribution. nodes get connected ution. θ ( u ) θ 0.1 0 0 0.6 0 0.1 0.2 i 0.44 get connected u ) θ ( v ) 0 0.1 0 0.7 0 0.2 0 j amss

  12. The underlying generative process for link e ( uv ) is as follows: ij X e ( uv ) θ ( u ) ik θ ( v ) ⇠ Bernoulli ( jk ) . ij k For the whole set of binary links E ( uv ) , the following likelihood can be derived to estimate parameters: ⌘ W ( uv ) ⌘ 1 � W ( uv ) ⇣ ⇣ Y P ( e ( uv ) P ( e ( uv ) ij ij = 1 ) = 0 ) (1) ij ij i < j | {z } Unobserved Links amss

  13. Subnetworks with Weighted Links Similar to the Bernoulli setting in the previous subsection, we first model the existence of a link between a given pair of nodes. In addition to the cluster membership vector θ ( u ) , we incorporate a i scale parameter σ ( u ) for each node x ( u ) in consideration of the i i weighted setting. Then we can come up with the following generative process for weighted links: X (a) e ( uv ) θ ( u ) ik θ ( v ) ⇠ Bernoulli ( jk ) ij k (2) X (b) If e ( uv ) ω ( uv ) ⇠ Poisson ( σ ( u ) σ ( v ) θ ( u ) ik θ ( v ) = 1 , jk ) ij ij i j k where discrete random variable ω ( uv ) is the weight of the link. ij amss

  14. ⇣ ⌘ Y X θ ( u ) ik θ ( v ) 1 � ⇥ jk k W ( uv ) = 0 ij | {z } Unobserved Links � W ( uv ) (3) ⌘� σ ( u ) σ ( v ) P k θ ( u ) ik θ ( v ) ij ⇣ X Y θ ( u ) ik θ ( v ) i j jk jk W ( uv ) ! W ( uv ) k ij > 0 ij ⇥ e � σ ( u ) σ ( v ) k θ ( u ) ik θ ( v ) P i j jk . amss

  15. Outline 1 Heterogeneity in Networked Data GIN–the Proposed Network Clustering Algorithm 2 Modeling Subnetworks Unified Model Experiments 3 amss

  16. Objective Function We first define two sets of subnetworks belonging to the same heterogeneous network G : B and W . They represent subnetworks having binary and weighted links respectively, satisfying that B [ W = G and B \ W = ∅ . ⌘ W ( uv ) ⌘ 1 − W ( uv ) ⇣ X ⇣ Y Y X θ ( u ) ik θ ( v ) ij θ ( u ) ik θ ( v ) ij jk ) 1 � jk G ( uv ) ∈ B i < j k k ⇣ ⌘ Y Y X θ ( u ) ik θ ( v ) ⇥ 1 � jk G ( uv ) ∈ W W ( uv ) k = 0 ij (4) � W ( uv ) ⌘� σ ( u ) σ ( v ) P k θ ( u ) ik θ ( v ) ij ⇣ X Y θ ( u ) ik θ ( v ) i j jk ⇥ jk W ( uv ) ! W ( uv ) k ij > 0 ij ⇥ e − σ ( u ) σ ( v ) k θ ( u ) ik θ ( v ) P i j jk . amss

  17. Complete Log-likelihood To directly optimize the previsou expression is difficult. We apply EM algorithm by using φ ( uv ) ijk 1 k 2 to denote the posterior probability of an unobserved link generated from different cluster assignments of two end nodes, i.e., k 1 6 = k 2 . Meanwhile, we use ψ ( uv ) to denote the ijk posterior probability of a link resulted from the same cluster assignments of two end nodes. ψ ( uv ) log θ ( u ) ik θ ( v ) X X X L ( Θ , Σ ) = ijk jk k G ( uv ) 2 B W ( uv ) = 1 ij ⇣ ⌘ X W ( uv ) ψ ( uv ) log θ ( u ) ik θ ( v ) X X + + 1 ij ijk jk G ( uv ) 2 W W ( uv ) k > 0 ij (5) φ ( uv ) ijk 1 k 2 log θ ( u ) ik 1 θ ( v ) X X X + jk 2 G ( uv ) 2 G W ( uv ) k 1 6 = k 2 = 0 ij W ( uv ) log σ ( u ) σ ( v ) X X + . ij i j G ( uv ) 2 W W ( uv ) > 0 ij amss

  18. Update Functions Expectation Step: θ ( u ) ik 1 θ ( v ) φ ( uv ) jk 2 ijk 1 k 2 = P l 1 6 = l 2 θ ( u ) il 1 θ ( v ) . jl 2 θ ( u ) ik θ ( v ) ψ ( uv ) jk = ijk P l θ ( u ) il θ ( v ) . jl Maximization Step: ⇣ ⌘ X X X X θ ( u ) ψ ( uv ) W ( uv ) ψ ( uv ) / + + 1 ik ijk ij ijk G ( uv ) 2 B W ( uv ) G ( uv ) 2 W W ( uv ) = 1 > 0 ij ij X X X φ ( uv ) + ijkl . G ( uv ) 2 G W ( uv ) l 6 = k = 0 ij amss

  19. Efficiency Issue θ ( u ) ik 1 θ ( v ) φ ( uv ) jk 2 O ( k 2 ) ijk 1 k 2 = P l 1 6 = l 2 θ ( u ) il 1 θ ( v ) jl 2 P l 6 = k θ ( u ) ik θ ( v ) θ ( u ) � θ ( u ) ik θ ( v ) X φ ( uv ) jl ik jk ) = = O ( k ) ijkl P l 1 6 = l 2 θ ( u ) il 1 θ ( v ) 1 � P l θ ( u ) il θ ( v ) l 6 = k jl 2 jl ⇣ ⌘ X X X X θ ( u ) ψ ( uv ) W ( uv ) ψ ( uv ) / + + 1 ik ijk ij ijk G ( uv ) 2 B W ( uv ) G ( uv ) 2 W W ( uv ) = 1 > 0 ij ij h X i X X φ ( uv ) + ijkl . G ( uv ) 2 G W ( uv ) l 6 = k = 0 ij amss (6)

  20. Sampling Unobserved Links For the unobserved links, the spatial/time complexity increases significantly if we need to go over all of them. To alleviate such burden we sampled a potential neighbourhood for each node. This also downweights the third term of θ ( u ) ik θ ( u ) ψ ( uv ) ⇣ W ( uv ) ⌘ ψ ( uv ) X X X X / + + 1 ik ijk ij ijk G ( uv ) 2 B W ( uv ) G ( uv ) 2 W W ( uv ) = 1 > 0 ij ij (7) φ ( uv ) X X X + # ijkl l 6 = k G ( uv ) 2 G W ( uv ) = 0 ij We keep all the non-zero links and sample η M unobserved links to make its size proportional to the total number of links M (we choose η = 0 . 1 in the experiments). amss

  21. Outline 1 Heterogeneity in Networked Data GIN–the Proposed Network Clustering Algorithm 2 Modeling Subnetworks Unified Model Experiments 3 amss

  22. Datasets Four real world data sets were used. The DBLP data set is a collection of CS publications. We use a subset that belong to four research areas. The 4Groups data set contains co-author and author-term relationships where researchers are selected from four data mining and machine learning research groups. The Flickr data set is a network containing three types of objects: image, user and tag. Links exist between image-user and image-tag. The NSF data set describes NSF Research Awards Abstracts from 1990 to 2003. We use documents associated with terms and investigators that belong to the largest 10 programs. amss

  23. The important statistics of four datasets are summarized in the following table. Data set DBLP 4Groups Flickr NSF #Nodes 70,536 1,618 4,076 30,995 #Links 332,388 5,568 14,396 1,883,682 Sparsity 6.7e-5 2.1e-3 8.7e-4 2.0e-3 #Clusters 4 4 8 10 #Objects 4 2 3 3 #Subnet. 3 2 2 2 Link Cat. Binary Weighted Binary Fused Term Venue Image Doc. Paper Author Term User Tag Inv. Term Author Figure: Network schemas of all data sets in which circles of labelled object types are amss in grey. Dashed (resp., solid) lines refer to binary (resp., weighted) links.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend