GIN: A Clustering Model for Capturing Dual Heterogeneity in - PowerPoint PPT Presentation

GIN: A Clustering Model for Capturing Dual Heterogeneity in Networked Data Jialu Liu Chi Wang Jing Gao Quanquan Gu Charu Aggarwal Lance Kaplan Jiawei Han May 1, 2015 amss

Outline 1 Heterogeneity in Networked Data GIN–the Proposed Network Clustering Algorithm 2 Modeling Subnetworks Unified Model Experiments 3 amss

Networked Data Many real-world data can be represented as a network (or graph), which is composed of nodes interconnected with each other via meaningful links. amss

Node Heterogeneity In real networks, there will likely be multiple types of nodes. amss

Link Heterogeneity Meanwhile, links can be categorized into different types. 10 12 3 24 5 6 Binary/Unweighted Links Weighted Links Besides link weights, links can be directed or undirected. amss

Dual Heterogeneity In this work, we work on heterogeneous networks that contain interconnected multi-typed nodes and links. Specifically, links are undirected but are allowed to be either binary or weighted . Author Paper Paper Venue Author Venue A 1 P 1 P 1 V 1 A 1 V 1 Author P 2 A 2 P 2 V 2 A 2 V 2 Paper Venue P 3 P 3 A 3 A 3 P 4 P 4 (a) (b) (c) (d) Figure: Dashed line – binary links, Solid line – weighted links. amss

Task and Novelty Network Clustering: We aim to find a clustering solution given a general heterogeneous network, in which each cluster consists of multiple types of nodes and links. Novelty compared with previous works: We are considering heterogeneity in both nodes and links; The algorithm does not have requirement on the network schema; The algorithm shows that sampling unobserved links (negative sampling) improves performance. amss

Subnetworks A subnetwork in heterogeneous network is either a homogeneous network or a bipartite network. A network with the number of object types T = 1 is called homogeneous network . It is called bipartite network when T = 2 and links only exist between two object types. amss

Symbols We use G to denote a heterogeneous network and G ( uv ) to represent its subnetwork (can be homogeneous or bipartite network depending on whether object type u equals v ). G ( uv ) can be either unweighted or weighted. That is to say, link e ( uv ) between nodes x ( u ) and x ( v ) with weight W ( uv ) ij i j ij can be binary or take any non-negative values. amss

Subnetworks with Binary Links Suppose the probability of a link between nodes x ( u ) and x ( v ) is i j P ( e ( uv ) = 1 ) . ij Specifically, we factorize P ( e ( uv ) = 1 ) into P K k = 1 θ ( u ) ik θ ( v ) where ij jk { θ ( u ) ik } K k = 1 is a vector with length K indicating the cluster membership of node x ( u ) . i This factorization implies that two nodes get connected more easily if they share the same cluster distribution. nodes get connected ution. θ ( u ) θ 0.1 0 0 0.6 0 0.1 0.2 i 0.44 get connected u ) θ ( v ) 0 0.1 0 0.7 0 0.2 0 j amss

The underlying generative process for link e ( uv ) is as follows: ij X e ( uv ) θ ( u ) ik θ ( v ) ⇠ Bernoulli ( jk ) . ij k For the whole set of binary links E ( uv ) , the following likelihood can be derived to estimate parameters: ⌘ W ( uv ) ⌘ 1 � W ( uv ) ⇣ ⇣ Y P ( e ( uv ) P ( e ( uv ) ij ij = 1 ) = 0 ) (1) ij ij i < j | {z } Unobserved Links amss

Subnetworks with Weighted Links Similar to the Bernoulli setting in the previous subsection, we first model the existence of a link between a given pair of nodes. In addition to the cluster membership vector θ ( u ) , we incorporate a i scale parameter σ ( u ) for each node x ( u ) in consideration of the i i weighted setting. Then we can come up with the following generative process for weighted links: X (a) e ( uv ) θ ( u ) ik θ ( v ) ⇠ Bernoulli ( jk ) ij k (2) X (b) If e ( uv ) ω ( uv ) ⇠ Poisson ( σ ( u ) σ ( v ) θ ( u ) ik θ ( v ) = 1 , jk ) ij ij i j k where discrete random variable ω ( uv ) is the weight of the link. ij amss

⇣ ⌘ Y X θ ( u ) ik θ ( v ) 1 � ⇥ jk k W ( uv ) = 0 ij | {z } Unobserved Links � W ( uv ) (3) ⌘� σ ( u ) σ ( v ) P k θ ( u ) ik θ ( v ) ij ⇣ X Y θ ( u ) ik θ ( v ) i j jk jk W ( uv ) ! W ( uv ) k ij > 0 ij ⇥ e � σ ( u ) σ ( v ) k θ ( u ) ik θ ( v ) P i j jk . amss

Objective Function We first define two sets of subnetworks belonging to the same heterogeneous network G : B and W . They represent subnetworks having binary and weighted links respectively, satisfying that B [ W = G and B \ W = ∅ . ⌘ W ( uv ) ⌘ 1 − W ( uv ) ⇣ X ⇣ Y Y X θ ( u ) ik θ ( v ) ij θ ( u ) ik θ ( v ) ij jk ) 1 � jk G ( uv ) ∈ B i < j k k ⇣ ⌘ Y Y X θ ( u ) ik θ ( v ) ⇥ 1 � jk G ( uv ) ∈ W W ( uv ) k = 0 ij (4) � W ( uv ) ⌘� σ ( u ) σ ( v ) P k θ ( u ) ik θ ( v ) ij ⇣ X Y θ ( u ) ik θ ( v ) i j jk ⇥ jk W ( uv ) ! W ( uv ) k ij > 0 ij ⇥ e − σ ( u ) σ ( v ) k θ ( u ) ik θ ( v ) P i j jk . amss

Complete Log-likelihood To directly optimize the previsou expression is difficult. We apply EM algorithm by using φ ( uv ) ijk 1 k 2 to denote the posterior probability of an unobserved link generated from different cluster assignments of two end nodes, i.e., k 1 6 = k 2 . Meanwhile, we use ψ ( uv ) to denote the ijk posterior probability of a link resulted from the same cluster assignments of two end nodes. ψ ( uv ) log θ ( u ) ik θ ( v ) X X X L ( Θ , Σ ) = ijk jk k G ( uv ) 2 B W ( uv ) = 1 ij ⇣ ⌘ X W ( uv ) ψ ( uv ) log θ ( u ) ik θ ( v ) X X + + 1 ij ijk jk G ( uv ) 2 W W ( uv ) k > 0 ij (5) φ ( uv ) ijk 1 k 2 log θ ( u ) ik 1 θ ( v ) X X X + jk 2 G ( uv ) 2 G W ( uv ) k 1 6 = k 2 = 0 ij W ( uv ) log σ ( u ) σ ( v ) X X + . ij i j G ( uv ) 2 W W ( uv ) > 0 ij amss

Update Functions Expectation Step: θ ( u ) ik 1 θ ( v ) φ ( uv ) jk 2 ijk 1 k 2 = P l 1 6 = l 2 θ ( u ) il 1 θ ( v ) . jl 2 θ ( u ) ik θ ( v ) ψ ( uv ) jk = ijk P l θ ( u ) il θ ( v ) . jl Maximization Step: ⇣ ⌘ X X X X θ ( u ) ψ ( uv ) W ( uv ) ψ ( uv ) / + + 1 ik ijk ij ijk G ( uv ) 2 B W ( uv ) G ( uv ) 2 W W ( uv ) = 1 > 0 ij ij X X X φ ( uv ) + ijkl . G ( uv ) 2 G W ( uv ) l 6 = k = 0 ij amss

Efficiency Issue θ ( u ) ik 1 θ ( v ) φ ( uv ) jk 2 O ( k 2 ) ijk 1 k 2 = P l 1 6 = l 2 θ ( u ) il 1 θ ( v ) jl 2 P l 6 = k θ ( u ) ik θ ( v ) θ ( u ) � θ ( u ) ik θ ( v ) X φ ( uv ) jl ik jk ) = = O ( k ) ijkl P l 1 6 = l 2 θ ( u ) il 1 θ ( v ) 1 � P l θ ( u ) il θ ( v ) l 6 = k jl 2 jl ⇣ ⌘ X X X X θ ( u ) ψ ( uv ) W ( uv ) ψ ( uv ) / + + 1 ik ijk ij ijk G ( uv ) 2 B W ( uv ) G ( uv ) 2 W W ( uv ) = 1 > 0 ij ij h X i X X φ ( uv ) + ijkl . G ( uv ) 2 G W ( uv ) l 6 = k = 0 ij amss (6)

Sampling Unobserved Links For the unobserved links, the spatial/time complexity increases significantly if we need to go over all of them. To alleviate such burden we sampled a potential neighbourhood for each node. This also downweights the third term of θ ( u ) ik θ ( u ) ψ ( uv ) ⇣ W ( uv ) ⌘ ψ ( uv ) X X X X / + + 1 ik ijk ij ijk G ( uv ) 2 B W ( uv ) G ( uv ) 2 W W ( uv ) = 1 > 0 ij ij (7) φ ( uv ) X X X + # ijkl l 6 = k G ( uv ) 2 G W ( uv ) = 0 ij We keep all the non-zero links and sample η M unobserved links to make its size proportional to the total number of links M (we choose η = 0 . 1 in the experiments). amss

Datasets Four real world data sets were used. The DBLP data set is a collection of CS publications. We use a subset that belong to four research areas. The 4Groups data set contains co-author and author-term relationships where researchers are selected from four data mining and machine learning research groups. The Flickr data set is a network containing three types of objects: image, user and tag. Links exist between image-user and image-tag. The NSF data set describes NSF Research Awards Abstracts from 1990 to 2003. We use documents associated with terms and investigators that belong to the largest 10 programs. amss

The important statistics of four datasets are summarized in the following table. Data set DBLP 4Groups Flickr NSF #Nodes 70,536 1,618 4,076 30,995 #Links 332,388 5,568 14,396 1,883,682 Sparsity 6.7e-5 2.1e-3 8.7e-4 2.0e-3 #Clusters 4 4 8 10 #Objects 4 2 3 3 #Subnet. 3 2 2 2 Link Cat. Binary Weighted Binary Fused Term Venue Image Doc. Paper Author Term User Tag Inv. Term Author Figure: Network schemas of all data sets in which circles of labelled object types are amss in grey. Dashed (resp., solid) lines refer to binary (resp., weighted) links.

GIN: A Clustering Model for Capturing Dual Heterogeneity in - PowerPoint PPT Presentation

GIN: A Clustering Model for Capturing Dual Heterogeneity in Networked Data Jialu Liu Chi Wang Jing Gao Quanquan Gu Charu Aggarwal Lance Kaplan Jiawei Han May 1, 2015 amss Outline 1 Heterogeneity in Networked Data GINthe Proposed

Gin, from the 13 th century to today History of Mothers Ruin The roots of modern gin were known

Michael Faraday James Clerk Maxwell James Clerk Maxwell Gin a body meet a body Gin a body meet a

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Dr Dr An Anne Br Brock ock Bombay Sapphire Dark satanic stills? What today are the stills

David T Smith Spirit Author and Gin Judge SummerfruitCup Tutored gin tasting Masons G12

Icelandic Premium GIN 100% Islndischer Gin Unsere Inspiration fjnden wir im arktischen

Aulia Arjan Gin Gin Rizky Saputra Hafiz Amin Zamzam Mitha Shafraa START Najma Abuwa Ndaru

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Technology Update Technology Update Terrell Russell, Ph.D. June 9-12, 2020 @terrellrussell

The problem of Energy Disaggrega4on/ Load Monitoring Analysis

R Modules for Accurate and Bugs Inaccuracies Reliable Statistical Computing Too little

Stable-Matching Voronoi Diagrams David Eppstein University of California, Irvine 21st Japan

Globally Identifiable Number (GIN) Registration Adam Roach

Deep Dive Into PostgreSQL Indexes Ibrar Ahmed Senior Database Architect - Percona LLC May 2019

http://cs224w.stanford.edu Output: Node embeddings. We can also embed larger network

Colton Shepard PostgresOpen 2019 What is all this, anyways? JavaScript Object Notation Data