Chapter 8-2: Communit unity D Det etection Jilles Vreeken IRDM - PowerPoint PPT Presentation

Chapter 8-2: Communit unity D Det etection Jilles Vreeken IRDM ‘15/16 3 Dec 2015

IRDM Chapter 8, overview The basics 1. Properties of Graphs 2. Frequent Subgraphs 3. Graph Clustering 4. You’ll find this covered in: Aggarwal, Ch. 17, 19 Zaki & Meira, Ch. 4, 11, 16 VIII-2: 2 IRDM ‘15/16

IRDM Chapter 8, today The basics 1. Properties of Graphs 2. Frequent Subgraphs 3. Community Detection 4. You’ll find this covered in: Aggarwal, Ch. 17, 19 Zaki & Meira, Ch. 4, 11, 16 VIII-2: 3 IRDM ‘15/16

Chapter 7.4: Communi unity Det Detec ection Aggarwal Ch. 19.3, 17.5 Zaki & Meira Ch. 16 VIII-2: 4 IRDM ‘15/16

Chapter 7.4.1: Det Detec ecting Sm Smal all Communi unities es - VIII-2: 5 IRDM ‘15/16

Trawling Searching for small communities in the Web graph What is the signature of a community in a Web graph?  intuition: Many people all talking about the same things Use this to define “topics”: What the same people on the left talk about on the right … … Dense 2-layer graph VIII-2: 6 IRDM ‘15/16

Searching for small communities A more well-defined problem:  enumerate complete bipartite subgraphs 𝐿 𝛽 , 𝛾 where 𝐿 𝛽 , 𝛾 has 𝑡 nodes on the “left” and every such node in 𝑡 links to the same set of 𝑢 nodes on the “right” | 𝑌 | = 𝛽 = 3 | 𝑍 | = 𝛾 = 4 𝑌 𝑍 𝐿 3 , 4 Fully connected VIII-2: 7 IRDM ‘15/16

Frequent itemset mining Recall market basket analysis.  market: universe 𝑉 of 𝑜 items  baskets: 𝑛 transctions, subsets of 𝑉 : 𝑢 1 , 𝑢 2 , … , 𝑢 𝑛 ⊆ 𝑉 where each 𝑢 𝑗 is a set of items one person bought  support: frequency threshold 𝜏 Goal:  find all subsets 𝑌 ⊆ 𝑉 s.t. 𝑌 ⊆ 𝑢 𝑗 of at least 𝜏 sets 𝑢 𝑗 ∈ 𝑬 What’s the connection between itemsets and complete bipartite graphs? VIII-2: 8 IRDM ‘15/16

From itemsets to bipartite 𝐿 𝛽 , 𝛾 Frequent itemsets = complete bipartite graphs! a How? b 𝑢 𝑗 = { 𝑏 , 𝑐 , 𝑑 , 𝑒 } i  view each node 𝑗 as a c set 𝑢 𝑗 of the nodes 𝑗 points to d 𝐿 𝛽 , 𝛾 = a set 𝑍 of size 𝛾 a that occurs in 𝛽 sets 𝑢 𝑗 j b i c  looking for 𝐿 𝛽 , 𝛾 → k set frequency threshold to 𝛽 d 𝑌 and look at layer 𝛾 , find 𝑍 all frequent sets of size 𝑢 𝛽 … minimum support ( | 𝑌 | = 𝛽 ) 𝛾 … itemset size ( | 𝑍 | = 𝛾 ) VIII-2: 9 IRDM ‘15/16

From itemsets to bipartite 𝐿 𝛽 , 𝛾 1) View each node 𝑗 as a 3) Say we find a frequent itemset 𝑌 = { 𝑏 , 𝑐 , 𝑑 } of supp 𝛽 set 𝑢 𝑗 of nodes 𝑗 points to This means, there are 𝛽 nodes a that link to all of { 𝑏 , 𝑐 , 𝑑 } : b i c a a d b a b 𝑢 𝑗 = { 𝑏 , 𝑐 , 𝑑 , 𝑒 } x z y c b c 2) Find frequent itemsets: c 𝛽 … minimum support 𝑍 𝛾 … itemset size a 4) We found 𝐿 𝛽 , 𝛾 ! x b y 𝐿 𝛽 , 𝛾 = a set 𝑍 of size 𝛾 c z that occurs in 𝛽 sets 𝑢 𝑗 (Kumar et al ‘99) 𝑌 VIII-2: 10 IRDM ‘15/16

Example b Suppor ort thresh shol old 𝛽 = 𝜏 = 2 a  { 𝑐 , 𝑒 } : support 3  { 𝑓 , 𝑔 }: support 2 c  i.e. we found 2 bipartite subgraphs: d e f b a c Itemsets: 𝑏 = { 𝑐 , 𝑑 , 𝑒 } d 𝑐 = { 𝑒 } c 𝑑 = { 𝑐 , 𝑒 , 𝑓 , 𝑔 } d e 𝑒 = { 𝑓 , 𝑔 } f 𝑓 = { 𝑐 , 𝑒 } e 𝑔 = {} VIII-2: 11 IRDM ‘15/16

Chapter 7.4.2: Communi unity Det Detec ection by y Gra Graph C h Clust lustering Aggarwal Ch. 17.5, 19.3 Zaki & Meira Ch. 16 VIII-2: 12 IRDM ‘15/16

Where do graphs come from? We can have data in graph form  e.g. the clusters of our social networks Or, we map existing data to a graph  data points become vertices  add an edge if two data points are similar  edge weights can also tell about similarity VIII-2: 13 IRDM ‘15/16

Similarity and adjacency matrices A sim imila ilarit ity matrix is an 𝑜 -by- 𝑜 non-negative, symmetric matrix  the opposite of the distance matrix Recall that a weighted adjacency matrix is an 𝑜 -by- 𝑜 non-negative, symmetric matrix  for weighted, undirected graphs So, we can think every s sim imila ilarit ity m matrix ix as an adjacency matrix of some weighted, undirected graph  this graph will be complete (a clique) Further, we can use any s sim imila ilarit ity m measure between two points as an edge weight VIII-2: 14 IRDM ‘15/16

Getting non-complete graphs Using complete graphs can be a waste of resources  for clustering, we don’t really care about very dissimilar pairs We can remove edges between dissimilar vertices  zero weight Or, we adjust the weights to diminish dissimilar points  the Gaus ussian n kernel l is popular for this 2 𝑥 𝑗𝑗 = exp − 𝑦 𝑗 − 𝑦 𝑗 2 𝜏 2 VIII-2: 15 IRDM ‘15/16

Getting non-complete graphs (2) How to decide when vertices are too dissimilar? In 𝝑 -ne neighb hbour ur grap aphs s we add an edge between two vertices that are within distance 𝜗 to each other  usually the resulting graph is considered unweighted as all weights would be roughly similar In 𝑙 -nearest st neighb hbour ur graphs we connect two vertices if one is within the 𝑙 nearest neighbours of the other  in mutual 𝑙 -nearest ne neighb hbour ur gra raph ph we only connect two vertices if they’re both in each other’s 𝑙 nearest neighbours VIII-2: 16 IRDM ‘15/16

Which similarity graph? With 𝜗 -graphs choosing the parameter is hard no single cor orrect an answer if different clusters have different internal similarities  𝑙 -nearest neighbours can connect points with different similarities but far-away high density regions become unconnected  The mutual 𝑙 -nearest neighbours is somewhat in between good for detecting clusters with different densities  General recommendation: start with 𝑙 -NN others if data supports that  VIII-2: 17 IRDM ‘15/16

Example graph (Zaki & Meira, Fig 16.1) VIII-2: 18 IRDM ‘15/16

Graph partitioning 5 1 Undirected graph 2 6 4 3 Bi-partitioning task:  divide vertices into two disjoint groups A B 5 1 2 6 4 3 Questions:  how can we define a “good partition of”?  how can we efficiently identify such a partition? VIII-2: 19 IRDM ‘15/16

Graph partitioning What makes a good partition?  maximize the number of within-group connections  minimize the number of between-group connections 5 1 2 6 4 3 A B VIII-2: 20 IRDM ‘15/16

Clustering as Graph Cuts A cut cut of a connected graph 𝐻 = ( 𝑊 , 𝐹 ) divides the set of vertices into two partitions 𝑇 and 𝑊 ∖ 𝑇 and removes the edges between them cut can be expressed by giving the set 𝑇  or by giving the cut set, i.e. edges with exactly one end in 𝑇 ,  𝑤 , 𝑣 ∩ 𝑇 = 1} 𝐺 = { 𝑤 , 𝑣 ∈ 𝐹 ∶ A graph cut groups the vertices of a graph into two clusters subsequent cuts in the components give us a hierarchical clustering  A 𝒍 -way y cut cut cuts the graph into 𝑙 disjoint set of vertices 𝐷 1 , 𝐷 2 , … , 𝐷 𝑙 and removes the edges between them VIII-2: 21 IRDM ‘15/16

What is a good cut? Not every cut will cut it In mi mini nimum mum c cut ut the goal is to find any set of vertices such that cutting them from the rest of the graph requires removing the least number of edges  least sum of weights for weighted graphs  the extension to multiway cuts is straightforward The minimum cut can be found in polynomial time  the max-flow min-cut theorem But minimum cut isn’t very good for clustering purposes VIII-2: 22 IRDM ‘15/16

Cuts that cut it The minimum cut usually removes only one vertex  not very appealing clustering  we want to penalize the cut for imbalanced cluster sizes In ratio io cu cut, the goal is to minimize the ratio of the weight of the edges in the cut set and the size of the clusters 𝐷 𝑗  Let 𝑋 𝐵 , 𝐶 = ∑ 𝑥 𝑗𝑗 𝑗∈𝐵 , 𝑗∈𝐶  wij is the weight of edge (i, j) 𝑙 RatioCut = � 𝑋 𝐷 𝑗 , 𝑊 ∖ 𝐷 𝑗 𝐷 𝑗 𝑗=1 VIII-2: 23 IRDM ‘15/16

Cuts that cut it The vol olume ume of a set of vertices 𝐵 is the weight of all edges connected to 𝐵 𝑤𝑤𝑤 𝐵 = 𝑋 𝐵 , 𝑊 = � 𝑥 𝑗𝑗 𝑗∈𝐵 , 𝑗∈𝑊 In normalize lized cu cut we measure the size of 𝐷 𝑗 by 𝑤𝑤𝑤 ( 𝐷 𝑗 ) instead of | 𝐷 𝑗 | 𝑙 NormalisedCut = � 𝑋 𝐷 𝑗 , 𝑊 ∖ 𝐷 𝑗 𝑤𝑤𝑤 𝐷 𝑗 𝑗=1 VIII-2: 24 IRDM ‘15/16

Cuts that cut it The vol olume ume of a set of vertices 𝐵 is the weight of all edges connected to 𝐵 𝑤𝑤𝑤 𝐵 = 𝑋 𝐵 , 𝑊 = � 𝑥 𝑗𝑗 Finding the optimal 𝑗∈𝐵 , 𝑗∈𝑊 RatioCut or NormalisedCut In normalize lized cu cut we measure the size of 𝐷 𝑗 by 𝑤𝑤𝑤 ( 𝐷 𝑗 ) is NP -hard instead of 𝐷 𝑗 𝑙 NormalisedCut = � 𝑋 𝐷 𝑗 , 𝑊 ∖ 𝐷 𝑗 𝑤𝑤𝑤 𝐷 𝑗 𝑗=1 VIII-2: 25 IRDM ‘15/16

Chapter 8-2: Communit unity D Det etection Jilles Vreeken IRDM - PowerPoint PPT Presentation

Chapter 8-2: Communit unity D Det etection Jilles Vreeken IRDM 15/16 3 Dec 2015 IRDM Chapter 8, overview The basics 1. Properties of Graphs 2. Frequent Subgraphs 3. Graph Clustering 4. Youll find this covered in: Aggarwal,

The Unity platform Unity Studios The Unity platform Unity Studios The Unity platform

IGDA Unity SIG II AI in Unity Emil AngryAnt Johansen Unity Technologies AI in Unity

Unity Investment AG - Presentation Unity Investment AG Unity Investment AG was founded in late

CAVE2 Unity Tutorial CAVE2 unity tutorial on github Omicron Cave example unity scene Cave2

CAVE2 Unity Tutorial CAVE2 unity tutorial on github Omicron Cave example unity scene Cave2

Statistical Natural Language Processing Sing DET NOUN PUNCT Def Sing 3s,Pres Sing,Dem case

UNITY IN DIVERSITY PROF L D MOSOMA INTRODUCTION TERMS OF UNITY IN DIVERSITY UNITY ( Veritas )

UNITY ITY INVE VESTM STMENT ENT AG - PRESENT NTATI TION ON Unity Investment AG Unity

Networking in Unity Emil AngryAnt Johansen Unity Technologies Networking in Unity HTTP

Design Foundations M Bethancourt Unity Unity occurs when elements are made to look like they

the Church John 17 1. Unity within the local church. 2. Unity in our relationships with other

Unity Unity is a Game Engine. Unity comes with prebuilt functionality speeding development

I : Newstead definitions Basic mob of unity in a study as subgroups of # roots of

Palatine Police Department Det. Josh Hester Det. Phil Hemmeler Marijuana is most commonly

Varfr ett holistiskt perspektiv? - att se hela individen i det idrottsliga Nedan r det

Hvad er det for en fisk?! Hvad er det for en fisk?! Louis Tim Larsen, July 2014 Finish

CS675: Convex and Combinatorial Optimization Fall 2016 Combinatorial Problems as Linear and

Max Flows and Minimum Cuts Carlo Mannino (from Geir Dahl notes) University of Oslo, INF-MAT5360

Other Related Problems Applications Classification Instance Classification Object Detection +

Why Sugeno -Measures Sugeno -Measure is . . . Processing Sugeno . . . Saiful Abu 1 , Vladik

BBM 413 Fundamentals of Goal: identify groups of pixels that go together Image Processing

Flow networks 2 5 1 How much flow can we push 4 7 through from s to t ? 3 2 (Numbers are

Generalized Flow-Cut Dualities Sanjeevi Krishnan (Upenn) Bremen 2013 MAX FLOW = MIN CUT The

CS137: Electronic Design Automation Day 7: January 28, 2002 Partitioning 2 (spectral, network