Chapter 8: Gra Graph ph Mining Mining Jilles Vreeken Revision 1, - PowerPoint PPT Presentation

Chapter 8: Gra Graph ph Mining Mining Jilles Vreeken Revision 1, December 4 th typo’s fixed: edge order IRDM ‘15/16 1 Dec 2015

IRDM Chapter 8, overview The basics 1. Properties of Graphs 2. Frequent Subgraphs 3. Community Detection 4. Graph Clustering 5. You’ll find this covered in: Aggarwal, Ch. 17, 19 Zaki & Meira, Ch. 4, 11, 16 VIII-1: 2 IRDM ‘15/16

IRDM Chapter 8, today The basics 1. Properties of Graphs 2. Frequent Subgraphs 3. Community Detection 4. Graph Clustering 5. You’ll find this covered in: Aggarwal, Ch. 17, 19 Zaki & Meira, Ch. 4, 11, 16 VIII-1: 3 IRDM ‘15/16

Chapter 7.1: The B he Basi asics Aggarwal Ch. 17.1 VIII-1: 4 IRDM ‘15/16

Networks are everywhere! Facebook Network [2010] Gene Regulatory Network [Decourty 2008] Human Disease Network [Barabasi 2007] The Internet [2005] VIII-1: 5 IRDM ‘15/16

The Internet Skewed Degrees Robustness VIII-1: 6 IRDM ‘15/16

High school dating network Blue: Male Pink: Female Interesting observations? (Bearman et. al. Am. Jnl. of Sociology, 2004. Image: Mark Newman) VIII-1: 7 IRDM ‘15/16

Karate club network VIII-1: 8 IRDM ‘15/16

Friends How many of you think that your friends have mor more friends than you? A recent Facebook study  examined all of FB’s users: 721 million people with 69 billion friendships  about 10% of the world’s population  found that 93 percent of the time a user’s friend count was le less tha han n the average frie iend nd count nt of his or her friends,  users had an average of 190 friends, while their friends averaged 635 friends of their own VIII-1: 9 IRDM ‘15/16

Reasons? You are a loner? Your friends are extraverts? There are more extraverts than introverts in the world? VIII-1: 10 IRDM ‘15/16

Example Average number Average number of fri friends? of fri friends o of f fri friends? = 1 + 3 + 2 + 2 = (3 + 1 + 2 + 2 + 4 3 + 2 + 3 + 2)/8 = 2 = ( 1 × 1 + 3 × 3 + 2 × 2 + (2 × 2))/8 = 𝟑 . 𝟑𝟑 (Strogatz, NYT 2012) VIII-1: 11 IRDM ‘15/16

Always true (almost)! Proof? 𝐹 𝑌 = ∑𝑦 𝑗 / 𝑂 2 𝑊𝑊𝑊 𝑌 = 𝐹 𝑌 − 𝐹 𝑌 = 𝐹 𝑌 2 − 𝐹 𝑌 2 𝐹 𝑌 2 = 𝐹 𝑌 + 𝑊𝑊𝑊 𝑌 𝐹 𝑌 𝐹 𝑌 Essentially, it’s true if there is an any spread in the number of friends (i.e. whenever there’s a non-zero variance). VIII-1: 12 IRDM ‘15/16

Why graphs? Many real-world data sets are in the forms of graphs  social networks  hyperlinks  protein–protein interaction  XML parse trees  … Many of these graphs are enormous  humans cannot understand them → a task for data mining! VIII-1: 13 IRDM ‘15/16

What is a graph? A graph ph 𝐻 is a pair ( 𝑊 , 𝐹 ⊆ 𝑊 2 )  elements in 𝑊 are ve verti tices or nod nodes of the graph  pairs 𝑤 , 𝑣 ∈ 𝐹 are edges edges or arcs cs of the graph  for undir irect cted gra raph phs pairs are unor unordered, for dir irect cted gra raph phs pairs are ordered The graphs can be la labelle lled  vertices can have labeling 𝑀 ( 𝑤 )  edges can have labeling 𝑀 ( 𝑤 , 𝑣 ) A tree is a rooted, connec nected ed, and acyc yclic graph Graphs can be represented using adjacency cency matr atrices  | 𝑊 | × | 𝑊 | matrix 𝐵 with 𝐵 𝑗𝑗 = 1 if 𝑤 𝑗 , 𝑤 𝑗 ∈ 𝐹 VIII-1: 14 IRDM ‘15/16

Eccentricity, radius, and diameter The dis istance 𝑒 ( 𝑤 𝑗 , 𝑤 𝑗 ) between two vertices is the (weighted) length of the shortest path between them The eccent ntric icity of a vertex 𝑤 𝑗 , 𝑓 ( 𝑤 𝑗 ) , is its maximum distance to any other vertex, max 𝑗 { 𝑒 ( 𝑤 𝑗 , 𝑤 𝑗 )} The radius ius of a connected graph, 𝑊 ( 𝐻 ) , is the minimum eccentricity of any vertex, min 𝑗 { 𝑓 ( 𝑤 𝑗 )} The diamet eter er of a connected graph, 𝑒 ( 𝐻 ) , is the maximum eccentricity of any vertex, max 𝑗 , 𝑗 { 𝑒 ( 𝑤 𝑗 , 𝑤 𝑗 )} 𝑗 { 𝑓 ( 𝑤 𝑗 )} = max  the effect ctive d dia iameter of a graph is smallest number that is larger than the eccentricity of a large fraction of the vertices in the graph VIII-1: 15 IRDM ‘15/16

Clustering Coefficient The cluster ering g coeffi ficien ent of vertex 𝑤 𝑗 , 𝐷 ( 𝑤 𝑗 ) , tells how clique-like the neighbourhood of 𝑤 𝑗 is let 𝑜 𝑗 be the number of neighbours of 𝑤 𝑗 and 𝑛 𝑗 the number of edges  between the neighbours of 𝑤 𝑗 excluding 𝑤 𝑗 itself 𝐷 𝑤 𝑗 = 𝑛 𝑗 2 𝑛 𝑗 𝑜 𝑗 𝑜 𝑗 − 1 = 𝑜 𝑗 2 well-defined only for 𝑤 𝑗 with at least two neighbours  for others, let 𝐷 ( 𝑤 𝑗 ) = 0  The clu lustering ng coefficie ient nt of the graph h is the average clustering coefficient of the vertices: 𝐷 ( 𝐻 ) = 𝑜 −1 � 𝐷 ( 𝑤 𝑗 ) 𝑗 VIII-1: 16 IRDM ‘15/16

What do to with a graph? There are many interesting data one can mine from graphs and sets of graphs  cliques of friends from social networks  hubs and authorities from link graphs  who is the centre of the Hollywood  subgraphs that appear frequently in (a set of) graph(s)  areas with higher inter-connectivity than intra-connectivity  … Graph mining is perhaps the most popular topic in contemporary data mining research  though not necessary called as such… VIII-1: 17 IRDM ‘15/16

Chapter 7.2: Properties o s of f Gra Graphs hs Aggarwal Ch. 17.1, 19.2; Zaki & Meira Ch 4 VIII-1: 18 IRDM ‘15/16

Centrality Six degrees of Kevin Bacon  ”Every actor is related to Kevin Bacon by no more than 6 hops”  Kevin Bacon has acted with many, that have acted with many others, that have acted with many others…  this makes Kevin Bacon a centre of the co-acting graph Kevin, however, is not the centre:  the average distance to him is 2.998  but to Harvey Keitel it is only 2.848 (http://oracleofbacon.org) VIII-1: 19 IRDM ‘15/16

Degree and eccentricity Centrality Centr tral ality ty is a function 𝑑 ∶ 𝑊 → ℝ inducing a total order in 𝑊  the higher the centrality of a vertex, the more important it is In degr gree c central ality ty 𝑑 ( 𝑤 𝑗 ) = 𝑒 ( 𝑤 𝑗 ) , the degree of the vertex In ecce ccentric icit ity ce centralit lity the least eccentric vertex is 1 the most central one, 𝑑 𝑤 𝑗 = 𝑓 𝑤 𝑗  the least eccentric vertex is central al  the most eccentric vertex is peripheral al VIII-1: 20 IRDM ‘15/16

Closeness centrality In closeness c eness cent ntra rality y the vertex with least distance to all other vertices is the centre −1 𝑑 𝑤 𝑗 = � 𝑒 𝑤 𝑗 , 𝑤 𝑗 𝑗 In eccentricity centrality we aim to minimize the maxi aximum d distan tance In closeness centrality we aim to minimize the avera verage d e distance nce  this is the distance used to measure the centre of Hollywood VIII-1: 21 IRDM ‘15/16

Betweenness centrality Betweenness centrality measures the number of sh shortest p path aths th that at tr trav avel th through 𝑤 𝑗  measures the “monitoring” role of the vertex  “all roads lead to Rome” Let 𝜃 𝑗𝑘 be the number of shortest paths between 𝑤 𝑗 and 𝑤 𝑘 and let 𝜃 𝑗𝑘 ( 𝑤 𝑗 ) be the number of those that include 𝑤 𝑗  let 𝛿 𝑗𝑘 𝑤 𝑗 = 𝜃 𝑗𝑘 ( 𝑤 𝑗 )/ 𝜃 𝑗𝑘  betweenness centrality is defined as 𝑑 𝑤 𝑗 = � � 𝛿 𝑗𝑘 𝑗≠𝑗 𝑘≠𝑗 𝑘>𝑗 VIII-1: 22 IRDM ‘15/16

Prestige In presti stige ge, the vertex is more central if it has many incoming edges from other vertices of high prestige  𝐵 is the adjacency matrix of the directed graph 𝐻  𝑞 is 𝑜 -dimensional vector giving the prestige of the vertices  𝑞 = 𝐵 𝑈 𝑞  starting from an initial prestige vector 𝑞 0 , we get 𝑞 𝑘 = 𝐵 𝑈 𝑞 𝑘−1 = 𝐵 𝑈 𝐵 𝑈 𝑞 𝑘−2 = 𝐵 𝑈 2 𝑞 𝑘−2 = 𝐵 𝑈 3 𝑞 𝑘−3 = ⋯ = 𝐵 𝑈 𝑘 𝑞 0 Vector 𝑞 converges to the dominant eigenvector of 𝐵 𝑈  under some assumptions (PageRank is based on (normalized) prestige) VIII-1: 23 IRDM ‘15/16

Graph properties Several real-world graphs exhibit certain characteristics  studying what these are and explaining why they appear is an important area of network research As data miners, we need to understand the cons onseque uenc nces of these characteristics  finding a result that can be explained merely by one of these characteristics is not interesting We also want to mod model graphs with these characteristics VIII-1: 24 IRDM ‘15/16

It’s a small world after all A graph 𝐻 is said to exhibit a sm smal all-worl rld p proper perty if its average path length scales logarithmically, 𝜈 𝑀 ∝ log 𝑜  six degrees of Kevin Bacon is based on this property  similarly so for Erd ő s numbers  how far a mathematician is from Hungarian combinatorist Paul Erd ő s  radius of a large, connected mathematical co-authorship network (268K authors) is 12 and diameter 23 VIII-1: 25 IRDM ‘15/16

Chapter 8: Gra Graph ph Mining Mining Jilles Vreeken Revision 1, - PowerPoint PPT Presentation

Chapter 8: Gra Graph ph Mining Mining Jilles Vreeken Revision 1, December 4 th typos fixed: edge order IRDM 15/16 1 Dec 2015 IRDM Chapter 8, overview The basics 1. Properties of Graphs 2. Frequent Subgraphs 3. Community

GRA Partners Hayden Montgomery Presentation to GRA Council GRA Special Representative 30

GRA Joint Programming Hayden Montgomery Presentation to GRA Council GRA Special Representative

GRA Flagships Presentation to GRA Council Berlin, 10 September, 2018 Principles of a GRA

UK Participation in the GRA Overview and update About the GRA The GRA brings countries

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

and technologies Hayden Montgomery Presentation to GRA Council GRA Special Representative

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

circular economy GRA Council 2017 Tsukuba Japan 2x more 2x healthier 2x less 2 GRA Council

Chapter X: Graph Mining Information Retrieval & Data Mining Universitt des Saarlandes,

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Graph Mining Marco Serafini COMPSCI 532 Lecture 11 Classes of Graph Systems Graph

Data Mining: Concepts and Techniques Chapter 9 Graph mining and Social Network Analysis

Graph Essentials Graph Basics Social Media Mining Social Media Mining Measures and Metrics

GRA facilitating linking to CIRCASA Hayden Montgomery CIRCASA Kick-off meeting GRA Special

Progress on GRA in Thailand Department of Livestock Development Bangkok. THAILAND 9 April 2014

INDONESIA STATEMENT : 2019 GRA COUNCIL MEETING Asep Nugraha Ardiwinata, BessTiesnamurti, Muhammad

Low dimensional Euclidean buildings: III Thibaut Dumont University of Jyv askyl a June

Fractal 3D modeling of asteroids using wavelets on arbitrary meshes Andre Jalobeanu Automated

Optimal Matching For SLiding Window Compression With Suffix Tries An advantage of sliding

FAQs Online GEAR presentation will be available on 4/6 You will have 3 days of discussion

2pt 0em CSCE423/823 Computer Science & Engineering 423/823 Introduction Flow Networks

Practical Multi-threaded Graph Coloring Algorithms for Shared Memory Architecture Nandini Singhal

Existence of CAT(0) structures for finite type Artin groups B 9 A 9 D 9 B 8 A 8 D 8 E 8 B 7 A 7 D

Linear Programming solvers: the state of the art Julian Hall School of Mathematics, University of

Chapter 8: Gra Graph ph Mining Mining Jilles Vreeken Revision 1, - PowerPoint PPT Presentation

Chapter 8: Gra Graph ph Mining Mining Jilles Vreeken Revision 1, December 4 th typos fixed: edge order IRDM 15/16 1 Dec 2015 IRDM Chapter 8, overview The basics 1. Properties of Graphs 2. Frequent Subgraphs 3. Community

GRA Partners Hayden Montgomery Presentation to GRA Council GRA Special Representative 30

GRA Joint Programming Hayden Montgomery Presentation to GRA Council GRA Special Representative

GRA Flagships Presentation to GRA Council Berlin, 10 September, 2018 Principles of a GRA

UK Participation in the GRA Overview and update About the GRA The GRA brings countries

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

and technologies Hayden Montgomery Presentation to GRA Council GRA Special Representative

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

circular economy GRA Council 2017 Tsukuba Japan 2x more 2x healthier 2x less 2 GRA Council

Chapter X: Graph Mining Information Retrieval &amp; Data Mining Universitt des Saarlandes,

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Graph Mining Marco Serafini COMPSCI 532 Lecture 11 Classes of Graph Systems Graph

Data Mining: Concepts and Techniques Chapter 9 Graph mining and Social Network Analysis

Graph Essentials Graph Basics Social Media Mining Social Media Mining Measures and Metrics

GRA facilitating linking to CIRCASA Hayden Montgomery CIRCASA Kick-off meeting GRA Special

Progress on GRA in Thailand Department of Livestock Development Bangkok. THAILAND 9 April 2014

INDONESIA STATEMENT : 2019 GRA COUNCIL MEETING Asep Nugraha Ardiwinata, BessTiesnamurti, Muhammad

Low dimensional Euclidean buildings: III Thibaut Dumont University of Jyv askyl a June

Fractal 3D modeling of asteroids using wavelets on arbitrary meshes Andre Jalobeanu Automated

Optimal Matching For SLiding Window Compression With Suffix Tries An advantage of sliding

FAQs Online GEAR presentation will be available on 4/6 You will have 3 days of discussion

2pt 0em CSCE423/823 Computer Science &amp; Engineering 423/823 Introduction Flow Networks

Practical Multi-threaded Graph Coloring Algorithms for Shared Memory Architecture Nandini Singhal

Existence of CAT(0) structures for finite type Artin groups B 9 A 9 D 9 B 8 A 8 D 8 E 8 B 7 A 7 D

Linear Programming solvers: the state of the art Julian Hall School of Mathematics, University of

Chapter X: Graph Mining Information Retrieval & Data Mining Universitt des Saarlandes,

2pt 0em CSCE423/823 Computer Science & Engineering 423/823 Introduction Flow Networks