data mining and machine learning fundamental concepts and
play

Data Mining and Machine Learning: Fundamental Concepts and - PowerPoint PPT Presentation

Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science


  1. Data Mining and Machine Learning: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA 2 Department of Computer Science Universidade Federal de Minas Gerais, Belo Horizonte, Brazil Chapter 4: Graph Data Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 4: Graph Data 1 / 48

  2. Graphs A graph G = ( V , E ) comprises a finite nonempty set V of vertices or nodes , and a set E ⊆ V × V of edges consisting of unordered pairs of vertices. The number of nodes in the graph G , given as | V | = n , is called the order of the graph, and the number of edges in the graph, given as | E | = m , is called the size of G . A directed graph or digraph has an edge set E consisting of ordered pairs of vertices. A weighted graph consists of a graph together with a weight w ij for each edge ( v i , v j ) ∈ E . A graph H = ( V H , E H ) is called a subgraph of G = ( V , E ) if V H ⊆ V and E H ⊆ E . Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 4: Graph Data 2 / 48

  3. Undirected and Directed Graphs v 1 v 2 v 1 v 2 v 3 v 4 v 5 v 6 v 3 v 4 v 5 v 6 v 7 v 8 v 7 v 8 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 4: Graph Data 3 / 48

  4. Degree Distribution The degree of a node v i ∈ V is the number of edges incident with it, and is denoted as d ( v i ) or just d i . The degree sequence of a graph is the list of the degrees of the nodes sorted in non-increasing order. Let N k denote the number of vertices with degree k . The degree frequency distribution of a graph is given as ( N 0 , N 1 ,..., N t ) where t is the maximum degree for a node in G . Let X be a random variable denoting the degree of a node. The degree distribution of a graph gives the probability mass function f for X , given as � � f ( 0 ) , f ( 1 ) ,..., f ( t ) where f ( k ) = P ( X = k ) = N k n is the probability of a node with degree k . Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 4: Graph Data 4 / 48

  5. Degree Distribution v 1 v 2 v 3 v 4 v 5 v 6 v 7 v 8 The degree sequence of the graph is ( 4 , 4 , 4 , 3 , 2 , 2 , 2 , 1 ) Its degree frequency distribution is ( N 0 , N 1 , N 2 , N 3 , N 4 ) = ( 0 , 1 , 3 , 1 , 3 ) The degree distribution is given as � � f ( 0 ) , f ( 1 ) , f ( 2 ) , f ( 3 ) , f ( 4 ) = ( 0 , 0 . 125 , 0 . 375 , 0 . 125 , 0 . 375 ) Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 4: Graph Data 5 / 48

  6. Path, Distance and Connectedness A walk in a graph G between nodes x and y is an ordered sequence of vertices, starting at x and ending at y , x = v 0 , v 1 , ..., v t − 1 , v t = y such that there is an edge between every pair of consecutive vertices, that is, ( v i − 1 , v i ) ∈ E for all i = 1 , 2 ,..., t . The length of the walk, t , is measured in terms of hops – the number of edges along the walk. A path is a walk with distinct vertices (with the exception of the start and end vertices). A path of minimum length between nodes x and y is called a shortest path , and the length of the shortest path is called the distance between x and y , denoted as d ( x , y ) . If no path exists between the two nodes, the distance is assumed to be d ( x , y ) = ∞ . Two nodes v i and v j are connected if there exists a path between them. A graph is connected if there is a path between all pairs of vertices. A connected component , or just component , of a graph is a maximal connected subgraph. A directed graph is strongly connected if there is a (directed) path between all ordered pairs of vertices. It is weakly connected if there exists a path between node pairs only by considering edges as undirected. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 4: Graph Data 6 / 48

  7. Adjacency Matrix A graph G = ( V , E ) , with | V | = n vertices, can be represented as an n × n , symmetric binary adjacency matrix , A , defined as � 1 if v i is adjacent to v j A ( i , j ) = 0 otherwise If the graph is directed, then the adjacency matrix A is not symmetric. If the graph is weighted, then we obtain an n × n weighted adjacency matrix , A , defined as � w ij if v i is adjacent to v j A ( i , j ) = 0 otherwise where w ij is the weight on edge ( v i , v j ) ∈ E . Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 4: Graph Data 7 / 48

  8. Graphs from Data Matrix Many datasets that are not in the form of a graph can still be converted into one. Let D = { x i } n i = 1 (with x i ∈ R d ), be a dataset. Define a weighted graph G = ( V , E ) , with edge weight w ij = sim ( x i , x j ) where sim ( x i , x j ) denotes the similarity between points x i and x j . For instance, using the Gaussian similarity � � −� x i − x j � 2 w ij = sim ( x i , x j ) = exp 2 σ 2 where σ is the spread parameter. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 4: Graph Data 8 / 48

  9. rS rS rS rS rS rS rS rS rS rS bC bC bC bC bC bC rS rS bC rS rS rS bC rS rS rS rS rS rS rS rS rS rS rS rS bC bC rS bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC bC rS rS bC uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT uT bC bC bC uT uT uT uT uT uT uT uT uT uT uT uT uT uT rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS rS uT uT uT uT uT uT uT uT uT rS uT uT uT uT uT uT uT bC Iris Similarity Graph: Gaussian Similarity √ σ = 1 / 2; edge exists iff w ij ≥ 0 . 777 order: | V | = n = 150; size: | E | = m = 753 Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 4: Graph Data 9 / 48

  10. Topological Graph Attributes Graph attributes are local if they apply to only a single node (or an edge), and global if they refer to the entire graph. Degree: The degree of a node v i ∈ G is defined as � d i = A ( i , j ) j The corresponding global attribute for the entire graph G is the average degree : � i d i µ d = n Average Path Length: The average path length is given as � � j > i d ( v i , v j ) 2 i � � µ L = = d ( v i , v j ) � n � n ( n − 1 ) 2 i j > i Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 4: Graph Data 10 / 48

  11. Iris Graph: Degree Distribution 0 . 10 0 . 09 13 13 0 . 08 0 . 07 10 9 0 . 06 f ( k ) 8 8 8 0 . 05 7 6 6 6 6 6 0 . 04 5 5 5 0 . 03 4 4 4 3 3 0 . 02 2 2 0 . 01 1 1 1 1 1 1 1 0 0 0 0 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 Degree: k Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 4: Graph Data 11 / 48

  12. Iris Graph: Path Length Histogram 1044 1000 900 831 800 753 668 700 Frequency 600 529 500 400 330 300 240 200 146 90 100 30 12 0 0 1 2 3 4 5 6 7 8 9 10 11 Path Length: k Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 4: Graph Data 12 / 48

  13. Eccentricity, Radius and Diameter The eccentricity of a node v i is the maximum distance from v i to any other node in the graph: � � e ( v i ) = max d ( v i , v j ) j The radius of a connected graph, denoted r ( G ) , is the minimum eccentricity of any node in the graph: � �� � � � r ( G ) = min e ( v i ) = min max d ( v i , v j ) i i j The diameter , denoted d ( G ) , is the maximum eccentricity of any vertex in the graph: � � � � d ( G ) = max e ( v i ) = max d ( v i , v j ) i i , j For a disconnected graph, values are computed over the connected components of the graph. The diameter of a graph G is sensitive to outliers. Effective diameter is more robust; defined as the minimum number of hops for which a large fraction, typically 90%, of all connected pairs of nodes can reach each other. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 4: Graph Data 13 / 48

  14. Clustering Coefficient The clustering coeff icient of a node v i is a measure of the density of edges in the neighborhood of v i . Let G i = ( V i , E i ) be the subgraph induced by the neighbors of vertex v i . Note that v i �∈ V i , as we assume that G is simple. Let | V i | = n i be the number of neighbors of v i , and | E i | = m i be the number of edges among the neighbors of v i . The clustering coefficient of v i is defined as no. of edges in G i = m i 2 · m i � = C ( v i ) = � n i maximum number of edges in G i n i ( n i − 1 ) 2 The clustering coeff icient of a graph G is simply the average clustering coefficient over all the nodes, given as C ( G ) = 1 � C ( v i ) n i C ( v i ) is well defined only for nodes with degree d ( v i ) ≥ 2, thus define C ( v i ) = 0 if d i < 2. Zaki & Meira Jr. (RPI and UFMG) Data Mining and Machine Learning Chapter 4: Graph Data 14 / 48

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend