Graph-based Proximity Measures
Practical Graph Mining with R
Graph-based Proximity Measures Nagiza F. Samatova William Hendrix - - PowerPoint PPT Presentation
Practical Graph Mining with R Graph-based Proximity Measures Nagiza F. Samatova William Hendrix John Jenkins Kanchana Padmanabhan Arpan Chakraborty Department of Computer Science North Carolina State University Outline Defining
Practical Graph Mining with R
2
3
Src: “Introduction to Data Mining” by Vipin Kumar et al
4
Src: “Introduction to Data Mining” by Vipin Kumar et al
5
1
j j j d
≤ ≤
1 2 1 2
d d d d
j j j d
≤ ≤
2 1
d j j j
=
1
j j j d
≤ ≤
6
1 2 1 2
d d d d
1
( , ) ( )
d j j j
d p q p q
=
= −
1 1
d r r r j j j
=
1
2
1
−
7
2 1
d j j j
=
1 2
d d
1
d k k
=
2 1
d p k k
=
1 2
d d new p p p p
new
new p
8
1 2 3 1 2 3 4 5 6
p1 p2 p3 p4
2 1
d j j j
=
p1 p2 p3 p4 p1 2.828 3.162 5.099 p2 2.828 1.414 3.162 p3 3.162 1.414 2 p4 5.099 3.162 2
point x y p1 2 p2 2 p3 3 1 p4 5 1
Src: “Introduction to Data Mining” by Vipin Kumar et al
9
1 2 1 2
d d d d
d k k
=
1
d pq k k k
=
T
10
1 11 12 1 1 2
d d d N N N Nd
1 1 2 1 2 1 2 2 2 1 2 1 2
N N N N N N N
11
Src: “Introduction to Data Mining” by Vipin Kumar et al
12
Src: “Introduction to Data Mining” by Vipin Kumar et al
13
14
cos( d1, d2 ) = (d1 •
|| d || is the length of vector d.
d1 = 3 2 0 5 0 0 0 2 0 0 d2 = 1 0 0 0 0 0 0 1 0 2 cos( d1, d2 ) = .3150 d1 • d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5 ||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481 ||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245
Src: “Introduction to Data Mining” by Vipin Kumar et al
15
Src: “Introduction to Data Mining” by Vipin Kumar et al
16
Src: “Introduction to Data Mining” by Vipin Kumar et al
17
Scatter plots showing the similarity from –1 to 1.
Src: “Introduction to Data Mining” by Vipin Kumar et al
18
Src: “Introduction to Data Mining” by Vipin Kumar et al
19
Src: “Introduction to Data Mining” by Vipin Kumar et al
21
von Neumann
– n1…n8 vertices (articles) – Graph is directed – Edges indicate a citation
– If an edge between two vertices exists then the matrix cell = 1 else = 0 n1 n2 n3 n4 n5 n6 n7 n8
n1 n2 n3 n4 n5 n6 n7 n8
n1 n2 n3 n4 n5 n6 n7 n8
Term-document matrix: A matrix in which the rows represent terms, columns represent documents, and entries represent a function
(e.g. frequency of the given term in the document).
Document correlation matrix: A matrix in which the rows and the columns represent documents, and entries represent the semantic similarity between two documents.
Term Correlation Matrix:- A matrix in which the rows and the columns represent terms, and entries represent the semantic similarity between two terms.
.
Simplifies with our definitions of K and T When When
Indegree = The indegree, δ-(v), of vertex v is the number of edges leading to vertex v. δ- (B)=1 Outdegree = The outdegree, δ+(v), of vertex v is the number of edges leading away from vertex v. δ+(A)=3 Maximal indegree= The maximal indegree, ∆-, of the graph is the maximum
graph. ∆-(G)= 2 Maximal outdegree= The maximal
graph. ∆+(G)= 3
A B C D
.
n1 n2 n3 n4 n5 n6 n7 n8
Generalization
Merges relatedness and importance Useful in many graph applications
Strengths Strengths
Topic Drift No penalty for loops in adjacency matrix
Weaknesses Weaknesses
37
Weaknesses Weaknesses Strengths Strengths Outlier/Anomally Detection Outlier/Anomally Detection R Code Example R Code Example Time Complexity Time Complexity SNN Algorithm SNN Algorithm Shared Nearest Neighbor Graph Shared Nearest Neighbor Graph Proximity Graphs Proximity Graphs Understanding Proximity Understanding Proximity
What makes a node a neighbor to another node is based off of the definition of proximity What makes a node a neighbor to another node is based off of the definition of proximity
Definition: the closeness between a set of objects Proximity can measure the extent to which the two nodes belong to the same cluster. Proximity is a subtle notion whose definition can depend on a specific application
Represents neighbor relationships between objects Represents neighbor relationships between objects Can estimate the likelihood that a link will exist in the future, or is missing in the data for some reason Can estimate the likelihood that a link will exist in the future, or is missing in the data for some reason Using a proximity graph increases the scale range over which good segmentations are possible Using a proximity graph increases the scale range over which good segmentations are possible Can be formulated with respect to many metrics Can be formulated with respect to many metrics
In the figure to the left, each of the two black vertices, i and j, have eight nearest neighbors, including each other. Four of those nearest neighbors are shared which are shown in red. Thus, the two black vertices are similar when parameter k=4 for SNN graph.
Input: G: an undirected graph Input: k: a natural number (number of shared neighbors) for i = 1 to N(G) do for j = i+1 to N(G) do if j < = N(G) then counter = 0 end if for m = 1 to N(G) do if vertex i and vertex j both have an edge with vertex m then counter ++ end if end for if counter k then Connect an edge between vertex i and vertex j in SNN graph. end if end for end for return SNN graph
for i = 1 to n for i = 1 to n for j = 1 to n for j = 1 to n for k = 1 to n for k = 1 to n
0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0)
weighted=NULL)
Outlier/Anomaly Outlier/Anomaly
deviates from what is standard, normal, or expected Outlier/Anomaly Detection Outlier/Anomaly Detection
a given data set that do not conform to an established normal behavior
Outlier/Anomaly 0.5 1 1.5 2 2.5 3 3.5 1 2 3
Ability to handle noise and outliers Ability to handle clusters
shapes Very good at handling clusters of varying densities
Does not take into account the weight
the link between the nodes in a nearest neighbor graph A low similarity amongst nodes of the same cluster in a graph can cause it to find nearest neighbors that are not in the same cluster
HITS HITS
Nuemann Kernel Nuemann Kernel
Shared Nearest Neighbor Shared Nearest Neighbor
Run Time