Graph-based Proximity Measures Nagiza F. Samatova William Hendrix - - PowerPoint PPT Presentation

graph based proximity measures
SMART_READER_LITE
LIVE PREVIEW

Graph-based Proximity Measures Nagiza F. Samatova William Hendrix - - PowerPoint PPT Presentation

Practical Graph Mining with R Graph-based Proximity Measures Nagiza F. Samatova William Hendrix John Jenkins Kanchana Padmanabhan Arpan Chakraborty Department of Computer Science North Carolina State University Outline Defining


slide-1
SLIDE 1

Graph-based Proximity Measures

Practical Graph Mining with R

Nagiza F. Samatova William Hendrix John Jenkins Kanchana Padmanabhan Arpan Chakraborty Department of Computer Science North Carolina State University

slide-2
SLIDE 2

Outline

  • Defining Proximity Measures
  • Neumann Kernels
  • Shared Nearest Neighbor

2

slide-3
SLIDE 3

3

Similarity and Dissimilarity

  • Similarity

– Numerical measure of how alike two data objects are. – Is higher when objects are more alike. – Often falls in the range [0,1]: – Examples: Cosine, Jaccard, Tanimoto,

  • Dissimilarity

– Numerical measure of how different two data objects are – Lower when objects are more alike – Minimum dissimilarity is often 0 – Upper limit varies

  • Proximity refers to a similarity or dissimilarity

Src: “Introduction to Data Mining” by Vipin Kumar et al

slide-4
SLIDE 4

4

Distance Metric

  • Distance d (p, q) between two points p and q is a

dissimilarity measure if it satisfies:

  • 1. Positive definiteness:

d (p, q) ≥ 0 for all p and q and d (p, q) = 0 only if p = q.

  • 2. Symmetry: d (p, q) = d (q, p) for all p and q.
  • 3. Triangle Inequality:

d (p, r) ≤ d (p, q) + d (q, r) for all points p, q, and r.

  • Examples:

– Euclidean distance – Minkowski distance – Mahalanobis distance

Src: “Introduction to Data Mining” by Vipin Kumar et al

slide-5
SLIDE 5

5

Is this a distance metric?

Not: Positive definite Not: Symmetric Not: Triangle Inequality

1

( , ) max( , )

j j j d

d p q p q

≤ ≤

=

Distance Metric

1 2 1 2

( , ,...., ) and ( , ,...., )

d d d d

p p p p q q q q = ∈ = ∈

  • 1

( , ) max( )

j j j d

d p q p q

≤ ≤

= −

2 1

( , ) ( )

d j j j

d p q p q

=

= −

1

( , ) min | |

j j j d

d p q p q

≤ ≤

= −

slide-6
SLIDE 6

6

Distance: Euclidean, Minkowski, Mahalanobis

1 2 1 2

( , ,...., ) and ( , ,...., )

d d d d

p p p p q q q q = ∈ = ∈

  • 2

1

( , ) ( )

d j j j

d p q p q

=

= −

Euclidean

1 1

( , ) | |

d r r r j j j

d p q p q

=

  = −    

Minkowski

1

1: City block distance Manhattan distance

  • norm

r L =

2

2: Euclidean,

  • norm

r L = Mahalanobis

1

( , ) ( ) ( )T d p q p q p q

= − Σ −

slide-7
SLIDE 7

7

Euclidean Distance

2 1

( , ) ( )

d j j j

d p q p q

=

= −

Standardization is necessary, if scales differ.

1 2

( , ,...., )

d d

p p p p = ∈

1

1

d k k

p p d

=

= ∈

  • Mean of attributes

Standard deviation of attributes

2 1

1 ( ) 1

d p k k

s p p d

=

= − ∈ − ∑

  • Ex:

( , ) p age salary =

Standardized/Normalized Vector

1 2

( , ,..., )

d d new p p p p

p p p p p p p p p s s s s − − − − = = ∈ 1

new

new p

p s = =

slide-8
SLIDE 8

8

Distance Matrix

1 2 3 1 2 3 4 5 6

p1 p2 p3 p4

2 1

( , ) ( )

d j j j

d p q p q

=

= −

p1 p2 p3 p4 p1 2.828 3.162 5.099 p2 2.828 1.414 3.162 p3 3.162 1.414 2 p4 5.099 3.162 2

Output Distance Matrix: D

  • P = as.matrix (read.table(file=“points.dat”));
  • D = dist (P[, 2;3], method = "euclidean");
  • L1 = dist (P[, 2;3], method = “minkowski", p=1);
  • help (dist)

point x y p1 2 p2 2 p3 3 1 p4 5 1

Input Data Table: P File name: points.dat

Src: “Introduction to Data Mining” by Vipin Kumar et al

slide-9
SLIDE 9

9

Covariance of Two Vectors, cov(p,q)

1 2 1 2

( , ,...., ) and ( , ,...., )

d d d d

p p p p q q q q = ∈ = ∈

  • 1

1

d k k

p p d

=

= ∈

  • Mean of attributes

1

1 cov( , ) ( )( ) 1

d pq k k k

p q s p p q q d

=

= = − − ∈ − ∑

  • One definition:

cov( , ) [( ( ))( ( )) ]

T

p q E p E p q E q = − − ∈

Or a better definition: E is the Expected values of a random variable.

slide-10
SLIDE 10

10

Covariance, or Dispersion Matrix, ∑

1 11 12 1 1 2

( , ,...., ) ..... ( , ,...., )

d d d N N N Nd

P p p p P p p p = ∈ = ∈

  • 1

1 1 2 1 2 1 2 2 2 1 2 1 2

cov( , ) cov( , ) ... cov( , ) cov( , ) cov( , ) ... cov( , ) ( , ,..., ) ... ... ... ... cov( , ) cov( , ) ... cov( , )

N N N N N N N

P P P P P P P P P P P P P P P P P P P P P       ∑ =      

N points in d-dimensional space: The covariance, or dispersion matrix: The inverse, Σ-1, is concentration matrix or precision matrix

slide-11
SLIDE 11

11

Common Properties of a Similarity

  • Similarities, also have some well known

properties.

– s(p, q) = 1 (or maximum similarity) only if p = q. – s(p, q) = s(q, p) for all p and q. (Symmetry)

where s(p, q) is the similarity between points (data objects), p and q.

Src: “Introduction to Data Mining” by Vipin Kumar et al

slide-12
SLIDE 12

12

Similarity Between Binary Vectors

  • Suppose p and q have only binary attributes
  • Compute similarities using the following quantities

– M01 = the number of attributes where p was 0 and q was 1 – M10 = the number of attributes where p was 1 and q was 0 – M00 = the number of attributes where p was 0 and q was 0 – M11 = the number of attributes where p was 1 and q was 1

  • Simple Matching and Jaccard Coefficients:

SMC = number of matches / number of attributes = (M11 + M00) / (M01 + M10 + M11 + M00) J = number of 11 matches / number of not-both-zero attributes values = (M11) / (M01 + M10 + M11)

Src: “Introduction to Data Mining” by Vipin Kumar et al

slide-13
SLIDE 13

13

SMC versus Jaccard: Example

p = 1 0 0 0 0 0 0 0 0 0 q = 0 0 0 0 0 0 1 0 0 1 M01 = 2 (the number of attributes where p was 0 and q was 1) M10 = 1 (the number of attributes where p was 1 and q was 0) M00 = 7 (the number of attributes where p was 0 and q was 0) M11 = 0 (the number of attributes where p was 1 and q was 1) SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7 J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0

slide-14
SLIDE 14

14

Cosine Similarity

  • If d1 and d2 are two document vectors, then

cos( d1, d2 ) = (d1 •

  • d2) / ||d1|| ||d2|| , where:
  • indicates vector dot product and

|| d || is the length of vector d.

  • Example:

d1 = 3 2 0 5 0 0 0 2 0 0 d2 = 1 0 0 0 0 0 0 1 0 2 cos( d1, d2 ) = .3150 d1 • d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5 ||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481 ||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245

Src: “Introduction to Data Mining” by Vipin Kumar et al

slide-15
SLIDE 15

15

Extended Jaccard Coefficient (Tanimoto)

  • Variation of Jaccard for continuous or count

attributes

– Reduces to Jaccard for binary attributes

Src: “Introduction to Data Mining” by Vipin Kumar et al

slide-16
SLIDE 16

16

Correlation (Pearson Correlation)

  • Correlation measures the linear relationship

between objects

  • To compute correlation, we standardize data
  • bjects, p and q, and then take their dot product

) ( / )) ( ( p std p mean p p

k k

− = ′ ) ( / )) ( ( q std q mean q q

k k

− = ′

q p q p n correlatio ′

= ) , (

Src: “Introduction to Data Mining” by Vipin Kumar et al

slide-17
SLIDE 17

17

Visually Evaluating Correlation

Scatter plots showing the similarity from –1 to 1.

Src: “Introduction to Data Mining” by Vipin Kumar et al

slide-18
SLIDE 18

18

General Approach for Combining Similarities

  • Sometimes attributes are of many different

types, but an overall similarity is needed.

Src: “Introduction to Data Mining” by Vipin Kumar et al

slide-19
SLIDE 19

19

Using Weights to Combine Similarities

  • May not want to treat all attributes the same.

– Use weights wk which are between 0 and 1 and sum to 1.

Src: “Introduction to Data Mining” by Vipin Kumar et al

slide-20
SLIDE 20

Graph-Based Proximity Measures

In order to apply graph- based data mining techniques, such as classification and clustering, it is necessary to define proximity measures between data represented in graph form. In order to apply graph- based data mining techniques, such as classification and clustering, it is necessary to define proximity measures between data represented in graph form.

Within-graph proximity measures: Within-graph proximity measures:

Hyperlink-Induced Topic Search (HITS) Hyperlink-Induced Topic Search (HITS) The Neumann Kernel The Neumann Kernel Shared Nearest Neighbor (SNN) Shared Nearest Neighbor (SNN)

slide-21
SLIDE 21

Outline

  • Defining Proximity Measures
  • Neumann Kernels
  • Shared Nearest Neighbor

21

slide-22
SLIDE 22

Neumann Kernels: Agenda

Neumann Kernel Introduction Neumann Kernel Introduction Co-citation and Bibliographic Coupling Co-citation and Bibliographic Coupling Document and Term Correlation Document and Term Correlation Diffusion/Decay factors Diffusion/Decay factors Relationship to HITS Relationship to HITS Strengths and Weaknesses Strengths and Weaknesses

slide-23
SLIDE 23

Neumann Kernels (NK)

von Neumann

Generalization of HITS Input: Undirected or Directed

Graph

Output: Within Graph

Proximity Measure

  • Importance
  • Relatedness
slide-24
SLIDE 24

NK: Citation graph

  • Input: Graph

– n1…n8 vertices (articles) – Graph is directed – Edges indicate a citation

  • Citation Matrix C can be formed

– If an edge between two vertices exists then the matrix cell = 1 else = 0 n1 n2 n3 n4 n5 n6 n7 n8

slide-25
SLIDE 25

NK: Co-citation graph

  • Co-citation graph: A graph which has two nodes

connected if they appear simultaneously in the reference list of a third node in citation graph.

  • In above graph n1 and n2 are connected because

both are referenced by same node n5 in citation graph

  • CC=CTC

n1 n2 n3 n4 n5 n6 n7 n8

slide-26
SLIDE 26

NK: Bibliographic Coupling Graph

  • Bibliographic coupling graph: A graph which has two nodes

connected if they share one or more bibliographic references.

  • In above graph n5 and n6 are connected because both are

referencing same node n2 in citation graph

  • CC=C CT

n1 n2 n3 n4 n5 n6 n7 n8

slide-27
SLIDE 27

NK: Document and Term Correlation

Term-document matrix: A matrix in which the rows represent terms, columns represent documents, and entries represent a function

  • f their relationship

(e.g. frequency of the given term in the document).

Example: D1: “I like this book” D2: “We wrote this book” Term-Document Matrix X

slide-28
SLIDE 28

NK: Document and Term Correlation (2)

Document correlation matrix: A matrix in which the rows and the columns represent documents, and entries represent the semantic similarity between two documents.

Example: D1: “I like this book” D2: “We wrote this book” Document Correlation matrix K = (XTX)

slide-29
SLIDE 29

NK: Document and Term Correlation (3)

Term Correlation Matrix:- A matrix in which the rows and the columns represent terms, and entries represent the semantic similarity between two terms.

Example: D1: “I like this book” D2: “We wrote this book”

Term Correlation Matrix T = (XXT)

slide-30
SLIDE 30

Neumann Kernel Block Diagram

.

Input: Graph Output: Two matrices of dimensions n x n called K γ and Tγ Diffusion/Decay Factor: A tunable parameter that controls the balance between relatedness and importance

slide-31
SLIDE 31

NK: Diffusion Factor - Equation & Effect

Neumann Kernel defines two matrices incorporating a diffusion factor:

Simplifies with our definitions of K and T When When

slide-32
SLIDE 32

Indegree = The indegree, δ-(v), of vertex v is the number of edges leading to vertex v. δ- (B)=1 Outdegree = The outdegree, δ+(v), of vertex v is the number of edges leading away from vertex v. δ+(A)=3 Maximal indegree= The maximal indegree, ∆-, of the graph is the maximum

  • f all indegree counts of all vertices of

graph. ∆-(G)= 2 Maximal outdegree= The maximal

  • utdegree, ∆+, of the graph is the maximum
  • f all outdegree counts of all vertices of

graph. ∆+(G)= 3

A B C D

NK: Diffusion Factor - Terminology

slide-33
SLIDE 33

NK: Diffusion Factor - Algorithm

slide-34
SLIDE 34

NK: Choice of Diffusion Factor and its effects

  • n the Neumann Algorithm
  • Neumann Kernel outputs relatedness between

documents and between terms when g = γ

  • Similarly when γ is larger, then the Kernel
  • utput matches with HITS
slide-35
SLIDE 35

.

HITS authority ranking for above graph n3 > n4 > n2 > n1 > n5 = n6 = n7 = n8 Calculation of Neumann Kenel for gamma=0.207 which is maximum possible value of gamma for this case gives following ranking n3 > n4 > n2 > n1 > n5 = n6 = n7 = n8 For higher values of gamma Neumann Kernel converges to HITS

n1 n2 n3 n4 n5 n6 n7 n8

Comparing NK, HITS, and Co-citation Bibliographic Coupling

slide-36
SLIDE 36

Strengths and Weaknesses

Generalization

  • f HITS

Merges relatedness and importance Useful in many graph applications

Strengths Strengths

Topic Drift No penalty for loops in adjacency matrix

Weaknesses Weaknesses

slide-37
SLIDE 37

Outline

  • Defining Proximity Measures
  • Neumann Kernels
  • Shared Nearest Neighbor

37

slide-38
SLIDE 38

Shared Nearest Neighbor (SNN)

  • An indirect approach

to similarity

  • Uses a dynamic

method of a k- Nearest Neighbor graph to determine the similarity between the nodes

  • If two vertices have

more than k neighbors in common then they can be considered similar to

  • ne another even if a

direct link does not exist

slide-39
SLIDE 39

SNN - Agenda

Weaknesses Weaknesses Strengths Strengths Outlier/Anomally Detection Outlier/Anomally Detection R Code Example R Code Example Time Complexity Time Complexity SNN Algorithm SNN Algorithm Shared Nearest Neighbor Graph Shared Nearest Neighbor Graph Proximity Graphs Proximity Graphs Understanding Proximity Understanding Proximity

slide-40
SLIDE 40

SNN – Understanding Proximity

What makes a node a neighbor to another node is based off of the definition of proximity What makes a node a neighbor to another node is based off of the definition of proximity

Definition: the closeness between a set of objects Proximity can measure the extent to which the two nodes belong to the same cluster. Proximity is a subtle notion whose definition can depend on a specific application

slide-41
SLIDE 41

SNN - Proximity Graphs

  • A graph obtained by connecting two points,

in a set of points, by an edge if the two points, in some sense, are close to each

  • ther
slide-42
SLIDE 42

SNN – Proximity Graphs (continued)

1 2 3 4 5

1 2 3 4 5 6 7

1 2 3 4 5 6

CYCLIC LINEAR RADIAL Various Types of Proximity Graphs

slide-43
SLIDE 43

SNN – Proximity Graphs (continued)

Other types of proximity graphs. MINIMUM SPANNING TREE RELATIVE NEIGHBOR GRAPH GABRIEL GRAPH NEAREST NEIGHBOR GRAPH (Voronoi diagram)

slide-44
SLIDE 44

SNN – Proximity Graphs (continued)

Represents neighbor relationships between objects Represents neighbor relationships between objects Can estimate the likelihood that a link will exist in the future, or is missing in the data for some reason Can estimate the likelihood that a link will exist in the future, or is missing in the data for some reason Using a proximity graph increases the scale range over which good segmentations are possible Using a proximity graph increases the scale range over which good segmentations are possible Can be formulated with respect to many metrics Can be formulated with respect to many metrics

slide-45
SLIDE 45

SNN – Kth Nearest Neighbor (k-NN) Graph

Forms the basis for the Shared Nearest Neighbor (SNN) within-graph proximity measure Forms the basis for the Shared Nearest Neighbor (SNN) within-graph proximity measure Has applications in cluster analysis and

  • utlier detection

Has applications in cluster analysis and

  • utlier detection
slide-46
SLIDE 46

SNN – Shared Nearest Neighbor Graph

  • An SNN graph is a

special type of KNN graph.

  • If an edge exists between

two vertices, then they both belong to each

  • ther’s k-neighborhood

In the figure to the left, each of the two black vertices, i and j, have eight nearest neighbors, including each other. Four of those nearest neighbors are shared which are shown in red. Thus, the two black vertices are similar when parameter k=4 for SNN graph.

slide-47
SLIDE 47

SNN – The Algorithm

Input: G: an undirected graph Input: k: a natural number (number of shared neighbors) for i = 1 to N(G) do for j = i+1 to N(G) do if j < = N(G) then counter = 0 end if for m = 1 to N(G) do if vertex i and vertex j both have an edge with vertex m then counter ++ end if end for if counter k then Connect an edge between vertex i and vertex j in SNN graph. end if end for end for return SNN graph

slide-48
SLIDE 48

SNN – Time Complexity

O(n3)

for i = 1 to n for i = 1 to n for j = 1 to n for j = 1 to n for k = 1 to n for k = 1 to n

The number of vertices of graph G can be

defined as n

“for loops” i and

k iterate once for each vertex in graph G (n times)

Cumulatively

this results in a total running time

  • f:

“for loop” j

iterates at most n -1 times (O(n))

slide-49
SLIDE 49

SNN – R Code Example

  • library(“igraph”)
  • library(“ProximityMeasure”)
  • data = c(

0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0)

  • mat = matrix(data,6,6)
  • G = graph.adjacency(mat,mode=c("directed"),

weighted=NULL)

  • V(G)$label<-c(‘A’,’B’,’C’,’D’,’E’,’F’)
  • tkplot(G)
  • SNN(mat, 2)

B A C

D E F

[0] A -- D [1] B -- D [2] B -- E [3] C -- E

slide-50
SLIDE 50

SNN – Outlier/Anomaly Detection

Outlier/Anomaly Outlier/Anomaly

  • something that

deviates from what is standard, normal, or expected Outlier/Anomaly Detection Outlier/Anomaly Detection

  • detecting patterns in

a given data set that do not conform to an established normal behavior

Outlier/Anomaly 0.5 1 1.5 2 2.5 3 3.5 1 2 3

slide-51
SLIDE 51

SNN - Strengths

Ability to handle noise and outliers Ability to handle clusters

  • f different sizes and

shapes Very good at handling clusters of varying densities

slide-52
SLIDE 52

SNN - Weaknesses

Does not take into account the weight

  • f

the link between the nodes in a nearest neighbor graph A low similarity amongst nodes of the same cluster in a graph can cause it to find nearest neighbors that are not in the same cluster

slide-53
SLIDE 53

Time Complexity Comparison

HITS HITS

O(k*n2.376)

Nuemann Kernel Nuemann Kernel

O(n2.376)

Shared Nearest Neighbor Shared Nearest Neighbor

O(n3)

Run Time

Conclusion: Nuemann Kernel <= HITS < SNN