Graph-based Proximity Measures Nagiza F. Samatova William Hendrix - PowerPoint PPT Presentation

Practical Graph Mining with R Graph-based Proximity Measures Nagiza F. Samatova William Hendrix John Jenkins Kanchana Padmanabhan Arpan Chakraborty Department of Computer Science North Carolina State University

Outline • Defining Proximity Measures • Neumann Kernels • Shared Nearest Neighbor 2

Similarity and Dissimilarity • Similarity – Numerical measure of how alike two data objects are. – Is higher when objects are more alike. – Often falls in the range [0,1]: – Examples: Cosine, Jaccard, Tanimoto, • Dissimilarity – Numerical measure of how different two data objects are – Lower when objects are more alike – Minimum dissimilarity is often 0 – Upper limit varies • Proximity refers to a similarity or dissimilarity Src: “Introduction to Data Mining” by Vipin Kumar et al 3

Distance Metric Distance d (p, q) between two points p and q is a • dissimilarity measure if it satisfies: 1. Positive definiteness: d (p, q) ≥ 0 for all p and q and d (p, q) = 0 only if p = q . 2. Symmetry: d (p, q) = d (q, p) for all p and q . 3. Triangle Inequality: d (p, r) ≤ d (p, q) + d (q, r) for all points p , q , and r . Examples: • – Euclidean distance – Minkowski distance – Mahalanobis distance Src: “Introduction to Data Mining” by Vipin Kumar et al 4

Is this a distance metric? ( , ,...., ) and ( , ,...., ) = ∈ d = ∈ d p p p p q q q q � � 1 2 1 2 d d ( , ) max( , ) = d p q p q Not: Positive definite j j 1 ≤ ≤ j d Not: Symmetric ( , ) max( ) = − d p q p q j j 1 ≤ ≤ j d Not: Triangle Inequality d ( , ) ( ) 2 = − d p q p q ∑ j j = 1 j Distance Metric ( , ) min | | = − d p q p q j j 1 ≤ ≤ j d 5

Distance: Euclidean, Minkowski, Mahalanobis ( , ,...., ) and ( , ,...., ) = ∈ d = ∈ d p p p p q q q q � � 1 2 1 2 d d Minkowski Mahalanobis Euclidean ( , ) ( ) − 1 ( ) T 1 = − Σ − d p q p q p q d ( , ) ( ) 2 = − d d p q p q r   ( , ) | | = − r ∑ d p q p q j j 1 = ∑ j r  j j  = 1 j   1: = r City block distance Manhattan distance -norm L 1 2: = r Euclidean, -norm L 2 6

d ( , ) ( ) 2 = − d p q p q Euclidean Distance ∑ j j 1 = j Standardization is necessary, if scales differ. ( , ) = p age salary ( , ,...., ) = ∈ � d p p p p Ex: 1 2 d Standard deviation of attributes Mean of attributes 1 d 1 d = ∈ p p � ( ) 2 = − ∈ s p p ∑ k d � − ∑ p k 1 1 d k = 1 = k Standardized/Normalized Vector 0 = p − − − − p p p p p p p p new ( , ,..., ) = = ∈ � d p 1 2 d 1 = new s s s s s p p p p p new 7

d ( , ) ( ) 2 = − d p q p q Distance Matrix ∑ j j 1 = j • P = as.matrix (read.table(file=“points.dat”)); • D = dist (P[, 2;3], method = "euclidean "); • L1 = dist (P[, 2;3], method = “minkowski " , p=1 ); • help (dist) 3 Input Data Table: P point x y p1 2 p1 0 2 p3 p4 2 0 p2 1 p3 3 1 p2 p4 5 1 0 0 1 2 3 4 5 6 File name: points.dat Output Distance Matrix : D p1 p2 p3 p4 p1 0 2.828 3.162 5.099 p2 2.828 0 1.414 3.162 3.162 1.414 0 2 p3 p4 5.099 3.162 2 0 8 Src: “Introduction to Data Mining” by Vipin Kumar et al

Covariance of Two Vectors, cov(p,q) ( , ,...., ) and ( , ,...., ) = ∈ d = ∈ d p p p p q q q q � � 1 2 1 2 d d One definition: Mean of attributes 1 1 d d cov( , ) ( )( ) = = − − ∈ = ∈ p p p q s p p q q � � − ∑ ∑ k pq 1 k k d d 1 1 = = k k Or a better definition: cov( , ) [( ( ))( ( )) ] = − − T ∈ � p q E p E p q E q E is the Expected values of a random variable. 9

Covariance, or Dispersion Matrix, ∑ ( , ,...., ) = ∈ d P p p p N points in d -dimensional space: � 1 11 12 1 d ..... ( , ,...., ) = ∈ d P p p p � 1 2 N N N Nd The covariance, or dispersion matrix: cov( , ) cov( , ) ... cov( , ) P P P P P P   1 1 1 2 1 N cov( , ) cov( , ) ... cov( , ) P P P P P P   2 1 2 2 2 ( , ,..., ) N =  P P P   ∑ 1 2 N ... ... ... ...   cov( , ) cov( , ) ... cov( , )  P P P P P P 1 2 N N N N   The inverse, Σ -1 , is concentration matrix or precision matrix 10

Common Properties of a Similarity • Similarities, also have some well known properties. – s(p, q) = 1 (or maximum similarity) only if p = q. – s(p, q) = s(q, p) for all p and q. (Symmetry) where s(p, q) is the similarity between points (data objects), p and q. Src: “Introduction to Data Mining” by Vipin Kumar et al 11

Similarity Between Binary Vectors • Suppose p and q have only binary attributes • Compute similarities using the following quantities – M01 = the number of attributes where p was 0 and q was 1 – M10 = the number of attributes where p was 1 and q was 0 – M00 = the number of attributes where p was 0 and q was 0 – M11 = the number of attributes where p was 1 and q was 1 • Simple Matching and Jaccard Coefficients: SMC = number of matches / number of attributes = (M11 + M00) / (M01 + M10 + M11 + M00) J = number of 11 matches / number of not-both-zero attributes values = (M11) / (M01 + M10 + M11) 12 Src: “Introduction to Data Mining” by Vipin Kumar et al

SMC versus Jaccard: Example p = 1 0 0 0 0 0 0 0 0 0 q = 0 0 0 0 0 0 1 0 0 1 M 01 = 2 (the number of attributes where p was 0 and q was 1) M 10 = 1 (the number of attributes where p was 1 and q was 0) M 00 = 7 (the number of attributes where p was 0 and q was 0) M 11 = 0 (the number of attributes where p was 1 and q was 1) SMC = (M 11 + M 00 )/(M 01 + M 10 + M 11 + M 00 ) = (0+7) / (2+1+0+7) = 0.7 J = (M 11 ) / (M 01 + M 10 + M 11 ) = 0 / (2 + 1 + 0) = 0 13

Cosine Similarity • If d 1 and d 2 are two document vectors, then cos( d 1 , d 2 ) = ( d 1 • • • d 2 ) / || d 1 || || d 2 || , where: • • indicates vector dot product and || d || is the length of vector d . • Example: d 1 = 3 2 0 5 0 0 0 2 0 0 d 2 = 1 0 0 0 0 0 0 1 0 2 cos( d 1 , d 2 ) = .3150 d 1 • d 2 = 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5 || d 1 || = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0) 0.5 = (42) 0.5 = 6.481 || d 2 || = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245 Src: “Introduction to Data Mining” by Vipin Kumar et al 14

Extended Jaccard Coefficient (Tanimoto) • Variation of Jaccard for continuous or count attributes – Reduces to Jaccard for binary attributes Src: “Introduction to Data Mining” by Vipin Kumar et al 15

Correlation (Pearson Correlation) • Correlation measures the linear relationship between objects • To compute correlation, we standardize data objects, p and q, and then take their dot product ′ ( ( )) / ( ) = − p p mean p std p k k ′ ( ( )) / ( ) = − q q mean q std q k k ( , ) ′ ′ = • correlatio n p q p q Src: “Introduction to Data Mining” by Vipin Kumar et al 16

Visually Evaluating Correlation Scatter plots showing the similarity from –1 to 1. Src: “Introduction to Data Mining” by Vipin Kumar et al 17

General Approach for Combining Similarities • Sometimes attributes are of many different types, but an overall similarity is needed. Src: “Introduction to Data Mining” by Vipin Kumar et al 18

Using Weights to Combine Similarities • May not want to treat all attributes the same. – Use weights w k which are between 0 and 1 and sum to 1. Src: “Introduction to Data Mining” by Vipin Kumar et al 19

Graph-Based Proximity Measures Within-graph Within-graph proximity measures: proximity measures: In order to apply graph- In order to apply graph- based data mining based data mining techniques, such as techniques, such as Hyperlink-Induced Hyperlink-Induced √ classification and clustering, classification and clustering, Topic Search (HITS) Topic Search (HITS) it is necessary to define it is necessary to define proximity measures between proximity measures between The Neumann The Neumann data represented in graph data represented in graph form. form. Kernel Kernel Shared Nearest Shared Nearest Neighbor (SNN) Neighbor (SNN)

Outline • Defining Proximity Measures • Neumann Kernels • Shared Nearest Neighbor 21

Neumann Kernels: Agenda Neumann Neumann Co-citation and Co-citation and Document and Document and Kernel Kernel Bibliographic Bibliographic Term Term Introduction Introduction Coupling Coupling Correlation Correlation Diffusion/Decay Diffusion/Decay Relationship to Relationship to Strengths and Strengths and factors factors HITS HITS Weaknesses Weaknesses

Neumann Kernels (NK) � Generalization of HITS � Input: Undirected or Directed Graph � Output: Within Graph Proximity Measure Importance � Relatedness � von Neumann

NK: Citation graph n 2 n 1 n 3 n 4 n 5 n 6 n 7 n 8 • Input: Graph – n 1 …n 8 vertices (articles) – Graph is directed – Edges indicate a citation • Citation Matrix C can be formed – If an edge between two vertices exists then the matrix cell = 1 else = 0

Graph-based Proximity Measures Nagiza F. Samatova William Hendrix - PowerPoint PPT Presentation

Practical Graph Mining with R Graph-based Proximity Measures Nagiza F. Samatova William Hendrix John Jenkins Kanchana Padmanabhan Arpan Chakraborty Department of Computer Science North Carolina State University Outline Defining

Tree Pr ee Proximity ximity Finding the good and bad of trees. joe@buildfax.com Tree

Planar Delaunay Triangulations and Proximity Structures Proximity Structures Given: a set P of n

Proximity Language Model A Language Model beyond Bag of Words through Proximity Jinglei Zhao 1

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

Towards Proximity Graph Auto-Configuration: an Approach Based on Meta-learning Rafael S. Oyamada,

Close Proximity Radiography www.tracoilandgas.com Overview What is Close Proximity Radiography?

Replay, Relay and Inverse-Sybil Attacks on Proximity Tracing Apps Krzysztof Pietrzak 2020

Behavioral Detection and Containment of Proximity Malware in Delay Tolerant Networks Wei Peng,

#prep X Assembly 03-B: Proximity Sensor + Right Fan You got the Dual Fan Upgrade? This is what

The distribution of the proximity function Timm Oertel Joseph Paat + Robert Weismantel +

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Integration Testing Path Based Chapter 13 Call graph based integration Use the call graph

51 52 Proximity-based social networks. Talking to strangers

Proximity-based Outlier Detection Objects far away from the others are outliers The

Graph Based Dependency Parsing Wei Qiu December 15, 2011 . . . . . . Graph Based

Graph Essentials Graph Basics Social Media Mining Social Media Mining Measures and Metrics

1 Implicit Classification Function Efficient Indexing Although it is not necessary to

Near Neighbor Search in High Dimensional Data (1) Motivation Distance Measures Shingling

Using technology to reduce social isolation: research on dementia and social isolation Professor

Lecture 1: Introduction to Program Analysis 17-355/17-655/17-819: Program Analysis Claire Le

We need a better perceptual similarity metric Lubomir Bourdev WaveOne, Inc. CVPR Workshop

An expressive dissimilarity measure for relational clustering using neighbourhood trees

Data Mining Techniques: Cluster Analysis Mirek Riedewald Many slides based on presentations by

Introduction CSCE CSCE If no label information is available, can still perform 478/878 478/878