April 2, 2008
1
Data Mining:
Concepts and Techniques
— Chapter 9 —
Graph mining and Social Network Analysis
Li Xiong
Slides credits: Jiawei Han and Micheline Kamber
Data Mining: Concepts and Techniques Chapter 9 Graph mining and - - PowerPoint PPT Presentation
Data Mining: Concepts and Techniques Chapter 9 Graph mining and Social Network Analysis Li Xiong Slides credits: Jiawei Han and Micheline Kamber 1 April 2, 2008 Graph Mining and Social Network Analysis Graph mining Frequent
April 2, 2008
1
Li Xiong
Slides credits: Jiawei Han and Micheline Kamber
Graph mining
Frequent subgraph mining
Social network analysis
Social network Social network analysis at different levels Link analysis
April 2, 2008 Mining and Searching Graphs in Graph Databases
2
April 2, 2008 Mining and Searching Graphs in Graph Databases
3
Methods for Mining Frequent Subgraphs Applications:
Graph Indexing Similarity Search Classification and Clustering
Summary
April 2, 2008 Mining and Searching Graphs in Graph Databases
4
Graphs are ubiquitous
Chemical compounds (Cheminformatics) Protein structures, biological pathways/networks (Bioinformactics) Program control flow, traffic flow, and workflow analysis XML databases, Web, and social network analysis
Graph is a general model
Trees, lattices, sequences, and items are degenerated graphs
Diversity of graphs
Directed vs. undirected, labeled vs. unlabeled (edges & vertices),
weighted, with angles & geometry (topological vs. 2-D/3-D)
Complexity of algorithms: many problems are of high
complexity
April 2, 2008 Mining and Searching Graphs in Graph Databases
5
Aspirin Yeast protein interaction network
from H. Jeong et al Nature 411, 41 (2001)
I nternet Co-author network
April 2, 2008 Mining and Searching Graphs in Graph Databases
6
Frequent subgraph mining
Finding frequent subgraphs within a single graph Finding frequent (sub)graphs in a set of graphs
support (occurrence frequency) no less than a
minimum support threshold
Applications of graph pattern mining
Mining biochemical structures, program control flow
analysis, XML structures or Web communities
Building blocks for graph classification, clustering,
compression, comparison, and correlation analysis
April 2, 2008 Mining and Searching Graphs in Graph Databases
7
Example: Frequent Subgraph Mining in Chemical Compounds
GRAPH DATASET FREQUENT PATTERNS (MIN SUPPORT IS 2)
S OH O O O N O N HO
O N O N
(A) (B) (C)
O N
N O N
(1) (2)
April 2, 2008 Mining and Searching Graphs in Graph Databases
8
Finding interesting and frequent substructures in a
single graph
SUBDUE
Finding frequent patterns in a set of independent
graphs
Apriori-based approach
Pattern-growth approach
April 2, 2008 Li Xiong
9
Problem
Finding “interesting” and repetitive substructures
(connected subgraphs) in data represented as a graph
Basic idea
Minimum description length (MDL) principle
Beam search algorithm
Start with best single vertices Expand best substructures with a new edge Substructures are evaluated based on their ability to
compress input graphs
Minimum description length (MDL) principle
A formalization of Occam’s Razor Best hypothesis minimizes description length of the data (largest
compression)
Graph substructure discovery based on MDL
Description length (DL): represent vertices and adjacency matrix Graph compression: replace substructure instances with pointers Find best substructure S in G that minimizes: DL(S) + DL(G|S)
R1 C1 T1 S1 T2 S2 T3 S3 T4 S4 Input Database (G) Substructure (S1) Compressed Database (G|S1) R1 C1 S1 S1 S1 S1 S1 S1 S1 S1 S1 S1 S1 S1 Triangle Square
Holder et al.
Beam search
An optimization of best-first search Breadth-first search with a predetermined number of
paths kept as candidates (beam width)
Subgraph discovery based on beam search
Start with best single vertices Expand best substructures with a new edge Substructures are evaluated based on their ability to
compress input graphs (minimize description length)
April 2, 2008 Li Xiong
11
Holder et al.
12
1.
Create substructure for each unique vertex label
Substructures (S)
triangle (4), square (4), circle (1), rectangle (1) circle rectangle triangle square
triangle square
triangle square
triangle square
R1 C1 T1 S1 T2 S2 T3 S3 T4 S4 Input Database (G) Input Database (G) (Graph form)
Holder et al.
13
2.
Expand best substructures by an edge or edge + neighboring vertex
Substructures (S) triangle square
rectangle square
rectangle triangle
circle rectangle triangle square
triangle square
triangle square
triangle square
rectangle circle
Holder et al. SRL Workshop
14
3.
Keep best beam-width substructures on queue
4.
Terminate when queue is empty or #discovered substructures >= limit
5.
Compress graph with hierarchical description
April 2, 2008 Mining and Searching Graphs in Graph Databases
15
Problem: finding frequent subgraphs in a set of graphs Apriori-based approach
AGM: Inokuchi, et al. (PKDD’00) FSG: Kuramochi and Karypis (ICDM’01) PATH#: Vanetik and Gudes (ICDM’02, ICDM’04) FFSM: Huan, et al. (ICDM’03)
Pattern growth approach
MoFa, Borgelt and Berthold (ICDM’02) gSpan: Yan and Han (ICDM’02) Gaston: Nijssen and Kok (KDD’04)
Close pattern mining
CLOSEGRAPH: Yan & Han (KDD’03)
April 2, 2008
16
…
G G1 G2 Gn
Frequent subgraphs Subgraphs w ith extra vertex, edge
G’ G’’ JOI N
Level-wise algorithm: building candidate
subgraphs from small frequent subgraphs
April 2, 2008 Mining and Searching Graphs in Graph Databases
17
AGM (Apriori-based Graph Mining), Inokuchi, et
generates new graphs with one more node
FSG (Frquent SubGraph mining), Kuramochi and
Karypis, ICDM’01
generates new graphs with one more edge c b a a a a a a a a
April 2, 2008 Mining and Searching Graphs in Graph Databases
18
…
G G1 G2 Gn
k-edge ( k+ 1 ) -edge
…
( k+ 2 ) -edge
…
duplicate graph
Depth-based search and right-most extension
April 2, 2008 Mining and Searching Graphs in Graph Databases
19
April 2, 2008 Mining and Searching Graphs in Graph Databases
20
Methods for Mining Frequent Subgraphs Applications:
Classification and Clustering Graph Indexing Similarity Search
April 2, 2008 Mining and Searching Graphs in Graph Databases
21
Using Graph Patterns
Similarity measures based on graph patterns
Feature-based similarity measure
Each graph is represented as a feature vector Frequent subgraphs can be used as features Vector distance
Structure-based similarity measure
Maximal common subgraph Graph edit distance: insertion, deletion, and relabel
Frequent and discriminative subgraphs are
high-quality indexing features
Social network Different levels of social network analysis Common measures and methods for social
network analysis
Link analysis
April 2, 2008 Mining and Searching Graphs in Graph Databases
22
Social network: a social structure consists of nodes and
ties.
Nodes are the individual actors within the networks
May be different kinds May have attributes, labels or classes
Ties are the relationships between the actors
May be different kinds Links may have attributes, directed or undirected
Homogeneous networks
Single object type and single link type Single model social networks (e.g., friends) WWW: a collection of linked Web pages
Multiple object and link types Medical network: patients, doctors, disease, contacts, treatments Bibliographic network: publications, authors, venues April 2, 2008 Mining and Searching Graphs in Graph Databases
23
Number of degrees of separation in actual social
networks?
Six-degree separation: everyone is an average of
six "steps" away from each person on Earth.
Empirical studies
Michael Gurevich,1961. US population linked by 2
intermediaries
Duncan Watts, 2001. Email-delivery on the internet:
average number of intermediaries is 6.
Leskovec and Horvitz, 2007. Instant messages:
average path length is 6.6
April 2, 2008 Mining and Searching Graphs in Graph Databases
24
April 2, 2008 Data Mining: Concepts and Techniques
25
Is Kevin Bacon the most connected actor?
Rank Name Average distance # of movies # of links 1 Rod Steiger 2.537527 112 2562 2 Donald Pleasence 2.542376 180 2874 3 Martin Sheen 2.551210 136 3501 4 Christopher Lee 2.552497 201 2993 5 Robert Mitchum 2.557181 136 2905 6 Charlton Heston 2.566284 104 2552 7 Eddie Albert 2.567036 112 3333 8 Robert Vaughn 2.570193 126 2761 9 Donald Sutherland 2.577880 107 2865 10 John Gielgud 2.578980 122 2942 11 Anthony Quinn 2.579750 146 2978 12 James Earl Jones 2.584440 112 3787 … 876 Kevin Bacon 2.786981 46 1811 …
876 Kevin Bacon 2.786981 46 1811
Kevin Bacon
Average separation: 2.79
April 2, 2008 Data Mining: Concepts and Techniques
26
Rod Steiger Martin Sheen Donald Pleasence #1 #2 #3 #876 Kevin Bacon
Actor level: centrality, prestige, and roles such as
isolates, liaisons, bridges, etc.
Dyadic level: distance and reachability, structural
and other notions of equivalence, and tendencies toward reciprocity.
Triadic level: balance and transitivity Subset level: cliques, cohesive subgroups,
components
Network level: connectedness, diameter,
centralization, density, prestige, etc.
April 2, 2008 Social network analysis: methods and applications
27
Measures in Social Network Analysis – Actor level
Non-directional graphs
Degree Centrality
The number of direct connections a node has 'connector' or 'hub' in this network
Betweenness Centrality
Degree an individual lies between other individuals in the
network
an intermediary; liaison; bridge
Closeness Centrality
The degree an individual is near all other individuals in a
network (directly or indirectly)
Eigenvector centrality
A measure of relative importance of a node Based on the principle that connections to nodes having a high
score contribute more to the current node
Directional graphs
Prestige: measure the degree of incoming ties April 2, 2008 Mining and Searching Graphs in Graph Databases
28
April 2, 2008 OrgNet.com
29
Measures in Social Network Analysis – Dyadic, Triadic and Subset Level
Path Length
The distances between pairs of nodes in the network.
Structural equivalence
Extent to which actors have a common set of linkages
to other actors in the system.
Clustering coefficient
A measure of the likelihood that two associates of a
node are associates themselves
Cliquishness of u’s neighborhood
Cohesion
The degree to which actors are connected directly to
each other by cohesive bonds
Cliques
April 2, 2008 Mining and Searching Graphs in Graph Databases
30
Measures in Social Network Analysis – Network Level
Network Centralization
The difference between number of links for each node Centralized vs. decentralized networks
Network density
Proportion of ties in a network relative to the total number possible Sparse vs. dense networks
Average Path Length
Average of distances between all pairs of nodes
Reach
The degree any member of a network can reach other members of
the network.
Structural cohesion
The minimum number of members who, if removed from a group,
would disconnect the group.
April 2, 2008 Mining and Searching Graphs in Graph Databases
31
April 2, 2008 Data Mining: Concepts and Techniques
32
Object-Related Tasks
Link-based object ranking Link-based object classification Object clustering (group detection) Object identification (entity resolution)
Link-Related Tasks
Link prediction
Graph-Related Tasks
Subgraph discovery Graph classification Generative model for graphs
Link-based object ranking for WWW (actor-level
analysis)
PageRank HITS
Influence and diffusion
April 2, 2008 Mining and Searching Graphs in Graph Databases
33
April 2, 2008 Data Mining: Concepts and Techniques
34
Exploit the link structure of a graph to order or prioritize the
set of objects within the graph
Focused on graphs with single object type and single
link type
Focus of link analysis community Algorithms
PageRank HITS
Intuition
Web pages are not equally “important”
www.joe-schmoe.com v www.stanford.edu
Links as citations: a page cited often is more important
www.stanford.edu has 23,400 inlinks www.joe-schmoe.com has 1 inlink
Are all links equal? Recursive model: being cited by a highly cited paper
counts a lot…
Eigenvector prestige measure
Each link’s vote is proportional to the importance of its
source page
If page P with importance x has n outlinks, each link gets
x/n votes
Page P’s own importance is the sum of the votes on its
inlinks
Yahoo M’soft Amazon
y a m y/2 y/2 a/2 a/2 m y = y /2 + a /2 a = y /2 + m m = a /2 Solving the equation with constraint: y+ a+ m = 1 y = 2/5, a = 2/5, m = 1/5
Web link matrix M: one row and one column per web page
Suppose page j has n outlinks, if j → i, then Mij=1/n, else Mij=0 M is a column stochastic matrix - Columns sum to 1
Rank vector r: one entry per web page
ri is the importance score of page i |r| = 1
Flow equation: r = Mr Rank vector is an eigenvector of the web matrix
i j
M r r
=
j i
Yahoo M’soft Amazon y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0 y a m
y = y /2 + a /2 a = y /2 + m m = a /2
r = Mr
y 1/2 1/2 0 y a = 1/2 0 1 a m 0 1/2 0 m
Simple iterative scheme (aka relaxation) Suppose there are N web pages Initialize: r0 = [1/N,….,1/N]T Iterate: rk+1 = Mrk Stop when |rk+1 - rk|1 < ε
|x|1 = ∑1≤i≤N|xi| is the L1 norm Can use any other vector norm e.g., Euclidean
Yahoo M’soft Amazon y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0 y a m y a = m 1/3 1/3 1/3 1/3 1/2 1/6 5/12 1/3 1/4 3/8 11/24 1/6 2/5 2/5 1/5 . . .
Imagine a random web surfer
At any time t, surfer is on some page P At time t+1, the surfer follows an outlink from P
uniformly at random
Ends up on some page Q linked from P Process repeats indefinitely
p(t) is the probability distribution whose ith
component is the probability that the surfer is at page i at time t
Where is the surfer at time t+1?
p(t+1) = Mp(t)
Suppose the random walk reaches a state such
that p(t+1) = Mp(t) = p(t)
Then p(t) is a stationary distribution for the random
walk
Our rank vector r satisfies r = Mr
Theory of random walks (aka Markov processes):
For graphs that satisfy certain conditions, the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0.
April 2, 2008 Mining and Searching Graphs in Graph Databases
43
A group of pages is a spider trap if there are no
links from within the group to outside the group
Spider traps violate the conditions needed for the
random walk theorem
Yahoo M’soft Amazon y 1/2 1/2 0 a 1/2 0 0 m 0 1/2 1 y a m y a = m 1 1 1 1 1/2 3/2 3/4 1/2 7/4 5/8 3/8 2 3 . . .
At each time step, the random surfer has two
With probability β, follow a link at random With probability 1-β, jump to some page uniformly at
random
Common values for β are in the range 0.8 to 0.9
Surfer will teleport out of spider trap within a few
time steps
Yahoo M’soft Amazon 1/2 1/2 0 1/2 0 0 0 1/2 1 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15
0.8 + 0.2
y a = m 1 1 1 1.00 0.60 1.40 0.84 0.60 1.56 0.776 0.536 1.688 7/11 5/11 21/11 . . .
Matrix vector A
Aij = βMij + (1-β)/N Mij = 1/|O(j)| when j→i and Mij = 0 otherwise Verify that A is a stochastic matrix
The page rank vector r is the principal eigenvector
satisfying r = Ar
Equivalently, r is the stationary distribution of the
random walk with teleports
April 2, 2008 Data Mining: Concepts and Techniques
48
Intuitions
Pages that are widely cited are good authorities Pages that cite many other pages are good hubs
HITS (Hypertext-Induced Topic Selection)
linked by Hubs
Iterative reinforcement …
Hubs Authorities
Transition (adjacency) matrix A
A[i, j] = 1 if page i links to page j, 0 if
not
The hub score vector h: score is
proportional to the sum of the authority scores of the pages it links to
h = λAa Constant λ is a scale factor
The authority score vector a: score
is proportional to the sum of the hub scores of the pages it is linked from
a = μAT h Constant μ is scale factor
Hubs Authorities
Yahoo M’soft Amazon y 1 1 1 a 1 0 1 m 0 1 0 y a m A =
Initialize h, a to all 1’s h = Aa Scale h so that its max entry is 1.0 a = ATh Scale a so that its max entry is 1.0 Continue until h, a converge
1 1 1 A = 1 0 1 0 1 0 1 1 0 AT = 1 0 1 1 1 0 a(yahoo) a(amazon) a(m’soft) = = = 1 1 1 1 1 1 1 4/5 1 1 0.75 1 . . . . . . . . . 1 0.732 1 h(yahoo) = 1 h(amazon) = 1 h(m’soft) = 1 1 2/3 1/3 1 0.73 0.27 . . . . . . . . . 1.000 0.732 0.268 1 0.71 0.29
h = λAa a = μAT h h = λμAAT h a = λμATA a Under reasonable assumptions about A, the dual iterative algorithm converges to vectors h* and a* such that:
Similarities
Iterative algorithm based on the linkage of the
documents on the web
Same problem: what is the value of a link from S to D?
Different models
PageRank: depends on the links into S HITS: depends on the value of the other links out of S
The destinies of PageRank and HITS post-1998
PageRank: trademark of Google HITS: not commonly used by search engines
(Ask.com?)
Link-based object ranking for WWW (actor-level
analysis)
PageRank HITS
Influence and diffusion
April 2, 2008 Mining and Searching Graphs in Graph Databases
55
OrgNet.com
56
CDC: Spread of Airborne Disease
Paper presentations:
Knowledge discovery from transportation network data Maximizing the spread of influence through a social
network
Wherefore Art Thou R3579X? Anonymized Social
Networks, Hidden Patterns, and Structural Steganography
April 2, 2008 Mining and Searching Graphs in Graph Databases
57
April 2, 2008 Mining and Searching Graphs in Graph Databases
58
substructures of molecules”, ICDM'02
Networks”, PKDD'05.
Approaches for Classifying Chemical Compounds”, ICDM 2003
structures”, BIOKDD'02
compounds”, KDD'98
KDD'04
Molecular Graphs”, ICML’05
Alternatives”, COLT/Kernel’03
April 2, 2008 Mining and Searching Graphs in Graph Databases
59
KDD'94
spatial motifs from protein structure graphs”, RECOMB’04
Massive Biological Networks for Functional Discovery”, ISMB'05
substructures from graph data”, PKDD'00
Daylight Chemical Information Systems, Inc., 2003.
ICML’03
April 2, 2008 Mining and Searching Graphs in Graph Databases
60
frequent subgraphs in biological networks”, Bioinformatics, 20:I200--I207, 2004.
Classification”, NIPS’04
Algorithm”, ICDM’04
Noncrashing Bugs’'', SDM'05
Kernels”, ICML’04
KDD'04
from graph databases”. KDD'04
April 2, 2008 Mining and Searching Graphs in Graph Databases
61
graph searching”, PODS'02
semistructured data”, ICDM'02
graph databases”, KDD'04
Explorations, 5:59-68, 2003
SIGMOD'04
Constraints”, KDD'05
SIGMOD'05
Distance”, ICDE'06
April 2, 2008 Data Mining: Concepts and Techniques
62
Social Networks. CIKM’03
Viral Marketing. KDD’02
Influence through a Social Network. KDD’03.
Intelligent Systems, 20(1), 80-82, 2005.
search engine. WWW7.
Raghavan, S. Rajagopalan, and A. Tomkins, Mining the link structure of the World Wide Web. IEEE Computer’99
SIGIR'2004.