Data Mining: Concepts and Techniques Chapter 9 Graph mining and - - PowerPoint PPT Presentation

data mining
SMART_READER_LITE
LIVE PREVIEW

Data Mining: Concepts and Techniques Chapter 9 Graph mining and - - PowerPoint PPT Presentation

Data Mining: Concepts and Techniques Chapter 9 Graph mining and Social Network Analysis Li Xiong Slides credits: Jiawei Han and Micheline Kamber 1 April 2, 2008 Graph Mining and Social Network Analysis Graph mining Frequent


slide-1
SLIDE 1

April 2, 2008

1

Data Mining:

Concepts and Techniques

— Chapter 9 —

Graph mining and Social Network Analysis

Li Xiong

Slides credits: Jiawei Han and Micheline Kamber

slide-2
SLIDE 2

Graph Mining and Social Network Analysis

Graph mining

Frequent subgraph mining

Social network analysis

Social network Social network analysis at different levels Link analysis

April 2, 2008 Mining and Searching Graphs in Graph Databases

2

slide-3
SLIDE 3

April 2, 2008 Mining and Searching Graphs in Graph Databases

3

Graph Mining

Methods for Mining Frequent Subgraphs Applications:

Graph Indexing Similarity Search Classification and Clustering

Summary

slide-4
SLIDE 4

April 2, 2008 Mining and Searching Graphs in Graph Databases

4

Why Graph Mining?

Graphs are ubiquitous

Chemical compounds (Cheminformatics) Protein structures, biological pathways/networks (Bioinformactics) Program control flow, traffic flow, and workflow analysis XML databases, Web, and social network analysis

Graph is a general model

Trees, lattices, sequences, and items are degenerated graphs

Diversity of graphs

Directed vs. undirected, labeled vs. unlabeled (edges & vertices),

weighted, with angles & geometry (topological vs. 2-D/3-D)

Complexity of algorithms: many problems are of high

complexity

slide-5
SLIDE 5

April 2, 2008 Mining and Searching Graphs in Graph Databases

5

Graph, Graph, Everywhere

Aspirin Yeast protein interaction network

from H. Jeong et al Nature 411, 41 (2001)

I nternet Co-author network

slide-6
SLIDE 6

April 2, 2008 Mining and Searching Graphs in Graph Databases

6

Graph Pattern Mining

Frequent subgraph mining

Finding frequent subgraphs within a single graph Finding frequent (sub)graphs in a set of graphs

support (occurrence frequency) no less than a

minimum support threshold

Applications of graph pattern mining

Mining biochemical structures, program control flow

analysis, XML structures or Web communities

Building blocks for graph classification, clustering,

compression, comparison, and correlation analysis

slide-7
SLIDE 7

April 2, 2008 Mining and Searching Graphs in Graph Databases

7

Example: Frequent Subgraph Mining in Chemical Compounds

GRAPH DATASET FREQUENT PATTERNS (MIN SUPPORT IS 2)

S OH O O O N O N HO

O N O N

(A) (B) (C)

O N

N O N

(1) (2)

slide-8
SLIDE 8

April 2, 2008 Mining and Searching Graphs in Graph Databases

8

Graph Mining Algorithms

Finding interesting and frequent substructures in a

single graph

SUBDUE

Finding frequent patterns in a set of independent

graphs

Apriori-based approach

Pattern-growth approach

slide-9
SLIDE 9

April 2, 2008 Li Xiong

9

SUBDUE (Holder et al. KDD’94)

Problem

Finding “interesting” and repetitive substructures

(connected subgraphs) in data represented as a graph

Basic idea

Minimum description length (MDL) principle

Beam search algorithm

Start with best single vertices Expand best substructures with a new edge Substructures are evaluated based on their ability to

compress input graphs

slide-10
SLIDE 10

Minimum Description Length (MDL)

Minimum description length (MDL) principle

A formalization of Occam’s Razor Best hypothesis minimizes description length of the data (largest

compression)

Graph substructure discovery based on MDL

Description length (DL): represent vertices and adjacency matrix Graph compression: replace substructure instances with pointers Find best substructure S in G that minimizes: DL(S) + DL(G|S)

R1 C1 T1 S1 T2 S2 T3 S3 T4 S4 Input Database (G) Substructure (S1) Compressed Database (G|S1) R1 C1 S1 S1 S1 S1 S1 S1 S1 S1 S1 S1 S1 S1 Triangle Square

Holder et al.

slide-11
SLIDE 11

Beam Search Algorithm

Beam search

An optimization of best-first search Breadth-first search with a predetermined number of

paths kept as candidates (beam width)

Subgraph discovery based on beam search

Start with best single vertices Expand best substructures with a new edge Substructures are evaluated based on their ability to

compress input graphs (minimize description length)

April 2, 2008 Li Xiong

11

slide-12
SLIDE 12

Holder et al.

12

Algorithm

1.

Create substructure for each unique vertex label

Substructures (S)

triangle (4), square (4), circle (1), rectangle (1) circle rectangle triangle square

  • n
  • n

triangle square

  • n
  • n

triangle square

  • n
  • n

triangle square

  • n
  • n
  • n

R1 C1 T1 S1 T2 S2 T3 S3 T4 S4 Input Database (G) Input Database (G) (Graph form)

slide-13
SLIDE 13

Holder et al.

13

Algorithm (cont.)

2.

Expand best substructures by an edge or edge + neighboring vertex

Substructures (S) triangle square

  • n

rectangle square

  • n

rectangle triangle

  • n

circle rectangle triangle square

  • n
  • n

triangle square

  • n
  • n

triangle square

  • n
  • n

triangle square

  • n
  • n
  • n

rectangle circle

  • n
slide-14
SLIDE 14

Holder et al. SRL Workshop

14

Algorithm (cont.)

3.

Keep best beam-width substructures on queue

4.

Terminate when queue is empty or #discovered substructures >= limit

5.

Compress graph with hierarchical description

slide-15
SLIDE 15

April 2, 2008 Mining and Searching Graphs in Graph Databases

15

Frequent Subgraph Mining Approaches

Problem: finding frequent subgraphs in a set of graphs Apriori-based approach

AGM: Inokuchi, et al. (PKDD’00) FSG: Kuramochi and Karypis (ICDM’01) PATH#: Vanetik and Gudes (ICDM’02, ICDM’04) FFSM: Huan, et al. (ICDM’03)

Pattern growth approach

MoFa, Borgelt and Berthold (ICDM’02) gSpan: Yan and Han (ICDM’02) Gaston: Nijssen and Kok (KDD’04)

Close pattern mining

CLOSEGRAPH: Yan & Han (KDD’03)

slide-16
SLIDE 16

April 2, 2008

16

Apriori-Based Approach

G G1 G2 Gn

Frequent subgraphs Subgraphs w ith extra vertex, edge

G’ G’’ JOI N

Level-wise algorithm: building candidate

subgraphs from small frequent subgraphs

slide-17
SLIDE 17

April 2, 2008 Mining and Searching Graphs in Graph Databases

17

Apriori-Based Search

AGM (Apriori-based Graph Mining), Inokuchi, et

  • al. PKDD’00

generates new graphs with one more node

FSG (Frquent SubGraph mining), Kuramochi and

Karypis, ICDM’01

generates new graphs with one more edge c b a a a a a a a a

slide-18
SLIDE 18

April 2, 2008 Mining and Searching Graphs in Graph Databases

18

Pattern Growth Method

G G1 G2 Gn

k-edge ( k+ 1 ) -edge

( k+ 2 ) -edge

duplicate graph

slide-19
SLIDE 19

Depth-based search and right-most extension

April 2, 2008 Mining and Searching Graphs in Graph Databases

19

GSPAN (Yan and Han ICDM’02)

slide-20
SLIDE 20

April 2, 2008 Mining and Searching Graphs in Graph Databases

20

Graph Mining

Methods for Mining Frequent Subgraphs Applications:

Classification and Clustering Graph Indexing Similarity Search

slide-21
SLIDE 21

April 2, 2008 Mining and Searching Graphs in Graph Databases

21

Using Graph Patterns

Similarity measures based on graph patterns

Feature-based similarity measure

Each graph is represented as a feature vector Frequent subgraphs can be used as features Vector distance

Structure-based similarity measure

Maximal common subgraph Graph edit distance: insertion, deletion, and relabel

Frequent and discriminative subgraphs are

high-quality indexing features

slide-22
SLIDE 22

Social Network Analysis

Social network Different levels of social network analysis Common measures and methods for social

network analysis

Link analysis

April 2, 2008 Mining and Searching Graphs in Graph Databases

22

slide-23
SLIDE 23

Social Network

Social network: a social structure consists of nodes and

ties.

Nodes are the individual actors within the networks

May be different kinds May have attributes, labels or classes

Ties are the relationships between the actors

May be different kinds Links may have attributes, directed or undirected

Homogeneous networks

Single object type and single link type Single model social networks (e.g., friends) WWW: a collection of linked Web pages

  • Heterogeneous networks

Multiple object and link types Medical network: patients, doctors, disease, contacts, treatments Bibliographic network: publications, authors, venues April 2, 2008 Mining and Searching Graphs in Graph Databases

23

slide-24
SLIDE 24

Small World Phenomenon

Number of degrees of separation in actual social

networks?

Six-degree separation: everyone is an average of

six "steps" away from each person on Earth.

Empirical studies

Michael Gurevich,1961. US population linked by 2

intermediaries

Duncan Watts, 2001. Email-delivery on the internet:

average number of intermediaries is 6.

Leskovec and Horvitz, 2007. Instant messages:

average path length is 6.6

April 2, 2008 Mining and Searching Graphs in Graph Databases

24

slide-25
SLIDE 25

April 2, 2008 Data Mining: Concepts and Techniques

25

Six Degrees of Kevin Bacon

  • Vertices: actors and actresses
  • Edge between u and v if they appeared in a film together

Is Kevin Bacon the most connected actor?

NO!

Rank Name Average distance # of movies # of links 1 Rod Steiger 2.537527 112 2562 2 Donald Pleasence 2.542376 180 2874 3 Martin Sheen 2.551210 136 3501 4 Christopher Lee 2.552497 201 2993 5 Robert Mitchum 2.557181 136 2905 6 Charlton Heston 2.566284 104 2552 7 Eddie Albert 2.567036 112 3333 8 Robert Vaughn 2.570193 126 2761 9 Donald Sutherland 2.577880 107 2865 10 John Gielgud 2.578980 122 2942 11 Anthony Quinn 2.579750 146 2978 12 James Earl Jones 2.584440 112 3787 … 876 Kevin Bacon 2.786981 46 1811 …

876 Kevin Bacon 2.786981 46 1811

Kevin Bacon

  • No. of movies : 46
  • No. of actors : 1811

Average separation: 2.79

slide-26
SLIDE 26

April 2, 2008 Data Mining: Concepts and Techniques

26

Rod Steiger Martin Sheen Donald Pleasence #1 #2 #3 #876 Kevin Bacon

slide-27
SLIDE 27

Social Network Analysis

Actor level: centrality, prestige, and roles such as

isolates, liaisons, bridges, etc.

Dyadic level: distance and reachability, structural

and other notions of equivalence, and tendencies toward reciprocity.

Triadic level: balance and transitivity Subset level: cliques, cohesive subgroups,

components

Network level: connectedness, diameter,

centralization, density, prestige, etc.

April 2, 2008 Social network analysis: methods and applications

27

slide-28
SLIDE 28

Measures in Social Network Analysis – Actor level

Non-directional graphs

Degree Centrality

The number of direct connections a node has 'connector' or 'hub' in this network

Betweenness Centrality

Degree an individual lies between other individuals in the

network

an intermediary; liaison; bridge

Closeness Centrality

The degree an individual is near all other individuals in a

network (directly or indirectly)

Eigenvector centrality

A measure of relative importance of a node Based on the principle that connections to nodes having a high

score contribute more to the current node

Directional graphs

Prestige: measure the degree of incoming ties April 2, 2008 Mining and Searching Graphs in Graph Databases

28

slide-29
SLIDE 29

Actor Centrality Example

April 2, 2008 OrgNet.com

29

slide-30
SLIDE 30

Measures in Social Network Analysis – Dyadic, Triadic and Subset Level

Path Length

The distances between pairs of nodes in the network.

Structural equivalence

Extent to which actors have a common set of linkages

to other actors in the system.

Clustering coefficient

A measure of the likelihood that two associates of a

node are associates themselves

Cliquishness of u’s neighborhood

Cohesion

The degree to which actors are connected directly to

each other by cohesive bonds

Cliques

April 2, 2008 Mining and Searching Graphs in Graph Databases

30

slide-31
SLIDE 31

Measures in Social Network Analysis – Network Level

Network Centralization

The difference between number of links for each node Centralized vs. decentralized networks

Network density

Proportion of ties in a network relative to the total number possible Sparse vs. dense networks

Average Path Length

Average of distances between all pairs of nodes

Reach

The degree any member of a network can reach other members of

the network.

Structural cohesion

The minimum number of members who, if removed from a group,

would disconnect the group.

April 2, 2008 Mining and Searching Graphs in Graph Databases

31

slide-32
SLIDE 32

April 2, 2008 Data Mining: Concepts and Techniques

32

Another Taxonomy of Link Mining Tasks

Object-Related Tasks

Link-based object ranking Link-based object classification Object clustering (group detection) Object identification (entity resolution)

Link-Related Tasks

Link prediction

Graph-Related Tasks

Subgraph discovery Graph classification Generative model for graphs

slide-33
SLIDE 33

Social Network Applications

Link-based object ranking for WWW (actor-level

analysis)

PageRank HITS

Influence and diffusion

April 2, 2008 Mining and Searching Graphs in Graph Databases

33

slide-34
SLIDE 34

April 2, 2008 Data Mining: Concepts and Techniques

34

Link-Based Object Ranking (LBR)

Exploit the link structure of a graph to order or prioritize the

set of objects within the graph

Focused on graphs with single object type and single

link type

Focus of link analysis community Algorithms

PageRank HITS

slide-35
SLIDE 35

PageRank: Ranking web pages (Brin & Page’98)

Intuition

Web pages are not equally “important”

www.joe-schmoe.com v www.stanford.edu

Links as citations: a page cited often is more important

www.stanford.edu has 23,400 inlinks www.joe-schmoe.com has 1 inlink

Are all links equal? Recursive model: being cited by a highly cited paper

counts a lot…

Eigenvector prestige measure

slide-36
SLIDE 36

Each link’s vote is proportional to the importance of its

source page

If page P with importance x has n outlinks, each link gets

x/n votes

Page P’s own importance is the sum of the votes on its

inlinks

Simple Recursive Flow Model

Yahoo M’soft Amazon

y a m y/2 y/2 a/2 a/2 m y = y /2 + a /2 a = y /2 + m m = a /2 Solving the equation with constraint: y+ a+ m = 1 y = 2/5, a = 2/5, m = 1/5

slide-37
SLIDE 37

Matrix formulation

Web link matrix M: one row and one column per web page

Suppose page j has n outlinks, if j → i, then Mij=1/n, else Mij=0 M is a column stochastic matrix - Columns sum to 1

Rank vector r: one entry per web page

ri is the importance score of page i |r| = 1

Flow equation: r = Mr Rank vector is an eigenvector of the web matrix

i j

M r r

=

j i

slide-38
SLIDE 38

Matrix formulation Example

Yahoo M’soft Amazon y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0 y a m

y = y /2 + a /2 a = y /2 + m m = a /2

r = Mr

y 1/2 1/2 0 y a = 1/2 0 1 a m 0 1/2 0 m

slide-39
SLIDE 39

Power Iteration method

Simple iterative scheme (aka relaxation) Suppose there are N web pages Initialize: r0 = [1/N,….,1/N]T Iterate: rk+1 = Mrk Stop when |rk+1 - rk|1 < ε

|x|1 = ∑1≤i≤N|xi| is the L1 norm Can use any other vector norm e.g., Euclidean

slide-40
SLIDE 40

Power Iteration Example

Yahoo M’soft Amazon y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0 y a m y a = m 1/3 1/3 1/3 1/3 1/2 1/6 5/12 1/3 1/4 3/8 11/24 1/6 2/5 2/5 1/5 . . .

slide-41
SLIDE 41

Random Walk Interpretation

Imagine a random web surfer

At any time t, surfer is on some page P At time t+1, the surfer follows an outlink from P

uniformly at random

Ends up on some page Q linked from P Process repeats indefinitely

p(t) is the probability distribution whose ith

component is the probability that the surfer is at page i at time t

slide-42
SLIDE 42

The stationary distribution

Where is the surfer at time t+1?

p(t+1) = Mp(t)

Suppose the random walk reaches a state such

that p(t+1) = Mp(t) = p(t)

Then p(t) is a stationary distribution for the random

walk

Our rank vector r satisfies r = Mr

slide-43
SLIDE 43

Existence and Uniqueness of the Solution

Theory of random walks (aka Markov processes):

For graphs that satisfy certain conditions, the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0.

April 2, 2008 Mining and Searching Graphs in Graph Databases

43

slide-44
SLIDE 44

Spider traps

A group of pages is a spider trap if there are no

links from within the group to outside the group

Spider traps violate the conditions needed for the

random walk theorem

Yahoo M’soft Amazon y 1/2 1/2 0 a 1/2 0 0 m 0 1/2 1 y a m y a = m 1 1 1 1 1/2 3/2 3/4 1/2 7/4 5/8 3/8 2 3 . . .

slide-45
SLIDE 45

Random teleports

At each time step, the random surfer has two

  • ptions:

With probability β, follow a link at random With probability 1-β, jump to some page uniformly at

random

Common values for β are in the range 0.8 to 0.9

Surfer will teleport out of spider trap within a few

time steps

slide-46
SLIDE 46

Random teleports Example (β = 0.8)

Yahoo M’soft Amazon 1/2 1/2 0 1/2 0 0 0 1/2 1 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15

0.8 + 0.2

y a = m 1 1 1 1.00 0.60 1.40 0.84 0.60 1.56 0.776 0.536 1.688 7/11 5/11 21/11 . . .

slide-47
SLIDE 47

Matrix formulation

Matrix vector A

Aij = βMij + (1-β)/N Mij = 1/|O(j)| when j→i and Mij = 0 otherwise Verify that A is a stochastic matrix

The page rank vector r is the principal eigenvector

  • f this matrix

satisfying r = Ar

Equivalently, r is the stationary distribution of the

random walk with teleports

slide-48
SLIDE 48

April 2, 2008 Data Mining: Concepts and Techniques

48

HITS: Capturing Authorities & Hubs (Kleinberg’98)

Intuitions

Pages that are widely cited are good authorities Pages that cite many other pages are good hubs

HITS (Hypertext-Induced Topic Selection)

  • 1. Authorities are pages containing useful information and

linked by Hubs

  • course home pages
  • home pages of auto manufacturers
  • 2. Hubs are pages that link to Authorities
  • course bulletin
  • list of US auto manufacturers

Iterative reinforcement …

Hubs Authorities

slide-49
SLIDE 49

Matrix Formulation

Transition (adjacency) matrix A

A[i, j] = 1 if page i links to page j, 0 if

not

The hub score vector h: score is

proportional to the sum of the authority scores of the pages it links to

h = λAa Constant λ is a scale factor

The authority score vector a: score

is proportional to the sum of the hub scores of the pages it is linked from

a = μAT h Constant μ is scale factor

Hubs Authorities

slide-50
SLIDE 50

Transition Matrix Example

Yahoo M’soft Amazon y 1 1 1 a 1 0 1 m 0 1 0 y a m A =

slide-51
SLIDE 51

Iterative algorithm

Initialize h, a to all 1’s h = Aa Scale h so that its max entry is 1.0 a = ATh Scale a so that its max entry is 1.0 Continue until h, a converge

slide-52
SLIDE 52

Iterative Algorithm Example

1 1 1 A = 1 0 1 0 1 0 1 1 0 AT = 1 0 1 1 1 0 a(yahoo) a(amazon) a(m’soft) = = = 1 1 1 1 1 1 1 4/5 1 1 0.75 1 . . . . . . . . . 1 0.732 1 h(yahoo) = 1 h(amazon) = 1 h(m’soft) = 1 1 2/3 1/3 1 0.73 0.27 . . . . . . . . . 1.000 0.732 0.268 1 0.71 0.29

slide-53
SLIDE 53

Existence and Uniqueness of the Solution

h = λAa a = μAT h h = λμAAT h a = λμATA a Under reasonable assumptions about A, the dual iterative algorithm converges to vectors h* and a* such that:

  • h* is the principal eigenvector of the matrix AAT
  • a* is the principal eigenvector of the matrix ATA
slide-54
SLIDE 54

Page Rank and HITS

Similarities

Iterative algorithm based on the linkage of the

documents on the web

Same problem: what is the value of a link from S to D?

Different models

PageRank: depends on the links into S HITS: depends on the value of the other links out of S

The destinies of PageRank and HITS post-1998

PageRank: trademark of Google HITS: not commonly used by search engines

(Ask.com?)

slide-55
SLIDE 55

Social Network Analysis Applications

Link-based object ranking for WWW (actor-level

analysis)

PageRank HITS

Influence and diffusion

April 2, 2008 Mining and Searching Graphs in Graph Databases

55

slide-56
SLIDE 56

Influence and Diffusion

OrgNet.com

56

CDC: Spread of Airborne Disease

slide-57
SLIDE 57

Coming Up

Paper presentations:

Knowledge discovery from transportation network data Maximizing the spread of influence through a social

network

Wherefore Art Thou R3579X? Anonymized Social

Networks, Hidden Patterns, and Structural Steganography

April 2, 2008 Mining and Searching Graphs in Graph Databases

57

slide-58
SLIDE 58

April 2, 2008 Mining and Searching Graphs in Graph Databases

58

References (1)

  • T. Asai, et al. “Efficient substructure discovery from large semi-structured data”, SDM'02
  • C. Borgelt and M. R. Berthold, “Mining molecular fragments: Finding relevant

substructures of molecules”, ICDM'02

  • D. Cai, Z. Shao, X. He, X. Yan, and J. Han, “Community Mining from Multi-Relational

Networks”, PKDD'05.

  • M. Deshpande, M. Kuramochi, and G. Karypis, “Frequent Sub-structure Based

Approaches for Classifying Chemical Compounds”, ICDM 2003

  • M. Deshpande, M. Kuramochi, and G. Karypis. “Automated approaches for classifying

structures”, BIOKDD'02

  • L. Dehaspe, H. Toivonen, and R. King. “Finding frequent substructures in chemical

compounds”, KDD'98

  • C. Faloutsos, K. McCurley, and A. Tomkins, “Fast Discovery of 'Connection Subgraphs”,

KDD'04

  • H. Fröhlich, J. Wegner, F. Sieker, and A. Zell, “Optimal Assignment Kernels For Attributed

Molecular Graphs”, ICML’05

  • T. Gärtner, P. Flach, and S. Wrobel, “On Graph Kernels: Hardness Results and Efficient

Alternatives”, COLT/Kernel’03

slide-59
SLIDE 59

April 2, 2008 Mining and Searching Graphs in Graph Databases

59

References (2)

  • L. Holder, D. Cook, and S. Djoko. “Substructure discovery in the subdue system”,

KDD'94

  • J. Huan, W. Wang, D. Bandyopadhyay, J. Snoeyink, J. Prins, and A. Tropsha. “Mining

spatial motifs from protein structure graphs”, RECOMB’04

  • J. Huan, W. Wang, and J. Prins. “Efficient mining of frequent subgraph in the presence
  • f isomorphism”, ICDM'03
  • H. Hu, X. Yan, Yu, J. Han and X. J. Zhou, “Mining Coherent Dense Subgraphs across

Massive Biological Networks for Functional Discovery”, ISMB'05

  • A. Inokuchi, T. Washio, and H. Motoda. “An apriori-based algorithm for mining frequent

substructures from graph data”, PKDD'00

  • C. James, D. Weininger, and J. Delany. “Daylight Theory Manual Daylight Version 4.82”.

Daylight Chemical Information Systems, Inc., 2003.

  • G. Jeh, and J. Widom, “Mining the Space of Graph Properties”, KDD'04
  • H. Kashima, K. Tsuda, and A. Inokuchi, “Marginalized Kernels Between Labeled Graphs”,

ICML’03

slide-60
SLIDE 60

April 2, 2008 Mining and Searching Graphs in Graph Databases

60

References (3)

  • M. Koyuturk, A. Grama, and W. Szpankowski. “An efficient algorithm for detecting

frequent subgraphs in biological networks”, Bioinformatics, 20:I200--I207, 2004.

  • T. Kudo, E. Maeda, and Y. Matsumoto, “An Application of Boosting to Graph

Classification”, NIPS’04

  • M. Kuramochi and G. Karypis. “Frequent subgraph discovery”, ICDM'01
  • M. Kuramochi and G. Karypis, “GREW: A Scalable Frequent Subgraph Discovery

Algorithm”, ICDM’04

  • C. Liu, X. Yan, H. Yu, J. Han, and P. S. Yu, “Mining Behavior Graphs for ‘Backtrace'' of

Noncrashing Bugs’'', SDM'05

  • P. Mahé, N. Ueda, T. Akutsu, J. Perret, and J. Vert, “Extensions of Marginalized Graph

Kernels”, ICML’04

  • B. McKay. Practical graph isomorphism. Congressus Numerantium, 30:45--87, 1981.
  • S. Nijssen and J. Kok. A quickstart in frequent structure mining can make a difference.

KDD'04

  • J. Prins, J. Yang, J. Huan, and W. Wang. “Spin: Mining maximal frequent subgraphs

from graph databases”. KDD'04

slide-61
SLIDE 61

April 2, 2008 Mining and Searching Graphs in Graph Databases

61

References (4)

  • D. Shasha, J. T.-L. Wang, and R. Giugno. “Algorithmics and applications of tree and

graph searching”, PODS'02

  • J. R. Ullmann. “An algorithm for subgraph isomorphism”, J. ACM, 23:31--42, 1976.
  • N. Vanetik, E. Gudes, and S. E. Shimony. “Computing frequent graph patterns from

semistructured data”, ICDM'02

  • C. Wang, W. Wang, J. Pei, Y. Zhu, and B. Shi. “Scalable mining of large disk-base

graph databases”, KDD'04

  • T. Washio and H. Motoda, “State of the art of graph-based data mining”, SIGKDD

Explorations, 5:59-68, 2003

  • X. Yan and J. Han, “gSpan: Graph-Based Substructure Pattern Mining”, ICDM'02
  • X. Yan and J. Han, “CloseGraph: Mining Closed Frequent Graph Patterns”, KDD'03
  • X. Yan, P. S. Yu, and J. Han, “Graph Indexing: A Frequent Structure-based Approach”,

SIGMOD'04

  • X. Yan, X. J. Zhou, and J. Han, “Mining Closed Relational Graphs with Connectivity

Constraints”, KDD'05

  • X. Yan, P. S. Yu, and J. Han, “Substructure Similarity Search in Graph Databases”,

SIGMOD'05

  • X. Yan, F. Zhu, J. Han, and P. S. Yu, “Searching Substructures with Superimposed

Distance”, ICDE'06

  • M. J. Zaki. “Efficiently mining frequent trees in a forest”, KDD'02
slide-62
SLIDE 62

April 2, 2008 Data Mining: Concepts and Techniques

62

Ref: Mining on Social Networks

  • D. Liben-Nowell and J. Kleinberg. The Link Prediction Problem for

Social Networks. CIKM’03

  • P. Domingos and M. Richardson, Mining the Network Value of
  • Customers. KDD’01
  • M. Richardson and P. Domingos, Mining Knowledge-Sharing Sites for

Viral Marketing. KDD’02

  • D. Kempe, J. Kleinberg, and E. Tardos, Maximizing the Spread of

Influence through a Social Network. KDD’03.

  • P. Domingos, Mining Social Networks for Viral Marketing. IEEE

Intelligent Systems, 20(1), 80-82, 2005.

  • S. Brin and L. Page, The anatomy of a large scale hypertextual Web

search engine. WWW7.

  • S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, S.R. Kumar, P.

Raghavan, S. Rajagopalan, and A. Tomkins, Mining the link structure of the World Wide Web. IEEE Computer’99

  • D. Cai, X. He, J. Wen, and W. Ma, Block-level Link Analysis.

SIGIR'2004.