Enhancing Graph Database Indexing By Suffix Tree Structure V. - - PowerPoint PPT Presentation

enhancing graph database indexing by suffix tree structure
SMART_READER_LITE
LIVE PREVIEW

Enhancing Graph Database Indexing By Suffix Tree Structure V. - - PowerPoint PPT Presentation

Enhancing Graph Database Indexing By Suffix Tree Structure V. Bonnici, R. Di Natale, A. Ferro, R. Giugno, M. Mongiov, G. Pigola, A. Pulvirenti Dipartimento di Matematica e Informatica Universit di Catania D. Shasha Courant Institute of


slide-1
SLIDE 1

Enhancing Graph Database Indexing By Suffix Tree Structure

  • V. Bonnici, R. Di Natale, A. Ferro, R. Giugno, M. Mongiovì, G.

Pigola, A. Pulvirenti Dipartimento di Matematica e Informatica Università di Catania

  • D. Shasha

Courant Institute of Mathematical Science, New York University

slide-2
SLIDE 2

Outline

  • Biological Motivations
  • GraphGrep
  • GraphGrepSX
  • Experimental results (GIndex, Gcoding,

GraphGrep, Ctree)

slide-3
SLIDE 3

Motivations

  • Many applications in industry, science, and engineering share

the same problem: given a subgraph, find its occurrences in a database of graphs.

– Prediction of the functionality of new natural or synthesized compounds – Make a compound Q more active – Find fragment with the same function among different species – Predict protein function, Predict protein interaction

Gene Ontologies Pathways Biological Networks

slide-4
SLIDE 4

Graph indexing system

  • Graph-to-graph matching algorithms can be used, efficiency

considerations suggest the use of specific techniques to reduce the search space and the time complexity.

  • In a preprocessing phase, each graph of the database is

analyzed in order to extract and store its discriminatory properties, features.

  • In the filtering phase, the graph database index is compared

with the query index in order to discard graphs of the database not containing some features present in the query graph.

slide-5
SLIDE 5

GraphGrep

Database

3 2 1 B C A B

g1 g2

4 B C 1 A 2B 3 C

g3

2 3 5 4 B C A B 1 D 6 E

Index Construction

Index

Query processing

Candidate Verification Filtering: find candidate Load from DB Indexing is crucial to reduce the search space and make the problem affordable!

Graphs searching is an NP-problem

Shasha D, Wang JL, Giugno R: Algorithmics and Applications of Tree and Graph Searching. Proceeding of the ACM Symposium on Principles of Database Systems (PODS) 2002, :39{52.

  • A. Ferro, R. Giugno, M. Mongiovì, A. Pulvirenti, D. Skripin, D. Shasha D. GraphFind: Enhancing Graph Searching by Low Support Data Mining Techniques.

BMC Bioinformatics 2008, Vol. 9 (Suppl 4) :S10 doi:10.1186/1471-2105-9-S4-S10

slide-6
SLIDE 6

GraphGrep: Index building

For each graph in DB:

  • Find all paths of length from 1 to L (4,10)
  • Save the paths in a Berkeley DB
  • Count how many occurrences of each path in each

graph

  • Save the occurrences in an hash table indexed by

the strings of the paths

slide-7
SLIDE 7

GraphGrep: Index building

slide-8
SLIDE 8

Run VF

GraphGrep: Filtering and Matching

slide-9
SLIDE 9

GraphGrep: Matching VF_lib

(Cordella et al. IEEE PAMI 2004, http://amalfi.dis.unina.it/graph/db/vflib-2.0/doc/vflib.html)

  • Extension of Ullmann matching algorithm ( Journal of the

ACM, 1976)

  • The process of finding the mapping function can be

suitably described by means of a TREE called State Space Representation (SSR)

  • Each node is a state s of the partial matching process
  • Transition from a generic state s to a successor s’

represents the addition of a pair matched nodes.

  • k-look-ahead rules for checking in advance if a consistent

state s has no consistent successors after k steps + Semantic rules

slide-10
SLIDE 10

GraphGrepSX: Idea

Realize a compact representation of the index by making use

  • f Suffix trees

“Algorithms on Strings, Trees, and Sequences” by Dan Gusfield

slide-11
SLIDE 11

GraphGrepSX: Idea

Realize a compact representation of the index by making use

  • f Suffix trees
slide-12
SLIDE 12

GraphGrepSX: Idea

Realize a compact representation of the index by making use

  • f Suffix trees

Suffix tree

index

slide-13
SLIDE 13

GraphGrepSX: Idea

Realize a compact representation of the index by making use

  • f Suffix trees

Suffix tree

index

slide-14
SLIDE 14

GraphGrepSX: Idea

Realize a compact representation of the index by making use

  • f Suffix trees

Suffix tree

index

slide-15
SLIDE 15

GraphGrepSX: Idea

Realize a compact representation of the index by making use

  • f Suffix trees

Suffix tree

index

slide-16
SLIDE 16

GraphGrepSX

  • Preprocessing phase

– replaces the hash table index by a suffix tree index

  • Filtering phase

– Build a query index tree – The candidate set is constructed by matching the query index tree and the database index

  • This results in a more flexible graph indexing system

– different ways to build the query index – an efficient technique to reduce redundant checks

slide-17
SLIDE 17

GraphGrepSX: Index structure

GraphGrepSX builds the tree index as follows:

slide-18
SLIDE 18

GraphGrepSX : Index structure

GraphGrepSX builds the tree index as follows:

slide-19
SLIDE 19

GraphGrepSX : Index structure

GraphGrepSX builds the tree index as follows:

slide-20
SLIDE 20

GraphGrepSX : Index structure

GraphGrepSX builds the tree index as follows:

slide-21
SLIDE 21

GraphGrepSX : Index structure

GraphGrepSX builds the tree index as follows:

Computed by DFS visit, the backtracking allows to find paths with the same suffix

slide-22
SLIDE 22

GraphGrepSX : Index structure

GraphGrepSX builds the tree index as follows:

Computed by DFS visit, the backtracking allows to find paths with the same suffix

slide-23
SLIDE 23

GraphGrepSX : Index structure

GraphGrepSX builds the tree index as follows:

Computed by DFS visit, the backtracking allows to find paths with the same suffix

slide-24
SLIDE 24

GraphGrepSX : Index structure

GraphGrepSX builds the tree index as follows:

slide-25
SLIDE 25

GraphGrepSX : Index structure

GraphGrepSX builds the tree index as follows:

slide-26
SLIDE 26

GraphGrepSX : Index structure

GraphGrepSX builds the tree index as follows:

slide-27
SLIDE 27

GraphGrepSX : Index structure

GraphGrepSX builds the tree index as follows:

slide-28
SLIDE 28

GraphGrepSX : Index structure

GraphGrepSX builds the tree index as follows:

slide-29
SLIDE 29

GraphGrepSX : Index structure

GraphGrepSX builds the tree index as follows:

slide-30
SLIDE 30

GraphGrepSX : Index structure

GraphGrepSX builds the tree index as follows:

slide-31
SLIDE 31

GraphGrepSX : Index structure

GraphGrepSX builds the tree index as follows:

slide-32
SLIDE 32

GraphGrepSX : Index structure

GraphGrepSX builds the tree index as follows:

slide-33
SLIDE 33

Index construction time

Number of graphs Building time (sec)

Experimental analysis on molecular dataset

http://dtp.nci.nih.gov/docs/aids/ AIDS antiviral screening database

slide-34
SLIDE 34

Experimental analysis on molecular dataset

Total index size

Number of graphs Size (Kb) Number of graphs Size (Kb) label-paths table + hashtable

Index fingerprint size

hashtable only

slide-35
SLIDE 35

GraphGrepSX: Index structure

A path in the index structure is defined as maximal path if:

slide-36
SLIDE 36

A path in the index structure is defined as maximal path if:

  • its length is L

GraphGrepSX: Index structure

slide-37
SLIDE 37

A path in the index structure is defined as maximal path if:

  • its length is L

GraphGrepSX: Index structure

slide-38
SLIDE 38

A path in the index structure is defined as maximal path if:

  • its length is L
  • the path has length < L but it cannot be extended

GraphGrepSX: Index structure

slide-39
SLIDE 39

A path in the index structure is defined as maximal path if:

  • its length is L
  • the path has length < L but it cannot be extended

GraphGrepSX: Index structure

slide-40
SLIDE 40

A path in the index structure is defined as maximal path if:

  • its length is L
  • the path has length < L but it cannot be extended

GraphGrepSX: Index structure

slide-41
SLIDE 41

GraphGrepSX: Filtering phase-Query tree index structure construction

  • Discard graphs from the database which do not

match the query by analyzing only the maximal paths

  • In the query tree nodes representing these maximal

paths are marked

Red nodes represent End-points of Maximal Paths

slide-42
SLIDE 42

GraphGrepSX: Filtering phase- candidates generation

Dataset index Query index

slide-43
SLIDE 43

GraphGrepSX: Filtering phase- candidates generation

Dataset index Query index

Trees matching

slide-44
SLIDE 44

GraphGrepSX: Filtering phase- candidates generation

Dataset index Query index

Trees matching

slide-45
SLIDE 45

GraphGrepSX: Filtering phase- candidates generation

Trees matching

slide-46
SLIDE 46

GraphGrepSX: Filtering phase- candidates generation

Trees matching Occurrences verification

slide-47
SLIDE 47

GraphGrepSX: Filtering phase- candidates generation

Trees matching Occurrences verification Candidates set:{g1, …}

slide-48
SLIDE 48

Experimental analysis

Molecular dataset of 42000 graphs

Query Time

Query dimension Query time (sec) filtering + matching

slide-49
SLIDE 49

Experimental analysis

Molecular dataset of 42000 graphs

Candidates

Query dimension Number of candidates

slide-50
SLIDE 50

Experimental analysis

Molecular dataset of 42000 graphs

Filtering time

Query dimension Filtering time (sec) Query dimension Matching time (sec)

Matching time

slide-51
SLIDE 51

Experimental analysis: CTree, GCoding, GraphGrep, GraphGrepSX

Molecular dataset of 42000 graphs

Index construction time

Time (sec) Number of graphs

Index size

Size (Kb) Number of graphs

slide-52
SLIDE 52

Molecular dataset of 42000 graphs

Query time

Time (sec) Query dimension

Candidates

Candidates Query dimension

Huahai He Ambuj K. Singh, Closure-Tree: An Index Structure for Graph Queries, ICDE '06 Lei Zou, Lei Chen, Jeffrey Xu Yu, Yansheng Lu, A novel spectral coding in a large graph database, Proceedings of the 11th international conference on Extending database technology, 2008

Experimental analysis: CTree, GCoding, GraphGrep, GraphGrepSX

slide-53
SLIDE 53

Molecular dataset of 42000 graphs

Filtering time

Time (sec) Query dimension

Matching time

Time (sec) Query dimension

Experimental analysis: CTree, GCoding, GraphGrep, GraphGrepSX

slide-54
SLIDE 54

Molecular dataset of 8000 graphs

Candidates

Candidates Query dimension

Index construction time

Time (sec)

Experimental analysis: GraphGrep, GraphGrepSX, GIndex

Yan X, Yu PS, Han J: Graph Indexing Based on Discriminative Frequent Structure Analysis. ACM Transactions on Database Systems 2005, 30(4):960-993.

slide-55
SLIDE 55

Molecular dataset of 8000 graphs

Filtering time

Time (sec) Query dimension

Matching time

Time (sec) Query dimension

Yan X, Yu PS, Han J: Graph Indexing Based on Discriminative Frequent Structure Analysis. ACM Transactions on Database Systems 2005, 30(4):960-993.

Experimental analysis: GraphGrep, GraphGrepSX, GIndex

slide-56
SLIDE 56

What is it coming next?…

Collaboration networks Social Networks The increasing size of databases application requires efficient structure searching algorithms. Biological Networks Many applications in industry, science, and engineering share the same problem: given a subgraph, find its occurrences in a database of graphs

  • r in large networks.
slide-57
SLIDE 57

NetMatch

Graph Properties Query Properties Network Properties Run the Algorithm

  • A. Ferro, R. Giugno, G. Pigola, A. Pulvirenti, D. Skripin, G.D. Bader, D. Shasha. NetMatch: a Cytoscape Plugin for Searching Biological
  • Networks. Bioinformatics. vol. 23, pp. 910-912, 2007 ISSN: 1367-4803. doi:10.1093/bioinformatics/btm032.
slide-58
SLIDE 58

NetMatch: Querying and getting Result

RESULT

slide-59
SLIDE 59

path<4

NetMatch: Analyze Results

slide-60
SLIDE 60

Find Feed-Forward Motifs

Graph motifs over-represented in many network types

Milo et al. Science 298, 2002 Feed-forward loop Gene regulation Neurons Electronic circuits

slide-61
SLIDE 61

Edge Labels 1=activator 2=repressor 3=dual 1 2

slide-62
SLIDE 62

Find All Feed-Forward Motifs

slide-63
SLIDE 63

Statistical Significance?

slide-64
SLIDE 64

http://ferrolab.dmi.unict.it/ http://alpha.dmi.unict.it/~ctnyu/

Resources

slide-65
SLIDE 65

Summer School 13-20 June, 2009: RNA: Structure, Function and Therapy http://lipari.cs.unict.it/LipariSchool/

SPEAKERS

  • Dr. Oliver Hobert, Howard Hughes Medical

Institute Department of Biochemistry and Molecular Biophysics, Columbia University Medical Center, New York, USA Chris Sander, Computational Biology Center, Sander Research Lab Memorial Sloan- Kettering Cancer Center, New York, USA Peter Stadler, Institut für Informatik, Universität Leipzig Germany Carlo Croce, Comprehensive Cancer Center, Columbus Ohio Debora Marks, Systems Biology Department, Harvard Medical School Boston, MA, USA GUEST SPEAKERS Charles R. Cantor and Graziano Pesole

TUTORIALS: Matteo Comin: RNA-protein interaction, University

  • f Padova, Italy
  • A. Pulvirenti: Algorithms for Sequence Alignment,

University of Catania, Italy

  • R. Giancarlo: Alignment-free distances, University
  • f Palermo, Italy

Knut Reinert: SeqAn: An efficient, generic C++ library for sequnece analysis, Institut für nformatik, Fachbereich Mathematik und Informatik, Freie Universität Berlin Christine Heitsch: RNA structure prediction Berkeley University, School of Mathematics Georgia Institute of Technology, Atlanta, USA Doron Betel: Non coding RNAs, Memorial Sloan- Kettering Cancer Center, New York, USA Alessandro Laganà: Tools for miRNA genes and target prediction, University of Catania, Italy

2009