4/9/10
1
MSSG: A Framework for Massive-Scale Semantic Graphs
Timothy D. R. Hartley1, Umit Catalyurek1,2, Füsun Özgüner1
- 1Dept. of Electrical & Computer Engineering
- 2Dept. of Biomedical Informatics
MSSG: A Framework for Massive-Scale Semantic Graphs Timothy D. R. - - PowerPoint PPT Presentation
MSSG: A Framework for Massive-Scale Semantic Graphs Timothy D. R. Hartley 1 , Umit Catalyurek 1,2 , F sun zg ner 1 1 Dept. of Electrical & Computer Engineering 2 Dept. of Biomedical Informatics The Ohio State University Andy Yoo,
4/9/10
1
4/9/10
2
– Kolda et al. (2004) estimate emerging graphs have 1015 entities! – Data will be dynamic
– Out-of-core data structures – Parallel computer (shared memory / cluster)
– Commodity hardware is still cheap – High-speed interconnection networks are becoming commonplace
4/9/10
3
– Good online performance
– Good I/O performance
– Efficient memory usage
– Efficient scale-free search
– TPIE, River
4/9/10
4
4/9/10
5
4/9/10
6
– Social networks
4/9/10
7
4/9/10
8
– Cray MTA-2
– IBM Bluegene/L
– High performance
– Expensive! – Algorithm tightly coupled with data distribution
4/9/10
9
– Parallel layout
– External memory
– Target graphs will be dynamic
Front-end Back-end Edges Disk(s) Input Graph
4/9/10
10
– Analysis
– Storage
Front-end Back-end Edges Disk(s) Input Graph
4/9/10
11
– BFS – Best-first search – Pattern search – Neighborhood quality quantification
4/9/10
12
– Window size important
communication latency
– Plug-in architecture
4/9/10
13
– Get adjacency list for vertex – Store vertex metadata (e.g. visited at level x)
– In memory
– Out-of-core
4/9/10
14
– Netezza streaming database
– Read a chunk of the graph from disk – Pick which edges match vertex – Return full list of adjacent vertices
– Lower seek overhead
4/9/10
15
– Fixed record size
– Variable record size
– Multiple fixed record files
4/9/10
16
– Record sizes chosen to match scale-free graph vertex degree distribution – File level 0
– File level 1
– Disk-block = unit of I/O
4/9/10
17
4/9/10
18
4/9/10
19
– Fast, scalable parallel communication – High-speed interconnect support
– Easy-to-use filter-based API – Rapid development – Robust processing model
– Rapid development – Fast execution time
4/9/10
20
data-parallel manipulation of large scientific data
– Transparent copies of filters – C++/Java/Python filters – Each filter runs as a thread
processing
– Data is streamed from producer to consumer filters
computation and application-specific storage access
any number of heterogeneous nodes
4/9/10
21
– 8 GB RAM per node – 500 GB local disks in RAID 0 per node – Infiniband
– Pubmed-S: 3,751,921 vertices and 27,841,781 edges – Pubmed-L: 26,676,177 vertices and 519,630,678 edges – Syn-2B: 100 Million vertices and 2 Billion edges
– Search time (s) – Aggregate Edges/s processed
4/9/10
22
4/9/10
23
4/9/10
24
4/9/10
25
4/9/10
26
4/9/10
27
4/9/10
28
4/9/10
29
– Expected ingestion with GrDB in roughly 77 hours – Expected average search in 10s of minutes
– I/O-efficient hash / index structure needed – More performance testing – Larger graphs
4/9/10
30
4/9/10
31
– Use queue for frontier vertices
– Use global queue
– Use local queue
partitioning
4/9/10
32