A Portable, High-Level Graph Analytics Framework Targeting - - PowerPoint PPT Presentation
A Portable, High-Level Graph Analytics Framework Targeting - - PowerPoint PPT Presentation
A Portable, High-Level Graph Analytics Framework Targeting Distributed, Heterogeneous Systems Robert Searles*, Stephen Herbein*, and Sunita Chandrasekaran November 14, 2016 Motivation HPC and Big Data communities are converging
Motivation
◮ HPC and Big Data communities are converging ◮ Heterogeneous and distributed systems are becoming
increasingly more common
◮ Distributing data and leveraging specialized hardware (e.g.
accelerators) is critical
◮ Graph analytics are important to both communities
Goal
◮ Develop a portable, high-level framework for programming
current and future HPC systems that:
◮ Distributes data automatically ◮ Utilize heterogeneous hardware
◮ Accelerate two real-world graph analytics applications ◮ Demonstrate portability by running on a variety of hardware,
including multi-core Intel CPUs, NVIDIA GPUs, and AMD GPUs
Our Framework: Spark + X
Spark
“X”
CPU GPU
“X”
CPU GPU
“X”
CPU GPU
◮ Utilize the MapReduce
framework, Spark, to handle data and task distribution
◮ Automatic data/task
distribution
◮ Fault-tolerant ◮ Minimal programmer
- verhead
◮ Leverage heterogeneous
resources to compute the tasks local to each node
◮ Accelerators and other
emerging trends in HPC technology
Case Study Applications
◮ Fast Subtree Kernel (FSK)
◮ Call graph similarity analysis ◮ Program characterization ◮ Malware analysis
◮ Triangle enumeration
◮ Spam detection ◮ Web link recommendation ◮ Social network analysis
What is FSK?
◮ Compute-bound graph kernel ◮ Measures the similarity of graphs in a dataset ◮ A graph is represented by a list of feature vectors
◮ Each feature vector represents a subtree
Binaries FSK Program Characterization Call Graphs SVM Similarity Matrix Decomp
FSK in our framework
◮ Spark Component
◮ Split up pairwise graph comparisons
◮ Local Component
◮ For each pair of graphs ◮ Compare all feature vectors
Spark
Call Graphs
Compare Compare Compare
What is Triangle Enumeration?
◮ Data-bound graph operation ◮ Finds all cycles of size 3 (AKA triangles) within a graph
1 2 3 4 5
Figure: This graph contains 2 triangles (highlighted in red).
Triangle Enumeration in our framework
◮ Spark Component
◮ Partition the graph ◮ Distribute the
vertices/edges across the cluster
◮ Local Component
◮ Count triangles within
each subgraph
◮ Done using
matrix-matrix multiplication (BLAS)
◮ Spark Component
◮ Count triangles
between subgraphs
Hardware/Software
Fast Subtree Kernel
◮ Software
◮ PySpark ◮ PyOpenCL
◮ Hardware: AMD GPU
◮ Fury X
Triangle Enumeration
◮ Software
◮ PySpark ◮ ScikitCUDA
◮ Hardware: NVIDIA GPUs
◮ GTX 470 ◮ GTX 970 ◮ Tesla K20c
FSK Results - Single-Node Parallelism
1.02 1.42 1.13 1.18 10000 20000 30000 40000 10 100 500 1000 Runtime (in seconds) Dataset Size
Call Graph Similarity - Single Node Performance
CPU Runtime CPU Runtime (8 threads) GPU Runtime
◮ Single node runtimes (Single thread, 8 thread, and GPU)
FSK Results - Multi-Node Scalability
0.62 3.07 2.99 3.13 2000 4000 6000 8000 10000 12000 10 100 500 1000 Runtime (in seconds) Dataset Size
Call Graph Similarity - Single Node vs. Multi Node
Single-Node CPU Multi-Node CPU (3 nodes)
◮ Multiple node runtimes (CPU saturated on all nodes)
Triangle Enumeration - Optimizing Data Movement
◮ Runtime of Spark component for Triangle Enumeration with a
variable number of partitions for Erdos-Renyi random graphs with differing densities
Sparse graphs (P=.001)
0.00 2.00 4.00 6.00 8.00 10.00 12.00 36 72 144 Global Time (Seconds) Number of Spark Partitions
Global Time vs. Number of Partitions for 3 Configurations (N=5000, P=.001)
CPU GPU-1 Executor GPU-4 Executors
◮ Fewer partitions allows for
more triangles to be counted locally
Denser graphs (P=.05)
0.00 50.00 100.00 150.00 200.00 250.00 300.00 350.00 36 72 144 Global Time (Seconds) Number of Spark Partitions
Global Time vs. Number of Partitions for 3 Configurations (N=5000, P=.05)
CPU GPU-1 Executor GPU-4 Executors
◮ More partitions means
- versubscription of the GPU
◮ Overlaps communication
with computation
Triangle Enumeration - Optimizing Local Computation
◮ Performance of the local component of Triangle Enumeration
- n the CPU and GPU for graphs of varying size and density
GPU (ScikitCUDA)
Graph Size (Nodes)
1000 2000 3000 4000 5000 6000 7000
Graph Density
0.00 0.01 0.02 0.03 0.04 0.05
Run Time (s)
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
CPU (Scipy)
Graph Size (Nodes)
1000 2000 3000 4000 5000 6000 7000
Graph Density
0.00 0.01 0.02 0.03 0.04 0.05
Run Time (s)
0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
◮ Running on the GPU is preferred unless the
graph is sparse (density < .01), then running on the CPU is preferred
Conclusion
◮ FSK
◮ Linear Scaling ◮ GPU outperforms CPU ◮ Free load balancing with Spark
◮ Triangle Enumeration
◮ Optimize data movement by changing the number of Spark
partitions
◮ Improve local performance by choosing where to execute tasks
◮ Our high-level framework
◮ Demonstrated portability using a variety of hardware
Future Work
◮ Additional case-study application
◮ Spike neural network training ◮ Detecting common subgraphs within neural networks
◮ Additional tests
◮ Scalability test on a large-scale homogenous cluster ◮ Add latest Nvidia GPUs (K40/80) to our heterogenous cluster
Reproducibility
◮ All data and code on GitHub
◮ https://github.com/rsearles35/WACCPD-2016