A Portable, High-Level Graph Analytics Framework Targeting - - PowerPoint PPT Presentation

a portable high level graph analytics framework targeting
SMART_READER_LITE
LIVE PREVIEW

A Portable, High-Level Graph Analytics Framework Targeting - - PowerPoint PPT Presentation

A Portable, High-Level Graph Analytics Framework Targeting Distributed, Heterogeneous Systems Robert Searles*, Stephen Herbein*, and Sunita Chandrasekaran November 14, 2016 Motivation HPC and Big Data communities are converging


slide-1
SLIDE 1

A Portable, High-Level Graph Analytics Framework Targeting Distributed, Heterogeneous Systems

Robert Searles*, Stephen Herbein*, and Sunita Chandrasekaran November 14, 2016

slide-2
SLIDE 2

Motivation

◮ HPC and Big Data communities are converging ◮ Heterogeneous and distributed systems are becoming

increasingly more common

◮ Distributing data and leveraging specialized hardware (e.g.

accelerators) is critical

◮ Graph analytics are important to both communities

slide-3
SLIDE 3

Goal

◮ Develop a portable, high-level framework for programming

current and future HPC systems that:

◮ Distributes data automatically ◮ Utilize heterogeneous hardware

◮ Accelerate two real-world graph analytics applications ◮ Demonstrate portability by running on a variety of hardware,

including multi-core Intel CPUs, NVIDIA GPUs, and AMD GPUs

slide-4
SLIDE 4

Our Framework: Spark + X

Spark

“X”

CPU GPU

“X”

CPU GPU

“X”

CPU GPU

◮ Utilize the MapReduce

framework, Spark, to handle data and task distribution

◮ Automatic data/task

distribution

◮ Fault-tolerant ◮ Minimal programmer

  • verhead

◮ Leverage heterogeneous

resources to compute the tasks local to each node

◮ Accelerators and other

emerging trends in HPC technology

slide-5
SLIDE 5

Case Study Applications

◮ Fast Subtree Kernel (FSK)

◮ Call graph similarity analysis ◮ Program characterization ◮ Malware analysis

◮ Triangle enumeration

◮ Spam detection ◮ Web link recommendation ◮ Social network analysis

slide-6
SLIDE 6

What is FSK?

◮ Compute-bound graph kernel ◮ Measures the similarity of graphs in a dataset ◮ A graph is represented by a list of feature vectors

◮ Each feature vector represents a subtree

Binaries FSK Program Characterization Call Graphs SVM Similarity Matrix Decomp

slide-7
SLIDE 7

FSK in our framework

◮ Spark Component

◮ Split up pairwise graph comparisons

◮ Local Component

◮ For each pair of graphs ◮ Compare all feature vectors

Spark

Call Graphs

Compare Compare Compare

slide-8
SLIDE 8

What is Triangle Enumeration?

◮ Data-bound graph operation ◮ Finds all cycles of size 3 (AKA triangles) within a graph

1 2 3 4 5

Figure: This graph contains 2 triangles (highlighted in red).

slide-9
SLIDE 9

Triangle Enumeration in our framework

◮ Spark Component

◮ Partition the graph ◮ Distribute the

vertices/edges across the cluster

◮ Local Component

◮ Count triangles within

each subgraph

◮ Done using

matrix-matrix multiplication (BLAS)

◮ Spark Component

◮ Count triangles

between subgraphs

slide-10
SLIDE 10

Hardware/Software

Fast Subtree Kernel

◮ Software

◮ PySpark ◮ PyOpenCL

◮ Hardware: AMD GPU

◮ Fury X

Triangle Enumeration

◮ Software

◮ PySpark ◮ ScikitCUDA

◮ Hardware: NVIDIA GPUs

◮ GTX 470 ◮ GTX 970 ◮ Tesla K20c

slide-11
SLIDE 11

FSK Results - Single-Node Parallelism

1.02 1.42 1.13 1.18 10000 20000 30000 40000 10 100 500 1000 Runtime (in seconds) Dataset Size

Call Graph Similarity - Single Node Performance

CPU Runtime CPU Runtime (8 threads) GPU Runtime

◮ Single node runtimes (Single thread, 8 thread, and GPU)

slide-12
SLIDE 12

FSK Results - Multi-Node Scalability

0.62 3.07 2.99 3.13 2000 4000 6000 8000 10000 12000 10 100 500 1000 Runtime (in seconds) Dataset Size

Call Graph Similarity - Single Node vs. Multi Node

Single-Node CPU Multi-Node CPU (3 nodes)

◮ Multiple node runtimes (CPU saturated on all nodes)

slide-13
SLIDE 13

Triangle Enumeration - Optimizing Data Movement

◮ Runtime of Spark component for Triangle Enumeration with a

variable number of partitions for Erdos-Renyi random graphs with differing densities

Sparse graphs (P=.001)

0.00 2.00 4.00 6.00 8.00 10.00 12.00 36 72 144 Global Time (Seconds) Number of Spark Partitions

Global Time vs. Number of Partitions for 3 Configurations (N=5000, P=.001)

CPU GPU-1 Executor GPU-4 Executors

◮ Fewer partitions allows for

more triangles to be counted locally

Denser graphs (P=.05)

0.00 50.00 100.00 150.00 200.00 250.00 300.00 350.00 36 72 144 Global Time (Seconds) Number of Spark Partitions

Global Time vs. Number of Partitions for 3 Configurations (N=5000, P=.05)

CPU GPU-1 Executor GPU-4 Executors

◮ More partitions means

  • versubscription of the GPU

◮ Overlaps communication

with computation

slide-14
SLIDE 14

Triangle Enumeration - Optimizing Local Computation

◮ Performance of the local component of Triangle Enumeration

  • n the CPU and GPU for graphs of varying size and density

GPU (ScikitCUDA)

Graph Size (Nodes)

1000 2000 3000 4000 5000 6000 7000

Graph Density

0.00 0.01 0.02 0.03 0.04 0.05

Run Time (s)

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

CPU (Scipy)

Graph Size (Nodes)

1000 2000 3000 4000 5000 6000 7000

Graph Density

0.00 0.01 0.02 0.03 0.04 0.05

Run Time (s)

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

◮ Running on the GPU is preferred unless the

graph is sparse (density < .01), then running on the CPU is preferred

slide-15
SLIDE 15

Conclusion

◮ FSK

◮ Linear Scaling ◮ GPU outperforms CPU ◮ Free load balancing with Spark

◮ Triangle Enumeration

◮ Optimize data movement by changing the number of Spark

partitions

◮ Improve local performance by choosing where to execute tasks

◮ Our high-level framework

◮ Demonstrated portability using a variety of hardware

slide-16
SLIDE 16

Future Work

◮ Additional case-study application

◮ Spike neural network training ◮ Detecting common subgraphs within neural networks

◮ Additional tests

◮ Scalability test on a large-scale homogenous cluster ◮ Add latest Nvidia GPUs (K40/80) to our heterogenous cluster

slide-17
SLIDE 17

Reproducibility

◮ All data and code on GitHub

◮ https://github.com/rsearles35/WACCPD-2016