green marl a dsl for easy and efficient graph analysis
play

GreenMarl: A DSL for Easy and Efficient Graph Analysis Sungpack - PowerPoint PPT Presentation

GreenMarl: A DSL for Easy and Efficient Graph Analysis Sungpack Hong*, Hassan Chafi* + , Eric Sedlar + , and Kunle Olukotun* *Pervasive Parallelism Lab, Stanford University + Oracle Labs Graph Analysis Classic graphs; New applications


  1. Green�Marl: A DSL for Easy and Efficient Graph Analysis Sungpack Hong*, Hassan Chafi* + , Eric Sedlar + , and Kunle Olukotun* *Pervasive Parallelism Lab, Stanford University + Oracle Labs

  2. Graph Analysis � Classic graphs; New applications � Artificial Intelligence, Computational Biology, … � SNS apps: Linkedin, Facebook,… Graph Analysis: a process of drawing out further information � Example> Movie Database from the given graph data�set “What would be the avg. hop�distance between any two (Australian) actors?” Sam Worthington ������ James “Is he a central figure in the movie Cameron Linda Hamilton network? How much?” Kevin Bacon Sigourney Weaver ,, “Do these actors work together ������ more frequently than others?” Jack Black Ben Stiller Owen Wilson

  3. More formally , � Graph Data*Set � ����� G = (V,E): ���������� relationship (E) between data entities (V) � �������� P: any extra data associated with each vertex or edge of graph G ������������������������� � Your Data*Set = (G, Π) = (G, P 1 , P 2 , … ) Your Data*Set = (G, Π) = (G, P , P , … ) � Graph analysis on (G, Π) � Compute a scalar value � e.g. Avg*distance, conductance, eigen*value, … � Compute a (new) property � e.g. (Max) Flow, betweenness centrality, page*rank, … � Identify a specific subset of G: � e.g. Minimum spanning tree, connected component, community structure detection, …

  4. The Performance Issue Traditional single*core machines showed limited � performance for graph analysis problems A lot of random memory accesses + data does not fit � in cache � Performance is bound to memory latency Conventional hardware (e.g. floating point units) does Conventional hardware (e.g. floating point units) does � not help much Use parallelism to accelerate graph analysis � Plenty of data*parallelism in large graph instances � Performance now depends on memory ��������� , not � ������� . Exploit modern parallel computers: Multi*core CPU, � GPU, Cray XMT, Cluster, ...

  5. New Issue: Implementation Overhead � It is challenging to implement a graph algorithm � correctly � + and efficiently � + while applying parallelism + while applying parallelism � + differently for each execution environment � ������������������������������������������ �����������������������������������������

  6. Our approach: DSL We design a domain specific language (DSL) for graph analysis � The user writes his/her algorithm concisely with our DSL � The compiler translates it into the target language (e.g. parallel � C++ or CUDA) (1) Inherent data�parallelism (1) Inherent data�parallelism (2) Good impl. templates (2) Good impl. templates Intuitive Efficient (parallel) Description of a Implementation of graph algorithm (3) High�level optimization the given algorithm ,, Foreach (t: G. Edgeset For(i=0;i<G.numN Nodes) odes();i++) { t.sigma += Foreach __fetch_and_add , (G.nodes[i], ,) BFS ��� DSL ���������������� ���������� Compiler ����������������������������

  7. Example: Betweenness Centrality � Betweenness Centrality (BC) Low BC High BC � A measure that tells how ‘central’ a node is in the graph � Used in social network analysis � Definition � How many shortest paths are How many shortest paths are there between any two nodes Kevin going through this node. Bacon Ayush K. Kehdekar [Image source; Wikipedia]

  8. Example: Betweenness Centrality Init BC for every node and begin outer�loop (s) [Brandes 2001] Looks complex s BFS Queues, Lists, w Order w Stack, Is this Is this parallelizable? v Compute sigma from parents s Reverse v BFS Order w w w Compute delta from children Accumulate delta into BC

  9. Example: Betweenness Centrality [Brandes 2001] s BFS w Order w v Compute sigma from parents s Reverse v BFS Order w w w Compute delta from children

  10. Example: Betweenness Centrality [Brandes 2001] s Parallel Iteration BFS Parallel w Order w Assignment v Parallel BFS Compute sigma from parents s Reverse v BFS Order w w w Compute delta from children Reduction

  11. DSL Approach: Benefits � Three benefits � Productivity � Portability � Performance

  12. Productivity Benefits � A common limiting resource in software development � your brain power (i.e. how long can you ����� ?) A C++ implementation of BC from SNAP ( a parallel graph library parallel graph library from GT): ≈ 400 line of codes (with OpenMP) Vs. Green�Marl* LOC: 24 *Green�Marl ( 그린 말 ) means ����������������� in Korean

  13. Productivity Benefits ��������� ������� ����������� ������ ���� ��� ��� BC ~ 400 24 SNAP C++ openMP Vertex Cover 71 21 SNAP C++ openMP Conductance 42 10 SNAP C++ openMP Page Rank Page Rank 75 75 15 15 http:// .. http:// .. C++ single thread C++ single thread SCC 65 15 http:// .. Java single thread � It is more than LOC � Focusing on the algorithm, not its implementation � More intuitive, less error*prone � Rapidly explore many different algorithms

  14. Portability Benefits (On�going work) � Multiple compiler targets Command line argument DSL DSL Description Compiler CUDA for Codes for (Parallelized) GPU Cluster C++ � SMP back*end � SMP back*end LIB (& RT) LIB (& RT) LIB (& RT) � Cluster back*end (*) � For large instances � We generate codes that work on Pregel API [Malewicz et al. SIGMOD 2010] � GPU back*end (*) � For small instances � We know some tricks [Hong et al. PPOPP 2011]

  15. Performance Benefits Optimized data structure Back�end specific & Code template optimization ��������������� Target Arch. Threading Lib, (SMP? GPU? (e.g.OpenMP) Distributed?) Graph Data Structure Compiler Arch. Arch. Parsing & Code Independent Dependent Checking Generation Opt Opt Use High�level Semantic ����������� Information ����������

  16. Arch�Indep�Opt: Loop Fusion ������� ������ ����� � ������� ������ ����� ��� �������������� ������������� Loop ������� ������ ����� � ���������������� Fusion ���������������� � “set” of nodes (elems are unique) ����������������������� ���������������������� ���������� C++ compiler cannot merge ���������������!���!������ loops ��� �����������"� ��������#�������������������� (Independence not �$%�&����$%�&� gauranteed) ��� �����������"� ��������#�������������������� �$%�&����$%�&����$%�&� Optimization enabled by high�level (semantic) information

  17. Arch�Indep�Opt: Flipping Edges Adding 1 to for all Outgoing Neighbors, if my B value is positive � Graph*Specific Optimization ������� ������ ����� � ������� ������ ����� ������'� ������� ������ ������ ������'� ������� ������ ������� � ��������� ��������� s t s s s t t s s (Why?) Reverse edges may not be Counting number of available or expensive to compute Incoming Neighbors whose B value is positive Optimization using domain�specific property

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend