Hardware/Software Vectorization for Closeness Centrality on - PowerPoint PPT Presentation

Hardware/Software Vectorization for Closeness Centrality on Multi-/Many-Core Architectures uce, Erik Saule , Kamer Kaya, ¨ Ahmet Erdem Sarıy¨ Umit V. C ¸ataly¨ urek The Ohio State University (BMI, CS, ECE) University of North Carolina at Charlotte (CS) MTAAP 2014 Erik Saule (UNCC) Vectorizing Closeness Centrality MTAAP 2014 1 / 21

Outline Introduction 1 An SpMM-based approach 2 Experiments 3 Conclusion 4 Erik Saule (UNCC) Vectorizing Closeness Centrality MTAAP 2014 2 / 21

Centralities - Concept Answer questions such as Who controls the flow in a Applications network? Covert network (e.g., terrorist Who is more important? identification) Who has more influence? Contingency analysis (e.g., Whose contribution is weakness/robustness of significant for connections? networks) Viral marketing (e.g., who will Different kinds of graph spread the word best) road networks Traffic analysis social networks Store locations power grids mechanical mesh Erik Saule (UNCC) Vectorizing Closeness Centrality MTAAP 2014 3 / 21

Closeness Centrality Definition Let G = ( V , E ) be an unweighted graph with the vertex set V and edge set E . 1 cc [ v ] = � d ( v , u ) where u ∈ V d ( u , v ) is the shortest path length between u and v . The best known algorithm computes the shortest path graph rooted in each vertex of the graph. The complexity is O ( E ) per source, O ( VE ) in total, which makes its computationally expensive. Erik Saule (UNCC) Vectorizing Closeness Centrality MTAAP 2014 4 / 21

Closeness Centrality Typical Algorithms (one BFS per Definition source) Let G = ( V , E ) be an Top-down or Bottom-Up. unweighted graph with the vertex T o set V and edge set E . x x x 1 cc [ v ] = � d ( v , u ) where u ∈ V x x d ( u , v ) is the shortest path From x x length between u and v . x x x The best known algorithm x computes the shortest path graph rooted in each vertex of the graph. The complexity is O ( E ) Direction Optimizing. per source, O ( VE ) in total, Level synchronous bfs. which makes its computationally No regularity in the computation: expensive. no use of vector processing units. Erik Saule (UNCC) Vectorizing Closeness Centrality MTAAP 2014 4 / 21

Vector processing units SIMD: a key source of performance MMX (1996): 64 bit registers (x86) SSE (1999): 128 bit registers (x86) AVX (2008): 256 bit registers (x86) IMIC (2012): 512-bit registers (Xeon Phi) Operations 512-bits register to come on x86 add mul Ignoring vectorization is wasting 75% (SSE), 87% (AVX), 93% (MIC) of available or performance in single precision. and ... Also it is often necessary to saturate memory bandwidth. Erik Saule (UNCC) Vectorizing Closeness Centrality MTAAP 2014 5 / 21

Outline Introduction 1 An SpMM-based approach 2 Experiments 3 Conclusion 4 Erik Saule (UNCC) Vectorizing Closeness Centrality MTAAP 2014 6 / 21

An SpMV-based approach A simpler definition of level synchronous BFS Vertex v is at level ℓ if and only if one of the neighbors of v is at level ℓ − 1 and v is not at any level ℓ ′ < ℓ . Let x ℓ i = true if vertex i is a part of the frontier at level ℓ . y ℓ +1 is the neighbors of level ℓ . y ℓ +1 = OR j ∈ Γ( k ) x ℓ j . ( (OR, AND)-SpMV ) k Compute the next level frontier x ℓ +1 = y ℓ +1 & ¬ ( OR ℓ ′ ≤ ℓ x ℓ ′ i ) . i i Contribution of the source to cc [ i ] is x ℓ ℓ . i bottom-up (gather reads) top-down (scatter writes) For each vertex, are the neighbors in For each element of the frontier, the frontier? touch the neighbors. Complexity O ( ED ), where D is the Complexity: O ( E ) diameter of the graph. Writes are scattered in memory Writes are performed once linearly. Read are linear Reads are (hopefully) close-by. Erik Saule (UNCC) Vectorizing Closeness Centrality MTAAP 2014 7 / 21

From SpMV to SpMM Data : G = ( V , E ), b Output : cc [ . ] ⊲ Init cc [ v ] ← 0 , ∀ v ∈ V From ℓ ← 0 partition V into k batches Π = { V 1 , V 2 , . . . , V k } o T of size b x x x for each batch of vertices V p ∈ Π do x x 0 s , s ← 1 if s ∈ V p , 0 otherwise x s x ℓ while � � i , s > 0 do i = x x ⊲ SpMM x y ℓ +1 = OR j ∈ Γ( i ) x ℓ j , s , ∀ s , ∀ i x x x i , s ⊲ Update x x x ℓ +1 = y ℓ +1 i , s & ¬ ( OR ℓ ′ ≤ ℓ x ℓ ′ i , s ) , ∀ s , ∀ i x x i , s ℓ ← ℓ + 1 for all v ∈ V do s x ℓ � cc [ v ] ← cc [ v ] + v , s ℓ return cc [ . ] Erik Saule (UNCC) Vectorizing Closeness Centrality MTAAP 2014 8 / 21

Some simple analysis Complexity of O ( VED ) Instead of O ( VE ) But D is typically small Vectorizable The matrix is transferred VD times b Instead of V D is small and b can be big (512-bit registers on MIC) Increasing b increases the size of the right hand side Potentially trash the cache Regularize the memory access patterns Erik Saule (UNCC) Vectorizing Closeness Centrality MTAAP 2014 9 / 21

Vectorization void cc_cpu_256_spmm (int* xadj, int* adj, int n, float* cc) { int b = 256; vis = _mm256_or_ps (nei, vis); size_t size_alloc = n * b / 8; int bcnt = bitCount_256(cu); char* neighbor = (char*)_mm_malloc(size_alloc, 32); if (bcnt > 0) { char* current = (char*)_mm_malloc(size_alloc, 32); cc[i] += bcnt * flevel; char* visited = (char*)_mm_malloc(size_alloc, 32); cont = 1; for (int s = 0; s < n; s += b) { } //Init _mm256_store_ps ((float *)(visited + 32 * i), vis); #pragma omp parallel for schedule (dynamic, CC_CHUNK) _mm256_store_ps ((float *)(current + 32 * i), cu); for (int i = 0; i < n; ++i) { } __m256i neigh = _mm256_setzero_si256(); int il[8] = {0, 0, 0, 0, 0, 0, 0, 0}; } } if (i >= s && i < s + b) il[(i-s)>>5] = 1 << ((i-s) & Ox1F); _mm_free(neighbor); __m256i cu = _mm256_set_epi32(il[7], il[6], il[5], il[4], _mm_free(current); il[3], il[2], il[1], il[0]); _mm_free(visited); _mm256_store_si256 ((__m256i *)(neighbor + 32 * i), neigh); } _mm256_store_si256 ((__m256i *)(current + 32 * i), cu); _mm256_store_si256 ((__m256i *)(visited + 32 * i), cu); } int cont = 1; int level = 0; while (cont != 0) { cont = 0; level++; //SpMM #pragma omp parallel for schedule (dynamic, CC_CHUNK) for (int i = 0; i < n; ++i) { __m256 vali = _mm256_setzero_ps(); for (int j = xadj[i]; j<xadj[i+1]; ++j) { int v = adj[j]; __m256 state_v = _mm256_load_ps((float*)(current + 32 * v)); vali = _mm256_or_ps (vali, state_v); } _mm256_store_ps ((float*)(neighbor + 32 * i), vali); } //Update float flevel = 1.0f / (float) level; #pragma omp parallel for schedule (dynamic, CC_CHUNK) for (int i = 0; i < n; ++i) { __m256 nei = _mm256_load_ps ((float *)(neighbor + 32 * i)); __m256 vis = _mm256_load_ps ((float *)(visited + 32 * i)); __m256 cu = _mm256_andnot_ps (vis, nei); Erik Saule (UNCC) Vectorizing Closeness Centrality MTAAP 2014 10 / 21

Vectorization void cc_cpu_256_spmm (int* xadj, int* adj, int n, float* cc) { int b = 256; vis = _mm256_or_ps (nei, vis); size_t size_alloc = n * b / 8; int bcnt = bitCount_256(cu); char* neighbor = (char*)_mm_malloc(size_alloc, 32); if (bcnt > 0) { char* current = (char*)_mm_malloc(size_alloc, 32); cc[i] += bcnt * flevel; char* visited = (char*)_mm_malloc(size_alloc, 32); cont = 1; for (int s = 0; s < n; s += b) { } //Init _mm256_store_ps ((float *)(visited + 32 * i), vis); #pragma omp parallel for schedule (dynamic, CC_CHUNK) _mm256_store_ps ((float *)(current + 32 * i), cu); for (int i = 0; i < n; ++i) { } __m256i neigh = _mm256_setzero_si256(); int il[8] = {0, 0, 0, 0, 0, 0, 0, 0}; } } if (i >= s && i < s + b) il[(i-s)>>5] = 1 << ((i-s) & Ox1F); _mm_free(neighbor); __m256i cu = _mm256_set_epi32(il[7], il[6], il[5], il[4], _mm_free(current); il[3], il[2], il[1], il[0]); _mm_free(visited); _mm256_store_si256 ((__m256i *)(neighbor + 32 * i), neigh); } _mm256_store_si256 ((__m256i *)(current + 32 * i), cu); _mm256_store_si256 ((__m256i *)(visited + 32 * i), cu); } Variants int cont = 1; int level = 0; while (cont != 0) { Similar SSE, MIC implementations. cont = 0; level++; //SpMM Also implemented in a generic way in #pragma omp parallel for schedule (dynamic, CC_CHUNK) for (int i = 0; i < n; ++i) { C++ using various tags to inform the __m256 vali = _mm256_setzero_ps(); for (int j = xadj[i]; j<xadj[i+1]; ++j) { int v = adj[j]; compiler of what it can do (restrict, __m256 state_v = _mm256_load_ps((float*)(current + 32 * v)); vali = _mm256_or_ps (vali, state_v); unroll) and using templates to fix the } _mm256_store_ps ((float*)(neighbor + 32 * i), vali); } number of BFS to generate dedicated //Update float flevel = 1.0f / (float) level; #pragma omp parallel for schedule (dynamic, CC_CHUNK) assembly code for each variant. for (int i = 0; i < n; ++i) { __m256 nei = _mm256_load_ps ((float *)(neighbor + 32 * i)); __m256 vis = _mm256_load_ps ((float *)(visited + 32 * i)); __m256 cu = _mm256_andnot_ps (vis, nei); Erik Saule (UNCC) Vectorizing Closeness Centrality MTAAP 2014 10 / 21

Software vectorization Observation Performing multiple BFS at once does not only allow to utilize vector registers. It also reduces the number of times the graph is traversed Idea Why limit the number of concurrent sources to the size of the vector register? We use the compiler vectorized code to generate kernels for different number of concurrent BFS. We call this technique software vectorization. Erik Saule (UNCC) Vectorizing Closeness Centrality MTAAP 2014 11 / 21

Hardware/Software Vectorization for Closeness Centrality on - PowerPoint PPT Presentation

Hardware/Software Vectorization for Closeness Centrality on Multi-/Many-Core Architectures uce, Erik Saule , Kamer Kaya, Ahmet Erdem Sary Umit V. C ataly urek The Ohio State University (BMI, CS, ECE) University of North Carolina at

Is vectorization easy? Is vectorization enough? Sbastien Ponce Florian Lemaitre Plan

Estimating Current-Flow Closeness Centrality with a Multigrid Laplacian Solver E. Bergamini, M.

Incremental Algorithms for Closeness Centrality A. Erdem Saryce

Hardware Observability Framework Hardware Observability Framework Hardware Observability

Function Call Re-Vectorization Pupil: Rubens Emilio Alves Moreira Advisor: Fernando Magno Quinto

LLVM Auto-Vectorization Past Present Future Renato Golin www.linaro.org LLVM

Lecture 3 SIMD and Vectorization GPU Architecture Todays lecture Vectorization and SSE

REDEFINING CENTRALITY Redefining Centrality Overview - Regional Integration - Global and Local

Centrality Argimiro Arratia & R. Ferrer-i-Cancho Universitat Polit` ecnica de Catalunya Version

A Round-Efficient Distributed Betweenness Centrality Algorithm Loc Hoang , Matteo Pontecorvi,

Array Based Betweenness Centrality Eric Robinson Northeastern University MIT Lincoln Labs

Centrality Social and Technological Networks Rik Sarkar University of Edinburgh, 2017.

Degree centrality Network Analysis in Python I Important nodes Which nodes are important?

Order Statistics and Pitman Closeness Katherine F. Davies Department of Statistics University of

A multimodal logic for closeness Alfredo Burrieza Emilio Mu noz-Velasco Manuel Ojeda-Aciego

STREAMER: a Distributed Framework for Incremental Closeness

Batavia City School District Budget 2012 2013 School Year Our Stats Staff Total Number

Presentation by Jeff Scott Managing Director ifitstuff LTD Background information Established

IMPACT OF GRAPEVINE BREEDING FOR DISEASE RESISTANCE IN WORLD WINE INDUSTRY Luigi BAVARESCO 1

CERTIFIED PAYROLL CERTIFIED PAYROLL Introductions Introductions WORKSHOP WORKSHOP