nested parallelism pagerank on risc v vector multi
play

Nested Parallelism PageRank on RISC- V Vector Multi- Processors Al - PowerPoint PPT Presentation

Nested Parallelism PageRank on RISC- V Vector Multi- Processors Al Alon Am Amid, Al Albert t Ou, Kr Krste te Asanov ovi , B Bor orivoj oje e Nikol oli Agenda Silicon-Proven Open Source Hardware and FPGA-Accelerated Problem


  1. Nested Parallelism PageRank on RISC- V Vector Multi- Processors Al Alon Am Amid, Al Albert t Ou, Kr Krste te Asanov ovi ć , B Bor orivoj oje e Nikol oli ć

  2. Agenda Silicon-Proven Open Source Hardware and FPGA-Accelerated Problem Domain Software Implementations Simulation (Graphs/PageRank + (Rocket + Hwacha + ( ) Nested Parallelism) GraphMat + OpenMP) SW/HW Design Space Exploration Full-System Implications

  3. Graphs ● Graph are everywhere ○ Implicit data-parallelism ○ Irregular data layout ● Usefulness of fixed-function acceleration of graph kernels is debatable ● Use general purpose data-parallel acceleration for graph workloads ○ Maximize the efficiency of data-parallel processors Images: http://netplexity.org/?p=809, http://horicky.blogspot.com/2012/04/basic-graph-analytics-using-igraph.html, http://mathworld.wolfram.com/GraphDiameter.html

  4. Common Data - Parallel Arch itectures ● Packed-SIMD ○ Register size exposed in the programming model ○ Direct bit-manipulation ○ ISA implications every technology generation change ● GPUs ○ SIMT programming model ○ Throughput-processors, scratchpad memories ● Vector Architectures ○ Vector-length agnostic programming model ○ Additional flexibility in µarch optimization

  5. Graphs in Data - Parallel Arch itectures ● Intel AVX ○ Small parallelism factor ○ AVX register utilizations size alignments ■ Alternative sparse-matrix representations to fit AVX registers (Grazelle [1]) ● GPUs [2][3] ○ Amortize data-movement between host memory and GPU memory ○ Load balancing between warps and threads Photo credits: [1] Making Pull-Based Graph Processing Performant, Samuel Grossman, Heiner Litz and Christos Kozyrakis https://software.intel.com/en-us/articles/introduction-to-intel-advanced-vector-extensions [2] Scalable SIMD-Efficient Graph Processing on GPUs, Farzad Khorasani, Rajiv Gupta, Laxmi N. Bhuyan https://www.tomshardware.co.uk/why-gpu-pricing-will-drop-further,news-58816.html [3] Multiple works by John Owens (UC Davis)

  6. Hwacha Vector Arch itecture ● Non-standard RISC-V ISA extension ● Integrated with Rocket chip ● Vector-length agnostic generator programming model ● TileLink cache-coherent memory ● Silicon-pr proven, n, open-source ce vector system accelerator ● Parameterizable multi-lane design ○ Open-sourced at the 1 st RISC-V Summit

  7. Hwacha Vector Arch itecture ● Decoupled access-execute ● 4 ops/cycle per lane average throughput ● 128 bits/cycle backing memory bandwidth ● 16 KiB SRAM banked register file per lane ○ Max vector length of 2048 double-width elements ○ Systolic-bank execution ○ 4x128 bits register file bandwidth

  8. Nested Parallelism ● Data-parallel accelerators + multi-processors ● Mixing parallelism properties ○ Task level parallelism – flexible, but expensive ○ Data level parallelism - efficient, but rigid ● Many design points, both SW and HW ● How to partition?

  9. Graph and Sparse - Matrix Represen tation s ● Graphs commonly represented as: ○ Adjacency lists 0 81 0 0 0 0 0 0 ○ Adjacency matrices 0 5 0 0 0 0 0 0 ● Adjacency matrix is usually a sparse matrix 0 0 0 0 0 0 0 0 ● Sparse matrices can be compressed 61 0 9 0 0 0 34 11 0 0 0 0 0 0 0 0 ○ Eliminating the zero values 0 0 0 0 0 0 0 42 ○ Reduce storage in memory 0 0 0 0 0 0 17 0 ● Variety of sparse matrix representations 0 92 0 0 0 0 0 70

  10. Graph and Sparse - Matrix Represen tation s row_indices 0 1 3 3 3 3 5 6 7 7 COO column_indices 1 1 0 2 6 7 7 6 1 7 values 0 81 0 0 0 0 0 0 81 5 61 9 34 11 42 17 92 70 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 61 0 9 0 0 0 34 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 42 0 0 0 0 0 0 17 0 0 92 0 0 0 0 0 70

  11. Graph and Sparse - Matrix Represen tation s row_indices 0 1 3 3 3 3 5 6 7 7 COO column_indices 1 1 0 2 6 7 7 6 1 7 0 81 0 0 0 0 0 0 values 0 81 0 0 0 0 0 0 81 5 61 9 34 11 42 17 92 70 0 0 5 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 row_pointers 0 1 2 2 6 6 7 8 10 61 61 0 0 9 9 0 0 0 0 0 0 34 34 11 11 CSR 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 column_indices 1 1 0 2 6 7 7 6 1 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 42 42 values 81 5 61 9 34 11 42 17 92 70 0 0 0 0 0 0 0 0 0 0 0 0 17 17 0 0 0 0 92 92 0 0 0 0 0 0 0 0 0 0 70 70

  12. Graph and Sparse - Matrix Represen tation s row_indices 0 1 3 3 3 3 5 6 7 7 COO column_indices 1 1 0 2 6 7 7 6 1 7 values 0 0 81 81 0 0 0 0 0 0 0 0 0 0 0 0 81 5 61 9 34 11 42 17 92 70 0 0 5 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 row_pointers 0 1 2 2 6 6 7 8 10 61 61 0 0 9 9 0 0 0 0 0 0 34 34 11 11 CSR 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 column_indices 1 1 0 2 6 7 7 6 1 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 42 42 values 81 5 61 9 34 11 42 17 92 70 0 0 0 0 0 0 0 0 0 0 0 0 17 17 0 0 0 0 92 92 0 0 0 0 0 0 0 0 0 0 70 70 column_pointers 0 1 4 5 5 5 5 7 10 CSC row_indices 3 0 1 7 3 3 6 3 5 7 61 81 5 92 9 34 17 11 42 70 values

  13. DCSR/DCSC Representation ● Compress across both dimensions ● Hyper-sparse matrices ○ Required to amortized the overhead of the additional indirection level ● Explicit nested parallelism 0 61 0 0 0 0 0 0 row_starts 0 2 5 0 0 81 0 5 0 92 9 0 0 0 0 0 0 0 0 row_indices 0 1 6 7 0 0 0 0 0 0 0 0 0 1 5 7 10 row_ptrs 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 column_indices 1 2 4 6 8 2 6 1 4 7 0 0 34 0 0 0 17 0 61 81 5 92 9 34 17 11 42 70 values 0 11 0 0 42 0 0 70 [1] Buluc, Aydin, and John R. Gilbert. "On the representation and multiplication of hypersparse matrices." 2008 IEEE International Symposium on Parallel and Distributed Processing. IEEE, 2008.

  14. Nested Parallelism in DCSR/DCSC ● A DCSR representation is composed of multiple CSR Thread 0 Thread 1 representation ● 2 Explicit parallelism levels: row_starts 0 2 5 ○ Level 1 – Task/Thread level row_indices 0 1 6 7 parallelism across the 0 1 5 7 10 row_ptrs external indirection array column_indices ○ Level 2 – Data-level 1 2 4 6 8 2 6 1 4 7 61 81 5 92 9 34 17 11 42 70 values parallelism within each sub- CSR representation

  15. Inner CSR Processing ● Each thread processes a small unit of a CSR unit ● For demonstration purposes, let’s make the sub-CSR larger Thread 0 row_starts 0 row_indices 0 1 7 12 21 30 row_indices 0 1 0 1 5 8 9 11 row_ptrs 0 1 row_ptrs column_indices 1 2 4 6 8 14 15 27 43 51 53 60 column_indices 1 2 4 6 8 61 81 5 92 9 3 44 2 17 18 10 44 values 61 81 5 92 9 values

  16. Sidenote : PageRan k ● Measure of importance of nodes in a directed graph ● Represents a random walk ● Can be implemented as an iterative SpMV ● Common iterative graph processing benchmark Images: https://en.wikipedia.org/wiki/File:PageRanks-Example.jpg

  17. Simple Scalar Sparse Matrix Traversal ● Process the internal CSR in a simple scalar loop p1 ● Traverse the pointers array ● Follow the pointer to the row_indices 0 1 7 12 21 30 values array 0 1 5 8 9 11 row_ptrs ● Perform the required operation (multiplication and column_indices 1 2 4 6 8 14 15 27 43 51 53 60 61 81 5 92 9 3 44 2 17 18 10 44 accumulation for SpMV) values p1

  18. Simple Scalar Sparse Matrix Traversal ● Process the internal CSR in a simple scalar loop p1 ● Traverse the pointers array ● Follow the pointer to the row_indices 0 1 7 12 21 30 values array 0 1 5 8 9 11 row_ptrs ● Perform the required operation (multiplication and column_indices 1 2 4 6 8 14 15 27 43 51 53 60 61 81 5 92 9 3 44 2 17 18 10 44 accumulation for SpMV) values p1

  19. Simple Scalar Sparse Matrix Traversal ● Process the internal CSR in a simple scalar loop p1 ● Traverse the pointers array ● Follow the pointer to the row_indices 0 1 7 12 21 30 values array 0 1 5 8 9 11 row_ptrs ● Perform the required operation (multiplication and column_indices 1 2 4 6 8 14 15 27 43 51 53 60 61 81 5 92 9 3 44 2 17 18 10 44 accumulation for SpMV) values p1

  20. Simple Scalar Sparse Matrix Traversal ● Process the internal CSR in a simple scalar loop p1 ● Traverse the pointers array ● Follow the pointer to the row_indices 0 1 7 12 21 30 values array 0 1 5 8 9 11 row_ptrs ● Perform the required operation (multiplication and column_indices 1 2 4 6 8 14 15 27 43 51 53 60 61 81 5 92 9 3 44 2 17 18 10 44 accumulation for SpMV) values p1

  21. Simple Scalar Sparse Matrix Traversal ● Process the internal CSR in a simple scalar loop p1 ● Traverse the pointers array ● Follow the pointer to the row_indices 0 1 7 12 21 30 values array 0 1 5 8 9 11 row_ptrs ● Perform the required operation (multiplication and column_indices 1 2 4 6 8 14 15 27 43 51 53 60 61 81 5 92 9 3 44 2 17 18 10 44 accumulation for SpMV) values p1

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend