Nested Parallelism PageRank on RISC- V Vector Multi- Processors Al - PowerPoint PPT Presentation

Nested Parallelism PageRank on RISC- V Vector Multi- Processors Al Alon Am Amid, Al Albert t Ou, Kr Krste te Asanov ovi ć , B Bor orivoj oje e Nikol oli ć

Agenda Silicon-Proven Open Source Hardware and FPGA-Accelerated Problem Domain Software Implementations Simulation (Graphs/PageRank + (Rocket + Hwacha + ( ) Nested Parallelism) GraphMat + OpenMP) SW/HW Design Space Exploration Full-System Implications

Graphs ● Graph are everywhere ○ Implicit data-parallelism ○ Irregular data layout ● Usefulness of fixed-function acceleration of graph kernels is debatable ● Use general purpose data-parallel acceleration for graph workloads ○ Maximize the efficiency of data-parallel processors Images: http://netplexity.org/?p=809, http://horicky.blogspot.com/2012/04/basic-graph-analytics-using-igraph.html, http://mathworld.wolfram.com/GraphDiameter.html

Common Data - Parallel Arch itectures ● Packed-SIMD ○ Register size exposed in the programming model ○ Direct bit-manipulation ○ ISA implications every technology generation change ● GPUs ○ SIMT programming model ○ Throughput-processors, scratchpad memories ● Vector Architectures ○ Vector-length agnostic programming model ○ Additional flexibility in µarch optimization

Graphs in Data - Parallel Arch itectures ● Intel AVX ○ Small parallelism factor ○ AVX register utilizations size alignments ■ Alternative sparse-matrix representations to fit AVX registers (Grazelle [1]) ● GPUs [2][3] ○ Amortize data-movement between host memory and GPU memory ○ Load balancing between warps and threads Photo credits: [1] Making Pull-Based Graph Processing Performant, Samuel Grossman, Heiner Litz and Christos Kozyrakis https://software.intel.com/en-us/articles/introduction-to-intel-advanced-vector-extensions [2] Scalable SIMD-Efficient Graph Processing on GPUs, Farzad Khorasani, Rajiv Gupta, Laxmi N. Bhuyan https://www.tomshardware.co.uk/why-gpu-pricing-will-drop-further,news-58816.html [3] Multiple works by John Owens (UC Davis)

Hwacha Vector Arch itecture ● Non-standard RISC-V ISA extension ● Integrated with Rocket chip ● Vector-length agnostic generator programming model ● TileLink cache-coherent memory ● Silicon-pr proven, n, open-source ce vector system accelerator ● Parameterizable multi-lane design ○ Open-sourced at the 1 st RISC-V Summit

Hwacha Vector Arch itecture ● Decoupled access-execute ● 4 ops/cycle per lane average throughput ● 128 bits/cycle backing memory bandwidth ● 16 KiB SRAM banked register file per lane ○ Max vector length of 2048 double-width elements ○ Systolic-bank execution ○ 4x128 bits register file bandwidth

Nested Parallelism ● Data-parallel accelerators + multi-processors ● Mixing parallelism properties ○ Task level parallelism – flexible, but expensive ○ Data level parallelism - efficient, but rigid ● Many design points, both SW and HW ● How to partition?

Graph and Sparse - Matrix Represen tation s ● Graphs commonly represented as: ○ Adjacency lists 0 81 0 0 0 0 0 0 ○ Adjacency matrices 0 5 0 0 0 0 0 0 ● Adjacency matrix is usually a sparse matrix 0 0 0 0 0 0 0 0 ● Sparse matrices can be compressed 61 0 9 0 0 0 34 11 0 0 0 0 0 0 0 0 ○ Eliminating the zero values 0 0 0 0 0 0 0 42 ○ Reduce storage in memory 0 0 0 0 0 0 17 0 ● Variety of sparse matrix representations 0 92 0 0 0 0 0 70

Graph and Sparse - Matrix Represen tation s row_indices 0 1 3 3 3 3 5 6 7 7 COO column_indices 1 1 0 2 6 7 7 6 1 7 values 0 81 0 0 0 0 0 0 81 5 61 9 34 11 42 17 92 70 0 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 61 0 9 0 0 0 34 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 42 0 0 0 0 0 0 17 0 0 92 0 0 0 0 0 70

Graph and Sparse - Matrix Represen tation s row_indices 0 1 3 3 3 3 5 6 7 7 COO column_indices 1 1 0 2 6 7 7 6 1 7 0 81 0 0 0 0 0 0 values 0 81 0 0 0 0 0 0 81 5 61 9 34 11 42 17 92 70 0 0 5 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 row_pointers 0 1 2 2 6 6 7 8 10 61 61 0 0 9 9 0 0 0 0 0 0 34 34 11 11 CSR 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 column_indices 1 1 0 2 6 7 7 6 1 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 42 42 values 81 5 61 9 34 11 42 17 92 70 0 0 0 0 0 0 0 0 0 0 0 0 17 17 0 0 0 0 92 92 0 0 0 0 0 0 0 0 0 0 70 70

Graph and Sparse - Matrix Represen tation s row_indices 0 1 3 3 3 3 5 6 7 7 COO column_indices 1 1 0 2 6 7 7 6 1 7 values 0 0 81 81 0 0 0 0 0 0 0 0 0 0 0 0 81 5 61 9 34 11 42 17 92 70 0 0 5 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 row_pointers 0 1 2 2 6 6 7 8 10 61 61 0 0 9 9 0 0 0 0 0 0 34 34 11 11 CSR 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 column_indices 1 1 0 2 6 7 7 6 1 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 42 42 values 81 5 61 9 34 11 42 17 92 70 0 0 0 0 0 0 0 0 0 0 0 0 17 17 0 0 0 0 92 92 0 0 0 0 0 0 0 0 0 0 70 70 column_pointers 0 1 4 5 5 5 5 7 10 CSC row_indices 3 0 1 7 3 3 6 3 5 7 61 81 5 92 9 34 17 11 42 70 values

DCSR/DCSC Representation ● Compress across both dimensions ● Hyper-sparse matrices ○ Required to amortized the overhead of the additional indirection level ● Explicit nested parallelism 0 61 0 0 0 0 0 0 row_starts 0 2 5 0 0 81 0 5 0 92 9 0 0 0 0 0 0 0 0 row_indices 0 1 6 7 0 0 0 0 0 0 0 0 0 1 5 7 10 row_ptrs 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 column_indices 1 2 4 6 8 2 6 1 4 7 0 0 34 0 0 0 17 0 61 81 5 92 9 34 17 11 42 70 values 0 11 0 0 42 0 0 70 [1] Buluc, Aydin, and John R. Gilbert. "On the representation and multiplication of hypersparse matrices." 2008 IEEE International Symposium on Parallel and Distributed Processing. IEEE, 2008.

Nested Parallelism in DCSR/DCSC ● A DCSR representation is composed of multiple CSR Thread 0 Thread 1 representation ● 2 Explicit parallelism levels: row_starts 0 2 5 ○ Level 1 – Task/Thread level row_indices 0 1 6 7 parallelism across the 0 1 5 7 10 row_ptrs external indirection array column_indices ○ Level 2 – Data-level 1 2 4 6 8 2 6 1 4 7 61 81 5 92 9 34 17 11 42 70 values parallelism within each sub- CSR representation

Inner CSR Processing ● Each thread processes a small unit of a CSR unit ● For demonstration purposes, let’s make the sub-CSR larger Thread 0 row_starts 0 row_indices 0 1 7 12 21 30 row_indices 0 1 0 1 5 8 9 11 row_ptrs 0 1 row_ptrs column_indices 1 2 4 6 8 14 15 27 43 51 53 60 column_indices 1 2 4 6 8 61 81 5 92 9 3 44 2 17 18 10 44 values 61 81 5 92 9 values

Sidenote : PageRan k ● Measure of importance of nodes in a directed graph ● Represents a random walk ● Can be implemented as an iterative SpMV ● Common iterative graph processing benchmark Images: https://en.wikipedia.org/wiki/File:PageRanks-Example.jpg

Simple Scalar Sparse Matrix Traversal ● Process the internal CSR in a simple scalar loop p1 ● Traverse the pointers array ● Follow the pointer to the row_indices 0 1 7 12 21 30 values array 0 1 5 8 9 11 row_ptrs ● Perform the required operation (multiplication and column_indices 1 2 4 6 8 14 15 27 43 51 53 60 61 81 5 92 9 3 44 2 17 18 10 44 accumulation for SpMV) values p1

Nested Parallelism PageRank on RISC- V Vector Multi- Processors Al - PowerPoint PPT Presentation

Nested Parallelism PageRank on RISC- V Vector Multi- Processors Al Alon Am Amid, Al Albert t Ou, Kr Krste te Asanov ovi , B Bor orivoj oje e Nikol oli Agenda Silicon-Proven Open Source Hardware and FPGA-Accelerated Problem

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

Nested Word Automata Jens Stimpfle 30.6.2014 Nested Words Nested Words Theoretically and

Graph Mining - PageRank Mert Terzihan-Zhixiong Chen Content 1. Web as a Graph 2. Why is

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

Threaded Programming Lecture 6: Further topics in OpenMP Overview Nested parallelism

The PageRank Algorithm and Web Search John Orr Engines Introduction PageRank Computation

PageRank CS16: Introduction to Data Structures & Algorithms Spring 2020 Outline The WWW

The future of operating systems on RISC-V Alex Bradbury asb@lowrisc.org @asbradbury 4th

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

Nested Transactions Nested Transactions Flat transactions The rules for committing of

Nested and Composite Classes Lecture 14 COP 3252 Summer 2017 May 30, 2017 Nested Classes

NestedMP: Taming Complex Configuration Space of Degree of Parallelism for Nested-Parallel

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Shared Memory Programming with OpenMP Lecture 7: Further topics Nested parallelism Unlike

Introduction to OpenMP Lecture 6: Further topics in OpenMP Nested parallelism Unlike most

Multi-core Programming: Implicit Parallelism Tuukka Haapasalo April 16, 2009 Tuukka Haapasalo

Vector Barrier Certificates and Comparison Systems Andrew Sogokon 1 Khalil Ghorbal 2 Yong Kiam Tan

Vector Spaces Linear Independence, Bases and Dimension Marco Chiarandini Department of

Parallel Sparse Matrix-Vector and Matrix- Transpose-Vector Multiplication using Compressed Sparse

NLA Reading Group Spring13 by smail Ar is a linear combination of the columns of

A structure-driven performance analysis of sparse matrix-vector multiplication Prabhjot Sandhu ,

Distributed Keyword Vector Representation for Document Categorization Yu-Lun Hsieh, Shih-Hung

Linear Combination Definition 1 Given a set of vectors { v 1 , v 2 , . . . , v k } in a vector

CIS 530: Vector Semantics part 2 JURAFSKY AND MARTIN CHAPTER 6 Reminders HOMEWORK 3 IS DUE ON

Sambuz

Useful Links

Newsletter

Mail Us

Nested Parallelism PageRank on RISC- V Vector Multi- Processors Al - PowerPoint PPT Presentation

Nested Parallelism PageRank on RISC- V Vector Multi- Processors Al Alon Am Amid, Al Albert t Ou, Kr Krste te Asanov ovi , B Bor orivoj oje e Nikol oli Agenda Silicon-Proven Open Source Hardware and FPGA-Accelerated Problem

Advanced OpenMP Lecture 6: Nested parallelism Nested parallelism Nested parallelism is

Nested Word Automata Jens Stimpfle 30.6.2014 Nested Words Nested Words Theoretically and

Graph Mining - PageRank Mert Terzihan-Zhixiong Chen Content 1. Web as a Graph 2. Why is

Hardware Parallelism vs. Software Parallelism USENIX Workshop on Hot Topics in Parallelism March

Threaded Programming Lecture 6: Further topics in OpenMP Overview Nested parallelism

The PageRank Algorithm and Web Search John Orr Engines Introduction PageRank Computation

PageRank CS16: Introduction to Data Structures &amp; Algorithms Spring 2020 Outline The WWW

The future of operating systems on RISC-V Alex Bradbury asb@lowrisc.org @asbradbury 4th

Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism

Nested Transactions Nested Transactions Flat transactions The rules for committing of

Nested and Composite Classes Lecture 14 COP 3252 Summer 2017 May 30, 2017 Nested Classes

NestedMP: Taming Complex Configuration Space of Degree of Parallelism for Nested-Parallel

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Shared Memory Programming with OpenMP Lecture 7: Further topics Nested parallelism Unlike

Introduction to OpenMP Lecture 6: Further topics in OpenMP Nested parallelism Unlike most

Multi-core Programming: Implicit Parallelism Tuukka Haapasalo April 16, 2009 Tuukka Haapasalo

Vector Barrier Certificates and Comparison Systems Andrew Sogokon 1 Khalil Ghorbal 2 Yong Kiam Tan

Vector Spaces Linear Independence, Bases and Dimension Marco Chiarandini Department of

Parallel Sparse Matrix-Vector and Matrix- Transpose-Vector Multiplication using Compressed Sparse

NLA Reading Group Spring13 by smail Ar is a linear combination of the columns of

A structure-driven performance analysis of sparse matrix-vector multiplication Prabhjot Sandhu ,

Distributed Keyword Vector Representation for Document Categorization Yu-Lun Hsieh, Shih-Hung

Linear Combination Definition 1 Given a set of vectors { v 1 , v 2 , . . . , v k } in a vector

CIS 530: Vector Semantics part 2 JURAFSKY AND MARTIN CHAPTER 6 Reminders HOMEWORK 3 IS DUE ON

Sambuz

Useful Links

Newsletter

Mail Us

PageRank CS16: Introduction to Data Structures & Algorithms Spring 2020 Outline The WWW