Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke - PowerPoint PPT Presentation

Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology

Sparse Matrix • Generated by FEM, being as the graph data – Often require solving sparse linear equation fast • Iterative method : CG method, BiCG method – Level-1 BLAS (Dot product + AXPY) • Sequential memory access – Sparse matrix vector multiplication (SpMV) • Using sparse matrix format • Random memory access Performance depends on cache hit rate 1

SpMV computation on GPU • High memory bandwidth and parallelism enable high performance • Latency is hidden with SMT • Available cache per thread is small – Controlling the cache is difficult – => Lower cache hit rate compared to CPU Intel Xeon Processor E5-2620 v2 NVIDIA Tesla K20X Cache size L1 cache : 192KB (instruction / data) Read-only cache : 12KB * 4 / SMX L2 cache : 1.5MB, L3 cache : 15MB L2 cache : 1.5MB 2 Max threads 12 threads 28672 threads

Contribution • We propose a family of cache-aware formats for GPU – Segmentation along the column – Segmented formats, Non-Uniformly Segmented formats • 2 ways of SpMV computation – Achieve speedups of up to • x2.1 for real datasets and x3.2 for synthetic matrices in SpMV • x1.15 in CG 3

Sparse Format • Compressing the needless zero elements – Reduce memory usage – Eg.) COO, CSR • Efficient memory access to matrix data depends on architecture – Vector machine, GPU : column major format • JDS, ELLPACK, SELL-C- σ 4

(Existing Sparse Format) JDS • Reordering the rows by the number of non-zero elements per row – Generate column major format • Favorable for vector machine and many core architectures 5

(Existing Sparse Format) SpMV kernel of JDS format __constant__ int jds_ptr[]; Using Constant cache __global__ void KernelJDS(float *out, when nnz_max * sizeof(float) < 16KB const float* __restrict__ vector int *jds_col, float *jds_val, Read-only cache int M, int nnz_max) { for input vector //Calculate i-th row int i = blockIdx.x * blockDim.x + threadIdx.x; int j=0; float answer = 0; int index = i + jds_ptr[i]; Using inline PTX assembly not to while (index < jds_ptr[j + 1] && j < nnz_max) { pollute the cache answer += jds_val[index] * vector[jds_col[index]]; jds_val[index]=>ld.global.cv.f32 index = I + jds_ptr[++j]; jds_col[index]=>ld.global.cv.s32 } out[i]=>st.global.cs.f32 out[i] = answer; } 6

(Existing Sparse Format) SELL-C- σ [Kreutzer, 2013] • Converting ELLPACK each row block (Sliced ELLPACK) – Reduce the zero filling – C is block size • C = WARP size • Sorting each σ rows – Tradeoff between the zero fill and the cost of sorting 7

Cache Hit Rates of Existing Sparse Formats • NVIDIA Tesla K20X • Dataset : University of Florida Sparse Matrix Collection • JDS format – Input vector is assigned to read-only cache – Coalesced access to matrix data Matrix Size L2 Cache Read-only Cache Hit Hit Rate [%] Rate [%] Audikw_1 943,695 82.864 51.420 Crankseg_2 63,838 98.338 66.540 mouse_gene 45,101 99.912 8.298 8

PROPOSED FORMATS 9

Column size and cache hit rate Column size where the cache hit rate drops • SpMV execution for random matrix corresponds to each cache size – The number of row : 1024 ^3 - Read-only cache : 12KB - L2 cache : 1.5MB – The number of columns : 2 ^ x (4 <= x <= 24)  Segmenting the matrix and the input vector enable to – Non-zero elements per row : 16 achieve high cache hit rate – Single precision – Using JDS format 10

Segmented Formats • Column-wise segmentation – Each segment is converted to JDS or SELL-C- σ 11

Segmented formats SpMV Execution • Two ways of SpMV computation – 2 phases computation : Reduce random write • 1 st phase : Computing SpMV for each sub-matrix and sub- vector, and storing the result into the memory • 2 nd phase : Accumulation of the intermediate vectors 12

Segmented formats SpMV Execution • Two ways of SpMV computation – 1 phase computation using atomic operation – Prepare additional threads to initialize output vector 13

Segmented Formats disadvantages • Increase memory access cost – Additional memory access (2 phase SpMV computation) – Atomic operation is expensive • Generate the segments having few non-zero elements – Improvement of reusability < Overhead of segmenting – => Low efficiency 14

Non-Uniformly Segmented Formats (NUS Formats) • Mixing the multi level segmentation size – Large segmentation width for the low density area – => Reduce the number of segments • Sorting by the number of non-zero elements per column – Set the high density column to left side and high reusability vector elements to the top 15

Converting NUS Format Count # of non-zero elements per column 4 2 2 2 5 3 Sorting 0 0 1 1 4 2 2 0 4 4 5 2 5 0 0 Reordering 0 3 1 3 4 0 1 1 Update col index 0 1 1 2 4 4 2 2 Permutation 0 5 3 4 0 5 3 3 Converting to Segmented CSR 0 0 1 3 1 1 2 3 5 3 0 4 5 2 5 4 4 0 1 0 1 2 0 4 2 5 5 5 Converting sub- matrix to JDS Matrix index : column index Vector index : original row index 16

Auto Parameter tuning mechanism for Conjugate Gradient method • Difficulty of setting parameter – NUS formats have 2D parameter space • Number of segments (seg_num) • Size of segment (seg_size) • Detection for best parameter in iterative method – Time of converting matrix to NUS format <<< Duration time until converging 17

Auto Parameter tuning mechanism for Conjugate Gradient method • Parallelizing by OMP section – CPU : Converting matrix – GPU : Executing iteration • Parameter – Giving seg_size – Changing # of segments 18

PERFORMANCE EVALUATION 19

Experiment Environment • TSUBAME-KFC • CUDA 5.5 – CPU ： Intel Xeon E5-2620 v2 2.10GHz x 2 • cuSPARSE – GPU ： NVIDIA Tesla K20X x 4 – Provided by NVIDIA • Single precision peak performance : – CSR format 3.95 [TFLOPS] – HYBRID format • Bandwidth : 250 [GB / sec] • Memory size : 6 [GB] • L2 cache : 1.5 [MB] • Read-only cache : 12 * 4 [KB / SMX] 20

Performance Evaluation SpMV (Florida data sets) • Our formats show – speedup of x0.86 ~ x2.13 – stable performance 21 Blue : Existing formats, Red : Proposal (2 phases ver.), Green : Proposal (Atomic ver.)

Performance Evaluation Cache Hit Rate of SpMV • Segment size suits to read-only cache – Improvement of cache hit rate from non-segmented formats 22

Performance Evaluation SpMV (Randomly generated matrix) • Investigating larger matrices Speedup of up to x3.2 and our formats are – Number of rows : 1.0M, 1.5M, 2.0M , 2.5M, 3.0M stable to matrix properties – Non-zero density : 0.0001%, 0.0002%, 0.0005% 23

Performance Evaluation Conjugate Gradient method • CG computation for positive definite matrices – Similar speedup to SpMV ; Up to x1.15 Speedup of SpMV is x1.22 24

Performance Evaluation Auto Parameter Tuning CG method Speedup is x1.09 crankseg_2 nd24k 25

Performance Evaluation Multi-node CG method • Strong scaling – One GPU for each node • Communication between nodes by MPI • Send / receive the vector and MPI_Reduce each residual to each iteration – Assign row block to each node • Each row block has fewer non-zero elements • => Cause performance degradation • Generate larger random matrices; row size is 8M 26

Performance Evaluation Multi-node CG method • NUS-SELL-C- σ shows superiority to CSR and SELL-C- σ – Speedup of up to x1.68 – In lower density matrix, data transfer time between nodes takes relatively longer • Performance difference between formats is not noticeable 27

Features of matrices • Family of Segmented formats works well for the matrix such that – Input vector access is more random • Improving the cache hit rate using Segmented formats – Matrix has many non-zero elements • Achieve high cache reusability – Matrix has large variance of the number of non-zero elements per row • Reduce idle threads from JDS or SELL-C- σ 28

Conclusion • S-Formats and NUS-Formats improve the cache locality and SpMV performance – NUS formats achieved speedups of up to • X2.1 for real datasets and x3.2 for synthetic matrix in SpMV • X1.15 for real datasets in CG E-mail : nagasaka.y.aa@m.titech.ac.jp 29

Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke - PowerPoint PPT Presentation

Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation fast Iterative method :

Multicore Workshop Caches Mark Bull David Henty EPCC, University of Edinburgh Overview

Trace Caches and optimizations therein CSE 240C - Rushi Chakrabarti - Winter 2009 Trace Caches

9.4 Local Perception Filters 9.4 Local Perception Filters Exploiting Exploiting Perceptual

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Techniques for Caches in GPUs Gnther Schindler Seminar Talk 2015/16 Chair ASC Outline 1.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Review: Why We Use Caches Caches Review Mechanism for transparent movement of Proc 1000

Say Goodbye to Off-heap Caches! On-heap Caches Using Memory-Mapped I/O Iacovos G. Kolokasis 1 ,

CSE 351: Week 7 Tom Bergan, TA 1 Today Cache geometries Lab 4 2 Caches they make

CS 136: Advanced Architecture Review of Caches 1 / 30 Introduction Why Caches? Basic goal:

CPUs Chapter 3.5 Caches. Memory management. Caches and CPUs address data cache

ECE232: Hardware Organization and Design Lecture 22: Introduction to Caches Adapted from Computer

What You Must Know about Memory, Caches, and Shared Memory Kenjiro Taura 1 / 67 Contents 1

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Basic Concepts in Linear Algebra Department of Mathematics Boise State University September 14,

Math 221: LINEAR ALGEBRA 5-4. Vector Space R n - Rank of a Matrix Le Chen 1 Emory University,

Dataset Creation Proposal for the next revision of ENERGY STAR Imaging Equipment specifications

Jun-Hwan Choi (UT Austin) with I. Shlosman (UKY), M. C. Begelman (UC Boulder) SMBHs are

MIMO OFDM Detection on SDR Platforms Daniel Guenther Chair ISS Integrierte Systeme der

Lecure 3: Models and measurements for thermal syste ms Types of inverse problems Denis Maillet ,

Overdetermined systems, conformal differential geometry, and the BGG complex Andreas Cap

Attention Strategies for Multi-Source Sequence-to-Sequence Learning Jindich Libovick,

Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke - PowerPoint PPT Presentation

Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation fast Iterative method :

Multicore Workshop Caches Mark Bull David Henty EPCC, University of Edinburgh Overview

Trace Caches and optimizations therein CSE 240C - Rushi Chakrabarti - Winter 2009 Trace Caches

9.4 Local Perception Filters 9.4 Local Perception Filters Exploiting Exploiting Perceptual

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

Techniques for Caches in GPUs Gnther Schindler Seminar Talk 2015/16 Chair ASC Outline 1.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Review: Why We Use Caches Caches Review Mechanism for transparent movement of Proc 1000

Say Goodbye to Off-heap Caches! On-heap Caches Using Memory-Mapped I/O Iacovos G. Kolokasis 1 ,

CSE 351: Week 7 Tom Bergan, TA 1 Today Cache geometries Lab 4 2 Caches they make

CS 136: Advanced Architecture Review of Caches 1 / 30 Introduction Why Caches? Basic goal:

CPUs Chapter 3.5 Caches. Memory management. Caches and CPUs address data cache

ECE232: Hardware Organization and Design Lecture 22: Introduction to Caches Adapted from Computer

What You Must Know about Memory, Caches, and Shared Memory Kenjiro Taura 1 / 67 Contents 1

Caches Electronic Computers M Caches 1 Cache LOCALITY PRINCIPLE (SPATIAL AND TEMPORAL)

Basic Concepts in Linear Algebra Department of Mathematics Boise State University September 14,

Math 221: LINEAR ALGEBRA 5-4. Vector Space R n - Rank of a Matrix Le Chen 1 Emory University,

Dataset Creation Proposal for the next revision of ENERGY STAR Imaging Equipment specifications

Jun-Hwan Choi (UT Austin) with I. Shlosman (UKY), M. C. Begelman (UC Boulder) SMBHs are

MIMO OFDM Detection on SDR Platforms Daniel Guenther Chair ISS Integrierte Systeme der

Lecure 3: Models and measurements for thermal syste ms Types of inverse problems Denis Maillet ,

Overdetermined systems, conformal differential geometry, and the BGG complex Andreas Cap

Attention Strategies for Multi-Source Sequence-to-Sequence Learning Jindich Libovick,

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team