for GPUs Sepideh Maleki*, Annie Yang, and Martin Burtscher - PowerPoint PPT Presentation

A New Parallel Prefix-Scan Algorithm for GPUs Sepideh Maleki*, Annie Yang, and Martin Burtscher Department of Computer Science

Highlights  GPU-friendly algorithm for prefix scans called SAM  Novelties and features  Higher-order support that is communication optimal  Tuple-value support with constant workload per thread  Carry propagation scheme with O(1) auxiliary storage  Implemented in unified 100-statement CUDA kernel  Results  Outperforms CUB by up to 2.9-fold on higher-order and by up to 2.6-fold on tuple-based prefix sums A New Parallel Prefix-Scan Algorithm for GPUs 2

Prefix Sums  Each value in the output sequence is the sum of all prior elements in the input sequence  Input 1 2 3 4 5 6 7 8  Output 0 1 3 6 10 15 21 28  Can be computed efficiently in parallel  Applications  Sorting, lexical analysis, polynomial evaluation, string comparison, stream compaction, & data compression A New Parallel Prefix-Scan Algorithm for GPUs 3

Data Compression  Data compression algorithms  Data model predicts next value in input sequence and emits difference between actual and predicted value  Coder maps frequently occurring values to produce shorter output than infrequent values  Delta encoding  Widely used data model Charles Trevelyan for http://plus.maths.org/  Computes difference sequence (i.e., predicts current value to be the same as previous value in sequence)  Used in image compression, speech compression, etc. A New Parallel Prefix-Scan Algorithm for GPUs 4

Delta Coding  Delta encoding is embarrassingly parallel  Delta decoding appears to be sequential  Decoded prior value needed to decode current value  Prefix sum decodes delta encoded values  Decoding can also be done in parallel Input sequence 1, 2, 3, 4, 5, 2, 4, 6, 8, 10 Difference sequence (encoding) 1, 1, 1, 1, 1, -3, 2, 2, 2, 2 Prefix sum (decoding) 1, 2, 3, 4, 5, 2, 4, 6, 8, 10 A New Parallel Prefix-Scan Algorithm for GPUs 5

Extensions of Delta Coding  Higher orders  Higher-order predictions are often more accurate  First order out k = in k - in k-1  Second order out k = in k - 2∙in k-1 + in k-2  Third order out k = in k - 3∙in k-1 + 3∙in k-2 - in k-3  Tuple values  Data frequently appear in tuples  Two-tuples x 0 , y 0 , x 1 , y 1 , x 2 , y 2 , …, x n-1 , y n-1  Three-tuples x 0 , y 0 , z 0 , x 1 , y 1 , z 1 , …, x n-1 , y n-1 , z n-1 A New Parallel Prefix-Scan Algorithm for GPUs 6

Problem and Solution  Conventional prefix sums are insufficient  Do not decode higher-order delta encodings  Do not decode tuple-based delta encodings  Prior work  Requires inefficient workarounds to handle higher- order and tuple-based delta encodings  SAM algorithm and implementation  Directly and efficiently supports these generalizations  Even supports combination of higher orders and tuples A New Parallel Prefix-Scan Algorithm for GPUs 7

Work Efficiency of Prefix Sums  Sequential prefix sum requires only a single pass  2 n data movement through memory  Linear O( n ) complexity  Parallel algorithm should have same complexity  O( n ) applications of the sum operator A New Parallel Prefix-Scan Algorithm for GPUs 8

Hierarchical Parallel Prefix Sum Initial Array of Arbitrary Values Break Array into Chunks Compute Local Prefix Sums Gather Top Most Values Time Auxiliary Array Compute Prefix Sum Add Resulting Carry i to all Values of Chunk i Final Values A New Parallel Prefix-Scan Algorithm for GPUs 9

Standard Prefix-Sum Implementation  Based on 3-phase approach  Reads and writes every element twice  4 n main-memory accesses  Auxiliary array is stored in global memory  Calculation is performed across blocks  High-performance implementations  Allocate and process several values per thread  Thrust and CUDPP use this hierarchical approach A New Parallel Prefix-Scan Algorithm for GPUs 10

SAM Base Implementation  Intra-block prefix sums  Computes prefix sum of each chunk conventionally  Writes local sum of each chunk to auxiliary array  Writes ready flag to second auxiliary array  Inter-block prefix sums  Reads local sums of all prior chunks  Adds up local sums to calculate carry  Updates all values in chunk using carry  Writes final result to global memory A New Parallel Prefix-Scan Algorithm for GPUs 11

Pipelined Processing of Chunks Block 1 Block 2 Block 3 Block 4 Chunk 1 Local Sum Flag Array Array Chunk 2 F1 S1 Sum1 Chunk 3 Carry1 = 0 F2 S2 Sum2 Chunk 4 Carry2 = s1 Time Sum3 F3 S3 Chunk 5 Carry3 = s1+s2 Sum4 F4 S4 Chunk 6 Carry4 = s1+s2+s3 Sum5 F5 S5 Chunk 7 Carry5 = Carry1 + Sum1 Sum6 F6 S6 + s2+s3+s4 Chunk 8 Carry6 = Carry2 + Sum2 Sum7 F7 S7 + s3+s4+s5 Carry7 = Carry3 + Sum3 + s4+s5+s6 Sum8 F8 S8 Carry8 = Carry4 + Sum4 + s5+s6+s7 A New Parallel Prefix-Scan Algorithm for GPUs 12

Carry Propagation Scheme  Persistent-block-based implementation  Same block processes every k th chunk  Carries require only O(1) computation per chunk  Circular-buffer-based implementation  Only 3 k elements needed at any point in time  Local sums and ready flags require O(1) storage  Redundant computations for latency hiding  Write-followed-by-independent-reads pattern  Multiple values processed per thread (fewer chunks) A New Parallel Prefix-Scan Algorithm for GPUs 13

A New Parallel Prefix-Scan Algorithm for GPUs 14

Higher-order Prefix Sums  Higher-order difference sequences can be computed by repeatedly applying first order  Prefix sum is the inverse of order-1 differencing  K prefix sums will decode an order- k sequence  No direct solution for computing higher orders  Must use iterative approach  Other codes’ memory accesses proportional to order A New Parallel Prefix-Scan Algorithm for GPUs 15

Higher-order Prefix Sums (cont.)  SAM is more efficient  Internally iterates only the computation phase  Does not read and write data in each iteration  Requires only 2 n main-memory accesses for any order  SAM’s higher -order implementation  Does not require additional auxiliary arrays  Both sum array and ‘flag’ array are O(1) circular buffers  Only needs non-Boolean ready ‘flags’  Uses counts to indicate iteration of current local sum A New Parallel Prefix-Scan Algorithm for GPUs 16

Tuple-based Prefix Sums  Data may be tuple based x 0 , y 0 , x 1 , y 1 , …, x n-1 , y n-1  Other codes have to handle tuples as follows  Reordering elements, compute, undo reordering  Slow due to reordering and may require extra memory x 0 , x 1 , …, x n-1 | y 0 , y 1 , …, y n-1 Σ 0 0 x i , Σ 0 1 x i , …, Σ 0 n-1 x i | Σ 0 0 y i , Σ 0 1 y i , …, Σ 0 n-1 y i Σ 0 0 x i , Σ 0 0 y i , Σ 0 1x i , Σ 0 1 y i , …, Σ 0 n-1 x i , Σ 0 n-1 y i  Defining a tuple data type as well as the plus operator  Slow for large tuples due to excessive register pressure A New Parallel Prefix-Scan Algorithm for GPUs 18

Tuple-based Prefix Sums (cont.)  SAM is more efficient  No reordering  No special data types or overloaded operators  Always same amount of data per thread  SAM’s tuple implementation  Employs multiple sum arrays, one per tuple element  Each sum array is an O(1) circular buffer  Uses modulo operations to determine which array to use  Still employs single O(1) flag array A New Parallel Prefix-Scan Algorithm for GPUs 19

Experimental Methodology  Evaluate following prefix sum implementations  Thrust library (from CUDA Toolkit 7.5)  4n  CUDPP library 2.2  4n  CUB library 1.5.1  2n  SAM 1.1  2n A New Parallel Prefix-Scan Algorithm for GPUs 20

Prefix Sum Throughputs (Titan X) 32-bit integers 64-bit integers THRUST CUDPP CUB SAM THRUST CUDPP CUB SAM 35.0 18.0 throughput [billion items per second] throughput [billion items per second] 16.0 30.0 14.0 25.0 12.0 20.0 10.0 8.0 15.0 6.0 10.0 4.0 5.0 2.0 0.0 0.0 2^10 2^11 2^12 2^13 2^14 2^15 2^16 2^17 2^18 2^19 2^20 2^21 2^22 2^23 2^24 2^25 2^26 2^27 2^28 2^29 2^30 10^3 10^4 10^5 10^6 10^7 10^8 10^9 2^10 2^11 2^12 2^13 2^14 2^15 2^16 2^17 2^18 2^19 2^20 2^21 2^22 2^23 2^24 2^25 2^26 2^27 2^28 2^29 2^30 10^3 10^4 10^5 10^6 10^7 10^8 10^9 input size [number of items] input size [number of items]  SAM and CUB outperform the other approaches (2 n vs. 4 n )  For 64-bit values, throughputs are about half (but same GB/s)  SAM matches cudaMemcpy throughput at high end (264 GB/s)  Surprising since SAM was designed for higher orders and tuples A New Parallel Prefix-Scan Algorithm for GPUs 22

for GPUs Sepideh Maleki*, Annie Yang, and Martin Burtscher - PowerPoint PPT Presentation

A New Parallel Prefix-Scan Algorithm for GPUs Sepideh Maleki*, Annie Yang, and Martin Burtscher Department of Computer Science Highlights GPU-friendly algorithm for prefix scans called SAM Novelties and features Higher-order support

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Scott Le Grand Some Things Never Change (GPUs vs the World) How Best to Exploit GPUs

Unleashing the Power of GPUs over the Web Vishal Vaidyanathan Royal Caliber LLC GPUs are

Efficient Video Decoding on GPUs Efficient Video Decoding on GPUs by Point Based Rendering by

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Clusters of GPUs Michael LeBeane mlebeane@utexas.edu Advisor : Lizy K. John Problem Statement

MD5 Chosen-Prefix Collisions on GPUs Marc Bevand m.bevand@gmail.com marc.bevand@rapid7.com

Analyzing Throughput of GPUs Analyzing Throughput of GPUs Exploiting Within-Die Core-to-Core

Can GPUs Cure Cancer? Multi-scale Integrative Analysis Predict treatment outcome, select,

Data Science Applications of GPUs in the R University of California at Language Davis GTC 2016

How Walmart Improves Forecast Accuracy with NVIDIA GPUs March 19, 2019 Agenda Walmarts

How to Deal with Radiation: Evaluation and Mitigation of GPUs Soft-Errors Paolo Rech

Techniques for Caches in GPUs Gnther Schindler Seminar Talk 2015/16 Chair ASC Outline 1.

ACCELERATING STEREO 360 STITCHING USING MULTI-GPUS Ken Turkowski & Trevor Smith, GTC 2017

Deep Machine Learning on GPUs Seminar talk | Daniel Schlegel | 28.01.2015 University of

CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs Wei Tan, IBM T. J. Watson

Learning Lexical Clusters in Childrens Books Edmond Lau 6.xxx Presentation May 12, 2004 1

PaddlePaddle B a i d u D e e p L e a r n i n g O p e n S o u r c e F r a m e w o r k 2019.01

1 2 0 4 8 Com pilers I I 1 2 0 4 8 Com pilers I I Computer Science Engineering, 8 semester

PRTZL Mathew Mallett mm4673 Rusty Nelson rnn2102 Guanqi Luo gl2483 Why PRTZL? Graph

ON NEGATIVE CONCORD IN EGYPTIAN AND MOROCCAN ARABIC Hamid Ouali (University of

Sentiment Analysis in Unstructured text data Presented By: Priyanka Boppana Gayatri Kakumanu

Retargetable Compilers System on Chip Many different types of DSPs and embedded processors

Formally Specified Computer Algebra Software - DK10 Muhammad Taimoor Khan Supervisor: Prof.

for GPUs Sepideh Maleki*, Annie Yang, and Martin Burtscher - PowerPoint PPT Presentation

A New Parallel Prefix-Scan Algorithm for GPUs Sepideh Maleki*, Annie Yang, and Martin Burtscher Department of Computer Science Highlights GPU-friendly algorithm for prefix scans called SAM Novelties and features Higher-order support

Why use GPUs for graph processing? FOSDEM 2020 2 GPUs and Graphs Graphs GPUs Found

Scott Le Grand Some Things Never Change (GPUs vs the World) How Best to Exploit GPUs

Unleashing the Power of GPUs over the Web Vishal Vaidyanathan Royal Caliber LLC GPUs are

Efficient Video Decoding on GPUs Efficient Video Decoding on GPUs by Point Based Rendering by

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Clusters of GPUs Michael LeBeane mlebeane@utexas.edu Advisor : Lizy K. John Problem Statement

MD5 Chosen-Prefix Collisions on GPUs Marc Bevand m.bevand@gmail.com marc.bevand@rapid7.com

Analyzing Throughput of GPUs Analyzing Throughput of GPUs Exploiting Within-Die Core-to-Core

Can GPUs Cure Cancer? Multi-scale Integrative Analysis Predict treatment outcome, select,

Data Science Applications of GPUs in the R University of California at Language Davis GTC 2016

How Walmart Improves Forecast Accuracy with NVIDIA GPUs March 19, 2019 Agenda Walmarts

How to Deal with Radiation: Evaluation and Mitigation of GPUs Soft-Errors Paolo Rech

Techniques for Caches in GPUs Gnther Schindler Seminar Talk 2015/16 Chair ASC Outline 1.

ACCELERATING STEREO 360 STITCHING USING MULTI-GPUS Ken Turkowski &amp; Trevor Smith, GTC 2017

Deep Machine Learning on GPUs Seminar talk | Daniel Schlegel | 28.01.2015 University of

CuMF: Large-Scale Matrix Factorization on Just One Machine with GPUs Wei Tan, IBM T. J. Watson

Learning Lexical Clusters in Childrens Books Edmond Lau 6.xxx Presentation May 12, 2004 1

PaddlePaddle B a i d u D e e p L e a r n i n g O p e n S o u r c e F r a m e w o r k 2019.01

1 2 0 4 8 Com pilers I I 1 2 0 4 8 Com pilers I I Computer Science Engineering, 8 semester

PRTZL Mathew Mallett mm4673 Rusty Nelson rnn2102 Guanqi Luo gl2483 Why PRTZL? Graph

ON NEGATIVE CONCORD IN EGYPTIAN AND MOROCCAN ARABIC Hamid Ouali (University of

Sentiment Analysis in Unstructured text data Presented By: Priyanka Boppana Gayatri Kakumanu

Retargetable Compilers System on Chip Many different types of DSPs and embedded processors

Formally Specified Computer Algebra Software - DK10 Muhammad Taimoor Khan Supervisor: Prof.

ACCELERATING STEREO 360 STITCHING USING MULTI-GPUS Ken Turkowski & Trevor Smith, GTC 2017