GPU Multisplit Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John - PowerPoint PPT Presentation

GPU Multisplit Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John D. Owens S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 1 / 16

Motivating Simple Example: Compaction Compaction (i.e., binary split) Traditional approach: Flags → Scan → Shuffle e.g., splitter: 10 input keys 25 12 4 76 7 17 6 1 compacted 4 7 6 1 S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 2 / 16

Motivating Simple Example: Compaction Compaction (i.e., binary split) Traditional approach: Flags → Scan → Shuffle e.g., splitter: 10 Other option: split input into two buckets input keys 25 12 4 76 7 17 6 1 buckets 1 1 0 1 0 1 0 0 output keys 4 7 6 1 25 12 76 17 buckets 0 0 0 0 1 1 1 1 S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 2 / 16

Motivating Simple Example: Compaction Compaction (i.e., binary split) Traditional approach: Flags → Scan → Shuffle e.g., splitter: 10 Other option: split input into two buckets Can also be solved by sorting keys Not always possible Loses “stability”, i.e., initial order within buckets not preserved input keys 25 12 4 76 7 17 6 1 buckets 1 1 0 1 0 1 0 0 output keys 4 7 6 1 25 12 76 17 buckets 0 0 0 0 1 1 1 1 sorted keys 1 4 6 7 12 17 25 76 S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 2 / 16

What is Multisplit? Multisplit (generalization of binary split) Let’s try multiple buckets e.g., splitters: 10 and 20 input keys 25 17 4 76 7 12 6 1 buckets 2 1 0 2 0 1 0 0 output keys 4 7 6 1 17 12 25 76 buckets 0 0 0 0 1 1 2 2 S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 3 / 16

What is Multisplit? Multisplit (generalization of binary split) Let’s try multiple buckets e.g., splitters: 10 and 20 Can also be solved by sorting keys input keys 25 17 4 76 7 12 6 1 buckets 2 1 0 2 0 1 0 0 output keys 4 7 6 1 17 12 25 76 buckets 0 0 0 0 1 1 2 2 sorted keys 1 4 6 7 12 17 25 76 buckets 0 0 0 0 1 1 2 2 S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 3 / 16

Mutlisplit primitive Input : unordered set of keys (or key-value pairs) m , number of buckets a user specified function to identify buckets for each key Output : keys (or key-value pairs) separated into m buckets B 0 B 1 B 2 B 3 S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 4 / 16

A Fast and Flexible Data-Organization Primitive characterizing key-value pairs into buckets General load balancing Priority queues Single Source Shortest Path (SSSP) Serial (Dijkstra): processing the vertex with the lowest weight Bellman-Ford-Moore → all vertices in parallel delta-stepping formulation of SSSP [Davidson et al., 2014] classifying vertices into buckets by their weights processing the lowest weights in parallel But no multisplit primitive → used radix-sort instead By using our own multisplit → 2.1x faster other applications colored prefix-sum reorganizing into 8 direction-based buckets in GPU based ray tracers [Yang et al., 2013] first step in building GPU hash tables [Alcantara et al., 2009] in the shallow stages of k -d tree construction [Wu et al., 2011] S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 5 / 16

Common Approaches: Buckets 1 Recursive scan-based split B 0 = { i ≤ 40 } B 1 = { i > 40 } ⌈ log( m ) ⌉ rounds of binary splits Initial Keys 0 1 2 3 4 5 6 7 59 46 31 6 24 82 3 17 0 0 1 1 1 0 1 1 B 0 Exclusive scan 0 1 2 0 0 3 3 4 1 1 0 0 0 1 0 0 B 1 right to left exclusive scan 2 1 1 1 1 0 0 0 0 1 2 3 4 5 6 7 31 3 17 59 46 82 31 6 24 S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 6 / 16

Common Approaches: 1 Recursive scan-based split Initial Keys ⌈ log( m ) ⌉ rounds of binary splits 0 1 2 3 4 5 6 7 2 Radix sort 59 46 31 6 24 82 3 17 sorting keys binary representation overkill (sorted within buckets) 0111011 0101110 0011111 0000110 initial order is not preserved 0011000 1010010 0000011 0010001 ≤ 7 splits 0 1 2 3 4 5 6 7 31 3 31 46 59 82 6 17 24 S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 6 / 16

Common Approaches: 1 Recursive scan-based split Initial Keys ⌈ log( m ) ⌉ rounds of binary splits 0 1 2 3 4 5 6 7 2 Radix sort 59 46 31 6 24 82 3 17 sorting keys 1 overkill (sorted within buckets) New values 0 initial order is not preserved 59 46 31 6 24 82 3 17 3 Reduced bit sort New keys 1 0 0 0 0 1 1 0 sort (bucket ID, keys) ⌈ log m ⌉ -bit bucket IDs key-value sort 0 1 2 3 4 5 6 7 31 59 3 6 17 24 31 46 82 S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 6 / 16

Common Approaches: 1 Recursive scan-based split Initial Keys ⌈ log( m ) ⌉ rounds of binary splits 0 1 2 3 4 5 6 7 2 Radix sort 59 46 31 6 24 17 82 3 sorting keys overkill (sorted within buckets) buffer B 0 17 24 3 31 6 initial order is not preserved 3 Reduced bit sort buffer B 1 82 46 59 sort (bucket ID, keys) ⌈ log m ⌉ -bit bucket IDs 4 Randomized insertions compaction a PRAM algorithm large buffers for buckets 0 1 2 3 4 5 6 7 31 24 random insertions 3 31 6 82 46 59 17 initial order is not preserved S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 6 / 16

Designing an Efficient Approach Stable Multisplit → unique permutation + data movement 1 Deriving all permutations → global computations histogram ( h 0 , . . . , h m − 1 ) key order per bucket j − 1 � u i ∈ B j ⇒ p ( i ) = + |{ u r : u r ∈ B j , r < i }| h k � �� k =0 Number of keys � �� Number of keys in before me in my own bucket previous buckets B 0 B 1 B 2 B 3 S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 7 / 16

Designing an Efficient Approach Stable Multisplit → unique permutation + data movement 1 Deriving all permutations → global computations histogram ( h 0 , . . . , h m − 1 ) key order per bucket 2 Final data movements → global random scatters B 0 B 1 B 2 B 3 S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 7 / 16

Our high level ideas 1 Global computations Localize computations several large enough local subproblems − → local histograms a single small enough global computation − → global histogram several large enough local subproblems − → permutations + scatters Avoid shared memory and synchronization: utilize intrinsics Pre scan Local Global Scan Post scan Local S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 8 / 16

Our high level ideas 1 Global computations Localize computations several large enough local subproblems − → local histograms a single small enough global computation − → global histogram several large enough local subproblems − → permutations + scatters Avoid shared memory and synchronization: utilize intrinsics 2 Global random scatters Reordering keys locally in the last stage → local multisplits more computational cost but better memory access pattern (coalesced writes) S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 8 / 16

Granularity Tradeoffs We experimented with a couple different subproblem granularities warp 1 warp synchronous model with minimal warp divergence fast communication via warp-wide ballot/shuffles block 2 more expensive communication via shared memory cheaper global computation (scan over m × N blocks ) more locality to extract after reordering Property Direct MS Warp-level MS Block-level MS Subproblem warp warp block reordering – warp-wide reordering block-wide reordering computational load low medium high Coalesced memory access low medium high S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 9 / 16

Implementation details & Optimizations an Pre scan Warp-level MS 1 Pre-scan (Local): read keys warp histogram · · · · · · h 0 , 0 h 1 , 0 h m − 1 , 0 bit-by-bit balloting 1 · · · · · · h 0 , 1 h 1 , 1 h m − 1 , 1 ⌈ log m ⌉ rounds 2 · · · · · · store warp histogram · · · · · · h 0 ,L − 1 h 1 ,L − 1 h m − 1 ,L − 1 1: procedure warp histogram ( bucket id[0:31] ) an Input: bucket id[0:31] Output: histo[0:m-1] for each thread i = 0:31 parallel warp do · · · · · · · · · · · · 2: histo bmp[i] = 0xFFFFFFFF; 3: for (int k = 0; k < ceil(log2(m)); k++) do 4: temp buffer = ballot(bucket id[i] & 0x01); 5: an if ((i >> k) & 0x01 ) then 6: 7: histo bmp[i] &= temp buffer; else 8: histo bmp[i] &= XOR(0xFFFFFFFF, temp buffer); 9: end if 10: bucket id[i] >>= 1; 11: end for 12: 13: histo[i] = popc(histo bmp[i]); 14: end for return histo[0:m-1]; 15: 16: end procedure S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 10 / 16

Implementation details & Optimizations Pre scan Warp-level MS 1 Pre-scan (Local): read keys warp histogram · · · h 0 , 0 h 1 , 0 h m − 1 , 0 bit-by-bit balloting 1 · · · h 0 , 1 h 1 , 1 h m − 1 , 1 ⌈ log m ⌉ rounds 2 · · · store warp histogram · · · h 0 ,L − 1 h 1 ,L − 1 h m − 1 ,L − 1 2 Scan (Global): exclusive scan on histograms Scan m × N warps elements · · · · · · · · · · · · h 0 , 0 h 0 , 1 h 0 ,L − 1 h 1 , 0 h 1 , 1 h 1 ,L − 1 h m − 1 , 0 h m − 1 , 1 h m − 1 ,L − 1 S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 10 / 16

GPU Multisplit Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John - PowerPoint PPT Presentation

GPU Multisplit Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John D. Owens S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 1 / 16 Motivating Simple Example: Compaction Compaction (i.e., binary split) Traditional approach: Flags Scan

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

GPU WITH A NETWORK INTERFACE DAVIDE ROSSETTI, SW COMPUTE TEAM GPUDIRECT FAMILY 1 GPUDirect Shared

GIT WORKSHOP GIT WORKSHOP 1 . 1 GIT WORKSHOP GIT WORKSHOP Manuela Salvucci

2018 REVENUE NET PROFIT 82,113,506 37,549,897 THE FIRST GAME | April 8th 201 1 ANOMAL Y:

F I N A N C I A L R E S U L T S 3Q17 October 12, 2017 3Q17 Financial highlights ROTCE 1 Common

https://bit.ly/2FLTf0h CLASSROOM PRESENTATIONS STUDENTS WILL RECEIVE: Location of the Middle

Q4 FY19 Results July 25, 2019 Legal disclosure This presentation contains forward-looking

Q3 FY 2018 Results April 19, 2018 Legal disclosure This presentation contains forward-looking

QjackCtl Considered Harmful QjackCtl Considered Harmful rncbc a.k.a. a.k.a. Rui Nuno Capela Rui

Adjustable navigation pile Jyrki Katajainen 1 Tetsuo Asano 2 Omar Darwish 3 Amr Elmasry 4 Fabio

GPU Multisplit Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John - PowerPoint PPT Presentation

GPU Multisplit Saman Ashkiani, Andrew Davidson, Ulrich Meyer, John D. Owens S. Ashkiani (UC Davis) GPU Multisplit GTC 2016 1 / 16 Motivating Simple Example: Compaction Compaction (i.e., binary split) Traditional approach: Flags Scan

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

GPU PROGRAMMING 2 GPU Programming Assignment 4 Consists of

GPU WITH A NETWORK INTERFACE DAVIDE ROSSETTI, SW COMPUTE TEAM GPUDIRECT FAMILY 1 GPUDirect Shared

GIT WORKSHOP GIT WORKSHOP 1 . 1 GIT WORKSHOP GIT WORKSHOP Manuela Salvucci

2018 REVENUE NET PROFIT 82,113,506 37,549,897 THE FIRST GAME | April 8th 201 1 ANOMAL Y:

F I N A N C I A L R E S U L T S 3Q17 October 12, 2017 3Q17 Financial highlights ROTCE 1 Common

https://bit.ly/2FLTf0h CLASSROOM PRESENTATIONS STUDENTS WILL RECEIVE: Location of the Middle

Q4 FY19 Results July 25, 2019 Legal disclosure This presentation contains forward-looking

Q3 FY 2018 Results April 19, 2018 Legal disclosure This presentation contains forward-looking

QjackCtl Considered Harmful QjackCtl Considered Harmful rncbc a.k.a. a.k.a. Rui Nuno Capela Rui

Adjustable navigation pile Jyrki Katajainen 1 Tetsuo Asano 2 Omar Darwish 3 Amr Elmasry 4 Fabio

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,