Efficient Stream Reduction on the GPU Efficient Stream Reduction on - PowerPoint PPT Presentation

Efficient Stream Reduction on the GPU Efficient Stream Reduction on the GPU David Roger, Ulf Assarsson, Nicolas Holzschuch Grenoble Chalmers University Cornell University of Technology University

Stream Reduction Removing unwanted elements from a stream Input stream Reduced stream 2

Applications ● Tree traversal: – Ray tracing – Collision detection ● Often the bottleneck 3

Sequential Algorithm ● Algorithm: i=0 for j=0 to n-1 do if x[j] is valid then x[i]=x[j] i=i+1 ● Easy: one single loop ● Linear complexity 4

On GPU ● Parallelism ● We assume no scatter – We will speak about scatter later 5

Talk Structure ● Previous Works ● Algorithm Overview ● Details and Implementation ● Results ● Future Works & Conclusion 6

Previous works: Horn's Method Input stream Prefix sum scan: computes the displacements Prefix sum 0 0 1 1 1 1 2 3 3 4 4 5 5 5 6 6 Dichotomic search: performs the displacements Reduced stream 7

Previous works ● Prefix sum scan – Hillis and Steele, Horn: O(n log n) – Blelloch, Sengupta et al. , Harris et al. : O(n) – Sengupta et al. Hybrid: O(n) ● Dichotomic search: O(n log n) ● Overall complexity: O(n log n) 8

Other approaches ● Geometry shader + stream output – NV_transform_feedback – Input stream: vertices in a VBO – Geometry shader discards NULL elements – Output stream: vertices in a VBO ● No fragments, no fragment shader ● Bitonic sort – Slow ● Sum scan + Scatter with vertex engine 9

Our approach Input stream, split in blocks Reduction of the blocks Concatenation Reduced stream 12

Reduction of the blocks ● In parallel ● Using previous works – Prefix sum scan – Dichotomic search ● Complexity – s: size of a block – One block: O(s log s) – n/s blocks: O(n log s) 13

Concatenation of the blocks ● Prefix sum scan – Computes displacements of the blocks in parallel ● Line drawing – Segments extremities moved by scattering (vertex engine) – Other elements linearly interpolated (rasterization) ● Complexity: O(n) 14

Concatenation of the blocks Reduced blocks Reduced stream 15

Concatenation of the blocks Reduced blocks Reduced stream Move the extremities with the vertex shader 16

Concatenation of the blocks Reduced blocks Reduced stream Move the extremities Rasterization with the vertex shader 17

Algortihmic complexity ● All previous works: O(n log n) ● Our algorithm: O(n log s) – s is the size of the blocks – s is a constant ! 18

Overview Input stream, split in blocks Prefix sum scan + Dichotomic search Prefix sum scan + Line drawing Reduced stream 19

Why is it efficient ? The key is block concatenation: – Dichotomic search is avoided – Vertex engine: scatter ... but lesser efficiency ● Use it for a few elements (segment extremities) ● Interpolate the other elements 20

Dichotomic search details Input block Gather: At output position i 0 1 2 3 4 5 6 7 j=8 j=8 9 10 11 12 13 14 15 Search j in input such as: i = j – sum[j] Prefix sum Search bounds: 0 0 1 1 1 1 2 3 3 4 4 5 5 5 6 6 i+sum[i] ≤ j ≤ i+sum[15] sum[j] = 3 Example: i = 5 6 ≤ j ≤ 11 Reduced block Search result j = 8 ? 0 1 2 3 4 i=5 =5 6 7 8 9 10 11 12 13 14 15 25

Dichotomic search pseudo-code Search j 0 such as i = j 0 - sum[j 0 ]: while(found ≠ 0) { lowBound = i + sum[i] if (found < 0) lowBound = j upBound = i + sum[n-1] else upBound = j if(upBound > n-1) discard j = (lowBound + upBound) / 2 found = j-sum[j]-i } j = (lowBound + upBound)/2 found = j-sum[j]-i 26

Dichotomic search improvement Search j 0 such as i = j 0 - sum[j 0 ]: while(found ≠ 0) { lowBound = i + sum[i] if (found < 0) lowBound = j - found upBound = i + sum[n-1] else upBound = j - found if(upBound > n-1) discard j = (lowBound + upBound) / 2 found = j-sum[j]-i } j = (lowBound + upBound)/2 found = j-sum[j]-i Because j – sum[j] is contracting! 27

Lines wrapping ● We use 2D textures: wrap line segments – Split all segments in two ● Or – Use geometry engine to split only when necessary Concatenation 30

Behavior: linear complexity 42

Behavior: block size 43

Behavior: fill ratio 44

Comparison with previous works 45

Scatter ? (future work) ● Scatter available in CUDA ● Possible improvements 48

Scatter ? (future work) Reduction of the blocks: Input stream, split in blocks ● without scatter: sum scan + search O(n log s) ● with scatter: sequential algo (loop over the block) O(n) Concatenation: ● Simpler ● No wrapping Reduced stream 49

Scatter ? (future work) ● Overall complexity: O(n) ● ... but other techniques in O(n) – Sum scan (Harris et al. or Sengupta et al.) + scatter ● Future work: tests with CUDA – Expected speed up ≥ 2.5 50

Conclusion ● Orthogonal to previous works: – We don't compete with them, we use them ! ● Better asymptotic complexity – O(n) Vs O(n log n) ● Significant speed up ● Does not require scatter 51

Thank you 52

Efficient Stream Reduction on the GPU Efficient Stream Reduction on - PowerPoint PPT Presentation

Efficient Stream Reduction on the GPU Efficient Stream Reduction on the GPU David Roger, Ulf Assarsson, Nicolas Holzschuch Grenoble Chalmers University Cornell University of Technology University Stream Reduction Removing unwanted elements

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

An Efficient GPU-based An Efficient GPU-based LDPC Decoder for Long LDPC Decoder for Long

? sync ref chosen as sync source by Listener Stream B: Presentation Stream C: timestamps

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

+ Design of Parallel Algorithms Communication Algorithms + Topic Overview n One-to-All Broadcast

Parallel Numerical Algorithms Chapter 6 Matrix Models Section 6.2 Low Rank Approximation

Control Regularization for Reduced Variance Reinforcement Learning Richard Cheng, Abhinav Verma,

Neural Networks: What can a network represent Deep Learning, Fall 2020 1 Recap : Neural

On Reducing Maximum Independent Set to Minimum Satis fiabili ty Ale x e y Ig n a t ie v , A

STEP Reduce Seth Hillbrand KiCad Services Corp. 1 / 13 Motivation STEP fjles are the

Matching Bipartite Matching Input Given a (undirected) graph G = ( V , E ) Input Given a bipartite

4.2 Reduction of order a lesson for MATH F302 Differential Equations Ed Bueler, Dept. of

Efficient Stream Reduction on the GPU Efficient Stream Reduction on - PowerPoint PPT Presentation

Efficient Stream Reduction on the GPU Efficient Stream Reduction on the GPU David Roger, Ulf Assarsson, Nicolas Holzschuch Grenoble Chalmers University Cornell University of Technology University Stream Reduction Removing unwanted elements

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

An Efficient GPU-based An Efficient GPU-based LDPC Decoder for Long LDPC Decoder for Long

? sync ref chosen as sync source by Listener Stream B: Presentation Stream C: timestamps

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Real-Time GPU Management Heechul Yun 1 This Week Topic: General Purpose Graphic Processing

GPU programming Dr. Bernhard Kainz 1 Overview About myself Last week Motivation GPU

+ Design of Parallel Algorithms Communication Algorithms + Topic Overview n One-to-All Broadcast

Parallel Numerical Algorithms Chapter 6 Matrix Models Section 6.2 Low Rank Approximation

Control Regularization for Reduced Variance Reinforcement Learning Richard Cheng, Abhinav Verma,

Neural Networks: What can a network represent Deep Learning, Fall 2020 1 Recap : Neural

On Reducing Maximum Independent Set to Minimum Satis fiabili ty Ale x e y Ig n a t ie v , A

STEP Reduce Seth Hillbrand KiCad Services Corp. 1 / 13 Motivation STEP fjles are the

Matching Bipartite Matching Input Given a (undirected) graph G = ( V , E ) Input Given a bipartite

4.2 Reduction of order a lesson for MATH F302 Differential Equations Ed Bueler, Dept. of

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,