GPU Primitives - Case Study: Hair Rendering Ulf Assarsson, Markus - PowerPoint PPT Presentation

GPU Primitives - Case Study: Hair Rendering Ulf Assarsson, Markus Billeter, Ola Olsson, Erik Sintorn Chalmers University of Technology Gothenburg, Sweden 1 Beyond Programmable Shading

Parallelism  Programming massively parallel systems 2 Beyond Programmable Shading

Parallelism  Programming massively parallel systems  Parallelizing algorithms 3 Beyond Programmable Shading

Parallelism  Programming massively parallel systems  Parallelizing algorithms  Our research on 3 key components: 1. Stream compaction 2. Prefix Sum 3. Sorting 4 Beyond Programmable Shading

Parallelism  Programming massively parallel systems  Parallelizing algorithms  Our research on 3 key components: 3x faster than any other implementation 1. Stream compaction we know of 2. Prefix Sum – 30% faster than CUDPP 1.1 3. Sorting – faster than newest CUDPP 1.1 July 2009 5 Beyond Programmable Shading

Parallelism  Programming massively parallel systems  Parallelizing algorithms  Our research on 3 key components: 1. Stream compaction 2. Prefix Sum 3. Sorting 6 Beyond Programmable Shading

Parallelism  Programming massively parallel systems  Parallelizing algorithms  Our research on 3 key components: 1. Stream compaction 2. Prefix Sum 3. Sorting input 1 3 9 4 2 5 7 1 8 4 5 9 3 output 13 15 0 1 4 … … … … … … … … Each output element is sum of all preceding input elements 7 Beyond Programmable Shading

Parallelism  Programming massively parallel systems  Parallelizing algorithms  Our research on 3 key components: 1. Stream compaction 2. Prefix Sum input 1 3 9 4 2 5 7 1 8 4 5 9 3 3. Sorting output 13 15 0 1 4 … … … … … … … … 19 5 100 1 63 79 1 5 19 63 79 100 8 Beyond Programmable Shading

1. Stream Compaction  Used for: – Load balancing & load distribution – Alternative to global task queue – Parallel Tree Traversal – Collision Detection - Horn, GPUGems 2, 2005. 1 Each processor handles one node and outputs nodes for continued traversal Stream compaction – removing nil elements 1 Stream reduction operations for GPGPU applications, Horn, GPU Gems 2, 2005 . 9 Beyond Programmable Shading

1. Stream Compaction  Used for: – Load balancing & load distribution – Alternative to global task queue – Parallel Tree Traversal – Collision Detection - Horn, GPUGems 2, 2005. – Constructing spatial hierarchies – Lauterbach, Garland, Sengupta, Luebke, Manocha, Fast BVH Construction on GPUs , EGSR 2009 – Radix Sort – Satish, Harris, Garland, Designing efficient sorting algorithms for manycore GPUs , IEEE Par. & Distr. Processing Symp., May 2009 – Ray Tracing – Aila and Laine, Understanding the Efficiency of Ray Traversal on GPUs , HPG 2009 – Roger, Assarsson, Holzschuch, 2 Whitted Ray-Tracing for Dynamic Scenes using a Ray-Space Hierarchy on the GPU, EGSR 2007. 10 Beyond Programmable Shading

1. Stream Compaction - shadows Alias Free Hard Shadows – Resolution Matched Shadow Maps , by Aaron Lefohn, Shubhabrata Sengupta, John Owens, Siggraph 2008  Prefix sum, stream compaction, sorting – Sample Based Visibility for Soft Shadows using Alias- free Shadow Maps , by Erik Sintorn, Elmar Eisemann, Ulf Assarsson, EGSR 2008  Prefix sum 11 Beyond Programmable Shading

2. Prefix Sum input 1 3 9 4 2 5 7 1 8 4 5 9 3 output 13 15 0 1 4 … … … … … … … … Each output element is sum of all preceding input elements  Good for – Solving recurrence equations – Sparse Matrix Computations – Tri-diagonal linear systems – Stream-compaction 12 Beyond Programmable Shading

3. Sorting Radix Sort: – Nadathur Satish, Mark Harris, Michael Garland Designing Efficient Sorting Algorithms for Manycore GPUs , IEEE Parallel & Distributed Processing Symposium, May 2009 . – Markus Billeter, Ola Olsson, Ulf Assarsson Efficient Stream Compaction on Wide SIMD Many-Core Architectures”, HPG, 2009 . 13 Beyond Programmable Shading

Stream Compaction  Parallel algorithms often targets unlimited #proc and have complexity O ( n log n )  E.g.:  But actual #proc are far from unlimited 14 Beyond Programmable Shading

Stream Compaction  More efficient option (~Blelloch 1990): Split input among processors and work sequentially on each part E.g.: Each stream processor sequentially compacts one part of stream StreamProc 0 StreamProc 1 StreamProc 2 … Input …removing the unwanted elements inside each part … …then concatenate parts Output 15 Beyond Programmable Shading

Stream Compaction  BUT:  Naïvely treating each SIMD-lane as one processor gives horrible memory access pattern StreamProc 0 StreamProc 1 StreamProc 2 … Input  Many versions of algorithms improving access pattern  We suggest treating hardware as a – Limited number of processors with a specific SIMD width – GTX280: 30 processors, logical SIMD width = 32 lanes ( CUDA 2.1/2.2 API ) 16 Beyond Programmable Shading

Stream Compaction  Our basic idea: Split input among processors and work sequentially on each part Each (multi-)processor sequentially compacts one part of stream Proc 1 Proc 2 Proc 0 … Start by computing …removing the unwanted output offsets for elements inside each part each processor … …then concatenate parts 17 Beyond Programmable Shading

Stream Compaction  Computing the processors’ output offsets: – Each processor counts its number of valid elements (i.e., output length) – Compute Prefix Sum array for all counts – This array tells the output position for each processor Proc 0 Proc 1 Proc 2 … Input: Counts = { #valids, #valids, #valids, … #valids } #valids for #valids for #valids for #valids for Prefix sum = { 0 , , , , … } p 0 p 0 +p 1 p 0 +p 1 +p 2 p 0 +...p #p-1 Output : 18 Beyond Programmable Shading

Stream Compaction  Computing the processors’ output offsets: – Each processor counts its number of valid elements (i.e., output length) – Compute Prefix Sum array for all counts – This array tells the output position for each processor Proc 0 Proc 1 Proc 2 … Input: Counts = { #valids, #valids, #valids, … #valids } #valids for #valids for #valids for #valids for Prefix sum = { 0 , , , , … } p 0 p 0 +p 1 p 0 +p 1 +p 2 p 0 +...p #p-1 Output : 19 Beyond Programmable Shading

Stream Compaction  Each processor counts its number of valid elements w = SIMD width Proc 1 Proc 0 w elems w elems …  Each processor: – Loop through its input list: – Reading w elements each iteration – Perfectly coalesced (i.e., each thread reads 1 element) – Each lane (thread / stream processor) increases its counter if its element is valid – Finally, sum the w counters 20 Beyond Programmable Shading

Stream Compaction  Our basic idea: Split input among processors and work sequentially on each part Each processor sequentially compacts one part of stream Proc 1 Proc 2 Proc 0 … …removing the unwanted Compact each elements inside each part processor’s list … …then concatenate parts 21 Beyond Programmable Shading

Stream Compaction  Compacting the input list for each SIMD-processor w = SIMD width Proc 1 Proc 0 Input: w elems w elems … Output : … …  Each processor: – Loop through its input list: POPC – Reading w elements each iteration SSE-Movmask – Perfectly coalesced (i.e., each thread reads 1 element) Any/All – Use a standard parallel compaction for w elements – Write to output list and update output position by #valid elements 22 Beyond Programmable Shading

Stream Compaction Stream compaction with – Optimal coalesced reads – Good write pattern @0-96:3A.:/6B01.23<%C3979:920;3 ! 3$#D10> !+* 3 !+)& !+) !+(& !+( ?1:93<:;> !+'& !+' EF449-98 !+&& @06G98 @B6009- !+& @979B0159 !+%& !+% 3 ! "! #! $! %! &! '! (! )! *! "!! ,-./.-01.23.43567183979:920;3<=> 23 Beyond Programmable Shading

Steam Compaction  In reality we use: – GTX280:  P = 480 to increase occupancy and hide mem latency – 30x4 blocks à 4 warps à 32 threads – Hardware specific  Highest memory bandwidth if each lane fetches 32 bit data in 64 bit units (i.e., 2 floats instead of 1). – Hardware specific 32x  32 bit fetches  64 bit fetches  128 bit fetches  Bandwidth (GB/s)  77.8  102.5  73.4  24 Beyond Programmable Shading

Stream Compaction  Our Trick:  Avoiding algorithms designed for unlimited #processors  Sequential algorithm – very simple  Split input into many independent pieces, apply sequential algorithm to each piece and combine the results later – Divide work among independent processors – Use SIMD-sequential algorithm on a processor i.e., fetch block of w elements Use parallel algorithm when working with the w elements – Work in fast shared memory 25 Beyond Programmable Shading

GPU Primitives - Case Study: Hair Rendering Ulf Assarsson, Markus - PowerPoint PPT Presentation

GPU Primitives - Case Study: Hair Rendering Ulf Assarsson, Markus Billeter, Ola Olsson, Erik Sintorn Chalmers University of Technology Gothenburg, Sweden 1 Beyond Programmable Shading Parallelism Programming massively parallel systems 2

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

RenderMan Primitives RenderMan Primitives CSCD 472? Slide 1 4/5/10 Primitive Attributes

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Verilog HDL:Digital Design and Modeling Chapter 6 User-Defined Primitives Chapter 6

Implementing new Topology Mapping Primitives Guillermo Baltra Prior Work Primitives for

Beyond Block I/O: Rethinking / Traditional Storage Primitives Traditional Storage Primitives

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Verifying and enforcing network paths with ICING Jad Naous , Michael Walfish, Antonio Nicolosi,

Orthogonal Time Frequency Space (OTFS) Modulation and Applications Tutorial at SPCOM 2020, IISc,

vDC: Virtual Data Center Powered with AS Alliance for Enabling Cost-Effective Business

Parallel Numerical Algorithms Chapter 1 Parallel Computing Michael T. Heath and Edgar

Inducing Suffix and LCP Arrays in External Memory Timo Bingmann, Johannes Fischer, and Vitaly

The DESQ Framework for Declarative and Scalable Frequent Sequence Mining Kaustubh Beedkar 1 Rainer

Functional programming and hardware design: where to now?? Wouter Swierstra, Koen Claessen, Carl

Simpler and More General Minimization for Weighted Finite-State Automata Jason Eisner Department

GPU Primitives - Case Study: Hair Rendering Ulf Assarsson, Markus - PowerPoint PPT Presentation

GPU Primitives - Case Study: Hair Rendering Ulf Assarsson, Markus Billeter, Ola Olsson, Erik Sintorn Chalmers University of Technology Gothenburg, Sweden 1 Beyond Programmable Shading Parallelism Programming massively parallel systems 2

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Motivation to Learn GPGPU Julius Parulek Why to Learn About GPU? Computational power of GPU vs.

RenderMan Primitives RenderMan Primitives CSCD 472? Slide 1 4/5/10 Primitive Attributes

UNIFIED MEMORY ON PASCAL AND VOLTA Nikolay Sakharnykh - May 10, 2017 1 HETEROGENEOUS

Advancements in V-Ray RT GPU Vlado Koylazov, CTO &amp; Co-founder Blagovest Taskov, RT GPU Team

Verilog HDL:Digital Design and Modeling Chapter 6 User-Defined Primitives Chapter 6

Implementing new Topology Mapping Primitives Guillermo Baltra Prior Work Primitives for

Beyond Block I/O: Rethinking / Traditional Storage Primitives Traditional Storage Primitives

Use Tesla to provide first GPU VM Service in China Feng Zhu

THEIA GPU Open Source multicore programmable GPU Problem Statement Develop an open source 3D

Performance Evaluation of a Multithreaded GPU Using CUDA GPU architecture GeForce 8800 GPU

Super GPU &amp; Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,

MULTI-GPU TRAINING WITH NCCL Sylvain Jeaugey MULTI-GPU COMPUTING Harvesting the power of

GPU Architecture and chitecture and GPU Ar The good The good The bad The bad

GPU programming in Haskell Henning Thielemann 2015-01-23 GPU programming in Haskell Motivation:

MVAPICH2-GPU: Op0mized GPU to GPU Communica0on for InfiniBand

Verifying and enforcing network paths with ICING Jad Naous , Michael Walfish, Antonio Nicolosi,

Orthogonal Time Frequency Space (OTFS) Modulation and Applications Tutorial at SPCOM 2020, IISc,

vDC: Virtual Data Center Powered with AS Alliance for Enabling Cost-Effective Business

Parallel Numerical Algorithms Chapter 1 Parallel Computing Michael T. Heath and Edgar

Inducing Suffix and LCP Arrays in External Memory Timo Bingmann, Johannes Fischer, and Vitaly

The DESQ Framework for Declarative and Scalable Frequent Sequence Mining Kaustubh Beedkar 1 Rainer

Functional programming and hardware design: where to now?? Wouter Swierstra, Koen Claessen, Carl

Simpler and More General Minimization for Weighted Finite-State Automata Jason Eisner Department

Advancements in V-Ray RT GPU Vlado Koylazov, CTO & Co-founder Blagovest Taskov, RT GPU Team

Super GPU & Super Kernels: Make programming of multi-GPU systems easy Michael Frumkin, May 8,