GPGPU 03 NVIDIA case study GeForce 7800 (2006) GeForce 7800 - PowerPoint PPT Presentation

Turing SM ● Divided into 4 pipelines, each housing ○ 16 FP32 ○ 16 INT32 ○ 2 Tensor core ○ 1 warp scheduler ○ 1 dispatch ● 96 KB L1/shared ○ 64 KB is “shader RAM” (per SM) when executing graphics works ● L0 instruction cache

Memory latencies

A*B + C

Raytracing

Raytracing “before”

Raytracing “now”

Raytracing in practice ● Hybrid solutions to minimize the number of rays ○ Low sample counts usually come with extreme noise - denoising to the forefront of research ■ https://www.youtube.com/watch?v=5pxnDsFLAuY ■ https://research.nvidia.com/publication/interactive-reconstruction-monte-carlo-image-seq uences-using-recurrent-denoising ■ https://www.youtube.com/watch?v=mtdRfl4fmvQ ● Acceleration Structures mean a considerable increase in GPU memory ● Decrease payload sizes as much as you can

Raytracing in practice

Mesh shader - motivation

Mesh shaders

Mesh shaders ● Task shader: threads in workgroups. Each can launch an arbitrary number (including zero) mesh shader workgroups ● Mesh shader: each thread can create primitives.

Mesh shaders

Texture space shading ● Turing feature, only available via extensions (just like mesh shading) ● Store the shaded fragments of a triangle in a separate texture ● Independent of visibility ● Re-sample this stashed texture instead of re-evaluating the full shading ● Unless we moved around too much ● For certain applications it’s almost a given that we are at least roughly at the same place for a frame: VR left and right eyes

Classic

Texture space

Texture space shading ● https://devblogs.nvidia.com/texture-space-shading/ ● https://www.youtube.com/watch?v=Rpy0-q0TyB0

References ● Fermi whitepaper: ○ http://www.nvidia.com/content/pdf/fermi_white_papers/p.glaskowsky_nvidia's_fermi-the_first_complete_gpu_a rchitecture.pdf ○ http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_architecture_whitepaper.pdf ● Kepler whitepaper: https://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf ● Maxwell whitepaper: ○ http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce-GTX-750-Ti-Whitepaper.pdf ○ http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FIN AL.PDF ● Pascal whitepaper: ○ http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_1080_Whitepaper_FIN AL.pdf ○ https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf ● Volta whitepaper: ○ http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf ● Turing whitepaper: ○ https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVI DIA-Turing-Architecture-Whitepaper.pdf

References The lowest level details are unfortunately only available via reverse-engineering: ● Volta: https://arxiv.org/abs/1804.06826 ● Turing: https://arxiv.org/pdf/1903.07486.pdf

Ampere

In numbers ● 7 GPCs ● Each GPC contains ○ 6 TPCs ○ 1 raster engine ○ (NEW) 2 ROP partitions ○ (NEW) 8 ROP units per ROP partition ● Each TPC contains ○ 2 SMs ○ 1 polymorph engine ● Each SM contains ○ 128 CUDA cores ○ 4 Texture units ○ 4 Tensor Cores (3rd gen) ○ 1 RT Core (2nd gen) ○ 256 KB register file partitioned into 4 64 KB parts ○ 128 KB of configurable L1/Shared memory

In numbers ● 12 x 32 bit memory controllers (384 bit) ● 512 KB L2 cache per controller (6144 KB in total) ● An SM partition can now service 2 FP32 operations (in Turing: it could only double issue a float-int operation pair)

GPGPU 03 NVIDIA case study GeForce 7800 (2006) GeForce 7800 - PowerPoint PPT Presentation

GPGPU 03 NVIDIA case study GeForce 7800 (2006) GeForce 7800 Impossible to maximize throughput with such a rigid architecture: you cant keep vertex and fragment shading units busy all the time As a result, many bottlenecks in the

Welcome! Global Agenda: 1. GPGPU (1) : Introduction, architecture, concepts 2. GPGPU (2) :

Welcome! Todays Agenda: GPU Execution Model GPGPU Flow GPGPU Low Level Notes

Parallel Incep+on MPP Databases GPGPU Kyle Dunn Me Data nerd for Recovering HPC/GPGPU

Welcome! Todays Agenda: Introduction to GPGPU Example: Voronoi Noise GPGPU

Welcome! Todays Agenda: Practical GPGPU: Verlet Fluid GPGPU Algorithms Optimizing

Efficient Abstractions for GPGPU Programming . Mathias Bourgoin 10.03.2015 Efficient

K E D b . D a L a t a B a s e Jordan Vincent XML processing using GPGPU Jordan

GPGPU: General-Purpose Computation on GPUs Prekshu Ajmera 03d05006 Overview 1. Motivation: Why

K Pre-Post Cloud Tutorial for the use of GPGPU instances RIKEN R-CCS MARCH 29, 2019 About this

GPGPU Programming in Haskell with Accelerate Trevor L. McDonell University of New South Wales

Node-Level Deep Learning Input Pipeline Optimization on GPGPU-Accelerated HPC Systems 28 Mar

Marcus Bakker & Roel van der Jagt Background information Main question Test

using GPGPU Joner Duarte jduartejr@tecgraf.puc-rio.br Outline Introduction Why is

High Performance GPGPU Implementation of a Large 2D Histogram (S9734) Mark Roulo Wed, March

GPGPU Applications for Hydrological and Atmospheric Simulations and Visualizations on the Web

GPGPU and Stream Computing Julian Fietkau University of Hamburg June 30th, 2011 Julian Fietkau

WITH CUDA C/C++ Pedro Mario Cruz e Silva, Solutions Architect Manager ELEVEN YEARS OF GPU

Solving Domain Wall Dirac Equation Using Multisplitting Preconditioned Conjugate Gradient Jiqun

VASP 5.4.4 October 2017 Silica IFPEN on V100s PCIe 0.00700 0.00628 0.00600 (Untuned on Volta)

Common Subexpression Convergence (CSC) Sana Damani and Vivek Sarkar Habanero Extreme Scale

Discourse Structure & Wrap-up: Q-A Ling571 Deep Processing Techniques for NLP March 9, 2016

The Present and Absent Lord 2019 TRINITY LECTURE 2 30 JULY 2019 MARKUS BOCKMUEHL, UNIVERSITY

r r tr

Two-Player Perfect Information Games: A Brief Survey Tsan-sheng Hsu tshsu@iis.sinica.edu.tw

GPGPU 03 NVIDIA case study GeForce 7800 (2006) GeForce 7800 - PowerPoint PPT Presentation

GPGPU 03 NVIDIA case study GeForce 7800 (2006) GeForce 7800 Impossible to maximize throughput with such a rigid architecture: you cant keep vertex and fragment shading units busy all the time As a result, many bottlenecks in the

Welcome! Global Agenda: 1. GPGPU (1) : Introduction, architecture, concepts 2. GPGPU (2) :

Welcome! Todays Agenda: GPU Execution Model GPGPU Flow GPGPU Low Level Notes

Parallel Incep+on MPP Databases GPGPU Kyle Dunn Me Data nerd for Recovering HPC/GPGPU

Welcome! Todays Agenda: Introduction to GPGPU Example: Voronoi Noise GPGPU

Welcome! Todays Agenda: Practical GPGPU: Verlet Fluid GPGPU Algorithms Optimizing

Efficient Abstractions for GPGPU Programming . Mathias Bourgoin 10.03.2015 Efficient

K E D b . D a L a t a B a s e Jordan Vincent XML processing using GPGPU Jordan

GPGPU: General-Purpose Computation on GPUs Prekshu Ajmera 03d05006 Overview 1. Motivation: Why

K Pre-Post Cloud Tutorial for the use of GPGPU instances RIKEN R-CCS MARCH 29, 2019 About this

GPGPU Programming in Haskell with Accelerate Trevor L. McDonell University of New South Wales

Node-Level Deep Learning Input Pipeline Optimization on GPGPU-Accelerated HPC Systems 28 Mar

Marcus Bakker &amp; Roel van der Jagt Background information Main question Test

using GPGPU Joner Duarte jduartejr@tecgraf.puc-rio.br Outline Introduction Why is

High Performance GPGPU Implementation of a Large 2D Histogram (S9734) Mark Roulo Wed, March

GPGPU Applications for Hydrological and Atmospheric Simulations and Visualizations on the Web

GPGPU and Stream Computing Julian Fietkau University of Hamburg June 30th, 2011 Julian Fietkau

WITH CUDA C/C++ Pedro Mario Cruz e Silva, Solutions Architect Manager ELEVEN YEARS OF GPU

Solving Domain Wall Dirac Equation Using Multisplitting Preconditioned Conjugate Gradient Jiqun

VASP 5.4.4 October 2017 Silica IFPEN on V100s PCIe 0.00700 0.00628 0.00600 (Untuned on Volta)

Common Subexpression Convergence (CSC) Sana Damani and Vivek Sarkar Habanero Extreme Scale

Discourse Structure &amp; Wrap-up: Q-A Ling571 Deep Processing Techniques for NLP March 9, 2016

The Present and Absent Lord 2019 TRINITY LECTURE 2 30 JULY 2019 MARKUS BOCKMUEHL, UNIVERSITY

r r tr

Two-Player Perfect Information Games: A Brief Survey Tsan-sheng Hsu tshsu@iis.sinica.edu.tw

Marcus Bakker & Roel van der Jagt Background information Main question Test

Discourse Structure & Wrap-up: Q-A Ling571 Deep Processing Techniques for NLP March 9, 2016