GPGPU 03 NVIDIA case study GeForce 7800 (2006) GeForce 7800 - - PowerPoint PPT Presentation

gpgpu 03
SMART_READER_LITE
LIVE PREVIEW

GPGPU 03 NVIDIA case study GeForce 7800 (2006) GeForce 7800 - - PowerPoint PPT Presentation

GPGPU 03 NVIDIA case study GeForce 7800 (2006) GeForce 7800 Impossible to maximize throughput with such a rigid architecture: you cant keep vertex and fragment shading units busy all the time As a result, many bottlenecks in the


slide-1
SLIDE 1

GPGPU 03

NVIDIA case study

slide-2
SLIDE 2

GeForce 7800 (2006)

slide-3
SLIDE 3

GeForce 7800

  • Impossible to maximize throughput with such a rigid architecture: you can’t

keep vertex and fragment shading units busy all the time

  • As a result, many bottlenecks in the pipeline (single triangle covering entire

screen VS extremely detailed geometry far away)

  • So NVIDIA decided to unify their processor architectures
slide-4
SLIDE 4

GeForce 8800 (2008 - Tesla)

slide-5
SLIDE 5

GeForce 8800

processing element

slide-6
SLIDE 6

GeForce 8800

shader core

slide-7
SLIDE 7

GeForce 8800

  • Processing element

○ A single processing unit (it can carry out an operation on a piece of data) ○ 128 of them in the full spec

  • Shader core

○ Contains 16 processing elements ○ The smallest unit capable of executing code (e.g. “shade fragments with shader <X>”)

  • CUDA is introduced
slide-8
SLIDE 8

Fermi

slide-9
SLIDE 9

NVIDIA GF100 - Fermi (2009)

  • NVIDIA carried on with the unified architecture in this 2009 series too
  • That is, unified steam processors switch between vertex and fragment

shading tasks, depending on workload needs

○ This requires more complex control logic than before

  • Fermi is DirectX11 compatible

○ New challenge on the graphics side is the support for tessellation shaders (potentially extremely large data amplification)

  • The focus with Fermi was on accelerating graphics applications
slide-10
SLIDE 10

Fermi architecture

  • Main elements:

○ GPC (Graphics Processing Cluster): a scalable unit of graphics processing units ○ SM (Streaming Multiprocessor): a unit of 32 streamprocessors (“GPU core” is now CUDA core) that work on the same task ○ CUDA core: the actual unit of ALU operations with a single, unified instruction set

  • In particular, GF100 (=GeForce 480) houses

○ 4 GPCs ○ Each GPC contains 4 SMs (16 SMs in total in GF100) ○ 6 memory controllers

  • Uses PCIe for accelerated CPU-GPU transfer speeds
slide-11
SLIDE 11

CUDA code

  • FP: IEEE 754-2008, FMA
  • INT: bool, shift, move, ...
slide-12
SLIDE 12

FMA

slide-13
SLIDE 13

SM

slide-14
SLIDE 14

SM

  • 32 CUDA cores
  • 4 special function units:

○ For evaluating transcendental functions (trig, inv trig, logs, length, rcp, etc.) ○ They also do interpolation in the graphics pipeline ○ 1 instruction per clock => finishing an entire warp takes 8 clocks

  • Essentially, it is partitioned into 4 independent pipelines: 2 ALU, 1 memory, 1

SFU

  • The dual warp scheduler selects two warps for execution each cycle (2x32

threads)

  • The current instruction of each selected warps then goes to either

○ 16 CUDA cores, or ○ 16 L/S units, or ○ 4 SFUs

slide-15
SLIDE 15

SM

  • The ‘ALU’ pipelines are called execution blocks, consisting of 16-16 cores
  • It takes 2 cycles to execute a single instruction of a warp’s all 32 threads for

the two execution blocks and L/S

  • It takes 8 cycles to execute 32 SFU commands
slide-16
SLIDE 16

SM scheduling

slide-17
SLIDE 17

SM

  • Each SM has 4 texture units (TU)

○ A TU can fetch 4 texture samples per cycle ○ And deliver them filtered

  • Dedicated cache for the TUs
  • Each SM has an L1 cache
  • More precisely 64KB mem that can be set to

16KB L1 + 48KB shared ○ 48KB L1 + 16KB shared

  • In rendering: 16KB L1 + 48KB shared
slide-18
SLIDE 18

L2 cache

  • 768KB, shared by all SM-s
  • Read, write, coherent memory
slide-19
SLIDE 19
slide-20
SLIDE 20

PolyMorph Engine

  • Controls the primitive processing step of the graphics pipeline
  • It is itself a 5 stage pipeline
  • The resulting primitives are passed to the rasterizer stage
slide-21
SLIDE 21

Raster Engine

  • 4 Raster Engine in parallel (1 / GPC)
  • 3 stage pipeline
  • Edge Setup: set up equations for the sides of the triangles
  • 8 pixel/cycle/Raster Engine
  • Hierarchical Z-culling (tile-based)
slide-22
SLIDE 22
  • The graphics API calls originate from your application

(using gl* commands in OpenGL, etc.)

  • These go into a pushbuffer (called command buffer)
  • And then comes the fun part:

https://fgiesen.wordpress.com/2011/07/09/a-trip-thro ugh-the-graphics-pipeline-2011-index/ ○ If you are serious about graphics, you SHOULD check this site

slide-23
SLIDE 23

CPU

slide-24
SLIDE 24

GPU . . .

slide-25
SLIDE 25
slide-26
SLIDE 26
slide-27
SLIDE 27

Communication within the SM

slide-28
SLIDE 28

Communication between every GPC’s every SM

slide-29
SLIDE 29
slide-30
SLIDE 30

Rendering details

https://developer.nvidia.com/content/life-triangle-nvidias-logical-pipeline

slide-31
SLIDE 31

Kepler

slide-32
SLIDE 32

Kepler (2012)

  • Successor to Fermi (GeForce 680)
  • Enhancing performance (especially double precision - HPC to the forefront!)
  • Also focus on efficiency (new mantra is “performance per watt”)
slide-33
SLIDE 33

Kepler setup

  • 4 GPC
  • 8 SMX (just how they called SMs in Kepler)
  • 4 memory controllers
slide-34
SLIDE 34

SMX

slide-35
SLIDE 35

SMX

  • 192 single precision CUDA core, with FMA
  • 32 SFU
  • Core clocks reduced to GPU clock (it was twice previously)
  • SMX schedules 4 warps each clock

○ Up to 2 independent commands of the same warp

  • Simplified scheduler

○ Some complex control logic moved into compilers (reshuffling math lines, etc.)

  • A single thread can be as wide as 255 registers

○ And these registers could be shuffled between the threads of the warp! Essentially, that’s free communication between the warps, bypassing L1

slide-36
SLIDE 36

Memory

  • Configrable 64KB memory per SMX, like in Fermi
  • +48KB-s read-only cache
  • 1536KB L2 cache (shared by all SMX)
slide-37
SLIDE 37
slide-38
SLIDE 38
slide-39
SLIDE 39
slide-40
SLIDE 40

Maxwell

slide-41
SLIDE 41

Maxwell (2014)

  • Focus on efficiency
  • Served GeForce 750, 8xx, and GeForce 9xx
  • Setup:

○ 4 GPC ○ 4 Maxwell SM (SMM) per GPC - 16 in total ○ 4 memory controllers

  • Some rationalizations:

○ Memory bus: from 192 bits to 128 ○ CUDA cores per SMM: 128 (Kepler had 192 per SMX!) ○ etc.

slide-42
SLIDE 42

SMM

slide-43
SLIDE 43

SMM

  • 128 CUDA cores

○ In units of 32 ○ Each unit of 32 has its own dedicated scheduler and command buffer

  • 1 PolyMorph Engine
  • 8 texturing unit
  • Simplified design to ease on control logic
  • Memory layout revamp:

○ L1 cache shared with texture cache ○ 96KB dedicated shared memory ○ L2 cache 2048 KB

slide-44
SLIDE 44

SMM

  • 4 warp scheduler:
  • Each warp scheduler

○ Schedules 2 commands per cycle ○ Works with its own dedicated 32 CUDA cores ○ 8 L/S units ○ 8 SFU units

slide-45
SLIDE 45
slide-46
SLIDE 46

Pascal

slide-47
SLIDE 47

Pascal

  • Full Pascal spec (remember: these are the 1080s)

○ 6 GPC ○ 10 Pascal SM per GPC - 60 in total ○ 30 TPC in total ○ 8 pieces of 512 bit memory controllers

  • 1 GPC:

○ 10 SM

  • 1 SM:

○ 64 CUDA cores ○ 4 texture units

  • That totals to 3840 single precision CUDA cores and 240 (!) texture units
slide-48
SLIDE 48
slide-49
SLIDE 49
slide-50
SLIDE 50

Pascal (2016)

  • Increased computation capacity (deep learning craze)

○ 5.3 TFLOPS FP64 (~double), 10.6 TFLOPS FP32 (~float), 21.2 TFLOPS FP16 (~half)

  • NVLink for high bandwidth connectivity
  • Modified memory architectures
slide-51
SLIDE 51

Floating point precision

Floating Point Bitdepth Largest value Smallest value1 Decimal digits of precision 32-bit Float 3.4028237 × 1038 1.175494 × 10-38 7.22 16-bit Float 6.55 × 104 6.10 × 10-5 3.31 14-bit Float 6.55 × 104 6.10 × 10-5 3.01 11-bit Float 6.50 × 104 6.10 × 10-5 2.1 10-bit Float 6.50 × 104 6.10 × 10-5 1.8

slide-52
SLIDE 52

Pascal

  • NVLink for high speed GPU-GPU data transfer: 160 GByte/s
slide-53
SLIDE 53

High Bandwidth Memory (HBM)

  • For the workstation Tesla P100-s
  • HBM stacked memory layout
  • In short, layered layout of memory, each with its own dedicated controller
  • HBM2 can stack up to 8 layers
  • Even with ECC
  • AMD started HBM1 in 2008
slide-54
SLIDE 54

Memory compression

  • Try to compress everything you can
  • Delta color compression (DCC) for textures:

○ Divide pixels into blocks ○ Assign a reference color to each block ○ Store block pixels by their deviation from reference color

  • New modes in Pascal
slide-55
SLIDE 55

Load balancing

  • Maxwell: subdivided the cores between compute and graphics before the fact
  • Pascal can do some sort of a dynamic load balancing
slide-56
SLIDE 56

Simultaneous Multi-Projection Engine

slide-57
SLIDE 57

Simultaneous Multi-Projection Engine

slide-58
SLIDE 58

Pascal

  • A single unified virtual memory space for the CPU and GPU (at least up to

512 TB)

  • Compute preemption on command level (Kepler/Maxwell was

threadblock-level)

slide-59
SLIDE 59
slide-60
SLIDE 60

Volta

slide-61
SLIDE 61

Volta (2017)

  • Primarily for the HPC community (Titan V), not for consumers
  • Many goodies, like cooperative groups
  • And some incremental dev:

○ Modified dispatch (FP32 and INT32 can co-issue) ○ L1 cache and shared memory merged

  • And there is this new processing unit...
slide-62
SLIDE 62

Volta SM

slide-63
SLIDE 63

Tensor core

slide-64
SLIDE 64

Tensor core

X12 the speed that of Pascal

slide-65
SLIDE 65

Tensor core

  • 64 FP16/FP32 mixed-precision FMA operations per clock
slide-66
SLIDE 66

New scheduling

slide-67
SLIDE 67

Scheduling before

slide-68
SLIDE 68

Scheduling on Volta

slide-69
SLIDE 69

Scheduling on Volta

slide-70
SLIDE 70

Phasma Phasma

slide-71
SLIDE 71

Turing

slide-72
SLIDE 72

Turing (2018)

  • September 20, 2018: GeForce 2080 RTX
  • Real time raytracing engine for consumers

○ It’s essentially the equivalent of the polymorph engine for ray tracing

  • New shading additions

○ VRS, MVR ○ Mesh shading

slide-73
SLIDE 73

Turing SM

slide-74
SLIDE 74

Turing SM

  • Divided into 4 pipelines, each housing

○ 16 FP32 ○ 16 INT32 ○ 2 Tensor core ○ 1 warp scheduler ○ 1 dispatch

  • 96 KB L1/shared

○ 64 KB is “shader RAM” (per SM) when executing graphics works

  • L0 instruction cache
slide-75
SLIDE 75
slide-76
SLIDE 76

Memory latencies

slide-77
SLIDE 77

A*B + C

slide-78
SLIDE 78

Raytracing

slide-79
SLIDE 79

Raytracing “before”

slide-80
SLIDE 80

Raytracing “now”

slide-81
SLIDE 81

DXR

slide-82
SLIDE 82

Raytracing in practice

  • Hybrid solutions to minimize the number of rays

○ Low sample counts usually come with extreme noise - denoising to the forefront of research ■ https://www.youtube.com/watch?v=5pxnDsFLAuY ■ https://research.nvidia.com/publication/interactive-reconstruction-monte-carlo-image-seq uences-using-recurrent-denoising ■ https://www.youtube.com/watch?v=mtdRfl4fmvQ

  • Acceleration Structures mean a considerable increase in GPU memory
  • Decrease payload sizes as much as you can
slide-83
SLIDE 83

Raytracing in practice

slide-84
SLIDE 84

DXR

slide-85
SLIDE 85

Mesh shader - motivation

slide-86
SLIDE 86

Mesh shaders

slide-87
SLIDE 87

Mesh shaders

  • Task shader: threads in workgroups. Each can launch an arbitrary number

(including zero) mesh shader workgroups

  • Mesh shader: each thread can create primitives.
slide-88
SLIDE 88

Mesh shaders

slide-89
SLIDE 89

Mesh shaders

slide-90
SLIDE 90

Mesh shaders

slide-91
SLIDE 91

Texture space shading

  • Turing feature, only available via extensions (just like mesh shading)
  • Store the shaded fragments of a triangle in a separate texture
  • Independent of visibility
  • Re-sample this stashed texture instead of re-evaluating the full shading
  • Unless we moved around too much
  • For certain applications it’s almost a given that we are at least roughly at the

same place for a frame: VR left and right eyes

slide-92
SLIDE 92

Classic

slide-93
SLIDE 93

Texture space

slide-94
SLIDE 94

Texture space shading

  • https://devblogs.nvidia.com/texture-space-shading/
  • https://www.youtube.com/watch?v=Rpy0-q0TyB0
slide-95
SLIDE 95

References

  • Fermi whitepaper:

○ http://www.nvidia.com/content/pdf/fermi_white_papers/p.glaskowsky_nvidia's_fermi-the_first_complete_gpu_a rchitecture.pdf ○ http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_architecture_whitepaper.pdf

  • Kepler whitepaper: https://www.nvidia.com/content/PDF/kepler/NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
  • Maxwell whitepaper:

○ http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce-GTX-750-Ti-Whitepaper.pdf ○ http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FIN AL.PDF

  • Pascal whitepaper:

○ http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_1080_Whitepaper_FIN AL.pdf ○ https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf

  • Volta whitepaper:

○ http://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf

  • Turing whitepaper:

○ https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVI DIA-Turing-Architecture-Whitepaper.pdf

slide-96
SLIDE 96

References

The lowest level details are unfortunately only available via reverse-engineering:

  • Volta: https://arxiv.org/abs/1804.06826
  • Turing: https://arxiv.org/pdf/1903.07486.pdf
slide-97
SLIDE 97

Ampere

slide-98
SLIDE 98
slide-99
SLIDE 99

In numbers

  • 7 GPCs
  • Each GPC contains

○ 6 TPCs ○ 1 raster engine ○ (NEW) 2 ROP partitions ○ (NEW) 8 ROP units per ROP partition

  • Each TPC contains

○ 2 SMs ○ 1 polymorph engine

  • Each SM contains

○ 128 CUDA cores ○ 4 Texture units ○ 4 Tensor Cores (3rd gen) ○ 1 RT Core (2nd gen) ○ 256 KB register file partitioned into 4 64 KB parts ○ 128 KB of configurable L1/Shared memory

slide-100
SLIDE 100

In numbers

  • 12 x 32 bit memory controllers (384 bit)
  • 512 KB L2 cache per controller (6144 KB in total)
  • An SM partition can now service 2 FP32 operations (in Turing: it could only

double issue a float-int operation pair)

slide-101
SLIDE 101

SM

slide-102
SLIDE 102

SM

  • Re-arranged the two main datapaths:

○ 1 with 16 FP32 CUDA Cores ○ 1 with 16 FP32 CUDA Cores and 16 INT32

  • Tricky: each clock, an SM partition can

○ either execute 32 FP32 operations ○

  • r execute 16 FP32 and 16 INT32 operations
  • They kept the Turing double-speed FP16 operations (HFMA)
slide-103
SLIDE 103

Unified shared memory

  • Graphics mode: 64 KB L1 data/texture cache + 48 KB shared + 16 KB

reserved for the graphics pipeline operations

  • In compute mode:

L1 Shared 128 KB 0 KB 120 KB 8 KB 112 KB 16 KB 96 KB 32 KB 64 KB 64 KB 28 KB 100 KB

slide-104
SLIDE 104

Tensor cores

  • 4 per SM
  • 256 FP16/32 FMA per clock
  • 3rd gen now supports FP16, BF16 (alternative to IEEE FP16), TF32 (‘tensor

float’), FP64, INT8, INT4, binary (INT1)

○ FP16/BF16 has 16x larger throughput in tensor cores than FP32

  • TF32 basically retains the exponent range of FP32 but discards mantissa bits

beyond what is representable with FP16

  • Faster execution if your input matrices are (lower bound) matches of

particular sparsity patterns

slide-105
SLIDE 105

Tensor cores

slide-106
SLIDE 106

NVIDIA types

slide-107
SLIDE 107

RT cores

  • Some caching and throughput optimizations
  • And some other nice goodies
slide-108
SLIDE 108

Ampere whitepapers

  • https://www.nvidia.com/content/dam/en-zz/Solutions/geforce/ampere/pdf/NVI

DIA-ampere-GA102-GPU-Architecture-Whitepaper-V1.pdf

  • https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-amp

ere-architecture-whitepaper.pdf