SLICING THE WORKLOAD MULTI-GPU OPENGL RENDERING APPROACHES INGO - - PowerPoint PPT Presentation

slicing the workload
SMART_READER_LITE
LIVE PREVIEW

SLICING THE WORKLOAD MULTI-GPU OPENGL RENDERING APPROACHES INGO - - PowerPoint PPT Presentation

SLICING THE WORKLOAD MULTI-GPU OPENGL RENDERING APPROACHES INGO ESSER NVIDIA DEVTECH PROVIZ OVERVIEW Motivation Tools of the trade Multi-GPU driver functions Multi-GPU programming functions Multi threaded multi GPU renderer General


slide-1
SLIDE 1

INGO ESSER – NVIDIA DEVTECH PROVIZ

SLICING THE WORKLOAD

MULTI-GPU OPENGL RENDERING APPROACHES

slide-2
SLIDE 2

OVERVIEW

Motivation Tools of the trade

Multi-GPU driver functions Multi-GPU programming functions

Multi threaded multi GPU renderer

General workflow Different applications

slide-3
SLIDE 3

MOTIVATION

Apps are becoming less CPU-bound, more GPU-bound

S5135 - GPU-Driven Large Scene Rendering in OpenGL S5148 - Nvpro-Pipeline: A Research Rendering Pipeline

Fragment Load (complex fragment shaders, higher resolutions)

Slice image space

Data / Geometry Load (large datasets)

Slice data / geometry

Processing (complex compute jobs)

Offload complex calculations to other GPUs

Stereo Rendering / VR is a natural fit

slide-4
SLIDE 4

OVERVIEW

Motivation Tools of the trade

Multi-GPU driver functions Multi-GPU programming functions

Multi threaded multi GPU renderer

General workflow Different applications

slide-5
SLIDE 5

DIRECTED GPU RENDERING

Quadro only Allows picking rendering GPU Fast blit path to display GPU Dedicate GPUs

OpenGL Compute

Choose via

NVDIA Control Panel NVAPI: developer.nvidia.com/nvapi

slide-6
SLIDE 6

QUADRO MOSAIC

Via SLI bridge or Quadro Sync board Advantages:

Transparent behavior One unified desktop No tearing Fragment clipping possible

Disadvantages:

Single view frustum Whole scene rendered

slide-7
SLIDE 7

QUADRO SLI FSAA

Use two Quadro boards with SLI connector Transparently scale image quality Up to 128x FSAA

slide-8
SLIDE 8

QUADRO SLI AFR

Semi-automagic multi-GPU support for alternate frame rendering (AFR) SLI AFR abstracts GPUs away

Application sees one GPU

Driver mirrors static resources between GPUs

No transfer between GPUs for unchanged data E.g. static textures, geometry data Dynamic data might need to be transferred

slide-9
SLIDE 9

QUADRO SLI AFR

Single GPU frame rendering

n+2 n+1 n n+3 n+4

GPU0 Display Time

n n+1 n+2 n+3 n+4

slide-10
SLIDE 10

QUADRO SLI AFR

SLI AFR rendering on two GPUs Same frame time, same latency Frames rendered in parallel, twice the frame rate GPU0 GPU1 Display

n n+1 n+2 n+3 n+4 n+5 n+6 n+7 n+8 n+9

Time

n+1 n+3 n+5 n+7 n+9 n+8 n+6 n+4 n+2 n

slide-11
SLIDE 11

QUADRO SLI AFR

Switch on SLI

Application needs a profile Force AFR1 / AFR2 in NV control panel For testing: Use profile “SLI Aware Application”

slide-12
SLIDE 12

QUADRO SLI AFR

Prerequisites for AFR (driver is conservative)

Unbind dynamic resources before calling swap GPU Queue must be full – no flushing GL queries Clear full surface

If SLI AFR doesn’t scale: Use GL debug callback

glEnable( GL_DEBUG_OUTPUT ); glDebugMessageCallback( ... ); Working on improving debug messages, feedback from developers welcome!

GPU0 GPU1

n+1 n+3 n+5 n+4 n+2 n

slide-13
SLIDE 13

OVERVIEW

Motivation Tools of the trade

Multi-GPU driver functions Multi-GPU programming functions

Multi threaded multi GPU renderer

General workflow Different applications

slide-14
SLIDE 14

MULTI-GPU RENDERING

slide-15
SLIDE 15

DISTRIBUTING WORKLOAD

Use NV_gpu_affinity extension Enumerate GPUs

wglEnumGpusNV( UINT iGPUIndex, HGPUNV* phGPU )

Enumerate displays per GPU

Needed to determine final display for image present wglEnumGpuDevicesNV( HGPUNV hGPU, UINT iDeviceIndex, PGPU_DEVICE lpGpuDevice );

Create an OpenGL context for a specific GPU

HGPUNV GpuMask[2]= {phGPU, nullptr}; //Get affinity DC based on GPU HDC affinityDC = wglCreateAffinityDCNV( GpuMask ); SetPixelFormat( affinityDC, ... ); HGLRC affinityGLRC = wglCreateContext( affinityDC );

slide-16
SLIDE 16

SHARING DATA BETWEEN GPUS

For multiple contexts on same GPU

ShareLists & GL_ARB_Create_Context

For multiple contexts across multiple GPUs

Readback (GPU1-Host)  Copy on host  Upload (Host-GPU0)

NV_copy_image extension for OGL 3.x

Windows – wglCopyImageSubDataNV Linux - glXCopyImageSubdataNV Avoids extra copies, same pinned host memory is accessed by both GPUs

slide-17
SLIDE 17

NV_COPY_IMAGE EXTENSION

Transfer in single call

No binding of objects No state changes Supports 2D, 3D textures & cube maps

Async for Fermi & above wglCopyImageSubDataNV( srcCtx, srcTex, GL_TEXTURE_2D, 0, 0, 0, 0, tgtCtx, tgtTex, GL_TEXTURE_2D, 0, 0, 0, 0, width, height, 1 ); GPU0

CPU / PCIe

srcTex GPU1 dstTex

slide-18
SLIDE 18

OPENGL SYNCHRONIZATION

OpenGL commands are asynchronous

glDraw*( ... ) can return before rendering has finished

Use Sync object (GL 3.2+) for apps that need to sync on GPU completion

Much more flexible than using glFinish()

Fence is inserted in consumer GL stream; blocks execution until producer signals fence object

glDraw wglCopy... glFenceSync glWaitSync glBind glDraw

GPU0 GPU1

slide-19
SLIDE 19

OVERVIEW

Motivation Tools of the trade

Multi-GPU driver functions Multi-GPU programming functions

Multi threaded multi GPU renderer

General workflow Different applications

slide-20
SLIDE 20

SETTING THE STAGE

App with rendering function Fragment bound Improvements

Split image to distribute rendering load (sort-first) Use multiple GPUs (4 in the example) Do parallel rendering Hide transfer overhead

renderFrame()

slide-21
SLIDE 21

RENDER PIPELINE GTC 2014 - ID S4455

render() idleQ renderFrame() renderQ copyQ copy() composeQ preRenderQ

slide-22
SLIDE 22

APP::RENDERFRAME CALL

Take an event token from the idle queue Add data for this frame (e.g. frame number, view matrix) Put token into the first queue of pipeline auto event = m_idleQueue->pop(); event->setType( Event::RENDER ); /* update payload */ m_preRenderQueue->push(event);

slide-23
SLIDE 23

PRERENDER STEP

Optional pre-computation (e.g. load balancing information) Put event token into N render queues Parallel execution begins here auto event = inputQueue->pop(); /* pre-computation code */ for( auto& i : outputQueues ) { i->push( event ); }

slide-24
SLIDE 24

RENDER STEP

N affinity contexts, optimally rendering 1/Nth of GPU load “Manually” multiplex scene resources to all threads E.g. scissor / depth / stencil buffer to confine rendering area Use texture from the event token as render target Insert fence at the end to signal render step has finished

slide-25
SLIDE 25

COPY STEP

N copy threads copying N textures Wait for fence from preceding render thread Copy data from render GPU to display GPU Use textures from event token as source & target Insert fence at the end to signal copy has finished

copy() copy() copy() copy()

slide-26
SLIDE 26

COMPOSE STEP

Pop from N event queues (CPU synchronization) Perform N glWaitSync (GPU synchronization) Take N textures and combine image data to output image Optional post-processing (overlays etc.) Call SwapBuffers to present frame

merge()

slide-27
SLIDE 27

OVERVIEW

Motivation Tools of the trade

Multi-GPU driver functions Multi-GPU programming functions

Multi threaded multi GPU renderer

General workflow Different applications

slide-28
SLIDE 28

SLICING IMAGE SPACE

Fragment bound scenario Split image up into N sub-images Every GPU renders the same scene, just different image regions Compose by reassembling output image fom sub-images Scales when fragment load is distributed well

slide-29
SLIDE 29

SLICING & COMPOSITION

slide-30
SLIDE 30

10 20 30 40 50 60 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 0,5 1 1,5 2 2,5 3 3,5 4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

RESULTS – SLICING IMAGE SPACE

Frame time vs. workload Scaling vs. Workload

slide-31
SLIDE 31

SLICING VERTEX SPACE

Geometry bound scenario Split scene up into N parts Every GPU renders the same frustum, but with a different sub-scene Compose output image by depth comparison Scales when geometry is distributed well Transfer full color and depth images

slide-32
SLIDE 32

SLICING & COMPOSITION

Every Torus: 724201 vertices / 722500 faces

slide-33
SLIDE 33

5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 1 2 3 4

RESULTS – SLICING VERTEX SPACE (LO RES)

Frame time vs. #objects Scaling vs. #objects

0,5 1 1,5 2 2,5 3 3,5 4 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

slide-34
SLIDE 34

RESULTS – SLICING VERTEX SPACE (LO RES)

Frame time vs. #objects Scaling vs. #objects

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 0,5 1 1,5 2 2,5 3 3,5 4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

slide-35
SLIDE 35

RESULTS – SLICING VERTEX SPACE

slide-36
SLIDE 36

RESULTS – SLICING VERTEX SPACE (HI RES)

Frame time vs. #objects Scaling vs. #objects

5 10 15 20 25 30 35 40 45 50 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 1 2 3 4 0,5 1 1,5 2 2,5 3 3,5 4 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

slide-37
SLIDE 37

RESULTS – SLICING VERTEX SPACE (HI RES)

PCIe 2.0 x16 can transport ~700 Full HD images per second Per displayed frame:

4 Full HD color images 4 Full HD depth images

700 / 8 = 87.5 max fps, 11.4 min ms per frame 800x600 image: 2.6 min ms per frame 4k image: 45.6 min ms per frame Improvements: Compression / PCIe 3.0

slide-38
SLIDE 38

SLICING TIME

General GPU bound scenario Implement „SLI AFR“, distribute whole frames across GPUs Every GPU renders a whole frame No composition, just display output image on display GPU Only scales without inter-frame dependencies

slide-39
SLIDE 39

SLICING & COMPOSITION

slide-40
SLIDE 40

RESULTS – SLICING TIME

Frame time vs. workload Scaling vs. Workload

10 20 30 40 50 60 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 1 2 3 4 0,5 1 1,5 2 2,5 3 3,5 4 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

slide-41
SLIDE 41

RESULTS – SLICING TIME

compose() render() copy() compose() render() copy() compose() render() copy()

Frame n Frame n+1 Frame n+2

slide-42
SLIDE 42

IN CLOSING

Other applications possible, e.g.

Stereo rendering Volume rendering Shadow passes

Further questions?

iesser@nvidia.com

Source code soon available at https://github.com/nvpro-samples

slide-43
SLIDE 43

THANK YOU