SLICING THE WORKLOAD MULTI-GPU OPENGL RENDERING APPROACHES INGO - PowerPoint PPT Presentation

SLICING THE WORKLOAD MULTI-GPU OPENGL RENDERING APPROACHES INGO ESSER – NVIDIA DEVTECH PROVIZ

OVERVIEW Motivation Tools of the trade Multi-GPU driver functions Multi-GPU programming functions Multi threaded multi GPU renderer General workflow Different applications

MOTIVATION Apps are becoming less CPU-bound, more GPU-bound S5135 - GPU-Driven Large Scene Rendering in OpenGL S5148 - Nvpro-Pipeline: A Research Rendering Pipeline Fragment Load (complex fragment shaders, higher resolutions) Slice image space Data / Geometry Load (large datasets) Slice data / geometry Processing (complex compute jobs) Offload complex calculations to other GPUs Stereo Rendering / VR is a natural fit

DIRECTED GPU RENDERING Quadro only Allows picking rendering GPU Fast blit path to display GPU Dedicate GPUs OpenGL Compute Choose via NVDIA Control Panel NVAPI: developer.nvidia.com/nvapi

QUADRO MOSAIC Via SLI bridge or Quadro Sync board Advantages: Transparent behavior One unified desktop No tearing Fragment clipping possible Disadvantages: Single view frustum Whole scene rendered

QUADRO SLI FSAA Use two Quadro boards with SLI connector Transparently scale image quality Up to 128x FSAA

QUADRO SLI AFR Semi-automagic multi-GPU support for alternate frame rendering (AFR) SLI AFR abstracts GPUs away Application sees one GPU Driver mirrors static resources between GPUs No transfer between GPUs for unchanged data E.g. static textures, geometry data Dynamic data might need to be transferred

QUADRO SLI AFR Single GPU frame rendering Display n n+1 n+2 n+3 n+4 GPU0 n+1 n+2 n+3 n n+4 Time

QUADRO SLI AFR SLI AFR rendering on two GPUs Same frame time, same latency Frames rendered in parallel, twice the frame rate Display n n+1 n+2 n+3 n+4 n+5 n+6 n+7 n+8 n+9 GPU0 n n+2 n+4 n+6 n+8 GPU1 n+1 n+3 n+5 n+7 n+9 Time

QUADRO SLI AFR Switch on SLI Application needs a profile Force AFR1 / AFR2 in NV control panel For testing: Use profile “SLI Aware Application”

QUADRO SLI AFR Prerequisites for AFR (driver is conservative) Unbind dynamic resources before calling swap GPU Queue must be full – no flushing GL queries Clear full surface GPU0 n n+2 n+4 GPU1 n+1 n+3 n+5 If SLI AFR doesn’t scale: Use GL debug callback glEnable( GL_DEBUG_OUTPUT ); glDebugMessageCallback( ... ); Working on improving debug messages, feedback from developers welcome!

MULTI-GPU RENDERING

DISTRIBUTING WORKLOAD Use NV_gpu_affinity extension Enumerate GPUs wglEnumGpusNV( UINT iGPUIndex, HGPUNV* phGPU ) Enumerate displays per GPU Needed to determine final display for image present wglEnumGpuDevicesNV( HGPUNV hGPU, UINT iDeviceIndex, PGPU_DEVICE lpGpuDevice ); Create an OpenGL context for a specific GPU HGPUNV GpuMask[2]= {phGPU, nullptr}; //Get affinity DC based on GPU HDC affinityDC = wglCreateAffinityDCNV( GpuMask ); SetPixelFormat( affinityDC, ... ); HGLRC affinityGLRC = wglCreateContext( affinityDC );

SHARING DATA BETWEEN GPUS For multiple contexts on same GPU ShareLists & GL_ARB_Create_Context For multiple contexts across multiple GPUs Readback (GPU 1 -Host)  Copy on host  Upload (Host-GPU 0 ) NV_copy_image extension for OGL 3.x Windows – wglCopyImageSubDataNV Linux - glXCopyImageSubdataNV Avoids extra copies, same pinned host memory is accessed by both GPUs

NV_COPY_IMAGE EXTENSION Transfer in single call No binding of objects CPU / PCIe No state changes Supports 2D, 3D textures & cube maps srcTex dstTex Async for Fermi & above GPU0 GPU1 wglCopyImageSubDataNV( srcCtx, srcTex, GL_TEXTURE_2D, 0, 0, 0, 0, tgtCtx, tgtTex, GL_TEXTURE_2D, 0, 0, 0, 0, width, height, 1 );

OPENGL SYNCHRONIZATION OpenGL commands are asynchronous glDraw*( ... ) can return before rendering has finished Use Sync object (GL 3.2+) for apps that need to sync on GPU completion Much more flexible than using glFinish() Fence is inserted in consumer GL stream; blocks execution until producer signals fence object GPU0 glDraw wglCopy... glFenceSync GPU1 glWaitSync glBind glDraw

SETTING THE STAGE App with rendering function renderFrame() Fragment bound Improvements Split image to distribute rendering load (sort-first) Use multiple GPUs (4 in the example) Do parallel rendering Hide transfer overhead

RENDER PIPELINE GTC 2014 - ID S4455 idleQ renderFrame() preRenderQ composeQ copy() copyQ render() renderQ

APP::RENDERFRAME CALL Take an event token from the idle queue Add data for this frame (e.g. frame number, view matrix) Put token into the first queue of pipeline auto event = m_idleQueue->pop(); event->setType( Event::RENDER ); /* update payload */ m_preRenderQueue->push(event);

PRERENDER STEP Optional pre-computation (e.g. load balancing information) Put event token into N render queues Parallel execution begins here auto event = inputQueue->pop(); /* pre-computation code */ for( auto& i : outputQueues ) { i->push( event ); }

RENDER STEP N affinity contexts, optimally rendering 1/Nth of GPU load “Manually” multiplex scene resources to all threads E.g. scissor / depth / stencil buffer to confine rendering area Use texture from the event token as render target Insert fence at the end to signal render step has finished

COPY STEP N copy threads copying N textures Wait for fence from preceding render thread Copy data from render GPU to display GPU Use textures from event token as source & target Insert fence at the end to signal copy has finished copy() copy() copy() copy()

COMPOSE STEP Pop from N event queues (CPU synchronization) Perform N glWaitSync (GPU synchronization) Take N textures and combine image data to output image Optional post-processing (overlays etc.) Call SwapBuffers to present frame merge()

SLICING IMAGE SPACE Fragment bound scenario Split image up into N sub-images Every GPU renders the same scene, just different image regions Compose by reassembling output image fom sub-images Scales when fragment load is distributed well

SLICING & COMPOSITION

RESULTS – SLICING IMAGE SPACE 60 4 1 2 3,5 50 3 3 4 40 2,5 30 2 1,5 20 1 10 0,5 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Frame time vs. workload Scaling vs. Workload

SLICING VERTEX SPACE Geometry bound scenario Split scene up into N parts Every GPU renders the same frustum, but with a different sub-scene Compose output image by depth comparison Scales when geometry is distributed well Transfer full color and depth images

SLICING & COMPOSITION Every Torus: 724201 vertices / 722500 faces

RESULTS – SLICING VERTEX SPACE (LO RES) 50 4 1 45 3,5 2 40 3 3 35 4 2,5 30 25 2 20 1,5 15 1 10 0,5 5 0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 Frame time vs. #objects Scaling vs. #objects

RESULTS – SLICING VERTEX SPACE (LO RES) 10 4 1 9 3,5 2 8 3 3 7 4 2,5 6 5 2 4 1,5 3 1 2 0,5 1 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Frame time vs. #objects Scaling vs. #objects

RESULTS – SLICING VERTEX SPACE

RESULTS – SLICING VERTEX SPACE (HI RES) 50 4 1 45 3,5 2 40 3 3 35 4 2,5 30 25 2 20 1,5 15 1 10 0,5 5 0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 Frame time vs. #objects Scaling vs. #objects

RESULTS – SLICING VERTEX SPACE (HI RES) PCIe 2.0 x16 can transport ~700 Full HD images per second Per displayed frame: 4 Full HD color images 4 Full HD depth images 700 / 8 = 87.5 max fps, 11.4 min ms per frame 800x600 image: 2.6 min ms per frame 4k image: 45.6 min ms per frame Improvements: Compression / PCIe 3.0

SLICING TIME General GPU bound scenario Implement „SLI AFR“, distribute whole frames across GPUs Every GPU renders a whole frame No composition, just display output image on display GPU Only scales without inter-frame dependencies

SLICING & COMPOSITION

RESULTS – SLICING TIME 60 4 1 2 3,5 50 3 3 4 40 2,5 30 2 1,5 20 1 10 0,5 0 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 Frame time vs. workload Scaling vs. Workload

SLICING THE WORKLOAD MULTI-GPU OPENGL RENDERING APPROACHES INGO - PowerPoint PPT Presentation

SLICING THE WORKLOAD MULTI-GPU OPENGL RENDERING APPROACHES INGO ESSER NVIDIA DEVTECH PROVIZ OVERVIEW Motivation Tools of the trade Multi-GPU driver functions Multi-GPU programming functions Multi threaded multi GPU renderer General

Workload, Fatigue, and Sleep Disruption 1 Workload 1.What is workload? 2.What is the

Program Slicing 2 1 Program Slicing 1. Slicing overview 2. Types of slices, levels of slices

Slicing Functional Programs: A Suspicion 10th CREST Open Workshop on Program Analysis and

Using program slicing data to predict code faults David Bowes University of Hertfordshire

WORKLOAD WORKLOAD WORKLOAD During exercise, nasal breathing causes a reduction in FEO 2

ASHA Workload Calculator What is Direct and Other indirect workload? activities Services

DAY 2 Agenda for Today Introduce the workload characterization problem. Discuss a

Day 3 Agenda for Today Formulate simple problem statement Revisit the workload

Local 006 Workload Appeal COLLECTIVE AGREEMENT 2014:LETTER OF INTENT #2 Why a Workload Appeal?

Workload Formulas Judicial Branch Workload Formulas and On-Bench Time Reporting | September 23,

CS 147: Computer Systems Performance Analysis Workload Selection 1 / 39 Overview CS147

Program Slicing Gias Uddin Special Topic Lecture prepared from the Survey of Frank Tip on Program

From Channel Slicing to From Channel Slicing to Spatial Division Multiplexing Spatial Division

Projection and slicing theorems in Heisenberg groups Pertti Mattila University of Helsinki

Strategies for Spectrum Slicing Based on Restarted Lanczos Methods Carmen Campos and Jose E.

Using Dependence Graphs for Slicing Functional Programs Dr. Vadim Zaytsev aka @grammarware

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat 1 Karim

Fahime Alizade & Rawi Ramdhan } Introduction Why scan the Internet? How to detect

Using Octavia deep dive Dean H. Lorenz, IBM Research Haifa Allan Hu, Cloud Networking

L O A D B A L A N C I N G I S I M P O S S I B L E LOAD BALANCING IS IMPOSSIBLE Tyler McMullen

Security for smart Electricity GRIDs Project type: Collaborative project small or medium

Optimized in-memory IBOR architecture in a cloud environment Using Apache Ignite Rafique Awan

Oregon PUD Association Annual Meeting www.avangrid.com 1 Who is Avangrid Renewables? AVANGRID

The evolution of load-balancing in a company remarkably like ours, with some sort of web

SLICING THE WORKLOAD MULTI-GPU OPENGL RENDERING APPROACHES INGO - PowerPoint PPT Presentation

SLICING THE WORKLOAD MULTI-GPU OPENGL RENDERING APPROACHES INGO ESSER NVIDIA DEVTECH PROVIZ OVERVIEW Motivation Tools of the trade Multi-GPU driver functions Multi-GPU programming functions Multi threaded multi GPU renderer General

Workload, Fatigue, and Sleep Disruption 1 Workload 1.What is workload? 2.What is the

Program Slicing 2 1 Program Slicing 1. Slicing overview 2. Types of slices, levels of slices

Slicing Functional Programs: A Suspicion 10th CREST Open Workshop on Program Analysis and

Using program slicing data to predict code faults David Bowes University of Hertfordshire

WORKLOAD WORKLOAD WORKLOAD During exercise, nasal breathing causes a reduction in FEO 2

ASHA Workload Calculator What is Direct and Other indirect workload? activities Services

DAY 2 Agenda for Today Introduce the workload characterization problem. Discuss a

Day 3 Agenda for Today Formulate simple problem statement Revisit the workload

Local 006 Workload Appeal COLLECTIVE AGREEMENT 2014:LETTER OF INTENT #2 Why a Workload Appeal?

Workload Formulas Judicial Branch Workload Formulas and On-Bench Time Reporting | September 23,

CS 147: Computer Systems Performance Analysis Workload Selection 1 / 39 Overview CS147

Program Slicing Gias Uddin Special Topic Lecture prepared from the Survey of Frank Tip on Program

From Channel Slicing to From Channel Slicing to Spatial Division Multiplexing Spatial Division

Projection and slicing theorems in Heisenberg groups Pertti Mattila University of Helsinki

Strategies for Spectrum Slicing Based on Restarted Lanczos Methods Carmen Campos and Jose E.

Using Dependence Graphs for Slicing Functional Programs Dr. Vadim Zaytsev aka @grammarware

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat 1 Karim

Fahime Alizade &amp; Rawi Ramdhan } Introduction Why scan the Internet? How to detect

Using Octavia deep dive Dean H. Lorenz, IBM Research Haifa Allan Hu, Cloud Networking

L O A D B A L A N C I N G I S I M P O S S I B L E LOAD BALANCING IS IMPOSSIBLE Tyler McMullen

Security for smart Electricity GRIDs Project type: Collaborative project small or medium

Optimized in-memory IBOR architecture in a cloud environment Using Apache Ignite Rafique Awan

Oregon PUD Association Annual Meeting www.avangrid.com 1 Who is Avangrid Renewables? AVANGRID

The evolution of load-balancing in a company remarkably like ours, with some sort of web

Fahime Alizade & Rawi Ramdhan } Introduction Why scan the Internet? How to detect