VULKAN TECHNOLOGY UPDATE Christoph Kubisch, NVIDIA GTC 2017 Ingo - PowerPoint PPT Presentation

VULKAN TECHNOLOGY UPDATE Christoph Kubisch, NVIDIA GTC 2017 Ingo Esser, NVIDIA

Device Generated Commands API Interop AGENDA VR in Vulkan NSIGHT Support 2

VK_NVX_device_generated_commands 3

DEVICE GENERATED COMMANDS GPU creates its own work (drawcalls and compute) CPU GPU Define the work-load in-pipeline, in-frame Reduce latency as no CPU roundtrip is required (VR!) 1-2 frames latency Use any GPU accessible resources to drive decision making (zbuffer etc.) Select level of detail, cull by occlusion, classify work into GPU different state usage, ... 4

DEVICE GENERATED COMMANDS OpenGL Examples https://github.com/nvpro- samples/gl_dynamic_lod ARB_draw_indirect to classify how particles are drawn (point, mesh, tessellation) https://github.com/nvpro- samples/gl_occlusion_culling ARB_multi_draw_indirect / NV_command_list to do shader-based occlusion culling Reverse angle & bboxes of culled Model courtesy of PGO Automobiles 5

EVOLUTION Draw Indirect: Multi Draw Indirect: GL_NV_command_list & VK_NVX_device_generated_ Typically change Multiple draw calls with DX12 ExecuteIndirect: commands # primitives, different index/vertex Change shader input Change shader (pipeline # instances offsets bindings for each draw state) per draw call DrawElements UniformAddressCommandNV DescriptorSetToken { { { GLuint indexCount; GLuint header; GLuint objectTableIndex; GLuint instanceCount; GLushort index; Gluint offsets[]; GLuint firstIndex; GLushort stage; } GLuint baseVertex; GLuint64 address; GLuint baseInstance; } 6 }

TRADITIONAL SETUP Shader classifies items into lists of indirect buffer storage Not all items may create work Set Pipeline A Draw Indirects Set Pipeline T Draw Indirects Set Pipeline G Draw Indirects Draw Indirects Set Pipeline C CPU-driven state setup is for worst-case distribution of indirect work May yield lots of needless state setup (imagine 100s of potentially-used Pipelines) 7

NEW VULKAN ABILITY GPU classifies items with state assignment Optionally preserve ordering or provide permutation Draw Indirects A G A G A G A G G with State Compact stream without unnecessary state setup or data overfetching Grouping by state is still recommended Draw Indirects A A A A G G G G G with State 8

PIPELINE CHANGES Add command-related work on the GPU to be more efficient at the actual tasks Make use of shader specialization (less dynamic branching, more aggressive compile- time optimizations...) Shader level of detail Partition & organize work by shader permutation or usage pattern 9

STATELESS DESIGN CPU Commands Device-Generated Commands CPU Commands State Access CPU-provided bind bind draw bind bind draw state is inherited Stateful within single Modified state is undefined for command sequence subsequent sequences or CPU commands 10

OVERVIEW Sequence & CPU Arguments GPU-Written Arguments Resources VkIndirect VkIndirect VkIndirectCommandsLayout VkObjectTable Commands Commands Token Token BindVertex [0] Buffer A Draw Buffer (binding) Buffer Buffer [1] Buffer B .. uint32[] 2,256 0,0 [2] Buffer C VkCmdProcess VkCmdBindVertexBuffer Commands VkCmdDraw(..) VkCmdBind.. VkCmdDraw (binding, Buffer C, 256) Reserved CommandBuffer Space 11

WORKFLOW Define a stateless sequence of commands as VkIndirectCommandsLayout Register Vulkan resources (VkBuffer, VkDescriptorSet, VkPipeline) in VkObjectTable at developer-managed index Fill & modify VkBuffers with command arguments and object table indices for many sequences Use VkCmdReserveSpaceForCommands to allocate command buffer space Generate the commands from token buffer content via VkCmdProcessCommands Execute via VkCmdExecuteCommands 12

SEPARATE GENERATION & EXECUTION Primary CommandBuffer CmdBuffer VkCmdProcessCommands Barrier VkCmdExecuteCommands ... Secondary CmdBuffer Secondary VkCmdReserveSpace... Record an array of command sequences into Reuse commands, or the reserved space reuse reserved space for another generation Generate & Execute as single action is also supported 13

OBJECT TABLE ObjectTable behaves similar to DescriptorPool Do not delete it, nor modify resource indices that may be in-flight VkObjectTable GPU [0] Buffer A VkCmdProcessCommands Timeline VkRegisterResource(..., 0) CPU 14

OBJECT TABLE CommandBuffer reservation depends on ObjectTable‘s state Use only those resources, that were registered at reservation time VkObjectTable VkObjectTable [0] [0] Buffer A Buffer A VkCmdProcess GPU [1] Buffer B Commands Timeline VkRegister...(..,1) VkCmdProcess... VkCmdReserve... CPU 15

INDIRECT COMMANDS EQUIVALENT COMMAND & VK_INDIRECT_COMMANDS_TOKEN GPU-WRITTEN ARGUMENTS _PIPELINE_NVX vkCmdBindPipeline (… pipeline) _DESCRIPTOR_SET_NVX vkCmdBindDescriptorSets (… descrSet, offsets) _INDEX_BUFFER_NVX vkCmdBindIndexBuffer (… buffer, offset) _VERTEX_BUFFER_NVX vkCmdBindVertexBuffer (… buffer, offset) _PUSH_CONSTANT_NVX vkCmdPushConstants(... data) _DRAW_INDEXED_NVX vkCmdDrawIndexed( *all* ) _DRAW_NVX VkCmdDraw( *all* ) _DISPATCH_NVX VkCmdDispatch( *all* ) 16 16

MULTIPLE INPUT STREAMS Command Sequences 0 Command A 0 Command B 0 Command C 1 1 1 Traditional approaches used single interleaved stream (array of structures AoS) 0 0 0 1 1 1 Buffer VK extension uses input streams (SoA), allows individual re-use and efficient updates on input 0 1 Buffer Buffer 0 1 Individual Common Buffer 0 1 Buffer 0,1 Input Rate Input Rate Buffer 0 1 Buffer 0,1,.. 17 17

FLEXIBLE SEQUENCING Ordered Sequences Unordered / Subset Custom Subset 0 1 2 3 4 5 6 7 3 2 0 1 2 5 1 4 Provide sequence indices as Default monotonic order of Allow impl.-dependent ordering command sequences (incoherent) additional GPU buffer 2 5 1 4 Buffer 8 4 4 CPU Argument Buffer Buffer Number of sequences Actual number provided by by CPU GPU Buffer 18

TEST BENCHMARK 200.000 Drawcalls (few triangles/lines) 45.000 Pipeline switches (lines vs triangles) 6 Tokens: Pipeline DescriptorSet (1 ubo + 1 offset) DescriptorSet (1 ubo + 1 offset) VertexBuffer + 1 offset IndexBuffer + 1 offset DrawIndexed https://github.com/nvpro- samples/gl_vk_threaded_cadscene/blob/ma ster/doc/vulkan_nvxdevicegenerated.md 19

TEST BENCHMARK 200 000 DRAWCALLS GENERATE EXECUTE 45 000 PSO CHANGES Driver (CPU 1 thread) 8.74 ms (async, on CPU) 14.74 ms Device Gen. Cmds 0.35 ms 8.12 ms 100 000 DRAWCALLS GENERATE EXECUTE NO PSO Driver (CPU 1 thread) 3.8 ms (async, on CPU) 1.8 ms Device Gen. Cmds 0.20 ms 1.8 ms Test benchmark is very simplified scenario, your milage will vary 20 20

NVIDIA IMPLEMENTATION Currently experimental extension, feedback welcome (design, performance etc.) VkIndirectCommandsLayout generates internal compute shader Compute shader stitches the command buffer from data stored in the VkObjectTable Implements redundant state filter within local workgroup Reserved command buffer space has to be allocated for worst-case scenario 21

NVIDIA IMPLEMENTATION Global memory used internally to stitch Previous 200.000 drawcall example command buffer reserved ~35 and generated ~15 megs struct GeneratingTask { struct ObjectTable { uint maxSequences; VkObjectTable uint pipelinesCount; uvec4 sequenceRawSizes; uint descriptorsetsCount; uint* outputBuffer; uint vertexbuffersCount; Pipelines DescriptorSets uint* inputBuffers[MAX_INPUTS]; uint indexbuffersCount; ... uint pushconstantCount; }; uint pipelinesetsCount; Variable GPU layout(std140,binding=0) uniform tableUbo { ResourcePipeline* pipelines; ObjectTable table; ResourceDescriptorSet* descriptorsets; command sizes }; ResourceVertexBuffer* vertexbuffers; ResourceIndexBuffer* indexbuffers; per object layout(std140,binding=1) uniform taskUbo { ResourcePushConstant* pushconstants; GeneratingTask task; ResourcePipelineSet* pipelinesets; }; uint* rawPipelines; uint* rawDescriptorsets; uint* rawVertexbuffers; uint* rawIndexbuffers; uint* rawPushconstants; Command Space uint* rawPipelinesets; Reserved size for uvec2* pipelinediffs; worst-case uint* rawPipelinediffs; Bind Bind Draw }; 22

CONCLUSION GPU-generating will get slower with divergent resource usage Still important to group by state, helps both CPU and GPU CPU-generating is asynchronous to device, may not add to frame-time GPU-generating is on device, best used to save work, not to offload work 23

CROSS API INTEROP 24

CROSS API INTEROP Generic framework lead by Khronos Share device memory & synchronization primitives across APIs and processes Created in context of Vulkan, but not exclusive to it Vulkan, OpenGL, DirectX (11,12), others may follow 25

EXTERNAL MEMORY VK_KHX_external_memory (& friends) New extensions to share memory objects across APIs VkMemoryAllocateInfo was extended VkImportMemory*Platform*HandleInfoKHX to reference memory owned by other instances of the same device VkExportMemory*Platform*HandleInfoKHX to make memory accessible to other instances VkGetMemory*Platform*KHX to query platform handle 26

VULKAN TECHNOLOGY UPDATE Christoph Kubisch, NVIDIA GTC 2017 Ingo - PowerPoint PPT Presentation

VULKAN TECHNOLOGY UPDATE Christoph Kubisch, NVIDIA GTC 2017 Ingo Esser, NVIDIA Device Generated Commands API Interop AGENDA VR in Vulkan NSIGHT Support 2 VK_NVX_device_generated_commands 3 DEVICE GENERATED COMMANDS GPU creates its own

Vulkan on NVIDIA GPUs Piers Daniell, Driver Software Engineer, OpenGL and Vulkan Who am I? Piers

Porting to Vulkan Lessons Learned Who am I? Feral Interactive - Mac/Linux/Mobile games publisher

What can Vulkan * do for you? Jason Ekstrand - Embedded Linux Conference - February 22, 2017 What

Software practical final presentation Niels Buwen David Sprengel Vulkan vs OpenGL Conceptual

VkRunner A simple shader script tester for Vulkan Neil Roberts Based on Piglits

Zink: OpenGL on Vulkan Simplifying the future of the graphics stack? Erik Faye-Lund Open

VULKAN AND NVIDIA: THE ESSENTIALS Tristan Lorach Manager of Developer Technology Group, NVIDIA US

VULKAN AND NVIDIA: THE ESSENTIALS Tristan Lorach Manager of Developer Technology Group, NVIDIA US

VULKAN- EUROPE Continuous Basalt Fiber Distributions and Products The new commodity Basalt fiber

RTX-RSim Accelerated Vulkan Room Response Simulation for Time-of-Flight Imaging Peter Thoman,

Implementing Vulkan Timeline Semaphores Jason Ekstrand, XDC 2018 Option 1: Kernel Magic In

Accelerating Your VR Games with VRWorks Manuel Kraemer Talk Overview NVIDIA Pascal Overview

Implementing SPMD control flow in LLVM using reconverging CFGs Fabian Wahlster Technische

Kyle Corbin Technology Project Lead,DODD 1 Supportive technology Technology that can support a

Cardinal Carter C.H.S. Technology November 2019 image goes here Grade 9 Technology Courses

3 Technology Technology X 3 X 3 Technology X 3 Technology X ModBUS output ModBUS

types 2 Exploring word-frequency differences in corpora Tanja Sily & Jukka Suomela

Cookies & Milk ...Soda? Matthew Lau Professor Imielinski March 28, 2017 Data 101 WALMART

Construction of LDPC codes Telecommunications Laboratory Alex Balatsoukas-Stimming Technical

Reserves & Resources Top 10 PRMS errors and misunderstandings Dr. Ed Jankowski - Managing

Phase Transitions in Evolution When do quasispecies form error thresholds? Peter Schuster

Persistent Homology: Persistence Modules Andrey Blinov 6 October 2017 Andrey Blinov Persistent

Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos Baidu SVAIL April 7, 2016

Department of Mental Health Child, Adolescent and Family Mental Health and System of Care

VULKAN TECHNOLOGY UPDATE Christoph Kubisch, NVIDIA GTC 2017 Ingo - PowerPoint PPT Presentation

VULKAN TECHNOLOGY UPDATE Christoph Kubisch, NVIDIA GTC 2017 Ingo Esser, NVIDIA Device Generated Commands API Interop AGENDA VR in Vulkan NSIGHT Support 2 VK_NVX_device_generated_commands 3 DEVICE GENERATED COMMANDS GPU creates its own

Vulkan on NVIDIA GPUs Piers Daniell, Driver Software Engineer, OpenGL and Vulkan Who am I? Piers

Porting to Vulkan Lessons Learned Who am I? Feral Interactive - Mac/Linux/Mobile games publisher

What can Vulkan * do for you? Jason Ekstrand - Embedded Linux Conference - February 22, 2017 What

Software practical final presentation Niels Buwen David Sprengel Vulkan vs OpenGL Conceptual

VkRunner A simple shader script tester for Vulkan Neil Roberts Based on Piglits

Zink: OpenGL on Vulkan Simplifying the future of the graphics stack? Erik Faye-Lund Open

VULKAN AND NVIDIA: THE ESSENTIALS Tristan Lorach Manager of Developer Technology Group, NVIDIA US

VULKAN AND NVIDIA: THE ESSENTIALS Tristan Lorach Manager of Developer Technology Group, NVIDIA US

VULKAN- EUROPE Continuous Basalt Fiber Distributions and Products The new commodity Basalt fiber

RTX-RSim Accelerated Vulkan Room Response Simulation for Time-of-Flight Imaging Peter Thoman,

Implementing Vulkan Timeline Semaphores Jason Ekstrand, XDC 2018 Option 1: Kernel Magic In

Accelerating Your VR Games with VRWorks Manuel Kraemer Talk Overview NVIDIA Pascal Overview

Implementing SPMD control flow in LLVM using reconverging CFGs Fabian Wahlster Technische

Kyle Corbin Technology Project Lead,DODD 1 Supportive technology Technology that can support a

Cardinal Carter C.H.S. Technology November 2019 image goes here Grade 9 Technology Courses

3 Technology Technology X 3 X 3 Technology X 3 Technology X ModBUS output ModBUS

types 2 Exploring word-frequency differences in corpora Tanja Sily &amp; Jukka Suomela

Cookies &amp; Milk ...Soda? Matthew Lau Professor Imielinski March 28, 2017 Data 101 WALMART

Construction of LDPC codes Telecommunications Laboratory Alex Balatsoukas-Stimming Technical

Reserves &amp; Resources Top 10 PRMS errors and misunderstandings Dr. Ed Jankowski - Managing

Phase Transitions in Evolution When do quasispecies form error thresholds? Peter Schuster

Persistent Homology: Persistence Modules Andrey Blinov 6 October 2017 Andrey Blinov Persistent

Persistent RNNs (stashing recurrent weights on-chip) Gregory Diamos Baidu SVAIL April 7, 2016

Department of Mental Health Child, Adolescent and Family Mental Health and System of Care

types 2 Exploring word-frequency differences in corpora Tanja Sily & Jukka Suomela

Cookies & Milk ...Soda? Matthew Lau Professor Imielinski March 28, 2017 Data 101 WALMART

Reserves & Resources Top 10 PRMS errors and misunderstandings Dr. Ed Jankowski - Managing