handling massive transform updates in a scenegraph
play

HANDLING MASSIVE TRANSFORM UPDATES IN A SCENEGRAPH Markus Tavenrath, - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley HANDLING MASSIVE TRANSFORM UPDATES IN A SCENEGRAPH Markus Tavenrath, March 5 th 2016 Senior Developer Technology Engineer MOTIVATION Nvpro-pipeline good for static scenes What about dynamic content? Update


  1. April 4-7, 2016 | Silicon Valley HANDLING MASSIVE TRANSFORM UPDATES IN A SCENEGRAPH Markus Tavenrath, March 5 th 2016 Senior Developer Technology Engineer

  2. MOTIVATION Nvpro-pipeline good for static scenes What about dynamic content? Update cost dominates rendering time Big potential: Transform updates TransformTree is now a separate module Adds CUDA support 2 4/3/2016

  3. TRANSFORM TREE What is a TransformTree TransformTree is unfolded SceneGraph with nothing more than the transforms SceneGraph SceneTree G0 Virtual root node T* T0 T1 T0 T1 S0 G1 T2‘ T3‘ T2 T3 T2 T3 S1 S2 3 4/3/2016

  4. TRANSFORM TREE Usage? What can a TransformTree be used for? Incremental computation of world matrices i.e. T2.world = T2.local * T0.local * T*.local T* What is the world transform be used for? T0 T1 Rendering T3‘ T2 T3 T2‘ Shader Culling Bounding box computation Collision detection 4 4/3/2016 …

  5. TRANSFORM TREE Interface class TransformTree { public: TransformIndex addTransform(TransformIndex parent, Mat44f const & local); void removeTransform(TransformIndex transformIndex); void setLocalTransform(TransformIndex parent, Mat44f const & local); // getting world matrices will have a cost if data is on the GPU Mat44f const & getWorldTransform(TransformIndex index); // interface, implementation for CPU or GPU (CUDA/Vulkan/OpenGL) virtual void process() = 0; }; 5 4/3/2016

  6. TRANSFORM TREE General Data Structure Common data structure has 4 arrays Level 2 Index has changed Parent 1 Index is free FreeVector 1 0 0 0 1 1 0 1 0 0 1 1 DirtyVector 0 1 1 0 1 0 0 0 1 0 0 0 2 Hierarchy 1 Local matrices Input World matrices Output Keep everything as local as possible, no pointer chasing 6 4/3/2016

  7. IMPLEMENTATION CPU CPU implementation keeps list of indices for each level to minimize traversal void process() { for (auto level : levels) { for (auto index : level.indices) { if (dirtyLocal[index] || dirtyWorld[index.parent]) { world[index] = local[index] * world[index.parent]; dirtyWorld[index] = true; } } } notify(dirtyWorld); clear (dirtyLocal); clear (dirtyWorld); } 7 4/3/2016

  8. RESULTS CPU CPU Xeon-E5 2630-v3 processes ~15 mio transforms per second Assume 1ms budget for the transform hierarchy -> 16k transforms possible Problem is somewhere else Pass updates transforms through pipeline on GPU Cost can be multiple times the cost of this update Is it possible to keep transform computations completely on GPU? 8 4/3/2016

  9. CUDA IMPLEMENTATION Implement animation system on the GPU Share result with renderer optional optional CPU -> GPU Process Level 0 … Process Level n GPU-> CPU required Is there enough parallelism? What‘s the fixed cost per stage? 9 4/3/2016

  10. CUDA IMPLEMENTATION Paralellism Quadro K6000 can execute 2,880 threads in parallel How to sature this high number of threads? One matrix multiplication per thread is not enough Compute one matrix component per thread Increases number of required threads by 16  10 4/3/2016

  11. PROPAGATION Matrix multiplication Each half warp (16 threads) execute one multiplication thread id (0,0,z) (1,0,z) (2,0,z) (3,0,z) (0,0,z) (1,0,z) (2,0,z) (3,0,z) (0,0,z) (1,0,z) (2,0,z) (3,0,z) = (0,1,z) (1,1,z) (2,1,z) (3,1,z) (0,1,z) (1,1,z) (2,1,z) (3,1,z) (0,1,z) (1,1,z) (2,1,z) (3,1,z) * (0,2,z) (1,2,z) (2,2,z) (3,2,z) (0,2,z) (1,2,z) (2,2,z) (3,2,z) (0,2,z) (1,2,z) (2,2,z) (3,2,z) (0,3,z) (1,3,z) (2,3,z) (3,3,z) (0,3,z) (1,3,z) (2,3,z) (3,3,z) (0,3,z) (1,3,z) (2,3,z) (3,3,z) 11 4/3/2016

  12. PROPAGATION Matrix multiplication Each thread computes one element of the matrix (0,0,z) (1,0,z) (2,0,z) (3,0,z) (0,0,z) (1,0,z) (2,0,z) (3,0,z) (0,0,z) (1,0,z) (2,0,z) (3,0,z) = (0,1,z) (1,1,z) (2,1,z) (3,1,z) (0,1,z) (1,1,z) (2,1,z) (3,1,z) (0,1,z) (1,1,z) (2,1,z) (3,1,z) * (0,2,z) (1,2,z) (2,2,z) (3,2,z) (0,2,z) (1,2,z) (2,2,z) (3,2,z) (0,2,z) (1,2,z) (2,2,z) (3,2,z) (0,3,z) (1,3,z) (2,3,z) (3,3,z) (0,3,z) (1,3,z) (2,3,z) (3,3,z) (0,3,z) (1,3,z) (2,3,z) (3,3,z) 12 4/3/2016

  13. PROPAGATION Matrix Multiplication World Local Thread (t.x, t.y, z) reads one (0,0,z) (1,0,z) (2,0,z) (3,0,z) (0,0,z) (1,0,z) (2,0,z) (3,0,z) local and world component of matrix Coalesced read (0,1,z) (1,1,z) (2,1,z) (3,1,z) (0,1,z) (1,1,z) (2,1,z) (3,1,z) -> One memory transaction per matrix Whole matrix state is now in half warp, (0,2,z) (1,2,z) (2,2,z) (3,2,z) (0,2,z) (1,2,z) (2,2,z) (3,2,z) can be distributed with __shfl (0,3,z) (1,3,z) (2,3,z) (3,3,z) (0,3,z) (1,3,z) (2,3,z) (3,3,z) 13 4/3/2016

  14. PROPAGATION Multiplication Full local & world state known through shuffle Each thread can grab the required values per iteration world(x,y,z) = __shfl(local, base1 + 0) * __shfl(world, base2 + 0) + __shfl(local, base1 + 1) * __shfl(world, base2 + 4) + __shfl(local, base1 + 2) * __shfl(world, base2 + 8) + __shfl(local, base1 + 3) * __shfl(world, base2 + 12); (Tid.y & 3) * 4 + (Tid.z & 1) * 16 Tid.x + (Tid.z & 1) * 16 14 4/3/2016

  15. CUDA IMPLEMENTATION ALGORITHM void process() { upload(); for (auto level : levels) { // kernel launch per level for (auto index : level.indices) { // warp id specified index if (dirtyLocal[index] || dirtyWorld[index.parent]) { parallelMultiply(world[index],local[index],world[index.parent]); if (threadIdx.x == 0 && threadIdx.y == 0) //mark dirty only once per matrix atomicOr(dirtyWorld[index / 32], 1 << (index & 31); // atomic, avoid conflicts } } } download(); notify(dirtyWorld); clear (dirtyLocal); clear (dirtyWorld); 15 4/3/2016 }

  16. RESULTS Matrix Multiplication on GPU Current version can do ~300m transforms/s on a K6000 -> bw limited by dirty bits and hierarchy, not transforms -> no coalesced reads, 128 byte transactions are being generated Solving inefficient hierarchy memory access pattern could bring ~900m transforms/s Work in Progress 16 4/3/2016

  17. GPU IMPLEMENTATION Multiply Level 0 Multiply Level 1 Multiply Level 2 Minimum cost ~5us due to kernel launch ~5us ~5us Total minimum cost up 15us for 3 levels, deep hierarchies might be bad There‘re ways to reduce the cost, not yet addressed 17 4/3/2016

  18. CUDA IMPLEMENTATION Results worst case 16k matrices, all changed 2 levels in the hierarchy Result required on CPU CPU -> GPU Process Level 0 … Process Level n GPU-> CPU ~100us ~50us ~100us 4x faster than CPU (single) ~250us but transfer kill improvements 18 4/3/2016

  19. CUDA IMPLEMENTATION Results Medium case 1 matrix changed (top level) 2 levels in the hierarchy Result required on CPU CPU -> GPU Process Level 0 … Process Level n GPU-> CPU ~10us ~50us ~100us 6.25x faster than CPU (single) ~160us but transfer kills improvements 19 4/3/2016

  20. CUDA IMPLEMENTATION Results Medium case 1 matrix changed (top level) 2 levels in the hierarchy Result not required on CPU CPU -> GPU Process Level 0 … Process Level n GPU-> CPU ~10us ~50us 0us 16x faster than CPU (single) ~60us but transfer kills improvements 20 4/3/2016

  21. DATA ON GPU – USE CASES Graphics Interop No need to transfer data from CPU to GPU Saves PCI-E bandwidth (~100us for 16k matrices) Graphics usually needs inverse tranpose Compute on GPU, saves CPU time again Saves even more PCI-E bandwidth (~100us for 16k matrices) 21 4/3/2016

  22. DATA ON GPU – USE CASES CULLING Frustum culling Quite efficient if data is already on GPU Bounding box generation For near/far plane computation bounding box of scene might be required 22 4/3/2016

  23. CONCLUSION / FUTURE Transform hierarchy can be evaluated on the GPU quite fast Result is required on CPU -> gain is limited due to transfer time Solvable with interop or by moving algorithm to Vulkan/OpenGL Input data comes from CPU -> gain might be limited depending on #changes Animate matrices on the GPU 23 4/3/2016

  24. April 4-7, 2016 | Silicon Valley THANK YOU

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend