HANDLING MASSIVE TRANSFORM UPDATES IN A SCENEGRAPH Markus Tavenrath, - - PowerPoint PPT Presentation

handling massive transform updates in a scenegraph
SMART_READER_LITE
LIVE PREVIEW

HANDLING MASSIVE TRANSFORM UPDATES IN A SCENEGRAPH Markus Tavenrath, - - PowerPoint PPT Presentation

April 4-7, 2016 | Silicon Valley HANDLING MASSIVE TRANSFORM UPDATES IN A SCENEGRAPH Markus Tavenrath, March 5 th 2016 Senior Developer Technology Engineer MOTIVATION Nvpro-pipeline good for static scenes What about dynamic content? Update


slide-1
SLIDE 1

April 4-7, 2016 | Silicon Valley

Markus Tavenrath, March 5th 2016 Senior Developer Technology Engineer

HANDLING MASSIVE TRANSFORM UPDATES IN A SCENEGRAPH

slide-2
SLIDE 2

2

MOTIVATION

Nvpro-pipeline good for static scenes What about dynamic content? Update cost dominates rendering time Big potential: Transform updates TransformTree is now a separate module Adds CUDA support

4/3/2016

slide-3
SLIDE 3

3

TRANSFORM TREE

What is a TransformTree

4/3/2016

G0 T0 T1 T2 S1 S2 G1 T3 S0 T* T0 T1 T2 T3 T2‘ T3‘

SceneGraph SceneTree TransformTree is unfolded SceneGraph with nothing more than the transforms Virtual root node

slide-4
SLIDE 4

4

TRANSFORM TREE

Usage?

4/3/2016

What can a TransformTree be used for? Incremental computation of world matrices i.e. T2.world = T2.local * T0.local * T*.local What is the world transform be used for? Rendering Shader Culling Bounding box computation Collision detection …

T* T0 T1 T2 T3 T2‘ T3‘

slide-5
SLIDE 5

5

TRANSFORM TREE

Interface

class TransformTree { public: TransformIndex addTransform(TransformIndex parent, Mat44f const & local); void removeTransform(TransformIndex transformIndex); void setLocalTransform(TransformIndex parent, Mat44f const & local); // getting world matrices will have a cost if data is on the GPU Mat44f const & getWorldTransform(TransformIndex index); // interface, implementation for CPU or GPU (CUDA/Vulkan/OpenGL) virtual void process() = 0; };

4/3/2016

slide-6
SLIDE 6

6

TRANSFORM TREE

General Data Structure

4/3/2016

Local matrices 1 1 1 1 DirtyVector World matrices FreeVector 1 1 1 1 1 1 Common data structure has 4 arrays Keep everything as local as possible, no pointer chasing Index is free Index has changed Input Output 2 1 Hierarchy Level 2 Parent 1

slide-7
SLIDE 7

7

IMPLEMENTATION

CPU

CPU implementation keeps list of indices for each level to minimize traversal

4/3/2016

void process() { for (auto level : levels) { for (auto index : level.indices) { if (dirtyLocal[index] || dirtyWorld[index.parent]) { world[index] = local[index] * world[index.parent]; dirtyWorld[index] = true; } } } notify(dirtyWorld); clear(dirtyLocal); clear(dirtyWorld); }

slide-8
SLIDE 8

8

RESULTS

CPU

CPU Xeon-E5 2630-v3 processes ~15 mio transforms per second Assume 1ms budget for the transform hierarchy -> 16k transforms possible Problem is somewhere else Pass updates transforms through pipeline on GPU Cost can be multiple times the cost of this update Is it possible to keep transform computations completely on GPU?

4/3/2016

slide-9
SLIDE 9

9

CUDA IMPLEMENTATION

4/3/2016

CPU -> GPU Process Level 0 Process Level n … GPU-> CPU

  • ptional
  • ptional

required Implement animation system on the GPU Share result with renderer Is there enough parallelism? What‘s the fixed cost per stage?

slide-10
SLIDE 10

10

CUDA IMPLEMENTATION

Paralellism

Quadro K6000 can execute 2,880 threads in parallel How to sature this high number of threads? One matrix multiplication per thread is not enough Compute one matrix component per thread Increases number of required threads by 16 

4/3/2016

slide-11
SLIDE 11

11

PROPAGATION

Matrix multiplication

4/3/2016

(0,0,z) (1,0,z) (2,0,z) (3,0,z) (0,1,z) (1,1,z) (2,1,z) (3,1,z) (0,2,z) (1,2,z) (2,2,z) (3,2,z) (0,3,z) (1,3,z) (2,3,z) (3,3,z) (0,0,z) (1,0,z) (2,0,z) (3,0,z) (0,1,z) (1,1,z) (2,1,z) (3,1,z) (0,2,z) (1,2,z) (2,2,z) (3,2,z) (0,3,z) (1,3,z) (2,3,z) (3,3,z) (0,0,z) (1,0,z) (2,0,z) (3,0,z) (0,1,z) (1,1,z) (2,1,z) (3,1,z) (0,2,z) (1,2,z) (2,2,z) (3,2,z) (0,3,z) (1,3,z) (2,3,z) (3,3,z)

= *

Each half warp (16 threads) execute one multiplication thread id

slide-12
SLIDE 12

12

PROPAGATION

Matrix multiplication

4/3/2016

(0,0,z) (1,0,z) (2,0,z) (3,0,z) (0,1,z) (1,1,z) (2,1,z) (3,1,z) (0,2,z) (1,2,z) (2,2,z) (3,2,z) (0,3,z) (1,3,z) (2,3,z) (3,3,z) (0,0,z) (1,0,z) (2,0,z) (3,0,z) (0,1,z) (1,1,z) (2,1,z) (3,1,z) (0,2,z) (1,2,z) (2,2,z) (3,2,z) (0,3,z) (1,3,z) (2,3,z) (3,3,z) (0,0,z) (1,0,z) (2,0,z) (3,0,z) (0,1,z) (1,1,z) (2,1,z) (3,1,z) (0,2,z) (1,2,z) (2,2,z) (3,2,z) (0,3,z) (1,3,z) (2,3,z) (3,3,z)

= *

Each thread computes one element of the matrix

slide-13
SLIDE 13

13

PROPAGATION

Matrix Multiplication

4/3/2016

Thread (t.x, t.y, z) reads one local and world component of matrix Coalesced read

  • > One memory transaction per matrix

Whole matrix state is now in half warp, can be distributed with __shfl

(0,0,z) (1,0,z) (2,0,z) (3,0,z) (0,1,z) (1,1,z) (2,1,z) (3,1,z) (0,2,z) (1,2,z) (2,2,z) (3,2,z) (0,3,z) (1,3,z) (2,3,z) (3,3,z) (0,0,z) (1,0,z) (2,0,z) (3,0,z) (0,1,z) (1,1,z) (2,1,z) (3,1,z) (0,2,z) (1,2,z) (2,2,z) (3,2,z) (0,3,z) (1,3,z) (2,3,z) (3,3,z)

Local World

slide-14
SLIDE 14

14

PROPAGATION

Multiplication

4/3/2016

world(x,y,z) = __shfl(local, base1 + 0) * __shfl(world, base2 + 0) + __shfl(local, base1 + 1) * __shfl(world, base2 + 4) + __shfl(local, base1 + 2) * __shfl(world, base2 + 8) + __shfl(local, base1 + 3) * __shfl(world, base2 + 12); Full local & world state known through shuffle Each thread can grab the required values per iteration (Tid.y & 3) * 4 + (Tid.z & 1) * 16 Tid.x + (Tid.z & 1) * 16

slide-15
SLIDE 15

15

CUDA IMPLEMENTATION

ALGORITHM

4/3/2016

void process() { upload(); for (auto level : levels) { // kernel launch per level for (auto index : level.indices) { // warp id specified index if (dirtyLocal[index] || dirtyWorld[index.parent]) { parallelMultiply(world[index],local[index],world[index.parent]); if (threadIdx.x == 0 && threadIdx.y == 0) //mark dirty only once per matrix atomicOr(dirtyWorld[index / 32], 1 << (index & 31); // atomic, avoid conflicts } } } download(); notify(dirtyWorld); clear(dirtyLocal); clear(dirtyWorld); }

slide-16
SLIDE 16

16

RESULTS

Matrix Multiplication on GPU

Current version can do ~300m transforms/s on a K6000

  • > bw limited by dirty bits and hierarchy, not transforms
  • > no coalesced reads, 128 byte transactions are being generated

Solving inefficient hierarchy memory access pattern could bring ~900m transforms/s Work in Progress

4/3/2016

slide-17
SLIDE 17

17

GPU IMPLEMENTATION

4/3/2016

Multiply Level 0 Multiply Level 1 Multiply Level 2 ~5us ~5us ~5us Total minimum cost up 15us for 3 levels, deep hierarchies might be bad Minimum cost due to kernel launch There‘re ways to reduce the cost, not yet addressed

slide-18
SLIDE 18

18

CUDA IMPLEMENTATION

Results

4/3/2016

CPU -> GPU Process Level 0 Process Level n … GPU-> CPU worst case 16k matrices, all changed 2 levels in the hierarchy Result required on CPU ~100us ~50us ~100us ~250us 4x faster than CPU (single) but transfer kill improvements

slide-19
SLIDE 19

19

CUDA IMPLEMENTATION

Results

4/3/2016

CPU -> GPU Process Level 0 Process Level n … GPU-> CPU Medium case 1 matrix changed (top level) 2 levels in the hierarchy Result required on CPU ~10us ~50us ~100us ~160us 6.25x faster than CPU (single) but transfer kills improvements

slide-20
SLIDE 20

20

CUDA IMPLEMENTATION

Results

4/3/2016

CPU -> GPU Process Level 0 Process Level n … GPU-> CPU Medium case 1 matrix changed (top level) 2 levels in the hierarchy Result not required on CPU ~10us ~50us 0us ~60us 16x faster than CPU (single) but transfer kills improvements

slide-21
SLIDE 21

21

DATA ON GPU – USE CASES

Graphics Interop

No need to transfer data from CPU to GPU Saves PCI-E bandwidth (~100us for 16k matrices) Graphics usually needs inverse tranpose Compute on GPU, saves CPU time again Saves even more PCI-E bandwidth (~100us for 16k matrices)

4/3/2016

slide-22
SLIDE 22

22

DATA ON GPU – USE CASES

CULLING

Frustum culling Quite efficient if data is already on GPU Bounding box generation For near/far plane computation bounding box of scene might be required

4/3/2016

slide-23
SLIDE 23

23

CONCLUSION / FUTURE

Transform hierarchy can be evaluated on the GPU quite fast Result is required on CPU -> gain is limited due to transfer time Solvable with interop or by moving algorithm to Vulkan/OpenGL Input data comes from CPU -> gain might be limited depending on #changes Animate matrices on the GPU

4/3/2016

slide-24
SLIDE 24

April 4-7, 2016 | Silicon Valley

THANK YOU