GPU Accelerated Tandem Traversal of Blocked Bounding Volume - - PowerPoint PPT Presentation

gpu accelerated tandem traversal of blocked bounding
SMART_READER_LITE
LIVE PREVIEW

GPU Accelerated Tandem Traversal of Blocked Bounding Volume - - PowerPoint PPT Presentation

GPU Accelerated Tandem Traversal of Blocked Bounding Volume Hierarchies Jesper Damkjr and Kenny Erleben { damkjaer,kenny } @diku.dk Department of Computer Science University of Copenhagen October 2009 Traditional BVH Traversal Two BVHs are


slide-1
SLIDE 1

GPU Accelerated Tandem Traversal of Blocked Bounding Volume Hierarchies

Jesper Damkjær and Kenny Erleben {damkjaer,kenny}@diku.dk

Department of Computer Science University of Copenhagen

October 2009

slide-2
SLIDE 2

Traditional BVH Traversal

Two BVHs are traversed

Using either a stack or a queue Using a descend rule descending either tree Descend both trees simultainiously

For each descend, the BVs in the nodes are compared for

  • verlap

2

slide-3
SLIDE 3

Naive BVH on GPU

One pair of BVHs per Thread Upper space bound for stack k (c − 1) max (height(A), height(B)) ,

  • max. cardinality, c, and size of two BV node references, k.

Shared memory too small and global memory too slow

3

slide-4
SLIDE 4

Use Blocks

1 Block ≡ Each node has 4 children If overlap ⇒ 16 new overlaps Less data to transfer and more work per thread

4

slide-5
SLIDE 5

Use Double Buffered List

Stack/Queue ⇒ Double buffered list Swap input/output paris for next pass

5

slide-6
SLIDE 6

Memory Trick Needed

6

slide-7
SLIDE 7

Need Imaginary Nodes

Less than 4 children ⇒ fill with imaginary nodes Fills up space ⇒ part of calculation time ⇒ use sparesly

7

slide-8
SLIDE 8

Blocks with Mixed Internal or Leaf Nodes

Not allowed ⇒ Simpler code

8

slide-9
SLIDE 9

Internal Block versus Leaf Block

if collide (a, k) ⇒ push (e, k) if collide (a, l) collision ⇒ push (e, k) if collide (a, m) collision ⇒ push (e, k) if collide (a, n) collision ⇒ push (e, k) Redundant results ⇒ add extra check to code

9

slide-10
SLIDE 10

The Test Setup

Three different configuration types Structured stack Unstructured Pile Rock Slide

10

slide-11
SLIDE 11

The Test Setup (Cont’d)

For each configuration type

Increasing number of triangles in objects Increasing number of objects

Test against Rapid

Rapid uses OBBs we use AABBs

No optimization of imaginary nodes in BVHs (upto 33%)

11

slide-12
SLIDE 12

Results

Rapid on Intel Quad CPU using one core

216 343 512 729 1000 192 48 12 1 2 3 Number of objects Stack: Rapid Triangles per object Time in seconds 216 343 512 729 1000 24000 6000 1500 1 2 3 4 5 Number of objects Pile: Rapid Triangles per object Time in seconds 500 1000 1500 2000 2500 24000 6000 1500 0.1 0.2 0.3 Number of objects Rockslide: Rapid Triangles per object Time in seconds

Cuda on ge9800 GX2 using one core

216 343 512 729 1000 192 48 12 1 2 3 Number of objects Stack: Cuda only Triangles per object Time in seconds 216 343 512 729 1000 24000 6000 1500 1 2 3 4 5 Number of objects Pile: Cuda only Triangles per object Time in seconds 500 1000 1500 2000 2500 24000 6000 1500 0.1 0.2 0.3 Number of objects Rockslide: Cuda only Triangles per object Time in seconds

Stack (5-8) Pile (3-7) Slide (2)

12

slide-13
SLIDE 13

Thanks Questions?

13