Outline Project goal and problem definition Prior Work: Octree - - PowerPoint PPT Presentation

outline
SMART_READER_LITE
LIVE PREVIEW

Outline Project goal and problem definition Prior Work: Octree - - PowerPoint PPT Presentation

Outline Project goal and problem definition Prior Work: Octree textures on GPU Parallel Octrees on GPU (2 implementations) Space Filling Curves CUDA programming model Octree Construction Results: Comparison of two


slide-1
SLIDE 1
slide-2
SLIDE 2

Outline

 Project goal and problem definition  Prior Work: Octree textures on GPU  Parallel Octrees on GPU (2 implementations)  Space Filling Curves  CUDA programming model  Octree Construction  Results: Comparison of two approaches  Conclusion

slide-3
SLIDE 3

Dual Degree Project Goal

To implement Parallel Fast Multipole Method (FMM) on Graphics Hardware Parallel FMM Using multi Processors (already done) FMM Using GPUs (to be done to achieve better speed-up) Parallel implementation of Octrees on the GPU is the first step towards implementing parallel FMM on GPUs

slide-4
SLIDE 4
slide-5
SLIDE 5

A B C D A(0,0) B(1,0) C(2,0) D(3,0) (1,0) (3,0) (2,0)

 The content of the leaves is stored as a RGB value  Alpha channel is used to distinguish between an index to a child and the

content of a leaf alpha = 1 data alpha = 0.5 index alpha = 0 empty cell

OctreeTextures on GPU

1 1 1 1

slide-6
SLIDE 6

A B C D A(0,0) B(1,0) C(2,0) D(3,0) (1,0) (3,0) (2,0)

Retrieve the value stored in the tree at a point M є [0,1] × [0,1]

The tree lookup starts from the root and successively visits the nodes containing the point M until a leaf is reached.

slide-7
SLIDE 7

A(0,0) B(1,0) C(2,0) D(3,0) (1,0) (3,0) (2,0)

+ M

Px = I0x + frac(M.20) Sx Py = I0y + frac(M.20) Sy I0 = (0,0) node A (root) P

frac(A) denotes the fractional part of A I0 = (0,0) // ID be the index of the data grid of the node visited at depth D Let M=(0.7, 0.7) Coordinates of M within grid A = frac(M·20) = frac(0.7x1) = 0.7 x coordinate of the lookup point P in the texture = Px = {I0x + frac(M.20)}/Sx = (0 + 0.7)/4 = 0.175 y coordinate of the lookup point P in the texture = Py = {I0y + frac(M.20)}/Sy = (0 + 0.7)/1 = 0.7 0.7 0.7

slide-8
SLIDE 8

A(0,0) B(1,0) C(2,0) D(3,0) (1,0) (3,0) (2,0)

+

M

Px = I1x + frac(M.21) Sx Py = I1y + frac(M.21) Sy I1 = (1,0) node B P

I1 = (1,0) M=(0.7, 0.7) Coordinates of M within grid B = frac(M·21) = frac(0.7x2) = 0.4 x coordinate of the lookup point P in the texture = Px = {I1x + frac(M.21)}/Sx = (1 + 0.4)/4 = 0.35 y coordinate of the lookup point P in the texture = Py = {I1y + frac(M.21)}/Sy = (0 + 0.4)/1 = 0.4

slide-9
SLIDE 9

A(0,0) B(1,0) C(2,0) D(3,0) (1,0) (3,0) (2,0)

+

M

Px = I2x + frac(M.22) Sx Py = I2y + frac(M.22) Sy I2 = (2,0) node C P

I2 = (2,0) M=(0.7, 0.7) Coordinates of M within grid C = frac(M·22) = frac(0.7x4) = 0.8 x coordinate of the lookup point P in the texture = Px = {I2x + frac(M.22)}/Sx = (2 + 0.8)/4 = 0.7 y coordinate of the lookup point P in the texture = Py = {I2y + frac(M.22)}/Sy = (0 + 0.8)/1 = 0.8

slide-10
SLIDE 10

Drawbacks

 Very difficult to create such a data representation in Parallel  No a priori knowledge of the position where a particular node in the

  • ctree is going to land in the texture

 Same amount of memory is allocated for both leaves and internal

  • nodes. The internal nodes do not contain much information. So no

need to allocate same memory space for an internal node and a leaf

 Difficult to compute the post order traversal

slide-11
SLIDE 11

Parallel Octrees

  • n GPU

Space Filling Curves CUDA Programming Model Parallel Bitonic Sort

slide-12
SLIDE 12

Parallel Octrees

  • n GPU

Space Filling Curves

CUDA Programming Model Parallel Bitonic Sort

slide-13
SLIDE 13

Space Filling Curves : Motivation

 Easy to implement in practice  Mathematical representation that enables fast computation of data

  • wnership

 Easy to parallelize  Good quality load balancing

A based domain decomposition meets the above requirements

slide-14
SLIDE 14

Space Filling Curves

 Consider the

recursive bisection of a 2D area into non-overlapping cells of equal sizesize

 A

is a mapping of these cells to a

  • ne dimensional linear ordering

Z-SFC for k = 2

slide-15
SLIDE 15

SFC Construction

 The run time to order

cells, is expensive since typically

 Integer coordinates of a cell having a particle

:

 Represent each integer coordinate of

cell using and then interleaving the bits starting from first dimension to form a integer

 Index of the cell with coordinates 

time to find the index

 Do a parallel integer sort to get the

SFC ordering

slide-16
SLIDE 16

SFC and Octrees

 SFC decomposition is very similar, though not identical to an octree

decomposition

 Octrees can be viewed as multiple SFCs at various resolutions  The process of assigning indices can be viewed hierarchically  Any ambiguity ?  check if a cell

is contained in cell

 find the smallest cell containing

and

 find the immediate sub cell of

that contains a given cell

slide-17
SLIDE 17

Parallel Octrees

  • n GPU

Space Filling Curves

CUDA Programming Model

Parallel Bitonic Sort

slide-18
SLIDE 18

CUDA Programming Model

Block (0,0) Block (1,0) Block (2,0) Block (0,1) Block (1,1) Block (2,1)

Kernel

Thread (0,0) Thread (1,0) Thread (2,0) Thread (3,0) Thread (0,1) Thread (1,1) Thread (2,1) Thread (3,1) Thread (0,2) Thread (1,2) Thread (2,2) Thread (3,2)

slide-19
SLIDE 19

CUDA Programming Model

 Block of threads  Grids of thread blocks  Any computation that is done independently on different data many

times, can be isolated into a function called kernel that is executed on the GPU as many different threads

 A GPU may run all the blocks of a grid sequentially if it has very few

parallel capabilities, or in parallel if it has a lot of parallel capabilities.

 How CUDA fits easily with Octrees and SFC?

slide-20
SLIDE 20
slide-21
SLIDE 21

23-1 23-1

k=3 k=2 k=1 k=0

 Maximum levels (L) given  Nodes at level k : 2kd  The 2-D position of the parent

  • f a node in the upper layer can

directly be calculated from the 2- D position of the child node Also store {-1 for empty, -2 for filled, -3 for filled internal node} {number of empty nodes in the subtree of the node (including itself)} {(level, 2-D position in that level)}

Octree Construction

slide-22
SLIDE 22

 Multiple passes considering two levels

and in each pass

 Allocate

threads so that each thread can handle four nodes

These four nodes come one after the

  • ther in the SFC linearization of that

level

 Each thread checks the number of

empty nodes among those four nodes

T(0,1) T(1,1) T(1,0)

T(0,0)

Octree Construction contd.

slide-23
SLIDE 23

 If

then it sets the nodeType field of its parent to - 1 and numEmptyNodes field to the number of empty nodes in the subtree plus 1. The dataLocation field of the parent still remains null.

 If

, then it sets the nodeType of the non-empty node to -1 and in its parent it sets numEmptyNodes to the number of empty nodes in the subtree plus 1, nodeType to the nodeType of non-empty node and dataLocation to be the dataLocation in the non-empty node. , it just set the nodeType field of the parent to -3 and numEmptyNodesto the number of empty nodes in the subtree.

 Repeat the same procedure for the remaining levels to generate the

complete octree

 Highly data parallel with zero communication between the GPU threads

Octree Construction contd.

slide-24
SLIDE 24

For each node directly calculate its postorder number ( ) according to the non-adaptive tree in parallel Also , calculate the number of empty nodes ( ) before the current node in the post order numbering of non-adaptive tree Final post order number in adaptive tree To calculate PONA make use of a table structure How to calculate ? time for processors

PostorderTraversal

Maximum level Total number

  • f nodes in

tree 1 1 5 2 21 3 85 4 341 5 1365 6 5461 … …

slide-25
SLIDE 25

PostorderTraversal

Calculates NE Calculates PONA

slide-26
SLIDE 26
slide-27
SLIDE 27

 Initial memory allocation same as that in implementation one  Also, allocate a global array

having the size of the octree

 Within each node we store

and as in implementation 1

 We do not store number of empty nodes but we store the

  • f the node

 Within each pass the construction of the parent level from child level nodes

(or leaves) is exactly same as that in implementation one.

 Once the parent level is constructed, copy the SFC linearized child level to

the global array G and delete the child array from the memory

 Copying in the next pass will start from where it ended in the last pass

Octree Construction

slide-28
SLIDE 28

 Parallel sort the nodes in global array G to get the post order traversal of the

  • tree. How?

 We order two nodes based on the level in which they are and their SFC index

in the level

 Note that the SFC ordering of nodes in a particular level is same their

  • rdering in the post order traversal of the tree

 For any two nodes

(level Li, SFC in the level = i) and (level Lj, SFC in the level = j) in , all nodes having SFCs less than or equal to at level Lj will come before Ni in the final post order o the tree

 On simplifying, in the array

if the position of then a swap between Nj and Ni is required if

PostorderTraversal

slide-29
SLIDE 29

 For processors the sorting algorithm takes

steps to sort elements

PostorderTraversal contd.

Comparison function for sorting (idj , idi)

slide-30
SLIDE 30

Comparison between implementations 1 and 2

 The time for constructing the octree and the amount of GPU memory

required is of the same order for both the algorithms.

 Theoretically, the calculation of postorder number of n nodes in

implementation one takes O(log4 n) time for n processors. In case of implementation two, for n processors the sorting algorithm takes O(log2 n) steps to sort n elements

 Bank conflicts in implementation 1  Large amount of data movement during sorting phase in implementation 2  No need to calculate SFC at each level in implementation 1  Very difficult to sort non even power of 2 number of elements on GPU in

parallel in case of implementation 2

slide-31
SLIDE 31

Performance Comparison

Number of levels Time for tree construction and post order calculation (milli secs) Implementation 1 Implementation 2 3 0.21 0.35 4 0.39 0.72 5 1.25 1.93 6 2.62 4.56 7 5.89 16.21 8 18.63 72.14 9 64.78 300.18

slide-32
SLIDE 32

Performance Comparison

Number of levels Time for tree construction and post order calculation (milli secs) Implementation 1 Implementation 2 3 0.51 0.85 4 2.60 3.12 5 5.25 6.73 6 6.48 8.56 7 9.89 25.21 8 20.63 98.14 9 71.32 340.29

slide-33
SLIDE 33

Conclusion

  • 1. Catalogued the currently existing implementation of octrees on GPUs
  • 2. Proposed two different algorithms for parallel construction of octrees on GPUs

using CUDA

  • 3. Rudimentary analysis of the running time for both the algorithms

Future Work

  • 1. Further optimizations in the octree implementations (utilizing the memory

storage framework used by the GPUs and reducing non-primitive gpu operations like integer division, modulo and branching operations)

  • 2. Adaptation of Global Illumination method using Fast Multipole Method on GPUs
  • 3. Exploration of compressed octrees on GPUs (low priority)
slide-34
SLIDE 34

References

  • 1. Nvidia CUDA Programming Guide. http://developer.nvidia.com/cuda
  • 2. S. Lefebvre, S. Hornus, and F. Neyret. GPU Gems 2, chapter Octree

Textures on the GPU, pages 595–614. Addison Wesley, 2005

  • 3. H. Sagan. Space Filling Curves. Springer-Verlag, 1994
  • 4. T. W. Christopher. Bitonic Sort Tutorial.

http://www.tools-of-computing.com/tc/CS/Sorts/bitonic_sort.htm

  • 5. S. Aluru and F. Sevilgen. Dynamic compressed hyper-octrees with

applications to n-body problems. Proceedings of Foundations of Software Technology and Theoritical Computer Science, pages 21–33, 1999.

  • 6. M. S. Warren and J. K. Salmon. A parallel hashed octree n-body algorithm.

Proceedings of Supercomputing, pages 12–21, 1993.

  • 7. L. Greengard and V. Rokhlin. A fast algorithm for particle simulations.

Journal of Computational Physics, 73:325–348, 1987.

slide-35
SLIDE 35
slide-36
SLIDE 36

CUDA general specifications

The maximum number of threads per block is 512; The maximum sizes of the x-, y-, and z-dimension of a thread blocks are 512, 512, and 64, respectively; The maximum size of each dimension of a grid of thread blocks is 65535; The warp size is 32 threads; The number of registers per multiprocessor is 8192; The amount of shared memory available per multiprocessor is 16 KB

  • rganized into 16 banks

The amount of constant memory available is 64 KB with a cache working set

  • f 8 KB per multiprocessor;
slide-37
SLIDE 37

CUDA general specifications

The maximum number of blocks that can run concurrently on a multiprocessor is 8; The maximum number of warps that can run concurrently on a multiprocessor is 24; The maximum number of threads that can run concurrently on a multiprocessor is 768; Each multiprocessor is composed of eight processors, so that a multiprocessor is able to process the 32 threads of a warp in four clock cycles

slide-38
SLIDE 38
slide-39
SLIDE 39

Compressed Octrees

1 2 3 4 5 6 7 8 10 9

7 1 3 2 9 4 8 5 6 10

slide-40
SLIDE 40

Compressed Octrees contd.

1 2 3 4 5 6 7 8 10 9

7 1 3 2 9 4 8 5 6 10

 Each node in compressed octree is either a

leaf or has atleast 2 children

slide-41
SLIDE 41

 Store 2 cells in each node of the compressed octree  Large cell

:

cell that encloses all the points the node represents

 Small cell

: cell that encloses all the points the node represents

Encapsulating spatial information lost in compression

1 2 3 4 5 6 7 8 10 9

7 1 3 2 9 4 8 5 6 10

slide-42
SLIDE 42

Compressed Octree Construction

points, = side length of domain

= pre-specified maximum resolution

 For each point, generate the index of the leaf cell containing it which is

the cell at the max resolution

 Parallel sort the leaf indices to compute their SFC-linearization, or the

left to right order of leaves in the compressed octree

 Incrementally construct the tree using sorted list of leaves starting from

single node tree for the first leaf and root cell

 Note: Do not confuse a cell with a node

slide-43
SLIDE 43

Compressed Octree Construction contd.

 Keep track of the most recently inserted leaf  Let q be the next leaf  Starting from most recently inserted leaf, traverse the path from leaf to

the root until we find first node v such that q belongs to L(v) Now we have 2 cases 1. : create a new node u between v and its parent and insert a new child of u with q as small cell.

slide-44
SLIDE 44

Compressed Octree Construction contd.

2. : insert q as a child for v corresponding to this subcell

slide-45
SLIDE 45
slide-46
SLIDE 46

Fast Multipole Method

The is concerned with evaluating the effect

  • f a “set of sources” , on a set of “evaluation points” . More formally,

given we wish to evaluate the sum

 Total Complexity:

slide-47
SLIDE 47

attempts to reduce this complexity to

 The two main insights that make this possible are 

  • f the kernel into source and receiver terms

 Many application domains do not require the function be

calculated at high accuracy

 FMM follows a  Each node has associated

slide-48
SLIDE 48

Each node has two kind of interaction lists

 Far Cell List  Near Cell List  No far cell list at level 1 and level 0 since everything is near neighbor of other  Transfer of energy from near neighbors happens only for leaves

Building Interaction Lists

Next : Passes of FMM

slide-49
SLIDE 49

FMM Algorithm

slide-50
SLIDE 50

FMM Algorithm

slide-51
SLIDE 51

FMM Algorithm

slide-52
SLIDE 52

FMM Algorithm

slide-53
SLIDE 53

Only for leaves of the quadtree

FMM Algorithm

slide-54
SLIDE 54

Octrees and SFCs

Octrees can be viewed as multiple SFCs at various resolutions To establish a total order on the cells of

  • ctree: given 2 cells

if one is contained in the other, the subcell is taken to precede the supercell if disjoint, order according to the order

  • f immediate subcells of the smallest

supercell enclosing them The resulting linearization is identical to traversal

1 01 11 10 00 1