Outline Project goal and problem definition Prior Work: Octree - - PowerPoint PPT Presentation
Outline Project goal and problem definition Prior Work: Octree - - PowerPoint PPT Presentation
Outline Project goal and problem definition Prior Work: Octree textures on GPU Parallel Octrees on GPU (2 implementations) Space Filling Curves CUDA programming model Octree Construction Results: Comparison of two
Outline
Project goal and problem definition Prior Work: Octree textures on GPU Parallel Octrees on GPU (2 implementations) Space Filling Curves CUDA programming model Octree Construction Results: Comparison of two approaches Conclusion
Dual Degree Project Goal
To implement Parallel Fast Multipole Method (FMM) on Graphics Hardware Parallel FMM Using multi Processors (already done) FMM Using GPUs (to be done to achieve better speed-up) Parallel implementation of Octrees on the GPU is the first step towards implementing parallel FMM on GPUs
A B C D A(0,0) B(1,0) C(2,0) D(3,0) (1,0) (3,0) (2,0)
The content of the leaves is stored as a RGB value Alpha channel is used to distinguish between an index to a child and the
content of a leaf alpha = 1 data alpha = 0.5 index alpha = 0 empty cell
OctreeTextures on GPU
1 1 1 1
A B C D A(0,0) B(1,0) C(2,0) D(3,0) (1,0) (3,0) (2,0)
Retrieve the value stored in the tree at a point M є [0,1] × [0,1]
The tree lookup starts from the root and successively visits the nodes containing the point M until a leaf is reached.
A(0,0) B(1,0) C(2,0) D(3,0) (1,0) (3,0) (2,0)
+ M
Px = I0x + frac(M.20) Sx Py = I0y + frac(M.20) Sy I0 = (0,0) node A (root) P
frac(A) denotes the fractional part of A I0 = (0,0) // ID be the index of the data grid of the node visited at depth D Let M=(0.7, 0.7) Coordinates of M within grid A = frac(M·20) = frac(0.7x1) = 0.7 x coordinate of the lookup point P in the texture = Px = {I0x + frac(M.20)}/Sx = (0 + 0.7)/4 = 0.175 y coordinate of the lookup point P in the texture = Py = {I0y + frac(M.20)}/Sy = (0 + 0.7)/1 = 0.7 0.7 0.7
A(0,0) B(1,0) C(2,0) D(3,0) (1,0) (3,0) (2,0)
+
M
Px = I1x + frac(M.21) Sx Py = I1y + frac(M.21) Sy I1 = (1,0) node B P
I1 = (1,0) M=(0.7, 0.7) Coordinates of M within grid B = frac(M·21) = frac(0.7x2) = 0.4 x coordinate of the lookup point P in the texture = Px = {I1x + frac(M.21)}/Sx = (1 + 0.4)/4 = 0.35 y coordinate of the lookup point P in the texture = Py = {I1y + frac(M.21)}/Sy = (0 + 0.4)/1 = 0.4
A(0,0) B(1,0) C(2,0) D(3,0) (1,0) (3,0) (2,0)
+
M
Px = I2x + frac(M.22) Sx Py = I2y + frac(M.22) Sy I2 = (2,0) node C P
I2 = (2,0) M=(0.7, 0.7) Coordinates of M within grid C = frac(M·22) = frac(0.7x4) = 0.8 x coordinate of the lookup point P in the texture = Px = {I2x + frac(M.22)}/Sx = (2 + 0.8)/4 = 0.7 y coordinate of the lookup point P in the texture = Py = {I2y + frac(M.22)}/Sy = (0 + 0.8)/1 = 0.8
Drawbacks
Very difficult to create such a data representation in Parallel No a priori knowledge of the position where a particular node in the
- ctree is going to land in the texture
Same amount of memory is allocated for both leaves and internal
- nodes. The internal nodes do not contain much information. So no
need to allocate same memory space for an internal node and a leaf
Difficult to compute the post order traversal
Parallel Octrees
- n GPU
Space Filling Curves CUDA Programming Model Parallel Bitonic Sort
Parallel Octrees
- n GPU
Space Filling Curves
CUDA Programming Model Parallel Bitonic Sort
Space Filling Curves : Motivation
Easy to implement in practice Mathematical representation that enables fast computation of data
- wnership
Easy to parallelize Good quality load balancing
A based domain decomposition meets the above requirements
Space Filling Curves
Consider the
recursive bisection of a 2D area into non-overlapping cells of equal sizesize
A
is a mapping of these cells to a
- ne dimensional linear ordering
Z-SFC for k = 2
SFC Construction
The run time to order
cells, is expensive since typically
Integer coordinates of a cell having a particle
:
Represent each integer coordinate of
cell using and then interleaving the bits starting from first dimension to form a integer
Index of the cell with coordinates
time to find the index
Do a parallel integer sort to get the
SFC ordering
SFC and Octrees
SFC decomposition is very similar, though not identical to an octree
decomposition
Octrees can be viewed as multiple SFCs at various resolutions The process of assigning indices can be viewed hierarchically Any ambiguity ? check if a cell
is contained in cell
find the smallest cell containing
and
find the immediate sub cell of
that contains a given cell
Parallel Octrees
- n GPU
Space Filling Curves
CUDA Programming Model
Parallel Bitonic Sort
CUDA Programming Model
Block (0,0) Block (1,0) Block (2,0) Block (0,1) Block (1,1) Block (2,1)
Kernel
Thread (0,0) Thread (1,0) Thread (2,0) Thread (3,0) Thread (0,1) Thread (1,1) Thread (2,1) Thread (3,1) Thread (0,2) Thread (1,2) Thread (2,2) Thread (3,2)
CUDA Programming Model
Block of threads Grids of thread blocks Any computation that is done independently on different data many
times, can be isolated into a function called kernel that is executed on the GPU as many different threads
A GPU may run all the blocks of a grid sequentially if it has very few
parallel capabilities, or in parallel if it has a lot of parallel capabilities.
How CUDA fits easily with Octrees and SFC?
23-1 23-1
k=3 k=2 k=1 k=0
Maximum levels (L) given Nodes at level k : 2kd The 2-D position of the parent
- f a node in the upper layer can
directly be calculated from the 2- D position of the child node Also store {-1 for empty, -2 for filled, -3 for filled internal node} {number of empty nodes in the subtree of the node (including itself)} {(level, 2-D position in that level)}
Octree Construction
Multiple passes considering two levels
and in each pass
Allocate
threads so that each thread can handle four nodes
These four nodes come one after the
- ther in the SFC linearization of that
level
Each thread checks the number of
empty nodes among those four nodes
T(0,1) T(1,1) T(1,0)
T(0,0)
Octree Construction contd.
If
then it sets the nodeType field of its parent to - 1 and numEmptyNodes field to the number of empty nodes in the subtree plus 1. The dataLocation field of the parent still remains null.
If
, then it sets the nodeType of the non-empty node to -1 and in its parent it sets numEmptyNodes to the number of empty nodes in the subtree plus 1, nodeType to the nodeType of non-empty node and dataLocation to be the dataLocation in the non-empty node. , it just set the nodeType field of the parent to -3 and numEmptyNodesto the number of empty nodes in the subtree.
Repeat the same procedure for the remaining levels to generate the
complete octree
Highly data parallel with zero communication between the GPU threads
Octree Construction contd.
For each node directly calculate its postorder number ( ) according to the non-adaptive tree in parallel Also , calculate the number of empty nodes ( ) before the current node in the post order numbering of non-adaptive tree Final post order number in adaptive tree To calculate PONA make use of a table structure How to calculate ? time for processors
PostorderTraversal
Maximum level Total number
- f nodes in
tree 1 1 5 2 21 3 85 4 341 5 1365 6 5461 … …
PostorderTraversal
Calculates NE Calculates PONA
Initial memory allocation same as that in implementation one Also, allocate a global array
having the size of the octree
Within each node we store
and as in implementation 1
We do not store number of empty nodes but we store the
- f the node
Within each pass the construction of the parent level from child level nodes
(or leaves) is exactly same as that in implementation one.
Once the parent level is constructed, copy the SFC linearized child level to
the global array G and delete the child array from the memory
Copying in the next pass will start from where it ended in the last pass
Octree Construction
Parallel sort the nodes in global array G to get the post order traversal of the
- tree. How?
We order two nodes based on the level in which they are and their SFC index
in the level
Note that the SFC ordering of nodes in a particular level is same their
- rdering in the post order traversal of the tree
For any two nodes
(level Li, SFC in the level = i) and (level Lj, SFC in the level = j) in , all nodes having SFCs less than or equal to at level Lj will come before Ni in the final post order o the tree
On simplifying, in the array
if the position of then a swap between Nj and Ni is required if
PostorderTraversal
For processors the sorting algorithm takes
steps to sort elements
PostorderTraversal contd.
Comparison function for sorting (idj , idi)
Comparison between implementations 1 and 2
The time for constructing the octree and the amount of GPU memory
required is of the same order for both the algorithms.
Theoretically, the calculation of postorder number of n nodes in
implementation one takes O(log4 n) time for n processors. In case of implementation two, for n processors the sorting algorithm takes O(log2 n) steps to sort n elements
Bank conflicts in implementation 1 Large amount of data movement during sorting phase in implementation 2 No need to calculate SFC at each level in implementation 1 Very difficult to sort non even power of 2 number of elements on GPU in
parallel in case of implementation 2
Performance Comparison
Number of levels Time for tree construction and post order calculation (milli secs) Implementation 1 Implementation 2 3 0.21 0.35 4 0.39 0.72 5 1.25 1.93 6 2.62 4.56 7 5.89 16.21 8 18.63 72.14 9 64.78 300.18
Performance Comparison
Number of levels Time for tree construction and post order calculation (milli secs) Implementation 1 Implementation 2 3 0.51 0.85 4 2.60 3.12 5 5.25 6.73 6 6.48 8.56 7 9.89 25.21 8 20.63 98.14 9 71.32 340.29
Conclusion
- 1. Catalogued the currently existing implementation of octrees on GPUs
- 2. Proposed two different algorithms for parallel construction of octrees on GPUs
using CUDA
- 3. Rudimentary analysis of the running time for both the algorithms
Future Work
- 1. Further optimizations in the octree implementations (utilizing the memory
storage framework used by the GPUs and reducing non-primitive gpu operations like integer division, modulo and branching operations)
- 2. Adaptation of Global Illumination method using Fast Multipole Method on GPUs
- 3. Exploration of compressed octrees on GPUs (low priority)
References
- 1. Nvidia CUDA Programming Guide. http://developer.nvidia.com/cuda
- 2. S. Lefebvre, S. Hornus, and F. Neyret. GPU Gems 2, chapter Octree
Textures on the GPU, pages 595–614. Addison Wesley, 2005
- 3. H. Sagan. Space Filling Curves. Springer-Verlag, 1994
- 4. T. W. Christopher. Bitonic Sort Tutorial.
http://www.tools-of-computing.com/tc/CS/Sorts/bitonic_sort.htm
- 5. S. Aluru and F. Sevilgen. Dynamic compressed hyper-octrees with
applications to n-body problems. Proceedings of Foundations of Software Technology and Theoritical Computer Science, pages 21–33, 1999.
- 6. M. S. Warren and J. K. Salmon. A parallel hashed octree n-body algorithm.
Proceedings of Supercomputing, pages 12–21, 1993.
- 7. L. Greengard and V. Rokhlin. A fast algorithm for particle simulations.
Journal of Computational Physics, 73:325–348, 1987.
CUDA general specifications
The maximum number of threads per block is 512; The maximum sizes of the x-, y-, and z-dimension of a thread blocks are 512, 512, and 64, respectively; The maximum size of each dimension of a grid of thread blocks is 65535; The warp size is 32 threads; The number of registers per multiprocessor is 8192; The amount of shared memory available per multiprocessor is 16 KB
- rganized into 16 banks
The amount of constant memory available is 64 KB with a cache working set
- f 8 KB per multiprocessor;
CUDA general specifications
The maximum number of blocks that can run concurrently on a multiprocessor is 8; The maximum number of warps that can run concurrently on a multiprocessor is 24; The maximum number of threads that can run concurrently on a multiprocessor is 768; Each multiprocessor is composed of eight processors, so that a multiprocessor is able to process the 32 threads of a warp in four clock cycles
Compressed Octrees
1 2 3 4 5 6 7 8 10 9
7 1 3 2 9 4 8 5 6 10
Compressed Octrees contd.
1 2 3 4 5 6 7 8 10 9
7 1 3 2 9 4 8 5 6 10
Each node in compressed octree is either a
leaf or has atleast 2 children
Store 2 cells in each node of the compressed octree Large cell
:
cell that encloses all the points the node represents
Small cell
: cell that encloses all the points the node represents
Encapsulating spatial information lost in compression
1 2 3 4 5 6 7 8 10 9
7 1 3 2 9 4 8 5 6 10
Compressed Octree Construction
points, = side length of domain
= pre-specified maximum resolution
For each point, generate the index of the leaf cell containing it which is
the cell at the max resolution
Parallel sort the leaf indices to compute their SFC-linearization, or the
left to right order of leaves in the compressed octree
Incrementally construct the tree using sorted list of leaves starting from
single node tree for the first leaf and root cell
Note: Do not confuse a cell with a node
Compressed Octree Construction contd.
Keep track of the most recently inserted leaf Let q be the next leaf Starting from most recently inserted leaf, traverse the path from leaf to
the root until we find first node v such that q belongs to L(v) Now we have 2 cases 1. : create a new node u between v and its parent and insert a new child of u with q as small cell.
Compressed Octree Construction contd.
2. : insert q as a child for v corresponding to this subcell
Fast Multipole Method
The is concerned with evaluating the effect
- f a “set of sources” , on a set of “evaluation points” . More formally,
given we wish to evaluate the sum
Total Complexity:
attempts to reduce this complexity to
The two main insights that make this possible are
- f the kernel into source and receiver terms
Many application domains do not require the function be
calculated at high accuracy
FMM follows a Each node has associated
Each node has two kind of interaction lists
Far Cell List Near Cell List No far cell list at level 1 and level 0 since everything is near neighbor of other Transfer of energy from near neighbors happens only for leaves
Building Interaction Lists
Next : Passes of FMM
FMM Algorithm
FMM Algorithm
FMM Algorithm
FMM Algorithm
Only for leaves of the quadtree
FMM Algorithm
Octrees and SFCs
Octrees can be viewed as multiple SFCs at various resolutions To establish a total order on the cells of
- ctree: given 2 cells
if one is contained in the other, the subcell is taken to precede the supercell if disjoint, order according to the order
- f immediate subcells of the smallest