Outline Project goal and problem definition Prior Work: Octree - PowerPoint PPT Presentation

Outline  Project goal and problem definition  Prior Work: Octree textures on GPU  Parallel Octrees on GPU (2 implementations)  Space Filling Curves  CUDA programming model  Octree Construction  Results: Comparison of two approaches  Conclusion

Dual Degree Project Goal To implement Parallel Fast Multipole Method (FMM) on Graphics Hardware Using multi Using GPUs Processors (to be done to (already done) achieve better speed-up) Parallel FMM FMM Parallel implementation of Octrees on the GPU is the first step towards implementing parallel FMM on GPUs

OctreeTextures on GPU A 1 B C(2,0) D(3,0) A(0,0) B(1,0) 1 C (1,0) D (3,0) (2,0) 0 1 0 1  The content of the leaves is stored as a RGB value  Alpha channel is used to distinguish between an index to a child and the content of a leaf alpha = 1 data alpha = 0.5 index alpha = 0 empty cell

A B C(2,0) D(3,0) A(0,0) B(1,0) C (1,0) D (3,0) (2,0) Retrieve the value stored in the tree at a point M є [0,1] × [0,1] The tree lookup starts from the root and successively visits the nodes containing the point M until a leaf is reached.

I 0 = (0,0) node A (root) C(2,0) D(3,0) A(0,0) B(1,0) + M (1,0) 0.7 P x = I 0x + frac ( M .2 0 ) P S x P y = I 0y + frac ( M .2 0 ) (3,0) (2,0) S y 0.7 frac(A) denotes the fractional part of A I 0 = (0,0) // I D be the index of the data grid of the node visited at depth D Let M=(0.7, 0.7) Coordinates of M within grid A = frac(M·2 0 ) = frac(0.7x1) = 0.7 x coordinate of the lookup point P in the texture = P x = {I 0x + frac(M.2 0 )}/S x = (0 + 0.7)/4 = 0.175 y coordinate of the lookup point P in the texture = P y = {I 0y + frac(M.2 0 )}/S y = (0 + 0.7)/1 = 0.7

I 1 = (1,0) node B C(2,0) D(3,0) A(0,0) B(1,0) M + (1,0) P x = I 1x + frac ( M .2 1 ) S x (2,0) P y = I 1y + frac ( M .2 1 ) (3,0) P S y I 1 = (1,0) M=(0.7, 0.7) Coordinates of M within grid B = frac(M·2 1 ) = frac(0.7x2) = 0.4 x coordinate of the lookup point P in the texture = P x = {I 1x + frac(M.2 1 )}/S x = (1 + 0.4)/4 = 0.35 y coordinate of the lookup point P in the texture = P y = {I 1y + frac(M.2 1 )}/S y = (0 + 0.4)/1 = 0.4

I 2 = (2,0) node C C(2,0) D(3,0) A(0,0) B(1,0) + P (1,0) P x = I 2x + frac ( M .2 2 ) M S x P y = I 2y + frac ( M .2 2 ) (3,0) (2,0) S y I 2 = (2,0) M=(0.7, 0.7) Coordinates of M within grid C = frac(M·2 2 ) = frac(0.7x4) = 0.8 x coordinate of the lookup point P in the texture = P x = {I 2x + frac(M.2 2 )}/S x = (2 + 0.8)/4 = 0.7 y coordinate of the lookup point P in the texture = P y = {I 2y + frac(M.2 2 )}/S y = (0 + 0.8)/1 = 0.8

Drawbacks  Very difficult to create such a data representation in Parallel  No a priori knowledge of the position where a particular node in the octree is going to land in the texture  Same amount of memory is allocated for both leaves and internal nodes. The internal nodes do not contain much information. So no need to allocate same memory space for an internal node and a leaf  Difficult to compute the post order traversal

CUDA Programming Space Filling Model Curves Parallel Octrees on GPU Parallel Bitonic Sort

CUDA Programming Model Space Filling Curves Parallel Octrees on GPU Parallel Bitonic Sort

Space Filling Curves : Motivation  Easy to implement in practice  Mathematical representation that enables fast computation of data ownership  Easy to parallelize  Good quality load balancing A based domain decomposition meets the above requirements

Space Filling Curves  Consider the recursive bisection of a 2D area into non-overlapping cells of equal sizesize  A is a mapping of these cells to a one dimensional linear ordering Z-SFC for k = 2

SFC Construction  The run time to order cells, is expensive since typically  Integer coordinates of a cell having a particle :  Represent each integer coordinate of cell using and then interleaving the bits starting from first dimension to form a integer  Index of the cell with coordinates time to find the index   Do a parallel integer sort to get the SFC ordering

SFC and Octrees  SFC decomposition is very similar, though not identical to an octree decomposition  Octrees can be viewed as multiple SFCs at various resolutions  The process of assigning indices can be viewed hierarchically  Any ambiguity ?  check if a cell is contained in cell  find the smallest cell containing and  find the immediate sub cell of that contains a given cell

CUDA Programming Space Filling Model Curves Parallel Octrees on GPU Parallel Bitonic Sort

CUDA Programming Model Kernel Block Block Block (0,0) (1,0) (2,0) Block Block Block (0,1) (1,1) (2,1) Thread Thread Thread Thread (0,0) (1,0) (2,0) (3,0) Thread Thread Thread Thread (0,1) (1,1) (2,1) (3,1) Thread Thread Thread Thread (0,2) (1,2) (2,2) (3,2)

CUDA Programming Model  Block of threads  Grids of thread blocks  Any computation that is done independently on different data many times, can be isolated into a function called kernel that is executed on the GPU as many different threads  A GPU may run all the blocks of a grid sequentially if it has very few parallel capabilities, or in parallel if it has a lot of parallel capabilities.  How CUDA fits easily with Octrees and SFC?

Octree Construction  Maximum levels (L) given k=0  Nodes at level k : 2 kd  The 2-D position of the parent of a node in the upper layer can directly be calculated from the 2- D position of the child node k=1 Also store {-1 for empty, -2 for filled, -3 for filled internal node} k=2 {number of empty nodes in the subtree of the 2 3 -1 node (including itself)} k=3 {(level, 2-D position in that level)} 0 2 3 -1

Octree Construction contd.  Multiple passes considering two levels and in each pass  Allocate threads so that each thread can handle four nodes T(0,1) T(1,1)  These four nodes come one after the other in the SFC linearization of that level  Each thread checks the number of empty nodes among those four nodes T(1,0) T(0,0)

Octree Construction contd.  If then it sets the nodeType field of its parent to - 1 and numEmptyNodes field to the number of empty nodes in the subtree plus 1. The dataLocation field of the parent still remains null.  If , then it sets the nodeType of the non-empty node to -1 and in its parent it sets numEmptyNodes to the number of empty nodes in the subtree plus 1, nodeType to the nodeType of non-empty node and dataLocation to be the dataLocation in the non-empty node. , it just set the nodeType field of the parent to -3 and numEmptyNodesto the number of empty nodes in the subtree.  Repeat the same procedure for the remaining levels to generate the complete octree  Highly data parallel with zero communication between the GPU threads

PostorderTraversal For each node directly calculate its postorder number ( ) according to the non-adaptive tree in parallel Maximum Total number Also , calculate the number of empty nodes ( ) level of nodes in before the current node in the post order tree numbering of non-adaptive tree 0 1 Final post order number in adaptive tree 1 5 2 21 To calculate PONA make use of a table structure 3 85 4 341 How to calculate ? 5 1365 time for processors 6 5461 … …

PostorderTraversal Calculates NE Calculates PONA

Octree Construction  Initial memory allocation same as that in implementation one  Also, allocate a global array having the size of the octree  Within each node we store and as in implementation 1  We do not store number of empty nodes but we store the of the node  Within each pass the construction of the parent level from child level nodes (or leaves) is exactly same as that in implementation one.  Once the parent level is constructed, copy the SFC linearized child level to the global array G and delete the child array from the memory  Copying in the next pass will start from where it ended in the last pass

PostorderTraversal  Parallel sort the nodes in global array G to get the post order traversal of the tree. How?  We order two nodes based on the level in which they are and their SFC index in the level  Note that the SFC ordering of nodes in a particular level is same their ordering in the post order traversal of the tree  For any two nodes (level L i , SFC in the level = i) and (level L j , SFC in the level = j) in , all nodes having SFCs less than or equal to at level L j will come before N i in the final post order o the tree  On simplifying, in the array if the position of then a swap between N j and N i is required if

PostorderTraversal contd.  For processors the sorting algorithm takes steps to sort elements Comparison function for sorting (id j , id i )

Outline Project goal and problem definition Prior Work: Octree - PowerPoint PPT Presentation

Outline Project goal and problem definition Prior Work: Octree textures on GPU Parallel Octrees on GPU (2 implementations) Space Filling Curves CUDA programming model Octree Construction Results: Comparison of two

Ins Domingues Breast Cancer Workshop April 7th 2015 Outline Outline Outline Outline

Presentation Preparation Outline Speech Outline Template ***Use this outline to guide you in

Outline for St Outline for St Outline for

Beob Kyun Kim, S oonwook Hwang {kyun, hwang}@ kisti.re.kr KIS TI, Korea Outline Outline

Catherine Revels, World Bank November 2009 Presentation outline Presentation outline

Battlestar Galactica Battlestar Galactica Galactica Battlestar Outline Outline Outline

Outline 2 Outline 2 ZSim core simulation techniques Outline 2 ZSim core simulation

Appendix J: Capstone Presentation Outline Revised Spring 2016 CAPSTONE PRESENTATION OUTLINE This

PT1 TMP Presentation Outline 1 Group Members: ___________________________________ Use this outline

Broverview Outline 2 Outline Philosophy and Architecture A framework for network traffic

Xingqian Peng, Huaqiao University, China Presented by Zhen Wu Presented by Zhen Wu October 30,2011

1 Web Application Development 2 3 Web Application Development CSS Outline An outline is a

Lecture Outline Strengthening Induction Hypothesis. Lecture Outline Strengthening Induction

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

High Dimensional Approximation - Outline Background and Sources Wolfgang Dahmen Seminar: USC,

Outline Outline Deaf and Hearing Impaired Deaf and Hearing Impaired Physical Structures of

Stochastic Integration with Respect to FBM Jorge A. Len Departamento de Control Automtico

Imagine-Nation Creating a Culture of Collaborators, Communicators, Complex and Creative Thinkers

Software for Competitiveness Big Data and Other Frontiers Stockholm, Nov 14 2017 Ralf 3 and

From Dataflow Specifications to Customised Reconfigurable Datapaths Using HLS: the OpenCL Case

COMPANY TODAY 2010 2011 2012 2013 2014 2015 2016 INGC, LLC modern engineering and manufacturing

Stephens Hi Hillside F Farms Preliminary S Subdivision 2018-016-SUB Stephens Hi Hillside F

03 | 16 | 2020 PARK ARLINGTON at PROJECT UPDATES COURTHOUSE Reduced surface parking

Simple DirectMedia Layer (SDL) by Kyle Smith Introduction *