Outline Problem Definition Summary of Work Done Space Filling - PowerPoint PPT Presentation

Outline  Problem Definition  Summary of Work Done  Space Filling Curves  Bottom-Up Octree construction on the GPU  Timing Results  The Fast Multipole Method  Timing and Quality Results  Conclusion

Problem Definition To provide an efficient, parallel GPU based Global Illumination solution for point models which is many folds faster than its corresponding CPU implementation. INPUT : A 3-D point model with attributes like 3-D coordinates, default surface diffuse color, emmisivity and surface normals OUTPUT : A fast parallel Global Illumination solution showing effects like color bleeding and soft shadows

FMM for Global Illumination on GPU?  Global Illumination problem is a N-Body problem since each particle is affected by the presence of all other particles (Quadratic in nature)  The input data (points models in our case) is very large in size. More than 10 5 particles.  Direct computations (on GPU) are not possible because of high memory requirements to utilize the available parallelism  FMM solves the quadratic N-body problem efficiently in linear time by  Approximating the solution to a user defined accuracy  Using a hierarchical data structure (in our case, the octree)

Contributions View independent Octree on GPU visibility on GPU FMM on GPU Fast method to  Fast parallel global calculate visibility  illumination for point between all point models pairs in parallel Intend to submit as a Required for correct   paper for consideration global illumination Non Adaptive Adaptive Adaptive Top Down Bottom Up Submitted to ICVGIP,  2008 for oral paper Fast Very fast but memory   inefficient Memory efficient as  compared to the non- Parent child relations  adaptive version calculated using direct SFC indexing Intend to submit as a  paper for consideration Published as a poster in  I3D, 2008 Post order traversal Location of a leaf cell containing the queried point Acknowledgements: Rhushabh Goradia Least Common Ancestor of two cells etc. Prof. Srinivas Aluru

Space Filling Curves  Consider the recursive bisection of a 2D area into non-overlapping cells of equal size  A is a mapping of these cells to a one dimensional linear ordering  We consider the z-sfc or Morton ordering Index of the cell with coordinates Z-SFC for k = 2

Octrees and SFCs 1. Octrees can be viewed as multiple SFCs at various resolutions 2. Parent can be generated from child’s SFC 3. To establish a total order on the cells of octree: given 2 cells a) if one is contained in the other, the subcell is taken to precede the supercell b) if disjoint, order according to the order of immediate subcells of the smallest supercell enclosing them The resulting linearization is identical to traversal

Octrees chains 10 9 4 8 1 8 4 2 5 7 6 3 6 5 9 1 3 2 10 7

Compressed Octrees  Each node in compressed octree is either a leaf or has at least 2 children 10 9 4 8 9 1 10 8 4 2 5 7 6 2 3 3 6 5 1 7

Memory Efficient Bottom-Up Octree on the GPU

Intuitions INPUT : n points (say a bunny) belonging to some 3-D domain OUTPUT : Octree represented in post-order with parent-child relationships established BOTTOM-UP TRAVERSAL Since every internal node in an octree has leaves in its subtree, given the leaves we can somehow decode this hierarchical inheritance information and generate the internal nodes. PARALLEL STRATEGY  Each internal node can be considered as a LCA of some particular leaf pairs (in a compressed octree).  Given the leaves, generation of internal nodes can be parallelized since each of them can be generated independently from a leaf pair.  Many leaf pairs might have the same LCA node resulting in duplicates which can be easily detected and removed. Parent-Child relationships can be established and octree can be generated from a given compressed octree using SFC indices across multiple levels.

Results Bunny (124531 points) 10000 8000 6000 4000 GPU 2000 CPU 0 CPU 5 GPU 6 7 8 9 Tree level GPU (ms) CPU (ms) 5 1218 1101 6 1482 1692 7 2041 2621 8 2501 4291 9 3669 9645

Results Ganesha (165646 points) 10000 8000 6000 4000 GPU 2000 CPU 0 CPU 5 GPU 6 7 8 9 Tree level GPU (ms) CPU (ms) 5 1463 1200 6 1762 1981 7 2396 2965 8 2923 4691 9 4501 8945

Fast Multipole Method

Fast Multipole Method The is concerned with evaluating the effect of a “set of sources” , on a set of “evaluation points” . More formally, given we wish to evaluate the sum  Total Complexity:

attempts to reduce this complexity to   The two main insights that make this possible are of the kernel into source and receiver terms   Many application domains do not require the function be calculated at high accuracy  FMM follows a  Each node has associated

FMM: Building Interaction Lists Each node has two kind of interaction lists from where the transfer of energy takes place  Far Cell List  Near Cell List  No far cell list at level 1 and level 0 since everything is near neighbor of other  Transfer of energy from near neighbors happens only for leaves

FMM Algorithm

Step1: GPU implementation PARALLELIZATION STRATEGIES 1) Multiple threads per leaf (one thread per particle) One thread produces multipole expansion for each particle in the leaf Drawbacks: a) After generation of expansions they need to be consolidated, which will necessitate data transfer to GPU global memory (expensive) b) Fixed thread block size on GPU during run time. So, some threads may remain idle. 2) One thread per leaf One thread produces full multipole expansion for the entire leaf Advantage: Work of each thread is completely independent and so there is no need for shared memory When each leaf may have different number of particles, as the thread that finishes work for a given leaf simply takes care of another leaf, without waiting or need for synchronization with other threads. Drawback: To realize the full GPU load the number of leaves should be sufficiently large (atleast 8192).

FMM Algorithm For each level l = l max -1, ... 2

Step2: GPU implementation PARALLELIZATION STRATEGIES Iterate from the last level onto the root (root is at level 0) For every level, allocate, One thread per parent node One thread produces multipole-to-multipole translations for all the children Drawback: a) GPU load becomes very small at low l max (maximum lavels) Upward pass is not highly compute intensive as compared to the downward pass. The total time taken by the upward pass is actually 1% of the total time taken by the downward pass on CPU.

FMM Algorithm Most Expensive Step of the Algorithm PARALLELIZATION: Iterate from level 2 to last, compute the Multipole to Local Translation for each node at current level in parallel.

FMM Algorithm PARALLELIZATION: Iterate from level 2 to last, compute the Local to Local Translation for each node at current level in parallel.

FMM Algorithm Only for leaves of the Quadtree PARALLELIZATION: Iterate from level 2 to last, if PARALLELIZATION: Iterate from level 2 to leaf, each thread performs all near-neighbor last, if leaf, compute the Local Expansion computations for a particular leaf. for leaves at current level in parallel

Results: Quality Comparisons Bunny (124531 points) CPU GPU

Results: Quality Comparisons Ganesha (165646 points) CPU GPU

Results: Timing Comparisons (without visibility) Bunny (124531 points) 30 25 20 15 GPU 10 CPU 5 0 CPU 200 GPU 150 100 50 25 Number of GPU points per leaf GPU (hr) CPU (hr) speedup 200 1.01 15.96 15.8 150 1.09 19.18 17.6 100 1.16 21.11 18.2 50 1.21 23.81 19.5 25 1.30 25.87 19.9

Results: Timing Comparisons (without visibility) Ganpati (165646 points) 30 25 20 15 GPU 10 CPU 5 0 CPU 200 GPU 150 100 50 25 Number of GPU points per leaf GPU (hr) CPU (hr) Speedup 200 1.11 14.54 13.1 150 1.16 16.58 14.3 100 1.21 20.81 17.2 50 1.28 23.15 18.1 25 1.41 26.37 18.7

Conclusion 1. Non Adaptive octree (speedups of upto 500 times) 2. Adaptive octrees (speedups of upto 3 times) 3. FMM on the GPU for Global illumination (speedups of upto 20 times) Future Work All the 3 steps above can be done on GPU to make a complete system

References L. Greengard and V. Rokhlin. A fast algorithm for particle simulations. Journal of  Computational Physics, 73:325 – 348, 1987. J. Carrier, L. Greengard, and V. Rokhlin. A Fast Adaptive Multipole Algorithm for  Particle Simulations. SIAM Journal of Scientific and Statistical Computing, 9:669 – 686, July 1988. R. Beatson and L. Greengard. A Short Course on Fast Multipole Methods.  H. Sagan. Space Filling Curves. Springer-Verlag, 1994.  S. Seal and S. Aluru. Spatial Domain Decomposition Methods in Parallel Scientific  Computing. Book chapter. N. A. Gumerov and R. Duraiswami. Fast Multipole Method on Graphics  Processors. AstroGPU 2007. John D. Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Krger, Aaron  E. Lefohn, and Timothy J. Purcell. A survey of general-purpose computation on graphics hardware. Computer Graphics Forum, 26(1):80 – 113, 2007. Nvidia CUDA Programming Guide. http://developer.nvidia.com/cuda 

Outline Problem Definition Summary of Work Done Space Filling - PowerPoint PPT Presentation

Outline Problem Definition Summary of Work Done Space Filling Curves Bottom-Up Octree construction on the GPU Timing Results The Fast Multipole Method Timing and Quality Results Conclusion Problem Definition To

Ins Domingues Breast Cancer Workshop April 7th 2015 Outline Outline Outline Outline

Presentation Preparation Outline Speech Outline Template ***Use this outline to guide you in

Outline for St Outline for St Outline for

Beob Kyun Kim, S oonwook Hwang {kyun, hwang}@ kisti.re.kr KIS TI, Korea Outline Outline

Catherine Revels, World Bank November 2009 Presentation outline Presentation outline

Battlestar Galactica Battlestar Galactica Galactica Battlestar Outline Outline Outline

Outline 2 Outline 2 ZSim core simulation techniques Outline 2 ZSim core simulation

Appendix J: Capstone Presentation Outline Revised Spring 2016 CAPSTONE PRESENTATION OUTLINE This

PT1 TMP Presentation Outline 1 Group Members: ___________________________________ Use this outline

Broverview Outline 2 Outline Philosophy and Architecture A framework for network traffic

Xingqian Peng, Huaqiao University, China Presented by Zhen Wu Presented by Zhen Wu October 30,2011

1 Web Application Development 2 3 Web Application Development CSS Outline An outline is a

Lecture Outline Strengthening Induction Hypothesis. Lecture Outline Strengthening Induction

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

High Dimensional Approximation - Outline Background and Sources Wolfgang Dahmen Seminar: USC,

Outline Outline Deaf and Hearing Impaired Deaf and Hearing Impaired Physical Structures of

Agenda Background Techniques Example Applications Summary 2 1 12/1/2011

Tree++ PLT FALL 2018 Team Members Allison Costa Laura Matos Jacob Penn Laura Smerling

This is a slide Group Philsophy Binary Search Tree (BST) A binary search tree 12 5 15 3

Faster Cover Trees Mike Izbicki and Christian R. Shelton UC Riverside Izbicki and Shelton (UC

A Graph-Based Definition of Distillation Geoff Hamilton Gavin Mendel-Gleason

mtDNAmanager: A Forensic Mitochondrial DNA Database Aimed at Supporting Data Quality Control and

Correctness Issues in Transforming Task Parallel Programs V. Krishna Nandivada IIT Madras

Biomolecular Evolution from a Physicists Point of View Peter Schuster Institut fr