Outline Problem Definition Overview of FMM Parallel FMM Space - - PowerPoint PPT Presentation
Outline Problem Definition Overview of FMM Parallel FMM Space - - PowerPoint PPT Presentation
Outline Problem Definition Overview of FMM Parallel FMM Space Filling Curves and Compressed Octrees Parallel Compressed Octrees Computing Translations Octree textures on GPU Problem Definition To implement Parallel Fast
Outline
Problem Definition Overview of FMM Parallel FMM Space Filling Curves and Compressed Octrees Parallel Compressed Octrees Computing Translations Octree textures on GPU
Problem Definition
To implement Parallel Fast Multipole Method (FMM) on Graphics Hardware Parallel FMM Using multi Processors (already done) FMM Using GPUs (to be done)
Fast Multipole Method
The is concerned with evaluating the effect
- f a “set of sources” , on a set of “evaluation points” . More formally,
given we wish to evaluate the sum
Total Complexity:
attempts to reduce this complexity to
The two main insights that make this possible are
- f the kernel into source and receiver terms
Many application domains do not require the function be
calculated at high accuracy
FMM follows a Each node has associated
Each node has two kind of interaction lists
Far Cell List Near Cell List No far cell list at level 1 and level 0 since everything is near neighbor of other Transfer of energy from near neighbors happens only for leaves
Building Interaction Lists
Next : Passes of FMM
FMM Algorithm
FMM Algorithm
FMM Algorithm
FMM Algorithm
Only for leaves of the quadtree
FMM Algorithm
Parallel FMM
Space Filling Curves Parallel Compressed Octrees Parallel Bitonic Sort Parallel Prefix Sum Building Interaction Lists and computing various Translations
Parallel FMM
Space Filling Curves
Parallel Compressed Octrees Parallel Bitonic Sort Parallel Prefix Sum Building Interaction Lists and computing various Translations
Space Filling Curves
Consider the
recursive bisection of a 2D area into non-overlapping cells of equal sizesize
A
is a mapping of these cells to a
- ne dimensional linear ordering
Z-SFC for k = 2
SFC Construction
The run time to order
cells, is expensive since typically
Represent integer coordinates of cell using
and then interleaving the bits sarting from first dimension to form a integer
Index of the cell with coordinates
time to find the index
Parallel FMM
Space Filling Curves
Parallel Compressed Octrees
Parallel Bitonic Sort Parallel Prefix Sum Building Interaction Lists and computing various Translations
Octrees
1 2 3 4 5 6 7 8 10 9
7 1 3 2 9 4 8 5 6 10
Compressed Octrees
1 2 3 4 5 6 7 8 10 9
7 1 3 2 9 4 8 5 6 10
Each node in compressed octree is either a
leaf or has atleast 2 children
Store 2 cells in each node of the compressed octree Large cell
:
cell that encloses all the points the node represents
Small cell
: cell that encloses all the points the node represents
Encapsulating spatial information lost in compression
1 2 3 4 5 6 7 8 10 9
7 1 3 2 9 4 8 5 6 10
Octrees and SFCs
Octrees can be viewed as multiple SFCs at various resolutions To establish a total order on the cells of
- ctree: given 2 cells
if one is contained in the other, the subcell is taken to precede the supercell if disjoint, order according to the order
- f immediate subcells of the smallest
supercell enclosing them The resulting linearization is identical to traversal
1 01 11 10 00 1
Parallel Compressed Octree Construction
Consider points equally distributed across processors
= pre-specified maximum resolution
For each point, generate the index of the leaf cell containing it which is
the cell at the max resolution
Parallel sort the leaf indices to compute their SFC-linearization, or the
left to right order of leaves in the compressed octree.
Each processor obtains the leftmost leaf cell of the next processor.
Why ?
On each processor, construct a local compressed octree for the leaf cells
within it and the borrowed leaf cell.
Send the
to appropriate processors
Insert the received out of order nodes in the already existing sorted
- rder of nodes
Parallel FMM
Space Filling Curves Parallel Compressed Octrees Parallel Bitonic Sort Parallel Prefix Sum Building Interaction Lists and computing various Translations
Parallel FMM
The FMM computation consists of the following phases
Building the compressed octree Building interaction lists Computing multipole expansions using a bottom-up traversal Computing multipole to local translations for each cell using its
interaction list
Computing the local expansions using a top-down traversal Projecting the field at leaf cells back to the particles
Computing Multipole Expansions
Each processor scans its local array from left to right If leaf node is reached compute its multipole expansion directly If node’s multipole is known, shift and add it to parent’s multipole expansion provided the parent is local to processor Use of postorder ? If the multipole expansion due to a cell is known but its parent lies in a different processor, it is labeled a If the multipole expansion at a node is not yet computed when it is visited, it is labeled a
Residual nodes form a tree (termed the residual tree) The tree is present in its postorder traversal order, distributed across
processors.
Multipole expansions on the residual tree can be computed using an efficient parallel upward tree accumulation algorithm The residual tree can be accumulated in rounds as compared to in case of global compressed octree Thus, the worst-case number of communications are reduced from
- f the tree to
- f the tree,
which is much smaller
Computing Multipole to Local Translations
An all-to-all communication is used to receive fields of nodes from the interaction lists that reside on remote processors Once all the information is available locally, the multipole to local translations are conducted within each processor as much as in the same way as in sequential FMM
Computing Local Expansions
Similar to computing multipole expansions Calculate local expansions for the residual tree. Compute local expansions for the local tree using a scan of the The exact number of communication rounds required is the same as in computing multipole expansions
A B C D A(0,0) B(1,0) C(2,0) D(3,0) (1,0) (3,0) (2,0)
The content of the leaves is directly stored as an RGB value Alpha channel is used to distinguish between an index to a child and the
content of a leaf alpha = 1 data alpha = 0.5 index alpha = 0 empty cell
OctreeTextures on GPU
A B C D A(0,0) B(1,0) C(2,0) D(3,0) (1,0) (3,0) (2,0)
Retrieve the value stored in the tree at a point M є [0,1] × [0,1]
The tree lookup starts from the root and successively visits the nodes containing the point M until a leaf is reached.
A(0,0) B(1,0) C(2,0) D(3,0) (1,0) (3,0) (2,0)
+ M
Px = I0x + frac(M.20) Sx Py = I0y + frac(M.20) Sy I0 = (0,0) node A (root) P
frac(A) denotes the fractional part of A I0 = (0,0) Let M=(0.7, 0.7) Coordinates of M within grid A = frac(M·20) = frac(0.7x1) = 0.7 x coordinate of the lookup point P in the texture = Px = {I0x + frac(M.20)}/Sx = (0 + 0.7)/4 = 0.175 y coordinate of the lookup point P in the texture = Py = {I0y + frac(M.20)}/Sy = (0 + 0.7)/1 = 0.7
A(0,0) B(1,0) C(2,0) D(3,0) (1,0) (3,0) (2,0)
+
M
Px = I1x + frac(M.21) Sx Py = I1y + frac(M.21) Sy I1 = (1,0) node B P
I1 = (1,0) M=(0.7, 0.7) Coordinates of M within grid B = frac(M·21) = frac(0.7x2) = 0.4 x coordinate of the lookup point P in the texture = Px = {I1x + frac(M.21)}/Sx = (1 + 0.4)/4 = 0.35 y coordinate of the lookup point P in the texture = Py = {I1y + frac(M.21)}/Sy = (0 + 0.4)/1 = 0.4
A(0,0) B(1,0) C(2,0) D(3,0) (1,0) (3,0) (2,0)
+
M
Px = I2x + frac(M.22) Sx Py = I2y + frac(M.22) Sy I2 = (2,0) node C P
I2 = (2,0) M=(0.7, 0.7) Coordinates of M within grid C = frac(M·22) = frac(0.7x4) = 0.8 x coordinate of the lookup point P in the texture = Px = {I2x + frac(M.22)}/Sx = (2 + 0.8)/4 = 0.7 y coordinate of the lookup point P in the texture = Py = {I2y + frac(M.22)}/Sy = (0 + 0.8)/1 = 0.8
References
L. Greengard and V. Rokhlin. A fast algorithm for particle simulations. Journal of
Computational Physics, 73:325–348, 1987.
J. Carrier, L. Greengard, and V. Rokhlin. A Fast Adaptive Multipole Algorithm for
Particle Simulations. SIAM Journal of Scientific and Statistical Computing, 9:669- 686, July 1988.
R. Beatson and L. Greengard. A Short Course on Fast Multipole Methods. B. Hariharan and S. Aluru. Efficient parallel algorithms and software for
compressed octrees with apllications to hierarchical methods. Parallel Computing, 31:311–331, 2005.
B. Hariharan, S. Aluru, and B. Shanker. A Scalable Parallel Fast Multipole Method
for Analysis of Scattering from Perfect Electrically Conducting Surfaces. Proc. Super- computing, page 42, 2002.
H. Sagan. Space Filling Curves. Springer-Verlag, 1994. M. Harris. Parallel Prefix Sum (Scan) with CUDA.
http://developer.download.nvidia.com/compute/cuda/sdk/website/samples.htm
S. Lefebvre, S. Hornus, and F. Neyret. GPU Gems 2, Octree Textures on the
GPU, pages 595–614. Addison Wesley, 2005.
T. W. Christopher. Bitonic Sort Tutorial.