Outline Problem Definition Overview of FMM Parallel FMM Space - - PowerPoint PPT Presentation

outline
SMART_READER_LITE
LIVE PREVIEW

Outline Problem Definition Overview of FMM Parallel FMM Space - - PowerPoint PPT Presentation

Outline Problem Definition Overview of FMM Parallel FMM Space Filling Curves and Compressed Octrees Parallel Compressed Octrees Computing Translations Octree textures on GPU Problem Definition To implement Parallel Fast


slide-1
SLIDE 1
slide-2
SLIDE 2

Outline

 Problem Definition  Overview of FMM  Parallel FMM  Space Filling Curves and Compressed Octrees  Parallel Compressed Octrees  Computing Translations  Octree textures on GPU

slide-3
SLIDE 3

Problem Definition

To implement Parallel Fast Multipole Method (FMM) on Graphics Hardware Parallel FMM Using multi Processors (already done) FMM Using GPUs (to be done)

slide-4
SLIDE 4

Fast Multipole Method

The is concerned with evaluating the effect

  • f a “set of sources” , on a set of “evaluation points” . More formally,

given we wish to evaluate the sum

 Total Complexity:

slide-5
SLIDE 5

attempts to reduce this complexity to

 The two main insights that make this possible are 

  • f the kernel into source and receiver terms

 Many application domains do not require the function be

calculated at high accuracy

 FMM follows a  Each node has associated

slide-6
SLIDE 6

Each node has two kind of interaction lists

 Far Cell List  Near Cell List  No far cell list at level 1 and level 0 since everything is near neighbor of other  Transfer of energy from near neighbors happens only for leaves

Building Interaction Lists

Next : Passes of FMM

slide-7
SLIDE 7

FMM Algorithm

slide-8
SLIDE 8

FMM Algorithm

slide-9
SLIDE 9

FMM Algorithm

slide-10
SLIDE 10

FMM Algorithm

slide-11
SLIDE 11

Only for leaves of the quadtree

FMM Algorithm

slide-12
SLIDE 12

Parallel FMM

Space Filling Curves Parallel Compressed Octrees Parallel Bitonic Sort Parallel Prefix Sum Building Interaction Lists and computing various Translations

slide-13
SLIDE 13

Parallel FMM

Space Filling Curves

Parallel Compressed Octrees Parallel Bitonic Sort Parallel Prefix Sum Building Interaction Lists and computing various Translations

slide-14
SLIDE 14

Space Filling Curves

 Consider the

recursive bisection of a 2D area into non-overlapping cells of equal sizesize

 A

is a mapping of these cells to a

  • ne dimensional linear ordering

Z-SFC for k = 2

slide-15
SLIDE 15

SFC Construction

 The run time to order

cells, is expensive since typically

 Represent integer coordinates of cell using

and then interleaving the bits sarting from first dimension to form a integer

 Index of the cell with coordinates 

time to find the index

slide-16
SLIDE 16

Parallel FMM

Space Filling Curves

Parallel Compressed Octrees

Parallel Bitonic Sort Parallel Prefix Sum Building Interaction Lists and computing various Translations

slide-17
SLIDE 17

Octrees

1 2 3 4 5 6 7 8 10 9

7 1 3 2 9 4 8 5 6 10

slide-18
SLIDE 18

Compressed Octrees

1 2 3 4 5 6 7 8 10 9

7 1 3 2 9 4 8 5 6 10

 Each node in compressed octree is either a

leaf or has atleast 2 children

slide-19
SLIDE 19

 Store 2 cells in each node of the compressed octree  Large cell

:

cell that encloses all the points the node represents

 Small cell

: cell that encloses all the points the node represents

Encapsulating spatial information lost in compression

1 2 3 4 5 6 7 8 10 9

7 1 3 2 9 4 8 5 6 10

slide-20
SLIDE 20

Octrees and SFCs

Octrees can be viewed as multiple SFCs at various resolutions To establish a total order on the cells of

  • ctree: given 2 cells

if one is contained in the other, the subcell is taken to precede the supercell if disjoint, order according to the order

  • f immediate subcells of the smallest

supercell enclosing them The resulting linearization is identical to traversal

1 01 11 10 00 1

slide-21
SLIDE 21

Parallel Compressed Octree Construction

 Consider points equally distributed across processors 

= pre-specified maximum resolution

 For each point, generate the index of the leaf cell containing it which is

the cell at the max resolution

 Parallel sort the leaf indices to compute their SFC-linearization, or the

left to right order of leaves in the compressed octree.

 Each processor obtains the leftmost leaf cell of the next processor.

Why ?

 On each processor, construct a local compressed octree for the leaf cells

within it and the borrowed leaf cell.

 Send the

to appropriate processors

 Insert the received out of order nodes in the already existing sorted

  • rder of nodes
slide-22
SLIDE 22

Parallel FMM

Space Filling Curves Parallel Compressed Octrees Parallel Bitonic Sort Parallel Prefix Sum Building Interaction Lists and computing various Translations

slide-23
SLIDE 23

Parallel FMM

The FMM computation consists of the following phases

 Building the compressed octree  Building interaction lists  Computing multipole expansions using a bottom-up traversal  Computing multipole to local translations for each cell using its

interaction list

 Computing the local expansions using a top-down traversal  Projecting the field at leaf cells back to the particles

slide-24
SLIDE 24

Computing Multipole Expansions

Each processor scans its local array from left to right If leaf node is reached compute its multipole expansion directly If node’s multipole is known, shift and add it to parent’s multipole expansion provided the parent is local to processor Use of postorder ? If the multipole expansion due to a cell is known but its parent lies in a different processor, it is labeled a If the multipole expansion at a node is not yet computed when it is visited, it is labeled a

 Residual nodes form a tree (termed the residual tree)  The tree is present in its postorder traversal order, distributed across

processors.

slide-25
SLIDE 25

Multipole expansions on the residual tree can be computed using an efficient parallel upward tree accumulation algorithm The residual tree can be accumulated in rounds as compared to in case of global compressed octree Thus, the worst-case number of communications are reduced from

  • f the tree to
  • f the tree,

which is much smaller

slide-26
SLIDE 26

Computing Multipole to Local Translations

An all-to-all communication is used to receive fields of nodes from the interaction lists that reside on remote processors Once all the information is available locally, the multipole to local translations are conducted within each processor as much as in the same way as in sequential FMM

slide-27
SLIDE 27

Computing Local Expansions

Similar to computing multipole expansions Calculate local expansions for the residual tree. Compute local expansions for the local tree using a scan of the The exact number of communication rounds required is the same as in computing multipole expansions

slide-28
SLIDE 28

A B C D A(0,0) B(1,0) C(2,0) D(3,0) (1,0) (3,0) (2,0)

 The content of the leaves is directly stored as an RGB value  Alpha channel is used to distinguish between an index to a child and the

content of a leaf alpha = 1 data alpha = 0.5 index alpha = 0 empty cell

OctreeTextures on GPU

slide-29
SLIDE 29

A B C D A(0,0) B(1,0) C(2,0) D(3,0) (1,0) (3,0) (2,0)

Retrieve the value stored in the tree at a point M є [0,1] × [0,1]

The tree lookup starts from the root and successively visits the nodes containing the point M until a leaf is reached.

slide-30
SLIDE 30

A(0,0) B(1,0) C(2,0) D(3,0) (1,0) (3,0) (2,0)

+ M

Px = I0x + frac(M.20) Sx Py = I0y + frac(M.20) Sy I0 = (0,0) node A (root) P

frac(A) denotes the fractional part of A I0 = (0,0) Let M=(0.7, 0.7) Coordinates of M within grid A = frac(M·20) = frac(0.7x1) = 0.7 x coordinate of the lookup point P in the texture = Px = {I0x + frac(M.20)}/Sx = (0 + 0.7)/4 = 0.175 y coordinate of the lookup point P in the texture = Py = {I0y + frac(M.20)}/Sy = (0 + 0.7)/1 = 0.7

slide-31
SLIDE 31

A(0,0) B(1,0) C(2,0) D(3,0) (1,0) (3,0) (2,0)

+

M

Px = I1x + frac(M.21) Sx Py = I1y + frac(M.21) Sy I1 = (1,0) node B P

I1 = (1,0) M=(0.7, 0.7) Coordinates of M within grid B = frac(M·21) = frac(0.7x2) = 0.4 x coordinate of the lookup point P in the texture = Px = {I1x + frac(M.21)}/Sx = (1 + 0.4)/4 = 0.35 y coordinate of the lookup point P in the texture = Py = {I1y + frac(M.21)}/Sy = (0 + 0.4)/1 = 0.4

slide-32
SLIDE 32

A(0,0) B(1,0) C(2,0) D(3,0) (1,0) (3,0) (2,0)

+

M

Px = I2x + frac(M.22) Sx Py = I2y + frac(M.22) Sy I2 = (2,0) node C P

I2 = (2,0) M=(0.7, 0.7) Coordinates of M within grid C = frac(M·22) = frac(0.7x4) = 0.8 x coordinate of the lookup point P in the texture = Px = {I2x + frac(M.22)}/Sx = (2 + 0.8)/4 = 0.7 y coordinate of the lookup point P in the texture = Py = {I2y + frac(M.22)}/Sy = (0 + 0.8)/1 = 0.8

slide-33
SLIDE 33

References

 L. Greengard and V. Rokhlin. A fast algorithm for particle simulations. Journal of

Computational Physics, 73:325–348, 1987.

 J. Carrier, L. Greengard, and V. Rokhlin. A Fast Adaptive Multipole Algorithm for

Particle Simulations. SIAM Journal of Scientific and Statistical Computing, 9:669- 686, July 1988.

 R. Beatson and L. Greengard. A Short Course on Fast Multipole Methods.  B. Hariharan and S. Aluru. Efficient parallel algorithms and software for

compressed octrees with apllications to hierarchical methods. Parallel Computing, 31:311–331, 2005.

 B. Hariharan, S. Aluru, and B. Shanker. A Scalable Parallel Fast Multipole Method

for Analysis of Scattering from Perfect Electrically Conducting Surfaces. Proc. Super- computing, page 42, 2002.

slide-34
SLIDE 34

 H. Sagan. Space Filling Curves. Springer-Verlag, 1994.  M. Harris. Parallel Prefix Sum (Scan) with CUDA.

http://developer.download.nvidia.com/compute/cuda/sdk/website/samples.htm

 S. Lefebvre, S. Hornus, and F. Neyret. GPU Gems 2, Octree Textures on the

GPU, pages 595–614. Addison Wesley, 2005.

 T. W. Christopher. Bitonic Sort Tutorial.

http://www.tools-of-computing.com/tc/CS/Sorts/bitonic_sort.htm

References contd..

slide-35
SLIDE 35