Multi-Criteria Partitioning of Multi-Block Structured Grids
Hengjie Wang Aparna Chandramowlishwaran
HPC Forge University of California, Irvine
- Jun. 27, 2019
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 1 / 39
Multi-Criteria Partitioning of Multi-Block Structured Grids Hengjie - - PowerPoint PPT Presentation
Multi-Criteria Partitioning of Multi-Block Structured Grids Hengjie Wang Aparna Chandramowlishwaran HPC Forge University of California, Irvine Jun. 27, 2019 H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS 19 06/28/2019 1 / 39
HPC Forge University of California, Irvine
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 1 / 39
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 2 / 39
Background
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 3 / 39
Background
◮ Structured Grid: Regular connectivity between grid cells. i,j i-1,j i+1,j i,j+1 i,j-1 ◮ Block: grid unit equivalent to a single rectangle.
airfoil connected, Block2Block
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 4 / 39
Background
◮ Multi-Block Structured Grids
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 5 / 39
Background
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 6 / 39
Background
◮ 1 MPI process per node and spawn threads within a node. ◮ Assume shared memory copy takes no time.
50Bytes 50 50 50Bytes 40Bytes 40Bytes 60 50
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 7 / 39
Background
◮ 1 MPI process per node and spawn threads within a node. ◮ Assume shared memory copy takes no time.
50Bytes 50 50 50Bytes 40Bytes 40Bytes 60 50
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 7 / 39
Background
◮ 1 MPI process per node and spawn threads within a node. ◮ Assume shared memory copy takes no time.
50Bytes 50 50 50Bytes 40Bytes 40Bytes 60 50
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 7 / 39
Background
◮ 1 MPI process per node and spawn threads within a node. ◮ Assume shared memory copy takes no time.
50Bytes 50 50 50Bytes 40Bytes 40Bytes 60 50
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 7 / 39
Background
◮ Achieve load balance ◮ Minimize communication cost
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 8 / 39
Background
◮ Achieve load balance
◮ Minimize communication cost
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 8 / 39
Background
◮ Achieve load balance
◮ Minimize communication cost
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 8 / 39
Algorithms
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 9 / 39
Algorithms
◮ Top-down strategy:
◮ Bottom-up strategy:
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 10 / 39
Algorithms
◮ Assign (part of) the largest block to the most underload partition. ◮ Cut at the longest edge of a block.
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 11 / 39
Algorithms
◮ Assign (part of) the largest block to the most underload partition. ◮ Cut at the longest edge of a block.
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 11 / 39
Algorithms
◮ Assign (part of) the largest block to the most underload partition. ◮ Cut at the longest edge of a block.
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 11 / 39
Algorithms
◮ Assign (part of) the largest block to the most underload partition. ◮ Cut at the longest edge of a block.
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 11 / 39
Algorithms
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 12 / 39
Algorithms
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 12 / 39
Algorithms
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 13 / 39
Algorithms
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 14 / 39
Algorithms
◮ Flat MPI, ignore the shared memory on the algorithm level.
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 15 / 39
Algorithms
◮ Flat MPI, ignore the shared memory on the algorithm level. ◮ The communication performance does not distinguish the shared memory
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 15 / 39
Algorithms
◮ Flat MPI, ignore the shared memory on the algorithm level. ◮ The communication performance does not distinguish the shared memory
◮ Primarily focus on reducing communication volume, ignore the effect of
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 15 / 39
Algorithms
◮ Use α − β model to measure communication cost, which incorporates
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 16 / 39
Algorithms
◮ Use α − β model to measure communication cost, which incorporates
◮ Propose new partition algorithms following the top-down strategy.
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 16 / 39
Algorithms
◮ Use α − β model to measure communication cost, which incorporates
◮ Propose new partition algorithms following the top-down strategy.
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 16 / 39
Algorithms
◮ Use α − β model to measure communication cost, which incorporates
◮ Propose new partition algorithms following the top-down strategy.
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 16 / 39
Algorithms
Divide into large blocks (>W ) and small blocks (<W ). Cut large block B to: Bl, W·⌊W /W ⌋ Bs, W − W·⌊W /W ⌋. Partition Bl with REB or IF. Group small blocks with CCG
Bs
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 17 / 39
Algorithms
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 18 / 39
Algorithms
Wall B2B
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 19 / 39
Algorithms
f c Wall B2B
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 19 / 39
Algorithms
f c Wall B2B B2B
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 19 / 39
Algorithms
f c Wall B2B
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 19 / 39
Algorithms
f c Wall B2B
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 19 / 39
Algorithms
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 20 / 39
Algorithms
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 20 / 39
Algorithms
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 20 / 39
Algorithms
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 20 / 39
Algorithms
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 21 / 39
Algorithms
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 22 / 39
Algorithms
◮ Compare np = nxnynz and np = 1 + nxnynz for every case. ◮ Choose the factorization based on max or sum α − β cost.
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 23 / 39
Algorithms
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 24 / 39
Algorithms
◮ Include (part of) the block reducing max communication cost to the partition. ◮ Convert Block2Block communication to shared memory copy.
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 25 / 39
Algorithms
◮ Include (part of) the block reducing max communication cost to the partition. ◮ Convert Block2Block communication to shared memory copy.
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 25 / 39
Algorithms
◮ Include (part of) the block reducing max communication cost to the partition. ◮ Convert Block2Block communication to shared memory copy.
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 25 / 39
Algorithms
◮ Include (part of) the block reducing max communication cost to the partition. ◮ Convert Block2Block communication to shared memory copy.
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 25 / 39
Algorithms
◮ Include (part of) the block reducing max communication cost to the partition. ◮ Convert Block2Block communication to shared memory copy.
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 25 / 39
Algorithms
◮ Convert Block2Block communication to shared memory copy. ◮ Avoid cutting blocks.
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 26 / 39
Algorithms
◮ Convert Block2Block communication to shared memory copy. ◮ Avoid cutting blocks.
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 26 / 39
Algorithms
◮ Convert Block2Block communication to shared memory copy. ◮ Avoid cutting blocks.
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 26 / 39
Algorithms
◮ Convert Block2Block communication to shared memory copy. ◮ Avoid cutting blocks.
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 26 / 39
Algorithms
◮ Convert Block2Block communication to shared memory copy. ◮ Avoid cutting blocks.
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 26 / 39
Algorithms
◮ Convert Block2Block communication to shared memory copy. ◮ Avoid cutting blocks.
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 26 / 39
Algorithms
◮ Convert Block2Block communication to shared memory copy. ◮ Avoid cutting blocks.
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 26 / 39
Algorithms
◮ Convert Block2Block communication to shared memory copy. ◮ Avoid cutting blocks.
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 26 / 39
Algorithms
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 27 / 39
Tests and Results
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 28 / 39
Tests and Results
◮ Our experiments are performed on Mira.
◮ The latency α = 1.73 × 10−5s and bandwidth β = 1.77 × 109 bytes/s are
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 29 / 39
Tests and Results
◮ Assign each MPI process to a node and spawn one OpenMP thread per core. ◮ Master thread calls MPI non-blocking routines which overlaps with shared memory copy.
⊲ #pragma omp for
Copy halo data to sending buffer ⊲ #pragma omp barrier ⊲ #pragma omp master
Update halo using non-blocking p2p communication ⊲ #pragma omp for
Copy halo data via shared memory within node ⊲ #pragma omp barrier ⊲ #pragma omp for
Copy data from receiving buffer to halo region ⊲ #pragma omp barrier ⊲ split blocks evenly among threads
Computation ⊲ #pragma omp barrier
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 30 / 39
Tests and Results
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 31 / 39
Tests and Results
Greedy
Metis REB+CCG IF+CCG REB+GGS IF+GGS ◮ Consistent with α − β cost.
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 32 / 39
Tests and Results
Greedy
Metis REB+CCG IF+CCG REB+GGS IF+GGS ◮ Consistent with α − β cost. ◮ Latency has more effect.
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 32 / 39
Tests and Results
Greedy
Metis REB+CCG IF+CCG REB+GGS IF+GGS ◮ Consistent with α − β cost. ◮ Latency has more effect. ◮ At 4096 partitions, IF
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 32 / 39
Tests and Results
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 33 / 39
Tests and Results
◮ 64-256 partitions
◮ 1024-4096 partitions
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 34 / 39
Tests and Results
◮ Greedy produces the max communication volume and edge cuts for 64-256
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 35 / 39
Tests and Results
◮ Metis produces the min communication volume and edge cuts for 64-256
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 35 / 39
Tests and Results
◮ Greedy, CCG, GGS produce close results for 1024-4096 partitions; GGS min.
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 35 / 39
Tests and Results
◮ 64-256 partitions
◮ 1024-4096 partitions
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 36 / 39
Tests and Results
Greedy
Metis REB+CCG IF+CCG REB+GGS IF+GGS ◮ Metis produces good result
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 37 / 39
Tests and Results
Greedy
Metis REB+CCG IF+CCG REB+GGS IF+GGS ◮ Metis produces good result
◮ Greedy creates good result
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 37 / 39
Tests and Results
Greedy
Metis REB+CCG IF+CCG REB+GGS IF+GGS ◮ Metis produces good result
◮ Greedy creates good result
◮ At 4096 partitions,
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 37 / 39
Conclusion
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 38 / 39
Conclusion
◮ Use the α − β model to construct a cost function incorporating the edge
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 39 / 39
Conclusion
◮ Use the α − β model to construct a cost function incorporating the edge
◮ Propose modified REB, IF for cutting large blocks and CCG, GGS for
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 39 / 39
Conclusion
◮ Use the α − β model to construct a cost function incorporating the edge
◮ Propose modified REB, IF for cutting large blocks and CCG, GGS for
◮ Test our partitioner with a hybrid MPI+OpenMP based Jacobi solver on up
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 39 / 39
Conclusion
◮ Use the α − β model to construct a cost function incorporating the edge
◮ Propose modified REB, IF for cutting large blocks and CCG, GGS for
◮ Test our partitioner with a hybrid MPI+OpenMP based Jacobi solver on up
◮ Achieve significant speedup in communication on both Bump3D and
H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 39 / 39