Multi-Criteria Partitioning of Multi-Block Structured Grids Hengjie - - PowerPoint PPT Presentation

multi criteria partitioning of multi block structured
SMART_READER_LITE
LIVE PREVIEW

Multi-Criteria Partitioning of Multi-Block Structured Grids Hengjie - - PowerPoint PPT Presentation

Multi-Criteria Partitioning of Multi-Block Structured Grids Hengjie Wang Aparna Chandramowlishwaran HPC Forge University of California, Irvine Jun. 27, 2019 H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS 19 06/28/2019 1 / 39


slide-1
SLIDE 1

Multi-Criteria Partitioning of Multi-Block Structured Grids

Hengjie Wang Aparna Chandramowlishwaran

HPC Forge University of California, Irvine

  • Jun. 27, 2019

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 1 / 39

slide-2
SLIDE 2

Outline

Background Algorithms Tests and Results Conclusion

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 2 / 39

slide-3
SLIDE 3

Background

Outline

Background Algorithms Tests and Results Conclusion

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 3 / 39

slide-4
SLIDE 4

Background

Structured Grid

◮ Structured Grid: Regular connectivity between grid cells. i,j i-1,j i+1,j i,j+1 i,j-1 ◮ Block: grid unit equivalent to a single rectangle.

Airfoil Grid

airfoil connected, Block2Block

Block

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 4 / 39

slide-5
SLIDE 5

Background

Structured Grid

◮ Multi-Block Structured Grids

Bump3D, 5 blocks

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 5 / 39

slide-6
SLIDE 6

Background

Halo Exchange

Split a block into 2 partitions and assign each partition to a node:

communication Block2Block communication

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 6 / 39

slide-7
SLIDE 7

Background

Hybird Programming Model

Hybrid programming model:

◮ 1 MPI process per node and spawn threads within a node. ◮ Assume shared memory copy takes no time.

Partition 4 blocks onto 2 nodes:

50Bytes 50 50 50Bytes 40Bytes 40Bytes 60 50

Average Workload W 105

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 7 / 39

slide-8
SLIDE 8

Background

Hybird Programming Model

Hybrid programming model:

◮ 1 MPI process per node and spawn threads within a node. ◮ Assume shared memory copy takes no time.

Partition 4 blocks onto 2 nodes:

50Bytes 50 50 50Bytes 40Bytes 40Bytes 60 50

Average Workload W 105 Imbalance 5/105

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 7 / 39

slide-9
SLIDE 9

Background

Hybird Programming Model

Hybrid programming model:

◮ 1 MPI process per node and spawn threads within a node. ◮ Assume shared memory copy takes no time.

Partition 4 blocks onto 2 nodes:

50Bytes 50 50 50Bytes 40Bytes 40Bytes 60 50

Average Workload W 105 Imbalance 5/105 Edge Cuts 2 Communcation Volume 80 Bytes

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 7 / 39

slide-10
SLIDE 10

Background

Hybird Programming Model

Hybrid programming model:

◮ 1 MPI process per node and spawn threads within a node. ◮ Assume shared memory copy takes no time.

Partition 4 blocks onto 2 nodes:

50Bytes 50 50 50Bytes 40Bytes 40Bytes 60 50

Average Workload W 105 Imbalance 5/105 Edge Cuts 2 Communcation Volume 80 Bytes Shared Memory Copy 100 Btyes

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 7 / 39

slide-11
SLIDE 11

Background

Objectives

Given the number of partitions np, workload per partition W , the partitioner should:

◮ Achieve load balance ◮ Minimize communication cost

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 8 / 39

slide-12
SLIDE 12

Background

Objectives

Given the number of partitions np, workload per partition W , the partitioner should:

◮ Achieve load balance

  • Trade off load balance for communication cost

◮ Minimize communication cost

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 8 / 39

slide-13
SLIDE 13

Background

Objectives

Given the number of partitions np, workload per partition W , the partitioner should:

◮ Achieve load balance

  • Trade off load balance for communication cost

◮ Minimize communication cost

  • Reduce the inter-node communication
  • Convert Block2Block communication to shared memory copy

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 8 / 39

slide-14
SLIDE 14

Algorithms

Outline

Background Algorithms Tests and Results Conclusion

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 9 / 39

slide-15
SLIDE 15

Algorithms

State-of-the-art Methods

The state-of-the-art methods can be divided into two strategies:

◮ Top-down strategy:

  • Cut large blocks and assign sub-blocks to partitions.
  • Group small blocks to fill partitions.

Examples: Greedy [Ytterstr¨

  • m 97]

Recursive Edge Bisection (REB) [Berger 87] Integer Factorization (IF)

◮ Bottom-up strategy:

Transform the problem to graph partitioning and use graph partitioner. Examples: Metis [Karypis 94], Scotch [Roman 96], Chaco [Leland 95]

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 10 / 39

slide-16
SLIDE 16

Algorithms

Greedy Algorithm

Greedy Algorithm:

◮ Assign (part of) the largest block to the most underload partition. ◮ Cut at the longest edge of a block.

10 10 10 10 20 10 15 W = 300 Wp = 0

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 11 / 39

slide-17
SLIDE 17

Algorithms

Greedy Algorithm

Greedy Algorithm:

◮ Assign (part of) the largest block to the most underload partition. ◮ Cut at the longest edge of a block.

10 10 10 10 20 10 15 W = 300 Wp = 200

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 11 / 39

slide-18
SLIDE 18

Algorithms

Greedy Algorithm

Greedy Algorithm:

◮ Assign (part of) the largest block to the most underload partition. ◮ Cut at the longest edge of a block.

10 10 10 10 20 10 5 10 W = 300 Wp = 300

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 11 / 39

slide-19
SLIDE 19

Algorithms

Greedy Algorithm

Greedy Algorithm:

◮ Assign (part of) the largest block to the most underload partition. ◮ Cut at the longest edge of a block.

10 10 10 10 20 10 5 10 W = 300 Wp = 300

Ignores the connectivity between blocks. Creates excessive small blocks when cutting a large block

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 11 / 39

slide-20
SLIDE 20

Algorithms

Greedy Algorithm

Bump3D grid: 5 blocks, the largest block is 27 times larger than the rest.

Bump3D blocks

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 12 / 39

slide-21
SLIDE 21

Algorithms

Greedy Algorithm

Bump3D grid: 5 blocks, the largest block is 27 times larger than the rest.

Bump3D blocks Greedy, 16 partitions

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 12 / 39

slide-22
SLIDE 22

Algorithms

Bottom-up Strategy

Bottom-up: Convert the structured grid partitioning to general graph partitioning. For a graph partitioner to work well, it needs large number of vertices per partition.

  • 1. Over-decompose blocks, construct graph with blocks as vertices
  • 2. Apply graph partitioner: Metis, Scotch, Chaco, etc
  • 3. Merge blocks within one partition

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 13 / 39

slide-23
SLIDE 23

Algorithms

Bottom-up Strategy

Use Metis as the graph partitioner to generate 16 partitions with different

  • ver-decomposition method.

Over-Decompose to elementary blocks Over-Decompose with IF

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 14 / 39

slide-24
SLIDE 24

Algorithms

Limitations of State-of-the-art Methods

Above methods share the limitations:

◮ Flat MPI, ignore the shared memory on the algorithm level.

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 15 / 39

slide-25
SLIDE 25

Algorithms

Limitations of State-of-the-art Methods

Above methods share the limitations:

◮ Flat MPI, ignore the shared memory on the algorithm level. ◮ The communication performance does not distinguish the shared memory

copy and inter-nodes data transfer.

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 15 / 39

slide-26
SLIDE 26

Algorithms

Limitations of State-of-the-art Methods

Above methods share the limitations:

◮ Flat MPI, ignore the shared memory on the algorithm level. ◮ The communication performance does not distinguish the shared memory

copy and inter-nodes data transfer.

◮ Primarily focus on reducing communication volume, ignore the effect of

network’s latency.

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 15 / 39

slide-27
SLIDE 27

Algorithms

Our Partition Algorithms

Our contributions:

◮ Use α − β model to measure communication cost, which incorporates

communication volume, edge cut, and network properties.

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 16 / 39

slide-28
SLIDE 28

Algorithms

Our Partition Algorithms

Our contributions:

◮ Use α − β model to measure communication cost, which incorporates

communication volume, edge cut, and network properties.

◮ Propose new partition algorithms following the top-down strategy.

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 16 / 39

slide-29
SLIDE 29

Algorithms

Our Partition Algorithms

Our contributions:

◮ Use α − β model to measure communication cost, which incorporates

communication volume, edge cut, and network properties.

◮ Propose new partition algorithms following the top-down strategy.

  • Modify Recursive Edge Bisection (REB) and Integer Factorization (IF) for

cutting large blocks (W>W ).

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 16 / 39

slide-30
SLIDE 30

Algorithms

Our Partition Algorithms

Our contributions:

◮ Use α − β model to measure communication cost, which incorporates

communication volume, edge cut, and network properties.

◮ Propose new partition algorithms following the top-down strategy.

  • Modify Recursive Edge Bisection (REB) and Integer Factorization (IF) for

cutting large blocks (W>W ).

  • Propose Cut-Combine-Greedy (CCG) and Graph-Grow-Sweep (GGS) for

grouping small blocks.

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 16 / 39

slide-31
SLIDE 31

Algorithms

Our Partition Algorithms

Cut large blocks → Group small blocks:

Divide into large blocks (>W ) and small blocks (<W ). Cut large block B to: Bl, W·⌊W /W ⌋ Bs, W − W·⌊W /W ⌋. Partition Bl with REB or IF. Group small blocks with CCG

  • r GGS to fill partitions.

Bs

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 17 / 39

slide-32
SLIDE 32

Algorithms

Measure of Communication cost

α − β model: α latency (s), β bandwidth (bytes/s) Cost(s) = α + sizeof(message) β For Block2Block message: tb2b = α + #Halo · FaceArea · sizeof(cell) β Sum over all Block2Block messages:

  • tb2b = α ·
  • Edge Cuts + Communication Volume

β

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 18 / 39

slide-33
SLIDE 33

Algorithms

Cut Large Block: find a cut

Input: Block B, workload Wcut, tolerance ǫ, partition P (optional) Output: a cut satisfies: (1) the cut-off sub-block fits in [Wcut(1 − ǫ), Wcut(1 + ǫ)] (2) introduces minimum communication cost δt Find a cut

1: for i = x, y, z do 2: Get area of i’s norm face Ai 3: f = floor(Wcut(1 − ǫ)/Ai) 4: c = ceil(Wcut(1 + ǫ)/Ai) 5: for pos∈[posFloor, posCeil] do 6: δtcut =

  • b2bcut

α + tb2b(Ai) −

  • Bj ∈p

tb2b(cut, Bj) 7: if δtcut < δtmin then 8: δtmin = δtcut 9: cut.pos = pos

Wall B2B

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 19 / 39

slide-34
SLIDE 34

Algorithms

Cut Large Block: find a cut

Input: Block B, workload Wcut, tolerance ǫ, partition P (optional) Output: a cut satisfies: (1) the cut-off sub-block fits in [Wcut(1 − ǫ), Wcut(1 + ǫ)] (2) introduces minimum communication cost δt Find a cut

1: for i = x, y, z do 2: Get area of i’s norm face Ai 3: f = floor(Wcut(1 − ǫ)/Ai) 4: c = ceil(Wcut(1 + ǫ)/Ai) 5: for pos∈[posFloor, posCeil] do 6: δtcut =

  • b2bcut

α + tb2b(Ai) −

  • Bj ∈p

tb2b(cut, Bj) 7: if δtcut < δtmin then 8: δtmin = δtcut 9: cut.pos = pos

f c Wall B2B

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 19 / 39

slide-35
SLIDE 35

Algorithms

Cut Large Block: find a cut

Input: Block B, workload Wcut, tolerance ǫ, partition P (optional) Output: a cut satisfies: (1) the cut-off sub-block fits in [Wcut(1 − ǫ), Wcut(1 + ǫ)] (2) introduces minimum communication cost δt Find a cut

1: for i = x, y, z do 2: Get area of i’s norm face Ai 3: f = floor(Wcut(1 − ǫ)/Ai) 4: c = ceil(Wcut(1 + ǫ)/Ai) 5: for pos∈[posFloor, posCeil] do 6: δtcut =

  • b2bcut

α + tb2b(Ai) −

  • Bj ∈p

tb2b(cut, Bj) 7: if δtcut < δtmin then 8: δtmin = δtcut 9: cut.pos = pos

f c Wall B2B B2B

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 19 / 39

slide-36
SLIDE 36

Algorithms

Cut Large Block: find a cut

Input: Block B, workload Wcut, tolerance ǫ, partition P (optional) Output: a cut satisfies: (1) the cut-off sub-block fits in [Wcut(1 − ǫ), Wcut(1 + ǫ)] (2) introduces minimum communication cost δt Find a cut

1: for i = x, y, z do 2: Get area of i’s norm face Ai 3: f = floor(Wcut(1 − ǫ)/Ai) 4: c = ceil(Wcut(1 + ǫ)/Ai) 5: for pos∈[posFloor, posCeil] do 6: δtcut =

  • b2bcut

α + tb2b(Ai) −

  • Bj ∈p

tb2b(cut, Bj) 7: if δtcut < δtmin then 8: δtmin = δtcut 9: cut.pos = pos

f c Wall B2B

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 19 / 39

slide-37
SLIDE 37

Algorithms

Cut Large Block: find a cut

Input: Block B, workload Wcut, tolerance ǫ, partition P (optional) Output: a cut satisfies: (1) the cut-off sub-block fits in [Wcut(1 − ǫ), Wcut(1 + ǫ)] (2) introduces minimum communication cost δt Find a cut

1: for i = x, y, z do 2: Get area of i’s norm face Ai 3: f = floor(Wcut(1 − ǫ)/Ai) 4: c = ceil(Wcut(1 + ǫ)/Ai) 5: for pos∈[posFloor, posCeil] do 6: δtcut =

  • b2bcut

α + tb2b(Ai) −

  • Bj ∈p

tb2b(cut, Bj) 7: if δtcut < δtmin then 8: δtmin = δtcut 9: cut.pos = pos

f c Wall B2B

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 19 / 39

slide-38
SLIDE 38

Algorithms

Cut Large Block: REB

Recursive Edge Bisection (REB): Algorithm: REB

1: function reb block(B, np) ⊲ Block B fits in np partitions. 2: if np == 1 then 3: return 4: W = B’s workload 5: Wl = W · ⌊np/2⌋ np , Wr = W − Wl 6: find min cut(B, Wl, ǫ, cut) 7: cut B into Bl of Wl and Br of Wr 8: reb block(Bl, ⌊np/2⌋) 9: reb block(Br, ⌈np/2⌉)

np = 7 4 7

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 20 / 39

slide-39
SLIDE 39

Algorithms

Cut Large Block: REB

Recursive Edge Bisection (REB): Algorithm: REB

1: function reb block(B, np) ⊲ Block B fits in np partitions. 2: if np == 1 then 3: return 4: W = B’s workload 5: Wl = W · ⌊np/2⌋ np , Wr = W − Wl 6: find min cut(B, Wl, ǫ, cut) 7: cut B into Bl of Wl and Br of Wr 8: reb block(Bl, ⌊np/2⌋) 9: reb block(Br, ⌈np/2⌉)

np = 7 4 4 3

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 20 / 39

slide-40
SLIDE 40

Algorithms

Cut Large Block: REB

Recursive Edge Bisection (REB): Algorithm: REB

1: function reb block(B, np) ⊲ Block B fits in np partitions. 2: if np == 1 then 3: return 4: W = B’s workload 5: Wl = W · ⌊np/2⌋ np , Wr = W − Wl 6: find min cut(B, Wl, ǫ, cut) 7: cut B into Bl of Wl and Br of Wr 8: reb block(Bl, ⌊np/2⌋) 9: reb block(Br, ⌈np/2⌉)

np = 7 4 2 2 4/3 8/3 3

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 20 / 39

slide-41
SLIDE 41

Algorithms

Cut Large Block: REB

Recursive Edge Bisection (REB): Algorithm: REB

1: function reb block(B, np) ⊲ Block B fits in np partitions. 2: if np == 1 then 3: return 4: W = B’s workload 5: Wl = W · ⌊np/2⌋ np , Wr = W − Wl 6: find min cut(B, Wl, ǫ, cut) 7: cut B into Bl of Wl and Br of Wr 8: reb block(Bl, ⌊np/2⌋) 9: reb block(Br, ⌈np/2⌉)

np = 7 2 2 2 2 4/3 4/3 4/3 3

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 20 / 39

slide-42
SLIDE 42

Algorithms

Cut Large Block: REB

Using REB to split Bump3D into 16 partitions with different α, β values:

α = 10−5, β = 109 α = 10−4, β = 109

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 21 / 39

slide-43
SLIDE 43

Algorithms

Cut Large Block: IF

Integer Factorization (IF): np = nx · ny · nz, nx lx ≈ ny ly ≈ nz lz If np is prime, cut off one partition and factorize the rest.

lx = 7, ly = 4, np = 6 lx = 7, ly = 4, np = 7

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 22 / 39

slide-44
SLIDE 44

Algorithms

Cut Large Block: IF

Generalize IF by using α − β cost function:

◮ Compare np = nxnynz and np = 1 + nxnynz for every case. ◮ Choose the factorization based on max or sum α − β cost.

factorization partitions max or sum cost min

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 23 / 39

slide-45
SLIDE 45

Algorithms

Cut Large Block: Compare REB and IF

Compare REB and IF on Bump3D, α = 10−5, β = 109, np = 16. In general, IF is better at reducing edge cuts and REB better at reducing communication volume.

REB, edge cuts 66, communication volume 1.57 × 106 IF, edge cuts 66, communication volume 1.61 × 106

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 24 / 39

slide-46
SLIDE 46

Algorithms

Group Small Blocks: CCG

Cut-Combine-Greedy (CCG): cut and combine small blocks in a greedy fashion.

◮ Include (part of) the block reducing max communication cost to the partition. ◮ Convert Block2Block communication to shared memory copy.

W = 160 Wp = 120 A: 40 C: 60 B: 80 D: 40 4 6 4 4 W = 160 Wp = 0

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 25 / 39

slide-47
SLIDE 47

Algorithms

Group Small Blocks: CCG

Cut-Combine-Greedy (CCG): cut and combine small blocks in a greedy fashion.

◮ Include (part of) the block reducing max communication cost to the partition. ◮ Convert Block2Block communication to shared memory copy.

W = 160 Wp = 120 A: 40 C: 60 B: 80 D: 40 4 6 4 4 W = 160 Wp = 40 B cost: -6 C cost: -4

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 25 / 39

slide-48
SLIDE 48

Algorithms

Group Small Blocks: CCG

Cut-Combine-Greedy (CCG): cut and combine small blocks in a greedy fashion.

◮ Include (part of) the block reducing max communication cost to the partition. ◮ Convert Block2Block communication to shared memory copy.

W = 160 Wp = 120 A: 40 B: 80 C: 60 D: 40 4 6 4 4 W = 160 Wp = 120 C overload D cost: -4

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 25 / 39

slide-49
SLIDE 49

Algorithms

Group Small Blocks: CCG

Cut-Combine-Greedy (CCG): cut and combine small blocks in a greedy fashion.

◮ Include (part of) the block reducing max communication cost to the partition. ◮ Convert Block2Block communication to shared memory copy.

W = 160 Wp = 120 A: 40 B: 80 C1: 40 C2: 20 D: 40 4 6 4 4 2 W = 160 Wp = 120 C1 cost: -4-4+2 = -6 D cost: -4

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 25 / 39

slide-50
SLIDE 50

Algorithms

Group Small Blocks: CCG

Cut-Combine-Greedy (CCG): cut and combine small blocks in a greedy fashion.

◮ Include (part of) the block reducing max communication cost to the partition. ◮ Convert Block2Block communication to shared memory copy.

W = 160 Wp = 120 A: 40 B: 80 C1: 40 C2: 20 D: 40 4 6 4 4 2 W = 160 Wp = 160 Convert cost: 14 Introduce new cost: 2

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 25 / 39

slide-51
SLIDE 51

Algorithms

Group Small Blocks: GGS

Graph-Growth-Sweep (GGS): Repeatedly use graph growing to group small blocks.

◮ Convert Block2Block communication to shared memory copy. ◮ Avoid cutting blocks.

W = 120 W1 = 000 W2 = 000 A: 40 B: 40 C: 40 D: 40 E: 40 4 6 3 5 4 W = 120 W1 = 0 W2 = 0

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 26 / 39

slide-52
SLIDE 52

Algorithms

Group Small Blocks: GGS

Graph-Growth-Sweep (GGS): Repeatedly use graph growing to group small blocks.

◮ Convert Block2Block communication to shared memory copy. ◮ Avoid cutting blocks.

W = 120 W1 = 000 W2 = 000 B: 40 C: 40 D: 40 E: 40 6 3 5 4 A: 40 4 W = 120 W1 = 40 W2 = 0

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 26 / 39

slide-53
SLIDE 53

Algorithms

Group Small Blocks: GGS

Graph-Growth-Sweep (GGS): Repeatedly use graph growing to group small blocks.

◮ Convert Block2Block communication to shared memory copy. ◮ Avoid cutting blocks.

W = 120 W1 = 000 W2 = 000 C: 40 D: 40 E: 40 3 5 4 A: 40 B: 40 4 6 W = 120 W1 = 80 W2 = 0 6

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 26 / 39

slide-54
SLIDE 54

Algorithms

Group Small Blocks: GGS

Graph-Growth-Sweep (GGS): Repeatedly use graph growing to group small blocks.

◮ Convert Block2Block communication to shared memory copy. ◮ Avoid cutting blocks.

W = 120 W1 = 000 W2 = 000 D: 40 E: 40 4 A: 40 B: 40 4 C: 40 6 3 5 W = 120 W1 = 120 W2 = 0

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 26 / 39

slide-55
SLIDE 55

Algorithms

Group Small Blocks: GGS

Graph-Growth-Sweep (GGS): Repeatedly use graph growing to group small blocks.

◮ Convert Block2Block communication to shared memory copy. ◮ Avoid cutting blocks.

W = 120 W1 = 000 W2 = 000 E: 40 A: 40 B: 40 4 C: 40 6 5 D: 40 3 4 W = 120 W1 = 120 W2 = 40

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 26 / 39

slide-56
SLIDE 56

Algorithms

Group Small Blocks: GGS

Graph-Growth-Sweep (GGS): Repeatedly use graph growing to group small blocks.

◮ Convert Block2Block communication to shared memory copy. ◮ Avoid cutting blocks.

W = 120 W1 = 000 W2 = 000 E: 40 A: 40 B: 40 4 C: 40 6 5 D: 40 3 4 W = 120 W1 = 120 W2 = 40 E cost: -4 C cost: -3 + 6 = 3

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 26 / 39

slide-57
SLIDE 57

Algorithms

Group Small Blocks: GGS

Graph-Growth-Sweep (GGS): Repeatedly use graph growing to group small blocks.

◮ Convert Block2Block communication to shared memory copy. ◮ Avoid cutting blocks.

W = 120 W1 = 000 W2 = 000 A: 40 B: 40 4 C: 40 6 D: 40 3 E: 40 4 5 W = 120 W1 = 120 W2 = 80 C cost: -3 - 5 + 6 = -2

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 26 / 39

slide-58
SLIDE 58

Algorithms

Group Small Blocks: GGS

Graph-Growth-Sweep (GGS): Repeatedly use graph growing to group small blocks.

◮ Convert Block2Block communication to shared memory copy. ◮ Avoid cutting blocks.

W = 120 W1 = 000 W2 = 000 A: 40 B: 40 4 6 D: 40 E: 40 4 C: 40 3 5 6 W = 120 W1 = 80 W2 = 120

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 26 / 39

slide-59
SLIDE 59

Algorithms

Group Small Blocks: Compare CCG and GGS

To compare CCG and GGS: When the #blocks is large, CCG converts more communication to shared memory copy. creates more cuts and new Block2Block communications. GGS converts less communication to shared memory copy. avoids cutting blocks and introduces less new Block2Block communications.

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 27 / 39

slide-60
SLIDE 60

Tests and Results

Outline

Background Algorithms Tests and Results Conclusion

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 28 / 39

slide-61
SLIDE 61

Tests and Results

Hardware and Network Specs

◮ Our experiments are performed on Mira.

IBM BlueGene/Q cluster, 16 cores per node

◮ The latency α = 1.73 × 10−5s and bandwidth β = 1.77 × 109 bytes/s are

measured by a pingpong test. 101 102 103 104 105 106 107 108 10−6 10−5 10−4 10−3 10−2 Message Size (Bytes) Time (s) Measured Time α − β model

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 29 / 39

slide-62
SLIDE 62

Tests and Results

Numerical Experiment

Implement a Jacobi type solver with MPI and OpenMP.

◮ Assign each MPI process to a node and spawn one OpenMP thread per core. ◮ Master thread calls MPI non-blocking routines which overlaps with shared memory copy.

Experimental Jacobi Solver

1: for i = 1 → NSTEP do

⊲ #pragma omp for

2:

Copy halo data to sending buffer ⊲ #pragma omp barrier ⊲ #pragma omp master

3:

Update halo using non-blocking p2p communication ⊲ #pragma omp for

4:

Copy halo data via shared memory within node ⊲ #pragma omp barrier ⊲ #pragma omp for

5:

Copy data from receiving buffer to halo region ⊲ #pragma omp barrier ⊲ split blocks evenly among threads

6:

Computation ⊲ #pragma omp barrier

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 30 / 39

slide-63
SLIDE 63

Tests and Results

Bump3D Metrics: Communication Cost

Refine Bump3D 4 times in each direciton, 8.3 × 107 cells in total. Beyond 512 partitions: Greedy > Metis > REB >≈ IF 64 256 1024 4096 10−1 100 np Communicaiton Cost Greedy Metis REB IF

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 31 / 39

slide-64
SLIDE 64

Tests and Results

Bump3D Running Time

1024 2048 4096 0.5 1 1.5 ·10−2

Greedy

np Time (s) Bump3D Runing Time

Communication Computation Others

Metis REB+CCG IF+CCG REB+GGS IF+GGS ◮ Consistent with α − β cost.

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 32 / 39

slide-65
SLIDE 65

Tests and Results

Bump3D Running Time

1024 2048 4096 0.5 1 1.5 ·10−2

Greedy

np Time (s) Bump3D Runing Time

Communication Computation Others

Metis REB+CCG IF+CCG REB+GGS IF+GGS ◮ Consistent with α − β cost. ◮ Latency has more effect.

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 32 / 39

slide-66
SLIDE 66

Tests and Results

Bump3D Running Time

1024 2048 4096 0.5 1 1.5 ·10−2

Greedy

np Time (s) Bump3D Runing Time

Communication Computation Others

Metis REB+CCG IF+CCG REB+GGS IF+GGS ◮ Consistent with α − β cost. ◮ Latency has more effect. ◮ At 4096 partitions, IF

achieves 5.80x over Greedy and 2.56x over Metis in communication.

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 32 / 39

slide-67
SLIDE 67

Tests and Results

Space X’s Falcon-Heavy Grid

The Space X’s Falcon-Heavy grid and its blocks’ distribution: 1 2 3 50 100

Workload (106 cells) # Blocks Falcon-Heavy

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 33 / 39

slide-68
SLIDE 68

Tests and Results

Falcon-Heavy Metrics: # Sub-Blocks

Choose REB+GGS and IF+CCG to present GGS and CCG respectively. 64 256 1024 4096 1 2 4 6 8 12 np # Sub-Blocks per Partition Greedy Metis CCG GGS # Sub-Blocks: Metis > CCG > Greedy ≥ GGS Dominating Pattern:

◮ 64-256 partitions

group small blocks

◮ 1024-4096 partitions

both

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 34 / 39

slide-69
SLIDE 69

Tests and Results

Falcon-Heavy Metrics: Volume and Edge Cuts

◮ Greedy produces the max communication volume and edge cuts for 64-256

partitions 64 256 1024 4096 108 108.5 np Communication Volume 64 256 1024 4096 103 104 np Edge Cuts Greedy Metis CCG GGS

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 35 / 39

slide-70
SLIDE 70

Tests and Results

Falcon-Heavy Metrics: Volume and Edge Cuts

◮ Metis produces the min communication volume and edge cuts for 64-256

partitions, CCG 2nd. 64 256 1024 4096 108 108.5 np Communication Volume 64 256 1024 4096 103 104 np Edge Cuts Greedy Metis CCG GGS

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 35 / 39

slide-71
SLIDE 71

Tests and Results

Falcon-Heavy Metrics: Volume and Edge Cuts

◮ Greedy, CCG, GGS produce close results for 1024-4096 partitions; GGS min.

64 256 1024 4096 108 108.5 np Communication Volume 64 256 1024 4096 103 104 np Edge Cuts Greedy Metis CCG GGS

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 35 / 39

slide-72
SLIDE 72

Tests and Results

Falcon-Heavy Metrics: Communication Cost

64 256 1024 4096 0.1 np Communication Cost Greedy Metis CCG GGS Consistent Pattern with metrics:

◮ 64-256 partitions

Metis best, CCG 2nd

◮ 1024-4096 partitions

Metis worst, GGS best

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 36 / 39

slide-73
SLIDE 73

Tests and Results

Falcon-Heavy Running Time

1024 2048 4096 0.5 1 1.5 ·10−2

Greedy

np Time (s) Falcon-Heavy Runing Time

Communication Computation Others

Metis REB+CCG IF+CCG REB+GGS IF+GGS ◮ Metis produces good result

at 4096 partitions

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 37 / 39

slide-74
SLIDE 74

Tests and Results

Falcon-Heavy Running Time

1024 2048 4096 0.5 1 1.5 ·10−2

Greedy

np Time (s) Falcon-Heavy Runing Time

Communication Computation Others

Metis REB+CCG IF+CCG REB+GGS IF+GGS ◮ Metis produces good result

at 4096 partitions

◮ Greedy creates good result

at 1024, 2048 partitions

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 37 / 39

slide-75
SLIDE 75

Tests and Results

Falcon-Heavy Running Time

1024 2048 4096 0.5 1 1.5 ·10−2

Greedy

np Time (s) Falcon-Heavy Runing Time

Communication Computation Others

Metis REB+CCG IF+CCG REB+GGS IF+GGS ◮ Metis produces good result

at 4096 partitions

◮ Greedy creates good result

at 1024, 2048 partitions

◮ At 4096 partitions,

REB+GGS achieves 2.11x

  • ver Greedy and 1.54x over

Metis in communication.

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 37 / 39

slide-76
SLIDE 76

Conclusion

Outline

Background Algorithms Tests and Results Conclusion

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 38 / 39

slide-77
SLIDE 77

Conclusion

Conclusion

◮ Use the α − β model to construct a cost function incorporating the edge

cuts, communication volume and network specifics.

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 39 / 39

slide-78
SLIDE 78

Conclusion

Conclusion

◮ Use the α − β model to construct a cost function incorporating the edge

cuts, communication volume and network specifics.

◮ Propose modified REB, IF for cutting large blocks and CCG, GGS for

grouping small blocks.

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 39 / 39

slide-79
SLIDE 79

Conclusion

Conclusion

◮ Use the α − β model to construct a cost function incorporating the edge

cuts, communication volume and network specifics.

◮ Propose modified REB, IF for cutting large blocks and CCG, GGS for

grouping small blocks.

◮ Test our partitioner with a hybrid MPI+OpenMP based Jacobi solver on up

to 4096 nodes.

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 39 / 39

slide-80
SLIDE 80

Conclusion

Conclusion

◮ Use the α − β model to construct a cost function incorporating the edge

cuts, communication volume and network specifics.

◮ Propose modified REB, IF for cutting large blocks and CCG, GGS for

grouping small blocks.

◮ Test our partitioner with a hybrid MPI+OpenMP based Jacobi solver on up

to 4096 nodes.

◮ Achieve significant speedup in communication on both Bump3D and

Falcon-Heavy.

H.Wang, A.Chandramowlishwaran (UCI) Partitioner ICS’ 19 06/28/2019 39 / 39