Non-Intrusively Avoiding Scaling Problems in and out of MPI - - PowerPoint PPT Presentation

non intrusively avoiding scaling problems in and out of
SMART_READER_LITE
LIVE PREVIEW

Non-Intrusively Avoiding Scaling Problems in and out of MPI - - PowerPoint PPT Presentation

Non-Intrusively Avoiding Scaling Problems in and out of MPI Collectives Hongbo Li , Zizhong Chen, Rajiv Gupta, and Min Xie May 21st, 2018 Outline Scaling Problem Avoidance Framework Evaluation Conclusion Outline Scaling Problem Avoidance


slide-1
SLIDE 1

Non-Intrusively Avoiding Scaling Problems in and out of MPI Collectives

Hongbo Li, Zizhong Chen, Rajiv Gupta, and Min Xie May 21st, 2018

slide-2
SLIDE 2

Outline

Scaling Problem Avoidance Framework Evaluation Conclusion

slide-3
SLIDE 3

Outline

Scaling Problem Avoidance Framework Evaluation Conclusion

slide-4
SLIDE 4

Scaling Problem

Scaling problem is a type of bug that occurs when the program runs at a large scale in terms of

the number of processes (P) OR the input size OR both

They frequently arise with the use of MPI collectives as collective communication involves

a group of processes and message size (input size)

slide-5
SLIDE 5

An Example of MPI Collective

MPI_Gather using two processes (! = #) with each transferring two elements $ = #.

Root process:

slide-6
SLIDE 6

Scaling Problem

The root cause of a scaling problem with the use of MPI collectives can be

inside MPI collectives

  • r outside MPI collectives
slide-7
SLIDE 7

Many scaling problems are challenging to deal with

They escape the testing in the development phase

It takes days and months to wait for an official fix

Difficulty exists in bug reproduction, root-cause diagnosis, and fixing

Inside MPI

Scaling problems reported online.

slide-8
SLIDE 8

Many scaling problems are challenging to deal with

They escape the testing in the development phase

It takes days and months to wait for an official fix

Difficulty exists in bug reproduction, root-cause diagnosis, and fixing

Inside MPI

Environment setting Connection failure Integer

  • verflow

OS Platform Unkown

slide-9
SLIDE 9

Outside MPI

In the user code, displacement array !"#$%# (C int, commonly 32 bits) of irregular collectives can be easily corrupted by integer overflow

Root’s recvbuf Each process’ sendbuf

In MPI_Gatherv, the root process calculate addresses for the incoming messages when !"#$%# is not corrupted.

*+,-./0 + 234564 0 ∗ 4 Calculate address:

slide-10
SLIDE 10

Outside MPI

In the user code, displacement array !"#$%# (C int, commonly 32 bits) of irregular collectives can be easily corrupted by integer overflow

Root’s recvbuf Each process’ sendbuf

In MPI_Gatherv, the root process calculate addresses for the incoming messages when !"#$%# is not corrupted.

*+,-./0 + 234564 0 ∗ 4 Calculate address:

slide-11
SLIDE 11

Outside MPI

In the user code, displacement array !"#$%# (C int, commonly 32 bits) of irregular collectives can be easily corrupted by integer overflow

1 Root’s recvbuf Each process’ sendbuf

In MPI_Gatherv, the root process calculate addresses for the incoming messages when !"#$%# is not corrupted.

*+,-./0 + 234564 1 ∗ 4 Calculate address:

slide-12
SLIDE 12

Outside MPI

In the user code, displacement array !"#$%# (C int, commonly 32 bits) of irregular collectives can be easily corrupted by integer overflow

1 Root’s recvbuf Each process’ sendbuf

In MPI_Gatherv, the root process calculate addresses for the incoming messages when !"#$%# is not corrupted.

*+,-./0 + 234564 1 ∗ 4 Calculate address:

slide-13
SLIDE 13

Outside MPI

In the user code, displacement array !"#$%# (C int, commonly 32 bits) of irregular collectives can be easily corrupted by integer overflow

1 2 Root’s recvbuf Each process’ sendbuf

In MPI_Gatherv, the root process calculate addresses for the incoming messages when !"#$%# is not corrupted.

*+,-./0 + 234564 2 ∗ 4 Calculate address:

slide-14
SLIDE 14

Outside MPI

In the user code, displacement array !"#$%# (C int, commonly 32 bits) of irregular collectives can be easily corrupted by integer overflow

1 2 Root’s recvbuf Each process’ sendbuf

In MPI_Gatherv, the root process calculate addresses for the incoming messages when !"#$%# is not corrupted.

*+,-./0 + 234564 2 ∗ 4 Calculate address:

slide-15
SLIDE 15

Outside MPI

In the user code, displacement array !"#$%# (C int, commonly 32 bits) of irregular collectives can be easily corrupted by integer overflow

1 2 i P-1 Root’s recvbuf Each process’ sendbuf

In MPI_Gatherv, the root process calculate addresses for the incoming messages when !"#$%# is not corrupted.

Calculate address:

slide-16
SLIDE 16

Outside MPI

In the user code, displacement array !"#$%# (C int, commonly 32 bits) of irregular collectives can be easily corrupted by integer overflow

In MPI_Gatherv, the root process calculate addresses for the incoming messages when !"#$%# is corrupted

*+,-./0 + 234564 0 ∗ 4 Calculate address:

slide-17
SLIDE 17

Outside MPI

In the user code, displacement array !"#$%# (C int, commonly 32 bits) of irregular collectives can be easily corrupted by integer overflow

In MPI_Gatherv, the root process calculate addresses for the incoming messages when !"#$%# is corrupted

*+,-./0 + 234564 0 ∗ 4 Calculate address:

slide-18
SLIDE 18

Outside MPI

In the user code, displacement array !"#$%# (C int, commonly 32 bits) of irregular collectives can be easily corrupted by integer overflow

1

In MPI_Gatherv, the root process calculate addresses for the incoming messages when !"#$%# is corrupted

*+,-./0 + 234564 1 ∗ 4 Calculate address:

slide-19
SLIDE 19

Outside MPI

In the user code, displacement array !"#$%# (C int, commonly 32 bits) of irregular collectives can be easily corrupted by integer overflow

1

In MPI_Gatherv, the root process calculate addresses for the incoming messages when !"#$%# is corrupted

*+,-./0 + 234564 1 ∗ 4 Calculate address:

slide-20
SLIDE 20

Outside MPI

In the user code, displacement array !"#$%# (C int, commonly 32 bits) of irregular collectives can be easily corrupted by integer overflow

1 2

In MPI_Gatherv, the root process calculate addresses for the incoming messages when !"#$%# is corrupted

*+,-./0 + 234564 2 ∗ 4 Calculate address:

slide-21
SLIDE 21

Outside MPI

In the user code, displacement array !"#$%# (C int, commonly 32 bits) of irregular collectives can be easily corrupted by integer overflow

1 2

In MPI_Gatherv, the root process calculate addresses for the incoming messages when !"#$%# is corrupted

*+,-./0 + 234564 2 ∗ 4 Calculate address:

slide-22
SLIDE 22

Outside MPI

In the user code, displacement array !"#$%# (C int, commonly 32 bits) of irregular collectives can be easily corrupted by integer overflow

1 2 i

In MPI_Gatherv, the root process calculate addresses for the incoming messages when !"#$%# is corrupted

*+,-., + < 0 1234567 + *+,-., + ∗ , Calculate address:

slide-23
SLIDE 23

Outside MPI

In the user code, displacement array !"#$%# (C int, commonly 32 bits) of irregular collectives can be easily corrupted by integer overflow

1 2 i

In MPI_Gatherv, the root process calculate addresses for the incoming messages when !"#$%# is corrupted

*+,-., + < 0 1234567 + *+,-., + ∗ , Calculate address:

slide-24
SLIDE 24

Outside MPI

In the user code, displacement array !"#$%# (C int, commonly 32 bits) of irregular collectives can be easily corrupted by integer overflow For MPI_Gatherv, the number of elements (N) received by the root process satisfies

* < ,-./0. 1 − 1 + 5*6_89: → < < = ><?_@AB

For MPI_Gather (a regular collective),

< ≤ D ><?_@AB

slide-25
SLIDE 25

Outside MPI

In the user code, displacement array !"#$%# (C int, commonly 32 bits) of irregular collectives can be easily corrupted by integer overflow For MPI_Gatherv, the number of elements (N) received by the root process satisfies

* < ,-./0. 1 − 1 + 5*6_89: → < < = ><?_@AB

For MPI_Gather (a regular collective),

< ≤ D ><?_@AB

Huge gap:

= D

slide-26
SLIDE 26

Outside MPI

Irregular collectives’ limitation due to displacement array !"#$%# of data type & "'( Replace int with long long int ?

Discussed yet never done --- backward compatibility

slide-27
SLIDE 27

An immediate remedy is in need!

slide-28
SLIDE 28

Outline

Scaling Problem Avoidance Framework Evaluation Conclusion

slide-29
SLIDE 29

Avoidance

Scaling problem’s trigger Workaround strategy

slide-30
SLIDE 30

Trigger (1) [Outside MPI]

Irregular collectives’ limitation’s trigger is

!"#$%# " < 0

slide-31
SLIDE 31

Trigger (2) [Inside MPI]

Users perform testing

It tells users if there is a scaling problem It also tells at what scale the problem occurs

Do users really need a fancy supercomputer to perform testing?

Not Necessary!

slide-32
SLIDE 32

Trigger (2) [Inside MPI]

User side testing: users manifest potential scaling problems of MPI routines of their interest

It tells users if there is a scaling problem It also tells at what scale the problem occurs

Most scaling problems with the use of MPI collectives relate to both parallelism scale and message size

With ONLY 2 nodes with each having 24 cores and 64 GB memory, we easily find 4 scaling problems inside released MPI libraries. Scaling problems related only to the number of processes are not found yet

slide-33
SLIDE 33

Workarounds

Workaround (W1) Partition communication (W1-A) Partition processes (W1-B) Partition the message (W2) Build big data type

slide-34
SLIDE 34

Workaround (1)

Partitioning

  • ne

MPI_Gatherv communication using two strategies supposing the bug is triggered when !" > $. Four processes (" = $) are involved with each sending two elements (! = &) and process 0 is the root process.

Empty recvbuf Filled recvbuf Temporary buffer

!" ≤ $

slide-35
SLIDE 35

Workaround (2)

Build big data type

Message size = s*n Bigger data type (bigger !) à smaller "

Only effective when the scaling problem is unrelated to !

Effective case: "# > 4 Ineffective case: s"# > 4

slide-36
SLIDE 36

Workaround (2)

Build big data type for MPI_Gather to avoid a bug triggered when !" > $.

root à proc 0 proc 1 sendbuf recvbuf root à proc 0 proc 1 sendbuf recvbuf % = 4, ) = 1B, , = 2 % = 1, ) = 4B, , = 2

!" = . !" < $

slide-37
SLIDE 37

Outline

Scaling Problem Avoidance Framework Evaluation Conclusion

slide-38
SLIDE 38

Evaluation – Setting

Tianhe-2:

Each node has 24 cores and 64GB DRAM One process per core

MPI_Gatherv

Effectiveness of avoiding scaling problem Performance

slide-39
SLIDE 39

Evaluation – Effectiveness

Workarounds for MPI_Gatherv that avoids the irregular collective limitation problem.

  • !": the maximal workable ! (unit: 1 M, i.e., 2^20)
  • #$ : the maximal memory consumption on one node

calculated according to MPI standard

23X increase!

Our workarounds are effective till the memory limit is hit

slide-40
SLIDE 40

Evaluation – Performance

MPI_Gatherv [P=768, s=1 B bug occurs when n>2.625 M].

slide-41
SLIDE 41

Evaluation -- Summary

Effectiveness:

W1-B is the best

Performance:

W2 is the best The time cost of a collective based on either W1-A or W1-B increases linearly as ! increases

slide-42
SLIDE 42

Outline

Scaling Problem Avoidance Framework Evaluation Conclusion

slide-43
SLIDE 43

Conclusion

Scaling problems are hard to be fixed and thus uses

  • ften need to spend days and months to wait for an
  • fficial fix

We provide a non-intrusive framework for application users as an immediate remedy

Easier than debugging Faster than official fix

slide-44
SLIDE 44

Thank you!