MPI: 25 Years of Progress Anthony Skjellum University of Tennessee - - PowerPoint PPT Presentation

mpi 25 years of progress
SMART_READER_LITE
LIVE PREVIEW

MPI: 25 Years of Progress Anthony Skjellum University of Tennessee - - PowerPoint PPT Presentation

MPI: 25 Years of Progress Anthony Skjellum University of Tennessee at Chattanooga Tony-skjellum@utc.edu Formerly: LLNL, MSU, MPI Software Technology, Verari/Verarisoft, UAB, and Auburn University Co-authors: Ron Brightwell, Sandia


slide-1
SLIDE 1

MPI: 25 Years of Progress

Anthony Skjellum

University of Tennessee at Chattanooga Tony-skjellum@utc.edu
 Formerly: LLNL, MSU, MPI Software Technology,
 Verari/Verarisoft, UAB, and Auburn University
 Co-authors: Ron Brightwell, Sandia
 Rossen Dimitrov, Intralinks

slide-2
SLIDE 2

MPI: 25 Years of Progress

Anthony Skjellum

University of Tennessee at Chattanooga Tony-skjellum@utc.edu
 Formerly: LLNL, MSU, MPI Software Technology,
 Verari/Verarisoft, UAB, and Auburn University
 Co-authors: Ron Brightwell, Sandia
 Rossen Dimitrov, Intralinks

slide-3
SLIDE 3

Outline

l Background l Legacy l About Progress l MPI Taxonomy l A glimpse at the past l A look toward the future

slide-4
SLIDE 4

Progress

l 25 years we as a community set out to

standardize parallel programming

l It worked J l Amazing “collective operation” (hmm..

still not complete)

l Some things about the other progress

too, moving data independently of user calls to MPI…

slide-5
SLIDE 5

Community

l This was close to the beginning…

slide-6
SLIDE 6

As we all know (agree?)

l MPI defined progress as a “weak”

requirement

l MPI implementations don’t have to move the

data independently of when MPI is called

l Implementations can do so l There is no need for an internally concurrent

schedule to comply

l For instance: do all the data movement at

“Waitall” … predictable if required only to be here!

slide-7
SLIDE 7

How programs/programmers achieve progress

l The MPI library calls the progress

engine when you call any of most MPI calls

l The MPI library does it for you

▼ In the transport, MPI just shepherds lightly ▼ In an internal thread or threads periodically scheduled

l You kick the progress engine (Self help)

▼ You call MPI_Test() sporadically in your user thread ▼ You schedule and call MPI_Test() in a helper thread

slide-8
SLIDE 8

Desirements

l Overlap communication and Computation l Predictability / low jitter



 
 


l Later: overlap of communication, computation, and

I/O


l Proviso: LJ à Must have the memory bandwidth 



 
 


slide-9
SLIDE 9

MPI Implementation Taxonomy (Dimitrov)

l Message completion

notification

▼ Asynchronous (blocking) ▼ Synchronous (polling)

l Message progress

▼ Asynchronous (independent) ▼ Synchronous (polling)


blocking independent polling independent blocking polling all-polling

slide-10
SLIDE 10

Segmentation

l Common technique for implementing

  • verlapping through pipelining

Message m Compute m Segments m/s Compute m/s Compute m/s Compute m/s m/s m/s

Entire message Segmented message

slide-11
SLIDE 11

Optimal Segmentation

T(s) Tno overlap Tbest s sm sb 1

slide-12
SLIDE 12

Performance Gain from Overlapping

l Effect of overlapping on FFT global

phase in seconds, p = 2

0.000 0.100 0.200 0.300 0.400 0.500 0.600 0.700 0.800 0.900 1.000 1 2 4 8 16 32 64 Number of segments Execution time [sec] 1M p=2 2M p=2 4M p=2

size Max speedup 1M 1.41 2M 1.43 4M 1.43

slide-13
SLIDE 13

Performance Gain from Overlapping (cont.)

l Effect of overlapping on FFT global

phase in seconds, p = 4

size Max speedup 1M 1.31 2M 1.32 4M 1.33

0.000 0.100 0.200 0.300 0.400 0.500 0.600 0.700 0.800 0.900 1.000 1 2 4 8 16 32 64 Number of segments Execution time [sec] 1M p=4 2M p=4 4M p=4

slide-14
SLIDE 14

Performance Gain from Overlapping (cont.)

l Effect of overlapping on FFT global

phase in seconds, p = 8

size Max speedup 1M 1.32 2M 1.32 4M 1.33

0.000 0.100 0.200 0.300 0.400 0.500 0.600 0.700 0.800 0.900 1.000 1 2 4 8 16 32 64 Number of segments Execution time [sec] 1M p=8 2M p=8 4M p=8

slide-15
SLIDE 15

Effect of Message-Passing Library on Overlapping

l Comparison between blocking and

polling modes of MPI, n = 2M, p = 2

0.000 0.050 0.100 0.150 0.200 0.250 0.300 0.350 0.400 0.450 0.500 1 2 4 8 16 32 64 Number of segments Execution time [sec]

blocking polling

slide-16
SLIDE 16

Effect of Message-Passing Library on Overlapping

l Comparison between blocking and

polling modes of MPI, n = 2M, p = 8

0.000 0.050 0.100 0.150 0.200 0.250 0.300 0.350 0.400 0.450 0.500 1 2 4 8 16 32 64 Number of segments Execution time [sec]

blocking polling

slide-17
SLIDE 17

Observations/Upshots

l Completion notification method affects

latency of short messages (i.e., < 4k on legacy system)

l Notification method did not affect

bandwidth of long messages

l Short message programs

▼ Strong progress, polling notification

l Long message programs

▼ Strong progress, blocking notification

slide-18
SLIDE 18

Future (soon?)

l MPI’s support overlap and notification mode well l Overlap is worth at most a factor of 2 (3 if you

include I/O)

l It is valuable in real algorithmic situations l Arguably growing in value at exascale l We need to reveal this capability broadly without

the “Self help” model

slide-19
SLIDE 19

Thank you

l 25 years of


progress

l And still going


strong…

l Collective! l Nonblocking? l Persistent! l Fault Tolerant?