mpi 25 years of progress
play

MPI: 25 Years of Progress Anthony Skjellum University of Tennessee - PowerPoint PPT Presentation

MPI: 25 Years of Progress Anthony Skjellum University of Tennessee at Chattanooga Tony-skjellum@utc.edu Formerly: LLNL, MSU, MPI Software Technology, Verari/Verarisoft, UAB, and Auburn University Co-authors: Ron Brightwell, Sandia


  1. MPI: 25 Years of Progress Anthony Skjellum University of Tennessee at Chattanooga Tony-skjellum@utc.edu 
 Formerly: LLNL, MSU, MPI Software Technology, 
 Verari/Verarisoft, UAB, and Auburn University 
 Co-authors: Ron Brightwell, Sandia 
 Rossen Dimitrov, Intralinks

  2. MPI: 25 Years of Progress Anthony Skjellum University of Tennessee at Chattanooga Tony-skjellum@utc.edu 
 Formerly: LLNL, MSU, MPI Software Technology, 
 Verari/Verarisoft, UAB, and Auburn University 
 Co-authors: Ron Brightwell, Sandia 
 Rossen Dimitrov, Intralinks

  3. Outline l Background l Legacy l About Progress l MPI Taxonomy l A glimpse at the past l A look toward the future

  4. Progress l 25 years we as a community set out to standardize parallel programming l It worked J l Amazing “collective operation” (hmm.. still not complete) l Some things about the other progress too, moving data independently of user calls to MPI …

  5. Community l This was close to the beginning …

  6. As we all know (agree?) l MPI defined progress as a “weak” requirement l MPI implementations don’t have to move the data independently of when MPI is called l Implementations can do so l There is no need for an internally concurrent schedule to comply l For instance: do all the data movement at “Waitall” … predictable if required only to be here!

  7. How programs/programmers achieve progress l The MPI library calls the progress engine when you call any of most MPI calls l The MPI library does it for you ▼ In the transport, MPI just shepherds lightly ▼ In an internal thread or threads periodically scheduled l You kick the progress engine (Self help) ▼ You call MPI_Test() sporadically in your user thread ▼ You schedule and call MPI_Test() in a helper thread

  8. 
 
 
 
 
 
 Desirements l Overlap communication and Computation l Predictability / low jitter 
 l Later: overlap of communication, computation, and I/O 
 l Proviso: LJ à Must have the memory bandwidth 


  9. MPI Implementation Taxonomy (Dimitrov) l Message completion notification blocking blocking ▼ Asynchronous (blocking) independent polling ▼ Synchronous (polling) l Message progress polling all-polling ▼ Asynchronous (independent) independent ▼ Synchronous (polling) 


  10. Segmentation l Common technique for implementing overlapping through pipelining Segments Message m m/s m/s m/s Compute m/s Compute m/s Compute m/s Compute m Entire message Segmented message

  11. Optimal Segmentation T ( s ) T no overlap T best s 1 s b s m

  12. Performance Gain from Overlapping l Effect of overlapping on FFT global phase in seconds, p = 2 1.000 0.900 0.800 size Max Execution time [sec] 0.700 speedup 0.600 1M p=2 1M 1.41 0.500 2M p=2 0.400 4M p=2 2M 1.43 0.300 0.200 0.100 4M 1.43 0.000 1 2 4 8 16 32 64 Number of segments

  13. Performance Gain from Overlapping (cont.) l Effect of overlapping on FFT global phase in seconds, p = 4 1.000 0.900 0.800 size Max Execution time [sec] 0.700 speedup 0.600 1M p=4 1M 1.31 0.500 2M p=4 0.400 4M p=4 0.300 2M 1.32 0.200 0.100 4M 1.33 0.000 1 2 4 8 16 32 64 Number of segments

  14. Performance Gain from Overlapping (cont.) l Effect of overlapping on FFT global phase in seconds, p = 8 1.000 0.900 0.800 size Max Execution time [sec] 0.700 speedup 0.600 1M p=8 1M 1.32 0.500 2M p=8 4M p=8 0.400 0.300 2M 1.32 0.200 0.100 4M 1.33 0.000 1 2 4 8 16 32 64 Number of segments

  15. Effect of Message-Passing Library on Overlapping l Comparison between blocking and polling modes of MPI, n = 2M, p = 2 0.500 0.450 0.400 0.350 Execution time [sec] 0.300 blocking 0.250 polling 0.200 0.150 0.100 0.050 0.000 1 2 4 8 16 32 64 Number of segments

  16. Effect of Message-Passing Library on Overlapping l Comparison between blocking and polling modes of MPI, n = 2M, p = 8 0.500 0.450 0.400 0.350 Execution time [sec] 0.300 blocking 0.250 polling 0.200 0.150 0.100 0.050 0.000 1 2 4 8 16 32 64 Number of segments

  17. Observations/Upshots l Completion notification method affects latency of short messages (i.e., < 4k on legacy system) l Notification method did not affect bandwidth of long messages l Short message programs ▼ Strong progress, polling notification l Long message programs ▼ Strong progress, blocking notification

  18. Future (soon?) l MPI’s support overlap and notification mode well l Overlap is worth at most a factor of 2 (3 if you include I/O) l It is valuable in real algorithmic situations l Arguably growing in value at exascale l We need to reveal this capability broadly without the “Self help” model

  19. Thank you l 25 years of 
 progress l And still going 
 strong … l Collective! l Nonblocking? l Persistent! l Fault Tolerant?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend