Automatic MPI application transformation with ASPhALT Anthony - - PowerPoint PPT Presentation
Automatic MPI application transformation with ASPhALT Anthony - - PowerPoint PPT Presentation
Automatic MPI application transformation with ASPhALT Anthony Danalis Lori Pollock Martin Swany University of Delaware University of Delaware Motivation Overview Transformation Automation
University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work
Anthony Danalis
Problem
University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work
Anthony Danalis
Overall Research Goal Overall Research Goal
Requirements:
✔ Achieve high-performance communication ✔ Simplify the MPI code developers write
Have your cake + Eat your cake
University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work
Anthony Danalis
Overall Research Goal Overall Research Goal
Requirements:
✔ Achieve high-performance communication ✔ Simplify the MPI code developers write
Have your cake + Eat your cake Automatic cake making machine
University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work
Anthony Danalis
Overall Research Goal Overall Research Goal
Proposed Solution:
An automatic automatic system that transforms transforms simple communication code into efficient code.
Requirements:
✔ Achieve high-performance communication ✔ Simplify the MPI code developers write
Have your cake + Eat your cake Automatic cake making machine
University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work
Anthony Danalis
Overall Research Goal Overall Research Goal
Proposed Solution:
An automatic automatic system that transforms transforms simple communication code into efficient code.
Requirements:
✔ Achieve high-performance communication ✔ Simplify the MPI code developers write
Side-effect:
Enables legacy parallel MPI applications legacy parallel MPI applications to scale, even if written without any knowledge without any knowledge of this system
Have your cake + Eat your cake Automatic cake making machine
University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work
Anthony Danalis
Overall Research Goal Overall Research Goal
ASPhALT: Information from multiple layers contributes to source optimization
Application Runtime Libraries Operating System/ Network
Cluster Layers
University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work
Anthony Danalis
Our Framework : ASPhALT*
Optimized Source Code Executable
System Parameters Low Level
- Comm. API
Existing Compiler
System Benchmarks
Source to Source Optimizer Original Code
Application Analyzer
Application Runtime Libraries Operating System/ Network
University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work
Anthony Danalis
“Prepushing” Transformation
University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work
Anthony Danalis
“Prepushing” Transformation
University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work
Anthony Danalis
“Prepushing” Transformation
University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work
Anthony Danalis
“Prepushing” Transformation
University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work
Anthony Danalis
“Prepushing” Transformation
University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work
Anthony Danalis
“Prepushing” Transformation
University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work
Anthony Danalis
- Comm. Aggregation vs. Performance
Traditional Approach:
High Aggregation Low Overhead + High Bandwidth High Communication Performance
University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work
Anthony Danalis
- Comm. Aggregation vs. Performance
Traditional Approach:
High Aggregation Low Overhead + High Bandwidth High Communication Performance
Why our communication segmentation works?
Fine Grain Communication
High Overhead on the network not the CPU Low Bandwidth but transfer Overlapped i.e. CPU not idle
High Application Performance
University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work
Anthony Danalis
Transformer Prototype
University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work
Anthony Danalis
Fortran Semantics vs. MPI Semantics
sArray[ NX, NY ], rArray[ NX, NY ] DO T = 1, N, K DO P = 1, NPROC S = F( NX, P, NPROC ) E = G( NX, P, NPROC ) asynchRecvInit( rArray[ S:E, T:T+K-1], req[ T/K ] ) END DO DO I = T, T+K-1 kernel( sArray[ : , I ], ... ) END DO DO P = 1, NPROC S = F( NX, P, NPROC ) E = G( NX, P, NPROC ) asynchSendInit( sArray[ S:E, T:T+K-1] ) END DO IF( T/K > D ) THEN wait( request[ T/K - D ] ) END IF END DO
After ASPhALT
sArray[ NX, NY ] DO I = 1,N kernel( sArray[ : , I ], ... ) END DO synchrnsTransfer( sArray[:,:], rArray[:,:])
Before ASPhALT
University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work
Anthony Danalis
Fortran Semantics vs. MPI Semantics
sArray[ NX, NY ], rArray[ NX, NY ] DO T = 1, N, K DO P = 1, NPROC S = F( NX, P, NPROC ) E = G( NX, P, NPROC ) asynchRecvInit( rArray[ S:E, T:T+K-1], req[ T/K ] ) END DO DO I = T, T+K-1 kernel( sArray[ : , I ], ... ) END DO DO P = 1, NPROC S = F( NX, P, NPROC ) E = G( NX, P, NPROC ) asynchSendInit( sArray[ S:E, T:T+K-1] ) END DO IF( T/K > D ) THEN wait( request[ T/K - D ] ) END IF END DO
After ASPhALT
sArray[ NX, NY ] DO I = 1,N kernel( sArray[ : , I ], ... ) END DO synchrnsTransfer( sArray[:,:], rArray[:,:])
Before ASPhALT
P1 P2 P1 P2
University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work
Anthony Danalis
Fortran Semantics vs. MPI Semantics
sArray[ NX, NY ], rArray[ NX, NY ] DO T = 1, N, K DO P = 1, NPROC S = F( NX, P, NPROC ) E = G( NX, P, NPROC ) asynchRecvInit( rArray[ S:E, T:T+K-1], req[ T/K ] ) END DO DO I = T, T+K-1 kernel( sArray[ : , I ], ... ) END DO DO P = 1, NPROC S = F( NX, P, NPROC ) E = G( NX, P, NPROC ) asynchSendInit( sArray[ S:E, T:T+K-1] ) END DO IF( T/K > D ) THEN wait( request[ T/K - D ] ) END IF END DO
After ASPhALT
sArray[ NX, NY ] DO I = 1,N kernel( sArray[ : , I ], ... ) END DO synchrnsTransfer( sArray[:,:], rArray[:,:])
Before ASPhALT
P1 P2 P1 P2
University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work
Anthony Danalis
Fortran Semantics vs. MPI Semantics
sArray[ NX, NY ], rArray[ NX, NY ] DO T = 1, N, K DO P = 1, NPROC S = F( NX, P, NPROC ) E = G( NX, P, NPROC ) asynchRecvInit( rArray[ S:E, T:T+K-1], req[ T/K ] ) END DO DO I = T, T+K-1 kernel( sArray[ : , I ], ... ) END DO DO P = 1, NPROC S = F( NX, P, NPROC ) E = G( NX, P, NPROC ) asynchSendInit( sArray[ S:E, T:T+K-1] ) END DO IF( T/K > D ) THEN wait( request[ T/K - D ] ) END IF END DO
After ASPhALT
TEMP2[ : ] = sArray[ S:E, T:T+K-1] asynchSend( TEMP2[ : ] ) sArray[ S:E, T:T+K-1] = TEMP2[ : ] TEMP1[ : ] = rArray[ S:E, T:T+K-1] asynchRecvInit( TEMP1[ : ] ) rArray[ S:E, T:T+K-1] = TEMP1[ : ]
After FORTRAN compiler
Array slice means implicit copy for data to be contiguous
University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work
Anthony Danalis
Fortran Semantics vs. MPI Semantics
TEMP2[ : ] = sArray[ S:E, T:T+K-1] asynchSend( TEMP2[ : ] ) sArray[ S:E, T:T+K-1] = TEMP2[ : ] TEMP1[ : ] = rArray[ S:E, T:T+K-1] asynchRecvInit( TEMP1[ : ] ) rArray[ S:E, T:T+K-1] = TEMP1[ : ]
After FORTRAN compiler
Potential Problems:
TEMP1 is copied back but Data not here yet Data Flow Analysis allows F90 compiler to re-define TEMP2 after this copy, but Data not departed yet
University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work
Anthony Danalis
Fortran Semantics vs. MPI Semantics
wait( send ) Stmp[ : ] = sArray[ S:E, T:T+K-1] asynchSend( Stmp[ : ] ) sArray[ S:E, T:T+K-1] = Stmp[ : ] Rtmp[ : ] = rArray[ S:E, T:T+K-1] asynchRecvInit( Rtmp[ : ] ) ... wait( receive ) rArray[ S:E, T:T+K-1] = Rtmp[ : ]
After ASPhALT
Solution: Make the copy explicit
Rtmp is copied back after Data has arrived Stmp is a variable introduced by ASPhALT, not F90 compiler, so ASPhALT knows when to re-define it
University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work
Anthony Danalis
Fortran Semantics vs. MPI Semantics
Stmp[ : ] = sArray[ S:E, T:T+K-1] asynchSend( Stmp[ : ] ) sArray[ S:E, T:T+K-1] = Stmp[ : ] Rtmp[ : ] = rArray[ S:E, T:T+K-1] asynchRecvInit( Rtmp[ : ] ) ... wait( receive ) rArray[ S:E, T:T+K-1] = Rtmp[ : ]
Redundant
After ASPhALT
University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work
Anthony Danalis
Fortran Semantics vs. MPI Semantics
... asynchRecvInit( Rtmp[ : ] ) ... wait( receive ) rArray[ S:E, T:T+K-1] = Rtmp[ : ] ... Stmp[ : ] = sArray[ S:E, T:T+K-1] asynchSend( Stmp[ : ] ) ...
After ASPhALT
Issue:
sArray[ NX, NY ] DO I = 1,N kernel( sArray[ : , I ], ... ) END DO synchrnsTransfer( sArray[:,:], rArray[:,:])
Before ASPhALT
Introducing memory copying limits performance
Solution:
Will talk about it after comm. aggregation graphs
University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work
Anthony Danalis
Evaluation of Automatic Transformation
interconnect:Ammasso, NP:16, size:1440x1440x48x16 Bytes
Synthetic Kernel
University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work
Anthony Danalis
Evaluation of Automatic Transformation
interconnect:Myrinet-MX, NP:48, size:1440x1440x48x16 Bytes
Synthetic Kernel
University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work
Anthony Danalis
Evaluation of Automatic Transformation
interconnect:Myrinet-MX, NP:48, size:9216x2305x48x16 Bytes
Real Application (visco)
University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work
Anthony Danalis
Evaluation of Automatic Transformation
interconnect:Myrinet-GM, NP:24, size:9216x2305x48x16 Bytes
Real Application (visco)
University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work
Anthony Danalis ... asynchRecvInit( Rtmp[ : ] ) ... wait( receive ) rArray[ S:E, T ] = Rtmp[ : ] ... Stmp[ : ] = sArray[ S:E, T ] asynchSend( Stmp[ : ] ) ... ... asynchRecvInit( Rtmp[ : ] ) ... wait( receive ) rArray[ S:E, T:T+K-1] = Rtmp[ : ] ... Stmp[ : ] = sArray[ S:E, T:T+K-1] asynchSend( Stmp[ : ] ) ...
Extra Memory Copying
Issue:
Introducing memory copying limits performance
Solution:
If network is fast for K==1 then buffer can be sent directly w/o copying
... asynchRecvInit( rArray[ S:E,T ] ) ... wait( receive ) ... asynchSend( sArray[ S:E, T ] ) ... If K=1, slice is not needed If memory is contiguous, copying is not needed
University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work
Anthony Danalis
Automatic Tunning through Benchmarks
University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work
Anthony Danalis
Automatic Tunning through Benchmarks
University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work
Anthony Danalis
Additional Framework Components
Low Level Communication API: libGravel
Thin libraries & macros Provide access to common low level features (RDMA put) Abstract hardware details Abstract protocol details (memory registration/pointer exchange) Abstract Language details (make API usable from F95) Utilize existing state of the art APIs
UDAPL (current work) GASNet, GM, MX
University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work
Anthony Danalis
Additional Framework Components (2)
Application Analyzer
Infer Program Parameters
Kernel Execution Time Data Size per Kernel Maximum Buffer Size
Run Empirical Tests on a Slice of the Program
Try various Parameter Values Try various Semantically Equivalent Algorithms
University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work
Anthony Danalis
Current & Future Directions
➢ Extend Generality of asphalt_transformer ➢ Enable the use of Low Level and One-Sided I/O ➢ Refine and Combine all the Parts of ASPhALT ➢ Study effect of ASPhALT on Time-To-Solution
with real developers
University of Delaware
Motivation Overview Transformation Automation Evaluation Future Work
Anthony Danalis