Automatic MPI application transformation with ASPhALT Anthony - - PowerPoint PPT Presentation

automatic mpi application transformation with asphalt
SMART_READER_LITE
LIVE PREVIEW

Automatic MPI application transformation with ASPhALT Anthony - - PowerPoint PPT Presentation

Automatic MPI application transformation with ASPhALT Anthony Danalis Lori Pollock Martin Swany University of Delaware University of Delaware Motivation Overview Transformation Automation


slide-1
SLIDE 1

Automatic MPI application transformation with ASPhALT

University of Delaware University of Delaware

Lori Pollock Martin Swany Anthony Danalis

slide-2
SLIDE 2

University of Delaware

Motivation Overview Transformation Automation Evaluation Future Work

Anthony Danalis

Problem

slide-3
SLIDE 3

University of Delaware

Motivation Overview Transformation Automation Evaluation Future Work

Anthony Danalis

Overall Research Goal Overall Research Goal

Requirements:

✔ Achieve high-performance communication ✔ Simplify the MPI code developers write

Have your cake + Eat your cake

slide-4
SLIDE 4

University of Delaware

Motivation Overview Transformation Automation Evaluation Future Work

Anthony Danalis

Overall Research Goal Overall Research Goal

Requirements:

✔ Achieve high-performance communication ✔ Simplify the MPI code developers write

Have your cake + Eat your cake Automatic cake making machine

slide-5
SLIDE 5

University of Delaware

Motivation Overview Transformation Automation Evaluation Future Work

Anthony Danalis

Overall Research Goal Overall Research Goal

Proposed Solution:

An automatic automatic system that transforms transforms simple communication code into efficient code.

Requirements:

✔ Achieve high-performance communication ✔ Simplify the MPI code developers write

Have your cake + Eat your cake Automatic cake making machine

slide-6
SLIDE 6

University of Delaware

Motivation Overview Transformation Automation Evaluation Future Work

Anthony Danalis

Overall Research Goal Overall Research Goal

Proposed Solution:

An automatic automatic system that transforms transforms simple communication code into efficient code.

Requirements:

✔ Achieve high-performance communication ✔ Simplify the MPI code developers write

Side-effect:

Enables legacy parallel MPI applications legacy parallel MPI applications to scale, even if written without any knowledge without any knowledge of this system

Have your cake + Eat your cake Automatic cake making machine

slide-7
SLIDE 7

University of Delaware

Motivation Overview Transformation Automation Evaluation Future Work

Anthony Danalis

Overall Research Goal Overall Research Goal

ASPhALT: Information from multiple layers contributes to source optimization

Application Runtime Libraries Operating System/ Network

Cluster Layers

slide-8
SLIDE 8

University of Delaware

Motivation Overview Transformation Automation Evaluation Future Work

Anthony Danalis

Our Framework : ASPhALT*

Optimized Source Code Executable

System Parameters Low Level

  • Comm. API

Existing Compiler

System Benchmarks

Source to Source Optimizer Original Code

Application Analyzer

Application Runtime Libraries Operating System/ Network

slide-9
SLIDE 9

University of Delaware

Motivation Overview Transformation Automation Evaluation Future Work

Anthony Danalis

“Prepushing” Transformation

slide-10
SLIDE 10

University of Delaware

Motivation Overview Transformation Automation Evaluation Future Work

Anthony Danalis

“Prepushing” Transformation

slide-11
SLIDE 11

University of Delaware

Motivation Overview Transformation Automation Evaluation Future Work

Anthony Danalis

“Prepushing” Transformation

slide-12
SLIDE 12

University of Delaware

Motivation Overview Transformation Automation Evaluation Future Work

Anthony Danalis

“Prepushing” Transformation

slide-13
SLIDE 13

University of Delaware

Motivation Overview Transformation Automation Evaluation Future Work

Anthony Danalis

“Prepushing” Transformation

slide-14
SLIDE 14

University of Delaware

Motivation Overview Transformation Automation Evaluation Future Work

Anthony Danalis

“Prepushing” Transformation

slide-15
SLIDE 15

University of Delaware

Motivation Overview Transformation Automation Evaluation Future Work

Anthony Danalis

  • Comm. Aggregation vs. Performance

Traditional Approach:

High Aggregation Low Overhead + High Bandwidth High Communication Performance

slide-16
SLIDE 16

University of Delaware

Motivation Overview Transformation Automation Evaluation Future Work

Anthony Danalis

  • Comm. Aggregation vs. Performance

Traditional Approach:

High Aggregation Low Overhead + High Bandwidth High Communication Performance

Why our communication segmentation works?

Fine Grain Communication

High Overhead on the network not the CPU Low Bandwidth but transfer Overlapped i.e. CPU not idle

High Application Performance

slide-17
SLIDE 17

University of Delaware

Motivation Overview Transformation Automation Evaluation Future Work

Anthony Danalis

Transformer Prototype

slide-18
SLIDE 18

University of Delaware

Motivation Overview Transformation Automation Evaluation Future Work

Anthony Danalis

Fortran Semantics vs. MPI Semantics

sArray[ NX, NY ], rArray[ NX, NY ] DO T = 1, N, K DO P = 1, NPROC S = F( NX, P, NPROC ) E = G( NX, P, NPROC ) asynchRecvInit( rArray[ S:E, T:T+K-1], req[ T/K ] ) END DO DO I = T, T+K-1 kernel( sArray[ : , I ], ... ) END DO DO P = 1, NPROC S = F( NX, P, NPROC ) E = G( NX, P, NPROC ) asynchSendInit( sArray[ S:E, T:T+K-1] ) END DO IF( T/K > D ) THEN wait( request[ T/K - D ] ) END IF END DO

After ASPhALT

sArray[ NX, NY ] DO I = 1,N kernel( sArray[ : , I ], ... ) END DO synchrnsTransfer( sArray[:,:], rArray[:,:])

Before ASPhALT

slide-19
SLIDE 19

University of Delaware

Motivation Overview Transformation Automation Evaluation Future Work

Anthony Danalis

Fortran Semantics vs. MPI Semantics

sArray[ NX, NY ], rArray[ NX, NY ] DO T = 1, N, K DO P = 1, NPROC S = F( NX, P, NPROC ) E = G( NX, P, NPROC ) asynchRecvInit( rArray[ S:E, T:T+K-1], req[ T/K ] ) END DO DO I = T, T+K-1 kernel( sArray[ : , I ], ... ) END DO DO P = 1, NPROC S = F( NX, P, NPROC ) E = G( NX, P, NPROC ) asynchSendInit( sArray[ S:E, T:T+K-1] ) END DO IF( T/K > D ) THEN wait( request[ T/K - D ] ) END IF END DO

After ASPhALT

sArray[ NX, NY ] DO I = 1,N kernel( sArray[ : , I ], ... ) END DO synchrnsTransfer( sArray[:,:], rArray[:,:])

Before ASPhALT

P1 P2 P1 P2

slide-20
SLIDE 20

University of Delaware

Motivation Overview Transformation Automation Evaluation Future Work

Anthony Danalis

Fortran Semantics vs. MPI Semantics

sArray[ NX, NY ], rArray[ NX, NY ] DO T = 1, N, K DO P = 1, NPROC S = F( NX, P, NPROC ) E = G( NX, P, NPROC ) asynchRecvInit( rArray[ S:E, T:T+K-1], req[ T/K ] ) END DO DO I = T, T+K-1 kernel( sArray[ : , I ], ... ) END DO DO P = 1, NPROC S = F( NX, P, NPROC ) E = G( NX, P, NPROC ) asynchSendInit( sArray[ S:E, T:T+K-1] ) END DO IF( T/K > D ) THEN wait( request[ T/K - D ] ) END IF END DO

After ASPhALT

sArray[ NX, NY ] DO I = 1,N kernel( sArray[ : , I ], ... ) END DO synchrnsTransfer( sArray[:,:], rArray[:,:])

Before ASPhALT

P1 P2 P1 P2

slide-21
SLIDE 21

University of Delaware

Motivation Overview Transformation Automation Evaluation Future Work

Anthony Danalis

Fortran Semantics vs. MPI Semantics

sArray[ NX, NY ], rArray[ NX, NY ] DO T = 1, N, K DO P = 1, NPROC S = F( NX, P, NPROC ) E = G( NX, P, NPROC ) asynchRecvInit( rArray[ S:E, T:T+K-1], req[ T/K ] ) END DO DO I = T, T+K-1 kernel( sArray[ : , I ], ... ) END DO DO P = 1, NPROC S = F( NX, P, NPROC ) E = G( NX, P, NPROC ) asynchSendInit( sArray[ S:E, T:T+K-1] ) END DO IF( T/K > D ) THEN wait( request[ T/K - D ] ) END IF END DO

After ASPhALT

TEMP2[ : ] = sArray[ S:E, T:T+K-1] asynchSend( TEMP2[ : ] ) sArray[ S:E, T:T+K-1] = TEMP2[ : ] TEMP1[ : ] = rArray[ S:E, T:T+K-1] asynchRecvInit( TEMP1[ : ] ) rArray[ S:E, T:T+K-1] = TEMP1[ : ]

After FORTRAN compiler

Array slice means implicit copy for data to be contiguous

slide-22
SLIDE 22

University of Delaware

Motivation Overview Transformation Automation Evaluation Future Work

Anthony Danalis

Fortran Semantics vs. MPI Semantics

TEMP2[ : ] = sArray[ S:E, T:T+K-1] asynchSend( TEMP2[ : ] ) sArray[ S:E, T:T+K-1] = TEMP2[ : ] TEMP1[ : ] = rArray[ S:E, T:T+K-1] asynchRecvInit( TEMP1[ : ] ) rArray[ S:E, T:T+K-1] = TEMP1[ : ]

After FORTRAN compiler

Potential Problems:

TEMP1 is copied back but Data not here yet Data Flow Analysis allows F90 compiler to re-define TEMP2 after this copy, but Data not departed yet

slide-23
SLIDE 23

University of Delaware

Motivation Overview Transformation Automation Evaluation Future Work

Anthony Danalis

Fortran Semantics vs. MPI Semantics

wait( send ) Stmp[ : ] = sArray[ S:E, T:T+K-1] asynchSend( Stmp[ : ] ) sArray[ S:E, T:T+K-1] = Stmp[ : ] Rtmp[ : ] = rArray[ S:E, T:T+K-1] asynchRecvInit( Rtmp[ : ] ) ... wait( receive ) rArray[ S:E, T:T+K-1] = Rtmp[ : ]

After ASPhALT

Solution: Make the copy explicit

Rtmp is copied back after Data has arrived Stmp is a variable introduced by ASPhALT, not F90 compiler, so ASPhALT knows when to re-define it

slide-24
SLIDE 24

University of Delaware

Motivation Overview Transformation Automation Evaluation Future Work

Anthony Danalis

Fortran Semantics vs. MPI Semantics

Stmp[ : ] = sArray[ S:E, T:T+K-1] asynchSend( Stmp[ : ] ) sArray[ S:E, T:T+K-1] = Stmp[ : ] Rtmp[ : ] = rArray[ S:E, T:T+K-1] asynchRecvInit( Rtmp[ : ] ) ... wait( receive ) rArray[ S:E, T:T+K-1] = Rtmp[ : ]

Redundant

After ASPhALT

slide-25
SLIDE 25

University of Delaware

Motivation Overview Transformation Automation Evaluation Future Work

Anthony Danalis

Fortran Semantics vs. MPI Semantics

... asynchRecvInit( Rtmp[ : ] ) ... wait( receive ) rArray[ S:E, T:T+K-1] = Rtmp[ : ] ... Stmp[ : ] = sArray[ S:E, T:T+K-1] asynchSend( Stmp[ : ] ) ...

After ASPhALT

Issue:

sArray[ NX, NY ] DO I = 1,N kernel( sArray[ : , I ], ... ) END DO synchrnsTransfer( sArray[:,:], rArray[:,:])

Before ASPhALT

Introducing memory copying limits performance

Solution:

Will talk about it after comm. aggregation graphs

slide-26
SLIDE 26

University of Delaware

Motivation Overview Transformation Automation Evaluation Future Work

Anthony Danalis

Evaluation of Automatic Transformation

interconnect:Ammasso, NP:16, size:1440x1440x48x16 Bytes

Synthetic Kernel

slide-27
SLIDE 27

University of Delaware

Motivation Overview Transformation Automation Evaluation Future Work

Anthony Danalis

Evaluation of Automatic Transformation

interconnect:Myrinet-MX, NP:48, size:1440x1440x48x16 Bytes

Synthetic Kernel

slide-28
SLIDE 28

University of Delaware

Motivation Overview Transformation Automation Evaluation Future Work

Anthony Danalis

Evaluation of Automatic Transformation

interconnect:Myrinet-MX, NP:48, size:9216x2305x48x16 Bytes

Real Application (visco)

slide-29
SLIDE 29

University of Delaware

Motivation Overview Transformation Automation Evaluation Future Work

Anthony Danalis

Evaluation of Automatic Transformation

interconnect:Myrinet-GM, NP:24, size:9216x2305x48x16 Bytes

Real Application (visco)

slide-30
SLIDE 30

University of Delaware

Motivation Overview Transformation Automation Evaluation Future Work

Anthony Danalis ... asynchRecvInit( Rtmp[ : ] ) ... wait( receive ) rArray[ S:E, T ] = Rtmp[ : ] ... Stmp[ : ] = sArray[ S:E, T ] asynchSend( Stmp[ : ] ) ... ... asynchRecvInit( Rtmp[ : ] ) ... wait( receive ) rArray[ S:E, T:T+K-1] = Rtmp[ : ] ... Stmp[ : ] = sArray[ S:E, T:T+K-1] asynchSend( Stmp[ : ] ) ...

Extra Memory Copying

Issue:

Introducing memory copying limits performance

Solution:

If network is fast for K==1 then buffer can be sent directly w/o copying

... asynchRecvInit( rArray[ S:E,T ] ) ... wait( receive ) ... asynchSend( sArray[ S:E, T ] ) ... If K=1, slice is not needed If memory is contiguous, copying is not needed

slide-31
SLIDE 31

University of Delaware

Motivation Overview Transformation Automation Evaluation Future Work

Anthony Danalis

Automatic Tunning through Benchmarks

slide-32
SLIDE 32

University of Delaware

Motivation Overview Transformation Automation Evaluation Future Work

Anthony Danalis

Automatic Tunning through Benchmarks

slide-33
SLIDE 33

University of Delaware

Motivation Overview Transformation Automation Evaluation Future Work

Anthony Danalis

Additional Framework Components

Low Level Communication API: libGravel

Thin libraries & macros Provide access to common low level features (RDMA put) Abstract hardware details Abstract protocol details (memory registration/pointer exchange) Abstract Language details (make API usable from F95) Utilize existing state of the art APIs

UDAPL (current work) GASNet, GM, MX

slide-34
SLIDE 34

University of Delaware

Motivation Overview Transformation Automation Evaluation Future Work

Anthony Danalis

Additional Framework Components (2)

Application Analyzer

Infer Program Parameters

Kernel Execution Time Data Size per Kernel Maximum Buffer Size

Run Empirical Tests on a Slice of the Program

Try various Parameter Values Try various Semantically Equivalent Algorithms

slide-35
SLIDE 35

University of Delaware

Motivation Overview Transformation Automation Evaluation Future Work

Anthony Danalis

Current & Future Directions

➢ Extend Generality of asphalt_transformer ➢ Enable the use of Low Level and One-Sided I/O ➢ Refine and Combine all the Parts of ASPhALT ➢ Study effect of ASPhALT on Time-To-Solution

with real developers

slide-36
SLIDE 36

University of Delaware

Motivation Overview Transformation Automation Evaluation Future Work

Anthony Danalis

Questions ?