DAGuE
George Bosilca, Aurelien Bouteiller, Anthony Danalis, Mathieu Faverge, Thomas Herault, Jack Dongarra
2012 Scheduling Workshop, Pittsburgh, PA
DAGuE George Bosilca, Aurelien Bouteiller, Anthony Danalis, Mathieu - - PowerPoint PPT Presentation
2012 Scheduling Workshop, Pittsburgh, PA DAGuE George Bosilca, Aurelien Bouteiller, Anthony Danalis, Mathieu Faverge, Thomas Herault, Jack Dongarra DAGuE DAGuE [dag] (like in Prague [prag]) Not DAGuE like ragout [rgo o] (the Prague
George Bosilca, Aurelien Bouteiller, Anthony Danalis, Mathieu Faverge, Thomas Herault, Jack Dongarra
2012 Scheduling Workshop, Pittsburgh, PA
(the Prague Astronomical Clock was first installed in 1410, making it the third-oldest astronomical clock in the world and the oldest
accelerators/coprocessors)
MPI, hybrid
maintain and to get good performance
When is it time to redesign a software?
programming environments, the requirements of emerging applications, and the challenges of future parallel architectures
Decouple “System issues” from Algorithm
(Micro) Tasks
System Language
portability, performance, scheduling heuristics, heterogeneity management, data movement, …
centralized scheduling or entire DAG representation: dynamic and independent discovery of the relevant portions during the execution
communications, instead make them implicit and schedule to maximize overlap and load balance
[22] F. G. Gustavson, L. Karlsson, and B. K˚ agstr¨
Distributed SBP cholesky factorization algorithms with near-optimal schedul-
DOI: 10.1145/1499096.1499100.
DSBP
81 dual Intel Xeon L5420@2.5GHz (2x4 cores/node) ! 648 cores MX 10Gbs, Intel MKL, Scalapack
1 2 3 4 5 6 7 13k 26k 40k 53k 67k 80k 94k 107k 120k 130k TFlop/s Matrix size (N) DPOTRF performance problem scaling 648 cores (Myrinet 10G) Theoretical peak Practical peak (GEMM) DAGuE DSBP ScaLAPACK 1 2 3 4 5 6 7 13k 26k 40k 53k 67k 80k 94k 107k 120k 130k TFlop/s Matrix size (N) DGEQRF performance problem scaling 648 cores (Myrinet 10G) Theoretical peak Practical peak (GEMM) DAGuE ScaLAPACK 1 2 3 4 5 6 7 13k 26k 40k 53k 67k 80k 94k 107k 120k 130k TFlop/s Matrix size (N) DGETRF performance problem scaling 648 cores (Myrinet 10G) Theoretical peak Practical peak (GEMM) DAGuE HPL ScaLAPACK
[22] F. G. Gustavson, L. Karlsson, and B. K˚ agstr¨
Distributed SBP cholesky factorization algorithms with near-optimal schedul-
DOI: 10.1145/1499096.1499100.
DSBP
81 dual Intel Xeon L5420@2.5GHz (2x4 cores/node) ! 648 cores MX 10Gbs, Intel MKL, Scalapack
1 2 3 4 5 6 7 13k 26k 40k 53k 67k 80k 94k 107k 120k 130k TFlop/s Matrix size (N) DPOTRF performance problem scaling 648 cores (Myrinet 10G) Theoretical peak Practical peak (GEMM) DAGuE DSBP ScaLAPACK 1 2 3 4 5 6 7 13k 26k 40k 53k 67k 80k 94k 107k 120k 130k TFlop/s Matrix size (N) DGEQRF performance problem scaling 648 cores (Myrinet 10G) Theoretical peak Practical peak (GEMM) DAGuE ScaLAPACK 1 2 3 4 5 6 7 13k 26k 40k 53k 67k 80k 94k 107k 120k 130k TFlop/s Matrix size (N) DGETRF performance problem scaling 648 cores (Myrinet 10G) Theoretical peak Practical peak (GEMM) DAGuE HPL ScaLAPACK
Extracts more parallelism Change of the data layout (static task scheduling) Competes with Hand tuned Hardware aware scheduling
Cores Memory Hierarchies Coherence Data Movement Accelerators Scheduling Data Movement Symbolic Representation
Parallel Runtime Hardware Domain Specific Extensions
Dense LA … Sparse LA Tools
as systems scale up
Serial Code DAGuE compiler Dataflow representation Dataflow compiler Parallel tasks stubs Runtime Programmer System compiler Additional libraries
MPI CUDA pthreads P L A S M A M A G M A
Application code & Codelets
DAGuE Toolchain
Domain Specific Extensions Data distribution Supercomputer
DAGuE Compiler
Serial Code DAGuE compiler Dataflow representation Dataflow compiler Parallel tasks stubs Runtime Programmer System compiler Additional libraries
MPI CUDA pthreads P L A S M A M A G M A
Application code & Codelets
DAGuE Toolchain
Domain Specific Extensions Data distribution Supercomputer
GEQRT TSQRT UNMQR TSMQR FOR k = 0 .. SIZE - 1 A[k][k], T[k][k] <- GEQRT( A[k][k] ) FOR m = k+1 .. SIZE - 1 A[k][k]|Up, A[m][k], T[m][k] <- TSQRT( A[k][k]|Up, A[m][k], T[m][k] ) FOR n = k+1 .. SIZE - 1 A[k][n] <- UNMQR( A[k][k]|Low, T[k][k], A[k][n] ) FOR m = k+1 .. SIZE - 1 A[k][n], A[m][n] <- TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] )
for (k = 0; k < A.mt; k++) { Insert_Task( zgeqrt, A[k][k], INOUT, T[k][k], OUTPUT); for (m = k+1; m < A.mt; m++) { Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U, A[m][k], INOUT | LOCALITY, T[m][k], OUTPUT); } for (n = k+1; n < A.nt; n++) { Insert_Task( zunmqr, A[k][k], INPUT | REGION_L, T[k][k], INPUT, A[k][m], INOUT); for (m = k+1; m < A.mt; m++) { Insert_Task( ztsmqr, A[k][n], INOUT, A[m][n], INOUT | LOCALITY, A[m][k], INPUT, T[m][k], INPUT); } } }
REGION_D, …
QR
Omega Test
expressions for:
tasks
that data flow to exist
FOR k = 0 .. SIZE - 1 A[k][k], T[k][k] <- GEQRT( A[k][k] ) FOR m = k+1 .. SIZE - 1 A[k][k]|Up, A[m][k], T[m][k] <- TSQRT( A[k][k]|Up, A[m][k], T[m][k] ) FOR n = k+1 .. SIZE - 1 A[k][n] <- UNMQR( A[k][k]|Low, T[k][k], A[k][n] ) FOR m = k+1 .. SIZE - 1 A[k][n], A[m][n] <- TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] )
MEM n = k+1 m = k+1 k = 0 k = SIZE-1 LOWER UPPER Incoming Data Outgoing Data
Serial Code DAGuE compiler Dataflow representation Dataflow compiler Parallel tasks stubs Runtime Programmer System compiler Additional libraries
MPI CUDA pthreads PLASMA MAGMAApplication code & Codelets
DAGuE Toolchain
Domain Specific Extensions Data distribution Supercomputer
GEQRT TSQRT UNMQR TSMQR
Control flow is eliminated, therefore maximum parallelism is possible
GEQRT(k) /* Execution space */ k = 0..( MT < NT ) ? MT-1 : NT-1 ) /* Locality */ : A(k, k) RW A <- (k == 0) ? A(k, k) : A1 TSMQR(k-1, k, k)
WRITE T <- T(k, k)
/* Priority */ ;(NT-k)*(NT-k)*(NT-k) BODY zgeqrt( A, T ) END
JDF
Serial Code DAGuE compiler Dataflow representation Dataflow compiler Parallel tasks stubs Runtime Programmer System compiler Additional libraries
MPI CUDA pthreads P L A S M A M A G M A
Application code & Codelets
DAGuE Toolchain
Domain Specific Extensions Data distribution Supercomputer
(Suppose the operator is associative and commutative)
(Suppose the operator is associative and commutative)
for(s = 1; s < N/2; s = 2*s) for(i = 0; i < N-s; i += s)
reduce(l, p) l = 1 .. depth p = 0 .. (MT / (1<<l)) : V(p * (1<<l)) RW A <- (1 == l) ? V(2*p) : A reduce( l-1, 2*p )
: B reduce(l+1, p/2) READ B <- (1 == l) ? V(2*p+1) : A reduce( l-1, p*2+1 ) BODY
END
A B Solution: Hand-writing of the data dependency using the intermediate Data Flow representation reduce (2, 1)
Data Flow Compiler
Serial Code DAGuE compiler Dataflow representation Dataflow compiler Parallel tasks stubs Runtime Programmer System compiler Additional libraries
MPI CUDA pthreads P L A S M A M A G M A
Application code & Codelets
DAGuE Toolchain
Domain Specific Extensions Data distribution Supercomputer
just a set of functions to obtain successors or predecessors of tasks, compute the set of initial tasks.
dague_object_t *reduce_create( dague_ddesc_t *V, int MT, int depth); void reduce_destroy(dague_object_t *o);
reduce(l, p) l = 1 .. depth p = 0 .. (MT / (1<<l)) : V(p * (1<<l))
across all nodes
DSEs
dague_ddesc_t *V; V = dague_onedim_bc( PTR, DAGUE_FLOAT, worldsize, M);
runtime
descriptors
be issued
int main(…) { MPI_Init(…); dague_init(cores, worldsize, …); dague_ddesc_t * V = …; dague_object_t * r = reduce_create(V, …); dague_enqueue(r); dague_wait(r); reduce_destroy(r); dague_fini(); MPI_Finalize(); }
Algorithm is now expressed as a Parameterized DAG
ahead of time
heterogeneous resources
message passing
heterogeneity
DAG is considered
remote completion notifications
memory hierarchies
Node0 Node1 Node2 Node3 PO GE TR SY TR TR PO GE GE TR TR SY SY GE PO TR SY PO SY SY
User defined data distribution function
this ordering due to locality
DAG
compute the successors in O(d) (O(1))
keep only the local & ready
DAG Unrolling PTG
No window Communication Patterns Detec. Memory O(N) Expressivity Memory O(w)
Performance; Ongoing Work
5 10 15 20 25 30 108 432 768 1200 3072 Performance (TFlop/s) Number of cores DGEQRF performance weak scaling Cray XT5 Practical peak (GEMM) DAGuE libSCI Scalapack
level
and from the GPU/Co-processors /* POTRF Lower case */ GEMM(k, m, n) // Execution space k = 0 .. MT-3 m = k+2 .. MT-1 n = k+1 .. m-1 // Parallel partitioning : A(m, n) // Parameters READ A <- C TRSM(m, k) READ B <- C TRSM(n, k) RW C <- (k == 0) ? A(m, n) : C GEMM(k-1, m, n)
BODY [CPU, CUDA, MIC, *]
1 2 3 4 5 6 7 8 9 10 2 k 3 k 4 k 5 k 6 k 7 k 8 k 9 k 1 k 1 1 k 1 2 k Performance (TFlop/s) Matrix Size SPOTRF performance problem scaling 12 GPU nodes (Infiniband 20G) Practical peak (GEMM) C2070 + 8 cores per node 8 cores per node
1 2 3 4 5 6 7 8 9 10 1 ; 5 4 k 2 ; 7 6 k 4 ; 1 8 k 8 ; 1 5 2 k 1 2 ; 1 7 6 k Performance (TFlop/s) Number of Nodes;Matrix Size (N) SPOTRF performance weak scaling 12 GPU nodes (Infiniband 20G) Practical peak (GEMM) C2070 + 8 cores per node 8 cores per node
Scalability
200 400 600 800 1000 1200 1400 10k 20k 30k 40k 50k Performance (GFlop/s) Matrix size (N) C1060x4 C1060x3 C1060x2 C1060x1
network, type and number of cores
increasing the task duration (or the tile size) decrease parallelism
auto-tune per system
50 60 70 80 90 100 120 160 200 260 300 340 460 640 1000 % efficiency Block Size (NB) 1 Nodes (8 cores) 4 Nodes (32 cores) 81 Nodes (648 cores)
Not enough parallelism
Hermitian Band Diagonal; 16x16 tiles
high performance tuned scheduling
Total energy consumption
QR factorization (256 cores)
Work in progress with Hatem Ltaief
5000 10000 15000 20000 25000 20 40 60 80 100 Power (Watts) Time (seconds) System CPU Memory Network
(a) ScaLAPACK.
5000 10000 15000 20000 25000 20 40 60 80 100 Power (Watts) Time (seconds) System CPU Memory Network
(b) DPLASMA.
# Cores Library Cholesky QR 128 ScaLAPACK 192000 672000 DPLASMA 128000 540000 256 ScaLAPACK 240000 816000 DPLASMA 96000 540000 512 ScaLAPACK 325000 1000000 DPLASMA 125000 576000
SystemG: Virginia Tech Energy Monitored cluster (ib40g, intel, 8cores/node)
the dague_object
distributed runs
synchronizations and transform them in data dependencies
synchronizations and transform them in data dependencies
synchronizations and transform them in data dependencies
Related Work
DAGuE SMPss StarPU Charm ++ FLAME QUARK Tblas PTG
Scheduling Distr. (1/core) Repl (1/node) Repl (1/node) Distr. (Actors) w/ SuperMatrix Repl (1/node) Centr. Centr. Language Internal
Affine Loops Seq. w/ add_task Seq. w/ add_task Msg- Driven Objects Internal (LA DSL) Seq. w/ add_task Seq. w/ add_task Internal Accelerator GPU GPU GPU GPU GPU Availability Public Public Public Public Public Public Not Avail. Not Avail.
Early stage: ParalleX Non-academic: Swarm, MadLINQ, CnC
All projects support Distributed and Shared Memory (QUARK with QUARKd; FLAME with Elemental)
New languages should not strive to transform the easy into trivial, but to transform the impracticable into achievable.
The end