Transparently Composing CnC Graph Pipelines
- n a Cluster
Hongbo Rong, Frank Schlimbach
Programming & Systems Lab (PSL)
Software Systems Group (SSG) 7th Annual Concurrent Collections Workshop 9/8/2015
on a Cluster Hongbo Rong, Frank Schlimbach Programming & - - PowerPoint PPT Presentation
Transparently Composing CnC Graph Pipelines on a Cluster Hongbo Rong, Frank Schlimbach Programming & Systems Lab (PSL) Software Systems Group (SSG) 7 th Annual Concurrent Collections Workshop 9/8/2015 Problem A productivity program
Hongbo Rong, Frank Schlimbach
Programming & Systems Lab (PSL)
Software Systems Group (SSG) 7th Annual Concurrent Collections Workshop 9/8/2015
Problem
A productivity program running on a cluster
2
Problem
A productivity program running on a cluster
2
Problem
A productivity program running on a cluster
2
Problem
A productivity program running on a cluster
2
Problem
A productivity program running on a cluster
How to compose these non-composable library functions automatically?
2
Flow Graphs
3 Figure from https://en.wikipedia.org/wiki/Bulk_synchronous_parallel
Traditional: Bulk-synchronous Parallel
Flow Graphs
3 Figure from https://en.wikipedia.org/wiki/Bulk_synchronous_parallel
Traditional: Bulk-synchronous Parallel
Flow Graphs
3 Figure from https://en.wikipedia.org/wiki/Bulk_synchronous_parallel
Traditional: Bulk-synchronous Parallel
Flow Graphs
3 Figure from https://en.wikipedia.org/wiki/Bulk_synchronous_parallel
Traditional: Bulk-synchronous Parallel Pieplined & asynchronous:
Communication
Basic Idea
User program As usual: assume sequential, global shared-memory programming C = A + B E = C * D
Basic Idea
User program As usual: assume sequential, global shared-memory programming C = A + B E = C * D Library Add a CnC Graph for each library function
A B
+
C C D
+
E
Basic Idea
User program As usual: assume sequential, global shared-memory programming C = A + B E = C * D Library Add a CnC Graph for each library function
A B
+
C C D
+
E
Compiler/ interpreter Compose the corresponding graphs of a sequence of library calls
Both graphs use the identical memory for C
A B
+
C D
+
E
Basic Idea
Let CnC do the distribution: mpiexec -genv DIST_CNC=MPI –n 1000 ./julia user_script.jl mpiexec -genv DIST_CNC=MPI –n 1000 ./python user_script.py mpiexec -genv DIST_CNC=MPI –n 1000 ./matlab user_script.m Execution User program As usual: assume sequential, global shared-memory programming C = A + B E = C * D Library Add a CnC Graph for each library function
A B
+
C C D
+
E
Compiler/ interpreter Compose the corresponding graphs of a sequence of library calls
Both graphs use the identical memory for C
A B
+
C D
+
E
Hello World
User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D
Hello World
User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D
A1* B
Multiply
C1*
Proces ess s 1
Graph 1
Hello World
User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D
A1* B
Multiply
C1*
Proces ess s 1
Graph 1
Leverage the library
Hello World
User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D
A1* B
Multiply
C1*
Proces ess s 1
D
Multiply
E1*
Graph 1 Graph 2
Leverage the library
Hello World
User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D
A1* B
Multiply
C1*
Proces ess s 1
D
Multiply
E1*
Graph 1 Graph 2
No barrier/copy/msg between graphs/steps unless required. E.g. No bcast/gather of C. Leverage the library
Hello World
User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D
A1* B
Multiply
C1*
Proces ess s 1
D
Multiply
E1*
Graph 1 Graph 2
No barrier/copy/msg between graphs/steps unless required. E.g. No bcast/gather of C. Leverage the library
Hello World
User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D
A1* B
Multiply
C1*
Proces ess s 1
D
Multiply
E1* A100* B
Multiply
C100*
Proces ess s 100
D
Multiply
E100*
Graph 1 Graph 2
No barrier/copy/msg between graphs/steps unless required. E.g. No bcast/gather of C. Leverage the library
Hello World
User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D
A1* B
Multiply
C1*
Proces ess s 1
D
Multiply
E1* A100* B
Multiply
C100*
Proces ess s 100
D
Multiply
E100*
E
Graph 1 Graph 2
No barrier/copy/msg between graphs/steps unless required. E.g. No bcast/gather of C. Leverage the library
Code skeleton
6
User code
dgemm(A, B, C) dgemm(C, D, E)
Code skeleton
6
User code
dgemm(A, B, C) dgemm(C, D, E)
Compiler
Code skeleton
6
User code
dgemm(A, B, C) dgemm(C, D, E)
Compiler
initialize_CnC() dgemm_dgemm(A, B, C, D, E) finalize_Cnc()
User code
Code skeleton
6
User code
dgemm(A, B, C) dgemm(C, D, E)
Compiler
initialize_CnC() dgemm_dgemm(A, B, C, D, E) finalize_Cnc()
User code
struct dgemm_dgemm_context { item_collection *C_collection; tuner row_tuner, col_tuner; Graph * graph1, * graph2; dgemm_dgemm_context( A, B, C, D, E) { create C_collection graph1 = make_dgemm_graph (A, B, C); graph2 = make_dgemm_graph (C, D, E); } }
Context
Code skeleton
6
User code
dgemm(A, B, C) dgemm(C, D, E)
Compiler
initialize_CnC() dgemm_dgemm(A, B, C, D, E) finalize_Cnc()
User code
struct dgemm_dgemm_context { item_collection *C_collection; tuner row_tuner, col_tuner; Graph * graph1, * graph2; dgemm_dgemm_context( A, B, C, D, E) { create C_collection graph1 = make_dgemm_graph (A, B, C); graph2 = make_dgemm_graph (C, D, E); } }
Context Interface
void dgemm_dgemm(A, B, C, D, E) { dgemm_dgemm_context ctxt(A, B, C, D, E ); ctxt.graph1->start(); ctxt.graph2->start(); ctxt.wait(); ctxt.graph2->copyout(); }
Code skeleton
6
User code
dgemm(A, B, C) dgemm(C, D, E)
Compiler
initialize_CnC() dgemm_dgemm(A, B, C, D, E) finalize_Cnc()
User code
struct dgemm_dgemm_context { item_collection *C_collection; tuner row_tuner, col_tuner; Graph * graph1, * graph2; dgemm_dgemm_context( A, B, C, D, E) { create C_collection graph1 = make_dgemm_graph (A, B, C); graph2 = make_dgemm_graph (C, D, E); } }
Context Interface
void dgemm_dgemm(A, B, C, D, E) { dgemm_dgemm_context ctxt(A, B, C, D, E ); ctxt.graph1->start(); ctxt.graph2->start(); ctxt.wait(); ctxt.graph2->copyout(); }
Domain expert written
class dgemm_graph { tuner *tunerA, *tunerB, *tunerC, *tunerS; item_collection *A_collection, *B_collection, *C_collection; tag_collection tags; step_collection *multiply_steps; dgemm_graph(_A, _B, _C) { create A/B/C_collection based on A/B/C define dataflow graph } }
Code skeleton
6
User code
dgemm(A, B, C) dgemm(C, D, E)
Compiler
initialize_CnC() dgemm_dgemm(A, B, C, D, E) finalize_Cnc()
User code
struct dgemm_dgemm_context { item_collection *C_collection; tuner row_tuner, col_tuner; Graph * graph1, * graph2; dgemm_dgemm_context( A, B, C, D, E) { create C_collection graph1 = make_dgemm_graph (A, B, C); graph2 = make_dgemm_graph (C, D, E); } }
Context Interface
void dgemm_dgemm(A, B, C, D, E) { dgemm_dgemm_context ctxt(A, B, C, D, E ); ctxt.graph1->start(); ctxt.graph2->start(); ctxt.wait(); ctxt.graph2->copyout(); }
Domain expert written
class dgemm_graph { tuner *tunerA, *tunerB, *tunerC, *tunerS; item_collection *A_collection, *B_collection, *C_collection; tag_collection tags; step_collection *multiply_steps; dgemm_graph(_A, _B, _C) { create A/B/C_collection based on A/B/C define dataflow graph } }
Key points
Compiler
(Static data distribution)
7
Key points
Compiler
(Static data distribution)
Domain-expert written graphs
7
Advantages
Useful for any language
– Dataflow analysis, pattern matching, code replacement
Extends a scripting language to distributed computing implicitly
Heavy lifting done in CnC and graph writing by domain experts
8
Open questions
Minimize communication
Scalability Applications
9