on a cluster
play

on a Cluster Hongbo Rong, Frank Schlimbach Programming & - PowerPoint PPT Presentation

Transparently Composing CnC Graph Pipelines on a Cluster Hongbo Rong, Frank Schlimbach Programming & Systems Lab (PSL) Software Systems Group (SSG) 7 th Annual Concurrent Collections Workshop 9/8/2015 Problem A productivity program


  1. Transparently Composing CnC Graph Pipelines on a Cluster Hongbo Rong, Frank Schlimbach Programming & Systems Lab (PSL) Software Systems Group (SSG) 7 th Annual Concurrent Collections Workshop 9/8/2015

  2. Problem A productivity program running on a cluster 2

  3. Problem A productivity program running on a cluster  The programmer is a domain expert, but not a tuning expert 2

  4. Problem A productivity program running on a cluster  The programmer is a domain expert, but not a tuning expert  Call distributed libraries 2

  5. Problem A productivity program running on a cluster  The programmer is a domain expert, but not a tuning expert  Call distributed libraries  Library functions are not composable  Black box  Independent  Context-unaware  Barrier at the end 2

  6. Problem A productivity program running on a cluster  The programmer is a domain expert, but not a tuning expert  Call distributed libraries  Library functions are not composable  Black box  Independent  Context-unaware  Barrier at the end How to compose these non-composable library functions automatically ? 2

  7. Flow Graphs Traditional: Bulk-synchronous Parallel Figure from https://en.wikipedia.org/wiki/Bulk_synchronous_parallel 3

  8. Flow Graphs Traditional: Bulk-synchronous Parallel Figure from https://en.wikipedia.org/wiki/Bulk_synchronous_parallel 3

  9. Flow Graphs Traditional: Bulk-synchronous Parallel Figure from https://en.wikipedia.org/wiki/Bulk_synchronous_parallel 3

  10. Flow Graphs Traditional: Bulk-synchronous Parallel Pieplined & asynchronous: Communication Figure from https://en.wikipedia.org/wiki/Bulk_synchronous_parallel 3

  11. Basic Idea User program As usual: assume sequential, global shared-memory programming C = A + B E = C * D

  12. Basic Idea User program As usual: assume sequential, global shared-memory programming C = A + B E = C * D Library Add a CnC Graph for each library function A C B D + + C E

  13. Basic Idea User program As usual: assume sequential, global shared-memory programming C = A + B E = C * D Library Add a CnC Graph for each library function A C B D + + C E Compose the corresponding graphs of a sequence of library calls Compiler/ interpreter A B Both graphs use the identical + memory for C C D + E

  14. Basic Idea User program As usual: assume sequential, global shared-memory programming C = A + B E = C * D Library Add a CnC Graph for each library function A C B D + + C E Compose the corresponding graphs of a sequence of library calls Compiler/ interpreter A B Both graphs use the identical + memory for C C D + E Execution Let CnC do the distribution: mpiexec -genv DIST_CNC=MPI – n 1000 ./julia user_script.jl mpiexec -genv DIST_CNC=MPI – n 1000 ./python user_script.py mpiexec -genv DIST_CNC=MPI – n 1000 ./matlab user_script.m

  15. Hello World User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D

  16. Hello World User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D Proces ess s 1 A1* B Multiply Graph 1 C1*

  17. Hello World User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D Proces ess s 1 A1* B Leverage the library Multiply Graph 1 C1*

  18. Hello World User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D Proces ess s 1 A1* B Leverage the library Multiply Graph 1 C1* D Multiply Graph 2 E1*

  19. Hello World User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D Proces ess s 1 A1* B Leverage the library Multiply Graph 1 No barrier/copy/msg between C1* graphs/steps unless required. E.g. No bcast/gather of C. D Multiply Graph 2 E1*

  20. Hello World User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D Proces ess s 1 A1* B … Leverage the library Multiply Graph 1 No barrier/copy/msg between C1* graphs/steps unless required. E.g. No bcast/gather of C. … D Multiply Graph 2 E1*

  21. Hello World User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D Proces ess s 1 Proces ess s 100 A1* B … A100* B Leverage the library Multiply Graph 1 Multiply No barrier/copy/msg between C1* graphs/steps unless required. C100* E.g. No bcast/gather of C. … D D Multiply Multiply Graph 2 E100* E1*

  22. Hello World User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D Proces ess s 1 Proces ess s 100 A1* B … A100* B Leverage the library Multiply Graph 1 Multiply No barrier/copy/msg between C1* graphs/steps unless required. C100* E.g. No bcast/gather of C. … D D Multiply Multiply Graph 2 E100* E1* E

  23. Code skeleton User code dgemm(A, B, C) dgemm(C, D, E) 6

  24. Code skeleton User code dgemm(A, B, C) dgemm(C, D, E) Compiler 6

  25. Code skeleton User code dgemm(A, B, C) dgemm(C, D, E) Compiler User code initialize_CnC() dgemm_dgemm(A, B, C, D, E) finalize_Cnc() 6

  26. struct dgemm_dgemm_context { Context Code skeleton item_collection *C_collection; tuner row_tuner, col_tuner; Graph * graph1, * graph2; User code dgemm_dgemm_context( A, B, C, D, E) { dgemm(A, B, C) create C_collection dgemm(C, D, E) graph1 = make_dgemm_graph (A, B, C); graph2 = make_dgemm_graph (C, D, E); } Compiler } User code initialize_CnC() dgemm_dgemm(A, B, C, D, E) finalize_Cnc() 6

  27. struct dgemm_dgemm_context { Context Code skeleton item_collection *C_collection; tuner row_tuner, col_tuner; Graph * graph1, * graph2; User code dgemm_dgemm_context( A, B, C, D, E) { dgemm(A, B, C) create C_collection dgemm(C, D, E) graph1 = make_dgemm_graph (A, B, C); graph2 = make_dgemm_graph (C, D, E); } Compiler } void dgemm_dgemm (A, B, C, D, E) { dgemm_dgemm_context ctxt(A, B, C, D, E ); ctxt.graph1->start(); User code ctxt.graph2->start(); Interface initialize_CnC() ctxt.wait(); dgemm_dgemm(A, B, C, D, E) ctxt.graph2->copyout(); finalize_Cnc() } 6

  28. struct dgemm_dgemm_context { Context Code skeleton item_collection *C_collection; tuner row_tuner, col_tuner; Graph * graph1, * graph2; User code dgemm_dgemm_context( A, B, C, D, E) { dgemm(A, B, C) create C_collection dgemm(C, D, E) graph1 = make_dgemm_graph (A, B, C); graph2 = make_dgemm_graph (C, D, E); } Compiler } void dgemm_dgemm (A, B, C, D, E) { dgemm_dgemm_context ctxt(A, B, C, D, E ); ctxt.graph1->start(); User code ctxt.graph2->start(); Interface initialize_CnC() ctxt.wait(); dgemm_dgemm(A, B, C, D, E) ctxt.graph2->copyout(); finalize_Cnc() } class dgemm_graph { tuner *tunerA, *tunerB, *tunerC, *tunerS; Domain item_collection *A_collection, *B_collection, *C_collection; expert tag_collection tags; written step_collection *multiply_steps; dgemm_graph(_A, _B, _C) { create A/B/C_collection based on A/B/C define dataflow graph } } 6

  29. struct dgemm_dgemm_context { Context Code skeleton item_collection *C_collection; tuner row_tuner, col_tuner; Graph * graph1, * graph2; User code dgemm_dgemm_context( A, B, C, D, E) { dgemm(A, B, C) create C_collection Host Ho t dgemm(C, D, E) graph1 = make_dgemm_graph (A, B, C); graph2 = make_dgemm_graph (C, D, E); } Compiler } la langu nguage age void dgemm_dgemm (A, B, C, D, E) { dgemm_dgemm_context ctxt(A, B, C, D, E ); ctxt.graph1->start(); User code ctxt.graph2->start(); Interface C initialize_CnC() ctxt.wait(); dgemm_dgemm(A, B, C, D, E) ctxt.graph2->copyout(); finalize_Cnc() } class dgemm_graph { tuner *tunerA, *tunerB, *tunerC, *tunerS; Domain item_collection *A_collection, *B_collection, *C_collection; expert tag_collection tags; written step_collection *multiply_steps; dgemm_graph(_A, _B, _C) { create A/B/C_collection based on A/B/C define dataflow graph } } 6

  30. Key points Compiler  Generates a context and an interface for a dataflow  Connects expert-written graphs into a pipeline  Minimizes communication with step tuners (Static scheduling) and item collection tuners (Static data distribution) 7

  31. Key points Compiler  Generates a context and an interface for a dataflow  Connects expert-written graphs into a pipeline  Minimizes communication with step tuners (Static scheduling) and item collection tuners (Static data distribution) Domain-expert written graphs  High-level algorithms for library functions  Input/output collections can be from outside 7

  32. Advantages Useful for any language  Mature work in compiler/interpreter – Dataflow analysis, pattern matching, code replacement Extends a scripting language to distributed computing implicitly  Transparent to users  Transparent to the language  Transparent to libraries Heavy lifting done in CnC and graph writing by domain experts 8

  33. Open questions Minimize communication  Item collections: consumed_on  Step collections: computed_on Scalability Applications  There might not be many long sequences of library calls 9

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend