on a Cluster Hongbo Rong, Frank Schlimbach Programming & - - PowerPoint PPT Presentation

on a cluster
SMART_READER_LITE
LIVE PREVIEW

on a Cluster Hongbo Rong, Frank Schlimbach Programming & - - PowerPoint PPT Presentation

Transparently Composing CnC Graph Pipelines on a Cluster Hongbo Rong, Frank Schlimbach Programming & Systems Lab (PSL) Software Systems Group (SSG) 7 th Annual Concurrent Collections Workshop 9/8/2015 Problem A productivity program


slide-1
SLIDE 1

Transparently Composing CnC Graph Pipelines

  • n a Cluster

Hongbo Rong, Frank Schlimbach

Programming & Systems Lab (PSL)

Software Systems Group (SSG) 7th Annual Concurrent Collections Workshop 9/8/2015

slide-2
SLIDE 2

Problem

A productivity program running on a cluster

2

slide-3
SLIDE 3

Problem

A productivity program running on a cluster

  • The programmer is a domain expert, but not a tuning expert

2

slide-4
SLIDE 4

Problem

A productivity program running on a cluster

  • The programmer is a domain expert, but not a tuning expert
  • Call distributed libraries

2

slide-5
SLIDE 5

Problem

A productivity program running on a cluster

  • The programmer is a domain expert, but not a tuning expert
  • Call distributed libraries
  • Library functions are not composable
  • Black box
  • Independent
  • Context-unaware
  • Barrier at the end

2

slide-6
SLIDE 6

Problem

A productivity program running on a cluster

  • The programmer is a domain expert, but not a tuning expert
  • Call distributed libraries
  • Library functions are not composable
  • Black box
  • Independent
  • Context-unaware
  • Barrier at the end

How to compose these non-composable library functions automatically?

2

slide-7
SLIDE 7

Flow Graphs

3 Figure from https://en.wikipedia.org/wiki/Bulk_synchronous_parallel

Traditional: Bulk-synchronous Parallel

slide-8
SLIDE 8

Flow Graphs

3 Figure from https://en.wikipedia.org/wiki/Bulk_synchronous_parallel

Traditional: Bulk-synchronous Parallel

slide-9
SLIDE 9

Flow Graphs

3 Figure from https://en.wikipedia.org/wiki/Bulk_synchronous_parallel

Traditional: Bulk-synchronous Parallel

slide-10
SLIDE 10

Flow Graphs

3 Figure from https://en.wikipedia.org/wiki/Bulk_synchronous_parallel

Traditional: Bulk-synchronous Parallel Pieplined & asynchronous:

Communication

slide-11
SLIDE 11

Basic Idea

User program As usual: assume sequential, global shared-memory programming C = A + B E = C * D

slide-12
SLIDE 12

Basic Idea

User program As usual: assume sequential, global shared-memory programming C = A + B E = C * D Library Add a CnC Graph for each library function

A B

+

C C D

+

E

slide-13
SLIDE 13

Basic Idea

User program As usual: assume sequential, global shared-memory programming C = A + B E = C * D Library Add a CnC Graph for each library function

A B

+

C C D

+

E

Compiler/ interpreter Compose the corresponding graphs of a sequence of library calls

Both graphs use the identical memory for C

A B

+

C D

+

E

slide-14
SLIDE 14

Basic Idea

Let CnC do the distribution: mpiexec -genv DIST_CNC=MPI –n 1000 ./julia user_script.jl mpiexec -genv DIST_CNC=MPI –n 1000 ./python user_script.py mpiexec -genv DIST_CNC=MPI –n 1000 ./matlab user_script.m Execution User program As usual: assume sequential, global shared-memory programming C = A + B E = C * D Library Add a CnC Graph for each library function

A B

+

C C D

+

E

Compiler/ interpreter Compose the corresponding graphs of a sequence of library calls

Both graphs use the identical memory for C

A B

+

C D

+

E

slide-15
SLIDE 15

Hello World

User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D

slide-16
SLIDE 16

Hello World

User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D

A1* B

Multiply

C1*

Proces ess s 1

Graph 1

slide-17
SLIDE 17

Hello World

User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D

A1* B

Multiply

C1*

Proces ess s 1

Graph 1

Leverage the library

slide-18
SLIDE 18

Hello World

User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D

A1* B

Multiply

C1*

Proces ess s 1

D

Multiply

E1*

Graph 1 Graph 2

Leverage the library

slide-19
SLIDE 19

Hello World

User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D

A1* B

Multiply

C1*

Proces ess s 1

D

Multiply

E1*

Graph 1 Graph 2

No barrier/copy/msg between graphs/steps unless required. E.g. No bcast/gather of C. Leverage the library

slide-20
SLIDE 20

Hello World

User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D

A1* B

Multiply

C1*

Proces ess s 1

D

Multiply

E1*

… …

Graph 1 Graph 2

No barrier/copy/msg between graphs/steps unless required. E.g. No bcast/gather of C. Leverage the library

slide-21
SLIDE 21

Hello World

User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D

A1* B

Multiply

C1*

Proces ess s 1

D

Multiply

E1* A100* B

Multiply

C100*

Proces ess s 100

D

Multiply

E100*

… …

Graph 1 Graph 2

No barrier/copy/msg between graphs/steps unless required. E.g. No bcast/gather of C. Leverage the library

slide-22
SLIDE 22

Hello World

User program: dgemm(A, B, C ) # C = A*B dgemm(C, D, E ) # E = C*D

A1* B

Multiply

C1*

Proces ess s 1

D

Multiply

E1* A100* B

Multiply

C100*

Proces ess s 100

D

Multiply

E100*

… …

E

Graph 1 Graph 2

No barrier/copy/msg between graphs/steps unless required. E.g. No bcast/gather of C. Leverage the library

slide-23
SLIDE 23

Code skeleton

6

User code

dgemm(A, B, C) dgemm(C, D, E)

slide-24
SLIDE 24

Code skeleton

6

User code

dgemm(A, B, C) dgemm(C, D, E)

Compiler

slide-25
SLIDE 25

Code skeleton

6

User code

dgemm(A, B, C) dgemm(C, D, E)

Compiler

initialize_CnC() dgemm_dgemm(A, B, C, D, E) finalize_Cnc()

User code

slide-26
SLIDE 26

Code skeleton

6

User code

dgemm(A, B, C) dgemm(C, D, E)

Compiler

initialize_CnC() dgemm_dgemm(A, B, C, D, E) finalize_Cnc()

User code

struct dgemm_dgemm_context { item_collection *C_collection; tuner row_tuner, col_tuner; Graph * graph1, * graph2; dgemm_dgemm_context( A, B, C, D, E) { create C_collection graph1 = make_dgemm_graph (A, B, C); graph2 = make_dgemm_graph (C, D, E); } }

Context

slide-27
SLIDE 27

Code skeleton

6

User code

dgemm(A, B, C) dgemm(C, D, E)

Compiler

initialize_CnC() dgemm_dgemm(A, B, C, D, E) finalize_Cnc()

User code

struct dgemm_dgemm_context { item_collection *C_collection; tuner row_tuner, col_tuner; Graph * graph1, * graph2; dgemm_dgemm_context( A, B, C, D, E) { create C_collection graph1 = make_dgemm_graph (A, B, C); graph2 = make_dgemm_graph (C, D, E); } }

Context Interface

void dgemm_dgemm(A, B, C, D, E) { dgemm_dgemm_context ctxt(A, B, C, D, E ); ctxt.graph1->start(); ctxt.graph2->start(); ctxt.wait(); ctxt.graph2->copyout(); }

slide-28
SLIDE 28

Code skeleton

6

User code

dgemm(A, B, C) dgemm(C, D, E)

Compiler

initialize_CnC() dgemm_dgemm(A, B, C, D, E) finalize_Cnc()

User code

struct dgemm_dgemm_context { item_collection *C_collection; tuner row_tuner, col_tuner; Graph * graph1, * graph2; dgemm_dgemm_context( A, B, C, D, E) { create C_collection graph1 = make_dgemm_graph (A, B, C); graph2 = make_dgemm_graph (C, D, E); } }

Context Interface

void dgemm_dgemm(A, B, C, D, E) { dgemm_dgemm_context ctxt(A, B, C, D, E ); ctxt.graph1->start(); ctxt.graph2->start(); ctxt.wait(); ctxt.graph2->copyout(); }

Domain expert written

class dgemm_graph { tuner *tunerA, *tunerB, *tunerC, *tunerS; item_collection *A_collection, *B_collection, *C_collection; tag_collection tags; step_collection *multiply_steps; dgemm_graph(_A, _B, _C) { create A/B/C_collection based on A/B/C define dataflow graph } }

slide-29
SLIDE 29

Code skeleton

6

User code

dgemm(A, B, C) dgemm(C, D, E)

Compiler

initialize_CnC() dgemm_dgemm(A, B, C, D, E) finalize_Cnc()

User code

struct dgemm_dgemm_context { item_collection *C_collection; tuner row_tuner, col_tuner; Graph * graph1, * graph2; dgemm_dgemm_context( A, B, C, D, E) { create C_collection graph1 = make_dgemm_graph (A, B, C); graph2 = make_dgemm_graph (C, D, E); } }

Context Interface

void dgemm_dgemm(A, B, C, D, E) { dgemm_dgemm_context ctxt(A, B, C, D, E ); ctxt.graph1->start(); ctxt.graph2->start(); ctxt.wait(); ctxt.graph2->copyout(); }

Domain expert written

class dgemm_graph { tuner *tunerA, *tunerB, *tunerC, *tunerS; item_collection *A_collection, *B_collection, *C_collection; tag_collection tags; step_collection *multiply_steps; dgemm_graph(_A, _B, _C) { create A/B/C_collection based on A/B/C define dataflow graph } }

Ho Host t la langu nguage age

C

slide-30
SLIDE 30

Key points

Compiler

  • Generates a context and an interface for a dataflow
  • Connects expert-written graphs into a pipeline
  • Minimizes communication with step tuners (Static scheduling) and item collection tuners

(Static data distribution)

7

slide-31
SLIDE 31

Key points

Compiler

  • Generates a context and an interface for a dataflow
  • Connects expert-written graphs into a pipeline
  • Minimizes communication with step tuners (Static scheduling) and item collection tuners

(Static data distribution)

Domain-expert written graphs

  • High-level algorithms for library functions
  • Input/output collections can be from outside

7

slide-32
SLIDE 32

Advantages

Useful for any language

  • Mature work in compiler/interpreter

– Dataflow analysis, pattern matching, code replacement

Extends a scripting language to distributed computing implicitly

  • Transparent to users
  • Transparent to the language
  • Transparent to libraries

Heavy lifting done in CnC and graph writing by domain experts

8

slide-33
SLIDE 33

Open questions

Minimize communication

  • Item collections: consumed_on
  • Step collections: computed_on

Scalability Applications

  • There might not be many long sequences of library calls

9