from functional data parallel
play

from Functional Data-Parallel Programs Inferring data distributions - PowerPoint PPT Presentation

Synthesizing MPI Implementations from Functional Data-Parallel Programs Inferring data distributions using types Tristan Aubrey-Jones Bernd Fischer People need to program distributed memory architectures: GPUs Server farms/compute clusters


  1. Synthesizing MPI Implementations from Functional Data-Parallel Programs Inferring data distributions using types Tristan Aubrey-Jones Bernd Fischer

  2. People need to program distributed memory architectures: GPUs Server farms/compute clusters Big Data/HPC; MapReduce/HPF; Ethernet/Infiniband Graphics/GPGPU Future many-core architectures? CUDA/OpenCL Memory levels: thread-local, block-shared, device-global.

  3. Our Aim To automatically generate distributed-memory cluster implementations (C++/MPI). Single threaded Parallel shared-memory Distributed memory Many distributions possible We want this not just for arrays (i.e., other collection types as well as disk backed collections).

  4. Approach • We define Flocc, a data-parallel DSL, where parallelism is expressed via skeletons or combinator functions (HOFs). • We use distributed data layout (DDL) types to carry data distribution information, and type inference to drive code generation. • We develop a code generator that searches for well performing implementations, and generates C++/MPI implementations .

  5. Approach Back-end Input program Plan search Plan synthesis eqJoin eqJoin1 eqJoin1 :: (T1,T2) -> T3 map2 map2 :: T3 -> T4 map MPI & groupRed1 groupRed1 :: T4 -> T5 groupRed C++ DDL types Distributed- eqJoin2 eqJoin2 :: (T6,T7) -> T8 Generated Abstract or memory map1 map1 :: T8 -> T9 code combinators MPI & groupRed1 groupRed1 :: T9 -> T10 combinators C++ eqJoin3 eqJoin3 :: (T11,T12) -> T13 map1 map1 ; redist1 :: … or MPI & groupRed1 groupRed1 :: T14 -> T15 C++ Redistributions eqJoin4 eqJoin4 (mirr , repartition)… map2 map2 :: T18 -> T19 or MPI & groupRed1 groupRed1 :: T19 -> T20 C++ Performance ⁞ feedback

  6. What’s new? Last++ time: Since then: • Functional DSL with • MPI/C++ code generation data-parallel combinators • Performance-feedback- • Data distributions for based data distribution distributed-memory search implementations using • Automatic redistribution dependent types insertion • Distributions for maps, • Local data layouts arrays, and lists • Arrays with ghosting • Synthesis of data • Type inference with distributions using type E-unification inference algorithm • (Thesis submitted)

  7. What we presented last time. RE-CAP

  8. A Distributed Map type The DMap type extends the basic Map k v type to symbolically describe how the Map should be distributed on the cluster • Key and value types t1 and t2 dependent types! • Partition function f : (t1,t2) → N – takes key-value pairs to node coordinates • Partition dimension identifier d1 – specifies which nodes the coordinates map onto • Mirror dimension identifier d2 – specifies which nodes to mirror the partitions over Also works for Arrays (DArr) and Lists (DList)

  9. Distribution types for group reduces • Π binds concrete term in AST • creates a reference to argument’s AST when instantiated • must match concrete reference when unifying ( rigid ) • used to specify that a collection is distributed using a specific key projection function

  10. Distribution types for group reduces • Input must be partitioned using the groupReduce’s key projection function f • Result keys are already co-located In Local group reduce Out No inter-node Node1 communication necessary Node2 (but constrains input distribution) Node3

  11. Distribution types for joins 0 1 0 A 0 B 0 A 1 B 0 d2 1 A 0 B 1 A 1 B 1 d1 Left and right input partitioned and mirrored on orthogonal dims In Out Local equi-joins Node (0,0) • No inter-node communication Node (0,1) • Output can be partitioned by any f Node (1,0) • Can be more efficient than mirroring whole Node (1,1) input

  12. Deriving distribution plans Hidden from user Hidden from user • Different distributed implementations of each combinator • Enumerate different choices of combinator implementations • Each combinator implementation has a DDL type • Use type inference to check if a set of choices is sound and infer data distributions • Backend templates for each combinator implementation eqJoin1 / eqJoin2 / eqJoin3 let R let R = = eqJoin eqJoin (\(( ((ai,aj ai,aj),_) ),_) -> > aj aj, , \(( ((bi,bj bi,bj),_) ),_) -> bi, > bi, A, B) in A, B) in groupReduce groupReduce (\((( (((ai,aj ai,aj),( ),(bi,bj bi,bj)),_) )),_) -> ( > (ai,bj ai,bj), ), \(_,( (_,(av,bv av,bv)) )) -> > mul mul (av,bv av,bv), add, R) ), add, R) groupReduce1 / groupReduce 2

  13. Distributed matrix multiplication - #1 let let R R = = eqJoin eqJoin (\(( ((ai,aj ai,aj),_) ),_) -> > aj aj, , \(( ((bi,bj bi,bj),_) ),_) -> bi, > bi, A, B) in A, B) in A: Partitioned by row along B: Partitioned by column along B: Partitioned by column along groupReduce (\((( groupReduce (((ai,aj ai,aj),( ),(bi,bj bi,bj)),_) )),_) -> ( > (ai,bj ai,bj), ), d1, mirrored along d2 d2, mirrored along d1 \(_,( (_,(av,bv av,bv)) )) -> > mul mul (av,bv av,bv), add, R) ), add, R) A : DMap (Int,Int) Float \((ai,aj ),_) → ai d1 (d2,m) B : DMap (Int,Int) Float \((bi,bj ),_) → bj d2 (d1,m) R : DMap ((Int,Int),(Int,Int)) (Float,Float) \(((ai,aj),(bi,bj)),_) → (ai,bj) (d1,d2) m C : DMap (Int,Int) Float fst (d1,d2) m C: Partitioned by (row, column) along (d1, d2) Must partition and mirror A and B at beginning of computation.

  14. Distributed matrix multiplication - #1 let let R R = = eqJoin eqJoin (\(( ((ai,aj ai,aj),_) ),_) -> > aj aj, , \(( ((bi,bj bi,bj),_) ),_) -> bi, > bi, A, B) in A, B) in A: Partitioned by row along B: Partitioned by column along B: Partitioned by column along groupReduce (\((( groupReduce (((ai,aj ai,aj),( ),(bi,bj bi,bj)),_) )),_) -> ( > (ai,bj ai,bj), ), d1, mirrored along d2 d2, mirrored along d1 \(_,( (_,(av,bv av,bv)) )) -> > mul mul (av,bv av,bv), add, R) ), add, R) A : DMap (Int,Int) Float \((ai,aj ),_) → ai d1 (d2,m) B : DMap (Int,Int) Float \((bi,bj ),_) → bj d2 (d1,m) R : DMap ((Int,Int),(Int,Int)) (Float,Float) \(((ai,aj),(bi,bj)),_) → (ai,bj) (d1,d2) m C : DMap (Int,Int) Float fst (d1,d2) m C: Partitioned by (row, column) along (d1, d2) Must partition and mirror A and B at beginning of computation.

  15. Distributed matrix multiplication - #1 A: Partitioned by row along B: Partitioned by column along B: Partitioned by column along d1, mirrored along d2 d2, mirrored along d1 groupReduce2 eqJoin3 R (1,1) R (1,2) R (1,3) C (1,1) C (1,2) C (1,3) A 1 B 1 B 2 B 3 = R (2,1) R (2,2) R (2,3) = C (2,1) C (2,2) C (2,3) A 2 d1 R (3,1) R (3,2) R (3,3) A 3 C (3,1) C (3,2) C (3,3) d2 (3-by-3 = 9nodes) C: Partitioned by (row, column) along (d1, d2) A common solution. A common solution. Must partition and mirror A and B at beginning of computation.

  16. Distributed matrix multiplication - #2 A: Partitioned by col along d A: Partitioned by col along d B: Partitioned by row along d B: Partitioned by row along d (aligned with A) C: Partitioned by (row, column) along d Must exchange R during groupReduce1.

  17. What’s new since last time. RECENT WORK

  18. Since last time… • Local data layouts • Distributed arrays with ghosting • Automatic redistribution insertion • E-unification in type inference • MPI/C++ code generation • Performance-feedback-based data distribution search • Proof of concept (~25k Haskell loc)

  19. Local data layouts • Extra DDL type parameters for local layouts – Choice of data structure (or storage mode)  Sorted std::vector, hash-map, tree-map, value- stream etc… – Order of elements or key indexed by  Local layout function similar to partition function

  20. Extended array distributions • Supports – index offsets, axis-reversal (all HPF alignments) – cyclic, blocked, and block-cyclic distributions (all HPF dists) – ghosted regions/fringes • Index transformer functions in DDL types – Take and return tuples of integer array indices  Block-sizes  Index directions  Index offsets  Index ghosted fringe sizes (left and right) • Distribution for Jacobi 1D – DArr … bs dir id id (+1) (+1) -> DArr … bs dir id id id • Relies on E- unification (later…)

  21. Automatic redistribution insertion • Data re-distribution and re-layout functions are type casts. • For invalid Flocc plans (i.e., that don’t type check) insert just enough redistributions (or re-layouts) to make type check. • Means can synthesize a valid plan for any choice of combinator implementations. • Finds implementations that benefit from redistributing data so more efficient combinator implementations can be used.

  22. E-unification • Adding equational theories for projection, permutation, and index transformation functions to DDL type inference • Use E-prover and “question” conjectures to return values for existentially qualified variables • Allows improved array distributions and more flexible DDL types to be supported • (Not integrated with current prototype)

  23. E-unification: projection functions

  24. E-unification: indexing functions

  25. E-unification: permutation functions

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend