from Functional Data-Parallel Programs Inferring data distributions - PowerPoint PPT Presentation

Synthesizing MPI Implementations from Functional Data-Parallel Programs Inferring data distributions using types Tristan Aubrey-Jones Bernd Fischer

People need to program distributed memory architectures: GPUs Server farms/compute clusters Big Data/HPC; MapReduce/HPF; Ethernet/Infiniband Graphics/GPGPU Future many-core architectures? CUDA/OpenCL Memory levels: thread-local, block-shared, device-global.

Our Aim To automatically generate distributed-memory cluster implementations (C++/MPI). Single threaded Parallel shared-memory Distributed memory Many distributions possible We want this not just for arrays (i.e., other collection types as well as disk backed collections).

Approach • We define Flocc, a data-parallel DSL, where parallelism is expressed via skeletons or combinator functions (HOFs). • We use distributed data layout (DDL) types to carry data distribution information, and type inference to drive code generation. • We develop a code generator that searches for well performing implementations, and generates C++/MPI implementations .

Approach Back-end Input program Plan search Plan synthesis eqJoin eqJoin1 eqJoin1 :: (T1,T2) -> T3 map2 map2 :: T3 -> T4 map MPI & groupRed1 groupRed1 :: T4 -> T5 groupRed C++ DDL types Distributed- eqJoin2 eqJoin2 :: (T6,T7) -> T8 Generated Abstract or memory map1 map1 :: T8 -> T9 code combinators MPI & groupRed1 groupRed1 :: T9 -> T10 combinators C++ eqJoin3 eqJoin3 :: (T11,T12) -> T13 map1 map1 ; redist1 :: … or MPI & groupRed1 groupRed1 :: T14 -> T15 C++ Redistributions eqJoin4 eqJoin4 (mirr , repartition)… map2 map2 :: T18 -> T19 or MPI & groupRed1 groupRed1 :: T19 -> T20 C++ Performance ⁞ feedback

What’s new? Last++ time: Since then: • Functional DSL with • MPI/C++ code generation data-parallel combinators • Performance-feedback- • Data distributions for based data distribution distributed-memory search implementations using • Automatic redistribution dependent types insertion • Distributions for maps, • Local data layouts arrays, and lists • Arrays with ghosting • Synthesis of data • Type inference with distributions using type E-unification inference algorithm • (Thesis submitted)

What we presented last time. RE-CAP

A Distributed Map type The DMap type extends the basic Map k v type to symbolically describe how the Map should be distributed on the cluster • Key and value types t1 and t2 dependent types! • Partition function f : (t1,t2) → N – takes key-value pairs to node coordinates • Partition dimension identifier d1 – specifies which nodes the coordinates map onto • Mirror dimension identifier d2 – specifies which nodes to mirror the partitions over Also works for Arrays (DArr) and Lists (DList)

Distribution types for group reduces • Π binds concrete term in AST • creates a reference to argument’s AST when instantiated • must match concrete reference when unifying ( rigid ) • used to specify that a collection is distributed using a specific key projection function

Distribution types for group reduces • Input must be partitioned using the groupReduce’s key projection function f • Result keys are already co-located In Local group reduce Out No inter-node Node1 communication necessary Node2 (but constrains input distribution) Node3

Distribution types for joins 0 1 0 A 0 B 0 A 1 B 0 d2 1 A 0 B 1 A 1 B 1 d1 Left and right input partitioned and mirrored on orthogonal dims In Out Local equi-joins Node (0,0) • No inter-node communication Node (0,1) • Output can be partitioned by any f Node (1,0) • Can be more efficient than mirroring whole Node (1,1) input

Deriving distribution plans Hidden from user Hidden from user • Different distributed implementations of each combinator • Enumerate different choices of combinator implementations • Each combinator implementation has a DDL type • Use type inference to check if a set of choices is sound and infer data distributions • Backend templates for each combinator implementation eqJoin1 / eqJoin2 / eqJoin3 let R let R = = eqJoin eqJoin (\(( ((ai,aj ai,aj),_) ),_) -> > aj aj, , \(( ((bi,bj bi,bj),_) ),_) -> bi, > bi, A, B) in A, B) in groupReduce groupReduce (\((( (((ai,aj ai,aj),( ),(bi,bj bi,bj)),_) )),_) -> ( > (ai,bj ai,bj), ), \(_,( (_,(av,bv av,bv)) )) -> > mul mul (av,bv av,bv), add, R) ), add, R) groupReduce1 / groupReduce 2

Distributed matrix multiplication - #1 let let R R = = eqJoin eqJoin (\(( ((ai,aj ai,aj),_) ),_) -> > aj aj, , \(( ((bi,bj bi,bj),_) ),_) -> bi, > bi, A, B) in A, B) in A: Partitioned by row along B: Partitioned by column along B: Partitioned by column along groupReduce (\((( groupReduce (((ai,aj ai,aj),( ),(bi,bj bi,bj)),_) )),_) -> ( > (ai,bj ai,bj), ), d1, mirrored along d2 d2, mirrored along d1 \(_,( (_,(av,bv av,bv)) )) -> > mul mul (av,bv av,bv), add, R) ), add, R) A : DMap (Int,Int) Float \((ai,aj ),_) → ai d1 (d2,m) B : DMap (Int,Int) Float \((bi,bj ),_) → bj d2 (d1,m) R : DMap ((Int,Int),(Int,Int)) (Float,Float) \(((ai,aj),(bi,bj)),_) → (ai,bj) (d1,d2) m C : DMap (Int,Int) Float fst (d1,d2) m C: Partitioned by (row, column) along (d1, d2) Must partition and mirror A and B at beginning of computation.

Distributed matrix multiplication - #1 A: Partitioned by row along B: Partitioned by column along B: Partitioned by column along d1, mirrored along d2 d2, mirrored along d1 groupReduce2 eqJoin3 R (1,1) R (1,2) R (1,3) C (1,1) C (1,2) C (1,3) A 1 B 1 B 2 B 3 = R (2,1) R (2,2) R (2,3) = C (2,1) C (2,2) C (2,3) A 2 d1 R (3,1) R (3,2) R (3,3) A 3 C (3,1) C (3,2) C (3,3) d2 (3-by-3 = 9nodes) C: Partitioned by (row, column) along (d1, d2) A common solution. A common solution. Must partition and mirror A and B at beginning of computation.

Distributed matrix multiplication - #2 A: Partitioned by col along d A: Partitioned by col along d B: Partitioned by row along d B: Partitioned by row along d (aligned with A) C: Partitioned by (row, column) along d Must exchange R during groupReduce1.

What’s new since last time. RECENT WORK

Since last time… • Local data layouts • Distributed arrays with ghosting • Automatic redistribution insertion • E-unification in type inference • MPI/C++ code generation • Performance-feedback-based data distribution search • Proof of concept (~25k Haskell loc)

Local data layouts • Extra DDL type parameters for local layouts – Choice of data structure (or storage mode)  Sorted std::vector, hash-map, tree-map, value- stream etc… – Order of elements or key indexed by  Local layout function similar to partition function

Extended array distributions • Supports – index offsets, axis-reversal (all HPF alignments) – cyclic, blocked, and block-cyclic distributions (all HPF dists) – ghosted regions/fringes • Index transformer functions in DDL types – Take and return tuples of integer array indices  Block-sizes  Index directions  Index offsets  Index ghosted fringe sizes (left and right) • Distribution for Jacobi 1D – DArr … bs dir id id (+1) (+1) -> DArr … bs dir id id id • Relies on E- unification (later…)

Automatic redistribution insertion • Data re-distribution and re-layout functions are type casts. • For invalid Flocc plans (i.e., that don’t type check) insert just enough redistributions (or re-layouts) to make type check. • Means can synthesize a valid plan for any choice of combinator implementations. • Finds implementations that benefit from redistributing data so more efficient combinator implementations can be used.

E-unification • Adding equational theories for projection, permutation, and index transformation functions to DDL type inference • Use E-prover and “question” conjectures to return values for existentially qualified variables • Allows improved array distributions and more flexible DDL types to be supported • (Not integrated with current prototype)

E-unification: projection functions

E-unification: indexing functions

E-unification: permutation functions

from Functional Data-Parallel Programs Inferring data distributions - PowerPoint PPT Presentation

Synthesizing MPI Implementations from Functional Data-Parallel Programs Inferring data distributions using types Tristan Aubrey-Jones Bernd Fischer People need to program distributed memory architectures: GPUs Server farms/compute clusters

FFR Guided Functional FFR Guided Functional FFR Guided Functional FFR Guided Functional

Functional Linear Models 1 66 / 181 Functional Linear Models Statistical Models So far we have

Functional Programming in 40 minutes @russolsen Functional Programming in 40 minutes

+ f(x) = Python Functional Programming Python Functional Programming Functional Programming by

Basic Idea The main task of a functional programmer should be to specify what has to be

Functional Data Structures [C. Okasaki, Simple and efficient purely functional queues and deques ,

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

Overview Why Parallel Sorting? Parallel Quicksort Bitonic Sort Parallel Merge Sort

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Distributed PostgreSQL Santa Clara, California | April 23th 25th, 2018 Simon Riggs CTO,

Cross-National Inequality: Trends, Causes, Consequences Lane Kenworthy June 2020 Which

in Africa David McLennan, Michael Noble, Gemma Wright, Helen Barnes, and Faith Masekesa WIDER

Amazon EC2, GPU computing, PyNX:Ptychography Vincent Favre-Nicolin X-ray NanoProbe group, ESRF 1

Tutorials Interpretable Deep Learning: Towards Understanding & Explaining DNNs P a r t 2

ten other beams & pinned frames Pinned Frames 2 Elements of Architectural Structures

Sizing, Incentives and Redistribution in Bike-sharing Systems Nicolas Gast 1 G-scop seminar, dec

Sequential mechanism design Krzysztof R. Apt (so not Krzystof and definitely not Krystof) CWI,

from Functional Data-Parallel Programs Inferring data distributions - PowerPoint PPT Presentation

Synthesizing MPI Implementations from Functional Data-Parallel Programs Inferring data distributions using types Tristan Aubrey-Jones Bernd Fischer People need to program distributed memory architectures: GPUs Server farms/compute clusters

FFR Guided Functional FFR Guided Functional FFR Guided Functional FFR Guided Functional

Functional Linear Models 1 66 / 181 Functional Linear Models Statistical Models So far we have

Functional Programming in 40 minutes @russolsen Functional Programming in 40 minutes

+ f(x) = Python Functional Programming Python Functional Programming Functional Programming by

Basic Idea The main task of a functional programmer should be to specify what has to be

Functional Data Structures [C. Okasaki, Simple and efficient purely functional queues and deques ,

Parallel Numerical Algorithms Chapter 2 Parallel Thinking Section 2.2 Parallel

Introduction Introduction What is Parallel Architecture? Why Parallel Architecture? Evolution

Parallel and Distributed Programming Introduction Kenjiro Taura 1 / 21 Contents 1 Why Parallel

Introduction to Parallel Computing George Karypis Principles of Parallel Algorithm Design

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources of

+ Design of Parallel Algorithms Parallel Algorithm Analysis Tools + Topic Overview n Sources

Overview Why Parallel Sorting? Parallel Quicksort Bitonic Sort Parallel Merge Sort

Parallel Computing: Opportunities and Challenges Victor Lee Parallel Computing Lab (PCL), Intel

A Massively Parallel Dense Symmetric A Massively Parallel Dense Symmetric A Massively Parallel

Shared Memory Programming with OpenMP Lecture 3: Parallel Regions Parallel region directive

Distributed PostgreSQL Santa Clara, California | April 23th 25th, 2018 Simon Riggs CTO,

Cross-National Inequality: Trends, Causes, Consequences Lane Kenworthy June 2020 Which

in Africa David McLennan, Michael Noble, Gemma Wright, Helen Barnes, and Faith Masekesa WIDER

Amazon EC2, GPU computing, PyNX:Ptychography Vincent Favre-Nicolin X-ray NanoProbe group, ESRF 1

Tutorials Interpretable Deep Learning: Towards Understanding &amp; Explaining DNNs P a r t 2

ten other beams &amp; pinned frames Pinned Frames 2 Elements of Architectural Structures

Sizing, Incentives and Redistribution in Bike-sharing Systems Nicolas Gast 1 G-scop seminar, dec

Sequential mechanism design Krzysztof R. Apt (so not Krzystof and definitely not Krystof) CWI,

Tutorials Interpretable Deep Learning: Towards Understanding & Explaining DNNs P a r t 2

ten other beams & pinned frames Pinned Frames 2 Elements of Architectural Structures