A Multi-Paradigm Approach to High-Performance Scientific - - PowerPoint PPT Presentation

a multi paradigm approach to high performance scientific
SMART_READER_LITE
LIVE PREVIEW

A Multi-Paradigm Approach to High-Performance Scientific - - PowerPoint PPT Presentation

A Multi-Paradigm Approach to High-Performance Scientific Programming Pritish Jetley Parallel Programming Laboratory What will the language of tomorrow look like? Language Runtime support for assistance modularity No sacrifice of


slide-1
SLIDE 1

A Multi-Paradigm Approach to High-Performance Scientific Programming

Pritish Jetley Parallel Programming Laboratory

slide-2
SLIDE 2

2

What will the language of tomorrow look like?

  • Language

support for modularity

  • Abstraction

→ Productivity

  • Runtime

assistance

  • No sacrifice of

performance

slide-3
SLIDE 3

3

The future is now...

Charm++ PGAS (UPC, CAF, X10, Chapel, Fortress,...) MPI/PGAS Hybrids

slide-4
SLIDE 4

4

...or is it?

How abstract can languages be? Can we reconcile program & language semantics? Can we express algorithms naturally?

slide-5
SLIDE 5

5

Our premise

  • Productivity comes from abstractions
  • Specialization of abstractions also yields better

parallel performance

– e.g. relaxed semantics in Global Arrays

slide-6
SLIDE 6

6

Our approach

  • Plurality
  • Specialization
  • Interoperability
slide-7
SLIDE 7

7

Our agenda

Complete set of incomplete, interoperable languages Abstract, specialized languages Completeness through interoperation

slide-8
SLIDE 8

8

This talk

Productive message-driven programming (Charj) Static data flow (Charisma) Generative recursion (Divcon) Tree-based algorithms (Distree) Disciplined sharing of global data (MSA)

slide-9
SLIDE 9

9

Productive message-driven programming With Charj

slide-10
SLIDE 10

10

Charj

  • Charm++/Java = Charj
  • Keep the good bits of Charm++:

– Overdecomposition onto migratable objects – Message driven execution – Asynchrony – Intelligent runtime system (load balancing,

message combination, etc.)

  • But use a source-to-source compiler to

address its drawbacks

slide-11
SLIDE 11

11

Compiler intervention for productivity

  • Automatically determines parallel interfaces

// foo.ci entry void bar(); // foo.h void bar(); // foo.cpp void Foo::bar() {...} // foo.cj void bar();

slide-12
SLIDE 12

12

Compiler intervention for productivity

  • Automatically generate per-entry

(de)serialization code

class Particle { Vec3 position, accel, vel; Real mass, charge; } class Compute { void pairwise(Array<Particle> first, Array<Particle> second){ // only uses Particle position, charge } }

slide-13
SLIDE 13

13

Compiler intervention for productivity

  • Semantic checking and type safety

w.foo(); // “plain”: asynchronous x.foo(); // local: preempts y.foo(); // sync: blocks z.foo(); // array: multiple invocations // foo.ci readonly int n; // foo.cpp int n; … n = 17; // bug (?)

slide-14
SLIDE 14

14

Compiler intervention for productivity

  • Simple optimizations such as live variable

analysis

– Minimize checkpoint footprint – Find pertinent data to be offloaded to GPU

slide-15
SLIDE 15

15

Charm++ workflow

slide-16
SLIDE 16

16

Charj workflow

slide-17
SLIDE 17

17

Static data flow programming with Charisma

slide-18
SLIDE 18

18

Expressive scope of Charisma

  • Structured grid methods
  • Wavefront computations
  • Dense linear algebra
  • Permutation
  • MG
slide-19
SLIDE 19

19

Charisma

  • Salient features

– Object-oriented – Programmer decomposes work – Global view of data and control – Publish-consume model for data dependencies – Separation of parallel structure & serial code – Compiled into message-driven Charm++

specification

slide-20
SLIDE 20

20

A Charisma program

  • rchestrates the interactions
  • f collections of objects
slide-21
SLIDE 21

21

Indexed collections of objects

  • Objects encapsulate data and work

– Explicit specification of grain size and locality – Allows for adaptive overlap of comm./comp. – Load balancing, check pointing, etc.

  • Unit of work is a method invocation
slide-22
SLIDE 22

22

Objects communicate by publishing and consuming values

slide-23
SLIDE 23

23

Communication between objects

  • Method invocations publish, consume values
  • Publish-consume pattern → data dependencies
  • Parsed by compiler to generate code

(p) obj1.foo(); ←

  • bj2.bar(p);
slide-24
SLIDE 24

24

Parallelism across objects is specified via the foreach construct

slide-25
SLIDE 25

25

Object parallelism

  • Invoke foo() on all objects in collection A

ispace S = {0:N-1:1}; foreach (x,y in S * S){ A[x,y].foo(); }

  • ispace construct gives index space
slide-26
SLIDE 26

26

Section communication

  • Dense linear algebra (e.g. LU)

Factorize Tri-solve Update

III II I

slide-27
SLIDE 27

27

LU in Charisma

for(K = 0; K < N/g; K++){ ispace Trailing = {K+1 : N/g-1}; // factorize diagonal block, and mcast (d) A[K,K].factorize(); ← // update active panels, and mcast foreach(j in Trailing){ (c[j]) A[K,j].utri(d); ← // row (r[j]) A[j,K].ltri(d); ← // column } // trailing matrix update foreach(i,j in Trailing * Trailing){ A[i,j].update(r[i], c[j]); } }

slide-28
SLIDE 28

28

Others too...

  • Blelloch (work-efficient) scan
  • MG
  • Pipelining (Gauss-Seidel)
  • Scatter-gather, reduction, multicasts (OpenAtom)
  • Other dense linear algebra (Gaussian elimination,

forward/backward substitution, etc.)

  • MD
slide-29
SLIDE 29

29

Expressing generative recursion with Divcon

slide-30
SLIDE 30

30

Generative Recursion

Elegant Intuitive Implicit, tree-structured parallelism

slide-31
SLIDE 31

31

Examples

  • Sorting, Closest pair
  • Convex hull, Delaunay triangulation
  • Adaptive quadrature, etc.
slide-32
SLIDE 32

32

let A1 = f(p1(A)), A2 = f(p2(A)), … An = f(pn(A)) in g(A1, A2, …, An);

Recursive Structure

f(A) = g(f(A1), f(A2), …, f(An))

slide-33
SLIDE 33

33

Data movement from A → Ai A

A1 A2 An

  • memcpy in shared memory systems
  • Network communication in distributed memory!
slide-34
SLIDE 34

34

Quicksort

Array<int> qsort(Array<int> A){ if(A.length() <= THRESH) return seq_sort(A); Array<int> LT,EQ,GT; int pivot = A[rand(0,A.length())]; (LT,EQ,GT) = {partition(A,pivot)}; return concat(qsort(LT),EQ,qsort(GT)); }

slide-35
SLIDE 35

35

Significant redistribution costs

= 73 = pivot 129 21 35

slide-36
SLIDE 36

36

Parallel execution

Root LT Root GT seq Redistribute Redistribute Redistribute Redistribute

slide-37
SLIDE 37

37

Delayed data redistribution

Amortize redistribution costs over several recursive invocations Reduces communication But lowers concurrency

slide-38
SLIDE 38

38

Best of both worlds

Recv partition data Serial computation Delay shuffle? Adaptive grain size control do read, map

  • n leaves

Shuffle feasible? Shuffle partition data Yes No No Yes

slide-39
SLIDE 39

39

Allows consolidation

  • Redistribution delay → several (new) arrays

distributed across same section of containers

  • If operation-issuing tasks are kept on same PE,

issued operations may be consolidated

  • Consolidated operations applied together on

target arrays

slide-40
SLIDE 40

40

Allows consolidation

Quicksort on 256 BG/P cores

slide-41
SLIDE 41

41

A framework for expressing tree-based algorithms

slide-42
SLIDE 42

42

Tree-based algorithms

Structural (as opposed to generative) recursion N-body codes, granular dynamics, SPH,... Distributed tree + recursive traversal procedure

slide-43
SLIDE 43

43

Data decomposition

Spatial entities Compact spatial partitioning of data

  • ver chares
slide-44
SLIDE 44

44

Distributed tree

Global, distributed tree Spatial entities Compact spatial partitions

slide-45
SLIDE 45

45

“Chunked” distribution of data

TreePieces Global tree

slide-46
SLIDE 46

46

Algorithm comprises concurrent traversals on pieces

  • Visitor + Iterator pattern
  • Visitor defines

– node() – localLeaf() – remoteLeaf()

  • Iterate over nodes using traversal

– Order decided by traversal

slide-47
SLIDE 47

47

Traversal with reuse

Local

TreePiece Request remote node

Remote

Software Cache Traversals Send request message to owner TreePiece if data not present locally Respond with subtree Callback (immediate if data present locally)

PE1 PE2

slide-48
SLIDE 48

48

Barnes-Hut control flow

for(int iteration = 0; iteration < parameters.nIterations; iteration++){ // decompose particles onto tree pieces decomposerProxy.decompose(universe, CkCallbackResumeThread()); // build local trees & submit to framework treePieceProxy.build(CkCallbackResumeThread()); // merge trees mdtHandle.syncToMerge(CkCallbackResumeThread()); ... }

slide-49
SLIDE 49

49

Barnes-Hut control flow

for(int iteration = 0; iteration < parameters.nIterations; iteration++){ ... // initialize traversals topdown.synch(mdtHandle, CkCallbackResumeThread()); bottomup.synch(mdtHandle, CkCallbackResumeThread()); // start gravity and SPH computations treePieceProxy.gravity(CkCallback(CkReductionTarget(gravityDone), thisProxy)); treePieceProxy.sph(CkCallback(CkReductionTarget(sphDone), thisProxy)); // done with traversal topdown.done(CkCallbackResumeThread()); bottomup.done(CkCallbackResumeThread());

...

}

slide-50
SLIDE 50

50

Barnes-Hut control flow

for(int iteration = 0; iteration < parameters.nIterations; iteration++){ …

// integrate particle trajectories

treePieceProxy.integrate(CkCallbackResumeThread((void *&)result)); // delete distributed tree mdtHandle.syncToDelete(CkCallbackResumeThread()); }

slide-51
SLIDE 51

51

Visitor code

Class BarnesHutVisitor { bool node(const Node *n){ bool doOpen = open(leaf_, n); if(!doOpen){ gravity(n); return false; } return true; } ... }

slide-52
SLIDE 52

52

Visitor code

Class BarnesHutVisitor { void localLeaf(Key sourceKey, const Particle *sources, int nSources){ gravity(sources, nSources); } void remoteLeaf(Key sourceKey, const RemoteParticle *sources, int nSources){ gravity(sources, nSources); } };

slide-53
SLIDE 53

53

Distributed Shared Array programming with MSA

slide-54
SLIDE 54

54

Multiphase Shared Arrays (MSA)

Disciplined shared address space abstraction Dynamic modes of operation

Read-only Write-exclusive Accumulate

slide-55
SLIDE 55

55

MSA Model

PE 2 PE 3 PE 1 PE 0 Mapping

slide-56
SLIDE 56

56

Parallel histogramming in MSA

  • Two MSAs, A() and Bins()
  • A() in read mode, Bins() in accum mode

MSA1D<double> A; MSA1D<int> Bins; MSA1D::Read rd = A.getInitialWrite(); MSA1D::Accum acc = Bins.getInitialAccum(); For (int x = myStart; x < myStart + myNumElts; x++){ acc(getBin(d.get(x)) += 1; }

slide-57
SLIDE 57

57

Compiler and Runtime optimizations

  • Strip mining (Charj)
  • Bipartite graph-based optimal placement
  • Message combining
slide-58
SLIDE 58

58

Conclusion

  • Ecosystem of specialized languages

– Productivity and performance – Higher-level constructs

  • Common runtime substrate for interoperability

– Completeness of expression