A Quest for Unified, Global View Parallel Programming Models for Our - - PowerPoint PPT Presentation

a quest for unified global view parallel programming
SMART_READER_LITE
LIVE PREVIEW

A Quest for Unified, Global View Parallel Programming Models for Our - - PowerPoint PPT Presentation

A Quest for Unified, Global View Parallel Programming Models for Our Future Kenjiro Taura University of Tokyo T0 T1 T161 T2 T40 T162 T184 T3 T31 T41 T77 T163 T172 T185 T187 T4 T29 T32 T38 T42 T66 T78 T102 T164 T166 T173


slide-1
SLIDE 1

A Quest for Unified, Global View Parallel Programming Models for Our Future

Kenjiro Taura

University of Tokyo

T0 T1 T161 T2 T40 T3 T31 T4 T29 T5 T11 T6 T7 T8 T9 T10 T12 T24 T13 T14 T15 T23 T16 T20 T17 T18 T19 T21 T22 T25 T26 T27 T28 T30 T32 T38 T33 T37 T34 T35 T36 T39 T41 T77 T42 T66 T43 T62 T44 T45 T61 T46 T60 T47 T56 T48 T49 T55 T50 T54 T51 T53 T52 T57 T58 T59 T63 T65 T64 T67 T74 T68 T72 T69 T71 T70 T73 T75 T76 T78 T102 T79 T82 T80 T81 T83 T101 T84 T93 T85 T86 T87 T88 T92 T89 T90 T91 T94 T95 T96 T97 T98 T100 T99 T103 T153 T104 T122 T105 T120 T106 T111 T107 T110 T108 T109 T112 T114 T113 T115 T117 T116 T118 T119 T121 T123 T137 T124 T128 T125 T126 T127 T129 T135 T130 T131 T132 T134 T133 T136 T138 T152 T139 T143 T140 T141 T142 T144 T146 T145 T147 T150 T148 T149 T151 T154 T155 T156 T158 T157 T159 T160 T162 T184 T163 T172 T164 T166 T165 T167 T171 T168 T169 T170 T173 T175 T174 T176 T181 T177 T179 T178 T180 T182 T183 T185 T187 T186 T188 T190 T189 T191 T192 T193 T195 T194 T196 T198 T197 T199

1 / 52

slide-2
SLIDE 2

Acknowledgements

▶ Jun Nakashima (MassiveThreads) ▶ Shigeki Akiyama, Wataru Endo (MassiveThreads/DM) ▶ An Huynh (DAGViz) ▶ Shintaro Iwasaki (Vectorization)

2 / 52

slide-3
SLIDE 3

3 / 52

slide-4
SLIDE 4

What is task parallelism?

▶ like most CS terms, the definition is vague ▶ I don’t consider contraposition “data parallelism vs.

task parallelism” useful

▶ imagine lots of tasks each working on a piece of data ▶ is it data parallel or task parallel?

▶ let’s instead ask:

▶ what’s useful from programmer’s view point ▶ what are useful distinctions to make from

implementer’s view point

4 / 52

slide-5
SLIDE 5

What is task parallelism?

A system supports task parallelism when:

  • 1. a logical unit of concurrency (that is,

a task) can be created dynamically, at an arbitrary point of execution,

create task create task

5 / 52

slide-6
SLIDE 6

What is task parallelism?

A system supports task parallelism when:

  • 1. a logical unit of concurrency (that is,

a task) can be created dynamically, at an arbitrary point of execution,

  • 2. and cheaply;

create task create task

5 / 52

slide-7
SLIDE 7

What is task parallelism?

A system supports task parallelism when:

  • 1. a logical unit of concurrency (that is,

a task) can be created dynamically, at an arbitrary point of execution,

  • 2. and cheaply;

create task create task

5 / 52

slide-8
SLIDE 8

What is task parallelism?

A system supports task parallelism when:

  • 1. a logical unit of concurrency (that is,

a task) can be created dynamically, at an arbitrary point of execution,

  • 2. and cheaply;
  • 3. and they are automatically mapped
  • n hardware parallelism (cores,

nodes, . . . )

create task create task

5 / 52

slide-9
SLIDE 9

What is task parallelism?

A system supports task parallelism when:

  • 1. a logical unit of concurrency (that is,

a task) can be created dynamically, at an arbitrary point of execution,

  • 2. and cheaply;
  • 3. and they are automatically mapped
  • n hardware parallelism (cores,

nodes, . . . )

  • 4. and cheaply context-switched

create task create task

5 / 52

slide-10
SLIDE 10

What are they good for?

▶ generality: “creating tasks at arbitrary points” unifies

many superficially different patterns

▶ parallel nested loop, parallel recursions ▶ they trivially compose

▶ programmability: cheap task creation + automatic

load balancing allow straightforward, processor-oblivious decomposition of the work (divide-and-conquer-until-trivial)

▶ performance: dynamic scheduling is a basis for hiding

latencies and tolerating noises

6 / 52

slide-11
SLIDE 11

Our goal

▶ programmers use tasks (+

higher-level syntax on top) as the unified means to express parallelism

▶ the system maps tasks to

hardware parallelism

▶ cores within a node ▶ nodes ▶ SIMD lanes within a core! 7 / 52

slide-12
SLIDE 12

Rest of the talk

Intra-node Task Parallelism Task Parallelism in Distributed Memory Need Good Performance Analysis Tools Compiler Optimizations and Vectorization Concluding Remarks

8 / 52

slide-13
SLIDE 13

9 / 52

slide-14
SLIDE 14

Agenda

Intra-node Task Parallelism Task Parallelism in Distributed Memory Need Good Performance Analysis Tools Compiler Optimizations and Vectorization Concluding Remarks

10 / 52

slide-15
SLIDE 15

Taxonomy

▶ library or frontend: implemented with ordinary C/C++

compilers or does it heavily rely on a tailored frontend?

11 / 52

slide-16
SLIDE 16

Taxonomy

▶ library or frontend: implemented with ordinary C/C++

compilers or does it heavily rely on a tailored frontend?

▶ tasks suspendable or atomic: can tasks suspend/resume

in the middle or do tasks always run to completion?

11 / 52

slide-17
SLIDE 17

Taxonomy

▶ library or frontend: implemented with ordinary C/C++

compilers or does it heavily rely on a tailored frontend?

▶ tasks suspendable or atomic: can tasks suspend/resume

in the middle or do tasks always run to completion?

▶ synchronization patterns arbitrary or pre-defined: can

tasks synchronize in an arbitrary topology or only in pre-defined synchronization patterns (e.g., bag-of-tasks, fork/join)?

11 / 52

slide-18
SLIDE 18

Taxonomy

▶ library or frontend: implemented with ordinary C/C++

compilers or does it heavily rely on a tailored frontend?

▶ tasks suspendable or atomic: can tasks suspend/resume

in the middle or do tasks always run to completion?

▶ synchronization patterns arbitrary or pre-defined: can

tasks synchronize in an arbitrary topology or only in pre-defined synchronization patterns (e.g., bag-of-tasks, fork/join)?

▶ tasks untied or tied: can tasks migrate after they

started?

11 / 52

slide-19
SLIDE 19

Instantiations

library suspendable untied sync /frontend task tasks topology OpenMP tasks frontend yes yes fork/join TBB library yes no fork/join Cilk frontend yes yes fork/join Quark library no no arbitrary Nanos++ library yes yes arbitrary Qthreads library yes yes arbitrary Argobots library yes yes? arbitrary MassiveThreads library yes yes arbitrary

12 / 52

slide-20
SLIDE 20

MassiveThreads

▶ https://github.com/massivethreads/massivethreads ▶ design philosophy: user-level threads (ULT) in an

  • rdinary thread API as you know it

▶ tid = myth create(f, arg) ▶ tid = myth join(arg) ▶ myth yield to switch among threads (useful for

latency hiding)

▶ mutex and condition variables to build arbitrary

synchronization patterns

▶ efficient work stealing scheduler (locally LIFO and

child-first; steal oldest task first)

▶ an (experimental) customizable work stealing

[Nakashima and Taura; ROSS 2013]

13 / 52

slide-21
SLIDE 21

User-facing APIs on MassiveThreads

▶ TBB’s task group and

parallel for (but with untied work stealing scheduler)

▶ Chapel tasks on top of

MassiveThreads (currently broken orz)

▶ SML# (Ueno @ Tohoku

University) ongoing

▶ Tapas (Fukuda @ RIKEN), a

domain specific language for particle simulation

quicksort(a, p, q) { if (q - p < th) { ... } else { mtbb::task group tg; r = partition(a, p, q); tg.run([=]{ quicksort(a, p, r-1); }); quicksort(a, r, q); tg.wait(); } }

TBB interface on MassiveThreads

14 / 52

slide-22
SLIDE 22

Important performance metrics

▶ low local creation/sync overhead ▶ low local context switches ▶ reasonably low load balancing (migration) overhead ▶ somewhat sequential scheduling order

π0 γ π1

1

parent() {

2

π0:

3

spawn { γ: ... };

4

π1:

5

}

  • p

measure what time (cycles) local create π0 → γ ≈ 140 work steal π0 → π1 ≈ 900 context switch myth yield ≈ 80

(Haswell i7-4500U (1.80GHz), GCC 4.9)

15 / 52

slide-23
SLIDE 23

Comparison to other systems

500 1000 1500 2000 2500 3000 Cilk CilkPlus MassiveThreads OpenMP Qthreads TBB ≈ 7000 73 72 138 167 clocks child parent

1

parent() {

2

π0:

3

spawn { γ: ... };

4

π1:

5

}

Summary:

▶ Cilk(Plus), known for its superb local creation

performance, sacrifices work stealing performance

▶ TBB’s local creation overhead is equally good, but it is

“parent-first” and tasks are tied to a worker once started

16 / 52

slide-24
SLIDE 24

Further research agenda (1)

▶ task runtimes for ever larger scale systems is vital

17 / 52

slide-25
SLIDE 25

Further research agenda (1)

▶ task runtimes for ever larger scale systems is vital ▶ ⇒ “locality-/cache-/hierarchy-/topology-/whatever-

aware” schedulers obviously important

17 / 52

slide-26
SLIDE 26

Further research agenda (1)

▶ task runtimes for ever larger scale systems is vital ▶ ⇒ “locality-/cache-/hierarchy-/topology-/whatever-

aware” schedulers obviously important

▶ ⇒ hierarchical/customizable schedulers proposals

17 / 52

slide-27
SLIDE 27

Further research agenda (1)

▶ task runtimes for ever larger scale systems is vital ▶ ⇒ “locality-/cache-/hierarchy-/topology-/whatever-

aware” schedulers obviously important

▶ ⇒ hierarchical/customizable schedulers proposals ▶ ⇒ yet, IMO, there are no clear demonstrations that

clearly outperform simple greedy work stealing over many workloads

17 / 52

slide-28
SLIDE 28

Further research agenda (1)

▶ task runtimes for ever larger scale systems is vital ▶ ⇒ “locality-/cache-/hierarchy-/topology-/whatever-

aware” schedulers obviously important

▶ ⇒ hierarchical/customizable schedulers proposals ▶ ⇒ yet, IMO, there are no clear demonstrations that

clearly outperform simple greedy work stealing over many workloads

▶ the question, it seems, ultimately comes to this:

when no tasks exist near you but some may exist far from you, steal it or not (stay idle)?

17 / 52

slide-29
SLIDE 29

Further research agenda (2)

▶ quantify the gap between hand-optimized

decomposition vs. automatic decomposition (by work stealing); e.g.

▶ Space-filling decomposition vs. work stealing ▶ 2.5D matrix-multiply vs. work stealing

▶ both experimentally and theoretically

18 / 52

slide-30
SLIDE 30

19 / 52

slide-31
SLIDE 31

Agenda

Intra-node Task Parallelism Task Parallelism in Distributed Memory Need Good Performance Analysis Tools Compiler Optimizations and Vectorization Concluding Remarks

20 / 52

slide-32
SLIDE 32

Two facets of task parallelism in distributed memory settings

▶ a means to hide latency, for which we merely need a

local user-level thread library supporting suspend/resume at arbitrary points

▶ a means to globally balance loads, for which we need a

system specifically designed to migrate tasks across address spaces MassiveThreads/DM is a system supporting

▶ distributed load balancing and latency hiding ▶ + global address space supporting migration and

replication

21 / 52

slide-33
SLIDE 33

Tasks to hide latencies

The goal:

▶ individual tasks look like

  • rdinary blocking access

(programmer-friendly)

▶ hide latencies by creating lots

  • f tasks

Ingredients for implementation:

▶ local tasking layer with good

context switch performance

▶ message/RDMA layer with

good multithreaded performance

scan(global_array<T> a) { for (i = 0; i < n; i++) { .. = .. a[i] ..; } }

scan(global_array<T> a) { pfor (i = 0; i < n; i++) { .. = .. a[i] ..; } }

22 / 52

slide-34
SLIDE 34

Preliminary results

▶ context switch: we used MassiveThreads’s myth yield

function to switch context upon blocking

▶ message/RDMA: we rolled our own thread-safe comm

layer (on MPI, on IB verbs, and on Fujitsu Tofu RMA), partly because Fujitsu MPI lacks multithreading support

/* a[i] */ T get(address<T>) { issue non-blocking get(address); while (!the result available) { myth_yield(); } return result; }

200000 400000 600000 800000 1 × 106 1.2 × 106 1.4 × 106 1 2 3 4 5 6 7 8 9 10 gets/node/sec tasks workers=1 workers=2 workers=3 workers=4 workers=5 23 / 52

slide-35
SLIDE 35

Taxonomy

▶ library or frontend ▶ tasks suspendable or atomic ▶ synchronization patterns arbitrary or pre-defined ▶ tasks untied or tied ▶ the main issue:

implementation complexity raises on distributed memory especially for untied tasks

▶ that is, how to move tasks across address spaces?

24 / 52

slide-36
SLIDE 36

Instantiations

library suspendable untied sync scale /frontend task tasks topology Distributed Cilk frontend yes yes fork/join 16

[Blumofe et al. 96]

Satin frontend yes no fork/join 256

[Neuwpoort et al. 01]

Tascell frontend yes yes fork/join 128

[Hiraishi et al. 09]

Scioto library no no BoT 8192

[Dinan et al. 09]

HotSLAW library yes no fork/join 256

[Min et al. 11]

X10/GLB library no no BoT 16384

[Zhang et al. 13]

Grappa library yes no fork/join 4096

[Nelson et al. 15]

MassiveThreads/DM library yes yes fork/join 4096

[Akiyama et al. 15] 25 / 52

slide-37
SLIDE 37

MassiveThreads/DM

▶ global (inter-node) work stealing library

▶ usable with ordinary C/C++ compilers ▶ supports fork-join with untied tasks

▶ ⇒ moves native threads across nodes

26 / 52

slide-38
SLIDE 38

Migrating native threads

▶ problem: the stack of native threads has pointers

pointing to the inside

▶ migrating a thread to an arbitrary address breaks

these pointers

▶ ⇒ upon migration, copy the stack to the same address

(iso-address [Antoniu et al. 1999])

@a

27 / 52

slide-39
SLIDE 39

Migrating native threads

▶ problem: the stack of native threads has pointers

pointing to the inside

▶ migrating a thread to an arbitrary address breaks

these pointers

▶ ⇒ upon migration, copy the stack to the same address

(iso-address [Antoniu et al. 1999])

!

@a @a'

27 / 52

slide-40
SLIDE 40

Migrating native threads

▶ problem: the stack of native threads has pointers

pointing to the inside

▶ migrating a thread to an arbitrary address breaks

these pointers

▶ ⇒ upon migration, copy the stack to the same address

(iso-address [Antoniu et al. 1999])

!

iso address @a @a' @a @a

27 / 52

slide-41
SLIDE 41

Iso-address limits scalability

▶ for each thread, all nodes must reserve its address ▶ ⇒ a huge waste of virtual memory

virtual address space 28 / 52

slide-42
SLIDE 42

Is consuming a huge virtual memory really a problem?

▶ with high concurrency, it may indeed overflow virtual

address space

stack size × tasks depth × cores/node × nodes 214 × 213 × 28 × 213 = 248

▶ more important, the luxury use of virtual memory

prohibits using RDMA for work stealing (as RDMA memory must be pinned)

▶ ⇒ proposed UniAddress scheme [Akiyama et al. 2015]

29 / 52

slide-43
SLIDE 43

Further research agenda

▶ demonstrate global distributed load balancing with

practical workloads with lots of shared data

▶ “locality-/hierarchy-. . . ”awareness are even more

important in this setting

▶ latency-hiding opportunity adds an extra dimension ▶ steal or not, switch or not

30 / 52

slide-44
SLIDE 44

31 / 52

slide-45
SLIDE 45

Agenda

Intra-node Task Parallelism Task Parallelism in Distributed Memory Need Good Performance Analysis Tools Compiler Optimizations and Vectorization Concluding Remarks

32 / 52

slide-46
SLIDE 46

Analyzing task parallel programs

▶ task parallel systems are more

“opaque” from users

▶ task management, load

balancing, scheduling

▶ they show performance differences

and researchers want to precisely understand where they come from

T0 T1 T161 T2 T40 T3 T31 T4 T29 T5 T11 T6 T7 T8 T9 T10 T12 T24 T13 T14 T15 T23 T16 T20 T17 T18 T19 T21 T22 T25 T26 T27 T28 T30 T32 T38 T33 T37 T34 T35 T36 T39 T41 T77 T42 T66 T43 T62 T44 T45 T61 T46 T60 T47 T56 T48 T49 T55 T50 T54 T51 T53 T52 T57 T58 T59 T63 T65 T64 T67 T74 T68 T72 T69 T71 T70 T73 T75 T76 T78 T102 T79 T82 T80 T81 T83 T101 T84 T93 T85 T86 T87 T88 T92 T89 T90 T91 T94 T95 T96 T97 T98 T100 T99 T103 T153 T104 T122 T105 T120 T106 T111 T107 T110 T108 T109 T112 T114 T113 T115 T117 T116 T118 T119 T121 T123 T137 T124 T128 T125 T126 T127 T129 T135 T130 T131 T132 T134 T133 T136 T138 T152 T139 T143 T140 T141 T142 T144 T146 T145 T147 T150 T148 T149 T151 T154 T155 T156 T158 T157 T159 T160 T162 T184 T163 T172 T164 T166 T165 T167 T171 T168 T169 T170 T173 T175 T174 T176 T181 T177 T179 T178 T180 T182 T183 T185 T187 T186 T188 T190 T189 T191 T192 T193 T195 T194 T196 T198 T197 T199

physical resource runtime system create/wait task

33 / 52

slide-47
SLIDE 47

DAG Recorder and DAGViz

▶ DAG Recorder runs a task

parallel program and extracts its DAG, augmented with timestamps, CPUs, etc.

▶ DAGViz is its visualizer

A() { for(i=0;i<2;i++) { mk_task_group; create_task(B()); create_task(C()); D(); wait_tasks(); } } D() { mk_task_group; create_task(E()); F(); wait_tasks(); }

E B C E B C

create_task wait_tasks begin_section endtask

34 / 52

slide-48
SLIDE 48

Why record the DAG?

▶ DAG is a logical representation of the program

execution independent from the runtime system

▶ you can compare DAGs by two systems side by side

▶ DAG contains sufficient information to reconstruct

many details

▶ work and critical path (excluding overhead) ▶ actual parallelism (running cores) along time ▶ available parallelism (ready tasks) along time ▶ how long each task was delayed by the scheduler 35 / 52

slide-49
SLIDE 49

DAGViz Demo

Seeing is believing.

36 / 52

slide-50
SLIDE 50

Challenge : reducing space requirement

▶ literally recording all subgraphs is

prohibitive

▶ collapse “uninteresting” subgraphs

into single nodes

▶ current criteria: we collapse a

subgraph ⇐ ⇒

  • 1. its nodes are executed by a single

worker,

  • 2. its span is smaller than a

(configurable) threshold

E B C E B C

create_task wait_tasks begin_section endtask

37 / 52

slide-51
SLIDE 51

Challenge : reducing space requirement

▶ literally recording all subgraphs is

prohibitive

▶ collapse “uninteresting” subgraphs

into single nodes

▶ current criteria: we collapse a

subgraph ⇐ ⇒

  • 1. its nodes are executed by a single

worker,

  • 2. its span is smaller than a

(configurable) threshold

E B C E B C

create_task wait_tasks begin_section endtask

37 / 52

slide-52
SLIDE 52

Challenge : reducing space requirement

▶ literally recording all subgraphs is

prohibitive

▶ collapse “uninteresting” subgraphs

into single nodes

▶ current criteria: we collapse a

subgraph ⇐ ⇒

  • 1. its nodes are executed by a single

worker,

  • 2. its span is smaller than a

(configurable) threshold

E B C B C

create_task wait_tasks begin_section endtask

37 / 52

slide-53
SLIDE 53

Challenge : reducing space requirement

▶ literally recording all subgraphs is

prohibitive

▶ collapse “uninteresting” subgraphs

into single nodes

▶ current criteria: we collapse a

subgraph ⇐ ⇒

  • 1. its nodes are executed by a single

worker,

  • 2. its span is smaller than a

(configurable) threshold

E B C B C

create_task wait_tasks begin_section endtask

37 / 52

slide-54
SLIDE 54

Challenge : reducing space requirement

▶ literally recording all subgraphs is

prohibitive

▶ collapse “uninteresting” subgraphs

into single nodes

▶ current criteria: we collapse a

subgraph ⇐ ⇒

  • 1. its nodes are executed by a single

worker,

  • 2. its span is smaller than a

(configurable) threshold

E B C

create_task wait_tasks begin_section endtask

37 / 52

slide-55
SLIDE 55

Challenge : reducing space requirement

▶ literally recording all subgraphs is

prohibitive

▶ collapse “uninteresting” subgraphs

into single nodes

▶ current criteria: we collapse a

subgraph ⇐ ⇒

  • 1. its nodes are executed by a single

worker,

  • 2. its span is smaller than a

(configurable) threshold

E B C

create_task wait_tasks begin_section endtask

37 / 52

slide-56
SLIDE 56

Challenge : reducing space requirement

▶ literally recording all subgraphs is

prohibitive

▶ collapse “uninteresting” subgraphs

into single nodes

▶ current criteria: we collapse a

subgraph ⇐ ⇒

  • 1. its nodes are executed by a single

worker,

  • 2. its span is smaller than a

(configurable) threshold

create_task wait_tasks begin_section endtask

37 / 52

slide-57
SLIDE 57

Challenge : reducing space requirement

▶ literally recording all subgraphs is

prohibitive

▶ collapse “uninteresting” subgraphs

into single nodes

▶ current criteria: we collapse a

subgraph ⇐ ⇒

  • 1. its nodes are executed by a single

worker,

  • 2. its span is smaller than a

(configurable) threshold

create_task wait_tasks begin_section endtask

37 / 52

slide-58
SLIDE 58

Challenge : reducing space requirement

▶ literally recording all subgraphs is

prohibitive

▶ collapse “uninteresting” subgraphs

into single nodes

▶ current criteria: we collapse a

subgraph ⇐ ⇒

  • 1. its nodes are executed by a single

worker,

  • 2. its span is smaller than a

(configurable) threshold

create_task wait_tasks begin_section endtask

37 / 52

slide-59
SLIDE 59

Ongoing work

▶ hoping to use this tool to automate discovery of issues

in runtime systems

▶ scheduler delays along a critical path ▶ work time inflation

▶ shed light on “steal or not” trade-offs

38 / 52

slide-60
SLIDE 60

39 / 52

slide-61
SLIDE 61

Agenda

Intra-node Task Parallelism Task Parallelism in Distributed Memory Need Good Performance Analysis Tools Compiler Optimizations and Vectorization Concluding Remarks

40 / 52

slide-62
SLIDE 62

Motivation

▶ task parallelism is a friend of divide-and-conquer

algorithms

▶ divide-and-conquer makes coding “trivial,” by dividing

until the problem becomes trivial

▶ matrix multiply, matrix factorization, triangular solve,

FFT, sorting, . . .

▶ in reality, the programmer has to optimize leaves

manually

▶ why? because we lack good compilers

41 / 52

slide-63
SLIDE 63

The power of divide-and-conquer

/∗ quick sort ∗/ quicksort(a, p, q) { if (q − p < 2) { return; } else { ... } }

/∗ FFT ∗/ fft (n, x) { if (n = 1) { return x0 ; } else { ... } }

/∗ C += AB ∗/ mm(A, B, C) { if (|A| = 1 && |B| = 1 && |C| = 1) { C00 += A00 · B00 ; } else { ... } }

/∗ triangular solve LX = B. ∗/ trsm(L, B) { if (M = 1) { B /= l11 ; } else { ... } }

/∗ Cholesky factorization ∗/ chol(A) { if (n = 1) { return (√a11); } else { ... } }

They all admit “trivial” base case,

  • nly if performance is

acceptable . . .

42 / 52

slide-64
SLIDE 64

Static optimizations and vectorization of tasks

▶ goal: run straightforward task-based programs as fast

as manually optimized programs

▶ write once, parallelize everywhere (nodes, cores, and

vectors)

43 / 52

slide-65
SLIDE 65

Static optimizations and vectorization of tasks

▶ goal: run straightforward task-based programs as fast

as manually optimized programs

▶ write once, parallelize everywhere (nodes, cores, and

vectors)

serialized and vectorized

43 / 52

slide-66
SLIDE 66

What does our compiler do?

  • 1. static cut-off statically eliminates task creations
  • 2. code-bloat-free inlining inline-expands recursions
  • 3. loopification transforms recursions into flat loops (and

then vectorizes it if possible)

44 / 52

slide-67
SLIDE 67

Static cut-off

1

f(a, b, · · · ) {

2

if (E) {

3

L(a, b, · · · )

4

} else {

5

· · ·

6

spawn f(a1, b1, · · · );

7

· · ·

8

spawn f(a2, b2, · · · );

9

· · ·

10

}

11

}

1

fseq(a, b, · · · ) {

2

if (E) {

3

L(a, b, · · · )

4

} else {

5

· · ·

6

fseq(a1, b1, · · · );

7

· · ·

8

fseq(a2, b2, · · · );

9

· · ·

10

}

11

}

key: determine a condition Hk, in which the height of recursion from leaves ≤ k

▶ H0 = E ▶ Hk+1 = E or ∀i(ai, bi, · · · ) satisfy Hk

when succeeded, generate code that statically eliminate all task creations

45 / 52

slide-68
SLIDE 68

Code-bloat-free inlining

▶ under condition Hk, inline-expanding all recursions k

times would eliminate all function calls

▶ but this would result in an exponential code bloat

when the function has multiple recursive calls

▶ code-bloat-free inlining fuses multiple recursive calls

into a single call site

1

· · ·

2

f(a1, b1, · · · );

3

· · ·

4

f(a2, b2, · · · );

5

· · ·

1

for (i = 0; i < 2; i++) {

2

switch (i) {

3

case 0: · · ·

4

case 1: · · ·

5

}

6

f(ai, bi, · · · );

7

}

46 / 52

slide-69
SLIDE 69

Loopification

1

fseq(a, b, · · · ) {

2

if (E) {

3

L(a, b, · · · )

4

} else {

5

· · ·

6

fseq(a1, b1, · · · );

7

· · ·

8

fseq(a2, b2, · · · );

9

· · ·

10

}

11

}

1

for i ∈ P {

2

L(xi, yi, · · · )

3

} ▶ instead of code-bloat-free inlining, loopification

attempts to generate a flat (or shallow) loop directly from recursive code

▶ it tries to synthesize hypotheses that the original code

is an affine loop of leaf blocks

▶ the loopified code may then be vectorized

47 / 52

slide-70
SLIDE 70

Results: effect of optimizations

2 4 6 8 10 12 14 16 fi b n q u e e n s ff t s

  • r

t n b

  • d

y s t r a s s e n v e c a d d h e a t 2 d h e a t 3 d g a u s s i a n m a t m u l t r i m u l t r e e a d d t r e e s u m u t s 27.12 17.56 17.65 220.14 109.72 relative performance base dynamic static cef loop proposed

48 / 52

slide-71
SLIDE 71

Results: remaining gap to hand-optimized code

0.5 1 1.5 2 2.5 3 n b

  • d

y v e c a d d h e a t 2 d h e a t 3 d g a u s s i a n m a t m u l t r i m u l a v e r a g e g e

  • m

e a n relative performance (task=1) task

  • mp
  • mp optimized

polly

49 / 52

slide-72
SLIDE 72

50 / 52

slide-73
SLIDE 73

Agenda

Intra-node Task Parallelism Task Parallelism in Distributed Memory Need Good Performance Analysis Tools Compiler Optimizations and Vectorization Concluding Remarks

51 / 52

slide-74
SLIDE 74

Future outlook of task parallelism

▶ the goal: offer both programmability and performance ▶ long way toward achieving acceptable performance on

distributed memory machines. why?

▶ dynamic load balancing → random traffic ▶ global address space → fine-grain communication

▶ OK in shared memory today. why not on distributed

memory (at least for now)?

▶ checking errors and completion everywhere ▶ doing mutual exclusion everywhere ▶ no hardware-prefetching analog ▶ or lack of bandwidth to tolerate random traffic and

aggressive prefetching

Thank you for listening

52 / 52