The old challenge: How to support users? - - PowerPoint PPT Presentation

the old challenge how to support users
SMART_READER_LITE
LIVE PREVIEW

The old challenge: How to support users? - - PowerPoint PPT Presentation

The old challenge: How to support users? mirko.rahn@itwm.fraunhofer.de Dagstuhl, November 2017 CC-HPC@itwm.fraunhofer.de 1 CC-HPC@itwm.fraunhofer.de What we do: Holistic optimization, dealing with many core (structured) SMP machines (Xeon,


slide-1
SLIDE 1

The old challenge: How to support users?

mirko.rahn@itwm.fraunhofer.de Dagstuhl, November 2017

slide-2
SLIDE 2

CC-HPC@itwm.fraunhofer.de 1

CC-HPC@itwm.fraunhofer.de

  • What we do: Holistic optimization, dealing with many core (structured) SMP machines (Xeon, Phi) and large

(100–500 DSM nodes) and very large machines (1000–5000 DSM nodes).

  • Staff: 1/3 computer scientists, 1/3 mathematicians, 1/3 physicists, 1/3 else
  • Hardware: Many cores (104–106 threads), multiple levels of memory (tape, cold disk, spinning disk, SSD, NVRAM,

DRAM, HBM, Cache level 3, 2, 1, SIMD).

  • Costs: Computation is for free, data transfer dominates energy, latency, throughput.
  • Software high level: Asynchronous communication, task based programming (abstraction, load balancing, time

skewing), framework & DSL.

  • Software mid level: hybrid process/thread, multi level cache blocking, zero indirection memory layouts, zero copy

dependency management.

  • Software low level: SIMD intrinsics, coroutines (not threads), CAS & lock free.

mirko.rahn@itwm.fraunhofer.de Dagstuhl, November 2017

slide-3
SLIDE 3

Parallel programming: Shining reality. 16 nodes → 1584 nodes = ⇒ 1 h → 1 min. 2

Parallel programming: Shining reality. 16 nodes → 1584 nodes = ⇒ 1 h → 1 min.

Problem size 10003 = ⇒ each of the 512 · 3 · 28 = 43008 cores has about 28.53 points. 8th order operator = ⇒ ((28.5 − 16)/28.5)3 ≈ 8.5% inner points. = ⇒ Latency is what matters.

mirko.rahn@itwm.fraunhofer.de Dagstuhl, November 2017

slide-4
SLIDE 4

Parallel programming: Reality. 3

Parallel programming: Reality.

Server: Client: write /global/config

  • n "setup":

broadcast "setup" to all clients read /global/config

Some clients might fail: Distributed file system violates POSIX!?

Server: Client: write /global/config

  • n "setup":

fsync /global broadcast "setup" to all clients read /global/config

Stalls cluster for minutes! Still some clients might fail because their local view on meta data is not updated.

Server: Client: write /global/config

  • n "setup":

broadcast "setup" to all clients while (! exists /global/config ): sleep 1 read /global/config

Typically works, so this is industrial production code.

mirko.rahn@itwm.fraunhofer.de Dagstuhl, November 2017

slide-5
SLIDE 5

Parallel programming: Distribute data. Beginner. 4

Parallel programming: Distribute data. Beginner.

class EqualDistribution { size_t size_per_rank (); }; EqualDistribution distribution (size , nProc );

  • ffset_t

begin = iProc * distribution. size_per_rank (); // type error !? transfer (begin , distribution. size_per_rank ());

BROKEN: 12 elements on 5 ranks = ⇒ 3 elements per rank. ### ### ##? ##? ##? = ⇒ There is no “size per rank”!

class ContiguousDistribution {

  • ffset_t

begin (rank_t ); size_t size (rank_t i) = begin (i+1) - begin (i); // size_t

  • perator - (offset_t , offset_t)

};

Discipline! Programmer’s discipline! Teacher’s discipline!

mirko.rahn@itwm.fraunhofer.de Dagstuhl, November 2017

slide-6
SLIDE 6

Parallel programming: Distributed termination detection. Junior. 5

Parallel programming: Distributed termination detection. Junior.

Each process i ∈ {0, . . . , P − 1}: on init: ti = ri = 0, on send: ti = ti + 1, on recv: ri = ri + 1. Termination detection uses:

bool Comm :: messages_in_flight () { long d = t - r; // note: signed !! long D; MPI_Allreduce (&d, &D, 1, MPI_LONG , MPI_SUM , MPI_COMM_WORLD ); return D != 0; } while (c. messages_in_flight ()) ... // global

  • peration vs. resource

utilization

CORRECT! But does not scale! Also it mixes messages with control messages. (This is the state of the art in 2017.)

mirko.rahn@itwm.fraunhofer.de Dagstuhl, November 2017

slide-7
SLIDE 7

Parallel programming: Distributed termination detection. Professional. 6

Parallel programming: Distributed termination detection. Professional.

Solved, e. g. Friedemann Mattern: Algorithms for distributed termination detection, 1987. ATTENTION: Inconsistent cuts are possible! Termination detection at scale: Asynchronously compute P −1

i=0 ti and P −1 i=0 ri twice. Termina-

tion ⇐ ⇒ all four values are equal. Library? Interface? Transformation? Language construct?

mirko.rahn@itwm.fraunhofer.de Dagstuhl, November 2017

slide-8
SLIDE 8

Parallel programming: GASPI/GPI: Notifications. Professional. 7

Parallel programming: GASPI/GPI: Notifications. Professional.

Fine grained synchronization: Remote notification attached to a single message.

  • write_notify (source, destination, notification)
  • waitsome (set of notifications)

Structured stencil with double buffering:

while (! done) { write_notify_to_all_neighbours (); compute_inner_region (); while (!all neighbour data received) { process (neighbour = wait_some (unprocessed neighbours )); } }

Communication and computation happen at the same time. Requires a lot of programming discipline. Synthesis!

mirko.rahn@itwm.fraunhofer.de Dagstuhl, November 2017

slide-9
SLIDE 9

Parallel programming: GASPI/GPI: Notifications. Senior/Library writer. 8

Parallel programming: GASPI/GPI: Notifications. Senior/Library writer.

Zero copy unstructured nearest neighbor:

while (tile = lock_unlocked_and_ready_tile ()) { process (tile ); publish_progress (tile ); // update ready flags of neighbors unlock (tile ); }

Task based middle ware:

while (! done) { task = get_ready_task (); // blocking , maybe busy process (task ); publish_progress (task ); // might enable

  • ther

tasks }

Dynamic communication patterns. Debugging often a nightmare. Interface design!

mirko.rahn@itwm.fraunhofer.de Dagstuhl, November 2017

slide-10
SLIDE 10

Egor’s tool: Almost ready for the tool chain. 9

Egor’s tool: Almost ready for the tool chain.

Implements GPI on top of threads rather than processes. WARNING: ThreadSanitizer: data race (pid=4141) Read of size 4 at 0x7f42f5ffc024 by thread T4 (mutexes: write M101): #0 dump main.c:9 (exe+0x00000006b9bd) #1 main main.c:41 (exe+0x00000006be3f) #2 operator() /devel/src/gpi/gpi_detail/GlobalState.cpp:50 (exe+0x0000000787e4) #3 execute_native_thread_routine /src/gcc-4.8.1/x86_64-unknown-linux-gnu/libstdc++-v3/src/c++11/../../../.././libstdc++-v3/src/c++11/thread.cc:84 Previous write of size 1 at 0x7f42f5ffc027 by thread T38: #0 memcpy /src/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors.cc:577 (exe+0x000000028090) #1 operator() /devel/src/gpi/gpi_detail/GlobalState.cpp:188 (exe+0x0000000781c2) #2 gpi_detail::Executor::threadMain() /devel/src/gpi/gpi_detail/Executor.cpp:23 (exe+0x00000007f869) #3 execute_native_thread_routine /src/gcc-4.8.1/x86_64-unknown-linux-gnu/libstdc++-v3/src/c++11/../../../.././libstdc++-v3/src/c++11/thread.cc:84

mirko.rahn@itwm.fraunhofer.de Dagstuhl, November 2017

slide-11
SLIDE 11

Parallel programming: Intended data race: Weak minimum. Senior. 10

Parallel programming: Intended data race: Weak minimum. Senior.

Let M(t) = minP −1

i=0 fi(t) where fi is only known to process i and t is a point in time. To compute M(t) a barrier is

  • required. =

⇒ Not possible at scale. Note: The barrier latency is not the problem, the problem is the accumulation of imbalances. Additional knowledge: All fi are strictly increasing. = ⇒ M(t) is strictly increasing. Easier to compute: Strictly increasing eventually consistent weak minimum W(t) ≤ M(t): Publish fi(t) asynchronously. (Publish wave.) Reduce all values upon request. (Reduction wave.) No synchronization between the waves. = ⇒ Data race. Race is okay as long as fi(t) is read/written atomically. Latency stays the same but work can be done asynchronously and therefore the imbalances are smeared out. Detect the race, prove the algorithm is correct with the race!

mirko.rahn@itwm.fraunhofer.de Dagstuhl, November 2017

slide-12
SLIDE 12

Parallel programming: Alltoall. Library writer. 11

Parallel programming: Alltoall. Library writer.

@iProc: forall other ranks: async_write (data[other] -> other); wait_for_local_completion(); barrier(); work (received_data); BROKEN: Local completion plus barrier = ⇒ All data has been sent. Unknown whether or not data has been received! Works fine on Infiniband (non overtaking) but fails on Cray and TCP Ethernet.

mirko.rahn@itwm.fraunhofer.de Dagstuhl, November 2017

slide-13
SLIDE 13

Olaf’s tool: Not ready for the tool chain. 12

Olaf’s tool: Not ready for the tool chain.

mirko.rahn@itwm.fraunhofer.de Dagstuhl, November 2017

slide-14
SLIDE 14

Parallel programming: Alltoall. 13

Parallel programming: Alltoall.

@iProc: forall other ranks: async_write_notify (data[other] -> other, notify: data from iProc);

  • utstanding_messages = nProc;

while (outstanding_messages --> 0) sender = wait_for_notification(); partial_work (received_data[sender]); work (received_data);

  • CORRECT. No wait. No barrier. Better overlap. Partial work possible.
  • nProc many notifications → If memory scaling issue kicks in, then trade with latency.

mirko.rahn@itwm.fraunhofer.de Dagstuhl, November 2017

slide-15
SLIDE 15

Parallel programming: Reality. Undergraduate using tools. 14

Parallel programming: Reality. Undergraduate using tools.

0.75 1 1.25 4x16 8x16 16x16 24x16 32x16 40x16 48x16 parallel efficiency normalized to 4 nodes #nodes x #threads gitfan: parallel efficiency

Legacy symbolic linear algebra now parallel. Tools help!

mirko.rahn@itwm.fraunhofer.de Dagstuhl, November 2017

slide-16
SLIDE 16

Summary 15

Summary

  • Hardware is complex and requires parallel programming.
  • Parallel programming is hard.
  • Programming discipline can (should) be enforced.
  • Annotations can (should) be required.

Create industrial tool chains that help to produce correct parallel software for large machines. Rethink software and how it can tolerate latency in visibility and inconsistency.

mirko.rahn@itwm.fraunhofer.de Dagstuhl, November 2017