Group Operation Assembly Language - A Flexible Way to Express - PowerPoint PPT Presentation

Group Operation Assembly Language - A Flexible Way to Express Collective Communication - Torsten Hoefler¹, Christian Siebert², Andrew Lumsdaine¹ ²NEC Laboratories Europe ¹Open Systems Lab Sankt Augustin, Germany Indiana University, Bloomington 09/25/09 ICPP 2009 Vienna, Austria Torsten Hoefler ICPP 2009 1 Indiana University Vienna, Austria

Introduction  MPI as de-facto standard in parallel processing  Collective operations are integral part of MPI  Large body of research on advanced algorithms  Multiple implementations in MPI libraries: e.g., MPICH2, MVAPICH, Open MPI   “Group Operations” are also used in other environments (e.g., MRNet, Multicast) 2 Torsten Hoefler, Indiana University ICPP 2009, Vienna Austria

Motivation  Group Operations are a general concept  e.g., used in MPI, UPC, MRNet  Nonblocking Collective operations arrived  NBC will be in MPI 3.0 (or 2.3?)  Most implementations are hard-coded  Control-flow as static branches in source-code  Requires considerable hand-tuning  User-defined (sparse) collective operations (?)  Hardware offload and NBC 3 Torsten Hoefler, Indiana University ICPP 2009, Vienna Austria

Broadcast Tree Examples  Binomial trees used in many small-message collectives (e.g., Bcast, Reduce) 4 Torsten Hoefler, Indiana University ICPP 2009, Vienna Austria

Our Goals  Define a minimal language to express collective communication to enable:  efficient representation for offload  fast and simple execution on slow PEs  good specification of advanced algorithms  execution on resource-constrained environments (NIC)  (automatic) transformational optimizations 5 Torsten Hoefler, Indiana University ICPP 2009, Vienna Austria

Abstracting  What is the minimal set of operations needed to perform any collective algorithm?  Theorem 1 states that send, receive and (local) dependencies are sufficient to model any collective algorithm  allows concise definition!  Theorem 2 states that the order requirement is relative to each single operation  allows optimized/adaptive execution! 6 Torsten Hoefler, Indiana University ICPP 2009, Vienna Austria

Group Operation Assembly Language  Very low-level specification (compilation target) cf. RISC assembler code   Translated into a machine-dependent form cf. RISC bytecode  7 Torsten Hoefler, Indiana University ICPP 2009, Vienna Austria

A Binomial Tree Example 8 Torsten Hoefler, Indiana University ICPP 2009, Vienna Austria

GOAL Language Interface  GOAL Language interface (Bcast example): rank #0 { rank #1 { send <msg>,<len> to 1; r: recv <msg>,<len> from 0; send <msg>,<len> to 2; s1: send <msg>,<len> to 3; send <msg>,<len> to 4; s2: send <msg>,<len> to 5; } requ s1 -> r; requ s2 -> r; rank #5 { } recv <msg>,<len> from 1; … } 9 Torsten Hoefler, Indiana University ICPP 2009, Vienna Austria

Group Operation Assembly Language  Alternative schedule creation at runtime:  Library interface: gop=GOAL_Create()  id=GOAL_Send(sched, buf, size, dest)  id=GOAL_Recv(sched, buf, size, dest)  GOAL_Exec(sched, func, buf, size)  GOAL_Requ(sched, src_id, tgt_id)  sched=GOAL_Compile(gop)   Internal representation reflects a dependency DAG  enables transformational optimizations 10 Torsten Hoefler, Indiana University ICPP 2009, Vienna Austria

Optimization possibilities  Adaptive execution  Possible to consider process arrival pattern  independent ops: sent to ready hosts first 11 Torsten Hoefler, Indiana University ICPP 2009, Vienna Austria

Optimization Possibilities (cont.)  Parallel execution  Schedule (DAG) allows for parallel execution Multiple parallel NICs   Same scheduling issues as for multicore task libraries (TBB, Cilk, OpenMP 3.0)  Static schedule (compiler) optimization  e.g., architecture-dependent pipelining  Scheduler runs in thread or hardware  Offload to spare CPU core  Offload to NIC (same GOAL specification) 12 Torsten Hoefler, Indiana University ICPP 2009, Vienna Austria

Advanced Example - Dissemination 13 Torsten Hoefler, Indiana University ICPP 2009, Vienna Austria

Schedule Details  Result of GOAL assembly  Optimized for each architecture  Should not lose flexibility  Represents dependency/execution graph  Our machine-dependent representation:  We propose binary schedule  Linear memory layout (cache/pre-fetch friendly)  Executor only 98 SLOC C code in LibNBC  Compression possible (not in this work) 14 Torsten Hoefler, Indiana University ICPP 2009, Vienna Austria

Execution Constraints  How much memory do we need to execute a schedule?  We can use a sliding window (hold only parts of the schedule in a scratchpad memory (NIC))  Theorem 3: A schedule of length N can be executed with additional memory using a constant-size window.  it’s actually also see: 15 Torsten Hoefler, Indiana University ICPP 2009, Vienna Austria

Execution Constraints (contd.) memory consumption is infeasible   SRAM on a NIC is expensive!  Solution: introduce additional dependencies  BUT: additional dependencies serialization  Theorem 4: Each schedule can be executed in memory, if dummy actions are added. 16 Torsten Hoefler, Indiana University ICPP 2009, Vienna Austria

Implementation  Ernest Rutherford: “We don’t have the money, so we have to think.”  no easy access to programmable NIC  working with Myricom on Myrinet  Mellanox seems to have a similar interface in it’s next generation API  We offloaded to a spare CPU core  threading model  replacing current implementation in LibNBC  less synchronicity than round-based scheme! 17 Torsten Hoefler, Indiana University ICPP 2009, Vienna Austria

Test System  Odin Cluster at Indiana University  4x InfiniBand SDR  Single 288 port Mellanox switch  128 nodes  4 cores per node -> 512 cores  Open MPI coll component “tuned”  version 1.3  LibNBC 1.0 (with NBCBench 1.0)  OFED-optimized version (uses RDMA-W) 18 Torsten Hoefler, Indiana University ICPP 2009, Vienna Austria

Blocking Collectives No performance penalty! 19 Torsten Hoefler, Indiana University ICPP 2009, Vienna Austria

Nonblocking Collectives Even less overhead! 20 Torsten Hoefler, Indiana University ICPP 2009, Vienna Austria

Conclusions  Abstract definition of group communication  easy definition of (non-)blocking for offload  universal (implements all collectives)  small overhead, maximum asynchrony  Enables compiler-based optimizations and dynamic scheduling  e.g., pipelining, coalescing, memory registration  First step towards high-level communication expression 21 Torsten Hoefler, Indiana University ICPP 2009, Vienna Austria

Future Work  Investigate compiler optimizations  Compress schedules (reduce resource needs)  Implement scheduler on NICs Questions? 22 Torsten Hoefler, Indiana University ICPP 2009, Vienna Austria

Group Operation Assembly Language - A Flexible Way to Express - PowerPoint PPT Presentation

Group Operation Assembly Language - A Flexible Way to Express Collective Communication - Torsten Hoefler, Christian Siebert, Andrew Lumsdaine NEC Laboratories Europe Open Systems Lab Sankt Augustin, Germany Indiana University,

Assembly Language Programming Assembler and assembly language Zbigniew Jurkiewicz, Instytut

Assembly Language Introduction Learning Objectives Explain what assembly language is

FROM SYSTEM F TO TYPED ASSEMBLY LANGUAGE Greg Morrisett, David Walker, Karl Crary & Neal

Overview of Assembly Language Chapter 9 S. Dandamudi Outline Assembly language

Assembly Language CS2253 Owen Kaser, UNBSJ Assembly Language Some insane machine-code

Assembly Language for Intel- -Based Based Assembly Language for Intel th Edition Computers, 4 th

Assembly Language Assembly Language: Human Readable Machine Language Computers like ones and

Assembly Language Assembler translates the assembly language source into binary instructions in

#join Y assembly to Box JellyBox Build: 15_Y-Assembly Join (link directly to the y assembly part

Assembly Language for Intel- -Based Based Assembly Language for Intel th Edition Computers, 4 th

TALx86: A Realistic Typed Assembly Language TALx86: A Realistic Typed Assembly Language Dan

lecture 8 MIPS assembly language 1 - what is an assembly language? - addressing and Memory -

Assembly basics CS 2XA3 Term I, 2020/21 Outline What is Assembly Language ? Assemblers Why

#join X assembly to Box JellyBox Build: 16_X-Assembly Join In this video, we incorporate X

Bioinformatics Seminars Series: Assembly Validation Francesco Vezzi KTH: Royal Institute of

4 - #join Y Assembly to the Box JellyBox Build: 15_Y-Assembly Join (link directly to the y

Improving structure with inheritance 2.0 Main concepts to be covered Inheritance

Real-Time Scheduling Author: Peter van der Stok CTT-DRTS-WS Scheduling dd 5-1-2001 Phi l i ps

WP5 : Auction- -Driven Driven WP5 : Auction Dynamic Spectrum Allocation Dynamic Spectrum

File output Ch 6 Download vs stream Streams A stream is information flow that is

Cobham Recursive Set Functions Moritz M uller Kurt G odel Research Center, Vienna, Austria.

Textual Entailment Alina Petrova EMCL TUD, HLT FBK February 22, 2012 Alina Petrova EMCL TUD,

Modeling Process-Related Duties with Extended UML Activity and Interaction Diagrams Sigrid

Some Applications of Set Theory in Proof Theory Juan P. Aguilera TU Wien The Arctic, January

Group Operation Assembly Language - A Flexible Way to Express - PowerPoint PPT Presentation

Group Operation Assembly Language - A Flexible Way to Express Collective Communication - Torsten Hoefler, Christian Siebert, Andrew Lumsdaine NEC Laboratories Europe Open Systems Lab Sankt Augustin, Germany Indiana University,

Assembly Language Programming Assembler and assembly language Zbigniew Jurkiewicz, Instytut

Assembly Language Introduction Learning Objectives Explain what assembly language is

FROM SYSTEM F TO TYPED ASSEMBLY LANGUAGE Greg Morrisett, David Walker, Karl Crary &amp; Neal

Overview of Assembly Language Chapter 9 S. Dandamudi Outline Assembly language

Assembly Language CS2253 Owen Kaser, UNBSJ Assembly Language Some insane machine-code

Assembly Language for Intel- -Based Based Assembly Language for Intel th Edition Computers, 4 th

Assembly Language Assembly Language: Human Readable Machine Language Computers like ones and

Assembly Language Assembler translates the assembly language source into binary instructions in

#join Y assembly to Box JellyBox Build: 15_Y-Assembly Join (link directly to the y assembly part

Assembly Language for Intel- -Based Based Assembly Language for Intel th Edition Computers, 4 th

TALx86: A Realistic Typed Assembly Language TALx86: A Realistic Typed Assembly Language Dan

lecture 8 MIPS assembly language 1 - what is an assembly language? - addressing and Memory -

Assembly basics CS 2XA3 Term I, 2020/21 Outline What is Assembly Language ? Assemblers Why

#join X assembly to Box JellyBox Build: 16_X-Assembly Join In this video, we incorporate X

Bioinformatics Seminars Series: Assembly Validation Francesco Vezzi KTH: Royal Institute of

4 - #join Y Assembly to the Box JellyBox Build: 15_Y-Assembly Join (link directly to the y

Improving structure with inheritance 2.0 Main concepts to be covered Inheritance

Real-Time Scheduling Author: Peter van der Stok CTT-DRTS-WS Scheduling dd 5-1-2001 Phi l i ps

WP5 : Auction- -Driven Driven WP5 : Auction Dynamic Spectrum Allocation Dynamic Spectrum

File output Ch 6 Download vs stream Streams A stream is information flow that is

Cobham Recursive Set Functions Moritz M uller Kurt G odel Research Center, Vienna, Austria.

Textual Entailment Alina Petrova EMCL TUD, HLT FBK February 22, 2012 Alina Petrova EMCL TUD,

Modeling Process-Related Duties with Extended UML Activity and Interaction Diagrams Sigrid

Some Applications of Set Theory in Proof Theory Juan P. Aguilera TU Wien The Arctic, January

FROM SYSTEM F TO TYPED ASSEMBLY LANGUAGE Greg Morrisett, David Walker, Karl Crary & Neal