Overlapping Communication and Computation with High Level - PowerPoint PPT Presentation

Overlapping Communication and Computation with High Level Communication Routines - On Optimizing Parallel Applications - Torsten Hoefler and Andrew Lumsdaine Open Systems Lab Indiana University Bloomington, IN 47405, USA Conference on Cluster Computing and the Grid (CCGrid’08) Lyon, France 21th May 2008 Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

Introduction Solving Grand Challenge Problems not a Grid talk HPC-centric view highly-scalable tightly coupled machines Thanks for the Introduction Manish! All processors will be multi-core All computers will be massively parallel All programmers will be parallel programmers All programs will be parallel programs ⇒ All (massively) parallel programs need optimized communication (patterns) Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

Fundamental Assumptions (I) We need more powerful machines! Solutions for real-world scientific problems need huge processing power (Grand Challenges) Capabilities of single PEs have fundamental limits The scaling/frequency race is currently stagnating Moore’s law is still valid (number of transistors/chip) Instruction level parallelism is limited (pipelining, VLIW, multi-scalar) Explicit parallelism seems to be the only solution Single chips and transistors get cheaper Implicit transistor use (ILP , branch prediction) have their limits Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

Fundamental Assumptions (II) Parallelism requires communication Local or even global data-dependencies exist Off-chip communication becomes necessary Bridges a physical distance (many PEs) Communication latency is limited It’s widely accepted that the speed of light limits data-transmission Example: minimal 0-byte latency for 1 m ≈ 3 . 3 ns ≈ 13 cycles on a 4 GHz PE Bandwidth can hide latency only partially Bandwidth is limited (physical constraints) The problem of “scaling out” (especially iterative solvers) Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

Assumptions about Parallel Program Optimization Collective Operations Collective Operations (COs) are an optimization tool CO performance influences application performance optimized implementation and analysis of CO is non-trivial Hardware Parallelism More PEs handle more tasks in parallel Transistors/PEs take over communication processing Communication and computation could run simultaneously Overlap of Communication and Computation Overlap can hide latency Improves application performance Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

Overview (I) Theoretical Considerations a model for parallel architectures parametrize model derive model for BC and NBC prove optimality of collops in the model (?) show processor idle time during BC show limits of the model (IB,BG/L) Implementation of NBC how to assess performance? highly portable low-performance IB optimized, high performance, threaded Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

Overview (II) Application Kernels FFT (strong data dependency) compression (parallel data analysis) poisson solver (2d-decomposition) Applications show how performance benefits for microbenchmarks can benefit real-world applications ABINIT Octopus OSEM medical image reconstruction Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

The LogGP model Modelling Network Communication LogP model family has best tradeoff between ease of use and accuracy LogGP is most accurate for different message sizes Methodology assess LogGP parameters for modern interconnects model collective communication level Sender Receiver CPU Network or o s L g, G g, G time Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

TCP/IP - GigE/SMP MPICH2 - G*s+g 600 MPICH2 - o TCP - G*s+g TCP o 500 Time in microseconds 400 300 200 100 0 0 10000 20000 30000 40000 50000 60000 Datasize in bytes (s) Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

Myrinet/GM (preregistered/cached) 350 Open MPI - G*s+g Open MPI - o 300 Myrinet/GM - G*s+g Myrinet/GM - o Time in microseconds 250 200 150 100 50 0 0 10000 20000 30000 40000 50000 60000 Datasize in bytes (s) Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

InfiniBand (preregistered/cached) 90 Open MPI - G*s+g Open MPI - o 80 OpenIB - G*s+g 70 OpenIB - o Time in microseconds 60 50 40 30 20 10 0 0 10000 20000 30000 40000 50000 60000 Datasize in bytes (s) Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

Modelling Collectives LogGP Models - general t barr = ( 2 o + L ) · ⌈ log 2 P ⌉ t allred = 2 · ( 2 o + L + m · G ) · ⌈ log 2 P ⌉ + m · γ · ⌈ log 2 P ⌉ t bcast = ( 2 o + L + m · G ) · ⌈ log 2 P ⌉ CPU and Network LogGP parts t CPU t NET barr = 2 o · ⌈ log 2 P ⌉ barr = L · ⌈ log 2 P ⌉ t CPU t NET allred = ( 4 o + m · γ ) · ⌈ log 2 P ⌉ allred = 2 · ( L + m · G ) · ⌈ log 2 P ⌉ t CPU t NET bcast = 2 o · ⌈ log 2 P ⌉ bcast = ( L + m · G ) · ⌈ log 2 P ⌉ Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

CPU Overhead - MPI_Allreduce LAM/MPI 7.1.2 CPU Usage (share) CPU Usage (share) 0.03 0.025 0.02 0.015 0.01 0.005 0 100000 10000 1000 10 20 100 Data Size 30 40 10 50 Communicator Size 60 1 Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

CPU Overhead - MPI_Allreduce MPICH2 1.0.3 CPU Usage (share) CPU Usage (share) 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 100000 10000 1000 10 20 100 Data Size 30 40 10 50 Communicator Size 60 1 Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

Implementation of Non-blocking Collectives LibNBC for MPI single-threaded highly portable schedule-based design LibNBC for InfiniBand single-threaded (first version) receiver-driven message passing very low overhead Threaded LibNBC thread support requires MPI_THREAD_MULTIPLE completely asynchronous progress complicated due to scheduling issues Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

LibNBC - Alltoall overhead, 64 nodes 60000 Open MPI/blocking LibNBC/Open MPI, 1024 50000 LibNBC/OF, waitonsend Overhead (usec) 40000 30000 20000 10000 0 0 50 100 150 200 250 300 Message Size (kilobytes) Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

First Example Derivation from “normal” implementation distribution identical to “normal” 3D-FFT first FFT in z direction and index-swap identical Design Goals to Minimize Communication Overhead start communication as early as possible achieve maximum overlap time Solution start MPI_Ialltoall as soon as first xz-plane is ready calculate next xz-plane start next communication accordingly ... collect multiple xz-planes (tile factor) Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

Transformation in z Direction Data already transformed in y direction z x y 1 block = 1 double value (3x3x3 grid) Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

Transformation in z Direction Transform first xz plane in z direction �� z � � � � �� x y pattern means that data was transformed in y and z direction Torsten Hoefler and Andrew Lumsdaine Communication/Computation Overlap

Overlapping Communication and Computation with High Level - PowerPoint PPT Presentation

Overlapping Communication and Computation with High Level Communication Routines - On Optimizing Parallel Applications - Torsten Hoefler and Andrew Lumsdaine Open Systems Lab Indiana University Bloomington, IN 47405, USA Conference on Cluster

http://cs224w.stanford.edu Non overlapping vs overlapping communities Non overlapping

Variational methods for overlapping and non-overlapping stochastic block models Pierre Latouche

Ego-Splitting Framework: from Non-Overlapping to Overlapping Clusters. Alessandro Epasto

UN High UN High UN High UN High- - - -Level Meeting on TB Level Meeting on TB Level Meeting

Minimizing Completion Time for Loop Tiling with Computation and Communication Overlapping

SK Telecom 1 U U U U U U U- U - - communication - - - - - communication

Formal Definition of Computation Formal Definition of Computation p.1/28 Computation

PowerWizard Level 1.0 & Level 2.0 Control Systems Training Systems Comparison Level 2

Meet the Overlapping Coefficient: A Measure for Elevator Speeches Brent Larson

Overlapping Debt Q & A Amy R. Laskey, Managing Director Virginia Government Finance

The Combinatorics of Overlapping Squares Bill Smyth Algorithms Research Group, Department of

Chair of Computer Science 5 RWTH Aachen University Learning Layers Analysis of Overlapping

Portraiture Overlapping Value Who is the artist that created their self portrait?

Overlapping Patches for Dynamic Surface Problems C. Carlo Fazioli Drexel University 11 Jan 2014

Algebraic Tools for the Product of Overlapping Tiles E. Dubourg joint work with D. Janin LaBRI

Session 12 Assessing and Developing Communication SECTION 4: 1 Communication Communication

Disclosures Obstructive Sleep Apnea in the Underserved I have nothing to disclose February 23,

Sub-topics Chemical characterization pH, TDS, EC, BOD, COD Sulphite and Chloride contents

in an active dumbbell system Giuseppe Gonnella e Alessandro Mossa, Universit di Bari Antonio

A Polymer in a Multi-Interface Medium Francesco Caravenna Universit` a degli Studi di

RTP Payload Format for Uncompressed Video draft-ietf-avt-uncomp-video-02 Ladan

Week 9 -Wednesday What did we talk about last time? Textures Volume textures Cube

VIDEO SIGNALS VIDEO SIGNALS Colorimetry Colorimetry WHAT IS COLOR? Electromagnetic Wave

EXPLORATION OF DEEP CONVOLUTIONAL AND DOMAIN ADVERSARIAL NEURAL NETWORKS IN MINERVA. 1

Overlapping Communication and Computation with High Level - PowerPoint PPT Presentation

Overlapping Communication and Computation with High Level Communication Routines - On Optimizing Parallel Applications - Torsten Hoefler and Andrew Lumsdaine Open Systems Lab Indiana University Bloomington, IN 47405, USA Conference on Cluster

http://cs224w.stanford.edu Non overlapping vs overlapping communities Non overlapping

Variational methods for overlapping and non-overlapping stochastic block models Pierre Latouche

Ego-Splitting Framework: from Non-Overlapping to Overlapping Clusters. Alessandro Epasto

UN High UN High UN High UN High- - - -Level Meeting on TB Level Meeting on TB Level Meeting

Minimizing Completion Time for Loop Tiling with Computation and Communication Overlapping

SK Telecom 1 U U U U U U U- U - - communication - - - - - communication

Formal Definition of Computation Formal Definition of Computation p.1/28 Computation

PowerWizard Level 1.0 &amp; Level 2.0 Control Systems Training Systems Comparison Level 2

Meet the Overlapping Coefficient: A Measure for Elevator Speeches Brent Larson

Overlapping Debt Q &amp; A Amy R. Laskey, Managing Director Virginia Government Finance

The Combinatorics of Overlapping Squares Bill Smyth Algorithms Research Group, Department of

Chair of Computer Science 5 RWTH Aachen University Learning Layers Analysis of Overlapping

Portraiture Overlapping Value Who is the artist that created their self portrait?

Overlapping Patches for Dynamic Surface Problems C. Carlo Fazioli Drexel University 11 Jan 2014

Algebraic Tools for the Product of Overlapping Tiles E. Dubourg joint work with D. Janin LaBRI

Session 12 Assessing and Developing Communication SECTION 4: 1 Communication Communication

Disclosures Obstructive Sleep Apnea in the Underserved I have nothing to disclose February 23,

Sub-topics Chemical characterization pH, TDS, EC, BOD, COD Sulphite and Chloride contents

in an active dumbbell system Giuseppe Gonnella e Alessandro Mossa, Universit di Bari Antonio

A Polymer in a Multi-Interface Medium Francesco Caravenna Universit` a degli Studi di

RTP Payload Format for Uncompressed Video draft-ietf-avt-uncomp-video-02 Ladan

Week 9 -Wednesday What did we talk about last time? Textures Volume textures Cube

VIDEO SIGNALS VIDEO SIGNALS Colorimetry Colorimetry WHAT IS COLOR? Electromagnetic Wave

EXPLORATION OF DEEP CONVOLUTIONAL AND DOMAIN ADVERSARIAL NEURAL NETWORKS IN MINERVA. 1

PowerWizard Level 1.0 & Level 2.0 Control Systems Training Systems Comparison Level 2

Overlapping Debt Q & A Amy R. Laskey, Managing Director Virginia Government Finance