Vers des mcanismes gnriques de communication et une meilleure - PowerPoint PPT Presentation

Vers des mécanismes génériques de communication et une meilleure maîtrise des affinités dans les grappes de calculateurs hiérarchiques Brice Goglin 15 avril 2014

Towards generic Communication Mechanisms and better Affinity Management in Clusters of Hierarchical Nodes Brice Goglin April 15th, 2014

Scientific simulation is everywhere ● Used by many industries – Faster than real experiments – Cheaper – More flexible ● Today's society cannot live without it ● Used by many non computer scientists 2014/04/15 HDR Brice Goglin 3/58

Growing computing needs ● Growing platform performance – Multiprocessors – Clusters of nodes – Higher frequency – Multicore processors ● High Performance Computing combines all of them – Only computer scientists can understand the details – But everybody must parallelize his codes 2014/04/15 HDR Brice Goglin 4/58

Hierarchy of computing resources 2014/04/15 HDR Brice Goglin 5/58

Increasing hardware complexity ● Vendors cannot keep the hardware simple – Multicore instead of higher frequencies ● You have to learn parallelism – Hierarchical memory organization ● Non uniform memory access (NUMA) and multiple caches – Your performance may vary – Complex network interconnection ● Hierarchical ● Very different hardware features 2014/04/15 HDR Brice Goglin 6/58

Background ● 2002-2005: PhD – Interaction between HPC networks and storage ● Towards a generic networking API ● Still no portable API? ● 2005-2006: Post-doc – On the influence of vendors on HPC ecosystems ● Benchmarks, hidden features, etc. – Multicore and NUMA spreading ● Clusters and large SMP worlds merging 2014/04/15 HDR Brice Goglin 7/58

Since 2006 ● Joined Inria Bordeaux and LaBRI in 2006 ● Optimizing low-level HPC layers – Interaction with OS and drivers 2014/04/15 HDR Brice Goglin 8/58

HPC stack HPC Applications Numerical Compilers Libraries Run-time support PhD + Postdoc Operating system & Drivers HPC Standard NUMA Multicore Accelerators Networks Networks 2014/04/15 HDR Brice Goglin 9/58

A) Bringing HPC network innovations to the masses HPC Applications Performance, portability and features without Numerical Compilers specialized hardware Libraries Run-time support PhD + MPI MPI over Postdoc Intra-node Ethernet Operating system & Drivers HPC Standard NUMA Multicore Accelerators Networks Networks 2014/04/15 HDR Brice Goglin 10/58

B) Better management of hierarchical cluster nodes HPC Applications Numerical Compilers Libraries Platform model Run-time support Memory & PhD + MPI MPI over I/O affinity Postdoc Intra-node Ethernet Operating system & Drivers Understanding &mastering HPC Standard NUMA Multicore Accelerators platforms and affinities Networks Networks 2014/04/15 HDR Brice Goglin 11/58

A.1) Bringing HPC network innovations to the masses: High performance MPI over Ethernet HPC Applications Numerical Compilers Libraries Run-time support PhD + MPI over Postdoc Ethernet Operating system & Drivers HPC Standard NUMA Multicore Accelerators Networks Networks

MPI is everywhere ● De facto standard for communicating between nodes – And even often inside nodes ● 20-year-old standard – Nothing ready to replace it – Real codes will not leave the MPI world unless a stable and proven standard emerges ● MPI is not perfect – API needs enhancements – Implementations need a lot of optimizations 2014/04/15 HDR Brice Goglin 13/58

Two worlds for networking in HPC Specialized Standard Technology (InfiniBand, MX) (TCP/IP, Ethernet) Expensive, Hardware Any specialized Low latency, Performance High latency? high throughput Designed for RDMA, messages Flows Data transfer Zero-copy Additional copies Write in user-space, Notification Interrupt in the kernel or interrupt 2014/04/15 HDR Brice Goglin 14/58

Existing alternatives ● Gamma, Multiedge, EMP, etc. – Deployment issues ● Require modified drivers and/or NIC firmwares – Only compatible with a few platforms ● Break IP stack – No more administration network? – Use custom MPI implementations ● Less stable, not feature complete, etc. 2014/04/15 HDR Brice Goglin 15/58

High Performance MPI over Ethernet, really? ● Take the best of both worlds – Better Ethernet performance by avoiding TCP/IP – Easy to deploy and easy to use ● Open-MX software – Portable implementation of Myricom's specialized networking stack (MX) ● Joint work with N. Furmento, L. Stordeur, R. Perier, 2014/04/15 HDR Brice Goglin 16/58

MPI over Ethernet Issue #1: Memory Copies Application DMA IB HCA Incoming network packet 2014/04/15 HDR Brice Goglin 17/58

MPI over Ethernet Issue #1: Memory Copies ● Copy is expensive – Lower throughput Application ● Virtual remapping? Copy? – [Passas, 2009] – Remapping isn't cheap Kernel – Alignment constraints DMA ➔ I/O AT Copy Offload Ethernet – on Intel since 2006 NIC Incoming network packet 2014/04/15 HDR Brice Goglin 18/58

MPI over Ethernet Issue #1: IMB Pingpong +30% on average for other IMB tests [Cluster 2008] 2014/04/15 HDR Brice Goglin 19/58

MPI over Ethernet Issue #2: Interrupt Latency Kernel Interrupts Standard NIC Incoming network packets ● Tradeoff between reactivity and CPU usage 2014/04/15 HDR Brice Goglin 20/58

MPI over Ethernet Issue #2: Interrupt Latency ● Adapt interrupts to the message structure – Small messages ● Immediate interrupt ➔ Reactivity – Large messages ● Coalescing ➔ Small CPU usage [Cluster 2009] 2014/04/15 HDR Brice Goglin 21/58

MPI over Ethernet, summary ● TCP/IP Ethernet features adapted to MPI – Interrupt coalescing (and multiqueue filtering) ● Success thanks to widespread API – Open-MX works with all MPI implementations [ParCo 2011] ● But MX is going away – Still waiting for a generic HPC network API? 2014/04/15 HDR Brice Goglin 22/58

A.2) Bringing HPC network innovations to the masses: Intra-node MPI communication HPC Applications Numerical Compilers Libraries Run-time support PhD + MPI MPI over Postdoc Intra-node Ethernet Operating system & Drivers HPC Standard NUMA Multicore Accelerators Networks Networks

MPI inside nodes, really? ● MPI codes work unmodified on multicores – No need to add OpenMP, etc. ● Long history of intra-node communication optimization in the Runtime team ➔ Focus on large messages ● KNEM software – Joint work with S. Moreaud (PhD), G. Mercier, R. Namyst, 2014/04/15 HDR Brice Goglin 24/58

MPI inside nodes, how? or how HPC vendors abuse drivers Inter-node Local Double Copy across Library shared buffer Direct Copy Driver between processes NIC Shared Hardware Software Memory Loopback Loopback 2014/04/15 HDR Brice Goglin 25/58

Portability issues Solution Shared-memory Direct-copy Latency OK High Throughput Depends OK Send-receive OK Features Collectives OK Send-receive only RMA needs work Network-specific or Portability OK Platform-specific Security OK None 2014/04/15 HDR Brice Goglin 26/58

KNEM ( Kernel Nemesis ) design ● RMA-like API – Out-of-bound synchronization is easy ● Fixes existing direct-copy issues – Designed for send-recv, collectives and RMA – Does not require specific network/platform driver – Built-in security model [ICPP 2009] 2014/04/15 HDR Brice Goglin 27/58

Applying KNEM to collectives ● OpenMPI collectives directly on top on KNEM – No serialization in the root process anymore – Much better overlap between collective steps – e.g. MPI_Bcast 48% faster on 48-core AMD server [ICPP 2011, JPDC 2013] 2014/04/15 HDR Brice Goglin 28/58

MPI intra-node, summary ● Pushed kernel-assistance to the masses – Available in all MPI implementations, for all platforms – For different kinds on communication, vectorial buffer support, and overlapped copy offload ● Basic support included in Linux (CMA) – Thanks to IBM ● When do we enable which strategy? – High impact of process locality 2014/04/15 HDR Brice Goglin 29/58

B.1) Better managing hierarchical cluster nodes : Modeling modern platforms HPC Applications Numerical Compilers Libraries Platform model Run-time support PhD + MPI MPI over Postdoc Intra-node Ethernet Operating system & Drivers HPC Standard NUMA Multicore Accelerators Networks Networks

View of server topology

Servers' topology is actually getting (too) complex

Using locality for binding: Binding related tasks Shared cache

Using locality for binding: Binding near involved resources Application buffer GPU

Using locality AFTER binding: Adapting hierarchical barriers

Modeling platforms ● Static model (hwloc software) + memory model ● Joint work with J. Clet-Ortega (PhD), B. Putigny (PhD), A. Rougier, B. Ruelle, S. Thibault, and many other academics and vendors contributing to hwloc. 2014/04/15 HDR Brice Goglin 36/58

Vers des mcanismes gnriques de communication et une meilleure - PowerPoint PPT Presentation

Vers des mcanismes gnriques de communication et une meilleure matrise des affinits dans les grappes de calculateurs hirarchiques Brice Goglin 15 avril 2014 Towards generic Communication Mechanisms and better Affinity Management

Vers une Analyse Conceptuelle des Rseaux Sociaux Erick Stattner Martine Collard Laboratory of

Block Ciphers and DES S-DES DES Details DES Design Other Ciphers CSS441: Security and

Data Encryption Standard Simplified-DES Details of DES DES in OpenSSL Cryptography DES in

Data Encryption Standard Simplified-DES Details of DES DES in OpenSSL Cryptography DES in

Vers une vision intgre de la protection de lair et du climat en Wallonie : Introduction

Vers une nouvelle dfinition de lischmie myocardique Genevive Derumeaux Hpital Louis

Groupe ABC arbitrage Prsentation SFAF - Comptes annuels 2017 Retour vers le futur des marchs

Ph Philip ip Br Brown wn Cardif rdiff f Unive vers rsity ty SRHE Conference-7 December

Adaptive and Localized Basis Functions for O ( N ) vers. Linear Scaling, Large Systems and

SK Telecom 1 U U U U U U U- U - - communication - - - - - communication

J UNE 14 2013 J UNE 14, 2013 R R EINSURANCE REGULATIONS : I MPACT ON I RISKS IN THE I NDIAN L

Analyse Relationnelle de Concepts: Une approche pour fouiller des ensembles de donnes

Du Python, du Bash, un Raspberry Pi et des clefs USB Kitten Groomer, le nettoyeur de cl e USB.

Administration de la navigation arienne Une Communication claire et transparente avec les

Measuring the Progress of Societies: some alternative measures of wellbeing Jon Hall OECD July

Diagnostic et Prise Prise en Charge des en Charge des Diagnostic et Echecs De Thrombolyse De

Digital and Information Literacy as Discursive Mapping of an Information Landscape Andrew

Introduction to the MIZAR system Adam Naumowicz adamn@mizar.org Institute of Computer Science

Physics Simulation Engines 2008 Chalmers University of Technology Markus Larsson

Simulation Engines TDA571|DIT030 Physics Tommaso Piazza 1 Administrative stuff Group

REVELATION 18 FALL OF BABYLON Series The future has already begun #15 IBC Eindhoven, 7 June

Shaykh Abbas Jaffer February 25, 2017 ALI 389: Makkan and Madani Revelations INTRODUCTION 2

Deriving Consensus for Multi-Parallel Corpora: An English Bible Study Patrick Xia David

How are spouses supposed to treat each other? 1 Peter 3:1-7 Introduction Being a Christian