Hybrid MPI - A Case Study on the Xeon Phi Platform Udayanga - PowerPoint PPT Presentation

Hybrid MPI - A Case Study on the Xeon Phi Platform Udayanga Wickramasinghe Center for Research on Extreme Scale Technologies (CREST) Indiana University Greg Bronevetsky Lawrence Livermore National Laboratory Andrew Friedley Intel Corporation Andrew Lumsdaine Center for Research on Extreme Scale Technologies (CREST) Indiana University ROSS 2014 Munich, Germany Hybrid MPI – ROSS / ICS 2014 1

Hybrid MPI - Motivation § MPI – dominant programming model in HPC § Hybrid MPI – MPI implementation specialized for intra-node point to point communication § Fast point to point communication over shared memory hardware § Evolving processor architectures § Single Core à Dual Core à Quad Core à Multi-Core à Many-Core/Clusters § High compute density and performance per watt § Robust shared memory hardware § Motivation – maximize use of many core hardware § Maximum use of shared memory hardware of the Xeon Phi § Gain Maximum communication performance from available bandwidth of the Xeon Phi hardware Hybrid MPI – ROSS / ICS 2014 2

Agenda § Xeon Phi Platform § Traditional MPI Design § Hybrid MPI § A Shared Heap § Communication § Experimental Evaluation § Micro-benchmarks § Applications § Hybrid MPI Highlights § Towards Hybrid MPI Future Hybrid MPI – ROSS / ICS 2014 3

Xeon Phi Platform § Intel Many Integrated Core Architecture (MIC) à Xeon Phi (earlier known as Knights Corner - 50 cores) § Utilized in #1 supercomputing cluster – Tianhe-2 (http://top500.org/) § STAMPEDE @ TACC § Xeon Phi processor à 61 cores with 4 Hardware Threads § No out of order execution § x86 compatibility § Shorter instruction set pipeline § Simpler cores à higher power efficiency Hybrid MPI – ROSS / ICS 2014 4

Xeon Phi Platform § Inter core communication § Bi-directional ring topology interconnect § ~320GB/s Aggregated Theoretical bandwidth § 4 modes of operation (MPSS) § Host § Offload – offloads computation § Symmetric – ranks in both Host and Phi § Phi Only Hybrid MPI – ROSS / ICS 2014 5

Xeon Phi Software Model/Stack (MPSS) Multi-core based MPI Application Host mode MPI Application Offload computation Offload mode Symmetric mode MPI Application MPI Application Phi Only mode MPI Application Many-core based § Offload/Symmetric/Phi-only supported via Intel Many-core Platform Software Stack(MPSS) § Shared memory/ SHM § SCIF § IB verbs/ IB-SCIF Hybrid MPI – ROSS / ICS 2014 6

Traditional MPI with Disjoint Address Spaces § Process based ranks - Regular process abstraction - Shared nothing § Communication § Disjoint address spaces – multiple copies § IPC/Kernel buffers/ shared buffers – resources grow rapidly with number of ranks Hybrid MPI – ROSS / ICS 2014 8

Traditional MPI with Disjoint Address Spaces (contd..) Shared Segment Shared Segment P1 P2 P1 P2 Recv_buffer Recv_buffer Send_buffer Send_buffer § Two Copies – resources grow as ranks increase Hybrid MPI – ROSS / ICS 2014 9

Alternative MPI – Avoid Copies § Necessary to share memory § Thread based § Share everything – Heap/data/text segments are shared among threads § Thread pinning to Xeon Phi cores via KMP_AFFINITY § scatter/compact/fine § However few problems arise when resources are shared § Ensure mutual exclusion § Transform globals/heap vars to thread local § Network resource contention Hybrid MPI – ROSS / ICS 2014 10

Hybrid MPI – A Shared Heap § Hybrid MPI approach § Each rank P1, P2, P3, P4 heap is mmap() ed to a shared segment § Has access to entire shared segment § Each process allocates memory on their own chunk à heap_p1, heap_p2, heap_p3, heap_p4 Hybrid MPI – ROSS / ICS 2014 11

Hybrid MPI – A Shared Heap (contd..) P1 P2 Recv_buffer Send_buffer Direct copy heap_p1 heap_p2 § Single Copy using the unified shared address space of Hybrid MPI § Implementation with mmap() § MAP_SHARED , MAP_FIXED features Hybrid MPI – ROSS / ICS 2014 12

Hybrid MPI View on Xeon Phi Host / Xeon Phi node Network HMPI HMPI HMPI node Intra-node Intra-node Inter-node Intel SHM Intel SHM Intel SHM communication MPI Ext. MPI Ext. MPI Ext. OS / MPSS § Hybrid MPI has its own Shared Memory extension for Intra-node communication § Inter-node communication via Intel MPI § Infiniband network § TCP/IP § PCIe(PCI express) / SCIF (Symmetric Communication Interface) Hybrid MPI – ROSS / ICS 2014 13

Hybrid MPI – Message matching Shared Global Queue < rank, communicator, tag > Payload Information HMPI_Request MCS Lock/s Private Q Private Q Private Q Private Q (Mellor- Crummey- P1 P2 P3 P4 Scott Algorithm) § ‘send’ requests are matched with local ‘receive’ requests in HMPI_Progress § Match for tuples <rank, comm, tag> § Two Queues used § Shared – protected by global MCS lock § Private – where match is performed, drained from global queue § Minimize contention Hybrid MPI – ROSS / ICS 2014 15

Hybrid MPI – Communication protocols § 3 protocols § Direct Transfer § Immediate Transfer § Synergistic Transfer § Direct Transfer § Single memcpy () to transfer from sender’s buffer to receive § Applied when message is medium sized (512b ≤ m ≤ ! 8KB) § Immediate Transfer § Applied when message size is small ( ≤ ! 512 bytes) § Payload is transferred immediately with HMPI request (header) § Message is cache aligned to fit the cache lines and 32KB L1 cache – Avoids 2 copies à use temporal locality, payload will already be in receivers’ L1/L2 cache Hybrid MPI – ROSS / ICS 2014 16

Hybrid MPI – Immediate protocol Memory/RAM § Direct protocol incurs separate cache misses for each step Application buffer sender è local_send() HMPI_Request tag comm rank eager Shared Global Queue sender è add_queue() Hybrid MPI – ROSS / ICS 2014 17

Hybrid MPI – Immediate protocol Memory/RAM § Message transferred at matching stage Shared Global Queue receiver è match () L1/L2 cache - Receiver HMPI_Request tag comm rank eager Hybrid MPI – ROSS / ICS 2014 18

Hybrid MPI – Immediate protocol Memory/RAM § At the data transfer Shared Global Queue § No cache miss to fetch data § If destination buffer is already on cache then extremely fast copying § 43% - 70% improvement over 32b – 512b L1/L2 cache - Receiver HMPI_Request tag comm rank eager receiver è receive () dest_ buffer Hybrid MPI – ROSS / ICS 2014 19

Hybrid MPI – Communication protocols § Synergistic Transfer § Large messages (>=8KB) both sender and receiver engage actively in copying the message to destination T2 << T1 Synergistic Regular Receiver Receiver Sender Start Start Init ( ) Init ( ) Init ( ) Block 1 Block 1 T1 Block 2 T2 Block 3 Block 2 Block 4 End Block 5 Block 3 Block 4 End Block 5 Hybrid MPI – ROSS / ICS 2014 20

Experimental Setup § TACC STAMPEDE node § Host processor – Xeon E5, 8 core, 2.7GHz, 32GB DDR3 RAM , Cent OS 6.3 § Co processor – Xeon Phi, 61 cores, 1.1 GHz, 8GB DDR5 RAM – Linux based Busy Box OS (kernel version 2.6) / MPSS – Intel icc/mpicc Compiler – cross compile for Xeon Phi § Presta benchmark , “purple suite” § 2 types of experiments § Intra-node – Single STAMPEDE node (from 2 ranks to 240 ranks in one node) – All experiments run in ‘Phi-Only’ mode – only in coprocessor – Benchmarks used – Presta Stress Benchmark - com / latency § Inter-node – Between nodes but in ‘Phi-Only’ mode – Communication via infiniband FDR interconnect Hybrid MPI – ROSS / ICS 2014 22

Intra-node Setup § Intra-node setup Xeon-phi co-processor core-1 c o r e - 2 6 0 e - o r c core-3 core-j core-33 c o r e - 3 2 core-31 core-30 § Each core is bound to a rank § All nodes tested have one Xeon Phi coprocessor § Rank pairs are formed in on opposite sides of ring interconnect Hybrid MPI – ROSS / ICS 2014 23

Inter-node Setup § Inter-node setup Host Host Host Xeon-phi Xeon-phi Xeon-phi core-1 core-1 core-1 core-2 core-2 core-2 PCIe PCIe PCIe … … … core-60 core-60 core-60 § Subset of cores/ranks from each node are selected § Communication in symmetric mode – Phi to Phi § RDMA with Infiniband Hybrid MPI – ROSS / ICS 2014 24

Hybrid MPI - A Case Study on the Xeon Phi Platform Udayanga - PowerPoint PPT Presentation

Hybrid MPI - A Case Study on the Xeon Phi Platform Udayanga Wickramasinghe Center for Research on Extreme Scale Technologies (CREST) Indiana University Greg Bronevetsky Lawrence Livermore National Laboratory Andrew Friedley Intel Corporation

Outline Background 1 Xeon Phi Architecture 2 Programming Xeon Phi TM 3 Native Mode Offload

XEON PHI BASICS Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Xeon Phi Basics Reusing this

OPTIMISING PARALLEL PROGRAMS ON XEON PHI Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc

Offload Mode Case Study James Briggs 1 COSMOS DiRAC April 28, 2015 Case Study: Modal2d

AsHES 2014 XSW: Accelerating Biological Database Search on Xeon Phi School of Computer Science

Optimizing Codes For Intel Xeon Phi Brian Friesen NERSC 2017 July 26 Cori What is different

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

PCS SERVICE FOR SALE FOR SALE Used PHI 660 Scanning Auger PHI 660 Scanning Auger Used

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

GPU vs Xeon Phi: Performance of Bandwidth Bound Applications with a Lattice QCD Case Study

THE PHI PROJECT THE FINANCIAL IMPACT OF BREACHED PROTECTED HEALTH INFORMATION A

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

Halcyon Technology PLC. HTECH Opportunity Day Q218 19/09/18 2 Halcyon Timeline Year

PHILIPPINE GOVERNMENT INTEGRATED DEPARTMENT OF LABOR AND EMPLOYMENT Bureau of Workers with

European Clinical Data on HIV-1 Coreceptor Usage and Genotypic Identification of Tropism in HIV-2

E-Pass Redesign Overview Presentation for E-Pass Implementation Team February 12, 2003

Commission Charge Develop a strategic plan for building the future CA health workforce

Aureus Mining New Liberty Project Mining February 2014 Contents Geotechnical Pit

San Diego Comic-Con 2011 San Diego Comic-Con 2011 THANK YOU! THANK YOU! o State of the Galaxy

First Quarter 2013 Earnings Call David Rosenthal Vice President Investor Relations &

Sambuz

Useful Links

Newsletter

Mail Us

Hybrid MPI - A Case Study on the Xeon Phi Platform Udayanga - PowerPoint PPT Presentation

Hybrid MPI - A Case Study on the Xeon Phi Platform Udayanga Wickramasinghe Center for Research on Extreme Scale Technologies (CREST) Indiana University Greg Bronevetsky Lawrence Livermore National Laboratory Andrew Friedley Intel Corporation

Outline Background 1 Xeon Phi Architecture 2 Programming Xeon Phi TM 3 Native Mode Offload

XEON PHI BASICS Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc Xeon Phi Basics Reusing this

OPTIMISING PARALLEL PROGRAMS ON XEON PHI Adrian Jackson adrianj@epcc.ed.ac.uk @adrianjhpc

Offload Mode Case Study James Briggs 1 COSMOS DiRAC April 28, 2015 Case Study: Modal2d

AsHES 2014 XSW: Accelerating Biological Database Search on Xeon Phi School of Computer Science

Optimizing Codes For Intel Xeon Phi Brian Friesen NERSC 2017 July 26 Cori What is different

MPI is too High-Level MPI is too Low-Level Marc Snir High-Level MPI MPI is an Application

The MPI+MPI programming model and why we need shared-memory MPI libraries Jeff Hammond Extreme

PCS SERVICE FOR SALE FOR SALE Used PHI 660 Scanning Auger PHI 660 Scanning Auger Used

Introduction to MPI T opics to be covered MPI vs shared memory Initializing MPI MPI

Message Passing Programming with MPI What is MPI? Message Passing Programming with MPI 1

GPU vs Xeon Phi: Performance of Bandwidth Bound Applications with a Lattice QCD Case Study

THE PHI PROJECT THE FINANCIAL IMPACT OF BREACHED PROTECTED HEALTH INFORMATION A

MPI-IO: A Retrospective Rajeev Thakur 25 th Anniversary of MPI Workshop Argonne, IL, Sept 25,

Message Passing Programming with MPI Message Passing Programming with MPI 1 What is MPI?

Programming Miscellaneous MPI-IO topics MPI-IO Errors Unlike the rest of MPI, MPI-IO errors

Halcyon Technology PLC. HTECH Opportunity Day Q218 19/09/18 2 Halcyon Timeline Year

PHILIPPINE GOVERNMENT INTEGRATED DEPARTMENT OF LABOR AND EMPLOYMENT Bureau of Workers with

European Clinical Data on HIV-1 Coreceptor Usage and Genotypic Identification of Tropism in HIV-2

E-Pass Redesign Overview Presentation for E-Pass Implementation Team February 12, 2003

Commission Charge Develop a strategic plan for building the future CA health workforce

Aureus Mining New Liberty Project Mining February 2014 Contents Geotechnical Pit

San Diego Comic-Con 2011 San Diego Comic-Con 2011 THANK YOU! THANK YOU! o State of the Galaxy

First Quarter 2013 Earnings Call David Rosenthal Vice President Investor Relations &amp;

Sambuz

Useful Links

Newsletter

Mail Us

First Quarter 2013 Earnings Call David Rosenthal Vice President Investor Relations &