[PPT] - Efficient Intra-node Communication on Intel MIC Clusters Sreeram PowerPoint Presentation

SLIDE 1

Efficient Intra-node Communication on Intel MIC Clusters

Sreeram Potluri Akshay Venkatesh Devendar Bureddy

Krishna Kandalla Dhabaleswar K. Panda

Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University

1

SLIDE 2

2

Outline

Introduction
Problem Statement
Hybrid MPI Communication Runtime
Performance Evaluation
Conclusion and Future Work

SLIDE 3

Many Integrated Core (MIC) Architecture

Hybrid system architectures with graphics processors have become common -

high compute density and high performance per watt

Intel introduced Many Integrated Core (MIC) architecture geared for HPC
X86 compatibility - applications and libraries can run out-of-the-box or with

minor modifications

Many low-power processor cores, hardware threads and wide vector units
MPI continues to be a predominant programming model in HPC

3 Single Core Dual Core Quad Core Oct Core Twelve Core Hybrid Architectures

SLIDE 4

Programming Models on Clusters with MIC

4

Xeon Xeon Phi Multi-core Centric Many-core Centric

MPI Program MPI Program Offloaded Computation MPI Program MPI Program MPI Program

Host-only Offload Symmetric MIC-only

Xeon Phi, the first commercial product based on MIC architecture
Flexibility in launching MPI jobs on clusters with Xeon Phi

SLIDE 5

PCIe Intel Xeon Phi Intel Xeon IB HCA Host-to-MIC MIC-to-Host Intra-MIC

MPI Communication on Node with a Xeon Phi

Intra-Host

Various paths for MPI communication on a node with Xeon Phi

5

SLIDE 6

Symmetric Communication Stack with MPSS

MPSS – Intel Manycore Platform Software Stack

– Shared Memory – Symmetric Communication InterFace (SCIF) – over PCIe – IB Verbs – through IB adapter – IB-SCIF – IB Verbs over SCIF

6

PCI-E IB-HCA SCIF IB IB-SCIF SHM

Xeon Phi

SCIF IB IB-SCIF SHM

Host/Xeon Phi

IB Verbs

SCIF API POSIX Calls

IB Verbs

SCIF API POSIX Calls

SLIDE 7

Problem Statement

What are the performance characteristics of different communication channels available on a node with Xeon Phi? How can an MPI communication runtime take advantage of the different channels? Can a low latency and high bandwidth hybrid communication channel be designed, leveraging the all channels? What is the impact of such a hybrid communication channel on performance of benchmarks and applications?

7

SLIDE 8

Outline

Introduction
Problem Statement
Hybrid MPI Communication Runtime
Performance Evaluation
Conclusion and Future Work

8

SLIDE 9

High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and

RDMA over Converged Enhanced Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-3.0), available since 2002M – MVAPICH2-‑X ¡(MPI ¡+ ¡PGAS), ¡Available ¡since ¡2012PI + PGAS), Available since 2012 – Used by more than 2,000 organizations (HPC Centers, Industry and Universities) in 70 countries – More than 165,000 downloads from OSU site directly

– Empowering many TOP500 clusters

7th ranked 204,900-core cluster (Stampede) at TACC
14th ranked 125,980-core cluster (Pleiades) at NASA
and many others

– Available with software stacks of many IB, HSE and server vendors including Linux Distros (RedHat and SuSE) – http://mvapich.cse.ohio-state.edu

Partner in the U.S. NSF-TACC Stampede (9 PFlop) System

MVAPICH2/MVAPICH2-X Software

9

SLIDE 10

Intra-MIC Communication

Shared Memory Interface (CH3-SHM)

– POSIX Shared Memory API – Small Messages: pair-wise memory regions between processes – Large Messages: buffer pool per process, data is divided into chunks (8KB) to pipeline copy in and copy out – MPSS offers two implementations of memcpy – multi-threaded copy – DMA-assisted copy: offers low latency for large messages – We use 64KB chunks to trigger the use of DMA-assisted copies for large messages MVAPICH2

SCIF-CH3

Xeon Phi

SHM-CH3

SCIF

CH3

10

SLIDE 11

Intra-MIC Communication

SCIF Channel (CH3-SCIF)

– Control of DMA engine to the user – API for remote memory access:

Registration: scif_register
Initiation: scif_writeto/readfrom
Completion: scif_fence_signal

– We use a write-based rendezvous protocol

Sender sends Request-To-Send (RTS)
Receiver responds with Ready-to-Receive (RTR) with registered buffer offset and flag
ffset
Sender issues scif_writeto followed by scif_fence_signal
Both processes poll for flag to be set

11

MVAPICH2

SCIF-CH3

Xeon Phi

SHM-CH3

SCIF

CH3

SLIDE 12

Host-MIC Communication

MVAPICH2

OFA-IB-CH3

SCIF-CH3

SCIF IB-HCA IB-Verbs PCI-E

Xeon Phi MVAPICH2 Host

OFA-IB-CH3 SCIF IB-Verbs

SCIF-CH3

mlx4_0 scif0 scif0 mlx4_0

IB Channel (OFA-IB-CH3)

– Uses IB verbs – Selection of IB network interface to switch between IB and IB-SCIF

SCIF-CH3

– Can be used for communication between Xeon Phi and Host 12

SLIDE 13

Host-MIC Communication: Host-Initiated SCIF

DMA can be initiated by host or Xeon

Phi

But performance is not symmetric
Host-initiated DMA delivers better

performance

Host-initiated mode takes advantage
f this

– Write-based from Host-to-Xeon Phi – Read-based transfer from Xeon Phi-to-Host 13

HOST MIC HOST MIC Symmetric Host-Initiated Host-to-MIC MIC-to-Host Host-to-MIC MIC-to-Host

Symmetric mode to maximize resource utilization on host and Xeon Phi

SLIDE 14

Outline

Introduction
Problem Statement
Hybrid MPI Communication Runtime
Performance Evaluation
Conclusion and Future Work

14

SLIDE 15

Experimental Setup

15

TACC Stampede Node

– Host

Dual-socket oct-core Intel Sandy Bridge (E5-2680 @ 2.70GHz)
CentOS release 6.3 (Final)

– MIC

SE10P (B0-KNC)
61 cores @ 1085.854 MHz, 4 hardware threads/core
OS 2.6.32-279.el6.x86_64, MPSS 2.1.4346-16

– Compiler: Intel Composer_xe_2013.2.146 – Network Adapter: IB FDR MT 4099 HCA – Enhanced MPI based on MVAPICH2 1.9

SLIDE 16

Intra-MIC Point-to-Point Communication

16

2000 4000 6000 8000 10000 12000 4K 16K 64K 256K 1M 4M Latency (usec) Message Size (Bytes) 2000 4000 6000 8000 4K 16K 64K 256K 1M 4M Bandwidth (MB/sec) Message Size (Bytes)

su_latency
su_bw

2000 4000 6000 8000 10000 4K 16K 64K 256K 1M 4M Bi-Bandwidth (MB/sec) Message Size (Bytes)

su_bibw
Default chunk size severely limits

performance

Tuned block size alleviates it but shm

performance still low

Using SCIF works around these

limitations – 75% improvement in latency, 4.0x improvement in b/w over SHM-TUNED Better Better Better

SLIDE 17

Host-MIC Point-to-Point Communication

17

10 20 30 40 50 60 2 8 32 128 512 2K Latency (usec) Message Size (Bytes)

su_latency : small messages
su_latency : large messages

500 1000 1500 2000 2500 3000 4K 16K 64K 256K 1M 4M Latency (usec) Message Size (Bytes)

IB provides a low-latency path – 4.7usec for 4Byte messages
IB-SCIF overheads due to SCIF and additional software layer
SCIF designs are already hybrid, use IB for small messages
SCIF outperforms IB for large messages – 72% improvement for 4MB messages
Host-Initiated SCIF takes advantage of faster DMA – 33% improvement over SCIF for

64KB messages Better Better

SLIDE 18

Host-MIC Point-to-Point Communication

18

su_bw: mic-to-host

2000 4000 6000 8000 4K 16K 64K 256K 1M 4M Bandwidth (MB/sec) Message Size (Bytes) 2000 4000 6000 8000 4K 16K 64K 256K 1M 4M Bandwidth (MB/sec) Message Size (Bytes)

su_bw: host-to-mic

2000 4000 6000 8000 10000 4K 16K 64K 256K 1M 4M Bi-Bandwidth (MB/sec) Message Size (Bytes)

su_bibw
IB bandwidth limited mic-to-host due to

peer-to-peer limitation on Sandy Bridge

SCIF works around this, Host-initiated

DMA delivers better bandwidth too – 6.6x improvement over IB

Host-initiated SCIF worse than SCIF in

bibw due to wasted resources Better Better Better

SLIDE 19

Collective Communication

19

16 processes on host + 16 processes on MIC
Host-initiated SCIF or symmetric SCIF based on the communication pattern and

message size, collective level selected

Gather, rooted collective uses host-initiated SCIF – 75% improvement in at 1MB
All-to-all uses symmetric SCIF – 78% improvement at 1MB

2000 4000 6000 8000 10000 4K 16K 64K 256K 1M Latency (usec) Message Size (Bytes)

su_gather: root on host

50000 100000 150000 200000 250000 300000 350000 4K 16K 64K 256K 1M Latency (usec) Message Size (Bytes)

su_alltoall

Better Better

SLIDE 20

Performance of 3D Stencil Communication Benchmark

20

Near-neighbor communication – upto 6 neighbors – 64KB messages
67% improvement in time per step

Better

2 4 6 4+4 8+8 16+16 Time per Step (msec) Processes Count (Host + MIC) 67%

SLIDE 21

Performance of P3DFFT Library

21

(MPI + OpenMP) version of popular library for 3D Fast Fourier Transforms - test performs

forward transform and a backward transform in each iteration

2 processes on Host (8 threads/process) + 8 processes on MIC (8 threads/process)
Uses symmetric SCIF because of the MPI_Alltoall
Upto 19% improvement using SCIF-ENHANCED

Better

2 4 6 8 256x256x256 512x512x512 Time per Loop (sec) Problem Size 19% 16%

SLIDE 22

Conclusion and Future Work

A hybrid communication runtime to optimize intranode MPI communication
n clusters with Xeon Phi
Take advantage of SCIF in addition to standard channels like shared memory

and IB

Upto 75% improvement in latency and 6x improvement in unidirectional

bandwidth for MIC-Host Communication

Upto 78% improvement in MPI_Alltoall performance
Considerable improvements with 3DStencil and P3DFFT kernels
Focus on optimizations for shared memory based communication
Working on designs for inter-node communication on clusters with Xeon Phi

22

SLIDE 23

Thank You!

{potluri, akshay, bureddy, kandalla, panda} @cse.ohio-state.edu

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ MVAPICH Web Page http://mvapich.cse.ohio-state.edu/

23