Efficient Intra-node Communication on Intel MIC Clusters Sreeram - - PowerPoint PPT Presentation

efficient intra node communication on intel mic clusters
SMART_READER_LITE
LIVE PREVIEW

Efficient Intra-node Communication on Intel MIC Clusters Sreeram - - PowerPoint PPT Presentation

Efficient Intra-node Communication on Intel MIC Clusters Sreeram Potluri Akshay Venkatesh Devendar Bureddy Krishna Kandalla Dhabaleswar K. Panda Network-Based Computing Laboratory Department of Computer Science and Engineering


slide-1
SLIDE 1

Efficient Intra-node Communication on Intel MIC Clusters

Sreeram Potluri Akshay Venkatesh Devendar Bureddy

Krishna Kandalla Dhabaleswar K. Panda

Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University

1

slide-2
SLIDE 2

2

Outline

  • Introduction
  • Problem Statement
  • Hybrid MPI Communication Runtime
  • Performance Evaluation
  • Conclusion and Future Work
slide-3
SLIDE 3

Many Integrated Core (MIC) Architecture

  • Hybrid system architectures with graphics processors have become common -

high compute density and high performance per watt

  • Intel introduced Many Integrated Core (MIC) architecture geared for HPC
  • X86 compatibility - applications and libraries can run out-of-the-box or with

minor modifications

  • Many low-power processor cores, hardware threads and wide vector units
  • MPI continues to be a predominant programming model in HPC

3 Single Core Dual Core Quad Core Oct Core Twelve Core Hybrid Architectures

slide-4
SLIDE 4

Programming Models on Clusters with MIC

4

Xeon Xeon Phi Multi-core Centric Many-core Centric

MPI Program MPI Program Offloaded Computation MPI Program MPI Program MPI Program

Host-only Offload Symmetric MIC-only

  • Xeon Phi, the first commercial product based on MIC architecture
  • Flexibility in launching MPI jobs on clusters with Xeon Phi
slide-5
SLIDE 5

PCIe Intel Xeon Phi Intel Xeon IB HCA Host-to-MIC MIC-to-Host Intra-MIC

MPI Communication on Node with a Xeon Phi

Intra-Host

  • Various paths for MPI communication on a node with Xeon Phi

5

slide-6
SLIDE 6

Symmetric Communication Stack with MPSS

  • MPSS – Intel Manycore Platform Software Stack

– Shared Memory – Symmetric Communication InterFace (SCIF) – over PCIe – IB Verbs – through IB adapter – IB-SCIF – IB Verbs over SCIF

6

PCI-E IB-HCA SCIF IB IB-SCIF SHM

Xeon Phi

SCIF IB IB-SCIF SHM

Host/Xeon Phi

IB Verbs

SCIF API POSIX Calls

IB Verbs

SCIF API POSIX Calls

slide-7
SLIDE 7

Problem Statement

What are the performance characteristics of different communication channels available on a node with Xeon Phi? How can an MPI communication runtime take advantage of the different channels? Can a low latency and high bandwidth hybrid communication channel be designed, leveraging the all channels? What is the impact of such a hybrid communication channel on performance of benchmarks and applications?

7

slide-8
SLIDE 8

Outline

  • Introduction
  • Problem Statement
  • Hybrid MPI Communication Runtime
  • Performance Evaluation
  • Conclusion and Future Work

8

slide-9
SLIDE 9
  • High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and

RDMA over Converged Enhanced Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-3.0), available since 2002M – MVAPICH2-­‑X ¡(MPI ¡+ ¡PGAS), ¡Available ¡since ¡2012PI + PGAS), Available since 2012 – Used by more than 2,000 organizations (HPC Centers, Industry and Universities) in 70 countries – More than 165,000 downloads from OSU site directly

– Empowering many TOP500 clusters

  • 7th ranked 204,900-core cluster (Stampede) at TACC
  • 14th ranked 125,980-core cluster (Pleiades) at NASA
  • and many others

– Available with software stacks of many IB, HSE and server vendors including Linux Distros (RedHat and SuSE) – http://mvapich.cse.ohio-state.edu

  • Partner in the U.S. NSF-TACC Stampede (9 PFlop) System

MVAPICH2/MVAPICH2-X Software

9

slide-10
SLIDE 10

Intra-MIC Communication

  • Shared Memory Interface (CH3-SHM)

– POSIX Shared Memory API – Small Messages: pair-wise memory regions between processes – Large Messages: buffer pool per process, data is divided into chunks (8KB) to pipeline copy in and copy out – MPSS offers two implementations of memcpy – multi-threaded copy – DMA-assisted copy: offers low latency for large messages – We use 64KB chunks to trigger the use of DMA-assisted copies for large messages MVAPICH2

SCIF-CH3

Xeon Phi

SHM-CH3

SCIF

CH3

10

slide-11
SLIDE 11

Intra-MIC Communication

  • SCIF Channel (CH3-SCIF)

– Control of DMA engine to the user – API for remote memory access:

  • Registration: scif_register
  • Initiation: scif_writeto/readfrom
  • Completion: scif_fence_signal

– We use a write-based rendezvous protocol

  • Sender sends Request-To-Send (RTS)
  • Receiver responds with Ready-to-Receive (RTR) with registered buffer offset and flag
  • ffset
  • Sender issues scif_writeto followed by scif_fence_signal
  • Both processes poll for flag to be set

11

MVAPICH2

SCIF-CH3

Xeon Phi

SHM-CH3

SCIF

CH3

slide-12
SLIDE 12

Host-MIC Communication

MVAPICH2

OFA-IB-CH3

SCIF-CH3

SCIF IB-HCA IB-Verbs PCI-E

Xeon Phi MVAPICH2 Host

OFA-IB-CH3 SCIF IB-Verbs

SCIF-CH3

mlx4_0 scif0 scif0 mlx4_0

  • IB Channel (OFA-IB-CH3)

– Uses IB verbs – Selection of IB network interface to switch between IB and IB-SCIF

  • SCIF-CH3

– Can be used for communication between Xeon Phi and Host 12

slide-13
SLIDE 13

Host-MIC Communication: Host-Initiated SCIF

  • DMA can be initiated by host or Xeon

Phi

  • But performance is not symmetric
  • Host-initiated DMA delivers better

performance

  • Host-initiated mode takes advantage
  • f this

– Write-based from Host-to-Xeon Phi – Read-based transfer from Xeon Phi-to-Host 13

HOST MIC HOST MIC Symmetric Host-Initiated Host-to-MIC MIC-to-Host Host-to-MIC MIC-to-Host

  • Symmetric mode to maximize resource utilization on host and Xeon Phi
slide-14
SLIDE 14

Outline

  • Introduction
  • Problem Statement
  • Hybrid MPI Communication Runtime
  • Performance Evaluation
  • Conclusion and Future Work

14

slide-15
SLIDE 15

Experimental Setup

15

  • TACC Stampede Node

– Host

  • Dual-socket oct-core Intel Sandy Bridge (E5-2680 @ 2.70GHz)
  • CentOS release 6.3 (Final)

– MIC

  • SE10P (B0-KNC)
  • 61 cores @ 1085.854 MHz, 4 hardware threads/core
  • OS 2.6.32-279.el6.x86_64, MPSS 2.1.4346-16

– Compiler: Intel Composer_xe_2013.2.146 – Network Adapter: IB FDR MT 4099 HCA – Enhanced MPI based on MVAPICH2 1.9

slide-16
SLIDE 16

Intra-MIC Point-to-Point Communication

16

2000 4000 6000 8000 10000 12000 4K 16K 64K 256K 1M 4M Latency (usec) Message Size (Bytes) 2000 4000 6000 8000 4K 16K 64K 256K 1M 4M Bandwidth (MB/sec) Message Size (Bytes)

  • su_latency
  • su_bw

2000 4000 6000 8000 10000 4K 16K 64K 256K 1M 4M Bi-Bandwidth (MB/sec) Message Size (Bytes)

  • su_bibw
  • Default chunk size severely limits

performance

  • Tuned block size alleviates it but shm

performance still low

  • Using SCIF works around these

limitations – 75% improvement in latency, 4.0x improvement in b/w over SHM-TUNED Better Better Better

slide-17
SLIDE 17

Host-MIC Point-to-Point Communication

17

10 20 30 40 50 60 2 8 32 128 512 2K Latency (usec) Message Size (Bytes)

  • su_latency : small messages
  • su_latency : large messages

500 1000 1500 2000 2500 3000 4K 16K 64K 256K 1M 4M Latency (usec) Message Size (Bytes)

  • IB provides a low-latency path – 4.7usec for 4Byte messages
  • IB-SCIF overheads due to SCIF and additional software layer
  • SCIF designs are already hybrid, use IB for small messages
  • SCIF outperforms IB for large messages – 72% improvement for 4MB messages
  • Host-Initiated SCIF takes advantage of faster DMA – 33% improvement over SCIF for

64KB messages Better Better

slide-18
SLIDE 18

Host-MIC Point-to-Point Communication

18

  • su_bw: mic-to-host

2000 4000 6000 8000 4K 16K 64K 256K 1M 4M Bandwidth (MB/sec) Message Size (Bytes) 2000 4000 6000 8000 4K 16K 64K 256K 1M 4M Bandwidth (MB/sec) Message Size (Bytes)

  • su_bw: host-to-mic

2000 4000 6000 8000 10000 4K 16K 64K 256K 1M 4M Bi-Bandwidth (MB/sec) Message Size (Bytes)

  • su_bibw
  • IB bandwidth limited mic-to-host due to

peer-to-peer limitation on Sandy Bridge

  • SCIF works around this, Host-initiated

DMA delivers better bandwidth too – 6.6x improvement over IB

  • Host-initiated SCIF worse than SCIF in

bibw due to wasted resources Better Better Better

slide-19
SLIDE 19

Collective Communication

19

  • 16 processes on host + 16 processes on MIC
  • Host-initiated SCIF or symmetric SCIF based on the communication pattern and

message size, collective level selected

  • Gather, rooted collective uses host-initiated SCIF – 75% improvement in at 1MB
  • All-to-all uses symmetric SCIF – 78% improvement at 1MB

2000 4000 6000 8000 10000 4K 16K 64K 256K 1M Latency (usec) Message Size (Bytes)

  • su_gather: root on host

50000 100000 150000 200000 250000 300000 350000 4K 16K 64K 256K 1M Latency (usec) Message Size (Bytes)

  • su_alltoall

Better Better

slide-20
SLIDE 20

Performance of 3D Stencil Communication Benchmark

20

  • Near-neighbor communication – upto 6 neighbors – 64KB messages
  • 67% improvement in time per step

Better

2 4 6 4+4 8+8 16+16 Time per Step (msec) Processes Count (Host + MIC) 67%

slide-21
SLIDE 21

Performance of P3DFFT Library

21

  • (MPI + OpenMP) version of popular library for 3D Fast Fourier Transforms - test performs

forward transform and a backward transform in each iteration

  • 2 processes on Host (8 threads/process) + 8 processes on MIC (8 threads/process)
  • Uses symmetric SCIF because of the MPI_Alltoall
  • Upto 19% improvement using SCIF-ENHANCED

Better

2 4 6 8 256x256x256 512x512x512 Time per Loop (sec) Problem Size 19% 16%

slide-22
SLIDE 22

Conclusion and Future Work

  • A hybrid communication runtime to optimize intranode MPI communication
  • n clusters with Xeon Phi
  • Take advantage of SCIF in addition to standard channels like shared memory

and IB

  • Upto 75% improvement in latency and 6x improvement in unidirectional

bandwidth for MIC-Host Communication

  • Upto 78% improvement in MPI_Alltoall performance
  • Considerable improvements with 3DStencil and P3DFFT kernels
  • Focus on optimizations for shared memory based communication
  • Working on designs for inter-node communication on clusters with Xeon Phi

22

slide-23
SLIDE 23

Thank You!

{potluri, akshay, bureddy, kandalla, panda} @cse.ohio-state.edu

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ MVAPICH Web Page http://mvapich.cse.ohio-state.edu/

23