Optimized MPI Gather Collective for Many Integrated Core (MIC) - - PowerPoint PPT Presentation

optimized mpi gather collective for many integrated core
SMART_READER_LITE
LIVE PREVIEW

Optimized MPI Gather Collective for Many Integrated Core (MIC) - - PowerPoint PPT Presentation

Optimized MPI Gather Collective for Many Integrated Core (MIC) InfiniBand Clusters Akshay Venkatesh Krishna Kandalla Dhabaleswar K. Panda Network-based Computing Laboratory Department of Computer Science and Engineering The Ohio State


slide-1
SLIDE 1

Network-based Computing Laboratory Department of Computer Science and Engineering The Ohio State University

Akshay Venkatesh Krishna Kandalla Dhabaleswar K. Panda

Optimized MPI Gather Collective for Many Integrated Core (MIC) InfiniBand Clusters

slide-2
SLIDE 2

2

Outline

  • Introduction
  • Problem Statement
  • Designs
  • Experimental Evaluation and Analyses
  • Conclusion and Future Work
slide-3
SLIDE 3

3

Scientific applications, Accelerators and MPI

  • Several areas such as medical sciences, atmospheric

research and earthquake simulations rely on speed

  • f computations for better prediction/ analysis.
  • Many instances of applications benefitting from use
  • f large scale
  • Accelerators/ Coprocessors further increase

computational speed and increase energy efficiency

  • MPI continues to be widely used as HPC reels in

heterogeneous architectures

slide-4
SLIDE 4
  • MPI(+X) continues to be the predominant programming model in HPC
  • High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA
  • ver Converged Enhanced Ethernet (RoCE)

– MVAPICH (MPI-1) ,MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002 – MVAPICH2-X (MPI + PGAS), Available since 2012 – Used by more than 2,055 organizations (HPC Centers, Industry and Universities) in 70 countries – More than 181,000 downloads from OSU site directly

– Empowering many TOP500 clusters

  • 6th ranked 462,462-core cluster (Stampede) at TACC
  • 19th ranked 125,980-core cluster (Pleiades) at NASA
  • 21st ranked 73,278-core cluster (Tsubame 2.0) at Tokyo Institute of Technology and many others

– Available with software stacks of many IB, HSE, and server vendors including Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu

  • Partner in the U.S. NSF-TACC Stampede System

MVAPICH2/MVAPICH2-X Software

4 XSCALE13

slide-5
SLIDE 5

5

Latest Version of MVAPICH2

  • Support for GPU-Aware MPI
  • Optimized and tuned point-to-point operations

involving GPU Buffers

  • Support of GPU Direct RDMA
  • Optimized GPU Collectives
  • Ongoing effort to design a high performance library

that enables MPI communication in MIC clusters

slide-6
SLIDE 6

6

Intel Xeon Phi Specifications

  • Belongs to the Many Integrated Core (MIC) family
  • 61 cores on the chip (each running at 1 GHz)
  • 4 hardware threads (smart round robin scheduler)
  • 1 Teraflop peak throughput and energy efficient
  • X_86 compatible and supports OpenMP, MPI, Cilk,

etc.

  • Installed in the compute node as PCI Express device
slide-7
SLIDE 7

7

MPI on MIC Clusters (1)

  • Stampede has ~6,000 MIC Coprocessors
  • Tianhe-2 has ~48,000 MIC Coprocessors
  • Through MPSS*, MIC Coprocessors can directly use

IB HCAs through peer-to-peer PCI communication for inter/intra-node communication

  • MPI is the predominantly used to make use of

multiple such compute nodes in tandem

*MPSS – Intel Manycore Platfrom Sofware Stack

slide-8
SLIDE 8

8

MPI on MIC Clusters(2)

  • MIC supports various modes of operation

– Offload mode – Coprocessor-only or Native mode – Symmetric mode

  • Non-uniform host and destination platforms

– Host to MIC, MIC to host, MIC-to-MIC

  • Transfers involving the MIC incurs additional cost
  • wing to expensive PCIe path
slide-9
SLIDE 9

9

Symmetric mode and Implications

  • Non-uniform host and destination platforms

– Host to MIC – MIC to host – MIC to MIC

  • Transfers involving the MIC incurs additional cost
  • wing to expensive PCIe path
  • Performance is non-uniform
slide-10
SLIDE 10

10

Symmetric mode and Implications

  • MIC->MIC Latency = 8 X Host->Host Latency
  • Host->IB NIC bandwidth = 6 X MIC->IB NIC Bandwidth

2 4 6 8 10 12 14 16 18 256K 512K 1M 2M 4M 8M 16M Latency (us) x 1000 Message Size (Bytes) host->remote_host host->remote_mic mic->remote_host mic->remote_mic

slide-11
SLIDE 11

11

MPI on MIC Clusters(2)

  • MIC supports various modes of operation

– Offload mode – Coprocessor-only or Native mode – Symmetric mode

  • Besides point-to-point primitives, MPI Standard also

defines a set of collectives such as:

– MPI_Bcast – MPI_Gather

slide-12
SLIDE 12

12

MPI_Gather

  • Gather used in

– Multi-agent heuristics – Mini applications – Can be used for reduction operations and more

1 2 3 1 2 3 1 2 3 ROOT

slide-13
SLIDE 13

13

MPI Gather(2)

  • MPI_Gather

– One root process receives data from every other process

  • On homogeneous systems the collective adopts

– Linear scheme – Binomial scheme – Hierarchical scheme

slide-14
SLIDE 14

14

Typical MPI_Gather on MIC Clusters

  • Yellow grid boxes

represent host processors

  • Blue grid boxes

represent MIC coprocessors

Host 0 MIC 0 HCA Host 1 MIC 1 HCA Host 2 MIC 2 HCA Node 1 Node 2 Node 0 Host 3 MIC 3 HCA Node 3

slide-15
SLIDE 15

15

Typical MPI_Gather on MIC Clusters

Host 0 MIC 0 HCA Host 1 MIC 1 HCA Host 2 MIC 2 HCA Node 1 Node 2 Node 0 Host 3 MIC 3 HCA Node 3

Local - Leader

  • Hierarchical scheme
  • r Leader-based

scheme

  • Communicator per

node

  • Leader is least rank

in node

slide-16
SLIDE 16

16

Typical MPI_Gather on MIC Clusters

  • Local leader on the

MIC directly uses the NIC

  • IB reading from MIC

is costly

  • When root of the

gather is on MIC, there are transfers from local MIC to remote MIC

  • This can be very

costly

Host 0 MIC 0 HCA Host 1 MIC 1 HCA Host 2 MIC 2 HCA Node 1 Node 2 Node 0 Host 3 MIC 3 HCA Node 3

slide-17
SLIDE 17

17

Outline

  • Introduction
  • Problem Statement
  • Designs
  • Experimental Evaluation and Analyses
  • Conclusion and Future Work
slide-18
SLIDE 18

18

Problem Statement

  • What are the primary bottlenecks that affect the

performance of the MPI Gather operation on MIC clusters?

  • Can we design algorithms to overcome architecture

specific performance deficits to improve gather latency?

  • Can we analyze and quantify the benefits of our

proposed approach using micro-benchmarks?

slide-19
SLIDE 19

19

MPI Collectives on MIC

  • Primitive operations such as MPI_Send, MPI_Recv

and their non-blocking counterparts have been

  • ptimized*
  • MPI Collectives such MPI_Alltoall, MPI_Scatter, etc

which are designed on top of p2p primitives immediately benefit from such optimizations

  • Can we do better?

*S. Potluri, A. Venkatesh, D. Bureddy, K. Kandalla and D. K. Panda: Efficient Intra-node Communication on Intel-MIC Clusters, CCGrid’13

slide-20
SLIDE 20

20

Outline

  • Introduction
  • Problem Statement
  • Designs
  • Experimental Evaluation and Analyses
  • Conclusion and Future Work
slide-21
SLIDE 21

21

Design Goals

  • Avoid IB reading from MIC

– Especially large transfers

  • Use pipelining methods
  • Overlap operations when possible
slide-22
SLIDE 22

22

Designs

  • 3-level hierarchical algorithm
  • Pipelined Algorithm
  • Overlapped 3-Level-hierarchical algorithm
slide-23
SLIDE 23

23

Designs

  • 3-level hierarchical algorithm
  • Pipelined Algorithm
  • Overlapped 3-Level-hierarchical algorithm
slide-24
SLIDE 24

24

Design1: 3-Level-Hierarchical Algorithm

  • Step 1: Same as the

default hierarchical

  • r leader based

scheme

Host 0 MIC 0 HCA Host 1 MIC 1 HCA Host 2 MIC 2 HCA Node 1 Node 2 Node 0 Host 3 MIC 3 HCA Node 3

Local - Leader

slide-25
SLIDE 25

25

Design1: 3-Level-Hierarchical Algorithm

Host 0 MIC 0 HCA Host 1 MIC 1 HCA Host 2 MIC 2 HCA Node 1 Node 2 Node 0 Host 3 MIC 3 HCA Node 3

  • Step 2: Transfer of

MIC-aggregated data to host over PCI

  • Difference? SCIF is

used and performance is relatively competent

slide-26
SLIDE 26

26

Design1: 3-Level-Hierarchical Algorithm

Host 0 MIC 0 HCA Host 1 MIC 1 HCA Host 2 MIC 2 HCA Node 1 Node 2 Node 0 Host 3 MIC 3 HCA Node 3

  • Advantage: Step 3

which involves transferring the large aggregate message does not involve any IB reads from MIC

  • Disadvantage: MIC

cores are slow and hence Intra-MIC gathers are slower

  • => Host leader

needs to wait

slide-27
SLIDE 27

27

Designs

  • 3-level hierarchical algorithm
  • Pipelined Algorithm
  • Overlapped 3-Level-hierarchical algorithm
slide-28
SLIDE 28

28

Design 2: Pipelined Gather

  • Leader on a each

host posts non- blocking recvs from all processes within the node

  • Leader sends its
  • wn data to gather

root

  • Each process within

a node sends to leader on host

  • Host forwards to

gather root

Host 0 MIC 0 HCA Host 1 MIC 1 HCA Host 2 MIC 2 HCA Node 1 Node 2 Node 0 Host 3 MIC 3 HCA Node 3

slide-29
SLIDE 29

29

Design 2: Pipelined Gather

Host 0 MIC 0 HCA Host 1 MIC 1 HCA Host 2 MIC 2 HCA Node 1 Node 2 Node 0 Host 3 MIC 3 HCA Node 3

  • Advantage: None of

the steps involve IB reading from MIC

  • Disadvantage:

Majority of transfer burden lies on leader of host nodes

  • => Processing non-

blocking receives on Host leader can turn into a bottleneck

slide-30
SLIDE 30

30

Designs

  • 3-level hierarchical algorithm
  • Pipelined Algorithm
  • Overlapped 3-Level-hierarchical algorithm
slide-31
SLIDE 31

31

Design 3: 3-Level-Hierarchical Overlapped Algorithm

  • Step 1: Same as the

default hierarchical

  • r leader based

scheme

Host 0 MIC 0 HCA Host 1 MIC 1 HCA Host 2 MIC 2 HCA Node 1 Node 2 Node 0 Host 3 MIC 3 HCA Node 3

Local - Leader

slide-32
SLIDE 32

32

Design 3: 3-Level-Hierarchical Overlapped Algorithm

Host 0 MIC 0 HCA Host 1 MIC 1 HCA Host 2 MIC 2 HCA Node 1 Node 2 Node 0 Host 3 MIC 3 HCA Node 3

  • Host leader first

posts non-blocking receive from MIC leader in the node

  • Host leader gathers

data from host processes and sends to root

  • Meanwhile MIC

leader starts sending data to host leader

slide-33
SLIDE 33

33

Design 3: 3-Level-Hierarchical Overlapped Algorithm

Host 0 MIC 0 HCA Host 1 MIC 1 HCA Host 2 MIC 2 HCA Node 1 Node 2 Node 0 Host 3 MIC 3 HCA Node 3

  • Advantage: No step

involves transferring the large aggregate message using IB reads from MIC

  • Advantage: Host

does not wait for relatively slow MIC leader to finish and send data

  • Experiments show

good overlap

  • Host leader

forwards MIC data to gather root

slide-34
SLIDE 34

34

Designs

  • 3-level hierarchical algorithm
  • Pipelined Algorithm
  • Overlapped 3-Level-hierarchical algorithm

– Direct variant for MIC rooted gathers

slide-35
SLIDE 35

35

Design 3: 3-Level-Hierarchical Overlapped Algorithm (MIC root)

Host 0 MIC 0 HCA Host 1 MIC 1 HCA Host 2 MIC 2 HCA Node 1 Node 2 Node 0 Host 3 MIC 3 HCA Node 3

  • Root on MIC
  • Data still channeled

through host leader

  • n node
  • Advantage: SCIF

write used

slide-36
SLIDE 36

36

Design 3: 3-Level-Hierarchical Overlapped Algorithm (MIC root)

Host 0 MIC 0 HCA Host 1 MIC 1 HCA Host 2 MIC 2 HCA Node 1 Node 2 Node 0 Host 3 MIC 3 HCA Node 3

  • Host leader on the

node where MIC gather root exists becomes a bottleneck

slide-37
SLIDE 37

37

Design 3: 3-Level-Hierarchical Overlapped Algorithm (Direct Variant)

Host 0 MIC 0 HCA Host 1 MIC 1 HCA Host 2 MIC 2 HCA Node 1 Node 2 Node 0 Host 3 MIC 3 HCA Node 3

  • IB write

performance is competent

slide-38
SLIDE 38

38

Design 3: 3-Level-Hierarchical Overlapped Algorithm (Direct Variant)

Host 0 MIC 0 HCA Host 1 MIC 1 HCA Host 2 MIC 2 HCA Node 1 Node 2 Node 0 Host 3 MIC 3 HCA Node 3

  • HCA directly writes

into MIC

  • Limitations exist in

the very large messages but alleviated in the medium message range

slide-39
SLIDE 39

39

Outline

  • Introduction
  • Problem Statement
  • Optimizations
  • Experimental Evaluation and Analyses
  • Conclusion and Future Work
slide-40
SLIDE 40

40

Experimental Setup

  • Stampede Cluster
  • Up to 64 nodes with Sandy Bridge-EP (Xeon E5-2680)

processors and Xeon Phi Coprocessors (Knight’s Corner) used

  • Host has 32 GB memory and MIC has 8GB
  • Mellanox FDR switches (SX6036) to interconnect
  • OSU Micro-benchmark suite (OMB)
  • Optimizations added to MVAPICH2-1.9 (MV2-MIC)

– Version already has intra-node SCIF based optimizations

  • IMPI 4.1.1.026
slide-41
SLIDE 41

41

Experiments

  • Symmetric mode of execution
  • The number of MPI processes on the MIC restricted

to 16 due to limited memory

  • OpenMP + MPI mode where each MPI process

spawns several OpenMP threads is a better fit for MIC

  • All designs constructed on top of MVAPICH2 version

that has SCIF optimizations*

*S. Potluri, A. Venkatesh, D. Bureddy, K. Kandalla and D. K. Panda: Efficient Intra-node Communication on Intel-MIC Clusters, CCGrid’13

slide-42
SLIDE 42

42

Host-rooted Gather

  • 3-Level-Overlapped shows up to 78% and 85% improvement over MV2 and

Intel-MPI respectively

2 4 6 8 10 12 14 8K 16K 32K 64K 128K 256K 512K Latency (us) x 1000 Message Size (Bytes)

128 process gather (16h + 16m)

3-Level-Overlapped 3-Level-Hierarchical Pipelined Intel-MPI MV2-MIC 62% 76% 5 10 15 20 25 8K 16K 32K 64K 128K256K512K Latency (us) x 1000 Message Size (Bytes)

256 process gather (16h + 16m)

3-Level-Overlapped 3-Level-Hierarchical Pipelined Intel-MPI MV2-MIC 78% 85%

slide-43
SLIDE 43

43

MIC-rooted Gather

  • 3-Level-Overlapped shows up to 83% and 87% improvement over MV2 and

Intel-MPI respectively

5 10 15 20 25 30 35 8K 16K 32K 64K 128K 256K 512K Latency (us) x 1000 Message Size (Bytes)

256 process gather (16h + 16m)

3-Level-Overlapped 3-Level-Hierarchical Pipelined Intel-MPI MV2-MIC 83% 87% 2 4 6 8 10 12 14 16 8K 16K 32K 64K 128K 256K 512K Latency (us) x 1000 Message Size (Bytes)

128 process gather (16h + 16m)

3-Level-Overlapped 3-Level-Hierarchical Pipelined Intel-MPI MV2-MIC

slide-44
SLIDE 44

44

Case with Maximum Gather Latency

10 20 30 40 50 60 70 80 90 100 8K 16K 32K 64K 128K 256K 512K Latency (us) x 1000 Message Size (Bytes)

256 Process MIC rooted Gather (16h + 16m)

3-Level-Overlapped-Direct 3-Level-Overlapped MV2-MIC

  • Max latency of the 3-level overlapped algorithm bloats up in the large

message range compared to default MV2-MIC

  • The Direct variant of the algorithm alleviates this problem to a large extent

67%

slide-45
SLIDE 45

45

Outline

  • Introduction
  • Problem Statement
  • Designs
  • Experimental Evaluation and Analyses
  • Conclusion and Future Work
slide-46
SLIDE 46

Conclusions

46

  • Bottlenecks in MPI_Gather identified for MIC

clusters

  • Optimized MPI Gather for the Symmetric mode of
  • peration
  • Hierarchical and overlapping methods used to

alleviate architectural bottlenecks

  • Future Goals: Exploration of other MPI collectives
slide-47
SLIDE 47

Web Pointers

http://www.cse.ohio-state.edu/~panda http://nowlab.cse.ohio-state.edu MVAPICH Web Page http://mvapich.cse.ohio-state.edu panda@cse.ohio-state.edu

47 XSCALE13