Optimized MPI Gather Collective for Many Integrated Core (MIC) - - PowerPoint PPT Presentation
Optimized MPI Gather Collective for Many Integrated Core (MIC) - - PowerPoint PPT Presentation
Optimized MPI Gather Collective for Many Integrated Core (MIC) InfiniBand Clusters Akshay Venkatesh Krishna Kandalla Dhabaleswar K. Panda Network-based Computing Laboratory Department of Computer Science and Engineering The Ohio State
2
Outline
- Introduction
- Problem Statement
- Designs
- Experimental Evaluation and Analyses
- Conclusion and Future Work
3
Scientific applications, Accelerators and MPI
- Several areas such as medical sciences, atmospheric
research and earthquake simulations rely on speed
- f computations for better prediction/ analysis.
- Many instances of applications benefitting from use
- f large scale
- Accelerators/ Coprocessors further increase
computational speed and increase energy efficiency
- MPI continues to be widely used as HPC reels in
heterogeneous architectures
- MPI(+X) continues to be the predominant programming model in HPC
- High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA
- ver Converged Enhanced Ethernet (RoCE)
– MVAPICH (MPI-1) ,MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002 – MVAPICH2-X (MPI + PGAS), Available since 2012 – Used by more than 2,055 organizations (HPC Centers, Industry and Universities) in 70 countries – More than 181,000 downloads from OSU site directly
– Empowering many TOP500 clusters
- 6th ranked 462,462-core cluster (Stampede) at TACC
- 19th ranked 125,980-core cluster (Pleiades) at NASA
- 21st ranked 73,278-core cluster (Tsubame 2.0) at Tokyo Institute of Technology and many others
– Available with software stacks of many IB, HSE, and server vendors including Linux Distros (RedHat and SuSE)
– http://mvapich.cse.ohio-state.edu
- Partner in the U.S. NSF-TACC Stampede System
MVAPICH2/MVAPICH2-X Software
4 XSCALE13
5
Latest Version of MVAPICH2
- Support for GPU-Aware MPI
- Optimized and tuned point-to-point operations
involving GPU Buffers
- Support of GPU Direct RDMA
- Optimized GPU Collectives
- Ongoing effort to design a high performance library
that enables MPI communication in MIC clusters
6
Intel Xeon Phi Specifications
- Belongs to the Many Integrated Core (MIC) family
- 61 cores on the chip (each running at 1 GHz)
- 4 hardware threads (smart round robin scheduler)
- 1 Teraflop peak throughput and energy efficient
- X_86 compatible and supports OpenMP, MPI, Cilk,
etc.
- Installed in the compute node as PCI Express device
7
MPI on MIC Clusters (1)
- Stampede has ~6,000 MIC Coprocessors
- Tianhe-2 has ~48,000 MIC Coprocessors
- Through MPSS*, MIC Coprocessors can directly use
IB HCAs through peer-to-peer PCI communication for inter/intra-node communication
- MPI is the predominantly used to make use of
multiple such compute nodes in tandem
*MPSS – Intel Manycore Platfrom Sofware Stack
8
MPI on MIC Clusters(2)
- MIC supports various modes of operation
– Offload mode – Coprocessor-only or Native mode – Symmetric mode
- Non-uniform host and destination platforms
– Host to MIC, MIC to host, MIC-to-MIC
- Transfers involving the MIC incurs additional cost
- wing to expensive PCIe path
9
Symmetric mode and Implications
- Non-uniform host and destination platforms
– Host to MIC – MIC to host – MIC to MIC
- Transfers involving the MIC incurs additional cost
- wing to expensive PCIe path
- Performance is non-uniform
10
Symmetric mode and Implications
- MIC->MIC Latency = 8 X Host->Host Latency
- Host->IB NIC bandwidth = 6 X MIC->IB NIC Bandwidth
2 4 6 8 10 12 14 16 18 256K 512K 1M 2M 4M 8M 16M Latency (us) x 1000 Message Size (Bytes) host->remote_host host->remote_mic mic->remote_host mic->remote_mic
11
MPI on MIC Clusters(2)
- MIC supports various modes of operation
– Offload mode – Coprocessor-only or Native mode – Symmetric mode
- Besides point-to-point primitives, MPI Standard also
defines a set of collectives such as:
– MPI_Bcast – MPI_Gather
12
MPI_Gather
- Gather used in
– Multi-agent heuristics – Mini applications – Can be used for reduction operations and more
1 2 3 1 2 3 1 2 3 ROOT
13
MPI Gather(2)
- MPI_Gather
– One root process receives data from every other process
- On homogeneous systems the collective adopts
– Linear scheme – Binomial scheme – Hierarchical scheme
14
Typical MPI_Gather on MIC Clusters
- Yellow grid boxes
represent host processors
- Blue grid boxes
represent MIC coprocessors
Host 0 MIC 0 HCA Host 1 MIC 1 HCA Host 2 MIC 2 HCA Node 1 Node 2 Node 0 Host 3 MIC 3 HCA Node 3
15
Typical MPI_Gather on MIC Clusters
Host 0 MIC 0 HCA Host 1 MIC 1 HCA Host 2 MIC 2 HCA Node 1 Node 2 Node 0 Host 3 MIC 3 HCA Node 3
Local - Leader
- Hierarchical scheme
- r Leader-based
scheme
- Communicator per
node
- Leader is least rank
in node
16
Typical MPI_Gather on MIC Clusters
- Local leader on the
MIC directly uses the NIC
- IB reading from MIC
is costly
- When root of the
gather is on MIC, there are transfers from local MIC to remote MIC
- This can be very
costly
Host 0 MIC 0 HCA Host 1 MIC 1 HCA Host 2 MIC 2 HCA Node 1 Node 2 Node 0 Host 3 MIC 3 HCA Node 3
17
Outline
- Introduction
- Problem Statement
- Designs
- Experimental Evaluation and Analyses
- Conclusion and Future Work
18
Problem Statement
- What are the primary bottlenecks that affect the
performance of the MPI Gather operation on MIC clusters?
- Can we design algorithms to overcome architecture
specific performance deficits to improve gather latency?
- Can we analyze and quantify the benefits of our
proposed approach using micro-benchmarks?
19
MPI Collectives on MIC
- Primitive operations such as MPI_Send, MPI_Recv
and their non-blocking counterparts have been
- ptimized*
- MPI Collectives such MPI_Alltoall, MPI_Scatter, etc
which are designed on top of p2p primitives immediately benefit from such optimizations
- Can we do better?
*S. Potluri, A. Venkatesh, D. Bureddy, K. Kandalla and D. K. Panda: Efficient Intra-node Communication on Intel-MIC Clusters, CCGrid’13
20
Outline
- Introduction
- Problem Statement
- Designs
- Experimental Evaluation and Analyses
- Conclusion and Future Work
21
Design Goals
- Avoid IB reading from MIC
– Especially large transfers
- Use pipelining methods
- Overlap operations when possible
22
Designs
- 3-level hierarchical algorithm
- Pipelined Algorithm
- Overlapped 3-Level-hierarchical algorithm
23
Designs
- 3-level hierarchical algorithm
- Pipelined Algorithm
- Overlapped 3-Level-hierarchical algorithm
24
Design1: 3-Level-Hierarchical Algorithm
- Step 1: Same as the
default hierarchical
- r leader based
scheme
Host 0 MIC 0 HCA Host 1 MIC 1 HCA Host 2 MIC 2 HCA Node 1 Node 2 Node 0 Host 3 MIC 3 HCA Node 3
Local - Leader
25
Design1: 3-Level-Hierarchical Algorithm
Host 0 MIC 0 HCA Host 1 MIC 1 HCA Host 2 MIC 2 HCA Node 1 Node 2 Node 0 Host 3 MIC 3 HCA Node 3
- Step 2: Transfer of
MIC-aggregated data to host over PCI
- Difference? SCIF is
used and performance is relatively competent
26
Design1: 3-Level-Hierarchical Algorithm
Host 0 MIC 0 HCA Host 1 MIC 1 HCA Host 2 MIC 2 HCA Node 1 Node 2 Node 0 Host 3 MIC 3 HCA Node 3
- Advantage: Step 3
which involves transferring the large aggregate message does not involve any IB reads from MIC
- Disadvantage: MIC
cores are slow and hence Intra-MIC gathers are slower
- => Host leader
needs to wait
27
Designs
- 3-level hierarchical algorithm
- Pipelined Algorithm
- Overlapped 3-Level-hierarchical algorithm
28
Design 2: Pipelined Gather
- Leader on a each
host posts non- blocking recvs from all processes within the node
- Leader sends its
- wn data to gather
root
- Each process within
a node sends to leader on host
- Host forwards to
gather root
Host 0 MIC 0 HCA Host 1 MIC 1 HCA Host 2 MIC 2 HCA Node 1 Node 2 Node 0 Host 3 MIC 3 HCA Node 3
29
Design 2: Pipelined Gather
Host 0 MIC 0 HCA Host 1 MIC 1 HCA Host 2 MIC 2 HCA Node 1 Node 2 Node 0 Host 3 MIC 3 HCA Node 3
- Advantage: None of
the steps involve IB reading from MIC
- Disadvantage:
Majority of transfer burden lies on leader of host nodes
- => Processing non-
blocking receives on Host leader can turn into a bottleneck
30
Designs
- 3-level hierarchical algorithm
- Pipelined Algorithm
- Overlapped 3-Level-hierarchical algorithm
31
Design 3: 3-Level-Hierarchical Overlapped Algorithm
- Step 1: Same as the
default hierarchical
- r leader based
scheme
Host 0 MIC 0 HCA Host 1 MIC 1 HCA Host 2 MIC 2 HCA Node 1 Node 2 Node 0 Host 3 MIC 3 HCA Node 3
Local - Leader
32
Design 3: 3-Level-Hierarchical Overlapped Algorithm
Host 0 MIC 0 HCA Host 1 MIC 1 HCA Host 2 MIC 2 HCA Node 1 Node 2 Node 0 Host 3 MIC 3 HCA Node 3
- Host leader first
posts non-blocking receive from MIC leader in the node
- Host leader gathers
data from host processes and sends to root
- Meanwhile MIC
leader starts sending data to host leader
33
Design 3: 3-Level-Hierarchical Overlapped Algorithm
Host 0 MIC 0 HCA Host 1 MIC 1 HCA Host 2 MIC 2 HCA Node 1 Node 2 Node 0 Host 3 MIC 3 HCA Node 3
- Advantage: No step
involves transferring the large aggregate message using IB reads from MIC
- Advantage: Host
does not wait for relatively slow MIC leader to finish and send data
- Experiments show
good overlap
- Host leader
forwards MIC data to gather root
34
Designs
- 3-level hierarchical algorithm
- Pipelined Algorithm
- Overlapped 3-Level-hierarchical algorithm
– Direct variant for MIC rooted gathers
35
Design 3: 3-Level-Hierarchical Overlapped Algorithm (MIC root)
Host 0 MIC 0 HCA Host 1 MIC 1 HCA Host 2 MIC 2 HCA Node 1 Node 2 Node 0 Host 3 MIC 3 HCA Node 3
- Root on MIC
- Data still channeled
through host leader
- n node
- Advantage: SCIF
write used
36
Design 3: 3-Level-Hierarchical Overlapped Algorithm (MIC root)
Host 0 MIC 0 HCA Host 1 MIC 1 HCA Host 2 MIC 2 HCA Node 1 Node 2 Node 0 Host 3 MIC 3 HCA Node 3
- Host leader on the
node where MIC gather root exists becomes a bottleneck
37
Design 3: 3-Level-Hierarchical Overlapped Algorithm (Direct Variant)
Host 0 MIC 0 HCA Host 1 MIC 1 HCA Host 2 MIC 2 HCA Node 1 Node 2 Node 0 Host 3 MIC 3 HCA Node 3
- IB write
performance is competent
38
Design 3: 3-Level-Hierarchical Overlapped Algorithm (Direct Variant)
Host 0 MIC 0 HCA Host 1 MIC 1 HCA Host 2 MIC 2 HCA Node 1 Node 2 Node 0 Host 3 MIC 3 HCA Node 3
- HCA directly writes
into MIC
- Limitations exist in
the very large messages but alleviated in the medium message range
39
Outline
- Introduction
- Problem Statement
- Optimizations
- Experimental Evaluation and Analyses
- Conclusion and Future Work
40
Experimental Setup
- Stampede Cluster
- Up to 64 nodes with Sandy Bridge-EP (Xeon E5-2680)
processors and Xeon Phi Coprocessors (Knight’s Corner) used
- Host has 32 GB memory and MIC has 8GB
- Mellanox FDR switches (SX6036) to interconnect
- OSU Micro-benchmark suite (OMB)
- Optimizations added to MVAPICH2-1.9 (MV2-MIC)
– Version already has intra-node SCIF based optimizations
- IMPI 4.1.1.026
41
Experiments
- Symmetric mode of execution
- The number of MPI processes on the MIC restricted
to 16 due to limited memory
- OpenMP + MPI mode where each MPI process
spawns several OpenMP threads is a better fit for MIC
- All designs constructed on top of MVAPICH2 version
that has SCIF optimizations*
*S. Potluri, A. Venkatesh, D. Bureddy, K. Kandalla and D. K. Panda: Efficient Intra-node Communication on Intel-MIC Clusters, CCGrid’13
42
Host-rooted Gather
- 3-Level-Overlapped shows up to 78% and 85% improvement over MV2 and
Intel-MPI respectively
2 4 6 8 10 12 14 8K 16K 32K 64K 128K 256K 512K Latency (us) x 1000 Message Size (Bytes)
128 process gather (16h + 16m)
3-Level-Overlapped 3-Level-Hierarchical Pipelined Intel-MPI MV2-MIC 62% 76% 5 10 15 20 25 8K 16K 32K 64K 128K256K512K Latency (us) x 1000 Message Size (Bytes)
256 process gather (16h + 16m)
3-Level-Overlapped 3-Level-Hierarchical Pipelined Intel-MPI MV2-MIC 78% 85%
43
MIC-rooted Gather
- 3-Level-Overlapped shows up to 83% and 87% improvement over MV2 and
Intel-MPI respectively
5 10 15 20 25 30 35 8K 16K 32K 64K 128K 256K 512K Latency (us) x 1000 Message Size (Bytes)
256 process gather (16h + 16m)
3-Level-Overlapped 3-Level-Hierarchical Pipelined Intel-MPI MV2-MIC 83% 87% 2 4 6 8 10 12 14 16 8K 16K 32K 64K 128K 256K 512K Latency (us) x 1000 Message Size (Bytes)
128 process gather (16h + 16m)
3-Level-Overlapped 3-Level-Hierarchical Pipelined Intel-MPI MV2-MIC
44
Case with Maximum Gather Latency
10 20 30 40 50 60 70 80 90 100 8K 16K 32K 64K 128K 256K 512K Latency (us) x 1000 Message Size (Bytes)
256 Process MIC rooted Gather (16h + 16m)
3-Level-Overlapped-Direct 3-Level-Overlapped MV2-MIC
- Max latency of the 3-level overlapped algorithm bloats up in the large
message range compared to default MV2-MIC
- The Direct variant of the algorithm alleviates this problem to a large extent
67%
45
Outline
- Introduction
- Problem Statement
- Designs
- Experimental Evaluation and Analyses
- Conclusion and Future Work
Conclusions
46
- Bottlenecks in MPI_Gather identified for MIC
clusters
- Optimized MPI Gather for the Symmetric mode of
- peration
- Hierarchical and overlapping methods used to
alleviate architectural bottlenecks
- Future Goals: Exploration of other MPI collectives
Web Pointers
http://www.cse.ohio-state.edu/~panda http://nowlab.cse.ohio-state.edu MVAPICH Web Page http://mvapich.cse.ohio-state.edu panda@cse.ohio-state.edu
47 XSCALE13