 
              Optimizing MPI Intra-node Communication with New Task Model for Many-core Systems Research and Development Group, Hitachi, Ltd. Akio SHIMADA LENS INTERNATIONAL WORKSHOP 2015
Background core system since the appearance of multi-core processor and try to accelerate intra-node communication on many-core systems (e.g. hybrid MPI) 2 • A large number of parallel processes can be invoked within a node on a many- • MPI and some PGAS language runtimes invokes multiple processes • Fast Intra-node communication is required • Many researches proposed a variety of intra-node communication schemes(e.g. KNEM, LiMIC) Parallel*Processes � Process � Process � Process � Process � Process � Process � Process � Process � Process � Process � Process � Process � Process � Process � Process � Process � Process � Process � Parallel*Processes � Process � Process � Process � Process � Process � Process � Process � Process � Core � Core � Core � Core � Core � Core � Process � Process � Core � Core � Core � Core � Core � Core � Core � Core � Core � Core � Core � Core � Core � Core � Core � Core � Core � Core � Core � Core � Core � Core � Node � Node � Communica/on*on*Mul/1core*Node � Communica/on*on*Many1core*Node �
Conventional Intra-node Communication Schemes Buffer OS Kernel Buffer Receive Receiver Send Buffer Sender copy memory copy memory Buffer Intermediate Shared Memory Receive Receiver Send Buffer Sender communication System call overhead is produced for every • OS kernel assistance (KNEM, LiMIC, etc.) required for every communication Double-copy via shared memory is • Shared Memory 3 are produced memory copy • Overheads for “crossing address space boundaries among processes” • There are address space boundaries among processes
Proposal on many-core systems within the same node to run in the same address space space boundary from intra-node communication 4 • Partitioned Virtual Address Space(PVAS) • A new task model for efficient parallel processing • PVAS make it possible for parallel processes • PVAS can remove overheads for crossing address
Address Space Layout HEAP Address Low High PVAS Partition 1 TEXT DATA&BSS STACK Partition 0 TEXT DATA&BSS HEAP STACK PVAS Task 0 PVAS Task 1 Normal Task Model PVAS Task Model KERNEL PVAS ・・・ KERNEL them to parallel processes (PVAS tasks) within a PVAS partition assigned to the other PVAS task) other processes 5 Process 0 TEXT DATA&BSS HEAP STACK KERNEL Process 1 TEXT DATA&BSS HEAP STACK • PVAS partitions a single address space into multiple segments (PVAS partition) and assigns • Parallel processes uses the same page table for managing memory mapping informations • PVAS task can use only its own PVAS partition as its local memory (cannot allocate memory • PVAS task is almost same as normal process except sharing the same address space with
PVAS Feature other PVAS tasks within the same node other PVAS tasks by load/store instructions (There are no address space boundaries among them) without overheads for crossing an address space boundary 6 • All memory of the PVAS task is exposed to the • PVAS task can access the memory of the • A pair of PVAS tasks can exchange the data
Optimizing Open MPI by PVAS Transfer Layer (BTL) of the Open MPI • Supporting double-copy communication via shared memory • Supporting single-copy communication with OS kernel assistance (using KNEM) OS kernel assistance by using PVAS facility 7 • PVAS BTL component is implemented in the Byte • SM BTL • PVAS BTL(developed on the basis of the SM BTL) • Copying the data from send buffer to receive buffer without
PVAS BTL Receive from the send buffer ② Receiver copies the data to the send buffer 8 Buffer (PVAS Task 1) MPI Process 1 Send Buffer (PVAS Task 0) MPI Process 0 when transferring the data • Invoking MPI process as PVAS task • Copying the data from send buffer to receive buffer directly • The overheads for crossing address space boundary is not produced • Single-copy communication (avoiding extra memory copy) • OS kernel assistance is not necessary (avoiding system call overhead) ① Sender posts the pointer
Evaluation Environment • Intel Xeon Phi 5110P • 1.083 GHZ, 60 cores (4HT) • 32 KB L1 cache, 512 KB L2 cache • 8 GB of main memory • OS • Intel MPSS linux 2.6.38.8 with PVAS facility • MPI • Open MPI 1.8 with PVAS BTL
Latency Evaluation Intel MPI Benchmarks 10 when message size is small because of the system call overhead • Ping-pong communication latency was measured by running 1000000" SM" SM"(KNEM)" 100000" PVAS" 10000" Lanteyc"(usec) � 1000" 100" 10" 1" 64" 128" 256" 512" 1K" 2K" 4K" 8K" 16K" 32K" 64K" 128K" 256K" 512K" 1M" 2M" 4M" 8M" 16M" 32M" Message"Size"(Bytes) � • PVAS BTL outperforms others regardless of the message size • Latency of the SM BTL (KNEM) is higher than that of SM BTL
NAS Parallel Benchmarks (NPB) • 11 performance by up to 28% CLASS A, B, C (A < B < C) N/A 225(SP, BT) • 128(MG, CG, FT, IS, LU) • 15$ 10$ 5$ Performance$Improvement$(%) � 0$ !5$ !10$ • Running NPB on a single node !15$ !20$ !25$ !30$ !35$ !40$ • Number of Processes !45$ !50$ SM$(KNEM)$ PVAS$ !55$ !60$ MG$ CG$ FT$ IS$ LU$ SP$ BT$ Benchmark$(CLASS$A) � 25$ 20$ Performance$Improvement$(%)$ 15$ 10$ 5$ 0$ !5$ !10$ !15$ • Problem size !20$ !25$ SM$(KNEM)$ !30$ PVAS$ !35$ MG$ CG$ FT$ IS$ LU$ SP$ BT$ Bechmark$(CLASS$B) � 30$ 25$ Performance$Improvement$ 20$ • PVAS BTL improves benchmark 15$ 10$ 5$ 0$ !5$ !10$ SM$(KNEM)$ PVAS$ !15$ • SP(CLASS C) !20$ MG$ CG$ FT$ IS$ LU$ SP$ BT$ Bemchmark$(CLASS$C) �
MPI Process 0 (PVAS Task 0) ① Sender and receiver exchange the pointer to ②’ memory copy by receiver the intermediate buffer ② Sender posts the pointer to by receiver ③ memory copy ② memory copy by sender PVAS BTL SM BTL Shared Memory Send Buffer by sender ① memory copy 12 consulting the data type informations of them the data type informations of them Receive Buffer Send Buffer MPI Process 1 (PVAS Task 1) Receive Buffer MPI Process 1 (PVAS Task 1) MPI Process 0 (PVAS Task 0) Optimizing Non-contiguous Data Transfer Using Derived Data Types when using PVAS facility • Sender and receiver exchange the pointer to the data type informations of them • MPI process can access the MPI internal objects of the other MPI process • Sender and receiver copies the data from the send buffer to the receive buffer • Sender and receiver copy the data in parallel
Latency Evaluation Using DDTBench(1/2) SPECFEM3D 13 X-axis: Data Size, Y-axis: Latency (usec) derived data types • DDTBench [Timo et al., EurMPI’12] mimics the commutation pattern of MPI applications by using • MPI processes send and receive the non-contiguous data in WRF, MILC, NPB, LAMMPS, WRF_x_vec � WRF_x_sa � WRF_y_sa � WRF_y_vec � 6000" 1200" 6000" 1200" SM" SM" SM" SM" 5000" PVAS" 1000" 5000" 1000" PVAS" PVAS" PVAS" 4000" 4000" 800" 800" 3000" 3000" 600" 600" 2000" 2000" 400" 400" 1000" 1000" 200" 200" 0" 0" 0" 0" 63K" 102K" 173K" 63K" 102K" 173K" 43K" 55K" 63K" 75K" 90K" 43K" 55K" 63K" 75K" 90K" NAS_MG_z � NAS_MG_x � NAS_MG_y � MILC_su3_zd � 7000" 7000" 70000" 800" SM" SM" SM" SM" 700" 6000" 6000" 60000" PVAS" PVAS" PVAS" PVAS" 600" 5000" 5000" 50000" 500" 4000" 4000" 40000" 400" 3000" 30000" 3000" 300" 2000" 2000" 20000" 200" 1000" 1000" 10000" 100" 0" 0" 0" 0" 1M" 32K" 4K" 65K" 262K" 4K" 65K" 262K" 1M" 12K" 24K" 49K" 98K" 2K" 131K" 524K"
Recommend
More recommend