Heterogeneous Computing Systems Mikiko Sato Tokyo University of - - PowerPoint PPT Presentation
Heterogeneous Computing Systems Mikiko Sato Tokyo University of - - PowerPoint PPT Presentation
MapReduce Frameworks on Multiple PVAS for Heterogeneous Computing Systems Mikiko Sato Tokyo University of Agriculture and Technology Background A A recent t tendency ncy for p performance nce improvements ements is s due to
Background
2
A A recent t tendency ncy for p performance nce improvements ements is s due to increases in the number ers of CPU PU c cores with accelerato ators.
- GPGPU, Intel XeonPhi
- The multi-core and many-core CPUs provide differing
computational performance, parallelism, latency…
Application Program Multi-coreCPU Many-coreCPU
core core c c c c c c c c … Multi-core OS
(Linux)
Many-core OS
(Light-weight Kernel)
cooperate Task Task Task Task
T T T T T T T T T T T T T T T T TMany-core Task (High-parallel computational processing) Multi-core Task (I/O processing, Low-parallel & high-latency processing)
Its ts importa tant is t issue is H How to to improve ve th the application performan ance using both th ty types o
- f CPUs cooperat
ative vely.
MapReduce framework
Big dat ata a an anal alytics h has as been identified ed as as t the e exc xciting ar areas as f for both ac acad ademia a a and industry. Ma MapR pReduce fram amework is a a po popu pular ar pr program amming f fram amework for big data analytics and scientific computing MapReduce was originally designed for distributed-computing, and has been extended to various architectures. (HPC system,[2]
GPGPUs[3] , many-core CPUs[4])
Ma MapR pReduce on a a heterogeneo eous system with XeonPhi
The hardware features of the Xeon Phi achieve high performance (512- bit VPUs, MIMD thread parallelism, coherent L2 Cache, etc.) The host processor assists the data transfer for MapReduce processing.
[1] Welcome to Apache Hadoop (online), available from http://hadoop.apache.org. [2]"K MapReduce: A scalable tool for data-processing and search/ensemble applications on large- scale supercomputers", M. Matsuda, et.al., in CLUSTER , IEEE Computer Society, pp. 1-8, 2013. [3] "Mars: a mapreduce framework on graphics processors”, B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wnag, in PACT, pp. 1-8, 2008. [4] "Optimizing the MapReduce framework on Intel Xeon Phi coprocessor," M. Lu, et al., Big Data, IEEE International Conference on, pp.125-130, Oct. 2013.
3
[1]
Previous MapReduce frameworks on XeonPhi
4
MRPhi[4]
- Optimized MapReduce framework for XeonPhi coprocessor
Using SIMD VPUs for map phase, SIMD hash computation algorithms, based on the MIMD hyper-threading, etc. The pthread is used for Master/Worker task controls on Xeon Phi.
Important issues for the performance are both utilizing of advanced XeonPhi-features and effective thread controls. MrPhi[5]
- The expanded version of MRPhi.[4] MapReduce operation and data
are transferred separately from host to XeonPhi.
- MPI communication is used for data transfer and synchronous
control between host and XeonPhi.
The communication overhead will be one of the factor of the MapReduce performance.
[4] "Optimizing the MapReduce framework on Intel Xeon Phi coprocessor," M. Lu, et al., Big Data, IEEE International Conference on, pp.125-130, Oct. 2013. [5] "MrPhi: An Optimized MapReduce Framework on Intel Xeon Phi Coprocessors", Lu, M., et al., IEEE Transactions on Parallel and Distributed Systems, vol.PP, no.99, pp.1-14, 2014.
Inter-task communications
Turn-ar around times ( (TAT) of a a n null function ca call in t the XenoPhi
- ffload
ading scheme ar are meas asured as as the referen ence e of our study
The communication overhead is large when sending the small size data between host and XeonPhi.
→It is important to reduce the communication cost between host and XeonPhi as much as possible for the MapReduce performance.
5
Delegator Task
Local CPU Remote CPU Buf
Delegatee Task
Write Request (8~128B) Write Result (8B) Buf Polling Polling
Turn-around time (TAT) are measured The processing request data varies between 8 bytes and 128 bytes and the response data is fixed at 8 bytes. (Xeon E5-2670,MPSS 3-2.1.6720-13)
Issues & Goal
In n order er t to o
- btain h
n high gh per erfo forma manc nce e on n the he hy hybrid- architec ecture s e sys ystem ems, i it is i imp mportant to
perform inter-task communication by less overhead execute processing on the suitable CPU in consideration of the difference of performance and characteristic between CPUs
Goal
En Enable e cooper eration b by y little e over erhea ead bet etwe ween en tasks ks for MapRed educe e frame mewo work k on a a h hyb ybrid s sys ystem em. .
In order to realize the program execution environment, “Multiple PVAS”(Multiple Partitioned Virtual Address Space)
will be provided as system software for task collaboration with less overhead on the hybrid-architecture system.
6
Task Model
A tas ask model o
- f M
M-PVAS is bas ased on
- n PVAS[1].
The PVAS system assigns one partition to one PVAS task. PVAS tasks execute using each PVAS partition on a same PVAS address space. →PVAS Tas asks can an communicat ate b by Re Read ad/W /Write virtual al ad address on a a PVAS ad address spa pace, w without using an another shar ared m memory.
Memory Many-core CPU PVAS Application Program TaskTaskTask Task PVAS Task#2 PVAS Task#M PVAS Address Space
- PVAS Task#1
PVAS Task#3 Kernel Export TXT DATA & BSS HEAP STACK PVAS Partition
- [1] Shimada, A., Gero, B., Hori, A. and Ishikawa, Y.: Proposing
a new task model towards many-core architecture (MES '13). 7
M-PVAS Task Model
M-PVAS map aps a a number of
- f P
PVAS a address s spa paces o
- nto a
a single virtual address space, “Mu Multipl ple PVAS Address Spa pace”. PVAS tasks belonging to the same Multiple PVAS address space can ac access o
- ther PVAS ad
address spa pace, even i if o
- n a
a differen ent CPU. →M-PVAS Tas asks can an communicat ate w with an another M M-PVAS tas ask by just accessing to the v virtual address.
8
It is convenient to develop the parallel program which collaborates between different CPUs.
Basic Design of M-PVAS MapReduce
M-PVAS MapReduce was designed based on MRPhi[3] The same MapReduce processing model as MRPhi[3]
- Host sends the MapReduce Data to XeonPhi repeatedly.
- Workers execute MapReduce operation with accessing each
part of the data.
Change the inter-task communication and the task control part to compare the performance gain when using pthread and MPI I/F or M-PVAS methods.
9
(MRPhi): pthread control vs (M-PVAS): M-PVAS Task control (MRPhi): MPI comm. vs (M-PVAS): Shared Address Space
Ma Master T Tas ask controls Worker T Tas asks
Master Task notifies Worker Tasks of the MapReduce Control Data (fig.①) ← the he same as pthr hread Master/Worker Tasks control synchronously using busy-waiting flags and an atomic counter(fig.②,③) ← the he simple flag sensing g will be e expected better performance ce
Master/Worker Task Control on M-PVAS 10
Processing information(Map or Reduce), The Number of Worker Tasks, MapReduce Data address, size, MapReduce Result Data address, etc.
① ② ③
Data transfer for MapReduce processing
Non-blocking data transfer is employed by both Sender Task
- n Host System and Master Task on Many-core System
Sender Task gets the request from Ma Master r Task and t transfer ers s data The double buffering g requires es two buffers, , with one u used to receive e the next d data chunk w while the other to pr process ss the he current data chunk. Worker ers s divide e the he Receive e buffe fer r data and e execute each Map processi ssing. g. With this control, computation ion and data transfer er can be overlapped ped and will be e expecte ted d bet ette ter r performan ance
11
M-PVAS
Master writes the buffer address and size information on Master address space, and Sender checks them and memory copy using “memcpy()” function simply.
MR MRPhi
MRPhi uses MPI_Irecv(), MPI_wait() functions to get data asynchronously.
Implementations of Data transfer 12
Evaluation
Execution environment for M-PVAS MapReduce XeonPhi : : Ma Master Tas ask = = 1, 1, Worker Tas asks=2 =239 39 Host(Xeon): Sender Tas ask = = 1 Benchmark Mo Monte Car arlo that at shows good pe performan ance on XeonPhi.
Many- core CPU Intel Xeon Phi 5110P (60 cores, 240 threads, 1.053GHz) Memory GDDR5 8GB OS Linux 2.6.38 Multi-core CPU Intel Xeon E5-2650 x2 (8 cores, 16 threads, 2.6GHz) Memory DDR3 64GB OS Linux 2.6.32 (CentOS 6.3) Intel CCL MPSS Version 3.4.3 MPI IMPI Version 5.0.1.035
13
Summary
In this study, the task execution model “Multiple Partitioned Virtual Address Space (M-PVAS)” is applied to the MapReduce framework. The effect of the M-PVAS model is estimated by the MapReduce benchmark, Monte Carlo.
- At the current state M-PVAS MapReduce shows better
performance than the original MapReduce framework.
- M-PVAS achieves around 1.8~2.0 times speedup.
- The main factor is data transfer processing.