 
              Improving Node-level MapReduce Performance using Processing-in-Memory Technologies Mahzabeen Islam, Marko Scrbak and Krishna M. Kavi Computer Systems Research Laboratory Department of Computer Science & Engineering University of North Texas, USA Mike Ignatowski and Nuwan Jayasena AMD Research - Advanced Micro Devices, Inc., USA
Overview • Introduction • Motivation • Proposed Model • Server Architecture • Programming Framework • Experiments • Results • Conclusion and Future Work • Related Work • References 3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: 2 throughput-oriented programmable processing in memory. In: HPDC, (2014)
Introduction • 3D stacked DRAM consists of DRAM dies stacked on top of a logic die, • provides higher memory bandwidth, • lower access latencies and • lower energy consumption than existing DRAM technologies Ø Hybrid Memory Cube (HMC): capacity 2-4 GB, bandwidth 160 GB/sec (15x DDR3), 70%less energy per bit 1 • The bottom logic die contains peripheral circuitry (row decoder, sense amp etc.), but still there is enough silicon for other logic • 3D-DRAM can be used as large Last Level Cache or Main Memory or buffer to PCM • SRAM can be integrated in the logic layer to aid address translation – hardware page tables • Recent trend is to put processing capabilities in the logic layer 3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS 3 (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: throughput-oriented programmable processing in memory. In: HPDC, (2014)
Processing in Memory • Processing-In-Memory (PIM) is the concept of moving computation closer to memory • Advantages: Ø Low access latency, high memory bandwidth and high degree of parallelization can be achieved by adding simple processing cores in memory Ø Minimize cache pollution by not transferring some data to main cores Ø Data intensive/memory bounded applications , which do not benefit from the conventional cache hierarchies, could benefit from PIM • Concerns: Ø Designing appropriate system architecture. § Too many design choices – main processor, PIM processors, memory hierarchy, communication channels, interfaces Ø Requires changes to Operating System (memory management), programming framework (e.g. MapReduce library), programming models (synchronization, coherence) 3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: 4 throughput-oriented programmable processing in memory. In: HPDC, (2014)
Our Work • 3D stacked DRAM has generated renewed interest in PIMs • We can use several low power cores in the logic layer of a 3D-DRAM to execute memory bounded functions closer to memory • Our current research is focusing on Big Data analyses based on MapReduce programming model Ø Map functions are good candidates for executing on PIM processors Ø We propose and evaluate a server architecture here Ø MapReduce is modified for shared memory processors § We plan to investigate using PIM for other parts of MapReduce applications § And other classes of applications (Scale-Out applications) § Contemporary research shows that emerging scale-out applications do not benefit from conventional processor architecture and cache hierarchies 2 3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, 5 A., et al.: TOP-PIM: throughput-oriented programmable processing in memory. In: HPDC, (2014)
Proposed Server Architecture • Host processor connected to multiple 3D Memory Units (3DMUs) • PIM cores in the logic layer of each 3DMMU • Simple, in-order, single-issue, energy efficient PIM cores with only L-1 caches • Processes running on host control the execution of PIM threads • Unified Memory View as proposed by Heterogeneous System Architecture (HSA) foundation • A number of such nodes will make up a cluster Memory ¡dies ¡ Abstract ¡ load/store ¡interface ¡ Timing-‑specific ¡ DRAM ¡ interface ¡ Host ¡ PIM ¡& ¡ ¡ DRAM ¡controllers ¡ 3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: 6 throughput-oriented programmable processing in memory. In: HPDC, (2014)
Proposed MapReduce Framework • Adapt MapReduce frameworks for shared memory systems that exhibit NUMA Ø We chose Phoenix++ which works with CMP and SMP systems Ø Needed to modify Phoenix for our purpose • Map phase - overlap with reading input using MP cores (host reads from files) Running on host processor Running on 3DMUs • Reduce phase - By using special data Manager PIM structures (2D hash tables) allow local Process 0 Threads reduction in the 3DMUs to minimize amount of data transferred during final reduction Manager PIM Process 1 Threads Master • Merge phase – Initial stages can be Input Process Manager PIM performed by PIM cores, and the rest by the Process 2 Threads host processor Manager PIM Process 3 Threads • Here we emphasize on single (intra) node level MapReduce operation, and assume, a global (inter) node level of MapReduce operation will take place if we need a cluster of such nodes. 3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: 7 throughput-oriented programmable processing in memory. In: HPDC, (2014)
Experiment Setup • Baseline vs. New System Configuration M M E E X eon ¡E5 QPI X eon ¡E5 M M P0 ¡ P1 ¡ 0 1 Table 1: Baseline System Con fi guration Table 2: New System Con fi guration CPU 2 x Xeon E5-2640 6 cores per processor, 2 threads/core Host Processor PIM cores Out-of-Order, 4-wide issue 1 Xeon E5-2640 64 = 4 * 16 ARM Cortex-A5 Clock Speed 2.5 GHz clock speed Processing Unit 6 cores, 2 threads/core In-order, single-issue Out-of-Order, 4-wide issue L3 Cache 15 MB/processor Clock Speed 2.5 GHz 1 GHz Power TDP = 95 W/processor LL Cache 15 MB 32 KB I and 32 KB D /core Low-power = 15 W/processor Power TDP = 95 W 80 mW/core (5.12 W for 64 cores) Memory BW 42.6 GB/s per processor Memory BW 42.6 GB/s 1.33 GB/s per core Memory 32 GB (8 x 4 GiB DIMM DDR3), NUMA enabled Memory 32 GB (4 8 GiB 3DMU) 3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: 8 throughput-oriented programmable processing in memory. In: HPDC, (2014)
Experiments and Analysis • Our assumption is that we can overlap reading of data with the execution of map tasks • The input reading is performed by the host CPU and the map tasks by PIM cores Time (ms) 0 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ 32 ¡ 64 ¡ 96 ¡ 128 ¡ 160 ¡ 192 ¡ 224 ¡ 0 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ 32 ¡ 64 ¡ 96 ¡ 128 ¡ 160 ¡ 192 ¡ 224 ¡ Host reads IP splits ¡IP ¡MU0 ¡ ¡IP ¡MU1 ¡ ¡IP ¡MU2 ¡ ¡IP ¡MU3 ¡ ¡IP ¡MU0 ¡ ¡IP ¡MU1 ¡ ¡IP ¡MU2 ¡ ¡IP ¡MU3 ¡ ¡IP ¡MU0 ¡ ¡IP ¡MU1 ¡ ¡IP ¡MU2 ¡ ¡IP ¡MU3 ¡ ¡IP ¡MU0 ¡ ¡IP ¡MU1 ¡ ¡IP ¡MU2 ¡ ¡IP ¡MU3 ¡ Into 3DMUs idle ¡ busy ¡ idle ¡ busy ¡ idle ¡ idle ¡ busy ¡ busy ¡ PIM cores in 3DMU0 busy ¡ busy ¡ busy ¡ busy ¡ PIM cores in 3DMU1 busy ¡ busy ¡ busy ¡ busy ¡ PIM cores in 3DMU2 busy ¡ busy ¡ PIM cores in 3DMU3 (a) (b) Fig : (a) PIM cores mostly idle (b) PIM core utilization is high Ø We do not want the cores to sit idle Ø Estimate the number of cores needed 3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: 9 throughput-oriented programmable processing in memory. In: HPDC, (2014)
Experiments and Analysis How many PIM cores per 3DMU do we need? The time taken by PIM cores to process a input split should be smaller than the time taken by the host to read one input split Here s is the factor that indicates the relative slowdown caused by simple PIM cores when compared to the host. is the time taken by host to complete map function on one input split There are 4 DMUs and each contains n PIM cores is the time taken by host to read one input split 3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: 10 throughput-oriented programmable processing in memory. In: HPDC, (2014)
Recommend
More recommend