Improving Node-level MapReduce Performance using - - PowerPoint PPT Presentation

improving node level mapreduce performance using
SMART_READER_LITE
LIVE PREVIEW

Improving Node-level MapReduce Performance using - - PowerPoint PPT Presentation

Improving Node-level MapReduce Performance using Processing-in-Memory Technologies Mahzabeen Islam, Marko Scrbak and Krishna M. Kavi Computer Systems Research Laboratory Department of Computer Science & Engineering University of North


slide-1
SLIDE 1

Improving Node-level MapReduce Performance using Processing-in-Memory Technologies

Mahzabeen Islam, Marko Scrbak and Krishna M. Kavi Computer Systems Research Laboratory Department of Computer Science & Engineering University of North Texas, USA Mike Ignatowski and Nuwan Jayasena AMD Research - Advanced Micro Devices, Inc., USA

slide-2
SLIDE 2

Overview

  • Introduction
  • Motivation
  • Proposed Model
  • Server Architecture
  • Programming Framework
  • Experiments
  • Results
  • Conclusion and Future Work
  • Related Work
  • References

2 3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: throughput-oriented programmable processing in memory. In: HPDC, (2014)

slide-3
SLIDE 3

Introduction

  • 3D stacked DRAM consists of DRAM dies stacked on top of a logic die,
  • provides higher memory bandwidth,
  • lower access latencies and
  • lower energy consumption than existing DRAM technologies

Ø Hybrid Memory Cube (HMC): capacity 2-4 GB, bandwidth 160 GB/sec (15x DDR3), 70%less energy per bit 1

  • The bottom logic die contains peripheral circuitry (row decoder, sense amp etc.), but

still there is enough silicon for other logic

3

  • 3D-DRAM can be used as large Last Level Cache or

Main Memory or buffer to PCM

  • SRAM can be integrated in the logic layer to aid

address translation – hardware page tables

  • Recent trend is to put processing capabilities in the

logic layer

3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: throughput-oriented programmable processing in memory. In: HPDC, (2014)

slide-4
SLIDE 4

Processing in Memory

  • Processing-In-Memory (PIM) is the concept of moving computation closer

to memory

  • Advantages:

Ø Low access latency, high memory bandwidth and high degree of parallelization can be achieved by adding simple processing cores in memory Ø Minimize cache pollution by not transferring some data to main cores Ø Data intensive/memory bounded applications , which do not benefit from the conventional cache hierarchies, could benefit from PIM

  • Concerns:

Ø Designing appropriate system architecture.

§ Too many design choices – main processor, PIM processors, memory hierarchy, communication channels, interfaces

Ø Requires changes to Operating System (memory management), programming framework (e.g. MapReduce library), programming models (synchronization, coherence)

4 3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: throughput-oriented programmable processing in memory. In: HPDC, (2014)

slide-5
SLIDE 5

Our Work

  • 3D stacked DRAM has generated renewed interest in PIMs
  • We can use several low power cores in the logic layer of a 3D-DRAM to execute

memory bounded functions closer to memory

  • Our current research is focusing on Big Data analyses based on MapReduce

programming model Ø Map functions are good candidates for executing on PIM processors Ø We propose and evaluate a server architecture here Ø MapReduce is modified for shared memory processors § We plan to investigate using PIM for other parts of MapReduce applications § And other classes of applications (Scale-Out applications)

§ Contemporary research shows that emerging scale-out applications do not benefit from conventional processor architecture and cache hierarchies 2

5

3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: throughput-oriented programmable processing in memory. In: HPDC, (2014)

slide-6
SLIDE 6

Proposed Server Architecture

6

Host ¡ PIM ¡& ¡ ¡ DRAM ¡controllers ¡ Memory ¡dies ¡ Timing-­‑specific ¡ DRAM ¡ interface ¡ Abstract ¡ load/store ¡interface ¡

  • Host processor connected to multiple 3D Memory Units (3DMUs)
  • PIM cores in the logic layer of each 3DMMU
  • Simple, in-order, single-issue, energy efficient PIM cores with only L-1 caches
  • Processes running on host control the execution of PIM threads
  • Unified Memory View as proposed by Heterogeneous System Architecture (HSA)

foundation

  • A number of such nodes will make up a cluster

3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: throughput-oriented programmable processing in memory. In: HPDC, (2014)

slide-7
SLIDE 7

Proposed MapReduce Framework

  • Adapt MapReduce frameworks for shared memory systems that exhibit NUMA

Ø We chose Phoenix++ which works with CMP and SMP systems Ø Needed to modify Phoenix for our purpose

7

  • Map phase - overlap with reading input

using MP cores (host reads from files)

  • Reduce phase - By using special data

structures (2D hash tables) allow local reduction in the 3DMUs to minimize amount

  • f data transferred during final reduction
  • Merge phase – Initial stages can be

performed by PIM cores, and the rest by the host processor

  • Here we emphasize on single (intra) node level MapReduce operation, and

assume, a global (inter) node level of MapReduce operation will take place if we need a cluster of such nodes.

Master Process Input Manager Process 0 Manager Process 1 Manager Process 2 Manager Process 3 PIM Threads PIM Threads PIM Threads PIM Threads

Running on host processor Running on 3DMUs

3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: throughput-oriented programmable processing in memory. In: HPDC, (2014)

slide-8
SLIDE 8

Experiment Setup

X eon ¡E5 X eon ¡E5 QPI M E M M E M 1

Table 1: Baseline System Configuration

CPU 2 x Xeon E5-2640 6 cores per processor, 2 threads/core Out-of-Order, 4-wide issue Clock Speed 2.5 GHz clock speed L3 Cache 15 MB/processor Power TDP = 95 W/processor Low-power = 15 W/processor Memory BW 42.6 GB/s per processor Memory 32 GB (8 x 4 GiB DIMM DDR3), NUMA enabled

Table 2: New System Configuration

Host Processor PIM cores

Processing Unit 1 Xeon E5-2640 6 cores, 2 threads/core Out-of-Order, 4-wide issue 64 = 4 * 16 ARM Cortex-A5 In-order, single-issue Clock Speed 2.5 GHz 1 GHz LL Cache 15 MB 32 KB I and 32 KB D /core Power TDP = 95 W 80 mW/core (5.12 W for 64 cores) Memory BW 42.6 GB/s 1.33 GB/s per core Memory 32 GB (4 8 GiB 3DMU) 8

  • Baseline vs. New System Configuration

P0 ¡ P1 ¡

3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: throughput-oriented programmable processing in memory. In: HPDC, (2014)

slide-9
SLIDE 9

Experiments and Analysis

  • Our assumption is that we can overlap reading of data with the execution of map tasks
  • The input reading is performed by the host CPU and the map tasks by PIM cores

Ø We do not want the cores to sit idle Ø Estimate the number of cores needed

Fig : (a) PIM cores mostly idle (b) PIM core utilization is high

9

0 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ 32 ¡ 64 ¡ 96 ¡ 128 ¡ 160 ¡ 192 ¡ 224 ¡ ¡IP ¡MU0 ¡ ¡IP ¡MU1 ¡ ¡IP ¡MU2 ¡ ¡IP ¡MU3 ¡ ¡IP ¡MU0 ¡ ¡IP ¡MU1 ¡ ¡IP ¡MU2 ¡ ¡IP ¡MU3 ¡ 0 ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ 32 ¡ 64 ¡ 96 ¡ 128 ¡ 160 ¡ 192 ¡ 224 ¡ ¡IP ¡MU0 ¡ ¡IP ¡MU1 ¡ ¡IP ¡MU2 ¡ ¡IP ¡MU3 ¡ ¡IP ¡MU0 ¡ ¡IP ¡MU1 ¡ ¡IP ¡MU2 ¡ ¡IP ¡MU3 ¡ idle ¡

busy ¡

idle ¡

busy ¡ idle ¡ busy ¡ busy ¡ busy ¡ busy ¡ busy ¡

Time (ms) Host reads IP splits Into 3DMUs PIM cores in 3DMU0 PIM cores in 3DMU1 PIM cores in 3DMU2 PIM cores in 3DMU3

(a)

idle ¡

busy ¡ busy ¡ busy ¡ busy ¡ busy ¡ busy ¡ busy ¡ (b) 3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: throughput-oriented programmable processing in memory. In: HPDC, (2014)

slide-10
SLIDE 10

Experiments and Analysis

10

The time taken by PIM cores to process a input split should be smaller than the time taken by the host to read one input split Here s is the factor that indicates the relative slowdown caused by simple PIM cores when compared to the host. is the time taken by host to complete map function on

  • ne input split

There are 4 DMUs and each contains n PIM cores is the time taken by host to read one input split How many PIM cores per 3DMU do we need?

3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: throughput-oriented programmable processing in memory. In: HPDC, (2014)

slide-11
SLIDE 11

Experiments and Analysis

  • We ran different workloads on the baseline

system with Phoenix++ and we measured

  • We used two different storage technologies-

HDD and SSD on the baseline system to measure the

11

Table 3: Workload word count 25 ms histogram 7 ms string match 12 ms linear regression 7 ms Table 4: Storage HDD 10.42 ms SSD 2.17 ms

3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: throughput-oriented programmable processing in memory. In: HPDC, (2014)

slide-12
SLIDE 12
  • To achieve full utilization of the PIM cores following must hold,
  • We estimated how many cores (n) we need to achieve the overlapping of map tasks

with input reading for different storage technologies and slowdown factors

  • For an estimated slowdown factor of 4 for the PIM cores, we need fewer than16

cores per 3DMU – but use 16 cores to handle stragglers

Experiments and Analysis

Fig: Required number of PIM cores for different slowdown factors

12 3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: throughput-oriented programmable processing in memory. In: HPDC, (2014)

slide-13
SLIDE 13

Feasibility Analysis

  • Embedding 16 PIM cores on the logic layer of a 3DMU is possible

Ø We use ARM Cortex-A5 as PIM core to estimate silicon area needed

§ 40 nm processing technology § 1 GHz clock § 32 KB D and 32 KB I L-1 caches

  • Area?

Ø Each PIM core needs an area of 0.80mm2 Ø The area overhead for 16 PIM core is only ~12% in the logic die 3

  • Power budget?

Ø Each PIM core has an average power consumption of 80 mW Ø 16 PIM core will consume 1.28W, which is only ~13% of the 10W TDP budget of the logic layer 4

13

3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce

  • Workloads. In: ISPASS (2014)

4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: throughput-oriented programmable processing in memory. In:

HPDC, (2014)

slide-14
SLIDE 14

Performance Analysis

14

The performance gain using 16 PIM cores per 3DMU, assuming a slow down factor of 4 is shown here We reduced the execution time by

tmap

The average reduction is 8% We feel overlapping some merge and reduce functions lead to higher gains Gains are higher if input does not fit in memory

3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: throughput-oriented programmable processing in memory. In: HPDC, (2014)

slide-15
SLIDE 15

Energy Analysis

  • Data is for 16 PIM cores in each

3DMU (total 64)

  • Relative energy savings range from

10% to 23% with the absolute energy savings range from 80J to 2045J

  • Energy reduction is due to low

power cores

15

Fig: Energy consumption by processing units

3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: throughput-oriented programmable processing in memory. In: HPDC, (2014)

slide-16
SLIDE 16

Bandwidth Utilization

  • On the baseline system the peak BW consumption is less than 15GB/s,
  • Assuming 64 PIM cores, peak BW consumption 60GB/s (15GB/s at each 3DMU)
  • Each SerDes link provides 40GB/s with 5W power consumption 5
  • 64 ARM Cortex-A5 (if placed on host chip) can consume at most 88GB/s 6, requiring

at least 3 SerDes links to transfer data between host and 3D DRAM

  • But when we use PIM cores these high bandwidth links are not required

Fig: Bandwidth consumption when running wordcount on two different systems.

16

3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: throughput-oriented programmable processing in memory. In: HPDC, (2014)

slide-17
SLIDE 17

Summary and conclusions

  • We propose to use simple energy efficient cores embedded in the logic layer of

3D-DRAMs

  • We show how our architecture can be used for MapReduce workloads
  • We estimated the number of PIM cores needed to achieve a good balance of work

between host and PIMs

  • We have shown that we achieve both energy savings and performance gains
  • We have found that most of the applications do not need the high bandwidth
  • ffered (or proposed) by current prototypes of 3D-stacked DRAMs

17 3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: throughput-oriented programmable processing in memory. In: HPDC, (2014)

slide-18
SLIDE 18

Future Work

  • Conduct experiments using variety of scale-out applications
  • Investigate the impact of reduced memory bus traffic on the memory hierarchy

Ø Bandwidth utilization § Use low-energy buses with smaller bandwidth instead of high-speed SerDes links Ø Alternative cache organization § What are the savings?

  • ARM Cortex -A5

Ø Do we need such processor complexity? Ø Simple RISC cores? Even less energy consumption Ø GPGPUs, FPGAs

  • Estimate the performance of a cluster comprising of proposed nodes

18 3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: throughput-oriented programmable processing in memory. In: HPDC, (2014)

slide-19
SLIDE 19

Related Study

  • Several studies are conducted on using 3D-DRAM as LLC or main memory [1, 2]
  • Researchers have worked in the PIM idea a decade ago [3, 4, 5, 6], but integrating

DRAM and logic in the same die was not quite successful at that time.

  • The Phoenix++ MapReduce framework [7] works for conventional large-scale

shared memory CMP and SMP systems. We use Phoenix++ as our basis and propose changes to adapt it to our PIM architecture.

  • Near Data Computing (NDC) architecture [8] provides a similar idea to our study

and assumes 3D-DRAMs embedded with processing cores. Ø The NDC study works only with in-memory MapReduce workloads

  • Recent study on scale-out cloud applications shows that they do not benefit from

common server class processors with complex micro-architecture and deep cache

  • hierarchy. Also they do not need high bandwidth on and off-chip links [9].

19 3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: throughput-oriented programmable processing in memory. In: HPDC, (2014)

slide-20
SLIDE 20

References

1. Black, B., Annavaram, M., Brekelbaum, N., DeVale, et al.: Die stacking (3D) microarchitecture. In: Micro, pp. 469-479. IEEE, (2006) 2. Zhang, D. P., Jayasena, N., Lyashevsky, A., et al.: A new perspective on processingin-memory architecture design. In: Proceedings of the ACM SIGPLAN Workshop 3. Patterson, D., Anderson, T., Cardwell, N., et al.: A case for intelligent RAM. In: Micro, 17(2), 34-44. IEEE, (1997) 4. Torrellas, J.: FlexRAM: Toward an advanced Intelligent Memory system: A retrospective paper. In: Intl. Conference on Computer Design, pp. 3-4. IEEE, (2012) 5. Draper, J., Chame, J., Hall, M., et al.: The architecture of the DIVA processing-inmemory chip. In: Proceedings of the Supercomputing, pp. 14-25. ACM, (2002) 6. Rezaei, M., Kavi, K. M.: Intelligent memory manager: Reducing cache pollution due to memory management functions. In: Journal of Systems Architecture, 52(1), 41-55. (2006) 7. Talbot, J., Yoo, R. M., Kozyrakis, C.: Phoenix++: modular MapReduce for sharedmemory

  • systems. In: Proceedings of the international workshop on MapReduce and its applications, pp.

9-16. ACM, (2011) 8. Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: International Symposium on Performance Analysis of Systems and Software. (2014) 9. Ferdman, M., Adileh, A., Kocberber, O., et al.: A Case for Specialized Processors for Scale-Out

  • Workloads. In: Micro, pp. 31-42. IEEE, (2014)

20 3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: throughput-oriented programmable processing in memory. In: HPDC, (2014)

slide-21
SLIDE 21

Thank You Questions ?

21 3 Pugsley, S. H., Jestes, J., Zhang, H.: NDC: Analyzing the Impact of 3D-Stacked Memory+Logic Devices on MapReduce Workloads. In: ISPASS (2014) 4 Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: throughput-oriented programmable processing in memory. In: HPDC, (2014)