Can Non-Volatile Memory Benefit MapReduce Applications on HPC - - PowerPoint PPT Presentation

can non volatile memory benefit mapreduce applications on
SMART_READER_LITE
LIVE PREVIEW

Can Non-Volatile Memory Benefit MapReduce Applications on HPC - - PowerPoint PPT Presentation

Can Non-Volatile Memory Benefit MapReduce Applications on HPC Clusters? Md. Wasi-ur- Rahman, Nusrat Sharmin Islam, Xiaoyi Lu, and Dhabaleswar K. (DK) Panda Department of Computer Science and Engineering The Ohio State University Columbus, OH,


slide-1
SLIDE 1
  • Md. Wasi-ur- Rahman, Nusrat Sharmin Islam, Xiaoyi Lu, and

Dhabaleswar K. (DK) Panda

Department of Computer Science and Engineering The Ohio State University Columbus, OH, USA

Can Non-Volatile Memory Benefit MapReduce Applications

  • n HPC Clusters?
slide-2
SLIDE 2

PDSW-DISCS 2016

Outline

  • Introduction
  • Problem Statement
  • Key Contributions
  • Opportunities and Design
  • Performance Evaluation
  • Conclusion and Future Work

2

slide-3
SLIDE 3

PDSW-DISCS 2016

Introduction

3

  • Big Data has become one of the most important

elements in business analytics

  • The rate of information growth appears to be exceeding

Moore’s Law

  • Every day ~2.5 quintillion (2.5×1018) bytes of data are

created

  • Big Data and High Performance Computing (HPC) are

converging to meet large scale data processing challenges

  • According to IDC, 67% of HPC centers are running High

Performance Data Analysis (HPDA) workloads

  • The revenues of these workloads are expected to grow

exponentially

http://www.coolinfographics.com/blog/tag/data?currentPage=3 http://www.climatecentral.org/news/white-house-brings-together-big-data- and-climate-change-17194

slide-4
SLIDE 4

PDSW-DISCS 2016

Big Data Processing with Hadoop

  • The open-source implementation of

MapReduce programming model for Big Data Analytics

  • Major components

q HDFS q MapReduce

  • Underlying Hadoop Distributed File System

(HDFS) can be used by both MapReduce and end applications

4

HDFS

MapReduce

Hadoop Framework

User Applications Hadoop Common (RPC)

slide-5
SLIDE 5

PDSW-DISCS 2016

Drivers of Modern HPC Cluster Architectures

Tianhe – 2 Titan Stampede Gordon

  • Multi-core/many-core technologies
  • Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)
  • Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), Parallel File Systems
  • Accelerators (NVIDIA GPGPUs and Intel Xeon Phi)

Accelerators / Coprocessors high compute density, high performance/watt >1 TFlop DP on a chip High Performance Interconnects - InfiniBand <1usec latency, 100Gbps Bandwidth> Multi-core Processors SSD, NVMe-SSD, NVRAM

5

slide-6
SLIDE 6

PDSW-DISCS 2016

Non-Volatile Memory Trends

6

http://www.slideshare.net/Yole_Developpement/yole-emerging-nonvolatile- memory-2016-report-by-yole-developpement?next_slideshow=2 http://www.chipdesignmag.com/bursky/?paged=2

  • NVM devices offer DRAM-like performance characteristics with persistence; suitable for data

processing middleware

  • Number of NVM applications are growing rapidly because of the byte-addressability and

persistence features

slide-7
SLIDE 7

PDSW-DISCS 2016

NVM-aware HDFS

  • Our previous work, NVFS

provides NVRAM-based designs for HDFS

  • Exploits byte-addressability of

NVM for communication and I/O in HDFS

  • MapReduce, Spark, HBase can
  • btain better performance for

utilizing NVFS as input-output storage

  • N. S. Islam, M. W. Rahman, X. Lu, D. K. Panda,

High Performance Design for HDFS with Byte- Addressability of NVM and RDMA, 24th International Conference on Supercomputing (ICS '16), Jun 2016.

7

Applications and Benchmarks

Hadoop MapReduce

Spark HBase

Co-Design

(Cost-Effectiveness, Use-case)

RDMA Receiver RDMA Sender

DFSClient

RDMA Replicator RDMA Receiver NVFS- BlkIO Writer/Reader NVM NVFS- MemIO SSD SSD SSD

NVM and RDMA-aware HDFS (NVFS)

DataNode

RDMA

slide-8
SLIDE 8

PDSW-DISCS 2016

MapReduce on HPC Systems

8

Our previous works provide designs for MapReduce with these HPC resources

slide-9
SLIDE 9

PDSW-DISCS 2016

Outline

  • Introduction
  • Problem Statement
  • Key Contributions
  • Opportunities and Design
  • Performance Evaluation
  • Conclusion and Future Work

9

slide-10
SLIDE 10

PDSW-DISCS 2016

Problem Statement

  • What are the possible choices for using NVRAM in the MapReduce

execution pipeline?

  • How can MapReduce execution frameworks take advantage of NVRAM in

such use cases?

  • Can MapReduce benchmarks and applications be benefitted through the

usage of NVRAM in terms of performance and scalability?

10

slide-11
SLIDE 11

PDSW-DISCS 2016

Outline

  • Introduction
  • Problem Statement
  • Key Contributions
  • Opportunities and Design
  • Performance Evaluation
  • Conclusion and Future Work

11

slide-12
SLIDE 12

PDSW-DISCS 2016

Key Contributions

  • Proposed a novel NVRAM-assisted Map Output Spill Approach
  • Applied our approach on top of RDMA-based Hadoop MapReduce to keep

both map and reduce phase enhancements

  • Proposed approach can significantly out-perform the current approaches

proven by different sets of workloads

12

slide-13
SLIDE 13

PDSW-DISCS 2016

RDMA-enhanced MapReduce

  • RDMA-based MapReduce

– RDMA-based shuffle engine – Pre-fetching and caching of intermediate data

  • M. W. Rahman , N. S. Islam, X. Lu, J. Jose, H. Subramoni, H. Wang, and D. K. Panda, High-Performance RDMA-based Design of

Hadoop MapReduce over InfiniBand, HPDIC, in conjunction with IPDPS, 2013

  • Hybrid Overlapping among Phases (HOMR)

– Overlapping among map, shuffle, and merge phases as well as shuffle, merge, and reduce phases – Advanced shuffle algorithms with dynamic adjustments in shuffle volume

  • M. W. Rahman , X. Lu, N. S. Islam, and D. K. Panda, HOMR: A Hybrid Approach to Exploit Maximum Overlapping in MapReduce
  • ver High Performance Interconnects, ICS, 2014

13

These designs are incorporated into the public release of “RDMA for Apache Hadoop” package under HiBD project

slide-14
SLIDE 14

PDSW-DISCS 2016

  • RDMA for Apache Spark
  • RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)

– Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions

  • RDMA for Apache HBase
  • RDMA for Memcached (RDMA-Memcached)
  • RDMA for Apache Hadoop 1.x (RDMA-Hadoop)
  • OSU HiBD-Benchmarks (OHB)

– HDFS, Memcached, and HBase Micro-benchmarks

  • http://hibd.cse.ohio-state.edu
  • Users Base: 195 organizations from 26 countries
  • More than 18,600 downloads from the project site
  • RDMA for Impala (upcoming)

The High-Performance Big Data (HiBD) Project

Available for InfiniBand and RoCE

14

slide-15
SLIDE 15

PDSW-DISCS 2016

RDMA for Apache Hadoop 2.x

15

  • High-Performance Design of Hadoop
  • ver RDMA-enabled Interconnects

– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for HDFS, MapReduce, and RPC components – Enhanced HDFS with in-memory and heterogeneous storage – High performance design of MapReduce

  • ver Lustre

– Plugin-based architecture supporting RDMA-based designs for Apache Hadoop, HDP, and CDH

  • Current release: 1.1.0

– Based on Apache Hadoop 2.7.3 – Compliant with Apache Hadoop 2.7.3, HDP 2.5.0.3, CDH 5.8.2 APIs and applications – http://hibd.cse.ohio-state.edu

slide-16
SLIDE 16

PDSW-DISCS 2016

Outline

  • Introduction
  • Problem Statement
  • Key Contributions
  • Opportunities and Design

– Optimization Opportunities – NVRAM-Assisted Map Spilling

  • Performance Evaluation
  • Conclusion and Future Work

16

slide-17
SLIDE 17

PDSW-DISCS 2016

Optimization Opportunities

  • Utilizing NVMs as PCIe SSD devices would be straight-forward

– Configuring the Hadoop local dirs with the NVMe SSD locations – No design changes required

17

HDD SSD RAMDisk

50 100 150 200 250 300 350 400

Intermediate Data Storage

Execution Time (s)

  • Performance improvement

potential with such configuration changes is not high

– Only improves by 16% for RAMDisk over HDD as intermediate data storage

  • Utilizing NVMs as NVRAM can be crucial
slide-18
SLIDE 18

PDSW-DISCS 2016

HOMR Design and Execution Flow

18

Input Files Output Files Intermediate Data

Map Task

Read Map Spill Merge

Map Task

Read Map Spill Merge

Reduce Task

Shuffle Reduce In- Mem Merge

Reduce Task

Shuffle Reduce In- Mem Merge

RDMA

All Operations are In- Memory Opportunities exist to improve the performance with NVRAM

slide-19
SLIDE 19

PDSW-DISCS 2016

Profiling Map Phase

  • Map execution performance can be estimated from five

different stages

19

Reading input data from file system Applying map() function Serialization and Partitioning Spilling key-value pairs to files Merge the spill files and write the data to intermediate storage

Involves disk operations on intermediate data storage

slide-20
SLIDE 20

PDSW-DISCS 2016

Profiling Map Phase

  • Profiled 20GB Sort and TeraSort experiments on 8 nodes with default Hadoop
  • Averaged over 3 executions
  • Spill + Merge takes 1.71x more time compared to Read + Map + Collect for

Sort; for TeraSort, it takes 3.75x more time

20

Read + Map + Collect Spill + Merge

2 4 6 8 10 12 14

Time (s)

Sort TeraSort

slide-21
SLIDE 21

PDSW-DISCS 2016

Outline

  • Introduction
  • Problem Statement
  • Key Contributions
  • Opportunities and Design

– Optimization Opportunities – NVRAM-Assisted Map Spilling

  • Performance Evaluation
  • Conclusion and Future Work

21

slide-22
SLIDE 22

PDSW-DISCS 2016

NVRAM-Assisted Map Spilling

22

Input Files Output Files Intermediate Data

Map Task

Read Map Spill Merge

Map Task

Read Map Spill Merge

Reduce Task

Shuffle Reduce In- Mem Merge

Reduce Task

Shuffle Reduce In- Mem Merge

RDMA

NVRAM

q Minimizes the disk operations in Spill phase q Final merged output is still written to intermediate data storage for maintaining similar fault-tolerance

slide-23
SLIDE 23

PDSW-DISCS 2016

Outline

  • Introduction
  • Problem Statement
  • Key Contributions
  • Opportunities and Design
  • Performance Evaluation
  • Conclusion and Future Work

23

slide-24
SLIDE 24

PDSW-DISCS 2016

Experimental Setup

  • We have used SDSC-Comet for our evaluation

– 9 nodes – 12-core Intel Xeon E5-2680 v3 (Haswell) processors – 128 GB DDR4 DRAM – 320 GB local SATA SSD – 56 Gbps FDR InfiniBand

  • Software and Libraries

– Hadoop-2.6.0, JDK 1.7 – RDMA-based Apache Hadoop 0.9.7

24

slide-25
SLIDE 25

PDSW-DISCS 2016

Configurations and Notations

  • Hadoop configurations used throughout the experiments
  • Notations used in the graphs

25

Parameter Value HDFS Block Size 256 MB HDFS Data Directory <SSD Location> Intermediate Data Directory <SSD Location> YARN Concurrent Containers 12 Hadoop Repo Notation Used Apache Hadoop MR RDMA Hadoop RMR RDMA Hadoop with NVRAM-Assisted Map Spill (this paper) RMR-NVM

slide-26
SLIDE 26

PDSW-DISCS 2016

Simulating NVRAM performance

  • Because of hardware limitation, we perform simulation to predict NVRAM

performance using DRAM

  • Assumption: NVRAM write is 10x slower compared to DRAM write;

NVRAM read performs similar to DRAM Read

  • NVRAM. http://www.enterprisetech.com/2014/08/06/flashtec-nvram-15-million-iops-sub-

microsecondlatency –

  • S. Pelley, T. F. Wenisch, B. T. Gold, and B. Bridge. Storage Management in the NVRAM Era. Proc.

VLDB Endow., 2013.

  • We simulate NVRAM performance by adding a delay (δ) after DRAM write
  • perations
  • We utilize System.nanoTime() for adding a sleep to simulate δ

26

slide-27
SLIDE 27

PDSW-DISCS 2016

Benefits in Map Phase

  • Read + Map + Collect performs similarly across different MR designs
  • Spill + Merge performs significantly better compared to both MR and RMR
  • 20 GB Sort and TeraSort experiments on 8 nodes; RMR-NVM Map phase performs at

least 2x better compared to RMR

27 Sort TeraSort 0.5 1 1.5 2 2.5 3 3.5 Benchmarks Time (s) MR RMR RMR-NVM Sort TeraSort 2 4 6 8 10 12 14 Benchmarks Time (s) MR RMR RMR-NVM

Read + Map + Collect Spill + Merge

slide-28
SLIDE 28

PDSW-DISCS 2016

Benefits in Map Phase (Contd.)

  • Profiling Map Spill Cost for different MR frameworks
  • Sort experiment with 96 maps on 8 nodes
  • Sorted spill costs for all maps; averaged over 3 iterations to minimize variation
  • Average benefit of 2.39x is achieved across all maps

28 1 11 21 31 41 51 61 71 81 91 200 400 600 800 1000 1200 1400 1600 1800 2000 2200

Map tasks Spill Cost (ms) MR-IPoIB RMR RMR-NVM

2.39x

slide-29
SLIDE 29

PDSW-DISCS 2016

Comparison with Sort and TeraSort

29

  • RMR-NVM achieves 2.37x benefit for

Map phase compared to RMR and MR- IPoIB; overall benefit 55% compared to MR-IPoIB, 28% compared to RMR

2.37x 55% 2.48x 51%

  • RMR-NVM achieves 2.48x benefit for

Map phase compared to RMR and MR- IPoIB; overall benefit 51% compared to MR-IPoIB, 31% compared to RMR

slide-30
SLIDE 30

PDSW-DISCS 2016

Evaluation of Intel HiBench Workloads

  • We evaluate different

HiBench workloads with Huge data sets on 8 nodes

  • Performance benefits for

Shuffle-intensive workloads compared to MR-IPoIB:

– Sort: 42% (25 GB) – TeraSort: 39% (32 GB) – PageRank: 21% (5 million pages)

  • Other workloads:

– WordCount: 18% (25 GB) – KMeans: 11% (100 million samples)

30

slide-31
SLIDE 31

PDSW-DISCS 2016

Evaluation of PUMA Workloads

31

  • We evaluate different PUMA

workloads on 8 nodes with 30GB data size

  • Performance benefits for

Shuffle-intensive workloads compared to MR-IPoIB :

– AdjList: 39% – SelfJoin: 58% – RankedInvIndex: 39%

  • Other workloads:

– SeqCount: 32% – InvIndex: 18%

slide-32
SLIDE 32

PDSW-DISCS 2016

Outline

  • Introduction
  • Problem Statement
  • Key Contributions
  • Opportunities and Design
  • Performance Evaluation
  • Conclusion and Future Work

32

slide-33
SLIDE 33

PDSW-DISCS 2016

Conclusion and Future Work

33

  • We propose an enhanced design of MapReduce with NVRAM
  • NVRAM-assisted Map Spilling provides significant performance benefits

(2.73x) in Map phase compared to previous designs

  • Overall, it achieves 55% performance benefits for Sort, 58% for SelfJoin
  • This design will be made available in the public release of “RDMA for Apache

Hadoop” package under HiBD (http://hibd.cse.ohio-state.edu) project

  • In the future, we plan to extend other MapReduce execution frameworks

(e.g. Spark, Tez) by leveraging similar design choices with NVRAM

slide-34
SLIDE 34

PDSW-DISCS 2016

Thank You!

{rahmanmd, islamn, luxi, panda}@cse.ohio-state.edu

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/ High Performance Big Data http://hibd.cse.ohio-state.edu/