Accelerating Data Management and Processing on Modern Clusters with - - PowerPoint PPT Presentation

accelerating data management and processing on modern
SMART_READER_LITE
LIVE PREVIEW

Accelerating Data Management and Processing on Modern Clusters with - - PowerPoint PPT Presentation

Accelerating Data Management and Processing on Modern Clusters with RDMA-Enabled Interconnects Keynote Talk at ADMS 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu


slide-1
SLIDE 1

Accelerating Data Management and Processing on Modern Clusters with RDMA-Enabled Interconnects

Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

Keynote Talk at ADMS 2014 by

slide-2
SLIDE 2

Introduction to Big Data Applications and Analytics

  • Big Data has become the one of the

most important elements of business analytics

  • Provides groundbreaking opportunities

for enterprise information management and decision making

  • The amount of data is exploding;

companies are capturing and digitizing more information than ever

  • The rate of information growth appears

to be exceeding Moore’s Law

ADMS '14

2

slide-3
SLIDE 3

4V Characteristics of Big Data

ADMS '14 http://api.ning.com/files/tRHkwQN7s-Xz5cxylXG004GLGJdjoPd6bVfVBwvgu*F5MwDDUCiHHdmBW- JTEz0cfJjGurJucBMTkIUNdL3jcZT8IPfNWfN9/dv1.jpg

  • Commonly accepted 3V’s of Big Data
  • Volume, Velocity, Variety

Michael Stonebraker: Big Data Means at Least Three Different Things, http://www.nist.gov/itl/ssd/is/upload/NIST-stonebraker.pdf

  • 4/5V’s of Big Data – 3V + *Veracity, *Value

3

slide-4
SLIDE 4

Velocity of Big Data – How Much Data Is Generated Every Minute on the Internet?

The global Internet population grew 6.59% from 2010 to 2011 and now represents

2.1 Billion People.

http://www.domo.com/blog/2012/06/how-much-data-is-created-every-minute

ADMS '14

4

slide-5
SLIDE 5
  • Substantial impact on designing and utilizing modern data management and

processing systems in multiple tiers – Front-end data accessing and serving (Online)

  • Memcached + DB (e.g. MySQL), HBase

– Back-end data analytics (Offline)

  • HDFS, MapReduce, Spark

ADMS '14

Data Management and Processing in Modern Datacenters

5

slide-6
SLIDE 6
  • Three-layer architecture of Web 2.0

– Web Servers, Memcached Servers, Database Servers

  • Memcached is a core component of Web 2.0 architecture

6

Overview of Web 2.0 Architecture and Memcached

ADMS '14

slide-7
SLIDE 7

ADMS '14

7

Memcached Architecture

  • Distributed Caching Layer

– Allows to aggregate spare memory from multiple nodes – General purpose

  • Typically used to cache database queries, results of API calls
  • Scalable model, but typical usage very network intensive
slide-8
SLIDE 8

8

HBase Overview

  • Apache Hadoop Database

(http://hbase.apache.org/)

  • Semi-structured database, which

is highly scalable

  • Integral part of many datacenter

applications

– eg: Facebook Social Inbox

  • Developed in Java for platform-

independence and portability

  • Uses sockets for communication!

(HBase Architecture)

ADMS '14

slide-9
SLIDE 9

Hadoop Distributed File System (HDFS)

  • Primary storage of Hadoop;

highly reliable and fault-tolerant

  • Adopted by many reputed
  • rganizations

– eg: Facebook, Yahoo!

  • NameNode: stores the file system

namespace

  • DataNode: stores data blocks
  • Developed in Java for platform-

independence and portability

  • Uses sockets for communication!

(HDFS Architecture)

9

RPC RPC RPC RPC

ADMS '14

Client

slide-10
SLIDE 10

Data Movement in Hadoop MapReduce

Disk Operations

  • Map and Reduce Tasks carry out the total job execution

– Map tasks read from HDFS, operate on it, and write the intermediate data to local disk – Reduce tasks get these data by shuffle from TaskTrackers, operate on it and write to HDFS

  • Communication in shuffle phase uses HTTP over Java Sockets

10

ADMS '14

Bulk Data Transfer

slide-11
SLIDE 11

Spark Architecture Overview

  • An in-memory data-processing

framework

– Iterative machine learning jobs – Interactive data analytics – Scala based Implementation – Standalone, YARN, Mesos

  • Scalable and communication

intensive

– Wide dependencies between Resilient Distributed Datasets (RDDs) – MapReduce-like shuffle

  • perations to repartition RDDs

– Sockets based communication

11 ADMS '14

http://spark.apache.org

slide-12
SLIDE 12
  • Overview of Modern Clusters, Interconnects and Protocols
  • Challenges for Accelerating Data Management and Processing
  • The High-Performance Big Data (HiBD) Project
  • RDMA-based design for Memcached and HBase

– RDMA-based Memcached – Case study with OLTP – SSD-assisted hybrid Memcached – RDMA-based HBase

  • RDMA-based designs for Apache Hadoop and Spark

– Case studies with HDFS, MapReduce, and Spark – RDMA-based MapReduce on HPC Clusters with Lustre

  • Ongoing and Future Activities
  • Conclusion and Q&A

Presentation Outline

ADMS '14

12

slide-13
SLIDE 13

High-End Computing (HEC): PetaFlop to ExaFlop

ADMS '14

13

Expected to have an ExaFlop system in 2020-2022! 100 PFlops in 2015 1 EFlops in 2018?

slide-14
SLIDE 14

Trends for Commodity Computing Clusters in the Top 500 List (http://www.top500.org)

10 20 30 40 50 60 70 80 90 100 50 100 150 200 250 300 350 400 450 500 Percentage of Clusters Number of Clusters Timeline Percentage of Clusters Number of Clusters

ADMS '14

14

slide-15
SLIDE 15
  • High End Computing (HEC) grows dramatically

– High Performance Computing – Big Data Computing

  • Technology Advancement

– Multi-core/many-core technologies and accelerators – Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE) – Solid State Drives (SSDs) and Non-Volatile Random- Access Memory (NVRAM) – Accelerators (NVIDIA GPGPUs and Intel Xeon Phi)

High End Computing (HEC)

Tianhe – 2 (1) Titan (2) Stampede (6) Tianhe – 1A (10)

ADMS '14

15

slide-16
SLIDE 16

Overview of High Performance Interconnects

  • High-Performance Computing (HPC) has adopted advanced

interconnects and protocols

– InfiniBand – 10 Gigabit Ethernet/iWARP – RDMA over Converged Enhanced Ethernet (RoCE)

  • Very Good Performance

– Low latency (few micro seconds) – High Bandwidth (100 Gb/s with dual FDR InfiniBand) – Low CPU overhead (5-10%)

  • OpenFabrics software stack (www.openfabrics.org) with IB,

iWARP and RoCE interfaces are driving HPC systems

  • Many such systems in Top500 list

ADMS '14

16

slide-17
SLIDE 17

ADMS '14 Kernel Space

All interconnects and protocols in OpenFabrics Stack

Application / Middleware Verbs Ethernet Adapter Ethernet Switch Ethernet Driver TCP/IP

1/10/40 GigE

InfiniBand Adapter InfiniBand Switch IPoIB

IPoIB

Ethernet Adapter Ethernet Switch Hardware Offload TCP/IP

10/40 GigE- TOE

InfiniBand Adapter InfiniBand Switch User Space RSockets

RSockets

iWARP Adapter Ethernet Switch TCP/IP User Space

iWARP

RoCE Adapter Ethernet Switch RDMA User Space

RoCE

InfiniBand Switch InfiniBand Adapter RDMA User Space

IB Native

Sockets

Application / Middleware Interface Protocol Adapter Switch

InfiniBand Adapter InfiniBand Switch RDMA SDP

SDP

17

slide-18
SLIDE 18

Trends of Networking Technologies in TOP500 Systems

Percentage share of InfiniBand is steadily increasing Interconnect Family – Systems Share

ADMS '14

18

slide-19
SLIDE 19
  • 223 IB Clusters (44.3%) in the June 2014 Top500 list

(http://www.top500.org)

  • Installations in the Top 50 (25 systems):

Large-scale InfiniBand Installations

519,640 cores (Stampede) at TACC (7th) 120, 640 cores (Nebulae) at China/NSCS (28th) 62,640 cores (HPC2) in Italy (11th) 72,288 cores (Yellowstone) at NCAR (29th) 147, 456 cores (Super MUC) in Germany (12th) 70,560 cores (Helios) at Japan/IFERC (30th) 76,032 cores (Tsubame 2.5) at Japan/GSIC (13th) 138,368 cores (Tera-100) at France/CEA (35th) 194,616 cores (Cascade) at PNNL (15th) 222,072 cores (QUARTETTO) in Japan (37th) 110,400 cores (Pangea) at France/Total (16th) 53,504 cores (PRIMERGY) in Australia (38th) 96,192 cores (Pleiades) at NASA/Ames (21st) 77,520 cores (Conte) at Purdue University (39th) 73,584 cores (Spirit) at USA/Air Force (24th) 44,520 cores (Spruce A) at AWE in UK (40th) 77,184 cores (Curie thin nodes) at France/CEA (26h) 48,896 cores (MareNostrum) at Spain/BSC (41st) 65,320-cores, iDataPlex DX360M4 at Germany/Max- Planck (27th) and many more!

ADMS '14

19

slide-20
SLIDE 20

Open Standard InfiniBand Networking Technology

  • Introduced in Oct 2000
  • High Performance Data Transfer

– Interprocessor communication and I/O – Low latency (<1.0 microsec), High bandwidth (up to 12.5 GigaBytes/sec

  • > 100Gbps), and low CPU utilization (5-10%)
  • Flexibility for LAN and WAN communication
  • Multiple Transport Services

– Reliable Connection (RC), Unreliable Connection (UC), Reliable Datagram (RD), Unreliable Datagram (UD), and Raw Datagram – Provides flexibility to develop upper layers

  • Multiple Operations

– Send/Recv – RDMA Read/Write – Atomic Operations (very unique)

  • high performance and scalable implementations of distributed locks,

semaphores, collective communication operations

  • Leading to big changes in designing HPC clusters, file systems,

cloud computing systems, grid computing systems, ….

ADMS '14

20

slide-21
SLIDE 21

ADMS '14

Communication in the Channel Semantics (Send/Receive Model)

InfiniBand Device Memory Memory InfiniBand Device

CQ QP

Send Recv

Memory Segment

Send WQE contains information about the send buffer (multiple non-contiguous segments)

Processor Processor

CQ QP

Send Recv

Memory Segment

Receive WQE contains information on the receive buffer (multiple non-contiguous segments); Incoming messages have to be matched to a receive WQE to know where to place

Hardware ACK

Memory Segment Memory Segment Memory Segment

Processor is involved only to: 1. Post receive WQE 2. Post send WQE 3. Pull out completed CQEs from the CQ

21

slide-22
SLIDE 22

ADMS '14

Communication in the Memory Semantics (RDMA Model)

InfiniBand Device Memory Memory InfiniBand Device

CQ QP

Send Recv

Memory Segment

Send WQE contains information about the send buffer (multiple segments) and the receive buffer (single segment)

Processor Processor

CQ QP

Send Recv

Memory Segment

Hardware ACK

Memory Segment Memory Segment

Initiator processor is involved only to: 1. Post send WQE 2. Pull out completed CQE from the send CQ No involvement from the target processor

22

slide-23
SLIDE 23
  • Overview of Modern Clusters, Interconnects and Protocols
  • Challenges for Accelerating Data Management and Processing
  • The High-Performance Big Data (HiBD) Project
  • RDMA-based design for Memcached and HBase
  • RDMA-based designs for Apache Hadoop and Spark

– Case studies with HDFS, MapReduce, and Spark – RDMA-based MapReduce on HPC Clusters with Lustre

  • Ongoing and Future Activities
  • Conclusion and Q&A

Presentation Outline

ADMS '14

23

slide-24
SLIDE 24

Wide Adaptation of RDMA Technology

  • Message Passing Interface (MPI) for HPC
  • Parallel File Systems

– Lustre – GPFS

  • Delivering excellent performance (latency, bandwidth and

CPU Utilization)

  • Delivering excellent scalability

ADMS '14

24

slide-25
SLIDE 25
  • High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP, and RDMA
  • ver Converged Enhanced Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002 – MVAPICH2-X (MPI + PGAS), Available since 2012 – Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC) – Used by more than 2,200 organizations in 73 countries – More than 221,000 downloads from OSU site directly

– Empowering many TOP500 clusters

  • 7th ranked 519,640-core cluster (Stampede) at TACC
  • 13th ranked 74,358-core cluster (Tsubame 2.5) at Tokyo Institute of Technology
  • 23th ranked 96,192-core cluster (Pleiades) at NASA
  • many others . . .

– Available with software stacks of many IB, HSE, and server vendors including Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu

  • Partner in the U.S. NSF-TACC Stampede System

MVAPICH2/MVAPICH2-X Software

ADMS '14

25

slide-26
SLIDE 26

ADMS '14

One-way Latency: MPI over IB with MVAPICH2

0.00 1.00 2.00 3.00 4.00 5.00 6.00 Small Message Latency Message Size (bytes) Latency (us) 1.66 1.56 1.64 1.82 0.99 1.09 1.12 DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch 0.00 50.00 100.00 150.00 200.00 250.00 Qlogic-DDR Qlogic-QDR ConnectX-DDR ConnectX2-PCIe2-QDR ConnectX3-PCIe3-FDR Sandy-ConnectIB-DualFDR Ivy-ConnectIB-DualFDR Large Message Latency Message Size (bytes) Latency (us)

26

slide-27
SLIDE 27

ADMS '14

Bandwidth: MPI over IB with MVAPICH2

2000 4000 6000 8000 10000 12000 14000 Unidirectional Bandwidth Bandwidth (MBytes/sec) Message Size (bytes) 3280 3385 1917 1706 6343 12485 12810 5000 10000 15000 20000 25000 30000 Qlogic-DDR Qlogic-QDR ConnectX-DDR ConnectX2-PCIe2-QDR ConnectX3-PCIe3-FDR Sandy-ConnectIB-DualFDR Ivy-ConnectIB-DualFDR Bidirectional Bandwidth Bandwidth (MBytes/sec) Message Size (bytes) 3341 3704 4407 11643 6521 21025 24727 DDR, QDR - 2.4 GHz Quad-core (Westmere) Intel PCI Gen2 with IB switch FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.6 GHz Octa-core (SandyBridge) Intel PCI Gen3 with IB switch ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch

27

slide-28
SLIDE 28

Can High-Performance Interconnects Benefit Data Management and Processing?

  • Most of the current Big Data systems use Ethernet

Infrastructure with Sockets

  • Concerns for performance and scalability
  • Usage of High-Performance Networks is beginning to draw

interest

– Oracle, IBM, Google, Intel are working along these directions

  • What are the challenges?
  • Where do the bottlenecks lie?
  • Can these bottlenecks be alleviated with new designs (similar

to the designs adopted for MPI)?

  • Can HPC Clusters with High-Performance networks be used

for Big Data applications using Hadoop and Memcached?

ADMS '14

28

slide-29
SLIDE 29

Designing Communication and I/O Libraries for Big Data Systems: Challenges

ADMS '14

Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached)

Networking Technologies (InfiniBand, 1/10/40GigE and Intelligent NICs) Storage Technologies (HDD and SSD)

Programming Models (Sockets) Applications

Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators)

Other Protocols? Communication and I/O Library

Point-to-Point Communication

QoS

Threaded Models and Synchronization

Fault-Tolerance I/O and File Systems Virtualization Benchmarks

29

slide-30
SLIDE 30

Can Big Data Processing Systems be Designed with High- Performance Networks and Protocols?

Enhanced Designs

Application Accelerated Sockets 10 GigE or InfiniBand Verbs / Hardware Offload

Current Design

Application Sockets 1/10 GigE Network

  • Sockets not designed for high-performance

– Stream semantics often mismatch for upper layers (Memcached, HBase, Hadoop) – Zero-copy not available for non-blocking sockets

Our Approach

Application OSU Design 10 GigE or InfiniBand Verbs Interface

ADMS '14

30

slide-31
SLIDE 31
  • Overview of Modern Clusters, Interconnects and Protocols
  • Challenges for Accelerating Data Management and Processing
  • The High-Performance Big Data (HiBD) Project
  • RDMA-based design for Memcached and HBase
  • RDMA-based designs for Apache Hadoop and Spark

– Case studies with HDFS, MapReduce, and Spark – RDMA-based MapReduce on HPC Clusters with Lustre

  • Ongoing and Future Activities
  • Conclusion and Q&A

Presentation Outline

ADMS '14

31

slide-32
SLIDE 32
  • RDMA for Memcached (RDMA-Memcached)
  • RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)
  • RDMA for Apache Hadoop 1.x (RDMA-Hadoop)
  • OSU HiBD-Benchmarks (OHB)
  • http://hibd.cse.ohio-state.edu
  • RDMA for Apache HBase and Spark

The High-Performance Big Data (HiBD) Project

ADMS '14

32

slide-33
SLIDE 33
  • High-Performance Design of Memcached over RDMA-enabled

Interconnects

– High performance design with native InfiniBand and RoCE support at the verbs- level for Memcached and libMemcached components – Easily configurable for native InfiniBand, RoCE and the traditional sockets- based support (Ethernet and InfiniBand with IPoIB)

  • Current release: 0.9.1

– Based on Memcached 1.4.20 and libMemcached 1.0.18. – Compliant with Memcached APIs and applications – Tested with

  • Mellanox InfiniBand adapters (DDR, QDR and FDR)
  • RoCE support with Mellanox adapters
  • Various multi-core platforms

– http://hibd.cse.ohio-state.edu

RDMA for Memcached Distribution

ADMS '14

33

slide-34
SLIDE 34
  • High-Performance Design of Hadoop over RDMA-enabled Interconnects

– High performance design with native InfiniBand and RoCE support at the verbs- level for HDFS, MapReduce, and RPC components – Easily configurable for native InfiniBand, RoCE and the traditional sockets- based support (Ethernet and InfiniBand with IPoIB)

  • Current release: 0.9.9/0.9.1

– Based on Apache Hadoop 1.2.1/2.4.1 – Compliant with Apache Hadoop 1.2.1/2.4.1 APIs and applications – Tested with

  • Mellanox InfiniBand adapters (DDR, QDR and FDR)
  • RoCE support with Mellanox adapters
  • Various multi-core platforms
  • Different file systems with disks and SSDs

– http://hibd.cse.ohio-state.edu

RDMA for Apache Hadoop 1.x/2.x Distributions

ADMS '14

34

slide-35
SLIDE 35
  • Released in OHB 0.7.1 (ohb_memlat)
  • Evaluates the performance of stand-alone Memcached
  • Three different micro-benchmarks

– SET Micro-benchmark: Micro-benchmark for memcached set

  • perations

– GET Micro-benchmark: Micro-benchmark for memcached get

  • perations

– MIX Micro-benchmark: Micro-benchmark for a mix of memcached set/get operations (Read:Write ratio is 90:10)

  • Calculates average latency of Memcached operations
  • Can measure throughput in Transactions Per Second

ADMS '14

OSU HiBD Micro-Benchmark (OHB) Suite - Memcached

35

slide-36
SLIDE 36
  • Overview of Modern Clusters, Interconnects and Protocols
  • Challenges for Accelerating Data Management and Processing
  • The High-Performance Big Data (HiBD) Project
  • RDMA-based design for Memcached and HBase

– RDMA-based Memcached – Case study with OLTP – SSD-assisted hybrid Memcached – RDMA-based HBase

  • RDMA-based designs for Apache Hadoop and Spark

– Case studies with HDFS, MapReduce, and Spark – RDMA-based MapReduce on HPC Clusters with Lustre

  • Ongoing and Future Activities
  • Conclusion and Q&A

Presentation Outline

ADMS '14

36

slide-37
SLIDE 37

Design Challenges of RDMA-based Memcached

  • Can Memcached be re-designed from the ground up to utilize

RDMA capable networks?

  • How efficiently can we utilize RDMA for Memcached
  • perations?
  • Can we leverage the best features of RC and UD to deliver

both high performance and scalability to Memcached?

  • Memcached applications need not be modified; the

middleware uses verbs interface if available

ADMS '14

37

slide-38
SLIDE 38

Memcached-RDMA Design

  • Server and client perform a negotiation protocol

– Master thread assigns clients to appropriate worker thread

  • Once a client is assigned a verbs worker thread, it can communicate directly

and is “bound” to that thread

  • All other Memcached data structures are shared among RDMA and Sockets

worker threads

  • Native IB-verbs-level Design and evaluation with

– Server : Memcached (http://memcached.org) – Client : libmemcached (http://libmemcached.org) – Different networks and protocols: 10GigE, IPoIB, native IB (RC, UD)

Sockets Client RDMA Client

Master Thread Sockets Worker Thread Verbs Worker Thread Sockets Worker Thread Verbs Worker Thread Shared Data Memory Slabs Items … 1 1 2 2

ADMS '14

38

slide-39
SLIDE 39
  • Memcached Get latency

– 4 bytes RC/UD – DDR: 6.82/7.55 us; QDR: 4.28/4.86 us – 2K bytes RC/UD – DDR: 12.31/12.78 us; QDR: 8.19/8.46 us

  • Almost factor of four improvement over 10GE (TOE) for 2K bytes on

the DDR cluster

Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR)

10 20 30 40 50 60 70 80 90 100 1 2 4 8 16 32 64 128 256 512 1K 2K Time (us) Message Size 10 20 30 40 50 60 70 80 90 100 1 2 4 8 16 32 64 128 256 512 1K 2K Time (us) Message Size

IPoIB (QDR) 10GigE (QDR) OSU-RC-IB (QDR) OSU-UD-IB (QDR)

Memcached Get Latency (Small Message)

ADMS '14 IPoIB (DDR) 10GigE (DDR) OSU-RC-IB (DDR) OSU-UD-IB (DDR)

39

slide-40
SLIDE 40

ADMS '14

Memcached Get TPS (4 bytes)

  • Memcached Get transactions per second for 4 bytes

– On IB QDR 1.4M/s (RC), 1.3 M/s (UD) for 8 clients

  • Significant improvement with native IB QDR compared to IPoIB

200 400 600 800 1000 1200 1400 1600 1 2 4 8 16 32 64 128 256 512 800 1K Thousands of Transactions per second (TPS)

  • No. of Clients

200 400 600 800 1000 1200 1400 1600 4 8 Thousands of Transactions per second (TPS)

  • No. of Clients

IPoIB (QDR) OSU-RC-IB (QDR) OSU-UD-IB (QDR)

40

slide-41
SLIDE 41

1 10 100 1000 1 2 4 8 16 32 64 128 256 512 1K 2K 4K Time (us) Message Size OSU-IB (FDR) IPoIB (FDR) 100 200 300 400 500 600 700 16 32 64 128 256 512 1024 2048 4080 Thousands of Transactions per Second (TPS)

  • No. of Clients
  • Memcached Get latency

– 4 bytes OSU-IB: 2.84 us; IPoIB: 75.53 us – 2K bytes OSU-IB: 4.49 us; IPoIB: 123.42 us

  • Memcached Throughput (4bytes)

– 4080 clients OSU-IB: 556 Kops/sec, IPoIB: 233 Kops/s – Nearly 2X improvement in throughput

Memcached GET Latency Memcached Throughput

Memcached Performance (FDR Interconnect)

Experiments on TACC Stampede (Intel SandyBridge Cluster, IB: FDR)

Latency Reduced by nearly 20X 2X

ADMS '14

41

slide-42
SLIDE 42

ADMS '14

Application Level Evaluation – Olio Benchmark

  • Olio Benchmark

– RC – 1.6 sec, UD – 1.9 sec, Hybrid – 1.7 sec for 1024 clients

  • 4X times better than IPoIB for 8 clients
  • Hybrid design achieves comparable performance to that of pure RC design

500 1000 1500 2000 2500 64 128 256 512 1024 Time (ms)

  • No. of Clients

10 20 30 40 50 60 1 4 8 Time (ms)

  • No. of Clients

IPoIB (QDR) OSU-RC-IB (QDR) OSU-UD-IB (QDR) OSU-Hybrid-IB (QDR)

42

slide-43
SLIDE 43

ADMS '14

Application Level Evaluation – Real Application Workloads

  • Real Application Workload

– RC – 302 ms, UD – 318 ms, Hybrid – 314 ms for 1024 clients

  • 12X times better than IPoIB for 8 clients
  • Hybrid design achieves comparable performance to that of pure RC design

10 20 30 40 50 60 70 80 90 1 4 8 Time (ms)

  • No. of Clients

50 100 150 200 250 300 350 64 128 256 512 1024 Time (ms)

  • No. of Clients
  • J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, M. W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and
  • D. K. Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP’11
  • J. Jose, H. Subramoni, K. Kandalla, M. W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Scalable

Memcached design for InfiniBand Clusters using Hybrid Transport, CCGrid’12

IPoIB (QDR) OSU-RC-IB (QDR) OSU-UD-IB (QDR) OSU-Hybrid-IB (QDR)

43

slide-44
SLIDE 44

ADMS '14

44

Memcached-RDMA – Case Studies with OLTP Workloads

  • Design of scalable

architectures using Memcached and MySQL

  • Case study to evaluate

benefits of using RDMA- Memcached for traditional OLTP workloads

slide-45
SLIDE 45
  • Illustration with Read-Cache-Read access pattern using modified

mysqlslap load testing tool

  • Up to 40 nodes, 10 concurrent threads per node, 20 queries per client
  • Memcached-RDMA can
  • improve query latency by up to 66% over IPoIB (32Gbps)
  • throughput by up to 69% over IPoIB (32Gbps)

ADMS '14

45

Evaluation with Traditional OLTP Workloads - mysqlslap

1 2 3 4 5 6 7 8 48 64 80 96 112 128 144 160 240 320 400

Latency (sec)

  • No. of Clients

Memcached-IPoIB (32Gbps) Memcached-RDMA (32Gbps) 500 1000 1500 2000 2500 3000 3500 48 64 80 96 112 128 144 160 240 320 400

Throughput (q/ms)

  • No. of Clients

Memcached-IPoIB (32Gbps) Memcached-RDMA (32Gbps)

slide-46
SLIDE 46

ADMS '14

46

Traditional Memcached Deployments

  • Low hit ratio at Memcached server

due to Limited main memory size

  • Frequent access to backend DB
  • Higher hit ratio due to SSD

Mapped virtual memory

  • Significant overhead at virtual

memory management Existing memcached deployment Mmap() SSD into virtual memory

slide-47
SLIDE 47

ADMS '14

47

SSD-Assisted Hybrid Memory for Memcached

IB Verbs IPoIB 10GigE 1GigE MySQL N/A 10763 10724 11220 Memcached (In RAM) 10 60 40 150 Memcached (SSD virtual memory) 347 387 362 455

Random Read Random Write SSD Latency 68 70 Get Latency (us) SSD Basic Performance (us) (Fusionio ioDrive)

  • SSD-Assisted Hybrid Memory to

expand Memory size

  • A user-level library that bypasses the

kernel-based virtual memory management overhead

slide-48
SLIDE 48

ADMS '14

48

Evaluation on SSD-Assisted Hybrid Memory for Memcached - Latency

3.6 X 347 us 93 us 16 us 48 us 3 X

Memcached Get Latency Memcached Put Latency

  • Memcached with InfiniBand DDR transport (16Gbps)
  • 30GB data in SSD, 256 MB read/write buffer
  • Get / Put a random object
  • X. Ouyang, N. S. Islam, R. Rajachandrasekar, J. Jose, M. Luo, H. Wang & D. K. Panda, SSD-Assisted Hybrid

Memory to Accelerate Memcached over High Performance Networks, ICPP’12

slide-49
SLIDE 49

ADMS '14

49

Evaluation on SSD-Assisted Hybrid Memory for Memcached - Throughput

  • Memcached with InfiniBand DDR transport (16Gbps)
  • 30GB data in SSD, object size = 1KB, 256 MB read/write buffer
  • 1, 2, 4 and 8 client process to perform random get()

5.3 X 20 40 60 80 1 2 4 8 Transactions/sec (thousands) # processes

Native Hmem (upper-bound) Memcached+IB+Hmem Memcached+IB+VMS

slide-50
SLIDE 50

ADMS '14

Motivation – Detailed Analysis on HBase Put/Get

  • HBase 1KB Put

– Communication Time – 8.9 us – A factor of 6X improvement over 10GE for communication time

  • HBase 1KB Get

– Communication Time – 8.9 us – A factor of 6X improvement over 10GE for communication time

  • M. W. Rahman, J. Huang, J. Jose, X. Ouyang, H. Wang, N. Islam, H. Subramoni, Chet Murthy and D. K. Panda,

Understanding the Communication Characteristics in HBase: What are the Fundamental Bottlenecks?, Poster Paper, ISPASS’12

20 40 60 80 100 120 140 160 180 200 IPoIB (DDR) 10GigE OSU-IB (DDR) Time (us) HBase Put 1KB 20 40 60 80 100 120 140 160 IPoIB (DDR) 10GigE OSU-IB (DDR) Time (us) HBase Get 1KB Communication Communication Preparation Server Processing Server Serialization Client Processing Client Serialization 50

slide-51
SLIDE 51

ADMS '14

HBase-RDMA Design Overview

  • JNI Layer bridges Java based HBase with communication library written in

native code

  • Enables high performance RDMA communication, while supporting

traditional socket interface

HBase IB Verbs

RDMA Capable Networks

(IB, 10GE/ iWARP, RoCE ..) OSU-IB Design Applications 1/10 GigE, IPoIB Networks Java Socket Interface Java Native Interface (JNI)

51

slide-52
SLIDE 52

HBase-RDMA Design - Communication Flow

Application HBase Client

Network Selector

Sockets 1/10G Network Sockets 1/10G Network JNI Interface OSU-IB Design IBVerbs JNI Interface OSU-IB Design IBVerbs

IB Reader Responder Handler Threads Helper Threads Reader Threads (HBase Client) (HBase Server)

  • RDMA design components

– Network Selector, IB Reader, Helper threads JNI Adaptive Interface

Call Queue Response Queue ADMS '14

52

slide-53
SLIDE 53

ADMS '14

HBase Micro-benchmark (Single-Server-Multi- Client) Results

  • HBase Get latency

– 4 clients: 104.5 us; 16 clients: 296.1 us

  • HBase Get throughput

– 4 clients: 37.01 Kops/sec; 16 clients: 53.4 Kops/sec

  • 27% improvement in throughput for 16 clients over 10GE

50 100 150 200 250 300 350 400 450 1 2 4 8 16 Time (us)

  • No. of Clients

10000 20000 30000 40000 50000 60000 1 2 4 8 16 Ops/sec

  • No. of Clients

10GigE

Latency Throughput

  • J. Huang, X. Ouyang, J. Jose, M. W. Rahman, H. Wang, M. Luo, H. Subramoni, Chet Murthy, and D. K. Panda,

High-Performance Design of HBase with RDMA over InfiniBand, IPDPS’12

OSU-IB (DDR) IPoIB (DDR) 53

slide-54
SLIDE 54

ADMS '14

HBase – YCSB Read-Write Workload

  • HBase Get latency (Yahoo! Cloud Service Benchmark)

– 64 clients: 2.0 ms; 128 Clients: 3.5 ms – 42% improvement over IPoIB for 128 clients

  • HBase Put latency

– 64 clients: 1.9 ms; 128 Clients: 3.5 ms – 40% improvement over IPoIB for 128 clients

1000 2000 3000 4000 5000 6000 7000 8 16 32 64 96 128 Time (us)

  • No. of Clients

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 8 16 32 64 96 128 Time (us)

  • No. of Clients

10GigE

Read Latency Write Latency

OSU-IB (QDR) IPoIB (QDR) 54

slide-55
SLIDE 55
  • Overview of Modern Clusters, Interconnects and Protocols
  • Challenges for Accelerating Data Management and Processing
  • The High-Performance Big Data (HiBD) Project
  • RDMA-based design for Memcached and HBase
  • RDMA-based designs for Apache Hadoop and Spark

– Case studies with HDFS, MapReduce, and Spark – RDMA-based MapReduce on HPC Clusters with Lustre

  • Ongoing and Future Activities
  • Conclusion and Q&A

Presentation Outline

ADMS '14

55

slide-56
SLIDE 56
  • RDMA-based Designs and Performance Evaluation

– HDFS – MapReduce – Spark

Acceleration Case Studies and In-Depth Performance Evaluation

ADMS '14

56

slide-57
SLIDE 57

Design Overview of HDFS with RDMA

  • Enables high performance RDMA communication, while supporting traditional socket

interface

  • JNI Layer bridges Java based HDFS with communication library written in native code

HDFS Verbs RDMA Capable Networks (IB, 10GE/ iWARP, RoCE ..) Applications 1/10 GigE, IPoIB Network Java Socket Interface Java Native Interface (JNI)

Write Others

OSU Design

  • Design Features

– RDMA-based HDFS write – RDMA-based HDFS replication – Parallel replication support – On-demand connection setup – InfiniBand/RoCE support

ADMS '14

57

slide-58
SLIDE 58

Communication Times in HDFS

  • Cluster with HDD DataNodes

– 30% improvement in communication time over IPoIB (QDR) – 56% improvement in communication time over 10GigE

  • Similar improvements are obtained for SSD DataNodes

Reduced by 30%

ADMS '14

5 10 15 20 25 2GB 4GB 6GB 8GB 10GB

Communication Time (s) File Size (GB)

10GigE IPoIB (QDR) OSU-IB (QDR)

58

  • N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda ,

High Performance RDMA-Based Design of HDFS over InfiniBand , Supercomputing (SC), Nov 2012

slide-59
SLIDE 59

Evaluations using OHB HDFS Micro-benchmark

  • Cluster with 4 HDD DataNodes, single disk per node

– 25% improvement in latency over IPoIB (QDR) for 10GB file size – 50% improvement in throughput over IPoIB (QDR) for 10GB file size

Write Latency Write Throughput with 2 clients

reduced by 25% Increased by 50%

ADMS '14

10 20 30 40 50 60 70 80 90 100 2 4 6 8 10 Write Latency (s) File Size (GB)

10GigE IPoIB (QDR) OSU-IB (QDR)

100 200 300 400 500 600 2 4 6 8 10 Total Throughput (MBps) File Size (GB)

10GigE IPoIB (QDR) OSU-IB (QDR)

59

  • N. S. Islam, X. Lu, M. W. Rahman, J. Jose, and D. K. Panda, A Micro-benchmark Suite for

Evaluating HDFS Operations on Modern Clusters, Int'l Workshop on Big Data Benchmarking (WBDB '12), December 2012.

slide-60
SLIDE 60

Evaluations using TestDFSIO

  • Cluster with 8 HDD DataNodes, single disk per node

– 24% improvement over IPoIB (QDR) for 20GB file size

  • Cluster with 4 SSD DataNodes, single SSD per node

– 61% improvement over IPoIB (QDR) for 20GB file size

Cluster with 8 HDD Nodes, TestDFSIO with 8 maps Cluster with 4 SSD Nodes, TestDFSIO with 4 maps

Increased by 24% Increased by 61%

ADMS '14

200 400 600 800 1000 1200 5 10 15 20 Total Throughput (MBps) File Size (GB)

IPoIB (QDR) OSU-IB (QDR)

500 1000 1500 2000 2500 5 10 15 20 Total Throughput (MBps) File Size (GB)

10GigE IPoIB (QDR) OSU-IB (QDR)

60

slide-61
SLIDE 61

ADMS '14

Evaluations using Enhanced DFSIO of Intel HiBench on TACC-Stampede

  • Cluster with 64 DataNodes, single HDD per node

– 64% improvement in throughput over IPoIB (FDR) for 256GB file size – 37% improvement in latency over IPoIB (FDR) for 256GB file size

200 400 600 800 1000 1200 1400 64 128 256 Aggregated Throughput (MBps) File Size (GB)

IPoIB (FDR) OSU-IB (FDR) Increased by 64% Reduced by 37%

100 200 300 400 500 600 64 128 256 Execution Time (s) File Size (GB)

IPoIB (FDR) OSU-IB (FDR)

61

slide-62
SLIDE 62

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 120K 240K 360K 480K Throughput (Kops/sec) Number of Records (Record Size = 1KB) IPoIB (QDR) OSU-IB (QDR) 50 100 150 200 250 300 120K 240K 360K 480K Average Latency (us) Number of Records (Record Size = 1KB) IPoIB (QDR) OSU-IB (QDR)

Evaluations using YCSB (32 Region Servers: 100% Update)

Put Latency Put Throughput

  • HBase using TCP/IP, running over HDFS-IB
  • HBase Put latency for 480K records

– 201 us for OSU Design; 272 us for IPoIB (32Gbps)

  • HBase Put throughput for 480K records

– 4.42 Kops/sec for OSU Design; 3.63 Kops/sec for IPoIB (32Gbps)

  • 26% improvement in average latency; 24% improvement in throughput

ADMS '14

62

slide-63
SLIDE 63
  • YCSB Evaluation with 4 RegionServers (100% update)
  • HBase Put Latency and Throughput for 360K records
  • 37% improvement over IPoIB (32Gbps)
  • 18% improvement over OSU-IB HDFS only

HDFS and HBase Integration over IB (OSU-IB)

50 100 150 200 250 300 120K 240K 360K 480K Latency (us) Number of Records (Record Size = 1KB) IPoIB (QDR) OSU-IB (HDFS) - IPoIB (HBase) (QDR) OSU-IB (HDFS) - OSU-IB (HBase) (QDR) 1 2 3 4 5 6 120K 240K 360K 480K Throughput (Kops/sec) Number of Records (Record Size = 1KB) IPoIB (QDR) OSU-IB (HDFS) - IPoIB (HBase) (QDR) OSU-IB (HDFS) - OSU-IB (HBase) (QDR)

ADMS '14

63

slide-64
SLIDE 64
  • RDMA-based Designs and Performance Evaluation

– HDFS – MapReduce – Spark

Acceleration Case Studies and In-Depth Performance Evaluation

ADMS '14

64

slide-65
SLIDE 65

Design Overview of MapReduce with RDMA

MapReduce Verbs

RDMA Capable Networks

(IB, 10GE/ iWARP, RoCE ..) OSU Design Applications 1/10 GigE, IPoIB Network Java Socket Interface Java Native Interface (JNI)

Job Tracker Task Tracker Map Reduce ADMS '14

  • Enables high performance RDMA communication, while supporting traditional socket interface
  • JNI Layer bridges Java based MapReduce with communication library written in native code
  • Design Features

– RDMA-based shuffle – Prefetching and caching map

  • utput

– Efficient Shuffle Algorithms – In-memory merge – On-demand Shuffle Adjustment – Advanced overlapping

  • map, shuffle, and merge
  • shuffle, merge, and reduce

– On-demand connection setup – InfiniBand/RoCE support

65

slide-66
SLIDE 66

Advanced Overlapping among different phases

  • A hybrid approach to achieve

maximum possible overlapping in MapReduce across all phases compared to other approaches

– Efficient Shuffle Algorithms – Dynamic and Efficient Switching – On-demand Shuffle Adjustment

Default Architecture Enhanced Overlapping Advanced Overlapping

  • M. W. Rahman, X. Lu, N. S. Islam, and D. K. Panda,

HOMR: A Hybrid Approach to Exploit Maximum Overlapping in MapReduce over High Performance Interconnects, ICS, June 2014.

ADMS '14

66

slide-67
SLIDE 67

Evaluations using OHB MapReduce Micro-benchmark

  • Stand-alone MapReduce micro-benchmark (MR-AVG)
  • 1 KB key/value pair size
  • For 8 slave nodes, RDMA has up to 30% over IPoIB (56Gbps)
  • For 16 slave nodes, RDMA has up to 28% over IPoIB (56Gbps)

8 slave nodes 16 slave nodes

ADMS '14 100 200 300 400 500 600 700 800 900 8 16 32 64 IPoIB (FDR) OSU-IB (FDR) 100 200 300 400 500 600 700 16 32 64 128 IPoIB (FDR) OSU-IB (FDR) 28% 30%

67

  • D. Shankar, X. Lu, M. W. Rahman, N. Islam, and D. K. Panda, A Micro-Benchmark Suite for Evaluating

Hadoop MapReduce on High-Performance Networks, BPOE-5 (2014).

slide-68
SLIDE 68

Evaluations using Sort (HDD, SSD)

8 HDD DataNodes

  • With 8 HDD DataNodes for 40GB sort

– 43% improvement over IPoIB (QDR) – 44% improvement over UDA-IB (QDR)

ADMS '14

8 SSD DataNodes

  • With 8 SSD DataNodes for 40GB sort

– 52% improvement over IPoIB (QDR) – 45% improvement over UDA-IB (QDR)

50 100 150 200 250 300 350 400 450 500 25 30 35 40

Job Execution Time (sec) Data Size (GB) 10GigE IPoIB (QDR) UDA-IB (QDR) OSU-IB (QDR)

100 200 300 400 500 600 25 30 35 40

Job Execution Time (sec) Data Size (GB) 10GigE IPoIB (QDR) UDA-IB (QDR) OSU-IB (QDR)

68

slide-69
SLIDE 69

200 400 600 800 1000 1200 1400 1600 1800 2000 60 80 100

Job Execution Time (sec)

Data Size (GB)

10GigE IPoIB (QDR) UDA-IB (QDR) OSU-IB (QDR)

Evaluations with TeraSort

  • 100 GB TeraSort with 8 DataNodes with 2 HDD per node

– 49% benefit compared to UDA-IB (QDR) – 54% benefit compared to IPoIB (QDR) – 56% benefit compared to 10GigE

49% 54%

ADMS '14

69

slide-70
SLIDE 70
  • For 240GB Sort in 64 nodes

– 40% improvement over IPoIB (QDR) with HDD used for HDFS

Performance Evaluation on Larger Clusters

ADMS '14

200 400 600 800 1000 1200

Data Size: 60 GB Data Size: 120 GB Data Size: 240 GB Cluster Size: 16 Cluster Size: 32 Cluster Size: 64

Job Execution Time (sec)

IPoIB (QDR) UDA-IB (QDR) OSU-IB (QDR)

Sort in OSU Cluster

100 200 300 400 500 600 700

Data Size: 80 GB Data Size: 160 GBData Size: 320 GB Cluster Size: 16 Cluster Size: 32 Cluster Size: 64

Job Execution Time (sec)

IPoIB (FDR) UDA-IB (FDR) OSU-IB (FDR)

TeraSort in TACC Stampede

  • For 320GB TeraSort in 64 nodes

– 38% improvement over IPoIB (FDR) with HDD used for HDFS

70

slide-71
SLIDE 71
  • 50% improvement in Self Join over IPoIB (QDR) for 80 GB data size
  • 49% improvement in Sequence Count over IPoIB (QDR) for 30 GB data size

Evaluations using PUMA Workload

ADMS '14

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

AdjList (30GB) SelfJoin (80GB) SeqCount (30GB) WordCount (30GB) InvertIndex (30GB) Normalized Execution Time

Benchmarks 10GigE IPoIB (QDR) OSU-IB (QDR)

71

slide-72
SLIDE 72
  • 50 small MapReduce jobs executed in a cluster size of 4
  • Maximum performance benefit 24% over IPoIB (QDR)
  • Average performance benefit 13% over IPoIB (QDR)

Evaluations using SWIM

15 20 25 30 35 40 1 6 11 16 21 26 31 36 41 46

Execution Time (sec) Job Number IPoIB (QDR) OSU-IB (QDR)

ADMS '14

72

slide-73
SLIDE 73
  • RDMA-based Designs and Performance Evaluation

– HDFS – MapReduce – Spark

Acceleration Case Studies and In-Depth Performance Evaluation

ADMS '14

73

slide-74
SLIDE 74

Design Overview of Spark with RDMA

  • Design Features

– RDMA based shuffle – SEDA-based plugins – Dynamic connection management and sharing – Non-blocking and out-of-

  • rder data transfer

– Off-JVM-heap buffer management – InfiniBand/RoCE support

ADMS '14

  • Enables high performance RDMA communication, while supporting traditional socket

interface

  • JNI Layer bridges Scala based Spark with communication library written in native code
  • X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data

Processing: Early Experiences, Int'l Symposium on High Performance Interconnects (HotI'14), August 2014

74

slide-75
SLIDE 75

ADMS '14

Preliminary Results of Spark-RDMA Design - GroupBy

1 2 3 4 5 6 7 8 9 4 6 8 10

GroupBy Time (sec) Data Size (GB)

1 2 3 4 5 6 7 8 9 8 12 16 20

GroupBy Time (sec) Data Size (GB) 10GigE IPoIB RDMA Cluster with 4 HDD Nodes, GroupBy with 32 cores Cluster with 8 HDD Nodes, GroupBy with 64 cores

  • Cluster with 4 HDD Nodes, single disk per node, 32 concurrent tasks

– 18% improvement over IPoIB (QDR) for 10GB data size

  • Cluster with 8 HDD Nodes, single disk per node, 64 concurrent tasks

– 20% improvement over IPoIB (QDR) for 20GB data size

75

slide-76
SLIDE 76
  • Overview of Modern Clusters, Interconnects and Protocols
  • Challenges for Accelerating Data Management and Processing
  • The High-Performance Big Data (HiBD) Project
  • RDMA-based design for Memcached and HBase
  • RDMA-based designs for Apache Hadoop and Spark

– Case studies with HDFS, MapReduce, and Spark – RDMA-based MapReduce on HPC Clusters with Lustre

  • Ongoing and Future Activities
  • Conclusion and Q&A

Presentation Outline

ADMS '14

76

slide-77
SLIDE 77

Optimize Apache Hadoop over Parallel File Systems

MetaData Servers Object Storage Servers Compute Nodes

TaskTracker

Map Reduce Lustre Client

Lustre Setup

  • HPC Cluster Deployment

– Hybrid topological solution of Beowulf architecture with separate I/O nodes – Lean compute nodes with light OS; more memory space; small local storage – Sub-cluster of dedicated I/O nodes with parallel file systems, such as Lustre

  • MapReduce over Lustre

– Local disk is used as the intermediate data directory – Lustre is used as the intermediate data directory

ADMS '14

77

slide-78
SLIDE 78
  • For 500GB Sort in 64 nodes

– 44% improvement over IPoIB (FDR)

Case Study - Performance Improvement of MapReduce

  • ver Lustre on TACC-Stampede

ADMS '14

  • For 640GB Sort in 128 nodes

– 48% improvement over IPoIB (FDR)

200 400 600 800 1000 1200 300 400 500

Job Execution Time (sec) Data Size (GB)

IPoIB (FDR) OSU-IB (FDR)

50 100 150 200 250 300 350 400 450 500 20 GB 40 GB 80 GB 160 GB 320 GB 640 GB Cluster: 4 Cluster: 8 Cluster: 16 Cluster: 32 Cluster: 64 Cluster: 128

Job Execution Time (sec)

IPoIB (FDR) OSU-IB (FDR)

  • M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, MapReduce over Lustre: Can RDMA-

based Approach Benefit?, Euro-Par, August 2014.

  • Local disk is used as the intermediate data directory

78

slide-79
SLIDE 79
  • For 160GB Sort in 16 nodes

– 35% improvement over IPoIB (FDR)

Case Study - Performance Improvement of MapReduce

  • ver Lustre on TACC-Stampede

ADMS '14

  • For 320GB Sort in 32 nodes

– 33% improvement over IPoIB (FDR)

  • Lustre is used as the intermediate data directory

20 40 60 80 100 120 140 160 180 200 80 120 160

Job Execution Time (sec) Data Size (GB)

IPoIB (FDR) OSU-IB (FDR)

50 100 150 200 250 300

80 GB 160 GB 320 GB Cluster: 8 Cluster: 16 Cluster: 32 Job Execution Time (sec)

IPoIB (FDR) OSU-IB (FDR)

  • Can more optimizations be achieved by leveraging more features of Lustre?

79

slide-80
SLIDE 80
  • Overview of Modern Clusters, Interconnects and Protocols
  • Challenges for Accelerating Data Management and Processing
  • The High-Performance Big Data (HiBD) Project
  • RDMA-based design for Memcached and HBase
  • RDMA-based designs for Apache Hadoop and Spark

– Case studies with HDFS, MapReduce, and Spark – RDMA-based MapReduce on HPC Clusters with Lustre

  • Ongoing and Future Activities
  • Conclusion and Q&A

Presentation Outline

ADMS '14

80

slide-81
SLIDE 81
  • Some other existing Hadoop solutions

– Hadoop-A (UDA)

  • RDMA-based implementation of Hadoop MapReduce shuffle engine
  • Uses plug-in based solution : UDA (Unstructured Data Accelerator)
  • http://www.mellanox.com/page/products_dyn?product_family=144

– Cloudera Distributions of Hadoop

  • CDH: Open Source distribution
  • Cloudera Standard: CDH with automated cluster management
  • Cloudera Enterprise: Cloudera Standard + enhanced management capabilities and support
  • Ethernet/IPoIB
  • http://www.cloudera.com/content/cloudera/en/products.html

– Hortonworks Data Platform

  • HDP: Open Source distribution
  • Developed as projects through the Apache Software Foundation (ASF), NO proprietary

extensions or add-ons

  • Functional areas: Data Management, Data Access, Data Governance and Integration,

Security, and Operations.

  • Ethernet/IPoIB
  • http://hortonworks.com/hdp

ADMS '14

Ongoing and Future Activities for Hadoop Accelerations

81

slide-82
SLIDE 82

Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached)

Networking Technologies (InfiniBand, 1/10/40GigE and Intelligent NICs) Storage Technologies (HDD and SSD)

Programming Models (Sockets) Applications

Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators)

Other Protocols? Communication and I/O Library

Point-to-Point Communication

QoS

Threaded Models and Synchronization

Fault-Tolerance I/O and File Systems Virtualization Benchmarks RDMA Protocol

Designing Communication and I/O Libraries for Big Data Systems: Solved a Few Initial Challenges

ADMS '14

Upper level Changes?

82

slide-83
SLIDE 83
  • Multi-threading and Synchronization

– Multi-threaded model exploration – Fine-grained synchronization and lock-free design – Unified helper threads for different components – Multi-endpoint design to support multi-threading communications

  • QoS and Virtualization

– Network virtualization and locality-aware communication for Big Data middleware – Hardware-level virtualization support for End-to-End QoS – I/O scheduling and storage virtualization – Live migration

ADMS '14

More Challenges

83

slide-84
SLIDE 84
  • Support of Accelerators

– Efficient designs for Big Data middleware to take advantage of NVIDA GPGPUs and Intel MICs – Offload computation-intensive workload to accelerators – Explore maximum overlapping between communication and

  • ffloaded computation
  • Fault Tolerance Enhancements

– Exploration of light-weight fault tolerance mechanisms for Big Data

  • Support of Parallel File Systems

– Optimize Big Data middleware over parallel file systems (e.g. Lustre) on modern HPC clusters

  • Big Data Benchmarking

ADMS '14

More Challenges (Cont’d)

84

slide-85
SLIDE 85
  • The current benchmarks provide some performance

behavior

  • However, do not provide any information to the

designer/developer on:

– What is happening at the lower-layer? – Where the benefits are coming from? – Which design is leading to benefits or bottlenecks? – Which component in the design needs to be changed and what will be its impact? – Can performance gain/loss at the lower-layer be correlated to the performance gain/loss observed at the upper layer?

ADMS '14

Are the Current Benchmarks Sufficient for Big Data Management and Processing?

85

slide-86
SLIDE 86

Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached)

Networking Technologies (InfiniBand, 1/10/40GigE and Intelligent NICs) Storage Technologies (HDD and SSD)

Programming Models (Sockets) Applications

Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators)

Other Protocols? Communication and I/O Library

Point-to-Point Communication

QoS

Threaded Models and Synchronization

Fault-Tolerance I/O and File Systems Virtualization Benchmarks RDMA Protocols

Challenges in Benchmarking of RDMA-based Designs

Current Benchmarks No Benchmarks

Correlation?

ADMS '14

86

slide-87
SLIDE 87

OSU MPI Micro-Benchmarks (OMB) Suite

  • A comprehensive suite of benchmarks to

– Compare performance of different MPI libraries on various networks and systems – Validate low-level functionalities – Provide insights to the underlying MPI-level designs

  • Started with basic send-recv (MPI-1) micro-benchmarks for latency, bandwidth and

bi-directional bandwidth

  • Extended later to

– MPI-2 one-sided – Collectives – GPU-aware data movement – OpenSHMEM (point-to-point and collectives) – UPC

  • Has become an industry standard
  • Extensively used for design/development of MPI libraries, performance comparison
  • f MPI libraries and even in procurement of large-scale systems
  • Available from http://mvapich.cse.ohio-state.edu/benchmarks
  • Available in an integrated manner with MVAPICH2 stack

ADMS '14

87

slide-88
SLIDE 88

Big Data Middleware (HDFS, MapReduce, HBase, Spark and Memcached)

Networking Technologies (InfiniBand, 1/10/40GigE and Intelligent NICs) Storage Technologies (HDD and SSD)

Programming Models (Sockets) Applications

Commodity Computing System Architectures (Multi- and Many-core architectures and accelerators)

Other Protocols? Communication and I/O Library

Point-to-Point Communication

QoS

Threaded Models and Synchronization

Fault-Tolerance I/O and File Systems Virtualization Benchmarks RDMA Protocols

Iterative Process – Requires Deeper Investigation and Design for Benchmarking Next Generation Big Data Systems and Applications

ADMS '14

Applications- Level Benchmarks Micro- Benchmarks

88

slide-89
SLIDE 89
  • Upcoming Releases of RDMA-enhanced Packages will support

– Hadoop 2.x MapReduce & RPC – Spark – HBase

  • Upcoming Releases of OSU HiBD Micro-Benchmarks (OHB) will

support

– HDFS – MapReduce – RPC

  • Exploration of other components (Threading models, QoS,

Virtualization, Accelerators, etc.)

  • Advanced designs with upper-level changes and optimizations

ADMS '14

Future Plans of OSU High Performance Big Data Project

89

slide-90
SLIDE 90
  • Overview of Modern Clusters, Interconnects and Protocols
  • Challenges for Accelerating Data Management and Processing
  • The High-Performance Big Data (HiBD) Project
  • RDMA-based design for Memcached and HBase
  • RDMA-based designs for Apache Hadoop and Spark

– Case studies with HDFS, MapReduce, and Spark – RDMA-based MapReduce on HPC Clusters with Lustre

  • Ongoing and Future Activities
  • Conclusion and Q&A

Presentation Outline

ADMS '14

90

slide-91
SLIDE 91
  • Presented an overview of data management and processing

middleware in different tiers

  • Provided an overview of modern networking technologies
  • Discussed challenges in accelerating Big Data middleware
  • Presented initial designs to take advantage of InfiniBand/RDMA

for Memcached, HBase, Hadoop and Spark

  • Results are promising
  • Many other open issues need to be solved
  • Will enable Big Data management and processing community to

take advantage of modern HPC technologies to carry out their analytics in a fast and scalable manner

ADMS '14

Concluding Remarks

91

slide-92
SLIDE 92

panda@cse.ohio-state.edu

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/

ADMS '14

Thank You!

The High-Performance Big Data Project http://hibd.cse.ohio-state.edu/

92