SSS: An Implementation of Key-value Store based MapReduce Framework - PowerPoint PPT Presentation

SSS: An Implementation of Key-value Store based MapReduce Framework Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh

MapReduce • A promising programming tool for implementing large- scale data-intensive apps • Essentially provides a data-parallel computing model – Map • Spreads a segment of a single array computation over multiple processors • Performs each computation on the relevant processor – Reduce • Aggregates distributed reduction variables • Performs computation over them • (Theoretically) most SPMD-type apps can be realized by MR model

Extends MapReduce to HPC • User can develop HPC apps faster and easier than ever! – Provides a higher level programming model than parallel programming languages, e.g., HPF, OpenMP – Provides simpler communication topologies and synchronization model than message-passing libraries, e.g., MPI • But there are limitations in MapReduce – Sacrificing runtime performance – Fixed workflow

Why? • Semantic gap between MR data model and the input/output data format Considerable overhead for reading/writing large – MR apps handle KV data amount of data, in particular, iterative apps – Backend DFS provides an interface to large data files • No opportunity to reuse the internal Flexible workflows, e.g., reusing KV data (incl. the KV data int. KV data) across multiple maps and reduces, are – These KV data exist infeasible! only in the MR runtime

Related Work • Twister – Map tasks can read input data from: • Distributed in-memory cache • Local disks • MRAP – Map tasks can read input data from: • Preprocessed files optimized to efficient access • Original files • →Partial Solutions – Cannot handle the intermediate KV data – Users need to determine which data is cacheable (immutable)

Our Solution MapReduce Runtime MapReduce Runtime M M M R R M M M R R Task Manager Task Manager KV KV KV KV KV Storage-side Cache Internal Data (Cache) Manager KV KV KV KV Serialized Files KV KV KV KV KV KV KV KV KV KV KV KV KV KV KV KV File System Storage System

SSS: Distributed KVS based MapReduce • Our first prototype of KVS-based MapReduce • Hadoop-compatible API • Distributed KVS-centric – Scale-out • Horizontally adding new nodes – Owner computes • Each map and reduce is distributed to the node where the target KV data is stored – Shuffle-and-Sort phase is not required Why? – On-memory cache • Enables reuse of KV data across multiple maps and reduces – Flexible workflows are supported • Any combination of map and reduce, including iterative apps, are available

How MapReduce runs in SSS KVS KVS KVS KV iKV KV iKV groupin KV iKV iKV KV g Map Reduce KV iKV iKV KV iKV KV iKV groupin KV iKV KV g iKV Map Reduce KV iKV iKV KV iKV KV iKV groupin KV iKV KV g iKV Reduce Map KV iKV KV iKV

SSS: Distributed KVS based MapReduce • Our first prototype of KVS-based MapReduce • Hadoop-compatible API • Distributed KVS-based – Scale-out • Horizontally adding new nodes – Owner computes • Each map and reduce is distributed to the node where the target KV data is stored – Shuffle-and-Sort phase is not required – On-memory cache • Enables reuse of KV data across multiple maps and reduces – Flexible workflows are supported • Any combination of map and reduce, including iterative apps, are available How?

Input Key-value Data Map Map Map Map Map Reuse the Int. Key-value Data Intermediate Key-value Data Iterative Map Map Map Map Map Application Red Red Red Red Red Intermediate Key-value Data Output Key-value Data Red Red Red Red Red Map Map Map Map Map Output Key-value Data Intermediate Key-value Data Red Red Red Red Red Output Key-value Data

SSS: Architectural Overview

SSS Map Thread Pipeline • To limit the memory usage and hide the latency of KVS, SSS runtime employs pipelining technique

Packed-SSS Map Thread Pipeline • To reduce the number of KV data – Stores KV data into thread-local buffer – Converts them to a single large KV data – Stores it to the KVS

Preliminary Evaluation • Environment – Number of nodes: 16 + 1 (master) – CPUs per node: Intel Xeon W5590 3.33GHz x 2 – Memory per node: 48GB – OS: CentOS 5.5 x86_64 – Storage: Fusion-io ioDrive Duo 320GB – NIC: Mellanox ConnectX-II 10G • MapReduce implementations – SSS – Packed-SSS – Hadoop (replica count is 1, to avoid unintended replications)

Benchmarks • Word count – 128 text files • Total file size: 12.5GiB • Each file size: almost 100MiB – No combiners employed – Input: Coarse grain, Output: Fine grain • Iterative identity map and reduce – Both of map and reduce generate a set of KV data same as their inputs – Iteratively 8 times applied – 8 million keys • Total amount of KV data: almost 128MiB – Input/Output: Fine grain

DistCopy: Distributing 12.5GiB text files for WordCount DistCopy [sec] 200 180 17% faster 160 140 120 100 90% faster DistCopy [sec] 80 60 40 20 0 Our KVS is not slower than HDFS Hadoop (serial) SSS (serial) SSS (parallel) Parallelization is very effective due to ioDrive + 10GbE

WordCount 160 12% faster 140 120 Running Time [sec] 100 Hadoop 80 SSS 60 packed-SSS 40 3x faster 20 0 1 2 3 4 5 Trials

Iterative Identity Map/Reduce 350 2.9x faster 10x faster 300 250 Running Time [sec] 200 Hadoop SSS 150 packed-SSS 100 50 0 1 2 3 4 5 6 7 8 Iteration Count

Numbers of KV data/files WordCount • # Map Inputs # Intermediate # Reduce Outputs SSS 128 1.5 billion 2.7 million Packed-SSS 128 2,048 16 Hadoop 128 files ~256 files 16 files Iterative Identity Map/Reduce • # Map Inputs # Intermediate # Reduce Outputs SSS 8 million 8 million 8 million Packed-SSS 128 128 128 Hadoop 128 files 128 files 128 files

Conclusion • SSS is our first prototype of KVS-based MapReduce • Distributed KVS centric – Scale-out – Owner computes – Shuffle-and-Sort phase is not required – On-memory cache – Flexible workflows are supported • Hadoop-compatible API • Runtime performance is better than Hadoop, but not enough (we think)

Future Work • Performance • Fault-tolerance • More comprehensive benchmarks – To identify the characteristics and feasibility to various class of HPC and data-intensive apps • Higher-level programming tool – Pig, Szl, DryadLINQ, HAMA, R, etc. – We have already implemented our own Sawzall- clone running on top of Hadoop and SSS

Thank you!

Matrix Multiply by MR Matrix A (blocked) 1,1 1,2 1,3 1,4 2,1 2,2 2,3 2,4 3,1 3,2 3,3 3,4 Block Block multiply add 4,1 4,2 4,3 4,4 <C 11 , A 11 *B 11 > <C 11 , A 12 *B 21 > <C 11 , A 11 *B 11 + A 12 *B 21 + A 13 *B 31 + A 14 *B 41 > <C 11 , A 13 *B 31 > <C 11 , A 14 *B 41 > <C 12 , A 11 *B 12 + A 12 *B 22 + A 13 *B 32 + A 14 *B 42 > map red <C 12 , A 11 *B 12 > <C 12 , A 12 *B 22 > <C 13 , A 11 *B 13 + A 12 *B 23 + A 13 *B 33 + A 14 *B 43 > Matrix B (blocked) <C 12 , A 13 *B 32 > <C 12 , A 14 *B 42 > … <C 13 , A 11 *B 13 > … 1,1 1,2 1,3 1,4 2,1 2,2 2,3 2,4 3,1 3,2 3,3 3,4 4,1 4,2 4,3 4,4

SSS: An Implementation of Key-value Store based MapReduce Framework - PowerPoint PPT Presentation

SSS: An Implementation of Key-value Store based MapReduce Framework Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh MapReduce A promising programming tool for implementing large- scale data-intensive apps

Sapporo Sapporo Namba Namba Shinjuku Shinjuku Store Store Store Store West Store West

What is SSS? How do I get in? What is Student Support Services (SSS)? SSS is a free program

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the

s sss sss ts

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

Stubby: A Transformation-based Optimizer for MapReduce Workflows Harold Lim, Herodotos

SILT A Memory-Efficient, High-Performance Key- Value Store Based on paper of H. Lim, B. Fan,

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Junjie Hu 1 Introduction Current practice

Cryptanalysis Results on Spook Bringing Full Shadow-512 to the Light Patrick Derbez 1 , Paul Huynh

Beyond Block I/O: Rethinking / Traditional Storage Primitives Traditional Storage Primitives

Workshop in Honour of John Power on the occasion of his 60th Birthday @RIMS, Kyoto University 23

Dimensionality Reduction and (Bucket) Ranking: a Mass Transportation Approach Mastane Achab,

Outline Outline Independent Increment Processes Independent Increment Processes

Using software packages developed by R.H. Silsbee and J. Drger, we perform

S 3 identified by a rep. identified by a rep. n n = # of = # of Make Make- -Set

New Technologies to Drive System Transformation Using Claims Data to Assess State Initiatives: A

SSS: An Implementation of Key-value Store based MapReduce Framework - PowerPoint PPT Presentation

SSS: An Implementation of Key-value Store based MapReduce Framework Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh MapReduce A promising programming tool for implementing large- scale data-intensive apps

Sapporo Sapporo Namba Namba Shinjuku Shinjuku Store Store Store Store West Store West

What is SSS? How do I get in? What is Student Support Services (SSS)? SSS is a free program

COMP9313: Big Data Management MapReduce Data Structure in MapReduce Key-value pairs are the

Cutting MapReduce Cost with Spot Market Huan Liu Accenture Technology Labs Why spot market? 2

MapReduce Andrew Crotty Alex Galakatos What is MapReduce? MapReduce is a framework for:

Mrs: MapReduce for Scientific Computing in Python Andrew McNabb, Jeff Lund , and Kevin Seppi

Lecture 16: Overview of MapReduce MapReduce is a parallel, distributed programming model and

MapReduce 340151 Big Data &amp; Cloud Services (P. Baumann) 1 Overview MapReduce : the

s sss sss ts

Hadoop Map Reduce 1 MapReduce 2-in-1 A programming paradigm A query execution engine A kind

MapReduce 320302 Databases &amp; Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large

Lecture 36: MapReduce Frameworks [Adapted from slides by John DeNero and MapReduce is a

Laboratory Session: MapReduce Algorithm Design in MapReduce Pietro Michiardi Eurecom Pietro

Stubby: A Transformation-based Optimizer for MapReduce Workflows Harold Lim, Herodotos

SILT A Memory-Efficient, High-Performance Key- Value Store Based on paper of H. Lim, B. Fan,

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Junjie Hu 1 Introduction Current practice

Cryptanalysis Results on Spook Bringing Full Shadow-512 to the Light Patrick Derbez 1 , Paul Huynh

Beyond Block I/O: Rethinking / Traditional Storage Primitives Traditional Storage Primitives

Workshop in Honour of John Power on the occasion of his 60th Birthday @RIMS, Kyoto University 23

Dimensionality Reduction and (Bucket) Ranking: a Mass Transportation Approach Mastane Achab,

Outline Outline Independent Increment Processes Independent Increment Processes

Using software packages developed by R.H. Silsbee and J. Drger, we perform

S 3 identified by a rep. identified by a rep. n n = # of = # of Make Make- -Set

New Technologies to Drive System Transformation Using Claims Data to Assess State Initiatives: A

MapReduce 340151 Big Data & Cloud Services (P. Baumann) 1 Overview MapReduce : the

MapReduce 320302 Databases & Web Services (P. Baumann) 1 Why MapReduce? Motivation: Large