Building the next Generation of MapReduce Programming Models over - - PowerPoint PPT Presentation

building the next generation of mapreduce programming
SMART_READER_LITE
LIVE PREVIEW

Building the next Generation of MapReduce Programming Models over - - PowerPoint PPT Presentation

Building the next Generation of MapReduce Programming Models over MPI to Fill the Gaps between Data Analytics and Supercomputers Michela Taufer University of Delaware Collaborators: Tao Gao, Boyu Zhang (University of Delaware) Pavan Balaji,


slide-1
SLIDE 1

Building the next Generation of MapReduce Programming Models over MPI to Fill the Gaps between Data Analytics and Supercomputers

Michela Taufer

University of Delaware Collaborators: Tao Gao, Boyu Zhang (University of Delaware) Pavan Balaji, Yanfei Guo (Argonne National Laboratory) BingQiang Wang, Yutong Lu (Guangzhou Supercomputer Center) Pietro Cicotti (San Diego Supercomputer Center) Yanjie Wei (Shenzhen Institute of Advanced Technologies)

slide-2
SLIDE 2

1

MapReduce Programming Model

  • MapReduce runtime handles the parallel job execution,

communication, and data movement

  • Users provide map and reduce functions

<Hello,1> map map Shuffle reduce reduce Hello World Hello World <world,1> <Hello,1> <world,1> <Hello, 2> <World, 2> Wordcount example:

slide-3
SLIDE 3

2 2

When tweetle beetles fight, it’s called a tweetle beetle battle. And when they battle in a puddle, it’s a tweetle beetle puddle battle. And when tweetle beetles battle with paddles in a puddle, They call it a tweetle beetle puddle paddle battle. (When, 1), (tweetle, 1), (beetles, 1), (fight, 1) (it’s, 1), (called, 1), (a, 1), (tweetle, 1), (beetle, 1), (battle, 1) (And, 1), (when, 1), (they, 1), (battle, 1), (in, 1), (a, 1), (puddle, 1) (it’s, 1), (a, 1), (tweetle, 1), (beetle, 1), (puddle, 1), (battle, 1) (And, 1), (when, 1), (tweetle, 1), (beetles, 1), (battle, 1), (with, 1), (paddles, 1), (in, 1), (a, 1), (puddle, 1) (They, 1), (call, 1), (it, 1), (a, 1), (tweetle, 1), (beetle, 1), (puddle, 1), (paddle, 1), (battle, 1)

Map

Key Value

WordCount: A Concrete Example

slide-4
SLIDE 4

3 3

(When, 1), (tweetle, 1), (beetles, 1), (fight, 1) (it’s, 1), (called, 1), (a, 1), (tweetle, 1), (beetle, 1), (battle, 1) (And, 1), (when, 1), (they, 1), (battle, 1), (in, 1), (a, 1), (puddle, 1) (it’s, 1), (a, 1), (tweetle, 1), (beetle, 1), (puddle, 1), (battle, 1) (And, 1), (when, 1), (tweetle, 1), (beetles, 1), (battle, 1), (with, 1), (paddles, 1), (in, 1), (a, 1), (puddle, 1) (They, 1), (call, 1), (it, 1), (a, 1), (tweetle, 1), (beetle, 1), (puddle, 1), (paddle, 1), (battle, 1) (tweetle, 1), (tweetle, 1), (tweetle, 1), (tweetle, 1), (tweetle, 1)

(tweetle, 5)

Reduce

(battle, 1), (battle, 1), (battle, 1), (battle, 1), (battle, 1)

(battle, 5)

(puddle, 1), (puddle, 1), (puddle, 1), (puddle, 1)

(puddle, 4)

(beetle, 1), (beetle, 1), (beetle, 1)

(beetle, 3)

(beetles, 1), (beetles, 1)

(beetles, 2)

(when, 1), (when, 1)

(when, 2)

(When, 1)

(When, 1)

WordCount: A Concrete Example

slide-5
SLIDE 5

4

Data Generation on HPC Systems

From: https://xdmod.ccr.buffalo.edu

slide-6
SLIDE 6

5

  • ver MPI

Is MapReduce an appealing way to handle big data processing

  • n HPC systems?
slide-7
SLIDE 7

6

Data Processing on HPC Systems

  • Key differences between Cloud computing and HPC systems

disenfranchise the naïve used of Cloud method

disk array

MPI/OpenMP

HPC systems

processor

Interconnect

Hadoop/Spark

disk processor Ethernet

Cloud computing systems

disk processor

slide-8
SLIDE 8

7

reduce reduce reduce aggregate

A Fundamentally Correct MapReduce (MR) over MPI

map

  • Support logical map-shuffle-reduce workflow in four phases

§ Map, aggregate, convert, and reduce [1]

map map

… P0 P1 Pn … …

convert convert convert

… …

input <key, value>

barrier barrier barrier barrier

<key, value> <key, list<value>>

  • utput

[1] S. J Plimpton and K. D. Devine. MapReduce in MPI for Large-Scale Graph Algorithms. Parallel Computing, 37(9):610–632, 2011.

slide-9
SLIDE 9

8

Extra Synchronizations

  • Aggregate and convert need users to explicit invocate them

§ Cost: extra synchronizations

reduce reduce reduce aggregate

map map map

… P0 P1 Pn … …

convert convert convert

… …

input <key, value>

barrier barrier barrier barrier

<key, list<value>>

  • utput
slide-10
SLIDE 10

9

Extra Data Staging

  • Aggregate and convert need to store intermediate data

§ Cost: extra data staging

reduce reduce reduce aggregate

map map map

… P0 P1 Pn … …

convert convert convert

… …

input <key, value> <key, list<value>>

  • utput
slide-11
SLIDE 11

10

Extra Memory Usage and Poor Data Management

  • Zoom in map / aggregate operations

reduce reduce reduce aggregate

map map map

… P0 P1 Pn … …

convert convert convert

… …

input <key, value>

barrier barrier barrier barrier

<key, list<value>>

  • utput

zoom in

slide-12
SLIDE 12

11

send buffer send buffer

Extra Memory Usage and Poor Data Management

  • Allocation additional memory buffers for metadata

§ Cost: extra memory use

  • If in-memory buffer full à Spill data to the disk

§ Cost: poor data management

P0 P1

map map

staging area staging area

receive buffer receive buffer

Static Allocation

slide-13
SLIDE 13

12

Tackling Shortcomings of a Correct MR Model

Shortcomings: extra synchronizations, extra data staging, extra memory use, and poor data management A journey to design and implement Mimir, an efficient MR over MPI framework

  • Memory inefficiency
  • Load balancing issues
  • I/O variability
slide-14
SLIDE 14

13

Impact: Out-of-memory Operations

Single-node execution time of WordCount (Wikipedia) with MR- MPI on Comet (128G memory) [1]

Out-of-memory processing

  • Existing MapReduce over MPI implementations still struggle

with memory limits

Can process only 4GB data in- memory on a 128GB node

[1] T. Gao, Y. Guo, B. Zhang, P. Cicotti, Y. Lu, P. Balaji, and M. Taufer. Mimir: Memory-Efficient and Scalable MapReduce for Large Supercomputing Systems. In Proceedings of the IPDPS, 2017.

slide-15
SLIDE 15

14

Reduce Synchronization and Extra Data Staging

  • Interleave operations: e.g., map interleaves with aggregate

aggregate map map map … P0 P1 Pn … … convert convert convert … reduce reduce reduce …

input <key, value> <key,value> <key,list<value>>

  • utput

barrier barrier barrier

interleave

barrier

Improvements: 1. Reduce synchronization; 2. Reduce extra data staging

slide-16
SLIDE 16

15

Reduce Synchronization and Extra Data Staging

  • Interleave operations: e.g., map interleaves with aggregate

aggregate map map map … P0 P1 Pn … … convert convert convert … reduce reduce reduce …

input <key, value> <key,value> <key,list<value>>

  • utput

barrier barrier

interleave

barrier

slide-17
SLIDE 17

16

Optimizing Intermediate Data Management

  • Use send buffer as output of map directly

§ Avoid extra buffers usage

  • Use KV/KMV container as staging area

§ Dynamically allocate one or multiple pages

P0 P1

map map

KV container

Improvements:

  • 3. Avoid extra

memory buffer usage

  • 4. Manage

intermediate data more efficiently

Dynamic Allocation

slide-18
SLIDE 18

17

Mimir vs. MR-MPI: WordCount on Comet

  • Single-node execution (24 processes, 128G memory)

§ Benchmarks: WC with Wikipedia dataset § Settings: MR-MPI (64M page and 512M page); Mimir (64M page)

Mimir can handle 4X larger dataset 64X 4X

[1] T. Gao, Y. Guo, B. Zhang, P. Cicotti, Y. Lu, P. Balaji, and M. Taufer. Mimir: Memory-Efficient and Scalable MapReduce for Large Supercomputing Systems. In Proceedings of the IPDPS, 2017.

slide-19
SLIDE 19

18

Impact: Load Imbalance

  • <key,value> pairs are NOT distributed evenly among processes

§ Imbalanced <key,value> pairs may cause poor resource usage

Execution time of WordCount (Wikipedia) with Mimir on Tianhe-2 without load balancing Total time (sec) Number of <key,value> pairs for the WordCount (Wikipedia) on 768 processes Number of <key,value> pairs Process ids

Mimir

slide-20
SLIDE 20

19

Impact: Load Imbalance

  • <key,value> pairs are NOT distributed evenly among processes

§ Imbalanced <key,value> pairs may cause poor resource usage

Balance ratio (Max mem / min mem for all processes) for WordCount (Wikipedia) on Tianhe-2

Balance ratio Number of <key,value> pairs for the WordCount (Wikipedia) on 768 processes Number of <key,value> pairs Process ids

Mimir

slide-21
SLIDE 21

20

Combining <key,value> Pairs

  • Combiner operations:

§ Merge <key,value> pairs with the same key before shuffle § Merge <key,value> pairs with the same key after shuffle

  • Application dependent:

§ Wordcount à YES § Join à NO

aggregate map map map … P0 P1 Pn …

input <key, value> interleave

combine combine combine combine combine combine

slide-22
SLIDE 22

21

Combining <key,value> Pairs

  • Merge <key,value> pairs with the same key before shuffle
  • Merge <key,value> pairs with the same key after shuffle

<Hello,1> map map Shuffle reduce reduce Hello World Hi World Hello World <World,1> <Hello,1> <World,1> <Hello, 3> <World, 3> Wordcount example: <Hi,1> <World,1> Hello map <Hello,1> reduce <Hi, 1> <World,2> <Hello,2>

slide-23
SLIDE 23

22 WC

Combiner Results: WordCount on Tianhe-2

Total time (sec) Balance ratio Memory usage (GB)

  • T. Gao, Y. Guo, B. Zhang, P. Cicotti, Y. Lu, P. Balaji, and M. Taufer. Skew Mitigation in MapReduce for

Supercomputing Systems. In preparation, 2017. Mimir Mimir + combiner Mimir Mimir + combiner Mimir Mimir + combiner

Number of KV pairs 1e8

slide-24
SLIDE 24

23

Files and IO Variability

  • Sizes of different input files can be very different

§ Sequence file size in genomes dataset

  • IO performance variability makes some processes slow down

the running of all processes

200 400 600 800 1000 1 49 97 145 193 241 289 337 385 433 481 529 577 625 673 721 read/time/(sec) process/rank

78.9 78.9 0.63 24.3 23.4 17.1 17.1 0.13 20.5 20.5 0.1 7.3 7.3 0.1

20 40 60 80 100

file+size+(GB) different+files

Sequence file size in 1000 genomes dataset varies from a few MB to a few hundred GB Time to read 6 TB genomics data with 768 processes (8 GB per process) on the Tianhe-2

slide-25
SLIDE 25

24

Streaming IO Model

  • Files are viewed as if they are segments of continuous data

§ Files are cut into many equal-size chunks § MR application provides a chunking function to Mimir

<key, value> map map <key,value> Mimir Shuffle Communicati

  • n

sequence file Discrete IO model P0 P1 P2 P3 P0 P1 P1 P3 P0 P2 P2 P3 Stream IO model

slide-26
SLIDE 26

25

Work Stealing: General Idea

  • File chunks are static partitioned to each process initially
  • Once a process finish its work, it will try to steal a chuck using

fetch_and_add by chosing a victim process

  • Each process manages a chuck map to keep track of where

the “stolen” chucks have gone

slide-27
SLIDE 27

26

Work Stealing: Implementation

  • Processes acquire a local chunk

P0 P1

  • 1
  • 1
  • 1
  • 1
  • 1
  • 1

steal offset chunk id chunk map

1 FOP Atomic PUT

slide-28
SLIDE 28

27

Work Stealing: Implementation

P0 P1 3 1

  • 1
  • 1

1

steal offset chunk id chunk map

Atomic GET 2 FOP Atomic PUT 1

  • Processes steal a remote chunk
slide-29
SLIDE 29

28

Real Bioinformatics Application: K-mer Counting

  • A fundamental operation in genome analytics with several

use cases:

§ K-mer counting is fundamental to analyzing or estimating genome assembly (number of k-mers determines the graph size) § Core tool for understanding similarities in genomic samples

  • E.g., rate of increase in k-mer counts explains how similar multiple

genomes are

§ Error validation tool

  • When a newly processed genome is merged with a reference genome,

drastic increase in k-mer counts indicates many errors

slide-30
SLIDE 30

29

Integration in Real Bioinformatics Application

  • Bloomfish combines single-node k-mer counting (i.e.,

JellyFish) with our MapReduce framework (i.e., Mimir)

<mer,NULL>

map map

<mer,NULL> <mer,NULL> <mer,NULL>

Hash Array of JellyFish Hash Array of JellyFish Mimir JellyFish

Shuffle Communication sequence file

  • utput
  • utput

In-situ Analysis In-situ Analysis

  • utput file
slide-31
SLIDE 31

30

Analyze Large-scale DNA Dataset

Data source: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp//phase3/data/

Weak scalability (8 GB/process) for 1000 Genomes dataset on up to 24 processes (3 TB) of Tianhe-2 Jellyfish can count 22-mers

  • f a 3 TB dataset with 24

processes in about 24 hour

slide-32
SLIDE 32

31

Analyze Large-scale DNA Dataset

Weak scalability (8 GB/process) for 1000 genomes dataset on up to 3072 processes (24TB) of Tianhe-2 Bloomfish (Jellyfish on top

  • f Mimir) can count 22-

mers of a 24 TB dataset with 3,072 processes in about

  • ne hour

Tao Gao, Yanfei Guo, Yanjie Wei, Bingqiang Wang, Yutong Lu, Pietro Cicotti, Pavan Balaji and Michela

  • Taufer. Bloomfish: A Highly Scalable Distributed K-mer Counting Framework. ICPADS, 2017.
slide-33
SLIDE 33

32

Lesson Learned

  • We present our journey to build Mimir, a memory-efficient

and scalable MapReduce over MPI

§ Mimir can handle 16 X larger dataset in-memory compared with MR-MPI

  • Mimir scale to 16,384 processes

§ Mimir is a open-source https://github.com/TauferLab/Mimir.git

  • We codesign I/O in Mimir to support stream I/O model, work

stealing, nonblocking collective communication

  • We integrate Mimir into Bloomfish, a highly scalable k-mer

counting framework and show substantial data scaling Join the Panel today at 15:30PM to discuss about MPI on Post-Exascale Systems