Data Locality in MapReduce Loris Marchal 1 Olivier Beaumont 2 1: - - PowerPoint PPT Presentation

data locality in mapreduce
SMART_READER_LITE
LIVE PREVIEW

Data Locality in MapReduce Loris Marchal 1 Olivier Beaumont 2 1: - - PowerPoint PPT Presentation

Data Locality in MapReduce Loris Marchal 1 Olivier Beaumont 2 1: CNRS and ENS Lyon, France. 2: INRIA Bordeaux Sud-Ouest, France. New Challenges in Scheduling Theory March 2016 1/ 29 MapReduce basics Well known framework for


slide-1
SLIDE 1

1/ 29

Data Locality in MapReduce

Loris Marchal1 Olivier Beaumont2

1: CNRS and ENS Lyon, France. 2: INRIA Bordeaux Sud-Ouest, France.

New Challenges in Scheduling Theory — March 2016

slide-2
SLIDE 2

2/ 29

MapReduce basics

◮ Well known framework for data-processing on parallel clusters ◮ Popularized by Google, open source implementation: Apache Hadoop ◮ Breaks computation into small tasks, distributed on the processors ◮ Dynamic scheduler: handle failures and processor heterogeneity ◮ Centralized scheduler launches all tasks ◮ Users only have to write code for two functions:

◮ Map: filters the data, produces intermediate results ◮ Reduce: summarizes the information

◮ Large data files split into chunks that are scattered on the platform

(e.g. using HDFS for Hadoop)

◮ Goal: process computation near the data, avoid large data transfers

slide-3
SLIDE 3

3/ 29

MapReduce example

Textbook example: WordCount (count #occurrences of words in a text)

  • 1. Text split in chunks scattered on local disks
  • 2. Map: compute #occurrences of words in each chunk,

produces results as <word,#occurrences> pairs

  • 3. Sort and Shuffle: gather all pairs with same word on a single

processor

  • 4. Reduce: merges results for single word (sum #occurrences)
slide-4
SLIDE 4

4/ 29

Other usages of MapReduce

◮ Several phases of Map and Reduce (tightly coupled

applications)

◮ Only Map phase (independent tasks, divisible load scheduling)

slide-5
SLIDE 5

5/ 29

MapReduce locality

Potential data transfer sources:

◮ Sort and Shuffle: data exchange between all processors

◮ Depends on the applications (size and number of <key,value> pairs)

◮ Map task allocation: when a Map slot is available on a processor

◮ choose a local chunk if any ◮ otherwise choose any unprocessed chunk and transfer data

Replication during initial data distributions:

◮ To improve (data locality) and fault tolerance ◮ Optional, basic setting: 3 replicas

◮ first, chunk placed on a disk ◮ one copy sent to another disk of the same rack (local communication) ◮ one copy sent to another rack

slide-6
SLIDE 6

6/ 29

Objective of this study

Analyze the data locality of the Map phase:

  • 1. estimate the volume of communication
  • 2. estimate the load imbalance without communication

Using a simple model, to provide good estimates and measure the influence of key parameters:

◮ Replication factor ◮ Number of tasks and processors ◮ Task heterogeneity (to come)

Disclaimer: work in progress Comments/contributions welcome!

slide-7
SLIDE 7

7/ 29

Outline

Introduction & motivation Related work Volume of communication of the Map phase Load imbalance without communication Conclusion

slide-8
SLIDE 8

8/ 29

Outline

Introduction & motivation Related work Volume of communication of the Map phase Load imbalance without communication Conclusion

slide-9
SLIDE 9

9/ 29

Related work 1/2

MapReduce locality:

◮ Improvement Shuffle phase ◮ Few studies on the locality for the Map phase (mostly

experimental) Balls-into-bins:

◮ Random allocation of n balls in p bins:

◮ For n = p, maximum load of log n/ log log n ◮ Estimation of maximum load with high probability for n ≥ p

[Raab & Steeger 2013]

◮ Choosing the least loaded among r candidates improves a lot

◮ “Power of two choices” [Mitzenmacher 2001] ◮ Maximum load n/p + O(log log p) [Berenbrick et al. 2000]

◮ Adaptation for weighted balls [Berenbrick et al. 2008]

slide-10
SLIDE 10

9/ 29

Related work 1/2

MapReduce locality:

◮ Improvement Shuffle phase ◮ Few studies on the locality for the Map phase (mostly

experimental) Balls-into-bins:

◮ Random allocation of n balls in p bins:

◮ For n = p, maximum load of log n/ log log n ◮ Estimation of maximum load with high probability for n ≥ p

[Raab & Steeger 2013]

◮ Choosing the least loaded among r candidates improves a lot

◮ “Power of two choices” [Mitzenmacher 2001] ◮ Maximum load n/p + O(log log p) [Berenbrick et al. 2000]

◮ Adaptation for weighted balls [Berenbrick et al. 2008]

slide-11
SLIDE 11

10/ 29

Related work 2/2

Work-stealing:

◮ Independent tasks or tasks with precedence ◮ Steal part of a victim’s task queue in time 1 ◮ Distributed process (steal operations may fail) ◮ Bound on makespan using potential function [Tchiboukdjian,

Gast & Trystram 2012]

slide-12
SLIDE 12

11/ 29

Outline

Introduction & motivation Related work Volume of communication of the Map phase Load imbalance without communication Conclusion

slide-13
SLIDE 13

12/ 29

Problem statement – MapReduce model

Data distribution:

◮ p processors, each with its own data storage (disk) ◮ n tasks (or chunks) ◮ r copies of each chunk distributed uniformly at random

Allocation strategy:

◮ whenever a processor is idle:

◮ allocate a local task is possible ◮ otherwise, allocate a random task, copy the data chunk ◮ invalidate all other replicas of the chosen chunk

Cost model:

◮ Uniform chunk size (parameter of MapReduce) ◮ Uniform task durations

Question:

◮ Total volume of communication (in chunk number)

slide-14
SLIDE 14

12/ 29

Problem statement – MapReduce model

Data distribution:

◮ p processors, each with its own data storage (disk) ◮ n tasks (or chunks) ◮ r copies of each chunk distributed uniformly at random

Allocation strategy:

◮ whenever a processor is idle:

◮ allocate a local task is possible ◮ otherwise, allocate a random task, copy the data chunk ◮ invalidate all other replicas of the chosen chunk

Cost model:

◮ Uniform chunk size (parameter of MapReduce) ◮ Uniform task durations

Question:

◮ Total volume of communication (in chunk number)

slide-15
SLIDE 15

12/ 29

Problem statement – MapReduce model

Data distribution:

◮ p processors, each with its own data storage (disk) ◮ n tasks (or chunks) ◮ r copies of each chunk distributed uniformly at random

Allocation strategy:

◮ whenever a processor is idle:

◮ allocate a local task is possible ◮ otherwise, allocate a random task, copy the data chunk ◮ invalidate all other replicas of the chosen chunk

Cost model:

◮ Uniform chunk size (parameter of MapReduce) ◮ Uniform task durations

Question:

◮ Total volume of communication (in chunk number)

slide-16
SLIDE 16

12/ 29

Problem statement – MapReduce model

Data distribution:

◮ p processors, each with its own data storage (disk) ◮ n tasks (or chunks) ◮ r copies of each chunk distributed uniformly at random

Allocation strategy:

◮ whenever a processor is idle:

◮ allocate a local task is possible ◮ otherwise, allocate a random task, copy the data chunk ◮ invalidate all other replicas of the chosen chunk

Cost model:

◮ Uniform chunk size (parameter of MapReduce) ◮ Uniform task durations

Question:

◮ Total volume of communication (in chunk number)

slide-17
SLIDE 17

13/ 29

Simple solution

◮ Consider the system after k chunks have been allocated ◮ A processor i requests a new task ◮ Assumption: the remaining r(n − k) replicas are uniformly distributed ◮ Probability that none of them reach i:

pk =

  • 1 − 1

p r(n−k) = 1− r(n − k) p +o 1 p

  • = e−r(n−k)/p +o

1 p

  • ◮ Fraction of non-local chunks:

f = 1 n

  • k

pk = p rn(1 − e−rn/p)

slide-18
SLIDE 18

14/ 29

Simple solution - simulations

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 1 2 3 4 5 6 fraction of non local tasks replication factor p=1000 processors, m=10.000 tasks MapReduce simulations 1-f ◮ Largely underestimates non-local tasks without replication (r = 1) ◮ Average accuracy with replication (r > 1)

slide-19
SLIDE 19

15/ 29

Simple solution - questioning the assumption

Remaining chunks without replication: (100 processors, 1000 tasks) initial distribution (10 chunks/procs on average) Non uniform distribution after some time

slide-20
SLIDE 20

15/ 29

Simple solution - questioning the assumption

Remaining chunks without replication: (100 processors, 1000 tasks) after 200 steps Non uniform distribution after some time

slide-21
SLIDE 21

15/ 29

Simple solution - questioning the assumption

Remaining chunks without replication: (100 processors, 1000 tasks) after 400 steps Non uniform distribution after some time

slide-22
SLIDE 22

15/ 29

Simple solution - questioning the assumption

Remaining chunks without replication: (100 processors, 1000 tasks) after 600 steps Non uniform distribution after some time

slide-23
SLIDE 23

15/ 29

Simple solution - questioning the assumption

Remaining chunks without replication: (100 processors, 1000 tasks) after 800 steps Non uniform distribution after some time

slide-24
SLIDE 24

15/ 29

Simple solution - questioning the assumption

Remaining chunks without replication: (100 processors, 1000 tasks) after 800 steps Non uniform distribution after some time

slide-25
SLIDE 25

16/ 29

Simple solution - questioning the assumption

Remaining chunks with replication=3: (100 processors, 1000 tasks) initial distribution (30 chunks/procs on average) Uniform distribution for a large part of the execution?

slide-26
SLIDE 26

16/ 29

Simple solution - questioning the assumption

Remaining chunks with replication=3: (100 processors, 1000 tasks) after 200 steps Uniform distribution for a large part of the execution?

slide-27
SLIDE 27

16/ 29

Simple solution - questioning the assumption

Remaining chunks with replication=3: (100 processors, 1000 tasks) after 400 steps Uniform distribution for a large part of the execution?

slide-28
SLIDE 28

16/ 29

Simple solution - questioning the assumption

Remaining chunks with replication=3: (100 processors, 1000 tasks) after 600 steps Uniform distribution for a large part of the execution?

slide-29
SLIDE 29

16/ 29

Simple solution - questioning the assumption

Remaining chunks with replication=3: (100 processors, 1000 tasks) after 800 steps Uniform distribution for a large part of the execution?

slide-30
SLIDE 30

16/ 29

Simple solution - questioning the assumption

Remaining chunks with replication=3: (100 processors, 1000 tasks) after 800 steps Uniform distribution for a large part of the execution?

slide-31
SLIDE 31

17/ 29

Simple solution - questioning the assumption

Assumption: after k steps, the remaining r(n − k) replicas are uniformly distributed

◮ χ2 test to check if the distribution is uniform ◮ Fraction of the execution with a uniform distribution:

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 fraction of the execution w/unif. dist. replication factor

◮ For r = 1: non-uniform distribution for most of the execution ◮ For r > 1: uniform distribution in a majority of cases

slide-32
SLIDE 32

18/ 29

Lower bound on communications without replication

◮ Consider n balls placed in p bins

(initial distribution with r = 1)

◮ A processor with k < n/p chunks will have to

receive at least k − n/p chunks

◮ It may need more chunks if some of its chunks

are used by other starving processors

◮ Assume that we steal chunks only from

  • verloaded processors

◮ Let Nk be the number of processors with exactly k chunks:

Nk = p × n k

  • (1/p)k(1 − 1/p)n−k

= e−n/p(n/p)k/k! when k ≪ n, p

◮ Then, the communication volume is given by:

V =

  • k<n/p

(n/p − k)Nk = pe−n/p (n/p)n/p+1 (n/p)! ≈ np 2π

slide-33
SLIDE 33

18/ 29

Lower bound on communications without replication

◮ Consider n balls placed in p bins

(initial distribution with r = 1)

◮ A processor with k < n/p chunks will have to

receive at least k − n/p chunks

◮ It may need more chunks if some of its chunks

are used by other starving processors

◮ Assume that we steal chunks only from

  • verloaded processors

◮ Let Nk be the number of processors with exactly k chunks:

Nk = p × n k

  • (1/p)k(1 − 1/p)n−k

= e−n/p(n/p)k/k! when k ≪ n, p

◮ Then, the communication volume is given by:

V =

  • k<n/p

(n/p − k)Nk = pe−n/p (n/p)n/p+1 (n/p)! ≈ np 2π

slide-34
SLIDE 34

18/ 29

Lower bound on communications without replication

◮ Consider n balls placed in p bins

(initial distribution with r = 1)

◮ A processor with k < n/p chunks will have to

receive at least k − n/p chunks

◮ It may need more chunks if some of its chunks

are used by other starving processors

◮ Assume that we steal chunks only from

  • verloaded processors

◮ Let Nk be the number of processors with exactly k chunks:

Nk = p × n k

  • (1/p)k(1 − 1/p)n−k

= e−n/p(n/p)k/k! when k ≪ n, p

◮ Then, the communication volume is given by:

V =

  • k<n/p

(n/p − k)Nk = pe−n/p (n/p)n/p+1 (n/p)! ≈ np 2π

slide-35
SLIDE 35

18/ 29

Lower bound on communications without replication

◮ Consider n balls placed in p bins

(initial distribution with r = 1)

◮ A processor with k < n/p chunks will have to

receive at least k − n/p chunks

◮ It may need more chunks if some of its chunks

are used by other starving processors

◮ Assume that we steal chunks only from

  • verloaded processors

◮ Let Nk be the number of processors with exactly k chunks:

Nk = p × n k

  • (1/p)k(1 − 1/p)n−k

= e−n/p(n/p)k/k! when k ≪ n, p

◮ Then, the communication volume is given by:

V =

  • k<n/p

(n/p − k)Nk = pe−n/p (n/p)n/p+1 (n/p)! ≈ np 2π

slide-36
SLIDE 36

19/ 29

Lower bound without replication – simulations

0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 5000 10000 15000 20000 25000 30000 35000 40000 45000 50000 fraction of non local tasks nb of tasks mapreduce - steal random mapreduce - steal most loaded lower bound

slide-37
SLIDE 37

20/ 29

Outline

Introduction & motivation Related work Volume of communication of the Map phase Load imbalance without communication Conclusion

slide-38
SLIDE 38

21/ 29

Estimate load imbalance without communication

◮ Previous section: estimate communication done by

MapReduce to mitigate load imbalance

◮ But load imbalance might be more desirable that large data

exchange

◮ Objective: estimate the makespan without communication

Model:

◮ Similar data distribution (n chunks on p processors, r replicas

  • f each chunk)

◮ Allocation mechanism:

◮ When a processor is idle, allocate a task on local chunk (if any) ◮ Invalidate other replicas of the chosen chunk

◮ Uniform or slightly heterogeneous task durations (wi ≤ wi n log n),

unknown beforehand

slide-39
SLIDE 39

22/ 29

Makespan without replication

◮ Without replication: each chunk is on a single

processor

◮ Processor execution time = sum of chunk sizes ◮ Similar to the maximum load of a bin in

balls-in-bins:

◮ With identical tasks, when n/polylog(n) ≤ p ≤ n log n:

M ∼ log p log

  • p log p

n

w.h.p.

◮ For other cases, see [Raab & Steeger 2013], [Berenbrick 2008]

slide-40
SLIDE 40

23/ 29

Makespan with replication – intuition

We build an analogy between:

◮ Modified MapReduce with replication r ◮ Balls-In-Bins distribution with r choices:

◮ For each ball, select r bins at random ◮ Allocate ball to the least loaded bin among them

In the following:

◮ Slightly different starting times of processors: ti ◮ Initial load of bins i: ti (same tie break at time 0) ◮ Set of random choices: Ci = {i1, . . . , ir} used by both

processes

slide-41
SLIDE 41

24/ 29

Makespan with replication – analogy

Modified MapReduce:

◮ For each task:

Place a copy of task Ti on processors with index in Ci = {i1, . . . , ir}

◮ When a processor k becomes idle:

Execute the available tasks with smaller index (if any) NB: allocation with replication, load-balancing at runtime Balls-In-Bins with multiple choices:

◮ For each ball: Place ball i in the least loaded bin with index in

Ci = {i1, . . . , ir} NB: load-balancing during the allocation

Theorem.

The makespan of Modified MapReduce is equal to the maximum load of Balls-In-Bins with multiple choice

slide-42
SLIDE 42

24/ 29

Makespan with replication – analogy

Modified MapReduce:

◮ For each task:

Place a copy of task Ti on processors with index in Ci = {i1, . . . , ir}

◮ When a processor k becomes idle:

Execute the available tasks with smaller index (if any) NB: allocation with replication, load-balancing at runtime Balls-In-Bins with multiple choices:

◮ For each ball: Place ball i in the least loaded bin with index in

Ci = {i1, . . . , ir} NB: load-balancing during the allocation

Theorem.

The makespan of Modified MapReduce is equal to the maximum load of Balls-In-Bins with multiple choice

slide-43
SLIDE 43

24/ 29

Makespan with replication – analogy

Modified MapReduce:

◮ For each task:

Place a copy of task Ti on processors with index in Ci = {i1, . . . , ir}

◮ When a processor k becomes idle:

Execute the available tasks with smaller index (if any) NB: allocation with replication, load-balancing at runtime Balls-In-Bins with multiple choices:

◮ For each ball: Place ball i in the least loaded bin with index in

Ci = {i1, . . . , ir} NB: load-balancing during the allocation

Theorem.

The makespan of Modified MapReduce is equal to the maximum load of Balls-In-Bins with multiple choice

slide-44
SLIDE 44

25/ 29

Makespan with replication – proof

Lemma.

Let proc(i) be the processor executing task i and bin(i) the bin containing ball i, then proc(i) = bin(i). Proof by induction:

◮ First ball put on bin k ∈ C1 with smallest tk, same for first task ◮ Consider task/ball i:

◮ When Ti starts, only tasks with smaller indexes already processed by

processors of Ci

◮ Completion time of such a processor k before starting Ti:

Ck =

  • j<i,proc(j)=k

size(j))

◮ Ball i considered after balls 1, . . . , i − 1, load of bin k at that time:

Lk =

  • j<i,bin(j)=k

size(i)

◮ Ball i put in bin k ∈ Ci with smallest Lk ◮ By induction, Ck = Lk

slide-45
SLIDE 45

26/ 29

Makespan with replication – results

Maximum load using multiple choice (r ≥ 2) at most: n p + log log n log r + Θ(1) w.h.p. [Berenbrick et al. 2000] Simulations with 200 processors and 400 (identical) tasks:

5.0 7.5 10.0 1 2 3 4 5

replication factor Makespan

Method MapReduce simulations balls in bins formulae

slide-46
SLIDE 46

27/ 29

Outline

Introduction & motivation Related work Volume of communication of the Map phase Load imbalance without communication Conclusion

slide-47
SLIDE 47

28/ 29

Conclusion

◮ Data locality analysis of the Map phase of MapReduce ◮ Task allocation mechanism with initial data placement:

very simple and general

◮ Volume of communication:

◮ Simple formula accurate for r ≥ 2 (missing formal proof) ◮ Lower bound for r = 1

= exact volume for a variant of MapReduce (steal the most loaded)

◮ Load imbalance without communication:

◮ Makespan = maximum load for multiple-choice balls-in-bins

◮ Key parameter: replication (both for comm. and makespan) ◮ Analogy: replication vs. “power of 2 choices” for balls-in-bins ◮ NB: cost of replication: large communication volume prior to

the computation (best-effort, possibly for many computations)

slide-48
SLIDE 48

29/ 29

Perspectives

Extensions: Better estimate the communication volume with replication:

◮ Use analogy with balls-into-bins with r choices

(at most 2p holes [Berenbrick et al. 2000])?

◮ Use potential function (cf. [Tchiboukdjian et al. 2012])? ◮ Heterogeneous task durations

Long-term perspectives:

◮ More complex data dependences (2D, tasks sharing files)