Network Calculus for Parallel Processing George Kesidis The - - PowerPoint PPT Presentation

network calculus for parallel processing george kesidis
SMART_READER_LITE
LIVE PREVIEW

Network Calculus for Parallel Processing George Kesidis The - - PowerPoint PPT Presentation

Network Calculus for Parallel Processing George Kesidis The Pennsylvania State University kesidis@gmail.com Dagstuhl Seminar on Network Calculus March 8-11, 2015, at Schloss Dagstuhl March 9, 2015 George Kesidis 1 Outline of the talk


slide-1
SLIDE 1

Network Calculus for Parallel Processing George Kesidis

The Pennsylvania State University kesidis@gmail.com

Dagstuhl Seminar on Network Calculus

March 8-11, 2015, at Schloss Dagstuhl

March 9, 2015 George Kesidis 1

slide-2
SLIDE 2

Outline of the talk

  • Introduction
  • Review of two results from the 1980s for Markovian models

– A two-server Markovian system - two M/M/1 queues with coupled arrivals – Multi-server Markovian system

  • Single-stage, fork-join system
  • Network calculus applications - in collaboration with Y. Shan, B. Urgaonkar & Jorg

– Simple deterministic result – Stationary analysis via gSBB – Numerical example using Facebook data

  • Discussions

– Load balancing in a single processing stage – Workload transformation for tandem processing stages – Dynamic scheduling – Applications with feedback, e.g., distributed simulation

  • References

2

slide-3
SLIDE 3

Parallel processing systems - overview

  • Decades of study on concurrent programming and parallel processing (including cluster

computing), often in highly application-specific settings.

  • Challenges include

– resource allocation and load balancing so as to reduce delays at barrier (synchronization, join) points, – redundancy for robustness/protection, and – maintaining consistent shared memory/state across processors while minimizing com- munication overhead, – especially when dealing with feedback in the application itself.

  • Techniques may be proactive or reactive/dynamic in nature.
  • Today, popular platforms use Virtual Machines (VMs) mounted on multi-processor servers
  • f a single data-center, or a group of data-centers forming a cloud.

March 9, 2015 George Kesidis 3

slide-4
SLIDE 4

Feed-forward parallel-processing systems

  • A certain family of jobs are best served by a particular arrangement of VMs/processors for

parallel execution.

  • In the following, we consider jobs that lend themselves to feed-forward parallel-processing

systems, e.g., many search/data-mining applications.

  • In a single parallel-processing stage, a job is partitioned into tasks (i.e., the job is “forked”
  • r the tasks are demultiplexed); the tasks are then worked upon in parallel by different

processors.

  • Within parallel-processing systems, there are often processing barriers (points of synchro-

nization or “joins”) wherein some or all component tasks of a job need to be completed before the next stage of processing of the job can commence.

  • The terminus of the entire parallel-processing system is typically a barrier.
  • Thus, the latency of a stage (between barriers or between the exogenous job arrivals to the

first barrier) is the greatest latency among the processing paths through it.

March 9, 2015 George Kesidis 4

slide-5
SLIDE 5

MapReduce

  • Google’s MapReduce template for parallel processing with VMs (especially its open-source

implementation Apache Hadoop) is a very popular such framework to handle a sequence of search tasks.

  • MapReduce is a multi-stage parallel-processing framework where each processor is a VM

(again, mounted on a server of a data-center).

  • In MapReduce, jobs arrive and are partitioned into tasks.
  • Each task is then assigned to a mapper VM for initial processing (first stage).
  • The results of mappers are transmitted (shuffled), in pipelined fashion with the mapper’s
  • peration, to reducers (second stage).
  • Reducer VMs combine the mapper results they have received and perform additional pro-

cessing.

  • A barrier exists before each reducer (after its mapper-shuffler stage) and after all the reducers

(after the reducer stage).

March 9, 2015 George Kesidis 5

slide-6
SLIDE 6

Simple MapReduce example of a word-search application

  • Two mappers that search and one reducer that combines their results.
  • Document corpus to be searched is divided between the mappers.

March 9, 2015 George Kesidis 6

slide-7
SLIDE 7

Single-stage, fork-join systems - a Markovian analysis

  • Jobs sequentially arrive to a parallel processing system of K identical servers.
  • The ith job arrives at time ti and spawns (forks) K tasks.
  • Let xj,i be the service-duration of the task assigned to server j by job i.
  • The tasks assigned to a server are queued in FIFO fashion.
  • The sojourn (or response) time Dj,i − ti of the ith task of server j is the sum of its service

time (xj,i) and its queueing delay:

Dj,i = xj,i + max{Dj,i−1, ti} ∀ i ≥ 1, 1 ≤ j ≤ K Dj,0 =

  • The response time of the ith job is

max

1≤j≤K Dj,i − ti

March 9, 2015 George Kesidis 7

slide-8
SLIDE 8

Two-server (K = 2) system

  • Suppose that jobs arrive according to a Poisson process with intensity λ > 0, i.e.,

ti − ti−1 ∼ exp(λ)

so that E(ti − ti−1) = λ−1.

  • Also, assume that the task service-times xj,i are mutually independent and exponentially

distributed:

x1,i ∼ exp(α)

and

x2,i ∼ exp(β) ∀i ≥ 1.

  • Let Qi(t) be the number of tasks in server i at time t.
  • (Q1, Q2) is a continuous-time Markov chain.

March 9, 2015 George Kesidis 8

slide-9
SLIDE 9

Transition rates of (Q1, Q2) with m, n ≥ 0

March 9, 2015 George Kesidis 9

slide-10
SLIDE 10

Stationary distribution of (Q1, Q2)

  • Assume that the system is stable, i.e., λ < min{α, β}.
  • For the Markov process (Q1, Q2) in steady state, let the stationary

pm,n =

P((Q1, Q2) = (m, n)).

  • The balance equations are

(1 + α1{m > 0} + β1{n > 0}) pm,n = λ1{m > 0, n > 0}pm−1,n−1 + αpm+1,n + βpm,n+1, ∀m, n ∈ Z≥0,

where

  • m=0

  • n=0

pm,n = 1.

March 9, 2015 George Kesidis 10

slide-11
SLIDE 11

Stationary distribution of (Q1, Q2) (cont)

  • The balance equations can be solved by two-dimensional moment generating function (Z

transform) [Flatto & Hahn 1984]

P(z, w) =

  • m=0

  • n=0

pm,nzmwn, z, w ∈ C

  • Multiplying the previous balance equations by zmwn and summing over m, n gives P(z, w)

in terms of boundary values P(z, 0) and P(0, w).

  • In the load-balanced case where α = β with ρ := λ/α < 1 [equ (6.5) of FH’84],

P(z, 0) = (1 − ρ)3/2/

  • 1 − ρz.
  • From this, we can find the first two moments of pm,0,

  • m=0

mpm,0 =

d dzP(z, 0)

  • z=1

= 1 2ρ

  • m=0

m2pm,0 =

d dzz d dzP(z, 0)

  • z=1

= 1 2ρ + 3 4 · ρ2 1 − ρ

March 9, 2015 George Kesidis 11

slide-12
SLIDE 12

Job sojourn times

  • Recall that a job is completed (departs the system) only when all of its tasks are completed

(have been served).

  • Some jobs have arrived but none of their tasks completed, while others have had only one

task completed.

  • So, in the two-server (K = 2) case, |Q1 − Q2| represents the number of jobs queued in

the system with just one task completed.

  • Let qk = P(Q1 − Q2 = k) in steady-state for k ∈ Z.
  • Note that ∀k ≥ 0,

qk =

  • m=k

pm,m−k.

March 9, 2015 George Kesidis 12

slide-13
SLIDE 13

Job sojourn times in the load-balanced case

  • Summing the balance equations for (Q1, Q2) from m = k ≥ 0 with n = m − k gives

(1 + α + β)qk − βpk,0 = qk + αqk+1 + βqk−1 − βpk−1,0 ⇒ α(qk+1 − qk) − β(qk − qk−1) = −βpk,0 + βpk−1,0

  • In the symmetric case (i.e., the servers are load balanced) where α = β > λ, this implies

qk+1 − qk = −pk,0, ∀k ≥ 0

where ∀k ∈ Z, qk = q−k.

  • Thus,

qk =

  • m=k

pm,0, ∀k ≥ 0.

March 9, 2015 George Kesidis 13

slide-14
SLIDE 14

Job sojourn times in the load-balanced case (cont)

  • Consider jobs with no tasks completed and those completed tasks whose siblings are not

completed for the load-balanced (α = β) case.

  • By Little’s theorem the mean sojourn time of a job is:

EQ1

λ + E|Q1 − Q2| 2λ = 1 α − λ + 1 λ

  • k=1

kqk = 1 α − λ + 1 λ

  • k=1

k

  • m=k

pm,0 = 1 α − λ + 1 λ

  • m=1

pm,0

m

  • k=1

k = 1 α − λ + 1 λ

  • m=1

pm,0 m2 + m 2 = 1 α − λ + 1 4λρ + 3 8λ · ρ2 1 − ρ + 1 4λρ

where

α − λ λ = 1 − ρ ρ ,

and we have used the first two moments of pm,0 computed above.

March 9, 2015 George Kesidis 14

slide-15
SLIDE 15

Job sojourn times in the load-balanced case - main result

  • So, the mean sojourn time of a job in the load-balanced (α = β) case is:

EQ1

λ + E|Q1 − Q2| 2λ = 1 α − λ

3

2 − 1 8ρ

  • ,

where

1 α − λ

is just the mean number of jobs in a stationary M/M/1 queue.

  • Note that the delay factor above M/M/1 satisfies:

11 8 ≤ 3 2 − 1 8ρ ≤ 3 2.

March 9, 2015 George Kesidis 15

slide-16
SLIDE 16

Bounds for K > 2 servers - Associated RVs

  • Again, consider the load balanced (i.i.d. exp(α) task service times) and stable (λ < α)

case.

  • To obtain an upper bound, it was argued in [Nelson and Tantawi 1988] that for all jobs i,

all of its task sojourn times {Sj,i := Dj,i − ti}K

j=1 form an “associated” group of random

variables.

  • Taking any monotonic function g of each member group of an “associated” random variables

{Xj} leads to a group of random variables {g(Xj)} that have (pairwise) non-negative

covariance, cov(g(Xj), g(Xl)) ≥ 0.

  • The following useful maximal inequality follows: ∀x > 0,

P( max

1≤j≤K Sj,i > x)

≤ 1 −

K

  • j=1

P(Sj,i ≤ x) i.e., the Bernoulli random variables 1{Sj,i ≤ x} (a monotonically decreasing function of Sj,i) have non-negative covariance since P( max

1≤j≤K Sj,i > x)

= 1 − P( max

1≤j≤K Sj,i ≤ x).

March 9, 2015 George Kesidis 16

slide-17
SLIDE 17

Bounds for K > 2 servers (cont)

  • The stationary sojourn time S(K) of a job has distribution satisfying, ∀x > 0:

P(S(K) > x)

= lim

i→∞ P( max 1≤j≤K Sj,i > x)

≤ 1 −

K

  • j=1

lim

i→∞ P(Sj,i ≤ x),

where the last equality is for the M/M/1 queue.

  • Using PASTA and conditioning on the number of jobs in a stationary M/M/1 queue (∼

geom(ρ)), one can show that the sojourn time of a job in steady-state ∼ exp(α − λ), so that P(S(K) > x)

≤ 1 − (1 − exp((α − λ)x))K

  • Thus, one can show using

ES(K)

=

∞ P(S(K) > x)dx

(1 − (1 − exp((α − λ)x))K)dx =: HK

March 9, 2015 George Kesidis 17

slide-18
SLIDE 18

Bounds for K > 2 servers - main result

  • From the previous display (for the load-balanced case α = β), the mean sojourn time

ES(K) ≤ HK.

  • One can also show HK = O(log K), so that

ES(K)

=

O(log K).

  • Ignoring queuing delays, we get a simple lower bound

ES(K)

≥ HK/α,

giving some measure of tightness to the previous upper bound.

March 9, 2015 George Kesidis 18

slide-19
SLIDE 19

Single-stage, fork-join systems - a deterministic analysis

  • Consider a bank of K parallel queues, with queue/processor k is provisioned with service

capacity sk.

  • Here let A be the (fluid, positive time) cumulative input process of work that is divided

among queues so that the kth queue has arrivals ak and departures dk in such a way that

∀t ≥ 0, A(t) =

  • k

ak(t).

  • Define the virtual delay processes for hypothetical departures at time t ≥ 0 for queue k as

δk(t) = t − a−1

k (dk(t)),

where we define inverses a−1

k

  • f non-decreasing functions ak as continuous from the left so

that ak(a−1

k (v)) ≡ a−1 k (ak(v)) ≡ v.

  • The following definition of the cumulative departures D is such that the output ready

for processing in the subsequent (reducer) stage is determined by the most “lagging” queue/processor: ∀t ≥ 0,

D(t) = A(t − max

k

δk(t)) = A

  • min

k

a−1

k (dk(t))

  • 19
slide-20
SLIDE 20

March 9, 2015 George Kesidis

slide-21
SLIDE 21

Delay bound under service and input-burstiness curves

  • The convolution (⊗) of two non-decreasing functions f and g, with f(t) = g(t) = 0 for

t ≤ 0, is (f ⊗ g)(t) = inf

0≤τ≤t

  • f(τ) + g(t − τ)

.

  • Define a delay function ∆v for any v ≥ 0 as

∆v(t) =

  • if t ≤ v ,

+∞

if t > v .

  • So, for any function f, constant v ≥ 0, and time t,

f(t − v) = (f ⊗ ∆v)(t).

  • For a queue with cumulative arrival and departure functions given by a(t) and c(t), re-

spectively, the queue has a lower service curve smin if for all times t and arrivals a,

c(t) ≥ (smin ⊗ a)(t) .

  • A lower service curve is a non-decreasing function that describes a service guarantee of the

queue.

March 9, 2015 George Kesidis 20

slide-22
SLIDE 22

Delay bound under service and input-burstiness curves (cont)

  • We assume that the arrivals to queue k are bounded by a burstiness curve (traffic envelope)

bin,k in the sense that for all t ≥ 0, ak(t) ≤ (ak ⊗ bin,k)(t) .

i.e., bk,in(x) is an upper bound on the arrivals to queue k in any time interval of length x.

  • If a queue with lower service curve smin has arrivals with burstiness curve bin,k, an upper

bound on the delay is given by

dmax,k = min{z ≥ 0 : ∀x ≥ 0, smin,k(x) ≥ (bin,k ⊗ ∆z)(x)} .

(1)

  • Here, dmax,k is the largest horizontal difference between bin,k and smin,k.

March 9, 2015 George Kesidis 21

slide-23
SLIDE 23

Simple deterministic delay-bound claim

  • Claim: A lower service curve of a fork-join system is given by

smin(t) = ∆maxk{dmax,k}(t) .

  • Remark: The claim simply states that the maximum delay of the whole system is the

maximum delay among the queues.

22

slide-24
SLIDE 24

Proof of deterministic delay-bound claim

  • By hypothesis, ∀t ≥ v ≥ 0 and ∀k,

smin,k(t − v) ≥ (bin,k ⊗ ∆dmax,k)(t − v) = bin,k(t − v − dmax,k) ≥ ak(t − dmax,k) − ak(v) ,

  • Thus, ∀t ≥ v ≥ 0 and ∀k,

ak(v) + smin,k(t − v) ≥ ak(t − dmax,k) ⇒ (ak ⊗ smin,k)(t) ≥ ak(t − dmax,k) ⇒ a−1

k ((ak ⊗ smin,k)(t))

≥ t − dmax,k ,

where we have used the fact that, ∀k, ak are nondecreasing.

  • Thus,

D(t) = A

  • min

k {a−1 k (dk(t))}

A

  • min

k {a−1 k ((ak ⊗ smin,k)(t))}

A

  • min

k {t − dmax,k}

  • = A(t − max

k

dmax,k) = (A ⊗ ∆maxk{dmax,k})(t),

where we have used the fact that A is nondecreasing.

23

slide-25
SLIDE 25

Single-stage, fork-join systems - a stationary analysis

  • Claim: In the stationary regime at t ≥ 0, if

A1 service to queue k, sk ≫ smin,k where

∀v ≥ 0, smin,k(v) := vµk;

A2 the demux/mapper divides arriving work roughly proportional to minimum allocated service resources µk to queue k (strong load balancing), i.e., ∀k, ∃ small εk > 0 such that ∀v ≤ t,

  • ak(t) − ak(v) − µk

M (A(t) − A(v))

εk a.s.,

where M :=

k µk;

A3 the total arrivals have generalized (strong) stochastically bounded burstiness, P(max

v≤t A(t) − A(v) − M(t − v) ≥ x)

≤ Φ(x),

where Φ decreases in x > 0; then ∀x > 2M maxk εk/µk, P(A(t) − D(t) ≥ x)

≤ Φ(x − 2M max

k

εk/µk).

March 9, 2015 George Kesidis 24

slide-26
SLIDE 26

Single-stage, fork-join systems - a stationary analysis

March 9, 2015 George Kesidis 25

slide-27
SLIDE 27

A stationary analysis - proof of claim

P(A(t) − D(t) ≥ x)

=

P(A(t) − A(min

k

a−1

k (dk(t))) ≥ x)

=

P(min

k

a−1

k (dk(t)) ≤ A−1(A(t) − x) =: t − z)

=

P(∃k s.t. dk(t) ≤ ak(t − z))

=

P(∃k s.t. ak(t) − dk(t) ≥ ak(t) − ak(t − z) =: xk),

P(∃k s.t. max

v≤t ak(t) − ak(v) − (t − v)µk ≥ xk)

  • where we have used the fact that A and the ak are nondecreasing (cumulative arrivals) and

the inequality is by assumption A1.

  • Also, we have defined non-negative random variables z and xk such that
  • k

xk = x = A(t) − A(t − z).

March 9, 2015 George Kesidis 26

slide-28
SLIDE 28

A stationary analysis - proof of claim (cont)

So by using A2 then A3, we get P(A(t) − D(t) ≥ x)

P(∃k s.t. max

v≤t

µk M (A(t) − A(v)) + εk − (t − v)µk ≥ µk M x − εk) =

P(∃k s.t. max

v≤t (A(t) − A(v)) − (t − v)M ≥ x − 2M

µk εk) =

P(max

v≤t (A(t) − A(v)) − (t − v)M ≥ x − 2M max k

εk/µk) ≤ Φ(x − 2M max

k

εk/µk).

March 9, 2015 George Kesidis 27

slide-29
SLIDE 29

Numerical example based on a Facebook dataset

  • Figure 3 of [Chen et al. 2011] depicts a week-long trace of the total number of arriving

jobs to a MapReduce system operated by Facebook.

  • Clearly, the job rate exhibits “time-of-day” periodicity in its mean and variance.
  • It can be simply modeled as a bounded AR(1) (two-parameter autoregressive) process with

(deterministically) time-varying parameters.

  • A day-long trace of the data of individual jobs from which Figure 3 of [Chen et al. 2011]

was partially derived is publicly available.

  • From this dataset, we depict the aggregate job arrival rate, by ten-minute intervals (i.e.,

144 time samples), in the following figure.

  • Here, we see that the data is roughly stationary.
  • Rather than interpolating, we took the workload as zero during what was likely an hour-long
  • bservational outage starting at hour 14.

March 9, 2015 George Kesidis 28

slide-30
SLIDE 30

Aggregate job arrival rate

5 10 15 20 25 50 100 150 200 250 300 350 400 hours number of jobs

March 9, 2015 George Kesidis 29

slide-31
SLIDE 31

Facebook job types

  • Moreover, Table 1 of [Chen et al. 2011] identifies ten different Facebook job types (i.e.,

ten rows) identified through clustering based on the features (columns) given in this table.

  • In column 1, the number of observed jobs nj of type j is given.
  • Also, the mean number of “task-seconds” per type-j job for the mapper stage, wj, is given

in the “Map time” column (we divided “Map time” by 600s consistent with the ten-minute sampling of the aggregate number of jobs in the above figure).

  • With this information, we can develop an aggregate workload model to the mapper stage,

A, assuming that at each point in time the types of jobs arriving are distributed as in

column 1 of Table 1 of [Chen et al. 2011].

  • Timing information associated with workloads, including the total execution duration of the

individual jobs, are not given in the publicly available datasets.

  • Execution times are provided for individual jobs of CMU’s OpenCloud Hadoop cluster (so

the previous assumption would not be necessary were we to model this dataset).

March 9, 2015 George Kesidis 30

slide-32
SLIDE 32

Cumulative mapper workload A - typical generated trace

5 10 15 20 25 1 2 3 4 5 6 7 8 9 x 10

4

Cumulative arrivals and service curve Hours Workloads

March 9, 2015 George Kesidis 31

slide-33
SLIDE 33

Queue process Q for A with service rate M = 600 (line’s slope)

5 10 15 20 25 2000 4000 6000 8000 10000 12000 14000 Hours Queue backlog Queue backlog

March 9, 2015 George Kesidis 32

slide-34
SLIDE 34

gSBB bound Φ at service rate M = 600

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0.05 0.1 0.15 0.2 0.25 0.3 0.35 mean probability with confidence bar x probability

  • Recall assumption A3 of the previous Claim
  • Used the day-long raw trace given above and multiple samplings of the average “Map time”

data of Table 1 [Chen et al. 2011].

  • The vertical lines are 95% confidence bars based on 30 independent trials

March 9, 2015 George Kesidis 33

slide-35
SLIDE 35

Discussion - load balancing in a single processing stage

  • Typically, the amount of allocated parallelism of a job at a stage is based on the size of

the job’s input data-set to that stage, as that information is readily available operationally

  • nline.
  • The execution time for the component tasks will, of course, greatly depend on other factors

such as algorithmic/computational complexity.

  • E.g., Facebook data in rows 4 and 5 of Table 1 of [Chen et al. 2011], where two jobs

have about the same mean input data size (≈ 400KB in Input column) but significantly different mean Map times (one is roughly double the other).

  • This said, it’s likely that the same algorithm will be applied for all tasks of a given job so

that effective load balancing from job to task typically may be achieved, i.e., when ∀k, l,

µk = µl in in the previous claim (which allows for processors of different capacities µ, as

considered in [Ghodsi et al. 2011]).

March 9, 2015 George Kesidis 34

slide-36
SLIDE 36

Discussion - tandem processing stages

  • In MapReduce applications for search, the first (mapper) stage performs search on input

files, with

  • the workload partitioned among the mapper’s parallel processors to allow pipelined operation

with the shuffler (communication with the following reducer stage), and

  • the reducer stage combines the results of the mapper.
  • The incident workloads for the two stages may be very different (again, Table 1 of [Chen

et al. 2011]).

  • Such tandem parallel processing stages can be simply modeled by suitably selecting the

available service capacities µ at each stage and by suitably transforming the workload between stages.

  • For example, one can compute the gSBB bound Φ2 for the incident workload of the reducer

(second) stage, as Φ1 for the mapper (first) stage was computed above using the data from Figure 3 and Table 1 of [Chen et al. 2011], the latter also having reducer-workload information.

  • That is, the departing workflow of the first mapper stage is transformed so as to have a

different gSBB bound for the next reducer stage.

March 9, 2015 George Kesidis 35

slide-37
SLIDE 37

Workload transformation for tandem processing stages

  • More precisely, suppose Ji is the counting process of incident jobs to the ith stage and that

the kth task of jth job at that stage has workload wi,j,k.

  • Thus, the cumulative workload to the kth stage is simply

Ai(t) =

  • j≤J −1

i (t)

  • k

wi,j,k, =

  • k

ai,k(t),

where the workload to the kth processor of the ith stage is

ai,k(t) =

  • j≤J −1

i (t)

wi,j,k.

March 9, 2015 George Kesidis 36

slide-38
SLIDE 38

Workload transformation between stages (cont)

  • To use the workload data of Table 1 of [Chen et al. 2011] for the reducer stage, we need

the counting process of jobs J2 incident to the reducer stage (J1 for the mapper stage was given in their Figure 3).

  • We can do this by considering the cumulative work done (“departures”) D1 from the first

stage, an object central to our previous claims above.

  • The arrival time of the mth job to the second stage is

D−1

1

 

j≤m

  • k

w1,j,k

 

=: J−1

2 (m).

  • Note that the service rates M, µ of the first stage will affect J2, and hence A2.
  • Given such gSBB-bound transformations between stages, the results of our previous claims

can be generalized to obtain end-to-end results across tandem processing stages,

  • including for in-tree networks and certain more complex networks multiclass workflows or

feedback [C.-S. Chang 1999].

March 9, 2015 George Kesidis 37

slide-39
SLIDE 39

Job/task specialization for performance improvements

  • Difficult to optimally set-up processor topology and provision it for a job (or job family)

completely proactively.

  • Resource allocation often modified within existing MapReduce/Hadoop template and cus-

tomizations for specific job/task types.

  • These changes can be proactive or reactive/dynamic to deal with performance degradation

due to – excessive stragglers (overdue tasks causing delays at barrier points) causing cancel-and- relaunch or just relaunch of tasks (so increasing associated workload), – excessive communication overhead, including to maintain consistent shared-memory/state among processes of different stages, – faults.

  • One can use redundant mapper/reducer functionality, e.g., same dataset assigned to mul-

tiple mappers.

March 9, 2015 George Kesidis 38

slide-40
SLIDE 40

Job/task specialization for performance improvements (cont)

  • Currently under MapReduce, such redundancy is done in “uniform” fashion.
  • Alternatively, redundancy could be based recognition of hotspots or congestion points.
  • For example, more mappers could be allocated according to the success of a search of a

particular data subset (which is duplicated for each assigned mapper).

  • This can also be done proactively by customized cloud-computing templates for specific

jobs.

  • Task prioritization can be added - non-FIFO scheduling, greater likelihood of certain types
  • f tasks being relaunched when delayed by smaller amounts of time.
  • Moreover, certain types of jobs may only require “soft” synchronization (not hard barriers)

at join-points of certain of their tasks.

  • None of these methods are new to the general problem space of parallel computation.

March 9, 2015 George Kesidis 39

slide-41
SLIDE 41

Applications with feedback

  • So far, considered applications that map to feed-forward processor topologies.
  • Processor topologies with feedback needed for, e.g., distributed simulation of

– communication networks – manufacturing systems with “re-entrant lines”

  • Rather than “hard” synchronization, can use rollback when inconsistency is detected in

shared memory/state.

  • Other application-specific tricks

– modeling, e.g., packet or fluid traffic models (ripple effect for latter), importance sam- pling – dynamic time-warp

March 9, 2015 George Kesidis 40

slide-42
SLIDE 42

Summary

  • Classical area of parallel processing - techniques of concurrent programming, cluster com-

puting, cloud computing (now trending).

  • Markov models of fork-join systems studied in the 1980s under highly idealized assumptions.
  • Possible to apply methods of network calculus.
  • Workloads naturally change as jobs progress through the system.
  • Workloads associated with component tasks change with application of proactive/reactive

techniques for performance improvement.

March 9, 2015 George Kesidis 41