Sol Fast Distributed Computation Over Slow Networks Fan Lai , Jie - - PowerPoint PPT Presentation

sol
SMART_READER_LITE
LIVE PREVIEW

Sol Fast Distributed Computation Over Slow Networks Fan Lai , Jie - - PowerPoint PPT Presentation

Sol Fast Distributed Computation Over Slow Networks Fan Lai , Jie You, Xiangfeng Zhu Harsha V. Madhyastha, Mosharaf Chowdhury 1 Distributed Data Processing is Ubiquitous Distributed computation in Local-Area Networks (LAN) To accelerate


slide-1
SLIDE 1

Sol

Fast Distributed Computation Over Slow Networks

Fan Lai, Jie You, Xiangfeng Zhu Harsha V. Madhyastha, Mosharaf Chowdhury

1

slide-2
SLIDE 2
  • Distributed computation in Local-Area Networks (LAN)
  • To accelerate executions within a single cluster

Distributed Data Processing is Ubiquitous

2

Efforts for Computation in LAN

slide-3
SLIDE 3
  • Computation over Wide-Area Networks (WAN)
  • To reduce data transfers, mitigate privacy risks
  • Distributed computation in Local-Area Networks (LAN)
  • To accelerate executions within a single cluster

Distributed Data Processing is Ubiquitous

2

Google Spanner

Iridium CLARINEt

Efforts for Computation over WAN

Azure Cosmos DB

Tetrium

Efforts for Computation in LAN

slide-4
SLIDE 4

Execution Engine: Core of Big Data Stack

SQL Queries AI/ML

Select * FROM …; K-means, SVM

Stream Processing

… …

WordCount, TopKCount

3

slide-5
SLIDE 5

Execution Engine: Core of Big Data Stack

SQL Queries AI/ML Execution Planner

Select * FROM …; K-means, SVM

Stream Processing

… …

WordCount, TopKCount

3

slide-6
SLIDE 6

Execution Engine: Core of Big Data Stack

SQL Queries AI/ML Execution Planner

Select * FROM …; K-means, SVM

Stream Processing

… …

WordCount, TopKCount

Typical job execution plans

Job 2 Job 1

3

slide-7
SLIDE 7

Execution Engine: Core of Big Data Stack

SQL Queries AI/ML Execution Planner

Select * FROM …; K-means, SVM

Coordinator Worker1 Worker2 WorkerN

Execution Engine Stream Processing

… …

WordCount, TopKCount

3

slide-8
SLIDE 8

Execution Engine: Core of Big Data Stack

SQL Queries AI/ML Execution Planner Storage System Resource Scheduler

Select * FROM …; K-means, SVM

Coordinator Worker1 Worker2 WorkerN

Execution Engine Stream Processing

… …

WordCount, TopKCount

3

slide-9
SLIDE 9

Google Spanner

Execution Engine: Core of Big Data Stack

SQL Queries AI/ML Execution Planner Storage System Resource Scheduler

Select * FROM …; K-means, SVM

Coordinator Worker1 Worker2 WorkerN

Execution Engine Stream Processing

… …

WordCount, TopKCount

Efforts for Computation in LAN

Iridium

CLARINEt

Efforts for Computation over WAN

Azure Cosmos DB

Tetrium

3

slide-10
SLIDE 10 Google Spanner

Execution Engine: Core of Big Data Stack

SQL Queries AI/ML Execution Planner Storage System Resource Scheduler

Select * FROM …; K-means, SVM

Coordinator Worker1 Worker2 WorkerN

Execution Engine Stream Processing

… …

WordCount, TopKCount

Efforts for Computation in LAN

Iridium CLARINEt

Efforts for Computation over WAN

Azure Cosmos DB

Tetrium

3

slide-11
SLIDE 11 Google Spanner

Execution Engine: Core of Big Data Stack

SQL Queries AI/ML Execution Planner Storage System Resource Scheduler

Select * FROM …; K-means, SVM

Coordinator Worker1 Worker2 WorkerN

Execution Engine Stream Processing

… …

WordCount, TopKCount

Efforts for Computation in LAN

Iridium CLARINEt

Efforts for Computation over WAN

Azure Cosmos DB

Tetrium

3

While network conditions are diverse in real, execution engines remain the same

slide-12
SLIDE 12

Outline

4

  • Today’s Execution Engines
  • Sol Architecture
  • Control Plane Design
  • Data Plane Design
  • Evaluation
slide-13
SLIDE 13

50 100 150 Query Completion Time (s) 0.00 0.25 0.50 0.75 1.00 CDF across Queries 10 Gbps, O(1) ms 1 Gbps, O(1) ms 10 Gbps, O(100) ms 1 Gbps, O(100) ms

Impact of Networks on Latency-sensitive Jobs

5

Queries from 100 GB TPC Benchmarks

Job Completion Time (s) CDF

slide-14
SLIDE 14

50 100 150 Query Completion Time (s) 0.00 0.25 0.50 0.75 1.00 CDF across Queries 10 Gbps, O(1) ms 1 Gbps, O(1) ms 10 Gbps, O(100) ms 1 Gbps, O(100) ms

Impact of Networks on Latency-sensitive Jobs

5

Queries from 100 GB TPC Benchmarks

Job Completion Time (s) CDF

slide-15
SLIDE 15

50 100 150 Query Completion Time (s) 0.00 0.25 0.50 0.75 1.00 CDF across Queries 10 Gbps, O(1) ms 1 Gbps, O(1) ms 10 Gbps, O(100) ms 1 Gbps, O(100) ms

Impact of Networks on Latency-sensitive Jobs

5

Queries from 100 GB TPC Benchmarks

4.9X Job Completion Time (s) CDF

slide-16
SLIDE 16

Slow job execution in high-latency networks

Problem #1

50 100 150 Query Completion Time (s) 0.00 0.25 0.50 0.75 1.00 CDF across Queries 10 Gbps, O(1) ms 1 Gbps, O(1) ms 10 Gbps, O(100) ms 1 Gbps, O(100) ms

Impact of Networks on Latency-sensitive Jobs

5

Queries from 100 GB TPC Benchmarks

4.9X Job Completion Time (s) CDF

slide-17
SLIDE 17

Slow job execution in high-latency networks

Problem #1

Control Plane Inefficiency Due to High Latency

6

Coordinator Worker Busy

O(1) ms

Time

Tasks Launch(■)

slide-18
SLIDE 18

Slow job execution in high-latency networks

Problem #1

Control Plane Inefficiency Due to High Latency

6

Coordinator Worker Busy

O(1) ms

Time Busy

Tasks Tasks Launch(■) Complete(■) Launch(■) Complete(■)

slide-19
SLIDE 19

Slow job execution in high-latency networks

Problem #1

Control Plane Inefficiency Due to High Latency

7

Coordinator Worker Busy

C

  • m

p l e t e ( ■ ) L a u n c h ( ■ )

Time Busy

L a u n c h ( ■ ) Complete(■)

Idle

Tasks Tasks O(100) ms

Late-binding of tasks postpones scheduling

slide-20
SLIDE 20

Impact of Networks on Bandwidth-intensive Jobs

8

Stage 1 Stage 2

Data transfers

  • ver networks

Stage 3

Query25 on 1TB TPC benchmark

slide-21
SLIDE 21

Impact of Networks on Bandwidth-intensive Jobs

8

Stage 1 Stage 2

Data transfers

  • ver networks

Stage 3

Query25 on 1TB TPC benchmark

Resource utilization throughout the job

50 100 150 200 250 Time (s) 25 50 75 100 Percentage of the Total (%) Occupied CPUs CPU Util. B/w Util. Stage 1 Stage 2 Stage 3

Time (s)

slide-22
SLIDE 22

Impact of Networks on Bandwidth-intensive Jobs

8

Stage 1 Stage 2

Data transfers

  • ver networks

Stage 3

Query25 on 1TB TPC benchmark

50 100 150 200 250 Time (s) 25 50 75 100 Percentage of the Total (%) Occupied CPUs CPU Util. B/w Util.

Resource utilization throughout the job

Stage 1 Stage 2 Stage 3

Time (s)

Low CPU util.

slide-23
SLIDE 23

Resource utilization throughout the job

50 100 150 200 250 Time (s) 25 50 75 100 Percentage of the Total (%) Occupied CPUs CPU Util. B/w Util. Stage 1 Stage 2 Stage 3

Time (s)

50 100 150 200 250 Time (s) 25 50 75 100 Percentage of the Total (%) Occupied CPUs CPU Util. B/w Util.

Resource utilization throughout the job

Stage 1 Stage 2 Stage 3

Time (s)

CPU underutilization in low-bandwidth networks

Data Plane Inefficiency Due to Low Bandwidth

9

Stage 1 Stage 2

Data transfers

  • ver networks

Stage 3

Query25 on 1TB TPC benchmark

Tasks hog CPUs throughout the lifespan

Problem #2

slide-24
SLIDE 24

Outline

10

  • Today’s Execution Engines
  • Sol Architecture
  • Control Plane Design
  • Data Plane Design
  • Evaluation

Problem #1

High latency → Idleness of workers

Problem #2

Low b/w → CPU underutilization

slide-25
SLIDE 25

Outline

11

  • Today’s Execution Engines
  • Sol Architecture
  • Control Plane Design
  • Data Plane Design
  • Evaluation

A federated execution engine for diverse network conditions w/

  • faster job execution
  • higher resource utilization

Sol

slide-26
SLIDE 26

LAN

Sol: A Federated Execution Engine

12

Sol Architecture

  • Central Coordinator
  • Coordinate inter-site executions

WAN

O(100) ms

Sol Coordinator

Task Arrivals Site 2 Site 3

WAN

O(100) ms

Site 1

slide-27
SLIDE 27

LAN

Sol: A Federated Execution Engine

12

Sol Architecture

  • Central Coordinator
  • Coordinate inter-site executions
  • Site Manager
  • Coordinate local workers
  • Manage queued tasks

WAN

O(100) ms

Sol Coordinator

Task Arrivals Site 2 Site 3

WAN

O(100) ms

Site Manager

LAN

slide-28
SLIDE 28

LAN

Sol: A Federated Execution Engine

12

Sol Architecture

  • Central Coordinator
  • Coordinate inter-site executions
  • Site Manager
  • Coordinate local workers
  • Manage queued tasks

WAN

O(100) ms

Sol Coordinator

Task Arrivals Site 2 Site 3

WAN

O(100) ms

Site Manager

LAN

slide-29
SLIDE 29

LAN

Sol: A Federated Execution Engine

12

Sol Architecture

  • Central Coordinator
  • Coordinate inter-site executions
  • Site Manager
  • Coordinate local workers
  • Manage queued tasks
  • Task Manager
  • Manage worker resource

WAN LAN

O(100) ms

Sol Coordinator

Task Arrivals Site 2 Site 3

WAN

O(100) ms

Worker

Task Manager

Worker

Task Manager

Site Manager

LAN

slide-30
SLIDE 30

Outline

13

  • Today’s Execution Engines
  • Sol Architecture
  • Control Plane Design
  • Data Plane Design
  • Evaluation

Push tasks proactively to reduce worker idle time Problem #1

High latency → Idleness of workers

slide-31
SLIDE 31

Task Early-binding in Control Plane

14

Coordinator Worker Time

C

  • m

p l e t e ( ■ )

Busy O(100) ms

Launch(■)

Idle

Launch(■)

Existing designs

Tasks Tasks

slide-32
SLIDE 32

Task Early-binding in Control Plane

15

Coordinator Worker Time Site Manager O(100) ms O(1) ms

Tasks

slide-33
SLIDE 33

Task Early-binding in Control Plane

15

Coordinator Worker Time Site Manager O(100) ms O(1) ms

Launch(■ ■) Tasks

slide-34
SLIDE 34

Task Early-binding in Control Plane

15

Coordinator Worker Time Site Manager O(100) ms O(1) ms

Launch(■ ■) Launch(■) Complete(■) Busy Tasks

slide-35
SLIDE 35
  • Coordinator ⟷ Site Manager
  • Inter-site operations are early-binding

→ Guarantee high utilization

Task Early-binding in Control Plane

15

Coordinator Worker Time

C

  • m

p l e t e ( ■ )

Site Manager

Launch(■)

Idle Busy O(100) ms O(1) ms

Launch(■ ■) Launch(■) Complete(■) Busy Launch(■) Tasks Tasks

slide-36
SLIDE 36
  • Coordinator ⟷ Site Manager
  • Inter-site operations are early-binding

→ Guarantee high utilization

Task Early-binding in Control Plane

15

Coordinator Worker Time

C

  • m

p l e t e ( ■ )

Site Manager

Launch(■)

Idle Busy O(100) ms O(1) ms

Launch(■ ■) Launch(■) Complete(■) Busy Launch(■)

  • Site Manager ⟷ Worker
  • Intra-site operations are late-binding

→ Retain precise views

Tasks Tasks

slide-37
SLIDE 37

Challenge 1.1: How Many Tasks to Push?

16

Coordinator Site Manager RTT

Tasks Tasks

slide-38
SLIDE 38

Challenge 1.1: How Many Tasks to Push?

16

  • Queue up too few
  • Not enough work → Underutilization
  • Queue up too many
  • Scheduling too early → Suboptimal placement

Coordinator Site Manager RTT

Tasks Tasks

slide-39
SLIDE 39

Challenge 1.1: How Many Tasks to Push?

16

  • Queue up too few
  • Not enough work → Underutilization
  • Queue up too many
  • Scheduling too early → Suboptimal placement
  • Target:
  • Total duration of queued tasks ≃ Round-Trip Time(RTT)

Coordinator Site Manager RTT

Tasks Tasks

slide-40
SLIDE 40
  • Sol works well w/o precise knowledge of task duration
  • Hoeffding-bound (details in paper)

Challenge 1.1: How Many Tasks to Push?

16

  • Queue up too few
  • Not enough work → Underutilization
  • Queue up too many
  • Scheduling too early → Suboptimal placement
  • Target:
  • Total duration of queued tasks ≃ Round-Trip Time(RTT)

Coordinator Site Manager RTT

Tasks Tasks

slide-41
SLIDE 41

Challenge 1.2: How to Push Tasks w/ Dependencies?

17

  • Task placements depend on upstream outputs
  • In order to reduce data transfers over networks
slide-42
SLIDE 42

Challenge 1.2: How to Push Tasks w/ Dependencies?

17

  • Task placements depend on upstream outputs
  • In order to reduce data transfers over networks

C S1 S2

Time

T1 T2

Design in Existing Engines

T1 T2 T3

Task Dependencies

slide-43
SLIDE 43

Challenge 1.2: How to Push Tasks w/ Dependencies?

17

  • Task placements depend on upstream outputs
  • In order to reduce data transfers over networks

Complete(T1)

C S1 S2

Time

T1 T2 Complete(T2)

Design in Existing Engines

T1 T2 T3

Task Dependencies

Output1 Output2

slide-44
SLIDE 44

Challenge 1.2: How to Push Tasks w/ Dependencies?

17

  • Task placements depend on upstream outputs
  • In order to reduce data transfers over networks

Complete(T1)

C S1 S2

Time

T1 T2 Complete(T2) Start(T3) Launch(T3)

Design in Existing Engines

T1 T2 T3

Task Dependencies

Output1 Output2

slide-45
SLIDE 45

Challenge 1.2: How to Push Tasks w/ Dependencies?

17

  • Task placements depend on upstream outputs
  • In order to reduce data transfers over networks

Complete(T1)

C S1 S2

Time

T1 T2 Complete(T2) Start(T3) Launch(T3)

Design in Existing Engines

T1 T2 T3

Task Dependencies

Output1 Output2

W/o full knowledge, pushing leads to tradeoff

slide-46
SLIDE 46

Challenge 1.2: How to Push Tasks w/ Dependencies?

18

  • 1. Sol improves utilization by pushing with speculation
  • E.g., historical information

Complete(T1)

C S1 S2

Time

T1 T2 Complete(T2) Start(T3) Launch(T3)

Design in Existing Engines

slide-47
SLIDE 47

Challenge 1.2: How to Push Tasks w/ Dependencies?

18

C S1 S2

Time

T1 T2

Push w/ Correct Speculations

T3

  • 1. Sol improves utilization by pushing with speculation
  • E.g., historical information

Complete(T1)

C S1 S2

Time

T1 T2 Complete(T2) Start(T3) Launch(T3)

Design in Existing Engines

slide-48
SLIDE 48

Challenge 1.2: How to Push Tasks w/ Dependencies?

18

Complete(T1) Complete(T2)

C S1 S2

Time

T1 T2

Push w/ Correct Speculations

T3

Sol saves RTTs

  • 1. Sol improves utilization by pushing with speculation
  • E.g., historical information

Complete(T1)

C S1 S2

Time

T1 T2 Complete(T2) Start(T3) Launch(T3)

Design in Existing Engines

Activate(T3)

slide-49
SLIDE 49

Challenge 1.2: How to Push Tasks w/ Dependencies?

19

  • 2. In case of mistakes, Sol retains good scheduling by recovering
  • With worker-initiated re-scheduling

Complete(T1)

C S1 S2

Time

T1 T2 Complete(T2) Start(T3) Launch(T3)

Design in Existing Engines

slide-50
SLIDE 50

C S1 S2

Time

T1 T2

Challenge 1.2: How to Push Tasks w/ Dependencies?

19

Push under Mispredictions

  • 2. In case of mistakes, Sol retains good scheduling by recovering
  • With worker-initiated re-scheduling

Complete(T1)

C S1 S2

Time

T1 T2 Complete(T2) Start(T3) Launch(T3)

Design in Existing Engines

T3

slide-51
SLIDE 51

C S1 S2

Time

T1 T2

Challenge 1.2: How to Push Tasks w/ Dependencies?

19

Push under Mispredictions

  • 2. In case of mistakes, Sol retains good scheduling by recovering
  • With worker-initiated re-scheduling

Complete(T1) Complete(T1)

C S1 S2

Time

T1 T2 Complete(T2) Start(T3) Launch(T3)

Design in Existing Engines

T3

slide-52
SLIDE 52

C S1 S2

Time

T1 T2

Challenge 1.2: How to Push Tasks w/ Dependencies?

19

Push under Mispredictions

Cancel(T3)

  • 2. In case of mistakes, Sol retains good scheduling by recovering
  • With worker-initiated re-scheduling

Complete(T1) Complete(T1)

C S1 S2

Time

T1 T2 Complete(T2) Start(T3) Launch(T3)

Design in Existing Engines

slide-53
SLIDE 53

C S1 S2

Time

T1 T2

Challenge 1.2: How to Push Tasks w/ Dependencies?

19

Complete(T2)

Push under Mispredictions

Start(T3) Re-Schedule(T3) Cancel(T3)

Sol does not make things worse

  • 2. In case of mistakes, Sol retains good scheduling by recovering
  • With worker-initiated re-scheduling

Complete(T1) Complete(T1)

C S1 S2

Time

T1 T2 Complete(T2) Start(T3) Launch(T3)

Design in Existing Engines

slide-54
SLIDE 54

C S1 S2

Time

T1 T2

Task Early-binding in Control Plane

20

Complete(T2)

Push under Mispredictions

Start(T3) Re-Schedule(T3) Cancel(T3)

Sol retains good scheduling quality

  • Sol improves utilization while retaining good scheduling quality

Complete(T1) T3 Complete(T1) Complete(T2)

C S1 S2

Time

T1 T2

Push w/ Correct Speculations

T3

Sol improves utilization

Activate(T3)

slide-55
SLIDE 55

Outline

21

  • Today’s Execution Engines
  • Sol Architecture
  • Control Plane Design
  • Data Plane Design
  • Evaluation

Decouple resource provisioning to improve CPU utilization Problem #2

Low b/w → CPU underutilization

slide-56
SLIDE 56

Resource Decoupling in Data Plane

22

  • Decouple the resource provisioning internally with
  • Communication task: prepare data over networks
slide-57
SLIDE 57

Resource Decoupling in Data Plane

22

  • Decouple the resource provisioning internally with
  • Communication task: prepare data over networks
  • Computation task: perform computation on input
slide-58
SLIDE 58

Resource Decoupling in Data Plane

22

  • Decouple the resource provisioning internally with
  • Communication task: prepare data over networks

Sol scales down CPU requirements and reclaims unused CPUs

  • Computation task: perform computation on input
slide-59
SLIDE 59

Challenge 2: How to Manage Jobs?

23

slide-60
SLIDE 60

For bandwidth-intensive task

Challenge 2: How to Manage Jobs?

23

Incoming tasks Create

  • comm. task

Y Large remote read?

Control flow of decoupling

slide-61
SLIDE 61

For bandwidth-intensive task

Challenge 2: How to Manage Jobs?

23

Incoming tasks Create

  • comm. task

Y Complete

  • comm. ?

Large remote read?

Control flow of decoupling

slide-62
SLIDE 62

For bandwidth-intensive task

Challenge 2: How to Manage Jobs?

23

  • How many communication tasks to create?
  • Too few → Network is not saturated
  • Too many → CPUs are not saturated

Incoming tasks Create

  • comm. task

Y Complete

  • comm. ?

Large remote read?

Control flow of decoupling

slide-63
SLIDE 63

For bandwidth-intensive task

Challenge 2: How to Manage Jobs?

23

  • How many communication tasks to create?
  • Too few → Network is not saturated
  • Too many → CPUs are not saturated } Adapt to available bandwidth

Incoming tasks Create

  • comm. task

Y Complete

  • comm. ?

Large remote read?

Control flow of decoupling

slide-64
SLIDE 64

For bandwidth-intensive task

Challenge 2: How to Manage Jobs?

24

Incoming tasks Create

  • comm. task

Y Complete

  • comm. ?

Control flow of decoupling

  • How to manage the computation tasks?

Large remote read?

slide-65
SLIDE 65

For bandwidth-intensive task

Challenge 2: How to Manage Jobs?

24

Incoming tasks Create

  • comm. task

Y Complete

  • comm. ?

Control flow of decoupling

Available CPUs N

  • How to manage the computation tasks?

Pending tasks Large remote read?

slide-66
SLIDE 66

For bandwidth-intensive task

Challenge 2: How to Manage Jobs?

24

Incoming tasks Create

  • comm. task

Y Complete

  • comm. ?

Control flow of decoupling

Available CPUs N

  • How to manage the computation tasks?

Y Pending tasks Large remote read?

slide-67
SLIDE 67

For bandwidth-intensive task

Challenge 2: How to Manage Jobs?

24

Incoming tasks Create

  • comm. task

Y Complete

  • comm. ?

Control flow of decoupling

Available CPUs N

  • How to manage the computation tasks?
  • Prioritize them when data is ready

Y Pending tasks Large remote read?

slide-68
SLIDE 68

For bandwidth-intensive task

Challenge 2: How to Manage Jobs?

24

Incoming tasks Create

  • comm. task

Y Complete

  • comm. ?

Control flow of decoupling

Available CPUs N

  • How to manage the computation tasks?
  • Prioritize them when data is ready

Y Pending tasks Large remote read?

slide-69
SLIDE 69

25

Evaluation

With a prototype supporting generic data processing

  • Environment
  • 10-site deployment in EC2
  • 4 m4.4xlarge VMs in each site

Deployment over WAN

slide-70
SLIDE 70

25

Evaluation

With a prototype supporting generic data processing 1. compared to existing engines? 2. across design space? 3. under uncertainties?

How does Sol perform:

  • Environment
  • 10-site deployment in EC2
  • 4 m4.4xlarge VMs in each site

Deployment over WAN

slide-71
SLIDE 71

Sol Improves Job Performance and Resource Util. (WAN)

26

Benchmark — multi-job execution

  • Latency-sensitive TPC queries
  • Bandwidth-intensive TeraSort

Baseline

  • Apache Spark
slide-72
SLIDE 72

Sol Improves Job Performance and Resource Util. (WAN)

26

Benchmark — multi-job execution

  • Latency-sensitive TPC queries
  • Bandwidth-intensive TeraSort

Baseline

  • Apache Spark

100 101 102 103 Job Completion Time (s) 0.00 0.25 0.50 0.75 1.00 CDF Sol Spark

16.4x improvement on average

slide-73
SLIDE 73

Sol Improves Job Performance and Resource Util. (WAN)

27

100 101 102 103 Job Completion Time (s) 0.00 0.25 0.50 0.75 1.00 CDF Sol Sol – Spark

16.4x improvement on average

Control-plane benefits (2.6x on avg.)

16.4x improvement on average

100 101 102 103 Job Completion Time (s) 0.00 0.25 0.50 0.75 1.00 CDF Sol Sol – Spark

Control plane benefits (2.6x on avg.)

Job Performance

Data plane benefits

Control Plane:

Early-binding → Less idle time

Sol w/o decoupling

slide-74
SLIDE 74

Sol Improves Job Performance and Resource Util. (WAN)

27

100 101 102 103 Job Completion Time (s) 0.00 0.25 0.50 0.75 1.00 CDF Sol Sol – Spark

16.4x improvement on average

Data-plane benefits Control-plane benefits (2.6x on avg.)

16.4x improvement on average

100 101 102 103 Job Completion Time (s) 0.00 0.25 0.50 0.75 1.00 CDF Sol Sol – Spark

Control plane benefits (2.6x on avg.)

Job Performance

Data plane benefits

Control Plane:

Early-binding → Less idle time

Data Plane:

Decoupling → Less under-util.

Sol w/o decoupling

slide-75
SLIDE 75

Sol Improves Job Performance and Resource Util. (WAN)

27

100 101 102 103 Job Completion Time (s) 0.00 0.25 0.50 0.75 1.00 CDF Sol Sol – Spark

16.4x improvement on average

Data-plane benefits Control-plane benefits (2.6x on avg.)

16.4x improvement on average

100 101 102 103 Job Completion Time (s) 0.00 0.25 0.50 0.75 1.00 CDF Sol Sol – Spark

Control plane benefits (2.6x on avg.)

Job Performance

Data plane benefits

Control Plane:

Early-binding → Less idle time

Data Plane:

Decoupling → Less under-util.

16.4x better job completion

+

1.8x better CPU util.

Sol w/o decoupling

slide-76
SLIDE 76

Sol Performs Well Across Design Space (LAN)

28

High-bandwidth setting (10 Gbps)

1.3x improvement on average

100 101 102

Job Completion Time (s) 0.00 0.25 0.50 0.75 1.00 CDF Spark Sol

3.9x improvement on average

100 101 102

Job Completion Time (s) 0.00 0.25 0.50 0.75 1.00 CDF Spark Sol

Low-bandwidth setting (1 Gbps)

slide-77
SLIDE 77

Sol

https://github.com/SymbioticLab/Sol

A federated execution engine for diverse network conditions with

  • Faster job execution
  • Higher resource utilization

Improve CPU util.{

before task executions → during task executions → Early-binding of tasks Decoupling of resource provisioning

29