6.888 Lecture 8: Networking for Data Analy9cs Mohammad Alizadeh - - PowerPoint PPT Presentation

6 888 lecture 8 networking for data analy9cs
SMART_READER_LITE
LIVE PREVIEW

6.888 Lecture 8: Networking for Data Analy9cs Mohammad Alizadeh - - PowerPoint PPT Presentation

6.888 Lecture 8: Networking for Data Analy9cs Mohammad Alizadeh Many thanks to Mosharaf Chowdhury (Michigan) and Kay Ousterhout (Berkeley) Spring 2016 1 Big Data Huge amounts of data being collected daily Wide variety of sources -


slide-1
SLIDE 1

6.888 Lecture 8: Networking for Data Analy9cs

Mohammad Alizadeh

Spring 2016

1

² Many thanks to Mosharaf Chowdhury (Michigan) and Kay Ousterhout (Berkeley)

slide-2
SLIDE 2

“Big Data”

Huge amounts of data being collected daily Wide variety of sources

  • Web, mobile, wearables, IoT, scien9fic
  • Machines: monitoring, logs, etc

Many applica9ons

  • Business intelligence, scien9fic

research, health care

2

slide-3
SLIDE 3

Big Data Systems

3

2005 2010 2015 MapReduce Hadoop Spark Hive Dryad DryadLINQ Spark-Streaming GraphX GraphLab Pregel Storm Dremel BlinkDB

slide-4
SLIDE 4

Mul9-stage dataflow

  • Computa9on interleaved with communica9on

Computa9on Stage (e.g., Map, Reduce)

  • Distributed across many machines
  • Tasks run in parallel

Communica9on Stage (e.g., Shuffle)

  • Between successive computa9on stages

Map Stage Shuffle Reduce Stage

A communication stage cannot complete until all the data have been transferred

Data Parallel Applica9ons

slide-5
SLIDE 5

Ques9ons

How to design the network for data parallel applica9ons?

Ø What are good communica9on abstrac9ons?

Does the network ma]er for data parallel applica9ons?

Ø What are the bo]lenecks for these applica9ons?

slide-6
SLIDE 6

Efficient Coflow Scheduling with Varys

6

² Slides by Mosharaf Chowdhury (Michigan), with minor modifica9ons

slide-7
SLIDE 7

Exis9ng Solu9ons

GPS RED WFQ CSFQ ECN XCP D2TCP DCTCP PDQ D3 FCP DeTail pFabric 2005 2010 2015 1980s 1990s 2000s RCP

Per-Flow Fairness Flow Completion Time

Independent flows cannot capture the collective communication behavior common in data-parallel applications

Flow: Transfer of data from a source to a des9na9on

slide-8
SLIDE 8

Cof low

Communication abstraction for data-parallel applications to express their performance goals

  • 1. Minimize completion times,
  • 2. Meet deadlines
slide-9
SLIDE 9

Aggregation Broadcast Shuffle Parallel Flows All-to-All Single Flow

slide-10
SLIDE 10

How to schedule coflows

  • nline …

… for faster #1 completion

  • f coflows?

… to meet #2 more deadlines?

1 2 N 1 2 N

. . . . . .

Datacenter

slide-11
SLIDE 11

Benefits of

time

2 4 6

time

2 4 6

time

2 4 6

Coflow1 comp. time = 5 Coflow2 comp. time = 6 Coflow1 comp. time = 5 Coflow2 comp. time = 6 Fair Sharing Smallest-Flow First1,2 The Optimal Coflow1 comp. time = 3 Coflow2 comp. time = 6 L1 L2 L1 L2 L1 L2

  • 1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012.
  • 2. pFabric: Minimal Near-Optimal Datacenter

Transport, SIGCOMM’2013.

Link 1 Link 2 3 Units Coflow 1 6 Units Coflow 2 2 Units

Inter-Coflow Scheduling

slide-12
SLIDE 12

time

2 4 6

Coflow1 comp. time = 6 Coflow2 comp. time = 6 Fair Sharing L1 L2 time

2 4 6

Coflow1 comp. time = 6 Coflow2 comp. time = 6 Flow-level Prioritization1 L1 L2 time

2 4 6

The Optimal Coflow1 comp. time = 3 Coflow2 comp. time = 6 L1 L2

Benefits of

Concurrent Open Shop Scheduling1

  • Examples include job scheduling and

caching blocks

  • Solutions use a ordering heuristic

Link 1 Link 2 3 Units Coflow 1 6 Units Coflow 2 2 Units

  • 1. A Note on the Complexity of the Concurrent Open Shop Problem, Journal of Scheduling, 9(4):389–396, 2006

Inter-Coflow Scheduling

  • 1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012.
  • 2. pFabric: Minimal Near-Optimal Datacenter

Transport, SIGCOMM’2013.

is NP-Hard

slide-13
SLIDE 13

Inter-Coflow Scheduling

1 2 3 1 2 3 Input Links Output Links Datacenter

Concurrent Open Shop Scheduling with Coupled Resources

  • Examples include job scheduling and

caching blocks

  • Solutions use a ordering heuristic
  • Consider matching constraints

Link 1 Link 2 3 Units Coflow 1 6 Units Coflow 2 2 Units

3 6 2

is NP-Hard

slide-14
SLIDE 14

Varys

Employs a two-step algorithm to minimize coflow completion times

  • 1. Ordering heuristic

Keep an ordered list of coflows to be scheduled, preempting if needed

  • 2. Allocation algorithm

Allocates minimum required resources to each coflow to finish in minimum time

slide-15
SLIDE 15

Alloca9on Algorithm

A coflow cannot finish before its very last flow Finishing flows faster than the bottleneck cannot decrease a coflow’s completion time Allocate minimum flow rates such that all flows of a coflow finish together on time

slide-16
SLIDE 16

Varys Architecture

Centralized master-slave architecture

  • Applications use a client library to

communicate with the master

Actual timing and rates are determined by the coflow scheduler

Put Get Reg Varys Master

Coflow Scheduler

Topology Monitor Usage Estimator Network Interface (Distributed) File System f Comp. Tasks calling Varys Client Library

TaskName Sender Receiver Driver Varys Daemon Varys Daemon Varys Daemon

  • 1. Download from http://varys.net
slide-17
SLIDE 17

Discussion

17

slide-18
SLIDE 18

Making Sense of Performance in Data Analy9cs Frameworks

18

² Slides by Kay Ousterhout (Berkeley), with minor modifica9ons

slide-19
SLIDE 19

Stragglers

Scarlett [EuroSys ‘11], SkewTune [SIGMOD ‘12], LATE [OSDI ‘08], Mantri [OSDI ‘10], Dolly [NSDI ‘13], GRASS [NSDI ‘14], Wrangler [SoCC ’14]

Disk

Themis [SoCC ‘12], PACMan [NSDI ’12], Spark [NSDI ’12], Tachyon [SoCC ’14]

Network

Load balancing: VL2 [SIGCOMM ‘09], Hedera [NSDI ’10], Sinbad [SIGCOMM ’13] Application semantics: Orchestra [SIGCOMM ’11], Baraat [SIGCOMM ‘14], Varys [SIGCOMM ’14] Reduce data sent: PeriSCOPE [OSDI ‘12], SUDO [NSDI ’12] In-network aggregation: Camdoop [NSDI ’12] Better isolation and fairness: Oktopus [SIGCOMM ’11], EyeQ [NSDI ‘12], FairCloud [SIGCOMM ’12]

slide-20
SLIDE 20

Disk

Themis [SoCC ‘12], PACMan [NSDI ’12], Spark [NSDI ’12], Tachyon [SoCC ’14]

Stragglers

Scarlett [EuroSys ‘11], SkewTune [SIGMOD ‘12], LATE [OSDI ‘08], Mantri [OSDI ‘10], Dolly [NSDI ‘13], GRASS [NSDI ‘14], Wrangler [SoCC ’14]

Network

Load balancing: VL2 [SIGCOMM ‘09], Hedera [NSDI ’10], Sinbad [SIGCOMM ’13] Application semantics: Orchestra [SIGCOMM ’11], Baraat [SIGCOMM ‘14], Varys [SIGCOMM ’14] Reduce data sent: PeriSCOPE [OSDI ‘12], SUDO [NSDI ’12] In-network aggregation: Camdoop [NSDI ’12] Better isolation and fairness: Oktopus [SIGCOMM ‘11]), EyeQ [NSDI ‘12], FairCloud [SIGCOMM ’12]

Missing: what’s most important to end-to-end performance?

slide-21
SLIDE 21

Disk

Themis [SoCC ‘12], PACMan [NSDI ’12], Spark [NSDI ’12], Tachyon [SoCC ’14]

Stragglers

Scarlett [EuroSys ‘11], SkewTune [SIGMOD ‘12], LATE [OSDI ‘08], Mantri [OSDI ‘10], Dolly [NSDI ‘13], GRASS [NSDI ‘14], Wrangler [SoCC ’14]

Network

Load balancing: VL2 [SIGCOMM ‘09], Hedera [NSDI ’10], Sinbad [SIGCOMM ’13] Application semantics: Orchestra [SIGCOMM ’11], Baraat [SIGCOMM ‘14], Varys [SIGCOMM ’14] Reduce data sent: PeriSCOPE [OSDI ‘12], SUDO [NSDI ’12] In-network aggregation: Camdoop [NSDI ’12] Better isolation and fairness: Oktopus [SIGCOMM ‘11]), EyeQ [NSDI ‘12], FairCloud [SIGCOMM ’12]

Widely-accepted mantras: Network and disk I/O are bottlenecks Stragglers are a major issue with unknown causes

slide-22
SLIDE 22

(1) How can we quan9fy performance bo]lenecks? Blocked time analysis (2) Do the mantras hold? Takeaways based on three workloads run with Spark

This work

slide-23
SLIDE 23

Blocked 9me analysis

(2) Simulate how job completion time would change (1) Measure time when tasks are blocked on the network

tasks

slide-24
SLIDE 24

network read compute disk write Original task runtime : time blocked on network compute task runtime if network were infinitely fast : time blocked on disk Best case : time to handle one record

(1) Measure the time when tasks are blocked

  • n the network
slide-25
SLIDE 25

(2) Simulate how job comple9on 9me would change

Task 0 Task 1 Task 2 time 2 slots to: Original job completion time Task 0 Task 1 Task 2 2 slots Incorrectly computed time: doesn’t account for task scheduling : time blocked

  • n network

tn: Job completion time with infinitely fast network

slide-26
SLIDE 26

Takeaways based on three Spark workloads:

Network optimizations can reduce job comple9on 9me by at most 2% CPU (not I/O) often the bottleneck <19% reduction in completion time from optimizing disk Many straggler causes can be identified and fixed

slide-27
SLIDE 27
  • When does the network ma]er?

Network important when: (1) Computa9on op9mized (2) Serializa9on 9me low (3) Large amount of data sent

  • ver network
slide-28
SLIDE 28

Discussion

28

slide-29
SLIDE 29

What You Said

“I very much appreciated the thorough nature of the "Making Sense of Performance in Data Analy9cs Frameworks" paper.” “I see their paper as more of a survey on the performance of current data analy9cs plahorms as opposed to a paper that discusses fundamental tradeoffs between compute and networking resources. I think the ques9on of whether current “data-analy9cs plahorms” are network bound or CPU bound depends heavily on the implementa9on, and design assump9ons. As a result, I see their work as somewhat of a self-fulfilling prophecy.”

29

slide-30
SLIDE 30

What You Said

“The paper admits its bias in primarily studying instrumented Spark servers. It uses traces from real-world services to back up its conclusions across other types and scales of services, and is reasonably convincing in this analysis. It is easy to agree with the conclusion that services should be more heavily instrumented.”

30

slide-31
SLIDE 31

Next Time: Wireless/Op9cal Data Centers

31

slide-32
SLIDE 32

32