Coflow Recent Advances and Whats Next? Mosharaf Chowdhury - - PowerPoint PPT Presentation

coflow
SMART_READER_LITE
LIVE PREVIEW

Coflow Recent Advances and Whats Next? Mosharaf Chowdhury - - PowerPoint PPT Presentation

Coflow Recent Advances and Whats Next? Mosharaf Chowdhury University of Michigan Rack-Scale Datacenter-Scale Geo-Distributed Computing Computing Computing Coflow Networking Open Source Apache Spark Open Source Cluster File System


slide-1
SLIDE 1

Recent Advances and What’s Next?

Coflow

Mosharaf Chowdhury University of Michigan

slide-2
SLIDE 2

Datacenter-Scale Computing Geo-Distributed Computing

Fast Analytics Over the WAN

Rack-Scale Computing

Proactive Analytics Before You Think! Coflow Networking

Open Source

Apache Spark

Open Source

Cluster File System

Facebook

Resource Allocation

Microsoft

DAG Scheduling

Apache YARN

Cluster Caching

Alluxio

slide-3
SLIDE 3

Datacenter-Scale Computing Geo-Distributed Computing Rack-Scale Computing

< 0.01 ms ~ 1 ms > 100 ms

slide-4
SLIDE 4

Big Data

The volume of data businesses want to make sense of is increasing Increasing variety of sources

  • Web, mobile, wearables, vehicles, scientific, …

Cheaper disks, SSDs, and memory Stalling processor speeds

slide-5
SLIDE 5

Big Datacenters for Massive Parallelism

2005 2010 2015 MapReduce Hadoop Spark Hive Dryad DryadLINQ Spark-Streaming GraphX GraphLab Pregel Storm Dremel BlinkDB

  • 1. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, NSDI’2012.
slide-6
SLIDE 6

Distributed Data-Parallel Applications

Multi-stage dataflow

  • Computation interleaved with communication

Computation Stage (e.g., Map, Reduce)

  • Distributed across many machines
  • Tasks run in parallel

Communication Stage (e.g., Shuffle)

  • Between successive computation stages

Map Stage Reduce Stage

A communication stage cannot complete until all the data have been transferred

slide-7
SLIDE 7

Communication is Crucial Performance

As SSD-based and in-memory systems proliferate, the network is likely to become the primary bottleneck

  • 1. Based on a month-long trace with 320,000 jobs and 150 Million tasks, collected from a 3000-machine Facebook production MapReduce cluster.

Facebook jobs spend ~25% of runtime on average in intermediate comm.1

slide-8
SLIDE 8

Faster Communication Stages: Traditional Networking Approach

Flow

Transfers data from a source to a destination Independent unit of allocation, sharing, load balancing, and/or prioritization

slide-9
SLIDE 9

Existing Solutions

GPS RED WFQ CSFQ ECN XCP D2TCP DCTCP PDQ D3 FCP DeTail pFabric 2005 2010 2015 1980s 1990s 2000s RCP

Per-Flow Fairness Flow Completion Time

Independent flows cannot capture the collective communication behavior common in data-parallel applications

slide-10
SLIDE 10

Datacenter Fabric

1 2 3 1 2 3

Why Do They Fall Short?

r1 r2 s1 s2 s3 r1 r2 s1 s2 s3

Input Links Output Links

slide-11
SLIDE 11

Why Do They Fall Short?

r1 r2 s1 s2 s3 r1 r2 s1 s2 s3

Datacenter Fabric

1 2 3 1 2 3

r1 r2 s1 s2 s3

slide-12
SLIDE 12

Why Do They Fall Short?

Datacenter Fabric

time

2 4 6

Link to r2 Link to r1 Per-Flow Fair Sharing Shuffle Completion Time = 5

  • Avg. Flow

Completion Time = 3.66

3 3 5 3 3 5

s1 s2 s3 r1 r2

1 2 3 1 2 3

Solutions focusing on flow completion time cannot further decrease the shuffle completion time

slide-13
SLIDE 13

Improve Application-Level Performance1

Datacenter Fabric

time

2 4 6

Link to r2 Link to r1 Per-Flow Fair Sharing Shuffle Completion Time = 5

  • Avg. Flow

Completion Time = 3.66

3 3 5 3 3 5

s1 s2 s3 r1 r2

1 2 3 1 2 3

  • 1. Managing Data

Transfers in Computer Clusters with Orchestra, SIGCOMM’2011.

Slow down faster flows to accelerate slower flows

time

2 4 6

Link to r2 Link to r1 Per-Flow Fair Sharing Shuffle Completion Time = 4

  • Avg. Flow

Completion Time = 4

4 4 4 4 4 4

Data-Proportional Allocation

slide-14
SLIDE 14

Communication abstraction for data-parallel applications to express their performance goals

Coflow

  • 1. Size of each flow;
  • 2. Total number of flows;
  • 3. Endpoints of individual flows;
  • 4. Dependencies between coflows;
slide-15
SLIDE 15

Aggregation Broadcast Shuffle Parallel Flows All-to-All Single Flow

slide-16
SLIDE 16

How to schedule coflows

  • nline …

… for faster #1 completion

  • f coflows?

… to meet #2 more deadlines? … for fair #3 allocation of the network?

1 2 N 1 2 N

. . . . . .

Datacenter

slide-17
SLIDE 17

Varys, Aalo & HUG

  • 1. Coflow Scheduler

Faster, application-aware data transfers throughout the network

  • 2. Global Coordination

Consistent calculation and enforcement of scheduler decisions

  • 3. The Coflow API

Decouples network optimizations from applications, relieving developers and end users

  • 1. Efficient Coflow Scheduling with

Varys, SIGCOMM’2014.

  • 2. Efficient Coflow Scheduling

Without Prior Knowledge, SIGCOMM’2015.

  • 3. HUG: Multi-Resource Fairness for Correlated and Elastic Demands, NSDI’2016.

1 2 3

slide-18
SLIDE 18

Benefits of

time

2 4 6

time

2 4 6

time

2 4 6

Coflow1 comp. time = 5 Coflow2 comp. time = 6 Coflow1 comp. time = 5 Coflow2 comp. time = 6 Fair Sharing Smallest-Flow First1,2 The Optimal Coflow1 comp. time = 3 Coflow2 comp. time = 6 L1 L2 L1 L2 L1 L2

  • 1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012.
  • 2. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’2013.

Link 1 Link 2 3 Units Coflow 1 6 Units Coflow 2 2 Units

Inter-Coflow Scheduling

slide-19
SLIDE 19

Inter-Coflow Scheduling

1 2 3 1 2 3 Input Links Output Links Datacenter

Concurrent Open Shop Scheduling with Coupled Resources

  • Examples include job scheduling and

caching blocks

  • Solutions use a ordering heuristic
  • Consider matching constraints

Link 1 Link 2 3 Units Coflow 1 6 Units Coflow 2 2 Units

3 6 2

is NP-Hard

slide-20
SLIDE 20

Many Problems to Solve

Aalo Varys

Clairvoyant Objective

HUG

Min CCT Min CCT Fair CCT Yes No No Optimal Yes No No

slide-21
SLIDE 21

Coflow-Based Architecture

Centralized master-slave architecture

  • Applications use a client library to

communicate with the master

Actual timing and rates are determined by the coflow scheduler

Master/Coordinator

Network Interface f Computation tasks

Local Daemon Local Daemon Local Daemon

Coordination

Coflow Scheduler

slide-22
SLIDE 22
  • 1. CODA:

Toward Automatically Identifying and Scheduling Coflows in the Dark, SIGCOMM’2016.

Coflow API

Change the applications

  • At the very least, we need to know

what a coflow is

  • For clairvoyant versions, we need

more information

Changing the framework can enabled ALL jobs to take advantage

  • f coflows

DO NOT change the applications1

  • Infer coflows from traffic network

traffic patterns

  • Design robust coflow scheduler that

can tolerate misestimations

Our current solution only works for coflows without dependencies; we need DAG support!

slide-23
SLIDE 23

Performance Benefits of Using Coflows

  • 1. Managing Data

Transfers in Computer Clusters with Orchestra, SIGCOMM’2011

  • 2. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012
  • 3. pFabric: Minimal Near-Optimal Datacenter

Transport, SIGCOMM’2013

  • 4. Decentralized

Task-Aware Scheduling for Data Center Networks, SIGCOMM’2014 1.00 3.21 5.65 5.53 22.07 1.10 5 10 15 20 25 Varys Fair FIFO Priority FIFO-LM NC

Overhead Over Varys

Varys Aalo

1 4

Per-Flow Fairness Per-Flow Prioritization

2,3

Lower is Better

slide-24
SLIDE 24

The Need for Coordination

8 17 115 495 992 1 10 100 1000 100 1000 10000 50000 100000

Average Coordination Time (ms) # (Emulated) Aalo Slaves

Coordination is necessary to determine realtime

  • Coflow size (sum);
  • Coflow rates (max);
  • Partial order of coflows (ordering);

Can be a large source of overhead

  • Does not impact too much for large

coflows in slow networks, but …

How to perform decentralized coflow scheduling?

slide-25
SLIDE 25

Coflow-Aware Load Balancing

Especially useful in asymmetric topologies

  • For example, in the presence of switch or link failures

Provides an additional degree of freedom

  • During path selection
  • For dynamically determining load balancing granularity

Increased need for coordination, but at an even higher cost

slide-26
SLIDE 26

Coflow-Aware Routing

Relevant in topologies w/o full bisection bandwidth

  • When topologies have temporary in-network oversubscriptions
  • In geo-distributed analytics

Scheduling-only solutions do not work well

  • Calls for routing-scheduling joint solutions
  • Must take network utilization into account
  • Must avoid frequent path changes

Increased need for coordination

slide-27
SLIDE 27

Coflows in Circuit-Switched Networks

Circuit switching is relevant again due to the rise of optical networks

  • Provides very high bandwidth
  • Expensive to setup new circuits

Co-scheduling applications and coflows

  • Schedule tasks so that we can reuse already-setup circuits
  • Perform in-network aggregation using existing circuits instead of waiting for new

circuits to be created

slide-28
SLIDE 28

Extension to Multiple Resources1

A DAG of coflows is very similar to a job DAG of stages

  • Same principle applies, but with new

challenges

Consider both fungible (b/w) and non-fungible resources (cores)

  • Across the entire DAG
  • 1. Altruistic Scheduling in Multi-Resource Clusters, OSDI2016.
slide-29
SLIDE 29

Communication abstraction for data-parallel applications to express their performance goals

Coflow

Key open challenges

1. Better theoretical understanding 2. Efficient solutions to deal with decentralization, topologies, multi-resource settings, estimations over DAG, circuit-switching, etc.

More information

1. Papers: http://www.mosharaf.com/publications/ 2. Software/simulator/workloads: https://github.com/coflow