Recent Advances and What’s Next?
Coflow
Mosharaf Chowdhury University of Michigan
Coflow Recent Advances and Whats Next? Mosharaf Chowdhury - - PowerPoint PPT Presentation
Coflow Recent Advances and Whats Next? Mosharaf Chowdhury University of Michigan Rack-Scale Datacenter-Scale Geo-Distributed Computing Computing Computing Coflow Networking Open Source Apache Spark Open Source Cluster File System
Recent Advances and What’s Next?
Mosharaf Chowdhury University of Michigan
Datacenter-Scale Computing Geo-Distributed Computing
Fast Analytics Over the WAN
Rack-Scale Computing
Proactive Analytics Before You Think! Coflow Networking
Open Source
Apache Spark
Open Source
Cluster File System
Resource Allocation
Microsoft
DAG Scheduling
Apache YARN
Cluster Caching
Alluxio
Datacenter-Scale Computing Geo-Distributed Computing Rack-Scale Computing
< 0.01 ms ~ 1 ms > 100 ms
Big Data
The volume of data businesses want to make sense of is increasing Increasing variety of sources
Cheaper disks, SSDs, and memory Stalling processor speeds
Big Datacenters for Massive Parallelism
2005 2010 2015 MapReduce Hadoop Spark Hive Dryad DryadLINQ Spark-Streaming GraphX GraphLab Pregel Storm Dremel BlinkDB
Distributed Data-Parallel Applications
Multi-stage dataflow
Computation Stage (e.g., Map, Reduce)
Communication Stage (e.g., Shuffle)
Map Stage Reduce Stage
A communication stage cannot complete until all the data have been transferred
Communication is Crucial Performance
As SSD-based and in-memory systems proliferate, the network is likely to become the primary bottleneck
Facebook jobs spend ~25% of runtime on average in intermediate comm.1
Faster Communication Stages: Traditional Networking Approach
Transfers data from a source to a destination Independent unit of allocation, sharing, load balancing, and/or prioritization
Existing Solutions
GPS RED WFQ CSFQ ECN XCP D2TCP DCTCP PDQ D3 FCP DeTail pFabric 2005 2010 2015 1980s 1990s 2000s RCP
Per-Flow Fairness Flow Completion Time
Independent flows cannot capture the collective communication behavior common in data-parallel applications
Datacenter Fabric
1 2 3 1 2 3
Why Do They Fall Short?
r1 r2 s1 s2 s3 r1 r2 s1 s2 s3
Input Links Output Links
Why Do They Fall Short?
r1 r2 s1 s2 s3 r1 r2 s1 s2 s3
Datacenter Fabric
1 2 3 1 2 3
r1 r2 s1 s2 s3
Why Do They Fall Short?
Datacenter Fabric
time
2 4 6
Link to r2 Link to r1 Per-Flow Fair Sharing Shuffle Completion Time = 5
Completion Time = 3.66
3 3 5 3 3 5
s1 s2 s3 r1 r2
1 2 3 1 2 3
Solutions focusing on flow completion time cannot further decrease the shuffle completion time
Improve Application-Level Performance1
Datacenter Fabric
time
2 4 6
Link to r2 Link to r1 Per-Flow Fair Sharing Shuffle Completion Time = 5
Completion Time = 3.66
3 3 5 3 3 5
s1 s2 s3 r1 r2
1 2 3 1 2 3
Transfers in Computer Clusters with Orchestra, SIGCOMM’2011.
Slow down faster flows to accelerate slower flows
time
2 4 6
Link to r2 Link to r1 Per-Flow Fair Sharing Shuffle Completion Time = 4
Completion Time = 4
4 4 4 4 4 4
Data-Proportional Allocation
Communication abstraction for data-parallel applications to express their performance goals
Aggregation Broadcast Shuffle Parallel Flows All-to-All Single Flow
How to schedule coflows
… for faster #1 completion
… to meet #2 more deadlines? … for fair #3 allocation of the network?
1 2 N 1 2 N
. . . . . .
Datacenter
Faster, application-aware data transfers throughout the network
Consistent calculation and enforcement of scheduler decisions
Decouples network optimizations from applications, relieving developers and end users
Varys, SIGCOMM’2014.
Without Prior Knowledge, SIGCOMM’2015.
1 2 3
Benefits of
time
2 4 6time
2 4 6time
2 4 6Coflow1 comp. time = 5 Coflow2 comp. time = 6 Coflow1 comp. time = 5 Coflow2 comp. time = 6 Fair Sharing Smallest-Flow First1,2 The Optimal Coflow1 comp. time = 3 Coflow2 comp. time = 6 L1 L2 L1 L2 L1 L2
Link 1 Link 2 3 Units Coflow 1 6 Units Coflow 2 2 Units
Inter-Coflow Scheduling
Inter-Coflow Scheduling
1 2 3 1 2 3 Input Links Output Links Datacenter
Concurrent Open Shop Scheduling with Coupled Resources
caching blocks
Link 1 Link 2 3 Units Coflow 1 6 Units Coflow 2 2 Units
3 6 2
is NP-Hard
Many Problems to Solve
Clairvoyant Objective
Min CCT Min CCT Fair CCT Yes No No Optimal Yes No No
Coflow-Based Architecture
Centralized master-slave architecture
communicate with the master
Actual timing and rates are determined by the coflow scheduler
Master/Coordinator
Network Interface f Computation tasks
Local Daemon Local Daemon Local Daemon
Coordination
Coflow Scheduler
Toward Automatically Identifying and Scheduling Coflows in the Dark, SIGCOMM’2016.
Coflow API
Change the applications
what a coflow is
more information
Changing the framework can enabled ALL jobs to take advantage
DO NOT change the applications1
traffic patterns
can tolerate misestimations
Our current solution only works for coflows without dependencies; we need DAG support!
Performance Benefits of Using Coflows
Transfers in Computer Clusters with Orchestra, SIGCOMM’2011
Transport, SIGCOMM’2013
Task-Aware Scheduling for Data Center Networks, SIGCOMM’2014 1.00 3.21 5.65 5.53 22.07 1.10 5 10 15 20 25 Varys Fair FIFO Priority FIFO-LM NC
Overhead Over Varys
Varys Aalo
1 4Per-Flow Fairness Per-Flow Prioritization
2,3Lower is Better
The Need for Coordination
8 17 115 495 992 1 10 100 1000 100 1000 10000 50000 100000
Average Coordination Time (ms) # (Emulated) Aalo Slaves
Coordination is necessary to determine realtime
Can be a large source of overhead
coflows in slow networks, but …
How to perform decentralized coflow scheduling?
Coflow-Aware Load Balancing
Especially useful in asymmetric topologies
Provides an additional degree of freedom
Increased need for coordination, but at an even higher cost
Coflow-Aware Routing
Relevant in topologies w/o full bisection bandwidth
Scheduling-only solutions do not work well
Increased need for coordination
Coflows in Circuit-Switched Networks
Circuit switching is relevant again due to the rise of optical networks
Co-scheduling applications and coflows
circuits to be created
Extension to Multiple Resources1
A DAG of coflows is very similar to a job DAG of stages
challenges
Consider both fungible (b/w) and non-fungible resources (cores)
Communication abstraction for data-parallel applications to express their performance goals
Key open challenges
1. Better theoretical understanding 2. Efficient solutions to deal with decentralization, topologies, multi-resource settings, estimations over DAG, circuit-switching, etc.
More information
1. Papers: http://www.mosharaf.com/publications/ 2. Software/simulator/workloads: https://github.com/coflow