Coflow Recent Advances and Whats Next? Mosharaf Chowdhury - PowerPoint PPT Presentation

Coflow Recent Advances and What’s Next? Mosharaf Chowdhury University of Michigan

Rack-Scale Datacenter-Scale Geo-Distributed Computing Computing Computing Coflow Networking Open Source Apache Spark Open Source Cluster File System Facebook Resource Allocation Microsoft Proactive Analytics Fast Analytics DAG Scheduling Apache YARN Before You Think! Over the WAN Cluster Caching Alluxio

Rack-Scale Datacenter-Scale Geo-Distributed Computing Computing Computing < 0.01 ms ~ 1 ms > 100 ms

Big Data The volume of data businesses want to make sense of is increasing Increasing variety of sources • Web, mobile, wearables, vehicles, scientific, … Cheaper disks, SSDs, and memory Stalling processor speeds

Big Datacenters for Massive Parallelism BlinkDB Storm Spark-Streaming Pregel GraphLab GraphX DryadLINQ Spark Dremel MapReduce Hadoop Dryad Hive 2005 2010 2015 1. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, NSDI’2012.

Distributed Data-Parallel Applications Multi-stage dataflow • Computation interleaved with communication Computation Stage (e.g., Map, Reduce) • Distributed across many machines Reduce Stage • Tasks run in parallel A communication stage cannot complete Communication Stage (e.g., Shuffle) until all the data have been transferred • Between successive computation stages Map Stage

Communication is Crucial Performance Facebook jobs spend ~ 25% of runtime on average in intermediate comm. 1 As SSD-based and in-memory systems proliferate, the network is likely to become the primary bottleneck 1. Based on a month-long trace with 320,000 jobs and 150 Million tasks, collected from a 3000-machine Facebook production MapReduce cluster.

Faster Communication Transfers data from a source Stages: Flow to a destination Traditional Independent unit of allocation, Networking sharing, load balancing, and/or prioritization Approach

Existing Solutions WFQ CSFQ D 3 DeTail PDQ pFabric GPS RED ECN XCP RCP DCTCP D 2 TCP FCP 1980s 1990s 2000s 2005 2010 2015 Per-Flow Fairness Flow Completion Time Independent flows cannot capture the collective communication behavior common in data-parallel applications

Why Do They Fall Short? r 1 r 2 r 1 r 2 1 1 2 2 s 1 s 2 s 3 s 1 s 2 s 3 Datacenter 3 3 Fabric Input Links Output Links

Why Do They Fall Short? r 1 r 2 r 1 r 2 r 1 s 1 1 1 r 2 s 2 2 2 s 1 s 2 s 3 s 1 s 2 s 3 Datacenter s 3 3 3 Fabric

Why Do They Fall Short? r 1 s 1 1 1 r 2 s 2 2 2 Datacenter s 3 3 3 Fabric Per-Flow Fair Sharing Shuffle Completion 3 Solutions focusing on flow Link to r 1 3 Time = 5 5 completion time cannot further 3 Avg. Flow decrease the shuffle completion time Link to r 2 3 5 Completion Time = 3.66 2 4 6 time

Improve Application-Level Performance 1 r 1 s 1 1 1 Slow down faster flows to accelerate r 2 s 2 2 2 slower flows Datacenter s 3 3 3 Fabric Per-Flow Fair Sharing Data-Proportional Allocation Per-Flow Fair Sharing Shuffle Shuffle Completion Completion 4 3 Link to r 1 Link to r 1 4 3 Time = 5 Time = 4 5 4 4 3 Avg. Flow Avg. Flow Link to r 2 Link to r 2 4 3 5 Completion 4 Completion Time = 3.66 Time = 4 2 4 6 2 4 6 time time 1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’2011.

Coflow Communication abstraction for data-parallel applications to express their performance goals 1. Size of each flow; 2. Total number of flows; 3. Endpoints of individual flows; 4. Dependencies between coflows;

Broadcast Single Flow Aggregation All-to-All Parallel Flows Shuffle

… for faster #1 completion 1 1 of coflows? How to 2 2 … to meet schedule #2 more . . coflows deadlines? . . online … . . … for fair #3 allocation of the network? N N Datacenter

Varys, Aalo & HUG 3 1 2 Faster, application-aware data transfers 1. Coflow Scheduler throughout the network Consistent calculation and enforcement of 2. Global Coordination scheduler decisions Decouples network optimizations from 3. The Coflow API applications, relieving developers and end users 1. Efficient Coflow Scheduling with Varys, SIGCOMM’2014. 2. Efficient Coflow Scheduling Without Prior Knowledge, SIGCOMM’2015. 3. HUG: Multi-Resource Fairness for Correlated and Elastic Demands, NSDI’2016.

Benefits of Inter-Coflow Scheduling Coflow 1 Coflow 2 6 Units Link 2 Link 1 3 Units 2 Units Fair Sharing Smallest-Flow First 1,2 The Optimal L2 L2 L2 L1 L1 L1 2 6 2 6 4 4 2 4 6 time time time Coflow1 comp. time = 3 Coflow1 comp. time = 5 Coflow1 comp. time = 5 Coflow2 comp. time = 6 Coflow2 comp. time = 6 Coflow2 comp. time = 6 1. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012. 2. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’2013.

is NP-Hard Inter-Coflow Scheduling Coflow 1 Coflow 2 6 Units Link 2 Link 1 3 Units 2 Units Input Links Output Links Concurrent Open Shop Scheduling 1 1 with Coupled Resources • Examples include job scheduling and caching blocks 2 2 6 • Solutions use a ordering heuristic • Consider matching constraints 2 3 3 Datacenter 3

Many Problems to Solve Clairvoyant Objective Optimal Varys Yes Min CCT No Aalo No Min CCT No HUG No Fair CCT Yes

Coflow -Based Architecture Centralized master-slave architecture Local Local Local Daemon Daemon Daemon • Applications use a client library to communicate with the master Network Interface Actual timing and rates are determined by the coflow scheduler Coordination Coflow Scheduler Master/Coordinator Computation tasks f

Coflow API Change the applications DO NOT change the applications 1 • At the very least, we need to know • Infer coflows from traffic network what a coflow is traffic patterns • For clairvoyant versions, we need • Design robust coflow scheduler that more information can tolerate misestimations Changing the framework can Our current solution only works enabled ALL jobs to take advantage for coflows without dependencies; of coflows we need DAG support! 1. CODA: Toward Automatically Identifying and Scheduling Coflows in the Dark, SIGCOMM’2016.

Performance Benefits of Using Coflows Lower is Better 25 22.07 Overhead Over Varys 20 15 10 5.65 5.53 5 3.21 1.10 1.00 0 1 2,3 4 Varys Aalo Per-Flow Per-Flow Varys Fair FIFO Priority FIFO-LM NC Fairness Prioritization 1. Managing Data Transfers in Computer Clusters with Orchestra, SIGCOMM’2011 2. Finishing Flows Quickly with Preemptive Scheduling, SIGCOMM’2012 3. pFabric: Minimal Near-Optimal Datacenter Transport, SIGCOMM’2013 4. Decentralized Task-Aware Scheduling for Data Center Networks, SIGCOMM’2014

The Need for Coordination Coordination is necessary to 1000 992 determine realtime 495 Average Coordination Time (ms) • Coflow size (sum); 100 115 • Coflow rates (max); • Partial order of coflows (ordering); 17 10 Can be a large source of overhead 8 • Does not impact too much for large coflows in slow networks, but … 1 How to perform decentralized coflow 100 1000 10000 50000 100000 scheduling? # (Emulated) Aalo Slaves

Coflow-Aware Load Balancing Especially useful in asymmetric topologies • For example, in the presence of switch or link failures Provides an additional degree of freedom • During path selection • For dynamically determining load balancing granularity Increased need for coordination, but at an even higher cost

Coflow-Aware Routing Relevant in topologies w/o full bisection bandwidth • When topologies have temporary in-network oversubscriptions • In geo-distributed analytics Scheduling-only solutions do not work well • Calls for routing-scheduling joint solutions • Must take network utilization into account • Must avoid frequent path changes Increased need for coordination

Coflows in Circuit-Switched Networks Circuit switching is relevant again due to the rise of optical networks • Provides very high bandwidth • Expensive to setup new circuits Co-scheduling applications and coflows • Schedule tasks so that we can reuse already-setup circuits • Perform in-network aggregation using existing circuits instead of waiting for new circuits to be created

Extension to Multiple Resources 1 A DAG of coflows is very similar to a job DAG of stages • Same principle applies, but with new challenges Consider both fungible (b/w) and non-fungible resources (cores) • Across the entire DAG 1. Altruistic Scheduling in Multi-Resource Clusters, OSDI2016.

Coflow Communication abstraction for data-parallel applications to express their performance goals Key open challenges 1. Better theoretical understanding 2. Efficient solutions to deal with decentralization, topologies, multi-resource settings, estimations over DAG, circuit-switching, etc. More information 1. Papers: http://www.mosharaf.com/publications/ 2. Software/simulator/workloads: https://github.com/coflow

Coflow Recent Advances and Whats Next? Mosharaf Chowdhury - PowerPoint PPT Presentation

Coflow Recent Advances and Whats Next? Mosharaf Chowdhury University of Michigan Rack-Scale Datacenter-Scale Geo-Distributed Computing Computing Computing Coflow Networking Open Source Apache Spark Open Source Cluster File System

COFLOW CHAPTER 4 INTRA-COFLOW SCHEDULING Author: Mosharaf Kabir Chowdhury Presenter: Yuwei Jiao 1

The NL-coflow polynomial (joint work with W. Hochstttler) MCW 2019 The NL-coflow polynomial

Coflow Deadline Scheduling via Network-Aware Optimization Shih-Hao Tseng, (pronounced as

Selective Coflow Completion for Time-sensitive Distributed Applications with Poco Shouxi Luo

Exploiting Inter-Flow Relationship for Coflow Placement in Data Centers Xin Sunny Huang , T. S.

Coflow Scheduling Erez Kantor Hamid Jahanjou Rajmohan Rajaraman Northeastern University,

Non-preemptive Coflow Scheduling and Routing Ruozhou Yu , Guoliang Xue, and Xiang Zhang Arizona

# Comm. ! Params * ! Optimizing Communication Spark 1.0.1 6 ! Performance: Hadoop 1.0.4 10 !

Siphon: Expediting Inter-Datacenter Coflows in Wide-Area Data Analytics Shuhao Liu, Li Chen ,

Normalizing Flow Models Stefano Ermon, Aditya Grover Stanford University Lecture 8 Stefano

Backpressure Flow Control Prateesh Goyal, Preey Shah, Naveen Sharma, Kevin Zhao, Mohammad

A Tricky Problem Darryl Veitch Principal Research Fellow http://www.cubinlab.ee.mu.oz/ darryl

Platform-Independent Debugging of Physical Interaction and Signal Flow Models Mehdi Dadfarnia 1

Flow Networks Carola Wenk Slides adapted from slides by Charles Leiserson Max flow and min cut

Netflows at The University of Chicago E. Larry Lidz, ellidz@uchicago.edu The University of

The P rice of Anarchy is I ndependent of t he Net work Topology Tim Roughgarden Cornell

Evaluating Bufferless Flow Control for On-Chip Networks George Michelogiannakis, Daniel Sanchez,

Information Flow and Decision-Making in Advanced Vehicle Development Presented by: Presented

Mass quenching, cold flows and gas inflow into galaxies Yuval Birnboim The Hebrew University of

Optimizing Flow Bandwidth Consumption with Traffic-diminishing Middlebox Placement Yan Yang

Decompilation is an information-flow problem (Or, information flow meets program transformation)

Design Flow in Colorado: New! Fun! Litigious! Meg Parish Manager, Permits Section Water Quality

Standard Network Flow problems with Secure Mul8party Computa8on

Optical flow Cordelia Schmid Motion field The motion field is the projection of the 3D scene