C LARINET : WAN-Aware Optimization for Analytics Queries Raajay - PDF document

C LARINET : WAN-Aware Optimization for Analytics Queries Raajay Viswanathan ◦ Ganesh Ananthanarayanan † Aditya Akella ◦ ◦ University of Wisconsin-Madison † Microsoft Abstract detect attacks. Recent work has shown that centrally aggregating Recent work has made the case for geo-distributed and analyzing this data using frameworks such as analytics, where data collected and stored at multiple Spark [48] can be slow, i.e., it cannot support the datacenters and edge sites world-wide is analyzed in situ timeliness requirements of the applications above [24], to drive operational and management decisions. A key and can cause wasteful use of the expensive wide-area issue in such systems is ensuring low response times network (WAN) bandwidth [35, 43, 36]. In contrast, for analytics queries issued against geo-distributed data. executing the analytics queries geo-distributedly on the A central determinant of response time is the query data stored in-place at the sites—an approach called execution plan (QEP). Current query optimizers do not geo-distributed analytics (GDA)—can result in faster consider the network when deriving QEPs, which is a key query completions [35, 43]. drawback as the geo-distributed sites are connected via GDA entails bringing to data WAN links with heterogeneous and modest bandwidths, WAN-awareness analytics frameworks. Prior work on GDA has shown unlike intra-datacenter networks. We propose C LARINET , how to make query execution (specifically, data and a novel WAN-aware query optimizer. Deriving a task placement) WAN-aware [43, 35, 36]. This paper WAN-aware QEP requires working jointly with the makes a strong case for pushing WAN-awareness up the execution layer of analytics frameworks that places data analytics stack, into query optimization . While it tasks to sites and performs scheduling. We design can substantially lower GDA query completion times, it efficient heuristic solutions in C LARINET to make such requires radical new approaches to query optimization, a joint decision on the QEP. Our experiments with and rethinking the division of functionalities between a real prototype deployed across EC2 datacenters, query optimization and execution. and large-scale simulations using production workloads The query optimizer (QO) takes users’ input show that C LARINET improves query response times by query/script and determines an optimal query execution ≥ 50 % compared to state-of-the-art WAN-aware task plan (QEP) from among many equivalent QEPs that placement and scheduling. differ in, e.g., their ordering of joins in the query. 1 Introduction QOs in modern analytics frameworks [2, 7], largely Large organizations, such as Microsoft, Facebook, and use database technology developed over 30 + years of Google each operate many 10s-100s of datacenters research. These QOs consider many factors (e.g., (DCs) and edge clusters worldwide [1, 5, 6, 13] where buffer cache and distribution of column values) but crucial services (e.g., chat/voice, social networking, and largely ignore the network because they were designed cloud-based storage) are hosted to provide low-latency for a single-server setup. Some parallel databases access to (nearby) users. These sites routinely gather considered the network, but they model the cost of any service data (e.g., end-user session logs) as well as over-the-network access via a single parameter. This server monitoring logs. Analyzing this geo-distributed is less problematic within a DC where the network data is important toward driving key operations and is high-bandwidth and homogeneous. Geo-distributed management tasks. Example analyses include querying clusters, on the other hand, are connected by WAN server logs to maintain system health dashboards, links whose bandwidths are heterogeneous and limited querying session logs to aid server selection for video (§2.1), varying by over 20 × , because of differences in applications [15], and correlating network/server logs to provisioning of WAN links as well as usage by different

Master Scheduler Namenode Worker (non-analytics) applications. Given this heterogeneity, existing network-agnostic QOs can produce query plans that are far from optimal WAN (§2.2). For example, QOs decide the ordering of Site-1 Site-2 multi-way joins purely based on the size of the Hetero- intermediate outputs. However, this can lead to heavy geneous Tunnel data transfer over thin WAN links, thereby inflating bundles completion times. Likewise, today’s QOs optimize one Site-3 Site-4 query at a time; as such, when multiple queries are issued simultaneously, their individual QEPs can contend for the same WAN links. Thus, we need a new approach for WAN-aware multi-query optimization. Arguably, because QOs are upper-most in analytics Figure 1: Architecture of GDA Systems stacks, them being network-agnostic fundamentally WAN-aware QO for Hive [3]. Instead of introducing limits the benefits from downstream advances in task WAN-awareness inside existing QOs [2, 7], C LARINET is placement/scheduling [21, 43, 35, 36]. However, as architecturally outside of them. We modify existing QOs data analytics queries are DAGs of interconnected to simply output all the functionally equivalent QEPs tasks, WAN-aware query planning itself has to be for a query, and C LARINET picks the best WAN-aware performed in concert with placement and scheduling QEP per query, as well as task placement and scheduling of the queries’ tasks and intermediate network transfers which it provides as hints to the query execution layer. (§2.2), in contrast with most existing systems where Our design allows any analytics system to take advantage these are conducted in isolation. This is because of C LARINET with minimal changes. task placement impacts which WAN links are exercised We deploy a C LARINET prototype across 10 regions by a given QEP, and scheduling impacts when they on Amazon EC2, and evaluate it using realistic TPC-DS are exercised, both of which determine if the QEP is queries. We also conduct large scale trace-driven WAN-optimal. Unfortunately, formulating an optimal simulations using production workloads based on two solution for such multi-query network-aware joint query large online service providers. Our evaluation shows planning, placement, and scheduling is computationally that, compared to the baseline that uses network-agnostic intractable. QO and task placement, C LARINET can improve the We develop a novel heuristic for the above problem. average query performance by 60 - 80 % percent in First, we show how to compute the WAN-optimal QEP different settings. We find that C LARINET ’s joint for a single query, which includes task placement and query planning and task placement/scheduling doubles scheduling (§4). For tractability, our solution relies the benefits compared to state-of-the-art WAN-aware on reserving WAN links for scheduled (but yet to placement/scheduling. execute) tasks/transfers; however, we show that such link 2 Background and Motivation reservations lead to faster query completions in practice. Given a batch of n queries, we order them based on In this section, we first discuss the architectural details their individually optimal QEPs’ expected completion of GDA, focusing on WAN constraints. We then analyze time; the QEP for the i th query is chosen considering how queries are handled in existing GDA systems. the WAN impact of the preceding i − 1 queries. This 2.1 Geo-Distributed Analytics mimics shortest-job first (SJF) order while allowing for cross-query optimization (§5.1). However, it results GDA Architecture: In GDA, there is a central master at in bandwidth fragmentation (due to task dependencies), one of the DCs/edge sites where queries—written, e.g., thereby hurting completion times. To overcome this, in SparkSQL [7], HiveQL [3], or Pig Latin [33]—are our final heuristic considers groups of k ≤ n queries submitted. For every query, the QO at the master from the above order and explores how to compact constructs an optimized query execution plan (QEP), their schedules tightly in time, while obeying inter-task essentially, a DAG of many interdependent tasks . A ordering (§5.2). The result is a cross-query schedule centralized scheduler places tasks in a QEP at nodes that veers from SJF but is closer to work-conserving, across different sites based on resource availability and and offers low average completion times for GDA schedules them based on task dependencies. 1 queries. We also extend the heuristic to accommodate 1 Typically, the task scheduler, the namenode of the distributed file fair treatment of queries, minimizing WAN bandwidth system, and the master all run at the same site to reduce inter-process costs, and online query arrivals (§5.3). communication latencies between them. However, it is possible to We have built our solution into C LARINET , a distribute them across different processing sites.

C LARINET : WAN-Aware Optimization for Analytics Queries Raajay - PDF document

C LARINET : WAN-Aware Optimization for Analytics Queries Raajay Viswanathan Ganesh Ananthanarayanan Aditya Akella University of Wisconsin-Madison Microsoft Abstract detect attacks. Recent work has shown that centrally

CLARINET: WAN-Aware Optimization for Analytics Queries Presented By Robert Claus Agenda 1.

Causes and solutions Wan Abdul Manan Wan Muda, for Wan Abdul Manan Wan Muda, Jomo KS and Tan

Clarinet: WAN-Aware Optimization for Analyt ytics Queries Raajay Viswanathan, Ganesh

Queries in PSM The following rules apply to the use of queries: CS 235: 1. Queries

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

St Stress Aware Layout Stress Aware Layout St A A L L t t Optimization Optimization

Undergraduate Business Analytics Minor Spreadsheet Analytics BANA-2081 Business Analytics

Range Minimum and Lowest Common Ancestor Queries Slides by Solon P. Pissis November 15, 2019

Top- -k k Queries Queries on SQL on SQL Databases Databases Top Top-k Queries on SQL

Middleware Queries Queries Middleware Middleware Queries Prof. Paolo Ciaccia Prof. Paolo

Premises Hybrid Hosted Managed Service Delivery Future Services* SD WAN WAN

1 Gbps and 10 Gbps IP WAN Link Emulator - IPLinkSim Single Stream IP WAN Link Emulator 818

1 Gbps and 10 Gbps WAN Emulator IPNetSim Multi Stream IP WAN Emulator 818 West Diamond

Islamization of Knowledge Hasan M. Talib 0417039 Wan Aizuddin 0437867 Wan Abdul Matiin

MOOCs LEARNING THROUGH HEUTAGOGY APPROACH WAN EALINA WAHIDA BT WAN DIN | 2015778469

CBCN4103 CBCN4103 WAN is a computer network that spans a relatively large geographical area.

market and the Contract Manufacturing April 3 rd 2013 Ali MADANI Phone: + 33 1 47 78 46 00

Q2/H1 2017 PRESENTATION CEO Bjrn Ivroth CFO Henrik Schibler Agenda Q2/H1 Presentation

If You Wait For The Robins, Spring Will Be Over* December 7, 2009 Pershing Square Capital

School Selection Process For 2017-2018 The School Selection Process School Selection is open to

RECORDINGS STUDIOS Sala Giardino - CREMA (CR) The structure has been created in a historic

LYRASIS 2019 Accessibility Survey Hannah Rosen Scholarly Communication Specialist and

Accessible Graphics on the World Wide Web Dan J. Grauman Information Technology Specialist

Instructional Materials Learn by doing What Workshops to guide faculty in the creation of