L OW U TILIZATION ! Servers operate at 10% - 40% utilization most of - PowerPoint PPT Presentation

P LIANT : L EVERAGING A PPROXIMATION TO I MPROVE R ESOURCE E FFICIENCY IN D ATACENTERS Neeraj Kulkarni, Feng Qi, Christina Delimitrou

C LOUD C OMPUTING § Resource Flexibility • Users can elastically scale their resources on-demand § Cost Efficiency • Sharing resources between multiple users and applications Latency-critical Batch applications Interactive apps QoS: throughput QoS: tail latency

L OW U TILIZATION ! § Servers operate at 10% - 40% utilization most of the time Google cluster Twitter cluster § Major reasons: • Dedicated servers for interactive services • Resource over-provisioning – conservative reservations C. Delimitrou and C. Kozyrakis, “Quasar: Resource-Efficient and QoS-Aware Cluster Management,” in ASPLOS , 2014 L. Barroso et. al., “The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines”, Second edition, 2013

M ULTI - TENANCY § Scheduling multiple jobs on the same server • Increases server utilization and cost efficiency • Interference in shared resources … … CPU CPU CPU CPU App1 compute LLC memory App2 Memory network § Interference à Unpredictable performance § Difficult with interactive services

P REVIOUS S OLUTIONS 1. Allow co-scheduling of apps that would not violate QoS • Bubble-Up, Bubble-Flux, Paragon and Quasar 2. Partition shared resources at runtime to reduce interference • Heracles, Ubik, Rubik 3. Reduce interference by throttling applications at runtime • Bubble-Flux, ReQoS, Protean Code § But … • Server utilization by disallowing certain co-locations • Performance of batch applications by treating them as low-priority

B REAK UTILIZATION VS PERFORMANCE TRADE - OFF § Approximate computing applications • Tolerate some loss in output accuracy in return for » Improved performance, or » Same performance with reduced resources § Cloud workloads suitable for approximation • Performance can be more important than highest output quality § Co-locate approximate batch apps with interactive services • Meet performance for both applications at the cost of some inaccuracy

L EVERAGING A PPROXIMATION 1. Mitigate interference: • Approximation can reduce # of requests to memory system & network • Approximation may not be always sufficient 2. Meet performance of approximate applications: • When approximation is not enough, employ resource partitioning: » Core relocation » Cache partitioning » Memory partitioning • Provide more resources to interactive service to meet its QoS • Approximation preserves the performance of batch applications

A PPROXIMATION T ECHNIQUES § Loop perforation: Skip fraction of iterations • Fewer instructions & data accesses à exec time ⇩ & cache interference ⇩ ... § Synchronization elision: Barriers, locks elided double l_c = 1.0 float l_c = 1.0 • Threads don’t wait for sync à exec time ⇩ • Reduces memory accesses for acquiring locks for i = 1 to N: if i % 2 != 0: § Lower precision: Reduce precision of variables ..... ... • e.g., replace ‘double’ with ‘float’ or ‘int’ lock() g_c = g_c + l_c • Reduces memory traffic unlock() ... § Tiling: Compute 1 element & project onto neighbors BARRIER() ... • Fewer instructions & data accesses à exec time ⇩ & for i = 1 to M: A[i,2] = F(i,2) cache interference ⇩ for j = 1 to 3: A[i][j] = F(i,j) A[i][j] = A[i][2]

A PPROXIMATION T RADE - OFFS § 100s of approximate variants § Pruning design space: • Hint-based: » Employ approximations hinted by ACCEPT* tool • Profiling-based (gprof): » Approximate in functions which contribute most to execution time Precise Approx v1 Approx v2 Approx v3 Approx v4 Approx v5 Approx v6 Approx v7 Approx v8 Precise Approx v1 Approx v2 Approx v3 Approx v4 Execution time norm. to precise Canneal Canneal 3.0 1.1 Precise 1.0 Tail Latency vs. QoS 2.5 Approx Selected 0.9 2.0 0.8 1.5 0.7 1.0 0.6 0.5 0.5 0.4 0 5 10 15 20 0.0 nginx memcached mongodb Inaccuracy (%) *ACCEPT: A Programmer-Guided Compiler Framework for Practical Approximate Computing, A. Sampson et. al.

P LIANT : G OALS § High utilization • Co-schedule interactive services with approximate applications § High QoS • Satisfy QoS of all co-scheduled jobs at the cost of some accuracy loss § Minimize accuracy loss • Adjust approximation at runtime using slack in tail latency § Techniques used to reduce interference at runtime • Approximation • Resource relocation (core relocation, cache & memory partitioning)

P LIANT - O VERVIEW • Continuously monitors • Dynamic recompilation the tail latency • Runtime allocation A Design Space QoS violation Exploration Performance Pliant Actuator monitor C B CPU CPU … requests LLC Main Memory CPU CPU CPU CPU Client … LLC interactive service Main Memory workload generator approximate computing app Server

P LIANT – R UNTIME A LGORITHM § Meet QoS as fast as possible § Minimize accuracy loss using latency slack when QoS met QoS not met Batch: precise QoS not met ….. QoS not met Batch: Most-1 Approx Batch: Most Approx Batch: -1 core Latency slack > 10% Interactive: +1 core Batch: -1 core Interactive: +1 core ….. Latency slack > 10% Interference ⇩

P LIANT – R UNTIME A LGORITHM § Multiple resources: cores, LLC and memory QoS not met Mem saturated? CPU saturated? Batch: Most Approx Cache thrashing? Batch: - 512 MB Batch: -1 LLC way Batch: -1 core Interactive: +512 MB Interactive: +1 LLC way Interactive: +1 core ….. ….. ….. Batch: - 512 MB Batch: -1 LLC way Batch: -1 core Interactive: +512 MB Interactive: +1 LLC way Interactive: +1 core

P LIANT – V ARYING A PPROXIMATION D EGREE § Dynamic recompilation system • Aggregated approximate variants to construct tunable app • Linux signals for DynamoRIO to switch to an approximate variant • drwrap_replace() interface is used to replace functions » Coarse granularity à low overheads DynamoRIO App Binary Tunable App precise – signal0 precise - addr0 ... ... approx1 – signal1 approx1 - addr1 approx2 – signal2 approx2 - addr2 void f1_p(){ addr0 <f1_p> signal0 //f1_precise .... signal2 ..... ..... f1_a1 f0 } void f1_a1(){ addr1 <f1_a1> //f1_approx1 ..... f1_p f1_a2 ..... ..... f2 } Pliant runtime void f1_a2(){ addr2 <f1_a2> //f1_approx2 ..... ..... ..... } ... ...

P LIANT – R UNTIME R ESOURCE A LLOCATION § All applications run in Docker containers § Core relocation • Docker update interface to allocate cores to each container § Cache allocation • Intel’s Cache Allocation Technology (CAT) to allocate cache ways § Memory capacity • Docker update interface to assign memory limits

E XPERIMENTAL S ETUP § Interactive services: NGINX, memcached, MongoDB § 24 approximate computing applications: • PARSEC, SPLASH2x, MineBench, BioPerf benchmark suites § Systems • 44 physical core dual-socket platform, 128 GB RAM, 56 MB LLC/socket • Interactive services & approximate applications pinned to different physical cores of same socket § Baseline • Approximate application run in precise mode • Cores, cache, and memory shared fairly among the applications

E VALUATION - D YNAMIC BEHAVIOR Batch: precise Precise Approx v1 Approx v2 Approx v3 Approx v4 Approx v5 Approx v6 Approx v7 Approx v8 Precise Approx v1 Approx v2 Approx v3 Approx v4 Batch: Most-1 Approx Batch: Most Approx Batch: -1 core Interactive: +1 core Batch: -1 core Interactive: +1 core Batch: -1 core Interactive: +1 core

E VALUATION – D YNAMIC B EHAVIOR § Across interactive services • memcached and NGINX need to reclaim resources • In case of MongoDB, approximation is enough Precise Approx v1 Approx v2 Approx v3 Approx v4 Approx v5 Approx v6 Approx v7 Approx v8 Precise Approx v1 Approx v2 Approx v3 Approx v4

E VALUATION – D YNAMIC B EHAVIOR § Across approximate applications • Bayesian shows bursty behavior - approximation usually enough • In case of SNP, no resource reclamation is required Precise Approx v1 Approx v2 Approx v3 Approx v4 Approx v5 Approx v6 Approx v7 Approx v8 § For all co-schedulings, show QoS is met for all apps at an accuracy loss of up to 5% (2.8% on average)

S UMMARY - P LIANT § Approximation can break performance vs utilization trade-off § Many cloud applications can tolerate some loss of quality § Pliant – practical runtime system • Incremental approximation using dynamic recompilation • Dynamic allocation of shared resources § Achieves high utilization • Enabled co-scheduling of approximate batch apps with interactive services § Achieves high QoS • Meets QoS for all apps at cost of small accuracy loss (max 5%, avg 2.8%)

Q UESTIONS ? § Approximation can break performance vs utilization trade-off § Many cloud applications can tolerate some loss of quality § Pliant – practical runtime system • Incremental approximation using dynamic recompilation • Dynamic allocation of shared resources § Achieves high utilization • Enabled co-scheduling of approximate batch apps with interactive services § Achieves high QoS • Meets QoS for all apps at cost of small accuracy loss (max 5%, avg 2.8%)

T HANK Y OU ! Page 25 of 25

L OW U TILIZATION ! Servers operate at 10% - 40% utilization most of - PowerPoint PPT Presentation

P LIANT : L EVERAGING A PPROXIMATION TO I MPROVE R ESOURCE E FFICIENCY IN D ATACENTERS Neeraj Kulkarni, Feng Qi, Christina Delimitrou C LOUD C OMPUTING Resource Flexibility Users can elastically scale their resources on-demand Cost

An Algebraic Theory of Markov Processes Giorgio Bacci , Radu Mardare, Prakash Panangaden and

S yntax Darrell Larsen Linguistics 101 Introduction Syntactic Categories Constituency Tests

Proposed Approaches to Determine Progress on the Local Control Funding Formula Evaluation Rubrics

August 28, 2020 BILINGUAL COORDINATORS NETWORK (BCN) UPDATES FEDERAL PROGRAM MONITORING T

Peninsula Clean Energy Board of Directors Meeting July 25, 2019 Agenda Call to order / Roll

Introduction to IDEA MOE Overview of the Maintenance of Effort requirement under the Individuals

1Q17 Supplem ental Slides John C. R. Hele Chief Financial Officer Table of Contents Page

SUSY searches in Jets + MET at CMS Leonardo Sala (ETH Zurich) for the CMS Collaboration

Binding back to the future Patrick D. Elliott and Yasu Sudo July 2, 2019 Asymmetries in

Welcome to ABET 101. The purpose of this presentation is to educate the College of Engineering

The 1960s Some people and mathematics I met John J. Benedetto Norbert Wiener Center Department

CASA EDUCATIONAL ADVOCACY TRAINING CASA of New Jersey, Inc. Whole-System Learning

Carbon Pricing 2. Different Approaches Taxes Emissions Trading Systems France EU Sweden

File Processing Midterm Logistics Midterm tonight , 7PM 9PM Last name A O: Cemex

Getting to or from Golden Gate Park by cable car, at the corner of Stanyan and Haight, was quite

Adaptive Learning Meets Crowdsourcing Towards Development of Cost-Effective Adaptive Educational

Develop Your Data Mindset Module 7 - Student Level Goal Setting Part 3B - Answer By Nathan

Annotating Reduced Argument Scope Using QA-SRL Gabriel Stanovsky, Ido Dagan and Meni Adler

Visualizing Clinical Profiles of Rare Metabolic Diseases Project Team: Zhong Huang, Nishant

FAILURE TO THRIVE: Disclosures RETHINKING OUR I have nothing to disclose. TREATMENT GOALS

Locomotion CSE169: Computer Animation Instructor: Steve Rotenberg UCSD, Winter 2017 Legged

A Case for a Road Map Dr Giovanna Cruz Research Fellow Hospice Isle of Man Background To

Workflows as an Operational Tool Scientific Computing using Data Scien lkay ALTINTA , Ph.D.

Matrix Factorization For Topic Models Dr. Derek Greene Insight Latent Space Workshop