L OW U TILIZATION ! Servers operate at 10% - 40% utilization most of - - PowerPoint PPT Presentation

l ow u tilization
SMART_READER_LITE
LIVE PREVIEW

L OW U TILIZATION ! Servers operate at 10% - 40% utilization most of - - PowerPoint PPT Presentation

P LIANT : L EVERAGING A PPROXIMATION TO I MPROVE R ESOURCE E FFICIENCY IN D ATACENTERS Neeraj Kulkarni, Feng Qi, Christina Delimitrou C LOUD C OMPUTING Resource Flexibility Users can elastically scale their resources on-demand Cost


slide-1
SLIDE 1

PLIANT: LEVERAGING APPROXIMATION TO IMPROVE RESOURCE EFFICIENCY IN DATACENTERS

Neeraj Kulkarni, Feng Qi, Christina Delimitrou

slide-2
SLIDE 2

CLOUD COMPUTING

§ Resource Flexibility

  • Users can elastically scale their resources on-demand

§ Cost Efficiency

  • Sharing resources between multiple users and applications

Batch applications Latency-critical Interactive apps

QoS: tail latency QoS: throughput

slide-3
SLIDE 3

LOW UTILIZATION!

  • C. Delimitrou and C. Kozyrakis, “Quasar: Resource-Efficient and QoS-Aware Cluster Management,” in ASPLOS, 2014

§ Major reasons:

  • Dedicated servers for interactive services
  • Resource over-provisioning – conservative reservations

§ Servers operate at 10% - 40% utilization most of the time

Twitter cluster Google cluster

  • L. Barroso et. al., “The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines”, Second edition, 2013
slide-4
SLIDE 4

CPU

MULTI-TENANCY

CPU

§ Scheduling multiple jobs on the same server

  • Increases server utilization and cost efficiency
  • Interference in shared resources

… …

CPU CPU Memory LLC

App1 App2

compute

§ Interference à Unpredictable performance § Difficult with interactive services

network memory

slide-5
SLIDE 5

PREVIOUS SOLUTIONS

  • 1. Allow co-scheduling of apps that would not violate QoS
  • Bubble-Up, Bubble-Flux, Paragon and Quasar
  • 2. Partition shared resources at runtime to reduce interference
  • Heracles, Ubik, Rubik
  • 3. Reduce interference by throttling applications at runtime
  • Bubble-Flux, ReQoS, Protean Code

§ But …

  • Server utilization by disallowing certain co-locations
  • Performance of batch applications by treating them as low-priority
slide-6
SLIDE 6

BREAK UTILIZATION VS PERFORMANCE TRADE-OFF

§ Approximate computing applications

  • Tolerate some loss in output accuracy in return for

» Improved performance, or » Same performance with reduced resources

§ Cloud workloads suitable for approximation

  • Performance can be more important than highest output quality

§ Co-locate approximate batch apps with interactive services

  • Meet performance for both applications at the cost of some inaccuracy
slide-7
SLIDE 7

LEVERAGING APPROXIMATION

  • 1. Mitigate interference:
  • Approximation can reduce # of requests to memory system & network
  • Approximation may not be always sufficient
  • 2. Meet performance of approximate applications:
  • When approximation is not enough, employ resource partitioning:

» Core relocation » Cache partitioning » Memory partitioning

  • Provide more resources to interactive service to meet its QoS
  • Approximation preserves the performance of batch applications
slide-8
SLIDE 8

APPROXIMATION TECHNIQUES

§ Loop perforation: Skip fraction of iterations

  • Fewer instructions & data accesses à exec time ⇩ & cache interference ⇩

§ Synchronization elision: Barriers, locks elided

  • Threads don’t wait for sync à exec time ⇩
  • Reduces memory accesses for acquiring locks

§ Lower precision: Reduce precision of variables

  • e.g., replace ‘double’ with ‘float’ or ‘int’
  • Reduces memory traffic

double l_c = 1.0 float l_c = 1.0 for i = 1 to N: if i % 2 != 0: ..... lock() g_c = g_c + l_c unlock() BARRIER() for i = 1 to M: A[i,2] = F(i,2) for j = 1 to 3: A[i][j] = F(i,j) A[i][j] = A[i][2]

... ... ... ...

§ Tiling: Compute 1 element & project onto neighbors

  • Fewer instructions & data accesses à exec time ⇩ &

cache interference ⇩

slide-9
SLIDE 9

APPROXIMATION TRADE-OFFS

§ 100s of approximate variants § Pruning design space:

  • Hint-based:

» Employ approximations hinted by ACCEPT* tool

  • Profiling-based (gprof):

» Approximate in functions which contribute most to execution time

5 10 15 20

Inaccuracy (%)

0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1

Execution time norm. to precise

Canneal

Precise Approx Selected

*ACCEPT: A Programmer-Guided Compiler Framework for Practical Approximate Computing, A. Sampson et. al.

nginx memcached mongodb 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Tail Latency vs. QoS

Canneal

Precise Approx v1 Approx v2 Approx v3 Approx v4 Approx v5 Approx v6 Approx v7 Approx v8

Precise Approx v1 Approx v2 Approx v3 Approx v4

slide-10
SLIDE 10

PLIANT: GOALS

§ High utilization

  • Co-schedule interactive services with approximate applications

§ High QoS

  • Satisfy QoS of all co-scheduled jobs at the cost of some accuracy loss

§ Minimize accuracy loss

  • Adjust approximation at runtime using slack in tail latency

§ Techniques used to reduce interference at runtime

  • Approximation
  • Resource relocation (core relocation, cache & memory partitioning)
slide-11
SLIDE 11

PLIANT - OVERVIEW

  • Continuously monitors

the tail latency

  • Dynamic recompilation
  • Runtime allocation

CPU

Client Server

CPU

LLC Main Memory

CPU

LLC Main Memory

Pliant

interactive service approximate computing app

QoS violation

CPU CPU

Performance monitor Actuator

CPU

workload generator

requests Design Space Exploration

C B A

slide-12
SLIDE 12

PLIANT – RUNTIME ALGORITHM

§ Meet QoS as fast as possible § Minimize accuracy loss using latency slack when QoS met

Batch: -1 core Interactive: +1 core Batch: -1 core Interactive: +1 core Batch: Most-1 Approx Batch: Most Approx Batch: precise ….. …..

Interference ⇩

Latency slack > 10% QoS not met Latency slack > 10% QoS not met QoS not met

slide-13
SLIDE 13

PLIANT – RUNTIME ALGORITHM

§ Multiple resources: cores, LLC and memory

Batch: Most Approx QoS not met CPU saturated? Batch: -1 LLC way Interactive: +1 LLC way Batch: - 512 MB Interactive: +512 MB Batch: -1 core Interactive: +1 core Cache thrashing? Mem saturated? Batch: -1 LLC way Interactive: +1 LLC way Batch: - 512 MB Interactive: +512 MB Batch: -1 core Interactive: +1 core ….. ….. …..

slide-14
SLIDE 14

PLIANT – V

ARYING APPROXIMATION DEGREE

§ Dynamic recompilation system

  • Aggregated approximate variants to construct tunable app
  • Linux signals for DynamoRIO to switch to an approximate variant
  • drwrap_replace() interface is used to replace functions

» Coarse granularity à low overheads

Tunable App

void f1_p(){ //f1_precise ..... } void f1_a1(){ //f1_approx1 ..... } void f1_a2(){ //f1_approx2 ..... } ... ...

App Binary

addr0 <f1_p> .... ..... addr1 <f1_a1> ..... ..... addr2 <f1_a2> ..... ..... ... ...

Pliant runtime DynamoRIO

precise - addr0 approx1 - addr1 approx2 - addr2 precise – signal0 approx1 – signal1 approx2 – signal2 f0 f1_p f2 f1_a2

signal2 signal0

f1_a1

slide-15
SLIDE 15

PLIANT – RUNTIME RESOURCE ALLOCATION

§ All applications run in Docker containers § Core relocation

  • Docker update interface to allocate cores to each container

§ Cache allocation

  • Intel’s Cache Allocation Technology (CAT) to allocate cache ways

§ Memory capacity

  • Docker update interface to assign memory limits
slide-16
SLIDE 16

EXPERIMENTAL SETUP

§ Interactive services: NGINX, memcached, MongoDB § 24 approximate computing applications:

  • PARSEC, SPLASH2x, MineBench, BioPerf benchmark suites

§ Systems

  • 44 physical core dual-socket platform, 128 GB RAM, 56 MB LLC/socket
  • Interactive services & approximate applications pinned to different

physical cores of same socket

§ Baseline

  • Approximate application run in precise mode
  • Cores, cache, and memory shared fairly among the applications
slide-17
SLIDE 17

EVALUATION - DYNAMIC BEHAVIOR

Precise Approx v1 Approx v2 Approx v3 Approx v4 Approx v5 Approx v6 Approx v7 Approx v8

Precise Approx v1 Approx v2 Approx v3 Approx v4

Batch: -1 core Interactive: +1 core Batch: -1 core Interactive: +1 core Batch: Most-1 Approx Batch: Most Approx Batch: precise Batch: -1 core Interactive: +1 core

slide-18
SLIDE 18

EVALUATION – DYNAMIC BEHAVIOR

§ Across interactive services

  • memcached and NGINX need to reclaim resources
  • In case of MongoDB, approximation is enough

Precise Approx v1 Approx v2 Approx v3 Approx v4 Approx v5 Approx v6 Approx v7 Approx v8

Precise Approx v1 Approx v2 Approx v3 Approx v4

slide-19
SLIDE 19

EVALUATION – DYNAMIC BEHAVIOR

Precise Approx v1 Approx v2 Approx v3 Approx v4 Approx v5 Approx v6 Approx v7 Approx v8

§ Across approximate applications

  • Bayesian shows bursty behavior - approximation usually enough
  • In case of SNP, no resource reclamation is required

§ For all co-schedulings, show QoS is met for all apps at an accuracy loss of up to 5% (2.8% on average)

slide-20
SLIDE 20

SUMMARY - PLIANT

§ Approximation can break performance vs utilization trade-off § Many cloud applications can tolerate some loss of quality § Pliant – practical runtime system

  • Incremental approximation using dynamic recompilation
  • Dynamic allocation of shared resources

§ Achieves high utilization

  • Enabled co-scheduling of approximate batch apps with interactive services

§ Achieves high QoS

  • Meets QoS for all apps at cost of small accuracy loss (max 5%, avg 2.8%)
slide-21
SLIDE 21

QUESTIONS?

§ Approximation can break performance vs utilization trade-off § Many cloud applications can tolerate some loss of quality § Pliant – practical runtime system

  • Incremental approximation using dynamic recompilation
  • Dynamic allocation of shared resources

§ Achieves high utilization

  • Enabled co-scheduling of approximate batch apps with interactive services

§ Achieves high QoS

  • Meets QoS for all apps at cost of small accuracy loss (max 5%, avg 2.8%)
slide-22
SLIDE 22

Page 25 of 25

THANK YOU!