FOR L ATENCY -C RITICAL S YSTEMS H ARSHAD K ASTURE , D AVIDE B - - PowerPoint PPT Presentation

for l atency c ritical s ystems
SMART_READER_LITE
LIVE PREVIEW

FOR L ATENCY -C RITICAL S YSTEMS H ARSHAD K ASTURE , D AVIDE B - - PowerPoint PPT Presentation

R UBIK : F AST A NALYTICAL P OWER M ANAGEMENT FOR L ATENCY -C RITICAL S YSTEMS H ARSHAD K ASTURE , D AVIDE B ARTOLINI , N ATHAN B ECKMANN , D ANIEL S ANCHEZ MICRO 2015 Motivation 2 Low server utilization in todays datacenters results in


slide-1
SLIDE 1

RUBIK: FAST ANALYTICAL POWER MANAGEMENT

FOR LATENCY-CRITICAL SYSTEMS

HARSHAD KASTURE, DAVIDE BARTOLINI, NATHAN BECKMANN, DANIEL SANCHEZ

MICRO 2015

slide-2
SLIDE 2

Motivation

2

 Low server utilization in today’s datacenters results in

resource and energy inefficiency

 Stringent latency requirements of user-facing services is a

major contributing factor

 Power management for these services is challenging

 Strict requirements on tail latency  Inherent variability in request arrival and service times

 Rubik uses statistical modeling to adapt to short-term

variations

 Respond to abrupt load changes  Improve power efficiency  Allow colocation of latency-critical and batch applications

slide-3
SLIDE 3

Understanding Latency-Critical Applications

3

Client Root Node Back End Back End

Datacenter

Leaf Node Back End Back End Leaf Node Back End Back End Leaf Node

slide-4
SLIDE 4

Understanding Latency-Critical Applications

4

Client Root Node Back End Back End

Datacenter

Leaf Node Back End Back End Leaf Node Back End Back End Leaf Node

slide-5
SLIDE 5

Understanding Latency-Critical Applications

5

Client Root Node Back End Back End

Datacenter

Leaf Node Back End Back End Leaf Node Back End Back End Leaf Node

slide-6
SLIDE 6

Understanding Latency-Critical Applications

6

 The few slowest responses determine user-perceived latency

 Tail latency (e.g., 95th / 99th percentile), not mean latency, determines

performance

Client Root Node Back End Back End

Datacenter

Leaf Node Back End Back End Leaf Node Back End Back End Leaf Node

1 ms 1 ms

slide-7
SLIDE 7

Prior Schemes Fall Short

7

 Traditional DVFS schemes (cpufreq, TurboBoost…)

 React to coarse grained metrics like processor utilization,

  • blivious to short-term performance requirements

 Power management for embedded systems (PACE,

GRACE…)

 Do not consider queuing

 Schemes designed specifically for latency-critical systems

(PEGASUS [Lo ISCA’14], Adrenaline [Hsu HPCA’15])

 Rely on application-specific heuristics  Too conservative

slide-8
SLIDE 8

Insight 1: Short-Term Load Variations

8

 Latency-critical applications have significant short-term load

variations

 PEGASUS [Lo ISCA’14] uses feedback control to adapt frequency

setting to diurnal load variations

 Deduce server load from observed request latency  Cannot adapt to short-term variations

moses

slide-9
SLIDE 9

Insight 2: Queuing Matters!

9

 Tail latency is often determined

by queuing, not the length of individual requests

 Adrenaline [Hsu HPCA’15] uses

application-level hints to distinguish long requests from short ones

 Long requests boosted (sped up)  Frequency settings must be

conservative to handle queuing

moses

slide-10
SLIDE 10

Rubik Overview

10

 Use queue length as a measure of instantaneous system

load

 Update frequency whenever queue length changes

 Adapt to short-term load variations

Core Activity Time Queue Length Core Frequency Time Time

Idle

Rubik

slide-11
SLIDE 11

Goal: Reshaping Latency Distribution 11

Response Latency Probability Density

slide-12
SLIDE 12

Key Factors in Setting Frequencies

12

 Distribution of cycle requirements of individual requests

 Larger variance  more conservative frequency setting

 How long has a request spent in the queue?

 Longer wait times  higher frequency

 How many requests are queued waiting for service

 Longer queues  higher frequency

slide-13
SLIDE 13

There’s Math!

13

Cycles Cycles

ω

฀ P [ S 0  c ]  P [ S  c   | S   ]  P [ S  c   ] P [ S   ]

฀ P S i  P S i  1 * P S  P S 0 * P S * P S * ... * P S

i times

6 7 4 4 8 4 4

Cycles Cycles Cycles

฀ * ฀ f  max

i  0 ... N

c i L  ( t i  m i )

slide-14
SLIDE 14

Efficient Implementation

14

 Pre-computed tables store most of the required quantities  Table contents are independent of system load!  Implemented as a software runtime

 Hardware support: fast, per-core DVFS, performance counters

for CPI stacks

m0 m1 m2 m15 ω = 0 ω < 25th pct ω < 50th pct ω < 75th pct

Target Tail Tables

c0 c1 c2 c15 ω = 0 ω < 25th pct ω < 50th pct ω < 75th pct Otherwise

Updated Periodically Read on each request arrival/departure

slide-15
SLIDE 15

Evaluation

15

 Microarchitectural simulations using zsim

 Power model tuned to a real system

 Compare Rubik against two oracular schemes:

 StaticOracle: Pick the lowest static frequency that meets latency

targets for a given request trace

 AdrenalineOracle: Assume oracular knowledge of long and short

requests, use offline training to pick frequencies for each

Core 0

Shared L3

Core 1 Core 2 Core 3 Core 4 Core 5

  • Westmere-like OOO cores
  • Fast per-core DVFS
  • CPI stack counters
  • Pin threads to cores
slide-16
SLIDE 16

Evaluation

16

 Five diverse latency-critical applications

 xapian (search engine)  masstree (in-memory key-value store)  moses (statistical machine translation)  shore-mt (OLTP)  specjbb (java middleware)

 For each application, latency target set at the tail latency

achieved at nominal frequency (2.4 GHz) at 50% utilization

slide-17
SLIDE 17

Tail Latency

17

slide-18
SLIDE 18

Tail Latency

18

slide-19
SLIDE 19

Core Power Savings

19

 All three schemes save significant power at low utilization

 Rubik performs best, reducing core power by up to 66%

slide-20
SLIDE 20

Core Power Savings

20

 All three schemes save significant power at low utilization

 Rubik performs best, reducing core power by up to 66%

 Rubik’s relative savings increase as short-term adaptation

becomes more important

slide-21
SLIDE 21

Core Power Savings

21

 All three schemes save significant power at low utilization

 Rubik performs best, reducing core power by up to 66%

 Rubik’s relative savings increase as short-term adaptation

becomes more important

 Rubik saves significant power even at high utilization

 17% on average, and up to 34%

slide-22
SLIDE 22

Real Machine Power Savings

22

 V/F transition latencies of >100 µs even with integrated

voltage controllers

 Likely due to inefficiencies in firmware

 Rubik successfully adapts to higher V/F transition

latencies

slide-23
SLIDE 23

Static Power Limits Efficiency

23 Datacenter Latency-critical Utilization Idle Batch Utilization

slide-24
SLIDE 24

RubikColoc: Colocation Using Rubik

24 RubikColoc Statically Partitioned LLC Rubik sets Latency-Critical Frequencies

slide-25
SLIDE 25

RubikColoc Savings

25

 RubikColoc saves significant power and resources over a

segregated datacenter baseline

 17% reduction in datacenter power consumption; 19% fewer

machines at high load

 31% reduction in datacenter power consumption, 41% fewer

machines at high load

slide-26
SLIDE 26

Conclusions

26

 Rubik uses fine-grained power management to reduce

active core power consumption by up to 66%

 Rubik uses statistical modeling to account for various

sources of uncertainty, and avoids application-specific heuristics

 RubikColoc uses Rubik to colocate latency-critical and

batch applications, reducing datacenter power consumption by up to 31% while using up to 41% fewer machines

slide-27
SLIDE 27

THANKS FOR YOUR ATTENTION! QUESTIONS?