Enhancing Server Efficiency in the Face of Killer Microseconds - - PowerPoint PPT Presentation

enhancing server efficiency in the face of killer
SMART_READER_LITE
LIVE PREVIEW

Enhancing Server Efficiency in the Face of Killer Microseconds - - PowerPoint PPT Presentation

Enhancing Server Efficiency in the Face of Killer Microseconds Amirhossein Mirhosseini, Akshitha Sriraman, Thomas F. Wenisch University of Michigan HPCA 2019 02 /18/2019 Killer Microseconds [Barroso17] Frequent microsecond-scale


slide-1
SLIDE 1

Enhancing Server Efficiency in the Face of Killer Microseconds

Amirhossein Mirhosseini, Akshitha Sriraman, Thomas F. Wenisch University of Michigan HPCA 2019 02 /18/2019

slide-2
SLIDE 2

Enhancing Server Efficiency in the Face of Killer Microseconds

Killer Microseconds

[Barroso’17] 2

  • Frequent microsecond-scale pauses in datacenter applications

– Stalls for accessing emerging memory & I/O devices – Mid-tier servers synchronously waiting for leaf nodes – Brief idle periods in high-throughput microservices

  • Modern computing systems not effective in hiding microseconds

– Micro-architectural techniques are insufficient – OS/software context switches are too coarse grain

slide-3
SLIDE 3

Enhancing Server Efficiency in the Face of Killer Microseconds

Our proposal: Duplexity

  • Cost-effective highly multithreaded server design
  • Heterogeneous design --- Dy

Dyads of cores:

– Ma Mast ster er core re for latency-sensitive microservices – Le Lender core for latency-insensitive applications

  • Key idea 1: master core may “borrow” threads

from the lender core to fill utilization holes

  • Key idea 2: cores protect threads’ cache states to

avoid excessive tail latencies and QoS violations

3

Dupl Duplexity improve ves core utilization by 4.8x in presence of killer microseconds

slide-4
SLIDE 4

Enhancing Server Efficiency in the Face of Killer Microseconds

Outline

  • Killer microseconds
  • Why (scaling) SMT is not an option
  • Duplexity server architecture
  • Evaluation methodology and results

4

slide-5
SLIDE 5

Enhancing Server Efficiency in the Face of Killer Microseconds

Modern HW is great at hiding nanosecond scale stalls…

5

50 nanoseconds

Caches! OoO! MLP! Spec! Prefetching!

Micro-architectural techniques are at best able to hide 100s of nanoseconds

slide-6
SLIDE 6

Enhancing Server Efficiency in the Face of Killer Microseconds

Modern OS is great at hiding millisecond scale stalls…

6

5 milliseconds yawn Context Switch!

OS context switching typically has an average overhead of at least 5-20us

slide-7
SLIDE 7

Enhancing Server Efficiency in the Face of Killer Microseconds

  • Em

Emerg rging memory ry and I/ I/O technologies:

– NVM, disaggregated memory, … : O(1 μs) – High-end flash, accelerators, … : O(10 μs)

  • Br

Brief idle periods:

– With μs-scale microservices, idle periods also shrink to μs scales

  • 200K QPS service at 50% load has average idle periods of only 10 μs

7

But, today’s devices and microservices inflict μs-scale stalls

Need HW/SW mechanisms to hide μs-scale latencies

slide-8
SLIDE 8

Enhancing Server Efficiency in the Face of Killer Microseconds

Multithreading is the obvious solution

  • OS context switches are too coarse-grain for μs-scale periods

– User-level cooperative multithreading [Cho’18] – Hardware (simultaneous) multithreading [Yamamoto’95][[Tullsen’95][Tullsen’96] …

8

But, we need a lot of (10+) threads to fill μs-scale stall/idle periods

slide-9
SLIDE 9

Enhancing Server Efficiency in the Face of Killer Microseconds

Simply adding more threads is not enough

  • Complicates fetch/dispatch/issue logic

– Prolonging its critical path

  • Requires a larger register file
  • Pressure/thrashing in L1 caches
  • Hi

Higher her tail latenc ency due due to int nter erfer erenc ence e amo mong ng thr hrea eads ds

– Up to 5.7x higher tail latency – 1.5x higher tail at low load and low IPC co-runner

9

Need co complexity management and pe perf rform rmanc ance isolat ation n mechanisms

slide-10
SLIDE 10

Enhancing Server Efficiency in the Face of Killer Microseconds

Duplexity

  • Two main objectives:

– Maximize performance density and energy efficiency

  • Fill utilization “holes” arising from killer microseconds

– Minimize disruption of latency-sensitive threads

  • Avoid excessive tail latency due to interference

10 Latency

Bo Borrow latency-insensitive batch threads to fill microservices’ ut utilizat ation n ho holes Is Isolate e stateful uarch structures (e.g., caches) to avoid QoS QoS vi violations

slide-11
SLIDE 11

Enhancing Server Efficiency in the Face of Killer Microseconds

Duplexity : a server made of “Dyads”

  • Master core

– Designed for latency-sensitive microservices – “Borrows” threads from lender core to fill util holes

  • Lender core

– Designed for latency-insensitive batch applications

  • Shared backlog of batch threads

11

Lender core Master core Lender core Master core

LLC

Memory, I/O controllers, etc

... ... Lender core Master core Lender core Master core ... ...

Shared thread backlogs

slide-12
SLIDE 12

Enhancing Server Efficiency in the Face of Killer Microseconds

Lender Core

  • Latency-insensitive batch threads

– In-order execution

  • Variable number of virtual contexts needed

– FIFO run-queue of virtual contexts in memory

12

slide-13
SLIDE 13

Enhancing Server Efficiency in the Face of Killer Microseconds

Lender Core

  • Hierarchical Simultaneous Multithreading (HSMT)

– Backlog of virtual contexts – Inspired by Balanced Multithreading [Tune’04] 13

8-wa way In-or

  • rder SMT

MT Dat Datapat apath

In-Order Issue Queues

PC 0 PC 1 PC 2 PC 3 PC 4 PC 5 PC 6 PC 7 V-contexts PTR

Fetch Instruction Cache Instruction Buffer Data Cache Register File Select Functional Units Frontend Backend

FIFO 0 FIFO 1 FIFO 2 FIFO 3 FIFO 4 FIFO 5 FIFO 6 FIFO 7

slide-14
SLIDE 14

Enhancing Server Efficiency in the Face of Killer Microseconds

Master Core

  • Single latency-sensitive ma

master er thread

  • Borrows threads from the lender core to fill μs-scale holes

– Single-threaded out-of-order mode for ma master thread – Multi-threaded in-order mode for fi filler threads

  • Inspired by Morphcore [Khubaib’12]

14

Filler threads thrash the cache, TLB, and branch predictor state of the master thread è Increase tail latency >2x

slide-15
SLIDE 15

Enhancing Server Efficiency in the Face of Killer Microseconds

Master Thread Filler Threads RF RF

I/D caches

Branch predictor Branch predictor ...

I/D TLBs I/D caches I/D TLBs

Segregating State

  • Na

Naive ve solution: replicate all stateful uarch structures

– Register files, caches, branch predictor, TLBs, etc.

15

Caches and register files are large and power-hungry è Full replication undermines performance density and energy efficiency objectives

ü û ü ü û û û ü

Master core only replicates in inexpensiv ive structures (e.g., TLBs and predictors)

slide-16
SLIDE 16

Enhancing Server Efficiency in the Face of Killer Microseconds

Segregating Register Files

  • Repurpose physical RF as architectural RF for filler threads
  • Retain master thread architectural registers

– Facilitates fast restart when the stall resolves

16

RF

Master Thread Arch+Phys Registers Master Thread Arch Regs Filler Threads Arch Regs

What about caches?

slide-17
SLIDE 17

Enhancing Server Efficiency in the Face of Killer Microseconds

Master-Lender Dyads

  • Master core remotely accesses the L1 I/D caches of the lender core

– Protects the master thread’s state – Allows filler threads to hit on their own cache state

  • L0 I/D caches as effective bandwidth filters

17 Lender core Master core

L1 Inst $ L1 Data $ L1 Inst $ L1 Data $

Master-thread mode Filler-thread mode

Lender core Master core

L1 Inst $ L1 Data $ L1 Inst $ L1 Data $ L0 L0

Master thread can almost immediately resume execution as stall resolves

slide-18
SLIDE 18

Enhancing Server Efficiency in the Face of Killer Microseconds

Evaluation Methodology

  • Master thread:

– Open source μs-scale microservices

  • Locality sensitive hashing, protocol routing, remote caching, word stemming
  • Filler threads:

– Data-parallel distributed graph algorithms

  • Page Rank, single-source shortest path
  • Design Alternatives:

– Baseline single-threaded OoO, SMT, Duplexity+Replication, more alternatives in the paper

18

slide-19
SLIDE 19

Enhancing Server Efficiency in the Face of Killer Microseconds

Evaluation

19

34%

Duplexity achieves 34% higher average core utilization compared to SMT Within 4% of the utilization achieve ved by Dupl Duplexity + r replication

slide-20
SLIDE 20

Enhancing Server Efficiency in the Face of Killer Microseconds

Evaluation

20

Duplexity improves performance density by 49%, 28%, and 10% 10% compared to baseline, SMT, and Dupl Duplexity+Repl plicat ation

slide-21
SLIDE 21

Enhancing Server Efficiency in the Face of Killer Microseconds

Evaluation

21

SMT worsens tail latency by 2.7x on average (up to 5.7x) Duplexity maintains tail latency within 19%

2.7x

slide-22
SLIDE 22

Enhancing Server Efficiency in the Face of Killer Microseconds

Conclusions

  • Killer Microseconds: Frequent μs-scale pauses in microservices
  • Modern computing systems not effective in hiding microseconds
  • Our proposal: Duplexity

– Cost-effective highly multithreaded server architecture – Heterogeneous design:

  • Master cores for latency-sensitive microservices
  • Lender cores for latency-insensitive batch application

– Master core may “borrow” threads from the lender core to fill utilization holes – Cores protect their threads’ cache states to avoid QoS violations 22

Dupl Duplexity improve ves utilization by 4.8x while maintaining tail latency within 19%

slide-23
SLIDE 23

Enhancing Server Efficiency in the Face of Killer Microseconds

Questions?

23