Enhancing Server Efficiency in the Face of Killer Microseconds - - PowerPoint PPT Presentation
Enhancing Server Efficiency in the Face of Killer Microseconds - - PowerPoint PPT Presentation
Enhancing Server Efficiency in the Face of Killer Microseconds Amirhossein Mirhosseini, Akshitha Sriraman, Thomas F. Wenisch University of Michigan HPCA 2019 02 /18/2019 Killer Microseconds [Barroso17] Frequent microsecond-scale
Enhancing Server Efficiency in the Face of Killer Microseconds
Killer Microseconds
[Barroso’17] 2
- Frequent microsecond-scale pauses in datacenter applications
– Stalls for accessing emerging memory & I/O devices – Mid-tier servers synchronously waiting for leaf nodes – Brief idle periods in high-throughput microservices
- Modern computing systems not effective in hiding microseconds
– Micro-architectural techniques are insufficient – OS/software context switches are too coarse grain
Enhancing Server Efficiency in the Face of Killer Microseconds
Our proposal: Duplexity
- Cost-effective highly multithreaded server design
- Heterogeneous design --- Dy
Dyads of cores:
– Ma Mast ster er core re for latency-sensitive microservices – Le Lender core for latency-insensitive applications
- Key idea 1: master core may “borrow” threads
from the lender core to fill utilization holes
- Key idea 2: cores protect threads’ cache states to
avoid excessive tail latencies and QoS violations
3
Dupl Duplexity improve ves core utilization by 4.8x in presence of killer microseconds
Enhancing Server Efficiency in the Face of Killer Microseconds
Outline
- Killer microseconds
- Why (scaling) SMT is not an option
- Duplexity server architecture
- Evaluation methodology and results
4
Enhancing Server Efficiency in the Face of Killer Microseconds
Modern HW is great at hiding nanosecond scale stalls…
5
50 nanoseconds
Caches! OoO! MLP! Spec! Prefetching!
Micro-architectural techniques are at best able to hide 100s of nanoseconds
Enhancing Server Efficiency in the Face of Killer Microseconds
Modern OS is great at hiding millisecond scale stalls…
6
5 milliseconds yawn Context Switch!
OS context switching typically has an average overhead of at least 5-20us
Enhancing Server Efficiency in the Face of Killer Microseconds
- Em
Emerg rging memory ry and I/ I/O technologies:
– NVM, disaggregated memory, … : O(1 μs) – High-end flash, accelerators, … : O(10 μs)
- Br
Brief idle periods:
– With μs-scale microservices, idle periods also shrink to μs scales
- 200K QPS service at 50% load has average idle periods of only 10 μs
7
But, today’s devices and microservices inflict μs-scale stalls
Need HW/SW mechanisms to hide μs-scale latencies
Enhancing Server Efficiency in the Face of Killer Microseconds
Multithreading is the obvious solution
- OS context switches are too coarse-grain for μs-scale periods
– User-level cooperative multithreading [Cho’18] – Hardware (simultaneous) multithreading [Yamamoto’95][[Tullsen’95][Tullsen’96] …
8
But, we need a lot of (10+) threads to fill μs-scale stall/idle periods
Enhancing Server Efficiency in the Face of Killer Microseconds
Simply adding more threads is not enough
- Complicates fetch/dispatch/issue logic
– Prolonging its critical path
- Requires a larger register file
- Pressure/thrashing in L1 caches
- Hi
Higher her tail latenc ency due due to int nter erfer erenc ence e amo mong ng thr hrea eads ds
– Up to 5.7x higher tail latency – 1.5x higher tail at low load and low IPC co-runner
9
Need co complexity management and pe perf rform rmanc ance isolat ation n mechanisms
Enhancing Server Efficiency in the Face of Killer Microseconds
Duplexity
- Two main objectives:
– Maximize performance density and energy efficiency
- Fill utilization “holes” arising from killer microseconds
– Minimize disruption of latency-sensitive threads
- Avoid excessive tail latency due to interference
10 Latency
Bo Borrow latency-insensitive batch threads to fill microservices’ ut utilizat ation n ho holes Is Isolate e stateful uarch structures (e.g., caches) to avoid QoS QoS vi violations
Enhancing Server Efficiency in the Face of Killer Microseconds
Duplexity : a server made of “Dyads”
- Master core
– Designed for latency-sensitive microservices – “Borrows” threads from lender core to fill util holes
- Lender core
– Designed for latency-insensitive batch applications
- Shared backlog of batch threads
11
Lender core Master core Lender core Master core
LLC
Memory, I/O controllers, etc
... ... Lender core Master core Lender core Master core ... ...
Shared thread backlogs
Enhancing Server Efficiency in the Face of Killer Microseconds
Lender Core
- Latency-insensitive batch threads
– In-order execution
- Variable number of virtual contexts needed
– FIFO run-queue of virtual contexts in memory
12
Enhancing Server Efficiency in the Face of Killer Microseconds
Lender Core
- Hierarchical Simultaneous Multithreading (HSMT)
– Backlog of virtual contexts – Inspired by Balanced Multithreading [Tune’04] 13
8-wa way In-or
- rder SMT
MT Dat Datapat apath
In-Order Issue Queues
PC 0 PC 1 PC 2 PC 3 PC 4 PC 5 PC 6 PC 7 V-contexts PTR
Fetch Instruction Cache Instruction Buffer Data Cache Register File Select Functional Units Frontend Backend
FIFO 0 FIFO 1 FIFO 2 FIFO 3 FIFO 4 FIFO 5 FIFO 6 FIFO 7
Enhancing Server Efficiency in the Face of Killer Microseconds
Master Core
- Single latency-sensitive ma
master er thread
- Borrows threads from the lender core to fill μs-scale holes
– Single-threaded out-of-order mode for ma master thread – Multi-threaded in-order mode for fi filler threads
- Inspired by Morphcore [Khubaib’12]
14
Filler threads thrash the cache, TLB, and branch predictor state of the master thread è Increase tail latency >2x
Enhancing Server Efficiency in the Face of Killer Microseconds
Master Thread Filler Threads RF RF
I/D caches
Branch predictor Branch predictor ...
I/D TLBs I/D caches I/D TLBs
Segregating State
- Na
Naive ve solution: replicate all stateful uarch structures
– Register files, caches, branch predictor, TLBs, etc.
15
Caches and register files are large and power-hungry è Full replication undermines performance density and energy efficiency objectives
ü û ü ü û û û ü
Master core only replicates in inexpensiv ive structures (e.g., TLBs and predictors)
Enhancing Server Efficiency in the Face of Killer Microseconds
Segregating Register Files
- Repurpose physical RF as architectural RF for filler threads
- Retain master thread architectural registers
– Facilitates fast restart when the stall resolves
16
RF
Master Thread Arch+Phys Registers Master Thread Arch Regs Filler Threads Arch Regs
What about caches?
Enhancing Server Efficiency in the Face of Killer Microseconds
Master-Lender Dyads
- Master core remotely accesses the L1 I/D caches of the lender core
– Protects the master thread’s state – Allows filler threads to hit on their own cache state
- L0 I/D caches as effective bandwidth filters
17 Lender core Master core
L1 Inst $ L1 Data $ L1 Inst $ L1 Data $
Master-thread mode Filler-thread mode
Lender core Master core
L1 Inst $ L1 Data $ L1 Inst $ L1 Data $ L0 L0
Master thread can almost immediately resume execution as stall resolves
Enhancing Server Efficiency in the Face of Killer Microseconds
Evaluation Methodology
- Master thread:
– Open source μs-scale microservices
- Locality sensitive hashing, protocol routing, remote caching, word stemming
- Filler threads:
– Data-parallel distributed graph algorithms
- Page Rank, single-source shortest path
- Design Alternatives:
– Baseline single-threaded OoO, SMT, Duplexity+Replication, more alternatives in the paper
18
Enhancing Server Efficiency in the Face of Killer Microseconds
Evaluation
19
34%
Duplexity achieves 34% higher average core utilization compared to SMT Within 4% of the utilization achieve ved by Dupl Duplexity + r replication
Enhancing Server Efficiency in the Face of Killer Microseconds
Evaluation
20
Duplexity improves performance density by 49%, 28%, and 10% 10% compared to baseline, SMT, and Dupl Duplexity+Repl plicat ation
Enhancing Server Efficiency in the Face of Killer Microseconds
Evaluation
21
SMT worsens tail latency by 2.7x on average (up to 5.7x) Duplexity maintains tail latency within 19%
2.7x
Enhancing Server Efficiency in the Face of Killer Microseconds
Conclusions
- Killer Microseconds: Frequent μs-scale pauses in microservices
- Modern computing systems not effective in hiding microseconds
- Our proposal: Duplexity
– Cost-effective highly multithreaded server architecture – Heterogeneous design:
- Master cores for latency-sensitive microservices
- Lender cores for latency-insensitive batch application
– Master core may “borrow” threads from the lender core to fill utilization holes – Cores protect their threads’ cache states to avoid QoS violations 22
Dupl Duplexity improve ves utilization by 4.8x while maintaining tail latency within 19%
Enhancing Server Efficiency in the Face of Killer Microseconds
Questions?
23