Enhancing Server Efficiency in the Face of Killer Microseconds - PowerPoint PPT Presentation

Enhancing Server Efficiency in the Face of Killer Microseconds Amirhossein Mirhosseini, Akshitha Sriraman, Thomas F. Wenisch University of Michigan HPCA 2019 02 /18/2019

Killer Microseconds [Barroso’17] • Frequent microsecond-scale pauses in datacenter applications – Stalls for accessing emerging memory & I/O devices – Mid-tier servers synchronously waiting for leaf nodes – Brief idle periods in high-throughput microservices • Modern computing systems not effective in hiding microseconds – Micro-architectural techniques are insufficient – OS/software context switches are too coarse grain Enhancing Server Efficiency in the Face of Killer Microseconds 2

Our proposal: Duplexity • Cost-effective highly multithreaded server design • Heterogeneous design --- Dy Dyads of cores: – Ma Mast ster er core re for latency-sensitive microservices – Le Lender core for latency-insensitive applications • Key idea 1: master core may “borrow” threads from the lender core to fill utilization holes • Key idea 2: cores protect threads’ cache states to avoid excessive tail latencies and QoS violations Dupl Duplexity improve ves core utilization by 4.8x in presence of killer microseconds Enhancing Server Efficiency in the Face of Killer Microseconds 3

Outline • Killer microseconds • Why (scaling) SMT is not an option • Duplexity server architecture • Evaluation methodology and results Enhancing Server Efficiency in the Face of Killer Microseconds 4

Modern HW is great at hiding nanosecond scale stalls… Caches! OoO! MLP! Spec! Prefetching! 50 nanoseconds Micro-architectural techniques are at best able to hide 100s of nanoseconds Enhancing Server Efficiency in the Face of Killer Microseconds 5

Modern OS is great at hiding millisecond scale stalls… Context Switch! 5 milliseconds yawn OS context switching typically has an average overhead of at least 5-20us Enhancing Server Efficiency in the Face of Killer Microseconds 6

But, today’s devices and microservices inflict μs-scale stalls • Em Emerg rging memory ry and I/ I/O technologies: – NVM, disaggregated memory, … : O(1 μs) – High-end flash, accelerators, … : O(10 μs) • Br Brief idle periods: – With μs-scale microservices, idle periods also shrink to μs scales • 200K QPS service at 50% load has average idle periods of only 10 μs Need HW/SW mechanisms to hide μs-scale latencies Enhancing Server Efficiency in the Face of Killer Microseconds 7

Multithreading is the obvious solution • OS context switches are too coarse-grain for μs-scale periods – User-level cooperative multithreading [Cho’18] – Hardware (simultaneous) multithreading [Yamamoto’95][[Tullsen’95][Tullsen’96] … But, we need a lot of (10+) threads to fill μs-scale stall/idle periods Enhancing Server Efficiency in the Face of Killer Microseconds 8

Simply adding more threads is not enough • Complicates fetch/dispatch/issue logic – Prolonging its critical path • Requires a larger register file • Pressure/thrashing in L1 caches • Hi Higher her tail latenc ency due due to int nter erfer erenc ence e amo mong ng thr hrea eads ds – Up to 5.7x higher tail latency – 1.5x higher tail at low load and low IPC co-runner Need co complexity management and pe perf rform rmanc ance isolat ation n mechanisms Enhancing Server Efficiency in the Face of Killer Microseconds 9

Duplexity • Two main objectives: – Maximize performance density and energy efficiency • Fill utilization “holes” arising from killer microseconds Borrow latency-insensitive batch threads to fill microservices’ ut Bo utilizat ation n ho holes – Minimize disruption of latency-sensitive threads • Avoid excessive tail latency due to interference Latency Is Isolate e stateful uarch structures (e.g., caches) to avoid QoS QoS vi violations Enhancing Server Efficiency in the Face of Killer Microseconds 10

Duplexity : a server made of “Dyads” • Master core – Designed for latency-sensitive microservices – “Borrows” threads from lender core to fill util holes • Lender core – Designed for latency-insensitive batch applications • Shared backlog of batch threads Lender Lender Shared thread backlogs core core ... ... Master Master LLC core core Lender Lender ... core ... core Master Master Memory, I/O core core controllers, etc Enhancing Server Efficiency in the Face of Killer Microseconds 11

Lender Core • Latency-insensitive batch threads – In-order execution • Variable number of virtual contexts needed – FIFO run-queue of virtual contexts in memory Enhancing Server Efficiency in the Face of Killer Microseconds 12

Lender Core • Hierarchical Simultaneous Multithreading (HSMT) – Backlog of virtual contexts – Inspired by Balanced Multithreading [Tune’04] Frontend Backend In-Order FIFO 0 V-contexts Issue PTR FIFO 1 Queues Functional Units PC 0 FIFO 2 Instruction Buffer Select PC 1 FIFO 3 PC 2 FIFO 4 PC 3 FIFO 5 Fetch 8-wa way In-or order SMT MT Dat Datapat apath PC 4 FIFO 6 PC 5 FIFO 7 PC 6 Register PC 7 File Instruction Data Cache Cache Enhancing Server Efficiency in the Face of Killer Microseconds 13

Master Core • Single latency-sensitive ma er thread master • Borrows threads from the lender core to fill μs-scale holes – Single-threaded out-of-order mode for ma master thread >2x – Multi-threaded in-order mode for fi filler threads • Inspired by Morphcore [Khubaib’12] Filler threads thrash the cache, TLB, and branch predictor state of the master thread è Increase tail latency Enhancing Server Efficiency in the Face of Killer Microseconds 14

Segregating State • Na ve solution : replicate all stateful uarch structures Naive – Register files, caches, branch predictor, TLBs, etc. ü Branch predictor û RF I/D û caches Master Thread I/D ü Master core only replicates TLBs in inexpensiv ive structures I/D û Filler ... caches (e.g., TLBs and predictors) Threads I/D ü ü û RF TLBs Branch predictor Caches and register files are large and power-hungry è Full replication undermines performance density and energy efficiency objectives Enhancing Server Efficiency in the Face of Killer Microseconds 15

Segregating Register Files • Repurpose physical RF as architectural RF for filler threads • Retain master thread architectural registers – Facilitates fast restart when the stall resolves Master Thread Arch Regs RF Master Thread Arch+Phys Registers Filler Threads Arch Regs What about caches? Enhancing Server Efficiency in the Face of Killer Microseconds 16

Master-Lender Dyads • Master core remotely accesses the L1 I/D caches of the lender core – Protects the master thread’s state – Allows filler threads to hit on their own cache state • L0 I/D caches as effective bandwidth filters Master-thread mode Filler-thread mode L1 Inst L1 Inst Lender Lender $ $ core core L1 Data L1 Data $ $ L0 L1 Inst L1 Inst Master Master $ $ L0 core L1 Data L1 Data core $ $ Master thread can almost immediately resume execution as stall resolves Enhancing Server Efficiency in the Face of Killer Microseconds 17

Evaluation Methodology • Master thread: – Open source μs-scale microservices • Locality sensitive hashing, protocol routing, remote caching, word stemming • Filler threads: – Data-parallel distributed graph algorithms • Page Rank, single-source shortest path • Design Alternatives: – Baseline single-threaded OoO, SMT, Duplexity+Replication, more alternatives in the paper Enhancing Server Efficiency in the Face of Killer Microseconds 18

Evaluation Duplexity achieves 34% higher average 34% core utilization compared to SMT Within 4% of the utilization achieve ved by Dupl Duplexity + r replication Enhancing Server Efficiency in the Face of Killer Microseconds 19

Evaluation Duplexity improves performance density by 49%, 28%, and 10% 10% compared to baseline, SMT, and Dupl Duplexity+Repl plicat ation Enhancing Server Efficiency in the Face of Killer Microseconds 20

Evaluation 2.7x SMT worsens tail latency by 2.7x on average (up to 5.7x) Duplexity maintains tail latency within 19% Enhancing Server Efficiency in the Face of Killer Microseconds 21

Conclusions • Killer Microseconds: Frequent μs-scale pauses in microservices • Modern computing systems not effective in hiding microseconds • Our proposal: Duplexity – Cost-effective highly multithreaded server architecture – Heterogeneous design: • Master cores for latency-sensitive microservices • Lender cores for latency-insensitive batch application – Master core may “borrow” threads from the lender core to fill utilization holes – Cores protect their threads’ cache states to avoid QoS violations Dupl Duplexity improve ves utilization by 4.8x while maintaining tail latency within 19% Enhancing Server Efficiency in the Face of Killer Microseconds 22

Questions? Enhancing Server Efficiency in the Face of Killer Microseconds 23

Enhancing Server Efficiency in the Face of Killer Microseconds - PowerPoint PPT Presentation

Enhancing Server Efficiency in the Face of Killer Microseconds Amirhossein Mirhosseini, Akshitha Sriraman, Thomas F. Wenisch University of Michigan HPCA 2019 02 /18/2019 Killer Microseconds [Barroso17] Frequent microsecond-scale

Killer Presentation Skills: How to Acquire the Skills and Killer Presentation Skills: How to

Killer Presentation Skills: How to Acquire the Skills and Say Goodbye to Killer Presentation

Face Cover Face Coverings In School Guidelines Face Coverings Face Coverings and PPE Cloth

A summary of deep models for face recognition Qianli Liao Face recognition Face recognition:

To provide you with a comprehensive overview on conducting effective face-to face contacts

Deciphering the Face Deciphering the Face Aleix M. Martinez Computational Biology Computational

Finishing Face to Face: The Priesthood Fulfilled in the Book of Revelation Steve Midgley

CSI Dublin: The Hunt for the Irish Potato Killer Isolating a Potato Killer: Using Aseptic

Not Killer Applications but perhaps Killer Solutions. Model Model Results Results

Panel Discussion Insights: The Killer of Creative? Or the Driver of Killer Creative? ARF

THE DIFFERENCE BETWEEN A KILLER DEAL AND A DEAL KILLER MARKUS JAKOBSSON PAYPAL THAT DIFFERENCE

Killer Portfolio or Portfolio Killer Greg Foertsch Firaxis Games Jeremy Bennett Valve

Server Traffic Management Server Traffic Management Jeff Chase Duke University, Department of

Content Server Caching Network Client Web Server Browser Avoid Network Latency Avoid Queuing

GLAST Large Area Telescope: GLAST Large Area Telescope: Face to Face Managers Meeting Face to

Face detection and recognition Detection Recognition Sally Face detection &

What kind of data is it? Situating sociolinguistic corpora in context Workshop on sociolinguistic

network science and social science on Twitter mor naaman rutgers SC&I | social media

Case Study: Network BIOSTAT830: Graphical Models December 13, 2016 Network Fundamentals One

Lab 2: Replica-ng Gartzke, The Capitalist Peace (2007)

Jean-Philippe Cointet CREA (CNRS/EP , France) AND Camille Roth CAMS (CNRS/EHESS, France) IEEE S

STYLES FROM MELODIC CONTOURS Amruta Vidwans Kaustuv Kanti Ganguli Preeti Rao Department of

PRONOUNS IN MOTION: A TYPOLOGY AND METHODOLOGY FOR EXAMINING DYNAMIC VARIATION KIRBY CONROD

MPact Physician Education John D. Owen, MD, FACEP, FAAFP Chief Medical Officer, MPact Health

Enhancing Server Efficiency in the Face of Killer Microseconds - PowerPoint PPT Presentation

Enhancing Server Efficiency in the Face of Killer Microseconds Amirhossein Mirhosseini, Akshitha Sriraman, Thomas F. Wenisch University of Michigan HPCA 2019 02 /18/2019 Killer Microseconds [Barroso17] Frequent microsecond-scale

Killer Presentation Skills: How to Acquire the Skills and Killer Presentation Skills: How to

Killer Presentation Skills: How to Acquire the Skills and Say Goodbye to Killer Presentation

Face Cover Face Coverings In School Guidelines Face Coverings Face Coverings and PPE Cloth

A summary of deep models for face recognition Qianli Liao Face recognition Face recognition:

To provide you with a comprehensive overview on conducting effective face-to face contacts

Deciphering the Face Deciphering the Face Aleix M. Martinez Computational Biology Computational

Finishing Face to Face: The Priesthood Fulfilled in the Book of Revelation Steve Midgley

CSI Dublin: The Hunt for the Irish Potato Killer Isolating a Potato Killer: Using Aseptic

Not Killer Applications but perhaps Killer Solutions. Model Model Results Results

Panel Discussion Insights: The Killer of Creative? Or the Driver of Killer Creative? ARF

THE DIFFERENCE BETWEEN A KILLER DEAL AND A DEAL KILLER MARKUS JAKOBSSON PAYPAL THAT DIFFERENCE

Killer Portfolio or Portfolio Killer Greg Foertsch Firaxis Games Jeremy Bennett Valve

Server Traffic Management Server Traffic Management Jeff Chase Duke University, Department of

Content Server Caching Network Client Web Server Browser Avoid Network Latency Avoid Queuing

GLAST Large Area Telescope: GLAST Large Area Telescope: Face to Face Managers Meeting Face to

Face detection and recognition Detection Recognition Sally Face detection &amp;

What kind of data is it? Situating sociolinguistic corpora in context Workshop on sociolinguistic

network science and social science on Twitter mor naaman rutgers SC&amp;I | social media

Case Study: Network BIOSTAT830: Graphical Models December 13, 2016 Network Fundamentals One

Lab 2: Replica-ng Gartzke, The Capitalist Peace (2007)

Jean-Philippe Cointet CREA (CNRS/EP , France) AND Camille Roth CAMS (CNRS/EHESS, France) IEEE S

STYLES FROM MELODIC CONTOURS Amruta Vidwans Kaustuv Kanti Ganguli Preeti Rao Department of

PRONOUNS IN MOTION: A TYPOLOGY AND METHODOLOGY FOR EXAMINING DYNAMIC VARIATION KIRBY CONROD

MPact Physician Education John D. Owen, MD, FACEP, FAAFP Chief Medical Officer, MPact Health

Face detection and recognition Detection Recognition Sally Face detection &

network science and social science on Twitter mor naaman rutgers SC&I | social media