Resource Efficient Computing for Warehouse-scale Datacenters - PowerPoint PPT Presentation

Resource Efficient Computing for Warehouse-scale Datacenters Christos Kozyrakis Stanford University http://csl.stanford.edu/~christos DATE Conference – March 21 st 2013

Computing is the Innovation Catalyst Science Government Commerce Healthcare Education Entertainment Faster, cheaper, greener 2

The Datacenter as a Computer [K. Vaid, Microsoft Global Foundation Services, 2010] 3

Advantages of Large-scale Datacenters  Scalable capabilities for demanding services  Websearch, social nets, machine translation, cloud computing  Compute, storage, networking  Cost effective  Low capital & operational expenses  Low total cost of ownership (TCO) 4

Datacenter Scaling  Cost reduction one time trick  Switch to commodity servers  Improved power delivery & cooling PUE < 1.15  Capability scaling >$300M per DC  More datacenters  More servers per datacenter @60MW per DC  Multicore servers End of voltage scaling  Scalable network fabrics 5

Datacenter Scaling through Resource Efficiency  Are we using our current resources efficiently?  Are we building the right systems to begin with? 6

Our Focus: Server Utilization Total Cost of Ownership Server utilization 3%$ Servers& 6%$ Energy& 14%$ Cooling& 16%$ 61%$ Networking& Other& [J. Hamilton, http://mvdirona.com] [U. Hoelzle and L. Barosso, 2009]  Servers dominate datacenter cost  CapEx and OpEx  Server resources are poorly utilized  CPUs cores, memory, storage 7

Low Utilization  Primary reasons  Diurnal user traffic & unexpected spikes  Planning for future traffic growth  Difficulty of designing balanced servers  Higher utilization through workload co-scheduling  Analytics run on front-end servers when traffic is low  Spiking services overflow on servers for other services  Servers with unused resources export them to other servers  E.g., storage, Flash, memory  So, why hasn’t co-scheduling solved the problem yet? 8

Interference  Poor Performance & QoS  Interference on shared resources  Cores, caches, memory, storage, network  Large performance losses  E.g. 40% for Google apps [Tang’11]  QoS issue for latency-critical applications  Optimized for for low 99 th percentile latency in addition to throughput  Assume 1% chance of >1sec server latency, 100 servers used per request  Then 63% chance of user request latency >1sec  Common cures lead to poor utilization  Limited resource sharing  Exaggerated reservations 9

Higher Resource Efficiency wo/ QoS Loss  Research agenda  Workload analysis  Understand resource needs, impact of interference  Mechanisms for interference reduction  HW & SW isolation mechanisms (e.g., cache partitioning)  Interference-aware datacenter management  Scheduling for min interference and max resource use  Resource efficient hardware design  Energy efficient, optimized for sharing  Potential for >5x improvement in TCO 10

Datacenter Scheduling Apps Scheduler Loss System Metrics State  Two obstacles to good performance  Interference: sharing resources with other apps  Heterogeneity: running on suboptimal server configuration 11

Paragon: interference-aware Scheduling [ASPLOS’13] Learning Heterogeneity App Apps Scheduler Classification Interference System Metrics State  Quickly classify incoming apps  For heterogeneity and interference caused/tolerated  Heterogeneity & interference aware scheduling  Send apps to best possible server configuration  Co-schedule apps that don’t interfere much  Monitor & adapt  Deviation from expected behavior signals error or phase change 12

Fast & Accurate Classification resources applications PQ SVD SVD SGD Reconstructed Final Initial Interference utility matrix decomposition decomposition scores  Cannot afford to exhaustively analyze workloads  High churn rates of evolving and/or unknown apps  Classification using collaborative filtering  Similar to recommendations for movies and other products  Leverage knowledge from previously scheduled apps  Within 1min of sparse profiling we can estimate  How much interference an app causes/tolerates on each resource  How well it will perform on each server type 13

Paragon Evaluation  5K apps on 1K EC2 instances (14 server types) 14

Paragon Evaluation  Better performance with same resources  Most workloads within 10% of ideal performance 15

Paragon Evaluation Gain  Better performance with same resources  Most workloads within 10% of ideal performance  Can serve additional apps without the need for more HW 16

High Utilization & Latency-critical Apps 95th-% Latency % of base IPC % server util. 1000 100% 900 90% Memcached latency (us) 800 80% 700 70% 600 60% 500 50% 400 40% 300 30% 200 20% 25% QPS 50% QPS 75% QPS 100% QPS 100 10% 0 0% 6 12 18 24 6 12 18 24 6 12 18 24 6 12 18 24 Total number of background processes  Example: scheduling work on underutilized memcached servers  Reporting QPS at cutoff of 500usec for 95 th % latency  High potential for utilization improvement  All the way to 100% CPU utilization impact QoS impact  Several open issues  System configuration, OS scheduling, management of hardware resources 17

Datacenter Scaling through Resource Efficiency  Are we using our current resources efficiently?  Are we building the right systems to begin with? 18

Main Memory in Datacenters [U. Hoelzle and L. Barosso, 2009]  Server power main energy bottleneck in datacenters  PUE of ~1.1  the rest of the system is energy efficient  Significant main memory (DRAM) power  25-40% of server power across all utilization points  Low dynamic range  no energy proportionality 19

DDR3 Energy Characteristics  DDR3 optimized for high bandwidth (1.5V, 800MHz)  On chip DLLs & on-die-termination lead to high static power  70pJ/bit @ 100% utilization, 260pJ/bit at low data rates  LVDDR3 alternative (1.35V, 400MHz)  Lower Vdd  higher on-die-termination  Still disproportional at 190pJ/bit  Need memory systems that consume lower energy and are proportional  What metric can we trade for efficiency? 20

Memory Use in Datacenters Resource Utilization for Microsoft Services under Stress Testing [Micro’11] CPU Memory BW Disk BW Utilization Utilization Utilization Large-scale analytics 88% 1.6% 8% Search 97% 5.8% 36%  Online apps rely on memory capacity, density, reliability  But not on memory bandwidth  Web-search and map-reduce  CPU or DRAM latency bound, <6% peak DRAM bandwidth used  Memory caching, DRAM-based storage, social media  Overall bandwidth by network (<10% of DRAM bandwidth)  We can trade off bandwidth for energy efficiency 21

Mobile DRAMs for Datacenter Servers [ISCA’12] 5x  Same core, capacity, and latency as DDR3  Interface optimized for lower power & lower bandwidth ( 1 / 2 )  No termination, lower frequency, faster powerdown modes  Energy proportional & energy efficient 22

Mobile DRAMs for Datacenter Servers [ISCA’12] Memory Power Search Memcached-a, b SPECPower SPECWeb SPECJbb  LPDDR2 module: die stacking + buffered module design  High capacity + good signal integrity  5x reduction in memory power, no performance loss  Save power or increase capability in TCO neutral manner  Unintended consequences  Energy efficient DRAM  L3 cache power now dominates 23

Summary  Resource efficiency  A promising approach for scalability & cost efficiency  Potential for large benefits in TCO  Key questions  Are we using our current resources efficiently?  Research on understanding, reducing, and managing interference  Hardware & software  Are we building the right systems to begin with?  Research on new compute, memory, and storage structures 24

Resource Efficient Computing for Warehouse-scale Datacenters - PowerPoint PPT Presentation

Resource Efficient Computing for Warehouse-scale Datacenters Christos Kozyrakis Stanford University http://csl.stanford.edu/~christos DATE Conference March 21 st 2013 Computing is the Innovation Catalyst Science Government Commerce

Financial Data Financial Data Financial Data Financial Data Warehouse Warehouse Warehouse

Data Warehouse Update March 19, 2019 Agenda Why a data warehouse? Why THIS data

Europe Manchester, England North America - Factory Lehi, UT HQ & Warehouse Salt Lake

An Overview of Data Warehousing and OLAP T echnology What is a data warehouse? A

Data Warehouse Chronic Conditions Data Warehouse 1 Your source for national CMS Medicare and

Request-Level and Data-Level Parallelism in Warehouse-Scale Computers 1 MO401 2013 Tpicos

Data Warehouse of German Federal Police From Raw Data to Flexible Analytics Data Warehouse

MARKET LEADER OF WAREHOUSE REAL ESTATE 8 % of the market of class A warehouse property of Russia

BI and WIC Data Warehouse Project Overview Reason for the Data Warehouse project EBT

Purchasing / Warehouse / Print Shop Kathy Cartwright, Director Kelly ORourke, Purchasing

A Data Warehouse-based A Data Warehouse-based Gene Expression Analysis Gene Expression Analysis

Financial Data Financial Data Warehouse Warehouse Some day, on the corporate balance sheet,

DATA WAREHOUSE How Business Intel and Data Warehouse works Information DEG uses Teacher

STRATEGIC LOCATION Add: Warehouse No.5, 2 nd Floor, Tan Cang-Cat Lai warehouse, Cat Lai Ward,

THE SPECIALIST WAREHOUSE INVESTOR FULL YEAR RESULTS 31 MARCH 2019 WAREHOUSE REIT PLC Full Year

THE SPECIALIST WAREHOUSE INVESTOR FY2020 RESULTS JUNE 2020 WAREHOUSE REIT PLC TILSTONE June

Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk

Analyzing Direct Marketing Data Marketing Data with R Liang Wei Brendan Kitts Lucid Commerce

[V IRTUALIZATION ] Shrideep Pallickara Computer Science Colorado State University CS370:

( ) Building Best Buys Brian Sletten BBY Open RESTful Bosatsu Consulting, Inc. Commerce

The Hardware & Software Implications of Microservices and How Big Data Can Help Christina

Chapter 20 Intruders Cryptography and Network Security They agreed that Graham should set the

Why Do We Need Multiple Recovery Options? TECHNOLOGY PARTNERS www.vembu.com Agenda Introduction

Empirical Evaluation of Latency-Sensitive Application Performance in the Cloud Sean Barker and

Resource Efficient Computing for Warehouse-scale Datacenters - PowerPoint PPT Presentation

Resource Efficient Computing for Warehouse-scale Datacenters Christos Kozyrakis Stanford University http://csl.stanford.edu/~christos DATE Conference March 21 st 2013 Computing is the Innovation Catalyst Science Government Commerce

Financial Data Financial Data Financial Data Financial Data Warehouse Warehouse Warehouse

Data Warehouse Update March 19, 2019 Agenda Why a data warehouse? Why THIS data

Europe Manchester, England North America - Factory Lehi, UT HQ &amp; Warehouse Salt Lake

An Overview of Data Warehousing and OLAP T echnology What is a data warehouse? A

Data Warehouse Chronic Conditions Data Warehouse 1 Your source for national CMS Medicare and

Request-Level and Data-Level Parallelism in Warehouse-Scale Computers 1 MO401 2013 Tpicos

Data Warehouse of German Federal Police From Raw Data to Flexible Analytics Data Warehouse

MARKET LEADER OF WAREHOUSE REAL ESTATE 8 % of the market of class A warehouse property of Russia

BI and WIC Data Warehouse Project Overview Reason for the Data Warehouse project EBT

Purchasing / Warehouse / Print Shop Kathy Cartwright, Director Kelly ORourke, Purchasing

A Data Warehouse-based A Data Warehouse-based Gene Expression Analysis Gene Expression Analysis

Financial Data Financial Data Warehouse Warehouse Some day, on the corporate balance sheet,

DATA WAREHOUSE How Business Intel and Data Warehouse works Information DEG uses Teacher

STRATEGIC LOCATION Add: Warehouse No.5, 2 nd Floor, Tan Cang-Cat Lai warehouse, Cat Lai Ward,

THE SPECIALIST WAREHOUSE INVESTOR FULL YEAR RESULTS 31 MARCH 2019 WAREHOUSE REIT PLC Full Year

THE SPECIALIST WAREHOUSE INVESTOR FY2020 RESULTS JUNE 2020 WAREHOUSE REIT PLC TILSTONE June

Reaching reliable agreement in an unreliable world Heidi Howard heidi.howard@cl.cam.ac.uk

Analyzing Direct Marketing Data Marketing Data with R Liang Wei Brendan Kitts Lucid Commerce

[V IRTUALIZATION ] Shrideep Pallickara Computer Science Colorado State University CS370:

( ) Building Best Buys Brian Sletten BBY Open RESTful Bosatsu Consulting, Inc. Commerce

The Hardware &amp; Software Implications of Microservices and How Big Data Can Help Christina

Chapter 20 Intruders Cryptography and Network Security They agreed that Graham should set the

Why Do We Need Multiple Recovery Options? TECHNOLOGY PARTNERS www.vembu.com Agenda Introduction

Empirical Evaluation of Latency-Sensitive Application Performance in the Cloud Sean Barker and

Europe Manchester, England North America - Factory Lehi, UT HQ & Warehouse Salt Lake

The Hardware & Software Implications of Microservices and How Big Data Can Help Christina