Think outside the rack 2015-04-21 WRSC john wilkes / - PowerPoint PPT Presentation

Think outside the rack 2015-04-21 WRSC john wilkes / johnwilkes@google.com, Parthasarathy Ranganathan, Steven Hand Google Inc .

Datacenter loads are not SPEC benchmarks Single query across multiple racks of multiple servers: graph of one query and associated RPCs for work distribution (only two levels shown); other queries going on, but not shown. Graphic from Dick Sites

Good news! lots of new technologies Silicon/hardware is getting ever more inventive forced to move to parallelism to track Moore's "law" Main memory: lots of volatile RAM, new non-volatile h/w Computation: oodles of cores, specialized accelerators Storage: flash/SSD, [magnetic disks still kicking] Networking: high bandwidth + low latency + lossless(?)

Good news! resource disaggregation Conceptually it's wonderful: Build a "rack computer" from a kit of parts (*) ○ a single big, disaggregated machine ○ all the benefits of a unified OS Build a "datacenter in a rack" (*) ○ a single, scaled-down distributed system, like the big guys use ○ all the benefits of shared-nothing distributed systems (*) OK - build at least two, for reliability

Good news! a RackScale foo is both of these Upsides: ○ meet all the needs of all but the largest organizations ○ buy just what you need (save money) ○ build just what you want (go fast) ○ tune for peak performance (go fast; save money) ○ conceptually similar to existing programming models What could possibly go wrong?

An RSfoo breaks everything An RSfoo is not the same as a computer ○ multiple internal failure domains ○ non-uniform resource access costs An RSfoo is not the same as a datacenter ○ shared nothing => disaggregated resources ○ existing programming models don't work

Datacenter experiences are relevant DRAM errors (1% AFR) A 2000-machine service will Disk failures (2-10% AFR) have >10 machine crashes per Machine crashes (~2/year) day OS upgrades (2-6/year) This is not a problem because of the shared-nothing model Images by Connie Zhou

RSfoo failures If disaggregation is used ○ each component failure ⇒ partial system failure ⇒ visible at the app level ○ fault propagation at the speed of light Apps aren't designed to handle this today

RSfoo provisioning How much of what to buy? ○ workload lifetime << hardware depreciation cycle ○ multiple esoteric resources ○ requires (dynamic) hardware evolution Apps + planning tools aren't designed to handle this today

RSfoo placement/scheduling Avoid resource stranding ○ disaggregation helps … ○ but RSfoo has more resource types Avoid bad placement ○ NUMA writ large ○ dynamic interference Existing placement / scheduling algorithms aren't good at this today

RSfoo inter-application interference RSfoos are small-scale datacenters, so will run multiple apps Disaggregated resources make ... ○ performance isolation much harder ○ security isolation much harder ○ failure isolation much harder Apps + systems aren't designed to handle this today

RSfoo groups You still need multiple RSfoos ... ○ control-plane failure ○ datacenter / network / environment failure ○ "big" workloads ○ end-user latency Existing inter-datacenter solutions (e.g., full replication) probably aren't ideal

Good news! We'll have job security ;-)

Good news! The solutions are in sight. The problems are just beginning. ○ Failures ○ Provisioning and configuration hassles ○ Interference ○ Multi-RSfoo support

One possible approach For each feature/property/behavior, start by asking: ○ "is this a big computer, or a small datacenter?" ○ (distributed systems techniques go a long way) Thinking about timescales may help: ○ seconds and up - datacenter control model ○ below that: application-level Introduce a feature after addressing issues identified here ○ don't forget the programming model!

Think outside the rack 2015-04-21 WRSC john wilkes / - PowerPoint PPT Presentation

Think outside the rack 2015-04-21 WRSC john wilkes / johnwilkes@google.com, Parthasarathy Ranganathan, Steven Hand Google Inc . Datacenter loads are not SPEC benchmarks Single query across multiple racks of multiple servers: graph of one query

Rack in Rails 3 <http://twitter.com/rtomayko> Ryan Tomayko GitHub Rack (Core Team)

Do we need Rack-Scale Coordination? Alysson Bessani 1 April 21th, 2015 Rack-Scale Computers

RACK: a time-based fast loss recovery draft-ietf-tcpm-rack-01 Yuchung Cheng Neal Cardwell

outside the Gospels Sayings of Jesus outside the Gospels Sayings of Jesus outside the Gospels

http://rack.github.com Thursday, November 11, 2010 Rack provides a minimal, modular and adaptable

MRG - AMQP trading system in a rack Carl Trieloff Senior Consulting Software Engineer/ Director

Technical Information Rack Slide Dimension Drawing and Usage Table Dimension Diagram Rack

RACK for SCTP Felix Weinrank Michael Txen Erwin P. Rathgeb Agenda A brief introdcution

Towards Reconfigurable Rack-Scale Networking Tyler Szepesi , Bernard Wong, Tim Brecht, Sajjad Rizvi

How Economists Think and Things They Think About How Economists Think and Things They Think About

Truck Boat Tail Folding Seat Bike Rack Goals: Improve highway gas mileage on tractor Goals: Allow

Bicycle Rack Voucher Project RFP Pre-Bidders Conference March 6, 2014 1 Overview

CAD Geometry Original The pipe rack structures were represented as solid obstructions; flow is

Exo: Atomic Broadcast for the Rack-Scale Computer Matthew P. Grosvenor Marwan Fayed Andrew W.

PRODUCT GUIDE ELITE 12 SCARF RACK 300.3080.12.[FIN] DESCRIPTION & SPECIFICATIONS Dont

ArgonCube 2x2 Cabling and grounding F. Piastra 31.10.2019 Power connections/grounding DAQ rack

Multipath Transport, Resource Pooling, and implications for Routing Mark Handley , UCL and XORP,

Distributed Systems CS425/ECE428 01/29/2020 Logistics Slide policy: Lecture slides v1

Why do Internet services fail, and what can be done about it? David Oppenheimer

Keeping Movies Running Amid Thunderstorms Fault-tolerant Systems @ Netflix Sid Anand (@r39132)

Facing Up to Faults Facing Up to Faults Facing Up to Faults (v.2.0.1) (v.2.0.1) (v.2.0.1)

What Use Is Verified Software? John Rushby Computer Science Laboratory SRI International Menlo

Chapter 14: Consensus and Agreement Ajay Kshemkalyani and Mukesh Singhal Distributed Computing:

Building a Fault- Building a Fault- Tolerant Distributed Tolerant Distributed System with

Sambuz

Useful Links

Newsletter

Mail Us

Think outside the rack 2015-04-21 WRSC john wilkes / - PowerPoint PPT Presentation

Think outside the rack 2015-04-21 WRSC john wilkes / johnwilkes@google.com, Parthasarathy Ranganathan, Steven Hand Google Inc . Datacenter loads are not SPEC benchmarks Single query across multiple racks of multiple servers: graph of one query

Rack in Rails 3 &lt;http://twitter.com/rtomayko&gt; Ryan Tomayko GitHub Rack (Core Team)

Do we need Rack-Scale Coordination? Alysson Bessani 1 April 21th, 2015 Rack-Scale Computers

RACK: a time-based fast loss recovery draft-ietf-tcpm-rack-01 Yuchung Cheng Neal Cardwell

outside the Gospels Sayings of Jesus outside the Gospels Sayings of Jesus outside the Gospels

http://rack.github.com Thursday, November 11, 2010 Rack provides a minimal, modular and adaptable

MRG - AMQP trading system in a rack Carl Trieloff Senior Consulting Software Engineer/ Director

Technical Information Rack Slide Dimension Drawing and Usage Table Dimension Diagram Rack

RACK for SCTP Felix Weinrank Michael Txen Erwin P. Rathgeb Agenda A brief introdcution

Towards Reconfigurable Rack-Scale Networking Tyler Szepesi , Bernard Wong, Tim Brecht, Sajjad Rizvi

How Economists Think and Things They Think About How Economists Think and Things They Think About

Truck Boat Tail Folding Seat Bike Rack Goals: Improve highway gas mileage on tractor Goals: Allow

Bicycle Rack Voucher Project RFP Pre-Bidders Conference March 6, 2014 1 Overview

CAD Geometry Original The pipe rack structures were represented as solid obstructions; flow is

Exo: Atomic Broadcast for the Rack-Scale Computer Matthew P. Grosvenor Marwan Fayed Andrew W.

PRODUCT GUIDE ELITE 12 SCARF RACK 300.3080.12.[FIN] DESCRIPTION &amp; SPECIFICATIONS Dont

ArgonCube 2x2 Cabling and grounding F. Piastra 31.10.2019 Power connections/grounding DAQ rack

Multipath Transport, Resource Pooling, and implications for Routing Mark Handley , UCL and XORP,

Distributed Systems CS425/ECE428 01/29/2020 Logistics Slide policy: Lecture slides v1

Why do Internet services fail, and what can be done about it? David Oppenheimer

Keeping Movies Running Amid Thunderstorms Fault-tolerant Systems @ Netflix Sid Anand (@r39132)

Facing Up to Faults Facing Up to Faults Facing Up to Faults (v.2.0.1) (v.2.0.1) (v.2.0.1)

What Use Is Verified Software? John Rushby Computer Science Laboratory SRI International Menlo

Chapter 14: Consensus and Agreement Ajay Kshemkalyani and Mukesh Singhal Distributed Computing:

Building a Fault- Building a Fault- Tolerant Distributed Tolerant Distributed System with

Sambuz

Useful Links

Newsletter

Mail Us

Rack in Rails 3 <http://twitter.com/rtomayko> Ryan Tomayko GitHub Rack (Core Team)

PRODUCT GUIDE ELITE 12 SCARF RACK 300.3080.12.[FIN] DESCRIPTION & SPECIFICATIONS Dont