The Time-less Datacenter Paul Borrill and Alan H. Karp Earth - - PowerPoint PPT Presentation

the time less datacenter
SMART_READER_LITE
LIVE PREVIEW

The Time-less Datacenter Paul Borrill and Alan H. Karp Earth - - PowerPoint PPT Presentation

The Time-less Datacenter Paul Borrill and Alan H. Karp Earth Computing The Datacenter Resilience Company Stanford EE Computer Systems Colloquium Wednesday, November 16, 2016 http://ee380.stanford.edu Cloud Computing The Three Taxes: 1.


slide-1
SLIDE 1

The Time-less Datacenter

Paul Borrill and Alan H. Karp

Earth Computing

The Datacenter Resilience Company

Stanford EE Computer Systems Colloquium

Wednesday, November 16, 2016

http://ee380.stanford.edu

slide-2
SLIDE 2

The Three Taxes:

  • 1. Complexity
  • 2. Fragility
  • 3. Vulnerability

Cloud Computing

2

slide-3
SLIDE 3

Earth Computing | The Datacenter Resilience Company

Twitter Today

Systems can fail in catastrophic ways leading to death or tremendous financial loss. Although their are many potential causes including physical failure, human error, and environmental factors, design errors are increasingly becoming the most serious culprit*

3

*NASA Formal Methods Program: https://shemesh.larc.nasa.gov/fm/fm-why-new.html

slide-4
SLIDE 4

Earth Computing | The Datacenter Resilience Company

Key Computer Science Problems

Reliable Consensus

  • Generals Problem (no fixed length protocol exists to guarantee a reliable

solution in an environment where messages can get lost)

  • Slow Node vs. Link Failure Indistinguishability. I.e. what can one side of a

failed link assume about a partner or cohort on the other side?

FLP Result

  • Impossibility of Distributed Consensus with One Faulty Process

Key Idea:

  • Don’t depend on processes to provide liveness, use a new kind of link

4

slide-5
SLIDE 5

Earth Computing | The Datacenter Resilience Company

Problem: Event Ordering is Hard

  • In a distributed system over a general

network we can’t tell if event at process R happened before event at process Q, unless P caused R in some way

  • Causal Trees provide this guarantee when

they are stable

  • Dynamic Causal Trees provide guarantees

through failure & healing, iff you have AIT on each link

  • Needs Atomic Information Transfer

(AIT) in the Link

P P:0 Q:-- R:-- Q P:-- Q:0 R:-- R P:-- Q:-- R:0 P P:1 Q:2 R:1 P P:2 Q:2 R:1 P P:3 Q:3 R:3 Q P:-- Q:1 R:1 Q P:-- Q:2 R:1 Q P:-- Q:3 R:1 Q P:2 Q:4 R:1 Q P:2 Q:5 R:1 R P:-- Q:-- R:1 R P:-- Q:3 R:2 R P:-- Q:3 R:3 R P:2 Q:5 R:4 R P:2 Q:5 R:5 P P:4 Q:5 R:5

t Process

Causal History Future Effect

slope ≤ c slope ≤ c slope ≤ c slope ≤ c

11 12 13 14 21 22 23 24 25 32 31 33 34 35

P P:0 Q:-- R:-- Q P:-- Q:0 R:-- R P:-- Q:-- R:0 P P:1 Q:2 R:1 P P:2 Q:2 R:1 P P:3 Q:3 R:3 Q P:-- Q:1 R:1 Q P:-- Q:2 R:1 Q P:-- Q:3 R:1 Q P:2 Q:4 R:1 Q P:2 Q:5 R:1 R P:-- Q:-- R:1 R P:-- Q:3 R:2 R P:-- Q:3 R:3 R P:2 Q:5 R:4 R P:2 Q:5 R:5 P P:4 Q:5 R:5

t Process

Causal History

slope ≤ c slope ≤ c slope ≤ c

11 12 13 14 21 22 23 24 25 32 31 33 34 35

Future Effect 5

slide-6
SLIDE 6

Earth Computing | The Datacenter Resilience Company

Problem: Consensus is Hard

  • Failure detectors have failed

to solve the problem

  • 2PC (Fail-Stop)
  • Vulnerable to coordinator failure

(no safety proof)

  • 3PC vulnerable to network

partitions (no liveness proof)

  • Paxos (Fail-Recover)
  • Robust Algorithm but hard to

understand & get right.

  • Causal

Trees make roles robust, easier to understand & verify

6

slide-7
SLIDE 7

Earth Computing | The Datacenter Resilience Company

Why? Because The Network is Flaky!

  • App developers believe the network is

the problem

  • Networks drop, delay, duplicate & reorder packets
  • Networking people believe the apps

are the problem

  • The network end to end principle: Apps should

retry to distinguish between delays & drops … but … retries* ruin TCP’s ordering guarantees

  • Both are incorrect. Solution requires a

simple, but fresh perspective

Peter Bailis, Kyle Kingsbury. The network is reliable * Application retries (i.e. opening a new socket)

7

slide-8
SLIDE 8

Earth Computing | The Datacenter Resilience Company

Datacenter Failures Cascade

8

Switches are 


DReDDful

They Drop, Reorder, Delay and Duplicate Packets

Interdependent failures Reconstruction storms Timeout storms Gossip storms Cascade failures

slide-9
SLIDE 9

Earth Computing | The Datacenter Resilience Company

It’s Time to Simplify

Delta Amazon Google Apple Netflix Paypal …

9

slide-10
SLIDE 10

Earth Computing | The Datacenter Resilience Company

Earth Computing

E a r t h C

  • m

p u t i n g T i m B e r n e r s L e e

The Big Idea

World Wide Web

Key Idea:

2 Simple Sets of Rules

  • Document Language (html)
  • Connection Protocol (http)

Cloud Computing

Mere mortals can now get their computers to talk to each other Mere mortals can now manage their infrastructures

ONE WAY LINKS TWO WAY LINKS

10

Earth Computing

Key Idea:

2 Simple Sets of Rules

  • Graph Language (gvml)
  • Connection Protocol (eclp)
slide-11
SLIDE 11

Earth Computing | The Datacenter Resilience Company

CAS

  • Atomic Instruction

Key Idea:

Lock-Free 
 data structures

  • New Concurrency Libraries
  • Atomic RMW

AIT

  • Atomic Information

Key Idea:

Recoverable Atomic Tokens

  • Deterministic, In-Order
  • Reversible Atomic Message

Distributed Systems Primitives

Concurrent Safety Non-Blocking Deterministic Recoverability Durable Indivisible Property

Shared Memory Reversible Token

{While (CAS(oldvalue,newvalue, ) != new value} {Transfer (AIT(tokenID,Notify=NO, ) != Continue}

11

slide-12
SLIDE 12

Earth Computing | The Datacenter Resilience Company 12

C2C Lattice of Cells & Links

DC Gateway Spine Node Leaf Node ToR GW SN ToR ToR ToR LN LN SN ToR ToR ToR LN LN GW SN ToR ToR ToR LN LN SN ToR ToR ToR LN LN

DC

Today’s Networking: Servers & Switches EARTH Computing: Cells & Links

Servers, Any to Any (IP) addressing

Simpler Wiring: N2N, Switchless

slide-13
SLIDE 13

Earth Computing | The Datacenter Resilience Company

13

Today: Internal Segregation Firewalls EARTH: Dynamic Confinement Domains

The Datacenter Simplified

Fundamentally Simpler

The Datacenter Today

slide-14
SLIDE 14

Earth Computing | The Datacenter Resilience Company

Split infrastructure into:

Cloud datacenter accessed by untrusted legacy 
 protocols Earth dynamic, resilient, 
 programmable topologies Core where data is immutable, secure, protected, & resilient 
 to perturbations 


(failures, disasters, attacks)

14

Cloudplane Outside World EarthCore

Earth Computing Network Fabric

Data Center

slide-15
SLIDE 15

Confidential | Earth Computing Inc.

The Big Idea

EarthCore

15

slide-16
SLIDE 16

Earth Computing | The Datacenter Resilience Company

Logical Foundation for Resilience

16

Cell Agent N I C N I C N I C N I C NIC NIC Cell Agent N I C N I C N I C N I C NIC NIC Cell Agent N I C N I C N I C N I C NIC NIC Cell Agent N I C N I C N I C N I C NIC NIC Cell Agent N I C N I C N I C N I C NIC NIC Cell Agent N I C N I C N I C N I C NIC NIC Cell Agent N I C N I C N I C N I C NIC NIC Cell Agent N I C N I C N I C N I C NIC NIC Cell Agent N I C N I C N I C N I C NIC NIC Cell Agent N I C N I C N I C N I C NIC NIC Cell Agent N I C N I C N I C N I C NIC NIC Cell Agent N I C N I C N I C N I C NIC NIC Cell Agent N I C N I C N I C N I C NIC NIC Cell Agent N I C N I C N I C N I C NIC NIC Cell Agent N I C N I C N I C N I C NIC NIC

Fabric

slide-17
SLIDE 17

Earth Computing | The Datacenter Resilience Company

EARTH Computing Link Protocol (ECLP)

  • Events: Replaces Heartbeats, Timeouts
  • Addresses the Common Knowledge* Problem

17

C e l l

Agent

NIC NIC NIC NIC NIC NIC C e l l A g e n t NIC NIC NIC NIC NIC NIC Cable C e l l

Agent

NIC NIC NIC NIC NIC NIC Cable

New Distributed Systems Foundation

*Knowledge and Common Knowledge in a Distributed Environment – Joseph Y. Halpern & Yoram Moses ’90 (initial version 1984).

slide-18
SLIDE 18

Composable Presence Management

Router NIC NIC NIC NIC NIC NIC Router NIC NIC NIC NIC NIC NIC Cell Agent NIC NIC NIC NIC NIC NIC Router NIC NIC NIC NIC NIC NIC Cable Cell Agent NIC NIC NIC NIC NIC NIC Cable Cable Cable

18

slide-19
SLIDE 19

Composable Presence Management

Router NIC NIC NIC NIC NIC NIC Router NIC NIC NIC NIC NIC NIC Cell Agent NIC NIC NIC NIC NIC NIC Router NIC NIC NIC NIC NIC NIC Cable Cell Agent NIC NIC NIC NIC NIC NIC Cable Cable Cable

19

slide-20
SLIDE 20

Demo

20

slide-21
SLIDE 21

Two Generals Problem

21

slide-22
SLIDE 22

Earth Computing | The Datacenter Resilience Company

Example Use Cases

22

Two Phase Commit Paxos Link Reversal

slide-23
SLIDE 23

Earth Computing | The Datacenter Resilience Company

Example Use Cases

  • Two-phase commit The prepare phase is asking if the receiving agent is ready to accept the
  • token. This serves two purposes: communication liveness and agent readiness. Links provide

the communication liveness test, and we can avoid blocking on agent ready, by having the link store the token on the receiving half of the link. If there is a failure, both sides know; and both sides know what to do next.

  • Paxos “Agents may fail by stopping, and may restart. Since all agents may fail after a value is

chosen and then restart, a solution is impossible unless some information can be remembered by an agent that has failed and restarted”. The assumption is when a node has failed and restarted, it can’t remember the state it needs to recover. With AIT, the other half of the link can tell it the state to recover from.

  • Reliable tree generation Binary link reversal algorithms work by reversing the directions of

some edges. Transforming an arbitrary directed acyclic input graph into an output graph with at least one route from each node to a special destination node. The resulting graph can thus be used to route messages in a loop-free manner. Links store the direction of the arrow (head and tail); AIT facilitates the atomic swap of the arrow’s tail and head to maintain loop-free routes during failure and recovery.

23

slide-24
SLIDE 24

Earth Computing | The Datacenter Resilience Company

24

Common Knowledge

Courtesy: Adrian Coyler, The Morning Paper. https://blog.acolyer.org/2015/02/16/knowledge-and- common-knowledge-in-a-distributed-environment/

slide-25
SLIDE 25

Confidential | Earth Computing Inc. | Paul Borrill

Implementation On Smart NIC’s

P H Y P H Y P H Y P H Y P H Y P H Y

Agent

NIC NIC NIC NIC NIC NIC

Cell Hardware (containing Processor, Memory, Storage and (e.g 6) physical network ports) Cell Software Primary Agent Network Interface Controller Network Connector Entanglement Synchronization Domain 25

slide-26
SLIDE 26

Earth Computing | The Datacenter Resilience Company

TRAPHs (Tree-gRAPHs)

  • Simple Provisioning, Confinement, 


Elasticity, Migration, Failover

26

New Distributed Systems Foundation

Bare Metal

NAL

Network Asset Layer

slide-27
SLIDE 27

Earth Computing | The Datacenter Resilience Company

27

Demo

Simulator

slide-28
SLIDE 28

Earth Computing | The Datacenter Resilience Company

28

  • Don’t Make Datacenter Look Like the Internet
  • Simpler to Configure/Reconfigure
  • More Resilient to all perturbations
  • Easier to Secure
  • Key Ideas
  • RAFE: Reliable Address-Free Ethernet
  • Replace switches with cell to cell links
  • Don’t rely on blueprints, discover wiring
  • Event driven => No network timeouts
  • Keep state in links for recovery
  • No VLANs, no network-layer encryption
  • Scalable design - local only view
  • NO IP; service addressing
  • Self recovering from link & server failures
  • RAFE is a discovery process rather than a configuration process

Questions?

Cloudplane Outside World EarthCore