Balancing on the edge Transport affinity without network state Joo - - PowerPoint PPT Presentation

balancing on the edge
SMART_READER_LITE
LIVE PREVIEW

Balancing on the edge Transport affinity without network state Joo - - PowerPoint PPT Presentation

Balancing on the edge Transport affinity without network state Joo Taveira Arajo, Lorenzo Saino, Raul Landa and Lennert Buytenhek NSDI 2018 | this is the last slide (sort of) Faild decomposes load balancing as a division of labour


slide-1
SLIDE 1

Balancing on the edge

Transport affinity without network state

João Taveira Araújo, Lorenzo Saino, Raul Landa and Lennert Buytenhek NSDI 2018 |

slide-2
SLIDE 2

this is the last slide (sort of)

Faild decomposes load balancing as a division of labour

  • leverage hardware wherever possible - no latency cost in expected case
  • push functions requiring state towards hosts - low latency overhead in worst case
  • efficient, stateless, while ensuring graceful completion of flows
  • in production at Fastly since 2013

Read paper if you have an interest in transport protocols, Internet architecture

slide-3
SLIDE 3

the problem

slide-4
SLIDE 4
slide-5
SLIDE 5
slide-6
SLIDE 6
slide-7
SLIDE 7
slide-8
SLIDE 8
slide-9
SLIDE 9

2009 ~ 2012

slide-10
SLIDE 10

🛱

technology

  • SSD, multicore, 10Gbps NICs
  • network programmability

📊

cost of entry

  • maturity of open source reverse proxies (nginx, varnish)
  • network topology flattened and bandwidth costs dropped

💹

market

  • incumbents addressed shrinking use case: large static assets
  • a lot of suppressed demand
slide-11
SLIDE 11

start over

slide-12
SLIDE 12

PÖP (point of presence)

BILLY

slide-13
SLIDE 13

PÖP (point of presence)

BILLY

slide-14
SLIDE 14

PÖP (point of presence)

BILLY BILLY BILLY BILLY

slide-15
SLIDE 15

clöud

slide-16
SLIDE 16

clöud

slide-17
SLIDE 17
slide-18
SLIDE 18
slide-19
SLIDE 19
slide-20
SLIDE 20
slide-21
SLIDE 21

BILLY

RÅCK

slide-22
SLIDE 22

BILLY

RÅCK

slide-23
SLIDE 23

Requirements

BILLY

  • maximize RPS
  • low latency
  • absorb DDOS attacks
  • no single points of failure
  • no service disruption on maintenance
slide-24
SLIDE 24

Requirements

BILLY

  • maximize RPS
  • low latency
  • absorb DDOS attacks
  • no single points of failure
  • no service disruption on maintenance
slide-25
SLIDE 25

Requirements

BILLY

  • maximize RPS
  • low latency
  • absorb DDOS attacks
  • no single points of failure
  • no service disruption on maintenance
slide-26
SLIDE 26

Problem statement

BILLY

Given fixed physical footprint, how do you design a load balancing architecture which is efficient, resilient and graceful?

slide-27
SLIDE 27

the topology

slide-28
SLIDE 28

Guidelines

BILLY

maximize number of hosts need switches for connecting to upstreams avoid dedicated network hardware:

  • unneeded features (routers)
  • load balancer appliances
  • multiple switch tiers
slide-29
SLIDE 29

Guidelines

BILLY

maximize number of hosts need switches for connecting to upstreams avoid dedicated network hardware:

  • unneeded features (routers)
  • load balancer appliances
  • multiple switch tiers
slide-30
SLIDE 30

Guidelines

BILLY

maximize number of hosts need switches for connecting to upstreams avoid dedicated network hardware:

  • unneeded features (routers)
  • load balancer appliances
  • multiple switch tiers
slide-31
SLIDE 31

POP topology

slide-32
SLIDE 32

POP topology

slide-33
SLIDE 33

POP topology

Hosts Racks Bandwidth* (Gbps) RPS (millions) Storage (TB)

8 0.5 200 0.32 768 16 1 1200 0.64 1536 32 2 2400 1.28 3072 64 4 4800 2.56 6144

* notional host bandwidth on fabric accounting for loss of one switch

slide-34
SLIDE 34

Hosts Racks Bandwidth* (Gbps) RPS (millions) Storage (TB)

8 0.5 200 0.32 768 16 1 1200 0.64 1536 32 2 2400 1.28 3072 64 4 4800 2.56 6144

POP topology

* notional host bandwidth on fabric accounting for loss of one switch

slide-35
SLIDE 35

POP topology POP topology

Hosts Racks Bandwidth* (Gbps) RPS (millions) Storage (TB)

8 0.5 200 0.32 768 16 1 1200 0.64 1536 32 2 2400 1.28 3072 64 4 4800 2.56 6144

* notional host bandwidth on fabric accounting for loss of one switch

slide-36
SLIDE 36

load balancing

slide-37
SLIDE 37

anything that maintains state is easy to DDOS

  • load balancer appliances
  • Ananta [SIGCOMM’13]
  • Duet [SIGCOMM’14]
  • MagLev [NSDI’16]
  • SilkRoad [SIGCOMM’17]
slide-38
SLIDE 38

software-only load balancers can’t make use of full bisection bandwidth

slide-39
SLIDE 39

stateless solutions are not graceful

  • many switch ECMP implementations result in rehashing
  • even ones that don’t will break ongoing connections
slide-40
SLIDE 40

faild

slide-41
SLIDE 41

Faild

A C B

slide-42
SLIDE 42

FIB

Destination prefix 10.0.1.A 192.168.0.0/24 192.168.0.0/24 10.0.2.A 10.0.1.B 192.168.0.0/24 192.168.0.0/24 10.0.2.B 10.0.1.C 192.168.0.0/24 192.168.0.0/24 10.0.2.C Next hop IP

Faild

A C B

  • n switch, map VIP to static set of virtual next hops
  • using static set avoids rehashing
  • ECMP width determines granularity with which we load balance
slide-43
SLIDE 43

FIB

Destination prefix 10.0.1.A 192.168.0.0/24 192.168.0.0/24 10.0.2.A 10.0.1.B 192.168.0.0/24 192.168.0.0/24 10.0.2.B 10.0.1.C 192.168.0.0/24 192.168.0.0/24 10.0.2.C Next hop IP

Faild

A C B

made up IPs

  • n switch, map VIP to static set of virtual next hops
  • using static set avoids rehashing
  • ECMP width determines granularity with which we load balance
slide-44
SLIDE 44

Controller

ARP table

IP address xx:xx:xx:xx:xx:a 10.0.1.A 10.0.2.A xx:xx:xx:xx:xx:a MAC address xx:xx:xx:xx:xx:b 10.0.1.B 10.0.2.B xx:xx:xx:xx:xx:b 10.0.2.C xx:xx:xx:xx:xx:c xx:xx:xx:xx:xx:c 10.0.1.C

FIB

Destination prefix 10.0.1.A 192.168.0.0/24 192.168.0.0/24 10.0.2.A 10.0.1.B 192.168.0.0/24 192.168.0.0/24 10.0.2.B 10.0.1.C 192.168.0.0/24 192.168.0.0/24 10.0.2.C Next hop IP

Faild

A C B

  • n switch, control forwarding by manipulating ARP entry
  • controller maps virtual next hop to virtual MAC address
  • encode information about current host in virtual MAC address
  • bridging table configured to map virtual MAC to egress port
slide-45
SLIDE 45

Controller

ARP table

IP address xx:xx:xx:xx:xx:a 10.0.1.A 10.0.2.A xx:xx:xx:xx:xx:a MAC address xx:xx:xx:xx:xx:b 10.0.1.B 10.0.2.B xx:xx:xx:xx:xx:b 10.0.2.C xx:xx:xx:xx:xx:c xx:xx:xx:xx:xx:c 10.0.1.C

FIB

Destination prefix 10.0.1.A 192.168.0.0/24 192.168.0.0/24 10.0.2.A 10.0.1.B 192.168.0.0/24 192.168.0.0/24 10.0.2.B 10.0.1.C 192.168.0.0/24 192.168.0.0/24 10.0.2.C Next hop IP

Faild

A C B

  • n switch, control forwarding by manipulating ARP entry
  • controller maps virtual next hop to virtual MAC address
  • encode information about current host in virtual MAC address
  • bridging table configured to map virtual MAC to egress port

target host

slide-46
SLIDE 46

Controller

ARP table

IP address xx:xx:xx:xx:xx:a 10.0.1.A 10.0.2.A xx:xx:xx:xx:xx:a MAC address xx:xx:xx:xx:xx:b 10.0.1.B 10.0.2.B xx:xx:xx:xx:xx:b 10.0.2.C xx:xx:xx:xx:xx:c xx:xx:xx:xx:xx:c 10.0.1.C

FIB

Destination prefix 10.0.1.A 192.168.0.0/24 192.168.0.0/24 10.0.2.A 10.0.1.B 192.168.0.0/24 192.168.0.0/24 10.0.2.B 10.0.1.C 192.168.0.0/24 192.168.0.0/24 10.0.2.C Next hop IP

Faild

A C B

  • n switch, control forwarding by manipulating ARP entry
  • controller maps virtual next hop to virtual MAC address
  • encode information about current host in virtual MAC address
  • bridging table configured to map virtual MAC to egress port
slide-47
SLIDE 47

Faild

A C B

hosts send health status to controller

  • on drain, update ARP entry
  • balance virtual next hops across available servers

Controller

ARP table

IP address xx:xx:xx:xx:xx:a 10.0.1.A 10.0.2.A xx:xx:xx:xx:xx:a MAC address xx:xx:xx:xx:xx:c 10.0.1.B 10.0.2.B xx:xx:xx:xx:xx:a 10.0.2.C xx:xx:xx:xx:xx:c xx:xx:xx:xx:xx:c 10.0.1.C

FIB

Destination prefix 10.0.1.A 192.168.0.0/24 192.168.0.0/24 10.0.1.B 10.0.1.B 192.168.0.0/24 192.168.0.0/24 10.0.2.B 10.0.1.C 192.168.0.0/24 192.168.0.0/24 10.0.2.C Next hop IP

Controller

ARP table

IP address xx:xx:xx:xx:xx:a 10.0.1.A 10.0.2.A xx:xx:xx:xx:xx:a MAC address xx:xx:xx:xx:xx:b 10.0.1.B 10.0.2.B xx:xx:xx:xx:xx:b 10.0.2.C xx:xx:xx:xx:xx:c xx:xx:xx:xx:xx:c 10.0.1.C

FIB

Destination prefix 10.0.1.A 192.168.0.0/24 192.168.0.0/24 10.0.1.B 10.0.1.B 192.168.0.0/24 192.168.0.0/24 10.0.2.B 10.0.1.C 192.168.0.0/24 192.168.0.0/24 10.0.2.C Next hop IP

slide-48
SLIDE 48

Faild

A C B

hosts send health status to controller

  • on drain, update ARP entry
  • balance virtual next hops across available servers

Controller

ARP table

IP address xx:xx:xx:xx:xx:a 10.0.1.A 10.0.2.A xx:xx:xx:xx:xx:a MAC address xx:xx:xx:xx:xx:c 10.0.1.B 10.0.2.B xx:xx:xx:xx:xx:a 10.0.2.C xx:xx:xx:xx:xx:c xx:xx:xx:xx:xx:c 10.0.1.C

FIB

Destination prefix 10.0.1.A 192.168.0.0/24 192.168.0.0/24 10.0.1.B 10.0.1.B 192.168.0.0/24 192.168.0.0/24 10.0.2.B 10.0.1.C 192.168.0.0/24 192.168.0.0/24 10.0.2.C Next hop IP

Controller

ARP table

IP address xx:xx:xx:xx:xx:a 10.0.1.A 10.0.2.A xx:xx:xx:xx:xx:a MAC address xx:xx:xx:xx:xx:b 10.0.1.B 10.0.2.B xx:xx:xx:xx:xx:b 10.0.2.C xx:xx:xx:xx:xx:c xx:xx:xx:xx:xx:c 10.0.1.C

FIB

Destination prefix 10.0.1.A 192.168.0.0/24 192.168.0.0/24 10.0.1.B 10.0.1.B 192.168.0.0/24 192.168.0.0/24 10.0.2.B 10.0.1.C 192.168.0.0/24 192.168.0.0/24 10.0.2.C Next hop IP

slide-49
SLIDE 49

Faild

A C B

hosts send health status to controller

  • on drain, update ARP entry
  • balance virtual next hops across available servers

Controller

ARP table

IP address xx:xx:xx:xx:xx:a 10.0.1.A 10.0.2.A xx:xx:xx:xx:xx:a MAC address xx:xx:xx:xx:xx:c 10.0.1.B 10.0.2.B xx:xx:xx:xx:xx:a 10.0.2.C xx:xx:xx:xx:xx:c xx:xx:xx:xx:xx:c 10.0.1.C

FIB

Destination prefix 10.0.1.A 192.168.0.0/24 192.168.0.0/24 10.0.1.B 10.0.1.B 192.168.0.0/24 192.168.0.0/24 10.0.2.B 10.0.1.C 192.168.0.0/24 192.168.0.0/24 10.0.2.C Next hop IP

Controller

ARP table

IP address xx:xx:xx:xx:xx:a 10.0.1.A 10.0.2.A xx:xx:xx:xx:xx:a MAC address xx:xx:xx:xx:xx:c 10.0.1.B 10.0.2.B xx:xx:xx:xx:xx:a 10.0.2.C xx:xx:xx:xx:xx:c xx:xx:xx:xx:xx:c 10.0.1.C

FIB

Destination prefix 10.0.1.A 192.168.0.0/24 192.168.0.0/24 10.0.2.A 10.0.1.B 192.168.0.0/24 192.168.0.0/24 10.0.2.B 10.0.1.C 192.168.0.0/24 192.168.0.0/24 10.0.2.C Next hop IP

remap entries

slide-50
SLIDE 50

Faild

A C B

hosts send health status to controller

  • on drain, update ARP entry
  • balance virtual next hops across available servers

Controller

ARP table

IP address xx:xx:xx:xx:xx:a 10.0.1.A 10.0.2.A xx:xx:xx:xx:xx:a MAC address xx:xx:xx:xx:xx:c 10.0.1.B 10.0.2.B xx:xx:xx:xx:xx:a 10.0.2.C xx:xx:xx:xx:xx:c xx:xx:xx:xx:xx:c 10.0.1.C

FIB

Destination prefix 10.0.1.A 192.168.0.0/24 192.168.0.0/24 10.0.1.B 10.0.1.B 192.168.0.0/24 192.168.0.0/24 10.0.2.B 10.0.1.C 192.168.0.0/24 192.168.0.0/24 10.0.2.C Next hop IP

Controller

ARP table

IP address xx:xx:xx:xx:xx:a 10.0.1.A 10.0.2.A xx:xx:xx:xx:xx:a MAC address xx:xx:xx:xx:xx:c 10.0.1.B 10.0.2.B xx:xx:xx:xx:xx:a 10.0.2.C xx:xx:xx:xx:xx:c xx:xx:xx:xx:xx:c 10.0.1.C

FIB

Destination prefix 10.0.1.A 192.168.0.0/24 192.168.0.0/24 10.0.2.A 10.0.1.B 192.168.0.0/24 192.168.0.0/24 10.0.2.B 10.0.1.C 192.168.0.0/24 192.168.0.0/24 10.0.2.C Next hop IP

slide-51
SLIDE 51

isn’t this just consistent hashing?

slide-52
SLIDE 52

isn’t this just consistent hashing?

yes, but we can extend mechanism and avoid resets entirely

slide-53
SLIDE 53
  • append previous target as part of MAC address
  • still results in resets, but…
  • …conveys necessary information down to the host

embed mapping history in MAC address

Faild

A B

Controller

ARP table

IP address xx:xx:xx:xx:xx:a 10.0.1.A 10.0.2.A xx:xx:xx:xx:xx:a MAC address xx:xx:xx:xx:xx:c 10.0.1.B 10.0.2.B xx:xx:xx:xx:xx:a 10.0.2.C xx:xx:xx:xx:xx:c xx:xx:xx:xx:xx:c 10.0.1.C

FIB

Destination prefix 10.0.1.A 192.168.0.0/24 192.168.0.0/24 10.0.1.B 10.0.1.B 192.168.0.0/24 192.168.0.0/24 10.0.2.B 10.0.1.C 192.168.0.0/24 192.168.0.0/24 10.0.2.C Next hop IP

Controller

ARP table

IP address xx:xx:xx:xx:a:a 10.0.1.A 10.0.2.A xx:xx:xx:xx:a:a MAC address xx:xx:xx:xx:b:c 10.0.1.B 10.0.2.B xx:xx:xx:xx:b:a 10.0.2.C xx:xx:xx:xx:c:c xx:xx:xx:xx:c:c 10.0.1.C

FIB

Destination prefix 10.0.1.A 192.168.0.0/24 192.168.0.0/24 10.0.2.A 10.0.1.B 192.168.0.0/24 192.168.0.0/24 10.0.2.B 10.0.1.C 192.168.0.0/24 192.168.0.0/24 10.0.2.C Next hop IP

C

a a c c b b

slide-54
SLIDE 54
  • append previous target as part of MAC address
  • still results in resets, but…
  • …conveys necessary information down to the host

embed mapping history in MAC address

Faild

A B

Controller

ARP table

IP address xx:xx:xx:xx:xx:a 10.0.1.A 10.0.2.A xx:xx:xx:xx:xx:a MAC address xx:xx:xx:xx:xx:c 10.0.1.B 10.0.2.B xx:xx:xx:xx:xx:a 10.0.2.C xx:xx:xx:xx:xx:c xx:xx:xx:xx:xx:c 10.0.1.C

FIB

Destination prefix 10.0.1.A 192.168.0.0/24 192.168.0.0/24 10.0.1.B 10.0.1.B 192.168.0.0/24 192.168.0.0/24 10.0.2.B 10.0.1.C 192.168.0.0/24 192.168.0.0/24 10.0.2.C Next hop IP

Controller

ARP table

IP address xx:xx:xx:xx:a:a 10.0.1.A 10.0.2.A xx:xx:xx:xx:a:a MAC address xx:xx:xx:xx:b:c 10.0.1.B 10.0.2.B xx:xx:xx:xx:b:a 10.0.2.C xx:xx:xx:xx:c:c xx:xx:xx:xx:c:c 10.0.1.C

FIB

Destination prefix 10.0.1.A 192.168.0.0/24 192.168.0.0/24 10.0.2.A 10.0.1.B 192.168.0.0/24 192.168.0.0/24 10.0.2.B 10.0.1.C 192.168.0.0/24 192.168.0.0/24 10.0.2.C Next hop IP

C

a a c c b b

current host

slide-55
SLIDE 55
  • append previous target as part of MAC address
  • still results in resets, but…
  • …conveys necessary information down to the host

embed mapping history in MAC address

Faild

A B

Controller

ARP table

IP address xx:xx:xx:xx:xx:a 10.0.1.A 10.0.2.A xx:xx:xx:xx:xx:a MAC address xx:xx:xx:xx:xx:c 10.0.1.B 10.0.2.B xx:xx:xx:xx:xx:a 10.0.2.C xx:xx:xx:xx:xx:c xx:xx:xx:xx:xx:c 10.0.1.C

FIB

Destination prefix 10.0.1.A 192.168.0.0/24 192.168.0.0/24 10.0.1.B 10.0.1.B 192.168.0.0/24 192.168.0.0/24 10.0.2.B 10.0.1.C 192.168.0.0/24 192.168.0.0/24 10.0.2.C Next hop IP

Controller

ARP table

IP address xx:xx:xx:xx:a:a 10.0.1.A 10.0.2.A xx:xx:xx:xx:a:a MAC address xx:xx:xx:xx:b:c 10.0.1.B 10.0.2.B xx:xx:xx:xx:b:a 10.0.2.C xx:xx:xx:xx:c:c xx:xx:xx:xx:c:c 10.0.1.C

FIB

Destination prefix 10.0.1.A 192.168.0.0/24 192.168.0.0/24 10.0.2.A 10.0.1.B 192.168.0.0/24 192.168.0.0/24 10.0.2.B 10.0.1.C 192.168.0.0/24 192.168.0.0/24 10.0.2.C Next hop IP

C

a a c c c a

current host

a a c c b b

slide-56
SLIDE 56
  • append previous target as part of MAC address
  • still results in resets, but…
  • …conveys necessary information down to the host

embed mapping history in MAC address

Faild

A B

Controller

ARP table

IP address xx:xx:xx:xx:xx:a 10.0.1.A 10.0.2.A xx:xx:xx:xx:xx:a MAC address xx:xx:xx:xx:xx:c 10.0.1.B 10.0.2.B xx:xx:xx:xx:xx:a 10.0.2.C xx:xx:xx:xx:xx:c xx:xx:xx:xx:xx:c 10.0.1.C

FIB

Destination prefix 10.0.1.A 192.168.0.0/24 192.168.0.0/24 10.0.1.B 10.0.1.B 192.168.0.0/24 192.168.0.0/24 10.0.2.B 10.0.1.C 192.168.0.0/24 192.168.0.0/24 10.0.2.C Next hop IP

Controller

ARP table

IP address xx:xx:xx:xx:a:a 10.0.1.A 10.0.2.A xx:xx:xx:xx:a:a MAC address xx:xx:xx:xx:b:c 10.0.1.B 10.0.2.B xx:xx:xx:xx:b:a 10.0.2.C xx:xx:xx:xx:c:c xx:xx:xx:xx:c:c 10.0.1.C

FIB

Destination prefix 10.0.1.A 192.168.0.0/24 192.168.0.0/24 10.0.2.A 10.0.1.B 192.168.0.0/24 192.168.0.0/24 10.0.2.B 10.0.1.C 192.168.0.0/24 192.168.0.0/24 10.0.2.C Next hop IP

C

a a c c c a

current host

slide-57
SLIDE 57
  • append previous target as part of MAC address
  • still results in resets, but…
  • …conveys necessary information down to the host

embed mapping history in MAC address

Faild

A B

Controller

ARP table

IP address xx:xx:xx:xx:xx:a 10.0.1.A 10.0.2.A xx:xx:xx:xx:xx:a MAC address xx:xx:xx:xx:xx:c 10.0.1.B 10.0.2.B xx:xx:xx:xx:xx:a 10.0.2.C xx:xx:xx:xx:xx:c xx:xx:xx:xx:xx:c 10.0.1.C

FIB

Destination prefix 10.0.1.A 192.168.0.0/24 192.168.0.0/24 10.0.1.B 10.0.1.B 192.168.0.0/24 192.168.0.0/24 10.0.2.B 10.0.1.C 192.168.0.0/24 192.168.0.0/24 10.0.2.C Next hop IP

Controller

ARP table

IP address xx:xx:xx:xx:a:a 10.0.1.A 10.0.2.A xx:xx:xx:xx:a:a MAC address xx:xx:xx:xx:b:c 10.0.1.B 10.0.2.B xx:xx:xx:xx:b:a 10.0.2.C xx:xx:xx:xx:c:c xx:xx:xx:xx:c:c 10.0.1.C

FIB

Destination prefix 10.0.1.A 192.168.0.0/24 192.168.0.0/24 10.0.2.A 10.0.1.B 192.168.0.0/24 192.168.0.0/24 10.0.2.B 10.0.1.C 192.168.0.0/24 192.168.0.0/24 10.0.2.C Next hop IP

C

a a c c c a

slide-58
SLIDE 58
  • append previous target as part of MAC address
  • still results in resets, but…
  • …conveys necessary information down to the host

embed mapping history in MAC address

Faild

A B

Controller

ARP table

IP address xx:xx:xx:xx:xx:a 10.0.1.A 10.0.2.A xx:xx:xx:xx:xx:a MAC address xx:xx:xx:xx:xx:c 10.0.1.B 10.0.2.B xx:xx:xx:xx:xx:a 10.0.2.C xx:xx:xx:xx:xx:c xx:xx:xx:xx:xx:c 10.0.1.C

FIB

Destination prefix 10.0.1.A 192.168.0.0/24 192.168.0.0/24 10.0.1.B 10.0.1.B 192.168.0.0/24 192.168.0.0/24 10.0.2.B 10.0.1.C 192.168.0.0/24 192.168.0.0/24 10.0.2.C Next hop IP

Controller

ARP table

IP address xx:xx:xx:xx:a:a 10.0.1.A 10.0.2.A xx:xx:xx:xx:a:a MAC address xx:xx:xx:xx:b:c 10.0.1.B 10.0.2.B xx:xx:xx:xx:b:a 10.0.2.C xx:xx:xx:xx:c:c xx:xx:xx:xx:c:c 10.0.1.C

FIB

Destination prefix 10.0.1.A 192.168.0.0/24 192.168.0.0/24 10.0.2.A 10.0.1.B 192.168.0.0/24 192.168.0.0/24 10.0.2.B 10.0.1.C 192.168.0.0/24 192.168.0.0/24 10.0.2.C Next hop IP

C

previous host

a a c c c a a a c c b b

slide-59
SLIDE 59
  • append previous target as part of MAC address
  • still results in resets, but…
  • …conveys necessary information down to the host

embed mapping history in MAC address

Faild

A B

Controller

ARP table

IP address xx:xx:xx:xx:xx:a 10.0.1.A 10.0.2.A xx:xx:xx:xx:xx:a MAC address xx:xx:xx:xx:xx:c 10.0.1.B 10.0.2.B xx:xx:xx:xx:xx:a 10.0.2.C xx:xx:xx:xx:xx:c xx:xx:xx:xx:xx:c 10.0.1.C

FIB

Destination prefix 10.0.1.A 192.168.0.0/24 192.168.0.0/24 10.0.1.B 10.0.1.B 192.168.0.0/24 192.168.0.0/24 10.0.2.B 10.0.1.C 192.168.0.0/24 192.168.0.0/24 10.0.2.C Next hop IP

Controller

ARP table

IP address xx:xx:xx:xx:a:a 10.0.1.A 10.0.2.A xx:xx:xx:xx:a:a MAC address xx:xx:xx:xx:b:c 10.0.1.B 10.0.2.B xx:xx:xx:xx:b:a 10.0.2.C xx:xx:xx:xx:c:c xx:xx:xx:xx:c:c 10.0.1.C

FIB

Destination prefix 10.0.1.A 192.168.0.0/24 192.168.0.0/24 10.0.2.A 10.0.1.B 192.168.0.0/24 192.168.0.0/24 10.0.2.B 10.0.1.C 192.168.0.0/24 192.168.0.0/24 10.0.2.C Next hop IP

C

a a c c c a a a c c b b

slide-60
SLIDE 60

Host processing

Current target

xx:xx:xx:xx:c:b

Match previous? SYN packet? Local socket?

Redirect

xx:xx:xx:xx:

Process Destination MAC address

Previous target

C

Destina

A C B

slide-61
SLIDE 61

Host processing

Current target

xx:xx:xx:xx:c:b

Match previous? SYN packet? Local socket?

Redirect

xx:xx:xx:xx:

Process Destination MAC address

Previous target

C

Destina

c != b

A C B

slide-62
SLIDE 62

Host processing

Current target

xx:xx:xx:xx:c:b

Match previous? SYN packet? Local socket?

Redirect

xx:xx:xx:xx:

Process Destination MAC address

Previous target

C

Destina

A C B

slide-63
SLIDE 63

Host processing

Current target

xx:xx:xx:xx:c:b

Match previous? SYN packet? Local socket?

Redirect

xx:xx:xx:xx:

Process Destination MAC address

Previous target

C

Destina

A C B

slide-64
SLIDE 64

Current target

xx:xx:xx:xx:c:b

Match previous? SYN packet? Local socket?

Redirect

xx:xx:xx:xx:

Process Destination MAC address

Previous target

C

Destina

Host processing

A C B

slide-65
SLIDE 65

Current target

xx:xx:xx:xx:c:b

Match previous? SYN packet? Local socket?

Redirect

xx:xx:xx:xx:

Process Destination MAC address

Previous target

C

Destina

Host processing

A C B

slide-66
SLIDE 66

Current target

xx:xx:xx:xx:c:b

Match previous? SYN packet? Local socket?

Redirect

xx:xx:xx:xx:

Process Destination MAC address

Previous target

C

Destina

Host processing

A C B

slide-67
SLIDE 67

Match previous? SYN packet? Local socket?

Redirect

xx:xx:xx:xx:b:b

Process ess

Match previous? SYN packet? Local socket?

Redirect Process

C B

Destination MAC address

Host processing

A C B

slide-68
SLIDE 68

Match previous? SYN packet? Local socket?

Redirect

xx:xx:xx:xx:b:b

Process ess

Match previous? SYN packet? Local socket?

Redirect Process

C B

Destination MAC address

Host processing

C

b == b

A C B

slide-69
SLIDE 69

Match previous? SYN packet? Local socket?

Redirect

xx:xx:xx:xx:b:b

Process ess

Match previous? SYN packet? Local socket?

Redirect Process

C B

Destination MAC address

Host processing

b == b

A C B

slide-70
SLIDE 70

40 60 80 100 120 140 160 180 Round Trip Time [µs] 0.0 0.2 0.4 0.6 0.8 1.0 Cumulative probability Steady state Draining

Host processing

Low latency

  • expected case: switches do all heavy lifting
  • worst case: detour routing costs 14μs

Negligible impact on CPU utilization

  • impact only when refilling
  • peak CPU utilization below 0.3%

median difference: 14µs

slide-71
SLIDE 71

Host processing

Low latency

  • expected case: switches do all heavy lifting
  • worst case: detour routing costs 20μs

Negligible impact on CPU utilization

  • impact only when refilling (transient)
  • peak CPU utilization below 0.3%

Steady state Drain

0.0 0.1 0.2 0.3 0.4 0.5

CPU utilization [%]

Refill

Estimated PDF

slide-72
SLIDE 72

Timeline

2012 2014 2016 2018

slide-73
SLIDE 73

Timeline

2012 2014 2016 2018

deployed globally

slide-74
SLIDE 74

Timeline

2012 2014 2016 2018

deployed globally 3x 1014

requests per day

slide-75
SLIDE 75

we suspect it works

slide-76
SLIDE 76

Assumption #1 hash buckets are equally loaded

slide-77
SLIDE 77

Hashing

5 10 15 20 25 30 Time [min] 2k 3k 4k Requests per second

Implications for capacity planning

  • you are bound by most loaded host in a cluster
slide-78
SLIDE 78

Hashing

5 10 15 20 25 30 Time [min] 2k 3k 4k Requests per second

Implications for capacity planning

  • you are bound by most loaded host in a cluster
slide-79
SLIDE 79

50 100 150 200 250 Rank of nexthop 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 Normalized bucket load

Uneven hashing

Inject synthetic, equally distributed traffic

slide-80
SLIDE 80

50 100 150 200 250 Rank of nexthop 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 Normalized bucket load 50 100 150 200 250 Rank of nexthop 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 Normalized bucket load

Uneven hashing

Inject synthetic, equally distributed traffic

slide-81
SLIDE 81

50 100 150 200 250 Rank of nexthop 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 Normalized bucket load 50 100 150 200 250 Rank of nexthop 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 Normalized bucket load

Uneven hashing

Significant skew

  • most loaded bucket 6 times more loaded


than the least loaded

Inject synthetic, equally distributed traffic

slide-82
SLIDE 82

50 100 150 200 250 Rank of nexthop 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 Normalized bucket load 50 100 150 200 250 Rank of nexthop 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 Normalized bucket load

Uneven hashing

Significant skew

  • most loaded bucket 6 times more loaded


than the least loaded

Behaviour can depend on number of nexthops

  • some buckets received no traffic for specific

number of configured nexthops


Inject synthetic, equally distributed traffic

slide-83
SLIDE 83

Assumption #2 switches hash identically

slide-84
SLIDE 84

Hash polarization

slide-85
SLIDE 85

Hash polarization

slide-86
SLIDE 86

Hash polarization

slide-87
SLIDE 87

Hash polarization

slide-88
SLIDE 88

Hash polarization

Vendors were told hash polarization was bad

  • in many cases you can’t configure seed
  • in one case, you can configure the seed, but

vendor additionally uses boot order of linecards to add entropy

slide-89
SLIDE 89

Assumption #3 packets in a flow use same network path

slide-90
SLIDE 90

Nope, things break

Fragmentation

  • returning ICMP packets hash on outer header
  • took draft to IETF in 2014

ECN

  • some middleboxes hash on TOS field
  • ended up turning ECN negotation off, breaks anycast too
  • still looking for vendor(s) behind this, affected multiple ISPs

SYN proxies

  • recent trend in enterprise appliances
  • route lookup after connection handoff results in new path
  • one vendor fixed implementation
slide-91
SLIDE 91

paper has lots more stuff

  • SYN cookie handling
  • ARP reconfiguration measurements
  • evaluation of switch and host draining
  • switch controller details
  • host-side implementation quirks
  • ECMP skew results
  • switch memory
  • real flow measurements
  • vendors that don’t test their products
slide-92
SLIDE 92
slide-93
SLIDE 93
slide-94
SLIDE 94
slide-95
SLIDE 95
slide-96
SLIDE 96

the value is not in the implementation NSDI

slide-97
SLIDE 97
slide-98
SLIDE 98
slide-99
SLIDE 99
slide-100
SLIDE 100

the value is in the design NSDI

slide-101
SLIDE 101

the value is in the design NSDI

Faild decomposes load balancing as a division of labour

  • leverage hardware wherever possible - no latency cost in expected case
  • push functions requiring state towards hosts - low latency overhead in worst case
  • result is efficient, resilient and graceful
  • many of the design patterns applicable to other networking environments
slide-102
SLIDE 102

…the design is now part of the architecture

NSDI

Five years of dealing with the consequences of changing a fraction of the Internet:

  • PMTUD in ECMP networks
  • talking to vendors about broken hashing implementations and middleboxes
  • raising awareness within transport community and academia

Faild part of shift in economics of edge delivery, has since percolated through industry If you propose protocol changes, please take this paper into account

slide-103
SLIDE 103

Additional materials

  • NANOG 70 presentation on broken hashing

2017 2016 2015

  • Networking @Scale presentation
  • SREcon presentation
  • RFC7690 workarounds for PMTUD in ECMP
  • blogpost covering Faild design