Freeing the Whale How to Fail at Scale oliver gould cto, b uoyant - - PowerPoint PPT Presentation

freeing the whale
SMART_READER_LITE
LIVE PREVIEW

Freeing the Whale How to Fail at Scale oliver gould cto, b uoyant - - PowerPoint PPT Presentation

from QConSF, November 9, 2016 Freeing the Whale How to Fail at Scale oliver gould cto, b uoyant 2010 A FAILWHALE ODYSSEY Twitter, 2010 10 7 users 10 7 tweets/day 10 2 engineers 10 1 ops eng 10 1 services 10 1 deploys/week 10 2 hosts 0


slide-1
SLIDE 1

Freeing the Whale

How to Fail at Scale

  • liver gould


cto, buoyant

QConSF, November 9, 2016

from

slide-2
SLIDE 2

2010

A FAILWHALE ODYSSEY

slide-3
SLIDE 3
slide-4
SLIDE 4
slide-5
SLIDE 5

Twitter, 2010

107 users 107 tweets/day 102 engineers 101 ops eng 101 services 101 deploys/week 102 hosts 0 datacenters 101 user-facing outages/week

https://blog.twitter.com/2010/measuring-tweets

slide-6
SLIDE 6
  • bjective

reliability flexibility

slide-7
SLIDE 7
  • bjective

reliability flexibility

solution

platform SOA + devops


i.e. “microservices”

slide-8
SLIDE 8

Resilience is an imperative: our software runs on the truly dismal computers we call

  • datacenters. Besides being heinously


complex… they are unreliable and prone to


  • perator error.

Marius Eriksen @marius
 RPC Redux

slide-9
SLIDE 9

software you didn’t write hardware you can’t touch network you can’t trace break in new and surprising ways and your customers shouldn’t notice

slide-10
SLIDE 10

freeing the whale

photo: Johanan Ottensooser

slide-11
SLIDE 11

mesos.apache.org

UC Berkeley, 2010 Twitter, 2011 Apache, 2012 Abstracts compute resources Promise: don’t worry about the hosts

slide-12
SLIDE 12

aurora.apache.org

Twitter, 2011 Apache, 2013 Schedules processes on Mesos Promise: no more puppet, monit, etc

slide-13
SLIDE 13

timelines Aurora (or Marathon, or …)

host Mesos host host host host host

users notifications

x800 x300 x1000

slide-14
SLIDE 14

timelines Aurora (or Marathon, or …)

host Mesos host host host host

users notifications

x800 x300 x1000

🔦

slide-15
SLIDE 15

service discovery

timelines users

zookeeper

create ephemeral /svc/users/node_012345
 {“host”: “host-abc”,“port”: 4321}

slide-16
SLIDE 16

service discovery

timelines users

zookeeper

watch /svc/users/*

slide-17
SLIDE 17

service discovery

timelines users

zookeeper

GetUser(olix0r)

slide-18
SLIDE 18

service discovery

timelines users

zookeeper

uh oh.

GetUser(olix0r)

slide-19
SLIDE 19

service discovery

timelines users

zookeeper

client caches results

GetUser(olix0r)

slide-20
SLIDE 20

service discovery

timelines users

zookeeper

GetUser(olix0r)

zookeeper serves empty results?!

slide-21
SLIDE 21

service discovery

timelines users

zookeeper

service discovery is advisory

GetUser(olix0r)

slide-22
SLIDE 22

github.com/twitter/finagle

RPC library (JVM) asynchronous built on Netty scala functional strongly typed first commit: Oct 2010

slide-23
SLIDE 23

datacenter

[1] physical [2] link [3] network [4] transport

kubernetes, mesos, swarm, … 
 canal, weave, … aws, azure, digitalocean, gce, …

business

languages, libraries

[7] application

rpc

[5] session [6] presentation

json, protobuf, thrift, … http/2, mux, …

slide-24
SLIDE 24

“It’s slow”
 is the hardest problem you’ll ever debug.

Jeff Hodges @jmhodges
 Notes on Distributed Systems for Young Bloods

slide-25
SLIDE 25
  • bservability

counters (e.g. client/users/failures) histograms (e.g. client/users/latency/p99) tracing

slide-26
SLIDE 26

tracing

slide-27
SLIDE 27

timeouts & retries

timelines users web db

timeout=400ms retries=3 timeout=400ms retries=2 timeout=200ms retries=3

timelines users web db

slide-28
SLIDE 28

timeouts & retries

timelines users web db

timeout=400ms retries=3 timeout=400ms retries=2 timeout=200ms retries=3

timelines users web db

800ms! 600ms!

slide-29
SLIDE 29

deadlines

timelines users web db

timeout=400ms deadline=323ms deadline=210ms 77ms elapsed 113ms elapsed

slide-30
SLIDE 30

retries

typical: retries=3

slide-31
SLIDE 31

retries

typical: retries=3 worst-case: 300% more load!!!

slide-32
SLIDE 32

budgets

typical: retries=3 better:
 retryBudget=20% worst-case: 300% more load!!! worst-case: 20% more load

slide-33
SLIDE 33

load shedding via cancellation

timelines users web db timelines users web db

timeout!

slide-34
SLIDE 34

load shedding via cancellation

timelines users web db timelines users web db

timeout!

slide-35
SLIDE 35

backpressure

timelines users web db timelines users web db

1000 requests 100 requests 1000 requests

slide-36
SLIDE 36

backpressure

timelines users web db timelines users web db

1000 failed

💁

1000 failed

slide-37
SLIDE 37

backpressure

timelines users web db

100 ok 100 ok 100 ok + 900 failed/redirected/etc

slide-38
SLIDE 38

lb algorithms:

  • round-robin
  • fewest connections
  • queue depth
  • exponentially-weighted

moving average (ewma)

  • aperture

request-level load balancing

slide-39
SLIDE 39
slide-40
SLIDE 40

So just rewrite everything in Finagle!?

slide-41
SLIDE 41

linkerd

slide-42
SLIDE 42

github.com/buoyantio/linkerd

service mesh proxy built on finagle & netty suuuuper pluggable http, thrift, … etcd, consul, kubernetes, marathon, zookeeper, … …

slide-43
SLIDE 43

Linkers and Loaders, John R. Levine, Academic Press

slide-44
SLIDE 44

linker for the datacenter

slide-45
SLIDE 45

logical naming

applications refer to logical names
 requests are bound to concrete names
 delegations express routing

/s/users

/#/io.l5d.zk/prod/users /#/io.l5d.zk/staging/users

/s => /#/io.l5d.zk/prod

slide-46
SLIDE 46

per-request routing: staging

GET / HTTP/1.1
 Host: mysite.com
 l5d-dtab: /s/B => /s/B2

slide-47
SLIDE 47

per-request routing: debug proxy

GET / HTTP/1.1
 Host: mysite.com
 l5d-dtab: /s/E => /s/P/s/E

slide-48
SLIDE 48

linkerd service mesh

transport security service discovery circuit breaking backpressure deadlines retries tracing metrics keep-alive multiplexing load balancing per-request routing service-level objectives

Service B instance

linkerd

Service C instance

linkerd

Service A instance

linkerd

slide-49
SLIDE 49

demo: gob’s microservice

slide-50
SLIDE 50

web word gen l5d l5d l5d

slide-51
SLIDE 51

web word gen gen-v2 l5d l5d l5d l5d

slide-52
SLIDE 52

web word gen gen-v2 l5d l5d l5d l5d

namerd

slide-53
SLIDE 53

github.com/buoyantio/linkerd-examples

slide-54
SLIDE 54

linkerd roadmap

  • Battle test HTTP/2
  • TLS client certs
  • Deadlines
  • Dark Traffic
  • All configurable everything
slide-55
SLIDE 55

more at linkerd.io slack: slack.linkerd.io email: ver@buoyant.io twitter:

  • @olix0r
  • @linkerd

thanks!