WHAT I WISH I KNEW BEFORE SCALING UBER TO 1,000 SERVICES MATT - - PowerPoint PPT Presentation

what i wish i knew before scaling uber to 1 000 services
SMART_READER_LITE
LIVE PREVIEW

WHAT I WISH I KNEW BEFORE SCALING UBER TO 1,000 SERVICES MATT - - PowerPoint PPT Presentation

WHAT I WISH I KNEW BEFORE SCALING UBER TO 1,000 SERVICES MATT RANNEY WHAT I WISH I KNEW BEFORE SCALING UBER TO 1,000 SERVICES MATT RANNEY As of April 2016: Uber Cities Worldwide: 400+ Countries: 70 Employees: 6,000+ LIFE LESSONS


slide-1
SLIDE 1

WHAT I WISH I KNEW BEFORE SCALING UBER TO 1,000 SERVICES

MATT RANNEY

slide-2
SLIDE 2

WHAT I WISH I KNEW BEFORE SCALING UBER TO 1,000 SERVICES

MATT RANNEY

slide-3
SLIDE 3
slide-4
SLIDE 4
slide-5
SLIDE 5

As of April 2016: Uber Cities Worldwide: 400+ Countries: 70 Employees: 6,000+

slide-6
SLIDE 6

LIFE LESSONS

slide-7
SLIDE 7
slide-8
SLIDE 8

MICROSERVICES

Immutable? Append Only?

slide-9
SLIDE 9

WHY MICROSERVICES?

Move and Release Independently Own your Uptime Use the “Best” tool for the job

slide-10
SLIDE 10

WHAT ARE THE COSTS?

Now you have a distributed system Everything is an RPC What if it breaks?

slide-11
SLIDE 11

LESS OBVIOUS COSTS

Everything is a tradeoff You can build around problems Might trade complexity for politics You get to keep your biases

slide-12
SLIDE 12

pre-history PHP (outsourced) Dispatch Node.JS, moving Go Core Services Python, moving to Go Maps Python and Java Data Python and Java Metrics Go

slide-13
SLIDE 13

LANGUAGES

Hard to share code Hard to move between teams WIWIK: Fragments the culture

slide-14
SLIDE 14

RPC

HTTP/REST gets complicated JSON needs a schema RPCs are slower than PCs WIWIK: servers are not browsers

slide-15
SLIDE 15

HOW MANY REPOS

Many is good One is good Many is bad One is bad

slide-16
SLIDE 16
slide-17
SLIDE 17

APRIL 2016 MAY 2016

slide-18
SLIDE 18

OPERATIONAL

What happens when things break? Can other teams release your service? Understand a service in the larger context

slide-19
SLIDE 19

PERFORMANCE

Depends on language tools

slide-20
SLIDE 20
slide-21
SLIDE 21
slide-22
SLIDE 22
slide-23
SLIDE 23
slide-24
SLIDE 24

PERFORMANCE

Doesn’t matter until it does Probably want at least simple perf requirements WIWIK: “good” not required, but “known” is

slide-25
SLIDE 25
  • verall latency ≥ latency of slowest

1ms avg, 1000ms p99 use 1: 1% at least 1000ms use 100: 63% at least 1000ms 1.0 - 0.99^100 = 0.634 = 63.4%

FANOUT

slide-26
SLIDE 26

requests that are slow 0% 25% 50% 75% 100% Processes Used 1 2 4 8 16 32 64 128 256 512 1024

p95 p99 p99.9

slide-27
SLIDE 27

TRACING

Lots of ways to get this Best way to understand fanout

slide-28
SLIDE 28
slide-29
SLIDE 29
slide-30
SLIDE 30
slide-31
SLIDE 31

TRACING

Probably want sampling WIWIK: cross-lang context propagation

slide-32
SLIDE 32

LOGGING

Need consistent, structured logging Multiple languages makes this hard Logging fmoods can amplify problems WIWIK: Accounting

slide-33
SLIDE 33
slide-34
SLIDE 34
slide-35
SLIDE 35

LOAD TESTING

Need to test against production Without breaking metrics Preferably all the time WIWIK: all systems need to handle “test” traffjc

slide-36
SLIDE 36

FAILURE TESTING

WIWIK: people won’t like it

slide-37
SLIDE 37

MIGRATIONS

Old stuff still has to work What happened to immutable? WIWIK: mandates are bad

slide-38
SLIDE 38

OPEN SOURCE

Build/buy tradeoff is hard Commoditization WIWIK: this will make people sad

slide-39
SLIDE 39

POLITICS

Services allow people to play politics Company > Team > Self

slide-40
SLIDE 40

TRADEOFFS

Everything is a tradeoff Try to make them intentionally

slide-41
SLIDE 41

THANKS