Mesos at Yelp: Building a production ready PaaS Rob Johnson - - PowerPoint PPT Presentation

mesos at yelp building a production ready paas
SMART_READER_LITE
LIVE PREVIEW

Mesos at Yelp: Building a production ready PaaS Rob Johnson - - PowerPoint PPT Presentation

Mesos at Yelp: Building a production ready PaaS Rob Johnson robj@yelp.com/@rob_johnson_ Who Am I: - Rob Johnson - Operations Team at Yelp - Spend most of my time working on PaaSTA Yelps Mission: Connecting people with great local


slide-1
SLIDE 1

Mesos at Yelp: Building a production ready PaaS

Rob Johnson robj@yelp.com/@rob_johnson_

slide-2
SLIDE 2
  • Rob Johnson
  • Operations Team at Yelp
  • Spend most of my time working on PaaSTA

Who Am I:

slide-3
SLIDE 3

Yelp’s Mission:

Connecting people with great local businesses.

slide-4
SLIDE 4

Yelp Stats:

As of Q2 2015

83M 32 68% 83M

slide-5
SLIDE 5

PaaSTA

slide-6
SLIDE 6

Yelp’s homegrown Platform- as-a-Service

slide-7
SLIDE 7

What’s the problem we’re trying to solve here?

slide-8
SLIDE 8
  • Yelp’s monolith is ~3 million

LoC (that’s just the Python). *

  • Increasing number of

developers.

*as of 28/09/2015

slide-9
SLIDE 9
  • Code deployments become

increasingly difficult to coordinate.

  • Surface area for impact of a

bug greatly increases.

slide-10
SLIDE 10

What’s the solution?

slide-11
SLIDE 11

SOA

slide-12
SLIDE 12

Solves everything, right?

slide-13
SLIDE 13

SOA: Round 1

slide-14
SLIDE 14
  • Statically defined list of

hosts to deploy a service

  • n.
  • Operations handle deciding

which hosts to deploy to.

slide-15
SLIDE 15
  • Manually configure Nagios

for each service.

  • Manual deployment
  • system. Lots of rsync

wrappers to push code around.

slide-16
SLIDE 16

This doesn’t scale well.

slide-17
SLIDE 17

PaaSTA

slide-18
SLIDE 18
  • Built on the shoulders of

established tools.

  • ‘Glue Code’ that

coordinates these tools.

slide-19
SLIDE 19

Components

slide-20
SLIDE 20

Mesos

slide-21
SLIDE 21

Marathon

slide-22
SLIDE 22

Chronos

(almost)

slide-23
SLIDE 23

My work here is done, right?

slide-24
SLIDE 24

Not Quite.

slide-25
SLIDE 25

Services != Production

slide-26
SLIDE 26

What makes a service production ready?

slide-27
SLIDE 27
  • easy deployment for

developers

slide-28
SLIDE 28
  • easy deployment for

developers

  • discovery
slide-29
SLIDE 29
  • easy deployment for

developers

  • discovery
  • monitoring
slide-30
SLIDE 30
  • easy deployment for

developers

  • discovery
  • monitoring
  • highly available
slide-31
SLIDE 31
  • easy deployment for

developers

  • discovery
  • monitoring
  • highly available
  • operational support
slide-32
SLIDE 32
  • easy deployment for

developers

  • discovery
  • monitoring
  • highly available
  • operational support
slide-33
SLIDE 33

Services at Yelp tend to be:

  • http api
  • Python
  • uWSGI
slide-34
SLIDE 34

We want to be stack agnostic; developers shouldn’t be constrained by dependencies on a server.

slide-35
SLIDE 35
  • PaaSTA only runs Docker

containers.

  • Developers own the

creation of the image.

slide-36
SLIDE 36

PaaSTA currently has Java, Golang and Python apps in production.

slide-37
SLIDE 37

PaaSTA provides tooling to automate the build and deployment of images via Jenkins.

slide-38
SLIDE 38

PaaSTA uses Git as its control plane.

slide-39
SLIDE 39

git push make itest push to registry performance check deploy to dev (repeat for each dev env) manual intervention prod

slide-40
SLIDE 40

Once a given image is marked for deployment in production, PaaSTA ‘bounces’ the app, gracefully upgrading the version.

slide-41
SLIDE 41
  • Reduces operational
  • verhead of deploying

service.

  • Removes bottleneck of

going through operations to deploy.

slide-42
SLIDE 42
  • easy deployment for

developers

  • discovery
  • monitoring
  • highly available
  • operational support
slide-43
SLIDE 43

Smartstack

slide-44
SLIDE 44
  • Originally written by Airbnb
  • Yelp now has maintainers

working on it.

slide-45
SLIDE 45

s2 s1 s3 s4 s2 s1 s3 s4 s2 s1 s3 s4 H H H S N N S S N

ZK

slide-46
SLIDE 46

s2 s1 s3 s4 s2 s1 s3 s4 s2 s1 s3 s4 H H H S N N S S N

ZK

slide-47
SLIDE 47

s2 s1 s3 s4 s2 s1 s3 s4 s2 s1 s3 s4 H H H S N N S S N

ZK

slide-48
SLIDE 48

s2 s1 s3 s4 s2 s1 s3 s4 s2 s1 s3 s4 H H H S N N S S N

ZK

slide-49
SLIDE 49

There’s no place like 127.0.0.1 169.254.255.254

slide-50
SLIDE 50

Why Smartstack?

slide-51
SLIDE 51
  • ZK/synapse/nerve dying

doesn’t wipe us out.

  • HAProxy has its own health

checking system we can fall back to.

slide-52
SLIDE 52
  • HAProxy is a proven load

balancer and http proxy.

  • We can use Smartstack with

non-PaaSTA services.

slide-53
SLIDE 53

Zero-downtime HAProxy reloads: http://bit.ly/1RsctGi

slide-54
SLIDE 54
  • easy deployment for

developers

  • discovery
  • monitoring
  • highly available
  • operational support
slide-55
SLIDE 55
slide-56
SLIDE 56
  • API allows us to send event

data.

  • Flexibility to assign alerts to

service authors, rather than forcing it on

  • perations team.
slide-57
SLIDE 57

$ cat monitoring.yaml

  • team: search_infra

notification_email: search@yelp.com page: true runbook: 'y/rb-myservice' alert_after: 5m realert_every: 10m tip: 'The federator service is in the critical path for search, you should be fixing this'

slide-58
SLIDE 58

./check_marathon_services_replication

slide-59
SLIDE 59

./check_hung_setup_marathon_jobs

slide-60
SLIDE 60
  • easy deployment for

developers

  • discovery
  • monitoring
  • highly available
  • operational support
slide-61
SLIDE 61

Yelp organises machines into latency zones.

slide-62
SLIDE 62

Superregion Region Habitat

slide-63
SLIDE 63

$ cat smartstack.yaml

  • main:

advertise: [superregion] discover: superregion proxy_port: 20603

slide-64
SLIDE 64

By choosing a more specific latency zone, service owners

  • ptimize for RTT over

availability.

slide-65
SLIDE 65
  • By being aware of these latency

zones, PaaSTA can make smarter decisions on how to constrain applications.

slide-66
SLIDE 66

Without this coupling, Marathon wouldn’t balance apps evenly amongst the latency zones.

slide-67
SLIDE 67
  • easy deployment for

developers

  • discovery
  • monitoring
  • highly available
  • operational support
slide-68
SLIDE 68

PaaSTA comes with a cli for managing PaaSTA services.

slide-69
SLIDE 69
slide-70
SLIDE 70
slide-71
SLIDE 71
slide-72
SLIDE 72
  • easy deployment for

developers

  • discovery
  • monitoring
  • highly available
  • operational support
slide-73
SLIDE 73

Questions?

slide-74
SLIDE 74

@YelpEngineering YelpEngineers engineeringblog.yelp.com github.com/yelp