Arrested Development The awkward adolescence of a - - PowerPoint PPT Presentation

arrested development
SMART_READER_LITE
LIVE PREVIEW

Arrested Development The awkward adolescence of a - - PowerPoint PPT Presentation

Arrested Development The awkward adolescence of a microservices-based application Europython 2015 Scott Triglia The Company 77M reviews 142M monthly unique users Scott Triglia @scott_triglia 4 years with Yelp Your Speaker Search, ML,


slide-1
SLIDE 1

Arrested Development

The awkward adolescence of a microservices-based application Europython 2015 Scott Triglia

slide-2
SLIDE 2

The Company

slide-3
SLIDE 3

77M reviews 142M monthly unique users

slide-4
SLIDE 4

Your Speaker Scott Triglia @scott_triglia 4 years with Yelp Search, ML, Services

slide-5
SLIDE 5

Yelp Transaction Platform

The Product

slide-6
SLIDE 6

Yelp Transaction Platform (or just “Platform”)

The Product

slide-7
SLIDE 7
slide-8
SLIDE 8
slide-9
SLIDE 9
slide-10
SLIDE 10
slide-11
SLIDE 11
slide-12
SLIDE 12

Microservices

That Hot Trend

slide-13
SLIDE 13

“…an approach to developing a single application as a suite of small services, each running in its own process and communicating with lightweight mechanisms…”

http://martinfowler.com/articles/microservices.html

slide-14
SLIDE 14

(clarkmaxwell via Flickr; CC BY-NC-ND 2.0)

slide-15
SLIDE 15

Monolithic python code resisted decoupling

slide-16
SLIDE 16

Monolithic python code catered to the lowest common denominator

slide-17
SLIDE 17

Monolithic python code was anti-agile

slide-18
SLIDE 18

Time

Services

slide-19
SLIDE 19

Pinterest Gingerbread House

slide-20
SLIDE 20

Pinterest Gingerbread House

slide-21
SLIDE 21

API complexity increases

slide-22
SLIDE 22

coupling rises

slide-23
SLIDE 23

interactions get murky

slide-24
SLIDE 24

process does not scale

slide-25
SLIDE 25

So what’s an engineer to do?

slide-26
SLIDE 26
  • Decoupling
  • Defining
  • Understanding Production
  • Staying Agile
slide-27
SLIDE 27
  • Decoupling
  • Defining
  • Understanding Production
  • Staying Agile
slide-28
SLIDE 28

Old boring problem Monolithic spaghetti code

slide-29
SLIDE 29

Solution: microservices!

slide-30
SLIDE 30

New exciting problem how to share concepts across services

slide-31
SLIDE 31

New exciting problem distributed tech debt

slide-32
SLIDE 32

service_type

slide-33
SLIDE 33

service_type

What product does your business provide and how do they provide it?

slide-34
SLIDE 34

service_type

pickup delivery

slide-35
SLIDE 35

service_type

pickup delivery booking_at_business booking_at_customer

slide-36
SLIDE 36

service_type

pickup delivery hotel_reservation booking_at_business booking_at_customer goods_at_customer goods_at_business

slide-37
SLIDE 37
slide-38
SLIDE 38
slide-39
SLIDE 39
slide-40
SLIDE 40

Confusing Pervasive Convenient, but not designed

slide-41
SLIDE 41
slide-42
SLIDE 42

Draw boundaries, introduce domain-specific concepts tied to functionality

slide-43
SLIDE 43
slide-44
SLIDE 44

Lessons

slide-45
SLIDE 45

Interfaces are the sum of APIs, shared libraries, and the data that flows through them

slide-46
SLIDE 46

Sacrificing DRYness can be the best choice for overall design

slide-47
SLIDE 47

Service interfaces are a great

  • pportunity to intentionally

decouple systems

slide-48
SLIDE 48
  • Decoupling
  • Defining
  • Understanding Production
  • Staying Agile
slide-49
SLIDE 49

Have you ever needed to understand a system and been told go read the source?

slide-50
SLIDE 50

What about a system which

  • nly validates half its

interface?

slide-51
SLIDE 51

Coming from a python monolith, strong interfaces were quite rare

slide-52
SLIDE 52

def checkout(order, price, **kwargs): “““Process an order.””” validate_order(order) charge_credit_card(order.user, price) notify_user(order, **kwargs)

slide-53
SLIDE 53
slide-54
SLIDE 54
slide-55
SLIDE 55
slide-56
SLIDE 56

Client side - Yelp/bravado

from bravado.client import SwaggerClient client = SwaggerClient.from_url( “www.myservice.com/swagger.json” ) pet = client.pet.getPetById(petId=42).result()

slide-57
SLIDE 57

Server side - striglia/pyramid_swagger

# In your Pyramid webapp.py config.include(‘pyramid_swagger')

slide-58
SLIDE 58

Lessons

slide-59
SLIDE 59

Interfaces should be intentional

slide-60
SLIDE 60

Interfaces should be explicit

slide-61
SLIDE 61

Find the mechanical things which don’t scale and automate them mercilessly

slide-62
SLIDE 62
  • Decoupling
  • Defining
  • Understanding Production
  • Staying Agile
slide-63
SLIDE 63

Real customer bug report: “We’re seeing 504s talking to the /user_info API”

slide-64
SLIDE 64

Ancient times: Use logic and whatever logs happen to exist

slide-65
SLIDE 65

(drbethsnow via Flickr; CC BY-NC-ND 2.0)

slide-66
SLIDE 66

Better: Log all incoming API requests to any service

slide-67
SLIDE 67

(spam via Flickr; CC by 2.0)

slide-68
SLIDE 68

Best: Every service has a detailed access/ error log and tooling to examine them

slide-69
SLIDE 69
slide-70
SLIDE 70
slide-71
SLIDE 71
slide-72
SLIDE 72
slide-73
SLIDE 73

So what about that customer with the mystery 504?

slide-74
SLIDE 74
slide-75
SLIDE 75

0.15 s 2.5 s

slide-76
SLIDE 76

Realistically: Don’t require the customer to report issues in the first place

slide-77
SLIDE 77
slide-78
SLIDE 78

es_host: elasticsearch-hostname es_port: 14900 index: logstash-errors-%G.%V

type: frequency num_events: 20 timeframe: minutes: 2 alert:

  • "modules.sensu_alert.SensuAlerter"

sensu: team: platform tip: "This alert indicates a large number of errors across the Platform

  • product. See <link to Kibana> for details."

page: true status: 2 # CRITICAL

slide-79
SLIDE 79

es_host: elasticsearch-hostname es_port: 14900 index: logstash-errors-%G.%V

type: frequency num_events: 20 timeframe: minutes: 2

alert:

  • "modules.sensu_alert.SensuAlerter"

sensu: team: platform tip: "This alert indicates a large number of errors across the Platform

  • product. See <link to Kibana> for details."

page: true status: 2 # CRITICAL

slide-80
SLIDE 80

es_host: elasticsearch-hostname es_port: 14900 index: logstash-errors-%G.%V type: frequency num_events: 20 timeframe: minutes: 2

alert:

  • "modules.sensu_alert.SensuAlerter"

sensu: team: platform tip: "This alert indicates a large number of errors across the Platform product. See <link to Kibana> for details." page: true status: 2 # CRITICAL

slide-81
SLIDE 81

Lessons

slide-82
SLIDE 82

Logging is a superpower. Use it wisely constantly.

slide-83
SLIDE 83

But raw data is not enough! Visualize and monitor actively.

slide-84
SLIDE 84

These approaches make a world of difference:

  • Incident response from days to minutes
  • Investigations from ∞ to minutes
slide-85
SLIDE 85
  • Decoupling
  • Defining
  • Understanding Production
  • Staying Agile
slide-86
SLIDE 86

Uncomfortable conversation: “Customers had their orders

  • interrupted. How are you

preventing it going forward?”

slide-87
SLIDE 87

Understandable response: “Deploy more carefully”

slide-88
SLIDE 88

Understandable response: “Expand oncall”

slide-89
SLIDE 89

How do we ensure the team stays agile as our services grow in complexity?

slide-90
SLIDE 90

Pain point: The testing environment is {broken, flaky, not like prod}

slide-91
SLIDE 91
slide-92
SLIDE 92

Pain point: Tests passed but production broke

slide-93
SLIDE 93

Production monitoring is the natural extension of excellent pre-deploy testing.

slide-94
SLIDE 94
slide-95
SLIDE 95

Pain point: No clue how much time we spend fixing production issues

slide-96
SLIDE 96

Pain point: Tough to argue what changes will make things more robust

slide-97
SLIDE 97
slide-98
SLIDE 98
slide-99
SLIDE 99

And as with everything else, this must eventually be automated

slide-100
SLIDE 100

Lessons

slide-101
SLIDE 101

Networks of services are fundamentally harder to test. Prepare accordingly.

slide-102
SLIDE 102

Failure will happen. Focus on both identifying and recovering quickly.

slide-103
SLIDE 103

Staying agile is easy if your application rarely fails and recovers automatically

slide-104
SLIDE 104

Wrap Up

slide-105
SLIDE 105

Know your roots

slide-106
SLIDE 106

Be explicit

slide-107
SLIDE 107

Measure everything

slide-108
SLIDE 108

Scale via automation

slide-109
SLIDE 109

Yelp/bravado striglia/pyramid_swagger Yelp/elastalert

slide-110
SLIDE 110

http://engineeringblog.yelp.com/2015/03/ using-services-to-break-down- monoliths.html

slide-111
SLIDE 111

Our accumulated wisdom

Yelp/service-principles

slide-112
SLIDE 112

Questions?

@scott_triglia scott.triglia@gmail.com