SLIDE 1 Arrested Development
The awkward adolescence of a microservices-based application Europython 2015 Scott Triglia
SLIDE 2
The Company
SLIDE 3 77M reviews 142M monthly unique users
SLIDE 4
Your Speaker Scott Triglia @scott_triglia 4 years with Yelp Search, ML, Services
SLIDE 5
Yelp Transaction Platform
The Product
SLIDE 6
Yelp Transaction Platform (or just “Platform”)
The Product
SLIDE 7
SLIDE 8
SLIDE 9
SLIDE 10
SLIDE 11
SLIDE 12
Microservices
That Hot Trend
SLIDE 13 “…an approach to developing a single application as a suite of small services, each running in its own process and communicating with lightweight mechanisms…”
http://martinfowler.com/articles/microservices.html
SLIDE 14 (clarkmaxwell via Flickr; CC BY-NC-ND 2.0)
SLIDE 15
Monolithic python code resisted decoupling
SLIDE 16
Monolithic python code catered to the lowest common denominator
SLIDE 17
Monolithic python code was anti-agile
SLIDE 18
Time
Services
SLIDE 19 Pinterest Gingerbread House
SLIDE 20 Pinterest Gingerbread House
SLIDE 21
API complexity increases
SLIDE 22
coupling rises
SLIDE 23
interactions get murky
SLIDE 24
process does not scale
SLIDE 25
So what’s an engineer to do?
SLIDE 26
- Decoupling
- Defining
- Understanding Production
- Staying Agile
SLIDE 27
- Decoupling
- Defining
- Understanding Production
- Staying Agile
SLIDE 28
Old boring problem Monolithic spaghetti code
SLIDE 29
Solution: microservices!
SLIDE 30
New exciting problem how to share concepts across services
SLIDE 31
New exciting problem distributed tech debt
SLIDE 32
service_type
SLIDE 33
service_type
What product does your business provide and how do they provide it?
SLIDE 34
service_type
pickup delivery
SLIDE 35
service_type
pickup delivery booking_at_business booking_at_customer
SLIDE 36
service_type
pickup delivery hotel_reservation booking_at_business booking_at_customer goods_at_customer goods_at_business
SLIDE 37
SLIDE 38
SLIDE 39
SLIDE 40
Confusing Pervasive Convenient, but not designed
SLIDE 41
SLIDE 42
Draw boundaries, introduce domain-specific concepts tied to functionality
SLIDE 43
SLIDE 44
Lessons
SLIDE 45
Interfaces are the sum of APIs, shared libraries, and the data that flows through them
SLIDE 46
Sacrificing DRYness can be the best choice for overall design
SLIDE 47 Service interfaces are a great
- pportunity to intentionally
decouple systems
SLIDE 48
- Decoupling
- Defining
- Understanding Production
- Staying Agile
SLIDE 49
Have you ever needed to understand a system and been told go read the source?
SLIDE 50 What about a system which
interface?
SLIDE 51
Coming from a python monolith, strong interfaces were quite rare
SLIDE 52 def checkout(order, price, **kwargs): “““Process an order.””” validate_order(order) charge_credit_card(order.user, price) notify_user(order, **kwargs)
SLIDE 53
SLIDE 54
SLIDE 55
SLIDE 56 Client side - Yelp/bravado
from bravado.client import SwaggerClient client = SwaggerClient.from_url( “www.myservice.com/swagger.json” ) pet = client.pet.getPetById(petId=42).result()
SLIDE 57 Server side - striglia/pyramid_swagger
# In your Pyramid webapp.py config.include(‘pyramid_swagger')
SLIDE 58
Lessons
SLIDE 59
Interfaces should be intentional
SLIDE 60
Interfaces should be explicit
SLIDE 61
Find the mechanical things which don’t scale and automate them mercilessly
SLIDE 62
- Decoupling
- Defining
- Understanding Production
- Staying Agile
SLIDE 63
Real customer bug report: “We’re seeing 504s talking to the /user_info API”
SLIDE 64
Ancient times: Use logic and whatever logs happen to exist
SLIDE 65 (drbethsnow via Flickr; CC BY-NC-ND 2.0)
SLIDE 66
Better: Log all incoming API requests to any service
SLIDE 67 (spam via Flickr; CC by 2.0)
SLIDE 68
Best: Every service has a detailed access/ error log and tooling to examine them
SLIDE 69
SLIDE 70
SLIDE 71
SLIDE 72
SLIDE 73
So what about that customer with the mystery 504?
SLIDE 74
SLIDE 76
Realistically: Don’t require the customer to report issues in the first place
SLIDE 77
SLIDE 78 es_host: elasticsearch-hostname es_port: 14900 index: logstash-errors-%G.%V
type: frequency num_events: 20 timeframe: minutes: 2 alert:
- "modules.sensu_alert.SensuAlerter"
sensu: team: platform tip: "This alert indicates a large number of errors across the Platform
- product. See <link to Kibana> for details."
page: true status: 2 # CRITICAL
SLIDE 79 es_host: elasticsearch-hostname es_port: 14900 index: logstash-errors-%G.%V
type: frequency num_events: 20 timeframe: minutes: 2
alert:
- "modules.sensu_alert.SensuAlerter"
sensu: team: platform tip: "This alert indicates a large number of errors across the Platform
- product. See <link to Kibana> for details."
page: true status: 2 # CRITICAL
SLIDE 80 es_host: elasticsearch-hostname es_port: 14900 index: logstash-errors-%G.%V type: frequency num_events: 20 timeframe: minutes: 2
alert:
- "modules.sensu_alert.SensuAlerter"
sensu: team: platform tip: "This alert indicates a large number of errors across the Platform product. See <link to Kibana> for details." page: true status: 2 # CRITICAL
SLIDE 81
Lessons
SLIDE 82
Logging is a superpower. Use it wisely constantly.
SLIDE 83
But raw data is not enough! Visualize and monitor actively.
SLIDE 84 These approaches make a world of difference:
- Incident response from days to minutes
- Investigations from ∞ to minutes
SLIDE 85
- Decoupling
- Defining
- Understanding Production
- Staying Agile
SLIDE 86 Uncomfortable conversation: “Customers had their orders
preventing it going forward?”
SLIDE 87
Understandable response: “Deploy more carefully”
SLIDE 88
Understandable response: “Expand oncall”
SLIDE 89
How do we ensure the team stays agile as our services grow in complexity?
SLIDE 90
Pain point: The testing environment is {broken, flaky, not like prod}
SLIDE 91
SLIDE 92
Pain point: Tests passed but production broke
SLIDE 93
Production monitoring is the natural extension of excellent pre-deploy testing.
SLIDE 94
SLIDE 95
Pain point: No clue how much time we spend fixing production issues
SLIDE 96
Pain point: Tough to argue what changes will make things more robust
SLIDE 97
SLIDE 98
SLIDE 99
And as with everything else, this must eventually be automated
SLIDE 100
Lessons
SLIDE 101
Networks of services are fundamentally harder to test. Prepare accordingly.
SLIDE 102
Failure will happen. Focus on both identifying and recovering quickly.
SLIDE 103
Staying agile is easy if your application rarely fails and recovers automatically
SLIDE 104
Wrap Up
SLIDE 105
Know your roots
SLIDE 106
Be explicit
SLIDE 107
Measure everything
SLIDE 108
Scale via automation
SLIDE 109
Yelp/bravado striglia/pyramid_swagger Yelp/elastalert
SLIDE 110
http://engineeringblog.yelp.com/2015/03/ using-services-to-break-down- monoliths.html
SLIDE 111
Our accumulated wisdom
Yelp/service-principles
SLIDE 112 Questions?
@scott_triglia scott.triglia@gmail.com