Building and running applications at scale in Zalando Online - - PowerPoint PPT Presentation

building and running applications at scale in zalando
SMART_READER_LITE
LIVE PREVIEW

Building and running applications at scale in Zalando Online - - PowerPoint PPT Presentation

Building and running applications at scale in Zalando Online fashion store Checkout case By Pamela Canchanya About Zalando About Zalando > 250 visits ~ 5.4 billion EUR per million month revenue 2018 > 300.000 > 26 product


slide-1
SLIDE 1

Building and running applications at scale in Zalando

Online fashion store Checkout case

By Pamela Canchanya

slide-2
SLIDE 2

About Zalando

slide-3
SLIDE 3
slide-4
SLIDE 4

~ 5.4billion EUR

revenue 2018

> 250 million

visits per month

> 15.500

employees in Europe

> 70%

  • f visits via

mobile devices

> 26

million

active customers

> 300.000

product choices

~ 2.000

brands

17

countries

About Zalando

slide-5
SLIDE 5

Black Friday at a glance

slide-6
SLIDE 6

Zalando Tech

slide-7
SLIDE 7

From monolith to microservice architecture

> 1000 microservices

Reorganization

slide-8
SLIDE 8

Platform > 1100

developers

> 200

development teams

Tech organization

slide-9
SLIDE 9

End to end responsibility

slide-10
SLIDE 10

Checkout “Allow customers to buy seamlessly and conveniently”

Goal

slide-11
SLIDE 11

Checkout landscape

Java Scala Node JS

REST & messaging

Cassandra

data storage

ETCD

configurations

AWS

&

Kubernetes

infrastructure

React

client side

Docker

container Many more programming languages Communication

slide-12
SLIDE 12

Checkout architecture

Cassandra

Checkout service

Dependencies

Backend for frontend Frontend fragments

Dependencies Tailor Skipper Dependencies

slide-13
SLIDE 13

Checkout is a critical component in the shopping journey

  • Direct impact in business revenue
  • Direct impact in customer experience
slide-14
SLIDE 14

Checkout challenges in a microservice ecosystem

  • Increase points of failures
  • Multiple dependencies evolving independently
slide-15
SLIDE 15

Lessons learnt building Checkout with

  • Reliability patterns
  • Scalability
  • Monitoring
slide-16
SLIDE 16

Building microservices with reliability patterns

slide-17
SLIDE 17

Checkout confirmation page

Delivery Destination Payments Service Cart Delivery Service

slide-18
SLIDE 18

Checkout confirmation page

Delivery Service

slide-19
SLIDE 19

Unwanted error

slide-20
SLIDE 20

Doing retries

for (var i = 1; i < numRetries; i++) { try { return getDeliveryOptionsForCheckout(cart) } catch(error) { if (i >= numRetries) { throw error; } } }

slide-21
SLIDE 21

Retry for transient errors like a network error

  • r service overload
slide-22
SLIDE 22

Retries for some errors

try { getDeliveryOptionsForCheckout(cart) match { case Success() => // return result case TransientFailure => // retry operation case Error => // throw error } } catch { println("Delivery options exception") }

slide-23
SLIDE 23

Retries with exponential backoff

Exponential Backoff time Attempt 1 Attempt 2 Attempt 3 Exponential Backoff time

100 ms 100 ms 100 ms

slide-24
SLIDE 24

Exhaustion of retries and failures become permanent

slide-25
SLIDE 25

Prevent execution of

  • perations that are

likely to fail

slide-26
SLIDE 26

Circuit breaker pattern

Circuit breaker pattern - Martin Fowler blog post

slide-27
SLIDE 27

Open circuit, operations fails immediately

Target

error rate > threshold 50% getDeliveryOptionsForCheckout = failure

slide-28
SLIDE 28

Fallback as alternative of failure

Unwanted failure: no Checkout Fallback: Only Standard delivery service with a default delivery promise

slide-29
SLIDE 29

Putting all together

Do retries of operations with exponential backoff Wrap operations with a circuit breaker Handle failures with fallbacks when possible Otherwise make sure to handle the exceptions

circuitCommand( getDeloveryOptionsForCheckout(cart) .retry(2) ) .onSuccess(//do something with result) .onError(getDeloveryOptionsForCheckoutFallback)

slide-30
SLIDE 30

Scaling microservices

slide-31
SLIDE 31

Traffic pattern

slide-32
SLIDE 32

Traffic pattern

slide-33
SLIDE 33

Microservice infrastructure

Load balancer Instance Instance Instance Container

Incoming requests Distributed by instance

Use Zalando base image Node env JVM env

slide-34
SLIDE 34

Scaling horizontally

Load balancer Instance Instance Instance Container

slide-35
SLIDE 35

Scaling horizontally

Load balancer Instance Instance Instance Container Instance

slide-36
SLIDE 36

Scaling vertically

Load balancer Instance Instance Instance Container

slide-37
SLIDE 37

Scaling vertically

Load balancer Instance Instance Instance Container

slide-38
SLIDE 38

Scaling consequences

Cassandra

> service connections > saturation and risk of unhealthy database

slide-39
SLIDE 39

Microservices cannot be scalable if downstream microservices cannot scale

slide-40
SLIDE 40

Low traffic rollouts

1 2 3 4

Service v2 Traffic 0% Service v1 Traffic 100%

1 2 3 4

slide-41
SLIDE 41

High traffic rollouts

1 2 3 4 1 2 4 5 3 6

Service v2 Traffic 0% Service v1 Traffic 100%

slide-42
SLIDE 42

Rollout with not enough capacity

slide-43
SLIDE 43

Rollouts should consider allocate same capacity like version with 100% traffic

slide-44
SLIDE 44

Monitor microservices

slide-45
SLIDE 45

Hardware Communication Application platform Microservice Four layer model of microservice ecosystem

Monitoring microservice ecosystem

slide-46
SLIDE 46

Hardware Communication Application platform Microservice For layer model of microservice ecosystem

Infrastructure metrics

Monitoring microservice ecosystem

slide-47
SLIDE 47

Hardware Communication Application platform Microservice For layer model of microservice ecosystem

Microservice metrics

Monitoring microservice ecosystem

slide-48
SLIDE 48

First example

slide-49
SLIDE 49

Hardware metrics

slide-50
SLIDE 50

Communication metrics

slide-51
SLIDE 51

Rate and responses of API endpoints

slide-52
SLIDE 52

Dependencies metrics

slide-53
SLIDE 53

Language specific metrics

slide-54
SLIDE 54

Second Example

slide-55
SLIDE 55

Infrastructure metrics

slide-56
SLIDE 56

Node JS metrics

slide-57
SLIDE 57

Frontend microservice metrics

slide-58
SLIDE 58

Anti pattern: Dashboard usage for outage detection

slide-59
SLIDE 59

“Something is broken, and somebody needs to fix it right now! Or, something might break soon, so somebody should look soon.”

Practical Alerting - Monitoring distributed systems Google SRE Book

Alerting

slide-60
SLIDE 60

Unhealthy instances 1 of 5

Alert

No more memory, JVM is misconfigured

slide-61
SLIDE 61

Service checkout is returning 4XXs responses above threshold 25%

Alert

Recent change broke contract of API for unconsidered business rule

slide-62
SLIDE 62

No orders in last 5 minutes

Alert

Downstream dependency is experimenting connectivity issues

slide-63
SLIDE 63

Checkout database disk utilization is 80%

Alert

Saturation of data storage by an increase in traffic

slide-64
SLIDE 64

Alerts notify about symptoms

slide-65
SLIDE 65

Alerts should be actionable

slide-66
SLIDE 66

Incident response

Figure Five stages of incident response. Microservices ready to production

slide-67
SLIDE 67

Example of postmortem

Summary of incident

No orders in last 5 minutes 13.05.2019 between 16:00 and 16:45

Impact of customers

2K customers could not complete checkout

Impact of business

50K euros loss of order that could be completed

Analysis of root cause

Why there was no orders?

Action items ...

slide-68
SLIDE 68

Every incident should have postmortem

slide-69
SLIDE 69
slide-70
SLIDE 70

Preparing for Black Friday

  • Business forecast
  • Load testing of real customer journey
  • Capacity planning
slide-71
SLIDE 71

Checklist for every microservice involved in Black Friday

  • Is the architecture and dependencies reviewed?
  • Are the possible point of failures identified and mitigated?
  • Are reliability patterns implemented?
  • Are the configurations adjustable without need of deployment?
  • Do we have scaling strategy?
  • Is monitoring in place?
  • Are all alerts actionable?
  • Is our team prepared for 24x7 incident management?
slide-72
SLIDE 72

Situation room

slide-73
SLIDE 73

Black Friday pattern of requests

> 4,200

  • rders/m
slide-74
SLIDE 74

My summary of learnings

  • Think outside the happy path and

mitigate failures with reliability patterns

  • Services are scalable proportionally

with their dependencies

  • Monitor the microservice ecosystem
slide-75
SLIDE 75

Resources

  • Service reliability engineering
  • Production ready micro services
  • Monitoring and alerting Tool used by Zalando
  • Taylor
  • Skipper
  • Load testing in Zalando
  • Kubernertes in Zalando
slide-76
SLIDE 76

Obrigada Thank you Danke

Contact

Pamela Canchanya pam.cdm@posteo.net @pamcdm

slide-77
SLIDE 77

Building and running applications at scale in Zalando

Online fashion store Checkout case

By Pamela Canchanya