[PPT] - Mature microservices and how to operate them Sarah Wells Technical PowerPoint Presentation

SLIDE 1

Mature microservices and how to operate them

Sarah Wells Technical Director for Operations & Reliability, The Financial Times @sarahjwells

SLIDE 2

SLIDE 3

@sarahjwells

https://www.ft.com/stream/ c47f4dfc-6879-4e95-accf-ca8cbe6a1f69

SLIDE 4

@sarahjwells

https://www.ft.com/companies

SLIDE 5

@sarahjwells

Problem: we’d set up a redirect to a page which didn’t exist

SLIDE 6

@sarahjwells

We weren’t sure how to fix the data via the url management tool

SLIDE 7

SLIDE 8

SLIDE 9

@sarahjwells

We got it fixed

SLIDE 10

@sarahjwells

Polyglot architectures are great - until you need to work out how this database is backed up

SLIDE 11

SLIDE 12

SLIDE 13

SLIDE 14

@sarahjwells

Microservices are more complicated to operate and maintain

SLIDE 15

@sarahjwells

Why bother?

SLIDE 16

SLIDE 17

SLIDE 18

@sarahjwells

“Experiment” for most

rganizations really means “try”

Linda Rising Experiments: the Good, the Bad and the Beautiful

SLIDE 19

Overlap tests by componentising the barrier

SLIDE 20

@sarahjwells

Releasing changes frequently doesn’t just ‘happen’

SLIDE 21

@sarahjwells

Done right, microservices enable this

SLIDE 22

@sarahjwells

The team that builds the system has to operate it too

SLIDE 23

@sarahjwells

What happens when teams move

n to new projects?

SLIDE 24

@sarahjwells

Your next legacy system will be microservices not a monolith

SLIDE 25

@sarahjwells

Optimising for speed Operating microservices When people move on

SLIDE 26

@sarahjwells

Optimising for speed

SLIDE 27

SLIDE 28

Measure High performers Delivery lead time

SLIDE 29

Measure High performers Delivery lead time Less than one hour

“How long would it take you to release a single line of code to production?”

SLIDE 30

Measure High performers Delivery lead time Less than one hour Deployment frequency

SLIDE 31

Measure High performers Delivery lead time Less than one hour Deployment frequency On demand

SLIDE 32

Measure High performers Delivery lead time Less than one hour Deployment frequency On demand Time to restore service

SLIDE 33

Measure High performers Delivery lead time Less than one hour Deployment frequency On demand Time to restore service Less than one hour

SLIDE 34

Measure High performers Delivery lead time Less than one hour Deployment frequency On demand Time to restore service Less than one hour Change fail rate

SLIDE 35

Measure High performers Delivery lead time Less than one hour Deployment frequency On demand Time to restore service Less than one hour Change fail rate 0 - 15%

SLIDE 36

@sarahjwells

High performing organisations release changes frequently

SLIDE 37

@sarahjwells

Continuous delivery is the foundation

SLIDE 38

“If it hurts, do it more frequently, and bring the pain forward.”

SLIDE 39

@sarahjwells

Our old build and deployment process was very manual…

SLIDE 40

SLIDE 41

@sarahjwells

You can’t experiment when you do 12 releases a year

SLIDE 42

@sarahjwells

1. An automated build and release

pipeline

SLIDE 43

@sarahjwells

2. Automated testing, integrated

into the pipeline

SLIDE 44

@sarahjwells

3. Continuous integration

SLIDE 45

@sarahjwells

If you aren’t releasing multiple times a day, consider what is stopping you

SLIDE 46

@sarahjwells

You’ll probably have to change the way you architect things

SLIDE 47

@sarahjwells

Zero downtime deployments:

sequential deployments
schemaless databases

SLIDE 48

@sarahjwells

In hours releases mean the people who can help are there

SLIDE 49

@sarahjwells

You need to be able to test and deploy your changes independently

SLIDE 50

@sarahjwells

You need systems - and teams - to be loosely coupled

SLIDE 51

@sarahjwells

Done right, microservices are loosely coupled

SLIDE 52

@sarahjwells

Processes also have to change

SLIDE 53

@sarahjwells

Often there is ‘process theatre’ around things and this can safely be removed

SLIDE 54

@sarahjwells

Change approval boards don’t reduce the chance of failure

SLIDE 55

@sarahjwells

Filling out a form for each change takes too long

SLIDE 56

@sarahjwells

How fast are we moving?

SLIDE 57

SLIDE 58

SLIDE 59

@sarahjwells

Releasing 250 times as often

SLIDE 60

@sarahjwells

Changes are small, easy to understand, independent and reversible

SLIDE 61

<1% failure rate ~16% failure rate

SLIDE 62

@sarahjwells

Optimising for speed Operating microservices

SLIDE 63

SLIDE 64

@sarahjwells

There are patterns and approaches that help

SLIDE 65

@sarahjwells

Devops is essential for success

SLIDE 66

@sarahjwells

You can’t hand things off to another team when they change multiple times a day

SLIDE 67

@sarahjwells

High performing teams get to make their own decisions about tools and technology

SLIDE 68

@sarahjwells

Delegating tool choice to teams makes it hard for central teams to support everything

SLIDE 69

@sarahjwells

Make it someone else’s problem

SLIDE 70

https://medium.com/wardleymaps

SLIDE 71

@sarahjwells

Buy rather than build, unless it’s critical to your business

SLIDE 72

@sarahjwells

Work out what level of risk you’re comfortable with

SLIDE 73

@sarahjwells

“We’re not a hospital or a power station”

SLIDE 74

@sarahjwells

We value releasing often so we can experiment frequently

SLIDE 75

@sarahjwells

Accept that you will generally be in a state of ‘grey failure’

SLIDE 76

SLIDE 77

@sarahjwells

Retry on failure:

backoff before retrying
give up if it’s taking too long

SLIDE 78

@sarahjwells

Mitigate now, fix tomorrow

SLIDE 79

@sarahjwells

How do you know something’s wrong?

SLIDE 80

@sarahjwells

Concentrate on the business capabilities

SLIDE 81

@sarahjwells

Synthetic monitoring

SLIDE 82

SLIDE 83

SLIDE 84

SLIDE 85

SLIDE 86

@sarahjwells

No data fixtures required

SLIDE 87

@sarahjwells

Also helps us know things are broken even if no user is currently doing anything

SLIDE 88

@sarahjwells

Make sure you know whether real things are working in production

SLIDE 89

@sarahjwells

Our editorial team is inventive

SLIDE 90

@sarahjwells

What does it mean for a publish to be ‘successful’?

SLIDE 91

SLIDE 92

SLIDE 93

SLIDE 94

SLIDE 95

@sarahjwells

Build observability into your system

SLIDE 96

@sarahjwells

Observability: can you infer what’s going on in the system by looking at its external outputs?

SLIDE 97

@sarahjwells

Log aggregation

SLIDE 98

SLIDE 99

@sarahjwells

Metrics

SLIDE 100

@sarahjwells

Keep it simple:

request rate
latency
error rate

SLIDE 101

@sarahjwells

You’ll always be migrating something

SLIDE 102

@sarahjwells

Doing anything 150 times is painful

SLIDE 103

@sarahjwells

Deployment pipelines need to be templated

SLIDE 104

@sarahjwells

Use a service mesh

SLIDE 105

@sarahjwells

You’ll have services that haven’t been released for years

SLIDE 106

@sarahjwells

But you don’t want to find out your service can’t be released when you most need to do it

SLIDE 107

@sarahjwells

Build everything overnight?

SLIDE 108

@sarahjwells

Optimising for speed Operating microservices When people move on

SLIDE 109

@sarahjwells

Every system must be owned

SLIDE 110

@sarahjwells

If you won’t invest enough to keep it running properly, shut it down

SLIDE 111

@sarahjwells

Keeping documentation up to date is a challenge

SLIDE 112

@sarahjwells

We started with a searchable runbook library

SLIDE 113

SLIDE 114

@sarahjwells

System codes are very helpful

SLIDE 115

@sarahjwells

We needed to represent this stuff as a graph

SLIDE 116

SLIDE 117

SLIDE 118

@sarahjwells

Helps if you can give people something in return

SLIDE 119

SLIDE 120

SLIDE 121

@sarahjwells

Practice

SLIDE 122

“If it hurts, do it more frequently, and bring the pain forward.”

SLIDE 123

@sarahjwells

Failovers, database restores

SLIDE 124

@sarahjwells

Chaos engineering

https://principlesofchaos.org/

SLIDE 125

@sarahjwells

Understand your steady state Look at what you can change - minimise the blast radius Work out what you expect to see happen Run the experiment and see if you were right

SLIDE 126

@sarahjwells

Wrapping up…

SLIDE 127

@sarahjwells

Building and operating microservices is hard work

SLIDE 128

@sarahjwells

You have to maintain knowledge of services that are live

SLIDE 129

@sarahjwells

Plan now for the future of legacy microservices

SLIDE 130

@sarahjwells

Remember: it’s all about the business value of moving fast

SLIDE 131

@sarahjwells

Mature microservices and how to operate them

https://www.ft.com/stream/ c47f4dfc-6879-4e95-accf-ca8cbe6a1f69

https://www.ft.com/companies

Problem: we’d set up a redirect to a page which didn’t exist

We weren’t sure how to fix the data via the url management tool

We got it fixed

Polyglot architectures are great - until you need to work out how *this* database is backed up

Microservices are more complicated to operate and maintain

Why bother?

“Experiment” for most

Releasing changes frequently doesn’t just ‘happen’

Done right, microservices enable this

The team that builds the system *has* to operate it too

What happens when teams move

Your next legacy system will be microservices not a monolith

Optimising for speed Operating microservices When people move on

Optimising for speed

“How long would it take you to release a single line of code to production?”

High performing organisations release changes frequently

Continuous delivery is the foundation

“If it hurts, do it more frequently, and bring the pain forward.”

Our old build and deployment process was very manual…

You can’t experiment when you do 12 releases a year

pipeline

into the pipeline

If you aren’t releasing multiple times a day, consider what is stopping you

You’ll probably have to change the way you architect things

Zero downtime deployments:

In hours releases mean the people who can help are there

You need to be able to test and deploy your changes independently

You need systems - and teams - to be loosely coupled

Done right, microservices are loosely coupled

Processes also have to change

Often there is ‘process theatre’ around things and this can safely be removed

Change approval boards don’t reduce the chance of failure

Filling out a form for each change takes too long

How fast are we moving?

Releasing 250 times as often

Changes are small, easy to understand, independent and reversible

<1% failure rate ~16% failure rate

Optimising for speed Operating microservices

There are patterns and approaches that help

Devops is essential for success

You can’t hand things off to another team when they change multiple times a day

High performing teams get to make their own decisions about tools and technology

Delegating tool choice to teams makes it hard for central teams to support everything

Make it someone else’s problem

https://medium.com/wardleymaps

Buy rather than build, unless it’s critical to your business

Work out what level of risk you’re comfortable with

“We’re not a hospital or a power station”

We value releasing often so we can experiment frequently

Accept that you will generally be in a state of ‘grey failure’

Retry on failure:

Mitigate now, fix tomorrow

How do you know something’s wrong?

Concentrate on the business capabilities

Synthetic monitoring

No data fixtures required

Also helps us know things are broken even if no user is currently doing anything

Make sure you know whether *real* things are working in production

Our editorial team is inventive

What does it mean for a publish to be ‘successful’?

Build observability into your system

Observability: can you infer what’s going on in the system by looking at its external outputs?

Log aggregation

Metrics

Keep it simple:

You’ll always be migrating *something*

Doing anything 150 times is painful

Deployment pipelines need to be templated

Use a service mesh

You’ll have services that haven’t been released for years

But you don’t want to find out your service can’t be released when you most need to do it

Build everything overnight?

Optimising for speed Operating microservices When people move on

Every system must be owned

If you won’t invest enough to keep it running properly, shut it down

Keeping documentation up to date is a challenge

We started with a searchable runbook library

Polyglot architectures are great - until you need to work out how this database is backed up

The team that builds the system has to operate it too

Make sure you know whether real things are working in production

You’ll always be migrating something