Mature microservices and how to operate them Sarah Wells Technical - - PowerPoint PPT Presentation

mature microservices and how to operate them
SMART_READER_LITE
LIVE PREVIEW

Mature microservices and how to operate them Sarah Wells Technical - - PowerPoint PPT Presentation

Mature microservices and how to operate them Sarah Wells Technical Director for Operations & Reliability, The Financial Times @sarahjwells https://www.ft.com/stream/ c47f4dfc-6879-4e95-accf-ca8cbe6a1f69 @sarahjwells


slide-1
SLIDE 1

Mature microservices and how to operate them

Sarah Wells Technical Director for Operations & Reliability, The Financial Times @sarahjwells

slide-2
SLIDE 2
slide-3
SLIDE 3

@sarahjwells

https://www.ft.com/stream/ c47f4dfc-6879-4e95-accf-ca8cbe6a1f69

slide-4
SLIDE 4

@sarahjwells

https://www.ft.com/companies

slide-5
SLIDE 5

@sarahjwells

Problem: we’d set up a redirect to a page which didn’t exist

slide-6
SLIDE 6

@sarahjwells

We weren’t sure how to fix the data via the url management tool

slide-7
SLIDE 7
slide-8
SLIDE 8
slide-9
SLIDE 9

@sarahjwells

We got it fixed

slide-10
SLIDE 10

@sarahjwells

Polyglot architectures are great - until you need to work out how *this* database is backed up

slide-11
SLIDE 11
slide-12
SLIDE 12
slide-13
SLIDE 13
slide-14
SLIDE 14

@sarahjwells

Microservices are more complicated to operate and maintain

slide-15
SLIDE 15

@sarahjwells

Why bother?

slide-16
SLIDE 16
slide-17
SLIDE 17
slide-18
SLIDE 18

@sarahjwells

“Experiment” for most

  • rganizations really means “try”

Linda Rising Experiments: the Good, the Bad and the Beautiful

slide-19
SLIDE 19

Overlap tests by componentising the barrier

slide-20
SLIDE 20

@sarahjwells

Releasing changes frequently doesn’t just ‘happen’

slide-21
SLIDE 21

@sarahjwells

Done right, microservices enable this

slide-22
SLIDE 22

@sarahjwells

The team that builds the system *has* to operate it too

slide-23
SLIDE 23

@sarahjwells

What happens when teams move

  • n to new projects?
slide-24
SLIDE 24

@sarahjwells

Your next legacy system will be microservices not a monolith

slide-25
SLIDE 25

@sarahjwells

Optimising for speed Operating microservices When people move on

slide-26
SLIDE 26

@sarahjwells

Optimising for speed

slide-27
SLIDE 27
slide-28
SLIDE 28

Measure High performers Delivery lead time

slide-29
SLIDE 29

Measure High performers Delivery lead time Less than one hour

“How long would it take you to release a single line of code to production?”

slide-30
SLIDE 30

Measure High performers Delivery lead time Less than one hour Deployment frequency

slide-31
SLIDE 31

Measure High performers Delivery lead time Less than one hour Deployment frequency On demand

slide-32
SLIDE 32

Measure High performers Delivery lead time Less than one hour Deployment frequency On demand Time to restore service

slide-33
SLIDE 33

Measure High performers Delivery lead time Less than one hour Deployment frequency On demand Time to restore service Less than one hour

slide-34
SLIDE 34

Measure High performers Delivery lead time Less than one hour Deployment frequency On demand Time to restore service Less than one hour Change fail rate

slide-35
SLIDE 35

Measure High performers Delivery lead time Less than one hour Deployment frequency On demand Time to restore service Less than one hour Change fail rate 0 - 15%

slide-36
SLIDE 36

@sarahjwells

High performing organisations release changes frequently

slide-37
SLIDE 37

@sarahjwells

Continuous delivery is the foundation

slide-38
SLIDE 38

“If it hurts, do it more frequently, and bring the pain forward.”

slide-39
SLIDE 39

@sarahjwells

Our old build and deployment process was very manual…

slide-40
SLIDE 40
slide-41
SLIDE 41

@sarahjwells

You can’t experiment when you do 12 releases a year

slide-42
SLIDE 42

@sarahjwells

  • 1. An automated build and release

pipeline

slide-43
SLIDE 43

@sarahjwells

  • 2. Automated testing, integrated

into the pipeline

slide-44
SLIDE 44

@sarahjwells

  • 3. Continuous integration
slide-45
SLIDE 45

@sarahjwells

If you aren’t releasing multiple times a day, consider what is stopping you

slide-46
SLIDE 46

@sarahjwells

You’ll probably have to change the way you architect things

slide-47
SLIDE 47

@sarahjwells

Zero downtime deployments:

  • sequential deployments
  • schemaless databases
slide-48
SLIDE 48

@sarahjwells

In hours releases mean the people who can help are there

slide-49
SLIDE 49

@sarahjwells

You need to be able to test and deploy your changes independently

slide-50
SLIDE 50

@sarahjwells

You need systems - and teams - to be loosely coupled

slide-51
SLIDE 51

@sarahjwells

Done right, microservices are loosely coupled

slide-52
SLIDE 52

@sarahjwells

Processes also have to change

slide-53
SLIDE 53

@sarahjwells

Often there is ‘process theatre’ around things and this can safely be removed

slide-54
SLIDE 54

@sarahjwells

Change approval boards don’t reduce the chance of failure

slide-55
SLIDE 55

@sarahjwells

Filling out a form for each change takes too long

slide-56
SLIDE 56

@sarahjwells

How fast are we moving?

slide-57
SLIDE 57
slide-58
SLIDE 58
slide-59
SLIDE 59

@sarahjwells

Releasing 250 times as often

slide-60
SLIDE 60

@sarahjwells

Changes are small, easy to understand, independent and reversible

slide-61
SLIDE 61

<1% failure rate ~16% failure rate

slide-62
SLIDE 62

@sarahjwells

Optimising for speed Operating microservices

slide-63
SLIDE 63
slide-64
SLIDE 64

@sarahjwells

There are patterns and approaches that help

slide-65
SLIDE 65

@sarahjwells

Devops is essential for success

slide-66
SLIDE 66

@sarahjwells

You can’t hand things off to another team when they change multiple times a day

slide-67
SLIDE 67

@sarahjwells

High performing teams get to make their own decisions about tools and technology

slide-68
SLIDE 68

@sarahjwells

Delegating tool choice to teams makes it hard for central teams to support everything

slide-69
SLIDE 69

@sarahjwells

Make it someone else’s problem

slide-70
SLIDE 70

https://medium.com/wardleymaps

slide-71
SLIDE 71

@sarahjwells

Buy rather than build, unless it’s critical to your business

slide-72
SLIDE 72

@sarahjwells

Work out what level of risk you’re comfortable with

slide-73
SLIDE 73

@sarahjwells

“We’re not a hospital or a power station”

slide-74
SLIDE 74

@sarahjwells

We value releasing often so we can experiment frequently

slide-75
SLIDE 75

@sarahjwells

Accept that you will generally be in a state of ‘grey failure’

slide-76
SLIDE 76
slide-77
SLIDE 77

@sarahjwells

Retry on failure:

  • backoff before retrying
  • give up if it’s taking too long
slide-78
SLIDE 78

@sarahjwells

Mitigate now, fix tomorrow

slide-79
SLIDE 79

@sarahjwells

How do you know something’s wrong?

slide-80
SLIDE 80

@sarahjwells

Concentrate on the business capabilities

slide-81
SLIDE 81

@sarahjwells

Synthetic monitoring

slide-82
SLIDE 82
slide-83
SLIDE 83
slide-84
SLIDE 84
slide-85
SLIDE 85
slide-86
SLIDE 86

@sarahjwells

No data fixtures required

slide-87
SLIDE 87

@sarahjwells

Also helps us know things are broken even if no user is currently doing anything

slide-88
SLIDE 88

@sarahjwells

Make sure you know whether *real* things are working in production

slide-89
SLIDE 89

@sarahjwells

Our editorial team is inventive

slide-90
SLIDE 90

@sarahjwells

What does it mean for a publish to be ‘successful’?

slide-91
SLIDE 91
slide-92
SLIDE 92
slide-93
SLIDE 93
slide-94
SLIDE 94
slide-95
SLIDE 95

@sarahjwells

Build observability into your system

slide-96
SLIDE 96

@sarahjwells

Observability: can you infer what’s going on in the system by looking at its external outputs?

slide-97
SLIDE 97

@sarahjwells

Log aggregation

slide-98
SLIDE 98
slide-99
SLIDE 99

@sarahjwells

Metrics

slide-100
SLIDE 100

@sarahjwells

Keep it simple:

  • request rate
  • latency
  • error rate
slide-101
SLIDE 101

@sarahjwells

You’ll always be migrating *something*

slide-102
SLIDE 102

@sarahjwells

Doing anything 150 times is painful

slide-103
SLIDE 103

@sarahjwells

Deployment pipelines need to be templated

slide-104
SLIDE 104

@sarahjwells

Use a service mesh

slide-105
SLIDE 105

@sarahjwells

You’ll have services that haven’t been released for years

slide-106
SLIDE 106

@sarahjwells

But you don’t want to find out your service can’t be released when you most need to do it

slide-107
SLIDE 107

@sarahjwells

Build everything overnight?

slide-108
SLIDE 108

@sarahjwells

Optimising for speed Operating microservices When people move on

slide-109
SLIDE 109

@sarahjwells

Every system must be owned

slide-110
SLIDE 110

@sarahjwells

If you won’t invest enough to keep it running properly, shut it down

slide-111
SLIDE 111

@sarahjwells

Keeping documentation up to date is a challenge

slide-112
SLIDE 112

@sarahjwells

We started with a searchable runbook library

slide-113
SLIDE 113
slide-114
SLIDE 114

@sarahjwells

System codes are very helpful

slide-115
SLIDE 115

@sarahjwells

We needed to represent this stuff as a graph

slide-116
SLIDE 116
slide-117
SLIDE 117
slide-118
SLIDE 118

@sarahjwells

Helps if you can give people something in return

slide-119
SLIDE 119
slide-120
SLIDE 120
slide-121
SLIDE 121

@sarahjwells

Practice

slide-122
SLIDE 122

“If it hurts, do it more frequently, and bring the pain forward.”

slide-123
SLIDE 123

@sarahjwells

Failovers, database restores

slide-124
SLIDE 124

@sarahjwells

Chaos engineering

https://principlesofchaos.org/

slide-125
SLIDE 125

@sarahjwells

Understand your steady state Look at what you can change - minimise the blast radius Work out what you expect to see happen Run the experiment and see if you were right

slide-126
SLIDE 126

@sarahjwells

Wrapping up…

slide-127
SLIDE 127

@sarahjwells

Building and operating microservices is hard work

slide-128
SLIDE 128

@sarahjwells

You have to maintain knowledge of services that are live

slide-129
SLIDE 129

@sarahjwells

Plan now for the future of legacy microservices

slide-130
SLIDE 130

@sarahjwells

Remember: it’s all about the business value of moving fast

slide-131
SLIDE 131

@sarahjwells

Thank you!