Avoiding alerts overload from microservices Sarah Wells Principal - - PowerPoint PPT Presentation

avoiding alerts overload from microservices
SMART_READER_LITE
LIVE PREVIEW

Avoiding alerts overload from microservices Sarah Wells Principal - - PowerPoint PPT Presentation

Avoiding alerts overload from microservices Sarah Wells Principal Engineer, Financial Times @sarahjwells Knowing when theres a problem isnt enough @sarahjwells You only want an alert when you need to take action Hello @sarahjwells 1


slide-1
SLIDE 1

Avoiding alerts overload from microservices

Sarah Wells Principal Engineer, Financial Times @sarahjwells

slide-2
SLIDE 2
slide-3
SLIDE 3
slide-4
SLIDE 4
slide-5
SLIDE 5

@sarahjwells

Knowing when there’s a problem isn’t enough

slide-6
SLIDE 6

You only want an alert when you need to take action

slide-7
SLIDE 7

@sarahjwells

Hello

slide-8
SLIDE 8
slide-9
SLIDE 9
slide-10
SLIDE 10

1

slide-11
SLIDE 11

1 2

slide-12
SLIDE 12

1 2 3

slide-13
SLIDE 13

1 2 3 4

slide-14
SLIDE 14

@sarahjwells

Monitoring this system…

slide-15
SLIDE 15

@sarahjwells

Microservices make it worse

slide-16
SLIDE 16

“microservices (n,pl): an efficient device for transforming business problems into distributed transaction problems”

@drsnooks

slide-17
SLIDE 17

@sarahjwells

The services *themselves* are simple…

slide-18
SLIDE 18

@sarahjwells

There’s a lot of complexity around them

slide-19
SLIDE 19

@sarahjwells

Why do they make monitoring harder?

slide-20
SLIDE 20

@sarahjwells

You have a lot more services

slide-21
SLIDE 21

@sarahjwells

99 functional microservices 350 running instances

slide-22
SLIDE 22

@sarahjwells

52 non functional services 218 running instances

slide-23
SLIDE 23

@sarahjwells

That’s 568 separate services

slide-24
SLIDE 24

@sarahjwells

If we checked each service every minute…

slide-25
SLIDE 25

@sarahjwells

817,920 checks per day

slide-26
SLIDE 26

@sarahjwells

What about system checks?

slide-27
SLIDE 27

@sarahjwells

16,358,400 checks per day

slide-28
SLIDE 28

@sarahjwells

“One-in-a-million” issues would hit us 16 times every day

slide-29
SLIDE 29

@sarahjwells

Running containers on shared VMs reduces this to 92,160 system checks per day

slide-30
SLIDE 30

@sarahjwells

For a total of 910,080 checks per day

slide-31
SLIDE 31

@sarahjwells

It’s a distributed system

slide-32
SLIDE 32

@sarahjwells

Services are not independent

slide-33
SLIDE 33
slide-34
SLIDE 34
slide-35
SLIDE 35
slide-36
SLIDE 36
slide-37
SLIDE 37

http://devopsreactions.tumblr.com/post/122408751191/alerts-when- an-outage-starts

slide-38
SLIDE 38

@sarahjwells

You have to change how you think about monitoring

slide-39
SLIDE 39

How can you make it better?

slide-40
SLIDE 40

@sarahjwells

  • 1. Build a system you can support
slide-41
SLIDE 41

@sarahjwells

The basic tools you need

slide-42
SLIDE 42

@sarahjwells

Log aggregation

slide-43
SLIDE 43
slide-44
SLIDE 44

@sarahjwells

Logs go missing or get delayed more now

slide-45
SLIDE 45

@sarahjwells

Which means log based alerts may miss stuff

slide-46
SLIDE 46

@sarahjwells

Monitoring

slide-47
SLIDE 47
slide-48
SLIDE 48

@sarahjwells

Limitations of our nagios integration…

slide-49
SLIDE 49

@sarahjwells

No ‘service-level’ view

slide-50
SLIDE 50

@sarahjwells

Default checks included things we couldn’t fix

slide-51
SLIDE 51

@sarahjwells

A new approach for our container stack

slide-52
SLIDE 52

@sarahjwells

We care about each service

slide-53
SLIDE 53
slide-54
SLIDE 54

@sarahjwells

We care about each VM

slide-55
SLIDE 55
slide-56
SLIDE 56

@sarahjwells

We care about unhealthy instances

slide-57
SLIDE 57

@sarahjwells

Monitoring needs aggregating somehow

slide-58
SLIDE 58

@sarahjwells

SAWS

slide-59
SLIDE 59

Built by Silvano Dossan See our Engine room blog: http://bit.ly/1GATHLy

slide-60
SLIDE 60

@sarahjwells

"I imagine most people do exactly what I do - create a google filter to send all Nagios emails straight to the bin"

slide-61
SLIDE 61

@sarahjwells

"Our screens have a viewing angle of about 10 degrees"

slide-62
SLIDE 62

@sarahjwells

"It never seems to show the page I want"

slide-63
SLIDE 63

@sarahjwells

Code at: https://github.com/muce/SAWS

slide-64
SLIDE 64

@sarahjwells

Dashing

slide-65
SLIDE 65
slide-66
SLIDE 66
slide-67
SLIDE 67

@sarahjwells

Graphing of metrics

slide-68
SLIDE 68
slide-69
SLIDE 69
slide-70
SLIDE 70
slide-71
SLIDE 71
slide-72
SLIDE 72

https://www.flickr.com/photos/davidmasters/2564786205/

slide-73
SLIDE 73

@sarahjwells

The things that make those tools WORK

slide-74
SLIDE 74

@sarahjwells

Effective log aggregation needs a way to find all related logs

slide-75
SLIDE 75

Transaction ids tie all microservices together

slide-76
SLIDE 76

@sarahjwells

Make it easy for any language you use

slide-77
SLIDE 77

@sarahjwells

slide-78
SLIDE 78

@sarahjwells

Services need to report on their own health

slide-79
SLIDE 79

The FT healthcheck standard

GET http://{service}/__health

slide-80
SLIDE 80

The FT healthcheck standard

GET http://{service}/__health returns 200 if the service can run the healthcheck

slide-81
SLIDE 81

The FT healthcheck standard

GET http://{service}/__health returns 200 if the service can run the healthcheck each check will return "ok": true or "ok": false

slide-82
SLIDE 82
slide-83
SLIDE 83
slide-84
SLIDE 84

@sarahjwells

Knowing about problems before your clients do

slide-85
SLIDE 85

Synthetic requests tell you about problems early

https://www.flickr.com/photos/jted/ 5448635109

slide-86
SLIDE 86

@sarahjwells

  • 2. Concentrate on the stuff that matters
slide-87
SLIDE 87

@sarahjwells

It’s the business functionality you should care about

slide-88
SLIDE 88
slide-89
SLIDE 89

We care about whether content got published successfully

slide-90
SLIDE 90
slide-91
SLIDE 91

When people call our APIs, we care about speed

slide-92
SLIDE 92

… we also care about errors

slide-93
SLIDE 93
slide-94
SLIDE 94

But it's the end-to-end that matters

https://www.flickr.com/photos/robef/16537786315/

slide-95
SLIDE 95

If you just want information, create a dashboard or report

slide-96
SLIDE 96

@sarahjwells

Checking the services involved in a business flow

slide-97
SLIDE 97

/__health?categories=lists-publish

slide-98
SLIDE 98
slide-99
SLIDE 99
slide-100
SLIDE 100

@sarahjwells

  • 3. Cultivate your alerts
slide-101
SLIDE 101

Make each alert great

http://www.thestickerfactory.co.uk/

slide-102
SLIDE 102

@sarahjwells

Splunk Alert: PROD - MethodeAPIResponseTime5MAlert Business Impact The methode api server is slow responding to requests. This might result in articles not getting published to the new content platform or publishing requests timing out.

...

slide-103
SLIDE 103

@sarahjwells

Splunk Alert: PROD - MethodeAPIResponseTime5MAlert Business Impact The methode api server is slow responding to requests. This might result in articles not getting published to the new content platform or publishing requests timing out.

...

slide-104
SLIDE 104

@sarahjwells

Technical Impact The server is experiencing service degradation because of network latency, high publishing load, high bandwidth utilization, excessive memory or cpu usage on the VM. This might result in failure to publish articles to the new content platform.

slide-105
SLIDE 105

@sarahjwells

Splunk Alert: PROD Content Platform Ingester Methode Publish Failures Alert There has been one or more publish failures to the Universal Publishing Platform. The UUIDs are listed below. Please see the run book for more information.

_time transaction_id uuid Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe

slide-106
SLIDE 106

@sarahjwells

Splunk Alert: PROD Content Platform Ingester Methode Publish Failures Alert There has been one or more publish failures to the Universal Publishing Platform. The UUIDs are listed below. Please see the run book for more information.

_time transaction_id uuid Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe

slide-107
SLIDE 107

@sarahjwells

Splunk Alert: PROD Content Platform Ingester Methode Publish Failures Alert There has been one or more publish failures to the Universal Publishing Platform. The UUIDs are listed below. Please see the run book for more information.

_time transaction_id uuid Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe

slide-108
SLIDE 108

@sarahjwells

Splunk Alert: PROD Content Platform Ingester Methode Publish Failures Alert There has been one or more publish failures to the Universal Publishing Platform. The UUIDs are listed below. Please see the run book for more information.

_time transaction_id uuid Mon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe

slide-109
SLIDE 109

Make sure you can't miss an alert

slide-110
SLIDE 110

@sarahjwells

‘Ops Cops’ keep an eye on our systems

slide-111
SLIDE 111

@sarahjwells

Use the right communication channel

slide-112
SLIDE 112

@sarahjwells

It’s not email

slide-113
SLIDE 113

Slack integration

slide-114
SLIDE 114
slide-115
SLIDE 115

@sarahjwells

Support isn’t just getting the system fixed

slide-116
SLIDE 116
slide-117
SLIDE 117

@sarahjwells

‘You build it, you run it’?

slide-118
SLIDE 118

@sarahjwells

Review the alerts you get

slide-119
SLIDE 119

If it isn't helpful, make sure you don't get sent it again

slide-120
SLIDE 120

See if you can improve it

www.workcompass.c

  • m/
slide-121
SLIDE 121

@sarahjwells

When you didn't get an alert

slide-122
SLIDE 122

What would have told you about this?

slide-123
SLIDE 123

@sarahjwells

slide-124
SLIDE 124

@sarahjwells

Setting up an alert is part of fixing the problem

code

test alerts

slide-125
SLIDE 125

System boundaries are more difficult

Severin.stalder [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons

slide-126
SLIDE 126

@sarahjwells

Make sure you would know if an alert stopped working

slide-127
SLIDE 127

Add a unit test

public void shouldIncludeTriggerWordsForPublishFailureAlertInSplunk() { … }

slide-128
SLIDE 128

Deliberately break things

slide-129
SLIDE 129

Chaos snail

slide-130
SLIDE 130

@sarahjwells

It’s going to change: deal with it

slide-131
SLIDE 131

@sarahjwells

Out of date information can be worse than none

slide-132
SLIDE 132

@sarahjwells

Automate updates where you can

slide-133
SLIDE 133

@sarahjwells

Find ways to share what’s changing

slide-134
SLIDE 134

@sarahjwells

In summary: to avoid alerts overload…

slide-135
SLIDE 135

1 Build a system you can support

slide-136
SLIDE 136

2 Concentrate on the stuff that matters

slide-137
SLIDE 137

3 Cultivate your alerts

slide-138
SLIDE 138

@sarahjwells

A microservice architecture lets you move fast…

slide-139
SLIDE 139

@sarahjwells

But there’s an associated operational cost

slide-140
SLIDE 140

@sarahjwells

Make sure it’s a cost you’re willing to pay

slide-141
SLIDE 141

@sarahjwells

Thank you