Alerting Husbandry Julien Goodwin jgoodwin@studio442.com.au - - PowerPoint PPT Presentation

alerting husbandry
SMART_READER_LITE
LIVE PREVIEW

Alerting Husbandry Julien Goodwin jgoodwin@studio442.com.au - - PowerPoint PPT Presentation

Alerting Husbandry Julien Goodwin jgoodwin@studio442.com.au @laptop006 Bad Alerts Obsolete Alerts $THING is down! Because we turned it down a year ago $THING has bug $FOO! Really? The vendor fixed it three years ago and we


slide-1
SLIDE 1

Alerting Husbandry

Julien Goodwin jgoodwin@studio442.com.au – @laptop006

slide-2
SLIDE 2

Bad Alerts

slide-3
SLIDE 3

Obsolete Alerts

  • $THING is down!
  • Because we turned it down a year ago
  • $THING has bug $FOO!
  • Really? The vendor fixed it three years ago and we upgraded.
slide-4
SLIDE 4

Unactionable Alerts

  • $THING is down!
  • But it’s managed by another team, just thought you’d like to be

woken up.

slide-5
SLIDE 5

SLA Alerts

  • $SERVICE has failed SLA
  • So can I do anything about it?
  • Log for later reporting instead
slide-6
SLIDE 6

Bad Thresholds

  • $SERVER has a high load average of 4
  • It has 32 cores, that’s no load
  • $LUN is nearly full, only 100MB left
  • It’s a 10T LUN, I have no time to respond
  • It’s a 200MB LUN as /boot & a new kernel was installed
slide-7
SLIDE 7

Hair trigger alerts

  • $THING didn’t respond in 50ms
  • Once
  • It responded in 51ms
slide-8
SLIDE 8

Non-Impacting Redundancy

  • WEB_SERVER_4 is down
  • But I have 8 servers, and only need 6 at full load
slide-9
SLIDE 9

Spamming alerts

  • $THING is down!
  • For the 28345972398th time
  • Even if it’s important you’ve stopped caring
slide-10
SLIDE 10

Nobody cares

  • $TEST_SERVER has no backups
  • I want it that way
  • Most of the earlier items end up in this bucket
slide-11
SLIDE 11

Related Practices

slide-12
SLIDE 12

E-mail alerts

  • It’s not high priority enough to page, so I’ll email about it
  • Within a few weeks the entire team will have a filter to mark read &

delete

  • Having a separate archived alert list may work well as a log
slide-13
SLIDE 13

Undocumented Alerts

  • $THING is broken!
  • So what am I supposed to do?
  • Document actions to take in a “playbook”
  • All oncallers should be able to follow
slide-14
SLIDE 14

Alert Acceptance

  • Have a review process for any new alerts or thresholds.
  • Require documentation, expected impact, test data, etc.
  • Only oncallers should accept alerts.
slide-15
SLIDE 15

Silencing

  • If your alert system pages people you need a silence mechanism
  • In practice this becomes a whole system
  • Oncallers get very grumpy when woken up for other people’s

planned work

  • If relevant may include need to schedule silences for things like

carrier outages

slide-16
SLIDE 16

Production by Fiat

  • $THING is now in production because I say so — $VP
  • Good luck
slide-17
SLIDE 17

A Plug

Contains great selections on alerting, postmortems, availability & more.