Service Ownership Learn Faster Holly Allen Service Engineering - - PowerPoint PPT Presentation

service ownership
SMART_READER_LITE
LIVE PREVIEW

Service Ownership Learn Faster Holly Allen Service Engineering - - PowerPoint PPT Presentation

Service Ownership Learn Faster Holly Allen Service Engineering @hollyjallen Holly Allen Software development and leadership for 18 years @hollyjallen,#QConSF Nov 2018 @hollyjallen,#QConSF Nov 2018 @hollyjallen,#QConSF Nov 2018 Software!


slide-1
SLIDE 1

Service Ownership

Learn Faster

Holly Allen Service Engineering @hollyjallen
slide-2
SLIDE 2

Holly Allen

Software development and leadership for 18 years
slide-3
SLIDE 3 @hollyjallen,#QConSF Nov 2018
slide-4
SLIDE 4 @hollyjallen,#QConSF Nov 2018
slide-5
SLIDE 5 @hollyjallen,#QConSF Nov 2018
slide-6
SLIDE 6

Software! 😎

@hollyjallen,#QConSF Nov 2018
slide-7
SLIDE 7 @hollyjallen,#QConSF Nov 2018
slide-8
SLIDE 8

S L O W 😪

@hollyjallen,#QConSF Nov 2018
slide-9
SLIDE 9 @hollyjallen,#QConSF Nov 2018
slide-10
SLIDE 10 @hollyjallen,#QConSF Nov 2018

Measure Design Learn

slide-11
SLIDE 11

Toyota Production System

@hollyjallen,#QConSF Nov 2018
slide-12
SLIDE 12 @hollyjallen,#QConSF Nov 2018
slide-13
SLIDE 13 @hollyjallen,#QConSF Nov 2018
slide-14
SLIDE 14 @hollyjallen,#QConSF Nov 2018
slide-15
SLIDE 15 @hollyjallen,#QConSF Nov 2018
slide-16
SLIDE 16 @hollyjallen,#QConSF Nov 2018
slide-17
SLIDE 17

“”

Kaizen Continuous Improvement

@hollyjallen,#QConSF Nov 2018
slide-18
SLIDE 18 @hollyjallen,#QConSF Nov 2018

Measure Design Learn

slide-19
SLIDE 19 @hollyjallen,#QConSF Nov 2018
slide-20
SLIDE 20 @hollyjallen,#QConSF Nov 2018
slide-21
SLIDE 21 @hollyjallen,#QConSF Nov 2018
slide-22
SLIDE 22

“”

Executive dedication to learning

@hollyjallen,#QConSF Nov 2018
slide-23
SLIDE 23

“”

High Trust Teams

@hollyjallen,#QConSF Nov 2018
slide-24
SLIDE 24 @hollyjallen,#QConSF Nov 2018

Measure Design Learn

slide-25
SLIDE 25 @hollyjallen,#QConSF Nov 2018
slide-26
SLIDE 26

Slack launched February 2014

@hollyjallen,#QConSF Nov 2018

🚁

slide-27
SLIDE 27 @hollyjallen,#QConSF Nov 2018

5 Years

Grew to 13+ million weekly active users, with active sessions of 10+ hours a day

slide-28
SLIDE 28 @hollyjallen,#QConSF Nov 2018

5 Years

From 10 to 15,000 servers In 25 cloud data centers world-wide

slide-29
SLIDE 29 @hollyjallen,#QConSF Nov 2018

5 Years

From 8 to 1,200 people In 9 offices world-wide

slide-30
SLIDE 30 @hollyjallen,#QConSF Nov 2018

Measure Design Learn

slide-31
SLIDE 31 @hollyjallen,#QConSF Nov 2018
slide-32
SLIDE 32

“”

✅ Continuous Deployment ✅ Experiment Frameworks ✅ User Research

@hollyjallen,#QConSF Nov 2018
slide-33
SLIDE 33

Something didn't scale...

@hollyjallen,#QConSF Nov 2018
slide-34
SLIDE 34

Centralized Operations

@hollyjallen,#QConSF Nov 2018

😮

slide-35
SLIDE 35

“”

Who should be responsible for the management, monitoring and operation of a production application?

@hollyjallen,#QConSF Nov 2018
slide-36
SLIDE 36

“”

Centralized Operations Division of Labor

@hollyjallen,#QConSF Nov 2018
slide-37
SLIDE 37

Devs

Features Scale Architecture

@hollyjallen,#QConSF Nov 2018

Ops

Cloud Infra Deployment Monitoring

slide-38
SLIDE 38

“”

Ops is getting the pages

@hollyjallen,#QConSF Nov 2018
slide-39
SLIDE 39

“”

Product Development grew faster than Operations, A lot faster

@hollyjallen,#QConSF Nov 2018
slide-40
SLIDE 40

20 Product Developers

@hollyjallen,#QConSF Nov 2018

1 Ops Engineer

slide-41
SLIDE 41

“”

How can operations reliably reach the developers when there's a problem?

@hollyjallen,#QConSF Nov 2018
slide-42
SLIDE 42

“”

"Call Maude, she knows how this works"

@hollyjallen,#QConSF Nov 2018
slide-43
SLIDE 43

Devs

I've never been

  • n-call before,

this is scary!

@hollyjallen,#QConSF Nov 2018

Ops

Now I know I can find a developer when I need to.

slide-44
SLIDE 44

“”

Ops is getting the pages first pages Ultra-senior devs on-call

@hollyjallen,#QConSF Nov 2018
slide-45
SLIDE 45 @hollyjallen,#QConSF Nov 2018

Measure Design Learn

slide-46
SLIDE 46

“”

How can operations reliably reach the developers when there's a problem?

@hollyjallen,#QConSF Nov 2018
slide-47
SLIDE 47

Most devs go on-call Fall 2017

@hollyjallen,#QConSF Nov 2018

📠

slide-48
SLIDE 48

“”

Kaizen Continuous Improvement

@hollyjallen,#QConSF Nov 2018
slide-49
SLIDE 49

“”

"Wait, I'm on-call now?"

@hollyjallen,#QConSF Nov 2018
slide-50
SLIDE 50

Devs

I'm glad I'm only

  • n call a few

times a year

@hollyjallen,#QConSF Nov 2018

Ops

I'll be able to reach a search engineer if I need to.

slide-51
SLIDE 51

“”

Learn by Doing

@hollyjallen,#QConSF Nov 2018
slide-52
SLIDE 52

“”

On-call 3 times a year 🤕

@hollyjallen,#QConSF Nov 2018
slide-53
SLIDE 53

“”

Ops is getting the pages first pages Ultra-senior devs on-call Seven One dev rotations

@hollyjallen,#QConSF Nov 2018
slide-54
SLIDE 54

“”

Continuous Deployment 100+ prod deploys a day

@hollyjallen,#QConSF Nov 2018
slide-55
SLIDE 55

“”

What Changed?

@hollyjallen,#QConSF Nov 2018
slide-56
SLIDE 56

“”

@hollyjallen,#QConSF Nov 2018
slide-57
SLIDE 57

“”

@hollyjallen,#QConSF Nov 2018
slide-58
SLIDE 58

“”

Page the dev

@hollyjallen,#QConSF Nov 2018
slide-59
SLIDE 59

Devs

I don't understand this part of the code

@hollyjallen,#QConSF Nov 2018

Ops

These are the machine alerts I'm seeing

slide-60
SLIDE 60

“”

Human Routers

@hollyjallen,#QConSF Nov 2018
slide-61
SLIDE 61

“”

"Call Andy, he knows how this works"

@hollyjallen,#QConSF Nov 2018
slide-62
SLIDE 62

“”

Postmortems weren't a great place for learning

@hollyjallen,#QConSF Nov 2018
slide-63
SLIDE 63

“”

Can we catch problems earlier?

@hollyjallen,#QConSF Nov 2018
slide-64
SLIDE 64

“”

@hollyjallen,#QConSF Nov 2018
slide-65
SLIDE 65

“”

@hollyjallen,#QConSF Nov 2018
slide-66
SLIDE 66

“”

@hollyjallen,#QConSF Nov 2018
slide-67
SLIDE 67

“”

Investing in tech to make detection and remediation faster

@hollyjallen,#QConSF Nov 2018
slide-68
SLIDE 68

Reorg! Fall 2017

@hollyjallen,#QConSF Nov 2018

Operations is out Service Engineering is in

slide-69
SLIDE 69

“”

How can Slack ensure that developers know when there's a problem?

@hollyjallen,#QConSF Nov 2018
slide-70
SLIDE 70

“”

Centralized Operations Service Ownership

@hollyjallen,#QConSF Nov 2018
slide-71
SLIDE 71 @hollyjallen,#QConSF Nov 2018

Measure Design Learn

slide-72
SLIDE 72

“”

"We are the toolsmith and

  • specialists. We empower Service

Ownership"

@hollyjallen,#QConSF Nov 2018
slide-73
SLIDE 73

Devs

Features Reliability Performance Postmortems

@hollyjallen,#QConSF Nov 2018

Service

Cloud Platform Observability tools Service Discovery Define best practice

slide-74
SLIDE 74

I joined Slack in February 2018

@hollyjallen,#QConSF Nov 2018

👌

slide-75
SLIDE 75

“”

How to empower development teams to improve service reliability?

@hollyjallen,#QConSF Nov 2018
slide-76
SLIDE 76

Define service health and

  • perational

maturity

@hollyjallen,#QConSF Nov 2018
  • At least one alerting health

metric, like latency or throughput

slide-77
SLIDE 77

“” Send metrics to Prometheus Observability team is here to help! 🔯

@hollyjallen,#QConSF Nov 2018
slide-78
SLIDE 78

Define service health and

  • perational

maturity

@hollyjallen,#QConSF Nov 2018
  • Team should be on-call

ready

  • At least 4, preferably 6

engineers participating to make it sustainable

  • 24/7 or during the weekday,

depending on the service

slide-79
SLIDE 79

Define service health and

  • perational

maturity

@hollyjallen,#QConSF Nov 2018
  • Runbooks for standard

actions and troubleshooting

  • Central location in our code

repository

  • Up to date and useable by

any engineer

slide-80
SLIDE 80

Define service health and

  • perational

maturity

@hollyjallen,#QConSF Nov 2018
  • Paging alerts should link to

the runbook

  • Make responding to an

page easy

  • Practice incident response
slide-81
SLIDE 81

“”

Incident Lunch ⛑

@hollyjallen,#QConSF Nov 2018
slide-82
SLIDE 82

Site Reliability Engineers

@hollyjallen,#QConSF Nov 2018
  • Devops generalists
  • Emotional intelligence
  • Mentoring
  • Ambassadors
  • Operational maturity
slide-83
SLIDE 83

“”

SRE embedded in dev teams

@hollyjallen,#QConSF Nov 2018
slide-84
SLIDE 84

“”

@hollyjallen,#QConSF Nov 2018

Devs Ops SRE

slide-85
SLIDE 85

Devs

Um, where are the SREs?

@hollyjallen,#QConSF Nov 2018

SREs

I'm over here doing

  • perational

tasks

slide-86
SLIDE 86

“” SRE Ops is still getting the first pages

@hollyjallen,#QConSF Nov 2018
slide-87
SLIDE 87

“”

How do we lower operational burden on the SREs?

@hollyjallen,#QConSF Nov 2018
slide-88
SLIDE 88

“”

Plan: Send paging alerts to the development teams

@hollyjallen,#QConSF Nov 2018
slide-89
SLIDE 89

Devs

We need training

@hollyjallen,#QConSF Nov 2018

SREs

We're going to plan this out perfectly

slide-90
SLIDE 90 @hollyjallen,#QConSF Nov 2018
slide-91
SLIDE 91

“”

Host level alerts Hundreds of them

@hollyjallen,#QConSF Nov 2018
slide-92
SLIDE 92

“”

Test with the users

@hollyjallen,#QConSF Nov 2018
slide-93
SLIDE 93 @hollyjallen,#QConSF Nov 2018
slide-94
SLIDE 94

Everything was fine!

@hollyjallen,#QConSF Nov 2018

💫

slide-95
SLIDE 95

“”

Empowered Continuous Improvement

@hollyjallen,#QConSF Nov 2018
slide-96
SLIDE 96

“”

@hollyjallen,#QConSF Nov 2018

Devs Ops SRE

slide-97
SLIDE 97

“”

How do we test our understanding of how Slack will fail?

@hollyjallen,#QConSF Nov 2018
slide-98
SLIDE 98

“”

"Disasterpiece Theater is an ongoing series of exercises in which we will purposely cause a part of Slack to fail."

@hollyjallen,#QConSF Nov 2018
slide-99
SLIDE 99 @hollyjallen,#QConSF Nov 2018

Measure Design Learn

slide-100
SLIDE 100

Success Metrics

@hollyjallen,#QConSF Nov 2018
  • Increased engineer

confidence

  • Validate reliability

improvements

  • Learn something new
  • Practice incident response
slide-101
SLIDE 101

“”

Centralized Operations Service Ownership

@hollyjallen,#QConSF Nov 2018
slide-102
SLIDE 102

“”

How do we ensure the teams are being alerted, instead of skillsets?

@hollyjallen,#QConSF Nov 2018
slide-103
SLIDE 103

“”

How do we make postmortems a place for learning again?

@hollyjallen,#QConSF Nov 2018
slide-104
SLIDE 104

“”

How do we make sure a capable incident commander is available for all incidents?

@hollyjallen,#QConSF Nov 2018
slide-105
SLIDE 105 @hollyjallen,#QConSF Nov 2018

Measure Design Learn

slide-106
SLIDE 106 @hollyjallen,#QConSF Nov 2018
slide-107
SLIDE 107 @hollyjallen,#QConSF Nov 2018
slide-108
SLIDE 108

“”

Copy the questions Not the answers

@hollyjallen,#QConSF Nov 2018
slide-109
SLIDE 109

“”

Change is possible You don't have to be ready

@hollyjallen,#QConSF Nov 2018
slide-110
SLIDE 110

“”

Empowered Continuous Improvement

@hollyjallen,#QConSF Nov 2018
slide-111
SLIDE 111 @hollyjallen,#QConSF Nov 2018

Measure Design Learn

slide-112
SLIDE 112

“”

Learn Faster

@hollyjallen,#QConSF Nov 2018
slide-113
SLIDE 113 Holly Allen Service Engineering @hollyjallen

Thank You