Service Ownership
Learn Faster
Holly Allen Service Engineering @hollyjallen
Service Ownership Learn Faster Holly Allen Service Engineering - - PowerPoint PPT Presentation
Service Ownership Learn Faster Holly Allen Service Engineering @hollyjallen Holly Allen Software development and leadership for 18 years @hollyjallen,#QConSF Nov 2018 @hollyjallen,#QConSF Nov 2018 @hollyjallen,#QConSF Nov 2018 Software!
Service Ownership
Learn Faster
Holly Allen Service Engineering @hollyjallenHolly Allen
Software development and leadership for 18 yearsSoftware! 😎
@hollyjallen,#QConSF Nov 2018S L O W 😪
@hollyjallen,#QConSF Nov 2018Measure Design Learn
Toyota Production System
@hollyjallen,#QConSF Nov 2018“”
Kaizen Continuous Improvement
@hollyjallen,#QConSF Nov 2018Measure Design Learn
“”
Executive dedication to learning
@hollyjallen,#QConSF Nov 2018“”
High Trust Teams
@hollyjallen,#QConSF Nov 2018Measure Design Learn
Slack launched February 2014
@hollyjallen,#QConSF Nov 20185 Years
Grew to 13+ million weekly active users, with active sessions of 10+ hours a day
5 Years
From 10 to 15,000 servers In 25 cloud data centers world-wide
5 Years
From 8 to 1,200 people In 9 offices world-wide
Measure Design Learn
“”
✅ Continuous Deployment ✅ Experiment Frameworks ✅ User Research
@hollyjallen,#QConSF Nov 2018Something didn't scale...
@hollyjallen,#QConSF Nov 2018Centralized Operations
@hollyjallen,#QConSF Nov 2018“”
Who should be responsible for the management, monitoring and operation of a production application?
@hollyjallen,#QConSF Nov 2018“”
Centralized Operations Division of Labor
@hollyjallen,#QConSF Nov 2018Devs
Features Scale Architecture
@hollyjallen,#QConSF Nov 2018Ops
Cloud Infra Deployment Monitoring
“”
Ops is getting the pages
@hollyjallen,#QConSF Nov 2018“”
Product Development grew faster than Operations, A lot faster
@hollyjallen,#QConSF Nov 201820 Product Developers
@hollyjallen,#QConSF Nov 20181 Ops Engineer
“”
How can operations reliably reach the developers when there's a problem?
@hollyjallen,#QConSF Nov 2018“”
"Call Maude, she knows how this works"
@hollyjallen,#QConSF Nov 2018Devs
I've never been
this is scary!
@hollyjallen,#QConSF Nov 2018Ops
Now I know I can find a developer when I need to.
“”
Ops is getting the pages first pages Ultra-senior devs on-call
@hollyjallen,#QConSF Nov 2018Measure Design Learn
“”
How can operations reliably reach the developers when there's a problem?
@hollyjallen,#QConSF Nov 2018Most devs go on-call Fall 2017
@hollyjallen,#QConSF Nov 2018“”
Kaizen Continuous Improvement
@hollyjallen,#QConSF Nov 2018“”
"Wait, I'm on-call now?"
@hollyjallen,#QConSF Nov 2018Devs
I'm glad I'm only
times a year
@hollyjallen,#QConSF Nov 2018Ops
I'll be able to reach a search engineer if I need to.
“”
Learn by Doing
@hollyjallen,#QConSF Nov 2018“”
On-call 3 times a year 🤕
@hollyjallen,#QConSF Nov 2018“”
Ops is getting the pages first pages Ultra-senior devs on-call Seven One dev rotations
@hollyjallen,#QConSF Nov 2018“”
Continuous Deployment 100+ prod deploys a day
@hollyjallen,#QConSF Nov 2018“”
What Changed?
@hollyjallen,#QConSF Nov 2018“”
@hollyjallen,#QConSF Nov 2018“”
@hollyjallen,#QConSF Nov 2018“”
Page the dev
@hollyjallen,#QConSF Nov 2018Devs
I don't understand this part of the code
@hollyjallen,#QConSF Nov 2018Ops
These are the machine alerts I'm seeing
“”
Human Routers
@hollyjallen,#QConSF Nov 2018“”
"Call Andy, he knows how this works"
@hollyjallen,#QConSF Nov 2018“”
Postmortems weren't a great place for learning
@hollyjallen,#QConSF Nov 2018“”
Can we catch problems earlier?
@hollyjallen,#QConSF Nov 2018“”
@hollyjallen,#QConSF Nov 2018“”
@hollyjallen,#QConSF Nov 2018“”
@hollyjallen,#QConSF Nov 2018“”
Investing in tech to make detection and remediation faster
@hollyjallen,#QConSF Nov 2018Reorg! Fall 2017
@hollyjallen,#QConSF Nov 2018Operations is out Service Engineering is in
“”
How can Slack ensure that developers know when there's a problem?
@hollyjallen,#QConSF Nov 2018“”
Centralized Operations Service Ownership
@hollyjallen,#QConSF Nov 2018Measure Design Learn
“”
"We are the toolsmith and
Ownership"
@hollyjallen,#QConSF Nov 2018Devs
Features Reliability Performance Postmortems
@hollyjallen,#QConSF Nov 2018Service
Cloud Platform Observability tools Service Discovery Define best practice
I joined Slack in February 2018
@hollyjallen,#QConSF Nov 2018“”
How to empower development teams to improve service reliability?
@hollyjallen,#QConSF Nov 2018Define service health and
maturity
@hollyjallen,#QConSF Nov 2018metric, like latency or throughput
“” Send metrics to Prometheus Observability team is here to help! 🔯
@hollyjallen,#QConSF Nov 2018Define service health and
maturity
@hollyjallen,#QConSF Nov 2018ready
engineers participating to make it sustainable
depending on the service
Define service health and
maturity
@hollyjallen,#QConSF Nov 2018actions and troubleshooting
repository
any engineer
Define service health and
maturity
@hollyjallen,#QConSF Nov 2018the runbook
page easy
“”
Incident Lunch ⛑
@hollyjallen,#QConSF Nov 2018Site Reliability Engineers
@hollyjallen,#QConSF Nov 2018“”
SRE embedded in dev teams
@hollyjallen,#QConSF Nov 2018“”
@hollyjallen,#QConSF Nov 2018Devs Ops SRE
Devs
Um, where are the SREs?
@hollyjallen,#QConSF Nov 2018SREs
I'm over here doing
tasks
“” SRE Ops is still getting the first pages
@hollyjallen,#QConSF Nov 2018“”
How do we lower operational burden on the SREs?
@hollyjallen,#QConSF Nov 2018“”
Plan: Send paging alerts to the development teams
@hollyjallen,#QConSF Nov 2018Devs
We need training
@hollyjallen,#QConSF Nov 2018SREs
We're going to plan this out perfectly
“”
Host level alerts Hundreds of them
@hollyjallen,#QConSF Nov 2018“”
Test with the users
@hollyjallen,#QConSF Nov 2018Everything was fine!
@hollyjallen,#QConSF Nov 2018“”
Empowered Continuous Improvement
@hollyjallen,#QConSF Nov 2018“”
@hollyjallen,#QConSF Nov 2018Devs Ops SRE
“”
How do we test our understanding of how Slack will fail?
@hollyjallen,#QConSF Nov 2018“”
"Disasterpiece Theater is an ongoing series of exercises in which we will purposely cause a part of Slack to fail."
@hollyjallen,#QConSF Nov 2018Measure Design Learn
Success Metrics
@hollyjallen,#QConSF Nov 2018confidence
improvements
“”
Centralized Operations Service Ownership
@hollyjallen,#QConSF Nov 2018“”
How do we ensure the teams are being alerted, instead of skillsets?
@hollyjallen,#QConSF Nov 2018“”
How do we make postmortems a place for learning again?
@hollyjallen,#QConSF Nov 2018“”
How do we make sure a capable incident commander is available for all incidents?
@hollyjallen,#QConSF Nov 2018Measure Design Learn
“”
Copy the questions Not the answers
@hollyjallen,#QConSF Nov 2018“”
Change is possible You don't have to be ready
@hollyjallen,#QConSF Nov 2018“”
Empowered Continuous Improvement
@hollyjallen,#QConSF Nov 2018Measure Design Learn
“”
Learn Faster
@hollyjallen,#QConSF Nov 2018Thank You