Service Ownership Learn Faster Holly Allen Service Engineering @hollyjallen
Holly Allen Software development and leadership for 18 years
@hollyjallen,#QConSF Nov 2018
@hollyjallen,#QConSF Nov 2018
@hollyjallen,#QConSF Nov 2018
Software! ๐ @hollyjallen,#QConSF Nov 2018
@hollyjallen,#QConSF Nov 2018
S L O W ๐ช @hollyjallen,#QConSF Nov 2018
@hollyjallen,#QConSF Nov 2018
Measure Design Learn @hollyjallen,#QConSF Nov 2018
Toyota Production System @hollyjallen,#QConSF Nov 2018
@hollyjallen,#QConSF Nov 2018
@hollyjallen,#QConSF Nov 2018
@hollyjallen,#QConSF Nov 2018
@hollyjallen,#QConSF Nov 2018
@hollyjallen,#QConSF Nov 2018
โโ Kaizen Continuous Improvement @hollyjallen,#QConSF Nov 2018
Measure Design Learn @hollyjallen,#QConSF Nov 2018
@hollyjallen,#QConSF Nov 2018
@hollyjallen,#QConSF Nov 2018
@hollyjallen,#QConSF Nov 2018
โโ Executive dedication to learning @hollyjallen,#QConSF Nov 2018
โโ High Trust Teams @hollyjallen,#QConSF Nov 2018
Measure Design Learn @hollyjallen,#QConSF Nov 2018
@hollyjallen,#QConSF Nov 2018
๐ Slack launched February 2014 @hollyjallen,#QConSF Nov 2018
5 Years Grew to 13+ million weekly active users, with active sessions of 10+ hours a day @hollyjallen,#QConSF Nov 2018
5 Years From 10 to 15,000 servers In 25 cloud data centers world-wide @hollyjallen,#QConSF Nov 2018
5 Years From 8 to 1,200 people In 9 offices world-wide @hollyjallen,#QConSF Nov 2018
Measure Design Learn @hollyjallen,#QConSF Nov 2018
@hollyjallen,#QConSF Nov 2018
โโ โ Continuous Deployment โ Experiment Frameworks โ User Research @hollyjallen,#QConSF Nov 2018
Something didn't scale... @hollyjallen,#QConSF Nov 2018
๐ฎ Centralized Operations @hollyjallen,#QConSF Nov 2018
โโ Who should be responsible for the management, monitoring and operation of a production application? @hollyjallen,#QConSF Nov 2018
โโ Centralized Operations Division of Labor @hollyjallen,#QConSF Nov 2018
Devs Ops Features Cloud Infra Scale Deployment Architecture Monitoring @hollyjallen,#QConSF Nov 2018
โโ Ops is getting the pages @hollyjallen,#QConSF Nov 2018
โโ Product Development grew faster than Operations, A lot faster @hollyjallen,#QConSF Nov 2018
20 Product 1 Ops Developers Engineer @hollyjallen,#QConSF Nov 2018
โโ How can operations reliably reach the developers when there's a problem? @hollyjallen,#QConSF Nov 2018
โโ "Call Maude, she knows how this works" @hollyjallen,#QConSF Nov 2018
Devs Ops I've never been Now I know I on-call before, can find a this is scary! developer when I need to. @hollyjallen,#QConSF Nov 2018
โโ Ops is getting the pages first pages Ultra-senior devs on-call @hollyjallen,#QConSF Nov 2018
Measure Design Learn @hollyjallen,#QConSF Nov 2018
โโ How can operations reliably reach the developers when there's a problem? @hollyjallen,#QConSF Nov 2018
๐ Most devs go on-call Fall 2017 @hollyjallen,#QConSF Nov 2018
โโ Kaizen Continuous Improvement @hollyjallen,#QConSF Nov 2018
โโ "Wait, I'm on-call now?" @hollyjallen,#QConSF Nov 2018
Devs Ops I'm glad I'm only I'll be able to on call a few reach a search times a year engineer if I need to. @hollyjallen,#QConSF Nov 2018
โโ Learn by Doing @hollyjallen,#QConSF Nov 2018
โโ On-call 3 times a year ๐ค @hollyjallen,#QConSF Nov 2018
โโ Ops is getting the pages first pages Ultra-senior devs on-call Seven One dev rotations @hollyjallen,#QConSF Nov 2018
โโ Continuous Deployment 100+ prod deploys a day @hollyjallen,#QConSF Nov 2018
โโ What Changed? @hollyjallen,#QConSF Nov 2018
โโ @hollyjallen,#QConSF Nov 2018
โโ @hollyjallen,#QConSF Nov 2018
โโ Page the dev @hollyjallen,#QConSF Nov 2018
Devs Ops I don't These are the understand this machine alerts part of the code I'm seeing @hollyjallen,#QConSF Nov 2018
โโ Human Routers @hollyjallen,#QConSF Nov 2018
โโ "Call Andy, he knows how this works" @hollyjallen,#QConSF Nov 2018
โโ Postmortems weren't a great place for learning @hollyjallen,#QConSF Nov 2018
โโ Can we catch problems earlier? @hollyjallen,#QConSF Nov 2018
โโ @hollyjallen,#QConSF Nov 2018
โโ @hollyjallen,#QConSF Nov 2018
โโ @hollyjallen,#QConSF Nov 2018
โโ Investing in tech to make detection and remediation faster @hollyjallen,#QConSF Nov 2018
Operations is out Reorg! Service Engineering is in Fall 2017 @hollyjallen,#QConSF Nov 2018
โโ How can Slack ensure that developers know when there's a problem? @hollyjallen,#QConSF Nov 2018
โโ Centralized Operations Service Ownership @hollyjallen,#QConSF Nov 2018
Measure Design Learn @hollyjallen,#QConSF Nov 2018
โโ "We are the toolsmith and specialists. We empower Service Ownership" @hollyjallen,#QConSF Nov 2018
Devs Service Features Cloud Platform Reliability Observability tools Performance Service Discovery Postmortems Define best practice @hollyjallen,#QConSF Nov 2018
๐ I joined Slack in February 2018 @hollyjallen,#QConSF Nov 2018
โโ How to empower development teams to improve service reliability? @hollyjallen,#QConSF Nov 2018
Define โข At least one alerting health service metric, like latency or throughput health and operational maturity @hollyjallen,#QConSF Nov 2018
โโ Send metrics to Prometheus Observability team is here to help! ๐ฏ @hollyjallen,#QConSF Nov 2018
Define โข Team should be on-call service ready โข At least 4, preferably 6 health and engineers participating to operational make it sustainable โข 24/7 or during the weekday, maturity depending on the service @hollyjallen,#QConSF Nov 2018
Define โข Runbooks for standard service actions and troubleshooting health and โข Central location in our code operational repository โข Up to date and useable by maturity any engineer @hollyjallen,#QConSF Nov 2018
Define โข Paging alerts should link to service the runbook โข Make responding to an health and page easy operational โข Practice incident response maturity @hollyjallen,#QConSF Nov 2018
โโ Incident Lunch โ @hollyjallen,#QConSF Nov 2018
โข Devops generalists Site โข Emotional intelligence Reliability โข Mentoring โข Ambassadors Engineers โข Operational maturity @hollyjallen,#QConSF Nov 2018
โโ SRE embedded in dev teams @hollyjallen,#QConSF Nov 2018
โโ Devs SRE Ops @hollyjallen,#QConSF Nov 2018
Devs SREs Um, where are I'm over here the SREs? doing operational tasks @hollyjallen,#QConSF Nov 2018
โโ SRE Ops is still getting the first pages @hollyjallen,#QConSF Nov 2018
โโ How do we lower operational burden on the SREs? @hollyjallen,#QConSF Nov 2018
โโ Plan: Send paging alerts to the development teams @hollyjallen,#QConSF Nov 2018
Devs SREs We need We're going to training plan this out perfectly @hollyjallen,#QConSF Nov 2018
@hollyjallen,#QConSF Nov 2018
โโ Host level alerts Hundreds of them @hollyjallen,#QConSF Nov 2018
โโ Test with the users @hollyjallen,#QConSF Nov 2018
@hollyjallen,#QConSF Nov 2018
๐ซ Everything was fine! @hollyjallen,#QConSF Nov 2018
โโ Empowered Continuous Improvement @hollyjallen,#QConSF Nov 2018
โโ Devs SRE Ops @hollyjallen,#QConSF Nov 2018
โโ How do we test our understanding of how Slack will fail? @hollyjallen,#QConSF Nov 2018
โโ "Disasterpiece Theater is an ongoing series of exercises in which we will purposely cause a part of Slack to fail." @hollyjallen,#QConSF Nov 2018
Measure Design Learn @hollyjallen,#QConSF Nov 2018
โข Increased engineer Success confidence Metrics โข Validate reliability improvements โข Learn something new โข Practice incident response @hollyjallen,#QConSF Nov 2018
Recommend
More recommend