Winston: Helping Netflix Engineers Sleep at Night
Our journey… assisting engineers reduce operational load and MTTR
Winston: Helping Netflix Engineers Sleep at Night Our journey - - PowerPoint PPT Presentation
Winston: Helping Netflix Engineers Sleep at Night Our journey assisting engineers reduce operational load and MTTR On-Call ! Sayli Karmarkar Senior Software Engineer Diagnostics and Remediation Engineering (DaRE) skarmarkar@netflix.com
Our journey… assisting engineers reduce operational load and MTTR
Senior Software Engineer Diagnostics and Remediation Engineering (DaRE)
skarmarkar@netflix.com @HikerTechy https://www.linkedin.com/in/saylikarmarkar DaRE Team’s Focus Build platforms, tools and libraries to help teams reduce MTTR for operational issues.
Engineer Wakes up PagerDuty Alert 2:02 AM Logs in and ACK 2:07 AM Checks runbook 2:15 AM Studies the alert 2:10 AM Fixes/Mitigates the problem 2:30 AM Runs diagnostics 2:20 AM 2:00 AM
Scale and Growth Availability
Productivity
providing a platform to automate their runbooks
event
than infrastructure aka PaaS.
services
management
Engineer Wakes up Logs in and ACK Checks runbook Studies the alert Fixes the problem Runs diagnostics PagerDuty Alert 2:02 AM 2:07 AM 2:15 AM 2:10 AM 2:30 AM 2:20 AM 2:00 AM
False Positive
2:00 AM 2:05 AM 2:05 AM 2:15 AM Assisted Diagnostics Mitigates the problem
Inbound Integrations (through SQS)
Outbound Integrations Bolt ...
As a Service (High Availability and Reliability )
efficiency over new features?
management practices are often not followed
functionality
○ Compliance/Auditing ○ Persistence ○ Security (Authentication/Authorization)
False Positives
down an instance
Diagnostics - Correlation could point towards the root cause
when an issue occurs
anomalous behavior
failures and reach the root cause quicker Mitigation
○ Find common operational use-cases and allow them to be re-used ○ Improve discoverability of Winston by integrating into existing alerting systems ○ Polyglot support (Groovy based runbooks)
○ Resource isolation using containers ○ Rate limiting capability
○ Provide aggregate visualization of runbook executions
instead of documentation.
than fixing the root-cause
http://techblog.netflix.com/2016/08/introducing-winston-event-driven.html
https://docs.stackstorm.com/
skarmarkar@netflix.com
@HikerTechy We are hiring
Senior Software Engineer - https://jobs.netflix.com/jobs/860752