Automated Troubleshooting of Live Site Issues Sriram Srinivasan - PowerPoint PPT Presentation

Automated Troubleshooting of Live Site Issues Sriram Srinivasan PayPal SRE May 23, 2017

About Me • MTS 2, Software Engineer @ PayPal • Site Reliability Engineer

Agenda • A bit about PayPal SRE • Troubleshooting Challenges • Manual Troubleshooting Process • Requirements of Automated Troubleshooting Platform • Evolution of the Architecture • Architecture in Detail • Major features of the Automated Troubleshooting Platform • How to troubleshoot any type of Issues through Workflows? • Future Plans

A bit about PayPal SRE • Focus on the Key Aspects of Site Reliability:  Availability  Performance  Change Management  Monitoring and Alerting  Incident Management • To troubleshoot and drive resolution of Live issues (from every domain) across the company.

Troubleshooting Challenges • Manual: • Landscape:  Not having enough data to troubleshoot  Newer Products/Flows  Knowledge of the area/domain  People, Product & Bugs changing teams  Multiple signal generators  Growing number of issues as we grow  Inherent urgency in resolving  Growing signal generators  Takes time (due to human intervention)  Troubleshooting system generated Alerts not scalable  Past troubleshooting knowledge not always leveraged  Low priority issues don’t get enough attention  Expiry of the logs

Manual Troubleshooting Process • Issue Comprehension • Categorize Issue (System vs Application) • Look for Samples • Tag Samples with the corresponding logs (spanning multiple applications) • Check further in:  Stack Trace (Logs from the point of entry)  Recent Pushes (pertaining to the application/service)  Deployment Logs  Databases  In-house Alerting & Monitoring tools  In-house Admin tool  Code base  Production box  Bug Tracker/Ticketing Systems  …

Requirements • Explicit Functional Requirements:  Automate the troubleshooting process • Implicit Functional Requirements:  Provision to talk to disparate signal generators/data sources (like log servers, DB, …) synchronously/asynchronously  Adaptable to the growing signal generators/data sources  Ability to troubleshoot any type of issue/alert  Troubleshooting data augmentation/enrichment  Assimilation of the results from various data sources  Retain concerned Logs/troubleshooting info forever  Single place to view the auto-troubleshooting result  Build a Platform

Evolution of the Architecture • Key Abstractions: • Architecture Patterns:  Identify the Type of Issue/Alert  Multitier / n-tier architecture (Workflow)  Service-oriented architecture (SOA)  Workflow has the say on how to  Presentation – abstraction – control / troubleshoot (control strategy) (MVC)  Augment the Troubleshooting Data  Blackboard system  Invoke various Fetchers in the order prescribed (diverse specialized modules)  Gather results in a common place  Assimilate / Solution is incrementally constructed

Architecture

Workflows • Workflow has details on how to enrich the troubleshooting data and what Fetchers would be required (including the order of invocation). • Workflows are described in JSON and is nothing but a union of various Sections (or Directives).

Major features of the Automated Troubleshooting Platform • Pluggable:  Fetchers are Pluggable. We can add as many Fetchers (for Data Sources) as we want.  Language for the development of the Fetcher is not fixed. • Expandable:  Add as many Workflows (Products/Flows) as possible. Workflow says what Fetchers to be invoked and in what order.  Issues and various types of Alerts can be triaged. • Scalable by Design:  Asynchronous invocation of Fetchers.  Underlying technologies will also help.

Benefits • Fast Triaging of all Issues & Alerts:  All issues and alerts are auto-triaged in minutes.  Reduces MTTT (Mean Time to Triage) and thus reduced MTTR (Mean Time to Resolve) • Less Cost to Company:  Reduces the Sustaining Budget of teams. Teams can expend their effort on building other cool features. • Customer Satisfaction:  Better Customer Satisfaction as logs are available forever and we don't need to go back to our customers. • Better Insights:  As a single platform, it has gotten all the issues and their resolution. Thus this data platform can provide various insights.  Past triaging knowledge is leveraged for future troubleshooting.

Future Plans • Platform Usage:  Continuously evolve the platform by adding more Fetchers • Disposition:  Smart Issue Classification & Intelligent Issue Routing • Data Platform:  Cataloguing the Issue with additional information (Resolution details, additional Notes)  Insights generation sliced by products, flows, root cause • Proactive Measures:  Where more issues are coming and invest there by leveraging the data

Questions ?

Automated Troubleshooting of Live Site Issues Sriram Srinivasan - PowerPoint PPT Presentation

Automated Troubleshooting of Live Site Issues Sriram Srinivasan PayPal SRE May 23, 2017 About Me MTS 2, Software Engineer @ PayPal Site Reliability Engineer Agenda A bit about PayPal SRE Troubleshooting Challenges Manual

TROUBLESHOOTING Performing basic Acronis Backup and Acronis Backup Cloud troubleshooting Acronis

Troubleshooting & Q&A 1 1 SeisComP3 Troubleshooting scrttv Real Time Trace Viewer

Automated Design of Digital Automated Design of Digital Automated Design of Digital Automated

Overview of Automated Bus Consortium Program Accelerating automated technology for transit

Automated Reasoning: Some Successes and New Challenges Predrag Jani ci c

Week 3 Video 4 Automated Feature Generation Automated Feature Selection Automated Feature

Automated Reasoning Course Presentation Summary Automated Reasoning Motivations Course Plan

Web-Based SIS Troubleshooting Simplifying State Reporting Cycles Agenda Resources

Lawn Basics & Turf Troubleshooting Presentation Q & A Lawn Basics & Turf

Configuring and Troubleshooting MPLS VPN Vinit Jain, CCIE Security, Data Center, SP, and R&S

Troubleshooting PostgreSQL with pgCenter Alexey Lesovsky alexey.lesovsky@dataegret.com

Decision theoretic troubleshooting Ji r Vomlel Academy of Sciences of the Czech Republic

TROUBLESHOOTING AND APPEALS Health Access Basic Benefits Training February 27, 2020 Nancy

Automatically Generating Predicates and Solutions for Configuration Troubleshooting * Ya-Yunn Su

ESCALA TION And RESPONSE Out age scenarios John Allspa w Wednesday, April 24, 13

Provenance for System Troubleshooting Marc Chiarini Harvard SEAS TaPP '11 A Day in the Life...

Next generation data centre networks A platform for innovation Chris Gascoigne

The Norwegian EV Success Christina Bu, Secretary General Norwegian EV Association The Norwegian

FY2016 BUDGET HEARINGS 0 APP O RIAT 0 UNLVERSLTY OF ;ou TH DAKOTA COMMI E Student Success

JUNIOR NIGHT Tuesday, December 4, 2018 11 th Grade School Counselors Mrs. Baez (Last Names A-L)

AFN March 10-12, 2020 Kevin Debassige Transitioning to a Modernized Approach of Asset Management

ENDO PHARMACEUTICALS grow. collaborate. innovate. thrive. Instructions and Troubleshooting the

Texas Institute of Science Clients Technology Space and Requirements Director of Engineering

Summer School A Technology Presentation Paul Maniscalco Key School Instructional

Automated Troubleshooting of Live Site Issues Sriram Srinivasan - PowerPoint PPT Presentation

Automated Troubleshooting of Live Site Issues Sriram Srinivasan PayPal SRE May 23, 2017 About Me MTS 2, Software Engineer @ PayPal Site Reliability Engineer Agenda A bit about PayPal SRE Troubleshooting Challenges Manual

TROUBLESHOOTING Performing basic Acronis Backup and Acronis Backup Cloud troubleshooting Acronis

Troubleshooting &amp; Q&amp;A 1 1 SeisComP3 Troubleshooting scrttv Real Time Trace Viewer

Automated Design of Digital Automated Design of Digital Automated Design of Digital Automated

Overview of Automated Bus Consortium Program Accelerating automated technology for transit

Automated Reasoning: Some Successes and New Challenges Predrag Jani ci c

Week 3 Video 4 Automated Feature Generation Automated Feature Selection Automated Feature

Automated Reasoning Course Presentation Summary Automated Reasoning Motivations Course Plan

Web-Based SIS Troubleshooting Simplifying State Reporting Cycles Agenda Resources

Lawn Basics &amp; Turf Troubleshooting Presentation Q &amp; A Lawn Basics &amp; Turf

Configuring and Troubleshooting MPLS VPN Vinit Jain, CCIE Security, Data Center, SP, and R&amp;S

Troubleshooting PostgreSQL with pgCenter Alexey Lesovsky alexey.lesovsky@dataegret.com

Decision theoretic troubleshooting Ji r Vomlel Academy of Sciences of the Czech Republic

TROUBLESHOOTING AND APPEALS Health Access Basic Benefits Training February 27, 2020 Nancy

Automatically Generating Predicates and Solutions for Configuration Troubleshooting * Ya-Yunn Su

ESCALA TION And RESPONSE Out age scenarios John Allspa w Wednesday, April 24, 13

Provenance for System Troubleshooting Marc Chiarini Harvard SEAS TaPP '11 A Day in the Life...

Next generation data centre networks A platform for innovation Chris Gascoigne

The Norwegian EV Success Christina Bu, Secretary General Norwegian EV Association The Norwegian

FY2016 BUDGET HEARINGS 0 APP O RIAT 0 UNLVERSLTY OF ;ou TH DAKOTA COMMI E Student Success

JUNIOR NIGHT Tuesday, December 4, 2018 11 th Grade School Counselors Mrs. Baez (Last Names A-L)

AFN March 10-12, 2020 Kevin Debassige Transitioning to a Modernized Approach of Asset Management

ENDO PHARMACEUTICALS grow. collaborate. innovate. thrive. Instructions and Troubleshooting the

Texas Institute of Science Clients Technology Space and Requirements Director of Engineering

Summer School A Technology Presentation Paul Maniscalco Key School Instructional

Troubleshooting & Q&A 1 1 SeisComP3 Troubleshooting scrttv Real Time Trace Viewer

Lawn Basics & Turf Troubleshooting Presentation Q & A Lawn Basics & Turf

Configuring and Troubleshooting MPLS VPN Vinit Jain, CCIE Security, Data Center, SP, and R&S