Introducing Reliability T oolkit: easy-to-use monitoring and - PowerPoint PPT Presentation

Introducing Reliability T oolkit: easy-to-use monitoring and alerting Robin van Zijll & Janna Brummel PromCon 2018 • 10 August 2018

Hi! Robin Janna [Foto]

What do we work on with whom, how and why? Who? T eam of 7 SREs with the goal to reduce mean time to repair and increase mean time between failures for IT services within a bank Why? We do not reach availability levels expected by customers or regulators How? We enable ~300 BizDevOps squads through engineering, delivery of tooling, consulting and education What? We deliver a monitoring solution: the Reliability Toolkit , a ChatOps platform, we facilitate postmortems and we educate engineers about SRE- related topics

Why did we develop the Reliability T oolkit? Alerting not directly to teams Time before engineer starts resolving (major) incident is 69 minutes on average Lack of white-box Currently only real monitoring is black-box, does not fjt with ‘you build it, you run it’ High level of technology diversity Prometheus exporters make monitoring highly adoptable A bank can be a documentation factory It is a pain for teams to create something new Simplicity One toolkit to cover reliability building blocks, easy to get started, easy to use

What’s in the Reliability T oolkit? Prometheu Alert Grafana Model s Manager Builder*

How do we provision the Reliability T oolkit? SR T ea E m T ogether with We We deliver the We deliver client a team we maintain Reliability libraries so metrics create a joint and update T oolkit on 5 can be scraped confjg the bin machines over from servers fjles 3 environments, we remain responsible

Increasing and improving usage of Reliability T oolkit Include client libraries in engineering frameworks Ensure a good feedback loop with your customers Educate others during onboarding and workshops Create dashboards accessible to all engineers

Create awesome dashboards accessible to all engineers

NLA Grafana Prometheus NGINX Log Kafka NGINX Applications Aggregator*

Error Overview (1)

Error Overview (2)

T eam Overview (1)

eam Overview (2 ) T

T eam Overview (3)

T eam Overview (4)

Educate others during onboarding and workshops

PromQL Workshop: Example Assignment Selecting a range vector in Prometheus is done by appending a time window specifjcation between square brackets to your metric (for example: my_metric[1m] selects 1 minute). These ranges allow the use of all sorts of functions in Prometheus that manipulate the data. You can also have Prometheus calculate the change in the number of logged in customers using functions like delta() or deriv() . delta : change in value between the fjrst and last value of a time series in a range vector (time range) idelta: change in value between the 2 last values of a time series in a range vector (time range) derive: per-second derivative of a time series in range vector These functions should only be used with GAUGES. Note that the idelta function is somewhat less useful as it depends on the scrape interval in order to give it meaning. Objective: Understand the delta(), deriv() and idelta() functions 1. Use the 'logged_on_customers' metric 2. Add a panel showing the per second change in the number of logged on customers for each site

PromQL Workshop: Example Solution You should have fjlled in: "deriv(logged_on_customers[1m])” The graph should look similar to this: Note that this graph has a 30 min time frame instead of the default 15 min. Difgerence between Delta and Deriv: Delta shows you the difgerence between two points of time where the two valuables are subtracted from each other. These two valuables are selected based on the given time frame (in this case 1 min). On the other hand, Deriv (v range- vector) calculates the per-second derivative of the time series in a range vector v, using simple linear regression. Deriv calculates the slope of the graph.

Notify when things are difgerent than expected

Model Builder Current load Expected load Potential Alert

Model Builder Input Currently we support GAUGES and COUNTERS as input modelT ype AveragingModel. Prediction based on values in buckets Output model _http_request_rate. Model as sample in Prometheus

Questions?

Introducing Reliability T oolkit: easy-to-use monitoring and - PowerPoint PPT Presentation

Introducing Reliability T oolkit: easy-to-use monitoring and alerting Robin van Zijll & Janna Brummel PromCon 2018 10 August 2018 Hi! Robin Janna [Foto] What do we work on with whom, how and why? Who? T eam of 7 SREs with the

Easy-to-Use Easy-to-Install Easy on the Budget orecx.com Easy-to-Use

Introducing more people Introducing more people Introducing more people Introducing more people

Reliability Engineering - Discussions and Clarifications Reliability Engineering VS.

Software Reliability and System Reliability Introduction 1 Software Reliability and System

Reliability of Cloud-Scale Systems (CS 598) Fall 2018 Tianyin Xu 1 Reliability of Cloud-Scale

Reliability Perspectives on Clean Power Plan Implications NERC Reliability Assessments John Moura

The Future of Reliability: Stanton Energy Reliability Center DCBO Bidders Conference

Easy Flype & Easy HiFlype Peripheral Self-Expanding Stent System 20/07/2018 Easy Flype

Reminde nder/Recal call P Practi tice ces a and Tool oolkit it O Overvie iew Tuesday,

P ROJECT M ANAGEMENT T OOLKIT - A DDITIONAL D OCUMENTATION Available at:

The P ortable E xtensible T oolkit for S cientific C omputing Toby Isaac (building on slides from

DISSECT - DIS tributional SE mantics C omposition T oolkit Georgiana Dinu and Nghia The Pham and

Kaltur Kaltura Player a Player Toolkit oolkit FOSDEM 2015 Michael Dale Itay Kinnrot Kaltura

Whats New in disclosure? The AHRQ CANDOR Process and T oolkit Steve Kraman, M.D. Professor,

KAFKA STREAMS CLOUD MONITORING AWS CLOUD MONITORING AWS APP CLOUD MONITORING AWS HTTP APP

INTRODUCING UNDERCOUNTER CUBERS COMPACT DESIGN, RELIABILITY, AND INTUITIVE CONTROLS. The

Comparison of GiBUU calculations with MiniBooNE pion production data Olga Lalakulich, Ulrich

t t

Section 1 Financial Risk Financial risk is the potential for financial loss. All financial

DRA 101 Creating Jobs. Building Communities. Improving Lives. Quick Facts Established in 2000

Motivation Why are Views Useful? Give an example query: Workloads often have repeating

xthst : Testing slope homogeneity in Stata 2020 London (online) Stata User Group Meeting Tore

Sorting Algorithms Algorithm Analysis and Big-O Function Objects and the Comparator Interface

Cross Translation Unit Test Case Reduction Rka Kovcs / rekanikolett@gmail.com Etvs

Introducing Reliability T oolkit: easy-to-use monitoring and - PowerPoint PPT Presentation

Introducing Reliability T oolkit: easy-to-use monitoring and alerting Robin van Zijll & Janna Brummel PromCon 2018 10 August 2018 Hi! Robin Janna [Foto] What do we work on with whom, how and why? Who? T eam of 7 SREs with the

Easy-to-Use Easy-to-Install Easy on the Budget orecx.com Easy-to-Use

Introducing more people Introducing more people Introducing more people Introducing more people

Reliability Engineering - Discussions and Clarifications Reliability Engineering VS.

Software Reliability and System Reliability Introduction 1 Software Reliability and System

Reliability of Cloud-Scale Systems (CS 598) Fall 2018 Tianyin Xu 1 Reliability of Cloud-Scale

Reliability Perspectives on Clean Power Plan Implications NERC Reliability Assessments John Moura

The Future of Reliability: Stanton Energy Reliability Center DCBO Bidders Conference

Easy Flype &amp; Easy HiFlype Peripheral Self-Expanding Stent System 20/07/2018 Easy Flype

Reminde nder/Recal call P Practi tice ces a and Tool oolkit it O Overvie iew Tuesday,

P ROJECT M ANAGEMENT T OOLKIT - A DDITIONAL D OCUMENTATION Available at:

The P ortable E xtensible T oolkit for S cientific C omputing Toby Isaac (building on slides from

DISSECT - DIS tributional SE mantics C omposition T oolkit Georgiana Dinu and Nghia The Pham and

Kaltur Kaltura Player a Player Toolkit oolkit FOSDEM 2015 Michael Dale Itay Kinnrot Kaltura

Whats New in disclosure? The AHRQ CANDOR Process and T oolkit Steve Kraman, M.D. Professor,

KAFKA STREAMS CLOUD MONITORING AWS CLOUD MONITORING AWS APP CLOUD MONITORING AWS HTTP APP

INTRODUCING UNDERCOUNTER CUBERS COMPACT DESIGN, RELIABILITY, AND INTUITIVE CONTROLS. The

Comparison of GiBUU calculations with MiniBooNE pion production data Olga Lalakulich, Ulrich

t t

Section 1 Financial Risk Financial risk is the potential for financial loss. All financial

DRA 101 Creating Jobs. Building Communities. Improving Lives. Quick Facts Established in 2000

Motivation Why are Views Useful? Give an example query: Workloads often have repeating

xthst : Testing slope homogeneity in Stata 2020 London (online) Stata User Group Meeting Tore

Sorting Algorithms Algorithm Analysis and Big-O Function Objects and the Comparator Interface

Cross Translation Unit Test Case Reduction Rka Kovcs / rekanikolett@gmail.com Etvs

Easy Flype & Easy HiFlype Peripheral Self-Expanding Stent System 20/07/2018 Easy Flype