#GHC18 WHY MONITORING #GHC18 Run reliable services at scale - - PowerPoint PPT Presentation

ghc18 why monitoring
SMART_READER_LITE
LIVE PREVIEW

#GHC18 WHY MONITORING #GHC18 Run reliable services at scale - - PowerPoint PPT Presentation

M O N I T O R I N G C O M P L E X S Y S T E M S : L E S S O N S F R O M M O N I T O R I N G 1 0 K B A N K I N T E G R A T I O N S Joy Zheng #GHC18 WHY MONITORING #GHC18 Run reliable services at scale Understand service


slide-1
SLIDE 1

Joy Zheng

M O N I T O R I N G C O M P L E X S Y S T E M S : L E S S O N S F R O M M O N I T O R I N G 1 0 K B A N K I N T E G R A T I O N S

#GHC18

slide-2
SLIDE 2

PAGE 2 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

#GHC18

WHY MONITORING

#GHC18

  • Run reliable services at scale
  • Understand service performance over time
  • Quickly detect and react to problems
slide-3
SLIDE 3

PAGE 3 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

#GHC18

GOALS

#GHC18

  • Customize monitoring to your system or company
  • Avoid incurring high customization costs
slide-4
SLIDE 4

PAGE 4 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

#GHC18

WHO IS PLAID?

#GHC18

P E R S O N A L F I N A N C E S L E N D I N G C O N S U M E R P A Y M E N T S B A N K I N G A N D B R O K E R A G E B U S I N E S S F I N A N C E S I N T E G R A T I O N P A R T N E R S

slide-5
SLIDE 5

PAGE 5 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

#GHC18

WHO IS PLAID?

#GHC18

slide-6
SLIDE 6

PAGE 6 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

#GHC18

MONITORING AT PLAID

10,000 fjnancial institutions

#GHC18

slide-7
SLIDE 7

PAGE 7 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

#GHC18

MONITORING AT PLAID

10,000 fjnancial institutions Heterogeneous traffjc patterns

#GHC18

slide-8
SLIDE 8

PAGE 8 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

#GHC18

TWO SAMPLE INSTITUTIONS

#GHC18

First Platypus Bank T attersall Federal Credit Union

Hour 1

10 billion tries / 100 million failures 5 tries / 1 failures

Hour 2

10 billion tries / 200 million failures 5 tries / 5 failures

slide-9
SLIDE 9

PAGE 9 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

#GHC18

MONITORING AT PLAID

10,000 fjnancial institutions Heterogeneous traffjc patterns Metrics beyond success/failure

#GHC18

slide-10
SLIDE 10

#GHC18

Determining Requirements

slide-11
SLIDE 11

PAGE 11 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

#GHC18

OUR STEPS

#GHC18

  • 1. Convince ourselves we needed a new system
  • 2. Identify a full list of metrics to monitor
  • 3. Prioritize, prioritize, prioritize
  • 4. Determine technical system requirements
  • 5. Research technologies
slide-12
SLIDE 12

PAGE 12 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

#GHC18

THE OLD SYSTEM

#GHC18

slide-13
SLIDE 13

PAGE 13 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

#GHC18

METRICS WISHLIST

#GHC18

slide-14
SLIDE 14

PAGE 14 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

#GHC18

PRIORITIZATION: #SHIPTHEMVP

#GHC18

  • Based on customer impact and instrumentation

cost Narrowed to 1/3 of original wishlist

  • Still failed to narrow the list of metrics enough

Result: time spent writing complex (unused) database logic

slide-15
SLIDE 15

PAGE 15 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

#GHC18

TECHNICAL REQUIREMENTS

#GHC18

Scalability: 10k banks * # metrics each Latency: 30s from event to metric 1s to query metrics Usability: Engineers can create metrics and alerts with minimal monitoring implementation knowledge

slide-16
SLIDE 16

#GHC18

Building a Monitoring Pipeline

slide-17
SLIDE 17

PAGE 17 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

#GHC18

TWO USE CASES

#GHC18

Standard Pipeline Monitoring, alerting, and dashboards at scale for easy-to- generate metrics Custom Pipeline Metrics generation involving custom aggregation

1 2 3 4

slide-18
SLIDE 18

PAGE 18 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

#GHC18

STANDARD PIPELINE

#GHC18

slide-19
SLIDE 19

PAGE 19 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

#GHC18

PROMETHEUS

#GHC18

slide-20
SLIDE 20

PAGE 20 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

#GHC18

PROMETHEUS

#GHC18

​ server: tasks_processed_count{ server="i-abcdef123", status="success", institution="FirstPlatypusBank" } ​query: sum(rate(tasks_processed_count[5m])) ​alert: sum(rate(tasks_processed_count{status!="success}[5m])) ​ ​ ​ ​ ​ ​by​(institution) ​ ​ ​ ​ >​100 ​

slide-21
SLIDE 21

PAGE 21 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

#GHC18

PROMETHEUS

#GHC18

​ ​server:​tasks_processed_count{ ​ ​ ​ ​ ​ ​ server="i-abcdef123", ​ ​ ​ ​ ​ ​ status="success", ​ ​ ​ ​ ​ ​ institution="FirstPlatypusBank" ​ ​ ​ ​ ​} query: sum(rate(tasks_processed_count[5m])) ​alert: sum(rate(tasks_processed_count{status!="success}[5m])) ​ ​ ​ ​ ​ ​by​(institution) ​ ​ ​ ​ >​100 ​

slide-22
SLIDE 22

PAGE 22 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

#GHC18

PROMETHEUS

#GHC18

​ ​server:​tasks_processed_count{ ​ ​ ​ ​ ​ ​ server="i-abcdef123", ​ ​ ​ ​ ​ ​ status="success", ​ ​ ​ ​ ​ ​ institution="FirstPlatypusBank" ​ ​ ​ ​ ​} ​query: sum(rate(tasks_processed_count[5m])) alert: sum(rate(tasks_processed_count{status!="success}[5m])) by (institution) > 100 ​

slide-23
SLIDE 23

PAGE 23 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

#GHC18

ALERTMANAGER

#GHC18

​ ​route:​ ​ receiver:​team-monitoring-email​ group_by: ​ ​ - alertname​ ​ -​environment​ ​ routes:​ ​ - match_re:​ ​ ​ alert_type:​^(?:monitoring_uptime|prometheus)$ ​ ​ ​ routes:​ ​ ​ -​receiver:​team-monitoring-email​ ​ ​ match_re:​ ​ ​ ​ ​ environment:​^(?:testing|preprod)$​ ​ ​ -​receiver:​team-monitoring-email​ ​ ​ match_re:​ ​ ​ ​ plaid_env:​^(?:testing|preprod)$​ ​ ​ -​receiver:​team-monitoring-pager​ ​

slide-24
SLIDE 24

PAGE 24 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

#GHC18

ALERTMANAGER

#GHC18

​ ​route:​ ​ receiver:​team-monitoring-email​ group_by: ​ ​ - alertname​ ​ -​environment​ ​ routes:​ ​ - match_re:​ alert_type: ^(?:monitoring_uptime|prometheus)$ ​ ​ routes:​ ​ ​ -​receiver:​team-monitoring-email​ ​ ​ match_re:​ ​ ​ ​ ​ environment:​^(?:testing|preprod)$​ ​ ​ -​receiver:​team-monitoring-email​ ​ ​ match_re:​ ​ ​ ​ plaid_env:​^(?:testing|preprod)$​ ​ ​ -​receiver:​team-monitoring-pager​ ​

slide-25
SLIDE 25

PAGE 25 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

#GHC18

ALERTMANAGER

#GHC18

​ ​route:​ ​ receiver:​team-monitoring-email​ group_by: ​ ​ - alertname​ ​ -​environment​ ​ routes:​ ​ - match_re:​ ​ ​ alert_type:​^(?:monitoring_uptime|prometheus)$ ​ ​ ​ routes:​

  • receiver: team-monitoring-email

match_re: environment: ^(?:testing|preprod)$

  • receiver: team-monitoring-email

match_re: plaid_env: ^(?:testing|preprod)$ ​ ​ -​receiver:​team-monitoring-pager​ ​

slide-26
SLIDE 26

PAGE 26 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

#GHC18

ALERTMANAGER

#GHC18

​ ​route:​ ​ receiver:​team-monitoring-email​ group_by: ​ ​ - alertname​ ​ -​environment​ ​ routes:​ ​ - match_re:​ ​ ​ alert_type:​^(?:monitoring_uptime|prometheus)$ ​ ​ ​ routes:​ ​ ​ -​receiver:​team-monitoring-email​ ​ ​ match_re:​ ​ ​ ​ ​ environment:​^(?:testing|preprod)$​ ​ ​ -​receiver:​team-monitoring-email​ ​ ​ match_re:​ ​ ​ ​ plaid_env:​^(?:testing|preprod)$​ ​ ​ - receiver: team-monitoring-pager ​

slide-27
SLIDE 27

PAGE 27 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

#GHC18

GRAFANA

#GHC18

slide-28
SLIDE 28

PAGE 28 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

#GHC18

STANDARD PIPELINE

#GHC18

slide-29
SLIDE 29

PAGE 29 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

#GHC18

CUSTOM PIPELINE

#GHC18

slide-30
SLIDE 30

PAGE 30 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

#GHC18

RESULTS

#GHC18

Events per second processed Metrics exported Services monitored Engineers who have contributed monitoring changes (on a team of 45) Average delay from event to metrics generation

> 700 190k+ 17 31 <5s

slide-31
SLIDE 31

#GHC18

Takeaways

slide-32
SLIDE 32

PAGE 32 | GRACE HOPPER CELEBRATION 2018 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

#GHC18

TAKEAWAYS

#GHC18

Build the end-to- end pipeline fjrst Learn from real usage to avoid creating extra complexity Make monitoring components independently usable Other services can make use of individual pieces of the pipeline Tailor custom aggregation to the narrowest areas which require it Let services which don’t need customization use the standard pipeline Use standard components where possible Custom components have higher implementation costs and higher developer education costs

1 2 3 4

slide-33
SLIDE 33

#GHC18

Questions?