ghc18 why monitoring
play

#GHC18 WHY MONITORING #GHC18 Run reliable services at scale - PowerPoint PPT Presentation

M O N I T O R I N G C O M P L E X S Y S T E M S : L E S S O N S F R O M M O N I T O R I N G 1 0 K B A N K I N T E G R A T I O N S Joy Zheng #GHC18 WHY MONITORING #GHC18 Run reliable services at scale Understand service


  1. M O N I T O R I N G C O M P L E X S Y S T E M S : L E S S O N S F R O M M O N I T O R I N G 1 0 K B A N K I N T E G R A T I O N S Joy Zheng #GHC18

  2. WHY MONITORING #GHC18 • Run reliable services at scale • Understand service performance over time • Quickly detect and react to problems PAGE 2 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  3. GOALS #GHC18 • Customize monitoring to your system or company • Avoid incurring high customization costs PAGE 3 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  4. WHO IS PLAID? P E R S O N A L L E N D I N G B A N K I N G A N D C O N S U M E R B U S I N E S S I N T E G R A T I O N #GHC18 F I N A N C E S B R O K E R A G E P A Y M E N T S F I N A N C E S P A R T N E R S PAGE 4 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  5. WHO IS PLAID? #GHC18 PAGE 5 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  6. MONITORING AT PLAID #GHC18 10,000 fjnancial institutions PAGE 6 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  7. MONITORING AT PLAID #GHC18 10,000 fjnancial institutions Heterogeneous traffjc patterns PAGE 7 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  8. TWO SAMPLE INSTITUTIONS #GHC18 T attersall Federal Credit First Platypus Bank Union 10 billion tries / 5 tries / Hour 1 100 million failures 1 failures 10 billion tries / 5 tries / Hour 2 200 million failures 5 failures PAGE 8 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  9. MONITORING AT PLAID #GHC18 10,000 fjnancial institutions Heterogeneous traffjc patterns Metrics beyond success/failure PAGE 9 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  10. Determining Requirements #GHC18

  11. OUR STEPS 1. Convince ourselves we needed a new system #GHC18 2. Identify a full list of metrics to monitor 3. Prioritize, prioritize, prioritize 4. Determine technical system requirements 5. Research technologies PAGE 11 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  12. THE OLD SYSTEM #GHC18 PAGE 12 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  13. METRICS WISHLIST #GHC18 PAGE 13 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  14. PRIORITIZATION: #SHIPTHEMVP #GHC18 • Based on customer impact and instrumentation cost Narrowed to 1/3 of original wishlist • Still failed to narrow the list of metrics enough Result: time spent writing complex (unused) database logic PAGE 14 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  15. TECHNICAL REQUIREMENTS #GHC18 Scalability: Latency: Usability: Engineers can 10k banks * 30s from event create metrics and # metrics each to metric alerts with minimal 1s to query monitoring metrics implementation knowledge PAGE 15 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  16. Building a Monitoring Pipeline #GHC18

  17. TWO USE CASES #GHC18 1 2 Standard Pipeline Custom Pipeline Monitoring, alerting, Metrics generation and dashboards at involving custom scale for easy-to- aggregation generate metrics 3 4 PAGE 17 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  18. STANDARD PIPELINE #GHC18 PAGE 18 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  19. PROMETHEUS #GHC18 PAGE 19 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  20. PROMETHEUS #GHC18 ​ server: tasks_processed_count{ server="i-abcdef123", status="success", institution="FirstPlatypusBank" } ​query: sum(rate(tasks_processed_count[5m])) ​alert: sum(rate(tasks_processed_count{status!="success}[5m])) ​ ​ ​ ​ ​ ​by​(institution) ​ ​ ​ ​ >​100 ​ PAGE 20 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  21. PROMETHEUS #GHC18 ​ ​server:​tasks_processed_count{ ​ ​ ​ ​ ​ ​ server="i-abcdef123", ​ ​ ​ ​ ​ ​ status="success", ​ ​ ​ ​ ​ ​ institution="FirstPlatypusBank" ​ ​ ​ ​ ​} query: sum(rate(tasks_processed_count[5m])) ​alert: sum(rate(tasks_processed_count{status!="success}[5m])) ​ ​ ​ ​ ​ ​by​(institution) ​ ​ ​ ​ >​100 ​ PAGE 21 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  22. PROMETHEUS #GHC18 ​ ​server:​tasks_processed_count{ ​ ​ ​ ​ ​ ​ server="i-abcdef123", ​ ​ ​ ​ ​ ​ status="success", ​ ​ ​ ​ ​ ​ institution="FirstPlatypusBank" ​ ​ ​ ​ ​} ​query: sum(rate(tasks_processed_count[5m])) alert: sum(rate(tasks_processed_count{status!="success}[5m])) by (institution) > 100 ​ PAGE 22 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  23. ALERTMANAGER ​ ​route:​ #GHC18 ​ receiver:​team-monitoring-email​ group_by: ​ ​ - alertname​ ​ -​environment​ ​ routes:​ ​ - match_re:​ ​ ​ alert_type:​^(?:monitoring_uptime|prometheus)$ ​ ​ ​ routes:​ ​ ​ -​receiver:​team-monitoring-email​ ​ ​ match_re:​ ​ ​ ​ ​ environment:​^(?:testing|preprod)$​ ​ ​ -​receiver:​team-monitoring-email​ ​ ​ match_re:​ ​ ​ ​ plaid_env:​^(?:testing|preprod)$​ ​ ​ -​receiver:​team-monitoring-pager​ ​ PAGE 23 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  24. ALERTMANAGER ​ ​route:​ #GHC18 ​ receiver:​team-monitoring-email​ group_by: ​ ​ - alertname​ ​ -​environment​ ​ routes:​ ​ - match_re:​ alert_type: ^(?:monitoring_uptime|prometheus)$ ​ ​ routes:​ ​ ​ -​receiver:​team-monitoring-email​ ​ ​ match_re:​ ​ ​ ​ ​ environment:​^(?:testing|preprod)$​ ​ ​ -​receiver:​team-monitoring-email​ ​ ​ match_re:​ ​ ​ ​ plaid_env:​^(?:testing|preprod)$​ ​ ​ -​receiver:​team-monitoring-pager​ ​ PAGE 24 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  25. ALERTMANAGER ​ ​route:​ #GHC18 ​ receiver:​team-monitoring-email​ group_by: ​ ​ - alertname​ ​ -​environment​ ​ routes:​ ​ - match_re:​ ​ ​ alert_type:​^(?:monitoring_uptime|prometheus)$ ​ ​ ​ routes:​ - receiver: team-monitoring-email match_re: environment: ^(?:testing|preprod)$ - receiver: team-monitoring-email match_re: plaid_env: ^(?:testing|preprod)$ ​ ​ -​receiver:​team-monitoring-pager​ ​ PAGE 25 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  26. ALERTMANAGER ​ ​route:​ #GHC18 ​ receiver:​team-monitoring-email​ group_by: ​ ​ - alertname​ ​ -​environment​ ​ routes:​ ​ - match_re:​ ​ ​ alert_type:​^(?:monitoring_uptime|prometheus)$ ​ ​ ​ routes:​ ​ ​ -​receiver:​team-monitoring-email​ ​ ​ match_re:​ ​ ​ ​ ​ environment:​^(?:testing|preprod)$​ ​ ​ -​receiver:​team-monitoring-email​ ​ ​ match_re:​ ​ ​ ​ plaid_env:​^(?:testing|preprod)$​ ​ ​ - receiver: team-monitoring-pager ​ PAGE 26 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  27. GRAFANA #GHC18 PAGE 27 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  28. STANDARD PIPELINE #GHC18 PAGE 28 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  29. CUSTOM PIPELINE #GHC18 PAGE 29 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  30. RESULTS #GHC18 > 700 190k+ 17 Events per second processed Metrics exported Services monitored 31 <5s Engineers who have Average delay from contributed monitoring event to metrics changes (on a team of 45) generation PAGE 30 | GRACE HOPPER CELEBRATION 2018 #GHC18 PRESENTED BY ANITAB.ORG AND THE ASSOCIATION FOR COMPUTING MACHINERY

  31. Takeaways #GHC18

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend