monitoring swift
play

Monitoring Swift OpenStack Summit, Austin 2016 Adam Takvam, Sr. - PowerPoint PPT Presentation

Monitoring Swift OpenStack Summit, Austin 2016 Adam Takvam, Sr. Systems Engineer Martin Lanner, Engagement Manager April 28, 2016 2 | SwiftStack Confidential Overview Problems Swift key monitoring concepts - Usage intelligence - What to


  1. Monitoring Swift OpenStack Summit, Austin 2016 Adam Takvam, Sr. Systems Engineer Martin Lanner, Engagement Manager April 28, 2016

  2. 2 | SwiftStack Confidential

  3. Overview • Problems • Swift key monitoring concepts - Usage intelligence - What to monitor? - Capacity planning - How to monitor - Operational health - Audit trails • Monitoring methods - demos - Logging: ELK - Trending/Forecasting: • Background Prometheus + Grafana - Methods: logs + system metrics - System monitoring: Zabbix - Interpretation of metrics - Actions: thresholds + alerting | SwiftStack Confidential 3

  4. It’s Linux ! | SwiftStack Confidential 4

  5. Properties of Swift • Distributed system • Extremely durable through replication or Erasure Coding • No single point of failure • Even distribution of data • Resilient • Self-healing capabilities • Can take a lot of abuse and negligence 5

  6. Anatomy of a Monitoring Solution Agent: Gathers metrics on a host and either Visualizer: Renders graphs in a human-friendly • • pushed or advertises them format for easy comprehension of system state - Logstash - Kibana - Prometheus Node Exporter - Grafana - Zabbix Agent - Nagios NRPE Alerting: Uses metric thresholds to trigger • alerts when metrics fall out of an acceptable Aggregation Engines: Collects metrics from • range agents and provides an API with access to - AlertManager aggregated metric values - PagerDuty - Nagios - Zabbix - Elasticsearch - Prometheus | SwiftStack Confidential 6

  7. Developing a Monitoring Strategy Forms of Monitoring Monitoring Lifecycle System utilization: CPU, memory, disk Measurement • • I/O, network, auditing cycles, replicator Reporting • timing Characterization • Performance: Transaction latency • Thresholds • Errors: Invalid requests or states • Alerting • Outages: Service failures • Root cause analysis • Feature usage: Understand CRUD • Remediation • operations and traffic patterns - Manual Audit trail: Who did what when? • - Automated | SwiftStack Confidential 7

  8. Examples of monitoring methods • ELK: Usage intelligence • Prometheus: Capacity planning - Who? - Data growth - Agents - Trending analytics - HTTP response codes - Errors • Zabbix: Operational health - Audit trails - Network - CPU - RAM 8

  9. Key concepts for monitoring Swift • Cluster full • Auditing cycles - df • Replication cycle timing - Data growth - Capacity planning • Networking - Availability - Saturation • Proxy state - CPU - /healthcheck 9

  10. Load balancer health checks against Swift proxy servers • Most load balancers run ICMP checks against all IPs in its pool by default • Also, consider configuring the load balancer to run TCP checks against Swift’s /healthcheck endpoint Example: demo@demo:~$ curl http://swift.swiftstack.oss/healthcheck OK | SwiftStack Confidential 10

  11. Audit trails with ELK | SwiftStack Confidential 11

  12. Object size distribution | SwiftStack Confidential 12

  13. Distribution of CRUD operations over time | SwiftStack Confidential 13

  14. Zabbix triggers for Swift | SwiftStack Confidential 14

  15. Zabbix node memory usage | SwiftStack Confidential 15

  16. Zabbix drive utilization events | SwiftStack Confidential 16

  17. Disk I/O | SwiftStack Confidential 17

  18. Object Replicator Operations | SwiftStack Confidential 18

  19. Prometheus + Grafana trending and forecasting | SwiftStack Confidential 19

  20. Alerting ALERT StorageCritical24Hours IF sum(predict_linear(node_filesystem_free{ job='swiftstack',mountpoint=~"/srv/node/.*” }[1d]), 24*3600) < sum(node_filesystem_size{ job="swiftstack",mountpoint=~"/srv/node/.*” }) * 0.2 FOR 1h LABELS { group="storage_admin“ Example: severity="critical“ } Translation: Send a critical alert to all members of the storage_admin group if the total available storage capacity is projected to be less than 20% of the total storage capacity within the next 24 hours and that forecast has held true for at least 1 hour, recalculating every 5 minutes (per server config / not shown). | SwiftStack Confidential 20

  21. Q&A / Demo | SwiftStack Confidential 21

  22. Thank you! | SwiftStack Confidential 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend