Monitoring Swift OpenStack Summit, Austin 2016 Adam Takvam, Sr. - - PowerPoint PPT Presentation

monitoring swift
SMART_READER_LITE
LIVE PREVIEW

Monitoring Swift OpenStack Summit, Austin 2016 Adam Takvam, Sr. - - PowerPoint PPT Presentation

Monitoring Swift OpenStack Summit, Austin 2016 Adam Takvam, Sr. Systems Engineer Martin Lanner, Engagement Manager April 28, 2016 2 | SwiftStack Confidential Overview Problems Swift key monitoring concepts - Usage intelligence - What to


slide-1
SLIDE 1

Monitoring Swift

OpenStack Summit, Austin 2016

Adam Takvam, Sr. Systems Engineer Martin Lanner, Engagement Manager April 28, 2016

slide-2
SLIDE 2 2 | SwiftStack Confidential
slide-3
SLIDE 3 3

Overview

  • Problems
  • Usage intelligence
  • Capacity planning
  • Operational health
  • Audit trails
  • Background
  • Methods: logs + system metrics
  • Interpretation of metrics
  • Actions: thresholds + alerting
  • Swift key monitoring concepts
  • What to monitor?
  • How to monitor
  • Monitoring methods - demos
  • Logging: ELK
  • Trending/Forecasting:

Prometheus + Grafana

  • System monitoring: Zabbix
| SwiftStack Confidential
slide-4
SLIDE 4 4

It’s Linux!

| SwiftStack Confidential
slide-5
SLIDE 5 5

Properties of Swift

  • Distributed system
  • Extremely durable through replication or Erasure Coding
  • No single point of failure
  • Even distribution of data
  • Resilient
  • Self-healing capabilities
  • Can take a lot of abuse and negligence
slide-6
SLIDE 6 6

Anatomy of a Monitoring Solution

  • Agent: Gathers metrics on a host and either

pushed or advertises them

  • Logstash
  • Prometheus Node Exporter
  • Zabbix Agent
  • Nagios NRPE
  • Aggregation Engines: Collects metrics from

agents and provides an API with access to aggregated metric values

  • Nagios
  • Zabbix
  • Elasticsearch
  • Prometheus
  • Visualizer: Renders graphs in a human-friendly

format for easy comprehension of system state

  • Kibana
  • Grafana
  • Alerting: Uses metric thresholds to trigger

alerts when metrics fall out of an acceptable range

  • AlertManager
  • PagerDuty
| SwiftStack Confidential
slide-7
SLIDE 7 7

Forms of Monitoring

  • System utilization: CPU, memory, disk

I/O, network, auditing cycles, replicator timing

  • Performance: Transaction latency
  • Errors: Invalid requests or states
  • Outages: Service failures
  • Feature usage: Understand CRUD
  • perations and traffic patterns
  • Audit trail: Who did what when?

Monitoring Lifecycle

  • Measurement
  • Reporting
  • Characterization
  • Thresholds
  • Alerting
  • Root cause analysis
  • Remediation
  • Manual
  • Automated
| SwiftStack Confidential

Developing a Monitoring Strategy

slide-8
SLIDE 8 8

Examples of monitoring methods

  • ELK: Usage intelligence
  • Who?
  • Agents
  • HTTP response codes
  • Errors
  • Audit trails
  • Prometheus: Capacity planning
  • Data growth
  • Trending analytics
  • Zabbix: Operational health
  • Network
  • CPU
  • RAM
slide-9
SLIDE 9 9

Key concepts for monitoring Swift

  • Cluster full
  • df
  • Data growth
  • Capacity planning
  • Networking
  • Availability
  • Saturation
  • Proxy state
  • CPU
  • /healthcheck
  • Auditing cycles
  • Replication cycle timing
slide-10
SLIDE 10 10

Load balancer health checks against Swift proxy servers

demo@demo:~$ curl http://swift.swiftstack.oss/healthcheck OK

| SwiftStack Confidential
  • Most load balancers run ICMP checks against all IPs in its pool by default
  • Also, consider configuring the load balancer to run TCP checks against Swift’s

/healthcheck endpoint

Example:

slide-11
SLIDE 11 11

Audit trails with ELK

| SwiftStack Confidential
slide-12
SLIDE 12 12

Object size distribution

| SwiftStack Confidential
slide-13
SLIDE 13 13

Distribution of CRUD operations over time

| SwiftStack Confidential
slide-14
SLIDE 14 14

Zabbix triggers for Swift

| SwiftStack Confidential
slide-15
SLIDE 15 15

Zabbix node memory usage

| SwiftStack Confidential
slide-16
SLIDE 16 16

Zabbix drive utilization events

| SwiftStack Confidential
slide-17
SLIDE 17 17

Disk I/O

| SwiftStack Confidential
slide-18
SLIDE 18 18

Object Replicator Operations

| SwiftStack Confidential
slide-19
SLIDE 19 19

Prometheus + Grafana trending and forecasting

| SwiftStack Confidential
slide-20
SLIDE 20 20

Alerting

ALERT StorageCritical24Hours IF sum(predict_linear(node_filesystem_free{ job='swiftstack',mountpoint=~"/srv/node/.*” }[1d]), 24*3600) < sum(node_filesystem_size{ job="swiftstack",mountpoint=~"/srv/node/.*” }) * 0.2 FOR 1h LABELS { group="storage_admin“ severity="critical“ }

| SwiftStack Confidential

Translation: Send a critical alert to all members of the storage_admin group if the total available storage capacity is projected to be less than 20% of the total storage capacity within the next 24 hours and that forecast has held true for at least 1 hour, recalculating every 5 minutes (per server config / not shown).

Example:

slide-21
SLIDE 21 21

Q&A / Demo

| SwiftStack Confidential
slide-22
SLIDE 22 22

Thank you!

| SwiftStack Confidential