Effjcient Monitoring and Root Cause Analysis in Complex Systems - - PowerPoint PPT Presentation

effjcient monitoring and root cause analysis in complex
SMART_READER_LITE
LIVE PREVIEW

Effjcient Monitoring and Root Cause Analysis in Complex Systems - - PowerPoint PPT Presentation

Effjcient Monitoring and Root Cause Analysis in Complex Systems Witek Bedyk Agenda Benefjts of robust monitoring Measurements vs. Alarms Importance of Alarms Correlation Effective Alerting Self-healing Why is


slide-1
SLIDE 1

Effjcient Monitoring and Root Cause Analysis in Complex Systems

Witek Bedyk

slide-2
SLIDE 2

Agenda

  • Benefjts of robust monitoring
  • Measurements vs. Alarms
  • Importance of Alarms Correlation
  • Effective Alerting
  • Self-healing
slide-3
SLIDE 3

Why is Monitoring useful?

  • Improve system / application uptime
  • Reduce administration burden
  • Resource optimization
  • Prevent bottlenecks
  • Make use of collected data (e.g. billing)
slide-4
SLIDE 4

Why is Monitoring useful?

  • Improve system / application uptime
  • Reduce administration burden
  • Resource optimization
  • Prevent bottlenecks
  • Make use of collected data (e.g. billing)
slide-5
SLIDE 5

Use Case

Customer escalation: “We have cloud outage! Keystone is fmapping up and down continuously and many requests get 503 service unavailable error.”

slide-6
SLIDE 6

Healthcheck

Simple HTTP endpoint up or down checks on services. http_status [0, 1] http_response_time

slide-7
SLIDE 7

Metrics

  • Metrics measure and report on quantifjable data from your system
  • cpu, memory, network, fjlesystem, disk IO
  • Services

○ MySQL, RabbitMQ, Apache, MemcacheD, etc.

  • LibVirt, Open vSwitch
  • Applications:

○ StatsD, Prometheus

  • Custom checks
slide-8
SLIDE 8

Dimensions

  • Dimensions are a dictionary of key, value pairs used to describe metrics.
  • hostname
  • service
  • component
  • url
  • device
slide-9
SLIDE 9

Transaction-level vs. System-level metrics

  • Transaction-level: end user perspective

○ Is Horizon working correctly?

  • System-level: administrator perspective

○ Reveals failures of service components

slide-10
SLIDE 10

Dependencies

MySQL MemcacheD Keystone Apache

slide-11
SLIDE 11

Gathered metrics

http_status http_response_time apache.net.hits apache.performance.idle_worker_count mysql.performance.open_fjles mysql.net.connections memcache.curr_connections memcache.get_misses_rate process.cpu_perc process.open_fjle_descriptors

slide-12
SLIDE 12

Dashboards

slide-13
SLIDE 13

Alarms

Status of the system or resource meets criteria indicating an action is required.

slide-14
SLIDE 14

Alarm defjnitions

  • Alarm defjnitions are templates specifying how alarms should be created.
  • grouping
  • http_status > 0, match_by: ["service", "component", "hostname", "url"]
  • fjltering
  • avg(cpu.idle_perc{service=monitoring}) < 20
slide-15
SLIDE 15

Use case (alarms)

Keystone API is down on node A. Keystone API is down on node A. Keystone API is down on node A. Keystone API is down on node A. Keystone API is up on node A. Keystone API is up on node A. MemcacheD number of connections is high on node A. Keystone API is up on node A. Keystone API is up on node A. MemcacheD hit rate is low on node A.

slide-16
SLIDE 16

Alarms correlation

  • “80% of the mean time to repair is wasted on trying to locate the issue”

Gartner

  • Remove noise from the environment
  • Alerts should be:

○ meaningful ○ actionable ○ indicate the point of failure

slide-17
SLIDE 17

Vitrage

  • OpenStack Root Cause Analysis service
  • rganize alarms

○ defjne relationships between alarms ○ represent as an entity graph

  • analyze

○ represent system health

  • fjnd root cause

○ graphical visualization

slide-18
SLIDE 18

Dependencies

MySQL MemcacheD Keystone Apache

slide-19
SLIDE 19

Dependencies

Keystone cluster Keystone instances MemcacheD

slide-20
SLIDE 20

Dependencies

Keystone cluster Keystone instances MemcacheD

slide-21
SLIDE 21

Dependencies

Keystone cluster Keystone instances MemcacheD

slide-22
SLIDE 22

Dependencies

Keystone cluster Keystone instances MemcacheD

slide-23
SLIDE 23

Dependencies

Keystone cluster Keystone instances MemcacheD

slide-24
SLIDE 24

Dependencies

Keystone cluster Keystone instances MemcacheD

slide-25
SLIDE 25

Monitor Analyze Plan Execute (MAPE)

Monitor Execute Sensors Effectors Analyze Managed Resource Plan

slide-26
SLIDE 26

Monitor Analyze Plan Execute (MAPE)

Monitor Execute Sensors Effectors Analyze Managed Resource Plan

slide-27
SLIDE 27

Vitrage Templates

  • Vitrage Templates are used to express Condition

Action scenarios. →

  • if <condition> then raise deduced alarm
  • if <condition> then set deduced state
  • if <condition> then add causal relationship (used for RCA capability)
  • if <condition> then execute Mistral workfmow
slide-28
SLIDE 28

Self-healing

Keystone cluster Keystone instances MemcacheD

slide-29
SLIDE 29

Self-healing

Keystone cluster Keystone instances MemcacheD

slide-30
SLIDE 30

Self-healing

Keystone cluster Keystone instances MemcacheD

slide-31
SLIDE 31

Self-healing

Keystone cluster Keystone instances MemcacheD

slide-32
SLIDE 32

OpenStack Healthcheck APIs

  • more detailed checks would be useful for most OpenStack services
  • common middleware should get implemented in Oslo
  • existing old effort:

○ https://storyboard.openstack.org/#!/story/2001439 ○ https://review.opendev.org/617924

slide-33
SLIDE 33

Summary

  • Robust monitoring is essential
  • Measurements vs. Alarms
  • Importance of Alarms Correlation
  • Self-healing
slide-34
SLIDE 34

Thank You 谢谢

Questions and Answers