Hawkular Metrics Metric Storage & Alerting Stefan Negrea About - - PowerPoint PPT Presentation

hawkular metrics
SMART_READER_LITE
LIVE PREVIEW

Hawkular Metrics Metric Storage & Alerting Stefan Negrea About - - PowerPoint PPT Presentation

Hawkular Metrics Metric Storage & Alerting Stefan Negrea About Me Co-Creator of Hawkular Metrics 2 Hawkular Metrics Hawkular Demo & Alerting Introduction to Hawkular Metrics 3 Pre-History 2006 JBoss Operations Network 1.0


slide-1
SLIDE 1

Hawkular Metrics

Metric Storage & Alerting Stefan Negrea

slide-2
SLIDE 2

2

About Me

Co-Creator of Hawkular Metrics

slide-3
SLIDE 3

3

Introduction to Hawkular Metrics Hawkular Demo Hawkular Metrics & Alerting

slide-4
SLIDE 4

4

Pre-History

  • 2006 JBoss Operations Network 1.0
  • 2008 Project RHQ

○ JBoss Operations Network 2.0 ○ Metrics stored in Postgres

slide-5
SLIDE 5

5

Pre-History

slide-6
SLIDE 6

6

Pre-History

  • 2012 - 2013 RHQ Storage Nodes

○ Cassandra based ○ Store metrics

  • 2014 RHQ Metrics
slide-7
SLIDE 7

7

Hawkular

It’s a hawk with a monocular. Hawks are known to have a very sharp vision and very good hunters, they can catch preys anticipating their movements at a very fast speed. The goal is to be able to monitor and catch anomalies in fast pace environments. All* projects are Apache License 2.0

= +

slide-8
SLIDE 8

8

History

  • 2014 Hawkular organization formed
  • 2014 Hawkular Alerting started
  • 02/2015 RHQ Metrics joins Hawkular org
  • 12/2015 Hawkular Metrics integrated in OpenShift

Origin v3

  • 10/2016 Hawkular Metrics includes Hawkular Alerting
slide-9
SLIDE 9

Hawkular Metrics is a storage engine for metric data metric data = a measurement taken at a specific time storage engine = store metrics efficiently for their useful lifetime

Hawkular Metrics

9

slide-10
SLIDE 10
  • Gauge

○ number ○ varies (not monotonic) ○ rate of change

  • Counter

○ integer ○ monotonic (increasing or decreasing ○ rate of change ○ support for reset

Supported Metrics

10

Memory usage (metric1, 4.5, 1493301898245) (metric1, 5.6, 1493301898246) (metric1, 1.2, 1493301898247) Number of visitors (metric2, 4, 1493301898248) (metric2, 5, 1493301898249) (metric2, 9, 1493301898250) (metric2, 0, 1493301898251)

slide-11
SLIDE 11
  • Availability

○ Availability of a resource ○ up, down, or unknown ○ can compute interesting stats based

  • n values
  • String

○ just that ○ possible uses: logs, events, config

Supported Metrics

11

Server status (metric3, UP, 1493301898253) (metric3, DOWN, 1493301898254) (metric3, UP, 1493301898255) Value of configuration key ‘k’ (metric4, “k=v”, 1493301898256) (metric4, “k=t”, 1493301898257) (metric4, “k=1”, 1493301898258) (metric4, “k=4”, 1493301898259)

slide-12
SLIDE 12

12

slide-13
SLIDE 13

Management & Support

  • Highly available, fault tolerant
  • No specialized node roles
  • Minimal configuration

Performance & Scalability

  • Optimized for writes
  • Data compression
  • Indexing

Cassandra - Storage

13

slide-14
SLIDE 14
  • CQL based
  • Partitioning & indexing of data based on

usage

  • Use built-in compression & TTL
  • Use the Datastax driver fully async
  • Support for latest C* 3.0.x release
  • Keep updating to latest stable
  • Use multiple tables for indexing

Cassandra - Storage

14

slide-15
SLIDE 15
  • REST API with JSON
  • JAX-RS 2.0 (async spec)
  • Fully async = JAX-RS 2.0 async + RX Java

+ async C* driver

  • Stateless** server (Metrics, mostly)
  • Minimal clustering via Infinispan
  • Schema Management
  • Easy to use

○ packaged distribution with WildFly ○ download and run, only JDK required

App Layer

15

slide-16
SLIDE 16

C* - 4 CPU, 4GB Hawkular - 4 CPU, 4GB message sizes: 10 datapoints: 2592 req/sec => 25920 datapoints/sec 100 datapoints: 365 req/sec => 36500 datapoints/sec 5000 datapoints: 7.6 req/sec => 38000 datapoints/sec C*, 8 CPU, 8GB Hawkular, 8 CPU, 4GB message sizes: 10 datapoints: 4655 req/sec => 46550 datapoints/sec 100 datapoints: 604 req/sec => 60400 datapoints/sec 5000 datapoints: 15 req/sec => 75000 datapoints/sec

Performance - Sample

16

slide-17
SLIDE 17
  • Multi-tenant

○ tenant id required on each request (HAWKULAR-TENANT header) ○ no way to get data from multiple tenants at once

  • Can insert data without pre-creating metrics
  • Data is compressed using Gorilla compression

○ 2 hour time window ○ further reduces disk footprint ○ LZ4 enabled in Cassandra ○ Load testing: ■ 5000 data points/sec for 5 days = 26GB ■ 83M data points ~ 1GB of disk space

Features

17

slide-18
SLIDE 18
  • Bulk insertion endpoint for metrics and data
  • Tagging support for metrics and single data points

○ key, value; multi-tag support ■ tag1 = d ○ metrics queryable via TQL (tag query language) ■ AND, OR, NOT ■ grouping ■ wildcard matching ■ a1 = 'd' OR ( a1 != 'ab' AND c1 )

Features

18

slide-19
SLIDE 19
  • Endpoint for each metric type

○ /gauges, /availability, /counters, /strings ○ Each metric type has almost identical endpoints

  • Raw data - /gauges/raw
  • Raw data for single metric - /strings/{metric_id}/raw
  • Query time aggregation

○ multiple metrics - /availability/stats ○ single metric - /counter/{metric_id}/stats

  • Bulk operations - /metrics

** String metrics do not have stats (yet?)

Features - Simple REST API

19

slide-20
SLIDE 20
  • Query Time Aggregation

○ Combine multiple metrics and get statistical data ○ Gauge and counter: average, median, percentile, sum ○ Availability: ratios for uptime and downtime, downtime duration ○ Time Slicing: first group data, then compute stats ○ Single or multiple metrics

  • Rate

○ available for gauges and counters ○ rate of change of the values for the timespan ○ ex: how fast is the number of total requests increasing

Features - Aggregation & Rate

20

slide-21
SLIDE 21
  • Natural fit: collect data and then alert on

anomalies

  • Two ways to alert on metric data

○ Dedicated API for setting up alerts, incoming data is filtered and processed by the alerting engine ○ Metrics Alerter that queries single or multiple metrics, no need to predefine alerts triggers ahead of time.

Metrics + Alerting

21

slide-22
SLIDE 22
  • Single and group Triggers
  • Template triggers
  • Complex conditions
  • Dampening
  • Auto-resolve/auto-disable triggers
  • Pluggable notifiers

Alerting Features

22

slide-23
SLIDE 23

23

slide-24
SLIDE 24
  • Automatic & persisted aggregation
  • Management capabilities for the Cassandra cluster
  • Query language
  • Performance improvements

○ already have a good baseline, but can do better ○ read/write

Roadmap - 2017

24

slide-25
SLIDE 25

Demo

slide-26
SLIDE 26
  • Install ccm

○ https://github.com/pcmanus/ccm

  • Start a single node C* cluster

○ ccm create -v 3.0.12 -n 1 -s hawkular

  • Download, extract and start Hawkular Metrics

○ https://origin-repository.jboss.org/nexus/content/groups/public/org/ha wkular/metrics/hawkular-metrics-wildfly-standalone/0.26.1.Final/ ○ bin/standalone -b 0.0.0.0

  • Download, extract and start Grafana
  • Download, install, and configure the Hawkular plugin for Hawkular

○ https://grafana.com/plugins/hawkular-datasource/installation ○ https://github.com/hawkular/hawkular-grafana-datasource i. pick a tenant id of your choice

Demo

26

slide-27
SLIDE 27
  • Install the Hawkular Metrics python client via pip

○ pip install hawkular-client

  • Install psutil to collect CPU stats

○ pip install psutil

  • Create an custom agent (using python client)

○ make sure you use the same tenant id configured with Grafana ○ pre-create and tag a metric for each CPU ○ collect CPU usage every 10 seconds ○ send the data to Hawkular Metrics

Demo

27

slide-28
SLIDE 28

Demo

28

#! /usr/bin/env python3 import psutil, time from hawkular.metrics import HawkularMetricsClient, MetricType client = HawkularMetricsClient(tenant_id='test') cpu_percent = psutil.cpu_percent(interval = 1, percpu = True) for index, cpu in enumerate(cpu_percent) : client.create_metric_definition(MetricType.Gauge, 'cpu%s' % index, cpu = 'cpu%s' % index) while True : cpu_percent = psutil.cpu_percent(interval = 1, percpu = True) for index, cpu in enumerate(cpu_percent) : client.push(MetricType.Gauge, 'cpu%s'% index, float(cpu)) time.sleep(10)

slide-29
SLIDE 29
  • Web - http://www.hawkular.org/
  • Github - https://github.com/hawkular
  • Metrics Documentation - http://www.hawkular.org/tags/metrics.html
  • Alerting Documentation - http://www.hawkular.org/tags/alerts.html
  • Twitter - https://twitter.com/hawkular_org

Resources

29

slide-30
SLIDE 30

Thank you!

hawkular.org #hawkular (on freenode)

snegrea@redhat.com