Migrating from Nagios to Prometheus Runtastic Infrastructure Base - - PowerPoint PPT Presentation

migrating from nagios to prometheus runtastic
SMART_READER_LITE
LIVE PREVIEW

Migrating from Nagios to Prometheus Runtastic Infrastructure Base - - PowerPoint PPT Presentation

NOV 07, 2019 Migrating from Nagios to Prometheus Runtastic Infrastructure Base Virtualization Core DBs Technologies Linux (Ubuntu) Linux KVM Physical Really a lot open source OpenNebula Hybrid SDN (Cisco) 3600 CPU Cores Big Chef


slide-1
SLIDE 1

Migrating from Nagios to Prometheus

NOV 07, 2019

slide-2
SLIDE 2

2

Linux (Ubuntu) SDN (Cisco) Chef Terraform

Runtastic Infrastructure

Base

Linux KVM OpenNebula 3600 CPU Cores 20 TB Memory 100 TB Storage

Virtualization

Physical Hybrid Big

Core DBs

Really a lot

  • pen source

Technologies

slide-3
SLIDE 3

Our Monitoring back in 2017...

3

  • Nagios

○ Many Checks for all Servers ○ Checks for NewRelic

  • Pingdom

○ External HTTP Checks ○ Specific Nagios Alerts ○ Alerting via SMS

  • NewRelic

○ Error Rate ○ Response Time

slide-4
SLIDE 4

Configuration hell….

4

slide-5
SLIDE 5

Alert overflow...

5

slide-6
SLIDE 6

Goals for our new Monitoring system

6

  • Make On Call as comfortable as possible
  • Automate as much as possible
  • Make use of graphs
  • Rework our alerting
  • Make it scaleable!
slide-7
SLIDE 7

Starting with Prometheus...

slide-8
SLIDE 8

Prometheus

8

slide-9
SLIDE 9

Our Prometheus Setup

9

  • 2x Bare Metal
  • 8 Core CPU
  • Ubuntu Linux
  • 7.5 TB of Storage
  • 7 month of Retention time
  • Internal TSDB
slide-10
SLIDE 10

Automation

slide-11
SLIDE 11

Our Goals for Automation

11

  • Roll out Exporters on new servers automatically

○ using Chef

  • Use Service Discovery in Prometheus

○ using Consul

  • Add HTTP Healthcheck for a new Microservice

○ using Terraform

  • Add Silences with 30d duration

○ using Terraform

slide-12
SLIDE 12

Consul

12

  • Consul for our Terraform State
  • Agent Rollout via Chef
  • One Service definition per Exporter on each Server
slide-13
SLIDE 13

Consul

13

slide-14
SLIDE 14

What Labels do we need?

14

  • What’s the Load of all workers of our Newsfeed service?

○ node_load1{service=”newsfeed”, role=”workers”}

  • What’s the Load of a specific Leaderboard server?

○ node_load1{hostname=”prd-leaderboard-server-001”}

slide-15
SLIDE 15

...and how we implemented them in Consul

15

{ "service": { "name": "prd-sharing-server-001-mongodbexporter", "tags": [ "prometheus", "role:trinidad", "service:sharing", "exporter:mongodb" ], "port": 9216 } }

slide-16
SLIDE 16

Scrape Configuration

16

  • job_name: prd

consul_sd_configs:

  • server: 'prd-consul:8500'

token: 'ourconsultoken' datacenter: 'lnz' relabel_configs:

  • source_labels: [__meta_consul_tags]

regex: .*,prometheus,.* action: keep

  • source_labels: [__meta_consul_node]

target_label: hostname

  • source_labels: [__meta_consul_tags]

regex: .*,service:([^,]+),.* replacement: '${1}' target_label: service

slide-17
SLIDE 17

External Health Checks

17

  • 3x Blackbox Exporters
  • Accessing SSL Endpoints
  • Checks for

○ HTTP Response Code ○ SSL Certificate ○ Duration

slide-18
SLIDE 18

Add Healthcheck via Terraform

18

resource "consul_service" "health_check" { name = "${var.srv_name}-healthcheck" node = "blackbox_aws" tags = [ "healthcheck", "url:https://status.runtastic.com/${var.srv_name}", "service:${var.srv_name}", ] }

slide-19
SLIDE 19

Job Config for Blackbox Exporters

19

  • job_name: blackbox_aws

metrics_path: /probe params: module: [http_health_monitor] consul_sd_configs:

  • server: 'prd-consul:8500'

token: 'ourconsultoken' datacenter: 'lnz' relabel_configs:

  • source_labels: [__meta_consul_tags]

regex: .*,healthcheck,.* action: keep

  • source_labels: [__meta_consul_tags]

regex: .*,url:([^,]+),.* replacement: '${1}' target_label: __param_target

slide-20
SLIDE 20

Add Silence via Terraform

20

resource "null_resource" "prometheus_silence" { provisioner "local-exec" { command = <<EOF ${var.amtool_path} silence add 'service=~SERVICENAME' \

  • -duration='30d' \
  • -comment='Silence for the newly deployed service' \
  • -alertmanager.url='http://prd-alertmanager:9093'

EOF }

slide-21
SLIDE 21

OpsGenie

slide-22
SLIDE 22

Our Initial Alerting Plan

22

  • Alerts with Low Priority

○ Slack Integration

  • Alerts with High Priority (OnCall)

○ Slack Integration ○ OpsGenie

slide-23
SLIDE 23

...why not forward all Alerts to OpsGenie?

23

slide-24
SLIDE 24

Define OpsGenie Alert Routing

24

  • Prometheus OnCall Integration

○ High Priority Alerts (e.g. Service DOWN) ○ Call the poor On Call Person ○ Post Alerts to Slack #topic-alerts

  • Prometheus Ops Integration

○ Low Priority Alerts (e.g. Chef-Client failed runs) ○ Disable Notifications ○ Post Alerts to Slack #prometheus-alerts

slide-25
SLIDE 25

Setup Alertmanager Config

25

  • receiver: 'opsgenie_oncall'

group_wait: 10s group_by: ['...'] match:

  • ncall: 'true'
  • receiver: 'opsgenie'

group_by: ['...'] group_wait: 10s

slide-26
SLIDE 26

...and its receivers

26

  • name: "opsgenie_oncall"
  • psgenie_configs:
  • api_url: "https://api.eu.opsgenie.com/"

api_key: "ourapitoken" priority: "{{ range .Alerts }}{{ .Labels.priority }}{{ end }}" message: "{{ range .Alerts }}{{ .Annotations.title }}{{ end }}" description: "{{ range .Alerts }}\n{{ .Annotations.summary }}\n\n{{ if ne .Annotations.dashboard \"\" -}}\nDashboard:\n{{ .Annotations.dashboard }}\n{{- end }}{{ end }}" tags: "{{ range .Alerts }}{{ .Annotations.instance }}{{ end }}"

slide-27
SLIDE 27

Why we use group_by[‘...’]

27

  • Alert Deduplication from OpsGenie
  • Alerts are being grouped
  • Overlook Alerts
slide-28
SLIDE 28

Example Alerting Rule for On Call

28

  • alert: HTTPProbeFailedMajor

expr: max by(instance,service)(probe_success) < 1 for: 1m labels:

  • ncall: "true"

priority: "P1" annotations: title: "{{ $labels.service }} DOWN" summary: "HTTP Probe for {{ $labels.service }} FAILED.\nHealth Check URL: {{ $labels.instance }}"

slide-29
SLIDE 29

Example Alerting Rule with Low Priority

29

  • alert: MongoDB-ScannedObjects

expr: max by(hostname, service)(rate(mongodb_mongod_metrics_query_executor_total[30m])) > 500000 for: 1m labels: priority: "P3" annotations: title: "MongoDB - Scanned Objects detected on {{ $labels.service }}" summary: "High value of scanned objects on {{ $labels.hostname }} for service {{ $labels.service }}" dashboard: "https://prd-prometheus.runtastic.com/d/oCziI1Wmk/mongodb"

slide-30
SLIDE 30

Alert Management via Slack

30

slide-31
SLIDE 31

Setting up the Heartbeat

31

groups:

  • name: opsgenie.rules

rules:

  • alert: OpsGenieHeartBeat

expr: vector(1) for: 5m labels: heartbeat: "true" annotations: summary: "Heartbeat for OpsGenie"

slide-32
SLIDE 32

...and its Alertmanager Configuration

32

  • receiver: 'opsgenie_heartbeat'

repeat_interval: 5m group_wait: 10s match: heartbeat: 'true'

  • name: "opsgenie_heartbeat"

webhook_configs:

  • url:

'https://api.eu.opsgenie.com/v2/heartbeats/prd_prometheus/pi ng' send_resolved: false http_config: basic_auth: password: "opsgenieAPIkey"

slide-33
SLIDE 33

CI/CD Pipeline

slide-34
SLIDE 34

Goals for our Pipeline

34

  • Put all Alerting and Recording Rules into a Git Repository
  • Automatically test for syntax errors
  • Deploy master branch on all Prometheus servers
  • Merge to master —> Deploy on Prometheus
slide-35
SLIDE 35

How it works

35

  • Jenkins

○ running promtool against each .yml file

  • Bitbucket sending HTTP calls when master branch changes
  • Ruby based HTTP Handler on Prometheus Servers

○ Accepting HTTP calls from Bitbucket ○ Git pull ○ Prometheus reload

slide-36
SLIDE 36

Verify Builds for each Branch

36

slide-37
SLIDE 37

runtastic.com

THANK YOU

Niko Dominkowitsch

Infrastructure Engineer

niko.dominkowitsch@runtastic.com