@snehainguva prometheus everything, observing kubernetes in the - - PowerPoint PPT Presentation

snehainguva prometheus everything observing kubernetes in
SMART_READER_LITE
LIVE PREVIEW

@snehainguva prometheus everything, observing kubernetes in the - - PowerPoint PPT Presentation

@snehainguva prometheus everything, observing kubernetes in the cloud digitalocean.com about me software engineer @DigitalOcean former delivery, currently observability kubernetes, prometheus digitalocean.com Some stats digitalocean.com 15


slide-1
SLIDE 1

@snehainguva

slide-2
SLIDE 2

prometheus everything,

  • bserving kubernetes in the

cloud

digitalocean.com

slide-3
SLIDE 3

digitalocean.com

about me

software engineer @DigitalOcean former delivery, currently observability kubernetes, prometheus

slide-4
SLIDE 4

digitalocean.com

Some stats

slide-5
SLIDE 5

digitalocean.com

15 kubernetes clusters 12 data centers 300+ production applications

slide-6
SLIDE 6

digitalocean.com

2 promethei + 1 alertmanager per cluster 1.5 million+ timeseries 99218 samples/sec

(note: data-center wide scraping is at 550k samples/sec)

+

slide-7
SLIDE 7

digitalocean.com

the plan:

  • the pre-kubernetes days
  • kubernetes at DigitalOcean (aka docc)
  • prometheus + alertmanager and kubernetes
  • alerting in action: examples
  • potential pitfalls
  • next steps
slide-8
SLIDE 8

digitalocean.com

pre-kubernetes:

service owners write an application provision a server with chef or ansible use a CI/CD pipeline, bash scripts, or other tools to deploy and update application on a VM

slide-9
SLIDE 9

digitalocean.com

pre-kubernetes:

use nagios + various plugins to monitor use collectd + application metrics + statsd + graphite push data to openTSDB

slide-10
SLIDE 10

digitalocean.com

pre-kubernetes:

longer to provision host than write actual service blackbox monitoring NOT insightful whitebox monitoring services NOT easily queryable

slide-11
SLIDE 11

digitalocean.com

docc: Digital Ocean Command Center

A tool for deploying containerized, stateless applications

slide-12
SLIDE 12

digitalocean.com

What is kubernetes?

Container orchestration tool from Google

slide-13
SLIDE 13

digitalocean.com

What is docc?

An abstraction layer on top of kubernetes

CLI DOCCSERVER deployment → pods service

slide-14
SLIDE 14

digitalocean.com

post-docc:

service owners write an application service owner dockerizes application describe application in json manifest file deploy!

slide-15
SLIDE 15

digitalocean.com

post-docc:

deployments and updates take minutes, not hours view running applications get application logs easily scale, update, or restart applications

slide-16
SLIDE 16

digitalocean.com

But what about monitoring?

slide-17
SLIDE 17

digitalocean.com

Let’s use prometheus + alertmanager

slide-18
SLIDE 18

digitalocean.com

service promconfig alertconfig alertmanager docc deployment → pods

slide-19
SLIDE 19

digitalocean.com

instrument your application

use prometheus golang client expose metrics endpoint

1

slide-20
SLIDE 20

digitalocean.com

specify metrics, ports, alerts in your manifest file

Which metrics endpoint should be scraped? Which container port needs to be exposed? Specify alerting rule, duration interval, and channel.

2

slide-21
SLIDE 21

digitalocean.com

use docc CLI to deploy your application

service CLI doccserver

$ docc deploy manifest.json

3

annotations contain rules and receiver info deployment → pods

slide-22
SLIDE 22

digitalocean.com

prometheus talks to the kubernetes api and grabs the metrics endpoint and port information

promconfig service

4

slide-23
SLIDE 23

digitalocean.com

promconfig grabs alert information and rewrites prometheus rules file

promconfig service

5

slide-24
SLIDE 24

digitalocean.com

alertconfig grabs alert routes and rewrites alertmanager configuration file

service alertmanager alertconfig

6

slide-25
SLIDE 25

digitalocean.com

What should we monitor?

slide-26
SLIDE 26

digitalocean.com

latency traffic error saturation Request Errors Duration

4 Golden Signals

request-based system metrics

slide-27
SLIDE 27

digitalocean.com

Brendan Gregg’s USE-ful metrics

Utilization Saturation Error “Solves 80% of server issues with 5% of the effort.”

slide-28
SLIDE 28

digitalocean.com

counters: cumulative, increasing metric gauges: single metric that goes up or down histograms: samples and buckets observations summaries: samples observations, specify quantile

prom metrics types

slide-29
SLIDE 29

digitalocean.com

Putting it all together...

slide-30
SLIDE 30

digitalocean.com

service metric: traffic

how much demand is placed on the system

loadbalancer backend traffic

sum(rate(haproxy_backend_bytes_out_total{ kubernetes_name="loadbalancer", backend="tls_default_neptune_nyc3_internal_digitalocean_com"} [1m])) BY (backend)

fxn: rate() and sum() metric type: counter labels

slide-31
SLIDE 31

digitalocean.com

cluster metric: utilization

average time resource is busy servicing work

cluster CPU utilization

(sum(rate(container_cpu_usage_seconds_total{id="/"}[5m])) / sum(machine_cpu_cores))

fxn: sum() and rate() metric type: counter

slide-32
SLIDE 32

digitalocean.com

How should we alert?

slide-33
SLIDE 33

digitalocean.com

Threshold alerts

Do any of the aforementioned metrics exceed a lower or upper bound?

slide-34
SLIDE 34

digitalocean.com

Threshold alerts

Are more than 80% of cluster CPU cores being utilized? (sum(rate(container_cpu_usage_seconds_total{id="/"}[5m])) / sum(machine_cpu_cores))* 100 > 80

slide-35
SLIDE 35

digitalocean.com

State-based alerts

Is there a divergence between expected state and actual state of a service?

slide-36
SLIDE 36

digitalocean.com

State-based alerts

Is my service up and/or scrape-able? absent(up{kubernetes_name="doccserver"}) or sum(up{kubernetes_name="doccserver"}) == 0

slide-37
SLIDE 37

digitalocean.com

Common pitfalls

slide-38
SLIDE 38

digitalocean.com

Pitfall #1: Alerting fatigue

slide-39
SLIDE 39

digitalocean.com

Solution: Slack and/or Pagerduty

send only the most urgent, production alerts to pagerduty try out different promQL queries to have less spikey metrics

slide-40
SLIDE 40

digitalocean.com

Pitfall #2: Who owns what?

slide-41
SLIDE 41

digitalocean.com

Solution: opinionated manifest file

services owner must include maintainer information alerts themselves include descriptions and summaries with several labels alerts must include team-specific receivers

slide-42
SLIDE 42

digitalocean.com

Pitfall #3: Meta-monitoring

slide-43
SLIDE 43

digitalocean.com

Solution: Duplicate promethei and HA alertmanager

alertmanager alertmanager alertmanager

slide-44
SLIDE 44

digitalocean.com

Solution: Deadman’s switch

ALERT JustKeepSwimming IF vector(1)

elastalert

slide-45
SLIDE 45

digitalocean.com

slide-46
SLIDE 46

digitalocean.com

#1: Automated alerts

utilize user-defined memory and cpu limits for threshold alerts automatic state-based alerts

slide-47
SLIDE 47

digitalocean.com

#2: Leverage metrics for autopilot

user trusts in our custom controllers and schedulers collect metrics and build model about resource usage over time accordingly adjust limits and alerts

slide-48
SLIDE 48

digitalocean.com

#3: Leverage metrics for autoscaling

services based on resource usage, # connections, etc. loadbalancers based on # of frontend and backend connections # of worker nodes based on memory and cpu capacity metrics

slide-49
SLIDE 49

digitalocean.com

a brave new world of container

  • rchestration

prometheus + alertmanager are awesome! extensibility

slide-50
SLIDE 50

thanks!

@snehainguva

  • The best prometheus tutorials you will ever

read, Julius Volz

  • Actual Prometheus Website
  • Kubernetes Project