[PPT] - Monitoring In Motion Challenges in monitoring kubernetes, PowerPoint Presentation

SLIDE 1

Monitoring In Motion

Challenges in monitoring kubernetes, containers, and dynamic infrastructure.

ContainerCon Toronto Aug 24, 2016 Ilan Rabinovitch Director, Technical Community  Datadog

SLIDE 2

$ finger ilan@datadog

[datadoghq.com] Name: Ilan Rabinovitch Role: Director, Technical Community   Interests: * Monitoring and Metrics * Large scale web operations

* FL/OSS Community Events

SLIDE 3

SaaS based infrastructure and app monitoring
Open Source Agent
Time series data (metrics and events)
Processing nearly a trillion data points per day
Intelligent Alerting
We’re hiring! (www.datadoghq.com/careers/)

Datadog Overview

SLIDE 4

Operating Systems, Cloud Providers, Containers, Web Servers, Datastores, Caches, Queues and more...

Monitor Everything

SLIDE 5

SLIDE 6

$ cat ~/.plan

1. Intro: The Importance of Monitoring
2. The Challenge: Monitoring Dynamic Infrastructure
3. Finding the Signal: How do we know what to monitor?
4. Implementation: Applying it to Containerized Workloads

SLIDE 7

Our Focus Area

Culture Automation Metrics Sharing

Damon Edwards and John Willis DevOps Day LA

SLIDE 8

Culture

“organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations”

Melvin E. Conway

SLIDE 9

SLIDE 10

SLIDE 11

SLIDE 12

SLIDE 13

SLIDE 14

Follow @honest_update on Twitter

SLIDE 15

Collecting data is cheap;  not having it when you need it can be expensive

SLIDE 16

Instrument all the things!

SLIDE 17

Sharing

Looping Back on Culture Describe the problem as your “enemy” not each other Learn Together

SLIDE 18

Sharing

Using and Sharing the same metrics and measurements across teams is key to avoiding misunderstandings.

SLIDE 19

Source: http://bit.ly/1SvvbuP

SLIDE 20

Source: http://bit.ly/1RQRsXW

SLIDE 21

Operational Complexity Increases with..

Number of things to measure 
Velocity of change

SLIDE 22

https://www.datadoghq.com/docker-adoption/

SLIDE 23

How much we measure? 1 instance

10 metrics from cloud providers

1 operating system (e.g., Linux)

100 metrics

50~ metrics per application

SLIDE 24

SLIDE 25

How much we measure? 1 instance

10 metrics from cloud providers

1 operating system (e.g., Linux)

100 metrics

50~ metrics per application  N containers

150*N metrics

SLIDE 26

Operational Complexity

100

instances

500

containers

SLIDE 27

Operational Complexity: Scale

160

metrics per host

800

metrics per host

Assuming 5 containers per host

SLIDE 28

Operational Complexity: Scale

100

instances

80,000

metrics

Assuming 5 containers per host

SLIDE 29

How much we measure? 1 instance

10 metrics from cloud providers

1 operating system (e.g., Linux)

100 metrics

50~ metrics per application  N containers

150*N metrics

Metrics Overload!

SLIDE 30

Operational Complexity Increases with..

Number of things to measure 
Velocity of change

SLIDE 31

Source: Datadog

SLIDE 32

Source: http://bit.ly/1qFylWK

SLIDE 33

SLIDE 34

Operational Complexity Increases with..

Number of things to measure 
Velocity of change

SLIDE 35

SLIDE 36

SLIDE 37

Open Questions

Where is my container running?
What is the capacity of my cluster?
What port is my app running on?
What’s the total throughput of my app?
What’s its response time per tag? (app, version, region)
What’s the distribution of 5xx error per container?

SLIDE 38

Source: http://bit.ly/1YxJ7Jy

SLIDE 39

More Details at: http://www.datadoghq.com/blog/monitoring-101-alerting/

SLIDE 40

Monitoring 101

SLIDE 41

Finding Signal - Categorizing Your Metrics

SLIDE 42

SLIDE 43

SLIDE 44

SLIDE 45

SLIDE 46

Examples: NGINX - Metrics

Work Metrics:  

Requests Per Second
Request Time
Error Rates (4xx or 5xx)
Success (2xx)

Resource Metrics: 

Disk I/O
Memory
CPU
Queue Length

SLIDE 47

Examples: NGINX - Events

Configuration Change
Code Deployment
Service Started / Stopped

SLIDE 48

Examples: Events

SLIDE 49

When to let a sleeping engineer lie?

SLIDE 50

When to alert?

SLIDE 51

Recurse until you find root cause

SLIDE 52

What to demand from our monitoring tooling?

SLIDE 53

Cryptic Alerts

W H A T ?

SLIDE 54

EVERY ALERT MUST BE ACTIONABLE

SLIDE 55

Host Centric

SLIDE 56

Service Centric

SLIDE 57

Static configurations tracking dynamic infrastructure are not a recipe for success.

Static vs Dynamic

SLIDE 58

SLIDE 59

SLIDE 60

Query Based Monitoring

“What’s the average throughput of application:nginx per version ?” “Alert me when one of my pod from replication controller:foo is not behaving like the others?” “Show me rate of HTTP 500 responses from nginx” “… across all data centers” “… running my app version 2….”

SLIDE 61

Getting at the metrics…

SLIDE 62

Resource Metrics

Utilization:

CPU (user + system)
memory
i/o
network traffic

Saturation

throttling
swap

Error

Network Errors

(receive vs transmit)

SLIDE 63

Container Events

Starting / Stopping Containers
Scaling Events for Underlying Instances
Deploying a new container build

SLIDE 64

SLIDE 65

How do we get at the upper layers?

SLIDE 66

Getting at the Metrics

CPU METRICS MEMORY METRICS I/O METRICS NETWORK METRICS pseudo-files Yes Yes Some Yes, in 1.6.1+ stats command Basic Basic No Basic API Yes Yes Some Yes

SLIDE 67

Pseudo-files

Provide visibility into container metrics via the file system.
Generally under:

/cgroup/<resource>/docker/$CONTAINER_ID/  

r

/sys/fs/cgroup/<resource>/docker/$CONTAINER_ID/ 

SLIDE 68

Pseudo-files: CPU Metrics

$ cat /sys/fs/cgroup/cpuacct/docker/$CONTAINER_ID/cpuacct.stat > user 2451 # time spent running processes since boot > system 966 # time spent executing system calls since boot $ cat /sys/fs/cgroup/cpu/docker/$CONTAINER_ID/cpu.stat > nr_periods 565 # Number of enforcement intervals that have elapsed > nr_throttled 559 # Number of times the group has been throttled > throttled_time 12119585961 # Total time that members of the group were throttled (12.12 seconds)

Pseudo-files: CPU Throttling

SLIDE 69

Docker API

Detailed streaming metrics as JSON HTTP socket

$ curl -v --unix-socket /var/run/docker.sock http://localhost/containers/ 28d7a95f468e/stats 

SLIDE 70

STATS Command

# Usage: docker stats CONTAINER [CONTAINER...] $ docker stats $CONTAINER_ID CONTAINER CPU % MEM USAGE/LIMIT MEM % NET I/O BLOCK I/O ecb37227ac84 0.12% 71.53 MiB/490 MiB 14.60% 900.2 MB/275.5 MB 266.8 MB/872.7 MB

SLIDE 71

Side Car Containers

SLIDE 72

Aren’t we still missing a layer?

SLIDE 73

Open Questions

What is the capacity of my cluster?
What’s the total throughput of my app?
What’s its response time per tag? (app, version, region)
What’s the distribution of 5xx error per container?
Where is my container running? what port?

SLIDE 74

Service Discovery

Docker API Orchestrator Monitoring Agent Container

A O A O Containers List & Metadata Additional Metadata (Tags, etc)

Config Backend

Integration Configurations

Host Level Metrics

SLIDE 75

SLIDE 76

Custom Metrics

Instrument custom applications 
You know your key transactions best. 
Use async protocols like Etys’ STATSD or

DogstatsD

SLIDE 77

Source: http://bit.ly/1NoW6aj

SLIDE 78

Resources

Monitoring 101: Alerting   https://www.datadoghq.com/blog/monitoring-101-alerting/ Monitoring 101: Collecting the Right Data https://www.datadoghq.com/blog/monitoring-101-collecting-data/ Monitoring 101: Investigating performance issues https://www.datadoghq.com/blog/monitoring-101-investigation/  The Power of Tagged Metrics https://www.datadoghq.com/blog/the-docker-monitoring-problem/ How to Collect Docker Metrics https://www.datadoghq.com/blog/how-to-collect-docker-metrics/ 8 surprising facts about Docker Adoption https://www.datadoghq.com/docker-adoption/

Monitoring In Motion

Collecting data is cheap; not having it when you need it can be expensive

Instrument all the things!

100

500

160

800

100

80,000

Metrics Overload!

Monitoring 101

When to let a sleeping engineer lie?

W H A T ?

Getting at the metrics…

Collecting data is cheap;  not having it when you need it can be expensive