Monitoring In Motion
Challenges in monitoring kubernetes, containers, and dynamic infrastructure.
ContainerCon Toronto Aug 24, 2016 Ilan Rabinovitch Director, Technical Community Datadog
Monitoring In Motion Challenges in monitoring kubernetes, - - PowerPoint PPT Presentation
Monitoring In Motion Challenges in monitoring kubernetes, containers, and dynamic infrastructure. Ilan Rabinovitch ContainerCon Toronto Director, Technical Community Aug 24, 2016 Datadog $ finger ilan@datadog [datadoghq.com] Name:
Challenges in monitoring kubernetes, containers, and dynamic infrastructure.
ContainerCon Toronto Aug 24, 2016 Ilan Rabinovitch Director, Technical Community Datadog
$ finger ilan@datadog
[datadoghq.com] Name: Ilan Rabinovitch Role: Director, Technical Community Interests: * Monitoring and Metrics * Large scale web operations
* FL/OSS Community Events
Datadog Overview
Operating Systems, Cloud Providers, Containers, Web Servers, Datastores, Caches, Queues and more...
Monitor Everything
$ cat ~/.plan
Our Focus Area
Culture Automation Metrics Sharing
Damon Edwards and John Willis DevOps Day LA
Culture
“organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations”
Follow @honest_update on Twitter
Sharing
Looping Back on Culture Describe the problem as your “enemy” not each other Learn Together
Sharing
Using and Sharing the same metrics and measurements across teams is key to avoiding misunderstandings.
Source: http://bit.ly/1SvvbuP
Source: http://bit.ly/1RQRsXW
Operational Complexity Increases with..
https://www.datadoghq.com/docker-adoption/
How much we measure? 1 instance
1 operating system (e.g., Linux)
50~ metrics per application
How much we measure? 1 instance
1 operating system (e.g., Linux)
50~ metrics per application N containers
Operational Complexity
instances
containers
Operational Complexity: Scale
metrics per host
metrics per host
Assuming 5 containers per host
Operational Complexity: Scale
instances
metrics
Assuming 5 containers per host
How much we measure? 1 instance
1 operating system (e.g., Linux)
50~ metrics per application N containers
Operational Complexity Increases with..
Source: Datadog
Source: http://bit.ly/1qFylWK
Operational Complexity Increases with..
Open Questions
Source: http://bit.ly/1YxJ7Jy
More Details at: http://www.datadoghq.com/blog/monitoring-101-alerting/
Finding Signal - Categorizing Your Metrics
Examples: NGINX - Metrics
Work Metrics:
Resource Metrics:
Examples: NGINX - Events
Examples: Events
When to alert?
Recurse until you find root cause
What to demand from our monitoring tooling?
Cryptic Alerts
EVERY ALERT MUST BE ACTIONABLE
Host Centric
Service Centric
Static configurations tracking dynamic infrastructure are not a recipe for success.
Static vs Dynamic
Query Based Monitoring
“What’s the average throughput of application:nginx per version ?” “Alert me when one of my pod from replication controller:foo is not behaving like the others?” “Show me rate of HTTP 500 responses from nginx” “… across all data centers” “… running my app version 2….”
Resource Metrics
Utilization:
Saturation
Error
(receive vs transmit)
Container Events
How do we get at the upper layers?
Getting at the Metrics
CPU METRICS MEMORY METRICS I/O METRICS NETWORK METRICS pseudo-files Yes Yes Some Yes, in 1.6.1+ stats command Basic Basic No Basic API Yes Yes Some Yes
Pseudo-files
/cgroup/<resource>/docker/$CONTAINER_ID/
/sys/fs/cgroup/<resource>/docker/$CONTAINER_ID/
Pseudo-files: CPU Metrics
$ cat /sys/fs/cgroup/cpuacct/docker/$CONTAINER_ID/cpuacct.stat > user 2451 # time spent running processes since boot > system 966 # time spent executing system calls since boot $ cat /sys/fs/cgroup/cpu/docker/$CONTAINER_ID/cpu.stat > nr_periods 565 # Number of enforcement intervals that have elapsed > nr_throttled 559 # Number of times the group has been throttled > throttled_time 12119585961 # Total time that members of the group were throttled (12.12 seconds)
Pseudo-files: CPU Throttling
Docker API
$ curl -v --unix-socket /var/run/docker.sock http://localhost/containers/ 28d7a95f468e/stats
STATS Command
# Usage: docker stats CONTAINER [CONTAINER...] $ docker stats $CONTAINER_ID CONTAINER CPU % MEM USAGE/LIMIT MEM % NET I/O BLOCK I/O ecb37227ac84 0.12% 71.53 MiB/490 MiB 14.60% 900.2 MB/275.5 MB 266.8 MB/872.7 MB
Side Car Containers
Aren’t we still missing a layer?
Open Questions
Service Discovery
Docker API Orchestrator Monitoring Agent Container
A O A O Containers List & Metadata Additional Metadata (Tags, etc)
Config Backend
Integration Configurations
Host Level Metrics
Custom Metrics
DogstatsD
Source: http://bit.ly/1NoW6aj
Resources
Monitoring 101: Alerting https://www.datadoghq.com/blog/monitoring-101-alerting/ Monitoring 101: Collecting the Right Data https://www.datadoghq.com/blog/monitoring-101-collecting-data/ Monitoring 101: Investigating performance issues https://www.datadoghq.com/blog/monitoring-101-investigation/ The Power of Tagged Metrics https://www.datadoghq.com/blog/the-docker-monitoring-problem/ How to Collect Docker Metrics https://www.datadoghq.com/blog/how-to-collect-docker-metrics/ 8 surprising facts about Docker Adoption https://www.datadoghq.com/docker-adoption/