Monitoring In Motion Challenges in monitoring kubernetes, - - PowerPoint PPT Presentation

monitoring in motion
SMART_READER_LITE
LIVE PREVIEW

Monitoring In Motion Challenges in monitoring kubernetes, - - PowerPoint PPT Presentation

Monitoring In Motion Challenges in monitoring kubernetes, containers, and dynamic infrastructure. Ilan Rabinovitch ContainerCon Toronto Director, Technical Community Aug 24, 2016 Datadog $ finger ilan@datadog [datadoghq.com] Name:


slide-1
SLIDE 1

Monitoring In Motion

Challenges in monitoring kubernetes, containers, and dynamic infrastructure.

ContainerCon Toronto Aug 24, 2016 Ilan Rabinovitch Director, Technical Community
 Datadog

slide-2
SLIDE 2

$ finger ilan@datadog

[datadoghq.com] Name: Ilan Rabinovitch Role: Director, Technical Community 
 Interests: * Monitoring and Metrics * Large scale web operations

* FL/OSS Community Events

slide-3
SLIDE 3
  • SaaS based infrastructure and app monitoring
  • Open Source Agent
  • Time series data (metrics and events)
  • Processing nearly a trillion data points per day
  • Intelligent Alerting
  • We’re hiring! (www.datadoghq.com/careers/)

Datadog Overview

slide-4
SLIDE 4

Operating Systems, Cloud Providers, Containers, Web Servers, Datastores, Caches, Queues and more...

Monitor Everything

slide-5
SLIDE 5
slide-6
SLIDE 6

$ cat ~/.plan

  • 1. Intro: The Importance of Monitoring
  • 2. The Challenge: Monitoring Dynamic Infrastructure
  • 3. Finding the Signal: How do we know what to monitor?
  • 4. Implementation: Applying it to Containerized Workloads
slide-7
SLIDE 7

Our Focus Area

Culture Automation Metrics Sharing

Damon Edwards and John Willis DevOps Day LA

slide-8
SLIDE 8

Culture

“organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations”

  • Melvin E. Conway
slide-9
SLIDE 9
slide-10
SLIDE 10
slide-11
SLIDE 11
slide-12
SLIDE 12
slide-13
SLIDE 13
slide-14
SLIDE 14

Follow @honest_update on Twitter

slide-15
SLIDE 15

Collecting data is cheap;
 not having it when you need it can be expensive

slide-16
SLIDE 16

Instrument all the things!

slide-17
SLIDE 17

Sharing

Looping Back on Culture Describe the problem as your “enemy” not each other Learn Together

slide-18
SLIDE 18

Sharing

Using and Sharing the same metrics and measurements across teams is key to avoiding misunderstandings.

slide-19
SLIDE 19

Source: http://bit.ly/1SvvbuP

slide-20
SLIDE 20

Source: http://bit.ly/1RQRsXW

slide-21
SLIDE 21

Operational Complexity Increases with..

  • Number of things to measure

  • Velocity of change
slide-22
SLIDE 22

https://www.datadoghq.com/docker-adoption/

slide-23
SLIDE 23

How much we measure? 1 instance

  • 10 metrics from cloud providers

1 operating system (e.g., Linux)

  • 100 metrics

50~ metrics per application

slide-24
SLIDE 24
slide-25
SLIDE 25

How much we measure? 1 instance

  • 10 metrics from cloud providers

1 operating system (e.g., Linux)

  • 100 metrics

50~ metrics per application
 N containers

  • 150*N metrics
slide-26
SLIDE 26

Operational Complexity

100

instances

500

containers

slide-27
SLIDE 27

Operational Complexity: Scale

160

metrics per host

800

metrics per host

Assuming 5 containers per host

slide-28
SLIDE 28

Operational Complexity: Scale

100

instances

80,000

metrics

Assuming 5 containers per host

slide-29
SLIDE 29

How much we measure? 1 instance

  • 10 metrics from cloud providers

1 operating system (e.g., Linux)

  • 100 metrics

50~ metrics per application
 N containers

  • 150*N metrics

Metrics Overload!

slide-30
SLIDE 30

Operational Complexity Increases with..

  • Number of things to measure

  • Velocity of change
slide-31
SLIDE 31

Source: Datadog

slide-32
SLIDE 32

Source: http://bit.ly/1qFylWK

slide-33
SLIDE 33
slide-34
SLIDE 34

Operational Complexity Increases with..

  • Number of things to measure

  • Velocity of change
slide-35
SLIDE 35
slide-36
SLIDE 36
slide-37
SLIDE 37

Open Questions

  • Where is my container running?
  • What is the capacity of my cluster?
  • What port is my app running on?
  • What’s the total throughput of my app?
  • What’s its response time per tag? (app, version, region)
  • What’s the distribution of 5xx error per container?
slide-38
SLIDE 38

Source: http://bit.ly/1YxJ7Jy

slide-39
SLIDE 39

More Details at: http://www.datadoghq.com/blog/monitoring-101-alerting/

slide-40
SLIDE 40

Monitoring 101

slide-41
SLIDE 41

Finding Signal - Categorizing Your Metrics

slide-42
SLIDE 42
slide-43
SLIDE 43
slide-44
SLIDE 44
slide-45
SLIDE 45
slide-46
SLIDE 46

Examples: NGINX - Metrics

Work Metrics: 


  • Requests Per Second
  • Request Time
  • Error Rates (4xx or 5xx)
  • Success (2xx)

Resource Metrics:


  • Disk I/O
  • Memory
  • CPU
  • Queue Length
slide-47
SLIDE 47

Examples: NGINX - Events

  • Configuration Change
  • Code Deployment
  • Service Started / Stopped
slide-48
SLIDE 48

Examples: Events

slide-49
SLIDE 49

When to let a sleeping engineer lie?

slide-50
SLIDE 50

When to alert?

slide-51
SLIDE 51

Recurse until you find root cause

slide-52
SLIDE 52

What to demand from our monitoring tooling?

slide-53
SLIDE 53

Cryptic Alerts

W H A T ?

slide-54
SLIDE 54

EVERY ALERT MUST BE ACTIONABLE

slide-55
SLIDE 55

Host Centric

slide-56
SLIDE 56

Service Centric

slide-57
SLIDE 57

Static configurations tracking dynamic infrastructure are not a recipe for success.

Static vs Dynamic

slide-58
SLIDE 58
slide-59
SLIDE 59
slide-60
SLIDE 60

Query Based Monitoring

“What’s the average throughput of application:nginx per version ?” “Alert me when one of my pod from replication controller:foo is not behaving like the others?” “Show me rate of HTTP 500 responses from nginx” “… across all data centers” “… running my app version 2….”

slide-61
SLIDE 61

Getting at the metrics…

slide-62
SLIDE 62

Resource Metrics

Utilization:

  • CPU (user + system)
  • memory
  • i/o
  • network traffic

Saturation

  • throttling
  • swap

Error

  • Network Errors 


(receive vs transmit)

slide-63
SLIDE 63

Container Events

  • Starting / Stopping Containers
  • Scaling Events for Underlying Instances
  • Deploying a new container build
slide-64
SLIDE 64
slide-65
SLIDE 65

How do we get at the upper layers?

slide-66
SLIDE 66

Getting at the Metrics

CPU METRICS MEMORY METRICS I/O METRICS NETWORK METRICS pseudo-files Yes Yes Some Yes, in 1.6.1+ stats command Basic Basic No Basic API Yes Yes Some Yes

slide-67
SLIDE 67

Pseudo-files

  • Provide visibility into container metrics via the file system.
  • Generally under: 


/cgroup/<resource>/docker/$CONTAINER_ID/ 


  • r


/sys/fs/cgroup/<resource>/docker/$CONTAINER_ID/


slide-68
SLIDE 68

Pseudo-files: CPU Metrics

$ cat /sys/fs/cgroup/cpuacct/docker/$CONTAINER_ID/cpuacct.stat > user 2451 # time spent running processes since boot > system 966 # time spent executing system calls since boot $ cat /sys/fs/cgroup/cpu/docker/$CONTAINER_ID/cpu.stat > nr_periods 565 # Number of enforcement intervals that have elapsed > nr_throttled 559 # Number of times the group has been throttled > throttled_time 12119585961 # Total time that members of the group were throttled (12.12 seconds)

Pseudo-files: CPU Throttling

slide-69
SLIDE 69

Docker API

  • Detailed streaming metrics as JSON HTTP socket


$ curl -v --unix-socket /var/run/docker.sock http://localhost/containers/ 28d7a95f468e/stats


slide-70
SLIDE 70

STATS Command

# Usage: docker stats CONTAINER [CONTAINER...] $ docker stats $CONTAINER_ID CONTAINER CPU % MEM USAGE/LIMIT MEM % NET I/O BLOCK I/O ecb37227ac84 0.12% 71.53 MiB/490 MiB 14.60% 900.2 MB/275.5 MB 266.8 MB/872.7 MB

slide-71
SLIDE 71

Side Car Containers

slide-72
SLIDE 72

Aren’t we still missing a layer?

slide-73
SLIDE 73

Open Questions

  • What is the capacity of my cluster?
  • What’s the total throughput of my app?
  • What’s its response time per tag? (app, version, region)
  • What’s the distribution of 5xx error per container?
  • Where is my container running? what port?
slide-74
SLIDE 74

Service Discovery

Docker API Orchestrator Monitoring Agent Container

A O A O Containers List & Metadata Additional Metadata (Tags, etc)

Config Backend

Integration Configurations

Host Level Metrics

slide-75
SLIDE 75
slide-76
SLIDE 76

Custom Metrics

  • Instrument custom applications

  • You know your key transactions best.

  • Use async protocols like Etys’ STATSD or 


DogstatsD

slide-77
SLIDE 77

Source: http://bit.ly/1NoW6aj

slide-78
SLIDE 78

Resources

Monitoring 101: Alerting 
 https://www.datadoghq.com/blog/monitoring-101-alerting/ Monitoring 101: Collecting the Right Data https://www.datadoghq.com/blog/monitoring-101-collecting-data/ Monitoring 101: Investigating performance issues https://www.datadoghq.com/blog/monitoring-101-investigation/
 The Power of Tagged Metrics https://www.datadoghq.com/blog/the-docker-monitoring-problem/ How to Collect Docker Metrics https://www.datadoghq.com/blog/how-to-collect-docker-metrics/ 8 surprising facts about Docker Adoption https://www.datadoghq.com/docker-adoption/