Kubernetes the Very Hard Way Laurent Bernaille Staff Engineer, - - PowerPoint PPT Presentation

kubernetes the very hard way
SMART_READER_LITE
LIVE PREVIEW

Kubernetes the Very Hard Way Laurent Bernaille Staff Engineer, - - PowerPoint PPT Presentation

Kubernetes the Very Hard Way Laurent Bernaille Staff Engineer, Infrastructure @lbernail Datadog 10000s hosts in our infra Over 350 integrations 10s of k8s clusters with 50-2500 nodes Over 1,200 employees Multi-cloud Over 8,000 customers


slide-1
SLIDE 1

Kubernetes the Very Hard Way

Laurent Bernaille Staff Engineer, Infrastructure @lbernail

slide-2
SLIDE 2

lbernail

Datadog

Over 350 integrations Over 1,200 employees Over 8,000 customers Runs on millions of hosts Trillions of data points per day 10000s hosts in our infra 10s of k8s clusters with 50-2500 nodes Multi-cloud Very fast growth

slide-3
SLIDE 3

lbernail

Why Kubernetes?

Dogfooding Improve k8s integrations Immutable Move from Chef Multi Cloud Common API Community Large and Dynamic

slide-4
SLIDE 4

The very hard way?

slide-5
SLIDE 5

It was much harder

slide-6
SLIDE 6

lbernail

This talk is about the fine print

“Of course, you will need a HA master setup” “Oh, and yes, you will have to manage your certificates” “By the way, networking is slightly more complicated, look into CNI / ingress controllers”

slide-7
SLIDE 7

lbernail

What happens after “Kube 101”

  • 1. Resilient and Scalable Control Plane
  • 2. Securing the Control Plane
  • a. Kubernetes and Certificates
  • b. Exceptions?
  • c. Impact of Certificate Rotation
  • 3. Efficient networking
  • a. Giving pod IPs and routing them
  • b. Ingresses: Getting data in the cluster
slide-8
SLIDE 8

lbernail

What happens after “Kube 101”

  • 1. Resilient and Scalable Control Plane
  • 2. Securing the Control Plane
  • a. Kubernetes and Certificates
  • b. Exceptions?
  • c. Impact of Certificate Rotation
  • 3. Efficient networking
  • a. Giving pod IPs and routing them
  • b. Ingresses: Getting data in the cluster
slide-9
SLIDE 9

Resilient and Scalable Control Plane

slide-10
SLIDE 10

lbernail

Kube 101 Control Plane

kubelet kubectl etcd apiserver controllers scheduler Master in-cluster apps Service

slide-11
SLIDE 11

lbernail

Making it resilient

etcd apiserver controllers scheduler kubelet kubectl Master etcd apiserver controllers scheduler Master etcd apiserver controllers scheduler Master LoadBalancer in-cluster apps Service

slide-12
SLIDE 12

lbernail

Kube 101 Control Plane

kubelet kubectl etcd apiserver controllers scheduler Master in-cluster apps Service

slide-13
SLIDE 13

lbernail

apiserver controllers scheduler kubelet kubectl Master apiserver controllers scheduler Master apiserver controllers scheduler Master LoadBalancer in-cluster apps Service

Separate etcd nodes

etcd etcd

slide-14
SLIDE 14

lbernail

apiserver controllers scheduler kubelet kubectl Master apiserver controllers scheduler Master apiserver controllers scheduler Master LoadBalancer in-cluster apps Service

Single active Controller/scheduler

etcd etcd

slide-15
SLIDE 15

lbernail

apiserver controllers kubelet kubectl apiserver apiserver LoadBalancer in-cluster apps Service

Split scheduler/controllers

controllers schedulers schedulers

etcd

slide-16
SLIDE 16

lbernail

apiserver controllers kubelet kubectl apiserver apiserver LoadBalancer in-cluster apps Service

Split etcd

controllers schedulers schedulers etcd etcd events

slide-17
SLIDE 17

lbernail

apiserver controllers kubelet kubectl apiserver apiserver LoadBalancer in-cluster apps Service

Sizing the control plane

controllers schedulers schedulers

2x (3 or 5 nodes) disk + net ios X nodes RAM + net ios 2 nodes CPU 2 nodes CPU

etcd etcd events

slide-18
SLIDE 18

lbernail

  • 1. Resilient and Scalable Control Plane
  • 2. Securing the Control Plane
  • a. Kubernetes and Certificates
  • b. Exceptions?
  • c. Impact of Certificate Rotation
  • 3. Efficient networking
  • a. Giving pod IPs and routing them
  • b. Ingresses: Getting data in the cluster

What happens after “Kube 101”

slide-19
SLIDE 19

Kubernetes and Certificates

slide-20
SLIDE 20

lbernail

From “the hard way”

slide-21
SLIDE 21

lbernail

“Our cluster broke after ~1y”

slide-22
SLIDE 22

lbernail

Certificates in Kubernetes

  • Kubernetes uses certificates everywhere
  • Very common source of incidents
  • Our Strategy: Rotate all certificates daily
slide-23
SLIDE 23

lbernail

Certificate management

etcd apiserver

Vault

etcd PKI

Peer/Server cert E t c d C l i e n t c e r t

slide-24
SLIDE 24

lbernail

Certificate management

etcd apiserver controllers scheduler

Vault

etcd PKI

Peer/Server cert E t c d C l i e n t c e r t

kube PKI

A p i s e r v e r / k u b e l e t c l i e n t c e r t Controller client cert Scheduler client cert

kubelet

Kubelet client/server cert

slide-25
SLIDE 25

lbernail

Certificate management

etcd apiserver controllers scheduler

Vault

etcd PKI

Peer/Server cert E t c d C l i e n t c e r t

kube PKI

A p i s e r v e r / k u b e l e t c l i e n t c e r t

kube kv

SA public key SA private key Controller client cert Scheduler client cert

In-cluster app

S A t

  • k

e n

kubelet

Kubelet client/server cert

slide-26
SLIDE 26

lbernail

Certificate management

etcd apiserver controllers scheduler apiservice webhook...

Vault

etcd PKI

Peer/Server cert E t c d C l i e n t c e r t

apiservice PKI

Apiservice cert (proxy/webhooks)

kube PKI

A p i s e r v e r / k u b e l e t c l i e n t c e r t

kube kv

SA public key SA private key Controller client cert Scheduler client cert

In-cluster app

S A t

  • k

e n

kubelet

Kubelet client/server cert

slide-27
SLIDE 27

lbernail

Certificate management

etcd apiserver controllers scheduler apiservice webhook...

Vault

etcd PKI

Peer/Server cert E t c d C l i e n t c e r t

apiservice PKI

Apiservice cert (proxy/webhooks)

kube PKI

A p i s e r v e r / k u b e l e t c l i e n t c e r t

kube kv

SA public key SA private key Controller client cert Scheduler client cert

OIDC provider kubectl

OIDC auth

In-cluster app

S A t

  • k

e n

kubelet

Kubelet client/server cert

slide-28
SLIDE 28

Exception ? Incident...

slide-29
SLIDE 29

lbernail

Kubelet: TLS Bootstrap

apiserver controllers

Vault

kube PKI kube kv

3- Get signing key

admin

1- Create Bootstrap token 2- Add Bootstrap token to vault

slide-30
SLIDE 30

lbernail

Kubelet: TLS Bootstrap

apiserver controllers

Vault

kube PKI kube kv kubelet

5- Verify RBAC for CSR creator 6- Sign certificate 1- Get Bootstrap token 2- Authenticate with token 4- Create CSR 7- Download certificate 8- Authenticate with cert 9- Register node 3- Verify Token and map groups

slide-31
SLIDE 31

lbernail

Kubelet certificate issue

  • 1. One day, some Kubelets were failing to start or took 10s of minutes
  • 2. Nothing in logs
  • 3. Everything looked good but they could not get a cert
  • 4. Turns out we had a lot of CSRs in flight
  • 5. Signing controller was having a hard time evaluating them all

CSR resources in the cluster Lower is better!

slide-32
SLIDE 32

lbernail

Why?

Kubelet Authentication

  • Initial creation: bootstrap token, mapped to group “system:bootstrappers”
  • Renewal: use current node certificate, mapped to group “system:nodes“

Required RBAC permissions

  • CSR creation
  • CSR auto-approval

CSR creation CSR auto-approval system:bootstrappers OK OK system:nodes OK

slide-33
SLIDE 33

Exception 2? Incident 2...

slide-34
SLIDE 34

lbernail

Temporary solution

apiserver webhook

Vault

kube kv

Get cert and key

admin

Create webhook with self-signed cert as CA Add self-signed cert + key to Vault

One day, after ~1 year

  • Creation of resources started failing (luckily only a Custom Resource)
  • Cert had expired...
slide-35
SLIDE 35

lbernail

Take-away

  • Rotate server/client certificates
  • Not easy

But, “If it’s hard, do it often” > no expiration issues anymore

slide-36
SLIDE 36

Impact of Certificate rotation

slide-37
SLIDE 37

Apiserver certificate rotation

slide-38
SLIDE 38

lbernail

Impact on etcd

apiserver restarts etcd slow queries etcd traffic

We have multiple apiservers We restart each daily Significant etcd network impact (caches are repopulated) Significant impact on etcd performances

slide-39
SLIDE 39

Impact on Load-balancers

apiserver restarts ELB surge queue

Significant impact on LB as connections are reestablished Mitigation: increase queues on apiservers net.ipv4.tcp_max_syn_backlog net.core.somaxconn

slide-40
SLIDE 40

lbernail

Impact on apiserver clients

apiserver restarts coredns memory usage

  • Apiserver restarts
  • clients reconnect and refresh their cache

> Memory spike for impacted apps No real mitigation today

slide-41
SLIDE 41

lbernail

Impact on traffic balance

Number of connections / traffic very unbalanced Because connections are very long-lived More clients => Bigger impact clusterwide 15MB/s 2.5MB/s 2300 connections 300 connections

slide-42
SLIDE 42

lbernail

Why? Simple simulation

Simulation for 48h

  • 5 apiservers
  • 10000 connections (4 x 2500 nodes)
  • Every 4h, one apiserver restarts
  • Reconnections evenly dispatched

Cause

  • Cloud TCP load-balancers use round-robin
  • Long-lived connections
  • No rebalancing
slide-43
SLIDE 43

Kubelet certificate rotation

slide-44
SLIDE 44

Pod graceful termination

apiserver kubelet containerd admin or controller

Delete pod Stop Container with timeout “terminationGracePeriodSeconds”

container

Send SIGTERM After timeout, send SIGKILL

slide-45
SLIDE 45

Restarts impact graceful termination

apiserver containerd admin or controller

Delete pod

container

Send SIGTERM After timeout, or Context Cancelled send SIGKILL

Kubelet restarts end graceful termination Fixed upstream “Do not SIGKILL container if container stop is cancelled” https://github.com/containerd/cri/pull/1099 kubelet

slide-46
SLIDE 46

Impact on pod readiness

Issue upstream “pod with readinessProbe will be not ready when kubelet restart” https://github.com/kubernetes/kubernetes/issues/78733

kubelet restarts on “system” nodes (coredns + other services) coredns endpoints NotReady

On kubelet restart

  • Readiness probes marked as failed
  • Pods removed from service endpoints
  • Requires readiness to succeed again
slide-47
SLIDE 47

lbernail

Take-away

Restarting components is not transparent It would be great if

○ Components could transparently reload certs (server & client) ○ Clients could wait 0-Xs to reconnect to avoid thundering herd ○ Reconnections did not trigger memory spikes ○ Cloud TCP load-balancers supported least-conn algorithm ○ Connections were rebalanced (kill them after a while?)

slide-48
SLIDE 48

lbernail

What happens after “Kube 101”

  • 1. Resilient and Scalable Control Plane
  • 2. Securing the Control Plane
  • a. Kubernetes and Certificates
  • b. Exceptions?
  • c. Impact of Certificate Rotation
  • 3. Efficient networking
  • a. Giving pod IPs and routing them
  • b. Ingresses: Getting data in the cluster
slide-49
SLIDE 49

Efficient networking

slide-50
SLIDE 50

lbernail

Throughput

Trillions of data points daily

Scale

1000-2000 nodes clusters

Network challenges

Latency

End-to-end pipeline

Topology

Multiple clusters Access from standard VMs

slide-51
SLIDE 51

Giving pods IPs & Routing them

slide-52
SLIDE 52

lbernail

From “the Hard Way”

node IP Pod CIDR for this node

slide-53
SLIDE 53

lbernail

Small cluster? Static routes

Node 1 IP: 192.168.0.1 Pod CIDR: 10.0.1.0/24 Routes (local or cloud provider) 10.0.1.0/24 => 192.168.0.1 10.0.2.0/24 => 192.168.0.2 Node 2 IP: 192.168.0.2 Pod CIDR: 10.0.2.0/24 Limits local: nodes must be in the same subnet cloud provider: number of routes

slide-54
SLIDE 54

lbernail

Mid-size cluster? Overlay

Limits Overhead of the overlay Scaling route distribution (control plane) Node 1 IP: 192.168.0.1 Pod CIDR: 10.0.1.0/24 Node 2 IP: 192.168.0.2 Pod CIDR: 10.0.2.0/24

VXLAN VXLAN

Tunnel traffic between hosts Examples: Calico, Flannel

slide-55
SLIDE 55

lbernail

Large cluster with a lot of traffic? Native pod routing

Performance

Datapath: no overhead Control plane: simpler

Addressing

Pod IPs are accessible from

  • Other clusters
  • VMs
slide-56
SLIDE 56

lbernail

In practice

On premise

BGP Calico Kube-router Macvlan

AWS

Additional IPs on ENIs AWS EKS CNI plugin Lyft CNI plugin Cilium ENI IPAM

GCP

IP aliases

slide-57
SLIDE 57

lbernail

How it works on AWS

eth1 agent Pod 1 Pod 2 kubelet cni containerd CRI CNI eth0 Attach ENI Allocate IPs C r e a t e v e t h ip 1 ip 2 ip 3 Routing rule “From IP1, use eth1” Routing eth0 ip 1

slide-58
SLIDE 58

lbernail

Address space planning

Pod Cidr: /24

  • /24 leads to inefficient address usage
  • sig-network: remove contiguous range requirement for CIDR allocation
  • But also

○ Address space for node IPs (another /20 per cluster for 4096 nodes) ○ Service IP range (/20 would make sense for such a cluster)

  • Total: 1 /15 for pods, 2 /20 for nodes and service!

pod cidr 8bits node prefix: 12bits

  • 10. (8bits)

4bits

Up to 255 pods per node Simple addressing Up to 4096 nodes 4 bits available Up to 16 clusters

slide-59
SLIDE 59

lbernail

Take-away

  • Native pod routing has worked very well at scale
  • A bit more complex to debug
  • Much more efficient datapath
  • Topic is still dynamic (Cilium introduced ENI recently)
  • Great relationship with Lyft / Cilium
  • Plan your address space early
slide-60
SLIDE 60

Ingresses

slide-61
SLIDE 61

lbernail

Ingress: cross-clusters, VM to clusters

A A A B B B C C D D

Cluster 1 Cluster 2 Classic (VM)

C? C? B?

slide-62
SLIDE 62

lbernail

Master

Kubernetes default: LB service

External Client Load-Balancer

pod pod pod kube-proxy kube-proxy kube-proxy NP NP NP

Healthchecker data path health checks configuration (from watching ingresses on apiservers)

service-controller

slide-63
SLIDE 63

lbernail

Master

Inefficient Datapath & cross-application impacts

Web traffic Load-Balancer

web-1 web-2 web-3 kube-proxy kube-proxy kube-proxy NP NP NP

Healthchecker data path health checks configuration (from watching ingresses on apiservers)

service-controller kafka

slide-64
SLIDE 64

lbernail

Master

ExternalTrafficPolicy: Local?

Web traffic Load-Balancer

web-1 web-2 web-3 kube-proxy kube-proxy kube-proxy NP NP NP

Healthchecker data path health checks configuration (from watching ingresses on apiservers)

service-controller kafka

slide-65
SLIDE 65

lbernail

L7-proxy ingress controller

data path health checks configuration

from watching ingresses/endpoints on apiservers (ingress-controller) from watching LoadBalancer services (service-controller)

External Client Load-Balancer

l7proxy l7proxy kube-proxy kube-proxy kube-proxy NP NP NP

Heathchecker

ingress-controller pod pod pod pod Create l7proxy deployments Update backends using service endpoints Master service-controller

slide-66
SLIDE 66

lbernail

Limits

All nodes as backends (1000+) Inefficient datapath Cross-application impacts

Alternatives?

ExternalTrafficPolicy: Local? > Number of nodes remains the same > Issues with some CNI plugins K8s ingress > Still load-balancer based > Need to scale ingress pods > Still inefficient datapath

Challenges

slide-67
SLIDE 67

lbernail

Our target: native routing

External Client ALB

pod pod pod

Healthchecker data path health checks

alb-ingress-controller

configuration (from watching ingresses/endpoints on apiservers)

slide-68
SLIDE 68

lbernail

Limited to HTTP ingresses

No support for TCP/UDP Ingress v2 should address this

Remaining challenges

Registration delay

Slow registration with LB Pod rolling-updates much faster Mitigations

  • MinReadySeconds
  • Pod ReadinessGates
slide-69
SLIDE 69

lbernail

Workaround

External Client Load-Balancer l7proxy l7proxy Heathchecker

pod pod pod pod

Not managed by k8s Dedicated nodes Pods in host network

TCP / Registration delay not manageable > Dedicated gateways

slide-70
SLIDE 70

lbernail

Take-away

  • Ingress solutions are not great at scale yet
  • May require workarounds
  • Definitely a very important topic for us
  • The community is working on v2 Ingresses
slide-71
SLIDE 71

Conclusion

slide-72
SLIDE 72

lbernail

A lot of other topics

  • Accessing services (kube-proxy)
  • DNS (it’s always DNS!)
  • Challenges with Stateful applications
  • How to DDOS <insert ~anything> with Daemonsets
  • Node Lifecycle / Cluster Lifecycle
  • Deploying applications
  • ...
slide-73
SLIDE 73

lbernail

Getting started?

“Deep Dive into Kubernetes Internals for Builders and Operators” Jérôme Petazzoni, Lisa 2019 https://lisa-2019-10.container.training/talk.yml.html Minimal cluster, showing interactions between main components “Kubernetes the Hard Way” Kelsey Hightower https://github.com/kelseyhightower/kubernetes-the-hard-way HA control plane with encryption

slide-74
SLIDE 74

lbernail

You like horror stories?

“Kubernetes the very hard way at Datadog” https://www.youtube.com/watch?v=2dsCwp_j0yQ “10 ways to shoot yourself in the foot with Kubernetes” https://www.youtube.com/watch?v=QKI-JRs2RIE “Kubernetes Failure Stories” https://k8s.af

slide-75
SLIDE 75

lbernail

Key lessons

Self-managed Kubernetes is hard

> If you can, use a managed service

Networking is not easy (especially at scale) The main challenge is not technical

> Build a team > Transforming practices and training users is very important

slide-76
SLIDE 76

Thank you

We’re hiring! https://www.datadoghq.com/careers/ laurent@datadoghq.com @lbernail