A Series Of Unfortunate Container Events
Netflix’s container platform lessons learned
A Series Of Unfortunate Container Events Netflixs container platform - - PowerPoint PPT Presentation
A Series Of Unfortunate Container Events Netflixs container platform lessons learned About the speakers and team Follow along - @sargun @tomaszbak_ca @fabiokung @aspyker @amit_joshee @anwleung 2 Netflixs container management platform
Netflix’s container platform lessons learned
Follow along - @sargun @tomaszbak_ca @fabiokung @aspyker @amit_joshee @anwleung
2
Netflix’s container management platform
○ Service & batch jobs ○ Resource management
○ Docker/AWS Integration ○ Netflix Infra Support
Service Job Management Resource Management & Optimization Container Execution
Integration
Batch
3
Containers In Production
4
Current Titus scale
5
Single cloud platform for VMs and containers
6
Integrate containers with AWS EC2
7
○ Stream Processing (Flink) ○ UI Services (NodeJS) ○ Internal dashboards
○ Personalization ML model training (GPUs) ○ Content value analysis ○ Digital watermarking ○ Ad hoc reporting ○ Continuous integration builds ○ Media encoding experimentation
Container users on Titus
Archer
8
Titus high level overview
9
Rhea Rhea Titus API Cassandra Titus Scheduler
EC2 Autoscaling Fenzo
9
container container container docker Titus Agents Mesos agent Docker Docker Registry container container User Containers AWS Virtual Machines Mesos Titus System Agents Workflow Systems
9
Look away, look away, Look away, look away This session will wreck your evening, your whole life, and your day Every single episode is nothing but dismay So, look away, Look away, look away
Lessons learned from a year in production?
10
11
Run-away submissions
Submit a job, check status If API doesn’t answer assume 404 and re-submit Problem:
User
12
Worked for our content processing job of 100 containers Let’s run our “back-fill” -- 100s of thousands of containers Problems
Solutions
System perceived as infinite queue
User
13
Uses REST/JSON poorly { env: { “PATH” : null } } Problems
Solutions
Invalid Jobs
User
14
Failing jobs that repeat
Image: “org/imagename:lateest” Command: /bin/besh -c ... Problems
Solutions
User
15
Problems
Solutions Manual removal of bad job state? ✖ Test production data sets in staging
Testing for “bad” job data
16
PROD STAGING PROD
Identifying bad actors
V2 API
V2 Auditing
performing action V3 API
17
○ Docker 1.10 - User Namespaces (Feb 2016) ○ Docker 1.11 - Fixed shared networking NSs ■ User id mapping is per daemon (not per container) ○ Deployed user namespaces recently ■ Problems - shared NFS, OSX LDAP uid/gid’s
○ Users only have access to containers, not hosts ○ Required “power user” ssh access for perf/debugging
Really bad actors - container escapes protections
18
19
Cloud rate limiting and overall limits
Let’s do a red/black deploy of 2000 containers instantly Problems
Solutions
User
20
Hosts start or go bad
Problems
Solutions
long to become healthy
21
Upgrades - In place upgrades
Docker V1 Titus Agents V1 Mesos V1 Batch Container #1 Service Container #2 Docker V2 Titus Agents V2 Mesos V2 Batch Container #1 Service Container #2
Agent Agent with updates
✖ ✖ ✖
22
Upgrades - Whole cluster red/black
23
Upgrades - Partitioned cluster updates
○ Batch jobs to have runtime limits ○ Service jobs with Spinnaker migration tasks
Docker V1 Titus Agents V1 Mesos V1 Batch Container #1 Service Container #2 Docker V2 Titus Agents V2 Mesos V2 Service Container #2
Let “drain”
new
✖ ✖ ✖
24
CI/CD Task Migration
25
Disconnected containers
Problem
User
Solutions
aggressively as possible
Container Scheduler I’m running You shouldn’t be! Agent Stop Container
Locked Up
26
Scheduler failover speed is important
27
StandBy
C*
save Active
C*
restore
Scheduler failover time increased with scale
crashes
Solutions:
Active Active
✖
Know your dependencies
28
Agent DNS S3 Zookeeper
Problems
Solutions
○ Resource Isolation ○ Security ○ Networking etc
○ Need for BPF, Kprobe
Containers require kernel knowledge
29
Problems
Solutions
services (even our scheduler)
Strategy: Embrace chaos
30
Alerting and dashboards key
Very complex system == very complex telemetry and alerting Continuously evolving
Telemetry system Number Metrics 100’s Dashboard Graphs 70+ Alerts 50+ Elastic Search Indexes 4
31
Temporary ad hoc remediation
capacity management isn’t work
○ Detecting and remediating problems
32
33
Solid software
Docker Distribution
Apache Mesos
Zookeeper
Cassandra
34
Managing the Titus product
Focus on core business value
Features are important
Just as important!
Deliberately chose NOT to do
35
Ops enablement
Phase 1: Manual red/black deploys Phase 2: Runbook for on-call Phase 3: Automated pipelines
Deploy New ASG Find Current Leader Terminate Non Leader Nodes Terminate Current Leader
36
Problem
Solutions
○ Start Latency ○ % Crashed ○ API Availability
Service Level Objectives (SLOs)
37
38
Documenting container readiness
○ Alpha (early users), beta (subject to change), GA (mass adoption)
39
Growing usage slowly, carefully
Titus Created Batch GA
4Q 2015
Service Support Added
1Q 2016
First Scale Production Service
4Q 2016
Netflix Customer Facing Service
2Q 2017
40
shadow
Key takeaways
41
Questions?
42
Backup
Titus High Level Overview
44
Titus UI Titus UI Rhea Rhea Titus API Titus UI Cassandra Titus Master Job Management & Scheduler Zookeeper EC2 Auto-scaling API Mesos Master Fenzo
44
Docker Registry Docker Registry container container container docker Titus Agent metrics agents Titus executor logging agent btrfs Mesos agent Docker S3 Docker Registry container Pod & VPC network drivers container container AWS metadata proxy
Integration
AWS VMs