Bryant Vinisky and Joshua Galbraith Santa Clara, California | April 23th – 25th, 2018
Containerizing Databases at New Relic (What We Learned) Bryant - - PowerPoint PPT Presentation
Containerizing Databases at New Relic (What We Learned) Bryant - - PowerPoint PPT Presentation
Containerizing Databases at New Relic (What We Learned) Bryant Vinisky and Joshua Galbraith Santa Clara, California | April 23th 25th, 2018 Safe Harbor This presentation and the information herein (including any information that may be
2
Safe Harbor
This presentation and the information herein (including any information that may be incorporated by reference) is provided for informational purposes only and should not be construed as an offer, commitment, promise or obligation on behalf of New Relic, Inc. (“New Relic”) to sell securities or deliver any product, material, code, functionality, or other feature. Any information provided hereby is proprietary to New Relic and may not be replicated or disclosed without New Relic’s express written permission. Such information may contain forward-looking statements within the meaning of federal securities laws. Any statement that is not a historical fact or refers to expectations, projections, future plans, objectives, estimates, goals, or other characterizations of future events is a forward-looking statement. These forward- looking statements can often be identified as such because the context of the statement will include words such as “believes,” “anticipates,” “expects” or words of similar import. Actual results may differ materially from those expressed in these forward-looking statements, which speak only as of the date hereof, and are subject to change at any time without notice. Existing and prospective investors, customers and other third parties transacting business with New Relic are cautioned not to place undue reliance on this forward-looking information. The achievement or success of the matters covered by such forward-looking statements are based
- n New Relic’s current assumptions, expectations, and beliefs and are subject to substantial risks, uncertainties, assumptions, and changes in circumstances
that may cause the actual results, performance, or achievements to differ materially from those expressed or implied in any forward-looking statement. Further information on factors that could affect such forward-looking statements is included in the filings New Relic makes with the SEC from time to time. Copies of these documents may be obtained by visiting New Relic’s Investor Relations website at ir.newrelic.com or the SEC’s website at www.sec.gov. New Relic assumes no obligation and does not intend to update these forward-looking statements, except as required by law. New Relic makes no warranties, expressed or implied, in this presentation or otherwise, with respect to the information provided.
Introductions
3
New Relic
- A cloud platform to make every
aspect of modern software and infrastructure observable!
Bryant Vinisky
- Senior Site Reliability Engineer
- Database Engineering Team
Joshua Galbraith
- Senior Software Engineer
- Database Engineering Team
Agenda
- Where We Started
- Research
- Prerequisites
- Megabase
- Monitoring
- Lessons Learned
- Outcomes
4
Where We Started
Databases at New Relic circa 2016
Database Management Issues
6
📧 Using Puppet for configuration and deployment 🚛 Slow delivery time due to timing of hardware orders 💹 Inefficient hardware use
Why Containers?
7
- Why not use virtual
machines instead?
- Why not use AWS?
- Why not multi-tenant
logical databases?
Preparing for the Future
8
🤕 New Container Fabric for deploying our stateless apps 🚣 Future regions will be entirely container-based! 🛥 Incremental delivery of containerized databases?
Setting Goals
9
📧 Packaging and Deployment
- consistent and repeatable
🚛 Database Delivery Time
- from months to minutes
💹 Cost Efficiency
- reduce wasted resources
Compressed Timeline
10
😆 Managing hundreds of existing, busy databases Need to ship an MVP and gain traction quickly Dev work is blocked on database delivery time
Research
A Survey of Open-Source Orchestration Tools
12
Open-Source Container Orchestration
13
Stateful Containers: Mesos and Marathon
Key Concepts
- Dynamic Provisioning
- Reservation Labels
- Local Persistent
Volumes
- External Volumes
14
Stateful Containers: Kubernetes
Key Concepts
- Stateful Sets
- Pods
- Headless Service
- Persistent Volumes
- Operators
15
Stateful Containers: Nomad
Key Concepts
- Jobs
- Task Groups
- Allocations
- Sticky Volumes
- Volume Plugins
- FS Drivers
16
Joyent Blog: Autopilot for Databases
17
Stateful Containers: Emergent Patterns
Abstract Patterns
- Application-aware
- rchestration
- Autopilot pattern for
lifecycle management
- Storage fabrics and
networked storage 😟
18
Stateful Containers: Current Status
Making Trade-Offs
19
- Dynamic Scheduling vs.
Manual Placement
- Custom Orchestration
Framework vs. Client-Server
- Distributed Consensus?
- Local Object Storage?
- Lifecycle Management logic
inside or outside Container?
Prerequisites
Dynamic Inventory and/or Service Discovery
21
Problem: Database Inventory
What systems do we have? What logical databases? What team is the owner? Naming standards? API and CLI access?
22
Megabase Prerequisite: Inventory System
Metadata on DB systems and services Percona containers on Megabase Golang service HTTP interface to DB Update on event and scheduled jobs Seed automation tasks
23
Inventory System: Database
- Query read only data from other
systems and authorities
- Megabase container deployment
information
- Provide basic service discovery
24
Inventory System: HTTP Interface (JSON)
Query over HTTP Return JSON
25
Inventory System: HTTP Interface (text)
26
Inventory System: Dynamic Lookups
Megabase
A platform for containerized databases at New Relic
28
Megabase: What We Built
Megabase: Ingredients
29
Bare Metal:
- Kernel 4.4 and 4.14
- CentOS 7 and CoreOS
Docker: 1.12 and 1.17 Image OS:
- Alpine (Postgres and Redis)
- Debian (Percona)
Data stores:
- Percona-server 5.6
- Percona-server 5.7
- Postgresql-server 9.3, 9.5 and 9.6
- Redis-server 3.2
Golang: 1.9
Megabase: Docker Image Building
30
Dockerfile
- Upstream base image
- Custom labels
- Package dependencies
- Add binaries:
- rclone
- configuration sync
- replication bootstrap
- Entrypoint script
Megabase: Docker Entrypoint
31
Tasks
- Validate data mount
- Sync base configs from S3
- Apply dynamic configuration
- No data => replication
bootstrap/initdb
- Start server processor
Megabase: Injecting Configuration
32
Strategies
- Environment variables
- via deploy config
- Configuration files
- via object storage
- version controlled
- Dynamic computation from
cgroup limits
- Docker images
- via image registry
33
Megabase API Server
Pre/Post deployment dependencies Manage container runtime dependencies Interface to docker over https Support custom workflows for database services
megabase server
34
Megabase API: TLS Authentication
Authenticate all requests with TLS client certificates
35
Authentication Failure
36
Authentication Success
37
Megabase API: Endpoints
38
Megabase API: Dependencies
Containers expect and validate bind mount under data directory Megabase deploy pre-step to docker run:
- Carve off extents for logical volume and make filesystem
- Create systemd unit file for persistence and mount volume
39
Megabase API: Create Logical Volume
40
Megabase API: Container Runtime
Reduce client config burden Offload predictable defaults:
- Resource Limits
- Bind mount volume to data directory
- Host networking
- Inject environment secrets
Example docker run on cli =>
41
Megabase API: Failover Tasks
42
Megabase API: Xtrabackup Stream to Peer
43
Megabase… Client?
“So we’re just supposed to type out those nasty, long curl commands to operate everything? Really? And TLS client auth too? Are you trying to be annoying?”
- anonymous New Relic db-team member
mb client
44
Megabase Client: mb
45
Megabase Client: mb
mb: Command Line Interface
- wrap up server API functionality into command line interface
- transparently handle TLS authentication
- query state, health and configuration across servers and deployments
- coordinate deployments across servers
- leverage inventory to resolve host and container names
46
Megabase Client: mb
47
Megabase Client: Deployment Manifest
Minimum info required:
- Container name
- Image path and tag
- CPU, memory and disk resources
- VIP for load balancer and DNS
- Owning team
Auto select values if absent Define one or more containers
48
Megabase Client: mb Deployment
Process
- Target environment and cluster
- Specify a deployment manifest
- Generate port and passwords
- Send requests to API servers
- new logical volume
- run container
- Check responses
- Validate deployment
- health
Monitoring
Using New Relic to Monitor New Relic
50
Monitoring and Alerting
NewRelic APM, Infrastructure and Insights
- Hardware/System, Percona, Postgres, Redis, Megabase/Golang
- Connection availability monitoring events to Insights
- Insights dashboards created for deploys
51
Monitoring: Insights Dashboards MySQL
52
Monitoring: Insights Dashboards Postgres
53
Monitoring: Insights Dashboards Redis
54
Monitoring: Insights NRQL
55
Monitoring: Service Availability
56
Monitoring: APM Transactions
57
Monitoring: Go Runtime
Lessons Learned
Our First Year of Running Databases in Containers
59
Concern: Host Failure Blast Radius
Density of database services per host Maintenance (still) happens Hardware (still) fails Time to recover after host failure Time spent on failovers
60
…and then it happened.
Production host went down RAID controller failed Several active primary instances affected Time to recover was higher than we expected Started DRI sprint to address host failures
61
Fix: Improve Tools for Mass Failover
Tooling updated to support failover:
- All unhealthy database pools: ‘failover pools unhealthy’
- By megabase hostname: ‘failover host demo-megabase-1c’
62
Redis Tuning Memory Limits
Redis deploy for experiment with RDB bgsave enabled Keyspace sizing:
- docker update used to
walk memory limits up to 2GB, 4GB and 16GB
- redis maxmemory limit
increase to 1.5GB
63
Redis PRIMARY Erratic Memory
64
Kernel OOM on redis-server
65
Replica Swap Extends RDB Save Time
66
Fix: Adjust Alert Policies
New Relic NRQL Alerts
- Adjusted swap alert
- Added alert on kernel
version mismatch
Outcomes
What We Delivered at New Relic
68
Keys to Our Success
- Internal team supporting internal customers
- Scoped to meet our specific goals
- Controlled slow roll out and adoption
- Team autonomy and control of our hardware
- Existing APIs and tools from other teams
- Limiting technologies and versions involved
- Balancing trade-offs
- Minimal container resource limits
Outcomes
69
📧 Packaging and Deployment
- is consistent and repeatable
🚛 Database Delivery Time
- takes minutes not months
💱 Cost Efficiency
- fewer wasted resources
🌎 New Regions are Easy*
- single command mass deploys
70
Megabase Adoption
71
References
- https://mesosphere.github.io/marathon/docs/persistent-volumes.html
- https://docs.mesosphere.com/1.11/tutorials/stateful-services/
- https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/
- https://youtu.be/J-Ke0TxGUSg (Kubernetes StatefulSet)
- https://coreos.com/blog/introducing-operators.html
- https://youtu.be/faUQcd5_MUc (Towards Running Stateful Applications on Nomad)
- https://github.com/hashicorp/nomad/issues/150
- https://twitter.com/kelseyhightower/status/963415653930553345
- https://twitter.com/kelseyhightower/status/963418681148502016
- https://www.joyent.com/blog/dbaas-simplicity-no-lock-in
- https://www.joyent.com/blog/persistent-storage-patterns
- https://thenewstack.io/methods-dealing-container-storage/
- https://techcrunch.com/2015/11/21/i-want-to-run-stateful-containers-too/
Thank You!
Your questions are now welcome.
73