Containerizing Databases at New Relic (What We Learned) Bryant - - PowerPoint PPT Presentation

containerizing databases at new relic what we learned
SMART_READER_LITE
LIVE PREVIEW

Containerizing Databases at New Relic (What We Learned) Bryant - - PowerPoint PPT Presentation

Containerizing Databases at New Relic (What We Learned) Bryant Vinisky and Joshua Galbraith Santa Clara, California | April 23th 25th, 2018 Safe Harbor This presentation and the information herein (including any information that may be


slide-1
SLIDE 1

Bryant Vinisky and Joshua Galbraith Santa Clara, California | April 23th – 25th, 2018

Containerizing Databases at New Relic (What We Learned)

slide-2
SLIDE 2

2

Safe Harbor

This presentation and the information herein (including any information that may be incorporated by reference) is provided for informational purposes only and should not be construed as an offer, commitment, promise or obligation on behalf of New Relic, Inc. (“New Relic”) to sell securities or deliver any product, material, code, functionality, or other feature. Any information provided hereby is proprietary to New Relic and may not be replicated or disclosed without New Relic’s express written permission. Such information may contain forward-looking statements within the meaning of federal securities laws. Any statement that is not a historical fact or refers to expectations, projections, future plans, objectives, estimates, goals, or other characterizations of future events is a forward-looking statement. These forward- looking statements can often be identified as such because the context of the statement will include words such as “believes,” “anticipates,” “expects” or words of similar import. Actual results may differ materially from those expressed in these forward-looking statements, which speak only as of the date hereof, and are subject to change at any time without notice. Existing and prospective investors, customers and other third parties transacting business with New Relic are cautioned not to place undue reliance on this forward-looking information. The achievement or success of the matters covered by such forward-looking statements are based

  • n New Relic’s current assumptions, expectations, and beliefs and are subject to substantial risks, uncertainties, assumptions, and changes in circumstances

that may cause the actual results, performance, or achievements to differ materially from those expressed or implied in any forward-looking statement. Further information on factors that could affect such forward-looking statements is included in the filings New Relic makes with the SEC from time to time. Copies of these documents may be obtained by visiting New Relic’s Investor Relations website at ir.newrelic.com or the SEC’s website at www.sec.gov. New Relic assumes no obligation and does not intend to update these forward-looking statements, except as required by law. New Relic makes no warranties, expressed or implied, in this presentation or otherwise, with respect to the information provided.

slide-3
SLIDE 3

Introductions

3

New Relic

  • A cloud platform to make every

aspect of modern software and infrastructure observable!

Bryant Vinisky

  • Senior Site Reliability Engineer
  • Database Engineering Team

Joshua Galbraith

  • Senior Software Engineer
  • Database Engineering Team
slide-4
SLIDE 4

Agenda

  • Where We Started
  • Research
  • Prerequisites
  • Megabase
  • Monitoring
  • Lessons Learned
  • Outcomes

4

slide-5
SLIDE 5

Where We Started

Databases at New Relic circa 2016

slide-6
SLIDE 6

Database Management Issues

6

📧 Using Puppet for configuration and deployment 🚛 Slow delivery time due to timing of hardware orders 💹 Inefficient hardware use

slide-7
SLIDE 7

Why Containers?

7

  • Why not use virtual

machines instead?

  • Why not use AWS?
  • Why not multi-tenant

logical databases?

slide-8
SLIDE 8

Preparing for the Future

8

🤕 New Container Fabric for deploying our stateless apps 🚣 Future regions will be entirely container-based! 🛥 Incremental delivery of containerized databases?

slide-9
SLIDE 9

Setting Goals

9

📧 Packaging and Deployment

  • consistent and repeatable

🚛 Database Delivery Time

  • from months to minutes

💹 Cost Efficiency

  • reduce wasted resources
slide-10
SLIDE 10

Compressed Timeline

10

😆 Managing hundreds of existing, busy databases Need to ship an MVP and gain traction quickly  Dev work is blocked on database delivery time

slide-11
SLIDE 11

Research

A Survey of Open-Source Orchestration Tools

slide-12
SLIDE 12

12

Open-Source Container Orchestration

slide-13
SLIDE 13

13

Stateful Containers: Mesos and Marathon

Key Concepts

  • Dynamic Provisioning
  • Reservation Labels
  • Local Persistent

Volumes

  • External Volumes
slide-14
SLIDE 14

14

Stateful Containers: Kubernetes

Key Concepts

  • Stateful Sets
  • Pods
  • Headless Service
  • Persistent Volumes
  • Operators
slide-15
SLIDE 15

15

Stateful Containers: Nomad

Key Concepts

  • Jobs
  • Task Groups
  • Allocations
  • Sticky Volumes
  • Volume Plugins
  • FS Drivers
slide-16
SLIDE 16

16

Joyent Blog: Autopilot for Databases

slide-17
SLIDE 17

17

Stateful Containers: Emergent Patterns

Abstract Patterns

  • Application-aware
  • rchestration
  • Autopilot pattern for

lifecycle management

  • Storage fabrics and

networked storage 😟

slide-18
SLIDE 18

18

Stateful Containers: Current Status

slide-19
SLIDE 19

Making Trade-Offs

19

  • Dynamic Scheduling vs.

Manual Placement

  • Custom Orchestration

Framework vs. Client-Server

  • Distributed Consensus?
  • Local Object Storage?
  • Lifecycle Management logic

inside or outside Container?

slide-20
SLIDE 20

Prerequisites

Dynamic Inventory and/or Service Discovery

slide-21
SLIDE 21

21

Problem: Database Inventory

What systems do we have? What logical databases? What team is the owner? Naming standards? API and CLI access?

slide-22
SLIDE 22

22

Megabase Prerequisite: Inventory System

Metadata on DB systems and services Percona containers on Megabase Golang service HTTP interface to DB Update on event and scheduled jobs Seed automation tasks

slide-23
SLIDE 23

23

Inventory System: Database

  • Query read only data from other

systems and authorities

  • Megabase container deployment

information

  • Provide basic service discovery
slide-24
SLIDE 24

24

Inventory System: HTTP Interface (JSON)

Query over HTTP Return JSON

slide-25
SLIDE 25

25

Inventory System: HTTP Interface (text)

slide-26
SLIDE 26

26

Inventory System: Dynamic Lookups

slide-27
SLIDE 27

Megabase

A platform for containerized databases at New Relic

slide-28
SLIDE 28

28

Megabase: What We Built

slide-29
SLIDE 29

Megabase: Ingredients

29

Bare Metal:

  • Kernel 4.4 and 4.14
  • CentOS 7 and CoreOS

Docker: 1.12 and 1.17 Image OS:

  • Alpine (Postgres and Redis)
  • Debian (Percona)

Data stores:

  • Percona-server 5.6
  • Percona-server 5.7
  • Postgresql-server 9.3, 9.5 and 9.6
  • Redis-server 3.2

Golang: 1.9

slide-30
SLIDE 30

Megabase: Docker Image Building

30

Dockerfile

  • Upstream base image
  • Custom labels
  • Package dependencies
  • Add binaries:
  • rclone
  • configuration sync
  • replication bootstrap
  • Entrypoint script
slide-31
SLIDE 31

Megabase: Docker Entrypoint

31

Tasks

  • Validate data mount
  • Sync base configs from S3
  • Apply dynamic configuration
  • No data => replication

bootstrap/initdb

  • Start server processor
slide-32
SLIDE 32

Megabase: Injecting Configuration

32

Strategies

  • Environment variables
  • via deploy config
  • Configuration files
  • via object storage
  • version controlled
  • Dynamic computation from

cgroup limits

  • Docker images
  • via image registry
slide-33
SLIDE 33

33

Megabase API Server

Pre/Post deployment dependencies Manage container runtime dependencies Interface to docker over https Support custom workflows for database services

megabase server

slide-34
SLIDE 34

34

Megabase API: TLS Authentication

Authenticate all requests with TLS client certificates

slide-35
SLIDE 35

35

Authentication Failure

slide-36
SLIDE 36

36

Authentication Success

slide-37
SLIDE 37

37

Megabase API: Endpoints

slide-38
SLIDE 38

38

Megabase API: Dependencies

Containers expect and validate bind mount under data directory Megabase deploy pre-step to docker run:

  • Carve off extents for logical volume and make filesystem
  • Create systemd unit file for persistence and mount volume
slide-39
SLIDE 39

39

Megabase API: Create Logical Volume

slide-40
SLIDE 40

40

Megabase API: Container Runtime

Reduce client config burden Offload predictable defaults:

  • Resource Limits
  • Bind mount volume to data directory
  • Host networking
  • Inject environment secrets

Example docker run on cli =>

slide-41
SLIDE 41

41

Megabase API: Failover Tasks

slide-42
SLIDE 42

42

Megabase API: Xtrabackup Stream to Peer

slide-43
SLIDE 43

43

Megabase… Client?

“So we’re just supposed to type out those nasty, long curl commands to operate everything? Really? And TLS client auth too? Are you trying to be annoying?”

  • anonymous New Relic db-team member

mb client

slide-44
SLIDE 44

44

Megabase Client: mb

slide-45
SLIDE 45

45

Megabase Client: mb

mb: Command Line Interface

  • wrap up server API functionality into command line interface
  • transparently handle TLS authentication
  • query state, health and configuration across servers and deployments
  • coordinate deployments across servers
  • leverage inventory to resolve host and container names
slide-46
SLIDE 46

46

Megabase Client: mb

slide-47
SLIDE 47

47

Megabase Client: Deployment Manifest

Minimum info required:

  • Container name
  • Image path and tag
  • CPU, memory and disk resources
  • VIP for load balancer and DNS
  • Owning team

Auto select values if absent Define one or more containers

slide-48
SLIDE 48

48

Megabase Client: mb Deployment

Process

  • Target environment and cluster
  • Specify a deployment manifest
  • Generate port and passwords
  • Send requests to API servers
  • new logical volume
  • run container
  • Check responses
  • Validate deployment
  • health
slide-49
SLIDE 49

Monitoring

Using New Relic to Monitor New Relic

slide-50
SLIDE 50

50

Monitoring and Alerting

NewRelic APM, Infrastructure and Insights

  • Hardware/System, Percona, Postgres, Redis, Megabase/Golang
  • Connection availability monitoring events to Insights
  • Insights dashboards created for deploys
slide-51
SLIDE 51

51

Monitoring: Insights Dashboards MySQL

slide-52
SLIDE 52

52

Monitoring: Insights Dashboards Postgres

slide-53
SLIDE 53

53

Monitoring: Insights Dashboards Redis

slide-54
SLIDE 54

54

Monitoring: Insights NRQL

slide-55
SLIDE 55

55

Monitoring: Service Availability

slide-56
SLIDE 56

56

Monitoring: APM Transactions

slide-57
SLIDE 57

57

Monitoring: Go Runtime

slide-58
SLIDE 58

Lessons Learned

Our First Year of Running Databases in Containers

slide-59
SLIDE 59

59

Concern: Host Failure Blast Radius

Density of database services per host Maintenance (still) happens Hardware (still) fails Time to recover after host failure Time spent on failovers

slide-60
SLIDE 60

60

…and then it happened.

Production host went down RAID controller failed Several active primary instances affected Time to recover was higher than we expected Started DRI sprint to address host failures

slide-61
SLIDE 61

61

Fix: Improve Tools for Mass Failover

Tooling updated to support failover:

  • All unhealthy database pools: ‘failover pools unhealthy’
  • By megabase hostname: ‘failover host demo-megabase-1c’
slide-62
SLIDE 62

62

Redis Tuning Memory Limits

Redis deploy for experiment with RDB bgsave enabled Keyspace sizing:

  • docker update used to

walk memory limits up to 2GB, 4GB and 16GB

  • redis maxmemory limit

increase to 1.5GB

slide-63
SLIDE 63

63

Redis PRIMARY Erratic Memory

slide-64
SLIDE 64

64

Kernel OOM on redis-server

slide-65
SLIDE 65

65

Replica Swap Extends RDB Save Time

slide-66
SLIDE 66

66

Fix: Adjust Alert Policies

New Relic NRQL Alerts

  • Adjusted swap alert
  • Added alert on kernel

version mismatch

slide-67
SLIDE 67

Outcomes

What We Delivered at New Relic

slide-68
SLIDE 68

68

Keys to Our Success

  • Internal team supporting internal customers
  • Scoped to meet our specific goals
  • Controlled slow roll out and adoption
  • Team autonomy and control of our hardware
  • Existing APIs and tools from other teams
  • Limiting technologies and versions involved
  • Balancing trade-offs
  • Minimal container resource limits
slide-69
SLIDE 69

Outcomes

69

📧 Packaging and Deployment

  • is consistent and repeatable

🚛 Database Delivery Time

  • takes minutes not months

💱 Cost Efficiency

  • fewer wasted resources

🌎 New Regions are Easy*

  • single command mass deploys
slide-70
SLIDE 70

70

Megabase Adoption

slide-71
SLIDE 71

71

References

  • https://mesosphere.github.io/marathon/docs/persistent-volumes.html
  • https://docs.mesosphere.com/1.11/tutorials/stateful-services/
  • https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/
  • https://youtu.be/J-Ke0TxGUSg (Kubernetes StatefulSet)
  • https://coreos.com/blog/introducing-operators.html
  • https://youtu.be/faUQcd5_MUc (Towards Running Stateful Applications on Nomad)
  • https://github.com/hashicorp/nomad/issues/150
  • https://twitter.com/kelseyhightower/status/963415653930553345
  • https://twitter.com/kelseyhightower/status/963418681148502016
  • https://www.joyent.com/blog/dbaas-simplicity-no-lock-in
  • https://www.joyent.com/blog/persistent-storage-patterns
  • https://thenewstack.io/methods-dealing-container-storage/
  • https://techcrunch.com/2015/11/21/i-want-to-run-stateful-containers-too/
slide-72
SLIDE 72

Thank You!

Your questions are now welcome.

slide-73
SLIDE 73

73

Rate My Session