Lessons From Building Automation For a Large Distributed Database - - PowerPoint PPT Presentation

lessons from building automation for a large distributed
SMART_READER_LITE
LIVE PREVIEW

Lessons From Building Automation For a Large Distributed Database - - PowerPoint PPT Presentation

Lessons From Building Automation For a Large Distributed Database Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL Presenters Leigh Johnson Ameet Kotian (she/her/hers) (he/him/his) Staff DRE, Slack Staff DRE, Slack Google


slide-1
SLIDE 1

Lessons From Building Automation For a Large Distributed Database

Leigh Johnson Ameet Kotian October 1, 2019 bit.ly/2ntsVSL

slide-2
SLIDE 2

Leigh Johnson

(she/her/hers) Staff DRE, Slack Google Developer Expert (Machine Learning) @grepLeigh

Ameet Kotian

(he/him/his) Staff DRE, Slack Past - SRE, Twitter @ameetkotian

Presenters

slide-3
SLIDE 3

Slack’s mission is to make people’s working lives simpler, more pleasant, and more productive.

slide-4
SLIDE 4

Agenda

  • 1. Evolution of Slack Automation
  • 2. Case Study: Self-Healing Databases
  • 3. Lessons
  • 4. Q & A
slide-5
SLIDE 5

Evolution of Database Automation

Monitoring Alerts Checklists Scripts Automated Workflows Self-Healing Stateful Systems in 4 Steps

slide-6
SLIDE 6

Monitoring Alerts

Metrics Events Logs Traces Dashboards PagerDuty

slide-7
SLIDE 7

Checklists

Just follow these 19 easy steps!

Runbooks Shared documents Lots of hand-over Limited by team capacity

slide-8
SLIDE 8

Convert Runbooks to Code

slide-9
SLIDE 9

Convert Runbooks to Code

slide-10
SLIDE 10

Manual Scripts

$ ./provision.sh $ ./backup_restore.sh $ ./validate_replication.sh $ ./service_discovery_stuff.sh $ ./deprovision_server.sh

Just follow these 5 easy steps!

Difficult to maintain Context switching API contracts Multi-step process

  • r...
slide-11
SLIDE 11

Write a Do-Everything Script

$ ./fix-it-now.sh

Just follow this 1 easy steps!

slide-12
SLIDE 12

Automated Workflows

Systems (not humans) Detects failure mode Executes appropriate response Fail Open

slide-13
SLIDE 13

You will still need firefighters after installing a fire suppression system. Automation is not a magic bullet.

slide-14
SLIDE 14

Automated Workflows are an Investment

Many teams stop automating at the scripts stage Instrument, monitor, support n + 1 systems

slide-15
SLIDE 15

If automation is an investment ... Quantify value proposition Measure return on investment

Do Due Diligence

slide-16
SLIDE 16

If automation is an investment... Quantify value proposition Measure return on investment Include qualitative data

Do Due Diligence

slide-17
SLIDE 17

Are you ready to kill your pet databases?

slide-18
SLIDE 18

How did we take Slack from...

slide-19
SLIDE 19

Leigh Johnson

(she/her/hers) Staff DRE, Slack Google Developer Expert (Machine Learning) @grepLeigh

Ameet Kotian

(he/him/his) Staff DRE, Slack Past - SRE, Twitter @ameetkotian

Presenters

slide-20
SLIDE 20
slide-21
SLIDE 21
slide-22
SLIDE 22

Case Study

Building automation for remediating database failures

slide-23
SLIDE 23

Goals

slide-24
SLIDE 24

Goals

Reliably detect MySQL host failures

slide-25
SLIDE 25

Goals

Reliably detect MySQL host failures Automatically remediate failed MySQL hosts

slide-26
SLIDE 26

Goals

Reliably detect MySQL host failures Automatically remediate failed MySQL hosts Scale (security fixes, kernel upgrades etc.)

slide-27
SLIDE 27

Slack’s database architecture

slide-28
SLIDE 28

Slack’s database architecture

Self-manage MySQL on AWS i3 instances Data is sharded across thousands

  • f hosts

Two main types of clusters - Legacy and Vitess

slide-29
SLIDE 29
  • 1. Legacy

shard

Application level team-sharded active primary-primary MySQL setup.

slide-30
SLIDE 30
  • 1. Legacy

shard

Application level team-sharded active primary-primary MySQL setup. Some shards have read replicas

slide-31
SLIDE 31

For more details... Strength in Numbers: Slack's Database Architecture.

2nd Oct 2019, 1:30 PM

Guido Iaquinti, Josh Varner

slide-32
SLIDE 32

Slack is moving to Vitess

What is Vitess?

Database solution for MySQL

Deploy, scale and manage large MySQL cluster

Built on top of MySQL replication and InnoDB

MySQL features + scalability of a NoSQL database

Open source project by YouTube (Google)

Started in 2010

Cloud Native Computing Foundation endorsed project

Ability to run each component in a container

slide-33
SLIDE 33
  • 2. Vitess

shard

primary-replica MySQL setup

~40% of Slack’s database queries served by Vitess

slide-34
SLIDE 34

For more details on Vitess... My First 90 Days with Vitess.

2nd Oct 2019, 11:00 AM

Morgan Tocker

https://vitess.io/ https://vitess.slack.com/

slide-35
SLIDE 35

What if there is a failure?

Legacy shard Vitess shard

slide-36
SLIDE 36

Salvage or replace?

slide-37
SLIDE 37

Always replace!

slide-38
SLIDE 38

Failure detection

slide-39
SLIDE 39

Auto remediation requires accurate failure signals

slide-40
SLIDE 40
slide-41
SLIDE 41

We use Orchestrator for automatic master failovers for Vitess cluster and it met all the requirements to detect host failures

https://github.com/github/orchestrator

slide-42
SLIDE 42

Primary failures

Legacy shard Vitess shard

slide-43
SLIDE 43

Use orchestrator hooks triggered on master failovers

slide-44
SLIDE 44

Replica failures

Legacy shard Vitess shard

slide-45
SLIDE 45

Use orchestrator problems api to detect replica failures

slide-46
SLIDE 46

Orchestrator can be used to generate accurate failure signals

  • Orchestrator is distributed

and uses multiple probes to detect failures.

  • It uses knowledge of mysql

state and replication to detect failures.

  • Rich set of APIs
  • Has inbuilt concept of a shard
  • Downtime/Maintenance

mode

slide-47
SLIDE 47

Automated Remediation at Scale

slide-48
SLIDE 48

Failure Event Event Handler Provision Workflow ???

slide-49
SLIDE 49

APIs

Automation Components

Scheduler Task Queue

slide-50
SLIDE 50

Web UI Audit Logs Workflows

Automation Components

slide-51
SLIDE 51

Failure Event Event Handler Provision Workflow Celery

slide-52
SLIDE 52

Celery

Distributed Task Queue Framework

Manual Script

slide-53
SLIDE 53

Celery

Task API

Distributed Task Queue Framework

slide-54
SLIDE 54

Celery

Task API Workflows

Distributed Task Queue Framework

slide-55
SLIDE 55

Celery

Task API Workflows

Distributed Task Queue Framework

slide-56
SLIDE 56

Celery

Task API Workflows Queue Isolation Rate Limits Retry Behavior

Distributed Task Queue Framework

slide-57
SLIDE 57

Celery

Task API Workflows Queue Isolation Rate Limits Retry Behavior Scheduler

Distributed Task Queue Framework

slide-58
SLIDE 58

Celery

Task API Workflows Queue Isolation Rate Limits Retry Behavior Scheduler Web UI

Distributed Task Queue Framework

pip install celery-flower

slide-59
SLIDE 59

Celery

Task API Workflows Queue Isolation Rate Limits Retry Behavior Scheduler Web UI Slack Notifications

Distributed Task Queue Framework

pip install celery-slack-webhooks github.com/leigh-johnson/celery-slack-webhooks

slide-60
SLIDE 60

Distributed Lock

Any Strongly Consistent DB

slide-61
SLIDE 61
slide-62
SLIDE 62

Lessons learned

slide-63
SLIDE 63

Three important things - Safety, safety, safety...

slide-64
SLIDE 64

Automation software is just like regular software...use the same scalability/reliability principles

slide-65
SLIDE 65

Automation software is just regular software...release early and often

slide-66
SLIDE 66

You need a rollout strategy… And a rollback strategy

slide-67
SLIDE 67

Show Value Proposition

slide-68
SLIDE 68

Commitment to automation

slide-69
SLIDE 69

Questions?

We are hiring in DREs, SREs in San Francisco and Dublin locations