Arup Chakrabarti OPERATIONS ENGINEERING arup@pagerduty.com - - PowerPoint PPT Presentation

arup chakrabarti
SMART_READER_LITE
LIVE PREVIEW

Arup Chakrabarti OPERATIONS ENGINEERING arup@pagerduty.com - - PowerPoint PPT Presentation

Arup Chakrabarti OPERATIONS ENGINEERING arup@pagerduty.com @arupchak PagerDuty Common Ops Mistakes and How to Prevent Them PagerDuty Quick Bio Who the heck is this guy? Worked at Amazon as an Engineer/Manager Worked at Netflix as a


slide-1
SLIDE 1

PagerDuty

Arup Chakrabarti

OPERATIONS ENGINEERING

@arupchak arup@pagerduty.com

slide-2
SLIDE 2

PagerDuty

Common Ops Mistakes and How to Prevent Them

slide-3
SLIDE 3

PagerDuty

Who the heck is this guy?

Quick Bio

  • Worked at Amazon as an Engineer/Manager
  • Worked at Netflix as a Manager
  • Employee #20-something at PagerDuty
  • Infrastructure was a monolithic Rails app and a

single service

  • Still have the MonoRail, now with 10+ services
  • Over last year, ~20 servers -> ~200 servers
slide-4
SLIDE 4

PagerDuty

Quick Disclaimer

  • I did not come up with

everything

  • I am speaking for myself
slide-5
SLIDE 5

PagerDuty

The Magical Formula

What is Software Operations?

slide-6
SLIDE 6

PagerDuty

The Magical Formula

What is Software Operations?

  • Change ~ Downtime
  • More change => More Problems
slide-7
SLIDE 7

PagerDuty

Let’s Revise the Magical Formula

Why this is scary

slide-8
SLIDE 8

PagerDuty

Let’s Revise the Magical Formula

Why this is scary

  • Changes ~ Innovation ~ Downtime
  • Maintain stability by stopping innovation
  • Scrappy Startup vs. Big Company
  • Most Big Companies do not innovate

because they cannot risk the change

  • Does this mean all companies are

eventually doomed to not innovate?

slide-9
SLIDE 9

PagerDuty

The Magical Formula Revised Again

What is Software Operations?

  • Changes ~ (k) Innovation ~ (h) Downtime
  • There are two constants - k and h
  • k - Increase k to amplify innovation per change
  • Test environments, A/B testing, splitting up

codebases

  • h - Decrease h to improve stability per change
  • Fast deploys, better alerting, splitting up

codebases

slide-10
SLIDE 10

PagerDuty

Netflix Architecture Diagram

  • 0. Accept that no infrastructure is perfect
slide-11
SLIDE 11

PagerDuty

  • 0. Accept that no infrastructure is perfect
  • Make the best decisions at the time
  • Accept constraints
  • Learn more as our systems or

business evolve

  • This is ok

Really, it is ok

slide-12
SLIDE 12

PagerDuty

  • 1. Initial Setup of Infrastructure
  • Using personal email accounts
  • No, setup mailing lists, ideally have Google

Apps setup from the beginning

  • Pre-Optimizing for Scale
  • Use Heroku or other PaaS for as long as you can
  • Technology selection
  • Boring technologies to do cool things
  • Password storage
  • Not in the git repo, use ENV vars or your

configuration management tool

slide-13
SLIDE 13

PagerDuty

  • 2. Proper Test Environments
  • Separate hosting account for testing
  • Have separate provider accounts for test

(e.g. email providers)

  • For local development, use VMs
  • Do no run services on localhost
  • Use Vagrant for this
slide-14
SLIDE 14

PagerDuty

  • 3. Configuration Management
  • Early on, use Ansible or Salt
  • Light weight and easy to learn
  • Enforces treating ‘Infrastructure as Code’
  • Will scale just fine when you only have 4-5

server types

  • Avoid Bash Scripts
  • Beyond 5 server types, move to Chef,

Puppet, Asgard, or other heavier tools

  • Augment Cloud Formation or other PaaS

specific tools

slide-15
SLIDE 15

PagerDuty

  • 4. Deployments
  • Consistency
  • Every Engineer
  • Every Piece of Code
  • Use some orchestration tool with Git
  • Capistrano
  • Ansible
  • Salt
  • Celery
slide-16
SLIDE 16

PagerDuty

  • 5. Incident Management
  • Have a process in place and document

somewhere that is easily shared

  • Wiki
  • Dropbox document
  • Not in a random email
  • Make sure you review it monthly
slide-17
SLIDE 17

PagerDuty

Example Procedure

  • 5. Incident Management
  • 1. Everyone jumps onto chat client
  • 2. Everyone dials into group call
  • 3. Each member of the team gives a status update
  • 4. Single person acts as call leader (not a resolver)
  • 5. Call leader gives out orders
  • 6. Have a status update every 10 minutes

7 . Call leader maintains an outage log

  • 8. Conduct a post-mortem
slide-18
SLIDE 18

PagerDuty

  • 6. Monitoring and Alerting
  • Start with anything
  • StatsD with Graphite backend
  • CloudWatch
  • Sensu
  • Nagios
  • Use hosted solutions (as long as they

make fiscal sense)

  • New Relic or other APM’s
slide-19
SLIDE 19

PagerDuty

7 . Backups

  • Backup your data regularly to S3
  • Test your restores at least monthly
  • Practice restoring production data to test env
  • Make sure to scrub sensitive data
  • Measure time to recovery
slide-20
SLIDE 20

PagerDuty

  • 8. High Availability 101
  • Multiple servers at every layer
  • Multiple Load Balancers in DNS
  • Multiple App Servers
  • App servers have to be stateless
  • Use Clustered Datastores
  • MySQL XtraDB Cluster
  • Cassandra
  • Avoid Master/Slave architectures
  • Worry about sharding later
  • You do not know what to shard on yet
slide-21
SLIDE 21

PagerDuty

  • 9. Security 101
  • Use Gateway Hosts for SSH
  • These hosts are whitelisted for SSH, everything

else should have global SSH turned off

  • Unique user accounts for everything
  • Easy to revoke access when something happens
  • Use PaaS security features
  • Security Groups, VPC, etc
  • SSL encryption on everything
slide-22
SLIDE 22

PagerDuty

  • 10. Internal IT needs
  • Have a central list of tools that every

department needs

  • Onboarding docs are a good place for this
  • Consolidate machine types
  • Do not let everyone have every machine that

they want

  • Easier to support and swap out machines
  • Use images for machines
  • Easy to take a USB stick and make a general

image

slide-23
SLIDE 23

PagerDuty

for managing change

Exploiting your business patterns

  • Look for seasonality in traffic patterns
  • You can make changes when traffic is at the trough
  • Look for where you can be latency tolerant
  • Can you tolerate an extra 100-200ms of latency?
  • What gets impacted when you go down?
  • Actual revenue
  • Customer trust
slide-24
SLIDE 24

PagerDuty arup@pagerduty.com

Thank you.

Slides will be available at https://speakerdeck.com/arupchak

Arup Chakrabarti

OPERATIONS ENGINEERING

@arupchak