Operational Efficiency Hacks John Allspaw Operations Engineering, - - PowerPoint PPT Presentation

operational efficiency hacks
SMART_READER_LITE
LIVE PREVIEW

Operational Efficiency Hacks John Allspaw Operations Engineering, - - PowerPoint PPT Presentation

Operational Efficiency Hacks John Allspaw Operations Engineering, Flickr Wednesday, April 8, 2009 who am I? Manage the Flickr Operations group Wrote a geeky book: Wednesday, April 8, 2009 Efficiencies Wednesday, April 8, 2009


slide-1
SLIDE 1

Operational Efficiency Hacks

John Allspaw Operations Engineering, Flickr

Wednesday, April 8, 2009
slide-2
SLIDE 2

who am I?

Manage the Flickr Operations group Wrote a geeky book:

Wednesday, April 8, 2009
slide-3
SLIDE 3

“Efficiencies”

Wednesday, April 8, 2009
slide-4
SLIDE 4

“Efficiencies”

Doing more with the robots you’ve got

Wednesday, April 8, 2009
slide-5
SLIDE 5

“Efficiencies”

Doing more with the robots you’ve got Doing more with the humans you’ve got

Wednesday, April 8, 2009
slide-6
SLIDE 6

Some optimization “rules”

Wednesday, April 8, 2009
slide-7
SLIDE 7

Some optimization “rules”

  • Don’t rely on being able to tweak anything.
Wednesday, April 8, 2009
slide-8
SLIDE 8

Some optimization “rules”

  • Don’t rely on being able to tweak anything.
  • Don’t waste too much time tuning when

you have no evidence it’ll matter.

Wednesday, April 8, 2009
slide-9
SLIDE 9

Optimization “rules”

time spent tuning performance tuning gains

Wednesday, April 8, 2009
slide-10
SLIDE 10

Optimization “rules”

“We should forget about small efficiencies, say about 97% of the time: premature

  • ptimization is the root of all evil.”

Knuth, (or Hoare)

Wednesday, April 8, 2009
slide-11
SLIDE 11

however...

Wednesday, April 8, 2009
slide-12
SLIDE 12

Optimization “rules”

Wednesday, April 8, 2009
slide-13
SLIDE 13

Optimization “rules”

That doesn’t give us an excuse to be lazy and inefficient.

Wednesday, April 8, 2009
slide-14
SLIDE 14

Optimization “rules”

That doesn’t give us an excuse to be lazy and inefficient.

Wednesday, April 8, 2009
slide-15
SLIDE 15

Optimization “rules”

That doesn’t give us an excuse to be lazy and inefficient. We lean on the experience of people in the community for evidence that tuning(s) might be a worthwhile thing to do.

Wednesday, April 8, 2009
slide-16
SLIDE 16

Optimization “rules”

“Yet we should not pass up our

  • pportunities in that critical 3 percent.”

Knuth, (or Hoare)

Wednesday, April 8, 2009
slide-17
SLIDE 17
  • bvious

tuning wins OMG I'm wasting !@#$ time for no reason stop somewhere in here

So...

Wednesday, April 8, 2009
slide-18
SLIDE 18

Our Context

Wednesday, April 8, 2009
slide-19
SLIDE 19
  • 24 TB of MySQL data

Our Context

Wednesday, April 8, 2009
slide-20
SLIDE 20
  • 24 TB of MySQL data
  • 32k/sec of MySQL writes

Our Context

Wednesday, April 8, 2009
slide-21
SLIDE 21
  • 24 TB of MySQL data
  • 32k/sec of MySQL writes
  • 120k/sec of MySQL reads

Our Context

Wednesday, April 8, 2009
slide-22
SLIDE 22
  • 24 TB of MySQL data
  • 32k/sec of MySQL writes
  • 120k/sec of MySQL reads
  • 6 PB of photos

Our Context

Wednesday, April 8, 2009
slide-23
SLIDE 23
  • 24 TB of MySQL data
  • 32k/sec of MySQL writes
  • 120k/sec of MySQL reads
  • 6 PB of photos
  • 10TB storage eaten per day

Our Context

Wednesday, April 8, 2009
slide-24
SLIDE 24
  • 24 TB of MySQL data
  • 32k/sec of MySQL writes
  • 120k/sec of MySQL reads
  • 6 PB of photos
  • 10TB storage eaten per day
  • 15,362 service monitors (alerts)

Our Context

Wednesday, April 8, 2009
slide-25
SLIDE 25

Infrastructure Hacks

Wednesday, April 8, 2009
slide-26
SLIDE 26

Infrastructure Hacks

  • Examples of what changing software can do

(plain old-fashioned performance tuning)

Wednesday, April 8, 2009
slide-27
SLIDE 27

Infrastructure Hacks

  • Examples of what changing software can do

(plain old-fashioned performance tuning)

  • Examples of what changing hardware can do

(yay for Mr. Moore!)

Wednesday, April 8, 2009
slide-28
SLIDE 28

Leaning on compilers

(synthetic PHP benchmarks, not real-world)

(http://sebastian-bergmann.de/archives/634-PHP-GCC- ICC-Benchmark.html)

Wednesday, April 8, 2009
slide-29
SLIDE 29

PHP (real-world)

php 4.4.8 to php 5.2.8 migration

Wednesday, April 8, 2009
slide-30
SLIDE 30

same taste, less filling

Can now handle more with less

Wednesday, April 8, 2009
slide-31
SLIDE 31

Image Processing

Wednesday, April 8, 2009
slide-32
SLIDE 32

Image Processing

  • 2004, Flickr was using ImageMagick for

image processing (version 6.1.9)

Wednesday, April 8, 2009
slide-33
SLIDE 33

Image Processing

  • 2004, Flickr was using ImageMagick for

image processing (version 6.1.9)

  • Changed to GraphicsMagick, about 15%

faster at the time (version 1.1.5)

Wednesday, April 8, 2009
slide-34
SLIDE 34

Image Processing

  • 2004, Flickr was using ImageMagick for

image processing (version 6.1.9)

  • Changed to GraphicsMagick, about 15%

faster at the time (version 1.1.5)

  • Only need a subset of ImageMagick features

anyway for our purposes

Wednesday, April 8, 2009
slide-35
SLIDE 35

Image Processing

  • OpenMP support

(http://en.wikipedia.org/wiki/Openmp)

  • Allows parallelization of processing jobs,

using multiple cores working on the same image

  • Some algorithms have more parallelization

than others

Wednesday, April 8, 2009
slide-36
SLIDE 36

Image Processing

  • Test script
  • 7 large-ish DSLR photos
  • Cascade resizing each to 6 smaller sizes,

semi-typical for Flickr’s workload

  • Each resize processed serially
Wednesday, April 8, 2009
slide-37
SLIDE 37

Image Processing

compiler differences

(GM version 1.1.14, non-OpenMP)

Wednesday, April 8, 2009
slide-38
SLIDE 38

Image Processing

OpenMP differences

(gcc 4.1.2, on quad core Xeon L5335 @ 2.00GHz) OpenMP advantage

Wednesday, April 8, 2009
slide-39
SLIDE 39

Image Processing

CPU differences

Wednesday, April 8, 2009
slide-40
SLIDE 40

Diagonal Scaling

Wednesday, April 8, 2009
slide-41
SLIDE 41

Diagonal Scaling

  • Vertically scaling your already horizontally-

scaled nodes

Wednesday, April 8, 2009
slide-42
SLIDE 42

Diagonal Scaling

  • Vertically scaling your already horizontally-

scaled nodes

  • a.k.a. “tech refresh”
Wednesday, April 8, 2009
slide-43
SLIDE 43

Diagonal Scaling

  • Vertically scaling your already horizontally-

scaled nodes

  • a.k.a. “tech refresh”
  • a.k.a. “Moore’s Law Surfing”
Wednesday, April 8, 2009
slide-44
SLIDE 44

Diagonal Scaling

Wednesday, April 8, 2009
slide-45
SLIDE 45

Diagonal Scaling

We replaced 67 “old” webservers with 18 “new” :

Wednesday, April 8, 2009
slide-46
SLIDE 46

Diagonal Scaling

We replaced 67 “old” webservers with 18 “new” :

servers CPUs

per server

RAM

per server

drives

per server

total power (W)

@60% peak

67 2

4GB 1x80GB

8763.6 18 8

4GB 1x146GB

2332.8

Wednesday, April 8, 2009
slide-47
SLIDE 47

Diagonal Scaling

We replaced 67 “old” webservers with 18 “new” :

servers CPUs

per server

RAM

per server

drives

per server

total power (W)

@60% peak

67 2

4GB 1x80GB

8763.6 18 8

4GB 1x146GB

2332.8

~70% LESS power 49U LESS rack space

Wednesday, April 8, 2009
slide-48
SLIDE 48

Diagonal Scaling

Wednesday, April 8, 2009
slide-49
SLIDE 49

Diagonal Scaling

We replaced 23 “old” image processing boxes with 8 “new”

Wednesday, April 8, 2009
slide-50
SLIDE 50

Diagonal Scaling

We replaced 23 “old” image processing boxes with 8 “new”

server photos/min rack

total power (W)

@60% peak

23 1035 23 3008.4 8 1120 8 1036.8

Wednesday, April 8, 2009
slide-51
SLIDE 51

Diagonal Scaling

We replaced 23 “old” image processing boxes with 8 “new”

server photos/min rack

total power (W)

@60% peak

23 1035 23 3008.4 8 1120 8 1036.8

~75% FASTER 15U LESS rack space 65% LESS power

Wednesday, April 8, 2009
slide-52
SLIDE 52

Diagonal Scaling

We replaced 23 “old” image processing boxes with 8 “new”

server photos/min rack

total power (W)

@60% peak

23 1035 23 3008.4 8 1120 8 1036.8

~75% FASTER 15U LESS rack space 65% LESS power

Wednesday, April 8, 2009
slide-53
SLIDE 53

Diagonal Scaling

We replaced 23 “old” image processing boxes with 8 “new”

server photos/min rack

total power (W)

@60% peak

23 1035 23 3008.4 8 1120 8 1036.8

~75% FASTER 15U LESS rack space 65% LESS power

Wednesday, April 8, 2009
slide-54
SLIDE 54

Diagonal Scaling

We replaced 23 “old” image processing boxes with 8 “new”

server photos/min rack

total power (W)

@60% peak

23 1035 23 3008.4 8 1120 8 1036.8

~75% FASTER 15U LESS rack space 65% LESS power

from this

to this

Wednesday, April 8, 2009
slide-55
SLIDE 55

What do you do with

  • ld/slow machines?
Wednesday, April 8, 2009
slide-56
SLIDE 56

What do you do with

  • ld/slow machines?
  • Liquidate
Wednesday, April 8, 2009
slide-57
SLIDE 57

What do you do with

  • ld/slow machines?
  • Liquidate
  • Re-purpose as dev/staging/etc
Wednesday, April 8, 2009
slide-58
SLIDE 58

What do you do with

  • ld/slow machines?
  • Liquidate
  • Re-purpose as dev/staging/etc
  • “offline” tasks
Wednesday, April 8, 2009
slide-59
SLIDE 59

Offline Tasks

Wednesday, April 8, 2009
slide-60
SLIDE 60

Offline Tasks

  • Out-of-band/asynchronous queuing and execution

system, for non-realtime tasks

Wednesday, April 8, 2009
slide-61
SLIDE 61

Offline Tasks

  • Out-of-band/asynchronous queuing and execution

system, for non-realtime tasks

  • See here:
Wednesday, April 8, 2009
slide-62
SLIDE 62

Offline Tasks

  • Out-of-band/asynchronous queuing and execution

system, for non-realtime tasks

  • See here:

http://code.flickr.com/blog/2008/09/26/flickr-engineers-do-it-offline/

Wednesday, April 8, 2009
slide-63
SLIDE 63

Offline Tasks

  • Out-of-band/asynchronous queuing and execution

system, for non-realtime tasks

  • See here:

http://code.flickr.com/blog/2008/09/26/flickr-engineers-do-it-offline/

  • See Myles Grant talk about it more here:
Wednesday, April 8, 2009
slide-64
SLIDE 64

Offline Tasks

  • Out-of-band/asynchronous queuing and execution

system, for non-realtime tasks

  • See here:

http://code.flickr.com/blog/2008/09/26/flickr-engineers-do-it-offline/

  • See Myles Grant talk about it more here:

http://en.oreilly.com/velocity2009/public/schedule/detail/7552

Wednesday, April 8, 2009
slide-65
SLIDE 65

Runbook Hacks

“WTF HAPPENED LAST NIGHT?!”

Wednesday, April 8, 2009
slide-66
SLIDE 66

Why?

Wednesday, April 8, 2009
slide-67
SLIDE 67

Why?

As infrastructure grows, try to keep the Humans:Machines ratio from getting out of hand

Wednesday, April 8, 2009
slide-68
SLIDE 68

Why?

As infrastructure grows, try to keep the Humans:Machines ratio from getting out of hand Some of the How:

Wednesday, April 8, 2009
slide-69
SLIDE 69

Why?

As infrastructure grows, try to keep the Humans:Machines ratio from getting out of hand Some of the How:

  • teach machines to build themselves
Wednesday, April 8, 2009
slide-70
SLIDE 70

Why?

As infrastructure grows, try to keep the Humans:Machines ratio from getting out of hand Some of the How:

  • teach machines to build themselves
  • teach machines to watch themselves
Wednesday, April 8, 2009
slide-71
SLIDE 71

Why?

As infrastructure grows, try to keep the Humans:Machines ratio from getting out of hand Some of the How:

  • teach machines to build themselves
  • teach machines to watch themselves
  • teach machines to fix themselves
Wednesday, April 8, 2009
slide-72
SLIDE 72

Why?

As infrastructure grows, try to keep the Humans:Machines ratio from getting out of hand Some of the How:

  • teach machines to build themselves
  • teach machines to watch themselves
  • teach machines to fix themselves
  • reduce MTTR by streamlining
Wednesday, April 8, 2009
slide-73
SLIDE 73

Automated Infrastructure

Wednesday, April 8, 2009
slide-74
SLIDE 74

Automated Infrastructure

  • If there is only one thing you do, automatic

configuration and deployment management should be it.

Wednesday, April 8, 2009
slide-75
SLIDE 75

Automated Infrastructure

  • If there is only one thing you do, automatic

configuration and deployment management should be it.

  • See:
  • Opscode/Chef (http://opscode.com/)
  • Puppet (http://reductivelabs.com/products/puppet/)
  • System Imager/Configurator

(http://wiki.systemimager.org)

Wednesday, April 8, 2009
slide-76
SLIDE 76

Conguration Management Codeswarm

Wednesday, April 8, 2009
slide-77
SLIDE 77

Time

Machine time is cheaper than human time. If a failure results in some commands being run to ‘fix’ it, make the machines do it. (i.e., don’t wake people up for stupid things!)

Wednesday, April 8, 2009
slide-78
SLIDE 78

Aggregate Monitoring

Wednesday, April 8, 2009
slide-79
SLIDE 79

Aggregate Monitoring

Don’t care about single nodes, only care about delta change of metrics/faults

  • Warn (email) on X % change
  • Page (wake up) on Y % change
Wednesday, April 8, 2009
slide-80
SLIDE 80

Aggregate Monitoring

Don’t care about single nodes, only care about delta change of metrics/faults

  • Warn (email) on X % change
  • Page (wake up) on Y % change

High and low water marks for some metrics

Wednesday, April 8, 2009
slide-81
SLIDE 81

Self-Healing

Wednesday, April 8, 2009
slide-82
SLIDE 82

Self-Healing

Make service monitoring fix common failure scenarios, notify us later about it.

Wednesday, April 8, 2009
slide-83
SLIDE 83

Self-Healing

Make service monitoring fix common failure scenarios, notify us later about it.

Wednesday, April 8, 2009
slide-84
SLIDE 84

Self-Healing

Make service monitoring fix common failure scenarios, notify us later about it. Daemons/processes run on machines, will take corrective action under certain conditions, and report back with what they did.

Wednesday, April 8, 2009
slide-85
SLIDE 85

Self-Healing

Make service monitoring fix common failure scenarios, notify us later about it. Daemons/processes run on machines, will take corrective action under certain conditions, and report back with what they did.

Wednesday, April 8, 2009
slide-86
SLIDE 86

Self-Healing

Make service monitoring fix common failure scenarios, notify us later about it. Daemons/processes run on machines, will take corrective action under certain conditions, and report back with what they did. Can greatly reduce your mean time to recovery (MTTR)

Wednesday, April 8, 2009
slide-87
SLIDE 87

Self-Healing

Make service monitoring fix common failure scenarios, notify us later about it. Daemons/processes run on machines, will take corrective action under certain conditions, and report back with what they did. Can greatly reduce your mean time to recovery (MTTR)

Wednesday, April 8, 2009
slide-88
SLIDE 88

Basic Apache Example

Wednesday, April 8, 2009
slide-89
SLIDE 89

Basic Apache Example

  • 1. Webserver not running?
Wednesday, April 8, 2009
slide-90
SLIDE 90

Basic Apache Example

  • 1. Webserver not running?
  • 2. Under certain conditions, try to start it, and

email that this happened. (I’ll read it tomorrow)

Wednesday, April 8, 2009
slide-91
SLIDE 91

Basic Apache Example

  • 1. Webserver not running?
  • 2. Under certain conditions, try to start it, and

email that this happened. (I’ll read it tomorrow)

  • 3. Won’t start? Assume something’s really

wrong, so don’t keep trying (email that, too)

Wednesday, April 8, 2009
slide-92
SLIDE 92

MySQL Self-Healing

Wednesday, April 8, 2009
slide-93
SLIDE 93

MySQL Self-Healing

Some MySQL Issues “fixed” by the machines

Wednesday, April 8, 2009
slide-94
SLIDE 94

MySQL Self-Healing

Some MySQL Issues “fixed” by the machines

Wednesday, April 8, 2009
slide-95
SLIDE 95

MySQL Self-Healing

Some MySQL Issues “fixed” by the machines

  • Kill long-running SELECT queries (marked safe to kill)
Wednesday, April 8, 2009
slide-96
SLIDE 96

MySQL Self-Healing

Some MySQL Issues “fixed” by the machines

  • Kill long-running SELECT queries (marked safe to kill)
  • Queries not safe to kill are marked by the application

as “NO KILL” in comments

Wednesday, April 8, 2009
slide-97
SLIDE 97

MySQL Self-Healing

Some MySQL Issues “fixed” by the machines

  • Kill long-running SELECT queries (marked safe to kill)
  • Queries not safe to kill are marked by the application

as “NO KILL” in comments

  • Run EXPLAIN on killed queries, and report the results
Wednesday, April 8, 2009
slide-98
SLIDE 98

MySQL Self-Healing

Some MySQL Issues “fixed” by the machines

  • Kill long-running SELECT queries (marked safe to kill)
  • Queries not safe to kill are marked by the application

as “NO KILL” in comments

  • Run EXPLAIN on killed queries, and report the results
  • Keep track of the query types and databases that need

the most killing, produce a “DBs that Suck” report

Wednesday, April 8, 2009
slide-99
SLIDE 99

MySQL Self-Healing

Wednesday, April 8, 2009
slide-100
SLIDE 100

MySQL Self-Healing

Some MySQL Replication issues “fixed” by the machines, by error

Wednesday, April 8, 2009
slide-101
SLIDE 101

MySQL Self-Healing

Some MySQL Replication issues “fixed” by the machines, by error

  • Skip errors that can safely be skipped and

restart slave threads

Wednesday, April 8, 2009
slide-102
SLIDE 102

MySQL Self-Healing

Some MySQL Replication issues “fixed” by the machines, by error

  • Skip errors that can safely be skipped and

restart slave threads

  • Force refetch of replication binlogs on:
  • 1064 (ER_PARSE_ERROR)
Wednesday, April 8, 2009
slide-103
SLIDE 103

MySQL Self-Healing

Some MySQL Replication issues “fixed” by the machines, by error

  • Skip errors that can safely be skipped and

restart slave threads

  • Force refetch of replication binlogs on:
  • 1064 (ER_PARSE_ERROR)
  • Re-run queries on:
  • 1205 (ER_LOCK_WAIT_TIMEOUT)
  • 1213 (ER_LOCK_DEADLOCK)
Wednesday, April 8, 2009
slide-104
SLIDE 104

Troubleshooting

Wednesday, April 8, 2009
slide-105
SLIDE 105

Code and Config Deploy Logs

Wednesday, April 8, 2009
slide-106
SLIDE 106

Code and Config Deploy Logs

  • 1. ESSENTIAL
Wednesday, April 8, 2009
slide-107
SLIDE 107

Code and Config Deploy Logs

  • 1. ESSENTIAL
  • 2. MANDATORY
Wednesday, April 8, 2009
slide-108
SLIDE 108

Communications

Wednesday, April 8, 2009
slide-109
SLIDE 109

Communications

  • Internal IRC
  • For ongoing discussions
  • Logged, so “infinite” scrollback
Wednesday, April 8, 2009
slide-110
SLIDE 110

Communications

  • Internal IRC
  • For ongoing discussions
  • Logged, so “infinite” scrollback
  • IM Bot (built on libyahoo2.sf.net)
  • For production changes
  • Broadcasts all to all contacts
  • Logged, and injected into IRC
  • IM Status = who is in primary/secondary on-call
Wednesday, April 8, 2009
slide-111
SLIDE 111

Communications

  • Internal IRC
  • For ongoing discussions
  • Logged, so “infinite” scrollback
  • IM Bot (built on libyahoo2.sf.net)
  • For production changes
  • Broadcasts all to all contacts
  • Logged, and injected into IRC
  • IM Status = who is in primary/secondary on-call
  • All of IRC and IM Bot slurped into a search index
Wednesday, April 8, 2009
slide-112
SLIDE 112 Wednesday, April 8, 2009
slide-113
SLIDE 113

when

Wednesday, April 8, 2009
slide-114
SLIDE 114

when what

Wednesday, April 8, 2009
slide-115
SLIDE 115

when what detailed what*

Wednesday, April 8, 2009
slide-116
SLIDE 116

when what detailed what*

*also points to what commands should be used to back out the changes

Wednesday, April 8, 2009
slide-117
SLIDE 117

when who what detailed what*

*also points to what commands should be used to back out the changes

Wednesday, April 8, 2009
slide-118
SLIDE 118

when who what detailed what*

*also points to what commands should be used to back out the changes

Wednesday, April 8, 2009
slide-119
SLIDE 119

time of last deploy at top of ganglia when who what detailed what*

*also points to what commands should be used to back out the changes

Wednesday, April 8, 2009
slide-120
SLIDE 120 Wednesday, April 8, 2009
slide-121
SLIDE 121 Wednesday, April 8, 2009
slide-122
SLIDE 122

IM Bot (timestamps help correlation)

Wednesday, April 8, 2009
slide-123
SLIDE 123

IM Bot (timestamps help correlation)

Wednesday, April 8, 2009
slide-124
SLIDE 124

IM Bot (timestamps help correlation) all IRC, IM bot into searchable history

Wednesday, April 8, 2009
slide-125
SLIDE 125

Morals of Our Stories

Wednesday, April 8, 2009
slide-126
SLIDE 126

Morals of Our Stories

  • Optimizations can be a Very Good Thing™
Wednesday, April 8, 2009
slide-127
SLIDE 127

Morals of Our Stories

  • Optimizations can be a Very Good Thing™
  • Weigh time spent optimizing against

expected gains

Wednesday, April 8, 2009
slide-128
SLIDE 128

Morals of Our Stories

  • Optimizations can be a Very Good Thing™
  • Weigh time spent optimizing against

expected gains

  • Lean on others for how much “expected

gains” mean for different scenarios

Wednesday, April 8, 2009
slide-129
SLIDE 129

Morals of Our Stories

  • Optimizations can be a Very Good Thing™
  • Weigh time spent optimizing against

expected gains

  • Lean on others for how much “expected

gains” mean for different scenarios

  • Plain old-fashioned intuition
Wednesday, April 8, 2009
slide-130
SLIDE 130

Some Wisdom Nuggets

Jon Prall’s 85 WebOps Rules: http://jprall.vox.com/library/post/85-

  • perations-rules-to-live-by.html
Wednesday, April 8, 2009
slide-131
SLIDE 131

Questions?

http://www.flickr.com/photos/ebarney/3348965637/ http://www.flickr.com/photos/dgmiller/1606071911/ http://www.flickr.com/photos/dannyboyster/60371673/ http://www.flickr.com/photos/bright/189338394/ http://www.flickr.com/photos/nickwheeleroz/2475011402/ http://www.flickr.com/photos/dramaqueennorma/191063346/ http://www.flickr.com/photos/telstar/2861103147/ http://www.flickr.com/photos/norby/2309046043/ http://www.flickr.com/photos/allysonk/201008992/

Wednesday, April 8, 2009