Metrics-Driven Engineering Mike Brittain @ mikebrittain Director of - - PowerPoint PPT Presentation

metrics driven engineering
SMART_READER_LITE
LIVE PREVIEW

Metrics-Driven Engineering Mike Brittain @ mikebrittain Director of - - PowerPoint PPT Presentation

Metrics-Driven Engineering Mike Brittain @ mikebrittain Director of engineering, Infrastructure October 13, 2011 Tools and Process at Etsy How many new visits? How many listings created? How many registrations? How do people use Etsy? How


slide-1
SLIDE 1

Mike Brittain @ mikebrittain

Director of engineering, Infrastructure

Metrics-Driven Engineering

October 13, 2011

slide-2
SLIDE 2

Tools and Process at Etsy

slide-3
SLIDE 3

How many new visits?

How many listings created?

How many registrations?

How do people use Etsy?

How many convos sent?

How many purchases?

How many new shops?

slide-4
SLIDE 4

Search indexing?

How fast are pages generating?

Async tasks currently in queue?

What is the application doing?

Developer API auth and rate limiting?

Images resized and stored?

Error and warning rates?

slide-5
SLIDE 5

Replication slave lag?

Memcache hits/misses?

Available connections?

Are the servers in good shape ?

Database queries per second?

Total outgoing bandwidth?

CPU, Memory, I/O?

slide-6
SLIDE 6

Business Metrics

slide-7
SLIDE 7

Application Metrics

slide-8
SLIDE 8

System Metrics

slide-9
SLIDE 9

Visibility EVERYWHERE

slide-10
SLIDE 10

Constant Change

slide-11
SLIDE 11
slide-12
SLIDE 12

$314 Million GMS 2010

$180 Million GMS 2009

$87 Million GMS 2008

$26 Million GMS 2007

credit: pentarux (flickr)

slide-13
SLIDE 13

25 Million Unique Visitors 1 Billion page views per month

credit: pentarux (flickr)

slide-14
SLIDE 14

Engineering team grew 500%

  • ver 18 months

credit: martin_heigan (flickr)

slide-15
SLIDE 15

Less talk, more do.

slide-16
SLIDE 16

Always Be Shipping

credit: ibailemon (flickr)

slide-17
SLIDE 17

Always Be Shipping

(even if it’s your first day)

credit: ibailemon (flickr)

slide-18
SLIDE 18
slide-19
SLIDE 19

90+ Engineers

40+ Deploys / day

credit: misswired (flickr)

slide-20
SLIDE 20

credit: digidave (flickr)

slide-21
SLIDE 21

Code Reviews

slide-22
SLIDE 22

Automated Tests

slide-23
SLIDE 23

$cfg = array( 'checkout' => array('enabled' => 'on'), 'homepage' => array('enabled' => 'on'), 'profiles' => array('enabled' => 'on'), 'new_search' => array('enabled' => 'off'), );

Config Flags

Enable and disable features quickly

slide-24
SLIDE 24

$cfg = array( 'checkout' => array('enabled' => 'on'), 'homepage' => array('enabled' => 'on'), 'profiles' => array('enabled' => 'on'), 'new_search' => array('enabled' => 'off'), );

Config Flags

Enable and disable features quickly Plus “admin-only,” percentage ramp-up, A/B testing, whitelists, blacklists, etc...

slide-25
SLIDE 25

Failure is not an option

slide-26
SLIDE 26

Failure is not an option inevitable!

slide-27
SLIDE 27

Failure is not an option inevitable! a learning opportunity!

slide-28
SLIDE 28

Failure is not an option inevitable! a learning opportunity!

DETECTABLE!

slide-29
SLIDE 29

Access

slide-30
SLIDE 30
slide-31
SLIDE 31
slide-32
SLIDE 32
slide-33
SLIDE 33

Detect problems quickly

slide-34
SLIDE 34

CONFIDENCE

slide-35
SLIDE 35
slide-36
SLIDE 36

Well, the Ops team manages the network, racks the servers, installed the monitoring tools, wears the pagers, blah, blah, blah...

A:

slide-37
SLIDE 37

Engineers build the application

slide-38
SLIDE 38

OPS

Logging Graphing Trending Alerting

ENG

slide-39
SLIDE 39

“Engineers are too busy writing features to build metrics.”

slide-40
SLIDE 40

Metrics are part of every feature

...and so are config flags

slide-41
SLIDE 41

Dead Simple

slide-42
SLIDE 42

Simple, open source tools

slide-43
SLIDE 43

Cacti (network, SNMP) Ganglia (machines) Graphite (application) Splunk (log analysis, nightly reports) Nagios (alerting) Logging Logster StatsD

slide-44
SLIDE 44

Ganglia

slide-45
SLIDE 45

Cluster-oriented Huge community contributed recipes Custom metrics (gmetad)

Ganglia

slide-46
SLIDE 46

Graphite

slide-47
SLIDE 47

Single-instance Create new metrics on-the-fly Customize via URLs and display functions

Graphite

slide-48
SLIDE 48

Logging

slide-49
SLIDE 49

It’s 2:48 PM.

Do you know where your logs are?

slide-50
SLIDE 50

Logger::log_error("User login failed. Reason: $msg for $username", “login”);

slide-51
SLIDE 51

Logger::log_error("User login failed. Reason: $msg for $username", “login”);

slide-52
SLIDE 52

web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login

  • failed. Reason: wrong password for ...
slide-53
SLIDE 53

web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login

  • failed. Reason: wrong password for ...
slide-54
SLIDE 54

web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login

  • failed. Reason: wrong password for ...
slide-55
SLIDE 55

web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login

  • failed. Reason: wrong password for ...
slide-56
SLIDE 56

web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login

  • failed. Reason: wrong password for ...
slide-57
SLIDE 57

web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login

  • failed. Reason: wrong password for ...
slide-58
SLIDE 58

LogFormat "%h %l %u %t \"%r\" %>s %b" common

slide-59
SLIDE 59

LogFormat %{True-Client-IP}i %l %t \"%r \" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %{etsy_shop_id}n %{etsy_uaid}n %V %{etsy_ab_selections}n %{etsy_request_uuid}n %{etsy_api_consumer_key}n %{etsy_api_method_name}n %{php_memory_usage_bytes}n %{php_time_microsec}n %D" combined

slide-60
SLIDE 60

apache_note()

slide-61
SLIDE 61

LogFormat %{True-Client-IP}i %l %t \"%r \" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %{etsy_shop_id}n %{etsy_uaid}n %V %{etsy_ab_selections}n %{etsy_request_uuid}n %{etsy_api_consumer_key}n %{etsy_api_method_name}n %{php_memory_usage_bytes}n %{php_time_microsec}n %D" combined

slide-62
SLIDE 62

LogFormat %{True-Client-IP}i %l %t \"%r \" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %{etsy_shop_id}n %{etsy_uaid}n %V %{etsy_ab_selections}n %{etsy_request_uuid}n %{etsy_api_consumer_key}n %{etsy_api_method_name}n %{php_memory_usage_bytes}n %{php_time_microsec}n %D" combined

slide-63
SLIDE 63

LogFormat %{True-Client-IP}i %l %t \"%r \" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %{etsy_shop_id}n %{etsy_uaid}n %V %{etsy_ab_selections}n %{etsy_request_uuid}n %{etsy_api_consumer_key}n %{etsy_api_method_name}n %{php_memory_usage_bytes}n %{php_time_microsec}n %D" combined

slide-64
SLIDE 64

grep "/listing/" access.log | \ awk '{sum=sum+$(NF-2)} END {print sum/NR}'

slide-65
SLIDE 65

web0001 [04:28:54 2011] [error] [client 10.101.x.x] Help me, Rhonda. web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo! web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh! web0001 [04:28:54 2011] [error] [client 10.101.x.x] Heeeeeeellllllllllllllppppp! web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo! web0001 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh! web0201 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh! web0034 [04:28:54 2011] [warning] [client 10.101.x.x] Oh nooooooooooo web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web1101 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web0201 [04:28:54 2011] [error] [client 10.101.x.x] You've been eaten by a grue. web0055 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!!! web0002 [04:28:54 2011] [warning] [client 10.101.x.x] Sky is falling. web0089 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web0020 [04:28:54 2011] [error] [client 10.101.x.x] Sky is falling. web1101 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh! web0055 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh! web0001 [04:28:54 2011] [warning] [client 10.101.x.x] Oh nooooooooooo web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web0034 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web0087 [04:28:54 2011] [fatal] [client 10.101.x.x] Sky is falling. web0002 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo! web0201 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh! web0077 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh! web0355 [04:28:54 2011] [warning] [client 10.101.x.x] Oh nooooooooooo web0052 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web0003 [04:28:54 2011] [error] [client 10.101.x.x] You've been eaten by a grue. web0066 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!!! web0001 [04:28:54 2011] [warning] [client 10.101.x.x] Sky is falling

slide-66
SLIDE 66

Fatals Errors Warnings

Logster

slide-67
SLIDE 67

github.com/etsy

Run by cron Keeps a cursor on your log file Aggregate lines anyway you want Output to Ganglia or Graphite Simple parsers

Logster

slide-68
SLIDE 68

web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login

  • failed. Reason: wrong password for ...
slide-69
SLIDE 69

^.+ \[.+\] \[(?P<log_level>.+)\]

slide-70
SLIDE 70

if (fields['log_level'] == “fatal”): self.fatals += 1 elif (fields['log_level'] == “error”): self.errors += 1 elif (fields['log_level'] == “warning”): self.warnings += 1 ...

slide-71
SLIDE 71

MetricObject("fatals", (self.fatals / self.duration), "per sec") MetricObject("errors", (self.errors / self.duration), "per sec") MetricObject("warning", (self.warnings / self.duration), "per sec")

slide-72
SLIDE 72

Fatals Errors Warnings

slide-73
SLIDE 73

StatsD

slide-74
SLIDE 74

github.com/etsy

StatsD

Network daemon (node.js) Accepts data over UDP Flushes to Graphite every 10 sec One-line of code

slide-75
SLIDE 75

StatsD::increment("logins.success");

slide-76
SLIDE 76

StatsD::increment("logins.success");

logins

slide-77
SLIDE 77

StatsD::timing("gearman.time", $msec);

slide-78
SLIDE 78

StatsD::timing("gearman.time", $msec);

90th pct average lower

slide-79
SLIDE 79

Ad hoc

name value timestamp

slide-80
SLIDE 80

echo "events.deploy.site 1 `date +%s`" \ | nc graphite.etsycorp.com 2003

slide-81
SLIDE 81

Vertical Line Technology!

target=drawAsInfinite(events.deploy.site)

slide-82
SLIDE 82
slide-83
SLIDE 83

We could stare at graphs all day...

slide-84
SLIDE 84

http://graphite/render? from=-1hours&width=600&height=200 &target=webs.errorLog.warning&rawData=1

slide-85
SLIDE 85

http://graphite/render? from=-1hours&width=600&height=200 &target=webs.errorLog.warning&rawData=1

webs.errorLog.warning,1318444930,1318448530,60| 5.0,1.0,3.0,1.0,0.0,9.0,0.0,1.0,3.0,2.0,1.0,6.0,2.0,6.0,3.0,6.0,4.0,4.0,2.0, 1.0,1.0,8.0,2.0,3.0,6.0,3.0,5.0,3.0,0.0,4.0,6.0,2.0,0.0,2.0,0.0,4.0,0.0,3.0, 1.0,3.0,4.0,2.0,10.0,3.0,0.0,6.0,0.0,4.0,2.0,5.0,18.0,1.0,1.0,2.0,1.0,8.0,5. 0,1.0,1.0,None

slide-86
SLIDE 86

Holt-Winters Confidence Bands

lower upper

slide-87
SLIDE 87

Holt-Winters Aberration

slide-88
SLIDE 88

Business metrics + Confidence bands _____________ Alertable metrics

slide-89
SLIDE 89

40,000+ metrics at Etsy

Systems, Applications, Business

slide-90
SLIDE 90

Dashboards

slide-91
SLIDE 91

Dashboards

slide-92
SLIDE 92

<a href="http://graphite.etsycorp.com/render?from=-1hours&width=800&height=600&title=File+or +Script+Not+Found&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite %28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production %29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite %28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff, %23ff0000,%23006633,%23cc6600"> <img src="http://graphite.etsycorp.com/render? from=-1hours&width=280&height=220&title=File+or+Script+Not +Found&hideLegend=1&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite %28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production %29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite %28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff, %23ff0000,%23006633,%23cc6600"> </a>

Kind of Hard :-/

slide-93
SLIDE 93

$g = new Graphite($time); $g->setTitle('File Not Found'); $g->addMetric('webs.errorLog.notExist', '#00cc00'); echo $g->getDashboardHTML(280, 220);

Super Easy!

slide-94
SLIDE 94

Metrics!

slide-95
SLIDE 95

Metrics! Metrics + Events

slide-96
SLIDE 96

Metrics! Metrics + Events Metrics + Alerts

slide-97
SLIDE 97

Metrics! Metrics + Events Metrics + Alerts Metrics + Metrics

slide-98
SLIDE 98

High-level, real-time visibility

slide-99
SLIDE 99

Detect problems quickly

slide-100
SLIDE 100

CONFIDENCE

slide-101
SLIDE 101

Make them required features

slide-102
SLIDE 102

Make them dead simple

slide-103
SLIDE 103

Make them accessible

slide-104
SLIDE 104

Make them!

slide-105
SLIDE 105

Thank You

Homework

codeascraft.etsy.com github.com/etsy

We’re always looking for people who are interested in this kind of stuff... etsy.com/careers

Get in touch

mike @ etsy . com @ mikebrittain

slide-106
SLIDE 106