Mike Brittain @ mikebrittain
Director of engineering, Infrastructure
Metrics-Driven Engineering
October 13, 2011
Metrics-Driven Engineering Mike Brittain @ mikebrittain Director of - - PowerPoint PPT Presentation
Metrics-Driven Engineering Mike Brittain @ mikebrittain Director of engineering, Infrastructure October 13, 2011 Tools and Process at Etsy How many new visits? How many listings created? How many registrations? How do people use Etsy? How
Mike Brittain @ mikebrittain
Director of engineering, Infrastructure
October 13, 2011
How many new visits?
How many listings created?
How many purchases?
How many new shops?
Search indexing?
How fast are pages generating?
Images resized and stored?
Error and warning rates?
Replication slave lag?
Memcache hits/misses?
Total outgoing bandwidth?
CPU, Memory, I/O?
$87 Million GMS 2008
$26 Million GMS 2007
credit: pentarux (flickr)
credit: pentarux (flickr)
credit: martin_heigan (flickr)
credit: ibailemon (flickr)
(even if it’s your first day)
credit: ibailemon (flickr)
credit: misswired (flickr)
credit: digidave (flickr)
$cfg = array( 'checkout' => array('enabled' => 'on'), 'homepage' => array('enabled' => 'on'), 'profiles' => array('enabled' => 'on'), 'new_search' => array('enabled' => 'off'), );
Enable and disable features quickly
$cfg = array( 'checkout' => array('enabled' => 'on'), 'homepage' => array('enabled' => 'on'), 'profiles' => array('enabled' => 'on'), 'new_search' => array('enabled' => 'off'), );
Enable and disable features quickly Plus “admin-only,” percentage ramp-up, A/B testing, whitelists, blacklists, etc...
Well, the Ops team manages the network, racks the servers, installed the monitoring tools, wears the pagers, blah, blah, blah...
...and so are config flags
Cluster-oriented Huge community contributed recipes Custom metrics (gmetad)
Single-instance Create new metrics on-the-fly Customize via URLs and display functions
It’s 2:48 PM.
Logger::log_error("User login failed. Reason: $msg for $username", “login”);
Logger::log_error("User login failed. Reason: $msg for $username", “login”);
web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login
web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login
web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login
web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login
web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login
web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login
LogFormat "%h %l %u %t \"%r\" %>s %b" common
LogFormat %{True-Client-IP}i %l %t \"%r \" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %{etsy_shop_id}n %{etsy_uaid}n %V %{etsy_ab_selections}n %{etsy_request_uuid}n %{etsy_api_consumer_key}n %{etsy_api_method_name}n %{php_memory_usage_bytes}n %{php_time_microsec}n %D" combined
LogFormat %{True-Client-IP}i %l %t \"%r \" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %{etsy_shop_id}n %{etsy_uaid}n %V %{etsy_ab_selections}n %{etsy_request_uuid}n %{etsy_api_consumer_key}n %{etsy_api_method_name}n %{php_memory_usage_bytes}n %{php_time_microsec}n %D" combined
LogFormat %{True-Client-IP}i %l %t \"%r \" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %{etsy_shop_id}n %{etsy_uaid}n %V %{etsy_ab_selections}n %{etsy_request_uuid}n %{etsy_api_consumer_key}n %{etsy_api_method_name}n %{php_memory_usage_bytes}n %{php_time_microsec}n %D" combined
LogFormat %{True-Client-IP}i %l %t \"%r \" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %{etsy_shop_id}n %{etsy_uaid}n %V %{etsy_ab_selections}n %{etsy_request_uuid}n %{etsy_api_consumer_key}n %{etsy_api_method_name}n %{php_memory_usage_bytes}n %{php_time_microsec}n %D" combined
grep "/listing/" access.log | \ awk '{sum=sum+$(NF-2)} END {print sum/NR}'
web0001 [04:28:54 2011] [error] [client 10.101.x.x] Help me, Rhonda. web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo! web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh! web0001 [04:28:54 2011] [error] [client 10.101.x.x] Heeeeeeellllllllllllllppppp! web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo! web0001 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh! web0201 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh! web0034 [04:28:54 2011] [warning] [client 10.101.x.x] Oh nooooooooooo web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web1101 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web0201 [04:28:54 2011] [error] [client 10.101.x.x] You've been eaten by a grue. web0055 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!!! web0002 [04:28:54 2011] [warning] [client 10.101.x.x] Sky is falling. web0089 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web0020 [04:28:54 2011] [error] [client 10.101.x.x] Sky is falling. web1101 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh! web0055 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh! web0001 [04:28:54 2011] [warning] [client 10.101.x.x] Oh nooooooooooo web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web0034 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web0087 [04:28:54 2011] [fatal] [client 10.101.x.x] Sky is falling. web0002 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo! web0201 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh! web0077 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh! web0355 [04:28:54 2011] [warning] [client 10.101.x.x] Oh nooooooooooo web0052 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!! web0003 [04:28:54 2011] [error] [client 10.101.x.x] You've been eaten by a grue. web0066 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!!! web0001 [04:28:54 2011] [warning] [client 10.101.x.x] Sky is falling
Fatals Errors Warnings
github.com/etsy
Run by cron Keeps a cursor on your log file Aggregate lines anyway you want Output to Ganglia or Graphite Simple parsers
web0054 [Fri Mar 04 16:27:48 2011] [error] [login] [mk04gw1p71] User login
^.+ \[.+\] \[(?P<log_level>.+)\]
if (fields['log_level'] == “fatal”): self.fatals += 1 elif (fields['log_level'] == “error”): self.errors += 1 elif (fields['log_level'] == “warning”): self.warnings += 1 ...
MetricObject("fatals", (self.fatals / self.duration), "per sec") MetricObject("errors", (self.errors / self.duration), "per sec") MetricObject("warning", (self.warnings / self.duration), "per sec")
Fatals Errors Warnings
github.com/etsy
Network daemon (node.js) Accepts data over UDP Flushes to Graphite every 10 sec One-line of code
StatsD::increment("logins.success");
StatsD::increment("logins.success");
logins
StatsD::timing("gearman.time", $msec);
StatsD::timing("gearman.time", $msec);
90th pct average lower
name value timestamp
echo "events.deploy.site 1 `date +%s`" \ | nc graphite.etsycorp.com 2003
target=drawAsInfinite(events.deploy.site)
http://graphite/render? from=-1hours&width=600&height=200 &target=webs.errorLog.warning&rawData=1
http://graphite/render? from=-1hours&width=600&height=200 &target=webs.errorLog.warning&rawData=1
webs.errorLog.warning,1318444930,1318448530,60| 5.0,1.0,3.0,1.0,0.0,9.0,0.0,1.0,3.0,2.0,1.0,6.0,2.0,6.0,3.0,6.0,4.0,4.0,2.0, 1.0,1.0,8.0,2.0,3.0,6.0,3.0,5.0,3.0,0.0,4.0,6.0,2.0,0.0,2.0,0.0,4.0,0.0,3.0, 1.0,3.0,4.0,2.0,10.0,3.0,0.0,6.0,0.0,4.0,2.0,5.0,18.0,1.0,1.0,2.0,1.0,8.0,5. 0,1.0,1.0,None
lower upper
Systems, Applications, Business
<a href="http://graphite.etsycorp.com/render?from=-1hours&width=800&height=600&title=File+or +Script+Not+Found&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite %28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production %29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite %28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff, %23ff0000,%23006633,%23cc6600"> <img src="http://graphite.etsycorp.com/render? from=-1hours&width=280&height=220&title=File+or+Script+Not +Found&hideLegend=1&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite %28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production %29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite %28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff, %23ff0000,%23006633,%23cc6600"> </a>
$g = new Graphite($time); $g->setTitle('File Not Found'); $g->addMetric('webs.errorLog.notExist', '#00cc00'); echo $g->getDashboardHTML(280, 220);
codeascraft.etsy.com github.com/etsy
We’re always looking for people who are interested in this kind of stuff... etsy.com/careers
mike @ etsy . com @ mikebrittain