Operations at Twitter
John Adams Twitter Operations
USENIX LISA 2010
Friday, November 12, 2010
Operations at Twitter John Adams Twitter Operations USENIX LISA - - PowerPoint PPT Presentation
Operations at Twitter John Adams Twitter Operations USENIX LISA 2010 Friday, November 12, 2010 John Adams / @netik Early Twitter employee Lead engineer: Application Services (Apache, Unicorn, SMTP, etc...) Keynote Speaker:
Operations at Twitter
John Adams Twitter Operations
USENIX LISA 2010
Friday, November 12, 2010John Adams / @netik
(Apache, Unicorn, SMTP, etc...)
Operations
What changed since 2009?
Users
source: blog.twitter.com Friday, November 12, 2010Searches/Day
source: twitter.com internal, includes api based searches Friday, November 12, 2010Tweets per day
source: blog.twitter.com(~1000 Tweets/sec)
Friday, November 12, 20102,940 TPS Japan Scores! 3,085 TPS Lakers Win!
Friday, November 12, 2010API Web
Friday, November 12, 2010#newtwitter is an API client
Friday, November 12, 2010Nothing works the first time.
and then you must re-evaluate to grow.
UNIX friends fail at scale
same thing cause “micro” outages across the site.
Operations Mantra
Find Weakest Point
Metrics + Logs + Science = Analysis
Friday, November 12, 2010Operations Mantra
Find Weakest Point
Metrics + Logs + Science = Analysis
Take Corrective Action
Process
Friday, November 12, 2010Operations Mantra
Find Weakest Point
Metrics + Logs + Science = Analysis
Take Corrective Action Move to Next Weakest Point
Process Repeatability
Friday, November 12, 2010MTTD
Friday, November 12, 2010MTTR
Friday, November 12, 2010Sysadmin 2.0 (Devops)
programming task (puppet, chef, cfengine...)
Data Analysis
techniques for seeing patterns in data are going to be an increasingly valuable skill set. Employers take notice!”
“Web Squared: Web 2.0 Five Years On”, Tim O’Reilly, Web 2.0 Summit, 2009 Friday, November 12, 2010Monitoring
as near to real time as possible
too.
Profiling
Forecasting
signed int (32 bit) Twitpocolypse unsigned int (32 bit) Twitpocolypse status_id r2=0.99Curve-fitting for capacity planning (R, fityk, Mathematica, CurveFit)
Friday, November 12, 2010Configuration Management
EARLY in your company.
Puppet
loony
Murder
(Python w/libtorrent)
hosts
Issues with Centralized Management
Process through Reviews
Friday, November 12, 2010Logging
daemon failure
Scribe
writing
Hadoop for Ops
servers
systems.
Friday, November 12, 2010Analyze
deploy
200s/SEO
managed services
Dashboard
Friday, November 12, 2010Whale Watcher
daemon / www logs
Deploy Watcher
Sample window: 300.0 seconds First start time: Mon Apr 5 15:30:00 2010 (Mon Apr 5 08:30:00 PDT 2010) Second start time: Tue Apr 6 02:09:40 2010 (Mon Apr 5 19:09:40 PDT 2010) PRODUCTION APACHE: ALL OK PRODUCTION OTHER: ALL OK WEB049 CANARY APACHE: ALL OK WEB049 CANARY BACKEND SERVICES: ALL OK DAEMON031 CANARY BACKEND SERVICES: ALL OK DAEMON031 CANARY OTHER: ALL OK Friday, November 12, 2010Deploys
and Latency
Feature “Darkmode”
computationally or IO-Heavy site function
subsystems
Friday, November 12, 2010request flow
Load Balancers Apache Rails (Unicorn) FlockDB Kestrel Memcached MySQL Cassandra Daemons Mail Servers Monitoring
Friday, November 12, 2010Apache
Worker Model MaxClients TCP Listen queue depth
Rails (unicorn) 2:1 oversubscribed to cores Varnish (search) # threads
Many limiting factors in the request pipeline
Memcached # connections MySQL # db connections
Friday, November 12, 2010Unicorn Rails Server
Rails
resulting in slow queries against the db
memcached
important configuration data (loss of darkmode flags, for example)
use/eviction rates on individual slabs using
Decomposition
Asynchronous Requests
expensive
handle 3rd party communications or back-end work.
Thrift
Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, OCaml (phew!)
Friday, November 12, 2010Kestrel
Daemons
Daemons
jobs, all at once.
Friday, November 12, 2010Flock DB
framework
Flock DB Gizzard Mysql Mysql Mysql
Disk is the new Tape.
many O(ny) operations.
users start to notice. Goal: 250-300mS
Caching
to prevent eviction
Hash instead of Ruby + MD5
Caching
difficult.
power or system failure?
MySQL
store
Gizzard sharding framework
Friday, November 12, 2010MySQL Challenges
traversal - we have FlockDB for that
Database Replication
write DBs. Reading from master = slow death
queries
(mkill)
Friday, November 12, 2010Key Points
Questions?
Friday, November 12, 2010Thanks!