VP, Engineering bryan@joyent.com Bryan Cantrill
And It All Went Horribly Wrong: Debugging Production Systems
@bcantrill
Thursday, November 17, 2011
And It All Went Horribly Wrong: Debugging Production Systems Bryan - - PowerPoint PPT Presentation
And It All Went Horribly Wrong: Debugging Production Systems Bryan Cantrill VP, Engineering bryan@joyent.com @bcantrill Thursday, November 17, 2011 In the beginning... Thursday, November 17, 2011 In the beginning... Sir Maurice Wilkes,
VP, Engineering bryan@joyent.com Bryan Cantrill
@bcantrill
Thursday, November 17, 2011
Thursday, November 17, 2011
Sir Maurice Wilkes, 1913 - 2010
Thursday, November 17, 2011
“As soon as we started programming, we found to our surprise that it wasn't as easy to get programs right as we had thought. Debugging had to be
instant when I realized that a large part of my life from then on was going to be spent in finding mistakes in my
—Sir Maurice Wilkes, 1913 - 2010
Thursday, November 17, 2011
them, we became better at debugging their failures...
faster (cheaper) ones, debuggability often regressed
higher and higher layer of abstraction — and accelerated by extensive use of componentization
the system initially working (develop) — but often harder to understand it when it fails (deploy + operate)
debuggable!
Thursday, November 17, 2011
the architecture stateless wherever possible
semantics, moving from ACID to BASE semantics (i.e., different CAP trade-offs) to increase availability
reliable by using redundant components
architectural imperative to survive datacenter failure
Thursday, November 17, 2011
the putatively reliable distributed system; single component failure still has significant cost:
with the component in a failed state
induces additional work that can degrade performance
system in a more vulnerable mode whereby further failure becomes more likely
in mature, reliable systems
Thursday, November 17, 2011
Thursday, November 17, 2011
a panicked kernel, a seg fault, an uncaught exception
failure can alone induce system failure
liveness criteria for the system — and allowing the
invasive, it risks becoming so complicated as to compound failure
Thursday, November 17, 2011
Thursday, November 17, 2011
(that is, we must debug them), and we must fix them
failures and (especially) transient failures
Thursday, November 17, 2011
dereferencing invalid memory or a program-induced abort) its state is static and invalid
component can be debugged postmortem
backwards to find the transition from a valid state to an invalid one
dates from the dawn of the computing age: a core dump
Thursday, November 17, 2011
induced when the software has fatally failed, and even then it is only the cost of writing state to stable storage
else to learn from the componentʼs failure in situ; it can be safely restarted, minimizing downtime without sacrificing debuggability
in parallel, and arbitrarily distant in the future
exist on the system of failure
Thursday, November 17, 2011
Thursday, November 17, 2011
program text as well as program data
debugging (correctly formed stacks are a must, as is the symbol table; type information is invaluable)
repeatedly pathological system
several open source systems have met them...
Thursday, November 17, 2011
source illumos operating system (a Solaris derivative)
for components to deliver custom debugger support
deliver powerful postmortem analysis tools, e.g.:
a core dump to find memory leaks in native code
Thursday, November 17, 2011
— but much less developed for dynamic environments like Java, Python, Ruby, JavaScript, Erlang, etc.
postmortem debugging via the jdb(1) tool found in HotSpot VM — but it remains VM specific
software components, it is critical that they support postmortem debugging as a first-class operation!
components in node.js...
Thursday, November 17, 2011
Googleʼs V8) for building event-oriented servers:
var http = require(‘http’);
http.createServer(function (req, res) { res.writeHead(200, {'Content-Type': 'text/plain'}); res.end('Hello World\n'); }).listen(8124, "127.0.0.1"); console.log(‘Server running at http://127.0.0.1:8124!’);
Thursday, November 17, 2011
degree of VM specificity in the debugger…
do this somewhat cleanly with a disjoint V8 module
to symbolically dump JavaScript stacks and arguments from an OS core dump:
(V8) handle
http://dtrace.org/blogs/dap/2011/10/31/nodejs-v8-postmortem-debugging/
Thursday, November 17, 2011
much more to do here
JavaScript to allow program-induced core dumps…
core dumps
exactly twice over the course of the past year — and
Thursday, November 17, 2011
with architecturally and (given proper postmortem debugging support) be root-caused from a single failure
compensate for — and much more difficult to debug!
start, and the system is still moving!
symptom from cause — you are physician, not scientist
where is the butterfly?
Thursday, November 17, 2011
Thursday, November 17, 2011
systems originally developed circa 2003 for Solaris 10
subsequently ported to many other systems (MacOS X, FreeBSD, NetBSD, QNX, nascent Linux port)
situ data aggregation, statically-defined instrumentation
answers to arbitrary questions
the most significant non-fatal pathologies were high in the stack of abstraction...
Thursday, November 17, 2011
say, from the kernel, which poses a challenge for interpreted environments
describe semantically relevant points of instrumentation
PHP, Erlang) have added USDT providers that instrument the interpreter itself
call) and doesnʼt work in JITʼd environments
Thursday, November 17, 2011
instrument, we introduced a function into JavaScript that Node can call to get into USDT-instrumented C++
into C++ costs even when probes are not enabled
probe effect once in C++
for the kernel that allows for translation into a structure that is familiar to node programmers
Thursday, November 17, 2011
in his node-dtrace-provider npm module: https://github.com/chrisa/node-dtrace-provider
(Devel::DTrace::Provider)
in-kernel events with their user-level (dynamic) context
Thursday, November 17, 2011
from probe context, we introduced the notion of a helper — programmatic logic that is attached to the VM itself
get from frame pointers to a string that names the frame
to program
for Python by John Levon and node.js by Dave Pacheco
Thursday, November 17, 2011
recovered postmortem after system failure via MDB
into fatal failure via its raise() and panic() actions
then be gcore(1)ʼd and prun(1)ʼd
an otherwise dynamic problem
with dynamic instrumentation gives one much more latitude in attacking either variant of system pathology!
Thursday, November 17, 2011
debugging work, the V8 ustack helper and his excellent ACM Queue article on postmortem debugging: http://queue.acm.org/detail.cfm?id=2039361
dtrace
support for DTrace
Thursday, November 17, 2011