be conservative enhancing failure diagnosis with
play

Be Conservative: Enhancing Failure Diagnosis with Proactive Logging - PowerPoint PPT Presentation

Be Conservative: Enhancing Failure Diagnosis with Proactive Logging Ding Yuan , Soyeon Park, Peng Huang, Yang Liu, Michael Lee, Xiaoming Tang, Yuanyuan Zhou, Stefan Savage University of California, San Diego University of Illinois at


  1. Be Conservative: Enhancing Failure Diagnosis with Proactive Logging Ding Yuan , Soyeon Park, Peng Huang, Yang Liu, Michael Lee, Xiaoming Tang, Yuanyuan Zhou, Stefan Savage University of California, San Diego University of Illinois at Urbana-Champaign http://opera.ucsd.edu/errlog.html

  2. Motivation  Production failures are hard to reproduce  Privacy concerns for input  Hard to recreate the production setting 2

  3. Importance of log messages  Vendors actively collect logs  EMC, NetApp, Cisco, Dell collect logs from >50% of their Fifth annual SANS Survey Reveals 99% customers [SANS2009][EMC][Dell] of Organizations Collect Logs or Plan to Implement Log Management  Log messages cut diagnosis time by 2.2X 3.0X 2.3X Diagnosis time* 1.4X (normalized) * result from >100 randomly sampled failures per software 3

  4. An real-world example of good logging Starting Apache web server $ ./apachectl start Could not open group file: /etc/httpd/gorup No such file or directory Typo misconfiguration What if there is no such log message? 4

  5. Real-world failure report User: “Apache httpd cannot start. No log message printed.” Developer: Cannot reproduce the failure… Ask lots of user information… Detected error User’s misconfiguration: & terminate typo in group file name. if ((status = fileopen(grpfile, ..)) != SUCCESS) { + ap_log_error(“Could not open group file: %s”, grpfile); return DECLINED; Reative instead of proactive! } 5

  6. Real-world bug in Squid web-cache User: “In an array of squid servers, from time to time the available file descriptors drops down to nearly zero. No log message or anything!” Developer : Cannot reproduce the failure… 45 exchanges Ask user for [debug] level logs… Ask user for configuration file Additional log statements. Ask user for DNS statistics… 6

  7. Real-world bug in Squid web-cache User: “In an array of squid servers, from time to time the available file descriptors drops down to nearly zero. No log message or anything!” DNS lookup error Not handled if (status != OK) { properly idnsSendQuery (q); + idnsTcpCleanup(q); + error(“Failed to connect to DNS server with TCP”); } 7

  8. What we have seen from the examples  Developers miss obvious log opportunities  Analogy: solving crime without evidence  How many real-world cases are like this?  What are other obvious places to log? 8

  9. Our contributions  Quantitative evidences  Many opportunities that developers could have logged  Small set of generic “log-worthy” patterns  Errlog : a tool to automate logging if (status != OK) { if (status != OK) { elog(.. ..); Errlog .. .. .. .. .. .. } } Added log statement 9

  10. Outline  Introduction  Characterizing logging efficacy  Errlog design  Evaluation results  Conclusion 10

  11. Goals of our study  Do real-world failures have log messages?  Where to log? 11

  12. Study methodology  Randomly sampled 250 recently reported failures*  Carefully studied the discussion and related code/patch  Study took 4 authors 4 months Software Sampled failures Apache httpd 65 Squid 50 PostgreSQL 45 SVN 45 GNU Coreutils 45 T otal 250 * Data can be found at: http://opera.ucsd.edu/errlog.html 12

  13. How many missed log message?  Only 43% failures have log messages Software Failures with log Apache httpd 37% W/ Log Squid 40% (43%) W/O Log PostgreSQL 53% (57%) SVN 56% Coreutils 33% Overall 43% 13

  14. How many missed log message?  Only 43% failures have log messages  77% have easy-to-log opportunities Software Failures with log Apache httpd 37% W/ Log Squid 40% 77% (43%) W/O Log Easy-to-log PostgreSQL 53% (57%) opportunity SVN 56% Coreutils 33% 7 patterns Overall 43% 14

  15. Pattern I: function return error wrapper function of No log: “open” system call if ((status = fileopen (grpfile, ..)) != SUCCESS) { return DECLINED; } /* Apache httpd misconfiguration. */ Unnecessary user complain and debugging efforts 15

  16. Pattern I: function return error Good practice: Macro to detect all func. return error int main (..) { if (svn_export(..)) #define SVN_ERR(func) .. .. svn_error_t* temp=(func); print } if (temp) message once return temp; svn_err_t* svn_export(..) { Propagate to caller SVN_ERR (svn_versioned(..)); } Create and return an svn_err_t* svn_versioned(..) { error message if (!entry) return error_create( “%s is not under version control” , ..); } /* SVN */ 16

  17. Pattern I: function return error Good practice: Macro to detect all func. return error int main (..) { if (svn_export(..)) #define SVN_ERR(func) .. .. svn_error_t* temp=(func); print } if (temp) message once return temp; svn_err_t* svn_export(..) { Propagate to caller SVN_ERR (svn_versioned(..)); } Create and return an svn_err_t* svn_versioned(..) { error message if (!entry) return error_create( “%s is not under version control” , ..); } /* SVN */ 17

  18. Pattern II: abnormal exit A bug triggered No log: this abort Over 10 discussion messages if (svn_dirent_is_root) btw. user and dev. abort (); + svn_error_raise_on_malfunction(_FILE_, _LINE_); + svn_error_raise_on_malfunction (..) { + err=svn_error_create(“In file ‘%s’ line ‘%d’: internal malfunction”); + svn_handle_error2 (err); print error + abort(); message. + } “I really hate abort() calls! Instead of getting a usable core-dump, I often got nothing. ” /* SVN */ --- developer’s check-in message 18

  19. Generic log-worthy patterns 1. Function return 7. Failed memory error (30%) safety check (3%) 6. Abnormal 2. Switch stmt. exit (4%) fall-through to ‘default’ (14%) 5. Resource leak (4%) 3. Exception signals (13%) 4. Failed input validity check (9%) 19

  20. Generic log-worthy patterns 1. Function return 7. Failed memory error (30%) safety check (3%) 77% 6. Abnormal Exception conditions 2. Switch stmt. exit (4%) fall-through to ‘default’ (14%) 5. Resource leak (4%) 3. Exception signals (13%) 4. Failed input validity check (9%) 20

  21. Log the exception  Classic Fault-Error-Failure model [Laprie.95] Fault Error (exception) Failure Root cause, Start to misbehave, Affect service/result e.g., s/w bug, e.g., system-call Visible to user h/w fault, error return misconfiguration, Log! etc… Relevant to the failures Log? Not too much overhead Fault is hard to find! 21

  22. Why developers missed logging? 250 failures: Give up Failure Detected 113 Log detected Error no log: 11 errors! Don’t be optimistic; 154 (61%) conservatively log! no log: 46 Fault Handle incorrectly 41 Failure no log: 35 Undetected 96 (39%) Error no log: 96 Failure 39 (41%) can be Carefully check detected the error! 22

  23. Outline  Introduction  Characterizing logging efficacy  Errlog design automate such  Evaluation results logging  Conclusion 23

  24. Errlog : a practical logging tool errlog –logfunc=“error” CVS/src Exception identification Modified Source Is it already source code code logged? No Insertion & optimization 24

  25. Exception identification  Mechanically search for generic error conditions  Learn domain-specific error conditions Frequently logged if (status != COMM_OK) if (status != COMM_OK){ goto ERROR; + elog (); .. .. } ERROR: error(“network failure”); 25

  26. Log statement insertion  Check if the exception is already logged  Each log statement contains:  Unique log ID, global counter, call stack, useful variables LogEnhancer [TOCS’12] /* Errlog modified code */ if (status != COMM_OK) { + elog (logID, glob_counter, logEnhancer()); } 26

  27. Adaptive sampling to reduce overhead  Not every identified condition is a true error  E.g., using error return of ‘stat’ to test the existence of file  Adaptive sampling [HauswirthASPLOS’04]  More frequently it occurs, less likely to be a true error  Differentiate run-time log by call stack and errno Dynamic n occurrence 1st 2nd 3rd 4th 5th 6th 7th 8th … 2 th Logged occurrence 27

  28. Evaluation  Applied Errlog on ten software projects Software LOC Software LOC Apache httpd 317K CVS 111K Squid 121K OpenSSH 81K PostgreSQL 1029K Lighttpd 56K SVN 288K gzip 22K Coreutils 69K GNU make 29K Software used in our New software not used in our empirical study empirical study 28

  29. Reducing silent failures  Errlog adds 0.60X extra log printing statements  What is the benefit?  Evaluated on 141 silent failures Subtle exceptions. 35% still 65% have fail silently error msg. with Errlog Failures originally print no logs 29

  30. Comparison with manual logging  16,065 existing log stmt. in ten systems  Many added reactively Objective baseline Average: 85% Average: 83% Used in study New 30

  31. Performance overhead  Why Errlog has overhead?  A few noisy log messages in normal execution  Errlog adds 1.4% overhead Maximum 4.6% <1% <1% <1% <1% <1% <1% 31

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend