via Hardware Performance Counter s Joy Arulraj, Po-Chun Chang, - PowerPoint PPT Presentation

Production-Run Software Failure Diagnosis via Hardware Performance Counter s Joy Arulraj, Po-Chun Chang, Guoliang Jin and Shan Lu

Motivation  Software inevitably fails on production machines  These failures are widespread and expensive • Internet Explorer zero-day bug [2013] • Toyota Prius software glitch [2010] These failures need to be diagnosed before they can be fixed ! 2

Production-run failure diagnosis  Diagnosing failures on client machines • Limited info from each client machine • One bug can affect many clients • Need to figure out root cause & patch quickly 3

Executive Summary Use existing hardware support to diagnose widespread production-run failures with low monitoring overhead 4

Diagnosing a real world bug  Sequential bug in print_tokens Input: Abc Def int is_token_end(char ch){ Expected if(ch == ‘ \ n’)  return (TRUE); Output: else if(ch == ‘ ’) {Abc}, {Def} // Bug: should return FALSE return (TRUE);  Actual else Output: return (FALSE); {Abc Def} } 5

Diagnosing concurrency bugs  Concurrency bug in Apache server THREAD 1 THREAD 2 decrement_refcnt(...) decrement_refcnt(...) { { atomic_dec( 2 --> 1 &obj->refcnt); atomic_dec( 1 --> 0 &obj->refcnt); 0  0 if(!obj->refcnt) if(!obj->refcnt) cleanup(obj); cleanup(obj); } } 6

Requirements for failure diagnosis  Performance • Low runtime overhead for monitoring apps • Suitable for production-run deployment  Diagnostic Capability • Ability to accurately explain failures • Diagnose wide variety of bugs 7

Existing work Approach Performance Diagnostic Capability FAILURE High runtime overhead Manually locate root REPLAY cause OR BUG Non-existent hardware Many false positives DETECTION support 8

Cooperative Bug Isolation  Cooperatively diagnose production-run failures • Targets widely deployed software • Each client machine sends back information  Uses sampling • Collects only a subset of information • Reduces monitoring overhead • Fits well with cooperative debugging approach 9

Cooperative Bug Isolation Code size Compiler increased TRUE in most Program >10X FAILURE runs, Source Predicates FALSE in most Sampling SUCCESS runs. Statistical Failure Predicates Debugging Predictors & J / L Approach Performance Diagnostic Capability CBI / CCI >100% overhead for Accurate & Automatic many apps (CCI) 10

Performance-counter based Bug Isolation Hardware Hardware Code size performance Program unchanged. counters Binary Predicates Sampling Statistical Failure Predicates Debugging Predictors & J / L  Requires no non-existent hardware support  Requires no software instrumentation 11

PBI Contributions Approach Performance Diagnostic Capability PBI <2% overhead for most Accurate & Automatic apps evaluated  Suitable for production-run deployment  Can diagnose a wide variety of failures  Design addresses privacy concerns 12

Outline  Motivation  Overview  PBI • Hardware performance counters • Predicate design • Sampling design  Evaluation  Conclusion 13

Hardware Performance Counters  Registers monitor hardware performance events • 1 — 8 registers per core • Each register can contain an event count • Large collection of hardware events • Instructions retired, L1 cache misses, etc. 14

Accessing performance counters INTERRUPT-BASED POLLING-BASED User Instruction Config User Kernel Special Count Config Interrupt Config HW HW (PMU) (PMU) How do we monitor which event occurs at which instruction using performance counters ? 15

Predicate evaluation schemes INTERRUPT-BASED POLLING-BASED Kernel old = readCounter() < Instruction C > Config Interrupt new = readCounter() Counter if(new > old)  overflow Event occurred at C 16 Interrupt at Instruction C => Event occurred at C Natural fit for sampling Requires instrumentation More precise Imprecise due to OO execution 16

Concurrency bug failures How do we use performance counters to diagnose concurrency bug failures ?  L1 data cache cache-coherence events Local read M odified Local write E xclusive Remote read S hared Remote write I nvalid 17

Atomicity Violation Example THD 1 on CORE 1 CORE 1 – THD 1 decrement_refcnt(...) { Local apr_atomic_dec( Write &obj->refcnt); Modified C: if(!obj->refcnt) cleanup_cache(obj); } 18

Atomicity Violation Example CORE 1 – THD 1 CORE 2 - THD 2 decrement_refcnt(...) decrement_refcnt(...) { { apr_atomic_dec( &obj->refcnt); apr_atomic_dec( Remote Write &obj->refcnt);  if(!obj->refcnt) Invalid cleanup_cache(obj); C: if(!obj->refcnt) cleanup_cache(obj); } } 19

Atomicity Violation Bugs THREAD INTERLEAVING FAILURE PREDICTOR WWR Interleaving INVALID RWR Interleaving INVALID RWW Interleaving INVALID WRW Interleaving SHARED 20

Order violation CORE 1 – MASTER THD CORE 2 – SLAVE THD Remote Gend = time() Local Write Read print(‚End‛, Gend) Shared C: print(‚Run‛, Gend-init) 21

Order violation CORE 1 – MASTER THD CORE 2 – SLAVE THD Local Read print(‚End‛, Gend) Exclusive C: print(‚Run‛, Gend-init)  Gend = time() 22

PBI Predicate Sampling  We use Perf (provided by Linux kernel 2.6.31+) perf record – event=<code> -c <sampling_rate> <program monitored> Log APP Core Performance Instruction Function Id Event 1 Apache 2 0x140 401c3b decrement (Invalid) _refcnt 23

PBI vs. CBI/CCI (Qualitative)  Performance Sample in this region? Sample in this region? Are other threads sampling? Are other threads sampling? CBI CCI PBI  Diagnostic capability • Discontinuous monitoring (CCI/CBI) • Continuous monitoring (PBI) 24

Outline  Motivation  Overview  PBI • Hardware performance counters • Predicate design • Sampling design  Evaluation  Conclusion 25

Methodology  23 real-world failures • In open-source server, client, utility programs • All CCI benchmarks evaluated for comparison  Each app executed 1000 runs (400-600 failure runs) • Success inputs from standard test suites • Failure inputs from bug reports • Emulate production-run scenarios  Same sampling settings for all apps 26

Evaluation Program Diagnostic Capability PBI CCI-P CCI-H Apache1    Apache2    Cherokee  X  FFT   X LU  X  Mozilla-JS1  X  Mozilla-JS2    Mozilla-JS3    MySQL1  - - MySQL2  - - PBZIP2    27

Diagnostic Capability Program Diagnostic Capability PBI CCI-P CCI-H Apache1  (Invalid)   Apache2  (Invalid)   Cherokee  (Invalid) X  FFT  (Exclusive)  X LU  (Exclusive) X  Mozilla-JS1  (Invalid) X  Mozilla-JS2  (Invalid)   Mozilla-JS3  (Invalid)   MySQL1  (Invalid) - - MySQL2  (Shared) - - PBZIP2  (Invalid)   28

Diagnostic Capability Program Diagnostic Capability PBI CCI-P CCI-H Apache1    Apache2    Cherokee  X  FFT   X LU  X  Mozilla-JS1  X  Mozilla-JS2    Mozilla-JS3    MySQL1  - - MySQL2  - - PBZIP2    29

Diagnostic Capability Program Diagnostic Capability PBI CCI-P CCI-H Apache1    Apache2    Cherokee  X  FFT   X LU  X  Mozilla-JS1  X  Mozilla-JS2    Mozilla-JS3    MySQL1  - - MySQL2  - - PBZIP2    30

Diagnostic Overhead Program Diagnostic Overhead PBI CCI-P CCI-H Apache1 0.40% 1.90% 1.20% Apache2 0.40% 0.40% 0.10% Cherokee 0.50% 0.00% 0.00% FFT 1.00% 121% 118% LU 0.80% 285% 119% Mozilla-JS1 1.50% 800% 418% Mozilla-JS2 1.20% 432% 229% Mozilla-JS3 0.60% 969% 837% MySQL1 3.80% - - MySQL2 1.20% - - PBZIP2 8.40% 1.40% 3.00% 31

Diagnostic Overhead Program Diagnostic Overhead PBI CCI-P CCI-H Apache1 0.40% 1.90% 1.20% Apache2 0.40% 0.40% 0.10% Cherokee 0.50% 0.00% 0.00% FFT 1.00% 121% 118% LU 0.80% 285% 119% Mozilla-JS1 1.50% 800% 418% Mozilla-JS2 1.20% 432% 229% Mozilla-JS3 0.60% 969% 837% MySQL1 3.80% - - MySQL2 1.20% - - PBZIP2 8.40% 1.40% 3.00% 32

Conclusion  Low monitoring overhead  Good diagnostic capability  No changes in apps  Novel use of performance counters PBI will help developers diagnose production-run software failures with low overhead Thanks ! 33

via Hardware Performance Counter s Joy Arulraj, Po-Chun Chang, - PowerPoint PPT Presentation

Production-Run Software Failure Diagnosis via Hardware Performance Counter s Joy Arulraj, Po-Chun Chang, Guoliang Jin and Shan Lu Motivation Software inevitably fails on production machines These failures are widespread and expensive

8051 Serial Port and Timer/Counter Serial Port Timer Counter Chatchai Jantaraprim

Hardware Observability Framework Hardware Observability Framework Hardware Observability

Can We Understand Performance Counter Results? Vince Weaver ICL Lunch Talk 23 July 2010 How Do

The UN Global Counter- -Terrorism Strategy Terrorism Strategy The UN Global Counter The UN

Counter Braids: A novel counter architecture Balaji Prabhakar Balaji Prabhakar Stanford

For Loops and Arrays November 13, 2008 Counting Initialize counter Test counter against limit

Decidable Problems for Counter Systems Day 1 Introduction to Counter Systems St ephane Demri

VC. VC. Hardware Startup The Hardware Revolu/on The Hardware Revolution Removing Barriers to

Sec Secure ure Hardware Hardware and Hardware and Hardware- En Enabled abled Security

software and hardware for the Internet of Things. Choose hardware Design hardware Design

MEDI-MEAL A Reference Guide to Over-the-Counter Drugs Designed by Alexander Chen and Fiona Luo

Overview Narcotics Trafficking Operational picture Counter narcotics Strategies

Reversal-Bounded Counter Machines St ephane Demri LSV, CNRS, ENS Cachan Workshop on Logics

Counter/Timers Overview ATmega328P has two 8-bit and one 16-bit counter/timers. Unit C

Below the Stomach 121 Counter Attack Lines close close Bearish Bullish Bearish Bullish 122

DSN02 Block Diagram. Use CORDIC IP Core. Counter limit n step wrap clock reset

Local food systems and public policy : the case of Qubec The two studies

MICROALGAE CULTURE (2) BIO301 Dr Navid Moheimani n.moheimani@murdoch.edu.au Types of limitation

A Formal Connection between Security Properties and JML Annotations Work in progress with Marieke

Threshold Strategies for Risk Processes and their Relation to Queueing Theory Onno Boxma, David

NY-Sun MW Block Program Redesign June 13, 2018 New York State Energy Research and Development

NY-Sun Stakeholder Meeting Tuesday, April 7 th , 2015 New York State Energy Research and

15-780: Grad AI Lecture 18: Probability, planning, graphical models Geoff Gordon (this lecture)

Docket No. 4774 2018 Renewable Energy Growth Program Rhode Island Public Utility Commission