Failure Sketching: A Technique for Automated Root Cause Diagnosis - - PowerPoint PPT Presentation

failure sketching a technique for automated root cause
SMART_READER_LITE
LIVE PREVIEW

Failure Sketching: A Technique for Automated Root Cause Diagnosis - - PowerPoint PPT Presentation

Failure Sketching: A Technique for Automated Root Cause Diagnosis of In-Production Failures Baris Kasikci, Benjamin Schubert, Cristiano Pereira, Gilles Pokam, George Candea Debugging In-Production Software Failures Today 2 Debugging


slide-1
SLIDE 1

Failure Sketching: A Technique for Automated Root Cause Diagnosis

  • f In-Production Failures

Baris Kasikci, Benjamin Schubert, Cristiano Pereira, Gilles Pokam, George Candea

slide-2
SLIDE 2

Debugging In-Production Software Failures Today

2

slide-3
SLIDE 3

Debugging In-Production Software Failures Today

2

slide-4
SLIDE 4

Debugging In-Production Software Failures Today

2

#0 0x00007f51abae820b in raise (sig=11) at ../nptl/ sysdeps/unix/sysv/linux/pt-raise.c:37 #1 0x000000000042d289 in ap_buffered_log_writer (r=0x7f51a40053d0, handle=0x20eeba0, strs=0x7f51a4003578, strl=0x7f51a40035e8, nelts=14, len=82) at mod_log_config.c:1368 #2 0x000000000042b10d in config_log_transaction (r=0x7f51a40053d0, cls=0x20b9d50, default_format=0x20ee370) at mod_log_config.c:930 #3 0x000000000042aad6 in multi_log_transaction (r=0x7f51a40053d0) at mod_log_config.c:950 #4 0x000000000046cb2d in ap_run_log_transaction (r=0x7f51a40053d0) at protocol.c:1563 #5 0x0000000000436e81 in ap_process_request (r=0x7f51a40053d0) at http_request.c:312 #6 0x000000000042e9da in ap_process_http_connection (c=0x7f519c000b68) at http_core.c:293 #7 0x0000000000465cdd in ap_run_process_connection (c=0x7f519c000b68) at connection.c:85 #8 0x00000000004661f5 in ap_process_connection (c=0x7f519c000b68, csd=0x7f519c000a20) at connection.c:211 #9 0x0000000000451ba0 in process_socket (p=0x7f519c0009b8, sock=0x7f519c000a20, my_child_num=0, my_thread_num=0, bucket_alloc=0x7f51a4001348) at worker.c:632 #10 0x0000000000451221 in worker_thread (thd=0x210fa90, dummy=0x7f51a40008c0) at worker.c:946 #11 0x00007f51ac87c555 in dummy_worker (opaque=0x210fa90) at thread.c:127 #12 0x00007f51abae0182 in start_thread (arg=0x7f51aa8ef700) at pthread_create.c:312 #13 0x00007f51ab80d47d in clone () at ../sysdeps/ unix/sysv/linux/x86_64/clone.S:111

slide-5
SLIDE 5

Debugging In-Production Software Failures Today

2

#0 0x00007f51abae820b in raise (sig=11) at ../nptl/ sysdeps/unix/sysv/linux/pt-raise.c:37 #1 0x000000000042d289 in ap_buffered_log_writer (r=0x7f51a40053d0, handle=0x20eeba0, strs=0x7f51a4003578, strl=0x7f51a40035e8, nelts=14, len=82) at mod_log_config.c:1368 #2 0x000000000042b10d in config_log_transaction (r=0x7f51a40053d0, cls=0x20b9d50, default_format=0x20ee370) at mod_log_config.c:930 #3 0x000000000042aad6 in multi_log_transaction (r=0x7f51a40053d0) at mod_log_config.c:950 #4 0x000000000046cb2d in ap_run_log_transaction (r=0x7f51a40053d0) at protocol.c:1563 #5 0x0000000000436e81 in ap_process_request (r=0x7f51a40053d0) at http_request.c:312 #6 0x000000000042e9da in ap_process_http_connection (c=0x7f519c000b68) at http_core.c:293 #7 0x0000000000465cdd in ap_run_process_connection (c=0x7f519c000b68) at connection.c:85 #8 0x00000000004661f5 in ap_process_connection (c=0x7f519c000b68, csd=0x7f519c000a20) at connection.c:211 #9 0x0000000000451ba0 in process_socket (p=0x7f519c0009b8, sock=0x7f519c000a20, my_child_num=0, my_thread_num=0, bucket_alloc=0x7f51a4001348) at worker.c:632 #10 0x0000000000451221 in worker_thread (thd=0x210fa90, dummy=0x7f51a40008c0) at worker.c:946 #11 0x00007f51ac87c555 in dummy_worker (opaque=0x210fa90) at thread.c:127 #12 0x00007f51abae0182 in start_thread (arg=0x7f51aa8ef700) at pthread_create.c:312 #13 0x00007f51ab80d47d in clone () at ../sysdeps/ unix/sysv/linux/x86_64/clone.S:111

Understand root cause

slide-6
SLIDE 6

Debugging In-Production Software Failures Today

2

#0 0x00007f51abae820b in raise (sig=11) at ../nptl/ sysdeps/unix/sysv/linux/pt-raise.c:37 #1 0x000000000042d289 in ap_buffered_log_writer (r=0x7f51a40053d0, handle=0x20eeba0, strs=0x7f51a4003578, strl=0x7f51a40035e8, nelts=14, len=82) at mod_log_config.c:1368 #2 0x000000000042b10d in config_log_transaction (r=0x7f51a40053d0, cls=0x20b9d50, default_format=0x20ee370) at mod_log_config.c:930 #3 0x000000000042aad6 in multi_log_transaction (r=0x7f51a40053d0) at mod_log_config.c:950 #4 0x000000000046cb2d in ap_run_log_transaction (r=0x7f51a40053d0) at protocol.c:1563 #5 0x0000000000436e81 in ap_process_request (r=0x7f51a40053d0) at http_request.c:312 #6 0x000000000042e9da in ap_process_http_connection (c=0x7f519c000b68) at http_core.c:293 #7 0x0000000000465cdd in ap_run_process_connection (c=0x7f519c000b68) at connection.c:85 #8 0x00000000004661f5 in ap_process_connection (c=0x7f519c000b68, csd=0x7f519c000a20) at connection.c:211 #9 0x0000000000451ba0 in process_socket (p=0x7f519c0009b8, sock=0x7f519c000a20, my_child_num=0, my_thread_num=0, bucket_alloc=0x7f51a4001348) at worker.c:632 #10 0x0000000000451221 in worker_thread (thd=0x210fa90, dummy=0x7f51a40008c0) at worker.c:946 #11 0x00007f51ac87c555 in dummy_worker (opaque=0x210fa90) at thread.c:127 #12 0x00007f51abae0182 in start_thread (arg=0x7f51aa8ef700) at pthread_create.c:312 #13 0x00007f51ab80d47d in clone () at ../sysdeps/ unix/sysv/linux/x86_64/clone.S:111

Understand root cause Reproduce the failure

slide-7
SLIDE 7

Debugging In-Production Software Failures Today

2

#0 0x00007f51abae820b in raise (sig=11) at ../nptl/ sysdeps/unix/sysv/linux/pt-raise.c:37 #1 0x000000000042d289 in ap_buffered_log_writer (r=0x7f51a40053d0, handle=0x20eeba0, strs=0x7f51a4003578, strl=0x7f51a40035e8, nelts=14, len=82) at mod_log_config.c:1368 #2 0x000000000042b10d in config_log_transaction (r=0x7f51a40053d0, cls=0x20b9d50, default_format=0x20ee370) at mod_log_config.c:930 #3 0x000000000042aad6 in multi_log_transaction (r=0x7f51a40053d0) at mod_log_config.c:950 #4 0x000000000046cb2d in ap_run_log_transaction (r=0x7f51a40053d0) at protocol.c:1563 #5 0x0000000000436e81 in ap_process_request (r=0x7f51a40053d0) at http_request.c:312 #6 0x000000000042e9da in ap_process_http_connection (c=0x7f519c000b68) at http_core.c:293 #7 0x0000000000465cdd in ap_run_process_connection (c=0x7f519c000b68) at connection.c:85 #8 0x00000000004661f5 in ap_process_connection (c=0x7f519c000b68, csd=0x7f519c000a20) at connection.c:211 #9 0x0000000000451ba0 in process_socket (p=0x7f519c0009b8, sock=0x7f519c000a20, my_child_num=0, my_thread_num=0, bucket_alloc=0x7f51a4001348) at worker.c:632 #10 0x0000000000451221 in worker_thread (thd=0x210fa90, dummy=0x7f51a40008c0) at worker.c:946 #11 0x00007f51ac87c555 in dummy_worker (opaque=0x210fa90) at thread.c:127 #12 0x00007f51abae0182 in start_thread (arg=0x7f51aa8ef700) at pthread_create.c:312 #13 0x00007f51ab80d47d in clone () at ../sysdeps/ unix/sysv/linux/x86_64/clone.S:111

Understand root cause Reproduce the failure

slide-8
SLIDE 8

Related Work

3

  • Collaborative approaches
  • WER [SOSP’09], CBI [PLDI’05], CCI [OOPSLA’10]
  • Identifying differences of failing and successful runs
  • Delta debugging [TSE’02], Symbiosis [PLDI’15]
  • Record & replay, checkpointing
  • ODR [SOSP’09], Triage [SOSP’07]
  • Hardware support
  • PBI [ASPLOS’13], LBRA/LCRA [ASPLOS’14]
slide-9
SLIDE 9

Related Work

3

  • Collaborative approaches
  • WER [SOSP’09], CBI [PLDI’05], CCI [OOPSLA’10]
  • Identifying differences of failing and successful runs
  • Delta debugging [TSE’02], Symbiosis [PLDI’15]
  • Record & replay, checkpointing
  • ODR [SOSP’09], Triage [SOSP’07]
  • Hardware support
  • PBI [ASPLOS’13], LBRA/LCRA [ASPLOS’14]
slide-10
SLIDE 10

Contributions

4

slide-11
SLIDE 11

Contributions

4

Goal: automate the manual detective work of debugging


slide-12
SLIDE 12

Contributions

4

Goal: automate the manual detective work of debugging


Failure sketching

Complements in-house static analysis with in-production dynamic analysis Automatically and efficiently builds accurate failure sketches that show root causes of failures

slide-13
SLIDE 13

main() { queue* f = init(size); create_thread(cons, f); ... free(f->mut); ... }

Failure Sketch

5

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Time Thread 2 Segfault Thread 1 cons(queue* f) { ... mutex_unlock(f->mut); }

slide-14
SLIDE 14

main() { queue* f = init(size); create_thread(cons, f); ... free(f->mut); ... }

Failure Sketch

5

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Time Thread 2 Segfault Thread 1 cons(queue* f) { ... mutex_unlock(f->mut); }

slide-15
SLIDE 15

main() { queue* f = init(size); create_thread(cons, f); ... free(f->mut); ... }

Failure Sketch

5

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Time Thread 2 Segfault Thread 1 cons(queue* f) { ... mutex_unlock(f->mut); }

slide-16
SLIDE 16

main() { queue* f = init(size); create_thread(cons, f); ... free(f->mut); ... }

Failure Sketch

5

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Time Thread 2 Segfault Thread 1 cons(queue* f) { ... mutex_unlock(f->mut); }

slide-17
SLIDE 17

main() { queue* f = init(size); create_thread(cons, f); ... free(f->mut); ... }

Failure Sketch

5

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Time Thread 2 Segfault Thread 1 cons(queue* f) { ... mutex_unlock(f->mut); }

slide-18
SLIDE 18

main() { queue* f = init(size); create_thread(cons, f); ... free(f->mut); ... }

Failure Sketch

5

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Time Thread 2

Root cause

Segfault Thread 1 cons(queue* f) { ... mutex_unlock(f->mut); }

slide-19
SLIDE 19

Failure Sketch Usage Model

6

#0 0x00007f51abae820b in raise (sig=11) at ../nptl/ sysdeps/unix/sysv/linux/pt-raise.c:37 #1 0x000000000042d289 in ap_buffered_log_writer (r=0x7f51a40053d0, handle=0x20eeba0, strs=0x7f51a4003578, strl=0x7f51a40035e8, nelts=14, len=82) at mod_log_config.c:1368 #2 0x000000000042b10d in config_log_transaction (r=0x7f51a40053d0, cls=0x20b9d50, default_format=0x20ee370) at mod_log_config.c:930 #3 0x000000000042aad6 in multi_log_transaction (r=0x7f51a40053d0) at mod_log_config.c:950 #4 0x000000000046cb2d in ap_run_log_transaction (r=0x7f51a40053d0) at protocol.c:1563 #5 0x0000000000436e81 in ap_process_request (r=0x7f51a40053d0) at http_request.c:312 #6 0x000000000042e9da in ap_process_http_connection (c=0x7f519c000b68) at http_core.c:293 #7 0x0000000000465cdd in ap_run_process_connection (c=0x7f519c000b68) at connection.c:85 #8 0x00000000004661f5 in ap_process_connection (c=0x7f519c000b68, csd=0x7f519c000a20) at connection.c:211 #9 0x0000000000451ba0 in process_socket (p=0x7f519c0009b8, sock=0x7f519c000a20, my_child_num=0, my_thread_num=0, bucket_alloc=0x7f51a4001348) at worker.c:632 #10 0x0000000000451221 in worker_thread (thd=0x210fa90, dummy=0x7f51a40008c0) at worker.c:946 #11 0x00007f51ac87c555 in dummy_worker (opaque=0x210fa90) at thread.c:127 #12 0x00007f51abae0182 in start_thread (arg=0x7f51aa8ef700) at pthread_create.c:312 #13 0x00007f51ab80d47d in clone () at ../sysdeps/ unix/sysv/linux/x86_64/clone.S:111

Understand root cause Reproduce the failure

slide-20
SLIDE 20

Failure Sketch Usage Model

6

#0 0x00007f51abae820b in raise (sig=11) at ../nptl/ sysdeps/unix/sysv/linux/pt-raise.c:37 #1 0x000000000042d289 in ap_buffered_log_writer (r=0x7f51a40053d0, handle=0x20eeba0, strs=0x7f51a4003578, strl=0x7f51a40035e8, nelts=14, len=82) at mod_log_config.c:1368 #2 0x000000000042b10d in config_log_transaction (r=0x7f51a40053d0, cls=0x20b9d50, default_format=0x20ee370) at mod_log_config.c:930 #3 0x000000000042aad6 in multi_log_transaction (r=0x7f51a40053d0) at mod_log_config.c:950 #4 0x000000000046cb2d in ap_run_log_transaction (r=0x7f51a40053d0) at protocol.c:1563 #5 0x0000000000436e81 in ap_process_request (r=0x7f51a40053d0) at http_request.c:312 #6 0x000000000042e9da in ap_process_http_connection (c=0x7f519c000b68) at http_core.c:293 #7 0x0000000000465cdd in ap_run_process_connection (c=0x7f519c000b68) at connection.c:85 #8 0x00000000004661f5 in ap_process_connection (c=0x7f519c000b68, csd=0x7f519c000a20) at connection.c:211 #9 0x0000000000451ba0 in process_socket (p=0x7f519c0009b8, sock=0x7f519c000a20, my_child_num=0, my_thread_num=0, bucket_alloc=0x7f51a4001348) at worker.c:632 #10 0x0000000000451221 in worker_thread (thd=0x210fa90, dummy=0x7f51a40008c0) at worker.c:946 #11 0x00007f51ac87c555 in dummy_worker (opaque=0x210fa90) at thread.c:127 #12 0x00007f51abae0182 in start_thread (arg=0x7f51aa8ef700) at pthread_create.c:312 #13 0x00007f51ab80d47d in clone () at ../sysdeps/ unix/sysv/linux/x86_64/clone.S:111

Understand root cause Reproduce the failure

slide-21
SLIDE 21

Failure Sketch Usage Model

6

#0 0x00007f51abae820b in raise (sig=11) at ../nptl/ sysdeps/unix/sysv/linux/pt-raise.c:37 #1 0x000000000042d289 in ap_buffered_log_writer (r=0x7f51a40053d0, handle=0x20eeba0, strs=0x7f51a4003578, strl=0x7f51a40035e8, nelts=14, len=82) at mod_log_config.c:1368 #2 0x000000000042b10d in config_log_transaction (r=0x7f51a40053d0, cls=0x20b9d50, default_format=0x20ee370) at mod_log_config.c:930 #3 0x000000000042aad6 in multi_log_transaction (r=0x7f51a40053d0) at mod_log_config.c:950 #4 0x000000000046cb2d in ap_run_log_transaction (r=0x7f51a40053d0) at protocol.c:1563 #5 0x0000000000436e81 in ap_process_request (r=0x7f51a40053d0) at http_request.c:312 #6 0x000000000042e9da in ap_process_http_connection (c=0x7f519c000b68) at http_core.c:293 #7 0x0000000000465cdd in ap_run_process_connection (c=0x7f519c000b68) at connection.c:85 #8 0x00000000004661f5 in ap_process_connection (c=0x7f519c000b68, csd=0x7f519c000a20) at connection.c:211 #9 0x0000000000451ba0 in process_socket (p=0x7f519c0009b8, sock=0x7f519c000a20, my_child_num=0, my_thread_num=0, bucket_alloc=0x7f51a4001348) at worker.c:632 #10 0x0000000000451221 in worker_thread (thd=0x210fa90, dummy=0x7f51a40008c0) at worker.c:946 #11 0x00007f51ac87c555 in dummy_worker (opaque=0x210fa90) at thread.c:127 #12 0x00007f51abae0182 in start_thread (arg=0x7f51aa8ef700) at pthread_create.c:312 #13 0x00007f51ab80d47d in clone () at ../sysdeps/ unix/sysv/linux/x86_64/clone.S:111

Understand root cause Runtime traces Reproduce the failure

slide-22
SLIDE 22

Failure Sketch Usage Model

6

Understand root cause

main() { queue* f = init(size); create_thread(cons, f); ... free(f->mut); ... } cons(queue* f) { ... mutex_unlock(f->mut); } 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Time Failure: segmentation fault Thread 1 Thread 2

Runtime traces Reproduce the failure

slide-23
SLIDE 23

Failure Sketch Usage Model

6

main() { queue* f = init(size); create_thread(cons, f); ... free(f->mut); ... } cons(queue* f) { ... mutex_unlock(f->mut); } 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Time Failure: segmentation fault Thread 1 Thread 2

Runtime traces

slide-24
SLIDE 24

Failure Sketch Usage Model

6

main() { queue* f = init(size); create_thread(cons, f); ... free(f->mut); ... } cons(queue* f) { ... mutex_unlock(f->mut); } 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Time Failure: segmentation fault Thread 1 Thread 2

Runtime traces

slide-25
SLIDE 25

Outline

  • Challenges
  • Design
  • Evaluation

7

slide-26
SLIDE 26

Outline

  • Challenges
  • Design
  • Evaluation

7

slide-27
SLIDE 27

Challenges of Building Failure Sketches

  • Accuracy
  • Exclude all irrelevant information, preserve all relevant one
  • Recurrence
  • Gathering enough execution information from rare failures
  • Latency
  • Achieve high accuracy after just a few recurrences

8

slide-28
SLIDE 28

Outline

  • Challenges
  • Design

9

  • Evaluation
slide-29
SLIDE 29

Outline

  • Challenges
  • Design

9

  • Evaluation

Static Analysis Control and Data Flow Tracking Statistical Analysis Failure sketch

slide-30
SLIDE 30

main() { queue* f = init(size); create_thread(cons, f); ... free(f->mut); ... } cons(queue* f) { ... mutex_unlock(f->mut); } 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Time Failure: segmentation fault Thread 1 Thread 2

Gist System Architecture

10

Runtime traces

slide-31
SLIDE 31

Client Server

Gist System Architecture

10

slide-32
SLIDE 32

Client Server

Gist System Architecture

10

Program P (source) Failure report (core dump, stack trace, etc)

1

slide-33
SLIDE 33

Client Server

Gist System Architecture

10

Program P (source) Failure report (core dump, stack trace, etc)

1

Static Analyzer

  • queue* f = init(size);
  • init_vars();
  • free(f->mut);
  • print(“Done”);
  • f->mut = NULL;

Static slice

slide-34
SLIDE 34

Client Server

Gist System Architecture

10

Program P (source) Failure report (core dump, stack trace, etc)

1

Static Analyzer

2

Instrumentation Tracking control and data flow

  • queue* f = init(size);
  • init_vars();
  • free(f->mut);
  • print(“Done”);
  • f->mut = NULL;

Static slice

slide-35
SLIDE 35

Client Server

Gist System Architecture

10

Program P (source) Failure report (core dump, stack trace, etc)

1

Static Analyzer

2

Instrumentation

3

Refinement with runtime traces Tracking control and data flow

  • queue* f = init(size);
  • init_vars();
  • free(f->mut);
  • print(“Done”);
  • f->mut = NULL;

Static slice

slide-36
SLIDE 36

Client Server

Gist System Architecture

10

Program P (source) Failure report (core dump, stack trace, etc)

1

Static Analyzer

2

Instrumentation

3

Refinement with runtime traces Tracking control and data flow

  • queue* f = init(size);
  • init_vars();
  • free(f->mut);
  • print(“Done”);
  • f->mut = NULL;

Static slice Refined static slice

slide-37
SLIDE 37

Client Server

Gist System Architecture

10

Program P (source) Failure report (core dump, stack trace, etc)

1

Static Analyzer

2

Instrumentation

3

Refinement with runtime traces Failure Sketch Computation Engine

4

Tracking control and data flow

  • queue* f = init(size);
  • init_vars();
  • free(f->mut);
  • print(“Done”);
  • f->mut = NULL;

Static slice

main() { queue* f = init(size); create_thread(cons, f); ... free(f->mut); f->mut = NULL; ... } cons(queue* f) { ... mutex_unlock(f->mut); } 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Time Failure: segmentation fault Thread 1 Thread 2

Failure Sketch Refined static slice

slide-38
SLIDE 38

Outline

11

  • Challenges
  • Design
  • Evaluation

Static Analysis Control and Data Flow Tracking Statistical Analysis Failure sketch

slide-39
SLIDE 39

Outline

11

  • Challenges
  • Design
  • Evaluation

Static Analysis Control and Data Flow Tracking Statistical Analysis Failure sketch

slide-40
SLIDE 40

12

Static Analysis to Reduce the Overhead

  • Computes backward slices
  • Includes statements with dependencies to the failure
  • Excludes all other statements
  • Inter-procedural
  • Identify dependencies across functions
slide-41
SLIDE 41

12

Static Analysis to Reduce the Overhead

  • Computes backward slices
  • Includes statements with dependencies to the failure
  • Excludes all other statements
  • Inter-procedural
  • Identify dependencies across functions

Static analysis reduces subsequent runtime tracking (20x)

slide-42
SLIDE 42

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

13

Example

slide-43
SLIDE 43

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

13

Example

slide-44
SLIDE 44

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

13

Example

slide-45
SLIDE 45

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

13

Example

slide-46
SLIDE 46

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

13

Example

slide-47
SLIDE 47

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

13

Example

Segfault

slide-48
SLIDE 48

14

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Example

Segfault

slide-49
SLIDE 49

15

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Example: Static Backward Slicing

Segfault

slide-50
SLIDE 50

Outline

16

  • Challenges
  • Design
  • Evaluation

Static Analysis Control and Data Flow Tracking Statistical Analysis Failure sketch

slide-51
SLIDE 51

Outline

16

  • Challenges
  • Design
  • Evaluation

Static Analysis Control and Data Flow Tracking Statistical Analysis Failure sketch

slide-52
SLIDE 52

Low-Overhead Control Flow Tracking

  • Software-based tracking is expensive (up to 15x)
  • Hardware-based tracking is more efficient
  • Intel PT: new feature in Intel CPUs (~40%)
  • Gist combines static analysis and hardware-based

control flow tracking

  • Low overhead (~2%)

17

slide-53
SLIDE 53

Low-Overhead Control Flow Tracking

  • Software-based tracking is expensive (up to 15x)
  • Hardware-based tracking is more efficient
  • Intel PT: new feature in Intel CPUs (~40%)
  • Gist combines static analysis and hardware-based

control flow tracking

  • Low overhead (~2%)

17

Static analysis → Low-overhead control flow tracking

slide-54
SLIDE 54

Example: Control Flow Tracking (Step 1)

18

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Segfault

slide-55
SLIDE 55

19

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Example: Control Flow Tracking (Step 2)

Segfault

slide-56
SLIDE 56

19

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Example: Control Flow Tracking (Step 2)

Static analysis + control flow tracking shorten the sketch

Segfault

slide-57
SLIDE 57

20

Data Flow Tracking to Increase Accuracy

slide-58
SLIDE 58
  • Data flow information
  • Variable values & total order of memory accesses

20

Data Flow Tracking to Increase Accuracy

slide-59
SLIDE 59
  • Data flow information
  • Variable values & total order of memory accesses
  • Hardware watchpoints
  • Allow tracking reads and writes with low overhead
  • Allow tracking the total order of accesses

20

Data Flow Tracking to Increase Accuracy

slide-60
SLIDE 60
  • Data flow information
  • Variable values & total order of memory accesses
  • Hardware watchpoints
  • Allow tracking reads and writes with low overhead
  • Allow tracking the total order of accesses
  • Monitor multiple clients when run out of watchpoints

20

Data Flow Tracking to Increase Accuracy

slide-61
SLIDE 61
  • Data flow information
  • Variable values & total order of memory accesses
  • Hardware watchpoints
  • Allow tracking reads and writes with low overhead
  • Allow tracking the total order of accesses
  • Monitor multiple clients when run out of watchpoints

20

Data Flow Tracking to Increase Accuracy

Precise ordering information → High accuracy

slide-62
SLIDE 62

21

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); } void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

Segfault

Example: Data Flow Tracking

slide-63
SLIDE 63

21

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); } void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

Segfault

Watch &s

Example: Data Flow Tracking

slide-64
SLIDE 64

22

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); } void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

Example: Data Flow Tracking

Segfault

Watch &s

slide-65
SLIDE 65

22

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); } void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

Success

Example: Data Flow Tracking

Thread 1 Thread 2

slide-66
SLIDE 66

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Failure

22

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); } void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

Success

Example: Data Flow Tracking

Thread 1 Thread 2

slide-67
SLIDE 67

Outline

23

  • Challenges
  • Design
  • Evaluation

Static Analysis Control and Data Flow Tracking Statistical Analysis Failure sketch

slide-68
SLIDE 68

Outline

23

  • Challenges
  • Design
  • Evaluation

Static Analysis Control and Data Flow Tracking Statistical Analysis Failure sketch

slide-69
SLIDE 69

24

Statistical Analysis

slide-70
SLIDE 70
  • Identification of failure predictors1
  • A good predictor portends a failure with high probability

(e.g., data races, atomicity violations)

24

Statistical Analysis

1 Liblit, B. et al. Scalable sta0s0cal bug isola0on. PLDI 2005

slide-71
SLIDE 71
  • Identification of failure predictors1
  • A good predictor portends a failure with high probability

(e.g., data races, atomicity violations)

  • Example: data races

24

Statistical Analysis

1 Liblit, B. et al. Scalable sta0s0cal bug isola0on. PLDI 2005

write x write x write x read x read x write x

slide-72
SLIDE 72
  • Identification of failure predictors1
  • A good predictor portends a failure with high probability

(e.g., data races, atomicity violations)

  • Example: data races

24

Statistical Analysis

1 Liblit, B. et al. Scalable sta0s0cal bug isola0on. PLDI 2005

write x write x write x read x read x write x

Failure predictors across multiple executions

slide-73
SLIDE 73

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Failure

25 25

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

Success

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Example: Statistical Analysis

slide-74
SLIDE 74

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Failure

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Success

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Success

25 25

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

Success

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); } void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Failure

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Success

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Success

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Success

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Success

Example: Statistical Analysis

slide-75
SLIDE 75

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Failure

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Success

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Failure

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Success

25 25

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

Success

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); } void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Failure

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Success

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Success

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Success

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Success

Example: Statistical Analysis

slide-76
SLIDE 76

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Failure

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Success

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Failure

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Success

25 25

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

Success

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); } void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Failure

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Success

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Success

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Success

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Success

Example: Statistical Analysis

write read

slide-77
SLIDE 77

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Failure

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Success

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Failure

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Success

25 25

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; }

Success

void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); } void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Failure

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Success

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Success

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Success

void cleanup(State* s) { log(“Func:cleanup”); if(verbose) log(“Cleaning up %p”, s); delete s; } void display_size(State* s) { log(“Func:display_size”); log(“State: %u”, s->size); }

Success

Static analysis + cooperative dynamic analysis

Example: Statistical Analysis

write read

slide-78
SLIDE 78

Outline

26

  • Challenges
  • Design
  • Evaluation

Static Analysis Control and Data Flow Tracking Statistical Analysis Failure sketch

slide-79
SLIDE 79

Outline

26

  • Challenges
  • Design
  • Evaluation
  • Does Gist help developers do root cause diagnosis?
  • Is Gist efficient?
  • Is Gist accurate?
slide-80
SLIDE 80

27

  • Client side executions are analyzed in the lab
  • Real world server and desktop programs

memcached

Experimental Setup

slide-81
SLIDE 81

Do Failure Sketches Help Developers?

  • We manually analyzed the usefulness of Gist for 11

failures

  • Gist-identified failure predictors point to root causes
  • Developers eliminated those root causes to fix the bugs
  • Average number of statements to look at: 7

28

slide-82
SLIDE 82

Do Failure Sketches Help Developers?

  • We manually analyzed the usefulness of Gist for 11

failures

  • Gist-identified failure predictors point to root causes
  • Developers eliminated those root causes to fix the bugs
  • Average number of statements to look at: 7

Gist points developers to the root causes of failures

28

slide-83
SLIDE 83

1 2 3 4 5 5 10 15 20 25

# of statements tracked Performance

  • verhead [%]

Efficiency

29

(Control & data flow tracking)

slide-84
SLIDE 84

1 2 3 4 5 5 10 15 20 25

# of statements tracked Performance

  • verhead [%]

Gist has low average overhead (always below 5%)

Efficiency

29

(Control & data flow tracking)

slide-85
SLIDE 85

30

Accuracy [%] Apache-1 Apache-2 Apache-3 Apache-4 Cppcheck-1 Cppcheck-2 Curl Transmission SQLite memcached Pbzip2

Accuracy

slide-86
SLIDE 86

30

0.0 25.0 50.0 75.0 100.0

Accuracy [%] Apache-1 Apache-2 Apache-3 Apache-4 Cppcheck-1 Cppcheck-2 Curl Transmission SQLite memcached Pbzip2

Accuracy

slide-87
SLIDE 87

30

0.0 25.0 50.0 75.0 100.0

Static slicing Accuracy [%] Apache-1 Apache-2 Apache-3 Apache-4 Cppcheck-1 Cppcheck-2 Curl Transmission SQLite memcached Pbzip2

Accuracy

slide-88
SLIDE 88

30

0.0 25.0 50.0 75.0 100.0

Control flow tracking Static slicing Accuracy [%] Apache-1 Apache-2 Apache-3 Apache-4 Cppcheck-1 Cppcheck-2 Curl Transmission SQLite memcached Pbzip2

Accuracy

slide-89
SLIDE 89

30

0.0 25.0 50.0 75.0 100.0

Control flow tracking Data flow tracking Static slicing Accuracy [%] Apache-1 Apache-2 Apache-3 Apache-4 Cppcheck-1 Cppcheck-2 Curl Transmission SQLite memcached Pbzip2

Accuracy

slide-90
SLIDE 90

30

0.0 25.0 50.0 75.0 100.0

Control flow tracking Data flow tracking Static slicing Accuracy [%] Apache-1 Apache-2 Apache-3 Apache-4 Cppcheck-1 Cppcheck-2 Curl Transmission SQLite memcached Pbzip2

Accuracy

Average accuracy is 96%

slide-91
SLIDE 91

30

0.0 25.0 50.0 75.0 100.0

Control flow tracking Data flow tracking Static slicing Accuracy [%] Apache-1 Apache-2 Apache-3 Apache-4 Cppcheck-1 Cppcheck-2 Curl Transmission SQLite memcached Pbzip2

Each technique is needed for accuracy

Accuracy

Average accuracy is 96%

slide-92
SLIDE 92
  • Failure sketching
  • Combination of static and dynamic program

analysis

  • Failure sketches are summaries explaining failure

root causes

  • Accurate, efficient, improves developer productivity

31

main() { queue* f = init(size); create_thread(cons, f); ... free(f->mut); f->mut = NULL; ... } cons(queue* f) { ... mutex_unlock(f->mut); } 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Time Failure: segmentation fault Thread 1 Thread 2

Conclusion

slide-93
SLIDE 93
  • Failure sketching
  • Combination of static and dynamic program

analysis

  • Failure sketches are summaries explaining failure

root causes

  • Accurate, efficient, improves developer productivity

31

main() { queue* f = init(size); create_thread(cons, f); ... free(f->mut); f->mut = NULL; ... } cons(queue* f) { ... mutex_unlock(f->mut); } 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Time Failure: segmentation fault Thread 1 Thread 2

Conclusion

Cristiano Baris Ben Gilles George

http://dslab.epfl.ch/proj/gist