Concurrency Bugs Ben Liblit with Guoliang Jin and Shan Lu We need - - PowerPoint PPT Presentation

concurrency bugs
SMART_READER_LITE
LIVE PREVIEW

Concurrency Bugs Ben Liblit with Guoliang Jin and Shan Lu We need - - PowerPoint PPT Presentation

Automated Repair of Concurrency Bugs Ben Liblit with Guoliang Jin and Shan Lu We need reliable software Peoples daily life now depends on reliable software Software companies spend lots of resources on debugging More than 50% effort


slide-1
SLIDE 1

Automated Repair of Concurrency Bugs

Ben Liblit with Guoliang Jin and Shan Lu

slide-2
SLIDE 2

2

We need reliable software

  • People’s daily life now depends on reliable software
  • Software companies spend lots of resources on debugging
  • More than 50% effort on finding and fixing bugs
  • Around $300 billion per year
slide-3
SLIDE 3

Concurrency bugs hurt

  • It is an increasingly parallel world
  • Concurrency bugs in history

3

slide-4
SLIDE 4

Multi-threaded program

  • Concurrent programs under the shared-memory model
  • Programs execute multiple interacting threads in parallel
  • Threads communicate via shared memory
  • Shared-memory accesses should be well-synchronized

Multicore chip core1 cache thread1 core2 cache thread2 core3 cache thread3 core4 cache thread4 shared memory

4

slide-5
SLIDE 5

Huge Interleaving space

An example concurrency bug

Thread 1 if (ptr != NULL) { ptr->field = 1; } Thread 2 ptr = NULL; Thread 1 if (ptr != NULL) { ptr->field = 1; } Thread 2 ptr = NULL;

The interleaving space

5

Bad interleavings

Previous research focuses

  • n finding

Thread 1 if (ptr != NULL) { ptr->field = 1; } Thread 2 ptr = NULL; Thread 1 if (ptr != NULL) { ptr->field = 1; } Thread 2 ptr = NULL; Segmentation Fault

slide-6
SLIDE 6
  • Software quality does not improve until bugs are fixed
  • Manual concurrency bug fixing is
  • time-consuming: 73 days on average
  • error-prone: 39% patches are buggy in the first release
  • CFix: automated concurrency-bug fixing [PLDI’11*, OSDI’12]
  • Program behaves correctly if bad interleavings do not occur
  • Fix concurrency bugs by disabling bad interleavings

Bug fixing

6

*SIGPLAN: “one of the first papers to attack the problem of automated bug fixing”

slide-7
SLIDE 7
  • What is the correct behavior?
  • Usually requires developers’ knowledge
  • How to get the correct behavior?
  • Correct program states under bug-triggering inputs
  • No change to program states under other inputs

Automated fixing is difficult

7

Description:

Symptom Triggering condition …

Patch:

Correctness Performance Simplicity

?

slide-8
SLIDE 8
  • What is the correct behavior?
  • The program state is already correct

as long as the buggy interleaving does not occur

  • How to get the correct behavior?
  • Only need to disable failure-inducing interleavings
  • Can leverage well-defined synchronization operations

Automated concurrency-bug fixing?

8

Description:

Symptom Triggering condition …

Patch:

Correctness Performance Simplicity

?

slide-9
SLIDE 9

Description:

Symptom Triggering condition …

Description:

Interleavings that lead to software failure

9

Patch:

Correctness Performance Simplicity

?

atomicity violation detectors

ParkASPLOS’09, FlanaganPOPL’04, LuASPLOS’06, ChewEuroSys’10

  • rder violation

detectors

ZhangASPLOS’10, LuciaMICRO’09, YuISCA’09, GaoASPLOS’11

data race detectors

SenPLDI’08, SavageTOCS’97, YuSOSP’05, EricksonOSDI’10, KasikciASPLOS’10

abnormal data flow detectors

ZhangASPLOS’11, ShiOOPSLA’10

p r c

A B Wb R Wg I1 I2

How to get a general solution that generates good patches?

slide-10
SLIDE 10

. . . . . . Patched binary Patched binary Patched binary Patched binary Merged binary . . . Selected binary Selected binary Mutual exclusion Order Mutual exclusion Order Final patched binary

10

Description:

Interleavings that lead to software failure

Patch:

Correctness Performance Simplicity

CFix

Run-time Support Patch Merging Patch Testing & Selection Synchronization Enforcement Fix-Strategy Design

Source code Bug reports

slide-11
SLIDE 11
  • Show the feasibility of

automated fixing for non- deadlock concurrency bugs

  • Techniques that enforce

mutual exclusion and order relationship

  • A framework that assembles

a set of techniques to automate the whole bug- fixing process: CFix

Contributions

Run-time Support Fix-Strategy Design Synchronization Enforcement Patch Merging Patch Testing & Selection

11

slide-12
SLIDE 12

CFix: fix-strategy design

Challenges:

  • Huge variety of bugs

12

Run-time Support Fix-Strategy Design Synchronization Enforcement Patch Merging Patch Testing & Selection

slide-13
SLIDE 13
  • Why these two?
  • Basic relationships can be achieved by typical synchronizations
  • Based on real-world concurrency bug characteristics study

Two types of synchronization relationships

13

Mutual Exclusion Order Relationship

slide-14
SLIDE 14

Fix-strategy for atomicity-violation detectors example 1

Thread 1

if (ptr != NULL) { ptr->field = 1; } ptr = NULL;

Thread 2

14

slide-15
SLIDE 15

Fix-strategy for atomicity-violation detectors example 2

Thread 1

ptr->field = 1; ptr->field = 1; ptr = NULL;

Thread 2

15

slide-16
SLIDE 16

CFix: fix-strategy design

Challenges:

  • Inaccurate root cause
  • Huge variety of bugs

Solution:

  • A combination of

mutual exclusion &

  • rder relationship

enforcement

16

Run-time Support Fix-Strategy Design Synchronization Enforcement Patch Merging Patch Testing & Selection

slide-17
SLIDE 17

Fix-strategies

OV Detector AV Detector Race Detector DU Detector

I1 I2 A B

p r c

Wb R Wg

17

slide-18
SLIDE 18

CFix: synchronization enforcement

Challenges:

  • Correctness, performance,

and simplicity Solution:

  • Mutual exclusion

enforcement: AFix [PLDI’11]

  • Order relationship

enforcement: OFix [OSDI’12]

18

Run-time Support Fix-Strategy Design Synchronization Enforcement Patch Merging Patch Testing & Selection

slide-19
SLIDE 19
  • Input: three statements (p, c, r) with contexts
  • Goal: making the code region from p to c be mutually

exclusive with r

Mutual exclusion relationship

19

Thread 1 if (ptr != NULL) { ptr->field = 1; } Thread 2 ptr = NULL;

r p c

slide-20
SLIDE 20
  • Approach: lock
  • Principles:
  • Correctly paired lock acquisition and release operations
  • Small critical section

Mutual exclusion enforcement: AFix

p c r

20

slide-21
SLIDE 21
  • A naïve solution
  • Add lock on edges reaching p
  • Add unlock on edges leaving c
  • Potential new bugs
  • Could lock without unlock
  • Could unlock without lock
  • etc.

Put p and c into a critical section: naïve

p c p c p c p c

21

slide-22
SLIDE 22
  • Assume p and c are in the same function f
  • Step 1: find protected nodes in critical section
  • Step 2: add lock operations
  • unprotected node  protected node
  • protected node  unprotected node
  • Avoid those potential bugs mentioned

Put p and c into a critical section: AFix

p c

22

slide-23
SLIDE 23
  • p and c adjustment when they are in different functions
  • Observation: people put lock and unlock in one function
  • Find the longest common prefix of p’s and c’s stack traces
  • Adjust p and c accordingly
  • Put r into a critical section
  • Do nothing if we can reach r from the p–c critical section
  • Lock type:
  • Lock with timeout: if critical section has blocking operations
  • Reentrant lock: if recursion is possible within critical section

Subtle details

23

slide-24
SLIDE 24
  • Input: two statements (A, B) with contexts
  • There could be multiple instances of A in one thread
  • There could be multiple threads that could execute A
  • There could be no instance of A during the whole execution
  • Goal: making A execute before B

Order relationship

24

slide-25
SLIDE 25

use read initialization destroy

Order relationship: two sub-types

Ai A B Aj

… …

?

firstA-B allA-B

A1 B An

A1 B An

25

slide-26
SLIDE 26
  • Approach: condition variable and flag
  • Insert signal operations in A-threads
  • Insert wait operation before B
  • Principles
  • A-thread signals exactly once when it will not execute more A
  • A-thread signals as soon as possible
  • B proceeds when each A-thread has signaled

OFix allA-B enforcement

26

slide-27
SLIDE 27

OFix allA-B enforcement: A side

How to identify the last A instance in one thread

A

. . .; for (. . .) . . . ; // A . . .;

  • Each thread that executes A
  • exactly once as soon as it can execute no more A

27

slide-28
SLIDE 28

OFix allA-B enforcement: A side

How to identify the last thread that executes A

void main() { for (. . .) thread_create(thr_main); . . .; }

void ofix_signal() { mutex_lock(L);

  • -;

if ( == 0) cond_broadcast(con); mutex_unlock(L); }

void thr_main() { for (. . .) . . . ; // A . . .; }

counter for signal threads =1 ++

thread _create

A

28

slide-29
SLIDE 29
  • Safe to execute only when is 0
  • Give up if OFix knows that it introduces new deadlock
  • Timed wait-operation to mask potential deadlocks

OFix allA-B enforcement: B side

B

void ofix_wait() { mutex_lock(L); if ( != 0) cond_timedwait(con, L, t); mutex_unlock(L); }

29

slide-30
SLIDE 30
  • Basic enforcement
  • When A may not execute
  • Add a safety-net of signal with allA-B algorithm

OFix firstA-B

B A

30

slide-31
SLIDE 31

CFix: patch testing & selection

Challenge:

  • Multi-thread software

testing Solution:

  • CFix-patch oriented testing

31

Run-time Support Fix-Strategy Design Synchronization Enforcement Patch Merging Patch Testing & Selection

slide-32
SLIDE 32

Patch testing principles

  • Prune incorrect patches
  • Patches causing failures due to wrong fix strategies, etc.
  • Prune slow patches
  • Prune complicated patches
  • Not exhaustive testing, but patch oriented testing
  • Leverage existing testing techniques, with extra heuristics

32

slide-33
SLIDE 33

Run once without external perturbation

  • Reject if there is a time-out or failure
  • Patches fixing wrong root cause
  • Make software to fail deterministically

Thread 1 ptr->field = 1; ptr->field = 1; Thread 2 ptr = NULL;

33

slide-34
SLIDE 34

Implicit bad patch

  • A failure in patch_b implies a failure in patch_a
  • If patch_a is less restrictive than patch_b
  • Helpful to prune patch_a
  • Traditional testing may not find the failure in patch_a

a Mutual Exclusion b c Order Relationships

34

slide-35
SLIDE 35

Challenge:

  • One single programming

mistake usually leads to multiple bug reports Solution:

  • Heuristics to merge patches

CFix: patch merging

35

Run-time Support Fix-Strategy Design Synchronization Enforcement Patch Merging Patch Testing & Selection

slide-36
SLIDE 36

c1 r1 p1 p2 c2, r2

void buf_write() { int tmp = buf_len + str_len; if (tmp > MAX) return; memcpy(buf[buf_len], str, str_len); buf_len = tmp; }

An example with multiple reports

p1 c1 p2 r1 c2, r2

  • Too many lock/unlock operations
  • Potential new deadlocks
  • May hurt performance and simplicity

36

slide-37
SLIDE 37

Related patch: a case of AFix

  • Merge if p, c, or r is in some other patch’s critical sections

lock(L1) p1 lock(L2) p2 c1 unlock(L1) c2 unlock(L2) lock(L1) r1 unlock(L1) lock(L2) r2 unlock(L2) lock(L1) p1 p2 c1 c2 unlock(L1) lock(L1) r2 unlock(L1)

37

slide-38
SLIDE 38

c1 r1 p1 p2 c2,r2

void buf_write() { int tmp = buf_len + str_len; if (tmp > MAX) { return; } memcpy(buf[buf_len], str, str_len); buf_len = tmp; }

The merged patch for the example

p1 c1 p2 r1 c2, r2

c1,p2 c2,r1,r2 p1

38

slide-39
SLIDE 39
  • To understand whether there

is a deadlock underlying time-out

  • Low-overhead, and suitable

for production runs

CFix: run-time support

39

Run-time Support Fix-Strategy Design Synchronization Enforcement Patch Merging Patch Testing & Selection

slide-40
SLIDE 40

Evaluation methodology

APP. PBZIP2 x264 FFT HTTrack Mozilla-1 transmission ZSNES Apache MySQL-1 MySQL-2 Mozilla-2 Cherokee Mozilla-3 AV Detector OV Detector RA Detector DU Detector

40

slide-41
SLIDE 41

Evaluation result

AV Detector OV Detector RA Detector DU Detector                                   # of Ops 5 7 5 2 2 2 3 3 5 9 3 2 5 APP. PBZIP2 x264 FFT HTTrack Mozilla-1 transmission ZSNES Apache MySQL-1 MySQL-2 Mozilla-2 Cherokee Mozilla-3

41

slide-42
SLIDE 42

Comparison with manual patches

Manual Patch Order with pthread_join Order with pthread_join Order with pthread_join N/A Order with lock Move before pthread_create Move before pthread_create New lock in structure Existing lock and variable Existing lock Make the variable local Existing lock Customized synchronization APP. PBZIP2 x264 FFT HTTrack Mozilla-1 transmission ZSNES Apache MySQL-1 MySQL-2 Mozilla-2 Cherokee Mozilla-3

  • CFix patches have

similar correctness and performance

  • Manual patches

integrate better with existing code

42

slide-43
SLIDE 43

Broader context and related work

Concurrency bug detection

Atomicity and races Record/replay Production runs Special considerations for repair

Correct by construction

Synthesis and sketching Derivation from high-level constructs Global static analysis

Hot-patching at run time

Apply developer- provided fixes Execution steering

43

slide-44
SLIDE 44

Broader context and related work

Concurrency bug detection

Atomicity and races Record/replay Production runs Special considerations for repair

  • Atomicity violations: Atomizer

[Flanagan, POPL’04], CCI [Jin, OOPSLA’10],

AVIO [Lu, ASPLOS’06], Vaziri [POPL’06], ConTeGe [PLDI’12; ICSE’13; ISSTA’14]

  • Race conditions: Pacer [Bond,

PLDI’10], Choi [PLDI’02], FastTrack [Flanagan, PLDI’09], CCI [Jin, OOPSLA’10],

Eraser [Savage, TCS’97], ConTeGe

[PLDI’12; ICSE’13; ISSTA’14]

  • Many, many more: apologies

for omissions!

44

slide-45
SLIDE 45

Broader context and related work

Concurrency bug detection

Atomicity and races Record/replay Production runs Special considerations for repair

  • ODR [Altekar, SOSP’09]
  • DoubleChecker [Biswas, PLDI’14]
  • Light [Liu, PLDI’15]
  • PRES [Park, SOSP’09]
  • SlimFast [Peng, PLDI’16]
  • DoublePlay [Veeraraghavan, ASPLOS’11]
  • Wu [FSE’10]

45

slide-46
SLIDE 46

Broader context and related work

Concurrency bug detection

Atomicity and races Record/replay Production runs Special considerations for repair

  • Pacer [Bond, PLDI’10]
  • CCI [Jin, OOPSLA’10]
  • Marino [PLDI’09]
  • Veeraraghavan [SOSP’11]

46

slide-47
SLIDE 47

Broader context and related work

Concurrency bug detection

Atomicity and races Record/replay Production runs Special considerations for repair

  • False positives
  • Run-time information
  • Single vs. multiple
  • Misclassification

47

slide-48
SLIDE 48

Broader context and related work

Correct by construction

Synthesis and sketching Derivation from high-level constructs Global static analysis

  • Musketeer [Alglave, CAV’14]
  • Deshmukh [ESOP’10]
  • Solar-Lezama [PLDI’08]
  • Vechev [POPL’10]
  • Smart state-space search +

verification

  • Infer synchronization to make

program obey specification

  • Powerful: flexible specifications
  • Challenge: scalability to real-

world code

48

slide-49
SLIDE 49

Broader context and related work

Correct by construction

Synthesis and sketching Derivation from high-level constructs Global static analysis

  • Kim [MIT-TR’2010]
  • Autolocker [McCloskey, POPL’06]
  • Navabi [PPoPP’08]
  • Vaziri [POPL’06]
  • Weeratunge [OOPSLA’11]
  • Atomic sets, atomic blocks,

futures, etc.

  • Manual or profile-directed
  • Critical regions
  • Specified explicitly
  • Aligned with specific functions
  • Locking/barrier plan derived

automatically

49

slide-50
SLIDE 50

Broader context and related work

Correct by construction

Synthesis and sketching Derivation from high-level constructs Global static analysis

  • TraceFinder [Upadhyaya, OOPSLA’10]
  • Whole-program alias and

synchronization analyses

  • Identify atomic block

boundaries that guarantee conflict-serializability

  • Synchronization via atomic

blocks

  • Disregard lock assignment,

deadlocks

  • Challenge: scalability

50

slide-51
SLIDE 51

Broader context and related work

Hot-patching at run time

Apply developer- provided fixes Execution steering

  • ClearView [Perkins, SOSP’09]
  • LOOM [Wu, OSDI’10]
  • Assume human-provided

patches

  • If automating, what special

considerations apply?

  • Detect  design  apply
  • CFix neither first nor last

51

slide-52
SLIDE 52

Broader context and related work

Hot-patching at run time

Apply developer- provided fixes Execution steering

  • Hardware-assisted: Musketeer

[Alglave, CAV’14], Kivati [Chew, EuroSys’10],

Atom-Aid [Lucia, Micro’09], Yu [ISCA’09;

MICRO’10]

  • Known critical regions: AtomRace

[Křena, PADTAD’07; Letko, PADTAD’08],

Ratanaworabhan [PPoPP’09]

  • Learned deadlock avoidance:

Dimmunix [Jula, OSDI’08]

  • Deterministic execution: Aviram

[OSDI, 2010], dOS [Bergan, OSDI’10], Grace [Berger, OOPSLA’09], Cui [OSDI’10; SOSP’11],

Dthreads [Liu, SOSP’11], Kendo [Olszewski,

ASPLOS’09] 52

slide-53
SLIDE 53

Broader context and related work

Hot-patching at run time

Apply developer- provided fixes Execution steering

  • Challenges
  • Overhead
  • System non-determinism
  • Language design

53

slide-54
SLIDE 54

CFix summary

  • CFix uses some heuristics, with good results in practice
  • A combination of mutual exclusion and order enforcement
  • Use testing to select the best patch
  • Fix root cause without requiring detectors to report it
  • Small overhead and good simplicity
  • Concurrency bugs are feasible to be fixed automatically
  • By removing bad interleavings
  • Must be careful in the details

54

slide-55
SLIDE 55

55

Run-time Support Fix-Strategy Design Synchronization Enforcement Patch Merging Patch Testing & Selection