DCatch: Automatically Detecting Distributed Concurrency Bugs in - - PowerPoint PPT Presentation

dcatch automatically detecting
SMART_READER_LITE
LIVE PREVIEW

DCatch: Automatically Detecting Distributed Concurrency Bugs in - - PowerPoint PPT Presentation

1 DCatch: Automatically Detecting Distributed Concurrency Bugs in Cloud Systems Haopeng Liu , Guangpu Li, Jeffrey Lukman, Jiaxin Li, Shan Lu, Haryadi Gunawi, and Chen Tian* * 2 Cloud systems 3 Cloud systems 4 Distributed concurrency bugs


slide-1
SLIDE 1

1

DCatch: Automatically Detecting Distributed Concurrency Bugs in Cloud Systems

Haopeng Liu, Guangpu Li, Jeffrey Lukman, Jiaxin Li, Shan Lu, Haryadi Gunawi, and Chen Tian*

*

slide-2
SLIDE 2

2

Cloud systems

slide-3
SLIDE 3

3

Cloud systems

slide-4
SLIDE 4

4

Distributed concurrency bugs (DCbugs)

slide-5
SLIDE 5

5

Distributed concurrency bugs (DCbugs)

  • Unexpected timing among distributed operations
slide-6
SLIDE 6

6

Distributed concurrency bugs (DCbugs)

  • Unexpected timing among distributed operations
  • Example

B A C MapReduce-3274

slide-7
SLIDE 7

7

Distributed concurrency bugs (DCbugs)

  • Unexpected timing among distributed operations
  • Example

B A C B A C MapReduce-3274

hang

slide-8
SLIDE 8

8

DCbugs need to be tackled

  • Common in distributed systems [1, 2, 3]

– 26% failures caused by non-deterministic [1] – 6% software bugs in clouds system [2]

[1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC. In ASPLOS’16

slide-9
SLIDE 9

9

DCbugs need to be tackled

  • Common in distributed systems [1, 2, 3]
  • Difficult to avoid, expose and diagnose

[1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC. In ASPLOS’16

slide-10
SLIDE 10

10

DCbugs need to be tackled

  • Common in distributed systems [1, 2, 3]
  • Difficult to avoid, expose and diagnose

“That is one monster of a race!” Hadoop Map/Reduce / MAPREDUCE-3274 [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC. In ASPLOS’16

slide-11
SLIDE 11

11

DCbugs need to be tackled

  • Common in distributed systems [1, 2, 3]
  • Difficult to avoid, expose and diagnose

“That is one monster of a race!” Hadoop Map/Reduce / MAPREDUCE-3274 “There isn’t a week going by without new bugs about races.” HBase / HBASE-4397 [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC. In ASPLOS’16

slide-12
SLIDE 12

12

DCbugs need to be tackled

  • Common in distributed systems [1, 2, 3]
  • Difficult to avoid, expose and diagnose

“That is one monster of a race!” Hadoop Map/Reduce / MAPREDUCE-3274 “There isn’t a week going by without new bugs about races.” HBase / HBASE-4397 “We have already fix many cases, however it seems exist many other [racing] cases.” HBase / HBASE-6147 [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC. In ASPLOS’16

slide-13
SLIDE 13

13

DCbugs need to be tackled

  • Common in distributed systems [1, 2, 3]
  • Difficult to avoid, expose and diagnose

“That is one monster of a race!” Hadoop Map/Reduce / MAPREDUCE-3274 “There isn’t a week going by without new bugs about races.” HBase / HBASE-4397 “We have already found and fix many cases, however it seems exist many other cases.” HBase / HBASE-6147 “This has become quite messy, sigh.” Hadoop Map/Reduce / MAPREDUCE-4819 [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC. In ASPLOS’16

slide-14
SLIDE 14

14

DCbugs need to be tackled

  • Common in distributed systems [1, 2, 3]
  • Difficult to avoid, expose and diagnose

“That is one monster of a race!” Hadoop Map/Reduce / MAPREDUCE-3274 “There isn’t a week going by without new bugs about races.” HBase / HBASE-4397 “We have already found and fix many cases, however it seems exist many other cases.” HBase / HBASE-6147 “This has become quite messy, sigh.” Hadoop Map/Reduce / MAPREDUCE-4819 “Great catch, Sid! Apologies for missing the race condition.” Hadoop Map/Reduce / MAPREDUCE-4099 [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC. In ASPLOS’16

slide-15
SLIDE 15

15

[1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC. In ASPLOS’16

DCbugs need to be tackled

  • Common in distributed systems [1, 2, 3]
  • Difficult to avoid, expose and diagnose

“That is one monster of a race!” Hadoop Map/Reduce / MAPREDUCE-3274 “There isn’t a week going by without new bugs about races.” HBase / HBASE-4397 “We have already found and fix many cases, however it seems exist many other cases.” HBase / HBASE-6147 “This has become quite messy, sigh.” Hadoop Map/Reduce / MAPREDUCE-4819 “Great catch, Sid! Apologies for missing the race condition.” Hadoop Map/Reduce / MAPREDUCE-4099 “We [prefer] debug crashes instead of hanging jobs.” Hadoop Map/Reduce / MAPREDUCE-3634

slide-16
SLIDE 16

16

[1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC. In ASPLOS’16

DCbugs need to be tackled

  • Common in distributed systems [1, 2, 3]
  • Difficult to avoid, expose and diagnose

“That is one monster of a race!” Hadoop Map/Reduce / MAPREDUCE-3274 “There isn’t a week going by without new bugs about races.” HBase / HBASE-4397 “We have already found and fix many cases, however it seems exist many other cases.” HBase / HBASE-6147 “This has become quite messy, sigh.” Hadoop Map/Reduce / MAPREDUCE-4819 “Great catch, Sid! Apologies for missing the race condition.” Hadoop Map/Reduce / MAPREDUCE-4099 “We [prefer] debug crashes instead of hanging jobs.” Hadoop Map/Reduce / MAPREDUCE-3634

Can we detect DCbugs before they manifest?

slide-17
SLIDE 17

17

Previous work

  • Model checking

– Work on abstracted models – Face state-space explosion issue

slide-18
SLIDE 18

18

Our idea

  • Follow the philosophy of traditional

concurrency bug detection

slide-19
SLIDE 19

19

Our idea

  • Follow the philosophy of traditional

concurrency bug detection

Machine 2 Machine 1 Machine 3 Machine 4

slide-20
SLIDE 20

20

Our idea

  • Follow the philosophy of traditional

concurrency bug detection

Machine 2 Machine 1 Machine 3 Machine 4

slide-21
SLIDE 21

21

Our idea

  • Follow the philosophy of traditional

concurrency bug detection

Machine 2 Machine 1 Machine 3 Machine 4

slide-22
SLIDE 22

22

Example

B A C

slide-23
SLIDE 23

23

Example

B A C B

//UnReg thread void unReg(jID){

jMap.remove(jID);

.... } //RPC thread Task getTask(jID){ ... return jMap.get(jID); }

slide-24
SLIDE 24

24

Local concurrency bug detection

slide-25
SLIDE 25

25

Local concurrency bug detection

slide-26
SLIDE 26

26

Local concurrency bug detection

Is the problem solved?

slide-27
SLIDE 27

27

Local concurrency bug detection

 

Trace

T1 T2 T3

slide-28
SLIDE 28

28

Local concurrency bug detection

 

Trace

T1 T2 T3

C 1

C1: How to handle the huge amount of mem accesses? Challenges

slide-29
SLIDE 29

29

Local concurrency bug detection

. . .

 

Trace HB

T1 T2 T3

C 1

C1: How to handle the huge amount of mem accesses? Challenges

slide-30
SLIDE 30

30

Local concurrency bug detection

. . .

 

Trace HB

T1 T2 T3

C 1

C1: How to handle the huge amount of mem accesses? Challenges r w

slide-31
SLIDE 31

31

Local concurrency bug detection

. . .

 

Trace HB

T1 T2 T3

C 1 C 2

C1: How to handle the huge amount of mem accesses? C2: What’s the happens-before model? Challenges r w

slide-32
SLIDE 32

32

Local concurrency bug detection

. . .

 

Trace Triage HB

T1 T2 T3

. . .

C 1 C 2

C1: How to handle the huge amount of mem accesses? C2: What’s the happens-before model? Challenges

assert(r)

r w r w

slide-33
SLIDE 33

33

Local concurrency bug detection

. . .

 

Trace Triage HB

T1 T2 T3

. . .

C 1 C 2 C 3

C1: How to handle the huge amount of mem accesses? C2: What’s the happens-before model? C3: How to estimate the distributed impact of a race? Challenges

assert(r)

r w r w

slide-34
SLIDE 34

34

Local concurrency bug detection

. . .

T1 T2 T3

. . . . . .

 

Trace Triage Trigger HB

C 1 C 2 C 3

C1: How to handle the huge amount of mem accesses? C2: What’s the happens-before model? C3: How to estimate the distributed impact of a race? Challenges

+sleep

r w r w

assert(r)

slide-35
SLIDE 35

35

Local concurrency bug detection

. . .

 

Trace Triage Trigger HB

T1 T2 T3

. . . . . .

C 1 C 2 C 3 C 4

C1: How to handle the huge amount of mem accesses? C2: What’s the happens-before model? C3: How to estimate the distributed impact of a race? C4: How to trigger with distributed time manipulation? Challenges

+sleep

r w

assert(r)

r w

slide-36
SLIDE 36

36

Contribution

  • A comprehensive HB Model for distributed systems
slide-37
SLIDE 37

37

Contribution

 

Trace Triage Trigger HB

C 1 C 2 C 3 C 4

C1: How to handle the huge amount of mem accesses? C2: What’s the happens-before model? C3: How to estimate the distributed impact of a race? C4: How to trigger with distributed time manipulation? Challenges

Solved by

DCatch

  • A comprehensive HB Model for distributed systems
  • DCatch tool detects DCbugs from correct runs
slide-38
SLIDE 38

38

Contribution

  • A comprehensive HB Model for distributed systems
  • DCatch tool detects DCbugs from correct runs
  • Evaluate on 4 systems
  • Report 32 DCbugs, with 20 of them being truly

harmful

slide-39
SLIDE 39

39

Outline

  • Motivation
  • DCatch Happens-before Model
  • DCatch tool
  • Evaluation
  • Conclusion
slide-40
SLIDE 40

40

DCatch Happens-before Model

HMaster Thd

w

slide-41
SLIDE 41

41

DCatch Happens-before Model

Thd HMaster Thd

w

slide-42
SLIDE 42

42

DCatch Happens-before Model

Thd Thd HMaster HRegionServer Thd

w

slide-43
SLIDE 43

43

DCatch Happens-before Model

Thd Event thread Thd HMaster

e

HRegionServer Thd

w

slide-44
SLIDE 44

44

DCatch Happens-before Model

Thd Event thread Thd HMaster

e

HRegionServer Thd ZK Coordinator

w

slide-45
SLIDE 45

45

DCatch Happens-before Model

Thd Thd Event thread Thd HMaster

e

HRegionServer Thd ZK Coordinator

w

slide-46
SLIDE 46

46

DCatch Happens-before Model

Thd Thd Event thread Thd HMaster

e

HRegionServer Thd ZK Coordinator

w r

slide-47
SLIDE 47

47

DCatch Happens-before Model

Thd Thd Event thread Thd HMaster

e

HRegionServer Thd ZK Coordinator

w r

Where is HB model for distributed systems?

slide-48
SLIDE 48

48

DCatch Happens-before Model

Thd Thd Eve thd Thd HMaster HRegionServer Thd ZK Coordinator

w r

slide-49
SLIDE 49

49

DCatch Happens-before Model

Dist. Loc.

Thd Thd Eve thd Thd HMaster HRegionServer Thd ZK Coordinator

w r Dist.

slide-50
SLIDE 50

50

DCatch Happens-before Model

Dist. Loc. Async. Sync.

Thd Thd Eve thd Thd HMaster HRegionServer Thd ZK Coordinator

w r Dist. Async.

slide-51
SLIDE 51

51

DCatch Happens-before Model

Stand. Dist. Loc. Cust. Async. Sync.

Thd Thd Eve thd Thd HMaster HRegionServer Thd ZK Coordinator

w r Dist. Async. Cust.

slide-52
SLIDE 52

52

Distributed rules

Distributed Local Standard Custom Async. Sync.

slide-53
SLIDE 53

53

Distributed rule #1

  • Logical time clock (Leslie Lamport, 1978)

Send Recv Machine1 Machine2 Standard Asynch. Customize Synch. Socket

slide-54
SLIDE 54

54

Distributed rule #1

  • Logical time clock (Leslie Lamport, 1978)

Send Recv Machine1 Machine2 Standard Asynch. Customize Synch. Socket

slide-55
SLIDE 55

55

Distributed rule #2

RPC-call RPC-begin Machine1 Machine2 RPC-rt RPC-end Standard Asynch. Customize Synch. RPC

Socket

slide-56
SLIDE 56

56

Distributed rule #2

RPC-call RPC-begin Machine1 Machine2 RPC-rt RPC-end Standard Asynch. Customize Synch. RPC

Socket

waiting

slide-57
SLIDE 57

57

Distributed rule #2

RPC-call RPC-begin Machine1 Machine2 RPC-rt RPC-end Standard Asynch. Customize Synch. RPC

Socket

waiting

slide-58
SLIDE 58

58

Distributed rule #3

//Thread1 flag = True; //Thread while(!flag){ } ...

In multi-threaded systems:

Standard Asynch. Customize Synch.

Socket RPC

  • Dist. while-loop
slide-59
SLIDE 59

59

Distributed rule #3

//Thread1 flag = True; //Thread while(!flag){ } ...

In multi-threaded systems:

Standard Asynch. Customize Synch.

Socket RPC

  • Dist. while-loop
slide-60
SLIDE 60

60

Distributed rule #3

//Thread1 flag = True; //Thread while(!flag){ } ...

In multi-threaded systems:

Standard Asynch. Customize Synch.

Socket RPC

  • Dist. while-loop

In distributed systems:

slide-61
SLIDE 61

61

Distributed rule #3

//Thread1 flag = True; //Thread while(!getFlag()){ } ... //Thread2 bool getFlag(){ return flag; }

Machine B Machine A

In distributed systems:

Standard Asynch. Customize Synch.

Socket RPC

  • Dist. while-loop
slide-62
SLIDE 62

62

Distributed rule #3

//Thread1 flag = True; //Thread while(!getFlag()){ } ... //Thread2 bool getFlag(){ return flag; }

Machine B Machine A

In distributed systems:

Standard Asynch. Customize Synch.

Socket RPC

  • Dist. while-loop
slide-63
SLIDE 63

63

Distributed rule #4

Thd Thd Eve thd Thd Thd ZK Coordinator

w r

Standard Asynch. Customize Synch.

Socket RPC

ZooKeeper Service

slide-64
SLIDE 64

64

Distributed rule #4

Standard Asynch. Customize Synch. ZooKeeper Service

Socket RPC

Thd Thd Eve thd Thd Thd ZK Coordinator

w r

slide-65
SLIDE 65

65

Distributed rules

Standard Customized Asynchronous Synchronous RPC

  • Dist. While-loop

Socket Zookeeper Service

slide-66
SLIDE 66

66

Local rules

Distributed Local

slide-67
SLIDE 67

67

Local rules

Standard Customized Asynchronous Synchronous Event-related n/a

slide-68
SLIDE 68

68

Local rules

Standard Customized Asynchronous Synchronous Event-related Thread fork/join While-loop n/a

slide-69
SLIDE 69

69

Local rules

Standard Customized Asynchronous Synchronous Thread fork/join While-loop Event-related n/a

slide-70
SLIDE 70

70

Outline

  • Motivation
  • DCatch Happens-before Model
  • DCatch tool
  • Evaluation
  • Conclusion
slide-71
SLIDE 71

71

Triage Trigger Trace HB

slide-72
SLIDE 72

72

C1: How to handle the huge amount of mem accesses? C2: What’s the happens-before model? C3: How to estimate the distributed impact of a race? C4: How to trigger with distributed time manipulation? Challenges

Triage Trigger Trace HB

Selective tracing: only mem accesses in Event/message handlers and their callee.

slide-73
SLIDE 73

73

C1: How to handle the huge amount of mem accesses? C2: What’s the happens-before model? C3: How to estimate the distributed impact of a race? C4: How to trigger with distributed time manipulation? Challenges

Triage Trigger Trace HB

Local Distributed

[1] Raychev. Effective Race Detection for Event-Driven Programs. In OOPSLA’13

slide-74
SLIDE 74

74

C1: How to handle the huge amount of mem accesses? C2: What’s the happens-before model? C3: How to estimate the distributed impact of a race? C4: How to trigger with distributed time manipulation? Challenges

Triage Trigger Trace HB

Machine B

//RPC thread Task getTask(jID){ ... return jMap.get(jID); } //UnReg thread void unReg(jID){

jMap.remove(jID);

.... }

slide-75
SLIDE 75

75

C1: How to handle the huge amount of mem accesses? C2: What’s the happens-before model? C3: How to estimate the distributed impact of a race? C4: How to trigger with distributed time manipulation? Challenges

Triage Trigger Trace HB

Machine B Machine A

//RPC thread Task getTask(jID){ ... return jMap.get(jID); } //UnReg thread void unReg(jID){

jMap.remove(jID);

.... }

while(!getTask(jID)){

}

slide-76
SLIDE 76

76

C1: How to handle the huge amount of mem accesses? C2: What’s the happens-before model? C3: How to estimate the distributed impact of a race? C4: How to trigger with distributed time manipulation? Challenges

Triage Trigger Trace HB

Machine B Machine A

//RPC thread Task getTask(jID){ ... return jMap.get(jID); } //UnReg thread void unReg(jID){

jMap.remove(jID);

.... }

while(!getTask(jID)){

}

slide-77
SLIDE 77

77

C1: How to handle the huge amount of mem accesses? C2: What’s the happens-before model? C3: How to estimate the distributed impact of a race? C4: How to trigger with distributed time manipulation? Challenges

Triage Trigger Trace HB

Machine B Machine A

//RPC thread Task getTask(jID){ ... return jMap.get(jID); } //UnReg thread void unReg(jID){

jMap.remove(jID);

.... }

while(!getTask(jID)){

}

hang

slide-78
SLIDE 78

78

C1: How to handle the huge amount of mem accesses? C2: What’s the happens-before model? C3: How to estimate the distributed impact of a race? C4: How to trigger with distributed time manipulation? Challenges

Triage Trigger Trace HB

Thd Thd Event thread

e2 e1

w r

slide-79
SLIDE 79

79

C1: How to handle the huge amount of mem accesses? C2: What’s the happens-before model? C3: How to estimate the distributed impact of a race? C4: How to trigger with distributed time manipulation? Challenges

Triage Trigger Trace HB

Thd Thd Event thread

e2 e1

w r +sleep

slide-80
SLIDE 80

80

C1: How to handle the huge amount of mem accesses? C2: What’s the happens-before model? C3: How to estimate the distributed impact of a race? C4: How to trigger with distributed time manipulation? Challenges

Triage Trigger Trace HB

Thd Thd Event thread

e2 e1

w r +sleep

slide-81
SLIDE 81

81

C1: How to handle the huge amount of mem accesses? C2: What’s the happens-before model? C3: How to estimate the distributed impact of a race? C4: How to trigger with distributed time manipulation? Challenges

Triage Trigger Trace HB

Thd Machine A Thd Machine C Thd Thd Event thread

e2 e1

w r

Machine B

slide-82
SLIDE 82

82

C1: How to handle the huge amount of mem accesses? C2: What’s the happens-before model? C3: How to estimate the distributed impact of a race? C4: How to trigger with distributed time manipulation? Challenges

Triage Trigger Trace HB

Thd Machine A Thd Machine C Thd Thd Event thread

e2 e1

w r

Machine B

+sleep +sleep

slide-83
SLIDE 83

83

C1: How to handle the huge amount of mem accesses? C2: What’s the happens-before model? C3: How to estimate the distributed impact of a race? C4: How to trigger with distributed time manipulation? Challenges

Triage Trigger Trace HB

Thd Machine A Thd Machine C Thd Thd Event thread

e2 e1

w r

Machine B

+sleep +sleep +sleep +sleep

slide-84
SLIDE 84

84

Outline

  • Motivation
  • DCatch Happens-before Model
  • DCatch tool
  • Evaluation
  • Conclusion
slide-85
SLIDE 85

85

  • Benchmarks:

– 7 real-world DCbugs from TaxDC[1] – 4 distributed systems

Methodology

[1] Leesatapornwongsa. TaxDC. In ASPLOS’16

slide-86
SLIDE 86

86

Overall results

BugID Detected? #. Bugs #. Benign #. false-pos CA-1011 ✔ 3 HB-4539 ✔ 3 1 HB-4729 ✔ 4 1 MR- 3274 ✔ 2 4 MR- 4637 ✔ 1 2 4 ZK-1144 ✔ 5 1 1 ZK-1270 ✔ 6 2 Total 20 5 7

slide-87
SLIDE 87

87

Overall results

BugID Detected? #. Bugs #. Benign #. false-pos CA-1011 ✔ 3 HB-4539 ✔ 3 1 HB-4729 ✔ 4 1 MR- 3274 ✔ 2 4 MR- 4637 ✔ 1 2 4 ZK-1144 ✔ 5 1 1 ZK-1270 ✔ 6 2 Total 20 5 7

slide-88
SLIDE 88

88

Overall results

BugID Detected? #. Bugs #. Benign #. false-pos CA-1011 ✔ 3 HB-4539 ✔ 3 1 HB-4729 ✔ 4 1 MR- 3274 ✔ 2 4 MR- 4637 ✔ 1 2 4 ZK-1144 ✔ 5 1 1 ZK-1270 ✔ 6 2 Total 20 5 7 = 12 + 8

slide-89
SLIDE 89

89

Overall results

BugID Detected? #. Bugs #. Benign #. false-pos CA-1011 ✔ 3 HB-4539 ✔ 3 1 HB-4729 ✔ 4 1 MR- 3274 ✔ 2 4 MR- 4637 ✔ 1 2 4 ZK-1144 ✔ 5 1 1 ZK-1270 ✔ 6 2 Total 20 5 7

slide-90
SLIDE 90

90

Other results in our paper

  • Performance overhead
  • Trace compositions
  • HB model impact

– False-positive – False-negatives

slide-91
SLIDE 91

91

Outline

  • Motivation
  • DCatch Happens-before Model
  • DCatch tool
  • Evaluation
  • Conclusion
slide-92
SLIDE 92

92

Conclusion

  • A HB Model for distributed systems
  • DCatch detects DCbugs from correct runs with

low false positive rates.

 

Trace Triage Trigger HB

C 1 C 2 C 3 C 4

Local Distributed

slide-93
SLIDE 93

93

Thank you! Q&A