1
DCatch: Automatically Detecting Distributed Concurrency Bugs in - - PowerPoint PPT Presentation
DCatch: Automatically Detecting Distributed Concurrency Bugs in - - PowerPoint PPT Presentation
1 DCatch: Automatically Detecting Distributed Concurrency Bugs in Cloud Systems Haopeng Liu , Guangpu Li, Jeffrey Lukman, Jiaxin Li, Shan Lu, Haryadi Gunawi, and Chen Tian* * 2 Cloud systems 3 Cloud systems 4 Distributed concurrency bugs
2
Cloud systems
3
Cloud systems
4
Distributed concurrency bugs (DCbugs)
5
Distributed concurrency bugs (DCbugs)
- Unexpected timing among distributed operations
6
Distributed concurrency bugs (DCbugs)
- Unexpected timing among distributed operations
- Example
B A C MapReduce-3274
7
Distributed concurrency bugs (DCbugs)
- Unexpected timing among distributed operations
- Example
B A C B A C MapReduce-3274
hang
8
DCbugs need to be tackled
- Common in distributed systems [1, 2, 3]
– 26% failures caused by non-deterministic [1] – 6% software bugs in clouds system [2]
[1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC. In ASPLOS’16
9
DCbugs need to be tackled
- Common in distributed systems [1, 2, 3]
- Difficult to avoid, expose and diagnose
[1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC. In ASPLOS’16
10
DCbugs need to be tackled
- Common in distributed systems [1, 2, 3]
- Difficult to avoid, expose and diagnose
“That is one monster of a race!” Hadoop Map/Reduce / MAPREDUCE-3274 [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC. In ASPLOS’16
11
DCbugs need to be tackled
- Common in distributed systems [1, 2, 3]
- Difficult to avoid, expose and diagnose
“That is one monster of a race!” Hadoop Map/Reduce / MAPREDUCE-3274 “There isn’t a week going by without new bugs about races.” HBase / HBASE-4397 [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC. In ASPLOS’16
12
DCbugs need to be tackled
- Common in distributed systems [1, 2, 3]
- Difficult to avoid, expose and diagnose
“That is one monster of a race!” Hadoop Map/Reduce / MAPREDUCE-3274 “There isn’t a week going by without new bugs about races.” HBase / HBASE-4397 “We have already fix many cases, however it seems exist many other [racing] cases.” HBase / HBASE-6147 [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC. In ASPLOS’16
13
DCbugs need to be tackled
- Common in distributed systems [1, 2, 3]
- Difficult to avoid, expose and diagnose
“That is one monster of a race!” Hadoop Map/Reduce / MAPREDUCE-3274 “There isn’t a week going by without new bugs about races.” HBase / HBASE-4397 “We have already found and fix many cases, however it seems exist many other cases.” HBase / HBASE-6147 “This has become quite messy, sigh.” Hadoop Map/Reduce / MAPREDUCE-4819 [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC. In ASPLOS’16
14
DCbugs need to be tackled
- Common in distributed systems [1, 2, 3]
- Difficult to avoid, expose and diagnose
“That is one monster of a race!” Hadoop Map/Reduce / MAPREDUCE-3274 “There isn’t a week going by without new bugs about races.” HBase / HBASE-4397 “We have already found and fix many cases, however it seems exist many other cases.” HBase / HBASE-6147 “This has become quite messy, sigh.” Hadoop Map/Reduce / MAPREDUCE-4819 “Great catch, Sid! Apologies for missing the race condition.” Hadoop Map/Reduce / MAPREDUCE-4099 [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC. In ASPLOS’16
15
[1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC. In ASPLOS’16
DCbugs need to be tackled
- Common in distributed systems [1, 2, 3]
- Difficult to avoid, expose and diagnose
“That is one monster of a race!” Hadoop Map/Reduce / MAPREDUCE-3274 “There isn’t a week going by without new bugs about races.” HBase / HBASE-4397 “We have already found and fix many cases, however it seems exist many other cases.” HBase / HBASE-6147 “This has become quite messy, sigh.” Hadoop Map/Reduce / MAPREDUCE-4819 “Great catch, Sid! Apologies for missing the race condition.” Hadoop Map/Reduce / MAPREDUCE-4099 “We [prefer] debug crashes instead of hanging jobs.” Hadoop Map/Reduce / MAPREDUCE-3634
16
[1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi. What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC. In ASPLOS’16
DCbugs need to be tackled
- Common in distributed systems [1, 2, 3]
- Difficult to avoid, expose and diagnose
“That is one monster of a race!” Hadoop Map/Reduce / MAPREDUCE-3274 “There isn’t a week going by without new bugs about races.” HBase / HBASE-4397 “We have already found and fix many cases, however it seems exist many other cases.” HBase / HBASE-6147 “This has become quite messy, sigh.” Hadoop Map/Reduce / MAPREDUCE-4819 “Great catch, Sid! Apologies for missing the race condition.” Hadoop Map/Reduce / MAPREDUCE-4099 “We [prefer] debug crashes instead of hanging jobs.” Hadoop Map/Reduce / MAPREDUCE-3634
Can we detect DCbugs before they manifest?
17
Previous work
- Model checking
– Work on abstracted models – Face state-space explosion issue
18
Our idea
- Follow the philosophy of traditional
concurrency bug detection
19
Our idea
- Follow the philosophy of traditional
concurrency bug detection
Machine 2 Machine 1 Machine 3 Machine 4
20
Our idea
- Follow the philosophy of traditional
concurrency bug detection
Machine 2 Machine 1 Machine 3 Machine 4
21
Our idea
- Follow the philosophy of traditional
concurrency bug detection
Machine 2 Machine 1 Machine 3 Machine 4
22
Example
B A C
23
Example
B A C B
//UnReg thread void unReg(jID){
jMap.remove(jID);
.... } //RPC thread Task getTask(jID){ ... return jMap.get(jID); }
24
Local concurrency bug detection
25
Local concurrency bug detection
26
Local concurrency bug detection
Is the problem solved?
27
Local concurrency bug detection
Trace
T1 T2 T3
28
Local concurrency bug detection
Trace
T1 T2 T3
C 1
C1: How to handle the huge amount of mem accesses? Challenges
29
Local concurrency bug detection
. . .
Trace HB
T1 T2 T3
C 1
C1: How to handle the huge amount of mem accesses? Challenges
30
Local concurrency bug detection
. . .
Trace HB
T1 T2 T3
C 1
C1: How to handle the huge amount of mem accesses? Challenges r w
31
Local concurrency bug detection
. . .
Trace HB
T1 T2 T3
C 1 C 2
C1: How to handle the huge amount of mem accesses? C2: What’s the happens-before model? Challenges r w
32
Local concurrency bug detection
. . .
Trace Triage HB
T1 T2 T3
. . .
C 1 C 2
C1: How to handle the huge amount of mem accesses? C2: What’s the happens-before model? Challenges
assert(r)
r w r w
33
Local concurrency bug detection
. . .
Trace Triage HB
T1 T2 T3
. . .
C 1 C 2 C 3
C1: How to handle the huge amount of mem accesses? C2: What’s the happens-before model? C3: How to estimate the distributed impact of a race? Challenges
assert(r)
r w r w
34
Local concurrency bug detection
. . .
T1 T2 T3
. . . . . .
Trace Triage Trigger HB
C 1 C 2 C 3
C1: How to handle the huge amount of mem accesses? C2: What’s the happens-before model? C3: How to estimate the distributed impact of a race? Challenges
+sleep
r w r w
assert(r)
35
Local concurrency bug detection
. . .
Trace Triage Trigger HB
T1 T2 T3
. . . . . .
C 1 C 2 C 3 C 4
C1: How to handle the huge amount of mem accesses? C2: What’s the happens-before model? C3: How to estimate the distributed impact of a race? C4: How to trigger with distributed time manipulation? Challenges
+sleep
r w
assert(r)
r w
36
Contribution
- A comprehensive HB Model for distributed systems
37
Contribution
Trace Triage Trigger HB
C 1 C 2 C 3 C 4
C1: How to handle the huge amount of mem accesses? C2: What’s the happens-before model? C3: How to estimate the distributed impact of a race? C4: How to trigger with distributed time manipulation? Challenges
Solved by
DCatch
- A comprehensive HB Model for distributed systems
- DCatch tool detects DCbugs from correct runs
38
Contribution
- A comprehensive HB Model for distributed systems
- DCatch tool detects DCbugs from correct runs
- Evaluate on 4 systems
- Report 32 DCbugs, with 20 of them being truly
harmful
39
Outline
- Motivation
- DCatch Happens-before Model
- DCatch tool
- Evaluation
- Conclusion
40
DCatch Happens-before Model
HMaster Thd
w
41
DCatch Happens-before Model
Thd HMaster Thd
w
42
DCatch Happens-before Model
Thd Thd HMaster HRegionServer Thd
w
43
DCatch Happens-before Model
Thd Event thread Thd HMaster
e
HRegionServer Thd
w
44
DCatch Happens-before Model
Thd Event thread Thd HMaster
e
HRegionServer Thd ZK Coordinator
w
45
DCatch Happens-before Model
Thd Thd Event thread Thd HMaster
e
HRegionServer Thd ZK Coordinator
w
46
DCatch Happens-before Model
Thd Thd Event thread Thd HMaster
e
HRegionServer Thd ZK Coordinator
w r
47
DCatch Happens-before Model
Thd Thd Event thread Thd HMaster
e
HRegionServer Thd ZK Coordinator
w r
Where is HB model for distributed systems?
48
DCatch Happens-before Model
Thd Thd Eve thd Thd HMaster HRegionServer Thd ZK Coordinator
w r
49
DCatch Happens-before Model
Dist. Loc.
Thd Thd Eve thd Thd HMaster HRegionServer Thd ZK Coordinator
w r Dist.
50
DCatch Happens-before Model
Dist. Loc. Async. Sync.
Thd Thd Eve thd Thd HMaster HRegionServer Thd ZK Coordinator
w r Dist. Async.
51
DCatch Happens-before Model
Stand. Dist. Loc. Cust. Async. Sync.
Thd Thd Eve thd Thd HMaster HRegionServer Thd ZK Coordinator
w r Dist. Async. Cust.
52
Distributed rules
Distributed Local Standard Custom Async. Sync.
53
Distributed rule #1
- Logical time clock (Leslie Lamport, 1978)
Send Recv Machine1 Machine2 Standard Asynch. Customize Synch. Socket
54
Distributed rule #1
- Logical time clock (Leslie Lamport, 1978)
Send Recv Machine1 Machine2 Standard Asynch. Customize Synch. Socket
55
Distributed rule #2
RPC-call RPC-begin Machine1 Machine2 RPC-rt RPC-end Standard Asynch. Customize Synch. RPC
Socket
56
Distributed rule #2
RPC-call RPC-begin Machine1 Machine2 RPC-rt RPC-end Standard Asynch. Customize Synch. RPC
Socket
waiting
57
Distributed rule #2
RPC-call RPC-begin Machine1 Machine2 RPC-rt RPC-end Standard Asynch. Customize Synch. RPC
Socket
waiting
58
Distributed rule #3
//Thread1 flag = True; //Thread while(!flag){ } ...
In multi-threaded systems:
Standard Asynch. Customize Synch.
Socket RPC
- Dist. while-loop
59
Distributed rule #3
//Thread1 flag = True; //Thread while(!flag){ } ...
In multi-threaded systems:
Standard Asynch. Customize Synch.
Socket RPC
- Dist. while-loop
60
Distributed rule #3
//Thread1 flag = True; //Thread while(!flag){ } ...
In multi-threaded systems:
Standard Asynch. Customize Synch.
Socket RPC
- Dist. while-loop
In distributed systems:
61
Distributed rule #3
//Thread1 flag = True; //Thread while(!getFlag()){ } ... //Thread2 bool getFlag(){ return flag; }
Machine B Machine A
In distributed systems:
Standard Asynch. Customize Synch.
Socket RPC
- Dist. while-loop
62
Distributed rule #3
//Thread1 flag = True; //Thread while(!getFlag()){ } ... //Thread2 bool getFlag(){ return flag; }
Machine B Machine A
In distributed systems:
Standard Asynch. Customize Synch.
Socket RPC
- Dist. while-loop
63
Distributed rule #4
Thd Thd Eve thd Thd Thd ZK Coordinator
w r
Standard Asynch. Customize Synch.
Socket RPC
ZooKeeper Service
64
Distributed rule #4
Standard Asynch. Customize Synch. ZooKeeper Service
Socket RPC
Thd Thd Eve thd Thd Thd ZK Coordinator
w r
65
Distributed rules
Standard Customized Asynchronous Synchronous RPC
- Dist. While-loop
Socket Zookeeper Service
66
Local rules
Distributed Local
67
Local rules
Standard Customized Asynchronous Synchronous Event-related n/a
68
Local rules
Standard Customized Asynchronous Synchronous Event-related Thread fork/join While-loop n/a
69
Local rules
Standard Customized Asynchronous Synchronous Thread fork/join While-loop Event-related n/a
70
Outline
- Motivation
- DCatch Happens-before Model
- DCatch tool
- Evaluation
- Conclusion
71
Triage Trigger Trace HB
72
C1: How to handle the huge amount of mem accesses? C2: What’s the happens-before model? C3: How to estimate the distributed impact of a race? C4: How to trigger with distributed time manipulation? Challenges
Triage Trigger Trace HB
Selective tracing: only mem accesses in Event/message handlers and their callee.
73
C1: How to handle the huge amount of mem accesses? C2: What’s the happens-before model? C3: How to estimate the distributed impact of a race? C4: How to trigger with distributed time manipulation? Challenges
Triage Trigger Trace HB
Local Distributed
[1] Raychev. Effective Race Detection for Event-Driven Programs. In OOPSLA’13
74
C1: How to handle the huge amount of mem accesses? C2: What’s the happens-before model? C3: How to estimate the distributed impact of a race? C4: How to trigger with distributed time manipulation? Challenges
Triage Trigger Trace HB
Machine B
//RPC thread Task getTask(jID){ ... return jMap.get(jID); } //UnReg thread void unReg(jID){
jMap.remove(jID);
.... }
75
C1: How to handle the huge amount of mem accesses? C2: What’s the happens-before model? C3: How to estimate the distributed impact of a race? C4: How to trigger with distributed time manipulation? Challenges
Triage Trigger Trace HB
Machine B Machine A
//RPC thread Task getTask(jID){ ... return jMap.get(jID); } //UnReg thread void unReg(jID){
jMap.remove(jID);
.... }
while(!getTask(jID)){
}
76
C1: How to handle the huge amount of mem accesses? C2: What’s the happens-before model? C3: How to estimate the distributed impact of a race? C4: How to trigger with distributed time manipulation? Challenges
Triage Trigger Trace HB
Machine B Machine A
//RPC thread Task getTask(jID){ ... return jMap.get(jID); } //UnReg thread void unReg(jID){
jMap.remove(jID);
.... }
while(!getTask(jID)){
}
77
C1: How to handle the huge amount of mem accesses? C2: What’s the happens-before model? C3: How to estimate the distributed impact of a race? C4: How to trigger with distributed time manipulation? Challenges
Triage Trigger Trace HB
Machine B Machine A
//RPC thread Task getTask(jID){ ... return jMap.get(jID); } //UnReg thread void unReg(jID){
jMap.remove(jID);
.... }
while(!getTask(jID)){
}
hang
78
C1: How to handle the huge amount of mem accesses? C2: What’s the happens-before model? C3: How to estimate the distributed impact of a race? C4: How to trigger with distributed time manipulation? Challenges
Triage Trigger Trace HB
Thd Thd Event thread
e2 e1
w r
79
C1: How to handle the huge amount of mem accesses? C2: What’s the happens-before model? C3: How to estimate the distributed impact of a race? C4: How to trigger with distributed time manipulation? Challenges
Triage Trigger Trace HB
Thd Thd Event thread
e2 e1
w r +sleep
80
C1: How to handle the huge amount of mem accesses? C2: What’s the happens-before model? C3: How to estimate the distributed impact of a race? C4: How to trigger with distributed time manipulation? Challenges
Triage Trigger Trace HB
Thd Thd Event thread
e2 e1
w r +sleep
81
C1: How to handle the huge amount of mem accesses? C2: What’s the happens-before model? C3: How to estimate the distributed impact of a race? C4: How to trigger with distributed time manipulation? Challenges
Triage Trigger Trace HB
Thd Machine A Thd Machine C Thd Thd Event thread
e2 e1
w r
Machine B
82
C1: How to handle the huge amount of mem accesses? C2: What’s the happens-before model? C3: How to estimate the distributed impact of a race? C4: How to trigger with distributed time manipulation? Challenges
Triage Trigger Trace HB
Thd Machine A Thd Machine C Thd Thd Event thread
e2 e1
w r
Machine B
+sleep +sleep
83
C1: How to handle the huge amount of mem accesses? C2: What’s the happens-before model? C3: How to estimate the distributed impact of a race? C4: How to trigger with distributed time manipulation? Challenges
Triage Trigger Trace HB
Thd Machine A Thd Machine C Thd Thd Event thread
e2 e1
w r
Machine B
+sleep +sleep +sleep +sleep
84
Outline
- Motivation
- DCatch Happens-before Model
- DCatch tool
- Evaluation
- Conclusion
85
- Benchmarks:
– 7 real-world DCbugs from TaxDC[1] – 4 distributed systems
Methodology
[1] Leesatapornwongsa. TaxDC. In ASPLOS’16
86
Overall results
BugID Detected? #. Bugs #. Benign #. false-pos CA-1011 ✔ 3 HB-4539 ✔ 3 1 HB-4729 ✔ 4 1 MR- 3274 ✔ 2 4 MR- 4637 ✔ 1 2 4 ZK-1144 ✔ 5 1 1 ZK-1270 ✔ 6 2 Total 20 5 7
87
Overall results
BugID Detected? #. Bugs #. Benign #. false-pos CA-1011 ✔ 3 HB-4539 ✔ 3 1 HB-4729 ✔ 4 1 MR- 3274 ✔ 2 4 MR- 4637 ✔ 1 2 4 ZK-1144 ✔ 5 1 1 ZK-1270 ✔ 6 2 Total 20 5 7
88
Overall results
BugID Detected? #. Bugs #. Benign #. false-pos CA-1011 ✔ 3 HB-4539 ✔ 3 1 HB-4729 ✔ 4 1 MR- 3274 ✔ 2 4 MR- 4637 ✔ 1 2 4 ZK-1144 ✔ 5 1 1 ZK-1270 ✔ 6 2 Total 20 5 7 = 12 + 8
89
Overall results
BugID Detected? #. Bugs #. Benign #. false-pos CA-1011 ✔ 3 HB-4539 ✔ 3 1 HB-4729 ✔ 4 1 MR- 3274 ✔ 2 4 MR- 4637 ✔ 1 2 4 ZK-1144 ✔ 5 1 1 ZK-1270 ✔ 6 2 Total 20 5 7
90
Other results in our paper
- Performance overhead
- Trace compositions
- HB model impact
– False-positive – False-negatives
- …
91
Outline
- Motivation
- DCatch Happens-before Model
- DCatch tool
- Evaluation
- Conclusion
92
Conclusion
- A HB Model for distributed systems
- DCatch detects DCbugs from correct runs with
low false positive rates.
Trace Triage Trigger HB
C 1 C 2 C 3 C 4
Local Distributed
93