DCatch: Automatically Detecting Distributed Concurrency Bugs in - PowerPoint PPT Presentation

1 DCatch: Automatically Detecting Distributed Concurrency Bugs in Cloud Systems Haopeng Liu , Guangpu Li, Jeffrey Lukman, Jiaxin Li, Shan Lu, Haryadi Gunawi, and Chen Tian* *

2 Cloud systems

3 Cloud systems

4 Distributed concurrency bugs (DCbugs)

5 Distributed concurrency bugs (DCbugs) • Unexpected timing among distributed operations

6 Distributed concurrency bugs (DCbugs) • Unexpected timing among distributed operations • Example A B C MapReduce-3274

7 Distributed concurrency bugs (DCbugs) • Unexpected timing among distributed operations • Example A B C A B C hang MapReduce-3274

8 DCbugs need to be tackled • Common in distributed systems [1, 2, 3] – 26% failures caused by non-deterministic [1] – 6% software bugs in clouds system [2] [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi . What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC . In ASPLOS’16

9 DCbugs need to be tackled • Common in distributed systems [1, 2, 3] • Difficult to avoid, expose and diagnose [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi . What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC . In ASPLOS’16

10 DCbugs need to be tackled • Common in distributed systems [1, 2, 3] • Difficult to avoid, expose and diagnose Hadoop Map/Reduce / MAPREDUCE-3274 “That is one monster of a race!” [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi . What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC . In ASPLOS’16

11 DCbugs need to be tackled • Common in distributed systems [1, 2, 3] • Difficult to avoid, expose and diagnose Hadoop Map/Reduce / MAPREDUCE-3274 “That is one monster of a race!” HBase / HBASE-4397 “There isn’t a week going by without new bugs about races.” [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi . What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC . In ASPLOS’16

12 DCbugs need to be tackled • Common in distributed systems [1, 2, 3] • Difficult to avoid, expose and diagnose Hadoop Map/Reduce / MAPREDUCE-3274 “That is one monster of a race!” HBase / HBASE-4397 “There isn’t a week going by without new bugs about races.” HBase / HBASE-6147 “We have already fix many cases, however it seems exist many other [racing] cases.” [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi . What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC . In ASPLOS’16

13 DCbugs need to be tackled • Common in distributed systems [1, 2, 3] • Difficult to avoid, expose and diagnose Hadoop Map/Reduce / MAPREDUCE-3274 “That is one monster of a race!” HBase / HBASE-4397 “There isn’t a week going by without new bugs about races.” HBase / HBASE-6147 “We have already found and fix many cases, Hadoop Map/Reduce / MAPREDUCE-4819 however it seems exist many other cases.” “This has become quite messy, sigh. ” [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi . What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC . In ASPLOS’16

14 DCbugs need to be tackled • Common in distributed systems [1, 2, 3] • Difficult to avoid, expose and diagnose Hadoop Map/Reduce / MAPREDUCE-3274 “That is one monster of a race!” HBase / HBASE-4397 “There isn’t a week going by without new bugs about races.” HBase / HBASE-6147 “We have already found and fix many cases, Hadoop Map/Reduce / MAPREDUCE-4819 however it seems exist many other cases.” “This has become quite messy, sigh. ” Hadoop Map/Reduce / MAPREDUCE-4099 “Great catch, Sid! Apologies for missing the race condition.” [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi . What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC . In ASPLOS’16

15 DCbugs need to be tackled • Common in distributed systems [1, 2, 3] • Difficult to avoid, expose and diagnose Hadoop Map/Reduce / MAPREDUCE-3274 “That is one monster of a race!” HBase / HBASE-4397 “There isn’t a week going by without new bugs about races.” HBase / HBASE-6147 “We have already found and fix many cases, Hadoop Map/Reduce / MAPREDUCE-4819 however it seems exist many other cases.” “This has become quite messy, sigh. ” Hadoop Map/Reduce / MAPREDUCE-4099 “Great catch, Sid! Apologies for missing the race condition.” Hadoop Map/Reduce / MAPREDUCE-3634 “We [prefer] debug crashes instead of hanging jobs.” [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi . What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC . In ASPLOS’16

16 DCbugs need to be tackled • Common in distributed systems [1, 2, 3] • Difficult to avoid, expose and diagnose Hadoop Map/Reduce / MAPREDUCE-3274 “That is one monster of a race!” HBase / HBASE-4397 “There isn’t a week going by without new bugs about races.” HBase / HBASE-6147 “We have already found and fix many cases, Can we detect DCbugs before they manifest? Hadoop Map/Reduce / MAPREDUCE-4819 however it seems exist many other cases.” “This has become quite messy, sigh. ” Hadoop Map/Reduce / MAPREDUCE-4099 “Great catch, Sid! Apologies for missing the race condition.” Hadoop Map/Reduce / MAPREDUCE-3634 “We [prefer] debug crashes instead of hanging jobs.” [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi . What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC . In ASPLOS’16

17 Previous work • Model checking – Work on abstracted models – Face state-space explosion issue

18 Our idea • Follow the philosophy of traditional concurrency bug detection

19 Our idea • Follow the philosophy of traditional concurrency bug detection Machine 1 Machine 2 Machine 3 Machine 4

22 Example A B C

23 Example A B C B //UnReg thread //RPC thread void unReg(jID){ Task getTask(jID){ ... jMap.remove(jID) ; return jMap.get(jID) ; .... } }

24 Local concurrency bug detection

25 Local concurrency bug detection

26 Local concurrency bug detection Is the problem solved?

27 Local concurrency bug detection T2 T1 T3  Trace 

28 Local concurrency bug detection T2 T1 T3  C Trace 1  C1: How to handle the huge amount of mem accesses? Challenges

29 Local concurrency bug detection T2 T1 T3 . . .  C Trace HB 1  C1: How to handle the huge amount of mem accesses? Challenges

30 Local concurrency bug detection T2 T1 T3 . . r . w  C Trace HB 1  C1: How to handle the huge amount of mem accesses? Challenges

31 Local concurrency bug detection T2 T1 T3 . . r . w  C C Trace HB 1 2  C1: How to handle the huge C2: What’s the happens -before amount of mem accesses? model? Challenges

32 Local concurrency bug detection T2 T1 T3 . . . . r r . . w w assert( r )  C C Trace HB Triage 1 2  C1: How to handle the huge C2: What’s the happens -before amount of mem accesses? model? Challenges

33 Local concurrency bug detection T2 T1 T3 . . . . r r . . w w assert( r )  C C C Trace HB Triage 1 2 3  C1: How to handle the huge C2: What’s the happens -before amount of mem accesses? model? Challenges C3: How to estimate the distributed impact of a race?

34 Local concurrency bug detection T2 T1 T3 . . . . . . r r +sleep . . . w w assert( r )  C C C Trace HB Triage Trigger 1 2 3  C1: How to handle the huge C2: What’s the happens -before amount of mem accesses? model? Challenges C3: How to estimate the distributed impact of a race?

35 Local concurrency bug detection T2 T1 T3 . . . . . . r r +sleep . . . w w assert( r )  C C C C Trace HB Triage Trigger 1 2 3 4  C1: How to handle the huge C2: What’s the happens -before amount of mem accesses? model? Challenges C3: How to estimate the C4: How to trigger with distributed impact of a race? distributed time manipulation?

36 Contribution • A comprehensive HB Model for distributed systems

37 Contribution • A comprehensive HB Model for distributed systems • DCatch tool detects DCbugs from correct runs  C C C C Trace HB Triage Trigger 1 2 3 4  C1: How to handle the huge C2: What’s the amount of mem accesses? happens-before model? Challenges Solved by DCatch C3: How to estimate the C4: How to trigger with distributed impact of a race? distributed time manipulation?

38 Contribution • A comprehensive HB Model for distributed systems • DCatch tool detects DCbugs from correct runs • Evaluate on 4 systems • Report 32 DCbugs, with 20 of them being truly harmful

39 Outline • Motivation • DCatch Happens-before Model • DCatch tool • Evaluation • Conclusion

DCatch: Automatically Detecting Distributed Concurrency Bugs in - PowerPoint PPT Presentation

1 DCatch: Automatically Detecting Distributed Concurrency Bugs in Cloud Systems Haopeng Liu , Guangpu Li, Jeffrey Lukman, Jiaxin Li, Shan Lu, Haryadi Gunawi, and Chen Tian* * 2 Cloud systems 3 Cloud systems 4 Distributed concurrency bugs

Detecting Spammers and Content Detecting Spammers and Content Detecting Spammers and Content

12/6/2013 Detecting Fakes Image Forensics: Detecting Forged Photos 1.Detecting photorealistic

NetFlow Analysis: Detecting covert channels on the network Detecting malicious traffic by using

Introduction Detecting Errors in Effects of Annotation Errors Detecting Errors in Corpus

Automatically Automatically Finding Patches Finding Patches Using Genetic Using Genetic

Automatically Identifying Automatically Identifying and Georeferencing Georeferencing and

Automatically Detecting Likely Edits in Clinical Notes Created Using Automatic Speech Recognition

Inspection systems for automatically detecting defects in the surface of train axles Nigel

Detecting annotation noise in automatically labelled data Ines Rehbein Josef Ruppenhofer IDS

Detecting annotation noise in automatically labelled data Ines Rehbein & Josef Ruppenhofer

Automatically Detecting Vulnerable Sites Before They Turn Malicious Kyle Soska Nicolas

Detecting Chang Detecting Changes in W s in Water ter Qua Q ualit lity i lit lit i in L

Detecting Self-Interruptions during Reading Jan Pilzer and Sam Liu 2017-11-27 Detecting

Effective features for detecting Effective features for detecting IRC botnets IRC botnets

Detecting Insolvency Detecting Insolvency David Emanuel 1 4 August 2 0 0 9 Outline

Detecting Cracks under Bushings Detecting Cracks under Bushings in Aircraft Structures in

and Generator Interconnection (TPP-GIP Integration) Final Proposal Stakeholder conference call,

Dihadron Correlation with Identified Triggers Subikash Choudhury VECC,KOLKATA ALICE-India

Blake Kush Director of Training & Outreach New York State Office of Victim Services Overview

The FMS Trigger at STAR John Calvin Martinez Carl Gagliardi Pibero Djawotho Texas A&M

Property Protecting Turret Andrey Ivannikov Eric Poon Ilya Brutman Prototype of the PPT

Best Practices and Presentation Guidelines for Accessibility Adopted from NCORE - The National

report Wajah Wajahat B at Bajw ajwa | EPA Appointed Auditor Image Image Image placeholder

Tab B, No. 8b T August 2018 To assist Councils in reviewing existing fisheries allocations and

DCatch: Automatically Detecting Distributed Concurrency Bugs in - PowerPoint PPT Presentation

1 DCatch: Automatically Detecting Distributed Concurrency Bugs in Cloud Systems Haopeng Liu , Guangpu Li, Jeffrey Lukman, Jiaxin Li, Shan Lu, Haryadi Gunawi, and Chen Tian* * 2 Cloud systems 3 Cloud systems 4 Distributed concurrency bugs

Detecting Spammers and Content Detecting Spammers and Content Detecting Spammers and Content

12/6/2013 Detecting Fakes Image Forensics: Detecting Forged Photos 1.Detecting photorealistic

NetFlow Analysis: Detecting covert channels on the network Detecting malicious traffic by using

Introduction Detecting Errors in Effects of Annotation Errors Detecting Errors in Corpus

Automatically Automatically Finding Patches Finding Patches Using Genetic Using Genetic

Automatically Identifying Automatically Identifying and Georeferencing Georeferencing and

Automatically Detecting Likely Edits in Clinical Notes Created Using Automatic Speech Recognition

Inspection systems for automatically detecting defects in the surface of train axles Nigel

Detecting annotation noise in automatically labelled data Ines Rehbein Josef Ruppenhofer IDS

Detecting annotation noise in automatically labelled data Ines Rehbein &amp; Josef Ruppenhofer

Automatically Detecting Vulnerable Sites Before They Turn Malicious Kyle Soska Nicolas

Detecting Chang Detecting Changes in W s in Water ter Qua Q ualit lity i lit lit i in L

Detecting Self-Interruptions during Reading Jan Pilzer and Sam Liu 2017-11-27 Detecting

Effective features for detecting Effective features for detecting IRC botnets IRC botnets

Detecting Insolvency Detecting Insolvency David Emanuel 1 4 August 2 0 0 9 Outline

Detecting Cracks under Bushings Detecting Cracks under Bushings in Aircraft Structures in

and Generator Interconnection (TPP-GIP Integration) Final Proposal Stakeholder conference call,

Dihadron Correlation with Identified Triggers Subikash Choudhury VECC,KOLKATA ALICE-India

Blake Kush Director of Training &amp; Outreach New York State Office of Victim Services Overview

The FMS Trigger at STAR John Calvin Martinez Carl Gagliardi Pibero Djawotho Texas A&amp;M

Property Protecting Turret Andrey Ivannikov Eric Poon Ilya Brutman Prototype of the PPT

Best Practices and Presentation Guidelines for Accessibility Adopted from NCORE - The National

report Wajah Wajahat B at Bajw ajwa | EPA Appointed Auditor Image Image Image placeholder

Tab B, No. 8b T August 2018 To assist Councils in reviewing existing fisheries allocations and

Detecting annotation noise in automatically labelled data Ines Rehbein & Josef Ruppenhofer

Blake Kush Director of Training & Outreach New York State Office of Victim Services Overview

The FMS Trigger at STAR John Calvin Martinez Carl Gagliardi Pibero Djawotho Texas A&M