dcatch automatically detecting
play

DCatch: Automatically Detecting Distributed Concurrency Bugs in - PowerPoint PPT Presentation

1 DCatch: Automatically Detecting Distributed Concurrency Bugs in Cloud Systems Haopeng Liu , Guangpu Li, Jeffrey Lukman, Jiaxin Li, Shan Lu, Haryadi Gunawi, and Chen Tian* * 2 Cloud systems 3 Cloud systems 4 Distributed concurrency bugs


  1. 1 DCatch: Automatically Detecting Distributed Concurrency Bugs in Cloud Systems Haopeng Liu , Guangpu Li, Jeffrey Lukman, Jiaxin Li, Shan Lu, Haryadi Gunawi, and Chen Tian* *

  2. 2 Cloud systems

  3. 3 Cloud systems

  4. 4 Distributed concurrency bugs (DCbugs)

  5. 5 Distributed concurrency bugs (DCbugs) • Unexpected timing among distributed operations

  6. 6 Distributed concurrency bugs (DCbugs) • Unexpected timing among distributed operations • Example A B C MapReduce-3274

  7. 7 Distributed concurrency bugs (DCbugs) • Unexpected timing among distributed operations • Example A B C A B C hang MapReduce-3274

  8. 8 DCbugs need to be tackled • Common in distributed systems [1, 2, 3] – 26% failures caused by non-deterministic [1] – 6% software bugs in clouds system [2] [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi . What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC . In ASPLOS’16

  9. 9 DCbugs need to be tackled • Common in distributed systems [1, 2, 3] • Difficult to avoid, expose and diagnose [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi . What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC . In ASPLOS’16

  10. 10 DCbugs need to be tackled • Common in distributed systems [1, 2, 3] • Difficult to avoid, expose and diagnose Hadoop Map/Reduce / MAPREDUCE-3274 “That is one monster of a race!” [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi . What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC . In ASPLOS’16

  11. 11 DCbugs need to be tackled • Common in distributed systems [1, 2, 3] • Difficult to avoid, expose and diagnose Hadoop Map/Reduce / MAPREDUCE-3274 “That is one monster of a race!” HBase / HBASE-4397 “There isn’t a week going by without new bugs about races.” [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi . What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC . In ASPLOS’16

  12. 12 DCbugs need to be tackled • Common in distributed systems [1, 2, 3] • Difficult to avoid, expose and diagnose Hadoop Map/Reduce / MAPREDUCE-3274 “That is one monster of a race!” HBase / HBASE-4397 “There isn’t a week going by without new bugs about races.” HBase / HBASE-6147 “We have already fix many cases, however it seems exist many other [racing] cases.” [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi . What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC . In ASPLOS’16

  13. 13 DCbugs need to be tackled • Common in distributed systems [1, 2, 3] • Difficult to avoid, expose and diagnose Hadoop Map/Reduce / MAPREDUCE-3274 “That is one monster of a race!” HBase / HBASE-4397 “There isn’t a week going by without new bugs about races.” HBase / HBASE-6147 “We have already found and fix many cases, Hadoop Map/Reduce / MAPREDUCE-4819 however it seems exist many other cases.” “This has become quite messy, sigh. ” [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi . What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC . In ASPLOS’16

  14. 14 DCbugs need to be tackled • Common in distributed systems [1, 2, 3] • Difficult to avoid, expose and diagnose Hadoop Map/Reduce / MAPREDUCE-3274 “That is one monster of a race!” HBase / HBASE-4397 “There isn’t a week going by without new bugs about races.” HBase / HBASE-6147 “We have already found and fix many cases, Hadoop Map/Reduce / MAPREDUCE-4819 however it seems exist many other cases.” “This has become quite messy, sigh. ” Hadoop Map/Reduce / MAPREDUCE-4099 “Great catch, Sid! Apologies for missing the race condition.” [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi . What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC . In ASPLOS’16

  15. 15 DCbugs need to be tackled • Common in distributed systems [1, 2, 3] • Difficult to avoid, expose and diagnose Hadoop Map/Reduce / MAPREDUCE-3274 “That is one monster of a race!” HBase / HBASE-4397 “There isn’t a week going by without new bugs about races.” HBase / HBASE-6147 “We have already found and fix many cases, Hadoop Map/Reduce / MAPREDUCE-4819 however it seems exist many other cases.” “This has become quite messy, sigh. ” Hadoop Map/Reduce / MAPREDUCE-4099 “Great catch, Sid! Apologies for missing the race condition.” Hadoop Map/Reduce / MAPREDUCE-3634 “We [prefer] debug crashes instead of hanging jobs.” [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi . What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC . In ASPLOS’16

  16. 16 DCbugs need to be tackled • Common in distributed systems [1, 2, 3] • Difficult to avoid, expose and diagnose Hadoop Map/Reduce / MAPREDUCE-3274 “That is one monster of a race!” HBase / HBASE-4397 “There isn’t a week going by without new bugs about races.” HBase / HBASE-6147 “We have already found and fix many cases, Can we detect DCbugs before they manifest? Hadoop Map/Reduce / MAPREDUCE-4819 however it seems exist many other cases.” “This has become quite messy, sigh. ” Hadoop Map/Reduce / MAPREDUCE-4099 “Great catch, Sid! Apologies for missing the race condition.” Hadoop Map/Reduce / MAPREDUCE-3634 “We [prefer] debug crashes instead of hanging jobs.” [1] Yuan. Simple Testing Can Prevent Most Critical Failures. In OSDI’14 [2] Gunawi . What Bugs Live in the Cloud?. In SoCC’14 [3] Leesatapornwongsa. TaxDC . In ASPLOS’16

  17. 17 Previous work • Model checking – Work on abstracted models – Face state-space explosion issue

  18. 18 Our idea • Follow the philosophy of traditional concurrency bug detection

  19. 19 Our idea • Follow the philosophy of traditional concurrency bug detection Machine 1 Machine 2 Machine 3 Machine 4

  20. 20 Our idea • Follow the philosophy of traditional concurrency bug detection Machine 1 Machine 2 Machine 3 Machine 4

  21. 21 Our idea • Follow the philosophy of traditional concurrency bug detection Machine 1 Machine 2 Machine 3 Machine 4

  22. 22 Example A B C

  23. 23 Example A B C B //UnReg thread //RPC thread void unReg(jID){ Task getTask(jID){ ... jMap.remove(jID) ; return jMap.get(jID) ; .... } }

  24. 24 Local concurrency bug detection

  25. 25 Local concurrency bug detection

  26. 26 Local concurrency bug detection Is the problem solved?

  27. 27 Local concurrency bug detection T2 T1 T3  Trace 

  28. 28 Local concurrency bug detection T2 T1 T3  C Trace 1  C1: How to handle the huge amount of mem accesses? Challenges

  29. 29 Local concurrency bug detection T2 T1 T3 . . .  C Trace HB 1  C1: How to handle the huge amount of mem accesses? Challenges

  30. 30 Local concurrency bug detection T2 T1 T3 . . r . w  C Trace HB 1  C1: How to handle the huge amount of mem accesses? Challenges

  31. 31 Local concurrency bug detection T2 T1 T3 . . r . w  C C Trace HB 1 2  C1: How to handle the huge C2: What’s the happens -before amount of mem accesses? model? Challenges

  32. 32 Local concurrency bug detection T2 T1 T3 . . . . r r . . w w assert( r )  C C Trace HB Triage 1 2  C1: How to handle the huge C2: What’s the happens -before amount of mem accesses? model? Challenges

  33. 33 Local concurrency bug detection T2 T1 T3 . . . . r r . . w w assert( r )  C C C Trace HB Triage 1 2 3  C1: How to handle the huge C2: What’s the happens -before amount of mem accesses? model? Challenges C3: How to estimate the distributed impact of a race?

  34. 34 Local concurrency bug detection T2 T1 T3 . . . . . . r r +sleep . . . w w assert( r )  C C C Trace HB Triage Trigger 1 2 3  C1: How to handle the huge C2: What’s the happens -before amount of mem accesses? model? Challenges C3: How to estimate the distributed impact of a race?

  35. 35 Local concurrency bug detection T2 T1 T3 . . . . . . r r +sleep . . . w w assert( r )  C C C C Trace HB Triage Trigger 1 2 3 4  C1: How to handle the huge C2: What’s the happens -before amount of mem accesses? model? Challenges C3: How to estimate the C4: How to trigger with distributed impact of a race? distributed time manipulation?

  36. 36 Contribution • A comprehensive HB Model for distributed systems

  37. 37 Contribution • A comprehensive HB Model for distributed systems • DCatch tool detects DCbugs from correct runs  C C C C Trace HB Triage Trigger 1 2 3 4  C1: How to handle the huge C2: What’s the amount of mem accesses? happens-before model? Challenges Solved by DCatch C3: How to estimate the C4: How to trigger with distributed impact of a race? distributed time manipulation?

  38. 38 Contribution • A comprehensive HB Model for distributed systems • DCatch tool detects DCbugs from correct runs • Evaluate on 4 systems • Report 32 DCbugs, with 20 of them being truly harmful

  39. 39 Outline • Motivation • DCatch Happens-before Model • DCatch tool • Evaluation • Conclusion

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend