towards automatically checking thousands of failures with
play

Towards Automatically Checking Thousands of Failures with - PowerPoint PPT Presentation

Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi, Thanh Do , Pallavi Joshi, Joseph M. Hellerstein, Andrea C. Arpaci-Dusseau , Remzi H. Arpaci-Dusseau , Koushik Sen University of


  1. Towards Automatically Checking Thousands of Failures with Micro-Specifications Haryadi S. Gunawi, Thanh Do † , Pallavi Joshi, Joseph M. Hellerstein, Andrea C. Arpaci-Dusseau † , Remzi H. Arpaci-Dusseau † , Koushik Sen University of California, Berkeley † University of Wisconsin, Madison 1

  2. Cloud Era Solve bigger human problems Use cluster of thousands of machines 2 2

  3. Failures in The Cloud 3 3

  4. Failures in The Cloud “The future is a world of failures everywhere ” - Garth Gibson 3 3

  5. Failures in The Cloud “The future is a world of failures everywhere ” - Garth Gibson “Recovery must be a first-class operation” - Raghu Ramakrishnan 3 3

  6. Failures in The Cloud “The future is a world of failures everywhere ” - Garth Gibson “Recovery must be a first-class operation” - Raghu Ramakrishnan “Reliability has to come from the software ” - Je fg rey Dean 3 3

  7. 4 4

  8. 5 5

  9. Why Failure Recovery Hard? • Testing is not advanced enough against complex failures – Diverse, frequent, and multiple failures – FaceBook photo loss • Recovery is under specified – Need to specify failure recovery behaviors – Customized well-grounded protocols • Example: Paxos made live – An engineering perspective [PODC’ 07] 6 6

  10. Our Solutions 7 7

  11. Our Solutions • FTS (“FATE”) – Failure Testing Service – New abstraction for failure exploration – Systematically exercise 40,000 unique combinations of failures 7 7

  12. Our Solutions • FTS (“FATE”) – Failure Testing Service – New abstraction for failure exploration – Systematically exercise 40,000 unique combinations of failures • DTS (“DESTINI”) – Declarative Testing Specification – Enable concise recovery specifications – We have written 74 checks (3 lines / check) 7 7

  13. Our Solutions • FTS (“FATE”) – Failure Testing Service – New abstraction for failure exploration – Systematically exercise 40,000 unique combinations of failures • DTS (“DESTINI”) – Declarative Testing Specification – Enable concise recovery specifications – We have written 74 checks (3 lines / check) • Note: Names have changed since the paper 7 7

  14. Summary of Findings • Applied FATE and DESTINI to three cloud systems: HDFS, ZooKeeper, Cassandra • Found 16 new bugs • Reproduced 74 bugs • Problems found – Inconsistency – Data loss – Rack awareness broken – Unavailability 8 8

  15. Outline  Introduction • FATE • DESTINI • Evaluation • Summary 9 9

  16. 10 10

  17. M C 1 2 3 No failures 10 10

  18. Alloc. Req. M C 1 2 3 No failures 10 10

  19. Setup Alloc. Stage Req. M C 1 2 3 Data Transfer Stage No failures 10 10

  20. M C 1 2 3 No failures 10 10

  21. M C 1 2 3 M C 1 2 3 4 X 1 Setup Stage Recovery: No failures Recreate fresh pipeline 10 10

  22. M C 1 2 3 M C 1 2 3 4 X 1 Setup Stage Recovery: No failures Recreate fresh pipeline M C 1 2 3 X 2 Data transfer Stage Recovery: Continue on surviving nodes 10 10

  23. M C 1 2 3 M C 1 2 3 4 X 1 Setup Stage Recovery: No failures Recreate fresh pipeline M C 1 2 3 M C 1 2 3 X 2 X 3 Data transfer Stage Recovery: Bug in Data Transfer Stage Recovery Continue on surviving nodes 10 10

  24. M C 1 2 3 M C 1 2 3 4 X 1 Failures at Setup Stage Recovery: No failures DIFFERENT STAGES Recreate fresh pipeline lead to M C 1 2 3 M C 1 2 3 DIFFERENT FAILURE BEHAVIORS Goal: Exercise di fg erent failure recovery path X 2 X 3 Data transfer Stage Recovery: Bug in Data Transfer Stage Recovery Continue on surviving nodes 10 10

  25. FATE • A failure injection framework – target IO points – Systematically exploring failure – Multiple failures • New abstraction of failure scenario – Remember injected failures – Increase failure coverage 11 11

  26. FATE M C 1 2 3 • A failure injection framework – target IO points – Systematically exploring failure – Multiple failures • New abstraction of failure scenario – Remember injected failures – Increase failure coverage 11 11

  27. FATE M C 1 2 3 • A failure injection framework X X – target IO points – Systematically exploring failure – Multiple failures • New abstraction of failure scenario – Remember injected failures – Increase failure coverage 11 11

  28. FATE M C 1 2 3 • A failure injection framework X X X – target IO points X – Systematically exploring failure – Multiple failures • New abstraction of failure scenario – Remember injected failures – Increase failure coverage 11 11

  29. FATE M C 1 2 3 • A failure injection framework X X X – target IO points X – Systematically exploring failure – Multiple failures • New abstraction of failure scenario – Remember injected failures – Increase failure coverage 11 11

  30. FATE M C 1 2 3 • A failure injection framework X X X – target IO points X – Systematically exploring failure X X – Multiple failures • New abstraction of failure scenario – Remember injected failures – Increase failure coverage 11 11

  31. Failure ID 2 3 12 12

  32. Failure ID 2 3 Field Fields Values Static Static Func. Call OutputStream.read() Source File BlockReceiver.java Dynamic Stack Track … Domain Domain Source Node 2 specific specific Destination Node 3 Net. Message Data Packet Failure Type Crash After Hash 12348729 12348729 12 12

  33. How Developers Build Failure ID? • FATE intercepts all I/Os • Use aspectJ to collect information at every I/O point – I/O bu fg ers (e.g file bu fg er, network bu fg er) – Target I/O (e.g. file name, IP address) • Reverse engineer for domain specific information 13 13

  34. Failure ID 2 3 12 14

  35. Failure ID 2 3 Field Fields Values Static Static Func. Call OutputStream.read() Source File BlockReceiver.java Dynamic Stack Track … Domain Domain Source Node 2 specific specific Destination Node 3 Net. Message Data Packet Failure Type Crash After Hash 12348729 12348729 12 14

  36. Failure ID 2 3 Field Fields Values Static Static Func. Call OutputStream.read() Source File BlockReceiver.java Dynamic Stack Track … Domain Domain Source Node 2 specific specific Destination Node 3 Net. Message Data Packet Failure Type Crash After Hash 12348729 12348729 12 14

  37. Exploring Failure Space 14 15

  38. Exploring Failure Space M C 1 2 3 A Exp #1: A 14 15

  39. Exploring Failure Space M C 1 2 3 A Exp #1: A A Exp #2: B B 14 15

  40. Exploring Failure Space M C 1 2 3 A Exp #1: A A Exp #2: B B Exp #3: C A B C 14 15

  41. Exploring Failure Space M C 1 2 3 M C 1 2 3 A A Exp #1: A AB B A Exp #2: B B Exp #3: C A B C 14 15

  42. Exploring Failure Space M C 1 2 3 M C 1 2 3 A A Exp #1: A AB B A A Exp #2: B AC B C B Exp #3: C A B C 14 15

  43. Exploring Failure Space M C 1 2 3 M C 1 2 3 A A Exp #1: A AB B A A Exp #2: B AC B C B A Exp #3: C A BC C B B C 14 15

  44. Outline  Introduction  FATE • DESTINI • Evaluation • Summary 15 16

  45. DESTINI • Enable concise recovery specifications • Check if expected behaviors match with actual behaviors • Important elements: – Expectations – Facts – Failure Events – Check Timing • Interpose network and disk protocols 16 17

  46. Writing specifications 17 18

  47. Writing specifications “Violation if expectation is di fg erent from actual facts” 17 18

  48. Writing specifications “Violation if expectation is di fg erent from actual facts” 17 18

  49. Writing specifications “Violation if expectation is di fg erent from actual facts” violationTable():- expectationTable(), NOT-IN actualTable() 17 18

  50. Writing specifications “Violation if expectation is di fg erent from actual facts” violationTable():- expectationTable(), NOT-IN actualTable() 17 18

  51. Writing specifications “Violation if expectation is di fg erent from actual facts” violationTable():- expectationTable(), NOT-IN actualTable() DataLog syntax: 17 18

  52. Writing specifications “Violation if expectation is di fg erent from actual facts” violationTable():- expectationTable(), NOT-IN actualTable() DataLog syntax: :- derivation 17 18

  53. Writing specifications “Violation if expectation is di fg erent from actual facts” violationTable():- expectationTable(), NOT-IN actualTable() DataLog syntax: :- derivation , AND 17 18

  54. Correct recovery Incorrect Recovery M C 1 2 3 M C 1 2 3 X X 18 19

  55. Correct recovery Incorrect Recovery M C 1 2 3 M C 1 2 3 X X incorrectNodes (B, N) :- expectedNodes (B, N), NOT-IN actualNodes (B, N); 18 19

  56. Correct recovery Incorrect Recovery M C 1 2 3 M C 1 2 3 X X Expected Nodes Expected Nodes (Block, Node) ock, Node) B Node 1 B Node 2 incorrectNodes (B, N) :- expectedNodes (B, N), NOT-IN actualNodes (B, N); 18 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend