storms in the cloud
play

Storms in the Cloud Designing and using a fault injection system - PowerPoint PPT Presentation

Storms in the Cloud Designing and using a fault injection system Michalis Zervos @mzervos http://venturebeat.com/ http://www.pcworld.com/ Google News - http://thenextweb.com/ https://twitter.com/netflixhelps Google News -


  1. Storms in the Cloud Designing and using a fault injection system Michalis Zervos @mzervos

  2. http://venturebeat.com/ http://www.pcworld.com/ Google News - http://thenextweb.com/ https://twitter.com/netflixhelps Google News - http://mashable.com/ Google News - http://www.itnews.com.au http://www.itnews.com.au

  3. Service Resilience • Not a solved problem • Goal is: • 100% uptime • No degradation • Responsive 

  4. Traditional testing • Unit tests • Functional / Integration • End to end 

  5. Cloud services – Testing challenges • Continuous evolution • Multiple dependencies • Global distribution • Traffic fluctuation 

  6. Cloud services – Fundamentals • Auto-scaling • Redundancy • Monitoring and detection systems • Auto-mitigation / Failover mechanisms • Staged deployments • Data replication 

  7. The extra mile • Embrace failure • Break the system • Adjust the engineering process 

  8. Storms in the Cloud

  9. Fault Injection System Support diverse services Easy to use Verify resilience and behavior Simulate complex failures / real-life incidents

  10. Agenda Designing a Fault Injection System Usage patterns

  11. Faults • Resource pressure • Network • Processes • Virtual machine • Platform • Application specific • Hardware

  12. Faults • Resource pressure • Network • Processes • Virtual machine • Platform • Application specific • Hardware

  13. Resource Pressure Faults • CPU Available tools • Memory • consume.exe (Windows SDK) • Physical • stress (Unix) • Virtual • Sysinternals tools • Hard disk • Capacity • Read • Write

  14. Faults • Resource pressure • Network • Processes • Virtual machine • Platform • Application specific • Hardware

  15. Network faults • Layers Available tools • Transport (TCP/UDP) • Network Emulator Toolkit (NEWT) • Application layer (HTTP) • Fiddler core • Types • Disconnect • Latency • Alter response codes (HTTP) • Packet reorder / loss (TCP/UDP) • Filters • Domain / IP / Subnet • URL path • Port / Protocol

  16. Faults • Resource pressure • Network • Processes • Virtual machine • Platform • Application specific • Hardware

  17. Process faults • Stop / Kill Available tools • Restart • OS commands • Stop service • Sysinternals tools • Start • Crash • Hang

  18. Faults • Resource pressure • Network • Processes • Virtual machine • Platform • Application specific • Hardware

  19. Virtual Machine / OS faults • Stop Available tools • Restart • Cloud Management APIs • BSOD / Kernel panic • OS commands • Change date • Sysinternals tools • Re-image

  20. Faults • Resource pressure • Network • Processes • Virtual machine • Platform • Application specific • Hardware

  21. Distributed platform faults • Quorum loss Available tools – Platform specific • Data loss • Service Fabric testability APIs • Move primary node • Remove replica

  22. Faults • Resource pressure • Network • Processes • Virtual machine • Platform • Application specific • Hardware

  23. Application specific faults • Hooks Available tools • Instrument service code • MSR Detours • Intercept / Re-route calls • TestApi – Managed Fault Injection • No access to service code

  24. Application specific faults • Hooks Available tools • Instrument service code • MSR Detours • Intercept / Re-route calls • TestApi – Managed Fault Injection • No access to service code

  25. Faults • Resource pressure • Network • Processes • Virtual machine • Platform • Application specific • Hardware

  26. Hardware faults • Machine • Network devices • Rack • UPS • Datacenter

  27. Faults • Resource pressure • Network • Processes • Virtual machine • Platform • Application specific • Hardware

  28. Injection mechanism • VM External • VM Internal – Service code external  Agent • VM Internal – Service code internal  Hooks

  29. Injection mechanism • VM External • VM Internal – Service code external  Agent • VM Internal – Service code internal  Hooks

  30. External injection • VM / Region Stop • VM / Region Restart • Re-image Cloud Cloud Management Service Management Service Target VM Target VM

  31. Injection mechanism • VM External • VM Internal – Service code external  Agent • VM Internal – Service code internal  Hooks

  32. VM internal injection - Agent • Resource pressure Virtual Machine • Network • Processes Fault Agent • OS • Detours • … Target Service VM Operating System Target Application

  33. Injection mechanism • VM External • VM Internal – Service code external  Agent • VM Internal – Service code internal  Hooks

  34. VM internal injection - Hooks • Application behavior • Flexibility • Service specific Target Application

  35. VM internal injection - Hooks • Application behavior • Flexibility • Service specific Target Application

  36. Hooks Fault Agent VM External

  37. System Architecture Target Service VMs Fault Management Cloud Cloud Fault Agent Service Management Service Management Service Fault Agent

  38. System components Auditing Automation Security Verification Reporting

  39. System components Auditing Automation Security Verification Reporting

  40. Security and Safety • AuthN / AuthZ • Fault agents • Kill switch • Safety nets

  41. Security and Safety • AuthN / AuthZ • Integrate with Identity Provider Azure Active Directory • Fault agents • Multi-Factor Authentication • Kill switch • Least-privilege principle • Safety nets • Granular access levels

  42. Security and Safety • AuthN / AuthZ • Secure communication – TLS/SSL • Fault agents • Code signing • Kill switch • Execution permissions • Safety nets

  43. Security and Safety • AuthN / AuthZ • Fault agents • Kill switch • Safety nets

  44. Security and Safety • AuthN / AuthZ Auto fault removal • Fault agents • Agents – Service connectivity loss Agent-side detection • Kill switch • Service malfunctioning • Safety nets Auto-monitoring module • Unusual behavior Anomaly detection

  45. System components Auditing Automation Security Verification Reporting

  46. System components Auditing Automation Security Verification Reporting

  47. Auditing • Faults • Fault agents • Management service • Clients

  48. System components Auditing Automation Security Verification Reporting

  49. System components Auditing Automation Security Verification Reporting

  50. Automation • Scheduling • Zero - configuration • Dependencies auto-discovery

  51. System components Auditing Automation Security Verification Reporting

  52. System components Auditing Automation Security Verification Reporting

  53. System components Auditing Automation Security Verification Reporting

  54. Usage scenarios • Resilience verification • Test new features • Training • Verify staged deployments • Test detection, alerting, mitigation systems • Repro incidents

  55. Injection environment Test Canary Production

  56. Recovery Games

  57. Recovery Games Attacker Defender • Inject faults • Assess • Provide hints • Analyze • Mitigate

  58. Recovery Games - Goals • Familiarize with monitoring tools • Recognize outage patterns • Train on assessing the impact • Root-cause / mitigation mindset • Practice log analysis

  59. In Invest t in in Fault In Inje jecti tion Testing Resilience Test new Training verification features Engineering process & culture Michalis Zervos @mzervos

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend