Storms in the Cloud Designing and using a fault injection system Michalis Zervos @mzervos
http://venturebeat.com/ http://www.pcworld.com/ Google News - http://thenextweb.com/ https://twitter.com/netflixhelps Google News - http://mashable.com/ Google News - http://www.itnews.com.au http://www.itnews.com.au
Service Resilience • Not a solved problem • Goal is: • 100% uptime • No degradation • Responsive
Traditional testing • Unit tests • Functional / Integration • End to end
Cloud services – Testing challenges • Continuous evolution • Multiple dependencies • Global distribution • Traffic fluctuation
Cloud services – Fundamentals • Auto-scaling • Redundancy • Monitoring and detection systems • Auto-mitigation / Failover mechanisms • Staged deployments • Data replication
The extra mile • Embrace failure • Break the system • Adjust the engineering process
Storms in the Cloud
Fault Injection System Support diverse services Easy to use Verify resilience and behavior Simulate complex failures / real-life incidents
Agenda Designing a Fault Injection System Usage patterns
Faults • Resource pressure • Network • Processes • Virtual machine • Platform • Application specific • Hardware
Faults • Resource pressure • Network • Processes • Virtual machine • Platform • Application specific • Hardware
Resource Pressure Faults • CPU Available tools • Memory • consume.exe (Windows SDK) • Physical • stress (Unix) • Virtual • Sysinternals tools • Hard disk • Capacity • Read • Write
Faults • Resource pressure • Network • Processes • Virtual machine • Platform • Application specific • Hardware
Network faults • Layers Available tools • Transport (TCP/UDP) • Network Emulator Toolkit (NEWT) • Application layer (HTTP) • Fiddler core • Types • Disconnect • Latency • Alter response codes (HTTP) • Packet reorder / loss (TCP/UDP) • Filters • Domain / IP / Subnet • URL path • Port / Protocol
Faults • Resource pressure • Network • Processes • Virtual machine • Platform • Application specific • Hardware
Process faults • Stop / Kill Available tools • Restart • OS commands • Stop service • Sysinternals tools • Start • Crash • Hang
Faults • Resource pressure • Network • Processes • Virtual machine • Platform • Application specific • Hardware
Virtual Machine / OS faults • Stop Available tools • Restart • Cloud Management APIs • BSOD / Kernel panic • OS commands • Change date • Sysinternals tools • Re-image
Faults • Resource pressure • Network • Processes • Virtual machine • Platform • Application specific • Hardware
Distributed platform faults • Quorum loss Available tools – Platform specific • Data loss • Service Fabric testability APIs • Move primary node • Remove replica
Faults • Resource pressure • Network • Processes • Virtual machine • Platform • Application specific • Hardware
Application specific faults • Hooks Available tools • Instrument service code • MSR Detours • Intercept / Re-route calls • TestApi – Managed Fault Injection • No access to service code
Application specific faults • Hooks Available tools • Instrument service code • MSR Detours • Intercept / Re-route calls • TestApi – Managed Fault Injection • No access to service code
Faults • Resource pressure • Network • Processes • Virtual machine • Platform • Application specific • Hardware
Hardware faults • Machine • Network devices • Rack • UPS • Datacenter
Faults • Resource pressure • Network • Processes • Virtual machine • Platform • Application specific • Hardware
Injection mechanism • VM External • VM Internal – Service code external Agent • VM Internal – Service code internal Hooks
Injection mechanism • VM External • VM Internal – Service code external Agent • VM Internal – Service code internal Hooks
External injection • VM / Region Stop • VM / Region Restart • Re-image Cloud Cloud Management Service Management Service Target VM Target VM
Injection mechanism • VM External • VM Internal – Service code external Agent • VM Internal – Service code internal Hooks
VM internal injection - Agent • Resource pressure Virtual Machine • Network • Processes Fault Agent • OS • Detours • … Target Service VM Operating System Target Application
Injection mechanism • VM External • VM Internal – Service code external Agent • VM Internal – Service code internal Hooks
VM internal injection - Hooks • Application behavior • Flexibility • Service specific Target Application
VM internal injection - Hooks • Application behavior • Flexibility • Service specific Target Application
Hooks Fault Agent VM External
System Architecture Target Service VMs Fault Management Cloud Cloud Fault Agent Service Management Service Management Service Fault Agent
System components Auditing Automation Security Verification Reporting
System components Auditing Automation Security Verification Reporting
Security and Safety • AuthN / AuthZ • Fault agents • Kill switch • Safety nets
Security and Safety • AuthN / AuthZ • Integrate with Identity Provider Azure Active Directory • Fault agents • Multi-Factor Authentication • Kill switch • Least-privilege principle • Safety nets • Granular access levels
Security and Safety • AuthN / AuthZ • Secure communication – TLS/SSL • Fault agents • Code signing • Kill switch • Execution permissions • Safety nets
Security and Safety • AuthN / AuthZ • Fault agents • Kill switch • Safety nets
Security and Safety • AuthN / AuthZ Auto fault removal • Fault agents • Agents – Service connectivity loss Agent-side detection • Kill switch • Service malfunctioning • Safety nets Auto-monitoring module • Unusual behavior Anomaly detection
System components Auditing Automation Security Verification Reporting
System components Auditing Automation Security Verification Reporting
Auditing • Faults • Fault agents • Management service • Clients
System components Auditing Automation Security Verification Reporting
System components Auditing Automation Security Verification Reporting
Automation • Scheduling • Zero - configuration • Dependencies auto-discovery
System components Auditing Automation Security Verification Reporting
System components Auditing Automation Security Verification Reporting
System components Auditing Automation Security Verification Reporting
Usage scenarios • Resilience verification • Test new features • Training • Verify staged deployments • Test detection, alerting, mitigation systems • Repro incidents
Injection environment Test Canary Production
Recovery Games
Recovery Games Attacker Defender • Inject faults • Assess • Provide hints • Analyze • Mitigate
Recovery Games - Goals • Familiarize with monitoring tools • Recognize outage patterns • Train on assessing the impact • Root-cause / mitigation mindset • Practice log analysis
In Invest t in in Fault In Inje jecti tion Testing Resilience Test new Training verification features Engineering process & culture Michalis Zervos @mzervos
Recommend
More recommend