 
              INFRASTRUCTURE
Optimizing Interrupt Handling Performance for Memory Failures in Large Scale Data Centers Harish Dattatraya Dixit, Fred Lin, Bill Holland, Matt Beadon, Zhengyu Yang, Sriram Sankar Hardware Sustaining. Facebook.
MAP : 1.3B MAP : 2.45B MAP: 1B MAP : 1.6B Globally, there are more than 2.8B people using Facebook, WhatsApp, Instagram or Messenger each month. *MAP - Monthly Active People Source: Facebook data, Q4 2019
• Server Architecture • Intermittent Errors • Memory Error Reporting • Interrupt Handling Contents • System Management Interrupts (SMI) • Corrected Machine Check Interrupts (CMCI) • Experiment Infrastructure • Observations 5
Server Architecture • Compute Units • Central Processing Unit (CPU) DIMMs peripherals • Graphics Processing Unit (GPU) TPM • Memory storage • Dual In-line Memory Modules (DIMM) • GPU Storage PCH • Flash, Disks USB • Network CPUs • NIC, Cables NIC • Interfaces BMC • PCIe, USB • Monitoring fan controller • Baseboard Management sensors HSC Controller (BMC) • Sensors, Hot Swap Controller (HSC) 6
Intermittent Errors – Occurrence and Impact Machine Check Exceptions System Reboots CPU Bitflips Data Corruptions Interrupt Storms, System Stalls Correctable Errors DIMMs Uncorrectable Errors System Reboots ECC Errors Storage Retries (Ex: Flash, Disks) Input Output Bandwidth Loss RS Encoding Errors Network CRC Errors Retries (Ex: NICs, Cables) Network Bandwidth Loss Packet Loss Interfaces Retries, Bandwidth Loss Correctable Errors (Ex: PCIe, USB) System Reboots Uncorrectable Errors 7
Memory Error Reporting Firmware Config System Management System Management Interrupts (SMI) Mode Kernel Driver Corrected Machine Check Error Detection and CEs Interrupts (CMCI) Correction (EDAC) mcelog OS Daemon UCEs 8
System Management Interrupts Platform Processor SMI Trigger: Memory correctable errors SMI Handling: Logging Firmware System Management Mode (SMM) Handler SMI Pause all CPU Cores Perform Correctable Error (CE) Logging Capture Physical Address of the error Machine Check Handler NMI Handler Return from SMM OS Error Handling 9
Corrected Machine Check Interrupts CMCI Trigger: Processor Platform Memory correctable errors CMCI Handling: Invoke CMCI Handler SMI Logging Handler EDAC kernel driver for error data Firmware collection Repeat every specified polling interval duration 1. Collect CEs from each core EDAC Machine Check NMI Handler 2. Aggregate CEs Handler 3. Log aggregated CEs (count per poll) CMCI Handler [randomly assigned core] OS Error Handling CPU stall on 1 core 10
Experiment Infrastructure Failure Detection – MachineChecker Daemon • Runs hardware checks periodically create alert if a • Host ping, memory, CPU, NIC, dmesg, server check fails S.M.A.R.T., power supply, SEL, etc. run periodically Alert and collect output Manager MachineChecker On the machine Off the machine 11
Experiment Infrastructure Failure Detection – MachineChecker Daemon create alert if a server check fails Failure Digestion – FBAR run periodically • Facebook Auto Remediation Alert and collect output • Manager Picks up hardware failures, process logged information, and execute custom-made remediation accordingly MachineChecker FBAR On the machine Off the machine 12
Experiment Infrastructure Failure Detection – MachineChecker Daemon create alert if a server check fails Failure Digestion – FBAR run periodically Alert and collect output Manager Low-Level Software Fix – Cyborg • Handles low-level software fixes such as MachineChecker firmware update and reimaging FBAR Cyborg On the machine Off the machine 13
Experiment Infrastructure Failure Detection – MachineChecker Daemon create alert if a server check fails Failure Digestion – FBAR run periodically Alert and collect output Manager Low-Level Software Fix – Cyborg MachineChecker Manual Fix – Repair Ticketing FBAR • Creates repair tickets for DC technicians to carry out HW/SW fixes • Provides detailed logs throughout the auto- Cyborg remediation • Logs repair actions for further analysis Repair Ticketing 14
Experiment Infrastructure Production System Setup Remediation Policy Benchmarks Repro Memory Errors: Swap Memory Stressapptest at 10s of Correctable Errors per second Production Machines Detect Performance Impact: Configured with Step1: SMI Mem. Reporting SPEC (Perlbench) Step2: CMCI Mem. Reporting Fine grained stall detector 15
Experiment Infrastructure Memory errors in a production environment are random occurrences, and have no fixed periodicity, seen in experimental error injection setup. Correctable error count / second 1000 (log scale) 100 10 1 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 Time (seconds) M1 M2 M3 16
Observation 1: System Management Interrupts (SMI) cause the machines to stall for hundreds of milli-seconds based on the logging handler implementation. This is measurable performance impact to report corrected errors. Application Impact 7300 7300 120 7400 Example Caching Service 7200 Request efficiency (%) Correctable error count 100 7000 7000 80 6800 Impact of SMI due to CEs: 6600 60 6400 CEs increase (6200  7300) 40 6200 6200 6200 6000 20 5800 Default configs, trigger SMIs 0 5600 for every n errors (n=1000) 0 20 40 60 80 100 120 140 160 180 200 Time (minutes) Stall all cores of a CPU for Request Efficiency (%) Correctable Errors 100s of ms Application request efficiency drops by ~40% 17
Observation 2: Benchmarks like perlbench within SPEC are useful to quantify system performance. For variable events, we need to augment the benchmarks with fine-grained detectors to capture performance deviations. 40 37.17 37.13 35 Detect performance impact using benchmarking 30 Perlbench 25 Benchmark scores • Compare scores with and without SMI stalls. 20 • Benchmarks return same scores 15 10 Stall detection 5 • CPU Stall duration: 100s of ms 0 • Fine-grained stall detection to observe CPU stalls perlbench perlbench (+stressapp) (+stressapp + CEs) Stressapptest: Helps surface memory correctable errors due to bad DIMMs No difference observed in scores with or without Correctable Errors (and the SMI stalls) 18
Minimizing performance impact using CMCI interrupts CMCI Trigger: Processor Platform Memory correctable errors CMCI Handling: Invoke CMCI Handler SMI Logging Handler EDAC kernel driver for error data Firmware collection Repeat every specified polling interval duration 1. Collect CEs from each core EDAC Machine Check NMI Handler 2. Aggregate CEs Handler 3. Log aggregated CEs CMCI Handler (randomly assigned core) OS Error Handling CPU stall on 1 core 19
Observation 3: SMI are several times more computationally expensive than CMCI for correctable memory error reporting in a production environment. SMI vs CMCI performance impact 3500 Cumulative stalll time (ms) 3000 SMI: • Stall all cores 2500 • Provide full physical 2000 address of the error 1500 1000 CMCI: 500 • Stall 1 CPU core 0 1000 2000 3000 4000 5000 6000 7000 8000 Graph: Number of correctable errors SMI stall time vs CMCI stall time SMI (all cores) CMCI (1 core) vs Number of Errors Results hold for M1, M2, M3 machine types since the stalls are 20 a function of error counts.
Observation 4: We see that with increased polling interval, the amount of time spent in individual aggregate logging by the EDAC driver increases. Every Polling Interval 1400 • Log aggregated CEs 1178 1242 1244 Largest individual stall (in ms) 1200 (randomly assigned core) • CPU stall on 1 core 1000 842 800 Optimizing Polling Interval 600 • Tradeoff 440 • Error visibility frequency vs 362 357 400 281 Individual CPU stall 200 121 • Modify polling interval 82 • Obtain maximum individual 0 1 2 4 6 8 10 20 30 35 40 stall times per core EDAC polling interval (in seconds) 21
Observation 5: We see that with an increased polling interval for EDAC, frequent context switches are reduced. Hence the total time a machine spends in stalls will be reduced. Every Polling Interval 45000 • Log aggregated CEs 37250 38726 40000 (randomly assigned core) 33869 35000 33520 • CPU stall on 1 core Total stall time (in ms) 31641 30697 30000 27845 24014 Optimizing Polling Interval 25000 • Tradeoff 20000 • Error visibility frequency vs 15000 10756 Total CPU stall 8664 10000 • Modify polling interval 5000 • Obtain total stall times 0 1 2 4 6 8 10 20 30 35 40 EDAC polling interval (in seconds) 22
Observation 6: With increased polling interval for EDAC, we run the risk of overflow in error aggregation. 450 421 Every Polling Interval 400 • Log aggregated CEs 350 (randomly assigned core) Errors lost per poll • CPU stall on 1 core 300 250 209 200 Optimizing Polling Interval 150 • Tradeoff 103 100 • Error visibility vs 0 10 30 50 CPU stalls 0 0 0 0 0 0 0 0 • Modify polling interval 0 • Measure counter overflows 1 2 4 6 8 10 20 30 35 40 60 120 240 450 EDAC polling interval (in seconds) and error count variations 23
Recommend
More recommend