INFRASTRUCTURE Optimizing Interrupt Handling Performance for Memory - - PowerPoint PPT Presentation
INFRASTRUCTURE Optimizing Interrupt Handling Performance for Memory - - PowerPoint PPT Presentation
INFRASTRUCTURE Optimizing Interrupt Handling Performance for Memory Failures in Large Scale Data Centers Harish Dattatraya Dixit, Fred Lin, Bill Holland, Matt Beadon, Zhengyu Yang, Sriram Sankar Hardware Sustaining. Facebook. MAP : 1.3B MAP
Hardware Sustaining. Facebook.
Optimizing Interrupt Handling Performance for Memory Failures in Large Scale Data Centers
Harish Dattatraya Dixit, Fred Lin, Bill Holland, Matt Beadon, Zhengyu Yang, Sriram Sankar
MAP : 2.45B MAP: 1B MAP : 1.3B MAP : 1.6B
Globally, there are more than 2.8B people using Facebook, WhatsApp, Instagram or Messenger each month.
Source: Facebook data, Q4 2019 *MAP - Monthly Active People
- Server Architecture
- Intermittent Errors
- Memory Error Reporting
- Interrupt Handling
- System Management Interrupts (SMI)
- Corrected Machine Check Interrupts (CMCI)
- Experiment Infrastructure
- Observations
Contents
5
Server Architecture
- Compute Units
- Central Processing Unit (CPU)
- Graphics Processing Unit (GPU)
- Memory
- Dual In-line Memory Modules (DIMM)
- Storage
- Flash, Disks
- Network
- NIC, Cables
- Interfaces
- PCIe, USB
- Monitoring
- Baseboard Management
Controller (BMC)
- Sensors, Hot Swap Controller (HSC)
CPUs DIMMs PCH storage GPU USB BMC
TPM sensors HSC fan controller
peripherals NIC 6
Intermittent Errors – Occurrence and Impact
Machine Check Exceptions
System Reboots
Correctable Errors Bitflips
Data Corruptions
Uncorrectable Errors
Interrupt Storms, System Stalls System Reboots
ECC Errors RS Encoding Errors
Retries Input Output Bandwidth Loss
CRC Errors Packet Loss
Retries Network Bandwidth Loss
Correctable Errors Uncorrectable Errors
Retries, Bandwidth Loss System Reboots
CPU DIMMs Storage (Ex: Flash, Disks) Network (Ex: NICs, Cables) Interfaces (Ex: PCIe, USB)
7
Memory Error Reporting
CEs UCEs System Management Interrupts (SMI) Corrected Machine Check Interrupts (CMCI) mcelog System Management Mode Error Detection and Correction (EDAC)
Kernel Driver Firmware Config OS Daemon
8
System Management Interrupts
Machine Check Handler NMI Handler SMI Firmware Platform Processor Logging Handler OS Error Handling SMI Trigger: Memory correctable errors SMI Handling:
Return from SMM Capture Physical Address of the error Perform Correctable Error (CE) Logging Pause all CPU Cores System Management Mode (SMM)
9
Corrected Machine Check Interrupts
EDAC CMCI Handler Machine Check Handler NMI Handler SMI Logging Handler OS Error Handling Firmware Platform Processor
CMCI Trigger: Memory correctable errors CMCI Handling:
- 3. Log aggregated CEs (count per poll)
[randomly assigned core]
CPU stall on 1 core
- 2. Aggregate CEs
- 1. Collect CEs from each core
Repeat every specified polling interval duration EDAC kernel driver for error data collection Invoke CMCI Handler
10
Experiment Infrastructure
Daemon MachineChecker Alert Manager
run periodically and collect output create alert if a server check fails
Failure Detection – MachineChecker
- Runs hardware checks periodically
- Host ping, memory, CPU, NIC, dmesg,
S.M.A.R.T., power supply, SEL, etc.
On the machine Off the machine
11
Experiment Infrastructure
Daemon MachineChecker Alert Manager
run periodically and collect output create alert if a server check fails
Failure Detection – MachineChecker Failure Digestion – FBAR
- Facebook Auto Remediation
- Picks up hardware failures, process logged
information, and execute custom-made remediation accordingly
FBAR
On the machine Off the machine
12
Experiment Infrastructure
Daemon MachineChecker Alert Manager
run periodically and collect output create alert if a server check fails
Failure Detection – MachineChecker Failure Digestion – FBAR Low-Level Software Fix – Cyborg
- Handles low-level software fixes such as
firmware update and reimaging
FBAR Cyborg
On the machine Off the machine
13
Experiment Infrastructure
Daemon MachineChecker Alert Manager
run periodically and collect output create alert if a server check fails
Failure Detection – MachineChecker Failure Digestion – FBAR Low-Level Software Fix – Cyborg Manual Fix – Repair Ticketing
- Creates repair tickets for DC technicians to
carry out HW/SW fixes
- Provides detailed logs throughout the auto-
remediation
- Logs repair actions for further analysis
FBAR Cyborg Repair Ticketing
14
Experiment Infrastructure
Production System Setup
Production Machines Configured with Step1: SMI Mem. Reporting Step2: CMCI Mem. Reporting Remediation Policy Swap Memory at 10s of Correctable Errors per second Benchmarks Repro Memory Errors: Stressapptest Detect Performance Impact: SPEC (Perlbench) Fine grained stall detector 15
Experiment Infrastructure
Memory errors in a production environment are random occurrences, and have no fixed periodicity, seen in experimental error injection setup.
1 10 100 1000 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 Correctable error count / second (log scale) Time (seconds) M1 M2 M3 16
Observation 1: System Management Interrupts (SMI) cause the machines to stall for hundreds of milli-seconds based on the logging handler implementation. This is measurable performance impact to report corrected errors. 6200 6200 7000 7300 7300 5600 5800 6000 6200 6400 6600 6800 7000 7200 7400 20 40 60 80 100 120 140 160 180 200 20 40 60 80 100 120
Correctable error count
Time (minutes) Request efficiency (%)
Request Efficiency (%) Correctable Errors
Application Impact Example Caching Service Impact of SMI due to CEs:
Application request efficiency drops by ~40% Stall all cores of a CPU for 100s of ms Default configs, trigger SMIs for every n errors (n=1000) CEs increase (6200 7300)
17
Observation 2: Benchmarks like perlbench within SPEC are useful to quantify system performance. For variable events, we need to augment the benchmarks with fine-grained detectors to capture performance deviations.
Detect performance impact using benchmarking Perlbench
- Compare scores with and without SMI stalls.
- Benchmarks return same scores
Stall detection
- CPU Stall duration: 100s of ms
- Fine-grained stall detection to observe CPU stalls
37.17 37.13 5 10 15 20 25 30 35 40
perlbench (+stressapp) perlbench (+stressapp + CEs)
Benchmark scores
Stressapptest: Helps surface memory correctable errors due to bad DIMMs No difference observed in scores with or without Correctable Errors (and the SMI stalls)
18
Minimizing performance impact using CMCI interrupts
EDAC CMCI Handler Machine Check Handler NMI Handler SMI Logging Handler OS Error Handling Firmware Platform Processor
CMCI Trigger: Memory correctable errors CMCI Handling:
- 3. Log aggregated CEs
(randomly assigned core)
CPU stall on 1 core
- 2. Aggregate CEs
- 1. Collect CEs from each core
Repeat every specified polling interval duration EDAC kernel driver for error data collection Invoke CMCI Handler
19
Observation 3: SMI are several times more computationally expensive than CMCI for correctable memory error reporting in a production environment.
500 1000 1500 2000 2500 3000 3500
1000 2000 3000 4000 5000 6000 7000 8000 Cumulative stalll time (ms) Number of correctable errors SMI (all cores) CMCI (1 core)
SMI vs CMCI performance impact SMI:
- Stall all cores
- Provide full physical
address of the error CMCI:
- Stall 1 CPU core
Graph: SMI stall time vs CMCI stall time vs Number of Errors
Results hold for M1, M2, M3 machine types since the stalls are a function of error counts. 20
82 121 357 281 362 440 842 1178 1242 1244
200 400 600 800 1000 1200 1400 1 2 4 6 8 10 20 30 35 40 Largest individual stall (in ms) EDAC polling interval (in seconds)
Every Polling Interval
- Log aggregated CEs
(randomly assigned core)
- CPU stall on 1 core
Optimizing Polling Interval
- Tradeoff
- Error visibility frequency vs
Individual CPU stall
- Modify polling interval
- Obtain maximum individual
stall times per core
Observation 4: We see that with increased polling interval, the amount of time spent in individual aggregate logging by the EDAC driver increases. 21
37250 38726 31641 33869 30697 33520 27845 24014 10756 8664
5000 10000 15000 20000 25000 30000 35000 40000 45000 1 2 4 6 8 10 20 30 35 40 Total stall time (in ms) EDAC polling interval (in seconds)
Every Polling Interval
- Log aggregated CEs
(randomly assigned core)
- CPU stall on 1 core
Optimizing Polling Interval
- Tradeoff
- Error visibility frequency vs
Total CPU stall
- Modify polling interval
- Obtain total stall times
Observation 5: We see that with an increased polling interval for EDAC, frequent context switches are reduced. Hence the total time a machine spends in stalls will be reduced. 22
Observation 6: With increased polling interval for EDAC, we run the risk of overflow in error aggregation.
Every Polling Interval
- Log aggregated CEs
(randomly assigned core)
- CPU stall on 1 core
Optimizing Polling Interval
- Tradeoff
- Error visibility vs
CPU stalls
- Modify polling interval
- Measure counter overflows
and error count variations
0 10 30 103 209 421 50 100 150 200 250 300 350 400 450 1 2 4 6 8 10 20 30 35 40 60 120 240 450 Errors lost per poll EDAC polling interval (in seconds) 23
Minimizing performance impact using CMCI interrupts
37250 38726 31641 33869 30697 33250 27845 24014 10756 8664
2 4 6 8 10 12 5000 10000 15000 20000 25000 30000 35000 40000 45000 1 2 4 6 8 10 20 30 35 40 Missed error count per poll Total stall time (ms) EDAC polling interval (seconds) Total Stall Time (ms) Loss Of Errors
Recommendations:
- For measuring 10s of CEs
per second, use CMCI
- At polling interval of ~37s
- Tradeoff:
- Error visibility
- Maximized
- Total Stall time
- Minimized
24
Post Package Repair (PPR)
Memory Error Repair
- DDR4 Feature
- Remaps faulty cells to healthy cells in memory
- Requires physical address for performing PPR
- SMI provides physical address of error.
- CMCI doesn’t provide physical address.
- Hard PPR (Preferred)
- Persistent across reboots
- Soft PPR
- Not persistent across reboots
To overcome this, Use a hybrid approach, CMCI in production flow, SMI in remediation flow
Address Decoder
Memory Cells Spare Cells Remapped Bit [Post Package Repair] Bad Bit
Data I/O 25
Hybrid Error Reporting Approach
Daemon MachineChecker (If error > PPR threshold) Alert Manager
run periodically and collect output
FBAR Change Interrupts (CMCI to SMI) Reduce SMI trigger thresholds Production Machine (CMCI Interrupt) Perform Hard PPR Run Benchmarks (Memory Stress) Change Interrupts (SMI to CMCI)
26
Conclusion
SMI vs CMCI
- SMIs results in stalls of 100s of ms in
production environments
- Benchmarks can be augmented to be
sensitive to fine-grained stalls.
- CMCI more efficient for reporting
memory errors in production.
- CMCI can further be optimized by
tweaking polling intervals. PPR
- Hybrid implementation to reduce perf
impact in production, and obtain benefits of PPR
500 1000 1500 2000 2500 3000 3500 1000 2000 3000 4000 5000 6000 7000 8000 Cumulative stalll time (ms) Number of correctable errors SMI (all cores) CMCI (1 core)
37250 38726 31641 33869 30697 33250 27845 24014 10756 8664
2 4 6 8 10 12
5000 10000 15000 20000 25000 30000 35000 40000 45000
1 2 4 6 8 10 20 30 35 40
Missed error count per poll Total stall time (ms)
EDAC polling interval (seconds) Total Stall Time (ms) Loss Of Errors
27