INFRASTRUCTURE Optimizing Interrupt Handling Performance for Memory - - PowerPoint PPT Presentation

infrastructure optimizing interrupt handling performance
SMART_READER_LITE
LIVE PREVIEW

INFRASTRUCTURE Optimizing Interrupt Handling Performance for Memory - - PowerPoint PPT Presentation

INFRASTRUCTURE Optimizing Interrupt Handling Performance for Memory Failures in Large Scale Data Centers Harish Dattatraya Dixit, Fred Lin, Bill Holland, Matt Beadon, Zhengyu Yang, Sriram Sankar Hardware Sustaining. Facebook. MAP : 1.3B MAP


slide-1
SLIDE 1

INFRASTRUCTURE

slide-2
SLIDE 2

Hardware Sustaining. Facebook.

Optimizing Interrupt Handling Performance for Memory Failures in Large Scale Data Centers

Harish Dattatraya Dixit, Fred Lin, Bill Holland, Matt Beadon, Zhengyu Yang, Sriram Sankar

slide-3
SLIDE 3

MAP : 2.45B MAP: 1B MAP : 1.3B MAP : 1.6B

Globally, there are more than 2.8B people using Facebook, WhatsApp, Instagram or Messenger each month.

Source: Facebook data, Q4 2019 *MAP - Monthly Active People

slide-4
SLIDE 4
slide-5
SLIDE 5
  • Server Architecture
  • Intermittent Errors
  • Memory Error Reporting
  • Interrupt Handling
  • System Management Interrupts (SMI)
  • Corrected Machine Check Interrupts (CMCI)
  • Experiment Infrastructure
  • Observations

Contents

5

slide-6
SLIDE 6

Server Architecture

  • Compute Units
  • Central Processing Unit (CPU)
  • Graphics Processing Unit (GPU)
  • Memory
  • Dual In-line Memory Modules (DIMM)
  • Storage
  • Flash, Disks
  • Network
  • NIC, Cables
  • Interfaces
  • PCIe, USB
  • Monitoring
  • Baseboard Management

Controller (BMC)

  • Sensors, Hot Swap Controller (HSC)

CPUs DIMMs PCH storage GPU USB BMC

TPM sensors HSC fan controller

peripherals NIC 6

slide-7
SLIDE 7

Intermittent Errors – Occurrence and Impact

Machine Check Exceptions

System Reboots

Correctable Errors Bitflips

Data Corruptions

Uncorrectable Errors

Interrupt Storms, System Stalls System Reboots

ECC Errors RS Encoding Errors

Retries Input Output Bandwidth Loss

CRC Errors Packet Loss

Retries Network Bandwidth Loss

Correctable Errors Uncorrectable Errors

Retries, Bandwidth Loss System Reboots

CPU DIMMs Storage (Ex: Flash, Disks) Network (Ex: NICs, Cables) Interfaces (Ex: PCIe, USB)

7

slide-8
SLIDE 8

Memory Error Reporting

CEs UCEs System Management Interrupts (SMI) Corrected Machine Check Interrupts (CMCI) mcelog System Management Mode Error Detection and Correction (EDAC)

Kernel Driver Firmware Config OS Daemon

8

slide-9
SLIDE 9

System Management Interrupts

Machine Check Handler NMI Handler SMI Firmware Platform Processor Logging Handler OS Error Handling SMI Trigger: Memory correctable errors SMI Handling:

Return from SMM Capture Physical Address of the error Perform Correctable Error (CE) Logging Pause all CPU Cores System Management Mode (SMM)

9

slide-10
SLIDE 10

Corrected Machine Check Interrupts

EDAC CMCI Handler Machine Check Handler NMI Handler SMI Logging Handler OS Error Handling Firmware Platform Processor

CMCI Trigger: Memory correctable errors CMCI Handling:

  • 3. Log aggregated CEs (count per poll)

[randomly assigned core]

CPU stall on 1 core

  • 2. Aggregate CEs
  • 1. Collect CEs from each core

Repeat every specified polling interval duration EDAC kernel driver for error data collection Invoke CMCI Handler

10

slide-11
SLIDE 11

Experiment Infrastructure

Daemon MachineChecker Alert Manager

run periodically and collect output create alert if a server check fails

Failure Detection – MachineChecker

  • Runs hardware checks periodically
  • Host ping, memory, CPU, NIC, dmesg,

S.M.A.R.T., power supply, SEL, etc.

On the machine Off the machine

11

slide-12
SLIDE 12

Experiment Infrastructure

Daemon MachineChecker Alert Manager

run periodically and collect output create alert if a server check fails

Failure Detection – MachineChecker Failure Digestion – FBAR

  • Facebook Auto Remediation
  • Picks up hardware failures, process logged

information, and execute custom-made remediation accordingly

FBAR

On the machine Off the machine

12

slide-13
SLIDE 13

Experiment Infrastructure

Daemon MachineChecker Alert Manager

run periodically and collect output create alert if a server check fails

Failure Detection – MachineChecker Failure Digestion – FBAR Low-Level Software Fix – Cyborg

  • Handles low-level software fixes such as

firmware update and reimaging

FBAR Cyborg

On the machine Off the machine

13

slide-14
SLIDE 14

Experiment Infrastructure

Daemon MachineChecker Alert Manager

run periodically and collect output create alert if a server check fails

Failure Detection – MachineChecker Failure Digestion – FBAR Low-Level Software Fix – Cyborg Manual Fix – Repair Ticketing

  • Creates repair tickets for DC technicians to

carry out HW/SW fixes

  • Provides detailed logs throughout the auto-

remediation

  • Logs repair actions for further analysis

FBAR Cyborg Repair Ticketing

14

slide-15
SLIDE 15

Experiment Infrastructure

Production System Setup

Production Machines Configured with Step1: SMI Mem. Reporting Step2: CMCI Mem. Reporting Remediation Policy Swap Memory at 10s of Correctable Errors per second Benchmarks Repro Memory Errors: Stressapptest Detect Performance Impact: SPEC (Perlbench) Fine grained stall detector 15

slide-16
SLIDE 16

Experiment Infrastructure

Memory errors in a production environment are random occurrences, and have no fixed periodicity, seen in experimental error injection setup.

1 10 100 1000 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 Correctable error count / second (log scale) Time (seconds) M1 M2 M3 16

slide-17
SLIDE 17

Observation 1: System Management Interrupts (SMI) cause the machines to stall for hundreds of milli-seconds based on the logging handler implementation. This is measurable performance impact to report corrected errors. 6200 6200 7000 7300 7300 5600 5800 6000 6200 6400 6600 6800 7000 7200 7400 20 40 60 80 100 120 140 160 180 200 20 40 60 80 100 120

Correctable error count

Time (minutes) Request efficiency (%)

Request Efficiency (%) Correctable Errors

Application Impact Example Caching Service Impact of SMI due to CEs:

Application request efficiency drops by ~40% Stall all cores of a CPU for 100s of ms Default configs, trigger SMIs for every n errors (n=1000) CEs increase (6200  7300)

17

slide-18
SLIDE 18

Observation 2: Benchmarks like perlbench within SPEC are useful to quantify system performance. For variable events, we need to augment the benchmarks with fine-grained detectors to capture performance deviations.

Detect performance impact using benchmarking Perlbench

  • Compare scores with and without SMI stalls.
  • Benchmarks return same scores

Stall detection

  • CPU Stall duration: 100s of ms
  • Fine-grained stall detection to observe CPU stalls

37.17 37.13 5 10 15 20 25 30 35 40

perlbench (+stressapp) perlbench (+stressapp + CEs)

Benchmark scores

Stressapptest: Helps surface memory correctable errors due to bad DIMMs No difference observed in scores with or without Correctable Errors (and the SMI stalls)

18

slide-19
SLIDE 19

Minimizing performance impact using CMCI interrupts

EDAC CMCI Handler Machine Check Handler NMI Handler SMI Logging Handler OS Error Handling Firmware Platform Processor

CMCI Trigger: Memory correctable errors CMCI Handling:

  • 3. Log aggregated CEs

(randomly assigned core)

CPU stall on 1 core

  • 2. Aggregate CEs
  • 1. Collect CEs from each core

Repeat every specified polling interval duration EDAC kernel driver for error data collection Invoke CMCI Handler

19

slide-20
SLIDE 20

Observation 3: SMI are several times more computationally expensive than CMCI for correctable memory error reporting in a production environment.

500 1000 1500 2000 2500 3000 3500

1000 2000 3000 4000 5000 6000 7000 8000 Cumulative stalll time (ms) Number of correctable errors SMI (all cores) CMCI (1 core)

SMI vs CMCI performance impact SMI:

  • Stall all cores
  • Provide full physical

address of the error CMCI:

  • Stall 1 CPU core

Graph: SMI stall time vs CMCI stall time vs Number of Errors

Results hold for M1, M2, M3 machine types since the stalls are a function of error counts. 20

slide-21
SLIDE 21

82 121 357 281 362 440 842 1178 1242 1244

200 400 600 800 1000 1200 1400 1 2 4 6 8 10 20 30 35 40 Largest individual stall (in ms) EDAC polling interval (in seconds)

Every Polling Interval

  • Log aggregated CEs

(randomly assigned core)

  • CPU stall on 1 core

Optimizing Polling Interval

  • Tradeoff
  • Error visibility frequency vs

Individual CPU stall

  • Modify polling interval
  • Obtain maximum individual

stall times per core

Observation 4: We see that with increased polling interval, the amount of time spent in individual aggregate logging by the EDAC driver increases. 21

slide-22
SLIDE 22

37250 38726 31641 33869 30697 33520 27845 24014 10756 8664

5000 10000 15000 20000 25000 30000 35000 40000 45000 1 2 4 6 8 10 20 30 35 40 Total stall time (in ms) EDAC polling interval (in seconds)

Every Polling Interval

  • Log aggregated CEs

(randomly assigned core)

  • CPU stall on 1 core

Optimizing Polling Interval

  • Tradeoff
  • Error visibility frequency vs

Total CPU stall

  • Modify polling interval
  • Obtain total stall times

Observation 5: We see that with an increased polling interval for EDAC, frequent context switches are reduced. Hence the total time a machine spends in stalls will be reduced. 22

slide-23
SLIDE 23

Observation 6: With increased polling interval for EDAC, we run the risk of overflow in error aggregation.

Every Polling Interval

  • Log aggregated CEs

(randomly assigned core)

  • CPU stall on 1 core

Optimizing Polling Interval

  • Tradeoff
  • Error visibility vs

CPU stalls

  • Modify polling interval
  • Measure counter overflows

and error count variations

0 10 30 103 209 421 50 100 150 200 250 300 350 400 450 1 2 4 6 8 10 20 30 35 40 60 120 240 450 Errors lost per poll EDAC polling interval (in seconds) 23

slide-24
SLIDE 24

Minimizing performance impact using CMCI interrupts

37250 38726 31641 33869 30697 33250 27845 24014 10756 8664

2 4 6 8 10 12 5000 10000 15000 20000 25000 30000 35000 40000 45000 1 2 4 6 8 10 20 30 35 40 Missed error count per poll Total stall time (ms) EDAC polling interval (seconds) Total Stall Time (ms) Loss Of Errors

Recommendations:

  • For measuring 10s of CEs

per second, use CMCI

  • At polling interval of ~37s
  • Tradeoff:
  • Error visibility
  • Maximized
  • Total Stall time
  • Minimized

24

slide-25
SLIDE 25

Post Package Repair (PPR)

Memory Error Repair

  • DDR4 Feature
  • Remaps faulty cells to healthy cells in memory
  • Requires physical address for performing PPR
  • SMI provides physical address of error.
  • CMCI doesn’t provide physical address.
  • Hard PPR (Preferred)
  • Persistent across reboots
  • Soft PPR
  • Not persistent across reboots

To overcome this, Use a hybrid approach, CMCI in production flow, SMI in remediation flow

Address Decoder

Memory Cells Spare Cells Remapped Bit [Post Package Repair] Bad Bit

Data I/O 25

slide-26
SLIDE 26

Hybrid Error Reporting Approach

Daemon MachineChecker (If error > PPR threshold) Alert Manager

run periodically and collect output

FBAR Change Interrupts (CMCI to SMI) Reduce SMI trigger thresholds Production Machine (CMCI Interrupt) Perform Hard PPR Run Benchmarks (Memory Stress) Change Interrupts (SMI to CMCI)

26

slide-27
SLIDE 27

Conclusion

SMI vs CMCI

  • SMIs results in stalls of 100s of ms in

production environments

  • Benchmarks can be augmented to be

sensitive to fine-grained stalls.

  • CMCI more efficient for reporting

memory errors in production.

  • CMCI can further be optimized by

tweaking polling intervals. PPR

  • Hybrid implementation to reduce perf

impact in production, and obtain benefits of PPR

500 1000 1500 2000 2500 3000 3500 1000 2000 3000 4000 5000 6000 7000 8000 Cumulative stalll time (ms) Number of correctable errors SMI (all cores) CMCI (1 core)

37250 38726 31641 33869 30697 33250 27845 24014 10756 8664

2 4 6 8 10 12

5000 10000 15000 20000 25000 30000 35000 40000 45000

1 2 4 6 8 10 20 30 35 40

Missed error count per poll Total stall time (ms)

EDAC polling interval (seconds) Total Stall Time (ms) Loss Of Errors

27

slide-28
SLIDE 28

Questions

slide-29
SLIDE 29

Thank you

slide-30
SLIDE 30