Reverse Debugging of Kernel Failures in Deployed Systems Xinyang - - PowerPoint PPT Presentation

reverse debugging of kernel failures
SMART_READER_LITE
LIVE PREVIEW

Reverse Debugging of Kernel Failures in Deployed Systems Xinyang - - PowerPoint PPT Presentation

Reverse Debugging of Kernel Failures in Deployed Systems Xinyang Ge, Ben Niu and Weidong Cui Microsoft Research USENIX Annual Technical Conference, 2020 What happened before the crash? REPT: Reverse Execution with Processor Trace REPT:


slide-1
SLIDE 1

Reverse Debugging of Kernel Failures in Deployed Systems

Xinyang Ge, Ben Niu and Weidong Cui Microsoft Research

USENIX Annual Technical Conference, 2020

slide-2
SLIDE 2

What happened before the crash?

slide-3
SLIDE 3
slide-4
SLIDE 4

REPT: Reverse Execution with Processor Trace

slide-5
SLIDE 5

REPT: Reverse Execution with Processor Trace

  • A practical reverse debugging solution for user-mode failures [OSDI’18]
  • Online hardware tracing (e.g., Intel Processor Trace)
  • Log the control flow with timestamps
  • Low runtime overhead (1-5%)
  • No data!
  • Offline binary analysis
  • Recovers data flow from the control flow

How to make REPT support the kernel?

slide-6
SLIDE 6

USER KERNEL

How REPT works?

slide-7
SLIDE 7

USER KERNEL

How REPT works?

slide-8
SLIDE 8

USER KERNEL

How REPT works?

slide-9
SLIDE 9

USER KERNEL

How REPT works?

slide-10
SLIDE 10

USER KERNEL

add rax,rbx rax=3,rbx=1 rax=?,rbx=?

How REPT works?

slide-11
SLIDE 11

USER KERNEL

add rax,rbx rax=3,rbx=1 rax=2,rbx=1

How REPT works?

slide-12
SLIDE 12

Can we simply inverse the tracing?

slide-13
SLIDE 13

Can we simply inverse the tracing?

  • There are too many processes/threads on a system
  • High memory overhead for tracing
  • Hardware events must be emulated in addition to CPU instructions
  • Interrupts
  • Exceptions
  • System calls
slide-14
SLIDE 14

Here comes Kernel REPT…

slide-15
SLIDE 15

USER KERNEL context switch … is irreversible, and we log it in software.

slide-16
SLIDE 16

USER KERNEL syscalls interrupts/ exceptions

slide-17
SLIDE 17

USER KERNEL syscalls interrupts/ exceptions Interrupt Descriptor Table

INTERRUPT GATE 0 INTERRUPT GATE 1 INTERRUPT GATE 2 INTERRUPT GATE N

Different events can have different architectural effects Kernel Stack

SS RSP RFLAGS CS RIP Error Code

Stack Pointer

slide-18
SLIDE 18

That’s it?

slide-19
SLIDE 19

Automated Analyses

  • A common bug pattern: missing undo operations
  • EnterCriticalRegion vs LeaveCriticalRegion
  • Root-Cause Analysis
  • Scan the kernel execution trace to find missing undo operations
  • Proactive Bug Detector
  • Sanitize the kernel execution based on specified invariants
  • 17 new bugs found and fixed!
slide-20
SLIDE 20

Demo

slide-21
SLIDE 21

Conclusion

  • Debugging production kernel failures is hard
  • REPT now supports the reverse debugging of the kernel
  • Per-core control flow tracing in hardware
  • Context switch logging in software
  • Recovers data flow via CPU instruction and hardware event emulation
  • REPT enables automated analysis beyond reverse debugging
  • Root-cause analysis
  • Sanitizing analysis