REPT: Reverse Debugging of Failures in Deployed Software Weidong Cui - - PowerPoint PPT Presentation

rept reverse debugging of
SMART_READER_LITE
LIVE PREVIEW

REPT: Reverse Debugging of Failures in Deployed Software Weidong Cui - - PowerPoint PPT Presentation

REPT: Reverse Debugging of Failures in Deployed Software Weidong Cui 1 , Xinyang Ge 1 , Baris Kasikci 2 , Ben Niu 1 , Upamanyu Sharma 2 , Ruoyu Wang 3 , and Insu Yun 4 Microsoft Research 1 University of Michigan 2 Arizona State University 3 Georgia


slide-1
SLIDE 1

REPT: Reverse Debugging of Failures in Deployed Software

Weidong Cui1, Xinyang Ge1, Baris Kasikci2, Ben Niu1, Upamanyu Sharma2, Ruoyu Wang3, and Insu Yun4 Microsoft Research1 University of Michigan2 Arizona State University3 Georgia Institute of Technology4

OSDI 2018, Carlsbad, CA

slide-2
SLIDE 2

What happened before the crash?

slide-3
SLIDE 3
slide-4
SLIDE 4
slide-5
SLIDE 5
slide-6
SLIDE 6

REPT: Reverse Execution with Processor Trace

slide-7
SLIDE 7

REPT: Reverse Execution with Processor Trace

  • Online hardware tracing (e.g., Intel Processor Trace)
  • Log the control flow with timestamps
  • Low runtime overhead (1 – 5%)
  • No data!
  • Offline binary analysis
  • Recovers data flow from the control flow
slide-8
SLIDE 8
slide-9
SLIDE 9

REPT Data Recovery

  • Single-threaded execution reconstruction
  • Multi-threaded execution reconstruction
  • Multi-threaded execution reconstruction
slide-10
SLIDE 10

Core Dump Instruction Sequence

+ = ?

Execution History

How to recover

  • verwritten states
slide-11
SLIDE 11

lea rbx, [g] mov rax, 1 add rax, [rbx] mov [rbx], rax xor rbx, rbx

slide-12
SLIDE 12

lea rbx, [g] mov rax, 1 add rax, [rbx] mov [rbx], rax xor rbx, rbx rax=3, rbx=0, [g]=3 rax=3, rbx=?, [g]=3 rax=3, rbx=?, [g]=3 rax=?, rbx=?, [g]=3 rax=?, rbx=?, [g]=3 rax=?, rbx=?, [g]=3 rax=3, rbx=?, [g]=3 rax=?, rbx=?, [g]=3

slide-13
SLIDE 13

lea rbx, [g] mov rax, 1 add rax, [rbx] mov [rbx], rax xor rbx, rbx rax=3, rbx=0, [g]=3 rax=3, rbx=?, [g]=3 rax=3, rbx=?, [g]=3 rax=?, rbx=?, [g]=3 rax=?, rbx=?, [g]=3 rax=?, rbx=?, [g]=3 rax=?, rbx=g, [g]=3 rax=1, rbx=g, [g]=3 4? rax=3, rbx=g, [g]=3 rax=?, rbx=g, [g]=3 rax=1, rbx=g, [g]=3

slide-14
SLIDE 14

rax=3, rbx=g, [g]=? rax=1, rbx=g, [g]=? rax=?, rbx=g, [g]=? lea rbx, [g] mov rax, 1 add rax, [rbx] mov [rbx], rax xor rbx, rbx rax=3, rbx=0, [g]=3 rax=3, rbx=?, [g]=3 rax=?, rbx=?, [g]=? rax=3, rbx=g, [g]=3 rax=?, rbx=?, [g]=? rax=?, rbx=g, [g]=? rax=1, rbx=g, [g]=? rax=3, rbx=g, [g]=?

slide-15
SLIDE 15

lea rbx, [g] mov rax, 1 add rax, [rbx] mov [rbx], rax xor rbx, rbx rax=3, rbx=0, [g]=3 rax=3, rbx=g, [g]=3 rax=3, rbx=g, [g]=? rax=1, rbx=g, [g]=? rax=?, rbx=g, [g]=? rax=?, rbx=?, [g]=? rax=1, rbx=g, [g]=2 rax=?, rbx=g, [g]=2 rax=?, rbx=?, [g]=2 rax=1, rbx=g, [g]=2

slide-16
SLIDE 16

lea rbx, [g] mov rax, 1 add rax, [rbx] mov [rbx], rax xor rbx, rbx rax=3, rbx=0, [g]=3 rax=3, rbx=g, [g]=3 rax=3, rbx=g, [g]=? rax=1, rbx=g, [g]=2 rax=?, rbx=g, [g]=2 rax=?, rbx=?, [g]=2 rax=3, rbx=g, [g]=2

slide-17
SLIDE 17

rax=3, rbx=g, [g]=2 rax=1, rbx=g, [g]=2 rax=?, rbx=g, [g]=2 lea rbx, [g] mov rax, 1 add rax, [rbx] mov [rbx], rax xor rbx, rbx rax=3, rbx=0, [g]=3 rax=3, rbx=g, [g]=3 rax=?, rbx=?, [g]=2

slide-18
SLIDE 18

Key Techniques

  • Forward Execution
  • Recovers states before irreversible instructions
  • Error Correction
  • Handles errors introduced by “missing” memory writes
slide-19
SLIDE 19

REPT Data Recovery

  • Single-threaded execution reconstruction
  • Multi-threaded execution reconstruction
slide-20
SLIDE 20

Core Dump Instruction Sequence #1

+ = ?

Execution History

How to determine the thread interleavings?

+

Instruction Sequence #2

slide-21
SLIDE 21

A D

Time

B C E F G

slide-22
SLIDE 22

A B C D E F G

Time

A B C D E F G

slide-23
SLIDE 23

A B C D E F G

Time

A D E B F G C

10

slide-24
SLIDE 24

A B C D E F G

Time

A D E B F G C

18 or 20 10

slide-25
SLIDE 25

A B C D E F G

Time

A D E B F G C

10 18 or 20

slide-26
SLIDE 26

Key Techniques

  • Hardware Timestamps
  • Constructs a partial order
  • Concurrent memory write detection
  • Constrains their usage to avoid propagating a wrong value
slide-27
SLIDE 27

With REPT, …

slide-28
SLIDE 28

Hey, client, turn on tracing next time. I want history information!

slide-29
SLIDE 29

Demo

slide-30
SLIDE 30

16 bugs 1-5% overhead 92% accuracy 14 bugs

slide-31
SLIDE 31

Conclusion

  • Debugging production failures is important but hard
  • REPT is a practical reverse debugging solution for production failures
  • Online hardware tracing to log the control flow with timestamps
  • Offline binary analysis to recover the data flow with high accuracy
  • REPT has been deployed on Microsoft Windows