Capo: A Software-Hardware Interface for Practical Deterministic - - PowerPoint PPT Presentation

capo
SMART_READER_LITE
LIVE PREVIEW

Capo: A Software-Hardware Interface for Practical Deterministic - - PowerPoint PPT Presentation

Capo: A Software-Hardware Interface for Practical Deterministic Multiprocessor Replay Pablo Montesinos, Matthew Hicks, Samuel T. King and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign 2 Motivation:


slide-1
SLIDE 1

Capo:

A Software-Hardware Interface for Practical Deterministic Multiprocessor Replay

Department of Computer Science University of Illinois at Urbana-Champaign

Pablo Montesinos, Matthew Hicks, Samuel T. King and Josep Torrellas

slide-2
SLIDE 2

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

Motivation: Time Travel

Allows us to visit and recreate past states and events in computer Wide range of uses:

Debugging Security

Enabled by using Deterministic Replay of Execution

2

slide-3
SLIDE 3

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

Phase I: Initial Execution (a.k.a Recording)

Execute and record certain non-deterministic events into log Sources of non-determinism: interrupts, memory access interleaving ...

Phase II: Replay

Restore to a previous checkpoint Re-execute and use log to force software down the same execution path

How Deterministic Replay Works

3

slide-4
SLIDE 4

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

SW-Based Deterministic Replay

Flexible, integrate well with rest of SW stack Very slow or non-applicable to multiprocessor execution:

Software is slow at capturing memory access interleaving

4

Library Compiler Virtual Machine Operating System Virtual Machine Monitor SW-based schemes

slide-5
SLIDE 5

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

HW-Based Deterministic Replay of Multiprocessors

HW can record interleaving of shared-memory accesses effectively:

Small Memory Access Interleaving Log Little overhead

Limitation: integration with SW stack is poor

5

Library Compiler Virtual Machine Operating System Virtual Machine Monitor Hardware-based schemes SW-based schemes

slide-6
SLIDE 6

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

Limitations of HW-Based Replay of Multiprocessors

Past proposals focused only on HW primitive for recording and replaying

How does it integrate with the SW stack?

Cannot separate SW being recorded/replayed from the rest

Paradox: where does the SW that manages the logs go? Require complex VMM or simulator to replay execution

Can’t mix recording, replay and normal execution simultaneously in the machine

6

We must adapt HW-based replay systems and carefully integrate them with SW in order to make HW-based replay practical

slide-7
SLIDE 7

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

Capo Contributions

SW-HW interface for practical HW-assisted deterministic replay

Works with any HW-based replay system

Replay Sphere: new abstraction

Isolates SW that is being recorded (replayed) from the rest Separates the responsibilities of the HW and the SW components

CapoOne: Linux-based prototype

7

slide-8
SLIDE 8

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

Replay Sphere: Set of threads recorded and replayed as a unit and their address space

Only user-mode threads run inside spheres Threads inside a sphere: R-threads

Replay spheres and processes:

R-threads that share memory must run within same sphere Many processes can run within the same sphere

Replay Sphere 2 Replaying

Replay Sphere: Isolating Processes

8

Replay Sphere 1 Recording

OS

Replay Sphere 1 Recording Thread 103 Thread 128 Thread 26 R-thread 1 R-thread 2 R-thread 1 Thread 39 R-thread 3

CPU 1 CPU 2 CPU 3 CPU 4 Replay HW Replay Sphere Manager CPU 1 CPU 3 CPU 2 CPU 4

slide-9
SLIDE 9

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

Replay Sphere: Separating Responsibilities

HW:

Records memory access interleaving of R-threads running within same sphere Produces per-sphere Memory Access Interleaving Log Enforces same memory access interleaving during replay

SW (Replay Sphere Manager):

Logs the other sources of non-determinism that affect the sphere Produces per-sphere Input Log Includes system call return values, signals, data copied into the sphere... Injects data from log into sphere during replay

9

slide-10
SLIDE 10

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

Other Replay Sphere Manager Responsibilities

Assign the same virtual memory addresses during recording/replay Assign the same IDs to R-threads during recording/replay Manage Memory Access Interleaving Log and Input Log Manage replay HW resources

10

slide-11
SLIDE 11

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

Capo’s HW Interface

Works with any HW-based replay system Per-processor R-Thread Control Block:

Sphere ID register R-Thread ID register

Per-sphere Replay Sphere Control Block:

Mode register: specifies whether the sphere is recording or replaying Log pointers: insert to / remove from Memory Access Interleaving Log

11

slide-12
SLIDE 12

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

Virtualizing the Replay HW

Replay sphere manager schedules spheres into hardware contexts

12

Replay Sphere Manager

Sphere 1 Recording Sphere 2 Replaying Sphere 3 Recording (ready)

Replay Sphere Control Block Replay Sphere Control Block

HW SW

Log 1 Log 2 Log 3

Replay Sphere Control Block Replay Sphere Control Block

Mode Log Pointers Mode Log Pointers

(running) (running)

slide-13
SLIDE 13

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

Three Key Challenges

Ensuring deterministic interleaving when OS copies data into a sphere Using fewer processors during replay than were used during recording Emulating vs. re-executing system calls

13

3 2 1

slide-14
SLIDE 14

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

OS Copies Data into Spheres

Problem: interleaving between OS copies and R-threads not recorded Solution: insert copy_to_user into sphere: HW can log memory access interleaving

copy_to_user exits sphere once copy is over

14

OS

Replay Sphere 1 - Recording

B Y E \0 Replay Sphere Manager

R-thread 1 R-thread 2 copy_to_user

H I ! \0

X = buf[2] buf[3] = Y read(&buf)

Log 1 Log 1

buf

slide-15
SLIDE 15

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

Replaying with a Lower Processor Count

Problem: R-thread that should replay next log entry not scheduled in CPU Solution 1: HW detects problem and raises interrupt

Efficient, but it requires additional HW and SW support

Solution 2: SW inspects Interleaving Log and tries to prevent problem

Not trivial, requires changes to OS scheduler

Solution 3: Do nothing, simply wait for OS to schedule R-thread

Simple, but can hurt performance

15

slide-16
SLIDE 16

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

DeLorean HW Replay System Ubuntu Linux

CapoOne: First Capo Implementation

Simulated replay HW:

DeLorean HW system [Montesinos ISCA’08]

Augmented with Capo’s HW interface

Modified 2.6.24 Linux kernel

Supports replay spheres, R-threads New, deterministic copy_to_user

Split Replay Sphere Manager:

User-level component based on ptrace Kernel-level component schedules spheres

and R-threads

16

Replay Sphere 1 Recording R-thread 1 R-thread 2 Replay Sphere Manager Log 1 ptrace new_copy_to_user Replay Sphere Manager

Replay Sphere Control Blocks R-thread Control Blocks

slide-17
SLIDE 17

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

Also in the Paper

CapoOne implementation details Lessons learned during CapoOne’s development Emulating vs. Re-Executing System calls Using Capo with different HW-Based replay systems

17

slide-18
SLIDE 18

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

CapoOne Evaluation Setup

Two HW configurations

Simulated DeLorean HW replay system (SIMICS): 4 x86 processors Real hardware: 4-Core x86 Intel processor without DeLorean HW

SW: Ubuntu 7.10 with Replay Sphere Manager

Modified 2.6.24 Kernel

Benchmarks:

Scientific Benchmarks: SPLASH-2 System benchmarks: Apache, Compilation

18

slide-19
SLIDE 19

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

Overall Log Size

Memory Access Interleaving Log takes most of the space Small overall log: 3.17 bits/kilo-instruction

19

1 2 3 4 SPLASH2-avg Apache-avg Compilation-avg Log Size (bits/kilo-instruction)

Memory Access Interleaving Log Input Log

slide-20
SLIDE 20

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

Recording Performance

Moderate overhead: 21% for SPLASH2 and 41% average for system apps

Minimal timing distortion for debugging concurrency defects

20

1 2 SPLASH2-avg Apache-avg Compilation-avg Normalized Execution Time

Standard execution ptrace’s interposition overhead Rest of Replay Sphere Manager

slide-21
SLIDE 21

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

Replay Performance: SPLASH-2

Emulating system calls reduces cycles during replay Replay takes only 80% more cycles

R-Threads must wait for their turn to commit

21

1 2 Record Replay Normalized Cycles

Execution Stall

slide-22
SLIDE 22

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

Conclusions

Capo enables practical replay of execution for systems with replay HW

The Replay Sphere is a powerful abstraction Enable mixing recording, replay and standard execution

CapoOne: first Capo prototype

Working system Good performance (21-41% recording overhead, 80% replay overhead) Good for debugging concurrency defects

22

slide-23
SLIDE 23

Capo:

A Software-Hardware Interface for Practical Deterministic Multiprocessor Replay

Department of Computer Science University of Illinois at Urbana-Champaign

Pablo Montesinos, Matthew Hicks, Samuel T. King and Josep Torrellas

slide-24
SLIDE 24

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

CapoOne: Basic HW Operation

24

Proc 0

Commit Request Commit Request 7 3 Ok Ok CS Log A Size R-threadID = 7 R-threadID = 3 Chunk A

Proc 1

CS Log B Size Chunk B

Arbiter

Two new per-processor registers: RSID, R-threadID Arbiter now supports concurrent spheres Manages an Interleaving Log for each of them

RSID R-threadID

7

RSID R-threadID

1 3 Interleaving Logs

slide-25
SLIDE 25

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

Today’s Agenda: Towards the Perfect Replay System

DeLorean: New hardware replay engine

Very efficient multiprocessor support Vastly improved log requirements

Capo: New SW-HW interface for replay

Makes HW-based replay systems practical

Evaluation Future work

25

slide-26
SLIDE 26

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

DMA

DMA Log

Network

Overall DeLorean System

Proc 0

PI Log CS Log Chunk A

Arbiter

Interrupt Log I/O Log

Proc 1

CS Log Chunk B Interrupt Log I/O Log

26

Interrupt, I/O and DMA logs are common to other HW-based schemes

slide-27
SLIDE 27

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

CapoOne: HW Implementation

No need for DeLorean’s Interrupt Log, DMA Log nor I/O Log

PI Log becomes the per-sphere Interleaving Log CS Log becomes a per-R-thread Log

Chunks only have instructions from one application (or the kernel)

27 DMA Network

Proc 0

PI Log CS Log Chunk A

Arbiter

Proc 1

CS Log Chunk B DMA Log Interrupt Log I/O Log Interrupt Log I/O Log Interleaving Log

slide-28
SLIDE 28

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

Replay Sphere 1 Replaying Replay Sphere 1 Replaying

Emulating vs Re-Executing System Calls

During replay, the RSM emulates most system calls:

RSM injects return values from Sphere Input Log, squashes outputs

Some have to be re-executed

Thread management (clone) Address space modification (mprotect)

28

R-thread 1 R-thread 2 read()

Log 1

fork()

OS

RSM

sys_fork thread 674 R-thread 3

slide-29
SLIDE 29

Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors

Implicit Dependencies

R-thread changes mapping or protection of address space, and another R-thread uses this changed address space RSM can express these dependencies to hardware so these interactions can be recorded

29

Page Table

Sphere 1

R-thread 1

mprotect X

R-thread 2

while(1){ *x = *x+1 }

CPU 1 CPU 2