A True Hardware Read Barrier Matthias Meyer Institute of - - PowerPoint PPT Presentation

a true hardware read barrier
SMART_READER_LITE
LIVE PREVIEW

A True Hardware Read Barrier Matthias Meyer Institute of - - PowerPoint PPT Presentation

INSTITUT FR INSTITUT FR NACHRICHTENVERMITTLUNG KOMMUNIKATIONSNETZE Universitt Stuttgart Universitt Stuttgart UND DATENVERARBEITUNG UND RECHNERSYSTEME Prof. Dr.-Ing. Dr. h. c. mult. P. J. Khn Prof. Dr.-Ing. Dr. h. c. mult. P. J.


slide-1
SLIDE 1

INSTITUT FÜR NACHRICHTENVERMITTLUNG UND DATENVERARBEITUNG

  • Prof. Dr.-Ing. Dr. h. c. mult. P. J. Kühn

Universität Stuttgart

INSTITUT FÜR KOMMUNIKATIONSNETZE UND RECHNERSYSTEME

  • Prof. Dr.-Ing. Dr. h. c. mult. P. J. Kühn

Universität Stuttgart

A True Hardware Read Barrier

Matthias Meyer Institute of Communication Networks and Computer Engineering University of Stuttgart, Germany matthias.meyer@ikr.uni-stuttgart.de International Symposium on Memory Management June 10–11, 2006 Ottawa, Canada

slide-2
SLIDE 2

Institut für Kommunikationsnetze und Rechnersysteme Universität Stuttgart

A True Hardware Read Barrier

❐ Real-time garbage collection: The synchronization problem ❐ A hardware-supported approach ✗ Novel processor architecture ✗ Garbage collection coprocessor ✗ Prototype ❐ The read barrier ✗ Effect on mutator progress ✗ A closer look at the read barrier fault handler ✗ Novel hardware read barrier design ❐ Conclusions and further work

Outline

slide-3
SLIDE 3

Institut für Kommunikationsnetze und Rechnersysteme Universität Stuttgart

The synchronization problem

❐ Mutator and GC modify graph of objects ➠ read or write barriers ❐ Mutator and GC access same object ➠ mechanisms for mutual exclusion ➠ or atomic processing of objects ❐ Critical regions (root set processing) ➠ unbounded pauses high synchronization overhead no hard real-time capabilities Root set Heap memory Garbage collector (GC) Application (“Mutator”)

➠ ➠

Real-Time Garbage Collection

slide-4
SLIDE 4

Institut für Kommunikationsnetze und Rechnersysteme Universität Stuttgart

A Novel Processor Architecture (1)

Basic idea

❐ Hide garbage collection at the assembly language level ❐ Efficiently realize garbage collection and synchronization in hardware

Precondition

❐ Knowledge of pointers and objects in hardware

Novel approach

❐ Strictly separate pointers from non-pointer data ✗ in the register file ✗ in the instruction set ✗ in memory Pointer Area Data Area Attributes π–1 0 1 2 δ–1 0 1 2 Object Structure δ π

slide-5
SLIDE 5

Institut für Kommunikationsnetze und Rechnersysteme Universität Stuttgart

Extensions to a classical RISC pipeline

Fetch Decode Execute Memory Attribute AGU Cache Data ALU PGU Cache Attribute Set Register Cache Instruct. π δ π δ

❐ Separate data and pointer registers ❐ Extend pointer registers by attributes ❐ Add PGU for operations that generate pointers (allocate, copy pointer) ❐ Add attribute stage for efficient attribute access

A Novel Processor Architecture (2)

slide-6
SLIDE 6

Institut für Kommunikationsnetze und Rechnersysteme Universität Stuttgart

Support for concurrent compaction

δ Fromspace Tospace

scan backlink forwarding pointer

π

Fetch Decode Execute Memory Attribute AGU π δ Cache Data ALU PGU Cache Attribute Set Register Cache Instruct. π δ

❐ Extend pointer register set by backlink entry ❐ Extend attribute cache by backlink entry ❐ AGU dynamically uses tospace pointer

  • r

backlink for address generation

A Novel Processor Architecture (3)

slide-7
SLIDE 7

Institut für Kommunikationsnetze und Rechnersysteme Universität Stuttgart

The read barrier

❐ Two comparators check loaded pointers (hardware read barrier) ❐ Read barrier will trigger interrupt if loaded pointer refers to fromspace ❐ Interrupt handled by a dedicated garbage collection coprocessor

Fetch Decode Execute Memory Attribute AGU π δ Cache Data ALU PGU Cache Attribute Set Register Cache Instruct. Barrier Read- π δ Coprocessor GC

A Novel Processor Architecture (4)

slide-8
SLIDE 8

Institut für Kommunikationsnetze und Rechnersysteme Universität Stuttgart

Garbage Collection Coprocessor

Features

✗ performs garbage collection concurrently with application processing ✗ low cost device, specialized for garbage collection

Integration

✗ tightly coupled to main processor ✗ realized on same device ✗ separate ports to memory controller

Memory interface

✗ no temporal locality: no cache! ✗ spatial locality: burst registers!

Algorithm

✗ based on Baker’s algorithm ✗ directly implemented in microcode Caches Processor Main Memory Controller Coprocessor GC

slide-9
SLIDE 9

Institut für Kommunikationsnetze und Rechnersysteme Universität Stuttgart

Prototype

Main Processor with

  • n-chip GC Coprocessor

Standard SDRAM modules Ethernet PS/2 DVI Parallel Serial

slide-10
SLIDE 10

Institut für Kommunikationsnetze und Rechnersysteme Universität Stuttgart

Hardware

❐ Main processor: 3-way multiple issue, “in order” ❐ GC coprocessor: 256 x 80 bit microcode memory ❐ Synchronously operated at 25 MHz

Software

❐ Static Java compiler (bytecode to machine code) ❐ Subset of the Java class libraries

Features

❐ Low-cost fine-grained synchronization ✗ independent of compiler and runtime system ✗ no code size overhead, little runtime overhead ❐ First known system that limits any GC-related pause to max. 500 clock cycles

Question

How are the pauses distributed over time?

Prototype

slide-11
SLIDE 11

Institut für Kommunikationsnetze und Rechnersysteme Universität Stuttgart

Experimental results

Percentage of pause cycles (in intervals of 500 clock cycles, benchmark “database”)

3.04s 3.08s 3.12s 3.16s 20 40 60 80 100 0s 5s 10s 15s 20 40 60 80 100 3.20s

5ms

Minimum mutator utilization 1 ms intervals 7.2% 5 ms intervals 8.3% 25 ms intervals 11.4%

Read barrier effect on mutator progress

slide-12
SLIDE 12

Institut für Kommunikationsnetze und Rechnersysteme Universität Stuttgart

Trigger: Processor reads fromspace pointer

Tospace Fromspace δ π

free Processor Main Coprocessor GC Caches Memory Controller

A closer look at the read barrier fault handler

slide-13
SLIDE 13

Institut für Kommunikationsnetze und Rechnersysteme Universität Stuttgart

Step 1: Coprocessor reads faulting pointer

Tospace Fromspace δ π

free Caches Processor Main Coprocessor GC Memory Controller

A closer look at the read barrier fault handler

slide-14
SLIDE 14

Institut für Kommunikationsnetze und Rechnersysteme Universität Stuttgart

Step 2: Coprocessor reads object attributes

Tospace Fromspace δ π

free Caches Processor Main Coprocessor GC Memory Controller

A closer look at the read barrier fault handler

slide-15
SLIDE 15

Institut für Kommunikationsnetze und Rechnersysteme Universität Stuttgart

Step 3: Coprocessor advances free

Tospace Fromspace δ π

free freenew + 8 + π + δ = Caches Processor Main Coprocessor GC Memory Controller

A closer look at the read barrier fault handler

slide-16
SLIDE 16

Institut für Kommunikationsnetze und Rechnersysteme Universität Stuttgart

Step 4: Coprocessor overwrites fromspace attributes

Tospace Fromspace

freenew

π

forwarding pointer Caches Processor Main Coprocessor GC Memory Controller

A closer look at the read barrier fault handler

slide-17
SLIDE 17

Institut für Kommunikationsnetze und Rechnersysteme Universität Stuttgart

Step 5: Coprocessor initializes tospace attributes

Tospace Fromspace

freenew

π δ

backlink Caches Processor Main Coprocessor GC Memory Controller

A closer look at the read barrier fault handler

slide-18
SLIDE 18

Institut für Kommunikationsnetze und Rechnersysteme Universität Stuttgart

Step 6: Coprocessor updates fromspace pointer

Tospace Fromspace

freenew

π δ

Caches Processor Main Coprocessor GC Memory Controller

A closer look at the read barrier fault handler

slide-19
SLIDE 19

Institut für Kommunikationsnetze und Rechnersysteme Universität Stuttgart

Analysis

❐ Read barrier fault handling expensive despite hardware support ❐ Necessary to sacrifice the tospace invariant to avoid clustering?

Insights

  • 1. Read barrier in hardware

... but read barrier fault handling still in software

  • 2. Processors expensively communicate via main memory

... because faulting pointer local to main processor, not to garbage collector

Novel idea

Live with the clustering, save the tospace invariant

  • 1. Increase efficiency of the handler

➠ Realize fault handling completely in hardware!

  • 2. Resolve the locality issue

➠ Move fault handling to main processor!

A novel hardware read barrier design

slide-20
SLIDE 20

Institut für Kommunikationsnetze und Rechnersysteme Universität Stuttgart

Trigger: Processor reads fromspace pointer

Tospace Fromspace δ π

free Fetch Decode Execute Memory Attribute AGU Cache Data ALU PGU Cache Instruct. Barrier Read- Set Register Cache Attribute

A novel hardware read barrier design

slide-21
SLIDE 21

Institut für Kommunikationsnetze und Rechnersysteme Universität Stuttgart

Step 1: Advance free, write fromspace attributes, update fromspace pointer

Tospace Fromspace

free freenew + 8 + π + δ =

π

Fetch Decode Execute Memory Attribute AGU Cache Data ALU PGU Cache Instruct. Set Register Cache Attribute Barrier Read-

A novel hardware read barrier design

slide-22
SLIDE 22

Institut für Kommunikationsnetze und Rechnersysteme Universität Stuttgart

Step 2: Initialize tospace attributes

Tospace Fromspace

freenew

π δ

Fetch Decode Execute Memory Attribute AGU Cache Data ALU PGU Cache Instruct. Set Register Cache Attribute Barrier Read-

A novel hardware read barrier design

slide-23
SLIDE 23

Institut für Kommunikationsnetze und Rechnersysteme Universität Stuttgart

Experimental results

Percentage of stall cycles within intervals of 500 clock cycles (benchmark “database”)

3.04s 3.08s 3.12s 3.16s 20 40 60 80 100 3.20s 0s 5s 10s 15s 20 40 60 80 100

Minimum mutator utilization 1 ms intervals 56.8% (7.2%) 5 ms intervals 58.1% (8.3%) 25 ms intervals 62.1% (11.4%)

A novel hardware read barrier design

slide-24
SLIDE 24

Institut für Kommunikationsnetze und Rechnersysteme Universität Stuttgart

Minimum mutator utilization [%] for var. benchmarks and time quantums

compress cup database javac javacc jflex jlisp 1ms 91 99 16 65 17 69 18 72 99 100 97 100 7 57 11 62 8 58 50 87 80 97 83 98 93 98 22 84 39 90 57 92 62 94 6 55 25 84 8 79 21 87 5ms 25ms

A novel hardware read barrier design

slide-25
SLIDE 25

Institut für Kommunikationsnetze und Rechnersysteme Universität Stuttgart

The true hardware read barrier

❐ Average handling time: Less than 20 clock cycles ❐ Minimum mutator utilization for a time quantum of 1 ms: Greater than 55% ❐ Maintains elegance and simplicity of Baker’s tospace invariant

Processor architecture and GC coprocessor

❐ Exact garbage collection without support from compiler or runtime system ❐ High robustness at the machine code level ❐ Maximum pause time less than 500 clock cycles ❐ Total runtime overhead of garbage collection a few percent only

Further work

❐ Generational collector with "true hardware write barrier" ❐ Out-of-order implementation of the processor architecture

Conclusions and Further Work

slide-26
SLIDE 26

Institut für Kommunikationsnetze und Rechnersysteme Universität Stuttgart

Distribution of pause lengths

50 100 150 200 250 300 100000 200000 300000 400000 50 100 150 200 250 300 20000 40000 60000

Appendix