SLIDE 1 Register Relocation
Flexible Contexts for Multithreading Carl A. Waldspurger William E. Weihl
Parallel Software Group MIT Laboratory for Computer Science May 18, 1993
SLIDE 2 Multithreading
Goal: tolerate long latencies Approach: compute while waiting Mechanism: rapid context switching
Memory Register File Context Pointer
Switch
LoadedQ ReadyQ Executable Threads
}
Cn C1 C0
} }
SLIDE 3
Flexible Contexts
Thread requirements vary
register usage is thread-dependent decreasing marginal benefits from more registers
Software-based approach
application-specific partitioning variable-size contexts static or dynamic division
More resident contexts
better utilization of scarce registers improve processor efficiency
SLIDE 4
Outline
Register relocation
hardware primitive software support
Experiments
remote memory references synchronization events
Related work Conclusions Future work
SLIDE 5 Register Relocation
Flexible base/offset scheme Base: register relocation mask (RRM) Offset: context-relative register numbers Examples:
1 0 1 0 0 1 1
context- relative register
RRM
0 1 0 1 1 0 1
BASE OFFSET 4 bits 3 bits
absolute, relocated register
OR
0 1 0 1 1 1 0
RRM
context- relative register
1 1 1 0 1 0
OFFSET BASE 4 bits 3 bits
absolute, relocated register
OR
SLIDE 6 Hardware Support
Register relocation mask (RRM)
special hardware register
- dlg n e bits for n general registers
New instruction: ldrrm R
set RRM from low-order bits of R delay slots may follow
Instruction decode modifications
bitwise OR instruction operands and RRM RISC fixed-field decoding
Opcode Function OR OR OR RRM Reg src1 Reg src2 Reg dest
Relocated Relocated Relocated
Reg src1 Reg src2 Reg dest
SLIDE 7 Software Support
Context switch Context (de)allocate Context (un)load
Allocation Bitmap Memory Register File RRM
Switch
LoadedQ ReadyQ Executable Threads Cn C0
} C1
}
} } C2
SLIDE 8 Context Scheduling
Managed in software
no hardware task queues flexible control over policy
Sample policy
resident context queue round-robin scheduling fast context switch 4 to 6 cycles
PC NextRRM PC NextRRM PC NextRRM
PC RRM Save Restore
Resident Contexts Active Context Links in Registers
SLIDE 9
Context Management
Implemented in software
flexible partitioning of register file static or dynamic identical or varying sizes
Context allocation
general-purpose dynamic routines search allocation bitmap simple shift and mask operations alloc 25 cycles, dealloc 5 cycles
Context loading
save/restore exact number of registers single routine with multiple entry points
SLIDE 10
Compiler Support
Compiler informs runtime system
number of registers used by thread computed by traversing thread call graph
Compiler protects thread contexts
threads associated with single application single address space register and memory overwrites similar
Potential optimizations
choose number of registers per context decreasing marginal benefits power-of-two context size constraint example: allocate 16 vs. 17 registers
SLIDE 11
Experiments
Overview
cache faults synchronization faults
Conventional multithreading
fixed-size contexts: 32 regs zero alloc/dealloc costs
Register relocation
variable-size contexts: 4, 8, 16, 32 regs conservative alloc/dealloc costs
Simulation Environment
single multiprocessor node coarsely multithreaded architecture synthetic threads with stochastic run lengths Proteus simulator
SLIDE 12 Tolerating Cache Faults
Parameters
run lengths (R) geometrically distributed remote memory latency constant contexts never unloaded
Example results
register file size = 128 threads require 6 to 24 registers
64 128 192 256 320 384 448 512
Memory Latency (cycles)
0.0 0.2 0.4 0.6 0.8 1.0
Efficiency R = 128 R = 32 R = 8
SLIDE 13 Tolerating Synchronization Faults
Parameters
run lengths (R) geometrically distributed synchronization latency exponentially distributed competitive two-phase unloading policy
Example results
register file size = 128 threads require 6 to 24 registers
512 1024 1536 2048 2560 3072 3584 4096
Synchronization Latency (cycles)
0.0 0.2 0.4 0.6 0.8 1.0
Efficiency R = 512 R = 128 R = 32
SLIDE 14 Experiment Discussion
Many additional experiments
similar results both cache and synchronization faults homogeneous context sizes
Significant performance improvements
improved processor efficiency better over wide range of parameters 2 improvement for many workloads
Processor efficiency [Saavedra-Barrera 90]
linear gains saturated Efficiency # Contexts realistic simple
SLIDE 15
Related Work
Generally inflexible, hardware-intensive Finely multithreaded processors
cycle-by-cycle interleaving HEP, MASA, Horizon, Tera, Monsoon
Coarsely multithreaded processors
execute longer instruction blocks switch on high-latency operations APRIL, hybrid dataflow/von-Neumann
Named State Processor
fully associative register file more flexible, but hardware-intensive
Base + offset register addressing
addition flexible but expensive Am29000, HEP
SLIDE 16
Conclusions
Register relocation
multiple variable-size contexts minimal hardware support
Significant flexibility
software-based approach flexible partitioning of register file flexible control over scheduling
Substantial performance improvements
better utilization of registers enables more resident contexts tolerate longer latencies, shorter run lengths improved processor efficiency
SLIDE 17
Extensions and Future Work
Software-only approach
generate multiple versions of code use disjoint register subsets
Multiple active contexts
select from multiple RRMs context-specific operands example: ADD C0.R3, C0.R4, C1.R6
Cache interference effects
threads share common cache most interference destructive fine-grain parallelism shrinks working sets utilization vs. interference tradeoff
Arbitrary context sizes
addition vs. OR for relocation efficient software support