Register Relocation Flexible Contexts for Multithreading Carl A. - - PDF document

register relocation
SMART_READER_LITE
LIVE PREVIEW

Register Relocation Flexible Contexts for Multithreading Carl A. - - PDF document

Register Relocation Flexible Contexts for Multithreading Carl A. Waldspurger William E. Weihl Parallel Software Group MIT Laboratory for Computer Science May 18, 1993 Multithreading Goal: tolerate long latencies Approach: compute while


slide-1
SLIDE 1

Register Relocation

Flexible Contexts for Multithreading Carl A. Waldspurger William E. Weihl

Parallel Software Group MIT Laboratory for Computer Science May 18, 1993

slide-2
SLIDE 2

Multithreading

Goal: tolerate long latencies Approach: compute while waiting Mechanism: rapid context switching

Memory Register File Context Pointer

Switch

LoadedQ ReadyQ Executable Threads

}

Cn C1 C0

} }

slide-3
SLIDE 3

Flexible Contexts

Thread requirements vary

register usage is thread-dependent decreasing marginal benefits from more registers

Software-based approach

application-specific partitioning variable-size contexts static or dynamic division

More resident contexts

better utilization of scarce registers improve processor efficiency
slide-4
SLIDE 4

Outline

Register relocation

hardware primitive software support

Experiments

remote memory references synchronization events

Related work Conclusions Future work

slide-5
SLIDE 5

Register Relocation

Flexible base/offset scheme Base: register relocation mask (RRM) Offset: context-relative register numbers Examples:

1 0 1 0 0 1 1

context- relative register

RRM

0 1 0 1 1 0 1

BASE OFFSET 4 bits 3 bits

absolute, relocated register

OR

0 1 0 1 1 1 0

RRM

context- relative register

1 1 1 0 1 0

OFFSET BASE 4 bits 3 bits

absolute, relocated register

OR

slide-6
SLIDE 6

Hardware Support

Register relocation mask (RRM)

special hardware register
  • dlg n e bits for n general registers

New instruction: ldrrm R

set RRM from low-order bits of R delay slots may follow

Instruction decode modifications

bitwise OR instruction operands and RRM RISC fixed-field decoding

Opcode Function OR OR OR RRM Reg src1 Reg src2 Reg dest

Relocated Relocated Relocated

Reg src1 Reg src2 Reg dest

slide-7
SLIDE 7

Software Support

Context switch Context (de)allocate Context (un)load

Allocation Bitmap Memory Register File RRM

Switch

LoadedQ ReadyQ Executable Threads Cn C0

} C1

}

} } C2

slide-8
SLIDE 8

Context Scheduling

Managed in software

no hardware task queues flexible control over policy

Sample policy

resident context queue round-robin scheduling fast context switch 4 to 6 cycles

PC NextRRM PC NextRRM PC NextRRM

PC RRM Save Restore

Resident Contexts Active Context Links in Registers

slide-9
SLIDE 9

Context Management

Implemented in software

flexible partitioning of register file static or dynamic identical or varying sizes

Context allocation

general-purpose dynamic routines search allocation bitmap simple shift and mask operations alloc 25 cycles, dealloc 5 cycles

Context loading

save/restore exact number of registers single routine with multiple entry points
slide-10
SLIDE 10

Compiler Support

Compiler informs runtime system

number of registers used by thread computed by traversing thread call graph

Compiler protects thread contexts

threads associated with single application single address space register and memory overwrites similar

Potential optimizations

choose number of registers per context decreasing marginal benefits power-of-two context size constraint example: allocate 16 vs. 17 registers
slide-11
SLIDE 11

Experiments

Overview

cache faults synchronization faults

Conventional multithreading

fixed-size contexts: 32 regs zero alloc/dealloc costs

Register relocation

variable-size contexts: 4, 8, 16, 32 regs conservative alloc/dealloc costs

Simulation Environment

single multiprocessor node coarsely multithreaded architecture synthetic threads with stochastic run lengths Proteus simulator
slide-12
SLIDE 12

Tolerating Cache Faults

Parameters

run lengths (R) geometrically distributed remote memory latency constant contexts never unloaded

Example results

register file size = 128 threads require 6 to 24 registers

64 128 192 256 320 384 448 512

Memory Latency (cycles)

0.0 0.2 0.4 0.6 0.8 1.0

Efficiency R = 128 R = 32 R = 8

slide-13
SLIDE 13

Tolerating Synchronization Faults

Parameters

run lengths (R) geometrically distributed synchronization latency exponentially distributed competitive two-phase unloading policy

Example results

register file size = 128 threads require 6 to 24 registers

512 1024 1536 2048 2560 3072 3584 4096

Synchronization Latency (cycles)

0.0 0.2 0.4 0.6 0.8 1.0

Efficiency R = 512 R = 128 R = 32

slide-14
SLIDE 14

Experiment Discussion

Many additional experiments

similar results both cache and synchronization faults homogeneous context sizes

Significant performance improvements

improved processor efficiency better over wide range of parameters 2 improvement for many workloads

Processor efficiency [Saavedra-Barrera 90]

linear gains saturated Efficiency # Contexts realistic simple

slide-15
SLIDE 15

Related Work

Generally inflexible, hardware-intensive Finely multithreaded processors

cycle-by-cycle interleaving HEP, MASA, Horizon, Tera, Monsoon

Coarsely multithreaded processors

execute longer instruction blocks switch on high-latency operations APRIL, hybrid dataflow/von-Neumann

Named State Processor

fully associative register file more flexible, but hardware-intensive

Base + offset register addressing

addition flexible but expensive Am29000, HEP
slide-16
SLIDE 16

Conclusions

Register relocation

multiple variable-size contexts minimal hardware support

Significant flexibility

software-based approach flexible partitioning of register file flexible control over scheduling

Substantial performance improvements

better utilization of registers enables more resident contexts tolerate longer latencies, shorter run lengths improved processor efficiency
slide-17
SLIDE 17

Extensions and Future Work

Software-only approach

generate multiple versions of code use disjoint register subsets

Multiple active contexts

select from multiple RRMs context-specific operands example: ADD C0.R3, C0.R4, C1.R6

Cache interference effects

threads share common cache most interference destructive fine-grain parallelism shrinks working sets utilization vs. interference tradeoff

Arbitrary context sizes

addition vs. OR for relocation efficient software support