Register Relocation Flexible Contexts for Multithreading Carl A. - PDF document

Register Relocation Flexible Contexts for Multithreading Carl A. Waldspurger William E. Weihl Parallel Software Group MIT Laboratory for Computer Science May 18, 1993

Multithreading Goal: tolerate long latencies Approach: compute while waiting Mechanism: rapid context switching } Executable Threads C0 } Context LoadedQ Pointer C1 Switch ReadyQ } Cn Register File Memory

� register usage is thread-dependent � decreasing marginal benefits from more registers Flexible Contexts � application-specific partitioning � variable-size contexts � static or dynamic division Thread requirements vary � better utilization of scarce registers � improve processor efficiency Software-based approach More resident contexts

� hardware primitive � software support Outline � remote memory references � synchronization events Register relocation Experiments Related work Conclusions Future work

Register Relocation Flexible base/offset scheme Base: register relocation mask (RRM) Offset: context-relative register numbers Examples: 0 1 0 1 0 0 0 0 1 0 0 0 0 0 RRM RRM context- context- 0 0 1 0 1 0 1 1 1 0 relative relative OR OR register register absolute, absolute, 0 1 0 1 1 0 1 0 1 0 1 1 1 0 relocated relocated register register BASE OFFSET BASE OFFSET 4 bits 3 bits 3 bits 4 bits

� special hardware register � d lg n e bits for n general registers � set RRM from low-order bits of R Hardware Support � delay slots may follow Register relocation mask (RRM) � bitwise OR instruction operands and RRM � RISC fixed-field decoding New instruction: ldrrm R Instruction decode modifications Reg Reg Reg Opcode Function src1 src2 dest RRM OR OR OR Relocated Relocated Relocated Reg Reg Reg src1 src2 dest

Software Support Context switch Context (de)allocate Context (un)load } Executable Threads C0 } C1 LoadedQ RRM } C2 ReadyQ Switch Allocation Bitmap Cn } Register File Memory

� no hardware task queues � flexible control over policy � resident context queue Context Scheduling � round-robin scheduling � fast context switch � 4 to 6 cycles Managed in software Sample policy Links in Registers NextRRM NextRRM NextRRM PC PC PC Resident Contexts Save Restore PC RRM Active Context

� flexible partitioning of register file � static or dynamic � identical or varying sizes Context Management � general-purpose dynamic routines � search allocation bitmap Implemented in software � simple shift and mask operations � alloc � 25 cycles, dealloc � 5 cycles � save/restore exact number of registers Context allocation � single routine with multiple entry points Context loading

� number of registers used by thread � computed by traversing thread call graph Compiler Support � threads associated with single application � single address space � register and memory overwrites similar Compiler informs runtime system � choose number of registers per context � decreasing marginal benefits Compiler protects thread contexts � power-of-two context size constraint � example: allocate 16 vs. 17 registers Potential optimizations

� cache faults � synchronization faults Experiments � fixed-size contexts: 32 regs � zero alloc/dealloc costs Overview � variable-size contexts: 4, 8, 16, 32 regs � conservative alloc/dealloc costs Conventional multithreading � single multiprocessor node � coarsely multithreaded architecture � synthetic threads with stochastic run lengths Register relocation � Proteus simulator Simulation Environment

� run lengths (R) geometrically distributed � remote memory latency constant � contexts never unloaded � register file size = 128 Tolerating Cache Faults � threads require 6 to 24 registers Parameters Example results 1.0 0.8 R = 128 0.6 Efficiency 0.4 R = 32 0.2 R = 8 0.0 0 64 128 192 256 320 384 448 512 Memory Latency (cycles)

� run lengths (R) geometrically distributed � synchronization latency exponentially distributed � competitive two-phase unloading policy � register file size = 128 Tolerating Synchronization Faults � threads require 6 to 24 registers Parameters Example results 1.0 R = 512 0.8 0.6 Efficiency R = 128 0.4 R = 32 0.2 0.0 0 512 1024 1536 2048 2560 3072 3584 4096 Synchronization Latency (cycles)

� similar results � both cache and synchronization faults � homogeneous context sizes Experiment Discussion � improved processor efficiency � better over wide range of parameters � 2 � improvement for many workloads Many additional experiments Significant performance improvements Processor efficiency [Saavedra-Barrera 90] saturated Efficiency simple realistic linear gains # Contexts

� cycle-by-cycle interleaving � HEP, MASA, Horizon, Tera, Monsoon Related Work � execute longer instruction blocks Generally inflexible, hardware-intensive � switch on high-latency operations � APRIL, hybrid dataflow/von-Neumann Finely multithreaded processors � fully associative register file � more flexible, but hardware-intensive Coarsely multithreaded processors � addition flexible but expensive � Am29000, HEP Named State Processor Base + offset register addressing

� multiple variable-size contexts � minimal hardware support Conclusions � software-based approach � flexible partitioning of register file � flexible control over scheduling Register relocation � better utilization of registers � enables more resident contexts Significant flexibility � tolerate longer latencies, shorter run lengths � improved processor efficiency Substantial performance improvements

� generate multiple versions of code � use disjoint register subsets � select from multiple RRMs Extensions and Future Work � context-specific operands � example: ADD C0.R3, C0.R4, C1.R6 Software-only approach � threads share common cache � most interference destructive Multiple active contexts � fine-grain parallelism shrinks working sets � utilization vs. interference tradeoff � addition vs. OR for relocation Cache interference effects � efficient software support Arbitrary context sizes

Register Relocation Flexible Contexts for Multithreading Carl A. - PDF document

Register Relocation Flexible Contexts for Multithreading Carl A. Waldspurger William E. Weihl Parallel Software Group MIT Laboratory for Computer Science May 18, 1993 Multithreading Goal: tolerate long latencies Approach: compute while

Relocation, Relocation, Relocation The Growing Problem at the Heart of CPOs Colin Cottage

Sensor Relocation Mesh-based Sensor Relocation Mesh-based Sensor Relocation Objective for

Land Acquisition and Relocation Process Presented by: Lynn Green, Director of Acquisition

Control Unit Datapath Elements & Single Cycle Datapath Unit Register Files Register Layout

Causality and the benefits of relocation Causality and the benefits of relocation Presentation to

Ga-Pila Relocation Project Ga-Pila Relocation Project Project Overview 5 March 2003

Chapte pter 5: 5: Acquisitio isition and and Re Relocation DEHCR Bureau of Community

Rental Assistance Demonstration (RAD) Relocation Requirements Notice July 2014 S ECTION 1- P

Relocation Simon Bruce 12 February 2020 Relocation cases are the "hardest of dilemmas 2

Analysis of IPv6 Relocation Delays draft-vogt-dna-relocation-01.txt Christian Vogt, Roland Bless,

Acquisition and Relocation Introduction Federal Uniform Relocation Assistance and Real

Latvian Diabetes Register Eva Ramuse, Public health analyst of the Register Supervision Unit

Register Packing Register Packing Exploiting Narrow- -Width Operands Width Operands Exploiting

V. Register Machine Yuxi Fu BASICS, Shanghai Jiao Tong University Register Machines are more

CS 3330: SEQ part 1 condition codes ( ZF , SF ) register input register output updates every

Digital Design Discussion: RTL Storage Components Shift Register Timing Register File Timing

Processes, Exceptional Control Flow CSAPPe2, Chapter 8 Plan for Today Exceptional Control Flow

Processes What are they? How do we represent them? Scheduling Something smaller

Operating Systems , a 240 view Processes barely scraping the surface Key abstractions provided by

L e s s o n s L e a r n e d f r o m P o r t i n g H e l e n O S t

Multithreading 1 A rchitectural State and Context Switches 2 A rchitectural State The A

Chapter 8 System Software Chapter 8 Objectives Become familiar with the functions provided by

Why do we need another programing model ? Atsushi Hori Min Si Riken ANL B. Gerofi, M. Takagi,

The Structure of the THE Multiprogramming System p g g y THE (Technische

Register Relocation Flexible Contexts for Multithreading Carl A. - PDF document

Register Relocation Flexible Contexts for Multithreading Carl A. Waldspurger William E. Weihl Parallel Software Group MIT Laboratory for Computer Science May 18, 1993 Multithreading Goal: tolerate long latencies Approach: compute while

Relocation, Relocation, Relocation The Growing Problem at the Heart of CPOs Colin Cottage

Sensor Relocation Mesh-based Sensor Relocation Mesh-based Sensor Relocation Objective for

Land Acquisition and Relocation Process Presented by: Lynn Green, Director of Acquisition

Control Unit Datapath Elements &amp; Single Cycle Datapath Unit Register Files Register Layout

Causality and the benefits of relocation Causality and the benefits of relocation Presentation to

Ga-Pila Relocation Project Ga-Pila Relocation Project Project Overview 5 March 2003

Chapte pter 5: 5: Acquisitio isition and and Re Relocation DEHCR Bureau of Community

Rental Assistance Demonstration (RAD) Relocation Requirements Notice July 2014 S ECTION 1- P

Relocation Simon Bruce 12 February 2020 Relocation cases are the &quot;hardest of dilemmas 2

Analysis of IPv6 Relocation Delays draft-vogt-dna-relocation-01.txt Christian Vogt, Roland Bless,

Acquisition and Relocation Introduction Federal Uniform Relocation Assistance and Real

Latvian Diabetes Register Eva Ramuse, Public health analyst of the Register Supervision Unit

Register Packing Register Packing Exploiting Narrow- -Width Operands Width Operands Exploiting

V. Register Machine Yuxi Fu BASICS, Shanghai Jiao Tong University Register Machines are more

CS 3330: SEQ part 1 condition codes ( ZF , SF ) register input register output updates every

Digital Design Discussion: RTL Storage Components Shift Register Timing Register File Timing

Processes, Exceptional Control Flow CSAPPe2, Chapter 8 Plan for Today Exceptional Control Flow

Processes What are they? How do we represent them? Scheduling Something smaller

Operating Systems , a 240 view Processes barely scraping the surface Key abstractions provided by

L e s s o n s L e a r n e d f r o m P o r t i n g H e l e n O S t

Multithreading 1 A rchitectural State and Context Switches 2 A rchitectural State The A

Chapter 8 System Software Chapter 8 Objectives Become familiar with the functions provided by

Why do we need another programing model ? Atsushi Hori Min Si Riken ANL B. Gerofi, M. Takagi,

The Structure of the THE Multiprogramming System p g g y THE (Technische

Control Unit Datapath Elements & Single Cycle Datapath Unit Register Files Register Layout

Relocation Simon Bruce 12 February 2020 Relocation cases are the "hardest of dilemmas 2