register relocation
play

Register Relocation Flexible Contexts for Multithreading Carl A. - PDF document

Register Relocation Flexible Contexts for Multithreading Carl A. Waldspurger William E. Weihl Parallel Software Group MIT Laboratory for Computer Science May 18, 1993 Multithreading Goal: tolerate long latencies Approach: compute while


  1. Register Relocation Flexible Contexts for Multithreading Carl A. Waldspurger William E. Weihl Parallel Software Group MIT Laboratory for Computer Science May 18, 1993

  2. Multithreading Goal: tolerate long latencies Approach: compute while waiting Mechanism: rapid context switching } Executable Threads C0 } Context LoadedQ Pointer C1 Switch ReadyQ } Cn Register File Memory

  3. � register usage is thread-dependent � decreasing marginal benefits from more registers Flexible Contexts � application-specific partitioning � variable-size contexts � static or dynamic division Thread requirements vary � better utilization of scarce registers � improve processor efficiency Software-based approach More resident contexts

  4. � hardware primitive � software support Outline � remote memory references � synchronization events Register relocation Experiments Related work Conclusions Future work

  5. Register Relocation Flexible base/offset scheme Base: register relocation mask (RRM) Offset: context-relative register numbers Examples: 0 1 0 1 0 0 0 0 1 0 0 0 0 0 RRM RRM context- context- 0 0 1 0 1 0 1 1 1 0 relative relative OR OR register register absolute, absolute, 0 1 0 1 1 0 1 0 1 0 1 1 1 0 relocated relocated register register BASE OFFSET BASE OFFSET 4 bits 3 bits 3 bits 4 bits

  6. � special hardware register � d lg n e bits for n general registers � set RRM from low-order bits of R Hardware Support � delay slots may follow Register relocation mask (RRM) � bitwise OR instruction operands and RRM � RISC fixed-field decoding New instruction: ldrrm R Instruction decode modifications Reg Reg Reg Opcode Function src1 src2 dest RRM OR OR OR Relocated Relocated Relocated Reg Reg Reg src1 src2 dest

  7. Software Support Context switch Context (de)allocate Context (un)load } Executable Threads C0 } C1 LoadedQ RRM } C2 ReadyQ Switch Allocation Bitmap Cn } Register File Memory

  8. � no hardware task queues � flexible control over policy � resident context queue Context Scheduling � round-robin scheduling � fast context switch � 4 to 6 cycles Managed in software Sample policy Links in Registers NextRRM NextRRM NextRRM PC PC PC Resident Contexts Save Restore PC RRM Active Context

  9. � flexible partitioning of register file � static or dynamic � identical or varying sizes Context Management � general-purpose dynamic routines � search allocation bitmap Implemented in software � simple shift and mask operations � alloc � 25 cycles, dealloc � 5 cycles � save/restore exact number of registers Context allocation � single routine with multiple entry points Context loading

  10. � number of registers used by thread � computed by traversing thread call graph Compiler Support � threads associated with single application � single address space � register and memory overwrites similar Compiler informs runtime system � choose number of registers per context � decreasing marginal benefits Compiler protects thread contexts � power-of-two context size constraint � example: allocate 16 vs. 17 registers Potential optimizations

  11. � cache faults � synchronization faults Experiments � fixed-size contexts: 32 regs � zero alloc/dealloc costs Overview � variable-size contexts: 4, 8, 16, 32 regs � conservative alloc/dealloc costs Conventional multithreading � single multiprocessor node � coarsely multithreaded architecture � synthetic threads with stochastic run lengths Register relocation � Proteus simulator Simulation Environment

  12. � run lengths (R) geometrically distributed � remote memory latency constant � contexts never unloaded � register file size = 128 Tolerating Cache Faults � threads require 6 to 24 registers Parameters Example results 1.0 0.8 R = 128 0.6 Efficiency 0.4 R = 32 0.2 R = 8 0.0 0 64 128 192 256 320 384 448 512 Memory Latency (cycles)

  13. � run lengths (R) geometrically distributed � synchronization latency exponentially distributed � competitive two-phase unloading policy � register file size = 128 Tolerating Synchronization Faults � threads require 6 to 24 registers Parameters Example results 1.0 R = 512 0.8 0.6 Efficiency R = 128 0.4 R = 32 0.2 0.0 0 512 1024 1536 2048 2560 3072 3584 4096 Synchronization Latency (cycles)

  14. � similar results � both cache and synchronization faults � homogeneous context sizes Experiment Discussion � improved processor efficiency � better over wide range of parameters � 2 � improvement for many workloads Many additional experiments Significant performance improvements Processor efficiency [Saavedra-Barrera 90] saturated Efficiency simple realistic linear gains # Contexts

  15. � cycle-by-cycle interleaving � HEP, MASA, Horizon, Tera, Monsoon Related Work � execute longer instruction blocks Generally inflexible, hardware-intensive � switch on high-latency operations � APRIL, hybrid dataflow/von-Neumann Finely multithreaded processors � fully associative register file � more flexible, but hardware-intensive Coarsely multithreaded processors � addition flexible but expensive � Am29000, HEP Named State Processor Base + offset register addressing

  16. � multiple variable-size contexts � minimal hardware support Conclusions � software-based approach � flexible partitioning of register file � flexible control over scheduling Register relocation � better utilization of registers � enables more resident contexts Significant flexibility � tolerate longer latencies, shorter run lengths � improved processor efficiency Substantial performance improvements

  17. � generate multiple versions of code � use disjoint register subsets � select from multiple RRMs Extensions and Future Work � context-specific operands � example: ADD C0.R3, C0.R4, C1.R6 Software-only approach � threads share common cache � most interference destructive Multiple active contexts � fine-grain parallelism shrinks working sets � utilization vs. interference tradeoff � addition vs. OR for relocation Cache interference effects � efficient software support Arbitrary context sizes

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend