cs184c computer architecture reading parallel and
play

CS184c: Computer Architecture Reading [Parallel and Multithreaded] - PDF document

CS184c: Computer Architecture Reading [Parallel and Multithreaded] Shared Memory Focus: H&P Ch 8 At least read this Day 7: April 24, 2001 Retrospectives Threaded Abstract Machine (TAM) Valuable and short


  1. CS184c: Computer Architecture Reading [Parallel and Multithreaded] • Shared Memory – Focus: H&P Ch 8 • At least read this… Day 7: April 24, 2001 – Retrospectives Threaded Abstract Machine (TAM) • Valuable and short – ISCA papers Simultaneous Multi-Threading (SMT) • Good primary sources CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon Today Threaded Abstract Machine • TAM • SMT CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon TAM TL0 Model • Parallel Assembly Language • Activition Frame (like stack frame) • Fine-Grained Threading – Variables – Synchronization • Hybrid Dataflow – Thread stack (continuation vectors) • Scheduling Hierarchy • Heap Storage – I-structures CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon 1

  2. TL0 Ops Scheduling Hierarchy • Intra-frame • RISC -like ALU Ops • FORK – Related threads in same frame • SWITCH – Frame runs on single processor • STOP – Schedule together, exploit locality • POST • (cache, maybe regs) • Inter-frame • FALLOC • FFREE – Only swap when exhaust work in current • SWAP frame CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon Intra-Frame Scheduling TL0/CM5 Intra-frame • Simple (local) stack of pending threads • Fork on thread • Fork places new PC on stack – Fall through 0 inst – Unsynch branch 3 inst • STOP pops next PC off stack – Successful synch 4 inst • Stack initialized with code to exit – Unsuccessful synch 8 inst activation frame • Push thread onto LCV 3-6 inst – Including schedule next frame – Save live registers CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon Fib Example Multiprocessor Parallelism • [look at how this turns into TL0 code] • Comes from frame allocations • Runtime policy where allocate frames – Maybe use work stealing? CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon 2

  3. Frame Scheduling CM5 Frame Scheduling Costs • Inlets to non-active frames initiate • Inlet Posts on non-running thread pending thread stack (RCV) – 10-15 instructions • First inlet may place frame on • Swap to next frame processor’s runable frame queue – 14 instructions • SWAP instruction picks next frame branches to its enter thread • Average thread cost 7 cycles – Constitutes 15-30% TL0 instr CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon Cycle Instruction Mix Breakdown [Culler et. Al. [Culler et. Al. JPDC, July 1993] JPDC, July 1993] CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon Speedup Example Thread Stats • Thread lengths 3—17 • Threads run per “quantum” 7 —530 [Culler et. Al. JPDC, July 1993] [Culler et. Al. JPDC, July 1993] CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon 3

  4. Great Project • Develop optimized µ Arch for TAM Multithreaded Architectures – Hardware support/architecture for single- cycle thread-switch/post CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon Problem Idea • Long latency of operations • Run something else useful while stalled – Non-local memory fetch • In particular, another thread – Long latency operations (mpy, fp) – Another PC • Wastes processor cycles while stalled • If processor stalls on return • Again, use parallelism to “tolerate” – Latency problem turns into a throughput latency (utilization) problem – CPU sits idle CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon HEP/ µ Unity/Tera HEP Pipeline • Provide a number of contexts – Copies of register file… • Number of contexts ≥ operation latency – Pipeline depth – Roundtrip time to main memory • Run each round-robin [figure: Arvind+Innucci, DFVLR’87] CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon 4

  5. Strict Interleaved Threading • Uses parallelism to get throughput SMT • Potentially poor single-threaded performance – Increases end-to-end latency of thread CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon Can we do both? SuperScalar Inefficiency • Issue from multiple threads into pipeline Unused Slot • No worse than (super)scalar on single thread • More throughput with multiple threads – Fill in what would have been empty issue Recall: limited slots with instructions from different Scalar IPC threads CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon SMT Promise SMT Estimates (ideal) Fill in empty slots with other threads [Tullsen et. al. ISCA ’95] CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon 5

  6. SMT Estimates (ideal) SMT uArch • Observation: exploit register renaming – Get small modifications to existing superscalar architecture [Tullsen et. al. ISCA ’95] CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon SMT uArch Stopped Here 4/24/01 • N.B. remarkable thing is how similar superscalar core is [Tullsen et. al. ISCA ’96] CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon SMT uArch Performance • Changes: – Multiple PCs – Control to decide how to fetch from – Separate return stacks per thread – Per-thread reorder/commit/flush/trap – Thread id w/ BTB – Larger register file • More things outstanding [Tullsen et. al. ISCA ’96] CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon 6

  7. Optimizing: fetch freedom Optimizing: Fetch Alg. • RR=Round Robin • ICOUNT – priority to thread w/ fewest • RR.X.Y pending instrs – X – threads do fetch in cycle • BRCOUNT – Y – instructions • MISSCOUNT fetched/thread • IQPOSN – penalize threads w/ old instrs (at front of queues) [Tullsen et. al. ISCA ’96] [Tullsen et. al. ISCA ’96] CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon Throughput Improvement Costs • 8-issue superscalar – Achieves little over 2 instructions per cycle • Optimized SMT – Achieves 5.4 instructions per cycle on 8 threads • 2.5x throughput increase [Burns+Gaudiot HPCA’99] CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon Costs Not Done, yet… • Conventional SMT formulation is for coarse-grained threads • Combine SMT w/ TAM ? – Fill pipeline from multiple runnable threads in activation frame – ?multiple activation frames? – Eliminate thread switch overhead? [Burns+Gaudiot HPCA’99] CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon 7

  8. Thought? Big Ideas • SMT reduce need for split-phase • Primitives operations? – Parallel Assembly Language – Threads for control – Synchronization (post, full-empty) • Latency Hiding – Threads, split-phase operation • Exploit Locality – Create locality • Scheduling quanta CALTECH cs184c Spring2001 -- DeHon CALTECH cs184c Spring2001 -- DeHon 8

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend