credits
play

Credits Some of the material in this presentation is taken from: - PowerPoint PPT Presentation

9c.1 9c.2 Credits Some of the material in this presentation is taken from: Computer Architecture: A Quantitative Approach John Hennessy & David Patterson EE 457 Unit 9c Some of the material in this presentation is derived from


  1. 9c.1 9c.2 Credits • Some of the material in this presentation is taken from: – Computer Architecture: A Quantitative Approach • John Hennessy & David Patterson EE 457 Unit 9c • Some of the material in this presentation is derived from course notes and slides from – Prof. Michel Dubois (USC) – Prof. Murali Annavaram (USC) Thread Level Parallelism – Prof. David Patterson (UC Berkeley) 9c.3 9c.4 Motivating HW Multithread/Multicore • Issues that prevent us from exploiting ILP in more advanced single-core processors with deeper pipelines and OoO Execution – ______________ hierarchy – Increased ___________ with _______ clock rates A Case for Thread-Level Parallelism – Increased ___________ with more advanced CHIP MULTITHREADING AND structures (ROBs, Issue queues, etc.) MULTIPROCESSORS

  2. 9c.5 9c.6 Memory Wall Problem The Problem with 5-Stage Pipeline • Processor performance is increasing much faster than memory • A cache miss (memory induced stall) causes computation to performance stall • A __________ in compute time yields only minimal overall speedup due to _________________ dominating compute 55%/year Processor-Memory Performance Gap Compute C Time Memory M Time Latency 7%/year Single-Thread C M C M C M Execution Actual program speedup is There is a limit to ILP! Single-Thread Hennessy and Patterson, C M C M C M minimal due to Execution If a cache miss requires several hundred clock cyles even OoO pipelines with Computer Architecture – _________ (w/ ________ in A Quantitative Approach (2003) 10's or 100's of in-flight instructions may stall. latency compute) Adapted from: OpenSparc T1 Micro-architecture Specification 9c.7 9c.8 Cache Hierarchy Cache Penalty Example • A hierarchy of cache can help mitigate • Assume an L1 hit rate of 95% and miss penalty of 20 P the cache miss penalty clock cycles (assuming these misses hit in L2). What is • L1 Cache the CPI for our typical 5 stage pipeline? L1 Cache – 64 KB – 95 instructions take ____ cycles to execute – 2 cycle access time – 5 instructions take _________ cycles to execute – Common Miss Rate ~ ___ L2 Cache – Total _____ cycles for 100 instructions = • L2 Cache CPI of ____ – 1 MB – Effective CPI = Ideal CPI + Miss Rate*Miss Penalty Cycles L3 Cache – 20 cycle access time – Common Miss Rate ~ ___ • Main Memory – _____ cycle access time Memory

  3. 9c.9 9c.10 Multithreading Case for Multithreading • By executing multiple threads we can keep the processor busy • Long latency events – Cache Miss, Exceptions, Lock (Synchronization), Long instructions such as with ________________ MUL/DIV • Swap to the next thread when the current thread hits a • Long latency events cause Io and even OoO pipelines to be underutilized ________________________________ Idea: Share the processor among two executing threads, switching when • one hits a ___________________ Compute C – Only penalty is flushing the pipeline. Time Memory M Time Latency Thread 1 Single Thread C M C M Cache Cache Cache Cache Miss Compute Compute Miss Compute Miss Compute Miss Thread 2 C M C M Thread 3 C M C M Two Threads Cache Cache Cache Cache Thread 4 C M C M Compute Miss Compute Miss Compute Miss Compute Miss Cache Compute Compute Cache Compute Cache Compute Adapted from: OpenSparc T1 Micro-architecture Specification Miss Miss Miss 9c.11 9c.12 Non-Blocking Caches Power • Power consumption decomposed into: • Cache can service hits while fetching one or – Static: Power constantly being dissipated (grows with # of transistors) more _________________ – Dynamic: Power consumed for switching a bit (1 to 0) 2 f • P DYN = I DYN *V DD ≈ ½C TOT V DD – Example: Pentium Pro has a non-blocking cache – Recall, I = C dV/dt capable of handling ______________ – V DD is the logic ‘1’ voltage, f = clock frequency • Dynamic power favors parallel processing vs. higher clock rates ___________ – V DD value is tied to f, so a reduction/increase in f leads to similar change in Vdd Implies power is proportional to f 3 (a cubic savings in power if we can reduce f) – – Take a core and replicate it 4x => 4x performance and ___________ – Take a core and increase clock rate 4x => 4x performance and ________ • Static power – Leakage occurs no matter what the frequency is

  4. 9c.13 9c.14 Temperature Wire Delay • Temperature is related to power consumption • In modern circuits wire delay (transmitting the signal) begins to _________________________ (time for gate to switch) – Locations on the chip that burn more power will usually run hotter • Locations where bits toggle (register file, etc.) often will become quite hot • As wires get longer especially if toggling continues for a long period of time – Resistance goes up and Capacitance goes up causing longer time – Too much heat can destroy a chip delays (time is proportional to R*C) – Can use sensors to dynamically sense temperature • Dynamically scheduled, OoO processors require • Techniques for controlling temperature ___________________ for buses, forwarding, etc. – External measures: Remove and spread the heat • Simpler pipelines often lead to _________________ signal • Heat sinks, fans, even liquid cooled machines connections (wires) – Architectural measures • Throttle performance (run at slower frequencies / lower voltages) • CMP is really the only viable choice • Global clock gating (pause..turn off the clock) • None…results can be catastrophic • http://www.tomshardware.com/2001/09/17/hot_spot/ 9c.15 9c.16 Software Multithreading • Used since 1960's to hide I/O latency CPU Regs – Multiple processes with different virtual address spaces and process control blocks PC – On an I/O operation, state is saved and another process is given to the CPU OS – When I/O operation completes the process is Scheduler rescheduled • On a context switch T1 = Ready T2 = Blocked T3 = Ready – Trap processor and flush pipeline – Save state in process control block (____________ Saved Saved Saved IMPLEMENTING MULTITHREADING State State State __________________________________) Regs Regs Regs – Restore state of another process AND MULTICORE – Start execution and fill pipeline PC PC PC • Very high overhead! Meta Meta Meta Data Data Data • Context switch is also triggered by ___________ _________________

  5. 9c.17 9c.18 Hardware Multithreading Typical CMP Organization • Run multiple threads in turn on the same core • Requires additional hardware for fast context Private L1's require maintaining ___________ via snooping. Chip Multi- Processor switching Sharing L1 is not a good idea. P P P P L2 is shared (1 copy of data) and – Multiple register files thus does not require a L1 L1 L1 L1 coherency mechanism. – Multiple state registers (condition codes, interrupt Interconnect (On-Chip Network) vector, etc.) L2 L2 L2 L2 – Avoids saving context manually (via software) Bank Bank/ Bank Bank/ Shared bus would be a bottleneck. Use switched network (multiple Main Memory simultaneous connections) 9c.19 9c.20 Sparc T1 Niagara Sun T1 "Niagara" Block Diagram • 8 cores each executing 4 threads called a thread group – Zero cycle thread switching penalty (round-robin) – 6 stage pipeline • Each core has its own L1 cache • Each thread has its own – Register file, instruction and store buffers • Threads share… – L1 cache, TLB, and execution units • 3 MB shared L2 Cache, 4-banks, 12-way set-associative – Is it a problem that it's not a power of 2? ______ 2005 Ex. of Fine-grained Multithreading (Thread) Fetch Select Decode Exec. Mem. WB http://ogun.stanford.edu/~kunle/publications/niagra_micro.pdf

  6. 9c.21 9c.22 Sun T1 "Niagara" Pipeline T1 Pipeline • Fetch stage – Thread select mux chooses PC – Access I-TLB and I-Cache – 2 instructions fetched per cycle • Thread select stage – Choose instructions to issue from ready threads – Issues based on • Instruction type • Misses • Resource conflicts • Traps and interrupts http://ogun.stanford.edu/~kunle/publications/niagra_micro.pdf 9c.23 9c.24 T1 Pipeline Pipeline Scheduling • Decode stage • No pipeline flush on context switch (except on cache miss) – Accesses register file • Execute Stage • Full forwarding/bypassing to younger instructions of same thread – Includes ALU, shifter, MUL and DIV units – Forwarding Unit • In case of load, wait _________ before an instruction from the same thread is issued • Memory stage – Solved _________________ issue – DTLB, Data Cache, and 4 store buffers (1 per thread) • Scheduler guarantees fairness between threads by • WB prioritizing the least recently scheduled thread – Write to register file

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend