EE 457 Unit 9c Thread Level Parallelism 2 Credits Some of the - PowerPoint PPT Presentation

1 EE 457 Unit 9c Thread Level Parallelism

2 Credits • Some of the material in this presentation is taken from: – Computer Architecture: A Quantitative Approach • John Hennessy & David Patterson • Some of the material in this presentation is derived from course notes and slides from – Prof. Michel Dubois (USC) – Prof. Murali Annavaram (USC) – Prof. David Patterson (UC Berkeley)

3 A Case for Thread-Level Parallelism CHIP MULTITHREADING AND MULTIPROCESSORS

4 Motivating HW Multithread/Multicore • Issues that prevent us from exploiting ILP in more advanced single-core processors with deeper pipelines and OoO Execution – Slow memory hierarchy – Increased power with higher clock rates – Increased delay with more advanced structures (ROBs, Issue queues, etc.)

5 Memory Wall Problem • Processor performance is increasing much faster than memory performance 55%/year Processor-Memory Performance Gap 7%/year There is a limit to ILP! Hennessy and Patterson, Computer Architecture – If a cache miss requires several hundred clock cyles even OoO pipelines with A Quantitative Approach (2003) 10's or 100's of in-flight instructions may stall.

6 The Problem with 5-Stage Pipeline • A cache miss (memory induced stall) causes computation to stall • A 2x speedup in compute time yields only minimal overall speedup due to memory latency dominating compute Compute C Time Memory M Time Latency Single-Thread C M C M C M Execution Actual program speedup is Single-Thread minimal due to C M C M C M Execution memory (w/ 2x speedup latency in compute) Adapted from: OpenSparc T1 Micro-architecture Specification

7 Cache Hierarchy • A hierarchy of cache can help mitigate P the cache miss penalty • L1 Cache L1 Cache – 64 KB – 2 cycle access time – Common Miss Rate ~ 5% L2 Cache • L2 Cache – 1 MB – 20 cycle access time L3 Cache – Common Miss Rate ~ 1% • Main Memory – 300 cycle access time Memory

8 Cache Penalty Example • Assume an L1 hit rate of 95% and miss penalty of 20 clock cycles (assuming these misses hit in L2). What is the CPI for our typical 5 stage pipeline? – 95 instructions take 95 cycles to execute – 5 instructions take 105=5*(1+20) cycles to execute – Total 200 cycles for 100 instructions = CPI of 2 – Effective CPI = Ideal CPI + Miss Rate*Miss Penalty Cycles

9 Case for Multithreading • By executing multiple threads we can keep the processor busy with useful work • Swap to the next thread when the current thread hits a long- latency even (i.e. cache miss) Compute C Time Memory M Time Latency Thread 1 C M C M Thread 2 C M C M Thread 3 C M C M Thread 4 C M C M Adapted from: OpenSparc T1 Micro-architecture Specification

10 Multithreading • Long latency events – Cache Miss, Exceptions, Lock (Synchronization), Long instructions such as MUL/DIV • Long latency events cause Io and even OoO pipelines to be underutilized • Idea: Share the processor among two executing threads, switching when one hits a long latency event – Only penalty is flushing the pipeline. Single Thread Cache Cache Cache Cache Compute Miss Compute Miss Compute Miss Compute Miss Two Threads Cache Cache Cache Cache Compute Miss Compute Miss Compute Miss Compute Miss Cache Compute Compute Cache Compute Cache Compute Miss Miss Miss

11 Non-Blocking Caches • Cache can service hits while fetching one or more miss requests – Example: Pentium Pro has a non-blocking cache capable of handling 4 outstanding misses

12 Power • Power consumption decomposed into: – Static: Power constantly being dissipated (grows with # of transistors) – Dynamic: Power consumed for switching a bit (1 to 0) • P DYN = I DYN *V DD ≈ ½C TOT V DD 2 f – Recall, I = C dV/dt – V DD is the logic ‘1’ voltage, f = clock frequency • Dynamic power favors parallel processing vs. higher clock rates – V DD value is tied to f, so a reduction/increase in f leads to similar change in Vdd Implies power is proportional to f 3 (a cubic savings in power if we can reduce f) – – Take a core and replicate it 4x => 4x performance and 4x power – Take a core and increase clock rate 4x => 4x performance and 64x power • Static power – Leakage occurs no matter what the frequency is

13 Temperature • Temperature is related to power consumption – Locations on the chip that burn more power will usually run hotter • Locations where bits toggle (register file, etc.) often will become quite hot especially if toggling continues for a long period of time – Too much heat can destroy a chip – Can use sensors to dynamically sense temperature • Techniques for controlling temperature – External measures: Remove and spread the heat • Heat sinks, fans, even liquid cooled machines – Architectural measures • Throttle performance (run at slower frequencies / lower voltages) • Global clock gating (pause..turn off the clock) • None…results can be catastrophic • http://www.tomshardware.com/2001/09/17/hot_spot/

14 Wire Delay • In modern circuits wire delay (transmitting the signal) begins to dominate logic delay (time for gate to switch) • As wires get longer – Resistance goes up and Capacitance goes up causing longer time delays (time is proportional to R*C) • Dynamically scheduled, OoO processors require longer wire paths for buses, forwarding, etc. • Simpler pipelines often lead to local, shorter signal connections (wires) • CMP is really the only viable choice

15 IMPLEMENTING MULTITHREADING AND MULTICORE

16 Software Multithreading • Used since 1960's to hide I/O latency CPU Regs – Multiple processes with different virtual address spaces and process control blocks PC – On an I/O operation, state is saved and another process is given to the CPU – When I/O operation completes the process is OS Scheduler rescheduled • On a context switch T1 = Ready T2 = Blocked T3 = Ready – Trap processor and flush pipeline – Save state in process control block (PC, register file, Saved Saved Saved State State State Interrupt vector, page table base register) Regs Regs Regs – Restore state of another process – Start execution and fill pipeline PC PC PC • Meta Meta Meta Very high overhead! Data Data Data • Context switch is also triggered by timer for fairness

17 Hardware Multithreading • Run multiple threads on the same core with hardware support for fast context switch – Multiple register files – Multiple state registers (condition codes, interrupt vector, etc.) – Avoids saving context manually (via software)

18 Typical Multicore (CMP) Organization • Can simply replicate entire processor core to create a chip multi-processor (CMP) Private L1's require maintaining coherency via snooping. Chip Multi- Processor Sharing L1 is not a good idea. P P P P L2 is shared (1 copy of data) and thus does not require a L1 L1 L1 L1 coherency mechanism. Interconnect (On-Chip Network) L2 L2 L2 L2 Bank Bank/ Bank Bank/ Shared bus would be a bottleneck. Use switched network (multiple Main Memory simultaneous connections)

19 Sun T1 "Niagara" Block Diagram Ex. of Fine-grained Multithreading http://ogun.stanford.edu/~kunle/publications/niagra_micro.pdf

20 Sparc T1 Niagara • 8 cores each executing 4 threads called a thread group – Zero cycle thread switching penalty (round-robin) – 6 stage pipeline • Each core has its own L1 cache • Each thread has its own – Register file, instruction and store buffers • Threads share… – L1 cache, TLB, and execution units • 3 MB shared L2 Cache, 4-banks, 12-way set-associative – Is it a problem that it's not a power of 2? No! (Thread) Fetch Select Decode Exec. Mem. WB

21 Sun T1 "Niagara" Pipeline http://ogun.stanford.edu/~kunle/publications/niagra_micro.pdf

22 T1 Pipeline • Fetch stage – Thread select mux chooses PC – Access I-TLB and I-Cache – 2 instructions fetched per cycle • Thread select stage – Choose instructions to issue from ready threads – Issues based on • Instruction type • Misses • Resource conflicts • Traps and interrupts

23 T1 Pipeline • Decode stage – Accesses register file • Execute Stage – Includes ALU, shifter, MUL and DIV units – Forwarding Unit • Memory stage – DTLB, Data Cache, and 4 store buffers (1 per thread) • WB – Write to register file

24 Pipeline Scheduling • No pipeline flush on context switch (except on cache miss) • Full forwarding/bypassing to younger instructions of same thread • In case of load, wait 2 cycles before an instruction from the same thread is issued – Solved forwarding latency issue • Scheduler guarantees fairness between threads by prioritizing the least recently scheduled thread

25 A View Without HW Multithreading Single Threaded Superscalar Issue Slots w/ Software MT Time Expensive Context Switch Expensive Cache Miss Penalty Only instructions Software from a single Multithreading thread

EE 457 Unit 9c Thread Level Parallelism 2 Credits Some of the - PowerPoint PPT Presentation

1 EE 457 Unit 9c Thread Level Parallelism 2 Credits Some of the material in this presentation is taken from: Computer Architecture: A Quantitative Approach John Hennessy & David Patterson Some of the material in this

457 Retirement Program 41-10390-29 2018/01/05 457 Retirement Program Things You Already Know

Credits These slides were derived from Gandhi Puvvadas EE 457 Class Notes EE 457 Unit 1

EE 457 Focus on CPU Design Microarchitecture EE 457 Unit 0 General Digital System

Deferred Compensation Plans 457(b) & 457(f) Presented By: Nonqualified Deferred Compensation

HOUSING PROJECT 1 UNIT 4 UNIT 1 UNIT 6 UNIT 5 UNIT 3 UNIT 2 Application of the Concept

EE 457 Unit 1 Overview of Digital System Design 1.2 Credits These slides were derived from

Unit Identifier Unit October 21, 2014 Unit Identifiers Unit Members Representing Name Email

Unit Title: Presentation Software Unit Level: 2 Unit Credit Value: 4 GLH: 30 LASER Unit

Caroline Van Wie AT&T Services Inc. T: 202.457.3053 AVP - Federal Regulatory 1120 20 th

EE 457 Unit 4 Computer System Performance 2 Motivation An individual user wants to:

EE 457 Unit 7a Cache and Memory Hierarchy 2 Memory Hierarchy & Caching Use several

EE 457 Unit 2 Fixed Point Systems and Arithmetic 2 Unsigned 2s Complement Sign and Zero

EE 457 Unit 6c Control Hazards 2 Control Hazards Control (branch) hazards are named such

EE 457 Unit 2b Fast Adders (Carry-Lookahead Adder) 2 Carry-Lookahead Adders FAST ADDERS 3

EE 457 Unit 6b Data Hazards 2 Data Hazards Consider the data dependencies in the following

EE 457 Unit 2a Unsigned 2s Complement Sign and Zero Extension Fixed Point Systems and

Network Layers Standardization Cruelty 2009/08/12 (C) Herbert Haas The good thing about

Networkcontrolandmanagement Networkmanagement

IT Security From an IT Security From an Organizational Perspective Organizational Perspective

Frra veckan: Skerhet I nt ro t ill skerhet Sker kommunikat ion Krypt ograf i

Crankshaft Turbocharging the next generation of web applications Overview Why did we

Advances in Loop Analysis Frameworks and Optimizations Adam Nemet & Michael Zolotukhin

Effectiveness of CMPs Prepared for: ICO Ref: jn1666/BW Date: April/2014 1 UK I

In-network Monitoring and Control Policy for DVFS of CMP Networks- on-Chip and Last Level Caches

EE 457 Unit 9c Thread Level Parallelism 2 Credits Some of the - PowerPoint PPT Presentation

1 EE 457 Unit 9c Thread Level Parallelism 2 Credits Some of the material in this presentation is taken from: Computer Architecture: A Quantitative Approach John Hennessy & David Patterson Some of the material in this

457 Retirement Program 41-10390-29 2018/01/05 457 Retirement Program Things You Already Know

Credits These slides were derived from Gandhi Puvvadas EE 457 Class Notes EE 457 Unit 1

EE 457 Focus on CPU Design Microarchitecture EE 457 Unit 0 General Digital System

Deferred Compensation Plans 457(b) &amp; 457(f) Presented By: Nonqualified Deferred Compensation

HOUSING PROJECT 1 UNIT 4 UNIT 1 UNIT 6 UNIT 5 UNIT 3 UNIT 2 Application of the Concept

EE 457 Unit 1 Overview of Digital System Design 1.2 Credits These slides were derived from

Unit Identifier Unit October 21, 2014 Unit Identifiers Unit Members Representing Name Email

Unit Title: Presentation Software Unit Level: 2 Unit Credit Value: 4 GLH: 30 LASER Unit

Caroline Van Wie AT&amp;T Services Inc. T: 202.457.3053 AVP - Federal Regulatory 1120 20 th

EE 457 Unit 4 Computer System Performance 2 Motivation An individual user wants to:

EE 457 Unit 7a Cache and Memory Hierarchy 2 Memory Hierarchy &amp; Caching Use several

EE 457 Unit 2 Fixed Point Systems and Arithmetic 2 Unsigned 2s Complement Sign and Zero

EE 457 Unit 6c Control Hazards 2 Control Hazards Control (branch) hazards are named such

EE 457 Unit 2b Fast Adders (Carry-Lookahead Adder) 2 Carry-Lookahead Adders FAST ADDERS 3

EE 457 Unit 6b Data Hazards 2 Data Hazards Consider the data dependencies in the following

EE 457 Unit 2a Unsigned 2s Complement Sign and Zero Extension Fixed Point Systems and

Network Layers Standardization Cruelty 2009/08/12 (C) Herbert Haas The good thing about

Networkcontrolandmanagement Networkmanagement

IT Security From an IT Security From an Organizational Perspective Organizational Perspective

Frra veckan: Skerhet I nt ro t ill skerhet Sker kommunikat ion Krypt ograf i

Crankshaft Turbocharging the next generation of web applications Overview Why did we

Advances in Loop Analysis Frameworks and Optimizations Adam Nemet &amp; Michael Zolotukhin

Effectiveness of CMPs Prepared for: ICO Ref: jn1666/BW Date: April/2014 1 UK I

In-network Monitoring and Control Policy for DVFS of CMP Networks- on-Chip and Last Level Caches

Deferred Compensation Plans 457(b) & 457(f) Presented By: Nonqualified Deferred Compensation

Caroline Van Wie AT&T Services Inc. T: 202.457.3053 AVP - Federal Regulatory 1120 20 th

EE 457 Unit 7a Cache and Memory Hierarchy 2 Memory Hierarchy & Caching Use several

Advances in Loop Analysis Frameworks and Optimizations Adam Nemet & Michael Zolotukhin