Simultaneous Multithreading on Pentium 4 Presented by: Thomas - - PowerPoint PPT Presentation

simultaneous multithreading on pentium 4
SMART_READER_LITE
LIVE PREVIEW

Simultaneous Multithreading on Pentium 4 Presented by: Thomas - - PowerPoint PPT Presentation

Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on a single processor without


slide-1
SLIDE 1

Hyper-Threading:

Simultaneous Multithreading

  • n Pentium 4

Presented by: Thomas Repantis

trep@cs.ucr.edu

CS203B-Advanced Computer Architecture, Spring 2004 – p.1/32

slide-2
SLIDE 2

Overview

Multiple threads executing on a single processor without switching.

  • 1. Threads
  • 2. SMT
  • 3. Hyper-Threading on P4
  • 4. OS and Compiler Support
  • 5. Performance for Different Applications

CS203B-Advanced Computer Architecture, Spring 2004 – p.2/32

slide-3
SLIDE 3

Threads

  • Process: “A task being run by the computer.”
  • Context: Describes a process’s current state of

execution (registers, flags, PC...).

  • Thread: A “light-weight” process (has its own PC

and SP , but single address space and global variables).

  • Each process consists of at least one thread.
  • Threads allow faster context-switching and

fine-grain multitasking.

CS203B-Advanced Computer Architecture, Spring 2004 – p.3/32

slide-4
SLIDE 4

Single-Threaded CPU

A lot of bubbles in the in- struction issue and in the pipeline!

CS203B-Advanced Computer Architecture, Spring 2004 – p.4/32

slide-5
SLIDE 5

Single-Threaded SMP

Executing processes are doubled, but bubbles are doubled as well!

CS203B-Advanced Computer Architecture, Spring 2004 – p.5/32

slide-6
SLIDE 6

Superthreaded CPU

Each issue and each pipeline stage can con- tain instructions of the same thread only.

CS203B-Advanced Computer Architecture, Spring 2004 – p.6/32

slide-7
SLIDE 7

Hyper-Threaded CPU (SMT)

Instructions of different threads can be sched- uled on the same stage.

CS203B-Advanced Computer Architecture, Spring 2004 – p.7/32

slide-8
SLIDE 8

SMT vs TeraMTA

  • Each processor of the TeraMTA has 128 streams,

that include a PC and 32 registers.

  • Each stream is assigned to a thread.
  • Instructions from different streams can be

pipelined on the same processor.

  • However, in TeraMTA only a single thread is

active on any given cycle.

CS203B-Advanced Computer Architecture, Spring 2004 – p.8/32

slide-9
SLIDE 9

SMT Benefits

SMT:

  • Gives the OS the illusion of several (currently two)

logical processors.

  • Makes efficient use of resources.
  • Overcomes the barrier of limited amount of ILP

within just one thread.

  • Is implemented by dividing processor resources to

replicated, partitioned, and shared.

CS203B-Advanced Computer Architecture, Spring 2004 – p.9/32

slide-10
SLIDE 10

Replicated Resources

Each logical processor has independent:

  • Instruction Pointer
  • Register Renaming Logic
  • Instruction TLB
  • Return Stack Predictor
  • Advanced Programmable Interrupt Controller
  • Other architectural registers

CS203B-Advanced Computer Architecture, Spring 2004 – p.10/32

slide-11
SLIDE 11

Partitioned Resources

Each logical processors gets exactly half of:

  • Re-order buffers (ROBs)
  • Load/Store buffers
  • Several queues (e.g. scheduling, uop

(micro-operations)) Partitioning prohibits a logical processor from monopo- lizing the resources.

CS203B-Advanced Computer Architecture, Spring 2004 – p.11/32

slide-12
SLIDE 12

Statically Partitioned Queue

Specific positions are as- signed to each proces- sor.

CS203B-Advanced Computer Architecture, Spring 2004 – p.12/32

slide-13
SLIDE 13

Dynamically Partitioned Queue

A limit is imposed to the positions each processor can use, but no specific positions are assigned.

CS203B-Advanced Computer Architecture, Spring 2004 – p.13/32

slide-14
SLIDE 14

Shared Resources

Each logical processor shares SMT-unaware resources:

  • Execution Units
  • Microarchitectural registers (GPRs, FPRs)
  • Caches: trace cache, L1, L2, L3

Sharing: + Enables efficient use of resources, but...

  • Allows a thread to monopolize a resource (e.g. cache

thrashing).

CS203B-Advanced Computer Architecture, Spring 2004 – p.14/32

slide-15
SLIDE 15

Pentium 4

  • 32-bit
  • 2.4 to 3.4 GHz clock frequency
  • 800 MHz system bus
  • 0.13-micron technology
  • 8KB L1 data cache, 12KB L1

instruction cache, 256KB to 1MB L2 cache, 2MB L3 cache

  • NetBurst microarchitecture

(hyper-pipelined)

  • Hyper-Threading technology

CS203B-Advanced Computer Architecture, Spring 2004 – p.15/32

slide-16
SLIDE 16

Front-End Pipeline

(a) Trace Cache Hit (b) Trace Cache Miss

CS203B-Advanced Computer Architecture, Spring 2004 – p.16/32

slide-17
SLIDE 17

Out-Of-Order Execution Engine Pipeline

CS203B-Advanced Computer Architecture, Spring 2004 – p.17/32

slide-18
SLIDE 18

Implementation Goals Achieved

  • Minimal die area cost (less than 5% more die

area).

  • Stall of one logical processor does not stall the
  • ther (buffering queues between pipeline logic

blocks).

  • When only one thread is running, speed should be

the same as without H-T (partitioned resources are dedicated to it).

CS203B-Advanced Computer Architecture, Spring 2004 – p.18/32

slide-19
SLIDE 19

Single- and Multi-Task Modes

Partitioned resources are dedicated to one of the logical processors when the other is HALTed.

CS203B-Advanced Computer Architecture, Spring 2004 – p.19/32

slide-20
SLIDE 20

Operating System Optimizations

When the OS schedules threads to logical processors it should:

  • HALT an inactive logical processor, to avoid

wasting resources for idle loops (continuously checking for available work).

  • Schedule threads to logical processors on different

physical processors instead of the same (when possible), to avoid using the same physical execution resources.

CS203B-Advanced Computer Architecture, Spring 2004 – p.20/32

slide-21
SLIDE 21

OS Optimizations

The Linux kernel (2.6 series) distinguishes between logical and physical processors:

  • H-T-aware passive and active load-balancing
  • H-T-aware task pickup
  • H-T-aware affinity
  • H-T-aware wakeup

CS203B-Advanced Computer Architecture, Spring 2004 – p.21/32

slide-22
SLIDE 22

Compiler Optimizations

Intel 8.0 C++ and FORTRAN compilers: Automatic optimizations:

  • Vectorization
  • Advanced instruction selection

Programmer-controlled optimizations:

  • Insertion of Streaming-SIMD-Extensions 3 (SSE3)

instructions

  • Insertion of OpenMP directives

CS203B-Advanced Computer Architecture, Spring 2004 – p.22/32

slide-23
SLIDE 23

Performance gain from automatic

  • ptimizations

SPEC CPU 2000 shows significant speedup not only from H-T specific (QxP) but even for general P4 (QxN)

  • ptimizations.

CS203B-Advanced Computer Architecture, Spring 2004 – p.23/32

slide-24
SLIDE 24

Performance gain from manual

  • ptimizations

SPEC OMPM 2001 shows speedup achieved by automatic optimizations in combination with OpenMP directives.

CS203B-Advanced Computer Architecture, Spring 2004 – p.24/32

slide-25
SLIDE 25

Thread-level Parallelism of Desktop Applications

  • Unlike server workloads, interactive desktop

applications focus on response time and not on end-to-end throughput.

  • Average response time improvement on dual- vs

uni-processor measured 22%.

  • The application programmer has to exploit

multi-threading.

  • More than 2 processors yield no great

improvements.

CS203B-Advanced Computer Architecture, Spring 2004 – p.25/32

slide-26
SLIDE 26

Performance in Client-Server Applications

While H-T offers no gain or degradation in API calls and user application workloads, it achieves considerable speedups in multi-threaded workloads.

CS203B-Advanced Computer Architecture, Spring 2004 – p.26/32

slide-27
SLIDE 27

Performance in File Server Workloads

Good speedups in multi-threaded workloads, whether filesystem and socket calls, or just socket calls.

CS203B-Advanced Computer Architecture, Spring 2004 – p.27/32

slide-28
SLIDE 28

Performance in Online Transaction Processing

21% performance gain in the case of 1 and 2 processors.

CS203B-Advanced Computer Architecture, Spring 2004 – p.28/32

slide-29
SLIDE 29

Performance in Web Serving

16 to 28% performance gain.

CS203B-Advanced Computer Architecture, Spring 2004 – p.29/32

slide-30
SLIDE 30

Conclusions

  • Hyper-Threading enables thread-level parallelism

by duplicating the architectural state of the processor, while sharing one set of processor execution resources.

  • When scheduling threads, the OS sees two logical

processors.

  • While not providing the performance achieved by

adding a second processor, Hyper-Threading can

  • ffer a 30% improvement.
  • Resource contention limits the performance

benefits for certain applications.

  • Performance gains are evident in multi-threaded

workloads, which are usually found in servers.

CS203B-Advanced Computer Architecture, Spring 2004 – p.30/32

slide-31
SLIDE 31

References

  • 1. D. Marr et al., “Hyper-Threading Technology Architecture and Microarchitecture”,

Intel Technology Journal, Volume 06-Issue 01, 2002.

  • 2. D. Tulsen et al., “ Simultaneous Multithreading: Maximizing On-Chip Parallelism”,

ISCA, 1995.

  • 3. J. Stokes, “Introduction to Multithreading, Superthreading and Hyperthreading”,

Ars Technica, 2002.

  • 4. K. Smith et al., “Support for the Intel Pentium 4 Processor with Hyper-Threading

Technology in Intel 8.0 Compilers”, Intel Technology Journal, Volume 08-Issue 01, 2004.

  • 5. D. Vianney, “Hyper-Threading speeds Linux”, IBM Linux developerWorks, 2003.
  • 6. J.Hennessy, D. Patterson, “Computer Architecture: A Quantitative Approach”, 3rd

Edition, pp. 608–615, 2003.

  • 7. “Hyper-Threading Technology on the Intel Xeon Processor Family for Servers”,

Intel White Paper, 2004.

  • 8. K. Flautner et al., “Thread-level Parallelism and Interactive Performance of

Desktop Applications”, ASPLOS, 2000.

  • 9. L. Carter et al., “Performance and Programming Experience on the Tera MTA”,

CS203B-Advanced Computer Architecture, Spring 2004 – p.31/32

slide-32
SLIDE 32

Thank you!

Questions/Comments?

CS203B-Advanced Computer Architecture, Spring 2004 – p.32/32