[PPT] - Simultaneous Multithreading on Pentium 4 Presented by: Thomas PowerPoint Presentation

SLIDE 1

Hyper-Threading:

Simultaneous Multithreading

n Pentium 4

Presented by: Thomas Repantis

trep@cs.ucr.edu

CS203B-Advanced Computer Architecture, Spring 2004 – p.1/32

SLIDE 2

Overview

Multiple threads executing on a single processor without switching.

1. Threads
2. SMT
3. Hyper-Threading on P4
4. OS and Compiler Support
5. Performance for Different Applications

CS203B-Advanced Computer Architecture, Spring 2004 – p.2/32

SLIDE 3

Threads

Process: “A task being run by the computer.”
Context: Describes a process’s current state of

execution (registers, flags, PC...).

Thread: A “light-weight” process (has its own PC

and SP , but single address space and global variables).

Each process consists of at least one thread.
Threads allow faster context-switching and

fine-grain multitasking.

CS203B-Advanced Computer Architecture, Spring 2004 – p.3/32

SLIDE 4

Single-Threaded CPU

A lot of bubbles in the in- struction issue and in the pipeline!

CS203B-Advanced Computer Architecture, Spring 2004 – p.4/32

SLIDE 5

Single-Threaded SMP

Executing processes are doubled, but bubbles are doubled as well!

CS203B-Advanced Computer Architecture, Spring 2004 – p.5/32

SLIDE 6

Superthreaded CPU

Each issue and each pipeline stage can con- tain instructions of the same thread only.

CS203B-Advanced Computer Architecture, Spring 2004 – p.6/32

SLIDE 7

Hyper-Threaded CPU (SMT)

Instructions of different threads can be sched- uled on the same stage.

CS203B-Advanced Computer Architecture, Spring 2004 – p.7/32

SLIDE 8

SMT vs TeraMTA

Each processor of the TeraMTA has 128 streams,

that include a PC and 32 registers.

Each stream is assigned to a thread.
Instructions from different streams can be

pipelined on the same processor.

However, in TeraMTA only a single thread is

active on any given cycle.

CS203B-Advanced Computer Architecture, Spring 2004 – p.8/32

SLIDE 9

SMT Benefits

SMT:

Gives the OS the illusion of several (currently two)

logical processors.

Makes efficient use of resources.
Overcomes the barrier of limited amount of ILP

within just one thread.

Is implemented by dividing processor resources to

replicated, partitioned, and shared.

CS203B-Advanced Computer Architecture, Spring 2004 – p.9/32

SLIDE 10

Replicated Resources

Each logical processor has independent:

Instruction Pointer
Register Renaming Logic
Instruction TLB
Return Stack Predictor
Advanced Programmable Interrupt Controller
Other architectural registers

CS203B-Advanced Computer Architecture, Spring 2004 – p.10/32

SLIDE 11

Partitioned Resources

Each logical processors gets exactly half of:

Re-order buffers (ROBs)
Load/Store buffers
Several queues (e.g. scheduling, uop

(micro-operations)) Partitioning prohibits a logical processor from monopo- lizing the resources.

CS203B-Advanced Computer Architecture, Spring 2004 – p.11/32

SLIDE 12

Statically Partitioned Queue

Specific positions are as- signed to each proces- sor.

CS203B-Advanced Computer Architecture, Spring 2004 – p.12/32

SLIDE 13

Dynamically Partitioned Queue

A limit is imposed to the positions each processor can use, but no specific positions are assigned.

CS203B-Advanced Computer Architecture, Spring 2004 – p.13/32

SLIDE 14

Shared Resources

Each logical processor shares SMT-unaware resources:

Execution Units
Microarchitectural registers (GPRs, FPRs)
Caches: trace cache, L1, L2, L3

Sharing: + Enables efficient use of resources, but...

Allows a thread to monopolize a resource (e.g. cache

thrashing).

CS203B-Advanced Computer Architecture, Spring 2004 – p.14/32

SLIDE 15

Pentium 4

32-bit
2.4 to 3.4 GHz clock frequency
800 MHz system bus
0.13-micron technology
8KB L1 data cache, 12KB L1

instruction cache, 256KB to 1MB L2 cache, 2MB L3 cache

NetBurst microarchitecture

(hyper-pipelined)

Hyper-Threading technology

CS203B-Advanced Computer Architecture, Spring 2004 – p.15/32

SLIDE 16

Front-End Pipeline

(a) Trace Cache Hit (b) Trace Cache Miss

CS203B-Advanced Computer Architecture, Spring 2004 – p.16/32

SLIDE 17

Out-Of-Order Execution Engine Pipeline

CS203B-Advanced Computer Architecture, Spring 2004 – p.17/32

SLIDE 18

Implementation Goals Achieved

Minimal die area cost (less than 5% more die

area).

Stall of one logical processor does not stall the
ther (buffering queues between pipeline logic

blocks).

When only one thread is running, speed should be

the same as without H-T (partitioned resources are dedicated to it).

CS203B-Advanced Computer Architecture, Spring 2004 – p.18/32

SLIDE 19

Single- and Multi-Task Modes

Partitioned resources are dedicated to one of the logical processors when the other is HALTed.

CS203B-Advanced Computer Architecture, Spring 2004 – p.19/32

SLIDE 20

Operating System Optimizations

When the OS schedules threads to logical processors it should:

HALT an inactive logical processor, to avoid

wasting resources for idle loops (continuously checking for available work).

Schedule threads to logical processors on different

physical processors instead of the same (when possible), to avoid using the same physical execution resources.

CS203B-Advanced Computer Architecture, Spring 2004 – p.20/32

SLIDE 21

OS Optimizations

The Linux kernel (2.6 series) distinguishes between logical and physical processors:

H-T-aware passive and active load-balancing
H-T-aware task pickup
H-T-aware affinity
H-T-aware wakeup

CS203B-Advanced Computer Architecture, Spring 2004 – p.21/32

SLIDE 22

Compiler Optimizations

Intel 8.0 C++ and FORTRAN compilers: Automatic optimizations:

Vectorization
Advanced instruction selection

Programmer-controlled optimizations:

Insertion of Streaming-SIMD-Extensions 3 (SSE3)

instructions

Insertion of OpenMP directives

CS203B-Advanced Computer Architecture, Spring 2004 – p.22/32

SLIDE 23

Performance gain from automatic

ptimizations

SPEC CPU 2000 shows significant speedup not only from H-T specific (QxP) but even for general P4 (QxN)

ptimizations.

CS203B-Advanced Computer Architecture, Spring 2004 – p.23/32

SLIDE 24

Performance gain from manual

ptimizations

SPEC OMPM 2001 shows speedup achieved by automatic optimizations in combination with OpenMP directives.

CS203B-Advanced Computer Architecture, Spring 2004 – p.24/32

SLIDE 25

Thread-level Parallelism of Desktop Applications

Unlike server workloads, interactive desktop

applications focus on response time and not on end-to-end throughput.

Average response time improvement on dual- vs

uni-processor measured 22%.

The application programmer has to exploit

multi-threading.

More than 2 processors yield no great

improvements.

CS203B-Advanced Computer Architecture, Spring 2004 – p.25/32

SLIDE 26

Performance in Client-Server Applications

While H-T offers no gain or degradation in API calls and user application workloads, it achieves considerable speedups in multi-threaded workloads.

CS203B-Advanced Computer Architecture, Spring 2004 – p.26/32

SLIDE 27

Performance in File Server Workloads

Good speedups in multi-threaded workloads, whether filesystem and socket calls, or just socket calls.

CS203B-Advanced Computer Architecture, Spring 2004 – p.27/32

SLIDE 28

Performance in Online Transaction Processing

21% performance gain in the case of 1 and 2 processors.

CS203B-Advanced Computer Architecture, Spring 2004 – p.28/32

SLIDE 29

Performance in Web Serving

16 to 28% performance gain.

CS203B-Advanced Computer Architecture, Spring 2004 – p.29/32

SLIDE 30

Conclusions

Hyper-Threading enables thread-level parallelism

by duplicating the architectural state of the processor, while sharing one set of processor execution resources.

When scheduling threads, the OS sees two logical

processors.

While not providing the performance achieved by

adding a second processor, Hyper-Threading can

ffer a 30% improvement.
Resource contention limits the performance

benefits for certain applications.

Performance gains are evident in multi-threaded

workloads, which are usually found in servers.

CS203B-Advanced Computer Architecture, Spring 2004 – p.30/32

SLIDE 31

References

1. D. Marr et al., “Hyper-Threading Technology Architecture and Microarchitecture”,

Intel Technology Journal, Volume 06-Issue 01, 2002.

2. D. Tulsen et al., “ Simultaneous Multithreading: Maximizing On-Chip Parallelism”,

ISCA, 1995.

3. J. Stokes, “Introduction to Multithreading, Superthreading and Hyperthreading”,

Ars Technica, 2002.

4. K. Smith et al., “Support for the Intel Pentium 4 Processor with Hyper-Threading

Technology in Intel 8.0 Compilers”, Intel Technology Journal, Volume 08-Issue 01, 2004.

5. D. Vianney, “Hyper-Threading speeds Linux”, IBM Linux developerWorks, 2003.
6. J.Hennessy, D. Patterson, “Computer Architecture: A Quantitative Approach”, 3rd

Edition, pp. 608–615, 2003.

7. “Hyper-Threading Technology on the Intel Xeon Processor Family for Servers”,

Intel White Paper, 2004.

8. K. Flautner et al., “Thread-level Parallelism and Interactive Performance of

Desktop Applications”, ASPLOS, 2000.

9. L. Carter et al., “Performance and Programming Experience on the Tera MTA”,

CS203B-Advanced Computer Architecture, Spring 2004 – p.31/32

SLIDE 32

Thank you!

Questions/Comments?

CS203B-Advanced Computer Architecture, Spring 2004 – p.32/32