Simultaneous Multithreading: Simultaneous Multithreading: - - PDF document

simultaneous multithreading simultaneous multithreading
SMART_READER_LITE
LIVE PREVIEW

Simultaneous Multithreading: Simultaneous Multithreading: - - PDF document

Simultaneous Multithreading: Simultaneous Multithreading: Multiplying Alpha Performance Multiplying Alpha Performance Dr. Joel Emer Dr. Joel Emer Principal Member Technical Staff Principal Member Technical Staff Alpha Development Group


slide-1
SLIDE 1

www.compaq.com

Simultaneous Multithreading: Simultaneous Multithreading: Multiplying Alpha Performance Multiplying Alpha Performance

  • Dr. Joel Emer
  • Dr. Joel Emer

Principal Member Technical Staff Principal Member Technical Staff Alpha Development Group Alpha Development Group Compaq Computer Corporation Compaq Computer Corporation

www.compaq.com

Outline Outline

  • Alpha Processor Roadmap

Alpha Processor Roadmap

  • Motivation for Introducing SMT

Motivation for Introducing SMT

  • Implementation of an SMT CPU

Implementation of an SMT CPU

  • Performance Estimates

Performance Estimates

  • Architectural Abstraction

Architectural Abstraction

slide-2
SLIDE 2

www.compaq.com

Higher Performance Lower Cost 2000 2001 2002 2003 1998 1999

21264 21264 EV6 EV6 21264 21264 EV68 EV68

0.35µ µ µ µm

21264 21264 EV67 EV67

0.28µ µ µ µm 0.18µ µ µ µm

EV7 EV7

0.18µ µ µ µm

...

EV8 EV8

0.125µ µ µ µm

Alpha Microprocessor Overview Alpha Microprocessor Overview

EV78 EV78

0.125µ µ µ µm

First System Ship

www.compaq.com

EV8 Technology Overview EV8 Technology Overview

  • Leading edge process technology

Leading edge process technology – – 1.2 1.2-

  • 2.0GHz

2.0GHz

  • 0.125µm CMOS

0.125µm CMOS

  • SOI

SOI-

  • compatible

compatible

  • Cu interconnect

Cu interconnect

  • low

low-

  • k dielectrics

k dielectrics

  • Chip characteristics

Chip characteristics

  • ~1.2V

~1.2V Vdd Vdd

  • ~250 Million transistors

~250 Million transistors

  • ~1100 signal pins in flip chip packaging

~1100 signal pins in flip chip packaging

slide-3
SLIDE 3

www.compaq.com

EV8 Architecture Overview EV8 Architecture Overview

  • Enhanced out

Enhanced out-

  • of
  • f-
  • order execution
  • rder execution
  • 8

8-

  • wide

wide superscalar superscalar

  • Large on

Large on-

  • chip L2 cache

chip L2 cache

  • Direct RAMBUS interface

Direct RAMBUS interface

  • On

On-

  • chip router for system interconnect

chip router for system interconnect

  • Glueless

Glueless, directory , directory-

  • based,

based, ccNUMA ccNUMA for up to 512 for up to 512-

  • way SMP

way SMP

  • 4

4-

  • way simultaneous multithreading (SMT)

way simultaneous multithreading (SMT)

www.compaq.com

Goals Goals

  • Leadership single stream performance

Leadership single stream performance

  • Extra multistream performance with multithreading

Extra multistream performance with multithreading

  • Without major architectural changes

Without major architectural changes

  • Without significant additional cost

Without significant additional cost

slide-4
SLIDE 4

www.compaq.com

Instruction Issue Instruction Issue

Reduced function unit utilization due to dependencies

Time

www.compaq.com

Superscalar Superscalar Issue Issue

Superscalar leads to more performance, but lower utilization

Time

slide-5
SLIDE 5

www.compaq.com

Predicated Issue Predicated Issue

Adds to function unit utilization, but results are thrown away

Time

www.compaq.com

Chip Multiprocessor Chip Multiprocessor

Limited utilization when only running one thread

Time

slide-6
SLIDE 6

www.compaq.com

Fine Grained Multithreading Fine Grained Multithreading

Intra-thread dependencies still limit performance

Time

www.compaq.com

Simultaneous Multithreading Simultaneous Multithreading

Maximum utilization of function units by independent operations

Time

slide-7
SLIDE 7

www.compaq.com

Basic Out Basic Out-

  • of
  • f-
  • order Pipeline
  • rder Pipeline

Fetch Decode/ Map Queue Reg Read Execute Dcache/ Store Buffer Reg Write Retire

PC Icache Register Map Dcache Regs Regs

Thread-blind

www.compaq.com

SMT Pipeline SMT Pipeline

Fetch Decode/ Map Queue Reg Read Execute Dcache/ Store Buffer Reg Write Retire

Icache Dcache PC Register Map Regs Regs

slide-8
SLIDE 8

www.compaq.com

Changes for SMT Changes for SMT

  • Basic pipeline

Basic pipeline – – unchanged unchanged

  • Replicated resources

Replicated resources

  • Program counters

Program counters

  • Register maps

Register maps

  • Shared resources

Shared resources

  • Register file (size increased)

Register file (size increased)

  • Instruction queue

Instruction queue

  • First and second level caches

First and second level caches

  • Translation buffers

Translation buffers

  • Branch predictor

Branch predictor

www.compaq.com

Multiprogrammed workload Multiprogrammed workload

0% 50% 100% 150% 200% 250% SpecInt SpecFP Mixed Int/FP 1T 2T 3T 4T

slide-9
SLIDE 9

www.compaq.com

Decomposed SPEC95 Applications Decomposed SPEC95 Applications

0% 50% 100% 150% 200% 250% Turb3d Swm256 Tomcatv 1T 2T 3T 4T

www.compaq.com

Multithreaded Applications Multithreaded Applications

0% 50% 100% 150% 200% 250% 300% Barnes Chess Sort TP 1T 2T 4T

slide-10
SLIDE 10

www.compaq.com

Architectural Abstraction Architectural Abstraction

  • 1 CPU with 4 Thread Processing Units (

1 CPU with 4 Thread Processing Units (TPUs TPUs) )

  • Shared hardware resources

Shared hardware resources

TPU 0 TPU1 TPU2 TPU3 Icache TLB Dcache Scache

www.compaq.com

System Block Diagram System Block Diagram

EV8 M IO EV8 M IO EV8 M IO EV8 M IO EV8 M IO EV8 M IO EV8 M IO EV8 M IO EV8 M IO

0 1 2 3

slide-11
SLIDE 11

www.compaq.com

Quiescing Quiescing Idle Threads Idle Threads

  • Problem:

Problem: Spin looping thread consumes resources Spin looping thread consumes resources

  • Solution:

Solution: Provide Provide quiescing quiescing operation that allows a

  • peration that allows a

TPU to sleep until a memory location changes TPU to sleep until a memory location changes

www.compaq.com

Summary Summary

  • Alpha will maintain single stream performance leadership

Alpha will maintain single stream performance leadership

  • SMT will significantly enhance multistream performance

SMT will significantly enhance multistream performance

  • Across a wide range of applications,

Across a wide range of applications,

  • Without significant hardware cost, and

Without significant hardware cost, and

  • Without major architectural changes

Without major architectural changes

slide-12
SLIDE 12

www.compaq.com

References References

  • "

"Simultaneous Multithreading: Maximizing On Simultaneous Multithreading: Maximizing On-

  • Chip Parallelism

Chip Parallelism" by " by Tullsen Tullsen, , Eggers and Levy in ISCA95. Eggers and Levy in ISCA95.

  • "

"Exploiting Choice: Instruction Fetch and Issue on an Exploiting Choice: Instruction Fetch and Issue on an Implementable Implementable Simultaneous Multithreaded Processor Simultaneous Multithreaded Processor" by " by Tullsen Tullsen, Eggers, Emer, Levy, Lo , Eggers, Emer, Levy, Lo and and Stamm Stamm in ISCA96. in ISCA96.

“Converting Thread Converting Thread-

  • Level Parallelism to Instruction

Level Parallelism to Instruction-

  • Level Parallelism via

Level Parallelism via Simultaneous Multithreading Simultaneous Multithreading” by Lo, Eggers, Emer, Levy, ” by Lo, Eggers, Emer, Levy, Stamm Stamm and and Tullsen Tullsen in ACM Transactions on Computer Systems, August 1997. in ACM Transactions on Computer Systems, August 1997.

  • “Simultaneous Multithreading: A Platform for Next

“Simultaneous Multithreading: A Platform for Next-

  • Generation

Generation Prcoessors Prcoessors” by ” by Eggers, Emer, Levy, Lo, Eggers, Emer, Levy, Lo, Stamm Stamm and and Tullsen Tullsen in IEEE Micro, October, 1997. in IEEE Micro, October, 1997.