[PDF] - Replenishing the Microarchitecture Treasure Chest Prof. John Paul PDF Document

SLIDE 1

UT Austin -- Distinguished Lecture Series on Computer Architecture -- April 26, 1999 Page 1

Carnegie Mellon

“Replenishing the Microarchitecture Treasure Chest”

Prof. John Paul Shen

Electrical and Computer Engineering Department Carnegie Mellon University

UT Austin -- Distinguished Lecture Series on Computer Architecture -- April 26, 1999 Page 2

Carnegie Mellon Current Ph.D. Students:

1. Bryan Black
2. Yuan Chou
3. Alex Dean
4. Ryan Rakvic
5. Bob Rychlik

Current M.S. Students:

1. Candice Bechem
2. Jonathan Combs
3. Jeffrey Heid
4. Kyle Oppenheim

CMuART Members

CMuART (PhD) Alumni:

1. Ron Bianchini (FORE & CMU)
2. Mauricio Breternitz (Moto)
3. Trung Diep (Intel)
4. F. Joel Ferguson (UCSC)
5. Andrew Huang (Moto)
6. Mikko Lipasti (IBM & UW)
7. Chris Newburn (Intel)
8. Derek Noonburg (S3)
9. Scott Robinson (Intel)
10. Mike Schuette (Moto)
11. Kent Wilken (UC-Davis)
12. Andy Wolfe (S3)

SLIDE 2

UT Austin -- Distinguished Lecture Series on Computer Architecture -- April 26, 1999 Page 3

Carnegie Mellon

Unmatched by Any Other Industry!

[John Crawford, Intel, 1993]

Doubling every 18 months (1982-1996): total of 800X

Cars travel at 44,000 MPH; get 16,000 miles/gal.
Air travel: L.A. to N.Y. in 22 seconds (MACH 800)
Wheat yield: 80,000 bushels per acre

Doubling every 24 months (1971-1996): total of 9,000X

Cars travel at 600,000 MPH; get 150,000 miles/gal.
Air travel: L.A. to N.Y. in 2 seconds (MACH 9,000)
Wheat yield: 900,000 bushels per acre

Microprocessor Performance

UT Austin -- Distinguished Lecture Series on Computer Architecture -- April 26, 1999 Page 4

Carnegie Mellon

All originally invented in the 1960’s

Pipelining
Cache Memories
Multiple Instruction Issue
Out of Order Execution
Dataflow Machines
Vector Machines
Virtual Memory
Optimizing Compilers
Operating Systems

Leveraging the Treasure Chest

SLIDE 3

UT Austin -- Distinguished Lecture Series on Computer Architecture -- April 26, 1999 Page 5

Carnegie Mellon

Iron Law of Processor Performance

Processor Performance = --------------- Time Program = ------------------ X ---------------- X ------------ Instructions Cycles Program Instruction Time Cycle (code size) (CPI) (cycle time) Architecture --> Implementation --> Realization

Compiler Designer Processor Designer Chip Designer

UT Austin -- Distinguished Lecture Series on Computer Architecture -- April 26, 1999 Page 6

Carnegie Mellon

Evolution of Microprocessors

1970-1979 1980-1989 1990-1999 by 2009 Transistor Count 10K-100K 100K-1M 1M-30M 1,000M Clock Frequency 0.2-2MHz 2-20MHz 20-600MHz 10GHz Instruction/cycle << 0.1 0.1-0.8 0.8- 2.4 10 (?) MIPS/MFLOPS << 1 1-20 20-1,400 100,000

SLIDE 4

UT Austin -- Distinguished Lecture Series on Computer Architecture -- April 26, 1999 Page 7

Carnegie Mellon

Strong Diminishing Returns on IPC

IPC

4 8 16 32 64

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0

IPC m88ksim

4 8 16 32 64

Issue Width

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0

IPC li

4 8 16 32 64

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0

IPC ijpeg

4 8 16 32 64

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0

perl Issue Width

UT Austin -- Distinguished Lecture Series on Computer Architecture -- April 26, 1999 Page 8

Carnegie Mellon

Revisit previously-assumed limits and try to go beyond these limits. 1970’s - “Flynn’s Bottleneck” ..... Branch Prediction 1990’s - “Dataflow Limit” ..... Value Prediction

Looking for A Paradigm Shift

SLIDE 5

UT Austin -- Distinguished Lecture Series on Computer Architecture -- April 26, 1999 Page 9

Carnegie Mellon

Load Value Locality:

cc1-271 cc1 cjpeg compress doduc eqntott gawk gperf grep hydro2d mpeg perl quick sc swm256 tomcatv xlisp 20 40 60 80 100 Value Locality (%)

PowerPC

cc1-271 cjpeg compress doduc eqntott gawk gperf grep hydro2d mpeg perl quick sc swm256 tomcatv xlisp 20 40 60 80 100 Value Locality (%)

Alpha AXP

History=1 History=16

Once Upon a Time...Fall 1995

UT Austin -- Distinguished Lecture Series on Computer Architecture -- April 26, 1999 Page 10

Carnegie Mellon

Concept of “Value Locality”

cc1-271 cjpeg compress eqntott gawk gperf grep mpeg perl quick sc xlisp

0.0 20.0 40.0 60.0 80.0 100.0 Load Value Locality (%)

Load Value Locality

History=1 History=16 0.0 20.0 40.0 60.0 80.0 100.0 Register Value Locality (%)

Register Value Locality

cc1-271 cjpeg compress eqntott gawk gperf grep mpeg perl quick sc xlisp

History=1 History=4

SLIDE 6

UT Austin -- Distinguished Lecture Series on Computer Architecture -- April 26, 1999 Page 11

Carnegie Mellon

Dynamic “Pipeline Contraction”

Fetch Dispatch Rename Op Read Fetch Dispatch Execute Commit Fetch Dispatch Rename Op Read Fetch Dispatch Execute Commit Fetch Dispatch Rename Op Read Fetch Dispatch Execute Commit Fetch Dispatch Rename Op Read Fetch Dispatch Execute Commit Fetch Dispatch Rename Op Read Fetch Dispatch Execute Commit Fetch Dispatch Rename Op Read Fetch Dispatch Execute Commit Fetch Dispatch Rename Op Read Fetch Dispatch Execute Commit Fetch Dispatch Rename Op Read Fetch Dispatch Execute Commit

Branch Prediction Value Prediction

UT Austin -- Distinguished Lecture Series on Computer Architecture -- April 26, 1999 Page 12

Carnegie Mellon

“Superspeculation” Techniques

Dependence Prediction

Reg. Value

Prediction Load Value Prediction

Mem. Alias

Prediction

FETCH FETCH FETCH FETCH DECODE/ EXEC. COMPL. DECODE/ DECODE/ DECODE/ EXEC. COMPL. COMPL. COMPL. ADDR. TLB MEM. ADDR. MEM. TLB DISPATCH DISPATCH DISPATCH DISPATCH

SLIDE 7

UT Austin -- Distinguished Lecture Series on Computer Architecture -- April 26, 1999 Page 13

Carnegie Mellon

SPECint95 Performance (16 wide)

go m88ksim gcc compress li ijpeg perl vortex HM 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 Sustained IPC Baseline +GAg TC +2-Phase +VP +LVP +AP 64K/2 ports 64K/4 ports 64K/8 ports 64K/Infinite Perfect

UT Austin -- Distinguished Lecture Series on Computer Architecture -- April 26, 1999 Page 14

Carnegie Mellon

SPECfp95 Performance (16 wide)

tomcatv swim mgrid applu apsi fpppp HM Benchmark 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0 15.0 16.0 Sustained IPC Baseline +GAg TC +2-Phase +VP +LVP +AP 64K/2 ports 64K/4 ports 64K/8 ports 64K/Infinite Perfect

SLIDE 8

UT Austin -- Distinguished Lecture Series on Computer Architecture -- April 26, 1999 Page 15

Carnegie Mellon

A Possible New Paradigm

IPC

4 8 16 32 64

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0

IPC m88ksim

4 8 16 32 64

Issue Width

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0

IPC li

4 8 16 32 64

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0

IPC ijpeg

4 8 16 32 64

0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0

perl Issue Width

UT Austin -- Distinguished Lecture Series on Computer Architecture -- April 26, 1999 Page 16

Carnegie Mellon

P r e d ic t io n R a t e s f o r V P 1 a n d V P 2 0 % 1 0 % 2 0 % 3 0 % 4 0 % 5 0 % 6 0 % 7 0 % 8 0 % 9 0 % 1 0 0 % compress gcc go ijpeg li m88ksim perl vortex SPECint applu fpppp mgrid swim tomcatv turb3d SPECfp SPEC95 N o t Pr e d ic te d In c o r r e c t B o th C o r r e c t S tr id e + U n iq u e F C M U n iq u e V P1 V P2

Hybrid Value Predictors

SLIDE 9

UT Austin -- Distinguished Lecture Series on Computer Architecture -- April 26, 1999 Page 17

Carnegie Mellon

Prediction “Usefulness”

Us e fu ln e s s T r ac k in g Im p a c t o n IP C

2 .14 3 2 .2 8 7 2 .0 6 4 2 .2 8 8 2 .0 7 3 2 .14 9

2 .00 2 .05 2 .10 2 .15 2 .20 2 .25 2 .30 A ll S PECf p S PECin t IPC W ith o ut Tra c king W ith Tr ac kin g Us e fu ln e s s Tr ack in g Im p act o n Pr e d ictio n Rate 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% A ll SPECf p SPECint A ll SPECf p SPECint Not Predic ted Inc orrect Correct W ithout Tracking W ith Trac king

UT Austin -- Distinguished Lecture Series on Computer Architecture -- April 26, 1999 Page 18

Carnegie Mellon

Increasing Sophistication

Hybrid Predictors
Adaptive Predictors

Increasing Efficiency

Selective Prediction
Register-based Prediction
Compiler Assistance

Extensions Into Other Domains

VLIW-based Value Prediction
Dynamic Instruction Reuse
Aggressive Partial Evaluation

Current Value Prediction Landscape

SLIDE 10

UT Austin -- Distinguished Lecture Series on Computer Architecture -- April 26, 1999 Page 19

Carnegie Mellon

Instruction Supply Problem

T-cache I-cache FETCH DECODE/ COMPLETE D-cache Branch Prediction Instruction Buffer Store Buffer Reorder Buffer Integer Floating-point Media Memory

Instruction Register Data Flow Memory Data Flow Flow

UT Austin -- Distinguished Lecture Series on Computer Architecture -- April 26, 1999 Page 20

Carnegie Mellon

Three Major Challenges:

1. Multiple-Branch

Prediction

2. Multiple Fetch

Groups

3. Alignment and

Collapsing

Wide-Machine Instruction Fetch

I-cache FE T C H D EC O D E / B ranch Prediction Instruction Buffer D ISPAT C H

SLIDE 11

UT Austin -- Distinguished Lecture Series on Computer Architecture -- April 26, 1999 Page 21

Carnegie Mellon

High-bandwidth Instruction Fetch

Instructions align & collapse
Trace construction
Trace cache fill
Trace predictor update
Next trace prediction
Trace cache fetch

Enhanced

Fetch Time Completion Time

Execution Core

Trace Cache Instruction Cache

Multi-branch prediction
Multi-ported instruction

cache fetch

Instructions align & collapse
Branch predictor update

Execution Core

UT Austin -- Distinguished Lecture Series on Computer Architecture -- April 26, 1999 Page 22

Carnegie Mellon

Conventional Trace Cache

Fetch Buffer Com pletion

br.

Trace C ache

trace_id

I-C ache Execution C ore H istory H ash

hist.

Fill U nit N ext Trace Pred.

SLIDE 12

UT Austin -- Distinguished Lecture Series on Computer Architecture -- April 26, 1999 Page 23

Carnegie Mellon

The Block-Based Trace Cache

Fetch Buffer

trace_id

C om pletion Final C ollapse

br.

Block C ache

block_ids

I-C ache E xecution C ore Trace Table H istory H ash

pre-collapse hist.

R enam e T able Fill U nit

UT Austin -- Distinguished Lecture Series on Computer Architecture -- April 26, 1999 Page 24

Carnegie Mellon

Next Trace Prediction

tag index tag branch history block_ids 1 2

...

w v

=

H it Next

...

b_id0 b_id1 b_id2 b_id3 w pred. block_ids H ash Function trace_id

Trace Table

to the block cache

SLIDE 13

UT Austin -- Distinguished Lecture Series on Computer Architecture -- April 26, 1999 Page 25

Carnegie Mellon

Replicated Block Cache

...

Instructions from block_id decoder N =2n direct m apped cache FA i1 i2 ib

w ord lines

the block fill unit (n-bit) Final C ollapse Fetch Buffer copy-2 copy-3 copy-4 b inst 16 copy-1

Block C ache

UT Austin -- Distinguished Lecture Series on Computer Architecture -- April 26, 1999 Page 26

Carnegie Mellon

Block Rename Table

tag block_id v = =

1 2 3 4 5 6 7

N =8 entries B lock fetch address renam ed to a block_id T ag Index B lock fetch address

SLIDE 14

UT Austin -- Distinguished Lecture Series on Computer Architecture -- April 26, 1999 Page 27

Carnegie Mellon

Block Cache Fragmentation

1 2 3 4 5 6 7 8 9 10 comp gcc go ijpeg li m88k perl vort h-m

Block size (b), Block fetch width (w)

Avg. # of inst. in fetch buffer

2,8 4,4 6,3 8,2 16,1

UT Austin -- Distinguished Lecture Series on Computer Architecture -- April 26, 1999 Page 28

Carnegie Mellon

Block Cache Replication

(b=6, N=4096)

2 4 6 8 10 12 comp gcc go ijpeg li m88k perl vort

benchmark

Avg. # of inst. in fetch

buffer

w=2 w=3 w=4 w=5 w=16

SLIDE 15

UT Austin -- Distinguished Lecture Series on Computer Architecture -- April 26, 1999 Page 29

Carnegie Mellon

Trace Table Size

10 20 30 40 50 60 70 80 90 100 comp gcc go ijpeg li m88k perl vort

benchmark Hit rate (%)

64 256 1k 2k 8k

UT Austin -- Distinguished Lecture Series on Computer Architecture -- April 26, 1999 Page 30

Carnegie Mellon

Block Cache Hit Rate

0% 20% 40% 60% 80% 100%

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

Number of blocks fetched

hit miss no pred

com p gcc ijpeg go m 88k perl vort li

SLIDE 16

UT Austin -- Distinguished Lecture Series on Computer Architecture -- April 26, 1999 Page 31

Carnegie Mellon

Performance vs. Block Cache Size

compress 2 4 6 8 10 12 14 16 32 64 128 256 512 1k 2k 4k 8k 16k

Block cache entries (N) IPC

gcc 1 2 3 4 5 6 7 8 9 16 32 64 128 256 512 1k 2k 4k 8k 16k

Block cache entries (N) IPC

go 2 4 6 8 10 12 14 16 32 64 128 256 512 1k 2k 4k 8k 16k

Block cache entries (N) IPC

ijpeg 2 4 6 8 10 12 14 16 16 32 64 128 256 512 1k 2k 4k 8k 16k

Block cache entries (N) IPC

UT Austin -- Distinguished Lecture Series on Computer Architecture -- April 26, 1999 Page 32

Carnegie Mellon

Performance vs. Block Cache Size

li 1 2 3 4 5 6 7 16 32 64 128 256 512 1k 2k 4k 8k 16k

Block cache entries (N) IPC

m88ksim 2 4 6 8 10 12 14 16 32 64 128 256 512 1k 2k 4k 8k 16k

Block cache entries (N) IPC

perl 2 4 6 8 10 12 16 32 64 128 256 512 1k 2k 4k 8k 16k

Block cache entries (N) IPC

vortex 1 2 3 4 5 6 7 8 9 16 32 64 128 256 512 1k 2k 4k 8k 16k

Block cache entries (N) IPC

SLIDE 17

UT Austin -- Distinguished Lecture Series on Computer Architecture -- April 26, 1999 Page 33

Carnegie Mellon harmonic mean

2 4 6 8 10 12 14 16 16 32 64 128 256 512 1k 2k 4k 8k 16k

Block cache entries (N) IPC

data dependencies/instruction window lim it block cache capacity m isses & fragm entation

Aggregate IPC vs. Block Cache Size

perfect fetch perfect branch predict baseline branch predict perfect block predict r e a l b l

c

k p r e d i c t block cache

m ispredictions

taken branch boundary branch m isprediction

UT Austin -- Distinguished Lecture Series on Computer Architecture -- April 26, 1999 Page 34

Carnegie Mellon

Conventional vs. Block-Based

ijpeg 2 4 6 8 10 12 1 10 100 1000 10000 100000 1000000

bytes IPC

block based conventional go 2 4 6 8 10 12 1 10 100 1000 10000 100000 1000000

bytes IPC

block based conventional

g c c 2 4 6 8 1 0 1 2 1 1 0 1 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0

b y te s IPC

b lo c k b a s e d c o n v e n tio n a l

compress 2 4 6 8 10 12 1 10 100 1000 10000 100000 1000000

bytes IPC

block based conventional gcc 2 4 6 8 10 12 1 10 100 1000 10000 100000 1000000

bytes IPC

block based conventional

SLIDE 18

UT Austin -- Distinguished Lecture Series on Computer Architecture -- April 26, 1999 Page 35

Carnegie Mellon

Conventional vs. Block-Based

perl 2 4 6 8 10 12 1 10 100 1000 10000 100000 1000000

bytes IPC

block based conventional vortex 2 4 6 8 10 12 1 10 100 1000 10000 100000 1000000

bytes IPC

block based conventional m88ksim 2 4 6 8 10 12 1 10 100 1000 10000 100000 1000000

bytes IPC

block based conventional li 2 4 6 8 10 12 1 10 100 1000 10000 100000 1000000

bytes IPC

block based conventional

UT Austin -- Distinguished Lecture Series on Computer Architecture -- April 26, 1999 Page 36

Carnegie Mellon harmonic mean

2 4 6 8 10 12

1 10 100 1000 10000 100000 1000000

bytes IPC

block based conventional

Aggregate Comparison

SLIDE 19

UT Austin -- Distinguished Lecture Series on Computer Architecture -- April 26, 1999 Page 37

Carnegie Mellon

New Forcing Functions:

High Frequency Instruction Level Parallelism
Simultaneously optimize IPC and MHz
Silicon and Time Efficient Performance (STEP)
Increase both IPC/die_area and δ(IPC)/δ(year)
Dynamic/Static and Hardware/Software fusion
Reconcile generational latencies of H/W and S/W
Bridge compiled object code & machine executable

Where Do We Go From Here...

UT Austin -- Distinguished Lecture Series on Computer Architecture -- April 26, 1999 Page 38

Carnegie Mellon

How to have your cake and eat it too

2 3 4 6 8 1 0 1 6 1 2 3 4 8 1 6 0 .5 1 1 .5 2 2 .5 3 3 .5

Average IPC P i p e D e p t h Pipe Width

S P E C in t 9 5

M H z M H z

SPECint95

SLIDE 20

UT Austin -- Distinguished Lecture Series on Computer Architecture -- April 26, 1999 Page 39

Carnegie Mellon

Journey of Many Small Steps

1990 1995 2000 IPC Improvement Major microarchitecture generations Need to Increase δ(IPC)/δ(year)

UT Austin -- Distinguished Lecture Series on Computer Architecture -- April 26, 1999 Page 40