A tour of a microprocessor museum Tour t heme: -architectural - - PDF document

a tour of a microprocessor museum
SMART_READER_LITE
LIVE PREVIEW

A tour of a microprocessor museum Tour t heme: -architectural - - PDF document

A tour of a microprocessor museum Tour t heme: -architectural parallelism complicates performance understanding. BAFL: Bottleneck Analysis of Fine-grain Parallelism Tour game: Bot t leneck Hunt Which instruction slowed down the


slide-1
SLIDE 1

BAFL: Bottleneck Analysis of Fine-grain Parallelism

  • Prof. Rast islav Bodík
wit h Brian Fields in part wit h Shai Rubin, Prof . Mark Hill, Prof . Mary Vernon

University of Wisconsin

The computer system

  • Many levels of granularity, each with unique

performance problems

  • internet WANS
  • servers
  • microprocessors
  • Our goal:
  • quantitative approach for modern (out -of-order)

processors

Who cares?

  • Archit ect s:
  • circuit complexity
  • power consumption
  • Soft ware engineers:
  • performance-critical software
  • St udent s:
  • intuition how processors work
  • Processors:
  • understand themselves

A tour of a microprocessor museum

Tour t heme: µ-architectural parallelism complicates performance understanding. Tour game: “ Bot t leneck Hunt ” Which instruction slowed down the execution, and by how much? More specifically, why t he following model fails? execution time = cost(instruction1) + … + cost(instructionn) [cycles]

A tour of a microprocessor museum (0)

no parallelism

Int el 80386

  • fetch
  • decode
  • execut e

Writ e

A tour of a microprocessor museum (1)

scalar pipeline parallelism

Int el 80486 Fetch Decd Read Exe Mem

slide-2
SLIDE 2

A tour of a microprocessor museum (2)

in-order superscalar pipeline

Int el Pent ium Writ e Fetch Decd Read Exe Mem Writ e Fetch Decd Read Exe Mem The Bill Cosby Rule:* “ You’ re not a parent if you only have one child.”

*rule named by Amir Roth

A tour of a microprocessor museum (3)

  • ut-of-order superscalar

Int el Pent ium 4

A tour of a microprocessor museum (end)

  • ut-of-order superscalar

typical buffers, queues, windows

decode buffer reservat ion st at ions reorder buffer (ROB) store buffer missed loads

processors are good at tolerating latency,

but poor at deciding what to tolerate. Microprocessors are fine-grain parallel systems

like wide-area networks:

  • queues are like routers, pipelines are like communicat ion links

.

  • many (bad) event s going on in parallel, t heir lat ency t olerat ed

Why crit ical pat h? Outline

The model of micro-execution

  • capt ure bot h program and processor const raint s

Four met rics:

  • crit icalit y
  • slack
  • execut ion modes
  • cost
  • Critical path of a microexecut ion

Critical path misconceptions:

  • “ Every ‘ bad event’ is critical.”
  • branch misprediction
  • reorder-buffer st all
  • L1 cache miss
  • L2 cache miss
  • “ Critical path is obvious …

… it cont ains inst ruct ions providing dat a for ‘ bad event s’

slide-3
SLIDE 3

Modeling: why hard?

Crit ical pat h consist s of:

1.

inst ruct ions and dat a dependences

  • as in a traditional “ compiler” view

2.

microarchit ect ural resource const raint s

  • branch mispredictions, finite fetch b/ w, etc.

Together describe the microexecut ion of a given program executing on a given machine

How t o model in a uniform way?

  • Resource dependencies

Resources constrain the dataflow execution

Critical Path Models (1)

First, for a simple in-order machine

  • Data dependencies

i1 i2 i3 i4 i5

Dynamic Instructionsoldest

newest

Critical Path Models (2)

For an out-of-order machine

E E E E E F F F F F C C C C C Fetch Execute Commit

in order

  • ut of order

in order

i1 i2 i3 i4 i5

  • ldest

newest

Critical Path Models (3)

OOO + finite re -order buffer

E E E E F F F F C C C C Fetch Execute Commit ROB Size E F C

Critical Path Models (4)

OOO + finite ROB + branch misp

E E E E E F F F F F C C C C C Fetch Execute Commit mispredict ed branch

Example

last inst ruct ion

F E C

first inst ruct ion 1 1 1 1 3 2 1 1 2 1 1 4 2 1 1 1 2 1 2 1 1 4 1 1 2 1 1 2 3 1

slide-4
SLIDE 4

Example

F E C

1 1 1 1 3 2 1 1 2 1 1 4 2 1 1 1 2 1 2 1 1 4 1 1 2 1 1 2 3 1

CP Lengt h = 16 cycles ⇒ Exe Time = 16 cycles

Example

F E C

1 1 1 1 3 2 1 1 2 1 1 4 2 1 1 1 2 1 2 1 1 4 1 1 2 1 1 2 3 1

CP Lengt h = 16 cycles ⇒ Exe Time = 16 cycles

what if t his load is an L1 miss?

(3 cycles 12 cycles)

Example

F E C

1 1 1 1 3 2 1 1 2 1 1 4 2 1 1 1 2 1 2 1 1 4 1 1 2 1 1 2 12 1

CP Lengt h = 19 cycles ⇒ Exe Time = 19 cycles

what if t his load is an L1 miss?

(3 cycles 12 cycles)

Execution Modes

Three modesof execution

fetch limited (F-mode) execute limited (E-mode) commit limited (C-mode) E F C E F C E F C E F C E F C E F C E F C E F C E F C E F C E F C E F C E F C

F-mode E-mode C-mode

Execution Modes

Ent ering F-mode

E F C E F C E F C

S tart of program

1st inst . in program

...

E F C E F C E F C

Branch misp. ... ...

E F C E F C E F C E F C E F C

ROB stall ... ...

Ent ering E-mode

E F C E F C E F C E F C E F C

Fetch catches up ... ...

E F C E F C E F C

ROB stall ... ...

Ent ering C

  • mode

Validation: can we trust our model?

Execution Time Reduction (in cycles) per Cycle of Latency Reduced

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 crafty eon gcc gzip parser perl twolf vortex ammp art galgel mesa

Reducing CP Latencies Reducing non-CP Latencies

slide-5
SLIDE 5

Outline

The model of micro-execution

  • capt ure bot h program and processor const raint s

Four met rics:

  • criticality
  • slack
  • execut ion modes
  • cost

all misses FIFO

  • n each predict ion
  • ldest first

current what t o prefet ch? how to serve mem requests? when t o speculat e? how to schedule?

policy:

  • nly critical

predict ion and speculat ion prefet ch crit ical pre -fetch, pre -execut e critical first non- blocking caches critical first OOO execut ion bet t er

mechanism Current policies are egalitarian: all “ bad” events equally harmful.

all misses FIFO

  • n each predict ion
  • ldest first

current what t o prefet ch? how to serve mem requests? when t o speculat e? how to schedule?

policy:

  • nly critical

predict ion and speculat ion prefet ch crit ical pre -fetch, pre -execut e critical first non- blocking caches critical first OOO execut ion bet t er

mechanism

all misses FIFO

  • n each predict ion
  • ldest first

current what t o prefet ch? how to serve mem requests? when t o speculat e? how to schedule?

policy:

  • nly critical

predict ion and speculat ion prefet ch crit ical pre -fetch, pre -execut e critical first non- blocking caches critical first OOO execut ion bet t er

mechanism

all misses FIFO

  • n each predict ion
  • ldest first

current what t o prefet ch? how to serve mem requests? when t o speculat e? how to schedule?

policy:

  • nly critical

predict ion and speculat ion prefet ch crit ical pre -fetch, pre -execut e critical first non- blocking caches critical first OOO execut ion bet t er

mechanism

Why criticality predictor? Policies! Prediction: why hard?

Three st eps:

1.

  • bserve the microexecution ⇒ hard!
  • measuring edge latencies is intrusive

2.

analyze to find critical path ⇒ hard!

  • graph too large to buffer
  • and topological sort too complex

3.

store prediction for later use ⇒ easy!

  • store in table indexed by PC

S tep 1. Observing

R1 R2 + R3 If dependence int o R2 is on crit ical pat h, t hen value of R2 arrived last . critical ⇒ arrives last arrives last ⇒ crit ical E

R2 R3 Dependence resolved early

Last-arrive edges: a CPU stethoscope

CPU

E C E E F E C F F F E F C C

Implementing last-arrive edges

Observe events within the machine

E F C E F C FE if dat a ready on fet ch E F C E F C E F C EE observe arrival order of

  • perands

E F C E F C EC if commit point er is delayed CC ot herwise E F C E F C E F C E F C E F C E F C EF if branch misp. E F C E F C E F C E F C CF if ROB st all FF ot herwise

slide-6
SLIDE 6

Last-arrive edges

F E C

1 1 1 1 3 2 1 1 2 1 1 4 2 1 1 1 2 1 2 1 1 4 1 1 2 1 1 2 3 1

Remove latencies

F E C Do not need explicit weights

Prune the graph

Only last-arrive edges needed (other edges must be non-critical)

F E C

… and we’ve found the critical path!

Backward propagate along last-arrive edges F E C

Found CP by only observing last-arrive edges but still requires constructing entire graph

Prediction: why hard?

Three st eps:

1.

  • bserve the microexecution

⇒ solved!

  • measuring edge latencies is intrusive

2.

analyze to find critical path

  • graph too large to buffer ⇒ hard!
  • and topological sort too complex ⇒ solved!

3.

store prediction for later use ⇒ easy!

  • store in table indexed by PC

Step 2. Efficient analysis (predictor training)

CP is a ” long” chain of last -arrive edges. ⇒ t he longer a given chain of last-arrive

edges, t he more likely it is part of t he CP

Algorithm: find sufficient ly long last -arrive chains

1.

Plant token into a node n

2.

Propagate forward, only along last -arrive edges

3.

Check for token after several hundred cycles

4.

If token alive, n is assumed critical

slide-7
SLIDE 7
  • 1. plant

t oken

Token-passing example

  • 2. propagat e

t oken

  • 3. is t oken alive?
  • 4. yes, t rain

critical

Critical Found CPwithout constructing entire graph

ROB S ize

Implementation: a small S RAM array

Last-arrive producer node (inst id, F/ E/ C )

Token Queue Read Write

Commit ed (inst id, F/E/C )

S ize of S RAM: 3 bit s x ROB size < 200 Bytes

can simply replicate for additional tokens

Putting it all together

CP prediction table

Last-arrive edges (producer ret ired instr )

OOO Core

E-crit ical?

Training Path

PC

Prediction Path Token-Passing Analyzer

S teps to exploiting critical path

modeling predicting applying

1.

resource arbitration

  • case study: cluster scheduling

2.

speculation control

  • case study: value prediction
  • Experiment S

etup

Aggressive Core

  • 8-way issue, 256- entry window,
  • three configurations: core split into 1, 2, or 4 clusters:
  • unclustered, 8-way, 256-entry
  • 2 clusters, each 4-way, 128- entry
  • 4 clusters, each 2-way, 64-entry

CP Predictor

  • 8 tokens (1.5 KB token-passing array)
  • 16K-entry array for storing predictions (12 KB)
  • 6-bit hysteresis

Case S tudy #1: Clustered architectures

steering

issue window

scheduling

  • 1. current st at e of art (Base)
  • 2. base + CP Scheduling
  • 3. base + CP Scheduling + CP St eering
slide-8
SLIDE 8

0.60 0.70 0.80 0.90 1.00 1.10

Normalized IPC

eon crafty gcc gzip perl vortex galgel mesa unclustered 2 cluster 4 cluster

Current S tate of the Art

  • Avg. clust ering penalt y for 4 clust ers: 19%

Const ant issue widt h, clock frequency

0.60 0.70 0.80 0.90 1.00 1.10

Normalized IPC

eon crafty gcc gzip perl vortex galgel mesa unclustered 2 cluster 4 cluster

CP Optimizations

Base + CP Scheduling

0.60 0.70 0.80 0.90 1.00 1.10

Normalized IPC

eon crafty gcc gzip perl vortex galgel mesa

unclustered 2 cluster 4 cluster

CP Optimizations

  • Avg. clust ering penalt y reduced from 19%

t o 6% Base + CP Scheduling + CP St eering

Local vs. Global Analysis

Previous CP predictors: local stall-based predictions (HPCA 01, ISCA 01 by ot hers)

  • 5.0%

0.0% 5.0% 10.0% 15.0% 20.0% 25.0% crafty eon gcc gzip perl vortex galgel mesa

Speedup CP exploit at ion seems t o require global analysis

  • ldest-uncommited
  • ldest-unissued

token - passing

  • 3. One predict or many opt imizat ions
  • schedule critical instructions first up to 20%

speedup

  • value predict only critical up to 5%

speedup

Criticality: contributions

  • 1. Crit ical Pat h of a microexecut ion consist s of :
  • program -induced data dependences
  • machine-induced resource dependences
  • 2. CP Predict ion = global run-t ime analysis
  • bserve last arrive analyze apply predictions
  • Token

Outline

The model of micro-execution

  • capt ure bot h program and processor const raint s

Four met rics:

  • crit icalit y
  • slack
  • execut ion modes
  • cost
slide-9
SLIDE 9

Beyond criticality

  • Slack (definition):

number of cycles an instruction can be slowed down before it becomes critical.

  • Slack is prevalent:

75% dynamic instructions can be delayed by at least 5 cycles without impact on performance (no slow down)

  • How to compute slack?
  • in simulator
  • in hardware

Why is slack useful?

  • Non-uniform machines
  • resources at multiple levels of quality
  • t o deal wit h t echnological const raint s
  • to save power: slow/ fast clusters of ALUs
  • wire delay: some caches further away
  • Problem boils down to controlling non-

uniform machines

  • goal: hide the (longer) latency of low-quality

resources

  • can do this with slack

How to compute slack?

  • On the graph
  • two-pass t opological sort
  • In the processor
  • delay and observe: by reduct ion t o crit icalit y

analysis

  • delay instruction i by n cycles
  • if i is not critical,

then i did have at least n cycles of slack.

Experiments

  • Power -aware machine:
  • two clusters:
  • fast: full frequency
  • slow: half frequency (consumes ¼ power)
  • three non-uniformities

1. 2. 3.

  • results:
  • 3%

within performance of two fast clusters

  • existing techniques: 10%

slowdown

Outline

The model of micro-execution

  • capt ure bot h program and processor const raint s

Four met rics:

  • crit icalit y
  • slack
  • execution modes
  • cost

Reconfigurable machines

  • Imagine that, to save power, you can

dynamically:

1.

turn on/ off some ALUs

2.

change t heir frequency

Problem: how to adapt the machine configuration to the program needs?

slide-10
SLIDE 10

Outline

The model of micro-execution

  • capt ure bot h program and processor const raint s

Four met rics:

  • crit icalit y
  • slack
  • execut ion modes
  • cost

Finally, the quantitative approach

  • All boils down to computing a cost of

instruction:

  • can easily comput e from t he graph, if t he

graph is available (in t he simulat or)

  • can compute in hardware?

a new version of the randomized algorithm?

The fut ure

Superscalar complexity haunts

  • not only circuit designers
  • and verificat ion engineers
  • but also performance engineers
  • and hence also archit ect s t hemselves

Critical -path instruction processing helps

  • underst and performance complexit y
  • and hence also
  • exploit better existing designs
  • lead to simpler designs

Effect of CP scheduling on fut ure designs?

70% 75% 80% 85% 90% 95% 100% 105% 110% 1 2 4 No CP With CP

1) Cluster the machine:

  • 8-way machine split int o 1, 2, 4 clust ers
  • as in the previous experiment

2) Enlarge scheduling window:

  • 4 cluster x 2 -way machine
  • vary t he window size in each clust er
80% 100% 120% 140% 160% 180% 32 64 128 256 No CP With CP

3) Add clusters:

  • each clust er is 2
  • way, 64 -ent ry window
80% 100% 120% 140% 160% 180% 200% 220% 240% 1 2 4 No CP With CP

This t alk is about :

Making processors smarter

  • a modern processor: strong body, weak mind
  • example: can execute instructions out of order,

but does so without considering instruction cost

Making smarter = teach how to find bottlenecks

  • instructionswhose latency hurts
  • resources whose contention hurts

I will show how to

  • find bottlenecks (at run-time, with simple hardware), and
  • alleviate them (using existing resources, retrofitting)

Our solut ion:

Critical-Path Instruction Processing

  • crit ical-pat h analysis of µ-exe performance
  • crit ical-pat h predict ion
  • crit ical-pat h hardware opt imizat ions
slide-11
SLIDE 11

critical -path instruction processing is taming the µ -architectural evil which is taming the superscalar performance complexity:

  • find execut ion bot t lenecks, and
  • alleviat e t hem.

Why critical path?

CP used to understand bottlenecks in large-scale systems:

  • message passing, and locking in shared-memory syst ems

Hollingsworth [IEEE PDS ‘98]

  • TCP t ransact ions

Barford and Crovella [S IGCOMM ‘ 00]

1 3 1

call arrive at barrier leave barrier

S peculation Control: Value prediction

Opt imizat ion: Value predict only crit ical inst ruct ions

  • removes speculations that
  • have no benefit, but
  • may have high misspeculation recovery cost

Are critical instructions value predictable?

0.0% 5.0% 10.0% 15.0% 20.0% 25.0% 30.0% 35.0% 40.0% 45.0% 50.0% compress gcc m88ksim gzip applu swim wave5 On CP Rate Off CP Rate

Focused Value Prediction

0.0% 5.0% 10.0% 15.0% 20.0% 25.0% 30.0% 35.0% 40.0% 45.0% 50.0% 55.0% gcc gzip parser perl twolf ammp art Speedup over no value prediction No CP
  • ldest -uncom m it t ed
  • ldest -unissued
token -p assin g

Focused Value Prediction (2)

  • 30%
  • 20%
  • 10%

0% 10% 20% 30% 40% 50% 60%

ammp gcc gzip parser perl twolf

No CP

  • ldest-uncommitted
  • ldest-unissued

token-passing

less confident value predictor more value mispredictions: