[PPT] - A Framework for Modeling and A Framework for Modeling and PowerPoint Presentation

SLIDE 1

A Framework for Modeling and Optimization of Prescient Instruction Prefetch A Framework for Modeling and Optimization of Prescient Instruction Prefetch

Tor M. Aamodt †‡, Pedro Marcuello §, Paul Chow ‡, Antonio Gonzalez § Per Hammarlund ¶, Hong Wang †, John P. Shen †

†Microprocessor Research, Intel Labs ‡Dept. of Electrical and Computer Engineering, University of Toronto §Intel Barcelona Research Center ¶Desktop Products Group, Intel Corp

ACM ACM Sigmetrics Sigmetrics 2003 2003 – – June 12, 2003 June 12, 2003

SLIDE 2

Tor M. Aamodt

Multithreading Multithreading

pc pc $ Single chip, multiple flows of control Question: How might a single-threaded application exploit this hardware capability?

SLIDE 3

Tor M. Aamodt

Helper Threads Helper Threads

Related work on helper threads Related work on helper threads

Helper threads :

Helper threads : Chappell & Chappell & Patt Patt, Dubois & Song , Dubois & Song

Slices:

Slices: Zilles Zilles & & Sohi Sohi, Roth & , Roth & Sohi Sohi

Data prefetch:

Data prefetch: Zilles Zilles & & Sohi Sohi, Collins et al., , Collins et al., Annavaram Annavaram & & Davidson, Davidson, Luk Luk, , Moshovos Moshovos et al., et al., Liao Liao et al. et al.

Branch prediction:

Branch prediction: Chappell & Chappell & Patt Patt

This work This work: first work to study using helper threads for

: first work to study using helper threads for

instruction instruction prefetch prefetch (may also help TC pre

(may also help TC pre-

building)

building) Use spare thread context(s) to reduce Use spare thread context(s) to reduce µ µArch Arch

bottlenecks. Typically do not need to satisfy all
bottlenecks. Typically do not need to satisfy all

correctness constraints. correctness constraints.

SLIDE 4

Tor M. Aamodt

Existing/Proposed Techniques Existing/Proposed Techniques

Traditional hardware

Traditional hardware -

scalability

scalability

Helper thread

Helper thread – – a a few few “delinquent” instruction “delinquent” instruction

Runahead

Runahead – – need simultaneous I & D miss need simultaneous I & D miss

SLIDE 5

Tor M. Aamodt

Prescient Instruction Prefetch Prescient Instruction Prefetch

spawn target Main Thread Helper Thread Prefix Infix Postfix I-cache Misses

SLIDE 6

Tor M. Aamodt

Optimization of Prescient Instruction Prefetch Optimization of Prescient Instruction Prefetch

Optimization problem can be divided into

Optimization problem can be divided into two two parts parts 1. 1. Selection of SPAWN Selection of SPAWN-

TARGET pairs

TARGET pairs

2. 2. Optimization of resulting thread code, and hardware used to run Optimization of resulting thread code, and hardware used to run it it

This paper focuses on the first issue only

This paper focuses on the first issue only Stochastic Path Analysis HW Abstraction Optimization Algorithms Path Expression Mappings

SLIDE 7

Tor M. Aamodt

HW Abstraction HW Abstraction

Instruction Sequencer

Fully Associative I-Cache (line size = 1 inst.)

Memory

Intra-procedural control flow = Markov Chain Call/returns paired

SLIDE 8

Tor M. Aamodt

HW Abstraction

Prescient Instruction Prefetch

HW Abstraction

Prescient Instruction Prefetch

t Time (cycles) Instructions s i slack(i,s,t)

(s,t)

SLIDE 9

Tor M. Aamodt

Spawn-Target Selection Tradeoffs Spawn-Target Selection Tradeoffs

S&T highly correlated

S&T highly correlated

S and T should be far apart so

S and T should be far apart so slack is larger than memory latency. slack is larger than memory latency.

S

S-

>T instruction footprint should fit

>T instruction footprint should fit in I in I-

cache; T

cache; T-

>S should not.

>S should not.

b

X

c a

foo() 0.10 0.999 0.98

SLIDE 10

Tor M. Aamodt

Quantifying Tradeoffs Quantifying Tradeoffs

METRIC METRIC Aspect Quantified Aspect Quantified Reaching Probability Reaching Probability accuracy accuracy Posteriori Probability Posteriori Probability coverage coverage Expected Path Length Expected Path Length Path Length Variation Path Length Variation Path Footprint Path Footprint timeliness, necessity timeliness, necessity timeliness timeliness

SLIDE 11

Tor M. Aamodt

Spawn-Target Selection Algorithm Spawn-Target Selection Algorithm

I-cache & edge profiling data estimated helper thread & main thread CPI Summarize procedures Partition large basic blocks Select next block Update estimated # running helper threads, and I-cache miss coverage Select earliest target within ½max prefetch distance Select set of spawn-points (compute I-cache footprint on- demand) No suitable points No suitable points Next procedure in bottom-up traversal of call graph Coverage of all basic blocks acceptable or no pair found Use fast path algorithm to find path expressions. Compute RP, PP, path length mean & variance. Set of spawn points found: OUTPUT spawn pts., target, max. prefetch Done

Inputs: Profile data,

Inputs: Profile data, estimated CPI estimated CPI

Compute metrics /

Compute metrics / spawn spawn-

target value

target value function function

Select using greedy

Select using greedy heuristic heuristic

SLIDE 12

Tor M. Aamodt

Path Expressions Path Expressions

Regular expression describing all paths

Regular expression describing all paths between two points. between two points.

X

a

A B C D E F

( ) ( ) ( ) ( )

B F E D C B A X a P ⋅ ⋅ ⋅ ∪ ⋅ ⋅ = * ) , (

Fast Path Expression Algorithm Fast Path Expression Algorithm

[Tarjan 1981] : general approach to

solving path problems efficiently.

Examples: solving Ax=b, shortest

paths, data flow analysis.

SLIDE 13

Tor M. Aamodt

Example:

Reaching Probability

Example:

Reaching Probability

X

a

P(A) = 0.98 P(B) = 0.10 P(C) = 1.00 P(D) = 0.90 P(E) = 1.00 p

R q p R pq R R

−

= = =

+ ∪ ⋅

1 1

] * 1 [ ] 2 1 [R ] 2 1 [ closure union ion concatenat

P(F) = 0.999

( ) ( ) ( ) ( )

97 . 10 . ) 999 . ( )) . 1 ( 90 . ) . ( 1 . ( . 1 1 98 . )] , ( [ * ) , ( ≅ ⋅ ⋅ + − ⋅ = ⋅ ⋅ ⋅ ∪ ⋅ ⋅ =

     

X a P B F E D C B A X a P

SLIDE 14

Tor M. Aamodt

Mappings Mappings

closure [ R closure [ R1

1* * ]

] union [R union [R1

1∪

∪R R2

2]

] concatenation [R concatenation [R1

1•

R

R2

2]

] Path Length Variance Path Length Variance Expected Expected Path Length Path Length Reaching Reaching Probability Probability

pq q p +

p − 1 1 p pX − 1 p pX − 1

Y X + w v +

2 2 2

) ( ) (         + + − + + + + q p qY pX q p Y w q X v p

2 2

1 1 ) (         − + − + p pX p X v p

Decompose problem: ∑ E[X| follow p ∈ ∈ ∈ ∈ R]•P[follow p ∈ ∈ ∈ ∈ R]

SLIDE 15

Tor M. Aamodt

Path Footprint Path Footprint

∑

⋅ ¬ ⋅ =

v

y v RP y v x RP v size y x RP y x F ) , ( ) | , ( ) ( ) , ( 1 ) , (

β α

SLIDE 16

Tor M. Aamodt

Accuracy: vs. Monte Carlo Accuracy: vs. Monte Carlo

94% 96% 98% 100% 94% 96% 98% 100% Predicted Measured 200 400 200 400 Predicted Measured 50 100 50 100 Predicted Measured 100 200 300 100 200 300 Predicted Measured

Reaching Probability Expected Path Length Path Length Variation Path Footprint

SLIDE 17

Tor M. Aamodt

Accuracy: vs. Execution Accuracy: vs. Execution

60% 80% 100% 94% 96% 98% 100% Predicted Measured 200 400 200 400 Predicted Measured 50 100 50 100 Predicted Measured 100 200 300 100 200 300 Predicted Measured

Reaching Probability Expected Path Length Path Length Variation Path Footprint

SLIDE 18

Tor M. Aamodt

Spawn-Target Selection Algorithm Spawn-Target Selection Algorithm

I-cache & edge profiling data estimated helper thread & main thread CPI Summarize procedures

Partition large basic blocks

Select next block Update estimated # running helper threads, and I-cache miss coverage Select earliest target within ½max prefetch distance Select set of spawn-points (compute I-cache footprint on- demand) No suitable points No suitable points Next procedure in bottom-up traversal of call graph Coverage of all basic blocks acceptable or no pair found Use fast path algorithm to find path expressions. Compute RP, PP, path length mean & variance. Set of spawn points found: OUTPUT spawn pts., target,

max. prefetch

Done

SLIDE 19

Tor M. Aamodt

Selection Algorithm Details Selection Algorithm Details

Loop over basic blocks (ranked by E[#i-misses])

1. Select target, then select spawn

value(spawn) ∝ ∝ ∝ ∝ PP•RP•E[postfix size]•P[miss]•P[!evicted]•P[# ht < # ctx]

2. Update coverage metrics
# helper threads = ∑ PP(s,i) • P[still running]
# i-cache misses -=

PP(t,s) •PP(t,i) • P[still running] • (#i-cache misses)

SLIDE 20

Tor M. Aamodt

Hardware Details Hardware Details

230 230-

cycles, TLB miss = 30 cycles

cycles, TLB miss = 30 cycles Memory Memory L1 (separate I&D): 16KB 4 L1 (separate I&D): 16KB 4-

way 1

way 1-

cycle

cycle L2 (shared) : 256 KB 4 L2 (shared) : 256 KB 4-

way 14

way 14-

cycles

cycles L3 (shared) : 3072 KB 12 L3 (shared) : 3072 KB 12-

way 30

way 30-

cycles

cycles Caches Caches 2 bundles from 1, or 1 bundle from 2 threads 2 bundles from 1, or 1 bundle from 2 threads Prioritize main thread, helpers: round Prioritize main thread, helpers: round-

robin

robin Issue Issue 2k 2k-

entry

entry gshare

gshare. 256

. 256-

entry 4

entry 4-

way assoc. BTB.

way assoc. BTB. Helper threads oracle branch prediction (always Helper threads oracle branch prediction (always follow correct path) follow correct path) Branch Branch Pred Pred. . next line prefetch (triggered on miss) next line prefetch (triggered on miss) Stream prefetch triggered by compiler hints Stream prefetch triggered by compiler hints (max. 4 outstanding prefetches per context) (max. 4 outstanding prefetches per context) I I-

prefetch

prefetch 2 bundles from 1, or 1 bundle from 2 threads, 2 bundles from 1, or 1 bundle from 2 threads, prioritizing main thread, helpers ICOUNT prioritizing main thread, helpers ICOUNT Fetch Fetch In In-

order 12
rder 12-
stage pipeline

stage pipeline Pipelining Pipelining SMT processor with 2, 4, or 8 hardware contexts SMT processor with 2, 4, or 8 hardware contexts Threading Threading

SLIDE 21

Tor M. Aamodt

Performance Impact Performance Impact

1 1.1 1.2 1.3 145.fpppp 177.mesa 186.crafty 252.eon 255.vortex avg

Speedup

ideal 2t 4t 8t

SLIDE 22

Tor M. Aamodt

Performance Impact Performance Impact

1 1.1 1.2 1.3 145.fpppp 177.mesa 186.crafty 252.eon 255.vortex avg

Speedup

ideal 4t F L

SLIDE 23

Tor M. Aamodt

Source of remaining I-cache misses Source of remaining I-cache misses

0% 20% 40% 60% 80% 100% 2t 4t 8t F L 2t 4t 8t F L 2t 4t 8t F L 2t 4t 8t F L 2t 4t 8t F L 145.fpppp 177.mesa 186.crafty 252.eon 255.vortex

Miss Breakdown

evicted too-slow no-context no-spaw n no-target

SLIDE 24

Tor M. Aamodt

Helper Thread Characteristics Helper Thread Characteristics

benchmark static # pairs

dyn. #pairs

infix region size postfix region size

145.fpppp 62 378528 622 162 177.mesa 34 210519 1186 255 186.crafty 166 560200 573 129 252.eon 152 407516 691 120 255.vortex 1348 438722 1032 142

SLIDE 25

Tor M. Aamodt

Summary Summary

Limit study indicates Prescient Instruction Prefetch

Limit study indicates Prescient Instruction Prefetch may yield speedups from 4.8% to 17% may yield speedups from 4.8% to 17%

Introduce a framework for selecting spawn

Introduce a framework for selecting spawn-

target

target pairs based upon statistical analysis using path pairs based upon statistical analysis using path expressions. expressions.

Future work

Future work

hardware & slice optimization (under review)

hardware & slice optimization (under review)

more sophisticated modeling (branch

more sophisticated modeling (branch corr corr., phases) ., phases)

ptimization algorithms
ptimization algorithms